Forum:The clogged search
A search for my name Fred Bauder (edit), or, for example Joseph Smith (edit), produces a great many hits on myself, or Mini:Joseph Smith, Jr. the Mormon prophet. Which is fine, if that is what you're looking for. However, particularly in my case, unless one is a student of the Arbitration Committee at Wikipedia, that is unlikely. The problem is to gracefully achieve a search which produces hits about the "Fred Bauder"s who are not me or the "Joseph Smith"s who are not the Mormon prophet. In the case of Joseph Smith, this is complicated by the probability that some of the Joseph Smiths are descendants of the prophet and Mormons themselves.
Please see Mini:Fred Bauder for some preliminary work I have done. Fred Bauder 18:32, 10 January 2008 (UTC)
- Should someone know the search engine lingo for this phenomenon, that information would be helpful. I doubt I am the first to notice. Fred Bauder 18:32, 10 January 2008 (UTC)
- I don't know if you have progressed on this in the meantime but there are search engines which provide categorisation of the results. Try, for example, putting Joseph through the search on Vivisimo. I don't know how it works but the results are quite impressive. Is this the sort of thing you are looking for? Ipigott 15:22, 18 January 2008 (UTC)
[edit] Namespaces
In general, this problem is about namespaces. It's a classic problem in computer science concerning the name of something. When two things have the same name, which one do you mean? Since programming languages are deterministic, it's solved by creating a space in which names live (a "namespace"), then placing each of the names into their proper space. This can be applied to languages other than human languages as well. If you say "Joseph" in a room of 5 people, you probably mean the Joseph in that room. The implied namespace is created contextually. In a pet store, "Joseph" may be the name of a fish. To disambiguate the two, you might refer to Joseph, the fish at Puppies to Guppies store, versus Joseph in the Old Testament.
Fred, your question above seems to be all about picking which namespace you are talking about. You want the search engine to be able to discard search hits for one namespace in your results, for example, "search for the Joseph that is not the Mormon prophet." You discard one namespace and permit all others.
Okay, so how does a search engine know that a particular page is about one kind of Joseph or another? A web crawler tirelessly indexing pages after all, and only is as smart as we make it. The direct way to know what kind of Joseph is to examine the page for context. If the word "Egypt" and "Potiphar" were on the page, the search engine may place weight towards the Joseph in the Old Testament. If the word "Mormon" and "Latter" were on the page, the engine may place weight towards it being the Mormon prophet instead.
For this to happen, either manually or automatically, someone has to:
- Create namespaces
- Create a context profile for that namespace
- During web crawling, place words or phrases into the right namespaces based on how closely it matches the profile
A page may match two namespaces partially, one namespace 95% and another at 10%. Perhaps your search results would list the 95% match first, or ask you which namespace you meant.
Another way for this to happen is to see what links in and out of the page:
- If "Old Testament" pages link to Joseph, that would add weight towards the page being about Joseph of the Old Testament
- If the page links to other pages about the Old Testament, that adds more weight
By combining direct analysis (matching a profile of other words on the page), coupled with indirect analysis (looking at links in and out of the page), there is a high likelihood you will get the search matches you want. (Namespaces section added by --Caswick 16:24, 12 January 2008 (UTC))