Forum:Domain Specific Search Plugins
I've got a new Idea. A search ist often problematic on websites that are not optimized for search. Specially, most sites ignore semantic search engines. Therefore we need the possibilities to write plugins that extracts well structured data from poor structured websites. This plugins might get written in ECMA/Java-Script and should get maintained from trusted users as bots in Wikipedia do.
Plugins should get written by the users for Webites with great content that should be queryable. This includes Websites from
- traffic sites (bus, flight, taxi, ...)
- public Databases (IMDB, Last.FM, Friendster, Orkut, Amazon)
- Restaurants and Hotels (primary sites that collect such data)
- Recipes, Products (Woochi, allrecipes, eBay, Amazon, WalMart)
A problem that might occur with that is that website developers that want to have better search results for their sites may want to write own crappers. But they should primary focus on RDF-Data in their sites rather than making such hacks.
— MovGP0 23:05, 1 June 2007 (UTC)
[edit] Discussion
Looks like an interesting idea but how will you know what data to extract? Wouldn't the bots have to be site specific or otherwise they will just be another kind of crawlers? --Oshani 16:14, 2 June 2007 (UTC)
- I'm thinking on a kind of threefold approach for indexing:
- Conventional crawlers and syntactic-to-semantic text-processing for mapping the sites to the planned Wikipedia ontology (Ontoworld) using a semantic relatenes algorithm (I would prefer Wikipedia based Explicit Semantic Analysis). Relatenes determined by an algorithm might get expressed using Bayes OWL.
- Domain specific, user written, plugins for the bot. With these the bot can extract semantic data out of partly strucuted sites. This will ease the indexing of whole websites by writing a plugin instead of letting the user index one page after the other manually.
- Users that mapping to the Wikipedia ontology using natural understanding of the sites content for further improving the mapping. I think on a similar Model to Metawebs Freebase, where Semantic Annotation is shown in a Sidebar next to an external Website.
- I think such automation tasks are needed, because users alone won't be able to handle the data masses of regularly updated sites manually. In fact, I think that only a very small amount of peoples will be willing to work on the search result improvement; the very most of the users just want to search.
- — MovGP0 18:11, 3 June 2007 (UTC)
[edit] See also
- Creating a Bot from Wikipedia
- Screen Scraper Howto from the Piggy Bank Project