Brainstorming

We should start a Brainstorming about our ideas of a new search engine.

Just throw in your ideas!

  • Creation of a tags/comments page for every indexed page, as well as community rating (mimicking the functionality of StumbleUpon[1], but integrating ratings with search rank)
  • Local Search - by city (for pizza, niche shops, etc., allow businesses to include metadata like items they carry or foods served at restaurants)
  • Integrate TIP total informatin pages technology to enhance the search platform. TIP is a compiler that links every word in any documents to thousands of sources their are TIPed doucument that are outsourced to wiki now on the internet. The educational commuinty would benefit greatly from this type of search technology.
  • Why not let every website spider (index) its own content and submit the results (like they do a sitemap)? - I agree, it would be useful to have a php, asp and javascript crawler (factory?), this could then be ran as a cron job, and be built into CMS's such as drupal, mambo etc.
  • Don't re-invent the wheel(search engine) - focus on PRIVACY
  • SearchPlatform
  • CommunityRanking
  • proxy server option (for the schoolkids lol)
  • Anti search - show what sites dont want to be indexed. A db of every sites robots.txt would be nice.
  • media seach eg images, video, audio
  • personal homepages with customisable content
  • open source software
  • peer to peer search spider
  • user rating of search results / certain pages
  • a comment system incl. rating
  • distributed spidering and indexing of web pages using networked video game systems (xbox, etc.)
  • semantic web?
  • social bookmarking
  • semi-automatic semantic tagging
  • transparent engine that explains why this link is the first
    • NPOV criteria for improving ranking by generic relevance using aggregated tagging, ranking, of material to specified criteria?
    • POV search by person's own criteria, eg person values x over y sources, open licenses, local to me, most recent? first.
  • date: how old is a page? - yes, along with filtering results by date (useful for searching for news)
  • edit search results
  • whitelists: what sites are worth crawling?
  • blacklist: what sites are spam / dangerous (e.g.phishing)?
  • mark results:
  • do I have to pay for the online content?
  • do I have to have a login to see the content?
  • open licensed materials I can remix and redistribute
  • Proximity based ranking
  • Popularity-aware ranking
    • maybe information on what top-rated results were skipped for results with lower rating is even more interesting, since top-rated results will be popular anyway - such a negative rating might increase the chance of previously less rated links to "move up" more easily?
  • Pareto-Efficiency-Criteria Ranking [2]
  • Multi-search sessions
  • site thumbnails
    • thumbnails, but just on request
  • flexible user-defined search with choice of ranking and voting panel to rank suitability
  • Thesaurus style search grouped by similarity of meaning. Owners place pages in contextual boxes.
  • context-sensitive hints (like amazon, etc.: Others also have searched for ..., ..., ...
  • Using web servers as self define search agent each website could define available content based on tags ,ranks as serve it to other web servers in p2p agent networks in semantic web conecepts
  • Visual Searching: Use metadata about picture elements (and each pixel's immediate neighbor) to find similar images. Diana Day points us to: Hermitage Museum rbg search

See also: eVision and Able Image from Mu Labs... web-based, drag-and-drop, searches that find blue pictures, not just pictures with blue.png filenames.

  • Allow contributors to edit categories, not results. Use solid searching, but allow searchers to edit "cluster" categories and directory hierarchies...
  • Allow the users to categorize the content but do not force them to do this completely: e.g. allow a user to categorize an article to chemistry but do not require to categorize the article into organic/inorganic chemistry -> sequential improvement of categorization.
  • Allow to categorize every aspect of an article and do not require a single categorization.
  • Allow to categorize an article by a content provider; perhaps allow for overruling of this categorization by users in order to limit abuse.
  • Check experiences of categorization schemes like the Japanese Patent Classification, International Patent Classification etc. and design a compatible system or use at least the work provided by these organizations.
  • Here's something I don't know if I've mentioned before: One thing we could do to apply the community to improve search results is to constantly rotate in different algorithms and let users rate the result relevance. Sure, maybe Joe Averageuser wouldn't want to have much to do with that, but even a dedicated core of few hundred power users could quickly create significant statistics about what algorithms are *actually* working better. But you'd never want to stop - constantly take in new ideas about what data to collect and how to weight it, then collect data about how well the algorithm is working in real time. Use the feedback from that to "evolve" the algorithms.
  • mobile version to search with pda's etc.
  • Having induction periods where new websites added are displayed in a random order and whichever gets the most clicks comes first
  • Something similar to Googles "Subscribed Links" where people can add specific data on a specialized subject, or access to a specific web application (i.e. searching for "whois:www.domain.com" will provide the whois data for that domain, or searching for "where is company ABC" will bring up there address from an address book)

Integrate users personal social networking websites. For example, del.icio.us bookmarks, Digg bookmarks, search friends Digg marks, MySpace pages, including friends (extended friends also an option) MySpace pages in searches. The user can login to include which ever social networking site they want.

In addition to searching the Internet as a whole you can include a list of familar sites ONLY search as an option.

  • Google does not serve more than 1000 results for any query. Why not serve ALL results? And show some options with the results. (Maybe this work could be done in the user's computer, maybe by a browser extension) Some options could be:
    • Order results in reverse mode.
    • Or shuffle mode.
    • Or specifying a rank.
    • Or I want to watch the development of a topic online, show me a date view


  • It does not matter if all results are returned, 88% of users only look at the first 3 pages before changing their search query.
    • But being able to do a more complete search could be a niche to fill, or an advantage over what Google already does. About 85% of users use Google, Yahoo or MSN, doesn't mean that there shouldn't be a wikia search engine.Search engine watch Sept '07
  • Search should focus on the high-precision results (the first n results)
  • When using P2P to store your index you want incremental results OR just the top n results
  • Let webservers index their own content and let them act like peers in a hyrbid P2P network (see for an overview a paper I wrote Towards large scale P2P web search)
  • A secure version of the site.
  • An option to only search secure sites.
  • An option to only search FTP sites.
  • Harvest External Links from Wikipedia articles
this is kinda what Wikiseek does.
  • The term "human agent"
  • Centralized controle with all your Grub clients, eg. from the grub.org webpage. I would like to send schedules to all clients at one time, disable/enable all clients and so on.
  • The possibility for the Grub client to use more of the alocated resources. When I setup the client to use 100% CPU and 100% network usage, the client still only utilize a fraction of this power. Maybe this should be done as a "dedicated crawler" install, e.g. to be used by ISP's who would like to put dedicated machines into the fight, to ensure their sites get indexed (Local crawling).
  • Make local crawling available on a team level. E.g. I would like to get everyone at our office to install the grub client, but I would also like for them all to crawl the websites we produce. As this is several hundreds pages, I would like to add these pages to the team, instead of to each client.
  • It would also be nice to be able to add sites to be crawled locally, from code. This could be used for ISP's who create many websites.
  • Arrange META data -and other factors- obtained from each site to try a "semantic" search engine
  • I really want the ability to do search stemming (e.g., the search term "*librar*" can find pages to do with libraries, librarians, metalibraries, etc.).
  • Allow users to tag search results, and allow users to have their search results influence by the tagging or not. Allow users to "second" a tag, giving it more weight if the tag is seen as appropriate
  • The most difficult aspect will be the handling of commercial pages (search engine spam), where traditional search engines have failed - possibly due to their own commerical interests. Any rating/tagging/algorithm design process should account for the fundamental divide between purely informational pages and the type of infomercial/spam pages that have rendered the traditional competitor just about entirely useless when seeking (non-biased) information.
  • Firefox/IE Toolbar
  • Add a "freshness" rating. An article may be highly relevant when first published, but after a period of time, the data in the article becomes dated or useless.
  • Pruning of dead links and stale articles.
  • Session generation -- automatically generate Firefox/Opera sessions of each search result in a tab, with the option of excluding search results
  • Search session history -- ability to exclude certain results over long periods of time (see the persistent search session history idea above)
  • RSS/Atom feed for search results on particular topics
  • Shared Search -- generate a unique session ID and allow two or more people to tick off results from the list
  • Special ASCII character search
  • We could have all the different seach tools available in a similar way to facebook applications, in that people can seek out the search tool they most like the look of and have that on their personalised homepage. This would lead to things like popularity rankings for the search tools themselves...
  • Tagging can be used to gain buzzwords describing a found page. These tags can be used as index.
    • Tags can be gained from the meta data of the page
      • The contents of the page can further be analyzed semantically for second level index. Meaning the contents contains many words and phrases that are common and not specific the subject on the page; therefore these informations should not be weighted to much.
    • Allow the user to supply tags to a search result (Generate a tag-cloud for each page). The more user supply a result with the same tags the more appropriate they are.
  • A page that is often visited seems to be important. To find out how well received the contents of the page is there are other resources.
    • How many entries for a page can be found at del.ico.us or digg can indicate how widely the page is known and how well it is appreciated.
    • Some sites have their own critic section à la 'How helpful was this article?' Make use of this. There may be some need for standardisation but there is already good data out there.
    • Allow a page to indicate that it was visited. The webmaster of a site adds some lines to his page and every click on the page sends an event to the ranking mechanism. There must be decided if this was a regular click or fraud click and therefor adjust the ranking of the page.
  • It must be easy for the user to 'comment' a search result. The easier the more it will be used. The comments should be available in various granularity, eg. good or bad, rate between 0..9, some written comment. Such a thing can be done with an Add-on or another browser extension.
  • There are many people out there that are bad spellers it is therefore imperative to suggest correct spelled search terms before submitting the query in contrast to googles "Did you mean ..." after the request.
  • For performance reasons it may be helpful to treat frequent queries differently.
  • Today many informations are found in blogs and forums. Information is only valuable if it is correct. The correctness is something only a human can decide on after the study of an article. However if he knew beforehand the date of the article he could decide if it is out of date or worthwhile persueing. The date must not be the actual date of the last change page at the time of the search request but can be the last change by the time the page was last crawled.
  • It must be transparent how the crawling mechanism works so the user can decide if the found contents is out of date, eg. in the result page some event in the past is announced. To find out if the text in the displayed search result is old or the page itself he must open the page.

[edit] See also

A new Method for Product Search - BrandPages

Retrieved from "http://search.wikia.com/wiki/Brainstorming"

This page was last modified 13:22, 12 January 2008. GFDL