Whitelist

This is a list of sites that the community considers to be prime "must have" sites for the first crawl of the web. We are aiming for around 35 million URLs in the first index, which is of course very small (more to come soon of course!). This is a list of URLs (and notes about each of them) that should evolve to be a generally accepted consensus of a good starting point of sites that should be deep crawled and included.

The list is divided into 3 groups. You can loosely think of these as "top ten", "top 50" (i.e. 40 more sites) and "top 100" (i.e. 50 more sites), but to avoid pointless edit warring, there is no need to stick too rigidly to "10" things in the first rank. The idea though is just to get some kind of rough prioritized list.

Sites should be good content sites, as opposed to seed sites. They should be the kind of thing people will want to find in a good quality search result. The "must have" sites. A section at the bottom will list good "seed" sites in the sense of "sites with good external links but which are not really destination sites themselves".

Contents

[edit] Class I

[edit] websites in many different languages

  • youtube.com (Movies) Short Video-Clips, mostly homemade
  • LibraryThing - books (including data from libraries around the world, and user-provided content)
  • bbc.co.uk - relatively few links to other sites, solid content site - normally it has links out to sites that relate to a particular story
  • indymedia.org - Decentralized news site. News reported by local people about the things that occur in their area.
This is one of most POV-sites I know; It sure has its value but IMHO it's not suitable for class I.
Are you kidding me? POV? There isn't such a things as NPOV-news and anywhere that claims such a thing tend to be either most purposely biased or entirely oblivious to their own blinding biases. Besides, if we're going to start talking about biases I'm going to start questioning the inclusion of sites like myspace and youtube. Oh, and definitely don't forget ebay, amazon and microsoft no ones more dishonest that when they're trying to scam you out of a buck. At least if you are mislead by something on indymedia you're only going to be wrong, not loose your shirt. Also please sign your comments. Iconoclast 15:20, 6 April 2008 (UTC)


  • www.imdb.com - Good for movies... is this the best choice? What else to include in this category? Not very useful as a crawl seed, since it has essentially zero links to other sites, good content site.
  • www.sourceforge.net - Open Source software development web site. Coming from the open source area, I think this is one that can't be missed -- this has few external links
Are ebay.com and amazon.com good choices for Class I? They really do not fit the good content (ie information) category as they are more of consumer e-commerce sites. Joe Jaxx 16:46, 31 December 2007 (UTC)
I tend to agree with you on Ebay but not Amazon. Amazon has a lot of useful reviews and such like that. However, I don't want anyone to accuse me of conflict of interest (Amazon is an investor in Wikia), so I defer to others. Certainly both Amazon and Ebay should be spidered, but whether they are top priority or not is a good question.----Jimbo Wales 22:53, 7 January 2008 (UTC)
Why should Ebay be spidered? What else expect auctions is on their site? It should be included in the index if someone search for auctions but why spider all of their offers as well when there is so much out there which isn't spidered? --195.14.26.130 13:22, 8 January 2008 (UTC)
This is the reason I think, bookzilla is at the right place here.

Other are:

  1. office.microsoft.com
  2. update.microsoft.com
  3. support.microsoft.com
  4. go.microsoft.com
  5. windowsupdate.microsoft.com
  6. download.microsoft.com
  7. search.microsoft.com
source (http://www.alexa.com/data/details/traffic_details/microsoft.com) --Cy 09:20, 21 April 2008 (UTC)
  • www.britannica.com - Excellent general reference source. Even though it is a subscription service, the first article is usually free when linked to.
  • www.grin.com - huge archive with more than 77.700 (over 1.000 new papers per month) scientific texts (February 2008), such as research projects, theses, dissertations, and academic essays. The authors are free to decide if they receive monetary benefits from their work, or to simply contribute their publication for the use of others. From 'aerospace technology', ... 'economics', ... 'health science', 'physics' ... to 'women studies / gender studies'. Available languages: german/english/ soon: french/spanish
  • www.euregio.org - Euregio Rhine-Waal aims at removing obstacles caused by inner-european borders. INTERREG IIIA

[edit] websites in English language

  • Forbes.com - Articles generally have good links to relevant content even if its only in the forbes.com site.
  • kuro5hin - a collaborative web magazine. Articles are created and submitted by Kuro5hin's users and submitted to queue for evaluation.

[edit] websites in Portuguese language

  • Globo - website with news, entertainment and sports
  • Orkut - the most acessed social comunity in portuguese

[edit] websites in German languages

  • *.heise.de - a very popular german news site about almost everything of the it industry
  • www.tagesschau.de Germany's most popular televison newscast (online edition is increasingly popular) by the first public broadcaster "ARD".
  • www.zdf.de Site of the second public German news-station "ZDF".
  • www.dw-world.de - German and European news, analysis and multimedia from Deutsche Welle - in 30 languages
  • www.spiegel.de - "Spiegel Online", a German Weblog (former online edition of the former most important German news magazine "Der Spiegel").
  • bookzilla - 5% of their income goes to the Free Software Foundation Europe
and I actually can't see why this page should be a good choice either.... Zierfish

[edit] websites in Italian language

[edit] Class II

[edit] websites in many different languages

  • CERN - Scientif information and links

[edit] websites in English language

  • [htto://censolution.org *.censolution.org] - Censolution, organize the fight against censor
  • distrowatch.com - list of many Linux distributions lots of links to external sites, voted one of best sites on web.
  • www.ubuntuforums.org - Linux trouble solution. I propose ubuntuforums because I know there is a lot of activity there, but feel free to chose another, what I point out is to have something for Linux troubleshooting.
  • Open-Access.net - Information on open access, free access to scientific knowlegde
  • www.w3.org -- good entry point to the technical side of the Web
  • guardian.uk british newspaper with lots of international news.
  • Jamendo - The most popular Free Music/Creative Commons Music label]
  • Magnatune.com - Music label, CC-licensed middle-quality previews
  • The Numbers - movie database with fairly detailed info on about 10,000 of the top US and international movies. A decent selection of outbound links.
  • Rotten Tomatoes - database of movie reviews. Lots of outbound links to movie review sites, and some others.
  • mahalo.com - not really yet a "top tier" website, but has some decent stuff that is likely relevant to our visitors. Plus Jason is a friend and supporter of the project. :-)
So what? It's this sort of insider baseball thing that will cripple wikia. Is Wikia really about who you know?
Actually I wrote that as more of a joke, because Jason is not really a supporter of the project, but a competitor who likes to troll on the mailing list in favor of his proprietary alternative. :) But I like Jason nevertheless, and he seems to enjoy tweaking me, so this was just a way of teasing him back. (I assume he is looking at this, but I could be wrong.)----Jimbo Wales 00:09, 27 December 2007 (UTC)
  • deviantART - The largest art community site in the world, maybe even has potential for class I
  • www.today-in-history.de - History today and every day - Today in History covers historical topics, events, personalities and celebrities over time and history
  • www.walkingthetmb.com - Although this site appears at first glance to be single subject centered (the Tour du Mont Blanc), it actually contains a wealth of information for novice walkers wishing to embrace wilderness walking. In particular it provides a lot of practical kit advice and as a non-commecial site is well worth white-listing.
  • IGN - video gaming website

[edit] websites in German languages

  • faz.net - One of the most important german newspapers (Orientation: Traditional)
  • taz - Important Left/Alternative German newspaper.
  • zeit.de - Very important german newspaper.
  • Gulli - important German cracker board (news, magazine, forum, etc.)
  • Fotocommunity - popular community for photographers in different languages
  • Region Köln Bonn - New German website:information,links,tourism, business, science ect.

[edit] Class III

[edit] websites in many different languages

  • European Union - official gateway to the EU and all of its publications in different languages (and you'll never be able to find anything without search)
  • www.mods-ham.com Useful and well-structured page with all kind of information concerning amateur radio. Mixed German and English content.
  • esperanto.net - Multlingva Informcentro pri Esperanto
  • Tom's Hardware - A popular site for News about Hardware- releases/tests, Software.. and so on. Translated in common languages
  • www.belgium.be - België - Belgique - Belgien
  • Moto-Notes - A toolbox of useful applications that are somehow connected with automotive

[edit] websites in English language

[edit] websites in French language

  • CEA LIST - Lab of applied research on software-intensive technologies
  • Pictosaic - Créateur gratuit et en ligne de photo mosaïques

[edit] websites in Italian language

[edit] websites in Spanish language

[edit] websites in German languages

full text repositories/online libraries

full text online lexica

dictionaries

  • Leo.org - most important german dictionary

legal

other

[edit] websites in Dutch language

  • www.datishetverschil.nl - A dutch website where people can compare insurances.
  • www.aklik.nl - The leading dutch searchengine marketing firm
  • [2] - A dutch insurance website for quote comparisons
  • www.geencentteveel.nl - A dutch financial website with price comparison tools.
  • www.advogarant.de - A German website with a few thousand articles about law
  • [3] - A dutch website about pensionplans and savings
  • [4] - A dutch website about insurancemarketing
  • www.pieterbeens.nl - A Dutch website with articles and publications about politics, sociology and other society-related issues

[edit] Directory sites with good external links (seed sites)

[edit] User-driven sites with many external links

These tend to be biased toward techies.

[edit] Blog sites

Is there a way to use technorati data or similar to get just the most "important" blogs for this first run?

[edit] Online Community

Forum / Chat

[edit] International Education Portals

[edit] Religion

[edit] Trustworthy Health and Science sites

I think at least one of these should go in Class 1, but I don't have strong feelings as to which one.

[edit] Commercial interests

These are business which are not appearing in the index:

www.hamisrad.co.il ציוד משרדי online office shop.

  • www.novamedia.de | nova media offers cell phone management and internet access software for Macintosh Computers running Mac OS X

[edit] Best online dictionaries

This is a list of the best dictionaries on the web:

[edit] Other

[edit] Associations

  • http://www.amarc.org/ --- AMARC is the world federation of free radios and is an international non-governmental organization serving the community radio movement, with almost 3000 members and associates in 110 countries.
  • http://freie-radios.de/ --- The "Bundesverband Freier Radios" (BFR) is the association of free radios in Germany. The BFR, founded in 1994, is a decentralized grass roots democratic organization. The BFR is associated directly with the federations of free radios in German-speaking countries (VFRÖ in Austria, and UNIKOM in Switzerland) and works indirectly with AMARC (the world federation of free radios), Indymedia, and others through individual members.
  • http://freie-radios.at/ --- The "Verband Freier Radios Österreich" (VFRÖ) is the association of free radios in Austria.
  • http://www.unikomradios.ch/ --- The "Union Nicht-Kommerzorientierter Lokalradios" (UNIKOM) is the association of free radios in Switzerland.
  • http://www.planetology.org/ --- EAGAPE European Association for Geosciences, Applied Planetology and Exobiology. Hamburg / Germany.
  • http://www.jugendrechtshaus.de/ --- Bundesverband der Jugendrechtshäuser in Deutschland
  • http://www.sakaryaaktuel.com/ --- Online Web

Retrieved from "http://search.wikia.com/wiki/Whitelist"

This page was last modified 22:47, 11 May 2008. GFDL