Talk:Whitelist

What is the best approach to deciding what should go into this initial crawl?----Jimbo Wales 22:03, 17 December 2007 (UTC)

For starters this is test. Will Wikia Search act as a censor? How about this? DO NOT VOTE FOR OBAMA : [1] --unsigned comment from anonymous ip number

No we will not act as a censor. Now, my guess is that particular website is not one that people would consider to be part of the "core" most important sites on the net, and the purpose of the whitelist is to find sites that are "must haves" for an initial crawl of the web.----Jimbo Wales 22:52, 17 December 2007 (UTC)

Jim, I agree. Feel free to delete if you may. Like I said, this post was made by me to test the search engines fences. It is a sign of things to come. I know you are aware. Wikipedia vandalism will pale into comparison. But, I am so certain, this project will be the best in the world. I guarantee it. You have prematurely revolutionized the Internet. I am merely forecasting it. - Chris Desouza


Note that it is important to distinguish between 2 different phenomena: 1. sites (locations) and 2. websites (contant available at locations). Strictly speaking, domain names are also "content", namely the data registered at the registry level. From this perspective, the "location" for the content is the URL field of the software application used to navigate the WWW. A whitelist consisting of sites (domain names), is different than a whitelist consisting of websites (so-called "content"). Most search engines index websites (by creating cached copies of the websites) rather than indexing the sites. However, the ranking algorithms commonly place the primary value on the domain name -- therefore, a search for "Amazon" will most probably return the amazon.com site (domain name) rather than a website (content) about the concept "Amazon" (a river in South America). ----Websites

Contents

[edit] Important update

These are to be content sites, not seed sites. The point is, we will launch with a fairly limited dataset due to current hardware and time constraints. So while dmoz.org is a good "seed" site, it is a poor "search result" site.----Jimbo Wales 00:31, 18 December 2007 (UTC)

Why not give the .us the first look? It is a ccTLD and it would be easy to rapidly build a whitelist of .us sites. There are only 1.3M or so .us domains so the actively developed website figure would be a lot lower (less than 50% at a guess). The danger, this early in a search engine project, is trying to spider everything. And with ccTLDs becoming more important, a set of good ccTLD search engines might be better than one large search engine that has OK results.--Jmcc 05:11, 18 December 2007 (UTC)

[edit] Only english pages?

Just a question because I am not really involved with the plans of the wikiaserch (There is not much time between constructing http://www.zeno.org and writing articles like Nordiberische Kreuzotter): It seems, that the first crawl only will include English pages (I am confused that only en.wikipedia is listed and not de, fr and other big communities), will this be the direction for the future or only for the startup? Greetings from Berlin -- Achim Raschka 09:26, 18 December 2007 (UTC)

(p.s.: @ Jimbo: there is a lovely little text about you in the German Spiegel: They quoted you as being a snob and tell us, that core wikipedians think that "most internet users are idiots and shall not write articles" – Ummh, I think, I can agree with you ;O)

[edit] IMDB disallows crawling?

Insofar as I know (from prev. experience) imdb.com does not permit crawling -- does Wikia have some sort of relationship to permit it? - dizzyd

[edit] Images

Is Wikia Search going to crawl images? Wikimedia Commons isn't on the list, and image namespace in Wikipedia isn't mentioned. --Emijrp 20:11, 30 December 2007 (UTC)


[edit] Sites with to-be-paid-for content

I noticed Vlex mentioned as suitable site. I'd not heard of it. On looking it seems you have to be a paid-up subscriber to access its contents. Is that a reason for not including it? --User:Sebs 01.47, 2 January 2008

another issue are sites that are paid for links. I've added startpagina.nl as it is the largest Dutch directory, but I know that a small portion of its links are paid for by advertisers. Please undo my addition of startpagina.nl in case this disturbs the search results Merlijn 22:22, 12 January 2008 (UTC)

[edit] How about ap and reuters directly?

http://www.reuters.com/ http://www.ap.org/ What tier would they be in? There is also: http://en.wikipedia.org/wiki/Agence_France-Presse English language website here: http://www.afp.com/english/home/ There is also the (it appears) primarily german-language Hugin Group: http://www.hugingroup.com/# 76.180.120.161 11:20, 2 January 2008 (UTC)

[edit] Huh?

I'm confused. Isn't this all part of an algorithm which is fed by user rankings? Isn't a single global wiki page sort of an odd way to rank them? How does it become customized for a wide variety of people? I'm stuck at: One page, one vote. Seems like a database is more in order than a wiki page.

this page is for the websties to be mined/crawled by the project and then used as search results. it has nothing to do with rankings apart from the small diff in size and which order they should be crawled. --Markie 11:25, 3 January 2008 (UTC)

[edit] Just the most "important" blogs

Is there a way to use technorati data or similar to get just the most "important" blogs for this first run? The easiest way to go about is to look for posting frequency (only read the RSS/Atom feed). When the blog is not updated recently and does not have a certain volume it is not very relevant. Technorati is utterly out of date and their supposed ping/update function works unreliably. Only use it if you want the state of the blogosphere of 2006.

[edit] Important websites to be added?

Hi All,

I am missing some important pages from my country of origin: http://www.pagina.nl/ Most important link directory http://onderwerpen.startkabel.nl/ Link Directory

With these 2 sites you should be able to index about 80% of Dutch websites. (Via 3rd or 4th level probably 99%)

This site is important for education: http://www.kennisnet.nl/

Grimweb 15:04, 7 January 2008 (UTC)

[edit] What is to be considered Class I and not

I think something has to be wrong when it comes to placing searchengines as a Class I site. A searchengine should not generally be included since the point is to create a search engine with better results. I can agree that some parts of google should be included, like the open source parts - but if the rest part of google is to be indexed then all other search engines should be at the same level, regardless if they are som kind of kinkysearch.xxx. Noticed som other swedish search engines like hitta.se werre included and I think the not should be included.

/ Alf Lövbo, Sweden (to layzy to register (mayby later))

[edit] Hungarian sites

I added some sites to the end of the lists from Hungary. I think these are important and frequently used sites by us, and with them Wikia may reach almost the all of important sites in Hungary :)

[edit] make better results

Hello and sorry of my questions,
but:

  1. How can users make the searchindex better at this time?
  2. What pages are good enough for the whitelist (in look of small pages)?

Thank you for help, Conny 17:33, 10 January 2008 (UTC).

[edit] a different way of compiling sites

Forgive my ignorance if this is already being done somewhere, but what if a tool were provided for users to simply upload their Favorites/bookmarks? That would seem to provide a quick and easy way for users to vote on what they believe are the important/useful sites and then the sites could be broken out into Top 50, Top 100, etc.

If this exists in some form already, would somebody please direct me to it? Thanks!

[edit] IP edits

Hi, I think we should block the Whitelist for IP editors. Somebody, who adds a url to the whitelist at least should have a user account. Regards --Cy 18:55, 7 March 2008 (UTC)

Are there different meanings about it? Regards --Cy 10:36, 8 March 2008 (UTC)

I think we should generally rate a new site on the whitelist. I dont know how to realize. My idea: a list with new request for the whitelist. and every page can be rated by every user, which have an useraccount. Only the good sites will be included. But this is only a idea and much work i know. Please post your ideas here. I think this is very important for the search results. Regards --Cy 20:50, 9 March 2008 (UTC)
Please give me your meanings about that. Its very important to make some changes in this list. The Whitelist has a Google Pagerank 4 and this talk has a 3. Because of that every webmaster feels invited to post his links, which contain not even good content. Regards --Cy 11:24, 13 May 2008 (UTC)

[edit] Wikibooks

I think *.wikibooks.org REALY should be included. There are many completed and nearly completed books there! -- MichaelSchoenitzer 19:11, 11 March 2008 (UTC)


[edit] Blacklist

Although this could be difficult to maintain there should be a black list for spam sites etc. -- NoodlesNZ


[edit] Torrent sites?

There is a single torrent site on the list (TPB). While they don't have a lot of text content, they are content sites of a kind. Also likely to be target of searches.91.152.48.20 15:50, 6 June 2008 (UTC)

Retrieved from "http://search.wikia.com/wiki/Talk:Whitelist"

This page was last modified 15:50, 6 June 2008. GFDL