Search Team Update: August 13, 2008

August 13th, 2008 by Dan Lewis

Here’s the update from the last week (or so) of Search Team goodness:

Toolbar
* Wikia Evolution, the search toolbar launched.  Download it at http://re.search.wikia.com/toolbar/download.html
* Want to tinker with the code?  Grab it via the SVN at http://svn.swlabs.org/re.search/cool/toolbar/ - there’s no documentation yet, but we’re working on it.  The test .xpi (evolution.test.xpi) can be unzipped as per typical for a .zip file.

UI Stuff
(If you’re not on the search-ui miling list, that’s at http://lists.wikia.com/mailman/listinfo/search-ui)
* More work on widget/application framework
* Working on a search engine comparison tool
* Started on a “light” fork to the results UI.

Atlas
(There’s a distribution list for Atlas at http://lists.wikia.com/mailman/listinfo/atlas-l and a wiki page about the project at http://search.wikia.com/wiki/Atlas.  Check out both)
* Updated the Atlas protocol spec for “knuggets”
* A lot of the prototypes are starting to come together.  Atlas-l has a discussion starting at http://lists.wikia.com/pipermail/atlas-l/2008-July/000092.html and the SVN is at http://svn.swlabs.org/atlas/.

Operations
* Assisted with Nutch re-index
* Started a new crawl
* Fixes to KT importer, added the ability to load/populate the new location table
* Built a new 0.1.3 Hbase cluster, loaded with production data snapshot, populated the new location table, setup new KT code (with new features) pointed to the new cluster with new data (kt.search.isc.org/ktdev/)
* Tweaked lots of system monitoring
* Lots of work with the crawler, trying to find the source of very high fetch failure rates
* Deploy-redploy KT /ktdev/, started review of code
* Determining new hardware requirements
* Bind updates
* More work on Grub

Nutch
* Finished test rollout of new indexing and scoring systems.
* Started working on shard management servers.
* Started work on pornography and bad content identification.
* Started integration of kt input into analysis algorithms.
* Documentation, bug fixes, and unit tests for new scoring and indexing frameworks for Nutch.  Working to get final patches submitted and committed into the Nutch core.
* Finished new crawl, working on deployment and roll-out of new indexing and scoring systems to test
* Finished all patches and code documentation for new scoring and indexing systems.  Everything has been submitted to Nutch for inclusion in the Nutch code distribution.
* Finished modifications to FieldIndexer including field filter extension point, and field-basic, field-boost plugins that integration in the arbitrary boosting with the new indexing framework.

Other Stuff
* Improved contact importer for the social tools
* Working on a Facebook application

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

With Great Power Comes Great Responsibility

August 12th, 2008 by Dan Lewis

James Grimmelmann is an adjuct professor of law at New York Law School here in Manhattan.  His recent paper, “The Google Dilemma” (available via SSRN) tells five stories about the power Google currently wields, and why Google — or whomever “controls” search — needs to wield that power with great responsibility.

The paper is a quick read — even with footnotes, it’s a mere 11 pages — so it’s not fair of me to republish it here.  I’ll certainly have more thoughts on it later on, though.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Around the Blogs: Wikia Evolution

August 11th, 2008 by Dan Lewis

Our announcement about our new Wikia Evolution toolbar definitely made the rounds.  Here are some of the blog posts about it — feel free to use the toolbar to add these posts right to the Wikia Search index :)

You can download the toolbar here.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Introducing Wikia Evolution

August 6th, 2008 by Dan Lewis

One of our core values at Wikia Search is Community.  We want everyone to be able to participate in the Wikia Search project.  That’s why we are proud to introduce Wikia Evolution, our new Firefox toolbar.  You can download it here, or via Mozilla’s Firefox Add-On Library.

The mission of Wikia Evolution: To empower users to interact with search.

We want to make it dead-simple for you to add URLs into our index under appropriate keywords.  Already, we’re the cutting edge when it comes to incorporating user feedback into our search results, so much so that Google is experimenting with eerily similar features.   Wikia Evolution pushes the envelope even further.  It allows you to quickly and easily add the web page you are on into Wikia Search, directly from your browser, for whatever keyword is appropriate.  Instant indexing!   Then, you can modify the search result to make it really killer, all without leaving the page you’re on.

Letting everyone modify search results pages is of fundamental importance — but you can’t on either Google or Yahoo.    We can’t change that, sadly, but Wikia Evolution does the next best thing.  Using Wikia Evolution, you can add and rate URLs directly from Google or Yahoo, and those contributions will be immediately incorporated into Wikia Search.



Jimmy left a comment on the Mozilla download page, and I think it bears repeating:  “This toolbar, like everything we are doing at Wikia Search, is open source. We hope that if you are a toolbar fan and programmer, you will let us know what features need to be added and/or take this and do something surprising and cool with it.”  Community: Let’s make it happen, together.



These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Knol Proves The Importance of Transparency

July 29th, 2008 by Dan Lewis

From day one of the Wikia Search project, the Wikia Search community collectively brainstormed the core principles of the project and, indeed, that search currently lacks and needs.  One of them — transparency — is needed now more than ever.

Last week, Google released it’s new content endeavor, Knol — a platform by which anyone can write up a page about a topic, invite others to help, and make some pocket change using Google’s AdWords platfrom.  Google, of course, also makes some coin off those AdWords ads, and with the long tail working to their benefit, can cash in big time.   The only trick?  How to get traffic to all these new Knol pages.

Well, they happen to have a pretty big search engine — and already, some people are noticing that Knol entries tend to do well in Google Search results.   Jason Calacanis is probably the perfect person to point out the flaw — he’s no fan of ours here at Wikia Search (snif!) and admits that he’s a “Google man” who “love[s] the Google“, but even he is concerned: Yesterday, his screen shot of “how to backpack” made its wa around the web, replete with an ominous tag line, showing that in just five days, a Knol made it to the top of the relevant search result.

But let’s face it, Google is not going to re-write their algorithms to favor Knol.   It’d be mindbogglingly idiotic to do, and more importantly, unnecessary.   Why?  Because Knol already has an advantage that you, I, and the rest of the non-Google world don’t have — access to Google’s search team, and to that algorithm itself.

For most people — us average beings — Google recommends that you work with a Search Engine Optimization specialist (”SEO”).  No, not explicility, but read that page and you’ll see that (a) they don’t directly answer the question as to whether one should hire an SEO and (b) very little is on point on that page in general.  The small part that is says this:

A great time to hire [an SEO] is when you’re considering a site redesign, or planning to launch a new site. That way, you and your SEO can ensure that your site is designed to be search engine-friendly from the bottom up. However, a good SEO can also help improve an existing site.

But the fact is that SEOs do not know, exactly, how Google’s algorithm works.   Only one company does: Google itself.  And at some point — if it has not happened already — someone from the Google Search team and someone from the Google Knol team will get together and give Knol an big lesson in SEO.   Maybe it will be an explicit, high-level decision.  Maybe it will just be two people, one from each team, sitting down for lunch with the Search guy saying “hey, if you want to give your stuff a boost, do <muffled sounds>.”   Maybe it happened six months ago.  Maybe it will happen in three years.  Who knows?  All we know is two things:

  1. It’s only possible because Google hides their algorithm from the non-Google world.  If everyone could do it, Knol would have no appreciable advantage.
  2. It’s inevitable.   Even if Google’s corporate powers-that-be mandate that the two groups not mix nor mingle, the knowledge that flows through those halls will be impossible to shutter.

The solution?

Open source that algorithm, and everyone — include Knol — is on a level playing field.  All accusations of impropriety go away, and the inevitable occurrence of Team Knol benefiting from private lessons with Team Search are instantly moot.

Transparency.  Search demands it.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

What’s the Other 75 Percent? Blogs!

July 28th, 2008 by Dan Lewis

Last Friday, Google announced that, according to them, the web has one trillion unique URLs.   But Google’s index, quoth their blog, is not that big:

We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers.

Google does not tell us how large their index is, but they do claim to be “proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.”

Meanwhile, this morning marked the launch of Cuil, another search engine start-up.   Like Google, it’s a closed-source, read-only, hardly-transparent search engine.   And also like Google, Cuil claims to have the most comprehensive index out there, with over 120 billion URLs ready for the picking.   Naturally, we expect a bit of a PR grudge match here — a muted one, of course, but one nonetheless.  So let’s give Google the credit it is due, and say that they’re beating Cuil by, oh, just over 2:1.  That is, let’s just make up a number and assume (probably incorrectly) that Google’s index is 250 billion web pages.

One trillion unique URLs.

250 billion indexed.

What’s the other 75%?

Blogs.  Well, not all of the 75%.  In fact, not even close.  But a lot of blogs, for one reason or another, do not make it into Google’s main index.   Google tells us as much — by using its blogsearch tool, at http//www.google.com/blogsearch/.

Those are the blog search results for three top blog hosting domains — and they total just under 7.5 billion URLs, or, per Google, about 7.5% of the whole web.

Those?  The same results, using Google’s regular search engine.  Just over half a billion URLs.

So either (a) Google’s numbers are junk or (b) we just accounted for 7 billion missing URLs.    Maybe Cuil has them?  Don’t know.   But with 7 billion missing blog posts, there’s a lot of work to be done! So, here’s how to add your blog to Google’s search index:

Yeah, I don’t really know.
Here’s how you add your blog to Wikia Search:

  1. Go to http://search.wikia.com
  2. Type in a search term where you think your blog should come up.
  3. If it’s there, yay!  Go back to step two for another search term.
  4. If not?  No worries.  Just add the relevant URL into the “Add to this result” field on the right.  It looks like this:
  5. Hit “add”.
  6. Clean up the entry if you don’t like how we auto-populated it.
  7. Go back to step two for another search term.

Will we get all 7 billion blog posts that way?  Of course not.  But we will be sure to index yours — and it only takes about 10 seconds.   Now that really is cool.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Search Team Update: July 23, 2008

July 23rd, 2008 by Dan Lewis

Here’s what the search team did last week:

Nutch:
* Finish FieldIndexer
* Finished BasicFields
* Working on AnchorFields

These are all part of the new Indexer that will allow fine grained control of fields that go into our index. The FieldIndexer is the actual indexer itself that replaces the current Nutch indexer. The BasicFields replaces the current nutch functionality for fields from the indexer and BasicIndexingFilter plugin. The AnchorFields both replaces the current AnchorIndexingPlugin and enhances it to allow analysis and ordering by score of anchors to be indexed. The AnchorFields job should be finished, tested, and ready for larger deployment early this coming week

Search Tools:
* Worked on a new, experimental “light” fork to the results UI
* Lots of work testing new KT tools in development
* Brainstorming about the widget framework and how to speed up results
* Lots of work with the crawler, trying to find the source of very high fetch failure rates
* Continued development of the toolbar

Community Tools:
* Began work on a contact importer
* Created interface for translating Wikia Search interface

Operations:
* Deploy-redploy of KT /ktdev/, started review of code
* Started work on determining new hardware requirements

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Grub: A Roadmap

July 23rd, 2008 by Jeremie Miller

 

Grub, the open source web crawler, is a key part of Wikia Search and an community-benefiting web search experience in general.

 

My larger vision is pretty simple. I want Grub to be a decent and open snapshot of the web that is kept fresh, and then stored/shared in ways useful to the whole community — to individuals and companies alike, whoever is contributing.  One way of doing this is having the data in Hadoop and running community-directed MapReduce jobs on the whole dataset at the ISC, with the results being openly available to anyone to use.

 

I think we are getting close to at least the last part of that, Seth almost has the ARCs being uploaded into HDFS.  Things get a little more complicated after that as we both need to write MapReduce jobs that will slice or index the ARCs in some useful ways, as well as get something to better create the workunits from that dataset.

 

And after that?  Then, everyone can do the interesting things which Grub’s promise holds: things like experimenting with content categorization, finding similar pages, detecting language and microformats, and so forth.  Anyone will be able to use Grub to help the community produce an open and fresh copy of the web, accessible to developers and researchers building new tools and making new discoveries.

 

It’s great to see the project making progress, even if slowly.  It is a really big task to take on, but if we keep it simple and keep trying, we’ll get there!
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Tonight: Wikia Search at the NYC Creative Commons Salon

July 23rd, 2008 by Dan Lewis

What: Creative Commons Salon NYC

Where:The Open Planning Project, 349 W. 12th Street, New York City

When: TONIGHT!  7pm to 10pm

Who:  Us!  We’re presenting on Wikia Search   And the Livable Streets Network and comedian Max Silvestri.

What else: Free beer!

RSVP? Via Facebook

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon

Yes, the Search Query Counter Went Backward. Our Bad.

July 21st, 2008 by Dan Lewis

Oops.

About two weeks ago, we announced that the Wikia Search project hit the 2,000,000 query mark.   We were wrong.   Some funkified caching our end caused the counter to jump a lot more often than it really should have, and, well, jump it did. That’s the real reason.  Here are the fake reasons we also came up with, which are more entertaining but, again, dirty dirty lies:

  • The last million queries sucked, so we took them out
  • We had 1 million in query writeoffs due to bad loans
  • It was tied to the Zimbabwaean dollar
  • We’re also working on an open-source time machine

We’re really at just north of 1.975 million queries, and the counter now reflects the true number.   The next 2,000,000 though — that’ll be the real one.  Sorry.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Facebook
  • Mixx
  • Reddit
  • StumbleUpon