lunaru.com | Thoughts on Development

Nov/05

4

Thoughts on Google and Search

The boom around search has come in large part from the financial success of search pioneers like Google. In celebrating this newfound interest in the web, it’s always interesting to step back and look at the nascent industry and reflect on how it got to where it is now and perhaps draw inferences to where it might be going.

It all began with the first “spiders” or “crawlers” that automated the process of exploring the unknown of the web. The first crawler based engines used classical information retrieval techniques to identify documents with search phrases. These techniques used some components of the infamous “tf.idf” measure (as it is referred to in IR circles) to score documents based on a frequency of a term and the rarity of the term. These scores, being cached for each document/term pair provided a basis for the earliest web ranking techniques, but were too generous to overly “optimized” pages. Being a classical technique, standalone tf.idf was suited for older applications, where there wasn’t necessarily as dynamic an environment as the web. But because the web offered such an opportunity for anyone looking to publish content, the search engines began waging what would be an uphill battle against the spamming (aka optimization) community intent on making a buck via the gaming of search engine results.

Enter “Backrub”, now known as Google. To fight off the gross irrelevance being returned by the search engines of that time, the founders of Google looked at how the link structure of the web could be exploited to produce more relevant results. The intuition was that pages with a high count of backlinks are pages that are generally considered more popular (also known as the random walker argument), and thus, more likely to be authoritative on any given subject, provided that the page was relevant with the subject in the first place. Combined with the classical techniques of identifying key terms in a “secret sauce”, this new PageRank system proved to provide astoundingly accurate results for most queries and skyrocketed Google to what it is today. Of course, the generalized idea here is that this success was largely in part because of the way the PageRank system mitigated the local optimization efforts of spammers on their webpages.

Thus, if you look at the success of PageRank, you’ll see that really, it was an arms escalation on the part of the search industry to fight off the spammers. This has essentially bought some time for the industry, but it is very apparent that the spammers have caught up. Using a combination of tricks to farm PageRank for pages, “search engine optimization” efforts have, to a large degree, closed the gap on Google. The battle has yet to be decided, and it is hotly debated in the IR circles whether or not there is even a true solution, short of human intelligence, that can deal with the rising sophistication of gaming techniques. As one would expect, new methods are being research and developed in both the industry and academia for fighting off search spam.

But what really needs more reflection is whether or not PageRank is still as relevant to the success of a search engine as it was when it was first invented. Of course there will always be something to be said of the democratic backlink model because its just intuitive that a page is about x if everyone says its about x. But the converse is not true. That is, it can’t necessarily be said that a page is not about x, just because no one recognizes it as x. The idea can basically be summed up in the fact that because PageRank so heavily weights the importance of backlinks, it might be ignoring what can be referred to as the long tail of search. In the IR industry, this of course known as sacrificing recall for precision.

If PageRank was effective in fighting off spam in 2000, it can not be recognized as being just as effective today. In fact, it is quite clear, even to an outsider, that the issue of spam has to be handled in an entirely different manner for Google today. That said, it begs the question whether or not new models of search can be built that diminish the role backlinks play and instead focus on a “smarter” way of categorizing pages, and in one fell swoop, solve the problems of topical ranking and spam. From what I here, (and I could be totally wrong) http://www.kosmix.com/ is one such project that is focused on high recall by not necessarily letting backlink pageranking be the major ingredient of the formula.

To sum up, what is evident to me is that search has to gradually change as consumer expectation changes. And as time goes on, it is worth reflecting and re-evaluating ubiquitos techniques of the day to see if they would still be relevant tomorrow. However, I trust the guys at Google are well aware that resting on their laurels, especially in a dynamic and nascent field, is not an option.

RSS Feed

No comments yet.

Leave a comment!

<<

>>

Find it!

Theme Design by devolux.org

Tag Cloud