Archive for November, 2005

Thoughts on Google and Search

Friday, November 4th, 2005

The boom around search has come in large part from the financial success of search pioneers like Google. In celebrating this newfound interest in the web, it’s always interesting to step back and look at the nascent industry and reflect on how it got to where it is now and perhaps draw inferences to where it might be going.

It all began with the first “spiders” or “crawlers” that automated the process of exploring the unknown of the web. The first crawler based engines used classical information retrieval techniques to identify documents with search phrases. These techniques used some components of the infamous “tf.idf” measure (as it is referred to in IR circles) to score documents based on a frequency of a term and the rarity of the term. These scores, being cached for each document/term pair provided a basis for the earliest web ranking techniques, but were too generous to overly “optimized” pages. Being a classical technique, standalone tf.idf was suited for older applications, where there wasn’t necessarily as dynamic an environment as the web. But because the web offered such an opportunity for anyone looking to publish content, the search engines began waging what would be an uphill battle against the spamming (aka optimization) community intent on making a buck via the gaming of search engine results.

Enter “Backrub”, now known as Google. To fight off the gross irrelevance being returned by the search engines of that time, the founders of Google looked at how the link structure of the web could be exploited to produce more relevant results. The intuition was that pages with a high count of backlinks are pages that are generally considered more popular (also known as the random walker argument), and thus, more likely to be authoritative on any given subject, provided that the page was relevant with the subject in the first place. Combined with the classical techniques of identifying key terms in a “secret sauce”, this new PageRank system proved to provide astoundingly accurate results for most queries and skyrocketed Google to what it is today. Of course, the generalized idea here is that this success was largely in part because of the way the PageRank system mitigated the local optimization efforts of spammers on their webpages.

Thus, if you look at the success of PageRank, you’ll see that really, it was an arms escalation on the part of the search industry to fight off the spammers. This has essentially bought some time for the industry, but it is very apparent that the spammers have caught up. Using a combination of tricks to farm PageRank for pages, “search engine optimization” efforts have, to a large degree, closed the gap on Google. The battle has yet to be decided, and it is hotly debated in the IR circles whether or not there is even a true solution, short of human intelligence, that can deal with the rising sophistication of gaming techniques. As one would expect, new methods are being research and developed in both the industry and academia for fighting off search spam.

But what really needs more reflection is whether or not PageRank is still as relevant to the success of a search engine as it was when it was first invented. Of course there will always be something to be said of the democratic backlink model because its just intuitive that a page is about x if everyone says its about x. But the converse is not true. That is, it can’t necessarily be said that a page is not about x, just because no one recognizes it as x. The idea can basically be summed up in the fact that because PageRank so heavily weights the importance of backlinks, it might be ignoring what can be referred to as the long tail of search. In the IR industry, this of course known as sacrificing recall for precision.

If PageRank was effective in fighting off spam in 2000, it can not be recognized as being just as effective today. In fact, it is quite clear, even to an outsider, that the issue of spam has to be handled in an entirely different manner for Google today. That said, it begs the question whether or not new models of search can be built that diminish the role backlinks play and instead focus on a “smarter” way of categorizing pages, and in one fell swoop, solve the problems of topical ranking and spam. From what I here, (and I could be totally wrong) http://www.kosmix.com/ is one such project that is focused on high recall by not necessarily letting backlink pageranking be the major ingredient of the formula.

To sum up, what is evident to me is that search has to gradually change as consumer expectation changes. And as time goes on, it is worth reflecting and re-evaluating ubiquitos techniques of the day to see if they would still be relevant tomorrow. However, I trust the guys at Google are well aware that resting on their laurels, especially in a dynamic and nascent field, is not an option.

Tale of Two

Tuesday, November 1st, 2005

If computing were a snake, it would have two heads. On the one side would be the business world, driving, pushing the snake toward the automation model. On the other would be the academic, focused on the core definition of computing and its mathematical heritage. As a computer scientist, computer engineer, software developer, or what have you, I often find myself conflicted in finding a unifying theme, a universal purpose for what computing really is.

Ask the economist and he/she will tell you that the two forces of our world are gravity and currency. The latter force is the one that drives our innovation, because, as fate would have it, capitalist societies encourage those who give back value in the form of money. This basic driving force for innovation is key to the technological golden age we are in today. Here, for the reason of financial gain, is where we find the birthplace of a lot of software computing ideas. CRM, ERP, Help Desk, and generalized financial software are all forms of software innovation that had risen to face the challenges of a world ever increasing in its demand for automation. Software products succeed and fail on the central criterion of: “Does it solve my problem, and can I afford it?” This is perhaps why, if you read material intended for the industry, you’ll often find it 90% business and, if your lucky, 10% technical. For example, you can ask, in a typical company, how much of the CIO’s roles are business related? How many are conversely technical? What backgrounds and experiences are important for a CIO/CTO?

On the flip side, you have academic computer science. Being research based, the primary actors might or might not be motivated by immediate financial gain. Most of the innovation here comes from a deep down passion for the subject and for a will to make a change to the technological landscape, however big or small. Unfortunately, the lack of monetary stimuli leaves much of it to the personal desires of the individuals involved. As a result, you will typically have few students of computer science interested in the theories and the fundamentals. Having much experience in this matter myself, I can tell you not many people go into the computer science degree expecting to learn what it is they learn. Where they expected a practical curriculum mirroring the demands of our external society and industry (like implementing solutions in a multitude of languages) they instead find deep mathematical theory involved in no one concrete, intended application. Even those who check “Computer Engineer” instead of “Computer Science” on their applications are typically in for a surprise. Look at each school’s degree conferal summary and you will find something like: “A degree in Computer Science from (insert school here) is a guarantee by the school that you are well prepared in the breadth of knowledge to tackle problems faced in a computer science occupation.” Is that so? Many understandably doubt it, and as a result, many people simply do not see the point and drop out.

Computing has come down to two threads of existences. One survives to automate - to solve practical problems in the business world and find financial reward. The other survives to truly further our world through innovation. While the former certainly has played a big role in our world (just think of all the big software businesses out there), the latter thread is the one that, at the end of the day, creates the great leaps we enjoy as a society today and in the years to come. Just think of the internet and its roots as a military/academic project to provide a way to decentralize computing. With it came email, and then the web. These are the truly essential technologies that even the automation world depends upon.

We have been given the gift of technology - but this gift is no promise. When it comes down to it, it’s up to the many brilliant minds out there toiling away in the land of academia to create new ways we can change computing. Perhaps an interesting question is then: can we find a blend of computing that is both viable as an academic pursuit and as an industrial success? To that, I’d have to point out that the Google boys did. They are one example of how an academic innovation can bear both intellectual gain (in the form of benefitting our society in knowledge) and financial gain. In the end, I believe that perhaps the reality is that there will always be financial success guaranteed in a truly brilliant leap in technology, but conversely, there is not always a leap in technology guaranteed for every financial success.

- LW