While much of this paper is now “quaint” in light of the amount of money that Google is making, what with their “noble” goals to save the world and all, the fact remains that there are many interesting ideas in the paper and it does a good job of bridging the gap between real world systems and academic research. Essentially, this paper comes down to two main parts that I am interested in:

  1. PageRank - The formula and approach for ranking web pages.
  2. System Anatomy - The discussion of how Brin and Page setup their initial system.

I’ll skip the details on performance, etc. since it should be pretty obvious that it works. It should be noted, though, that the importance of this new approach did wonders for how the web was searched. If you remember prior to Google, most search results were pretty low quality except Yahoo! which relied on a directory structure.
PageRank
Here is the discussion of PageRank from http://infolab.stanford.edu/~backrub/google.html

We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.

PageRank is really just the notion that the more pages that point to a page, the more higher that page should rank. Additionally, this is a cumulative effect, that is if A points at B and B at C, A’s PR is also a factor in calculating C’s PR. The paper doesn’t go into details about the exact algorithm for calculating PR, simply stating that it is a simple iterative algorithm and of course this is now part of Google’s proprietary system, so the world may never know much to the chagrin of many an SEO consultant. In the upcoming papers, we will see references to PR and how it can be used for doing other NLP tasks, so it is good to know the basic idea.

System Anatomy

From an engineering standpoint, the System Anatomy section of the paper is quite interesting especially since it goes into some detail about how to setup a basic Google-like search engine. Of note are the discussion, albeit brief, of the Distributed File System (called BigFiles, section 4.2.1. For an open source version see Hadoop, which has both a DFS and an implementation of Google’s Map/Reduce algorithm.) Moving on, there is some discussion of how the lexicon is stored as well as the documents. These sections strike me as standard indexing techniques.

Following the lexicon subsection, though, is the discussion information on how they store the “Hit Lists”, that is, the list of occurrences of terms in documents, along with payload information about the terms such as font size, capitalization, etc. In comparison, Lucene currently only supports the term occurrence info and not payload information, although there is a submitted patch that allows for indexing payloads. These Hit Lists are then stored in both the forward index and the inverted index.

Section 4.4 concerns indexing the web and dealing with the plethora of errors that occur in malformed HTML as well as how to create the various indexes. Currently, I think there are a number of good solutions that can help solve this problem, given the right amount of hardware (see Nutch for an Open Source version.)

Section 4.5 deals with the issues of searching, namely how to handle single word queries and multi-word queries. Multi-word queries can be a bit tricky since you want to weight hits with terms that occur closer together in a document higher than those that are a further apart.

The final section (section 5) covers results and performance, which I’m not going to go into because I think it is obvious to anyone who has an online pulse in the last 10 years that the Google approach works.

Next week, we will start looking into some graph based papers that have some basis in the PageRank calculation, but use it in different contexts.

Popularity: 18% [?]