20071223

PageRank Explained


PageRank is a method invented by Google to measure the relative importance of web pages, which is often called popularity. It is based on the topology of the web, i.e. the links structure between pages.

The main idea is that if a page A has a link to a page B, then the page A "thinks" that page B is important enough to deserve being cited and maybe visited by visitors of page A. This link from A to B increase the PageRank of B.

There are two other essential ideas:

  • the higher the PageRank of page A, the higher the increase of the PageRank of page B. In other terms, it is greatly efficient to get a link from the homepage of Google than one from a page of your cousin's site (otherwise he's a potential genius!).
  • the less page A has out-links, the more page B's PageRank increases. In other terms, if page A "thinks" that there is only one page deserving a link, then it is quite natural that the PageRank of this page B increases more than in the case where lots of pages get a link from page A.

Now that you know the main principles of PageRank, let's see the mathematical formulat... Our explanations are based upon a paper written by the two founders of Google (1), even though since this time the algorithm must have changed -- the basis remains the same.

Let A1, A2, ..., An: be n pages linking to a page B. Let PR(Ak) be the PageRank of page Ak, N(Ak) the number of outward links within page Ak, and d a factor between 0 and 1, generally equal to 0.85.

Then the PageRank of page B is computed from the PageRank of all pages Ak in the following way:

PR(B) = (1-d) + d x ( PR(A1) / N(A1) + ... + PR(An) / N(An) )

AS you may think, this formula is in the same time simple and complex. Simple because it only depends on several terms, complex because it is recursive: if you wan to compute the PageRank of one particular page, you must have already computed the PageRank of all pages linking to it. Then which page should you begin to process?

Actually it is very simple, you only need to initialize all the PageRank with the same value (for example: 1). The choice of this value doesn't have any impact on the final result -- if you give all pages the same initial value. The first iteration of the formula gives you another PageRank for each page, closer to the reality than the value we selected at the beginning.
Then you continue iterating this formula, computing the PageRank of all the pages of the system, based upon the previous values computed in the previous iteration. After some iterations, the system converges: the value of the PageRank of each page doesn't change from one iteration to the next one.
In pratice, the convergence is obtained after several tens of iterations (depending upon the total number of pages!).

Next step: First points to note about PageRank.

(1) The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, www-db.stanford.edu/~backrub/google.html

No comments: