PCPlus 281: Indexing the Internet

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there’s an Xmas issue as well — so it’s a bit more than monthly). The column is called Theory Workshop and appears in the Make It section of the magazine. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. What I’ll do is publish the article from a year ago or so here when I purchase the current issue.

PCPlus logoI was in England when the May issue came out, so I’m able to post this a little earlier than usual (my Barnes & Noble generally gets an issue 5-6 weeks after it appears in newsagents in England).

This particular piece was a pure layman’s article about how to index text and in particular how big search engines index web pages. I covered the usual suspects: inverted indexes and PageRank, with asides on stemming and SEO (search engine optimization).

As it happens, in doing the research for this article, I read Sergey Brin & Larry Page’s seminal paper The Anatomy of a Large-Scale Hypertextual Web Search Engine for the first time. This was the paper that essentially launched Google and that changed the landscape of search engines. The techniques discussed in this paper have obviously improved in the 12 years since then (I dare say that Google no longer just uses PageRank but instead use a panoply of different indexing mechanisms to improve results), but it is still an excellent exposition of what happens in a large-scale search engine.

And... 12 years ago? How the internet has changed since Brin and Page presented their paper at the Seventh International World-Wide Web Conference in 1998.

This article first appeared in issue 281, May 2009.

You can download the PDF here.

Album cover for HeligolandNow playing:
Massive Attack - Babel
(from Heligoland)

Loading similar posts...   Loading links to posts on similar topics...

No Responses

Feel free to add a comment...

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

  •  Emphasize with italics: surround word with underscores _emphasis_
  •  Emphasize strongly: surround word with double-asterisks **strong**
  •  Link: surround text with square brackets, url with parentheses [text](url)
  •  Inline code: surround text with backticks `IEnumerable`
  •  Unordered list: start each line with an asterisk, space * an item
  •  Ordered list: start each line with a digit, period, space 1. an item
  •  Insert code block: start each line with four spaces
  •  Insert blockquote: start each line with right-angle-bracket, space > Now is the time...
Preview of response