The Architecture of a Crawler

Written by dcat on 01.11.2008 | General

I’m going to explain how crawlers work. There are three main tasks that a crawler has to take care of.

Find new hosts to crawl.
Request data from a host that is being crawled.
Display to the user the data gathered.

This design lends itself well to being distributed. Several host crawlers (those that perform task 2) can all [...]

Recent Updates

Written by dcat on 19.10.2008 | General

My focus lately has been on hub uptimes. There is a new page showing hub uptime distribution graphs. It gives a visual representation of some of the categories on the uptimes page. The overall hub uptime distribution graph also features two vertical lines. The red line shows where the average hub uptime is and the [...]

Quick g2paranha update

Written by dcat on 10.07.2008 | General

The crawler has been running pretty well with only minor tweaks from day to day which sometimes show up as blips in the graph. It was also down for a few days due to a failing hard drive.
Yesterday the crawler got into the Foxy network again which uses the same protocol as G2 but is [...]

g2paranha - The New G2 Crawler

Written by dcat on 23.06.2008 | General

Anyone who has read through this blog knows that the crawler has tended to crash fairly often. In recent times it was crashing to much to even continue running it. But rather than give up entirely I decided to write my own crawler. Five weeks later and g2paranha has emerged. To go along with the [...]

Links

Light Reading

  • Blogroll