The Architecture of a Crawler
I’m going to explain how crawlers work. There are three main tasks that a crawler has to take care of.
Find new hosts to crawl.
Request data from a host that is being crawled.
Display to the user the data gathered.
This design lends itself well to being distributed. Several host crawlers (those that perform task 2) can all [...]