G2 Crawler News http://crawler.trillinux.org/news G2 Crawler News Sat, 01 Nov 2008 15:25:35 +0000 http://wordpress.org/?v=2.6.3 en The Architecture of a Crawler http://crawler.trillinux.org/news/2008/11/01/the-architecture-of-a-crawler/ http://crawler.trillinux.org/news/2008/11/01/the-architecture-of-a-crawler/#comments Sat, 01 Nov 2008 15:25:35 +0000 dcat http://crawler.trillinux.org/news/?p=15 I’m going to explain how crawlers work. There are three main tasks that a crawler has to take care of.

  1. Find new hosts to crawl.
  2. Request data from a host that is being crawled.
  3. Display to the user the data gathered.

This design lends itself well to being distributed. Several host crawlers (those that perform task 2) can all be working in parallel and independently. All the host crawlers need is a coordinator (the one that performs task 1) to feed them lists of hosts that aren’t duplicated. The host crawlers then send their responses back to the coordinator which finds new hosts from the responses and then stores the responses. Lastly the aggregator or statistics generator (the one that performs task 3) periodically runs through all the data collected and creates useful ways for the user to view this information.

That’s how a crawler works in general terms. But the fun begins when a crawler has to actually be built. One of the most important decisions is how to store the data collected and how to store and distribute the list of hosts that need to be crawled next.

Relational Database Approach

This is the approach that g2paranha takes. It’s an easy and straightforward way to store data for anyone trained on these traditional databases. But the data that a crawler needs to store is for the most part not relational except for the links between hosts. Another problem with relational databases is that many of them lock all of the data whenever data is being written or read. This creates a huge bottleneck in a distributed environment where lots of both of these operations are being performed. So while it is easy to implement it may not be an optimal solution. On the positive side the extremely powerful SQL language is available for extracting statistics from the data.

Non-relational Database Approach

This is the direction I have been heading in for the crawler. It seems to be a good fit for the type of data that the crawler needs to store. But I have very little experience in this are. So far my only forays  into this field have been with CouchDB which is still in heavy development. CouchDB looks promising but I haven’t had much luck with getting it to work.

So if anyone has experience in non-relational databases or in creating distributed crawlers I’d like to hear from you.

]]>
http://crawler.trillinux.org/news/2008/11/01/the-architecture-of-a-crawler/feed/
Recent Updates http://crawler.trillinux.org/news/2008/10/19/recent-updates/ http://crawler.trillinux.org/news/2008/10/19/recent-updates/#comments Sun, 19 Oct 2008 15:03:09 +0000 dcat http://crawler.trillinux.org/news/?p=12 My focus lately has been on hub uptimes. There is a new page showing hub uptime distribution graphs. It gives a visual representation of some of the categories on the uptimes page. The overall hub uptime distribution graph also features two vertical lines. The red line shows where the average hub uptime is and the green line shows where the median hub uptime is. Eventually all of the graphs will have this extra information.

The other major addition is to the uptimes page. The second table of information is new and expands on the information in the first table. The new table shows for each grouping/category:

  • average
  • median
  • minimum uptime
  • maximum uptime
  • the total number of hubs that fit this category
  • the number of hubs below the average for this category
  • the number of hubs above the average for this category
  • the ratio of hubs under the average and over the average

On the vendors page I added back the showing of the GDNA data. But for now no hubs will be showing up as GDNA. GnucDNA does not send any vendor code at all so the current crawler just shows them as UNKN. The old crawler looked at the hub’s User-Agent to figure out if it was GnucDNA and then set the vendor code appropriately. So while new data won’t be logged the same way the trend of GDNA can at least be seen again on the yearly vendor graph.

Lastly, there is an experimental feature that shows where each hub is on a geographic map if they provided that information in their profile. Of course most users do not reveal this information and some lie about their location but it can still be a lot of fun. You get to see that the G2 network has users all over the world and they’re all interconnected. This mapping feature is experimental because it requires a fast computer in order to work smoothly. So give it a try but understand that it may not work well for everyone. In my experience Chrome did really well while Firefox and IE did poorly so if you already have Chrome installed you might give this feature a try in that browser even if you don’t use it for anything else.

Green lines are used to show hubs that are connected together. Clicking on a marker will reveal additional information about that hub:

  • username if provided
  • name and version of the software they are using
  • number of hubs and leaves they are connected to
  • the actual country they are in based on their IP address and MaxMind’s GeoIP
  • the exact coordinates they provided

I hope you like the additions. Keep the suggestions coming.

]]>
http://crawler.trillinux.org/news/2008/10/19/recent-updates/feed/
Quick g2paranha update http://crawler.trillinux.org/news/2008/07/10/quick-g2paranha-update/ http://crawler.trillinux.org/news/2008/07/10/quick-g2paranha-update/#comments Thu, 10 Jul 2008 23:55:51 +0000 dcat http://crawler.trillinux.org/news/2008/07/10/quick-g2paranha-update/ The crawler has been running pretty well with only minor tweaks from day to day which sometimes show up as blips in the graph. It was also down for a few days due to a failing hard drive.

Yesterday the crawler got into the Foxy network again which uses the same protocol as G2 but is a private network separate from G2 probably using GnucDNA’s authentication scheme. Their network is significantly larger than the G2 network as can be seen from the spike history graph. Since it is so much larger it presents problems for crawling and since it doesn’t represent the true open network of G2 I have filtered it out from being crawled.

]]>
http://crawler.trillinux.org/news/2008/07/10/quick-g2paranha-update/feed/
g2paranha - The New G2 Crawler http://crawler.trillinux.org/news/2008/06/23/g2paranha/ http://crawler.trillinux.org/news/2008/06/23/g2paranha/#comments Mon, 23 Jun 2008 04:04:34 +0000 dcat http://crawler.trillinux.org/news/2008/06/23/g2paranha/ Anyone who has read through this blog knows that the crawler has tended to crash fairly often. In recent times it was crashing to much to even continue running it. But rather than give up entirely I decided to write my own crawler. Five weeks later and g2paranha has emerged. To go along with the new crawler is a redesigned website written by kevogod. g2paranha has been designed to be distributed. Currently I run one crawler and Datz kindly volunteered to run another.

Over the next few weeks expect there to be a few bumps as any bugs get ironed out. I’ll also be rounding out the feature set provided by the original crawler. Right now that means adding country flags and keeping track of unique nodes. This last feature is the best estimate of the network size so until it is implemented the network counter won’t be put back on the front page. The leaf count on the history page gives an over count of the network size because each leaf can connect to multiple hubs.

I’ll try to keep the news updated as I progress.

]]>
http://crawler.trillinux.org/news/2008/06/23/g2paranha/feed/
The State of G2 http://crawler.trillinux.org/news/2008/02/28/the-state-of-g2/ http://crawler.trillinux.org/news/2008/02/28/the-state-of-g2/#comments Fri, 29 Feb 2008 03:08:40 +0000 dcat http://crawler.trillinux.org/news/2008/02/28/the-state-of-g2/ I was reading the Gnutella2 article on Wikipedia today and I noticed both entries in the External Links section point to my sites (crawler.trillinux.org and g2.trillinux.org). The latter being the new home for the G2 specs after gnutella2.com was allowed to expire. This got me thinking that it looks like I’m the only one trying to keep G2 from completely disappearing.

This is partly to benefit others and partly out of self-interest. I don’t think the G2 protocol as a whole is all that spectacular anymore, if it ever was. But parts of the protocol can be reapplied to accomplish other things. For example, at its core is a specification for a compact, extensible tree structure for communication. This could be made generic and used for all sorts of applications outside of G2. The search mechanism of a random walk is not original or unique but it’s the largest P2P network I’m aware of that still makes use of it so from that perspective it could be interesting to study.

I run the crawler out of self-interest. I like data, statistics, and graphs. I never turn down the opportunity to collect raw data and turn it into graphs and make inferences from the data.

I started maintaining the G2 website sometime in (late?) 2005 and moved it to its current home in October 2007. The crawler has similarly been running since late 2005. Here’s to many more years to come.

]]>
http://crawler.trillinux.org/news/2008/02/28/the-state-of-g2/feed/
More Crawler Downtime http://crawler.trillinux.org/news/2008/01/30/more-crawler-downtime/ http://crawler.trillinux.org/news/2008/01/30/more-crawler-downtime/#comments Wed, 30 Jan 2008 05:33:31 +0000 dcat http://crawler.trillinux.org/news/2008/01/30/more-crawler-downtime/ I spent last weekend replacing my router with another computer. The transition was a bit bumpy but things are starting to get sorted out. More extended periods of downtime are possible over the next few weeks as I get things completely transitioned and working reliably.

]]>
http://crawler.trillinux.org/news/2008/01/30/more-crawler-downtime/feed/
Crawler Downtime http://crawler.trillinux.org/news/2007/10/08/crawler-downtime/ http://crawler.trillinux.org/news/2007/10/08/crawler-downtime/#comments Tue, 09 Oct 2007 02:28:00 +0000 dcat http://crawler.trillinux.org/news/2007/10/08/crawler-downtime/ The crawler has been down since Friday because I’m doing hardware work on my router. It is also the computer that does backups and it has had a slowly failing hard drive for the last few months. I finally bought a new hard drive and have been deciding how to set it up. In the mean time my network is a bit fragmented and the consumer router I’m using as backup folds under the load of the crawler. Things should be back to normal sometime on Tuesday.

]]>
http://crawler.trillinux.org/news/2007/10/08/crawler-downtime/feed/
New Graphs http://crawler.trillinux.org/news/2007/05/24/new-graphs/ http://crawler.trillinux.org/news/2007/05/24/new-graphs/#comments Thu, 24 May 2007 21:11:20 +0000 dcat http://crawler.trillinux.org/news/2007/05/24/new-graphs/ I added some new graphs back in April on the hub density page. They show the percentage of hubs with a certain number of leaves. This way the capacity of hubs can be tracked more granularly than just the average leaves per hub statistic.

Let me know about other improvements you’d like to see and I will try to make them happen.

]]>
http://crawler.trillinux.org/news/2007/05/24/new-graphs/feed/
Country Database Update http://crawler.trillinux.org/news/2006/10/11/country-database-update/ http://crawler.trillinux.org/news/2006/10/11/country-database-update/#comments Wed, 11 Oct 2006 19:58:51 +0000 dcat http://crawler.trillinux.org/news/2006/10/11/country-database-update/ You may have noticed that the number of “Unknown”/”??” countries has been increasing since the graphs came back in June. This is because the country is determined by using MaxMind’s GeoLite Country database which maps IP addresses to countries. The version the crawler was using hadn’t been updated since September 2005 and many new IP blocks have been added since then. So the database is now current and hopefully the country stats will more accurately reflect where people really are. Some of the countries which should see increases are Australia, Costa Rica, India, Japan, Korea, Mexico, New Zealand, and Thailand.

]]>
http://crawler.trillinux.org/news/2006/10/11/country-database-update/feed/
New best hub uptime http://crawler.trillinux.org/news/2006/10/06/new-best-hub-uptime/ http://crawler.trillinux.org/news/2006/10/06/new-best-hub-uptime/#comments Fri, 06 Oct 2006 16:57:52 +0000 dcat http://crawler.trillinux.org/news/2006/10/06/new-best-hub-uptime/ Today a new best hub uptime was established beating the old one of 209h 10m 45s. This is because for some reason unknown to me the crawler has decided to run for over a week continuously without dying. I did however check system logs recently and it does look like it may be crashing because of a memory leak. With any luck magic fixed it and it will now run smoothly forever.

]]>
http://crawler.trillinux.org/news/2006/10/06/new-best-hub-uptime/feed/