A different way to look at the network

Since the middle of August the crawler has been recording the time when hubs join and leave the network. This allows for certain time based trends to be realized. The hub is identified by its IP address. One way to visualize a set of IP addresses is with the Hilbert curve which was made popular by xkcd. The tool I used is called ipv4-heatmap. By generating a heatmap every 2 hours and then playing the images in order a time lapse video is produced. This video will show when users in different parts of the world are online depending on the time of day and the day of the week.


Direct download (3.0 MB, MKV)

Here is a rough breakdown of what the video shows:

  • On the right side half way up is where a lot of Brazilian users are.
  • Above them is where a lot of Asian IPs are.
  • In the bottom left is where Europe is.
  • In the center left and the top center is North America.

Consult the ipv4-heatmap website for more details on IP distribution or the slightly outdated internet map from xkcd.

Thanks to Daisuke from irc.p2pchat.net #shareaza for his excellent work converting the images to a video.

General

Comments (0)

Permalink

Two new experimental features

A few weeks ago two new features were released. The first is a world map view of the country page and the second shows how the network size changes over time.

World Map

The world map shows two different data sets. Red circles represent where hubs say they are located. The size of the circle indicates how many hubs are reported to be at that location. Green circles represent how many hubs are in that country based on their IP address and mapping it to a country using Maxmind’s GeoIP. The location of the green circle is either drawn roughly in the center of the country it represents or at the country’s capital. The country location data was collected from Freebase.com.

Network size changes

The network size page shows graphically how the network size is changing. Each data point shows how many hubs or leaves joined or departed the network from the previous network measurement. So for example if one data point reads +200 hubs then 200 hubs joined the network between that time and the last network measurement.

This also is the debut of pChart usage on the crawler website. It makes creating beautiful and useful graphs easy. I look forward to using it more in the future.

General

Comments (0)

Permalink

Network size

The network size is now featured on the front page once again. When the new crawler was implemented that statistic had to be dropped because it was too resource intensive to calculate with how the new crawler worked. But now that issue has been resolved.

Some background

The number of leaves on the network isn’t a good measure of the number of users on the network because most users connect to 2 or more hubs and are therefore counted twice. So the leaf count is approximately double the real network size. The unique leaves statistic that has been brought back counts the number of unique IP addresses that are on the network and so is much more accurate.

Website updates

A few minor styling changes were made to the website.

General

Comments (0)

Permalink

A Quick Update

I haven’t made a post in awhile so I thought I should.

Not much is going on with the crawler right now. I’ve been pretty busy lately and haven’t had any time to spend on improving the crawler. However there were a few subtle updates to many of the webpages. More detailed descriptions were added to many pages.

The uptime graphs page saw the most changes. Hubs that have been up for longer than 3 days will no longer be included on the graphs. This was done because the graphs were not very useful when the X-axis had to cover 100s of hours of uptime. New graphs were also added that show the uptimes by country for several of the top countries.

General

Comments (0)

Permalink

The Architecture of a Crawler

I’m going to explain how crawlers work. There are three main tasks that a crawler has to take care of.

  1. Find new hosts to crawl.
  2. Request data from a host that is being crawled.
  3. Display to the user the data gathered.

This design lends itself well to being distributed. Several host crawlers (those that perform task 2) can all be working in parallel and independently. All the host crawlers need is a coordinator (the one that performs task 1) to feed them lists of hosts that aren’t duplicated. The host crawlers then send their responses back to the coordinator which finds new hosts from the responses and then stores the responses. Lastly the aggregator or statistics generator (the one that performs task 3) periodically runs through all the data collected and creates useful ways for the user to view this information.

That’s how a crawler works in general terms. But the fun begins when a crawler has to actually be built. One of the most important decisions is how to store the data collected and how to store and distribute the list of hosts that need to be crawled next.

Relational Database Approach

This is the approach that g2paranha takes. It’s an easy and straightforward way to store data for anyone trained on these traditional databases. But the data that a crawler needs to store is for the most part not relational except for the links between hosts. Another problem with relational databases is that many of them lock all of the data whenever data is being written or read. This creates a huge bottleneck in a distributed environment where lots of both of these operations are being performed. So while it is easy to implement it may not be an optimal solution. On the positive side the extremely powerful SQL language is available for extracting statistics from the data.

Non-relational Database Approach

This is the direction I have been heading in for the crawler. It seems to be a good fit for the type of data that the crawler needs to store. But I have very little experience in this are. So far my only forays  into this field have been with CouchDB which is still in heavy development. CouchDB looks promising but I haven’t had much luck with getting it to work.

So if anyone has experience in non-relational databases or in creating distributed crawlers I’d like to hear from you.

General

Comments (3)

Permalink

Recent Updates

My focus lately has been on hub uptimes. There is a new page showing hub uptime distribution graphs. It gives a visual representation of some of the categories on the uptimes page. The overall hub uptime distribution graph also features two vertical lines. The red line shows where the average hub uptime is and the green line shows where the median hub uptime is. Eventually all of the graphs will have this extra information.

The other major addition is to the uptimes page. The second table of information is new and expands on the information in the first table. The new table shows for each grouping/category:

  • average
  • median
  • minimum uptime
  • maximum uptime
  • the total number of hubs that fit this category
  • the number of hubs below the average for this category
  • the number of hubs above the average for this category
  • the ratio of hubs under the average and over the average

On the vendors page I added back the showing of the GDNA data. But for now no hubs will be showing up as GDNA. GnucDNA does not send any vendor code at all so the current crawler just shows them as UNKN. The old crawler looked at the hub’s User-Agent to figure out if it was GnucDNA and then set the vendor code appropriately. So while new data won’t be logged the same way the trend of GDNA can at least be seen again on the yearly vendor graph.

Lastly, there is an experimental feature that shows where each hub is on a geographic map if they provided that information in their profile. Of course most users do not reveal this information and some lie about their location but it can still be a lot of fun. You get to see that the G2 network has users all over the world and they’re all interconnected. This mapping feature is experimental because it requires a fast computer in order to work smoothly. So give it a try but understand that it may not work well for everyone. In my experience Chrome did really well while Firefox and IE did poorly so if you already have Chrome installed you might give this feature a try in that browser even if you don’t use it for anything else.

Green lines are used to show hubs that are connected together. Clicking on a marker will reveal additional information about that hub:

  • username if provided
  • name and version of the software they are using
  • number of hubs and leaves they are connected to
  • the actual country they are in based on their IP address and MaxMind’s GeoIP
  • the exact coordinates they provided

I hope you like the additions. Keep the suggestions coming.

General

Comments (0)

Permalink

Quick g2paranha update

The crawler has been running pretty well with only minor tweaks from day to day which sometimes show up as blips in the graph. It was also down for a few days due to a failing hard drive.

Yesterday the crawler got into the Foxy network again which uses the same protocol as G2 but is a private network separate from G2 probably using GnucDNA’s authentication scheme. Their network is significantly larger than the G2 network as can be seen from the spike history graph. Since it is so much larger it presents problems for crawling and since it doesn’t represent the true open network of G2 I have filtered it out from being crawled.

General

Comments (1)

Permalink

g2paranha – The New G2 Crawler

Anyone who has read through this blog knows that the crawler has tended to crash fairly often. In recent times it was crashing to much to even continue running it. But rather than give up entirely I decided to write my own crawler. Five weeks later and g2paranha has emerged. To go along with the new crawler is a redesigned website written by kevogod. g2paranha has been designed to be distributed. Currently I run one crawler and Datz kindly volunteered to run another.

Over the next few weeks expect there to be a few bumps as any bugs get ironed out. I’ll also be rounding out the feature set provided by the original crawler. Right now that means adding country flags and keeping track of unique nodes. This last feature is the best estimate of the network size so until it is implemented the network counter won’t be put back on the front page. The leaf count on the history page gives an over count of the network size because each leaf can connect to multiple hubs.

I’ll try to keep the news updated as I progress.

General

Comments (0)

Permalink

The State of G2

I was reading the Gnutella2 article on Wikipedia today and I noticed both entries in the External Links section point to my sites (crawler.trillinux.org and g2.trillinux.org). The latter being the new home for the G2 specs after gnutella2.com was allowed to expire. This got me thinking that it looks like I’m the only one trying to keep G2 from completely disappearing.

This is partly to benefit others and partly out of self-interest. I don’t think the G2 protocol as a whole is all that spectacular anymore, if it ever was. But parts of the protocol can be reapplied to accomplish other things. For example, at its core is a specification for a compact, extensible tree structure for communication. This could be made generic and used for all sorts of applications outside of G2. The search mechanism of a random walk is not original or unique but it’s the largest P2P network I’m aware of that still makes use of it so from that perspective it could be interesting to study.

I run the crawler out of self-interest. I like data, statistics, and graphs. I never turn down the opportunity to collect raw data and turn it into graphs and make inferences from the data.

I started maintaining the G2 website sometime in (late?) 2005 and moved it to its current home in October 2007. The crawler has similarly been running since late 2005. Here’s to many more years to come.

General

Comments (5)

Permalink

More Crawler Downtime

I spent last weekend replacing my router with another computer. The transition was a bit bumpy but things are starting to get sorted out. More extended periods of downtime are possible over the next few weeks as I get things completely transitioned and working reliably.

General

Comments (0)

Permalink