Here are a couple of graphical representations of the results. They make great wallpapers!
My Current Wallpaper |
A Zoomed Out Link Graph |
Obtaining the Data
My first task was to download the Wikipedia dataset in its' entirety. I chose the XML format which compressed is approximately 10GB. The uncompressed file is nearly 40GB. I naively attempted to load the dataset using the .NET runtime XML parsers. Unfortunately they attempt to inflate the data before it can be queried. My server has 4GB of RAM and I quickly learned that this simply wouldn't work.
I decided to work with a subset of the data to ease the load on my server. This worked out well.
Assembling a Graph
The links on Wikipedia can be modeled using the abstract graph data type.
I have a type that consists of the page name and a list of all other pages it links to. To determine a path, I recursively search for the destination page. Once I find a path to the destination I record the hops necessary to get there. I try to find an even shorter path until I can no longer find a path that has not already been found.
Graphing the Results!
Once I realized that I could definitely beat this game with some simple programming, I quickly lost interest in the project. I decided to try graphing the output. Here is one of the first graphs that I created. This one originates at Dinosaur and performs 6 hops. There is lots of overlapping text, but it looks good!
An early prototype! |
Linux is the Source |
Starting to look Good |
Apple Inc. |
Inkscape, most bugs are fixed at this point |
If anyone wishes to see the original SVG files, I will be happy to share them. They are rather large so I won't bother hosting them until I have some serious interest.
Releasing the Code
I have decided not to release this code. There is nothing proprietary about what I have done and I have given you more than enough information to create your own implementation. This was a weekend experiment and as such is riddled with dependencies on my server and specific files that existed on my hard drive at the time. It is untested and is definitely not production code.
This became more of an art project than anything and the results are fantastic. If you decide to implement your own, I would love to see it!
No comments :
Post a Comment
Note: Only a member of this blog may post a comment.