Wikipedia does not only provide the digital world with a vast amount of high quality information, it also opens up new opportunities to investigate the processes that lie behind the creation of the content as well as the relations between knowledge domains.
In their daily work Wikipedia editors make sure to keep articles updated: Natural disasters, shiny new pop icons and scandals are reflected in new articles or in links between them. But how do these pages and their links evolve over time? Can we visually track how ties between subject-areas grow stronger, is there a way to notice that an article becomes more influential?
Our first attempt to come up with an answer to these questions was the development of a visualization that renders pages as nodes of a graph. If there is a link between two pages, the corresponding links are represented as an edge. Each graph represents a snapshot of the articles at a specific date, the slider and the video controls on the left allow you to navigate back and forth in time.
Try it out: Scroll to zoom in and out, use the video controls to start and pause the animation or drag to slider to any point in time.
Selection of the Nodes
There are currently 3,6 Million articles in the English Wikipedia and displaying nodes for all of them at the same time does barely make sense. For our first prototype we decided to display a subset of the 50 most important nodes out of a given data-set.
How do we define importance? We decided to select the top nodes by using their indegree value - the number of links that point to an article, a trivial way to measure basic influence and relevance. The data-sets that are used, are based on related categories on Wikipedia e.g. to look at modern Musical groups we look at all the members of the categories “Musical groups established in 1990”, “Musical groups established in 1991” and so forth.
Collecting the necessary data is a time consuming process. The usual approach for doing network analysis on Wikipedia is to use complete database dumps that are provided by the Wikipedia foundation. The problem with these dumps is that they are either very large (complete dump that contains all historical data: 5 TB) or do not provide a high enough date resolution to accurately track the development of current events. To get around these issues we developed a data fetcher that uses the HTTP API. It continuously collects and stores the minimal amount of information that we need to build link-networks for a selected list of articles with the desired date resolution.
Looking at the changes in the graph over time, it becomes clear that the simple indegree criterion does suffer from some shortcomings. It does not work to discover (fast) rising subjects. Or speaking figuratively: Despite the attention they currently receive, Lady Gaga and Justin Bieber do not stand a chance against Madonna or Eric Clapton. While one might claim that this situation is perfectly justified and reflects their artistic contributions, it would still be interesting to develop a set of metrics to select and rank nodes based on short term spikes in interest or relevance.
posted by Reto Kleeb