Tuesday, July 19, 2011

Wikimaps Revised

The first version of the Wikimaps Page (http://www.ickn.org/wikimaps/) that we published a couple of weeks ago helped to visualize the basic idea of Wikimaps. It consists of an interactive animation that allows visitors to visually track the changes in Wikipedia articles over a given time period. Real world activities and events are reflected in updates of the respective articles and the links between them.

Rise and Fall (of Swiss Tennis Star Hingis) on Wikimaps
A good example is the retirement of the Swiss tennis player Martina Hingis. While her page (node) is still well connected to the network in 2008 (roughly a year after she retired), the page is not listed in the network anymore after February 2010. It is important to note that this view of the network is filtered and only displays nodes that “survive the cut”: The page of the former number one ranked player is still there and has many links pointing to it, just not enough to appear in this filtered “most important pages” view.

Another example is the case of the former president of the International Monetary Found, Dominique Strauss-Kahn. We tracked changes in related pages for a time span of approximately 8 months and built a network with weekly snapshots. On May 14th 2011, Strauss-Kahn was arrested in New York City, this event lead to an spike in the activities in the network surrounding the page of Strauss-Kahn. Interestingly enough the increased activities that lead to this spike were not solely based on pages directly related to the arrest. The attention lead to a general increase of activities on related pages.

Watch a video of the changes in the Dominique Strass-Kahn graph:

The following graph shows the spike in activities in the graph around the 14th of May.

(Activity in a network is defined as the sum of additions and deletions of nodes within a given time frame)

Although we think this first visualization is already pretty cool, the results did not really surprise anyone. The data that was initially used was very static. We simply picked (seemingly related) categories and selected the pages that had the highest indegree values. Pages that would be “close” or relevant but not members of the selected categories would never show up in the graph.

To mitigate the shortcomings of this approach we decided to change our approach for the collection of the pages that would be considered candidates for the graph. The most promising idea was and still is, a combination of weighted components, possibly applied in multiple iterations. Or as we call it, a Filtered Breadth First Search.

Effective Filtering is Key
One of the challenges of working with the Wikipedia graph is the size of it. An optimal algorithm should therefore handle the trade-off between maintaining a small sub-graph while still returning meaningful results. A naively executed BFS would quickly lead to an explosion of articles that would have to be considered. To prevent this we only follow edges (links) that are considered interesting or relevant. The decision whether to follow a link during the execution of the search is based on a weighted mix of the following metrics:
● Local Indegree
● Global Indegree
● Number of recent page edits
● Reciprocal Links to source page
● Shortest Path Distance to Source Page
● Wikipedia Full-text search results

Naive Degree-Based Filtering leads to “boring” results
It would be a lot easier to simply include pages based on a single metric, namely the one that is the least expensive and seemingly a very meaningful one: The (local) Indegree, the number of pages that link to a certain page. The problem is, that this metric strongly favors so called hub-pages, these are pages that are linked to a lot altough they are semantically not directly related. Typical examples are pages for certain dates or countries. There were ideas to filter these pages using blacklists or to work with an Indegree-Band (as opposed to a lower limit).

These two ideas to however turned out to be very tedious and error-prone. We further believe that the most relevant results can only be found by a cleverly tuned combination of many factors.

There is another network on wikipedia besides the one that based on articles and links. It’s the network of the Wikipedia authors and their collaborations. We anticipate that the incorporation of these informations will additionally improve the relevance of the nodes in a Wikimap network. Read this previous blog post for an explanation of the basic idea.

posted by Reto Kleeb