ODCSSS 2010: Week 2

Week of Monday, 7th of June 2010

Flickr Stream

Monday was a Bank Holiday in Ireland, so we had the day off. Over the weekend I had decided that I really didn’t want to spend another week reading research papers, so I was set to do as much coding as possible. I started off on Tuesday by researching ways to extract data from Wikipedia. There’s a number of methods that can be used such as:

  • [Wikimedia’s Database Dumps] (http://en.wikipedia.org/wiki/Wikipedia_database) – Available here, These are periodic database dumps of entire Wikimedia sites. They’re available as .sql files and can run to hundreds of gigabytes. You can also get html dumps of the latest version of the site, but that looses the history and Wiki Markup, both of which I need. I was very intrigued by the possibility of using this, but when I realised that the English Wikipedia runs to a few hundred Gigabytes of data, I decided that it was probably better to find something else.
  • Wikipedia’s Special:Export feature – This will export (upto) the first 1000 revisions of a page, or the latest version with all the WikiMarkup, of the page(s) that you specify. It also has the handy feature of being able to specify categories of pages, so for instance you could download all of the Social Science pages if you wanted to. This was looking to be very promising, except for the fact that it only allowed the first X revisions, up to 1000, which I felt was a bit limiting. I thought that in the long run I would need more data from each page to make it a worthwhile dataset
  • Wikipedia’s API is a well documented, if rarely used, API that Wikipedia provides for programmers to access Wikipedia’s data. You can get all the data for any article, back through 5000 revisions (thats the API limit for bots). The beauty of the API is that you can specify exactly what data you want, and in what format (see the API for all the options). When I saw this I knew it was the one. It has plenty of options and exactly the kind of data I needed.

So I decided on using the API to gather the data, but now the issue was how to actually implement it. After a fair bit of research I determined that there was no decent, maintained and well-documented Ruby wrapper for the API. I was all for writing my own wrapper, but then I realised that I only needed one thing. Query a URL and get the data back.

I didn’t need or want any other operation. If I went to the page in a browser it gave me everything. Now I just needed a way to programmatically get the data. I had heard of cURL before, but I’d only every used it when copy-pasting Bash) Commands. Since I had a tiny bit of knowledge I decided to run with it. I did some Googling a identified Curb (the cURL bindings for Ruby), as the way to go.

It didn’t seem to want to install itself the Linux desktop that I’d been given and after some battling, I gave up on making it work. However it installed on my Mac the first time round (I’m so tempted to say Mac:It just works).

Anyway with Curb working I wrote a quick script in Ruby that used cURL to query the API and get the data back in XML. After some tweaking and refinement, I came out with a nice generic script that takes a list of pages (from a separate file), gets the data from those pages and saves them as xml files locally.

Once I had that working I moved onto figuring out a way to parse the xml to get the appropriate data out of it. From my experience with Ruby and Rails I had a rough idea of what I was looking for. Something like Active Record which maps data to an object, which I can then manipulate. After a bit of searching and trial and error with a few tools, I found HappyMapper (GitHub, RubyForge, RDoc, RubyGems), which takes XML and maps it (happily) to an object. It also provides a nice couple of methods for traversing the data.

After some jiggery-pokery trying to get the XML that Wikipedia gives back to fit in with the way that HappyMapper defines its Objects, I finally got it working. After that I wrote a simple proof of concept script that just printed out all the data from each revision.

  • The User that made a revision
  • The Date & Time of that revision
  • The Comment that the user left
  • The text of the revision

This is not the final product by any means. But it shows that all of the appropriate data that we need can be extracted easily. There is more information in there if we want to get it, but I didn’t want to spend all day on this one proof of concept. I’ve included some sample output from the Google page at the end of this article

So with those two things wrapped up nicely I finished up for the day, and for the week on Wednesday. Before I got offered my Odcsss internship I had agreed to be a ‘Student Leader’ for the CSI Summer School which was on all day Thursday & Friday. So after clearing it with my Odcsss supervisor, I spent those two days helping out at the summer school instead of working on my project. It was good fun, and turned out to be better paid then Odcsss.

So there you have it, the second week of my Odcsss internship. Unfortunately it was cut short by the summer school, but I’m looking forward to next week and getting a good solid week’s work done.

N.B.

This shows the information from the last 10 revisions of the Google page. The text itself is rather long, so I didn’t include it, but it’s all there.

Google

_________________________________________________
Revid: 366826294, Parentid: 366673948
User: Citation bot
Timestamp: 2010-06-08T17:18:31+00:00
Comment: Citations: [161]Tweaked: url. [[User:Mono|Mono]]
-----------------------------------------------
Revid: 366673948, Parentid: 366673616
User: Histornomicon
Timestamp: 2010-06-07T23:32:10+00:00
Comment: seven
-----------------------------------------------
Revid: 366673616, Parentid: 366673431
User: Histornomicon
Timestamp: 2010-06-07T23:30:17+00:00
Comment:
-----------------------------------------------
Revid: 366673431, Parentid: 366655325
User: Histornomicon
Timestamp: 2010-06-07T23:29:17+00:00
Comment: six
-----------------------------------------------
Revid: 366655325, Parentid: 366654954
User: Histornomicon
Timestamp: 2010-06-07T21:53:59+00:00
Comment: two
-----------------------------------------------
Revid: 366654954, Parentid: 365226870
User: Histornomicon
Timestamp: 2010-06-07T21:51:54+00:00
Comment: one
-----------------------------------------------
Revid: 365226870, Parentid: 365022855
User: StephaneOdul
Timestamp: 2010-05-31T14:46:46+00:00
Comment: /* Innovation Time Off */
-----------------------------------------------
Revid: 365022855, Parentid: 365021980
User: Ianmacm
Timestamp: 2010-05-30T13:45:13+00:00
Comment: rv, clear [[WP:TOPIC]], please discuss on the talk page rather than reverting
-----------------------------------------------
Revid: 365021980, Parentid: 365020595
User: Rahulchoudhary003
Timestamp: 2010-05-30T13:38:42+00:00
Comment: Reverted to revision 365019743 by [[Special:Contributions/Rahulchoudhary003|Rahulchoudhary003]]; dont make this article booring . ([[WP:TW|TW]])
-----------------------------------------------
Revid: 365020595, Parentid: 365019743
User: Ianmacm
Timestamp: 2010-05-30T13:27:13+00:00
Comment: rv [[WP:AGF|good faith]] edit per [[WP:TOPIC]]. [[Google]] is about the company, the founders have their own BLPs
-----------------------------------------------