ODCSSS Wrapup

QuickLinks

Given my failed attempts to chronicle my internship on a week by week basis. I’ve decided instead to

post a brief wrap-up of my summer research project; What the aim of the project was, how close I got to completing it, and what I learned in the process. I will also put up here what the end result is, as well as describing my future plans.

The name of the internship was Assessing the Authoritativeness of Source on Wikipedia and, in short, the goal was to create an algorithm which would automatically classify pages as good, bad or of middling quality. Something which has yet to be done effectively by anybody. After completing some background research which involved reading a number of longwinded research papers by people who had done research in the same area, I set to work.

My first step was relatively simple. I had to first out how to get the data I was going to analyse. I wrote a Ruby script which used the WikiMedia API to gather all the data about a page. The API provides an interface to pretty much every datastore on a WikiMedia based site, so it was the best way to get the information we needed. In the end my code ended up fetching a large amount of data, including the entire text for every revision of a page, which user made the revision and when, as well as what comments they made about it.

Then next step was deciding how to analyse the data itself. Our picture of the final algorithm was rather fuzzy throughout the internship. We knew what the end goal was, but how we were going to get there changed after every meeting. Throughout the process it was clear that we would need to extract some more information from the data that we weren’t given explicitly by the API. One of those pieces of information was reverts. On Wikipedia a revert is an edit which entirely undoes another users edit, and ‘reverts’ the article back to a previous version. This information can be stored by wikipedia, but it isn’t stored in a reliable way, usually as tags or as a comment by the user that did the revert. So it was my task to create some code which could identify a revert. There were also a couple of other pieces of information that had to be extracted but the reverts were by far the most tricky to extract. In the end I hashed the text of every revision and then compared the hash to all the revisions before it. If two revisions are the exact same, the two hashes will be the same, thus the latter is a revert back to the previous version.

It was around this time that I had to give a Midterm Presentation about my project. I’ll include it here just in case you want to see it. Don’t worry, it’s not a boring ‘ol powerpoint. It actually looks rather snazzy, even if I do say so myself. It’s more of a ‘high level overview’ of the goal of the internship.

Once we had extracted our necessary features from the data we then set about analysing it. Over the course of the internship we had formed a picture of how we would analyse the data. We took a user centric view, in which we classified pages based on the users that had contributed to the page. Initially we created a dataset of users, in which each user had certain values associated with them;

  • Bot: Whether or not the user is a registered bot on Wikipedia
  • Edit Count: The number of edits that a user has made on Wikipedia (Within our sample set of pages)
  • Page Count: The number of pages that a user has edited on Wikipedia (Within our sample set of pages)
  • Reverted To: The number of times a user has been reverted to, over another user.
  • Reverted Over: The number of times a user has been reverted over, as in a later user has undone all of that user’s edit

Initially I manually classified a small group of users as either ‘reliable’ – likely to produce good content – or ‘other’ and used machine learning to check internal consistency of that set. Given a high internal consistency we decided to use the sample set as seed data to classify the other 13,000 registered users that we had gathered data about.

Using this labelled dataset of users, we then set about doing something similar for pages. We created a dataset of pages with another set of values;

  • Edit Count: The number of edits (revisions) on that page
  • Link Count: The number of links (from pages within Wikipedia) pointing to that article
  • Revert Count: The number of reverts which that page has had
  • User Count: The total number of users that have contributed to that page
  • User (type) Count: The number of users of each classification that have contributed to that page (reliable, other)
  • User (type) Percentage: The percentage of total users of each classification that have contributed to that page (reliable, other).
  • User (type) Edit Percentage: The percentage of the edits that each classification of user has made on that page (reliable, other).
  • Classification: The classification that a Wikipedia users have given to the article according the the scale outlined in the Wikiprojects

We then used machine learning on this labelled dataset to see if we could predict the quality of a page. The results were very promising with Weka predicting the quality of pages 97% of the time. A very successful result in the end.

As you may have guessed I’ve skimmed over large parts of the internship, so these are very much the cliff notes. However if you want to read the full version you can download the paper that I wrote about the project. It’s a long paper, but if you do get through it I’d love to hear your thoughts.

As for the future, well, I didn’t get all the tests done which I wanted to do, nor did I get to put the algorithm to work on any of the unclassified pages on Wikipedia. Since I finished the internship I’ve been slowly improving the code, both in quality and runtime and I’ve been writing tests. In the next few weeks I plan on running some of the analysis that I didn’t get to do during the internship. Hopefully I’ll get it good enough to get a paper published on the topic and start to make a name for myself. :)

If you want to keep up to date on the latest code, feel free to follow PageAnalyzer on GitHub.