Andrew Lih asked if anyone had done research on the quantity and distribution of videos on Wikipedia. We responded by parsing for File: links, video templates and a special category.
We developed a variation of an existing dump format parser looking at 2011 data we had left over from our earliest work. This showed that we could get good information from article pages alone. parser
We downloaded a new dump which had grown from 13Gb to 40Gb over the intervening years, this time to our everyday laptop which has a solid-state drive. Repeating the parse showed that videos had grown dramatically. parser
Tally of video file formats found in recent Wikipedia dump.
For both runs we captured the page name, the (last) video name, and the presence of the has-video-clip category as a text file of there columns. This went from 700 rows to 4000 rows between runs.
While tuning our parser we discovered a record 18Gb into the dataset that would exceed a fixed recursion limit in our parser. For some limits we discard the record and resume. We should do that for all limits.
We should also consider a start over in the middle feature that could retry parser improvements on a troublesome record. This could be automatic when a parse doesn't complete. We could back up to the last sediment or write the postion of the last record parsed on failure.
Futures
Andrew says, I suppose using the API, it's not too hard to analyze the categories for each of the articles that contain video, and see what trends there are. From eyeballing it, it seems lots of the articles are space, science and flight related. Not a surprise given many of these may be PD NASA videos, or expired copyright.
I think right now we'd be satisfied with a lay of the land of what's out there, so ideally we could discover:
What category of articles have the most videos?
Who are the top producer of videos (usernames, that is)?
What is the average length of the videos? What does a histogram look like if we plot the lengths of videos?
How many articles have more than one video?
How many videos are used in more than one article?
If a video shows up in en: article X, does it show up in other interwiki (other language versions of that article)?
Within videos:
How many contain a language-specific audio track (narration, interview, dialogue, etc)?
How many edits/cuts are in the video?
What is the most popular resolution of videos?