3,500,000 Articles

We retrieved the English Wikipedia current-articles-only xml dump sometime in 2011. We extract title and text elements for each article. We'll develop a variety of useful parsers while we understand there is no reliable spec for markup. wikipedia

dump-xml = page-tag | . page-tag = '<page>' << ( specific-tag | !'</page>' . )* >> '</page>' specific-tag = title-tag | text-tag title-tag = '<title>' << ( !'<' . )+ >> '</title>' text-tag = '<text xml:space=\"preserve\">' << wikitext >> '</text>'

grammar: 35 productions dataset: 13.56 Gb

Strategy

See 30,000 Web Pages for a mockup of how this might work. For this dataset we're actually going manage the remote job from this wiki page.

Create a Parse plugin that initiates a remote job based on dataset information in the text field and Code plugins it finds looking left.

With each initiation add a Parse plugin copy to the current page. Store sufficient run information so that a WebSocket connection can be reestablished to capture live status of the run.

The Parse plugin shows progress and offers to open svg parse charts as ghost pages that track progress and offer sample matches for each arc. These show up as an additional ghost page filled with Record objects that can outlive a parse.

Aside: We're exploring this interaction by creating a mock Parse plugin that just pretends to be launching jobs on the remote server. So far it feels a lot like any desktop except this one has history. github