1.29GB
Scraped Text
Mixed Character Sets
We examine the html found on the front pages of 30,000 web sites scraped from domains extracted from the .com zone file.
The scraper wrote records with two fields, first the domain name and then the homepage retrieved from the domain name.
record = domain homepage? domain = < ch+ > {bind(0)} gs ch = !gs !eor . gs = '\035' eor = '\036\036\036\036'
When things went wrong the scraper may or may not have written some descriptive information in place of the homepage.
homepage = << ( html-document | other-document ) >> eor other-document = << ch+ >>
Results
real 1m43.665s user 1m41.920s sys 0m0.818s
Refinement
Continue matching html-documents.
Start Tags and arguments but ignoring text and end tags.
Scraper Error Messages left in place of documents.