30,000 Web Pages

1.29GB

Scraped Text
Mixed Character Sets

We examine the html found on the front pages of 30,000 web sites scraped from domains extracted from the .com zone file.

The scraper wrote records with two fields, first the domain name and then the homepage retrieved from the domain name.

record = domain homepage? domain = < ch+ > {bind(0)} gs ch = !gs !eor . gs = '\035' eor = '\036\036\036\036'

When things went wrong the scraper may or may not have written some descriptive information in place of the homepage.

homepage = << ( html-document | other-document ) >> eor other-document = << ch+ >>

Results

real	1m43.665s
user	1m41.920s
sys	0m0.818s

homepage other document 41,594 /root/ 41,594

Refinement

Continue matching html-documents.

Start Tags and arguments but ignoring text and end tags.

Scraper Error Messages left in place of documents.