Start Tags

We look for start tags and then observe how arguments are used in specific cases.

We look for begin tags, possibly with arguments, and complete the parse when we find them.

html-document = << html-markup+ >> html-markup = tag | end-tag | other-text | other-char tag = << ( familiar-tag | other-tag ) >> end-tag = << '</' [a-zA-Z]+ '>' >> tag-arguments = << (!'>' ch)+ >> other-tag = << '<' [a-zA-Z]+ tag-arguments? '>' >> other-char = << ch >> other-text = << '<'* (!'<' ch)+ >>

Results

real	2m4.497s
user	2m3.299s
sys	0m0.900s

html document other text 18,342,285 tag 15,302,451 end tag 12,141,210 other char 22 other tag tag arguments 11,608,623 15,302,451 homepage 41,594 /root/ 41,594

Refinement

Continue matching familiar-tags.

Tags for Dynamic Content managed with scripts.

Tags for Tables as used for formatting.

Tags for Images large and small.