We look for start tags and then observe how arguments are used in specific cases.
We look for begin tags, possibly with arguments, and complete the parse when we find them.
html-document = << html-markup+ >> html-markup = tag | end-tag | other-text | other-char tag = << ( familiar-tag | other-tag ) >> end-tag = << '</' [a-zA-Z]+ '>' >> tag-arguments = << (!'>' ch)+ >> other-tag = << '<' [a-zA-Z]+ tag-arguments? '>' >> other-char = << ch >> other-text = << '<'* (!'<' ch)+ >>
Results
real 2m4.497s user 2m3.299s sys 0m0.900s
html document other text 18,342,285 tag 15,302,451 end tag 12,141,210 other char 22 other tag tag arguments 11,608,623 15,302,451 homepage 41,594 /root/ 41,594
Refinement
Continue matching familiar-tags.
Tags for Dynamic Content managed with scripts.
Tags for Tables as used for formatting.
Tags for Images large and small.