The HTML5 Parsing Algorithm

mbrubeck · on Aug 5, 2010

Yay! Also coming in Firefox 4: http://hacks.mozilla.org/2010/05/firefox-4-the-html5-parser-...

and IE9: http://blogs.msdn.com/b/ie/archive/2010/03/16/html5-hardware...

Unfortunately the new parsing algorithm breaks my online banking site: http://bugzil.la/565689

palish · on Aug 5, 2010

Ugh, their response:

This is an intentional change. Prior to the HTML5 parsing algorithm, browsers backtracked and reparsed when seeing an EOF inside a script to deal with </script> inside an inline script. This means that prior to HTML5, an accidental or maliciously forced premature end of file could change the executability properties of pieces of an HTML file.

The current magic in the spec was carefully designed and researched to permit forward-only parsing in a maximally Web-compatible way. The solution in the spec was known to break a handful of pages among lots and lots of pages listed by dmoz, but the breakage was deemed negligible.

This is probably the highest-risk change in the HTML5 parsing algorithm, and this is the first report of it breaking an "important" contemporary site. Considering that keeping the forward-only tokenization behavior is highly desirable, in the absence of evidence of more important breakage, I'm treating this as an evang issue realizing that further evidence may force us to revisit this part of the spec. But let's try to get away with forward-only parsing!

What is the logic behind this? Just because we want "forward-only parsing", we're going to try to force people not to write </script> inside of a Javascript string? That seems rather silly.

Are there any benefits to forward-only parsing besides speed? Because if this change was done in the name of performance, then we really should re-evaluate who exactly is proposing these types of changes and why.

judofyr · on Aug 5, 2010

browsers backtracked and reparsed when seeing an EOF inside a script to deal with </script> inside an inline script

Correct me if I'm wrong, but it seems to me that in order to detect that the </script> is inside an inline script, you'll have to parse the JavaScript. Requiring every HTML5 parser to include a JavaScript parser in order to parse it correctly seems a little too much for me.

Another reason: code complexity. Backtracking makes everything a lot more complex which is the exact opposite of what they're trying to achieve with the HTML5 parsing algorithm. Less code = fewer bugs, easier to understand and easier to implement.

cameronh90 · on Aug 5, 2010

Forward only parsing is not only faster, it's less complicated which should translate to fewer bugs. It's worth noting that the HTML parser isn't used only when initially rendering the page, it's also used when rewriting the contents with JavaScript.

The bug report has a perfectly reasonable workaround you can use, and you can always put your JavaScript in a separate file or escape < >. It's a good idea to do this even with the current generation of browsers, because if you don't, you sometimes get strange bugs... often bugs that are not reproducible in all browsers.

Honestly, allowing unescaped </script> inside inline javascript wasn't a good idea to begin with.

pornel · on Aug 6, 2010

ECMAScript has escape sequence just for this occasion:

    <\/script>

Works everywhere (it's shame so few people know about it and use uglier and invalid "</sc"+"ript>").

pbiggar · on Aug 6, 2010

Why is that invalid?

pornel · on Aug 6, 2010

http://www.w3.org/TR/html4/types.html

"The first occurrence of the character sequence "</" (end-tag open delimiter) is treated as terminating the end of the element's content."

Of course browsers never implemented this correctly.

tedunangst · on Aug 5, 2010

"Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code."

Anybody else think that if your markup language requires a 10k line parser, somebody took a wrong turn at the complexity vs simplicity fork in the road?

rimantas · on Aug 6, 2010

Pay attention to this part:

  All browsers that implement the HTML5 parsing algorithm
  should parse HTML the same way, which means your web page
  should parse the same way in Firefox 4 and the WebKit
  nightly, even if it contains invalid markup.

Parsing perfect HTML5 would be easy, but one of the features of HTML5 spec is that it does that no spec did before: it defines how parsing should work exactly, even in the case of invalid markup. Also, the parser hast to deal with deprecated elements no longer in the spec (such as infamous <font>). I assums most work went into this "how to parse tag soup" part.

tedunangst · on Aug 6, 2010

Deciding that invalid markup should work would be the wrong turn I alluded to.

It would have been very easy to drop all the back compat bs by saying "An HTML5 document is one that begins with the 6 bytes '<html5' and if the document is invalid, reject it. Anything else, parse however you want." Browsers that support HTML5 add text/html5 to the Accept header.

rimantas · on Aug 6, 2010

Alas, your six bytes would drop older browsers into quirks mode, which makes is not nice. The only reason HTML5 has doctype at all is because it forces browsers into standards mode. What you propose sounds a bit like the way XHTML2 was intended to go, we know how that ended. There is and will be tons of invalid documents on the web—the wast majority of it, so browser rejecting them has no future, the blame will be on the browser, not the authors. Having at least some consistency in dealing with invalid documents is A Good Thing™.

ori_b · on Aug 6, 2010

So, how many different parsers did you want to see in the browser again?

You'd still need to support HTML4 somewhere. Supporting HTML5 separately just means duplicating the common parts of the parser. The simplicity boat has already sailed.

tedunangst · on Aug 6, 2010

Addition is simpler than combination.

    HTML4 + HTML5 < HTML4 * HTML5

ori_b · on Aug 6, 2010

They're extremely similar. You don't get an explosion of code size. Unless you're suggesting that HTML5 should also be radically different syntactically as well?

tedunangst · on Aug 6, 2010

I'm suggesting that a strict HTML5 parser that doesn't have complex recovery code is radically simpler than one that does.

ori_b · on Aug 6, 2010

I'm suggesting that a HTML4 parser that has complex recovery code, and a mostly-copy-and-paste HTML5 parser that doesn't isn't a big win.

Browsers are going to have a complex, ugly, recovery-enabled parser in them either way, and the effort to add HTML5 to the recovery-enabled parser isn't very big, comparatively speaking.

othermaciej · on Aug 6, 2010

Not really true in this case. WebKit's previous HTML parser is similar in complexity to the HTML5 parser. Adding a second HTML parser would have been more code, more complexity, and a more complex test matrix.

Retric · on Aug 6, 2010

Complexity does not increase linearly with number of lines of code. 110k lines can easly be 5x as complex as 100k lines of code.

sesqu · on Aug 6, 2010

1 + 1 > 1 * 1

pilif · on Aug 6, 2010

The problem with this approach is the same problem people serving XHTML as application/xhtml+xml ran into: You don't control your page any more these days.

If you allow people to leave comments on your page or if you are serving ads, you lose a bit of control in what gets put on a particular page of yours.

In case of comments you could try and sanitize them, but with ads that's hardly possible.

j-g-faustus · on Aug 6, 2010

I tend to think that very few things that see widespread real-world use are "simple" by any reasonable metric.

The basic idea may be simple, but by the time you have covered a reasonable fraction of real-world use cases and error conditions for a sizeable population, the resulting code is anything but.

And 10k lines still counts as "moderately simple" - 10k lines is well within scope for one developer working alone.

syncsynchalt · on Aug 5, 2010

This must be your first time hearing about HTML, welcome!

Just imagine how large "quirks mode" must be.

othermaciej · on Aug 6, 2010

You make it sound like it was a conscious decision. Most of the complexity comes from the need to tolerate the janky markup that older browsers processed in a reasonable way.

murrayh · on Aug 6, 2010

Just for kicks, I am going to try and implement these two algorithms using using the State Machine Compiler (http://smc.sourceforge.net/). I expect that my "code" will be much less than 10,000 lines, although I don't really know how long the code SMC itself generates will be.

astrange · on Aug 6, 2010

Is there any advantage to SMC vs. Ragel for parsing? The code looks similar but a bit larger.

murrayh · on Aug 8, 2010

Heh, just what I need, a new toy to play with!

I spent a few hours looking into Ragel over the weekend, and, honestly, I can't envisage how a html5 parser mostly written in Ragel would look. But I'm going to give it a crack.

btn · on Aug 6, 2010

What's the magic number of LOC that any markup language should be implemented under?

tedunangst · on Aug 6, 2010

To put 10k in perspective, that's substantially more than the pcc C compiler front end, and closing in on the compiler's total line count.

When I consider what I can accomplish using C vs what I can accomplish using HTML, that doesn't appear to be the expected result.

plorkyeran · on Aug 6, 2010

C compilers don't have to deal with users that expect purely random garbage thrown at the parsers to have a "valid" parsing that's consistent across every implementation. Unfortunately, making a language simple to use by not requiring that users get everything right makes it very hard to implement.

j-g-faustus · on Aug 6, 2010

Considering the number of people that can accomplish anything useful with HTML vs the number of people that can accomplish anything useful with C (or even XML), it's well in line with my expectations.

It's fairly easy to get a simpler implementation by shifting complexity from the parser to the user. But from a "total society productivity" standpoint, having a handful of developers spend a few months to implement a resilient parser is a cheap price to pay to have umpteen million people save hours of debugging for every page they create.

sethg · on Aug 6, 2010

Back when HTML was invented, it was based on SGML. In retrospect, this was a fundamental design flaw. SGML has all sorts of fillips that were intended to make documents easy to type (back in the pre-GUI era) but which in practice make them hard to parse. SGML parsers, even parsers built into expensive commercial software, were famous for not reliably interoperating. The HTML5 spec abandons SGML entirely because, as it turned out, even those fillips of SGML did not make it possible for browsers to consistently agree on how to parse HTML in the wild.

It would have been better for everyone if HTML had used a bog-simple Lisp-like format from day one, but without a time machine we’re stuck with these kludges of history.

masklinn · on Aug 6, 2010

Most markup languages (indeed most languages) don't have error discovery and recovery built in and specified.

HTML5 does, and most of the spec (and probably the parser) is about dealing with error cases.

Plus it's not like XML parsers are small. And HTML parsing is far more complex an affair (due to not just blowing up on error)

gorm · on Aug 6, 2010

Having this complexity ensures that nobody will ever endeavor into writing any more browser engines, those ensuring that there will be no more browser bugs. The rest of the job is fixing the behavior of the already existing browsers and then the web will be perfect.

pornel · on Aug 6, 2010

HTML already tried the "let's switch to a simple parser" route, and failed:

http://dig.csail.mit.edu/breadcrumbs/node/166

Now HTML5 has to deal with two parsers.

jimmyjazz14 · on Aug 6, 2010

I would prefer that the parser just failed when invalid code was encountered, personally I saw this as a strong point of XHTML.

ugh · on Aug 6, 2010

One of the nice things about the web is that all websites still work. Here is the first webpage: http://www.w3.org/History/19921103-hypertext/hypertext/WWW/T... That one wouldn’t work with a strict parser (look at the source: it uses the header tag instead of the head tag).

Breaking half or more of all the websites out there wouldn’t exactly be what I would call the “spirit of the www”.

tedunangst · on Aug 6, 2010

Use a new parser for new pages, the old parser for old pages. The introduction of H.264 didn't suddenly render all older formats unplayable. Why can't a new markup format coexist with older formats, too?

ugh · on Aug 6, 2010

What would be the advantage of that? Also, you then never truly can ditch the old parser if you want to keep everything working. Your browser has to come with two parsers for all eternity.

philwelch · on Aug 6, 2010

How many codecs does your media player come with?

ugh · on Aug 6, 2010

Not enough because there always seems to be this one file it absolutely cannot play ;)

It’s possible to do it but it’s also needlessly complex while giving you very little in return. The biggest, most problematic flaw of allowing browsers to parse invalid code, namely that different browsers might handle failure differently, is in the process of being fixed and that’s good enough, I think.

Strict parsing has been tried and it was a resounding failure. It’s not going to happen.

wilhelm · on Aug 6, 2010

Web browsers already have this, and it sucks. There's XML mode, HTML standards mode, HTML quirks mode and HTML almost standards mode. It's horrible to test and maintain.

masklinn · on Aug 6, 2010

1. How do you discriminate?

2. What would the point be, apart from annoying every single end user?

jimmyjazz14 · on Aug 6, 2010

Maybe a content-type header from the server? The point as I may wrongly see it is, that attempting to support backward compatibility indefinitely will result in overly complex and bloated browsers in the future.

wilhelm · on Aug 6, 2010

Nobody sets headers and meta information correctly anyway. Those headers cannot be trusted, and browsers have to guess what content-type and character encoding should actually apply to any particular document.

haberman · on Aug 6, 2010

You have to realize that what you are proposing is simply a bad idea, even if everything is trying to be pure and perfect:

http://diveintomark.org/archives/2004/01/14/thought_experime...

jimmyjazz14 · on Aug 9, 2010

I think that article may have seriously changed my mind.

wilhelm · on Aug 6, 2010

You can still write the HTML5 vocabulary as XHTML if you really want to. That's perfectly valid, and defined in the spec.

The HTML5 parsing algorithm is what the browser vendors want and need, however. Since HTML was never specified properly - not even in HTML4, where large areas were just undefined - HTML parsers have slowly evolved through trial and error. If big sites depend on a particular behaviour in a particular browser, other browsers have tried to be bug-compatible with that browser. Browsers that failed to do so have lost market share. If a browser decided to halt on encountering anything invalid would lose all of its users instantly, since the vast majority of documents are invalid.

Parsers have slowly converged towards each other, and the HTML5 algorithm is the compromise between them that breaks the last amount of content.

In other words, the goal of the HTML5 parsing spec was never to make something nice and clean. The goal was to define how to parse the unholy mess all you web developers out there have created during the past two decades. A clear spec, a good test suite and exactly identical behaviour between browsers is a benefit for everyone.

eru · on Aug 6, 2010

I agree. But: Why do we need exactly identical behaviour? Wasn't HTML intended to leave lots of discretion for rendering decisions to the browser?

wilhelm · on Aug 7, 2010

If you parse a string of bytes, it's nice if it always turns into the same DOM. That's the scope of the HTML5 parser. There is no benefit to inconsistencies here.

Allowing browsers on different platforms to adjust page widths, font sizes, whether or not load images and plugins in accordance with the user's preferences is fine, but that's an entirely different issue.

eru · on Aug 9, 2010

Thanks for the clarification.

benhoyt · on Aug 6, 2010

Hmmm. I suspect that would break 90% of web pages on the web. On the other hand, people would make their pages valid pretty quick. :-)

wlievens · on Aug 6, 2010

What about pages that are not maintained but still carry value?

wilhelm · on Aug 7, 2010

No. Browsers that implemented this would just lose their market share instantly. Users prefer being able to use their favourite sites.