Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The HTML5 Parsing Algorithm (webkit.org)
85 points by sant0sk1 on Aug 5, 2010 | hide | past | favorite | 53 comments


Yay! Also coming in Firefox 4: http://hacks.mozilla.org/2010/05/firefox-4-the-html5-parser-...

and IE9: http://blogs.msdn.com/b/ie/archive/2010/03/16/html5-hardware...

Unfortunately the new parsing algorithm breaks my online banking site: http://bugzil.la/565689


Ugh, their response:

This is an intentional change. Prior to the HTML5 parsing algorithm, browsers backtracked and reparsed when seeing an EOF inside a script to deal with </script> inside an inline script. This means that prior to HTML5, an accidental or maliciously forced premature end of file could change the executability properties of pieces of an HTML file.

The current magic in the spec was carefully designed and researched to permit forward-only parsing in a maximally Web-compatible way. The solution in the spec was known to break a handful of pages among lots and lots of pages listed by dmoz, but the breakage was deemed negligible.

This is probably the highest-risk change in the HTML5 parsing algorithm, and this is the first report of it breaking an "important" contemporary site. Considering that keeping the forward-only tokenization behavior is highly desirable, in the absence of evidence of more important breakage, I'm treating this as an evang issue realizing that further evidence may force us to revisit this part of the spec. But let's try to get away with forward-only parsing!

What is the logic behind this? Just because we want "forward-only parsing", we're going to try to force people not to write </script> inside of a Javascript string? That seems rather silly.

Are there any benefits to forward-only parsing besides speed? Because if this change was done in the name of performance, then we really should re-evaluate who exactly is proposing these types of changes and why.


browsers backtracked and reparsed when seeing an EOF inside a script to deal with </script> inside an inline script

Correct me if I'm wrong, but it seems to me that in order to detect that the </script> is inside an inline script, you'll have to parse the JavaScript. Requiring every HTML5 parser to include a JavaScript parser in order to parse it correctly seems a little too much for me.

Another reason: code complexity. Backtracking makes everything a lot more complex which is the exact opposite of what they're trying to achieve with the HTML5 parsing algorithm. Less code = fewer bugs, easier to understand and easier to implement.


Forward only parsing is not only faster, it's less complicated which should translate to fewer bugs. It's worth noting that the HTML parser isn't used only when initially rendering the page, it's also used when rewriting the contents with JavaScript.

The bug report has a perfectly reasonable workaround you can use, and you can always put your JavaScript in a separate file or escape < >. It's a good idea to do this even with the current generation of browsers, because if you don't, you sometimes get strange bugs... often bugs that are not reproducible in all browsers.

Honestly, allowing unescaped </script> inside inline javascript wasn't a good idea to begin with.


ECMAScript has escape sequence just for this occasion:

    <\/script>
Works everywhere (it's shame so few people know about it and use uglier and invalid "</sc"+"ript>").


Why is that invalid?


http://www.w3.org/TR/html4/types.html

"The first occurrence of the character sequence "</" (end-tag open delimiter) is treated as terminating the end of the element's content."

Of course browsers never implemented this correctly.


"Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code."

Anybody else think that if your markup language requires a 10k line parser, somebody took a wrong turn at the complexity vs simplicity fork in the road?


Pay attention to this part:

  All browsers that implement the HTML5 parsing algorithm
  should parse HTML the same way, which means your web page
  should parse the same way in Firefox 4 and the WebKit
  nightly, even if it contains invalid markup.
Parsing perfect HTML5 would be easy, but one of the features of HTML5 spec is that it does that no spec did before: it defines how parsing should work exactly, even in the case of invalid markup. Also, the parser hast to deal with deprecated elements no longer in the spec (such as infamous <font>). I assums most work went into this "how to parse tag soup" part.


Deciding that invalid markup should work would be the wrong turn I alluded to.

It would have been very easy to drop all the back compat bs by saying "An HTML5 document is one that begins with the 6 bytes '<html5' and if the document is invalid, reject it. Anything else, parse however you want." Browsers that support HTML5 add text/html5 to the Accept header.


Alas, your six bytes would drop older browsers into quirks mode, which makes is not nice. The only reason HTML5 has doctype at all is because it forces browsers into standards mode. What you propose sounds a bit like the way XHTML2 was intended to go, we know how that ended. There is and will be tons of invalid documents on the web—the wast majority of it, so browser rejecting them has no future, the blame will be on the browser, not the authors. Having at least some consistency in dealing with invalid documents is A Good Thing™.


So, how many different parsers did you want to see in the browser again?

You'd still need to support HTML4 somewhere. Supporting HTML5 separately just means duplicating the common parts of the parser. The simplicity boat has already sailed.


Addition is simpler than combination.

    HTML4 + HTML5 < HTML4 * HTML5


They're extremely similar. You don't get an explosion of code size. Unless you're suggesting that HTML5 should also be radically different syntactically as well?


I'm suggesting that a strict HTML5 parser that doesn't have complex recovery code is radically simpler than one that does.


I'm suggesting that a HTML4 parser that has complex recovery code, and a mostly-copy-and-paste HTML5 parser that doesn't isn't a big win.

Browsers are going to have a complex, ugly, recovery-enabled parser in them either way, and the effort to add HTML5 to the recovery-enabled parser isn't very big, comparatively speaking.


Not really true in this case. WebKit's previous HTML parser is similar in complexity to the HTML5 parser. Adding a second HTML parser would have been more code, more complexity, and a more complex test matrix.


Complexity does not increase linearly with number of lines of code. 110k lines can easly be 5x as complex as 100k lines of code.


1 + 1 > 1 * 1


The problem with this approach is the same problem people serving XHTML as application/xhtml+xml ran into: You don't control your page any more these days.

If you allow people to leave comments on your page or if you are serving ads, you lose a bit of control in what gets put on a particular page of yours.

In case of comments you could try and sanitize them, but with ads that's hardly possible.


I tend to think that very few things that see widespread real-world use are "simple" by any reasonable metric.

The basic idea may be simple, but by the time you have covered a reasonable fraction of real-world use cases and error conditions for a sizeable population, the resulting code is anything but.

And 10k lines still counts as "moderately simple" - 10k lines is well within scope for one developer working alone.


This must be your first time hearing about HTML, welcome!

Just imagine how large "quirks mode" must be.


You make it sound like it was a conscious decision. Most of the complexity comes from the need to tolerate the janky markup that older browsers processed in a reasonable way.


Just for kicks, I am going to try and implement these two algorithms using using the State Machine Compiler (http://smc.sourceforge.net/). I expect that my "code" will be much less than 10,000 lines, although I don't really know how long the code SMC itself generates will be.


Is there any advantage to SMC vs. Ragel for parsing? The code looks similar but a bit larger.


Heh, just what I need, a new toy to play with!

I spent a few hours looking into Ragel over the weekend, and, honestly, I can't envisage how a html5 parser mostly written in Ragel would look. But I'm going to give it a crack.


What's the magic number of LOC that any markup language should be implemented under?


To put 10k in perspective, that's substantially more than the pcc C compiler front end, and closing in on the compiler's total line count.

When I consider what I can accomplish using C vs what I can accomplish using HTML, that doesn't appear to be the expected result.


C compilers don't have to deal with users that expect purely random garbage thrown at the parsers to have a "valid" parsing that's consistent across every implementation. Unfortunately, making a language simple to use by not requiring that users get everything right makes it very hard to implement.


Considering the number of people that can accomplish anything useful with HTML vs the number of people that can accomplish anything useful with C (or even XML), it's well in line with my expectations.

It's fairly easy to get a simpler implementation by shifting complexity from the parser to the user. But from a "total society productivity" standpoint, having a handful of developers spend a few months to implement a resilient parser is a cheap price to pay to have umpteen million people save hours of debugging for every page they create.


Back when HTML was invented, it was based on SGML. In retrospect, this was a fundamental design flaw. SGML has all sorts of fillips that were intended to make documents easy to type (back in the pre-GUI era) but which in practice make them hard to parse. SGML parsers, even parsers built into expensive commercial software, were famous for not reliably interoperating. The HTML5 spec abandons SGML entirely because, as it turned out, even those fillips of SGML did not make it possible for browsers to consistently agree on how to parse HTML in the wild.

It would have been better for everyone if HTML had used a bog-simple Lisp-like format from day one, but without a time machine we’re stuck with these kludges of history.


Most markup languages (indeed most languages) don't have error discovery and recovery built in and specified.

HTML5 does, and most of the spec (and probably the parser) is about dealing with error cases.

Plus it's not like XML parsers are small. And HTML parsing is far more complex an affair (due to not just blowing up on error)


Having this complexity ensures that nobody will ever endeavor into writing any more browser engines, those ensuring that there will be no more browser bugs. The rest of the job is fixing the behavior of the already existing browsers and then the web will be perfect.


HTML already tried the "let's switch to a simple parser" route, and failed:

http://dig.csail.mit.edu/breadcrumbs/node/166

Now HTML5 has to deal with two parsers.


I would prefer that the parser just failed when invalid code was encountered, personally I saw this as a strong point of XHTML.


One of the nice things about the web is that all websites still work. Here is the first webpage: http://www.w3.org/History/19921103-hypertext/hypertext/WWW/T... That one wouldn’t work with a strict parser (look at the source: it uses the header tag instead of the head tag).

Breaking half or more of all the websites out there wouldn’t exactly be what I would call the “spirit of the www”.


Use a new parser for new pages, the old parser for old pages. The introduction of H.264 didn't suddenly render all older formats unplayable. Why can't a new markup format coexist with older formats, too?


What would be the advantage of that? Also, you then never truly can ditch the old parser if you want to keep everything working. Your browser has to come with two parsers for all eternity.


How many codecs does your media player come with?


Not enough because there always seems to be this one file it absolutely cannot play ;)

It’s possible to do it but it’s also needlessly complex while giving you very little in return. The biggest, most problematic flaw of allowing browsers to parse invalid code, namely that different browsers might handle failure differently, is in the process of being fixed and that’s good enough, I think.

Strict parsing has been tried and it was a resounding failure. It’s not going to happen.


Web browsers already have this, and it sucks. There's XML mode, HTML standards mode, HTML quirks mode and HTML almost standards mode. It's horrible to test and maintain.


1. How do you discriminate?

2. What would the point be, apart from annoying every single end user?


Maybe a content-type header from the server? The point as I may wrongly see it is, that attempting to support backward compatibility indefinitely will result in overly complex and bloated browsers in the future.


Nobody sets headers and meta information correctly anyway. Those headers cannot be trusted, and browsers have to guess what content-type and character encoding should actually apply to any particular document.


You have to realize that what you are proposing is simply a bad idea, even if everything is trying to be pure and perfect:

http://diveintomark.org/archives/2004/01/14/thought_experime...


I think that article may have seriously changed my mind.


You can still write the HTML5 vocabulary as XHTML if you really want to. That's perfectly valid, and defined in the spec.

The HTML5 parsing algorithm is what the browser vendors want and need, however. Since HTML was never specified properly - not even in HTML4, where large areas were just undefined - HTML parsers have slowly evolved through trial and error. If big sites depend on a particular behaviour in a particular browser, other browsers have tried to be bug-compatible with that browser. Browsers that failed to do so have lost market share. If a browser decided to halt on encountering anything invalid would lose all of its users instantly, since the vast majority of documents are invalid.

Parsers have slowly converged towards each other, and the HTML5 algorithm is the compromise between them that breaks the last amount of content.

In other words, the goal of the HTML5 parsing spec was never to make something nice and clean. The goal was to define how to parse the unholy mess all you web developers out there have created during the past two decades. A clear spec, a good test suite and exactly identical behaviour between browsers is a benefit for everyone.


I agree. But: Why do we need exactly identical behaviour? Wasn't HTML intended to leave lots of discretion for rendering decisions to the browser?


If you parse a string of bytes, it's nice if it always turns into the same DOM. That's the scope of the HTML5 parser. There is no benefit to inconsistencies here.

Allowing browsers on different platforms to adjust page widths, font sizes, whether or not load images and plugins in accordance with the user's preferences is fine, but that's an entirely different issue.


Thanks for the clarification.


Hmmm. I suspect that would break 90% of web pages on the web. On the other hand, people would make their pages valid pretty quick. :-)


What about pages that are not maintained but still carry value?


No. Browsers that implemented this would just lose their market share instantly. Users prefer being able to use their favourite sites.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: