This is an intentional change. Prior to the HTML5 parsing algorithm, browsers
backtracked and reparsed when seeing an EOF inside a script to deal with
</script> inside an inline script. This means that prior to HTML5, an
accidental or maliciously forced premature end of file could change the
executability properties of pieces of an HTML file.
The current magic in the spec was carefully designed and researched to permit
forward-only parsing in a maximally Web-compatible way. The solution in the
spec was known to break a handful of pages among lots and lots of pages listed
by dmoz, but the breakage was deemed negligible.
This is probably the highest-risk change in the HTML5 parsing algorithm, and
this is the first report of it breaking an "important" contemporary site.
Considering that keeping the forward-only tokenization behavior is highly
desirable, in the absence of evidence of more important breakage, I'm treating
this as an evang issue realizing that further evidence may force us to revisit
this part of the spec. But let's try to get away with forward-only parsing!
What is the logic behind this? Just because we want "forward-only parsing", we're going to try to force people not to write </script> inside of a Javascript string? That seems rather silly.
Are there any benefits to forward-only parsing besides speed? Because if this change was done in the name of performance, then we really should re-evaluate who exactly is proposing these types of changes and why.
browsers backtracked and reparsed when seeing an EOF inside a script to deal with </script> inside an inline script
Correct me if I'm wrong, but it seems to me that in order to detect that the </script> is inside an inline script, you'll have to parse the JavaScript. Requiring every HTML5 parser to include a JavaScript parser in order to parse it correctly seems a little too much for me.
Another reason: code complexity. Backtracking makes everything a lot more complex which is the exact opposite of what they're trying to achieve with the HTML5 parsing algorithm. Less code = fewer bugs, easier to understand and easier to implement.
Forward only parsing is not only faster, it's less complicated which should translate to fewer bugs. It's worth noting that the HTML parser isn't used only when initially rendering the page, it's also used when rewriting the contents with JavaScript.
The bug report has a perfectly reasonable workaround you can use, and you can always put your JavaScript in a separate file or escape < >. It's a good idea to do this even with the current generation of browsers, because if you don't, you sometimes get strange bugs... often bugs that are not reproducible in all browsers.
Honestly, allowing unescaped </script> inside inline javascript wasn't a good idea to begin with.
"Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code."
Anybody else think that if your markup language requires a 10k line parser, somebody took a wrong turn at the complexity vs simplicity fork in the road?
All browsers that implement the HTML5 parsing algorithm
should parse HTML the same way, which means your web page
should parse the same way in Firefox 4 and the WebKit
nightly, even if it contains invalid markup.
Parsing perfect HTML5 would be easy, but one of the features of HTML5 spec is that it does that no spec did before: it defines how parsing should work exactly, even in the case of invalid markup. Also, the parser hast to deal with deprecated elements no longer in the spec (such as infamous <font>). I assums most work went into this "how to parse tag soup" part.
Deciding that invalid markup should work would be the wrong turn I alluded to.
It would have been very easy to drop all the back compat bs by saying "An HTML5 document is one that begins with the 6 bytes '<html5' and if the document is invalid, reject it. Anything else, parse however you want." Browsers that support HTML5 add text/html5 to the Accept header.
Alas, your six bytes would drop older browsers into quirks mode, which makes is not nice. The only reason HTML5 has doctype at all is because it forces browsers into standards mode.
What you propose sounds a bit like the way XHTML2 was intended to go, we know how that ended.
There is and will be tons of invalid documents on the web—the wast majority of it, so browser rejecting them has no future, the blame will be on the browser, not the authors. Having at least some consistency in dealing with invalid documents is A Good Thing™.
So, how many different parsers did you want to see in the browser again?
You'd still need to support HTML4 somewhere. Supporting HTML5 separately just means duplicating the common parts of the parser. The simplicity boat has already sailed.
They're extremely similar. You don't get an explosion of code size. Unless you're suggesting that HTML5 should also be radically different syntactically as well?
I'm suggesting that a HTML4 parser that has complex recovery code, and a mostly-copy-and-paste HTML5 parser that doesn't isn't a big win.
Browsers are going to have a complex, ugly, recovery-enabled parser in them either way, and the effort to add HTML5 to the recovery-enabled parser isn't very big, comparatively speaking.
Not really true in this case. WebKit's previous HTML parser is similar in complexity to the HTML5 parser. Adding a second HTML parser would have been more code, more complexity, and a more complex test matrix.
The problem with this approach is the same problem people serving XHTML as application/xhtml+xml ran into: You don't control your page any more these days.
If you allow people to leave comments on your page or if you are serving ads, you lose a bit of control in what gets put on a particular page of yours.
In case of comments you could try and sanitize them, but with ads that's hardly possible.
I tend to think that very few things that see widespread real-world use are "simple" by any reasonable metric.
The basic idea may be simple, but by the time you have covered a reasonable fraction of real-world use cases and error conditions for a sizeable population, the resulting code is anything but.
And 10k lines still counts as "moderately simple" - 10k lines is well within scope for one developer working alone.
You make it sound like it was a conscious decision. Most of the complexity comes from the need to tolerate the janky markup that older browsers processed in a reasonable way.
Just for kicks, I am going to try and implement these two algorithms using using the State Machine Compiler (http://smc.sourceforge.net/). I expect that my "code" will be much less than 10,000 lines, although I don't really know how long the code SMC itself generates will be.
I spent a few hours looking into Ragel over the weekend, and, honestly, I can't envisage how a html5 parser mostly written in Ragel would look. But I'm going to give it a crack.
C compilers don't have to deal with users that expect purely random garbage thrown at the parsers to have a "valid" parsing that's consistent across every implementation. Unfortunately, making a language simple to use by not requiring that users get everything right makes it very hard to implement.
Considering the number of people that can accomplish anything useful with HTML vs the number of people that can accomplish anything useful with C (or even XML), it's well in line with my expectations.
It's fairly easy to get a simpler implementation by shifting complexity from the parser to the user. But from a "total society productivity" standpoint, having a handful of developers spend a few months to implement a resilient parser is a cheap price to pay to have umpteen million people save hours of debugging for every page they create.
Back when HTML was invented, it was based on SGML. In retrospect, this was a fundamental design flaw. SGML has all sorts of fillips that were intended to make documents easy to type (back in the pre-GUI era) but which in practice make them hard to parse. SGML parsers, even parsers built into expensive commercial software, were famous for not reliably interoperating. The HTML5 spec abandons SGML entirely because, as it turned out, even those fillips of SGML did not make it possible for browsers to consistently agree on how to parse HTML in the wild.
It would have been better for everyone if HTML had used a bog-simple Lisp-like format from day one, but without a time machine we’re stuck with these kludges of history.
Having this complexity ensures that nobody will ever endeavor into writing any more browser engines, those ensuring that there will be no more browser bugs. The rest of the job is fixing the behavior of the already existing browsers and then the web will be perfect.
One of the nice things about the web is that all websites still work. Here is the first webpage: http://www.w3.org/History/19921103-hypertext/hypertext/WWW/T... That one wouldn’t work with a strict parser (look at the source: it uses the header tag instead of the head tag).
Breaking half or more of all the websites out there wouldn’t exactly be what I would call the “spirit of the www”.
Use a new parser for new pages, the old parser for old pages. The introduction of H.264 didn't suddenly render all older formats unplayable. Why can't a new markup format coexist with older formats, too?
What would be the advantage of that? Also, you then never truly can ditch the old parser if you want to keep everything working. Your browser has to come with two parsers for all eternity.
Not enough because there always seems to be this one file it absolutely cannot play ;)
It’s possible to do it but it’s also needlessly complex while giving you very little in return. The biggest, most problematic flaw of allowing browsers to parse invalid code, namely that different browsers might handle failure differently, is in the process of being fixed and that’s good enough, I think.
Strict parsing has been tried and it was a resounding failure. It’s not going to happen.
Web browsers already have this, and it sucks. There's XML mode, HTML standards mode, HTML quirks mode and HTML almost standards mode. It's horrible to test and maintain.
Maybe a content-type header from the server? The point as I may wrongly see it is, that attempting to support backward compatibility indefinitely will result in overly complex and bloated browsers in the future.
Nobody sets headers and meta information correctly anyway. Those headers cannot be trusted, and browsers have to guess what content-type and character encoding should actually apply to any particular document.
You can still write the HTML5 vocabulary as XHTML if you really want to. That's perfectly valid, and defined in the spec.
The HTML5 parsing algorithm is what the browser vendors want and need, however. Since HTML was never specified properly - not even in HTML4, where large areas were just undefined - HTML parsers have slowly evolved through trial and error. If big sites depend on a particular behaviour in a particular browser, other browsers have tried to be bug-compatible with that browser. Browsers that failed to do so have lost market share. If a browser decided to halt on encountering anything invalid would lose all of its users instantly, since the vast majority of documents are invalid.
Parsers have slowly converged towards each other, and the HTML5 algorithm is the compromise between them that breaks the last amount of content.
In other words, the goal of the HTML5 parsing spec was never to make something nice and clean. The goal was to define how to parse the unholy mess all you web developers out there have created during the past two decades. A clear spec, a good test suite and exactly identical behaviour between browsers is a benefit for everyone.
If you parse a string of bytes, it's nice if it always turns into the same DOM. That's the scope of the HTML5 parser. There is no benefit to inconsistencies here.
Allowing browsers on different platforms to adjust page widths, font sizes, whether or not load images and plugins in accordance with the user's preferences is fine, but that's an entirely different issue.
and IE9: http://blogs.msdn.com/b/ie/archive/2010/03/16/html5-hardware...
Unfortunately the new parsing algorithm breaks my online banking site: http://bugzil.la/565689