The HTML5 Parsing Algorithm
Over the past few months, we’ve been hard at work implementing the parsing algorithm from HTML5. Before HTML5, there was no standard for how browsers should parse invalid HTML. As a result, every browser developed their own parsing quirks, harming interoperability for pages that contain invalid HTML. HTML5, in contrast, specifies a complete algorithm for parsing HTML documents. Switching to the HTML5 parsing algorithm has three main benefits:
- Better interoperability between browsers. All browsers that implement the HTML5 parsing algorithm should parse HTML the same way, which means your web page should parse the same way in Firefox 4 and the WebKit nightly, even if it contains invalid markup. Improving interoperability makes it easier to author HTML by reducing the differences between browsers.
- Better compatibility with web pages. A mind boggling amount of data analysis has gone into designing the HTML parsing algorithm. By crawling the web, the designers were able to carefully weigh the trade-offs and maximize compatibility with existing web pages. By implementing the algorithm, we leverage their effort and improve the compatibility of our parser, making it less likely that users will run across broken pages.
- SVG and MathML in HTML. One of the cool new features of the HTML5 parsing algorithm is the ability to embed SVG and MathML directly in HTML pages. To embed SVG, you simply add an <svg> tag to your HTML page and you can use the full power of SVG.
(View source to see the demo SVG code inline in this HTML post.)
We’ve been implementing the HTML5 parsing algorithm in phases. Two months ago, we finished the first phase, which consisted of the tokenization algorithm. Late last night, we finished the second major piece: the tree builder algorithm. Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code. In the next phase, we’ll tackle fragment parsing (which is used by innerHTML and HTML5test.com).
One of the challenges we’ve encountered in switching to the HTML5 parsing algorithm is that some HTML documents rely on WebKit-specific parser quirks. For example, some websites use self-closing script tags (e.g., <script src=”…” />). WebKit is the only major rendering engine that supports this syntax (other browsers ignore the trailing “/” and look for a “</script>” tag). By implementing HTML5, we improve interoperability with other browsers at the cost of compatibility with some WebKit-specific content. In the long run, however, we believe these changes are good for the web platform as a whole.
Implementing the HTML5 parsing algorithm has also given us an opportunity to give back to the community. For example, we’ve sent feedback to the W3C HTML working group about how to improve the correctness and compatibility of the parsing algorithm itself and we’ve contributed over 250 test cases to the HTML5lib parser test suite.