HTML解析器对比

Parsing HTML is a automated task, performed by (so called) HTML parsers. They have two main purposes:

HTML traversal: offer a interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.

Parser	License	Implementation language(s)	Latest date*	HTML Parsing^[1]	Clean HTML**	Update HTML***
Beautiful Soup (base on lxml and html5lib)^[2]	Python S. F. L.	Python	2013-05-31	是	?	?
Gumbo	Apache License 2.0	C (programming language)	2013-08-13	是	?	?
html5lib	MIT License	Python and PHP	2013-12-23^[3]	是	是	否
HTML::Parser	Perl license	Perl	2013-03-28	否^[4]	?	?
htmlPurifier	GNU Lesser GPL	PHP	2009-03-25^[5]	否	是	是
HTML Tidy	W3C license	ANSI C	2009-03-25^[6]	是^[7]	是	?
HtmlCleaner	BSD License^[8]	Java	2013-09-05	否	是	?
Hubbub	MIT License	C (programming language)	2013-04-19	是	?	?
Jaunt API	Jaunt Beta License	Java	2013-08-01	是	是	否
Jericho HTML Parser	Eclipse Public License	Java	2012-10-30^[9]	否??	?	?
jsdom	MIT license	JavaScript	2013-07-21	否	?	?
jsoup	MIT license	Java	2013-01-27^[10]	是	是	是
JTidy	JTidy License	Java	2009-12-01^[11]	是	?	?
libxml2 HTMLparser	MIT License	C (programming language)	2012-09-11^[12]	是	?	?
NekoHTML	Apache License 2.0	Java	2013-02-27^[13]	否	?	?
TagSoup	Apache License 2.0	Java	2011-07-07	否	?	?
Validator.nu HTML Parser	MIT License	Java	2012-06-05	是	?	?
Parser	License	Implementation language(s)	Latest date*	HTML Parsing	Clean HTML**	Update HTML***

* Latest release (of significant changes) date.

** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.

*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

[1] 12.2 Parsing HTML documents — HTML Standard

[2] ttp://www.crummy.com/software/BeautifulSoup/

[3] Releases · html5lib/html5lib-python

[4] Bug #53300 for HTML-Parser: HTML 5

[5] HTML Tidy for Windows

[6] HTML Tidy for Windows

[7] Tidy parser example: class.tidynode of PHP

[8] HtmlCleaner is distributed under BSD License

[9] Jericho HTML Parser - Browse /jericho-html/3.3 at SourceForge.net

[10] soup/CHANGES at master · jhy/jsoup · GitHub

[11] JTidy - Browse /JTidy at SourceForge.net

[12] xml2 Releases

[13] NekoHTML | Change History

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]