Jump to content

Talk:Comparison of HTML parsers

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 200.45.200.41 (talk) at 07:20, 5 December 2014 (Other parsers that are not listed in the article: Add a forgotten parser). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Please check/review column definitions

Parser
The softwate, a "HTML parser"... DOM with a LoadHTML method is a "HTML parser"!? There are some standalone software, that only transform HTML; and "enabled" to programmer's to traversal all nodes, etc.? What the software taxonomy here??
License
Ok.
Implementation language(s)
Ok, but not confuse with "driver/bridge for bin implementation".
Latest date
Latest release date of significant changes in the implementation source code.
HTML Parsing
Common sense says that all "HTML parsers" have YES to "HTML Parsing"... So, same problem, of column "Parser": DOMDocument class with a LoadHTML method is a "enabled" to programmer's "HTML parsing"!?
Clean HTML
sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code
Update HTML
Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

It is a consensus here? --Krauss (talk) 13:20, 9 June 2013 (UTC)[reply]

Other parsers that are not listed in the article

In case anyone is interested and has enough time to add them, at the end of Beautiful Soup documentation are linked the following parsers:

  • Rubyful Soup: port of Beautiful Soup to Ruby.
  • Hpricot: written in Ruby and C (currently its development is discontinued).
  • ElementTree: fast Python XML parser (last updated in September 2007).
  • HtmlPrag: Scheme library for parsing bad HTML (source code here).
  • xmltramp: a "standard" XML/XHTML parser. Like most parsers, it makes you traverse the tree yourself, but it's easy to use.
  • pullparser includes a tree-traversal method. Today is unmaintained (now part of mechanize, but interface no longer public).
  • Mike Foord didn't like the way Beautiful Soup can change HTML if you write the tree back out, so he wrote HTML Scraper. It's basically a version of HTMLParser that can handle bad HTML (published in 2004 and posibly obsolete).
  • Ka-Ping Yee's scrape.py combines page scraping with URL opening.

Reviewing the history of the discussion also can be seen that in this edition someone else suggested htmLawed (PHP alternative to Tidy).--200.45.200.41 (talk) 07:20, 5 December 2014 (UTC)[reply]