Fix HTML Parser to Support HTML5 Syntax #106

pbottine · 2025-11-25T15:09:33Z

This PR replaces the XML parser with lxml's HTML parser for HTML files, fixing three long-standing issues with HTML parsing in graphtage.

Issues Resolved

Fixes #25 - Unquoted HTML attributes
Fixes #26 - Closing tag matching errors
Fixes #80 - Text nodes between elements missing from diffs

This PR implements a proper HTML parser using lxml.html:

Adds lxml dependency (lxml>=4.9.0) for HTML5-compliant parsing
Overrides build_tree() in HTML class to use lxml.html parser
Converts lxml trees to ElementTree format for compatibility with existing code
Includes graceful fallback to XML parser if lxml is unavailable
Maintains XML parsing unchanged - strict parsing still used for XML files

Test Coverage
Added 6 new HTML-specific tests covering:
✅ Unquoted attributes ()
✅ Mixed quoted/unquoted attributes
✅ Backward compatibility with quoted attributes
✅ Complex real-world attribute patterns
✅ Closing tag handling
✅ Text nodes between elements

This commit fixes issue #25 by replacing the XML parser with lxml's HTML parser for HTML files. The XML parser incorrectly required quoted attributes (e.g., name="foo"), but HTML5 allows unquoted attributes (e.g., name=foo). Changes: - Added lxml>=4.9.0 as a dependency in pyproject.toml - Overrode build_tree() method in HTML class to use lxml.html parser - Added _lxml_to_et() helper to convert lxml trees to ElementTree format - Implemented graceful fallback to XML parser if lxml is not available - Added comprehensive tests for both quoted and unquoted HTML attributes - XML parsing remains unchanged and continues to use strict ET parser The lxml.html parser provides lenient HTML5-compliant parsing while maintaining compatibility with the existing XMLElement tree structure. All 66 existing tests pass with no regressions. Fixes #25 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This commit adds a dedicated test file (test_html.py) with comprehensive tests for the HTML unquoted attributes fix (issue #25). Tests added: - test_unquoted_attributes: Tests the exact issue from #25 - parsing HTML with unquoted attributes like <meta name=foo> - test_mixed_quoted_unquoted_attributes: Ensures mixed quoted/unquoted attributes work correctly - test_quoted_attributes_backward_compatibility: Verifies backward compatibility with fully quoted HTML - test_complex_unquoted_attributes: Tests more complex real-world cases with multiple unquoted attributes All tests verify that: 1. HTML files parse without errors 2. Trees are built successfully 3. Diffs can be computed between files 4. Changes are correctly captured in the diff output The test file is designed to be extensible for future HTML-related improvements. Related to #25 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This commit adds tests for two additional HTML parsing issues that are resolved by the lxml HTML parser: Issue #26 - Fails to match closing HTML tags: - Added test_closing_tags_issue_26() to verify that HTML files with closing tags like </head> parse and diff correctly - The XML parser previously threw errors on closing tags - lxml's HTML parser handles these correctly Issue #80 - Text missing from HTML diff: - Added test_text_between_elements_issue_80() to verify that text nodes between elements are correctly parsed and included in diffs - Tests the exact example from the issue: text outside of tags like "some <div>text</div> more text" - lxml correctly parses text nodes as element text and tail attributes Both tests pass, confirming that the lxml HTML parser (introduced to fix issue #25) also resolves these related HTML parsing issues. Test results: 72 tests pass (66 original + 6 HTML tests) Related to #26, #80 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

pbottine · 2025-11-25T20:12:45Z

cc: @smoelius

ESultanik

This looks good, but there's one recursive function that will almost certainly cause a stack overflow on large HTML files. It needs to be converted to an iterative function.

ESultanik · 2025-12-01T22:10:32Z

graphtage/xml.py

+
+        # Recursively convert children
+        for child in lxml_elem:
+            et_elem.append(HTML._lxml_to_et(child))


This recursion will exhaust Python's tiny stack if the HTML DOM is very deep. _lxml_to_et needs to be converted to an iterative function (e.g., using a stack).

pbottine and others added 3 commits November 24, 2025 16:44

pbottine requested a review from ESultanik as a code owner November 25, 2025 15:09

ESultanik requested changes Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HTML Parser to Support HTML5 Syntax #106

Fix HTML Parser to Support HTML5 Syntax #106

Uh oh!

pbottine commented Nov 25, 2025

Uh oh!

pbottine commented Nov 25, 2025

Uh oh!

ESultanik left a comment

Uh oh!

ESultanik Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix HTML Parser to Support HTML5 Syntax #106

Are you sure you want to change the base?

Fix HTML Parser to Support HTML5 Syntax #106

Uh oh!

Conversation

pbottine commented Nov 25, 2025

Uh oh!

pbottine commented Nov 25, 2025

Uh oh!

ESultanik left a comment

Choose a reason for hiding this comment

Uh oh!

ESultanik Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants