Skip to content

Conversation

@pbottine
Copy link

This PR replaces the XML parser with lxml's HTML parser for HTML files, fixing three long-standing issues with HTML parsing in graphtage.

Issues Resolved

Fixes #25 - Unquoted HTML attributes
Fixes #26 - Closing tag matching errors
Fixes #80 - Text nodes between elements missing from diffs

This PR implements a proper HTML parser using lxml.html:

  • Adds lxml dependency (lxml>=4.9.0) for HTML5-compliant parsing
  • Overrides build_tree() in HTML class to use lxml.html parser
  • Converts lxml trees to ElementTree format for compatibility with existing code
  • Includes graceful fallback to XML parser if lxml is unavailable
  • Maintains XML parsing unchanged - strict parsing still used for XML files

Test Coverage
Added 6 new HTML-specific tests covering:
✅ Unquoted attributes ()
✅ Mixed quoted/unquoted attributes
✅ Backward compatibility with quoted attributes
✅ Complex real-world attribute patterns
✅ Closing tag handling
✅ Text nodes between elements

pbottine and others added 3 commits November 24, 2025 16:44
This commit fixes issue #25 by replacing the XML parser with lxml's HTML
parser for HTML files. The XML parser incorrectly required quoted attributes
(e.g., name="foo"), but HTML5 allows unquoted attributes (e.g., name=foo).

Changes:
- Added lxml>=4.9.0 as a dependency in pyproject.toml
- Overrode build_tree() method in HTML class to use lxml.html parser
- Added _lxml_to_et() helper to convert lxml trees to ElementTree format
- Implemented graceful fallback to XML parser if lxml is not available
- Added comprehensive tests for both quoted and unquoted HTML attributes
- XML parsing remains unchanged and continues to use strict ET parser

The lxml.html parser provides lenient HTML5-compliant parsing while
maintaining compatibility with the existing XMLElement tree structure.
All 66 existing tests pass with no regressions.

Fixes #25

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit adds a dedicated test file (test_html.py) with comprehensive
tests for the HTML unquoted attributes fix (issue #25).

Tests added:
- test_unquoted_attributes: Tests the exact issue from #25 - parsing HTML
  with unquoted attributes like <meta name=foo>
- test_mixed_quoted_unquoted_attributes: Ensures mixed quoted/unquoted
  attributes work correctly
- test_quoted_attributes_backward_compatibility: Verifies backward
  compatibility with fully quoted HTML
- test_complex_unquoted_attributes: Tests more complex real-world cases
  with multiple unquoted attributes

All tests verify that:
1. HTML files parse without errors
2. Trees are built successfully
3. Diffs can be computed between files
4. Changes are correctly captured in the diff output

The test file is designed to be extensible for future HTML-related
improvements.

Related to #25

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit adds tests for two additional HTML parsing issues that are
resolved by the lxml HTML parser:

Issue #26 - Fails to match closing HTML tags:
- Added test_closing_tags_issue_26() to verify that HTML files with
  closing tags like </head> parse and diff correctly
- The XML parser previously threw errors on closing tags
- lxml's HTML parser handles these correctly

Issue #80 - Text missing from HTML diff:
- Added test_text_between_elements_issue_80() to verify that text nodes
  between elements are correctly parsed and included in diffs
- Tests the exact example from the issue: text outside of tags like
  "some <div>text</div> more text"
- lxml correctly parses text nodes as element text and tail attributes

Both tests pass, confirming that the lxml HTML parser (introduced to fix
issue #25) also resolves these related HTML parsing issues.

Test results: 72 tests pass (66 original + 6 HTML tests)

Related to #26, #80

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@pbottine pbottine requested a review from ESultanik as a code owner November 25, 2025 15:09
@pbottine
Copy link
Author

cc: @smoelius

Copy link
Collaborator

@ESultanik ESultanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but there's one recursive function that will almost certainly cause a stack overflow on large HTML files. It needs to be converted to an iterative function.


# Recursively convert children
for child in lxml_elem:
et_elem.append(HTML._lxml_to_et(child))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursion will exhaust Python's tiny stack if the HTML DOM is very deep. _lxml_to_et needs to be converted to an iterative function (e.g., using a stack).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text missing from HTML diff Fails to match closing HTML tags Expects quoted attributes in HTML unnecessarily, preventing diffing

3 participants