-
Notifications
You must be signed in to change notification settings - Fork 51
Fix HTML Parser to Support HTML5 Syntax #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit fixes issue #25 by replacing the XML parser with lxml's HTML parser for HTML files. The XML parser incorrectly required quoted attributes (e.g., name="foo"), but HTML5 allows unquoted attributes (e.g., name=foo). Changes: - Added lxml>=4.9.0 as a dependency in pyproject.toml - Overrode build_tree() method in HTML class to use lxml.html parser - Added _lxml_to_et() helper to convert lxml trees to ElementTree format - Implemented graceful fallback to XML parser if lxml is not available - Added comprehensive tests for both quoted and unquoted HTML attributes - XML parsing remains unchanged and continues to use strict ET parser The lxml.html parser provides lenient HTML5-compliant parsing while maintaining compatibility with the existing XMLElement tree structure. All 66 existing tests pass with no regressions. Fixes #25 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This commit adds a dedicated test file (test_html.py) with comprehensive tests for the HTML unquoted attributes fix (issue #25). Tests added: - test_unquoted_attributes: Tests the exact issue from #25 - parsing HTML with unquoted attributes like <meta name=foo> - test_mixed_quoted_unquoted_attributes: Ensures mixed quoted/unquoted attributes work correctly - test_quoted_attributes_backward_compatibility: Verifies backward compatibility with fully quoted HTML - test_complex_unquoted_attributes: Tests more complex real-world cases with multiple unquoted attributes All tests verify that: 1. HTML files parse without errors 2. Trees are built successfully 3. Diffs can be computed between files 4. Changes are correctly captured in the diff output The test file is designed to be extensible for future HTML-related improvements. Related to #25 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This commit adds tests for two additional HTML parsing issues that are resolved by the lxml HTML parser: Issue #26 - Fails to match closing HTML tags: - Added test_closing_tags_issue_26() to verify that HTML files with closing tags like </head> parse and diff correctly - The XML parser previously threw errors on closing tags - lxml's HTML parser handles these correctly Issue #80 - Text missing from HTML diff: - Added test_text_between_elements_issue_80() to verify that text nodes between elements are correctly parsed and included in diffs - Tests the exact example from the issue: text outside of tags like "some <div>text</div> more text" - lxml correctly parses text nodes as element text and tail attributes Both tests pass, confirming that the lxml HTML parser (introduced to fix issue #25) also resolves these related HTML parsing issues. Test results: 72 tests pass (66 original + 6 HTML tests) Related to #26, #80 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
cc: @smoelius |
ESultanik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but there's one recursive function that will almost certainly cause a stack overflow on large HTML files. It needs to be converted to an iterative function.
|
|
||
| # Recursively convert children | ||
| for child in lxml_elem: | ||
| et_elem.append(HTML._lxml_to_et(child)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recursion will exhaust Python's tiny stack if the HTML DOM is very deep. _lxml_to_et needs to be converted to an iterative function (e.g., using a stack).
This PR replaces the XML parser with lxml's HTML parser for HTML files, fixing three long-standing issues with HTML parsing in graphtage.
Issues Resolved
Fixes #25 - Unquoted HTML attributes
Fixes #26 - Closing tag matching errors
Fixes #80 - Text nodes between elements missing from diffs
This PR implements a proper HTML parser using lxml.html:
Test Coverage
Added 6 new HTML-specific tests covering:
✅ Unquoted attributes ()
✅ Mixed quoted/unquoted attributes
✅ Backward compatibility with quoted attributes
✅ Complex real-world attribute patterns
✅ Closing tag handling
✅ Text nodes between elements