Skip to content

Commit 9c2bec8

Browse files
feat: implement MarkdownHeaderLevelsInferrer (#373)
* add component and tests * rework to match component-level output pattern * improved test cases * fix linting error * resolve typing issues * remove pytest from global dependencies * move logger to top of file * Update haystack_experimental/components/preprocessors/md_header_level_inferrer.py Co-authored-by: David S. Batista <[email protected]> * use doc.content instead of extra variable 'content' * Update haystack_experimental/components/preprocessors/md_header_level_inferrer.py Co-authored-by: David S. Batista <[email protected]> * Update haystack_experimental/components/preprocessors/md_header_level_inferrer.py Co-authored-by: David S. Batista <[email protected]> * refactor for readability * adding docstrings and simplyfing * removing uv.lock * adding new component to README.md * extending tests * adding link to discussion --------- Co-authored-by: David S. Batista <[email protected]>
1 parent bbfbfce commit 9c2bec8

File tree

3 files changed

+318
-5
lines changed

3 files changed

+318
-5
lines changed

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,16 +41,17 @@ that includes it. Once it reaches the end of its lifespan, the experiment will b
4141

4242
### Active experiments
4343

44-
| Name | Type | Expected End Date | Dependencies | Cookbook | Discussion |
45-
|---------------------------------------|--------------------------------|-------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
46-
| [`InMemoryChatMessageStore`][1] | Memory Store | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
47-
| [`ChatMessageRetriever`][2] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
48-
| [`ChatMessageWriter`][3] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
44+
| Name | Type | Expected End Date | Dependencies | Cookbook | Discussion |
45+
|---------------------------------------|--------------------------------|-------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
46+
| [`InMemoryChatMessageStore`][1] | Memory Store | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
47+
| [`ChatMessageRetriever`][2] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
48+
| [`ChatMessageWriter`][3] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
4949
| [`QueryExpander`][5] | Query Expansion Component | October 2025 | None | None | [Discuss][6] |
5050
| [`EmbeddingBasedDocumentSplitter`][8] | EmbeddingBasedDocumentSplitter | August 2025 | None | None | [Discuss][7] |
5151
| [`MultiQueryEmbeddingRetriever`][13] | MultiQueryEmbeddingRetriever | November 2025 | None | None | [Discuss][11] |
5252
| [`MultiQueryTextRetriever`][14] | MultiQueryTextRetriever | November 2025 | None | None | [Discuss][12] |
5353
| [`OpenAIChatGenerator`][9] | Chat Generator Component | November 2025 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/hallucination_score_calculator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][10] |
54+
| [`MarkdownHeaderLevelInferrer`][15] | Preprocessor | January 2025 | None | None | [Discuss][16] |
5455

5556
[1]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/chat_message_stores/in_memory.py
5657
[2]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/retrievers/chat_message_retriever.py
@@ -66,6 +67,10 @@ that includes it. Once it reaches the end of its lifespan, the experiment will b
6667
[12]: https://github.com/deepset-ai/haystack-experimental/discussions/364
6768
[13]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/retrievers/multi_query_embedding_retriever.py
6869
[14]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/retrievers/multi_query_text_retriever.py
70+
[15]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/retrievers/md_header_level_inferrer.py
71+
[16]: https://github.com/deepset-ai/haystack-experimental/discussions/376
72+
73+
6974

7075
### Adopted experiments
7176
| Name | Type | Final release |
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
import re
6+
7+
from haystack import Document, component, logging
8+
9+
logger = logging.getLogger(__name__)
10+
11+
12+
@component
13+
class MarkdownHeaderLevelInferrer:
14+
"""
15+
Infers and rewrites header levels in Markdown text to normalize hierarchy.
16+
17+
First header → Always becomes level 1 (#)
18+
Subsequent headers → Level increases if no content between headers, stays same if content exists
19+
Maximum level → Capped at 6 (######)
20+
21+
### Usage example
22+
```python
23+
from haystack import Document
24+
from haystack_experimental.components.preprocessors import MarkdownHeaderLevelInferrer
25+
26+
# Create a document with uniform header levels
27+
text = "## Title\nSome content\n## Section\nMore content\n## Subsection\nFinal content"
28+
doc = Document(content=text)
29+
30+
# Initialize the inferrer and process the document
31+
inferrer = MarkdownHeaderLevelInferrer()
32+
result = inferrer.run([doc])
33+
34+
# The headers are now normalized with proper hierarchy
35+
print(result["documents"][0].content)
36+
> # Title\nSome content\n## Section\nMore content\n### Subsection\nFinal content
37+
```
38+
"""
39+
40+
def __init__(self):
41+
"""Initializes the MarkdownHeaderLevelInferrer."""
42+
# handles headers with optional trailing spaces and empty content
43+
self._header_pattern = re.compile(r"(?m)^(#{1,6})\s+(.+?)(?:\s*)$")
44+
45+
@component.output_types(documents=list[Document])
46+
def run(self, documents: list[Document]) -> dict:
47+
"""
48+
Infers and rewrites the header levels in the content for documents that use uniform header levels.
49+
50+
:param documents: list of Document objects to process.
51+
52+
:returns:
53+
dict: a dictionary with the key 'documents' containing the processed Document objects.
54+
"""
55+
if not documents:
56+
logger.warning("No documents provided to process")
57+
return {"documents": []}
58+
59+
logger.debug(f"Inferring and rewriting header levels for {len(documents)} documents")
60+
processed_docs = [self._process_document(doc) for doc in documents]
61+
return {"documents": processed_docs}
62+
63+
def _process_document(self, doc: Document) -> Document:
64+
"""
65+
Processes a single document, inferring and rewriting header levels.
66+
67+
:param doc: Document object to process.
68+
:returns:
69+
Document object with rewritten header levels.
70+
"""
71+
if doc.content is None:
72+
logger.warning(f"Document {getattr(doc, 'id', '')} content is None; skipping header level inference.")
73+
return doc
74+
75+
matches = list(re.finditer(self._header_pattern, doc.content))
76+
if not matches:
77+
logger.info(f"No headers found in document {doc.id}; skipping header level inference.")
78+
return doc
79+
80+
modified_text = MarkdownHeaderLevelInferrer._rewrite_headers(doc.content, matches)
81+
logger.info(f"Rewrote {len(matches)} headers with inferred levels in document{doc.id}.")
82+
return MarkdownHeaderLevelInferrer._build_final_document(doc, modified_text)
83+
84+
@staticmethod
85+
def _rewrite_headers(content: str, matches: list[re.Match]) -> str:
86+
"""
87+
Rewrites the headers in the content with inferred levels.
88+
89+
:param content: Original Markdown content.
90+
:param matches: List of regex matches for headers.
91+
"""
92+
modified_text = content
93+
offset = 0
94+
current_level = 1
95+
96+
for i, match in enumerate(matches):
97+
original_header = match.group(0)
98+
header_text = match.group(2).strip()
99+
100+
# Skip empty headers
101+
if not header_text:
102+
logger.warning(f"Skipping empty header at position {match.start()}")
103+
continue
104+
105+
has_content = MarkdownHeaderLevelInferrer._has_content_between_headers(content, matches, i)
106+
inferred_level = MarkdownHeaderLevelInferrer._infer_level(i, current_level, has_content)
107+
current_level = inferred_level
108+
109+
new_header = f"{'#' * inferred_level} {header_text}"
110+
start_pos = match.start() + offset
111+
end_pos = match.end() + offset
112+
modified_text = modified_text[:start_pos] + new_header + modified_text[end_pos:]
113+
offset += len(new_header) - len(original_header)
114+
115+
return modified_text
116+
117+
@staticmethod
118+
def _has_content_between_headers(content: str, matches: list[re.Match], i: int) -> bool:
119+
"""Checks if there is content between the previous and current header."""
120+
if i == 0:
121+
return False
122+
prev_end = matches[i - 1].end()
123+
current_start = matches[i].start()
124+
content_between = content[prev_end:current_start]
125+
return bool(content_between.strip())
126+
127+
@staticmethod
128+
def _infer_level(i: int, current_level: int, has_content: bool) -> int:
129+
"""Infers the header level for the current header."""
130+
if i == 0:
131+
return 1
132+
if has_content:
133+
return current_level
134+
return min(current_level + 1, 6)
135+
136+
@staticmethod
137+
def _build_final_document(doc: Document, new_content: str) -> Document:
138+
"""Creates a new Document with updated content, preserving other fields."""
139+
return Document(
140+
id=getattr(doc, "id", "") or "",
141+
content=new_content,
142+
blob=getattr(doc, "blob", None),
143+
meta=getattr(doc, "meta", {}) or {},
144+
score=getattr(doc, "score", None),
145+
embedding=getattr(doc, "embedding", None),
146+
)
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
from haystack import Document
6+
from haystack_experimental.components.preprocessors.md_header_level_inferrer import MarkdownHeaderLevelInferrer
7+
8+
9+
def test_single_header_level_inference():
10+
text = "## H1\nSome content\n## H2\nContent"
11+
inferrer = MarkdownHeaderLevelInferrer()
12+
doc = Document(content=text)
13+
result = inferrer.run([doc])
14+
content = result["documents"][0].content
15+
# Expect the first header to be rewritten to level 1, second to level 1 (since content follows)
16+
expected = "# H1\nSome content\n# H2\nContent"
17+
assert content == expected
18+
19+
20+
def test_header_level_increase_on_consecutive_headers():
21+
text = "## H1\n## H2\n## H3"
22+
inferrer = MarkdownHeaderLevelInferrer()
23+
doc = Document(content=text)
24+
result = inferrer.run([doc])
25+
content = result["documents"][0].content
26+
# Expect the first header to be level 1, the next two to increase in level
27+
expected = "# H1\n## H2\n### H3"
28+
assert content == expected
29+
30+
31+
def test_no_headers():
32+
text = "This is just some text without headers."
33+
inferrer = MarkdownHeaderLevelInferrer()
34+
doc = Document(content=text)
35+
result = inferrer.run([doc])
36+
content = result["documents"][0].content
37+
assert content == text
38+
39+
40+
def test_complex_structure():
41+
text = (
42+
"## Title\n"
43+
"## Section\n"
44+
"Section content\n"
45+
"## Subsection\n"
46+
"Subsection content\n"
47+
"## Another Section\n"
48+
"## Another Subsection\n"
49+
"Even more content\n"
50+
"## Final Section\n"
51+
"Final content"
52+
)
53+
inferrer = MarkdownHeaderLevelInferrer()
54+
doc = Document(content=text)
55+
result = inferrer.run([doc])
56+
content = result["documents"][0].content
57+
expected = (
58+
"# Title\n"
59+
"## Section\n"
60+
"Section content\n"
61+
"## Subsection\n"
62+
"Subsection content\n"
63+
"## Another Section\n"
64+
"### Another Subsection\n"
65+
"Even more content\n"
66+
"### Final Section\n"
67+
"Final content"
68+
)
69+
assert content == expected
70+
71+
72+
def test_empty_documents_list():
73+
inferrer = MarkdownHeaderLevelInferrer()
74+
result = inferrer.run([])
75+
assert result["documents"] == []
76+
77+
78+
def test_document_with_none_content():
79+
inferrer = MarkdownHeaderLevelInferrer()
80+
doc = Document(content=None)
81+
result = inferrer.run([doc])
82+
assert result["documents"][0].content is None
83+
84+
85+
def test_document_with_empty_content():
86+
inferrer = MarkdownHeaderLevelInferrer()
87+
doc = Document(content="")
88+
result = inferrer.run([doc])
89+
assert result["documents"][0].content == ""
90+
91+
92+
def test_headers_with_trailing_spaces():
93+
text = "## Header 1 \nContent\n## Header 2 \nMore content"
94+
inferrer = MarkdownHeaderLevelInferrer()
95+
doc = Document(content=text)
96+
result = inferrer.run([doc])
97+
content = result["documents"][0].content
98+
expected = "# Header 1\nContent\n# Header 2\nMore content"
99+
assert content == expected
100+
101+
102+
def test_headers_with_leading_spaces():
103+
text = " ## Header 1\nContent\n ## Header 2\nMore content"
104+
inferrer = MarkdownHeaderLevelInferrer()
105+
doc = Document(content=text)
106+
result = inferrer.run([doc])
107+
# Headers with leading spaces should not match the pattern
108+
assert result["documents"][0].content == text
109+
110+
111+
def test_maximum_header_level():
112+
text = "## H1\n## H2\n## H3\n## H4\n## H5\n## H6\n## H7\n## H8"
113+
inferrer = MarkdownHeaderLevelInferrer()
114+
doc = Document(content=text)
115+
result = inferrer.run([doc])
116+
content = result["documents"][0].content
117+
expected = "# H1\n## H2\n### H3\n#### H4\n##### H5\n###### H6\n###### H7\n###### H8"
118+
assert content == expected
119+
120+
121+
def test_multiple_documents():
122+
text1 = "## Title 1\nContent 1"
123+
text2 = "## Title 2\nContent 2"
124+
inferrer = MarkdownHeaderLevelInferrer()
125+
docs = [Document(content=text1), Document(content=text2)]
126+
result = inferrer.run(docs)
127+
128+
assert len(result["documents"]) == 2
129+
assert result["documents"][0].content == "# Title 1\nContent 1"
130+
assert result["documents"][1].content == "# Title 2\nContent 2"
131+
132+
133+
def test_headers_with_special_characters():
134+
text = "## Header with émojis 🚀\nContent\n## Header with numbers 123\nMore content"
135+
inferrer = MarkdownHeaderLevelInferrer()
136+
doc = Document(content=text)
137+
result = inferrer.run([doc])
138+
expected = "# Header with émojis 🚀\nContent\n# Header with numbers 123\nMore content"
139+
assert result["documents"][0].content == expected
140+
141+
142+
def test_headers_with_markdown_formatting():
143+
text = "## Header with **bold** text\nContent\n## Header with *italic* text\nMore content"
144+
inferrer = MarkdownHeaderLevelInferrer()
145+
doc = Document(content=text)
146+
result = inferrer.run([doc])
147+
expected = "# Header with **bold** text\nContent\n# Header with *italic* text\nMore content"
148+
assert result["documents"][0].content == expected
149+
150+
151+
def test_very_long_content():
152+
lines = ["## Header " + str(i) + "\nContent for header " + str(i) for i in range(50)]
153+
text = "\n".join(lines)
154+
inferrer = MarkdownHeaderLevelInferrer()
155+
doc = Document(content=text)
156+
result = inferrer.run([doc])
157+
158+
# verify first header becomes level 1, others follow the pattern
159+
content = result["documents"][0].content
160+
assert content.startswith("# Header 0")
161+
assert "# Header 1" in content
162+
assert len(content.split("\n")) == len(text.split("\n"))

0 commit comments

Comments
 (0)