Skip to content

Commit e86a11f

Browse files
feat: adding RegexTextExtractor component (#378)
* initial import of component * stripping down paramters * adding tests * adding missing files * adding missing test files * adding component to README.MD * LICENSE header checks running on tests * Update haystack_experimental/components/extractors/regex_text_extractor.py Co-authored-by: Julian Risch <[email protected]> * Update haystack_experimental/components/extractors/regex_text_extractor.py Co-authored-by: Julian Risch <[email protected]> * Update test/components/extractors/test_regex_text_extractor.py Co-authored-by: Julian Risch <[email protected]> * Update test/components/extractors/test_regex_text_extractor.py Co-authored-by: Julian Risch <[email protected]> * PR comments * removing unused imports --------- Co-authored-by: Julian Risch <[email protected]>
1 parent 5bc99c2 commit e86a11f

File tree

4 files changed

+325
-12
lines changed

4 files changed

+325
-12
lines changed

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -41,18 +41,19 @@ that includes it. Once it reaches the end of its lifespan, the experiment will b
4141

4242
### Active experiments
4343

44-
| Name | Type | Expected End Date | Dependencies | Cookbook | Discussion |
45-
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|-------------------|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
46-
| [`InMemoryChatMessageStore`][1] | Memory Store | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
47-
| [`ChatMessageRetriever`][2] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
48-
| [`ChatMessageWriter`][3] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
49-
| [`QueryExpander`][5] | Query Expansion Component | October 2025 | None | None | [Discuss][6] |
50-
| [`EmbeddingBasedDocumentSplitter`][8] | EmbeddingBasedDocumentSplitter | August 2025 | None | None | [Discuss][7] |
51-
| [`MultiQueryEmbeddingRetriever`][13] | MultiQueryEmbeddingRetriever | November 2025 | None | None | [Discuss][11] |
52-
| [`MultiQueryTextRetriever`][14] | MultiQueryTextRetriever | November 2025 | None | None | [Discuss][12] |
53-
| [`OpenAIChatGenerator`][9] | Chat Generator Component | November 2025 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/hallucination_score_calculator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][10] |
54-
| [`MarkdownHeaderLevelInferrer`][15] | Preprocessor | January 2025 | None | None | [Discuss][16] |
55-
| [`Agent`][17]; [Confirmation Policies][18]; [ConfirmationUIs][19]; [ConfirmationStrategies][20]; [`ConfirmationUIResult` and `ToolExecutionDecision`][21] [HITLBreakpointException][22] | Human in the Loop | December 2025 | rich | None | [Discuss][23] |
44+
| Name | Type | Expected End Date | Dependencies | Cookbook | Discussion |
45+
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|-------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
46+
| [`InMemoryChatMessageStore`][1] | Memory Store | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
47+
| [`ChatMessageRetriever`][2] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
48+
| [`ChatMessageWriter`][3] | Memory Component | December 2024 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/conversational_rag_using_memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][4] |
49+
| [`QueryExpander`][5] | Query Expansion Component | October 2025 | None | None | [Discuss][6] |
50+
| [`EmbeddingBasedDocumentSplitter`][8] | EmbeddingBasedDocumentSplitter | August 2025 | None | None | [Discuss][7] |
51+
| [`MultiQueryEmbeddingRetriever`][13] | MultiQueryEmbeddingRetriever | November 2025 | None | None | [Discuss][11] |
52+
| [`MultiQueryTextRetriever`][14] | MultiQueryTextRetriever | November 2025 | None | None | [Discuss][12] |
53+
| [`OpenAIChatGenerator`][9] | Chat Generator Component | November 2025 | None | <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/hallucination_score_calculator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | [Discuss][10] |
54+
| [`MarkdownHeaderLevelInferrer`][15] | Preprocessor | January 2025 | None | None | [Discuss][16] |
55+
| [`Agent`][17]; [Confirmation Policies][18]; [ConfirmationUIs][19]; [ConfirmationStrategies][20]; [`ConfirmationUIResult` and `ToolExecutionDecision`][21] [HITLBreakpointException][22] | Human in the Loop | December 2025 | rich | None | [Discuss][23] |
56+
| [`RegexTextExtractor`][24] | Text Extractor Component | January 2025 | None | None | [Discuss][25] |
5657

5758
[1]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/chat_message_stores/in_memory.py
5859
[2]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/retrievers/chat_message_retriever.py
@@ -77,6 +78,8 @@ that includes it. Once it reaches the end of its lifespan, the experiment will b
7778
[21]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/agents/human_in_the_loop/dataclasses.py
7879
[22]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/agents/human_in_the_loop/errors.py
7980
[23]: https://github.com/deepset-ai/haystack-experimental/discussions/XXX
81+
[24]: https://github.com/deepset-ai/haystack-experimental/blob/main/haystack_experimental/components/extractors/regex_text_extractor.py
82+
[25]: https://github.com/deepset-ai/haystack-experimental/discussions/XXX
8083

8184
### Adopted experiments
8285
| Name | Type | Final release |
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
import re
6+
from typing import Union
7+
8+
from haystack import component, logging
9+
from haystack.dataclasses import ChatMessage
10+
11+
logger = logging.getLogger(__name__)
12+
13+
14+
@component
15+
class RegexTextExtractor:
16+
"""
17+
Extracts text from chat message or string input using a regex pattern.
18+
19+
RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern.
20+
It can be configured to search through all messages or only the last message in a list of ChatMessages.
21+
22+
### Usage example
23+
24+
```python
25+
from haystack_experimental.components.extractors import RegexTextExtractor
26+
from haystack.dataclasses import ChatMessage
27+
28+
# Using with a string
29+
parser = RegexTextExtractor(regex_pattern='<issue url=\"(.+)\">')
30+
result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>')
31+
# result: {"captured_text": "github.com/hahahaha"}
32+
33+
# Using with ChatMessages
34+
messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')]
35+
result = parser.run(text_or_messages=messages)
36+
# result: {"captured_text": "github.com/hahahaha"}
37+
```
38+
"""
39+
40+
def __init__(self, regex_pattern: str):
41+
"""
42+
Creates an instance of the RegexTextExtractor component.
43+
44+
:param regex_pattern:
45+
The regular expression pattern used to extract text.
46+
The pattern should include a capture group to extract the desired text.
47+
Example: '<issue url=\"(.+)\">' captures 'github.com/hahahaha' from '<issue url="github.com/hahahaha">'.
48+
"""
49+
self.regex_pattern = regex_pattern
50+
51+
# Check if the pattern has at least one capture group
52+
num_groups = re.compile(regex_pattern).groups
53+
if num_groups < 1:
54+
logger.warning(
55+
"The provided regex pattern {regex_pattern} doesn't contain any capture groups. "
56+
"The entire match will be returned instead.",
57+
regex_pattern=regex_pattern,
58+
)
59+
60+
@component.output_types(captured_text=str, captured_texts=list[str])
61+
def run(self, text_or_messages: Union[str, list[ChatMessage]]) -> dict:
62+
"""
63+
Extracts text from input using the configured regex pattern.
64+
65+
:param text_or_messages:
66+
Either a string or a list of ChatMessage objects to search through.
67+
68+
:returns:
69+
- If match found: {"captured_text": "matched text"}
70+
- If no match and return_empty_on_no_match=True: {}
71+
72+
:raises:
73+
- ValueError: if receiving a list the last element is not a ChatMessage instance.
74+
"""
75+
if isinstance(text_or_messages, str):
76+
return RegexTextExtractor._build_result(self._extract_from_text(text_or_messages))
77+
if not text_or_messages:
78+
logger.warning("Received empty list of messages")
79+
return {}
80+
return self._process_last_message(text_or_messages)
81+
82+
@staticmethod
83+
def _build_result(result: Union[str, list[str]]) -> dict:
84+
"""Helper method to build the return dictionary based on configuration."""
85+
if (isinstance(result, str) and result == "") or (isinstance(result, list) and not result):
86+
return {}
87+
return {"captured_text": result}
88+
89+
def _process_last_message(self, messages: list[ChatMessage]) -> dict:
90+
"""Process only the last message and build the result."""
91+
last_message = messages[-1]
92+
if not isinstance(last_message, ChatMessage):
93+
raise ValueError(f"Expected ChatMessage object, got {type(last_message)}")
94+
if last_message.text is None:
95+
logger.warning("Last message has no text content")
96+
return {}
97+
result = self._extract_from_text(last_message.text)
98+
return RegexTextExtractor._build_result(result)
99+
100+
def _extract_from_text(self, text: str) -> Union[str, list[str]]:
101+
"""
102+
Extract text using the regex pattern.
103+
104+
:param text:
105+
The text to search through.
106+
107+
:returns:
108+
The text captured by the first capturing group in the regex pattern.
109+
If the pattern has no capture groups, returns the entire match.
110+
If no match is found, returns an empty string.
111+
"""
112+
match = re.search(self.regex_pattern, text)
113+
if not match:
114+
return ""
115+
if match.groups():
116+
return match.group(1)
117+
return match.group(0)

test/components/extractors/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)