-
Notifications
You must be signed in to change notification settings - Fork 7.5k
Description
Self Checks
- I have searched for existing issues search for existing issues, including closed ones.
- I confirm that I am using English to submit this report (Language Policy).
- Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- Please do not modify this template :) and fill in all the required fields.
Describe your problem
My goal
I have some Markdown (.md) files that I have pre-processed. Each entry/record in this file is explicitly separated by \n\n<hr>\n\n.
My goal is to have RAGFlow treat each section separated by <hr> as a single, complete chunk, regardless of how long that section is.
Example data structure in my_file.md:
# Entry 1 Title
This is the content for entry 1.
It might be very long, over 1000 characters.
## SubTitle
...
...
<hr>
# Entry 2 Title
This is the content for entry 2.
It might be short.
<hr>
# Entry 3 Title
This is another long entry.
...And my desired chunks should be Chunk 1: (All content for Entry 1),Chunk 2: (All content for Entry 2)...
What I have tried
Using the "General" Knowledge Base Import
When I upload the file directly to a Knowledge Base using the "general" method, RAGFlow still splits my entries. For example, if "Entry 1" is 1500 characters and my chunk_size is 512, though I set "Text segmentation identifier" as \n\n<hr>\n\n,RAGFlow splits "Entry 1" into 3-4 smaller chunks. This breaks the context of my data.(in this case,I have 1000 structured entries, and chunk_size is 512, my file is split into 2127)
Using an Ingestion Pipeline (Agent):
I tried to build a custom processing pipeline to control this. My pipeline looks like this(the template): File -> Parser -> Chunker -> Tokenizer (Indexer)
I hope the chunker can split the markdown file with H1, while I come across an error[ERROR]Input error: ... Input should be 'json' or 'chunks' [type=literal_error, input_value='text']
My Question
How can I correctly configure RAGFlow to chunk only on my custom <hr> separator and prevent any further splitting (like by chunk_size or paragraphs) within those sections?
A) Is there a setting in the standard Knowledge Base import that I missed (e.g., "split by custom regex" or "disable size limit")?
B) If this must be done with an Ingestion Pipeline, what is the correct node setup and configuration to achieve this? Is there anything I missed?
Thanks for your time : )