Skip to content

[Question]: How to chunk a Markdown file only by a custom delimiter (e.g., <hr>) without further splitting? #10890

@VVX94

Description

@VVX94

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

Describe your problem

My goal

I have some Markdown (.md) files that I have pre-processed. Each entry/record in this file is explicitly separated by \n\n<hr>\n\n.

My goal is to have RAGFlow treat each section separated by <hr> as a single, complete chunk, regardless of how long that section is.

Example data structure in my_file.md:

# Entry 1 Title
This is the content for entry 1.
It might be very long, over 1000 characters.
## SubTitle
...
...

<hr>

# Entry 2 Title
This is the content for entry 2.
It might be short.

<hr>

# Entry 3 Title
This is another long entry.
...

And my desired chunks should be Chunk 1: (All content for Entry 1),Chunk 2: (All content for Entry 2)...

What I have tried

Using the "General" Knowledge Base Import

When I upload the file directly to a Knowledge Base using the "general" method, RAGFlow still splits my entries. For example, if "Entry 1" is 1500 characters and my chunk_size is 512, though I set "Text segmentation identifier" as \n\n<hr>\n\n,RAGFlow splits "Entry 1" into 3-4 smaller chunks. This breaks the context of my data.(in this case,I have 1000 structured entries, and chunk_size is 512, my file is split into 2127)

Image Image

Using an Ingestion Pipeline (Agent):

I tried to build a custom processing pipeline to control this. My pipeline looks like this(the template): File -> Parser -> Chunker -> Tokenizer (Indexer)

Image Image

I hope the chunker can split the markdown file with H1, while I come across an error[ERROR]Input error: ... Input should be 'json' or 'chunks' [type=literal_error, input_value='text']

My Question

How can I correctly configure RAGFlow to chunk only on my custom <hr> separator and prevent any further splitting (like by chunk_size or paragraphs) within those sections?
A) Is there a setting in the standard Knowledge Base import that I missed (e.g., "split by custom regex" or "disable size limit")?

B) If this must be done with an Ingestion Pipeline, what is the correct node setup and configuration to achieve this? Is there anything I missed?

Thanks for your time : )

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions