Integrating Memento into a RAGFlow Pipeline #10041

fixitup8 · 2025-09-10T18:26:56Z

fixitup8
Sep 10, 2025

https://github.com/Agent-on-the-Fly/Memento

A RAGFlow pipeline typically ingests a user query, runs it through information‐extraction and retrieval stages, then augments an LLM prompt with retrieved context to generate an answer. To graft Memento’s memory-based continual-learning loop onto this pipeline, one can treat Memento as an external planning/memory service that sits alongside the retrieval and generation stages. In this design, an incoming query first triggers the usual RAGFlow steps (preprocessing, indexing, retrieval) to collect relevant documents or chunks. Before invoking the LLM, the pipeline calls Memento’s memory read process to fetch analogous past “cases” (experience tuples) that are relevant to the current query

These cases are then concatenated or injected into the LLM prompt as additional context (much like extra retrieved facts). The LLM then generates an answer. After generation (and any tool execution), a feedback signal (e.g. answer correctness) is passed back to Memento, which triggers the memory update (write) loop. In effect, Memento’s planner-executor-episodic memory architecture runs in parallel with the RAGFlow pipeline: the planner suggests subtasks (here, refining retrieval or answer strategy), the executor is the LLM (with optional tools), and the Case Bank accumulates (query, plan/answer, success) tuples. For example:

Step 1: User query → RAGFlow retrieval (documents, knowledge base lookup) → Preliminary context.
Step 2: Memory retrieval (Memento Case Bank) → fetch similar past cases based on query embedding
Step 3: Build prompt = query + retrieved documents + retrieved cases → LLM answer.
Step 4: Execute any tools or verify answer. Measure success/feedback.
Step 5: Memory write = store (query, answer, feedback) into the Case Bank (possibly only final step per trajectory

This loop can be summarized in pseudocode:

def handle_query(query, memory_bank, retrieval_index):
docs = retrieval_index.search(query) # RAG retrieval
cases = memory_bank.read_similar(query, top_k=K) # Memento memory read
prompt = compose_prompt(query, docs, cases) # combine context
answer = LLM.generate(prompt) # LLM generation
feedback = evaluate_answer(answer) # user or automatic feedback
memory_bank.write(case=(query, answer, feedback)) # Memento memory write
return answer

A typical RAGFlow pipeline (query analysis → multi-index retrieval → ranking → LLM answer) can be augmented by inserting Memento’s memory read/write steps. Retrieved past cases enrich the prompt before LLM generation, and post-answer feedback is used to update memory.

==================================
Managing Evolving Memory

Memento’s memory (the Case Bank) grows online as new queries are answered. To prevent uncontrolled growth and redundancy, Memento only writes a compact summary of each episode into memory. In practice, each “case” is stored as a tuple of (state, action, reward) – for example, (query, answer, success_flag) – and only the final step of each trajectory is written. This keeps the memory bank concise and informative. In the RAG context, this means each completed QA turn (or conversation turn) yields at most one memory entry. Care should be taken to prune or archive stale or low-value cases (e.g. duplicate queries or irrelevant answers) as the bank grows. One strategy is to limit the bank to the top-N most similar/valuable cases per query cluster, or to periodically remove very old entries.

During a multi-turn conversation or document session, the memory can evolve on two scales:

Session (short-term) memory: Store recent utterances and answers within the current dialog as a transient context. This can be kept in fast-access memory and optionally injected into each new prompt.

Global (long-term) memory: Persist the Case Bank across sessions (e.g. in a database or vector index). Use a vector search index (like Elasticsearch or a vector DB) to efficiently retrieve similar past cases by embedding similarity. For modularity, keep these stores separate: e.g. use RAGFlow’s existing document index for static knowledge, and a separate vector index for cases.

In all cases, updates should be transactional or asynchronous. For example, after the LLM generates an answer and it’s evaluated, an event can be queued to add that case to memory. Batch writes or eventual consistency avoids slowing down the real-time pipeline.

==================================
Memory-Enhanced Retrieval and Context Injection

Memento’s key advantage is using past experience to guide future actions. In a RAG pipeline, memory can improve retrieval and answer generation by augmenting the context. For each new query, the system should retrieve not only documents from the static knowledge base but also the most semantically similar “cases” from memory. As noted, Memento’s planner concatenates retrieved cases with the current query to form the prompt. Analogously, a RAGFlow pipeline can inject relevant memory snippets alongside the usual retrieved documents. This “memory context” might include previously successful answers, partial plans, or even notes about what worked in the past. By treating memory entries as additional “pseudo-documents,” the LLM sees analogous situations when generating its answer.

For example, if a user previously asked “What is the tallest building in City X?” and the system answered correctly, that Q&A pair can be stored. When later the user (or another user) asks a similar question about City X or a related city, memory retrieval will supply the prior case. The prompt might then read: “Similar to earlier: [Past Q&A]. Now user asks: [new Q].” This guides the LLM toward the correct answer (like a strong prompt hint). Technically, memory retrieval can occur in the retrieval stage (mixing memory vectors into the search), or as a separate “memory-to-prompt” step just before generation.

Memory can also refine the retrieval query itself: e.g., key phrases or entities from top cases can be added to the search query to fetch better documents. In other words, memory can either act as additional context or as query rewriting signals. In Memento, a Case Memory module ranks and supplies cases via cosine-similarity or a learned Q-value
arxiv.org
arxiv.org
. A similar architecture can be used: index memory case embeddings, and retrieve top-𝑘 case texts per query. Those case texts then become part of the prompt or even part of the knowledge index for hybrid search.

==================================
Continual Learning and Feedback Loop

A crucial benefit of Memento is online learning: every answer is an opportunity to learn. In a RAGFlow setting, feedback can come from user ratings, downstream task success, or heuristic signals (e.g. did the user follow up with another question, indicating dissatisfaction?). Memento formalizes this via reinforcement-learning style signals: each case has an associated reward (success or failure), and the planner (LLM) “policy” is updated by writing high-reward cases into memory and (if parametric) by adjusting retrieval Q-values
arxiv.org
arxiv.org
.

In practice, structure the RAG loop as follows:

After LLM generation: Evaluate the answer. This could be an automated check (e.g. verifying facts against a knowledge base) or explicit user feedback (“Was this helpful?”).

Reward signals: Convert feedback into a numeric or categorical reward. For instance, +1 for correct, 0 for incorrect.

Memory write: If the episode succeeded (or even if it failed, to learn what went wrong), write the final state/action/reward to the Case Bank. Memento’s design only writes upon task completion
arxiv.org
, so in a conversational RAG setting one might write only at the end of a user session or after each closed question.

Policy improvement via retrieval: Over time, the memory grows to reflect what worked. The retrieval step “learns” because successful cases become more likely to be retrieved for similar future queries. (In Memento’s parametric variant, a Q-function is updated on writes to bias future reads
arxiv.org
arxiv.org
.)

Importantly, this feedback loop is separate from the LLM weights: the LLM itself is not fine-tuned or updated (Memento’s core insight
arxiv.org
). Instead, learning happens by shaping the memory. Thus, the RAG pipeline remains flexible: the LLM can be replaced or updated independently, while the memory bank carries the adaptation. Over time, the system should improve retrieval relevance and prompt assembly, yielding better answers without ever re-training the model.

==================================
Modularity and Interface Design

To preserve clean separation between Memento and RAGFlow, design each as an independent service or module with well-defined interfaces. For example:

Memory Service: Expose Memento’s read/write via an API. For reads, accept a query embedding and return top-N cases. For writes, accept (query, answer, reward). This service encapsulates the vector store, case indexing, and similarity search internally.

RAG Pipeline: Orchestrate the flow: call the Memory Service before/after LLM invocation. The pipeline code should not depend on Memento’s internals; it just passes prompts and feedback.

Communication: Use asynchronous queues or REST calls. For instance, the RAGFlow task executor could publish “new-answer” events to a message queue, which a separate memory-worker consumes to perform writes. This decouples runtime latency.

Maintain modularity by treating Memento as a black box “context source.” That way, the RAG pipeline can fallback gracefully (e.g. if memory is unavailable, it simply proceeds without memory context). Similarly, Memento can be upgraded or scaled separately (e.g. sharding the Case Bank) without touching the retrieval or LLM code. In a containerized deployment, Memento could even be an independent container/pod with its own stateful storage (e.g. Redis or a vector DB).

==================================
Pipeline Lifecycle and State Management

A dynamic RAG+Memento pipeline mixes stateless inference and stateful memory. Best practices:

Persistent Storage: Keep the Case Bank in durable storage (database or vector index) so that memory survives restarts. RAGFlow already uses Elasticsearch/MinIO for documents
github.com
; similarly use a persistent store (e.g. a vector DB or database table) for cases.

Session State: For conversational agents, maintain ephemeral state (chat history, conversation memory) in-session. This could be cached in memory or a short-lived store (Redis). Do not mix up session context with the long-term Case Bank.

Scalability: Since memory reads/writes are frequent, index them (e.g. using approximate nearest neighbors) for speed. Also consider batching writes. If running multiple RAGFlow workers, coordinate memory access with locks or distributed transactions to avoid duplicates.

Fallbacks: Implement fallbacks if memory lookup fails or is slow. The RAGFlow pipeline can still answer from static docs even if memory is offline.

Monitoring and Pruning: Monitor memory size and retrieval latency. Periodically prune unhelpful cases (low-reward or rarely-used) to prevent “catastrophic forgetting” of useful info
arxiv.org
.

Stateless vs. Stateful Components: The LLM and retrieval calls are inherently stateless (each query is independent), so they scale horizontally. The memory service is stateful and may become a bottleneck; manage it like a database service (e.g. scaling shards, using a cloud vector store).

In summary, treat the inference pipeline as stateless microservices orchestrated by a workflow engine, with memory as a separate stateful component. This separation lets you restart or scale parts of the system independently. For example, you can upgrade the LLM without touching memory, or extend memory capacity without retraining. By clearly delineating where memory enters and exits the pipeline, you preserve modularity and make the system maintainable.

==================================
Citations

Key principles and design patterns are drawn from Memento’s architecture and from RAGFlow’s documented pipeline. Memento uses a planner–executor framework with an episodic case memory to continuously adapt an agent without changing LLM weights
arxiv.org
arxiv.org
. Retrieval-Augmented Generation systems like RAGFlow typically draw on fixed corpora
arxiv.org
, so integrating Memento provides the needed adaptability. RAGFlow itself stages extraction, indexing, retrieval, and generation
bestarion.com
, into which the memory read/write steps can be inserted. Memento’s documentation shows concatenating similar past cases with the current instruction
arxiv.org
and only writing final outcomes to memory for compactness
arxiv.org
. These inform the above recommendations for evolving memory, retrieval enhancement, feedback incorporation, modular interfacing, and stateful service design.

Llala0928 · 2025-09-11T09:07:49Z

Llala0928
Sep 11, 2025

wow~ ⊙o⊙, I understand what you mean, thank you for your comprehensive and detailed suggestions! @fixitup8

0 replies

yingfeng · 2025-09-11T09:39:18Z

yingfeng
Sep 11, 2025
Maintainer

Thanks for your suggestion!
We do have the plan to support memory after the release of 0.21.0, it's a must-have fundamental building block for the context engineering for agents, RAGFlow is not only a RAG engine but context engine eventually.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

Integrating Memento into a RAGFlow Pipeline #10041

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

Integrating Memento into a RAGFlow Pipeline #10041

Uh oh!

Uh oh!

fixitup8 Sep 10, 2025

Replies: 2 comments

Uh oh!

Llala0928 Sep 11, 2025

Uh oh!

yingfeng Sep 11, 2025 Maintainer

fixitup8
Sep 10, 2025

Llala0928
Sep 11, 2025

yingfeng
Sep 11, 2025
Maintainer