Skip to content

[Algorithm] Advanced RAG: Query Characterization, Expansion, and Group Ranking for Improved Retrieval #39

@karthik18495

Description

@karthik18495

Summary

Implement an advanced RAG pipeline that, for each user query, performs the following steps:

  1. Characterizes the query type (e.g., theory or experiment).
  2. Generates 3 additional queries aligned with the original query to improve information retrieval (query rewriting/expansion).
  3. Searches the database with all queries in parallel.
  4. Aggregates and group-ranks the retrieved information.
  5. Passes top-ranked information to the LLM to generate a cohesive answer.

Implementation Plan (LangChain)

  1. Query Characterization Node

    • Use an LLM chain with a prompt to classify the incoming query as "theory" or "experiment" (or other scientific categories as needed).
    • Store the type for downstream logic and metadata.
  2. Query Expansion Node

    • Use an LLM chain to rewrite the original query into 3 additional queries that are semantically aligned but phrased differently to maximize recall.
    • Example prompt: "Given the query below, generate 3 alternative queries that would help retrieve more relevant information about the same topic, considering it is a [theory|experiment] question."
  3. Parallel Retrieval Node

    • Search the vector database (or multiple databases) using all 4 queries (original + 3 expansions).
    • Use LangChain's RunnableParallel to run retrievals concurrently.
  4. Group Ranking Node

    • Combine all retrieved results.
    • Deduplicate by document/content, aggregate scores if available.
    • Use a custom ranking function or an LLM chain to group and re-rank results based on relevance to the original query and coverage across expanded queries.
    • Optionally, annotate which query retrieved which document for transparency.
  5. Cohesive Answer Generation

    • Pass the top-ranked, grouped context to the LLM for answer synthesis.
    • Prompt should encourage the model to use information from multiple viewpoints and sources.
    • Cite sources as usual.
  6. Integration and UI

    • Add a toggle/option for users to enable "Advanced RAG" mode in the Streamlit UI.
    • Expose query type and expanded queries in the UI for transparency/debugging.
  7. Testing & Evaluation

    • Test on both theory and experiment queries.
    • Compare recall, precision, and answer quality to baseline RAG.
    • Log query types and rewrites for analysis.

References

Future Improvements

  • Support for more granular query types (simulation, review, data, etc.)
  • User feedback loop on query expansion effectiveness
  • Integration with LangGraph for graph-based control flow
  • Continuous learning of effective query rewrites

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions