[Algorithm] Advanced RAG: Query Characterization, Expansion, and Group Ranking for Improved Retrieval

## Summary
Implement an advanced RAG pipeline that, for each user query, performs the following steps:
1. Characterizes the query type (e.g., theory or experiment).
2. Generates 3 additional queries aligned with the original query to improve information retrieval (query rewriting/expansion).
3. Searches the database with all queries in parallel.
4. Aggregates and group-ranks the retrieved information.
5. Passes top-ranked information to the LLM to generate a cohesive answer.

## Implementation Plan (LangChain)

1. **Query Characterization Node**
   - Use an LLM chain with a prompt to classify the incoming query as "theory" or "experiment" (or other scientific categories as needed).
   - Store the type for downstream logic and metadata.

2. **Query Expansion Node**
   - Use an LLM chain to rewrite the original query into 3 additional queries that are semantically aligned but phrased differently to maximize recall.
   - Example prompt: "Given the query below, generate 3 alternative queries that would help retrieve more relevant information about the same topic, considering it is a [theory|experiment] question."

3. **Parallel Retrieval Node**
   - Search the vector database (or multiple databases) using all 4 queries (original + 3 expansions).
   - Use LangChain's `RunnableParallel` to run retrievals concurrently.

4. **Group Ranking Node**
   - Combine all retrieved results.
   - Deduplicate by document/content, aggregate scores if available.
   - Use a custom ranking function or an LLM chain to group and re-rank results based on relevance to the original query and coverage across expanded queries.
   - Optionally, annotate which query retrieved which document for transparency.

5. **Cohesive Answer Generation**
   - Pass the top-ranked, grouped context to the LLM for answer synthesis.
   - Prompt should encourage the model to use information from multiple viewpoints and sources.
   - Cite sources as usual.

6. **Integration and UI**
   - Add a toggle/option for users to enable "Advanced RAG" mode in the Streamlit UI.
   - Expose query type and expanded queries in the UI for transparency/debugging.

7. **Testing & Evaluation**
   - Test on both theory and experiment queries.
   - Compare recall, precision, and answer quality to baseline RAG.
   - Log query types and rewrites for analysis.

## References
- [LangChain Parallel Runnables](https://python.langchain.com/docs/expression_language/how_to/parallel)
- [LangChain Custom Ranking](https://python.langchain.com/docs/modules/data_connection/retrievers)
- Related: [Query rewriting in RAG](https://arxiv.org/abs/2305.14283)

## Future Improvements
- Support for more granular query types (simulation, review, data, etc.)
- User feedback loop on query expansion effectiveness
- Integration with LangGraph for graph-based control flow
- Continuous learning of effective query rewrites


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Algorithm] Advanced RAG: Query Characterization, Expansion, and Group Ranking for Improved Retrieval #39

Summary

Implementation Plan (LangChain)

References

Future Improvements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Algorithm] Advanced RAG: Query Characterization, Expansion, and Group Ranking for Improved Retrieval #39

Description

Summary

Implementation Plan (LangChain)

References

Future Improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions