-
-
Notifications
You must be signed in to change notification settings - Fork 128
[Store] Add HybridStore with BM25 ranking for PostgreSQL #783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3807878 to
8d4ccfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces PostgresHybridStore, a new vector store implementation that combines semantic vector search (pgvector) with PostgreSQL Full-Text Search (FTS) using Reciprocal Rank Fusion (RRF), following Supabase's hybrid search approach.
Key changes:
- Implements configurable hybrid search with adjustable semantic ratio (0.0 for pure FTS, 1.0 for pure vector, 0.5 for balanced)
- Uses RRF algorithm with k=60 default to merge vector similarity and ts_rank_cd rankings
- Supports multilingual content through configurable PostgreSQL text search configurations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/store/src/Bridge/Postgres/PostgresHybridStore.php | Core implementation of hybrid store with vector/FTS query building, RRF fusion logic, and table setup with tsvector generation |
| src/store/tests/Bridge/Postgres/PostgresHybridStoreTest.php | Comprehensive test coverage for constructor validation, setup, pure vector/FTS queries, hybrid RRF queries, and various configuration options |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
chr-hertel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this is a super cool feature - some copilot findings seem valid to me - please check.
On top, I was unsure if all sprintf need to be sprintf or some values can/should be a prepared parameter - that'd be great to double check as well please.
Side-by-side comparison of FTS, Hybrid (RRF), and Semantic search. Uses Supabase (pgvector + PostgreSQL FTS). 30 sample articles with interactive Live Component. Related: symfony/ai#783 Author: Ahmed EBEN HASSINE <[email protected]>
|
@ahmed-bhs could you please have a look at the pipeline failures - i think there's still some minor parts open |
Combines pgvector semantic search with PostgreSQL Full-Text Search using Reciprocal Rank Fusion (RRF), following Supabase approach. Features: - Configurable semantic/keyword ratio (0.0 to 1.0) - RRF fusion with customizable k parameter - Multilingual FTS support (default: 'simple') - Optional relevance filtering with defaultMaxScore - All pgvector distance metrics supported
- Extract WHERE clause logic into addFilterToWhereClause() helper method - Fix embedding param logic: ensure it's set before maxScore uses it - Replace fragile str_replace() with robust str_starts_with() approach - Remove code duplication between buildFtsOnlyQuery and buildHybridQuery This addresses review feedback about fragile WHERE clause manipulation and centralizes the logic in a single, reusable method.
- Rename class from PostgresHybridStore to HybridStore - The namespace already indicates it's Postgres-specific - Add postgres-hybrid.php RAG example demonstrating: * Different semantic ratios (0.0, 0.5, 1.0) * RRF (Reciprocal Rank Fusion) hybrid search * Full-text search with 'q' parameter * Per-query semanticRatio override
c75380e to
19623bb
Compare
19623bb to
2c7b49a
Compare
Replace ts_rank_cd (PostgreSQL Full-Text Search) with BM25 algorithm
for better keyword search ranking in hybrid search.
Changes:
- Add bm25Language parameter (configurable via YAML)
- Replace FTS CTEs with bm25topk() function calls
- Add DISTINCT ON fixes to prevent duplicate results
- Add fuzzy matching with word_similarity (pg_trgm)
- Add score normalization (0-100 range)
- Add searchable attributes with field-specific boosting
- Bundle configuration in options.php and AiBundle.php
Tests:
- Update 6 existing tests for BM25 compatibility
- Add 3 new tests for fuzzy matching and searchable attributes
- All 19 tests passing (132 assertions)
Breaking changes:
- Requires plpgsql_bm25 extension instead of native FTS
- BM25 uses short language codes ('en', 'fr') vs FTS full names
Add 3 new tests covering newly introduced functionality: - testFuzzyMatchingWithWordSimilarity: Verifies pg_trgm fuzzy matching with word_similarity() and custom thresholds (primary, secondary, strict) - testSearchableAttributesWithBoost: Ensures field-specific tsvector columns are created with proper GIN indexes (title_tsv, overview_tsv) - testFuzzyWeightParameter: Validates fuzzy weight distribution in RRF formula when combining vector, BM25, and fuzzy scores All tests verify SQL generation via callback assertions. Test suite: 19 tests, 132 assertions, all passing.
|
Open to finish this PR @ahmed-bhs ? |
…dStore - Extract RRF logic into dedicated ReciprocalRankFusion class - Introduce TextSearchStrategyInterface for pluggable search strategies - Remove debug code (file_put_contents calls) - Replace empty() with strict comparisons ([] !==) per PHPStan rules - Add missing PHPDoc types for array parameters - Mark properties as readonly for immutability - Extract helper methods (buildTsvectorColumns, createSearchTextTrigger) - Use NullVector for results without embeddings - Update tests to reflect new setup() execution order
|
Hi @OskarStark, The work on my side is complete, you can take a look whenever you have time. To give you a bit of context on the evolution of this work: Initially, I wanted to propose a hybrid search implementation on PostgreSQL, combining semantic search (pgvector) with PostgreSQL’s native text search based on TF-IDF ( However, TF-IDF has well-known scoring limitations:
That’s why I then suggested using BM25 (the algorithm used by Elasticsearch, Meilisearch, Lucene), which addresses these issues through saturation and document length normalization. BM25, however, requires the
I also extracted the RRF (Reciprocal Rank Fusion) logic into a dedicated class for reusability. Feel free to reach out if you have any questions or feedback! |
171ba50 to
27954f9
Compare
- Demonstrate BM25TextSearchStrategy vs native PostgreSQL FTS - Show explicit ReciprocalRankFusion configuration - Add comparison between both text search strategies - Simplify summary and improve clarity
27954f9 to
d7446d5
Compare
Problem
Vector search and full-text search each have limitations:
Users often need both: "Find documents about space travel that mention Apollo"
Solution
New PostgreSQL
HybridStorecombining three search methods with Reciprocal Rank Fusion (RRF):Why BM25 over native PostgreSQL FTS?
Native PostgreSQL uses TF-IDF which has known limitations:
BM25 fixes these issues with saturation and length normalization — that's why Elasticsearch, Meilisearch, and Lucene all use it.
Fallback strategy
BM25 requires the
plpgsql_bm25extension. For users without it:PostgresTextSearchStrategyusing nativets_rank_cd(works everywhere)Bm25TextSearchStrategyfor better ranking (requires extension)Features
0.0= keyword-only →1.0= vector-onlytitle: 2x,overview: 1xConfiguration
Usage
References