Skip to content

Conversation

@ahmed-bhs
Copy link
Contributor

@ahmed-bhs ahmed-bhs commented Oct 15, 2025

Q A
Bug fix? no
New feature? yes
Docs? no
License MIT

Problem

Vector search and full-text search each have limitations:

  • Vector search: Great for semantic similarity, but may rank exact term matches lower
  • Full-text search: Great for exact matches, but misses semantic relationships

Users often need both: "Find documents about space travel that mention Apollo"

Solution

New PostgreSQL HybridStore combining three search methods with Reciprocal Rank Fusion (RRF):

Method Extension Purpose
Semantic pgvector Conceptual similarity
Keyword BM25 or native FTS Exact term matching
Fuzzy pg_trgm Typo tolerance

Why BM25 over native PostgreSQL FTS?

Native PostgreSQL uses TF-IDF which has known limitations:

  • No document length normalization (long documents score higher unfairly)
  • Term frequency grows unbounded (repeating a word 100x inflates score)

BM25 fixes these issues with saturation and length normalization — that's why Elasticsearch, Meilisearch, and Lucene all use it.

Fallback strategy

BM25 requires the plpgsql_bm25 extension. For users without it:

  • Default: PostgresTextSearchStrategy using native ts_rank_cd (works everywhere)
  • Optional: Bm25TextSearchStrategy for better ranking (requires extension)
// Native FTS fallback (default)
$store = new HybridStore($pdo, 'movies');

// BM25 for better ranking (requires extension)
$store = new HybridStore($pdo, 'movies', 
    textSearchStrategy: new Bm25TextSearchStrategy(bm25Language: 'en')
);

Features

  • Pluggable text search: BM25 (ParadeDB) or native PostgreSQL FTS
  • RRF fusion: Merges vector + keyword + fuzzy rankings
  • Configurable ratio: 0.0 = keyword-only → 1.0 = vector-only
  • Fuzzy matching: Typo tolerance via pg_trgm
  • Field boosting: title: 2x, overview: 1x
  • Score normalization: 0-100 range

Configuration

framework:
    ai:
        stores:
            hybrid:
                postgres:
                    connection: doctrine.dbal.default_connection
                    table_name: movies
                    semantic_ratio: 0.7
                    fuzzy_weight: 0.3
                    normalize_scores: true
                    searchable_attributes:
                        title: { boost: 2.0, metadata_key: 'title' }
                        overview: { boost: 1.0, metadata_key: 'overview' }

Usage

$store = new HybridStore($pdo, 'movies', semanticRatio: 0.7);
$results = $store->query($vector, ['q' => 'space adventure', 'limit' => 10]);

References

@carsonbot carsonbot added Feature New feature Store Issues & PRs about the AI Store component Status: Needs Review labels Oct 15, 2025
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 2 times, most recently from 3807878 to 8d4ccfe Compare October 16, 2025 07:36
@chr-hertel chr-hertel requested a review from Copilot October 23, 2025 19:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces PostgresHybridStore, a new vector store implementation that combines semantic vector search (pgvector) with PostgreSQL Full-Text Search (FTS) using Reciprocal Rank Fusion (RRF), following Supabase's hybrid search approach.

Key changes:

  • Implements configurable hybrid search with adjustable semantic ratio (0.0 for pure FTS, 1.0 for pure vector, 0.5 for balanced)
  • Uses RRF algorithm with k=60 default to merge vector similarity and ts_rank_cd rankings
  • Supports multilingual content through configurable PostgreSQL text search configurations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/store/src/Bridge/Postgres/PostgresHybridStore.php Core implementation of hybrid store with vector/FTS query building, RRF fusion logic, and table setup with tsvector generation
src/store/tests/Bridge/Postgres/PostgresHybridStoreTest.php Comprehensive test coverage for constructor validation, setup, pure vector/FTS queries, hybrid RRF queries, and various configuration options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@chr-hertel chr-hertel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this is a super cool feature - some copilot findings seem valid to me - please check.

On top, I was unsure if all sprintf need to be sprintf or some values can/should be a prepared parameter - that'd be great to double check as well please.

ahmed-bhs added a commit to ahmed-bhs/ai-demo that referenced this pull request Oct 30, 2025
Side-by-side comparison of FTS, Hybrid (RRF), and Semantic search.
Uses Supabase (pgvector + PostgreSQL FTS).
30 sample articles with interactive Live Component.

Related: symfony/ai#783
Author: Ahmed EBEN HASSINE <[email protected]>
@chr-hertel
Copy link
Member

@ahmed-bhs could you please have a look at the pipeline failures - i think there's still some minor parts open

Combines pgvector semantic search with PostgreSQL Full-Text Search
using Reciprocal Rank Fusion (RRF), following Supabase approach.

Features:
- Configurable semantic/keyword ratio (0.0 to 1.0)
- RRF fusion with customizable k parameter
- Multilingual FTS support (default: 'simple')
- Optional relevance filtering with defaultMaxScore
- All pgvector distance metrics supported
- Extract WHERE clause logic into addFilterToWhereClause() helper method
- Fix embedding param logic: ensure it's set before maxScore uses it
- Replace fragile str_replace() with robust str_starts_with() approach
- Remove code duplication between buildFtsOnlyQuery and buildHybridQuery

This addresses review feedback about fragile WHERE clause manipulation
and centralizes the logic in a single, reusable method.
- Rename class from PostgresHybridStore to HybridStore
- The namespace already indicates it's Postgres-specific
- Add postgres-hybrid.php RAG example demonstrating:
  * Different semantic ratios (0.0, 0.5, 1.0)
  * RRF (Reciprocal Rank Fusion) hybrid search
  * Full-text search with 'q' parameter
  * Per-query semanticRatio override
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 2 times, most recently from c75380e to 19623bb Compare November 7, 2025 13:56
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 19623bb to 2c7b49a Compare November 7, 2025 13:57
Replace ts_rank_cd (PostgreSQL Full-Text Search) with BM25 algorithm
for better keyword search ranking in hybrid search.

Changes:
- Add bm25Language parameter (configurable via YAML)
- Replace FTS CTEs with bm25topk() function calls
- Add DISTINCT ON fixes to prevent duplicate results
- Add fuzzy matching with word_similarity (pg_trgm)
- Add score normalization (0-100 range)
- Add searchable attributes with field-specific boosting
- Bundle configuration in options.php and AiBundle.php

Tests:
- Update 6 existing tests for BM25 compatibility
- Add 3 new tests for fuzzy matching and searchable attributes
- All 19 tests passing (132 assertions)

Breaking changes:
- Requires plpgsql_bm25 extension instead of native FTS
- BM25 uses short language codes ('en', 'fr') vs FTS full names
Add 3 new tests covering newly introduced functionality:

- testFuzzyMatchingWithWordSimilarity: Verifies pg_trgm fuzzy matching
  with word_similarity() and custom thresholds (primary, secondary, strict)

- testSearchableAttributesWithBoost: Ensures field-specific tsvector
  columns are created with proper GIN indexes (title_tsv, overview_tsv)

- testFuzzyWeightParameter: Validates fuzzy weight distribution in RRF
  formula when combining vector, BM25, and fuzzy scores

All tests verify SQL generation via callback assertions.
Test suite: 19 tests, 132 assertions, all passing.
@ahmed-bhs ahmed-bhs changed the title [Store] Add PostgresHybridStore with RRF following Supabase approach [Store] Add HybridStore with BM25 ranking for PostgreSQL Nov 23, 2025
@OskarStark
Copy link
Contributor

Open to finish this PR @ahmed-bhs ?

…dStore

- Extract RRF logic into dedicated ReciprocalRankFusion class
- Introduce TextSearchStrategyInterface for pluggable search strategies
- Remove debug code (file_put_contents calls)
- Replace empty() with strict comparisons ([] !==) per PHPStan rules
- Add missing PHPDoc types for array parameters
- Mark properties as readonly for immutability
- Extract helper methods (buildTsvectorColumns, createSearchTextTrigger)
- Use NullVector for results without embeddings
- Update tests to reflect new setup() execution order
@ahmed-bhs
Copy link
Contributor Author

Hi @OskarStark,

The work on my side is complete, you can take a look whenever you have time.

To give you a bit of context on the evolution of this work:

Initially, I wanted to propose a hybrid search implementation on PostgreSQL, combining semantic search (pgvector) with PostgreSQL’s native text search based on TF-IDF (ts_rank_cd).

However, TF-IDF has well-known scoring limitations:

  • No length normalization: longer documents are unfairly favored
  • Unbounded term frequency: repeating a word 100 times artificially inflates the score

That’s why I then suggested using BM25 (the algorithm used by Elasticsearch, Meilisearch, Lucene), which addresses these issues through saturation and document length normalization.

BM25, however, requires the plpgsql_bm25 extension, which is not installed by default. So I implemented a fallback architecture:

  • Default: PostgresTextSearchStrategy using native FTS (works everywhere)
  • Optional: Bm25TextSearchStrategy for better ranking (requires the extension)

I also extracted the RRF (Reciprocal Rank Fusion) logic into a dedicated class for reusability.

Feel free to reach out if you have any questions or feedback!

@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 171ba50 to 27954f9 Compare November 26, 2025 04:05
- Demonstrate BM25TextSearchStrategy vs native PostgreSQL FTS
- Show explicit ReciprocalRankFusion configuration
- Add comparison between both text search strategies
- Simplify summary and improve clarity
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 27954f9 to d7446d5 Compare November 26, 2025 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature New feature Status: Needs Review Store Issues & PRs about the AI Store component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants