-
Notifications
You must be signed in to change notification settings - Fork 81
Description
I want to provide search results with more information, like how many documents have been found (counting) and also how many documents belong to other fields (faceted search).
I have solved this use cases in a single query using a CTE. However when i want to include information about how many documents have been found, the query planner decides to not picking the HNSW index resulting in very slow queries.
The question is: How to fix it to use the HNSW index always?
Context:
Table
CREATE TABLE documents (
id integer PRIMARY KEY,
src text,
author text,
doc_emb vecf16(1000),
);Index
CREATE INDEX hnsw_doc_index ON documents
USING vectors (doc_emb vecf16_dot_ops)
WITH (options='[indexing.hnsw]
m = 16
ef_construction = 200');Example of simple query (use HNSW index = yes)
EXPLAIN ANALYZE
WITH cte AS
(
SELECT *
FROM documents
WHERE doc_emb <#> $1 < 0.2
)
(
SELECT
'doc_result' AS Section,
json_build_object( 'id', id, 'src', src ) AS JSON_Value
FROM cte
ORDER BY doc_emb <#> $1 asc
LIMIT 5
OFFSET 0
);Limit (cost=0.00..2.91 rows=5 width=68) (actual time=4.240..4.914 rows=5 loops=1)
-> Index Scan using hnsw_doc_index on documents (cost=0.00..1044978.07 rows=1794161 width=68) (actual time=4.238..4.912 rows=5 loops=1)
Order By: (doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16)
Filter: ((doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16) < '0.2'::double precision)
Planning Time: 1.260 ms
Execution Time: 5.040 msExample with count (use HNSW index = no)
EXPLAIN ANALYZE
WITH cte AS (
SELECT *
FROM documents
WHERE doc_emb <#> $1 < 0.2
)
(
SELECT
'doc_result' AS Section,
json_build_object( 'id', id, 'src', src ) AS JSON_Value
FROM cte
ORDER BY doc_emb <#> $1 asc
LIMIT 5
OFFSET 0
)
UNION ALL
(
SELECT
'num_documents' AS Section,
json_build_object('count', COUNT(*)) AS JSON_Value
FROM cte
);Append (cost=1110658.65..1151027.38 rows=6 width=64) (actual time=40454.851..40464.023 rows=6 loops=1)
CTE cte
-> Seq Scan on documents (cost=0.00..1036004.26 rows=1794161 width=1208) (actual time=1.287..39950.807 rows=37641 loops=1)
Filter: ((doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16) < '0.2'::double precision)
Rows Removed by Filter: 5378463
-> Subquery Scan on "*SELECT* 1_1" (cost=74654.39..74654.46 rows=5 width=64) (actual time=40454.850..40454.854 rows=5 loops=1)
-> Limit (cost=74654.39..74654.41 rows=5 width=68) (actual time=40454.665..40454.667 rows=5 loops=1)
-> Sort (cost=74654.39..79139.80 rows=1794161 width=68) (actual time=40213.111..40213.112 rows=5 loops=1)
Sort Key: ((cte.doc_emb <#> '[0.010192871, -0.001909256, ... , -0.033477783, -0.015335083]'::vecf16))
Sort Method: top-N heapsort Memory: 27kB
-> CTE Scan on cte (cost=0.00..44854.03 rows=1794161 width=68) (actual time=1.355..40198.036 rows=37641 loops=1)
-> Aggregate (cost=40368.62..40368.64 rows=1 width=64) (actual time=9.162..9.163 rows=1 loops=1)
-> CTE Scan on cte cte_1 (cost=0.00..35883.22 rows=1794161 width=0) (actual time=0.001..7.846 rows=37641 loops=1)
Planning Time: 1.464 ms
JIT:
Functions: 12
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.109 ms, Inlining 53.273 ms, Optimization 104.714 ms, Emission 83.744 ms, Total 242.840 ms
Execution Time: 40486.784 msExample with count + faceted search (use HNSW index = no)
documents table has author column. In this example i want to count how many documents has each author.
EXPLAIN ANALYZE
WITH cte AS
(
SELECT *
FROM documents
WHERE doc_emb <#> $1 < 0.2
)
(
SELECT
'doc_result' AS Section,
json_build_object( 'id', id, 'src', src ) AS JSON_Value
FROM cte
ORDER BY doc_emb <#> $1 asc
LIMIT 5
OFFSET 0
)
UNION ALL
(
SELECT
'num_documents' AS Section,
json_build_object('count', COUNT(*)) AS JSON_Value
FROM cte
)
UNION ALL
(
-- faceted search info of author field
SELECT
'author' AS Section,
json_build_object(
'value', author,
'count', COUNT(*)
) AS JSON_Value
FROM cte
GROUP BY author
ORDER BY COUNT(*) DESC
LIMIT 20
)
-- more faceted search fields here...
;Append (cost=1110658.65..1195889.58 rows=26 width=64) (actual time=45792.299..45828.560 rows=26 loops=1)
CTE cte
-> Seq Scan on documents (cost=0.00..1036004.26 rows=1794161 width=1208) (actual time=1.415..45124.253 rows=37641 loops=1)
Filter: ((doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16) < '0.2'::double precision)
Rows Removed by Filter: 5378463
-> Subquery Scan on "*SELECT* 1_1" (cost=74654.39..74654.46 rows=5 width=64) (actual time=45792.293..45792.299 rows=5 loops=1)
-> Limit (cost=74654.39..74654.41 rows=5 width=68) (actual time=45792.231..45792.234 rows=5 loops=1)
-> Sort (cost=74654.39..79139.80 rows=1794161 width=68) (actual time=45449.864..45449.866 rows=5 loops=1)
Sort Key: ((cte.doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16))
Sort Method: top-N heapsort Memory: 27kB
-> CTE Scan on cte (cost=0.00..44854.03 rows=1794161 width=68) (actual time=1.470..45429.721 rows=37641 loops=1)
-> Aggregate (cost=40368.62..40368.64 rows=1 width=64) (actual time=11.863..11.863 rows=1 loops=1)
-> CTE Scan on cte cte_1 (cost=0.00..35883.22 rows=1794161 width=0) (actual time=0.002..10.550 rows=37641 loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=44861.85..44862.10 rows=20 width=64) (actual time=24.379..24.383 rows=20 loops=1)
-> Limit (cost=44861.85..44861.90 rows=20 width=104) (actual time=24.365..24.368 rows=20 loops=1)
-> Sort (cost=44861.85..44862.35 rows=200 width=104) (actual time=24.359..24.361 rows=20 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 28kB
-> HashAggregate (cost=44854.02..44856.52 rows=200 width=104) (actual time=18.820..23.519 rows=5304 loops=1)
Group Key: cte_2.author
Batches: 1 Memory Usage: 737kB
-> CTE Scan on cte cte_2 (cost=0.00..35883.22 rows=1794161 width=32) (actual time=0.001..11.229 rows=37641 loops=1)
Planning Time: 1.583 ms
JIT:
Functions: 21
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.918 ms, Inlining 58.359 ms, Optimization 163.785 ms, Emission 120.338 ms, Total 344.398 ms
Execution Time: 45850.317 msSide note 1
In all queries i pass the query document embedding as a postgre binary parameter and I use $1 to reference it in the query.
Every query has 2 parts:
- The filtering part in the common CTE:
WHERE doc_emb <#> $1 < 0.2 - The sorting part in the "doc_results" section:
ORDER BY doc_emb <#> $1 asc
I want to know also if the doc_emb <#> $1 computation is reused because i have the same query embedding ($1) in both parts. Or is recomended to precompute a distance column ?
Side note 2
OFFSET 0 is neccesary becase i have pagination and this is page 1, for other pages the OFFSET value will be different. If you came up with other pagination strategies I will appreciete it