Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

willtai · 2025-01-03T13:55:39Z

Description

Allow thresholding on vector and fulltext indexes for Hybrid retrievers. Two thresholds can be provided by the user during search to determine the importance of the search results from either vector or fulltext index.

Type of Change

Complexity

Complexity: Low

How Has This Been Tested?

Unit tests
E2E tests
Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

Documentation has been updated
Unit tests have been updated
E2E tests have been updated
Examples have been updated
New files have copyright header
CLA (https://neo4j.com/developer/cla/) has been signed
CHANGELOG.md updated if appropriate

src/neo4j_graphrag/retrievers/hybrid.py

tests/unit/test_neo4j_queries.py

alexthomas93

LGTM, just a few minor points

stellasia · 2025-01-06T08:38:49Z

I'd be interested in seeing an example of this.

Here are the points bothering me:

Aren't these parameters too query-dependent, since the range of normalized scores can vary a lot for each query?
What if most of the scores end up being 0? Then we do not have any ordering?
Is this compatible with the effective search ratio implemented in langchain? Typically, this ratio would bring more items in this part of the query and so change the normalization.

willtai · 2025-01-15T11:02:19Z

@CodiumAI-Agent /update_changelog

CodiumAI-Agent · 2025-01-15T11:02:57Z

Changelog updates: 🔄

2025-01-15

Added

Support for thresholding on vector and fulltext indexes in Hybrid retrievers, enabling users to set importance levels for search results.

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

…thresholds are provided

alexthomas93 · 2025-01-16T10:17:09Z

src/neo4j_graphrag/neo4j_queries.py

-            f"RETURN n.node AS node, (n.score / ft_index_max_score) AS score }} "
-            f"WITH node, max(score) AS score ORDER BY score DESC LIMIT $top_k"
-        )
+        return f"""CALL () {{


Why change this to a multi-line string here?

This helps me modify the unit tests more easily without worrying about correct indentation and spaces

With multi-line strings, you do have to worry about correct indentation though. With the

( "Hello " "world" )

approach you only have to worry about there being a space at the end of every line.

Hmm I still generally prefer multi-line strings as I find them more readable and tend to avoid mistakes after changes. Do you think we should revert this back?

I’m inclined to prefer the older approach, but I’m also open to this option. @stellasia what are your thoughts?

src/neo4j_graphrag/neo4j_queries.py

NathalieCharbel · 2025-01-22T10:28:40Z

I apologise for chiming in a bit late, but here's my opinion.
While thresholding in a hybrid retriever can be useful, it may also introduce nuances that are misaligned with the main objective of identifying the most relevant results from both full-text and vector-based searches (with thresholding, you often filter out any results that do not exceed a certain threshold in one search method before considering their relevance in the other method, so relevant results can be lost prematurely).
Also, from a user's perspective, I find that adjusting how much weight is given to each type of score is more intuitive than tuning two threshold parameters based on score, which might be an obscure value for the user.
One way to do this is by combining the vector and full-text scores into a single one. So in general, you fetch top candidates from each search, compute one combined score, and then pick the best overall. This way the user only needs to tune a single parameter, something like:
combined_score = α * score_vector + (1−α) * score_fulltext where α ∈ [0,1]
wdyt?

CodiumAI-Agent · 2025-01-22T10:29:17Z

Changelog updates: 🔄

2025-01-22

Added

Support for thresholding on vector and fulltext indexes in Hybrid retrievers, enabling users to set importance levels for search results.

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

CodiumAI-Agent · 2025-02-24T11:10:44Z

Changelog updates: 🔄

2025-02-24

Added

Support for thresholding on vector and fulltext indexes in Hybrid retrievers, enabling users to set importance levels for search results.

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

willtai requested a review from a team as a code owner January 3, 2025 13:55

willtai force-pushed the hybrid-retriever-weight branch from 95dd2d9 to 4c1976c Compare January 3, 2025 13:56

alexthomas93 reviewed Jan 3, 2025

View reviewed changes

src/neo4j_graphrag/retrievers/hybrid.py Show resolved Hide resolved

alexthomas93 reviewed Jan 3, 2025

View reviewed changes

tests/unit/test_neo4j_queries.py Outdated Show resolved Hide resolved

alexthomas93 approved these changes Jan 3, 2025

View reviewed changes

willtai force-pushed the hybrid-retriever-weight branch from 3e4d313 to ef33344 Compare January 14, 2025 11:17

willtai requested a review from stellasia January 14, 2025 13:14

willtai force-pushed the hybrid-retriever-weight branch from f2447f6 to be3d3cb Compare January 14, 2025 15:19

willtai added 7 commits January 15, 2025 19:17

Allow thresholding on vector and fulltext indexes for Hybrid retrievers

60e6dbb

Ruff

21636fd

Update unit tests and _get_hybrid_query to use triple quote strings

336b372

Update example

7aa97cd

Add additional detail to docstring

9c8ce02

Avoid breaking change by only filtering out nodes with 0 scores when …

bae7579

…thresholds are provided

Update CHANGELOG

72a6e88

willtai force-pushed the hybrid-retriever-weight branch from 3ac51fd to 72a6e88 Compare January 15, 2025 11:18

alexthomas93 reviewed Jan 16, 2025

View reviewed changes

Add space to f string for hybrid query

454b231

willtai requested a review from alexthomas93 January 16, 2025 10:58

willtai closed this Feb 24, 2025

willtai mentioned this pull request Feb 24, 2025

Add linear hybrid search ranker #284

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

willtai commented Jan 3, 2025

alexthomas93 left a comment

stellasia commented Jan 6, 2025

willtai commented Jan 15, 2025

CodiumAI-Agent commented Jan 15, 2025

alexthomas93 Jan 16, 2025

willtai Jan 16, 2025

alexthomas93 Jan 16, 2025

willtai Jan 16, 2025

alexthomas93 Jan 16, 2025

NathalieCharbel commented Jan 22, 2025

CodiumAI-Agent commented Jan 22, 2025

CodiumAI-Agent commented Feb 24, 2025

Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

Conversation

willtai commented Jan 3, 2025

Description

Type of Change

Complexity

How Has This Been Tested?

Checklist

alexthomas93 left a comment

Choose a reason for hiding this comment

stellasia commented Jan 6, 2025

willtai commented Jan 15, 2025

CodiumAI-Agent commented Jan 15, 2025

2025-01-15

Added

alexthomas93 Jan 16, 2025

Choose a reason for hiding this comment

willtai Jan 16, 2025

Choose a reason for hiding this comment

alexthomas93 Jan 16, 2025

Choose a reason for hiding this comment

willtai Jan 16, 2025

Choose a reason for hiding this comment

alexthomas93 Jan 16, 2025

Choose a reason for hiding this comment

NathalieCharbel commented Jan 22, 2025

CodiumAI-Agent commented Jan 22, 2025

2025-01-22

Added

CodiumAI-Agent commented Feb 24, 2025

2025-02-24

Added