Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

Closed
wants to merge 8 commits into from

Conversation

willtai
Copy link
Contributor

@willtai willtai commented Jan 3, 2025

Description

Allow thresholding on vector and fulltext indexes for Hybrid retrievers. Two thresholds can be provided by the user during search to determine the importance of the search results from either vector or fulltext index.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Documentation update
  • Project configuration change

Complexity

Complexity: Low

How Has This Been Tested?

  • Unit tests
  • E2E tests
  • Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

  • Documentation has been updated
  • Unit tests have been updated
  • E2E tests have been updated
  • Examples have been updated
  • New files have copyright header
  • CLA (https://neo4j.com/developer/cla/) has been signed
  • CHANGELOG.md updated if appropriate

@willtai willtai requested a review from a team as a code owner January 3, 2025 13:55
@willtai willtai force-pushed the hybrid-retriever-weight branch from 95dd2d9 to 4c1976c Compare January 3, 2025 13:56
Copy link
Contributor

@alexthomas93 alexthomas93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor points

@stellasia
Copy link
Contributor

I'd be interested in seeing an example of this.

Here are the points bothering me:

  • Aren't these parameters too query-dependent, since the range of normalized scores can vary a lot for each query?
  • What if most of the scores end up being 0? Then we do not have any ordering?
  • Is this compatible with the effective search ratio implemented in langchain? Typically, this ratio would bring more items in this part of the query and so change the normalization.

@willtai willtai force-pushed the hybrid-retriever-weight branch from 3e4d313 to ef33344 Compare January 14, 2025 11:17
@willtai willtai requested a review from stellasia January 14, 2025 13:14
@willtai willtai force-pushed the hybrid-retriever-weight branch from f2447f6 to be3d3cb Compare January 14, 2025 15:19
@willtai
Copy link
Contributor Author

willtai commented Jan 15, 2025

@CodiumAI-Agent /update_changelog

@CodiumAI-Agent
Copy link

Changelog updates: 🔄

2025-01-15

Added

  • Support for thresholding on vector and fulltext indexes in Hybrid retrievers, enabling users to set importance levels for search results.

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

@willtai willtai force-pushed the hybrid-retriever-weight branch from 3ac51fd to 72a6e88 Compare January 15, 2025 11:18
f"RETURN n.node AS node, (n.score / ft_index_max_score) AS score }} "
f"WITH node, max(score) AS score ORDER BY score DESC LIMIT $top_k"
)
return f"""CALL () {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this to a multi-line string here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps me modify the unit tests more easily without worrying about correct indentation and spaces

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With multi-line strings, you do have to worry about correct indentation though. With the

(
"Hello "
"world"
)

approach you only have to worry about there being a space at the end of every line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I still generally prefer multi-line strings as I find them more readable and tend to avoid mistakes after changes. Do you think we should revert this back?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m inclined to prefer the older approach, but I’m also open to this option. @stellasia what are your thoughts?

@willtai willtai requested a review from alexthomas93 January 16, 2025 10:58
@NathalieCharbel
Copy link
Contributor

I apologise for chiming in a bit late, but here's my opinion.
While thresholding in a hybrid retriever can be useful, it may also introduce nuances that are misaligned with the main objective of identifying the most relevant results from both full-text and vector-based searches (with thresholding, you often filter out any results that do not exceed a certain threshold in one search method before considering their relevance in the other method, so relevant results can be lost prematurely).
Also, from a user's perspective, I find that adjusting how much weight is given to each type of score is more intuitive than tuning two threshold parameters based on score, which might be an obscure value for the user.
One way to do this is by combining the vector and full-text scores into a single one. So in general, you fetch top candidates from each search, compute one combined score, and then pick the best overall. This way the user only needs to tune a single parameter, something like:
combined_score = α * score_vector + (1−α) * score_fulltext where α ∈ [0,1]
wdyt?

@CodiumAI-Agent
Copy link

Changelog updates: 🔄

2025-01-22

Added

  • Support for thresholding on vector and fulltext indexes in Hybrid retrievers, enabling users to set importance levels for search results.

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

@willtai willtai closed this Feb 24, 2025
@CodiumAI-Agent
Copy link

Changelog updates: 🔄

2025-02-24

Added

  • Support for thresholding on vector and fulltext indexes in Hybrid retrievers, enabling users to set importance levels for search results.

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

@willtai willtai mentioned this pull request Feb 24, 2025
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants