Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Semantic Search instead of Full-Text-Search #1

Open
alexferrari88 opened this issue Dec 15, 2023 · 7 comments
Open

Discussion: Semantic Search instead of Full-Text-Search #1

alexferrari88 opened this issue Dec 15, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@alexferrari88
Copy link

Hi Ian, great project. It's in my todo list of projects to build but thank god you got this done :)

I was wondering: wouldn't be better to use semantic search instead of full text search?

At least this was my idea for creating a project similar to yours.

I'd be glad to give more details, if my question is not clear.

(also interested in contributing, if you want to go in this direction)

@iansinnott
Copy link
Owner

Hey @alexferrari88 thanks for the kinds words. Agreed, semantic search would be great. I've thought about it as well, but unlike sqlite-in-wasm, I'm unaware of a solution to semantic search in JS.

Running a vector store remotely is an option but one of my goals with this project was to have it be useful as a standalone product.

What are your thoughts on how to implement semantic search?

@iansinnott iansinnott added the enhancement New feature or request label Dec 16, 2023
@alexferrari88
Copy link
Author

Right after submitting the issue, I started looking for a wasm vector search, since — I agree with you — it would be nicer to have this extension be sort of self-contained.

Unfortunately, the solutions are still few and far between. The best solutions I found so far:

  1. https://github.com/tantaraio/voy
  2. https://github.com/not-pizza/victor

Of the two, voy seems quite nice and there are also JS examples on how to use.

Curious to know your thoughts about this.

@iansinnott
Copy link
Owner

Awesome, thanks for the links. After a quick look i have some thoughts:

  • Voy is currently an in-memory store which means we'd have to load everything into memory and initialize the index. This will work, but is not ideal since the amount of full-text data is unbounded and assumedly will be measured in gigabytes once the user has browsed for long enough.
  • Victor looks pretty ideal in that it uses OPFS for storage in the browser. However, OPFS does not currently work in the background thread of web extensions. More details below.

I initially created this extension with WebSQL, which works for extensions using manifest v2. MV2 extensions are no longer allowed though, so while porting to MV3 I initially wanted to use OPFS and the official sqlite-wasm implementation.

I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.

I ended up using IndexedDB as the backing filesystem via the excellent wa-sqlite implementation. That's the current state of things -- Using IndexedDB because it happens to work in service workers.

@alexferrari88
Copy link
Author

thank you for looking more into it. That's unfortunate ☹️. Having a standalone extension that would take care of everything would be ideal, without the user having to install extra stuff but it seems like it is not feasible at the moment.

Ideally, one could proceed with an external (but local) vector store (e.g. Chromadb) and create a repository layer that would allow an easy swap for a wasm implementation in the future. I understand this is completely outside the scope of this extension.

I might fork it and start working on it but can't promise anything 😎

@rhashimoto
Copy link

I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.

Technically (and pedantically) speaking, OPFS should work in any context, including service workers. The restriction is the OPFS synchronous file access handles that make OPFS file operations fast are only available in dedicated workers. That is a deliberate choice, not a bug - the rationale is that blocking calls should not be used anywhere else.

For Chrome extensions, although they are implemented as service workers, I think there is a workaround. An offscreen document can be attached to an extension, and this document can create a Worker where the entire OPFS API should be usable. Perhaps that path is worth exploring.

@iansinnott
Copy link
Owner

Thanks for chiming in @rhashimoto. Interesting, I had looked at the offscreen document API for dom parsing but if it allows access to a normal worker that might be an option. A bit roundabout, but vector search for browsing history may well be worth it.

@iansinnott
Copy link
Owner

iansinnott commented Aug 22, 2024

There is a new, viable option: using pgvector via pglite (https://pglite.dev/extensions/#pgvector). I'm exploring this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants