Discussion: Semantic Search instead of Full-Text-Search #1

alexferrari88 · 2023-12-15T08:45:53Z

Hi Ian, great project. It's in my todo list of projects to build but thank god you got this done :)

I was wondering: wouldn't be better to use semantic search instead of full text search?

At least this was my idea for creating a project similar to yours.

I'd be glad to give more details, if my question is not clear.

(also interested in contributing, if you want to go in this direction)

iansinnott · 2023-12-16T08:30:04Z

Hey @alexferrari88 thanks for the kinds words. Agreed, semantic search would be great. I've thought about it as well, but unlike sqlite-in-wasm, I'm unaware of a solution to semantic search in JS.

Running a vector store remotely is an option but one of my goals with this project was to have it be useful as a standalone product.

What are your thoughts on how to implement semantic search?

alexferrari88 · 2023-12-16T15:03:42Z

Right after submitting the issue, I started looking for a wasm vector search, since — I agree with you — it would be nicer to have this extension be sort of self-contained.

Unfortunately, the solutions are still few and far between. The best solutions I found so far:

Of the two, voy seems quite nice and there are also JS examples on how to use.

Curious to know your thoughts about this.

iansinnott · 2023-12-17T03:18:59Z

Awesome, thanks for the links. After a quick look i have some thoughts:

Voy is currently an in-memory store which means we'd have to load everything into memory and initialize the index. This will work, but is not ideal since the amount of full-text data is unbounded and assumedly will be measured in gigabytes once the user has browsed for long enough.
Victor looks pretty ideal in that it uses OPFS for storage in the browser. However, OPFS does not currently work in the background thread of web extensions. More details below.

I initially created this extension with WebSQL, which works for extensions using manifest v2. MV2 extensions are no longer allowed though, so while porting to MV3 I initially wanted to use OPFS and the official sqlite-wasm implementation.

I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.

I ended up using IndexedDB as the backing filesystem via the excellent wa-sqlite implementation. That's the current state of things -- Using IndexedDB because it happens to work in service workers.

alexferrari88 · 2023-12-18T09:22:18Z

thank you for looking more into it. That's unfortunate ☹️. Having a standalone extension that would take care of everything would be ideal, without the user having to install extra stuff but it seems like it is not feasible at the moment.

Ideally, one could proceed with an external (but local) vector store (e.g. Chromadb) and create a repository layer that would allow an easy swap for a wasm implementation in the future. I understand this is completely outside the scope of this extension.

I might fork it and start working on it but can't promise anything 😎

rhashimoto · 2023-12-19T17:30:09Z

I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.

Technically (and pedantically) speaking, OPFS should work in any context, including service workers. The restriction is the OPFS synchronous file access handles that make OPFS file operations fast are only available in dedicated workers. That is a deliberate choice, not a bug - the rationale is that blocking calls should not be used anywhere else.

For Chrome extensions, although they are implemented as service workers, I think there is a workaround. An offscreen document can be attached to an extension, and this document can create a Worker where the entire OPFS API should be usable. Perhaps that path is worth exploring.

iansinnott · 2024-02-10T04:35:20Z

Thanks for chiming in @rhashimoto. Interesting, I had looked at the offscreen document API for dom parsing but if it allows access to a normal worker that might be an option. A bit roundabout, but vector search for browsing history may well be worth it.

iansinnott · 2024-08-22T00:22:26Z

There is a new, viable option: using pgvector via pglite (https://pglite.dev/extensions/#pgvector). I'm exploring this now.

iansinnott added the enhancement New feature or request label Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Semantic Search instead of Full-Text-Search #1

Discussion: Semantic Search instead of Full-Text-Search #1

alexferrari88 commented Dec 15, 2023

iansinnott commented Dec 16, 2023

alexferrari88 commented Dec 16, 2023

iansinnott commented Dec 17, 2023

alexferrari88 commented Dec 18, 2023

rhashimoto commented Dec 19, 2023

iansinnott commented Feb 10, 2024

iansinnott commented Aug 22, 2024 •

edited

Loading

Discussion: Semantic Search instead of Full-Text-Search #1

Discussion: Semantic Search instead of Full-Text-Search #1

Comments

alexferrari88 commented Dec 15, 2023

iansinnott commented Dec 16, 2023

alexferrari88 commented Dec 16, 2023

iansinnott commented Dec 17, 2023

alexferrari88 commented Dec 18, 2023

rhashimoto commented Dec 19, 2023

iansinnott commented Feb 10, 2024

iansinnott commented Aug 22, 2024 • edited Loading

iansinnott commented Aug 22, 2024 •

edited

Loading