Skip to content

problems found during the ws package upgrade that make ci error #7067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kryon-7 opened this issue Apr 8, 2025 · 13 comments
Open

problems found during the ws package upgrade that make ci error #7067

kryon-7 opened this issue Apr 8, 2025 · 13 comments

Comments

@kryon-7
Copy link
Contributor

kryon-7 commented Apr 8, 2025

To be honest, the CI issue is not related to ws, but rather the randomDelayStorage itself has a problem with handling concurrency. The issue actually existed before, but the upgrade of the ws library caused changes in the speed of message transmission, which amplified the problem. The issue itself exists, but the ws upgrade increased the likelihood of the problem occurring. There are many possibilities for the problem to appear. When I delete other cases and only run the doing insert after subscribe should end with the correct results case, the probability of the problem occurring decreases... I don't know why.

    const c = await humansCollection.create(1);
    let result = [];
    c.insert(schemaObjects.humanData()); // do not await here!
    c.find().$.subscribe((r) => {
        console.log(`subscribe get, ${JSON.stringify(r, null, 2)}`);
        result = r;
    });

    await c.insert(schemaObjects.humanData());
    console.log('wait');
    await waitUntil(() => result.length === 3);

We mark the three documents inserted in this code as 0, 1, 2 according to the insertion order. Document 0 is created during the create operation and is awaited, so there's no problem with it.

The issue occurs with the last insert of document 1,2 and the find operation with subscription.
randomDelayStorage randomly delays for a period of time before actually using memory storage. Therefore, even though the network data for the bulkWrite method is sent before the query method's network data, it's still possible that the server processes the query first, then reads the current document data as [0]. Then the insert method executes very quickly, rapidly returning the changeStream and bulkWrite data packets to the client. After sending these, it then sends the query return data back to the client via websocket. In this situation, the client first gets the insert result and discards one piece of data from the changeStream that it should have received, then gets the query result [0]. This depends on the processing speed of insertion and query. It's even possible to only get [0, 1] data, with data for 2 being lost because it was returned too quickly. These extreme possibilities can occur.

Solutions:

  1. Implement read-write lock control for storage to ensure atomicity of read and write operations and guarantee data consistency. Performance will generally decrease, but ensuring data consistency is necessary.
  2. Specify in advance that this method still has extreme cases (maybe, I just want to share these two feasible solutions). That is, when querying, start reading the changeStream and cache it, return a checkpoint with the query result, first get the checkpoint and then execute the query operation. The checkpoint data is included in the query return, and all changeStream data after the checkpoint is applied. However, there are still issues because getting the query result and getting the checkpoint information are not atomic operations, so there will still be concurrency problems. However, since create, update, and delete operations are essentially overwrite operations on the doc, implementing idempotent processing is not difficult. Redundant data is simpler to handle than missing data.

The first solution is very simple and won't cause any problems. For the second solution, I'm not sure if it's feasible or if there are extreme cases that need to be considered. Additionally, since I'm new to rxdb and have only read a small portion of the source code, I'm not clear how much modification to the source code would be required. I hope to get some advice on this.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 9, 2025

Alright, the second solution may not be that complex. Since the query has a record of latestChangesEvent, the solution is quite straightforward. We just need to record a checkpoint before the query is executed, and then check this checkpoint after the ReExec is completed. During the check, it's necessary to ensure idempotent validation. Now, everything in the CI is working fine.

Image We can now discuss whether there are any edge cases to consider based on the second solution. If there are no issues, I will optimize the code style and then show you the changes again. I accidentally adjusted the code style using the IDE's format, and it seems that lint can't resolve it. Additionally, I added some logs for debugging and forgot to remove them.

@pubkey
Copy link
Owner

pubkey commented Apr 10, 2025

Hi @kryon-7
Thank you for the investigation.
Do you think it is possible to craft a unit test that fails nearly 100% to reproduce this? I think we have to first define exxactly how a storage should behave, and when exactly it should emit events and respond to ongoing queries.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 10, 2025

Hi @kryon-7 Thank you for the investigation. Do you think it is possible to craft a unit test that fails nearly 100% to reproduce this? I think we have to first define exxactly how a storage should behave, and when exactly it should emit events and respond to ongoing queries.

Of course, You can look #7068 this pr. Through diagrammatic analysis, I've identified the root cause . So we need to insert some data during the window between the server completing its query operation and the client receiving the query results. This timing might be difficult to control precisely. An alternative approach would be to have the server autonomously insert some data immediately after completing the query operation, and only then return the query results to the client.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 10, 2025

We can make the server wait 2 seconds after finishing the query (to fake processing time) before sending results back. During this wait, the client can send lots of insert/delete/update requests.
I checked the code - randomDelayStorage returns immediately after query without this delay, making the rare issue even harder to reproduce.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 10, 2025

Extending query execution time may trigger collection creation limits. Perhaps we should configure the test cases to run sequentially rather than in parallel. Alternatively, we could create a dedicated randomDelayStorage to test this case. However, we cannot guarantee whether the 'delayed response after query completion' might introduce other potential issues.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 11, 2025

Hi @kryon-7 Thank you for the investigation. Do you think it is possible to craft a unit test that fails nearly 100% to reproduce this? I think we have to first define exxactly how a storage should behave, and when exactly it should emit events and respond to ongoing queries.

OK, I've submitted a PR #7075 to ensure the error can be reproduced with near 100% reliability
Image

Of course, this test case still passes in PR #7068

@pubkey
Copy link
Owner

pubkey commented Apr 15, 2025

Hi @kryon-7

error can be reproduced with near 100% reliability

This test also only reproduces if random delays are monkey-patched into the storage. Also your test uses getStorage('memory-long-query-delay').getStorage() in all cases, so it for example does not test the dexie storage or any of the "normal" ones.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 15, 2025

Hi @pubkey ,I understood. I attempt to reproduce the issue by inserting a large number of documents while simultaneously executing queries with multiple sort fields. However, I've observed that the queries wait for all document insertions to complete. After reviewing the code, I noticed that all storage implementations intentionally await the completion of all operations in their respective write queues, as shown in this snippet:

await this.writeQueue;

This design makes it particularly challenging to control the timing for reproducing the error. Additionally, the query and insertion performance varies across different storage implementations, making it difficult to create a universal test case that ensures all storage systems complete their operations consistently.

The issue is further complicated by several factors:

  • Local storage operations execute extremely quickly

  • The aforementioned write queue synchronization mechanism

  • Node.js's single-threaded nature

To reliably reproduce this issue, one would need to intentionally introduce an artificial delay during query result sorting. This represents an extreme edge case scenario.
But based on my testing, this error primarily manifests with MongoDB (and likely most remote storage systems) under the specific conditions I've described(insert many docs during query).CI

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 16, 2025

Hello! @pubkey I've just been thinking. It's really hard to encounter this error when using the real storage, because the test environment is too ideal, including the network and so on. Could we use a wrapper similar to randomDelayStorage to force the real storage to wait for a while after completing the query? This is possible. If the real storage calls await during query processing and yields the thread's ownership, this situation could occur. Since the test environment is quite favorable, it's quite difficult to reproduce. I think only this possibility can ensure that all storages trigger this error with 100% certainty. Although this is somewhat like monkey-patching into the storage, it can indeed happen in the real environment and is even more complex. In the real environment, we are not only using RxDB but may also be using other libraries (such as React or something else) that compete for thread execution. The thread may not necessarily yield its execution to anyone during the query waiting period.

The most critical point is this: In theory, no matter how long our query takes to execute, it should never result in errors—that's the fundamental principle!

As for why the remote storage only exhibits issues after the WebSocket upgrade, I believe it's because the WS upgrade improves concurrency performance, allowing more requests to be processed simultaneously—which amplifies the concurrency-related problems (and since the remote storage still relies on memory, the issue is particularly difficult to reproduce at the moment).This leads to situations where some insert operations may complete before their corresponding query requests have finished processing—a conclusion I reached during my earlier investigation.

Under this established principle, the issue is inherently difficult to reproduce with local storage. Without artificially controlling query duration and intentionally introducing wait states, reliable reproduction becomes nearly impossible.

Each storage implementation has unique characteristics. Take Dexie as an example: it appears to employ read-write transactions that lock writes during read operations. Errors might only occur if data insertion happens precisely at the query completion moment (due to its await mechanism), while subsequent query result processing remains synchronous. In Node.js' single-threaded environment, our local storage's read/write operations share one thread. During query result processing, other insert operations cannot acquire thread execution rights, making natural reproduction unfeasible.

Conversely, remote storage lacks these constraints, making it more prone to concurrency issues.

@pubkey
Copy link
Owner

pubkey commented Apr 16, 2025

The thing is that javascript has a single process and many parts on RxDB intentionally rely on that. If the insert-tx completes, we directly emit the events to the changestream. By definition of how javascript works, there cannot happen anything "in between". Only if you patch the storage to wait at these points in time, it of course will behave different.

To reproduce your problem, you do not need a test case that fails 100% in time, only like 10% would be good enough to work on.

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 16, 2025

bt remote storage is another process,so it can be reproduced in mongodb ci result

@kryon-7
Copy link
Contributor Author

kryon-7 commented Apr 17, 2025

Actually, this issue isn't limited to MongoDB—it can happen with any remote storage. I tried running the remote server in a separate process instead of sharing one with the client, and the error still occurs.

Copy link

stale bot commented Apr 24, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed soon. Please update it or it may be closed to keep our repository organized. The best way is to add some more information or make a pull request with a test case. Also you might get help in fixing it at the RxDB Community Chat If you know you will continue working on this, just write any message to the issue (like "ping") to remove the stale tag.

@stale stale bot added the stale label Apr 24, 2025
@pubkey pubkey added never-stale and removed stale labels May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants