-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
problems found during the ws package upgrade that make ci error #7067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @kryon-7 |
Of course, You can look #7068 this pr. Through diagrammatic analysis, I've identified the root cause . So we need to insert some data during the window between the server completing its query operation and the client receiving the query results. This timing might be difficult to control precisely. An alternative approach would be to have the server autonomously insert some data immediately after completing the query operation, and only then return the query results to the client. |
We can make the server wait 2 seconds after finishing the query (to fake processing time) before sending results back. During this wait, the client can send lots of insert/delete/update requests. |
Extending query execution time may trigger collection creation limits. Perhaps we should configure the test cases to run sequentially rather than in parallel. Alternatively, we could create a dedicated randomDelayStorage to test this case. However, we cannot guarantee whether the 'delayed response after query completion' might introduce other potential issues. |
OK, I've submitted a PR #7075 to ensure the error can be reproduced with near 100% reliability Of course, this test case still passes in PR #7068 |
Hi @kryon-7
This test also only reproduces if random delays are monkey-patched into the storage. Also your test uses |
Hi @pubkey ,I understood. I attempt to reproduce the issue by inserting a large number of documents while simultaneously executing queries with multiple sort fields. However, I've observed that the queries wait for all document insertions to complete. After reviewing the code, I noticed that all storage implementations intentionally await the completion of all operations in their respective write queues, as shown in this snippet: await this.writeQueue; This design makes it particularly challenging to control the timing for reproducing the error. Additionally, the query and insertion performance varies across different storage implementations, making it difficult to create a universal test case that ensures all storage systems complete their operations consistently. The issue is further complicated by several factors:
To reliably reproduce this issue, one would need to intentionally introduce an artificial delay during query result sorting. This represents an extreme edge case scenario. |
Hello! @pubkey I've just been thinking. It's really hard to encounter this error when using the real storage, because the test environment is too ideal, including the network and so on. Could we use a wrapper similar to randomDelayStorage to force the real storage to wait for a while after completing the query? This is possible. If the real storage calls await during query processing and yields the thread's ownership, this situation could occur. Since the test environment is quite favorable, it's quite difficult to reproduce. I think only this possibility can ensure that all storages trigger this error with 100% certainty. Although this is somewhat like monkey-patching into the storage, it can indeed happen in the real environment and is even more complex. In the real environment, we are not only using RxDB but may also be using other libraries (such as React or something else) that compete for thread execution. The thread may not necessarily yield its execution to anyone during the query waiting period. The most critical point is this: In theory, no matter how long our query takes to execute, it should never result in errors—that's the fundamental principle! As for why the remote storage only exhibits issues after the WebSocket upgrade, I believe it's because the WS upgrade improves concurrency performance, allowing more requests to be processed simultaneously—which amplifies the concurrency-related problems (and since the remote storage still relies on memory, the issue is particularly difficult to reproduce at the moment).This leads to situations where some insert operations may complete before their corresponding query requests have finished processing—a conclusion I reached during my earlier investigation. Under this established principle, the issue is inherently difficult to reproduce with local storage. Without artificially controlling query duration and intentionally introducing wait states, reliable reproduction becomes nearly impossible. Each storage implementation has unique characteristics. Take Dexie as an example: it appears to employ read-write transactions that lock writes during read operations. Errors might only occur if data insertion happens precisely at the query completion moment (due to its await mechanism), while subsequent query result processing remains synchronous. In Node.js' single-threaded environment, our local storage's read/write operations share one thread. During query result processing, other insert operations cannot acquire thread execution rights, making natural reproduction unfeasible. Conversely, remote storage lacks these constraints, making it more prone to concurrency issues. |
The thing is that javascript has a single process and many parts on RxDB intentionally rely on that. If the insert-tx completes, we directly emit the events to the changestream. By definition of how javascript works, there cannot happen anything "in between". Only if you patch the storage to wait at these points in time, it of course will behave different. To reproduce your problem, you do not need a test case that fails 100% in time, only like 10% would be good enough to work on. |
bt remote storage is another process,so it can be reproduced in mongodb ci result |
Actually, this issue isn't limited to MongoDB—it can happen with any remote storage. I tried running the remote server in a separate process instead of sharing one with the client, and the error still occurs. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed soon. Please update it or it may be closed to keep our repository organized. The best way is to add some more information or make a pull request with a test case. Also you might get help in fixing it at the RxDB Community Chat If you know you will continue working on this, just write any message to the issue (like "ping") to remove the stale tag. |
To be honest, the CI issue is not related to ws, but rather the randomDelayStorage itself has a problem with handling concurrency. The issue actually existed before, but the upgrade of the ws library caused changes in the speed of message transmission, which amplified the problem. The issue itself exists, but the ws upgrade increased the likelihood of the problem occurring. There are many possibilities for the problem to appear. When I delete other cases and only run the
doing insert after subscribe should end with the correct results
case, the probability of the problem occurring decreases... I don't know why.We mark the three documents inserted in this code as 0, 1, 2 according to the insertion order. Document 0 is created during the create operation and is awaited, so there's no problem with it.
The issue occurs with the last insert of document 1,2 and the find operation with subscription.
randomDelayStorage randomly delays for a period of time before actually using memory storage. Therefore, even though the network data for the bulkWrite method is sent before the query method's network data, it's still possible that the server processes the query first, then reads the current document data as [0]. Then the insert method executes very quickly, rapidly returning the changeStream and bulkWrite data packets to the client. After sending these, it then sends the query return data back to the client via websocket. In this situation, the client first gets the insert result and discards one piece of data from the changeStream that it should have received, then gets the query result [0]. This depends on the processing speed of insertion and query. It's even possible to only get [0, 1] data, with data for 2 being lost because it was returned too quickly. These extreme possibilities can occur.
Solutions:
The first solution is very simple and won't cause any problems. For the second solution, I'm not sure if it's feasible or if there are extreme cases that need to be considered. Additionally, since I'm new to rxdb and have only read a small portion of the source code, I'm not clear how much modification to the source code would be required. I hope to get some advice on this.
The text was updated successfully, but these errors were encountered: