Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non monotonic reads, e.g. prior local reads/writes reverted, due to disconnecting/connecting, particularly on a flaky network. #253

Open
nurturenature opened this issue Mar 11, 2025 · 2 comments

Comments

@nurturenature
Copy link

Hi,

This is a breakout, decomposition of issue "db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly" #226

It's also a way to trigger "Local changes reverted when a replication delay is present" powersync-ja/powersync-sqlite-core#54

The "Local changes reverted when a replication delay is present" in powersync-sqlite-core was easy to reproduce in v1.10.0 and became harder in v1.11.0 using disconnecting/connecting to trigger the bug.

At the current time, v.12.1, if one adds some network instability, e.g. an occasional network partition, in addition to disconnecting/connecting, it is possible to reliably trigger the bug.

Under these conditions, the bug presents in two ways:

  • a {key: value} is read that is older, was previously read in prior reads, in place of the current, last read
    • i.e. the {key: value} appears to be non monotonic to the client, go backwards in time
  • the client reads an empty, SELECT * returns [], table even though there have been successful prior reads

A fuzzing application has been created to reproduce this and other anomalies, and fuzz test PowerSync in general.

Please see A Dart CLI fuzzer for PowerSync.

There's a GitHub Action matrix to fuzz-non-monotonic-reads.

And a script for reproducing locally Local Host - Helper Script to Loop Until Error


Here's a typical example of working through a failed powersync_fuzz.log output.

./powersync-fuzz-loop.sh ./powersync_fuzz --table mww --clients 5 --rate 10 --time 100 --disconnect --partition

We can see the disconnecting, connecting, and partitioning nemesis activity:

$ grep 'nemesis' powersync_fuzz.log

[2025-03-11 02:18:57.576457] [main] [INFO] nemesis: disconnect/connect: start listening to stream of disconnected/connected messages
[2025-03-11 02:18:57.576504] [main] [INFO] nemesis: partition: start listening to stream of partition messages
[2025-03-11 02:19:01.847681] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (1, 3)
[2025-03-11 02:19:03.575833] [main] [INFO] nemesis: disconnect/connect: connected: clients: (1, 5, 3, 2, 4)
[2025-03-11 02:19:05.290393] [main] [INFO] nemesis: partition: bidirectional
[2025-03-11 02:19:09.693580] [main] [INFO] nemesis: partition: none
[2025-03-11 02:19:13.468192] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (2)
[2025-03-11 02:19:18.185368] [main] [INFO] nemesis: partition: outbound
[2025-03-11 02:19:21.954030] [main] [INFO] nemesis: disconnect/connect: connected: clients: (3, 4, 2, 1, 5)
[2025-03-11 02:19:25.980436] [main] [INFO] nemesis: partition: none
...

We can see the nemesis is triggering upload/download errors in the SyncStatus stream, the client treats them as ignorable:

$ grep 'ignorable' powersync_fuzz.log

[2025-03-11 02:19:21.948436] [ps-5] [INFO] SyncStatus: ignorable upload error in statusStream: ClientException: Connection closed before full header was received, uri=http://powersync:8080/write-checkpoint2.json?client_id=adb498d8-3697-4459-b009-823ee5a1f83b
...
[2025-03-11 02:19:31.108449] [ps-1] [INFO] SyncStatus: ignorable download error in statusStream: ClosedException
...

And now for the non-monotonic read error.

At the end of powersync_fuzz.log, the SEVERE messages will show the cause.
The easiest cause to debug is "expected because myPreviousRead".

Here is clientNum: 3 reading backwards in time:

[2025-03-11 02:20:19.507539] [main] [SEVERE] read of {0: 422} is less than expected read of {0: 703}, expected because myPreviousRead, in op: {clientNum: 3, clientType: ps, f: txn, id: 156, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 422, ..., 99: 402}}, {f: write-some, k: -1, v: {85: 818, 14: 818, 64: 818, 27: 818}}]}
[2025-03-11 02:20:19.507554] [main] [SEVERE] CausalChecker client states with {0: [422, 703]}:
[2025-03-11 02:20:19.507561] [main] [SEVERE] 	{0: 703}: in client 1: reason: myPreviousRead
[2025-03-11 02:20:19.507565] [main] [SEVERE] 	{0: 703}: in client 3: reason: myPreviousRead
[2025-03-11 02:20:19.507569] [main] [SEVERE] CausalChecker Monotonic Read/Write Follows Reads states:
[2025-03-11 02:20:19.507583] [main] [SEVERE] 	{0: 422}: written when: [422, ..., 316]
[2025-03-11 02:20:19.507599] [main] [SEVERE] 	{0: 703}: written when: [703, ..., 650]

We can see that client 3 read the correct, current, value many times previously:

$ grep 'ps-3' powersync_fuzz.log |  grep '0: 703'

[2025-03-11 02:20:09.679222] [ps-3] [FINE] SQL txn: response: {clientNum: 3, clientType: ps, f: txn, id: 140, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 665}}, {f: write-some, k: -1, v: {13: 720, 57: 720, 6: 720, 90: 720}}]}
...
read 15 additional times
...

Add observe the correct value originally being written, uploaded and read by client 1:

$ grep 'ps-1' powersync_fuzz.log |  grep '0: 703'

# the write
[2025-03-11 02:20:07.979747] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 141, table: mww, type: ok, value: [{f: read-all, k: -1, v: {...}}, {f: write-some, k: -1, v: {98: 703, 12: 703, 18: 703, 0: 703}}]}

# a read
[2025-03-11 02:20:08.579595] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 142, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 650}}, {f: write-some, k: -1, v: {29: 709, 98: 709, 66: 709, 25: 709}}]}

# the BackendConnector.upload()
[2025-03-11 02:20:09.179271] [ps-1] [FINER] uploadData: txn: 142 patch: {0: 703}

... multiple additional reads

There may actually be multiple bugs exposed by disconnecting/connecting with a flakey network.

The GitHub Actions runs a matrix of 240 jobs and:

  • 25 fail with non monotonic reads
  • 16 fail with empty table reads
  • 141 fail with divergent final reads

The test matrix has been tuned over time to elicit maximum failure.
This issue is designed for the non monotonic or empty read. There are cleaner ways to produce divergent reads, and they behave differently, so will be opening up an independent issue.

The first local Docker build takes a bit of time, mostly creating the Flutter/Dart build image.

The local script that loops until error will sometimes find an error on the first attempt, other times it will grind, fuzz, for a bit.
It's the nature of fuzzing.

I know this is a lot.
I hope that the fuzzing environment can be made easy to use, maybe even run as an action on commits, so please ask questions and for changes that would be helpful.

Thanks!

@rkistner
Copy link
Contributor

Thanks - the fuzzing scripts look great!

I got the fuzzing scripts running on my machine. The first iteration failed with Divergent final reads!, but the actual cause appeared to be a timeout while downloading the data: waited for db.currentStatus.downloading == false: tried 31 times every 1000ms. I haven't investigated this further.

The second iteration did fail with "non monotonic reads". So at the very least I can reproduce your results, and starting investigating the issue.

@nurturenature
Copy link
Author

The first iteration failed with Divergent final reads!, but the actual cause appeared to be a timeout while downloading the data: waited for db.currentStatus.downloading == false: tried 31 times every 1000ms

It's a bit arbitrary to say the test is conclusively complete when faults disrupt replication. After running the transactions at rate for time, the test:

  • waits for 3 seconds
  • then waits for UploadQueueStats.count == 0 on all clients
  • then waits for SyncStatus.downloading == false on all clients
    then does final reads on each client for comparison.

It seems that once faults disrupt replication to the point it doesn't resume, SyncStatus.downloading stays true indefinitely. The test gives it 30s before calling it quits and doing final reads.

I'll work on a better/cleaner/clearer way to detect

  • that replication has stopped
  • no clients are being updated
  • try to identify the last replicated transaction for each client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants