Non monotonic reads, e.g. prior local reads/writes reverted, due to disconnecting/connecting, particularly on a flaky network. #253

nurturenature · 2025-03-11T05:26:17Z

Hi,

This is a breakout, decomposition of issue "db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly" #226

It's also a way to trigger "Local changes reverted when a replication delay is present" powersync-ja/powersync-sqlite-core#54

The "Local changes reverted when a replication delay is present" in powersync-sqlite-core was easy to reproduce in v1.10.0 and became harder in v1.11.0 using disconnecting/connecting to trigger the bug.

At the current time, v.12.1, if one adds some network instability, e.g. an occasional network partition, in addition to disconnecting/connecting, it is possible to reliably trigger the bug.

Under these conditions, the bug presents in two ways:

a {key: value} is read that is older, was previously read in prior reads, in place of the current, last read
- i.e. the {key: value} appears to be non monotonic to the client, go backwards in time
the client reads an empty, SELECT * returns [], table even though there have been successful prior reads

A fuzzing application has been created to reproduce this and other anomalies, and fuzz test PowerSync in general.

Please see A Dart CLI fuzzer for PowerSync.

There's a GitHub Action matrix to fuzz-non-monotonic-reads.

And a script for reproducing locally Local Host - Helper Script to Loop Until Error

Here's a typical example of working through a failed powersync_fuzz.log output.

./powersync-fuzz-loop.sh ./powersync_fuzz --table mww --clients 5 --rate 10 --time 100 --disconnect --partition

We can see the disconnecting, connecting, and partitioning nemesis activity:

$ grep 'nemesis' powersync_fuzz.log

[2025-03-11 02:18:57.576457] [main] [INFO] nemesis: disconnect/connect: start listening to stream of disconnected/connected messages
[2025-03-11 02:18:57.576504] [main] [INFO] nemesis: partition: start listening to stream of partition messages
[2025-03-11 02:19:01.847681] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (1, 3)
[2025-03-11 02:19:03.575833] [main] [INFO] nemesis: disconnect/connect: connected: clients: (1, 5, 3, 2, 4)
[2025-03-11 02:19:05.290393] [main] [INFO] nemesis: partition: bidirectional
[2025-03-11 02:19:09.693580] [main] [INFO] nemesis: partition: none
[2025-03-11 02:19:13.468192] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (2)
[2025-03-11 02:19:18.185368] [main] [INFO] nemesis: partition: outbound
[2025-03-11 02:19:21.954030] [main] [INFO] nemesis: disconnect/connect: connected: clients: (3, 4, 2, 1, 5)
[2025-03-11 02:19:25.980436] [main] [INFO] nemesis: partition: none
...

We can see the nemesis is triggering upload/download errors in the SyncStatus stream, the client treats them as ignorable:

$ grep 'ignorable' powersync_fuzz.log

[2025-03-11 02:19:21.948436] [ps-5] [INFO] SyncStatus: ignorable upload error in statusStream: ClientException: Connection closed before full header was received, uri=http://powersync:8080/write-checkpoint2.json?client_id=adb498d8-3697-4459-b009-823ee5a1f83b
...
[2025-03-11 02:19:31.108449] [ps-1] [INFO] SyncStatus: ignorable download error in statusStream: ClosedException
...

And now for the non-monotonic read error.

At the end of powersync_fuzz.log, the SEVERE messages will show the cause.
The easiest cause to debug is "expected because myPreviousRead".

Here is clientNum: 3 reading backwards in time:

[2025-03-11 02:20:19.507539] [main] [SEVERE] read of {0: 422} is less than expected read of {0: 703}, expected because myPreviousRead, in op: {clientNum: 3, clientType: ps, f: txn, id: 156, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 422, ..., 99: 402}}, {f: write-some, k: -1, v: {85: 818, 14: 818, 64: 818, 27: 818}}]}
[2025-03-11 02:20:19.507554] [main] [SEVERE] CausalChecker client states with {0: [422, 703]}:
[2025-03-11 02:20:19.507561] [main] [SEVERE] 	{0: 703}: in client 1: reason: myPreviousRead
[2025-03-11 02:20:19.507565] [main] [SEVERE] 	{0: 703}: in client 3: reason: myPreviousRead
[2025-03-11 02:20:19.507569] [main] [SEVERE] CausalChecker Monotonic Read/Write Follows Reads states:
[2025-03-11 02:20:19.507583] [main] [SEVERE] 	{0: 422}: written when: [422, ..., 316]
[2025-03-11 02:20:19.507599] [main] [SEVERE] 	{0: 703}: written when: [703, ..., 650]

We can see that client 3 read the correct, current, value many times previously:

$ grep 'ps-3' powersync_fuzz.log |  grep '0: 703'

[2025-03-11 02:20:09.679222] [ps-3] [FINE] SQL txn: response: {clientNum: 3, clientType: ps, f: txn, id: 140, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 665}}, {f: write-some, k: -1, v: {13: 720, 57: 720, 6: 720, 90: 720}}]}
...
read 15 additional times
...

Add observe the correct value originally being written, uploaded and read by client 1:

$ grep 'ps-1' powersync_fuzz.log |  grep '0: 703'

# the write
[2025-03-11 02:20:07.979747] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 141, table: mww, type: ok, value: [{f: read-all, k: -1, v: {...}}, {f: write-some, k: -1, v: {98: 703, 12: 703, 18: 703, 0: 703}}]}

# a read
[2025-03-11 02:20:08.579595] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 142, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 650}}, {f: write-some, k: -1, v: {29: 709, 98: 709, 66: 709, 25: 709}}]}

# the BackendConnector.upload()
[2025-03-11 02:20:09.179271] [ps-1] [FINER] uploadData: txn: 142 patch: {0: 703}

... multiple additional reads

There may actually be multiple bugs exposed by disconnecting/connecting with a flakey network.

The GitHub Actions runs a matrix of 240 jobs and:

25 fail with non monotonic reads
16 fail with empty table reads
141 fail with divergent final reads

The test matrix has been tuned over time to elicit maximum failure.
This issue is designed for the non monotonic or empty read. There are cleaner ways to produce divergent reads, and they behave differently, so will be opening up an independent issue.

The first local Docker build takes a bit of time, mostly creating the Flutter/Dart build image.

The local script that loops until error will sometimes find an error on the first attempt, other times it will grind, fuzz, for a bit.
It's the nature of fuzzing.

I know this is a lot.
I hope that the fuzzing environment can be made easy to use, maybe even run as an action on commits, so please ask questions and for changes that would be helpful.

Thanks!

The text was updated successfully, but these errors were encountered:

rkistner · 2025-03-11T15:37:04Z

Thanks - the fuzzing scripts look great!

I got the fuzzing scripts running on my machine. The first iteration failed with Divergent final reads!, but the actual cause appeared to be a timeout while downloading the data: waited for db.currentStatus.downloading == false: tried 31 times every 1000ms. I haven't investigated this further.

The second iteration did fail with "non monotonic reads". So at the very least I can reproduce your results, and starting investigating the issue.

nurturenature · 2025-03-13T02:10:37Z

The first iteration failed with Divergent final reads!, but the actual cause appeared to be a timeout while downloading the data: waited for db.currentStatus.downloading == false: tried 31 times every 1000ms

It's a bit arbitrary to say the test is conclusively complete when faults disrupt replication. After running the transactions at rate for time, the test:

waits for 3 seconds
then waits for UploadQueueStats.count == 0 on all clients
then waits for SyncStatus.downloading == false on all clients
then does final reads on each client for comparison.

It seems that once faults disrupt replication to the point it doesn't resume, SyncStatus.downloading stays true indefinitely. The test gives it 30s before calling it quits and doing final reads.

I'll work on a better/cleaner/clearer way to detect

that replication has stopped
no clients are being updated
try to identify the last replicated transaction for each client

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non monotonic reads, e.g. prior local reads/writes reverted, due to disconnecting/connecting, particularly on a flaky network. #253

Non monotonic reads, e.g. prior local reads/writes reverted, due to disconnecting/connecting, particularly on a flaky network. #253

nurturenature commented Mar 11, 2025

rkistner commented Mar 11, 2025

nurturenature commented Mar 13, 2025

Non monotonic reads, e.g. prior local reads/writes reverted, due to disconnecting/connecting, particularly on a flaky network. #253

Non monotonic reads, e.g. prior local reads/writes reverted, due to disconnecting/connecting, particularly on a flaky network. #253

Comments

nurturenature commented Mar 11, 2025

rkistner commented Mar 11, 2025

nurturenature commented Mar 13, 2025