Skip to content

Latest commit

 

History

History
73 lines (49 loc) · 4.1 KB

wip-0013.md

File metadata and controls

73 lines (49 loc) · 4.1 KB
  WIP: WIP-0013
  Title: Apr 2021 Chain Fork Post-Mortem
  Author: The witnet-rust developers 
  Discussions-To: `#dev-general` channel on Witnet Community's Discord server
  Status: Final
  Type: Informational
  Created: 2021-04-24
  License: PD

What went wrong

For reasons much alike to WIP-0010, on April 22, 2021, the superblock consensus committee size started to progressively go down until it went to the minimum of 1.

After several hours of a total lack of superblock consensus, different segments of the network consolidated different superblocks. This resulted in multiple, incompatible chain forks. The last block that was common to all of them was #364269 with hash a23f647471f0976a358de9d7ac901a3f715e78335e6f412d58e5b037c11668ea.

At that point, at least two forks were registered:

Chain First block number after #364269 First block hash after #364269
A #364270 462d4fc6815a8492a07d133f796e66919ba459ee6431ce3376ad47f311744821
B #365292 8a09fd775d473340bec892c7c6cd543e01922fce5ba3e94a1892800b11c43d3e

What went right

By Apr 24, the A chain above seemed healthy enough, as it presented a consistent committee size of 100, and a low rate of rollbacks.

Root cause

The root cause for this network disruption is assumed to be the same as for WIP-0010.

Action items

Immediate

These immediate actions are geared towards resuming the normal operation of the network. That is, making sure that all nodes can get a valid chain state on top of which new blocks can be appended and validated by the entire network.

These actions are aimed specifically at nodes that were not able to keep up with the A chain above.

  1. Release a witnet-rust version, tagged as 1.2.1.
  2. Optionally, make this release preemptively soften some peering restrictions that are not consensus-critical (e.g. inbound Sybil protection, icing period, outbound peers target, etc.) to facilitate the discovery of potential peers and speed up the recovery.
  3. Release a recovery script that evaluates whether a local node is on the A chain, and if it not, rewinds the chain back to the common checkpoint (364269) using the rewind CLI method. This script can also let the node know about the public address of stable nodes that are known to be on the A chain.
  4. Bundle the recovery script into the witnet/witnet-rust:1.2.1 and witnet/witnet-rust:latest docker images, and modify migrations.sh so the recovery script is run upon starting or restarting a container.

Nodes that are not on the A chain SHOULD upgrade to 1.2.1 as soon as possible.

Nodes that are already on the A chain SHOULDN'T upgrade to 1.2.1, as upgrading would make no difference to them, and they would be reducing the set of available connections to peers that are already on the leading chain.

Node operators can check whether a local node is on the A chain by running this one-liner:

docker exec witnet_node sh -c "witnet node blockchain --epoch 368699 --limit 1 2>&1 | grep -q '#368699 had digest 3b0b03df' && echo 'Your node seems to be OK' || echo 'Oh no! Your node is forked'"

This is the equivalent for nodes not using docker:

./witnet node blockchain --epoch 368699 --limit 1 2>&1 | grep -q '#368699 had digest 3b0b03df' && echo 'Your node seems to be OK' || echo 'Oh no! Your node is forked'

The source code for the recovery script can be found in the witnet-rust repository.

Medium term

Most medium medium term actions mandated by WIP-0010 have been addressed by WIP-0009, WIP-0011, and WIP-0012. All these protocol improvements are bound to be activated on 28 April 2021 at 9am UTC, so no further action is needed at this time.