Skip to content

Commit

Permalink
Increase BlockSynchronizer timeouts (MystenLabs/narwhal#937)
Browse files Browse the repository at this point in the history
Currently `BlockSynchronizer` steps have 2s timeout. They seem a bit short considering
- Roundtrip to other nodes may take multiple secs
- Other nodes may be busy
- Large payload may need to be loaded from disk.

For `certificates_synchronize_timeout` and `payload_availability_timeout`, only one response is needed within timeout to make progress. However 2s still seems too short. Maybe these timeouts can be 10s~20s instead.
  • Loading branch information
mwtian authored Sep 10, 2022
1 parent db822d8 commit 883ec17
Show file tree
Hide file tree
Showing 9 changed files with 82 additions and 67 deletions.
6 changes: 0 additions & 6 deletions narwhal/Docker/validators/parameters.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
{
"batch_size": 500000,
"block_synchronizer": {
"certificates_synchronize_timeout": "2_000ms",
"handler_certificate_deliver_timeout": "2_000ms",
"payload_availability_timeout": "2_000ms",
"payload_synchronize_timeout": "2_000ms"
},
"consensus_api_grpc": {
"get_collections_timeout": "5_000ms",
"remove_collections_timeout": "5_000ms",
Expand Down
18 changes: 9 additions & 9 deletions narwhal/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ bench_params = {
'tx_size': 512,
'faults': 0,
'duration': 300,
'mem_profiling': False
'mem_profiling': False
}
```
They specify the number of primaries (`nodes`) and workers per primary (`workers`) to deploy, the input rate (transactions per second, or tx/s) at which the clients submit transactions to the system (`rate`), the size of each transaction in bytes (`tx_size`), the number of faulty nodes ('faults`), and the duration of the benchmark in seconds (`duration`). The minimum transaction size is 9 bytes; this ensures the transactions of a client are all different.
Expand All @@ -38,10 +38,10 @@ node_params = {
'batch_size': 500_000,
'max_batch_delay': '100ms',
'block_synchronizer': {
'certificates_synchronize_timeout': '2000ms',
'payload_synchronize_timeout': '2000ms',
'payload_availability_timeout': '2000ms',
'handler_certificate_deliver_timeout': '2_000ms'
'certificates_synchronize_timeout': '30s',
'payload_synchronize_timeout': '30s',
'payload_availability_timeout': '30s',
'handler_certificate_deliver_timeout': '30s'
},
"consensus_api_grpc": {
"socket_addr": "/ip4/127.0.0.1/tcp/0/http",
Expand All @@ -65,8 +65,8 @@ They are defined as follows:
* `socket_addr`: The socket address the consensus api gRPC server should be listening to.
* `get_collections_timeout`: The timeout configuration when requesting batches from workers.
* `remove_collections_timeout`: The timeout configuration when removing batches from workers.
* `handler_certificate_deliver_timeout`: When a certificate is fetched on the fly from peers, it is submitted from the block synchronizer handler for further processing to core
to validate and ensure parents are available and history is causal complete. This property is the timeout while we wait for core to perform this processes and the certificate to become
* `handler_certificate_deliver_timeout`: When a certificate is fetched on the fly from peers, it is submitted from the block synchronizer handler for further processing to core
to validate and ensure parents are available and history is causal complete. This property is the timeout while we wait for core to perform this processes and the certificate to become
available to the handler to consume.
* `max_concurrent_requests`: The maximum number of concurrent requests for primary-to-primary and worker-to-worker messages.

Expand Down Expand Up @@ -120,7 +120,7 @@ to be turned on. To enable memory profiling mode, set the following benchmark o
* `'mem_profiling': True`
* `duration`: This should be set to 5 minutes or longer

`dhat-heap-*.json` files will be written to the benchmark dir, one for each primary.
`dhat-heap-*.json` files will be written to the benchmark dir, one for each primary.
You might not get files for all primaries as some primaries might die with a panic (see https://github.com/nnethercote/dhat-rs/issues/19).

To view the profiling information, use the [hosted viewer](https://nnethercote.github.io/dh_view/dh_view.html) or see [the viewing instructions](https://docs.rs/dhat/latest/dhat/index.html#viewing) for details.
Expand Down Expand Up @@ -148,7 +148,7 @@ If you don't have an SSH key, you can create one through Amazon or by using [ssh
$ ssh-keygen -t ed25519 -f ~/.ssh/aws
```

Then follow the instructions for [Amazon EC2 key pairs and Linux instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) to import the public key to Amazon EC2 (using the *Actions > Import key pair* function).
Then follow the instructions for [Amazon EC2 key pairs and Linux instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) to import the public key to Amazon EC2 (using the *Actions > Import key pair* function).

This operation is manual (AWS exposes APIs to manipulate keys) and needs to be repeated for each AWS region that you plan to use. Upon importing your key, AWS requires you to choose a 'name' for your key; ensure you set the same name on all AWS regions. This SSH key will be used by the Python scripts to execute commands and upload/download files to your AWS instances.

Expand Down
18 changes: 0 additions & 18 deletions narwhal/benchmark/fabfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,6 @@ def local(ctx, debug=True):
'sync_retry_nodes': 3, # number of nodes
'batch_size': 500_000, # bytes
'max_batch_delay': '200ms', # ms,
'block_synchronizer': {
'certificates_synchronize_timeout': '2_000ms',
'payload_synchronize_timeout': '2_000ms',
'payload_availability_timeout': '2_000ms',
'handler_certificate_deliver_timeout': '2_000ms'
},
"consensus_api_grpc": {
"socket_addr": "/ip4/127.0.0.1/tcp/0/http",
"get_collections_timeout": "5_000ms",
Expand Down Expand Up @@ -68,12 +62,6 @@ def demo(ctx, debug=True):
}
node_params = {
"batch_size": 500000,
"block_synchronizer": {
"certificates_synchronize_timeout": "2_000ms",
"handler_certificate_deliver_timeout": "2_000ms",
"payload_availability_timeout": "2_000ms",
"payload_synchronize_timeout": "2_000ms"
},
"consensus_api_grpc": {
"get_collections_timeout": "5_000ms",
"remove_collections_timeout": "5_000ms",
Expand Down Expand Up @@ -193,12 +181,6 @@ def remote(ctx, debug=False):
'sync_retry_nodes': 3, # number of nodes
'batch_size': 500_000, # bytes
'max_batch_delay': '200ms', # ms,
'block_synchronizer': {
'certificates_synchronize_timeout': '2_000ms',
'payload_synchronize_timeout': '2_000ms',
'payload_availability_timeout': '2_000ms',
'handler_certificate_deliver_timeout': '2_000ms'
},
"consensus_api_grpc": {
"socket_addr": "/ip4/127.0.0.1/tcp/0/http",
"get_collections_timeout": "5_000ms",
Expand Down
74 changes: 52 additions & 22 deletions narwhal/config/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -178,35 +178,67 @@ impl Default for ConsensusAPIGrpcParameters {
}

#[derive(Clone, Debug, Deserialize, Serialize)]
#[serde(default)]
pub struct BlockSynchronizerParameters {
/// The timeout configuration when requesting certificates from peers.
#[serde(with = "duration_format")]
#[serde(
with = "duration_format",
default = "BlockSynchronizerParameters::default_certificates_synchronize_timeout"
)]
pub certificates_synchronize_timeout: Duration,
/// Timeout when has requested the payload for a certificate and is
/// waiting to receive them.
#[serde(with = "duration_format")]
#[serde(
with = "duration_format",
default = "BlockSynchronizerParameters::default_payload_synchronize_timeout"
)]
pub payload_synchronize_timeout: Duration,
/// The timeout configuration when for when we ask the other peers to
/// discover who has the payload available for the dictated certificates.
#[serde(with = "duration_format")]
#[serde(
with = "duration_format",
default = "BlockSynchronizerParameters::default_payload_availability_timeout"
)]
pub payload_availability_timeout: Duration,
/// When a certificate is fetched on the fly from peers, it is submitted
/// from the block synchronizer handler for further processing to core
/// to validate and ensure parents are available and history is causal
/// complete. This property is the timeout while we wait for core to
/// perform this processes and the certificate to become available to
/// the handler to consume.
#[serde(with = "duration_format")]
#[serde(
with = "duration_format",
default = "BlockSynchronizerParameters::default_handler_certificate_deliver_timeout"
)]
pub handler_certificate_deliver_timeout: Duration,
}

impl BlockSynchronizerParameters {
fn default_certificates_synchronize_timeout() -> Duration {
Duration::from_secs(30)
}
fn default_payload_synchronize_timeout() -> Duration {
Duration::from_secs(30)
}
fn default_payload_availability_timeout() -> Duration {
Duration::from_secs(30)
}
fn default_handler_certificate_deliver_timeout() -> Duration {
Duration::from_secs(30)
}
}

impl Default for BlockSynchronizerParameters {
fn default() -> Self {
Self {
certificates_synchronize_timeout: Duration::from_millis(2_000),
payload_synchronize_timeout: Duration::from_millis(2_000),
payload_availability_timeout: Duration::from_millis(2_000),
handler_certificate_deliver_timeout: Duration::from_millis(2_000),
certificates_synchronize_timeout:
BlockSynchronizerParameters::default_certificates_synchronize_timeout(),
payload_synchronize_timeout:
BlockSynchronizerParameters::default_payload_synchronize_timeout(),
payload_availability_timeout:
BlockSynchronizerParameters::default_payload_availability_timeout(),
handler_certificate_deliver_timeout:
BlockSynchronizerParameters::default_handler_certificate_deliver_timeout(),
}
}
}
Expand Down Expand Up @@ -248,22 +280,22 @@ impl Parameters {
self.max_batch_delay.as_millis()
);
info!(
"Synchronize certificates timeout set to {} ms",
"Synchronize certificates timeout set to {} s",
self.block_synchronizer
.certificates_synchronize_timeout
.as_millis()
.as_secs()
);
info!(
"Payload (batches) availability timeout set to {} ms",
"Payload (batches) availability timeout set to {} s",
self.block_synchronizer
.payload_availability_timeout
.as_millis()
.as_secs()
);
info!(
"Synchronize payload (batches) timeout set to {} ms",
"Synchronize payload (batches) timeout set to {} s",
self.block_synchronizer
.payload_synchronize_timeout
.as_millis()
.as_secs()
);
info!(
"Consensus API gRPC Server set to listen on on {}",
Expand All @@ -280,10 +312,10 @@ impl Parameters {
.as_millis()
);
info!(
"Handler certificate deliver timeout set to {} ms",
"Handler certificate deliver timeout set to {} s",
self.block_synchronizer
.handler_certificate_deliver_timeout
.as_millis()
.as_secs()
);
info!(
"Max concurrent requests set to {}",
Expand Down Expand Up @@ -664,17 +696,15 @@ mod tests {
assert!(logs_contain("Sync retry nodes set to 3 nodes"));
assert!(logs_contain("Batch size set to 500000 B"));
assert!(logs_contain("Max batch delay set to 100 ms"));
assert!(logs_contain("Synchronize certificates timeout set to 30 s"));
assert!(logs_contain(
"Synchronize certificates timeout set to 2000 ms"
));
assert!(logs_contain(
"Payload (batches) availability timeout set to 2000 ms"
"Payload (batches) availability timeout set to 30 s"
));
assert!(logs_contain(
"Synchronize payload (batches) timeout set to 2000 ms"
"Synchronize payload (batches) timeout set to 30 s"
));
assert!(logs_contain(
"Handler certificate deliver timeout set to 2000 ms"
"Handler certificate deliver timeout set to 30 s"
));
assert!(logs_contain(
"Consensus API gRPC Server set to listen on on /ip4/127.0.0.1/tcp"
Expand Down
3 changes: 1 addition & 2 deletions narwhal/config/tests/config_tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -170,8 +170,7 @@ fn parameters_import_snapshot_matches() {
"block_synchronizer": {
"certificates_synchronize_timeout": "2s",
"payload_synchronize_timeout": "3_000ms",
"payload_availability_timeout": "4_000ms",
"handler_certificate_deliver_timeout": "1_000ms"
"payload_availability_timeout": "4_000ms"
},
"consensus_api_grpc": {
"socket_addr": "/ip4/127.0.0.1/tcp/0/http",
Expand Down
8 changes: 4 additions & 4 deletions narwhal/config/tests/snapshots/config_tests__parameters.snap
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ expression: parameters
"batch_size": 500000,
"max_batch_delay": "100ms",
"block_synchronizer": {
"certificates_synchronize_timeout": "2000ms",
"payload_synchronize_timeout": "2000ms",
"payload_availability_timeout": "2000ms",
"handler_certificate_deliver_timeout": "2000ms"
"certificates_synchronize_timeout": "30000ms",
"payload_synchronize_timeout": "30000ms",
"payload_availability_timeout": "30000ms",
"handler_certificate_deliver_timeout": "30000ms"
},
"consensus_api_grpc": {
"socket_addr": "/ip4/127.0.0.1/tcp/8081/http",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ expression: params
"certificates_synchronize_timeout": "2000ms",
"payload_synchronize_timeout": "3000ms",
"payload_availability_timeout": "4000ms",
"handler_certificate_deliver_timeout": "1000ms"
"handler_certificate_deliver_timeout": "30000ms"
},
"consensus_api_grpc": {
"socket_addr": "/ip4/127.0.0.1/tcp/0/http",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -535,9 +535,9 @@ async fn test_multiple_overlapping_requests() {
worker_network: PrimaryToWorkerNetwork::default(),
payload_store,
certificate_store,
certificates_synchronize_timeout: Duration::from_millis(2_000),
payload_synchronize_timeout: Duration::from_millis(2_000),
payload_availability_timeout: Duration::from_millis(2_000),
certificates_synchronize_timeout: Duration::from_secs(1),
payload_synchronize_timeout: Duration::from_secs(1),
payload_availability_timeout: Duration::from_secs(1),
};

// ResultSender
Expand Down Expand Up @@ -651,6 +651,10 @@ async fn test_timeout_while_waiting_for_certificates() {
.unwrap();

// AND create the synchronizer
let params = BlockSynchronizerParameters {
certificates_synchronize_timeout: Duration::from_secs(1),
..Default::default()
};
let _synchronizer_handle = BlockSynchronizer::spawn(
name.clone(),
committee.clone(),
Expand All @@ -662,7 +666,7 @@ async fn test_timeout_while_waiting_for_certificates() {
PrimaryNetwork::new(network),
payload_store.clone(),
certificate_store.clone(),
BlockSynchronizerParameters::default(),
params,
);

// AND the channel to respond to
Expand Down
8 changes: 7 additions & 1 deletion narwhal/primary/tests/integration_tests_validator_api.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use arc_swap::ArcSwap;
// Copyright (c) 2022, Mysten Labs, Inc.
// SPDX-License-Identifier: Apache-2.0
use config::{Committee, Parameters, WorkerId};
use config::{BlockSynchronizerParameters, Committee, Parameters, WorkerId};
use consensus::{dag::Dag, metrics::ConsensusMetrics};
use crypto::PublicKey;
use fastcrypto::{traits::KeyPair as _, Hash};
Expand Down Expand Up @@ -864,6 +864,12 @@ async fn test_get_collections_with_missing_certificates() {

let parameters = Parameters {
batch_size: 200, // Two transactions.
block_synchronizer: BlockSynchronizerParameters {
certificates_synchronize_timeout: Duration::from_secs(1),
payload_synchronize_timeout: Duration::from_secs(1),
payload_availability_timeout: Duration::from_secs(1),
handler_certificate_deliver_timeout: Duration::from_secs(1),
},
..Parameters::default()
};

Expand Down

0 comments on commit 883ec17

Please sign in to comment.