Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bulk fetch of batches in executor (MystenLabs#9624)
## Description TLDR: This PR introduces bulk fetch for all batches required to execute a committed subdag which improves catchup 2.5x Was waiting to submit this PR in light of another PR that is being tested/experimented to introduce batching for payload fetch in the synchronizer and unite the following three batch fetch paths through a shared `BatchFetcher` worker component. 1. Synchronize header batch (blocking) 2. Synchronize certificate batches (non-blocking) 3. Synchronize certificate batches & fetch batches for execution (blocking) The main goal of synchronizer being to get the payload into the workers local store so that at the time of fetching we are hitting our local worker and not a remote worker most of the time. The synchronizer will feed digests to the BatchFetcher which will have a queue of missing digests it needs to fetch and it can then bulk fetch those from local store or remote workers in the background or blocking if required immediately. Doing this will allow us to reduce the total number of fetch requests significantly because we are bulk fetching and deduping digests that are being fetched. Unfortunately we have not seen major results yet with those changes as there seem to be other bottlenecks that need to be fixed first which is why we are deploying this incremental change first to speed up catchup and tail latencies. Will send the follow up PR when the experimental impact matches the risk of the refactor introduced. ## Test Plan Unit tests and benchmark cluster - 8-node geo distributed -- **Before batching** ~ 16-72 certs per second Example total catchup time - [2 hours of down time, about 1 hour to catchup](https://mysten.grafana.net/explore?left=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%29%22,%22hide%22:false,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D%5D,%22range%22:%7B%22from%22:%221678411411288%22,%22to%22:%221678424058450%22%7D%7D&orgId=1&right=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%29%22,%22hide%22:true,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D,%7B%22refId%22:%22B%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22sum%20by%20%28host%29%20%28rate%28sequencing_certificate_attempt%7Bhost%3D~%5C%22.%2A%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B$__rate_interval%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221678411411288%22,%22to%22:%221678424058450%22%7D%7D) Commit round rate - [2-9 rounds/s with one spike up to 15 r/s](https://mysten.grafana.net/explore?left=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22hide%22:false,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D%5D,%22range%22:%7B%22from%22:%221678394911392%22,%22to%22:%221678483239964%22%7D%7D&orgId=1&right=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%29%22,%22hide%22:true,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D,%7B%22refId%22:%22B%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22sum%20by%20%28host%29%20%28rate%28sequencing_certificate_attempt%7Bhost%3D~%5C%22.%2A%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B$__rate_interval%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221678394911392%22,%22to%22:%221678483239964%22%7D%7D) -- **After batching** ~ 96-184 certs per second Example total catchup time - [7.5 hours of down time, about 45 minutes to catchup](https://mysten.grafana.net/explore?left=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%29%22,%22hide%22:false,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D%5D,%22range%22:%7B%22from%22:%221678520418067%22,%22to%22:%221678552191281%22%7D%7D&orgId=1&right=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%29%22,%22hide%22:true,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D,%7B%22refId%22:%22B%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22sum%20by%20%28host%29%20%28rate%28sequencing_certificate_attempt%7Bhost%3D~%5C%22.%2A%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B$__rate_interval%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221678520418067%22,%22to%22:%221678552191281%22%7D%7D) Commit round rate - [12-23 rounds/s with one spike of about 40 r/s](https://mysten.grafana.net/explore?left=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22hide%22:false,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D%5D,%22range%22:%7B%22from%22:%221678520418067%22,%22to%22:%221678552191281%22%7D%7D&orgId=1&right=%7B%22datasource%22:%228Xt1pVoVk%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28last_committed_round%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%29%22,%22hide%22:true,%22range%22:true,%22refId%22:%22C%22,%22interval%22:%22%22%7D,%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22sum%20by%28host%29%20%28rate%28subscriber_processed_batches%7Bhost%3D~%5C%22ams-bnc-val-00%7Cewr-bnc-val-00%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B5m%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22hide%22:true%7D,%7B%22refId%22:%22B%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%228Xt1pVoVk%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22sum%20by%20%28host%29%20%28rate%28sequencing_certificate_attempt%7Bhost%3D~%5C%22.%2A%5C%22,%20network%3D%5C%22benchmark%5C%22%7D%5B$__rate_interval%5D%29%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%221678520418067%22,%22to%22:%221678552191281%22%7D%7D) - 100 node geo distributed -- [**Before Batching** ~ 200-300 certs per second (2-3 rounds per second)](https://mysten.grafana.net/d/ORCQSHfVk/subscriber-bulk-fetch-dashboard?var-Environment=mysten-metrics-internal&var-network=benchmark&var-validator=atl-bnc-val-00&var-validator=atl-bnc-val-01&orgId=1&from=1679379442288&to=1679398925376&viewPanel=1) -- [**After Batching** ~300-500 certs per second (3-5 rounds per second)](https://mysten.grafana.net/d/ORCQSHfVk/subscriber-bulk-fetch-dashboard?var-Environment=mysten-metrics-internal&var-network=benchmark&var-validator=atl-bnc-val-00&var-validator=atl-bnc-val-01&orgId=1&from=1679540599207&to=1679542836527)
- Loading branch information