-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
indexer-alt: file-based configs #20461
Conversation
## Description Introduce file-based configs to `sui-indexer-alt`. This is done by: - Splitting arguments into ones that are passed by command-line argument (ending in `*Args`), and arguments that are passed by configuration file (ending in `*Config`). - Using `DefaultConfig` to ensure there's a single source of truth for the default values of file-based configs (their `Default` impl). - Fleshing out configurations for pipelines: - `PipelineConfig` becomes `CommitterConfig` because it only dealt with the commit side. - Introduced new `ConcurrentConfig` and `SequentialConfig` structs to represent the configs for a given pipeline. These were plumbed through to the functions that add pipelines to the indexer, which means that each pipeline gets its own configuration now, where previously they shared a committer configuration and their pruner could not be configured at all. This does mean that configuring the committer is more cumbersome now (changes needs to be replicated for each pipeline), but it gives the ability to configure the pruner, and add special cases to the committer as well. In future PRs, we will: - Re-introduce the ability to configure the committer in one place, and then override it for specific pipelines. - Introduce a command to generate a default config, to make it easier to start writing your own. - Move some more configuration into the file that were moved to various other places to make them easier to modify during experimentation: - Which pipelines are enabled. - Write concurrency. This PR also performs some smaller changes: - Some re-orderings to make function argument order the same, and to pull config structs up to be the first struct in their respective modules. - Introducing a test that we don't try and enable `skip_watermark` for a sequential pipeline (more relevant now that we configure the committers per pipeline). ## Test plan CI + we will be able to test the actual configuration part more easily once we have some more machinery set-up.
## Description Basic command for generating a config TOML based on the Default config. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config [ingestion] checkpoint-buffer-size = 5000 ingest-concurrency = 200 retry-interval-ms = 200 [consistency] consistent-pruning-interval-ms = 300000 pruner-delay-ms = 120000 [pipeline.sum-coin-balances] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.wal-coin-balances] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-obj-types] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.wal-obj-types] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-displays.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-packages.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.ev-emit-mod.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.ev-struct-inst.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-checkpoints.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-epoch-ends.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-epoch-starts.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-feature-flags.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-objects.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-protocol-configs.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-transactions.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.obj-versions.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-affected-addresses.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-affected-objects.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-balance-changes.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-calls.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-digests.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-kinds.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false ```
## Description Add a common `CommitterConfig` and per-pipeline overrides, to avoid repetition for parameters that are often shared. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config [ingestion] checkpoint-buffer-size = 5000 ingest-concurrency = 200 retry-interval-ms = 200 [consistency] consistent-pruning-interval-ms = 300000 pruner-delay-ms = 120000 [committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-coin-balances] [pipeline.wal-coin-balances] [pipeline.sum-obj-types] [pipeline.wal-obj-types] [pipeline.sum-displays] [pipeline.sum-packages] [pipeline.ev-emit-mod] [pipeline.ev-struct-inst] [pipeline.kv-checkpoints] [pipeline.kv-epoch-ends] [pipeline.kv-epoch-starts] [pipeline.kv-feature-flags] [pipeline.kv-objects] [pipeline.kv-protocol-configs] [pipeline.kv-transactions] [pipeline.obj-versions] [pipeline.tx-affected-addresses] [pipeline.tx-affected-objects] [pipeline.tx-balance-changes] [pipeline.tx-calls] [pipeline.tx-digests] [pipeline.tx-kinds] ```
## Description It doesn't make much sense to skip watermarks for some pipelines and not others from one running instance of the indexer, and it is more common to keep everything else fixed in a configuration and change this flag from run to run, because it is related to the `--first-checkpoint` and `--last-checkpoint` flags, so moving it from the file-based config to the command-line config. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config [ingestion] checkpoint-buffer-size = 5000 ingest-concurrency = 200 retry-interval-ms = 200 [consistency] consistent-pruning-interval-ms = 300000 pruner-delay-ms = 120000 [committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 [pipeline.sum-coin-balances] [pipeline.wal-coin-balances] [pipeline.sum-obj-types] [pipeline.wal-obj-types] [pipeline.sum-displays] [pipeline.sum-packages] [pipeline.ev-emit-mod] [pipeline.ev-struct-inst] [pipeline.kv-checkpoints] [pipeline.kv-epoch-ends] [pipeline.kv-epoch-starts] [pipeline.kv-feature-flags] [pipeline.kv-objects] [pipeline.kv-protocol-configs] [pipeline.kv-transactions] [pipeline.obj-versions] [pipeline.tx-affected-addresses] [pipeline.tx-affected-objects] [pipeline.tx-balance-changes] [pipeline.tx-calls] [pipeline.tx-digests] [pipeline.tx-kinds] ```
## Description Rather than deciding which pipelines to run according to a `--pipeline` flag, decide them based on which pipelines have been configured in the TOML file. It suffices that the pipeline is mentioned in the config (no config fields need to be overridden for it). Additionally, this change makes it so that the write-ahead log portion of consistent tables is only written to when a checkpoint lag has been specified. ## Test plan Run indexers with the following configs: ``` [ingestion] remote-store-url = "https://checkpoints.mainnet.sui.io" [pipeline.kv-objects] [pipeline.kv-transactions] ``` ``` [ingestion] remote-store-url = "https://checkpoints.mainnet.sui.io" [consistency] consistent-range = 1000 [pipeline.sum-obj-types] ``` ``` [ingestion] remote-store-url = "https://checkpoints.mainnet.sui.io" [committer] collect-interval-ms = 1000 [pipeline.tx-calls] [pipeline.tx-affected-objects] collect-interval-ms = 5000 ```
## Description Move the source of checkpoint data (remote store or local path) into its own struct that is parsed using clap, instead of serde. This is to cater to the case where the same indexer configuration might be used to index different networks (mainnet, testnet, devnet, etc). ## Test plan Run the indexer against a variety of configs: ``` sui$ cargo run -p sui-indexer-alt --release -- \ indexer --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config $CONFIG ``` Where config is a file that contains one of the following: ``` [pipeline.kv_objects] [pipeline.kv_transactions] ``` ``` [consistency] consistent-range = 1000 [pipeline.sum_obj_types] ``` ``` [committer] collect-interval-ms = 1000 [pipeline.tx_calls] [pipeline.tx_affected_objects] collect-interval-ms = 5000 ```
## Description TOML deserialization ignores fields that are not relevant to the struct being targeted. Given all the fields in our config have default values, this behaviour can cause confusion because it means that a typo on a field name will result in the value being ignored and replaced with its default value without warning. It's tough to change this behaviour wholesale (to detect and error on any unrecognised field name), but this chane tries to bridge most of the gap by detecting unexpected pipeline configurations and erroring for those. ## Test plan Run the indexer on a config with typos in its pipeline configs. It will now produce an error explaining that it doesn't recognise those configs: ``` sui$ cargo run -p sui-indexer-alt -- indexer --config /tmp/tx.toml --last-checkpoint 10000 --remote-store-url 'https://checkpoints.mainnet.sui.io' Error: Unexpected pipeline configurations: [tx-calls] [tx-affected-objects] collect-interval-ms = 5000 ```
## Description This override can now be performed using file-based configuration, so we don't need to hard code overrides in the binary. ## Test plan CI
## Description Introduce a `*Layer` type for each file-based `*Config` type. `*Layer` types typically follow the same structure as their corresponding config, but all their fields are optional. This abstraction will serve the following purposes (to be realised in future changes -- this change is a pure refactoring): - It allows us to isolate the dependency on `sui_default_config` to code that will only belong in our indexer, and not into the indexer framework. - Layers can be merged together, this allows us to accept multiple configuration files and merge them into one file. This can be used for storing the configs for each pipeline in their own files to mix and match, and/or to support parameterised indexing in future. - When converting from a Layer to its Config type (finishing it), we can check that the layer does not have any extra (unexpected) fields. ## Test plan CI + tests + run the indexer against some existing configs
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Skipped Deployments
|
#[arg(long)] | ||
pub skip_watermark: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit in particular makes me wonder what the ideal situation would be when it comes to cli args vs file-based configs, when do you put something in one vs the other, when do you put something in both (with cli arg having precedence), etc.
If you look at how git
is configured it even allows for EnvVars for some options as well as a general purpose --config <key>=<value>
cli flag for setting all config values for the duration of the command run. I think that for our purposes thats probably very extreme since we expect this to generally be set and forget, or one-off runs for backfill so I'm not suggesting we do this for the indexer...but it does make me wonder if this is something we should explore for the new sui CLI when we get to that.
Sorry for venturing a bit off-topic :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we definitely need to think holistically about our "philosophy" for configuring the new CLI. It will be more challenging there than here, and I struggled to come up with clear rules even in this setting.
## Description Add back the `--pipeline` command-line argument, layered on top of the file-based configuration. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config > /tmp/indexer.toml # Remove some pipelines sui$ cargo run -p sui-indexer-alt -- indexer \ --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config /tmp/indexer.toml \ --pipeline i_dont_exist sui$ cargo run -p sui-indexer-alt -- indexer \ --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config /tmp/indexer.toml \ --pipeline kv_objects --pipeline kv_transactions sui$ cargo run -p sui-indexer-alt -- indexer \ --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config /tmp/indexer.toml ```
## Description Document how the macros for adding pipelines work. ## Test plan :eyes:
Description
Introduce file-based configs to
sui-indexer-alt
. This is done by:*Args
), and arguments that are passed by configuration file (ending in*Config
).DefaultConfig
to ensure there's a single source of truth for the default values of file-based configs (theirDefault
impl).PipelineConfig
becomesCommitterConfig
because it only dealt with the commit side.ConcurrentConfig
andSequentialConfig
structs to represent the configs for a given pipeline. These were plumbed through to the functions that add pipelines to the indexer, which means that each pipeline gets its own configuration now, where previously they shared a committer configuration and their pruner could not be configured at all.This PR also performs some smaller changes:
Test plan
CI + manual testing by generating some configs and running the indexer on them, including running on an intentionally broken config:
Stack
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.