indexer-alt: file-based configs #20461

amnn · 2024-11-28T13:40:14Z

Description

Introduce file-based configs to sui-indexer-alt. This is done by:

Splitting arguments into ones that are passed by command-line argument (ending in *Args), and arguments that are passed by configuration file (ending in *Config).
Using DefaultConfig to ensure there's a single source of truth for the default values of file-based configs (their Default impl).
Fleshing out configurations for pipelines:
- PipelineConfig becomes CommitterConfig because it only dealt with the commit side.
- Introduced new ConcurrentConfig and SequentialConfig structs to represent the configs for a given pipeline. These were plumbed through to the functions that add pipelines to the indexer, which means that each pipeline gets its own configuration now, where previously they shared a committer configuration and their pruner could not be configured at all.
Introduce the ability to configure the committer in one place, and then override it for specific pipelines.
Introduce a command to generate a default config, to make it easier to start writing your own.

This PR also performs some smaller changes:

Some re-orderings to make function argument order the same, and to pull config structs up to be the first struct in their respective modules.

Test plan

CI + manual testing by generating some configs and running the indexer on them, including running on an intentionally broken config:

sui$ cargo run -p sui-indexer-alt -- indexer --config /tmp/tx.toml --last-checkpoint 10000 --remote-store-url 'https://checkpoints.mainnet.sui.io'
Error: Unexpected pipeline configurations:
[tx-calls]

[tx-affected-objects]
collect-interval-ms = 5000

Stack

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

## Description Introduce file-based configs to `sui-indexer-alt`. This is done by: - Splitting arguments into ones that are passed by command-line argument (ending in `*Args`), and arguments that are passed by configuration file (ending in `*Config`). - Using `DefaultConfig` to ensure there's a single source of truth for the default values of file-based configs (their `Default` impl). - Fleshing out configurations for pipelines: - `PipelineConfig` becomes `CommitterConfig` because it only dealt with the commit side. - Introduced new `ConcurrentConfig` and `SequentialConfig` structs to represent the configs for a given pipeline. These were plumbed through to the functions that add pipelines to the indexer, which means that each pipeline gets its own configuration now, where previously they shared a committer configuration and their pruner could not be configured at all. This does mean that configuring the committer is more cumbersome now (changes needs to be replicated for each pipeline), but it gives the ability to configure the pruner, and add special cases to the committer as well. In future PRs, we will: - Re-introduce the ability to configure the committer in one place, and then override it for specific pipelines. - Introduce a command to generate a default config, to make it easier to start writing your own. - Move some more configuration into the file that were moved to various other places to make them easier to modify during experimentation: - Which pipelines are enabled. - Write concurrency. This PR also performs some smaller changes: - Some re-orderings to make function argument order the same, and to pull config structs up to be the first struct in their respective modules. - Introducing a test that we don't try and enable `skip_watermark` for a sequential pipeline (more relevant now that we configure the committers per pipeline). ## Test plan CI + we will be able to test the actual configuration part more easily once we have some more machinery set-up.

## Description Basic command for generating a config TOML based on the Default config. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config [ingestion] checkpoint-buffer-size = 5000 ingest-concurrency = 200 retry-interval-ms = 200 [consistency] consistent-pruning-interval-ms = 300000 pruner-delay-ms = 120000 [pipeline.sum-coin-balances] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.wal-coin-balances] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-obj-types] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.wal-obj-types] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-displays.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-packages.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.ev-emit-mod.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.ev-struct-inst.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-checkpoints.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-epoch-ends.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-epoch-starts.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-feature-flags.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-objects.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-protocol-configs.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.kv-transactions.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.obj-versions.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-affected-addresses.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-affected-objects.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-balance-changes.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-calls.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-digests.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.tx-kinds.committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false ```

## Description Add a common `CommitterConfig` and per-pipeline overrides, to avoid repetition for parameters that are often shared. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config [ingestion] checkpoint-buffer-size = 5000 ingest-concurrency = 200 retry-interval-ms = 200 [consistency] consistent-pruning-interval-ms = 300000 pruner-delay-ms = 120000 [committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 skip-watermark = false [pipeline.sum-coin-balances] [pipeline.wal-coin-balances] [pipeline.sum-obj-types] [pipeline.wal-obj-types] [pipeline.sum-displays] [pipeline.sum-packages] [pipeline.ev-emit-mod] [pipeline.ev-struct-inst] [pipeline.kv-checkpoints] [pipeline.kv-epoch-ends] [pipeline.kv-epoch-starts] [pipeline.kv-feature-flags] [pipeline.kv-objects] [pipeline.kv-protocol-configs] [pipeline.kv-transactions] [pipeline.obj-versions] [pipeline.tx-affected-addresses] [pipeline.tx-affected-objects] [pipeline.tx-balance-changes] [pipeline.tx-calls] [pipeline.tx-digests] [pipeline.tx-kinds] ```

## Description It doesn't make much sense to skip watermarks for some pipelines and not others from one running instance of the indexer, and it is more common to keep everything else fixed in a configuration and change this flag from run to run, because it is related to the `--first-checkpoint` and `--last-checkpoint` flags, so moving it from the file-based config to the command-line config. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config [ingestion] checkpoint-buffer-size = 5000 ingest-concurrency = 200 retry-interval-ms = 200 [consistency] consistent-pruning-interval-ms = 300000 pruner-delay-ms = 120000 [committer] write-concurrency = 5 collect-interval-ms = 500 watermark-interval-ms = 500 [pipeline.sum-coin-balances] [pipeline.wal-coin-balances] [pipeline.sum-obj-types] [pipeline.wal-obj-types] [pipeline.sum-displays] [pipeline.sum-packages] [pipeline.ev-emit-mod] [pipeline.ev-struct-inst] [pipeline.kv-checkpoints] [pipeline.kv-epoch-ends] [pipeline.kv-epoch-starts] [pipeline.kv-feature-flags] [pipeline.kv-objects] [pipeline.kv-protocol-configs] [pipeline.kv-transactions] [pipeline.obj-versions] [pipeline.tx-affected-addresses] [pipeline.tx-affected-objects] [pipeline.tx-balance-changes] [pipeline.tx-calls] [pipeline.tx-digests] [pipeline.tx-kinds] ```

## Description Rather than deciding which pipelines to run according to a `--pipeline` flag, decide them based on which pipelines have been configured in the TOML file. It suffices that the pipeline is mentioned in the config (no config fields need to be overridden for it). Additionally, this change makes it so that the write-ahead log portion of consistent tables is only written to when a checkpoint lag has been specified. ## Test plan Run indexers with the following configs: ``` [ingestion] remote-store-url = "https://checkpoints.mainnet.sui.io" [pipeline.kv-objects] [pipeline.kv-transactions] ``` ``` [ingestion] remote-store-url = "https://checkpoints.mainnet.sui.io" [consistency] consistent-range = 1000 [pipeline.sum-obj-types] ``` ``` [ingestion] remote-store-url = "https://checkpoints.mainnet.sui.io" [committer] collect-interval-ms = 1000 [pipeline.tx-calls] [pipeline.tx-affected-objects] collect-interval-ms = 5000 ```

## Description Move the source of checkpoint data (remote store or local path) into its own struct that is parsed using clap, instead of serde. This is to cater to the case where the same indexer configuration might be used to index different networks (mainnet, testnet, devnet, etc). ## Test plan Run the indexer against a variety of configs: ``` sui$ cargo run -p sui-indexer-alt --release -- \ indexer --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config $CONFIG ``` Where config is a file that contains one of the following: ``` [pipeline.kv_objects] [pipeline.kv_transactions] ``` ``` [consistency] consistent-range = 1000 [pipeline.sum_obj_types] ``` ``` [committer] collect-interval-ms = 1000 [pipeline.tx_calls] [pipeline.tx_affected_objects] collect-interval-ms = 5000 ```

## Description TOML deserialization ignores fields that are not relevant to the struct being targeted. Given all the fields in our config have default values, this behaviour can cause confusion because it means that a typo on a field name will result in the value being ignored and replaced with its default value without warning. It's tough to change this behaviour wholesale (to detect and error on any unrecognised field name), but this chane tries to bridge most of the gap by detecting unexpected pipeline configurations and erroring for those. ## Test plan Run the indexer on a config with typos in its pipeline configs. It will now produce an error explaining that it doesn't recognise those configs: ``` sui$ cargo run -p sui-indexer-alt -- indexer --config /tmp/tx.toml --last-checkpoint 10000 --remote-store-url 'https://checkpoints.mainnet.sui.io' Error: Unexpected pipeline configurations: [tx-calls] [tx-affected-objects] collect-interval-ms = 5000 ```

## Description This override can now be performed using file-based configuration, so we don't need to hard code overrides in the binary. ## Test plan CI

## Description Introduce a `*Layer` type for each file-based `*Config` type. `*Layer` types typically follow the same structure as their corresponding config, but all their fields are optional. This abstraction will serve the following purposes (to be realised in future changes -- this change is a pure refactoring): - It allows us to isolate the dependency on `sui_default_config` to code that will only belong in our indexer, and not into the indexer framework. - Layers can be merged together, this allows us to accept multiple configuration files and merge them into one file. This can be used for storing the configs for each pipeline in their own files to mix and match, and/or to support parameterised indexing in future. - When converting from a Layer to its Config type (finishing it), we can check that the layer does not have any extra (unexpected) fields. ## Test plan CI + tests + run the indexer against some existing configs

vercel · 2024-11-28T13:40:18Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 29, 2024 5:10pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Nov 29, 2024 5:10pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Nov 29, 2024 5:10pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Nov 29, 2024 5:10pm

crates/sui-indexer-alt/src/lib.rs

bmwill · 2024-11-28T16:27:08Z

crates/sui-indexer-alt/src/lib.rs

+    #[arg(long)]
+    pub skip_watermark: bool,


This commit in particular makes me wonder what the ideal situation would be when it comes to cli args vs file-based configs, when do you put something in one vs the other, when do you put something in both (with cli arg having precedence), etc.

If you look at how git is configured it even allows for EnvVars for some options as well as a general purpose --config <key>=<value> cli flag for setting all config values for the duration of the command run. I think that for our purposes thats probably very extreme since we expect this to generally be set and forget, or one-off runs for backfill so I'm not suggesting we do this for the indexer...but it does make me wonder if this is something we should explore for the new sui CLI when we get to that.

Sorry for venturing a bit off-topic :)

I think we definitely need to think holistically about our "philosophy" for configuring the new CLI. It will be more challenging there than here, and I struggled to come up with clear rules even in this setting.

crates/sui-indexer-alt/src/lib.rs

crates/sui-indexer-alt/src/benchmark.rs

crates/sui-indexer-alt/src/config.rs

crates/sui-indexer-alt/src/lib.rs

crates/sui-indexer-alt/src/ingestion/mod.rs

## Description Add back the `--pipeline` command-line argument, layered on top of the file-based configuration. ## Test plan ``` sui$ cargo run -p sui-indexer-alt -- generate-config > /tmp/indexer.toml # Remove some pipelines sui$ cargo run -p sui-indexer-alt -- indexer \ --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config /tmp/indexer.toml \ --pipeline i_dont_exist sui$ cargo run -p sui-indexer-alt -- indexer \ --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config /tmp/indexer.toml \ --pipeline kv_objects --pipeline kv_transactions sui$ cargo run -p sui-indexer-alt -- indexer \ --last-checkpoint 10000 \ --remote-store-url https://checkpoints.mainnet.sui.io \ --config /tmp/indexer.toml ```

## Description Document how the macros for adding pipelines work. ## Test plan :eyes:

amnn added 10 commits November 28, 2024 12:49

indexer-alt: Remove WRITE_CONCURRENCY_OVERRIDE

5357d39

## Description This override can now be performed using file-based configuration, so we don't need to hard code overrides in the binary. ## Test plan CI

indexer-alt: IngestionArgs -> ClientArgs

09cc8d7

amnn requested review from lxfind, bmwill, emmazzz, gegaowp and wlmyng November 28, 2024 13:40

amnn self-assigned this Nov 28, 2024

amnn temporarily deployed to sui-typescript-aws-kms-test-env November 28, 2024 13:40 — with GitHub Actions Inactive

This was referenced Nov 28, 2024

indexer-alt: merging file-based configs #20462

Merged

indexer-alt: detect and warn against unrecognised fields. #20463

Merged

indexer-alt: generate example config #20464

Merged

chore(indexer-alt): rename TxAffectedAddress(es) #20465

Merged

bmwill approved these changes Nov 28, 2024

View reviewed changes

amnn added 2 commits November 29, 2024 17:03

doc(indexer-alt): file-based configs add comment for macros

293d47e

## Description Document how the macros for adding pipelines work. ## Test plan :eyes:

amnn temporarily deployed to sui-typescript-aws-kms-test-env November 29, 2024 17:09 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs November 29, 2024 17:10 View deployment

Base automatically changed from amnn/idx-cfg-default to amnn/idx-cfg-ms November 29, 2024 17:11

amnn merged commit 293d47e into amnn/idx-cfg-ms Nov 29, 2024
51 checks passed

amnn deleted the amnn/idx-file-cfg branch November 29, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexer-alt: file-based configs #20461

indexer-alt: file-based configs #20461

amnn commented Nov 28, 2024

vercel bot commented Nov 28, 2024 •

edited

Loading

bmwill Nov 28, 2024

amnn Nov 29, 2024

indexer-alt: file-based configs #20461

indexer-alt: file-based configs #20461

Conversation

amnn commented Nov 28, 2024

Description

Test plan

Stack

Release notes

vercel bot commented Nov 28, 2024 • edited Loading

bmwill Nov 28, 2024

Choose a reason for hiding this comment

amnn Nov 29, 2024

Choose a reason for hiding this comment

vercel bot commented Nov 28, 2024 •

edited

Loading