Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexer-alt: file-based configs #20461

Merged
merged 12 commits into from
Nov 29, 2024
Merged

indexer-alt: file-based configs #20461

merged 12 commits into from
Nov 29, 2024

Conversation

amnn
Copy link
Member

@amnn amnn commented Nov 28, 2024

Description

Introduce file-based configs to sui-indexer-alt. This is done by:

  • Splitting arguments into ones that are passed by command-line argument (ending in *Args), and arguments that are passed by configuration file (ending in *Config).
  • Using DefaultConfig to ensure there's a single source of truth for the default values of file-based configs (their Default impl).
  • Fleshing out configurations for pipelines:
    • PipelineConfig becomes CommitterConfig because it only dealt with the commit side.
    • Introduced new ConcurrentConfig and SequentialConfig structs to represent the configs for a given pipeline. These were plumbed through to the functions that add pipelines to the indexer, which means that each pipeline gets its own configuration now, where previously they shared a committer configuration and their pruner could not be configured at all.
  • Introduce the ability to configure the committer in one place, and then override it for specific pipelines.
  • Introduce a command to generate a default config, to make it easier to start writing your own.

This PR also performs some smaller changes:

  • Some re-orderings to make function argument order the same, and to pull config structs up to be the first struct in their respective modules.

Test plan

CI + manual testing by generating some configs and running the indexer on them, including running on an intentionally broken config:

sui$ cargo run -p sui-indexer-alt -- indexer --config /tmp/tx.toml --last-checkpoint 10000 --remote-store-url 'https://checkpoints.mainnet.sui.io'
Error: Unexpected pipeline configurations:
[tx-calls]

[tx-affected-objects]
collect-interval-ms = 5000

Stack


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

amnn added 10 commits November 28, 2024 12:49
## Description

Introduce file-based configs to `sui-indexer-alt`. This is done by:

- Splitting arguments into ones that are passed by command-line argument
  (ending in `*Args`), and arguments that are passed by configuration
  file (ending in `*Config`).
- Using `DefaultConfig` to ensure there's a single source of truth for
  the default values of file-based configs (their `Default` impl).
- Fleshing out configurations for pipelines:
  - `PipelineConfig` becomes `CommitterConfig` because it only dealt
    with the commit side.
  - Introduced new `ConcurrentConfig` and `SequentialConfig` structs to
    represent the configs for a given pipeline. These were plumbed
    through to the functions that add pipelines to the indexer, which
    means that each pipeline gets its own configuration now, where
    previously they shared a committer configuration and their pruner
    could not be configured at all.

This does mean that configuring the committer is more cumbersome now
(changes needs to be replicated for each pipeline), but it gives the
ability to configure the pruner, and add special cases to the committer
as well. In future PRs, we will:

- Re-introduce the ability to configure the committer in one place, and
  then override it for specific pipelines.
- Introduce a command to generate a default config, to make it easier to
  start writing your own.
- Move some more configuration into the file that were moved to various
  other places to make them easier to modify during experimentation:
  - Which pipelines are enabled.
  - Write concurrency.

This PR also performs some smaller changes:

- Some re-orderings to make function argument order the same, and to
  pull config structs up to be the first struct in their respective
  modules.
- Introducing a test that we don't try and enable `skip_watermark` for a
  sequential pipeline (more relevant now that we configure the
  committers per pipeline).

## Test plan

CI + we will be able to test the actual configuration part more easily
once we have some more machinery set-up.
## Description

Basic command for generating a config TOML based on the Default config.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config
[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[pipeline.sum-coin-balances]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.wal-coin-balances]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-obj-types]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.wal-obj-types]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-displays.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-packages.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.ev-emit-mod.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.ev-struct-inst.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-checkpoints.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-epoch-ends.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-epoch-starts.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-feature-flags.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-objects.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-protocol-configs.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-transactions.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.obj-versions.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-affected-addresses.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-affected-objects.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-balance-changes.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-calls.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-digests.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-kinds.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false
```
## Description

Add a common `CommitterConfig` and per-pipeline overrides, to avoid
repetition for parameters that are often shared.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config
[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-coin-balances]

[pipeline.wal-coin-balances]

[pipeline.sum-obj-types]

[pipeline.wal-obj-types]

[pipeline.sum-displays]

[pipeline.sum-packages]

[pipeline.ev-emit-mod]

[pipeline.ev-struct-inst]

[pipeline.kv-checkpoints]

[pipeline.kv-epoch-ends]

[pipeline.kv-epoch-starts]

[pipeline.kv-feature-flags]

[pipeline.kv-objects]

[pipeline.kv-protocol-configs]

[pipeline.kv-transactions]

[pipeline.obj-versions]

[pipeline.tx-affected-addresses]

[pipeline.tx-affected-objects]

[pipeline.tx-balance-changes]

[pipeline.tx-calls]

[pipeline.tx-digests]

[pipeline.tx-kinds]
```
## Description

It doesn't make much sense to skip watermarks for some pipelines and not
others from one running instance of the indexer, and it is more common
to keep everything else fixed in a configuration and change this flag
from run to run, because it is related to the `--first-checkpoint` and
`--last-checkpoint` flags, so moving it from the file-based config to
the command-line config.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config
[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500

[pipeline.sum-coin-balances]

[pipeline.wal-coin-balances]

[pipeline.sum-obj-types]

[pipeline.wal-obj-types]

[pipeline.sum-displays]

[pipeline.sum-packages]

[pipeline.ev-emit-mod]

[pipeline.ev-struct-inst]

[pipeline.kv-checkpoints]

[pipeline.kv-epoch-ends]

[pipeline.kv-epoch-starts]

[pipeline.kv-feature-flags]

[pipeline.kv-objects]

[pipeline.kv-protocol-configs]

[pipeline.kv-transactions]

[pipeline.obj-versions]

[pipeline.tx-affected-addresses]

[pipeline.tx-affected-objects]

[pipeline.tx-balance-changes]

[pipeline.tx-calls]

[pipeline.tx-digests]

[pipeline.tx-kinds]
```
## Description

Rather than deciding which pipelines to run according to a `--pipeline`
flag, decide them based on which pipelines have been configured in the
TOML file. It suffices that the pipeline is mentioned in the config (no
config fields need to be overridden for it).

Additionally, this change makes it so that the write-ahead log portion
of consistent tables is only written to when a checkpoint lag has been
specified.

## Test plan

Run indexers with the following configs:

```
[ingestion]
remote-store-url = "https://checkpoints.mainnet.sui.io"

[pipeline.kv-objects]
[pipeline.kv-transactions]
```

```
[ingestion]
remote-store-url = "https://checkpoints.mainnet.sui.io"

[consistency]
consistent-range = 1000

[pipeline.sum-obj-types]
```

```
[ingestion]
remote-store-url = "https://checkpoints.mainnet.sui.io"

[committer]
collect-interval-ms = 1000

[pipeline.tx-calls]

[pipeline.tx-affected-objects]
collect-interval-ms = 5000
```
## Description

Move the source of checkpoint data (remote store or local path) into its
own struct that is parsed using clap, instead of serde. This is to cater
to the case where the same indexer configuration might be used to index
different networks (mainnet, testnet, devnet, etc).

## Test plan

Run the indexer against a variety of configs:

```
sui$ cargo run -p sui-indexer-alt --release --         \
 indexer --last-checkpoint 10000                       \
 --remote-store-url https://checkpoints.mainnet.sui.io \
 --config $CONFIG
```

Where config is a file that contains one of the following:

```
[pipeline.kv_objects]
[pipeline.kv_transactions]
```

```
[consistency]
consistent-range = 1000

[pipeline.sum_obj_types]
```

```
[committer]
collect-interval-ms = 1000

[pipeline.tx_calls]

[pipeline.tx_affected_objects]
collect-interval-ms = 5000
```
## Description

TOML deserialization ignores fields that are not relevant to the struct
being targeted. Given all the fields in our config have default values,
this behaviour can cause confusion because it means that a typo on a
field name will result in the value being ignored and replaced with its
default value without warning.

It's tough to change this behaviour wholesale (to detect and error on
any unrecognised field name), but this chane tries to bridge most of the
gap by detecting unexpected pipeline configurations and erroring for
those.

## Test plan

Run the indexer on a config with typos in its pipeline configs. It will
now produce an error explaining that it doesn't recognise those configs:

```
sui$ cargo run -p sui-indexer-alt -- indexer --config /tmp/tx.toml --last-checkpoint 10000 --remote-store-url 'https://checkpoints.mainnet.sui.io'
Error: Unexpected pipeline configurations:
[tx-calls]

[tx-affected-objects]
collect-interval-ms = 5000
```
## Description

This override can now be performed using file-based configuration, so we
don't need to hard code overrides in the binary.

## Test plan

CI
## Description

Introduce a `*Layer` type for each file-based `*Config` type. `*Layer`
types typically follow the same structure as their corresponding config,
but all their fields are optional. This abstraction will serve the
following purposes (to be realised in future changes -- this change is a
pure refactoring):

- It allows us to isolate the dependency on `sui_default_config` to code
  that will only belong in our indexer, and not into the indexer
  framework.
- Layers can be merged together, this allows us to accept multiple
  configuration files and merge them into one file. This can be used for
  storing the configs for each pipeline in their own files to mix and
  match, and/or to support parameterised indexing in future.
- When converting from a Layer to its Config type (finishing it), we can
  check that the layer does not have any extra (unexpected) fields.

## Test plan

CI + tests + run the indexer against some existing configs
@amnn amnn self-assigned this Nov 28, 2024
Copy link

vercel bot commented Nov 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 29, 2024 5:10pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Nov 29, 2024 5:10pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Nov 29, 2024 5:10pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Nov 29, 2024 5:10pm

crates/sui-indexer-alt/src/lib.rs Outdated Show resolved Hide resolved
crates/sui-indexer-alt/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +63 to +64
#[arg(long)]
pub skip_watermark: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit in particular makes me wonder what the ideal situation would be when it comes to cli args vs file-based configs, when do you put something in one vs the other, when do you put something in both (with cli arg having precedence), etc.

If you look at how git is configured it even allows for EnvVars for some options as well as a general purpose --config <key>=<value> cli flag for setting all config values for the duration of the command run. I think that for our purposes thats probably very extreme since we expect this to generally be set and forget, or one-off runs for backfill so I'm not suggesting we do this for the indexer...but it does make me wonder if this is something we should explore for the new sui CLI when we get to that.

Sorry for venturing a bit off-topic :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely need to think holistically about our "philosophy" for configuring the new CLI. It will be more challenging there than here, and I struggled to come up with clear rules even in this setting.

crates/sui-indexer-alt/src/lib.rs Show resolved Hide resolved
crates/sui-indexer-alt/src/benchmark.rs Outdated Show resolved Hide resolved
crates/sui-indexer-alt/src/config.rs Show resolved Hide resolved
crates/sui-indexer-alt/src/lib.rs Show resolved Hide resolved
crates/sui-indexer-alt/src/ingestion/mod.rs Show resolved Hide resolved
amnn added 2 commits November 29, 2024 17:03
## Description

Add back the `--pipeline` command-line argument, layered on top of the
file-based configuration.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config > /tmp/indexer.toml
# Remove some pipelines

sui$ cargo run -p sui-indexer-alt -- indexer            \
  --last-checkpoint 10000                               \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --config /tmp/indexer.toml                            \
  --pipeline i_dont_exist

sui$ cargo run -p sui-indexer-alt -- indexer            \
  --last-checkpoint 10000                               \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --config /tmp/indexer.toml                            \
  --pipeline kv_objects --pipeline kv_transactions

sui$ cargo run -p sui-indexer-alt -- indexer            \
  --last-checkpoint 10000                               \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --config /tmp/indexer.toml
```
## Description
Document how the macros for adding pipelines work.

## Test plan
:eyes:
@amnn amnn temporarily deployed to sui-typescript-aws-kms-test-env November 29, 2024 17:09 — with GitHub Actions Inactive
Base automatically changed from amnn/idx-cfg-default to amnn/idx-cfg-ms November 29, 2024 17:11
@amnn amnn merged commit 293d47e into amnn/idx-cfg-ms Nov 29, 2024
51 checks passed
@amnn amnn deleted the amnn/idx-file-cfg branch November 29, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants