Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexer-alt: simplify duration command-line parsing #20459

Merged
merged 27 commits into from
Nov 29, 2024
Merged

indexer-alt: simplify duration command-line parsing #20459

merged 27 commits into from
Nov 29, 2024

Conversation

amnn
Copy link
Member

@amnn amnn commented Nov 28, 2024

Description

When accepting a duration as a command-line (or eventually file-based) configuration. Rather than using the Duration type directly, accept it as a number in milliseconds and expose a constructor on the config type to build the Duration from it.

This is helpful for

  • Accepting durations as file-based configs, where support for parsing Duration directly is more complicated.
  • Ensuring there is a single source of truth for config defaults (Even for command-line arguments, we could not supply a Duration as a default_value_t which meant we could not easily share a default value between a command-line default and the impl Default.
  • Standardise all durations accepted by command-line/file to milliseconds rather than some in seconds and some in milliseconds.

Test plan

CI +

sui$ cargo run -p sui-indexer-alt -- --help

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

amnn added 18 commits November 24, 2024 23:14
## Description

Switch the `IngestionConfig::retry_interval: Duration` field to
`IngestionConfig::retry_interval_ms: u64`. This is to avoid dealing with
parsing/deserializing a `Duration` as part of loading a config.

This is the first step towards converting configs to a file-based
format.

## Test plan

CI.
## Description

Switch

- `PipelineConfig::commit_interval: Duration` and
- `PipelineConfig::watermark_interval: Duration`

to

- `PipelineConfig::commit_interval_ms: u64` and
- `PipelineConfig::watermark_interval_ms: u64`

respectively. To avoid dealing with parsing/deserializing a `Duration`
as part of loading a config. Continuing the work to support converting
indexer's configs to a file-based format.

## Test plan

CI.
…elay}_ms

## Description

Switch

- `ConsistencyConfig::consistency_pruning_interval: Duration` and
- `ConsistencyConfig::pruner_delay: Duration`

to

- `ConsistencyConfig::consistency_pruning_interval_ms: u64` and
- `ConsistencyConfig::pruner_delay_ms: u64`

respectively. Note that this is a bigger change than other flags,
because we are changing the unit that the value is written in from
seconds to milliseconds. This was:

- for consistency (no pun intended) -- it seems like it would be more
  confusing in a file-based config to have a different duration values
  using different units.
- for precision: the inputs are all integers, so if we did want to set
  these values to half a second or something, we wouldn't be able to if
  it was denominated in whole seconds.

## Test plan

CI
## Description

Switch

- `PrunerConfig::interval: Duration` and
- `PrunerConfig::delay: Duration`

to

- `PrunerConfig::interval_ms: u64` and
- `PrunerConfig::delay_ms: u64`

respectively, to prepare for porting to file-based configs.

## Test plan

CI
## Description

Change `DbConfig::connection_timeout: Duration` to
`DbConfig::connection_timeout: u64`, so we can drop the dependency on
`const_str`.

Also introduce a default value for the database URL so that we can
implement a `Default` impl for `DbConfig` overall, and which serves as a
single source of truth for defaults.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- --help
```
## Description

`GraphQLConfig` is a derive macro for deriving `Serialize`,
`Deserialize`, and `Debug`, and introducing `#[serde(default = ...)]`
annotations that rely on the fields of the type's `Default`
implementation.

This allows config types to rely on their `Default` implementation as a
single source of truth for default values, and is used through-out
`sui-graphql-rpc` to define its config types.

This change generalizes it (renames it) so it can be used in the indexer
as well.

## Test plan

Build and test `sui-graphql-rpc`, `sui-mvr-graphql-rpc`:

```
sui$ cargo nextest run -p sui-graphql-rpc
sui$ cargo nextest run -p sui-graphql-e2e-tests
sui$ cargo nextest run -p sui-mvr-graphql-rpc
```

This is a behaviour preserving change for the moment.
## Description

`DefaultConfig` previously automatically added derives for `Serialize`,
`Deserialize`, and `Clone`, `Eq` and `PartialEq`, which was convenient
for its use in `sui-graphql-rpc`, but it does not generalize well,
because:

- Currently, it requires that the traits are in scope, which somewhat
  defeats the point of not having to explicitly derive them, in the case
  of `Serialize`/`Deserialize`.
- The logic to detect when a derive is already applied is fragile,
  because it works by name (so it wouldn't play nicely with
  fully-qualified or aliased derives).

This change removes that support in favour of a simpler approach:

- `DefaultConfig` adds derives for `serde::Serialize` and
  `serde::Deserialize`, using fully-qualified names.
- Other derives need to be added explicitly.
- The logic to only add a derive if it hasn't already been added has
  been removed.

## Test plan

CI for existing use cases.
## Description

Add support to `sui-default-config` to change the naming scheme of its
fields (previously always used kebab-case). This allows pipelines to
retain the names they were given in source code (which use underscores).

## Test plan

Will be used and tested in a future PR.
## Description

Introduce file-based configs to `sui-indexer-alt`. This is done by:

- Splitting arguments into ones that are passed by command-line argument
  (ending in `*Args`), and arguments that are passed by configuration
  file (ending in `*Config`).
- Using `DefaultConfig` to ensure there's a single source of truth for
  the default values of file-based configs (their `Default` impl).
- Fleshing out configurations for pipelines:
  - `PipelineConfig` becomes `CommitterConfig` because it only dealt
    with the commit side.
  - Introduced new `ConcurrentConfig` and `SequentialConfig` structs to
    represent the configs for a given pipeline. These were plumbed
    through to the functions that add pipelines to the indexer, which
    means that each pipeline gets its own configuration now, where
    previously they shared a committer configuration and their pruner
    could not be configured at all.

This does mean that configuring the committer is more cumbersome now
(changes needs to be replicated for each pipeline), but it gives the
ability to configure the pruner, and add special cases to the committer
as well. In future PRs, we will:

- Re-introduce the ability to configure the committer in one place, and
  then override it for specific pipelines.
- Introduce a command to generate a default config, to make it easier to
  start writing your own.
- Move some more configuration into the file that were moved to various
  other places to make them easier to modify during experimentation:
  - Which pipelines are enabled.
  - Write concurrency.

This PR also performs some smaller changes:

- Some re-orderings to make function argument order the same, and to
  pull config structs up to be the first struct in their respective
  modules.
- Introducing a test that we don't try and enable `skip_watermark` for a
  sequential pipeline (more relevant now that we configure the
  committers per pipeline).

## Test plan

CI + we will be able to test the actual configuration part more easily
once we have some more machinery set-up.
## Description

Basic command for generating a config TOML based on the Default config.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config
[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[pipeline.sum-coin-balances]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.wal-coin-balances]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-obj-types]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.wal-obj-types]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-displays.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-packages.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.ev-emit-mod.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.ev-struct-inst.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-checkpoints.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-epoch-ends.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-epoch-starts.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-feature-flags.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-objects.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-protocol-configs.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.kv-transactions.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.obj-versions.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-affected-addresses.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-affected-objects.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-balance-changes.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-calls.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-digests.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.tx-kinds.committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false
```
## Description

Add a common `CommitterConfig` and per-pipeline overrides, to avoid
repetition for parameters that are often shared.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config
[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500
skip-watermark = false

[pipeline.sum-coin-balances]

[pipeline.wal-coin-balances]

[pipeline.sum-obj-types]

[pipeline.wal-obj-types]

[pipeline.sum-displays]

[pipeline.sum-packages]

[pipeline.ev-emit-mod]

[pipeline.ev-struct-inst]

[pipeline.kv-checkpoints]

[pipeline.kv-epoch-ends]

[pipeline.kv-epoch-starts]

[pipeline.kv-feature-flags]

[pipeline.kv-objects]

[pipeline.kv-protocol-configs]

[pipeline.kv-transactions]

[pipeline.obj-versions]

[pipeline.tx-affected-addresses]

[pipeline.tx-affected-objects]

[pipeline.tx-balance-changes]

[pipeline.tx-calls]

[pipeline.tx-digests]

[pipeline.tx-kinds]
```
## Description

It doesn't make much sense to skip watermarks for some pipelines and not
others from one running instance of the indexer, and it is more common
to keep everything else fixed in a configuration and change this flag
from run to run, because it is related to the `--first-checkpoint` and
`--last-checkpoint` flags, so moving it from the file-based config to
the command-line config.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config
[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500

[pipeline.sum-coin-balances]

[pipeline.wal-coin-balances]

[pipeline.sum-obj-types]

[pipeline.wal-obj-types]

[pipeline.sum-displays]

[pipeline.sum-packages]

[pipeline.ev-emit-mod]

[pipeline.ev-struct-inst]

[pipeline.kv-checkpoints]

[pipeline.kv-epoch-ends]

[pipeline.kv-epoch-starts]

[pipeline.kv-feature-flags]

[pipeline.kv-objects]

[pipeline.kv-protocol-configs]

[pipeline.kv-transactions]

[pipeline.obj-versions]

[pipeline.tx-affected-addresses]

[pipeline.tx-affected-objects]

[pipeline.tx-balance-changes]

[pipeline.tx-calls]

[pipeline.tx-digests]

[pipeline.tx-kinds]
```
## Description

Rather than deciding which pipelines to run according to a `--pipeline`
flag, decide them based on which pipelines have been configured in the
TOML file. It suffices that the pipeline is mentioned in the config (no
config fields need to be overridden for it).

Additionally, this change makes it so that the write-ahead log portion
of consistent tables is only written to when a checkpoint lag has been
specified.

## Test plan

Run indexers with the following configs:

```
[ingestion]
remote-store-url = "https://checkpoints.mainnet.sui.io"

[pipeline.kv-objects]
[pipeline.kv-transactions]
```

```
[ingestion]
remote-store-url = "https://checkpoints.mainnet.sui.io"

[consistency]
consistent-range = 1000

[pipeline.sum-obj-types]
```

```
[ingestion]
remote-store-url = "https://checkpoints.mainnet.sui.io"

[committer]
collect-interval-ms = 1000

[pipeline.tx-calls]

[pipeline.tx-affected-objects]
collect-interval-ms = 5000
```
## Description

Move the source of checkpoint data (remote store or local path) into its
own struct that is parsed using clap, instead of serde. This is to cater
to the case where the same indexer configuration might be used to index
different networks (mainnet, testnet, devnet, etc).

## Test plan

Run the indexer against a variety of configs:

```
sui$ cargo run -p sui-indexer-alt --release --         \
 indexer --last-checkpoint 10000                       \
 --remote-store-url https://checkpoints.mainnet.sui.io \
 --config $CONFIG
```

Where config is a file that contains one of the following:

```
[pipeline.kv_objects]
[pipeline.kv_transactions]
```

```
[consistency]
consistent-range = 1000

[pipeline.sum_obj_types]
```

```
[committer]
collect-interval-ms = 1000

[pipeline.tx_calls]

[pipeline.tx_affected_objects]
collect-interval-ms = 5000
```
## Description

TOML deserialization ignores fields that are not relevant to the struct
being targeted. Given all the fields in our config have default values,
this behaviour can cause confusion because it means that a typo on a
field name will result in the value being ignored and replaced with its
default value without warning.

It's tough to change this behaviour wholesale (to detect and error on
any unrecognised field name), but this chane tries to bridge most of the
gap by detecting unexpected pipeline configurations and erroring for
those.

## Test plan

Run the indexer on a config with typos in its pipeline configs. It will
now produce an error explaining that it doesn't recognise those configs:

```
sui$ cargo run -p sui-indexer-alt -- indexer --config /tmp/tx.toml --last-checkpoint 10000 --remote-store-url 'https://checkpoints.mainnet.sui.io'
Error: Unexpected pipeline configurations:
[tx-calls]

[tx-affected-objects]
collect-interval-ms = 5000
```
## Description

This override can now be performed using file-based configuration, so we
don't need to hard code overrides in the binary.

## Test plan

CI
## Description

Introduce a `*Layer` type for each file-based `*Config` type. `*Layer`
types typically follow the same structure as their corresponding config,
but all their fields are optional. This abstraction will serve the
following purposes (to be realised in future changes -- this change is a
pure refactoring):

- It allows us to isolate the dependency on `sui_default_config` to code
  that will only belong in our indexer, and not into the indexer
  framework.
- Layers can be merged together, this allows us to accept multiple
  configuration files and merge them into one file. This can be used for
  storing the configs for each pipeline in their own files to mix and
  match, and/or to support parameterised indexing in future.
- When converting from a Layer to its Config type (finishing it), we can
  check that the layer does not have any extra (unexpected) fields.

## Test plan

CI + tests + run the indexer against some existing configs
@amnn amnn self-assigned this Nov 28, 2024
Copy link

vercel bot commented Nov 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 29, 2024 6:18pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Nov 29, 2024 6:18pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Nov 29, 2024 6:18pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Nov 29, 2024 6:18pm

Copy link
Contributor

@bmwill bmwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, modulo the probably unintentional removal of some code in one commit.

crates/sui-indexer-alt/src/args.rs Show resolved Hide resolved
crates/sui-indexer-alt/src/db.rs Show resolved Hide resolved
amnn added 9 commits November 29, 2024 17:03
## Description

Add back the `--pipeline` command-line argument, layered on top of the
file-based configuration.

## Test plan

```
sui$ cargo run -p sui-indexer-alt -- generate-config > /tmp/indexer.toml
# Remove some pipelines

sui$ cargo run -p sui-indexer-alt -- indexer            \
  --last-checkpoint 10000                               \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --config /tmp/indexer.toml                            \
  --pipeline i_dont_exist

sui$ cargo run -p sui-indexer-alt -- indexer            \
  --last-checkpoint 10000                               \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --config /tmp/indexer.toml                            \
  --pipeline kv_objects --pipeline kv_transactions

sui$ cargo run -p sui-indexer-alt -- indexer            \
  --last-checkpoint 10000                               \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --config /tmp/indexer.toml
```
## Description
Document how the macros for adding pipelines work.

## Test plan
:eyes:
## Description

Add an ability to merge together configurations. This is useful for
configuring each pipeline in its own file and then merging them into
pods, or combining the configurations of multiple application-specific
indexers into one configuration.

## Test plan

New unit tests:

```
sui$ cargo nextest run -p sui-indexer-alt
```

Run a test command:

```
sui$ cargo run -p sui-indexer-alt -- merge-configs \
  --config tx.toml                                 \
  --config obj.toml                                \
  --config kv.toml

[ingestion]

[consistency]
consistent-range = 1000

[committer]
collect-interval-ms = 1000

[pipeline.sum_obj_types]

[pipeline.kv_objects]

[pipeline.kv_transactions]

[pipeline.tx_affected_objects]

[pipeline.tx_calls]

```
## Description

Similar to the change introducing a shared committer config, this change
introduces a shared pruner config between all concurrent pipelines. The
values in this config are only used to fill out concurrent pipelines who
have already specified that they want pruning enabled (i.e. unlike the
committer override which is always inherited, this will only be
inherited if the pruner config is called for).

Additionally, unlike other fields where the last write wins when merging
configs, `PrunerConfig.retention` will be merged by `max`. This is to
deal with merging the configs of two different apps that need to index
the same table, but with different retentions.

## Test plan

New unit tests:

```
sui$ cargo nextest run -p sui-indexer-alt
```
## Description

Add an extra flattened generic, `toml::Table` to every file-based config
struct to pick up any flags that were not recognised by the indexer.

Extra fields will be logged as a warning and then ignored when
processing flags, so that we are aware of potential typos, but will not
cause an error to preserve some form of backwards compatibility (e.g.
imagine a new pipeline is added to the indexer, and we update its TOML
and try and run an old version of the indexer with the config that
mentions the new pipeline).

## Test plan

Try merging together some configs that contain typos in them:

STDERR:
```
2024-11-27T00:29:37.177152Z  WARN sui_indexer_alt::config: Found unrecognized committer field which will be ignored by this indexer. This could be because of a typo, or because it was introduced in a newer version of the indexer:
collect-interval-secs = 1000

2024-11-27T00:29:37.177242Z  WARN sui_indexer_alt::config: Found unrecognized pipeline field which will be ignored by this indexer. This could be because of a typo, or because it was introduced in a newer version of the indexer:
[tx-affected-objects]
collect-interval-ms = 5000

2024-11-27T00:29:37.177328Z  WARN sui_indexer_alt::config: Found unrecognized consistency field which will be ignored by this indexer. This could be because of a typo, or because it was introduced in a newer version of the indexer:
consistency-range = 1000
```

STDOUT:
```
[ingestion]

[consistency]

[committer]

[pruner]

[pipeline.sum_obj_types]

[pipeline.kv_objects]

[pipeline.kv_transactions]

[pipeline.tx_calls]

```
## Description

Using the default impl for `IndexerConfig` to generate the output for
`generate-config` lead to sub-par results because most of its fields are
optional, and so will not show up.

This change introduces an explicit `example` function that is
responsible for generating a non-default output that does a better job
of documenting what fields and pipelines are available to configure.

## Test plan

Generate a config:

```
sui$ cargo run -p sui-indexer-alt -- generate-config

[ingestion]
checkpoint-buffer-size = 5000
ingest-concurrency = 200
retry-interval-ms = 200

[consistency]
consistent-pruning-interval-ms = 300000
pruner-delay-ms = 120000

[committer]
write-concurrency = 5
collect-interval-ms = 500
watermark-interval-ms = 500

[pruner]
interval-ms = 300000
delay-ms = 120000
retention = 4000000
max-chunk-size = 2000

[pipeline.sum_coin_balances]

[pipeline.wal_coin_balances]

[pipeline.sum_obj_types]

[pipeline.wal_obj_types]

[pipeline.sum_displays]

[pipeline.sum_packages]

[pipeline.ev_emit_mod]

[pipeline.ev_struct_inst]

[pipeline.kv_checkpoints]

[pipeline.kv_epoch_ends]

[pipeline.kv_epoch_starts]

[pipeline.kv_feature_flags]

[pipeline.kv_objects]

[pipeline.kv_protocol_configs]

[pipeline.kv_transactions]

[pipeline.obj_versions]

[pipeline.tx_affected_addresses]

[pipeline.tx_affected_objects]

[pipeline.tx_balance_changes]

[pipeline.tx_calls]

[pipeline.tx_digests]

[pipeline.tx_kinds]

```
## Description
Replace use of macros with a `Merge` trait.

## Test plan
CI
## Description
Avoid catch-all (`..`) patterns, and be explicit about the `extra`
pattern, so that if we introduce new fields in future, we notice.

## Test plan
CI
## Description

Pluralize it to match the other instances of the name.

## Test plan

CI
@amnn amnn temporarily deployed to sui-typescript-aws-kms-test-env November 29, 2024 17:12 — with GitHub Actions Inactive
@amnn amnn temporarily deployed to sui-typescript-aws-kms-test-env November 29, 2024 18:17 — with GitHub Actions Inactive
@amnn amnn merged commit ddb604b into main Nov 29, 2024
97 checks passed
@amnn amnn deleted the amnn/idx-cfg-ms branch November 29, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants