Skip to content

Latest commit

 

History

History
833 lines (707 loc) · 41.1 KB

2022-02-23-9481-config-schema.md

File metadata and controls

833 lines (707 loc) · 41.1 KB

RFC 9481 - 2022-02-23 - Configuration Schema

Vector's configuration, while driven by code, is hard to document, as well as programmatically describe and validate outside of Vector itself. This RFC proposes an enhancement to how we define configuration types so that Vector can emit an authoritative schema that can be used to drive configuration editing and validation, in both interactive and programmatic contexts.

Context

  • Document options naming and remediate #3714
  • Formalize how options are deprecated #4023
  • Versioned Vector configuration #6231
  • Fix broken handling of adaptive_concurrency defaults #8189
  • Fuzz/validate Vector's configuration options #8229
  • ARC was not enabled by default #9727
  • Avoid enum deserialization for configuration #10582

Cross cutting concerns

  • This work will eventually help power the UI for Datadog's Observability Pipelines, in terms of validating user-supplied configurations. We want to be cognizant of areas where we can provide more flexibility/generalization to allow that work to extract whatever information it may require.
  • Powering the new website's reference documentation.
  • There are many potentially off-label usages of the configuration schema -- configuration migration, generative testing, and so on -- where we'll be balancing a fine-line between encoding our configuration schema as an API contract vs encoding it as a representation of implementation details.

Scope

In scope

  • Encoding the schema of Vector's configuration.
  • Modifications to how configuration types are created and used to provide invariants that make generating such a schema easier.

Out of scope

  • Providing the configuration schema as part of a Vector release.
  • Providing additional tooling, or enhancements to existing tooling, to use the schema to help better validate configurations.
  • Generating Cue documentation from the configuration schema.
  • Generating configuration migrations between versions of Vector.

Pain

Vector's configuration represents a fairly dense entrypoint into the vast flexibility provided by Vector: sources, transforms, sinks, tests, each with their own varying levels of complexity.

While Vector's documentation is generally fairly high-quality, users struggle to efficiently and correctly configure Vector for a variety of different reasons:

  • lack of available examples
  • incorrectly documented fields, or lack of documentation of fields
  • confusion over field types and supported values (i.e. a field that can be a string or object)
  • human-unfriendly rendering of default values (i.e. 1e+07 instead of 1000000, even when we know the unit is bytes, and could instead display 1MB)

Additionally, some of these pains are also pains felt during development. While users may struggle to understand an issue caused by following incorrect documentation, developers also struggle to correctly update the documentation itself. Vector uses a tool called Cue which is a data definition language that can be used to define data and its schema all in one, which is how we use it in Vector to codify our configuration in a somewhat programmatic/strict way. While Cue itself is a powerful data constraint language, it can take time to master and can be fairly inscrutable when errors are encountered.

Overlapping with some of the user pains, understanding when the documentation needs to be updated, or even simply remembering, can easily be missed during development, which in turn leads to documentation that grows out-of-sync until someone realizes what has occurred.

Proposal

User Experience

In order to address these concerns, we would update Vector to generate a configuration schema to provide a single source of truth in terms of how a particular version of Vector could be configured.

This schema would be based on JSONSchema, a specification for defining and validating JSON objects, as we can trivially convert any supported encoding for Vector configurations into JSON. JSONSchema is the most comprehensive specification for schema definition and validation when also considering the ubiqitousness of JSON itself, of which JSONSchema is written in.

As a developer, the primary impact of this would be in terms of needing to utilize new helpers, and patterns, that become required for documenting their configuration types. This would represent a net-new change to creating and updating components, although as part of developing this feature, all existing components would be bootstrapped in terms of being migrated over.

In turn, developers could equate this to something like a new CI check or lint being added that told them when their code did not comply, and what changes needed to be made in order to do so.

Overall, the cognitive overhead of this proposal would be low, as we would rely on the compiler, and CI, as much as possible in order to surface errors or non-compliance and explain exactly what needed to be changed in order to fix the errors.

Implementation

1. Introduce a new trait and custom derive macro: Configurable

This trait, and derive macro, would form the basis of validating the compliance of a configuration as well as walking a configuration to generate its schema. There are two primary requirements for generating the configuration schema: discoverability and compliance.

Discoverability, or the actual logic of inspecting the configuration types, encoding information about their fields, allowable values, and so on, is the most obvious requirement. We need to be able to find these types, know they can be inspected, and then actually do the work. Compliance is perhaps even more important: unless all configuration types are able to be inspected, then the configuration schema can never be used to correctly validate a user-supplied configuration.

The Configurable trait would form the basis for discoverability. It would provide a minimalistic interface that walked the type, and walked its fields, mapping closely to the traditional "Visitor" pattern. The trait would allow exposing common items such as name, description, allowable type, units, and so on. Additionally, it would allow for defining custom metadata, or extensions, that could be parsed by external code to satisfy more advanced workflows i.e. configuration migration, testing, etc.

The Configurable derive macro would form the basis for compliance. While it would generally provide the scaffolding to generate an implementation of the Configurable trait -- walking each field, gathering attributes and doc comments and so on -- it would also be able to validate that those things exists at compile time. As an example, we can enforce that all Configuration implementors are fully documented: the type itself, their subfields, and so on. It would then become extremely hard for developers to add new fields or types in configuration that weren't documented upon cutting a new Vector release.

2. Configurable trait as a vehicle for encapsulating all facets of configuration

As discussed in point 1, the Configurable trait is meant to provide a common interface for configuration types, and the types used within those types, such as types from the standard library or third-party crates, such that they can describe their "shape", value constraints, and any other relevant metadata. Below is an abbreviated version of the Configurable trait, along with supporting types:

/// The shape of the field.
///
/// This maps similiar to the concept of JSON's data types, where types are generalized and have
/// generalized representations. This allows us to provide general-but-relevant mappings to core
/// types, such as integers and strings and so on, while providing escape hatches for customized
/// types that may be encoded and decoded via "normal" types but otherwise have specific rules or
/// requirements.
///
/// Additionally, the shape of a field can encode some basic properties about the field to which it
/// is attached. For example, numbers can be bounded on or the lower or upper end, while strings
/// could define a minimum length, or even an allowed pattern via regular expressions.
///
/// In this way, they describe a more complete shape of the field than simply the data type alone.
#[derive(Clone)]
pub enum Shape {
    Null,
    Boolean,
    String(StringShape),
    Number(NumberShape),
    Array(ArrayShape),
    Map(MapShape),
    Composite(Vec<Shape>),
}

#[derive(Clone, Default)]
pub struct StringShape {
    minimum_length: Option<usize>,
    maximum_length: Option<usize>,
    allowed_pattern: Option<&'static str>,
}

#[derive(Clone)]
pub enum NumberShape {
    Unsigned {
        effective_lower_bound: u128,
        effective_upper_bound: u128,
    },
    Signed {
        effective_lower_bound: i128,
        effective_upper_bound: i128,
    },
    FloatingPoint {
        effective_lower_bound: f64,
        effective_upper_bound: f64,
    }
}

#[derive(Clone)]
pub struct ArrayShape {
    element_shape: Box<Shape>,
    minimum_length: Option<usize>,
    maximum_length: Option<usize>,
}

#[derive(Clone)]
pub struct MapShape {
    required_fields: HashMap<&'static str, Shape>,
    allowed_unknown_field_shape: Option<Shape>,
}

pub struct Field {
    name: &'static str,
    description: &'static str,
    shape: Shape,
    fields: Vec<Field>,
    metadata: Metadata<Value>,
}

#[derive(Clone, Default)]
pub struct Metadata<T: Serialize> {
    default: Option<T>,
    attributes: Vec<(String, String)>,
}

pub trait Configurable<'de>: Serialize + Deserialize<'de> + Sized
where
    Self: Clone,
{
    /// Gets the human-readable description of this value, if any.
    ///
    /// For standard types, this will be `None`. Commonly, custom types would implement this
    /// directly, while fields using standard types would provide a field-specific description that
    /// would be used instead of the default descrption.
    fn description() -> Option<&'static str>;

    /// Gets the shape of this value.
    fn shape() -> Shape;

    /// Gets the metadata for this value.
    fn metadata() -> Metadata<Self>;

    /// The fields for this value, if any.
    fn fields(overrides: Metadata<Self>) -> Option<HashMap<&'static str, Field>>;
}

The Configurable trait defines some very basic core functionality: the description of this type (if applicable), the "shape" of the type, any metadata associated with it, and the fields it exposes. It also enforces (de)serialization capabilities on the type as this represents a base level of functionality required by types that will be utilized in a Vector configuration.

Description and shape are required because they are both inherent and inextricable qualities of anything that we expose as a configurable option. Metadata and fields are optional as not every type will have metadata, and not every type actually has fields. For example, any scalar value -- string, number, bool, etc -- is a singular unit, and the same with arrays. Anything that looks like an "object", however, must have fields, as that is an inherent characteristic of an "object".

At the top level, there must always be a type that is Configurable which maps to the Vector configuration itself, and then fields within it. From this point on, we'll relate characteristics of the Configurable trait in the context of the types that implement it being fields.

Shape represents the inherent type of a field, as well as any additional constraints on that type. This is where we start to see the mappings from Rust types to their serialized representation, and in general, the Shape variants map closely to the various JSON types. We've added some general constraints here -- lower/upper bounds on numbers, min.max length and acceptable regex pattern for strings, expected element shape for arrays, expected fields for maps, etc -- but this is merely for fleshing out the concept. We could extend this as needed but generally we would strive to only encode intrinisic properties of these types within Shape, depending on metadata for more custom/situational constraints.

Following on from Shape, we have the ability to define metadata about fields. One major thing that we utilize metadata for is to provide default values for a given type/field. This allows Shape to avoid having to deal with that as it makes it a bit messier. Another thing it allows us to do is use a generically-typed struct to capture real Rust values, and then eventually serialize them down to a generic representation that can eventually flow into the schema. Additionally, and perhaps most obviously, metadata can also be used for generic key/value data about the given type.

Finally, we come to fields. As mentioned above, fields are the realization, essentially, of the sum of Configurable types that can represent a Vector configuration. They are a coalesced version of all the data provided by Configurable and are ultimately the data that gets used to drive schema generation. One point here is that this is the interface where typed metadata will be serialized such that Field has all the data necessary to be generate a schema: name, description, shape, default value, custom metadata, subfields, etc.

Below is an example of a very simple sink configuration which supports batching and uses the ubiquitous BatchConfig type:

#[derive(Serialize, Deserialize, Clone)]
struct SinkConfig {
    endpoint: String,
    batch: BatchConfig,
}

impl<'de> Configurable<'de> for SinkConfig {
    fn description() -> Option<&'static str> {
        Some("Configuration for the sink.")
    }

    fn shape() -> Shape {
        let mut required_fields = HashMap::new();
        required_fields.insert("endpoint", <String as Configurable>::shape());
        required_fields.insert("batch", <BatchConfig as Configurable>::shape());

        Shape::Map(MapShape {
            required_fields,
            allowed_unknown_field_shape: None,
        })
    }

    fn metadata() -> Metadata<Self> {
        Metadata {
            default: Some(SinkConfig {
                endpoint: String::from("foo"),
                batch: BatchConfig::default(),
            }),
            ..Default::default()
        }
    }

    fn fields(overrides: Metadata<Self>) -> Option<Vec<Field>> {
        let shape = Self::shape();
        let mut required_field_shapes = match shape {
            Shape::Map(MapShape { required_fields, .. }) => required_fields.clone(),
            _ => unreachable!("SinkConfig is a fixed-field object and cannot be another shape"),
        };

        let base_metadata = <Self as Configurable>::metadata();
        let merged_metadata = merge_metadata_overrides(base_metadata, overrides);

        let endpoint_shape = required_field_shapes.remove("endpoint").expect("shape for `endpoint` must exist");
        let endpoint_override_metadata = merged_metadata.clone()
            .map_default_value(|default| default.endpoint.clone());

        let batch_shape = required_field_shapes.remove("batch").expect("shape for `batch` must exist");
        let batch_override_metadata = merged_metadata.clone()
            .map_default_value(|default| default.batch.clone());

        let mut fields = HashMap::new();
        fields.insert("endpoint", Field::new::<String>(
            "endpoint",
            "The endpoint to send requests to.",
            endpoint_shape,
            endpoint_override_metadata,
        ));
         fields.insert("batch", Field::new::<BatchConfig>(
            "batch",
            <BatchConfig as Configurable>::description().expect("`BatchConfig` has no defined description, and an override description was not provided."),
            batch_shape,
            batch_override_metadata,
        ));

        Some(fields)
    }
}


#[derive(Serialize, Deserialize, Default, Clone)]
struct BatchConfig {
    max_events: Option<u32>,
    max_bytes: Option<u32>,
    max_timeout: Option<Duration>,
}

impl<'de> Configurable<'de> for BatchConfig {
    fn description() -> Option<&'static str> {
        Some("Controls batching behavior i.e. maximum batch size, the maximum time before a batch is flushed, etc.")
    }

    fn shape() -> Shape {
        let mut required_fields = HashMap::new();
        required_fields.insert("max_events", <Option<u32> as Configurable>::shape());
        required_fields.insert("max_bytes", <Option<u32> as Configurable>::shape());
        required_fields.insert("max_timeout", <Option<Duration> as Configurable>::shape());

        Shape::Map(MapShape {
            required_fields,
            allowed_unknown_field_shape: None,
        })
    }

    fn metadata() -> Option<Vec<Metadata<Self>>> {
        Some(vec![
            Metadata::DefaultValue(BatchConfig {
                max_events: Some(1000),
                max_bytes: Some(1048576),
                max_timeout: Some(Duration::from_secs(60)),
            })
        ])
    }

    fn fields(overrides: Option<Vec<Metadata<Self>>>) -> Option<Vec<Field>> {
        let shape = Self::shape();
        let mut required_field_shapes = match shape {
            Shape::Map(MapShape { required_fields, .. }) => required_fields.clone(),
            _ => unreachable!("SinkConfig is a fixed-field object and cannot be another shape"),
        };

        let base_metadata = <Self as Configurable>::metadata();
        let merged_metadata = merge_metadata_overrides(base_metadata, overrides);

        let max_events_shape = required_field_shapes.remove("max_events").expect("shape for `max_events` must exist");
        let max_events_override_metadata = merged_metadata.clone()
            .map_default_value(|default| default.max_events);

        let max_bytes_shape = required_field_shapes.remove("max_bytes").expect("shape for `max_bytes` must exist");
        let max_bytes_override_metadata = merged_metadata.clone()
            .map_default_value(|default| default.max_bytes));

        let max_timeout_shape = required_field_shapes.remove("max_timeout").expect("shape for `max_timeout` must exist");
        let max_timeout_override_metadata = merged_metadata.clone()
            .map_default_value(|default| default.max_timeout));

        let mut fields = HashMap::new();
        fields.insert("max_events", Field::new::<Option<u32>>(
            "max_events",
            "Maximum number of events per batch.",
            max_events_shape,
            max_events_override_metadata,
        ));
        fields.insert("max_bytes", Field::new::<Option<u32>>(
            "max_bytes",
            "Maximum number of bytes per batch.",
            max_bytes_shape,
            max_bytes_override_metadata,
        ));
        fields.insert("max_timeout", Field::new::<Option<Duration>>(
            "max_timeout",
            "Maximum period of time a batch can exist before being forcibly flushed.",
            max_timeout_shape,
            max_timeout_override_metadata,
        ));

        Some(fields)
    }
}

This code represents the expected boilerplate that would be generated by a Configurable derive macro, so it may appear verbose but such verbosity would be entirely hidden from developers unless they needed to manually implement Configurable for a remote type in a third-party crate.

Immediately, we can observe a few things about the trait's usage in practice. The design of Configurable::metadata and Configurable::fields allow us to define metadata such that it is automatically propagated downward as far as we wish. In the above example, we use this specifically for the ability to define a default BatchConfig value at the SinkConfig level, while being able to pass the value of each individual field down. This means that, while we may be using a global definition of the shape of BatchConfig, we can define an override of the default value for it at the point of usage.

Additionally, as the metadata is typed, the derive macro can generate rich code that utilizes the raw types involved whether than having to deal with downcasted/generalized versions. We utilize this in the referenced try_derive_field_default_from_self function to grab the value of a specific field from a value of Self, allowing us to continue generating and providing typed metadata as we render each field, and that field's fields, and so on.

Additionally, you can see the generated code around things descriptions, where we can layer on additional checks to ensure required fields are present, giving us to ability to add in run-time checks on top of compile-time checks, furthering extending our goal of misuse resistance.

3. Configurable derive macro as a vehicle for easily defining high-level constraints on configuration types

Following from the code examples in point 2, we'll explore what the user-defined types would look like when using the proposed Configurable derive macro. Some of the fields and other metadata may differ from the above example as it would have become too verbose to display above. All of that said, let's take a look:

/// Configuration for the sink.
#[derive(Clone)]
#[configurable_component]
#[configurable(metadata("status", "beta"))]
struct SinkConfig {
    /// The endpoint to send requests to.
    #[configurable(format(uri), deprecated("url"))]
    #[serde(alias = "url")]
    endpoint: String,
    #[serde(default = default_batch_config_for_sink)]
    #[configurable(subfield(max_events, range(max = 1000)))]
    batch: BatchConfig,
}

/// Controls batching behavior i.e. maximum batch size, the maximum time before a batch is flushed, etc.
#[derive(Default, Clone)]
#[configurable_component]
struct BatchConfig {
    /// Maximum number of events per batch.
    #[configurable(non_zero)]
    max_events: Option<u32>,
    #[configurable(non_zero)]
    /// Maximum number of bytes per batch.
    max_bytes: Option<u32>,
    #[configurable(non_zero)]
    /// Maximum period of time a batch can exist before being forcibly flushed.
    max_timeout: Option<Duration>,
}

At a high level, the Configurable derive macro primarily deals with generating the boilerplate Configurable implementation for the given type, but goes further with actually being able to introspect existing attributes as well as allowing further constraints to be applied. Additionally, you can see a distinct attribute macro here -- configurable_component -- which we use to apply the derive attribute for us. You can think of #[configurable_component] as being a string replacement marker for #[derive(Serialize, Deserialize, Configurable)]. This lets us enforce that Serialize and Deserialize are derived, along with Configurable, as they all must be present for any configuration type that we want to include within the schema.

Above, we can see the two previously shown types, with doc comments, typical derives for (de)serialization, and so on. The Configurable derive macro can trivially consume information such as the doc comments to provide a description for types and fields. The real power, as mentioned above, is when we get into using the configurable attribute on fields, and how it uses existing attributes.

We can see a number of usages of #[configurable(...)] which represents specific attributes supported by the Configurable derive macro. In this example, we're doing a few different things:

  • defining a JSONSchema format of uri on SinkConfig::endpoint, which will let consumers of the schema know to validate this field according to the uri format defined in the JSONSchema specification
  • defining the alias of "url" on the SinkConfig::endpoint as deprecated, which gets added as custom metadata
  • defining a subfield constraint on SinkConfig::batch where we apply a range override, specifically setting the maximum value for max_events to 1000
  • setting all fields on BatchConfig to have a non-zero constraint, ensuring that none of the values can ever be passed in as zero

In and of themselves, these are powerful constraints to be able to apply inline with the definition of the configuration types themselves, and then can be exposed via the schema using either native JSONSchema support or custom metadata if they weren't natively supported yet.

As well, we can interrogate the other attributes present on the fields, including existing serde field attributes. On the batch field in SinkConfig, a default batch configuration has been defined using the typical serde field attribute, default, which can either take a direct value or a reference to a function that can generate the value. We too can see this attribute when our derive macro runs, and we can utilize it to generate our own default value. This is generically applicable to whatever attributes we want to be able to interrogate, and so this provides an extremely powerful primitive to be able to take advantage of existing code, as well as support new attributes from other crates that get utilized in the future.

Additionally, the ability to specify custom metadata using the configurable attribute is a powerful escape hatch when we need to encode behaviours, or inline relevant data, to configuration types and fields that aren't related to the schema of a Vector configuration itself... or aren't possible to encode in JSONSchema. For example, this could be used for something simple, like in the above example, where we're defining the status of the sink implementation as beta. This might be used to drive the generated content of the vector.dev website and documentation.

Some constraints might be harder to express in a JSONSchema, however, such as whether or not a sink supports acknowledgements. Whether or not a sink supports acknowledgements isn't terribly relevant to the Vector configuration itself, beyond validating whether or not any fields which toggle it on or off have been set right, and so on. There's a semantic relevance, however, is that knowing a sink does or doesn't support acknowledgements could allow a validator to surface issues to user. For example, if they have acknowledgements enabled on a sink but their source does not support acknowledgements, then end-to-end acknowledgements would not be able to actually function. While the configuration loading code can also detect this, being able to provide these semantic definitions within the schema itself allows us to more generically encode these types of behaviors and allow external tools, which don't have the benefit of running Vector directly, to correctly suss out incompatibilities and misconfigurations.

Further, and following the example code itself, this can allow us to enforce constraints that are only partially able to be represented in JSONSchema, such as aliased fields. While JSONSchema already supports the ability to define a schema such that a field can be represented by multiple names, it has no concept of a deprecated field. Utilizing custom metadata, we can encode metadata that indicates which field name variant is the deprecated one. This can be used not only to drive behavior in the generated documentation, but in other tools as well, such as automatically transforming one version of a configuration to a newer version by analyzing the schemas, and being able to reason that if field X used to be able to be referenced via A or B, and now only B is allowed, we can just rename A to B. We could also check that schemas don't remove fields unless those fields were already marked as deprecated, being able to enforce our unofficial guideline of not removing fields unless they've been marked as deprecated for at least one release.

4. Utilize schemars to generate the actual JSONSchema

With the necessary information being derived from our configuration types directly, we still need a way to take that and emit an actual JSONSchema document for Vector's configuration. We would utilize schemars, a crate for generating JSONSchema documents, for this purpose.

The schemars crate, among other things, provides a small set of traits and types related specifically to programmatically generating a JSONSchema document. Types which a user wishes to document must implement JSONSchema, which interacts with a SchemaGenerator object that actually holds the in-progress schema as a type is being walked.

Our Configurable derive macro would implement this trait automatically for the given type, using the information provided by Configurable to ultimately drive the schema generation. While the boilerplate to read the data given by Configurable and use it to feed the schema generator is almost entirely mechanical and boring, we can take a look at how the above code example from point 3 might look when actually turned into a JSONSchema document:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema#",
  "title": "SinkConfig",
  "type": "object",
  "oneOf": [
    { "required": ["endpoint"] },
    { "required": ["url"] }
  ],
  "properties": {
    "endpoint": {
      "description": "The endpoint to send requests to.",
      "type": "string",
      "format": "uri"
    },
    "url": {
      "description": "The endpoint to send requests to.",
      "type": "string",
      "format": "uri",
      "deprecated": true
    },
    "batch": {
      "allOf": [{ "$ref": "#/definitions/BatchSettings" }],
      "properties": {
        "max_events": {
          "type": [
            "null",
            "number"
          ],
          "maximum": 1000
        }
      },
      "default": {
        "max_bytes": 1048576,
        "max_events": 1000,
        "max_timeout": 60
      }
    }
  },
  "_metadata": {
    "status": "beta"
  },
  "definitions": {
    "BatchSettings": {
      "description": "Controls batching behavior.",
      "type": "object",
      "properties": {
        "max_bytes": {
          "description": "Maximum number of bytes per batch.",
          "type": [
            "null",
            "integer"
          ],
          "minimum": 1,
          "maximum": 4294967296
        },
        "max_events": {
          "description": "Maximum number of events per batch.",
          "type": [
            "null",
            "integer"
          ],
          "minimum": 1,
          "maximum": 4294967296
        },
        "max_timeout": {
          "description": "Maximum period of time a batch can exist before being forcibly flushed.",
          "anyOf": [
            { "type": "null" },
            { "$ref": "#/definitions/duration" }
          ],
          "minimum": 1
        }
      }
    },
    "duration": {
      "type": "number",
      "minimum": 0,
      "maximum": 9007199254740991
    }
  }
}

While this is long and verbose, users almost exclusively interact with JSONSchema documents by using a library that validates JSON documents using the schema. Briefly, though, you can observe the following things:

  • we can see that for the SinkConfig::endpoint field, we've enumerated its alias, url, along with the fact that it's deprecated
  • throughout the various fields, we can set field constraints such as the minimum or maximum value for integers
  • we've utilized a common definition of the BatchConfig type, but we can also do a union of its definition along with a separate definition that enforces a maximum size of 1000 events per batch
  • basic information such as field and type descriptions are present, enriching the schema
  • we have support for custom metadata, which is letting us include the sink's beta status in the schema for rich, semantic descriptions
  • while it does not affect validation at all, we can also include other semantic information, such as the default value for a field, even when that field has a shared definition, like BatchConfig

Rationale

In general, the lack of a configuration schema, and being able to treat our Rust code as the source of truth, hurts both users and developers. As such, doing this work represents a huge opportunity to reduce developer toil when it comes to generating and synchronizing our generated documentation. It additionally represents a large quality of life improvement for our users, who look to our documentation to be up-to-date and semantically meaningful.

If we didn't do this work, we would be stuck with the current manual toil, which is not only extra work on the part of developers -- learning Cue, remembering to update it, etc -- but also reviewers, who now need to catch when these things are missed.

Utilizing an extensible system for expressing the schema of Vector's configuration gives us the runway to solve our current problems, but also the ability to handle future problems and requirements as they come up.

Drawbacks

The primary drawback of this approach is that it limits us to things we can reasonably express within the limitations of the Rust language itself, and what we're comfortable with representing via attributes. For example, single-line and multi-line doc comments are trivially supported, but if we wanted to start pushing more semantically-relevant information, such as configuration snippets, etc, it could be technically possible but appear as a very ugly attribute usage, which could make the source muddy and unclear.

Additionally, while derive macros are written in Rust code, which can be much easier to grok than declarative macros, it represents a section of code that may be harder to understand and modify than if we used a more brute force approach.

Prior Art

Notably, and perhaps most obviously, Cue itself can be used for data validation purposes. There are also other projects that slot themselves into the same general domain, such as OPA/Rego and CDDL.

The primary problem with all of these solutions is that they're all custom languages, with far less cross-platform support, and a far steeper learning curve. Even though our approach generally concerns itself with obfuscating the schema tool itself, and focusing on making it trivially to generate the schema from annotated Rust types/fields, eventually the rubber must meet the road, and this is where these other tools would fall down for our case and become very hard to wield correctly.

Alternatives

Use schemars only

While schemars itself has support for annotating Rust types in almost the exact same way as we've proposed above, it lacks a few features necessary for our use case:

  • no support for the serde alias attribute feature
  • no support for defining generic metadata for a type/field and exposing it in the schema
  • no mechanism to override constraints for a field which already defines its own constraints at the type level

The lack of these features make it much harder to correctly generate our schema, as we would still be required to do the minimum amount of work to support defining custom attributes for the missing features, and the work to support at least one attribute is 90% of the work to support two, or three, or more, custom attributes. It would also mean that developers would need to figure out whether or not schemars provided a certain behavior via its attributes, versus being able to only need to remember how to get to the documentation for the our proposed internal approach.

Generate the Rust configuration types from a non-Rust source of truth

Another alternative is the possibility of moving the configuration source-of-truth outside of Rust and using it in the opposite direction, to generate the Rust types themselves. This is technically possible, although fundamentally suboptimal for a few reasons:

  • it still requires developers to become well-versed in whatever language is used to define the schema, which is already a pain point that we hope this work can be used to help solve in general
  • it would require a potentially error-prone way of generating the types and then importing the code for use in the codebase

In general, the issue of integrating the code is the biggest reason to avoid such an approach. Developers already have issues with getting IDE language assist extensions/plugins to correctly provide type information and hinting for types that are in Rust code which is imported from the filesystem, such as the approach taken by prost for importing the Rust code it generates from Protocol Buffers definitions. Those issues can be easily kept at bay as-is because the rate of change to things like our Protocol Buffers definitions is low, but configuration types for components are both more prevalent overall and experience far more churn, which means whatever potential issues existed would statistically show up more often.

Even an approach where we more directly placed the code into the normal src hierarchy, to avoid needing to do prost-style code import, would still be at risk of causing friction during the development process:

  • we would be forced into a specific directory/module hierarchy to ensure the tooling put the definitions in the correct location
  • developers would need to run an external tool every time they changed the configuration schema in order to get their updated configuration types

Outstanding Questions

  • Can downstream consumers extract the necessary information purely from the specified JSONSchema fields and any additional custom metadata as string key/value pairs? Do they need a richer representation of custom metadata?
  • Is there a better way, or a way at all, to handle overriding subfields at compile-time? (Procedural macros operate on the Rust AST, so for example, we cannot annotate a field that points to type B and reference a field that only exists on B, so we have to generate code such that it eventually fails at run-time.)
  • How can we provide logical constraints between disparate components? For example, the end-to-end acknowledgements example is a very real scenario where there is no existing way to describe the relationship between sinks based on their acknowledgement support, because schema validation doesn't interpret a configuration like Vector does, where it's actually wired up in a graph vs simply parsed and validated for conformance.

Plan Of Attack

Incremental steps to execute this change. These will be converted to issues after the RFC is approved:

  • Develop the Configurable trait, and supporting types, and implement it by hand for Rust standard library types, and at least one of each major component type (source, transform, sink) to vet out the design and expose any corner cases.
  • Develop an initial set of helper types/methods for generating a JSONSchema document from a root Configurable type, and implement the JSONSchema trait from schemars by hand for all types which have a manual Configurable implementation.
  • Create a top-level root type that fully encompasses the concept of a Vector configuration, and manually implement the JSONSchema trait for it.
  • Create a Vector subcommand for running the schema generation code that prints it to the console.
  • Develop an initial version of the Configurable derive macro, and configurable_component attribute macro, that can generate a boilerplate Configurable trait implementation without any support for attributes or attribute interrogation.
  • Add macro support for also deriving the boilerplate JSONSchema trait implementation.
  • Add macro support for interrogating serde type/field/variant attributes to generate metadata with a goal of being on par with what schemars supports, plus whatever they don't support that we need.
  • Add macro support for configurable attributes, specifically in the vein of what schemars provides, in terms of adding field value constraints or custom metadata.
  • Add a comprehensive internal README of the usable of the new macro and macro helper attributes, similar in spirit to serde's own website.
  • Replace hand-written implementations of Configurable/JSONSchema with the derive macro/attributes.
  • Continue updating configuration types with the derive macro until all components are using it.
  • Update the logic used to register a component (the inventory-based stuff) to enforce that the configuration types implement Configurable, thereby ensuring that all configuration types are adhering to the requirements enforced by the derive macro.

Future Improvements

  • Move all configuration validation to configurable attributes, as well as default values that get merged in after-the-fact. Currently, many defaults are merged in, and validations are performed, only once a sink is in the process of being built, and the configuration has been deserialized, which means some validation happens during deserialization and some happens after, leading to slightly discontiguous error messages.
  • Generate a configuration schema whenever we cut a release, and store it in Git, followed by checking it against the last release's schema to check for incompatibilities: removal of fields that weren't already marked as deprecated, etc.
  • Generate Cue definitions/documentation based on the JSONSchema itself. This is apparently a one-liner cue command but would also involve figuring out how to integrate the resulting Cue into our existing Cue documentation.