Vector's configuration, while driven by code, is hard to document, as well as programmatically describe and validate outside of Vector itself. This RFC proposes an enhancement to how we define configuration types so that Vector can emit an authoritative schema that can be used to drive configuration editing and validation, in both interactive and programmatic contexts.
- Document options naming and remediate #3714
- Formalize how options are deprecated #4023
- Versioned Vector configuration #6231
- Fix broken handling of adaptive_concurrency defaults #8189
- Fuzz/validate Vector's configuration options #8229
- ARC was not enabled by default #9727
- Avoid enum deserialization for configuration #10582
- This work will eventually help power the UI for Datadog's Observability Pipelines, in terms of validating user-supplied configurations. We want to be cognizant of areas where we can provide more flexibility/generalization to allow that work to extract whatever information it may require.
- Powering the new website's reference documentation.
- There are many potentially off-label usages of the configuration schema -- configuration migration, generative testing, and so on -- where we'll be balancing a fine-line between encoding our configuration schema as an API contract vs encoding it as a representation of implementation details.
- Encoding the schema of Vector's configuration.
- Modifications to how configuration types are created and used to provide invariants that make generating such a schema easier.
- Providing the configuration schema as part of a Vector release.
- Providing additional tooling, or enhancements to existing tooling, to use the schema to help better validate configurations.
- Generating Cue documentation from the configuration schema.
- Generating configuration migrations between versions of Vector.
Vector's configuration represents a fairly dense entrypoint into the vast flexibility provided by Vector: sources, transforms, sinks, tests, each with their own varying levels of complexity.
While Vector's documentation is generally fairly high-quality, users struggle to efficiently and correctly configure Vector for a variety of different reasons:
- lack of available examples
- incorrectly documented fields, or lack of documentation of fields
- confusion over field types and supported values (i.e. a field that can be a string or object)
- human-unfriendly rendering of default values (i.e.
1e+07
instead of1000000
, even when we know the unit is bytes, and could instead display1MB
)
Additionally, some of these pains are also pains felt during development. While users may struggle to understand an issue caused by following incorrect documentation, developers also struggle to correctly update the documentation itself. Vector uses a tool called Cue which is a data definition language that can be used to define data and its schema all in one, which is how we use it in Vector to codify our configuration in a somewhat programmatic/strict way. While Cue itself is a powerful data constraint language, it can take time to master and can be fairly inscrutable when errors are encountered.
Overlapping with some of the user pains, understanding when the documentation needs to be updated, or even simply remembering, can easily be missed during development, which in turn leads to documentation that grows out-of-sync until someone realizes what has occurred.
In order to address these concerns, we would update Vector to generate a configuration schema to provide a single source of truth in terms of how a particular version of Vector could be configured.
This schema would be based on JSONSchema, a specification for defining and validating JSON objects, as we can trivially convert any supported encoding for Vector configurations into JSON. JSONSchema is the most comprehensive specification for schema definition and validation when also considering the ubiqitousness of JSON itself, of which JSONSchema is written in.
As a developer, the primary impact of this would be in terms of needing to utilize new helpers, and patterns, that become required for documenting their configuration types. This would represent a net-new change to creating and updating components, although as part of developing this feature, all existing components would be bootstrapped in terms of being migrated over.
In turn, developers could equate this to something like a new CI check or lint being added that told them when their code did not comply, and what changes needed to be made in order to do so.
Overall, the cognitive overhead of this proposal would be low, as we would rely on the compiler, and CI, as much as possible in order to surface errors or non-compliance and explain exactly what needed to be changed in order to fix the errors.
This trait, and derive macro, would form the basis of validating the compliance of a configuration as well as walking a configuration to generate its schema. There are two primary requirements for generating the configuration schema: discoverability and compliance.
Discoverability, or the actual logic of inspecting the configuration types, encoding information about their fields, allowable values, and so on, is the most obvious requirement. We need to be able to find these types, know they can be inspected, and then actually do the work. Compliance is perhaps even more important: unless all configuration types are able to be inspected, then the configuration schema can never be used to correctly validate a user-supplied configuration.
The Configurable
trait would form the basis for discoverability. It would provide a minimalistic
interface that walked the type, and walked its fields, mapping closely to the traditional "Visitor"
pattern. The trait would allow exposing common items such as name, description, allowable type,
units, and so on. Additionally, it would allow for defining custom metadata, or extensions, that
could be parsed by external code to satisfy more advanced workflows i.e. configuration migration,
testing, etc.
The Configurable
derive macro would form the basis for compliance. While it would generally
provide the scaffolding to generate an implementation of the Configurable
trait -- walking each
field, gathering attributes and doc comments and so on -- it would also be able to validate that
those things exists at compile time. As an example, we can enforce that all Configuration
implementors are fully documented: the type itself, their subfields, and so on. It would then become
extremely hard for developers to add new fields or types in configuration that weren't documented
upon cutting a new Vector release.
As discussed in point 1, the Configurable
trait is meant to provide a common interface for
configuration types, and the types used within those types, such as types from the standard library
or third-party crates, such that they can describe their "shape", value constraints, and any other
relevant metadata. Below is an abbreviated version of the Configurable
trait, along with
supporting types:
/// The shape of the field.
///
/// This maps similiar to the concept of JSON's data types, where types are generalized and have
/// generalized representations. This allows us to provide general-but-relevant mappings to core
/// types, such as integers and strings and so on, while providing escape hatches for customized
/// types that may be encoded and decoded via "normal" types but otherwise have specific rules or
/// requirements.
///
/// Additionally, the shape of a field can encode some basic properties about the field to which it
/// is attached. For example, numbers can be bounded on or the lower or upper end, while strings
/// could define a minimum length, or even an allowed pattern via regular expressions.
///
/// In this way, they describe a more complete shape of the field than simply the data type alone.
#[derive(Clone)]
pub enum Shape {
Null,
Boolean,
String(StringShape),
Number(NumberShape),
Array(ArrayShape),
Map(MapShape),
Composite(Vec<Shape>),
}
#[derive(Clone, Default)]
pub struct StringShape {
minimum_length: Option<usize>,
maximum_length: Option<usize>,
allowed_pattern: Option<&'static str>,
}
#[derive(Clone)]
pub enum NumberShape {
Unsigned {
effective_lower_bound: u128,
effective_upper_bound: u128,
},
Signed {
effective_lower_bound: i128,
effective_upper_bound: i128,
},
FloatingPoint {
effective_lower_bound: f64,
effective_upper_bound: f64,
}
}
#[derive(Clone)]
pub struct ArrayShape {
element_shape: Box<Shape>,
minimum_length: Option<usize>,
maximum_length: Option<usize>,
}
#[derive(Clone)]
pub struct MapShape {
required_fields: HashMap<&'static str, Shape>,
allowed_unknown_field_shape: Option<Shape>,
}
pub struct Field {
name: &'static str,
description: &'static str,
shape: Shape,
fields: Vec<Field>,
metadata: Metadata<Value>,
}
#[derive(Clone, Default)]
pub struct Metadata<T: Serialize> {
default: Option<T>,
attributes: Vec<(String, String)>,
}
pub trait Configurable<'de>: Serialize + Deserialize<'de> + Sized
where
Self: Clone,
{
/// Gets the human-readable description of this value, if any.
///
/// For standard types, this will be `None`. Commonly, custom types would implement this
/// directly, while fields using standard types would provide a field-specific description that
/// would be used instead of the default descrption.
fn description() -> Option<&'static str>;
/// Gets the shape of this value.
fn shape() -> Shape;
/// Gets the metadata for this value.
fn metadata() -> Metadata<Self>;
/// The fields for this value, if any.
fn fields(overrides: Metadata<Self>) -> Option<HashMap<&'static str, Field>>;
}
The Configurable
trait defines some very basic core functionality: the description of this type
(if applicable), the "shape" of the type, any metadata associated with it, and the fields it
exposes. It also enforces (de)serialization capabilities on the type as this represents a base level
of functionality required by types that will be utilized in a Vector configuration.
Description and shape are required because they are both inherent and inextricable qualities of anything that we expose as a configurable option. Metadata and fields are optional as not every type will have metadata, and not every type actually has fields. For example, any scalar value -- string, number, bool, etc -- is a singular unit, and the same with arrays. Anything that looks like an "object", however, must have fields, as that is an inherent characteristic of an "object".
At the top level, there must always be a type that is Configurable
which maps to the Vector
configuration itself, and then fields within it. From this point on, we'll relate characteristics of
the Configurable
trait in the context of the types that implement it being fields.
Shape
represents the inherent type of a field, as well as any additional constraints on that type.
This is where we start to see the mappings from Rust types to their serialized representation, and in
general, the Shape
variants map closely to the various JSON types. We've added some general
constraints here -- lower/upper bounds on numbers, min.max length and acceptable regex pattern for
strings, expected element shape for arrays, expected fields for maps, etc -- but this is merely for
fleshing out the concept. We could extend this as needed but generally we would strive to only
encode intrinisic properties of these types within Shape
, depending on metadata for more
custom/situational constraints.
Following on from Shape
, we have the ability to define metadata about fields. One major thing that
we utilize metadata for is to provide default values for a given type/field. This allows Shape
to
avoid having to deal with that as it makes it a bit messier. Another thing it allows us to do is use
a generically-typed struct to capture real Rust values, and then eventually serialize them down to a
generic representation that can eventually flow into the schema. Additionally, and perhaps most
obviously, metadata can also be used for generic key/value data about the given type.
Finally, we come to fields. As mentioned above, fields are the realization, essentially, of the sum
of Configurable
types that can represent a Vector configuration. They are a coalesced version of
all the data provided by Configurable
and are ultimately the data that gets used to drive schema
generation. One point here is that this is the interface where typed metadata will be serialized
such that Field
has all the data necessary to be generate a schema: name, description, shape,
default value, custom metadata, subfields, etc.
Below is an example of a very simple sink configuration which supports batching and uses the
ubiquitous BatchConfig
type:
#[derive(Serialize, Deserialize, Clone)]
struct SinkConfig {
endpoint: String,
batch: BatchConfig,
}
impl<'de> Configurable<'de> for SinkConfig {
fn description() -> Option<&'static str> {
Some("Configuration for the sink.")
}
fn shape() -> Shape {
let mut required_fields = HashMap::new();
required_fields.insert("endpoint", <String as Configurable>::shape());
required_fields.insert("batch", <BatchConfig as Configurable>::shape());
Shape::Map(MapShape {
required_fields,
allowed_unknown_field_shape: None,
})
}
fn metadata() -> Metadata<Self> {
Metadata {
default: Some(SinkConfig {
endpoint: String::from("foo"),
batch: BatchConfig::default(),
}),
..Default::default()
}
}
fn fields(overrides: Metadata<Self>) -> Option<Vec<Field>> {
let shape = Self::shape();
let mut required_field_shapes = match shape {
Shape::Map(MapShape { required_fields, .. }) => required_fields.clone(),
_ => unreachable!("SinkConfig is a fixed-field object and cannot be another shape"),
};
let base_metadata = <Self as Configurable>::metadata();
let merged_metadata = merge_metadata_overrides(base_metadata, overrides);
let endpoint_shape = required_field_shapes.remove("endpoint").expect("shape for `endpoint` must exist");
let endpoint_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.endpoint.clone());
let batch_shape = required_field_shapes.remove("batch").expect("shape for `batch` must exist");
let batch_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.batch.clone());
let mut fields = HashMap::new();
fields.insert("endpoint", Field::new::<String>(
"endpoint",
"The endpoint to send requests to.",
endpoint_shape,
endpoint_override_metadata,
));
fields.insert("batch", Field::new::<BatchConfig>(
"batch",
<BatchConfig as Configurable>::description().expect("`BatchConfig` has no defined description, and an override description was not provided."),
batch_shape,
batch_override_metadata,
));
Some(fields)
}
}
#[derive(Serialize, Deserialize, Default, Clone)]
struct BatchConfig {
max_events: Option<u32>,
max_bytes: Option<u32>,
max_timeout: Option<Duration>,
}
impl<'de> Configurable<'de> for BatchConfig {
fn description() -> Option<&'static str> {
Some("Controls batching behavior i.e. maximum batch size, the maximum time before a batch is flushed, etc.")
}
fn shape() -> Shape {
let mut required_fields = HashMap::new();
required_fields.insert("max_events", <Option<u32> as Configurable>::shape());
required_fields.insert("max_bytes", <Option<u32> as Configurable>::shape());
required_fields.insert("max_timeout", <Option<Duration> as Configurable>::shape());
Shape::Map(MapShape {
required_fields,
allowed_unknown_field_shape: None,
})
}
fn metadata() -> Option<Vec<Metadata<Self>>> {
Some(vec![
Metadata::DefaultValue(BatchConfig {
max_events: Some(1000),
max_bytes: Some(1048576),
max_timeout: Some(Duration::from_secs(60)),
})
])
}
fn fields(overrides: Option<Vec<Metadata<Self>>>) -> Option<Vec<Field>> {
let shape = Self::shape();
let mut required_field_shapes = match shape {
Shape::Map(MapShape { required_fields, .. }) => required_fields.clone(),
_ => unreachable!("SinkConfig is a fixed-field object and cannot be another shape"),
};
let base_metadata = <Self as Configurable>::metadata();
let merged_metadata = merge_metadata_overrides(base_metadata, overrides);
let max_events_shape = required_field_shapes.remove("max_events").expect("shape for `max_events` must exist");
let max_events_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.max_events);
let max_bytes_shape = required_field_shapes.remove("max_bytes").expect("shape for `max_bytes` must exist");
let max_bytes_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.max_bytes));
let max_timeout_shape = required_field_shapes.remove("max_timeout").expect("shape for `max_timeout` must exist");
let max_timeout_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.max_timeout));
let mut fields = HashMap::new();
fields.insert("max_events", Field::new::<Option<u32>>(
"max_events",
"Maximum number of events per batch.",
max_events_shape,
max_events_override_metadata,
));
fields.insert("max_bytes", Field::new::<Option<u32>>(
"max_bytes",
"Maximum number of bytes per batch.",
max_bytes_shape,
max_bytes_override_metadata,
));
fields.insert("max_timeout", Field::new::<Option<Duration>>(
"max_timeout",
"Maximum period of time a batch can exist before being forcibly flushed.",
max_timeout_shape,
max_timeout_override_metadata,
));
Some(fields)
}
}
This code represents the expected boilerplate that would be generated by a Configurable
derive
macro, so it may appear verbose but such verbosity would be entirely hidden from developers unless
they needed to manually implement Configurable
for a remote type in a third-party crate.
Immediately, we can observe a few things about the trait's usage in practice. The design of
Configurable::metadata
and Configurable::fields
allow us to define metadata such that it is
automatically propagated downward as far as we wish. In the above example, we use this specifically
for the ability to define a default BatchConfig
value at the SinkConfig
level, while being able
to pass the value of each individual field down. This means that, while we may be using a global
definition of the shape of BatchConfig
, we can define an override of the default value for it at
the point of usage.
Additionally, as the metadata is typed, the derive macro can generate rich code that utilizes the
raw types involved whether than having to deal with downcasted/generalized versions. We utilize this
in the referenced try_derive_field_default_from_self
function to grab the value of a specific
field from a value of Self
, allowing us to continue generating and providing typed metadata as we
render each field, and that field's fields, and so on.
Additionally, you can see the generated code around things descriptions, where we can layer on additional checks to ensure required fields are present, giving us to ability to add in run-time checks on top of compile-time checks, furthering extending our goal of misuse resistance.
3. Configurable
derive macro as a vehicle for easily defining high-level constraints on configuration types
Following from the code examples in point 2, we'll explore what the user-defined types would look
like when using the proposed Configurable
derive macro. Some of the fields and other metadata may
differ from the above example as it would have become too verbose to display above. All of that
said, let's take a look:
/// Configuration for the sink.
#[derive(Clone)]
#[configurable_component]
#[configurable(metadata("status", "beta"))]
struct SinkConfig {
/// The endpoint to send requests to.
#[configurable(format(uri), deprecated("url"))]
#[serde(alias = "url")]
endpoint: String,
#[serde(default = default_batch_config_for_sink)]
#[configurable(subfield(max_events, range(max = 1000)))]
batch: BatchConfig,
}
/// Controls batching behavior i.e. maximum batch size, the maximum time before a batch is flushed, etc.
#[derive(Default, Clone)]
#[configurable_component]
struct BatchConfig {
/// Maximum number of events per batch.
#[configurable(non_zero)]
max_events: Option<u32>,
#[configurable(non_zero)]
/// Maximum number of bytes per batch.
max_bytes: Option<u32>,
#[configurable(non_zero)]
/// Maximum period of time a batch can exist before being forcibly flushed.
max_timeout: Option<Duration>,
}
At a high level, the Configurable
derive macro primarily deals with generating the boilerplate
Configurable
implementation for the given type, but goes further with actually being able to
introspect existing attributes as well as allowing further constraints to be applied. Additionally,
you can see a distinct attribute macro here -- configurable_component
-- which we use to apply the
derive attribute for us. You can think of #[configurable_component]
as being a string replacement
marker for #[derive(Serialize, Deserialize, Configurable)]
. This lets us enforce that Serialize
and Deserialize
are derived, along with Configurable
, as they all must be present for any
configuration type that we want to include within the schema.
Above, we can see the two previously shown types, with doc comments, typical derives for
(de)serialization, and so on. The Configurable
derive macro can trivially consume information such
as the doc comments to provide a description for types and fields. The real power, as mentioned
above, is when we get into using the configurable
attribute on fields, and how it uses existing
attributes.
We can see a number of usages of #[configurable(...)]
which represents specific attributes
supported by the Configurable
derive macro. In this example, we're doing a few different things:
- defining a JSONSchema format of
uri
onSinkConfig::endpoint
, which will let consumers of the schema know to validate this field according to theuri
format defined in the JSONSchema specification - defining the alias of "url" on the
SinkConfig::endpoint
as deprecated, which gets added as custom metadata - defining a subfield constraint on
SinkConfig::batch
where we apply a range override, specifically setting the maximum value formax_events
to 1000 - setting all fields on
BatchConfig
to have a non-zero constraint, ensuring that none of the values can ever be passed in as zero
In and of themselves, these are powerful constraints to be able to apply inline with the definition of the configuration types themselves, and then can be exposed via the schema using either native JSONSchema support or custom metadata if they weren't natively supported yet.
As well, we can interrogate the other attributes present on the fields, including existing serde
field attributes. On the batch
field in SinkConfig
, a default batch configuration has been
defined using the typical serde field attribute, default
, which can either take a direct value or
a reference to a function that can generate the value. We too can see this attribute when our derive
macro runs, and we can utilize it to generate our own default value. This is generically applicable
to whatever attributes we want to be able to interrogate, and so this provides an extremely powerful
primitive to be able to take advantage of existing code, as well as support new attributes from
other crates that get utilized in the future.
Additionally, the ability to specify custom metadata using the configurable
attribute is a
powerful escape hatch when we need to encode behaviours, or inline relevant data, to configuration
types and fields that aren't related to the schema of a Vector configuration itself... or aren't
possible to encode in JSONSchema. For example, this could be used for something simple, like in
the above example, where we're defining the status of the sink implementation as beta. This might be
used to drive the generated content of the vector.dev website and documentation.
Some constraints might be harder to express in a JSONSchema, however, such as whether or not a sink supports acknowledgements. Whether or not a sink supports acknowledgements isn't terribly relevant to the Vector configuration itself, beyond validating whether or not any fields which toggle it on or off have been set right, and so on. There's a semantic relevance, however, is that knowing a sink does or doesn't support acknowledgements could allow a validator to surface issues to user. For example, if they have acknowledgements enabled on a sink but their source does not support acknowledgements, then end-to-end acknowledgements would not be able to actually function. While the configuration loading code can also detect this, being able to provide these semantic definitions within the schema itself allows us to more generically encode these types of behaviors and allow external tools, which don't have the benefit of running Vector directly, to correctly suss out incompatibilities and misconfigurations.
Further, and following the example code itself, this can allow us to enforce constraints that are only partially able to be represented in JSONSchema, such as aliased fields. While JSONSchema already supports the ability to define a schema such that a field can be represented by multiple names, it has no concept of a deprecated field. Utilizing custom metadata, we can encode metadata that indicates which field name variant is the deprecated one. This can be used not only to drive behavior in the generated documentation, but in other tools as well, such as automatically transforming one version of a configuration to a newer version by analyzing the schemas, and being able to reason that if field X used to be able to be referenced via A or B, and now only B is allowed, we can just rename A to B. We could also check that schemas don't remove fields unless those fields were already marked as deprecated, being able to enforce our unofficial guideline of not removing fields unless they've been marked as deprecated for at least one release.
With the necessary information being derived from our configuration types directly, we still need a
way to take that and emit an actual JSONSchema document for Vector's configuration. We would utilize
schemars
, a crate for generating JSONSchema documents, for this purpose.
The schemars
crate, among other things, provides a small set of traits and types related
specifically to programmatically generating a JSONSchema document. Types which a user wishes to
document must implement JSONSchema
, which interacts with a SchemaGenerator
object that actually
holds the in-progress schema as a type is being walked.
Our Configurable
derive macro would implement this trait automatically for the given type, using
the information provided by Configurable
to ultimately drive the schema generation. While the
boilerplate to read the data given by Configurable
and use it to feed the schema generator is
almost entirely mechanical and boring, we can take a look at how the above code example from point 3
might look when actually turned into a JSONSchema document:
{
"$schema": "https://json-schema.org/draft/2020-12/schema#",
"title": "SinkConfig",
"type": "object",
"oneOf": [
{ "required": ["endpoint"] },
{ "required": ["url"] }
],
"properties": {
"endpoint": {
"description": "The endpoint to send requests to.",
"type": "string",
"format": "uri"
},
"url": {
"description": "The endpoint to send requests to.",
"type": "string",
"format": "uri",
"deprecated": true
},
"batch": {
"allOf": [{ "$ref": "#/definitions/BatchSettings" }],
"properties": {
"max_events": {
"type": [
"null",
"number"
],
"maximum": 1000
}
},
"default": {
"max_bytes": 1048576,
"max_events": 1000,
"max_timeout": 60
}
}
},
"_metadata": {
"status": "beta"
},
"definitions": {
"BatchSettings": {
"description": "Controls batching behavior.",
"type": "object",
"properties": {
"max_bytes": {
"description": "Maximum number of bytes per batch.",
"type": [
"null",
"integer"
],
"minimum": 1,
"maximum": 4294967296
},
"max_events": {
"description": "Maximum number of events per batch.",
"type": [
"null",
"integer"
],
"minimum": 1,
"maximum": 4294967296
},
"max_timeout": {
"description": "Maximum period of time a batch can exist before being forcibly flushed.",
"anyOf": [
{ "type": "null" },
{ "$ref": "#/definitions/duration" }
],
"minimum": 1
}
}
},
"duration": {
"type": "number",
"minimum": 0,
"maximum": 9007199254740991
}
}
}
While this is long and verbose, users almost exclusively interact with JSONSchema documents by using a library that validates JSON documents using the schema. Briefly, though, you can observe the following things:
- we can see that for the
SinkConfig::endpoint
field, we've enumerated its alias,url
, along with the fact that it's deprecated - throughout the various fields, we can set field constraints such as the minimum or maximum value for integers
- we've utilized a common definition of the
BatchConfig
type, but we can also do a union of its definition along with a separate definition that enforces a maximum size of 1000 events per batch - basic information such as field and type descriptions are present, enriching the schema
- we have support for custom metadata, which is letting us include the sink's beta status in the schema for rich, semantic descriptions
- while it does not affect validation at all, we can also include other semantic information, such
as the default value for a field, even when that field has a shared definition, like
BatchConfig
In general, the lack of a configuration schema, and being able to treat our Rust code as the source of truth, hurts both users and developers. As such, doing this work represents a huge opportunity to reduce developer toil when it comes to generating and synchronizing our generated documentation. It additionally represents a large quality of life improvement for our users, who look to our documentation to be up-to-date and semantically meaningful.
If we didn't do this work, we would be stuck with the current manual toil, which is not only extra work on the part of developers -- learning Cue, remembering to update it, etc -- but also reviewers, who now need to catch when these things are missed.
Utilizing an extensible system for expressing the schema of Vector's configuration gives us the runway to solve our current problems, but also the ability to handle future problems and requirements as they come up.
The primary drawback of this approach is that it limits us to things we can reasonably express within the limitations of the Rust language itself, and what we're comfortable with representing via attributes. For example, single-line and multi-line doc comments are trivially supported, but if we wanted to start pushing more semantically-relevant information, such as configuration snippets, etc, it could be technically possible but appear as a very ugly attribute usage, which could make the source muddy and unclear.
Additionally, while derive macros are written in Rust code, which can be much easier to grok than declarative macros, it represents a section of code that may be harder to understand and modify than if we used a more brute force approach.
Notably, and perhaps most obviously, Cue itself can be used for data validation purposes. There are also other projects that slot themselves into the same general domain, such as OPA/Rego and CDDL.
The primary problem with all of these solutions is that they're all custom languages, with far less cross-platform support, and a far steeper learning curve. Even though our approach generally concerns itself with obfuscating the schema tool itself, and focusing on making it trivially to generate the schema from annotated Rust types/fields, eventually the rubber must meet the road, and this is where these other tools would fall down for our case and become very hard to wield correctly.
While schemars
itself has support for annotating Rust types in almost the exact same way as we've
proposed above, it lacks a few features necessary for our use case:
- no support for the serde
alias
attribute feature - no support for defining generic metadata for a type/field and exposing it in the schema
- no mechanism to override constraints for a field which already defines its own constraints at the type level
The lack of these features make it much harder to correctly generate our schema, as we would still
be required to do the minimum amount of work to support defining custom attributes for the missing
features, and the work to support at least one attribute is 90% of the work to support two, or
three, or more, custom attributes. It would also mean that developers would need to figure out
whether or not schemars
provided a certain behavior via its attributes, versus being able to only
need to remember how to get to the documentation for the our proposed internal approach.
Another alternative is the possibility of moving the configuration source-of-truth outside of Rust and using it in the opposite direction, to generate the Rust types themselves. This is technically possible, although fundamentally suboptimal for a few reasons:
- it still requires developers to become well-versed in whatever language is used to define the schema, which is already a pain point that we hope this work can be used to help solve in general
- it would require a potentially error-prone way of generating the types and then importing the code for use in the codebase
In general, the issue of integrating the code is the biggest reason to avoid such an approach.
Developers already have issues with getting IDE language assist extensions/plugins to correctly
provide type information and hinting for types that are in Rust code which is imported from the
filesystem, such as the approach taken by prost
for importing the Rust code it generates from
Protocol Buffers definitions. Those issues can be easily kept at bay as-is because the rate of
change to things like our Protocol Buffers definitions is low, but configuration types for
components are both more prevalent overall and experience far more churn, which means whatever
potential issues existed would statistically show up more often.
Even an approach where we more directly placed the code into the normal src
hierarchy, to avoid
needing to do prost
-style code import, would still be at risk of causing friction during the
development process:
- we would be forced into a specific directory/module hierarchy to ensure the tooling put the definitions in the correct location
- developers would need to run an external tool every time they changed the configuration schema in order to get their updated configuration types
- Can downstream consumers extract the necessary information purely from the specified JSONSchema fields and any additional custom metadata as string key/value pairs? Do they need a richer representation of custom metadata?
- Is there a better way, or a way at all, to handle overriding subfields at compile-time? (Procedural macros operate on the Rust AST, so for example, we cannot annotate a field that points to type B and reference a field that only exists on B, so we have to generate code such that it eventually fails at run-time.)
- How can we provide logical constraints between disparate components? For example, the end-to-end acknowledgements example is a very real scenario where there is no existing way to describe the relationship between sinks based on their acknowledgement support, because schema validation doesn't interpret a configuration like Vector does, where it's actually wired up in a graph vs simply parsed and validated for conformance.
Incremental steps to execute this change. These will be converted to issues after the RFC is approved:
- Develop the
Configurable
trait, and supporting types, and implement it by hand for Rust standard library types, and at least one of each major component type (source, transform, sink) to vet out the design and expose any corner cases. - Develop an initial set of helper types/methods for generating a JSONSchema document from a
root
Configurable
type, and implement theJSONSchema
trait fromschemars
by hand for all types which have a manualConfigurable
implementation. - Create a top-level root type that fully encompasses the concept of a Vector configuration, and
manually implement the
JSONSchema
trait for it. - Create a Vector subcommand for running the schema generation code that prints it to the console.
- Develop an initial version of the
Configurable
derive macro, andconfigurable_component
attribute macro, that can generate a boilerplateConfigurable
trait implementation without any support for attributes or attribute interrogation. - Add macro support for also deriving the boilerplate
JSONSchema
trait implementation. - Add macro support for interrogating
serde
type/field/variant attributes to generate metadata with a goal of being on par with whatschemars
supports, plus whatever they don't support that we need. - Add macro support for
configurable
attributes, specifically in the vein of whatschemars
provides, in terms of adding field value constraints or custom metadata. - Add a comprehensive internal README of the usable of the new macro and macro helper attributes, similar in spirit to serde's own website.
- Replace hand-written implementations of
Configurable
/JSONSchema
with the derive macro/attributes. - Continue updating configuration types with the derive macro until all components are using it.
- Update the logic used to register a component (the
inventory
-based stuff) to enforce that the configuration types implementConfigurable
, thereby ensuring that all configuration types are adhering to the requirements enforced by the derive macro.
- Move all configuration validation to
configurable
attributes, as well as default values that get merged in after-the-fact. Currently, many defaults are merged in, and validations are performed, only once a sink is in the process of being built, and the configuration has been deserialized, which means some validation happens during deserialization and some happens after, leading to slightly discontiguous error messages. - Generate a configuration schema whenever we cut a release, and store it in Git, followed by checking it against the last release's schema to check for incompatibilities: removal of fields that weren't already marked as deprecated, etc.
- Generate Cue definitions/documentation based on the JSONSchema itself. This is apparently a
one-liner
cue
command but would also involve figuring out how to integrate the resulting Cue into our existing Cue documentation.