Knative Runtime Contract

Abstract

The Knative serverless compute infrastructure extends the Open Container Initiative Runtime Specification to describe the functionality and requirements for serverless execution workloads. In contrast to general-purpose containers, stateless request-triggered (i.e. on-demand) autoscaled containers have the following properties:

Little or no long-term runtime state (especially in cases where code might be scaled to zero in the absence of request traffic).
Logging and monitoring aggregation (telemetry) is important for understanding and debugging the system, as containers might be created or deleted at any time in response to autoscaling.
Multitenancy is highly desirable to allow cost sharing for bursty applications on relatively stable underlying hardware resources.

This contract does not define the control surfaces over the runtime environment except by reference to the Knative Kubernetes resources. Similarly, this contract does not define the implementation of metrics or logging aggregation, except to provide a contract for the collection of logging data. It is expected that access to the aggregated telemetry will be provided by the platform operator.

Background

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.

The OCI specification (v1.0.1) is the basis for this document. When this document and the OCI specification conflict, this document is assumed to override the general OCI recommendations. Where this document does not specify behavior, runtime implementations SHOULD be OCI compliant with respect to those features. Additionally, the core Knative definition assumes the Linux Container Configuration.

In particular, the default Knative implementation relies on Kubernetes behavior to implement container operation. In some cases, current Kubernetes behavior in 2018 is not as performant as envisioned in this documentation. The goal of the Knative authors is to push as much of the needed functionality into Kubernetes and/or HTTP routers as possible, rather than implementing reach-around layers.

This document considers two users of a given Knative environment, and is particularly concerned with the expectations of developers (and language and tooling developers, by extension) running code in the environment.

Developers write code which is packaged into a container which is run on the Knative cluster.
- Language and tooling developers typically write tools used by developers to package code into containers. As such, they are concerned that tooling which wraps developer code complies with this runtime contract.
Operators (also known as platform providers) provision the compute resources and manage the software configuration of Knative and the underlying abstractions (for example, Linux, Kubernetes, Istio, etc).

Runtime and Lifecycle

Knative aims to minimize the amount of tuning and production configuration needed to run a service. Some of these production-friendly features include:

Stateless computation at request-scale or event-scale granularity.
Automatic scaling between 0 and many instances (the process scale-out model).
Automatic adjustment of resource requirements based on observed behavior, where possible.

In order to achieve these properties, containers which are operated as part of a serverless platform are expected to observe the following properties:

Fast startup time (<1s until a request or event can be processed, given container image layer caching),
Minimize local state (in support of autoscaling and scale to zero),
CPU usage only while requests are active (see this issue for reasons an operator might want to de-allocate CPU between requests).

State

In a highly-shared environment, containers might experience the following:

Containers with status of stopped MAY be immediately reclaimed by the system.
The container process MAY be started as pid 0, through the use of PID namespaces or other processes.

Lifecycle

The container MAY be killed when the container is inactive. Containers MUST be considered "active" while they are handling at least one request, but other conditions MAY also be used to determine that a container is active. The container is sent a SIGTERM signal when it is killed via the OCI specification's kill command to allow for a graceful shutdown of existing resources and connections. If the container has not shut down after a defined grace period, the container is forcibly killed via a SIGKILL signal.
The environment MAY restrict the use of prestart, poststart, and poststop hooks to platform operators rather than developers. All of these hooks are defined in the context of the runtime namespace, rather than the container namespace, and might expose system-level information (and are non-portable).
Failures of the developer-specified process MUST be logged to a developer-visible logging system.

In addition, some serverless environments MAY use an execution model other than docker in linux (for example, runv or Kata Containers). Implementations using an execution model beyond docker in linux MAY alter the lifecycle contract beyond the OCI specification as long as:

An OCI-compliant lifecycle contract is the default, regardless of how many extensions are provided.
The implementation of an extended execution model or lifecycle MUST provide documentation about the extended model or lifecycle and documentation about how to opt in to the extended lifecycle contract.

Errors

Platforms MAY provide mechanisms for post-mortem viewing of filesystem contents from a particular execution. Because containers (particularly failing containers) can experience frequent starts, operators or platform providers SHOULD limit the total space consumed by these failures.

Warnings

As specified by OCI.

Operations

It is expected that containers do not have direct access to the OCI interface as providing access allows containers to circumvent runtime restrictions that are enforced by the Knative control plane. The operator or platform provider MAY have the ability to directly interact with the OCI interface, but that is beyond the scope of this specification.

An OPTIONAL method of invoking the kill operation MAY be exposed to developers to provide signalling to the container.

Hooks

Operation hooks SHOULD NOT be configurable by the Knative developer. Operators or platform providers MAY use hooks to implement their own lifecycle controls.

Linux Runtime

File descriptors

A read from the stdin file descriptor on the container SHOULD always result in EOF. The stdout and stderr file descriptors on the container SHOULD be collected and retained in a developer-accessible logging repository. (TODO:docs#902).

Within the container, pipes and file descriptors can be used to communicate between processes running in the same container.

Dev symbolic links

As specified by OCI.

Network Environment

For request-response functions, 0->many scaling is enabled by control of the inbound request path to enable capturing and stalling inbound requests until an autoscaled container is available to serve that request.

Inbound network connectivity

Inbound network connectivity is assumed to use HTTP/1.1 compatible transport.

Protocols and Ports

The container MUST accept HTTP/1.1 requests from the environment. The environment SHOULD offer an HTTP/2.0 upgrade option (Upgrade: h2c on either the initial request or an OPTIONS request) on the same port as HTTP/1.1. The developer MAY specify this port at deployment; if the developer does not specify a port, the platform provider MUST provide a default. Only one inbound containerPort SHALL be specified in the core.v1.Container specification. The hostPort parameter SHOULD NOT be set by the developer or the platform provider, as it can interfere with ingress autoscaling. Regardless of its source, the selected port will be made available in the PORT environment variable.

The platform provider SHOULD configure the platform to perform HTTPS termination and protocol transformation e.g. between QUIC or HTTP/2 and HTTP/1.1. Developers ought not need to implement multiple transports between the platform and their code. Unless overridden by setting the name field on the inbound port, the platform will perform automatic detection as described above. If the core.v1.Container.ports[0].name is set to one of the following values, HTTP negotiation will be disabled and the following protocol will be used:

http1: HTTP/1.1 transport and will not attempt to upgrade to h2c..
h2c: HTTP/2 transport, as described in section 3.4 of the HTTP2 spec (Starting HTTP/2 with Prior Knowledge)

Developers ought to use automatic content negotiation where available, and MUST NOT set the name field to arbitrary values, as additional transports might be defined in the future. Developers can assume all traffic is intermediated by an L7 proxy. Developers can not assume a direct network connection between their server process and client processes.

Headers

As requests to the container will be proxied by the platform, all inbound request headers SHOULD be set to the same values as the incoming request. Some implementations MAY strip certain HTTP headers for security or other reasons; such implementations SHOULD document the set of stripped headers. Because the full set of HTTP headers is constantly evolving, it is RECOMMENDED that platforms which strip headers define a common prefix which covers all headers removed by the platform.

In addition, the following base set of HTTP/1.1 headers MUST be set on the request:

Host - As specified by RFC 7230 Section 5.4

Also, the following proxy-specific request headers MUST be set:

Forwarded - As specified by RFC 7239.

Additionally, the following legacy headers SHOULD be set for compatibility with client software:

X-Forwarded-For
X-Forwarded-Proto

In addition, the following headers SHOULD be set to enable tracing and observability features:

Trace headers - Platform providers SHOULD provide and document headers needed to propagate trace contexts, in the absence of w3c standardization.

Operators and platform providers MAY provide additional headers to provide environment specific information.

Meta Requests

The core.v1.Container object allows specifying both a readinessProbe and a livenessProbe. If not provided, container startup and listening on the declared HTTP socket is considered sufficient to declare the container "ready" and "live" (see the probe definition below). If specified, liveness and readiness probes are REQUIRED to be of the httpGet or tcpSocket types, and MUST target the inbound container port; platform providers SHOULD disallow other probe methods.

Because serverless platforms automatically scale instances based on inbound requests, and because noncompliant (or even failing) containers might be provided by developers, the following defaults SHOULD be applied by the platform provider if not set by the developer. The probes are intended to be trivially supportable by naive conforming containers while preventing interference with developer code. These settings apply to both livenessProbe and readinessProbe:

tcpSocket set to the container's port
initialDelaySeconds set to 0
periodSeconds set to platform-specific value

Setting initialDelaySeconds to a value greater than 0 impacts container startup time (aka cold start time) as a container will not serve traffic until the probe succeeds.

Deployment probe

On the initial deployment, platform providers SHOULD start an instance of the container to validate that the container is valid and will become ready. This startup SHOULD occur even if the container would not serve any user requests. If a container cannot satisfy the readinessProbe during deployment startup, the Revision SHOULD be marked as failed.

Initial readiness probes allow the platform to avoid attempting to later provision or scale deployments (Revisions) which cannot become healthy, and act as a backstop to developer testing (via CI/CD or otherwise) which has been performed on the supplied container. Common causes of these failures can include: malformed dynamic code not tested in the container, environment differences between testing and deployment environment, and missing or misconfigured backends. This also provides an opportunity for the container to be run at least once despite scale-to-zero guarantees.

Outbound network connectivity

OCI does not specify any properties of the network environment in which a container runs. The following items are OPTIONAL additions to the runtime contract which describe services which might be of particular value to platform providers.

DNS

Platform providers SHOULD override the DNS related configuration files under /etc to enable local DNS lookups in the target environment (see Default Filesystems).

Metadata Services

Platform providers MAY provide a network service to provide introspection and environment information to the running process. Such a network service SHOULD be an HTTP server with an operator- or provider-defined URL schema. If a metadata service is provided, the schema MUST be documented. Sample use cases for such metadata include:

Container information or control interfaces.
Host information, including maintenance or capability information.
Access to external configuration stores (such as the Kubernetes ConfigMap APIs).
Access to secrets or identity tokens, to enable key rotation.

Configuration

Root

Platform providers MAY set the readonly bit on the container to true in order to reduce the possible disk space provisioning and management of serverless workloads. Containers MUST use the provided temporary storage areas (see Default Filesystems) for working files and caches.

Mounts

In general, stateless applications package their dependencies within the container and do not rely on mutable external state for templates, logging configuration, etc. In some cases, it might be necessary for certain application settings to be overridden at deploy time (for example, database backends or authentication credentials). When these settings need to be loaded via a file, read-only mounts of application configuration and secrets are supported by ConfigMap and Secrets volumes. Platform providers MAY apply updates to Secrets and ConfigMaps while the application is running; these updates could complicate rollout and rollback. It is up to the developer to choose appropriate policies for mounting and updating ConfigMap and Secrets which are mounted as volumes.

As serverless applications are expected to scale horizontally and statelessly, per-container volumes are likely to introduce state and scaling bottlenecks and are NOT RECOMMENDED.

Process

Serverless applications which scale horizontally are expected to be managed in a declarative fashion, and individual instances SHOULD NOT be interacted with or connected directly.

The terminal property SHOULD NOT be set to true.
The linux process specific properties MUST NOT be configurable by the developer, and MAY set by the operator or platform provider.

The following environment variables MUST be set:

Name	Meaning
`PORT`	Ingress `containerPort` for ingress requests and health checks. See Inbound network connectivity for more details.

The following environment variables SHOULD be set:

Name	Meaning
`K_REVISION`	Name of the current Revision.
`K_CONFIGURATION`	Name of the Configuration that created the current Revision.
`K_SERVICE`	If the current Revision has been created by manipulating a Knative Service object, name of this Knative Service.

Platform providers MAY set additional environment variables. Standardization of such variables will follow demonstrated usage and utility.

User

Developers MAY specify that containers be run as a specific user or group ID using the runAsUser container property. If specified, the runtime MUST run the container as the specified user ID if allowed by the platform (see below). If no runAsUser is specified, a platform-specific default SHALL be used. Platform Providers SHOULD document this default behavior.

Operators and Platform Providers MAY prohibit certain user IDs, such as root, from executing code. In this case, if the identity selected by the developer is invalid, the container execution MUST be failed.

Default Filesystems

The OCI specification describes a default container environment which can be used for many different purposes, including containerization of existing legacy or stateful processes which might store substantial amounts of on-disk state. In a scaled-out, stateless environment, container startup and teardown is accelerated when on-disk resources are kept to a minimum. Additionally, developers might not have access to the container's filesystems (or the containers might be rapidly recycled), so log aggregation SHOULD be provided.

In addition to the filesystems recommended in the OCI, the following filesystems MUST be provided:

Mount	Description
`/tmp`	MUST be Read-write. SHOULD be backed by tmpfs if disk load is a concern.
`/var/log`	MUST be a directory with write permissions for logs storage. Implementations MAY permit the creation of additional subdirectories and log rotation and renaming.

To enable DNS resolution, the following files might be overwritten at runtime:

File	Description
`/etc/hosts`	MAY be overridden to provide host mappings for well-known or provider-specific resources.
`/etc/hostname`	some environments MAY set this to a different value for each container, but other environments might use the same value for all containers.
`/etc/resolv.conf`	SHOULD be set to a valid cluster-specific recursive resolver. Providers MAY provide additional default search domains to improve customer experience in the cluster.

Platform providers MAY provide additional platform-specific mount points (example: shared read-only object stores or DB connection brokers). If provided, the location and contents of the mount points SHOULD be documented by the platform provider.

Namespaces

The namespace configuration MUST be provided by the operator or platform provider; developers or container providers MUST NOT set or assume a particular namespace configuration.

Devices

Developers MUST NOT use OCI devices to request additional devices beyond the OCI specification "Default Devices".

Control Groups

Control group (cgroups) controllers MUST be selected and configured by the operator or platform provider. The cgroup devices SHOULD be mounted as read-only.

Memory and CPU limits

The serverless platform MAY automatically adjust the resource limits (e.g. CPU) based on observed resource usage. The limits enforced to a container SHOULD be exposed in

/sys/fs/cgroup/memory/memory.limit_in_bytes
/sys/fs/cgroup/cpu/cpu.cfs_period_us
/sys/fs/cgroup/cpu/cpu.cfs_quota_us

Additionally, operators or the platform MAY restrict or prevent CPU scheduling for instances when no requests are active, where this capability is available. The Knative authors are currently discussing the best implementations options for this feature with the Kubernetes SIG-Node team.

Sysctl

The sysctl parameter applies system-wide kernel parameter tuning, which could interfere with other workloads on the host system. This is not appropriate for a shared environment, and SHOULD NOT be exposed for developer tuning.

Seccomp

Seccomp provides a mechanism for further restricting the set of linux syscalls permitted to the processes running inside the container environment. A seccomp sandbox MAY be enforced by the platform operator; any such application profiles SHOULD be configured and applied in a consistent mechanism outside of the container specification. A seccomp policy MAY be part of the platform security configuration that operators can tune over time as the threat environment changes.

Rootfs Mount Propagation

From the OCI spec:

rootfsPropagation (string, OPTIONAL) sets the rootfs's mount propagation. Its value is either slave, private, shared or unbindable. The Shared Subtrees article in the kernel documentation has more information about mount propagation.

This option MAY be set by the operator or platform provider, and MUST NOT be configurable by the developer. Mount propagation MAY be part of the platform security configuration that operators can tune over time as the threat environment changes.

Masked Paths

This option MAY be set by the operator or platform provider, and MUST NOT be configurable by the developer. Masked paths MAY be part of the platform security configuration that operators can tune over time as the threat environment changes.

Readonly Paths

This option MAY only be set by the operator or platform provider, and MUST NOT be configurable by the developer.

Posix-platform Hooks

Operation hooks SHOULD NOT be configurable by the developer. Operators or platform providers MAY use hooks to implement their own lifecycle controls.

Annotations

As specified by OCI.

Files

runtime-contract.md

Latest commit

History

runtime-contract.md

File metadata and controls

Knative Runtime Contract

Abstract

Background

Runtime and Lifecycle

State

Lifecycle

Errors

Warnings

Operations

Hooks

Linux Runtime

File descriptors

Dev symbolic links

Network Environment

Inbound network connectivity

Protocols and Ports

Headers

Meta Requests

Deployment probe

Outbound network connectivity

DNS

Metadata Services

Configuration

Root

Mounts

Process

User

Default Filesystems

Namespaces

Devices

Control Groups

Memory and CPU limits

Sysctl

Seccomp

Rootfs Mount Propagation

Masked Paths

Readonly Paths

Posix-platform Hooks

Annotations