The Knative serverless compute infrastructure extends the Open Container Initiative Runtime Specification to describe the functionality and requirements for serverless execution workloads. In contrast to general-purpose containers, stateless request-triggered (i.e. on-demand) autoscaled containers have the following properties:
- Little or no long-term runtime state (especially in cases where code might be scaled to zero in the absence of request traffic).
- Logging and monitoring aggregation (telemetry) is important for understanding and debugging the system, as containers might be created or deleted at any time in response to autoscaling.
- Multitenancy is highly desirable to allow cost sharing for bursty applications on relatively stable underlying hardware resources.
This contract does not define the control surfaces over the runtime environment except by reference to the Knative Kubernetes resources. Similarly, this contract does not define the implementation of metrics or logging aggregation, except to provide a contract for the collection of logging data. It is expected that access to the aggregated telemetry will be provided by the platform operator.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.
The OCI specification (v1.0.1) is the basis for this document. When this document and the OCI specification conflict, this document is assumed to override the general OCI recommendations. Where this document does not specify behavior, runtime implementations SHOULD be OCI compliant with respect to those features. Additionally, the core Knative definition assumes the Linux Container Configuration.
In particular, the default Knative implementation relies on Kubernetes behavior to implement container operation. In some cases, current Kubernetes behavior in 2018 is not as performant as envisioned in this documentation. The goal of the Knative authors is to push as much of the needed functionality into Kubernetes and/or HTTP routers as possible, rather than implementing reach-around layers.
This document considers two users of a given Knative environment, and is particularly concerned with the expectations of developers (and language and tooling developers, by extension) running code in the environment.
- Developers write code which is packaged into a container which is run on
the Knative cluster.
- Language and tooling developers typically write tools used by developers to package code into containers. As such, they are concerned that tooling which wraps developer code complies with this runtime contract.
- Operators (also known as platform providers) provision the compute resources and manage the software configuration of Knative and the underlying abstractions (for example, Linux, Kubernetes, Istio, etc).
Knative aims to minimize the amount of tuning and production configuration needed to run a service. Some of these production-friendly features include:
- Stateless computation at request-scale or event-scale granularity.
- Automatic scaling between 0 and many instances (the process scale-out model).
- Automatic adjustment of resource requirements based on observed behavior, where possible.
In order to achieve these properties, containers which are operated as part of a serverless platform are expected to observe the following properties:
- Fast startup time (<1s until a request or event can be processed, given container image layer caching),
- Minimize local state (in support of autoscaling and scale to zero),
- CPU usage only while requests are active (see this issue for reasons an operator might want to de-allocate CPU between requests).
In a highly-shared environment, containers might experience the following:
- Containers with
status
ofstopped
MAY be immediately reclaimed by the system. - The container process MAY be started as pid 0, through the use of PID namespaces or other processes.
- The container MAY be killed when the container is inactive. Containers MUST be
considered "active" while they are handling at least one request, but other
conditions MAY also be used to determine that a container is active. The
container is sent a
SIGTERM
signal when it is killed via the OCI specification'skill
command to allow for a graceful shutdown of existing resources and connections. If the container has not shut down after a defined grace period, the container is forcibly killed via aSIGKILL
signal. - The environment MAY restrict the use of
prestart
,poststart
, andpoststop
hooks to platform operators rather than developers. All of these hooks are defined in the context of the runtime namespace, rather than the container namespace, and might expose system-level information (and are non-portable). - Failures of the developer-specified process MUST be logged to a developer-visible logging system.
In addition, some serverless environments MAY use an execution model other than docker in linux (for example, runv or Kata Containers). Implementations using an execution model beyond docker in linux MAY alter the lifecycle contract beyond the OCI specification as long as:
- An OCI-compliant lifecycle contract is the default, regardless of how many extensions are provided.
- The implementation of an extended execution model or lifecycle MUST provide documentation about the extended model or lifecycle and documentation about how to opt in to the extended lifecycle contract.
- Platforms MAY provide mechanisms for post-mortem viewing of filesystem contents from a particular execution. Because containers (particularly failing containers) can experience frequent starts, operators or platform providers SHOULD limit the total space consumed by these failures.
As specified by OCI.
It is expected that containers do not have direct access to the OCI interface as providing access allows containers to circumvent runtime restrictions that are enforced by the Knative control plane. The operator or platform provider MAY have the ability to directly interact with the OCI interface, but that is beyond the scope of this specification.
An OPTIONAL method of invoking the kill
operation MAY be exposed to developers
to provide signalling to the container.
Operation hooks SHOULD NOT be configurable by the Knative developer. Operators or platform providers MAY use hooks to implement their own lifecycle controls.
A read from the stdin
file descriptor on the container
SHOULD
always result in EOF
. The stdout
and stderr
file descriptors on the
container SHOULD be collected and retained in a developer-accessible logging
repository. (TODO:docs#902).
Within the container, pipes and file descriptors can be used to communicate between processes running in the same container.
As specified by OCI.
For request-response functions, 0->many scaling is enabled by control of the inbound request path to enable capturing and stalling inbound requests until an autoscaled container is available to serve that request.
Inbound network connectivity is assumed to use HTTP/1.1 compatible transport.
The container MUST accept HTTP/1.1 requests from the environment. The
environment SHOULD
offer an HTTP/2.0 upgrade option
(Upgrade: h2c
on either the initial request or an OPTIONS
request) on the
same port as HTTP/1.1. The developer MAY specify this port at deployment; if the
developer does not specify a port, the platform provider MUST provide a default.
Only one inbound containerPort
SHALL
be specified in the
core.v1.Container
specification. The hostPort
parameter
SHOULD NOT
be set by the developer or the platform provider, as it can interfere with
ingress autoscaling. Regardless of its source, the selected port will be made
available in the PORT
environment variable.
The platform provider SHOULD configure the platform to perform HTTPS termination
and protocol transformation e.g. between QUIC or HTTP/2 and HTTP/1.1. Developers
ought not need to implement multiple transports between the platform and their
code. Unless overridden by setting the
name
field on the inbound port, the platform will perform automatic detection as
described above. If the
core.v1.Container.ports[0].name
is set to one of the following values, HTTP negotiation will be disabled and the
following protocol will be used:
http1
: HTTP/1.1 transport and will not attempt to upgrade to h2c..h2c
: HTTP/2 transport, as described in section 3.4 of the HTTP2 spec (Starting HTTP/2 with Prior Knowledge)
Developers ought to use automatic content negotiation where available, and MUST
NOT set the name
field to arbitrary values, as additional transports might be
defined in the future. Developers can assume all traffic is intermediated by an
L7 proxy. Developers can not assume a direct network connection between their
server process and client processes.
As requests to the container will be proxied by the platform, all inbound request headers SHOULD be set to the same values as the incoming request. Some implementations MAY strip certain HTTP headers for security or other reasons; such implementations SHOULD document the set of stripped headers. Because the full set of HTTP headers is constantly evolving, it is RECOMMENDED that platforms which strip headers define a common prefix which covers all headers removed by the platform.
In addition, the following base set of HTTP/1.1 headers MUST be set on the request:
Host
- As specified by RFC 7230 Section 5.4
Also, the following proxy-specific request headers MUST be set:
Forwarded
- As specified by RFC 7239.
Additionally, the following legacy headers SHOULD be set for compatibility with client software:
X-Forwarded-For
X-Forwarded-Proto
In addition, the following headers SHOULD be set to enable tracing and observability features:
- Trace headers - Platform providers SHOULD provide and document headers needed to propagate trace contexts, in the absence of w3c standardization.
Operators and platform providers MAY provide additional headers to provide environment specific information.
The
core.v1.Container
object allows specifying both a
readinessProbe
and a
livenessProbe
.
If not provided, container startup and listening on the declared HTTP socket is
considered sufficient to declare the container "ready" and "live" (see the probe
definition below). If specified, liveness and readiness probes are REQUIRED to
be of the httpGet
or tcpSocket
types, and
MUST
target the inbound container port; platform providers SHOULD disallow other
probe methods.
Because serverless platforms automatically scale instances based on inbound
requests, and because noncompliant (or even failing) containers might be
provided by developers, the following defaults SHOULD be applied by the platform
provider if not set by the developer. The probes are intended to be trivially
supportable by naive conforming containers while preventing interference with
developer code. These settings apply to both livenessProbe
and
readinessProbe
:
tcpSocket
set to the container's portinitialDelaySeconds
set to 0periodSeconds
set to platform-specific value
Setting initialDelaySeconds
to a value greater than 0 impacts container
startup time (aka cold start time) as a container will not serve traffic until
the probe succeeds.
On the initial deployment, platform providers
SHOULD
start an instance of the container to validate that the container is valid and
will become ready. This startup
SHOULD
occur even if the container would not serve any user requests. If a container
cannot satisfy the readinessProbe
during deployment startup, the Revision
SHOULD
be marked as failed.
Initial readiness probes allow the platform to avoid attempting to later provision or scale deployments (Revisions) which cannot become healthy, and act as a backstop to developer testing (via CI/CD or otherwise) which has been performed on the supplied container. Common causes of these failures can include: malformed dynamic code not tested in the container, environment differences between testing and deployment environment, and missing or misconfigured backends. This also provides an opportunity for the container to be run at least once despite scale-to-zero guarantees.
OCI does not specify any properties of the network environment in which a container runs. The following items are OPTIONAL additions to the runtime contract which describe services which might be of particular value to platform providers.
Platform providers SHOULD override the DNS related configuration files under
/etc
to enable local DNS lookups in the target environment (see
Default Filesystems).
Platform providers MAY provide a network service to provide introspection and environment information to the running process. Such a network service SHOULD be an HTTP server with an operator- or provider-defined URL schema. If a metadata service is provided, the schema MUST be documented. Sample use cases for such metadata include:
- Container information or control interfaces.
- Host information, including maintenance or capability information.
- Access to external configuration stores (such as the Kubernetes ConfigMap APIs).
- Access to secrets or identity tokens, to enable key rotation.
Platform providers MAY set the readonly
bit on the container to true
in
order to reduce the possible disk space provisioning and management of
serverless workloads. Containers MUST use the provided temporary storage areas
(see Default Filesystems) for working files and caches.
In general, stateless applications package their dependencies within the
container and do not rely on mutable external state for templates, logging
configuration, etc. In some cases, it might be necessary for certain application
settings to be overridden at deploy time (for example, database backends or
authentication credentials). When these settings need to be loaded via a file,
read-only mounts of application configuration and secrets are supported by
ConfigMap
and Secrets
volumes. Platform providers MAY apply updates to
Secrets
and ConfigMaps
while the application is running; these updates could
complicate rollout and rollback. It is up to the developer to choose appropriate
policies for mounting and updating ConfigMap
and Secrets
which are mounted
as volumes.
As serverless applications are expected to scale horizontally and statelessly, per-container volumes are likely to introduce state and scaling bottlenecks and are NOT RECOMMENDED.
Serverless applications which scale horizontally are expected to be managed in a declarative fashion, and individual instances SHOULD NOT be interacted with or connected directly.
- The
terminal
property SHOULD NOT be set totrue
. - The linux process specific properties MUST NOT be configurable by the developer, and MAY set by the operator or platform provider.
The following environment variables MUST be set:
Name | Meaning |
---|---|
PORT |
Ingress containerPort for ingress requests and health checks. See Inbound network connectivity for more details. |
The following environment variables SHOULD be set:
Name | Meaning |
---|---|
K_REVISION |
Name of the current Revision. |
K_CONFIGURATION |
Name of the Configuration that created the current Revision. |
K_SERVICE |
If the current Revision has been created by manipulating a Knative Service object, name of this Knative Service. |
Platform providers MAY set additional environment variables. Standardization of such variables will follow demonstrated usage and utility.
Developers MAY specify that containers be run as a specific user or group ID
using the runAsUser
container property. If specified, the runtime
MUST
run the container as the specified user ID if allowed by the platform (see
below). If no runAsUser
is specified, a platform-specific default SHALL be
used. Platform Providers SHOULD document this default behavior.
Operators and Platform Providers MAY prohibit certain user IDs, such as root
,
from executing code. In this case, if the identity selected by the developer is
invalid, the container execution
MUST
be failed.
The OCI specification describes a default container environment which can be used for many different purposes, including containerization of existing legacy or stateful processes which might store substantial amounts of on-disk state. In a scaled-out, stateless environment, container startup and teardown is accelerated when on-disk resources are kept to a minimum. Additionally, developers might not have access to the container's filesystems (or the containers might be rapidly recycled), so log aggregation SHOULD be provided.
In addition to the filesystems recommended in the OCI, the following filesystems MUST be provided:
Mount | Description |
---|---|
/tmp |
MUST be Read-write. SHOULD be backed by tmpfs if disk load is a concern. |
/var/log |
MUST be a directory with write permissions for logs storage. Implementations MAY permit the creation of additional subdirectories and log rotation and renaming. |
To enable DNS resolution, the following files might be overwritten at runtime:
File | Description |
---|---|
/etc/hosts |
MAY be overridden to provide host mappings for well-known or provider-specific resources. |
/etc/hostname |
some environments MAY set this to a different value for each container, but other environments might use the same value for all containers. |
/etc/resolv.conf |
SHOULD be set to a valid cluster-specific recursive resolver. Providers MAY provide additional default search domains to improve customer experience in the cluster. |
Platform providers MAY provide additional platform-specific mount points (example: shared read-only object stores or DB connection brokers). If provided, the location and contents of the mount points SHOULD be documented by the platform provider.
The namespace configuration MUST be provided by the operator or platform provider; developers or container providers MUST NOT set or assume a particular namespace configuration.
Developers MUST NOT use OCI devices
to request additional devices beyond the
OCI specification "Default Devices".
Control group (cgroups) controllers MUST be selected and configured by the operator or platform provider. The cgroup devices SHOULD be mounted as read-only.
The serverless platform MAY automatically adjust the resource limits (e.g. CPU) based on observed resource usage. The limits enforced to a container SHOULD be exposed in
/sys/fs/cgroup/memory/memory.limit_in_bytes
/sys/fs/cgroup/cpu/cpu.cfs_period_us
/sys/fs/cgroup/cpu/cpu.cfs_quota_us
Additionally, operators or the platform MAY restrict or prevent CPU scheduling for instances when no requests are active, where this capability is available. The Knative authors are currently discussing the best implementations options for this feature with the Kubernetes SIG-Node team.
The sysctl parameter applies system-wide kernel parameter tuning, which could interfere with other workloads on the host system. This is not appropriate for a shared environment, and SHOULD NOT be exposed for developer tuning.
Seccomp provides a mechanism for further restricting the set of linux syscalls permitted to the processes running inside the container environment. A seccomp sandbox MAY be enforced by the platform operator; any such application profiles SHOULD be configured and applied in a consistent mechanism outside of the container specification. A seccomp policy MAY be part of the platform security configuration that operators can tune over time as the threat environment changes.
From the OCI spec:
rootfsPropagation
(string, OPTIONAL) sets the rootfs's mount propagation. Its value is either slave, private, shared or unbindable. The Shared Subtrees article in the kernel documentation has more information about mount propagation.
This option MAY be set by the operator or platform provider, and MUST NOT be configurable by the developer. Mount propagation MAY be part of the platform security configuration that operators can tune over time as the threat environment changes.
This option MAY be set by the operator or platform provider, and MUST NOT be configurable by the developer. Masked paths MAY be part of the platform security configuration that operators can tune over time as the threat environment changes.
This option MAY only be set by the operator or platform provider, and MUST NOT be configurable by the developer.
Operation hooks SHOULD NOT be configurable by the developer. Operators or platform providers MAY use hooks to implement their own lifecycle controls.
As specified by OCI.