This folder provides Rust benchmarks.
$ cargo bench # Just run benchmarks
$ cargo bench -- --quick # Just run benchmarks, with less samples
$ cargo bench -- --profile-time 10 # run benchmarks with cpu profile; results will be in out/rust/criterion/<group>/<test>/profile/profile.pb
$ # Compare to a baseline
$ cargo bench -- --save-baseline <name> # save baseline
$ # ...change something...
$ cargo bench -- --baseline <name> # compare against it
Ztunnel performance largely falls into throughput and latency. While these are sometimes at odds with each other, as Ztunnel is a generic proxy, we aim to make it perform well on both metrics.
The primary responsibility of the proxy is copying bits between peers.
Currently, this is always either TCP<-->TCP
or TCP<-->HBONE
.
This is the simplest case, and common amongst many proxies.
copy.rs
does the bulk of the work, essentially just bi-directionally copying bytes between the two sockets.
Typical bi-di copies are using a fixed buffer. To adapt to various workloads, we use dynamically sized buffers, that can grow from 1kb -> 16kb -> 256kb when enough traffic is received. This allows high throughput workloads to perform well, without excessive memory costs for low-bandwidth services.
This case ends up being much more complex, as we flow through HTTP2 and TLS. The full flow looks as such (pseudocode):
copy_bidi():
loop {
data = tcp_in.read(up to 256k) # based on dynamic buffer size
h2.write(data)
}
h2::write(data):
Buffer data as a DATA frame, up to a max of `max_send_buffer_size`. We configure this to 256k.
Asyncronously, the connection driver will pick up this data and call `rustls.write_vectored([256bytes, rest of data])`.
rustls::write(data):
data=encrypt(data)
# TLS records are at most 16k
# In practice I have observed at most 4 chunks; unclear where this is configured.
tcp_out.write_vectored([chunks of 16k])
From an iperf
load, this ends up looking something like this in strace
:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
55.21 0.841290 5 140711 writev
44.78 0.682359 17 38481 recvfrom
This will be from writev([16kb * 4])
calls and recvfrom(256kb)
.
This flow is substantially different from the inverse direction.
The receive flow is driven by h2
. Under the hood this uses a LengthDelimitedCodec
.
h2
will attempt to decode 1 frame at a time, using an internal buffer.
This buffer starts at 8kb
but will grow to meet the size of frames.
We allow up to a max of 1mb
frame sizes (config.frame_size
).
Ultimately, this will call rustls.read(buf)
.
This goes through a few indirections, but ultimately ends up in rustls.deframer_buffer
.
This is what calls read()
on the underlying IO, in our case the TCP connection.
This buffer is configured to do 4kb
reads generally.
Upon reading the frame from the wire, these get buffered up by h2
.
We read these in recv_stream.poll_data
, trigger by the copy_bidirectional
.
Ultimately, this will write out 1 DATA frame worth of data to the upstream TCP connection
From an iperf
load, this ends up looking something like this in strace
:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
61.08 1.253541 50 24703 sendto
38.19 0.783733 2 360707 8 recvfrom
This will be from sendto(256kb)
calls, with many recvfrom()
calls ranging from 4k to 16k.
Under an iperf
load, Envoy client:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
68.24 1.363149 3 440440 1 sendto
31.72 0.633584 11 55114 31 readv
This is from many sendto(16k)
calls, and readv([16k]*8)
.
Envoy Server:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
65.24 1.199264 1 757275 8 recvfrom
34.73 0.638315 26 23670 writev
This is from many calls of recvfrom(5); recvfrom(16k)
, and writev([16k]*16)
.
(All strace commands are looking at -e trace=write,writev,read,recvfrom,sendto,readv
).