Skip to content

Commit

Permalink
crypto/tls: replace custom *block with standard buffers
Browse files Browse the repository at this point in the history
The crypto/tls record layer used a custom buffer implementation with its
own semantics, freelist, and offset management. Replace it all with
per-task bytes.Buffer, bytes.Reader and byte slices, along with a
refactor of all the encrypt and decrypt code.

The main quirk of *block was to do a best-effort read past the record
boundary, so that if a closeNotify was waiting it would be peeked and
surfaced along with the last Read. Address that with atLeastReader and
ReadFrom to avoid a useless copy (instead of a LimitReader or CopyN).

There was also an optimization to split blocks along record boundary
lines without having to copy in and out the data. Replicate that by
aliasing c.input into consumed c.rawInput (after an in-place decrypt
operation). This is safe because c.rawInput is not used until c.input is
drained.

The benchmarks are noisy but look like an improvement across the board,
which is a nice side effect :)

name                                       old time/op   new time/op   delta
HandshakeServer/RSA-8                        817µs ± 2%    797µs ± 2%  -2.52%  (p=0.000 n=10+9)
HandshakeServer/ECDHE-P256-RSA-8             984µs ±11%    897µs ± 0%  -8.89%  (p=0.000 n=10+9)
HandshakeServer/ECDHE-P256-ECDSA-P256-8      206µs ±10%    199µs ± 3%    ~     (p=0.113 n=10+9)
HandshakeServer/ECDHE-X25519-ECDSA-P256-8    204µs ± 3%    202µs ± 1%  -1.06%  (p=0.013 n=10+9)
HandshakeServer/ECDHE-P521-ECDSA-P521-8     15.5ms ± 0%   15.6ms ± 1%    ~     (p=0.095 n=9+10)
Throughput/MaxPacket/1MB-8                  5.35ms ±19%   5.39ms ±36%    ~     (p=1.000 n=9+10)
Throughput/MaxPacket/2MB-8                  9.20ms ±15%   8.30ms ± 8%  -9.79%  (p=0.035 n=10+9)
Throughput/MaxPacket/4MB-8                  13.8ms ± 7%   13.6ms ± 8%    ~     (p=0.315 n=10+10)
Throughput/MaxPacket/8MB-8                  25.1ms ± 3%   23.2ms ± 2%  -7.66%  (p=0.000 n=10+9)
Throughput/MaxPacket/16MB-8                 46.9ms ± 1%   43.0ms ± 3%  -8.29%  (p=0.000 n=9+10)
Throughput/MaxPacket/32MB-8                 88.9ms ± 2%   82.3ms ± 2%  -7.40%  (p=0.000 n=9+9)
Throughput/MaxPacket/64MB-8                  175ms ± 2%    164ms ± 4%  -6.18%  (p=0.000 n=10+10)
Throughput/DynamicPacket/1MB-8              5.79ms ±26%   5.82ms ±22%    ~     (p=0.912 n=10+10)
Throughput/DynamicPacket/2MB-8              9.23ms ±14%   9.50ms ±23%    ~     (p=0.971 n=10+10)
Throughput/DynamicPacket/4MB-8              14.5ms ±11%   13.8ms ± 6%  -4.66%  (p=0.019 n=10+10)
Throughput/DynamicPacket/8MB-8              25.6ms ± 4%   23.5ms ± 3%  -8.33%  (p=0.000 n=10+10)
Throughput/DynamicPacket/16MB-8             47.3ms ± 3%   44.6ms ± 7%  -5.65%  (p=0.000 n=10+10)
Throughput/DynamicPacket/32MB-8             91.9ms ±14%   85.0ms ± 4%  -7.55%  (p=0.000 n=10+10)
Throughput/DynamicPacket/64MB-8              177ms ± 2%    168ms ± 4%  -4.97%  (p=0.000 n=8+10)
Latency/MaxPacket/200kbps-8                  694ms ± 0%    694ms ± 0%    ~     (p=0.315 n=10+9)
Latency/MaxPacket/500kbps-8                  279ms ± 0%    279ms ± 0%    ~     (p=0.447 n=9+10)
Latency/MaxPacket/1000kbps-8                 140ms ± 0%    140ms ± 0%    ~     (p=0.661 n=9+10)
Latency/MaxPacket/2000kbps-8                71.1ms ± 0%   71.1ms ± 0%  +0.05%  (p=0.019 n=9+9)
Latency/MaxPacket/5000kbps-8                30.4ms ± 7%   30.5ms ± 4%    ~     (p=0.720 n=9+10)
Latency/DynamicPacket/200kbps-8              134ms ± 0%    134ms ± 0%    ~     (p=0.075 n=10+10)
Latency/DynamicPacket/500kbps-8             54.8ms ± 0%   54.8ms ± 0%    ~     (p=0.631 n=10+10)
Latency/DynamicPacket/1000kbps-8            28.5ms ± 0%   28.5ms ± 0%    ~     (p=1.000 n=8+8)
Latency/DynamicPacket/2000kbps-8            15.7ms ±12%   16.1ms ± 0%    ~     (p=0.109 n=10+7)
Latency/DynamicPacket/5000kbps-8            8.20ms ±26%   8.17ms ±13%    ~     (p=1.000 n=9+9)

name                                       old speed     new speed     delta
Throughput/MaxPacket/1MB-8                 193MB/s ±14%  202MB/s ±30%    ~     (p=0.897 n=8+10)
Throughput/MaxPacket/2MB-8                 230MB/s ±14%  249MB/s ±17%    ~     (p=0.089 n=10+10)
Throughput/MaxPacket/4MB-8                 304MB/s ± 6%  309MB/s ± 7%    ~     (p=0.315 n=10+10)
Throughput/MaxPacket/8MB-8                 334MB/s ± 3%  362MB/s ± 2%  +8.29%  (p=0.000 n=10+9)
Throughput/MaxPacket/16MB-8                358MB/s ± 1%  390MB/s ± 3%  +9.08%  (p=0.000 n=9+10)
Throughput/MaxPacket/32MB-8                378MB/s ± 2%  408MB/s ± 2%  +8.00%  (p=0.000 n=9+9)
Throughput/MaxPacket/64MB-8                384MB/s ± 2%  410MB/s ± 4%  +6.61%  (p=0.000 n=10+10)
Throughput/DynamicPacket/1MB-8             178MB/s ±24%  182MB/s ±24%    ~     (p=0.604 n=9+10)
Throughput/DynamicPacket/2MB-8             228MB/s ±13%  225MB/s ±20%    ~     (p=0.971 n=10+10)
Throughput/DynamicPacket/4MB-8             291MB/s ±10%  305MB/s ± 6%  +4.83%  (p=0.019 n=10+10)
Throughput/DynamicPacket/8MB-8             327MB/s ± 4%  357MB/s ± 3%  +9.08%  (p=0.000 n=10+10)
Throughput/DynamicPacket/16MB-8            355MB/s ± 3%  376MB/s ± 6%  +6.07%  (p=0.000 n=10+10)
Throughput/DynamicPacket/32MB-8            366MB/s ±12%  395MB/s ± 4%  +7.91%  (p=0.000 n=10+10)
Throughput/DynamicPacket/64MB-8            380MB/s ± 2%  400MB/s ± 4%  +5.26%  (p=0.000 n=8+10)

Note that this reduced the buffer for the first read from 1024 to 5+512,
so it triggered the issue described at golang#24198 when using a synchronous
net.Pipe: the first server flight was not being consumed entirely by the
first read anymore, causing a deadlock as both the client and the server
were trying to send (the client a reply to the ServerHello, the server
the rest of the buffer). Fixed by rebasing on top of CL 142817.

Change-Id: Ie31b0a572b2ad37878469877798d5c6a5276f931
Reviewed-on: https://go-review.googlesource.com/c/142818
Run-TryBot: Filippo Valsorda <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Adam Langley <[email protected]>
  • Loading branch information
FiloSottile committed Oct 24, 2018
1 parent 202e903 commit ab51b1d
Show file tree
Hide file tree
Showing 3 changed files with 287 additions and 376 deletions.
30 changes: 18 additions & 12 deletions src/crypto/tls/cipher_suites.go
Original file line number Diff line number Diff line change
Expand Up @@ -134,25 +134,29 @@ func macSHA1(version uint16, key []byte) macFunction {
copy(mac.key, key)
return mac
}
return tls10MAC{hmac.New(newConstantTimeHash(sha1.New), key)}
return tls10MAC{h: hmac.New(newConstantTimeHash(sha1.New), key)}
}

// macSHA256 returns a SHA-256 based MAC. These are only supported in TLS 1.2
// so the given version is ignored.
func macSHA256(version uint16, key []byte) macFunction {
return tls10MAC{hmac.New(sha256.New, key)}
return tls10MAC{h: hmac.New(sha256.New, key)}
}

type macFunction interface {
// Size returns the length of the MAC.
Size() int
MAC(digestBuf, seq, header, data, extra []byte) []byte
// MAC appends the MAC of (seq, header, data) to out. The extra data is fed
// into the MAC after obtaining the result to normalize timing. The result
// is only valid until the next invocation of MAC as the buffer is reused.
MAC(seq, header, data, extra []byte) []byte
}

type aead interface {
cipher.AEAD

// explicitIVLen returns the number of bytes used by the explicit nonce
// that is included in the record. This is eight for older AEADs and
// explicitNonceLen returns the number of bytes of explicit nonce
// included in each record. This is eight for older AEADs and
// zero for modern ones.
explicitNonceLen() int
}
Expand Down Expand Up @@ -245,6 +249,7 @@ func aeadChaCha20Poly1305(key, fixedNonce []byte) cipher.AEAD {
type ssl30MAC struct {
h hash.Hash
key []byte
buf []byte
}

func (s ssl30MAC) Size() int {
Expand All @@ -257,7 +262,7 @@ var ssl30Pad2 = [48]byte{0x5c, 0x5c, 0x5c, 0x5c, 0x5c, 0x5c, 0x5c, 0x5c, 0x5c, 0

// MAC does not offer constant timing guarantees for SSL v3.0, since it's deemed
// useless considering the similar, protocol-level POODLE vulnerability.
func (s ssl30MAC) MAC(digestBuf, seq, header, data, extra []byte) []byte {
func (s ssl30MAC) MAC(seq, header, data, extra []byte) []byte {
padLength := 48
if s.h.Size() == 20 {
padLength = 40
Expand All @@ -270,13 +275,13 @@ func (s ssl30MAC) MAC(digestBuf, seq, header, data, extra []byte) []byte {
s.h.Write(header[:1])
s.h.Write(header[3:5])
s.h.Write(data)
digestBuf = s.h.Sum(digestBuf[:0])
s.buf = s.h.Sum(s.buf[:0])

s.h.Reset()
s.h.Write(s.key)
s.h.Write(ssl30Pad2[:padLength])
s.h.Write(digestBuf)
return s.h.Sum(digestBuf[:0])
s.h.Write(s.buf)
return s.h.Sum(s.buf[:0])
}

type constantTimeHash interface {
Expand Down Expand Up @@ -304,7 +309,8 @@ func newConstantTimeHash(h func() hash.Hash) func() hash.Hash {

// tls10MAC implements the TLS 1.0 MAC function. RFC 2246, Section 6.2.3.
type tls10MAC struct {
h hash.Hash
h hash.Hash
buf []byte
}

func (s tls10MAC) Size() int {
Expand All @@ -314,12 +320,12 @@ func (s tls10MAC) Size() int {
// MAC is guaranteed to take constant time, as long as
// len(seq)+len(header)+len(data)+len(extra) is constant. extra is not fed into
// the MAC, but is only provided to make the timing profile constant.
func (s tls10MAC) MAC(digestBuf, seq, header, data, extra []byte) []byte {
func (s tls10MAC) MAC(seq, header, data, extra []byte) []byte {
s.h.Reset()
s.h.Write(seq)
s.h.Write(header)
s.h.Write(data)
res := s.h.Sum(digestBuf[:0])
res := s.h.Sum(s.buf[:0])
if extra != nil {
s.h.Write(extra)
}
Expand Down
Loading

0 comments on commit ab51b1d

Please sign in to comment.