initial implementation of zero-copy deserialization for SimpleMetadata (

astral-sh#1249) (Please review this PR commit by commit.) This PR closes an initial loop on zero-copy deserialization. That is, provides a way to get a `Archived<SimpleMetadata>` (spelled `OwnedArchive<SimpleMetadata>` in the code) from a `CachedClient`. The main benefit of zero-copy deserialization is that we can read bytes from a file, cast those bytes to a structured representation without cost, and then start using that type as any other Rust type. The "catch" is that the structured representation is not the actual type you started with, but the "archived" version of it. In order to make all this work, we ended up needing to shave a rather large yak: we had to re-implement HTTP cache semantics. Previously, we were using the `http-cache-semantics` crate. While it does support Serde, it doesn't support `rkyv`. Moreover, even simple support for `rkyv` wouldn't be enough. What we actually want is for the HTTP cache semantics to be implemented on the *archived* type so that we can decide whether our cached response is stale or not without needing to do a full deserialization into the unarchived type. This is why, in this PR, you'll see `impl ArchivedCachePolicy { ... }` instead of `impl CachePolicy { ... }`. (The `derive(rkyv::Archive)` macro automatically introduces the `ArchivedCachePolicy` type into the current namespace.) Unfortunately, this PR does not fully realize the dream that is zero-copy deserialization. Namely, while a `CachedClient` can now provide an `OwnedArchive<SimpleMetadata>`, the rest of our code doesn't really make use of it. Indeed, as soon as we go to build a `VersionMap`, we eagerly convert our archived metadata into an owned `SimpleMetadata` via deserialization (that *isn't* zero-copy). After this change, a lot of the work now shifts to `rkyv` deserialization and `VersionMap` construction. More precisely, the main thing we drop here is `CachePolicy` deserialization (which is now truly zero-copy) and the parsing of the MessagePack format for `SimpleMetadata`. But we are still paying for deserialization. We're just paying for it in a different place. This PR does seem to bring a speed-up, but it is somewhat underwhelming. My measurements have been pretty noisy, but I get a 1.1x speedup fairly often: ``` $ hyperfine -w5 "puffin-main pip compile --cache-dir ~/astral/tmp/cache-main ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null" "puffin-test pip compile --cache-dir ~/astral/tmp/cache-test ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null" ; A kang Benchmark 1: puffin-main pip compile --cache-dir ~/astral/tmp/cache-main ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null Time (mean ± σ): 164.4 ms ± 18.8 ms [User: 427.1 ms, System: 348.6 ms] Range (min … max): 131.1 ms … 190.5 ms 18 runs Benchmark 2: puffin-test pip compile --cache-dir ~/astral/tmp/cache-test ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null Time (mean ± σ): 148.3 ms ± 10.2 ms [User: 357.1 ms, System: 319.4 ms] Range (min … max): 136.8 ms … 184.4 ms 19 runs Summary puffin-test pip compile --cache-dir ~/astral/tmp/cache-test ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null ran 1.11 ± 0.15 times faster than puffin-main pip compile --cache-dir ~/astral/tmp/cache-main ~/astral/tmp/reqs/home-assistant-reduced.in -o /dev/null ``` One downside is that this does increase cache size (`rkyv`'s serialization format is not as compact as MessagePack). On disk size increases by about 1.8x for our `simple-v0` cache. ``` $ sort-filesize cache-main 4.0K cache-main/CACHEDIR.TAG 4.0K cache-main/.gitignore 8.0K cache-main/interpreter-v0 8.7M cache-main/wheels-v0 18M cache-main/archive-v0 59M cache-main/simple-v0 109M cache-main/built-wheels-v0 193M cache-main 193M total $ sort-filesize cache-test 4.0K cache-test/CACHEDIR.TAG 4.0K cache-test/.gitignore 8.0K cache-test/interpreter-v0 8.7M cache-test/wheels-v0 18M cache-test/archive-v0 107M cache-test/simple-v0 109M cache-test/built-wheels-v0 242M cache-test 242M total ``` Also, while I initially intended to do a simplistic implementation of HTTP cache semantics, I found that everything was somewhat inter-connected. I could have wrote code that _specifically_ only worked with the present behavior of PyPI, but then it would need to be special cased and everything else would need to continue to use `http-cache-sematics`. By implementing what we need based on what Puffin actually is (which is still less than what `http-cache-semantics` does), we can avoid special casing and use zero-copy deserialization for our cache policy in _all_ cases.
exag · Feb 5, 2024 · d4b4c21 · d4b4c21
1 parent 398659a
commit d4b4c21
Show file tree

Hide file tree

Showing 17 changed files with 2,681 additions and 347 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -50,7 +50,6 @@ hmac = { version = "0.12.1" }
 home = { version = "0.5.9" }
 html-escape = { version = "0.2.13" }
 http = { version = "0.2.11" }
-http-cache-semantics = { version = "1.0.2" }
 indexmap = { version = "2.1.0" }
 indicatif = { version = "0.17.7" }
 indoc = { version = "2.0.4" }

diff --git a/crates/puffin-cache/src/lib.rs b/crates/puffin-cache/src/lib.rs
@@ -402,7 +402,7 @@ pub enum CacheBucket {
     /// The following requirements:
     /// ```text
     /// # git source dist
-    /// pydantic-extra-types @ git+https://github.com/pydantic/pydantic-extra-types.git    
+    /// pydantic-extra-types @ git+https://github.com/pydantic/pydantic-extra-types.git
     /// # pypi source dist
     /// django_allauth==0.51.0
     /// # url source dist
@@ -488,8 +488,8 @@ pub enum CacheBucket {
     /// Index responses through the simple metadata API.
     ///
     /// Cache structure:
-    ///  * `simple-v0/pypi/<package_name>.msgpack`
-    ///  * `simple-v0/<digest(index_url)>/<package_name>.msgpack`
+    ///  * `simple-v0/pypi/<package_name>.rkyv`
+    ///  * `simple-v0/<digest(index_url)>/<package_name>.rkyv`
     ///
     /// The response is parsed into `puffin_client::SimpleMetadata` before storage.
     Simple,
@@ -575,15 +575,15 @@ impl CacheBucket {
                 }
             }
             CacheBucket::Simple => {
-                // For `pypi` wheels, we expect a MsgPack file per package, indexed by name.
+                // For `pypi` wheels, we expect a rkyv file per package, indexed by name.
                 let root = cache.bucket(self).join(WheelCacheKind::Pypi);
-                summary += rm_rf(root.join(format!("{name}.msgpack")))?;
+                summary += rm_rf(root.join(format!("{name}.rkyv")))?;
 
                 // For alternate indices, we expect a directory for every index, followed by a
                 // MsgPack file per package, indexed by name.
                 let root = cache.bucket(self).join(WheelCacheKind::Url);
                 for directory in directories(root) {
-                    summary += rm_rf(directory.join(format!("{name}.msgpack")))?;
+                    summary += rm_rf(directory.join(format!("{name}.rkyv")))?;
                 }
             }
             CacheBucket::FlatIndex => {

diff --git a/crates/puffin-client/Cargo.toml b/crates/puffin-client/Cargo.toml
@@ -23,7 +23,6 @@ fs-err = { workspace = true, features = ["tokio"] }
 futures = { workspace = true }
 html-escape = { workspace = true }
 http = { workspace = true }
-http-cache-semantics = { workspace = true }
 reqwest = { workspace = true }
 reqwest-middleware = { workspace = true }
 reqwest-retry = { workspace = true }

diff --git a/crates/puffin-client/src/cache_headers.rs b/crates/puffin-client/src/cache_headers.rs