list and get specific files from remote zip archives without downloading the whole thing
Tip
New: Experimental support for mounting a remote zip file as a local directory. See mounting below.
see releases for the latest release, you can download binaries directly from GitHub.
cz
is available as a single binary, so no installation, simply stick it somewhere in your $PATH
.
Clone and build the project (no binaries available atm, sorry!)
git clone https://github.com/ozkatz/cloudzip.git
cd cloudzip
go build -o cz main.go
Then copy the cz
binary into a location in your $PATH
cp cz /usr/local/bin/
Listing the contents of a zip file without downloading it:
cz ls s3://example-bucket/path/to/archive.zip
Printing a summary of the contents (number of files, total size compressed/uncompressed):
cz info s3://example-bucket/path/to/archive.zip
Downloading and extracting a specific object from within a zip file:
cz cat s3://example-bucket/path/to/archive.zip images/cat.png > cat.png
HTTP proxy mode (see below):
cz http s3://example-bucket/path
Mounting (See below):
cz mount s3://example-bucket/path/to/archive.zip some_dir/
Unmounting:
cz umount some_dir
My use case was a pretty specific access pattern:
Upload lots of small (~1-100Kb) files as quickly as possible, while still allowing random access to them
How does cz
solve this?
Well, uploading many small files to object stores is hard to do efficiently.
Bundling them as a large object and using multipart uploads to parallelize the upload while retaining bigger chunks is the most efficient way.
While this is commonly done with tar
, the tar format doesn't keep an index of the files included in it.
Scanning the archive until we find the file we're looking for means we might end up downloading the whole thing.
Zip, on the other hand, has a central directory, which is an index! It stores paths in the archive and their offset in the file.
This index, together with byte range requests (supported by all major object stores), allow reading a small file(s) from large archives without having to fetch the entire thing!
We can even write a zip file directly to remote storage without saving it locally:
zip -r - -0 * | aws s3 cp - "s3://example-bucket/path/to/archive.zip"
Zip files don't have to be compressed! zip -0
will result in an uncompressed archive, so there's no additional overhead.
Listing is done by issuing 2 HTTP range requests:
- Fetch the last 64kB of the zip file, looking for the End Of Central Directory (EOCD), and possibly EOCD64.
- The EOCD contains the exact start offset and size of the Central Directory, which is then read by issuing another HTTP range request
Once the central directory is read, it is parsed and written to stdout
, similar to the output of unzip -l
.
Reading a file from the remote zip involves another HTTP range request: once we have the central directory, we find the relevant entry for the file we wish to get, and figure out its offset and size. This is then used to issue a 3rd HTTP range request.
Because zip files store each file (whether compressed or not) independently, this is enough to uncompress and write the file to stdout
.
CloudZip can run in proxy mode, allowing you to read archived files directly HTTP client (usually a browser).
cz http s3://example-bucket/path
This will open an HTTP server on a random port (use --listen
to bind to another address). The server will map the requested path relative to the supplied S3 url argument. A single query argument filename
should be supplied, referencing the file within the zip file. E.g. GET /a/b/c.zip?filename=foobar.png
will serve foobar.png
from within the s3://example-bucket/path/a/b/c.zip
archive.
Instead of listing and downloading individual files from the remote zip, you can now mount it to a local directory.
cz mount s3://example-bucket/path/to/archive.zip my_dir/
This would show up on your local filesystem as a directory with the contents of the zip archive inside it - as if you've downloaded and extracted it.
However... behind the scenes, it would fetch only the file listing from the remote zip (just like cz ls
) and spin up a small NFS server, listening on localhost, and mount it to my_dir/
.
When reading files from my_dir/
, they will first be downloaded and decompressed on-the-fly, just like cz cat
does.
These files are downloaded into a cache dir, which if not explicitly set, will be purged when unmounted.
To set it to a specific location (and retain it across mount/umount cycles), set the CLOUDZIP_CACHE_DIR
environment variable:
export CLOUDZIP_CACHE_DIR="/nvme/fast/cache"
cz mount s3://example-bucket/path/to/archive.zip my_dir/
To unmount:
cz umount my_dir
which will unmount the NFS share from the directory, and terminate the local NFS server for you.
Mounting a 32GB dataset, directly from Kaggle's storage (See Kaggle usage below) as a local directory, with DuckDB reading a single file with ~1 second load time:
Caution
This is still experimental (and only supported on Linux and MacOS for now)
Set the $CLOUDZIP_LOGGING
environment variable to DEBUG
to log storage calls to stderr:
export CLOUDZIP_LOGGING="DEBUG"
cz ls s3://example-bucket/path/to/archive.zip # will log S3 calls to stderr
Will use the default AWS credentials resolution order
Example:
cz ls s3://example-bucket/path/to/archive.zip
Example:
cz ls https://example.com/path/to/archive.zip
Kaggle's Dataset Download API returns an URL for a zip file, so we can use it easily with cz
!
Before getting started, generate an API key and store the json file in ~/.kaggle/kaggle.json
(see "Authentication" on the Kaggle API docs).
Alternatively, you can store the kaggle.json
in a different location and set the KAGGLE_KEY_FILE
environment variable with its path.
Example:
cz ls kaggle://{userSlug}/{datasetSlug}
For example, for the dataset at https://www.kaggle.com/datasets/datasnaek/youtube-new
, the cz
url should be kaggle://datasnaek/youtube-new
.
lakeFS is fully supported. Will probe the lakeFS server for pre-signed URL support, and if supported will transparently use those. Otherwise, will fetch data through the API.
cz ls lakefs://repository/main/path/to/archive.zip
Prefix the path with file://
to read from the local filesystem. Can accept either relative path or absolute path.
Example:
cz ls file://archive.zip # relative to current directory (./archive.zip)
cz ls file:///home/user/archive.zip # absolute path (/home/user/archive.zip)