Skip to content

list and get specific files from remote zip archives without downloading the whole thing

License

Notifications You must be signed in to change notification settings

ozkatz/cloudzip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cz - Cloud Zip

list and get specific files from remote zip archives without downloading the whole thing

Tip

New: Experimental support for mounting a remote zip file as a local directory. See mounting below.

Installation

Download cloudzip

see releases for the latest release, you can download binaries directly from GitHub.

cz is available as a single binary, so no installation, simply stick it somewhere in your $PATH.

Building from source

Clone and build the project (no binaries available atm, sorry!)

git clone https://github.com/ozkatz/cloudzip.git
cd cloudzip
go build -o cz main.go

Then copy the cz binary into a location in your $PATH

cp cz /usr/local/bin/

Usage

Listing the contents of a zip file without downloading it:

cz ls s3://example-bucket/path/to/archive.zip

Printing a summary of the contents (number of files, total size compressed/uncompressed):

cz info s3://example-bucket/path/to/archive.zip

Downloading and extracting a specific object from within a zip file:

cz cat s3://example-bucket/path/to/archive.zip images/cat.png > cat.png

HTTP proxy mode (see below):

cz http s3://example-bucket/path

Mounting (See below):

cz mount s3://example-bucket/path/to/archive.zip some_dir/

Unmounting:

cz umount some_dir

Why does cz exist?

My use case was a pretty specific access pattern:

Upload lots of small (~1-100Kb) files as quickly as possible, while still allowing random access to them

How does cz solve this?

Well, uploading many small files to object stores is hard to do efficiently.

Bundling them as a large object and using multipart uploads to parallelize the upload while retaining bigger chunks is the most efficient way.

While this is commonly done with tar, the tar format doesn't keep an index of the files included in it. Scanning the archive until we find the file we're looking for means we might end up downloading the whole thing.

Zip, on the other hand, has a central directory, which is an index! It stores paths in the archive and their offset in the file.

This index, together with byte range requests (supported by all major object stores), allow reading a small file(s) from large archives without having to fetch the entire thing!

We can even write a zip file directly to remote storage without saving it locally:

zip -r - -0 * | aws s3 cp - "s3://example-bucket/path/to/archive.zip"

but what about CPU usage? Won't compression slow down the upload?

Zip files don't have to be compressed! zip -0 will result in an uncompressed archive, so there's no additional overhead.

How Does it Work?

cz ls

Listing is done by issuing 2 HTTP range requests:

  1. Fetch the last 64kB of the zip file, looking for the End Of Central Directory (EOCD), and possibly EOCD64.
  2. The EOCD contains the exact start offset and size of the Central Directory, which is then read by issuing another HTTP range request

Once the central directory is read, it is parsed and written to stdout, similar to the output of unzip -l.

cz cat

Reading a file from the remote zip involves another HTTP range request: once we have the central directory, we find the relevant entry for the file we wish to get, and figure out its offset and size. This is then used to issue a 3rd HTTP range request.

Because zip files store each file (whether compressed or not) independently, this is enough to uncompress and write the file to stdout.

⚠️ Experimental: cz http

CloudZip can run in proxy mode, allowing you to read archived files directly HTTP client (usually a browser).

cz http s3://example-bucket/path

This will open an HTTP server on a random port (use --listen to bind to another address). The server will map the requested path relative to the supplied S3 url argument. A single query argument filename should be supplied, referencing the file within the zip file. E.g. GET /a/b/c.zip?filename=foobar.png will serve foobar.png from within the s3://example-bucket/path/a/b/c.zip archive.

⚠️ Experimental: cz mount

Instead of listing and downloading individual files from the remote zip, you can now mount it to a local directory.

cz mount s3://example-bucket/path/to/archive.zip my_dir/

This would show up on your local filesystem as a directory with the contents of the zip archive inside it - as if you've downloaded and extracted it.

However... behind the scenes, it would fetch only the file listing from the remote zip (just like cz ls) and spin up a small NFS server, listening on localhost, and mount it to my_dir/.

When reading files from my_dir/, they will first be downloaded and decompressed on-the-fly, just like cz cat does.

These files are downloaded into a cache dir, which if not explicitly set, will be purged when unmounted. To set it to a specific location (and retain it across mount/umount cycles), set the CLOUDZIP_CACHE_DIR environment variable:

export CLOUDZIP_CACHE_DIR="/nvme/fast/cache"
cz mount s3://example-bucket/path/to/archive.zip my_dir/

To unmount:

cz umount my_dir

which will unmount the NFS share from the directory, and terminate the local NFS server for you.

Mounting, illustrated:

Demo

Mounting a 32GB dataset, directly from Kaggle's storage (See Kaggle usage below) as a local directory, with DuckDB reading a single file with ~1 second load time:

Caution

This is still experimental (and only supported on Linux and MacOS for now)

Logging

Set the $CLOUDZIP_LOGGING environment variable to DEBUG to log storage calls to stderr:

export CLOUDZIP_LOGGING="DEBUG"
cz ls s3://example-bucket/path/to/archive.zip  # will log S3 calls to stderr

Supported backends

AWS S3

Will use the default AWS credentials resolution order

Example:

cz ls s3://example-bucket/path/to/archive.zip

HTTP / HTTPS

Example:

cz ls https://example.com/path/to/archive.zip

Kaggle

Kaggle's Dataset Download API returns an URL for a zip file, so we can use it easily with cz! Before getting started, generate an API key and store the json file in ~/.kaggle/kaggle.json (see "Authentication" on the Kaggle API docs).

Alternatively, you can store the kaggle.json in a different location and set the KAGGLE_KEY_FILE environment variable with its path.

Example:

cz ls kaggle://{userSlug}/{datasetSlug}

For example, for the dataset at https://www.kaggle.com/datasets/datasnaek/youtube-new, the cz url should be kaggle://datasnaek/youtube-new.

lakeFS

lakeFS is fully supported. Will probe the lakeFS server for pre-signed URL support, and if supported will transparently use those. Otherwise, will fetch data through the API.

cz ls lakefs://repository/main/path/to/archive.zip

Local files

Prefix the path with file:// to read from the local filesystem. Can accept either relative path or absolute path.

Example:

cz ls file://archive.zip  # relative to current directory (./archive.zip)
cz ls file:///home/user/archive.zip  # absolute path (/home/user/archive.zip)

About

list and get specific files from remote zip archives without downloading the whole thing

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages