Chat with us via the orange icon on quiltdata.com.
It's easy to install code dependencies with projects like pip and npm. But what about data dependencies? That's where quilt
comes in.
Install, compile, and version data with Quilt.
- Install and import data with simple one-liners:
from quilt.data.bob import sales
- Compile files into memory-mapped binary data frames that load 5X to 20X faster than files
- Version your data. Hash, tag, and version quilt data packages.
A data package is a namespace of binary data frames. You can use data packages from the community, publish packages for others to use, or keep packages private to you.
quilt
is the command-line client that builds, retrieves, and stores
packages. quilt
works in conjunction with a server-side registry,
not covered in this document. quilt
currently pushes to and pulls from
the registry at quiltdata.com.
- Data packages built in Python 3.x are not always backwards compatible with Python 2.7+. This happens because
pickle
is not backwards compatible across major Python versions. Workaround: If you need to use Python 2.7 and Python 3.x, build packages on Python 2.7. - Anaconda with python 2.7 has an old version of
setuptools
. Strangely,pip install --upgrade setuptools
run three times, yes three times, will ultimately succeed.
- Open Terminal
$ pip install quilt
$ quilt install akarve/examples
(install a sample package)$ python
(fire up python)- You've got data frames
from quilt.data.examples import wine
wine.quality.red # this is a pandas.DataFrame
pip install git+https://github.com/quiltdata/quilt.git
(more up-to-date thanpip install quilt
)
Let's install the public package examples/wine
quilt install examples/wine
Now let's fire up Python and import the package.
$ python
>>> from quilt.data.examples import wine
The import syntax is from quilt.data.USER import PACKAGE
.
Let's see what's in the wine
package:
>>> wine
<class 'quilt.data.DataNode'>
File: /Users/kmoore/toa/github/quilt2/quilt_packages/examples/wine.json
Path: /
README/
quality/
>>> wine.quality
<class 'quilt.data.DataNode'>
File: /Users/kmoore/toa/github/quilt2/quilt_packages/examples/wine.json
Path: /quality/
red/
white/
>>> type(wine.quality.red)
<class 'pandas.core.frame.DataFrame'>
>>> wine.quality.red
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
...
[1599 rows x 12 columns]
The simplest way to create a data package is from a set of input files. Quilt's build
command can take a source file directory as a parameter and automatically build a package based on its contents.
quilt build USER/PACKAGE -d PATH_TO_INPUT_FILES
That will create a data package USER/PACKAGE on your local machine. You can inspect the contents using:
quilt inspect USER/PACKAGE
You can now use your package locally:
from quilt.data.USER import PACKAGE
Data packages deserialize 5x to 20x faster than text files since there is little to no parsing but simply copying from disk into memory.
Running quilt build USER/PACKAGE -d PATH
as described above generates a data package and a file, build.yml
that specifies the contents of the package.
Your file should look something like this:
---
contents:
one:
file: src/bar/your.txt
transform: csv
two:
file: another.csv
um:
buckle:
file: finance/excel_file.xls
my:
file: numbers/excel_file.xlsx
shoe:
transform: tsv
file: measurements.txt
...
The above build.yml
tells quilt
how to build a package from a set of input files. By editing the automatically generated build.yml
or creating a configuration file of your own, you can control the exact names of DataFrames and files in your package.
The tree structure under contents
dictate the package tree. foo.one
and foo.two
will import as data frames. foo.um
is a group containing three data frames. foo.um.buckle
is a data frame, etc.
The transform
key specifies the parser. This is useful when the file extension does not match the file format (e.g. your.txt
is actually in CSV format). Without a transform key, the build system will try to infer the parser from the file extension. To prevent a file from being compiled, and simply copy it into the package as is, set transform: id
.
Each leaf node in contents
is specified by a list of the form
[parser, file]
. You can have as many leaf nodes (data frames) and non-leaf nodes (groups) as you choose.
Note: parser
and file
's extension may differ, and in
practice often do. For example foo.one
uses the csv
parser to read from a .txt
file that, contrary to its extension, is actually
in CSV format. The separation of parser
and file
allows you to change
parsers without changing file names.
xls
orxlsx
for Excelcsv
for comma-separated valuestsv
for tab-separated valuesssv
for semicolon-separated values
quilt
can be extended to support more parsers. See TARGET
in quilt/data/tools/constants.py
.
Packages can include data and other contents that are not representable as DataFrames. To include an input file unmodified, set the parser
value to raw
.
Files can be accessed by using the normal Python open
method.
from quilt.data.USER import PACKAGE
with open(PACKAGE.a_file, 'r') as localfile:
print(localfile.read())
quilt build USER/PACKAGE build.yml
build
parses the source files referenced in the contents
tree of build.yml
, transforms them with specified parser into data frames, then serializes the data frames to memory-mapped binary formats. At present quilt packages are pandas data frames stored in HDF5. In the future we will support R, Spark, and
binary formats like Parquet.
So far your package lives on your local machine. Now you can push it to a secure registry in the cloud.
-
quilt login
. Sign in or create an account, then paste your confirmation code intoquilt
. -
quilt push YOU/YOUR_PACKAGE
adds your package to the registry. By default all packages are private to the owner (you).
Note: all packages are private by default, visible only to the owner.
quilt access add YOU/YOUR_PACKAGE FRIEND
. Now userFRIEND
canquilt install YOU/YOUR_PACKAGE
. In the near future the quilt registry at quiltdata.com will offer a graphical user interface for easy access control.
If you wish to make a package public:
quilt access add YOU/YOUR_PACKAGE public
If you change your mind:
quilt access remove YOU/YOUR_PACKAGE public
Once you've pushed a package to the registry, you can list its versions and tags.
quilt tag list USER/PACKAGE
latest: 7f6ca2546aba49be878c7f407bb49ef9388c51be716360685bce2d2cdae4fcd1
The tag latest
is automatically added to the most recently pushed instance of a data package. To add a new tag, copy the package hash for the package instance you want to tag and run:
quilt tag add USER/PACKAGE NEW_TAG PKG_HASH
quilt tag list USER/PACKAGE
latest: 7f6ca2546aba49be878c7f407bb49ef9388c51be716360685bce2d2cdae4fcd1
newtag: 7f6ca2546aba49be878c7f407bb49ef9388c51be716360685bce2d2cdae4fcd1
To create a new version, copy the package hash for the package instance you want to tag and run:
quilt version add USER/PACKAGE VERSION PKG_HASH
quilt version list USER/PACKAGE
0.0.1: 7f6ca2546aba49be878c7f407bb49ef9388c51be716360685bce2d2cdae4fcd1
quilt -h
for a list of commandsquilt CMD -h
for info about a commandquilt login
quilt build USER/PACKAGE FILE.YML
quilt push USER/PACKAGE
stores the package in the registryquilt install [-x HASH | -v VERSION | -t TAG] USER/PACKAGE
installs a packagequilt access list USER/PACKAGE
to see who has access to a packagequilt access {add, remove} USER/PACKAGE ANOTHER_USER
to set accessquilt log USER/PACKAGE
to see all changes to a packagequilt version list USER/PACKAGE
to see versions of a packagequilt version add USER/PACKAGE VERSION HASH
to create a new versionquilt tag list USER/PACKAGE
to see tags of a packagequilt tag add USER/PACKAGE TAG HASH
to create a new tagquilt tag remove USER/PACKAGE TAG
to delete a tag
pip install pylint pytest pytest-cov
pytest
will run anytest_*
files in any subdirectory- All new modules, files, and functions should have a corresponding test
- Track test code coverage by running:
python -m pytest --cov=quilt/tools/ --cov-report html:cov_html quilt/test -v
- View coverage results by opening cov_html/index.html
git clone https://github.com/quiltdata/quilt.git
cd quilt
- From the repository root:
pip install -e .