forked from quiltdata/quilt
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Quilt Data
committed
Feb 10, 2017
0 parents
commit e77194f
Showing
25 changed files
with
21,826 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
build | ||
.cache | ||
dist | ||
*.egg-info | ||
*.py[cod] | ||
tmp_* | ||
|
||
quilt_packages |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
# This is alpha software | ||
It will eventually be awesome. Until then we welcome your contributions. | ||
If you hit any snags or want to chat, please find us via | ||
the little orange chat icon on [beta.quiltdata.com](https://beta.quiltdata.com/). | ||
|
||
We're three engineers with a strong commitment to quality but a long list of things | ||
to do :) | ||
|
||
# Overview | ||
[Quilt](https://beta.quiltdata.com/) is a data package manager. | ||
You can use data packages from the community, or publish packages for others to use. | ||
|
||
`quilt` is the command-line client that builds, retrieves, and stores | ||
packages. `quilt` works in conjunction with a server-side registry, | ||
not covered in this document. `quilt` currently pushes to and pulls from | ||
the registry at [beta.quiltdata.com](https://beta.quiltdata.com/). In the near | ||
future users will be able to browse packages in the registry. | ||
|
||
## Benefits | ||
* Access data frames [5X to 20X faster](http://wesmckinney.com/blog/pandas-and-apache-arrow/). | ||
Quilt stores data frames in high-efficiency, memory-mapped binary formats like HDF5. | ||
* Version your data. Pull packages by version number or tag (incomplete feature). | ||
* Publish data packages for the benefit of the community. | ||
* Satisfy your data dependencies with one command, `quilt install dependency`. | ||
|
||
# Install `quilt` | ||
- `pip install quilt` | ||
|
||
# Install a package | ||
Let's install a public package containing wine quality data from the UCI Machine | ||
Learning Repository. | ||
- `quilt install akarve/wine` | ||
|
||
Now let's fire up Python and import the package. | ||
``` | ||
$ python | ||
>>> from quilt.data.akarve import wine | ||
``` | ||
The import syntax is `from quilt.data.USER import PACKAGE`. | ||
|
||
Let's see what's in the `wine` package: | ||
``` | ||
>>> wine | ||
<class 'quilt.data.Node'> | ||
File: /Users/karve/code/quilt-cli/quilt_packages/akarve/wine.h5 | ||
Path: / | ||
quality/ | ||
>>> wine.quality | ||
<class 'quilt.data.Node'> | ||
File: /Users/karve/code/quilt-cli/quilt_packages/akarve/wine.h5 | ||
Path: /quality/ | ||
red/ | ||
white/ | ||
>>> wine.quality.red | ||
# ... omitting lots of rows | ||
1598 11.0 6 | ||
[1599 rows x 12 columns] | ||
>>> type(wine.quality.red) | ||
<class 'pandas.core.frame.DataFrame'> | ||
>>> type(wine.quality) | ||
<class 'quilt.data.Node'> | ||
``` | ||
As you can see, `quilt` packages are a tree of groups and data frames. | ||
You can enumerate a package tree as follows: | ||
``` | ||
>>> wine.quality._keys() | ||
dict_keys(['red', 'white']) | ||
>>> wine.quality._groups() | ||
[] | ||
>>> wine.quality._dfs() | ||
['red', 'white'] | ||
``` | ||
|
||
## Traverse a package | ||
|
||
`foo._keys()` enumerates all children of `foo`, whereas `foo._dfs()` and | ||
`foo._groups()` partition keys into data frames and groups, respectively. | ||
Groups are like folders for data frames. | ||
|
||
# Create your first package | ||
Create a `build.yml` file. Your file should look something like this: | ||
```yaml | ||
--- | ||
tables: | ||
one: [csv, src/bar/your.txt] | ||
two: [csv, another.csv] | ||
um: | ||
buckle: [xls, finance/excel_file.xls] | ||
my: [xlsx, numbers/excel_file.xlsx] | ||
shoe: [tsv, measurements.txt] | ||
... | ||
``` | ||
The above `build.yml` tells `quilt` how to build a package from a set | ||
of input files. The `tables` dictionary is required. The tree | ||
structure under `tables` dictates the package tree. `foo.one` and | ||
`foo.two` will import as data frames. `foo.um` is a group containing | ||
three data frames. `foo.um.buckle` is a data frame, etc. | ||
|
||
Each leaf node in `tables` is specified by a list of the form | ||
`[parser, file]`. You can have as many leaf nodes (data frames) and non-leaf nodes | ||
(groups) as you choose. | ||
|
||
**Note**: `parser` and `file`'s extension may differ, and in | ||
practice often do. For example `foo.one` uses the `csv` | ||
parser to read from a `.txt` file that, contrary to its extension, is actually | ||
in CSV format. The separation of `parser` and `file` allows you to change | ||
parsers without changing file names. | ||
|
||
## Supported parsers | ||
- `xls` or `xlsx` for Excel | ||
- `csv` for comma-separated values | ||
- `tsv` for tab-separated values | ||
- `ssv` for semicolon-separated values | ||
|
||
`quilt` can be extended to support more parsers. See `TARGET` in `quilt/data/tools/constants.py`. | ||
|
||
## Build the package | ||
- `quilt build USER/PACKAGE build.yml` | ||
|
||
`build` parses the files referenced in `data.yml`, transforms them with specified | ||
parser into data frames, then serializes the data frames to | ||
memory-mapped binary formats. At present quilt packages are pandas | ||
data frames stored in HDF5. In the future we will support R, Spark, and | ||
binary formats like Parquet. | ||
|
||
You can now use your package locally: | ||
``` | ||
>>> from quilt.data.user import package | ||
``` | ||
Data packages deserialize 5x to 20x faster than text files. | ||
|
||
## Push the package | ||
So far your package lives on your local machine. Now you can | ||
push it to a secure registry in the cloud. | ||
|
||
1. `quilt login`. Sign in or create an account, then paste your confirmation code into | ||
`quilt`. | ||
|
||
1. `quilt access add YOU/YOUR_PACKAGE FRIEND`. Now user `FRIEND` can | ||
`quilt install YOU/YOUR_PACKAGE`. In the near future | ||
the quilt registry at [beta.quiltdata.com](https://quiltdata.com) will offer | ||
a graphical user interface for easy access control. | ||
|
||
If you wish to make a package public, type `quilt access add YOU/YOUR_PACKAGE public`. | ||
|
||
**Note**: all packages are private by default, visible only to the owner. | ||
|
||
## Manage access | ||
- `quilt access list USER/PACKAGE` reveals who can view a package | ||
- `quilt access add USER/PACKAGE FRIEND` adds `FRIEND` as a viewer | ||
- `quilt access remove USER/PACKAGE ENEMY` removes `ENEMY` as a viewer | ||
|
||
# Command summary | ||
* `quilt -h` for a list of commands | ||
* `quilt CMD -h` for info about a command | ||
* `quilt login` | ||
* `quilt build USER/PACKAGE FILE.YML` | ||
* `quilt push USER/PACKAGE` stores the package in the registry | ||
* `quilt access list USER/PACKAGE` to see who has access to a package | ||
* `quilt access {add, remove} USER/PACKAGE ANOTHER_USER` to set access | ||
|
||
# Developer | ||
- `pip install pylint pytest` | ||
- `pytest` will run any `test_*` files in any subdirectory | ||
- All new modules, files, and functions should have a corresponding test | ||
|
||
## Local installation | ||
1. `git clone https://github.com/quiltdata/quilt.git` | ||
1. `cd quilt` | ||
1. From the repository root: `pip install -e .` | ||
|
||
## If you need h5py | ||
### The easy way with binaries | ||
Use conda to `conda install h5py`. | ||
|
||
### The hard way from source (YMMV; this is for Mac OS) | ||
1. Install HDF5: `brew install homebrew/science/[email protected]` | ||
- [See also this `h5py` doc](http://docs.h5py.org/en/latest/build.html#source-installation-on-linux-and-os-x) | ||
1. Expose compiler flags in `~/.bash_profile`. Follow the homebrew instructions, which should look something like this: | ||
``` | ||
export LDFLAGS="-L/usr/local/opt/[email protected]/lib" | ||
export CPPFLAGS="-I/usr/local/opt/[email protected]/include" | ||
``` |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
""" | ||
Magic module that maps its submodules to Quilt tables. | ||
Submodules have the following format: quilt.data.$user.$package.$table | ||
E.g.: | ||
import quilt.data.$user.$package as $package | ||
print $package.$table | ||
or | ||
from quilt.data.$user.$package import $table | ||
print $table | ||
The corresponding data is looked up in `quilt_modules/$user/$package.h5` | ||
in ancestors of the current directory. | ||
""" | ||
|
||
import imp | ||
import os.path | ||
import sys | ||
|
||
from .tools.build import get_store | ||
from .tools.store import PackageStore | ||
|
||
__path__ = [] # Required for submodules to work | ||
|
||
class Node(object): | ||
""" | ||
Represents either the root of the store or a group, similar to nodes | ||
in HDFStore's `root`. | ||
""" | ||
def __init__(self, store, prefix=''): | ||
self._prefix = prefix | ||
self._store = store | ||
|
||
def __getattr__(self, name): | ||
# TODO clean if... up since VALID_NAME_RE no longer allows leading _ | ||
if name.startswith('_'): | ||
raise AttributeError | ||
path = self._prefix + '/' + name | ||
return self._get_store_obj(path) | ||
|
||
def __repr__(self): | ||
cinfo = str(self.__class__) | ||
finfo = 'File: ' + self._store.get_path() | ||
pinfo = 'Path: ' + self._prefix + '/' | ||
#TODO maybe show all descendant subpaths instead of just children | ||
spaths = [k + '/' for k in self._keys()] | ||
spaths.sort() | ||
output = [cinfo, finfo, pinfo] + spaths | ||
return '\n'.join(output) | ||
|
||
def _dfs(self): | ||
""" | ||
every child key referencing a dataframe | ||
""" | ||
pref = self._prefix + '/' | ||
return [k for k in self._keys() | ||
if not isinstance(self._get_store_obj(pref + k), Node)] | ||
|
||
def _get_store_obj(self, path): | ||
try: | ||
with self._store: | ||
return self._store.get(path) | ||
except KeyError: | ||
# No such group or table | ||
raise AttributeError("No such table or group: %s" % path) | ||
except TypeError: | ||
# This is awful, but that's what happens when the object being looked up | ||
# is a group rather than a table. | ||
return Node(self._store, path) | ||
|
||
def _groups(self): | ||
""" | ||
every child key referencing a group that is not a dataframe | ||
""" | ||
pref = self._prefix + '/' | ||
return [k for k in self._keys() | ||
if isinstance(self._get_store_obj(pref + k), Node)] | ||
|
||
def _keys(self): | ||
""" | ||
keys directly accessible on this object via getattr or . | ||
""" | ||
return self._store.keys(self._prefix) | ||
|
||
class FakeLoader(object): | ||
""" | ||
Fake module loader used to create intermediate user and package modules. | ||
""" | ||
def __init__(self, path): | ||
self._path = path | ||
|
||
def load_module(self, fullname): | ||
""" | ||
Returns an empty module. | ||
""" | ||
mod = sys.modules.setdefault(fullname, imp.new_module(fullname)) | ||
mod.__file__ = self._path | ||
mod.__loader__ = self | ||
mod.__path__ = [] | ||
mod.__package__ = fullname | ||
return mod | ||
|
||
class PackageLoader(object): | ||
""" | ||
Module loader for Quilt tables. | ||
""" | ||
def __init__(self, path, store): | ||
self._path = path | ||
self._store = store | ||
|
||
def load_module(self, fullname): | ||
""" | ||
Returns an object that lazily looks up tables and groups. | ||
""" | ||
mod = sys.modules.get(fullname) | ||
if mod is not None: | ||
return mod | ||
|
||
# We're creating an object rather than a module. It's a hack, but it's approved by Guido: | ||
# https://mail.python.org/pipermail/python-ideas/2012-May/014969.html | ||
|
||
mod = Node(self._store) | ||
sys.modules[fullname] = mod | ||
return mod | ||
|
||
class ModuleFinder(object): | ||
""" | ||
Looks up submodules. | ||
""" | ||
@staticmethod | ||
def find_module(fullname, path=None): | ||
""" | ||
Looks up the table based on the module path. | ||
""" | ||
if not fullname.startswith(__name__ + '.'): | ||
# Not a quilt submodule. | ||
return None | ||
|
||
submodule = fullname[len(__name__) + 1:] | ||
parts = submodule.split('.') | ||
|
||
if len(parts) == 1: | ||
for package_dir in PackageStore.find_package_dirs(): | ||
file_path = os.path.join(package_dir, parts[0]) | ||
if os.path.isdir(file_path): | ||
return FakeLoader(file_path) | ||
elif len(parts) == 2: | ||
user, package = parts | ||
store = get_store(user, package) | ||
if store: | ||
file_path = store.get_path() | ||
return PackageLoader(file_path, store) | ||
|
||
return None | ||
|
||
sys.meta_path.append(ModuleFinder) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
tables: | ||
csv: [csv, data/10KRows13Cols.csv] | ||
tsv: [tsv, data/10KRows13Cols.tsv] | ||
xls: [xlsx, data/10KRows13Cols.xlsx] |
Oops, something went wrong.