Skip to content

Commit

Permalink
Initial code commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Quilt Data committed Feb 10, 2017
0 parents commit e77194f
Show file tree
Hide file tree
Showing 25 changed files with 21,826 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
build
.cache
dist
*.egg-info
*.py[cod]
tmp_*

quilt_packages
184 changes: 184 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# This is alpha software
It will eventually be awesome. Until then we welcome your contributions.
If you hit any snags or want to chat, please find us via
the little orange chat icon on [beta.quiltdata.com](https://beta.quiltdata.com/).

We're three engineers with a strong commitment to quality but a long list of things
to do :)

# Overview
[Quilt](https://beta.quiltdata.com/) is a data package manager.
You can use data packages from the community, or publish packages for others to use.

`quilt` is the command-line client that builds, retrieves, and stores
packages. `quilt` works in conjunction with a server-side registry,
not covered in this document. `quilt` currently pushes to and pulls from
the registry at [beta.quiltdata.com](https://beta.quiltdata.com/). In the near
future users will be able to browse packages in the registry.

## Benefits
* Access data frames [5X to 20X faster](http://wesmckinney.com/blog/pandas-and-apache-arrow/).
Quilt stores data frames in high-efficiency, memory-mapped binary formats like HDF5.
* Version your data. Pull packages by version number or tag (incomplete feature).
* Publish data packages for the benefit of the community.
* Satisfy your data dependencies with one command, `quilt install dependency`.

# Install `quilt`
- `pip install quilt`

# Install a package
Let's install a public package containing wine quality data from the UCI Machine
Learning Repository.
- `quilt install akarve/wine`

Now let's fire up Python and import the package.
```
$ python
>>> from quilt.data.akarve import wine
```
The import syntax is `from quilt.data.USER import PACKAGE`.

Let's see what's in the `wine` package:
```
>>> wine
<class 'quilt.data.Node'>
File: /Users/karve/code/quilt-cli/quilt_packages/akarve/wine.h5
Path: /
quality/
>>> wine.quality
<class 'quilt.data.Node'>
File: /Users/karve/code/quilt-cli/quilt_packages/akarve/wine.h5
Path: /quality/
red/
white/
>>> wine.quality.red
# ... omitting lots of rows
1598 11.0 6
[1599 rows x 12 columns]
>>> type(wine.quality.red)
<class 'pandas.core.frame.DataFrame'>
>>> type(wine.quality)
<class 'quilt.data.Node'>
```
As you can see, `quilt` packages are a tree of groups and data frames.
You can enumerate a package tree as follows:
```
>>> wine.quality._keys()
dict_keys(['red', 'white'])
>>> wine.quality._groups()
[]
>>> wine.quality._dfs()
['red', 'white']
```

## Traverse a package

`foo._keys()` enumerates all children of `foo`, whereas `foo._dfs()` and
`foo._groups()` partition keys into data frames and groups, respectively.
Groups are like folders for data frames.

# Create your first package
Create a `build.yml` file. Your file should look something like this:
```yaml
---
tables:
one: [csv, src/bar/your.txt]
two: [csv, another.csv]
um:
buckle: [xls, finance/excel_file.xls]
my: [xlsx, numbers/excel_file.xlsx]
shoe: [tsv, measurements.txt]
...
```
The above `build.yml` tells `quilt` how to build a package from a set
of input files. The `tables` dictionary is required. The tree
structure under `tables` dictates the package tree. `foo.one` and
`foo.two` will import as data frames. `foo.um` is a group containing
three data frames. `foo.um.buckle` is a data frame, etc.

Each leaf node in `tables` is specified by a list of the form
`[parser, file]`. You can have as many leaf nodes (data frames) and non-leaf nodes
(groups) as you choose.

**Note**: `parser` and `file`'s extension may differ, and in
practice often do. For example `foo.one` uses the `csv`
parser to read from a `.txt` file that, contrary to its extension, is actually
in CSV format. The separation of `parser` and `file` allows you to change
parsers without changing file names.

## Supported parsers
- `xls` or `xlsx` for Excel
- `csv` for comma-separated values
- `tsv` for tab-separated values
- `ssv` for semicolon-separated values

`quilt` can be extended to support more parsers. See `TARGET` in `quilt/data/tools/constants.py`.

## Build the package
- `quilt build USER/PACKAGE build.yml`

`build` parses the files referenced in `data.yml`, transforms them with specified
parser into data frames, then serializes the data frames to
memory-mapped binary formats. At present quilt packages are pandas
data frames stored in HDF5. In the future we will support R, Spark, and
binary formats like Parquet.

You can now use your package locally:
```
>>> from quilt.data.user import package
```
Data packages deserialize 5x to 20x faster than text files.

## Push the package
So far your package lives on your local machine. Now you can
push it to a secure registry in the cloud.

1. `quilt login`. Sign in or create an account, then paste your confirmation code into
`quilt`.

1. `quilt access add YOU/YOUR_PACKAGE FRIEND`. Now user `FRIEND` can
`quilt install YOU/YOUR_PACKAGE`. In the near future
the quilt registry at [beta.quiltdata.com](https://quiltdata.com) will offer
a graphical user interface for easy access control.

If you wish to make a package public, type `quilt access add YOU/YOUR_PACKAGE public`.

**Note**: all packages are private by default, visible only to the owner.

## Manage access
- `quilt access list USER/PACKAGE` reveals who can view a package
- `quilt access add USER/PACKAGE FRIEND` adds `FRIEND` as a viewer
- `quilt access remove USER/PACKAGE ENEMY` removes `ENEMY` as a viewer

# Command summary
* `quilt -h` for a list of commands
* `quilt CMD -h` for info about a command
* `quilt login`
* `quilt build USER/PACKAGE FILE.YML`
* `quilt push USER/PACKAGE` stores the package in the registry
* `quilt access list USER/PACKAGE` to see who has access to a package
* `quilt access {add, remove} USER/PACKAGE ANOTHER_USER` to set access

# Developer
- `pip install pylint pytest`
- `pytest` will run any `test_*` files in any subdirectory
- All new modules, files, and functions should have a corresponding test

## Local installation
1. `git clone https://github.com/quiltdata/quilt.git`
1. `cd quilt`
1. From the repository root: `pip install -e .`

## If you need h5py
### The easy way with binaries
Use conda to `conda install h5py`.

### The hard way from source (YMMV; this is for Mac OS)
1. Install HDF5: `brew install homebrew/science/[email protected]`
- [See also this `h5py` doc](http://docs.h5py.org/en/latest/build.html#source-installation-on-linux-and-os-x)
1. Expose compiler flags in `~/.bash_profile`. Follow the homebrew instructions, which should look something like this:
```
export LDFLAGS="-L/usr/local/opt/[email protected]/lib"
export CPPFLAGS="-I/usr/local/opt/[email protected]/include"
```
Empty file added quilt/__init__.py
Empty file.
157 changes: 157 additions & 0 deletions quilt/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
"""
Magic module that maps its submodules to Quilt tables.
Submodules have the following format: quilt.data.$user.$package.$table
E.g.:
import quilt.data.$user.$package as $package
print $package.$table
or
from quilt.data.$user.$package import $table
print $table
The corresponding data is looked up in `quilt_modules/$user/$package.h5`
in ancestors of the current directory.
"""

import imp
import os.path
import sys

from .tools.build import get_store
from .tools.store import PackageStore

__path__ = [] # Required for submodules to work

class Node(object):
"""
Represents either the root of the store or a group, similar to nodes
in HDFStore's `root`.
"""
def __init__(self, store, prefix=''):
self._prefix = prefix
self._store = store

def __getattr__(self, name):
# TODO clean if... up since VALID_NAME_RE no longer allows leading _
if name.startswith('_'):
raise AttributeError
path = self._prefix + '/' + name
return self._get_store_obj(path)

def __repr__(self):
cinfo = str(self.__class__)
finfo = 'File: ' + self._store.get_path()
pinfo = 'Path: ' + self._prefix + '/'
#TODO maybe show all descendant subpaths instead of just children
spaths = [k + '/' for k in self._keys()]
spaths.sort()
output = [cinfo, finfo, pinfo] + spaths
return '\n'.join(output)

def _dfs(self):
"""
every child key referencing a dataframe
"""
pref = self._prefix + '/'
return [k for k in self._keys()
if not isinstance(self._get_store_obj(pref + k), Node)]

def _get_store_obj(self, path):
try:
with self._store:
return self._store.get(path)
except KeyError:
# No such group or table
raise AttributeError("No such table or group: %s" % path)
except TypeError:
# This is awful, but that's what happens when the object being looked up
# is a group rather than a table.
return Node(self._store, path)

def _groups(self):
"""
every child key referencing a group that is not a dataframe
"""
pref = self._prefix + '/'
return [k for k in self._keys()
if isinstance(self._get_store_obj(pref + k), Node)]

def _keys(self):
"""
keys directly accessible on this object via getattr or .
"""
return self._store.keys(self._prefix)

class FakeLoader(object):
"""
Fake module loader used to create intermediate user and package modules.
"""
def __init__(self, path):
self._path = path

def load_module(self, fullname):
"""
Returns an empty module.
"""
mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
mod.__file__ = self._path
mod.__loader__ = self
mod.__path__ = []
mod.__package__ = fullname
return mod

class PackageLoader(object):
"""
Module loader for Quilt tables.
"""
def __init__(self, path, store):
self._path = path
self._store = store

def load_module(self, fullname):
"""
Returns an object that lazily looks up tables and groups.
"""
mod = sys.modules.get(fullname)
if mod is not None:
return mod

# We're creating an object rather than a module. It's a hack, but it's approved by Guido:
# https://mail.python.org/pipermail/python-ideas/2012-May/014969.html

mod = Node(self._store)
sys.modules[fullname] = mod
return mod

class ModuleFinder(object):
"""
Looks up submodules.
"""
@staticmethod
def find_module(fullname, path=None):
"""
Looks up the table based on the module path.
"""
if not fullname.startswith(__name__ + '.'):
# Not a quilt submodule.
return None

submodule = fullname[len(__name__) + 1:]
parts = submodule.split('.')

if len(parts) == 1:
for package_dir in PackageStore.find_package_dirs():
file_path = os.path.join(package_dir, parts[0])
if os.path.isdir(file_path):
return FakeLoader(file_path)
elif len(parts) == 2:
user, package = parts
store = get_store(user, package)
if store:
file_path = store.get_path()
return PackageLoader(file_path, store)

return None

sys.meta_path.append(ModuleFinder)
Empty file added quilt/test/__init__.py
Empty file.
5 changes: 5 additions & 0 deletions quilt/test/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
tables:
csv: [csv, data/10KRows13Cols.csv]
tsv: [tsv, data/10KRows13Cols.tsv]
xls: [xlsx, data/10KRows13Cols.xlsx]
Loading

0 comments on commit e77194f

Please sign in to comment.