Initial code commit

manzo1991 · Feb 10, 2017 · e77194f · e77194f
commit e77194f
Show file tree

Hide file tree

Showing 25 changed files with 21,826 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,8 @@
+build
+.cache
+dist
+*.egg-info
+*.py[cod]
+tmp_*
+
+quilt_packages
diff --git a/README.md b/README.md
@@ -0,0 +1,184 @@
+# This is alpha software
+It will eventually be awesome. Until then we welcome your contributions.
+If you hit any snags or want to chat, please find us via
+the little orange chat icon on [beta.quiltdata.com](https://beta.quiltdata.com/).
+
+We're three engineers with a strong commitment to quality but a long list of things
+to do :)
+
+# Overview
+[Quilt](https://beta.quiltdata.com/) is a data package manager.
+You can use data packages from the community, or publish packages for others to use.
+
+`quilt` is the command-line client that builds, retrieves, and stores
+packages. `quilt` works in conjunction with a server-side registry,
+not covered in this document. `quilt` currently pushes to and pulls from
+the registry at [beta.quiltdata.com](https://beta.quiltdata.com/). In the near
+future users will be able to browse packages in the registry.
+
+## Benefits
+* Access data frames [5X to 20X faster](http://wesmckinney.com/blog/pandas-and-apache-arrow/).
+Quilt stores data frames in high-efficiency, memory-mapped binary formats like HDF5.
+* Version your data. Pull packages by version number or tag (incomplete feature).
+* Publish data packages for the benefit of the community.
+* Satisfy your data dependencies with one command, `quilt install dependency`.
+
+# Install `quilt`
+- `pip install quilt`
+
+# Install a package
+Let's install a public package containing wine quality data from the UCI Machine
+Learning Repository.
+- `quilt install akarve/wine`
+
+Now let's fire up Python and import the package.
+```
+$ python
+>>> from quilt.data.akarve import wine
+```
+The import syntax is `from quilt.data.USER import PACKAGE`.
+
+Let's see what's in the `wine` package:
+```
+>>> wine
+<class 'quilt.data.Node'>
+File: /Users/karve/code/quilt-cli/quilt_packages/akarve/wine.h5
+Path: /
+quality/
+>>> wine.quality
+<class 'quilt.data.Node'>
+File: /Users/karve/code/quilt-cli/quilt_packages/akarve/wine.h5
+Path: /quality/
+red/
+white/
+>>> wine.quality.red
+# ... omitting lots of rows
+1598     11.0        6  
+
+[1599 rows x 12 columns]
+>>> type(wine.quality.red)
+<class 'pandas.core.frame.DataFrame'>
+>>> type(wine.quality)
+<class 'quilt.data.Node'>
+```
+As you can see, `quilt` packages are a tree of groups and data frames.
+You can enumerate a package tree as follows:
+```
+>>> wine.quality._keys()
+dict_keys(['red', 'white'])
+>>> wine.quality._groups()
+[]
+>>> wine.quality._dfs()
+['red', 'white']
+```
+
+## Traverse a package
+
+`foo._keys()` enumerates all children of `foo`, whereas `foo._dfs()` and
+`foo._groups()` partition keys into data frames and groups, respectively.
+Groups are like folders for data frames.
+
+# Create your first package
+Create a `build.yml` file. Your file should look something like this:
+```yaml
+---
+tables:
+  one: [csv, src/bar/your.txt]
+  two: [csv, another.csv]
+  um:
+    buckle: [xls, finance/excel_file.xls]
+    my: [xlsx, numbers/excel_file.xlsx]
+    shoe: [tsv, measurements.txt]
+...
+```
+The above `build.yml` tells `quilt` how to build a package from a set
+of input files. The `tables` dictionary is required. The tree
+structure under `tables` dictates the package tree. `foo.one` and
+`foo.two` will import as data frames. `foo.um` is a group containing
+three data frames. `foo.um.buckle` is a data frame, etc.
+
+Each leaf node in `tables` is specified by a list of the form
+`[parser, file]`. You can have as many leaf nodes (data frames) and non-leaf nodes
+(groups) as you choose.
+
+**Note**: `parser` and `file`'s extension may differ, and in
+practice often do. For example `foo.one` uses the `csv`
+parser to read from a `.txt` file that, contrary to its extension, is actually
+in CSV format. The separation of `parser` and `file` allows you to change
+parsers without changing file names.
+
+## Supported parsers
+- `xls` or `xlsx` for Excel
+- `csv` for comma-separated values
+- `tsv` for tab-separated values
+- `ssv` for semicolon-separated values
+
+`quilt` can be extended to support more parsers. See `TARGET` in `quilt/data/tools/constants.py`.
+
+## Build the package
+- `quilt build USER/PACKAGE build.yml`
+
+`build` parses the files referenced in `data.yml`, transforms them with specified
+parser into data frames, then serializes the data frames to
+memory-mapped binary formats. At present quilt packages are pandas
+data frames stored in HDF5. In the future we will support R, Spark, and
+binary formats like Parquet.
+
+You can now use your package locally:
+```
+>>> from quilt.data.user import package
+```
+Data packages deserialize 5x to 20x faster than text files.
+
+## Push the package
+So far your package lives on your local machine. Now you can
+push it to a secure registry in the cloud.
+
+1. `quilt login`. Sign in or create an account, then paste your confirmation code into
+`quilt`.
+
+1. `quilt access add YOU/YOUR_PACKAGE FRIEND`. Now user `FRIEND` can
+`quilt install YOU/YOUR_PACKAGE`. In the near future
+the quilt registry at [beta.quiltdata.com](https://quiltdata.com) will offer
+a graphical user interface for easy access control.
+
+If you wish to make a package public, type `quilt access add YOU/YOUR_PACKAGE public`.
+
+**Note**: all packages are private by default, visible only to the owner. 
+
+## Manage access
+- `quilt access list USER/PACKAGE` reveals who can view a package
+- `quilt access add USER/PACKAGE FRIEND` adds `FRIEND` as a viewer
+- `quilt access remove USER/PACKAGE ENEMY` removes `ENEMY` as a viewer
+
+# Command summary
+* `quilt -h` for a list of commands
+* `quilt CMD -h` for info about a command
+* `quilt login`
+* `quilt build USER/PACKAGE FILE.YML`
+* `quilt push USER/PACKAGE` stores the package in the registry
+* `quilt access list USER/PACKAGE` to see who has access to a package
+* `quilt access {add, remove} USER/PACKAGE ANOTHER_USER` to set access
+
+# Developer
+- `pip install pylint pytest`
+- `pytest` will run any `test_*` files in any subdirectory
+- All new modules, files, and functions should have a corresponding test
+
+## Local installation
+1. `git clone https://github.com/quiltdata/quilt.git`
+1. `cd quilt`
+1. From the repository root: `pip install -e .`
+
+## If you need h5py
+### The easy way with binaries
+Use conda to `conda install h5py`.
+
+### The hard way from source (YMMV; this is for Mac OS)
+1. Install HDF5: `brew install homebrew/science/[email protected]`
+  - [See also this `h5py` doc](http://docs.h5py.org/en/latest/build.html#source-installation-on-linux-and-os-x)
+1. Expose compiler flags in `~/.bash_profile`. Follow the homebrew instructions, which should look something like this:
+```
+export LDFLAGS="-L/usr/local/opt/[email protected]/lib"
+export CPPFLAGS="-I/usr/local/opt/[email protected]/include"
+```
diff --git a/quilt/__init__.py b/quilt/__init__.py
diff --git a/quilt/data.py b/quilt/data.py
@@ -0,0 +1,157 @@
+"""
+Magic module that maps its submodules to Quilt tables.
+
+Submodules have the following format: quilt.data.$user.$package.$table
+
+E.g.:
+  import quilt.data.$user.$package as $package
+  print $package.$table
+or
+  from quilt.data.$user.$package import $table
+  print $table
+
+The corresponding data is looked up in `quilt_modules/$user/$package.h5`
+in ancestors of the current directory.
+"""
+
+import imp
+import os.path
+import sys
+
+from .tools.build import get_store
+from .tools.store import PackageStore
+
+__path__ = []  # Required for submodules to work
+
+class Node(object):
+    """
+    Represents either the root of the store or a group, similar to nodes
+    in HDFStore's `root`.
+    """
+    def __init__(self, store, prefix=''):
+        self._prefix = prefix
+        self._store = store
+
+    def __getattr__(self, name):
+        # TODO clean if... up since VALID_NAME_RE no longer allows leading _
+        if name.startswith('_'):
+            raise AttributeError
+        path = self._prefix + '/' + name
+        return self._get_store_obj(path)
+
+    def __repr__(self):
+        cinfo = str(self.__class__)
+        finfo = 'File: ' + self._store.get_path()
+        pinfo = 'Path: ' + self._prefix + '/'
+        #TODO maybe show all descendant subpaths instead of just children
+        spaths = [k + '/' for k in self._keys()]
+        spaths.sort()
+        output = [cinfo, finfo, pinfo] + spaths
+        return '\n'.join(output)
+
+    def _dfs(self):
+        """
+        every child key referencing a dataframe
+        """
+        pref = self._prefix + '/'
+        return [k for k in self._keys()
+                if not isinstance(self._get_store_obj(pref + k), Node)]
+
+    def _get_store_obj(self, path):
+        try:
+            with self._store:
+                return self._store.get(path)
+        except KeyError:
+            # No such group or table
+            raise AttributeError("No such table or group: %s" % path)
+        except TypeError:
+            # This is awful, but that's what happens when the object being looked up
+            # is a group rather than a table.
+            return Node(self._store, path)
+
+    def _groups(self):
+        """
+        every child key referencing a group that is not a dataframe
+        """
+        pref = self._prefix + '/'
+        return [k for k in self._keys()
+                if isinstance(self._get_store_obj(pref + k), Node)]
+
+    def _keys(self):
+        """
+        keys directly accessible on this object via getattr or .
+        """
+        return self._store.keys(self._prefix)
+
+class FakeLoader(object):
+    """
+    Fake module loader used to create intermediate user and package modules.
+    """
+    def __init__(self, path):
+        self._path = path
+
+    def load_module(self, fullname):
+        """
+        Returns an empty module.
+        """
+        mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
+        mod.__file__ = self._path
+        mod.__loader__ = self
+        mod.__path__ = []
+        mod.__package__ = fullname
+        return mod
+
+class PackageLoader(object):
+    """
+    Module loader for Quilt tables.
+    """
+    def __init__(self, path, store):
+        self._path = path
+        self._store = store
+
+    def load_module(self, fullname):
+        """
+        Returns an object that lazily looks up tables and groups.
+        """
+        mod = sys.modules.get(fullname)
+        if mod is not None:
+            return mod
+
+        # We're creating an object rather than a module. It's a hack, but it's approved by Guido:
+        # https://mail.python.org/pipermail/python-ideas/2012-May/014969.html
+
+        mod = Node(self._store)
+        sys.modules[fullname] = mod
+        return mod
+
+class ModuleFinder(object):
+    """
+    Looks up submodules.
+    """
+    @staticmethod
+    def find_module(fullname, path=None):
+        """
+        Looks up the table based on the module path.
+        """
+        if not fullname.startswith(__name__ + '.'):
+            # Not a quilt submodule.
+            return None
+
+        submodule = fullname[len(__name__) + 1:]
+        parts = submodule.split('.')
+
+        if len(parts) == 1:
+            for package_dir in PackageStore.find_package_dirs():
+                file_path = os.path.join(package_dir, parts[0])
+                if os.path.isdir(file_path):
+                    return FakeLoader(file_path)
+        elif len(parts) == 2:
+            user, package = parts
+            store = get_store(user, package)
+            if store:
+                file_path = store.get_path()
+                return PackageLoader(file_path, store)
+
+        return None
+
+sys.meta_path.append(ModuleFinder)
diff --git a/quilt/test/__init__.py b/quilt/test/__init__.py
diff --git a/quilt/test/build.yml b/quilt/test/build.yml
@@ -0,0 +1,5 @@
+---
+tables:
+  csv: [csv, data/10KRows13Cols.csv]
+  tsv: [tsv, data/10KRows13Cols.tsv]
+  xls: [xlsx, data/10KRows13Cols.xlsx]