Skip to content

sampathweb/quilt

Repository files navigation

PyPI Gitter

OS master Python support
Linux 2.7, 3.5, 3.6
CircleCI branch 2.7, 3.5, 3.6
Windows 3.5, 3.6

Docs

Visit docs.quiltdata.com. Or browse the docs on GitHub.

Quilt is a data registry

Quilt provides versioned, reusable building blocks for analysis in the form of data packages. A data package may contain data of any type or size. In spirit, Quilt does for data what package managers and Docker registries do for code: provide a centralized, collaborative store of record.

Benefits

  • Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
  • Collaboration and transparency - Data likes to be shared. Quilt offers a centralized data warehouse for finding and sharing data.
  • Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed
  • Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.
  • Deduplication - Data fragments are hashed with SHA256. Duplicate data fragments are written to disk once globally per user. As a result, large, repeated data fragments consume less disk and network bandwidth.
  • Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.

Commands

Here are the basic Quilt commands:

Service

Quilt is offered as a managed service at quiltdata.com.

Architecture

Quilt consists of three source-level components:

  1. A data catalog

    • Displays package meta-data in HTML
    • Implemented with JavaScript with redux, sagas
  2. A data registry

    • Controls permissions
    • Stores pacakge fragments in blob storage
    • Stores package meta-data
    • De-duplicates repeated data fragments
    • Implemented in Python with Flask and PostgreSQL
  3. A data compiler

    • Serializes tabular data to Apache Parquet
    • Transforms and parses files
    • builds packages locally
    • pushes packages to the registry
    • pulls packages from the registry
    • Implemented in Python with pandas and PyArrow

About

Quilt is a versioned data portal for AWS

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 56.7%
  • JavaScript 25.9%
  • Python 17.0%
  • HTML 0.4%
  • Dockerfile 0.0%
  • CSS 0.0%