Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at h2oai.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!
- groupby
- join
- sort
- read
- dask
- data.table
- dplyr
- DataFrames.jl
- pandas
- (py)datatable
- spark
- cuDF
- ClickHouse (
join
not yet added)
More solutions has been proposed. Some of them are not yet mature enough to address benchmark questions well enough (e.g. modin). Others haven't been yet evaluated or implemented. Status of all can be tracked in dedicated issues labelled as new solution in project repository.
- edit
path.env
and setjulia
andjava
paths - if solution uses python create new
virtualenv
as$solution/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install every solution (if needed activate each
virtualenv
) - edit
run.conf
to define solutions and tasks to benchmark - generate data, for
groupby
useRscript _data/groupby-datagen.R 1e7 1e2 0 0
to createG1_1e7_1e2_0_0.csv
, re-save to binary data where needed, createdata
directory and keep all data files there - edit
_control/data.csv
to define data sizes to benchmark usingactive
flag - start benchmark with
./run.sh
- generate data (see related point above)
- set data name env var, for example in
groupby
use something likeexport SRC_GRP_LOCAL=G1_1e7_1e2_0_0
- if solution uses python activate
virtualenv
of a solution - enter interactive console and run lines of script interactively
cuDF
- use
conda
instead ofvirtualenv
- use
ClickHouse
- generate data having extra primary key column according to
clickhouse/setup-clickhouse.sh
- follow "reproduce interactive environment" section from
clickhouse/setup-clickhouse.sh
- generate data having extra primary key column according to
- setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
- (slightly outdated) full reproduce script on clean Ubuntu 16.04: _utils/repro.sh
Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated. Please check issues labelled as exceptions in our repository for a list of issues/defects in solutions, that makes us unable to provide all timings.