Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at h2oai.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!
- groupby
- join
- dask
- data.table
- dplyr
- DataFrames.jl
- pandas
- (py)datatable
- spark
- cuDF
- ClickHouse (
join
not yet added)
More solutions has been proposed. Status of those can be tracked in issues tracker of our project repository by using new solution label.
- edit
path.env
and setjulia
andjava
paths - if solution uses python create new
virtualenv
as$solution/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install every solution, follow
$solution/setup-$solution.sh
scripts - edit
run.conf
to define solutions and tasks to benchmark - generate data, for
groupby
useRscript _data/groupby-datagen.R 1e7 1e2 0 0
to createG1_1e7_1e2_0_0.csv
, re-save to binary format where needed (see below), createdata
directory and keep all data files there - edit
_control/data.csv
to define data sizes to benchmark usingactive
flag - ensure SWAP is disabled and ClickHouse server is not yet running
- start benchmark with
./run.sh
- install solution software
- for python we recommend to use
virtualenv
for better isolation - for R ensure that library is installed in a solution subdirectory, so that
library("dplyr", lib.loc="./dplyr/r-dplyr")
orlibrary("data.table", lib.loc="./datatable/r-datatable")
works - note that some solutions may require another to be installed to speed-up csv data load, for example,
dplyr
requiresdata.table
and similarlypandas
requires (py)datatable
- for python we recommend to use
- generate data using
_data/*-datagen.R
scripts, for example,Rscript _data/groupby-datagen.R 1e7 1e2 0 0
createsG1_1e7_1e2_0_0.csv
, put data files indata
directory - run benchmark for a single solution using
./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7
- run other data cases by passing extra parameters
--k=1e2 --na=0 --sort=0
- use
--quiet=true
to suppress script's output and print timings only, using--print=question,run,time_sec
specify columns to be printed to console, to print all use--print=*
- use
--out=time.csv
to write timings to a file rather than console
cudf
- use
conda
instead ofvirtualenv
- use
clickhouse
- generate data having extra primary key column according to
clickhouse/setup-clickhouse.sh
- follow "reproduce interactive environment" section from
clickhouse/setup-clickhouse.sh
- generate data having extra primary key column according to
pydatatable
- re-save csv join-1e9 data into
jay
format (should not be needed after h2oai/datatable#1750)
- re-save csv join-1e9 data into
dask
- re-save csv groupby-1e9 and join-1e9 data into
parquet
format
- re-save csv groupby-1e9 and join-1e9 data into
- setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
- (slightly outdated) full reproduce script on clean Ubuntu 16.04: _utils/repro.sh
Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated. Please check exceptions label in our repository for a list of issues/defects in solutions, that makes us unable to provide all timings. There is also no documentation label that lists issues that are blocked by missing documentation in solutions we are benchmarking.