Repository for reproducible benchmarking of database-like operations.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.
- groupby
- join
- sort
- read
- data.table
- dplyr
- pandas
- (py)datatable
- spark
- dask
- modin (not capable to groupby yet)
- if solution uses python create new
virtualenv
as[$solution]/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install every solution (if needed activate
virtualenv
each) - edit
run.conf
to define tasks to benchmark - generate data, for
groupby
useRscript groupby-datagen.R 1e7 1e2
to createG1_1e7_1e2.csv
- edit
data.csv
to define data sizes to benchmark - start benchmark with
./run.sh
- if solution uses python create new
virtualenv
as[$solution]/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install solution (if needed activate
virtualenv
) - generate data, for
groupby
useRscript groupby-datagen.R 1e7 1e2
to createG1_1e7_1e2.csv
- start single task and solution by
SRC_GRP_LOCAL=G1_1e7_1e2.csv ./pandas/groupby-pandas.py
- setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
- full reproduce script on clean Ubuntu 16.04: repro.sh
- Solution
modin
is not capable to performgroupby
task yet. - It might eventually happens that on the report
spark
will not have a date for its corresponding version. It is because of SPARK-16864 "resolved" as "Won't Fix", thus we are unable to lookup that information from GitHub repo. - Solution
dask
is currently not presented on plot due to "groupby aggregation does not scale well with amount of groups",groupby
scritpt is in place so anyone interested can run it already.