Repository for reproducible benchmarking of database-like operations.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.
- groupby
- join
- sort
- read
- dask
- data.table
- dplyr
- juliadf (for status see #30)
- pandas
- (py)datatable
- spark
- modin (for status see #38)
- if solution uses python create new
virtualenv
as[$solution]/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install every solution (if needed activate
virtualenv
each) - edit
run.conf
to define tasks to benchmark - generate data, for
groupby
useRscript groupby-datagen.R 1e7 1e2
to createG1_1e7_1e2.csv
- edit
data.csv
to define data sizes to benchmark - start benchmark with
./run.sh
- if solution uses python create new
virtualenv
as[$solution]/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install solution (if needed activate
virtualenv
) - generate data, for
groupby
useRscript groupby-datagen.R 1e7 1e2
to createG1_1e7_1e2.csv
- start single task and solution by
SRC_GRP_LOCAL=G1_1e7_1e2.csv ./pandas/groupby-pandas.py
- setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
- full reproduce script on clean Ubuntu 16.04: repro.sh
- Solution
modin
is not capable to performgroupby
task yet. - It might eventually happens that on the report
spark
will not have a date for its corresponding version. It is because of SPARK-16864 "resolved" as "Won't Fix", thus we are unable to lookup that information from GitHub repo. - Above issue currently affects also
juliadf
, this hopefully will be fixed JuliaLang/Pkg.jl#793.