Skip to content

Spencerx/db-benchmark

Repository files navigation

Repository for reproducible benchmarking of database-like operations.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.

Tasks

  • groupby
  • join
  • sort
  • read

Solutions

Reproduce

all tasks and all solutions

  • if solution uses python create new virtualenv as [$solution]/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install every solution (if needed activate virtualenv each)
  • edit run.conf to define tasks to benchmark
  • generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 to create G1_1e7_1e2.csv
  • edit data.csv to define data sizes to benchmark
  • start benchmark with ./run.sh

single task and single solution

  • if solution uses python create new virtualenv as [$solution]/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install solution (if needed activate virtualenv)
  • generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 to create G1_1e7_1e2.csv
  • start single task and solution by SRC_GRP_LOCAL=G1_1e7_1e2.csv ./pandas/groupby-pandas.py

Example environment

Acknowledgment

  • Solution modin is not capable to perform groupby task yet.
  • It might eventually happens that on the report spark will not have a date for its corresponding version. It is because of SPARK-16864 "resolved" as "Won't Fix", thus we are unable to lookup that information from GitHub repo.
  • Solution dask is currently not presented on plot due to "groupby aggregation does not scale well with amount of groups", groupby scritpt is in place so anyone interested can run it already.

About

reproducible benchmark of database-like ops

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 49.1%
  • Python 39.2%
  • Shell 7.7%
  • Julia 3.9%
  • HTML 0.1%