Skip to content

Spencerx/db-benchmark

Repository files navigation

Repository for reproducible benchmarking of database-like operations.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.

Tasks

  • groupby
  • join
  • sort
  • read

Solutions

Reproduce

all tasks and all solutions

  • if solution uses python create new virtualenv as [$solution]/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install every solution (if needed activate virtualenv each)
  • edit run.conf to define tasks to benchmark
  • generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 to create G1_1e7_1e2.csv
  • edit data.csv to define data sizes to benchmark
  • start benchmark with ./run.sh

single task and single solution

  • if solution uses python create new virtualenv as [$solution]/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install solution (if needed activate virtualenv)
  • generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 to create G1_1e7_1e2.csv
  • start single task and solution by SRC_GRP_LOCAL=G1_1e7_1e2.csv ./pandas/groupby-pandas.py

Example environment

Acknowledgment

  • Solution modin is not capable to perform groupby task yet.
  • It might eventually happens that on the report spark will not have a date for its corresponding version. It is because of SPARK-16864 "resolved" as "Won't Fix", thus we are unable to lookup that information from GitHub repo.
  • Above issue currently affects also juliadf, this hopefully will be fixed JuliaLang/Pkg.jl#793.

About

reproducible benchmark of database-like ops

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 49.1%
  • Python 39.2%
  • Shell 7.7%
  • Julia 3.9%
  • HTML 0.1%