Name	Name	Last commit message	Last commit date
Latest commit History 308 Commits
dask	dask
datatable	datatable
dplyr	dplyr
julia	julia
modin	modin
pandas	pandas
pydatatable	pydatatable
spark	spark
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
benchplot.R	benchplot.R
data.csv	data.csv
exit-chk-validation.R	exit-chk-validation.R
groupby-datagen.R	groupby-datagen.R
helpers.R	helpers.R
helpers.py	helpers.py
index.Rmd	index.Rmd
init-setup-iteration.R	init-setup-iteration.R
publish.sh	publish.sh
repro.sh	repro.sh
run.conf	run.conf
run.sh	run.sh

Repository for reproducible benchmarking of database-like operations.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.

Tasks

groupby
join
sort
read

Solutions

Reproduce

all tasks and all solutions

if solution uses python create new virtualenv as [$solution]/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
install every solution (if needed activate virtualenv each)
edit run.conf to define tasks to benchmark
generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 to create G1_1e7_1e2.csv
edit data.csv to define data sizes to benchmark
start benchmark with ./run.sh

single task and single solution

if solution uses python create new virtualenv as [$solution]/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
install solution (if needed activate virtualenv)
generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 to create G1_1e7_1e2.csv
start single task and solution by SRC_GRP_LOCAL=G1_1e7_1e2.csv ./pandas/groupby-pandas.py

Example environment

setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
full reproduce script on clean Ubuntu 16.04: repro.sh

Acknowledgment

Solution modin is not capable to perform groupby task yet.
It might eventually happens that on the report spark will not have a date for its corresponding version. It is because of SPARK-16864 "resolved" as "Won't Fix", thus we are unable to lookup that information from GitHub repo.
Solution dask is currently not presented on plot due to "groupby aggregation does not scale well with amount of groups", groupby scritpt is in place so anyone interested can run it already.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tasks

Solutions

Reproduce

all tasks and all solutions

single task and single solution

Example environment

Acknowledgment

About

Releases

Packages

Languages

License

Spencerx/db-benchmark

Folders and files

Latest commit

History

Repository files navigation

Tasks

Solutions

Reproduce

all tasks and all solutions

single task and single solution

Example environment

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages