Skip to content

Commit

Permalink
Readme with new PR protocol (#52)
Browse files Browse the repository at this point in the history
* updated read me a little bit. need to test the actual process

* update report to reflect new machine type

* going to add time.csv and logs.csv to the main repo

* update .gitignore to start including time.csv

* add logs.csv and time.csv

* add VERSION for each file

* updates to README with new PR guidelines

* add report upadte request to index.Rmd as well

* update the regression.yml to remove time.csv and logs.csv. Remove q10 arrow result that doesnt work

* update readme and regression.yml

* Update regression.yml

remove time.csv and logs.csv from all regression test as well.

* final update to README

* small update to publish.sh

* update juliads setup
  • Loading branch information
Tmonster authored Nov 3, 2023
1 parent dc4a11e commit 86ee579
Show file tree
Hide file tree
Showing 25 changed files with 7,296 additions and 42 deletions.
8 changes: 8 additions & 0 deletions .github/workflows/regression.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ jobs:
shell: bash
run: ./_utils/generate-data-small.sh

- name: Remove old logs
shell: bash
run: rm time.csv logs.csv

- name: Install all solutions
shell: bash
run: source path.env && python3 _utils/install_all_solutions.py ${{ matrix.solution }}
Expand Down Expand Up @@ -114,6 +118,10 @@ jobs:
shell: bash
run: ./_utils/generate-data-small.sh

- name: Remove old logs
shell: bash
run: rm time.csv logs.csv

- name: Install all solutions
shell: bash
run: source path.env && python3 _utils/install_all_solutions.py all
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@ metastore_db/*
*.log
*.html
*.csv
!time.csv
!logs.csv
*.md5
.Rproj.user
.Rhistory
db-benchmark.Rproj
*/REVISION
*/VERSION
token
.token
public/
Expand Down
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ Contribution and feedback are very welcome!
- [x] [Polars](https://github.com/ritchie46/polars)
- [x] [Arrow](https://github.com/apache/arrow)
- [x] [DuckDB](https://github.com/duckdb/duckdb)
- [x] [DuckDB-latest](https://github.com/duckdb/duckdb)
- [x] [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl)
- [x] [In Memory DataSets](https://github.com/sl-solution/InMemoryDatasets.jl)
- [x] [Datafusion] (https://github.com/apache/arrow-datafusion)

If you would like your solution to be included, feel free to file a PR with the necessary setup-_solution_/ver-_solution_/groupby-_solution_/join-_solution_ scripts. The team at duckdblabs approves the PR it will be merged.

Expand Down Expand Up @@ -60,10 +63,39 @@ If you would like your solution to be included, feel free to file a PR with the
- call `SRC_DATANAME=G1_1e7_1e2_0_0 R`, if desired replace `R` with `python` or `julia`
- proceed pasting code from benchmark script

# Updating the benchmark.

The benchmark will now be updated upon request. A request can be made by creating a PR with a combination of the following.

The PR **must** include
- updates to the time.csv and log.csv files of a run on a c6id.metal machine. If you are re-enabling a query for a solution, you can just include new times and logs for the query, however, the version must match currently reported version.

The PR must include **one** of the following
- changes to a solution VERSION file.
- changes to a solution groupby or join script. This can mean:
1. Loading the data differently
2. Changing settings for a solution.
3. Re-enabling a query for a solution

To facilitate creating an instance identical to the one with the current results, the script `_utils/format_and_mount.sh` was created. The script does the following
1. Formats and mounts an nvme drive so that solutions have access to instance storage
2. Creates a new directory `db-benchmark-metal` on the nvme drive. This directory is a clone of the repository. Having a clone of the benchmark on the nvme drive enables the solutions to load the data faster (assuming you follow the steps to copy the data onto the nvme mount).

Once the `db-benchmark-metal` directory is created, you will need to
1. Create or generate all the datasets. The benchmark will not be updated if only a subset of datasets are tested.
- If you call `./_utils/format_and_mount.sh -c` the datasets will be created for you. Creating every dataset will take at least >1hr
2. Install the solutions you wish to have updated. The {{solution}}/setup-{{solution}}.sh should have everything you need
3. Update the solution(s) groupby or join scripts with any desired changes
4. Benchmark on your solution against all datasets.
5. Generate the report to see how the results compare to other solutions. The report should be automatically generated. You can find it in `public`.
6. Create your PR! Include the updates to the time.csv and logs.csv files

The PR will then be reviewed by the DuckDBLabs team where we will run the benchmark again ourselves to validate the new results. If there aren't any questions, we will merge the PR and publish a new report!


# Example environment

- setting up m4.10xlarge: 160GB RAM, 32 cores: [Amazon link](https://aws.amazon.com/ec2/instance-types/)
- setting up c6id.metal: 250GB RAM, 128 cores: [Amazon link](https://aws.amazon.com/ec2/instance-types/)
- Full reproduce script on clean Ubuntu 22.04: [_utils/repro.sh](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/repro.sh)

# Acknowledgment
Expand Down
3 changes: 2 additions & 1 deletion _control/nodenames.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
nodename,cpu_model,cpu_cores,memory_model,memory_gb,gpu_model,gpu_num,gpu_gb
mr-0xc11,Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz,20,DIMM DDR4 Synchronous 2133 MHz,125.80,,,
mr-dl11,Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz,40,DIMM Synchronous 2133 MHz,125.78,GeForce GTX 1080 Ti,2,21.83
m4.10xlarge,Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz,40,unkown,157,None,None,None
m4.10xlarge,Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz,40,unkown,157,None,None,None
c6id.metal,Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz,128,NVMe SSD,250,None,None,None
29 changes: 23 additions & 6 deletions _report/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ output:

The code for this benchmark can be found at [https://github.com/duckdblabs/db-benchmark](https://github.com/duckdblabs/db-benchmark) and has been forked from [https://github.com/h2oai/db-benchmark](https://github.com/h2oai/db-benchmark).

This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users. You can find out more about the project in [_Efficiency in data processing_ slides](https://jangorecki.gitlab.io/r-talks/2019-12-26_Mumbai_Efficiency-in-data-processing/Efficiency-in-data-processing.pdf) and [talk made by Matt Dowle on H2OWorld 2019 NYC conference](https://www.youtube.com/watch?v=fZpA_cU0SPg).
This page aims to benchmark various database-like tools popular in open-source data science. It runs whenever a PR is opened requesting an update, provided the PR author has ran the benchmark themselves. We provide this as a service to both developers of these packages and to users. You can find out more about the project in [_Efficiency in data processing_ slides](https://jangorecki.gitlab.io/r-talks/2019-12-26_Mumbai_Efficiency-in-data-processing/Efficiency-in-data-processing.pdf) and [talk made by Matt Dowle on H2OWorld 2019 NYC conference](https://www.youtube.com/watch?v=fZpA_cU0SPg).

We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.

Expand Down Expand Up @@ -41,7 +41,7 @@ source("./_benchplot/benchplot-dict.R", chdir=TRUE)
ld = time_logs()
lld = ld[script_recent==TRUE]
# lld_nodename = as.character(unique(lld$nodename))
lld_nodename = "m4.10xlarge"
lld_nodename = "c6id.metal"
if (length(lld_nodename)>1L)
stop(sprintf("There are multiple different 'nodename' to be presented on single report '%s'", report_name))
lld_unfinished = lld[is.na(script_time_sec)]
Expand Down Expand Up @@ -260,23 +260,40 @@ rpivotTable::rpivotTable(
unusedAttrsVertical = TRUE
)
```
## Requesting an updated run

The benchmark will now be updated with PR requests. To publish new results for a solution(s), you can open a PR with changes to solutions scripts or VERSION files, with updates to the time.csv and log.csv files of a run on a c6id.metal machine. To facilitate creating an instance identical to the one with the current results, the script `_utils/format_and_mount.sh` was created. The script does the following

1. Formats and mounts an nvme drive so that solutions have access to instance storage
2. Creates a new directory `db-benchmark-metal` on the nvme drive. This directory is a clone of the repository

Once the `db-benchmark-metal` directory is created, you will need to
1. Create or generate all the datasets. The benchmark will not be updated if only a subset of datasets are tested.
2. Install the solutions you wish to have updated
3. Update the solution(s) groupby or join scripts with any desired changes
4. Run the benchmark on your solution
5. Generate the report to see how the results compare to other solutions
6. Create your PR! (make sure the new time.csv and logs.csv files are included!)

The PR will then be reviewed by the DuckDBLabs team where we will run the benchmark ourselves to validate the new results. If there aren't any questions, we will merge your PR and publish a new report!


## Notes


- You are welcome to run this benchmark yourself! all scripts related to setting up environment, data and benchmark are in the repository [repository](https://github.com/duckdblabs/db-benchmark).
- You are welcome to run this benchmark yourself! All scripts related to setting up environment, data and benchmark are in the repository [repository](https://github.com/duckdblabs/db-benchmark) in the `utils` directory.
- Data used to generate benchmark plots on this website can be obtained from [time.csv](./time.csv) (together with [logs.csv](./logs.csv)). See [_report/report.R](https://github.com/duckdblabs/db-benchmark/blob/master/_report/report.R) for quick introduction how to work with those.
- Solutions are using in-memory data storage to achieve best timing. In case a solution runs out of memory (we use 160 GB machine), it will use on-disk data storage if possible. In such a case solution name is denoted by a `*` suffix on the legend.
- Solutions are using in-memory data storage to achieve best timing. In case a solution runs out of memory (we use 250GB machine), it will use nvme storage if correctly set up. In such a case solution name is denoted by a `*` suffix on the legend.
- ClickHouse and DuckDB queries are `CREATE TABLE ans AS SELECT ...` to match the functionality provided by other solutions in terms of caching results of queries, see [#151](https://github.com/h2oai/db-benchmark/issues/151).
- We ensure that calculations are not deferred by solution.
- Because of the above, as of current moment, join timings of python datatable suffers from an extra deep copy. As a result of that extra overhead it suffers additionally with out of memory error for 1e9 join q5 _big-to-big_ join.
- We also tested that answers produced from different solutions match each others, for details see [_utils/answers-validation.R](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/answers-validation.R).

## Environment configuration

- R 4.2.2
- R 4.3.2
- python 3.10
- Julia 1.9.2
- Julia 1.9.3

```{r environment_hardware}
pretty_component = function(x) gsub("_", " ", fixed=TRUE,
Expand Down
16 changes: 8 additions & 8 deletions _report/publish.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,26 @@ publishGhPages(){

## Reset gh-pages branch
git remote add upstream "[email protected]:duckdblabs/db-benchmark.git"
git fetch -q upstream gh-pages 2>err.txt
git fetch -q upstream gh-pages
rm -f err.txt
git checkout -q gh-pages
git reset -q --hard "4eadfc22cc86eade8c91f7809aae01a9753c4d90" 2>err.txt
git reset -q --hard "4eadfc22cc86eade8c91f7809aae01a9753c4d90"

rm -f err.txt
cp -r ../public/* ./
git add -A
git commit -q -m 'publish benchmark report' 2>err.txt
git commit -q -m 'publish benchmark report'
cp ../time.csv .
cp ../logs.csv .
git add time.csv logs.csv 2>err.txt
git add time.csv logs.csv
md5sum time.csv > time.csv.md5
md5sum logs.csv > logs.csv.md5
git add time.csv.md5 logs.csv.md5 2>err.txt
git add time.csv.md5 logs.csv.md5
gzip --keep time.csv
gzip --keep logs.csv
git add time.csv.gz logs.csv.gz 2>err.txt
git commit -q -m 'publish benchmark timings and logs' 2>err.txt
git push --force upstream gh-pages 2>err.txt
git add time.csv.gz logs.csv.gz
git commit -q -m 'publish benchmark timings and logs'
git push --force upstream gh-pages

cd ..

Expand Down
52 changes: 48 additions & 4 deletions _utils/format_and_mount.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,54 @@ sudo chown -R ubuntu db-benchmark-metal/
cd db-benchmark-metal
git clone https://github.com/duckdblabs/db-benchmark.git .

mkdir data
cd data
cp ~/db-benchmark/data/*.csv .
# if you have an EBS volume, you can generate the data once, save it on the ebs volume, and transfer it
# each time.

if [[ $# -gt 0 ]]
then
echo "Creating data"
mkdir -p ~/db-benchmark-metal/data/
cd ~/db-benchmark-metal/data/
echo "Creating 500mb group by datasets"
Rscript ../_data/groupby-datagen.R 1e7 1e2 0 0
Rscript ../_data/groupby-datagen.R 1e7 1e1 0 0
Rscript ../_data/groupby-datagen.R 1e7 2e0 0 0
Rscript ../_data/groupby-datagen.R 1e7 1e2 0 1
Rscript ../_data/groupby-datagen.R 1e7 1e2 5 0
echo "Creating 5gb group by datasets"
Rscript ../_data/groupby-datagen.R 1e8 1e2 0 0
Rscript ../_data/groupby-datagen.R 1e8 1e1 0 0
Rscript ../_data/groupby-datagen.R 1e8 2e0 0 0
Rscript ../_data/groupby-datagen.R 1e8 1e2 0 1
Rscript ../_data/groupby-datagen.R 1e8 1e2 5 0
echo "Creating 50gb group by datasets"
Rscript ../_data/groupby-datagen.R 1e9 1e2 0 0
Rscript ../_data/groupby-datagen.R 1e9 1e1 0 0
Rscript ../_data/groupby-datagen.R 1e9 2e0 0 0
Rscript ../_data/groupby-datagen.R 1e9 1e2 0 1
Rscript ../_data/groupby-datagen.R 1e9 1e2 5 0
echo "Creating 500mb join datasets"
Rscript ../_data/join-datagen.R 1e7 0 0
Rscript ../_data/join-datagen.R 1e7 5 0
Rscript ../_data/join-datagen.R 1e7 0 1
echo "Creating 5gb join datasets"
Rscript ../_data/join-datagen.R 1e8 0 0
Rscript ../_data/join-datagen.R 1e8 5 0
Rscript ../_data/join-datagen.R 1e8 0 1
echo "Creating 50gb join datasets"
Rscript ../_data/join-datagen.R 1e9 0 0
cd ..
elif [[ ! -d "~/db-benchark/data" ]]
then
echo "no arguments passed. Copying data..."
echo "ERROR: directory ~/db-benchmark/data does not exist"
else
mkdir -p ~/db-benchmark-metal/data/
cd ~/db-benchmark-metal/data/
echo "Copying data from ~/db-benchark/data"
cp ~/db-benchmark/data/*.csv
cd ~/db-benchmark-metal
fi


./_launcher/setup.sh
Expand All @@ -40,4 +85,3 @@ sudo cp clickhouse/clickhouse-mount-config.xml /etc/clickhouse-server/config.d/d
echo "------------------------------------------"
echo "------------------------------------------"
echo "READY TO RUN BENCHMARK. ./run.sh"

2 changes: 2 additions & 0 deletions _utils/generate-data-small.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
mkdir -p data
cd data/
Rscript ../_data/groupby-datagen.R 1e7 1e2 0 0
Rscript ../_data/groupby-datagen.R 1e7 1e2 15 0
Rscript ../_data/join-datagen.R 1e7 0 0 0

cp G1_1e7_1e2_0_0.csv G1_1e9_1e2_0_0.csv
Expand All @@ -21,6 +22,7 @@ mv _control/data.csv _control/data.csv.original

echo "task,data,nrow,k,na,sort,active" > _control/data.csv
echo "groupby,G1_1e7_1e2_0_0,1e7,1e2,0,0,1" >> _control/data.csv
echo "groupby,G1_1e7_1e2_15_0,1e7,1e2,15,0,1" >> _control/data.csv
echo "groupby,G1_1e9_1e2_0_0,1e9,1e2,0,0,1" >> _control/data.csv
echo "join,J1_1e7_NA_0_0,1e7,NA,0,0,1" >> _control/data.csv
echo "join,J1_1e9_NA_0_0,1e9,NA,0,0,1" >> _control/data.csv
40 changes: 20 additions & 20 deletions arrow/groupby-arrow.R
Original file line number Diff line number Diff line change
Expand Up @@ -219,26 +219,26 @@ rm(ans)
# print(tail(ans, 3))
# rm(ans)

question = "sum v3 count by id1:id6" # q10
t = system.time({
ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
rm(ans)
t = system.time({
ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
ans <- collect(ans)
print(head(ans, 3))
print(tail(ans, 3))
rm(ans)
# question = "sum v3 count by id1:id6" # q10
# t = system.time({
# ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
# print(dim(ans))
# })[["elapsed"]]
# m = memory_usage()
# chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
# write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
# rm(ans)
# t = system.time({
# ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
# print(dim(ans))
# })[["elapsed"]]
# m = memory_usage()
# chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
# write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
# ans <- collect(ans)
# print(head(ans, 3))
# print(tail(ans, 3))
# rm(ans)

cat(sprintf("grouping finished, took %.0fs\n", proc.time()[["elapsed"]]-task_init))

Expand Down
1 change: 1 addition & 0 deletions clickhouse/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
23.9.1.1854
1 change: 1 addition & 0 deletions collapse/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2.0.3
1 change: 1 addition & 0 deletions dask/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2023.10.0
1 change: 1 addition & 0 deletions datatable/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1.14.9
1 change: 1 addition & 0 deletions dplyr/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1.1.3
1 change: 1 addition & 0 deletions duckdb-latest/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.9.1.1
1 change: 1 addition & 0 deletions duckdb/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.8.1.3
1 change: 1 addition & 0 deletions juliadf/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1.6.1
1 change: 1 addition & 0 deletions juliads/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.7.18
2 changes: 1 addition & 1 deletion juliads/setup-juliads.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sudo mv julia-1.9.3 /opt
rm julia-1.9.3-linux-x86_64.tar.gz

# put to paths
echo 'export JULIA_HOME=/opt/julia-1.9.1' >> path.env
echo 'export JULIA_HOME=/opt/julia-1.9.3' >> path.env
echo 'export PATH=$PATH:$JULIA_HOME/bin' >> path.env
echo "export JULIA_NUM_THREADS=40" >> path.env
# note that cron job must have path updated as well
Expand Down
Loading

0 comments on commit 86ee579

Please sign in to comment.