Readme with new PR protocol (#52)

* updated read me a little bit. need to test the actual process * update report to reflect new machine type * going to add time.csv and logs.csv to the main repo * update .gitignore to start including time.csv * add logs.csv and time.csv * add VERSION for each file * updates to README with new PR guidelines * add report upadte request to index.Rmd as well * update the regression.yml to remove time.csv and logs.csv. Remove q10 arrow result that doesnt work * update readme and regression.yml * Update regression.yml remove time.csv and logs.csv from all regression test as well. * final update to README * small update to publish.sh * update juliads setup
duckdblabs · Nov 3, 2023 · 86ee579 · 86ee579
1 parent dc4a11e
commit 86ee579
Show file tree

Hide file tree

Showing 25 changed files with 7,296 additions and 42 deletions.
diff --git a/.github/workflows/regression.yml b/.github/workflows/regression.yml
@@ -42,6 +42,10 @@ jobs:
       shell: bash
       run: ./_utils/generate-data-small.sh
 
+    - name: Remove old logs
+      shell: bash
+      run: rm time.csv logs.csv
+
     - name: Install all solutions
       shell: bash
       run: source path.env && python3 _utils/install_all_solutions.py ${{ matrix.solution }}
@@ -114,6 +118,10 @@ jobs:
       shell: bash
       run: ./_utils/generate-data-small.sh
 
+    - name: Remove old logs
+      shell: bash
+      run: rm time.csv logs.csv
+
     - name: Install all solutions
       shell: bash
       run: source path.env && python3 _utils/install_all_solutions.py all

diff --git a/.gitignore b/.gitignore
@@ -3,12 +3,13 @@ metastore_db/*
 *.log
 *.html
 *.csv
+!time.csv
+!logs.csv
 *.md5
 .Rproj.user
 .Rhistory
 db-benchmark.Rproj
 */REVISION
-*/VERSION
 token
 .token
 public/

diff --git a/README.md b/README.md
@@ -24,7 +24,10 @@ Contribution and feedback are very welcome!
   - [x] [Polars](https://github.com/ritchie46/polars)
   - [x] [Arrow](https://github.com/apache/arrow)
   - [x] [DuckDB](https://github.com/duckdb/duckdb)
+  - [x] [DuckDB-latest](https://github.com/duckdb/duckdb)
   - [x] [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl)
+  - [x] [In Memory DataSets](https://github.com/sl-solution/InMemoryDatasets.jl)
+  - [x] [Datafusion] (https://github.com/apache/arrow-datafusion)
 
 If you would like your solution to be included, feel free to file a PR with the necessary setup-_solution_/ver-_solution_/groupby-_solution_/join-_solution_ scripts. The team at duckdblabs approves the PR it will be merged.
 
@@ -60,10 +63,39 @@ If you would like your solution to be included, feel free to file a PR with the
 - call `SRC_DATANAME=G1_1e7_1e2_0_0 R`, if desired replace `R` with `python` or `julia`
 - proceed pasting code from benchmark script
 
+# Updating the benchmark.
+
+The benchmark will now be updated upon request. A request can be made by creating a PR with a combination of the following.
+
+The PR **must** include 
+- updates to the time.csv and log.csv files of a run on a c6id.metal machine. If you are re-enabling a query for a solution, you can just include new times and logs for the query, however, the version must match currently reported version.
+
+The PR must include **one** of the following
+- changes to a solution VERSION file.
+- changes to a solution groupby or join script. This can mean:
+  1. Loading the data differently 
+  2. Changing settings for a solution.
+  3. Re-enabling a query for a solution
+
+To facilitate creating an instance identical to the one with the current results, the script `_utils/format_and_mount.sh`  was created. The script does the following 
+1. Formats and mounts an nvme drive so that solutions have access to instance storage
+2. Creates a new directory `db-benchmark-metal` on the nvme drive. This directory is a clone of the repository. Having a clone of the benchmark on the nvme drive enables the solutions to load the data faster (assuming you follow the steps to copy the data onto the nvme mount). 
+
+Once the `db-benchmark-metal` directory is created, you will need to 
+1. Create or generate all the datasets. The benchmark will not be updated if only a subset of datasets are tested. 
+  - If you call `./_utils/format_and_mount.sh -c` the datasets will be created for you. Creating every dataset will take at least >1hr
+2. Install the solutions you wish to have updated. The {{solution}}/setup-{{solution}}.sh should have everything you need
+3. Update the solution(s) groupby or join scripts with any desired changes
+4. Benchmark on your solution against all datasets.
+5. Generate the report to see how the results compare to other solutions. The report should be automatically generated. You can find it in `public`.
+6. Create your PR! Include the updates to the time.csv and logs.csv files
+
+The PR will then be reviewed by the DuckDBLabs team where we will run the benchmark again ourselves to validate the new results. If there aren't any questions, we will merge the PR and publish a new report!
+
 
 # Example environment
 
-- setting up m4.10xlarge: 160GB RAM, 32 cores: [Amazon link](https://aws.amazon.com/ec2/instance-types/)  
+- setting up c6id.metal: 250GB RAM, 128 cores: [Amazon link](https://aws.amazon.com/ec2/instance-types/)  
 - Full reproduce script on clean Ubuntu 22.04: [_utils/repro.sh](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/repro.sh)
 
 # Acknowledgment

diff --git a/_control/nodenames.csv b/_control/nodenames.csv
@@ -1,4 +1,5 @@
 nodename,cpu_model,cpu_cores,memory_model,memory_gb,gpu_model,gpu_num,gpu_gb
 mr-0xc11,Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz,20,DIMM DDR4 Synchronous 2133 MHz,125.80,,,
 mr-dl11,Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz,40,DIMM Synchronous 2133 MHz,125.78,GeForce GTX 1080 Ti,2,21.83
-m4.10xlarge,Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz,40,unkown,157,None,None,None
+m4.10xlarge,Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz,40,unkown,157,None,None,None
+c6id.metal,Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz,128,NVMe SSD,250,None,None,None
diff --git a/_report/index.Rmd b/_report/index.Rmd
@@ -12,7 +12,7 @@ output:
 
 The code for this benchmark can be found at [https://github.com/duckdblabs/db-benchmark](https://github.com/duckdblabs/db-benchmark) and has been forked from [https://github.com/h2oai/db-benchmark](https://github.com/h2oai/db-benchmark).
 
-This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users. You can find out more about the project in [_Efficiency in data processing_ slides](https://jangorecki.gitlab.io/r-talks/2019-12-26_Mumbai_Efficiency-in-data-processing/Efficiency-in-data-processing.pdf) and [talk made by Matt Dowle on H2OWorld 2019 NYC conference](https://www.youtube.com/watch?v=fZpA_cU0SPg).
+This page aims to benchmark various database-like tools popular in open-source data science. It runs whenever a PR is opened requesting an update, provided the PR author has ran the benchmark themselves. We provide this as a service to both developers of these packages and to users. You can find out more about the project in [_Efficiency in data processing_ slides](https://jangorecki.gitlab.io/r-talks/2019-12-26_Mumbai_Efficiency-in-data-processing/Efficiency-in-data-processing.pdf) and [talk made by Matt Dowle on H2OWorld 2019 NYC conference](https://www.youtube.com/watch?v=fZpA_cU0SPg).
 
 We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.
 
@@ -41,7 +41,7 @@ source("./_benchplot/benchplot-dict.R", chdir=TRUE)
 ld = time_logs()
 lld = ld[script_recent==TRUE]
 # lld_nodename = as.character(unique(lld$nodename))
-lld_nodename = "m4.10xlarge"
+lld_nodename = "c6id.metal"
 if (length(lld_nodename)>1L)
   stop(sprintf("There are multiple different 'nodename' to be presented on single report '%s'", report_name))
 lld_unfinished = lld[is.na(script_time_sec)]
@@ -260,23 +260,40 @@ rpivotTable::rpivotTable(
   unusedAttrsVertical = TRUE
 )
 ```
+## Requesting an updated run
+
+The benchmark will now be updated with PR requests. To publish new results for a solution(s), you can open a PR with changes to solutions scripts or VERSION files, with updates to the time.csv and log.csv files of a run on a c6id.metal machine. To facilitate creating an instance identical to the one with the current results, the script `_utils/format_and_mount.sh`  was created. The script does the following 
+
+1. Formats and mounts an nvme drive so that solutions have access to instance storage
+2. Creates a new directory `db-benchmark-metal` on the nvme drive. This directory is a clone of the repository
+
+Once the `db-benchmark-metal` directory is created, you will need to 
+1. Create or generate all the datasets. The benchmark will not be updated if only a subset of datasets are tested. 
+2. Install the solutions you wish to have updated
+3. Update the solution(s) groupby or join scripts with any desired changes
+4. Run the benchmark on your solution
+5. Generate the report to see how the results compare to other solutions
+6. Create your PR! (make sure the new time.csv and logs.csv files are included!)
+
+The PR will then be reviewed by the DuckDBLabs team where we will run the benchmark ourselves to validate the new results. If there aren't any questions, we will merge your PR and publish a new report!
+
 
 ## Notes
 
 
-- You are welcome to run this benchmark yourself! all scripts related to setting up environment, data and benchmark are in the repository [repository](https://github.com/duckdblabs/db-benchmark).
+- You are welcome to run this benchmark yourself! All scripts related to setting up environment, data and benchmark are in the repository [repository](https://github.com/duckdblabs/db-benchmark) in the `utils` directory.
 - Data used to generate benchmark plots on this website can be obtained from [time.csv](./time.csv) (together with [logs.csv](./logs.csv)). See [_report/report.R](https://github.com/duckdblabs/db-benchmark/blob/master/_report/report.R) for quick introduction how to work with those.
-- Solutions are using in-memory data storage to achieve best timing. In case a solution runs out of memory (we use 160 GB machine), it will use on-disk data storage if possible. In such a case solution name is denoted by a `*` suffix on the legend.
+- Solutions are using in-memory data storage to achieve best timing. In case a solution runs out of memory (we use 250GB machine), it will use nvme storage if correctly set up. In such a case solution name is denoted by a `*` suffix on the legend.
 - ClickHouse and DuckDB queries are `CREATE TABLE ans AS SELECT ...` to match the functionality provided by other solutions in terms of caching results of queries, see [#151](https://github.com/h2oai/db-benchmark/issues/151).
 - We ensure that calculations are not deferred by solution.
 - Because of the above, as of current moment, join timings of python datatable suffers from an extra deep copy. As a result of that extra overhead it suffers additionally with out of memory error for 1e9 join q5 _big-to-big_ join.
 - We also tested that answers produced from different solutions match each others, for details see [_utils/answers-validation.R](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/answers-validation.R).
 
 ## Environment configuration
 
-- R 4.2.2
+- R 4.3.2
 - python 3.10
-- Julia 1.9.2
+- Julia 1.9.3
 
 ```{r environment_hardware}
 pretty_component = function(x) gsub("_", " ", fixed=TRUE,

diff --git a/_report/publish.sh b/_report/publish.sh
@@ -15,26 +15,26 @@ publishGhPages(){
 
   ## Reset gh-pages branch
   git remote add upstream "[email protected]:duckdblabs/db-benchmark.git"
-  git fetch -q upstream gh-pages 2>err.txt
+  git fetch -q upstream gh-pages
   rm -f err.txt
   git checkout -q gh-pages
-  git reset -q --hard "4eadfc22cc86eade8c91f7809aae01a9753c4d90" 2>err.txt
+  git reset -q --hard "4eadfc22cc86eade8c91f7809aae01a9753c4d90"
 
   rm -f err.txt
   cp -r ../public/* ./
   git add -A
-  git commit -q -m 'publish benchmark report' 2>err.txt
+  git commit -q -m 'publish benchmark report'
   cp ../time.csv .
   cp ../logs.csv .
-  git add time.csv logs.csv 2>err.txt
+  git add time.csv logs.csv 
   md5sum time.csv > time.csv.md5
   md5sum logs.csv > logs.csv.md5
-  git add time.csv.md5 logs.csv.md5 2>err.txt
+  git add time.csv.md5 logs.csv.md5
   gzip --keep time.csv
   gzip --keep logs.csv
-  git add time.csv.gz logs.csv.gz 2>err.txt
-  git commit -q -m 'publish benchmark timings and logs' 2>err.txt
-  git push --force upstream gh-pages 2>err.txt
+  git add time.csv.gz logs.csv.gz
+  git commit -q -m 'publish benchmark timings and logs'
+  git push --force upstream gh-pages
 
   cd ..
 

diff --git a/_utils/format_and_mount.sh b/_utils/format_and_mount.sh
@@ -11,9 +11,54 @@ sudo chown -R ubuntu db-benchmark-metal/
 cd db-benchmark-metal
 git clone https://github.com/duckdblabs/db-benchmark.git .
 
-mkdir data
-cd data
-cp ~/db-benchmark/data/*.csv .
+# if you have an EBS volume, you can generate the data once, save it on the ebs volume, and transfer it
+# each time.
+
+if [[ $# -gt 0 ]]
+then
+	echo "Creating data"
+	mkdir -p ~/db-benchmark-metal/data/
+	cd ~/db-benchmark-metal/data/
+	echo "Creating 500mb group by datasets"
+	Rscript ../_data/groupby-datagen.R 1e7 1e2 0 0
+	Rscript ../_data/groupby-datagen.R 1e7 1e1 0 0
+	Rscript ../_data/groupby-datagen.R 1e7 2e0 0 0
+	Rscript ../_data/groupby-datagen.R 1e7 1e2 0 1
+	Rscript ../_data/groupby-datagen.R 1e7 1e2 5 0
+	echo "Creating 5gb group by datasets"
+	Rscript ../_data/groupby-datagen.R 1e8 1e2 0 0
+	Rscript ../_data/groupby-datagen.R 1e8 1e1 0 0
+	Rscript ../_data/groupby-datagen.R 1e8 2e0 0 0
+	Rscript ../_data/groupby-datagen.R 1e8 1e2 0 1
+	Rscript ../_data/groupby-datagen.R 1e8 1e2 5 0
+	echo "Creating 50gb group by datasets"
+	Rscript ../_data/groupby-datagen.R 1e9 1e2 0 0
+	Rscript ../_data/groupby-datagen.R 1e9 1e1 0 0
+	Rscript ../_data/groupby-datagen.R 1e9 2e0 0 0
+	Rscript ../_data/groupby-datagen.R 1e9 1e2 0 1
+	Rscript ../_data/groupby-datagen.R 1e9 1e2 5 0
+	echo "Creating 500mb join datasets"
+	Rscript ../_data/join-datagen.R 1e7 0 0
+	Rscript ../_data/join-datagen.R 1e7 5 0
+	Rscript ../_data/join-datagen.R 1e7 0 1
+	echo "Creating 5gb join datasets"
+	Rscript ../_data/join-datagen.R 1e8 0 0
+	Rscript ../_data/join-datagen.R 1e8 5 0
+	Rscript ../_data/join-datagen.R 1e8 0 1
+	echo "Creating 50gb join datasets"
+	Rscript ../_data/join-datagen.R 1e9 0 0
+	cd ..
+elif [[ ! -d "~/db-benchark/data" ]]
+then
+	echo "no arguments passed. Copying data..."
+	echo "ERROR: directory ~/db-benchmark/data does not exist"
+else
+	mkdir -p ~/db-benchmark-metal/data/
+	cd ~/db-benchmark-metal/data/
+	echo "Copying data from ~/db-benchark/data"
+	cp ~/db-benchmark/data/*.csv
+	cd ~/db-benchmark-metal
+fi
 
 
 ./_launcher/setup.sh
@@ -40,4 +85,3 @@ sudo cp clickhouse/clickhouse-mount-config.xml /etc/clickhouse-server/config.d/d
 echo "------------------------------------------"
 echo "------------------------------------------"
 echo "READY TO RUN BENCHMARK. ./run.sh"
-
diff --git a/_utils/generate-data-small.sh b/_utils/generate-data-small.sh
@@ -3,6 +3,7 @@
 mkdir -p data
 cd data/
 Rscript ../_data/groupby-datagen.R 1e7 1e2 0 0
+Rscript ../_data/groupby-datagen.R 1e7 1e2 15 0
 Rscript ../_data/join-datagen.R 1e7 0 0 0
 
 cp G1_1e7_1e2_0_0.csv G1_1e9_1e2_0_0.csv
@@ -21,6 +22,7 @@ mv _control/data.csv _control/data.csv.original
 
 echo "task,data,nrow,k,na,sort,active" > _control/data.csv
 echo "groupby,G1_1e7_1e2_0_0,1e7,1e2,0,0,1" >> _control/data.csv
+echo "groupby,G1_1e7_1e2_15_0,1e7,1e2,15,0,1" >> _control/data.csv
 echo "groupby,G1_1e9_1e2_0_0,1e9,1e2,0,0,1" >> _control/data.csv
 echo "join,J1_1e7_NA_0_0,1e7,NA,0,0,1" >> _control/data.csv
 echo "join,J1_1e9_NA_0_0,1e9,NA,0,0,1" >> _control/data.csv
diff --git a/arrow/groupby-arrow.R b/arrow/groupby-arrow.R
@@ -219,26 +219,26 @@ rm(ans)
 # print(tail(ans, 3))
 # rm(ans)
 
-question = "sum v3 count by id1:id6" # q10
-t = system.time({
-  ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
-  print(dim(ans))
-})[["elapsed"]]
-m = memory_usage()
-chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
-write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
-rm(ans)
-t = system.time({
-  ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
-  print(dim(ans))
-})[["elapsed"]]
-m = memory_usage()
-chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
-write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
-ans <- collect(ans)
-print(head(ans, 3))
-print(tail(ans, 3))
-rm(ans)
+# question = "sum v3 count by id1:id6" # q10
+# t = system.time({
+#   ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
+#   print(dim(ans))
+# })[["elapsed"]]
+# m = memory_usage()
+# chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
+# write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
+# rm(ans)
+# t = system.time({
+#   ans <- collect(x %>% group_by(id1, id2, id3, id4, id5, id6) %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()))
+#   print(dim(ans))
+# })[["elapsed"]]
+# m = memory_usage()
+# chkt = system.time(chk <- collect(summarise(ungroup(ans), v3=sum(v3), count=sum(bit64::as.integer64(count)))))[["elapsed"]]
+# write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
+# ans <- collect(ans)
+# print(head(ans, 3))
+# print(tail(ans, 3))
+# rm(ans)
 
 cat(sprintf("grouping finished, took %.0fs\n", proc.time()[["elapsed"]]-task_init))
 

diff --git a/clickhouse/VERSION b/clickhouse/VERSION
@@ -0,0 +1 @@
+23.9.1.1854
diff --git a/collapse/VERSION b/collapse/VERSION
@@ -0,0 +1 @@
+2.0.3
diff --git a/dask/VERSION b/dask/VERSION
@@ -0,0 +1 @@
+2023.10.0
diff --git a/datatable/VERSION b/datatable/VERSION
@@ -0,0 +1 @@
+1.14.9
diff --git a/dplyr/VERSION b/dplyr/VERSION
@@ -0,0 +1 @@
+1.1.3
diff --git a/duckdb-latest/VERSION b/duckdb-latest/VERSION
@@ -0,0 +1 @@
+0.9.1.1
diff --git a/duckdb/VERSION b/duckdb/VERSION
@@ -0,0 +1 @@
+0.8.1.3
diff --git a/juliadf/VERSION b/juliadf/VERSION
@@ -0,0 +1 @@
+1.6.1
diff --git a/juliads/VERSION b/juliads/VERSION
@@ -0,0 +1 @@
+0.7.18
diff --git a/juliads/setup-juliads.sh b/juliads/setup-juliads.sh
@@ -6,7 +6,7 @@ sudo mv julia-1.9.3 /opt
 rm julia-1.9.3-linux-x86_64.tar.gz
 
 # put to paths
-echo 'export JULIA_HOME=/opt/julia-1.9.1' >> path.env
+echo 'export JULIA_HOME=/opt/julia-1.9.3' >> path.env
 echo 'export PATH=$PATH:$JULIA_HOME/bin' >> path.env
 echo "export JULIA_NUM_THREADS=40" >> path.env
 # note that cron job must have path updated as well