Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Include reshape benchmarks #3

Open
grantmcdermott opened this issue Apr 14, 2023 · 5 comments
Open

Suggestion: Include reshape benchmarks #3

grantmcdermott opened this issue Apr 14, 2023 · 5 comments
Labels
Solution Include new solution

Comments

@grantmcdermott
Copy link

grantmcdermott commented Apr 14, 2023

Stoked to see this back up and running!

(As an aside, the relentless performance gains of DuckDB are truly impressive.)

Two suggestions:

  1. Please consider the collapse R package (link). In my own set of benchmarks, collapse is typically at or near the top of various groupby operations for datasets in the order of .5-5 GB. (I haven't tested larger than that and should also say it doesn't support join operations yet.) I can add a PR if interested. Closed via [WIP] New solution: r-collapse #33.
  2. There was talk over at the old repo of adding a set of reshape benchmarks. Personally, I think this would be great to have. See: reshape task (pivot, unpivot) h2oai/db-benchmark#175

Thanks again for all effort in resurrecting this.

@grantmcdermott grantmcdermott changed the title Suggestions Suggestions: Include r-collapse and reshape benchmarks Apr 14, 2023
@Tmonster
Copy link
Collaborator

Hi Grant, Thank you for the suggestion!

I currently don't have a lot of bandwidth to add a whole new solution to the benchmark, but if you want to open a PR that adds the necessary setup-collapse.sh, ver-collapse.sh, upg-collapse.sh, groupby-collapse.R, and join-collapse.R then I'd be happy to review. A good place to start would be copying the files in the dplyr folder in the benchmark, and just change the imported libraries. That will probably get you more than halfway.

See repro.sh for steps to run the benchmark either locally or on an AWS instance. If no errors are thrown for the 0.5GB & 5GB datasets I'd be happy to merge your PR and re-run the benchmark to include results for collapse.

@Tmonster
Copy link
Collaborator

As for the reshaping benchmarks, I think its a great idea!

It would take a while to finally include those queries in the benchmark, however, as I would need to

  1. Create new queries and datasets. (Although I believe the group by datasets could work well for this)
  2. Create new reshape-solution.* scripts for each of the solutions that support reshaping functionality
  3. Modify the report generation code to include reshape results

I would like to do a re-work of the report generation code, as it was hard to track down bugs while re-running the benchmark. As mentioned in h2oai#175, however, I would be happy to review or collaborate any PRs that help maintain and improve the benchmark!

@Tmonster Tmonster added the Solution Include new solution label Apr 19, 2023
@grantmcdermott

This comment was marked as resolved.

@SebKrantz
Copy link

SebKrantz commented Sep 18, 2023

collapse author here. Thanks @grantmcdermott and @vincentarelbundock for the initiative! I'm happy with adding collapse to the benchmarks, and also happy for any suggested code, but would like to wait for the pending v2.0 release (which includes implementations of table joins and reshaping). I will also ensure the benchmarking code is equivalent to other DBMS (collapse has some unfavorable defaults e.g. sort = TRUE, na.rm = TRUE, nthreads = 1). I expect v2.0 to be released within 1 month, and will then get back to this and submit a comprehensive PR, integreating what was suggested here.

@vincentarelbundock
Copy link

Sounds good @SebKrantz.

You may want to use my PR as a starting point since most of the setup and group-by stuff is close to done.

FYI, the dplyr and data.table benchmarks use na.rm=TRUE, but you are right that the sort and nthreads arguments may need to be adjusted.

@grantmcdermott grantmcdermott changed the title Suggestions: Include r-collapse and reshape benchmarks Suggestion: Include reshape benchmarks Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solution Include new solution
Projects
None yet
Development

No branches or pull requests

4 participants