-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40
Comments
this will probably need to wait for pandas-dev/pandas#3729 |
If a solution like pandas can't handle a test, then that is not a good reason to disable the test. It should be shown as a fail. Similarly, for example, data.table should be shown as a fail for 500GB whereas pydatatable and Spark will work. |
They can handle but require another syntax, so two test scripts for same test, or branching |
Yes it's a tough tradeoff.
But there'll be exceptions for other tests, not just this one. So capability to deal with exceptions needs to be added anyway. Although, in this case, I'm sure what you mean by exceptions or why an exception is needed: "two test scripts for same test, or branching if before each query". |
To be clear, this issue is about NAs in the grouping columns, right? i.e. dup of #54. If so, can the title be changed. |
Pandas issue is about NA in grouping columns, but we want to tests NAs in all columns. |
pandas finally implemented NA support in groupby: pandas-dev/pandas#3729 (comment) |
According to this information https://stackoverflow.com/questions/63057886/clickhouse-string-field-disk-usage-null-vs-empty |
Interesting that it still doesn't work for pandas. Till pandas-dev/pandas#36327 is resolved we will have to escape this script for pandas because it will produce incorrect results. Dask inherits this issue from pandas as well. |
Groupby benchmark scripts has been amended, wherever possible, to provide NA-aware processing. Where it was not possible (pandas and dask) it will skip this data case. When this capability will be implemented there, then we will remove skipping this data cases to enable it. Below is dummy data we will need to use for q8 validation to ensure it behaves as expected (described in 5bc7113#comments):
What is left now is to run all and tweak reports to handle new change. |
This has been implemented. Function calls are now NA-aware, in all data cases. Over the course of development, the following issues has been encountered: Rdatatable/data.table#4849 They are blocking data case having NAs for pandas and dask, and in case of python datatable, only 1e9 size data case having NA. Issue is now resolved. I would like to acknowledge valuable feedback I got from Julia developers (@bkamins and @nalimilan) during implementation. |
In 73647e5 I removed initial filtering out of all NAs variables in q9. It is to align answers between solutions, and to keep all groups (if a group was full of NAs then it would have been filtered out). |
Fix groupby setup and join scripts for duckdb and duckdb latest
No description provided.
The text was updated successfully, but these errors were encountered: