Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂ Statistics streamlining #961

Open
3 of 9 tasks
Jolanrensen opened this issue Nov 21, 2024 · 5 comments
Open
3 of 9 tasks

☂ Statistics streamlining #961

Jolanrensen opened this issue Nov 21, 2024 · 5 comments
Assignees
Labels
bug Something isn't working ☂ umbrella issue Label assigned to issues that are collections of smaller issues

Comments

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Nov 21, 2024

Continuation of #558 which fixed the most annoying bugs related to describe.

See #558 for more information.

Our statistics functions need some more love. We used to have many missing types (mostly fixed by #937), but there are yet some more inconsistencies to be solved:

As mentioned here #543, some functions like median(ints) might result in an unexpectedly rounded Int in return. It might be better to let all functions return Double and then handle BigInteger / BigDecimal separately for now, as they're java-specific for now.

There are plenty of public overloads on Iterable and Sequence. It's fine to have them internally, but I feel like we're clogging the public scope here. mean, for instance, is already covered in the stdlib.

We'll need to hide public functions that are not on DataColumn as @AndreiKingsley will probably make a statistics library for that anyway.

We need to honor some conversion table (see below)

We won't support UByte, UShort, UInt, and ULong since they don't inherit Number.

We also drop support for BigNumber and BigDecimal as this makes generic typing and conversion very difficult and unpredictable.

Progress:

Function Conversion extra information nulls in input
mean Int -> Double For all: Double.NaN if no elements All nulls are filtered out
Short -> Double
Byte -> Double
Long -> Double
Double -> Double skipNaN option, false by default
Float -> Double skipNaN option, false by default
Number -> Conversion(Common number type) -> Double skipNaN option, false by default
Nothing / no values -> Double.NaN
sum Int -> Int All default to zero if no values All nulls are filtered out
Short -> Int
Byte -> Int
Long -> Long
Double -> Double skipNaN option, false by default
Float -> Float skipNaN option, false by default
Number -> Conversion(Common number type) -> Number skipNaN option, false by default
Nothing / no values -> Double (0.0)
cumSum Int -> Int All default to zero if no values All can optionally skip nulls in input with skipNull option, true by default
Short -> Int important because order matters with cumSum
Byte -> Int
Long -> Long
Double -> Double skipNaN option, true by default
Float -> Float skipNaN option, true by default
Number -> Conversion(Common number type) -> Number skipNaN option, true by default
Nothing / no values -> Double (0.0)
min/max T -> T? where T : Comparable<T> For all: null if no elements, has -OrNull overloads All nulls are filtered out
Int -> Int?
Short -> Short?
Byte -> Byte?
Long -> Long?
Double -> Double? skipNaN option, false by default, returns NaN when in the input
Float -> Float? skipNaN option, false by default, returns NaN when in the input
Number -> Number? Would need more overloads and more work
Nothing / no values -> Nothing? (null)
(Don't convert Short/Byte to Int!)
median/percentile T -> T? where T : Comparable<T> For all: median of even list will cause conversion to Double if possible, else lower middle All nulls are filtered out
Int -> Double and Double.NaN or null if no elements
Short -> Double
Byte -> Double
Long -> Double
Double -> Double
Float -> Double
Number -> Conversion(Common number type) -> Double Would need more overloads and more work
Nothing / no values -> Nothing? (null)
std Int -> Double All have DDoF (Delta Degrees of Freedom) argument All nulls are filtered out
Short -> Double and Double.NaN if no elements
Byte -> Double
Long -> Double
Double -> Double skipNaN option, false by default
Float -> Double skipNaN option, false by default
Number -> Conversion(Common number type) -> Double skipNaN option, false by default
Nothing / no values -> Double.NaN
var (want to add?) same as std
@Jolanrensen Jolanrensen added bug Something isn't working ☂ umbrella issue Label assigned to issues that are collections of smaller issues labels Nov 21, 2024
@Jolanrensen Jolanrensen added this to the 0.16.0 milestone Nov 21, 2024
@Jolanrensen Jolanrensen self-assigned this Nov 21, 2024
@Jolanrensen
Copy link
Collaborator Author

Also see #961

@Jolanrensen
Copy link
Collaborator Author

Check all AnyRow.rowXXX functions, like rowMean, rowMin, etc.

rowMin for instance is defined like:

public fun AnyRow.rowMinOrNull(): Any? = values().filterIsInstance<Comparable<*>>().minWithOrNull(compareBy { it })

This will break if you have a Number and String column in your row. While they both are Comparable, they are not comparable to each other. We probably need to expand the interComparableColumns() or valuesAreComparable() function for these cases.

@AndreiKingsley
Copy link
Collaborator

#1060 adds percentile which is similar to all these functions and inherits all the above problems. After merge we will have to fix this all this stuff for it as well.

@Jolanrensen
Copy link
Collaborator Author

I've adjusted the table. We can support mixed number types auto-conversion to Double, except when there's a BigInteger or BigDecimal is among the values. Converting a big number to Double is lossy and can result in infinities. Best to throw an exception and tell users to first convert their values all to BigDecimal and then call the BigDecimal -> BigDecimal overload of the function.

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Feb 19, 2025

#1068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ☂ umbrella issue Label assigned to issues that are collections of smaller issues
Projects
None yet
Development

No branches or pull requests

2 participants