-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2 #487
Comments
I wonder why we first compute r and then yield r^2 instead of just yielding r. I'm sure when I wrote that code I was simply mimicking past behaviour but I have no idea why it was that way. |
Pearson's coeff is useful b/c it states the direction of the effect, e.g. whether one variable is directly or inversely correlated to the other. That said, one reason to yield r2 is that there is no "r" for categorical variables. A set of columns in the types of datasets we analyze will often include both numeric and categorical variables. If you want to plot both correlations on the same heatmap, then the correlation metric should be on the same scale for both stattypes. If numerical variables' coefficients can be expressed as R but categorical can only be expressed as R2, maybe that's why we chose to use R2 for both. My vote is for the Pearson's coeff to be available somehow, in some circumstance, b/c it is what most researchers use and is what we would compare our Bayesian dependencies to. I'm not sure what the best way to set this up is, though, considering variable types are mixed. Also, there is a judgment call to make on what to use for categorical variables, because there are multiple flavors of correlation coefficients. Once we make the judgment call, maybe it can be documented somewhere so the next person understands the rationale the previous person had. I don't know enough about this to have a strong preference myself. |
A simple solution is to implement both |
The main purpose of CORRELATION is to make heat maps which we contrast with the much better-looking DEPENDENCE PROBABILITY heat maps, as a tool for finding plausibly related variables. In that respect, r^2 is more applicable than r, since we need some kind of consistent measure between all pairs of columns, and the orientation of the correlation is of less interest in these heat maps than the magnitude of the correlation. |
Documentation is here: http://probcomp.csail.mit.edu/bayesdb/doc/bql.html#bql-expressions
Either functionality or doc should be changed.
The text was updated successfully, but these errors were encountered: