Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2 #487

versar · 2016-10-19T21:23:59Z

Documentation is here: http://probcomp.csail.mit.edu/bayesdb/doc/bql.html#bql-expressions

Either functionality or doc should be changed.

riastradh-probcomp · 2016-10-19T22:22:28Z

I wonder why we first compute r and then yield r^2 instead of just yielding r. I'm sure when I wrote that code I was simply mimicking past behaviour but I have no idea why it was that way.

versar · 2016-10-19T23:26:50Z

Pearson's coeff is useful b/c it states the direction of the effect, e.g. whether one variable is directly or inversely correlated to the other.

That said, one reason to yield r2 is that there is no "r" for categorical variables. A set of columns in the types of datasets we analyze will often include both numeric and categorical variables. If you want to plot both correlations on the same heatmap, then the correlation metric should be on the same scale for both stattypes. If numerical variables' coefficients can be expressed as R but categorical can only be expressed as R2, maybe that's why we chose to use R2 for both.

My vote is for the Pearson's coeff to be available somehow, in some circumstance, b/c it is what most researchers use and is what we would compare our Bayesian dependencies to. I'm not sure what the best way to set this up is, though, considering variable types are mixed.

Also, there is a judgment call to make on what to use for categorical variables, because there are multiple flavors of correlation coefficients. Once we make the judgment call, maybe it can be documented somewhere so the next person understands the rationale the previous person had. I don't know enough about this to have a strong preference myself.

fsaad · 2016-11-05T18:23:26Z

A simple solution is to implement both CORRELATION AND CORRELATION2, which really should rather be named PEARSON R and PEARSON R2 because "correlation" is a quite general term.

riastradh-probcomp · 2016-12-28T21:56:49Z

The main purpose of CORRELATION is to make heat maps which we contrast with the much better-looking DEPENDENCE PROBABILITY heat maps, as a tool for finding plausibly related variables. In that respect, r^2 is more applicable than r, since we need some kind of consistent measure between all pairs of columns, and the orientation of the correlation is of less interest in these heat maps than the magnitude of the correlation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2 #487

Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2 #487

versar commented Oct 19, 2016

riastradh-probcomp commented Oct 19, 2016

versar commented Oct 19, 2016

fsaad commented Nov 5, 2016

riastradh-probcomp commented Dec 28, 2016

Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2 #487

Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2 #487

Comments

versar commented Oct 19, 2016

riastradh-probcomp commented Oct 19, 2016

versar commented Oct 19, 2016

fsaad commented Nov 5, 2016

riastradh-probcomp commented Dec 28, 2016