satRdays: a snarky correlation function

There are a lot of ways to do correlations in R (I use corr.test most of the time), but I just found out about the function, correlation, from the psycho package and it might be my new favorite. I give a quick tutorial of it here (see the The Psycho Blog for the developers’ tutorial).

Tutorial

If you’re following along, you’ll need to install the package from github:

# install.packages("devtools")
devtools::install_github("neuropsychology/psycho.R") # installs newest version
library(psycho)

Now let’s see it in action using the mtcars dataset, which contains data on fuel consumption for various car models made in 1974 (super exciting, I know).

head(mtcars) # this dataset is preloaded in R
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Imagine we have a hunch hypothesis that car weight (wt) will be negatively associated with fuel efficiency (mpg). The correlation package prints out the results in a manuscript-ready format.

psycho::correlation(mtcars[, c('mpg', 'wt')])
## Pearson Full correlation (p value correction: holm):
## 
##    - mpg / wt:   Results of the Pearson correlation showed a significant large, and negative association between mpg and wt (r(30) = -0.87, p < .001***).

The correlation package also makes it easy to create formatted correlation matrices—complete with cute little significance stars. Let’s look at the correlations between a few more totally-theory-driven metrics.

cors <- psycho::correlation(mtcars[, c('mpg', 'cyl', 'hp','wt', 'disp')])
cortable <- summary(cors)
# if you want to save it to excel run this
# write.csv(cortable, "correlationtable.csv") 
# cortable
knitr::kable(cortable) # you don't need this: the `kable` function just makes it pretty for html rendering
mpg cyl hp wt
mpg
cyl -0.85***
hp -0.78*** 0.83***
wt -0.87*** 0.78*** 0.66***
disp -0.85*** 0.9*** 0.79*** 0.89***

We can also quickly make a nice little plot of the correlations to share with collaborators:

plot(cors)

I’m pretty sure that plot title is supposed to be a play on words…

Alpha correction

You may have noticed earlier that the correlation package automatically applies p-value corrections to the correlations. The default is Holm’s method, but you can specify others by setting the adjust= argument to ‘bonferroni’, ‘fdr’, or ‘none’. It doesn’t change much, in this case.

boncors <- psycho::correlation(mtcars[, c('mpg', 'cyl', 'hp','wt', 'disp')],
                            adjust = 'bonferroni')
boncortable <- summary(boncors)
knitr::kable(boncortable) 
mpg cyl hp wt
mpg
cyl -0.85***
hp -0.78*** 0.83***
wt -0.87*** 0.78*** 0.66***
disp -0.85*** 0.9*** 0.79*** 0.89***

The ‘snarky’ part

So all that is pretty cool, but the main reason I like the correlation package is because it trolls you if you get too carried away with correlations.

Let’s imagine we were just fishing around for some significant correlations (i.e., totally normal science stuff) so we just ran all possible pairwise correlations in the mtcars dataset1; we wouldn’t want any pesky alpha corrections getting in the way of discovering an intersting correlation, so let’s make sure we set that adjust argument to ‘none’.

allcors <- psycho::correlation(mtcars, adjust = 'none')
## Warning in psycho::correlation(mtcars, adjust = "none"): We've detected that you are running a lot (> 10) of correlation tests without adjusting the p values. To help you in your p-fishing, we've added some interesting variables: You never know, you might find something significant!
## To deactivate this, change the 'i_am_cheating' argument to TRUE.
allcorstable <- summary(allcors)
knitr::kable(allcorstable) 
mpg cyl disp hp drat wt qsec vs am gear carb Local_Air_Density Reincarnation_Cycle Communism_Level Alien_Mothership_Distance Schopenhauers_Optimism
mpg
cyl -0.85***
disp -0.85*** 0.9***
hp -0.78*** 0.83*** 0.79***
drat 0.68*** -0.7*** -0.71*** -0.45**
wt -0.87*** 0.78*** 0.89*** 0.66*** -0.71***
qsec 0.42* -0.59*** -0.43* -0.71*** 0.09 -0.17
vs 0.66*** -0.81*** -0.71*** -0.72*** 0.44* -0.55*** 0.74***
am 0.6*** -0.52** -0.59*** -0.24 0.71*** -0.69*** -0.23 0.17
gear 0.48** -0.49** -0.56*** -0.13 0.7*** -0.58*** -0.21 0.21 0.79***
carb -0.55** 0.53** 0.39* 0.75*** -0.09 0.43* -0.66*** -0.57*** 0.06 0.27
Local_Air_Density 0.86*** -0.92*** -0.99*** -0.88*** 0.67*** -0.87*** 0.52** 0.74*** 0.52** 0.47** -0.5**
Reincarnation_Cycle -0.06 -0.02 0 0.11 0.2 -0.05 -0.22 0.06 0.18 0.1 0.15 -0.03
Communism_Level 0.27 -0.27 -0.49** 0.14 0.51** -0.5** -0.31 0.12 0.61*** 0.72*** 0.43* 0.35 0.15
Alien_Mothership_Distance -0.33 0.3 0.21 0.21 -0.31 0.21 -0.18 -0.08 -0.2 -0.15 0.08 -0.22 0.24 -0.05
Schopenhauers_Optimism 0.87*** -0.87*** -0.82*** -0.98*** 0.51** -0.72*** 0.69*** 0.75*** 0.32 0.2 -0.75*** 0.9*** -0.11 -0.07 -0.26
Hulks_Power 0.19 -0.18 -0.15 -0.18 0.14 -0.19 0.11 0.22 -0.06 0.18 -0.04 0.16 0.04 -0.01 -0.33 0.19

Now we think to ourselves:

“Wow look at all those super intersting significant correlations we found! But wait, what’s going on with these extra variables that aren’t in the original dataset!? We didn’t collect data on Schopenhauers Optimism or Local Air Density…”

What’s going on?

To make it harder to ‘accidentally’ run too many uncorrected tests, the correlation function simulates new random variables (complete with funny names) when you run more than 10 uncorrected correlations. If you’re deadset on p-hacking, you can override this by setting the i_am_cheating argument to TRUE.

allcors <- psycho::correlation(mtcars, adjust = 'none', i_am_cheating = TRUE)
allcorstable <- summary(allcors)
knitr::kable(allcorstable) 
mpg cyl disp hp drat wt qsec vs am gear
mpg
cyl -0.85***
disp -0.85*** 0.9***
hp -0.78*** 0.83*** 0.79***
drat 0.68*** -0.7*** -0.71*** -0.45**
wt -0.87*** 0.78*** 0.89*** 0.66*** -0.71***
qsec 0.42* -0.59*** -0.43* -0.71*** 0.09 -0.17
vs 0.66*** -0.81*** -0.71*** -0.72*** 0.44* -0.55*** 0.74***
am 0.6*** -0.52** -0.59*** -0.24 0.71*** -0.69*** -0.23 0.17
gear 0.48** -0.49** -0.56*** -0.13 0.7*** -0.58*** -0.21 0.21 0.79***
carb -0.55** 0.53** 0.39* 0.75*** -0.09 0.43* -0.66*** -0.57*** 0.06 0.27

There, now you can publish all those super significant, intersting, real, and totally a-priori correlations you found without all those spurious ones that the correlation package created… (just make sure to hide your code from reviewers!)

Conclusion

Lame p-hacking/fishing jokes aside, the correlation package is really easy to use and makes reporting more efficient and reproducible. It’s just an added benefit that it also makes it a little harder to unknowingly/passively run a bunch of uncorrected (and potentially spurious) correlations. I think it would be cool to see more packages set defaults that are explicitly aimed at reducing questionable research practices.

Know of any others? Let me know about them at pdurkee@utexas.edu or @durkeepk so I can learn them.


  1. A bunch of the variables in mtcars are dichotomous or categorical so this doesn’t make much sense, but bare with me because that’s not the point.

Related