Monday, 21 July 2014

Correlation: Nuts, Bolts and an Exploded View

This post deals with correlation by breaking down the formula into elementary statistical operations and explaining how each component contributes to the calculation of the final correlation statistic. All Courier Font Text is R code that you can try out while you read.

The correlation metric measures the degree of linear association between two sets of numbers, in your case these two sets are experimental observations where one set is the numerical change in perturbation and the second set is the numerical output of the perturbed system or the observations. The value of correlation is scaled between -1 to 1 where numbers between 0~1 indicate positive correlation and between -1to 0 is a negative correlation. Positive correlation states that as the level of perturbation goes up the experimentally observed output of the system goes up. Negative correlation states the opposite.

Theory of Correlation:




So correlation is calculated between two paired sets of numbers X and Y. The scaling of covariance of X and Y by the product of the standard deviation of X and Y is the correlation. So to understand how this works you need to know covariance and standard deviation.

Covariance measures how two random variables (in this case X and Y) vary together.  Covariance is the expected(average/first moment/mean) value of the product of the shifts of random variables X and Y around their respective means.

Let's break it down and try it out.

#Create simulated correlated data with uniformly distributed error using a linear model

x<-seq(0,1,by=0.01)
y=2*x+runif(length(x),min=-0.5,max=0.5)

plot(x,y,col="blue")
lines(x,2*x,col="red")

cor(x,y) # gives me a decent correlation of 0.91
cov(x,y) # gives me a covariance of 0.183

However even though covariance measures how two random variables covary, the value of covariance changes with scaling. To see how this happens we multiply every number in X and Y by 3 and recalculate the covariance

cov(3*x,3*y) # this gives me a covariance of 1.652

On the other hand the correlation is unaffected by scaling as can be seen by:

cor(3*x,3*y) # this still gives me a correlation of 0.91

This happens because covariance is affected by the value of the mean which is shifted when the numbers are re-scaled, the shift of the mean affects the subsequent calculation which measures the spread of the values of random variable X and Y around the mean. Dividing by the standard deviation of X and Y accounts for this spread because standard deviation quantifies the spread around the mean.

Let us calculate the covariance from elementary operations.

#Calculate product of deviation around the means for X and Y
dev_mean=((x-mean(x))*(y-mean(y)))

mean(dev_mean)
# [1] 0.1817975

sum(dev_mean)/length(x)
# [1] 0.1817975 # slight over-estimate of the mean

sum(dev_mean)/(length(x)-1)

# [1] 0.1836154 # accounting for the DOF loss

# In the code above we take the mean of this value accounting for the loss of one degree of freedom (DOF) because we used the "mean" of X and Y in the previous operation. So take (n-1) when calculating mean instead of n

# Degree of freedom is a measure of how many variables are present in the system that cannot be estimated using each other, i.e. knowing say 9 out of 10 variables we still can't predict what the value of the 10th variable will be. However if I calculated the mean (or sum or any other statistic) of the 10 variables I could use the value of this mean and the 9 other numbers to calculate the value of the 10th number because the mean is not an independent variable that does not depend on the value of the other numbers, it is an aggregate statistic that uses all the numbers and therefore its value varies with the other numbers.

# Now we used the mean of X and Y to calculate the absolute mean deviation products and we are again taking the mean of the products. Since we've already reduced the degree of freedom of the system by using the mean we can't take the mean of these numbers (dividing by n) and not expect the value to be an over-estimate. Therefore we divide by (n-1)
  


cov(x,y)
# [1] 0.1836154  # the same value as we calculated manually above

# The first expression and its predecessor can be condensed into one line
 
(sum((x-mean(x))*(y-mean(y))))/(length(x)-1)
# [1] 0.1836154

Now standard deviation is calculated as the square-root of the variance. The variance is calculated as the expected value of the squares of the deviations around the mean

 sd(x)
# [1] 0.2930017

 sd(x)^2
# [1] 0.08585

var(x)
# [1] 0.08585

sum((x-mean(x))^2)/(length(x)-1) #variance formula

# [1] 0.08585


sqrt(sum((x-mean(x))^2)/(length(x)-1)) # sqrt(var(x))=sd(x)
# [1] 0.2930017

#Now we combine all of these independent formulas into one user defined function that calculates the correlation

user_corrfx<-function(x,y){

user_cov<- (sum((x-mean(x))*(y-mean(y))))/(length(x)-1)
user_sdProd<- sqrt(sum((x-mean(x))^2)/(length(x)-1)) * sqrt(sum((y-mean(y))^2)/(length(y)-1))
user_cor<-user_cov/user_sdProd
return(user_cor)
}

Now we test our user-defined function against the standard function provided by R and verify its accuracy

user_corrfx(x,y)
[1] 0.9107999
 

cor(x,y)
[1] 0.9107999




Woo hoo! it works! So now you hopefully have a clearer picture of how correlation works, your next exercise should be like a child with a hammer who thinks everything is a nail that needs a good pounding. Apply correlations on different sets of data that you think might be related and find associations. Of course you may find absurd correlations and certain times the data is too small to confidently say there is "significant correlation". We will examine this so called "significance" of correlation in terms of P-values in the next post.


No comments:

Post a Comment