Showing posts with label Theory. Show all posts
Showing posts with label Theory. Show all posts

Tuesday, 14 April 2015

Principal Components Analysis: making omelettes without breaking ellipsoids

One of the ways to represent the results of an experiment with one user-controlled variable (independent variable) and another variable that supposedly changes with the changes in the independent variable (the dependent variable), is the scatter-plot. If the data-points are uncorrelated they will be dispersed equally across four corners of the plot and if they are correlated they will be along the diagonal. It is called positive correlation when they are along the rising diagonal and negative correlation when they are on the falling diagonal.

However, this data is of the co-ordinate form (x,y) and therefore two dimensional. Now suppose that we are looking at more than one dependent variable. Then it could be that our data will become 3 dimensional with the co-ordinates of (x,y,z). Now 3-D data can still be observed by plotting (x,y) & (x,z) but if there are more dimensions like 4,7 or 12 you can see that it will become hard to plot and observe. In that case we will be looking for a scatterplot that shows the variation in the data but it will be harder to find.

Enter Principal Components Analysis, PCA needs you to format/formalize your data as a matrix where the columns represent the various dependent variables. Each column represents a dimension in space, so each of your rows is a point in N dimensional space where N is the number of columns/variables in the matrix.

So the first step to doing a PCA is to center your data using the mean. Then you calculate the covariance matrix of this mean centered data and then find the eigen-values and the corresponding eigenvectors and sort the eigenvectors by the magnitude of their eigen-values. These eigenvectors represent the projection of the data-points into a lower dimensional space that preserves the maximum amount of variation in the data. The magnitude of the eigenvalues with respect to the total represents the proportion of variance explained by rotating the data by the corresponding eigenvector.


So, yeah, that is the explanation every text on PCA gives you, so what am I telling differently here?

A couple of things, first, mean centering the data would bring the data to the origin of the co-ordinate system with the mean being the origin instead of the data-points being arbitrarily located at one corner of the graph. So, geometrically, when mean-centered, all of your observations are defined in values which when squared and added and then square-rooted would represent their Euclidean distance from the origin.

Next, what does it mean to calculate the co-variance matrix?.

Co-variance is a measure of how two or more set of data-points/random variables vary with each other. If you go through the post on Correlation: Nuts, Bolts and an Exploded view, you will understand how co-variance measures, (in an un-normalized fashion), how much two random variables co-vary.

Variance is also called the second moment, so co-variance would be the second moment for two sets of data-points and would measure how fast the two data-points vary with respect to each other. Doesn't that sound like something that is familiar from calculus? the rate of change of a variable is expressed as the derivative isn't it? So is the co-variance matrix some sort of a matrix of derivatives?. Turns out that moments of data are analogous to the concept of derivatives of functions.



To make matters more fun, we come to the problem of computing eigenvalues and eigenvectors. The eigenvectors of a certain matrix are vectors that when multiplied with the matrix are not changed (i.e their co-ordinates do not change, they don't get rotated or translated) only their length changes. Length of a vector, assuming that we have mean-centered everything before this computation step, is simply the Euclidean distance from the origin (0,0,0...) which would be the square-root of the sum of squares of the co-ordinates. So length is a scalar quantity and the eigenvector is simply scaled by another scalar when you multiply the matrix with the eigenvector. In this case that particular scalar is the eigenvalues. So every eigenvector has its corresponding eigenvalue which indicates, how much the length of that vector would change when multiplied by the matrix. 


 

So we calculate something called the characteristic polynomial for the covariance matrix with the yet-to-be-computed eigenvalues subtracted from the diagonal entries. The characteristic polynomial is basically the determinant of this matrix.






After we have the characteristic polynomial, we equate it to zero and calculate the eigenvalues. For those of you familiar with calculus, finding the first derivative and equating to zero and solving for the values is familiar as the computation of the maxima of a function. In this case, our function is the characteristic polynomial and the eigenvalues are simply the roots of that polynomial which maximize that function. Since the co-variance matrix is positive definite, there exists only a single, unique, minimum (if you flip the function by multiplying everything by -1, it becomes a maximum) and the eigen-values are defined at those points.

So what is the determinant here? according to geometrical intuition, determinants are areas (in higher dimensions, volumes) of the space spanned by the vectors of the matrix for which the determinant is being calculated. So, here, the determinant of the characteristic polynomial gives us a measure of the volume spanned by the vectors of the co-variance matrix and the eigenvalues give us the values by which each vector is shifted along each axes to maximize the volume spanned by the set of vectors.

 


So the eigenvectors which correspond to these eigenvalues are co-ordinates along which the proportion of scaling described by the corresponding eigenvalue is observed for the data.

So when you rotate the data such that they lie along these eigenvectors, they will describe the maximum about of variance in them. Essentially, the data points are rotated to a new set of co-ordinates  where the maximum amount of variation present in the data is represented.

For intuitiveness, let's imagine that our data is an ellipsoid/egg, not a circle/sphere, so that it has two directions which have different variations/spreads. Later we will generalize it to a scalene ellipsoid which looks more like a bar of soap and has different spreads across different axes.

So ideally, if PCA is able to find the directions with the maximum variance then, for an ellipse it should find the major and minor semi-axes.

Now for some practical programming exercises.

Various texts on PCA will tell you that doing a PCA on your data rotates the data-points in such a way that you are able to visualize the maximum amount of variance in your data. So for now to make this intuitive, let us put together a data-set of 3 dimensions (because 2-dimensions can be visualized using a scatterplot). So we will be making a 3-D ellipse, often called an ellipsoid of data-points such that we can see it using an interactive 3-D plot that is available in R under the package rgl (installation instructions in the post on Brownian motion).

So we start with formulating a function for an ellipsoid. A 3-D ellipsoid would have co-ordinates (x,y,z) and the axes would be (a,b,c). The equation would be.



{x^2 \over a^2}+{y^2 \over b^2}+{z^2 \over c^2}=1,

in R we type

ellipsoid<-function(xyz,abc,radius){
ellipse<-(xyz[,1]^2/abc[1]^2)+(xyz[,2]^2/abc[2]^2)+(xyz[,3]^2/abc[3]^2)
ellipsis<-which(ellipse<=radius)
return(xyz[ellipsis,])
}

So R newbies, this function uses a few shortcuts now. I'm passing the data as a matrix and indexing it in the body of the function to extract the respective (x,y,z) co-ordinates and sending the semi-axes as a vector of numbers. Compared to earlier posts this will be a little more of brevity. However the meaning stays the same and you will see it soon enough.

So here is what are going to do, we start with a box peppered with points every where uniformly, that will be our random cloud of data.

library(rgl)
xyz<-matrix(runif(30000,-2,2),10000,3)
plot3d(xyz[,1],xyz[,2],xyz[,3])




Then we transform it into an ellipsoid of points using the function ellipsoid which we just wrote. This ellipsoid will be our high-dimensional data. Although it is not really high-dimensional and you can very well use scatterplots to see what is happening but for the sake of argument, let us say that it is high-dimensional and we are unable to view it.


exyz<-ellipsoid(xyz,abc=c(2,1,1),radius=1)
plot3d(exyz[,1],exyz[,2],exyz[,3],xlim=c(-3,3),ylim=c(-3,3),zlim=c(-3,3),col="red")



Now with this ellipsoidal cloud of points we will perform a principal components analysis. Since PCA gives us vectors called Principal Components which describe the greatest amount of variance in the data, for an ellipse the PC's would be the semi-axes. There after we will see that tweaking the length of the semi-axes (the values of a,b & c) will show their reflection in the amount of variance in the principal components.

So first I'll just briefly talk about ellipsoids, ellipsoids can have different shapes depending on the length of the semi-axes (a,b,c). As a special case, if all three semi-axes are of the same length (1,1,1) then the ellipsoid actually becomes a sphere.If one of them is about twice as long as the other two which are equal (2,1,1) then the ellipsoid becomes egg-shaped and is called a prolate spheroid, similarly it could become an egg which is wider than it is tall and that would be an oblate spheroid which is what the Earth is like, being squashed at the Poles. A rare case is when a>b>c, there is a convention of ordering in the semi-axes, so a is the semi-axes along x, b is along y and c is along z. That rare case looks like a computer mouse and is called a tri-axial or scalene ellipse.

 
Code for scalene ellipsoid:
plot3d(ellipsoid(xyz,abc=c(1,2,3),radius=0.5),xlim=c(-3,3),ylim=c(-3,3),zlim=c(-3,3),col="red")
 


Why am I talking about ellipses here? because data with multiple dependent variables forms an ellipsoid in higher dimensions. Or at least the limiting case of normally distributed data would be an ellipse (oblate,prolate,spherical or scalene), because uniformly scattered points would have to be uniformly distributed if they are to be seen as evenly spread throughout the co-ordinates, which your data wouldn't be unless it was just random noise.

So knowing from theory that PCA rotates the data in such a way that we can see maximum variation, for an ellipsoid, the directions for maximum variation would be along the semi-axes right?

We needed to know the different kinds of ellipses and the above information will help us track how the variance described by the Principal Components change when we modify the length of the axes. So here we also need a convenient function to be able to tweak the parameters quickly and view them as well. Let us, for a short time, be really horrible programmers and make a function that displays the shape of the ellipsoid or does a principal components analysis with the relevant plots depending on what we feel we need to see.

Ellipsoid3D_PCA<-function(xyz,abc,radius,mode=NULL){
require(rgl)

pointcloud<-ellipsoid(xyz,abc,radius)

if(mode=="pcaPCplot"){
trypca<-princomp(pointcloud)
print(trypca)
biplot(trypca,cex=0.4)
}

if(mode=="3dPlot"){
plot3d(pointcloud[,1],pointcloud[,2],pointcloud[,3],xlim=c(-3,3),ylim=c(-3,3),zlim=c(-3,3))
}

if(length(mode)==0)
{
print("Mode is missing : use 3dPlot or pcaPCplot")
}

}

Now this function simplifies a lot of typing and helps you to quickly modify parameters by repeating the commands using the UP Arrow key.

 Ellipsoid3D_PCA(xyz,abc=c(2,1,1),1,mode="pcaPCplot")

So PCA doesn't really care what length you chose for the semi-axes, it will always find the longest semi-axes as the first PC and the second longest as the second PC and the third longest semi-axes as the third PC. PCA will give you the same result even if you re-ordered the axes sizes like this.

Ellipsoid3D_PCA(xyz,abc=c(1,1,2),1,mode="pcaPCplot")


So the semi axes are like an averaging line which runs through the cloud of points and spans the maximum amount of points of the ellipsoid as possible along each of the  independent co-ordinates x,y and z. By independent I mean that the x,y and z co-ordinates cannot be expressed in terms of each other. There is no linear relationship that would allow you to express one axis in terms of one or more of the others. This is a concept called orthogonality.

Principal Components are orthogonal to each other and are made up of linear combinations of the variables in the matrix (the columns). So, a principal component tells you how much of the variance that you see in your data can be described in terms of orthogonal axes that are composed of linear combinations of the variables in your experiment. So, in terms of biology, if one of your experiments is very noisy and has data scattered throughout the co-ordinates, it would contribute considerably to the principal components.

Now PCA is also said to be a method for dimensionality reduction. So you can reduce the number of dimensions in your data while retaining the maximum amount of variation that defines your data. So how would that work, what does dimensionality reduction mean?

Dimensionality here would be the number of variables in your experiment. So generally experiments are represented in the form of a numeric matrix where the rows represent the entities being experimented and observed and the columns represent the different perturbations that have been applied. Here is where the question that you are asking becomes important. So is your question, "what are the experiments which describe the maximum amount of variance in my data?" or is it, "what are the entities which describe the maximum amount of variance in my data?".

If it is the first, then your dimensions are "experiments", else you have to take the transpose of your matrix such that "experiments" comes on the rows and "entities" come on the columns.

So how can you reduce dimensionality, you may ask, because obviously, the number of experiments that we have done is fixed, you may say "I'm not going to remove the results of one experiment that I did just so that my data becomes manageable. I want all my experiments to be factored in!".

What if I told you that maybe, a bunch of experiments that you did, despite all the hard work that went into them, do not contribute anything in terms of new results. Not philosophically! but actually they do not give any new numbers that might be statistically analyzed to get interesting results because they all say the same thing numerically speaking, they are collinear and therefore one of them describes the variation just as well as the other of the bunch do.

"Well, I'm still not throwing them out!", you say and rightly so, we don't need to throw them out, we can still make another column/experiment that combines the bunch of these columns into one column that contains all the data in the bunch but expressed as a linear weighted sum of the individual experiments of the bunch.

This weighting of the experiments allows you to retain all the experiments and their results but you end up re-weighting them in terms of the amount of information that they contribute to your data. Information here is defined as variance, so coming from where we considered variance to be noise in experiments, we now consider variance to be a sign that something is happening in our data which we can't directly quantify in a numerical, pair-wise comparison but we can see that it changes the amount of variation observed in the data. So the variance observed is considered to be a result of the perturbation.

So, the eigen-vectors that you get after the PCA decomposition are a mapping of your original data matrix into a sub-space with lower amount of dimensions where the number of dimensions have been reduced by taking dimensions which are linear combinations of the original dimensions. So you have managed to express most of the significant variation in your data in the form of a matrix which is much smaller than what you started out with. This is what it means when it is said that PCA is used for dimensionality reduction.

We'll see a more hands-on approach to PCA in the next installment to this post. Mull over everything you've seen and learned here and try to check out other resources on PCA to see if you've gained a better understanding of what is going on, Also try out functions other than ellipsoids and see how PCA behaves for them. Increase the number of dimensions so that you now only have a 3-D projection of say a 7-D data-set. Play with parameters, axes values and have fun.

Monday, 21 July 2014

Correlation: Nuts, Bolts and an Exploded View

This post deals with correlation by breaking down the formula into elementary statistical operations and explaining how each component contributes to the calculation of the final correlation statistic. All Courier Font Text is R code that you can try out while you read.

The correlation metric measures the degree of linear association between two sets of numbers, in your case these two sets are experimental observations where one set is the numerical change in perturbation and the second set is the numerical output of the perturbed system or the observations. The value of correlation is scaled between -1 to 1 where numbers between 0~1 indicate positive correlation and between -1to 0 is a negative correlation. Positive correlation states that as the level of perturbation goes up the experimentally observed output of the system goes up. Negative correlation states the opposite.

Theory of Correlation:




So correlation is calculated between two paired sets of numbers X and Y. The scaling of covariance of X and Y by the product of the standard deviation of X and Y is the correlation. So to understand how this works you need to know covariance and standard deviation.

Covariance measures how two random variables (in this case X and Y) vary together.  Covariance is the expected(average/first moment/mean) value of the product of the shifts of random variables X and Y around their respective means.

Let's break it down and try it out.

#Create simulated correlated data with uniformly distributed error using a linear model

x<-seq(0,1,by=0.01)
y=2*x+runif(length(x),min=-0.5,max=0.5)

plot(x,y,col="blue")
lines(x,2*x,col="red")

cor(x,y) # gives me a decent correlation of 0.91
cov(x,y) # gives me a covariance of 0.183

However even though covariance measures how two random variables covary, the value of covariance changes with scaling. To see how this happens we multiply every number in X and Y by 3 and recalculate the covariance

cov(3*x,3*y) # this gives me a covariance of 1.652

On the other hand the correlation is unaffected by scaling as can be seen by:

cor(3*x,3*y) # this still gives me a correlation of 0.91

This happens because covariance is affected by the value of the mean which is shifted when the numbers are re-scaled, the shift of the mean affects the subsequent calculation which measures the spread of the values of random variable X and Y around the mean. Dividing by the standard deviation of X and Y accounts for this spread because standard deviation quantifies the spread around the mean.

Let us calculate the covariance from elementary operations.

#Calculate product of deviation around the means for X and Y
dev_mean=((x-mean(x))*(y-mean(y)))

mean(dev_mean)
# [1] 0.1817975

sum(dev_mean)/length(x)
# [1] 0.1817975 # slight over-estimate of the mean

sum(dev_mean)/(length(x)-1)

# [1] 0.1836154 # accounting for the DOF loss

# In the code above we take the mean of this value accounting for the loss of one degree of freedom (DOF) because we used the "mean" of X and Y in the previous operation. So take (n-1) when calculating mean instead of n

# Degree of freedom is a measure of how many variables are present in the system that cannot be estimated using each other, i.e. knowing say 9 out of 10 variables we still can't predict what the value of the 10th variable will be. However if I calculated the mean (or sum or any other statistic) of the 10 variables I could use the value of this mean and the 9 other numbers to calculate the value of the 10th number because the mean is not an independent variable that does not depend on the value of the other numbers, it is an aggregate statistic that uses all the numbers and therefore its value varies with the other numbers.

# Now we used the mean of X and Y to calculate the absolute mean deviation products and we are again taking the mean of the products. Since we've already reduced the degree of freedom of the system by using the mean we can't take the mean of these numbers (dividing by n) and not expect the value to be an over-estimate. Therefore we divide by (n-1)
  


cov(x,y)
# [1] 0.1836154  # the same value as we calculated manually above

# The first expression and its predecessor can be condensed into one line
 
(sum((x-mean(x))*(y-mean(y))))/(length(x)-1)
# [1] 0.1836154

Now standard deviation is calculated as the square-root of the variance. The variance is calculated as the expected value of the squares of the deviations around the mean

 sd(x)
# [1] 0.2930017

 sd(x)^2
# [1] 0.08585

var(x)
# [1] 0.08585

sum((x-mean(x))^2)/(length(x)-1) #variance formula

# [1] 0.08585


sqrt(sum((x-mean(x))^2)/(length(x)-1)) # sqrt(var(x))=sd(x)
# [1] 0.2930017

#Now we combine all of these independent formulas into one user defined function that calculates the correlation

user_corrfx<-function(x,y){

user_cov<- (sum((x-mean(x))*(y-mean(y))))/(length(x)-1)
user_sdProd<- sqrt(sum((x-mean(x))^2)/(length(x)-1)) * sqrt(sum((y-mean(y))^2)/(length(y)-1))
user_cor<-user_cov/user_sdProd
return(user_cor)
}

Now we test our user-defined function against the standard function provided by R and verify its accuracy

user_corrfx(x,y)
[1] 0.9107999
 

cor(x,y)
[1] 0.9107999




Woo hoo! it works! So now you hopefully have a clearer picture of how correlation works, your next exercise should be like a child with a hammer who thinks everything is a nail that needs a good pounding. Apply correlations on different sets of data that you think might be related and find associations. Of course you may find absurd correlations and certain times the data is too small to confidently say there is "significant correlation". We will examine this so called "significance" of correlation in terms of P-values in the next post.


Monday, 9 June 2014

The Tea Test ceremony

Understanding the T-test


The T-test is a frequently used statistical test for testing the difference of means between two groups. However as these things go, people don't really seem to understand how it works so I've written some intuitive simulations in R that may explain it better.

requires: R, install.packages("corrplot")

The T-test statistic is designed to help you determine if the difference of means (average) between two groups of numbers is sufficiently different to be called significant. You see such a situation in life when it is reported, for example, that the average marks of a class has improved from 65 to 70 which is reported as an improvement.

If you are familiar with the properties of the mean you will know that it is influenced by the value of a few outliers (large values)  and also it does not tell you anything about the spread of values around it.

The statistic that tells you about the spread of values is the variance. The variance is calculated by subtracting the mean from every observation and squaring these calculated values and finally adding all the squared values into one lumped value and then dividing by the number of observations.

So, the mean alone cannot tell you whether the difference in the control and test group is sufficiently different enough to make the claim that there is an effect or an improvement. To illustrate that look at the graphs below.


So the mean difference in the above figures is the same, it is -1.5 - 1.5 which is 3, but the variance (spread of the values around the mean) varies all the way from 0.5 to 7. As you can clearly see, more variance produces more overlap in between the two histograms such that even if the means are different, it is hard to say that overall, the scores between the two groups can be neatly divided into two groups. In the first tile of the graph you can say that it seems the groups separate neatly into two so the difference in means can be considered to be representative of the difference of the two groups as a whole. However in the last tile even though the difference of means is the same, the spread and overlap of the values shows that random fluctuations in the scores could switch scores from the first group into the second group or vice-versa thereby not enabling a clear differentiation.


So the T-test addresses precisely this issue. By taking both the mean and the variance into account it combines them to produce a statistic which looks like a cousin of the Signal to Noise Ratio (SNR)



So X1 is the mean of the first group, X2 is the mean of the second group, S1 & S2 are standard deviations of the first and second group respectively and N1 and N2 are the number of observations in both the groups.

Calculating the t-statistic with the values gives us a variable which follows what is known as the t-distribution. As an aside, I should probably mention that one of the major objectives of mathematical statistics is to reduce a bunch of numbers using a certain formula called the test-statistic. This reduction (under certain assumptions) causes the properties of the statistic to follow a certain distribution with predictable shape and size which can then be used to estimate the probability of that statistic being that high by chance.

So we will be looking at the T-distribution and T-test from a simulation perspective in R

mx=seq(-10, 10,by=0.5) #
my=seq(-10, 10,by=0.5) #create an evenly spaced set of means

#using each of those means, draw from a normal distribution of that mean
# and standard deviation of 3
simx<-sapply(mx,function(x){rnorm(150,mean=x,sd=3)})
simy<-sapply(my,function(x){rnorm(150,mean=x,sd=3)})

# a function to calculate the t-statistic of two vectors
t_stat<-function(x,y){
ts<- (mean(x)-mean(y))/sqrt((var(x)/length(x))+(var(y)/length(y)))
return(ts)}

# Now simx and simy are two matrices the columns of which consist of 150 numbers which are normally distributed with the mean taken from each element of mx and my which vary in steps of 0.5 from -10 to 10

# now we take each pair of vectors that can be made by matching the 1st column of simx to every column of simy and then moving on to the 2nd column of simx and matching it to every column of simy and so on till all pairs are constructed. These pairs of vectors will be sent to t_stat to compute the t-statistics for the pair. That t-statistic will be stored in a matrix which we will visualize to see how the value of the t-statistic changes as we keep varying the two means.

t_mat<-matrix(0,41,41)
rownames(t_mat)<-mx
colnames(t_mat)<-my

for(i in 1:dim(simx)[2]){
for(j in 1:dim(simy)[2]){
t_mat[i,j]<-t_stat(simx[,i],simy[,j])
}
}

library(corrplot)
corrplot(t_mat,is.corr=FALSE, addgrid.col=FALSE,tl.cex=0.6,main="T-statistic landscape with variation in mean of X versus mean of Y")




From this image it is clearly visible that the t-statistic increases as the difference in means increases in either direction only changing the sign. The matrix looks symmetric as it theoretically should be  but close observation will show asymmetry. That is because we used two sets of random draws to generate our distributions with the same mean and since we sampled only 150 numbers we are not really guaranteed convergence towards the ideal means that we specified for the random draws. But for all practical purposes the matrix is symmetric. 

The heat of the colours shows the value of the t-statistic as mapped to the colour bar on the right. So along the principal diagonal since the difference in means is not much the t-statistic goes to zero.

Just so you trust me on this, we'll calculate the actual means of our simulated data and plot it to make sure

amx<-apply(simx, 2, mean)
amy<-apply(simy, 2, mean)
rownames(t_mat)<-round(amx,2)
colnames(t_mat)<-round(amy,2)
corrplot(t_mat,is.corr=FALSE, addgrid.col=FALSE,tl.cex=0.6,main="T-statistic landscape with variation in mean of X versus mean of Y")

So now you can see that there are minor variations in the actual values of the means of the distributions that were sampled from the normal distribution using the means that we specified from -10 to 10 in steps of 0.5

Anyway now we need to know the p-value that is associated with the t-statistic.

The p-value is calculated from the t-distribution, in our case we can visualize this distribution by plotting the t-statistics that we have calculated as a histogram.

tstatdist<-as.vector(t_mat)
hist(tstatdist)

Slightly heavy tailed if you look carefully. the point is that if you increased the number of samples from 150 to say 1000, you would have got a distribution that was asymptotically converging to the t-distribution. You can do that by changing the call in rnorm to 1000 from 150.

So the way to calculate the p-value is to calculate the area under the curve from the LHS of the curve from minus infinity to the point where the value of your t-statistic lies. This area is subtracted from 1 to get the p-value. (Since the AUC of a probability distribution always sums up to 1)

You can do that roughly (I mean very very roughly) by using the histogram binning data to approximate a probability distribution function and using a cumulative sum with normalization by the total sum to obtain an approximate (very approximate) numerical integration of the area under the curve.

tstathist<-hist(tstatdist)
csumtstathist<-cumsum(10 * tstathist$counts)
plot(csumtstathist, type="l") # approximate cumulative distribution function
csumtstathist/max(csumtstathist)