Friday 13 June 2014

Oh those Gaussians!


Central Limit Theorem and the Gaussian/Normal distribution


The Gaussian distribution is an extremely popular distribution for most things in statistics related to population studies and biology in general. However, for the longest amount of time I did not really understand why it was really important. The Central Limit Theorem was a buzzword that I didn't have a feel for. However, using R it became very intuitive.

The CLT states that the mean/average/sum or additive effects of a large number of independent random events occurring with uniform probability will converge towards a bell shaped curve called the Gaussian or Normal distribution.

To see this in R:

unifmat<-matrix(runif(1000000),1000,1000)
hist(as.vector(unifmat))

Now that is a uniformly distributed set of numbers that we have there inside that matrix that we collapsed to a vector and saw with a histogram that it was pretty much uniformly distributed. Now we take the averages/sums and see what the distribution is like

hist(rowSums(unifmat))


hist(rowMeans(unifmat))

Both of them look the same because the mean of the rows is just a scaled-down-by-the-same-number version of the sums of the rows.

So the Central Limit Theorem works in simulation. In practice, the distribution of heights in a population, which is believed to be a multi-loci inherited complex trait, is normally distributed presumably because the recombination of genes is a random event. That means most people are of a certain height and the shorter and the taller ones become rarer as you go to the left or the right of the X-axis. It is possible to estimate the proportion/percentage of people who will be taller than average by estimating the area under the curve for a certain height range divided by the total area under the curve. This distribution holds for IQ's as well. The reason why the bell curve distribution holds is because the mixing is uniformly random. If you start selecting from specific areas of the curve and then generating means then you will get a skew towards the left or the right. Now think Francis Galton, regression to the mean and Eugenics.

Now, we will generate this distribution using an analytic formula that allows us to feed in numbers and get a nice smooth Gaussian curve.

x=seq(-3,3,by=0.01)
y=exp(1)^(-1*x^2)
plot(x,y,type="l",col="red")

 This comes from
and feeding x the numbers from -3 to 3 very smoothly in 600 steps

Now the normalized version of this curve has an area of 1 under it. The reason for insisting on an area of 1 is that statisticians, especially the ones who love probability like to describe phenomena in terms of what they called probability distribution functions (PDF). Now these probability loving statisticians would die if the sum of all the probabilities would not add up to 1. So the area under curve (AUC) is one and it has been done by transforming the formula into something that looks like this.

 Terrifying, isn't it ? but then you can see the skeleton of the first equation and eventually you can make out that there are only some subtractions which will shift the curve alongside the x axis and some divisions which will scale down the numbers.

We can code our own R function to generate Gaussian curves and this is how we will go about it.

Gauss<-function(x,mu,sigma){
gx<-(1/(sigma*sqrt(2*pi))) * exp(1)^(-1*((x-mu)^2/(2*sigma)))
return(gx)
}

Now to use it:
x<-seq(-3,3,by=0.01)
y<-Gauss(x,0,1) # we pass x and the mu/mean and the sigma/standard deviation
plot(x,y,type="h",col="red")

So, with that we have our own Gaussian function to play with. Especially now, we can modify the mean and the variance and obtain our own custom shaped normal distributions. We can also take two curves and perform math operations like adding, multiplying and other things with them to see what happens to their shapes.

First thing to do is to make a Dirac delta function, an extremely thin normal distribution that looks like a spike.

y=Gauss(x,0,0.001)
plot(x,y, pch="*",col="blue",type="b")
Yes I like to add stars in my graphs sometimes, they look nice. The point here is that the Dirac delta is approximated using a normal distribution with very low variance/sigma.

So you can create your own thin/fat curves by tweaking the sigma/variance, to shift the curve to the left or right of zero, simply choose a non-zero, negative or positive mean. Since this post is approaching the TL;DR Limit, I will put the experiments with Gaussians in another post.

Tuesday 10 June 2014

Brownian motion at loose

 Brownian action : a still motion picture


Brownian motion is used to describe processes as varied as the movement of dust particles in a room, stock market patterns and the distribution of the Kolmogorov-Smirnov statistic. It is described as random with seemingly no apparent order that you can make out to predict the next position of the particle in time. Non-deterministic/ stochastic processes is what you call them. To generate Brownian motion on a computer is easy however and this post will show you how to go about it in R.

install.packages("rgl") # we need this for 3-D visualization

if this install fails on linux you need to install "libglu1-mesa-dev" using apt-get or yum or whatever package manager that you prefer.

So a Brownian walk is the averaged out motion of a particle which is equally likely to go in one direction as the other at stochastic intervals such that the mean motion is nearly zero.

library(rgl)
rxyz<-matrix(runif(3*100000,-1,1),,3) # fill 3 columns of a matrix with randomly uniform numbers
crxyz<-apply(rxyz,2,cumsum) # integrate the motion over time with cumulative sum
plot3d(crxyz,type="l",col="red") # plot the integration in 3-D

So here what we do is generate x,y,z trajectory for a particle for 100000 instances. Now the co-ordinates of the trajectory are completely random uniformly distributed between -1 and 1 such that the mean value of the motion is close to zero, however because of small differences in the randomly generated values, there is an average motion that is still random in the x,y and z axes.


This is a 2-D plot of the motion in the x,y axis generated using
plot(crxyz[,1],crxyz[,2],type="l",col="red")

The 3-D plot allows you to visualize and rotate the motion in 3-D and observe the fine structure of the random walk by zooming in and out.


That is it, now enjoy your drunken walks knowing that you can still integrate when you are high.

Monday 9 June 2014

The Tea Test ceremony

Understanding the T-test


The T-test is a frequently used statistical test for testing the difference of means between two groups. However as these things go, people don't really seem to understand how it works so I've written some intuitive simulations in R that may explain it better.

requires: R, install.packages("corrplot")

The T-test statistic is designed to help you determine if the difference of means (average) between two groups of numbers is sufficiently different to be called significant. You see such a situation in life when it is reported, for example, that the average marks of a class has improved from 65 to 70 which is reported as an improvement.

If you are familiar with the properties of the mean you will know that it is influenced by the value of a few outliers (large values)  and also it does not tell you anything about the spread of values around it.

The statistic that tells you about the spread of values is the variance. The variance is calculated by subtracting the mean from every observation and squaring these calculated values and finally adding all the squared values into one lumped value and then dividing by the number of observations.

So, the mean alone cannot tell you whether the difference in the control and test group is sufficiently different enough to make the claim that there is an effect or an improvement. To illustrate that look at the graphs below.


So the mean difference in the above figures is the same, it is -1.5 - 1.5 which is 3, but the variance (spread of the values around the mean) varies all the way from 0.5 to 7. As you can clearly see, more variance produces more overlap in between the two histograms such that even if the means are different, it is hard to say that overall, the scores between the two groups can be neatly divided into two groups. In the first tile of the graph you can say that it seems the groups separate neatly into two so the difference in means can be considered to be representative of the difference of the two groups as a whole. However in the last tile even though the difference of means is the same, the spread and overlap of the values shows that random fluctuations in the scores could switch scores from the first group into the second group or vice-versa thereby not enabling a clear differentiation.


So the T-test addresses precisely this issue. By taking both the mean and the variance into account it combines them to produce a statistic which looks like a cousin of the Signal to Noise Ratio (SNR)



So X1 is the mean of the first group, X2 is the mean of the second group, S1 & S2 are standard deviations of the first and second group respectively and N1 and N2 are the number of observations in both the groups.

Calculating the t-statistic with the values gives us a variable which follows what is known as the t-distribution. As an aside, I should probably mention that one of the major objectives of mathematical statistics is to reduce a bunch of numbers using a certain formula called the test-statistic. This reduction (under certain assumptions) causes the properties of the statistic to follow a certain distribution with predictable shape and size which can then be used to estimate the probability of that statistic being that high by chance.

So we will be looking at the T-distribution and T-test from a simulation perspective in R

mx=seq(-10, 10,by=0.5) #
my=seq(-10, 10,by=0.5) #create an evenly spaced set of means

#using each of those means, draw from a normal distribution of that mean
# and standard deviation of 3
simx<-sapply(mx,function(x){rnorm(150,mean=x,sd=3)})
simy<-sapply(my,function(x){rnorm(150,mean=x,sd=3)})

# a function to calculate the t-statistic of two vectors
t_stat<-function(x,y){
ts<- (mean(x)-mean(y))/sqrt((var(x)/length(x))+(var(y)/length(y)))
return(ts)}

# Now simx and simy are two matrices the columns of which consist of 150 numbers which are normally distributed with the mean taken from each element of mx and my which vary in steps of 0.5 from -10 to 10

# now we take each pair of vectors that can be made by matching the 1st column of simx to every column of simy and then moving on to the 2nd column of simx and matching it to every column of simy and so on till all pairs are constructed. These pairs of vectors will be sent to t_stat to compute the t-statistics for the pair. That t-statistic will be stored in a matrix which we will visualize to see how the value of the t-statistic changes as we keep varying the two means.

t_mat<-matrix(0,41,41)
rownames(t_mat)<-mx
colnames(t_mat)<-my

for(i in 1:dim(simx)[2]){
for(j in 1:dim(simy)[2]){
t_mat[i,j]<-t_stat(simx[,i],simy[,j])
}
}

library(corrplot)
corrplot(t_mat,is.corr=FALSE, addgrid.col=FALSE,tl.cex=0.6,main="T-statistic landscape with variation in mean of X versus mean of Y")




From this image it is clearly visible that the t-statistic increases as the difference in means increases in either direction only changing the sign. The matrix looks symmetric as it theoretically should be  but close observation will show asymmetry. That is because we used two sets of random draws to generate our distributions with the same mean and since we sampled only 150 numbers we are not really guaranteed convergence towards the ideal means that we specified for the random draws. But for all practical purposes the matrix is symmetric. 

The heat of the colours shows the value of the t-statistic as mapped to the colour bar on the right. So along the principal diagonal since the difference in means is not much the t-statistic goes to zero.

Just so you trust me on this, we'll calculate the actual means of our simulated data and plot it to make sure

amx<-apply(simx, 2, mean)
amy<-apply(simy, 2, mean)
rownames(t_mat)<-round(amx,2)
colnames(t_mat)<-round(amy,2)
corrplot(t_mat,is.corr=FALSE, addgrid.col=FALSE,tl.cex=0.6,main="T-statistic landscape with variation in mean of X versus mean of Y")

So now you can see that there are minor variations in the actual values of the means of the distributions that were sampled from the normal distribution using the means that we specified from -10 to 10 in steps of 0.5

Anyway now we need to know the p-value that is associated with the t-statistic.

The p-value is calculated from the t-distribution, in our case we can visualize this distribution by plotting the t-statistics that we have calculated as a histogram.

tstatdist<-as.vector(t_mat)
hist(tstatdist)

Slightly heavy tailed if you look carefully. the point is that if you increased the number of samples from 150 to say 1000, you would have got a distribution that was asymptotically converging to the t-distribution. You can do that by changing the call in rnorm to 1000 from 150.

So the way to calculate the p-value is to calculate the area under the curve from the LHS of the curve from minus infinity to the point where the value of your t-statistic lies. This area is subtracted from 1 to get the p-value. (Since the AUC of a probability distribution always sums up to 1)

You can do that roughly (I mean very very roughly) by using the histogram binning data to approximate a probability distribution function and using a cumulative sum with normalization by the total sum to obtain an approximate (very approximate) numerical integration of the area under the curve.

tstathist<-hist(tstatdist)
csumtstathist<-cumsum(10 * tstathist$counts)
plot(csumtstathist, type="l") # approximate cumulative distribution function
csumtstathist/max(csumtstathist)