Tuesday 30 September 2014

Correlation: How significant is it? Part 1


This very long post deals with establishing the significance of your correlation. In other words, the correlation that you report, is it just a random correlation that came up or is it truly representing a linear relationship. Ideas of permutation testing and analytical ways to obtain P-values are discussed. Some simulations are shown for an appreciation of the ability of randomness to generate patterns.
TLDR-over

We've discussed the theory of correlation at length in the Nuts and Bolts post and the scientific philosophy of correlation in the Philosophy and Practicality posts. Correlation is a corner-stone method in the sciences for the detection of statistical association between two variables in an experimental study.

One thing you should very firmly hammer into your own head is that when we talk about correlation we are not talking about causality, there is absolutely no way to establish causality using correlation. When you say that two phenomena A and B are correlated you can't say that A causes B or vice-versa. You can only hypothesize that there is an association between the two phenomena, which may be a direct or indirect association, you can't know which unless you have direct evidence.

For example, you can detect a correlation between green-house gas emission levels from automobiles and the global temperature increase. Therefore you may hypothesize that there is a contribution to global warming from increased vehicular CO2 emissions. However, precisely because you can only hypothesize and can't completely prove that CO2 emissions is the only factor contributing to global warming, people are not going to believe that global warming is real.

You could probably also detect a correlation between the average number of natural disasters over a time window and global temperature change within the window. When I say natural disasters, I mean the climactic variety like hurricanes, heat waves, cold spells, flash flood rains and unexpected weather for certain places for that particular time. However, using correlation, you still can't absolutely prove that these disasters are caused by global warming, as much as you would like to say it definitively.

Also besides being unable to prove that global warming is causing all of that, correlation is also unable to prove if the relationship between global temperature rise and CO2 is a direct relationship. For example, I can say that the increased greenhouse effect and concomitant temperature rise is because of methane (a more potent greenhouse gas) being farted out by livestock, the production of which has increased to sustain the ever-growing population of meat-eaters. Now I could say that the increasing CO2 levels are just a consequence of the respiration of the increasing population and livestock and meat and feed processing technologies that consume fossil fuels therefore the correlation we observe between methane levels and global temperature is more correct and explains global warming better.

Also consider a scenario where the increased CO2 levels lead to an increase in photosynthesis and therefore an increase in the total available plant biomass for animal feeding. Consequently the population of animals grow and feed more on the plants and fart out even more methane. So the methane levels increase because of increase in CO2 and this results in increase in global warming.

Now in such a case, even though you may find a correlation between CO2 levels and global temperatures, your inference that the increase in global temperatures is because of the greenhouse effect of CO2 will not be correct because that isn't the case. Increased CO2 leads to increased plant biomass which leads to increased animal feeding and concomitant population growth and the feeding of the population generated methane which being a more potent greenhouse gas lead to global temperature increase. This is a problem of transitivity, where you cannot ascertain and differentiate using the correlation metric between direct and indirect associations. Even if your variables are working in a sequential manner to cause a particular phenomenon, applying correlation between variables will show all the variables as being highly inter-correlated.

So you can see that trying to prove a hypothesis using correlation is a dicey business, you cannot prove anything. You can only postulate an association or rather a hypothesis.


So given two sets of observations A and B, under the classical hypothesis testing format, you formulate the null hypothesis thus. You state that A and B don't have any association between them. Then you calculate the correlation between them. Although you don't know if this is a direct correlation; as in, is the situation such that A directly influences B or A acts through X which then acts on Y which then exerts an effect on B (transitivity). 

After you've calculated the correlation you need to determine if the number that you got is significant or not. That is done using a P-value derived using a Permutation Testing procedure. Now you say that under (significance level) you reject the null hypothesis (that there is no correlation) and accept the alternative hypothesis that there is a correlation.

The Permutation Testing procedure is used in many contexts to establish the significance of a statistic. I'm not really sure why, but I think when a statistic does not have an analytic formula that describes it, then the Permutation test is used to establish a significance level. For other statistics such as the T-statistic which is analytically tractable, it is normally done using the Area Under the Curve (AUC) for the T-distribution in the case of the T-statistic and other distributions in case of other statistics.

So the Permutation procedure at the very basic level relies on permuting the data points on which the statistics have been calculated. Then it recalculates the statistics for this permuted, jumbled and by now non-sense data, The newly calculated statistics (often referred to as a null distribution) are then compared to the values with the statistics for the original non-permuted data. A P-value estimate (I don't know if calling it a P-value would be correct because at some places that I have read, this method is also used in calculating the False Discovery Rate, but then we're talking about the limits of what I know, I would happily be corrected in the comments) is then made by calculating the number of times the statistic of the randomized data went higher than that of the original data. This divided by the number of permutations gives you an estimate of the so called P-value.

Thus the Permutation test is used to calculate the significance of statistics. For correlation, the Permutation test tells you how likely it is that you got a such a nice correlation by chance. The advantage of having a significance level is (if you go by Ben Lehner's papers themed on stochasticity and noise in biological systems) that you can detect very weak correlations, which in the standard statistical methodology would have been ignored, because a correlation like 0.10 is considered insignificant in magnitude. In biology, apparently 0.6 is considered a ball-park value for a good correlation. For very weak correlations in Lehners' papers however, when you calculate the P-value for that low correlation, it turns out to be 10^-28 (10 power negative 28) or some very low value like that which makes that tiny correlation significant. Which means that the chance that you got such a weak correlation by accident is 1 in 10^-28.

So how would you pick up a weak correlation or maybe prove that a strong correlation that you got is not by chance. You can use a simple code to permute the values of a variable, to take an example which is meaningful in biological sciences, the correlation between a small number or set of values.

To make the plot thicker, let's examine the phenomenon of "spurious correlation". This is a correlation that appears when you take ratios of numbers to calculate correlation. An example of that would be the popular strategy of taking the log ratios in micro-array data analysis to calculate fold change or in the field of yeast synthetic genetics, taking a ratio of the growth rate of the mutant strain to the wild-type to examine fitness changes caused by mutation. When you measure ratios and use ratios to find correlations, you will find correlations that are actually artifact correlations that are generated because you took the ratio. When using correlations you must use the raw values to see if there are correlations and not the fold-changes.

To explain it more clearly, let's try some R code

x<-rnorm(1000,mean=10)
y<-rnorm(1000,mean=10)
z<-rnorm(1000,mean=10)

cor(x,y) # gives a weak correlation -0.0050
plot(x,y,pch="*",main="Scatter of 2 independent Normal dist RV",col="red",xlab="Random Variable X",ylab="Random Variable Y")


cor(x,y/x) # gives a strong negative correlation of -0.717
plot(x,y/x,pch="*",main="Scatter of RV X and Y/x",col="purple",xlab="Random Variable X",ylab="Y/x")
It is easy to see why this would happen. As x changes, the value of y/x is locked on to the change. An increase in x causes a decrease in y/x and a decrease in x causes an increase in y. They are actually related by the straight line equation where (y prime) is a ratio the value of which depends on the value of y as well as x. So y' = y*(1/x) and gives a nice negative correlation.


cor(x/z,y/z) # gives a relatively strong correlation of 0.4874412
plot(x/z,y/z,pch="*",main="Scatter of RV X/z and Y/z",col="purple",xlab="Ratio X/Z",ylab="Y/Z")

It is also easy to see how this would happen. Z is a normal random variate and whenever Z is a high value, x/z and y/z will become low and when Z is a low value x/z and y/x become high, thus introducing a positive correlation.

You can also see the effect of dividing both variables by the same set of random normal numbers. Dividing them by the same set of numbers gives you a positive correlation which is relatively strong (0.48). So you can see why taking ratios can be a problem. In biology it is a relatively common practice to divide by the total or by a particular variable. This can lead to incorrect conclusions being drawn of correlations that exist when they are actually not there and have been introduced erroneously by the normalization procedure.


Now does a permutation test prove that this correlation is not significant? Not really, since it is actually a proper correlation caused by a mathematical relation. However we can try to see how the permutation testing procedure tests for significance of correlations by testing it out on some simulated data. To ensure that the simulated data is truly random, we will generate the data using random numbers from a  uniform distribution and write a program that selects a randomized set that passes the correlation criteria.


Let us try an exercise where we first make a program that tries to randomly generate data of a certain correlation that we demand from it for a specified amount of numbers. Basically we want to feed in the correlation value say 0.6 and need 10 numbers which when used to calculate correlation against 1 to 10 should give us a correlation of 0.6.

cor_RandGen<-function(cor_val=0.6,len=10){
 
  x=1:len
  samp_cor=0
  trial=0
  y_var<-runif(len)
  samp_cor<-cor(x,y_var)
 
  while(cor_val!=round(samp_cor,1))
  {
  y_var<-runif(len)
  samp_cor<-cor(x,y_var)
  #print(samp_cor)
  trial=trial+1
  }
  out=vector(mode="list")
  out$y_var=y_var
  out$trial=trial
  #print(trial)
  return(out)
}


this function does precisely that. It takes a correlation value as an argument along with the length of the vector to be generated and then attempts to generate as many sets of random uniformly distributed numbers until it finds a set which exhibits the correlation that we've given as the argument. It returns the numbers that it has found along with the number of trials it took to get that set from random number draws as a list.

It isn't too hard to see by hand but you can easily see that as the length of the vector requested increases, the number of iterations taken to get to a vector that satisfies that particular value of correlation increases.

Now as another fun exercise, let us try to increase the vector length from 5 to 60 in steps of 5 (12 steps/iterations) keeping the correlation value at 0.6 and repeatedly calculate the number of random number draws that were required to get that correlation for a given length of the vector. Following that, we will plot a box plot of the number of iterations to see how much of a an effect does chance play in getting that correlation. For shorter vectors it is possible that the spread of iteration count values will be small since it is easy to stumble upon that much correlation by chance but for bigger vectors the spread should be quite large since it could take a lot of time to get that high a correlation randomly.

Now, I've written some quick and dirty code to do this that takes about half an hour to execute on an Intel Core 2 Duo, 2.1GHz, 4GB RAM Windows machine running RStudio on Windows 7. Most of the half hour is spent on the later part where high correlations have to be discovered in larger sets of numbers and therefore more iterations are needed.

corlenseq<-seq(5,60,by=5)
corlist<-vector(mode="list",length=12)
for(i in 1:12)
  {
  for(j in 1:50)
  {
  corlist[[i]]<-c(corlist[[i]],cor_RandGen(cor_val=0.6,len=corlenseq[i])$trial) 
  }
  print(i)
 }


uncorlist<-unlist(corlist)
matuncor<-matrix(uncorlist,12,50,byrow=TRUE)
boxplot(t(matuncor),col="purple",main="Iteration count for random hits of 0.6 correlation for set length 5-60 in steps of 5")




The plot in log scale looks a little better, you can see the monotonic increase in the median value of the iterations required to reach the same correlation as the set element numbers keep increasing.


I dare say that the median value of the number of iterations increases exponentially and therefore a log scaling could give a nice linear fit for the medians.

plot(corlenseq,ylist,main="Scatterplot of log2 median iterations vs. sequence length Correlation: 0.6",col="red",xlab="Sequence length",ylab="Log 2 Median iterations")





















We can try a regression on log transformed medians. If the model fits well then we can presume that the log transform linearizes the medians. Now if a distribution becomes linear on log transformation then it was probably exponentially distributed to begin with, in which case we could say that the median value of the number of iterations it took to discover a vector (generated from a uniform distribution) that had a correlation of 0.6 was exponentially distributed with relation to increasing length of the vector.

Now to fit a linear model to the data, we take the y vector as the log transform of the median values of the distribution of the number of iterations taken to get a vector with that high a correlation (0.6) as the length of the random vector increased from 5 to 60 in steps of 5.  The distribution of number of iterations is obtained by repeating the generation of a random vector of a particular length 50 times. The x vector is simply the lengths of the sequence of the random vector from 5 to 60 (corlenseq).

ylist<-unlist(lapply(corlist,function(x){median(log2(x))}))
# [1] 3.459432 4.491497 5.740900 6.554507 8.065728 9.803805 10.810313 12.207590 [9] 12.752444 14.529334 15.603466 17.687929

cor.test(corlenseq,ylist)

------------------------------------------------------------
    Pearson's product-moment correlation

data:  corlenseq and ylist
t = 56.3941, df = 10, p-value = 7.461e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9942186 0.9995751
sample estimates:
      cor
0.9984315
------------------------------------------------------------
 

lmxy<-lm(ylist~corlenseq)
abline(lmxy,col="blue",lty=3)
 






















summary(lmxy)
-------------------------------------------------------------
Call:
lm(formula = ylist ~ corlenseq)

Residuals:
     Min       1Q   Median       3Q      Max
-0.26541 -0.16303  0.05005  0.11609  0.28841

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 1.493194   0.113555   13.15 1.23e-07 ***
corlenseq   0.174022   0.003086   56.39 7.46e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1845 on 10 degrees of freedom
Multiple R-squared: 0.9969,    Adjusted R-squared: 0.9966
F-statistic:  3180 on 1 and 10 DF,  p-value: 7.456e-14

-------------------------------------------------------------------------------------------------------------------------------

So the R-squared comes to be around 0.99 so we can safely say that the time taken to generate a sequence of a particular correlation from a uniform distribution is exponentially distributed with respect to the length of the sequence. The slope of the line-fit is 0.253 and the intercept on the x-axis is 1.904. The P-value obtained from the correlation/regression - 7.456e-14 indicates that this is highly significant.




Now that is a cool thing that even I did not know. That the median time taken to find a sequence of a particular correlation randomly, varies as an exponential function of the length of the sequence.

Okay so now the post has started to acquire a mind of its own. Anyway, back to business, the post is still about establishing significance levels of correlation using permutation testing and that is exactly what we are going to do with a randomly generated vector of a certain correlation.

So we generate a vector of length 15 that has a correlation of 0.6 using our random correlation sequence generator. Then we'll permute the values to see how significant do we get the correlation to be.
 
randcor0p6<-cor_RandGen(cor_val=0.6,len=15)
# randcor0p6 $y_var [1] 0.31093842 0.29422432 0.10033297 0.55573146 0.23636603 0.08042674 0.75749109 0.67582663 0.75344375 0.77688401 [11] 0.32971845 0.70740437 0.85071694 0.28012281 0.99635235

xvar<-1:15
plot(xvar,randcor0p6$y_var)
cor(xvar,randcor0p6$y_var)

#[1] 0.5831759
  

plot(xvar,randcor0p6$y_var,main="15 Random Points of Correlation 0.6",col="red",xlab="1:15",ylab="Uniformly Distributed Random Numbers")

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
We can find a P-value for the correlation using R by using the cor.test function instead of plain ol' "cor". Note that the P-value given by cor.test is the same as the one given by the linear model fitting summary above.
 
cor.test(xvar,randcor0p6$y_var)
------------------------------------------------------------ 

 Pearson's product-moment correlation

data:  xvar and randcor0p6$y_var 
t = 2.5884, df = 13, p-value = 0.0225
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.1011222 0.8434631 
sample estimates:
      cor 
0.5831759 
------------------------------------------------------------ 
 
randLm<-lm(randcor0p6$y_var~xvar)
abline(randLm,col="blue",lty=3) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
summary(randLm)
--------------------------------------------------------------------------



Call:
 lm(formula = randcor0p6$y_var ~ xvar)

Residuals:
 Min        1Q    Median     3Q     Max 
-0.46377 -0.19194 0.06573 0.19094 0.28212 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|) 
(Intercept) 0.20685  0.13475     1.535   0.1487 
xvar        0.03836  0.01482     2.588   0.0225 *
 ---
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.248 on 13 degrees of freedom
Multiple R-squared: 0.3401, Adjusted R-squared: 0.2893 
F-statistic: 6.7 on 1 and 13 DF, p-value: 0.0225 
--------------------------------------------------------------------------

Now we should try to see if we can write our own variant of a permutation testing function and see if it gives P-values anywhere close to the in-built routines in R.

So we will write a quick and dirty code that permutes the values of y vector and then calculates correlation of x vector against permutations of y vector. The number of times the correlation is higher than the correlation of the un-permuted data divided by the total number of permutations is considered to pass for the P-value.
 
This takes about 2 minutes to execute on an 8GB,quad-core Xeon machine, (Dell Precision T5400)

cor_rand<-vector()
for(i in 1:100000){

cor_rand<-c(cor_rand,cor(xvar,sample(randcor0pINR371y_var)))
}


# P-value as defined as how I've seen it done for permutation testing.
 
length(which(cor_rand>cor(xvar,randcor0pINR371y_var)))/length(cor_rand)
 [1] 0.01186

Hmm, interestingly, this seems to be an underestimate of the P-value compared to what we obtain from the regression/correlation tests which both seem to be using the same algorithm to estimate P-value. Could it be that the tests are obtaining the P-value by some other means? We could try building the distribution of the permuted P-values, which follow the T-distribution as I've heard and then try a rough numerical integration to see how much of the Area Under the Curve is covered until correlation of 0.6 is reached. Note that this is highly experimental and possibly wrong but then in a random blog we could try it out and learn something new, wrong things lead away to right things, eventually...

Using a density approximation, we can get a smoothed histogram. The object returned by using the density command is a list containing x and y values.

So now, I will use the trapezium rule for integration to find the area under the curve for the density curve of the randomized correlations obtained from permutation of the original data. That is 0.5 * sum of parallel sides * height, the height here is the successive difference as we step through the x-axis calculated using the R function diff which gives back a vector containing the successive differences. The sum of parallel sides is done by successive sums looping through the y-axis and adding the i and (i+1) element together until you reach the end of the y-axis and concatenating the results in a vector.

 
plot(density(cor_rand))
dens_cor<-density(cor_rand)

height<-diff(dens_cor$x)
sops<-vector()
for(i in 1:(length(dens_cor$y)-1))

{
sops<-c(sops,sum(dens_cor$y[c(i,i+1)])) 
}


# Finally you just multiply the vector of differences with the vector of the successive sums and take the sum of the resultant vector and multiply it by 0.5, you should ideally get the complete area under the curve which should sum to 1 and maybe a little more because of numerical errors in the rounding off procedure.
 
sum(sops*height)*0.5
# [1] 1.000978

# That is pretty close to 1, and now we need to find out what the area under the curve is when we integrate till the correlation of 0.58. So that would mean taking the area under the curve till 0.58.
  
length(which(dens_cor$x<=0.58))# [1] 408

#Now the area under the curve until correlation of 0.58 is the area of the trapeziums formed by those sides till 408.

sum(0.5*sops[1:408]*height[1:408])
# [1] 0.9886911

#Now the P-value is 1-AUC until the value till index 408

1- (0.5*sum(sops[1:408]*height[1:408]))
# [1] 0.01130888


Um, well as it turns out, this P-value is also not close to the P-value we found using the regression/correlation in R and in fact it is closer to the P-value that we found using the Permutation testing procedure ( 0.01186). So I think we may be barking up the wrong tree here. Anyway, looking up on how R gets the P-value for its correlation I found something about a transformation called the Fisher's Z- transform. It seems that the correlation values that we get from the sampling distribution are not normally distributed except for low values of correlation and tend to be negatively skewed for higher values and applying this transformation mitigates the skewing problem.

The transformation is:




Now to do all of that neatly, we need cook up some functions, one for the Fisher's Z transform and the second for the Trapezium integration.

A fairly simple function for the Fisher's Z-transform

FisherZ<-function(x){
  z<- 0.5 * (log(1+x)-log(1-x))
return(z)
}


For the trapezium integration, I've just put together the code from the earlier integration code into a function into which you can pass the randomized correlation distribution and the threshold of correlation till which you want the AUC to be calculated.

TrpzInteg_AUC<-function(cor_rand,thresh){

  dens_cor<-density(cor_rand)
  height<-diff(dens_cor$x)

  sops<-vector()

  for(i in 1:(length(dens_cor$y)-1))
    {
    sops<-c(sops,sum(dens_cor$y[c(i,i+1)]))
    }

  totauc<-sum(0.5*sops*height)
  message(paste("Total AUC",totauc)) 
  ix<-length(which(dens_cor$x<=thresh))
  pauc<-sum(0.5*sops[1:ix]*height[1:ix])
  return(pauc) 
}


Now we could try applying this transform to our permuted correlation co-efficient and see if it brings the distribution to a normal distribution (trivia: such transforms are called variance stabilizing transforms). So we'll try the same thing we did earlier, i.e we start with the distribution of correlation co-efficients obtained from our permutation (cor_rand) and this time we run the Fisher's Z transform on those values and using these transformed values to estimate the Area Under the Curve again using the Trapezium rule and consequently find the P-value.

ZtranCor<-FisherZ(cor_rand)

Lets try plotting the histograms to see how things change with the transform and then we'll do an over-plotted density plot to see how the overlap is, without the R pretty printing graphs to mislead us how different the histograms look.

hist(cor_rand,col="yellow",main="Permuted Random Correlations",xlab="Correlation")
hist(ZtranCor,col="purple",main="Fisher-Z-transformed Correlations",xlab="Correlation")


 


You have to admit though, the Z-transform has clearly made the histogram much more symmetrical


plot(density(ZtranCor),col="purple",type="l",lty=2,main="Permuted Correlations (PC's) and Z-transformed PC's density")
 

lines(density(cor_rand),col="red",type="l",lty=3)


So applying the transform and looking at the density plot we can see that the red curve which was the original permuted correlation distribution of values is slightly asymmetrical around the center while the purple curve which is the Fisher Z-transformed value has more or less symmetrically distributed itself around the center.

So now we execute this function TrpzInteg_AUC on the Z tranformed permuted correlation distribution values (ZtranCor to find the estimate of the P-value according to the same principle as we did before.

TrpzInteg_AUC(cor_rand,0.58)
# Total AUC 1.00097847575004
# [1] 0.9886911
 

1-TrpzInteg_AUC(cor_rand,0.58)
# Total AUC 1.00097847575004
# [1] 0.01130888
 

1-TrpzInteg_AUC(ZtranCor,0.58)
# Total AUC 1.00097114802702
# [1] 0.02189813




So as you can see, the function behaves correctly as evidenced from it giving the same value for the non-transformed data as we calculated earlier.

The P-value estimate for the Z transformed correlation distribution(0.0219) is closer to the P-value obtained from the regression/correlation (0.0225). So we can probably say that the correlation/regression function in R uses a similar and definitely more accurate method (for the numerical integration) to give us the estimate of the P-value that it does. I'm not implying any warranties here that what we did was correct but then I could say that it was not entirely wrong.


However, the main objective of this post is not yet over, whilst we have shown how the P-values are obtained and it was a good exercise and everything, we still haven't shown the difference in the significance levels of correlation between a randomly generated vector and a real data vector that has noise added into it keeping the correlation at 0.6.

So as it turns out (proof not given). A high correlation such as 0.6 generated even through random means has just as much P-value as a perfect correlation that has been noised to 0.6. This means that a correlation of 0.6 for a vector of length 15, if it has that high a correlation, holds significance by virtue of its length. The blog post on Correlation: Philosophy and Practicality has an interesting graph which shows the decline of a perfect correlation as exponentially distributed noise is added to it. For correlations till 0.6 you can see the "cloud around the diagonal"  but then as the correlation drops lower, you don't see the cloud anymore for correlations of say 0.2. At that point it pretty much looks randomized and that is where the P-value estimates are going to come in real handy. Proving that is going to take some more coding and I think this post has stretched far too long. Another post on another day, probably part 2 will have the simulations for that.

Tuesday 23 September 2014

Shathranj ke Khiladi (Experts of Chess)

TLDR;
Random musings about Chess and Life, The title of this post is from a story by Premchand, a Hindi writer who wrote pithy, wry and direct stories about aristocracy and the life of poor men under them. Experts of Chess briefly is a story of two chess players so engrossed in games of chess that they ignore the fall of their king and his kingdom and the decimation of soldiers, later they have an argument about the game and duel each other and are both mortally wounded which is the end of the story. This post has little to do with that story though and more about the game of chess and an analogy to life.

#include<chess.h>
#include<musings.h>
void main()
{
I’ve taken to playing chess about 4 months back. I was introduced by a friend to the game again (I used to play real lousy when I was a kid). She is good at it and has me beat all but 2 times out of the 12 games that we’ve played, there was a vodka fueled game where I was beaten really badly and I can’t still say it was because I was inebriated. Then I’ve played against the computer on Chess Titans on Windows 7. Started with Level 5, the first game lasted two hours with plenty of Undo Moves and a reversal from a checkmate (which according to the game statistics is counted as a loss even if you undo), played from there and won by checkmate. 

While I played I realized that early in the game there was no way to predict the future moves of my opponent to more than 3 moves as the number of possible combinations were insane. However, shit could get real very fast if you played fast and loose.

There is quite a bit of breathing room in the beginning and this is when you should prepare the layout of your pieces such that they back each other up. If you play a move just for the heck of it without an idea of what strategy that move benefits to, then you’ve wasted your move. A move that could tighten your offence or strengthen you defence is what you should always be on the lookout for.  Also no point taking out the big guns early when the board is full and you have no space to move. When the opponent is clever enough, even if you have them defended you will lose them to the pawns, which is never a good deal. Use the Knight early on, it can jump over pieces and on a choc-a-bloc board, it is a good way to pick out defensive pawn positions, avoid losing the Knight if you can help it. The computer always tries to get my Knights in exchange for its own and considers it a fair trade. I wonder why it would do that considering it could have used them just as well.

There are no spare moves towards the middle of the game. You have to play everything with a purpose and exactly right keeping all the pieces backing each other up while breaching into the enemy lines.  The middle of the game is when the pieces start going off the board and you have to start weighing your options. Would you mind losing a Knight for saving a rook or a bishop? A pawn threatens both your Knight and your bishop, which one do you choose? The Queen, yes you have to save her and you could end up sacrificing quite a bit just to save her. The computer always goes for your queen so never put it up-front with a lesser piece defending it. Send the Rook in for an attack if you have to and then defend the Rook with the Queen but never vice-versa because the computer will just take your Queen and not care what it loses in return. In other scenarios you will notice it putting its Queen head-to-head with yours and then losing its Queen to take yours. But this is a pre-meditated calculation that assumes that a Pawn from the computer is already half-way across the board and will reach the end of the board and promote itself to a Queen which will make your King a sitting duck.

The end of the game is a little anti-climactic. I’m told that towards the end, the game is predictable, so predictable that in fact the computer stores look-up tables for the moves that can result in the fastest check-mate. The irony is that, despite the board being largely empty, there is not much space to run around to and you will eventually be trapped into a corner and check-mated. Hope or horror also exists when a pawn goes over to the other side of the board and gets promoted, mostly to Queen. That is mostly the end, if you have just the King and other odd-end pawns, it could also be the end for the opponent if you have a bishop or another Queen on a largely empty board with the other King in the corner.

It seems to be quite a useful analogy for life. Play for a strong defense, build your fundamentals. When the attack comes you should have your defense ready otherwise it will be decimation. No point taking out the big guns first. Play the small battles with the pawns and weaken the defense of the enemy, then start moving in for the kill with the big guns

Recently I happened to play Level 6. It is very hard for me and this particular game lasted 2 hours or more. I got check-mated in at least 6 different ways and kept undoing the game to see what the point of no-return was. I played a variety of endings which were ending in check-mate. Finally I found an undo point where it seemed I was able to smuggle my Queen bang in the middle of enemy territory. It already had its Queen and kept troubling my King in an oscillating check series where I stayed put in a, for the lack of a better word, fortress of Pawns in a triangle. Then began my series of check moves, I had my Queen backed up with a Rook and that Rook backed up with another Rook just so that the computer Queen wouldn’t dare kill on the diagonal. Then a sudden check by the computer Queen which came and stood right behind my King and I took the Queen for nothing. Then there was nothing to do but to use my Queen as soon as possible for a check-mate and that was it. I had won Level 6 for the first time ever. I had drawn it previously but a win always has a nice warm fuzzy feeling to it. To beat a machine, well maybe not the best chess engine but a fairly decent one.

However if I were to stretch the analogy of chess for life far enough, is it possible that you could have made a series of moves from where there is a point of no return? It would not be really possible to know because I knew that until I did about 9 undoes (10 moves back more or less) I was not able to get to the point in the game where I was able to save my power pieces and also move in for the kill. I repeatedly lost my Rooks and the Queen in exchanges where the computer retained a Rook, lost the Queen and managed to promote a Pawn to Queen in the next 3 moves to give me a check-mate. Out of the 9 variants of the undo games I played (they all ended in check-mate for me differently) only one of the games was where I managed to defend as well as move in for the kill simultaneously.

Is it possible that I may have reached that point in life as well, where I may have just made the best of the positions I had set out initially but now it is just a few moves to check-mate? There is no Ctrl Z to the moves you make in life and there is of course no going back to play a different variant. Is the late 20’s, the middle game of academia-chess? Have the choices that determine the closing moves of the endgame been made? I guess I will know 5 years down the line and it will be too late to do anything by then. All I know is that going from one day to the next, I always make sure I always know more than what I started out with and knowledge has always been power. By a check-mate I mean the glass wall you hit up against and you just can't go higher than that, a win would be a place where you do the science you like and you live it too. Also when you realize that academic hierarchy like a lot of others in the world, is a little lop-sided, and you definitely don't want to be at the bottom of a pyramid of incompetents.

As to the question of if it ends in a check-mate for me, your guess is just as good as mine.

}

Tuesday 2 September 2014

Science Humour


In the first 2 years of my Ph.D, I happened to educate myself on how to write scientific text that can be horrifyingly obfuscating as well as technically impressive while looking scholarly enough to pass for publishable writing. Sometimes however there has been un-intended humour that comes only when you read it aloud. On the other hand, I try not to pass up the opportunity to crack a subtle joke in an institute seminar or lab meeting. So here are a couple.

For instance we had this hypothesis that exhaled breath would contain organic compounds that would help us detect the constitution of the microbiome of the gut. Basically the relative proportion of sulphur containing gases, methane and other gases would indicate what sort of bacteria were present in the gut. Being in an omics institute, we gave this collective parameter the name "gas-ome".

If that wasn't funny enough, in a proposal that I ended up writing, we had figured that different bacteria could emit the same gases so gas production alone couldn't be used to determine constitution. So we wrote, "Since gas production is degenerate among different species" to peals of laughter on recognizing the un-intended pun. By degeneracy we implied that different species could produce the same gases (like same codons being produced by different tri-nucleotides), however the social implication of that sentence wasn't clear until we wrote it and read it aloud.

My seminar jokes have been subtle. One which I never made was this one

 "As much as you think that this is a scene from the movie "Kill Bill Vol I", it is not. It is a correlation of epistasis values between FBA and SGA for 97000 double knockouts."

Of course this wasn't going to go well with an audience that may not have watched the movie so I let it pass.

Another one, which I didn't crack but was there for everyone to see was a slide on using mathematical models to construct a minimal genome. See if you spot it, it is a calculus joke.





Hint: Top left corner, math expression is a joke.

Speaking of calculus jokes, here is one from my favorite (although slightly irregular) science comic at www.abstrusegoose.com


That's all for the random blog entry of the day folks!