Tuesday, June 5, 2012

Learning to think in R... I

A developmental process.

I have recently started working in a new position which gives me the opportunity to closely collaborate with "real" statistician. This new colleague is a like a treasure trove of knowledge to me, when it comes to the stringent analysis of genome-scale datasets - which is what I now do on a daily basis. Of course, this involves using R (http://cran.r-project.org/) and the Bioconductor packages (http://bioconductor.org). I have previously made use of the R environment for different types of analyses, but these were always restricted to dealing with almost finalized datasets, i.e. all the data tidying, pruning and preparation had already been done, usually using a combination of SQL (Postgres), scripting languages (Ruby, Python, Perl) and shell scripts (bash/awk/sed).

Recently I have begun moving my datasets straight into R and perform the re-formatting and organization within the R environment. As anyone who uses R on a regular basis knows, the R language is actually extremely powerful for these particular tasks. However, for someone who originally learned programming using procedural languages (object-orientation was in its infancy when I learned Unix SysV programming...), it does require some changes in the way of thinking. My initial ineptitude can be nicely illustrated in this following example:

Objective: To re-organize data from four individual tables (samples in columns, CNV probabilities per BAC in a row, one table each for "normal", "loss", "gain", "amplification", all organized in the same fashion), into matrices for each sample with BACs in columns, and a row per status containing the probabilities. This information is then supposed to be collapsed into a single "status" variable.

My first (non-working) approach looked like this:

for ( i in 1:length(colnames(Probain))) {
        name <- paste("matrix", colnames(Probgain[i]), sep=".")
        for ( j in 1:length(rownames(Probgain[i]))) {
                if (j == 1) {
      assign(name, as.matrix(c(Probnorm[j,i],Probloss[j,i],Probgain[j,i],Probamp[j,i])))
                else {

This is essentially a tell-tale sign of a procedural thought process: Create two nested loops which work throught the , include a check for the condition the loop is in, and based on that change the behaviour. The end result was a properly named matrix per sample, which only contained a single column...

After a lot of introspection, I finally arrived at this solution:

for ( i in 1:length(colnames(Probgain))) {
        name<-paste("matrix", colnames(Probgain[i]), sep=".")

This does the job. End result are matrices for each sample, containing the probabilities for each status for each BAC. And it also nicely illustrates, that R 's way of dealing with data objects really lends itself to re-formatting data...

I am planning to blog on a regular basis about my experiences delving deeper into the R environment, and I hope to thereby provide some interesting information for other people in the field in similar situations.

Monday, June 4, 2012

# This command will create a new high-dimensional list which will contain the 
# value "gain" for all variables which had a numerical value between 0.5 and 1,
# as well as the value "NA" for all variables with values between 0 and 0.5
+recode(x, recodes="0.5:1='gain'; 0:0.5=NA", as.factor.result = FALSE))
# now we're turning this list into a data.frame again and attaching 
# the col and rownames from the original data frame