A tutorial on organizing and plotting data with R

Every year I teach in the Programming for Evolutionary Biology course held in Leipzig. It’s an intensive three weeks course, in which we take 25 evolutionary biologists with no prior knowledge of programming, we lock them in a room together with some very good teaching assistants, and we keep explaining them how to program until they learn or manage to escape.

Jokes aside, the course is a very nice experience, and people have a lot of fun, as the three weeks are full of discussion about evolutionary biology and about bioinformatics. The nice things is that this year two brilliant former students (Karl and Michiel) organized a whole reunion conference, which has been called the PEB conference and is taking place this week at the CIBIO near Porto. This reunion conference is also an opportunity to follow-up the students, and I have been in charge of organizing a couple of workshops, one about installing software on linux, and one about advanced R programming.

The tutorial for advanced R programming is available on github and below. I think it may be of interest for anybody with some knowledge of R, but wishing to learn some new tricks. In particular the tutorial is about good ways to organize a dataframe, and I tried to cover a few beginner mistakes about data analysis that I saw in Biostar. It describes the differences between a dataframe in a wide or a long format, how to convert one to the other, and what are the advantages of doing that. It also teaches how to calculate group-based summaries with dplyr, and how to plot them with ggplot2.

DIfference between a dataset in a wide or a long format
DIfference between a dataset in a wide or a long format

PEB advanced R workshop

Welcome to the PEB advanced R workshop!

We will take a “messy” gene expression table, and use the tidyr library to restructure it in a format that is better suited for data analysis. We will also group the data and calculate summaries using the dplyr library, and finally plot it using ggplot2.

For a better experience see the full version of this tutorial on github:

Requirements

Let’s install all the required packages:

The tidyr package requires at least R 3.1.2. If you can’t install it, you can use reshape2 instead, but you won’t be able to follow some of the exercises.

We also need to download the following data files:

the peb_expression.csv file

Let’s read the peb_expression.csv file that we just downloaded, and have a look at its contents.

The first two columns contain the probe ID and the name of the gene. The other columns contain the expression levels of each gene in an individual. Notice how the name of the columns also encode the population of each individual, e.g. YRI, EUR, or EAS.

A piping system for R

The dplyr package introduced a piping system for R, using the %>% symbol.

This works similarly to the piping system in bash, but using the %>% symbol instead of the pipe |. We first write the name of the dataframe to use, then all the operation that must be executed on it.

For example, the following is equivalent to head(peb)

We can concantenate any number of operations on the same dataset, just as we would do in bash. We will play with this piping system in a few minutes.

Converting to a long format

A dataset can be encoded in a “wide” format (more columns and less rows), or in a “long” format (minimum number of columns and more rows). While the wide format can be more readable for the human eye, the long format is better suited for data analysis.

Our peb data frame is in a “wide” format, as it contains many columns, one for every individual. However, all the functions used in the rest of the workshop require the dataset to be in a long format. Let’s convert it using the gather() function from tidyr:

Explanation:

  • sample is the name of the new column containing the key variables
  • expression is the new column containing the values variable
  • we use the operator to define which columns must not be converted to the long format

Note how the new format encodes exactly the same data as before, but has only four columns.

Nowadays many recent R libraries and functions are designed for datasets in the long format. A common beginner mistake is trying to apply these functions to datasets in the wide format, while the trick is to restructure the data first. If you learn how to reorganize your dataset into a long format, then you can solve most data analysis problems in R using always the same approach.

If you couldn’t install the tidyr package, you can achieve the same format using the melt function from reshape2

Tidying-up the peb.long dataframe

In a properly structured table each variable must contain only one type of information. If we look at our peb.long dataframe, the sample column contains both the individual ID and its population. Let’s split this column into two, using separate from tidyr:

The above code can be simplified using the %>% operator:

To prepare the dataset for our workshop, we need to apply a few more data filtering steps, like removing all the “Control” rows, eliminating all the duplicated genes (keeping only one probe per gene), and dropping the ID_REF column.

Group operations

Apart from the %>% operator, the dplyr library introduces three useful functions: group_by, summarise and mutate.

  • group_by is used to define how the rows of a dataset must be grouped. For example, we can group the rows of peb.long by population.
  • summarise is used to calculate summaries of a grouped dataset – for example we can use it to calculate the mean and standard deviation of the expression for every population:

  • mutate is used to add columns to a data frame, or to modify existing columns. For example we can use it to annotate whether each gene is over or under expressed, compared to the expression of other genes in the same individual:

Plotting the dataset

We will use the ggplot2 package to plot our peb.long dataframe.

A ggplot2 plot is composed by a base ggplot() object, defining the dataset and the basic variables used, and by additional elements defining how to represent them. Let’s see an example:

 

Explanation:

  • ggplot(peb.long, aes(x=overexpressed, y=expression)) initialize the base ggplot2 object. We use the aes (for aesthetic) function to define the default x and y variables
  • geom_point adds a scatterplot representation of the base plot
  • ggtitle is another element of the plot, in this case the title.

Let’ try a boxplot representation:

 

Let’s create a nicer plot, in which we:

 

Classifying cancer genes

Remember that we also downloaded a file containing the classification of genes into oncogenes and tumor suppressors, from the NCG database:

This file doesn’t follow the principles of “tidy” data, according to which each row should represent only one single observation.

In this case we can use the unnest function from tidyr to convert it to a “long” format:

Now we can use left_join from dplyr to add a column to our peb expression data frame:

OPTIONAL exercise

You may notice that the number of rows in the dataframe increased after the join. The reason is that three genes are classified both as Oncogenes and as Tumor Suppressors.

To find them you can do any of the following:

End of the OPTIONAL exercise

advanced ggplot2: facetting and saving to a file

Now that we have more qualitative variables in peb.long, we can create more sophisticated plots. For example, an useful way to represent multi-dimensional data is facetting:

 

If we want to save this plot to a file, we can do:

a Barchart plot

Let’s see how many oncogenes/tumor suppressors are overexpressed in each population:

 


 

Leave a Reply