Data analysis involves a large amount of janitor work – munging and cleaning data to facilitate downstream data analysis. This lesson demonstrates techniques for advanced data manipulation and analysis with the split-apply-combine strategy. We will use the dplyr package in R to effectively manipulate and conditionally compute summary statistics over subsets of a “big” dataset containing many observations.
Recommended reading: Review the Introduction
(10.1) and Tibbles
vs. data.frame (10.3) sections of the R for Data
Science book. We will initially be using the
read_*
functions from the readr package.
These functions load data into a tibble instead of R’s
traditional data.frame. Tibbles are data frames, but they tweak some
older behaviors to make life a little easier. These sections explain the
few key small differences between traditional data.frames and
tibbles.
We are going to use a yeast gene expression dataset. This is a cleaned up version of a gene expression dataset from Brauer et al. Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) Mol Biol Cell 19:352-367. These data are from a gene expression microarray, and in this paper the authors are examining the relationship between growth rate and gene expression in yeast cultures limited by one of six different nutrients (glucose, leucine, ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded with nutrients except restrict the supply of a single nutrient, you can control the growth rate to any rate you choose. By starving yeast of specific nutrients you can find genes that:
The file is called brauer2007_tidy.csv. Later on we’ll actually start with the original raw data (minimally processed) and manipulate it so that we can make it more amenable for analysis.
We need to load both the dplyr and readr packages for efficiently reading in and displaying this data. We’re also going to use many other functions from the dplyr package. Make sure you have these packages installed as described on the setup page.
# Load packages
library(readr)
library(dplyr)
# Read in data
ydat <- read.csv('https://raw.githubusercontent.com/bioconnector/workshops/master/data/brauer2007_tidy.csv')
# Display the data
ydat
# Optionally, bring up the data in a viewer window
# View(ydat)
## symbol systematic_name nutrient rate expression
## 1 SFB2 YNL049C Glucose 0.05 -0.24
## 2 <NA> YNL095C Glucose 0.05 0.28
## 3 QRI7 YDL104C Glucose 0.05 -0.02
## 4 CFT2 YLR115W Glucose 0.05 -0.33
## 5 SSO2 YMR183C Glucose 0.05 0.05
## 6 PSP2 YML017W Glucose 0.05 -0.69
## 7 RIB2 YOL066C Glucose 0.05 -0.55
## 8 VMA13 YPR036W Glucose 0.05 -0.75
## 9 EDC3 YEL015W Glucose 0.05 -0.24
## 10 VPS5 YOR069W Glucose 0.05 -0.16
## 11 <NA> YOL029C Glucose 0.05 -0.22
## 12 AMN1 YBR158W Glucose 0.05 0.18
## 13 SCW11 YGL028C Glucose 0.05 -0.67
## 14 DSE2 YHR143W Glucose 0.05 -0.59
## 15 COX15 YER141W Glucose 0.05 -0.28
## 16 SPE1 YKL184W Glucose 0.05 -0.19
## 17 MTF1 YMR228W Glucose 0.05 -0.42
## 18 KSS1 YGR040W Glucose 0.05 -0.76
## 19 <NA> YHR036W Glucose 0.05 -0.91
## 20 <NA> YNL158W Glucose 0.05 -0.47
## 21 YAP7 YOL028C Glucose 0.05 -0.51
## 22 <NA> YBR074W Glucose 0.05 -1.01
## 23 YVC1 YOR087W Glucose 0.05 -0.40
## 24 CDC40 YDR364C Glucose 0.05 -0.19
## 25 <NA> YPL162C Glucose 0.05 -0.10
## 26 RMD1 YDL001W Glucose 0.05 -0.22
## 27 PCL6 YER059W Glucose 0.05 -0.25
## 28 AI4 Q0065 Glucose 0.05 -0.36
## bp
## 1 ER to Golgi transport
## 2 biological process unknown
## 3 proteolysis and peptidolysis
## 4 mRNA polyadenylylation*
## 5 vesicle fusion*
## 6 biological process unknown
## 7 riboflavin biosynthesis
## 8 vacuolar acidification
## 9 deadenylylation-independent decapping
## 10 protein retention in Golgi*
## 11 biological process unknown
## 12 negative regulation of exit from mitosis*
## 13 cytokinesis, completion of separation
## 14 cell wall organization and biogenesis*
## 15 cytochrome c oxidase complex assembly*
## 16 pantothenate biosynthesis*
## 17 transcription from mitochondrial promoter
## 18 invasive growth (sensu Saccharomyces)*
## 19 biological process unknown
## 20 biological process unknown
## 21 positive regulation of transcription from RNA polymerase II promoter
## 22 proteolysis and peptidolysis
## 23 cation homeostasis
## 24 nuclear mRNA splicing, via spliceosome*
## 25 biological process unknown
## 26 biological process unknown
## 27 regulation of glycogen biosynthesis*
## 28 RNA splicing*
## mf
## 1 molecular function unknown
## 2 molecular function unknown
## 3 metalloendopeptidase activity
## 4 RNA binding
## 5 t-SNARE activity
## 6 molecular function unknown
## 7 pseudouridylate synthase activity*
## 8 hydrogen-transporting ATPase activity, rotational mechanism
## 9 molecular function unknown
## 10 protein transporter activity
## 11 molecular function unknown
## 12 protein binding
## 13 glucan 1,3-beta-glucosidase activity
## 14 glucan 1,3-beta-glucosidase activity
## 15 oxidoreductase activity, acting on NADH or NADPH, heme protein as acceptor
## 16 ornithine decarboxylase activity
## 17 S-adenosylmethionine-dependent methyltransferase activity*
## 18 MAP kinase activity
## 19 molecular function unknown
## 20 molecular function unknown
## 21 RNA polymerase II transcription factor activity
## 22 metalloendopeptidase activity
## 23 calcium channel activity*
## 24 RNA splicing factor activity, transesterification mechanism*
## 25 molecular function unknown
## 26 molecular function unknown
## 27 cyclin-dependent protein kinase regulator activity
## 28 endonuclease activity
## [ reached 'max' / getOption("max.print") -- omitted 198402 rows ]
The dplyr package is a relatively new R package that makes data manipulation fast and easy. It imports functionality from another package called magrittr that allows you to chain commands together into a pipeline that will completely change the way you write R code such that you’re writing code the way you’re thinking about the problem.
When you read in data with the readr package
(read_csv()
) and you had the dplyr package loaded already,
the data frame takes on this “special” class of data frames called a
tbl
(pronounced “tibble”), which you can see with
class(ydat)
. If you have other “regular” data frames in
your workspace, the as_tibble()
function will convert it
into the special dplyr tbl
that displays nicely (e.g.:
iris <- as_tibble(iris)
). You don’t have to turn all
your data frame objects into tibbles, but it does make working with
large datasets a bit easier.
You can read more about tibbles in Tibbles chapter in R for Data Science or in the tibbles vignette. They keep most of the features of data frames, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). You can read more about the differences between data frames and tibbles in this section of the tibbles vignette, but the major convenience for us concerns printing (aka displaying) a tibble to the screen. When you print (i.e., display) a tibble, it only shows the first 10 rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type. You can control the default appearance with options:
options(tibble.print_max = n, tibble.print_min = m)
: if
there are more than n rows, print only the first m
rows. Use options(tibble.print_max = Inf)
to always show
all rows.options(tibble.width = Inf)
will always print all
columns, regardless of the width of the screen.The dplyr package gives you a handful of useful verbs for managing data. On their own they don’t do anything that base R can’t do. Here are some of the single-table verbs we’ll be working with in this lesson (single-table meaning that they only work on a single table – contrast that to two-table verbs used for joining data together, which we’ll cover in a later lesson).
filter()
select()
mutate()
arrange()
summarize()
group_by()
They all take a data frame or tibble as their input for the first argument, and they all return a data frame or tibble as output.
If you want to filter rows of the data where some
condition is true, use the filter()
function.
filter(mydata, ...
.filter(ydat, symbol == "LEU1")
. If you want to satisfy
all of multiple conditions, you can use the “and” operator,
&
. The “or” operator |
(the pipe
character, usually shift-backslash) will return a subset that meet
any of the conditions.==
: Equal to!=
: Not equal to>
, >=
: Greater than, greater than or
equal to<
, <=
: Less than, less than or equal
toLet’s try it out. For this to work you have to have already loaded the dplyr package. Let’s take a look at LEU1, a gene involved in leucine synthesis.
# First, make sure you've loaded the dplyr package
library(dplyr)
# Look at a single gene involved in leucine synthesis pathway
filter(ydat, symbol == "LEU1")
# Optionally, bring that result up in a View window
# View(filter(ydat, symbol == "LEU1"))
# Look at multiple genes
filter(ydat, symbol=="LEU1" | symbol=="ADH2")
# Look at LEU1 expression at a low growth rate due to nutrient depletion
# Notice how LEU1 is highly upregulated when leucine is depleted!
filter(ydat, symbol=="LEU1" & rate==.05)
# But expression goes back down when the growth/nutrient restriction is relaxed
filter(ydat, symbol=="LEU1" & rate==.3)
# Show only stats for LEU1 and Leucine depletion.
# LEU1 expression starts off high and drops
filter(ydat, symbol=="LEU1" & nutrient=="Leucine")
# What about LEU1 expression with other nutrients being depleted?
filter(ydat, symbol=="LEU1" & nutrient=="Glucose")
Let’s look at this graphically. Don’t worry about what these commands are doing just yet - we’ll cover that later on when we talk about ggplot2. Here’s I’m taking the filtered dataset containing just expression estimates for LEU1 where I have 36 rows (one for each of 6 nutrients \(\times\) 6 growth rates), and I’m piping that dataset to the plotting function, where I’m plotting rate on the x-axis, expression on the y-axis, mapping the value of nutrient to the color, and using a line plot to display the data.
library(ggplot2)
filter(ydat, symbol=="LEU1") %>%
ggplot(aes(rate, expression, colour=nutrient)) + geom_line(lwd=1.5)
Look closely at that! LEU1 is highly expressed when starved of leucine because the cell has to synthesize its own! And as the amount of leucine in the environment (the growth rate) increases, the cell can worry less about synthesizing leucine, so LEU1 expression goes back down. Consequently the cell can devote more energy into other functions, and we see other genes’ expression very slightly raising.
EXERCISE 1
bp
variable) is “leucine biosynthesis” (case-sensitive)
and the limiting nutrient was Leucine. (Answer should return a
24-by-7 data frame – 4 genes \(\times\)
6 growth rates).?quantile
and try
quantile(ydat$expression, probs=.99)
to see the expression
value which is higher than 99% of all the data, then
filter()
based on that. Try wrapping your answer with a
View()
function so you can see the whole thing. What does
it look like those genes are doing? Answer should return a 1971-by-7
data frame.What we’ve done up to this point is read in data from a file
(read_csv(...)
), and assigning that to an object in our
workspace (ydat <- ...
). When we run operations
like filter()
on our data, consider two things:
ydat
object in our workspace is not being modified
directly. That is, we can filter(ydat, ...)
, and a result
is returned to the screen, but ydat
remains the same. This
effect is similar to what we demonstrated in our first session.# Assign the value '50' to the weight object.
weight <- 50
# Print out weight to the screen (50)
weight
# What's the value of weight plus 10?
weight + 10
# Weight is still 50
weight
# Weight is only modified if we *reassign* weight to the modified value
weight <- weight+10
# Weight is now 60
weight
data/brauer2007_tidy.csv
) is never modified. No
matter what we do to ydat, the file is never modified. If we want to
save the result of an operation to a file on disk, we can
assign the result of an operation to an object, and
write_csv
that object to disk. See the help for
?write_csv
(note, write_csv()
with an
underscore is part of the readr package – not to be
confused with the built-in write.csv()
function).# What's the result of this filter operation?
filter(ydat, nutrient=="Leucine" & bp=="leucine biosynthesis")
# Assign the result to a new object
leudat <- filter(ydat, nutrient=="Leucine" & bp=="leucine biosynthesis")
# Write that out to disk
write_csv(leudat, "leucinedata.csv")
Note that this is different than saving your entire workspace to an Rdata file, which would contain all the objects we’ve created (weight, ydat, leudat, etc).
The filter()
function allows you to return only certain
rows matching a condition. The select()
function
returns only certain columns. The first argument is the data,
and subsequent arguments are the columns you want.
# Select just the symbol and systematic_name
select(ydat, symbol, systematic_name)
# Alternatively, just remove columns. Remove the bp and mf columns.
select(ydat, -bp, -mf)
# Notice that the original data doesn't change!
ydat
Notice above how the original data doesn’t change. We’re selecting
out only certain columns of interest and throwing away columns we don’t
care about. If we wanted to keep this data, we would need to
reassign the result of the select()
operation to a
new object. Let’s make a new object called nogo
that does
not contain the GO annotations. Notice again how the original data is
unchanged.
# create a new dataset without the go annotations.
nogo <- select(ydat, -bp, -mf)
nogo
# we could filter this new dataset
filter(nogo, symbol=="LEU1" & rate==.05)
# Notice how the original data is unchanged - still have all 7 columns
ydat
The mutate()
function adds new columns to the data.
Remember, it doesn’t actually modify the data frame you’re operating on,
and the result is transient unless you assign it to a new object or
reassign it back to itself (generally, not always a good practice).
The expression level reported here is the \(log_2\) of the sample signal divided by the signal in the reference channel, where the reference RNA for all samples was taken from the glucose-limited chemostat grown at a dilution rate of 0.25 \(h^{-1}\). Let’s mutate this data to add a new variable called “signal” that’s the actual raw signal ratio instead of the log-transformed signal.
mutate(nogo, signal=2^expression)
Mutate has a nice little feature too in that it’s “lazy.” You can mutate and add one variable, then continue mutating to add more variables based on that variable. Let’s make another column that’s the square root of the signal ratio.
mutate(nogo, signal=2^expression, sigsr=sqrt(signal))
Again, don’t worry about the code here to make the plot – we’ll learn about this later. Why do you think we log-transform the data prior to analysis?
library(tidyr)
mutate(nogo, signal=2^expression, sigsr=sqrt(signal)) %>%
gather(unit, value, expression:sigsr) %>%
ggplot(aes(value)) + geom_histogram(bins=100) + facet_wrap(~unit, scales="free")
The arrange()
function does what it sounds like. It
takes a data frame or tbl and arranges (or sorts) by column(s) of
interest. The first argument is the data, and subsequent arguments are
columns to sort on. Use the desc()
function to arrange by
descending.
# arrange by gene symbol
arrange(ydat, symbol)
# arrange by expression (default: increasing)
arrange(ydat, expression)
# arrange by decreasing expression
arrange(ydat, desc(expression))
EXERCISE 2
arrange()
where you’ll arrange the result of #1 by the gene
symbol.View()
statement so you
can see the entire result.The summarize()
function summarizes multiple values to a
single value. On its own the summarize()
function doesn’t
seem to be all that useful. The dplyr package provides a few convenience
functions called n()
and n_distinct()
that
tell you the number of observations or the number of distinct values of
a particular variable.
Notice that summarize takes a data frame and returns a data frame. In this case it’s a 1x1 data frame with a single row and a single column. The name of the column, by default is whatever the expression was used to summarize the data. This usually isn’t pretty, and if we wanted to work with this resulting data frame later on, we’d want to name that returned value something easier to deal with.
# Get the mean expression for all genes
summarize(ydat, mean(expression))
# Use a more friendly name, e.g., meanexp, or whatever you want to call it.
summarize(ydat, meanexp=mean(expression))
# Measure the correlation between rate and expression
summarize(ydat, r=cor(rate, expression))
# Get the number of observations
summarize(ydat, n())
# The number of distinct gene symbols in the data
summarize(ydat, n_distinct(symbol))
We saw that summarize()
isn’t that useful on its own.
Neither is group_by()
All this does is takes an existing
data frame and coverts it into a grouped data frame where operations are
performed by group.
ydat
group_by(ydat, nutrient)
group_by(ydat, nutrient, rate)
The real power comes in where group_by()
and
summarize()
are used together. First, write the
group_by()
statement. Then wrap the result of that with a
call to summarize()
.
# Get the mean expression for each gene
# group_by(ydat, symbol)
summarize(group_by(ydat, symbol), meanexp=mean(expression))
# Get the correlation between rate and expression for each nutrient
# group_by(ydat, nutrient)
summarize(group_by(ydat, nutrient), r=cor(rate, expression))
This is where things get awesome. The dplyr package imports
functionality from the magrittr package that
lets you pipe the output of one function to the input of
another, so you can avoid nesting functions. It looks like this:
%>%
. You don’t have to load the
magrittr package to use it since dplyr imports its functionality when
you load the dplyr package.
Here’s the simplest way to use it. Remember the tail()
function. It expects a data frame as input, and the next argument is the
number of lines to print. These two commands are identical:
tail(ydat, 5)
ydat %>% tail(5)
Let’s use one of the dplyr verbs.
filter(ydat, nutrient=="Leucine")
ydat %>% filter(nutrient=="Leucine")
So what?
Now, think about this for a minute. What if we wanted to get the correlation between the growth rate and expression separately for each limiting nutrient only for genes in the leucine biosynthesis pathway, and return a sorted list of those correlation coeffients rounded to two digits? Mentally we would do something like this:
ydat
datasetfilter()
it for genes in the leucine
biosynthesis pathwaygroup_by()
the limiting nutrientsummarize()
to get the correlation
(cor()
) between rate and expressionmutate()
to round the result of the above
calculation to two significant digitsarrange()
by the rounded correlation
coefficient aboveBut in code, it gets ugly. First, take the ydat
dataset
ydat
then filter()
it for genes in the leucine
biosynthesis pathway
filter(ydat, bp=="leucine biosynthesis")
then group_by()
the limiting nutrient
group_by(filter(ydat, bp=="leucine biosynthesis"), nutrient)
then summarize()
to get the correlation
(cor()
) between rate and expression
summarize(group_by(filter(ydat, bp == "leucine biosynthesis"), nutrient), r = cor(rate,
expression))
then mutate()
to round the result of the above
calculation to two significant digits
mutate(summarize(group_by(filter(ydat, bp == "leucine biosynthesis"), nutrient),
r = cor(rate, expression)), r = round(r, 2))
then arrange()
by the rounded correlation
coefficient above
arrange(
mutate(
summarize(
group_by(
filter(ydat, bp=="leucine biosynthesis"),
nutrient),
r=cor(rate, expression)),
r=round(r, 2)),
r)
Now compare that with the mental process of what you’re actually
trying to accomplish. The way you would do this without pipes is
completely inside-out and backwards from the way you express in words
and in thought what you want to do. The pipe operator
%>%
allows you to pass the output data frame from one
function to the input data frame to another function.
This is how we would do that in code. It’s as simple as replacing the
word “then” in words to the symbol %>%
in code. (There’s
a keyboard shortcut that I’ll use frequently to insert the
%>%
sequence – you can see what it is by clicking the
Tools menu in RStudio, then selecting Keyboard Shortcut
Help. On Mac, it’s CMD-SHIFT-M.)
ydat %>%
filter(bp=="leucine biosynthesis") %>%
group_by(nutrient) %>%
summarize(r=cor(rate, expression)) %>%
mutate(r=round(r,2)) %>%
arrange(r)
EXERCISE 3
Here’s a warm-up round. Try the following.
Show the limiting nutrient and expression values for the gene ADH2
when the growth rate is restricted to 0.05. Hint: 2 pipes:
filter
and select
.
ydat %>% filter(symbol=="ADH2" & rate==0.05) %>% select(nutrient, expression)
What are the four most highly expressed genes when the growth rate is
restricted to 0.05 by restricting glucose? Show only the symbol,
expression value, and GO terms. Hint: 4 pipes:
filter
, arrange
, head
, and
select
.
ydat %>%
filter(nutrient=="Glucose" & rate==.05) %>%
arrange(desc(expression)) %>%
head(4) %>%
select(symbol, expression, bp, mf)
When the growth rate is restricted to 0.05, what is the average
expression level across all genes in the “response to stress” biological
process, separately for each limiting nutrient? What about genes in the
“protein biosynthesis” biological process? Hint: 3 pipes:
filter
, group_by
, summarize
.
ydat %>%
filter(rate==.05 & bp=="response to stress") %>%
group_by(nutrient) %>%
summarize(meanexp=mean(expression))
ydat %>%
filter(rate==.05 & bp=="protein biosynthesis") %>%
group_by(nutrient) %>%
summarize(meanexp=mean(expression))
EXERCISE 4
That was easy, right? How about some tougher ones.
First, some review. How do we see the number of distinct values of a
variable? Use n_distinct()
within a
summarize()
call.
ydat %>% summarize(n_distinct(mf))
Which 10 biological process annotations have the most genes
associated with them? What about molecular functions? Hint: 4
pipes: group_by
, summarize
with
n_distinct
, arrange
, head
.
ydat %>%
group_by(bp) %>%
summarize(n=n_distinct(symbol)) %>%
arrange(desc(n)) %>%
head(10)
ydat %>%
group_by(mf) %>%
summarize(n=n_distinct(symbol)) %>%
arrange(desc(n)) %>%
head(10)
How many distinct genes are there where we know what process the gene
is involved in but we don’t know what it does? Hint: 3 pipes;
filter
where
bp!="biological process unknown" & mf=="molecular function unknown"
,
and after select
ing columns of interest, pipe the output to
distinct()
. The answer should be 737, and
here are a few:
ydat %>%
filter(bp!="biological process unknown" & mf=="molecular function unknown") %>%
select(symbol, bp, mf) %>%
distinct()
When the growth rate is restricted to 0.05 by limiting Glucose, which
biological processes are the most upregulated? Show a sorted list with
the most upregulated BPs on top, displaying the biological process and
the average expression of all genes in that process rounded to two
digits. Hint: 5 pipes: filter
,
group_by
, summarize
, mutate
,
arrange
.
ydat %>%
filter(nutrient=="Glucose" & rate==.05) %>%
group_by(bp) %>%
summarize(meanexp=mean(expression)) %>%
mutate(meanexp=round(meanexp, 2)) %>%
arrange(desc(meanexp))
Group the data by limiting nutrient (primarily) then by biological
process. Get the average expression for all genes annotated with each
process, separately for each limiting nutrient, where the growth rate is
restricted to 0.05. Arrange the result to show the most upregulated
processes on top. The initial result will look like the result below.
Pipe this output to a View()
statement. What’s going on?
Why didn’t the arrange()
work? Hint: 5 pipes:
filter
, group_by
, summarize
,
arrange
, View
.
ydat %>%
filter(rate==0.05) %>%
group_by(nutrient, bp) %>%
summarize(meanexp=mean(expression)) %>%
arrange(desc(meanexp))
Let’s try to further process that result to get only the top three
most upregulated biolgocal processes for each limiting nutrient. Google
search “dplyr first result within group.” You’ll need a
filter(row_number()......)
in there somewhere.
Hint: 5 pipes: filter
, group_by
,
summarize
, arrange
,
filter(row_number()...
. Note: dplyr’s pipe syntax
used to be %.%
before it changed to %>%
. So
when looking around, you might still see some people use the old syntax.
Now if you try to use the old syntax, you’ll get a deprecation
warning.
ydat %>%
filter(rate==0.05) %>%
group_by(nutrient, bp) %>%
summarize(meanexp=mean(expression)) %>%
arrange(desc(meanexp)) %>%
filter(row_number()<=3)