options(width=60)10 Data Frames
Almost every dataset you will meet in genomics arrives as a table: genes down the side, samples across the top, measurements in the middle, and a few columns of annotation bolted on. In R, the structure built for exactly this shape is the data.frame. Get comfortable with it and most of the rest of the book — loading expression data, filtering genes, summarizing by condition — becomes variations on a theme you already know.
A data.frame looks a bit like an R matrix in that it has two dimensions, rows and columns. The difference that matters is this: a matrix must hold a single data type, while a data.frame may hold a different data type in each column — numbers in one, gene names in another, an experimental condition in a third. That is exactly what a real dataset looks like, which is why the data.frame, not the matrix, is the workhorse for tabular data in R.
10.1 What you’ll learn
By the end of this chapter you will be able to:
- Explain how a
data.framediffers from a matrix, and why that difference matters. - Load tabular data from a file or URL with
read.csv(). - Inspect a data frame with
head(),dim(),summary(), and friends. - Pull out columns and rows by position, name, and logical condition.
- Summarize measurements by category with
aggregate(). - Build a
data.framefrom scratch and save it back to disk.
We’ll do all of this on a real yeast gene-expression dataset.
10.2 Dataset
The data used here are borrowed directly from the fantastic Bioconnector tutorials and are a cleaned up version of the data from Brauer et al. Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) Mol Biol Cell 19:352-367. These data are from a gene expression microarray, and in this paper the authors examine the relationship between growth rate and gene expression in yeast cultures limited by one of six different nutrients (glucose, leucine, ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded with nutrients except restrict the supply of a single nutrient, you can control the growth rate to any rate you choose. By starving yeast of specific nutrients you can find genes that:
- Raise or lower their expression in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress. The authors found that expression of >25% of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. They also found that the subset of negatively growth-correlated genes is enriched for peroxisomal functions, and positively correlated genes mainly encode ribosomal functions.
- Respond differently when different nutrients are being limited. If you see particular genes that respond very differently when a nutrient is sharply restricted, these genes might be involved in the transport or metabolism of that specific nutrient.
The dataset can be downloaded directly from:
We are going to read this dataset into R and then use it as a playground for learning about data.frames.
10.3 Reading in data
R has many capabilities for reading in data. Many of the functions have names that help us to understand what data format is to be expected. In this case, the filename that we want to read ends in .csv, meaning comma-separated-values. The read.csv() function reads in .csv files. As usual, it is worth reading help('read.csv') to get a better sense of the possible bells-and-whistles.
The read.csv() function can read directly from a URL, so we do not need to download the file directly. This dataset is relatively large (about 16MB), so this may take a bit depending on your network connection speed.
Our variable, ydat, now “contains” the downloaded and read data. We can check to see what data type read.csv gave us:
class(ydat)[1] "data.frame"
R reports "data.frame", so read.csv() did what its name promises: it turned a table of text into the structure we want to work with.
10.4 Inspecting data.frames
Your ydat variable is a data.frame. The dataset is fairly large, so you will not be able to look at it all at once on the screen. Fortunately, R gives you many tools to inspect a data.frame a piece at a time.
- Overviews of content
- Size
-
dim()for dimensions (rows, columns) nrow()ncol()-
object.size()for power users interested in the memory used to store an object
-
- Data and attribute summaries
-
colnames()to get the names of the columns -
rownames()to get the “names” of the rows–may not be present -
summary()to get per-column summaries of the data in the data.frame.
-
head(ydat) symbol systematic_name nutrient rate expression
1 SFB2 YNL049C Glucose 0.05 -0.24
2 <NA> YNL095C Glucose 0.05 0.28
3 QRI7 YDL104C Glucose 0.05 -0.02
4 CFT2 YLR115W Glucose 0.05 -0.33
5 SSO2 YMR183C Glucose 0.05 0.05
6 PSP2 YML017W Glucose 0.05 -0.69
bp
1 ER to Golgi transport
2 biological process unknown
3 proteolysis and peptidolysis
4 mRNA polyadenylylation*
5 vesicle fusion*
6 biological process unknown
mf
1 molecular function unknown
2 molecular function unknown
3 metalloendopeptidase activity
4 RNA binding
5 t-SNARE activity
6 molecular function unknown
tail(ydat) symbol systematic_name nutrient rate expression
198425 DOA1 YKL213C Uracil 0.3 0.14
198426 KRE1 YNL322C Uracil 0.3 0.28
198427 MTL1 YGR023W Uracil 0.3 0.27
198428 KRE9 YJL174W Uracil 0.3 0.43
198429 UTH1 YKR042W Uracil 0.3 0.19
198430 <NA> YOL111C Uracil 0.3 0.04
bp
198425 ubiquitin-dependent protein catabolism*
198426 cell wall organization and biogenesis
198427 cell wall organization and biogenesis
198428 cell wall organization and biogenesis*
198429 mitochondrion organization and biogenesis*
198430 biological process unknown
mf
198425 molecular function unknown
198426 structural constituent of cell wall
198427 molecular function unknown
198428 molecular function unknown
198429 molecular function unknown
198430 molecular function unknown
dim(ydat)[1] 198430 7
nrow(ydat)[1] 198430
ncol(ydat)[1] 7
colnames(ydat)[1] "symbol" "systematic_name" "nutrient"
[4] "rate" "expression" "bp"
[7] "mf"
summary(ydat) symbol systematic_name nutrient
Length :198430 Length :198430 Length :198430
N.unique : 4210 N.unique : 5536 N.unique : 6
N.blank : 0 N.blank : 0 N.blank : 0
Min.nchar: 2 Min.nchar: 5 Min.nchar: 6
Max.nchar: 9 Max.nchar: 9 Max.nchar: 9
NAs : 47250
rate expression bp
Min. :0.0500 Min. :-6.500000 Length :198430
1st Qu.:0.1000 1st Qu.:-0.290000 N.unique : 880
Median :0.2000 Median : 0.000000 N.blank : 0
Mean :0.1752 Mean : 0.003367 Min.nchar: 7
3rd Qu.:0.2500 3rd Qu.: 0.290000 Max.nchar: 82
Max. :0.3000 Max. : 6.640000 NAs : 7663
mf
Length :198430
N.unique : 1085
N.blank : 0
Min.nchar: 11
Max.nchar: 125
NAs : 7663
The dim() output tells you the shape at a glance — tens of thousands of rows and a handful of columns — and summary() gives a per-column thumbnail: numeric ranges for the measurement columns and counts for the categorical ones.
View()
In RStudio there is one more inspection tool, View() (note the capital “V”), that opens the first 1000 rows in a spreadsheet-style window you can scroll and sort. It is a great way to get a feel for a dataset before you start writing code against it. We don’t run it here because it only works in interactive RStudio, not when the book is rendered.
View(ydat)10.5 Accessing variables (columns) and subsetting
In R, data.frames can be subset similarly to other two-dimensional data structures. The [ in R is used to denote subsetting of any kind. When working with two-dimensional data, we need two values inside the [ ] to specify the details. The specification is [rows, columns]. For example, to get the first three rows of ydat, use:
ydat[1:3, ] symbol systematic_name nutrient rate expression
1 SFB2 YNL049C Glucose 0.05 -0.24
2 <NA> YNL095C Glucose 0.05 0.28
3 QRI7 YDL104C Glucose 0.05 -0.02
bp
1 ER to Golgi transport
2 biological process unknown
3 proteolysis and peptidolysis
mf
1 molecular function unknown
2 molecular function unknown
3 metalloendopeptidase activity
Note how the second number, the columns, is blank. R takes that to mean “all the columns”. Similarly, we can combine rows and columns specification arbitrarily.
ydat[1:3, 1:3] symbol systematic_name nutrient
1 SFB2 YNL049C Glucose
2 <NA> YNL095C Glucose
3 QRI7 YDL104C Glucose
Because selecting a single variable, or column, is such a common operation, there are two shortcuts for doing so with data.frames. The first, the $ operator works like so:
# Look at the column names, just to refresh memory
colnames(ydat)[1] "symbol" "systematic_name" "nutrient"
[4] "rate" "expression" "bp"
[7] "mf"
# Note that I am using "head" here to limit the output
head(ydat$symbol)[1] "SFB2" NA "QRI7" "CFT2" "SSO2" "PSP2"
# What is the actual length of "symbol"?
length(ydat$symbol)[1] 198430
The second is related to the fact that, in R, data.frames are also lists. We subset a list by using [[]] notation. To get the second column of ydat, we can use:
head(ydat[[2]])[1] "YNL049C" "YNL095C" "YDL104C" "YLR115W" "YMR183C"
[6] "YML017W"
Alternatively, we can use the column name:
head(ydat[["systematic_name"]])[1] "YNL049C" "YNL095C" "YDL104C" "YLR115W" "YMR183C"
[6] "YML017W"
$ versus [[ ]]
Both pull out a single column as a plain vector, so why two notations? $ is the quick, everyday form when you know the column name as you type it (ydat$symbol). [[ ]] is the form to reach for when the column is chosen by a value — a number (ydat[[2]]) or a name stored in a variable (col <- "symbol"; ydat[[col]]), which $ cannot do. Reach for $ by default and [[ ]] when you’re programming with the column choice.
10.6 Aggregating and exploring
There are a couple of columns that hold numeric values. You can confirm which by asking each column its class():
class(ydat$symbol)[1] "character"
class(ydat$rate)[1] "numeric"
class(ydat$expression)[1] "numeric"
So symbol is a character column, while rate and expression are numeric. We’ll put those numeric columns to work in the exercises at the end of the chapter.
10.6.1 More advanced indexing and subsetting
We can use, for example, logical values (TRUE/FALSE) to subset data.frames. Here we keep only the rows where the symbol column equals LEU1:
head(ydat[ydat$symbol == 'LEU1', ]) symbol systematic_name nutrient rate expression bp
NA <NA> <NA> <NA> NA NA <NA>
NA.1 <NA> <NA> <NA> NA NA <NA>
NA.2 <NA> <NA> <NA> NA NA <NA>
NA.3 <NA> <NA> <NA> NA NA <NA>
NA.4 <NA> <NA> <NA> NA NA <NA>
NA.5 <NA> <NA> <NA> NA NA <NA>
mf
NA <NA>
NA.1 <NA>
NA.2 <NA>
NA.3 <NA>
NA.4 <NA>
NA.5 <NA>
tail(ydat[ydat$symbol == 'LEU1', ]) symbol systematic_name nutrient rate expression
NA.47244 <NA> <NA> <NA> NA NA
NA.47245 <NA> <NA> <NA> NA NA
NA.47246 <NA> <NA> <NA> NA NA
NA.47247 <NA> <NA> <NA> NA NA
NA.47248 <NA> <NA> <NA> NA NA
NA.47249 <NA> <NA> <NA> NA NA
bp mf
NA.47244 <NA> <NA>
NA.47245 <NA> <NA>
NA.47246 <NA> <NA>
NA.47247 <NA> <NA>
NA.47248 <NA> <NA>
NA.47249 <NA> <NA>
Notice the rows full of NA that came along for the ride. That is not a bug in your code — it is one of the most common surprises in R, and worth understanding once so it never trips you up again.
When you filter with a condition like ydat$symbol == 'LEU1', any row whose symbol is NA returns NA for the test — not FALSE — so R keeps it, and you get phantom rows full of NA. The fix is to exclude the missing values explicitly with !is.na():
symbol systematic_name nutrient rate expression
1526 LEU1 YGL009C Glucose 0.05 -1.12
7043 LEU1 YGL009C Glucose 0.10 -0.77
12555 LEU1 YGL009C Glucose 0.15 -0.67
18071 LEU1 YGL009C Glucose 0.20 -0.59
23603 LEU1 YGL009C Glucose 0.25 -0.20
29136 LEU1 YGL009C Glucose 0.30 0.03
bp
1526 leucine biosynthesis
7043 leucine biosynthesis
12555 leucine biosynthesis
18071 leucine biosynthesis
23603 leucine biosynthesis
29136 leucine biosynthesis
mf
1526 3-isopropylmalate dehydratase activity
7043 3-isopropylmalate dehydratase activity
12555 3-isopropylmalate dehydratase activity
18071 3-isopropylmalate dehydratase activity
23603 3-isopropylmalate dehydratase activity
29136 3-isopropylmalate dehydratase activity
Now only the genuine LEU1 rows remain.
Sometimes, looking at the data themselves is not that important. Using dim() is one possibility to look at the number of rows and columns after subsetting.
dim(ydat[ydat$expression > 3, ])[1] 714 7
Find the high expressed genes when leucine-starved. For this task we can also use subset which allows us to treat column names as R variables (no $ needed).
subset(ydat, nutrient == 'Leucine' & rate == 0.05 & expression > 3) symbol systematic_name nutrient rate expression
133768 QDR2 YIL121W Leucine 0.05 4.61
133772 LEU1 YGL009C Leucine 0.05 3.84
133858 BAP3 YDR046C Leucine 0.05 4.29
135186 <NA> YPL033C Leucine 0.05 3.43
135187 <NA> YLR267W Leucine 0.05 3.23
135288 HXT3 YDR345C Leucine 0.05 5.16
135963 TPO2 YGR138C Leucine 0.05 3.75
135965 YRO2 YBR054W Leucine 0.05 4.40
136102 GPG1 YGL121C Leucine 0.05 3.08
136109 HSP42 YDR171W Leucine 0.05 3.07
136119 HXT5 YHR096C Leucine 0.05 4.90
136151 <NA> YJL144W Leucine 0.05 3.06
136152 MOH1 YBL049W Leucine 0.05 3.43
136153 <NA> YBL048W Leucine 0.05 3.95
136189 HSP26 YBR072W Leucine 0.05 4.86
136231 NCA3 YJL116C Leucine 0.05 4.03
136233 <NA> YBR116C Leucine 0.05 3.28
136486 <NA> YGR043C Leucine 0.05 3.07
137443 ADH2 YMR303C Leucine 0.05 4.15
137448 ICL1 YER065C Leucine 0.05 3.54
137451 SFC1 YJR095W Leucine 0.05 3.72
137569 MLS1 YNL117W Leucine 0.05 3.76
bp
133768 multidrug transport
133772 leucine biosynthesis
133858 amino acid transport
135186 meiosis*
135187 biological process unknown
135288 hexose transport
135963 polyamine transport
135965 biological process unknown
136102 signal transduction
136109 response to stress*
136119 hexose transport
136151 response to dessication
136152 biological process unknown
136153 <NA>
136189 response to stress*
136231 mitochondrion organization and biogenesis
136233 <NA>
136486 biological process unknown
137443 fermentation*
137448 glyoxylate cycle
137451 fumarate transport*
137569 glyoxylate cycle
mf
133768 multidrug efflux pump activity
133772 3-isopropylmalate dehydratase activity
133858 amino acid transporter activity
135186 molecular function unknown
135187 molecular function unknown
135288 glucose transporter activity*
135963 spermine transporter activity
135965 molecular function unknown
136102 signal transducer activity
136109 unfolded protein binding
136119 glucose transporter activity*
136151 molecular function unknown
136152 molecular function unknown
136153 <NA>
136189 unfolded protein binding
136231 molecular function unknown
136233 <NA>
136486 transaldolase activity
137443 alcohol dehydrogenase activity
137448 isocitrate lyase activity
137451 succinate:fumarate antiporter activity
137569 malate synthase activity
10.7 Aggregating data
Aggregating data, or summarizing by category, is a common way to look for trends or differences in measurements between categories. Use aggregate to find the mean expression by gene symbol.
Group.1 x
1 AAC1 0.52888889
2 AAC3 -0.21628571
3 AAD10 0.43833333
4 AAD14 -0.07166667
5 AAD16 0.24194444
6 AAD4 -0.79166667
symbol expression
1 AAC1 0.52888889
2 AAC3 -0.21628571
3 AAD10 0.43833333
4 AAD14 -0.07166667
5 AAD16 0.24194444
6 AAD4 -0.79166667
Both calls do the same thing: collapse the many measurements for each gene into a single mean expression value, one row per symbol. The formula form (expression ~ symbol) reads as “expression by symbol” and is usually the easier to type and to read.
10.8 Creating a data.frame from scratch
Sometimes it is useful to combine related data into one object. For example, let’s simulate some data.
We set a seed first with set.seed(42) so the random draws are reproducible — you and a reader will see the same numbers. We now have two related variables, risk and smoker. We can make a data.frame out of them:
smoker_risk <- data.frame(smoker = smoker, risk = risk)
head(smoker_risk) smoker risk
1 smoker 5.370958
2 smoker 3.435302
3 smoker 4.363128
4 smoker 4.632863
5 smoker 4.404268
6 smoker 3.893875
R also has plotting shortcuts that work with data.frames. Because smoker is a factor, plot() recognizes the risk ~ smoker formula and draws a boxplot of risk in each group:
plot(risk ~ smoker, data = smoker_risk)
As we built the data, smokers carry the higher simulated risk — and the boxplot shows exactly that.
10.9 Saving a data.frame
Once we have a data.frame of interest, we may want to save it. The most portable way to save a data.frame is to use one of the write functions. In this case, let’s save the data as a .csv file.
write.csv(smoker_risk, "smoker_risk.csv")10.10 Exercises
These all use the yeast dataset, ydat, from earlier in the chapter.
- Which columns of
ydatare numeric? Checkrateandexpression. - Make a histogram of the
expressionvalues, then of theratevalues. -
ratehas only a handful of distinct values. Usetable()to count how many rows fall at each rate. Whichratecorresponds to the most nutrient-starved condition?
10.11 Summary
You can now load a real dataset into a data.frame, inspect its shape and contents, pull out the rows and columns you care about (by position, name, or a logical test), summarize values by category with aggregate(), and write results back to disk. These are the moves you’ll repeat on every dataset for the rest of the book.
10.12 Reflection
- Can you state one thing a
data.framecan do that a matrix cannot? - Given a column
ydat$rate, can you write the subset that keeps only rows whererate < 0.1? - When would you reach for
aggregate()instead of subsetting?

