10  Data Frames

Published

June 1, 2024

Modified

June 2, 2026

Almost every dataset you will meet in genomics arrives as a table: genes down the side, samples across the top, measurements in the middle, and a few columns of annotation bolted on. In R, the structure built for exactly this shape is the data.frame. Get comfortable with it and most of the rest of the book — loading expression data, filtering genes, summarizing by condition — becomes variations on a theme you already know.

A data.frame looks a bit like an R matrix in that it has two dimensions, rows and columns. The difference that matters is this: a matrix must hold a single data type, while a data.frame may hold a different data type in each column — numbers in one, gene names in another, an experimental condition in a third. That is exactly what a real dataset looks like, which is why the data.frame, not the matrix, is the workhorse for tabular data in R.

10.1 What you’ll learn

By the end of this chapter you will be able to:

  • Explain how a data.frame differs from a matrix, and why that difference matters.
  • Load tabular data from a file or URL with read.csv().
  • Inspect a data frame with head(), dim(), summary(), and friends.
  • Pull out columns and rows by position, name, and logical condition.
  • Summarize measurements by category with aggregate().
  • Build a data.frame from scratch and save it back to disk.

We’ll do all of this on a real yeast gene-expression dataset.

10.2 Dataset

The data used here are borrowed directly from the fantastic Bioconnector tutorials and are a cleaned up version of the data from Brauer et al. Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) Mol Biol Cell 19:352-367. These data are from a gene expression microarray, and in this paper the authors examine the relationship between growth rate and gene expression in yeast cultures limited by one of six different nutrients (glucose, leucine, ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded with nutrients except restrict the supply of a single nutrient, you can control the growth rate to any rate you choose. By starving yeast of specific nutrients you can find genes that:

  1. Raise or lower their expression in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress. The authors found that expression of >25% of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. They also found that the subset of negatively growth-correlated genes is enriched for peroxisomal functions, and positively correlated genes mainly encode ribosomal functions.
  2. Respond differently when different nutrients are being limited. If you see particular genes that respond very differently when a nutrient is sharply restricted, these genes might be involved in the transport or metabolism of that specific nutrient.

The dataset can be downloaded directly from:

We are going to read this dataset into R and then use it as a playground for learning about data.frames.

10.3 Reading in data

R has many capabilities for reading in data. Many of the functions have names that help us to understand what data format is to be expected. In this case, the filename that we want to read ends in .csv, meaning comma-separated-values. The read.csv() function reads in .csv files. As usual, it is worth reading help('read.csv') to get a better sense of the possible bells-and-whistles.

The read.csv() function can read directly from a URL, so we do not need to download the file directly. This dataset is relatively large (about 16MB), so this may take a bit depending on your network connection speed.

options(width=60)
url <- paste0(
    'https://raw.githubusercontent.com',
    '/bioconnector/workshops/master/data/brauer2007_tidy.csv'
)
ydat <- read.csv(url)

Our variable, ydat, now “contains” the downloaded and read data. We can check to see what data type read.csv gave us:

class(ydat)
[1] "data.frame"

R reports "data.frame", so read.csv() did what its name promises: it turned a table of text into the structure we want to work with.

10.4 Inspecting data.frames

Your ydat variable is a data.frame. The dataset is fairly large, so you will not be able to look at it all at once on the screen. Fortunately, R gives you many tools to inspect a data.frame a piece at a time.

  • Overviews of content
    • head() to show first few rows
    • tail() to show last few rows
  • Size
  • Data and attribute summaries
    • colnames() to get the names of the columns
    • rownames() to get the “names” of the rows–may not be present
    • summary() to get per-column summaries of the data in the data.frame.
head(ydat)
  symbol systematic_name nutrient rate expression
1   SFB2         YNL049C  Glucose 0.05      -0.24
2   <NA>         YNL095C  Glucose 0.05       0.28
3   QRI7         YDL104C  Glucose 0.05      -0.02
4   CFT2         YLR115W  Glucose 0.05      -0.33
5   SSO2         YMR183C  Glucose 0.05       0.05
6   PSP2         YML017W  Glucose 0.05      -0.69
                            bp
1        ER to Golgi transport
2   biological process unknown
3 proteolysis and peptidolysis
4      mRNA polyadenylylation*
5              vesicle fusion*
6   biological process unknown
                             mf
1    molecular function unknown
2    molecular function unknown
3 metalloendopeptidase activity
4                   RNA binding
5              t-SNARE activity
6    molecular function unknown
tail(ydat)
       symbol systematic_name nutrient rate expression
198425   DOA1         YKL213C   Uracil  0.3       0.14
198426   KRE1         YNL322C   Uracil  0.3       0.28
198427   MTL1         YGR023W   Uracil  0.3       0.27
198428   KRE9         YJL174W   Uracil  0.3       0.43
198429   UTH1         YKR042W   Uracil  0.3       0.19
198430   <NA>         YOL111C   Uracil  0.3       0.04
                                               bp
198425    ubiquitin-dependent protein catabolism*
198426      cell wall organization and biogenesis
198427      cell wall organization and biogenesis
198428     cell wall organization and biogenesis*
198429 mitochondrion organization and biogenesis*
198430                 biological process unknown
                                        mf
198425          molecular function unknown
198426 structural constituent of cell wall
198427          molecular function unknown
198428          molecular function unknown
198429          molecular function unknown
198430          molecular function unknown
dim(ydat)
[1] 198430      7
nrow(ydat)
[1] 198430
ncol(ydat)
[1] 7
colnames(ydat)
[1] "symbol"          "systematic_name" "nutrient"       
[4] "rate"            "expression"      "bp"             
[7] "mf"             
summary(ydat)
       symbol        systematic_name        nutrient     
 Length   :198430   Length   :198430   Length   :198430  
 N.unique :  4210   N.unique :  5536   N.unique :     6  
 N.blank  :     0   N.blank  :     0   N.blank  :     0  
 Min.nchar:     2   Min.nchar:     5   Min.nchar:     6  
 Max.nchar:     9   Max.nchar:     9   Max.nchar:     9  
 NAs      : 47250                                        
      rate          expression                bp        
 Min.   :0.0500   Min.   :-6.500000   Length   :198430  
 1st Qu.:0.1000   1st Qu.:-0.290000   N.unique :   880  
 Median :0.2000   Median : 0.000000   N.blank  :     0  
 Mean   :0.1752   Mean   : 0.003367   Min.nchar:     7  
 3rd Qu.:0.2500   3rd Qu.: 0.290000   Max.nchar:    82  
 Max.   :0.3000   Max.   : 6.640000   NAs      :  7663  
         mf        
 Length   :198430  
 N.unique :  1085  
 N.blank  :     0  
 Min.nchar:    11  
 Max.nchar:   125  
 NAs      :  7663  

The dim() output tells you the shape at a glance — tens of thousands of rows and a handful of columns — and summary() gives a per-column thumbnail: numeric ranges for the measurement columns and counts for the categorical ones.

NoteLook before you leap with View()

In RStudio there is one more inspection tool, View() (note the capital “V”), that opens the first 1000 rows in a spreadsheet-style window you can scroll and sort. It is a great way to get a feel for a dataset before you start writing code against it. We don’t run it here because it only works in interactive RStudio, not when the book is rendered.

View(ydat)

10.5 Accessing variables (columns) and subsetting

In R, data.frames can be subset similarly to other two-dimensional data structures. The [ in R is used to denote subsetting of any kind. When working with two-dimensional data, we need two values inside the [ ] to specify the details. The specification is [rows, columns]. For example, to get the first three rows of ydat, use:

ydat[1:3, ]
  symbol systematic_name nutrient rate expression
1   SFB2         YNL049C  Glucose 0.05      -0.24
2   <NA>         YNL095C  Glucose 0.05       0.28
3   QRI7         YDL104C  Glucose 0.05      -0.02
                            bp
1        ER to Golgi transport
2   biological process unknown
3 proteolysis and peptidolysis
                             mf
1    molecular function unknown
2    molecular function unknown
3 metalloendopeptidase activity

Note how the second number, the columns, is blank. R takes that to mean “all the columns”. Similarly, we can combine rows and columns specification arbitrarily.

ydat[1:3, 1:3]
  symbol systematic_name nutrient
1   SFB2         YNL049C  Glucose
2   <NA>         YNL095C  Glucose
3   QRI7         YDL104C  Glucose

Because selecting a single variable, or column, is such a common operation, there are two shortcuts for doing so with data.frames. The first, the $ operator works like so:

# Look at the column names, just to refresh memory
colnames(ydat)
[1] "symbol"          "systematic_name" "nutrient"       
[4] "rate"            "expression"      "bp"             
[7] "mf"             
# Note that I am using "head" here to limit the output
head(ydat$symbol)
[1] "SFB2" NA     "QRI7" "CFT2" "SSO2" "PSP2"
# What is the actual length of "symbol"?
length(ydat$symbol)
[1] 198430

The second is related to the fact that, in R, data.frames are also lists. We subset a list by using [[]] notation. To get the second column of ydat, we can use:

head(ydat[[2]])
[1] "YNL049C" "YNL095C" "YDL104C" "YLR115W" "YMR183C"
[6] "YML017W"

Alternatively, we can use the column name:

head(ydat[["systematic_name"]])
[1] "YNL049C" "YNL095C" "YDL104C" "YLR115W" "YMR183C"
[6] "YML017W"
Note$ versus [[ ]]

Both pull out a single column as a plain vector, so why two notations? $ is the quick, everyday form when you know the column name as you type it (ydat$symbol). [[ ]] is the form to reach for when the column is chosen by a value — a number (ydat[[2]]) or a name stored in a variable (col <- "symbol"; ydat[[col]]), which $ cannot do. Reach for $ by default and [[ ]] when you’re programming with the column choice.

10.6 Aggregating and exploring

There are a couple of columns that hold numeric values. You can confirm which by asking each column its class():

class(ydat$symbol)
[1] "character"
class(ydat$rate)
[1] "numeric"
class(ydat$expression)
[1] "numeric"

So symbol is a character column, while rate and expression are numeric. We’ll put those numeric columns to work in the exercises at the end of the chapter.

10.6.1 More advanced indexing and subsetting

We can use, for example, logical values (TRUE/FALSE) to subset data.frames. Here we keep only the rows where the symbol column equals LEU1:

head(ydat[ydat$symbol == 'LEU1', ])
     symbol systematic_name nutrient rate expression   bp
NA     <NA>            <NA>     <NA>   NA         NA <NA>
NA.1   <NA>            <NA>     <NA>   NA         NA <NA>
NA.2   <NA>            <NA>     <NA>   NA         NA <NA>
NA.3   <NA>            <NA>     <NA>   NA         NA <NA>
NA.4   <NA>            <NA>     <NA>   NA         NA <NA>
NA.5   <NA>            <NA>     <NA>   NA         NA <NA>
       mf
NA   <NA>
NA.1 <NA>
NA.2 <NA>
NA.3 <NA>
NA.4 <NA>
NA.5 <NA>
tail(ydat[ydat$symbol == 'LEU1', ])
         symbol systematic_name nutrient rate expression
NA.47244   <NA>            <NA>     <NA>   NA         NA
NA.47245   <NA>            <NA>     <NA>   NA         NA
NA.47246   <NA>            <NA>     <NA>   NA         NA
NA.47247   <NA>            <NA>     <NA>   NA         NA
NA.47248   <NA>            <NA>     <NA>   NA         NA
NA.47249   <NA>            <NA>     <NA>   NA         NA
           bp   mf
NA.47244 <NA> <NA>
NA.47245 <NA> <NA>
NA.47246 <NA> <NA>
NA.47247 <NA> <NA>
NA.47248 <NA> <NA>
NA.47249 <NA> <NA>

Notice the rows full of NA that came along for the ride. That is not a bug in your code — it is one of the most common surprises in R, and worth understanding once so it never trips you up again.

WarningLogical subsetting and missing values

When you filter with a condition like ydat$symbol == 'LEU1', any row whose symbol is NA returns NA for the test — not FALSE — so R keeps it, and you get phantom rows full of NA. The fix is to exclude the missing values explicitly with !is.na():

head(ydat[ydat$symbol == 'LEU1' & !is.na(ydat$symbol), ])
      symbol systematic_name nutrient rate expression
1526    LEU1         YGL009C  Glucose 0.05      -1.12
7043    LEU1         YGL009C  Glucose 0.10      -0.77
12555   LEU1         YGL009C  Glucose 0.15      -0.67
18071   LEU1         YGL009C  Glucose 0.20      -0.59
23603   LEU1         YGL009C  Glucose 0.25      -0.20
29136   LEU1         YGL009C  Glucose 0.30       0.03
                        bp
1526  leucine biosynthesis
7043  leucine biosynthesis
12555 leucine biosynthesis
18071 leucine biosynthesis
23603 leucine biosynthesis
29136 leucine biosynthesis
                                          mf
1526  3-isopropylmalate dehydratase activity
7043  3-isopropylmalate dehydratase activity
12555 3-isopropylmalate dehydratase activity
18071 3-isopropylmalate dehydratase activity
23603 3-isopropylmalate dehydratase activity
29136 3-isopropylmalate dehydratase activity

Now only the genuine LEU1 rows remain.

Sometimes, looking at the data themselves is not that important. Using dim() is one possibility to look at the number of rows and columns after subsetting.

dim(ydat[ydat$expression > 3, ])
[1] 714   7

Find the high expressed genes when leucine-starved. For this task we can also use subset which allows us to treat column names as R variables (no $ needed).

subset(ydat, nutrient == 'Leucine' & rate == 0.05 & expression > 3)
       symbol systematic_name nutrient rate expression
133768   QDR2         YIL121W  Leucine 0.05       4.61
133772   LEU1         YGL009C  Leucine 0.05       3.84
133858   BAP3         YDR046C  Leucine 0.05       4.29
135186   <NA>         YPL033C  Leucine 0.05       3.43
135187   <NA>         YLR267W  Leucine 0.05       3.23
135288   HXT3         YDR345C  Leucine 0.05       5.16
135963   TPO2         YGR138C  Leucine 0.05       3.75
135965   YRO2         YBR054W  Leucine 0.05       4.40
136102   GPG1         YGL121C  Leucine 0.05       3.08
136109  HSP42         YDR171W  Leucine 0.05       3.07
136119   HXT5         YHR096C  Leucine 0.05       4.90
136151   <NA>         YJL144W  Leucine 0.05       3.06
136152   MOH1         YBL049W  Leucine 0.05       3.43
136153   <NA>         YBL048W  Leucine 0.05       3.95
136189  HSP26         YBR072W  Leucine 0.05       4.86
136231   NCA3         YJL116C  Leucine 0.05       4.03
136233   <NA>         YBR116C  Leucine 0.05       3.28
136486   <NA>         YGR043C  Leucine 0.05       3.07
137443   ADH2         YMR303C  Leucine 0.05       4.15
137448   ICL1         YER065C  Leucine 0.05       3.54
137451   SFC1         YJR095W  Leucine 0.05       3.72
137569   MLS1         YNL117W  Leucine 0.05       3.76
                                              bp
133768                       multidrug transport
133772                      leucine biosynthesis
133858                      amino acid transport
135186                                  meiosis*
135187                biological process unknown
135288                          hexose transport
135963                       polyamine transport
135965                biological process unknown
136102                       signal transduction
136109                       response to stress*
136119                          hexose transport
136151                   response to dessication
136152                biological process unknown
136153                                      <NA>
136189                       response to stress*
136231 mitochondrion organization and biogenesis
136233                                      <NA>
136486                biological process unknown
137443                             fermentation*
137448                          glyoxylate cycle
137451                       fumarate transport*
137569                          glyoxylate cycle
                                           mf
133768         multidrug efflux pump activity
133772 3-isopropylmalate dehydratase activity
133858        amino acid transporter activity
135186             molecular function unknown
135187             molecular function unknown
135288          glucose transporter activity*
135963          spermine transporter activity
135965             molecular function unknown
136102             signal transducer activity
136109               unfolded protein binding
136119          glucose transporter activity*
136151             molecular function unknown
136152             molecular function unknown
136153                                   <NA>
136189               unfolded protein binding
136231             molecular function unknown
136233                                   <NA>
136486                 transaldolase activity
137443         alcohol dehydrogenase activity
137448              isocitrate lyase activity
137451 succinate:fumarate antiporter activity
137569               malate synthase activity

10.7 Aggregating data

Aggregating data, or summarizing by category, is a common way to look for trends or differences in measurements between categories. Use aggregate to find the mean expression by gene symbol.

head(aggregate(ydat$expression, by = list(ydat$symbol), mean))
  Group.1           x
1    AAC1  0.52888889
2    AAC3 -0.21628571
3   AAD10  0.43833333
4   AAD14 -0.07166667
5   AAD16  0.24194444
6    AAD4 -0.79166667
# or
head(aggregate(expression ~ symbol, mean, data = ydat))
  symbol  expression
1   AAC1  0.52888889
2   AAC3 -0.21628571
3  AAD10  0.43833333
4  AAD14 -0.07166667
5  AAD16  0.24194444
6   AAD4 -0.79166667

Both calls do the same thing: collapse the many measurements for each gene into a single mean expression value, one row per symbol. The formula form (expression ~ symbol) reads as “expression by symbol” and is usually the easier to type and to read.

10.8 Creating a data.frame from scratch

Sometimes it is useful to combine related data into one object. For example, let’s simulate some data.

set.seed(42)
smoker <- factor(rep(c("smoker", "non-smoker"), each = 50))
smoker_numeric <- as.numeric(smoker)
x <- rnorm(100)
risk <- x + 2 * smoker_numeric

We set a seed first with set.seed(42) so the random draws are reproducible — you and a reader will see the same numbers. We now have two related variables, risk and smoker. We can make a data.frame out of them:

smoker_risk <- data.frame(smoker = smoker, risk = risk)
head(smoker_risk)
  smoker     risk
1 smoker 5.370958
2 smoker 3.435302
3 smoker 4.363128
4 smoker 4.632863
5 smoker 4.404268
6 smoker 3.893875

R also has plotting shortcuts that work with data.frames. Because smoker is a factor, plot() recognizes the risk ~ smoker formula and draws a boxplot of risk in each group:

plot(risk ~ smoker, data = smoker_risk)

As we built the data, smokers carry the higher simulated risk — and the boxplot shows exactly that.

10.9 Saving a data.frame

Once we have a data.frame of interest, we may want to save it. The most portable way to save a data.frame is to use one of the write functions. In this case, let’s save the data as a .csv file.

write.csv(smoker_risk, "smoker_risk.csv")

10.10 Exercises

These all use the yeast dataset, ydat, from earlier in the chapter.

  1. Which columns of ydat are numeric? Check rate and expression.
  2. Make a histogram of the expression values, then of the rate values.
  3. rate has only a handful of distinct values. Use table() to count how many rows fall at each rate. Which rate corresponds to the most nutrient-starved condition?
class(ydat$rate); class(ydat$expression)   # both numeric
[1] "numeric"
[1] "numeric"
hist(ydat$expression)

hist(ydat$rate)

table(ydat$rate)                            # the smallest rate = most starved

 0.05   0.1  0.15   0.2  0.25   0.3 
32741 33132 33145 33121 33177 33114 

Growth rate is the limiting nutrient dialed down, so the lowest rate (0.05) is the most starved condition.

10.11 Summary

You can now load a real dataset into a data.frame, inspect its shape and contents, pull out the rows and columns you care about (by position, name, or a logical test), summarize values by category with aggregate(), and write results back to disk. These are the moves you’ll repeat on every dataset for the rest of the book.

10.12 Reflection

  • Can you state one thing a data.frame can do that a matrix cannot?
  • Given a column ydat$rate, can you write the subset that keeps only rows where rate < 0.1?
  • When would you reach for aggregate() instead of subsetting?