Almost every dataset you will meet in genomics arrives as a table: genes down the side, samples across the top, measurements in the middle, and a few columns of annotation bolted on. In R, the structure built for exactly this shape is the data.frame. Get comfortable with it and most of the rest of the book — loading expression data, filtering genes, summarizing by condition — becomes variations on a theme you already know.
A data.frame looks a bit like an R matrix in that it has two dimensions, rows and columns. The difference that matters is this: a matrix must hold a single data type, while a data.frame may hold a different data type in each column — numbers in one, gene names in another, a logical flag in a third. That is exactly what a real dataset looks like, which is why the data.frame, not the matrix, is the workhorse for tabular data in R.
13.1 What you’ll learn
By the end of this chapter you will be able to:
Explain how a data.frame differs from a matrix, and why that difference matters.
Build a data.frame from scratch with data.frame().
We’ll do all of this on a small gene-expression table that we build ourselves, so there is nothing to download and every value is in plain sight.
13.2 Building a data.frame from scratch
The most direct way to understand a data.frame is to make one. Recall from the vectors chapter that a column of a data frame is just a vector. So a data.frame is really a set of equal-length vectors stood up side by side, each one becoming a named column. The data.frame() function does exactly that assembly.
Imagine we have measured a handful of genes: each gene has a symbol, the chromosome it sits on, an average expression value, the log2 fold change between a tumor and a normal sample, and a flag for whether it is a known cancer gene. Each of those is a vector, and they are all the same length — one entry per gene.
Notice that R lined up our five vectors into five columns and numbered the rows 1 through 8 down the left edge. Those row numbers are the default row names. Each column kept its own type: symbol and chromosome are character, expression and log2fc are numeric, and is_cancer is logical. That mix of types in one object is the whole point of a data.frame — and the one thing a matrix could never do.
We can confirm that R really did give us a data.frame:
Our table is small enough to print in full, but real datasets have thousands or tens of thousands of rows and you will not be able to look at them all at once. Fortunately, R gives you a toolkit for inspecting a data.frame a piece at a time.
Structure at a glance
str() to see the columns, their types, and a preview of each
The single most useful one is str(), which you met in the lists chapter. It reports the structure: how many rows (“obs.”), how many columns (“variables”), and for each column its name, type, and first few values.
The size functions answer “how big is this?” at a glance. dim() returns the rows and columns together; nrow() and ncol() return them one at a time, which is handy inside other calculations:
The name functions tell you what the rows and columns are called. colnames() is the one you’ll reach for constantly — it’s a menu of everything you can ask for:
Finally, summary() walks across the columns and gives each an appropriate thumbnail: numeric ranges and quartiles for the measurement columns, simple counts for the others.
In RStudio there is one more inspection tool, View() (note the capital “V”), that opens the data frame in a spreadsheet-style window you can scroll and sort. It is a great way to get a feel for a dataset before you start writing code against it. We don’t run it here because it only works in interactive RStudio, not when the book is rendered.
Because pulling out a single column is such a common operation, R gives data.frames two shortcuts for it. The first, the $ operator, takes the data frame on the left and a column name on the right:
Each of these hands you back a plain vector — exactly the vector you put in when you built the data frame. So everything you learned about vectors applies directly. You can, for example, take the mean of the expression column:
The second shortcut comes from a fact worth remembering: in R, a data.frame is also a list, where each column is one element. So the list notation [[ ]] works too, either by position or by name:
genes[[3]]# the third column, by position
[1] 8.2 5.1 11.4 9.8 6.7 7.3 4.2 5.9
genes[["expression"]]# the same column, by name
[1] 8.2 5.1 11.4 9.8 6.7 7.3 4.2 5.9
Note$ versus [[ ]]
Both pull out a single column as a plain vector, so why two notations? $ is the quick, everyday form when you know the column name as you type it (genes$expression). [[ ]] is the form to reach for when the column is chosen by a value — a number (genes[[3]]) or a name stored in a variable (col <- "expression"; genes[[col]]), which $ cannot do. Reach for $ by default and [[ ]] when you’re programming with the column choice.
13.5 Subsetting with [rows, columns]
A data.frame has two dimensions, so to pull out a rectangular piece we use the square brackets with two values inside, separated by a comma: [rows, columns]. Leaving one side of the comma blank means “all of them.”
To get the first three rows and all columns, fill in the rows and leave the columns blank:
And to get one whole column while leaving the rows blank, name the column on the right of the comma:
genes[, "chromosome"]
[1] "17" "17" "7" "8" "10" "12" "2" "13"
TipThe comma is the whole trick
With a one-dimensional vector you write x[i] — one value in the brackets. With a two-dimensional data.frame you write df[i, j] — two values, separated by a comma, meaning [rows, columns]. If you ever see an error like “incorrect number of dimensions,” you’ve almost always forgotten the comma.
13.6 Adding a column
Adding a column works just like assigning to a column that doesn’t exist yet: name it with $ on the left of the assignment arrow, and give it a vector of the right length. Here we add an abs_log2fc column holding the absolute fold change, which is often what we sort on when we care about the size of a change regardless of its direction:
The data frame now has six columns instead of five. Because abs() is vectorized (see the vectors chapter), one short line computed the new value for every row at once.
13.7 Selecting rows by a logical condition
This is where data.frames earn their keep. Recall the logical-indexing pattern from the vectors chapter: ask a yes/no question of a column, get back a TRUE/FALSE vector, then use it to keep only the rows where the answer is TRUE. With a data.frame, that logical vector goes in the row position of [rows, columns].
For example, keep only the genes with expression above 7:
Read the inside first: genes$expression > 7 produces a logical vector — one TRUE/FALSE per gene — and genes[..., ] keeps the rows where it is TRUE, with the blank after the comma keeping all the columns.
You can combine conditions with & (“and”) and | (“or”). Here we ask for cancer genes whose expression is also above 7:
Writing genes$ in front of every column name gets repetitive. The subset() function lets you refer to the columns by their bare names, because it knows to look them up inside the data frame:
It reads almost like English: subset the genes where is_cancer is true and expression exceeds 7. Use whichever form is clearer to you; they do the same job.
13.8 Saving a data.frame
Once you have a data.frame worth keeping, you’ll want to save it. The most portable way is to write it to a .csv file (comma-separated values), which any other program — R, Python, even a spreadsheet — can read. The write.csv() function does this. We pass row.names = FALSE so the default row numbers aren’t written as an extra unnamed column:
The function names follow a pattern worth noticing: read.csv() and write.csv() are mirror images, and the .csv in each name tells you the file format. As always, help('read.csv') is worth a look for the many options these functions accept.
13.9 Exercises
These all use the genes data frame we built earlier in the chapter (including the abs_log2fc column we added).
How many genes, how many columns? Report the number of rows and the number of columns in genes two ways: with a single call to dim(), and with nrow() and ncol() separately.
dim() returns both numbers as a length-2 vector (rows first, then columns); nrow() and ncol() return each one on its own, which is handier inside other calculations.
Extract a column and summarize it. Pull the log2fc column out as a plain vector and compute its mean. Confirm the extracted object is numeric.
The two conditions are joined with &; the combined logical vector goes in the row position of [rows, columns], and the blank after the comma keeps every column. nrow() then counts the surviving rows.
Add a column. Add a logical column called strong that is TRUE when a gene’s absolute fold change (abs_log2fc) is at least 2. Then count how many genes are strong.
Assigning to genes$strong creates the new column from a logical comparison. Because TRUE counts as 1, sum() of a logical column tallies the TRUEs — the same count-by-summing trick from the vectors chapter.
13.10 Summary
You can now build a data.frame from a set of equal-length vectors, inspect its shape and contents, pull out the rows and columns you care about, add new columns, and write the result back to disk. Specifically, you can:
Build a data frame with data.frame(name = vector, ...), mixing column types freely — the one thing a matrix cannot do.
Extract a column with $ or [[ ]], getting back a plain vector.
Subset a rectangle with [rows, columns], where a blank means “all.”
Filter rows by a logical condition in the row position — the single most important move, and the same logical-indexing pattern you learned for vectors.
Add a column by assigning to df$newname, and save with write.csv().
The logical-indexing pattern — ask a yes/no question of a column, then use the answer to keep rows — is the move you’ll repeat on every dataset for the rest of the book.
13.11 Reflection
Can you state one thing a data.frame can do that a matrix cannot?
Given the column genes$expression, can you write the subset that keeps only rows where expression < 6?
What’s the difference between genes[2], genes[[2]], and genes[, 2]?