install.packages("ggplot2")12 Introduction to dplyr: mammal sleep dataset
Almost every analysis you’ll do in biology starts with a table: a sample sheet with one row per sample, a count matrix with one row per gene, a results table with one row per differentially expressed transcript. Over and over you’ll want to do the same handful of things to those tables — keep only the rows that pass a filter, pull out a few columns, sort by a statistic, add a derived column, or collapse many rows into a per-group summary.
That recurring pattern has a name: split-apply-combine. You split a table into groups (say, by treatment condition), apply a calculation to each group (say, the mean expression), and combine the answers back into one tidy table. The dplyr package gives you a small, consistent vocabulary of verbs for exactly these moves, and they read almost like English.
We’ll learn those verbs on a gentle, friendly dataset — the sleep habits of mammals — before you turn them loose on genomics tables. The dataset is small enough to see the whole thing, so you can watch each verb do its job. Every move you make here (filtering rows, selecting columns, summarizing by group) is the same move you’ll later make on a count matrix or a sample sheet.
12.1 What you’ll learn
- Filter rows of a table with
filter()and select columns withselect(). - Re-order rows with
arrange()and add derived columns withmutate(). - Collapse a table to summary statistics with
summarise(). - Use
group_by()to compute those summaries per group — the split-apply-combine pattern. - Chain verbs together with the pipe operator,
|>, to read an analysis left to right.
msleep is a warm-up, chosen because it is small and the columns are easy to reason about. Keep the biological payoff in view as you go: filtering mammals that sleep more than 16 hours is the same operation as filtering genes with an adjusted p-value below 0.05; selecting the name and sleep_total columns is the same as pulling sample_id and condition out of a sample sheet; and summarizing sleep by taxonomic order is the same as summarizing mean expression by treatment group. Learn the verbs here; reuse them on your own data.
12.2 What is dplyr?
The dplyr package is a specialized package for working with data.frames (and the closely related tibble) to transform and summarize tabular data with rows and columns. For another explanation, see the package vignette, Introduction to dplyr.
12.3 Why is dplyr useful?
dplyr provides a set of functions — commonly called the dplyr “verbs” — that perform the common data manipulations: filtering rows, selecting columns, re-ordering rows, adding new columns, and summarizing data. It also makes the split-apply-combine pattern straightforward through grouping.
Compared with base R functions, the dplyr verbs are often easier to work with. They are more consistent in their syntax and are designed around data frames rather than individual vectors, which is exactly the shape biological data usually arrives in.
12.4 Data: mammals sleep
The msleep (mammals sleep) dataset is an updated and expanded version of the mammals sleep dataset, with sleep times and weights taken from Savage and West1. It contains the sleep times and weights for a set of mammals, with 83 rows and 11 variables. It ships as a dataset inside the ggplot2 package, so there is nothing to download — you just install ggplot2 once if you don’t already have it.
1 A quantitative, theoretical framework for understanding mammalian sleep. Van M. Savage, Geoffrey B. West. Proceedings of the National Academy of Sciences Jan 2007, 104 (3) 1051-1056; DOI: 10.1073/pnas.0610080104
Then load the library, which makes msleep available.
As with many datasets in R, a help page describes the dataset itself.
?msleepThe columns are described on that help page, and are listed here for convenience.
| column name | Description |
|---|---|
| name | common name |
| genus | taxonomic rank |
| vore | carnivore, omnivore or herbivore? |
| order | taxonomic rank |
| conservation | the conservation status of the mammal |
| sleep_total | total amount of sleep, in hours |
| sleep_rem | rem sleep, in hours |
| sleep_cycle | length of sleep cycle, in hours |
| awake | amount of time spent awake, in hours |
| brainwt | brain weight in kilograms |
| bodywt | body weight in kilograms |
12.5 The dplyr verbs
dplyr has many functions, but the six verbs below do the bulk of everyday work. The rest of this chapter takes them one at a time.
| dplyr verbs | Description |
|---|---|
select() |
select columns |
filter() |
filter rows |
arrange() |
re-order or arrange rows |
mutate() |
create new columns |
summarise() |
summarise values |
group_by() |
allows for group operations in the “split-apply-combine” concept |
Before going further, install dplyr (once) and load it.
install.packages("dplyr")
12.6 Selecting columns with select()
select() keeps only the columns you name. Here we keep the name and sleep_total columns — think of pulling just the sample identifier and one measurement out of a wider table.
# A tibble: 6 × 2
name sleep_total
<chr> <dbl>
1 Cheetah 12.1
2 Owl monkey 17
3 Mountain beaver 14.4
4 Greater short-tailed shrew 14.9
5 Cow 4
6 Three-toed sloth 14.4
To keep all columns except one, put a minus sign in front of it (negative indexing). For example, every column except name:
# A tibble: 6 × 10
genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Acinonyx carni Carnivo… lc 12.1 NA NA 11.9
2 Aotus omni Primates <NA> 17 1.8 NA 7
3 Aplodontia herbi Rodentia nt 14.4 2.4 NA 9.6
4 Blarina omni Soricom… lc 14.9 2.3 0.133 9.1
5 Bos herbi Artioda… domesticated 4 0.7 0.667 20
6 Bradypus herbi Pilosa <NA> 14.4 2.2 0.767 9.6
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
To select a range of columns by name, use the : operator. Notice that dplyr lets you write the column names without quotes, almost as if they were variables:
# A tibble: 6 × 4
name genus vore order
<chr> <chr> <chr> <chr>
1 Cheetah Acinonyx carni Carnivora
2 Owl monkey Aotus omni Primates
3 Mountain beaver Aplodontia herbi Rodentia
4 Greater short-tailed shrew Blarina omni Soricomorpha
5 Cow Bos herbi Artiodactyla
6 Three-toed sloth Bradypus herbi Pilosa
To select every column whose name starts with “sl”, use the helper starts_with():
head(select(msleep, starts_with("sl")))# A tibble: 6 × 3
sleep_total sleep_rem sleep_cycle
<dbl> <dbl> <dbl>
1 12.1 NA NA
2 17 1.8 NA
3 14.4 2.4 NA
4 14.9 2.3 0.133
5 4 0.7 0.667
6 14.4 2.2 0.767
Other helpers let you select columns by other criteria:
-
ends_with()— columns that end with a character string -
contains()— columns that contain a character string -
matches()— columns that match a regular expression -
one_of()— columns whose names are in a given group of names
12.7 Selecting rows with filter()
filter() keeps only the rows that match a condition. For example, keep the mammals that sleep a total of 16 or more hours:
filter(msleep, sleep_total >= 16)# A tibble: 8 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
2 Long-n… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
3 North … Dide… omni Dide… lc 18 4.9 0.333 6
4 Big br… Epte… inse… Chir… lc 19.7 3.9 0.117 4.3
5 Thick-… Lutr… carni Dide… lc 19.4 6.6 NA 4.6
6 Little… Myot… inse… Chir… <NA> 19.9 2 0.2 4.1
7 Giant … Prio… inse… Cing… en 18.1 6.1 NA 5.9
8 Arctic… Sper… herbi Rode… lc 16.6 NA NA 7.4
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
You can give filter() more than one condition, separated by commas; a row must satisfy all of them. Here we want mammals that sleep at least 16 hours and weigh more than one kilogram:
filter(msleep, sleep_total >= 16, bodywt >= 1)# A tibble: 3 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Long-n… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
2 North … Dide… omni Dide… lc 18 4.9 0.333 6
3 Giant … Prio… inse… Cing… en 18.1 6.1 NA 5.9
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
You can also match against a set of values with the %in% operator, which returns TRUE for each element of the first vector that appears in the second. Here we keep mammals in either the Perissodactyla or Primates taxonomic order:
# A tibble: 15 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
2 Grivet Cerc… omni Prim… lc 10 0.7 NA 14
3 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
4 Donkey Equus herbi Peri… domesticated 3.1 0.4 NA 20.9
5 Patas… Eryt… omni Prim… lc 10.9 1.1 NA 13.1
6 Galago Gala… omni Prim… <NA> 9.8 1.1 0.55 14.2
7 Human Homo omni Prim… <NA> 8 1.9 1.5 16
8 Mongo… Lemur herbi Prim… vu 9.5 0.9 NA 14.5
9 Macaq… Maca… omni Prim… <NA> 10.1 1.2 0.75 13.9
10 Slow … Nyct… carni Prim… <NA> 11 NA NA 13
11 Chimp… Pan omni Prim… <NA> 9.7 1.4 1.42 14.3
12 Baboon Papio omni Prim… <NA> 9.4 1 0.667 14.6
13 Potto Pero… omni Prim… lc 11 NA NA 13
14 Squir… Saim… omni Prim… <NA> 9.6 1.4 NA 14.4
15 Brazi… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Any of the comparison operators (>, <, >=, <=, !=, %in%) can form the logical test. This is exactly how you would later keep only the genes passing a significance cutoff.
12.8 Chaining verbs with the pipe, |>
Real analyses string several of these operations together. The pipe operator, |>, lets you “pipe” the output of one function straight into the first argument of the next. There is nothing magical about it — but it lets you read a chain of operations from left to right, in the order they happen.
x |> f() |> g() means: take x, and then apply f, and then apply g. Without the pipe you’d either save each step in a temporary variable and pass it along, or nest the calls inside out — g(f(x)) — which is hard to read once the chain grows. The pipe keeps the steps in the order you think about them.
Here is an example we have already written the nested way:
# A tibble: 6 × 2
name sleep_total
<chr> <dbl>
1 Cheetah 12.1
2 Owl monkey 17
3 Mountain beaver 14.4
4 Greater short-tailed shrew 14.9
5 Cow 4
6 Three-toed sloth 14.4
With the pipe, you read the same thing top to bottom: take msleep, and then select two columns (name and sleep_total), and then show the head of the result.
# A tibble: 6 × 2
name sleep_total
<chr> <dbl>
1 Cheetah 12.1
2 Owl monkey 17
3 Mountain beaver 14.4
4 Greater short-tailed shrew 14.9
5 Cow 4
6 Three-toed sloth 14.4
You’ll see how much this helps once we start combining many verbs. From here on, we’ll use the pipe throughout.
12.9 Re-ordering rows with arrange()
arrange() sorts the rows by one or more columns. To sort by the taxonomic order, name the column:
# A tibble: 6 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Tenrec Tenr… omni Afro… <NA> 15.6 2.3 NA 8.4
2 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
3 Roe de… Capr… herbi Arti… lc 3 NA NA 21
4 Goat Capri herbi Arti… lc 5.3 0.6 NA 18.7
5 Giraffe Gira… herbi Arti… cd 1.9 0.4 NA 22.1
6 Sheep Ovis herbi Arti… domesticated 3.8 0.6 NA 20.2
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Now let’s combine verbs: select three columns, sort the rows first by taxonomic order and then by sleep_total, and finally show the head of the result.
# A tibble: 6 × 3
name order sleep_total
<chr> <chr> <dbl>
1 Tenrec Afrosoricida 15.6
2 Giraffe Artiodactyla 1.9
3 Roe deer Artiodactyla 3
4 Sheep Artiodactyla 3.8
5 Cow Artiodactyla 4
6 Goat Artiodactyla 5.3
The same chain, but instead of showing the head we filter to the mammals that sleep 16 or more hours:
# A tibble: 8 × 3
name order sleep_total
<chr> <chr> <dbl>
1 Big brown bat Chiroptera 19.7
2 Little brown bat Chiroptera 19.9
3 Long-nosed armadillo Cingulata 17.4
4 Giant armadillo Cingulata 18.1
5 North American Opossum Didelphimorphia 18
6 Thick-tailed opposum Didelphimorphia 19.4
7 Owl monkey Primates 17
8 Arctic ground squirrel Rodentia 16.6
To sort a column in descending order, wrap it in desc(). Here we sort by order ascending, then by sleep_total descending:
# A tibble: 8 × 3
name order sleep_total
<chr> <chr> <dbl>
1 Little brown bat Chiroptera 19.9
2 Big brown bat Chiroptera 19.7
3 Giant armadillo Cingulata 18.1
4 Long-nosed armadillo Cingulata 17.4
5 Thick-tailed opposum Didelphimorphia 19.4
6 North American Opossum Didelphimorphia 18
7 Owl monkey Primates 17
8 Arctic ground squirrel Rodentia 16.6
12.10 Creating new columns with mutate()
mutate() adds new columns computed from existing ones. Here we add a column rem_proportion, the ratio of REM sleep to total sleep:
# A tibble: 6 × 12
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
# ℹ 3 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>
You can add several columns at once, separated by commas. Here we add rem_proportion and also bodywt_grams, the body weight expressed in grams:
# A tibble: 6 × 13
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
# ℹ 4 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>,
# bodywt_grams <dbl>
Computing a derived column like this is the same move as turning raw counts into counts-per-million, or a fold change into a log fold change.
12.11 Summarizing with summarise()
summarise() collapses a column down to a single summary statistic. To get the average amount of sleep, apply mean() to sleep_total and name the result avg_sleep:
Mammals in this dataset sleep about ten and a half hours a day, on average. You can compute several summaries at once. Besides mean(), useful summary functions include sd(), min(), max(), median(), sum(), n() (the number of rows), first(), last(), and n_distinct() (the number of distinct values):
# A tibble: 1 × 4
avg_sleep min_sleep max_sleep total
<dbl> <dbl> <dbl> <int>
1 10.4 1.9 19.9 83
That single row tells you the average, the extremes, and the sample size all at once.
12.12 Grouping with group_by()
group_by() is what makes summarise() powerful. On its own, summarise() collapses the whole table to one row. Combined with group_by(), it gives you one summary row per group — this is split-apply-combine in action.
Picture the table sorted into piles, one pile per group (here, one per taxonomic order). dplyr splits the data into those piles, applies your summary to each pile independently, and combines the per-pile answers into one tidy result table. It’s the same idea as base R’s aggregate(), but it fits naturally into a dplyr pipe.
Let’s split msleep by taxonomic order, then ask for the same summary statistics as before. We expect one row of statistics per order:
# A tibble: 19 × 5
order avg_sleep min_sleep max_sleep total
<chr> <dbl> <dbl> <dbl> <int>
1 Afrosoricida 15.6 15.6 15.6 1
2 Artiodactyla 4.52 1.9 9.1 6
3 Carnivora 10.1 3.5 15.8 12
4 Cetacea 4.5 2.7 5.6 3
5 Chiroptera 19.8 19.7 19.9 2
6 Cingulata 17.8 17.4 18.1 2
7 Didelphimorphia 18.7 18 19.4 2
8 Diprotodontia 12.4 11.1 13.7 2
9 Erinaceomorpha 10.2 10.1 10.3 2
10 Hyracoidea 5.67 5.3 6.3 3
11 Lagomorpha 8.4 8.4 8.4 1
12 Monotremata 8.6 8.6 8.6 1
13 Perissodactyla 3.47 2.9 4.4 3
14 Pilosa 14.4 14.4 14.4 1
15 Primates 10.5 8 17 12
16 Proboscidea 3.6 3.3 3.9 2
17 Rodentia 12.5 7 16.6 22
18 Scandentia 8.9 8.9 8.9 1
19 Soricomorpha 11.1 8.4 14.9 5
Each row is one taxonomic order, with its average, minimum, and maximum sleep and the number of species. Swap “order” for “treatment condition” and “sleep” for “expression” and you have a standard genomics summary.
12.13 Exercises
Use the msleep dataset (with ggplot2 and dplyr loaded) for each of these. Try to write the pipe before opening the solution.
-
Pick a verb. You want only the mammals that sleep fewer than 5 hours, showing just their
nameandsleep_total. Which two verbs do you need, and in what order? Write the pipe.NoteSolution# A tibble: 11 × 2 name sleep_total <chr> <dbl> 1 Cow 4 2 Roe deer 3 3 Asian elephant 3.9 4 Horse 2.9 5 Donkey 3.1 6 Giraffe 1.9 7 Pilot whale 2.7 8 African elephant 3.3 9 Sheep 3.8 10 Caspian seal 3.5 11 Brazilian tapir 4.4You need
filter()to choose the rows andselect()to choose the columns. Either order works here, but filtering first meansselect()operates on fewer rows — the habit that scales to large tables. -
Sleepiest carnivores. Keep only the carnivores (
vore == "carni"), then sort them from most to least total sleep. Showname,vore, andsleep_total.NoteSolution# A tibble: 19 × 3 name vore sleep_total <chr> <chr> <dbl> 1 Thick-tailed opposum carni 19.4 2 Long-nosed armadillo carni 17.4 3 Tiger carni 15.8 4 Northern grasshopper mouse carni 14.5 5 Lion carni 13.5 6 Domestic cat carni 12.5 7 Arctic fox carni 12.5 8 Cheetah carni 12.1 9 Slow loris carni 11 10 Jaguar carni 10.4 11 Dog carni 10.1 12 Red fox carni 9.8 13 Northern fur seal carni 8.7 14 Genet carni 6.3 15 Gray seal carni 6.2 16 Common porpoise carni 5.6 17 Bottle-nosed dolphin carni 5.2 18 Caspian seal carni 3.5 19 Pilot whale carni 2.7filter()chooses the rows,select()trims the columns, andarrange(desc(...))sorts largest-first. Reading the pipe top to bottom tells the whole story. -
A derived column. Is there a relationship between
rem_proportion(REM sleep divided by total sleep) and body weight? Add arem_proportioncolumn, then sort the table so the mammals with the highest REM proportion come first, and look at theirbodywt.NoteSolution# A tibble: 6 × 3 name bodywt rem_proportion <chr> <dbl> <dbl> 1 European hedgehog 0.77 0.347 2 Thick-tailed opposum 0.37 0.340 3 Giant armadillo 60 0.337 4 Tree shrew 0.104 0.292 5 Dog 14 0.287 6 North American Opossum 1.7 0.272mutate()builds the new column, thenarrange(desc(...))ranks by it. The heaviest mammals are not at the top, hinting that large-bodied animals tend to spend a smaller fraction of their sleep in REM — a pattern worth a proper plot to confirm. -
Summarize by group. For each feeding type (
vore), compute the average total sleep and the number of species. Which group sleeps the most on average?NoteSolution# A tibble: 5 × 3 vore avg_sleep n <chr> <dbl> <int> 1 carni 10.4 19 2 herbi 9.51 32 3 insecti 14.9 5 4 omni 10.9 20 5 <NA> 10.2 7This is the split-apply-combine pattern: split by
vore, applymean()andn(), combine into one row per group. TheNArow collects mammals whose feeding type is unknown — a reminder to check for missing values in real data.
12.14 Summary
You now have the core dplyr toolkit, and every verb maps onto something you’ll do with biological tables:
-
select()keeps the columns you want (e.g. pullingsample_idandconditionfrom a sample sheet). -
filter()keeps the rows that pass a condition (e.g. genes below a p-value cutoff). -
arrange()sorts rows, withdesc()for largest-first. -
mutate()adds derived columns (e.g. log fold changes). -
summarise()collapses a column to a summary statistic. -
group_by()turnssummarise()into per-group summaries — the split-apply-combine pattern at the heart of so many analyses. - The pipe,
|>, chains these verbs so you can read an analysis left to right, as a sequence of “and then” steps.
Master these six verbs on msleep and you can carry them straight over to count matrices, sample sheets, and results tables.