12  Introduction to dplyr: mammal sleep dataset

Author

Stephen Turner (with modifications by Sean Davis)

Published

June 1, 2024

Modified

June 2, 2026

Almost every analysis you’ll do in biology starts with a table: a sample sheet with one row per sample, a count matrix with one row per gene, a results table with one row per differentially expressed transcript. Over and over you’ll want to do the same handful of things to those tables — keep only the rows that pass a filter, pull out a few columns, sort by a statistic, add a derived column, or collapse many rows into a per-group summary.

That recurring pattern has a name: split-apply-combine. You split a table into groups (say, by treatment condition), apply a calculation to each group (say, the mean expression), and combine the answers back into one tidy table. The dplyr package gives you a small, consistent vocabulary of verbs for exactly these moves, and they read almost like English.

We’ll learn those verbs on a gentle, friendly dataset — the sleep habits of mammals — before you turn them loose on genomics tables. The dataset is small enough to see the whole thing, so you can watch each verb do its job. Every move you make here (filtering rows, selecting columns, summarizing by group) is the same move you’ll later make on a count matrix or a sample sheet.

12.1 What you’ll learn

  • Filter rows of a table with filter() and select columns with select().
  • Re-order rows with arrange() and add derived columns with mutate().
  • Collapse a table to summary statistics with summarise().
  • Use group_by() to compute those summaries per group — the split-apply-combine pattern.
  • Chain verbs together with the pipe operator, |>, to read an analysis left to right.
TipWhy a sleep dataset for a genomics book?

msleep is a warm-up, chosen because it is small and the columns are easy to reason about. Keep the biological payoff in view as you go: filtering mammals that sleep more than 16 hours is the same operation as filtering genes with an adjusted p-value below 0.05; selecting the name and sleep_total columns is the same as pulling sample_id and condition out of a sample sheet; and summarizing sleep by taxonomic order is the same as summarizing mean expression by treatment group. Learn the verbs here; reuse them on your own data.

12.2 What is dplyr?

The dplyr package is a specialized package for working with data.frames (and the closely related tibble) to transform and summarize tabular data with rows and columns. For another explanation, see the package vignette, Introduction to dplyr.

12.3 Why is dplyr useful?

dplyr provides a set of functions — commonly called the dplyr “verbs” — that perform the common data manipulations: filtering rows, selecting columns, re-ordering rows, adding new columns, and summarizing data. It also makes the split-apply-combine pattern straightforward through grouping.

Compared with base R functions, the dplyr verbs are often easier to work with. They are more consistent in their syntax and are designed around data frames rather than individual vectors, which is exactly the shape biological data usually arrives in.

12.4 Data: mammals sleep

The msleep (mammals sleep) dataset is an updated and expanded version of the mammals sleep dataset, with sleep times and weights taken from Savage and West1. It contains the sleep times and weights for a set of mammals, with 83 rows and 11 variables. It ships as a dataset inside the ggplot2 package, so there is nothing to download — you just install ggplot2 once if you don’t already have it.

1 A quantitative, theoretical framework for understanding mammalian sleep. Van M. Savage, Geoffrey B. West. Proceedings of the National Academy of Sciences Jan 2007, 104 (3) 1051-1056; DOI: 10.1073/pnas.0610080104

install.packages("ggplot2")

Then load the library, which makes msleep available.

As with many datasets in R, a help page describes the dataset itself.

?msleep

The columns are described on that help page, and are listed here for convenience.

column name Description
name common name
genus taxonomic rank
vore carnivore, omnivore or herbivore?
order taxonomic rank
conservation the conservation status of the mammal
sleep_total total amount of sleep, in hours
sleep_rem rem sleep, in hours
sleep_cycle length of sleep cycle, in hours
awake amount of time spent awake, in hours
brainwt brain weight in kilograms
bodywt body weight in kilograms

12.5 The dplyr verbs

dplyr has many functions, but the six verbs below do the bulk of everyday work. The rest of this chapter takes them one at a time.

dplyr verbs Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations in the “split-apply-combine” concept

Before going further, install dplyr (once) and load it.


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

12.6 Selecting columns with select()

select() keeps only the columns you name. Here we keep the name and sleep_total columns — think of pulling just the sample identifier and one measurement out of a wider table.

sleepData <- select(msleep, name, sleep_total)
head(sleepData)
# A tibble: 6 × 2
  name                       sleep_total
  <chr>                            <dbl>
1 Cheetah                           12.1
2 Owl monkey                        17  
3 Mountain beaver                   14.4
4 Greater short-tailed shrew        14.9
5 Cow                                4  
6 Three-toed sloth                  14.4

To keep all columns except one, put a minus sign in front of it (negative indexing). For example, every column except name:

head(select(msleep, -name))
# A tibble: 6 × 10
  genus      vore  order    conservation sleep_total sleep_rem sleep_cycle awake
  <chr>      <chr> <chr>    <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Acinonyx   carni Carnivo… lc                  12.1      NA        NA      11.9
2 Aotus      omni  Primates <NA>                17         1.8      NA       7  
3 Aplodontia herbi Rodentia nt                  14.4       2.4      NA       9.6
4 Blarina    omni  Soricom… lc                  14.9       2.3       0.133   9.1
5 Bos        herbi Artioda… domesticated         4         0.7       0.667  20  
6 Bradypus   herbi Pilosa   <NA>                14.4       2.2       0.767   9.6
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

To select a range of columns by name, use the : operator. Notice that dplyr lets you write the column names without quotes, almost as if they were variables:

head(select(msleep, name:order))
# A tibble: 6 × 4
  name                       genus      vore  order       
  <chr>                      <chr>      <chr> <chr>       
1 Cheetah                    Acinonyx   carni Carnivora   
2 Owl monkey                 Aotus      omni  Primates    
3 Mountain beaver            Aplodontia herbi Rodentia    
4 Greater short-tailed shrew Blarina    omni  Soricomorpha
5 Cow                        Bos        herbi Artiodactyla
6 Three-toed sloth           Bradypus   herbi Pilosa      

To select every column whose name starts with “sl”, use the helper starts_with():

head(select(msleep, starts_with("sl")))
# A tibble: 6 × 3
  sleep_total sleep_rem sleep_cycle
        <dbl>     <dbl>       <dbl>
1        12.1      NA        NA    
2        17         1.8      NA    
3        14.4       2.4      NA    
4        14.9       2.3       0.133
5         4         0.7       0.667
6        14.4       2.2       0.767

Other helpers let you select columns by other criteria:

  1. ends_with() — columns that end with a character string
  2. contains() — columns that contain a character string
  3. matches() — columns that match a regular expression
  4. one_of() — columns whose names are in a given group of names

12.7 Selecting rows with filter()

filter() keeps only the rows that match a condition. For example, keep the mammals that sleep a total of 16 or more hours:

filter(msleep, sleep_total >= 16)
# A tibble: 8 × 11
  name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
2 Long-n… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
3 North … Dide… omni  Dide… lc                  18         4.9       0.333   6  
4 Big br… Epte… inse… Chir… lc                  19.7       3.9       0.117   4.3
5 Thick-… Lutr… carni Dide… lc                  19.4       6.6      NA       4.6
6 Little… Myot… inse… Chir… <NA>                19.9       2         0.2     4.1
7 Giant … Prio… inse… Cing… en                  18.1       6.1      NA       5.9
8 Arctic… Sper… herbi Rode… lc                  16.6      NA        NA       7.4
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

You can give filter() more than one condition, separated by commas; a row must satisfy all of them. Here we want mammals that sleep at least 16 hours and weigh more than one kilogram:

filter(msleep, sleep_total >= 16, bodywt >= 1)
# A tibble: 3 × 11
  name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Long-n… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
2 North … Dide… omni  Dide… lc                  18         4.9       0.333   6  
3 Giant … Prio… inse… Cing… en                  18.1       6.1      NA       5.9
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

You can also match against a set of values with the %in% operator, which returns TRUE for each element of the first vector that appears in the second. Here we keep mammals in either the Perissodactyla or Primates taxonomic order:

filter(msleep, order %in% c("Perissodactyla", "Primates"))
# A tibble: 15 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 2 Grivet Cerc… omni  Prim… lc                  10         0.7      NA      14  
 3 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
 4 Donkey Equus herbi Peri… domesticated         3.1       0.4      NA      20.9
 5 Patas… Eryt… omni  Prim… lc                  10.9       1.1      NA      13.1
 6 Galago Gala… omni  Prim… <NA>                 9.8       1.1       0.55   14.2
 7 Human  Homo  omni  Prim… <NA>                 8         1.9       1.5    16  
 8 Mongo… Lemur herbi Prim… vu                   9.5       0.9      NA      14.5
 9 Macaq… Maca… omni  Prim… <NA>                10.1       1.2       0.75   13.9
10 Slow … Nyct… carni Prim… <NA>                11        NA        NA      13  
11 Chimp… Pan   omni  Prim… <NA>                 9.7       1.4       1.42   14.3
12 Baboon Papio omni  Prim… <NA>                 9.4       1         0.667  14.6
13 Potto  Pero… omni  Prim… lc                  11        NA        NA      13  
14 Squir… Saim… omni  Prim… <NA>                 9.6       1.4      NA      14.4
15 Brazi… Tapi… herbi Peri… vu                   4.4       1         0.9    19.6
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Any of the comparison operators (>, <, >=, <=, !=, %in%) can form the logical test. This is exactly how you would later keep only the genes passing a significance cutoff.

12.8 Chaining verbs with the pipe, |>

Real analyses string several of these operations together. The pipe operator, |>, lets you “pipe” the output of one function straight into the first argument of the next. There is nothing magical about it — but it lets you read a chain of operations from left to right, in the order they happen.

TipRead the pipe as “and then”

x |> f() |> g() means: take x, and then apply f, and then apply g. Without the pipe you’d either save each step in a temporary variable and pass it along, or nest the calls inside out — g(f(x)) — which is hard to read once the chain grows. The pipe keeps the steps in the order you think about them.

Here is an example we have already written the nested way:

head(select(msleep, name, sleep_total))
# A tibble: 6 × 2
  name                       sleep_total
  <chr>                            <dbl>
1 Cheetah                           12.1
2 Owl monkey                        17  
3 Mountain beaver                   14.4
4 Greater short-tailed shrew        14.9
5 Cow                                4  
6 Three-toed sloth                  14.4

With the pipe, you read the same thing top to bottom: take msleep, and then select two columns (name and sleep_total), and then show the head of the result.

msleep |>
    select(name, sleep_total) |>
    head()
# A tibble: 6 × 2
  name                       sleep_total
  <chr>                            <dbl>
1 Cheetah                           12.1
2 Owl monkey                        17  
3 Mountain beaver                   14.4
4 Greater short-tailed shrew        14.9
5 Cow                                4  
6 Three-toed sloth                  14.4

You’ll see how much this helps once we start combining many verbs. From here on, we’ll use the pipe throughout.

12.9 Re-ordering rows with arrange()

arrange() sorts the rows by one or more columns. To sort by the taxonomic order, name the column:

msleep |> arrange(order) |> head()
# A tibble: 6 × 11
  name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Tenrec  Tenr… omni  Afro… <NA>                15.6       2.3      NA       8.4
2 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
3 Roe de… Capr… herbi Arti… lc                   3        NA        NA      21  
4 Goat    Capri herbi Arti… lc                   5.3       0.6      NA      18.7
5 Giraffe Gira… herbi Arti… cd                   1.9       0.4      NA      22.1
6 Sheep   Ovis  herbi Arti… domesticated         3.8       0.6      NA      20.2
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Now let’s combine verbs: select three columns, sort the rows first by taxonomic order and then by sleep_total, and finally show the head of the result.

msleep |>
    select(name, order, sleep_total) |>
    arrange(order, sleep_total) |>
    head()
# A tibble: 6 × 3
  name     order        sleep_total
  <chr>    <chr>              <dbl>
1 Tenrec   Afrosoricida        15.6
2 Giraffe  Artiodactyla         1.9
3 Roe deer Artiodactyla         3  
4 Sheep    Artiodactyla         3.8
5 Cow      Artiodactyla         4  
6 Goat     Artiodactyla         5.3

The same chain, but instead of showing the head we filter to the mammals that sleep 16 or more hours:

msleep |>
    select(name, order, sleep_total) |>
    arrange(order, sleep_total) |>
    filter(sleep_total >= 16)
# A tibble: 8 × 3
  name                   order           sleep_total
  <chr>                  <chr>                 <dbl>
1 Big brown bat          Chiroptera             19.7
2 Little brown bat       Chiroptera             19.9
3 Long-nosed armadillo   Cingulata              17.4
4 Giant armadillo        Cingulata              18.1
5 North American Opossum Didelphimorphia        18  
6 Thick-tailed opposum   Didelphimorphia        19.4
7 Owl monkey             Primates               17  
8 Arctic ground squirrel Rodentia               16.6

To sort a column in descending order, wrap it in desc(). Here we sort by order ascending, then by sleep_total descending:

msleep |>
    select(name, order, sleep_total) |>
    arrange(order, desc(sleep_total)) |>
    filter(sleep_total >= 16)
# A tibble: 8 × 3
  name                   order           sleep_total
  <chr>                  <chr>                 <dbl>
1 Little brown bat       Chiroptera             19.9
2 Big brown bat          Chiroptera             19.7
3 Giant armadillo        Cingulata              18.1
4 Long-nosed armadillo   Cingulata              17.4
5 Thick-tailed opposum   Didelphimorphia        19.4
6 North American Opossum Didelphimorphia        18  
7 Owl monkey             Primates               17  
8 Arctic ground squirrel Rodentia               16.6

12.10 Creating new columns with mutate()

mutate() adds new columns computed from existing ones. Here we add a column rem_proportion, the ratio of REM sleep to total sleep:

msleep |>
    mutate(rem_proportion = sleep_rem / sleep_total) |>
    head()
# A tibble: 6 × 12
  name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
# ℹ 3 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>

You can add several columns at once, separated by commas. Here we add rem_proportion and also bodywt_grams, the body weight expressed in grams:

msleep |>
    mutate(rem_proportion = sleep_rem / sleep_total,
           bodywt_grams = bodywt * 1000) |>
    head()
# A tibble: 6 × 13
  name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
# ℹ 4 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>,
#   bodywt_grams <dbl>

Computing a derived column like this is the same move as turning raw counts into counts-per-million, or a fold change into a log fold change.

12.11 Summarizing with summarise()

summarise() collapses a column down to a single summary statistic. To get the average amount of sleep, apply mean() to sleep_total and name the result avg_sleep:

msleep |>
    summarise(avg_sleep = mean(sleep_total))
# A tibble: 1 × 1
  avg_sleep
      <dbl>
1      10.4

Mammals in this dataset sleep about ten and a half hours a day, on average. You can compute several summaries at once. Besides mean(), useful summary functions include sd(), min(), max(), median(), sum(), n() (the number of rows), first(), last(), and n_distinct() (the number of distinct values):

msleep |>
    summarise(avg_sleep = mean(sleep_total),
              min_sleep = min(sleep_total),
              max_sleep = max(sleep_total),
              total = n())
# A tibble: 1 × 4
  avg_sleep min_sleep max_sleep total
      <dbl>     <dbl>     <dbl> <int>
1      10.4       1.9      19.9    83

That single row tells you the average, the extremes, and the sample size all at once.

12.12 Grouping with group_by()

group_by() is what makes summarise() powerful. On its own, summarise() collapses the whole table to one row. Combined with group_by(), it gives you one summary row per group — this is split-apply-combine in action.

NoteSplit-apply-combine

Picture the table sorted into piles, one pile per group (here, one per taxonomic order). dplyr splits the data into those piles, applies your summary to each pile independently, and combines the per-pile answers into one tidy result table. It’s the same idea as base R’s aggregate(), but it fits naturally into a dplyr pipe.

Let’s split msleep by taxonomic order, then ask for the same summary statistics as before. We expect one row of statistics per order:

msleep |>
    group_by(order) |>
    summarise(avg_sleep = mean(sleep_total),
              min_sleep = min(sleep_total),
              max_sleep = max(sleep_total),
              total = n())
# A tibble: 19 × 5
   order           avg_sleep min_sleep max_sleep total
   <chr>               <dbl>     <dbl>     <dbl> <int>
 1 Afrosoricida        15.6       15.6      15.6     1
 2 Artiodactyla         4.52       1.9       9.1     6
 3 Carnivora           10.1        3.5      15.8    12
 4 Cetacea              4.5        2.7       5.6     3
 5 Chiroptera          19.8       19.7      19.9     2
 6 Cingulata           17.8       17.4      18.1     2
 7 Didelphimorphia     18.7       18        19.4     2
 8 Diprotodontia       12.4       11.1      13.7     2
 9 Erinaceomorpha      10.2       10.1      10.3     2
10 Hyracoidea           5.67       5.3       6.3     3
11 Lagomorpha           8.4        8.4       8.4     1
12 Monotremata          8.6        8.6       8.6     1
13 Perissodactyla       3.47       2.9       4.4     3
14 Pilosa              14.4       14.4      14.4     1
15 Primates            10.5        8        17      12
16 Proboscidea          3.6        3.3       3.9     2
17 Rodentia            12.5        7        16.6    22
18 Scandentia           8.9        8.9       8.9     1
19 Soricomorpha        11.1        8.4      14.9     5

Each row is one taxonomic order, with its average, minimum, and maximum sleep and the number of species. Swap “order” for “treatment condition” and “sleep” for “expression” and you have a standard genomics summary.

12.13 Exercises

Use the msleep dataset (with ggplot2 and dplyr loaded) for each of these. Try to write the pipe before opening the solution.

  1. Pick a verb. You want only the mammals that sleep fewer than 5 hours, showing just their name and sleep_total. Which two verbs do you need, and in what order? Write the pipe.

    msleep |>
        filter(sleep_total < 5) |>
        select(name, sleep_total)
    # A tibble: 11 × 2
       name             sleep_total
       <chr>                  <dbl>
     1 Cow                      4  
     2 Roe deer                 3  
     3 Asian elephant           3.9
     4 Horse                    2.9
     5 Donkey                   3.1
     6 Giraffe                  1.9
     7 Pilot whale              2.7
     8 African elephant         3.3
     9 Sheep                    3.8
    10 Caspian seal             3.5
    11 Brazilian tapir          4.4

    You need filter() to choose the rows and select() to choose the columns. Either order works here, but filtering first means select() operates on fewer rows — the habit that scales to large tables.

  2. Sleepiest carnivores. Keep only the carnivores (vore == "carni"), then sort them from most to least total sleep. Show name, vore, and sleep_total.

    msleep |>
        filter(vore == "carni") |>
        select(name, vore, sleep_total) |>
        arrange(desc(sleep_total))
    # A tibble: 19 × 3
       name                       vore  sleep_total
       <chr>                      <chr>       <dbl>
     1 Thick-tailed opposum       carni        19.4
     2 Long-nosed armadillo       carni        17.4
     3 Tiger                      carni        15.8
     4 Northern grasshopper mouse carni        14.5
     5 Lion                       carni        13.5
     6 Domestic cat               carni        12.5
     7 Arctic fox                 carni        12.5
     8 Cheetah                    carni        12.1
     9 Slow loris                 carni        11  
    10 Jaguar                     carni        10.4
    11 Dog                        carni        10.1
    12 Red fox                    carni         9.8
    13 Northern fur seal          carni         8.7
    14 Genet                      carni         6.3
    15 Gray seal                  carni         6.2
    16 Common porpoise            carni         5.6
    17 Bottle-nosed dolphin       carni         5.2
    18 Caspian seal               carni         3.5
    19 Pilot whale                carni         2.7

    filter() chooses the rows, select() trims the columns, and arrange(desc(...)) sorts largest-first. Reading the pipe top to bottom tells the whole story.

  3. A derived column. Is there a relationship between rem_proportion (REM sleep divided by total sleep) and body weight? Add a rem_proportion column, then sort the table so the mammals with the highest REM proportion come first, and look at their bodywt.

    msleep |>
        mutate(rem_proportion = sleep_rem / sleep_total) |>
        select(name, bodywt, rem_proportion) |>
        arrange(desc(rem_proportion)) |>
        head()
    # A tibble: 6 × 3
      name                   bodywt rem_proportion
      <chr>                   <dbl>          <dbl>
    1 European hedgehog       0.77           0.347
    2 Thick-tailed opposum    0.37           0.340
    3 Giant armadillo        60              0.337
    4 Tree shrew              0.104          0.292
    5 Dog                    14              0.287
    6 North American Opossum  1.7            0.272

    mutate() builds the new column, then arrange(desc(...)) ranks by it. The heaviest mammals are not at the top, hinting that large-bodied animals tend to spend a smaller fraction of their sleep in REM — a pattern worth a proper plot to confirm.

  4. Summarize by group. For each feeding type (vore), compute the average total sleep and the number of species. Which group sleeps the most on average?

    msleep |>
        group_by(vore) |>
        summarise(avg_sleep = mean(sleep_total),
                  n = n())
    # A tibble: 5 × 3
      vore    avg_sleep     n
      <chr>       <dbl> <int>
    1 carni       10.4     19
    2 herbi        9.51    32
    3 insecti     14.9      5
    4 omni        10.9     20
    5 <NA>        10.2      7

    This is the split-apply-combine pattern: split by vore, apply mean() and n(), combine into one row per group. The NA row collects mammals whose feeding type is unknown — a reminder to check for missing values in real data.

12.14 Summary

You now have the core dplyr toolkit, and every verb maps onto something you’ll do with biological tables:

  • select() keeps the columns you want (e.g. pulling sample_id and condition from a sample sheet).
  • filter() keeps the rows that pass a condition (e.g. genes below a p-value cutoff).
  • arrange() sorts rows, with desc() for largest-first.
  • mutate() adds derived columns (e.g. log fold changes).
  • summarise() collapses a column to a summary statistic.
  • group_by() turns summarise() into per-group summaries — the split-apply-combine pattern at the heart of so many analyses.
  • The pipe, |>, chains these verbs so you can read an analysis left to right, as a sequence of “and then” steps.

Master these six verbs on msleep and you can carry them straight over to count matrices, sample sheets, and results tables.