The dataset we will be using to introduce the dplyr package is an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep1.

Learning goals

  • Know that dplyr is just a different approach to manipulating data in data.frames.
  • List the commonly used dplyr verbs and how they can be used to manipulate data.frames.
  • Show how to aggregate and summarized data using dplyr
  • Know what the piping operator, %>%, is and how it can be used.

Learning objectives

  • Select subsets of the mammal sleep dataset.
  • Reorder the dataset.
  • Add columns to the dataset based on existing columns.
  • Summarize the amount of sleep by categorical variables using group_by and summarize.

What is dplyr?

The dplyr package is a specialized package for working with data.frames (and the related tibble) to transform and summarize tabular data with rows and columns. For another explanation of dplyr see the dplyr package vignette: Introduction to dplyr

Why Is dplyr userful?

dplyr contains a set of functions–commonly called the dplyr “verbs”–that perform common data manipulations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data. In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept.

Compared to base functions in R, the functions in dplyr are often easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames, instead of just vectors.

Data: Mammals Sleep

The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on github. This data set contains 83 rows and 11 variables. The data happen to be available as a dataset in the ggplot2 package. To get access to the msleep dataset, we need to first install the ggplot2 package.

install.packages('ggplot2')

Then, we can load the library.

library(ggplot2)
data(msleep)

As with many datasets in R, “help” is available to describe the dataset itself.

?msleep

The columns are described in the help page, but are included here, also.

column name Description
name common name
genus taxonomic rank
vore carnivore, omnivore or herbivore?
order taxonomic rank
conservation the conservation status of the mammal
sleep_total total amount of sleep, in hours
sleep_rem rem sleep, in hours
sleep_cycle length of sleep cycle, in hours
awake amount of time spent awake, in hours
brainwt brain weight in kilograms
bodywt body weight in kilograms

dplyr verbs

The dplyr verbs are listed here. There are many other functions available in dplyr, but we will focus on just these.

dplyr verbs Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations in the “split-apply-combine” concept

Using the dplyr verbs

The two most basic functions are select() and filter(), which selects columns and filters rows respectively. What are the equivalent ways to select columns without dplyr? And filtering to include only specific rows?

Before proceeding, we need to install the dplyr package:

install.packages('dplyr')

And then load the library:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Selecting columns: select()

Select a set of columns such as the name and the sleep_total columns.

sleepData <- select(msleep, name, sleep_total)
head(sleepData)
## # A tibble: 6 × 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Cheetah                           12.1
## 2 Owl monkey                        17  
## 3 Mountain beaver                   14.4
## 4 Greater short-tailed shrew        14.9
## 5 Cow                                4  
## 6 Three-toed sloth                  14.4

To select all the columns except a specific column, use the “-” (subtraction) operator (also known as negative indexing). For example, to select all columns except name:

head(select(msleep, -name))
## # A tibble: 6 × 10
##   genus      vore  order    conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>      <chr> <chr>    <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Acinonyx   carni Carnivo… lc                  12.1      NA        NA      11.9
## 2 Aotus      omni  Primates <NA>                17         1.8      NA       7  
## 3 Aplodontia herbi Rodentia nt                  14.4       2.4      NA       9.6
## 4 Blarina    omni  Soricom… lc                  14.9       2.3       0.133   9.1
## 5 Bos        herbi Artioda… domesticated         4         0.7       0.667  20  
## 6 Bradypus   herbi Pilosa   <NA>                14.4       2.2       0.767   9.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>

To select a range of columns by name, use the “:” operator. Note that dplyr allows us to use the column names without quotes and as “indices” of the columns.

head(select(msleep, name:order))
## # A tibble: 6 × 4
##   name                       genus      vore  order       
##   <chr>                      <chr>      <chr> <chr>       
## 1 Cheetah                    Acinonyx   carni Carnivora   
## 2 Owl monkey                 Aotus      omni  Primates    
## 3 Mountain beaver            Aplodontia herbi Rodentia    
## 4 Greater short-tailed shrew Blarina    omni  Soricomorpha
## 5 Cow                        Bos        herbi Artiodactyla
## 6 Three-toed sloth           Bradypus   herbi Pilosa

To select all columns that start with the character string “sl”, use the function starts_with().

head(select(msleep, starts_with("sl")))
## # A tibble: 6 × 3
##   sleep_total sleep_rem sleep_cycle
##         <dbl>     <dbl>       <dbl>
## 1        12.1      NA        NA    
## 2        17         1.8      NA    
## 3        14.4       2.4      NA    
## 4        14.9       2.3       0.133
## 5         4         0.7       0.667
## 6        14.4       2.2       0.767

Some additional options to select columns based on a specific criteria include:

  1. ends_with() = Select columns that end with a character string
  2. contains() = Select columns that contain a character string
  3. matches() = Select columns that match a regular expression
  4. one_of() = Select column names that are from a group of names

Selecting rows: filter()

The filter() function allows us to filter rows to include only those rows that match the filter. For example, we can filter the rows for mammals that sleep a total of more than 16 hours.

filter(msleep, sleep_total >= 16)
## # A tibble: 8 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 2 Long-n… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
## 3 North … Dide… omni  Dide… lc                  18         4.9       0.333   6  
## 4 Big br… Epte… inse… Chir… lc                  19.7       3.9       0.117   4.3
## 5 Thick-… Lutr… carni Dide… lc                  19.4       6.6      NA       4.6
## 6 Little… Myot… inse… Chir… <NA>                19.9       2         0.2     4.1
## 7 Giant … Prio… inse… Cing… en                  18.1       6.1      NA       5.9
## 8 Arctic… Sper… herbi Rode… lc                  16.6      NA        NA       7.4
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>

Filter the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram.

filter(msleep, sleep_total >= 16, bodywt >= 1)
## # A tibble: 3 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Long-n… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
## 2 North … Dide… omni  Dide… lc                  18         4.9       0.333   6  
## 3 Giant … Prio… inse… Cing… en                  18.1       6.1      NA       5.9
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>

Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. The %in% operator is a logical operator that returns TRUE for values of a vector that are present in a second vector.

filter(msleep, order %in% c("Perissodactyla", "Primates"))
## # A tibble: 15 × 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
##  2 Grivet Cerc… omni  Prim… lc                  10         0.7      NA      14  
##  3 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
##  4 Donkey Equus herbi Peri… domesticated         3.1       0.4      NA      20.9
##  5 Patas… Eryt… omni  Prim… lc                  10.9       1.1      NA      13.1
##  6 Galago Gala… omni  Prim… <NA>                 9.8       1.1       0.55   14.2
##  7 Human  Homo  omni  Prim… <NA>                 8         1.9       1.5    16  
##  8 Mongo… Lemur herbi Prim… vu                   9.5       0.9      NA      14.5
##  9 Macaq… Maca… omni  Prim… <NA>                10.1       1.2       0.75   13.9
## 10 Slow … Nyct… carni Prim… <NA>                11        NA        NA      13  
## 11 Chimp… Pan   omni  Prim… <NA>                 9.7       1.4       1.42   14.3
## 12 Baboon Papio omni  Prim… <NA>                 9.4       1         0.667  14.6
## 13 Potto  Pero… omni  Prim… lc                  11        NA        NA      13  
## 14 Squir… Saim… omni  Prim… <NA>                 9.6       1.4      NA      14.4
## 15 Brazi… Tapi… herbi Peri… vu                   4.4       1         0.9    19.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>

You can use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.

“Piping”” with %>%

It is not unusual to want to perform a set of operations using dplyr. The pipe operator %>% allows us to “pipe” the output from one function into the input of the next. While there is nothing special about how R treats operations that are written in a pipe, the idea of piping is to allow us to read multiple functions operating one after another from left-to-right. Without piping, one would either 1) save each step in set of functions as a temporary variable and then pass that variable along the chain or 2) have to “nest” functions, which can be hard to read.

Here’s an example we have already used:

head(select(msleep, name, sleep_total))
## # A tibble: 6 × 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Cheetah                           12.1
## 2 Owl monkey                        17  
## 3 Mountain beaver                   14.4
## 4 Greater short-tailed shrew        14.9
## 5 Cow                                4  
## 6 Three-toed sloth                  14.4

Now in this case, we will pipe the msleep data frame to the function that will select two columns (name and sleep\_total) and then pipe the new data frame to the function head(), which will return the head of the new data frame.

msleep %>% 
    select(name, sleep_total) %>% 
    head()
## # A tibble: 6 × 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Cheetah                           12.1
## 2 Owl monkey                        17  
## 3 Mountain beaver                   14.4
## 4 Greater short-tailed shrew        14.9
## 5 Cow                                4  
## 6 Three-toed sloth                  14.4

You will soon see how useful the pipe operator is when we start to combine many functions.

Now that you know about the pipe operator (%>%), we will use it throughout the rest of this tutorial.

Arrange Or Re-order Rows Using arrange()

To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:

msleep %>% arrange(order) %>% head()
## # A tibble: 6 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Tenrec  Tenr… omni  Afro… <NA>                15.6       2.3      NA       8.4
## 2 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 3 Roe de… Capr… herbi Arti… lc                   3        NA        NA      21  
## 4 Goat    Capri herbi Arti… lc                   5.3       0.6      NA      18.7
## 5 Giraffe Gira… herbi Arti… cd                   1.9       0.4      NA      22.1
## 6 Sheep   Ovis  herbi Arti… domesticated         3.8       0.6      NA      20.2
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>

Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by sleep_total. Finally, show the head of the final data frame:

msleep %>% 
    select(name, order, sleep_total) %>%
    arrange(order, sleep_total) %>% 
    head()
## # A tibble: 6 × 3
##   name     order        sleep_total
##   <chr>    <chr>              <dbl>
## 1 Tenrec   Afrosoricida        15.6
## 2 Giraffe  Artiodactyla         1.9
## 3 Roe deer Artiodactyla         3  
## 4 Sheep    Artiodactyla         3.8
## 5 Cow      Artiodactyla         4  
## 6 Goat     Artiodactyla         5.3

Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:

msleep %>% 
    select(name, order, sleep_total) %>%
    arrange(order, sleep_total) %>% 
    filter(sleep_total >= 16)
## # A tibble: 8 × 3
##   name                   order           sleep_total
##   <chr>                  <chr>                 <dbl>
## 1 Big brown bat          Chiroptera             19.7
## 2 Little brown bat       Chiroptera             19.9
## 3 Long-nosed armadillo   Cingulata              17.4
## 4 Giant armadillo        Cingulata              18.1
## 5 North American Opossum Didelphimorphia        18  
## 6 Thick-tailed opposum   Didelphimorphia        19.4
## 7 Owl monkey             Primates               17  
## 8 Arctic ground squirrel Rodentia               16.6

For something slightly more complicated do the same as above, except arrange the rows in the sleep_total column in a descending order. For this, use the function desc()

msleep %>% 
    select(name, order, sleep_total) %>%
    arrange(order, desc(sleep_total)) %>% 
    filter(sleep_total >= 16)
## # A tibble: 8 × 3
##   name                   order           sleep_total
##   <chr>                  <chr>                 <dbl>
## 1 Little brown bat       Chiroptera             19.9
## 2 Big brown bat          Chiroptera             19.7
## 3 Giant armadillo        Cingulata              18.1
## 4 Long-nosed armadillo   Cingulata              17.4
## 5 Thick-tailed opposum   Didelphimorphia        19.4
## 6 North American Opossum Didelphimorphia        18  
## 7 Owl monkey             Primates               17  
## 8 Arctic ground squirrel Rodentia               16.6

Create New Columns Using mutate()

The mutate() function will add new columns to the data frame. Create a new column called rem_proportion, which is the ratio of rem sleep to total amount of sleep.

msleep %>% 
    mutate(rem_proportion = sleep_rem / sleep_total) %>%
    head()
## # A tibble: 6 × 12
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # … with 3 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>

You can add many new columns using mutate (separated by commas). Here we add a second column called bodywt_grams which is the bodywt column in grams.

msleep %>% 
    mutate(rem_proportion = sleep_rem / sleep_total, 
           bodywt_grams = bodywt * 1000) %>%
    head()
## # A tibble: 6 × 13
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # … with 4 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>,
## #   bodywt_grams <dbl>

Is there a relationship between rem_proportion and bodywt? How about sleep_total?

Create summaries: summarise()

The summarise() function will create summary statistics for a given column in the data frame such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean() function to the column sleep_total and call the summary value avg_sleep.

msleep %>% 
    summarise(avg_sleep = mean(sleep_total))
## # A tibble: 1 × 1
##   avg_sleep
##       <dbl>
## 1      10.4

There are many other summary statistics you could consider such sd(), min(), max(), median(), sum(), n() (returns the length of vector), first() (returns first value in vector), last() (returns last value in vector) and n_distinct() (number of distinct values in vector).

msleep %>% 
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total),
              max_sleep = max(sleep_total),
              total = n())
## # A tibble: 1 × 4
##   avg_sleep min_sleep max_sleep total
##       <dbl>     <dbl>     <dbl> <int>
## 1      10.4       1.9      19.9    83

Grouping data: group_by()

The group_by() verb is an important function in dplyr. The group_by allows us to use the concept of “split-apply-combine”. We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output. This approach is similar to the aggregate function from R, but group_by integrates with dplyr.

Let’s do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.

msleep %>% 
    group_by(order) %>%
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total), 
              max_sleep = max(sleep_total),
              total = n())
## # A tibble: 19 × 5
##    order           avg_sleep min_sleep max_sleep total
##    <chr>               <dbl>     <dbl>     <dbl> <int>
##  1 Afrosoricida        15.6       15.6      15.6     1
##  2 Artiodactyla         4.52       1.9       9.1     6
##  3 Carnivora           10.1        3.5      15.8    12
##  4 Cetacea              4.5        2.7       5.6     3
##  5 Chiroptera          19.8       19.7      19.9     2
##  6 Cingulata           17.8       17.4      18.1     2
##  7 Didelphimorphia     18.7       18        19.4     2
##  8 Diprotodontia       12.4       11.1      13.7     2
##  9 Erinaceomorpha      10.2       10.1      10.3     2
## 10 Hyracoidea           5.67       5.3       6.3     3
## 11 Lagomorpha           8.4        8.4       8.4     1
## 12 Monotremata          8.6        8.6       8.6     1
## 13 Perissodactyla       3.47       2.9       4.4     3
## 14 Pilosa              14.4       14.4      14.4     1
## 15 Primates            10.5        8        17      12
## 16 Proboscidea          3.6        3.3       3.9     2
## 17 Rodentia            12.5        7        16.6    22
## 18 Scandentia           8.9        8.9       8.9     1
## 19 Soricomorpha        11.1        8.4      14.9     5

  1. A quantitative, theoretical framework for understanding mammalian sleep. Van M. Savage, Geoffrey B. West. Proceedings of the National Academy of Sciences Jan 2007, 104 (3) 1051-1056; DOI: 10.1073/pnas.0610080104↩︎