The dataset we will be using to introduce the dplyr package is an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep1.
dplyr is just a different approach to
manipulating data in data.frames.dplyr verbs and how they can be
used to manipulate data.frames.dplyr%>%, is and how it
can be used.group_by and summarize.The dplyr package is a specialized package for working with
data.frames (and the related tibble) to
transform and summarize tabular data with rows and columns. For another
explanation of dplyr see the dplyr package vignette: Introduction
to dplyr
dplyr contains a set of functions–commonly called the dplyr “verbs”–that perform common data manipulations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data. In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept.
Compared to base functions in R, the functions in dplyr are often easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames, instead of just vectors.
The msleep (mammals sleep) data set contains the sleep times and
weights for a set of mammals and is available in the dagdata repository
on github. This data set contains 83 rows and 11 variables. The data
happen to be available as a dataset in the ggplot2
package. To get access to the msleep dataset, we need to
first install the ggplot2 package.
install.packages('ggplot2')Then, we can load the library.
library(ggplot2)
data(msleep)As with many datasets in R, “help” is available to describe the dataset itself.
?msleepThe columns are described in the help page, but are included here, also.
| column name | Description | 
|---|---|
| name | common name | 
| genus | taxonomic rank | 
| vore | carnivore, omnivore or herbivore? | 
| order | taxonomic rank | 
| conservation | the conservation status of the mammal | 
| sleep_total | total amount of sleep, in hours | 
| sleep_rem | rem sleep, in hours | 
| sleep_cycle | length of sleep cycle, in hours | 
| awake | amount of time spent awake, in hours | 
| brainwt | brain weight in kilograms | 
| bodywt | body weight in kilograms | 
The dplyr verbs are listed here. There are many other functions available in dplyr, but we will focus on just these.
| dplyr verbs | Description | 
|---|---|
| select() | select columns | 
| filter() | filter rows | 
| arrange() | re-order or arrange rows | 
| mutate() | create new columns | 
| summarise() | summarise values | 
| group_by() | allows for group operations in the “split-apply-combine” concept | 
The two most basic functions are select() and
filter(), which selects columns and filters rows
respectively. What are the equivalent ways to select columns without
dplyr? And filtering to include only specific rows?
Before proceeding, we need to install the dplyr package:
install.packages('dplyr')And then load the library:
library(dplyr)## 
## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':
## 
##     filter, lag## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, unionselect()Select a set of columns such as the name and the
sleep_total columns.
sleepData <- select(msleep, name, sleep_total)
head(sleepData)## # A tibble: 6 × 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Cheetah                           12.1
## 2 Owl monkey                        17  
## 3 Mountain beaver                   14.4
## 4 Greater short-tailed shrew        14.9
## 5 Cow                                4  
## 6 Three-toed sloth                  14.4To select all the columns except a specific column, use the
“-” (subtraction) operator (also known as negative indexing). For
example, to select all columns except name:
head(select(msleep, -name))## # A tibble: 6 × 10
##   genus      vore  order    conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>      <chr> <chr>    <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Acinonyx   carni Carnivo… lc                  12.1      NA        NA      11.9
## 2 Aotus      omni  Primates <NA>                17         1.8      NA       7  
## 3 Aplodontia herbi Rodentia nt                  14.4       2.4      NA       9.6
## 4 Blarina    omni  Soricom… lc                  14.9       2.3       0.133   9.1
## 5 Bos        herbi Artioda… domesticated         4         0.7       0.667  20  
## 6 Bradypus   herbi Pilosa   <NA>                14.4       2.2       0.767   9.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>To select a range of columns by name, use the “:” operator. Note that dplyr allows us to use the column names without quotes and as “indices” of the columns.
head(select(msleep, name:order))## # A tibble: 6 × 4
##   name                       genus      vore  order       
##   <chr>                      <chr>      <chr> <chr>       
## 1 Cheetah                    Acinonyx   carni Carnivora   
## 2 Owl monkey                 Aotus      omni  Primates    
## 3 Mountain beaver            Aplodontia herbi Rodentia    
## 4 Greater short-tailed shrew Blarina    omni  Soricomorpha
## 5 Cow                        Bos        herbi Artiodactyla
## 6 Three-toed sloth           Bradypus   herbi PilosaTo select all columns that start with the character string “sl”, use
the function starts_with().
head(select(msleep, starts_with("sl")))## # A tibble: 6 × 3
##   sleep_total sleep_rem sleep_cycle
##         <dbl>     <dbl>       <dbl>
## 1        12.1      NA        NA    
## 2        17         1.8      NA    
## 3        14.4       2.4      NA    
## 4        14.9       2.3       0.133
## 5         4         0.7       0.667
## 6        14.4       2.2       0.767Some additional options to select columns based on a specific criteria include:
ends_with() = Select columns that end with a character
stringcontains() = Select columns that contain a character
stringmatches() = Select columns that match a regular
expressionone_of() = Select column names that are from a group of
namesfilter()The filter() function allows us to filter rows to
include only those rows that match the filter. For example, we
can filter the rows for mammals that sleep a total of more than 16
hours.
filter(msleep, sleep_total >= 16)## # A tibble: 8 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 2 Long-n… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
## 3 North … Dide… omni  Dide… lc                  18         4.9       0.333   6  
## 4 Big br… Epte… inse… Chir… lc                  19.7       3.9       0.117   4.3
## 5 Thick-… Lutr… carni Dide… lc                  19.4       6.6      NA       4.6
## 6 Little… Myot… inse… Chir… <NA>                19.9       2         0.2     4.1
## 7 Giant … Prio… inse… Cing… en                  18.1       6.1      NA       5.9
## 8 Arctic… Sper… herbi Rode… lc                  16.6      NA        NA       7.4
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>Filter the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram.
filter(msleep, sleep_total >= 16, bodywt >= 1)## # A tibble: 3 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Long-n… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
## 2 North … Dide… omni  Dide… lc                  18         4.9       0.333   6  
## 3 Giant … Prio… inse… Cing… en                  18.1       6.1      NA       5.9
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>Filter the rows for mammals in the Perissodactyla and Primates
taxonomic order. The %in% operator is a logical operator
that returns TRUE for values of a vector that are present
in a second vector.
filter(msleep, order %in% c("Perissodactyla", "Primates"))## # A tibble: 15 × 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
##  2 Grivet Cerc… omni  Prim… lc                  10         0.7      NA      14  
##  3 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
##  4 Donkey Equus herbi Peri… domesticated         3.1       0.4      NA      20.9
##  5 Patas… Eryt… omni  Prim… lc                  10.9       1.1      NA      13.1
##  6 Galago Gala… omni  Prim… <NA>                 9.8       1.1       0.55   14.2
##  7 Human  Homo  omni  Prim… <NA>                 8         1.9       1.5    16  
##  8 Mongo… Lemur herbi Prim… vu                   9.5       0.9      NA      14.5
##  9 Macaq… Maca… omni  Prim… <NA>                10.1       1.2       0.75   13.9
## 10 Slow … Nyct… carni Prim… <NA>                11        NA        NA      13  
## 11 Chimp… Pan   omni  Prim… <NA>                 9.7       1.4       1.42   14.3
## 12 Baboon Papio omni  Prim… <NA>                 9.4       1         0.667  14.6
## 13 Potto  Pero… omni  Prim… lc                  11        NA        NA      13  
## 14 Squir… Saim… omni  Prim… <NA>                 9.6       1.4      NA      14.4
## 15 Brazi… Tapi… herbi Peri… vu                   4.4       1         0.9    19.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>You can use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.
%>%It is not unusual to want to perform a set of operations using dplyr.
The pipe operator %>% allows us to “pipe” the output
from one function into the input of the next. While there is nothing
special about how R treats operations that are written in a pipe, the
idea of piping is to allow us to read multiple functions operating one
after another from left-to-right. Without piping, one would either 1)
save each step in set of functions as a temporary variable and then pass
that variable along the chain or 2) have to “nest” functions, which can
be hard to read.
Here’s an example we have already used:
head(select(msleep, name, sleep_total))## # A tibble: 6 × 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Cheetah                           12.1
## 2 Owl monkey                        17  
## 3 Mountain beaver                   14.4
## 4 Greater short-tailed shrew        14.9
## 5 Cow                                4  
## 6 Three-toed sloth                  14.4Now in this case, we will pipe the msleep data frame to the function
that will select two columns (name and
sleep\_total) and then pipe the new data frame to the
function head(), which will return the head of the new data
frame.
msleep %>% 
    select(name, sleep_total) %>% 
    head()## # A tibble: 6 × 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Cheetah                           12.1
## 2 Owl monkey                        17  
## 3 Mountain beaver                   14.4
## 4 Greater short-tailed shrew        14.9
## 5 Cow                                4  
## 6 Three-toed sloth                  14.4You will soon see how useful the pipe operator is when we start to combine many functions.
Now that you know about the pipe operator (%>%), we will use it throughout the rest of this tutorial.
arrange()To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:
msleep %>% arrange(order) %>% head()## # A tibble: 6 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Tenrec  Tenr… omni  Afro… <NA>                15.6       2.3      NA       8.4
## 2 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 3 Roe de… Capr… herbi Arti… lc                   3        NA        NA      21  
## 4 Goat    Capri herbi Arti… lc                   5.3       0.6      NA      18.7
## 5 Giraffe Gira… herbi Arti… cd                   1.9       0.4      NA      22.1
## 6 Sheep   Ovis  herbi Arti… domesticated         3.8       0.6      NA      20.2
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by sleep_total. Finally, show the head of the final data frame:
msleep %>% 
    select(name, order, sleep_total) %>%
    arrange(order, sleep_total) %>% 
    head()## # A tibble: 6 × 3
##   name     order        sleep_total
##   <chr>    <chr>              <dbl>
## 1 Tenrec   Afrosoricida        15.6
## 2 Giraffe  Artiodactyla         1.9
## 3 Roe deer Artiodactyla         3  
## 4 Sheep    Artiodactyla         3.8
## 5 Cow      Artiodactyla         4  
## 6 Goat     Artiodactyla         5.3Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:
msleep %>% 
    select(name, order, sleep_total) %>%
    arrange(order, sleep_total) %>% 
    filter(sleep_total >= 16)## # A tibble: 8 × 3
##   name                   order           sleep_total
##   <chr>                  <chr>                 <dbl>
## 1 Big brown bat          Chiroptera             19.7
## 2 Little brown bat       Chiroptera             19.9
## 3 Long-nosed armadillo   Cingulata              17.4
## 4 Giant armadillo        Cingulata              18.1
## 5 North American Opossum Didelphimorphia        18  
## 6 Thick-tailed opposum   Didelphimorphia        19.4
## 7 Owl monkey             Primates               17  
## 8 Arctic ground squirrel Rodentia               16.6For something slightly more complicated do the same as above, except
arrange the rows in the sleep_total column in a descending order. For
this, use the function desc()
msleep %>% 
    select(name, order, sleep_total) %>%
    arrange(order, desc(sleep_total)) %>% 
    filter(sleep_total >= 16)## # A tibble: 8 × 3
##   name                   order           sleep_total
##   <chr>                  <chr>                 <dbl>
## 1 Little brown bat       Chiroptera             19.9
## 2 Big brown bat          Chiroptera             19.7
## 3 Giant armadillo        Cingulata              18.1
## 4 Long-nosed armadillo   Cingulata              17.4
## 5 Thick-tailed opposum   Didelphimorphia        19.4
## 6 North American Opossum Didelphimorphia        18  
## 7 Owl monkey             Primates               17  
## 8 Arctic ground squirrel Rodentia               16.6mutate()The mutate() function will add new columns to the data
frame. Create a new column called rem_proportion, which is the ratio of
rem sleep to total amount of sleep.
msleep %>% 
    mutate(rem_proportion = sleep_rem / sleep_total) %>%
    head()## # A tibble: 6 × 12
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # … with 3 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>You can add many new columns using mutate (separated by commas). Here we add a second column called bodywt_grams which is the bodywt column in grams.
msleep %>% 
    mutate(rem_proportion = sleep_rem / sleep_total, 
           bodywt_grams = bodywt * 1000) %>%
    head()## # A tibble: 6 × 13
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # … with 4 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>,
## #   bodywt_grams <dbl>Is there a relationship between rem_proportion and
bodywt? How about sleep_total?
summarise()The summarise() function will create summary statistics
for a given column in the data frame such as finding the mean. For
example, to compute the average number of hours of sleep, apply the
mean() function to the column sleep_total and call the
summary value avg_sleep.
msleep %>% 
    summarise(avg_sleep = mean(sleep_total))## # A tibble: 1 × 1
##   avg_sleep
##       <dbl>
## 1      10.4There are many other summary statistics you could consider such
sd(), min(), max(),
median(), sum(), n() (returns the
length of vector), first() (returns first value in vector),
last() (returns last value in vector) and
n_distinct() (number of distinct values in vector).
msleep %>% 
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total),
              max_sleep = max(sleep_total),
              total = n())## # A tibble: 1 × 4
##   avg_sleep min_sleep max_sleep total
##       <dbl>     <dbl>     <dbl> <int>
## 1      10.4       1.9      19.9    83group_by()The group_by() verb is an important function in dplyr.
The group_by allows us to use the concept of
“split-apply-combine”. We literally want to split the data frame by some
variable (e.g. taxonomic order), apply a function to the individual data
frames and then combine the output. This approach is similar to the
aggregate function from R, but group_by
integrates with dplyr.
Let’s do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.
msleep %>% 
    group_by(order) %>%
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total), 
              max_sleep = max(sleep_total),
              total = n())## # A tibble: 19 × 5
##    order           avg_sleep min_sleep max_sleep total
##    <chr>               <dbl>     <dbl>     <dbl> <int>
##  1 Afrosoricida        15.6       15.6      15.6     1
##  2 Artiodactyla         4.52       1.9       9.1     6
##  3 Carnivora           10.1        3.5      15.8    12
##  4 Cetacea              4.5        2.7       5.6     3
##  5 Chiroptera          19.8       19.7      19.9     2
##  6 Cingulata           17.8       17.4      18.1     2
##  7 Didelphimorphia     18.7       18        19.4     2
##  8 Diprotodontia       12.4       11.1      13.7     2
##  9 Erinaceomorpha      10.2       10.1      10.3     2
## 10 Hyracoidea           5.67       5.3       6.3     3
## 11 Lagomorpha           8.4        8.4       8.4     1
## 12 Monotremata          8.6        8.6       8.6     1
## 13 Perissodactyla       3.47       2.9       4.4     3
## 14 Pilosa              14.4       14.4      14.4     1
## 15 Primates            10.5        8        17      12
## 16 Proboscidea          3.6        3.3       3.9     2
## 17 Rodentia            12.5        7        16.6    22
## 18 Scandentia           8.9        8.9       8.9     1
## 19 Soricomorpha        11.1        8.4      14.9     5A quantitative, theoretical framework for understanding mammalian sleep. Van M. Savage, Geoffrey B. West. Proceedings of the National Academy of Sciences Jan 2007, 104 (3) 1051-1056; DOI: 10.1073/pnas.0610080104↩︎