install.packages('ggplot2')
11 Introduction to dplyr: mammal sleep dataset
The dataset we will be using to introduce the dplyr package is an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep1.
1 A quantitative, theoretical framework for understanding mammalian sleep. Van M. Savage, Geoffrey B. West. Proceedings of the National Academy of Sciences Jan 2007, 104 (3) 1051-1056; DOI: 10.1073/pnas.0610080104
11.1 Learning goals
- Know that
dplyr
is just a different approach to manipulating data in data.frames. - List the commonly used
dplyr
verbs and how they can be used to manipulate data.frames. - Show how to aggregate and summarized data using
dplyr
- Know what the piping operator,
|>
, is and how it can be used.
11.2 Learning objectives
- Select subsets of the mammal sleep dataset.
- Reorder the dataset.
- Add columns to the dataset based on existing columns.
- Summarize the amount of sleep by categorical variables using
group_by
andsummarize
.
11.3 What is dplyr?
The dplyr package is a specialized package for working with data.frames
(and the related tibble
) to transform and summarize tabular data with rows and columns. For another explanation of dplyr see the dplyr package vignette: Introduction to dplyr
11.4 Why Is dplyr userful?
dplyr contains a set of functions–commonly called the dplyr “verbs”–that perform common data manipulations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data. In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept.
Compared to base functions in R, the functions in dplyr are often easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames, instead of just vectors.
11.5 Data: Mammals Sleep
The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on github. This data set contains 83 rows and 11 variables. The data happen to be available as a data
set in the ggplot2 package. To get access to the msleep
dataset, we need to first install the ggplot2 package.
Then, we can load the library.
As with many datasets in R, “help” is available to describe the dataset itself.
?msleep
The columns are described in the help page, but are included here, also.
column name | Description |
---|---|
name | common name |
genus | taxonomic rank |
vore | carnivore, omnivore or herbivore? |
order | taxonomic rank |
conservation | the conservation status of the mammal |
sleep_total | total amount of sleep, in hours |
sleep_rem | rem sleep, in hours |
sleep_cycle | length of sleep cycle, in hours |
awake | amount of time spent awake, in hours |
brainwt | brain weight in kilograms |
bodywt | body weight in kilograms |
11.6 dplyr verbs
The dplyr verbs are listed here. There are many other functions available in dplyr, but we will focus on just these.
dplyr verbs | Description |
---|---|
select() |
select columns |
filter() |
filter rows |
arrange() |
re-order or arrange rows |
mutate() |
create new columns |
summarise() |
summarise values |
group_by() |
allows for group operations in the “split-apply-combine” concept |
11.7 Using the dplyr verbs
The two most basic functions are select()
and filter()
, which selects columns and filters rows respectively. What are the equivalent ways to select columns without dplyr? And filtering to include only specific rows?
Before proceeding, we need to install the dplyr package:
install.packages('dplyr')
And then load the library:
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
11.7.1 Selecting columns: select()
Select a set of columns such as the name
and the sleep_total
columns.
# A tibble: 6 × 2
name sleep_total
<chr> <dbl>
1 Cheetah 12.1
2 Owl monkey 17
3 Mountain beaver 14.4
4 Greater short-tailed shrew 14.9
5 Cow 4
6 Three-toed sloth 14.4
To select all the columns except a specific column, use the “-” (subtraction) operator (also known as negative indexing). For example, to select all columns except name
:
# A tibble: 6 × 10
genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Acinonyx carni Carnivo… lc 12.1 NA NA 11.9
2 Aotus omni Primates <NA> 17 1.8 NA 7
3 Aplodontia herbi Rodentia nt 14.4 2.4 NA 9.6
4 Blarina omni Soricom… lc 14.9 2.3 0.133 9.1
5 Bos herbi Artioda… domesticated 4 0.7 0.667 20
6 Bradypus herbi Pilosa <NA> 14.4 2.2 0.767 9.6
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
To select a range of columns by name, use the “:” operator. Note that dplyr allows us to use the column names without quotes and as “indices” of the columns.
# A tibble: 6 × 4
name genus vore order
<chr> <chr> <chr> <chr>
1 Cheetah Acinonyx carni Carnivora
2 Owl monkey Aotus omni Primates
3 Mountain beaver Aplodontia herbi Rodentia
4 Greater short-tailed shrew Blarina omni Soricomorpha
5 Cow Bos herbi Artiodactyla
6 Three-toed sloth Bradypus herbi Pilosa
To select all columns that start with the character string “sl”, use the function starts_with()
.
head(select(msleep, starts_with("sl")))
# A tibble: 6 × 3
sleep_total sleep_rem sleep_cycle
<dbl> <dbl> <dbl>
1 12.1 NA NA
2 17 1.8 NA
3 14.4 2.4 NA
4 14.9 2.3 0.133
5 4 0.7 0.667
6 14.4 2.2 0.767
Some additional options to select columns based on a specific criteria include:
-
ends_with()
= Select columns that end with a character string -
contains()
= Select columns that contain a character string -
matches()
= Select columns that match a regular expression -
one_of()
= Select column names that are from a group of names
11.7.2 Selecting rows: filter()
The filter()
function allows us to filter rows to include only those rows that match the filter. For example, we can filter the rows for mammals that sleep a total of more than 16 hours.
filter(msleep, sleep_total >= 16)
# A tibble: 8 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
2 Long-n… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
3 North … Dide… omni Dide… lc 18 4.9 0.333 6
4 Big br… Epte… inse… Chir… lc 19.7 3.9 0.117 4.3
5 Thick-… Lutr… carni Dide… lc 19.4 6.6 NA 4.6
6 Little… Myot… inse… Chir… <NA> 19.9 2 0.2 4.1
7 Giant … Prio… inse… Cing… en 18.1 6.1 NA 5.9
8 Arctic… Sper… herbi Rode… lc 16.6 NA NA 7.4
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Filter the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram.
filter(msleep, sleep_total >= 16, bodywt >= 1)
# A tibble: 3 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Long-n… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
2 North … Dide… omni Dide… lc 18 4.9 0.333 6
3 Giant … Prio… inse… Cing… en 18.1 6.1 NA 5.9
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. The %in%
operator is a logical operator that returns TRUE
for values of a vector that are present in a second vector.
# A tibble: 15 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
2 Grivet Cerc… omni Prim… lc 10 0.7 NA 14
3 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
4 Donkey Equus herbi Peri… domesticated 3.1 0.4 NA 20.9
5 Patas… Eryt… omni Prim… lc 10.9 1.1 NA 13.1
6 Galago Gala… omni Prim… <NA> 9.8 1.1 0.55 14.2
7 Human Homo omni Prim… <NA> 8 1.9 1.5 16
8 Mongo… Lemur herbi Prim… vu 9.5 0.9 NA 14.5
9 Macaq… Maca… omni Prim… <NA> 10.1 1.2 0.75 13.9
10 Slow … Nyct… carni Prim… <NA> 11 NA NA 13
11 Chimp… Pan omni Prim… <NA> 9.7 1.4 1.42 14.3
12 Baboon Papio omni Prim… <NA> 9.4 1 0.667 14.6
13 Potto Pero… omni Prim… lc 11 NA NA 13
14 Squir… Saim… omni Prim… <NA> 9.6 1.4 NA 14.4
15 Brazi… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
You can use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.
11.8 “Piping”” with |>
It is not unusual to want to perform a set of operations using dplyr. The pipe operator |>
allows us to “pipe” the output from one function into the input of the next. While there is nothing special about how R treats operations that are written in a pipe, the idea of piping is to allow us to read multiple functions operating one after another from left-to-right. Without piping, one would either 1) save each step in set of functions as a temporary variable and then pass that variable along the chain or 2) have to “nest” functions, which can be hard to read.
Here’s an example we have already used:
# A tibble: 6 × 2
name sleep_total
<chr> <dbl>
1 Cheetah 12.1
2 Owl monkey 17
3 Mountain beaver 14.4
4 Greater short-tailed shrew 14.9
5 Cow 4
6 Three-toed sloth 14.4
Now in this case, we will pipe the msleep data frame to the function that will select two columns (name
and sleep\_total
) and then pipe the new data frame to the function head()
, which will return the head of the new data frame.
# A tibble: 6 × 2
name sleep_total
<chr> <dbl>
1 Cheetah 12.1
2 Owl monkey 17
3 Mountain beaver 14.4
4 Greater short-tailed shrew 14.9
5 Cow 4
6 Three-toed sloth 14.4
You will soon see how useful the pipe operator is when we start to combine many functions.
Now that you know about the pipe operator (|>), we will use it throughout the rest of this tutorial.
11.8.1 Arrange Or Re-order Rows Using arrange()
To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:
# A tibble: 6 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Tenrec Tenr… omni Afro… <NA> 15.6 2.3 NA 8.4
2 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
3 Roe de… Capr… herbi Arti… lc 3 NA NA 21
4 Goat Capri herbi Arti… lc 5.3 0.6 NA 18.7
5 Giraffe Gira… herbi Arti… cd 1.9 0.4 NA 22.1
6 Sheep Ovis herbi Arti… domesticated 3.8 0.6 NA 20.2
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by sleep_total. Finally, show the head of the final data frame:
# A tibble: 6 × 3
name order sleep_total
<chr> <chr> <dbl>
1 Tenrec Afrosoricida 15.6
2 Giraffe Artiodactyla 1.9
3 Roe deer Artiodactyla 3
4 Sheep Artiodactyla 3.8
5 Cow Artiodactyla 4
6 Goat Artiodactyla 5.3
Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:
msleep |>
select(name, order, sleep_total) |>
arrange(order, sleep_total) |>
filter(sleep_total >= 16)
# A tibble: 8 × 3
name order sleep_total
<chr> <chr> <dbl>
1 Big brown bat Chiroptera 19.7
2 Little brown bat Chiroptera 19.9
3 Long-nosed armadillo Cingulata 17.4
4 Giant armadillo Cingulata 18.1
5 North American Opossum Didelphimorphia 18
6 Thick-tailed opposum Didelphimorphia 19.4
7 Owl monkey Primates 17
8 Arctic ground squirrel Rodentia 16.6
For something slightly more complicated do the same as above, except arrange the rows in the sleep_total column in a descending order. For this, use the function desc()
msleep |>
select(name, order, sleep_total) |>
arrange(order, desc(sleep_total)) |>
filter(sleep_total >= 16)
# A tibble: 8 × 3
name order sleep_total
<chr> <chr> <dbl>
1 Little brown bat Chiroptera 19.9
2 Big brown bat Chiroptera 19.7
3 Giant armadillo Cingulata 18.1
4 Long-nosed armadillo Cingulata 17.4
5 Thick-tailed opposum Didelphimorphia 19.4
6 North American Opossum Didelphimorphia 18
7 Owl monkey Primates 17
8 Arctic ground squirrel Rodentia 16.6
11.9 Create New Columns Using mutate()
The mutate()
function will add new columns to the data frame. Create a new column called rem_proportion, which is the ratio of rem sleep to total amount of sleep.
# A tibble: 6 × 12
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
# ℹ 3 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>
You can add many new columns using mutate (separated by commas). Here we add a second column called bodywt_grams which is the bodywt column in grams.
# A tibble: 6 × 13
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
# ℹ 4 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>,
# bodywt_grams <dbl>
Is there a relationship between rem_proportion
and bodywt
? How about sleep_total
?
11.9.1 Create summaries: summarise()
The summarise()
function will create summary statistics for a given column in the data frame such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean()
function to the column sleep_total and call the summary value avg_sleep.
There are many other summary statistics you could consider such sd()
, min()
, max()
, median()
, sum()
, n()
(returns the length of vector), first()
(returns first value in vector), last()
(returns last value in vector) and n_distinct()
(number of distinct values in vector).
11.10 Grouping data: group_by()
The group_by()
verb is an important function in dplyr. The group_by
allows us to use the concept of “split-apply-combine”. We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output. This approach is similar to the aggregate
function from R, but group_by
integrates with dplyr.
Let’s do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.
msleep |>
group_by(order) |>
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
# A tibble: 19 × 5
order avg_sleep min_sleep max_sleep total
<chr> <dbl> <dbl> <dbl> <int>
1 Afrosoricida 15.6 15.6 15.6 1
2 Artiodactyla 4.52 1.9 9.1 6
3 Carnivora 10.1 3.5 15.8 12
4 Cetacea 4.5 2.7 5.6 3
5 Chiroptera 19.8 19.7 19.9 2
6 Cingulata 17.8 17.4 18.1 2
7 Didelphimorphia 18.7 18 19.4 2
8 Diprotodontia 12.4 11.1 13.7 2
9 Erinaceomorpha 10.2 10.1 10.3 2
10 Hyracoidea 5.67 5.3 6.3 3
11 Lagomorpha 8.4 8.4 8.4 1
12 Monotremata 8.6 8.6 8.6 1
13 Perissodactyla 3.47 2.9 4.4 3
14 Pilosa 14.4 14.4 14.4 1
15 Primates 10.5 8 17 12
16 Proboscidea 3.6 3.3 3.9 2
17 Rodentia 12.5 7 16.6 22
18 Scandentia 8.9 8.9 8.9 1
19 Soricomorpha 11.1 8.4 14.9 5