The dataset we will be using to introduce the dplyr package is an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep1.
dplyr
is just a different approach to
manipulating data in data.frames.dplyr
verbs and how they can be
used to manipulate data.frames.dplyr
%>%
, is and how it
can be used.group_by
and summarize
.The dplyr package is a specialized package for working with
data.frames
(and the related tibble
) to
transform and summarize tabular data with rows and columns. For another
explanation of dplyr see the dplyr package vignette: Introduction
to dplyr
dplyr contains a set of functions–commonly called the dplyr “verbs”–that perform common data manipulations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data. In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept.
Compared to base functions in R, the functions in dplyr are often easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames, instead of just vectors.
The msleep (mammals sleep) data set contains the sleep times and
weights for a set of mammals and is available in the dagdata repository
on github. This data set contains 83 rows and 11 variables. The data
happen to be available as a data
set in the ggplot2
package. To get access to the msleep
dataset, we need to
first install the ggplot2 package.
install.packages('ggplot2')
Then, we can load the library.
library(ggplot2)
data(msleep)
As with many datasets in R, “help” is available to describe the dataset itself.
?msleep
The columns are described in the help page, but are included here, also.
column name | Description |
---|---|
name | common name |
genus | taxonomic rank |
vore | carnivore, omnivore or herbivore? |
order | taxonomic rank |
conservation | the conservation status of the mammal |
sleep_total | total amount of sleep, in hours |
sleep_rem | rem sleep, in hours |
sleep_cycle | length of sleep cycle, in hours |
awake | amount of time spent awake, in hours |
brainwt | brain weight in kilograms |
bodywt | body weight in kilograms |
The dplyr verbs are listed here. There are many other functions available in dplyr, but we will focus on just these.
dplyr verbs | Description |
---|---|
select() |
select columns |
filter() |
filter rows |
arrange() |
re-order or arrange rows |
mutate() |
create new columns |
summarise() |
summarise values |
group_by() |
allows for group operations in the “split-apply-combine” concept |
The two most basic functions are select()
and
filter()
, which selects columns and filters rows
respectively. What are the equivalent ways to select columns without
dplyr? And filtering to include only specific rows?
Before proceeding, we need to install the dplyr package:
install.packages('dplyr')
And then load the library:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
select()
Select a set of columns such as the name
and the
sleep_total
columns.
sleepData <- select(msleep, name, sleep_total)
head(sleepData)
## # A tibble: 6 × 2
## name sleep_total
## <chr> <dbl>
## 1 Cheetah 12.1
## 2 Owl monkey 17
## 3 Mountain beaver 14.4
## 4 Greater short-tailed shrew 14.9
## 5 Cow 4
## 6 Three-toed sloth 14.4
To select all the columns except a specific column, use the
“-” (subtraction) operator (also known as negative indexing). For
example, to select all columns except name
:
head(select(msleep, -name))
## # A tibble: 6 × 10
## genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Acinonyx carni Carnivo… lc 12.1 NA NA 11.9
## 2 Aotus omni Primates <NA> 17 1.8 NA 7
## 3 Aplodontia herbi Rodentia nt 14.4 2.4 NA 9.6
## 4 Blarina omni Soricom… lc 14.9 2.3 0.133 9.1
## 5 Bos herbi Artioda… domesticated 4 0.7 0.667 20
## 6 Bradypus herbi Pilosa <NA> 14.4 2.2 0.767 9.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
To select a range of columns by name, use the “:” operator. Note that dplyr allows us to use the column names without quotes and as “indices” of the columns.
head(select(msleep, name:order))
## # A tibble: 6 × 4
## name genus vore order
## <chr> <chr> <chr> <chr>
## 1 Cheetah Acinonyx carni Carnivora
## 2 Owl monkey Aotus omni Primates
## 3 Mountain beaver Aplodontia herbi Rodentia
## 4 Greater short-tailed shrew Blarina omni Soricomorpha
## 5 Cow Bos herbi Artiodactyla
## 6 Three-toed sloth Bradypus herbi Pilosa
To select all columns that start with the character string “sl”, use
the function starts_with()
.
head(select(msleep, starts_with("sl")))
## # A tibble: 6 × 3
## sleep_total sleep_rem sleep_cycle
## <dbl> <dbl> <dbl>
## 1 12.1 NA NA
## 2 17 1.8 NA
## 3 14.4 2.4 NA
## 4 14.9 2.3 0.133
## 5 4 0.7 0.667
## 6 14.4 2.2 0.767
Some additional options to select columns based on a specific criteria include:
ends_with()
= Select columns that end with a character
stringcontains()
= Select columns that contain a character
stringmatches()
= Select columns that match a regular
expressionone_of()
= Select column names that are from a group of
namesfilter()
The filter()
function allows us to filter rows to
include only those rows that match the filter. For example, we
can filter the rows for mammals that sleep a total of more than 16
hours.
filter(msleep, sleep_total >= 16)
## # A tibble: 8 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
## 2 Long-n… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
## 3 North … Dide… omni Dide… lc 18 4.9 0.333 6
## 4 Big br… Epte… inse… Chir… lc 19.7 3.9 0.117 4.3
## 5 Thick-… Lutr… carni Dide… lc 19.4 6.6 NA 4.6
## 6 Little… Myot… inse… Chir… <NA> 19.9 2 0.2 4.1
## 7 Giant … Prio… inse… Cing… en 18.1 6.1 NA 5.9
## 8 Arctic… Sper… herbi Rode… lc 16.6 NA NA 7.4
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
Filter the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram.
filter(msleep, sleep_total >= 16, bodywt >= 1)
## # A tibble: 3 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Long-n… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
## 2 North … Dide… omni Dide… lc 18 4.9 0.333 6
## 3 Giant … Prio… inse… Cing… en 18.1 6.1 NA 5.9
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
Filter the rows for mammals in the Perissodactyla and Primates
taxonomic order. The %in%
operator is a logical operator
that returns TRUE
for values of a vector that are present
in a second vector.
filter(msleep, order %in% c("Perissodactyla", "Primates"))
## # A tibble: 15 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
## 2 Grivet Cerc… omni Prim… lc 10 0.7 NA 14
## 3 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
## 4 Donkey Equus herbi Peri… domesticated 3.1 0.4 NA 20.9
## 5 Patas… Eryt… omni Prim… lc 10.9 1.1 NA 13.1
## 6 Galago Gala… omni Prim… <NA> 9.8 1.1 0.55 14.2
## 7 Human Homo omni Prim… <NA> 8 1.9 1.5 16
## 8 Mongo… Lemur herbi Prim… vu 9.5 0.9 NA 14.5
## 9 Macaq… Maca… omni Prim… <NA> 10.1 1.2 0.75 13.9
## 10 Slow … Nyct… carni Prim… <NA> 11 NA NA 13
## 11 Chimp… Pan omni Prim… <NA> 9.7 1.4 1.42 14.3
## 12 Baboon Papio omni Prim… <NA> 9.4 1 0.667 14.6
## 13 Potto Pero… omni Prim… lc 11 NA NA 13
## 14 Squir… Saim… omni Prim… <NA> 9.6 1.4 NA 14.4
## 15 Brazi… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
You can use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.
%>%
It is not unusual to want to perform a set of operations using dplyr.
The pipe operator %>%
allows us to “pipe” the output
from one function into the input of the next. While there is nothing
special about how R treats operations that are written in a pipe, the
idea of piping is to allow us to read multiple functions operating one
after another from left-to-right. Without piping, one would either 1)
save each step in set of functions as a temporary variable and then pass
that variable along the chain or 2) have to “nest” functions, which can
be hard to read.
Here’s an example we have already used:
head(select(msleep, name, sleep_total))
## # A tibble: 6 × 2
## name sleep_total
## <chr> <dbl>
## 1 Cheetah 12.1
## 2 Owl monkey 17
## 3 Mountain beaver 14.4
## 4 Greater short-tailed shrew 14.9
## 5 Cow 4
## 6 Three-toed sloth 14.4
Now in this case, we will pipe the msleep data frame to the function
that will select two columns (name
and
sleep\_total
) and then pipe the new data frame to the
function head()
, which will return the head of the new data
frame.
msleep %>%
select(name, sleep_total) %>%
head()
## # A tibble: 6 × 2
## name sleep_total
## <chr> <dbl>
## 1 Cheetah 12.1
## 2 Owl monkey 17
## 3 Mountain beaver 14.4
## 4 Greater short-tailed shrew 14.9
## 5 Cow 4
## 6 Three-toed sloth 14.4
You will soon see how useful the pipe operator is when we start to combine many functions.
Now that you know about the pipe operator (%>%), we will use it throughout the rest of this tutorial.
arrange()
To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:
msleep %>% arrange(order) %>% head()
## # A tibble: 6 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Tenrec Tenr… omni Afro… <NA> 15.6 2.3 NA 8.4
## 2 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 3 Roe de… Capr… herbi Arti… lc 3 NA NA 21
## 4 Goat Capri herbi Arti… lc 5.3 0.6 NA 18.7
## 5 Giraffe Gira… herbi Arti… cd 1.9 0.4 NA 22.1
## 6 Sheep Ovis herbi Arti… domesticated 3.8 0.6 NA 20.2
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by sleep_total. Finally, show the head of the final data frame:
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
head()
## # A tibble: 6 × 3
## name order sleep_total
## <chr> <chr> <dbl>
## 1 Tenrec Afrosoricida 15.6
## 2 Giraffe Artiodactyla 1.9
## 3 Roe deer Artiodactyla 3
## 4 Sheep Artiodactyla 3.8
## 5 Cow Artiodactyla 4
## 6 Goat Artiodactyla 5.3
Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
filter(sleep_total >= 16)
## # A tibble: 8 × 3
## name order sleep_total
## <chr> <chr> <dbl>
## 1 Big brown bat Chiroptera 19.7
## 2 Little brown bat Chiroptera 19.9
## 3 Long-nosed armadillo Cingulata 17.4
## 4 Giant armadillo Cingulata 18.1
## 5 North American Opossum Didelphimorphia 18
## 6 Thick-tailed opposum Didelphimorphia 19.4
## 7 Owl monkey Primates 17
## 8 Arctic ground squirrel Rodentia 16.6
For something slightly more complicated do the same as above, except
arrange the rows in the sleep_total column in a descending order. For
this, use the function desc()
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, desc(sleep_total)) %>%
filter(sleep_total >= 16)
## # A tibble: 8 × 3
## name order sleep_total
## <chr> <chr> <dbl>
## 1 Little brown bat Chiroptera 19.9
## 2 Big brown bat Chiroptera 19.7
## 3 Giant armadillo Cingulata 18.1
## 4 Long-nosed armadillo Cingulata 17.4
## 5 Thick-tailed opposum Didelphimorphia 19.4
## 6 North American Opossum Didelphimorphia 18
## 7 Owl monkey Primates 17
## 8 Arctic ground squirrel Rodentia 16.6
mutate()
The mutate()
function will add new columns to the data
frame. Create a new column called rem_proportion, which is the ratio of
rem sleep to total amount of sleep.
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total) %>%
head()
## # A tibble: 6 × 12
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## # … with 3 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>
You can add many new columns using mutate (separated by commas). Here we add a second column called bodywt_grams which is the bodywt column in grams.
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total,
bodywt_grams = bodywt * 1000) %>%
head()
## # A tibble: 6 × 13
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## # … with 4 more variables: brainwt <dbl>, bodywt <dbl>, rem_proportion <dbl>,
## # bodywt_grams <dbl>
Is there a relationship between rem_proportion
and
bodywt
? How about sleep_total
?
summarise()
The summarise()
function will create summary statistics
for a given column in the data frame such as finding the mean. For
example, to compute the average number of hours of sleep, apply the
mean()
function to the column sleep_total and call the
summary value avg_sleep.
msleep %>%
summarise(avg_sleep = mean(sleep_total))
## # A tibble: 1 × 1
## avg_sleep
## <dbl>
## 1 10.4
There are many other summary statistics you could consider such
sd()
, min()
, max()
,
median()
, sum()
, n()
(returns the
length of vector), first()
(returns first value in vector),
last()
(returns last value in vector) and
n_distinct()
(number of distinct values in vector).
msleep %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
## # A tibble: 1 × 4
## avg_sleep min_sleep max_sleep total
## <dbl> <dbl> <dbl> <int>
## 1 10.4 1.9 19.9 83
group_by()
The group_by()
verb is an important function in dplyr.
The group_by
allows us to use the concept of
“split-apply-combine”. We literally want to split the data frame by some
variable (e.g. taxonomic order), apply a function to the individual data
frames and then combine the output. This approach is similar to the
aggregate
function from R, but group_by
integrates with dplyr.
Let’s do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.
msleep %>%
group_by(order) %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
## # A tibble: 19 × 5
## order avg_sleep min_sleep max_sleep total
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Afrosoricida 15.6 15.6 15.6 1
## 2 Artiodactyla 4.52 1.9 9.1 6
## 3 Carnivora 10.1 3.5 15.8 12
## 4 Cetacea 4.5 2.7 5.6 3
## 5 Chiroptera 19.8 19.7 19.9 2
## 6 Cingulata 17.8 17.4 18.1 2
## 7 Didelphimorphia 18.7 18 19.4 2
## 8 Diprotodontia 12.4 11.1 13.7 2
## 9 Erinaceomorpha 10.2 10.1 10.3 2
## 10 Hyracoidea 5.67 5.3 6.3 3
## 11 Lagomorpha 8.4 8.4 8.4 1
## 12 Monotremata 8.6 8.6 8.6 1
## 13 Perissodactyla 3.47 2.9 4.4 3
## 14 Pilosa 14.4 14.4 14.4 1
## 15 Primates 10.5 8 17 12
## 16 Proboscidea 3.6 3.3 3.9 2
## 17 Rodentia 12.5 7 16.6 22
## 18 Scandentia 8.9 8.9 8.9 1
## 19 Soricomorpha 11.1 8.4 14.9 5
A quantitative, theoretical framework for understanding mammalian sleep. Van M. Savage, Geoffrey B. West. Proceedings of the National Academy of Sciences Jan 2007, 104 (3) 1051-1056; DOI: 10.1073/pnas.0610080104↩︎