install.packages("ggplot2")14 Exploring data with ggplot2
A plot answers questions a table of numbers cannot. Does a cost rise with age? Do two groups behave differently? Are there outliers worth a second look? ggplot2 is the package most R users reach for to ask those questions, and it does so with a small, consistent vocabulary you can combine to build almost any graphic. In this chapter you’ll start from a blank canvas and, one layer at a time, build up a single plot that shows four variables at once.
This chapter is based on the Intro to ggplot2 chapter from the book Modern Data Visualization with R by Robert Kabacoff, which is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. The original chapter has been modified to fit the context of this book.
The insurance dataset is described in the book Machine Learning with R by Brett Lantz. A cleaned version of the dataset is also available on Kaggle. The dataset describes medical information and costs billed by health insurance companies in 2013, as compiled by the United States Census Bureau. Variables include age, sex, body mass index, number of children covered by health insurance, smoker status, US region, and individual medical costs billed by health insurance for 1338 individuals.
14.1 What you’ll learn
- Initialize a plot with
ggplot()and tell it which data frame to use. - Add geometric layers (points, lines) with
geom_*()functions. - Map variables to aesthetics such as the x-axis, y-axis, and colour, and use colour to group observations.
- Adjust how variables are displayed with
scale_*()functions. - Split a plot into small multiples with
facet_wrap(). - Add informative labels and titles with
labs(), and restyle the plot with atheme_*().
ggplot2 is built on an idea called the grammar of graphics: a plot is assembled from independent pieces you stack with +. The pieces are always the same — data (the data frame), aesthetics (which variable maps to x, to y, to colour, …), geoms (the shapes that get drawn: points, lines, bars), and optional scales, facets, labels, and themes. You don’t memorize a separate function for every chart type; you learn this handful of building blocks and recombine them. That is why the same code patterns produce a scatterplot here and, with a different data frame, a plot of gene-expression values.
To get started, you need to install and load the ggplot2 package. If you haven’t installed it yet, you can do so with:
Once installed, load the package:
Next, read the insurance dataset into R. We’ll use a convenient online version and the read.csv() function to load it:
# load the data
url <- "https://tinyurl.com/mtktm8e5"
insurance <- read.csv(url)In RStudio you can use the View() function to inspect the dataset in a spreadsheet-style window. We don’t run it here because it only works in interactive RStudio, not when the book is rendered:
# view the dataset
View(insurance)Next, we’ll add a variable indicating whether a patient is obese. We’ll define obesity as a body mass index greater than or equal to 30:
# create an obesity variable
insurance$obese <- ifelse(insurance$bmi >= 30,
"obese", "not obese")When you build a ggplot2 graph, only the first two pieces below — ggplot() and at least one geom_*() — are required. The rest are optional and can appear in any order.
14.2 ggplot()
The ggplot() function initializes the plot. It takes a data frame as its first argument and can also include aesthetic mappings (aes()) that define how variables in the data map to visual properties such as the x and y axes, colour, and size.
Why is Figure 14.1 showing an empty panel? Because we haven’t added any layers yet. The ggplot() function only sets up the canvas; it doesn’t draw anything until we add a geom. We told it to map age to the x-axis and expenses to the y-axis, but we haven’t yet said what shape to put on the graph.
14.3 geom_*()
The geom_*() functions add layers to the plot. Each one corresponds to a specific geometric object: geom_point() adds points, geom_line() adds lines, geom_bar() adds bars, and so on. Let’s add points.
# add points to the plot
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point()
In Figure 14.2 we added points with geom_point(). The + operator stacks layers onto the plot, and the mapping argument in aes() specifies how variables map to visual properties. You can already see that insurance expenses tend to rise with age, though there is a lot of variability.
A geom can take parameters (options) of its own. For geom_point(), the common ones are color, size, and alpha. These control point colour, point size, and transparency. Transparency ranges from 0 (completely transparent) to 1 (completely opaque).
# make points blue, larger, and semi-transparent
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point(color = "cornflowerblue",
alpha = .7,
size = 2)
alpha to see through crowded points
When many points land on top of one another — common with hundreds or thousands of observations — a solid plot hides how the data pile up. Setting alpha below 1 makes each point partly transparent, so dense regions render darker and sparse ones lighter. It’s the cheapest fix for an overplotted scatterplot, and it works the same way whether your points are patients or single cells.
Next, let’s layer on a line of best fit — in other words, a regression fit. We do that with geom_smooth(). Its options control the type of line (linear, quadratic, nonparametric), the line’s thickness and colour, and whether a confidence band is shown. Here we ask for a linear regression line with method = "lm" (where lm stands for linear model).
# add a line of best fit
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point(color = "cornflowerblue",
alpha = .7,
size = 2) +
geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'
In Figure 14.4 we added a fitted line with geom_smooth(method = "lm"). The method argument chooses the type of fit; here a straight line summarizes the upward trend, and the shaded band around it is the confidence interval.
14.4 Grouping
Beyond the x and y axes, you can map groups of observations to colour, shape, size, transparency, and other visual characteristics. This lets you superimpose several groups in a single graph. Let’s add smoker status and show it with colour.
# group points by smoker status
ggplot(data = insurance,
mapping = aes(x = age, y = expenses, color = smoker)) +
geom_point(alpha = .7, size = 2) +
geom_smooth(method = "lm", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
In Figure 14.5 we moved the color aesthetic into aes() and mapped it to the smoker variable. Because color is now part of the mapping, ggplot2 splits the data by smoker status and draws a separate set of points and a separate fitted line for each group. It probably comes as no surprise that smokers appear to incur greater expenses than non-smokers.
14.5 Scales
Scales control how variables are translated into the visual characteristics of the plot. Scale functions (which start with scale_) let you adjust that translation. Next we’ll change the spacing of the x and y axis tick marks, format the y-axis as dollars, and pick our own colours.
# modify the x and y axes and specify the colours to be used
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
linewidth = 1.5) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue"))`geom_smooth()` using formula = 'y ~ x'
In Figure 14.6 we used scale_x_continuous() and scale_y_continuous() to set the axis tick marks. The breaks argument specifies where the ticks fall, and the label argument formats the y-axis values as dollar amounts with scales::dollar. scale_color_manual() then assigns our chosen colours to the two smoker groups. (Note that line thickness in geom_smooth() is set with linewidth, the current argument for the width of a line geom.)
14.6 Facets
Faceting splits a plot into small multiples — one panel per level of a categorical variable. This is handy for comparing relationships across groups. The facet_wrap() function does this; the ~obese formula tells it to make one panel for each value of the obese variable.
# reproduce the plot for obese and non-obese individuals
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~obese)`geom_smooth()` using formula = 'y ~ x'
In Figure 14.7, facet_wrap(~obese) produced separate panels for obese and non-obese individuals, letting us compare the age-versus-expenses relationship side by side. We have now packed four dimensions of data — age, smoking status, obesity, and annual expenses — into a single two-dimensional figure.
14.7 Labels and titles
Labels and titles make a plot self-explanatory. The labs() function sets the axis labels, the legend title, and the plot title, subtitle, and caption.
# add informative labels
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~obese) +
labs(title = "Relationship between patient demographics and medical costs",
subtitle = "US Census Bureau 2013",
caption = "source: http://mosaic-web.org/",
x = "Age (years)",
y = "Annual expenses",
color = "Smoker?")`geom_smooth()` using formula = 'y ~ x'
In Figure 14.8 we used labs() to add a title, subtitle, caption, and clearer axis and legend labels. A reader can now understand the figure without hunting for context elsewhere.
14.8 Theming
Finally, you can fine-tune the plot’s overall appearance with a theme. Theme functions (which start with theme_) control background colours, fonts, grid lines, legend placement, and other non-data features. Let’s switch to a cleaner look.
# use a minimalist theme
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~obese) +
labs(title = "Relationship between age and medical expenses",
subtitle = "US Census Data 2013",
caption = "source: https://github.com/dataspelunking/MLwR",
x = "Age (years)",
y = "Medical expenses",
color = "Smoker?") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
14.9 Exercises
The insurance data is a convenient, non-biological playground, but every technique below is exactly what you’d use to plot a biological dataset — swap expenses for a gene’s expression level and region for a treatment group and the code is unchanged. Try these on the insurance data frame you loaded above.
-
A boxplot by group. Instead of a scatterplot, draw a boxplot of
expensessplit bysmoker. Mapsmokerto the x-axis andexpensesto the y-axis, and usegeom_boxplot().NoteSolutionggplot(data = insurance, mapping = aes(x = smoker, y = expenses)) + geom_boxplot()
A boxplot is just a different geom on the same
aes()skeleton. The boxes make it obvious that smokers’ expenses are both higher and more spread out than non-smokers’. -
Colour by region. Make a scatterplot of
expensesagainstageand colour the points byregion. Addalpha = .6so crowded points stay readable.NoteSolutionggplot(data = insurance, mapping = aes(x = age, y = expenses, color = region)) + geom_point(alpha = .6)
Mapping
regiontocolorinsideaes()gives each of the four regions its own colour. The clouds overlap heavily, which tells you region alone doesn’t explain much of the variation in expenses. -
Facet by sex. Take the age-versus-expenses scatterplot and split it into one panel per
sexwithfacet_wrap().NoteSolutionggplot(data = insurance, mapping = aes(x = age, y = expenses)) + geom_point(alpha = .5) + facet_wrap(~sex)
facet_wrap(~sex)draws the same plot once for each sex. The two panels look very similar, suggesting sex has little effect on the age-expenses relationship on its own. -
Add a fitted line. Starting from the plot in Exercise 3, add a linear fit with
geom_smooth(method = "lm")so each panel gets its own trend line.NoteSolutionggplot(data = insurance, mapping = aes(x = age, y = expenses)) + geom_point(alpha = .5) + geom_smooth(method = "lm") + facet_wrap(~sex)`geom_smooth()` using formula = 'y ~ x'
Because faceting splits the data first,
geom_smooth()fits a separate line within each panel. Both slopes point upward and look nearly parallel, reinforcing that the age trend holds for both groups.
14.10 Summary
You can now build a ggplot2 graphic from the ground up. You initialized a plot with ggplot(), drew data with geom_point() and geom_smooth(), mapped variables to aesthetics and used colour to separate groups, reshaped the display with scale_*() functions, split the plot into panels with facet_wrap(), labelled it with labs(), and restyled it with theme_minimal(). Each was an independent layer added with + — the grammar of graphics in action.
Reading the final figure (Figure 14.9), it appears that:
- There is a positive linear relationship between age and expenses, and the slope stays roughly constant across smoking and obesity status.
- Smokers and obese patients have higher medical expenses.
- Smoking and obesity interact. Non-smokers look similar across obesity groups, but among smokers, obese patients have much higher expenses.
- A few very high outliers sit in the obese-smoker group.
These findings are tentative. They rest on a limited sample and involve no statistical testing to assess whether the differences could be due to chance.
For the design principles behind these plots — the data-ink ratio, colorblind-safe palettes — and specialized plot types for genomics such as UpSet plots and heatmaps, see A self-guided tour of data visualization in R.