6 Chart: Boxplot
6.1 Overview
This section covers how to make boxplots.
6.2 tl;dr
I want a nice example and I want it NOW!
Here’s a look at the weights of newborn chicks split by the feed supplement they received:
And here’s the code:
library(datasets) # data
library(ggplot2) # plotting
# reorder supplements
supps <- c("horsebean", "linseed", "soybean", "meatmeal", "sunflower", "casein")
# boxplot by feed supplement with jitter layer
ggplot(chickwts, aes(x = factor(feed, levels = supps),
y = weight)) +
# plotting
geom_boxplot(fill = "#cc9a38", color = "#473e2c") +
geom_jitter(alpha = 0.2, width = 0.1, color = "#926d25") +
# formatting
ggtitle("Casein Makes You Fat?!",
subtitle = "Boxplots of Chick Weights by Feed Supplement") +
labs(x = "Feed Supplement", y = "Chick Weight (g)", caption = "Source: datasets::chickwts") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
For more info on this dataset, type ?datasets::chickwts
into the console.
6.3 Simple examples
Okay…much simpler please.
Let’s use the airquality
dataset from the datasets
package:
library(datasets)
head(airquality, n = 5)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
6.3.1 Boxplot using base R
# plot data
boxplot(airquality, col = 'lightBlue', main = "Base R Boxplots of airquality")
Boxplots with Base R are super easy. Like histograms, boxplots only need the data. In this case, we passed a dataframe with six variables, so it made separate boxplots for each variable. You may not want to create boxplots for every variable, in which case you could specify the variables individually or use filter
from the dplyr
package.
6.3.2 Boxplot using ggplot2
# import ggplot
library(ggplot2)
# plot data
g1 <- ggplot(stack(airquality), aes(x = ind, y = values)) +
geom_boxplot(fill = "lightBlue") +
# extra formatting
labs(x = "") +
ggtitle("ggplot2 Boxplots of airquality")
g1
## Warning: Removed 44 rows containing non-finite values (stat_boxplot).
ggplot2
requires data to be mapped to the x
and y
aesthetics. Here we use the stack
function to combine each column of the airquality
dataframe. Reading the documentation for the stack
function (?utils::stack
), we see the new stacked dataframe has two columns: values
and ind
, which we use to create the boxplots. Notice: ggplot2
warns us that it is ignoring “non-finite values”, which are the NA’s in the dataset.
6.4 Theory
Here’s a quote by Hadley Wickham that sums up boxplots nicely:
The boxplot is a compact distributional summary, displaying less detail than a histogram or kernel density, but also taking up less space. Boxplots use robust summary statistics that are always located at actual data points, are quickly computable (originally by hand), and have no tuning parameters. They are particularly useful for comparing distributions across groups. - Hadley Wickham
Another important use of the boxplot is in showing outliers. A boxplot shows how much of an outlier a data point is with quartiles and fences. Use the boxplot when you have data with outliers so that they can be exposed. What it lacks in specificity it makes up with its ability to clearly summarize large data sets.
- For more info about boxplots and continuous variables, check out Chapter 3 of the textbook.
6.5 When to use
Boxplots should be used to display continuous variables. They are particularly useful for identifying outliers and comparing different groups.
Aside: Boxplots may even help you convince someone you are their outlier (If you like it when people over-explain jokes, here is why that comic is funny.).
6.6 Considerations
6.6.1 Flipping orientation
Often you want boxplots to be horizontal. Super easy to do: just tack on coord_flip()
:
# g1 plot from above (5.3.2)
g1 + coord_flip()
## Warning: Removed 44 rows containing non-finite values (stat_boxplot).
6.6.2 NOT for categorical data
Boxplots are great, but they do NOT work with categorical data. Make sure your variable is continuous before using boxplots. Here’s an example of what not to do:
library(likert) # data
library(dplyr) # data manipulation
# load/format data
data(pisaitems)
pisa <- pisaitems[1:100, 2:7] %>%
dplyr::mutate_all(as.integer) %>%
dplyr::filter(complete.cases(.))
# create theme
theme <- theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
# create plot
plot <- ggplot(stack(pisa), aes(x = ind, y = values)) +
geom_boxplot(fill = "#9B3535") +
ggtitle("Don't Plot Boxplots of Categorical Variables Like This",
subtitle = "...seriously don't. Here, I'll make it red so it looks scary:") +
labs(x = "Assessment Code", y = "Values", caption = "Source: likert::pisaitems")
# bad boxplot
plot + theme
6.7 External resources
- Tukey, John W. 1977. Exploratory Data Analysis. Addison-Wesley. (Chapter 2): the primary source in which boxplots are first presented.
- DataCamp: Quick Exercise on Boxplots: a simple example of making boxplots from a dataset.
- Article on boxplots with ggplot2: An excellent collection of code examples on how to make boxplots with
ggplot2
. Covers layering, working with legends, faceting, formatting, and more. If you want a boxplot to look a certain way, this article will help. - Boxplots with plotly package: boxplot examples using the
plotly
package. These allow for a little interactivity on hover, which might better explain the underlying statistics of your plot. - ggplot2 Boxplot: Quick Start Guide: Article from STHDA on making boxplots using ggplot2. Excellent starting point for getting immediate results and custom formatting.
- ggplot2 cheatsheet: Always good to have close by.
- Hadley Wickhan and Lisa Stryjewski on boxplots: good for understanding basics of more complex boxplots and some of the history behind them.
with