Chapter 14 Stacked Bar Charts and Treemaps

Jasmine Bao and Yingnan Wu

This tutorial covers how to make static and interactive stacked bar charts and static treemaps.

In this section, we discuss ways of displaying multivariate categorical data, i.e., combinations of categorical variables using bar charts and treemaps.

14.1 1. Grouped and Stacked Bar Chart

14.1.1 Overview

Grouped and stacked bar charts are good for showing the counts of two or three categorical variables. The dataset we are using is the same as the data frame shown in class.

14.1.2 ggplot2

We first use package ggplot2 to make stacked bar charts. The input dataset should provide three columns which are the numeric count and two categorical variables for group and subgroup respectively.

To make a percentage stacked bar chart, we just need to switch to position="fill". The y-axis label needs to be changed to proportion or percent, accordingly. If the original numerical values are counts, it is better to change the y-axis to continuous percentage scale.

Argument position="dodge" has bars align beside each other (default as position="stack"). When stacked bar charts are overused, grouped barplot is preferred with common y-axis scale to compare.

In this case, we have two options in terms of which is used for fill and which is used for group division.

We can also add facets to avoid overusing colors. Observe that colors disappear, it is because all categorical variables are clearly labeled.

Similarly, we can have facets on the other categorical variable Age.

The grouped bar chart with facets also works when we have three categorical variables. When reshaping the dataset, we actually need complete() here to avoid dropping zero counts.

14.1.3 plotly

R package plotly can also be utilized to plot interactive bar charts for multivariate categorical variables.

For the corresponding grouped bar chart, we only need to change barmode = 'stack' to barmode = 'group'. The counts for each subgroup can be added directly on the interactive plot as well.

14.1.4 Consideration

The above stacked and grouped bar charts can only be used for the visualization of two or three categorical variables and necessitate the reshape of data (messier or tidier). For more general demonstration, we can apply mosaic plots, doubledecker plots, fluctuation diagrams, treemaps, association plots, and parallel sets/categorical parallel coordinate plots.

14.1.5 External resources

  1. Grouped, stacked and percent stacked barplot in ggplot2: a good reference of learning how to build grouped, stacked and percent stacked barplot with R and ggplot2 with multiple examples.

  2. How to make a bar chart in R using plotly: a detailed tutorial of making barplots using plotly package.

  3. R documentation tidyr: complete()

  4. R documentation ggplot2: geom_bar()

  5. 14MosaicPlots.pdf by Professor Joyce Robbins

14.2 2. Treemap

14.2.1 Overview

This section covers how to make treemaps. Treemaps are filled rectangular plot representing hierarchical data, similar to pie chart in that the area of the rectangles can represent proportions. Treemaps can be drawn in R using the treemap function in the package treemap.

What types of datasets are appropriate for treemaps? Firstly, we need to have a quantitative variable of positive value, then we need one or more hierarchical categorical variables associated with that quantitative variable.

14.2.3 Region level

Now we would like to look at the population at region level. We simply add “Region” after “Continent” in the index=c("Continent") line. Note that the categorical variables have to go in decreasing order of hierarchy for instance, index=c("group", "subgroup", "sub-subgroup",...).

We observe that Southern Asia and Eastern Asia have the higest proportion of population within Asia and their population proportions are almost equal, while Central Asia has the lowest proportion of population within Asia. Western Africa has the higest proportion of population within Africa, Southern Africa has the lowest and so on.

14.2.4 Country Level

If we would also like to look at population at the country level, we simply add "Country" after "Region" in the index=c("Continent", "Region") line.

However, one problem might arise as we have many levels of hierarchy: the labels might become hard to read. In that case, we can adjust the parameters of the labels for the ease of reading.

We observe that China has the highest proportion of population in Eastern Asia while India has the the highest proportion of population in Southern Asia and so on.

14.2.5 Consideration

Static treemaps which were covered in this section could be handy when we have a hierarchical dataset with at most three levels of hierarchy. However, if there are more levels of hierarchy, we should consider plotting an interactive version of treemaps using packages such as d3Tree as it will greatly increase readability of the plot.

Another important reminder is that when we have multivariate categorical data, we should only consider plotting the data using treemaps when the data has hierarchical structure and we are interested in the relationship between the quantitative variables and the different level of subgroups. If that’s not the case, we should consider plotting the data using graphs such as mosaicplots, doubledecker plots, fluctuation diagrams, association plots, and parallel sets/categorical parallel coordinate plots.

14.2.6 External resources

  1. How to make a treemap in R using treemap(): a detailed tutorial of making treemaps using treemap package.

  2. R documentation treemap()

  3. Graphical Data Analysis with R by Antony Unwin