Scatterplot matrix is a collection of scatterplots being organized into a matrix, and each scatterplot shows the relationship between a pair of variables. This is very useful for having a vague idea about linear correlation between variables. In creating a model, collinearity is not desired, and by inspecting the scatterplot matrix, we would have an idea of what to include into the model at the beginning. There are various methods to plot a scatterplot matrix, and this plot will introduce 6 different methods of creating the scatterplot matrix, compare their difference, and discuss their pros and cons. The example dataset being used is called Seatbelts, which is a time series data. For convenience, we also created a data frame verison of it.
data("Seatbelts")
Seatbelts.df <- as.data.frame(Seatbelts)
Generally, there are two genres of functions which can be used to create a scatterplot. The first genre of functions that creates a scatterplot matrix are: • pairs in base R • cpairs() in library “gclus” • splom() in library “lattice” These methods generate a scatterplot matrix with every variable against each other, and these are straightforward to understand
The pairs() function requires a minimum input of x, which is described as “the coordinates of points given as numeric columns of a matrix or data frame”. In other words, a data frame, a tibble, a time-series data, etc. can all be fitted into the pairs() function, and this should satisfy most of our needs, if we only want to take a glimpse at the correlation between variables. Note that, although do-able, pairs are not good at handling logical or factor values, as well as other categorical or discrete values. To get the most out of it, we might want to exclude the categorical/discrete variables from the scatterplot matrix. In the examples, the Seatbelts dataset includes a discrete categorical variable called “law”, and we can tell from the plot that, there are two different values, but nothing more. In this case, we might consider other approaches, such as coloring points in the rest of the scatterplot by different level of variable “law”.
# pairs from Base R
pairs(Seatbelts)
# Coloring points by different level of variable "law"
pairs(Seatbelts.df[,1:7], col=ifelse(Seatbelts.df$law==0, "black", "red"))
There are many useful arguments in pairs() to utilize. “pch” allows you change the shape of dots, this allows you to further differentiate the points in each plot, if you have more than one categorical variables that you want to display. Setting “upper.panel=NULL” allows us to remove the repetitive half of the scatterplot matrix
# changing shape of points
pairs(Seatbelts.df[,1:7],pch=ifelse(Seatbelts.df$law==0, 1,3), upper.panel = NULL)
Second method is cpairs() from package “gclus”. gclus is a package for plotting scatterplot matrices and parallel coordinates with specific orders and better display results. The function cpairs() is called enhanced scatterplot matrix. It can order the variables in a way you like, and can color the panels as well for better display results. For example, we can order the variables by correlation.
# function cpairs in "gclus"
# install.packages("gclus")
library("gclus")
## Loading required package: cluster
sb.cor <- cor(Seatbelts.df)
# assign color
sb.color <- dmat.color(sb.cor)
# assign order
sb.o <- order.hclust(sb.cor)
cpairs(Seatbelts, order= sb.o,panel.color= sb.color,upper.panel=NULL)
In this example, cpairs() took two extra arguments: order and panel.color. Order is generated by ranking the correlation between variables, and panel color is given by the function dmat.color(), which is also from this package.
The function dmat.color() assigns three colors to the correlations according to the correlation magnitude. High correlations are in pink, the middle ones are in blue, and the bottom ones are in yellow.
The cpairs() function and package “gclus” provided a better visualization of scatterplot matrices. This will be useful if there are too many variables, as color and ordering will help gather useful information more quickly than checking each scatterplot seperately.
The third method being introduced is function splom() in package “lattice”. “lattice” is a package for data visualization, and it delivers great plots especially for multivariate plots. It is first developed in S, which is the mother of R, and now has extended Splom, namely ScatterPLOt Matrix, differes from pairs() and cpairs() in that, splom() only take data frames as its input. The dataset Seatbelt, which is a time series data, will not be accepted by splom().
# function splom in "lattice"
# install.packages("lattice")
library("lattice")
splom(~Seatbelts.df[,1:7]|Seatbelts.df[,8],aspect = 0.3,layout = c(2, 1))
One feature of splom() is that the plot can be faceted if we modify the input x with “|”, to divide the plot by a certain variable at its levels. We can control the layout of different facets by argument “layout=c(# of column, # of rows)”. Aspect is the ratio of y axis to x axis. Aspect = 0.5 would generate a plot with x axis twice as long as y axis.
Traditional scatterplot, shown in the first genre above, is a natural tool for displaying a pair of quantitative variables. When a data set includes one or more categorical variables the traditional display offers limited flexibility. The second genre of functions that provides implementations of the generalized pairs plot are: ggpairs() in library GGally and gpairs() in library gpairs. The generalized pairs plot addresses the need for a more flexible display of a mixture of quantitative and categorical variables. Generalized pairs plots straighten out important features of data when data sets consist of categorical and quantitative variables.
GGally extends ggplot2, a “plotting system based on the grammar of graphics”, by including several functions to enable very customized matrix plot. The function ggpairs() leverages a modular design of pairwise comparisons of multivariate data and displays either the density or count of the respective variable along the diagonal. There are three types of comparisons. A comparison between two quantitative variables is called quantitative-quantitative, which is often explored with scatterplot. A panel (or comparison) between one categorical and one quantitative variable is called quantitative-categorical, where side-by-side boxplots, facetted histograms or density plots can be used. The last type corresponds to two categorical variables, called a categorical-categorical comparison. When appropriate, two-way tables and mosaic plots are used for pairs of categorical variables. As with many R functions, arguments recognized by ggplot2 can be provided to ggpairs and passed through to lower-level plotting functions.
In following examples, iris data was used to plot the within group and between group comparisons due to the diversity of its categorical variable.
# function ggpairs in "GGally"
# install.packages("GGally")
library("GGally")
## Loading required package: ggplot2
iris$Long.Petal<-as.factor(ifelse(iris$Petal.Length>median(iris$Petal.Length), "long", "short"))
ggpairs(iris)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this example, we used iris data and further categorized petal length into two subgroups, long(above median petal length) and short(below median petal length). Density plots and bar charts on the diagonal reflect the marginal distributions of the variables. For quantitative-quantitative plots, there is a strong positive association between petal length and width, which is also supported from a correlation of 0.872. The mosaic panel between iris species and petal length groups show both of the conditional distributions; for example, the panel in row 5, column 6 gives the distribution of petal length groups on species.
gpairs is another package to display the generalized pairs plot. gparis() function intakes data frame (or matrix) of any combination of quantitative and categorical variables.
# function gpairs in "gpairs"
# install.packages("gpairs")
library("gpairs")
gpairs(iris)
While side-by-side boxplot in ggpairs() could show that the petal length is shorter on average for setosa flowers, it would obscure information in the conditional distributions due to data reduction.
boxplot(iris$Petal.Length~iris$Species, horizontal=TRUE)
Instead, the barcode plot from gpairs() function provides an alternative quantitative-categorical display that maintains the full data resolution. The slim strock alleviates overplotting in dense regions, and the ties reveals some interesting information hidden in a boxplot: the flower #78, #114, #120 and #147 have identical petal length (while only #78 is versicolor and the remaining three are virginica flowers). Similarly, there are other pairs of flowers were tied with high petal length.
library(barcode)
barcode(list(Setosa=iris[iris$Species=="setosa",]$Petal.Length,
Versicolor=iris[iris$Species=="versicolor",]$Petal.Length,
Virginica=iris[iris$Species=="virginica",]$Petal.Length),
horizontal=TRUE)
The generalized pairs plot can combine scatterplots, mosaic plots, and the detailed barcode plots with the higher-level summary of traditional boxplots.
ggpairs() and gpairs() also differs in the coordination of axis scales and labels. ggpairs() uses “global limits” to ensure that all panels of the generalized pairs plot are ranged properly on each axis. gpairs(), by default, displays the variable names and axis labels on the diagonal.
ggcore() function is part of GGally and shows correlation coefficients between continuous variables in correltaition matrixes. The input of ggcorr can be data frame or a matrix. In this default output of ggcorr, the non-numeric data Sepecies and Long. Petal were dropped from the pairwise correlation matrix.
# function ggcorr in "ggplot2"
library(ggplot2)
ggcorr(iris)
## Warning in ggcorr(iris): data in column(s) 'Species', 'Long.Petal' are not
## numeric and were ignored
The first genre, traditional scatterplots, often obscures important information of the data, while the second genre, generalized pairs plot, supplements more diverse displays of relationship between categorical- quantitative variables, and categorical-categorical variables.
Each implementation in the first and second genre could be further expanded. For example, time series data could be represented by lines instead of points in base R or as an object of class ts. Additional features may be employeed to provide ordered factors or spatially distributed data.