Some base R functions, plyr, dplyr and tidyr packages are very efficient tool to perform data manipulation like subsetting, sorting and merging of data. Though the sytax, ways and complexity of them to deal with data may be different, we can always get the same result we want. Here, I want to use the dataset “strikes” to compare the commons and differences between them.
The dataset “strikes” is a data set on 18 countries over 35 years(compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables are represented as follows:
strikes <- read.csv("strikes.csv")
head(strikes)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
If we want to research on the average unemploymentrate, inflation rates, and strike volume for each year in the strikesdata set, we can use base R and tidyverse.
First, we need to split our data into appropriate chuncks, each of which can be handled by our function. Here, the function split() is often helpful. Recall, split(df, f = my.factor) splits a data frame df into several dataframes, defined by constant levels of the factor my.factor.
years.split <- split(strikes, strikes$year)
str(years.split[[1]])
## 'data.frame': 18 obs. of 8 variables:
## $ country : Factor w/ 18 levels "Australia","Austria",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1951 1951 1951 1951 1951 1951 1951 1951 1951 1951 ...
## $ strike.volume : int 296 43 242 242 3 288 299 112 773 437 ...
## $ unemployment : num 1.3 3.5 4.5 2.4 9.7 0.1 0.6 6.4 7.3 8.8 ...
## $ inflation : num 19.8 27.5 9.6 10.4 10.5 16.3 17.7 7.7 7.9 14.3 ...
## $ left.parliament: num 43 43.6 39.6 78.7 44.6 ...
## $ centralization : num 0.374859 0.997524 0.753247 0.000225 0.498754 ...
## $ density : num NA NA NA NA NA NA NA NA NA NA ...
Now, we have several sub datasets of strikes that divided by year. Then, define a function that can calculate the mean of unemployment, inflation rates, and strike colume for each small dataset.
three.mean <- function(df) {
return(apply(df[, c("unemployment", "inflation", "strike.volume")], 2, mean))
}
Finally, apply our function to each chunk of data frame in years.split. Here, the function sapply() are helpful.
years.avg.apply <- sapply(years.split, three.mean)
str(years.avg.apply)
## num [1:3, 1:35] 3.09 13.09 359.22 3.68 5.79 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3] "unemployment" "inflation" "strike.volume"
## ..$ : chr [1:35] "1951" "1952" "1953" "1954" ...
years.avg.apply[, 1:6]
## 1951 1952 1953 1954 1955
## unemployment 3.088889 3.683333 3.594444 3.505556 3.044444
## inflation 13.088889 5.794444 1.333333 1.833333 1.294444
## strike.volume 359.222222 588.666667 211.944444 139.333333 215.277778
## 1956
## unemployment 3.033333
## inflation 3.705556
## strike.volume 561.944444
For the same research question, the method that using tidyverse are more concise and straightforward. Two packages“plyr” and “dplyr” included in tidyverse, can both be used to solve data manipulation problem.
“pylr” provides us with an extremely useful family of apply-like functions. Here we would like to use function ddply(), which can split the input dataframe, apply a function to each piece and then combine all the results back together as a new dataframe. If we want the type of output to be matrix or list, the function daply() and dlply() are helpful.
The details can be found here:https://www.rdocumentation.org/packages/plyr/versions/1.8.4
library(plyr)
years.avg.plyr <- ddply(strikes[, c("year", "unemployment", "inflation", "strike.volume")], .(year),
apply, MARGIN = 2, FUN = mean)
str(years.avg.plyr)
## 'data.frame': 35 obs. of 4 variables:
## $ year : num 1951 1952 1953 1954 1955 ...
## $ unemployment : num 3.09 3.68 3.59 3.51 3.04 ...
## $ inflation : num 13.09 5.79 1.33 1.83 1.29 ...
## $ strike.volume: num 359 589 212 139 215 ...
head(years.avg.plyr)
## year unemployment inflation strike.volume
## 1 1951 3.088889 13.088889 359.2222
## 2 1952 3.683333 5.794444 588.6667
## 3 1953 3.594444 1.333333 211.9444
## 4 1954 3.505556 1.833333 139.3333
## 5 1955 3.044444 1.294444 215.2778
## 6 1956 3.033333 3.705556 561.9444
“dplyr” is a grammar of data manipulation, providing a consistent set of verbs to solve the most common data manipulation challenges.
First, we use select() function to select the columns in the dataset strikes that we need to calculate. Then, we use group_by() function to splite the dataset strikes into small groups by year. Finally, we use summarise_all() function to get a summary statistic for each group of all columns. Since we want to compute the means here, we put mean inside the parathesize of function summarise_all(). The details can be found here:https://www.rdocumentation.org/packages/dplyr/versions/0.7.8
It is worth to mention that the pipes %>% here take each output of previous function and send it directly to the next, which is useful when you need to do many things to the same data set and make each step clear.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 2.0.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0.9000
## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
years.avg.dplyr <- strikes %>%
select(year, unemployment, inflation, strike.volume) %>%
group_by(year) %>%
summarise_all(mean)
str(years.avg.dplyr)
## Classes 'tbl_df', 'tbl' and 'data.frame': 35 obs. of 4 variables:
## $ year : int 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
## $ unemployment : num 3.09 3.68 3.59 3.51 3.04 ...
## $ inflation : num 13.09 5.79 1.33 1.83 1.29 ...
## $ strike.volume: num 359 589 212 139 215 ...
head(years.avg.dplyr)
## # A tibble: 6 x 4
## year unemployment inflation strike.volume
## <int> <dbl> <dbl> <dbl>
## 1 1951 3.09 13.1 359.
## 2 1952 3.68 5.79 589.
## 3 1953 3.59 1.33 212.
## 4 1954 3.51 1.83 139.
## 5 1955 3.04 1.29 215.
## 6 1956 3.03 3.71 562.
Actually base R and tidyverse can handle the same task and produce the similar result. For base R, you need to do three steps(split, process per piece, and combine) one by one and store all the intermediary results. However, using tidyverse can solve this task and obtain the final result straightforward.
Another siginificant difference is the structure of the result. For base R, the targeted features are row variables and the different groups are column variables. However, for tidyverse, the targeted features are column variables and the different groups are the values of first column variable.
We can use base R to plot trend chart of the average of different features over years.
As the ranges of three features (“Unemployment”, “Inflation”, “strike.volume”) are extremely different, we build a plot with two axises. Reference: https://www.r-bloggers.com/r-single-plot-with-two-different-y-axes/
par(mar = c(5,4,2,4))
max.rate <- max(years.avg.apply[1:2,])
min.rate <- min(years.avg.apply[1:2,])
plot(colnames(years.avg.apply), years.avg.apply[1, ], xlab = "Year", ylab = "Rate",
type = "o", col = "#234003", ylim = c(min.rate, max.rate))
points(colnames(years.avg.apply), years.avg.apply[2, ], type = "o", col = "#a61c00")
# Second axis for strike.volume
par(new = T)
plot(colnames(years.avg.apply), years.avg.apply[3, ], type = "o", col = "#3d85c6", yaxt='n', ann=FALSE)
axis(side = 4)
mtext(side = 4, line = 3, 'Days')
legend("topright", c("Unemployment", "Inflation", "strike.volume"), fill = c("#234003", "#a61c00", "#3d85c6"), cex = .5)
Actually, tidyverse can also draw this trend chart. Since this is a simple task, I prefer to use base R.
“ggplot2” package included in tidyverse is really helpful to draw Cleveland dot plot. First, we need to use function tidyr::gather() to tide data, which means convert multiple column features into key-value pairs. Then, we can use ggplot grammer to draw Cleveland dot plot. Recall the different ranges of different features still need to be handled by adding the second axis.
# Tidy data
years.avg.dplyr$strike.volume <- years.avg.dplyr$strike.volume / 50
years.avg.dplyr_tidy <- gather(years.avg.dplyr, key = "Features", value = "Avg", -year)
years.avg.dplyr_tidy$Features <- fct_relevel(years.avg.dplyr_tidy$Features, "strike.volume", after = Inf)
head(years.avg.dplyr_tidy)
## # A tibble: 6 x 3
## year Features Avg
## <int> <fct> <dbl>
## 1 1951 unemployment 3.09
## 2 1952 unemployment 3.68
## 3 1953 unemployment 3.59
## 4 1954 unemployment 3.51
## 5 1955 unemployment 3.04
## 6 1956 unemployment 3.03
# Cleveland dot plot with multiple dots
ggplot(years.avg.dplyr_tidy,
aes(x = Avg,
y = fct_reorder2(as.factor(year), Features, -Avg))) +
geom_point(aes(col = Features)) +
ylab("years") +
scale_x_continuous(
"Rate",
sec.axis = sec_axis(~ . * 50, name = "Days")) +
ggtitle("Trend Chart over Years")
Only using base R is hard to draw Cleveland dot plot.
Pros are:
Cons are:
Pros are:
Cons are:
Between “plyr” and “dplyr”, the former can solve most data manipulation tasks with one function, while the latter still need to use several functions step by step. However, “plyr” is harder to learn, and not clear enough to be understood by naive user.