Using Base R and Tidyverse for Data Manipulation

Some base R functions, plyr, dplyr and tidyr packages are very efficient tool to perform data manipulation like subsetting, sorting and merging of data. Though the sytax, ways and complexity of them to deal with data may be different, we can always get the same result we want. Here, I want to use the dataset “strikes” to compare the commons and differences between them.

The dataset “strikes” is a data set on 18 countries over 35 years(compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables are represented as follows:

country,year: country and year of data collection
strike.volume: days on strike per 1000 workers
unemployment: unemployment rate
inflation: inflation rate
left.parliament: leftwing share of the goverment
centralization: centralization of unions
unemployment: unemployment rate
density: density of unions

strikes <- read.csv("strikes.csv")
head(strikes)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

If we want to research on the average unemploymentrate, inflation rates, and strike volume for each year in the strikesdata set, we can use base R and tidyverse.

Using base R

First, we need to split our data into appropriate chuncks, each of which can be handled by our function. Here, the function split() is often helpful. Recall, split(df, f = my.factor) splits a data frame df into several dataframes, defined by constant levels of the factor my.factor.

years.split <- split(strikes, strikes$year)
str(years.split[[1]])

## 'data.frame':    18 obs. of  8 variables:
##  $ country        : Factor w/ 18 levels "Australia","Austria",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year           : int  1951 1951 1951 1951 1951 1951 1951 1951 1951 1951 ...
##  $ strike.volume  : int  296 43 242 242 3 288 299 112 773 437 ...
##  $ unemployment   : num  1.3 3.5 4.5 2.4 9.7 0.1 0.6 6.4 7.3 8.8 ...
##  $ inflation      : num  19.8 27.5 9.6 10.4 10.5 16.3 17.7 7.7 7.9 14.3 ...
##  $ left.parliament: num  43 43.6 39.6 78.7 44.6 ...
##  $ centralization : num  0.374859 0.997524 0.753247 0.000225 0.498754 ...
##  $ density        : num  NA NA NA NA NA NA NA NA NA NA ...

Now, we have several sub datasets of strikes that divided by year. Then, define a function that can calculate the mean of unemployment, inflation rates, and strike colume for each small dataset.

three.mean <- function(df) {
  return(apply(df[, c("unemployment", "inflation", "strike.volume")], 2, mean))
}

Finally, apply our function to each chunk of data frame in years.split. Here, the function sapply() are helpful.

years.avg.apply <- sapply(years.split, three.mean)
str(years.avg.apply)

##  num [1:3, 1:35] 3.09 13.09 359.22 3.68 5.79 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "unemployment" "inflation" "strike.volume"
##   ..$ : chr [1:35] "1951" "1952" "1953" "1954" ...

years.avg.apply[, 1:6]

##                     1951       1952       1953       1954       1955
## unemployment    3.088889   3.683333   3.594444   3.505556   3.044444
## inflation      13.088889   5.794444   1.333333   1.833333   1.294444
## strike.volume 359.222222 588.666667 211.944444 139.333333 215.277778
##                     1956
## unemployment    3.033333
## inflation       3.705556
## strike.volume 561.944444

Using tidyverse

For the same research question, the method that using tidyverse are more concise and straightforward. Two packages“plyr” and “dplyr” included in tidyverse, can both be used to solve data manipulation problem.

plyr

“pylr” provides us with an extremely useful family of apply-like functions. Here we would like to use function ddply(), which can split the input dataframe, apply a function to each piece and then combine all the results back together as a new dataframe. If we want the type of output to be matrix or list, the function daply() and dlply() are helpful.

The details can be found here:https://www.rdocumentation.org/packages/plyr/versions/1.8.4

library(plyr)
years.avg.plyr <- ddply(strikes[, c("year", "unemployment", "inflation", "strike.volume")], .(year), 
      apply, MARGIN = 2, FUN = mean)
str(years.avg.plyr)

## 'data.frame':    35 obs. of  4 variables:
##  $ year         : num  1951 1952 1953 1954 1955 ...
##  $ unemployment : num  3.09 3.68 3.59 3.51 3.04 ...
##  $ inflation    : num  13.09 5.79 1.33 1.83 1.29 ...
##  $ strike.volume: num  359 589 212 139 215 ...

head(years.avg.plyr)

##   year unemployment inflation strike.volume
## 1 1951     3.088889 13.088889      359.2222
## 2 1952     3.683333  5.794444      588.6667
## 3 1953     3.594444  1.333333      211.9444
## 4 1954     3.505556  1.833333      139.3333
## 5 1955     3.044444  1.294444      215.2778
## 6 1956     3.033333  3.705556      561.9444

dplyr

“dplyr” is a grammar of data manipulation, providing a consistent set of verbs to solve the most common data manipulation challenges.

First, we use select() function to select the columns in the dataset strikes that we need to calculate. Then, we use group_by() function to splite the dataset strikes into small groups by year. Finally, we use summarise_all() function to get a summary statistic for each group of all columns. Since we want to compute the means here, we put mean inside the parathesize of function summarise_all(). The details can be found here:https://www.rdocumentation.org/packages/dplyr/versions/0.7.8

It is worth to mention that the pipes %>% here take each output of previous function and send it directly to the next, which is useful when you need to do many things to the same data set and make each step clear.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0          ✔ purrr   0.2.5     
## ✔ tibble  2.0.1          ✔ dplyr   0.7.8     
## ✔ tidyr   0.8.2          ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0.9000

## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()   masks plyr::arrange()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ dplyr::count()     masks plyr::count()
## ✖ dplyr::failwith()  masks plyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::id()        masks plyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::mutate()    masks plyr::mutate()
## ✖ dplyr::rename()    masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()

years.avg.dplyr <- strikes %>%
  select(year, unemployment, inflation, strike.volume) %>%
  group_by(year) %>%
  summarise_all(mean)
str(years.avg.dplyr)

## Classes 'tbl_df', 'tbl' and 'data.frame':    35 obs. of  4 variables:
##  $ year         : int  1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
##  $ unemployment : num  3.09 3.68 3.59 3.51 3.04 ...
##  $ inflation    : num  13.09 5.79 1.33 1.83 1.29 ...
##  $ strike.volume: num  359 589 212 139 215 ...

head(years.avg.dplyr)

## # A tibble: 6 x 4
##    year unemployment inflation strike.volume
##   <int>        <dbl>     <dbl>         <dbl>
## 1  1951         3.09     13.1           359.
## 2  1952         3.68      5.79          589.
## 3  1953         3.59      1.33          212.
## 4  1954         3.51      1.83          139.
## 5  1955         3.04      1.29          215.
## 6  1956         3.03      3.71          562.

Comparison of base R and tidyverse for Data Manipulation

How do their features differ?

Actually base R and tidyverse can handle the same task and produce the similar result. For base R, you need to do three steps(split, process per piece, and combine) one by one and store all the intermediary results. However, using tidyverse can solve this task and obtain the final result straightforward.

Another siginificant difference is the structure of the result. For base R, the targeted features are row variables and the different groups are column variables. However, for tidyverse, the targeted features are column variables and the different groups are the values of first column variable.

Better suited for certain types of tasks respectively

Trend chart

We can use base R to plot trend chart of the average of different features over years.

As the ranges of three features (“Unemployment”, “Inflation”, “strike.volume”) are extremely different, we build a plot with two axises. Reference: https://www.r-bloggers.com/r-single-plot-with-two-different-y-axes/

par(mar = c(5,4,2,4))
max.rate <- max(years.avg.apply[1:2,])
min.rate <- min(years.avg.apply[1:2,])
plot(colnames(years.avg.apply), years.avg.apply[1, ], xlab = "Year", ylab = "Rate", 
     type = "o", col = "#234003", ylim = c(min.rate, max.rate))
points(colnames(years.avg.apply), years.avg.apply[2, ], type = "o", col = "#a61c00")
# Second axis for strike.volume
par(new = T)
plot(colnames(years.avg.apply), years.avg.apply[3, ], type = "o", col = "#3d85c6", yaxt='n', ann=FALSE)
axis(side = 4)
mtext(side = 4, line = 3, 'Days')
legend("topright", c("Unemployment", "Inflation", "strike.volume"), fill = c("#234003", "#a61c00", "#3d85c6"), cex = .5)

Actually, tidyverse can also draw this trend chart. Since this is a simple task, I prefer to use base R.

Cleveland dot plot

“ggplot2” package included in tidyverse is really helpful to draw Cleveland dot plot. First, we need to use function tidyr::gather() to tide data, which means convert multiple column features into key-value pairs. Then, we can use ggplot grammer to draw Cleveland dot plot. Recall the different ranges of different features still need to be handled by adding the second axis.

# Tidy data
years.avg.dplyr$strike.volume <- years.avg.dplyr$strike.volume / 50
years.avg.dplyr_tidy <- gather(years.avg.dplyr, key = "Features", value = "Avg", -year)
years.avg.dplyr_tidy$Features <- fct_relevel(years.avg.dplyr_tidy$Features, "strike.volume", after = Inf)
head(years.avg.dplyr_tidy)

## # A tibble: 6 x 3
##    year Features       Avg
##   <int> <fct>        <dbl>
## 1  1951 unemployment  3.09
## 2  1952 unemployment  3.68
## 3  1953 unemployment  3.59
## 4  1954 unemployment  3.51
## 5  1955 unemployment  3.04
## 6  1956 unemployment  3.03

# Cleveland dot plot with multiple dots
ggplot(years.avg.dplyr_tidy, 
       aes(x = Avg, 
           y = fct_reorder2(as.factor(year), Features, -Avg))) + 
  geom_point(aes(col = Features)) + 
  ylab("years") +
  scale_x_continuous(
    "Rate", 
    sec.axis = sec_axis(~ . * 50, name = "Days")) +
  ggtitle("Trend Chart over Years")

Only using base R is hard to draw Cleveland dot plot.

What are pros and cons of each?

base R

Pros are:

not depend on other packages;
all the steps are clear and intuitive;
all the intermediary results can be easily obtained and changed.

Cons are:

all the intermediary results must be stored;
some proper type transforms are needed sometimes;
groups are considerd as columnnames.

tidyverse