tidyr vs reshape2

1. General Introduction

According to the R Documentation, tidyr is “is designed specifically for tidying data, not general reshaping (reshape2)”. The documentation also says tidyr is a replacement for reshape2. In fact, when we check the development version of both packages on Github, tidyr is still under active development while the last commit to the reshape2 was in 2017.

In this post, we will try to compare these two packages to see what they have in common and how they complement each other. One common need in data reshaping/tidying is the transformation between long form and wide form. Let’s illustrate the meaning of “long” and “wide” with examples.

Most data we observe today are probably in the wide form. In a wide form, the multiple measures of a single observation are stored in a single row. For example, the following crime data is in wide form.

##        State Murder Assault UrbanPop Rape
## 1    Alabama   13.2     236       58 21.2
## 2     Alaska   10.0     263       48 44.5
## 3    Arizona    8.1     294       80 31.0
## 4   Arkansas    8.8     190       50 19.5
## 5 California    9.0     276       91 40.6
## 6   Colorado    7.9     204       78 38.7

For each state, we have 4 measures: Murder, Assault, UrbanPop and Rape. Then let’s look at the same dataset after being transformed into long form. In the long form, each row corresponds to one measure on one observation, as shown below.

##        State Measure Value
## 1    Alabama  Murder  13.2
## 2     Alaska  Murder  10.0
## 3    Arizona  Murder   8.1
## 4   Arkansas  Murder   8.8
## 5 California  Murder   9.0
## 6   Colorado  Murder   7.9

In tidyr, it does not define what long form and wide form are. It classifies data into tidy data and messy data. The long form data shown above is tidy data because it satisfies three criteria:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

All other forms of data are called messy data.

At first, reshaping data into long form(or tidy form) may seem weird, but such data is easier to work with when we need to summaraise data with dplyr.

Now we have defined some key terms. Let’s start comparing some functions from the two packages that can perform similar task.

2. gather() vs melt()

Obviously, both of them transform the data from wide form to long form.

Here we continue to use the crime dataset. We first compare by running both functions over the data without any further parameter.

head(gather(crime))

##     key      value
## 1 State    Alabama
## 2 State     Alaska
## 3 State    Arizona
## 4 State   Arkansas
## 5 State California
## 6 State   Colorado

head(melt(crime))

## Using State as id variables

##        State variable value
## 1    Alabama   Murder  13.2
## 2     Alaska   Murder  10.0
## 3    Arizona   Murder   8.1
## 4   Arkansas   Murder   8.8
## 5 California   Murder   9.0
## 6   Colorado   Murder   7.9

We could see that the gather() function only uses the first column to create the key-value pair, which is not what we want.

On the other hand, melt() function treat “State” as an “id variable” and produces a correct long form dataset. It seems that if melt() function does not receive any id.variable, it will select a factor of character column as id variable.

In order to get the same result, we need to assign the id.variable for both functions as below.

crime.l <- gather(crime,Measure,Value,-State)
head(crime.l)

##        State Measure Value
## 1    Alabama  Murder  13.2
## 2     Alaska  Murder  10.0
## 3    Arizona  Murder   8.1
## 4   Arkansas  Murder   8.8
## 5 California  Murder   9.0
## 6   Colorado  Murder   7.9

head(melt(crime,variable.name="Measure",value.names="Value",id.vars = "State"))

##        State Measure value
## 1    Alabama  Murder  13.2
## 2     Alaska  Murder  10.0
## 3    Arizona  Murder   8.1
## 4   Arkansas  Murder   8.8
## 5 California  Murder   9.0
## 6   Colorado  Murder   7.9

But you could only get the same output on data frames. As what has been shown below, gather() function could not work on arrays or matrices but melt() function could.

set.seed(6)
matrices <- matrix(rnorm(6),ncol=2)
matrices

##            [,1]       [,2]
## [1,]  0.2696060 1.72719552
## [2,] -0.6299854 0.02418764
## [3,]  0.8686598 0.36802518

#gather(matrices)   comment in order to knit..
# if not comment, 'Error in UseMethod("gather_") : no applicable method for 'gather_' applied to an object of class "c('matrix', 'double', 'numeric')"' will be shown.

melt(matrices)

##   Var1 Var2       value
## 1    1    1  0.26960598
## 2    2    1 -0.62998541
## 3    3    1  0.86865983
## 4    1    2  1.72719552
## 5    2    2  0.02418764
## 6    3    2  0.36802518

3. spread() vs dcast()

Obviously, both transform the data from long form to wide form.

crime.dcast <- dcast(crime.l,State ~ Measure, value.var = "Value")
head(crime.dcast)

##        State Assault Murder Rape UrbanPop
## 1    Alabama     236   13.2 21.2       58
## 2     Alaska     263   10.0 44.5       48
## 3    Arizona     294    8.1 31.0       80
## 4   Arkansas     190    8.8 19.5       50
## 5 California     276    9.0 40.6       91
## 6   Colorado     204    7.9 38.7       78

The dcast() function uses a formula to describe the shape of the data. The arguments on the left side to the ‘~’ (State) refer to the ID variables and the arguments on the right to the ‘~’ refer to the variable to swing(into column names). ‘value.var=’ assign the measured variables.

crime.spread <- spread(crime.l,Measure,Value)
head(crime.spread)

##        State Assault Murder Rape UrbanPop
## 1    Alabama     236   13.2 21.2       58
## 2     Alaska     263   10.0 44.5       48
## 3    Arizona     294    8.1 31.0       80
## 4   Arkansas     190    8.8 19.5       50
## 5 California     276    9.0 40.6       91
## 6   Colorado     204    7.9 38.7       78

The spread() function is complementary to gather(). The arguments of key and value are column names or positions. Unlike dcast() function, no formula argument is needed in spread() function.

In addition, reshape2 has acast() function that has the similiar function as dcast(), the only difference is acast() works on array or matrix and dcast() works on data frame. However, spread() does not work on array or matrix and tidyr do not have other similar functions that could work on array or matrix.

#spread(m,Var1,Var2,value)   comment in order to knit...
#If not commented, 'Error in UseMethod("spread_") : no applicable method for 'spread_' applied to an object of class "c('matrix', 'double', 'numeric')"' will be shown.

4. separate() vs colsplit()

Sometimes we have messy data where two variables are stored in the same column. In this case, we may want to split the column into multiple columns.

##      Name Gender_Score
## 1    John      Male_89
## 2 Brandon      Male_75
## 3    Anna    Female_80

In this dataset, we can see that for each person, the measure “gender” and “score” are stored in one column. Using colsplit() from reshape2, we can split this column.

tidy <- colsplit(messy$Gender_Score, "_", c("Gender", "Score"))
tidy

##   Gender Score
## 1   Male    89
## 2   Male    75
## 3 Female    80

Notice that colsplit only returns the splitted columns. It does not automatically keep all the columns in the original dataset. To have the “Name” column back in the dataset, we need to combine the columns by ourselves.

tidy <- cbind(messy$Name, tidy)
colnames(tidy)[1] <- "Name"
tidy

##      Name Gender Score
## 1    John   Male    89
## 2 Brandon   Male    75
## 3    Anna Female    80

In comparison, using separate() from tidyr gives us the option to keep all columns except the column to be splitted, which saves us some time.

separate(messy, Gender_Score, into = c("Gender", "Score"), sep = "_", remove = TRUE)

##      Name Gender Score
## 1    John   Male    89
## 2 Brandon   Male    75
## 3    Anna Female    80

5. Bigger picture

As stated before, tidyr is a package specifically designed for data tidying. Therefore, in addition to data reshaping, it has quite a few functions that helps you clean data: complete(), drop_na() etc. These functions could be particularly useful when you use tidyr and dplyr together. Let’s say we have a datasets that records the number of customers visiting the stores.

##    week      time n_of_customers
## 1 week1   morning            267
## 2 week2 afternoon             25
## 3 week1 afternoon            199

It seems we are missing one possible combination: week2 and morning . This is probably because nobody visited the store on the mornings in week2. However, we still want to see this combination in our dataset. The following code shall work.

complete(visit,week,time, fill = list(n_of_customers=0))

## # A tibble: 4 x 3
##   week  time      n_of_customers
##   <fct> <fct>              <dbl>
## 1 week1 afternoon            199
## 2 week1 morning              267
## 3 week2 afternoon             25
## 4 week2 morning                0

As for reshape2, it does have some functions for general data reshaping that has no equivalents in tidyr. Notice that gather() and spread() only work for data frame. However, reshape2 allows us to work with other data types such as array and list through functions like melt.array() and melt.list().

6. Conclusion

In conclusion, tidyr and reshape2 do complement each other in the following ways:

gather() and melt(): similar, but melt() can automatically identify id variable without further arguments passed into the function.
spread() and dcast(): similar.
separate() vs colsplit(): separate() can keep all columns after splitting.
reshape2 has functions that can work with array or list while tidyr functions only work with data frame.
tidyr has additional functions to clean data: complete(), drop_na etc.