According to the R Documentation, tidyr is “is designed specifically for tidying data, not general reshaping (reshape2)”. The documentation also says tidyr is a replacement for reshape2. In fact, when we check the development version of both packages on Github, tidyr is still under active development while the last commit to the reshape2 was in 2017.
In this post, we will try to compare these two packages to see what they have in common and how they complement each other. One common need in data reshaping/tidying is the transformation between long form and wide form. Let’s illustrate the meaning of “long” and “wide” with examples.
Most data we observe today are probably in the wide form. In a wide form, the multiple measures of a single observation are stored in a single row. For example, the following crime data is in wide form.
## State Murder Assault UrbanPop Rape
## 1 Alabama 13.2 236 58 21.2
## 2 Alaska 10.0 263 48 44.5
## 3 Arizona 8.1 294 80 31.0
## 4 Arkansas 8.8 190 50 19.5
## 5 California 9.0 276 91 40.6
## 6 Colorado 7.9 204 78 38.7
For each state, we have 4 measures: Murder, Assault, UrbanPop and Rape. Then let’s look at the same dataset after being transformed into long form. In the long form, each row corresponds to one measure on one observation, as shown below.
## State Measure Value
## 1 Alabama Murder 13.2
## 2 Alaska Murder 10.0
## 3 Arizona Murder 8.1
## 4 Arkansas Murder 8.8
## 5 California Murder 9.0
## 6 Colorado Murder 7.9
In tidyr, it does not define what long form and wide form are. It classifies data into tidy data and messy data. The long form data shown above is tidy data because it satisfies three criteria:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
All other forms of data are called messy data.
At first, reshaping data into long form(or tidy form) may seem weird, but such data is easier to work with when we need to summaraise data with dplyr.
Now we have defined some key terms. Let’s start comparing some functions from the two packages that can perform similar task.
Obviously, both of them transform the data from wide form to long form.
Here we continue to use the crime dataset. We first compare by running both functions over the data without any further parameter.
head(gather(crime))
## key value
## 1 State Alabama
## 2 State Alaska
## 3 State Arizona
## 4 State Arkansas
## 5 State California
## 6 State Colorado
head(melt(crime))
## Using State as id variables
## State variable value
## 1 Alabama Murder 13.2
## 2 Alaska Murder 10.0
## 3 Arizona Murder 8.1
## 4 Arkansas Murder 8.8
## 5 California Murder 9.0
## 6 Colorado Murder 7.9
We could see that the gather() function only uses the first column to create the key-value pair, which is not what we want.
On the other hand, melt() function treat “State” as an “id variable” and produces a correct long form dataset. It seems that if melt() function does not receive any id.variable, it will select a factor of character column as id variable.
In order to get the same result, we need to assign the id.variable for both functions as below.
crime.l <- gather(crime,Measure,Value,-State)
head(crime.l)
## State Measure Value
## 1 Alabama Murder 13.2
## 2 Alaska Murder 10.0
## 3 Arizona Murder 8.1
## 4 Arkansas Murder 8.8
## 5 California Murder 9.0
## 6 Colorado Murder 7.9
head(melt(crime,variable.name="Measure",value.names="Value",id.vars = "State"))
## State Measure value
## 1 Alabama Murder 13.2
## 2 Alaska Murder 10.0
## 3 Arizona Murder 8.1
## 4 Arkansas Murder 8.8
## 5 California Murder 9.0
## 6 Colorado Murder 7.9
But you could only get the same output on data frames. As what has been shown below, gather() function could not work on arrays or matrices but melt() function could.
set.seed(6)
matrices <- matrix(rnorm(6),ncol=2)
matrices
## [,1] [,2]
## [1,] 0.2696060 1.72719552
## [2,] -0.6299854 0.02418764
## [3,] 0.8686598 0.36802518
#gather(matrices) comment in order to knit..
# if not comment, 'Error in UseMethod("gather_") : no applicable method for 'gather_' applied to an object of class "c('matrix', 'double', 'numeric')"' will be shown.
melt(matrices)
## Var1 Var2 value
## 1 1 1 0.26960598
## 2 2 1 -0.62998541
## 3 3 1 0.86865983
## 4 1 2 1.72719552
## 5 2 2 0.02418764
## 6 3 2 0.36802518
Obviously, both transform the data from long form to wide form.
crime.dcast <- dcast(crime.l,State ~ Measure, value.var = "Value")
head(crime.dcast)
## State Assault Murder Rape UrbanPop
## 1 Alabama 236 13.2 21.2 58
## 2 Alaska 263 10.0 44.5 48
## 3 Arizona 294 8.1 31.0 80
## 4 Arkansas 190 8.8 19.5 50
## 5 California 276 9.0 40.6 91
## 6 Colorado 204 7.9 38.7 78
The dcast() function uses a formula to describe the shape of the data. The arguments on the left side to the ‘~’ (State) refer to the ID variables and the arguments on the right to the ‘~’ refer to the variable to swing(into column names). ‘value.var=’ assign the measured variables.
crime.spread <- spread(crime.l,Measure,Value)
head(crime.spread)
## State Assault Murder Rape UrbanPop
## 1 Alabama 236 13.2 21.2 58
## 2 Alaska 263 10.0 44.5 48
## 3 Arizona 294 8.1 31.0 80
## 4 Arkansas 190 8.8 19.5 50
## 5 California 276 9.0 40.6 91
## 6 Colorado 204 7.9 38.7 78
The spread() function is complementary to gather(). The arguments of key and value are column names or positions. Unlike dcast() function, no formula argument is needed in spread() function.
In addition, reshape2 has acast() function that has the similiar function as dcast(), the only difference is acast() works on array or matrix and dcast() works on data frame. However, spread() does not work on array or matrix and tidyr do not have other similar functions that could work on array or matrix.
#spread(m,Var1,Var2,value) comment in order to knit...
#If not commented, 'Error in UseMethod("spread_") : no applicable method for 'spread_' applied to an object of class "c('matrix', 'double', 'numeric')"' will be shown.
Sometimes we have messy data where two variables are stored in the same column. In this case, we may want to split the column into multiple columns.
## Name Gender_Score
## 1 John Male_89
## 2 Brandon Male_75
## 3 Anna Female_80
In this dataset, we can see that for each person, the measure “gender” and “score” are stored in one column. Using colsplit() from reshape2, we can split this column.
tidy <- colsplit(messy$Gender_Score, "_", c("Gender", "Score"))
tidy
## Gender Score
## 1 Male 89
## 2 Male 75
## 3 Female 80
Notice that colsplit only returns the splitted columns. It does not automatically keep all the columns in the original dataset. To have the “Name” column back in the dataset, we need to combine the columns by ourselves.
tidy <- cbind(messy$Name, tidy)
colnames(tidy)[1] <- "Name"
tidy
## Name Gender Score
## 1 John Male 89
## 2 Brandon Male 75
## 3 Anna Female 80
In comparison, using separate() from tidyr gives us the option to keep all columns except the column to be splitted, which saves us some time.
separate(messy, Gender_Score, into = c("Gender", "Score"), sep = "_", remove = TRUE)
## Name Gender Score
## 1 John Male 89
## 2 Brandon Male 75
## 3 Anna Female 80
As stated before, tidyr is a package specifically designed for data tidying. Therefore, in addition to data reshaping, it has quite a few functions that helps you clean data: complete(), drop_na() etc. These functions could be particularly useful when you use tidyr and dplyr together. Let’s say we have a datasets that records the number of customers visiting the stores.
## week time n_of_customers
## 1 week1 morning 267
## 2 week2 afternoon 25
## 3 week1 afternoon 199
It seems we are missing one possible combination: week2 and morning . This is probably because nobody visited the store on the mornings in week2. However, we still want to see this combination in our dataset. The following code shall work.
complete(visit,week,time, fill = list(n_of_customers=0))
## # A tibble: 4 x 3
## week time n_of_customers
## <fct> <fct> <dbl>
## 1 week1 afternoon 199
## 2 week1 morning 267
## 3 week2 afternoon 25
## 4 week2 morning 0
As for reshape2, it does have some functions for general data reshaping that has no equivalents in tidyr. Notice that gather() and spread() only work for data frame. However, reshape2 allows us to work with other data types such as array and list through functions like melt.array() and melt.list().
In conclusion, tidyr and reshape2 do complement each other in the following ways:
gather() and melt(): similar, but melt() can automatically identify id variable without further arguments passed into the function.
spread() and dcast(): similar.
separate() vs colsplit(): separate() can keep all columns after splitting.
reshape2 has functions that can work with array or list while tidyr functions only work with data frame.
tidyr has additional functions to clean data: complete(), drop_na etc.