53 Comparison among base R, tidyverse, and datatable
Siyue Han
# Load the tidyverse package
library(tidyverse)
# Load the data.table package
library(data.table)
53.1 Introduction
There are many ways to read and analyze data in R, the data.frame
provided in base R can handle most of the situations. Therefore, I have been using it for most of the time so far and occasionally used tibble
from tidyverse
. But one time when I was dealing with a large csv, I found it was so slow with data.frame
. With the help of google, I used data.table
for the first time, and it was amazing. Therefore, I’d like to share this way of reading and analyzing data to more people.
53.2 Reading Data
First, let’s see the performance of reading csv data among these three environments.
53.2.1 Data.Frame
This is an example of reading csv into a data.frame.
start <- Sys.time()
df <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
## Time difference of 3.001485 secs
53.2.2 Tibble
This is an example of reading csv into a tibble.
start <- Sys.time()
tb <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
## Time difference of 1.689931 secs
53.2.3 Data.Table
This is an example of reading csv into a data.table.
start <- Sys.time()
dt <- fread("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
## Time difference of 0.2768054 secs
As we can see, data.table
can read a csv file super fast, especially when the file is large. tibble
in tidyverse
is slightly faster than data.frame
in base R, but is still much slower than data.table
.
53.3 Processing Data
Then, let’s see the differences of processing data among these three environments.
53.3.1 Selecting column(s) and row(s)
## Time difference of 0.002945662 secs
start <- Sys.time()
x2 <- select(tb, Lat, Long_) %>% slice(101:110)
end <- Sys.time()
print(end - start)
## Time difference of 0.02623582 secs
## Time difference of 0.01489043 secs
53.3.2 Filtering row(s)
## Time difference of 0.02817774 secs
## Time difference of 0.04359889 secs
## Time difference of 0.006566763 secs
53.3.3 Sorting the table
## Time difference of 0.03687096 secs
## Time difference of 0.5127223 secs
## Time difference of 0.009575844 secs
53.3.4 Summarizing columns by group
start <- Sys.time()
x1 <- aggregate(df$`X10.31.21`, list(df$Province_State), sum)
end <- Sys.time()
print(end - start)
## Time difference of 0.006486416 secs
start <- Sys.time()
x2 <- group_by(tb, Province_State) %>% summarise(sum(`10/31/21`))
end <- Sys.time()
print(end - start)
## Time difference of 0.03093123 secs
start <- Sys.time()
x3 <- dt[ , lapply(.(`10/31/21`), sum), by = Province_State]
end <- Sys.time()
print(end - start)
## Time difference of 0.004686117 secs
53.3.5 Pivoting longer
start <- Sys.time()
x1 <- reshape(df,
varying = 12:dim(df)[2],
timevar = "Date", v.names="Cases",
direction = "long")
x1 <- x1[, c('Combined_Key', 'Date', 'Cases')]
end <- Sys.time()
print(end - start)
## Time difference of 27.89158 secs
start <- Sys.time()
x2 <- pivot_longer(tb,
names_to = "Date",
values_to = "Cases",
-(1:11)) %>%
select(Combined_Key, Date, Cases)
end <- Sys.time()
print(end - start)
## Time difference of 1.585238 secs
start <- Sys.time()
x3 <- melt(dt,
id.vars = 1:11,
variable.name = "Date",
value.name = "Cases")[
, .(Combined_Key, Date, Cases)]
end <- Sys.time()
print(end - start)
## Time difference of 1.465139 secs
53.3.6 Joining tables
## Time difference of 39.40367 secs
start <- Sys.time()
x2 <- left_join(select(tb, 1:11), x2, by = "Combined_Key")
end <- Sys.time()
print(end - start)
## Time difference of 2.279808 secs
## Time difference of 0.5547302 secs
53.3.7 Chaining structures
Base R does not have chaining structure like the tidyverse
or data.table
. Here we compare chaining structures in tidyverse
and data.table
.
The tidyverse
uses %>%
to connect operations together.
The data.table
uses bracketed operations back to back as [...][...]
.
start <- Sys.time()
x2 <- tb %>%
mutate(year = `10/31/21` - `10/31/20`) %>%
group_by(Province_State) %>%
summarise(year = sum(year)) %>%
arrange(-year)
end <- Sys.time()
print(end - start)
## Time difference of 0.04686999 secs
start <- Sys.time()
x3 <- dt[, year := `10/31/21` - `10/31/20`, ][
order(-year), .(year = sum(year)), by = Province_State]
end <- Sys.time()
print(end - start)
## Time difference of 0.008189201 secs
From all the above, we can see that when doing simple operations such as selecting, filtering and sorting, Base R can finish very fast. However, when doing complex operations such as pivoting and joining, Base R will cost huge amount of time. Comparing tidyverse
and data.table
, we can see that data.table
have slightly faster speed than tidyverse
in almost every task. Especially, when using chaining structure, data.table
finishes much faster than tidyverse
. This is probably because data.table
includes many different operations together in one bracketed operation. In the above example, it use one bracketed operation to do the group_by
, summarise
and arrange
task in tidyverse
. On the other hand, since tidyverse
does only one task in each function, and what task a function will do is easy to understand through its name, the code in tidyverse
is more readable than in data.table
.
53.4 Summary of key functions
Environment | base | tidyverse | data.table |
---|---|---|---|
Supported data class | data.frame | tibble | data.table |
Reading data | read.csv | read_csv | fread |
Subset by column | [ , …] | select() | [ ,… , ] |
Subset by rows | [… , ] | filter() | [… , , ] |
Create new column | df$y = … | mutate(tb, y = …) | [ , y := …, ] |
Delete a column | df$y = NULL | select(tb, -y) | [ , y := NULL, ] |
Summarize | apply(df[ , y], 2, …) | summarise() | [ , …(y), ] |
Grouping | aggregate() | group_by() | [ , , by = …] |
Pivot to long | reshape() | pivot_longer() | melt() |
Pivot to wide | reshape() | pivot_wider() | dcast() |
Joining tables | merge() | left_join() | dt1[dt2, on = …] |
53.5 Conclusion
The motivation for this project is that one day when I was dealing with a large csv, I found it was so slow with data.frame
. With the help of google, I used data.table
for the first time, and it was amazing. Therefore, I’d like to compare its performance with that of base R
and tidyverse
when reading and analyzing data.
From this project, I learned how to use data.table
to analyze data. Also I learned the advantage of using each of the three ways. data.frame
in base R is the most convenient and easy way to deal with data analytics tasks, but it takes too much time when the data is large or the operation is complex. Therefore, in such cases, it would be better to use data.table
and tidyverse
. In cases when we are handling very large dataset, data.table
would be a good choice since it runs extremely fast. In cases when we are not requiring the speed so much, especially when collaborating with others, we can choose tidyverse
since its code is more readable.