53 Comparison among base R, tidyverse, and datatable

Siyue Han

# Load the tidyverse package
library(tidyverse)
# Load the data.table package
library(data.table)

53.1 Introduction

There are many ways to read and analyze data in R, the data.frame provided in base R can handle most of the situations. Therefore, I have been using it for most of the time so far and occasionally used tibble from tidyverse. But one time when I was dealing with a large csv, I found it was so slow with data.frame. With the help of google, I used data.table for the first time, and it was amazing. Therefore, I’d like to share this way of reading and analyzing data to more people.

53.2 Reading Data

First, let’s see the performance of reading csv data among these three environments.

53.2.1 Data.Frame

This is an example of reading csv into a data.frame.

start <- Sys.time()
df <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
## Time difference of 3.001485 secs

53.2.2 Tibble

This is an example of reading csv into a tibble.

start <- Sys.time()
tb <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
## Time difference of 1.689931 secs

53.2.3 Data.Table

This is an example of reading csv into a data.table.

start <- Sys.time()
dt <- fread("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
## Time difference of 0.2768054 secs

As we can see, data.table can read a csv file super fast, especially when the file is large. tibble in tidyverse is slightly faster than data.frame in base R, but is still much slower than data.table.

53.3 Processing Data

Then, let’s see the differences of processing data among these three environments.

53.3.1 Selecting column(s) and row(s)

start <- Sys.time()
x1 <- df[101:110, c('Lat', 'Long_')]
end <- Sys.time()
print(end - start)
## Time difference of 0.002945662 secs
start <- Sys.time()
x2 <- select(tb, Lat, Long_) %>% slice(101:110)
end <- Sys.time()
print(end - start)
## Time difference of 0.02623582 secs
start <- Sys.time()
x3 <- dt[101:110, .(Lat, Long_)]
end <- Sys.time()
print(end - start)
## Time difference of 0.01489043 secs

53.3.2 Filtering row(s)

start <- Sys.time()
x1 <- df[df$`X10.31.21` > 500000,]
end <- Sys.time()
print(end - start)
## Time difference of 0.02817774 secs
start <- Sys.time()
x2 <- filter(tb, `10/31/21` > 500000)
end <- Sys.time()
print(end - start)
## Time difference of 0.04359889 secs
start <- Sys.time()
x3 <- dt[`10/31/21` > 500000,]
end <- Sys.time()
print(end - start)
## Time difference of 0.006566763 secs

53.3.3 Sorting the table

start <- Sys.time()
x1 <- df[order(-df$`X10.31.21`), ]
end <- Sys.time()
print(end - start)
## Time difference of 0.03687096 secs
start <- Sys.time()
x2 <- arrange(tb, -`10/31/21`)
end <- Sys.time()
print(end - start)
## Time difference of 0.5127223 secs
start <- Sys.time()
x3 <- dt[order(-`10/31/21`), ]
end <- Sys.time()
print(end - start)
## Time difference of 0.009575844 secs

53.3.4 Summarizing columns by group

start <- Sys.time()
x1 <- aggregate(df$`X10.31.21`, list(df$Province_State), sum)
end <- Sys.time()
print(end - start)
## Time difference of 0.006486416 secs
start <- Sys.time()
x2 <- group_by(tb, Province_State) %>% summarise(sum(`10/31/21`))
end <- Sys.time()
print(end - start)
## Time difference of 0.03093123 secs
start <- Sys.time()
x3 <- dt[ , lapply(.(`10/31/21`), sum), by = Province_State] 
end <- Sys.time()
print(end - start)
## Time difference of 0.004686117 secs

53.3.5 Pivoting longer

start <- Sys.time()
x1 <- reshape(df, 
              varying = 12:dim(df)[2],
              timevar = "Date", v.names="Cases",
              direction = "long")
x1 <- x1[, c('Combined_Key', 'Date', 'Cases')]
end <- Sys.time()
print(end - start)
## Time difference of 27.89158 secs
start <- Sys.time()
x2 <- pivot_longer(tb, 
                   names_to = "Date", 
                   values_to = "Cases", 
                   -(1:11)) %>%
      select(Combined_Key, Date, Cases)
end <- Sys.time()
print(end - start)
## Time difference of 1.585238 secs
start <- Sys.time()
x3 <- melt(dt, 
           id.vars = 1:11, 
           variable.name = "Date", 
           value.name = "Cases")[
             , .(Combined_Key, Date, Cases)]
end <- Sys.time()
print(end - start)
## Time difference of 1.465139 secs

53.3.6 Joining tables

start <- Sys.time()
x1 <- merge(df[, 1:11], x1)
end <- Sys.time()
print(end - start)
## Time difference of 39.40367 secs
start <- Sys.time()
x2 <- left_join(select(tb, 1:11), x2, by = "Combined_Key")
end <- Sys.time()
print(end - start)
## Time difference of 2.279808 secs
start <- Sys.time()
x3 <- dt[, 1:11][x3, on = "Combined_Key"]
end <- Sys.time()
print(end - start)
## Time difference of 0.5547302 secs

53.3.7 Chaining structures

Base R does not have chaining structure like the tidyverse or data.table. Here we compare chaining structures in tidyverse and data.table. The tidyverse uses %>% to connect operations together. The data.table uses bracketed operations back to back as [...][...].

start <- Sys.time()
x2 <- tb %>% 
      mutate(year = `10/31/21` - `10/31/20`) %>% 
      group_by(Province_State) %>% 
      summarise(year = sum(year)) %>% 
      arrange(-year)

end <- Sys.time()
print(end - start)
## Time difference of 0.04686999 secs
start <- Sys.time()
x3 <- dt[, year := `10/31/21` - `10/31/20`, ][
         order(-year), .(year = sum(year)), by = Province_State]
end <- Sys.time()
print(end - start)
## Time difference of 0.008189201 secs

From all the above, we can see that when doing simple operations such as selecting, filtering and sorting, Base R can finish very fast. However, when doing complex operations such as pivoting and joining, Base R will cost huge amount of time. Comparing tidyverse and data.table, we can see that data.table have slightly faster speed than tidyverse in almost every task. Especially, when using chaining structure, data.table finishes much faster than tidyverse. This is probably because data.table includes many different operations together in one bracketed operation. In the above example, it use one bracketed operation to do the group_by, summarise and arrange task in tidyverse. On the other hand, since tidyverse does only one task in each function, and what task a function will do is easy to understand through its name, the code in tidyverse is more readable than in data.table.

53.4 Summary of key functions

Environment base tidyverse data.table
Supported data class data.frame tibble data.table
Reading data read.csv read_csv fread
Subset by column [ , …] select() [ ,… , ]
Subset by rows [… , ] filter() [… , , ]
Create new column df$y = … mutate(tb, y = …) [ , y := …, ]
Delete a column df$y = NULL select(tb, -y) [ , y := NULL, ]
Summarize apply(df[ , y], 2, …) summarise() [ , …(y), ]
Grouping aggregate() group_by() [ , , by = …]
Pivot to long reshape() pivot_longer() melt()
Pivot to wide reshape() pivot_wider() dcast()
Joining tables merge() left_join() dt1[dt2, on = …]

53.5 Conclusion

The motivation for this project is that one day when I was dealing with a large csv, I found it was so slow with data.frame. With the help of google, I used data.table for the first time, and it was amazing. Therefore, I’d like to compare its performance with that of base R and tidyverse when reading and analyzing data.

From this project, I learned how to use data.table to analyze data. Also I learned the advantage of using each of the three ways. data.frame in base R is the most convenient and easy way to deal with data analytics tasks, but it takes too much time when the data is large or the operation is complex. Therefore, in such cases, it would be better to use data.table and tidyverse. In cases when we are handling very large dataset, data.table would be a good choice since it runs extremely fast. In cases when we are not requiring the speed so much, especially when collaborating with others, we can choose tidyverse since its code is more readable.