Data table
is a package that provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed[1].
In this article, we will compare Data table
with other packages to find out the advantages and disadvantages of Data table
.
TL;DR:
Data table
is fastest in grouping, reading and writing compared to base r
and tidyverse
. But it is a little slower in sorting than base r
Data table
has a more convinient and concise syntax while dplyr
is easier to read with the help of pipeline
.Data table
and knitr
can both print out a nice format of table in html than data frame
. But Data table
can provide an interactive interface of table with sorting and searching functions.The latest version of Data table
is 1.12.0
as of 2019-03-20. You can simply install this package by using install.packages command.
install.packages("data.table")
In this part, we are going to introduce the basic syntax of Data table
.
We can create one using the data.table()
.
DT = data.table(
sample1 = sample(1:10, 4),
sample2 = sample(11:20, 4),
sample3 = c('a','b','c','d')
)
DT
## sample1 sample2 sample3
## 1: 8 17 a
## 2: 1 14 b
## 3: 9 15 c
## 4: 2 16 d
You can also convert existing data.frame and list to a data.table using setDT()
or as.data.table()
for other structures. Unlike data.frames, columns of character
type are never converted to factors
by default.
data.table’s general syntax is written as DT[i, j, by]
, read as Take DT, subset/reorder rows using i
, then calculate j
, grouped by by
.
DT <- iris
DT <- setDT(DT)
DT[Species == 'setosa' & Sepal.Length > 5.0][1:3]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 5.4 3.9 1.7 0.4 setosa
## 3: 5.4 3.7 1.5 0.2 setosa
Within the frame of a data.table
, columns can be referred to as if they are variables. We do not need to add the prefix DT$
each time. A comma after the condition in i is not required. In data.frames
, however, the comma is necessary.
Select Sepal.Length
column, but return it as a vector.
DT[, Sepal.Length][1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
Select Sepal.Length
column, but return as a data.table instead.
DT[, list(Sepal.Length)][1:3]
## Sepal.Length
## 1: 5.1
## 2: 4.9
## 3: 4.7
Select both Sepal.Length
and Sepal.Width
columns and rename them to Length
and Width
.
DT[, .(Length = Sepal.Length, Width = Sepal.Width)][1:3]
## Length Width
## 1: 5.1 3.5
## 2: 4.9 3.0
## 3: 4.7 3.2
Calculate the average Sepal.Length and Sepal.Width for all setosa specie.
DT[Species == 'setosa', .(mlen = mean(Sepal.Length), mwid = mean(Sepal.Width))]
## mlen mwid
## 1: 5.006 3.428
.N
is a special built-in variable that holds the number of observations in the current group. In the absence of group by operations, it simply returns the number of rows in the subset.
DT[Species == 'setosa', .N]
## [1] 50
How can we get the number of each specie?
DT[, .N, by = Species]
## Species N
## 1: setosa 50
## 2: versicolor 50
## 3: virginica 50
Let’s first preview our .csv file relatively read as data.table, tibble and data.frame. This dataset has 1 million rows with 6 columns which sampled as characters, integers and double.
## sample1 sample2 sample3 sample4 sample5 sample6
## 1: id027 id0000001402 62 4027 2 53.4276
## 2: id038 id0000006957 33 9190 3 52.0274
## 3: id058 id0000007289 21 7011 4 4.4083
## ---
## 999998: id081 id0000002287 37 4665 4 10.8610
## 999999: id070 id0000007561 55 2626 3 98.9201
## 1000000: id061 id0000003997 16 148 1 21.9067
## # A tibble: 1,000,000 x 6
## sample1 sample2 sample3 sample4 sample5 sample6
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 id027 id0000001402 62 4027 2 53.4
## 2 id038 id0000006957 33 9190 3 52.0
## 3 id058 id0000007289 21 7011 4 4.41
## 4 id091 id0000000917 84 8531 2 37.2
## 5 id021 id0000000667 15 7745 5 1.70
## # … with 1e+06 more rows
## sample1 sample2 sample3 sample4 sample5 sample6
## 1 id027 id0000001402 62 4027 2 53.4276
## 2 id038 id0000006957 33 9190 3 52.0274
## 3 id058 id0000007289 21 7011 4 4.4083
## 4 id091 id0000000917 84 8531 2 37.1531
## 5 id021 id0000000667 15 7745 5 1.7010
## [ reached 'max' / getOption("max.print") -- omitted 999995 rows ]
As a main strength of data.table, the lightening speed of processing data makes it popular among users. This part concerns the running time of conducting functions such as reading, writing, grouping and sorting using different tools and we’ll make it clearly by showing the barplot.
dtr <- system.time(fread(setpath))
tbr <- system.time(read_csv(setpath))
dfr <- system.time(read.csv(setpath))
dtr # Data.table reading file
## user system elapsed
## 0.798 0.038 0.108
tbr # Tidyverse reading file
## user system elapsed
## 1.237 0.046 0.491
dfr # Base R reading file
## user system elapsed
## 1.524 0.047 1.575
dtw <- system.time(fwrite(DT, setpath))
tbw <- system.time(write_csv(TB, setpath))
dfw <- system.time(write.csv(DF, setpath, row.names = FALSE))
dtw # Data.table writing file
## user system elapsed
## 0.385 0.033 0.060
tbw # Tidyverse writing file
## user system elapsed
## 1.990 0.068 1.161
dfw # Base R writing file
## user system elapsed
## 1.772 0.054 1.833
dtg <- system.time(DT[, .(.N), by = .(sample1)])
tbg <- system.time(
TB %>%
group_by(sample1) %>%
summarise(n1 = n())
)
dfg <- system.time(by(DF, DT$sample1, count))
dtg # Data.table grouping
## user system elapsed
## 0.064 0.025 0.014
tbg # Tidyverse grouping
## user system elapsed
## 0.363 0.047 0.063
dfg # Base R grouping
## user system elapsed
## 0.181 0.017 0.193
dts <- system.time(DT[order(sample6)])
tbs <- system.time(TB %>% arrange(sample6))
dfs <- system.time(DF[order(DF$sample6),])
dts # Data.table sorting
## user system elapsed
## 0.468 0.072 0.080
tbs # Tidyverse sorting
## user system elapsed
## 1.151 0.030 0.361
dfs # Base R sorting
## user system elapsed
## 0.087 0.007 0.094
runningtime <- data.frame(
time = c(dtr[1], tbr[1], dfr[1], dtw[1], tbw[1], dfw[1],
dtg[1], tbg[1], dfg[1], dts[1], tbs[1], dfs[1]),
tool = rep(c('data.table', 'tidyverse', 'base R'), 4),
func = rep(c('reading', 'writing', 'grouping', 'sorting'), rep(3,4))
)
runningtime %>%
ggplot() +
geom_col(aes(x = tool, y = time, fill = tool)) +
facet_grid(func ~ ., scales = "free", space = "free") +
coord_flip() +
scale_fill_brewer(palette = 'Set2') +
theme_minimal() +
xlab('') +
ylab('Running Time (sec)') +
guides(fill = FALSE)
As we can see in the plot, data.table consumes the least time to achieve the same goal compared with tidyverse and base R in most functions. R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0 which makes base R reach the same speed with data.table in sorting function.
Data table
and dplyr
are similar in some way. Both of them can manipulate data such as selecting columns, filtering values, sorting data and so on. But their syntaxes are quite different. Here are some examples:
For this part, we are going to use the data from https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e. The data is named as Payroll_data.
Task 1:
List the 10 agencies Agency Name
with the highest median base salaries Base Salary
in descending order by median base salary.
Data table
Payroll_data[,.(median = median(`Base Salary`)),by=.(`Agency Name`)][order(-median)][1:10]
## Agency Name median
## 1: DISTRICTING COMMISSION 176099.0
## 2: FINANCIAL INFO SVCS AGENCY 105875.0
## 3: OFFICE OF COLLECTIVE BARGAININ 99000.0
## 4: BRONX COMMUNITY BOARD #3 98489.0
## 5: DEPT OF ED PEDAGOGICAL 82900.0
## 6: INDEPENDENT BUDGET OFFICE 82012.0
## 7: OFFICE OF THE MAYOR 79000.0
## 8: BRONX COMMUNITY BOARD #1 78190.0
## 9: CONFLICTS OF INTEREST BOARD 78000.0
## 10: BRONX COMMUNITY BOARD #5 77105.5
dplyr
Payroll_data %>%
group_by(`Agency Name`)%>%
dplyr::summarize(median = median(`Base Salary`))%>%
ungroup()%>%
arrange(-median)%>%
top_n(10)
## # A tibble: 10 x 2
## `Agency Name` median
## <chr> <dbl>
## 1 DISTRICTING COMMISSION 176099
## 2 FINANCIAL INFO SVCS AGENCY 105875
## 3 OFFICE OF COLLECTIVE BARGAININ 99000
## 4 BRONX COMMUNITY BOARD #3 98489
## 5 DEPT OF ED PEDAGOGICAL 82900
## 6 INDEPENDENT BUDGET OFFICE 82012
## 7 OFFICE OF THE MAYOR 79000
## 8 BRONX COMMUNITY BOARD #1 78190
## 9 CONFLICTS OF INTEREST BOARD 78000
## 10 BRONX COMMUNITY BOARD #5 77106.
Task 2:
For Pay Basis
== “per Annum” rows, list the mean of Base Salary
, Regular Hours
and Regular Gross Paid
for every combination of Agency Name
and Fiscal Year
in descending order by mean base salary. Present the top 10 rows.
Data table
Payroll_data[`Pay Basis` == "per Annum",
lapply(.SD, mean),
by = .(`Fiscal Year`, `Agency Name`),
.SDcols = c("Base Salary", "Regular Hours","Regular Gross Paid")
][order(-`Base Salary`)
][1:10]
## Fiscal Year Agency Name Base Salary Regular Hours
## 1: 2014 DISTRICTING COMMISSION 158699.7 0.000
## 2: 2016 BROOKLYN COMMUNITY BOARD #9 138978.0 594.000
## 3: 2017 OFFICE OF COLLECTIVE BARGAININ 115778.8 1825.000
## 4: 2018 OFFICE OF COLLECTIVE BARGAININ 115778.8 1781.750
## 5: 2018 FINANCIAL INFO SVCS AGENCY 109684.0 1664.911
## 6: 2017 PUBLIC ADMINISTRATOR-RICHMOND 109400.8 1520.833
## 7: 2016 OFFICE OF COLLECTIVE BARGAININ 108167.3 1692.611
## 8: 2016 PUBLIC ADMINISTRATOR-RICHMOND 108015.0 1525.000
## 9: 2017 FINANCIAL INFO SVCS AGENCY 107847.7 1704.999
## 10: 2017 BRONX COMMUNITY BOARD #3 107115.0 1825.000
## Regular Gross Paid
## 1: 7910.243
## 2: 70614.470
## 3: 118979.448
## 4: 114941.314
## 5: 101727.832
## 6: 82937.115
## 7: 99065.806
## 8: 87886.057
## 9: 103801.862
## 10: 109131.555
dplyr
Payroll_data%>%
filter(`Pay Basis` == "per Annum")%>%
group_by(`Fiscal Year`, `Agency Name`)%>%
dplyr::summarize(`Base Salary` = mean(`Base Salary`),
`Regular Hours` = mean(`Regular Hours`),
`Regular Gross Paid` = mean(`Regular Gross Paid`))%>%
ungroup()%>%
arrange(-`Base Salary`)%>%
.[1:10,]
## # A tibble: 10 x 5
## `Fiscal Year` `Agency Name` `Base Salary` `Regular Hours`
## <int> <chr> <dbl> <dbl>
## 1 2014 DISTRICTING … 158700. 0
## 2 2016 BROOKLYN COM… 138978 594
## 3 2017 OFFICE OF CO… 115779. 1825
## 4 2018 OFFICE OF CO… 115779. 1782.
## 5 2018 FINANCIAL IN… 109684. 1665.
## 6 2017 PUBLIC ADMIN… 109401. 1521.
## 7 2016 OFFICE OF CO… 108167. 1693.
## 8 2016 PUBLIC ADMIN… 108015 1525
## 9 2017 FINANCIAL IN… 107848. 1705.
## 10 2017 BRONX COMMUN… 107115 1825
## # … with 1 more variable: `Regular Gross Paid` <dbl>
While data table
and dplyr
can both provide a flow of actions, data table
is more concise but may be harder to figure the function of these codes.
dplyr
is more readable and convenient for multiple actions with the help of pipeline
. You can understand the function of the action step by step.
While the output of data frames may be unorganized, data table can also make the output look nicer by applying DT
library(An R interface to the JavaScript library DataTables). There are also other packages can do the same thing, such as tibble and knitr.
The data we are using is wine data from pgmm
package.
library(pgmm)
data(wine)
wine <- wine[,1:10] # Limiting the number of columns
By using setDT function, we can turn a data set into data table.
wine_DT <- setDT(wine)
wine[1:3,]
## Type Alcohol Sugar-free Extract Fixed Acidity Tartaric Acid Malic Acid
## 1: 1 14.23 24.82 73.1 1.21 1.71
## 2: 1 13.20 26.30 72.8 1.84 1.78
## 3: 1 13.16 26.30 68.5 1.94 2.36
## Uronic Acids pH Ash Alcalinity of Ash
## 1: 0.72 3.38 2.43 15.6
## 2: 0.71 3.30 2.14 11.2
## 3: 0.84 3.48 2.67 18.6
This table is not organized and not easy to read.
library(DT)
datatable(wine[1:50,])
Data table
(with the help of DT
package) provides a tidy and interactive interface to the data. You can browse more data by clicking bottom right corner or present more data per page in upper left corner
library(knitr)
knitr::kable(wine[1:5,])
Type | Alcohol | Sugar-free Extract | Fixed Acidity | Tartaric Acid | Malic Acid | Uronic Acids | pH | Ash | Alcalinity of Ash |
---|---|---|---|---|---|---|---|---|---|
1 | 14.23 | 24.82 | 73.1 | 1.21 | 1.71 | 0.72 | 3.38 | 2.43 | 15.6 |
1 | 13.20 | 26.30 | 72.8 | 1.84 | 1.78 | 0.71 | 3.30 | 2.14 | 11.2 |
1 | 13.16 | 26.30 | 68.5 | 1.94 | 2.36 | 0.84 | 3.48 | 2.67 | 18.6 |
1 | 14.37 | 25.85 | 74.9 | 1.59 | 1.95 | 0.72 | 3.43 | 2.50 | 16.8 |
1 | 13.24 | 26.05 | 83.5 | 1.30 | 2.59 | 1.10 | 3.42 | 2.87 | 21.0 |
Knitr
also helps organizing the table. But it has fewer functions. You can choose the package based on your needs.
You can learn more about data table at: