Data table and more

Introduction
Installation
Basic syntax
Lightning speed
- Reading
- Writing
- Grouping
- Sorting
Data table v.s. Dplyr
Nice table output

Introduction

Data table is a package that provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed^[1].

In this article, we will compare Data table with other packages to find out the advantages and disadvantages of Data table.

TL;DR:

Data table is fastest in grouping, reading and writing compared to base r and tidyverse. But it is a little slower in sorting than base r
Data table has a more convinient and concise syntax while dplyr is easier to read with the help of pipeline.
Data table and knitr can both print out a nice format of table in html than data frame. But Data table can provide an interactive interface of table with sorting and searching functions.

Installation

The latest version of Data table is 1.12.0 as of 2019-03-20. You can simply install this package by using install.packages command.

install.packages("data.table")

Basic syntax

In this part, we are going to introduce the basic syntax of Data table.

We can create one using the data.table().

DT = data.table(
  sample1 = sample(1:10, 4),
  sample2 = sample(11:20, 4),
  sample3 = c('a','b','c','d')
)
DT

##    sample1 sample2 sample3
## 1:       8      17       a
## 2:       1      14       b
## 3:       9      15       c
## 4:       2      16       d

You can also convert existing data.frame and list to a data.table using setDT() or as.data.table() for other structures. Unlike data.frames, columns of character type are never converted to factors by default.

data.table’s general syntax is written as DT[i, j, by], read as Take DT, subset/reorder rows using i, then calculate j, grouped by by.

Subset row(s) in i

DT <- iris
DT <- setDT(DT)
DT[Species == 'setosa' & Sepal.Length > 5.0][1:3]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          5.1         3.5          1.4         0.2  setosa
## 2:          5.4         3.9          1.7         0.4  setosa
## 3:          5.4         3.7          1.5         0.2  setosa

Within the frame of a data.table, columns can be referred to as if they are variables. We do not need to add the prefix DT$ each time. A comma after the condition in i is not required. In data.frames, however, the comma is necessary.

Select column(s) in j

Select Sepal.Length column, but return it as a vector.

DT[, Sepal.Length][1:10]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

Select Sepal.Length column, but return as a data.table instead.

DT[, list(Sepal.Length)][1:3]

##    Sepal.Length
## 1:          5.1
## 2:          4.9
## 3:          4.7

Select both Sepal.Length and Sepal.Width columns and rename them to Length and Width.

DT[, .(Length = Sepal.Length, Width = Sepal.Width)][1:3]

##    Length Width
## 1:    5.1   3.5
## 2:    4.9   3.0
## 3:    4.7   3.2

Subset in i and do in j

Calculate the average Sepal.Length and Sepal.Width for all setosa specie.

DT[Species == 'setosa', .(mlen = mean(Sepal.Length), mwid = mean(Sepal.Width))]

##     mlen  mwid
## 1: 5.006 3.428

.N is a special built-in variable that holds the number of observations in the current group. In the absence of group by operations, it simply returns the number of rows in the subset.

DT[Species == 'setosa', .N]

## [1] 50

Grouping using by

How can we get the number of each specie?

DT[, .N, by = Species]

##       Species  N
## 1:     setosa 50
## 2: versicolor 50
## 3:  virginica 50

Lightning speed

Let’s first preview our .csv file relatively read as data.table, tibble and data.frame. This dataset has 1 million rows with 6 columns which sampled as characters, integers and double.

##          sample1      sample2 sample3 sample4 sample5 sample6
##       1:   id027 id0000001402      62    4027       2 53.4276
##       2:   id038 id0000006957      33    9190       3 52.0274
##       3:   id058 id0000007289      21    7011       4  4.4083
##      ---                                                     
##  999998:   id081 id0000002287      37    4665       4 10.8610
##  999999:   id070 id0000007561      55    2626       3 98.9201
## 1000000:   id061 id0000003997      16     148       1 21.9067

## # A tibble: 1,000,000 x 6
##   sample1 sample2      sample3 sample4 sample5 sample6
##   <chr>   <chr>          <dbl>   <dbl>   <dbl>   <dbl>
## 1 id027   id0000001402      62    4027       2   53.4 
## 2 id038   id0000006957      33    9190       3   52.0 
## 3 id058   id0000007289      21    7011       4    4.41
## 4 id091   id0000000917      84    8531       2   37.2 
## 5 id021   id0000000667      15    7745       5    1.70
## # … with 1e+06 more rows

##   sample1      sample2 sample3 sample4 sample5 sample6
## 1   id027 id0000001402      62    4027       2 53.4276
## 2   id038 id0000006957      33    9190       3 52.0274
## 3   id058 id0000007289      21    7011       4  4.4083
## 4   id091 id0000000917      84    8531       2 37.1531
## 5   id021 id0000000667      15    7745       5  1.7010
##  [ reached 'max' / getOption("max.print") -- omitted 999995 rows ]

As a main strength of data.table, the lightening speed of processing data makes it popular among users. This part concerns the running time of conducting functions such as reading, writing, grouping and sorting using different tools and we’ll make it clearly by showing the barplot.

Reading

dtr <- system.time(fread(setpath))
tbr <- system.time(read_csv(setpath))
dfr <- system.time(read.csv(setpath))
dtr # Data.table reading file

##    user  system elapsed 
##   0.798   0.038   0.108

tbr # Tidyverse reading file

##    user  system elapsed 
##   1.237   0.046   0.491

dfr # Base R reading file

##    user  system elapsed 
##   1.524   0.047   1.575

Writing

dtw <- system.time(fwrite(DT, setpath))
tbw <- system.time(write_csv(TB, setpath))
dfw <- system.time(write.csv(DF, setpath, row.names = FALSE))
dtw # Data.table writing file

##    user  system elapsed 
##   0.385   0.033   0.060

tbw # Tidyverse writing file

##    user  system elapsed 
##   1.990   0.068   1.161

dfw # Base R writing file

##    user  system elapsed 
##   1.772   0.054   1.833

Grouping

dtg <- system.time(DT[, .(.N), by = .(sample1)])
tbg <- system.time(
                   TB %>%
                   group_by(sample1) %>%
                   summarise(n1 = n())
                   )
dfg <- system.time(by(DF, DT$sample1, count))
dtg # Data.table grouping

##    user  system elapsed 
##   0.064   0.025   0.014

tbg # Tidyverse grouping

##    user  system elapsed 
##   0.363   0.047   0.063

dfg # Base R grouping

##    user  system elapsed 
##   0.181   0.017   0.193

Sorting

dts <- system.time(DT[order(sample6)])
tbs <- system.time(TB %>% arrange(sample6))
dfs <- system.time(DF[order(DF$sample6),])
dts # Data.table sorting

##    user  system elapsed 
##   0.468   0.072   0.080

tbs # Tidyverse sorting

##    user  system elapsed 
##   1.151   0.030   0.361

dfs # Base R sorting

##    user  system elapsed 
##   0.087   0.007   0.094

runningtime <- data.frame(
                          time = c(dtr[1], tbr[1], dfr[1], dtw[1], tbw[1], dfw[1],
                                       dtg[1], tbg[1], dfg[1], dts[1], tbs[1], dfs[1]),
                          tool = rep(c('data.table', 'tidyverse', 'base R'), 4),
                          func = rep(c('reading', 'writing', 'grouping', 'sorting'), rep(3,4))
                      )
runningtime %>%
    ggplot() +
    geom_col(aes(x = tool, y = time, fill = tool)) +
    facet_grid(func ~ ., scales = "free", space = "free") +
    coord_flip() +
    scale_fill_brewer(palette = 'Set2') +
    theme_minimal() +
    xlab('') +
    ylab('Running Time (sec)') +
    guides(fill = FALSE)

As we can see in the plot, data.table consumes the least time to achieve the same goal compared with tidyverse and base R in most functions. R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0 which makes base R reach the same speed with data.table in sorting function.

Data table v.s. Dplyr

Data table and dplyr are similar in some way. Both of them can manipulate data such as selecting columns, filtering values, sorting data and so on. But their syntaxes are quite different. Here are some examples:

For this part, we are going to use the data from https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e. The data is named as Payroll_data.

Task 1:

List the 10 agencies Agency Name with the highest median base salaries Base Salary in descending order by median base salary.

Data table

Payroll_data[,.(median = median(`Base Salary`)),by=.(`Agency Name`)][order(-median)][1:10]

##                        Agency Name   median
##  1:         DISTRICTING COMMISSION 176099.0
##  2:     FINANCIAL INFO SVCS AGENCY 105875.0
##  3: OFFICE OF COLLECTIVE BARGAININ  99000.0
##  4:       BRONX COMMUNITY BOARD #3  98489.0
##  5:         DEPT OF ED PEDAGOGICAL  82900.0
##  6:      INDEPENDENT BUDGET OFFICE  82012.0
##  7:            OFFICE OF THE MAYOR  79000.0
##  8:       BRONX COMMUNITY BOARD #1  78190.0
##  9:    CONFLICTS OF INTEREST BOARD  78000.0
## 10:       BRONX COMMUNITY BOARD #5  77105.5

dplyr

Payroll_data %>%
  group_by(`Agency Name`)%>%
  dplyr::summarize(median = median(`Base Salary`))%>%
  ungroup()%>%
  arrange(-median)%>%
  top_n(10)

## # A tibble: 10 x 2
##    `Agency Name`                   median
##    <chr>                            <dbl>
##  1 DISTRICTING COMMISSION         176099 
##  2 FINANCIAL INFO SVCS AGENCY     105875 
##  3 OFFICE OF COLLECTIVE BARGAININ  99000 
##  4 BRONX COMMUNITY BOARD #3        98489 
##  5 DEPT OF ED PEDAGOGICAL          82900 
##  6 INDEPENDENT BUDGET OFFICE       82012 
##  7 OFFICE OF THE MAYOR             79000 
##  8 BRONX COMMUNITY BOARD #1        78190 
##  9 CONFLICTS OF INTEREST BOARD     78000 
## 10 BRONX COMMUNITY BOARD #5        77106.

Task 2:

For Pay Basis == “per Annum” rows, list the mean of Base Salary, Regular Hours and Regular Gross Paid for every combination of Agency Name and Fiscal Year in descending order by mean base salary. Present the top 10 rows.

Data table

Payroll_data[`Pay Basis` == "per Annum",                      
        lapply(.SD, mean),                    
        by = .(`Fiscal Year`, `Agency Name`),           
        .SDcols = c("Base Salary", "Regular Hours","Regular Gross Paid")
        ][order(-`Base Salary`)
          ][1:10]

##     Fiscal Year                    Agency Name Base Salary Regular Hours
##  1:        2014         DISTRICTING COMMISSION    158699.7         0.000
##  2:        2016    BROOKLYN COMMUNITY BOARD #9    138978.0       594.000
##  3:        2017 OFFICE OF COLLECTIVE BARGAININ    115778.8      1825.000
##  4:        2018 OFFICE OF COLLECTIVE BARGAININ    115778.8      1781.750
##  5:        2018     FINANCIAL INFO SVCS AGENCY    109684.0      1664.911
##  6:        2017  PUBLIC ADMINISTRATOR-RICHMOND    109400.8      1520.833
##  7:        2016 OFFICE OF COLLECTIVE BARGAININ    108167.3      1692.611
##  8:        2016  PUBLIC ADMINISTRATOR-RICHMOND    108015.0      1525.000
##  9:        2017     FINANCIAL INFO SVCS AGENCY    107847.7      1704.999
## 10:        2017       BRONX COMMUNITY BOARD #3    107115.0      1825.000
##     Regular Gross Paid
##  1:           7910.243
##  2:          70614.470
##  3:         118979.448
##  4:         114941.314
##  5:         101727.832
##  6:          82937.115
##  7:          99065.806
##  8:          87886.057
##  9:         103801.862
## 10:         109131.555

dplyr

Payroll_data%>%
  filter(`Pay Basis` == "per Annum")%>%
  group_by(`Fiscal Year`, `Agency Name`)%>%
  dplyr::summarize(`Base Salary` = mean(`Base Salary`),
                   `Regular Hours` = mean(`Regular Hours`),
                   `Regular Gross Paid` = mean(`Regular Gross Paid`))%>%
  ungroup()%>%
  arrange(-`Base Salary`)%>%
  .[1:10,]

## # A tibble: 10 x 5
##    `Fiscal Year` `Agency Name` `Base Salary` `Regular Hours`
##            <int> <chr>                 <dbl>           <dbl>
##  1          2014 DISTRICTING …       158700.              0 
##  2          2016 BROOKLYN COM…       138978             594 
##  3          2017 OFFICE OF CO…       115779.           1825 
##  4          2018 OFFICE OF CO…       115779.           1782.
##  5          2018 FINANCIAL IN…       109684.           1665.
##  6          2017 PUBLIC ADMIN…       109401.           1521.
##  7          2016 OFFICE OF CO…       108167.           1693.
##  8          2016 PUBLIC ADMIN…       108015            1525 
##  9          2017 FINANCIAL IN…       107848.           1705.
## 10          2017 BRONX COMMUN…       107115            1825 
## # … with 1 more variable: `Regular Gross Paid` <dbl>

While data table and dplyr can both provide a flow of actions, data table is more concise but may be harder to figure the function of these codes.

dplyr is more readable and convenient for multiple actions with the help of pipeline. You can understand the function of the action step by step.

Nice table output

While the output of data frames may be unorganized, data table can also make the output look nicer by applying DT library(An R interface to the JavaScript library DataTables). There are also other packages can do the same thing, such as tibble and knitr.

The data we are using is wine data from pgmm package.

library(pgmm)
data(wine)
wine <- wine[,1:10] # Limiting the number of columns

By using setDT function, we can turn a data set into data table.

wine_DT <- setDT(wine)

Data frame

wine[1:3,]

##    Type Alcohol Sugar-free Extract Fixed Acidity Tartaric Acid Malic Acid
## 1:    1   14.23              24.82          73.1          1.21       1.71
## 2:    1   13.20              26.30          72.8          1.84       1.78
## 3:    1   13.16              26.30          68.5          1.94       2.36
##    Uronic Acids   pH  Ash Alcalinity of Ash
## 1:         0.72 3.38 2.43              15.6
## 2:         0.71 3.30 2.14              11.2
## 3:         0.84 3.48 2.67              18.6

This table is not organized and not easy to read.

Data table

library(DT) 
datatable(wine[1:50,])

Data table(with the help of DT package) provides a tidy and interactive interface to the data. You can browse more data by clicking bottom right corner or present more data per page in upper left corner

Knitr

library(knitr)
knitr::kable(wine[1:5,])

Type	Alcohol	Sugar-free Extract	Fixed Acidity	Tartaric Acid	Malic Acid	Uronic Acids	pH	Ash	Alcalinity of Ash
1	14.23	24.82	73.1	1.21	1.71	0.72	3.38	2.43	15.6
1	13.20	26.30	72.8	1.84	1.78	0.71	3.30	2.14	11.2
1	13.16	26.30	68.5	1.94	2.36	0.84	3.48	2.67	18.6
1	14.37	25.85	74.9	1.59	1.95	0.72	3.43	2.50	16.8
1	13.24	26.05	83.5	1.30	2.59	1.10	3.42	2.87	21.0

Knitr also helps organizing the table. But it has fewer functions. You can choose the package based on your needs.

You can learn more about data table at:

Data table and more

GR5293 Statistical Graphic Community Contribution

Chao Yin, Zeyu Yang

Introduction

Installation

Basic syntax

Subset row(s) in i

Select column(s) in j

Subset in i and do in j

Grouping using by

Lightning speed

Reading

Writing

Grouping

Sorting

Data table v.s. Dplyr

Nice table output

Data frame

Data table

Knitr