Chapter 5 All About Dataframes

Thomas Causero

The purpose of this cheatsheet is to present functions to deal with dataframes. To do so, I’ll use 2 different ways: one that doesn’t require any additional libraries and one using two libraries (dplyr and tidyr). Note that the tidyverse library contains those two libraries among others. So, one can decide to load dplyr and tidyr sperately or load all of them at the same time using tidyverse.

5.1 Create Data Frames

There are several ways to create a dataframe using R.

The first one is to create a dataframe with vectors as inputs for every column, to do so, you need to use the following syntax:

data.frame(x = vector_1, y = vector_2,… ,col_name = vector_n).

To avoid writing too long vectors, you might also use random data or random sampling:

  • rnorm(k) will return a k random samples from a standard normal distribution

  • runif(k) will return a k random samples from a [0,1] uniform distribution

  • rpois(k, lambda) for a poison distribution

  • rbinom(k, n, p) for a binomial distribution

Another way to create a dataframe is to use external data, either from a csv file, or using web scrapping techniques. To do so, there is a very useful package rvest, which is rather easy to use and which has a function html_table() which convert the html table into a dataframe.

In the same way, the “read_csv” function takes a csv file as input and ouput a dataframe. Note that for these functions, there are a lot of parameters such as header which enables to control which rows or columns we would like to keep.

To have more information on an R function, you can just tap ?function_name.

Another useful function is write_csv(), which is rather explicit.

Here are a few other functions that can be useful to deal with data:

  • to create a vector: 1:10 or c(1:10)

  • to repeat 5 10 times: rep(5,10)

  • it also works with strings: rep(‘hey’,10) repeat hey 10 times

  • rep(c(1,2,3),2) will repeat the vector 1:3 twice

  • rep(c(1,2,3), each = 2) will repeat each value of the vector twice

##             x          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 2  0.18470422  0.3790845 -0.25926828
## 3 -0.03302085 -0.2270535  0.39218406
## 4  1.33592831  0.7596725 -0.67016045
## 5  0.28714408 -1.1602793 -0.09767826

5.2 Get information on the dataframe

When dealing with dataframes, it is very useful to first look at its characteristic such as the shape or the type of data we are dealing with, here are the functions to do so.

##             x         y          z
## 1 -0.06539989 0.2256787 -0.1942168
## [1] 3
## [1] 5
## [1] 5 3
## 'data.frame':    5 obs. of  3 variables:
##  $ x: num  -0.0654 0.1847 -0.033 1.3359 0.2871
##  $ y: num  0.226 0.379 -0.227 0.76 -1.16
##  $ z: num  -0.1942 -0.2593 0.3922 -0.6702 -0.0977
## [1] Tomato Tomato Tomato Tomato Tomato Apple  Apple  Apple  Apple 
## Levels: Apple Tomato
## [1] "x" "y" "z"
## [1] "1" "2" "3" "4" "5"
##           new
## 1 -0.06539989
## 2  0.18470422
## 3 -0.03302085
## 4  1.33592831
## 5  0.28714408
##           new          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 2  0.18470422  0.3790845 -0.25926828
## 3 -0.03302085 -0.2270535  0.39218406
## 4  1.33592831  0.7596725 -0.67016045
## 5  0.28714408 -1.1602793 -0.09767826

5.3 Concatenate dataframes

When dealing with dataframes, it can also be very useful to concatenate them. To do so, there are several functions such as rbind or cbind. In this section, I will also present functions to merge dataframe (depending on a specific key)

##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    2    3
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    2    3
## [3,]    1    2    3
## [4,]    1    2    3
## [5,]    1    2    3
##   ID value.x value.y
## 1  1       5       6
## 2  2       4       7
## 3  3       3       8
## 4  4       2       9
## 5  5       1      10
##   ID value.x value.y
## 1  1       5       6
## 2  2       4       7
## 3  3       3       8
## 4  4       2       9
## 5  5       1      10

5.4 Order dataframes

Ordering a dataframe means that you want to reorder the rows depending on the value of a specific colum. You can also change the order of the columns.

##             x          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 3 -0.03302085 -0.2270535  0.39218406
## 2  0.18470422  0.3790845 -0.25926828
## 5  0.28714408 -1.1602793 -0.09767826
## 4  1.33592831  0.7596725 -0.67016045
##             x          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 3 -0.03302085 -0.2270535  0.39218406
## 2  0.18470422  0.3790845 -0.25926828
## 5  0.28714408 -1.1602793 -0.09767826
## 4  1.33592831  0.7596725 -0.67016045
##             x
## 1 -0.06539989
## 2  0.18470422
## 3 -0.03302085
## 4  1.33592831
## 5  0.28714408
##             x         y          z
## 1 -0.06539989 0.2256787 -0.1942168
##             x           z          y
## 1 -0.06539989 -0.19421678  0.2256787
## 2  0.18470422 -0.25926828  0.3790845
## 3 -0.03302085  0.39218406 -0.2270535
## 4  1.33592831 -0.67016045  0.7596725
## 5  0.28714408 -0.09767826 -1.1602793
##             x          y           z
## 1  1.33592831  0.7596725 -0.67016045
## 2  0.28714408 -1.1602793 -0.09767826
## 3  0.18470422  0.3790845 -0.25926828
## 4 -0.03302085 -0.2270535  0.39218406
## 5 -0.06539989  0.2256787 -0.19421678
##             x          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 2 -0.03302085 -0.2270535  0.39218406
## 3  0.18470422  0.3790845 -0.25926828
## 4  0.28714408 -1.1602793 -0.09767826
## 5  1.33592831  0.7596725 -0.67016045

5.5 Subset of data tables

Getting subset of a dataframe is very important. Those functions are like filters in a way that they will only return specific rows (or columns) of a dataframe depending on its value.

##             x          y          z
## 1 -0.06539989  0.2256787 -0.1942168
## 2  0.18470422  0.3790845 -0.2592683
## 3 -0.03302085 -0.2270535  0.3921841
## 4  1.33592831  0.7596725 -0.6701605
##             x          y           z
## 3 -0.03302085 -0.2270535  0.39218406
## 5  0.28714408 -1.1602793 -0.09767826
##             x          y          z
## 1 -0.06539989  0.2256787 -0.1942168
## 2  0.18470422  0.3790845 -0.2592683
## 3 -0.03302085 -0.2270535  0.3921841
##             x          y           z
## 3 -0.03302085 -0.2270535  0.39218406
## 4  1.33592831  0.7596725 -0.67016045
## 5  0.28714408 -1.1602793 -0.09767826
##             x          y
## 1 -0.06539989  0.2256787
## 2  0.18470422  0.3790845
## 3 -0.03302085 -0.2270535
## 4  1.33592831  0.7596725
## 5  0.28714408 -1.1602793
##             x          y          z
## 1 -0.06539989  0.2256787 -0.1942168
## 3 -0.03302085 -0.2270535  0.3921841
##           x         y          z
## 2 0.1847042 0.3790845 -0.2592683
## 4 1.3359283 0.7596725 -0.6701605
##             x          y         z
## 1 -0.03302085 -0.2270535 0.3921841
##             x          y          z
## 1 -0.06539989  0.2256787 -0.1942168
## 2 -0.03302085 -0.2270535  0.3921841
##           x         y          z
## 1 0.1847042 0.3790845 -0.2592683
## 2 1.3359283 0.7596725 -0.6701605
##           x         y          z
## 1 0.1847042 0.3790845 -0.2592683
## 2 1.3359283 0.7596725 -0.6701605
##             x
## 1 -0.06539989
## 2  0.18470422
## 3 -0.03302085
## 4  1.33592831
## 5  0.28714408
##             x          y
## 1 -0.06539989  0.2256787
## 2  0.18470422  0.3790845
## 3 -0.03302085 -0.2270535
## 4  1.33592831  0.7596725
## 5  0.28714408 -1.1602793
##             x          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 2  0.18470422  0.3790845 -0.25926828
## 3 -0.03302085 -0.2270535  0.39218406
## 4  1.33592831  0.7596725 -0.67016045
## 5  0.28714408 -1.1602793 -0.09767826
##            y           z
## 1  0.2256787 -0.19421678
## 2  0.3790845 -0.25926828
## 3 -0.2270535  0.39218406
## 4  0.7596725 -0.67016045
## 5 -1.1602793 -0.09767826
##             z
## 1 -0.19421678
## 2 -0.25926828
## 3  0.39218406
## 4 -0.67016045
## 5 -0.09767826

5.6 Change dataframe shape

##          [,1]       [,2]        [,3]       [,4]        [,5]
## x -0.06539989  0.1847042 -0.03302085  1.3359283  0.28714408
## y  0.22567875  0.3790845 -0.22705346  0.7596725 -1.16027931
## z -0.19421678 -0.2592683  0.39218406 -0.6701605 -0.09767826
##   col1  col value
## 1    1 col2     1
## 2    2 col2     2
## 3    3 col2     3
## 4    1 col3     1
## 5    2 col3     2
## 6    3 col3     3
##   col1 col2 col3
## 1    1    1    1
## 2    2    2    2
## 3    3    3    3

5.7 Transforming data

Transforming data is also very important. Indeed, when one creates a new column depending on the value of others, or if one would like to change the values of a specific column, then it transforms the dataframe.

##            x          y           z
## 1  -6.539989  0.2256787 -0.19421678
## 2  18.470422  0.3790845 -0.25926828
## 3  -3.302085 -0.2270535  0.39218406
## 4 133.592831  0.7596725 -0.67016045
## 5  28.714408 -1.1602793 -0.09767826
##             x          y           z
## 1 -0.06539989  0.2256787 -0.19421678
## 2  0.18470422  0.3790845 -0.25926828
## 3 -3.30208519 -0.2270535  0.39218406
## 4  1.33592831  0.7596725 -0.67016045
## 5  0.28714408 -1.1602793 -0.09767826
##             x          y           z new_col
## 1 -0.06539989  0.2256787 -0.19421678      -1
## 2  0.18470422  0.3790845 -0.25926828       1
## 3 -0.03302085 -0.2270535  0.39218406      -1
## 4  1.33592831  0.7596725 -0.67016045       1
## 5  0.28714408 -1.1602793 -0.09767826       1
##             x          y           z          w  r  t  o  i
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1
##             x
## 1 -0.06539989
## 2  0.18470422
## 3 -0.03302085
## 4  1.33592831
## 5  0.28714408
## [1] -0.06539989  0.18470422 -0.03302085  1.33592831  0.28714408
## [1] -0.06539989  0.18470422 -0.03302085  1.33592831  0.28714408
## [1] -0.06539989  0.18470422 -0.03302085  1.33592831  0.28714408
##             x          y           z          w  r  t  o  i          a
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 -0.1307998
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1  0.3694084
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 -0.0660417
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1  2.6718566
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1  0.5742882
##            b
## 1  0.4513575
## 2  0.7581690
## 3 -0.4541069
## 4  1.5193451
## 5 -2.3205586
##             x          y           z          w  r  t  o  i          a
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 -0.1307998
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1  0.3694084
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 -0.0660417
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1  2.6718566
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1  0.5742882
##            b
## 1 -0.2615996
## 2  0.7388169
## 3 -0.1320834
## 4  5.3437132
## 5  1.1485763
##            a          b
## 1 -0.1307998 -0.2615996
## 2  0.3694084  0.7388169
## 3 -0.0660417 -0.1320834
## 4  2.6718566  5.3437132
## 5  0.5742882  1.1485763

5.8 Dealing with duplicates and missing values

NA values or duplicate values can be very annoying in a dataframe. Hopefully, some functions exist to deal with them.

##             x          y           z          w  r  t  o  i NA
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1 NA
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1 NA
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1 NA
##             x          y           z          w  r  t  o  i NA
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1  0
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1  0
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1  0
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1  0
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1  0
##             x          y           z          w  r  t  o  i NA
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1 NA
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1 NA
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1 NA
## [1] x  y  z  w  r  t  o  i  NA
## <0 rows> (or 0-length row.names)
##             x          y           z          w  r  t  o  i NA
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1 NA
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1 NA
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1 NA
##             x          y           z          w  r  t  o  i NA
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1 NA
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1 NA
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1 NA
## [1] x  y  z  w  r  t  o  i  NA
## <0 rows> (or 0-length row.names)
##             x          y           z          w  r  t  o  i NA
## 1 -0.06539989  0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2  0.18470422  0.3790845 -0.25926828  0.3694084  1  1  1  1 NA
## 3 -0.03302085 -0.2270535  0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4  1.33592831  0.7596725 -0.67016045  2.6718566  1  1  1  1 NA
## 5  0.28714408 -1.1602793 -0.09767826  0.5742882  1  1  1  1 NA
## [1] x  y  z  w  r  t  o  i  NA
## <0 rows> (or 0-length row.names)

5.9 group_by function

The group_by function is used to create groups of observations in order to apply functions to each group separately. It is equivalent to the GROUP BY function is SQL. You need the dplyr package for this section.

##    name_col
## 1 0.3418712
## # A tibble: 3 x 4
##   tailnum count  dist delay
##   <chr>   <int> <dbl> <dbl>
## 1 N722MQ    513  546.  4.91
## 2 N723MQ    507  538.  6.42
## 3 N725MQ    575  559.  4.67