Chapter 5 All About Dataframes
Thomas Causero
The purpose of this cheatsheet is to present functions to deal with dataframes. To do so, I’ll use 2 different ways: one that doesn’t require any additional libraries and one using two libraries (dplyr and tidyr). Note that the tidyverse library contains those two libraries among others. So, one can decide to load dplyr and tidyr sperately or load all of them at the same time using tidyverse.
5.1 Create Data Frames
There are several ways to create a dataframe using R.
The first one is to create a dataframe with vectors as inputs for every column, to do so, you need to use the following syntax:
data.frame(x = vector_1, y = vector_2,… ,col_name = vector_n).
To avoid writing too long vectors, you might also use random data or random sampling:
rnorm(k) will return a k random samples from a standard normal distribution
runif(k) will return a k random samples from a [0,1] uniform distribution
rpois(k, lambda) for a poison distribution
rbinom(k, n, p) for a binomial distribution
Another way to create a dataframe is to use external data, either from a csv file, or using web scrapping techniques. To do so, there is a very useful package rvest, which is rather easy to use and which has a function html_table() which convert the html table into a dataframe.
In the same way, the “read_csv” function takes a csv file as input and ouput a dataframe. Note that for these functions, there are a lot of parameters such as header which enables to control which rows or columns we would like to keep.
To have more information on an R function, you can just tap ?function_name.
Another useful function is write_csv(), which is rather explicit.
Here are a few other functions that can be useful to deal with data:
to create a vector: 1:10 or c(1:10)
to repeat 5 10 times: rep(5,10)
it also works with strings: rep(‘hey’,10) repeat hey 10 times
rep(c(1,2,3),2) will repeat the vector 1:3 twice
rep(c(1,2,3), each = 2) will repeat each value of the vector twice
## x y z
## 1 -0.06539989 0.2256787 -0.19421678
## 2 0.18470422 0.3790845 -0.25926828
## 3 -0.03302085 -0.2270535 0.39218406
## 4 1.33592831 0.7596725 -0.67016045
## 5 0.28714408 -1.1602793 -0.09767826
5.2 Get information on the dataframe
When dealing with dataframes, it is very useful to first look at its characteristic such as the shape or the type of data we are dealing with, here are the functions to do so.
## x y z
## 1 -0.06539989 0.2256787 -0.1942168
## [1] 3
## [1] 5
## [1] 5 3
## 'data.frame': 5 obs. of 3 variables:
## $ x: num -0.0654 0.1847 -0.033 1.3359 0.2871
## $ y: num 0.226 0.379 -0.227 0.76 -1.16
## $ z: num -0.1942 -0.2593 0.3922 -0.6702 -0.0977
#There are three main data types: numeric, string and factors (very similar to vectors
#but very useful with nominal data because it also provides the set of possible values)
factor(c('Tomato','Tomato','Tomato','Tomato','Tomato','Apple','Apple','Apple','Apple'))
## [1] Tomato Tomato Tomato Tomato Tomato Apple Apple Apple Apple
## Levels: Apple Tomato
## [1] "x" "y" "z"
## [1] "1" "2" "3" "4" "5"
#to change row names or column names with a vector
colnames(df) <- c('a','y','z')
#change only one column name without needing to write all the others
colnames(df)[colnames(df)=="a"] <- 'x' #with the column name
colnames(df)[1]<-"x" #with the column number
#Using dplyr and tidyr
#you can rename variables with select but it will drop all the variables not explicitly mentioned
select(df, new = x)
## new
## 1 -0.06539989
## 2 0.18470422
## 3 -0.03302085
## 4 1.33592831
## 5 0.28714408
## new y z
## 1 -0.06539989 0.2256787 -0.19421678
## 2 0.18470422 0.3790845 -0.25926828
## 3 -0.03302085 -0.2270535 0.39218406
## 4 1.33592831 0.7596725 -0.67016045
## 5 0.28714408 -1.1602793 -0.09767826
5.3 Concatenate dataframes
When dealing with dataframes, it can also be very useful to concatenate them. To do so, there are several functions such as rbind or cbind. In this section, I will also present functions to merge dataframe (depending on a specific key)
## [,1] [,2]
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 3
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 3
## [3,] 1 2 3
## [4,] 1 2 3
## [5,] 1 2 3
df1 <- data.frame(ID = 1:5, value = 5:1)
df2 <- data.frame(ID = 1:5, value = 6:10)
#to merge with a single key
merge(df1, df2, by = 'ID')
## ID value.x value.y
## 1 1 5 6
## 2 2 4 7
## 3 3 3 8
## 4 4 2 9
## 5 5 1 10
#can also merge with multiple ids: by = c(col1, col2 ...)
#merge with a key that has a different name in datasets
df3 <- data.frame(ID3 = 1:5, value = 6:10)
merge(df1, df3, by.x = 'ID', by.y = 'ID3')
## ID value.x value.y
## 1 1 5 6
## 2 2 4 7
## 3 3 3 8
## 4 4 2 9
## 5 5 1 10
5.4 Order dataframes
Ordering a dataframe means that you want to reorder the rows depending on the value of a specific colum. You can also change the order of the columns.
## x y z
## 1 -0.06539989 0.2256787 -0.19421678
## 3 -0.03302085 -0.2270535 0.39218406
## 2 0.18470422 0.3790845 -0.25926828
## 5 0.28714408 -1.1602793 -0.09767826
## 4 1.33592831 0.7596725 -0.67016045
## x y z
## 1 -0.06539989 0.2256787 -0.19421678
## 3 -0.03302085 -0.2270535 0.39218406
## 2 0.18470422 0.3790845 -0.25926828
## 5 0.28714408 -1.1602793 -0.09767826
## 4 1.33592831 0.7596725 -0.67016045
## x
## 1 -0.06539989
## 2 0.18470422
## 3 -0.03302085
## 4 1.33592831
## 5 0.28714408
## x y z
## 1 -0.06539989 0.2256787 -0.1942168
## x z y
## 1 -0.06539989 -0.19421678 0.2256787
## 2 0.18470422 -0.25926828 0.3790845
## 3 -0.03302085 0.39218406 -0.2270535
## 4 1.33592831 -0.67016045 0.7596725
## 5 0.28714408 -0.09767826 -1.1602793
## x y z
## 1 1.33592831 0.7596725 -0.67016045
## 2 0.28714408 -1.1602793 -0.09767826
## 3 0.18470422 0.3790845 -0.25926828
## 4 -0.03302085 -0.2270535 0.39218406
## 5 -0.06539989 0.2256787 -0.19421678
## x y z
## 1 -0.06539989 0.2256787 -0.19421678
## 2 -0.03302085 -0.2270535 0.39218406
## 3 0.18470422 0.3790845 -0.25926828
## 4 0.28714408 -1.1602793 -0.09767826
## 5 1.33592831 0.7596725 -0.67016045
5.5 Subset of data tables
Getting subset of a dataframe is very important. Those functions are like filters in a way that they will only return specific rows (or columns) of a dataframe depending on its value.
#For all those functions, no additional libraries are required
#Only return specific row and column numbers
#df[row_numbers, col_numbers]
df[1:4,1:3]
## x y z
## 1 -0.06539989 0.2256787 -0.1942168
## 2 0.18470422 0.3790845 -0.2592683
## 3 -0.03302085 -0.2270535 0.3921841
## 4 1.33592831 0.7596725 -0.6701605
## x y z
## 3 -0.03302085 -0.2270535 0.39218406
## 5 0.28714408 -1.1602793 -0.09767826
## x y z
## 1 -0.06539989 0.2256787 -0.1942168
## 2 0.18470422 0.3790845 -0.2592683
## 3 -0.03302085 -0.2270535 0.3921841
## x y z
## 3 -0.03302085 -0.2270535 0.39218406
## 4 1.33592831 0.7596725 -0.67016045
## 5 0.28714408 -1.1602793 -0.09767826
## x y
## 1 -0.06539989 0.2256787
## 2 0.18470422 0.3790845
## 3 -0.03302085 -0.2270535
## 4 1.33592831 0.7596725
## 5 0.28714408 -1.1602793
#select rows under condition of a certain column
df[df$x<0,] #df[df$col %in% c('var1','var2','var3')]
## x y z
## 1 -0.06539989 0.2256787 -0.1942168
## 3 -0.03302085 -0.2270535 0.3921841
## x y z
## 2 0.1847042 0.3790845 -0.2592683
## 4 1.3359283 0.7596725 -0.6701605
#Now, I'll show similar functions using additional libraries (dplyr and tidyr)
#sample_n() (fixed number) and sample_frac() to take random samples (fixed fraction)
sample_n(df, 1)
## x y z
## 1 -0.03302085 -0.2270535 0.3921841
## x y z
## 1 -0.06539989 0.2256787 -0.1942168
## 2 -0.03302085 -0.2270535 0.3921841
## x y z
## 1 0.1847042 0.3790845 -0.2592683
## 2 1.3359283 0.7596725 -0.6701605
## x y z
## 1 0.1847042 0.3790845 -0.2592683
## 2 1.3359283 0.7596725 -0.6701605
## x
## 1 -0.06539989
## 2 0.18470422
## 3 -0.03302085
## 4 1.33592831
## 5 0.28714408
## x y
## 1 -0.06539989 0.2256787
## 2 0.18470422 0.3790845
## 3 -0.03302085 -0.2270535
## 4 1.33592831 0.7596725
## 5 0.28714408 -1.1602793
## x y z
## 1 -0.06539989 0.2256787 -0.19421678
## 2 0.18470422 0.3790845 -0.25926828
## 3 -0.03302085 -0.2270535 0.39218406
## 4 1.33592831 0.7596725 -0.67016045
## 5 0.28714408 -1.1602793 -0.09767826
## y z
## 1 0.2256787 -0.19421678
## 2 0.3790845 -0.25926828
## 3 -0.2270535 0.39218406
## 4 0.7596725 -0.67016045
## 5 -1.1602793 -0.09767826
## z
## 1 -0.19421678
## 2 -0.25926828
## 3 0.39218406
## 4 -0.67016045
## 5 -0.09767826
5.6 Change dataframe shape
## [,1] [,2] [,3] [,4] [,5]
## x -0.06539989 0.1847042 -0.03302085 1.3359283 0.28714408
## y 0.22567875 0.3790845 -0.22705346 0.7596725 -1.16027931
## z -0.19421678 -0.2592683 0.39218406 -0.6701605 -0.09767826
#Using tidyr library
#gather and spread
#gathering the column names and turning them into a pair of new variables.
#One variable represents the column names as values, and the other variable
#contains the values previously associated with the column names
tmp <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3)
df_gather <- gather(tmp, key = 'col', value = 'value', -col1)
df_gather
## col1 col value
## 1 1 col2 1
## 2 2 col2 2
## 3 3 col2 3
## 4 1 col3 1
## 5 2 col3 2
## 6 3 col3 3
#spread to get initial dataset
#spread is the inverse function of gather
spread(df_gather, key = col, value = value)
## col1 col2 col3
## 1 1 1 1
## 2 2 2 2
## 3 3 3 3
5.7 Transforming data
Transforming data is also very important. Indeed, when one creates a new column depending on the value of others, or if one would like to change the values of a specific column, then it transforms the dataframe.
## x y z
## 1 -6.539989 0.2256787 -0.19421678
## 2 18.470422 0.3790845 -0.25926828
## 3 -3.302085 -0.2270535 0.39218406
## 4 133.592831 0.7596725 -0.67016045
## 5 28.714408 -1.1602793 -0.09767826
## x y z
## 1 -0.06539989 0.2256787 -0.19421678
## 2 0.18470422 0.3790845 -0.25926828
## 3 -3.30208519 -0.2270535 0.39218406
## 4 1.33592831 0.7596725 -0.67016045
## 5 0.28714408 -1.1602793 -0.09767826
#create a new column with a value of 1 or -1 depending on the sign of x
transform(df, new_col = ifelse(x>0, 1, -1))
## x y z new_col
## 1 -0.06539989 0.2256787 -0.19421678 -1
## 2 0.18470422 0.3790845 -0.25926828 1
## 3 -0.03302085 -0.2270535 0.39218406 -1
## 4 1.33592831 0.7596725 -0.67016045 1
## 5 0.28714408 -1.1602793 -0.09767826 1
#create new colum (there are several ways to do so)
df['w'] = 2*df['x'] #create a colum w whose values are twice the value of x
df[,'r'] = ifelse(df[['x']]>0, 1, -1) #create a column r whose values are 1 or -1
#depending on the value of x
#note that there are [[]] and not []) - df[,'r'] = ifelse(df['x']>0, 1, -1) doesn't work
df['t'] = ifelse(df[,'x']>0, 1, -1)
df$o = ifelse(df[,'x']>0, 1, -1)
df$i = ifelse(df$x>0, 1, -1)
df
## x y z w r t o i
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1
## x
## 1 -0.06539989
## 2 0.18470422
## 3 -0.03302085
## 4 1.33592831
## 5 0.28714408
## [1] -0.06539989 0.18470422 -0.03302085 1.33592831 0.28714408
## [1] -0.06539989 0.18470422 -0.03302085 1.33592831 0.28714408
## [1] -0.06539989 0.18470422 -0.03302085 1.33592831 0.28714408
#Using dplyr and tidyr libraries
#mutate() and transmute() to add new variables that are functions of existing variables.
#dplyr::mutate() is similar to transform(), but allows you to refer to columns that you’ve just created
mutate(df,a = 2*x,b = 2*y)
## x y z w r t o i a
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 -0.1307998
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 0.3694084
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 -0.0660417
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 2.6718566
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 0.5742882
## b
## 1 0.4513575
## 2 0.7581690
## 3 -0.4541069
## 4 1.5193451
## 5 -2.3205586
## x y z w r t o i a
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 -0.1307998
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 0.3694084
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 -0.0660417
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 2.6718566
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 0.5742882
## b
## 1 -0.2615996
## 2 0.7388169
## 3 -0.1320834
## 4 5.3437132
## 5 1.1485763
#transform(df, a = 2*X, b = 2*a) #doesn't work
#If you only want to keep the new variables, use transmute():
transmute(df,a = 2*x,b = 2*a)
## a b
## 1 -0.1307998 -0.2615996
## 2 0.3694084 0.7388169
## 3 -0.0660417 -0.1320834
## 4 2.6718566 5.3437132
## 5 0.5742882 1.1485763
5.8 Dealing with duplicates and missing values
NA values or duplicate values can be very annoying in a dataframe. Hopefully, some functions exist to deal with them.
## x y z w r t o i NA
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 NA
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 NA
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 NA
## x y z w r t o i NA
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 0
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 0
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 0
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 0
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 0
## x y z w r t o i NA
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 NA
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 NA
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 NA
## [1] x y z w r t o i NA
## <0 rows> (or 0-length row.names)
## x y z w r t o i NA
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 NA
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 NA
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 NA
## x y z w r t o i NA
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 NA
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 NA
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 NA
## [1] x y z w r t o i NA
## <0 rows> (or 0-length row.names)
## x y z w r t o i NA
## 1 -0.06539989 0.2256787 -0.19421678 -0.1307998 -1 -1 -1 -1 NA
## 2 0.18470422 0.3790845 -0.25926828 0.3694084 1 1 1 1 NA
## 3 -0.03302085 -0.2270535 0.39218406 -0.0660417 -1 -1 -1 -1 NA
## 4 1.33592831 0.7596725 -0.67016045 2.6718566 1 1 1 1 NA
## 5 0.28714408 -1.1602793 -0.09767826 0.5742882 1 1 1 1 NA
## [1] x y z w r t o i NA
## <0 rows> (or 0-length row.names)
5.9 group_by function
The group_by function is used to create groups of observations in order to apply functions to each group separately. It is equivalent to the GROUP BY function is SQL. You need the dplyr package for this section.
library(nycflights13)
#summarise() to condense multiple values to a single value
#It collapses a data frame to a single row
#very useful with group by
#examples of functions: min(), max(), mean(), sum(), sd(), median()
#n(): the number of observations in the current group or n_distinct(x):the number of unique values in x
summarise(df,name_col = mean(x, na.rm = TRUE))
## name_col
## 1 0.3418712
#group_by() + summarize()
flights %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(delay, count > 500, dist < 800)
## # A tibble: 3 x 4
## tailnum count dist delay
## <chr> <int> <dbl> <dbl>
## 1 N722MQ 513 546. 4.91
## 2 N723MQ 507 538. 6.42
## 3 N725MQ 575 559. 4.67