9 Introduction to Data Manipulation

Yuhe Wang

9.1 Data Wrangling with dplyr

9.1.1 Pipe:

Here we use %>% as pipeline. a %>% operation(…) == operation(a, ….)

9.1.2 Tibble

Tibble is a type of data structure that is similar to data frame in base R.Compared to normal dataframee, tibble never changes the type of the input, or the names of the variables, or the row names.

tibble(x=1:5, y=1,z=x^2+y)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

9.1.3 Pick observations with filter()

R provides standard suites: <, >=, <=, !=, ==, %in% for you to apply conditions on rows

vehicles %>% filter(year>1999)
## # A tibble: 16,649 × 12
##       id make  model  year class       trans drive   cyl displ fuel    hwy   cty
##    <dbl> <chr> <chr> <dbl> <chr>       <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
##  1 16573 Acura 3.2CL  2001 Compact Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  2 17489 Acura 3.2CL  2002 Compact Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  3 18458 Acura 3.2CL  2003 Compact Ca… Manu… Fron…     6   3.2 Prem…    26    17
##  4 18459 Acura 3.2CL  2003 Compact Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  5 15871 Acura 3.2TL  2000 Midsize Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  6 16734 Acura 3.2TL  2001 Midsize Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  7 17664 Acura 3.2TL  2002 Midsize Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  8 18629 Acura 3.2TL  2003 Midsize Ca… Auto… Fron…     6   3.2 Prem…    27    17
##  9 15872 Acura 3.5RL  2000 Midsize Ca… Auto… Fron…     6   3.5 Prem…    22    16
## 10 16735 Acura 3.5RL  2001 Midsize Ca… Auto… Fron…     6   3.5 Prem…    22    16
## # … with 16,639 more rows

9.1.4 Reorder rows with arrange()

vehicles %>% arrange(year,class,trans)
## # A tibble: 33,442 × 12
##       id make      model    year class trans drive   cyl displ fuel    hwy   cty
##    <dbl> <chr>     <chr>   <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
##  1 27049 Buick     Electr…  1984 Larg… Auto… 2-Wh…     6   4.1 Regu…    19    14
##  2 27050 Buick     Electr…  1984 Larg… Auto… 2-Wh…     8   5   Regu…    20    14
##  3 27051 Buick     Electr…  1984 Larg… Auto… 2-Wh…     8   5.7 Dies…    26    18
##  4 27057 Cadillac  Brough…  1984 Larg… Auto… Rear…     8   4.1 Regu…    19    14
##  5 27058 Cadillac  Brough…  1984 Larg… Auto… Rear…     8   5.7 Dies…    26    18
##  6 28105 Cadillac  Brough…  1984 Larg… Auto… Rear…     8   4.1 Regu…    19    14
##  7 28106 Cadillac  Fleetw…  1984 Larg… Auto… Rear…     6   4.3 Dies…    31    21
##  8 28225 Chevrolet S10 Pi…  1984 Smal… Auto… 2-Wh…     4   2   Regu…    24    18
##  9 27219 Dodge     Ram 50…  1984 Smal… Auto… 2-Wh…     4   2   Regu…    21    20
## 10 27220 Dodge     Ram 50…  1984 Smal… Auto… 2-Wh…     4   2   Regu…    20    18
## # … with 33,432 more rows

9.1.5 Create new variables using mutate()

vehicles %>% mutate(cyl_2 = cyl*2)
## # A tibble: 33,442 × 13
##       id make  model  year class trans drive   cyl displ fuel    hwy   cty cyl_2
##    <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
##  1 13309 Acura 2.2C…  1997 Subc… Auto… Fron…     4   2.2 Regu…    26    20     8
##  2 13310 Acura 2.2C…  1997 Subc… Manu… Fron…     4   2.2 Regu…    28    22     8
##  3 13311 Acura 2.2C…  1997 Subc… Auto… Fron…     6   3   Regu…    26    18    12
##  4 14038 Acura 2.3C…  1998 Subc… Auto… Fron…     4   2.3 Regu…    27    19     8
##  5 14039 Acura 2.3C…  1998 Subc… Manu… Fron…     4   2.3 Regu…    29    21     8
##  6 14040 Acura 2.3C…  1998 Subc… Auto… Fron…     6   3   Regu…    26    17    12
##  7 14834 Acura 2.3C…  1999 Subc… Auto… Fron…     4   2.3 Regu…    27    20     8
##  8 14835 Acura 2.3C…  1999 Subc… Manu… Fron…     4   2.3 Regu…    29    21     8
##  9 14836 Acura 2.3C…  1999 Subc… Auto… Fron…     6   3   Regu…    26    17    12
## 10 11789 Acura 2.5TL  1995 Comp… Auto… Fron…     5   2.5 Prem…    23    18    10
## # … with 33,432 more rows

9.1.6 Create new calculations by catgories using summarize()

We can use summarize to get some statistics for different groups. In the following example, we are getting average of difference between air time and scheduled air time, grouped by different carriers. Note that we have to add “na.rm” inside a functio to remove NA, else we will see a lot of NAs.
Common operation functions:
sd(x): standard deviation
mean(x): mean
IQR(x): interquartile range
mad(x): median absolute deviation
min(x): min
quantile(x, 0.5): ith quartile
max(x): max
first(x): the first row
nth(x,1): the nth row
last(): the last row
n(): count
n_distinct(x): count distinct
sum(): sum

vehicles %>% group_by(class, fuel) %>% summarize(mean_cty =mean(cty, na.rm = TRUE))
## # A tibble: 151 × 3
## # Groups:   class [34]
##    class        fuel                       mean_cty
##    <chr>        <chr>                         <dbl>
##  1 Compact Cars CNG                            24.5
##  2 Compact Cars Diesel                         29.2
##  3 Compact Cars Electricity                   110  
##  4 Compact Cars Gasoline or E85                21.9
##  5 Compact Cars Midgrade                       16  
##  6 Compact Cars Premium                        17.6
##  7 Compact Cars Premium Gas or Electricity     35  
##  8 Compact Cars Premium or E85                 17.2
##  9 Compact Cars Regular                        21.1
## 10 Large Cars   CNG                            13.6
## # … with 141 more rows

9.2 Tidy Data with dplyr

9.2.1 Gather/Spread

We can reshape table from wide format into long format using Gather. We can reshape table from wide format into long format using Sprea vice versa. (The following example is untrue data)

t1 <- tibble(country=c('China', 'US', 'Korea'), `1999` = c(123,323,4245),`2000` = c(12,32,424))
t1
## # A tibble: 3 × 3
##   country `1999` `2000`
##   <chr>    <dbl>  <dbl>
## 1 China      123     12
## 2 US         323     32
## 3 Korea     4245    424
t1 %>% gather(`1999`, `2000`,key='year',value = 'GDP')
## # A tibble: 6 × 3
##   country year    GDP
##   <chr>   <chr> <dbl>
## 1 China   1999    123
## 2 US      1999    323
## 3 Korea   1999   4245
## 4 China   2000     12
## 5 US      2000     32
## 6 Korea   2000    424

9.2.2 Seperate/Unite

We can combine/separate values in one column into two, or two into one(using some special characters). We can add a parameter of “convert=TRUE” to convert chars to integer directly.

t2 <- tibble(country=c('China', 'US', 'Korea'), rate=c('12/232','123/20384','2328/2301823'))
t2 %>% separate(rate,into=c("numerator", "denominator"),convert=TRUE)
## # A tibble: 3 × 3
##   country numerator denominator
##   <chr>       <int>       <int>
## 1 China          12         232
## 2 US            123       20384
## 3 Korea        2328     2301823

9.3 Relational Data with dplyr

9.3.1 prerequisites

KEYS in DBMS is an attribute or set of attributes which helps you to identify a row or a relation. A primary key uniquely identifies an observation. A foreign key uniquely identify an observation in another table. Join between two table usually takes place between two keys within different tables to ensure single join.

9.3.2 Understanding different types of joins

x () y inner join: only return the matching pairs existing in both tables left join: only keep all observations in x right join: only keep all observations in y full join:keep all observations in x and y Note that only one of the tables can have duplicate keys. If both of the tables have duplicate keys, this will cause error. Usually, having duplicate key in one table can produce unexpected result. So, the best practice is to investigate both tables firstly before joining.

# These data are frictional 
x <- tibble(Country = c('China','US','Japan','Canada'), population=c(100,200,300,400))
y <- tibble(Country = c('China','US','Japan','Mexico'), GDP=c(100,23,2142,234))
left_join(x, y, by='Country')
## # A tibble: 4 × 3
##   Country population   GDP
##   <chr>        <dbl> <dbl>
## 1 China          100   100
## 2 US             200    23
## 3 Japan          300  2142
## 4 Canada         400    NA
right_join(x, y, by='Country')
## # A tibble: 4 × 3
##   Country population   GDP
##   <chr>        <dbl> <dbl>
## 1 China          100   100
## 2 US             200    23
## 3 Japan          300  2142
## 4 Mexico          NA   234
inner_join(x, y, by='Country')
## # A tibble: 3 × 3
##   Country population   GDP
##   <chr>        <dbl> <dbl>
## 1 China          100   100
## 2 US             200    23
## 3 Japan          300  2142
full_join(x, y, by='Country')
## # A tibble: 5 × 3
##   Country population   GDP
##   <chr>        <dbl> <dbl>
## 1 China          100   100
## 2 US             200    23
## 3 Japan          300  2142
## 4 Canada         400    NA
## 5 Mexico          NA   234

In the case that two tables have different names of their keys, we can use by=c(“a”=“b”)

9.4 Datetime with Lubridate

9.4.1 Datetime from strings

Lubridate is able to parse different format of datetime strings

ymd('2020-01-01')
## [1] "2020-01-01"
mdy('March 1 2021')
## [1] "2021-03-01"
dmy('02/01/2020')
## [1] "2020-01-02"

Getting components from datetime

t <- ymd_hms('2020-01-01 12:00:00')
year(t)
## [1] 2020
## [1] 1
# day of the month
mday(t)
## [1] 1
# day of the year
yday(t)
## [1] 1
# day of the week
wday(t)
## [1] 4
hour(t)
## [1] 12
## [1] 0
## [1] 0

9.4.2 Timespan

Duration: exact number of seconds
Periods: human units (week and months)
Intervals: starting and ending point

Duration

## [1] "600s (~10 minutes)"
## [1] "1814400s (~3 weeks)"
## [1] "31557600s (~1 years)"

Periods

## [1] "15S"
days(7)
## [1] "7d 0H 0M 0S"
months(1:6)
## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
years(1)/days(1)
## [1] 365.25
today() + years(1)
## [1] "2023-01-28"

9.5 String manipulations with stringr

9.5.1 Matching Patterns

We can do regular expression matching in R easily too. string_view() is a very useful function that showcase string patterns. The first input will be input variable, the second input will be the string that we are trying to match. The following regular expression matches will be introduced using this method.

We can match a substring directly easily.

a <-c('root', 'create','time','death')
str_view(a, "ea")

We can use ‘.’ as a wildcard to match any character

a <-c('root', 'create','time','death')
str_view(a, "im.")

We can use ‘^a’ to match any string starting with an “a”, and “a$” to find any string that ending with “a”. We can also use ‘^’ at the beginning and “$” in the end to make sure that it’s an exact match.

a <-c('root', 'create','time','death','eath')
str_view(a, "^c")
str_view(a, "^death$")
str_view(a, "e$")

We can also specify how many times a character repeats itself. {n} represents we have exactly n times of repetition,{n,}: n or more, {,n}: at most n, {n,m}: between n and m

a <-'CCccoljenlq'
str_view(a, 'C{2}')

There are also extra type of strings we can match other than characters we mentioned. ‘\d’ matches any digit,‘\s’ matches any white space, [xyz] matches x, y or z, [^xyz] matches anything other than x, y and z. Here noticing that because we want to match a substring starting with one backlash, we have to specify two backlashes in the string matching.

a <- 'xeqowhe22'
str_view(a, '\\d{2}')
str_view(a, '[xyz]')

Then we will introduce a couple of methods that are useful in conjunction to use with these regular expression.
1.str_detect(): to see if we detect certain substring, return TRUE or FALSE
2.str_extract(): to extract the actual substring from the string
3.str_subset(): return a group of strings that matches certain pattern
4.str_count(): count the number of substring appearances in a string
5.str_replace(): replace substrings with certain patterns
6.str_split(): split string according to patterns

str_detect(c('case','happy','sad'), '[ey]$')
## [1]  TRUE  TRUE FALSE
str_extract_all(c('case','happy','sad'), '[ey]$')
## [[1]]
## [1] "e"
## 
## [[2]]
## [1] "y"
## 
## [[3]]
## character(0)
str_subset(c('case','happy','sad'), '[ey]$')
## [1] "case"  "happy"
a <- 'laaalk3kr23'
str_count(a,'[al]{3}')
## [1] 1
a <- 'laaalk3kr23'
str_replace(a,'[al]{3}','happy')
## [1] "happyalk3kr23"
str_split('abc def',' ')
## [[1]]
## [1] "abc" "def"

Source: R for Data Science