91 Joining and reshaping in R

Yisi Liu

91.1 Motivation

The majority of the analyses we have done in assignments and lectures so far have been based on a single data set. In the real world, however, raw data may be saved in different files or tables, so there may not always be a meta data set containing all of the information. For example, we want a summary table showing how many students are in each department, but department information and student information are kept separate. To begin the analysis, we will need to combine the two or more data sets into a single one and, if necessary, change it into a preferred format.

During my previous intern experience as an analytics engineer, merging and transforming the source data into the desired table is a crucial step. We performed data wrangling and ETL (extract-transform-loading) pipelines with SQL back then. While working on assignments for this class, I surprisingly discovered some comparable functions within dplyr and tidyr, which I will explain in this tutorial.

91.2 Content

  • Joining / Merging data sets
    • bind_rows, bind_cols
    • inner_join, full_join, left_join, right_join
    • semi_join, anti_join
  • Reshaping data
    • pivot_longer
    • pivot_wider

91.3 Combining Tables

91.3.1 Setup

We will use student table to walk through the tutorial today. student includes information student_id, student_name, and department_id so far.

set.seed(111)

student <- data.frame('student_id' = sample(100000:999999, 4, replace=FALSE),
                      'student_name' = c('James','Alice','Amy','Roger'),
                      'department_id' =  c(1,1,3,4))
grid.arrange(tableGrob(student), top = 'student table')

91.3.2 Types of Bind

bind_rows()

At the beginning of this semester, two new students Troy and Wendy transferred to our school. Table new_student records their student id, name, and department id (new students’ majors remain undeclared so NA here).

set.seed(123)
new_student = data.frame('student_id' = sample(100000:999999, 2, replace=FALSE),
                        'student_name' = c('Troy','Wendy'),
                        'department_id' =  NA)

grid.arrange(tableGrob(new_student), top = 'new_student table')

For the convenience of future analyses, we want to add new students’ information to our existing student table. To append the rows of new_students onto student, we perform bind_rows() function in dplyr.

student <- bind_rows(student, new_student)
grid.arrange(tableGrob(student), top = 'bind_rows: student & new_student')

bind_cols()

In this semester, suppose the first half of students on student table are assigned to take STAT 5701 while the second half to take STAT 5702. A table course records the course id and course name each student takes, but there is no student name.

course = data.frame(
        'course_id' = c(rep('STAT 5701',nrow(student)/2),rep('STAT 5702',nrow(student)/2)),
        'course_name' = c(rep('Probability',nrow(student)/2),rep('Data Visualization',nrow(student)/2)))
grid.arrange(tableGrob(course), top = 'course table')

As we are interested in questions like which class James takes or if Roger and Amy take the same class this year, we want to include these two columns in our student table.

To add an additional column, we use bind_columns() here. Now we are able to tell that James takes STAT 5701, and Roger and Amy are not taking the same class this semester.

bind_c = bind_cols(student, course) 

grid.arrange(tableGrob(bind_c) , top = 'bind_rows: student & course')

Note: rbind() and cbind() function in base R perform similarly as bind_rows() and bind_cols(). You can find more information about rbind() and cbind() by trying ?rbind and ?cbind.

91.3.3 Mutating Joins

We have shown that bind_cols() could append columns to a table. However, it is less likely that a table with the same number of rows as the primary table would always exist. Consider the case where the student table has 10,000 rows. Another table with 10,000 rows, each indicating a student’s major, would take up a lot of space. A more common way to store relational data is as follows:

department <- data.frame('department_id' = c(1,2,3,4,5),
                         'department_name' = c('Math','Computer Science', 
                                               'Philosophy','Linguistics',
                                               'Economics'))

grid.arrange(tableGrob(student),tableGrob(department), nrow=1, top = 'student table, department table')

In this way, the department table will only have to store a modest amount of data. However, the new difficulty is how to link this data to the student table, as we want to know which students belong to which departments.

Keys

We require a common variable (or combination of variables) that uniquely identifies an observation in at least one of the tables and identifies it in the same way across both tables to appropriately align observations between them. This variable is commonly referred to as a ‘key’. department id is the key shared by student and department in our case.

To accommodate the many ways tables are arranged, we require multiple distinct forms of joins. Here is a summary table. https://dplyr.tidyverse.org/reference/mutate-joins.html

Join types Meaning
inner_join keep rows that have matching records in both tables
full_join keep rows when there is a match in either left or right table
left_join keep rows from left table, and matched records from right table
right_join keep rows from right table, and matched records from left table

Now, let’s perform different joins to merge the tables.

inner_join()

join_inner <- inner_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_inner), top = 'inner join: student & department')

full_join()

join_full <- full_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_full), top = 'full join: student & department')

left_join()

join_left <- left_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_left), top = 'left join: student & department')

right_join()

join_right <- right_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_right), top = 'right join: student & department')

Multiple keys

Multiple keys may be required in some circumstances to uniquely identify an observation. Our data sets below, for example, each student has two separate observations, therefore we need to connect them by both student_id and course_id.

student2 <- data.frame('student_id' = rep(student$student_id, 2),
                       'student_name' = rep(student$student_name, 2),
                       'course_id' = c(rep('STAT 5701',nrow(student)),rep('STAT 5702',nrow(student)))) %>%
            arrange(student_id)
grades <- data.frame('course_id' = c(rep('STAT 5701',nrow(student)),rep('STAT 5702',nrow(student))),
                     'student_id' = rep(student$student_id, 2),
                     'final_grade' = c('A','A-','B','B+','A','B+','B','A-','A','A','A','A'))%>%
            arrange(course_id, student_id)

grid.arrange(tableGrob(student2),tableGrob(grades), nrow=1, top = 'student2 table, grades table')
student_with_grades = inner_join(student2, grades, by = c('student_id','course_id'))
grid.arrange(tableGrob(student_with_grades), top = 'inner join: student2 & grades')

91.3.4 Filtering joins

In addition to mutating joins that add columns to x from y, there are two more filtering joins functions within dplyr. Filtering joins filter rows from x based on the presence or absence of matches in y. https://dplyr.tidyverse.org/reference/filter-joins.html

Join types Meaning
semi_join keep rows from left table with a match in right table
anti_join keep rows from left table without a match in right table

Recall that department_id is the key for both student table and department table, and department_id= {1,3,4} occurs in both tables.

semi_join()

#only keep rows in student table
join_semi1 <- semi_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_semi1), top = 'semi join: student & department')
#only keep rows in department table
join_semi2 <- semi_join(department,student, by = 'department_id')
grid.arrange(tableGrob(join_semi2), top = 'semi join: department & student')

anti_join()

#only keep rows in student table
join_anti1 <- anti_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_anti1), top = 'anti join: student & department')
#only keep rows in department table
join_anti2 <- anti_join(department, student, by = 'department_id')
grid.arrange(tableGrob(join_anti2), top = 'anti join: department & student')

91.3.5 R vs. SQL

The commands for joins in R and SQL are included in the table below. We can see that both languages have mutating/filtering joins.

dplyr SQL
bind_rows(x, y) (SELECT * FROM x) UNION (SELECT * FROM y)
inner_join(x, y, by = ‘z’) SELECT * FROM x INNER JOIN y ON x.z = y.z
full_join(x, y, by = ‘z’) SELECT * FROM x FULL OUTER JOIN y ON x.z = y.z
left_join(x, y, by = ‘z’) SELECT * FROM x LEFT JOIN y ON x.z = y.z
right_join(x, y, by = ‘z’) SELECT * FROM x RIGHT JOIN y ON x.z = y.z
semi_join(x, y, by = ‘z’) SELECT x.* FROM x LEFT JOIN y ON x.z = y.z WHERE y.z IS NOT NULL
anti_join(x, y, by = ‘z’) SELECT x.* FROM x LEFT JOIN y ON x.z = y.z WHERE y.z IS NULL

91.4 Reshaping

pivot_wider()

Recall the table student_with_grades.

The above data makes it difficult to read and compare students’ grades in both classes. Alternatively, we can extract the courses’ names and divide them into two columns, each containing the final grade as the value. We could make such a table with the pivot wider() function in tidyr.

student_with_grades_2 = student_with_grades %>% pivot_wider(names_from = course_id, values_from = final_grade)
grid.arrange(tableGrob(student_with_grades_2), top = 'pivot_wider: student_with_grades_2')

pivot_longer()

Conversely, if we are given the ‘wider’ table student_with_grades_2, and want to convert the format to be student_with_grades in order to compare the distribution of final grades between two classes. We can do this by using the pivor_longer() function to reshape the table and plot a side-by-side bar chart from the ‘longer’ table.

student_with_grades_3 <- student_with_grades_2 %>% 
                        pivot_longer(cols = c('STAT 5701','STAT 5702'), 
                                     names_to = 'course_id', 
                                     values_to = 'final_grade')
grid.arrange(tableGrob(student_with_grades_3), top = 'pivot_longer: student_with_grades_3')
student_with_grades_3 %>% 
  ggplot(aes(fct_relevel(final_grade,'B', after = 3))) + 
  geom_bar() + 
  facet_grid(~course_id) +
  ggtitle('Distribution of Final Grades in STAT 5701 & STAT 5702') +
  xlab('Final grades')

91.5 Self-evaluation

Before starting this project, I only knew mutating joins such as inner_join() and left_join(). I discovered filtering joins like semi_join() and anti_join() while working on it. The filtering joins are extremely useful when we only need data from x based on some standards from y.

In SQL, we normally use the where statement to create such a constraint in x, as seen in the ‘R vs. SQL’ comparison table I included above. This, I believe, is the most intriguing aspect of knowing many data-handling tools: we can always find other approaches that perform the same function.

Due to time constraints, this tutorial only covers as many as ‘merging’ and ‘reshaping’ functions that I am familiar with. If I get another chance, I will go over the parameters in each functions, such as copy, keep, na_matches in mutating joins, to assist everyone understand them better.

91.6 Reference

R4DS Chapter 13: https://r4ds.had.co.nz/relational-data.html