92 Running Sql Queries on Dataframes

Chao Pang and Krishna Kalluri

library(tidyverse)
library(openintro)
library(sparklyr)

92.1 Motivation

The goal of this tutorial is to allow ourselves and other students to have the capability to write SQL queries against dplyr dataframes directly rather than to use dplyr functions in our assignments. Despite the convenience dplyr offers, we actually prefer the beloved classic SQL queries because that’s what we are using in our daily work. We believe that a lot of people who aren’t familiar with R (specifically tidyverse) would feel the same way and prefer SQL at the beginning of their adventure into the R “universe”.

It’s important to point out that writing SQL SHOULD not be a complete replacement of dplyr, but rather an alternative option on the table for students because we just feel that one should not limit themselves to a particular data manipulation tool. We do want to acknolwedge that dplyr is a well-thought-out library, it uses the same principles as SQL queries, however, Googling for the right syntax doesn’t seem to be straightforward at times (unlike pandas for Python). The motivation of this project stemmed from our frustrations of not being able to find the right functions/syntax to manipulate the data with ease. Although we’ve learned how to use dplyr after three assignments, it would be nice to provide this alternative to the future generations of students.

92.2 SQL use cases

92.2.1 Get started

knitr is the underlying engine for knitting the R markdown files into a document, and provides support for for different programming languages. In particular, knitr provides a SQL engine that allows you write SQL queries in the code chunk, see official documentation at knitr sql engine. However, this engine requires a connection to a database, and we can only query tables that are materialized in this database. To leverage this engine, we have to create a temp database and import the dataframes of interest as tables. The dbplyer library provides a convenience function to get a in-memory database engine (sqllite) and additional functions for importing dataframes as tables. If you are not familiar with SQL queries and want to learn more about it, please visit this funny tutorial at learn-how-to-write-sql-from-r

92.2.1.1 Load the table into the dbplyr in-mem database

connection = dbplyr::src_memdb()$con
dplyr::copy_to(connection, iris)

Wow, this is actually super easy to do, it only took two lines of code to import a table. However, it took us a long time to find out such a function even existed, hopefully you won’t go through the same pain in the future 😊!

92.2.1.2 Run your first query

Once you’ve imported the table, you could get the database connection object from dbplyr directly, now you just need to pass it into the code chunk in the chunk options. There are three options specified in the below code chunk,

sql: the key word that indicates this is a sql code chunk
connection: the sqllite database connection we created earlier (any database connection would work though!)

--chunk options: {sql, connection=connection}
SELECT
  *
FROM iris

92.2.1.3 Group by

Calculate the average petal length, and petal width by species.

--chunk options: {sql, connection=connection}
SELECT
  species,
  AVG(`Petal.Length`) AS avg_petal_length,
  AVG(`Petal.Width`) AS avg_petal_width
FROM iris
GROUP BY species

92.2.1.4 Saving results as dataframes

--chunk options: {sql, connection=connection, output.var="result_df"}
SELECT
  Species AS specs,
  `Petal.Width` AS petal_width,
  `Petal.Length` AS ptal_length
FROM iris
WHERE species = 'setosa'

result_df %>% rmarkdown::paged_table()

92.2.2 Complex Use Cases

92.2.2.0.1 Popular pet names

In this tutorial, are going to use a pet name dataset called seattlepets for demonstration.The question we are trying to figure out is — what are the 20 most popular cat and dog names?

Let’s try to answer this question using the dplyr way.

seattlepets %>% 
  drop_na(animal_name) %>%
  filter (species %in% c("Dog", "Cat")) %>% 
  group_by(species, animal_name) %>%
  summarise(count=n()) %>%
  ungroup() %>%
  group_by(species) %>%
  arrange(desc(count)) %>%
  mutate(row_number=row_number()) %>%
  ungroup() %>%
  filter(row_number <= 20) %>%
  rmarkdown::paged_table()

Now let’s try to write a SQL query that does the same thing.

connection = dbplyr::src_memdb()$con
dplyr::copy_to(connection, seattlepets)

--chunk options: {sql, connection=connection}
WITH pets AS
(
  SELECT
    species, 
    animal_name,
    COUNT(*) AS count
  FROM seattlepets
  WHERE animal_name IS NOT NULL
    AND species IN ("Cat", "Dog")
  GROUP BY species, animal_name
),
animal_name_row_number AS
(
  SELECT
    species,
    animal_name,
    count,
    ROW_NUMBER() OVER(PARTITION BY species ORDER BY count DESC) AS row_number
  FROM pets
)

SELECT
  *
FROM animal_name_row_number
WHERE row_number <= 20
ORDER BY species DESC, count DESC

Voila, the tables generated by two different approaches look exactly the same, how cool is that! We can choose between the two depending on how we feel!

Let’s look at a slightly more complicated example, we want to know the 20 most popular names for dogs and cats combined. To do that, we need to structure the data in the long form for generating a Cleveland plot. Again we will do it using dplyr way first.

seattlepets %>%
  drop_na(animal_name) %>%
  filter (species %in% c("Dog", "Cat")) %>%
  group_by(animal_name, species) %>%
  summarise(count=n()) %>%
  ungroup() %>%
  pivot_wider(names_from=species, values_from=count) %>%
  mutate(total = ifelse(is.na(Dog), 0, Dog) + ifelse(is.na(Cat), 0, Cat)) %>%
  arrange(desc(total), animal_name) %>%
  mutate(row_number=row_number()) %>%
  filter(row_number <= 20) %>%
  mutate(total_count = total) %>%
  pivot_longer(cols=c(Dog, Cat, total), names_to="group_type", values_to="count") %>%

 ggplot() + 
    geom_point(aes(fct_reorder(animal_name, total_count), count, group=group_type, color=group_type)) +
    coord_flip() + 
    xlab("Animal Name") + 
    ylab("Count")

The cool thing about SQL engine is that it returns the result as a regular dplyr dataframe, which allows us to mix and match SQL queries and dplyr syntax. Here is a demonstration, where we calculate the total count of dog and cat names combined in SQL query, and leverage the pivot_longer function to structure the data for plotting.

--chunk options: {sql, connection=connection, output.var="total_pet_count"}
WITH pet_counts AS
(
  SELECT
    animal_name, 
    species,
    COUNT(*) AS count
  FROM seattlepets
  WHERE animal_name IS NOT NULL
    AND species IN ("Dog", "Cat")
  GROUP BY animal_name, species
)

SELECT
  animal_name,
  Cat,
  Dog,
  Cat + Dog AS total,
  ROW_NUMBER() OVER(ORDER BY Cat + Dog DESC) AS row_number
FROM
(
  SELECT
    animal_name,
    IFNULL(SUM(CASE WHEN species = "Cat" THEN count END), 0) AS Cat,
    IFNULL(SUM(CASE WHEN species = "Dog" THEN count END), 0) AS Dog
  FROM pet_counts
  GROUP BY animal_name
) p
ORDER BY Cat + Dog DESC

total_pet_count %>%
  filter(row_number <= 20) %>%
  mutate(total_count = total) %>%
  pivot_longer(cols=c(Dog, Cat, total), names_to="group_type", values_to="count") %>%

 ggplot() + 
    geom_point(aes(fct_reorder(animal_name, total_count), count, group=group_type, color=group_type)) +
    coord_flip() + 
    xlab("Animal Name") + 
    ylab("Count")

92.3 Lessons learned

Writing SQL queries in R markdown is not as complicated as we thought. Although Knitr already provides support for the SQL engine, their documentation does not provide enough information for our use case, where we want to write SQL queries against the dataframes. It took us some time to realize that we could use an in-memory SqlLite database and import dataframes as tables, doing so allows us to use the SQL engine directly. We feel there is a missing gap in the documentation, which is the reason we created this tutorial to help others get up to speed.

In general, most of the dplyr functions can be reproduced in SQL queries, and vice versa. One could perform data transformation using either dplyr functions or SQL queries or the combination of both. However, there are certain things that are definitely easier to do depending on your background and preference (subjective opinions). For example, we feel doing joins and aliasing columns are much easier in SQL queries, whereas pivot operations are away easier and straightforward in dplyr.

We do want to re-iterate that writing SQL should not be a replacement of dplyr functions, but rather an alternative approach one could take. In addition, we acknowledge that there are certain drawbacks in the SQL approach.

SqlLite doesn’t support a few functions including pivot, unpivot, full outer join, and right join. (These functions are offered in other SQL dialects though e.g. MSSQL)
For the intermediate results generated by the SQL chunk, we would have to import them as tables into the same in-mem database so we can use them in the subsequent SQL code chunks.
The SQL chunk doesn’t produce a paged table on its own. One would have to save the result as a dataframe and view it in a R chunk.

92.4 Future work

Given the limited time, we are happy with what we’ve achieved, however, there are few things we would like to improve upon in the near future.

92.4.1 Missing pagination for sql engine query output

We noticed that the table generated by sql engine doesn’t come with a pager, unlike the formatiting and pagination offered by dplyr by default. We looked through the options offered by the sql engine, but it doesn’t seem to support this feature. So we are planning to expand the table formatter in the existing sql engine to support pagination.

92.4.2 sql engine for dplyr dataframes

Despite the success, we still feel that importing the tables to the in-mem database and passing the connection along is tedious and unncessary, in an idea world, we could just reference the dataframes in the environment directly in the SQL queries. We have explored this idea briefly by creating a new knitr engine named eng_sql_local based on the existing eng_sql implementation. The idea is summarized below,

we use a library called queryparser to extract all the table references from the query
we get the corresponding dataframes by name from the knit_global environment
we import the dataframes into the dbplyr in-memory database
we invoke the existing eng_sql to run the query

Here is a code snippet, and we also have the code in our own fork.

# sql engine using dbplyr in-memory db
eng_sql_local = function(options) {

  query = one_string(options$code)
  table_references <- queryparser::parse_query(query)$from
  options$connection <- dbplyr::src_memdb()$con
  overwrite <-ifelse(is.null(options$overwrite), FALSE, options$overwrite)

  output <- tryCatch({
    for (i in range(1, length(table_references)))
    {
      table_name <- as.character(table_references[[i]])
      dataframe = get(table_name, envir = knit_global())
      if ((!DBI::dbExistsTable(options$connection, table_name)) | overwrite){
        DBI::dbWriteTable(dbplyr::src_memdb()$con, table_name, dataframe)
      }
    }
  },error = function(e) {
    e$message <-
      paste(e$message,
            paste("The table ", table_name, " doesn't exist in the global environment.", sep=""),
            "Using the following command to set the dataframe in the global environment:",
            "env <- knitr::knit_global()",
            paste("env$`", table_name, "` <- `", table_name, "`", sep=""),
            sep = "\n")
    e
  })

  if (inherits(output, "error"))
    return(engine_output(options, query, one_string(output)))

  return(eng_sql(options))
}

This has been tested successfully in our local environment (one would have to fork or clone our repo, build knitr locally to test out this new code block). However, the library queryparser doesn’t handle the complex queries with Common Table Expression (CTE), we would have to implement our own queryparser using regular expression to identify table names in the query. Another problem we found with this custom sql engine is that it doesn’t support the syntax highlighting, while the existing sql engine does, we will try to fix this issue in the future.

91 Joining and reshaping in R

93 TidyVerse