18 Picturing Inflation: An Exercise in Web Scraping, Plotly and Timeseries

Kevin Taylor

##############################################################
## Install Necessary Packages for Web Driver, Plotting etc. ##
##############################################################

# install.packages('RSelenium')
# install.packages('rvest')
# install.packages('dplyr')
# install.packages('XML')
# install.packages('lubridate')
# install.packages('tidyr')
# install.packages('plotly')

library(RSelenium)
library(rvest)
library(dplyr)
library(XML)
library(lubridate)
library(tidyr)
library(plotly)

18.1 Background / Goals & Motivation

Inflation in the wake of COVID-19 is one of today’s most important issues, reflected in the prices of almost all goods and services. Some products have rapidly doubled or tripled in price, while others have remained relatively stable. There are many ways to measure inflation, and many indices that reflect national averages for price, but perhaps the most well-known are published by the Bureau of Labor Statistics (BLS). Notably, the Consumer Price Index (CPI) measures the average price paid by consumers for a wide variety of commodities. Countless agencies use the Consumer/Producer Price Index for various commodities as an indicator for nationwide inflation, but collecting the data can be quite tedious. My goal/motivation was to develop a framework for gathering Price Index data efficiently from the Bureau of Labor Statistics to picture inflation, while simultaneously exploring the syntax for RSelenium, plotly and timeseries data in R.

I sought to develop a framework using web scraping and interactive visualizations to compare inflation among different commodities. The Bureau of Labor statistics offers visualizations for individual commodities. However, there is no simple way to collect data for multiple commodities and compare the relative change over time for each. The framework below scrapes data for each price index listed, and creates an interactive visualization including data from each commodity for easy comparison.

18.2 Methods

18.2.1 Creating a Web Scraping Function

I began by creating a function to scrape the data and metadata from a single url. The function takes three inputs - the url as a string, an object rD which is an instance of the rsDriver object from the RSelenium package, and an integer start_year which represents the first year for collecting data. The web driver opens the url, interacts with the page through dropdowns and buttons to update the data start year, then reads the HTML tables to scrape the data. The HTML tables on the page include the data itself as well as tables with metadata detailing the particular commodity name, industry, seasonal adjustment to the timeseries, etc. After the data is read, the data is processed for plotting, including removing characters for preliminary measurements ie) 246.6 (P) and scaling the values using the first value as 100. Scaling the data is extremely important as it allows for easy relative comparison between multiple price indices. The function get_index_data below details the process of scraping and processing the data from one Bureau of Labor Statistics timeseries url.

##########################################################
## Function to access the input url and scrape the data ##
##########################################################

#' @title get_index_data
#' @descrition Gathers Data and Metadata from a Price Index
#' Timeseries URL
#' @param url = the url string of the price index timeseries
#' @param rD = an instance of RSelenium rsDriver with appropriate parameters
#' @param start_year = the year(int) to use as a baseline
#' @return df DataFrame including timeseries index values and metadata

get_index_data = function(url, rD, start_year)
{
  # navigate to the URL
  remDr = rD[['client']]
  remDr$navigate(url)
  
  Sys.sleep(3) # Sleep to allow the webpage to load
  
  # find the 'Start Year' dropdown and select the start year
  year_checkbox = remDr$findElement(using = 'id', value = 'dv-start-year')
  year_checkbox$clickElement()
  option = remDr$findElement(using = 'xpath',
                             sprintf("//*/option[@value = '%s']", start_year))
  option$clickElement()
  
  # find the 'Update' button and click it
  update_button = remDr$findElement(using = 'xpath', "//input[@value='Update']")
  update_button$clickElement()
  
  Sys.sleep(1) ## Sleep to allow the webpage to update
  
  # read the HTML and extract the tables as a list 
  html_tables = remDr$getPageSource()[[1]] |> read_html() |> html_table()
  
  ###############################################################
  ## Using the Scraped HTML Tables, Tidy the Data for Plotting ##
  ###############################################################
  
  df = html_tables[[5]] # this is the table with the bulk of the data
  names(df) = colnames(html_tables[[4]]) # Contains the column names
  # the below table contains metadata 
  metadata = (html_tables[[3]][, c(1, 3)] |> pivot_longer(cols = -X1))
  rownames(metadata) = metadata$X1
  
  # some values listed as preliminary x.x (P) - remove this and cast as numeric
  df$Value = as.numeric(gsub('\\(P\\)', '', df$Value)) 
  # scale the value column to first value = 100
  df$Value = (df$Value/df$Value[1])*100
  # calculate the row-by-row percent change in the index value
  df$percent_change_last_update = c(
    NA,
    (df$Value[-1]-df$Value[-nrow(df)])/df$Value[-nrow(df)]
  )

  
  # data published quarterly - change quarters to months
  df$Label = gsub('Qtr1', 'Mar',
                  gsub('Qtr2', 'Jun',
                       gsub('Qtr3', 'Sep',
                            gsub('Qtr4', 'Dec', df$Label))))
  
  # data published biannually - change halves to months
  df$Label = gsub('Half1', 'Jun',
                  gsub('Half2', 'Dec', df$Label))
  
  df$dt = lubridate::ym(df$Label) # convert label to datetime year-month form
  
  df$url = url # add url to table
  df$Item = metadata['Item',]$value # add item name to table
  
  return(df)
}

18.2.2 Gathering the Data for Some Indices

This function was interesting it uses some of the web scraping techniques covered in class, as well as timeseries principles such as scaling timeseries data (base year as value 100) and lubridate package functions for conversion from character type to Date type. After creating the web scraping function, I gathered the data for a small set of indices. Items such as eggs, milk and gasoline are often highlighted as they are common items that have recently experienced inflation. I created a list of the url for the Consumer Price Index for each of these items in U.S. cities, and compiled the data using the above web scraping function. For each item I passed the function the url, the RSelenium rsDriver object I created, and the start year of 2015 to get a picture of prices before and after the COVID-19 pandemic.

urls = c(
  'https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH', # eggs
  'https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFJ01', # milk
  'https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SETB01' # gasoline
)
start_year = 2015

At this point, I ran the code below. As a web driver was used and cannot be incorporated into the bookdown collection without launching the web driver, I have commented out the code and instead read the data from a static file representing the webdriver output. To use the webdriver on Bureau of Labor Statistics webpages, simply remove the comments.

# create the Selenium Web Driver with the appropriate Chrome Driver
# check Chrome version and run binman::list_versions("chromedriver")
# to see all possible chromedriver inputs

########################
## Web Scraping Block ##
########################
# df_total = data.frame() # aggregate dataframe
# 
#   rD = rsDriver(
#    verbose = TRUE,
#    browser = 'chrome',
#    chromever = '107.0.5304.62',
#    port = 4545L,
#    check = TRUE
#   )
#   
#   for(url in urls)
#   {
#    df = get_index_data(url, rD, 2015);
#    if(nrow(df_total) == 0)
#    {
#      df_total = df
#    }
#    else
#      df_total = rbind(df_total, df)
#   }
#   df_total =  df_total[,c('dt',
#                          'Value',
#                          'percent_change_last_update',
#                          'Item',
#                          'url')]
#  #########################
df_total = read.csv('./resources/inflation_web_scraping/bls_data.csv')
df_total$dt = as.Date(df_total$dt) # reads dates as characters
head(df_total)
##   X         dt     Value percent_change_last_update Item
## 1 1 2015-01-01 100.00000                         NA Eggs
## 2 2 2015-02-01 100.94505                0.009450543 Eggs
## 3 3 2015-03-01 100.09782               -0.008393045 Eggs
## 4 4 2015-04-01  97.57482               -0.025205373 Eggs
## 5 5 2015-05-01  96.18938               -0.014198645 Eggs
## 6 6 2015-06-01 113.33988                0.178299253 Eggs
##                                                            url
## 1 https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH
## 2 https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH
## 3 https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH
## 4 https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH
## 5 https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH
## 6 https://beta.bls.gov/dataViewer/view/timeseries/CUUR0000SEFH

18.2.3 Creating Interactive Visualizations

Plotly is a package that allows users to create interactive visualizations. It comes with editable layout elements and a tooltip which displays information about the data when the user hovers over the graph. I began by creating the axes and adding some general formatting (theme, title, and font elements). For each commodity timeseries I collected above, I added a trace to the Plotly figure I created, each time formatting the tooltip to display some relevant information. The final output, a composite figure with multiple indices, allows all the commodities to be compared on the same axis, a feature not included in the Bureau of Labor Statistics data viewer. Furthermore, scaling each timeseries to a base value of 100 allows the relative inflation of each commodity to be compared on the same basis. Please note: the price index value itself is not proportional to any particular unit price; rather, viewers should gain insight from the relative change in each price index over time.

#########################################################
## Create the Plotly Figure and add General Formatting ##
#########################################################

fig = plot_ly()
fig = fig |> layout(
  width = 1000,
  template = 'seaborn', # set a general template
  plot_bgcolor = 'white',
  paper_bgcolor = 'white',
  title = list(
    text = sprintf('<b>Price Index Values (Scaled)</b>
    Source:<a href="https://beta.bls.gov/dataQuery/find">Bureau of Labor Statistics</a>
    Baseline Jan %s', start_year),
    font_size = 18,
    font_family = 'Arial',
    font_color = 'black'
  ),
  xaxis = list( # x-axis formatting
    title = list(
      text = 'Date',
      font_size = 16,
      font_family = 'Arial'
    ),
    tickfont = list(
      size = 14,
      family = 'Arial'
    ),
    showgrid = TRUE,
    gridwidth = 0.6,
    gridcolor = '#EFEFEF'
  ),
  yaxis = list( # y-axis formatting
    title = list(
      text = 'Price Index Value (Scaled)',
      font_size = 16,
      font_family = 'Arial'
    ),
    tickfont = list(
      size = 14,
      family = 'Arial'
    ),
    showgrid = TRUE,
    gridwidth = 0.6,
    gridcolor = '#EFEFEF'
  ),
  hovermode = 'x unified', # display settings when hovering over chart
  hoverlabel = list(
    font_size = 14,
    font_family = 'Arial',
    align = 'left'
  )
)

#############################################################
## Add a trace to the figure for each commodity timeseries ##
#############################################################

for(url in unique(urls))
{
  data = df_total[df_total$url == url,]
  fig = fig |> add_trace( # add a trace for a particular commodity
    data = data,
    customdata = ~percent_change_last_update,
    text = ~Item,
    x = ~dt,
    y = ~Value,
    type = 'scatter',
    mode = 'markers+lines',
    name = data$Item[1],
    line = list(
      width = 2.5
    ),
    marker = list(
      opacity = 0.5
    ),
    hovertemplate = '<b>%{text}</b><br>Value: %{y:.1f}<br>% Change from Prev Update: %{customdata:.1%}<extra></extra>'
  )
}

fig

18.3 Takeaways / Lessons Learned

From this figure, based on the latest OCT-22 values published in NOV-22 retrospectively, it seems gasoline has been the most volatile commodity in the past 5 years, even doubling in average price in 2022 when compared to the baseline of JAN-2015. In OCT-2022, Milk and eggs have increased in price 15% and 28%, respectively, from JAN-2015 and gasoline is 83% more expensive, on average, than it was in JAN-2015. However, in the most recent update, eggs have experienced the highest monthly inflation, increasing 10.1% from the last month. This figure paints a clearer picture of inflation for the average consumer and can be customized with any combination of commodities.

Overall this project was an interesting exercise in creating a reproducible web scraping and visualization workflow, and providing additional visualization functionality that did not previously exist (viewing multiple scaled commodities on the same axis to compare relative inflation). I learned a great deal about webpage layout and types of interactive .html elements, web scraping syntax with the RSelenium package, timeseries and the Date type, and plotly syntax. To reproduce any of this workflow, simply use the functions and loops created in this file. Make sure to un-comment the web scraping loop and change the chromedriver parameter to your machine’s version of Chrome! Hopefully this exercise helps the readers better understand web scraping, timeseries and interactive visualization through plotly, and provides a good resource for visualizing and understanding inflation.