library(tidyverse)
library(rvest) #html webscraping
library(packageRank) #to retrieve CRAN download counts2 Appendix A: Tidyverse Gallery Scrape
The following code is designed to scrape the names of packages in the ggplot environment from the tidyverse gallery (https://exts.ggplot2.tidyverse.org/gallery/) and then retrieve the cumulative historical cran download count for each. It also scrapes the github star count listed for these packages.
First import the necessary packages:
Read in the downloaded tidyverse gallery page html file. Must download webpage for the github star scrape to work properly.
df <- read_html("raw_data/exts.ggplot2.tidyverse.org.html")Scrape the package names and store in a package_names vector.
package_names <- df |>
html_elements("div.card-content") |>
html_elements("span.card-title") |>
html_text()To find the most current total historical download count, set a target_date of two days before today. Depending on the time of day, cranDownloads is updated to either 1 or 2 days previous to the current day.
target_date <- Sys.Date()-2The below function get_total_downloads takes in a package name to retrieve a cumulative count of that package’s cran downloads up until the set target_date by utilizing the cranDownloads funcion of packageRank. Handles error that arises when package is not found on CRAN. Returned as dataframe.
get_total_downloads <- function(pkg) {
#to = 2025 pulls entire download history
cd <- tryCatch(
cranDownloads(packages = pkg, to = 2025),
#if the package is not found in cran return NA
error = function(e) NA
)
#retrieving the 'cumulative' value of a particular date gets total download count up to that date
count <- ifelse(length(cd) == 1, NA, cd$cranlogs.data$cumulative [
cd$cranlogs.data$date == target_date
])
data.frame(package = pkg, downloads = count)
}Retrieve cran download count for each package by mapping get_total_downloads across scraped package_names and combining returned dataframes in one df. Will take a few minutes to complete.
gallery_packages <- map_dfr(package_names, get_total_downloads)Scrape github star count from the gallery webpage and add to dataframe.
github_stars <- df |>
html_elements("span.github-btn") |>
html_elements("a.gh-count") |>
html_text() |> as.numeric()
gallery_packages$stars = github_starsStore data in a new dataframe with a column indicating gallery as the source and export as csv with a timestamp for when the data was generated.
gallery_packages$gallery = TRUE
head(gallery_packages) package downloads stars gallery
1 ggQQunif 24841 8 TRUE
2 ggupset 182240 354 TRUE
3 xmrr 26242 7 TRUE
4 ggpcp 18302 1 TRUE
5 gg3D NA 104 TRUE
6 ggQC 63896 46 TRUE
timestamp <- paste("# Gallery data generated on:", Sys.time())
data_lines <- format_csv(gallery_packages)
# Combine and write
write_lines(c(timestamp, data_lines), "generated_data/gallery_packages.csv")