library(tidyverse)
library(rvest) #html webscraping
library(packageRank) #to retrieve CRAN download counts
2 Tidyverse Gallery Scrape
The following code is designed to scrape the names of packages in the ggplot environment from the tidyverse gallery (https://exts.ggplot2.tidyverse.org/gallery/) and then retrieve the cumulative historical cran download count for each. It also scrapes the github star count listed for these packages.
First import the necessary packages:
Read in the downloaded tidyverse gallery page html file. Must download webpage for the github star scrape to work properly.
<- read_html("raw_data/exts.ggplot2.tidyverse.org.html") df
Scrape the package names and store in a package_names vector.
<- df |>
package_names html_elements("div.card-content") |>
html_elements("span.card-title") |>
html_text()
To find the most current total historical download count, set a target_date of two days before today. Depending on the time of day, cranDownloads is updated to either 1 or 2 days previous to the current day.
<- Sys.Date()-2 target_date
The below function get_total_downloads takes in a package name to retrieve a cumulative count of that package’s cran downloads up until the set target_date by utilizing the cranDownloads funcion of packageRank. Handles error that arises when package is not found on CRAN. Returned as dataframe.
<- function(pkg) {
get_total_downloads
#to = 2025 pulls entire download history
<- tryCatch(
cd cranDownloads(packages = pkg, to = 2025),
#if the package is not found in cran return NA
error = function(e) NA
)
#retrieving the 'cumulative' value of a particular date gets total download count up to that date
<- ifelse(length(cd) == 1, NA, cd$cranlogs.data$cumulative [
count $cranlogs.data$date == target_date
cd
])
data.frame(package = pkg, downloads = count)
}
Retrieve cran downlaod count for ech package by mapping get_total_downloads across scraped package_names and combining returned dataframes in one df. Will take a few minutes to complete.
<- map_dfr(package_names, get_total_downloads) gallery_packages
Scrape github star count from the gallery webpage and add to dataframe.
<- df |>
github_stars html_elements("span.github-btn") |>
html_elements("a.gh-count") |>
html_text() |> as.numeric()
$stars = github_stars gallery_packages
Store data in a new dataframe with a column indicating gallery as the source and export as csv.
$gallery = TRUE
gallery_packages
head(gallery_packages)
package downloads stars gallery
1 ggQQunif 23664 8 TRUE
2 ggupset 158066 354 TRUE
3 xmrr 25255 7 TRUE
4 ggpcp 7421 1 TRUE
5 gg3D NA 104 TRUE
6 ggQC 63134 46 TRUE
write_csv(gallery_packages, "generated_data/gallery_packages.csv")