[!IMPORTANT]
This package is part of a homework exercise for STATS 290 regarding data mining and web APIs.
The goal of explorecourses
is to automatically retrieve course information from Stanford University’s ExploreCourses API.
Installation
You can install the development version of explorecourses from GitHub with:
# install.packages("remotes")
remotes::remotes("coatless-rpkg/explorecourses")
Usage
First, load the package:
The package contains three main functions:
-
fetch_all_courses()
: Fetches all courses from the ExploreCourses API for a set of departments (Default: all). -
fetch_department_courses()
: Fetches the courses for a specific department. -
fetch_departments()
: Fetches the list of departments from the ExploreCourses API.
By default, we’ll retrieve all courses across all departments for the current academic year using:
all_courses <- fetch_all_courses()
We can also request specific courses for a set of departments in a given academic year. For example, to retrieve all courses for the departments of “STATS” and “MATH” for the academic year 2023-2024, we can use:
stats_and_math_courses <- fetch_all_courses(c("STATS", "MATH"), year = "20232024")
This function is excellent for retrieving course information across multiple departments for a given academic year as it allows for parallel processing of the data.
For a single department, we can use the fetch_department_courses()
function to retrieve the courses for that department in any academic year. This function’s overhead is lower as it does not support parallel processing. For example, to retrieve all courses for the “STATS” department, we can use:
department_courses <- fetch_department_courses("STATS")
To determine possible department shortcodes, we can use:
departments <- fetch_departments()
This will return a data frame with the department short name, long name, and school the department is associated with.
Cache
To cache the data, we can use the cache_dir
parameter in the fetch_all_courses()
, fetch_department_courses()
, and fetch_departments()
functions. This will cause the XML data downloaded from the API to be stored in the specified directory and reused on subsequent calls.
We can list the current cache contents using the list_cache()
function:
list_cache() # List current cache
# Cache contents:
#
# Found 256 cached files
# Directory: explorecourses_cache
#
# AA ACCT AFRICAAM ALP AMELANG
# AMHRLANG AMSTUD ANES ANTHRO APPPHYS
# ARABLANG ARCHLGY ARMELANG ARTHIST ARTSINST
# ...
Parallel Processing
We can speed up the process of fetching and transforming course data by using parallel processing. For the fetch_all_courses()
function, we’ve set up parallel processing using the furrr
package, which provides purrr
’s functional interface to the future
parallel processing library. As a result, we will be able to download and process all courses for every department in parallel. Moreover, we’ve set up progress reporting using the progressr
package to track the progress of the parallel processing.
library(explorecourses)
library(future)
library(progressr)
# Set up parallel processing
plan(multisession)
# Set up progress reporting
handlers(handler_progress())
# Show progress bar for fetching all courses
with_progress({
# Fetch all courses for the departments in parallel
all_courses <- fetch_all_courses()
})
# Reset to sequential processing
plan(sequential)
Please note, we need to ensure we deactivate the multisession
plan by resetting it to sequential
after we’ve finished using it.