Web scraping is a useful tool for data practitioners, to state the obvious. Often, scraping is most valuable when performed on a scheduled basis, to incorporate new or refreshed values into the dataset.
In the past, I’ve paid a (small) monthly fee to PythonAnywhere to run scraping jobs. However, there’s a better, free alternative offered by a familiar platform: GitHub Actions. While GitHub Actions is largely designed for code deployment automation (testing pull requests, deploying merged pull requests to production) it can also be used to run jobs, including web scraping jobs.
This post walks through the implementation of a simple GitHub action, which scrapes the headline mortgage rates posted on Freddie Mac’s home page daily.
To get started, create a directory called
.github/workflows in your repository. Within the
.github/workflows directory, create a
.yml file. This will contain the details of the action workflow.
.yml file structure has two basic parts:
on:specifies when the job should run, and
jobs: defines what steps should be taken.
This action has been scheduled to run daily:
on: workflow_dispatch: schedule: - cron: '0 8 * * *'
Copy the scraper.yml, and modify as needed for your use case. Update the
.py script and
requirements.txt file in the root directory accordingly. This tutorial, as well as the official GitHub documentation, are good resources for building on this template.
Here, the python script is grabbing the mortgage rates posted on Freddie Mac’s homepage and saving them to a new
Over time, enough snapshots accumulate to do something meaningful with this data!
Let’s get a quick sense of how mortgage rates have changes since the action was first configured on December 18, 2021.
.R script reads, joins, and cleans the saved files, and creates a trend plot:
The takeaway? It’s clear that mortgage rates are rapidly rising from their historically-low levels, propelled by the expectation of rate hikes by the Fed to counter inflation.
Thanks for following along. Check out this repo for all the components of the walkthrough.