Web Scraping: NBA Salaries

Inspired in part by Python’s Beautiful Soup, the R package rvest makes it delightfully easy to scrape data from the web. As part of the Tidyverse collection of packages, rvest fits nicely within the broader data workflow:

The Tidyverse data science workflow (source)

In this post I’ll walk through an example of using rvest to compile a dataset of NBA player salaries. To follow along, create a free RStudio Cloud account to write and run R code and bookmark the SelectorGadget tool to help identify HTML/CSS tags.

Background

Professional athletes are paid handsomely for their highly-specialized skill sets, and NBA players are no exception. Take Steph Curry, the highest-paid player during the 2018-2019 season. During that season he brought in a cool $37M in salary.

The NBA is the professional sports league with the
highest player wages worldwide (source, image source)

ESPN publishes annual salary data going back to the 1999-2000 season. Rather than manually copying this data, which is spread across hundreds of web pages, we can write a script to compile the data automatically using rvest. The data can then be used to explore variation in salary over time, by team, and by position.

Page Structure

The first step in any web scraping project is to become familiar with the target site URL structure. In this case, ESPN has a different page for each season. For example, the link below contains the player salaries for the 2018-2019 season:

http://www.espn.com/nba/salaries/_/year/2019

Screenshot of the ESPN NBA Player Salaries page for the 2018-19 season

Within each season, the salaries are spread across several sub-pages, with 40 players listed on each sub-page:

http://www.espn.com/nba/salaries/_/year/2019/page/2

This means our code will need to dynamically determine the number of sub-pages to loop through when scraping the player salaries from each season.

Code Walkthrough

With that background, let’s dive into the code! To get started, we’ll need three packages: rvest, tidyverse, and stringr:

Next, we’ll define a vector of season URLs to loop over:

Next comes the bulk of the code required to perform the scraping. Here we loop over each of the season URLs. By extracting the text from the .page-numbers element on each page, we can dynamically determine the number of sub-pages for each season.

The html_table() command from the rvest package detects and extracts the table of salaries from each sub-page.

The last step is to clean the raw scraping output by adding column names, removing extra rows, and splitting the “name” and “position” fields into two columns, since they were stored as a single column in the ESPN tables, separated by a comma.

The final dataset has 9,456 rows. Below are the first 10:

ranknamepositionteamsalaryseason
1Shaquille O'NealCLos Angeles Lakers171420002000
2Kevin GarnettPFMinnesota Timberwolves168060002000
3Alonzo MourningCMiami Heat150040002000
4Juwan HowardPFWashington Wizards150000002000
5Scottie PippenSFPortland Trail Blazers147950002000
6Karl MalonePFUtah Jazz140000002000
7Larry JohnsonFNew York Knicks119100002000
8Gary PaytonPGSeattle SuperSonics110200002000
9Rasheed WallacePFPortland Trail Blazers108000002000
10Shawn KempCCleveland Cavaliers107800002000

Visualization

With the hard part out of the way, let’s quickly explore the trend in player salaries. We’ll visualize the distribution of salary by season:

Notice how the right tail has grown over time. It appears that most of the growth in average player salary is being drive by salaries paid to superstars. This observation is consistent with a similar analysis of NBA salaries done by Dimitrije Curcic.

“The average salary in the NBA has increased more than 7x since the 1990-91 season. Median salary has slower growth than the average; this could suggest that the financial gap between the top talents and the rest is getting larger.”

Here’s the code used to create the chart.

Web scraping is an important tool for data analysis because it enables data collection at scale. Rvest streamlines the process, allowing you to quickly create novel datasets for analysis, like NBA player salaries.

Complete R Script: link
Final Dataset: link

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php