Blog

Web Scraping: NBA Salaries

Inspired in part by Python’s Beautiful Soup, the R package rvest makes it delightfully easy to scrape data from the web. As part of the Tidyverse collection of packages, rvest fits nicely within the broader data workflow:

The Tidyverse data science workflow (source)

In this post I’ll walk through an example of using rvest to compile a dataset of NBA player salaries. To follow along, create a free RStudio Cloud account to write and run R code and bookmark the SelectorGadget tool to help identify HTML/CSS tags.

Background

Professional athletes are paid handsomely for their highly-specialized skill sets, and NBA players are no exception. Take Steph Curry, the highest-paid player during the 2018-2019 season. During that season he brought in a cool $37M in salary.

The NBA is the professional sports league with the
highest player wages worldwide (source, image source)

ESPN publishes annual salary data going back to the 1999-2000 season. Rather than manually copying this data, which is spread across hundreds of web pages, we can write a script to compile the data automatically using rvest. The data can then be used to explore variation in salary over time, by team, and by position.

Page Structure

The first step in any web scraping project is to become familiar with the target site URL structure. In this case, ESPN has a different page for each season. For example, the link below contains the player salaries for the 2018-2019 season:

http://www.espn.com/nba/salaries/_/year/2019

Screenshot of the ESPN NBA Player Salaries page for the 2018-19 season

Within each season, the salaries are spread across several sub-pages, with 40 players listed on each sub-page:

http://www.espn.com/nba/salaries/_/year/2019/page/2

This means our code will need to dynamically determine the number of sub-pages to loop through when scraping the player salaries from each season.

Code Walkthrough

With that background, let’s dive into the code! To get started, we’ll need three packages: rvest, tidyverse, and stringr:

Next, we’ll define a vector of season URLs to loop over:

Next comes the bulk of the code required to perform the scraping. Here we loop over each of the season URLs. By extracting the text from the .page-numbers element on each page, we can dynamically determine the number of sub-pages for each season.

The html_table() command from the rvest package detects and extracts the table of salaries from each sub-page.

The last step is to clean the raw scraping output by adding column names, removing extra rows, and splitting the “name” and “position” fields into two columns, since they were stored as a single column in the ESPN tables, separated by a comma.

The final dataset has 9,456 rows. Below are the first 10:

ranknamepositionteamsalaryseason
1Shaquille O'NealCLos Angeles Lakers171420002000
2Kevin GarnettPFMinnesota Timberwolves168060002000
3Alonzo MourningCMiami Heat150040002000
4Juwan HowardPFWashington Wizards150000002000
5Scottie PippenSFPortland Trail Blazers147950002000
6Karl MalonePFUtah Jazz140000002000
7Larry JohnsonFNew York Knicks119100002000
8Gary PaytonPGSeattle SuperSonics110200002000
9Rasheed WallacePFPortland Trail Blazers108000002000
10Shawn KempCCleveland Cavaliers107800002000

Visualization

With the hard part out of the way, let’s quickly explore the trend in player salaries. We’ll visualize the distribution of salary by season:

Notice how the right tail has grown over time. It appears that most of the growth in average player salary is being drive by salaries paid to superstars. This observation is consistent with a similar analysis of NBA salaries done by Dimitrije Curcic.

“The average salary in the NBA has increased more than 7x since the 1990-91 season. Median salary has slower growth than the average; this could suggest that the financial gap between the top talents and the rest is getting larger.”

Here’s the code used to create the chart.

Web scraping is an important tool for data analysis because it enables data collection at scale. Rvest streamlines the process, allowing you to quickly create novel datasets for analysis, like NBA player salaries.

Complete R Script: link
Final Dataset: link

Which programming language should I learn first?

Aspiring programmers and data scientists often ask, “Which programming language should I learn first?” It’s a valid question, since it can take hundreds of hours of practice to become competent with your first programming language. There are a couple of key factors to take into consideration, like how easy the language is to learn, the job market for the language, and the long term prospects for the language.

In this post, we’ll take a data-driven approach to determining which programming languages are the most popular and growing the fastest in order to make an informed recommendation to new entrants to the developer community.

Common Programming Languages (Source)

Quantifying Popularity

There are several ways you could measure the popularity or growth of programming languages over time. The PYPL (PopularitY of Programming Language Index) is created by analyzing how often language tutorials are searched on Google; the more a language tutorial is searched, the more popular the language is assumed to be.

Another avenue could be analyzing GitHub metadata. GitHub is the largest code host in the world, with 40 million users and more than 100 million repositories (source). We could quantify the popularity of a programming language by measuring the number of pull requests / push requests / stars / issues over time (example, example).

Finally, the popularity proxy I’ll use is the number of questions posted by programming language on Stack Overflow. Stack Overflow is a question and answer site for programmers. Questions have tags like java and python which makes it easier for people to find and answer questions.

We’ll visualize how programming languages have trended over the last 10 years based on use of their tags on Stack Overflow.

Data Explorer

So, how are we going to source this data? Should we scrape all 18 million questions or start hitting the Stack Exchange API? No! There’s an easier way: Stack Exchange (Stack Overflow’s “parent”) exposes a data explorer to run queries against historical data.

Screenshot of the Stack Exchange Data Explorer

In other words, we can review the Stack Overflow database schema and write a SQL query to extract the data we need. Before writing any SQL, let’s think about how we’d like the query output to be structured. Each row should contain a tag (e.g. java, python), a date (year / month), and count of the number of times a question was posted using that tag:

Year | Month | Tag | Question Count

The SQL query below joins the Posts, Tags, and PostTags tables, counts the number of questions by tag each month, and returns the top 100 tags each month:

Below are the first ten rows returned by the query:

YearMonthTagCountRank
20101c#51161
20101java37282
20101php34423
20101javascript26204
20101.net23405
20101jquery23386
20101iphone22467
20101asp.net22138
20101c++20029
20101python194910

Great, now we have the data we need. Next, how should we visualize it to measure programming language popularity over time? Let’s try an animated bar race chart using Flourish. Flourish is an online data studio that helps you visualize and tell stories with data.

In order to get the data into the right format for Flourish visualization, we’ll use R to filter and reshape the data. To smooth the trend, we’ll also calculate a moving average of tag question count.

After uploading the reshaped data to Flourish and formatting the animated bar race chart, we can sit back and watch the programming languages fight it out for the top spot over the last decade:

It’s hard to miss the steady rise of Python, hovering in fourth and five place from 2010 to 2017 before accelerating into first place by late 2018.

Why has Python become so popular? First, it’s more concise and requires less time, effort, and lines of code to perform the same operations as languages like C++ and Java. Python is well-known for its simple programming syntax, code readability and English-like commands. For those reason, not to mention its rich set of libraries and large community, Python is a great place to start for new programmers and data scientists.

The story our animated bar chart tells is validated by the reporting published by Stack Overflow Insights, where we see Python growing steadily over time, measured as a percentage of questions asked on Stack Overflow in a month:

Conclusion

Using question tag data from Stack Overflow, we’ve determined that Python is probably the best programming languages to learn first. We could have saved ourselves some time and done a simple Google search or consulted Reddit to come to the same conclusion, but there’s something satisfying about validating the hype with real data.

How to build a simple mobile app in 30 minutes or less

A few weeks ago I took a GRE practice test to gauge how much preparation I’d need before officially taking the exam. The scores revealed a skill gap, especially in the Verbal Reasoning section. To do well, it turns out you need a pretty robust and sophisticated vocabulary. 💡

With that insight, it was apparent I needed a way to log unfamiliar words to expand my vocabulary. I hoped to automate the process of inputting word definitions. Combining the power of Google Sheets, Glide, and Rapid API, I built a simple mobile app to help level up my vocabulary-building efforts. Take it for a spin here!

App Components

Google Sheets (Database): Spreadsheets are the most successful programming model of all time, and for small projects, a Google Sheet is a great database.

Google Sheets: The database

Glide (App UI): Glide allows you to build beautiful apps without any code using drag-and-drop building blocks. After connecting to the Google Sheet “back end”, it was easy to assemble a polished progressive web app that looks and feels like a native iOS app from the App Store.


Glide: The app user interface

RapidAPI (The Dictionary API): To automate the process of sourcing word definitions and synonyms, I turned to the WordsAPI, available in the RapidAPI marketplace. The free tier includes 2,500 API calls a day.

RapidAPI / WordsAPI: the dictionary API

The cool part is that you can make external API calls in a Google Sheet via Apps Script (a scripting platform developed by Google for light-weight application development in the G Suite platform). This required a dabble in JavaScript, which was new to me as a heavy R/Python user.

The code is broken up into two functions:

  • wordsAPImakes a GET call to the WordsAPI and saves the response
  • getWordSynonym extracts the word definition from the JSON body

The best part is that once the code is saved in the App Scripts file, you can call the custom function directly in Google Sheets, just like a regular function:

=ARRAYFORMULA(getWordSynonym(A2:A))

With Google Sheets and Glide, together with WordsAPI, I was able to make the simple tool I needed to save, learn, and review new words in order to prepare for the GRE exam. It’s exciting to see new no-code/low-code platforms (i.e. Glide) being developed that make anyone a builder.

My GRE Vocab App: grevocab.glideapp.io
Underlying Google Sheet (View Only): Link
App Scripts JS Code: Link

Trends in Vault Banking Rankings

As a society, we love to rank things. We rank colleges (US News & World Report), companies (Fortune 500), sports teams (AP Top 25 Poll), and even people (IMBd STARmeter).

Sometimes rankings are useful, since they collapse many data points into a single metric, allowing for easy comparison. The problem is when rankings build on subjective methodologies or abstract criteria are taken as absolute truth, rather than a directional guide.

With that disclaimer as backdrop, it’s no surprise that Vault.com surveys professionals to rank the top employers in industries like law, consulting, and banking. The rankings they produce are based on surveys that try to measure things like prestige, culture, satisfaction, work/life balance, training, and compensation.

Vault rankings are created using “a weighted formula that reflects the issues professionals care most about”, such as prestige, culture, and satisfaction (source)

Obviously, the inputs (“prestige” and “culture”) are inherently abstract and highly subjective, so the output (rankings) is likely to be noisy and subjective as well. That said, I was interested to see how rankings, specifically in banking, had changed over time, so I compiled the Top 50 lists from 2011 to 2020.

The lists are composed of companies across the banking spectrum, from bulge bracket firms like Goldman Sachs and Morgan Stanley to elite boutiques like Centerview and Evercore to middle market banks like Piper Sandler and Raymond James.

Below are the results for the bulge bracket and elite boutique segments, along with a few observations, based on loose categories suggested by mergersandinquisitions.com.

  • Dominance of GS: Over the ten year period, Goldman only dipped below #1 briefly, in 2012-13.
  • Decline of JPM: Despite clenching the #1 spot in 2012-13, JPM declined in the following years, landing at #5 in 2020.
  • Growth of BAML: Starting in #9 in 2011, BAML’s rank steadily improved over time, hovering at #3 in 2020.

I compiled this data manually, but used r and ggplot to clean and filter the data and create the charts. You can find the full repo on Github here.

Import, Define ggplot Theme

Plot

Export

Thanks for reading! Feel free to check out my other blog posts or click a tag below to see related blog posts.

Visualizing Rap Communities with Python & Spotify’s API

Finding new music you like can be tough. In my experience, there’s no single discovery mechanism that delivers consistently. I usually rely on a mix of sources: websites like Pitchfork or Genius, subreddits like popheads or hiphodheads, and curated playlists like Get Turnt or Hot Rhythmic. Lately, I’ve found new favorites through a Spotify feature called “Fans Also Like”.

FANS ALSO LIKE – A Spotify music discovery feature

Listed on each artist page, the “Fans Also Like” section is an algorithmically populated discovery feature built using a metric called “artist similarity”. This metric is based on shared fans, meaning the more fans two artists have in common, the higher their similarity score.

“Artist similarity is probably the second-most important piece of data we extract from listening patterns—after popularity. It’s the data behind radio, genres, and Discover pages.”

Glenn McDonald, Spotify’s data alchemist (source)

The cool thing is that Spotify exposes this discovery algorithm via API. After authenticating and supply an artist id, the API will return a list of 20 similar artists. Obviously, this is a huge win for music data nerds everywhere.

In this post, I’ll leverage Spotify’s “similar artists” API to build interactive network charts, visualizing how artists are linked together, as measured by the similarity of their fans.

Walkthrough

To access the Spotify API, you’ll need a Spotify account (free or premium), and a registered application. To make things easy, I used the spotipy library in Python, which supports all of the features of the Spotify Web API.

Next, leaning on the the spotipy library to do the heavy lifting, I can retrieve the artist and “similar artist” data with two lines, passing the artist id to the artist and artist_related_artists functions.

Here’s a sample of the result when we query Spotify for the artists most similar to Drake, according to listener behavior:

NamePopularityFollower Count
Big Sean87 7,113,709
J. Cole90 10,379,858
Jeremih84 4,094,532
Wale80 2,457,939
Rick Ross86 3,839,127

The list of similar artists is returned in order of ranked similarity score, meaning that according to the listener data, Drake is most similar to Big Sean, J.Cole, and Jeremih. Surprising? Let’s make the list more visual by creating an interactive plot using Flourish.

It’s a fun visual, but you’d find these same faces if you looked at “Fans Also Like” on Drake’s artist page. Let’s take it a step further and query the API for similar artists for the artists similar to Drake. Then we’ll start to get a sense of the pop-rap landscape.

Right off, it looks like Jeremih is the odd one out, with none of his peer artists overlapping with the rest of the group. In contrast, Big Sean overlaps three of five, J.Cole, Wale, and Rick Ross, with Drake.

Let’s see how things look when we pull in the full dataset, with each of Drake’s top 20 most similar artists and each of their 20 most similar artists.

How could we use this data to find new music? Counting the number of times an artist appeared across the second iteration of similar artists, below are the top artists to check out if you’re a Drake fan:

This has been one approach to understanding “community” in rap music. Another would be to analyze collaboration between artists and the frequency of features shared. However you find new music, “Fans Also Like” is a fantastic tool to explore new artists, and even genres.

You can find the full code to create the dataset used here and the dataset itself here.

Building a Birthday Text Bot using Twilio

A good way to show family and friends you care is remembering their birthday. It seems simple enough but in practice, birthday tracking for anyone beyond immediate family and very close friends can be time consuming. Thankfully, you can automate that!

While outsourcing birthday check-in duties does feel a bit impersonal, you can always follow up on the generic message after getting a reply. This post is a tutorial for building a birthday text bot using Twilio.

The first step to building a birthday bot is storing the list of birthdays and contact information somewhere. In this example, I’ve used Coda.io to store name, birthday, and phone number. While I’d prefer to use Google Sheets, Coda’s API interface makes it very easy to import data into the Python environment. Authentication occurs via a bearer token and the API returns a JSON file.

After a bit of unpacking and cleaning, we have a birthday data frame like the one below (this is dummy data, for obvious privacy reasons).

Next, since this code will be deployed to a server and run on a daily schedule, we need to determine which, if any, of our family or friends is celebrating their birthday today.

Finally, we need to tap into the power of Twilio to send the actual SMS message. Twilio is a really cool API service that allows you to programmatically make phone calls and send or receive text messages.

twilio.com

The actual code required to make the birthday bot come to life only requires about eight lines of code. After supplying an account identifier and authentication token, the message client takes as input the body of your text, your Twilio number, and the recipient’s phone number.

That’s it! Let’s see what the message looks like on the recipient’s end.

Very slick. By connecting a database (Coda.io) to a messaging API (Twilio), we’ve created a simple birthday text service, capable of earning you the reputation of most thoughtful friend. Enjoy!

Full code can be found on GitHub here.

Feature photo by Sarah Pflug from Burst.

Building a Lyrics Profanity Analyzer with R Shiny

A few weeks ago, a family member asked me to make them a Spotify playlist with recent rap hits. To avoid including anything excessively profane, I’d pull up the song lyrics on genius.com and manually search for potentially offensive words or phrases. Looking to streamline this process (and have a bit of fun with Shiny Apps), I built a simple tool that quickly measures profanity in any song, based on lyrics from genius.com.

The app is embedded below, but you can find a full-screen version here. Here are some sample songs to try out:

  • “Rap God”: https://genius.com/Eminem-rap-god-lyrics
  • “The London”: https://genius.com/Young-thug-the-london-lyrics
  • “Money in the Grave”: https://genius.com/Drake-money-in-the-grave-lyrics

Building the App

The first step was to create a list of offensive words to check song lyrics against. The list I used was by developed by Luis von Ahn. As he notes on his resource page, “the list contains some words that many people won’t find offensive, but it’s a good start for anybody wanting to block offensive or profane terms on their site.”

Next, I needed to develop a function to scrape the lyrics from genius.com, tidy the text into a data frame format, and summarize the profanity by count in descending order.

Finally, I needed an interface for users to interact with. Luckily, the shiny package makes this easy. After importing the profanity list, writing the genius.com scrape function, and building the Shiny app interface locally, I was ready to deploy it to shinyapps.io.

Voila! Now I have a basic tool to quickly summarize profanity in any song found on genius.com.

Hopefully this will come in handy next time I need to put together a “family friendly” mix for events, parties, or road trips. You can find the GitHub repo for this project here, and can access the Shiny app directly here.

Feature photo by Matthew Henry from Burst.

Studying Trends in World Religion using R

Using a data set from the Pew Research Center, this post is about unpacking trends in world religion. The data set contains estimated religious compositions by country from 2010 to 2050.

Sourcing the Data

Made readily available via Github, the file was easy to import into the R environment. Reshaping the data (wide to long format) using the tidyverse “gather” function simplifies plotting down the road.

After reshaping, the data resembles the table below:

Visualizations

Let’s start by visualizing religious composition by region over time.

A few observations:

  • Asia-Pacific has the least concentrated religious mix, with a “rainbow” assortment of Hindus, Muslims, and Buddhists.
  • Christianity is on the decline in North American and Europe.
  • Simultaneously, the percentage of people reporting to be “unaffiliated” with any religion is growing in North America and Europe.

Next, let’s take a look at the least religious countries.

Any patterns of interest?

  • Most of the least religious countries are in Europe and Asian.
  • The Czech Republic tops the list with 76% unaffiliated, beating communist North Korea by a full five percentage points.
  • 50%+ of the China, Hong Kong, and Japan population is non-religious.

Lastly, what will change between 2010 and 2050?

For simplicity, I’ve only included differences greater or less than 2%.

  • Again, we see evidence of a decline in the percentage of Christians globally, although it appears to be most concentrated in Europe and Sub-Saharan Africa.
  • Meanwhile, a larger portion of the population in places like Europe and Asia-Pacific is expected to be Muslim or non-religious.

Conclusion

This was a good exercise in brainstorming ways to slice a seemingly simple data set in pursuit of insights. You can find the data set for your own analysis here, or find the code that produced the visuals here.

Featured photo by Janilson Alves Furtado from Burst.

Building a Simple Crypto Alert Bot in Python

Introduction

In the long run, I think cryptocurrencies will be more valuable than they are today, on average. The investment strategy consistent with that belief is to buy and hold (disclaimer below). However, considering a record of considerable volatility, could a crypto enthusiast be smarter about when to buy, in pursuit of a “bargain”?

This post outlines the process of building a simple crypto “bargain buy” alert system using Python, which sends a notification when a given cryptocurrency (BTC, XRP, ETH, etc.) appears “cheap” relative to historical prices. I use CoinAPI for current and historical cryptocurrency pricing and the Slack API for iOS and web push notifications.

My “Crypto Alerts” Slack bot notifies me of “bargain” opportunities daily

The true focus here is not the specific strategy (i.e. determining the right time to buy) but rather, demonstrating how APIs can power the creation of new and valuable services.

I broke the alert system process into four pieces:

  • Retrieve the crypto’s current price (CoinAPI)
  • Retrieve the crypto’s historical price data (CoinAPI)
  • Determine if current price is a “bargain”
  • Summarize findings via push notification (Slack API)
CoinAPI offers a entry-tier API key with 100 free daily calls

After writing the script in Python, I deployed it to PythonAnywhere and scheduled it to run daily. With that overview in place, let’s dive in and walk through the details!

Code Walkthrough

As usual, we’ll start by bringing in the necessary libraries. We’ll use the request library to make the API calls (GET from CoinAPI and POST to the Slack API), the pandas library to organize the JSON response.

To start, we send a request to CoinAPI to retrieve the current price of the cryptocurrency, measured in USD.

To retrieve historical exchange rates, we’ll modify the URL and specify that we’d like daily values for the last 30 days. For simplicity, we can save the results into a pandas data frame.

Now that we have the current price and a historical benchmark, we can take a stab at determining if the cyrpto is a “bargain”.

My approach here is unsophisticated. If the current price is less than the 20% percentile of prices from the last 30 days, it’s considered a bargain. If it’s greater than the 80% percentile, it’s a “rip-off”.

This goes without saying, but this strategy won’t make you a Bitcoin millionaire! However, it does provide a basic alert bot framework.

When I ran this code while testing, at a price of $11,706, BTC was labeled as a rip-off. Here’s a sample of the message the bot produces:

BTC is a RIP-OFF today. The current price of $11,706.27 is higher than 83.3% of closing prices during the last 30 days.

Finally, the last piece of the alert system is to distribute the trading insight via a push notification. Luckily, this is pretty easily accomplished using the Slack API.

To leverage this free resource, I created a new domain and registered an application. This supplied the required authentication token.

Once automated through Python Anywhere, the messages look like this inside of my “crypto-alerts” channel. They are also conveniently pushed to my iPhone via the Slack mobile app.

You can find the complete script here. Thanks for reading!

Disclaimer: This content is for informational purposes only. Nothing contained here constitutes a solicitation, recommendation, endorsement, or offer to buy or sell any securities or other financial instruments (including cryptocurrencies) in this or in in any other jurisdiction.

Measuring Commute Times with IFTTT and R

Each morning I make the journey from the suburbs of Westchester County to downtown New York City. In the process, I ride the bus, train, and subway. This post is about quantifying my time spent commuting using IFTTT and R, which will hopefully add some weight to my complaints about the daily grind.

IFTTT is a free web service that “gets all your apps and devices talking to each other.” It allows you to create simple conditional statements to automate everyday tasks. Many of the applets are centered around making your smart home “smarter”, like automatically adjusting the thermostat when you leave home.

Rather than manually log when I leave home and work each day, I automate the tracking using IFTTT. To do so, I set up two “geo-fences“: one for home and one for work. Each time I enter or exit either of those areas, a new row is created in a Google Sheet. After letting this process run in the background for about two months, I have a good sample to work with.

Let’s start by calling the necessary libraries and importing the data. The googlesheets package by Jennifer Bryanmakes makes this easy.

After a quick bit of cleaning, I can calculate commute times by applying some simple logic. IFTTT is triggered every time I leave home or work, like when I grab lunch near the office or run to the grocery store. I only want to measure time when I leave home and then arrive at work or leave work and then arrive home. I check those conditions in a for-loop by comparing the location of event i and i+1.

Now for the fun part. Let’s make a density plot to visualize the distribution of times for both legs of the commute:

Because I catch the same bus every morning, travel times are more predictable, and more tightly centered around 1.1 hours. On the other hand, I rarely leave work at a consistent time. As a result, there’s more variation in how long it takes to get home, with some quick trips just over one hour and others close to two hours! In the future, I hope to leverage the Google Maps API to find the perfect times to leave work to minimize my commute home.

Thanks for reading! Check out the full code here.

css.php