Data & Technology Archives - Unboxed Analytics

Hip Hop’s 2023 Heavyweights

With over 15 million listeners, Spotify’s RapCaviar has been called “the most influential playlist in music.” RapCaviar is curated by Spotify’s editorial team and updated daily to represent the latest and greatest hip-hop and rap tracks.

For the last year, I’ve saved a daily snapshot of the playlist using the Spotify API to empirically determine the biggest rappers in hip hop today. In this post, we’ll use hard data to approximate influence, hustle, and longevity for rap’s biggest names during 2023.

Methodology

To collect the data, I scheduled a Python script to run daily to (1) hit Spotify’s API to collect the RapCaviar track list and (2) save the resulting data frame as a .csv file to an S3 bucket. After pulling down and combining all daily files from S3 using an R script, the tidied dataset contains 11 fields:

Field Name	Sample Value
Playlist Id	37i9dQZF1DX0XUsuxWHRQd
Playlist Name	RapCaviar
Track Playlist Position	2
Track Name	Mad Max
Track Id	2i2qDe3dnTl6maUE31FO7c
Track Release Date	2022-12-16
Track Added At	2022-12-30
Artist Track Position	1
Artist Name	Lil Durk
Artist Id	3hcs9uc56yIGFCSy9leWe7
Date	2023-01-02

The value for “artist track position” helps distinguish owners from features. For example, both Lil Durk and Future participate in the track Mad Max. Since it’s Lil Durk’s song and Future is the feature, a two row exists in the dataset, with values of “artist track position” set to 1 (Lil Durk) and 2 (Future).

After the cleaning and duplication process, the dataset contains 469 total tracks with 271 distinct artists represented across 351 distinct playlist snapshots between January 1 to December 27, 2023.

Metrics

Influence

Let’s start with influence: what percent of available days was a given artist represented on the playlist? For example, if an artist appeared in 50 of the 351 possible daily snapshots, their “influence” score would be 14.2%.

Here are the top ten rappers ranked by this influence metric, for 2023:

Name	Days Represented	Percent of Available
Drake	351	100%
Future	351	100%
Gucci Mane	351	100%
Travis Scott	351	100%
21 Savage	344	98%
Kodak Black	330	94%
Yeat	328	93%
Latto	311	89%
Lil Uzi Vert	309	88%
Quavo	302	86%

Impressively, four artists yielded sufficient influence to maintain a presence on the playlist every day of the year: Drake, Future, Gucci Mane, and Travis Scott. Here’s a visual representation of their dominant year:

Each colored line represents a unique track. With the y-axis reversed, the chart shows how new tracks enter the playlist positioned near the top and then descend over time. The biggest surprise to me is Gucci Mane, who managed to maintain his presence on the playlist via 14 distinct tracks released throughout the year:

The hustle shown here reminds me of my favorite Lil Wayne clip of all time.

Notably, 21 Savage was only a week short of full coverage, coming in at 98%.

Looking at the distribution of influence scores for all artists appearing at least once during the year, 38 (14%) were present in the RapCaviar playlist more than half of the year:

Density

It’s one thing for an artist to have one of their tracks represented on RapCaviar, but the heavyweights often have several at once. “Density” is calculated as a distinct count of tracks by artist and day.

The highest density score for 2023 was 6, a score achieved by just four rappers:

Density	Artist \| Dates
6 tracks	21 Savage \| Jun 23 – Jul 13 (21 days)
6 tracks	Lil Wayne \| Nov 10 – 16 (7 days)
6 tracks	Drake \| Oct 7 – 12 (6 days)
6 tracks	Travis Scott \| Aug 3 (1 day)

Most impressive is 21 Savage’s dominant 21-day, 6-track run over the summer, preceded by a 20-day, 5-track run. Notably, during the 6-track spree, all six were features or joint tracks:

Pull Up (feat. 21 Savage)
Wit Da Racks (feat. 21 Savage, Travis Scott & Yak Gotti)
Peaches & Eggplants (feat. 21 Savage)
06 Gucci (feat. DaBaby & 21 Savage)
War Bout It (feat. 21 Savage)
Spin Bout U (with Drake)

Contributing to more than 10% of the playlist’s track count simultaneously is truly impressive (RapCaviar usually has 50 tracks total); rap’s heavyweights are dense.

Longevity

Finally, let’s consider longevity, meaning how long an artist’s tracks remains on the playlist. Here are the top ten songs by lifespan on the RapCaviar track list during ’23:

Track	Artist	Days	First Day	Last Day
f*kumean	Gunna	179	Jun 19	Dec 14
Turn Yo Clic Up	Quavo	167	Jul 14	Dec 27
Search & Rescue	Drake	161	Apr 7	Sep 15
500lbs	Lil Tecca	159	Jul 21	Dec 27
I KNOW ?	Travis Scott	153	Jul 28	Dec 27
Paint The Town Red	Doja Cat	146	Aug 4	Dec 27
MELTDOWN	Travis Scott	143	Aug 1	Dec 21
Private Landing	Don Toliver	136	Feb 24	Jul 13
Superhero	Metro Boomin	135	Jan 2	May 25
All My Life	Lil Durk	133	May 12	Sep 21

Importantly, four of the tracks in the top ten are still active (italicized above), so there’s a decent chance Turn Yo Clic Up could outlive f*kumean. Speaking of which, Gunna’s first top-ten solo single managed to spend almost six months on RapCaviar, complete with a position surge in mid-August:

Zooming in, here’s the position history for all of those top ten tracks:

Most of the time, a track will debut on the playlist and then fade out over time, sinking deeper in the set list before falling off. Good examples are All My Life, Private Landing, and Search & Rescue. Hits like 500lbs and Paint The Town Red are more anomalous, with momentum building within the playlist over time.

To close this metric out, let’s look at the top ten rappers with the highest average longevity per track, for those artists with three or more distinct tracks ever appearing on the playlist during the year:

Name	Median Longevity	Average Longevity	Track Count
Gunna	116	103	4
Metro Boomin	92	96	6
Ice Spice	96	72	4
Latto	75	74	4
Moneybagg Yo	75	69	6
Lil Uzi Vert	68	64	8
Don Toliver	41	64	5
Toosii	79	63	3
Key Glock	61	62	4
Sexyy Red	63	62	4

Conclusion

The influence and density metrics point toward the same heavyweights: 21 Savage, Drake, and Travis Scott. This is intuitive since the two metrics are correlated. The longevity metric shines the spotlight on a different subset of rappers, like Gunna, Metro Boomin, and Ice Spice.

Either way, it was a great year for rap. Thanks for reading!

Python script (scheduled sourcing)
R script (combining, cleaning, visualization)
Raw dataset
Clean dataset

GitHub Actions for Data Analysts

Web scraping is a useful tool for data practitioners, to state the obvious. Often, scraping is most valuable when performed on a scheduled basis, to incorporate new or refreshed values into the dataset.

In the past, I’ve paid a (small) monthly fee to PythonAnywhere to run scraping jobs. However, there’s a better, free alternative offered by a familiar platform: GitHub Actions. While GitHub Actions is largely designed for code deployment automation (testing pull requests, deploying merged pull requests to production) it can also be used to run jobs, including web scraping jobs.

This post walks through the implementation of a simple GitHub action, which scrapes the headline mortgage rates posted on Freddie Mac’s home page daily.

Setup

To get started, create a directory called .github/workflows in your repository. Within the .github/workflows directory, create a .yml file. This will contain the details of the action workflow.

The .yml file structure has two basic parts: on:specifies when the job should run, and jobs: defines what steps should be taken.

This action has been scheduled to run daily:

on:
  workflow_dispatch:
  schedule:
    - cron: '0 8 * * *'

Copy the scraper.yml, and modify as needed for your use case. Update the .py script and requirements.txt file in the root directory accordingly. This tutorial, as well as the official GitHub documentation, are good resources for building on this template.

Here, the python script is grabbing the mortgage rates posted on Freddie Mac’s homepage and saving them to a new .csv file.

Over time, enough snapshots accumulate to do something meaningful with this data!

Analysis

Let’s get a quick sense of how mortgage rates have changes since the action was first configured on December 18, 2021.

This .R script reads, joins, and cleans the saved files, and creates a trend plot:

The takeaway? It’s clear that mortgage rates are rapidly rising from their historically-low levels, propelled by the expectation of rate hikes by the Fed to counter inflation.

Thanks for following along. Check out this repo for all the components of the walkthrough.

Meetinghouses: A Proxy for Growth in the LDS Church

Background

When the Church of Jesus Christ of Latter-Day Saints was organized in a small town in New York in 1830, there were only six members. Today, the Mormon Church has grown to over 16 million members, with congregations in 160 countries.

Leaders of the church have taught that this extraordinary growth is fulfillment of the Old Testament prophecy. Daniel 2:31–45 describes a stone “cut out of the mountain without hands” which would roll forth to fill the whole earth. Like the stone, the Church is prophesied to spread and fill “every nation, kindred, tongue, and people.” (See D&C 42:58)

Given the ambitious scope of that prophecy, it’s no wonder that many parties, inside and outside the organization, are interested in measuring and tracking the growth of the LDS Church. While high-level membership metrics are shared bi-annually by church leadership, country or state-specific trends are not provided. This project is an attempt to more precisely measure church growth by tracking changes in the number and distribution of meetinghouses and wards over time.

Wards are the basic organizational unit of the Church, e.g., a congregation.

DATA SOURCE

To help members or visitors locate worship services nearby, the Church provides a meetinghouse locator tool. After entering an address, the user is shown nearby meetinghouses and hours of ward worship services.

Since there are many thousands of meetinghouses owned by the Church across the world, it would be very difficult to collect meetinghouse and ward details manually. However, using the back-end web service that powers the meetinghouse locator, it’s possible to query the full list of meetinghouses, along with the ward units assigned to those meetinghouses. You can find a copy of the code used to extract and clean the data here.

Note: The meetinghouse locator is publicly available online and is not restricted to authenticated users. Consequently, the underlying meetinghouse data is presumed to be open and available for collection and analysis.

DATA Structure

There are currently two data outputs: (1) a list of the ~19,000+ meetinghouses owned or operated by the Church and (2) a list of ~30,000+ wards or other organizational units “assigned” to those meetinghouses. Below is a simple example to make this relationship clear:

Meetinghouses Table

id	address	city	state	country	latitude	longitude
5272017-01-01	6695 S 2200 W	West Jordan	Utah	USA	40.629395	-111.9480540

Meetinghouse Assignments Table

meetinghouse_id	assignment_id	assignment_type	assignment_name
5272017-01-01	125857	ward	Colonial Park Ward
5272017-01-01	170534	ward	Meadowland Ward

In other words, the first table says there’s a meetinghouse (church building) on 6695 South 2200 West in West Jordan, UT. The second table says that there are two wards that meet in that building, the Colonial Park Ward and the Meadowland Ward.

Use Cases

The data described above represents a snapshot in time. It describes the distribution of meetinghouses and wards around the world at the moment the data is compiled, and could be used to answer these kinds of questions:

How many meetinghouses are there currently in Sao Paulo, Brazil?
How many wards are there currently in Coconut Creek, Florida?

While these are interesting questions, the bigger prize is understanding how the number of meetinghouses and wards in each country, state, or zip code is changing over time. To accomplish this, we need to capture and compare snapshots at regular intervals.

For example, by comparing the list of meetinghouses in January 2020 and June 2021, we could infer which meetinghouses are new, where they are located, which meetinghouses have been closed, and where they are located. Ultimately, this should serve as a kind of (imperfect) proxy for growth or migration effects within the church. My intention is to capture monthly snapshots of this data, and then stitch it together to analyze trends in growth (or decline).

Data download

Visit ldsmeetinghouses.com or the GitHub repo for the latest files for download.

Have a question, suggestion, or idea? Create a new issue via GitHub here.

Minivan Wars: Visualizing Prices in the Used Car Market

With the recent birth of our second child, it was time to face a harsh reality: the impending necessity of a minivan. After trying to cope by dreaming up a list of alternative “family” cars, the truth set in: with young kids, features like sliding doors, captain chairs, and amble storage space can’t be beat.

Looking to get acquainted with prices in the used minivan market, I scraped 20 years’ worth of monthly average price data from CarGurus for five minivan models: Kia Sedona, Toyota Sienna, Chrysler Pacifica, Honda Odyssey, and Dodge Grand Caravan. May the best car win!

As one of the most visited car shopping sites in the United States, CarGurus tracks prices for millions of used car listings every year. With a bit of web scraping (using R), I compiled a dataset to visualize how car prices for used minivans have changed over time.

Here’s the result, for minivan models released between 2015 and 2019:

At first glance, my impression is that the Honda Odyssey and Toyota Sienna fall in the “premium” segment of the minivan market (You be the judge: is premium minivan an oxymoron?). On average, prices are higher compared to the Kia Sedona and Dodge Grand Caravan.

Second, I was struck by how steadily deprecation appears to occur for the Honda Odyssey. Roughly speaking, you can expect your Odyssey to depreciate by about $5k a year in the early years of ownership.

Finally, the impact of the COVID-19 pandemic and related semiconductor shortage becomes really clear in this picture. Notice the uptick in average price across the board for almost all make-model and year combinations. Because of the reduced supply of new vehicles (thanks to the semiconductor shortage), would-be buyers of new cars have moved into the used car market, driving up prices.

Bottom line, this visual helped me develop a better feel for the prices we’ll encounter in the used minivan market. You can find the script used to create the dataset here (and below), and the dataset itself here. Thanks for reading!

Exploring the Marvel Cinematic Universe in Tableau

The first Marvel Avengers movie was released right around the time I graduated from high school. In fact, I saw it in theaters during Senior Day with my graduating class. Since then, over the last 10 years, the Marvel Cinematic Universe (MCU) franchise has grown astronomically, far out-grossing other major film franchises like Star Wars and Harry Potter.

While re-watching parts of the series during paternity leave, I compiled a dataset measuring things like budget, box office sales, and Rotten Tomatoes rating for the 23 movies. Using this data, I created an interactive visual in Tableau allowing comparison of measures across the films in different orders, like release date and chronological order.

You can find the visual on Tableau Public here, and the dataset here.

The first takeaway is that these movies were (and are) big money-makers. You have to admire the way Bob Iger‘s gathered quality intellectual property (e.g., Pixar, Marvel, LucasFilm) under the Disney umbrella via acquisition during his 15-year tenure as CEO, creating a deep catalog of content for the Disney+ streaming service. According to the data I collected from Wikipedia, total gross box office revenue for the MCU franchise is north of $22 Billion.

Second, I’m always interested in comparing critic and audience ratings on Rotten Tomatoes. While ratings were generally in sync for most films in the franchise, there were some notable exceptions. For example, the average rating for Captain Marvel among critics was 79%, compared to 45% for audiences, a 34 point difference! Sadly, there were reports of review bombing with troll comments attacking the film for perceived feminism.

Rotten Tomatoes Audience Score in ranked order

This was a light-hearted project, and a fun way to practice more advanced Tableau techniques like parameters, nested calculated fields, and custom shapes. You can explore the viz for yourself here.

How to scrape IMDb and analyze your favorite TV shows like a true nerd

Like many people, my wife and I relax by watching a show before going to bed at the end of the day. My preference is light-hearted comedy that doesn’t require much brainpower. Not surprisingly, frequent picks include episodes from sitcoms like The Office and Community.

Curious to see how well-liked some of my favorite shows were in their time, I scrapped 818 episode ratings and descriptions from IMDb.com for my top shows: The Office, Parks & Recreation, Modern Family, Community, New Girl, and The Good Place. I used IMDb’s crowd-sourced episode ratings to plot popularity across seasons, and extracted character name counts from episode descriptions to loosely quantify character importance.

Rating Trends

IMDb lists five data elements for each episode: name, release date, average rating, number of votes, and description:

For example, here’s how episode one of season one of *The Office* looks.

Looking for high-level rating trends, I plotted all 800+ episode ratings by release date for all six shows in a single chart, with an overlaid bold line to emphasize the trend.

A few show-specific observations:

The Office: It’s pretty easy to spot the impact of Michael’s (Steve Carell) departure from the show at the end of the seventh season. The final season punches below average until the last three episodes, which audiences appeared to adore (9.1, 9.5, and 9.8, respectively).

Community: Something is obviously off in season four. Wikipedia notes: “The [fourth] season marked the departure of show-runner Dan Harmon and overall received mixed reviews from critics. In the fifth season, Harmon returned as show-runner, and the fourth season was referred to retroactively as ‘the gas-leak year’.”

Modern Family: There’s a clear downward trend in average rating, but the show’s longevity definitely speaks to some kind of loyal fan base.

Character Importance

A dynamic and likeable cast of characters is really a key ingredient to any sitcom; personalities like Schmidt from New Girl, Ron Swanson from Parks & Recreation, or Abed from Community keep audiences coming back for more.

As a proxy for character “importance”, I counted the number of times a character’s name appeared in the IMDb descriptions, divided by the total number of episodes.

Here’s the calculation: The Office has 188 episodes. “Michael” appeared in the episode descriptions 128 times, so his “character importance” is ~68%. The actual value doesn’t matter as much as its relative position compared to other characters.

Community and The Good Place seem to have a fairly balanced character line up. In contrast, Parks & Recreation and The Office have an obvious “main” character (Leslie Knope and Michael Scott, respectively), with a solid cast of supporting personalities.

Keep in mind, this metric is a pretty rough proxy for character importance; a much better measure would be something like percentage of screen time or dialogue.

Code Walkthrough

Let’s start by pulling in the necessary packages.

Next, we’ll create a tibble (tidyverse data frame) containing a list of TV shows to scrape from IMDb.

Sourcing the imdb_id is easy, just search for the show you’re interested in and pull the last component of the URL (e.g. imdb.com/title/tt1442437 for Modern Family).

Next, define a scraper function to extract the key data elements (like episode name, average rating, and description). Here we loop over the list of shows and seasons previously defined.

After some cleaning, the data is ready to visualize. The geom_smooth function powers the overlaid bold line to emphasize the overall trend.

The code above is what produces the ratings trend chart:

Next, I used the str_detect() function from the stringr package to count the number of times a character name appeared in the episode descriptions. For example, Michael, Dwight, and Jim would have been counted once in the description below:

Ready to finalize his deal for a new condo, Michael is away with Dwight while Jim rallies the staff together for office games.
“Office Olympics”, The Office Season 2

As described previously, the “character importance” calculation is as simple as dividing the number of times a character’s name appears in the episode descriptions divided by the total number of episodes in the show.

The code above is what produces the character importance chart:

That’s it. Now you can impress (or bore) your friends and family with data-driven TV sitcom trivia, like a a true nerd. You can find the full code here.

The Rise of Rap: A Genre Popularity Analysis

Today it feels like rap is bigger and more mainstream than ever. A casual scan of the charts reveals that many of today’s biggest music icons are rappers. How long has it been this way? I remember a time when pop legends like Katy Perry, Lady Gaga, and Rihanna ruled the charts.

Looking for more than anecdotal evidence of the rise of rap as a genre in the mainstream music landscape, I developed a data-driven methodology to measure the high-level trend in music genre popularity over time.

Using Billboard’s Hot 100 Artist data, and mapping each artist to a genre using the Spotify API, I calculated what percent of the artists were represented within each genre over time, from 2006 to the present.

Here’s the trend:

Here’s another view with the same data, as a line chart:

It seems like the data supports my observation that rap has gone mainstream, with the percentage of rap artists in the Billboard Hot 100 growing steadily since 2014, and surpassing pop artists in 2018.

According to Rolling Stone, much of rap’s growth can be attributed to its reactivity on streaming services, with 92% of the genre’s total consumption coming from streaming channels. The timeline fits, with streaming giants like Apple launching in 2015, and Spotify hitting ~40 MAU in 2016.

While pop and country have maintained a relatively stable level of popularity, rock appears to be trending down, with rock artists composing less than 5% of the Billboard Hot 100 artist list in 2019.

What does the future hold? As the lines between genres continue to blur, with artists like Post Malone and Lil Nas X cutting across pop, rap, country, and even rock, it stops making sense to box artists into a single genre. In the age of the playlist, it’s easier than ever to rebel against the very idea of genre.

Walkthrough

The first step of the project was scraping the historical list of Hot 100 Artists from Billboard. Using the tidyverse and rvest packages in R, I quickly looped over the 13 years of available data:

Below is a preview of the first few rows of the resulting dataset:

Next, using a Python script and the Spotify API, I looped through each of the artists from the Billboard Hot 100 dataset and collected a list of their corresponding sub-genres. For example, Spotify associates Sean Paul with the sub-genres of dance pop, dance hall, and pop rap.

Here’s a preview of what the resulting data looks like:

The next step took some thought. I needed a way to map back each of the thousands of sub-genres labeled by Spotify into a few core genres, like pop, rap, country, and rock. Tapping into the work of the Every Noise project, which attempts to create “an algorithmically-generated, readability-adjusted scatter-plot of the musical genre-space”, I developed logic to assign a single genre to each of the artists in the original Billboard Hot 100 artist table:

Using this logic, each artist was assigned to a single genre bucket:

The last step was to merge the billboard artist and artist genre tables and calculate the genre percentage breakdown over time, from 2006 to 2019.

This R code produces the chart below that visualizes relative genre popularity over time:

You can find the GitHub repo for this project here.

Building a Scripture Search Tool with R Shiny

Many religions have texts that contain beliefs, ritual practices, or commandments. The Quran is the central religious text of Islam, believed by Muslims to be a revelation from Allah. The Bible is a collection of religious texts sacred to Christians, Jews, and others. Unique to the Latter Day Saint movement is the Book of Mormon.

The study of these texts is a core religious practice of believers. Looking for a way to quickly understand what the scriptures say on a given topic, I developed a simple Shiny app using R as a study tool:

When a user enters a search term (e.g. “faith”, “gospel”, “sacrifice”, etc.) and clicks “Search”, the app returns a summary table and a detail table. The summary table shows the number of verses that contain the search term by book of scripture. The detail table shows the actual text of all the verses containing the search term.

You can find a full-screen version of the web app here.

In the future, I’d like to enhance the app by adding the ability to search for a phrase (e.g. holy ghost), instead of just a single word. I’d also like to add functionality to compare the presence of multiple words and phrases in different volumes of scriptures. For example, comparing the frequency of the appearance of words like “man” and “woman”.

Hopefully this simple scripture search app can be a helpful tool in your own study. You can find the R code for this project here and access the Shiny app directly here.

Web Scraping: NBA Salaries

Inspired in part by Python’s Beautiful Soup, the R package rvest makes it delightfully easy to scrape data from the web. As part of the Tidyverse collection of packages, rvest fits nicely within the broader data workflow:

The Tidyverse data science workflow (source)

In this post I’ll walk through an example of using rvest to compile a dataset of NBA player salaries. To follow along, create a free RStudio Cloud account to write and run R code and bookmark the SelectorGadget tool to help identify HTML/CSS tags.

Background

Professional athletes are paid handsomely for their highly-specialized skill sets, and NBA players are no exception. Take Steph Curry, the highest-paid player during the 2018-2019 season. During that season he brought in a cool $37M in salary.

The NBA is the professional sports league with the
highest player wages worldwide (source, image source)

ESPN publishes annual salary data going back to the 1999-2000 season. Rather than manually copying this data, which is spread across hundreds of web pages, we can write a script to compile the data automatically using rvest. The data can then be used to explore variation in salary over time, by team, and by position.

Page Structure

The first step in any web scraping project is to become familiar with the target site URL structure. In this case, ESPN has a different page for each season. For example, the link below contains the player salaries for the 2018-2019 season:

http://www.espn.com/nba/salaries/_/year/2019

Screenshot of the ESPN NBA Player Salaries page for the 2018-19 season

Within each season, the salaries are spread across several sub-pages, with 40 players listed on each sub-page:

http://www.espn.com/nba/salaries/_/year/2019/page/2

This means our code will need to dynamically determine the number of sub-pages to loop through when scraping the player salaries from each season.

Code Walkthrough

With that background, let’s dive into the code! To get started, we’ll need three packages: rvest, tidyverse, and stringr:

Next, we’ll define a vector of season URLs to loop over:

Next comes the bulk of the code required to perform the scraping. Here we loop over each of the season URLs. By extracting the text from the .page-numbers element on each page, we can dynamically determine the number of sub-pages for each season.

The html_table() command from the rvest package detects and extracts the table of salaries from each sub-page.

The last step is to clean the raw scraping output by adding column names, removing extra rows, and splitting the “name” and “position” fields into two columns, since they were stored as a single column in the ESPN tables, separated by a comma.

The final dataset has 9,456 rows. Below are the first 10:

rank	name	position	team	salary	season
1	Shaquille O'Neal	C	Los Angeles Lakers	17142000	2000
2	Kevin Garnett	PF	Minnesota Timberwolves	16806000	2000
3	Alonzo Mourning	C	Miami Heat	15004000	2000
4	Juwan Howard	PF	Washington Wizards	15000000	2000
5	Scottie Pippen	SF	Portland Trail Blazers	14795000	2000
6	Karl Malone	PF	Utah Jazz	14000000	2000
7	Larry Johnson	F	New York Knicks	11910000	2000
8	Gary Payton	PG	Seattle SuperSonics	11020000	2000
9	Rasheed Wallace	PF	Portland Trail Blazers	10800000	2000
10	Shawn Kemp	C	Cleveland Cavaliers	10780000	2000

Visualization

With the hard part out of the way, let’s quickly explore the trend in player salaries. We’ll visualize the distribution of salary by season:

Notice how the right tail has grown over time. It appears that most of the growth in average player salary is being drive by salaries paid to superstars. This observation is consistent with a similar analysis of NBA salaries done by Dimitrije Curcic.

“The average salary in the NBA has increased more than 7x since the 1990-91 season. Median salary has slower growth than the average; this could suggest that the financial gap between the top talents and the rest is getting larger.”

Here’s the code used to create the chart.

Web scraping is an important tool for data analysis because it enables data collection at scale. Rvest streamlines the process, allowing you to quickly create novel datasets for analysis, like NBA player salaries.

Complete R Script: link
Final Dataset: link

Which programming language should I learn first?

Aspiring programmers and data scientists often ask, “Which programming language should I learn first?” It’s a valid question, since it can take hundreds of hours of practice to become competent with your first programming language. There are a couple of key factors to take into consideration, like how easy the language is to learn, the job market for the language, and the long term prospects for the language.

In this post, we’ll take a data-driven approach to determining which programming languages are the most popular and growing the fastest in order to make an informed recommendation to new entrants to the developer community.

Quantifying Popularity

There are several ways you could measure the popularity or growth of programming languages over time. The PYPL (PopularitY of Programming Language Index) is created by analyzing how often language tutorials are searched on Google; the more a language tutorial is searched, the more popular the language is assumed to be.

Another avenue could be analyzing GitHub metadata. GitHub is the largest code host in the world, with 40 million users and more than 100 million repositories (source). We could quantify the popularity of a programming language by measuring the number of pull requests / push requests / stars / issues over time (example, example).

Finally, the popularity proxy I’ll use is the number of questions posted by programming language on Stack Overflow. Stack Overflow is a question and answer site for programmers. Questions have tags like java and python which makes it easier for people to find and answer questions.

We’ll visualize how programming languages have trended over the last 10 years based on use of their tags on Stack Overflow.

Data Explorer

So, how are we going to source this data? Should we scrape all 18 million questions or start hitting the Stack Exchange API? No! There’s an easier way: Stack Exchange (Stack Overflow’s “parent”) exposes a data explorer to run queries against historical data.

In other words, we can review the Stack Overflow database schema and write a SQL query to extract the data we need. Before writing any SQL, let’s think about how we’d like the query output to be structured. Each row should contain a tag (e.g. java, python), a date (year / month), and count of the number of times a question was posted using that tag:

Year | Month | Tag | Question Count

The SQL query below joins the Posts, Tags, and PostTags tables, counts the number of questions by tag each month, and returns the top 100 tags each month:

Below are the first ten rows returned by the query:

Year	Month	Tag	Count	Rank
2010	1	c#	5116	1
2010	1	java	3728	2
2010	1	php	3442	3
2010	1	javascript	2620	4
2010	1	.net	2340	5
2010	1	jquery	2338	6
2010	1	iphone	2246	7
2010	1	asp.net	2213	8
2010	1	c++	2002	9
2010	1	python	1949	10

Great, now we have the data we need. Next, how should we visualize it to measure programming language popularity over time? Let’s try an animated bar race chart using Flourish. Flourish is an online data studio that helps you visualize and tell stories with data.

In order to get the data into the right format for Flourish visualization, we’ll use R to filter and reshape the data. To smooth the trend, we’ll also calculate a moving average of tag question count.

After uploading the reshaped data to Flourish and formatting the animated bar race chart, we can sit back and watch the programming languages fight it out for the top spot over the last decade:

It’s hard to miss the steady rise of Python, hovering in fourth and five place from 2010 to 2017 before accelerating into first place by late 2018.

Why has Python become so popular? First, it’s more concise and requires less time, effort, and lines of code to perform the same operations as languages like C++ and Java. Python is well-known for its simple programming syntax, code readability and English-like commands. For those reason, not to mention its rich set of libraries and large community, Python is a great place to start for new programmers and data scientists.

The story our animated bar chart tells is validated by the reporting published by Stack Overflow Insights, where we see Python growing steadily over time, measured as a percentage of questions asked on Stack Overflow in a month:

Conclusion

Using question tag data from Stack Overflow, we’ve determined that Python is probably the best programming languages to learn first. We could have saved ourselves some time and done a simple Google search or consulted Reddit to come to the same conclusion, but there’s something satisfying about validating the hype with real data.