With the recent birth of our second child, it was time to face a harsh reality: the impending necessity of a minivan. After trying to cope by dreaming up a list of alternative “family” cars, the truth set in: with young kids, features like sliding doors, captain chairs, and amble storage space can’t be beat.
Looking to get acquainted with prices in the used minivan market, I scraped 20 years’ worth of monthly average price data from CarGurus for five minivan models: Kia Sedona, Toyota Sienna, Chrysler Pacifica, Honda Odyssey, and Dodge Grand Caravan. May the best car win!
As one of the most visited car shopping sites in the United States, CarGurus tracks prices for millions of used car listings every year. With a bit of web scraping (using R), I compiled a dataset to visualize how car prices for used minivans have changed over time.
Here’s the result, for minivan models released between 2015 and 2019:
At first glance, my impression is that the Honda Odyssey and Toyota Sienna fall in the “premium” segment of the minivan market (You be the judge: is premium minivan an oxymoron?). On average, prices are higher compared to the Kia Sedona and Dodge Grand Caravan.
Second, I was struck by how steadily deprecation appears to occur for the Honda Odyssey. Roughly speaking, you can expect your Odyssey to depreciate by about $5k a year in the early years of ownership.
Finally, the impact of the COVID-19 pandemic and related semiconductor shortage becomes really clear in this picture. Notice the uptick in average price across the board for almost all make-model and year combinations. Because of the reduced supply of new vehicles (thanks to the semiconductor shortage), would-be buyers of new cars have moved into the used car market, driving up prices.
Bottom line, this visual helped me develop a better feel for the prices we’ll encounter in the used minivan market. You can find the script used to create the dataset here (and below), and the dataset itself here. Thanks for reading!
Like many people, my wife and I relax by watching a show before going to bed at the end of the day. My preference is light-hearted comedy that doesn’t require much brainpower. Not surprisingly, frequent picks include episodes from sitcoms like The Office and Community.
Curious to see how well-liked some of my favorite shows were in their time, I scrapped 818 episode ratings and descriptions from IMDb.com for my top shows: The Office, Parks & Recreation, Modern Family, Community, New Girl, and The Good Place. I used IMDb’s crowd-sourced episode ratings to plot popularity across seasons, and extracted character name counts from episode descriptions to loosely quantify character importance.
IMDb lists five data elements for each episode: name, release date, average rating, number of votes, and description:
Looking for high-level rating trends, I plotted all 800+ episode ratings by release date for all six shows in a single chart, with an overlaid bold line to emphasize the trend.
A few show-specific observations:
The Office: It’s pretty easy to spot the impact of Michael’s (Steve Carell) departure from the show at the end of the seventh season. The final season punches below average until the last three episodes, which audiences appeared to adore (9.1, 9.5, and 9.8, respectively).
Community: Something is obviously off in season four. Wikipedia notes: “The [fourth] season marked the departure of show-runner Dan Harmon and overall received mixed reviews from critics. In the fifth season, Harmon returned as show-runner, and the fourth season was referred to retroactively as ‘the gas-leak year’.”
Modern Family: There’s a clear downward trend in average rating, but the show’s longevity definitely speaks to some kind of loyal fan base.
A dynamic and likeable cast of characters is really a key ingredient to any sitcom; personalities like Schmidt from New Girl, Ron Swanson from Parks & Recreation, or Abed from Community keep audiences coming back for more.
As a proxy for character “importance”, I counted the number of times a character’s name appeared in the IMDb descriptions, divided by the total number of episodes.
Here’s the calculation: The Office has 188 episodes. “Michael” appeared in the episode descriptions 128 times, so his “character importance” is ~68%. The actual value doesn’t matter as much as its relative position compared to other characters.
Community and The Good Place seem to have a fairly balanced character line up. In contrast, Parks & Recreation and The Office have an obvious “main” character (Leslie Knope and Michael Scott, respectively), with a solid cast of supporting personalities.
Keep in mind, this metric is a pretty rough proxy for character importance; a much better measure would be something like percentage of screen time or dialogue.
Let’s start by pulling in the necessary packages.
Next, we’ll create a tibble (tidyverse data frame) containing a list of TV shows to scrape from IMDb.
Sourcing the imdb_id is easy, just search for the show you’re interested in and pull the last component of the URL (e.g. imdb.com/title/tt1442437 for Modern Family).
Next, define a scraper function to extract the key data elements (like episode name, average rating, and description). Here we loop over the list of shows and seasons previously defined.
After some cleaning, the data is ready to visualize. The geom_smooth function powers the overlaid bold line to emphasize the overall trend.
The code above is what produces the ratings trend chart:
Next, I used the str_detect() function from the stringr package to count the number of times a character name appeared in the episode descriptions. For example, Michael, Dwight, and Jim would have been counted once in the description below:
Ready to finalize his deal for a new condo, Michael is away with Dwight while Jim rallies the staff together for office games.
“Office Olympics”, The Office Season 2
As described previously, the “character importance” calculation is as simple as dividing the number of times a character’s name appears in the episode descriptions divided by the total number of episodes in the show.
The code above is what produces the character importance chart:
That’s it. Now you can impress (or bore) your friends and family with data-driven TV sitcom trivia, like a a true nerd. You can find the full code here.
Aspiring programmers and data scientists often ask, “Which programming language should I learn first?” It’s a valid question, since it can take hundreds of hours of practice to become competent with your first programming language. There are a couple of key factors to take into consideration, like how easy the language is to learn, the job market for the language, and the long term prospects for the language.
In this post, we’ll take a data-driven approach to determining which programming languages are the most popular and growing the fastest in order to make an informed recommendation to new entrants to the developer community.
There are several ways you could measure the popularity or growth of programming languages over time. The PYPL (PopularitY of Programming Language Index) is created by analyzing how often language tutorials are searched on Google; the more a language tutorial is searched, the more popular the language is assumed to be.
Another avenue could be analyzing GitHub metadata. GitHub is the largest code host in the world, with 40 million users and more than 100 million repositories (source). We could quantify the popularity of a programming language by measuring the number of pull requests / push requests / stars / issues over time (example, example).
Finally, the popularity proxy I’ll use is the number of questions posted by programming language on Stack Overflow. Stack Overflow is a question and answer site for programmers. Questions have tags like java and python which makes it easier for people to find and answer questions.
We’ll visualize how programming languages have trended over the last 10 years based on use of their tags on Stack Overflow.
So, how are we going to source this data? Should we scrape all 18 million questions or start hitting the Stack Exchange API? No! There’s an easier way: Stack Exchange (Stack Overflow’s “parent”) exposes a data explorer to run queries against historical data.
In other words, we can review the Stack Overflow database schema and write a SQL query to extract the data we need. Before writing any SQL, let’s think about how we’d like the query output to be structured. Each row should contain a tag (e.g. java, python), a date (year / month), and count of the number of times a question was posted using that tag:
Year | Month | Tag | Question Count
The SQL query below joins the Posts, Tags, and PostTags tables, counts the number of questions by tag each month, and returns the top 100 tags each month:
Below are the first ten rows returned by the query:
Great, now we have the data we need. Next, how should we visualize it to measure programming language popularity over time? Let’s try an animated bar race chart using Flourish. Flourish is an online data studio that helps you visualize and tell stories with data.
In order to get the data into the right format for Flourish visualization, we’ll use R to filter and reshape the data. To smooth the trend, we’ll also calculate a moving average of tag question count.
After uploading the reshaped data to Flourish and formatting the animated bar race chart, we can sit back and watch the programming languages fight it out for the top spot over the last decade:
It’s hard to miss the steady rise of Python, hovering in fourth and five place from 2010 to 2017 before accelerating into first place by late 2018.
Why has Python become so popular? First, it’s more concise and requires less time, effort, and lines of code to perform the same operations as languages like C++ and Java. Python is well-known for its simple programming syntax, code readability and English-like commands. For those reason, not to mention its rich set of libraries and large community, Python is a great place to start for new programmers and data scientists.
The story our animated bar chart tells is validated by the reporting published by Stack Overflow Insights, where we see Python growing steadily over time, measured as a percentage of questions asked on Stack Overflow in a month:
Using question tag data from Stack Overflow, we’ve determined that Python is probably the best programming languages to learn first. We could have saved ourselves some time and done a simple Google search or consulted Reddit to come to the same conclusion, but there’s something satisfying about validating the hype with real data.
Sometimes rankings are useful, since they collapse many data points into a single metric, allowing for easy comparison. The problem is when rankings build on subjective methodologies or abstract criteria are taken as absolute truth, rather than a directional guide.
With that disclaimer as backdrop, it’s no surprise that Vault.com surveys professionals to rank the top employers in industries like law, consulting, and banking. The rankings they produce are based on surveys that try to measure things like prestige, culture, satisfaction, work/life balance, training, and compensation.
Obviously, the inputs (“prestige” and “culture”) are inherently abstract and highly subjective, so the output (rankings) is likely to be noisy and subjective as well. That said, I was interested to see how rankings, specifically in banking, had changed over time, so I compiled the Top 50 lists from 2011 to 2020.
The lists are composed of companies across the banking spectrum, from bulge bracket firms like Goldman Sachs and Morgan Stanley to elite boutiques like Centerview and Evercore to middle market banks like Piper Sandler and Raymond James.
Below are the results for the bulge bracket and elite boutique segments, along with a few observations, based on loose categories suggested by mergersandinquisitions.com.
Dominance of GS: Over the ten year period, Goldman only dipped below #1 briefly, in 2012-13.
Decline of JPM: Despite clenching the #1 spot in 2012-13, JPM declined in the following years, landing at #5 in 2020.
Growth of BAML: Starting in #9 in 2011, BAML’s rank steadily improved over time, hovering at #3 in 2020.
I compiled this data manually, but used r and ggplot to clean and filter the data and create the charts. You can find the full repo on Github here.
Import, Define ggplot Theme
Thanks for reading! Feel free to check out my other blog posts or click a tag below to see related blog posts.
Using a data set from the Pew Research Center, this post is about unpacking trends in world religion. The data set contains estimated religious compositions by country from 2010 to 2050.
Sourcing the Data
Made readily available via Github, the file was easy to import into the R environment. Reshaping the data (wide to long format) using the tidyverse “gather” function simplifies plotting down the road.
After reshaping, the data resembles the table below:
Let’s start by visualizing religious composition by region over time.
A few observations:
Asia-Pacific has the least concentrated religious mix, with a “rainbow” assortment of Hindus, Muslims, and Buddhists.
Christianity is on the decline in North American and Europe.
Simultaneously, the percentage of people reporting to be “unaffiliated” with any religion is growing in North America and Europe.
Next, let’s take a look at the least religious countries.
Any patterns of interest?
Most of the least religious countries are in Europe and Asian.
The Czech Republic tops the list with 76% unaffiliated, beating communist North Korea by a full five percentage points.
50%+ of the China, Hong Kong, and Japan population is non-religious.
Lastly, what will change between 2010 and 2050?
For simplicity, I’ve only included differences greater or less than 2%.
Again, we see evidence of a decline in the percentage of Christians globally, although it appears to be most concentrated in Europe and Sub-Saharan Africa.
Meanwhile, a larger portion of the population in places like Europe and Asia-Pacific is expected to be Muslim or non-religious.
This was a good exercise in brainstorming ways to slice a seemingly simple data set in pursuit of insights. You can find the data set for your own analysis here, or find the code that produced the visuals here.
Each morning I make the journey from the suburbs of Westchester County to downtown New York City. In the process, I ride the bus, train, and subway. This post is about quantifying my time spent commuting using IFTTT and R, which will hopefully add some weight to my complaints about the daily grind.
IFTTT is a free web service that “gets all your apps and devices talking to each other.” It allows you to create simple conditional statements to automate everyday tasks. Many of the applets are centered around making your smart home “smarter”, like automatically adjusting the thermostat when you leave home.
Rather than manually log when I leave home and work each day, I automate the tracking using IFTTT. To do so, I set up two “geo-fences“: one for home and one for work. Each time I enter or exit either of those areas, a new row is created in a Google Sheet. After letting this process run in the background for about two months, I have a good sample to work with.
Let’s start by calling the necessary libraries and importing the data. The googlesheets package by Jennifer Bryanmakes makes this easy.
After a quick bit of cleaning, I can calculate commute times by applying some simple logic. IFTTT is triggered every time I leave home or work, like when I grab lunch near the office or run to the grocery store. I only want to measure time when I leave home and then arrive at work or leave work and then arrive home. I check those conditions in a for-loop by comparing the location of event i and i+1.
Now for the fun part. Let’s make a density plot to visualize the distribution of times for both legs of the commute:
Because I catch the same bus every morning, travel times are more predictable, and more tightly centered around 1.1 hours. On the other hand, I rarely leave work at a consistent time. As a result, there’s more variation in how long it takes to get home, with some quick trips just over one hour and others close to two hours! In the future, I hope to leverage the Google Maps API to find the perfect times to leave work to minimize my commute home.
Even though it’s been around for years, I just recently discovered Shark Tank, the show where hopeful entrepreneurs pitch business ideas to a panel of wealthy investors, or “sharks”. I usually wonder if there’s a method to the deal-making madness, especially when a pitch that resonates with me falls flat on the sharks.
In this post, I take my fandom to a deeper level by using episode descriptions from Wikipedia to understand what kinds of pitches have the highest chance of being offered a deal. In the process, I’ll use tools like web scraping, natural language processing, and API calls to gather, transform, enhance, and visual the data.
I’ve divided my workflow for this project into four steps:
Obtain episode-level descriptions via web scraping
Visualize key trends by season and pitch categories
1. Obtain episode-level descriptions via web scraping
This analysis is possible because of a Wikipedia page that contains short descriptions of every pitch delivered on Shark Tank.
The first step is to extract this information via the rvest package in R, looping over each of the nine tables (corresponding to nine seasons) within the page.
Next we’ll do a bit of cleaning, simplifying column naming conventions, and adjusting the data types for the air date and viewership fields.
2. Reshape data from episode-level to pitch-level
In its current form, we won’t be able to detect any patterns with this data since the descriptions are bundled at the episode-level, like this:
“Crooked Jaw” a mixed martial arts clothing line (NO); “Lifebelt” a device that prevents the car from starting without the seat belt being fastened (NO); “A Perfect Pear” a gourmet food business (YES);
We need to “un-nest” the descriptions so that each row contains a single pitch. This is easily accomplished using the unnest function from tidyr.
Now we have a clean dataset, ready to enhance and analyze. Here’s a sample of the data structure, highlighting a few variables:
a pie company
an implantable Bluetooth device requiring surgery to insert the device into the user's head
an electronic hand-held device for waiting rooms
a plastic elephant-shaped device that helps parents give small children oral medicine
a packing and organizing service based on an already successful business called College Hunks Hauling Junk
a mixed martial arts clothing line
a device that prevents the car from starting without the seat belt being fastened
a gourmet food business
a Post-It note arm for laptops
a musical way to teach students Shakespeare
3. Enhance data by categorizing descriptions via API
How can we systematically analyze what kind of pitches are more likely to be offered a deal when all we have is a brief text description? Rather than build my own NLP model from scratch to categorize pitches, I used uClassify, which offers “Classification as a Service” (CAAS).
These functions construct a URL with my personal API key, the classifier API name, and the text (pitch description) to be categorized. A GET call then returns a JSON with a list of categories and “match” scores.
For example, take pitch #803, “Thrive+”, which has this description: “capsules that reduce alcohol’s negative effects.” The category with the highest “match” score was Health, followed closely by Science. By categorizing the pitch descriptions, we’ll be more equipped to uncover some key elements of successful Shark Tank pitches.
4. Visualize key trends by season and pitch categories
Now for the fun part! After compiling, cleaning, and enhancing our dataset, we’re ready to visualize and model the data. First, let’s take a look at Shark Tank’s popularity over time, measured in TV viewership (in millions).
Even without the fitted line, it’s easy to see a rise and fall in popularity, with the peak around 2015 with 7.5 million viewers. Next, let’s look at how willing sharks were to make deals over the course of the show, across nine seasons:
During Season 1, less than 50% of pitches were offered a deal from the sharks. By season 9, deals were made over 65% of the time! I wonder if this had anything to do with sliding viewership.
Let’s dig a bit deeper and start looking at characteristics of successful pitches. Using the tidytext methodology, I determined which words within the pitch descriptions were most often associated with a strong response from the sharks (for better or worse).
Clothing is mentioned in 22 pitch descriptions, 70% of which were unsuccessful! On the flip side, when the pitch included something “portable”, the sharks were willing to make a deal 10 out of 13 times. If you make it onto Shark Tank, don’t mention ice! For whatever reason, almost 90% of those pitches resulted in no deal with the sharks.
Now let’s see what else we can learn by using the categories generated from the uClassify API classifiers:
Here we summarize pitch success by category, with the total number of pitches within the category represented above each bar. The dashed grey line represents the 50% cutoff, where a pitch within a given category is equally likely to be accepted or rejected.
Notice how over 65% of deals classified as “Recreation” were offered a deal by the sharks over the course of the nine seasons. It looks like “Game” entrepreneurs didn’t snag funding quite as easily!
This has been a fun and quick way to explore some of the nuance in the world of Shark Tank deal-making. Truthfully, the dataset we created was pretty limited. Adding in information like which shark (or sharks) made the deal, for how much, and for what percentage of equity would add more precision compared to simply knowing if a deal was made or not.
In addition, access to full pitch transcripts (rather than simplistic descriptions of ~10 words or less) would be much more helpful in accurately classifying the pitches into meaningful categories.
You can find the complete R code here and the final dataset here, both hosted on GitHub. Thanks for reading!
With our baby’s due date quickly approaching, my wife and I needed to find a hospital for delivery. Hoping to contribute something meaningful to the decision, I found data published by the state of New York on labor and delivery metrics. By visualizing measures like percentage of cesarean deliveries, I narrowed the list of hospitals within our county.
Despite my belief in “data-driven” decision-making, I understand that in the real world, most decisions are part art, part science, requiring a mix of qualitative and quantitative factors. That being said, in this post, I describe how I leveraged publicly-available data to help choose a hospital for my wife’s delivery.
The dataset spans a ten-year period, from 2008 to 2016, with data for 146 hospitals in 52 counties. Four general categories of metrics are present:
Anesthesia & Analgesia
Characteristics of Labor & Delivery
Infant Feeding Method
Route & Method
Since I lack the subject matter expertise to understand something like the difference between paracervical and pudendal anesthesia, some of the value of the dataset is lost. Despite the knowledge gaps, I’ll next visualize some of the more straightforward measures of labor and delivery to uncover insights about hospital quality.
Visualization & Analysis
First item of discussion: Where are most babies born in Westchester County?
In 2016, the most babies were born at the White Plains Hospital Center.
Volume may matter. Hospitals who deliver more babies may be exposed a wider spectrum of complications and be prepared to deliver treatment accordingly. On the other hand, large-scale operations likely produce strict standardized policies and procedures, with little room for customized delivery plans.
How has the volume of births change over the 10-year period?
Every hospital seems to be trending flat or down, which may
be a reflection of more general demographic trends.
Next up, let’s examine which hospitals work with midwives. This was an important consideration in our decision process.
Pretty clear. Phelps Memorial and Hudson Valley Hospitals
are midwife friendly, with 40%+ of births attended by a midwife.
Is there any relationship between births attended by midwifes and other labor outcomes?
It appears that mid-wife friendly hospitals enjoy a lower c-section rate, although I’m not implying that one causes the other. It would take more than a scatter plot to tease out the true nature of that relationship.
Let’s take a closer look at c-section rates by hospital over time.
There was a long stretch of time at Lawrence hospital where more cesarean sections were performed than vaginal births. Easy red flag!
This simple analysis was informative and eye-opening. With the list significantly narrowed, it’s time to tour the facilities, read reviews, and speak with medical providers to make the final decision.
Here’s a link to the code and data. Thanks for reading!
StreetEasy, NYC’s leading real estate marketplace, makes some fantastic housing data freely available through its data dashboard. Among the datasets available for download is a monthly breakdown of housing inventory by borough and neighborhood over the last 8 years. In this post I’ll use the gganimate package in R to visualize the ebb and flow of rental housing availability in NYC. If the law of supply and demand holds, this should inform ideal times for apartment hunting.
Rental Inventory Over Time
Let’s first visualize the number of rental units on the market over time by borough. This is a monthly view, from January 2010 – December 2018.
Here we see StreetEasy’s growth as marketplace year over year, together with distinct seasonal variation. Let’s explore seasonality next.
Average Rental Inventory by Month
We’d expect some flavor of seasonality with real estate. In the US it’s estimated that 80% of moves occur between April and September. Let’s see if the same pattern is true in NYC.
Sure enough, we observe a “peak season” with an influx of rental units coming onto the market from May to September, although the trend is strongest in Manhattan.
Rental Inventory in Brooklyn’s Neighborhoods
Finally, let’s visualize how housing availability has fluctuated on StreetEasy’s marketplace in each of Brooklyn’s neighborhoods over time.
Using face_wrap from ggplot2, we can easily observe the trend in each neighborhood simultaneously.
Conclusion & Appendix
Kudos to StreetEasy for making this dataset open to the public. There’s certainly more to explore and analyze in their data dashboard. Also, I find gganimate a really useful addition to any data storyteller’s toolkit, and I hope to find more opportunities to leverage this package in the future. Thanks for reading!
R Script: Link Data Source: Link [Rental Inventory]