How to scrape IMDb and analyze your favorite TV shows like a true nerd

Like many people, my wife and I relax by watching a show before going to bed at the end of the day. My preference is light-hearted comedy that doesn’t require much brainpower. Not surprisingly, frequent picks include episodes from sitcoms like The Office and Community.

Curious to see how well-liked some of my favorite shows were in their time, I scrapped 818 episode ratings and descriptions from IMDb.com for my top shows: The Office, Parks & Recreation, Modern Family, Community, New Girl, and The Good Place. I used IMDb’s crowd-sourced episode ratings to plot popularity across seasons, and extracted character name counts from episode descriptions to loosely quantify character importance.

Rating Trends

IMDb lists five data elements for each episode: name, release date, average rating, number of votes, and description:

For example, here’s how episode one of season one of The Office looks.

Looking for high-level rating trends, I plotted all 800+ episode ratings by release date for all six shows in a single chart, with an overlaid bold line to emphasize the trend.

A few show-specific observations:

The Office: It’s pretty easy to spot the impact of Michael’s (Steve Carell) departure from the show at the end of the seventh season. The final season punches below average until the last three episodes, which audiences appeared to adore (9.1, 9.5, and 9.8, respectively).

Community: Something is obviously off in season four. Wikipedia notes: “The [fourth] season marked the departure of show-runner Dan Harmon and overall received mixed reviews from critics. In the fifth season, Harmon returned as show-runner, and the fourth season was referred to retroactively as ‘the gas-leak year’.”

Modern Family: There’s a clear downward trend in average rating, but the show’s longevity definitely speaks to some kind of loyal fan base.

Character Importance

A dynamic and likeable cast of characters is really a key ingredient to any sitcom; personalities like Schmidt from New Girl, Ron Swanson from Parks & Recreation, or Abed from Community keep audiences coming back for more.

As a proxy for character “importance”, I counted the number of times a character’s name appeared in the IMDb descriptions, divided by the total number of episodes.

Here’s the calculation: The Office has 188 episodes. “Michael” appeared in the episode descriptions 128 times, so his “character importance” is ~68%. The actual value doesn’t matter as much as its relative position compared to other characters.

Community and The Good Place seem to have a fairly balanced character line up. In contrast, Parks & Recreation and The Office have an obvious “main” character (Leslie Knope and Michael Scott, respectively), with a solid cast of supporting personalities.

Keep in mind, this metric is a pretty rough proxy for character importance; a much better measure would be something like percentage of screen time or dialogue.

Code Walkthrough

Let’s start by pulling in the necessary packages.

Next, we’ll create a tibble (tidyverse data frame) containing a list of TV shows to scrape from IMDb.

Sourcing the imdb_id is easy, just search for the show you’re interested in and pull the last component of the URL (e.g. imdb.com/title/tt1442437 for Modern Family).

Next, define a scraper function to extract the key data elements (like episode name, average rating, and description). Here we loop over the list of shows and seasons previously defined.

After some cleaning, the data is ready to visualize. The geom_smooth function powers the overlaid bold line to emphasize the overall trend.

The code above is what produces the ratings trend chart:

Next, I used the str_detect() function from the stringr package to count the number of times a character name appeared in the episode descriptions. For example, Michael, Dwight, and Jim would have been counted once in the description below:

Ready to finalize his deal for a new condo, Michael is away with Dwight while Jim rallies the staff together for office games.

“Office Olympics”, The Office Season 2

As described previously, the “character importance” calculation is as simple as dividing the number of times a character’s name appears in the episode descriptions divided by the total number of episodes in the show.

The code above is what produces the character importance chart:

That’s it. Now you can impress (or bore) your friends and family with data-driven TV sitcom trivia, like a a true nerd. You can find the full code here.

Web Photo Archiving with R

My wife and two of her sisters ran cross-country and track in high school. I recently learned that their team website, which hosts thousands of event photos from the past 10 years, is being shut down. Wanting to save my mother-in-law from the unimaginably tedious task of manually downloading each image, I wrote a script in R to automate the process. 

The website has a page for each season with links to event photo albums. For example, in the 2012 season, there are 81 photos albums and 10,000+ photos. 

Each photo album contains somewhere between 80 and 150 photos. I needed to design the script to loop through and download each photo from each photo album.

In other words, I needed a way to pass a URL like the one below into the “file.download” function to save an image to my computer.

old.runtwolf.com/CC2012/Camp1/images/img_0973.jpg

Code Walkthrough

Let’s start by calling the two necessary packages: rvest and dplyr. These both form part of tidyverse, a collection of packages created by Hadley Wickham that share a common design philosophy. 

After downloading the season overview page with the list of photo albums, I used html_nodes and grepexpr to extract and clean the list of album names to form a list of album URLs. 

Finally, I looped through each photo album, replicating the folder structure locally, and downloading each of the .JPEG files.

After all was said and done, I had downloaded 100,005 images from 759 photo albums across 9 XC seasons.

The final step was the upload the images to the cloud for easy sharing and storage. Luckily, the googledrive package allowed me to upload the images via a script rather than manual bulk upload.

Assuming each image would have taken 20 seconds to download, label, and upload, the manual process would have taken ~500 hours, non-stop! Writing the scripts and monitoring the download and upload process took about 8 hours, for a net time saved of ~492 hours.  

You can find the complete code here and archived photos here. 

Thank you to Jen Fitzgarrald for capturing so many wonderful images over the past decade. 

The Hunt for Housing in NYC: A Data-Driven Approach

This summer my wife and I relocated to New York City in preparation for the start of my new job. Housing in Manhattan and the surrounding boroughs is notoriously expensive, so I decided to pursue a data-driven approach to our apartment search. I wrote a Python script to scrape 9,000+ apartment listings on Craigslist for zip codes in the five boroughs: Manhattan, Bronx, Brooklyn, Queens, and Staten Island. I then visualized the median rent by zip code in Tablaeu. Check out the dashboard here!

Gathering the Data

Before digging into some housing insights, let’s walk through the process used to obtain the data. First, I obtained data about the organization of New York City’s boroughs, neighborhoods, and zip codes from a New York State Department of Health website. I then leveraged the structure of Craigslists’s URLs to construct a vector of links to search for apartments in each of the zip codes. Here’s what the URL to search for apartments with the zip code 10453 looks like:

https://newyork.craigslist.org/search/aap?postal=10453

Let’s see what that looks like in code.

The ‘nyc-zip-codes.csv’ file referenced above can be found here. Next, I wrote a function to extract the pertinent information from each listing from each of these links. I extracted the listing title, posting date, monthly rent, and the number of bedrooms, when available.

This is what the function returns when fed the sample link for zip code 10453.

At this point, we just need a way to loop through each zip code and compile the data the function returns.

After cleaning the data and removing duplicates, we have about 9,400 listings to work with.

Analyzing the Data

Let’s start with the big picture and then zoom in. Below we have the median rental price of listings by borough. Manhattan is by far the most expensive place to live, followed in distant second by Brooklyn. Queens, Staten Island, and the Bronx are actually somewhat comparable, with median rent in Queens only $250 higher than median rent in the Bronx.

How does rent vary in the five boroughs by the number of bedrooms the unit has? Filtering the data to include only units with 1 to 4 bedrooms, Manhattan is still the most expensive for each number of bedrooms.


Note that the bracketed, italicized numbers above show the number of listings for each borough and bedroom combination.

My wife and I had hoped to find a 2-bedroom apartment in a safe neighborhood with a 30-minute commute to Midtown for $2,000 or less. But, as you can see in the image below depicting median 2-bedroom rent by zip code in Queens, that may be a tough find!

Now, what else would I have liked to add to this analysis? Since one major consideration in the hunt for housing is commute time, how about a distance-adjusted median rental price metric for each zip code? This is something I’ll tackle in a future post.

Conclusion

Ultimately, my wife and I found housing in Scarsdale through a family friend and didn’t end up living in any of the five boroughs! Luckily, by feeding the script a different set of zip codes and modifying the Craigslist URL structure, I’ll be able to replicate this data-driven process in future apartment searches.

Find the complete code here, hosted as a Gist on GitHub.

Check out my other data projects here.