Jamie Dimon’s Shareholder Letters: A Text Analysis in R

With the recent release of Warren Buffet’s much anticipated annual shareholder letter, I decided to show J.P. Morgan Chase chief Jamie Dimon some love by performing a text analysis on a sample of his annual shareholder letters.

Screenshot of 2017 JPMC Annual Report

The shareholder letters are hosted on the firm’s Investor Relations page. To avoid PDF text parsing, I analyzed all letters with a web interface, which includes years 2014 to 2017.

In this post I’ll analyze Jamie’s thoughts on the firm, the economy, and politics using tidytext principles, including sentiment, term frequency-inverse document frequency, and bigram network visualization.

Getting the Data

To import the letters into R, I used the rvest package and wrote a function to extract the text from the full-copyarea section of the HTML pages. Below is sample implementation for the 2016 letter:

After combining the text from the four year’s worth of letters, I used the unnest_tokens function from the tidytext package to create a table where each word is a row.

Below are the first seven rows of the table. Noticed that I’ve removed common stop words and numbers. With 31,986 total words and 5,049 unique words, we’re now ready for a text analysis!


Before moving into sentiment and other more complex methods of analysis, let’s start by looking at the top 10 most frequently used words by Jamie overall, across all years.

No surprises here. Terms like bank and banks, business and business top the chart. This doesn’t really provide much insight, which is why we’ll use methods like term frequency inverse document frequency, or tf-idf, to measure how important a word actually is in the letters.


Despite all of the hype around machine learning and artificial intelligence, it’s still tricky for a computer to understand writing or speech. For example, how would you write a program to interpret the snippet below from Jamie’s 2017 letter?

“Throughout a period of profound political and economic change around the world, our company has been steadfast in our dedication to the clients, communities and countries we serve while earning a fair return for our shareholders.”

Jamie Dimon, 2017 Shareholder Letter

A simple method to measure the sentiment of a text is to assign a sentiment score to each word within the text. Using the AFINN lexicon (easily accessible through the tidytext package) I can do just that:


A word like “risk” has a negative score, while “excellent” has a positive score. The sentiment of a sentence or paragraph is then the aggregate of individual word scores.

With that background, I’ll calculate sentiment at a “block” level of 20 sentences to understand how Jamie’s tone changes over the course of his letters.

In each of the four years, Jamie starts off positive, typically holding off on any “bad news” for several paragraphs. Let’s take a look at examples of sentences with positive and negative sentiment.

J.P. Morgan Chase Annual Reports, 2014 – 2017

In 2014, Jamie wrote the following, with a net sentiment score of 11:

“We are able to do our part in supporting communities and economies around the world because we are strong, stable and permanent.”

Jamie Dimon, 2014 Shareholder Letter, Sentence #155

Words like “strong”, “stable” and “permanent” clearly have a positive sentiment. In contrast, that same year he issued a warning, with a net sentiment score of -15:

“Some things never change — there will be another crisis, and its impact will be felt by the financial markets.

Jamie Dimon, 2014 Shareholder Letter, Sentence #545

Now that we’ve examined sentiment at a sentence level, let’s visualize the top words contributing to sentiment in the 2017 letter. The contribution to sentiment value is calculated by multiplying the sentiment score of the word by its frequency in the document.

Words like “issues”, “risk” and “crisis” may strike fear into an investor’s heart, but it’s important to remember that this approach to sentiment is far from perfect. For example, in some parts of the letter, Jamie may actually be speaking to the ways the firm successfully manages risk, or is prepared to handle a crisis.


Wikipedia calls tf-idf a “numerical statistic intended to reflect how important a word is to a document in a collection or corpus.” Tf-idf is composed of two terms: term frequency (tf) and inverse document frequency (idf). Term frequency is the number of times a word appears in a document divided by the total number of words in that document. Inverse document frequency (IDF) is the log of the number of the documents in the corpus divided by the number of documents where the specific term appears. (Source)

Frequently used in search engines and text-based recommender systems, we’ll use tf-idf to examine the words that are “rare” within each letter compared to other letters. We’ll start with Jamie’s 2016 letter:

Check out the first word: EU, or European Union. The tf-idf metric is telling me that this term is relatively important and unique to the 2016 letter compared to other years.

2016 was the year of Brexit, where 52% of voters in the UK chose to withdraw from the EU. We can infer that Jamie’s 2016 addressed the uncertainty associated with the exit and its potential effect on J.P. Morgan Chase and global markets.

Next, let’s find the top tf-idf scores for bigrams, or grouping of two words. This time, we’ll use the 2017 shareholder letter.

Again, this metric helps highlight what’s relevant in a given document compared to a group of documents. Here, we see the timely relevance of “business tax”, given the introduction of the Tax Cuts and Jobs Act on November 2, 2017.

Bi-gram Network

Before wrapping up, let’s try and visualize the relationships between words using a network chart. We accomplish this by counting the bigrams, filtering out common combinations, and passing the resulting igraph object to ggraph.

Each of the black nodes is a word. The size of the connection between nodes represent the frequency of that particular combination.

Some bigrams naturally flow together: balance sheet, european union, federal reserve, etc. Notice the web surrounding the word “financial” and the significant frequency of the bigram “stress test”.


There are a several interesting ways this analysis could be extended. I’d like compare measures of sentiment to market movements after the release of the annual letter. I’d also like to compare Jamie’s letters to those written by heads of other large banks. But for now, thank for reading!

You can find the full code written for this project as a Gist here.

Choosing the Right Hospital: Exploratory Analysis in R

With our baby’s due date quickly approaching, my wife and I needed to find a hospital for delivery. Hoping to contribute something meaningful to the decision, I found data published by the state of New York on labor and delivery metrics. By visualizing measures like percentage of cesarean deliveries, I narrowed the list of hospitals within our county.

Despite my belief in “data-driven” decision-making, I understand that in the real world, most decisions are part art, part science, requiring a mix of qualitative and quantitative factors. That being said, in this post, I describe how I leveraged publicly-available data to help choose a hospital for my wife’s delivery.

Data Overview

The dataset spans a ten-year period, from 2008 to 2016, with data for 146 hospitals in 52 counties. Four general categories of metrics are present:

  • Anesthesia & Analgesia       
  • Characteristics of Labor & Delivery
  • Infant Feeding Method
  • Route & Method

Since I lack the subject matter expertise to understand something like the difference between paracervical and pudendal anesthesia, some of the value of the dataset is lost. Despite the knowledge gaps, I’ll next visualize some of the more straightforward measures of labor and delivery to uncover insights about hospital quality.

Visualization & Analysis

First item of discussion: Where are most babies born in Westchester County?

In 2016, the most babies were born at the White Plains Hospital Center.

Volume may matter. Hospitals who deliver more babies may be exposed a wider spectrum of complications and be prepared to deliver treatment accordingly. On the other hand, large-scale operations likely produce strict standardized policies and procedures, with little room for customized delivery plans.

How has the volume of births change over the 10-year period? 

Every hospital seems to be trending flat or down, which may be a reflection of more general demographic trends.

Next up, let’s examine which hospitals work with midwives. This was an important consideration in our decision process.

Pretty clear. Phelps Memorial and Hudson Valley Hospitals are midwife friendly, with 40%+ of births attended by a midwife.

Is there any relationship between births attended by midwifes and other labor outcomes?

It appears that mid-wife friendly hospitals enjoy a lower c-section rate, although I’m not implying that one causes the other. It would take more than a scatter plot to tease out the true nature of that relationship.

Let’s take a closer look at c-section rates by hospital over time.

There was a long stretch of time at Lawrence hospital where more cesarean sections were performed than vaginal births. Easy red flag!

This simple analysis was informative and eye-opening. With the list significantly narrowed, it’s time to tour the facilities, read reviews, and speak with medical providers to make the final decision.

Here’s a link to the code and data. Thanks for reading!