With the recent release of Warren Buffet’s much anticipated annual shareholder letter, I decided to show J.P. Morgan Chase chief Jamie Dimon some love by performing a text analysis on a sample of his annual shareholder letters.
The shareholder letters are hosted on the firm’s Investor Relations page. To avoid PDF text parsing, I analyzed all letters with a web interface, which includes years 2014 to 2017.
In this post I’ll analyze Jamie’s thoughts on the firm, the economy, and politics using tidytext principles, including sentiment, term frequency-inverse document frequency, and bigram network visualization.
Getting the Data
To import the letters into R, I used the
rvest package and wrote a function to extract the text from the
full-copyarea section of the HTML pages. Below is sample implementation for the 2016 letter:
After combining the text from the four year’s worth of letters, I used the
unnest_tokens function from the
tidytext package to create a table where each word is a row.
Below are the first seven rows of the table. Noticed that I’ve removed common stop words and numbers. With 31,986 total words and 5,049 unique words, we’re now ready for a text analysis!
Before moving into sentiment and other more complex methods of analysis, let’s start by looking at the top 10 most frequently used words by Jamie overall, across all years.
No surprises here. Terms like bank and banks, business and business top the chart. This doesn’t really provide much insight, which is why we’ll use methods like term frequency inverse document frequency, or tf-idf, to measure how important a word actually is in the letters.
Despite all of the hype around machine learning and artificial intelligence, it’s still tricky for a computer to understand writing or speech. For example, how would you write a program to interpret the snippet below from Jamie’s 2017 letter?
“Throughout a period of profound political and economic change around the world, our company has been steadfast in our dedication to the clients, communities and countries we serve while earning a fair return for our shareholders.”Jamie Dimon, 2017 Shareholder Letter
A simple method to measure the sentiment of a text is to assign a sentiment score to each word within the text. Using the AFINN lexicon (easily accessible through the
tidytext package) I can do just that:
A word like “risk” has a negative score, while “excellent” has a positive score. The sentiment of a sentence or paragraph is then the aggregate of individual word scores.
With that background, I’ll calculate sentiment at a “block” level of 20 sentences to understand how Jamie’s tone changes over the course of his letters.
In each of the four years, Jamie starts off positive, typically holding off on any “bad news” for several paragraphs. Let’s take a look at examples of sentences with positive and negative sentiment.
In 2014, Jamie wrote the following, with a net sentiment score of 11:
“We are able to do our part in supporting communities and economies around the world because we are strong, stable and permanent.”
Jamie Dimon, 2014 Shareholder Letter, Sentence #155
Words like “strong”, “stable” and “permanent” clearly have a positive sentiment. In contrast, that same year he issued a warning, with a net sentiment score of -15:
“Some things never change — there will be another crisis, and its impact will be felt by the financial markets.Jamie Dimon, 2014 Shareholder Letter, Sentence #545
Now that we’ve examined sentiment at a sentence level, let’s visualize the top words contributing to sentiment in the 2017 letter. The contribution to sentiment value is calculated by multiplying the sentiment score of the word by its frequency in the document.
Words like “issues”, “risk” and “crisis” may strike fear into an investor’s heart, but it’s important to remember that this approach to sentiment is far from perfect. For example, in some parts of the letter, Jamie may actually be speaking to the ways the firm successfully manages risk, or is prepared to handle a crisis.
Wikipedia calls tf-idf a “numerical statistic intended to reflect how important a word is to a document in a collection or corpus.” Tf-idf is composed of two terms: term frequency (tf) and inverse document frequency (idf). Term frequency is the number of times a word appears in a document divided by the total number of words in that document. Inverse document frequency (IDF) is the log of the number of the documents in the corpus divided by the number of documents where the specific term appears. (Source)
Frequently used in search engines and text-based recommender systems, we’ll use tf-idf to examine the words that are “rare” within each letter compared to other letters. We’ll start with Jamie’s 2016 letter:
Check out the first word: EU, or European Union. The tf-idf metric is telling me that this term is relatively important and unique to the 2016 letter compared to other years.
2016 was the year of Brexit, where 52% of voters in the UK chose to withdraw from the EU. We can infer that Jamie’s 2016 addressed the uncertainty associated with the exit and its potential effect on J.P. Morgan Chase and global markets.
Next, let’s find the top tf-idf scores for bigrams, or grouping of two words. This time, we’ll use the 2017 shareholder letter.
Again, this metric helps highlight what’s relevant in a given document compared to a group of documents. Here, we see the timely relevance of “business tax”, given the introduction of the Tax Cuts and Jobs Act on November 2, 2017.
Before wrapping up, let’s try and visualize the relationships between words using a network chart. We accomplish this by counting the bigrams, filtering out common combinations, and passing the resulting
igraph object to
Each of the black nodes is a word. The size of the connection between nodes represent the frequency of that particular combination.
Some bigrams naturally flow together: balance sheet, european union, federal reserve, etc. Notice the web surrounding the word “financial” and the significant frequency of the bigram “stress test”.
There are a several interesting ways this analysis could be extended. I’d like compare measures of sentiment to market movements after the release of the annual letter. I’d also like to compare Jamie’s letters to those written by heads of other large banks. But for now, thank for reading!
You can find the full code written for this project as a Gist here.