Aspiring programmers and data scientists often ask, “Which programming language should I learn first?” It’s a valid question, since it can take hundreds of hours of practice to become competent with your first programming language. There are a couple of key factors to take into consideration, like how easy the language is to learn, the job market for the language, and the long term prospects for the language.
In this post, we’ll take a data-driven approach to determining which programming languages are the most popular and growing the fastest in order to make an informed recommendation to new entrants to the developer community.
There are several ways you could measure the popularity or growth of programming languages over time. The PYPL (PopularitY of Programming Language Index) is created by analyzing how often language tutorials are searched on Google; the more a language tutorial is searched, the more popular the language is assumed to be.
Another avenue could be analyzing GitHub metadata. GitHub is the largest code host in the world, with 40 million users and more than 100 million repositories (source). We could quantify the popularity of a programming language by measuring the number of pull requests / push requests / stars / issues over time (example, example).
Finally, the popularity proxy I’ll use is the number of questions posted by programming language on Stack Overflow. Stack Overflow is a question and answer site for programmers. Questions have tags like
python which makes it easier for people to find and answer questions.
We’ll visualize how programming languages have trended over the last 10 years based on use of their tags on Stack Overflow.
So, how are we going to source this data? Should we scrape all 18 million questions or start hitting the Stack Exchange API? No! There’s an easier way: Stack Exchange (Stack Overflow’s “parent”) exposes a data explorer to run queries against historical data.
In other words, we can review the Stack Overflow database schema and write a SQL query to extract the data we need. Before writing any SQL, let’s think about how we’d like the query output to be structured. Each row should contain a tag (e.g.
python), a date (year / month), and count of the number of times a question was posted using that tag:
Year | Month | Tag | Question Count
The SQL query below joins the
PostTags tables, counts the number of questions by tag each month, and returns the top 100 tags each month:
Below are the first ten rows returned by the query:
Great, now we have the data we need. Next, how should we visualize it to measure programming language popularity over time? Let’s try an animated bar race chart using Flourish. Flourish is an online data studio that helps you visualize and tell stories with data.
In order to get the data into the right format for Flourish visualization, we’ll use R to filter and reshape the data. To smooth the trend, we’ll also calculate a moving average of tag question count.
After uploading the reshaped data to Flourish and formatting the animated bar race chart, we can sit back and watch the programming languages fight it out for the top spot over the last decade:
It’s hard to miss the steady rise of Python, hovering in fourth and five place from 2010 to 2017 before accelerating into first place by late 2018.
Why has Python become so popular? First, it’s more concise and requires less time, effort, and lines of code to perform the same operations as languages like C++ and Java. Python is well-known for its simple programming syntax, code readability and English-like commands. For those reason, not to mention its rich set of libraries and large community, Python is a great place to start for new programmers and data scientists.
The story our animated bar chart tells is validated by the reporting published by Stack Overflow Insights, where we see Python growing steadily over time, measured as a percentage of questions asked on Stack Overflow in a month:
Using question tag data from Stack Overflow, we’ve determined that Python is probably the best programming languages to learn first. We could have saved ourselves some time and done a simple Google search or consulted Reddit to come to the same conclusion, but there’s something satisfying about validating the hype with real data.