If you conduct social science research and you are using Stata, SAS, or SPSS, you might be looking to learn how to use some of the new tools on the block.
R and Python are the two popular programming languages used by data analysts and data scientists, that provide many more features than the aforementioned statistical software packages. Although you could learn both, that would require a significant time investment - especially if you have never coded before. So which should you start with? And which one is best for social scientists?
The short answer is that learning either R or Python will allow you lots of room to grow as a data analyst. They are similar enough that you can transfer much of what you learn for one language to the other. Both programming languages are free and and open source and developed in the early 90s - R for statistical analysis, and Python as a general-purpose programming language. One could argue that Python has better libraries for data collection and a wider range of data structures, while R has better statistical and graphics libraries, but the best language for you will often depend on what your community uses.
Now, here’s the long answer!
Python: Many people who are new to coding find Python the easier language to learn. It’s also designed as more of a general-purpose language than R, meaning that it is more well suited to interfacing with the rest of your operating and file systems, and more similar to other lower level programming languages like C++ or Java. As social scientists are acquiring new sources of data and new ways to analyze it, the more literate you are in general programming, the more prepared you will be to use tools from other disciplines.
Python tends to be more widely used by computer scientists than R, so lots of machine learning libraries tend to be better supported in Python than R. For example, if you are particularly interested in getting into Deep Learning, Python is a better choice. While many of these libraries are being ported to R, the online documentation still tends to be better for Python users. Python also tends to run faster than R at scale, although both are relatively slow in the grand scheme of programming languages . Both R and Python can be linked up to C or C++ for a performance boost, but the average performance of builtin python functions tends to be a bit better.
R: R is and open source statistical computing language, which has traditionally been used by academics and researchers, meaning it inevitably has more packages for statistical analysis. Because of the easy to use, centralized, package dissemination system, most new statistical techniques also tend to get implemented in R before other languages. In particular, if you are interested in network analysis, R is the clear choice over python due to the active R developer community in this area.
Some disciplines have a strong bias toward using R scripts in replication materials, so it’s a great way to collaborate with your peers. Because of its open source nature, the latest techniques get released quickly and there are lots of support communities on the internet.
R is also great for data visualization. While the base R graphics package is comprehensive and powerful, additional libraries such as ggplot2 and lattice make R the go-to language for power data visualization approaches.
Which Language Should I Start With? While each language has its strengths, in all honesty, the differences between R and Python are starting to break down. Most of the common tasks once associated with one language or the other are now doable in both. If you're deciding between the two, and still aren't sure which of the two languages suits your needs best, note that you really can't go wrong since they both can do most of the same things.
The best guidance in deciding which language to focus on is to look at what peers in your field are working with, and follow their lead. If your department is more familiar with one language, it could save you a lot of effort if you learn that one. You can always pick up the other one later. A good idea is to ask any colleagues you have who already focus on data science and advanced analytics to see which language they predominantly use.
Practical Data Management with R for Social Scientists starts on July 2nd. Find out more and sign up here.