By Matt Denny, PhD Candidate in Political Science and Social Data Analytics, NSF Big Data Social Science IGERT Fellow at Penn State University, and course instructor on Practical Data Management with R for Social Scientists.
If you are just getting started with R, and coming to it from a background using statistical analysis software like Excel, SAS, Stata, or SPSS, then one of the first things you will have to get used to is the concept of a data structure. In all of the aforementioned software, you read in your data as a spreadsheet and then only operate on that one spreadsheet (with some exceptions). In R you can represent data in many more formats than just a spreadsheet, and you can hold all of these objects in memory at the same time. This is a very powerful concept, and one that allows R to perform many data management tasks that would simply be impossible in the programs named above. Here I will provide a brief conceptual overview of five of the most commonly used data structures in R.
So what are data structures?
Data structures are essentially containers for storing individual values (like the number 12.4 or the sentence “I love cats!”) in different configurations. They can also be used to store other data structures to represent and store complex datasets (from Excel spreadsheets to network data, and even geospatial data). They can also be used to store thousands or millions of separate datasets and can allow for seamless access to each one from R.
Which ones do I need to know about?
There are five basic data structures in R that you’ll want to get acquainted with pretty quickly. Here is a quick summary of the basic ones and their uses:
- Individual values and vectors
Individual values and vectors allow us to represent a number, string, or Boolean (TRUE/FALSE) by a variable name (e.g., my_number) or even a whole bunch of numbers/strings/Booleans as a vector. So we can essentially give a name to a value (a number like 12.4, for example) and then use the name to get access to the number later, even if it changes (to 26.2, for example). These data structures can only store one type of data at a time (e.g., strings, numbers, etc.) but are very useful for storing temporary data as part of a larger data management task. Understanding how these work will be a good stepping stone to understanding more complex data structures.
- Matrices
Matrices add a second dimension to vectors so that they can have both rows and columns and be indexed in two directions. Matrices are very good at representing social network and spatial data. For example, rows could indicate y-coordinates, and columns could represent x-coordinates.
- Data frames
Data frames are like the big sister of matrices. These are probably the most common data structure you will be working within R, and are essentially equivalent to Excel spreadsheets. Whereas matrices can only hold one type of value in the entire matrix, data frames can hold different types of data (strings and numbers, most often) in different columns.
- Lists
Lists are the most flexible data structure and are incredibly useful. Lists can hold other lists, as well as collections of vectors, matrices, and data frames in any combination. They are sort of like “super vectors” and are effective for managing multiple datasets at once.
There are of course more complex formats too, but for now values and vectors, matrices, data frames, and lists are a good place to start learning about data structures. Practical Data Management with R for Social Scientists goes into more detail if you’re keen to learn more. Find out more about Practical Data Management with R for Social Scientists.
Want to hear more from Matt Denny? Read his blog post A Bitesize Intro to... Web Scraping