Introduction to Applied Data Science Methods for Social Scientists
Next course runs from [[date]]
Introduction to Applied Data Science Methods for Social Scientists
Next course runs from [[date]]
In this course we demystify the tools and methods of an emerging field that is changing the way we collect, process, and analyze information.
The course is divided into twelve modules. After explaining the relationship between data science and the social sciences, we discuss the ethics of data science. We then dive into the applied methods and code with data science tools, looking at:
surveys and crowdsourcing
collecting data from the web
cleaning and preprocessing data
computational text analysis
The course concludes by looking forward to new ways to collect, analyze, and process data, along with new ways to make research more reproducible.
This course makes extensive use of interactive programming in Jupyter notebooks, a part of the larger Jupyter Project. The notebooks offer a seamless integration of code with explanatory markdown text. It will allow you to read the narrative of the programming task, and write code of your own to fit into the larger narrative. To bootstrap your learning, we provide the programming environment for you - in the cloud! We use a JupyterHub to host the materials, and you can program directly in your web browser from any device.
For a bulk order of 5 or more learners on any of our courses, you can claim 50% discount. Contact us for more information.
This course is divided into 12 modules. The first 2 modules will orientate you in the new big data landscape, and the remaining modules teach the practical application of data science methods to the social sciences using Python and R. The second part of the course is less sequential than the first, but we build from and return to the same examples introduced earlier.
There is great flexibility in taking this course, but we recommend you take the course in order and not break up a single module over multiple sittings.
All programming exercises are provided in both R and Python, so you can choose your path, or try your hand at both!
You have 3 months' access to this course. During the first 4 weeks of your course you will receive learning support from an SME (subject matter expert). We recommend working through as many modules as possible in these initial 4 weeks so that you can make the most the SME’s expertise. They will be on hand to answer any questions, or help you if you get stuck.
After the learning support period, you’ll still have access to the course materials but you won’t receive assistance from the instructor. SAGE Campus will help you with any IT or platform issues you might have throughout the course.
This module will introduce you to the objectives of the course via a visual overview of each module. It will discuss how data science is changing social science and statistics, and will cover reliability, generalizability, and reproducibility.
This module will teach you about the shortcomings and problems of data science in respect to the groups of people it affects, who it’s representing, and how to responsibly acknowledge these issues in research. Beyond problems in data collection and data sources are issues of privacy, sampling, population size, interpretation, and application. This module importantly emphasizes issues of deidentification and reidentification, and data security.
This module will introduce you to the data science tools commonly used in social science research. We will discuss the value of open-source programming languages, specifically R and Python, for research of this nature and weigh the advantages and disadvantages of each. You can share your experience of each language in the forum.
This module will also introduce the Jupyter environment. Demonstrations will be done in Jupyter notebooks, which provide user-friendly environments for executing and sharing code written in these two programming languages, and also seamlessly integrates code and markdown for explaining the code. This module will conclude with a brief overview of Git and GitHub as it has become essential for collaborative research programming projects.
This module will teach you what formats data comes in, and how we should structure our own data if we collect it ourselves. This module has three core lessons discussing delimiter-separated values formats, specifically the CSV and TSV, tree formats such as XML, and key-value pair data, specifically JSON. You will learn how to manipulate this data in Python or R.
This module will give an overview of how to construct a survey and crowdsource responses. Specifically, you will be introduced to Qualtrics and learn about other freely available tools for building surveys. We will discuss Amazon’s Mechanical Turk, which is becoming the new norm for online data collection in the social sciences.
This module will teach you how to extract data from web resources appropriate to your research question. You will learn what an API is, its main use cases, why it is a valuable source of data, and how it is clearly distinguished from web scraping. Special attention will be given to how to obtain permission from hosts, and proper etiquette when using APIs and scraping.
This lesson will also, through video and programming exercises, introduce you to common tools used by data scientists to collect data from the web. The activities for this module will be completed in a Jupyter notebook using Python or R. You will construct your own API requests, scrape a website, and write the collected data into a CSV.
This module will introduce you to standard conventions of basic subsetting, missing value handling, imputation, data type conversion and structure conversion, data de-duplication, and regular expressions. The efficiency of programming in R or Python means that running complex models often involves only minimal coding. However, a main limitation of such efficient practices is that effort must first be spent preparing the data into a format that can be recognized by the coded function. This module will introduce you to those skills.
This module will discuss effective presentation methods for various data types and variables. It will introduce you to effective means to visualize data for presentation, and effective methods for exploratory data analysis (EDA). You will test your knowledge by interpreting and explaining visualizations through multiple choice quizzes and matching, as well as by selecting the most suitable visualization for a particular analysis. You will also create your own visualizations in Jupyter notebooks in Python or R.
In this module, we will demystify network analysis. We begin with a dataset of open source emails from the Enron Corporation, which went bankrupt amid a large scandal in 2001, in order to model explicit relationships between the prominent actors in this company.
Using Gephi we will interact with the network visualization and examine the statistical properties of these relationships in co-mention networks. Lastly, we will contextualize the statistical properties of the network, such as eigenvector centrality (‘emergent leader’), by comparing a small number of case studies of the actors with the highest titles and positions in the network.
This module will teach you how to identify machine learning applications, the importance of data "cleaning", the rationale for splitting data into training, cross-validation and test sets, and basic ideas about algorithm construction and configuration settings.
A combination of statistics and computer science, machine learning provides a variety of methods for problem-solving in academia, industry, and business. R and Python are two common programming languages for machine learning that will be used to introduce you to core organizational concepts of classification and regression, data preprocessing, fitting a model to a training dataset, and choosing and updating hyperparameters using validation set results to prevent overfitting and underfitting.
This module will review the basic building blocks which serve as the foundation for computational text analysis. Modern computational methods have opened up new frontiers for the aggregation and analysis of digital text. These approaches can supplement, amplify, and augment traditional social science and humanistic approaches to text analysis. The module will begin by importing, structuring, and pre-processing text for computational analysis. We will then demonstrate two common methods for the categorization of text documents: 1) supervised classification, and 2) unsupervised LDA topic modelling. You will be asked to create a workflow designed in a Jupyter notebook.
This module wraps up the course with a discussion of best practices. For the social sciences, reproducibility in data science methods is paramount. This module will teach you how to best document your research for yourself and others as reproducible research and workflows, and how to make it available to other researchers in an accessible format. We will also discuss cutting edge directions in social science and data science research, and where you should look for next steps.
Please see below answers to some of the most frequent questions we get about this course
Yes! We use Jupyter notebooks to walk you through code that accomplishes specific tasks, and ask you to build on the code and solve mini challenges. Answers with explanations are provided in separate files, so you won’t get stuck!
Only if you want to! You’re welcome to use your local installation of R or Python, but we encourage you to use our JupyterHub so that all of your programming takes place in your web browser on your computer, tablet, or phone! You will also be able to download your work for reference later. For one module, we will use the popular network analysis tool Gephi, which you can also download to your computer.
The course will be run over 4 weeks, during which you will have access to learning support provided by the course instructor. After the 4 weeks, you will still have access to the course materials for another 2 months, but you will not be able to receive learning support from the instructor, and if there is a course forum, you will not be able to ask any questions.
All of our courses offer a certificate of completion signed by your instructor. You will be able to download this certificate, from the Learning Platform, when you complete the course.
Next course starts [[date]]. Book your place today!