Introduction to Applied Data Science Methods for Social Scientists

Next course 25 September - 22 October 2017



Introduction to Applied Data Science Methods for Social Scientists

Next course 25 September - 22 October 2017

Course overview

In this course, we demystify the tools and methods of an emerging field that is changing the way we collect, process, and analyze information. The course is divided into twelve modules. After explaining the relationship between data science and the social sciences, we discuss the ethics of data science. We then dive into the applied methods and code with data science tools. We look at data formats, surveys and crowdsourcing, collecting data from the web, cleaning and preprocessing data, data visualization, network analysis, machine learning, and computational text analysis. The course concludes by looking forward to new ways to collect, analyze, and process data, along with new ways to make research more reproducible. 

This course makes extensive use of interactive programming in Jupyter notebooks, a part of the larger Jupyter Project. The notebooks offer a seamless integration of code with explanatory markdown text. It will allow you to read the narrative of the programming task, and write code of your own to fit into the larger narrative. To bootstrap your learning, we provide the programming environment for you - in the cloud! We use a JupyterHub to host the materials, and you can program directly in your web browser from any device.

Any Questions? - Contact us

 For a bulk order of 5 or more learners on any of our courses, you can claim 50% discount. Contact us for more information.

Introduction to Applied Data Science Methods for Social Scientists

12+ hours
You will need to have a working knowledge of Python or R. You should be comfortable with for-loops, conditionals, and functions. But don’t worry, the notebooks will walk you through each lesson step-by-step and we provide various types of learner support.
Dr. Claudia von Vacano, Christopher Hench, Geoff Bacon, Dr. Evan Muzzall, Dr. Laura Nelson, Dr. Adam Anderson, Professor David Harding, Alex Estes and Rachel Jansen
In association with
Social Science Data Lab (D-Lab) at the University of California, Berkeley
Start date:
Start date:

Course Instructors

Course Instructors

Course Instructors



How It Works

How It Works

How It Works

This course is divided into twelve modules. The first two modules will orientate you in the new big data landscape, and the remaining modules teach the practical application of data science methods to the social sciences using Python and R. The second part of the course is less sequential than the first, but we build from and return to the same examples introduced earlier. 

The course could be completed over a long weekend, or over a couple weeks by completing one module per day. There is great flexibility in taking this course, but we recommend you take the course in order and not break up a single module over multiple sittings.

All programming exercises are provided in both R and Python, so you can choose your path, or try your hand at both! 





Module 1

A social science perspective on data science

This module will introduce you to the objectives of the course via a visual overview of each module. It will discuss how data science is changing social science and statistics, and will cover reliability, generalizability, and reproducibility.

Module 2

Ethics of Data Science Research

This module will teach you about the shortcomings and problems of data science in respect to the groups of people it affects, who it’s representing, and how to responsibly acknowledge these issues in research. Beyond problems in data collection and data sources are issues of privacy, sampling, population size, interpretation, and application. This module importantly emphasizes issues of deidentification and reidentification, and data security.

Module 3

Data Science Tools

This module will introduce you to the data science tools commonly used in social science research. We will discuss the value of open-source programming languages, specifically R and Python, for research of this nature and weigh the advantages and disadvantages of each. You can share your experience of each language in the forum.

This module will also introduce the Jupyter environment. Demonstrations will be done in Jupyter notebooks, which provide user-friendly environments for executing and sharing code written in these two programming languages, and also seamlessly integrates code and markdown for explaining the code. This module will conclude with a brief overview of Git and GitHub as it has become essential for collaborative research programming projects. 

Module 4

Data Formats

This module will teach you what formats data comes in, and how we should structure our own data if we collect it ourselves. This module has three core lessons discussing delimiter-separated values formats, specifically the CSV and TSV, tree formats such as XML, and key-value pair data, specifically JSON. You will learn how to manipulate this data in Python or R.

Module 5

Surveys and Crowdsourcing Data

This module will give an overview of how to construct a survey and crowdsource responses. Specifically, you will be introduced to Qualtrics and learn about other freely available tools for building surveys. We will discuss Amazon’s Mechanical Turk, which is becoming the new norm for online data collection in the social sciences. 

Module 6

Collecting Data from the Web

This module will teach you how to extract data from web resources appropriate to your research question. You will learn what an API is, its main use cases, why it is a valuable source of data, and how it is clearly distinguished from web scraping. Special attention will be given to how to obtain permission from hosts, and proper etiquette when using APIs and scraping.

This lesson will also, through video and programming exercises, introduce you to common tools used by data scientists to collect data from the web. The activities for this module will be completed in a Jupyter notebook using Python or R. You will construct their own API requests, scrape a website, and write the collected data into a CSV.

Module 7

Cleaning Data and Preprocessing

This module will introduce you to standard conventions of basic subsetting, missing value handling, imputation, data type conversion and structure conversion, data de-duplication, and regular expressions.The efficiency of programming in R or Python means that running complex models often involves only minimal coding. However, a main limitation of such efficient practices is that effort must first be spent preparing the data into a format that can be recognized by the coded function. This module will introduce you to those skills.

Module 8

Data Visualization

This module will discuss effective presentation methods for various data types and variables. It will introduce you to effective means to visualize data for presentation, and effective methods for exploratory data analysis (EDA). You will test your knowledge by interpreting and explaining visualizations through multiple choice quizzes and matching, as well as by selecting the most suitable visualization for a particular analysis. You will also create your own visualizations in Jupyter notebooks in Python or R.

Module 9

Network Analysis

In this module, we will demystify network analysis. We begin with a dataset of open source emails from the Enron Corporation, which went bankrupt amid a large scandal in 2001, in order to model explicit relationships between the prominent actors in this company. 

Using Gephi we will interact with the network visualization and examine the statistical properties of these relationships in co-mention networks. Lastly, we will contextualize the statistical properties of the network, such as eigenvector centrality (‘emergent leader’), by comparing a small number of case studies of the actors with the highest titles and positions in the network.

Module 10

Machine Learning

This module will teach you how to identify machine learning applications, the importance of data "cleaning", the rationale for splitting data into training, cross-validation and test sets, and basic ideas about algorithm construction and configuration settings. 

A combination of statistics and computer science, machine learning provides a variety of methods for problem-solving in academia, industry, and business. R and Python are two common programming languages for machine learning that will be used to introduce you to core organizational concepts of classification and regression, data preprocessing, fitting a model to a training dataset, and choosing and updating hyperparameters using validation set results to prevent overfitting and underfitting. 

Module 11

Text Analysis

This module will review the basic building blocks which serve as the foundation for computational text analysis. Modern computational methods have opened up new frontiers for the aggregation and analysis of digital text. These approaches can supplement, amplify, and augment traditional social science and humanistic approaches to text analysis. The module will begin by importing, structuring, and pre-processing text for computational analysis. We will then demonstrate two common methods for the categorization of text documents: 1) supervised classification, and 2) unsupervised LDA topic modelling. You will be asked to create a workflow designed in a Jupyter notebook.

Module 12

Looking forward

This module wraps up the course with a discussion of best practices. For the social sciences, reproducibility in data science methods is paramount. This module will teach you how to best document your research for yourself and others as reproducible research and workflows, and how to make it available to other researchers in an accessible format. We will also discuss cutting edge directions in social science and data science research, and where you should look for next steps.

In association with

In association with

developed in association WITH



Frequently Asked Questions

Please see below answers to some of the most frequent questions we get about this course

Will I be programming interactively?

Yes! We use Jupyter notebooks to walk you through code that accomplishes specific tasks, and ask you to build on the code and solve mini challenges. Answers with explanations are provided in separate files, so you won’t get stuck!

Do I need to install any software?

Only if you want to! You’re welcome to use your local installation of R or Python, but we encourage you to use our JupyterHub so that all of your programming takes place in your web browser on your computer, tablet, or phone! You will also be able to download your work for reference later. For one module, we will use the popular network analysis tool Gephi, which you can also download to your computer.

How long will I have access to the course for?

The course will be run over 4 weeks, during which you will have access to learning support provided by the course instructor. After the 4 weeks, you will still have access to the course materials for another 2 months, but you will not be able to receive learning support from the instructor, and if there is a course forum, you will not be able to ask any questions.

Do learners get a certificate?

All of our courses offer a certificate of completion signed by your instructor. You will be able to download this certificate, from the Learning Platform, when you complete the course.

Can't find what you're looking for?