Jon Slapin.jpg

By Jonathan Slapin, Professor in the Department of Government at the University of Essex, Director of the Essex Summer School in Social Science Data Analysis, and course instructor on Fundamentals of Quantitative Text Analysis for Social Scientists.


Quantitative Text Analysis is the automated, systematic method for processing large amounts of text. This means we can easily carry out tasks such as extracting policy positions from election manifestos or speeches, or even study attitudes or emotion in newspaper articles.  

The common focus across all methods used in QTA is that they can be reduced to three basic steps:  firstly we need to define a corpus from the texts we want to examine; secondly we need to determine what our unit of analysis will be; and finally we need to put a document feature matrix together. This post takes a closer look at what these steps involve.


1. Selecting the Texts to Examine

We start the process of QTA by selecting the texts that we wish to examine by defining a corpus. A corpus is a selection of real texts organized in a way that makes it suitable for QTA. Which texts are selected depends on the question that we seek to answer. We need to make sure that the texts are appropriate for the question. When we are thinking about what those texts might be we need to consider the process that generated the texts we are examining.

Let's consider, as an example, analyzing electoral manifestos and measuring political ideology in those documents. The data generating process here is that parties are writing an electoral platform for connecting with voters and constituents at election time to express their ideologies in order to attract voters. So the documents are comparable in terms of looking at party ideology. They wouldn't be suitable to try and understand what citizens think about politics or ideology as that wasn't the process that generated them.



2. Deciding the Unit of Analysis

Once the texts have been carefully selected the next thing to do is determine what the features of our unit of analysis will be.

Features to consider:

  • Words (unigrams) - a set of characters between two white spaces. We could consider unique words (types) and any word (tokens).
  • Multiple words - bigrams, trigrams, n-grams.
  • Count word stems and lemmas - equivalence classes for words, e.g. run, runs and ran.
  • Word counts in documents, paragraphs and sentences.


3. Creating a Document Feature Matrix

Once we've counted our features we need to put them into a useable format for analysis - a document feature matrix. We will then represent each document as a vector or list of word counts. We'll take a document and turn it into a set of word counts, do this for selected texts, and stack each document as a column in our document feature matrix. So the words become rows, the columns are the documents and an element in the matrix represents the number of times the word appears in a document.

What we have to think about is, are there any words in the documents that should be excluded such as a word that appears very few times in a corpus, e.g. once, or words that appear in one document but not any of the others. These are decisions the researcher will need to make and document.

This is just a quick look at the 3 basic steps of quantitative text analysis, and there is obviously more to share than will fit into a blog post! If you’re interested in learning more, take a look at the syllabus for Fundamentals of Quantitative Text Analysis for Social Scientists. We cover what I've written here in a lot more detail, and with the practical elements of the course you'll get to practice what you're learning.  Find out more about the course here.

Want to read more of Jon's posts? Check out How are you analyzing your texts?