By Matt Denny, PhD Candidate in Political Science and Social Data Analytics, NSF Big Data Social Science IGERT Fellow at Penn State University, and course instructor on Practical Data Management with R for Social Scientists.
The internet represents a vast and ever expanding source of social science data. Some of these data are well curated and easily downloadable, but much of these data are “hidden in plain sight”. An increasingly important tool in the social scientist’s toolkit is the ability to automatically collect data from the internet – a process commonly referred to as web scraping. Here’s a bitesize look at web scraping.
What is web scraping?
Whenever you visit a web page on your computer or phone, a server somewhere in the world sends a packet of data to your device continuing all of the information necessary to display the content of the webpage to you. Web scraping is a name for the process of automatically downloading these data and extracting useful bits of data from them. This content could be text, pictures, video, posts, tweets, etc. and be accessed through a variety of means – from simply typing in a web address, to using an Application Programming Interface (API) to gain access to the information.
Web pages as HTML documents
When you visit a webpage, one of the main items the server hosting that page sends to your device is an HTML document containing the information necessary to render the page on your screen. HTML documents are essentially plain text documents (like you might create in a simple text editor), but with extra markup added so that your computer knows how to format and display the information. If we think about them this way, then we can use text processing tools to extract useful bits of information from webpages. However, this is easier for some webpages than others and requires practice to get right. Some websites are designed to make it essentially impossible to automate the process of extracting information from them. It is important to look out for these websites and try to recognize if the information you want is simply not possible to obtain automatically (short of just visiting every page by hand and recording the information), before you invest lots of time in a project.
Getting data from Twitter
Scraping Twitter is a bit different from scraping an ordinary webpage. For starters, the Twitter terms of service prohibit a user from simply trying to automatically visit the URLs for thousands of tweets. Instead, Twitter has set up an application programming interface (API) for accessing the live stream of tweets from its platform. This means you cannot really get access to a whole lot of historical Twitter data (it is possible to get some, though) without paying a third party company to provide it to you. Instead, you can use the API to essentially collect tweets as they are written, in real time. From a practical perspective, this means that if you want to collect tweets about some event or for some time period, you need to plan ahead and have a computer actively collecting tweets before or shortly after an event occurs.
Don’t be a cyber criminal!
Before you start collecting information from the Internet, it is important to consider the legal and practical issues around web scraping. There is a sometimes fine line between collecting information automatically as part of research and putting stress on a website’s hosting services. A type of cyber-attack known as a denial of service (DOS) attack is a malicious, intentional attack meant to bring down a website. A denial of service attack is essentially just a souped-up version of web scraping: the attacker sets up a computer to make as many requests as possible for information from a website, automatically, until the server crashes. Always check a website’s terms of service to see if they prohibit web scraping or if they place limits on the number of request per hour, and follow those guidelines to make sure you stay out of trouble. If no guidelines exist, a good rule of thumb is to limit yourself to 10 requests per minute. The best strategy is always to take a conservative approach and go slow.
If this post has inspired you to learn more about web scraping, take a look at Practical Data Management with R for Social Scientists. There’s a module that looks at how you can use R for scraping and managing your data in a lot more detail than I can fit into a blog post.
Want to hear more from Matt Denny? Take a look at his blog post A bitesize intro to… The Basic Data Structures in R