By Dr Chris Hench, Program Development Lead for the D-Lab and Digital Humanities at the University of California, Berkeley, and course instructor for the courses in our Fundamentals of Data Science for Social Scientists course package for institutions.
Cleaning and preprocessing data is the practice of transforming your raw data into a consistent and correct format ready for statistical analysis or machine learning. This is also known as "data munging" or "data wrangling".
Common data cleaning tasks include the following:
Removing duplicated rows/columns
Standardizing inconsistent spelling
Finding out the unit of measurement for a column
Removing incorrect values
Imputing missing values
Many data scientists estimate they spend up to 80% of their time cleaning data! Although this may seem time-consuming, it highlights the importance of data cleaning. The key reasons you should always clean your data are that:
Most machine learning algorithms require data to be in a particular format, such as a rectangular array
Datasets often have missing and incorrect values
Through cleaning and preprocessing data, we come to know the dataset very well.
Regardless of these factors, you should consider every dataset you ever work with to be “dirty data” (meaning it hasn’t been cleaned and preprocessed yet), even if you receive it from a collaborator. Therefore, you should never skip the crucial stage of cleaning in the data analysis process!
Our Fundamentals of Data Science for Social Scientists package for institutions includes 7 short “bytesize” courses, inclduing Bytesize: Cleaning Data and Preprocessing. In this course, you will learn how to prepare data so that is in a format that can be recognized by the coding function in R or Python. This course is only available as part of the full package for institutions. Enquire today on the package page.