By Dr Chris Hench, Program Development Lead for the D-Lab and Digital Humanities at the University of California, Berkeley, and course instructor on Introduction to Data Science for Social Scientists.
Cleaning and preprocessing data is the practice of transforming your raw data into a consistent and correct format ready for statistical analysis or machine learning. This is also known as "data munging" or "data wrangling".
Common data cleaning tasks include the following:
- Removing duplicated rows/columns
- Standardizing inconsistent spelling
- Finding out the unit of measurement for a column
- Removing incorrect values
- Imputing missing values
- Type coercion
Many data scientists estimate they spend up to 80% of their time cleaning data! Although this may seem time-consuming, it highlights the importance of data cleaning. The key reasons you should always clean your data are that:
- Most machine learning algorithms require data to be in a particular format, such as a rectangular array
- Datasets often have missing and incorrect values
- Through cleaning and preprocessing data, we come to know the dataset very well.
Regardless of these factors, you should consider every dataset you ever work with to be “dirty data” (meaning it hasn’t been cleaned and preprocessed yet), even if you receive it from a collaborator. Therefore, you should never skip the crucial stage of cleaning in the data analysis process!
SAGE Campus Bytesize courses are a series of short courses that teach core data science skills to people who are eager to learn, but short on time.