Cleaning Messy Data
Cleaning Messy Data
This course will introduce the fundamentals of cleaning messy data. It will provide a clear understanding about what messy data sets are and why they need to be cleaned, as well as giving lots of practical examples for cleaning data sets.
This course will help learners to:
Recognize when data are messy and require cleaning
Apply cleaning methods to messy datasets
Understand how cleaning messy data contributes to good data management
Perform quality control of data
Language: English
Time to complete: 3 hours
Level: Beginner
Instructor: Dr. Alessandra Vigilante
How to access: Sage Campus is a digital library product. If you are a librarian, find out how to get Sage Campus for your university. If you are faculty, a researcher, or a student, recommend Sage Campus to your library.
Even the most organized person can make mistakes when recording and saving data. At first, datasets can look clean and reproducible but as soon as we try to add more data or use them for analysis or visualization purposes, issues begin to arise, and we find ourselves needing to clean the data! In this module, you will learn what messy data are, and why it’s so important to recognize and clean them as soon as possible (and avoid them in the future!).
Messy data will waste your time, will confuse your collaborators, and will certainly negatively impact your analysis and your research output.
In this module, we’ll explain why it’s so important to have clean data you can trust, both to obtain reliable results and for creating sustainable and interoperable datasets.
Most of the time, quantitative data are recorded and saved in text files using a spreadsheet program. Excel isn’t the only spreadsheet program, but it’s arguably the most used one. Free spreadsheet programs include LibreOffice Calc and Apple Numbers for Apple users. This module will provide background information on different spreadsheet programs and share key skills that can be used to manually clean messy data.
Students, researchers and faculty can try all Sage Campus courses today by signing up for a 7-day free trial below. 30-day institutional trials are set up via your institution’s library, so recommend us to your library to request a campus-wide trial.
This course is aimed at all learners who work with large data sets that need to be cleaned and reformatted before processing, from undergraduates to early career researchers.
This practical course will help you gain the knowledge and skills to use R for social science research, step-by-step.
Learn how to use R to manage data in a wide variety of formats, in a reproducible manner, at scale.
Perfect for beginners, this course will teach you the fundamentals of Python programming through taught materials and practical examples.
Gain the skills you need to manipulate and visualize a variety of data types using Python.
Learn the techniques and tools for presenting data in visually attractive and interactive ways using the R programming language.
Learn the essentials of collecting social media data and gain the skills to plan, gather and analyze social media data for your research.
Gain a conceptual overview of the text mining landscape and a foundational understanding of the analysis of digital textual data sets.
Learn how to analyze large amounts of textual data, at scale, using the R programming language.
Gives learners a full understanding of what artificial intelligence is and how it is used and applied in society and research methods, covering important ethical considerations and challenges.
Learn the fundamentals of cleaning messy data. This course will provide a clear understanding about what messy data sets are and why they need to be cleaned, as well as giving lots of practical examples for cleaning data sets.
Dr. Alessandra Vigilante is a Senior Lecturer in Bioinformatics at the Center for Stem Cells and Regenerative Medicine with a focus on genotype-phenotype interactions and data integration. Alessandra obtained her PhD in Bioinformatics in Naples (2008-2011) before moving to the UK to join the Nicholas Luscombe group first at the EMBL-European Bioinformatics Institute as a visiting student (2011-2012) and then as a postdoctoral fellow at UCL (2012-2017).
Alessandra Vigilante’s group has significant expertise and experience in the analysis and integration of large scale genomic, epigenomic and transcriptomic data (i.e. single-cell RNA-seq and ATAC-seq datasets, ChIP-seq etc…), and in the implementation of novel computational methods for various bespoke analyses to gain biological insights.She is actively involved in a great network of collaborations to develop multidisciplinary approaches to research efforts, working with faculty members within King’s and other research institutes.