6 Data cleaning, vizualizing and analysis

Data cleaning is the process of detecting and correcting or removing, incomplete, incorrect, irrelevant, duplicated or improperly formatted data from a dataset. Errors and problems in the data can be a problem or limit the downstream data analysis and affect the results. Therefore, data cleaning can solve some of the problems and improve the data, analysis and their outcome.

The data cleaning should be done fully code-based, meaning that from now on, there should be no more changing things in the data by hand. Make sure your code is openly available (e.g. on GitHub) to make the data cleaning workflow transparent and reproducible. For more details see section on data cleaning (Chapter 6).

After data cleaning, we recommend to save a clean version of your data and indicate again in the file name that this is the clean version.

Data cleaning is often a circular process…