Page 43 - PATIENT REGISTRY DATA FOR RESEARCH: A Basic Practical Guide
P. 43
3.6 A Summary of Data Preparation Procedures
The aim of data cleaning is to produce a data set that is usable and ready to be
analysed. Thus, the data set should be free from invalid duplicates, missing values,
inconsistent values and extreme values (or outliers). Below is an example of a problematic
data set as presented in Figure 3.2. Can you spot what are the problems with this data set?
Figure 3.2: Sample of problematic data
The problems are as followed:
1. Inconsistent code for gender (id=3 and id=13)
2. Patient's age of 445 years (id=14)
3. Missing value for race (id=5)
4. Male patient with a 'pregnant' status (id=20)