Page 43 - PATIENT REGISTRY DATA FOR RESEARCH: A Basic Practical Guide
P. 43

3.6 A Summary of Data Preparation Procedures

                       The aim of data cleaning is to produce a data set that is usable and ready to be


               analysed. Thus, the data set should be free from invalid duplicates, missing values,

               inconsistent values and extreme values (or outliers). Below is an example of a problematic

               data set as presented in Figure 3.2. Can you spot what are the problems with this data set?


































                                          Figure 3.2: Sample of problematic data





               The problems are as followed:

                   1. Inconsistent code for gender (id=3 and id=13)


                   2. Patient's age of 445 years (id=14)

                   3. Missing value for race (id=5)


                   4. Male patient with a 'pregnant' status (id=20)
   38   39   40   41   42   43   44   45   46   47   48