Page 44 - PATIENT REGISTRY DATA FOR RESEARCH: A Basic Practical Guide
P. 44
Perhaps it is easy to eyeball these problems since the data set is small. Large data sets
with dozens of variables are usually complicated and will require highly skilled statisticians
(who will be using an appropriate software) to handle the data. To solve these problems, a
statistician will need to verify the accuracy of these data by referring to the original records,
to rectify any errors found in these data or for certain variables such as age, to apply an
appropriate imputation technique for replacing any missing values in the data set. After data
cleaning has been completed, it is necessary to keep this 'clean' data set by saving it in a
different file. A good data set should contain the right number of subjects with all their
related variables lumped together for each category of these variable definitions (see Figure
3.3 and Figure 3.4). Finally, it is also necessary to give a proper name for the data set in order
to facilitate future retrieval, such as "osa_study20090930" which indicates a project related to
th
Obstructive Sleep Apnoea (OSA) study locked at 30 September 2009.