Page 35 - PATIENT REGISTRY DATA FOR RESEARCH: A Basic Practical Guide

P. 35

It is also a good practice for the researcher to pre-specify an acceptable range for all

numerical variables. This is important because it enables the researcher to detect any outliers,

abnormal values and data inconsistencies during the data cleaning process. For example, say

the acceptable range for an adult patient's age should be from 18 to 100 years. So, we should

therefore query if a patient's age has been reported as 'more than 100 years'. For best practices

in effective data management, the upper and lower and limits of each numerical variable

should be pre-determined and also clearly specified in the database system so that all

out-of-range values will automatically be removed from the registries. In a similar vein to the

procedure for removal of duplicates, it is also necessary to carefully screen for the presence

of outliers and other abnormal values even though a proper set of data validation rules has

already been established within the database system, as there is always a chance for a

programming error to occur in the system.

There are two approaches to adopt for data cleaning, namely: (i) to clean the data

obtained from its original source (front end), or (ii) to clean the data obtained from the

database system (back end). Cleaning the data obtained from its original source is much

more time-consuming because this will involve making reference to the original source

documents or the subjects themselves in order to verify the accuracy of the data collected.

Even though such a data cleaning process will be very likely to increase the validity of the

data, however it may occasionally not be feasible due to both time and resource constraints.

An alternative approach is to apply statistical techniques for data cleaning (back end).

These techniques commonly include data imputation or simply declaring the data as 'missing'.

Undoubtedly, simply declaring the data as 'missing' is the easiest way out, which is usually

applied whenever the proportion of missing values is very minimal. When preparing the

registry data set for data analysis, it will first be necessary to create new variables and to

30 31 32 33 34 35 36 37 38 39 40