Page 35 - PATIENT REGISTRY DATA FOR RESEARCH: A Basic Practical Guide
P. 35
It is also a good practice for the researcher to pre-specify an acceptable range for all
numerical variables. This is important because it enables the researcher to detect any outliers,
abnormal values and data inconsistencies during the data cleaning process. For example, say
the acceptable range for an adult patient's age should be from 18 to 100 years. So, we should
therefore query if a patient's age has been reported as 'more than 100 years'. For best practices
in effective data management, the upper and lower and limits of each numerical variable
should be pre-determined and also clearly specified in the database system so that all
out-of-range values will automatically be removed from the registries. In a similar vein to the
procedure for removal of duplicates, it is also necessary to carefully screen for the presence
of outliers and other abnormal values even though a proper set of data validation rules has
already been established within the database system, as there is always a chance for a
programming error to occur in the system.
There are two approaches to adopt for data cleaning, namely: (i) to clean the data
obtained from its original source (front end), or (ii) to clean the data obtained from the
database system (back end). Cleaning the data obtained from its original source is much
more time-consuming because this will involve making reference to the original source
documents or the subjects themselves in order to verify the accuracy of the data collected.
Even though such a data cleaning process will be very likely to increase the validity of the
data, however it may occasionally not be feasible due to both time and resource constraints.
An alternative approach is to apply statistical techniques for data cleaning (back end).
These techniques commonly include data imputation or simply declaring the data as 'missing'.
Undoubtedly, simply declaring the data as 'missing' is the easiest way out, which is usually
applied whenever the proportion of missing values is very minimal. When preparing the
registry data set for data analysis, it will first be necessary to create new variables and to