Page 35 - PATIENT REGISTRY DATA FOR RESEARCH: A Basic Practical Guide
P. 35

It is also a good practice for the researcher to pre-specify an acceptable range for all

               numerical variables. This is important because it enables the researcher to detect any outliers,


               abnormal values and data inconsistencies during the data cleaning process. For example, say

               the acceptable range for an adult patient's age should be from 18 to 100 years. So, we should


               therefore query if a patient's age has been reported as 'more than 100 years'. For best practices

               in effective data management, the upper and lower and limits of each numerical variable


               should be pre-determined and also clearly specified in the database system so that all

               out-of-range values will automatically be removed from the registries. In a similar vein to the


               procedure for removal of duplicates, it is also necessary to carefully screen for the presence

               of outliers and other abnormal values even though a proper set of data validation rules has


               already been established within the database system, as there is always a chance for a

               programming error to occur in the system.

                         There are two approaches to adopt for data cleaning, namely: (i) to clean the data


                   obtained from its original source (front end), or (ii) to clean the data obtained from the

                  database system (back end).  Cleaning the data obtained from its original source is much


                   more time-consuming because this will involve making reference to the original source

                 documents or the subjects themselves in order to verify the accuracy of the data collected.


                 Even though such a data cleaning process will be very likely to increase the validity of the

                 data, however it may occasionally not be feasible due to both time and resource constraints.


                       An alternative approach is to apply statistical techniques for data cleaning (back end).

               These techniques commonly include data imputation or simply declaring the data as 'missing'.


               Undoubtedly, simply declaring the data as 'missing' is the easiest way out, which is usually

               applied whenever the proportion of missing values is very minimal. When preparing the

               registry data set for data analysis, it will first be necessary to create new variables and to
   30   31   32   33   34   35   36   37   38   39   40