Making the Data Complete: An approach to Deal with Missing Data Problems
Abstract
How the missing values should be treated seems to be a universal dilemma in the world of statistics. The subject has seen a lot of research in the past mainly on the methods of handling missing values.
While in theory there are many methods of dealing with missing values, the actual method to be applied depends on the type of variables in particular and type of data in general. This paper is an attempt to bring together all the methods available for treating missing values in machine learning and statistical data analysis, assessing the situations where these should be applied (or not applied) with some examples and case studies. This is also an effort to explore some of the relatively newer techniques of dealing with the missing values (like multiple imputation developed by SAS and R), the advantages they have to offer over other techniques as also the situations where they should not be used. This will be done through comparing and presenting a different approach to treat missing values on a sample dataset and show how drastically the conclusions of predictive models can differ based on the method of treating missing values applied.