How to impute missing values in a dataset before feeding to a classifier is often a difficult decision. Imputing with a wrong value can significantly skew the data and result in wrong classifier. The ideal solution is to get a clean data set without any NULL values but then, we might have to throw out most data. There are no perfect workarounds as most classifiers are built based on the information from data and lack thereof results in the wrong classifier.
Another alternative is use tree based models (such as
random forest) which as less susceptible to NULL values compared to
linear regression or
If there are missing values in the target variable, that rows should be discarded.
Median value Imputation
In case of median imputation, missing data is given the value of most commonly occurring value and we might skew the data towards a middle value.
Nearest Neighbour Imputation (knn)
Here we set the NA values to the average of nearest values around that observation.
Zero value imputation
Not recommended as it skews the data with possibly wrong data. But at certain situations this might be the right choice.
The decision to impute NAs with a particular value is often domain specific and contextual (Check out this talk by anand). Given below are the general suggestions and caret package handles a lot of these with minimal code. In
R, imputation can be handled during preprocessing before training the model or during the raining of model.
library(caret) library(RANN) library(datasets) head(mtcars) # making some samples with NAs set.seed(5) index <- sample(nrow(mtcars),0.2*nrow(cars),replace = F) mtcars[index,"qsec"] <- NA mtcars[index,"hp"] <- NA # making test and train data set.seed(555) idx<- sample(nrow(mtcars),0.7*nrow(cars),replace = F) train <- mtcars[idx,] test <- mtcars[-idx,] # Calling preProcess to impute with k-Nearest Neighbours # By default, preProcess "center" and "scale" data before imputation preProcVal <- preProcess(train,method=c("knnImpute")) # other imputations include "bagImpute" and "medianImpute" # preProcess is called on new test data using predict # to get imputed dataset imputed_test <- predict(preProcVal,newdata = test)
Preprocessing while training a model
model <- train(mpg~., data=train, method="glm", preProcess=c("center", "scale", "knnImpute"))