Handling missing values in a Dataset before training

How to impute missing values in a dataset before feeding to a classifier is often a difficult decision. Imputing with a wrong value can significantly skew the data and result in wrong classifier. The ideal solution is to get a clean data set without any NULL values but then, we might have to throw out most data. There are no perfect workarounds as most classifiers are built based on the information from data and lack thereof results in the wrong classifier.

Another alternative is use tree based models (such as random forest) which as less susceptible to NULL values compared to linear regression or logistic regression

If there are missing values in the target variable, that rows should be discarded.

Median value Imputation

In case of median imputation, missing data is given the value of most commonly occurring value and we might skew the data towards a middle value.

Nearest Neighbour Imputation (knn)

Here we set the NA values to the average of nearest values around that observation.

Zero value imputation

Not recommended as it skews the data with possibly wrong data. But at certain situations this might be the right choice.

Imputation

The decision to impute NAs with a particular value is often domain specific and contextual (Check out this talk by anand). Given below are the general suggestions and caret package handles a lot of these with minimal code. In R, imputation can be handled  during preprocessing before training the model or during the raining of model.

Using preProcess

library(caret)
library(RANN)
library(datasets)

head(mtcars)

# making some samples with NAs
set.seed(5)
index <- sample(nrow(mtcars),0.2*nrow(cars),replace = F)
mtcars[index,"qsec"] <- NA
mtcars[index,"hp"] <- NA

# making test and train data
set.seed(555)
idx<- sample(nrow(mtcars),0.7*nrow(cars),replace = F)
train <- mtcars[idx,]
test <- mtcars[-idx,]

# Calling preProcess to impute with k-Nearest Neighbours
# By default, preProcess "center" and "scale" data before imputation

preProcVal <- preProcess(train,method=c("knnImpute")) 

# other imputations include "bagImpute" and "medianImpute"

# preProcess is called on new test data using predict
# to get imputed dataset

imputed_test <- predict(preProcVal,newdata = test)

Preprocessing while training a model

model <- train(mpg~., data=train, method="glm",
               preProcess=c("center", "scale", "knnImpute"))

Reference

  1. Applied Predictive Modelling
  2. Source Code for Caret

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s