Handling missing values in a Dataset before training

How to impute missing values in a dataset before feeding to a classifier is often a difficult decision. Imputing with a wrong value can significantly skew the data and result in wrong classifier. The ideal solution is to get a clean data set without any NULL values but then, we might have to throw out most data. There are no perfect workarounds as most classifiers are built based on the information from data and lack thereof results in the wrong classifier. Continue reading “Handling missing values in a Dataset before training”

Extracting top feature names for a trained classifier in order

Post describes how to extract top feature names from a supervised learning classifier in sklearn.

Note: The training dataset X_train and y_train are pandas dataframe with column names.

After fitting/training a classifier clf, the scoring for features can be accessed (method varies depending on the classifier used).

  • For example, for logistic regression it is the magnitude of the coefficients and can be accessed as clf.coef_
  • For DecisionTree, it is clf.feature_importances_

Sort the scores in descending order using np.argsort() and pass it as an index to the column names of X_train.columns.

# For Decision Tree classifier

from sklearn.tree import DecisionTreeClassifier
import numpy as np

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

importances = clf.feature_importances_

# printing top 5 features of fitted classifier
print (X_train.columns[(np.argsort(importances)[::-1])][:5])

Horizontal bar chart with 3 encodings using matplotlib


The chart explains the gender difference in school performance based on different states of india. Full project report

Finding d3 bl.ocks

If you are a user of d3 for visualization, you might already know what a bl.ock is. Apart from Mike Bostock’s bl.ocks, often it is very hard to navigate/find a good examples which you can use and build on top. Since most of the d3 examples are new creative ventures, it is often hard to classify them and hence indexing. These are some websites where you can navigate bl.ocks of users :


  1. https://medium.com/@enjalot/searching-for-examples-2c0f75709c1a

View story at Medium.com