How to impute missing values in a dataset before feeding to a classifier is often a difficult decision. Imputing with a wrong value can significantly skew the data and result in wrong classifier. The ideal solution is to get a clean data set without any NULL values but then, we might have to throw out most data. There are no perfect workarounds as most classifiers are built based on the information from data and lack thereof results in the wrong classifier. Continue reading “Handling missing values in a Dataset before training”
Post describes how to extract top feature names from a supervised learning classifier in sklearn.
Note: The training dataset
y_train are pandas dataframe with column names.
After fitting/training a classifier
clf, the scoring for features can be accessed (method varies depending on the classifier used).
- For example, for logistic regression it is the magnitude of the coefficients and can be accessed as
- For DecisionTree, it is
Sort the scores in descending order using
np.argsort() and pass it as an index to the column names of
# For Decision Tree classifier from sklearn.tree import DecisionTreeClassifier import numpy as np clf = DecisionTreeClassifier(random_state=42) clf.fit(X_train, y_train) importances = clf.feature_importances_ # printing top 5 features of fitted classifier print (X_train.columns[(np.argsort(importances)[::-1])][:5]) OR print(sorted(zip(X_train.columns,importances),key=lambda x: x)[::-1]