Extracting top feature names for a trained classifier in order

Post describes how to extract top feature names from a supervised learning classifier in sklearn.

Note: The training dataset X_train and y_train are pandas dataframe with column names.

After fitting/training a classifier clf, the scoring for features can be accessed (method varies depending on the classifier used).

  • For example, for logistic regression it is the magnitude of the coefficients and can be accessed as clf.coef_
  • For DecisionTree, it is clf.feature_importances_

Sort the scores in descending order using np.argsort() and pass it as an index to the column names of X_train.columns.


# For Decision Tree classifier

from sklearn.tree import DecisionTreeClassifier
import numpy as np

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

importances = clf.feature_importances_

# printing top 5 features of fitted classifier
print (X_train.columns[(np.argsort(importances)[::-1])][:5])

Horizontal bar chart with 3 encodings using matplotlib

output_20_0

The chart explains the gender difference in school performance based on different states of india. Full project report

View python source code inside packages

matrix_code

Often we want to know how a function is written in an imported package. This post explains how to examine the source code of a function/class.

To know where the package is installed:

[package_name].__file__

For the package pandas:

import pandas
pandas.__file__

To examine the source code of a given function or class, import the package inspect.

import inspect as insp
print insp.getsourcefile(pandas.DataFrame) # prints the path to source file

print insp.getsourcelines(pandas.DataFrame) # prints the source code

A documentation of inspect package can be found here.

Viewing the source code from IPython Notebook

Append ? to the function name inside the ipython-notebook cell to view source code and ?? for the entire source code.


import pandas

pandas.DataFrame? # shows the docstring</code>​

pandas.DataFrame?? # shows the source code and docstring

Return top “N” elements from an array

top_n returns a mask = [True, False, True, False, False ...] with “True” for top n values. The mask is passed into an array as index to get “True” values.

import numpy as np
from scipy.stats import rankdata

def top_n(list_array, n = 1):

  """
  Returns a boolean mask with "True" for greatest "n" number of values
  """
  np_array = np.array(list_array)
  # creating a mask
  mask = np.zeros(len(np_array.flatten()), dtype=bool)
  r =rankdata(np_array, method ="dense")
  # rank matrix with highest value =1
  r=(r.max()+1)-r
  for index, val in enumerate(r):
    if  val <= (n):
	mask[index] = True
  return mask.reshape(np_array.shape)

boolean_filter will return a list where boolean is true.

def boolean_filter(b_list, boolean):
  """
  This function returns values in b_list where the boolean is true
  """
  return [item for i, item in enumerate(b_list) if boolean[i]==True]