Extracting top feature names for a trained classifier in order

Post describes how to extract top feature names from a supervised learning classifier in sklearn.

Note: The training dataset X_train and y_train are pandas dataframe with column names.

After fitting/training a classifier clf, the scoring for features can be accessed (method varies depending on the classifier used).

  • For example, for logistic regression it is the magnitude of the coefficients and can be accessed as clf.coef_
  • For DecisionTree, it is clf.feature_importances_

Sort the scores in descending order using np.argsort() and pass it as an index to the column names of X_train.columns.


# For Decision Tree classifier

from sklearn.tree import DecisionTreeClassifier
import numpy as np

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

importances = clf.feature_importances_

# printing top 5 features of fitted classifier
print (X_train.columns[(np.argsort(importances)[::-1])][:5])

Horizontal bar chart with 3 encodings using matplotlib

output_20_0

The chart explains the gender difference in school performance based on different states of india. Full project report

Starting a Web Server on MAC for D3.js

Running the webpage locally requires starting a web server especially if it has javascript in the webpage.

  1. download/clone the repository from github link (or any other link) to a folder in your machine.
  2.  Switch to terminal and cd into the folder containing downloaded files and start the webserver as follows
     python -m http.server 8070

    This will start a webserver on the port 8070.

  3. Open web browser and type: ​http://localhost:8070/index.html
  4. index.html is loaded by default. Instead of index.html append [name].html to view the corresponding page.

Was used for this project.

 

Learning GIT version control

git has undoubtedly become the version control standard throughout industry and this skill is almost inevitable for collaboration. This post is a small starting point for anyone new.

git documentation can be intimidating/overwhelming for most newbies  with lots of options/commands. In reality, most developers end up using a handful.

[Since it is always hard to remember], I have this cheatsheet (atlassian) posted on my desk.

Another nifty little command on MAC terminal to get the graphical repository browser :


$ gitk

Check these awesome websites which can teach you to use git graphically.

Resources

  1. learngitbranching.js.org (highly recommended)
  2. Git-IT (Git is an excellent learn by doing cross-platform project)
  3. https://www.atlassian.com/git/tutorials/learn-git-with-bitbucket-cloud
  4. https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging

Some useful links for beginners to get involved

  1. http://www.firsttimersonly.com/
  2. https://github.com/search?utf8=%E2%9C%93&q=label%3Afirst-timers-only+is%3Aopen&type=Issues&ref=searchresults
  3. Medium blog for first timers

Return top “N” elements from an array

top_n returns a mask = [True, False, True, False, False ...] with “True” for top n values. The mask is passed into an array as index to get “True” values.

import numpy as np
from scipy.stats import rankdata

def top_n(list_array, n = 1):

  """
  Returns a boolean mask with "True" for greatest "n" number of values
  """
  np_array = np.array(list_array)
  # creating a mask
  mask = np.zeros(len(np_array.flatten()), dtype=bool)
  r =rankdata(np_array, method ="dense")
  # rank matrix with highest value =1
  r=(r.max()+1)-r
  for index, val in enumerate(r):
    if  val <= (n):
	mask[index] = True
  return mask.reshape(np_array.shape)

boolean_filter will return a list where boolean is true.

def boolean_filter(b_list, boolean):
  """
  This function returns values in b_list where the boolean is true
  """
  return [item for i, item in enumerate(b_list) if boolean[i]==True]