This is a sample code to extract a tarball (tar.gz) and load data into a numpy array. You may also load the file into a pandas dataframe. Continue reading “Extract and load data directly from a tarball”
Category: DataScience
Multiple sub plots using matplotlib in python
I keep forgetting how to plot multiple graphs on a single plot. Then I came up with this wonderful blog and cheat sheet. Also check out Jake Vandeplas‘s chapter 4.08 from Python Data Science Handbook for a more comprehensive guide.
Continue reading “Multiple sub plots using matplotlib in python”
get() method for python dictionary
get()
method is useful when accessing key-value pair from a dictionary. It returns a pre-defined value (-1
in the example below) if key is not present in dictionary, else it returns the value associated with key.
Running Jupyter Notebook on a remote server
With a command-line interface to the server, it is often hard to quickly scan through the contents on a server. This can be circumvented using jupyter-lab (or jupyter notebook) running on the server and accessing it using a client machine. I presume you have already installed jupyter-lab (or jupyter-notebook) on server. Jupyter-lab is a better option as it comes with a file-navigator, spread-sheet viewer (faster than excell, reminds me of sublime text) and an image-viewer. Check out this video for the latest feature updates in jupyter-lab.
Continue reading “Running Jupyter Notebook on a remote server”
Handling missing values in a Dataset before training
How to impute missing values in a dataset before feeding to a classifier is often a difficult decision. Imputing with a wrong value can significantly skew the data and result in wrong classifier. The ideal solution is to get a clean data set without any NULL values but then, we might have to throw out most data. There are no perfect workarounds as most classifiers are built based on the information from data and lack thereof results in the wrong classifier. Continue reading “Handling missing values in a Dataset before training”
Creating ROC curve in R
Although there are multiple packages which plots ROC curve, this one seems to be the most convenient.
library(caTools) # Predict on test: p p <- predict(model, test, type = "response") # create ROC Curve colAUC(p,test[["Class"]],plotROC = T)
Extracting top feature names for a trained classifier in order for sci-kit learn
Post describes how to extract top feature names from a supervised learning classifier in sklearn.
Note: The training dataset X_train
and y_train
are pandas dataframe with column names.
After fitting/training a classifier clf
, the scoring for features can be accessed (method varies depending on the classifier used).
- For example, for logistic regression it is the magnitude of the coefficients and can be accessed as
clf.coef_
- For DecisionTree, it is
clf.feature_importances_
Sort the scores in descending order using np.argsort()
and pass it as an index to the column names of X_train.columns
.
# For Decision Tree classifier from sklearn.tree import DecisionTreeClassifier import numpy as np clf = DecisionTreeClassifier(random_state=42) clf.fit(X_train, y_train) importances = clf.feature_importances_ # printing top 5 features of fitted classifier print (X_train.columns[(np.argsort(importances)[::-1])][:5]) OR print(sorted(zip(X_train.columns,importances),key=lambda x: x[1])[::-1]
Horizontal bar chart with 3 encodings using matplotlib
The chart explains the gender difference in school performance based on different states of india. Full project report
- 1st Encoding (lines): median values of performance for boys and girls
- 2nd Encoding (colored bars): difference in median values
- 3rd Encoding (circle size): count of values used to find median
Continue reading “Horizontal bar chart with 3 encodings using matplotlib”
Finding d3 bl.ocks
If you are a user of d3 for visualization, you might already know what a bl.ock is. Apart from Mike Bostock’s bl.ocks, often it is very hard to navigate/find a good examples which you can use and build on top. Since most of the d3 examples are new creative ventures, it is often hard to classify them and hence indexing. These are some websites where you can navigate bl.ocks of users :
- http://blockbuilder.org/search (Highly recommended)
- http://bl.ocks.org/enjalot/raw/211bd42857358a60a936/
- http://bl.ocksplorer.org/ (often not very helpful)
References:
Notes on Data Visualization – D3.js
Below are some links on Data Visualization focused on using D3.js. These links are compiled from the Data Visualization course on Udacity. Continue reading “Notes on Data Visualization – D3.js”
Serialize python object to JSON
This is a wonderful article on how to serialize a python object into JSON