Showing posts with label scikit-learn. Show all posts
Showing posts with label scikit-learn. Show all posts

Saturday, May 21, 2016

An intro to Regression Analysis with Decision Trees

It's a while that there are no posts on this blog, but the Glowing Python is still active and strong! I just decided to publish some of my post on the Cambridge Coding Academy blog. Here are the links to a series of two posts about Regression Analysis with Decision Trees: In this introduction to Regression Analysis we will see how to user scikit-learn to train Decision Trees to solve a specific problem: "How to predict the number of bikes hired in a bike sharing system in a given day?"

In the first post, we will see how to train a simple Decision Tree to exploit the relation between temperature and bikes hired, this tree will be analysed to explain the result of the training process and gain insights about the data. In the second, we will see how to learn more complex decision trees and how to assess the accuracy of the prediction using cross validation.

Here's a sneak peak of the figures that we will generate:

Tuesday, July 23, 2013

Combining Scikit-Learn and NTLK

In Chapter 6 of the book Natural Language Processing with Python there is a nice example where is showed how to train and test a Naive Bayes classifier that can identify the dialogue act types of instant messages. Th classifier is trained on the NPS Chat Corpus which consists of over 10,000 posts from instant messaging sessions labeled with one of 15 dialogue act types.
The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):
import nltk

def dialogue_act_features(sentence):
    """
        Extracts a set of features from a message.
    """
    features = {}
    tokens = nltk.word_tokenize(sentence)
    for t in tokens:
        features['contains(%s)' % t.lower()] = True    
    return features

# data structure representing the XML annotation for each post
posts = nltk.corpus.nps_chat.xml_posts() 
# label set
cls_set = ['Emotion', 'ynQuestion', 'yAnswer', 'Continuer',
'whQuestion', 'System', 'Accept', 'Clarify', 'Emphasis', 
'nAnswer', 'Greet', 'Statement', 'Reject', 'Bye', 'Other']
featuresets = [] # list of tuples of the form (post, features)
for post in posts: # applying the feature extractor to each post
 # post.get('class') is the label of the current post
 featuresets.append((dialogue_act_features(post.text),cls_set.index(post.get('class'))))
After the feature extraction we can split the data we obtained in training and testing set:
from random import shuffle
shuffle(featuresets)
size = int(len(featuresets) * .1) # 10% is used for the test set
train = featuresets[size:]
test = featuresets[:size]
Now we can instantiate the model that implements classifier using the scikitlearn interface provided by NLTK and train it:
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
# SVM with a Linear Kernel and default parameters 
classif = SklearnClassifier(LinearSVC())
classif.train(train)
In order to use the batch_classify method provided by scikitlearn we have to organize the test set in two lists, the first one with the train data and the second one with the target labels:
test_skl = []
t_test_skl = []
for d in test:
 test_skl.append(d[0])
 t_test_skl.append(d[1])
Then we can run the classifier on the test set and print a full report of its performances:
# run the classifier on the train test
p = classif.batch_classify(test_skl)
from sklearn.metrics import classification_report
# getting a full report
print classification_report(t_test_skl, p, labels=list(set(t_test_skl)),target_names=cls_set)
The report will look like this:
              precision    recall  f1-score   support

    Emotion       0.83      0.85      0.84       101
 ynQuestion       0.78      0.78      0.78        58
    yAnswer       0.40      0.40      0.40         5
  Continuer       0.33      0.15      0.21        13
 whQuestion       0.78      0.72      0.75        50
     System       0.99      0.98      0.98       259
     Accept       0.80      0.59      0.68        27
    Clarify       0.00      0.00      0.00         6
   Emphasis       0.59      0.59      0.59        17
    nAnswer       0.73      0.80      0.76        10
      Greet       0.94      0.91      0.93       160
  Statement       0.76      0.86      0.81       311
     Reject       0.57      0.31      0.40        13
        Bye       0.94      0.68      0.79        25
      Other       0.00      0.00      0.00         1

avg / total       0.84      0.85      0.84      1056

Monday, May 14, 2012

Manifold learning on handwritten digits with Isomap

The Isomap algorithm is an approach to manifold learning. Isomap seeks a lower dimensional embedding of a set of high dimensional data points estimating the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors.
The scikit-learn library provides a great implmentation of the Isomap algorithm and a dataset of handwritten digits. In this post we'll see how to load the dataset and how to compute an embedding of the dataset on a bidimentional space.
Let's load the dataset and show some samples:
from pylab import scatter,text,show,cm,figure
from pylab import subplot,imshow,NullLocator
from sklearn import manifold, datasets

# load the digits dataset
# 901 samples, about 180 samples per class 
# the digits represented 0,1,2,3,4
digits = datasets.load_digits(n_class=5)
X = digits.data
color = digits.target

# shows some digits
figure(1)
for i in range(36):
 ax = subplot(6,6,i)
 ax.xaxis.set_major_locator(NullLocator()) # remove ticks
 ax.yaxis.set_major_locator(NullLocator())
 imshow(digits.images[i], cmap=cm.gray_r) 
The result should be as follows:


Now X is a matrix where each row is a vector that represent a digit. Each vector has 64 elements and it has been obtained using spatial resampling on the above images. We can apply the Isomap algorithm on this data and plot the result with the following lines:
# running Isomap
# 5 neighbours will be considered and reduction on a 2d space
Y = manifold.Isomap(5, 2).fit_transform(X)

# plotting the result
figure(2)
scatter(Y[:,0], Y[:,1], c='k', alpha=0.3, s=10)
for i in range(Y.shape[0]):
 text(Y[i, 0], Y[i, 1], str(color[i]),
      color=cm.Dark2(color[i] / 5.),
      fontdict={'weight': 'bold', 'size': 11})
show()
The new embedding for the data will be as follows:


We computed a bidimensional version of each pattern in the dataset and it's easy to see that the separation between the five classes in the new manifold is pretty neat.