The Glowing Python: probability

Showing posts with label probability. Show all posts

Tuesday, April 22, 2014

Parameters selection with Cross-Validation

Most of the pattern recognition techniques have one or more free parameters and choose them for a given classification problem is often not a trivial task. In real applications we only have access to a finite set of examples, usually smaller than we wanted, and we need to test our model on samples not seen during the training process. A model that would just classify the samples that it has seen would have a very good score, but would definitely fail to predict unseen data. This situation is called overfitting and to avoid it we need to apply an appropriate validation procedure to select the parameters. A tool that can help us solve this problem is the Cross-Validation (CV). The idea behind CV is simple: the data are split into train and test sets several consecutive times and the averaged value of the prediction scores obtained with the different sets is the evaluation of the classifier.
Let's see a simple example where a smoothing parameter for a Bayesian classifier is select using the capabilities of the Sklearn library.
To begin we load one of the test datasets provided by sklearn (the same used here) and we hold 33% of the samples for the final evaluation:

from sklearn.datasets import load_digits
data = load_digits()
from sklearn.cross_validation import train_test_split
X,X_test,y,y_test = train_test_split(data.data,data.target,
                                     test_size=.33,
                                     random_state=1899)

Now, we import the classifier we want to use (a Bernoullian Naive Bayes in this case), specify a set of values for the parameter we want to choose and run a grid search:

from sklearn.naive_bayes import BernoulliNB
# test the model for alpha = 0.1, 0.2, ..., 1.0
parameters = [{'alpha':np.linspace(0.1,1,10)}]

from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(BernoulliNB(), parameters, cv=10, scoring='f1')
clf.fit(X,y) # running the grid search

The grid search has evaluated the classifier for each value specified for the parameter alpha using the CV. We can visualize the results as follows:

res = zip(*[(f1m, f1s.std(), p['alpha']) 
            for p, f1m, f1s in clf.grid_scores_])
subplot(2,1,1)
plot(res[2],res[0],'-o')
subplot(2,1,2)
plot(res[2],res[1],'-o')
show()

The plots above show the average score (top) and the standard deviation of the score (bottom) for each values of alpha used. Looking at the graphs it seems plausible that a small alpha could be a good choice.
We can also see thet using the alpha value that gave us the best results on the the test set we selected at the beginning gives us results that are similar to the ones obtained during the CV stage:

from sklearn.metrics import f1_score
print 'Best alpha in CV = %0.01f' % clf.best_params_['alpha']
final = f1_score(y_test,clf.best_estimator_.predict(X_test))
print 'F1-score on the final testset: %0.5f' % final

Best alpha in CV = 0.1
F1-score on the final testset: 0.85861

Monday, January 28, 2013

A toy Bloom Filter

A Bloom Filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is based on a probabilistic mechanism where false positive retrieval results are possible, but false negatives are not. In this post we will see a pure python implementation of the Bloom Filter and the end we will see how to tune the parameters in order to minimize the number of false positive results.
Let's begin with a little bit of theory. The idea behind the filter is to allocate a bit vector of length m, initially all set to 0, and then choose k independent hash functions, h₁, h₂, ..., h_k, each with range [1 m]. When an element a is added to the set then the bits at positions h(a)₁, h(a)₂, ..., h(a)_k in the bit vector are set to 1. Given a query element q we can test whether it is in the set using the bits at positions h(q)₁, h(q)₂, ..., h(q)_k in the vector. If any of these bits is 0 we report that q is not in the set otherwise we report that q is. The thing we have to care about is that in the first case there remains some probability that q is not in the set which could lead us to a false positive response.
The following class is a naive implementation of a Bloom Filter (pay attention: this implementation is not supposed to be suitable for production. It is made just to show how a Bloom Filter works and to study its behavior):

class Bloom:
 """ Bloom Filter """
 def __init__(self,m,k,hash_fun):
  """
   m, size of the vector
   k, number of hash fnctions to compute
   hash_fun, hash function to use
  """
  self.m = m
  # initialize the vector 
  # (attention a real implementation 
  #  should use an actual bit-array)
  self.vector = [0]*m
  self.k = k
  self.hash_fun = hash_fun
  self.data = {} # data structure to store the data
  self.false_positive = 0

 def insert(self,key,value):
  """ insert the pair (key,value) in the database """
  self.data[key] = value
  for i in range(self.k):
   self.vector[self.hash_fun(key+str(i)) % self.m] = 1

 def contains(self,key):
  """ check if key is cointained in the database
      using the filter mechanism """
  for i in range(self.k):
   if self.vector[self.hash_fun(key+str(i)) % self.m] == 0:
    return False # the key doesn't exist
  return True # the key can be in the data set

 def get(self,key):
  """ return the value associated with key """
  if self.contains(key):
   try:
    return self.data[key] # actual lookup
   except KeyError:
    self.false_positive += 1

The usage of this filter is pretty easy, we have to initialize the data structure with a hash function, a value for k and the size of the bit vector then we can start adding items as in this example:

import hashlib

def hash_f(x):
 h = hashlib.sha256(x) # we'll use sha256 just for this example
 return int(h.hexdigest(),base=16)

b = Bloom(100,10,hash_f)
b.insert('this is a key','this is a value')
print b.get('this is a key')

Now, the problem is to choose the parameters of the filter in order to minimize the number of false positive results. We have that after inserting n elements into a table of size m, the probability that a particular bit is still 0 is exactly

Hence, afer n insertions, the probability that a certain bit is 1 is

So, for fixed parameters m and n, the optimal value k that minimizes this probability is

With this in mind we can test our filter. The first thing we need is a function which tests the Bloom Filter for fixed values of m, n and k countinig the percentage of false positive:

import random

def rand_data(n, chars):
 """ generate random strings using the characters in chars """
 return ''.join(random.choice(chars) for i in range(n))

def bloomTest(m,n,k):
 """ return the percentage of false positive """
 bloom = Bloom(m,k,hash_f)
 # generating a random data
 rand_keys = [rand_data(10,'abcde') for i in range(n)]
 # pushing the items into the data structure
 for rk in rand_keys:
  bloom.insert(rk,'data')
 # adding other elements to the dataset
 rand_keys = rand_keys + [rand_data(10,'fghil') for i in range(n)]
 # performing a query for each element of the dataset
 for rk in rand_keys:
  bloom.get(rk)
 return float(bloom.false_positive)/n*100.0

If we fix m = 10000 and n = 1000, according to the equations above, we have that the value of k which minimizes the false positive number is around 6.9314. We can confirm that experimentally with the following test:

# testing the filter
m = 10000
n = 1000
k = range(1,64)
perc = [bloomTest(m,n,kk) for kk in k] # k is varying

# plotting the result of the test
from pylab import plot,show,xlabel,ylabel
plot(k,perc,'--ob',alpha=.7)
ylabel('false positive %')
xlabel('k')
show()

The result of the test should be as follows

Looking at the graph we can confirm that for k around 7 we have the lowest false positive percentage.

Thursday, August 16, 2012

Kernel Density Estimation with scipy

This post continues the last one where we have seen how to how to fit two types of distribution functions (Normal and Rayleigh). This time we will see how to use Kernel Density Estimation (KDE) to estimate the probability density function. KDE is a non-parametric technique for density estimation in which a known density function (the kernel) is averaged across the observed data points to create a smooth approximation. Given the non-parametrica nature of KDE, the main estimator has not a fixed functional form but only it depends upon all the data points we used for the estimation. Let's see the snippet:

from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist

# creating data with two peaks
sampD1 = norm.rvs(loc=-1.0,scale=1,size=300)
sampD2 = norm.rvs(loc=2.0,scale=0.5,size=300)
samp = hstack([sampD1,sampD2])

# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(samp)

# plotting the result
x = linspace(-5,5,100)
plot(x,my_pdf(x),'r') # distribution function
hist(samp,normed=1,alpha=.3) # histogram
show()

The result should be as follows:

Friday, July 20, 2012

Distribution fitting with scipy

Distribution fitting is the procedure of selecting a statistical distribution that best fits to a dataset generated by some random process. In this post we will see how to fit a distribution using the techniques implemented in the Scipy library.
This is the first snippet:

from scipy.stats import norm
from numpy import linspace
from pylab import plot,show,hist,figure,title

# picking 150 of from a normal distrubution
# with mean 0 and standard deviation 1
samp = norm.rvs(loc=0,scale=1,size=150) 

param = norm.fit(samp) # distribution fitting

# now, param[0] and param[1] are the mean and 
# the standard deviation of the fitted distribution
x = linspace(-5,5,100)
# fitted distribution
pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = norm.pdf(x)

title('Normal distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()

The result should be as follows

In the code above a dataset of 150 samples have been created using a normal distribution with mean 0 and standar deviation 1, then a fitting procedure have been applied on the data. In the figure we can see the original distribution (blue curve) and the fitted distribution (red curve) and we can observe that they are really similar.
Let's do the same with a Rayleigh distribution:

from scipy.stats import norm,rayleigh

samp = rayleigh.rvs(loc=5,scale=2,size=150) # samples generation

param = rayleigh.fit(samp) # distribution fitting

x = linspace(5,13,100)
# fitted distribution
pdf_fitted = rayleigh.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = rayleigh.pdf(x,loc=5,scale=2)

title('Rayleigh distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()

The resulting plot:

As expected, the two distributions are very close.