Showing posts with label numpy. Show all posts
Showing posts with label numpy. Show all posts

Thursday, March 28, 2019

Speeding up the Sieve of Eratosthenes with Numba

Lately, on invitation of my right honourable friend Michal, I've been trying to solve some problems from the Euler project and felt the need to have a good way to find prime numbers. So implemented the the Sieve of Eratosthenes. The algorithm is simple and efficient. It creates a list of all integers below a number n then filters out the multiples of all primes less than or equal to the square root of n, the remaining numbers are the eagerly-awaited primes. Here's the first version of the implementation I came up with:
def sieve_python(limit):
    is_prime = [True]*limit
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(limit**0.5) + 1):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime
This returns a list is_prime where is_prime[n] is True n is a prime number. The code is straightforward but it wasn't fast enough for my taste so I decided to time it:
from timeit import timeit

def elapse_time(s):
    s = timeit(s, number=100, globals=globals())
    return f'{s:.3f} seconds'

print(elapse_time('sieve_python(100000)'))
1.107 seconds
1.1 seconds to check 100000 values sounded indeed too slow so I decided to precompile the function with Numba:
from numba import njit

@njit
def sieve_python_jit(limit):
    is_prime = [True]*limit
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(limit**0.5) + 1):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

sieve_python_jit(10) # compilation
print(elapse_time('sieve_python_jit(100000)'))
0.103 seconds
The only addition to the previous version is the decorator @njit and this simple change resulted in a whopping 10x speed up! However, Michal shared with me some code making me notice that combining Numba with the appropriate Numpy data structures leads to impressive results so this implementation materialized:
import numpy as np

@njit
def sieve_numpy_jit(limit):
    is_prime = np.full(limit, True)
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(np.sqrt(limit) + 1)):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

sieve_numpy_jit(10) # compilation
print(elapse_time('sieve_numpy_jit(100000)'))
0.018 seconds
The speed up respect to the first version is 61x!

Lessons learned:
  • Using Numba is very straightforward and a Python function written in a decent manner can be speeded up with little effort.
  • Python lists are too heavy in some cases. Even with pre-allocation of the memory they can't beat Numpy arrays for this specific task.
  • Assigning types correctly is key. Using a Numpy array of integers instead of bools in the function sieve_numpy_jit would result in a slow down.
Update: Thanks to gwillicoder who made me realize the code could be speed up checking if the divisor is a prime and providing a very efficient numpy implementation here.

Wednesday, March 20, 2019

Ravel and unravel with numpy

Raveling and unraveling are common operations when working with matricies. With a ravel operation we go from matrix coordinate to index coordinates, while with an unravel operation we go the opposite way. In this post we will through an example how they can be done with numpy in a very easy way. Let's assume that we have a matrix of dimensions 4-by-4, and that we want to index of the element (1, 1) counting from the top right corner of the matrix. Using ravel_multi_index the solution is easy:
import numpy as np
coordinates = [[1], [1]]
shape = (4, 4)
idx = np.ravel_multi_index(coordinates, shape)
print(idx)

array([5])
What if we want to go back to the original coordinates? In this case we can use unravel_index:
np.unravel_index(idx, shape)

(array([1]), array([1]))
So now we know that the elements (1, 1) has index 5 ;-)

Friday, May 3, 2013

A new RefCard from the GlowingPython!

Check out the DZone RefCard from the GlowingPython:


This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python. In the RefCard you will find the following contents:
  • How to import and visualize data.
  • How to classify and cluster data.
  • How to discover relationships in the data using regression and correlation measures.
  • How to reduce the dimensionality of the data in order to compress and visualize the information it brings.
  • How to analyze structured data with networkx.
Each topic is covered with code examples based on four of the major Python libraries for data analysis and manipulation: numpy, matplotlib,sklearn and networkx. Here is a preview of the first two pages:


Click on the preview to get the RefCard!

Friday, October 12, 2012

Visualizing correlation matrices

The correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The function corrcoef provided by numpy returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are variables and whose columns are observations. Each element of the matrix R represents the correlation between two variables and it is computed as

where cov(X,Y) is the covariance between X and Y, while σX and σY are the standard deviations. If N is number of variables then R is a N-by-N matrix. Then, when we have a large number of variables we need a way to visualize R. The following snippet uses a pseudocolor plot to visualize R:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
from pylab import pcolor, show, colorbar, xticks, yticks

# generating some uncorrelated data
data = rand(10,100) # each row of represents a variable

# creating correlation between the variables
# variable 2 is correlated with all the other variables
data[2,:] = sum(data,0)
# variable 4 is correlated with variable 8
data[4,:] = log(data[8,:])*0.5

# plotting the correlation matrix
R = corrcoef(data)
pcolor(R)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
show()
The result should be as follows:


As we expected, the correlation coefficients for the variable 2 are higher than the others and we observe a strong correlation between the variables 4 and 8.

Saturday, September 29, 2012

Weighted random choice

Weighted random choice makes you able to select a random value out of a set of values using a distribution specified though a set of weights. So, given a list we want to pick randomly some elements from it but we need that the chances to pick a specific element is defined using a weight. In the following code we have a function that implements the weighted random choice mechanism and an example of how to use it:
from numpy import cumsum, sort, sum, searchsorted
from numpy.random import rand
from pylab import hist,show,xticks

def weighted_pick(weights,n_picks):
 """
  Weighted random selection
  returns n_picks random indexes.
  the chance to pick the index i 
  is give by the weight weights[i].
 """
 t = cumsum(weights)
 s = sum(weights)
 return searchsorted(t,rand(n_picks)*s)

# weights, don't have to sum up to one
w = [0.1, 0.2, 0.5, 0.5, 1.0, 1.1, 2.0]

# picking 10000 times
picked_list = weighted_pick(w,10000)

# plotting the histogram
hist(picked_list,bins=len(w),normed=1,alpha=.8,color='red')
show()
The code above plots the distribution of the selected indexes:


We can observe that the chance to pick the element i is proportional to the weight w[i].

Saturday, April 14, 2012

k-nearest neighbor search

A k-nearest neighbor search identifies the top k nearest neighbors to a query. The problem is: given a dataset D of vectors in a d-dimensional space and a query point x in the same space, find the closest point in D to x. The following function performs a k-nearest neighbor search using the euclidean distance:
from numpy import random,argsort,sqrt
from pylab import plot,show

def knn_search(x, D, K):
 """ find K nearest neighbours of data among D """
 ndata = D.shape[1]
 K = K if K < ndata else ndata
 # euclidean distances from the other points
 sqd = sqrt(((D - x[:,:ndata])**2).sum(axis=0))
 idx = argsort(sqd) # sorting
 # return the indexes of K nearest neighbours
 return idx[:K]
The function computes the euclidean distance between every point of D and x then returns the indexes of the points for which the distance is smaller.
Now, we will test this function on a random bidimensional dataset:
# knn_search test
data = random.rand(2,200) # random dataset
x = random.rand(2,1) # query point

# performing the search
neig_idx = knn_search(x,data,10)

# plotting the data and the input point
plot(data[0,:],data[1,:],'ob',x[0,0],x[1,0],'or')
# highlighting the neighbours
plot(data[0,neig_idx],data[1,neig_idx],'o',
  markerfacecolor='None',markersize=15,markeredgewidth=1)
show()
The result is as follows:


The red point is the query vector and the blue ones represent the data. The blue points surrounded by a black circle are the nearest neighbors.

Saturday, March 24, 2012

Linear regression with Numpy

Few post ago, we have seen how to use the function numpy.linalg.lstsq(...) to solve an over-determined system. This time, we'll use it to estimate the parameters of a regression line.
A linear regression line is of the form w1x+w2=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (xi, yi), the parameters that we are looking for are w1 and w2 which minimize the error


and we can compute the parameter vector w = (w1 , w2)T as the least-squares solution of the following over-determined system


Let's use numpy to compute the regression line:
from numpy import arange,array,ones,linalg
from pylab import plot,show

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]
w = linalg.lstsq(A.T,y)[0] # obtaining the parameters

# plotting the line
line = w[0]*xi+w[1] # regression line
plot(xi,line,'r-',xi,y,'o')
show()
We can see the result in the plot below.



You can find more about data fitting using numpy in the following posts: Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):
from numpy import arange,array,ones#,random,linalg
from pylab import plot,show
from scipy import stats

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]

slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)

print 'r value', r_value
print  'p_value', p_value
print 'standard deviation', std_err

line = slope*xi+intercept
plot(xi,line,'r-',xi,y,'o')
show()

Friday, March 16, 2012

SVD decomposition with numpy

The SVD decomposition is a factorization of a matrix, with many useful applications in signal processing and statistics. In this post we will see
  • how to compute the SVD decomposition of a matrix A using numpy,
  • how to compute the inverse of A using the matrices computed by the decomposition,
  • and how to solve a linear equation system Ax=b using using the SVD.
The SVD decomposition of a matrix A is of the fom


Since U and V are orthogonal (this means that U^T*U=I and V^T*V=I) we can write the inverse of A as (see Solving overdetermined systems with the QR decomposition for the tricks)


So, let's start computing the factorization and the inverse
from numpy import *

A = floor(random.rand(4,4)*20-10) # generating a random
b = floor(random.rand(4,1)*20-10) # system Ax=b

U,s,V = linalg.svd(A) # SVD decomposition of A

# computing the inverse using pinv
pinv = linalg.pinv(A)
# computing the inverse using the SVD decomposition
pinv_svd = dot(dot(V.T,linalg.inv(diag(s))),U.T)

print "Inverse computed by lingal.pinv()\n",pinv
print "Inverse computed using SVD\n",pinv_svd
As we can see, the output shows that pinv and pinv_svd are the equal
Inverse computed by lingal.pinv()
[[ 0.06578301 -0.04663721  0.0436917   0.089838  ]
 [ 0.15243004  0.044919   -0.03681885  0.00294551]
 [ 0.18213058 -0.01718213  0.06872852 -0.07216495]
 [ 0.03976436  0.09867452  0.03387334 -0.04270987]]
Inverse computed using SVD
[[ 0.06578301 -0.04663721  0.0436917   0.089838  ]
 [ 0.15243004  0.044919   -0.03681885  0.00294551]
 [ 0.18213058 -0.01718213  0.06872852 -0.07216495]
 [ 0.03976436  0.09867452  0.03387334 -0.04270987]]
Now, we can solve Ax=b using the inverse:


or solving the system

Multiplying by U^T we obtain


Then, letting c be U^Tb and w be V^Tx, we see


Since sigma is diagonal, we can easily obtain w solving the system above. And finally, we can obtain x solving

Let's compare the results of those methods:

x = linalg.solve(A,b) # solve Ax=b using linalg.solve

xPinv = dot(pinv_svd,b) # solving Ax=b computing x = A^-1*b

# solving Ax=b using the equation above
c = dot(U.T,b) # c = U^t*b
w = linalg.solve(diag(s),c) # w = V^t*c
xSVD = dot(V.T,w) # x = V*w

print "Ax=b solutions compared"
print x.T
print xSVD.T
print xPinv.T
As aspected, we have the same solutions:
Ax=b solutions compared
[[ 0.13549337 -0.37260677  1.62886598 -0.09720177]]
[[ 0.13549337 -0.37260677  1.62886598 -0.09720177]]
[[ 0.13549337 -0.37260677  1.62886598 -0.09720177]]

Saturday, March 10, 2012

Solving overdetermined systems with the QR decomposition

A system of linear equations is considered overdetermined if there are more equations than unknowns. In practice, we have a system Ax=b where A is a m by n matrix and b is a m dimensional vector b but m is greater than n. In this case, the vector b cannot be expressed as a linear combination of the columns of A. Hence, we can't find x so that satisfies the problem Ax=b (except in specific cases) but it is possible to determine x so that Ax is as close to b as possible. So we wish to find x which minimizes the following error


Considering the QR decomposition of A we have that Ax=b becomes


multiplying by Q^T we obtain

and since Q^T is orthogonal (this means that Q^T*Q=I) we have


Now, this is a well defined system, R is an upper triangular matrix and Q^T*b is a vector. More precisely b is the orthogonal projection of b onto the range of A. And,


The function linalg.lstsq() provided by numpy returns the least-squares solution to a linear system equation and is able to solve overdetermined systems. Let's compare the solutions of linalg.lstsq() with the ones computed using the QR decomposition:
from numpy import *

# generating a random overdetermined system
A = random.rand(5,3)
b = random.rand(5,1) 

x_lstsq = linalg.lstsq(A,b)[0] # computing the numpy solution

Q,R = linalg.qr(A) # qr decomposition of A
Qb = dot(Q.T,b) # computing Q^T*b (project b onto the range of A)
x_qr = linalg.solve(R,Qb) # solving R*x = Q^T*b

# comparing the solutions
print 'qr solution'
print x_qr
print 'lstqs solution'
print x_lstsq
This is the output of the script above:
qr solution
[[ 0.08704059]
 [-0.10106932]
 [ 0.56961487]]
lstqs solution
[[ 0.08704059]
 [-0.10106932]
 [ 0.56961487]]
As we can see, the solutions are the same.

Tuesday, February 28, 2012

Finite differences with Toeplitz matrix

A Toeplitz matrix is a band matrix in which each descending diagonal from left to right is constant. In this post we will see how to approximate the derivative of a function f(x) as matrix-vector products between a Toeplitz matrix and a vector of equally spaced values of f. Let's see how to generate the matrices we need using the function toeplitz(...) provided by numpy:
from numpy import *
from scipy.linalg import toeplitz
import pylab

def forward(size):
 """ returns a toeplitz matrix
   for forward differences
 """
 r = zeros(size)
 c = zeros(size)
 r[0] = -1
 r[size-1] = 1
 c[1] = 1
 return toeplitz(r,c)

def backward(size):
 """ returns a toeplitz matrix
   for backward differences
 """
 r = zeros(size)
 c = zeros(size)
 r[0] = 1
 r[size-1] = -1
 c[1] = -1
 return toeplitz(r,c).T

def central(size):
 """ returns a toeplitz matrix
   for central differences
 """
 r = zeros(size)
 c = zeros(size)
 r[1] = .5
 r[size-1] = -.5
 c[1] = -.5
 c[size-1] = .5
 return toeplitz(r,c).T

# testing the functions printing some 4-by-4 matrices
print 'Forward matrix'
print forward(4)
print 'Backward matrix'
print backward(4)
print 'Central matrix'
print central(4)
The result of the test above is as follows:
Forward matrix
[[-1.  1.  0.  0.]
 [ 0. -1.  1.  0.]
 [ 0.  0. -1.  1.]
 [ 1.  0.  0. -1.]]
Backward matrix
[[ 1.  0.  0. -1.]
 [-1.  1.  0.  0.]
 [ 0. -1.  1.  0.]
 [ 0.  0. -1.  1.]]
Central matrix
[[ 0.   0.5  0.  -0.5]
 [-0.5  0.   0.5  0. ]
 [ 0.  -0.5  0.   0.5]
 [ 0.5  0.  -0.5  0. ]]
We can observe that the matrix-vector product between those matrices and the vector of equally spaced values of f(x) implements, respectively, the following equations:

Forward difference,
Backward difference,
And central difference,

where h is the step size between the samples. Those equations are called Finite Differences and they give us an approximate derivative of f. So, let's approximate some derivatives!
x = linspace(0,10,15)
y = cos(x) # recall, the derivative of cos(x) is sin(x)
# we need the step h to compute f'(x) 
# because the product gives h*f'(x)
h = x[1]-x[2]
# generating the matrices
Tf = forward(15)/h 
Tb = backward(15)/h
Tc = central(15)/h

pylab.subplot(211)
# approximation and plotting
pylab.plot(x,dot(Tf,y),'g',x,dot(Tb,y),'r',x,dot(Tc,y),'m')
pylab.plot(x,sin(x),'b--',linewidth=3)
pylab.axis([0,10,-1,1])

# the same experiment with more samples (h is smaller)
x = linspace(0,10,50)
y = cos(x)
h = x[1]-x[2]
Tf = forward(50)/h
Tb = backward(50)/h
Tc = central(50)/h

pylab.subplot(212)
pylab.plot(x,dot(Tf,y),'g',x,dot(Tb,y),'r',x,dot(Tc,y),'m')
pylab.plot(x,sin(x),'b--',linewidth=3)
pylab.axis([0,10,-1,1])
pylab.legend(['Forward', 'Backward', 'Central', 'True f prime'],loc=4)
pylab.show()
The resulting plot would appear as follows:


As the theory suggests, the approximation is better when h is smaller and the central differences are more accurate (note that, they have a higher order of accuracy respect to the backward and forward ones).

Sunday, February 5, 2012

Convolution with numpy

A convolution is a way to combine two sequences, x and w, to get a third sequence, y, that is a filtered version of x. The convolution of the sample xt is computed as follows:



It is the mean of the weighted summation over a window of length k and wt are the weights. Usually, the sequence w is generated using a window function. Numpy has a number of window functions already implemented: bartlett, blackman, hamming, hanning and kaiser. So, let's plot some Kaiser windows varying the parameter beta:
import numpy
import pylab

beta = [2,4,16,32]

pylab.figure()
for b in beta:
 w = numpy.kaiser(101,b) 
 pylab.plot(range(len(w)),w,label="beta = "+str(b))
pylab.xlabel('n')
pylab.ylabel('W_K')
pylab.legend()
pylab.show()
The graph would appear as follows:



And now, we can use the function convolve(...) to compute the convolution between a vector x and one of the Kaiser window we have seen above:
def smooth(x,beta):
 """ kaiser window smoothing """
 window_len=11
 # extending the data at beginning and at the end
 # to apply the window at the borders
 s = numpy.r_[x[window_len-1:0:-1],x,x[-1:-window_len:-1]]
 w = numpy.kaiser(window_len,beta)
 y = numpy.convolve(w/w.sum(),s,mode='valid')
 return y[5:len(y)-5]
Let's test it on a random sequence:
# random data generation
y = numpy.random.random(100)*100 
for i in range(100):
 y[i]=y[i]+i**((150-i)/80.0) # modifies the trend

# smoothing the data
pylab.figure(1)
pylab.plot(y,'-k',label="original signal",alpha=.3)
for b in beta:
 yy = smooth(y,b) 
 pylab.plot(yy,label="filtered (beta = "+str(b)+")")
pylab.legend()
pylab.show()
The program would have an output similar to the following:



As we can see, the original sequence have been smoothed by the windows.

Saturday, January 21, 2012

Monte Carlo estimate for pi with numpy

In this post we will use a Monte Carlo method to approximate pi. The idea behind the method that we are going to see is the following:

Draw the unit square and the unit circle. Consider only the part of the circle inside the square and pick uniformly a large number of points at random over the square. Now, the unit circle has pi/4 the area of the square. So, it should be apparent that of the total number of points that hit within the square, the number of points that hit the circle quadrant is proportional to the area of that part. This gives a way to approximate pi/4 as the ratio between the number of points inside circle and the total number of points and multiplying it by 4 we have pi.

Let's see the python script that implements the method discussed above using the numpy's indexing facilities:
from pylab import plot,show,axis
from numpy import random,sqrt,pi

# scattering n points over the unit square
n = 1000000
p = random.rand(n,2)

# counting the points inside the unit circle
idx = sqrt(p[:,0]**2+p[:,1]**2) < 1

plot(p[idx,0],p[idx,1],'b.') # point inside
plot(p[idx==False,0],p[idx==False,1],'r.') # point outside
axis([-0.1,1.1,-0.1,1.1]) 
show()

# estimation of pi
print '%0.16f' % (sum(idx).astype('double')/n*4),'result'
print '%0.16f' % pi,'real pi'
The program will print the pi approximation on the standard out:
3.1457199999999998 result
3.1415926535897931 real pi
and will show a graph with the generated points:



Note that the lines of code used to estimate pi are just 3!

Thursday, September 8, 2011

Sound Synthesis

Physically, sound is an oscillation of a mechanical medium that makes the surrounding air also oscillate and transport the sound as a compression wave. Mathematically, the oscillations can be described as
where t is the time, and f the frequency of the oscillation. Sound on a computer is a sequence of numbers and in this post we will see how to generate a musical tone with numpy and how to write it to a file wav file. Each musical note vibrates at a particular frequency and the following script contains a function to generate a note (tone(...)), we'll use this function to generate the A tone creating an oscillation with f = 440 Hz.
from scipy.io.wavfile import write
from numpy import linspace,sin,pi,int16
from pylab import plot,show,axis

# tone synthesis
def note(freq, len, amp=1, rate=44100):
 t = linspace(0,len,len*rate)
 data = sin(2*pi*freq*t)*amp
 return data.astype(int16) # two byte integers

# A tone, 2 seconds, 44100 samples per second
tone = note(440,2,amp=10000)

write('440hzAtone.wav',44100,tone) # writing the sound to a file

plot(linspace(0,2,2*44100),tone)
axis([0,0.4,15000,-15000])
show()
Now we can play the file 440hzAtone.wav with an external player. This plot shows a part of the signal generated by the script:
And using the plotSpectrum function defined in a previous post we can verify that 440 Hz is the fundamental frequency of the tone.

Wednesday, July 27, 2011

PCA and image compression with numpy

In the previous post we have seen the princomp function. This function performs principal components analysis (PCA) on the n-by-p data matrix and uses all the p principal component to computed the principal component scores. In this new post, we will see a modified version of the princomp where the representation of the original data in the in the principal component space is computed with less than p principal components:
from numpy import mean,cov,cumsum,dot,linalg,size,flipud

def princomp(A,numpc=0):
 # computing eigenvalues and eigenvectors of covariance matrix
 M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
 [latent,coeff] = linalg.eig(cov(M))
 p = size(coeff,axis=1)
 idx = argsort(latent) # sorting the eigenvalues
 idx = idx[::-1]       # in ascending order
 # sorting eigenvectors according to the sorted eigenvalues
 coeff = coeff[:,idx]
 latent = latent[idx] # sorting eigenvalues
 if numpc < p and numpc >= 0:
  coeff = coeff[:,range(numpc)] # cutting some PCs if needed
 score = dot(coeff.T,M) # projection of the data in the new space
 return coeff,score,latent
The following code uses the new version of the princomp to compute the PCA of a matrix that represents an image in gray scale. The PCA is computed ten times with an increasing number of principal components. The script show the images reconstructed using less than 50 principal components (out of 200).
from pylab import imread,subplot,imshow,title,gray,figure,show,NullLocator
A = imread('shakira.jpg') # load an image
A = mean(A,2) # to get a 2-D array
full_pc = size(A,axis=1) # numbers of all the principal components
i = 1
dist = []
for numpc in range(0,full_pc+10,10): # 0 10 20 ... full_pc
 coeff, score, latent = princomp(A,numpc)
 Ar = dot(coeff,score).T+mean(A,axis=0) # image reconstruction
 # difference in Frobenius norm
 dist.append(linalg.norm(A-Ar,'fro'))
 # showing the pics reconstructed with less than 50 PCs
 if numpc <= 50:
  ax = subplot(2,3,i,frame_on=False)
  ax.xaxis.set_major_locator(NullLocator()) # remove ticks
  ax.yaxis.set_major_locator(NullLocator())
  i += 1 
  imshow(flipud(Ar))
  title('PCs # '+str(numpc))
  gray()

figure()
imshow(flipud(A))
title('numpc FULL')
gray()
show()
The resulting images:


We can see that 40 principal components are enough to reconstruct the original image.


At the end of this experiment, we can plot the distance of the reconstructed images from the original image in Frobenius norm (red curve) and the cumulative sum of the eigenvalues (blue curve). Recall that the cumulative sum of the eigenvalues shows the level of variance accounted by each of the corresponding eigenvectors. On the x axis there is the number of eigenvalues/eigenvectors used.
from pylab import plot,axis
figure()
perc = cumsum(latent)/sum(latent)
dist = dist/max(dist)
plot(range(len(perc)),perc,'b',range(0,full_pc+10,10),dist,'r')
axis([0,full_pc,0,1.1])
show()

Friday, July 22, 2011

Principal Component Analysis with numpy

The following function is a three-line implementation of the Principal Component Analysis (PCA). It is inspired by the function princomp of the matlab's statistics toolbox.
from numpy import mean,cov,double,cumsum,dot,linalg,array,rank
from pylab import plot,subplot,axis,stem,show,figure

def princomp(A):
 """ performs principal components analysis 
     (PCA) on the n-by-p data matrix A
     Rows of A correspond to observations, columns to variables. 

 Returns :  
  coeff :
    is a p-by-p matrix, each column containing coefficients 
    for one principal component.
  score : 
    the principal component scores; that is, the representation 
    of A in the principal component space. Rows of SCORE 
    correspond to observations, columns to components.
  latent : 
    a vector containing the eigenvalues 
    of the covariance matrix of A.
 """
 # computing eigenvalues and eigenvectors of covariance matrix
 M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
 [latent,coeff] = linalg.eig(cov(M)) # attention:not always sorted
 score = dot(coeff.T,M) # projection of the data in the new space
 return coeff,score,latent
(In this other post you can find an updated version of this function).
In the following test a 2D dataset wil be used. The result of this test is a plot with the two principal components (dashed lines), the original data (blue dots) and the new data (red stars). As we expected the first principal component describes the direction of maximum variance and the second is orthogonal to the first.
A = array([ [2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9],
            [2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1] ])

coeff, score, latent = princomp(A.T)

figure()
subplot(121)
# every eigenvector describe the direction
# of a principal component.
m = mean(A,axis=1)
plot([0, -coeff[0,0]*2]+m[0], [0, -coeff[0,1]*2]+m[1],'--k')
plot([0, coeff[1,0]*2]+m[0], [0, coeff[1,1]*2]+m[1],'--k')
plot(A[0,:],A[1,:],'ob') # the data
axis('equal')
subplot(122)
# new data
plot(score[0,:],score[1,:],'*g')
axis('equal')
show()

In this second example princomp(.) is tested on a 4D dataset. In this example the matrix of the data is rank deficient and only the first two components are necessary to bring the information of the entry dataset.
A = array([[-1, 1, 2, 2],
           [-2, 3, 1, 0],
           [ 4, 0, 3,-1]],dtype=double)

coeff, score, latent = princomp(A)
perc = cumsum(latent)/sum(latent)
figure()
# the following plot shows that first two components 
# account for 100% of the variance.
stem(range(len(perc)),perc,'--b')
axis([-0.3,4.3,0,1.3])
show()
print 'the principal component scores'
print score.T # only the first two columns are nonzero
print 'The rank of A is'
print rank(A)  # indeed, the rank of A is 2

Coefficients for principal components
[[  1.464140e+00   1.588382e+00   0.000000e+00  -4.440892e-16]
 [  2.768170e+00  -1.292503e+00  -2.775557e-17   6.557254e-16]
 [ -4.232310e+00  -2.958795e-01   1.110223e-16  -3.747002e-16]]
The rank of A is
2

Thursday, July 14, 2011

Polynomial curve fitting

We have seen already how to a fit a given set of points minimizing an error function, now we will see how to find a fitting polynomial for the data using the function polyfit provided by numpy:
from numpy import *
import pylab

# data to fit
x = random.rand(6)
y = random.rand(6)

# fit the data with a 4th degree polynomial
z4 = polyfit(x, y, 4) 
p4 = poly1d(z4) # construct the polynomial 

z5 = polyfit(x, y, 5)
p5 = poly1d(z5)

xx = linspace(0, 1, 100)
pylab.plot(x, y, 'o', xx, p4(xx),'-g', xx, p5(xx),'-b')
pylab.legend(['data to fit', '4th degree poly', '5th degree poly'])
pylab.axis([0,1,0,1])
pylab.show()

Let's see the two polynomials:

Saturday, May 28, 2011

Data fitting using fmin

We have seen already how to find the minimum of a function using fmin, in this example we will see how use it to fit a set of data with a curve minimizing an error function:
from pylab import *
from numpy import *
from numpy.random import normal
from scipy.optimize import fmin

# parametric function, x is the independent variable
# and c are the parameters.
# it's a polynomial of degree 2
fp = lambda c, x: c[0]+c[1]*x+c[2]*x*x
real_p = rand(3)

# error function to minimize
e = lambda p, x, y: (abs((fp(p,x)-y))).sum()

# generating data with noise
n = 30
x = linspace(0,1,n)
y = fp(real_p,x) + normal(0,0.05,n)

# fitting the data with fmin
p0 = rand(3) # initial parameter value
p = fmin(e, p0, args=(x,y))

print 'estimater parameters: ', p
print 'real parameters: ', real_p

xx = linspace(0,1,n*3)
plot(x,y,'bo', xx,fp(real_p,xx),'g', xx, fp(p,xx),'r')

show()
The following figure will be showed, in green the original curve used to generate the noisy data, in blue the noisy data and in red the curve found in the minimization process:

The parameters will be printed also:
Optimization terminated successfully.
         Current function value: 0.861885
         Iterations: 77
         Function evaluations: 146
estimater parameters:  [ 0.92504602  0.87328979  0.64051926]
real parameters:  [ 0.86284356  0.95994753  0.67643758]

Friday, May 27, 2011

Delaunay triangulation with matplotlib

How to plot the delaunay triangulation for a set of points in the plane using matplotlib:
import matplotlib.delaunay as triang
import pylab
import numpy

# 10 random points (x,y) in the plane
x,y =  numpy.array(numpy.random.standard_normal((2,10)))
cens,edg,tri,neig = triang.delaunay(x,y)

for t in tri:
 # t[0], t[1], t[2] are the points indexes of the triangle
 t_i = [t[0], t[1], t[2], t[0]]
 pylab.plot(x[t_i],y[t_i])

pylab.plot(x,y,'o')
pylab.show()
The output will be similar to this:

Monday, May 23, 2011

Four ways to compute the Google Pagerank

As described in THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE, we can compute the score of a page on a web as the maximal eigenvector of the matrix

Google

where A is the scaled connectivity matrix of a web, S is an n × n matrix with all entries 1/n and m is a real number between 0 and 1.

Here's implemented four ways to compute the maximal eigenvector of the matrix using the numpy:
from numpy import *

def powerMethodBase(A,x0,iter):
 """ basic power method """
 for i in range(iter):
  x0 = dot(A,x0)
  x0 = x0/linalg.norm(x0,1)
 return x0

def powerMethod(A,x0,m,iter):
 """ power method modified to compute
     the maximal real eigenvector 
     of the matrix M built on top of the input matrix A """
 n = A.shape[1]
 delta = m*(array([1]*n,dtype='float64')/n) # array([1]*n is [1 1 ... 1] n times
 for i in range(iter):
  x0 = dot((1-m),dot(A,x0)) + delta
 return x0

def maximalEigenvector(A):
 """ using the eig function to compute eigenvectors """
 n = A.shape[1]
 w,v = linalg.eig(A)
 return abs(real(v[:n,0])/linalg.norm(v[:n,0],1))

def linearEquations(A,m):
 """ solving linear equations 
     of the system (I-(1-m)*A)*x = m*s """
 n = A.shape[1]
 C = eye(n,n)-dot((1-m),A)
 b = m*(array([1]*n,dtype='float64')/n)
 return linalg.solve(C,b)

def getTeleMatrix(A,m):
 """ return the matrix M
     of the web described by A """
 n = A.shape[1]
 S = ones((n,n))/n
 return (1-m)*A+m*S

A = array([ [0,     0,     0,     1, 0, 1],
            [1/2.0, 0,     0,     0, 0, 0],
            [0,     1/2.0, 0,     0, 0, 0],
            [0,     1/2.0, 1/3.0, 0, 0, 0],
            [0,     0,     1/3.0, 0, 0, 0],
            [1/2.0, 0,     1/3.0, 0, 1, 0 ] ])

n = A.shape[1] # A is n x n
m = 0.15
M = getTeleMatrix(A,m)

x0 = [1]*n
x1 = powerMethod(A,x0,m,130)
x2 = powerMethodBase(M,x0,130)
x3 = maximalEigenvector(M)
x4 = linearEquations(A,m)

# comparison of the four methods
labels = range(1,6)
print array([labels, x1, x2, x3, x4]).T

The matrix A used to the test the program describe the following web


The scores are (the first column show the labels):
[[ 1.          0.32954577  0.32954577  0.32954577  0.32954577]
 [ 2.          0.16505695  0.16505695  0.16505695  0.16505695]
 [ 3.          0.0951492   0.0951492   0.0951492   0.0951492 ]
 [ 4.          0.12210815  0.12210815  0.12210815  0.12210815]
 [ 5.          0.05195894  0.05195894  0.05195894  0.05195894]
 [ 6.          0.23618099  0.23618099  0.23618099  0.23618099]]

Friday, May 20, 2011

Latent Semantic Analysis with Term-Document matrix

This example is inspired by the second paragraph of the paper Matrices, vector spaces, and information retrieval. It shows a vector space representation of information used to represent documents in a collection and the query algorithm to find relevant documents. This example implement the model and the query matching algorithm using the linear algebra module provided by numpy. The program is tested on the sample data in Figure 2 of the paper.
import numpy
def buildTermDocumentMatrix(terms,docs):
 """ build a term-document matrix """
 tlen = len(terms)
 dlen = len(docs)
 A = numpy.zeros((tlen, dlen))

 for i,t in enumerate(terms):
  for j,d in enumerate(docs):
   A[i,j] = d.lower().count(t) # computing terms frequencies

 for i in range(dlen): # normalize columns
  A[:tlen,i] = A[:tlen,i]/numpy.linalg.norm(A[:tlen,i])

 return A

def query(A,q,docs):
 """ make the query and print the result """
 q = q/numpy.linalg.norm(q) # normalize query vector
 for i in range(len(docs)):
  # dot product
  print '-Doc  :',docs[i],'\n-Match:',numpy.dot(A[:6,i].T,q) 

# documents collection
docs =['How to Bake Bread Without Recipes',
'The Classic Art of Viennese Pastry',
'Numerical Recipes: The Art of Scientific Computing',
'Breads, Pastries, Pies and Cakes : Quantity Baking Recipes',
'Pastry: A Book of Best French Recipe']
# interesting terms
terms = ['bak','recipe','bread','cake','pastr','pie']

# will return a matrix 6 terms x 5 documents
A = buildTermDocumentMatrix(terms,docs) 
print 'Normalized Terms-Documents matrix'
print A

print '\n*** Query: "bak(e,ing)" + "bread"'
q1 = numpy.array([1,0,1,0,0,0])
query(A,q1,docs)

print '\n*** Query: "bak(e,ing)" only'
q2 = numpy.array([1,0,0,0,0,0])
query(A,q2,docs)
The results are the same as is the reference paper:
Normalized Terms-Documents matrix
[[ 0.57735027  0.          0.          0.40824829  0.        ]
 [ 0.57735027  0.          1.          0.40824829  0.70710678]
 [ 0.57735027  0.          0.          0.40824829  0.        ]
 [ 0.          0.          0.          0.40824829  0.        ]
 [ 0.          1.          0.          0.40824829  0.70710678]
 [ 0.          0.          0.          0.40824829  0.        ]]

*** Query: "bak(e,ing)" + "bread"
-Doc  : How to Bake Bread Without Recipes 
-Match: 0.816496580928
-Doc  : The Classic Art of Viennese Pastry 
-Match: 0.0
-Doc  : Numerical Recipes: The Art of Scientific Computing 
-Match: 0.0
-Doc  : Breads, Pastries, Pies and Cakes : Quantity Baking Recipes 
-Match: 0.57735026919
-Doc  : Pastry: A Book of Best French Recipe 
-Match: 0.0

*** Query: "bak(e,ing)" only
-Doc  : How to Bake Bread Without Recipes 
-Match: 0.57735026919
-Doc  : The Classic Art of Viennese Pastry 
-Match: 0.0
-Doc  : Numerical Recipes: The Art of Scientific Computing 
-Match: 0.0
-Doc  : Breads, Pastries, Pies and Cakes : Quantity Baking Recipes 
-Match: 0.408248290464
-Doc  : Pastry: A Book of Best French Recipe 
-Match: 0.0
Other resources about about the model implemented can be found here: