100% found this document useful (1 vote)
735 views

Self-Taught Learning: Implementation Using MATLAB

Self-taught learning is a new framework in the domain of machine learning. It has the potential that by using unsupervised learning strategies to automatically learn and extract information and consequently create a succint representation of the given input. I have presented the theoretical underpinnings of the idea and then proceeded to show a possible implementation of the algorithm in the MATLAB programming language. I have also included my experimental results obtained during my research. Neural networks are extensively used in the project.

Uploaded by

bodgergely
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
735 views

Self-Taught Learning: Implementation Using MATLAB

Self-taught learning is a new framework in the domain of machine learning. It has the potential that by using unsupervised learning strategies to automatically learn and extract information and consequently create a succint representation of the given input. I have presented the theoretical underpinnings of the idea and then proceeded to show a possible implementation of the algorithm in the MATLAB programming language. I have also included my experimental results obtained during my research. Neural networks are extensively used in the project.

Uploaded by

bodgergely
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

SZAKDOLGOZAT

Bod Gergely
Debrecen
2013





Debreceni Egyetem
Informatikai Kar






Self-taught learning:
Implementation using MATLAB





Tmavezet: Dr. Antal Blint Ksztette: Bod Gergely
Beosztsa: Adjunktus Szak megnevezs:
programtervez informatikus




Table of contents
1. Machine learning primer
a) Supervised versus Unsupervised learning
b) Neural networks
2. Self-taught learning
a) Introduction
b) Representation learning
3. Implementation
a) Initialization
b) Setting up the environment for learning
c) Feature learning
d) Classification
e) Image database construction
f) Enhancements of the standard algorithm
g) Experiments
4. Conclusion
5. Acknowledgments





1



1. Machine learning primer
In this thesis I would like to present a relatively new concept that has emerged from machine
learning. Before going into specific details regarding this new framework it is advisable to
spend some time by elaborating what the field of machine learning deals with. Machine
learning is universally defined as the construction of intelligent computer programs that can
learn and with time improve its performance on some task (Mitchell, 1997).
Machine learning is often categorized as branch of computer science and is usually considered
a subfield of artificial intelligence. While this categorization places machine learning strictly
into the domain of computing it is fair to mention other disciplines that might be equally if not
more influential to the development of this science. First of all machine learning employs a
plethora of methods, technical tools from statistics, mathematics, neuroscience, biological
systems. One particular subfield of mathematics, namely optimization is a very important tool
for broad applications in machine learning. Statistics has many overlapping features with
machine learning, concepts, tools, methods but there seem to be one distinctive feature. While
statistics emphasizes inference, the primary goal of machine learning is prediction.
Typical industrial applications of ML include spam filtering, handwritten character
recognition, image classification. To give a gentle example consider the following from
Alpaydn: For some tasks, however, we do not have an algorithmfor example, to tell spam
emails from legitimate emails. We know what the input is: an email document that in the
simplest case is a file of characters. We know what the output should be: a yes/no output
indicating whether the message is spam or not. We do not know how to transform the input to
the output. What can be considered spam changes in time and from individual to individual.
What we lack in knowledge, we make up for in data. We can easily compile thousands of
example messages some of which we know to be spam and what we want is to learn what
constitutes spam from them. In other words, we would like the computer (machine) to extract
automatically the algorithm for this task. There is no need to learn to sort numbers, we
already have algorithms for that; but there are many applications for which we do not have
an algorithm but do have example data...Think, for example, of a supermarket chain that has
2

hundreds of stores all over a country selling thousands of goods to millions of
customers...What the supermarket chain wants is to be able to predict who are the likely
customers for a product. Again, the algorithm for this is not evident; it changes in time and by
geographic location.
1

The previous citation gives a good understanding when machine learning can be employed.
We do not have a rigid algorithm but we have a massive amount of data. This data can
compensate us for not having an exact algorithm and by discovering certain statistical,
mathematical patterns in the data we can construct different learning strategies.
a) Supervised versus Unsupervised learning
There is a particular dimension in which we can differentiate the learning algorithms. It is the
absence or the existence of the labels we provide with each incoming example. Imagine a
dataset which consists of images of cars and cats. The individual images have labels attached
to them describing which category (car or cat) they belong to. We could train the system by
splitting the dataset into two parts: a) training set b) test set. By first running the algorithm on
the training set the algorithm can make certain corrections to the parameters of the underlying
model whenever it sees an example and told what category it belonged to. This way it is
possible to devise such a system that can predict at a certain statistical predictive power that a
fresh example that has not been observed by the algorithm before which category belongs to.
In contrast when employing unsupervised learning we are not providing the labels (concretely
the metadata about the category) of the examples. In this particular case we are trying to find
some hidden pattern in the underlying data. It is very well summarized by Alpaydin: In
supervised learning, the aim is to learn a mapping from the input to an output whose correct
values are provided by a supervisor. In unsupervised learning, there is no such supervisor and
we only have input data.
The aim is to find the regularities in the input. There is a structure to the input space such that
certain patterns occur more often than others, and we want to see what generally happens
and what does not. In statistics, this is called density estimation.
2
1
Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge, Massachusetts
London, England, 2010, pages 1-2
2
Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge, Massachusetts
London, England, 2010, page 11

3

b) Neural networks
Since the algorithm devised by Andrew Ng at Stanford University relies heavily on the use of
neural networks (I would even say that they are the workhorse of the algorithm) I feel the
need to spend some time to explain what they are and when they are useful. The idea of a
neural network was inspired by a biological construct, namely the brain. The brain consists
millions of neurons that form a very complex information processing and forwarding system.
Neural networks consist of simple processing units that interact via weighted connections.
They are sometimes designed in hardware but most research nowadays involves software
simulations. They were originally inspired by ideas about how the brain makes its
computations (Hinton, 1999).
A typical processing unit first computes a total input which is a weighted sum of the
incoming values from other units plus a bias term. The next step is to put its total input
through an activation function to calculate the activity of the unit. One of the most common
activation function is the sigmoid where y = 1/(1+exp(-x)) (Hinton, 1999).

One of the most interesting properties of neural networks is their ability to learn from
examples by adapting the weights on the connections. The most widely adopted machine
learning algorithms are supervised: they assume that there is a set of training examples, each
consisting of an input vector and a desired output vector. Learning involves sweeping through
the training set couple of times and gradually adjusting the weights so that the actual output
produced by the network gets closer to the desired output. The simplest neural network
architecture is built of some input units with directed, weighted connections to an output unit.
(Hinton, 1999)


4


By the introduction of so called hidden layers neural networks can express complicated
nonlinear units between the input and the output. Finding the optimal weights is generally
computationally impossible but gradient methods can be effectively used to find sets of
weights that work well for many practical real life tasks.
An algorithm called back propagation (Rumelhart et al., 1986) can be used to compute the
derivatives with respect to each weight in the network of the error function. The standard
error function is the squared difference between the actual and the desired outputs (Hinton,
1999).

Source: MathWorks, Neural Network Product help
Illustration B: Describes a single neuron with input p and weight w.
A. Illustration
Source: The MathWorks Inc, Neural Network Product help
5

For each training case the activities of the units are calculated by a forward pass through the
network. Then starting with the output units a backward pass is done through the network to
compute the derivatives of the error function with respect to the total input received by each
unit (Hinton, 1999).
Below one can see a graphical representation of the one-layer neural network architecture.
C. Illustration

Source: MathWorks, Neural Network Product help
Illustration C depicts a one-layer network with R input elements and S neurons.
In this network, each element of the input vector p is connected to each neuron input through
the weight matrix W. The ith neuron has a summer function that gathers its weighted inputs
and bias to form its own scalar output, n(i). The various n(i) taken together form an S-element
net input vector, n. Finally, the neuron layer outputs form a column vector a. The expression
for a is shown at the bottom of the picture. (MathWorks, Neural Network Product help)
Note that it is not uncommon for the number of inputs to a layer to be different from the
number of neurons (i.e., R is not necessarily equal to S). A layer is not limited to have the
number of its inputs equal to the number of its neurons. (MathWorks, Neural Network
Product help)
A network is not constrained to one hidden layer. One can stack multiple layers onto each
6

other where each layer's input will become the output of the previous one.
Source: MathWorks, Neural Network Product help
Illustration D: Matrix (IW1,1) having a source 1 (second index) and a destination 1 (first
index). Elements of layer 1, such as its bias, net input, and output have a superscript 1 to say
that they are associated with the first layer.
2. Self-taught learning
In this chapter I will introduce the idea of a new framework postulated by Raina et al. in their
paper (Raina et al., 2007.).
a) Introduction
This new framework has the potential to use unlabeled data to enhance supervised
classification tasks. They (Raina et al., 2007) do not assume that the unlabeled data follows
the same class labels or distribution as the labeled dataset. Consequently they were able to use
a large number of unlabeled images or other input types like audio or text that are downloaded
randomly from the Internet. This huge unlabeled dataset is to be used to enhance the
performance of the supervised classification task. Since such unlabeled data is much easier
D. Illustration
7

and cheaper to be obtained than labeled data, it is possible to apply self-taught learning to
broad range of practical problems. Their approach uses sparse coding to construct higher-
level features using the unlabeled data. These features form a compact input representation
and in theory can significantly improve the classification performance. (Raina et al., 2007)
Their approach is motivated by the observation that even many randomly downloaded images
will contain basic visual patterns (such as edges) that are similar to the images to be originally
classified. Therefore, we can learn to recognize such patterns from the unlabeled data, these
patterns can be used for the supervised learning task of interest. (Raina et al, 2007)
They make the distinction from earlier method semi-supervised learning by stating: the
unlabeled data does not share the class labels or the generative distribution of the labeled data.
For example, given unlimited access to natural sounds (audio), can we perform better speaker
identification?
3
(Raina et al, 2007)
The similarity with semi-supervised learning (Nigam et al., 2000) is that both use labeled and
unlabeled data for the classification task. But unlike semi-supervised learning their new
method they propose that we do not assume that the unlabeled data can be assigned to the
supervised learning tasks class labels.
3
The consequence is that acquiring unlabeled images is far easier than to acquire the same
amount of labeled data from a certain category since it is perfectly sufficient to randomly
download for example 100,000 images from the Internet.
Their argument also enjoys a biological motivation. It has been long assumed by
neuroscientists that most human learning is performed by an unsupervised fashion.
Their approach was split to the following two stages: First we learn a representation using
only unlabeled data. Then, we apply this representation to the labeled data, and use it for the
classification task. Once the representation has been learned in the first stage it can then be
applied repeatedly to different classification tasks.
3
b) Representation learning
Much of our machine learning tasks are hindered by the fact that it is often very difficult to
recognize the underlying factors that contribute the most explanatory power. Nowadays a
common method is to manually preprocess the data by discovering the most fundamental
3
Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng, Self-taught Learning: Transfer
Learning from Unlabeled Data, 2007, pages 1- 2

8

factors and transform the data so a traditional classifier etc. can solve the problem at hand
effectively but it often requires industry specific knowledge. Feature engineering like this is
important but also has the weakness that it is labor extensive and difficult to apply in a general
context (Bengio, 2012). Recent advances in machine learning concentrates on representation
learning which can broaden the applicability of our already known algorithms to
classify/predict on the data. Self-taught learning holds the promise by incorporating several
methods from representation learning, like sparse auto-encoders, deep networks by
automating the process of feature extraction and augmenting the process of the final
classification/prediction task.
3. Implementation
At the beginning of my research I have decided to start by implementing the proposed
algorithms presented on the website of the Stanford class, UFLDL (Unsupervised Feature
Learning and Deep Learning
4
where some framework related, mostly initialization code was
available and I have included them in my implementation. I have coded most of the
algorithms both in MATLAB and Numpy, Scipy (Numpy and Scipy are extensions to the
Python language for numerical computations). My observation was that MATLAB provided
me a much better environment for rapid development and prototyping. The convenience of
the debugger in MATLAB made a big impact on the speed of my algorithm development.
Considering the above I have decided to provide the code snippets in MATLAB instead of
Python (some of the utility scripts were nevertheless scripted in Python).
a) Initialization
The MATLAB environment I have created consists a top-level script file called run.m that has
the job to start the profiling of the whole algorithm and to kick-off the several stages of the
process. Lets take a look at the code:

4
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

9

run.m
start = tic;
linearDecoder;
cnn;
elapsed = toc(start);
fprintf('Elapsed time: %d', elapsed);
fprintf('Done.');

By calling the tic instruction we tell the environment to assign to the start variable the current
time. It will be used after the algorithm finished by calling the toc function by passing the
start variable and the result will be stored in the elapsed variable which we can use to print
the elapsed number of seconds.
b) Setting up the environment for learning
At this point we should step into the linearDecoder.m script file that consists several tasks.
Specifically it needs to:
Initialize the architecture and the specific tasks related parameters
Optionally create the image batches
Apply ZCA whitening
Define the number of image patches to work with
Randomly initialize the weights of the neural network
Start the optimization of the the neural networks cost function
Save the optimal weights that will become our optimal features we can use for
representation
Visualize the learned features
Now lets look at the parameter initialization section:
imageChannels = 3; % number of channels (rgb, so 3)

patchDim = 8; % patch dimension
numPatches = 100000 % number of patches
10


visibleSize = patchDim * patchDim * imageChannels; % number of input units
outputSize = visibleSize; % number of output units
hiddenSize = 400; % number of hidden units

sparsityParam = 0.035; % desired average activation of the hidden units.
lambda = 3e-3; % weight decay parameter
beta = 5; % weight of sparsity penalty term

epsilon = 0.1; % epsilon for ZCA whitening
I set the number of image channels to three since we will be processing colored - RGB
pictures where we store an image in a 3 dimensional matrix where one dimension holds the
values for the red-green-blue components only. The variable numPatches holds the value of
the number of patches we would like to extract from our unlabeled image dataset. I set the
size of one patch to a 8 x 8 matrix by specifying patchDime to be eight. We need to take into
account how much RAM memory our system has. By setting the number of patches to
100,000 we will be allocating a total of (8 x 8 x 3) x (100000) number of double words which
is a total of ~153 MB.
In the next section we calculate the number of input and output units. In our case it will be
now 192, since we have 8 x 8 x 3 = 192 number of incoming value on one image patch
example. We set the number of hidden units to 400.
By setting the sparsity parameter in the next line to a close to zero number we are eventually
driving the neural network to make most of its hidden units inactive since we want only a few
number of nodes to be active, essentially in firing mode. The lambda and beta will have an
impact on the behaviour of the neural networks cost function.
Let me proceed further in linearDecoder.m:
batchSize = 25000 % should be less than the number of patches
patches = sampleIMAGES_fromBatches(C:\folder, numPatches);
save 'patches_imageNet.mat' patches
The variable batchSize we specify will tell the algorithm later in how many batches the
optimization of the cost function should happen. This is required for limited-memory
11

processing since one does not need to load all the data this way into RAM. In our case we
have 100,000 sample patches and this enables us to divide the processing into four seperate
iterations, consequently putting a lot less strain on the main memory.
The sampleIMAGES_fromBatches function takes three arguments. The first one will be a
string denoting the folder consisting our saved batches of images. Let me first show the
MATLAB code to create these image batches.
saveImagesToFiles.m
function IMAGES = saveImagesToFiles(pathFromDir,saveDir,row, col)

numOfImagesInOneBatch = 500;

fileFolder = fullfile(pathFromDir);
dirOutput = dir(fullfile(fileFolder,'*.jpg'));
fileNames = {dirOutput.name}';
numOfImages = numel(fileNames);

I = imread(fileNames{1});


numOfBatches = floor(numOfImages / numOfImagesInOneBatch);
numOfRemainder = mod(numOfImages, numOfImagesInOneBatch);

fprintf('Number of images to save: %d\n', numOfImages);
fprintf('Number of images in one batch: %d\n', numOfImagesInOneBatch);
fprintf('Number of batches: %d\n', numOfBatches + 1);


numberOfImagesSaved = 0;

for n=1:numOfBatches
batchName = strcat('imageBatch_', num2str(n));
% Preallocate the batch
IMAGES = zeros([row col 3 numOfImagesInOneBatch],class(I));
for i=1:numOfImagesInOneBatch
currentImage = (n-1)*numOfImagesInOneBatch + i;
12

I = imread(fileNames{currentImage});

if not(numel(size(I)) == 3)
fprintf(fileNames{currentImage})
continue
end

IMAGES(:,:,:,i) = I;
numberOfImagesSaved = numberOfImagesSaved + 1;
fprintf('Loaded image %d in batch %d. Global image number: %d\n', i, n,
numberOfImagesSaved);
end
save(strcat(saveDir,'/', batchName, '.mat'), 'IMAGES');
fprintf('Saved batch number %d out %d\n', n, numOfBatches + 1 );
clear IMAGES;
end

if not(numOfRemainder == 0)
batchName = strcat('imageBatch_', 'remainder_', num2str(numOfRemainder));
% Preallocate the batch
IMAGES = zeros([row col 3 numOfRemainder],class(I));
for i=1:numOfRemainder
currentImage = numOfBatches*numOfImagesInOneBatch + i;
I = imread(fileNames{currentImage});

if not(numel(size(I)) == 3)
fprintf(fileNames{currentImage})
continue
end

IMAGES(:,:,:,i) = I;
numberOfImagesSaved = numberOfImagesSaved + 1;
fprintf('Loaded image %d in batch %d. Global image number: %d\n', i, n,
numberOfImagesSaved);
end
save(strcat(saveDir,'/', batchName, '.mat'), 'IMAGES');
fprintf('Saved the remainder batch with %d images in it.\n', numOfRemainder);

clear IMAGES;
end
13


We pass the directory consisting our original images as the first parameter. The second
parameter specifies the folder to save the transformed and packaged images to. The row, col
parameters will denote the dimension into we would like to transform our images to. We can
specify in the function how many images we want to have in on batch file. There is always a
trade-off when working with batches of data. The more batches we have the less need we
have for memory but on the other hand it will require many I/O instructions to move the data
from disk to RAM and the code readabilty suffers too. In case we have a plethora of memory
we should set this number to a higher one so less I/O instructions will need to be performed
but it will incur a higher pressure on the main memory.
In the next steps we calculate the number of batches required to hold all the images with the
given extension from the given folder. In the for loop we start putting together the batches by
reading each image into memory by calling the imread built-in MATLAB function. The
dimension of the batch matrices are the following: we need the number of rows, columns,
color channels (three) and the number of images to be stored in one batch. As exception
handling I am guarding against invalid images that do not adhere to having a three
dimensional structure by simply skipping and logging the name of the particular image.
After this short diversion lets get back to our function, sampleIMAGES_fromBatches. Here
the first parameter denotes the folder where we now store the images that we have packaged
into batch files. We tell how many patches we want to create, which in our case is 100,000.
patchsize = 8; % Use 8x8 patches
numpatches = numberOfPatchesToCreate;

numOfChannels = 3;

% Initialize patches with zeros.
patches = zeros(patchsize*patchsize*numOfChannels, numpatches);
fileFolder = fullfile(directoryOfBatches);
dirOutput = dir(fullfile(fileFolder,'*.mat'));
fileNames = {dirOutput.name}';
numOfBatches = numel(fileNames);

14

samplesCreated = 0;

for n = 1:numOfBatches
load(fileNames{n});
fprintf('Sampling from batch: %s\n', fileNames{n});
numOfImagesInCurrentBatch = size(IMAGES,4);

batchSize = numberOfPatchesToCreate/numOfBatches;

for i = 1:batchSize
randImg = randi(numOfImagesInCurrentBatch);

patchRowStart = randi(505);
patchRowRange = patchRowStart:patchRowStart+patchsize-1;
patchColumnStart = randi(505);
patchColumnRange = patchColumnStart:patchColumnStart+patchsize-1;
patchSample = zeros(patchsize*patchsize*numOfChannels,1);
for j=1:numOfChannels
patchSample((j-1)*patchsize*patchsize+1:(j*patchsize*patchsize),1) =
reshape(IMAGES(patchRowRange,patchColumnRange,j,randImg),patchsize*patchsize,1);
end
patches(:,(n-1)*batchSize + i) = patchSample;
samplesCreated = samplesCreated + 1;
end
end

fprintf('Number of samples created: %d\n', samplesCreated);
In the above code excerpt we are randomly selecting 8 x 8 x 3 dimensional patches that we
are going to feed into our neural network. The return value will be a matrix consisting the
randomly sampled 100,000 patches each having a size of 192 values.
Now going back to the linearDecoder.m:
patches = sampleIMAGES_fromBatches(C:\folder, numPatches);
save 'patches_imageNet.mat' patches
15

We now have the sampled patches (192 x 100,000) in the patches variable. For convenience it
is recommended to save this matrix into a MATLAB file making it persistent on disk in case
we want to rerun the algorithm.
Proceed forward in linearDecoder.m:
% Subtract mean patch
meanPatch = mean(patches, 2);
patches = bsxfun(@minus, patches, meanPatch);

% Apply ZCA whitening
sigma = patches * patches' / numPatches;
[u, s, v] = svd(sigma);
ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u';
patches = ZCAWhite * patches;

In the next step we are going to normalize the values of the patches by subtracting the mean
values from each patch then apply ZCA whitening.

c) Feature learning

Now we got to the exciting part. This section does the heavy lifting by optimizing the weights
of the neural network to yield the feature activation vector. Lets go back to our code:

theta = initializeParameters(hiddenSize, visibleSize);

The initializeParameters function takes two incoming arguments, namely the number of
hidden units and the number of input/output units, these are one of the most important
characteristics of the architecture of any neural network.
initializeParameters.m

function theta = initializeParameters(hiddenSize, visibleSize)

%% Initialize parameters randomly based on layer sizes.
r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the
interval [-r, r]
16

W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;

b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);

% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all
% parameters into a vector
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];


In the above code snippet we will get a randomly initialized vector of size 154,192 where the
hidden size is 400 and visible size is 192 (which is the dimension of a single image patch).

Now we are ready to look at the code where we are going to conduct the feature
optimization step.

% Use minFunc to minimize the function
addpath minFunc/

options = struct;
options.Method = 'lbfgs';
options.maxIter = 200
options.display = 'on';

[optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCostBatch(p, ...
visibleSize, hiddenSize, ...
lambda, sparsityParam, ...
beta, patches, batchSize), ...
theta, options);

% Save the learned features and the preprocessing matrices
fprintf('Saving learned features and preprocessing matrices...\n');
save('imagenet_Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');
fprintf('Saved\n');

17

We are registering to MATLAB the path to our optimizer which is provided on the Stanford
class site
5
. This optimizer is capable of conducting a flavour of the limited memory BFGS
optimization. When I experimented with one of MATLABs built-in optimizer (fminunc) I
had memory problems since it was impossible to obtain the Hessian matrix of such a huge
vector (154,192 free variables). A limited BFGS optimizer tries to approximate the Hessian
matrix without building the whole matrix in memory. As a side note I would like to point out
that I found that Scipy provides the minimize function in its optimization library that can do l-
bfgs out of the box.
The minFunc optimizer takes a function pointer to the cost function to be minimized along
with the variables (in this case the features, theta) and the options that governs the behaviour
of the optimizer. The function will yield the optimized variables and the cost associated with
those variable values.
After we have obtained those values, it is again recommended to save the optimal features to
the disk so they are available the next time for progression in experimenting.
Now we have reached one of the most important algorithm in this thesis. Namely the cost
function of the neural network that will act as an auto-encoder that has the task of producing a
good approximation of the input values by devising the optimal feature activations. It is time
for code excerpts once again.
function [cost,grad] = sparseAutoencoderLinearCostBatch(theta, visibleSize, hiddenSize,
lambda,sparsityParam,beta,data,batchSizeParam)
The above represents the signature of the cost function that will return upon completion the
cost, the gradients which will be used by the optimizer.
Let me enumerate the parameters of this function:
theta: the weight values to be used by the neural network (this is to be optimized)
visibleSize: number of nodes in the input/output layer
hiddenSize: number of nodes in the hidden layer
lambda: weight decay parameter
sparsityParam: this poses an incentive on the hidden unit average activations to
converge to this value
beta: weight of the sparsity penalty
5
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

18

data: the matrix consisting the training data (in our case a 192 x 100,000 matrix)

As a first step we distribute the variables in the theta vector among the weight parameters of
the neural network.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize,
hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
Lets proceed forward:
numOfExamples = size(data,2);
batchSize = batchSizeParam;
numOfBatches = numOfExamples/batchSize;
hwb = zeros(visibleSize,batchSize);
a2 = zeros(hiddenSize, batchSize);
pj = zeros(hiddenSize,batchSize);
summa = 0;
batchStart = 1;
batchEnd = batchSize;
for t=1:numOfBatches
%forward propagation optimized with vectorization
z2 = W1 * data(:,batchStart:batchEnd) + repmat(b1,1,batchSize);
a2 = sigmoid(z2);
z3 = W2 * a2 + repmat(b2,1,batchSize);
hwb = z3; % makes it a LINEAR DECODER!
... to be continued.
First we assign the number of examples we are providing to the variable numOfExamples. We
assign the number of examples in one batch to the variable batchSize which we chose to be
25,000. I found this to be a good number on a medium priced PC where I conducted my
experiments since having a total of 100,000 examples will require four iterations. We are
faced with a trade-off again here as with any batch processing. MATLAB has very powerful
linear algebra libraries that can exploit vectorized form calculations very efficiently which
makes a strong incentive to pass a matrix with all the examples in it. But that might be
impossible or might put enormous pressure on the memory system. On the other hand using
smaller batches of the examples instead of a big one, we can make the optimizer converge
19

faster to the optimal value and also put less pressure on RAM. I found the number of 25,000
example size in one batch a good compromise.
The next step in our calculations is to start forward propagation. Using the weights of the
input hidden layer we can determine the hidden unit activations a2 on each example in the
given batch. Please note that here we are using vectorized notation which is a first-line
optimization method, practically necessary in MATLAB. Unfortunately, it makes the
readability of the code much more obscure but the speed gains are worth it in almost every
scenario. At the time of rapid prototyping I used for loops which called the MATLAB
interpreter and resulted in a serious degradation in performance.
We are using the sigmoid function to decide on a node being active or not. The sigmoid
function has the property that it maps 0 to 0.5 but coverges to +1 very rapidly when the input
gets relatively large and coverges to zero as the input gets a relatively large negative value.
The MATLAB code to compute the sigmoid function:

function sigm = sigmoid(x)

sigm = 1 ./ (1 + exp(-x));

end

Source: Mathworks, Product help

z3 = W2 * a2 + repmat(b2,1,batchSize);
hwb = z3; % makes it a LINEAR DECODER!

The next step in the calculation is to have the activations performed on the output units where
the chosen method is to use a linear activation function instead of a sigmoid since a linear
20

function will not constrain the output values to the [0,1] interval as the sigmoid does. Specifi-
cally our linear function is the identity function that we apply on z3 that we can assign to the
output unit variable, hwb right away.
Lets proceed forward. Now compute the squared error term of the cost function where we
are penalizing any high deviation of the output activation, hwb from the original input varia-
ble, data. The incentive of the neural network becomes that the target value should be equal to
the input.

for l=1:batchSize
diff = hwb(:,l) - data(:,(t-1)*batchSize+l);
summa = summa + (diff'*diff) * 0.5;
end

The next line will assign the current activations to pj which we are using to accumulate the
sum of the activations of each example so we will be able to calculate an average activation
per node.

pj = pj + a2;

The next two lines represent simply a housekeeping part where we increment the variable
holding the current examples to be processed. It is also the end of the for loop that iterates
through each batch in order to complete the feed forward step.

batchStart = batchStart + batchSize;
batchEnd = batchEnd + batchSize;
end % end of 'for t=1:numOfBatches'

As can be seen in the next line we calculate the average activations of the output unit.
pj = sum(pj,2) * numOfExamples^-1;
Then we have our cost term by averaging over all of the examples:
cost = summa / numOfExamples;
In the next code section we can see how the weight decay term is calculated:
W2Sum = sum(sum(W2 .* W2));
21

W1Sum = sum(sum(W1 .* W1));
cost = cost + (lambda/2) * (W1Sum + W2Sum);
The vectorized notation obscures the code again. The weight decay term acts as a normaliza-
tion agent that has the purpose to avoid over-fitting. Lets say we have our cost function
measuring the error term E(w) given w, denoting the weight matrix. Then we should use
Emod(w) = E(w) + (lambda/2) * w^2. It has the effect of penalizing large weights.
We should proceed now to the next step:
p = sparsityParam;
sumKL = sum(p*(log(p./pj)) + (1-p)*log((1-p) ./ (ones(hiddenSize,1)-pj)));
cost = cost + beta * sumKL;
In the above code snippet we are making use of our desired sparsity parameter of the activa-
tions. We are effectively penalizing any deviation from our chosen sparsity value and assign-
ing that penalty to the cost function. The method of calculating this penalty is based on the
Kullback-Leibler divergence which measures the difference between two probability distribu-
tions. It tries to measure the information loss that occurs when we are trying to approximate or
substitute one distribution with a different distribution. Here the distributions have the mean
of p and pj where p represents our desire and pj is the actual mean.
Finally we calculate the sparsity term by:
sparsityTerm = beta*( (-1*p)./pj + (1-p)./(ones(hiddenSize,1)-pj) );
Now we got to the point that we are ready to apply backward-propagation:
% HERE CALCULATE IN A LOOP THE FEEDFORWARD ACTIVATIONS AGAIN!
batchStart = 1;
batchEnd = batchSize;
for t=1:numOfBatches

z2 = W1 * data(:,batchStart:batchEnd) + repmat(b1,1,batchSize);
a2 = sigmoid(z2);
z3 = W2 * a2 + repmat(b2,1,batchSize);
hwb = z3; % makes it a LINEAR DECODER!

errorOutput = (-1*(data(:,batchStart:batchEnd) - hwb)); % in case a linear decoder the
derivOutput is not needed since it's derivative is 1
derivHidden = a2 .* (ones(hiddenSize, batchSize) - a2);
% add the sparsity term!
22

errorHidden = (W2'*errorOutput + repmat(sparsityTerm, 1,batchSize) ).*derivHidden;

W2grad = W2grad + errorOutput * a2';
b2grad = b2grad + sum(errorOutput,2);

W1grad = W1grad + errorHidden * data(:,batchStart:batchEnd)';
b1grad = b1grad + sum(errorHidden,2);

batchStart = batchStart + batchSize;
batchEnd = batchEnd + batchSize;
end
We are required to calculate the feed-forward activations once again due to our batch pro-
cessing solution. It incurs a performance penalty on our algorithm but makes it possible to
operate on machines with limited memory resources. By applying the backward-propagation
we have a very efficient way to calculate the derivatives of our cost function. We are effec-
tively calculating the error terms between our activations and the desired target value, namely
our original input, data.
What we are left with is to perform an averaging of the gradients.
W2grad = W2grad * (numOfExamples^-1);
b2grad = b2grad * (numOfExamples^-1);
W1grad = W1grad * (numOfExamples^-1);
b1grad = b1grad * (numOfExamples^-1);

% add the weight decay term to W2grad and W1grad
W2grad = W2grad + lambda*W2;
W1grad = W1grad + lambda*W1;
Lastly we need to add the weight decay term to the gradients as shown above.
Finally we unroll the weights and the bias values into a vector:
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
23

By passing the previously described cost function to be minimized to the l-BFGS optimizer
after several iterations we can get our hands on the optimal weights and the associated cost
value:
[optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCostBatch(p, ...
visibleSize, hiddenSize, ...
lambda, sparsityParam, ...
beta, patches, batchSize), ...
theta, options);
As a last step it is worthwhile and interesting to visually represent the learned features by:
W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);
b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
displayColorNetwork( (W*ZCAWhite)');

The running of the algorithm resulted in a picture shown below:

One can easily spot the edges learned from the pictures in the above image. It is simply an
image representation of the optimal weights I deduced running the optimizer.
24

The previous picture is useful for debugging purposes as well. In case most of the squares
contain only blurred images and no clear edge detection is present we should suspect that
our algorithm is not working as it is supposed to be.
As a final step if one used a profiler during execution it is advisable to stop the profiler and
save the results to identify possible bottlenecks in the code execution:
profile viewer
p = profile('info');
profsave(p,'profile_results')
d) Classification
If we look back to our run.m top level script we can see that the next step in our processing is
to step into the cnn.m file.
As a first step we initialize the environment once again with our parameters regarding the
architecture of the neural network, the dimension of the image patches, etc.
Please remember that we have saved the optimal weights of the network along with the ZCA
whitening information used at the end of our sparse auto-encoder algorithm:
save('imagenet_Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');
At this point the next is to load the data back to memory by issuing the following command:
load imagenet_Features;
This command will load into RAM the optimal features, the matrix used for ZCA whitening
and the mean values for each input variable (in our case 192).
Once again we reshape our incoming features into matrices representing the weights and the
bias term by:
W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);
b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
25

The next step to be done is to load our labeled dataset consisting the training and the test
data. These images will be used during our supervised classification task.
load trainImages;
load testImages;

Both trainImages and testImages are MATLAB format files containing raw colored pictures
in matrix form. Obtaining labeled images per each category is a very expensive process. It is
often very difficult to obtain massive amounts of labeled data per each category. One of the
ways I had collected these labeled images by going to www.flickr.com. One can search for
images by typing the required category into the search field and a reasonable amount of pic-
tures will be presented by the web site. Obviously the categorization by flickr is far from per-
fect and manual picking of images is necessary to make sure that we indeed acquiring images
of the correct category. I have devised four categories:
Cars
Cats
Tigers
Trees
By manually downloading a small number of samples from each category one can easily di-
vide them into training and test categories. As a preprocessing step it is necessary to transform
the images into a uniform and rather small dimension (64 x 64) for computational purposes
which can be done with the following function:
function IMAGES = loadandResizeImages(pathToDir, row, col)

fileFolder = fullfile(pathToDir);
dirOutput = dir(fullfile(fileFolder,'*.jpg'));
fileNames = {dirOutput.name}';
numFrames = numel(fileNames);

I = imread(fileNames{1});

% Preallocate the array
IMAGES = zeros([row col 3 numFrames],class(I));
26


for i=1:numFrames
I = imread(fileNames{i});
I = imresize(I,[row col]);
imwrite(I, strcat(pathToDir,'resized/',fileNames{i}(:,1:end-4), '.bmp'));
fprintf('Resized picture number and saved: %d\n', i);
end

Once we have the resized images on disk we should package them into MATLAB matrix
form.
At this point we are ready to apply convolution and pooling on both the training and the test
data. Convolution and pooling among others exploits certain statistical similarities of neigh-
boring parts of the image. It results in significant computational ease since we can represent
the incoming data more succinctly. In our case we convolve the learned 8 x 8 features with the
large incoming labeled images. Pooling is very useful for statistical aggregation by taking
large contiguous, non-overlapping part of the image and transforming it to a single value. It
will enable our algorithm to become much more computationally feasible.
For memory and performance reasons we conduct the convolution and pooling steps in sever-
al iterations over the features.
Lets look at the signature of the convolution function:
function convolvedFeatures = cnnConvolve(patchDim, numFeatures, images, W, b, ZCAWhite,
meanPatch)
In the next few instructions we are allocating space for our variables and for the final con-
volved features.
numImages = size(images, 4);
imageDim = size(images, 1);
imageChannels = size(images, 3);
convolvedFeatures = zeros(numFeatures, numImages, imageDim - patchDim + 1, imageDim -
patchDim + 1);
We need to apply ZCA whitening once again:
WT = W*ZCAWhite;
bT = b - WT * meanPatch;
WT = reshape(WT, numFeatures, patchDim*patchDim, imageChannels);
27

Then we conduct convolution on each image and each feature and each channel. A convolved
image is created by executing the following instructions:
feature = flipud(fliplr(squeeze(feature)));
im = squeeze(images(:, :, channel, imageNum));
convolvedImage = convolvedImage + conv2(im, feature, 'valid');
Then we apply the sigmoid function in order to produce the activations:
convolvedImage = sigmoid(convolvedImage);
The pooling is performed by the following function:
cnnPool.m
function pooledFeatures = cnnPool(poolDim, convolvedFeatures)
numImages = size(convolvedFeatures, 2);
numFeatures = size(convolvedFeatures, 1);
convolvedDim = size(convolvedFeatures, 3);

pooledFeatures = zeros(numFeatures, numImages, floor(convolvedDim / poolDim),
floor(convolvedDim / poolDim));
pool_length = floor(convolvedDim / poolDim);
rowbegin = 0;
rowend = 0;
columnbegin = 0;
columnend = 0;

for i = 1 : numFeatures
for j = 1 : numImages
for r = 1 : pool_length
for c = 1 : pool_length
rowbegin = 1 + poolDim * (r-1);
rowend = poolDim * r;
columnbegin = 1 + poolDim * (c-1);
columnend = poolDim * c;
pooledFeatures(i, j, r, c) = ...
mean(mean(convolvedFeatures(i, j, rowbegin : rowend, columnbegin :
columnend)));
end
end
end
end
28



We are pooling the features in the region poolDim x poolDim so that we are able to produce a
matrix with dimensions numFeatures, numImages and floor(convolvedDim/poolDim) where a
single value within this matrix corresponds to a single pooling region.
After we are done with convolving and pooling all the features on each image it is again rec-
ommended to save the result by:
save('pooledFeatures.mat', 'pooledFeaturesTrain', 'pooledFeaturesTest');
Now we are presented with a computationally feasible supervised classification task. We are
going to use the pooledFeaturesTrain and pooledFeaturesTest to train a softmax classifier.
softmaxLambda = 1e-4;
numClasses = 4; %2
% Reshape the pooledFeatures to form an input vector for softmax
softmaxX = permute(pooledFeaturesTrain, [1 3 4 2]);
softmaxX = reshape(softmaxX, numel(pooledFeaturesTrain) / numTrainImages,...
numTrainImages);
softmaxY = trainLabels;

options = struct;
options.maxIter = 200;
softmaxModel = softmaxTrain(numel(pooledFeaturesTrain) / numTrainImages,...
numClasses, softmaxLambda, softmaxX, softmaxY, options);
Lets open the softmaxTrain function to see how it operates exactly.
softmaxTrain.m
function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels,
options)
if ~exist('options', 'var')
options = struct;
end

if ~isfield(options, 'maxIter')
options.maxIter = 400;
end
% initialize parameters
29

theta = 0.005 * randn(numClasses * inputSize, 1);
addpath minFunc/
options.Method = 'lbfgs';
minFuncOptions.display = 'on';
[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
numClasses, inputSize, lambda, ...
inputData, labels), ...
theta, options);

softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;

In case we did not provide the maximum iteration number in the options structure we are go-
ing to perform 400 iterations. Then we randomly initialize the variable to be optimized, theta.
We specify the optimizer to perform an l-BFGS search and we pass the function softmaxCost
to the optimizer. After we are finished with the optimization we save the results into a struc-
ture. But lets look into the cost function to gain some understanding of the softmax classifier:
softmaxCost.m
function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

theta = reshape(theta, numClasses, inputSize);
numCases = size(data, 2);
groundTruth = full(sparse(labels, 1:numCases, 1));
cost = 0;
thetagrad = zeros(numClasses, inputSize);
M = theta * data;
M = bsxfun(@minus, M, max(M, [], 1));
expM = exp(M);
normTerm = 1./sum(expM,1);
h = (repmat(normTerm',1,numClasses) .* expM')';
probSum = 0;

for i=1:numCases
for j=1:numClasses
probSum = probSum + groundTruth(j,i) * log(h(j,i));
end
end

30

cost = -1*(numCases^-1) * probSum;
weightTerm = (lambda/2) * sum(sum(theta .* theta));
cost = cost + weightTerm;
% calulate the Hessian
thetagrad = -1 * (numCases^-1) * (data * (groundTruth - h)')' + lambda*theta;
% Unroll the gradient matrices into a vector for minFunc
grad = [thetagrad(:)];
end

This function accepts the following parameters:
numClasses - the number of classes
inputSize - the size N of the input vector
lambda - weight decay parameter
data - the N x M input matrix, where each column data(:, i) corresponds to a single test
set
labels - an M x 1 matrix containing the labels corresponding for the input data
The instruction M = theta * data makes M to contain the theta*x exponents for each class and
each example. M = bsxfun(@minus, M, max(M, [], 1)) has the purpose to subtract the maximum
of each theta*x vector at each example so we do not have large values that might cause
overflow. expM = exp(M) computes the the exponents of the theta*x matrix. By normTerm =
1./sum(expM,1) we first compute the normalizing term which makes the exponents sum to one.
We then sequentially compute the probabilities where the two instructions
h = (repmat(normTerm',1,numClasses) .* expM')' and
probSum = probSum + groundTruth(j,i) * log(h(j,i))
are important.
Finally we compute the weight decay term by: weightTerm = (lambda/2) * sum(sum(theta .*
theta)) and the cost will become:
cost = -1*(numCases^-1) * probSum;
cost = cost + weightTerm;
The last step is to calculate the Hessian matrix by:
thetagrad = -1 * (numCases^-1) * (data * (groundTruth - h)')' + lambda*theta
31

Finally we unroll the gradients into vector form by: grad = [thetagrad(:)];
We are now ready to test our trained classifiers predictive capabalities on a seperate test set
consisting our labeled images.
softmaxX = permute(pooledFeaturesTest, [1 3 4 2]);
softmaxX = reshape(softmaxX, numel(pooledFeaturesTest) / numTestImages, numTestImages);
softmaxY = testLabels;
[pred] = softmaxPredict(softmaxModel, softmaxX);
acc = (pred(:) == softmaxY(:));
acc = sum(acc) / size(acc, 1);
fprintf('Accuracy: %2.3f%%\n', acc * 100);
The softmaxX will represent the pooled test images and softmaxY will carry the labels
associated with each test example. We pass into our softmaxPredict function the previously
acquired classification model along with the test examples and it will return the predictions
made on each example.
softmaxPredict.m
function [pred] = softmaxPredict(softmaxModel, data)
theta = softmaxModel.optTheta;
pred = zeros(1, size(data, 2));
M = theta * data;
M = bsxfun(@minus, M, max(M, [], 1));
expM = exp(M);
normTerm = 1./sum(expM,1);
h = (repmat(normTerm',1,softmaxModel.numClasses) .* expM')';
[y,i] = max(h);
pred = i;
end

By using the [y,i] = max(h) expression we are effectively choosing the category that has the
highest probability the example belongs to. Finally we are returning the vector containing the
labels of our categories that we predicted for each example.
Upon return we can easily compare the actual labels of the test examples and what our model
has just predicted and finally print the accuracy of our prediction:
[pred] = softmaxPredict(softmaxModel, softmaxX);
acc = (pred(:) == softmaxY(:));
32

acc = sum(acc) / size(acc, 1);
fprintf('Accuracy: %2.3f%%\n', acc * 100);
e) Image database construction
I had three ways of collecting images during my research. The first one was very
straightforward. I used the STL-10 dataset provided by a link on the web site of the UFLDL
tutorial. The goal was to ensure that my implementation worked just as in the lecture notes.
After ensuring my implementation was working well I decided to try the algorithm on a
different dataset. My strategy was to randomly download massive amount of images from the
Internet as implied in (Raina et al., 2007). I chose Flickr as my search engine for images. The
dataset for the unsupervised sparsity auto-encoder algorithm was compiled by searching for
images with the keyword life. I successfully downloaded in this general category about
2000 unlabeled images. This dataset was used to create the activation features to be used
later in the softmax classification task.
I have sampled images in four categories: car, cat, tiger, tree. I have divided each category
into a training and a test set to eliminate any sampling bias. Each training set consisted around
40-50 images in the given category. Each test set contained around 15-20 images. (This
examplifies how difficult it is to get labeled images contrasted with just simply downloading
random images.)
By training and testing I was able to achieve a ~85% accuracy rate. If I used random weights
instead of the optimized activation features in the softmax training the predicition rate
dropped to ~57%! It was clear that the sparse auto-encoder algorithm contributed a significant
improvement to the prediction performance.
The third way to acquire images was to download images from the ImageNet database. I have
dowloaded the 2011 Fall dataset which consisted around 14 million direct image urls. I wrote
a small Python script that I used to dowload randomly sampled images.
import requests
import numpy as np
imagesDownloaded=0
index = 0

33

file_database = open("fall11_urls.txt", "r")
lines = file_database.readlines()
numOfLines = len(lines)
randomIndexes = np.random.permutation(range(numOfLines))
outFolder = "2011Fall"
while imagesDownloaded<20000:
for line in lines:
index = index + 1
try:
url = lines[randomIndexes[index]].split('\t')[1].strip()
fileName = url.split('/')[-1]
req = requests.get(url, timeout=1)
outFile = open(outFolder + "/" + fileName, "wb")
outFile.write(req.content)
outFile.close()
imagesDownloaded = imagesDownloaded + 1
print "Downloaded file: ", imagesDownloaded, "Random index: ",
randomIndexes[index]
except:
print "Problem at line: ", line
print "Done."
file_database.close()
print "Number of files downloaded: ", numOfLines
By using the above script I quickly acquired 20,000 images to work with. The next step was
to clean the dataset from invalid images since it contained a significant amount of stale
images.
import imghdr, os
import glob
fileList = glob.glob(os.getcwd() + "/*")
print "Number of files to check: ", len(fileList)
jpegCount = 0
noTypeCount = 0
otherTypeCount = 0
numFilesChecked = 0
for filePath in fileList:
imgType = imghdr.what(filePath)
tokenz = filePath.split('/')
if imgType == "jpeg":
tokenz[-1] = str(numFilesChecked)
newPath = "/".join(tokenz) + ".jpg"
34

os.rename(filePath, newPath)
jpegCount = jpegCount + 1
elif (imgType == "" or imgType == None):
os.remove(filePath)
noTypeCount = noTypeCount + 1
else:
os.remove(filePath)
otherTypeCount = otherTypeCount + 1
numFilesChecked = numFilesChecked + 1
print numFilesChecked
print "Jpeg count: ", jpegCount
print "Other type: ", otherTypeCount
print "No type: ", noTypeCount

f) Enhancements of the standard algorithm
I have made several enhancements to the main algorithm proposed by the Stanford class. I did
the enhancements mainly to increase the memory efficiency of the learning process.
In the following section I will enumerate the changes I introduced into my implementation:
Compile the raw images into batches and save them into the standard MATLAB
format. By introducing this feature, one is capable of working with the images in batch
iterations, meaning it is not necessary to have all the images in RAM all at once. It
enabled me to create a much larger sample size.
Training the neural network with limited memory batch processing. To be fair this
was a recommendation by the UFLDL site but they did not elaborate much on the
process. By training the network in batches one does not need to read all the input into
RAM but train and optimize the network in successive iterations. This is on one hand
more efficient since the load on RAM is less and finding the optimal weights is faster
(the weights coverge faster) on the other hand it is less efficient since a redundant
factor has been introduced which is the needs to calculate the feedforward activations
twice to get the average activations to calculate the back propagation step.
Vectorized forms of all the computationally expensive algorithms. It makes the
code more difficult to read and comprehend but on the other hand MATLAB can work
with vectorized forms very efficiently due to the sophisticated matrix manipulation
libraries.
35

Testing the algorithm on real-life datasets. By this I mean that I simply tried to
download images from the Internet. I have created categories, transformed the images
to verify the validity of the algorithm.

g) Experiments
After running the sparse auto-encoder on the STL-10 sampled patches (100,000 - 8 x 8 pixel
patches) I was able to produce a visually very similar feature activation to the one shown on
the tutorials web site.

Dataset Number of categories Method Prediction Accuracy
STL-10 4 Raw (random
weights without pre-
training)
~58%
STL-10 4 Sparse auto encoder ~81%

The above tables shows the effect of pre-training the weights by the unsupervised learning
step. We can achieve much better classification performance if we support the supervised
learning task by running an unsupervised algorithm.
My experience with different datasets than the STL-10 enforced the validity of the self-taught
learning method even though I was never able to achieve better classification performance
than 85%.
36

Data set Unlabeled data
set size
Labeled data Categories Accuracy
www.flickr.com 2000 200 4 (car, cat, tiger,
tree)
85%
www.flickr.com 2000 (same as
above)
400 2 (male face,
female face)
72%
ImageNet 15000 200 4 (car, cat, tiger,
tree)
81%

The above table represents my experiments with custom datasets. I wanted to see whether the
algorithm works well with images compiled by a user simply by randomly dowloading
images. The algorithm works well differentiating significantly different objects, like cats
from trees or trees from cars. But it has serious problem deciding between cats and tigers for
instance. I observed the same issue with the STL-10 dataset. It couldnt really tell whether it
saw a cat or a non-cat mammal.
4. Conclusion
Self-taught learning holds a great promise for future research and classification tasks where
we do not have access to a plethora of labeled images since it is much easier and cheaper to
simply acquire tons of unlabeled images.
Futher direction to my research will be to change the underlying architecture and its
parameters. When I tried to use bigger datasets I was not able to show significantly greater
performance thus I suspect one promising way forward is to have stacked auto-encoders. The
idea is nicely presented at the UFLDL site
6
.
My next step in my further development and research will be to expand the algorithm with
stacked auto-encoders. By having only a single hidden layer one can only extract edges
from the images, working effectively as an edge detector. Inserting multiple hidden layers
could extract deeper underlying features from images. They could detect longer contours, or
6
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

37

perhaps detect simple "parts of objects." An even deeper layer might then group together these
contours or detect even more complex features
7
.
5. Acknowledgments
At this section I would like to express my gratitude to my supervisor, Balint Antal who
guided me at my research and presented to me the idea of self-taught learning. His help and
his guidance immensely determined the direction and the success of my research in the topic.
I would like to thank Andrew Ng for his great lectures at Coursera where I was first exposed
to the ideas and techniques of machine learning in general. His course on Machine Learning
freely available on Youtube and his handouts greatly helped me to learn the fundamentals of
machine learning. His UFLDL tutorial on deep learning ideas along with implementation tips
should be the standard of education at every institute.




















7
http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Overview

38


References

Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng, Self-taught
Learning: Transfer Learning from Unlabeled Data, 2007
Bruno A Olshausen, David J Field, Sparse coding of sensory inputs, Current Opinion in
Neurobiology, 14:481487, 2004
Honglak Lee Alexis Battle Rajat Raina Andrew Y. Ng, Efficient sparse coding algorithms,
2006
Yoshua Bengio, Aaron Courville, and Pascal Vincent, Representation Learning: A Review
and New Perspectives, 2012
Tom M. Mitchell, Machine Learning, March 1, 1997
Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge,
Massachusetts London, England, 2010
Hinton, G.E. Supervised learning in multilayer neural networks in The MIT Encyclopedia of
the Cognitive Sciences Editors: Robert A. Wilson and Frank C. Keil The MIT Press, 1999
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-
propagating errors. Nature, 323, 533--536, 1986
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. Text classification from labeled and
unlabeled documents using EM. Machine Learning, 39, 103,134., 2000
B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by
learning a sparse code for natural images. Nature, 381:607609, 1996.
B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy
employed by V1? Vision Research, 37:33113325, 1997.
M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Comp.,
12(2), 2000.
B. A. Olshausen. Sparse coding of time-varying natural images. Vision of Vision, 2(7):130,
2002.
B.A. Olshausen and D.J. Field. Sparse coding of sensory inputs. Cur. Op. Neurobiology,
14(4), 2004.
MathWorks Inc, Neural Network Product Help

39

Sources from the Internet:
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Neural_Networks (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Linear_Decoders (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution (2013.
november 05.)
http://ufldl.stanford.edu/wiki/index.php/Pooling (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders (2013. november 05.)
http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Overview (2013. november 05.)

You might also like