Different Adv Algorithms For Machine Learning
Different Adv Algorithms For Machine Learning
X3
Supervised learning --- where the algorithm generates a function that maps inputs
to desired outputs. One standard formulation of the supervised learning task is the
classification problem: the learner is required to learn (to approximate the behavior
of) a function which maps a vector into one of several classes by looking at several
input-output examples of the function.
Unsupervised learning --- which models a set of inputs: labeled examples are not
available.
Semi-supervised learning --- which combines both labeled and unlabeled examples
to generate an appropriate function or classifier.
Reinforcement learning --- where the algorithm learns a policy of how to act given
an observation of the world. Every action has some impact in the environment, and
the environment provides feedback that guides the learning algorithm.
Transduction --- similar to supervised learning, but does not explicitly construct a
function: instead, tries to predict new outputs based on training inputs, training
outputs, and new inputs.
Learning to learn --- where the algorithm learns its own inductive bias based on
previous experience.
www.intechopen.com
20 New Advances in Machine Learning
Supervised learning 3 is the most common technique for training neural networks and
decision trees. Both of these techniques are highly dependent on the information given by
the pre-determined classifications. In the case of neural networks, the classification is used
to determine the error of the network and then adjust the network to minimize it, and in
decision trees, the classifications are used to determine what attributes provide the most
information that can be used to solve the classification puzzle. We'll look at both of these in
more detail, but for now, it should be sufficient to know that both of these examples thrive
on having some "supervision" in the form of pre-determined classifications.
Inductive machine learning is the process of learning a set of rules from instances (examples
in a training set), or more generally speaking, creating a classifier that can
be used to generalize from new instances. The process of applying supervised ML to a real-
world problem is described in Figure F. The first step is collecting the dataset. If a requisite
expert is available, then s/he could suggest which fields (attributes, features) are the most
1 http://www.aihorizon.com/essays/generalai/supervised_unsupervised_machine_learning.htm
2 http://www.cis.hut.fi/harri/thesis/valpola_thesis/node34.html
3 http://www.aihorizon.com/essays/generalai/supervised_unsupervised_machine_learning.htm
www.intechopen.com
Types of Machine Learning Algorithms 21
informative. If not, then the simplest method is that of brute-force, which means
measuring everything available in the hope that the right (informative, relevant) features
can be isolated. However, a dataset collected by the brute-force method is not directly
suitable for induction. It contains in most cases noise and missing feature values, and
therefore requires significant pre-processing according to Zhang et al (Zhang, 2002).
The second step is the data preparation and data pre-processing. Depending on the
circumstances, researchers have a number of methods to choose from to handle missing data
(Batista, 2003). Hodge et al (Hodge, 2004) , have recently introduced a survey of
contemporary techniques for outlier (noise) detection. These researchers have identified the
techniques advantages and disadvantages. Instance selection is not only used to handle
noise but to cope with the infeasibility of learning from very large datasets. Instance
selection in these datasets is an optimization problem that attempts to maintain the mining
quality while minimizing the sample size. It reduces data and enables a data mining
algorithm to function and work effectively with very large datasets. There is a variety of
procedures for sampling instances from a large dataset. See figure 2 below.
Feature subset selection is the process of identifying and removing as many irrelevant and
redundant features as possible (Yu, 2004) . This reduces the dimensionality of the data and
enables data mining algorithms to operate faster and more effectively. The fact that many
features depend on one another often unduly influences the accuracy of supervised ML
classification models. This problem can be addressed by constructing new features from the
basic feature set. This technique is called feature construction/transformation. These newly
generated features may lead to the creation of more concise and accurate classifiers. In
addition, the discovery of meaningful features contributes to better comprehensibility of the
produced classifier, and a better understanding of the learned concept.Speech recognition
using hidden Markov models and Bayesian networks relies on some elements of
supervision as well in order to adjust parameters to, as usual, minimize the error on the
given inputs.Notice something important here: in the classification problem, the goal of the
learning algorithm is to minimize the error with respect to the given inputs. These inputs,
often called the "training set", are the examples from which the agent tries to learn. But
learning the training set well is not necessarily the best thing to do. For instance, if I tried to
teach you exclusive-or, but only showed you combinations consisting of one true and one
false, but never both false or both true, you might learn the rule that the answer is always
true. Similarly, with machine learning algorithms, a common problem is over-fitting the
data and essentially memorizing the training set rather than learning a more general
classification technique. As you might imagine, not all training sets have the inputs
classified correctly. This can lead to problems if the algorithm used is powerful enough to
memorize even the apparently "special cases" that don't fit the more general principles. This,
too, can lead to over fitting, and it is a challenge to find algorithms that are both powerful
enough to learn complex functions and robust enough to produce generalisable results.
www.intechopen.com
22 New Advances in Machine Learning
Problem
Identification of
Data
Data Pre-Processing
Algorithm
selection
Parameter
Tuning Training
NO
YES Classifier
OK
4 http://www.aihorizon.com/essays/generalai/supervised_unsupervised_machine_learning.htm
www.intechopen.com
Types of Machine Learning Algorithms 23
actions and punished for doing others. Often, a form of reinforcement learning can be used
for unsupervised learning, where the agent bases its actions on the previous rewards and
punishments without necessarily even learning any information about the exact ways that
its actions affect the world. In a way, all of this information is unnecessary because by
learning a reward function, the agent simply knows what to do without any processing
because it knows the exact reward it expects to achieve for each action it could take. This can
be extremely beneficial in cases where calculating every possibility is very time consuming
(even if all of the transition probabilities between world states were known). On the other
hand, it can be very time consuming to learn by, essentially, trial and error. But this kind of
learning can be powerful because it assumes no pre-discovered classification of examples. In
some cases, for example, our classifications may not be the best possible. One striking
exmaple is that the conventional wisdom about the game of backgammon was turned on its
head when a series of computer programs (neuro-gammon and TD-gammon) that learned
through unsupervised learning became stronger than the best human chess players merely
by playing themselves over and over. These programs discovered some principles that
surprised the backgammon experts and performed better than backgammon programs
trained on pre-classified examples. A second type of unsupervised learning is called
clustering. In this type of learning, the goal is not to maximize a utility function, but simply
to find similarities in the training data. The assumption is often that the clusters discovered
will match reasonably well with an intuitive classification. For instance, clustering
individuals based on demographics might result in a clustering of the wealthy in one group
and the poor in another. Although the algorithm won't have names to assign to these
clusters, it can produce them and then use those clusters to assign new examples into one or
the other of the clusters. This is a data-driven approach that can work well when there is
sufficient data; for instance, social information filtering algorithms, such as those that
Amazon.com use to recommend books, are based on the principle of finding similar groups
of people and then assigning new users to groups. In some cases, such as with social
information filtering, the information about other members of a cluster (such as what books
they read) can be sufficient for the algorithm to produce meaningful results. In other cases, it
may be the case that the clusters are merely a useful tool for a human analyst.
Unfortunately, even unsupervised learning suffers from the problem of overfitting the
training data. There's no silver bullet to avoiding the problem because any algorithm that
can learn from its inputs needs to be quite powerful.
Unsupervised learning algorithms according to Ghahramani (Ghahramani, 2008) are
designed to extract structure from data samples. The quality of a structure is measured by a
cost function which is usually minimized to infer optimal parameters characterizing the
hidden structure in the data. Reliable and robust inference requires a guarantee that
extracted structures are typical for the data source, i.e., similar structures have to be
extracted from a second sample set of the same data source. Lack of robustness is known as
over fitting from the statistics and the machine learning literature. In this talk I characterize
the over fitting phenomenon for a class of histogram clustering models which play a
prominent role in information retrieval, linguistic and computer vision applications.
Learning algorithms with robustness to sample fluctuations are derived from large
deviation results and the maximum entropy principle for the learning process.
www.intechopen.com
24 New Advances in Machine Learning
Linear Classifiers
Logical Regression
Nave Bayes Classifier
Perceptron
Support Vector Machine
Quadratic Classifiers
K-Means Clustering
Boosting
Decision Tree
Random Forest
Neural networks
Bayesian Networks
Linear Classifiers: In machine learning, the goal of classification is to group items that have
similar feature values, into groups. Timothy et al (Timothy Jason Shepard, 1998) stated that
a linear classifier achieves this by making a classification decision based on the value of
the linear combination of the features. If the input feature vector to the classifier is
a real vector , then the output score is
www.intechopen.com
Types of Machine Learning Algorithms 25
where is a real vector of weights and f is a function that converts the dot product of the
two vectors into the desired output. The weight vector is learned from a set of labelled
training samples. Often f is a simple function that maps all values above a certain threshold
to the first class and all other values to the second class. A more complex f might give the
probability that an item belongs to a certain class.
For a two-class classification problem, one can visualize the operation of a linear classifier as
splitting a high-dimensional input space with a hyperplane: all points on one side of the
hyper plane are classified as "yes", while the others are classified as "no". A linear classifier is
often used in situations where the speed of classification is an issue, since it is often the
fastest classifier, especially when is sparse. However, decision trees can be faster. Also,
linear classifiers often work very well when the number of dimensions in is large, as
in document classification, where each element in is typically the number of counts of a
word in a document (see document-term matrix). In such cases, the classifier should be well-
regularized.
www.intechopen.com
26 New Advances in Machine Learning
A Two-Dimensional Example
Before considering N-dimensional hyper planes, lets look at a simple 2-dimensional
example. Assume we wish to perform a classification, and our data has a categorical target
variable with two categories. Also assume that there are two predictor variables with
continuous values. If we plot the data points using the value of one predictor on the X axis
and the other on the Y axis we might end up with an image such as shown below. One
category of the target variable is represented by rectangles while the other category is
represented by ovals.
In this idealized example, the cases with one category are in the lower left corner and the
cases with the other category are in the upper right corner; the cases are completely
separated. The SVM analysis attempts to find a 1-dimensional hyper plane (i.e. a line) that
separates the cases based on their target categories. There are an infinite number of possible
lines; two candidate lines are shown above. The question is which line is better, and how do
we define the optimal line.
The dashed lines drawn parallel to the separating line mark the distance between the
dividing line and the closest vectors to the line. The distance between the dashed lines is
called the margin. The vectors (points) that constrain the width of the margin are the support
vectors. The following figure illustrates this.
www.intechopen.com
Types of Machine Learning Algorithms 27
An SVM analysis (Luis Gonz, 2005) finds the line (or, in general, hyper plane) that is
oriented so that the margin between the support vectors is maximized. In the figure above,
the line in the right panel is superior to the line in the left panel.
If all analyses consisted of two-category target variables with two predictor variables, and
the cluster of points could be divided by a straight line, life would be easy. Unfortunately,
this is not generally the case, so SVM must deal with (a) more than two predictor variables,
(b) separating the points with non-linear curves, (c) handling the cases where clusters
cannot be completely separated, and (d) handling classifications with more than two
categories.
In this chapter, we shall explain three main machine learning techniques with their
examples and how they perform in reality. These are:
K-Means Clustering
Neural Network
Self Organised Map
www.intechopen.com
28 New Advances in Machine Learning
K-means (Bishop C. M., 1995) and (Tapas Kanungo, 2002) is one of the simplest
unsupervised learning algorithms that solve the well known clustering problem. The
procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,
one for each cluster. These centroids shoud be placed in a cunning way because of different
location causes different result. So, the better choice is to place them as much as possible far
away from each other. The next step is to take each point belonging to a given data set and
associate it to the nearest centroid. When no point is pending, the first step is completed and
an early groupage is done. At this point we need to re-calculate k new centroids as
barycenters of the clusters resulting from the previous step. After we have these k new
centroids, a new binding has to be done between the same data set points and the nearest
new centroid. A loop has been generated. As a result of this loop we may notice that the k
centroids change their location step by step until no more changes are done. In other words
centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error
function. The objective function
www.intechopen.com
Types of Machine Learning Algorithms 29
where is a chosen distance measure between a data point and the cluster
centre , is an indicator of the distance of the n data points from their respective cluster
centres.
The algorithm in figure 4 is composed of the following steps:
Although it can be proved that the procedure will always terminate, the k-means algorithm
does not necessarily find the most optimal configuration, corresponding to the global
objective function minimum. The algorithm is also significantly sensitive to the initial
randomly selected cluster centres. The k-means algorithm can be run multiple times to
reduce this effect. K-means is a simple algorithm that has been adapted to many problem
domains. As we are going to see, it is a good candidate for extension to work with fuzzy
feature vectors.
An example
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we
know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster
i. If the clusters are well separated, we can use a minimum-distance classifier to separate
them. That is, we can say that x is in cluster i if || x - mi || is the minimum of all the k
distances. This suggests the following procedure for finding the k means:
www.intechopen.com
30 New Advances in Machine Learning
Here is an example showing how the means m1 and m2 move into the centers of two
clusters.
This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm
for partitioning the n samples into k clusters so as to minimize the sum of the squared
distances to the cluster centers. It does have some weaknesses:
The way to initialize the means was not specified. One popular way to start is to
randomly choose k of the samples.
The results produced depend on the initial values for the means, and it frequently
happens that suboptimal partitions are found. The standard solution is to try a
number of different starting points.
It can happen that the set of samples closest to mi is empty, so that mi cannot be
updated. This is an annoyance that must be handled in an implementation, but that
we shall ignore.
The results depend on the metric used to measure || x - mi ||. A popular solution
is to normalize each variable by its standard deviation, though this is not always
desirable.
The results depend on the value of k.
This last problem is particularly troublesome, since we often have no way of knowing how
many clusters exist. In the example shown above, the same algorithm applied to the same
data produces the following 3-means clustering. Is it better or worse than the 2-means
clustering?
www.intechopen.com
Types of Machine Learning Algorithms 31
Unfortunately there is no general theoretical solution to find the optimal number of clusters
for any given data set. A simple approach is to compare the results of multiple runs with
different k classes and choose the best one according to a given criterion
www.intechopen.com