Machine Learning, Part I: Types of Learning Problems
Machine Learning, Part I: Types of Learning Problems
_____________________________________________
•Machine learning is about just that: designing algorithms that allow a computer
to learn.
1. Classification problems
Utility functions are usually combined with probabilities about things like
the state of the world and the likelihood that the agent's actions will actually work
in the expected way. For instance, if you were programming a robot exploring a
jungle, you might not always know exactly where the robot was located, and
sometimes when the robot took a step forward, it might end up running into a tree
and turning left. In the first case, the robot doesn't know exactly what the state of
the world is; in the second case, the robot isn't sure that its actions will work as
expected.
My Summary
Machine learning is allabout desing algorithm which allows computer to
learn. There may be many classifications to these algorithms. Common
classifications is
i. Classification the input - Algorithm is expected to classify
something like hadwriting or spech recognithin.
ii. Decision making - Alogrithms are expected to make a decision as
a example robot walking in a jungle.
________________________________________________________________
_____________________________________________
1.Supervised Learning
Supervised learning is the most common technique for training neural networks
and decision trees. Both of these techniques are highly dependent on the
information given by the pre-determined classifications. In the case of neural
networks, the classification is used to determine the error of the network and then
adjust the network to minimize it, and in decision trees, the classifications are
used to determine what attributes provide the most information that can be used
to solve the classification puzzle. We'll look at both of these in more detail, but for
now, it should be sufficient to know that both of these examples thrive on having
some "supervision" in the form of pre-determined classifications.
Speech recognition using hidden Markov models and Bayesian networks relies
on some elements of supervision as well in order to adjust parameters to, as
usual, minimize the error on the given inputs.
Notice something important here: in the classification problem, the goal of the
learning algorithm is to minimize the error with respect to the given inputs. These
inputs, often called the "training set", are the examples from which the agent tries
to learn. But learning the training set well is not necessarily the best thing to do.
For instance, if I tried to teach you exclusive-or, but only showed you
combinations consisting of one true and one false, but never both false or both
true, you might learn the rule that the answer is always true. Similarly, with
machine learning algorithms, a common problem is over-fitting the data and
essentially memorizing the training set rather than learning a more general
classification technique.
As you might imagine, not all training sets have the inputs classified correctly.
This can lead to problems if the algorithm used is powerful enough to memorize
even the apparently "special cases" that don't fit the more general principles.
This, too, can lead to overfitting, and it is a challenge to find algorithms that are
both powerful enough to learn complex functions and robust enough to produce
generalizable results.
2. Unsupervised learning
Unsupervised learning seems much harder: the goal is to have the computer
learn how to do something that we don't tell it how to do! There are actually two
approaches to unsupervised learning. The first approach is to teach the
agent not by giving explicit categorizations, but by using some sort of
reward system to indicate success. Note that this type of training will generally
fit into the decision problem framework because the goal is not to produce a
classification but to make decisions that maximize rewards. This approach nicely
generalizes to the real world, where agents might be rewarded for doing certain
actions and punished for doing others.
Although the algorithm won't have names to assign to these clusters, it can
produce them and then use those clusters to assign new examples into one or
the other of the clusters. This is a data-driven approach that can work well when
there is sufficient data; for instance, social information filtering algorithms, such
as those that Amazon.com use to recommend books, are based on the principle
of finding similar groups of people and then assigning new users to groups. In
some cases, such as with social information filtering, the information about other
members of a cluster (such as what books they read) can be sufficient for the
algorithm to produce meaningful results. In other cases, it may be the case that
the clusters are merely a useful tool for a human analyst. Unfortunately, even
unsupervised learning suffers from the problem of overfitting the training data.
There's no silver bullet to avoiding the problem because any algorithm that can
learn from its inputs needs to be quite powerful.
•Summary
Unsupervised learning has produced many successes, such as world-champion
calibre backgammon programs and even machines capable of driving cars! It can
be a powerful technique when there is an easy way to assign values to actions.
Clustering can be useful when there is enough data to form clusters (though this
turns out to be difficult at times) and especially when additional data about
members of a cluster can be used to produce further results due to
dependencies in the data.
Both techniques can be valuable and which one you choose should depend on
the circumstances--what kind of problem is being solved, how much time is
allotted to solving it (supervised learning or clustering is often faster than
reinforcement learning techniques), and whether supervised learning is even
possible.
My Summary
Supervised Learning - Set of examples are given to study and learning the
thing. More suitable for classification problems. The example set has to
be provide with extream care.
Un supervided Learning - We not give any exaple cases or any thing.
Algorithm it self must learn things. There are tow approaches
to this.
i. Reword system.
ii. Clustering.
________________________________________________________________
____________________________________________
Machine Learning, Part III: Testing Algorithms, and The "No Free Lunch
Theorem"
Now that you have a sense of the classifications of machine learning algorithms,
before diving into the specifics of individual algorithms, the only other background
required is a sense of how to test machine learning algorithms.
In most scenarios, there will be three types of data: training set data and testing
data will be used to train and evaluate the algorithm, but the ultimate test will be
how it performs on real data. We'll focus primarily on results from training and
testing because we'll generally assume that the test set is a resonable
approximation of the real world. (As an aside, some machine learning techniques
use a fourth set of data, called a validation set, which is used during training to
avoid overfitting. We'll discuss validation sets when we look at decision trees
because they are a common optimization for decision tree learning.)
We've already talked a bit about the fact that algorithms may over-fit the training
set. A corollary of this principle is that a learning algorithm should never be
evaluated for its results in the training set because this shows no evidence of an
ability to generalize to unseen instances. It's important that the training and test
sets be kept distinct (or at least that the two be independently generated, even if
they do share some inputs to the algorithm).
Unfortunately, this can lead to issues when the amount of training and test data
is relatively limited. If, for instance, you only have 20 samples, there's not much
data to use for a training set and still leave a significant test set. A solution to this
problem is to run your algorithm twenty times (once for each input), using 19 of
the samples as training data and the last sample as a test set so that you end up
using each sample to test your results once. This gives you a much larger
training set for each trial, meaning that your algorithm will have enough data to
learn from, but it also gives a fairly large number of tests (20 instead of 5 or 10).
The drawback to this approach is that the results of each individual test aren't
going to be independent of the results of the other tests. Nevertheless, if you're in
a bind for data, this can yield passable results with lower variance than simply
using one test set and one training set.
So far, a major theme in these machine learning articles has been having
algorithms generalize from the training data rather than simply memorizing it. But
there is a subtle issue that plagues all machine learning algorithms, summarized
as the "no free lunch theorem". The gist of this theorem is that you can't get
learning "for free" just by looking at training instances. Why not? Well, the fact is,
the only things you know about the data are what you have seen.
For example, if I give you the training inputs (0, 0) and (1, 1) and the
classifications of the input as both being "false", there are two obvious
hypotheses that fit the data: first, every possible input could result in a
classification of false. On the other hand, every input except the two
traininginputs might be true--these training inputs could be the only examples of
inputs that are classified as false. In short, given a training set, there are always
at least two equally plausible, but totally opposite, generalizations that could be
made. This means that any learning algorithm requires some kind of "bias" to
distinguish between these plausible outcomes.
Some algorithms may have such strong biases that they can only learn certain
kinds of functions. For instance, a certain kind of basic neural network, the
perceptron, is biased to learning only linear functions (functions with inputs that
can be separated into classifications by drawing a line). This can be a weakness
in cases where the input isn't actually linearly separable, but if the input is linearly
separable, it can force learning when more flexible algorithms might have more
trouble.
For instance, if you have the inputs ((0, 0), false), ((1, 0), true), and ((0, 1), true),
then the only linearly separable function possible would be the boolean OR
function, which the perceptron would learn. Now, if the true function were
boolean OR, then the perceptron would have correctly generalized from three
training instances to the full set of instances. On the other hand, a more flexible
algorithm might not have made that same generalization with the bias toward a
certain type of functions.
We'll often see algorithms that are specifically designed to introduce constraints
about the problem domain in order to introduce subtle biases to create specific
instances of general algorithms to enable better generalizations and
consequently better learning. One example of this when working with digit
recognition would be to create a specialized neural network that was designed to
mimic the way the eye perceives data. This might actually give (in fact, it has
given) better results than using a more powerful but more general network.
Continue to Decision Trees, Part I: Understanding Decision Trees, which covers
how decision trees work and how they exploit a particular type of bias to increase
learning.
My Summary
Testing - There are basically three types of data used with a machine
learning.
i. Traning data set
ii. Testing data
iii. Real world data
________________________________________________________________
_____________________________________________