Index
Index
PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information.
PDF generated at: Mon, 25 Oct 2010 08:51:59 UTC
Contents
Articles
Artificial neural network 1
Supervised learning 12
Semi-supervised learning 18
Active learning (machine learning) 19
Structured prediction 20
Learning to rank 21
Unsupervised learning 27
Reinforcement learning 28
Fuzzy logic 37
Fuzzy set 44
Fuzzy number 46
References
Article Sources and Contributors 47
Image Sources, Licenses and Contributors 48
Article Licenses
License 49
Artificial neural network 1
Background
The original inspiration for the term
Artificial Neural Network came from
examination of central nervous
systems and their neurons, axons,
dendrites and synapses which
constitute the processing elements of
biological neural networks investigated
by neuroscience. In an artificial neural
network simple artificial nodes, called
variously "neurons", "neurodes",
"processing elements" (PEs) or "units",
are connected together to form a
network of nodes mimicking the
biological neural networks — hence
the term "artificial neural network".
systems is more suitable for real-world problem solving, it has far less to do with the traditional artificial intelligence
connectionist models. What they do have in common, however, is the principle of non-linear, distributed, parallel
and local processing and adaptation.
Models
Neural network models in artificial intelligence are usually referred to as artificial neural networks (ANNs); these are
essentially simple mathematical models defining a function or a distribution over or both and , but
sometimes models also intimately associated with a particular learning algorithm or learning rule. A common use of
the phrase ANN model really means the definition of a class of such functions (where members of the class are
obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons
or their connectivity).
Network function
The word network in the term 'artificial neural network' refers to the inter–connections between the neurons in the
different layers of each system. The most basic system has three layers. The first layer has input neurons which send
data via synapses to the second layer of neurons and then via more synapses to the third layer of output neurons.
More complex systems will have more layers of neurons with some having increased layers of input neurons and
output neurons. The synapses store parameters called "weights" which are used to manipulate the data in the
calculations.
The layers network through the mathematics of the system algorithms. The network function is defined as a
composition of other functions , which can further be defined as a composition of other functions. This can be
conveniently represented as a network structure, with arrows depicting the dependencies between variables. A
widely used type of composition is the nonlinear weighted sum, where , where (commonly
referred to as the activation function[1] ) is some predefined function, such as the hyperbolic tangent. It will be
convenient for the following to refer to a collection of functions as simply a vector .
This figure depicts such a decomposition of , with dependencies between
variables indicated by arrows. These can be interpreted in two ways.
The first view is the functional view: the input is transformed into a
3-dimensional vector , which is then transformed into a 2-dimensional
vector , which is finally transformed into . This view is most commonly
encountered in the context of optimization.
The second view is the probabilistic view: the random variable depends upon the random variable ,
which depends upon , which depends upon the random variable . This view is most commonly
encountered in the context of graphical models.
The two views are largely equivalent. In either case, for this particular network architecture, the components of
individual layers are independent of each other (e.g., the components of are independent of each other given their
input ). This naturally enables a degree of parallelism in the implementation.
Artificial neural network 3
Networks such as the previous one are commonly called feedforward, because their
graph is a directed acyclic graph. Networks with cycles are commonly called
recurrent. Such networks are commonly depicted in the manner shown at the top of
the figure, where is shown as being dependent upon itself. However, there is an
implied temporal dependence which is not shown.
Learning
What has attracted the most interest in neural networks is the possibility of learning. Given a specific task to solve,
and a class of functions , learning means using a set of observations to find which solves the task in some
optimal sense.
This entails defining a cost function such that, for the optimal solution , (i.e., no
solution has a cost less than the cost of the optimal solution).
The cost function is an important concept in learning, as it is a measure of how far away a particular solution is
from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a
function that has the smallest possible cost.
For applications where the solution is dependent on some data, the cost must necessarily be a function of the
observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic
to which only approximations can be made. As a simple example, consider the problem of finding the model
which minimizes , for data pairs drawn from some distribution . In practical situations we
would only have samples from and thus, for the above example, we would only minimize
. Thus, the cost is minimized over a sample of the data rather than the entire data set.
When some form of online machine learning must be used, where the cost is partially minimized as each new
example is seen. While online machine learning is often used when is fixed, it is most useful in the case where the
distribution changes slowly over time. In neural network methods, some form of online machine learning is
frequently used for finite datasets.
Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task. These are
supervised learning, unsupervised learning and reinforcement learning. Usually any given type of network
architecture can be employed in any of those tasks.
Supervised learning
In supervised learning, we are given a set of example pairs and the aim is to find a function
in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by
the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains
prior knowledge about the problem domain.
A commonly used cost is the mean-squared error which tries to minimize the average squared error between the
network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost using
gradient descent for the class of neural networks called Multi-Layer Perceptrons, one obtains the common and
well-known backpropagation algorithm for training neural networks.
Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and
regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential
data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a
function that provides continuous feedback on the quality of solutions obtained thus far.
Unsupervised learning
In unsupervised learning we are given some data and the cost function to be minimized, that can be any function
of the data and the network's output, .
The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit
properties of our model, its parameters and the observed variables).
As a trivial example, consider the model , where is a constant and the cost . Minimizing
this cost will give us a value of that is equal to the mean of the data. The cost function can be much more
complicated. Its form depends on the application: for example, in compression it could be related to the mutual
information between x and y, whereas in statistical modelling, it could be related to the posterior probability of the
model given the data. (Note that in both of those examples those quantities would be maximized rather than
minimized).
Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications
include clustering, the estimation of statistical distributions, compression and filtering.
Reinforcement learning
In reinforcement learning, data are usually not given, but generated by an agent's interactions with the
environment. At each point in time , the agent performs an action and the environment generates an observation
and an instantaneous cost , according to some (usually unknown) dynamics. The aim is to discover a policy for
selecting actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The
environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
More formally, the environment is modeled as a Markov decision process (MDP) with states and actions
with the following probability distributions: the instantaneous cost distribution , the observation
distribution and the transition , while a policy is defined as conditional distribution over
actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to discover the
policy that minimizes the cost; i.e., the MC for which the cost is minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Artificial neural network 5
Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential
decision making tasks.
See also: dynamic programming, stochastic control
Learning algorithms
Training a neural network model essentially means selecting one model from the set of allowed models (or, in a
Bayesian framework, determining a distribution over the set of allowed models) that minimizes the cost criterion.
There are numerous algorithms available for training neural network models; most of them can be viewed as a
straightforward application of optimization theory and statistical estimation. Recent developments in this field use
particle swarm optimization and other swarm intelligence techniques.
Most of the algorithms used in training artificial neural networks employ some form of gradient descent. This is done
by simply taking the derivative of the cost function with respect to the network parameters and then changing those
parameters in a gradient-related direction.
Evolutionary methods, simulated annealing, expectation-maximization and non-parametric methods are some
commonly used methods for training neural networks. See also machine learning.
Temporal perceptual learning relies on finding temporal relationships in sensory signal streams. In an environment,
statistically salient temporal correlations can be found by monitoring the arrival times of sensory signals. This is
done by the perceptual network.
Applications
The utility of artificial neural network models lies in the fact that they can be used to infer a function from
observations. This is particularly useful in applications where the complexity of the data or task makes the design of
such a function by hand impractical.
• Data processing, including filtering, clustering, blind source separation and compression.
• Robotics, including directing manipulators, Computer numerical control.
Application areas include system identification and control (vehicle control, process control), quantum chemistry,[2]
game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face
identification, object recognition and more), sequence recognition (gesture, speech, handwritten text recognition),
medical diagnosis, financial applications (automated trading systems), data mining (or knowledge discovery in
databases, "KDD"), visualization and e-mail spam filtering.
Types of models
Many models are used in the field defined at different levels of abstraction and modelling different aspects of neural
systems. They range from models of the short-term behaviour of individual neurons, models of how the dynamics of
neural circuitry arise from interactions between individual neurons and finally to models of how behaviour can arise
from abstract neural modules that represent complete subsystems. These include models of the long-term, and
short-term plasticity, of neural systems and their relations to learning and memory from the individual neuron to the
system level.
Current research
While initially research had been concerned mostly with the electrical characteristics of neurons, a particularly
important part of the investigation in recent years has been the exploration of the role of neuromodulators such as
dopamine, acetylcholine, and serotonin on behaviour and learning.
Biophysical models, such as BCM theory, have been important in understanding mechanisms for synaptic plasticity,
and have had applications in both computer science and neuroscience. Research is ongoing in understanding the
computational algorithms used in the brain, with some recent biological evidence for radial basis networks and
neural backpropagation as mechanisms for processing data.
Computational devices have been created in CMOS for both biophysical simulation and neuromorphic computing.
More recent efforts show promise for creating nanodevices for very large scale principal components analyses and
convolution. If successful, these effort could usher in a new era of neural computing that is a step beyond digital
computing, because it depends on learning rather than programming and because it is fundamentally analog rather
than digital even though the first instantiations may in fact be with CMOS digital devices.
Artificial neural network 7
Theoretical properties
Computational power
The multi-layer perceptron (MLP) is a universal function approximator, as proven by the Cybenko theorem.
However, the proof is not constructive regarding the number of neurons required or the settings of the weights.
Work by Hava Siegelmann and Eduardo D. Sontag has provided a proof that a specific recurrent architecture with
rational valued weights (as opposed to full precision real number-valued weights) has the full power of a Universal
Turing Machine[4] using a finite number of neurons and standard linear connections. They have further shown that
the use of irrational values for weights results in a machine with super-Turing power.
Capacity
Artificial neural network models have a property called 'capacity', which roughly corresponds to their ability to
model any given function. It is related to the amount of information that can be stored in the network and to the
notion of complexity.
Convergence
Nothing can be said in general about convergence since it depends on a number of factors. Firstly, there may exist
many local minima. This depends on the cost function and the model. Secondly, the optimization method used might
not be guaranteed to converge when far away from a local minimum. Thirdly, for a very large amount of data or
parameters, some methods become impractical. In general, it has been found that theoretical guarantees regarding
convergence are an unreliable guide to practical application.
the goal is to minimize over two quantities: the 'empirical risk' and the 'structural risk', which roughly corresponds to
the error over the training set and the predicted error in unseen data due to overfitting.
Supervised neural networks that use an MSE cost function can use
formal statistical methods to determine the confidence of the
trained model. The MSE on a validation set can be used as an
estimate for variance. This value can then be used to calculate the
confidence interval of the output of the network, assuming a
normal distribution. A confidence analysis made this way is
statistically valid as long as the output probability distribution
stays the same and the network is not modified.
Dynamic properties
Various techniques originally developed for studying disordered magnetic systems (i.e., the spin glass) have been
successfully applied to simple neural network architectures, such as the Hopfield network. Influential work by E.
Gardner and B. Derrida has revealed many interesting properties about perceptrons with real-valued synaptic
weights, while later work by W. Krauth and M. Mezard has extended these principles to binary-valued synapses.
Criticism
A common criticism of artificial neural networks, particularly in robotics, is that they require a large diversity of
training for real-world operation. Dean Pomerleau, in his research presented in the paper "Knowledge-based
Training of Artificial Neural Networks for Autonomous Robot Driving," uses a neural network to train a robotic
vehicle to drive on multiple types of roads (single lane, multi-lane, dirt, etc.). A large amount of his research is
devoted to (1) extrapolating multiple training scenarios from a single training experience, and (2) preserving past
training diversity so that the system does not become overtrained (if, for example, it is presented with a series of
right turns – it should not learn to always turn right). These issues are common in neural networks that must decide
from amongst a wide variety of responses.
A. K. Dewdney, a former Scientific American columnist, wrote in 1997, "Although neural nets do solve a few toy
problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general
problem-solving tool." (Dewdney, p. 82)
Arguments for Dewdney's position are that to implement large and effective software neural networks, much
processing and storage resources need to be committed. While the brain has hardware tailored to the task of
processing signals through a graph of neurons, simulating even a most simplified form on Von Neumann technology
may compel a NN designer to fill many millions of database rows for its connections - which can lead to abusive
RAM and HD necessities. Furthermore, the designer of NN systems will often need to simulate the transmission of
signals through many of these connections and their associated neurons - which must often be matched with
incredible amounts of CPU processing power and time. While neural networks often yield effective programs, they
too often do so at the cost of time and money efficiency.
Artificial neural network 9
Arguments against Dewdney's position are that neural nets have been successfully used to solve many complex and
diverse tasks, ranging from autonomously flying aircraft[5] to detecting credit card fraud.[6] Technology writer Roger
Bridgman commented on Dewdney's statements about neural nets:
Neural networks, for instance, are in the dock not only because they have been hyped to high heaven,
(what hasn't?) but also because you could create a successful net without understanding how it worked:
the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable
table...valueless as a scientific resource". In spite of his emphatic declaration that science is not
technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them
are just trying to be good engineers. An unreadable table that a useful machine could read would still be
well worth having.[7]
Some other criticisms came from believers of hybrid models (combining neural networks and symbolic approaches).
They advocate the intermix of these two approaches and believe that hybrid models can better capture the
mechanisms of the human mind (Sun and Bookman 1994).
Gallery
See also
• 20Q
• Adaptive resonance theory
• Artificial life
• Associative memory
• Autoencoder
• Biological neural network
• Biologically inspired computing
• Blue brain
• Cascade Correlation
• Clinical decision support system
• Connectionist expert system
• Decision tree
• Expert system
Artificial neural network 10
• Fuzzy logic
• Genetic algorithm
• In Situ Adaptive Tabulation
• Linear discriminant analysis
• Logistic regression
• Memristor
• Multilayer perceptron
• Nearest neighbor (pattern recognition)
• Neural network
• Neuroevolution, NeuroEvolution of Augmented Topologies (NEAT)
• Neural network software
• Ni1000 chip
• Optical neural network
• Particle swarm optimization
• Perceptron
• Predictive analytics
• Principal components analysis
• Regression analysis
• Simulated annealing
• Systolic array
• Time delay neural network (TDNN)
References
[1] "The Machine Learning Dictionary" (http:/ / www. cse. unsw. edu. au/ ~billw/ mldict. html#activnfn). .
[2] Roman M. Balabin, Ekaterina I. Lomakina (2009). "Neural network approach to quantum-chemistry data: Accurate prediction of density
functional theory energies". J. Chem. Phys. 131 (7): 074104. doi:10.1063/1.3206326. PMID 19708729.
[3] "DANN:Genetic Wavelets" (http:/ / wiki. syncleus. com/ index. php/ DANN:Genetic_Wavelets). dANN project. . Retrieved 12 July 2010.
[4] Siegelmann, H.T.; Sontag, E.D. (1991). "Turing computability with neural nets" (http:/ / www. math. rutgers. edu/ ~sontag/ FTP_DIR/
aml-turing. pdf). Appl. Math. Lett. 4 (6): 77–80. doi:10.1016/0893-9659(91)90080-F. .
[5] "NASA NEURAL NETWORK PROJECT PASSES MILESTONE" (http:/ / www. nasa. gov/ centers/ dryden/ news/ NewsReleases/ 2003/
03-49. html). NASA. . Retrieved 12 July 2010.
[6] "Counterfeit Fraud" (http:/ / www. visa. ca/ en/ personal/ pdfs/ counterfeit_fraud. pdf) (PDF). VISA. p. 1. . Retrieved 12 July 2010. "Neural
Networks (24/7 Monitoring):"
[7] Roger Bridgman's defence of neural networks (http:/ / members. fortunecity. com/ templarseries/ popper. html)
Bibliography
• Bar-Yam, Yaneer (2003). Dynamics of Complex Systems, Chapter 2 (http://necsi.org/publications/dcs/
Bar-YamChap2.pdf).
• Bar-Yam, Yaneer (2003). Dynamics of Complex Systems, Chapter 3 (http://necsi.org/publications/dcs/
Bar-YamChap3.pdf).
• Bar-Yam, Yaneer (2005). Making Things Work (http://necsi.org/publications/mtw/). Please see Chapter 3
• Bhadeshia H. K. D. H. (1999). " Neural Networks in Materials Science (http://www.msm.cam.ac.uk/
phase-trans/abstracts/neural.review.pdf)". ISIJ International 39: 966–979. doi:10.2355/isijinternational.39.966.
• Bhagat, P.M. (2005) Pattern Recognition in Industry, Elsevier. ISBN 0-08-044538-1
• Bishop, C.M. (1995) Neural Networks for Pattern Recognition, Oxford: Oxford University Press. ISBN
0-19-853849-9 (hardback) or ISBN 0-19-853864-2 (paperback)
• Cybenko, G.V. (1989). Approximation by Superpositions of a Sigmoidal function, Mathematics of Control,
Signals and Systems, Vol. 2 pp. 303–314. electronic version (http://actcomm.dartmouth.edu/gvc/papers/
Artificial neural network 11
approx_by_superposition.pdf)
• Duda, R.O., Hart, P.E., Stork, D.G. (2001) Pattern classification (2nd edition), Wiley, ISBN 0-471-05669-3
• Egmont-Petersen, M., de Ridder, D., Handels, H. (2002). "Image processing with neural networks - a review".
Pattern Recognition 35 (10): 2279–2301. doi:10.1016/S0031-3203(01)00178-9.
• Gurney, K. (1997) An Introduction to Neural Networks London: Routledge. ISBN 1-85728-673-1 (hardback) or
ISBN 1-85728-503-4 (paperback)
• Haykin, S. (1999) Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13-273350-1
• Fahlman, S, Lebiere, C (1991). The Cascade-Correlation Learning Architecture, created for National Science
Foundation, Contract Number EET-8716324, and Defense Advanced Research Projects Agency (DOD), ARPA
Order No. 4976 under Contract F33615-87-C-1499. electronic version (http://www.cs.iastate.edu/~honavar/
fahlman.pdf)
• Hertz, J., Palmer, R.G., Krogh. A.S. (1990) Introduction to the theory of neural computation, Perseus Books.
ISBN 0-201-51560-1
• Lawrence, Jeanette (1994) Introduction to Neural Networks, California Scientific Software Press. ISBN
1-883157-00-5
• Masters, Timothy (1994) Signal and Image Processing with Neural Networks, John Wiley & Sons, Inc. ISBN
0-471-04963-8
• Ness, Erik. 2005. SPIDA-Web (http://www.conbio.org/cip/article61WEB.cfm). Conservation in Practice
6(1):35-36. On the use of artificial neural networks in species taxonomy.
• Ripley, Brian D. (1996) Pattern Recognition and Neural Networks, Cambridge
• Siegelmann, H.T. and Sontag, E.D. (1994). Analog computation via neural networks, Theoretical Computer
Science, v. 131, no. 2, pp. 331–360. electronic version (http://www.math.rutgers.edu/~sontag/FTP_DIR/
nets-real.pdf)
• Sergios Theodoridis, Konstantinos Koutroumbas (2009) "Pattern Recognition" , 4th Edition, Academic Press,
ISBN 978-1-59749-272-0.
• Smith, Murray (1993) Neural Networks for Statistical Modeling, Van Nostrand Reinhold, ISBN 0-442-01310-8
• Wasserman, Philip (1993) Advanced Methods in Neural Computing, Van Nostrand Reinhold, ISBN
0-442-00461-3
Further reading
• Dedicated issue of Philosophical Transactions B on Neural Networks and Perception. Some articles are freely
available. (http://publishing.royalsociety.org/neural-networks)
External links
• Performance comparison of neural network algorithms tested on UCI data sets (http://tunedit.org/results?e=&
d=UCI/&a=neural+rbf+perceptron&n=)
• A close view to Artificial Neural Networks Algorithms (http://www.learnartificialneuralnetworks.com)
• Neural Networks (http://www.dmoz.org/Computers/Artificial_Intelligence/Neural_Networks/) at the Open
Directory Project
• A Brief Introduction to Neural Networks (D. Kriesel) (http://www.dkriesel.com/en/science/neural_networks)
- Illustrated, bilingual manuscript about artificial neural networks; Topics so far: Perceptrons, Backpropagation,
Radial Basis Functions, Recurrent Neural Networks, Self Organizing Maps, Hopfield Networks.
• Neural Networks in Materials Science (http://www.msm.cam.ac.uk/phase-trans/abstracts/neural.review.
html)
• A practical tutorial on Neural Networks (http://www.ai-junkie.com/ann/evolved/nnt1.html)
Artificial neural network 12
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training
data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm
analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete, see
classification) or a regression function (if the output is continuous, see regression). The inferred function should
predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the
training data to unseen situations in a "reasonable" way (see inductive bias). (Compare with unsupervised learning.)
The parallel task in human and animal psychology is often referred to as concept learning.
Overview
In order to solve a given problem of supervised learning, one has to perform various steps:
1. Determine the type of training examples. Before doing anything else, the engineer should decide what kind of
data is to be used as an example. For instance, this might be a single handwritten character, an entire handwritten
word, or an entire line of handwriting.
2. Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set
of input objects is gathered and corresponding outputs are also gathered, either from human experts or from
measurements.
3. Determine the input feature representation of the learned function. The accuracy of the learned function depends
strongly on how the input object is represented. Typically, the input object is transformed into a feature vector,
which contains a number of features that are descriptive of the object. The number of features should not be too
large, because of the curse of dimensionality; but should contain enough information to accurately predict the
output.
4. Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer
may choose to use support vector machines or decision trees.
5. Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning
algorithms require the user to determine certain control parameters. These parameters may be adjusted by
optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the
resulting function should be measured on a test set that is separate from the training set.
A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no
single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).
There are four major issues to consider in supervised learning:
Bias-variance tradeoff
A first issue is the tradeoff between bias and variance[1] . Imagine that we have available several different, but
equally good, training data sets. A learning algorithm is biased for a particular input if, when trained on each of
these data sets, it is systematically incorrect when predicting the correct output for . A learning algorithm has high
variance for a particular input if it predicts different output values when trained on different training sets. The
prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm[2] .
Supervised learning 13
Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so
that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently,
and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this
tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can
adjust).
3. Presence of interactions and non-linearities. If each of the features makes an independent contribution to the
output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector
Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with
Gaussian kernels) generally perform well. However, if there are complex interactions among features, then
algorithms such as decision trees and neural networks work better, because they are specifically designed to
discover these interactions. Linear methods can also be applied, but the engineer must manually specify the
interactions when using them.
When considering a new application, the engineer can compare multiple learning algorithms and experimentally
determine which one works best on the problem at hand (see cross validation. Tuning the performance of a learning
algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting
additional training data and more informative features than it is to spend extra time tuning the learning algorithms.
The most widely used learning algorithms are Support Vector Machines, linear regression, logistic regression, naive
Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, and Neural Networks (Multilayer
perceptron).
A popular regularization penalty is , which is the squared Euclidean norm of the weights, also known as the
norm. Other norms include the norm, , and the norm, which is the number of non-zero s.
The parameter controls the bias-variance tradeoff. When , this gives empirical risk minimization with low
bias and high variance. When is large, the learning algorithm will have high bias and low variance. The value of
can be chosen empirically via cross validation.
The complexity penalty has a Bayesian interpretation as the negative log prior probability of , , in
which case is the posterior probabability of .
Generative training
The training methods described above are discriminative training methods, because they seek to find a function
that discriminates well between the different output values (see discriminative model). For the special case where
is a joint probability distribution and the loss function is the negative log likelihood
a risk minimization algorithm is said to perform generative training, because can be
regarded as a generative model that explains how the data were generated. Generative training algorithms are often
simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can
be computed in closed form as in naive Bayes and linear discriminant analysis.
Applications
• Bioinformatics
• Cheminformatics
• Quantitative structure-activity relationship
• Database marketing
• Handwriting recognition
• Information retrieval
• Learning to rank
• Object recognition in computer vision
• Optical character recognition
• Spam detection
• Pattern recognition
• Speech recognition
• Forecasting Fraudulent Financial Statements
Supervised learning 17
General issues
• Computational learning theory
• Inductive bias
• Overfitting (machine learning)
• (Uncalibrated) Class membership probabilities
• Version spaces
Notes
[1] Geman et al., 1992
[2] James, 2003
[3] Vapnik, 2000
References
• L. Breiman (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics 24(6),
2350-2382.
• G. James (2003) Variance and Bias for General Loss Functions, Machine Learning 51, 115-135. (http://
www-bcf.usc.edu/~gareth/research/bv.pdf)
• S. Geman, E. Bienenstock, and R. Doursat (1992). Neural networks and the bias/variance dilemma. Neural
Computation 4, 1–58.
• Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.), Springer Verlag, 2000.
External links
• Several supervised machine learning algorithm implementations in Ruby (http://ai4r.rubyforge.org)
Semi-supervised learning 18
Semi-supervised learning
In computer science, semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled
data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and
supervised learning (with completely labeled training data). Many machine-learning researchers have found that
unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable
improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled
human agent to manually classify training examples. The cost associated with the labeling process thus may render a
fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such
situations, semi-supervised learning can be of great practical value.
One example of a semi-supervised learning technique is co-training, in which two or possibly more learners are each
trained on a set of examples, but with each learner using a different, and ideally independent, set of features for each
example.
An alternative approach is to model the joint probability distribution of the features and the labels. For the unlabelled
data the labels can then be treated as 'missing data'. Techniques that handle missing data, such as Gibbs sampling or
the EM algorithm, can then be used to estimate the parameters of the model.
See also
• Constrained clustering
• Transductive learning
References
1. Abney, S., Semisupervised Learning for Computational Linguistics. Chapman & Hall/CRC, 2008.
2. Blum, A., Mitchell, T. Combining labeled and unlabeled data with co-training [1]. COLT: Proceedings of the
Workshop on Computational Learning Theory, Morgan Kaufmann, 1998, p. 92-100.
3. Chapelle, O., B. Schölkopf and A. Zien: Semi-Supervised Learning. MIT Press, Cambridge, MA (2006). Further
information [2].
4. Huang T-M., Kecman V., Kopriva I. [3], "Kernel Based Algorithms for Mining Huge Data Sets, Supervised,
Semisupervised and Unsupervised Learning", Springer-Verlag, Berlin, Heidelberg, 260 pp. 96 illus., Hardcover,
ISBN 3-540-31681-7, 2006.
5. O'Neill, T. J. (1978) Normal discrimination with unclassified observations. Journal of the American Statistical
Association, 73, 821–826.
6. Theodoridis S., Koutroumbas K. (2009) "Pattern Recognition" , 4th Edition, Academic Press, ISBN:
978-1-59749-272-0.
7. Zhu, X. Semi-supervised learning literature survey [4].
8. Zhu, X., Goldberg, A. Introduction to Semi-Supervised Learning [5]. Morgan & Claypool Publishers, 2009.
Semi-supervised learning 19
References
[1] http:/ / www. cs. wustl. edu/ ~zy/ paper/ cotrain. ps
[2] http:/ / www. kyb. tuebingen. mpg. de/ ssl-book/
[3] http:/ / www. learning-from-data. com
[4] http:/ / www. cs. wisc. edu/ ~jerryzhu/ pub/ ssl_survey. pdf
[5] http:/ / www. morganclaypool. com/ doi/ abs/ 10. 2200/ S00196ED1V01Y200906AIM006
Definitions
Let be the total set of all data under consideration. For example, in a protein engineering problem, would
include all proteins that are known to have a certain interesting activity and all additional proteins that one might
want to test for that activity.
During each iteration, , is broken up into three subsets
1. : Data points where the label is known.
2. : Data points where the label is unknown.
3. : A subset of that is chosen to be labeled.
Most of the current research in active learning involves the best method to chose the data points for .
Maximum Curiosity
Another active learning method, that typically learns a data set with fewer examples than Minimum Marginal
Hyperplane but is more computationally intensive and only works for discrete classifiers is Maximum Curiosity[3] .
Maximum curiosity takes each unlabeled datum in and assumes all possible labels that datum might have. This
datum with each assumed class is added to and then the new is cross-validated. It is assumed that when
the datum is paired up with its correct label, the cross-validated accuracy (or correlation coefficient) of will
most improve. The datum with the most improved accuracy is placed in to be labeled
Notes
[1] Settles, Burr (2009), "Active Learning Literature Survey" (http:/ / pages. cs. wisc. edu/ ~bsettles/ pub/ settles. activelearning. pdf), Computer
Sciences Technical Report 1648. University of Wisconsin–Madison, , retrieved 2010-09-14.
[2] Danziger, S.A., Swamidass, S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng, J., Hoang, V.P., Saigo, H., Luo, R., Baldi, P., Brachmann,
R.K. and Lathrop, R.H. Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants, (2006) IEEE/ACM
transactions on computational biology and bioinformatics, 3, 114-125.
[3] Danziger, S.A., Zeng, J., Wang, Y., Brachmann, R.K. and Lathrop, R.H. Choosing where to look next in a mutation sequence space:
Active Learning of informative p53 cancer rescue mutants,(2007) Bioinformatics, 23(13), 104-114. (http:/ / bioinformatics. oxfordjournals.
org/ cgi/ reprint/ 23/ 13/ i104. pdf)
Structured prediction
Structured prediction is an umbrella term for machine learning and regression techniques that involve predicting
structured objects. For example, the problem of translating a natural language sentence into a semantic representation
such as a parse tree can be seen as a structured prediction problem in which the structured output domain is the set of
all possible parse trees. Structured prediction generalizes supervised learning where the output domain is usually a
small or simple set.
Probabilistic graphical models form a large class of structured prediction models. In particular, Bayesian networks
and random fields are popularly used to solve structured prediction problems in a wide variety of application
domains including bioinformatics, natural language processing, speech recognition, and computer vision.
Similar to commonly used supervised learning techniques, structured prediction models are typically trained by
means of observed data in which the true prediction value is used to adjust model parameters. Due to the complexity
of the model and the interrelations of predicted variables the process of prediction using a trained model and of
training itself is often computationally infeasible and approximate inference and learning methods are used.
Another commonly used term for structured prediction is structured output learning.
Learning to rank 21
Learning to rank
Learning to rank[1] or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine
learning problem in which the goal is to automatically construct a ranking model from training data. Training data
consists of lists of items with some partial order specified between items in each list. This order is typically induced
by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. Ranking
model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way, which is "similar" to
rankings in the training data in some sense.
Learning to rank is a relatively new research area which has emerged in the past decade.
Applications
In information retrieval
Ranking is a central part of many information retrieval
problems, such as document retrieval, collaborative
filtering, sentiment analysis, computational advertising
(online ad placement).
When applied to document retrieval, the task of
learning to rank is to construct a ranking function for a
search engine. In this case each list in training data
represents documents which match a search query and
they are ordered according to relevance to the query.
A possible architecture of a machine-learned search
engine is shown in the figure to the right.
Training data consists of queries and documents
matching them together with relevance degree of each
match. It may be prepared manually by human
A possible architecture of a machine-learned search engine.
assessors (or raters, as Google calls them), who check
results for some queries and determine relevance of
each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used
— only top few documents, retrieved by some existing ranking models are checked. Alternatively, training data may
be derived automatically by analyzing clickthrough logs (i.e. search results which got clicks from users),[2] query
chains,[3] or such search engines' features as Google's SearchWiki.
Training data is used by a learning algorithm to produce a ranking model which computes relevance of documents
for actual queries.
Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web
search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a
two-phase scheme is used.[4] First, a small number of potentially relevant documents are identified using simpler
retrieval models which permit fast query evaluation, such as vector space model, boolean model, weighted AND[5] ,
BM25. This phase is called top- document retrieval and many good heuristics were proposed in the literature to
accelerate it, such as using document's static quality score and tiered indexes.[6] In the second phase, a more accurate
but computationally expensive machine-learned model is used to re-rank these documents.
Learning to rank 22
In other areas
Learning to rank algorithms have been applied in areas other than information retrieval:
• In machine translation for ranking a set of hypothesized translations;[7]
• In computational biology for ranking candidate 3-D structures in protein structure prediction problem.[7]
• In proteomics for the identification of frequent top scoring peptides.[8]
Feature vectors
For convenience of MLR algorithms, query-document pairs are usually represented by numerical vectors, which are
called feature vectors. Such approach is sometimes called bag of features and is analogous to bag of words and
vector space model used in information retrieval for representation of documents.
Components of such vectors are called features, factors or ranking signals. They may be divided into three groups
(features from document retrieval are shown as examples):
• Query-independent or static features — those features, which depend only on the document, but not on the query.
For example, PageRank or document's length. Such features can be precomputed in off-line mode during
indexing. They may be used to compute document's static quality score (or static rank), which is often used to
speed up search query evaluation.[6] [9]
• Query-dependent or dynamic features — those features, which depend both on the contents of the document and
the query, such as TF-IDF score or other non-machine-learned ranking functions.
• Query features, which depend only on the query. For example, the number of words in a query.
Some examples of features, which were used in the well-known LETOR dataset:[10]
• TF, TF-IDF, BM25, and language modeling scores of document's zones (title, body, anchors text, URL) for a
given query;
• Lengths and IDF sums of document's zones;
• Document's PageRank, HITS ranks and their variants.
Selecting and designing good features is an important area in machine learning, which is called feature engineering.
Evaluation measures
There are several measures (metrics) which are commonly used to judge how well an algorithm is doing on training
data and to compare performance of different MLR algorithms. Often a learning-to-rank problem is reformulated as
an optimization problem with respect to one of these metrics.
Examples of ranking quality measures:
• Mean average precision (MAP);
• DCG and NDCG;
• Precision@n, NDCG@n, where "@n" denotes that the metrics are evaluated only on top n documents;
• Mean reciprocal rank;
• Kendall's tau
DCG and its normalized variant NDCG are usually preferred in academic research when multiple levels of relevance
are used.[11] Other metrics such as MAP, MRR and precision, are defined only for binary judgements.
Recently, there have been proposed several new evaluation metrics which claim to model user's satisfaction with
search results better than the DCG metric:
• Expected reciprocal rank (ERR);[12]
• Yandex's pfound.[13]
Learning to rank 23
Both of these metrics are based on the assumption that the user is more likely to stop looking at search results after
examining a more relevant document, than after a less relevant document.
Approaches
Tie-Yan Liu of Microsoft Research Asia in his paper "Learning to Rank for Information Retrieval"[1] and talks at
several leading conferences has analyzed existing algorithms for learning to rank problems and categorized them into
three groups by their input representation and loss function:
Pointwise approach
In this case it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then
learning-to-rank problem can be approximated by a regression problem — given a single query-document pair,
predict its score.
A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal
regression and classification algorithms can also be used in pointwise approach when they are used to predict score
of a single query-document pair, and it takes a small, finite number of values.
Pairwise approach
In this case learning-to-rank problem is approximated by a classification problem — learning a binary classifier
which can tell which document is better in a given pair of documents. The goal is to minimize average number of
inversions in ranking.
Listwise approach
These algorithms try to directly optimize the value of one of the above evaluation measures, averaged over all
queries in the training data. This is difficult because most evaluation measures are not continuous functions with
respect to ranking model's parameters, and so continuous approximations or bounds on evaluation measures have to
be used.
List of methods
A partial list of published learning-to-rank algorithms is shown below with years of first publication of each method:
2008 [36]
Ranking Refinement pairwise A semi-supervised approach to learning to rank that uses Boosting
[35]
2009 [41] pairwise Magnitude-preserving variant of RankBoost. The idea is that the more unequal are labels of a pair of
MPBoost
documents, the harder should the algorithm try to rank them.
2009 [42] listwise Unlike earlier methods, BoltzRank produces a ranking model that looks during query time not just at
BoltzRank
a single document, but also at pairs of documents.
2010 [46] pairwise Extends GBRank to the learning-to-blend problem of jointly solving multiple learning-to-rank
GBlend
problems with some shared features.
Note: as most supervised learning algorithms can be applied to pointwise case, only those methods which are
specifically designed with ranking in mind are shown above.
Learning to rank 25
History
C. Manning et al.[48] trace earliest works on learning to rank problem to papers in late 1980s and early 1990s. They
suggest that these early works achieved limited results in their time due to little available training data and poor
machine learning techniques.
In mid-1990s Berkeley researchers used logistic regression to train a successful ranking function at TREC
conference.
Several conferences, such as NIPS, SIGIR and ICML had workshops devoted to the learning-to-rank problem since
mid-2000s, and this has stimulated much of academic research.
References
[1] Tie-Yan Liu (2009), Learning to Rank for Information Retrieval, Foundations and Trends in Information Retrieval: Vol. 3: No 3,
pp. 225–331, doi:10.1561/1500000016, ISBN 978-1-60198-244-5. Slides from Tie-Yan Liu's talk at WWW 2009 conference are available
online (http:/ / www2009. org/ pdf/ T7A-LEARNING TO RANK TUTORIAL. pdf)
[2] Joachims, T. (2003), "Optimizing Search Engines using Clickthrough Data" (http:/ / www. cs. cornell. edu/ people/ tj/ publications/
joachims_02c. pdf), Proceedings of the ACM Conference on Knowledge Discovery and Data Mining,
[3] Joachims T., Radlinski F. (2005), "Query Chains: Learning to Rank from Implicit Feedback" (http:/ / radlinski. org/ papers/
Radlinski05QueryChains. pdf), Proceedings of the ACM Conference on Knowledge Discovery and Data Mining,
[4] B. Cambazoglu, H. Zaragoza, O. Chapelle, J. Chen, C. Liao, Z. Zheng, and J. Degenhardt., "Early exit optimizations for additive machine
learned ranking systems." (http:/ / olivier. chapelle. cc/ pub/ wsdm2010. pdf), WSDM '10: Proceedings of the Third ACM International
Conference on Web Search and Data Mining, 2010. (to appear),
[5] Broder A., Carmel D., Herscovici M., Soffer A., Zien J. (2003), "Efficient query evaluation using a two-level retrieval process" (http:/ / cis.
poly. edu/ westlab/ papers/ cntdstrb/ p426-broder. pdf), Proceedings of the twelfth international conference on Information and knowledge
management: 426–434, ISBN 1-58113-723-0,
[6] Manning C., Raghavan P. and Schütze H. (2008), Introduction to Information Retrieval, Cambridge University Press. Section 7.1 (http:/ / nlp.
stanford. edu/ IR-book/ html/ htmledition/ efficient-scoring-and-ranking-1. html)
[7] Kevin K. Duh (2009), Learning to Rank with Partially-Labeled Data (http:/ / ssli. ee. washington. edu/ people/ duh/ thesis/ uwthesis. pdf),
[8] Henneges C., Hinselmann G., Jung S., Madlung J., Schütz W., Nordheim A., Zell A. (2009), Ranking Methods for the Prediction of Frequent
Top Scoring Peptides from Proteomics Data (http:/ / www. omicsonline. com/ ArchiveJPB/ 2009/ May/ 01/ JPB2. 226. pdf),
[9] Richardson, M.; Prakash, A. and Brill, E. (2006). "Beyond PageRank: Machine Learning for Static Ranking" (http:/ / research. microsoft.
com/ en-us/ um/ people/ mattri/ papers/ www2006/ staticrank. pdf). . pp. 707–715. .
[10] LETOR 3.0. A Benchmark Collection for Learning to Rank for Information Retrieval (http:/ / research. microsoft. com/ en-us/ people/
taoqin/ letor3. pdf)
[11] http:/ / www. stanford. edu/ class/ cs276/ handouts/ lecture15-learning-ranking. ppt
[12] Olivier Chapelle, Donald Metzler, Ya Zhang, Pierre Grinspan (2009), "Expected Reciprocal Rank for Graded Relevance" (http:/ / research.
yahoo. com/ files/ err. pdf), CIKM,
Learning to rank 26
[13] Gulin A., Karpovich P., Raskovalov D., Segalovich I. (2009), "Yandex at ROMIP'2009: optimization of ranking algorithms by machine
learning methods" (http:/ / romip. ru/ romip2009/ 15_yandex. pdf), Proceedings of ROMIP'2009: 163–168, (in Russian)
[14] http:/ / research. microsoft. com/ apps/ pubs/ default. aspx?id=65610
[15] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 20. 378
[16] http:/ / jmlr. csail. mit. edu/ papers/ volume4/ freund03a/ freund03a. pdf
[17] http:/ / research. microsoft. com/ en-us/ um/ people/ cburges/ papers/ ICML_ranking. pdf
[18] http:/ / research. microsoft. com/ en-us/ people/ tyliu/ cao-et-al-sigir2006. pdf
[19] http:/ / research. microsoft. com/ en-us/ um/ people/ cburges/ papers/ lambdarank. pdf
[20] http:/ / research. microsoft. com/ en-us/ people/ junxu/ sigir2007-adarank. pdf
[21] http:/ / research. microsoft. com/ apps/ pubs/ default. aspx?id=70364
[22] http:/ / www. cc. gatech. edu/ ~zha/ papers/ fp086-zheng. pdf
[23] http:/ / research. microsoft. com/ apps/ pubs/ default. aspx?id=70428
[24] http:/ / research. microsoft. com/ apps/ pubs/ default. aspx?id=68128
[25] http:/ / www. stat. rutgers. edu/ ~tzhang/ papers/ nips07-ranking. pdf
[26] http:/ / research. microsoft. com/ en-us/ people/ hangli/ qin_ipm_2008. pdf
[27] http:/ / citeseerx. ist. psu. edu/ viewdoc/ download?doi=10. 1. 1. 90. 220& rep=rep1& type=pdf
[28] http:/ / tucs. fi/ publications/ attachment. php?fname=inpPaTsAiBoSa07a. pdf
[29] http:/ / www. cs. cornell. edu/ People/ tj/ publications/ yue_etal_07a. pdf
[30] ftp:/ / ftp. research. microsoft. com/ pub/ tr/ TR-2008-109. pdf
[31] C. Burges. (2010). From RankNet to LambdaRank to LambdaMART: An Overview (http:/ / research. microsoft. com/ en-us/ um/ people/
cburges/ tech_reports/ MSR-TR-2010-82. pdf).
[32] http:/ / research. microsoft. com/ en-us/ people/ tyliu/ icml-listmle. pdf
[33] http:/ / research. microsoft. com/ en-us/ people/ junxu/ sigir2008-directoptimize. pdf
[34] http:/ / research. microsoft. com/ apps/ pubs/ ?id=63585
[35] http:/ / www. cse. msu. edu/ ~valizade/ Publications/ ranking_refinement. pdf
[36] Rong Jin, Hamed Valizadegan, Hang Li, Ranking Refinement and Its Application for Information Retrieval (http:/ / www. cse. msu. edu/
~valizade/ Publications/ ranking_refinement. pdf), in International Conference on World Wide Web (WWW), 2008.
[37] http:/ / www-connex. lip6. fr/ ~amini/ SSRankBoost/
[38] Massih-Reza Amini, Vinh Truong, Cyril Goutte, A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled
Data (http:/ / www-connex. lip6. fr/ ~amini/ Publis/ SemiSupRanking_sigir08. pdf), International ACM SIGIR conference, 2008.
[39] http:/ / phd. dii. unisi. it/ PosterDay/ 2009/ Tiziano_Papini. pdf
[40] Leonardo Rigutini, Tiziano Papini, Marco Maggini, Franco Scarselli, "SortNet: learning to rank by a neural-based sorting algorithm" (http:/ /
research. microsoft. com/ en-us/ um/ beijing/ events/ lr4ir-2008/ PROCEEDINGS-LR4IR 2008. PDF), SIGIR 2008 workshop: Learning to
Rank for Information Retrieval, 2008
[41] http:/ / itcs. tsinghua. edu. cn/ papers/ 2009/ 2009031. pdf
[42] http:/ / www. cs. toronto. edu/ ~zemel/ Papers/ boltzRank-ICML2009. pdf
[43] http:/ / www. iis. sinica. edu. tw/ papers/ whm/ 8820-F. pdf
[44] http:/ / www. cse. msu. edu/ ~valizade/ Publications/ NDCG_Boost. pdf
[45] Hamed Valizadegan, Rong Jin, Ruofei Zhang, Jianchang Mao, Learning to Rank by Optimizing NDCG Measure (http:/ / www. cse. msu.
edu/ ~valizade/ Publications/ NDCG_Boost. pdf), in Proceeding of Neural Information Processing Systems (NIPS), 2010.
[46] http:/ / arxiv. org/ abs/ 1001. 4597
[47] http:/ / wume. cse. lehigh. edu/ ~ovd209/ wsdm/ proceedings/ docs/ p151. pdf
[48] Manning C., Raghavan P. and Schütze H. (2008), Introduction to Information Retrieval, Cambridge University Press. Sections 7.4 (http:/ /
nlp. stanford. edu/ IR-book/ html/ htmledition/ references-and-further-reading-7. html) and 15.5 (http:/ / nlp. stanford. edu/ IR-book/ html/
htmledition/ references-and-further-reading-15. html)
[49] Jan O. Pedersen. The MLR Story (http:/ / jopedersen. com/ Presentations/ The_MLR_Story. pdf)
[50] U.S. Patent 7197497 (http:/ / www. google. com/ patents?vid=7197497)
[51] Bing Search Blog: User Needs, Features and the Science behind Bing (http:/ / www. bing. com/ community/ blogs/ search/ archive/ 2009/
06/ 01/ user-needs-features-and-the-science-behind-bing. aspx?PageIndex=4)
[52] Yandex corporate blog entry about new ranking model "Snezhinsk" (http:/ / webmaster. ya. ru/ replies. xml?item_no=5707& ncrnd=5118)
(in Russian)
[53] The algorithm wasn't disclosed, but a few details were made public in (http:/ / download. yandex. ru/ company/ experience/ GDD/
Zadnie_algoritmy_Karpovich. pdf) and (http:/ / download. yandex. ru/ company/ experience/ searchconf/
Searchconf_Algoritm_MatrixNet_Gulin. pdf).
[54] Yandex's Internet Mathematics 2009 competition page (http:/ / imat2009. yandex. ru/ academic/ mathematic/ 2009/ en/ )
[55] Yahoo Learning to Rank Challenge (http:/ / learningtorankchallenge. yahoo. com/ )
[56] Rajaraman, Anand (2008-05-24). "Are Machine-Learned Models Prone to Catastrophic Errors?" (http:/ / www. webcitation. org/
5sq8irWNM). Archived from the original (http:/ / anand. typepad. com/ datawocky/ 2008/ 05/
are-human-experts-less-prone-to-catastrophic-errors-than-machine-learned-models. html) on 2010-09-18. .
Learning to rank 27
[57] Costello, Tom (2009-06-26). "Cuil Blog: So how is Bing doing?" (http:/ / www. webcitation. org/ 5sq7DX3Pj). Archived from the original
(http:/ / www. cuil. com/ info/ blog/ 2009/ 06/ 26/ so-how-is-bing-doing) on 2010-09-15. .
External links
Competitions and public datasets
• LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval (http://research.
microsoft.com/en-us/um/people/letor/)
• Yandex's Internet Mathematics 2009 (http://imat2009.yandex.ru/en/)
• Yahoo! Learning to Rank Challenge (http://learningtorankchallenge.yahoo.com/)
• Microsoft Learning to Rank Datasets (http://research.microsoft.com/en-us/projects/mslr/default.aspx)
Unsupervised learning
In machine learning, unsupervised learning is a class of problems in which one seeks to determine how the data are
organized. Many methods employed here are based on data mining methods used to preprocess data. It is
distinguished from supervised learning (and reinforcement learning) in that the learner is given only unlabeled
examples.
Unsupervised learning is closely related to the problem of density estimation in statistics. However unsupervised
learning also encompasses many other techniques that seek to summarize and explain key features of the data.
One form of unsupervised learning is clustering. Another example is blind source separation based on Independent
Component Analysis (ICA).
Among neural network models, the Self-organizing map (SOM) and Adaptive resonance theory (ART) are
commonly used unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations
in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with
problem size and lets the user control the degree of similarity between members of the same clusters by means of a
user-defined constant called the vigilance parameter. ART networks are also used for many pattern recognition tasks,
such as automatic target recognition and seismic signal processing. The first version of ART was "ART1", developed
by Carpenter and Grossberg(1988).
Bibliography
• Geoffrey Hinton, Terrence J. Sejnowski (editors) (1999): Unsupervised Learning: Foundations of Neural
Computation, MIT Press, ISBN 0-262-58168-X (This book focuses on unsupervised learning in neural networks.)
• Richard O. Duda, Peter E. Hart, David G. Stork: Unsupervised Learning and Clustering, Ch. 10 in Pattern
classification (2nd edition), p. 571, Wiley, New York, ISBN 0-471-05669-3, 2001.
• Ranjan Acharyya (2008): A New Approach for Blind Source Separation of Convolutive Sources, ISBN
978-3639077971 (this book focuses on unsupervised learning with Blind Source Separation)
Unsupervised learning 28
See also
• Artificial neural network
• Blind Source Separation
• Data clustering
• Data mining
• Expectation-maximization algorithm
• Generative topographic map
• Multivariate analysis
• Radial basis function network
• Self-organizing map
• Time Adaptive Self-Organizing Map
Reinforcement learning
Inspired by old behaviorist psychology, reinforcement learning is an area of machine learning in computer science,
concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative
reward. The problem, due to its generality, is studied in many other disciplines, such as control theory, operations
research, information theory, simulation-based optimization, statistics, and Genetic Algorithm. In the operations
research and control literature the field where reinforcement learning methods are studied is called approximate
dynamic programming. The problem has been studied in the theory of optimal control, though most studies there are
concerned with existence of optimal solutions and their characterization, and not with the learning or approximation
aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise
under bounded rationality.
In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many
reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main
difference to these classical techniques is that reinforcement learning algorithms do not need the knowledge of the
MDP and they target large MDPs where exact methods become infeasible.
Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never
presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which
involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The
exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the
multi-armed bandit problem and in finite MDPs.
The basic reinforcement learning model consists of:
1. a set of environment states ;
2. a set of actions ;
3. rules of transitioning between states;
4. rules that determine the scalar immediate reward of a transition; and
5. rules that describe what the agent observes.
The rules are often stochastic. The observation typically involves the scalar immediate reward associated to the last
transition. In many works, the agent is also assumed to observe the current environmental state, in which case we
talk about full observability, whereas in the opposing case we talk about partial observability. Sometimes the set of
actions available to the agent is restricted (e.g., you cannot spend more money than what you posses).
A reinforcement learning agent interacts with its environment in discrete time steps. At each time , the agent
receives an observation , which typically includes the reward . It then chooses an action from the set of
actions available, which is subsequently sent to the environment. The environment moves to a new state and
Reinforcement learning 29
the reward associated with the transition is determined. The goal of a reinforcement learning agent is to collec
much reward as it is possible. The agent can choose any action as a function of the history and it can even randomize
its action selection.
When the agent's performance is compared to that of an agent which acts optimally from the beginning, the
difference in performance gives rise to the notion of regret. Note that in order to act near optimally, the agent must
reason about the long term consequences of its actions: In order to maximize my future income I better go to school
now, although the immediate monetary reward associated with this might be negative.
Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term
reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling,
telecommunications, backgammon and chess (Sutton and Barto 1998, Chapter 11).
Two components make reinforcement learning powerful: The use of samples to optimize performance and the use of
function approximation to deal with large environments. Thus, reinforcement learning is most successful when the
environment is big or cannot be precisely described. However, reinforcement learning methods can also be applied
when the environment is big, but can be reasonably simulated, a problem studied in simulation-based optimization.
Exploration
The reinforcement learning problem as described requires clever exploration mechanisms. Randomly selecting
actions is known to give rise to very poor performance. The case of (small) finite MDPs is relatively well understood
by now. However, due to the lack of algorithms that would provably scale well with the number of states (or scale to
problems with infinite state spaces), in practice people resort to simple exploration methods. One such method is
-greedy, when the agent chooses the action that it believes has the best long-term effect with probability , and
it chooses an action uniformly at random, otherwise. Here, is a tuning parameter, which is sometimes
changed, either according to a fixed schedule (making the agent explore less as time goes by), or adaptively based on
some heuristics (Tokic, 2010).
Criterion of optimality
For simplicity, assume for a moment that the problem studied is episodic, an episode ending when some terminal
state is reached. Assume further that no matter what course of actions the agent takes, termination is inevitable with
probably one. Under some additional mild regularity conditions the expectation of the total reward is then
well-defined, for any policy and any initial distribution over the states. Given a fixed initial distribution , we
can thus assign the expected return to policy :
where is the reward received after the -th transition, the initial state is sampled at random from and
actions are selected by policy . Here, denotes the (random) time when a terminal state is reached, i.e., the time
when the episode terminates.
In the case of non-episodic problems the return is often discounted,
Reinforcement learning 30
giving rise to the total expected discounted reward criterion. Here is the so-called discount-factor.
Since the undiscounted return is a special case of the discounted return, from now on we will assume discounting.
Although this looks innocent enough, discounting is in fact problematic if one cares about online performance. This
is because discounting makes the initial time steps more important. Since a learning agent is likely to make mistakes
during the first few steps after its "life" starts, no uninformed learning algorithm can achieve near-optimal
performance under discounting even if the class of environments is restricted to that of finite MDPs. (This does not
mean though that, given enough time, a learning agent cannot figure how to act near-optimally, if time was
restarted.)
The problem then is to specify an algorithm that can be used to find a policy with maximum expected return. From
the theory of MDPs it is known that, without the loss of generality, the search can be restricted to the set of the
so-called stationary policies. A policy is called stationary if the action-distribution returned by it depends only the
last state visited (which is part of the observation history of the agent, by our simplifying assumption). In fact, the
search can be further restricted to deterministic stationary policies. A deterministic stationary policy is one which
deterministically selects actions based on the current state. Since any such policy can be identified with a mapping
from the set of states to the set of action, these policies can be identified with such mappings with no loss of
generality.
Brute force
The naive brute force approach entails the following two steps:
1. For each possible policy, sample returns while following it
2. Choose the policy with the largest expected return
One problem with this is that the number of policies can be extremely large, or even infinite. Another is that variance
of the returns might be large, in which case a large number of samples will be required to accurately estimate the
return of each policy.
These problems can be ameliorated if we assume some structure and perhaps allow samples generated from one
policy to influence the estimates made for another. The two main approaches for achieving this are value function
estimation and direct policy search.
where stands for the random return associated with following from the initial state . Define as the
maximum possible value of , where is allowed to change:
A policy which achieves these optimal values in each state is called optimal. Clearly, a policy optimal in this strong
sense is also optimal in the sense that it maximizes the expected return , since , where is a
state randomly sampled from the distribution .
Reinforcement learning 31
Although state-values suffice to define optimality, it will prove to be useful to define action-values. Given a state ,
an action and a policy , the action-value of the pair under is defined by
where, now, stands for the random return associated with first taking action in state and following ,
thereafter.
It is well-known from the theory of MDPs that if someone gives us for an optimal policy, we can always choose
optimal actions (and thus act optimally) by simply choosing the action with the highest value at each state. The
action-value function of such an optimal policy is called the optimal action-value function and is denoted by . In
summary, the knowledge of the optimal action-value function alone suffices to know how to act optimally.
Assuming full knowledge of the MDP, there are two basic approaches to compute the optimal action-value function,
value iteration and policy iteration. Both algorithms compute a sequence of functions ( )
which converge to . Computing these functions involves computing expectations over the whole state-space,
which is impractical for all, but the smallest (finite) MDPs, never mind the case when the MDP is unknown. In
reinforcement learning methods the expectations are approximated by averaging over samples and one uses function
approximation techniques to cope with the need to represent value functions over large state-action spaces.
The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action
pairs. However, linear function approximation is not the only choice. More recently, methods based on ideas from
nonparametric statistics (which can be seen to construct their own features) have been explored.
So far, the discussion was restricted to how policy iteration can be used as a basis of the designing reinforcement
learning algorithms. Equally importantly, value iteration can also be used as a starting point, giving rise to the
Q-Learning algorithm (Watkins 1989) and its many variants.
The problem with methods that use action-values is that they may need highly precise estimates of the competing
action values, which can be hard to obtain when the returns are noisy. Though this problem is mitigated to some
extent by temporal difference methods and if one uses the so-called compatible function approximation method,
more work remains to be done to increase generality and efficiency. Another problem specific to temporal difference
methods comes from their reliance on the recursive Bellman equation. Most temporal difference methods have a
so-called parameter that allows one to continuously interpolate between Monte-Carlo methods
(which do not rely on the Bellman equations) and the basic temporal difference methods (which rely entirely on the
Bellman equations), which can thus be effective in palliating this issue.
Under mild conditions this function will be differentiable as a function of the parameter vector . If the gradient of
was known, one could use gradient ascent. Since an analytic expression for the gradient is not available, one must
rely on a noisy estimate. Such an estimate can be constructed in many ways, giving rise to algorithms like Williams'
REINFORCE method (which is also known as the likelihood ratio method in the simulation-based optimization
literature). Policy gradient methods have received a lot of attention in the last couple of years (e.g., Peters et al.
(2003)), but they remain an active field. The issue with many of these methods is that they may get stuck in local
optima (as they are based on local search).
A large class of methods avoids relying on gradient information. These include simulated annealing, cross-entropy
search or methods of evolutionary computation. Many gradient-free methods can achieve (in theory and in the limit)
a global optimum. In a number of cases they have indeed demonstrated remarkable performance.
The issue with policy search methods is that they may converge slowly if the information based on which they act is
noisy. For example, this happens when in episodic problems the trajectories are long and the variance of the returns
is large. As argued beforehand, value-function based methods that rely on temporal differences might help in this
case. In recent years, several actor-critic algorithms have been proposed following this idea and were demonstrated
to perform well on various benchmarks.
Theory
The theory for small, finite MDPs is quite mature. Both the asymptotic and finite-sample behavior of most
algorithms is well-understood. As mentioned beforehand, algorithms with provably good online performance
(addressing the exploration issue) are known. The theory of large MDPs needs more work. Efficient exploration is
largely untouched (except for the case of bandit problems). Although finite-time performance bounds appeared for
many algorithms in the recent years, these bounds are expected to be rather loose and thus more work is needed to
better understand the relative advantages, as well as the limitations of these algorithms. For incremental algorithm
asymptotic convergence issues have been settled. Recently, new incremental, temporal-difference-based algorithms
have appeared which converge under a much wider set of conditions than was previously possible (for example,
when used with arbitrary, smooth function approximation).
Current research
Current research topics include: adaptive methods which work with fewer (or no) parameters under a large number
of conditions, addressing the exploration problem in large MDPs, large scale empirical evaluations, learning and
acting under partial information (e.g., using Predictive State Representation), modular and hierarchical reinforcement
learning, improving existing value-function and policy search methods, algorithms that work well with large (or
continuous) action spaces, transfer learning, lifelong learning, efficient sample-based planning (e.g., based on
Monte-Carlo tree search). Multiagent or Distributed Reinforcement Learning is also a topic of interest in current
research. There is also a growing interest in real life applications of reinforcement learning. Successes of
reinforcement learning are collected on here [1] and here [2].
Reinforcement learning algorithms such as TD learning are also being investigated as a model for Dopamine-based
learning in the brain. In this model, the dopaminergic projections from the substantia nigra to the basal ganglia
function as the prediction error. Reinforcement learning has also been used as a part of the model for human skill
learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first
publication on this application was in 1995-1996, and there have been many follow-up studies). See http:/ / webdocs.
cs.ualberta.ca/~sutton/RL-FAQ.html#behaviorism for further details of these research areas above.
Reinforcement learning 34
Literature
Conferences, journals
Most reinforcement learning papers are published at the major machine learning and AI conferences (ICML, NIPS,
AAAI, IJCAI, UAI, AI and Statistics) and journals (JAIR [3], JMLR [4], Machine learning journal [5]). Some theory
papers are published at COLT and ALT. However, many papers appear in robotics conferences (IROS, ICRA) and
the "agent" conference AAMAS. Operations researchers publish their papers at the INFORMS conference and, for
example, in the Operation Research [6], and the Mathematics of Operations Research [7] journals. Control researchers
publish their papers at the CDC and ACC conferences, or, e.g., in the journals IEEE Transactions on Automatic
Control [8], or Automatica [9], although applied works tend to be published in more specialized journals. The Winter
Simulation Conference [10] also publishes many relevant papers. Other than this, papers also published in the major
conferences of the neural networks, fuzzy, and evolutionary computation communities. The annual IEEE symposium
titled Approximate Dynamic Programming and Reinforcement Learning (ADPRL) and the biannual European
Workshop on Reinforcement Learning (EWRL) are two regularly held meetings where RL researchers meet.
See also
• Temporal difference learning
• Q learning
• SARSA
• Fictitious play
• Optimal control
• Dynamic treatment regimes
• Error-driven learning
Implementations
• RL-Glue [11] provides a standard interface that allows you to connect agents, environments, and experiment
programs together, even if they are written in different languages.
• Maja Machine Learning Framework [12] The Maja Machine Learning Framework (MMLF) is a general
framework for problems in the domain of Reinforcement Learning (RL) written in python.
• Software Tools for Reinforcement Learning (Matlab and Python) [13]
• PyBrain(Python) [14]
• TeachingBox [15] is a Java reinforcement learning framework supporting many features like RBF networks,
gradient decent learning methods, ...
• Open source C++ implementations [16] for some well known reinforcement learning algorithms.
• Orange, a free data mining software suite, module orngReinforcement [17]
References
• Sutton, Richard S. (1984). Temporal Credit Assignment in Reinforcement Learning [18]. (PhD thesis).
• Williams, Ronald J. (1987). "A class of gradient-estimating algorithms for reinforcement learning in neural
networks" [19]. Proceedings of the IEEE First International Conference on Neural Networks [19].
• Sutton, Richard S. (1988). "Learning to predict by the method of temporal differences" [20]. Machine Learning
(Springer) 3: 9–44. doi:10.1007/BF00115009.
• Watkins, Christopher J.C.H. (1989). Temporal Credit Assignment in Reinforcement Learning [21]. (PhD thesis).
• Bradtke, Steven J.; Andrew G. Barto (1996). "Learning to predict by the method of temporal differences" [22].
Machine Learning (Springer) 22: 33–57. doi:10.1023/A:1018056104778.
Reinforcement learning 35
• Bertsekas, Dimitri P.; John Tsitsiklis (1996). Neuro-Dynamic Programming [23]. Nashua, NH: Athena Scientific.
ISBN 1-886529-10-8.
• Kaelbling, Leslie P.; Michael L. Littman; Andrew W. Moore (1996). "Reinforcement Learning: A Survey" [24].
Journal of Artificial Intelligence Research 4: 237–285.
• Sutton, Richard S.; Andrew G. Barto (1998). Reinforcement Learning: An Introduction [25]. MIT Press.
ISBN 0-262-19398-1.
• Peters, Jan; Sethu Vijayakumar; Stefan Schaal (2003). "Reinforcement Learning for Humanoid Robotics" [26].
IEEE-RAS International Conference on Humanoid Robots [26].
• Powell, Warren (2007). Approximate dynamic programming: solving the curses of dimensionality [27].
Wiley-Interscience. ISBN 0470171553.
• Auer, Peter; Thomas Jaksch; Ronald Ortner (2010). "Near-optimal regret bounds for reinforcement learning" [28].
Journal of Machine Learning Research 11: 1563–1600.
• Szita, Istvan; Csaba Szepesvari (2010). "Model-based Reinforcement Learning with Nearly Tight Exploration
Complexity Bounds" [29]. ICML 2010 [29]. Omnipress. pp. 1031–1038.
• Bertsekas, Dimitri P. (August 2010). "Chapter 6 (online): Approximate Dynamic Programming" [30]. Dynamic
Programming and Optimal Control. II (3 ed.).
• Busoniu, Lucian; Robert Babuska ; Bart De Schutter ; Damien Ernst (2010). Reinforcement Learning and
Dynamic Programming using Function Approximators [31]. Taylor & Francis CRC Press.
ISBN 978-1-4398-2108-4.
• Tokic, Michel (2010). "Adaptive e-Greedy Exploration in Reinforcement Learning Based on Value Differences"
[32]
. KI 2010: Advances in Artificial Intelligence. Lecture Notes in Computer Science. 6359. Springer Berlin /
Heidelberg. pp. 203–210.
External links
• Reinforcement Learning Repository [33]
• Reinforcement Learning and Artificial Intelligence [34] (Sutton's lab at the University of Alberta)
• Autonomous Learning Laboratory [35] (Barto's lab at the University of Massachusetts Amherst)
• RL-Glue [36]
• Software Tools for Reinforcement Learning (Matlab and Python) [13]
• The UofA Reinforcement Learning Library (texts) [37]
• The Reinforcement Learning Toolbox from the (Graz University of Technology) [38]
• Hybrid reinforcement learning [39]
• Piqle: a Generic Java Platform for Reinforcement Learning [40]
• A Short Introduction To Some Reinforcement Learning Algorithms [41]
• Reinforcement Learning applied to Tic-Tac-Toe Game [42]
• Scholarpedia Reinforcement Learning [43]
• Scholarpedia Temporal Difference Learning [44]
• Annual Reinforcement Learning Competition [45]
Reinforcement learning 36
References
[1] http:/ / umichrl. pbworks. com/ Successes-of-Reinforcement-Learning/
[2] http:/ / rl-community. org/ wiki/ Successes_Of_RL/
[3] http:/ / www. jair. org
[4] http:/ / www. jmlr. org
[5] http:/ / www. springer. com/ computer/ ai/ journal/ 10994
[6] http:/ / or. pubs. informs. org
[7] http:/ / mor. pubs. informs. org
[8] http:/ / www. nd. edu/ ~ieeetac/
[9] http:/ / www. elsevier. com/ locate/ automatica
[10] http:/ / www. wintersim. org/
[11] http:/ / glue. rl-community. org/
[12] http:/ / mmlf. sourceforge. net/
[13] http:/ / www. dia. fi. upm. es/ ~jamartin/ download. htm
[14] http:/ / www. pybrain. org/
[15] http:/ / servicerobotik. hs-weingarten. de/ en/ teachingbox. php
[16] http:/ / people. cs. uu. nl/ hado/ code. html
[17] http:/ / www. ailab. si/ orange/ doc/ modules/ orngReinforcement. htm
[18] http:/ / webdocs. cs. ualberta. ca/ ~sutton/ papers/ Sutton-PhD-thesis. pdf
[19] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 129. 8871
[20] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 81. 1503
[21] http:/ / www. cs. rhul. ac. uk/ ~chrisw/ new_thesis. pdf
[22] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 143. 857
[23] http:/ / www. athenasc. com/ ndpbook. html
[24] http:/ / www. cs. washington. edu/ research/ jair/ abstracts/ kaelbling96a. html
[25] http:/ / www. cs. ualberta. ca/ ~sutton/ book/ ebook/ the-book. html
[26] http:/ / www-clmc. usc. edu/ publications/ p/ peters-ICHR2003. pdf
[27] http:/ / www. castlelab. princeton. edu/ adp. htm
[28] http:/ / jmlr. csail. mit. edu/ papers/ v11/ jaksch10a. html
[29] http:/ / www. icml2010. org/ papers/ 546. pdf
[30] http:/ / web. mit. edu/ dimitrib/ www/ dpchapter. pdf
[31] http:/ / www. dcsc. tudelft. nl/ rlbook/
[32] http:/ / www. hs-weingarten. de/ ~tokicm/ web/ tokicm/ publikationen/ papers/ AdaptiveEpsilonGreedyExploration. pdf
[33] http:/ / www-anw. cs. umass. edu/ rlr/
[34] http:/ / rlai. cs. ualberta. ca/
[35] http:/ / www-all. cs. umass. edu/
[36] http:/ / glue. rl-community. org
[37] http:/ / rlai. cs. ualberta. ca/ RLR/ index. html
[38] http:/ / www. igi. tugraz. at/ ril-toolbox
[39] http:/ / www. cogsci. rpi. edu/ ~rsun/ hybrid-rl. html
[40] http:/ / sourceforge. net/ projects/ piqle/
[41] http:/ / people. cs. uu. nl/ hado/ rl_algs/ rl_algs. html
[42] http:/ / www. lwebzem. com/ cgi-bin/ ttt/ ttt. html
[43] http:/ / www. scholarpedia. org/ article/ Reinforcement_Learning
[44] http:/ / www. scholarpedia. org/ article/ Temporal_difference_learning
[45] http:/ / www. rl-competition. org/
Fuzzy logic 37
Fuzzy logic
Fuzzy logic is a form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate
rather than accurate. In contrast with "crisp logic", where binary sets have binary logic, fuzzy logic variables may
have a truth value that ranges between 0 and 1 and is not constrained to the two truth values of classic propositional
logic.[1] Furthermore, when linguistic variables are used, these degrees may be managed by specific functions.
Fuzzy logic emerged as a consequence of the 1965 proposal of fuzzy set theory by Lotfi Zadeh.[2] [3] Though fuzzy
logic has been applied to many fields, from control theory to artificial intelligence, it still remains controversial
among most statisticians, who prefer Bayesian logic, and some control engineers, who prefer traditional two-valued
logic.
Degrees of truth
Fuzzy logic and probabilistic logic are mathematically similar – both have truth values ranging between 0 and 1 –
but conceptually distinct, due to different interpretations—see interpretations of probability theory. Fuzzy logic
corresponds to "degrees of truth", while probabilistic logic corresponds to "probability, likelihood"; as these differ,
fuzzy logic and probabilistic logic yield different models of the same real-world situations.
Both degrees of truth and probabilities range between 0 and 1 and hence may seem similar at first. For example, let a
100 ml glass contain 30 ml of water. Then we may consider two concepts: Empty and Full. The meaning of each of
them can be represented by a certain fuzzy set. Then one might define the glass as being 0.7 empty and 0.3 full. Note
that the concept of emptiness would be subjective and thus would depend on the observer or designer. Another
designer might equally well design a set membership function where the glass would be considered full for all values
down to 50 ml. It is essential to realize that fuzzy logic uses truth degrees as a mathematical model of the vagueness
phenomenon while probability is a mathematical model of ignorance. The same could be achieved using
probabilistic methods, by defining a binary variable "full" that depends on a continuous variable that describes how
full the glass is. There is no consensus on which method should be preferred in a specific situation.
In this image, the meaning of the expressions cold, warm, and hot is represented by functions mapping a temperature
scale. A point on that scale has three "truth values"—one for each of the three functions. The vertical line in the
image represents a particular temperature that the three arrows (truth values) gauge. Since the red arrow points to
zero, this temperature may be interpreted as "not hot". The orange arrow (pointing at 0.2) may describe it as "slightly
warm" and the blue arrow (pointing at 0.8) "fairly cold".
Fuzzy logic 38
Linguistic variables
While variables in mathematics usually take numerical values, in fuzzy logic applications, the non-numeric linguistic
variables are often used to facilitate the expression of rules and facts.[4]
A linguistic variable such as age may have a value such as young or its antonym old. However, the great utility of
linguistic variables is that they can be modified via linguistic hedges applied to primary terms. The linguistic hedges
can be associated with certain functions. For example, L. A. Zadeh proposed to take the square of the membership
function. This model, however, does not work properly. For more details, see the references.
Example
Fuzzy set theory defines fuzzy operators on fuzzy sets. The problem in applying this is that the appropriate fuzzy
operator may not be known. For this reason, fuzzy logic usually uses IF-THEN rules, or constructs that are
equivalent, such as fuzzy associative matrices.
Rules are usually expressed in the form:
IF variable IS property THEN action
For example, a simple temperature regulator that uses a fan might look like this:
IF temperature IS very cold THEN stop fan
IF temperature IS cold THEN turn down fan
IF temperature IS normal THEN maintain level
IF temperature IS hot THEN speed up fan
There is no "ELSE" – all of the rules are evaluated, because the temperature might be "cold" and "normal" at the
same time to different degrees.
The AND, OR, and NOT operators of boolean logic exist in fuzzy logic, usually defined as the minimum, maximum,
and complement; when they are defined this way, they are called the Zadeh operators. So for the fuzzy variables x
and y:
NOT x = (1 - truth(x))
x AND y = minimum(truth(x), truth(y))
x OR y = maximum(truth(x), truth(y))
There are also other operators, more linguistic in nature, called hedges that can be applied. These are generally
adverbs such as "very", or "somewhat", which modify the meaning of a set using a mathematical formula.
Logical analysis
In mathematical logic, there are several formal systems of "fuzzy logic"; most of them belong among so-called
t-norm fuzzy logics.
• Gödel fuzzy logic is the extension of basic fuzzy logic BL where conjunction is Gödel t-norm. It has the axioms
of BL plus an axiom of idempotence of conjunction, and its models are called G-algebras.
• Product fuzzy logic is the extension of basic fuzzy logic BL where conjunction is product t-norm. It has the
axioms of BL plus another axiom for cancellativity of conjunction, and its models are called product algebras.
• Fuzzy logic with evaluated syntax (sometimes also called Pavelka's logic), denoted by EVŁ, is a further
generalization of mathematical fuzzy logic. While the above kinds of fuzzy logic have traditional syntax and
many-valued semantics, in EVŁ is evaluated also syntax. This means that each formula has an evaluation.
Axiomatization of EVŁ stems from Łukasziewicz fuzzy logic. A generalization of classical Gödel completeness
theorem is provable in EVŁ.
Fuzzy databases
Once fuzzy relations are defined, it is possible to develop fuzzy relational databases. The first fuzzy relational
database, FRDB, appeared in Maria Zemankova's dissertation. Later, some other models arose like the Buckles-Petry
model, the Prade-Testemale Model, the Umano-Fukami model or the GEFRED model by J.M. Medina, M.A. Vila et
al. In the context of fuzzy databases, some fuzzy querying languages have been defined, highlighting the SQLf by P.
Bosc et al. and the FSQL by J. Galindo et al. These languages define some structures in order to include fuzzy
aspects in the SQL statements, like fuzzy conditions, fuzzy comparators, fuzzy constants, fuzzy constraints, fuzzy
thresholds, linguistic labels and so on.
Fuzzy logic 40
Comparison to probability
Fuzzy logic and probability are different ways of expressing uncertainty. While both fuzzy logic and probability
theory can be used to represent subjective belief, fuzzy set theory uses the concept of fuzzy set membership (i.e.,
how much a variable is in a set), probability theory uses the concept of subjective probability (i.e., how probable do I
think that a variable is in a set). While this distinction is mostly philosophical, the fuzzy-logic-derived possibility
measure is inherently different from the probability measure, hence they are not directly equivalent. However, many
statisticians are persuaded by the work of Bruno de Finetti that only one kind of mathematical uncertainty is needed
and thus fuzzy logic is unnecessary. On the other hand, Bart Kosko argues that probability is a subtheory of fuzzy
logic, as probability only handles one kind of uncertainty. He also claims to have proven a derivation of Bayes'
theorem from the concept of fuzzy subsethood. Lotfi Zadeh argues that fuzzy logic is different in character from
probability, and is not a replacement for it. He fuzzified probability to fuzzy probability and also generalized it to
what is called possibility theory. (cf.[5] )
See also
• Artificial intelligence
• Artificial neural network
• Defuzzification
• Dynamic logic
• Expert system
• False dilemma
• Fuzzy associative matrix
• Fuzzy classification
• Fuzzy concept
• Fuzzy Control Language
• Fuzzy Control System
• Fuzzy electronics
• Fuzzy mathematics
• Fuzzy set
• Fuzzy subalgebra
• FuzzyCLIPS expert system
• Machine learning
• Multi-valued logic
• Neuro-fuzzy
• Paradox of the heap
• Rough set
• Type-2 fuzzy sets and systems
• Vagueness
• Interval finite element
Fuzzy logic 41
Notes
[1] Novák, V., Perfilieva, I. and Močkoř, J. (1999) Mathematical principles of fuzzy logic Dodrecht: Kluwer Academic. ISBN 0-7923-8595-0
[2] "Fuzzy Logic" (http:/ / plato. stanford. edu/ entries/ logic-fuzzy/ ). Stanford Encyclopedia of Philosophy. Stanford University. 2006-07-23. .
Retrieved 2008-09-29.
[3] Zadeh, L.A. (1965). "Fuzzy sets", Information and Control 8 (3): 338–353.
[4] Zadeh, L. A. et al. 1996 Fuzzy Sets, Fuzzy Logic, Fuzzy Systems, World Scientific Press, ISBN 9810224214
[5] Novák, V. Are fuzzy sets a reasonable tool for modeling vague phenomena?, Fuzzy Sets and Systems 156 (2005) 341—348.
Bibliography
• Von Altrock, Constantin (1995). Fuzzy logic and NeuroFuzzy applications explained. Upper Saddle River, NJ:
Prentice Hall PTR. ISBN 0-13-368465-2.
• Biacino, L.; Gerla, G. (2002). "Fuzzy logic, continuity and effectiveness". Archive for Mathematical Logic 41 (7):
643–667. doi:10.1007/s001530100128. ISSN 0933-5846.
• Cox, Earl (1994). The fuzzy systems handbook: a practitioner's guide to building, using, maintaining fuzzy
systems. Boston: AP Professional. ISBN 0-12-194270-8.
• Gerla, Giangiacomo (2006). "Effectiveness and Multivalued Logics". Journal of Symbolic Logic 71 (1): 137–162.
doi:10.2178/jsl/1140641166. ISSN 0022-4812.
• Hájek, Petr (1998). Metamathematics of fuzzy logic. Dordrecht: Kluwer. ISBN 0792352386.
• Hájek, Petr (1995). "Fuzzy logic and arithmetical hierarchy". Fuzzy Sets and Systems 3 (8): 359–363.
doi:10.1016/0165-0114(94)00299-M. ISSN 0165-0114.
• Halpern, Joseph Y. (2003). Reasoning about uncertainty. Cambridge, Mass: MIT Press. ISBN 0-262-08320-5.
• Höppner, Frank; Klawonn, F.; Kruse, R.; Runkler, T. (1999). Fuzzy cluster analysis: methods for classification,
data analysis and image recognition. New York: John Wiley. ISBN 0-471-98864-2.
• Ibrahim, Ahmad M. (1997). Introduction to Applied Fuzzy Electronics. Englewood Cliffs, N.J: Prentice Hall.
ISBN 0-13-206400-6.
• Klir, George J.; Folger, Tina A. (1988). Fuzzy sets, uncertainty, and information. Englewood Cliffs, N.J: Prentice
Hall. ISBN 0-13-345984-5.
• Klir, George J.; St Clair, Ute H.; Yuan, Bo (1997). Fuzzy set theory: foundations and applications. Englewood
Cliffs, NJ: Prentice Hall. ISBN 0133410587.
• Klir, George J.; Yuan, Bo (1995). Fuzzy sets and fuzzy logic: theory and applications. Upper Saddle River, NJ:
Prentice Hall PTR. ISBN 0-13-101171-5.
• Kosko, Bart (1993). Fuzzy thinking: the new science of fuzzy logic. New York: Hyperion. ISBN 0-7868-8021-X.
• Kosko, Bart; Isaka, Satoru (July 1993). "Fuzzy Logic". Scientific American 269 (1): 76–81.
doi:10.1038/scientificamerican0793-76.
• Montagna, F. (2001). "Three complexity problems in quantified fuzzy logic". Studia Logica 68 (1): 143–152.
doi:10.1023/A:1011958407631. ISSN 0039-3215.
• Mundici, Daniele; Cignoli, Roberto; D'Ottaviano, Itala M. L. (1999). Algebraic foundations of many-valued
reasoning. Dodrecht: Kluwer Academic. ISBN 0-7923-6009-5.
• Novák, Vilém (1989). Fuzzy Sets and Their Applications. Bristol: Adam Hilger. ISBN 0-85274-583-4.
• Novák, Vilém (2005). "On fuzzy type theory". Fuzzy Sets and Systems 149: 235–273.
doi:10.1016/j.fss.2004.03.027.
• Novák, Vilém; Perfilieva, Irina; Močkoř, Jiří (1999). Mathematical principles of fuzzy logic. Dordrecht: Kluwer
Academic. ISBN 0-7923-8595-0.
• Passino, Kevin M.; Yurkovich, Stephen (1998). Fuzzy control. Boston: Addison-Wesley. ISBN 020118074X.
• Pedrycz, Witold; Gomide, Fernando (2007). Fuzzy systems engineering: Toward Human-Centerd Computing.
Hoboken: Wiley-Interscience. ISBN 978047178857-7.
Fuzzy logic 42
• Pu, Pao Ming; Liu, Ying Ming (1980). "Fuzzy topology. I. Neighborhood structure of a fuzzy point and
Moore-Smith convergence". Journal of Mathematical Analysis and Applications 76 (2): 571–599.
doi:10.1016/0022-247X(80)90048-7. ISSN 0022-247X
• Santos, Eugene S. (1970). "Fuzzy Algorithms". Information and Control 17 (4): 326–339.
doi:10.1016/S0019-9958(70)80032-8.
• Scarpellini, Bruno (1962). "Die Nichaxiomatisierbarkeit des unendlichwertigen Prädikatenkalküls von
Łukasiewicz" (http://jstor.org/stable/2964111). Journal of Symbolic Logic (Association for Symbolic Logic)
27 (2): 159–170. doi:10.2307/2964111. ISSN 0022-4812.
• Steeb, Willi-Hans (2008). The Nonlinear Workbook: Chaos, Fractals, Cellular Automata, Neural Networks,
Genetic Algorithms, Gene Expression Programming, Support Vector Machine, Wavelets, Hidden Markov Models,
Fuzzy Logic with C++, Java and SymbolicC++ Programs: 4edition. World Scientific. ISBN 981-281-852-9.
• Wiedermann, J. (2004). "Characterizing the super-Turing computing power and efficiency of classical fuzzy
Turing machines". Theor. Comput. Sci. 317: 61–69. doi:10.1016/j.tcs.2003.12.004.
• Yager, Ronald R.; Filev, Dimitar P. (1994). Essentials of fuzzy modeling and control. New York: Wiley.
ISBN 0-471-01761-2.
• Van Pelt, Miles (2008). Fuzzy Logic Applied to Daily Life. Seattle, WA: No No No No Press.
ISBN 0-252-16341-9.
• Wilkinson, R.H. (1963). "A method of generating functions of several variables using analog diode logic". IEEE
Transactions on Electronic Computers 12: 112–129. doi:10.1109/PGEC.1963.263419.
• Zadeh, L.A. (1968). "Fuzzy algorithms". Information and Control 12 (2): 94–102.
doi:10.1016/S0019-9958(68)90211-8. ISSN 0019-9958.
• Zadeh, L.A. (1965). "Fuzzy sets". Information and Control 8 (3): 338–353.
doi:10.1016/S0019-9958(65)90241-X. ISSN 0019-9958.
• Zemankova-Leech, M. (1983). Fuzzy Relational Data Bases. Ph. D. Dissertation. Florida State University.
• Zimmermann, H. (2001). Fuzzy set theory and its applications. Boston: Kluwer Academic Publishers.
ISBN 0-7923-7435-5.
External links
Additional articles
• Formal fuzzy logic (http://en.citizendium.org/wiki/Formal_fuzzy_logic) - article at Citizendium
• Fuzzy Logic (http://www.scholarpedia.org/article/Fuzzy_Logic) - article at Scholarpedia
• Modeling With Words (http://www.scholarpedia.org/article/Modeling_with_words) - article at Scholarpedia
• Fuzzy logic (http://plato.stanford.edu/entries/logic-fuzzy/) - article at Stanford Encyclopedia of Philosophy
• Fuzzy Math (http://blog.peltarion.com/2006/10/25/fuzzy-math-part-1-the-theory) - Beginner level
introduction to Fuzzy Logic.
• Fuzzy Logic and the Internet of Things: I-o-T (http://www.i-o-t.org/post/WEB_3)
Fuzzy logic 43
Links pages
• Web page about FSQL (http://www.lcc.uma.es/~ppgg/FSQL/): References and links about FSQL
Tutorials
• Fuzzy Logic Tutorial (http://www.jimbrule.com/fuzzytutorial.html)
• Another Fuzzy Logic Tutorial (http://www.calvin.edu/~pribeiro/othrlnks/Fuzzy/home.htm) with
MATLAB/Simulink Tutorial
• Fuzzy logic in your game (http://www.byond.com/members/DreamMakers?command=view_post&
post=37966) - tutorial aimed towards game programming.
• Simple test to check how well you understand it (http://www.answermath.com/fuzzymath.htm)
Applications
• Research article that describes how industrial foresight could be integrated into capital budgeting with intelligent
agents and Fuzzy Logic (http://econpapers.repec.org/paper/amrwpaper/398.htm)
• A doctoral dissertation describing how Fuzzy Logic can be applied in profitability analysis of very large industrial
investments (http://econpapers.repec.org/paper/pramprapa/4328.htm)
• A method for asset valuation that uses fuzzy logic and fuzzy numbers for real option valuation (http://users.abo.
fi/mcollan/fuzzypayoff.html)
Fuzzy logic 44
Research Centres
• Institute for Research and Applications of Fuzzy Modeling (http://irafm.osu.cz/)
• European Centre for Soft Computing (http://www.softcomputing.es/)
• Fuzzy Logic Lab Linz-Hagenberg (http://www.flll.jku.at/)
Fuzzy set
Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets were introduced by Lotfi A. Zadeh
(1965) as an extension of the classical notion of set.[1] In classical set theory, the membership of elements in a set is
assessed in binary terms according to a bivalent condition — an element either belongs or does not belong to the set.
By contrast, fuzzy set theory permits the gradual assessment of the membership of elements in a set; this is described
with the aid of a membership function valued in the real unit interval [0, 1]. Fuzzy sets generalize classical sets, since
the indicator functions of classical sets are special cases of the membership functions of fuzzy sets, if the latter only
take values 0 or 1.[2] Classical bivalent sets are in fuzzy set theory usually called crisp sets. The fuzzy set theory can
be used in a wide range of domains in which information is incomplete or imprecise, such as bioinformatics [3] .
Definition
A fuzzy set is a pair where is a set and .
For each , is called the grade of membership of in . For a finite set
, the fuzzy set is often denoted by .
Let . Then is called not included in the fuzzy set if , is called fully included if
[4]
, and is called fuzzy member if . The set is called the
support of and the set is called its kernel.
Sometimes, more general variants of the notion of fuzzy set are used, with membership functions taking values in a
(fixed or variable) algebra or structure of a given kind; usually it is required that be at least a poset or lattice.
The usual membership functions with values in [0, 1] are then called [0, 1]-valued membership functions. This kind
of generalizations was first considered in 1967 by Joseph Goguen, who was a student of Zadeh.[5]
Fuzzy logic
As an extension of the case of multi-valued logic, valuations ( ) of propositional variables ( )
into a set of membership degrees ( ) can be thought of as membership functions mapping predicates into fuzzy
sets (or more formally, into an ordered set of fuzzy pairs, called a fuzzy relation). With these valuations,
many-valued logic can be extended to allow for fuzzy premises from which graded conclusions may be drawn.[6]
This extension is sometimes called "fuzzy logic in the narrow sense" as opposed to "fuzzy logic in the wider sense,"
which originated in the engineering fields of automated control and knowledge engineering, and which encompasses
many topics involving fuzzy sets and "approximated reasoning."[7]
Industrial applications of fuzzy sets in the context of "fuzzy logic in the wider sense" can be found at fuzzy logic.
Fuzzy set 45
Fuzzy number
A fuzzy number is a convex, normalized fuzzy set whose membership function is at least segmentally
continuous and has the functional value at precisely one element.
This can be likened to the funfair game "guess your weight," where someone guesses the contestant's weight, with
closer guesses being more correct, and where the guesser "wins" if he or she guesses near enough to the contestant's
weight, with the actual weight being completely correct (mapping to 1 by the membership function).
Fuzzy interval
A fuzzy interval is an uncertain set with a mean interval whose elements possess the membership function
value . As in fuzzy numbers, the membership function must be convex, normalized, at least
segmentally continuous.[8]
See also
• Alternative set theory
• Defuzzification
• Fuzzy mathematics
• Fuzzy measure theory
• Fuzzy set operations
• Fuzzy subalgebra
• Linear partial information
• Neuro-fuzzy
• Rough fuzzy hybridization
• Rough set
• Type-2 Fuzzy Sets and Systems
• Uncertainty
• Interval finite element
• Multiset
External links
• Uncertainty model Fuzziness [9]
• Fuzzy Systems Journal http://www.elsevier.com/wps/find/journaldescription.cws_home/505545/
description#description
• ScholarPedia [10]
• The Algorithm of Fuzzy Analysis [11]
• Fuzzy Image Processing [12]
• Zadeh's 1965 paper on Fuzzy Sets [13]
Fuzzy set 46
References
[1] L. A. Zadeh (1965) "Fuzzy sets" (http:/ / www-bisc. cs. berkeley. edu/ Zadeh-1965. pdf). Information and Control 8 (3) 338–353.
[2] D. Dubois and H. Prade (1988) Fuzzy Sets and Systems. Academic Press, New York.
[3] Lily R. Liang, Shiyong Lu, Xuena Wang, Yi Lu, Vinay Mandal, Dorrelyn Patacsil, and Deepak Kumar, “FM-test: A Fuzzy-Set-Theory-Based
Approach to Differential Gene Expression Data Analysis”, BMC Bioinformatics, 7 (Suppl 4): S7. 2006.
[4] AAAI http:/ / www. aaai. org/ aitopics/ pmwiki/ pmwiki. php/ AITopics/ FuzzyLogic
[5] Goguen, Joseph A., 1967, "L-fuzzy sets". Journal of Mathematical Analysis and Applications 18: 145–174
[6] Siegfried Gottwald, 2001. A Treatise on Many-Valued Logics. Baldock, Hertfordshire, England: Research Studies Press Ltd., ISBN
978-0863802621
[7] "The concept of a linguistic variable and its application to approximate reasoning," Information Sciences 8: 199–249, 301–357; 9: 43–80.
[8] "Fuzzy sets as a basis for a theory of possibility," Fuzzy Sets and Systems 1: 3–28
[9] http:/ / www. uncertainty-in-engineering. net/ uncertainty_models/ fuzziness
[10] http:/ / www. scholarpedia. org/ article/ Fuzzy_sets
[11] http:/ / www. uncertainty-in-engineering. net/ uncertainty_methods/ fuzzy_analysis/
[12] http:/ / pami. uwaterloo. ca/ tizhoosh/ set. htm
[13] http:/ / www-bisc. cs. berkeley. edu/ Zadeh-1965. pdf
Fuzzy number
A fuzzy number is an extension of a regular number in the sense that it does not refer to one single value but rather
to a connected set of possible values, where each possible value has its own weight between 0 and 1. This weight is
called the membership function. A fuzzy number is thus a special case of a convex fuzzy set[1] . Just like Fuzzy logic
is an extension of Boolean logic (which uses 'yes' and 'no' only, and nothing in between), fuzzy numbers are an
extension of real numbers. Calculations with fuzzy numbers allow the incorporation of uncertainty on parameters,
properties, geometry, initial conditions, etc.
See also
• Fuzzy set
• Uncertainty
References
[1] Michael Hanss, 2005. Applied Fuzzy Arithmetic, An Introduction with Engineering Applications. Springer, ISBN 3-540-24201-5
External links
Fuzzy Logic Tutorial (http://www.seattlerobotics.org/Encoder/mar98/fuz/flindex.html)
Article Sources and Contributors 47
Supervised learning Source: http://en.wikipedia.org/w/index.php?oldid=386774147 Contributors: 144.132.75.xxx, APH, Ahoerstemeier, Alfio, Ancheta Wis, AndrewHZ, Beetstra,
BenBildstein, BertSeghers, Boleslav Bobcik, Buster7, CapitalR, Chadloder, Cherkash, Classifier1234, Conversion script, Cyp, Da monster under your bed, Damian Yerrick, Darius Bacon, David
Eppstein, Denoir, Dfass, Doloco, Domanix, Duncharris, Erylaos, EverGreg, Fly by Night, Fstonedahl, Gene s, Giftlite, Hike395, Isomorph, Jamelan, Jlc46, Joerg Kurt Wegner, KnowledgeOfSelf,
Kotsiantis, LC, Lloydd, Mailseth, MarkSweep, Markus Krötzsch, Michael Hardy, MikeGasser, Mostafa mahdieh, MrOllie, Mscnln, Mxn, Naveen Sundar, Oliver Pereira, Paskari, Pintaio, Prolog,
Reedy, Ritchy, Rotem Dan, Sad1225, Sam Hocevar, Skbkekas, Skeppy, Snoyes, Sun116, Tdietterich, Tribaal, Twri, Unknown, X7q, Zadroznyelkan, Zeno Gantner, 59 anonymous edits
Semi-supervised learning Source: http://en.wikipedia.org/w/index.php?oldid=386413341 Contributors: Benwing, Bkkbrad, Bookuser, DaveWF, Delirium, Facopad, Furrykef, Grisendo,
Jcarroll, Lamro, MrOllie, Phoxhat, Pintaio, Rahimiali, Rajah, Ruud Koot, Soultaco, Stheodor, Tbmurphy, 19 anonymous edits
Active learning (machine learning) Source: http://en.wikipedia.org/w/index.php?oldid=384834323 Contributors: Bearcat, Tdietterich, X7q
Learning to rank Source: http://en.wikipedia.org/w/index.php?oldid=392188687 Contributors: Aminimassih, Epheswiki, ML Trick, Mild Bill Hiccup, Ppntizi, Rrenaud, X7q, 17 anonymous
edits
Unsupervised learning Source: http://en.wikipedia.org/w/index.php?oldid=388768175 Contributors: 3mta3, Aaronbrick, Agentesegreto, Ahoerstemeier, Alex Kosorukoff, Alfio, Algorithms,
AnAj, Auntof6, BertSeghers, Bobo192, Chire, CommodiCast, Daryakav, David Eppstein, Denoir, EverGreg, Fly by Night, Gene s, Hike395, Ida Shaw, Kku, Kotsiantis, Lambiam, Les boys,
Maheshbest, Michael Hardy, Mietchen, Ng.j, Nkour, Ojigiri, Ranjan.acharyya, Salvamoreno, Stheodor, Tablizer, Timohonkela, Trebor, 32 anonymous edits
Reinforcement learning Source: http://en.wikipedia.org/w/index.php?oldid=388103889 Contributors: Albertzeyer, Altenmann, Ash.dyer, Beetstra, Ceran, Charles Matthews, Correction45,
Delirium, Digfarenough, DopefishJustin, Dpbert, DrewNoakes, Fabrice.Rossi, Flohack, Gene s, Giftlite, Gosavia, Hike395, Imran, J04n, Jcarroll, Jcautilli, Jiuguang Wang, Julian, Kartoun, Kku,
Kpmiyapuram, MBK004, Maderlock, Masatran, Mdchang, Mianarshad, Michael Hardy, Mitar, Mr ashyash, MrOllie, MrinalKalakrishnan, Mrwojo, Nedrutland, Nvrmnd, Oleg Alexandrov,
Olethros, Qsung, Rev.bayes, Rinconsoleao, Rlguy, Sebastjanmm, Seliopou, Shyking, Skittleys, Stuhlmueller, Szepi, Tobacman, Tremilux, Vermorel, Wfu, Wikinacious, Wmorgan, XApple,
Yworo, 111 anonymous edits
Fuzzy logic Source: http://en.wikipedia.org/w/index.php?oldid=392000149 Contributors: AK Auto, Abtin, Academic Challenger, Ace Frahm, Acer, Adrian, Ahoerstemeier, Aiyasamy, Ajensen,
Alarichus, Alca Isilon, Allmightyduck, Amitauti, Andres, AndrewHowse, Anonymous Dissident, Ap, Aperezbios, Arabani, ArchonMagnus, Arjun01, Aronbeekman, Arthur Rubin, Atabəy,
AugPi, Avoided, Aylabug, Babytoys, Bairam, BenRayfield, BertSeghers, Betterusername, Bjankuloski06en, Blackmetalbaz, Blainster, BlaiseFEgan, Boffob, Borgx, Brat32, Brentdax, C2math,
CLW, CRGreathouse, Catchpole, Cedars, Cesarpermanente, Chalst, Charvest, Christian List, Chronulator, Ck lostsword, Clemwang, Closedmouth, Cnoguera, Crunchy Numbers, Cryptographic
hash, Cybercobra, Damian Yerrick, Denoir, Dethomas, Dhollm, Diegomonselice, Dragonfiend, Drwu82, Duniyadnd, EdH, Elockid, Elwikipedista, Em3ryguy, Eric119, Eulenreich,
EverettColdwell, Expensivehat, EyeSerene, False vacuum, Felipe Gonçalves Assis, Flewis, Fratrep, Fullofstars, Furrykef, Fyyer, Gauss, Gbellocchi, George100, Gerla, Gerla314, GideonFubar,
Giftlite, Gregbard, Gryllida, Guard, Gurch, Gurchzilla, Gyrofrog, Gökhan, H11, Hargle, Harry Wood, Heron, History2007, Hkhandan, Honglyshin, Hypertall, ISEGeek, Icairns, Ignacioerrico,
Igoldste, Ihope127, Intgr, Ioverka, Iridescent, Ixfd64, J.delanoy, J04n, Jack and Mannequin, Jadorno, Jaganath, Jbbyiringiro, Jchernia, Jcrada, JesseHogan, JimBrule, Joriki, Junes,
JustAnotherJoe, K.Nevelsteen, KSmrq, Kadambarid, Kariteh, Katzmik, Kilmer-san, Kingmu, Klausness, Koavf, Kuru, Kzollman, L353a1, LBehounek, Lambiam, LanguidMandala, Lars
Washington, Lawrennd, Lbhales, Lese, Letranova, Leujohn, Lord Hawk, Loren Rosen, Lynxoid84, MC MasterChef, MER-C, Maddie!, Malcolmxl5, Mani1, Manop, Marcus Beyer,
Mastercampbell, Mathaddins, Maurice Carbonaro, Mdd, Mdotley, Megatronium, Melcombe, Mhss, Michael Hardy, Mkcmkc, Mladifilozof, Mneser, Moilleadóir, Mr. Billion, Nbarth, Ndavies2,
Nortexoid, Ohka-, Oicumayberight, Oleg Alexandrov, Olethros, Oli Filth, Omegatron, Omicronpersei8, Oroso, Palfrey, Panadero45, Paper Luigi, Passino, Paul August, PeterFisk, Peterdjones,
Peyna, Pickles8, Pit, Pkrecker, Pownuk, Predictor, Ptroen, Quackor, R. S. Shaw, RTC, Rabend, RedHouse18, Requestion, Rgheck, Rjstott, Rjwilmsi, Rohan Ghatak, Ruakh, Rursus, S. Neuman,
SAE1962, Sahelefarda, Saibo, Samohyl Jan, Saros136, Scimitar, Scriber, Sebastjanmm, Sebesta, Serpentdove, Shervink, Slashme, Smmurphy, Snespeca, SparkOfCreation, Srikeit, Srinivasasha,
StephenReed, Stevertigo, SuzanneKn, Swagato Barman Roy, T2gurut2, T3hZ10n, T3kcit, TankMiche, Tarquin, Teutonic Tamer, ThornsCru, Thumperward, Tide rolls, Traroth, TreyHarris,
Trovatore, Trusilver, Turnstep, Typochimp, Ultimatewisdom, Ululuca, Vansskater692, Velho, Vendettax, Virtk0s, Vizier, Voyagerfan5761, Wavelength, Williamborg, Wireless friend,
Woohookitty, Xaosflux, Xezbeth, Yamamoto Ichiro, Zfr, Zoicon5, 435 anonymous edits
Fuzzy set Source: http://en.wikipedia.org/w/index.php?oldid=390110806 Contributors: Abdel Hameed Nawar, Bjankuloski06en, Boleslav Bobcik, Bouktin, CRGreathouse, Cesarpermanente,
Charles Matthews, Charvest, Dreadstar, Duncharris, El C, Elwikipedista, Evercat, Furrykef, Gaius Cornelius, George100, Gerla, Giftlite, Gregbard, Grendelkhan, Helgus, History2007, Hydrogen
Iodide, InformationSpace, Ixfd64, JRSpriggs, Jaredwf, Jcobb, Joriki, Kilmer-san, Krzysiulek, Kusma, LBehounek, Lukipuk, MartinHarper, Matsievsky, Maurice Carbonaro, Michael Hardy,
Michael Slone, Ml720834, NotQuiteEXPComplete, Palfrey, Peak, Pgallert, Phe, Pownuk, Predictor, QYV, R. S. Shaw, Rijkbenik, Ryan Reich, Salix alba, Smmurphy, Srinivasasha, Supten,
T2gurut2, Taw, The tree stump, Toby Bartels, Ty580, Urhixidur, VashiDonsk, VeryVerily, Wavelength, Wireless friend, Zundark, Александър, 88 anonymous edits
Fuzzy number Source: http://en.wikipedia.org/w/index.php?oldid=366920057 Contributors: Bjankuloski06en, Curtdbz, Dude1818, Excirial, KoenDelaere, Nikkimaria, 1 anonymous edits
Image Sources, Licenses and Contributors 48
License
Creative Commons Attribution-Share Alike 3.0 Unported
http:/ / creativecommons. org/ licenses/ by-sa/ 3. 0/