0% found this document useful (0 votes)
57 views35 pages

MLT UNIT-3 Notes

Uploaded by

srimaddhesia9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views35 pages

MLT UNIT-3 Notes

Uploaded by

srimaddhesia9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT – 3

 Decision Tree:

o Decision Tree is a Supervised learning technique that


can be used for both classification and Regression problems,
but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches
represent the decision rules, and each leaf node
represents the outcome.

o In a Decision tree, there are two nodes, which are


the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do
not contain any further branches.
o The decisions or the test are performed based on features of
the given dataset.
o It is a graphical representation for getting all the
possible solutions to a problem/decision based on
given conditions.
o It is called a decision tree because, like a tree, it starts with
the root node, which expands on further branches and
constructs a tree-like structure.
o To build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question and based on the
answer (Yes/No), it further splits the tree into subtrees.
Decision Tree Terminology:

 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.

 Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.

 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing unwanted branches from the tree.

 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

Example:

Suppose there is a candidate who has a job offer and wants


to decide whether he should accept the offer or not. So, to
solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
 Attribute selection measures used in decision tree are:
1. Entropy: Entropy is a metric to measure the impurity in each attribute.
It specifies randomness in data. The value of entropy range 0-1.
Entropy can be calculated as:
Entropy(S) = - P(Yes) log base 2P(Yes) – P(No)Log base2 P(No)
where,
S -> no. of samples
P(Yes) -> Probability of yes
P(No) -> Probability of no
A log function of base 2 is used because the entropy is encoded in
bits 0 and 1.

2. Information Gain:
Information gain is the difference between before and after a split
on a given attribute. It measures how much information a feature
provides about a target.

Constructing a decision tree is solely about finding a feature that


returns the highest information gain. The feature with the highest
information gain produces the best split, classifying the training
dataset better according to the target variable.

Information gain has the following formula:

Where:

 a is the specific attribute or class label.


 Entropy(S) is the entropy of dataset S.
 |Sv| / |S| is the proportion of the values in S to the number of
values in the dataset S.

3. Gain Ratio:
Gain Ratio or Uncertainty Coefficient is used to normalize the
information gain of an attribute against how much entropy that
attribute has. The information gain measure is biased towards tests
with many outcomes.
Formula of gain ratio is given by

Gain Ratio=Information Gain/Split information


 ID3 Algorithm:

ID3 stands for Iterative Dichotomizer3 and is named


such because the algorithm iteratively(repeatedly)
dichotomizes(divides) features into two or more groups at
each step.ID3 is an algorithm invented by Ross Quinlan used
to generate a decision tree from a dataset and is the most
popular algorithms used to constructing trees.

ID3 is the core algorithm for building a decision tree. It


employs a top-down greedy search through the space of all
possible branches with no backtracking. This algorithm uses
information gain and entropy to construct a classification
decision tree.

Steps:

1. Calculate entropy for the whole dataset.

2. For each attribute:

Calculate entropy for all its categorical values.

Calculate information gain for the feature.

3. find the feature with maximum information gain

4. Repeat till algorithm converges for a decision tree.

Example Of ID3 Algorithm:


Suppose we had the following dataset:

From the above example dataset, we are required to construct a


decision tree to help us decide whether we should play volleyball
based on the weather conditions.

To construct a decision tree, we need to pick the features that will


best guide us to make a viable decision on whether we should play
or not play volleyball.

We can't randomly select a feature from the dataset to build the


tree, so Entropy and information gain are good criteria for this
problem.

To begin with, we have four features we need to consider:


 Outlook
 Temp
 Humidity
 Wind

Finding the root node feature

Since we cannot just pick one of the features to start our decision
tree, we need to make calculations to get the feature with the
highest information gain from which we start splitting.

Calculate the entropy of the entire dataset (Entropy(S))

We can see that we have 5 Noes(or negatives) and 9 Yeses(or


positives). The total number of entries is 14.
The entropy of the whole dataset is:
Calculate the information gain for the Outlook feature

Outlook has 3 attributes:

 Sunny
 Overcast
 Rain.

So, we will calculate the entropy of each of these attributes (Sv) as


follows:

We have 5 Sunny attributes for Outlook:

 3 negative Sunny Outlooks (When Play Volleyball is No).


 2 positive Sunny Outlooks (When Play Volleyball is Yes).

Let's calculate:

We have 4 Overcast attributes for Outlook:

 0 negative Overcast Outlooks.


 4 positive Overcast Outlooks.

Let's calculate:

We have 5 Rain attributes for Outlook:


 2 negative Rain attributes.
 3 positive Rain attributes.

Let's calculate:

The information gain for Outlook

We have the following Entropies:

 Entropy(S) = 0.94
 Entropy (SSunny) = 0.97
 Entropy (SOvercast) = 0
 Entropy (SRain) = 0.97

We use the formula for information gain to calculate the gain.

So:

Information gain for Outlook is 0.24.

Similarly, we must calculate the information gain for the other


features.
Calculate the information gain for the Temp feature.

Temp has 3 attributes:


 Hot
 Mild
 Cool

Since we already have the entropy for the entire dataset


(Entropy(S)), we will calculate the entropy of each
attribute(Entropy (Sv)) of Temp, just as we did with Outlook.

The entropy of Hot:

The entropy of Mild:

The entropy of Cool:

Calculate the information gain for the Temp feature:

Information gain for Temp is 0.03.

Similarly, calculate the information gain for Humidity and Wind. All
information gain values will be:

 Gain(S, Outlook) = 0.24


 Gain(S, Temp) = 0.03
 Gain(S, Humidity) = 0.15
 Gain(S, Wind) = 0.04
💡
Outlook gives the highest information about our target variable from
the information gain values. It will act as the root node of our tree
from where the splitting will begin.

Note, for Sunny and Rain branches, we can not easily conclude a
yes or a no since we have events where Play Volleyball is yes and
Play volleyball is no. That means that their entropy is more than
zero and hence impure. So we need to split them.
💡

Overcast is a branch with zero entropy since it has all events as Play
volleyball (Yes), so it automatically becomes a leaf node.

Finding the internal nodes

We will calculate information gain for the rest of the features when
the Outlook is Sunny and when the Outlook is Rain:
Splitting on the Sunny attribute

Calculate the information gain for Temp.

Values (Temp) = Hot, Mild, Cool.

The entropy for Hot:

The entropy for Mild:

The entropy for Cool:

The Information gain for Temp:

Calculate the information gain for Humidity


Values (Humidity) = High, Normal.

The entropy for Sunny:

Entropy (SSunny) = 0.97

The entropy for High:

Entropy (SHigh) = 0

Then entropy for Normal:

Entropy (SNormal) = 0

The Information gain for Humidity:

Calculate the information gain for Wind.

Values (Wind) = Strong, Weak.

The entropy for Sunny:

Entropy (SSunny) = 0.97


The entropy for Strong:

Entropy (SStrong) = 1.0

The entropy for Weak:


The Information gain for Wind:

Image by author

Humidity gives the highest information gain value(0.97). So far,


our Tree will look like this:
Splitting on the Rain attribute

Calculate the information gain for Temp.

Values (Temp) = Mild, Cool

The entropy for Mild:

The entropy for Cool:

Entropy (SCool) = 1.0

The information gain for Temp:

Calculate the information gain for Humidity.

Values (Humidity) = High, Normal.

The entropy for High:

Entropy (SHigh) = 1.0


The Entropy for Normal:

The Information gain for Humidity:

Calculate the information gain for Wind.

Values (Wind) = Strong, Weak.

The entropy for Strong:

Entropy (SStrong) = 0

The entropy for Weak:

Entropy(SWeak) = 0

The information gain for Wind:

Wind gives the highest information gain value (0.97). Now we can
complete our Decision Tree.
A complete decision tree with Entropy and Information
gain criteria:

Issues in Decision Tree Learning:


Issues in decision tree learning:
Decision tree learning is a popular machine learning algorithm used for both
classification and regression tasks. Like any algorithm, decision trees come with
their own set of challenges and issues. Here are some common issues associated
with decision tree learning:
1. Overfitting:
Decision trees are prone to overfitting, especially when they are deep and capture
noise in the training data. Overfitting occurs when a model learns the training data
too well, including the noise and outliers, and performs poorly on new, unseen
data.
2. High Variance:
Decision trees can have high variance, meaning that small changes in the training
data can result in significantly different tree structures. This can lead to instability
in the model.
3. Sensitivity to Small Variations in Data:
Small changes in the input data can lead to different tree structures. This sensitivity
can make decision trees less robust, especially when dealing with noisy or
imprecise data.
4. Bias towards Dominant Classes:
In classification tasks with imbalanced class distributions, decision trees may have
a bias towards the dominant class. They might perform well on the majority class
but struggle to accurately predict instances from the minority class.
5. Limited Expressiveness:
Decision trees may not be expressive enough to capture complex relationships in
the data. They are considered "weak learners" compared to some other algorithms.
6. Difficulty Handling Missing Data:
Decision trees can struggle with datasets that have missing values. The way
missing data is handled (or not handled) can affect the performance of the model.
7. Lack of Interpretability for Deep Trees:
While decision trees are generally interpretable, deep trees can become complex
and difficult to interpret. This can be a challenge when trying to explain the model
to non-experts.
8. Computational Complexity:
Building a decision tree involves recursively splitting the dataset, which can
become computationally expensive, especially for large datasets or deep trees. This
complexity can affect both training and prediction times.

Mitigation Strategies:
1. Pruning:
Pruning involves removing branches from the tree that do not provide significant
predictive power. This helps to reduce overfitting and make the tree more
generalizable.
2. Minimum Samples per Leaf or Split:
Setting a minimum number of samples required to make a split or form a leaf node
can help control the tree's depth and mitigate overfitting.
3. Feature Selection:
Carefully selecting relevant features and avoiding irrelevant ones can improve the
tree's ability to generalize to new data.
4. Ensemble Methods:
Using ensemble methods like Random Forests or Gradient Boosting can improve
the overall performance and robustness of decision trees by combining multiple
trees.
5. Handling Imbalanced Data:
Techniques like resampling, using different evaluation metrics, or using specialized
algorithms can address issues related to imbalanced class distributions.
6. Feature Engineering:
Preprocessing the data and engineering informative features can enhance the
performance of decision trees.
7. Cross-Validation:
Employing techniques like cross-validation helps to assess the model's
performance on different subsets of the data, reducing the risk of overfitting.
8. Hyperparameter Tuning:
Tuning the hyperparameters of the decision tree, such as the maximum depth,
minimum samples per leaf, and others, can significantly impact the model's
performance.
By carefully addressing these issues and applying appropriate mitigation strategies,
decision trees can be powerful and effective models in machine learning.

 Instance Based Learning:


Instance-based learning, often associated with k-Nearest Neighbors (k-NN)
algorithms, is a type of lazy learning approach. Instead of building a model
during the training phase, instance-based learning stores the training
instances and makes predictions based on the similarity between new
instances and the stored examples. In the case of k-NN, the "k" nearest
neighbors in the training data are used to determine the prediction for a new
instance.
While these approaches are typically used independently, you might use an
instance-based method to generate predictions for instances and then use
those predictions as features in a decision tree. However, this is more of an
ensemble learning approach rather than directly combining instance-based
learning with decision tree learning.

Here's a simplified example:

Use k-Nearest Neighbors to predict labels for instances in your dataset.


Use these predicted labels, along with other original features, as input to a
decision tree.
Ensemble methods, such as Random Forests, can be considered as a
combination of decision trees, but they typically don't integrate instance-
based learning directly.
It's essential to carefully consider the nature of your data and the problem
you're trying to solve when choosing and combining different machine
learning techniques. Each method has its strengths and weaknesses, and the
effectiveness of the combination will depend on the specific characteristics
of your dataset and the goals of your machine learning task.

 Inductive Inference:

Inductive inference in machine learning is the process of learning patterns,


relationships, or rules from data to make predictions or decisions on new,
unseen data. The goal is to generalize from specific examples in the training
set to make accurate predictions on previously unseen instances. One
common approach for inductive inference is to use machine learning
algorithms that can automatically identify and capture patterns in the data.

In supervised learning, a prevalent form of inductive inference, the


algorithm learns from a labeled training dataset, where input features are
associated with corresponding output labels. The learned model aims to map
input features to the correct output labels, enabling predictions on new,
unseen data.

Decision trees, support vector machines, neural networks, and k-Nearest


Neighbors are examples of machine learning algorithms that engage in
inductive inference. The success of inductive inference depends on factors
like the quality and representativeness of the training data, the chosen
algorithm, and the appropriate tuning of hyperparameters.

 K- Nearest Neighbor’s (KNN algorithm):

o K-Nearest Neighbors is one of the simplest Machine Learning


algorithms based on Supervised Learning technique.
o The K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
o The K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data appears
then it can be easily classified into a well suite category by using the K-
NN algorithm.
o The K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at
the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.

Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:

 Steps Of K-NN Algorithm:


The K-NN working can be explained based on the below algorithm:
o Step-1: Select the number K of the neighbors.
o Step-2: Calculate the Euclidean distance of K number of neighbors.
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of data points in
each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

 Locally Weighted Regression:

Locally Weighted Regression (LWR), also known as Locally Weighted


Scatterplot Smoothing (LOESS), is a non-parametric regression technique
used in machine learning and statistics. It is particularly useful when
dealing with non-linear relationships between variables. LWR gives more
weight to data points that are close to the target point during the training
process, allowing the model to focus on the local behavior of the data.

how Locally Weighted Regression works:

 Basic Idea:
For each prediction, LWR assigns different weights to the data points
based on their proximity to the point where the prediction is
being made. Points closer to the prediction point receive higher
weights, while points farther away receive lower weights.

 Weighting Function:
The weights are assigned using a weighting function, which is typically
a Gaussian (bell-shaped) function.
where,
x^i is the feature value of the data point,
x is the feature value of the prediction point, and
τ is a bandwidth parameter that controls the width of the weighting
function.
 Local Regression:
LWR fits a regression model locally for each prediction point using the
weighted data. The weights are incorporated into the regression
algorithm to give more importance to nearby points.
 Prediction:
To make a prediction at a new point, the model computes a weighted
least squares regression using only the data points close to the
prediction point.
 Bandwidth Parameter:
The bandwidth parameter (τ) is crucial in controlling the degree of
locality. A smaller bandwidth focuses more on local details, but it may
lead to overfitting, while a larger bandwidth considers more global
patterns.

Pros and Cons:

Pros: LWR is flexible and can capture complex, non-linear relationships in


the data. It is adaptive to local patterns.

Cons: LWR may be computationally expensive for large datasets. The


choice of the bandwidth parameter is critical and can affect the model's
performance.

LWR is often used in situations where the underlying relationship between


variables is expected to change across different regions of the input
space. It's worth noting that while LWR can provide accurate predictions
in certain scenarios, it may not be the best choice for all types of data.
The choice of the bandwidth parameter is an important consideration, and
it may require some tuning to achieve optimal performance.
 Radial Basis Function networks:

A Radial Basis Function Network (RBFN) is a particular


type of neural network. In this article, I’ll be describing
its use as a non-linear classifier.

Generally, when people talk about neural networks or


“Artificial Neural Networks” they are referring to
the Multilayer Perceptron (MLP). Each neuron in an MLP
takes the weighted sum of its input values. That is, each
input value is multiplied by a coefficient, and the results
are all summed together. A single MLP neuron is a
simple linear classifier, but complex non-linear classifiers
can be built by combining these neurons into a network.

To me, the RBFN approach is more intuitive than the


MLP. An RBFN performs classification by measuring the
input’s similarity to examples from the training set. Each
RBFN neuron stores a “prototype”, which is just one of
the examples from the training set. When we want to
classify a new input, each neuron computes the
Euclidean distance between the input and its prototype.
Roughly speaking, if the input more closely resembles
the class A prototypes than the class B prototypes, it is
classified as class A.
 RBF Network Architecture

The above illustration shows the typical architecture of


an RBF Network. It consists of an input vector, a layer of
RBF neurons, and an output layer with one node per
category or class of data.

 The Input Vector

The input vector is the n-dimensional vector that you are


trying to classify. The entire input vector is shown to
each of the RBF neurons.

 The RBF Neurons

Each RBF neuron stores a “prototype” vector which is


just one of the vectors from the training set. Each RBF
neuron compares the input vector to its prototype, and
outputs a value between 0 and 1 which is a measure of
similarity. If the input is equal to the prototype, then the
output of that RBF neuron will be 1. As the distance
between the input and prototype grows, the response
falls off exponentially towards 0. The shape of the RBF
neuron’s response is a bell curve, as illustrated in the
network architecture diagram.

 The neuron’s response value is also called its


“activation” value.

 The prototype vector is also often called the neuron’s


“center” since it’s the value at the center of the bell
curve.

 The Output Nodes

The output of the network consists of a set of nodes, one


per category that we are trying to classify. Each output
node computes a sort of score for the associated
category. Typically, a classification decision is made by
assigning the input to the category with the highest
score.

 The score is computed by taking a weighted sum of the


activation values from every RBF neuron. By weighted
sum we mean that an output node associates a weight
value with each of the RBF neurons and multiplies the
neuron’s activation by this weight before adding it to the
total response.

 Because each output node is computing the score for a


different category, every output node has its own set of
weights. The output node will typically give a positive
weight to the RBF neurons that belong to its category,
and a negative weight to the others.

 RBF Neuron Activation Function

 Each RBF neuron computes a measure of the similarity


between the input and its prototype vector (taken from
the training set). Input vectors which are more similar to
the prototype return a result closer to 1. There are
different possible choices of similarity functions, but the
most popular is based on the Gaussian. Below is the
equation for a Gaussian with a one-dimensional input.

 Where x is the input, mu is the mean, and sigma is the


standard deviation. This produces the familiar bell curve
shown below, which is centered at the mean, mu (in the
below plot the mean is 5 and sigma is 1).

 The RBF neuron activation function is slightly different,


and is typically written as:

 In the Gaussian distribution, mu refers to the meaning of


the distribution. Here, it is the prototype vector which is
at the center of the bell curve.
 For the activation function, phi, we aren’t directly
interested in the value of the standard deviation, sigma,
so we make a couple simplifying modifications.

 The first change is that we’ve removed the outer


coefficient, 1 / (sigma * sqrt (2 * pi)). This term normally
controls the height of the Gaussian. Here, though, it is
redundant with the weights applied by the output nodes.
During training, the output nodes will learn the correct
coefficient or “weight” to apply to the neuron’s
response.

 The second change is that we’ve replaced the inner


coefficient, 1 / (2 * sigma^2), with a single parameter
‘beta’. This beta coefficient controls the width of the bell
curve. Again, in this context, we don’t care about the
value of sigma, we just care that there’s some
coefficient which is controlling the width of the bell
curve. So, we simplify the equation by replacing the
term with a single variable.

 RBF Neuron activation for different values of beta


 There is also a slight change in notation here when we
apply the equation to n-dimensional vectors. The double
bar notation in the activation equation indicates that we
are taking the Euclidean distance between x and mu and
squaring the result. For the 1-dimensional Gaussian, this
simplifies to just (x - mu) ^2.

 It’s important to note that the underlying metric here for


evaluating the similarity between an input vector and a
prototype is the Euclidean distance between the two
vectors.

 Also, each RBF neuron will produce its largest response


when the input is equal to the prototype vector. This
allows us to take it as a measure of similarity and sum
up the results from all the RBF neurons.

 As we move out from the prototype vector, the response


falls off exponentially. Recall from the RBFN architecture
illustration that the output node for each category takes
the weighted sum of every RBF neuron in the network–in
other words, every neuron in the network will have some
influence over the classification decision. The
exponential falls off the activation function, however,
means that the neurons whose prototypes are far from
the input vector will contribute very little to the result.

 Cased Based Learning:

Case based format encourages active learning and demonstrates how to apply
theoretical concepts to surgical practice.

1. Can be an element of curriculum.


2. Based on issue(s) that arise in a clinical case
3. Self-directed or structured
4. Structure depends on the level of the learner.
Case based learning instruction is one of the artist-oriented teaching
approaches since it promotes students’ active participation so they could form
their own learning. It helps transfer knowledge and expectations of the students
from their learning.

It is often defined as a teaching method which requires students to actively


participate in real or hypothetical problem situations, reflecting the kinds of
experiences naturally encountered in the discipline under study.
Cases are stories with a message which students analyze and consider the
solutions of these stories.

Functions of case-based learning algorithm are as follows:

1. Preprocessor: This prepares the input for processing e.g., normalizing the
range of numeric value features to ensure that they are treated with equal
importance by the similarity function formatting the raw input into a set of
cases etc.
2. Similarity: This function assesses the similarities of a given case with the
previously stored cases in the concept description. Assessment may
involve explicit encoding and dynamic computation, most practical CBL
similarity functions and a compromise along the continuum between these
extremes.
3. Prediction: This function inputs the similarity assessments and generates
a prediction for the value of the given cases goal feature i.e., a
classification when it is symbolic values.
4. Memory Updating: This updates the stored case base by modifying or
abstracting previously stored cases, forgetting cases presumed to be noisy
or updating a features relevance weight setting.
 Case-based learning cycle with different schemes of CBL:

1. Case retrieval: After the problem situation has been assessed, the best
matching case is searched in the case base and an approximate solution is
retrieved.
2. Case adaptation: The retrieved solution is adapted to fit better the new
problem.
3. Solution evaluation: The adapted solution can be evaluated either before the
solution is applied to the problem or after the solution has been applied. In any
case, if the accomplished result is not satisfactory, the retrieved solution must be
adapted again or more cases should be retrieved.
4. Case-based updating: If the solution was verified as correct, the new case
may be added to the case base.

Terms used in Cycle:


A new problem is matched against the cases furnishing the case base and one
or more similar cases are Retrieved.
A solution suggested by the matching cases is then Reused.

 The benefits of CBR as a lazy problem-solving method are:

 Ease of knowledge elicitation


 Absence of problem-solving bias
 Incremental learning
 Suitability for complex and not-fully formalized solution spaces
 Suitability for sequential problem solving.
 Ease of explanation
 Ease of maintenance

 The Limitations of Case Based Learning are as follows:

 Handling large case bases


 Dynamic problem domains
 Handling noisy data
 Fully automatic operation

 Applications of Case Based Learning are:


 Advising as a process of resolving diagnosed problems
 Design as a process of satisfying a number of posed constraints.
 Planning as a process of arranging a sequence of actions in time.
 Interpretation as a process of evaluating situations/problems in some
context.
 Classification as a process of explaining a number of encountered
symptoms.

 Major Paradigms of Machine Learning include:

 Rote Learning: It deals with One-to-one mapping from inputs to stored


representation. "Learning by memorization” Association-based storage and
retrieval.
 Induction: It uses specific examples to reach general conclusions
 Clustering: It involves automatically discovering natural grouping in data.
 Analogy: Helps to determine correspondence between two different
representations
 Discovery: It is a type of unsupervised learning in which a specific
goal/outcome is not provided.
 Genetic Algorithms: It is a method for solving both constrained and
unconstrained optimization problems based on a natural selection process
that mimics biological evolution.
 Reinforcement Learning: Only feedback (positive or negative reward) is
given at the end of a sequence of steps. Requires assigning reward to
steps by solving the credit assignment problem involving answering which
steps should receive credit or blame for a final result?

 The Inductive Learning Problem:

 Extrapolate from a given set of examples so that we can make accurate


predictions about future examples.
 Supervised vs Unsupervised learning
Want to learn an unknown function f(x) = y, where x is an input example
and y is the desired output. Supervised learning implies we are given a set
of (x, y) pairs by a "teacher." Unsupervised learning means we are only
given the xs. In either case, the goal is to estimate f.
 Concept learning
Given a set of examples of some concept/class/category, determine if a
given example is an instance of the concept or not. If it is an instance, we
call it a positive example. If it is not, it is called a negative example.
 Problem Example
Supervised Concept Learning by Induction
Given a training set of positive and negative examples of a concept,
construct a description that will accurately classify whether future
examples are positive or negative. That is, learn some good estimate of
function f given a training set {(x1, y1), (x2, y2), ..., (xn, yn)} where each yi
is either + (positive) or - (negative).

You might also like