Unit 5
Unit 5
Supervised Learning :
Regression and Classification
Topics
• Regression: Introduction, Example of Regression, Common Regression
Algorithms:
• Simple linear Regression, Multiple linear regression
• Classification: Introduction, Classification Model, Classification
Learning Steps,
• Classification Algorithms: kNN, Decision Tree, Random Forest, Support
Vector Machine
Regression
Introduction
• In supervised learning, when we are trying to predict a real-value variable such
as ‘Price’, ‘Weight’, etc., the problem falls under the category of regression.
• A regression problem tries to forecast results as a continuous output.
• Dependent Variable (Y) is the value to be predicted. This variable is presumed
to be functionally related to the independent variable (X).
• In other words, dependent variable(s) depends on independent variable(s).
Independent Variable (X) is called as predictor. The independent variable (X) is
used in a regression model to estimate the value of the dependent variable (Y).
• Regression is essentially finding a relationship (or) association between the
dependent variable (Y) and the independent variables (X).
COMMON REGRESSION
ALGORITHMS
• Simple linear regression
• Multiple linear regression
• Polynomial regression
• Multivariate adaptive regression splines
• Logistic regression
• Maximum likelihood estimation (least squares)
Simple linear regression
• If the regression involves only one independent variable, it is called
simple regression.
• Thus, if we take ‘Price of a used car’ as the dependent variable and the
‘Year of manufacturing of the car’ as the independent variable, we can
build a simple regression.
• Slope represents how much the line in a graph changes in the vertical
direction (Y-axis) over a change in the horizontal direction (X-axis). Slope
is also referred as the rate of change in a graph.
• Maximum and minimum points on a graph are found at points where
the slope of the curve is zero. It becomes zero either from positive or
from negative value
Simple linear regression
• Where
• f11 = value of feature f1 for data element d1
• f12 = value of feature f1 for data element d2
• f11 = value of feature f2 for data element d1
• f11 = value of feature f2 for data element d1
k-Nearest Neighbour (kNN) - Example
k-Nearest Neighbour (kNN) - 2-D representation of the
Example student data set
k-Nearest Neighbour (kNN) - Distance calculation between
Example test and training points
kNN algorithm
Input: Training data set, test data set (or data points), value of ‘k’ (i.e.number of nearest neighbors to be
considered)
Steps:
Do for all test data points
Calculate the distance (usually Euclidean distance) of the test data pointfrom the different training data
points.
Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test
data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that class label to the
test data point
End do
Strengths of the kNN algorithm
• Extremely simple algorithm – easy to understand
• Very effective in certain situations, e.g. for recommender system
design
• Very fast or almost no time required for the training phase
Weaknesses of the kNN algorithm
• Does not learn anything in the real sense. Classification is done
completely on the basis of the training data. So, it has a heavy
reliance on the training data. If the training data does not represent
the problem domain comprehensively, the algorithm fails to make an
effective classification.
• Because there is no model trained in real sense and the classification
is done completely on the basis of the training data, the classification
process is very slow.
• A large amount of computational space is required to load the
training data for classification.
Decision tree
• Decision tree learning is one of the most widely adopted algorithms
for classification.
• A decision tree is used for multi-dimensional analysis with multiple
classes.
• The goal of decision tree learning is to create a model(based on the
past data called past vector) that predicts the value of the output
variable based on the input variables in the feature vector.
Decision tree
• Each internal node tests an attribute (represented
as ‘A’/‘B’ within the boxes).
• Each branch corresponds to an attribute value (T/F)
in the above case. Each leaf node assigns a
classification.
• The first node is called as ‘Root’ Node.
• A decision tree consists of three types of nodes:
• Root Node
• Branch Node
• Leaf Node
Decision Tree - Example
Entropy of a decision tree
• Let us say S is the sample set of training examples. Then, Entropy (S)
measuring the impurity of S is defined as
Split the data set into subsets using the attribute Fmin
Draw a decision tree node containing the attribute Fmin and split the dataset into subsets
Repeat the above steps until the full tree is drawn, covering all the attributes of the original table.
Avoiding overfitting in decision tree
– pruning
• The decision tree algorithm, unless a stopping criterion is applied,
may keep growing indefinitely – splitting for every feature and
dividing into smaller partitions till the point that the data is perfectly
classified. This, as is quite evident, results in an overfitting problem.
• To prevent a decision tree getting overfitted to the training data,
pruning of the decision tree is essential.
• Pruning a decision tree reduces the size of the tree such that the
model is more generalized and can classify unknown and unlabelled
data in a better way.
Avoiding overfitting in decision tree
– pruning
• There are two approaches of pruning:
• Pre-pruning: Stop growing the tree before it reaches perfection.
• Post-pruning: Allow the tree to grow entirely and then post-prune some of
the branches from it.
Strengths of decision tree
• It produces very simple understandable rules. For smaller trees, not
much mathematical and computational knowledge is required to
understand this model.
• Works well for most of the problems.
• It can handle both numerical and categorical variables.
• Can work well both with small and large training data sets.
• Decision trees provide a definite clue of which features are more
useful for classification.
Weaknesses of decision tree
• Decision tree models are often biased towards features having more
number of possible values, i.e. levels.
• This model gets overfitted or under fitted quite easily.
• Decision trees are prone to errors in classification problems with
many classes and relatively small number of training examples.
• A decision tree can be computationally expensive to train.
• Large trees are complex to understand.
Random forest model
• Random forest is an ensemble classifier, i.e. a combining classifier that
uses and combines many decision tree classifiers.
• Ensembling is usually done using the concept of bagging with
different feature sets.
• The reason for using large number of trees in random forest is to
train the trees enough such that contribution from each feature
comes in a number of models.
• After the random forest is generated by combining the trees, majority
vote is applied to combine the output of the different trees.
A simplified
random
forest model
How does random forest work?
1. If there are N variables or features in the input data set, select a subset of ‘m ’
(m < N ) features at random out of the N features. The observations or data
instances should be picked randomly.
2. Use the best split principle on these ‘m’ features to calculate the number of
nodes ‘d’.
3. Keep splitting the nodes to child nodes till the tree is grown to the maximum
possible extent.
4. Select a different subset of the training data ‘with replacement’ to train another
decision tree following steps (1) to (3). Repeat this to build and train ‘n’ decision
trees.
5. Final class assignment is done on the basis of the majority votes from the ‘n ’
trees.
Strengths of random forest
• It runs efficiently on large and expansive data sets.
• It has a robust method for estimating missing data and maintains precision when a
large proportion of the data is absent.
• It has powerful techniques for balancing errors in a class population of unbalanced
data sets.
• It gives estimates (or assessments) about which features are the most important ones
in the overall classification.
• It generates an internal unbiased estimate (gauge) of the generalisation error as the
forest generation progresses.
• Generated forests can be saved for future use on other data.
• Lastly, the random forest algorithm can be used to solve both classification and
regression problems.
Weaknesses of random forest
• This model, because it combines a number of decision tree models, is
not as easy to understand as a decision tree model.
• It is computationally much more expensive than a simple model like
decision tree.
Support vector machines
• SVM is a model, which can do linear classification as well as
regression.
• SVM is based on the concept of a surface, called a hyperplane, which
draws a boundary between data instances plotted in the multi-
dimensional feature space.
• The output prediction of an SVM is one of two conceivable classes
which are already defined in the training data.
Classification using hyperplanes