0% found this document useful (0 votes)
19 views32 pages

ML Unit-2 Final

The document covers various classification and regression models, including linear separability, decision trees, linear regression, logistic regression, and their applications. It explains concepts such as decision regions, linear discriminants, and the differences between linear and logistic regression. Additionally, it discusses the decision tree algorithm, its advantages and disadvantages, and the ID3 and C4.5 algorithms for attribute selection.

Uploaded by

siva M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views32 pages

ML Unit-2 Final

The document covers various classification and regression models, including linear separability, decision trees, linear regression, logistic regression, and their applications. It explains concepts such as decision regions, linear discriminants, and the differences between linear and logistic regression. Additionally, it discusses the decision tree algorithm, its advantages and disadvantages, and the ID3 and C4.5 algorithms for attribute selection.

Uploaded by

siva M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit-2

Syllabus: Classification and Regression Models:


Linear Separability and decision regions, linear discriminates, linear regression, logistic regression, decision trees-
ID3 and C4.5, KNN.

Linear separability:

• Two sets of data points in a two dimensional space are said to be linearly separable when they can be
completely separable by a single straight line.
• For example, here is a case of selling a house based on area and price.
• We have got a number of data points for that along with the class, which is house Sold/Not Sold
What is linearly separable and linearly non separable data?
If you can draw a line or hyper plane that can separate those points into two classes, then. the data
is separable. If not, the data is termed as non linearly separable data.

Boolean AND & OR are linear separable problems while XOR is non linear separable

1
• The most classic example of linearly inseparable pattern is a logical exclusive-OR(XOR) function.
• Shown in figure is the illustration of XOR function that two classes, 0 for red dot and 1 for blue dot, cannot
be separated with a single line.
• The solution seems that patterns can be logically classified with two lines L1 and L2

Decision Regions:

• When performing pattern recognition, a set of patterns can be represented in a pattern space, in which
each pattern is represented as a point at a particular set of coordinates.

• The decision regions are separated by surfaces called the decision boundaries
• A decision region is an area or volume, marked by cuts in the pattern space.
• All of the patterns within a usable decision region belong to the same class.
 All feature vectors in a decision region are assigned to the same category.
 The decision regions are often simply connected, but they can be multiply connected as well.
 These separating surfaces represent points where there are ties between two or more categories.

2
linear discriminates:
 Linear discriminant or Fisher's linear discriminant, a method used in statistics, pattern recognition, and
machine learning to find a linear combination of features that characterizes or separates two or more
classes of objects or events.
 It is mainly used to express one dependent variable as a linear combination of other features or
measurements.
 LDA projects data from a D dimensional feature space down to a D’ (D>D’) dimensional space in a way to
maximize the variability between the classes and reducing the variability within the classes.

Regression:
Regression is a supervised learning technique that supports finding the correlation among variables. A regression
problem is when the output variable is a real or continuous value.

In Regression, we plot a graph between the variables which best fit the given data points. The machine learning
model can deliver predictions regarding the data.

Linear Regression

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that
is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (x)
variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it
finds how the value of the dependent variable is changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:

3
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called a regression
line. A regression line can show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a
relationship is termed as a Positive linear relationship.

4
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then
such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the error between
predicted values and actual values should be minimized. The best fit line will have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to
calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.

Cost function:

o The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and the
cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression model is
performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input variable
to the output variable. This mapping function is also known as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error
occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Applications (usecases) of linear regression:

5
o Sales Forecasting
o Risk Analysis
o Housing Applications To Predict the prices and other factors
o Finance Applications To Predict Stock prices, investment evaluation, etc.

Types of Regression models:

• It is the simplest form of regression, and models y as a linear function of x.

• Where the variance of y is assumed to be constant, and a and b are regression coefficients specifying the Y-
intercept and slope of the line, respectively.

• These coefficients can be solved for by the method of least squares, which estimates the best-fitting straight
line as the one that minimizes the error between the actual data and the estimate of the line.

6
Nonlinear Regression:

“How can we model data that does not show a linear dependence? For example, what if a given response
variable and predictor variable have a relationship that may be modeled by a polynomial function?”

• Polynomial regression is often of interest when there is just one predictor variable.

• It can be modeled by adding polynomial terms to the basic linear model.

• By applying transformations to the variables, we can convert the nonlinear model into a linear one that can
then be solved by the method of least squares.

• Note that polynomial regression is a special case of multiple regression.

• That is, the addition of high-order terms like x2, x3, and so on, which are simple functions of the single
variable, x, can be considered equivalent to adding new independent variables.

7
Example-2

• Table shows a set of paired data where x is the number of years of work experience of a college graduate
and y is the corresponding salary of the graduate.
• Predict that the salary of a college graduate with, say, 10 years of experience is___

8
Logistic Regression:

Logistic regression is a statistical method that is used for building machine learning models where the dependent
variable is dichotomous: i.e. binary. Logistic regression is used to describe data and the relationship between one
dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or
of interval type.

The name “logistic regression” is derived from the concept of the logistic function that it uses. The logistic function
is also known as the sigmoid function. The value of this logistic function lies between zero and one.

The following is an example of a logistic function we can use to find the probability of a vehicle breaking down,
depending on how many years it has been since it was serviced last.

9
Here is how you can interpret the results from the graph to decide whether the vehicle will break down or not.

Advantages of the Logistic Regression Algorithm

 Logistic regression performs better when the data is linearly separable


 It does not require too many computational resources as it’s highly interpretable
 There is no problem scaling the input features—It does not require tuning
 It is easy to implement and train a model using logistic regression
 It gives a measure of how relevant a predictor (coefficient size) is, and its direction of association (positive or
negative

Applications of Logistic Regression

 Using the logistic regression algorithm, banks can predict whether a customer would default on loans or not
 To predict the weather conditions of a certain place (sunny, windy, rainy, humid, etc.)
10
 Ecommerce companies can identify buyers if they are likely to purchase a certain product
 Companies can predict whether they will gain or lose money in the next quarter, year, or month based on their
current performance
 To classify objects based on their features and attributes

How Does the Logistic Regression Algorithm Work?

Consider the following example: An organization wants to determine an employee’s salary increase based on their
performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by considering the
employee’s performance as the independent variable, and the salary increase as the dependent variable will make
their task easier.

Now, what if the organization wants to know whether an employee would get a promotion or not based on their
performance? The above linear graph won’t be suitable in this case. As such, we clip the line at zero and one, and
convert it into a sigmoid curve (S curve)

Based on the threshold values, the organization can decide whether an employee will get a salary increase or not.
The equation of the sigmoid function is:

The sigmoid curve obtained from the above equation is as follows:

11
Example-2:

12
Differences between Linear and Logistic Regression:

Linear Regression Logistic Regression

Linear regression is used to predict the continuous Logistic Regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
variables. variables.

Linear Regression is used for solving Regression Logistic regression is used for solving Classification
problem. problems.
In linear regression, we find the best fit line, by which In Logistic Regression, we find the S-curve by which
we can easily predict the output. we can classify the samples.

Least square estimation method is used for estimation Maximum likelihood estimation method is used for
of accuracy. estimation of accuracy.

The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No, etc.

In Linear regression, it is required that relationship In Logistic regression, it is not required to have the
between dependent variable and independent linear relationship between the dependent and
variable must be linear. independent variable.

13
Decision Tree Algorithm:

Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems.
o It is a tree-structured classifier
The tree has three types of nodes:
– Root node that has no incoming edges and zero or more outgoing edges.
– Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
– Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label.
The non-terminal nodes, which include the root and other internal nodes, contain attribute test conditions
to separate records that have different characteristics.

o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
oBelow diagram explains the general structure of a decision tree

How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree.
This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. It
continues the process until it reaches the leaf node of the tree. The complete process can be better understood
using the below algorithm:

 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.

14
 Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue
this process until a stage is reached where you cannot further classify the nodes and called the final node as
a leaf node.

Example:
A decision tree that is used to classify whether a person is Fit or Unfit. The decision nodes here are questions like
‘’‘Is the person less than 30 years of age?’, ‘Does the person eat junk?’, etc. and the leaves are one of the two
possible outcomes viz. Fit and Unfit. Looking at the Decision Tree we can say make the following decisions: if a
person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a person is less than 30 years of age
and eats junk food then he is Unfit and so on.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making any decision
in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Attribute Selection Measures in decision tree:


While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node
and for sub-nodes. So, to solve such problems there is a technique, which is called as Attribute selection measure
or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There are two
popular techniques for ASM, which are:
o Entropy
o Information Gain
o Gini Index
o Gain ratio

i) ID3 Algorithm :
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly)
dichotomizes(divides) features into two or more groups at each step.

 It is a classification algorithm that follows a greedy approach by selecting a best attribute that yields
maximum Information Gain(IG) or minimum Entropy(H).

ID3 only work with Discrete or nominal data, but C4. 5 work with both Discrete and Continuous data

15
ID3 algorithm selects the best attribute based on the concept of entropy and information gain for developing the
tree. C4. 5 algorithm acts similar to ID3 but improves a few of ID3 behaviors. The metric (or heuristic) used in C4. 5
to measure impurity is the Gain Ratio.

We denote our dataset as S, entropy is calculated as:

Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n

where,
n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the “total number
of rows” in the dataset.

Information Gain for a feature column A is calculated as:

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))


where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of rows in Sᵥ and
likewise |S| is the number of rows in S.

ID3 Steps (Algorithm or procedure):

1. Calculate the Information Gain of each feature.


2. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature for
which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the class as its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes

Example:

16
17
18
19
20
21
22
Advantages of using ID3 :

 Understandable prediction rules are created from the training data.


23
 Builds the fastest tree.
 Builds a short tree.
 Only need to test enough attributes until all data is classified.
 Finding leaf nodes enables test data to be pruned, reducing number of tests.
 Whole dataset is searched to create tree.
Disadvantages of using ID3:

 Data may be over-fitted or over-classified, if a small sample is tested.


 Only one attribute at a time is tested for making a decision.
 Classifying continuous data may be computationally expensive, as many trees must be generated to see
where to break the continuum.

ii) C4.5 Algorithm:

ID3 only work with Discrete or nominal data, but C4. 5 work with both Discrete and Continuous data. It is a
successor of ID3

ID3 algorithm selects the best attribute based on the concept of entropy and information gain for developing the
tree.

C4.5 algorithm selects the best attribute based on Gain ratio

The information gain measure is biased toward tests with many outcomes. That is, it prefers to select attributes
having a large number of values. For example, consider an attribute that acts as a unique identifier such as
product_ID. A split on product_ID would result in a large number of partitions (as many as there are values), each
one containing just one tuple. Because each partition is pure, the information required to classify data set D based
on this partitioning would be Infoproduct_ID(D)= 0. Therefore, the information gained by partitioning on this attribute
is maximal. Clearly, such a partitioning is useless for classification.

C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome
this bias. It applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as

This value represents the potential information generated by splitting the training data set, D, into v partitions,
corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it considers the number of
tuples having that outcome same partitioning. The gain ratio is defined as with respect to the total number of
tuples in D. It differs from information gain, which measures the information with respect to classification that is
acquired based on the

The attribute with the maximum gain ratio is selected as the splitting attribute.

C4.5 Steps (Algorithm or procedure):

24
1. Check for the above base cases.
2. For each attribute a, find the normalised information gain ratio from splitting on A.
3. Let A_best be the attribute with the highest normalized information gain.
4. Create a decision node that splits on A_best.
5. Recursively on the sublists obtained by splitting on A_best, and add those nodes as children of node.
The base cases are the following:

 All the examples from the training set belong to the same class ( a tree leaf labeled with that class is
returned ).
 The training set is empty ( returns a tree leaf called failure ).
 The attribute list is empty ( returns a leaf labeled with the most frequent class or the disjunction of all the
classes).

Advantages of C4.5:
 Can use both categorical and continuous values
 The algorithm inherently employs Single Pass Pruning Process to Mitigate overfitting.
 It can work with both Discrete and Continuous Data
 C4.5 can handle the issue of incomplete data very well.
 Builds models that can be easily interpreted
 Easy to implement
 Deals with noise

Disadvantages of C4.5:
 Small variation in data can lead to different decision trees (especially when the variables are close to each
other in value)
 Does not work very well on a small training set

25
26
27
28
29
Comparison between ID3 and C4.5 algorithms:

K-Nearest Neighbour (KNN) Algorithm:

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.

 K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.

 K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.

 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.

Example:

30
31
How does K-NN work:

KNN Steps (Algorithm or procedure):

The K-NN working can be explained on the basis of the below algorithm:

 Step-1: Select the number K of the neighbors

 Step-2: Calculate the Euclidean distance of K number of neighbors

 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

 Step-4: Among these k neighbors, count the number of the data points in each category.

 Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Advantages of KNN Algorithm:

 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points for all the training
samples.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
• KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that
individual have the characteristics similar to the defaulters one.
Calculating Credit Ratings
• KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having
similar traits.
Politics
• With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will
not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
• Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image
Recognition and Video Recognition.

32

You might also like