ML Unit-2 Final
ML Unit-2 Final
Linear separability:
• Two sets of data points in a two dimensional space are said to be linearly separable when they can be
completely separable by a single straight line.
• For example, here is a case of selling a house based on area and price.
• We have got a number of data points for that along with the class, which is house Sold/Not Sold
What is linearly separable and linearly non separable data?
If you can draw a line or hyper plane that can separate those points into two classes, then. the data
is separable. If not, the data is termed as non linearly separable data.
Boolean AND & OR are linear separable problems while XOR is non linear separable
1
• The most classic example of linearly inseparable pattern is a logical exclusive-OR(XOR) function.
• Shown in figure is the illustration of XOR function that two classes, 0 for red dot and 1 for blue dot, cannot
be separated with a single line.
• The solution seems that patterns can be logically classified with two lines L1 and L2
Decision Regions:
• When performing pattern recognition, a set of patterns can be represented in a pattern space, in which
each pattern is represented as a point at a particular set of coordinates.
• The decision regions are separated by surfaces called the decision boundaries
• A decision region is an area or volume, marked by cuts in the pattern space.
• All of the patterns within a usable decision region belong to the same class.
All feature vectors in a decision region are assigned to the same category.
The decision regions are often simply connected, but they can be multiply connected as well.
These separating surfaces represent points where there are ties between two or more categories.
2
linear discriminates:
Linear discriminant or Fisher's linear discriminant, a method used in statistics, pattern recognition, and
machine learning to find a linear combination of features that characterizes or separates two or more
classes of objects or events.
It is mainly used to express one dependent variable as a linear combination of other features or
measurements.
LDA projects data from a D dimensional feature space down to a D’ (D>D’) dimensional space in a way to
maximize the variability between the classes and reducing the variability within the classes.
Regression:
Regression is a supervised learning technique that supports finding the correlation among variables. A regression
problem is when the output variable is a real or continuous value.
In Regression, we plot a graph between the variables which best fit the given data points. The machine learning
model can deliver predictions regarding the data.
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that
is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (x)
variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it
finds how the value of the dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
3
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model representation.
A linear line showing the relationship between the dependent and independent variables is called a regression
line. A regression line can show two types of relationship:
4
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then
such a relationship is called a negative linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the error between
predicted values and actual values should be minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to
calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.
Cost function:
o The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and the
cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression model is
performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input variable
to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error
occurred between the predicted values and actual values. It can be written as:
Where,
5
o Sales Forecasting
o Risk Analysis
o Housing Applications To Predict the prices and other factors
o Finance Applications To Predict Stock prices, investment evaluation, etc.
• Where the variance of y is assumed to be constant, and a and b are regression coefficients specifying the Y-
intercept and slope of the line, respectively.
• These coefficients can be solved for by the method of least squares, which estimates the best-fitting straight
line as the one that minimizes the error between the actual data and the estimate of the line.
6
Nonlinear Regression:
“How can we model data that does not show a linear dependence? For example, what if a given response
variable and predictor variable have a relationship that may be modeled by a polynomial function?”
• Polynomial regression is often of interest when there is just one predictor variable.
• By applying transformations to the variables, we can convert the nonlinear model into a linear one that can
then be solved by the method of least squares.
• That is, the addition of high-order terms like x2, x3, and so on, which are simple functions of the single
variable, x, can be considered equivalent to adding new independent variables.
7
Example-2
• Table shows a set of paired data where x is the number of years of work experience of a college graduate
and y is the corresponding salary of the graduate.
• Predict that the salary of a college graduate with, say, 10 years of experience is___
8
Logistic Regression:
Logistic regression is a statistical method that is used for building machine learning models where the dependent
variable is dichotomous: i.e. binary. Logistic regression is used to describe data and the relationship between one
dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or
of interval type.
The name “logistic regression” is derived from the concept of the logistic function that it uses. The logistic function
is also known as the sigmoid function. The value of this logistic function lies between zero and one.
The following is an example of a logistic function we can use to find the probability of a vehicle breaking down,
depending on how many years it has been since it was serviced last.
9
Here is how you can interpret the results from the graph to decide whether the vehicle will break down or not.
Using the logistic regression algorithm, banks can predict whether a customer would default on loans or not
To predict the weather conditions of a certain place (sunny, windy, rainy, humid, etc.)
10
Ecommerce companies can identify buyers if they are likely to purchase a certain product
Companies can predict whether they will gain or lose money in the next quarter, year, or month based on their
current performance
To classify objects based on their features and attributes
Consider the following example: An organization wants to determine an employee’s salary increase based on their
performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by considering the
employee’s performance as the independent variable, and the salary increase as the dependent variable will make
their task easier.
Now, what if the organization wants to know whether an employee would get a promotion or not based on their
performance? The above linear graph won’t be suitable in this case. As such, we clip the line at zero and one, and
convert it into a sigmoid curve (S curve)
Based on the threshold values, the organization can decide whether an employee will get a salary increase or not.
The equation of the sigmoid function is:
11
Example-2:
12
Differences between Linear and Logistic Regression:
Linear regression is used to predict the continuous Logistic Regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
variables. variables.
Linear Regression is used for solving Regression Logistic regression is used for solving Classification
problem. problems.
In linear regression, we find the best fit line, by which In Logistic Regression, we find the S-curve by which
we can easily predict the output. we can classify the samples.
Least square estimation method is used for estimation Maximum likelihood estimation method is used for
of accuracy. estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No, etc.
In Linear regression, it is required that relationship In Logistic regression, it is not required to have the
between dependent variable and independent linear relationship between the dependent and
variable must be linear. independent variable.
13
Decision Tree Algorithm:
Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems.
o It is a tree-structured classifier
The tree has three types of nodes:
– Root node that has no incoming edges and zero or more outgoing edges.
– Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
– Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label.
The non-terminal nodes, which include the root and other internal nodes, contain attribute test conditions
to separate records that have different characteristics.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
oBelow diagram explains the general structure of a decision tree
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
14
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue
this process until a stage is reached where you cannot further classify the nodes and called the final node as
a leaf node.
Example:
A decision tree that is used to classify whether a person is Fit or Unfit. The decision nodes here are questions like
‘’‘Is the person less than 30 years of age?’, ‘Does the person eat junk?’, etc. and the leaves are one of the two
possible outcomes viz. Fit and Unfit. Looking at the Decision Tree we can say make the following decisions: if a
person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a person is less than 30 years of age
and eats junk food then he is Unfit and so on.
i) ID3 Algorithm :
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly)
dichotomizes(divides) features into two or more groups at each step.
It is a classification algorithm that follows a greedy approach by selecting a best attribute that yields
maximum Information Gain(IG) or minimum Entropy(H).
ID3 only work with Discrete or nominal data, but C4. 5 work with both Discrete and Continuous data
15
ID3 algorithm selects the best attribute based on the concept of entropy and information gain for developing the
tree. C4. 5 algorithm acts similar to ID3 but improves a few of ID3 behaviors. The metric (or heuristic) used in C4. 5
to measure impurity is the Gain Ratio.
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
where,
n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the “total number
of rows” in the dataset.
Example:
16
17
18
19
20
21
22
Advantages of using ID3 :
ID3 only work with Discrete or nominal data, but C4. 5 work with both Discrete and Continuous data. It is a
successor of ID3
ID3 algorithm selects the best attribute based on the concept of entropy and information gain for developing the
tree.
The information gain measure is biased toward tests with many outcomes. That is, it prefers to select attributes
having a large number of values. For example, consider an attribute that acts as a unique identifier such as
product_ID. A split on product_ID would result in a large number of partitions (as many as there are values), each
one containing just one tuple. Because each partition is pure, the information required to classify data set D based
on this partitioning would be Infoproduct_ID(D)= 0. Therefore, the information gained by partitioning on this attribute
is maximal. Clearly, such a partitioning is useless for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome
this bias. It applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
This value represents the potential information generated by splitting the training data set, D, into v partitions,
corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it considers the number of
tuples having that outcome same partitioning. The gain ratio is defined as with respect to the total number of
tuples in D. It differs from information gain, which measures the information with respect to classification that is
acquired based on the
The attribute with the maximum gain ratio is selected as the splitting attribute.
24
1. Check for the above base cases.
2. For each attribute a, find the normalised information gain ratio from splitting on A.
3. Let A_best be the attribute with the highest normalized information gain.
4. Create a decision node that splits on A_best.
5. Recursively on the sublists obtained by splitting on A_best, and add those nodes as children of node.
The base cases are the following:
All the examples from the training set belong to the same class ( a tree leaf labeled with that class is
returned ).
The training set is empty ( returns a tree leaf called failure ).
The attribute list is empty ( returns a leaf labeled with the most frequent class or the disjunction of all the
classes).
Advantages of C4.5:
Can use both categorical and continuous values
The algorithm inherently employs Single Pass Pruning Process to Mitigate overfitting.
It can work with both Discrete and Continuous Data
C4.5 can handle the issue of incomplete data very well.
Builds models that can be easily interpreted
Easy to implement
Deals with noise
Disadvantages of C4.5:
Small variation in data can lead to different decision trees (especially when the variables are close to each
other in value)
Does not work very well on a small training set
25
26
27
28
29
Comparison between ID3 and C4.5 algorithms:
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
Example:
30
31
How does K-NN work:
The K-NN working can be explained on the basis of the below algorithm:
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
32