ML Module No 02
ML Module No 02
Module No. 02
Learning with Regression and Trees
(target) and independent (predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand how the value of the dependent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
year and get sales on that. The below list shows the advertisement made by the company in
variables and enables us to predict the continuous output variable based on the one or more
predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the
using this plot, the machine learning model can make predictions about the data.
In simple words, "Regression shows a line or curve that passes through all the
The distance between datapoints and line tells whether a model has captured a strong
relationship or not.
Dependent Variable: The main factor in Regression analysis which we want to predict or
predict the values of the dependent variables are called independent variable, also called as a
predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
Underfitting and Overfitting
By Mr. Sachin Balawant Takmare
ML Regression
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect
2. Logistic Regression
3. Polynomial Regression
7. Ridge Regression
By Mr. Sachin Balawant Takmare
ML Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
Linear regression algorithm shows a linear relationship between a dependent (x) and one or
Since linear regression shows the linear relationship, which means it finds how the value of
the dependent variable is changing according to the value of the independent variable.
The values for x and y variables are training datasets for Linear Regression
model representation.
By Mr. Sachin Balawant Takmare
ML Linear Regression
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Simple Linear Regression is a type of Regression algorithms that models the relationship
categorical values.
Model the relationship between the two variables. Such as the relationship between
response variable (Y). But there may be various cases in which the response variable is affected by
more than one predictor variable; for such cases, the Multiple Linear Regression algorithm is used.
Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes more
than one predictor variable to predict the response variable. We can define it as:
“Multiple Linear Regression is one of the important regression algorithms which
e.g. Prediction of CO2 emission based on engine size and number of cylinders in a car.
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
Linear Regression is used for solving Regression problems, whereas Logistic regression is
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
By Mr. Sachin Balawant Takmare
ML Logistic regression
• Logistic Regression can be used to classify the
election or whether a high school student will be admitted or not to a particular college.
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
• In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of the
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
• A bank loans officer need to analysis from the data to learn which loan
applicants are “safe” and which are “risky” for the bank to sanction loan
By Mr. Sachin Balawant Takmare
ML Classification
General approach
General approach of classification is divided into two-steps
• Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the
tree.
By Mr. Sachin Balawant Takmare
• Parent/Child node: The root node of the tree is called the parent node, and other
ML Decision Tree Classification
Example
• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes. So, to
solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
▪ Information Gain
▪ Gini Index
By Mr. Sachin Balawant Takmare
ML Classification
Rule Extraction from
a Decision Tree
Generate classification rules for the given data by applying decision tree classifier.
By Mr. Sachin Balawant Takmare
ML Decision Tree Classification
A simple example from the stock market involving only discrete ranges has profit as
categorical attribute with values (UP, DOWN) and the training data is as follows:
ML
First calculate Gini Index of attribute values
ML
First calculate Gini Index of attribute values
ML
First calculate Gini Index of attribute values
75
76
77
78
79
80
ML
Performance Metrics
• Positive tuples : positive tuples of the class attribute (in our last
example positive tuples are buys_computer= yes)
• Negative tuples : negative tuples of the class attribute (in our last
example negative tuples are buys_computer= no)
• Suppose we use our classifier on a test set of labeled tuples.
• P is the number of positive tuples and N is the number of negative
tuples.
• For each tuple, we compare the classifier’s class attribute prediction
with the tuple’s known class attribute value.
• True positives (TP): These refer to the positive tuples that were correctly
labeled by the classifier. Let TP be the number of true positives.
• True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
• False positives (FP): These are the negative tuples that were incorrectly
labeled as positive (e.g., tuples of class buys_computer=no for which the
classifier predicted buys_computer=yes). Let FP be the number of false
positives.
• False negatives (FN): These are the positive tuples that were mislabeled as
negative (e.g., tuples of class buys_computer=yes for which the classifier
predicted buys_computer=no). Let FN be the number of false negatives.
By Mr. Sachin Balawant Takmare
ML Confusion Matrix
• The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right, while FP
and FN tell us when the classifier is getting things wrong.
• E.g. suppose in a data set of the customers who buys the computer, there are total 10000
tuples, out of that 7000 are positive and 3000 are negative and our model has predicated
6954 are positive and 2588 are negative, so the confusion matrix will be
• E.g. suppose in a data set of the cancer, there are total 10000 tuples, out of that 300 are
positive and 9700 are negative and our model has predicated 90 are positive and 9560
are negative, so prepare confusion matrix and Find all evaluation measures for the
confusion matrix
• We have a data-set where we are predicting number of people who have more than Rs
1000 in their bank account. Consider a data-set with 200 observations i.e. n=200
• Out of 200 cases, our classification model predicted “YES” 135 times, and “NO” 65 times *
• Out of 200 cases, our classification model predicted “YES” 125 times, and “NO” 65 times.
• Out of 200 cases, our classification model predicted “YES” 135 times, and “NO” 60 times.
• Out of 200 cases, our classification model predicted “YES” 135 times, and “NO” 5 times.
1. Accuracy:
2. Error rate:
3. Sensitivity: ability to correctly label the positive as positive
4. Specificity: ability to correctly label the negative as negative
5. Precision: % of positive tuples labelled as positive
(Receiver operating
characteristic curves)
curves)