Machine Learning
Machine Learning
Supervised Learning - I
Interpolation
Regression
Extrapolation
Simple Linear Regression - Least Square Regression Line
• Least Squares method is used to determine the best-fitting line for the given data by reducing the
sum of the squares of the vertical deviations from each data point to the line.
• If a point rests on the fitted line accurately, then the value of its perpendicular deviation is 0.
• Linear regression determines the straight line, known as the least-squares regression line or LSRL.
• Suppose y is a dependent variable and X is an independent variable, then the population
regression line is given by the equation;
• When a random sample of observations is given, then the regression line is expressed as:
• Where B0 is a constant
• B1 is the regression coefficient
• X is the independent variable
• is known as the predicted value of the dependent variable.
Simple Linear Regression - Implementation
• Obtain the input dataset and identify independent and dependent features.
• Fit the regression line of the form:
• Compute B0 and B1
• Calculate the evaluation metrics to find whether the line obtained is best fit or not.
Regression - Performance Metrics
• The performance of a Regression model is reported as errors in the prediction.
• Following are the popular metrics that are used to evaluate the performance of Regression models.
• Mean Absolute Error: MAE is one of the simplest metrics, which measures the absolute difference between
actual and predicted values, where absolute means taking a number as positive.
• y is the Actual outcome, y' is the predicted outcome, and N is the total number of data points.
• Mean Squared Error: MSE measures the average of the Squared difference between predicted values and the actual
value given by the model.
• R2 Score: R2error is also known as Coefficient of Determination, which is another popular metric used for
Regression model evaluation.
• The R-squared metric enables us to compare the model with a constant baseline to determine the performance of the
model.
• To select the constant baseline, we need to take the mean of the data and draw the line at the mean.
Regression - Performance Metrics
• Adjusted R2:
• Adjusted R2, as the name suggests, is the improved version of R2 error.
• R2 has a limitation of improvement of a score on increasing the terms, even though the model is not
improving, and it may mislead the data scientists.
• To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than
R². It is because it adjusts the values of increasing predictors and only shows improvement if there is a real
improvement.
• We can calculate the adjusted R squared as follows:
• Calculate
• Calculate Regression sums using
Multiple Linear Regression - Implementation
Gradient Descent
Gradient Descent
Gradient Descent
• The basic principle of gradient descent is to choose the step size (also called as learning rate)
appropriately so that we can get close to the exact solution.
• Gradient descent stops when step size is very close to zero along with max number of steps we
want to perform.
Classification
• Information Gain calculates the reduction in the entropy and measures how well a given feature
separates or classifies the target classes.
• The feature with the highest Information Gain is selected as the best one.
Decision Tree – ID3
• Pros:
• Builds fastest tree
• Builds short tree
• Searches the whole dataset to build the decision tree
• Generates Multibranch tree
• Cons:
• Handles only Categorical data
• Cannot perform pruning
• Prioritizes attributes with more values
• Do not handle imbalanced and missing data values
Decision Tree – CART
• CART is an alternative decision tree building algorithm.
• It can handle both classification and regression tasks.
• It generates binary trees. This algorithm uses a new metric named gini index to create decision
points for classification tasks.
• Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each
class. We can formulate it as illustrated below.
Gini Index(Class) = 1 – Σ (Pi)2 for i=1 to number of subclasses
• Gini Index of feature = Weighted sum of the Gini Index of classes in a feature
• Gini Index(feature)=
Decision Tree – CART
• Process:
1. Start: Begin with the full dataset.
2. Split: Find the best feature and split point to divide the data into two subsets.
3. Create Node: Make a node in the tree based on the chosen split, with lower gini index.
4. Partition Data: Split the dataset into two subsets based on the chosen split.
• 2n - 2 possible ways to form two partitions of the data.
• n is the number of categories in an attribute
5. Recursive Splitting: Repeat steps 2-4 for each subset until stopping criteria are met.
6. Stopping Criteria: Stop growing the tree when certain conditions are reached, all instances belong
to the same class.
7. Pruning (optional): Trim the tree to avoid overfitting.
8. Output: Use the resulting tree for making predictions on new data.
Decision Tree – CART
• Pros:
• Handles both categorical and numerical values
• Handles Outliers
• Generates only binary trees
• Cons:
• Produces unstable decision trees
• Has a preference towards multivalued attributes
Naïve Bayes Algorithm
• Cons:
• Assumption of Independence
• Limited Expressiveness
• Requires Sufficient Training Data
• Cannot capture the correlation among features
KNN Algorithm
b. Manhattan distance: It is the sum of the absolute values of the differences between two points.
Types of Distance Metrics
c. Minkowski distance: It is used to find distance similarity between two points. Based on the below formula
changes to either Manhattan distance (When p=1) and Euclidean distance (When p=2).
d. Hamming Distance: It is used for categorical variables. This metric will tell whether two categorical
variables are the same or not.
Example
Example
K-NN Algorithm
• Pros:
• k-NN is easy to understand and implement.
• No Training Phase is needed.
• Can handle categorical and numerical data.
• It has versatile distance measures.
• Incremental learning
• Cons:
• Slow and inefficient for large datasets, especially as the size of the training set grows.
• Choosing right value of ‘k’
• Imbalanced Data
Logistic Regression
Assumptions:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
Cost Function:
• Cost function for logistic regression called log loss which is also derived from the maximum likelihood
estimation method.
Logistic Regression
Process using Gradient Descent:
1. Initialize parameters - Intercept and coefficient(s)
2. Substitute the values in sigmoid function and obtain
4. Compute Gradients
Assumptions:
• The Dependent variable should be either nominal or ordinal variable.
• Set of one or more Independent variables can be continuous, ordinal or nominal.
• The Observations and dependent variables must be mutually exclusive and exhaustive.
• No Multicollinearity between Independent variables.
• There should be no Outliers in the data points.
Multinominal Logistic Regression
Process using Gradient Descent:
1. Initialize parameters - Intercept and coefficient(s)
2. Compute Logits: Calculate the logits for all samples in the batch for each class using the current model parameters (weights
and biases).
3. Compute Softmax Probabilities: Apply the softmax function to the logits to obtain the predicted probabilities for each class
for all samples in the batch.
4. Compute Loss: Calculate the cross-entropy loss between the predicted probabilities and the true class labels for all samples in
the batch.
5. Compute Gradients: Compute the gradients of the loss function with respect to the model parameters (weights and biases)
using the entire batch of samples. This involves taking the derivative of the loss function with respect to each parameter.
6. Update Parameters: Update the model parameters (weights and biases) using the gradients and the chosen optimization
algorithm (e.g., gradient descent, SGD, Adam)..
Support Vector Machine
• Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the other is identified as -1.
• Step 2: SVM classifier uses a loss function known as the hinge loss function to find the maximum margin.
• Step 3: There is a trade-off between maximizing margin and the loss generated if the margin is maximized to a very
large extent. A regularization parameter is used to handle this.
• Step 4: weights are optimized by calculating the gradients.
• Step 5: The gradients are updated only by using the regularization parameter when there is no error in the
classification while the loss function is also used when misclassification happens.
Support Vector Machine
Steps:
• Compute the partial derivatives of L with respect to w and b.
Linear Kernel
Polynomial Kernel
Gaussian Kernel
Exponential Kernel
Laplacian Kernel
Hyperbolic Kernel
Support Vector Machine
Advantages of SVM:
• Effective in high dimensional cases
• Its memory efficient as it uses a subset of training points in the decision function called support vectors
• Different kernel functions can be specified for the decision functions and its possible to specify custom
kernels
THANK YOU