0% found this document useful (0 votes)
8 views36 pages

DS Notes

Uploaded by

PALASH BHARADWAJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views36 pages

DS Notes

Uploaded by

PALASH BHARADWAJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Homoscedasticity

Homoscadasticisty is one of the assumptions of linear regression in which the


variance of the residuals is assumed to be constant. In simple words, the error
terms should be of constant variance in the graph
Confusion Matrix

● True Positive: You predicted positive, and it’s true.

● True Negative: You predicted negative, and it’s true.

● False Positive: (Type 1 Error): You predicted positive, and it’s false.

● False Negative: (Type 2 Error): You predicted negative, and it’s false.

● Accuracy: the proportion of the total number of correct predictions

that were correct.

● Positive Predictive Value or Precision: proportion of the positive class

predictions that were actually correct

● Negative Predictive Value: the proportion of negative cases that were

correctly identified.

● Sensitivity or Recall: the proportion of actual positive cases which are

correctly identified.

● Specificity: the proportion of actual negative cases which are

correctly identified.

● Rate: It is a measuring factor in a confusion matrix. It has also 4

types TPR, FPR, TNR, and FNR.

● Precision = TP / TP+FP

● Recall = TP / TP+FN
F1 Score

We use F1 score to when we are trying to get the best precision and recall at
the same time

Area Under the ROC Curve (AUC – ROC)

ROC curve is the graphical representation of the effectiveness of the binary


classification model. It plots the true positive rate (TPR) vs the false positive
rate (FPR) at different classification thresholds.
TPR = Recall/sentivity
FPR = 1 - specificity
The ROC curve is the plot between sensitivity and (1- specificity)
As both TPR and FPR range between 0 to 1, So, the area will always lie
between 0 and 1, and A greater value of AUC denotes better model
performance
Basically, the ROC curve is a graph that shows the performance of a
classification model at all possible thresholds( threshold is a particular value
beyond which you say a point belongs to a particular class)
ROC is nothing but the plot between TPR and FPR across all possible
thresholds and AUC is the entire area beneath this ROC curve.

Root Mean Squared Error (RMSE)

R-Squared/Adjusted R-Squared

○ It measures the strength of the relationship between the dependent and


independent variables on a scale of 0-100%.
Adjusted R-Squared

he R-squared statistic isn’t perfect. In fact, it suffers from a major flaw. Its
value never decreases no matter the number of variables we add to our
regression model.
The Adjusted R-squared takes into account the number of independent
variables used for predicting the target variable.

k: number of features
● n represents the number of data points in our dataset

● R represents the R-squared values determined by the model.

Cross-validation
Cross-validation is a statistical method used to estimate the performance of

machine learning models. It is a method for assessing how the results of a

statistical analysis will generalize to an independent data set.

1. Hold Out method/train test split method


In this technique of cross-validation, the whole dataset is randomly

partitioned into a training set and validation set. Using a rule of thumb

nearly 70% of the whole dataset is used as a training set and the remaining

30% is used as the validation set.

2. 2. K-Fold Cross-Validation
In this technique of K-Fold cross-validation, the whole dataset is partitioned

into K parts of equal size. Each partition is called a “Fold“.So as we have K

parts we call it K-Folds. One Fold is used as a validation set and the

remaining K-1 folds are used as the training set.

The technique is repeated K times until each fold is used as a validation set

and the remaining folds as the training set.


3. Stratified K-Fold Cross-Validation
Stratified K-Fold is an enhanced version of K-Fold cross-validation which is

mainly used for imbalanced datasets. Just like K-fold, the whole dataset is

divided into K-folds of equal size.

But in this technique, each fold will have the same ratio of instances of

target variable as in the whole datasets.

1. Performance Metrics for Classification

○ Accuracy

○ Confusion Matrix

○ Precision

○ Recall

○ F-Score

○ AUC(Area Under the Curve)-ROC

https://www.javatpoint.com/performance-metrics-in-machine-learning

feature encoding

Machine learning models can only work with numerical values. For this reason, it is

necessary to transform the categorical values of the relevant features into numerical

ones. This process is called feature encoding.

Categorical features are generally divided into 3 types:


A. Binary: Either/or Yes, No

B. Ordinal: Specific ordered Groups. low, medium, high

C. Nominal: Unordered Groups. Examples cat, dog, tiger

1. Label Encoding, Ordinal Encoding

Label encoding includes replacing the categories with digits from 1 to n (or 0 to n-1, depending
on the implementation),where n is the number of the variable’s distinct categories (the
cardinality), and these numbers are assigned arbitrarily.

2. One-Hot Encoding, Dummy Encoding


To overcome the Disadvantage of Label Encoding as it considers some
hierarchy in the columns which can be misleading to nominal features
present in the data. we can use the One-Hot Encoding strategy.
One-hot encoding is processed in 2 steps:
1. Splitting of categories into different columns.

2. Put ‘0 for others and ‘1’ as an indicator for the appropriate column.

Target Encoding

In target encoding, we calculate the mean of the target variable for each
category and replace the category variable with the mean value. In the
case of the categorical target variables, the posterior probability of the
target replaces each category.
Feature Engineering: Scaling,
Normalization, and
Standardization
These techniques can help to improve model performance, reduce the
impact of outliers, and ensure that the data is on the same scale.

What is Feature Scaling?


Feature scaling is a data preprocessing technique used to transform the

values of features or variables in a dataset to a similar scale. The purpose

is to ensure that all features contribute equally to the model and to avoid

the domination of features with larger values.

1. Gradient Descent Based Algorithms


Machine learning algorithms like linear regression, logistic regression,

neural network, PCA (principal component analysis), etc., that use gradient

descent as an optimization technique require data to be scaled.

2. Distance-Based Algorithms
Distance algorithms like KNN, K-means clustering, and SVM(support vector

machines) are most affected by the range of features. This is because,

behind the scenes, they are using distances between data points to

determine their similarity.

TYPE OF SCALING:
1)Normalization
Normalization is a scaling technique in which values are shifted and
rescaled so that they end up ranging between 0 and 1. It is also known as
Min-Max scaling.

2)Standardization: Standardization scaling is also known as Z-score


normalization
Standardization is another scaling method where the values are centered
around the mean with a unit standard deviation. This means that the mean
of the attribute becomes zero, and the resultant distribution has a unit
standard deviation.

The Big Question – Normalize or Standardize?


Normalization Standardization

Rescales values to a range Centers data around the mean and


between 0 and 1 scales to a standard deviation of 1

Useful when the distribution of the Useful when the distribution of the
data is unknown or not Gaussian data is Gaussian or unknown

Sensitive to outliers Less sensitive to outliers

Retains the shape of the original Changes the shape of the original
distribution distribution

May not preserve the relationships Preserves the relationships between


between the data points the data points

Equation: (x – min)/(max – min) Equation: (x – mean)/standard


deviation
Bias-Variance Trade Off
https://www.geeksforgeeks.org/ml-bias-variance-trade-off/

What is Bias?

The bias is known as the difference between the prediction of the values by

the Machine Learning model and the correct value.

Being high in biasing gives a large error in training as well as testing data. It

recommended that an algorithm should always be low-biased to avoid the

problem of underfitting. By high bias, the data predicted is in a straight line

format, thus not fitting accurately in the data in the data set. Such fitting is

known as the Underfitting of Data. This happens when the hypothesis is too

simple or linear in nature.

What is Variance?

he variability of model prediction for a given data point which tells us the

spread of our data is called the variance of the model. The model with high

variance has a very complex fit to the training data and thus is not able to fit

accurately on the data which it hasn’t seen before. As a result, such models

perform very well on training data but have high error rates on test data.

When a model is high on variance, it is then said to as Overfitting of Data.


Bias Variance Tradeoff

High Bias = underfitting


High Variance = overfitting

The above figure shows that when the bias is high, the error in both the test
and training sets is also high. When the deviation is high, the model
performs well on our training set and gives a low error, but the error on our
test set is very high. In the middle of this, there is a region where bias and
variance are in perfect balance with each other here too, but training and
testing errors are low.

If the algorithm is too simple (hypothesis with linear equation) then it may be

on high bias and low variance condition and thus is error-prone. If algorithms

fit too complex (hypothesis with high degree equation) then it may be on high

variance and low bias. In the latter condition, the new entries will not perform

well. Well, there is something between both of these conditions, known as a

Trade-off or Bias Variance Trade-off. This tradeoff in complexity is why there

is a tradeoff between bias and variance. An algorithm can’t be more complex


and less complex at the same time. For the graph, the perfect tradeoff will be

like this.

We try to optimize the value of the total error for the model by using the

Bias-Variance Tradeoff.

https://www.analyticsvidhya.com/blog/2022/08/regularization-in-machine-

learning/

Regularization in Machine Learning


When training a machine learning model, the model can be easily overfitted

or under fitted. To avoid this, we use regularization in machine learning to

properly fit the model to our test set.

1. Lasso Regularization – L1 Regularization


2. Ridge Regularization – L2 Regularization

3. Elastic Net Regularization – L1 and L2 Regularization

Ridge Regression

○ Ridge regression is one of the types of linear regression in which a small


amount of bias is introduced so that we can get better long-term predictions.

○ Ridge regression is a regularization technique, which is used to reduce the


complexity of the model. It is also called as L2 regularization.

○ In this technique, the cost function is altered by adding the penalty term to it.
The amount of bias added to the model is called Ridge Regression penalty.
We can calculate it by multiplying with the lambda to the squared weight of
each individual feature.

○ The equation for the cost function in ridge regression will be:

○ As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression
model.

Lasso Regression:

○ Lasso regression is another regularization technique to reduce the complexity


of the model. It stands for Least Absolute and Selection Operator.

○ It is similar to the Ridge Regression except that the penalty term contains only
the absolute weights instead of a square of weights.
○ Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.

○ It is also called as L1 regularization. The equation for the cost function of


Lasso regression will be:

○ Some of the features in this technique are completely neglected for model
evaluation.

○ Hence, the Lasso regression can help us to reduce the overfitting in the model
as well as the feature selection.

Key Difference between Ridge Regression and Lasso


Regression

○ Ridge regression is mostly used to reduce the overfitting in the model, and it
includes all the features present in the model. It reduces the complexity of the
model by shrinking the coefficients.

○ Lasso regression helps to reduce the overfitting in the model as well as


feature selection.

Assumptions of Linear Regression

○ Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.

Small or no multicollinearity between the features:


Homoscedasticity Assumption:
The error terms must have constant variance.
Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern.

Yi = β0 + β1Xi

What is the best fit line?


In simple terms, the best fit line is a line that fits the given scatter plot in the

best way. Mathematically, the best fit line is obtained by minimizing the

Residual Sum of Squares(RSS).

Cost Function for Linear Regression: In Linear Regression,


generally Mean Squared Error (MSE) cost function is used

Evaluation Metrics for Linear Regression


Coefficient of Determination or R-Squared (R2)
R-Squared is a number that explains the amount of variation that is

explained/captured by the developed model. It always ranges between 0 &

1 . Overall, the higher the value of R-squared, the better the model fits the

data.

Mathematically it can be represented as,

R 2 = 1 – ( RSS/TSS )
Overfitting
There are several ways to prevent overfitting, which are stated below:

● Cross-validation

● If the training data is too small to train add more relevant and clean

data.

● If the training data is too large, do some feature selection and remove

unnecessary features.

● Regularization

Underfitting:
● Increase the model complexity

● Increase the number of features in the training data

● Remove noise from the data

Gradient Descent Algorithm

For y = mx+c, If we plot m and c against MSE, it will acquire a bowl shape

(As shown in the diagram below)


For some combination of m and c, we will get the least Error (MSE). That

combination of m and c will give us our best fit line.

The algorithm starts with some value of m and c (usually starts with m=0,

c=0). We calculate MSE (cost) at point m=0, c=0. Let say the MSE (cost) at

m=0, c=0 is 100. Then we reduce the value of m and c by some amount

(Learning Step). We will notice a decrease in MSE (cost). We will continue

doing the same until our loss function is a very small value or ideally 0

(which means 0 error or 100% accuracy).

Step by Step Algorithm:


1. Let m = 0 and c = 0. Let L be our learning rate. It could be a small value

like 0.01 for good accuracy.


2. Calculate the partial derivative of the Cost function with respect to m. Let

partial derivative of the Cost function with respect to m be D m (With little

change in m how much Cost function changes).

Similarly, let’s find the partial derivative with respect to c. Let partial

derivative of the Cost function with respect to c be D c (With little change in

c how much Cost function changes).


3. Now update the current values of m and c using the following equation:

4. We will repeat this process until our Cost function is very small (ideally

0).

● If slope is +ve : θj = θj – (+ve value). Hence the value of θj decreases.


● If slope is -ve : θj = θj – (-ve value). Hence the value of θj increases.

How To Choose Learning Rate

The choice of correct learning rate is very important as it ensures that

Gradient Descent converges in a reasonable time. :

● If we choose α to be very large, Gradient Descent can overshoot the

minimum. It may fail to converge or even diverge.


If we choose α to be very small, Gradient Descent will take small steps to reach

local minima and will take a longer time to reach minima.

Logistic Regression
○ Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
○ In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).

○ Odds: It is the ratio of something occurring to something not occurring.

it is different from probability as the probability is the ratio of

something occurring to everything that could possibly occur.

p(x)/1-p(x)

○ Log-odds: The log-odds, also known as the logit function, is the

natural logarithm of the odds. In logistic regression, the log odds of the

dependent variable are modeled as a linear combination of the

independent variables and the intercept.

log(p(x)/1-p(x)) = z

Applying steps in logistic regression modeling:


The following are the steps involved in logistic regression modeling:

● Define the problem: Identify the dependent variable and

independent variables and determine if the problem is a binary

classification problem.

● Data preparation: Clean and preprocess the data, and make sure

the data is suitable for logistic regression modeling.

● Exploratory Data Analysis (EDA): Visualize the relationships

between the dependent and independent variables, and identify any

outliers or anomalies in the data.


● Feature Selection: Choose the independent variables that have a

significant relationship with the dependent variable, and remove

any redundant or irrelevant features.

● Model Building: Train the logistic regression model on the selected

independent variables and estimate the coefficients of the model.

● Model Evaluation: Evaluate the performance of the logistic

regression model using appropriate metrics such as accuracy,

precision, recall, F1-score, or AUC-ROC.

● Model improvement: Based on the results of the evaluation, fine-

tune the model by adjusting the independent variables, adding new

features, or using regularization techniques to reduce overfitting.

● Model Deployment: Deploy the logistic regression model in a real-

world scenario and make predictions on new data.


Note: Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples; Therefore, it
falls under the classification algorithm.

Logistic Function (Sigmoid Function):

○ The sigmoid function is a mathematical function used to map the predicted


values to probabilities.

○ It maps any real value into another value within a range of 0 and 1.

○ The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.

○ In logistic regression, we use the concept of the threshold value, which


defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:

○ The dependent variable must be categorical in nature.

○ The independent variable should not have multi-collinearity.

Logistic Regression Equation:


Now we use the sigmoid function where the input will be z and we find the
probability between 0 and 1. i.e predicted y.

z= mx+c
1. Low Precision/High Recall: In applications where we want to

reduce the number of false negatives without necessarily reducing

the number of false positives, we choose a decision value that has a

low value of Precision or a high value of Recall. For example, in a

cancer diagnosis application, we do not want any affected patient to

be classified as not affected without giving much heed to if the

patient is being wrongfully diagnosed with cancer. This is because

the absence of cancer can be detected by further medical diseases

but the presence of the disease cannot be detected in an already

rejected candidate.

2. High Precision/Low Recall: In applications where we want to

reduce the number of false positives without necessarily reducing

the number of false negatives, we choose a decision value that has

a high value of Precision or a low value of Recall. For example, if we

are classifying customers whether they will react positively or

negatively to a personalized advertisement, we want to be

absolutely sure that the customer will react positively to the


advertisement because otherwise, a negative reaction can cause a

loss of potential sales from the customer.

Decision Tree

Assumptions while creating Decision Tree

1. In the beginning, the whole training set is considered as the root.

2. Feature values are preferred to be categorical. If the values are

continuous then they are discretized prior to building the model.

Records are distributed recursively on the basis of attribute values.

3.Order to placing attributes as root or internal node of the tree is

done by using some statistical approach.

Attribute Selection Measures

1. Gini Index
2. Information Gain(ID3)

Entropy
Entropy is a measure of the randomness in the information being rocessed
Information Gain
We can define information gain as a measure of how much

information a feature provides about a class. Information gain helps

to determine the order of attributes in the nodes of a decision tree.

Information Gain is calculated for a split by subtracting the

weighted entropies of each branch from the original entropy.

Gain=Eparent − Echildren

Entropy values can fall between 0 and 1. If all samples in data set, S, belong to one

class, then entropy will equal zero. If half of the samples are classified as one class

and the other half are in another class, entropy will be at its highest at 1.

https://www.section.io/engineering-education/entropy-information-gain-machine-

learning/
Gini index

The Gini Index is a measure of the inequality or impurity of a distribution,


● It means an attribute with a lower Gini index should be preferred.

● It ranges from 0 to 1, where 0 represents perfect equality (all values

are the same) and 1 represents perfect inequality (all values are

different).

● A lower Gini Index indicates a more homogeneous or pure distribution,

while a higher Gini Index indicates a more heterogeneous or impure

distribution.

● In decision trees, the Gini Index is used to evaluate the quality of a split

by measuring the difference between the impurity of the parent node

and the weighted impurity of the child nodes.

Pi= probability of an object being classified into a particular class.

Hyperparameter tuning for the decision tree


● criterion (“gini” or “entropy”) – the function (“gini” or
“entropy”) used to calculate the uncertainty on the
discrimination rule selected.
● splitter (“best” or “random”) – the strategy used to choose the
feature on which to create a discrimination rule. The default
value is “best”. That is, for each node, the algorithm takes into
account all features and chooses the one with the best
criterion. If “random”, a random feature will be taken. The
“random” technique is used to avoid overfitting
● max_depth (integer) – the maximum tree depth.
● min_samples_leaf (integer) – The minimum number of
samples required to be in a leaf. A leaf will not be allowed to
have a number of samples lower than this value. Ideal to
overcome overfitting.

Grid Search*********
Advantages of Decision Tree

1. A decision tree model is very interpretable and can be easily represented

to senior management and stakeholders.

2. Preprocessing of data such as normalization and scaling is not required

which reduces the effort in building a model.

3. A decision tree algorithm can handle both categorical and numeric data

and is much efficient compared to other algorithms.

4. Any missing value present in the data does not affect a decision tree which

is why it is considered a flexible algorithm.

Disadvantages of Decision Tree

1. A decision tree works badly when it comes to regression as it fails to

perform if the data have too much variation.

2. A decision tree is sometimes unstable and cannot be reliable as alteration

in data can cause a decision tree go in a bad structure which may affect

the accuracy of the model.

3. If the data are not properly discretized, then a decision tree algorithm can

give inaccurate results and will perform badly compared to other

algorithms.

4. Complexities arise in calculation if the outcomes are linked and it may

consume time while training a model.


****k-NN and Random Forest algorithms can also support missing
values. the k-NN algorithm considers the missing values by taking
the majority of the K nearest values. Unfortunately, the scikit-learn
library implementation of k-NN and RandomForest does not support
the presence of the missing values.***

RANDOM FOREST

Random Forest is a classifier that contains a number of decision trees on various


subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
Assumptions for Random Forest

○ There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.

○ The predictions from each tree must have very low correlations.

Why use Random Forest?

○ It takes less training time as compared to other algorithms.

○ It predicts output with high accuracy, even for the large dataset it runs
efficiently.

○ It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
Advantages of Random Forest

○ Random Forest is capable of performing both Classification and Regression


tasks.

○ It is capable of handling large datasets with high dimensionality.

○ It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

○ Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.

Random Forest Hyperparameters


1. n_estimators

2. max_features

3. max_depth

5. max_sample

Apart from the features, we have a large set of training datasets.


max_sample determines how much of the dataset is given to each individual
tree.
GridSearchCV
It runs through all the different parameters that is fed into
the parameter grid and produces the best combination of
parameters, based on a scoring metric of your choice
(accuracy, f1, etc).

1. estimator – A scikit-learn model

2. param_grid – A dictionary with parameter names as keys and lists of

parameter values.

3. scoring – The performance measure. For example, ‘r2’ for regression

models, ‘precision’ for classification models.

4. cv – An integer that is the number of folds for K-fold cross-validation.

Grid = GridSearchCv(estimator, param_grid, cv, scoring)

Grid.fit(X_train, y_train)

print(Grid.best_params_)

print(Grid.best_score_)

Thus, grid.best_params_ gives the best combination of tuned

hyperparameters

You might also like