DS Notes
DS Notes
● False Positive: (Type 1 Error): You predicted positive, and it’s false.
● False Negative: (Type 2 Error): You predicted negative, and it’s false.
correctly identified.
correctly identified.
correctly identified.
● Precision = TP / TP+FP
● Recall = TP / TP+FN
F1 Score
We use F1 score to when we are trying to get the best precision and recall at
the same time
R-Squared/Adjusted R-Squared
he R-squared statistic isn’t perfect. In fact, it suffers from a major flaw. Its
value never decreases no matter the number of variables we add to our
regression model.
The Adjusted R-squared takes into account the number of independent
variables used for predicting the target variable.
k: number of features
● n represents the number of data points in our dataset
Cross-validation
Cross-validation is a statistical method used to estimate the performance of
partitioned into a training set and validation set. Using a rule of thumb
nearly 70% of the whole dataset is used as a training set and the remaining
2. 2. K-Fold Cross-Validation
In this technique of K-Fold cross-validation, the whole dataset is partitioned
parts we call it K-Folds. One Fold is used as a validation set and the
The technique is repeated K times until each fold is used as a validation set
mainly used for imbalanced datasets. Just like K-fold, the whole dataset is
But in this technique, each fold will have the same ratio of instances of
○ Accuracy
○ Confusion Matrix
○ Precision
○ Recall
○ F-Score
https://www.javatpoint.com/performance-metrics-in-machine-learning
feature encoding
Machine learning models can only work with numerical values. For this reason, it is
necessary to transform the categorical values of the relevant features into numerical
Label encoding includes replacing the categories with digits from 1 to n (or 0 to n-1, depending
on the implementation),where n is the number of the variable’s distinct categories (the
cardinality), and these numbers are assigned arbitrarily.
2. Put ‘0 for others and ‘1’ as an indicator for the appropriate column.
Target Encoding
In target encoding, we calculate the mean of the target variable for each
category and replace the category variable with the mean value. In the
case of the categorical target variables, the posterior probability of the
target replaces each category.
Feature Engineering: Scaling,
Normalization, and
Standardization
These techniques can help to improve model performance, reduce the
impact of outliers, and ensure that the data is on the same scale.
is to ensure that all features contribute equally to the model and to avoid
neural network, PCA (principal component analysis), etc., that use gradient
2. Distance-Based Algorithms
Distance algorithms like KNN, K-means clustering, and SVM(support vector
behind the scenes, they are using distances between data points to
TYPE OF SCALING:
1)Normalization
Normalization is a scaling technique in which values are shifted and
rescaled so that they end up ranging between 0 and 1. It is also known as
Min-Max scaling.
Useful when the distribution of the Useful when the distribution of the
data is unknown or not Gaussian data is Gaussian or unknown
Retains the shape of the original Changes the shape of the original
distribution distribution
What is Bias?
The bias is known as the difference between the prediction of the values by
Being high in biasing gives a large error in training as well as testing data. It
format, thus not fitting accurately in the data in the data set. Such fitting is
known as the Underfitting of Data. This happens when the hypothesis is too
What is Variance?
he variability of model prediction for a given data point which tells us the
spread of our data is called the variance of the model. The model with high
variance has a very complex fit to the training data and thus is not able to fit
accurately on the data which it hasn’t seen before. As a result, such models
perform very well on training data but have high error rates on test data.
The above figure shows that when the bias is high, the error in both the test
and training sets is also high. When the deviation is high, the model
performs well on our training set and gives a low error, but the error on our
test set is very high. In the middle of this, there is a region where bias and
variance are in perfect balance with each other here too, but training and
testing errors are low.
If the algorithm is too simple (hypothesis with linear equation) then it may be
on high bias and low variance condition and thus is error-prone. If algorithms
fit too complex (hypothesis with high degree equation) then it may be on high
variance and low bias. In the latter condition, the new entries will not perform
like this.
We try to optimize the value of the total error for the model by using the
Bias-Variance Tradeoff.
https://www.analyticsvidhya.com/blog/2022/08/regularization-in-machine-
learning/
Ridge Regression
○ In this technique, the cost function is altered by adding the penalty term to it.
The amount of bias added to the model is called Ridge Regression penalty.
We can calculate it by multiplying with the lambda to the squared weight of
each individual feature.
○ The equation for the cost function in ridge regression will be:
○ As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression
model.
Lasso Regression:
○ It is similar to the Ridge Regression except that the penalty term contains only
the absolute weights instead of a square of weights.
○ Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
○ Some of the features in this technique are completely neglected for model
evaluation.
○ Hence, the Lasso regression can help us to reduce the overfitting in the model
as well as the feature selection.
○ Ridge regression is mostly used to reduce the overfitting in the model, and it
includes all the features present in the model. It reduces the complexity of the
model by shrinking the coefficients.
Yi = β0 + β1Xi
best way. Mathematically, the best fit line is obtained by minimizing the
1 . Overall, the higher the value of R-squared, the better the model fits the
data.
R 2 = 1 – ( RSS/TSS )
Overfitting
There are several ways to prevent overfitting, which are stated below:
● Cross-validation
● If the training data is too small to train add more relevant and clean
data.
● If the training data is too large, do some feature selection and remove
unnecessary features.
● Regularization
Underfitting:
● Increase the model complexity
For y = mx+c, If we plot m and c against MSE, it will acquire a bowl shape
The algorithm starts with some value of m and c (usually starts with m=0,
c=0). We calculate MSE (cost) at point m=0, c=0. Let say the MSE (cost) at
m=0, c=0 is 100. Then we reduce the value of m and c by some amount
doing the same until our loss function is a very small value or ideally 0
Similarly, let’s find the partial derivative with respect to c. Let partial
4. We will repeat this process until our Cost function is very small (ideally
0).
Logistic Regression
○ Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
○ In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
p(x)/1-p(x)
natural logarithm of the odds. In logistic regression, the log odds of the
log(p(x)/1-p(x)) = z
classification problem.
● Data preparation: Clean and preprocess the data, and make sure
○ It maps any real value into another value within a range of 0 and 1.
○ The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
z= mx+c
1. Low Precision/High Recall: In applications where we want to
rejected candidate.
Decision Tree
1. Gini Index
2. Information Gain(ID3)
Entropy
Entropy is a measure of the randomness in the information being rocessed
Information Gain
We can define information gain as a measure of how much
Gain=Eparent − Echildren
Entropy values can fall between 0 and 1. If all samples in data set, S, belong to one
class, then entropy will equal zero. If half of the samples are classified as one class
and the other half are in another class, entropy will be at its highest at 1.
https://www.section.io/engineering-education/entropy-information-gain-machine-
learning/
Gini index
are the same) and 1 represents perfect inequality (all values are
different).
distribution.
● In decision trees, the Gini Index is used to evaluate the quality of a split
Grid Search*********
Advantages of Decision Tree
3. A decision tree algorithm can handle both categorical and numeric data
4. Any missing value present in the data does not affect a decision tree which
in data can cause a decision tree go in a bad structure which may affect
3. If the data are not properly discretized, then a decision tree algorithm can
algorithms.
RANDOM FOREST
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
Assumptions for Random Forest
○ There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
○ The predictions from each tree must have very low correlations.
○ It predicts output with high accuracy, even for the large dataset it runs
efficiently.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
Advantages of Random Forest
○ It enhances the accuracy of the model and prevents the overfitting issue.
○ Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
2. max_features
3. max_depth
5. max_sample
parameter values.
Grid.fit(X_train, y_train)
print(Grid.best_params_)
print(Grid.best_score_)
hyperparameters