ML Unit 3
ML Unit 3
Overfitting:- It has very low training error. occurs when the model is too complex. leads to high variance 1)Lasso Regression- 1)Adds L1 penalty (sum of absolute values of coefficients). 2)Shrinks some
and low bias. Might occur when there are too few features. Perform more regularisation. For overfitting coefficients to exactly zero (feature selection). 3)Produces sparse models (useful for interpretability).
training error is much lower than test error. Underfitting:- It has a high training error. Occurs when 4)Works well when only a few features are important. 5)Sensitive to small data changes, which can
model is too simple. It leads to low variance and high bias. Might occur when there are too many affect feature selection. 6)May underperform when features are highly correlated.
features. Perform less regularisation. For underfitting training error is usually similar to test error. 2)Ridge Regression – 1)Adds L2 penalty (sum of squared values of coefficients). 2)Shrinks coefficients
towards zero but never exactly zero. 3)Retains all features, avoiding feature elimination. 4)Works well
Q2) What is Underfitting & Overfitting. Technique to reduce underfitting and overfitting
when features are highly correlated. 5)Produces stable models, less sensitive to data changes. 6)Prefers
1)Underfitting-happens when a model is too simple to capture the underlying patterns in the data,
models with all features contributing to the prediction.
leading to poor performance on both training and test data. 2)Overfitting- occurs when a model learns
the training data too well, including its noise and details, resulting in poor generalization to unseen Q6) Explain Regression- Regression is a type of supervisor machine Learning used to predict a
data. The model performs well on training data but poorly on test data. Technique to reduce continuous output or numerical value based on input data. It involves finding relationship between
Underfitting- 1) Increases model complexity- use more complex algorithm ( e.g.- switching from linear input variable ( Independent variable) and an output variable ( dependent variable). Regression models
regression to decision tree). 2)Add More features- include relevant features that help the model and estimate the value of output variable based on pattern learned from training data. Example- Suppose
better. 3)Train longer- allows the model to train for more epochs or iterations. 4) Tune you want to predict price of house based on factors like the size location, number of bedroom. Etc .
Hyperparameters- adjust learning rates or other parameters to improve models capacity to learn. Using regression, you can train a model on historical data of house, prices, and other associated
Overfitting- 1) Use Regularization- techniques like L1 ( lasso ) or L2 ( Ridge ) regularization add a features. The model will learn the pattern and relationship in data and can predict price of new house
penalty for overly complex models. 2) Simplify the model- use less complex model that is capturing based on similar inputs. Regression is widely used in field such as predicting sales, stock price or even
noise in data. 3) Increases Training data- provide more diverse example for model to learn General the whether the two basic types of regression are linear regression and multiple linear regression.
patterns. 4) Apply Dropout- in neural networks randomly drop some neurons during training to prevent
Q7) Linear Regression- Linear regression is a simple and widely ML algorithm for predicting continuous
over- reliance on specific path.
output based on one or more input variable. It assume the relationship between input variable and
Q3) Bias and Variance trade off output variable can be represented as single line. In simple linear regression, there is one input variable
Bias –It refers to the error caused by model being too simple and not capturing the underline pattern in and one output variable and the relationship is expressed as y = mc + c. Here, y is predicted value , x is
data. For e.g.- a linear model trying to fit complex non-linear data will have high bias. Variance- It refers the input variable, m is slope of line and c is the intercept ( value of y, when x = 0).
to error caused by a model being two complex and overlay sensitive to small functions in training data.
Q8) Evaluation Matrices- 1)Mean Absolute Error (MAE)- MAE is a metric used to measure the average
In this case, model perform well on training data, but poorly on new data. Trade off- the trade of
absolute difference between predicted value and actual value regression model. It tells us how much on
between bias and variance occurs because reducing bias typically increases variance and reducing
average the predictions deviate from true value. A lower MAE indicates better model performance.
variance typically increases bias. A High-bias, Low-variance: this model is simple, but it might not
2) Root Mean Squared Error (RMSE)- RMSE is the matrix used to evaluate the accuracy of regression
capture the complexity of data. Low-bias, High-variance: this model is complex and may overfit the
model. It calculate the square root of average of squared difference between predicted and actual
training data. Low-bias, low-variance: This variance shows ideal ML model, it is not possible practically.
value. RMSE gives more weight to larger error because it squares error before averaging, making it
High-bias, High-variance: predictions are inconsistent and also in accurate on average. The goal is to
more sensitive to outliers compare to MAE. A lower RMSE indicates better model performance.
find a model that performs well on both the training, data and new data.
3) R2 - also called coefficient of determination is a matrix that indicate how well regression model
Q4) Explain Lasso and Ridge Regression explains variability of target variable. Represents the proportion of variance in the target explained by
Lasso and Ridge recreations, or technique used in ML to improve performance of linear regression the features. Value ranges from 0 to 1 (or negative if the model performs worse than a simple mean
models, especially when there are too many input feature. 1) Lasso ( Least Absolute shrinkage & prediction). 1 indicates a perfect fit; 0 means no explanatory power.
selection operator )- Lasso adds a penalty, equal to absolute value of coefficient to the cost function.
Q9) Gradient Descent Algorithm-
The penalty term called L1 Regularisation not only reduces size of coefficient but can also shrink some
It is an optimisation algorithm used in ML to minimise error of model by finding optimal parameter (
coefficient zero. This means Lasso can effectively perform feature selection by keeping only most
like weight in a linear regression model ). It works by adjusting the parameters in the direction that
important features in model while discarding irrelevant ones. Formula- Cost Function = MSE + λ Σ |B|.
reduces error. Working:- 1) Begin with random value for model parameter ( Like slope & intercept for
E.g.- if a data sheet has many features, but only a few are relevant Lasso will identify and keep only the
linear regression). 2) Compute cost function- the court function measures, how will the model predict
key features. 2)Ridge Regression- Ridge adds penalty proportional to the squared value of coefficient.
the data. 3)Calculate Gradients- Gradients are partially derivatives of cost function with respect to each
This penalty term cordial L2 regularization helps reduce size of coefficient, making the model less
parameter. They tell us the direction and rate of change of cost. 4) Update Parameter- adjust parameter
sensitive to small functions in training data. Ridge is useful when all features are important but need to
in opposite direction of gradient to reduce cost. 5) Repeat the step 2-4 until cost stops decreasing
be scaled down to avoid overfitting. However, it does not shrink any coefficient exactly to zero so it
significantly or predefined number of iterations is reached.
keeps all feature model. Formula- Cost Function = MSE + λΣ B2.
Q10) Batch Gradient VS Stochastic Gradient ML unit 3 Q14) Equation of Linear regression line: X= 1,2,3,4. To find value of y for X= 12
Batch Gradient- It uses entire data acid to compute gradient at each step. It is slow because it processes Y= 3,4,5,7 Y= 11.61x + 6.39
entire data set at once. It requires more memory to process data set. More accurate as it consider data Answer- The equation of linear regression is = 11.61 ( 12) + 6.39
set for each update. It updates parameter less frequently. It is better for small or medium size data set. represented as, Y= mx + c. y= 145.7 calories
Stochastic Gradient- It uses single data point to compute gradient at each step. It is faster because it Where m= slope of line, c= intercept and they are
update after each data point. It require less memory as it process one data point at a time. It is less calculated as,
accurate because based on single data point. Update parameter frequently. Better for large dataset. m= n Σ( XY ) – ΣX ΣY / n Σ(X2) – ( ΣX)2
c= ΣY – m Σ X / n , n = n0 of values. Unit 6- Q ) Obtain output of Neuron Y for network
Q11) Define Different regression model - 1) Linear Regression- Models the relationship
Therefore we need to calculate, ΣX, ΣY, ΣXY, ΣX2. shown in figure. Using activation functions as:
between features and the target as a straight line. Suitable for simple, linear relationships.
We prepare the table. i) Binary Sigmoidal. Ii) Bipolar sigmoidal.
2)Logistic Regression: Used for binary classification problems. Outputs probabilities using the sigmoid
function 3) Polynomial Regression- Extends linear regression by adding polynomial terms of features. X Y XY X2 Answer-
Captures non-linear relationships. 4) Decision Tree Regression- Splits data into regions using decision 1 3 3 1 Inputs:- X1= 0.8, X2= 0.6, X3= 0.4
rules. Captures complex non-linear patterns. 5) Ridge Regression: Linear regression with L2 2 4 8 4 W= W1= 0.1, W2= 0.3, W3= 0.2
regularization to reduce overfitting. Penalizes large coefficients but retains all features. 3 5 15 9 bias (b)= 0.35 ( its input as always 1)
6)Lasso Regression: Linear regression with L1 regularization, shrinking some coefficients to zero. 4 7 28 16
Step 1) Net input to output neuron i,
ΣX= 10 ΣY= 19 ΣXY= 54 ΣX2= 30
Q12) Least square method- It is a technique used to find the best-fitting line in linear regression. Goal is Substituting the values, . yin= b + Σ xi wi
to minimizing the sum of the squared differences between the observed values and the predicted . = b + X1 W1 + X2 W2 + X3 W3
m= n Σ ( XY ) – ΣX ΣY / n Σ(X2) – ( ΣX)2
values. This ensures that line is as close as possible to all data points. In the context of linear regression . = 4 x 54 – 10 x 9 / 4 x 30 – (10)2 . = 0.35 + (0.8)(0.1) + (0.6)(0.3) + (0.4)(0.2)
the least square method helps determine the value of slope (m) & intercept (c) for equation: y=mx +c. . = 216-190 / 120- 100 = 26/20 . = 0.53
For each data point ( xi, yi ) the predicted value is Ypredicted= mx; + c. The difference between actual m = 1.3 1)Binary Sigmoidal:
value y; & the predicted value is called the error. To measure how well the line fits all the data points we . y= f ( yin) = ( 1/ 1 + e -yin )
calculate sum of squared errors: Error= Σ ( Yi – Y predicted)2 . Example- If we have dataset of house c= ΣY – m Σ X / n
. = 1 / 1+ e – 0.53 )
prices based on their sizes, the least square method will calculate the line that best predicts house . = 19 – 1.3 x 10 / 4
. = 0.625
prices based on size. . =6/4
c= 1.5 2)Bipolar Sigmoidal:
Q13) Stochastic gradient descent algorithm- It is an optimization algorithm used to minimize the cost . y= f ( yin) = ( 2 / 1 + e -yin )-1
function in machine learning, especially for large datasets. Widely used for training models like neural Equation of linear regression line is . = 2 / 1+ e – 0.53 ) -1
networks and linear regression. Key Points: 1)Gradient Descent: Involves updating model parameters in y= 1.3x + 1.5 . = 0.259
the direction of the negative gradient of the cost function to reduce error. 2)Stochastic: Unlike Q15)Linear regression For x= 8, 9.5 ,10, 6, 7, 4.
traditional gradient descent, which uses the entire dataset for each update, SGD updates parameters Y= 12,138, 147, 88,108, 62. Find value of Y for
using only one data point at a time. 3)Faster: Due to updates after each data point, it converges faster, X=12
making it suitable for large datasets. 4)Noisy Updates: The updates can be noisy and fluctuate because Till table same as above
they are based on a single data point, but over time, they can still converge to the optimal solution. ΣX= 44.5, ΣY= 555, ΣXY= 4409, ΣX2= 355.25
5)Learning Rate: The step size (learning rate) controls how big each update is. 6)Advantages: More
efficient for large datasets, faster than batch gradient descent. Substituting the values,
m= n Σ ( XY ) – ΣX ΣY / n Σ(X2) – ( ΣX)2
. = 6 x 4409 – 44.5 x 555 / 6 x 355.25 – 1980.25
. = 1756.5 / 151.25
m= 11.61
C= ΣY – m Σ X / n
. = 555- 11.61 x 44.5 / 6
c= 6.39
Equation of linear regression line is
y= 11.61x + 6.39