Assignment 4
Assignment 4
Linear Regression is one of the fundamental supervised learning algorithms used in machine
learning for predictive modeling. It establishes a relationship between an independent
variable (input) and a dependent variable (output) by fitting a straight line to the data. This
line, known as the regression line, is represented by the equation:
Y=mX+CY = mX + C
where:
Linear Regression is mainly used for predicting continuous values, such as stock prices,
house prices, or temperature trends. The model learns by minimizing the difference between
the actual and predicted values using techniques like Ordinary Least Squares (OLS) or
Gradient Descent.
# Sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
Y = np.array([2, 4, 5, 4, 5])
# Splitting data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,
random_state=42)
# Making predictions
predictions = model.predict(X_test)
print("Predictions:", predictions)
This script trains a linear regression model on a small dataset and makes predictions on
unseen data.
These applications demonstrate how Linear Regression is an essential tool for making data-
driven decisions.
4. What are the key performance parameters used to evaluate a Linear Regression
model?
Mean Absolute Error (MAE): Measures the average absolute difference between
actual and predicted values.
Mean Squared Error (MSE): Penalizes larger errors by squaring the differences.
Root Mean Squared Error (RMSE): The square root of MSE, providing error in
original units.
R-Squared (R²): Represents the proportion of variance in the dependent variable
explained by the model. A value close to 1 indicates a good fit.
These metrics help in assessing how well the model generalizes to unseen data.
A Decision Tree Classifier is a supervised learning algorithm used for classification tasks. It
works by recursively splitting the dataset based on feature values to create a tree-like
structure of decision rules.
At each node, the algorithm selects the best feature to split the data by minimizing impurity
(measured using Gini Index or Entropy). The process continues until all samples in a node
belong to the same class or another stopping criterion is met.
Decision Trees are widely used because they are easy to interpret and handle both numerical
and categorical data effectively.
6. What are the key differences between Classification and Regression Trees?
While both trees use recursive partitioning, classification trees focus on predicting categories,
whereas regression trees predict continuous values.
The Gini Index is a measure of impurity used to split nodes in a Decision Tree. It is
calculated as:
Gini=1−∑pi2Gini = 1 - \sum p_i^2
A lower Gini Index indicates a purer node. The algorithm selects splits that minimize
impurity, leading to better classification performance.
8. What is the ID3 algorithm, and how does it use Information Gain?
The ID3 (Iterative Dichotomiser 3) algorithm is a Decision Tree learning algorithm that
builds trees using the concept of Information Gain. Information Gain measures the reduction
in entropy (randomness) after a dataset split.
The feature with the highest Information Gain is chosen for splitting, as it provides the most
informative split.
A Random Forest Classifier is an ensemble learning method that combines multiple Decision
Trees to improve accuracy and reduce overfitting. It works by:
1. Creating multiple Decision Trees using different subsets of data and features.
2. Aggregating predictions from all trees (majority vote for classification, average for
regression).
This method increases robustness, generalization, and reduces sensitivity to individual noisy
features.
10. Can you explain a real-world case study where regression and classification models
are used to solve a problem?
A real-world example of using both regression and classification is loan approval prediction
and risk assessment in banking.
1. Classification Model (Decision Tree/Random Forest):
o Used to classify loan applicants as "Approved" or "Rejected" based on factors
like credit score, income, and employment status.
o Helps automate loan processing, improving efficiency.
2. Regression Model (Linear Regression):
o Used to predict loan default probability based on factors like past loan history,
outstanding debts, and economic trends.
o Helps banks decide interest rates and loan limits.