0% found this document useful (0 votes)
20 views

Test2 ML Model Answer

Uploaded by

itsluffy9998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Test2 ML Model Answer

Uploaded by

itsluffy9998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

GOVERNMENT POLYTECHNIC, NAGPUR.

(An Autonomous Institute of Govt. of Maharashtra)


Second Progressive Test
Diploma in Computer Engineering
Course Code : CM303G Course Name : Machine Learning
Time : 1 Hour Max. Marks : 25

Instructions:
1. All questions are compulsory
2. Illustrate your answers with neat sketches wherever necessary
3. Figures to the right indicate full marks
4. Use of non-programmable calculator is permissible
5. Assume suitable data if necessary
Preferably, write the answers in sequential order.

Q1 Attempt any FIVE. 10 M Cos


6R2 a) SVMs are particularly effective when the number of features CO6
(dimensions) is very large compared to the number of samples.
They work well in datasets with high dimensionality by
maximizing the margin between classes, ensuring good
generalization performance. By focusing on the points (support
vectors) that are closest to the decision boundary, SVMs minimize
the risk of overfitting, especially in scenarios where there is a clear
margin of separation between classes.
4R2 b) The two important steps of Feature Engineering are Feature CO4
Selection and Feature Transformation.
5R2 c) Supervised learning is a type of machine learning where a model is CO5
trained using a labeled dataset. The goal is to learn a mapping from
inputs (features) to outputs (labels or targets) so that the model can
predict the output for new, unseen data.
4R2 d) Data refers to raw, unprocessed facts, figures, or information that CO4
can be collected, analyzed, and used for decision-making. It is the
foundation of information and knowledge in any field, often
represented in numeric, textual, visual, or auditory forms.
Types of Data
Data can be broadly classified into two main types:
Structured Data
Unstructured Data
Qualitative Data (Categorical Data)
Describes attributes or characteristics.
Types:
Nominal Data: Categories without an inherent order (e.g., colors:
red, blue, green).
Ordinal Data: Categories with a specific order (e.g., ratings: poor,
average, excellent).
Quantitative Data (Numerical Data)
Represents measurable quantities.
Types:
Discrete Data: Countable values (e.g., number of students in a
class).
Continuous Data: Any value within a range (e.g., height, weight,
temperature).
5R2 e) Bayes' Theorem is a mathematical formula used to determine the CO5
conditional probability of an event based on prior knowledge of
conditions that might be related to the event. In machine learning,
it plays a critical role in probabilistic models, particularly in
classification problems.

The theorem is expressed as:

6R2 f) Regression analysis is a statistical and machine learning technique CO6


used to model the relationship between a dependent variable
(target) and one or more independent variables (features). Its
primary purpose is to predict continuous numerical values based on
the input data.
Nature of Target Variable:
Regression deals with continuous target variables.
Classification deals with categorical target variables.

Interpretation of Results:
Regression provides a quantitative prediction (e.g., temperature is
28.5°C).
Classification provides a qualitative label (e.g., "hot" or "cold").
Understanding whether the problem requires predicting a
continuous value or assigning categories helps determine whether
to use regression or classification techniques.
4R2 g) Features CO4
Input variables or attributes used to make predictions.
Serve as the independent variables for training.
Labels
Output variable or target value to be predicted.
Represent the dependent variable (ground truth).

Q2 Attempt any THREE. 09 M


4U3 a) CO4
Normalization Standardization

Rescales the data to a fixed Centers the data around the


range, typically [0, 1] or [-1, mean with unit variance.
1].

Useful when features have Useful when data follows a


different scales but no specific normal (Gaussian) distribution
assumptions about the or is expected to have similar
distribution. spread.

Sensitive to outliers, as they Less sensitive to outliers


can distort the min-max range. because it uses mean and
standard deviation.

Fixed range, typically [0, 1] or No fixed range, typically


[-1, 1]. centered around 0 with unit
variance.

5A3 b) Data preprocessing is a crucial step in machine learning because it CO5


ensures that the raw data is transformed into a suitable format for
model training and evaluation. Raw data is often incomplete, noisy,
or inconsistent, and preprocessing helps address these issues,
improving the model's performance and reliability.
1. Improves Data Quality
Handles missing values, duplicate entries, and noisy data to ensure
a cleaner dataset.
Better quality data leads to more accurate and reliable predictions.
2. Ensures Consistency in Data
Aligns data formats, units, or scales (e.g., normalization,
standardization) to prevent bias toward certain features during
model training.
3. Facilitates Faster Convergence of Models
By transforming data into a more manageable form, preprocessing
helps machine learning algorithms converge faster during training.
4. Reduces Overfitting and Underfitting
Techniques like feature selection or dimensionality reduction (e.g.,
PCA) can remove irrelevant or redundant features, reducing the
risk of overfitting or underfitting.
5. Handles Class Imbalance
Balancing datasets (e.g., using oversampling or undersampling)
ensures that the model does not favor the majority class over the
minority class.
6. Enables Compatibility with Algorithms
Some algorithms require specific input formats (e.g., numeric data
for many models). Preprocessing ensures data compatibility by
encoding categorical variables or scaling numerical data.

Steps in Data Preprocessing:


Data Cleaning: Handling missing values, noise, and
inconsistencies.
Data Transformation: Scaling, normalizing, or encoding data.
Feature Selection: Selecting the most relevant features for the
model.
Feature Extraction: Creating new, meaningful features from
existing ones
6U3 c) CO6
Simple Linear Regression Multiple Linear Regression

Involves only one independent Involves two or more


variable (feature). independent variables
(features).

Less complex due to a single More complex as it considers


predictor variable. multiple predictors.

Analyzes the relationship Analyzes the combined


between one feature and the influence of multiple features
target variable. on the target variable.

Predicting a dependent Predicting a dependent


variable using a single variable using multiple
predictor (e.g., predicting sales predictors (e.g., predicting
based on advertising budget). house prices based on size,
location, and age).

5U3 d) A Decision Tree is a popular and interpretable classification CO5


technique used in machine learning. It works by recursively
partitioning the data into subsets based on the feature values,
creating a tree-like structure. Each internal node represents a
feature, each branch represents a decision rule based on that
feature, and each leaf node represents the predicted class label.

How Decision Trees Work:


Feature Selection:
At each node, the decision tree algorithm selects the feature that
best splits the dataset into distinct classes. It chooses the feature
that maximizes a specific criterion, such as Gini impurity or
information gain (for classification tasks).

Recursive Partitioning:
The dataset is split based on the chosen feature, and this process
continues recursively for each subset until a stopping criterion is
met (e.g., maximum depth, minimum samples in a node, or no
further improvement in the split).

Class Assignment:
Once the tree reaches the leaf nodes, a class label is assigned to
each leaf based on the majority class in the data points that fall into
that leaf.
Overfitting: Decision trees can easily overfit the training data if
they grow too deep. Techniques like pruning (cutting back
branches of the tree) can help prevent overfitting.

Advantages of Decision Trees:


Interpretability: Decision trees are easy to understand and
visualize, making them user-friendly for both data scientists and
stakeholders.
No Need for Feature Scaling: Decision trees do not require
normalization or standardization of data since they are not sensitive
to the scale of the features.
Works with Both Numerical and Categorical Data: Decision trees
can handle both continuous and categorical variables.

Disadvantages of Decision Trees:


Overfitting: If the tree is too deep, it can memorize the training
data, leading to poor generalization to unseen data.
Instability: Small changes in the data can lead to a completely
different tree.
Bias Toward Features with More Levels: Decision trees may favor
features with more levels or categories, leading to biased splits.

Applications of Decision Trees:


Customer Segmentation: Identifying customer segments based on
purchasing behavior.
Medical Diagnosis: Classifying diseases based on patient
symptoms and test results.
Credit Scoring: Classifying loan applicants as high or low risk.
Example:
Imagine you're trying to classify whether a customer will buy a
product based on their age and income:

Node 1: If age ≤ 30, go to Node 2; if age > 30, go to Node 3.


Node 2: If income ≤ $50,000, predict "No"; if income > $50,000,
predict "Yes."
Node 3: If income ≤ $75,000, predict "Yes"; if income > $75,000,
predict "Yes."
The tree structure visually represents the decision rules used to
classify new instances based on the input features.

Q3 Attempt any ONE. 06 M


5A6 a) To build a Random Forest model to predict whether tennis will be CO5
played based on weather conditions, follow these steps. I will guide
you through the typical procedure of building and evaluating a
Random Forest classifier:

1. Dataset Understanding
Assume that you are provided with a dataset where each row
contains weather features and a target variable indicating whether
tennis was played. Here's an example of the dataset structure:

0°C 80% No Sunny Yes


2 25°C 70% Yes Overcast Yes
3 20°C 90% No Rainy No
4 35°C 60% Yes Sunny NoDay Temperature
Humidity Wind Outlook PlayTennis (Target)
1 3
... ... ... ... ... ...
2. Data Preprocessing
You need to prepare the data before building the model:

Handle missing data: Check if there are any missing values and
handle them by removing or filling them.
Encode categorical variables: The Outlook and Wind columns are
categorical. You need to convert these into numerical values using
encoding techniques like One-Hot Encoding.
Feature scaling: Random Forest is not sensitive to feature scaling,
but if you are using other models, scaling might be necessary.

3. Splitting the Data


You’ll split the data into training and test sets. A typical split is
70-80% for training and 20-30% for testing.

4. Model Building
We will use the Random Forest Classifier for this task, which is an
ensemble of decision trees. Here's how you would build the model:

# Step 1: Import Libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Step 2: Load the dataset


# Example dataset as a DataFrame
data = pd.DataFrame({
'Temperature': [30, 25, 20, 35, 28],
'Humidity': [80, 70, 90, 60, 75],
'Wind': ['No', 'Yes', 'No', 'Yes', 'No'],
'Outlook': ['Sunny', 'Overcast', 'Rainy', 'Sunny', 'Overcast'],
'PlayTennis': ['Yes', 'Yes', 'No', 'No', 'Yes']
})

# Step 3: Encode categorical features (Outlook, Wind)


le = LabelEncoder()
data['Wind'] = le.fit_transform(data['Wind'])
data['Outlook'] = le.fit_transform(data['Outlook'])

# Step 4: Split the dataset into features and target


X = data.drop('PlayTennis', axis=1) # Features
y = data['PlayTennis'] # Target

# Step 5: Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Step 6: Build the Random Forest Model


model = RandomForestClassifier(n_estimators=100,
random_state=42)
model.fit(X_train, y_train)

# Step 7: Make predictions


y_pred = model.predict(X_test)

# Step 8: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:\n', conf_matrix)
print('Classification Report:\n', class_report)

5. Explanation of the Code:


Label Encoding: Categorical variables (Wind and Outlook) are
encoded into numerical values using LabelEncoder so they can be
used by the model.
Train-Test Split: The dataset is divided into training (70%) and test
(30%) sets using train_test_split.
Random Forest Model: A RandomForestClassifier with 100 trees
(n_estimators=100) is trained on the training data.
Model Evaluation: The model's accuracy is calculated, and other
metrics such as confusion matrix and classification report
(precision, recall, F1-score) are printed.

6. Model Evaluation
Accuracy gives you the overall performance of the model.
Confusion Matrix shows the number of true positives, true
negatives, false positives, and false negatives, which is helpful in
understanding model errors.
Classification Report provides precision, recall, and F1-score for
both classes (Yes/No in this case).

7. Advantages of Random Forest for This Task:


Ensemble Method: Random Forest is an ensemble method that
combines the output of multiple decision trees to improve
predictive performance and reduce overfitting.
Handles Non-linear Data: It can handle complex relationships in
the data without needing explicit feature engineering.
Feature Importance: Random Forest can give insights into the
importance of each feature, which can help in model
interpretability.
Conclusion:
Using a Random Forest model for predicting whether tennis will be
played based on weather features can provide accurate predictions.
By preprocessing the data, encoding categorical variables, splitting
the data into training and testing sets, and evaluating the model
using appropriate metrics, you can effectively build and assess the
model's performance.
6A6 b) To implement Logistic Regression on the Iris dataset and classify CO6
whether a flower is of type "Setosa" or "Not Setosa," we can follow
these steps:

1. Import the Necessary Libraries


We'll use libraries like scikit-learn for dataset handling,
preprocessing, and model building.

2. Load the Iris Dataset


The Iris dataset is readily available in scikit-learn. We will classify
the flower as "Setosa" or "Not Setosa" based on the species
column.

3. Preprocess the Data


We'll create a binary classification problem by converting the target
variable into two classes: "Setosa" (label 1) and "Not Setosa" (label
0).

4. Build and Train the Logistic Regression Model


5. Evaluate the Model
Here’s how you can implement this:

# Step 1: Import necessary libraries


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Step 2: Load the Iris dataset


iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['Species'] = iris.target
# Step 3: Convert target variable to binary classification (Setosa vs
Not Setosa)
df['IsSetosa'] = df['Species'].apply(lambda x: 1 if x == 0 else 0) #
0: Setosa, 1: Not Setosa

# Step 4: Define the feature matrix (X) and target vector (y)
X = df[iris.feature_names] # Features: Sepal length, Sepal width,
Petal length, Petal width
y = df['IsSetosa'] # Target: 1 if Setosa, 0 otherwise

# Step 5: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Step 6: Initialize and train the Logistic Regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Step 7: Make predictions on the test set


y_pred = model.predict(X_test)

# Step 8: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print evaluation results


print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:\n', conf_matrix)
print('Classification Report:\n', class_report)

Explanation of the Code:


Data Loading:

We load the Iris dataset using load_iris() from sklearn.datasets.


We create a DataFrame (df) containing the Iris dataset's features
(iris.data) and target labels (iris.target).
Target Variable Transformation:

We convert the Species column into a binary classification column


(IsSetosa), where:
"Setosa" (species label 0) is represented as 1.
All other species (versicolor and virginica) are represented as 0.
Feature Matrix (X) and Target Vector (y):

X is the feature matrix containing the four measurements (sepal


length, sepal width, petal length, and petal width).
y is the binary target variable (IsSetosa).
Train-Test Split:
We split the data into training and test sets using train_test_split()
with 30% of the data reserved for testing.
Logistic Regression Model:

We initialize a Logistic Regression model.


We fit the model on the training data (X_train, y_train).
Predictions and Evaluation:

We use the trained model to make predictions on the test set


(X_test).
We evaluate the model's performance using accuracy_score,
confusion_matrix, and classification_report.
Output:
The model will output the following:

Accuracy: The overall accuracy of the model.


Confusion Matrix: A matrix that shows the counts of true positives,
false positives, true negatives, and false negatives.
Classification Report: Includes metrics such as precision, recall,
and F1-score.
Key Takeaways:
The logistic regression model will predict whether a flower is
"Setosa" (1) or "Not Setosa" (0).
The output includes performance metrics such as accuracy,
precision, recall, and F1-score, providing insight into how well the
model is performing.

Course Outcomes
CO4 Apply feature engineering on dataset.
CO5 Apply classification algorithm on dataset.
CO6 Apply regression algorithm on dataset.

You might also like