0% found this document useful (0 votes)
23 views7 pages

Week 4 Q&A

Uploaded by

parkerupsc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

Week 4 Q&A

Uploaded by

parkerupsc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Week 4

Question 1:
Explain the importance of the k value in the k-Nearest Neighbors (kNN) algorithm.
Answer:
The k value in the k-Nearest Neighbors (kNN) algorithm represents the number of nearest
neighbors used to make predictions. If k is too small, the model may become sensitive to noise
in the data, leading to overfitting. On the other hand, a large k value smooths the decision
boundary by considering more neighbors, which can improve generalization but might ignore
local data patterns. Selecting an appropriate k value is essential and is usually done through
cross-validation to achieve a balance between accuracy and robustness.
Additionally, smaller k values can capture local variations better, while larger k values provide a
more global perspective. Hence, the choice of k affects both the performance and complexity of
the model.

Question 2:
Explain the role of Python libraries in data science with examples.
Answer:
Python libraries play a crucial role in data science by providing tools for various tasks:
NumPy: Used for numerical computations and handling large datasets efficiently. Example: Creating multi-
dimensional arrays.
Pandas: Ideal for data manipulation and analysis. Example: Reading and cleaning datasets with DataFrame.
Matplotlib/Seaborn: Used for data visualization to create plots and graphs. Example: Plotting a histogram or
scatter plot.
Scikit-learn: Provides tools for machine learning tasks like classification, regression, and clustering. Example:
Building a decision tree model.
These libraries save time and simplify complex data science workflows.

Question 3:
Explain Regression types
Answer:
Classification of Regression Analysis
1. Univariate vs Multivariate
• Univariate: One dependent and one independent variable
• Multivariate: Multiple independent and multiple dependent variables
2. Linear vs Nonlinear
• Linear: Relationship is linear between dependent and independent
variables
• Nonlinear: Relationship is nonlinear between dependent and
independent variables
3. Simple vs Multiple
• Simple: One dependent and one independent variable (ŞISO)
• Multiple: One dependent and many independent variables (MISO)

Question 4:
Why do you use train_test_split() in machine learning models, and what is the purpose of the test_size
parameter?
Answer:
Purpose of train_test_split(): It splits the data into training and testing sets, ensuring the model is trained on
one portion of the data and evaluated on a separate portion to assess its generalization ability.
Role of test_size Parameter: The test_size parameter determines the proportion of data used for testing. For
example, test_size=0.3 means 30% of the data will be used for testing, and the rest will be for training.
Example: train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, random_state=0)

Question 5:
Explain the correct steps to approach a Classification Problem in Machine Learning.
Answer:
To solve a classification problem:
1. Understand the Problem: Define the objective, identify the target variable (categorical), and gather
domain knowledge.

2. Data Preprocessing: Collect and preprocess data by handling missing values, encoding categorical
variables, scaling features, and addressing class imbalance.

3. Split the Data: Divide into training, validation, and test sets (e.g., 70-15-15), ensuring consistent class
proportions using stratified splits.

4. Model Selection and Training: Choose algorithms like Logistic Regression, Decision Trees, or Neural
Networks. Train models and tune hyperparameters using Grid Search or Random Search.

5. Evaluate the Model: Use metrics like accuracy, precision, recall, F1-score, and ROC-AUC. Visualize
confusion matrix and performance curves.

6. Deploy and Report: Deploy the best model and present findings with metrics and visualizations.
Continuously monitor performance in production.

Question 6:
Why is reducing the number of input features in a classification model useful?
Answer:
Reducing input features is useful because:
1. It makes the model faster and easier to use.
2. It reduces the risk of overfitting by removing unnecessary data.
3. It makes the model simpler and easier to understand.
4. It lowers the cost and effort of collecting data.
This helps maintain good performance while improving efficiency and practicality.

Question 7:
What are the steps involved in building a logistic regression classifier for classifying salary status in the
classification case study?
Answer-
1) Re-indexing and Encoding: The categorical outcome variable (salary status) is re-indexed and mapped to
integers for compatibility with the logistic regression algorithm.
2) Data Preparation: Categorical variables are converted into dummy variables using one-hot encoding to
ensure numerical compatibility.
3) Model Training: The logistic regression classifier is trained on a training dataset using the processed
features.
4) Validation and Testing: The trained model is validated and tested on a separate test dataset to evaluate its
accuracy and performance.

Question 8:
Explain the concept of cross-validation and its importance in predictive modelling.
Answer:
Cross-validation is a technique used to assess the performance of a predictive model by testing it on multiple
subsets of the data. It helps ensure the model generalizes well to unseen data.
Process:
1. The dataset is split into k folds (subsets).

2. The model is trained on k−1k-1 folds and tested on the remaining fold.

3. This process is repeated kk times, with each fold used as the test set once.

4. The final performance is the average of all test sets results.

Importance:
o Reduces overfitting by ensuring the model isn't evaluated on just one train-test split.
o Gives a better estimate of the model's accuracy.

o Helps in hyperparameter tuning to optimize the model's performance.

A common type is k-fold cross-validation, with k=5k = 5 or 1010 being typical choices.

Question 9:
What is a decision tree? What are the types? Mention the different metrics used.
Answer:
A decision tree is a supervised machine learning algorithm used for classification and
regression tasks. It splits the dataset into subsets based on feature values, aiming to group data
into homogeneous sets. The final structure resembles a tree, where each node represents a
decision based on a feature, and each leaf node represents the predicted output.
Types of Decision Trees:
1) Classification Trees: Used for categorical outcomes (e.g., yes/no predictions).
2) Regression Trees: Used for continuous outcomes (e.g., predicting a numerical value).
Key Metrics:
• Gini Impurity: Measures how often a randomly selected element would be misclassified. Lower
Gini equals better splits.
• Entropy: Measures disorder in the dataset. Lower entropy indicates more purity in the data
subset.
• Information Gain: The reduction in entropy due to a split; higher information gain indicates a
better feature for splitting.
• Mean Squared Error (MSE): Used for regression trees, measuring the average squared
difference between predicted and actual values.
Decision trees can be prone to overfitting, which is mitigated by pruning or limiting tree depth.

Question 10:
Storm Motors is an e-commerce company acting as a mediator in buying and selling pre-owned
cars. They aim to develop an algorithm to predict car prices using attributes such as
specification details, condition, seller details, registration details, advertisement details, make
and model information, and price.
Explain how regression analysis can be used to achieve this. Mention the type of data required
and two factors that could affect prediction accuracy.
Answer:
Regression analysis can be used to predict the price of pre-owned cars by identifying the
relationship between the car’s price (dependent variable) and its attributes (independent
variables). By fitting a model such as linear regression, the algorithm can estimate the price
based on input features like car specifications, condition, and other factors.
Type of Data Required:
• Structured data containing numerical (e.g., mileage, price) and categorical (e.g.,
make, model) variables for each car listing.
Two Factors Affecting Prediction Accuracy:
1. Data Quality: Inaccurate or incomplete data, such as missing entries for car
condition or incorrect price tags, can reduce accuracy.
2. Feature Selection: Including irrelevant or redundant features can lead to
overfitting or underfitting, impacting the model’s predictive ability.

Question 11
Define Multiple Linear Regression and explain with an example?
Answer:
Multiple Linear Regression (MLR) is a statistical method that models the relationship
between one dependent variable and two or more independent variables. It helps in
predicting the outcome variable based on the influence of multiple predictors.
The general equation is:
Y=β0 +β1 X1 +β2 X2 +⋯+βn Xn +ϵ
Example:
Scenario: A company wants to predict the monthly sales (Y) based on the advertising
budget spent on TV (X1 ) and radio (X2).
• The MLR model could be:
Sales=β0 +β1 (TV Budget)+β2 (Radio Budget)+ϵ
• If the model estimates:
Sales=50+5(TV Budget)+3(Radio Budget)
Interpretation:
• For every additional unit spent on TV advertising, sales increase by 5 units.
• For every additional unit spent on radio advertising, sales increase by 3 units.
• The baseline sales without any advertising (intercept) are 50 units.

Question 12:
What are the different types of classification in machine learning, and how do they differ in terms of
definition, examples, characteristics?
Answer –
Binary Classification
Definition: Predict between two possible outcomes.
Characterization: Straightforward, outputs binary results like Yes/No or 0/1. Common in many practical
problems.
Example: Classifying emails as spam or not spam.
Multiclass Classification
Definition: Assign data to one of three or more classes.
Characterization: Outputs a single class label from multiple possible categories. Common metrics
include confusion matrix and F1-score.
Example: Classifying flowers into Setosa, Versicolor, or Virginica.

Multilabel Classification
Definition: A single instance can belong to multiple classes simultaneously.
Characterization: The output includes multiple labels for a single input. Used when categories are not
mutually exclusive.
Example: A movie categorized as action and comedy.

Imbalanced Classification
Definition: One class is significantly underrepresented compared to others.
Characterization: Imbalance in data requires specialized techniques like resampling or cost-sensitive
learning to avoid biased predictions.
Example: Fraud detection, where most transactions are legitimate.

Ordinal Classification
Definition: Classes have a natural order, but the differences between them are not defined.
Characterization: Predictions respect the class order. Often treated as a combination of classification
and regression.
Example: Customer satisfaction ratings like dissatisfied, neutral, or satisfied.

Question 13:
What is a confusion matrix and what is its importance in data science?
Answer:
A confusion matrix is a table used to evaluate the performance of a classification model. It summarizes the
predictions made by the model against the actual outcomes using four metrics: True Positives (TP), True
Negatives (TN), False Positives (FP), and False Negatives (FN).
Its importance lies in providing a detailed breakdown of how well the model distinguishes between classes,
which helps compute performance metrics like accuracy, precision, recall, and F1-score. This aids in
understanding the model's strengths, weaknesses, and suitability for specific tasks.

You might also like