0% found this document useful (0 votes)
2 views

IML_Module_Answer

The document covers various machine learning concepts including KNN classification, random forest models, hypothesis testing, information gain calculation, and K-means clustering. It explains the importance of the null hypothesis, types of errors in hypothesis testing, and the structure of artificial neurons. Additionally, it discusses metrics for association rule mining, differences between distance metrics, and provides insights into support vector machines and deep learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

IML_Module_Answer

The document covers various machine learning concepts including KNN classification, random forest models, hypothesis testing, information gain calculation, and K-means clustering. It explains the importance of the null hypothesis, types of errors in hypothesis testing, and the structure of artificial neurons. Additionally, it discusses metrics for association rule mining, differences between distance metrics, and provides insights into support vector machines and deep learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IML Module Answer

1. Classifying (3, 2) using K=2 and Manhattan Distance:

 Manhattan Distance: The Manhattan distance between two points (x1, y1)
and (x2, y2) is calculated as |x1 - x2| + |y1 - y2|.
 Calculations:
o Distance((3,2), (1,2)) = |3-1| + |2-2| = 2
o Distance((3,2), (2,3)) = |3-2| + |2-3| = 2
o Distance((3,2), (3,5)) = |3-3| + |2-5| = 3
o Distance((3,2), (4,4)) = |3-4| + |2-4| = 3
o Distance((3,2), (5,3)) = |3-5| + |2-3| = 3
 K=2 Nearest Neighbors: The two nearest neighbors to (3,2) are (1,2) and
(2,3), both with distance 2.
 Classification: Both nearest neighbors belong to class 'A'. Therefore, the
point (3, 2) is classified as A.

2. Random Forest Classification Model:

 Explanation: A random forest is an ensemble learning method that


operates by constructing multiple decision trees during training. Each tree is
built on a random subset of the training data and a random subset of the
features. For classification, the final prediction is determined by a majority
vote of all the trees.
 Bias: Random forests generally have lower bias compared to a single
decision tree because they combine many weak learners (trees) with
random subsets of data and features, thereby mitigating the risk of
underfitting.
 Variance: Random forests have significantly lower variance than single
decision trees, as the process of building multiple trees on different random
subsets of data and features and then averaging the output reduces
sensitivity to small changes in the training data.
 Prediction Accuracy: Due to the combined reduction in both bias and
variance, random forests typically yield higher prediction accuracy and
improved generalization compared to a single decision tree.

3. Hypothesis Testing for Cholesterol Levels:


 Null Hypothesis (H0): The null hypothesis for this situation might be "The
mean cholesterol level of the individuals is equal to a certain value." You
could set it to a clinically accepted average like 200. So: H0: μ = 200 (where μ
is the population mean cholesterol level).
 Testing: To test this, you would:
a. Calculate the sample mean (x̄ ) and sample standard deviation (s) from
your data.
b. Choose a significance level (alpha), e.g., 0.05.
c. Perform a one-sample t-test (since the population standard deviation
is unknown).
d. Calculate the t-statistic using the formula: t = (x̄ - μ) / (s / sqrt(n)).
e. Compare the calculated t-statistic to the critical t-value from the t-
distribution table at given degree of freedom(n-1) and alpha. If the
calculated t-statistic falls in the critical region, reject the null
hypothesis, otherwise, fail to reject H0.

4. Information Gain Calculation for Outlook:

 Entropy (Parent): First calculate the overall entropy of the "Play Tennis"
target variable. Count how many "Yes" and "No" values, and then use the
formula: -p(yes) * log2(p(yes)) - p(no) * log2(p(no))
 Entropy (Children): Calculate the entropy for each "Outlook" value (Sunny,
Overcast, Rainy).
o For each Outlook value, calculate the probability of "Yes" and "No"
then calculate the entropy for that Outlook
 Weighted Average of Child Entropy: Then, calculate the weighted average
of the child entropies, based on how many samples each Outlook contains.
 Information Gain: Finally, calculate the Information Gain as
Entropy(Parent) - Weighted average of child entropy
 Calculation
o Total Play Tennis: 9 yes , 5 no, 14 Total
o Parent Entropy = -(9/14)*log2(9/14) - (5/14)*log2(5/14) = 0.940
o Sunny: 2 yes, 3 no, 5 total, entropy = 0.97
o Overcast: 4 yes, 0 no, 4 total, entropy= 0
o Rainy: 3 yes, 2 no, 5 total, entropy = 0.97
o Weighted Entropy = (5/14)*0.97 + (4/14)*0+ (5/14)*0.97=0.692
o Information Gain = 0.940-0.692 = 0.248

5. Linear Regression Equation:


 Calculate means: Calculate the mean of x (x̄ ) and mean of y (ȳ). x̄ =
(2+3+5+7+8)/5 = 5 , ȳ = (4+5+7+10+11)/5 = 7.4
 Calculate slope (b): b = Σ[(xi - x̄ )(yi - ȳ)] / Σ[(xi - x̄ )^2]
o Σ[(xi - x̄ )(yi - ȳ)] = (-3)(-3.4) + (-2)(-2.4) + (0)(-0.4) + (2)(2.6) + (3)(3.6) =
10.2+4.8+0+5.2+10.8= 31
o Σ[(xi - x̄ )^2] = (-3)^2 + (-2)^2 + (0)^2+ (2)^2 + (3)^2 = 9+4+0+4+9=26
o b = 31/26 = 1.19
 Calculate y-intercept (a): a = ȳ - b * x̄ = 7.4 - (1.19 * 5) = 1.45
 Regression Equation: y = 1.45 + 1.19x

6. Null Hypothesis (H0) Importance:

 Definition: The null hypothesis is a statement of no effect, no difference, or


no relationship in a population. It is the default position we assume to be
true until evidence contradicts it.
 Importance:
o It provides a baseline against which we can assess the observed data.
o Hypothesis testing is based on the idea of trying to reject the null
hypothesis by providing sufficient evidence to the contrary. We do not
aim to "prove" the alternative hypothesis, but reject the null
hypothesis instead.
o It provides a structured approach for making objective decisions.

7. Type I and Type II Errors:

 Type I Error (False Positive): Rejecting the null hypothesis when it is


actually true. Example: concluding there is a statistically significant
difference when there isn't any.
 Type II Error (False Negative): Failing to reject the null hypothesis when it
is actually false. Example: concluding there is no statistically significant
difference when actually there is.

8. Overfitting in Decision Trees & Random Forest Solution:

 Overfitting in Decision Trees: Single decision trees can overfit the training
data, capturing noise, and leading to poor generalization on unseen data.
Complex trees can memorize the training data instead of learning general
patterns.
 Random Forest Solution:
o Random Subsampling of Training Data: By training each tree on a
random subset of data, each tree learns from different examples and
thereby reduces the chance of memorizing the noise in the training
data.
o Random Subset of Features: Each split in each tree is made
considering a random subset of features, which adds more variation.
o Ensemble Averaging: The predictions of multiple trees are then
averaged for regression, or a majority vote is taken for classification,
leading to a more robust prediction.
o Generalization: This combined approach leads to random forest
generalizing better than a single decision tree and is less prone to
overfitting.

9. K-Means Algorithm:

 Definition: K-means is an unsupervised clustering algorithm that aims to


partition n observations into k clusters, in which each observation belongs
to the cluster with the nearest mean (cluster centers).
 Process:
a. Initialization: Randomly choose k cluster centroids.
b. Assignment: Assign each data point to the closest centroid.
c. Update: Recompute the centroid of each cluster as the mean of all
the points assigned to it.
d. Repeat: Go back to step 2, and repeat until cluster assignment
doesn't change or a fixed number of iterations is reached.

10. K-Means Centroid Update (First Iteration):

 Initial Centroids (Random): Let's assume the initial centroids are (2,3) and
(6,6) (just pick any 2 data points from the dataset)
 Assignment Step:
o Distances to Centroid 1 (2,3):
 (2,3): 0
 (3,3): 1
 (6,6): sqrt(18)
 (8,8): sqrt(50)
 (5,8): sqrt(25)
 (1,2): sqrt(2)
o Distances to Centroid 2 (6,6):
 (2,3): sqrt(25)
 (3,3): sqrt(18)
 (6,6): 0
 (8,8): sqrt(8)
 (5,8): sqrt(5)
 (1,2): sqrt(50)
o Cluster Assignment:
 Cluster 1: (2,3) , (3,3), (1,2)
 Cluster 2: (6,6), (8,8), (5,8)
 Update Centroids:
o New Centroid 1: ((2+3+1)/3, (3+3+2)/3) = (2, 2.67)
o New Centroid 2: ((6+8+5)/3, (6+8+8)/3) = (6.33, 7.33)

11. Artificial Neuron Structure:

 Structure:
o Inputs: Multiple inputs (x1, x2, ... ,xn), each associated with a weight
(w1, w2, ... wn)
o Weighted Sum: Each input is multiplied by its corresponding weight,
and all results are summed up.
o Bias: A bias term (b) is added to the weighted sum.
o Activation Function: The sum (with bias) is passed through an
activation function (e.g., sigmoid, ReLU), which determines the
neuron's output.
 Similarity to Biological Neuron:
o Dendrites: Inputs are analogous to dendrites that receive signals.
o Synapses: Weights are analogous to the strength of connections
between neurons (synapses).
o Cell Body: The weighted sum with bias corresponds to the cell body
accumulating input signals.
o Axon: The activation function represents the neuron's firing behavior
(axon transmitting signals).

12. Association Rule Mining Metrics:

 Support: The proportion of transactions in the dataset containing the


itemset.
o Formula: Support(A) = (Number of transactions containing A) / (Total
number of transactions).
o Used to identify frequent itemsets.
 Confidence: The probability of finding itemset B given that itemset A is
present in a transaction.
o Formula: Confidence(A → B) = Support(A ∪ B) / Support(A)
o Used to measure the reliability of association rules.
 Lift: The ratio of observed support for A and B together to the support
expected if A and B were independent.
o Formula: Lift(A → B) = Support(A ∪ B) / (Support(A) * Support(B)).
o Used to measure the strength of the association between A and B by
considering their individual support. Lift of 1 implies that itemsets are
independent. Lift more than 1 implies positive relationship and lift
less than 1 implies negative relationship.

13. Euclidean vs. Manhattan Distance:

 Euclidean Distance: The straight-line distance between two points,


calculated as the square root of the sum of squared differences between
coordinates. It calculates the shortest path between two points.
 Manhattan Distance: The sum of the absolute differences of their Cartesian
coordinates, equivalent to walking along a grid (or city blocks). It calculates
the distance only along the axes.
 Key Differences:
o Euclidean is sensitive to the magnitude of differences, while
Manhattan is not.
o Euclidean gives the shortest path, whereas Manhattan gives a longer
path.
o Manhattan is generally used when movement must be along axes
(e.g., grid-like scenarios), while Euclidean is used when direction is not
constrained.

14. Support Vector Machine (SVM):

 Definition: SVM is a supervised learning model used for classification and


regression. It aims to find the optimal hyperplane that separates different
classes in the feature space.
 Key Concepts:
o Hyperplane: A decision boundary that separates different classes.
o Support Vectors: Data points that lie closest to the hyperplane, which
determine its orientation and position.
o Margin: The distance between the hyperplane and the nearest
support vectors, which SVM tries to maximize for better
generalization.
o Kernel Trick: SVMs can implicitly operate in high-dimensional spaces
using kernel functions (e.g., linear, polynomial, RBF), allowing it to
model complex non-linear relationships.

15. KNN Classification of (5, 6) with K=3:

 Euclidean Distance: The distance is calculated by sqrt((x2-x1)^2 + (y2-


y1)^2)
o Distance((5,6), (2,3)) = sqrt((5-2)^2 + (6-3)^2) = sqrt(18) ≈ 4.24
o Distance((5,6), (3,4)) = sqrt((5-3)^2 + (6-4)^2) = sqrt(8) ≈ 2.83
o Distance((5,6), (6,7)) = sqrt((5-6)^2 + (6-7)^2) = sqrt(2) ≈ 1.41
o Distance((5,6), (7,8)) = sqrt((5-7)^2 + (6-8)^2) = sqrt(8) ≈ 2.83
o Distance((5,6), (10,10)) = sqrt((5-10)^2 + (6-10)^2) = sqrt(41) ≈ 6.40
 K=3 Nearest Neighbors: The 3 nearest neighbors are (6,7) of class B, (3,4) of
class A, and (7,8) of class B.
 Classification: Since 2 of the 3 nearest neighbors are of class B, the point
(5,6) is classified as B.

16. Short Notes on Two Topics:

 Artificial Intelligence (AI): AI refers to the development of computer


systems that can perform tasks that typically require human intelligence,
such as learning, problem-solving, decision-making, and understanding
language. AI encompasses a wide range of techniques and applications,
including machine learning, deep learning, natural language processing,
computer vision, and robotics.

 Deep Learning: Deep learning is a subset of machine learning that uses


artificial neural networks with multiple layers (hence "deep") to extract
complex patterns and representations from data. Deep learning has been
particularly successful in areas such as image recognition, natural language
processing, and speech recognition due to its ability to automatically learn
hierarchical features from large amounts of data.

17. R² (Coefficient of Determination) Calculation:

 Calculate Mean of Actual y (ȳ): ȳ = (2.5 + 3.6 + 4.8 + 6.1 + 7.1) / 5 = 4.82
 Calculate Total Sum of Squares (TSS):
o TSS = Σ(yi - ȳ)^2 = (2.5-4.82)^2 + (3.6-4.82)^2 + (4.8-4.82)^2 + (6.1-
4.82)^2 + (7.1-4.82)^2
o TSS = 5.3824+1.4884+0.0004+1.6384+5.1984 = 13.708
 Calculate Residual Sum of Squares (RSS):
o RSS = Σ(yi - ŷi)^2 = (2.5-2.8)^2 + (3.6-3.4)^2 + (4.8-4.6)^2 + (6.1-5.9)^2 +
(7.1-7.1)^2 = 0.09+0.04+0.04+0.04+0 = 0.21
 Calculate R²: R² = 1 - (RSS / TSS) = 1 - (0.21 / 13.708) ≈ 0.985
o An R2 of 0.985 is a good model fit, explaining 98.5% of the variance.

18. Short Notes on ROC Curve & PCA:

 ROC Curve (Receiver Operating Characteristic Curve):


o A graphical plot that illustrates the performance of a binary
classification model as its discrimination threshold is varied.
o It plots the true positive rate (TPR, or sensitivity) against the false
positive rate (FPR, or 1-specificity) at various threshold settings.
o The area under the ROC curve (AUC) provides a measure of the
model's ability to distinguish between classes. An AUC of 1 indicates
perfect classification while an AUC of 0.5 is equivalent to random
guessing.
 PCA (Principal Component Analysis):
o A dimensionality reduction technique that transforms high-
dimensional data into a lower-dimensional space by projecting data
onto the most significant features (principal components)
o Principal components are the orthogonal features, with variance of
data decreasing for the successive principal components
o It helps in reducing data complexity, visualizing data in lower
dimensions, and removing redundant information for further
analysis.

19. One-Hot Encoding Problems:

 Example: Consider a dataset with a categorical feature "Color" that has


three possible values: "Red", "Blue", and "Green".
 One-Hot Encoding: Create three new binary features, one for each value of
the "Color" feature. Each new feature will be "1" if it corresponds to the
record's actual "Color" and "0" otherwise.
o Before:
Color
Red
Blue
Green
Blue
Red
o After:
Red Blue Green
1 0 0
0 1 0
0 0 1
0 1 0
1 0 0

20. Neural Network Training Iterations:

 Iterations per Epoch: Number of training samples / Batch size = 20000 /


400 = 50 iterations per epoch.
 Total Iterations: Iterations per epoch * Number of epochs = 50 * 30 = 1500
total iterations.

21. Short Notes on Artificial Neural Networks & Sigmoid Activation:

 Artificial Neural Networks (ANNs):


o Computational models that are inspired by the structure and function
of the human brain.
o They consist of interconnected nodes (neurons) organized in layers
(input, hidden, output) which learns non-linear relationship from the
input data.
o ANNs learn by adjusting the weights and biases of connections
between the neurons through training algorithms (e.g.,
backpropagation).
 Sigmoid Activation Function:
o A non-linear activation function that squashes the output of a neuron
to a range between 0 and 1, often used in the output layer of binary
classification problems.
o Formula: sigmoid(x) = 1 / (1 + exp(-x))
o It introduces non-linearity into the neural network, enabling it to learn
complex patterns. However, it can suffer from vanishing gradient
problem when neural networks get very deep.

22. Central Limit Theorem (CLT):

 Explanation: The Central Limit Theorem states that the distribution of


sample means (or sums) of a sufficiently large number of independent,
identically distributed (i.i.d) random variables, regardless of the original
population distribution, approaches a normal distribution as the sample size
increases.
 Importance:
o Allows using parametric statistical techniques (e.g., z-tests, t-tests)
even if the underlying population is not normally distributed.
o Fundamental for hypothesis testing, confidence interval estimation,
and many statistical inferences.
o It is a foundation for sampling and statistical analysis.
 Assumptions:
o Independence: The random variables must be independent.
o Identical Distribution: The random variables must be from the same
distribution.
o Sufficient Sample Size: Sample size should be large (typically, n ≥
30).
 Limitations:
o The theorem is asymptotic, meaning it holds for large sample sizes.
With small sample sizes, it may not be a good approximation.
o If the original population distribution is highly skewed, a larger
sample size might be needed for the sample means to closely follow a
normal distribution.

23. Entropy and Information Gain for Decision Trees:

 Entropy: Measures the impurity or disorder of a set of data.


o For classification problems, it quantifies how mixed the class labels
are in a subset of data.
o Formula for binary classification: Entropy(S) = -p(yes) *
log2(p(yes)) - p(no) * log2(p(no))
o Where p(yes) and p(no) are the proportions of positive and negative
classes, respectively.
 Information Gain: Measures how much a particular feature reduces
entropy when splitting a dataset. It is the difference between the entropy of
the parent node and the weighted average of the entropies of the child
nodes.
o Formula : Information Gain(S, A) = Entropy(S) - Σ(|Sv|/|S|) *
Entropy(Sv)
o Where S is the set of data, A is the feature, and Sv is the subset of S
with a particular value of feature A.
 Use in Decision Trees:
o The algorithm chooses the feature with the highest information gain
at each node, as this split reduces disorder the most and leads to
purer subsets, making classification easier.
o The tree construction continues until all leaves are mostly pure (low
entropy).

24. Hyperplane in SVM:

 Definition: A hyperplane is a decision boundary that divides a dataset into


different classes in the feature space. In 2D, a hyperplane is a line, while in
3D, it is a plane, and in higher dimensions, it's a similar dividing structure.
 Classification: In an SVM, the hyperplane is chosen to maximize the margin
- the distance from the hyperplane to the nearest data points (support
vectors). By maximizing this margin, SVM aims to make the most robust
classification possible and improve generalization.
 Significance of the Margin:
o A larger margin increases the robustness of the classification, making
the model less sensitive to variations in new data, and therefore
generalizes better on unseen data.
o It reduces overfitting, helping to achieve a balance between model
complexity and generalization.

25. Linear Regression Model for Study Hours and Test Scores:

 Data:
o x (Study Hours): 1, 2, 3, 4, 5
o y (Test Score): 20, 30, 40, 50, 60
 Calculate means: Calculate the mean of x (x̄ ) and mean of y (ȳ). x̄ =
(1+2+3+4+5)/5 = 3 , ȳ = (20+30+40+50+60)/5 = 40
 Calculate slope (b): b = Σ[(xi - x̄ )(yi - ȳ)] / Σ[(xi - x̄ )^2]
o Σ[(xi - x̄ )(yi - ȳ)] = (-2)(-20) + (-1)(-10) + (0)(0) + (1)(10) + (2)(20) =
40+10+0+10+40= 100
o Σ[(xi - x̄ )^2] = (-2)^2 + (-1)^2 + (0)^2+ (1)^2 + (2)^2 = 4+1+0+1+4=10
o b = 100/10 = 10
 Calculate y-intercept (a): a = ȳ - b * x̄ = 40 - (10 * 3) = 10
 Regression Equation: y = 10 + 10x
 Prediction: For 4.5 hours: y = 10 + 10 * 4.5 = 55.
 Test Score Prediction: The predicted test score for a student who has
studied for 4.5 hours is 55.

You might also like