0% found this document useful (0 votes)
7 views5 pages

Assignment 1ML

The document discusses overfitting in machine learning, explaining it as a model's inability to generalize due to learning noise in training data, and suggests mitigation techniques such as data augmentation, regularization, cross-validation, feature selection, and early stopping. It differentiates between classification tasks, which predict discrete labels, and regression tasks, which predict continuous values. Additionally, it explains the k-means clustering algorithm, outlining its steps and providing an example of clustering a set of 2D points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Assignment 1ML

The document discusses overfitting in machine learning, explaining it as a model's inability to generalize due to learning noise in training data, and suggests mitigation techniques such as data augmentation, regularization, cross-validation, feature selection, and early stopping. It differentiates between classification tasks, which predict discrete labels, and regression tasks, which predict continuous values. Additionally, it explains the k-means clustering algorithm, outlining its steps and providing an example of clustering a set of 2D points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

ASSIGNMENT 1

MACHINE LEARNING
QUESTION 1
What is overfitting in machine learning? How can it be mitigated?
Overfitting happens when a model learns not only the underlying pattern in the training data but
also the noise or random fluctuations. The mode performs very well on the trained data but
poorly on the new dataset and fails to generalize.
How it can be Mitigated
Using techniques that aim to reduce model complexity and encourage better generalization to
new data. They include
i. Data augmentation - Creating additional training data by applying transformations
like rotation, flipping, or scaling to existing data, which helps the model learn
more robust features
ii. Regularization Adding a penalty term to the loss function that discourages the
model from assigning large weights to parameters, effectively limiting model
complexity.
 L1 Regularization (Lasso): Encourages sparsity by pushing some
coefficients to zero.
 L2 Regularization (Ridge): Penalizes large weights, preventing them
from becoming too extreme.

iii. Cross-Validation-Splitting the dataset into multiple folds, training the model on
different subsets of data, and evaluating its performance on the remaining folds to
get a more robust estimate of generalization ability
iv. Feature selection- Choosing only the most relevant features to train the model on,
eliminating noise and reducing model complexity

v. Early stopping cross- Monitoring the model's performance on a validation set


during training and stopping the training process when performance on the
validation set starts to decline, preventing the model from overfitting to the
training data
QUESTION 2
Differentiate between classification and regression tasks.
Classification is a type of supervised machine learning task where the goal is to predict a discrete
label/category. It classifies input data into one or more classes. E.g. Image recognition
While
Regression is a type of machine learning task where the goal is to predict a continuous output
value, typically numerical. E.g. house price prediction

QUESTION 3
How does the k-means cluster algorithm work? Provide an example.
K-means groups together similar data points into clusters by minimizing the distance between
data points in a cluster with their centroid or k mean value. The primary goal of the k-means
algorithm is to minimize the total distances between points and their assigned cluster centroid
Example
Steps:
1. Initialize centroids: Choose K initial centroids randomly. These can either be chosen
randomly from the dataset or using some other method like K-means++.
2. Assign data points to the nearest centroid: For each data point in the dataset, compute
the distance to each of the K centroids (commonly using Euclidean distance), and assign
each point to the cluster whose centroid is closest.
3. Recompute centroids: After assigning the data points to the clusters, recalculate the
centroid of each cluster. The new centroid is the mean (average) of all the data points
assigned to that cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids no longer change (or the change is below
a certain threshold) or a maximum number of iterations is reached.
Example:
Imagine you have the following dataset of 2D points:

(1, 2), (1.5, 1.8), (5, 8), (8, 8), (1, 0.6), (9, 11)
Let's say we want to divide these points into 2 clusters (i.e., K = 2).
1. Initialize centroids: Suppose we randomly pick two points as the initial centroids:
o Centroid 1: (1, 2)

o Centroid 2: (5, 8)

2. Assign points to the nearest centroid:


o For each point, compute the distance to each centroid and assign the point to the
nearest centroid.
o Points close to (1, 2) might be:

 (1, 2), (1.5, 1.8), (1, 0.6)


o Points close to (5, 8) might be:

 (5, 8), (8, 8), (9, 11)


3. Recompute centroids:
 The new centroid for the first cluster is the mean of the points assigned to
it:
 Mean of (1, 2), (1.5, 1.8), (1, 0.6) = (1.17, 1.47)
o The new centroid for the second cluster is the mean of the points assigned to it:

 Mean of (5, 8), (8, 8), (9, 11) = (7.33, 9.0)


4. Repeat:
o Reassign each point to the new centroids:

 Points closer to (1.17, 1.47) are in Cluster 1.


 Points closer to (7.33, 9.0) are in Cluster 2.
o Recompute centroids again and repeat the process until the centroids stabilize (no
significant change).
Result:
After several iterations, the centroids will converge, and the dataset will be partitioned into two
clusters:
 Cluster 1: Points closer to centroid (1.17, 1.47).
 Cluster 2: Points closer to centroid (7.33, 9.0).
Key Points:
 K-means works best when clusters are spherical and roughly equal in size.
 The algorithm is sensitive to the initial choice of centroids. If the initial centroids are
poorly chosen, the result may not be optimal.
 It's often recommended to run the algorithm multiple times with different initializations
and choose the best result.

You might also like