Assignment 1ML
Assignment 1ML
MACHINE LEARNING
QUESTION 1
What is overfitting in machine learning? How can it be mitigated?
Overfitting happens when a model learns not only the underlying pattern in the training data but
also the noise or random fluctuations. The mode performs very well on the trained data but
poorly on the new dataset and fails to generalize.
How it can be Mitigated
Using techniques that aim to reduce model complexity and encourage better generalization to
new data. They include
i. Data augmentation - Creating additional training data by applying transformations
like rotation, flipping, or scaling to existing data, which helps the model learn
more robust features
ii. Regularization Adding a penalty term to the loss function that discourages the
model from assigning large weights to parameters, effectively limiting model
complexity.
L1 Regularization (Lasso): Encourages sparsity by pushing some
coefficients to zero.
L2 Regularization (Ridge): Penalizes large weights, preventing them
from becoming too extreme.
iii. Cross-Validation-Splitting the dataset into multiple folds, training the model on
different subsets of data, and evaluating its performance on the remaining folds to
get a more robust estimate of generalization ability
iv. Feature selection- Choosing only the most relevant features to train the model on,
eliminating noise and reducing model complexity
QUESTION 3
How does the k-means cluster algorithm work? Provide an example.
K-means groups together similar data points into clusters by minimizing the distance between
data points in a cluster with their centroid or k mean value. The primary goal of the k-means
algorithm is to minimize the total distances between points and their assigned cluster centroid
Example
Steps:
1. Initialize centroids: Choose K initial centroids randomly. These can either be chosen
randomly from the dataset or using some other method like K-means++.
2. Assign data points to the nearest centroid: For each data point in the dataset, compute
the distance to each of the K centroids (commonly using Euclidean distance), and assign
each point to the cluster whose centroid is closest.
3. Recompute centroids: After assigning the data points to the clusters, recalculate the
centroid of each cluster. The new centroid is the mean (average) of all the data points
assigned to that cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids no longer change (or the change is below
a certain threshold) or a maximum number of iterations is reached.
Example:
Imagine you have the following dataset of 2D points:
(1, 2), (1.5, 1.8), (5, 8), (8, 8), (1, 0.6), (9, 11)
Let's say we want to divide these points into 2 clusters (i.e., K = 2).
1. Initialize centroids: Suppose we randomly pick two points as the initial centroids:
o Centroid 1: (1, 2)
o Centroid 2: (5, 8)