ML TRW
ML TRW
To make it easier to conceptualize it, we can think about comparing two photographs
and wanting to figure out if the object in the photos is the same one. Instead of
checking every pixel, similarity learning algorithms will find key characteristics and
features (for example the shape) of the object in each photo and compare them.
Similarity learning algorithms can be used for various tasks where the main goal is to
find similarities and relationships between items:
● Recommendation system: to keep the user spending time on the social media
platform, similarity learning is used to find content similar to the ones the user
has already liked and recommend them.
● Classification: to classify an item into a given class, we want to check if the
item is similar to the items in that class.
● Face Verification: similarity learning is used also to compare facial features in
an image to a database of faces, verifying and recognizing with remarkable
accuracy.
● Anomaly detection: by defining what “normal” data looks like, any dissimilarity
and deviation data can be detected and reported to prevent possible issues.
Instance-based learners are called "lazy learners" because they do not perform any
significant computation or model building during the training phase; instead, they
simply store the training data and only calculate predictions when a new data point
needs to be classified, essentially "lazily" postponing the heavy lifting until prediction
time.
Key points about lazy learners:
● No upfront generalization:
Unlike other learning algorithms that build a general model during training,
instance-based learners directly compare new data points to the stored
training instances without creating a generalized representation.
● Query-based Learning:
When making a prediction, lazy learners simply look at the stored data and
make a decision based on the closest or most relevant instances. For
example, in k-nearest neighbors (k-NN), predictions are based on the k
closest data points to the query point.
● Delayed Computation:
Lazy learners postpone most of the computation until they receive a query.
The prediction process is computationally expensive because it requires
comparing the query with all stored instances, which can be inefficient for
large datasets.
● Example algorithm: K-Nearest Neighbors (KNN):
One of the most well-known lazy learners, KNN stores all the training data
and classifies a new data point based on the class labels of its closest
neighbors.
1. Storage of Training Data: k-NN stores all the training examples and does not
abstract or build a generalized model from them. Instead, the raw data itself is used
to make predictions, which means the algorithm must keep all the training data in
memory.
2. Prediction by Instance Comparison: When a new query (or test instance) comes
in, the k-NN algorithm compares it to every instance in the stored training set to find
the closest k instances. This means that the algorithm's performance depends on
how quickly it can access and compute distances between the query and all stored
instances.
3. No Generalization: Since k-NN doesn’t build a generalized model, it "memorizes"
the training data and retrieves the relevant pieces of it when needed for predictions.
This makes the prediction process computationally expensive, as it must involve all
the data stored in memory during the query phase.
Therefore, the method is termed "memory-based" because it relies heavily on the
memory to store the entire dataset and use it directly during the prediction stage,
unlike model-based learners, which abstract the data into a model that doesn’t need
to reference the training set in full.
Now, the factory produces a new paper tissue that has acid durability = 3 and
strength = 7.
Classify this new tissue as GOOD or BAD
Step 1
First, Number of parameters K = Number of nearest neighbors.
Therefore, from the given data K = 3
Step 2
Calculate the distance between the query instance and all the training samples.
Here query instance is (3,7) and calculates the distance by using the Euclidean
Distance formula
The below table shows the Euclidean Distance for every paper from the query
instance (3, 7):
Step 3
Sort the distance and determine the nearest neighbors based on Kth
minimum distance
The below table shows the sorted distance and according to that nearest neighbor is
decided for each paper:
Step 4
Collect the Quality of the nearest neighbors. Hence in the below, table Quality for
Paper_2 is not included because the rank of this paper item is more than 3.
The below table shows the quality of each paper based on the nearest neighbor:
Step 5
Use the simple majority of the category of nearest neighbors as the prediction value
of the query instance.
Here, it got 2 Good and 1 Bad value for the quality of nearest neighbors.
Hence, 2 Good > 1 Bad from which, the conclusion is that a new sample paper_5
that passes laboratory test with "Acid durability = 3" and "Strength = 7" is included in
Good category quality.