0% found this document useful (0 votes)

14 views65 pages

DS - Module 3

The document discusses the k-Nearest Neighbors (k-NN) algorithm, emphasizing the importance of defining similarity metrics and selecting an appropriate value for k to classify unlabeled items based on their labeled neighbors. It also covers various distance metrics such as Euclidean, Cosine, Jaccard, Mahalanobis, Hamming, and Manhattan distances, along with the process of training and testing datasets. Additionally, it touches on spam filtering using Naive Bayes and the challenges of applying linear regression to binary classification problems.

Uploaded by

satvikhegde2905

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views65 pages

DS - Module 3

Uploaded by

satvikhegde2905

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

Exploratory

Data Analysis, and

the Data
Science Process
The intuition behind k-NN
• is to consider the most similar other items defined in terms of
their attributes, look at their labels, and give the unassigned
item the majority vote.
• If there’s a tie, you randomly select among the labels that have
tied for first.
• To automate it, two decisions must be made: first, how do you
define similarity or closeness?
• Once you define it, for a given unrated item, you can say how
similar all the labeled items are to it, and you can take the
most similar items and call them neighbors, who each have a
“vote.”
• how many neighbors should you look at or “let vote”? This
value is k
overview of the process:

1. Decide on your similarity or distance metric.

2. Split the original labeled dataset into training and test data.
3. Pick an evaluation metric. (Misclassification rate is a good one.
We’ll explain this more in a bit.)
4. Run k-NN a few times, changing k and checking the evaluation
measure.
5. Optimize k by picking the one with the best evaluation measure.
6. Once you’ve chosen k, use the same training set and now create
a new test set with the people’s ages and incomes that you
have no labels for, and want to predict.
Similarity or distance metrics
• Euclidean distance
• Cosine Similarity
• Jaccard Distance or
Similarity
• Mahalanobis Distance
• Hamming Distance
• Manhattan
Similarity or distance metrics
• Euclidean distance
• Euclidean distance is a good go-to distance
metric for attributes that are real-valued and
can be plotted on a plane or in multidimensional
space.
• Variables need to be on similar scale.

d =√[(x2 – x1)2 + (y2 – y1)2]

Calculate the distance between the
points (4,1) and (3,0).
Using Euclidean Distance Formula: “d” is the Euclidean distance
⇒ d = √(x – x ) + (y – y )
2 1
2
2 1
2
(x1, y1) is the coordinate of the first
⇒ d = √(3 – 4) + (0 – 1)
2 2
point
⇒ d = √(1 + 1) (x2, y2) is the coordinate of the second
⇒ d = √2 = 1.414 unit point.
• Cosine Similarity
• Also can be used between two real-valued vectors, x and y , and
will yield a value between –1 (exact opposite) and 1 (exactly the
same) with 0 in between meaning independent.

SC(x, y) = x . y / ||x|| × ||y||

•x . y = product (dot) of the vectors ‘x’ and ‘y’.

•||x|| and ||y|| = length (magnitude) of the two vectors ‘x’ and ‘y’.
•||x|| ×× ||y|| = regular product of the two vectors ‘x’ and ‘y’.
• Consider an example to find the similarity between two vectors
– ‘x’ and ‘y’, using Cosine Similarity. The ‘x’ vector has values, x =
{ 3, 2, 0, 5 } The ‘y’ vector has values, y = { 1, 0, 0, 0 }
• The formula for calculating the cosine similarity is :
• SC(x, y) = x . y / ||x|| × ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 =
6.16
||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

The∴ dissimilarity
SCSC(x, y) = 3 / (6.16 * 1) = 0.49
between the two vectors ‘x’ and ‘y’ is given by
–
∴
DCDC(x, y) = 1 -
SCSC(x, y) = 1 - 0.49 = 0.51

The cosine similarity between two vectors is measured in ‘θ’.

•If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
•If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.
Similarity or distance metrics
• Jaccard Distance or Similarity
• This gives the distance between a set of objects.
• For example,
• Consider a list of Cathy’s friends A= {Kahn,Mark,Laura, . . .}
• And a list of Rachel’s friends B= {Mladen,Kahn,Mark, . . .}
• Similarity of these 2 sets an be measured with:

The Jaccard Similarity will be 0 if the two sets don't share any values and 1 if the
two sets are identical. The set may contain either numerical values or strings.
Additionally, this function can be used to find the dissimilarity
between two sets by calculating d(A,B)=1–J(A,B).
• Mahalanobis Distance
• Also can be used between two real-valued vectors and has the
advantage over Euclidean distance that it takes into account
correlation and is scale-invariant.
Similarity or distance metrics
• Hamming Distance
• Can be used to find the distance between two strings or pairs of words or DNA
sequences of the same length.
• The distance between olive and ocean is 4 because aside from the “o” the other 4
letters are different.

O L I V E
O C E A N
0 1 1 1 1
• Total - 4
Similarity or distance metrics
• Manhattan
• This is also a distance between two real-valued k-dimensional vectors.
• The image to have in mind is that of a taxi having to travel the city streets of
Manhattan, which is laid out in a grid-like fashion (you can’t cut diagonally across
buildings).
• The distance is therefore defined as

• where i is the ith element of each of the vectors.

Training and test sets
• Train Test split
Pick an evaluation metric
• When assessing the performance of a model, selecting the appropriate evaluation metric is
crucial. It determines how well the model's predictions align with the desired outcomes.
• Customized Evaluation: Evaluation metrics are not one-size-fits-all; they need to be
tailored to the specific problem at hand. Different scenarios may prioritize certain types of
errors over others, requiring collaboration with domain experts to define suitable metrics.
• For instance, in medical diagnosis, such as detecting cancer, minimizing false negatives
(misdiagnosing someone as not having cancer when they actually do) is often prioritized.
This emphasizes the importance of tuning evaluation metrics in consultation with medical
professionals.
• Trade-offs in Sensitivity and Specificity: Evaluation involves striking a balance between
sensitivity (correctly diagnosing ill patients as ill) and specificity (correctly diagnosing well
patients as well). Achieving perfect sensitivity may lead to overdiagnosis, highlighting the
need for a nuanced approach.
Pick an evaluation metric
• Accuracy and Misclassification Rate: Accuracy, the ratio of correct
labels to total labels, is a fundamental metric. Conversely, the
misclassification rate, which is the complement of accuracy, quantifies the
proportion of incorrect predictions. Minimizing the misclassification rate
ultimately maximizes accuracy.
Choosing k
• Run k-NN a few times, changing k, and checking the evaluation metric
each time.
k-Nearest Neighbor/k-NN (supervised)
• Used when you have many objects that are classified into categories but
have some unclassified objects (e.g. movie ratings).
k-Nearest Neighbor/k-NN (supervised)
• Pick a k value (usually a low odd number, but up to you to pick).
• Find the closest number of k points to the unclassified point (using various
distance measurement techniques).
• Assign the new point to the class where the majority of closest points lie.
• Run algorithm again and again using different k’s.
Brightness Saturation

40 20
50 50
60 90
10 25
70 70
60 10
25 80
k-means (unsupervised)
• Goal is to segment data into clusters or strata
• Important for marketing research where you need to determine
your sample space.
• Assumptions:
• Labels are not known.
• You pick k (more of an art than a science).
k-means (unsupervised)
• Randomly pick k centroids (centers of data) and place them
near “clusters” of data.
• Assign each data point to a centroid.
• Move the centroids to the average location of the data points
assigned to it.
• Repeat the previous two steps until the data point
assignments don’t change.
k-means (unsupervised)
• Let’s say you have users where
you know
• X  how many ads have been
shown to each user (the number of
impressions) and
• Y  How many times each has
clicked on an ad (number of clicks).
K-Means Problem
• Given a data set of the six objects characterised by two features.
Use k-means clustering algorithm to divide the following data
into two clusters and Find the updated cluster centres after two
iteration.
X 1 2 2 2 4 5
Y 1 1 3 3 3 5

Note: Initial cluster centres are C1 (2, 1) and C2 (2,

3)
• v1= (2, 1) and v2 = (2, 3)
• First Iteration
• Step 1: Find the distance between the cluster centers and each
data points
Distance from Distance from
Data Point Assigned Center
v1 (2, 1) v2 (2, 3)

a1 (1, 1) 1 2.236068 v1
a2 (2, 1) 0 2 v1
a3 (2, 3) 2 0 v2
a4 (2, 3) 2 0 v2
a5 (4, 3) 2.828427 2 v2
a6 (5, 5) 5 3.605551 v2
• Step 2: Cluster of v1: {a1, a2}
• Cluster of v2: {a3, a4, a5, a6}
• Step 3: Recalculate the cluster center v1 and v2
• v1 = 1/2 * [(1, 1) + (2, 1)] = (1.5, 1)
• v2 = 1/4 * [(2, 3) + (2, 3) + (4, 3) + (5, 5)] = (3.25, 3.5)
• Step 4: Repeat from step 1 until we get same cluster center or
same cluster element as in the previous iteration.
• Second Iteration:
• Step 1: Find the distance between the updated cluster centers
and each data points

Distance from Distance from

Data Point Assigned Center
v1 (1.5, 1) v2 (3.25, 3.5)
a1 (1, 1) 0.5 3.363406 v1
a2 (2, 1) 0.5 2.795085 v1
a3 (2, 3) 2.061553 1.346291 v2
a4 (2, 3) 2.061553 1.346291 v2
a5 (4, 3) 3.201562 0.901388 v2
a6 (5, 5) 5.315073 2.304886 v2
• Step 2: Cluster of v1 : {a1, a2}
• Cluster of v2: {a3, a4, a5, a6}
• Step 3: Recalculate the cluster center v1 and v2
• v1 = 1/2 * [(1, 1) + (2, 1)] = (1.5, 1)
• v2 = 1/4 * [(2, 3) + (2, 3) + (4, 3) + (5, 5)] = (3.25,
3.5)
Spam Filters,
Naive Bayes, and
Wrangling
OUTLINE

• Spam Filters
• Naive Bayes
• Wrangling
Learning by Example (Spam)
• Consider a bunch of text shown below,
Contd…
• You may notice that several of the rows of text look like spam.
• How did you figure this out?
• Can you write code to automate the spam filter that your brain represents?
Contd…
• Rachel’s class had a few ideas about what things might be clear signs of spam:
1. Any email is spam if it contains Not Relevant references (Ex: Ad Credits, You
have Won and so on). That’s a good rule to start with, but as you’ve likely seen in
your own email, people (spammers) figured out this spam filter rule and got
around it by modifying the spelling.

2. Maybe something about the length of the subject gives it away as spam, or perhaps
excessive use of exclamation points or other punctuation. But some words like
“Yahoo!” are authentic, so you don’t want to make your rule too simplistic.
And here are a few suggestions regarding
code you could write to identify spam:
• Try a probabilistic model. In other words, should you not have simple rules, but
have many rules of thumb that aggregate together to provide the probability of a
given email being spam? This is a great idea.

• What about k-nearest neighbors or linear regression? You learned about these
techniques in the previous chapter, but do they apply to this kind of problem? (Hint:
the answer is “No.”)
Why Won’t Linear Regression Work for Filtering Spam?

• Consider a dataset or matrix where each row represents a different email message (it
could be keyed by email_id).

• Now let’s make each word in the email a feature—this means that we create a
column called “You_Have_Won,” for example, and then for any message that has the
word “You_Have_Won” in it at least once, we put a 1 in; otherwise we assign a 0.

• Alternatively, we could put the number of times the word appears. Then each column
represents the appearance of a different word.
• In order to use linear regression, we need to have a training set of emails where
the messages have already been labeled with some outcome variable. In this case,
the outcomes are either spam or not.

• We could do this by having human evaluators label messages “spam,” which is a

reasonable, but time intensive, solution.

• Another way to do it would be to take an existing spam filter such as Gmail’s

spam filter and use those labels.
Contd…
• The first thing to consider is that your target is binary (0 if not spam, 1 if spam) -
you wouldn’t get a 0 or a 1 using linear regression; you’d get a number.

• Strictly speaking, this option really isn’t ideal; linear regression is aimed at
modeling a continuous output and this is binary.

• But this issue is basically a not too much important.

Contd…
• Remember we should use a model appropriate for the data.
• But if we wanted to fit it in R, in theory it could still work.
• R doesn’t check for us whether the model is appropriate or not. We could go for
it, fit a linear model, and then use that to predict and then choose a critical value
(Threshold Value) so that above that predicted value we call it “1” and below we
call it “0.”
Contd…
• But if we went ahead and tried, it still wouldn’t work because there are too many variables
compared to observations!

• We have on the order of 10,000 emails with on the order of 100,000 words. This won’t work.

• Technically, this corresponds to the fact that the matrix in the equation for linear regression is
not invertible—in fact, it’s not even close.

• Moreover, maybe we can’t even store it because it’s so huge.

• With carefully chosen feature selection and domain expertise, we could limit it to 100 words
(from 100,000 words) and that could be enough! But again, we’d still have the issue that
linear regression is not the appropriate model for a binary outcome.
How About k-nearest
Neighbors?
• To use k-nearest neighbors (k-NN) to create a spam filter:
• We would still need to choose features, probably corresponding to words,
and we’d likely define the value of those features to be 0 or 1, depending
on whether the word is present or not.

• Then, we’d need to define when two emails are “near” each other based on
which words they both contain.
Contd…
• Again, in this case with 10,000 emails and 100,000 words, we’ll
encounter a problems:
• Namely, the space we’d be working in has too many dimensions. Yes,
computing distances in a 100,000-dimensional space requires lots of
computational work. But that’s not the real problem.

• The real problem is even more basic: even our nearest neighbors are
really far away. This is called “the curse of dimensionality,” and it makes k-
NN a poor algorithm in this case.
Naive Bayes
• Bayes Law
• Let’s say we’re testing for a rare disease, where 1% of the
population is infected. We have a highly sensitive and specific test,
which is not quite perfect:
• 99% of sick patients test positive.
• 99% of healthy patients test negative.
• Given that a patient tests positive, what is the probability
that the patient is actually sick?
• Imagine we have 10,000 perfectly representative people.
• Recall from your basic statistics course that, given events x and
y,
• there’s a relationship between the probabilities of either event (denoted
p(x) and p(y)),
• the joint probabilities (both happen, which is denoted p(x, y),
• and conditional probabilities (event x happens given y happens, denoted
p(x | y)) as follows:

• Using that, we solve for p y x (assuming p x ≠0) to get what is called

Bayes’ Law:
• In our example:
A Spam Filter for Individual Words
(Spam Filtering)
• If the word “You have won” appears, this adds to the
probability that the email is spam.
• But it’s not conclusive, yet. We need to see what else is in
the email.
• Let’s first focus on just one word at a time, which we
generically call “word.” Then, applying Bayes’ Law, we have:
An Example
• Let’s look at some basic statistics on a random Enron employee’s
email.
• We can count 1,500 spam versus 3,672 ham (not spam)
• so, we already know p(spam) and p(ham).
• Let's use the word “meeting” to decide a mail is spam or not.
• 16 spam emails contain the word – “meeting”,
• i.e p(meeting|spam) = 16/1500
• 153 ham emails contain the word – “meeting”,
• i.e p(meeting|ham) = 153/3672
• We can now compute the chance that an email is spam:
• We can now compute the chance that an email is spam:
A Spam Filter That Combines
Words: Naive Bayes
• Each email can be represented by a list of words.
• This list is like a big checklist where each word is either checked (1) or unchecked (0)
depending on if it appears in the email.
• Since there are many words, we represent this checklist efficiently using only the
words that are actually in the email.
• The model predicts the chance that an email is spam or not based on the words it
contains.
• Denote the email vector to be x and the various entries x j, where the j indexes the
words.
• For now we can denote “is spam” by c, and we have the following model for p(x|c):
• The θ here is the probability that an individual word is present in a spam email.
• To simplify calculations, we take the logarithm of the formula.
• This changes the multiplication into addition.

• The term log θj / (1−θj) doesn’t depend on a given email, just the word
• so let’s rename it to wj
• Same with the quantity Σ j log (1−θ j) = w0.
• Compute these waits once, no need to compute again & again for each email.
• The weights that vary by email are the xjs. We need to compute them separately
for each email, but that shouldn’t be too hard.
• We can put together what we know to compute p(x|c), and then use Bayes’ Law to
get an estimate of p(c|x) , which is what we actually want
• This helps us make the final decision about whether an email is spam or not.
• Training and Performance:
• Naive Bayes is inexpensive to train if we have labeled data.
• It works by counting the occurrence of words in spam and non-spam emails.
• With more training data, we can improve the accuracy of the filter.
• Practical Implementation:
• In practice, there's a main model adjusted for individual users.
• Before using the main model, simple rules are applied quickly to sort emails.
Laplace Smoothing
• θj represents the probability of seeing a specific word (indexed by j) in a
spam email.
• It's calculated as a ratio of counts: θj = njc / nc.
• Here, njc is the number of times the word appears in spam emails, and
nc is the total count of the word across all emails.
• Laplace Smoothing is about refining the estimate of θj by adding a
smoothing factor.
• The formula becomes: θjc = (njc + α) / (nc + β).
• By choosing α and β (e.g., α=1 and β=10), we prevent extreme
probabilities like 0 or 1.
• Laplace Smoothing can be seen as incorporating prior knowledge
into the probability estimation.
• The maximal likelihood estimate (θML) seeks the values of θ
that make the observed data most probable.
• Assuming independent trials, we maximize the log-likelihood
function for each word separately.

• The derivative of the log-likelihood function leads to the original

probability estimate θj = njc / nc
• Adding a prior extends the estimation process to include prior
beliefs about the data.
• The maximum a posteriori likelihood (MAP) determines the
parameter θ most likely given the observed data.

• Applying Bayes's Law transforms θMAP into a form proportional to

the prior times the likelihood.
• The prior, denoted as p(θ), assumes a specific distribution form
(θα(1-θ)β) to achieve Laplace Smoothing's result.
Scraping the Web: APIs and Other Tools
• As a data scientist, you’re not always just handed some data and
asked to go figure something out based on it.

• Often, you have to actually figure out how to go get some data you
need to ask a question, solve a problem, do some research, etc.

• One way you can do this is with an API.

• For the sake of this discussion, an API (application programming
interface) is something websites provide to developers so they can
download data from the website easily and in standard format.
• Usually, the developer has to register and receive a “key,” which is something like
a password.

• For example, the New York Times has an API here.

• When you go this route, you often get back weird formats, sometimes in JSON, but
there’s no standardization to this standardization; i.e., different websites give you
different “standard” formats.

• One way to get beyond this is to use Yahoo’s YQL language, which allows you to go
to the Yahoo! Developer Network and write SQL-like queries that interact with many
of the APIs on common sites like this:

• The output is standard, and you only have to parse this in Python or R once.
But what if you want data when there’s no API available?

• In this case you might want to use something like the Firebug extension for
Firefox. You can “inspect the element” on any web page, and Firebug allows
you to grab the field inside the HTML. In fact, it gives you access to the full
HTML document so you can interact and edit. In this way you can see the
HTML as a map of the page and Firebug as a kind of tour guide.

• After locating the stuff you want inside the HTML, you can use curl, wget,
grep, awk, perl, etc., to write a quick-and-dirty shell script to grab what you
want, especially for a one-off grab. If you want to be more systematic, you
can also do this using Python or R.
Other parsing tools you might want to look into include:

• lynx and lynx --dump

• Good if you pine for the 1970s. Oh wait, 1992. Whatever.

• Beautiful Soup
• Robust but kind of slow.

• Mechanize (or here)

• Super cool as well, but it doesn’t parse JavaScript.

• PostScript
• Image classification.

Special Program in Sports Curriculum Map
No ratings yet
Special Program in Sports Curriculum Map
10 pages
(Tim Lang, Michael Heasman) Food Wars The Global
100% (1)
(Tim Lang, Michael Heasman) Food Wars The Global
385 pages
m3 Final-1
No ratings yet
m3 Final-1
171 pages
Module 3 Lab 1
No ratings yet
Module 3 Lab 1
6 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
Chapter 4. K Nearest Neighbors
No ratings yet
Chapter 4. K Nearest Neighbors
55 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Introduction To Classification - KNN
No ratings yet
Introduction To Classification - KNN
29 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
ML Unit - 2
No ratings yet
ML Unit - 2
85 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
K Nearest Neighbour - Algorithm
No ratings yet
K Nearest Neighbour - Algorithm
29 pages
IV Distance and Rule Based Models 4.1 Distance Based Models
No ratings yet
IV Distance and Rule Based Models 4.1 Distance Based Models
45 pages
05 KNN
No ratings yet
05 KNN
49 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
10 pages
K Nearest Neighbor KNN
No ratings yet
K Nearest Neighbor KNN
18 pages
K-Means and KNN
No ratings yet
K-Means and KNN
11 pages
Lecture 07 KNN 14112022 034756pm
100% (1)
Lecture 07 KNN 14112022 034756pm
24 pages
Unit II 2 Mark Answers ML
No ratings yet
Unit II 2 Mark Answers ML
3 pages
Week 07
No ratings yet
Week 07
24 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
KNN - Algorithm - SVM - Algorithm
No ratings yet
KNN - Algorithm - SVM - Algorithm
27 pages
K Nearest Neighbor: Presented by
No ratings yet
K Nearest Neighbor: Presented by
29 pages
Machine Learning Module-03
No ratings yet
Machine Learning Module-03
24 pages
K Nearest Neighbor Classification
No ratings yet
K Nearest Neighbor Classification
16 pages
KNN - Feb 19
No ratings yet
KNN - Feb 19
42 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
Machine Learning
No ratings yet
Machine Learning
50 pages
Non Parametric Classification: Pattern Recognition
No ratings yet
Non Parametric Classification: Pattern Recognition
74 pages
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
No ratings yet
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
6 pages
Introduction To AI and ML - UNIT 4
No ratings yet
Introduction To AI and ML - UNIT 4
29 pages
K-Nearest Neighbor Algorithm: by Vipul Pathak (00216404824) Siddharth Tyagi (02016404824)
No ratings yet
K-Nearest Neighbor Algorithm: by Vipul Pathak (00216404824) Siddharth Tyagi (02016404824)
19 pages
Cs4758 KNN Lectureslides
No ratings yet
Cs4758 KNN Lectureslides
34 pages
12 ML KNN
No ratings yet
12 ML KNN
28 pages
01 Basics 02knn 03
No ratings yet
01 Basics 02knn 03
9 pages
K-Nearest Neighbour Classifier: Prerequisite
No ratings yet
K-Nearest Neighbour Classifier: Prerequisite
6 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
8.2. Machine Learning K Nearest Neighbor
No ratings yet
8.2. Machine Learning K Nearest Neighbor
36 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
KNN Algorithm
No ratings yet
KNN Algorithm
10 pages
4 Intro To K Nearest Neighbors
No ratings yet
4 Intro To K Nearest Neighbors
13 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
ML Unit 2
No ratings yet
ML Unit 2
24 pages
Week03 - 1 - KNN
No ratings yet
Week03 - 1 - KNN
32 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
04 KNN M
No ratings yet
04 KNN M
26 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Week 5 - Instance-Based Learning & PCA
No ratings yet
Week 5 - Instance-Based Learning & PCA
69 pages
K-NN (Nearest Neighbor)
100% (1)
K-NN (Nearest Neighbor)
17 pages
Metric Learning 1
No ratings yet
Metric Learning 1
10 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
Lecture 14 and 15
No ratings yet
Lecture 14 and 15
42 pages
Evaluation of K Nearest Neighbour Classifier Performance For Heterogeneous Data Sets
No ratings yet
Evaluation of K Nearest Neighbour Classifier Performance For Heterogeneous Data Sets
15 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
KNN Algorithm
No ratings yet
KNN Algorithm
2 pages
Distance Functions
No ratings yet
Distance Functions
7 pages
Metric-Based Classifiers: Nuno Vasconcelos (Ken Kreutz-Delgado)
No ratings yet
Metric-Based Classifiers: Nuno Vasconcelos (Ken Kreutz-Delgado)
32 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
Comprehensive Linear Algebra
From Everand
Comprehensive Linear Algebra
Kartikeya Dutta
No ratings yet
DS - Module 1
No ratings yet
DS - Module 1
57 pages
Module 3 2
No ratings yet
Module 3 2
46 pages
DS - Module 4
No ratings yet
DS - Module 4
57 pages
NSCL 11
No ratings yet
NSCL 11
4 pages
2023 Annual Gad Accomplishment Report
100% (1)
2023 Annual Gad Accomplishment Report
5 pages
GRFG Panels PDF
No ratings yet
GRFG Panels PDF
14 pages
Mr. Rohit Jawa Unilever PDF
No ratings yet
Mr. Rohit Jawa Unilever PDF
16 pages
CG Corp
No ratings yet
CG Corp
3 pages
Anne of Green Gables
No ratings yet
Anne of Green Gables
89 pages
Earth Potential Rise (EPR) Computation For A Fault On Transmission Mains Pole
No ratings yet
Earth Potential Rise (EPR) Computation For A Fault On Transmission Mains Pole
6 pages
2.water Hardness - Ion Exchange Method
No ratings yet
2.water Hardness - Ion Exchange Method
5 pages
Marigold (Textbook) (NCERT) :english Grammar & Composition Part 2 (Omex Publications) :strings Cursive Writing (Arrow Publication Pvt. LTD)
No ratings yet
Marigold (Textbook) (NCERT) :english Grammar & Composition Part 2 (Omex Publications) :strings Cursive Writing (Arrow Publication Pvt. LTD)
13 pages
Sinteza Chimica Adamantan
No ratings yet
Sinteza Chimica Adamantan
4 pages
Le Mock 35 Ques @legaledgemock
No ratings yet
Le Mock 35 Ques @legaledgemock
40 pages
Chromosomal Crossover
No ratings yet
Chromosomal Crossover
4 pages
KEY Student Notes Lecture 41 Acid-Base Reactions and Titrations
No ratings yet
KEY Student Notes Lecture 41 Acid-Base Reactions and Titrations
10 pages
Gurley, Bill J, - Clinically Relevant Herb Mineral Vitamin Intrxn
No ratings yet
Gurley, Bill J, - Clinically Relevant Herb Mineral Vitamin Intrxn
9 pages
JIS-G-3312-2019-Prepainted Hot-Dip Zinc-Coated Steel Sheet and Strip
No ratings yet
JIS-G-3312-2019-Prepainted Hot-Dip Zinc-Coated Steel Sheet and Strip
31 pages
ATMega16 Microcontroller Digital LM35 LCD Thermometer
100% (3)
ATMega16 Microcontroller Digital LM35 LCD Thermometer
4 pages
Copy-of-English8 Quarter3 Module3
100% (1)
Copy-of-English8 Quarter3 Module3
18 pages
Kinetics of Condensation Reaction of Crude Glycerol With Acetaldehyde in A Reactive Extraction Process
No ratings yet
Kinetics of Condensation Reaction of Crude Glycerol With Acetaldehyde in A Reactive Extraction Process
10 pages
Intestinal Failure: Editor
No ratings yet
Intestinal Failure: Editor
953 pages
3rd SUMMATIVE 4
No ratings yet
3rd SUMMATIVE 4
2 pages
Termodinamica Moran Shapiro 7a Ed - Resp
No ratings yet
Termodinamica Moran Shapiro 7a Ed - Resp
32 pages
National Scientists of The Philippines
No ratings yet
National Scientists of The Philippines
3 pages
Manifestation Meditation
80% (5)
Manifestation Meditation
28 pages
BT-99 Parts
No ratings yet
BT-99 Parts
6 pages
FDA Certificate
No ratings yet
FDA Certificate
3 pages
Teste de Aderencia
No ratings yet
Teste de Aderencia
7 pages
Pharmacology Final Exam Study Guide
No ratings yet
Pharmacology Final Exam Study Guide
28 pages
Report On Improving Positioning Algorithm During Landing
No ratings yet
Report On Improving Positioning Algorithm During Landing
6 pages
Encyclopedia of Faith
100% (2)
Encyclopedia of Faith
109 pages

DS - Module 3

Uploaded by

DS - Module 3

Uploaded by

Exploratory

Data Analysis, and

1. Decide on your similarity or distance metric.

d =√[(x2 – x1)2 + (y2 – y1)2]

•x . y = product (dot) of the vectors ‘x’ and ‘y’.

The cosine similarity between two vectors is measured in ‘θ’.

• where i is the ith element of each of the vectors.

Note: Initial cluster centres are C1 (2, 1) and C2 (2,

Distance from Distance from

• We could do this by having human evaluators label messages “spam,” which is a

• Another way to do it would be to take an existing spam filter such as Gmail’s

• But this issue is basically a not too much important.

• Moreover, maybe we can’t even store it because it’s so huge.

• Using that, we solve for p y x (assuming p x ≠0) to get what is called

• The derivative of the log-likelihood function leads to the original

• Applying Bayes's Law transforms θMAP into a form proportional to

• One way you can do this is with an API.

• For example, the New York Times has an API here.

• lynx and lynx --dump

• Mechanize (or here)

You might also like