0% found this document useful (0 votes)
14 views65 pages

DS - Module 3

The document discusses the k-Nearest Neighbors (k-NN) algorithm, emphasizing the importance of defining similarity metrics and selecting an appropriate value for k to classify unlabeled items based on their labeled neighbors. It also covers various distance metrics such as Euclidean, Cosine, Jaccard, Mahalanobis, Hamming, and Manhattan distances, along with the process of training and testing datasets. Additionally, it touches on spam filtering using Naive Bayes and the challenges of applying linear regression to binary classification problems.

Uploaded by

satvikhegde2905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views65 pages

DS - Module 3

The document discusses the k-Nearest Neighbors (k-NN) algorithm, emphasizing the importance of defining similarity metrics and selecting an appropriate value for k to classify unlabeled items based on their labeled neighbors. It also covers various distance metrics such as Euclidean, Cosine, Jaccard, Mahalanobis, Hamming, and Manhattan distances, along with the process of training and testing datasets. Additionally, it touches on spam filtering using Naive Bayes and the challenges of applying linear regression to binary classification problems.

Uploaded by

satvikhegde2905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Exploratory

Data Analysis, and


the Data
Science Process
The intuition behind k-NN
• is to consider the most similar other items defined in terms of
their attributes, look at their labels, and give the unassigned
item the majority vote.
• If there’s a tie, you randomly select among the labels that have
tied for first.
• To automate it, two decisions must be made: first, how do you
define similarity or closeness?
• Once you define it, for a given unrated item, you can say how
similar all the labeled items are to it, and you can take the
most similar items and call them neighbors, who each have a
“vote.”
• how many neighbors should you look at or “let vote”? This
value is k
overview of the process:

1. Decide on your similarity or distance metric.


2. Split the original labeled dataset into training and test data.
3. Pick an evaluation metric. (Misclassification rate is a good one.
We’ll explain this more in a bit.)
4. Run k-NN a few times, changing k and checking the evaluation
measure.
5. Optimize k by picking the one with the best evaluation measure.
6. Once you’ve chosen k, use the same training set and now create
a new test set with the people’s ages and incomes that you
have no labels for, and want to predict.
Similarity or distance metrics
• Euclidean distance
• Cosine Similarity
• Jaccard Distance or
Similarity
• Mahalanobis Distance
• Hamming Distance
• Manhattan
Similarity or distance metrics
• Euclidean distance
• Euclidean distance is a good go-to distance
metric for attributes that are real-valued and
can be plotted on a plane or in multidimensional
space.
• Variables need to be on similar scale.

d =√[(x2 – x1)2 + (y2 – y1)2]


Calculate the distance between the
points (4,1) and (3,0).
Using Euclidean Distance Formula: “d” is the Euclidean distance
⇒ d = √(x – x ) + (y – y )
2 1
2
2 1
2
(x1, y1) is the coordinate of the first
⇒ d = √(3 – 4) + (0 – 1)
2 2
point
⇒ d = √(1 + 1) (x2, y2) is the coordinate of the second
⇒ d = √2 = 1.414 unit point.
• Cosine Similarity
• Also can be used between two real-valued vectors, x and y , and
will yield a value between –1 (exact opposite) and 1 (exactly the
same) with 0 in between meaning independent.


SC(x, y) = x . y / ||x|| × ||y||

•x . y = product (dot) of the vectors ‘x’ and ‘y’.


•||x|| and ||y|| = length (magnitude) of the two vectors ‘x’ and ‘y’.
•||x|| ×× ||y|| = regular product of the two vectors ‘x’ and ‘y’.
• Consider an example to find the similarity between two vectors
– ‘x’ and ‘y’, using Cosine Similarity. The ‘x’ vector has values, x =
{ 3, 2, 0, 5 } The ‘y’ vector has values, y = { 1, 0, 0, 0 }
• The formula for calculating the cosine similarity is :
• SC​(x, y) = x . y / ||x|| × ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 =
6.16
||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

The∴ dissimilarity
SCSC(x, y) = 3 / (6.16 * 1) = 0.49
between the two vectors ‘x’ and ‘y’ is given by

∴ ​
DCDC(x, y) = 1 - ​
SCSC(x, y) = 1 - 0.49 = 0.51

The cosine similarity between two vectors is measured in ‘θ’.


•If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
•If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.
Similarity or distance metrics
• Jaccard Distance or Similarity
• This gives the distance between a set of objects.
• For example,
• Consider a list of Cathy’s friends A= {Kahn,Mark,Laura, . . .}
• And a list of Rachel’s friends B= {Mladen,Kahn,Mark, . . .}
• Similarity of these 2 sets an be measured with:

The Jaccard Similarity will be 0 if the two sets don't share any values and 1 if the
two sets are identical. The set may contain either numerical values or strings.
Additionally, this function can be used to find the dissimilarity
between two sets by calculating d(A,B)=1–J(A,B).
• Mahalanobis Distance
• Also can be used between two real-valued vectors and has the
advantage over Euclidean distance that it takes into account
correlation and is scale-invariant.
Similarity or distance metrics
• Hamming Distance
• Can be used to find the distance between two strings or pairs of words or DNA
sequences of the same length.
• The distance between olive and ocean is 4 because aside from the “o” the other 4
letters are different.

O L I V E
O C E A N
0 1 1 1 1
• Total - 4
Similarity or distance metrics
• Manhattan
• This is also a distance between two real-valued k-dimensional vectors.
• The image to have in mind is that of a taxi having to travel the city streets of
Manhattan, which is laid out in a grid-like fashion (you can’t cut diagonally across
buildings).
• The distance is therefore defined as

• where i is the ith element of each of the vectors.


Training and test sets
• Train Test split
Pick an evaluation metric
• When assessing the performance of a model, selecting the appropriate evaluation metric is
crucial. It determines how well the model's predictions align with the desired outcomes.
• Customized Evaluation: Evaluation metrics are not one-size-fits-all; they need to be
tailored to the specific problem at hand. Different scenarios may prioritize certain types of
errors over others, requiring collaboration with domain experts to define suitable metrics.
• For instance, in medical diagnosis, such as detecting cancer, minimizing false negatives
(misdiagnosing someone as not having cancer when they actually do) is often prioritized.
This emphasizes the importance of tuning evaluation metrics in consultation with medical
professionals.
• Trade-offs in Sensitivity and Specificity: Evaluation involves striking a balance between
sensitivity (correctly diagnosing ill patients as ill) and specificity (correctly diagnosing well
patients as well). Achieving perfect sensitivity may lead to overdiagnosis, highlighting the
need for a nuanced approach.
Pick an evaluation metric
• Accuracy and Misclassification Rate: Accuracy, the ratio of correct
labels to total labels, is a fundamental metric. Conversely, the
misclassification rate, which is the complement of accuracy, quantifies the
proportion of incorrect predictions. Minimizing the misclassification rate
ultimately maximizes accuracy.
Choosing k
• Run k-NN a few times, changing k, and checking the evaluation metric
each time.
k-Nearest Neighbor/k-NN (supervised)
• Used when you have many objects that are classified into categories but
have some unclassified objects (e.g. movie ratings).
k-Nearest Neighbor/k-NN (supervised)
• Pick a k value (usually a low odd number, but up to you to pick).
• Find the closest number of k points to the unclassified point (using various
distance measurement techniques).
• Assign the new point to the class where the majority of closest points lie.
• Run algorithm again and again using different k’s.
Brightness Saturation

40 20
50 50
60 90
10 25
70 70
60 10
25 80
k-means (unsupervised)
• Goal is to segment data into clusters or strata
• Important for marketing research where you need to determine
your sample space.
• Assumptions:
• Labels are not known.
• You pick k (more of an art than a science).
k-means (unsupervised)
• Randomly pick k centroids (centers of data) and place them
near “clusters” of data.
• Assign each data point to a centroid.
• Move the centroids to the average location of the data points
assigned to it.
• Repeat the previous two steps until the data point
assignments don’t change.
k-means (unsupervised)
• Let’s say you have users where
you know
• X  how many ads have been
shown to each user (the number of
impressions) and
• Y  How many times each has
clicked on an ad (number of clicks).
K-Means Problem
• Given a data set of the six objects characterised by two features.
Use k-means clustering algorithm to divide the following data
into two clusters and Find the updated cluster centres after two
iteration.
X 1 2 2 2 4 5
Y 1 1 3 3 3 5

Note: Initial cluster centres are C1 (2, 1) and C2 (2,


3)
• v1= (2, 1) and v2 = (2, 3)
• First Iteration
• Step 1: Find the distance between the cluster centers and each
data points
Distance from Distance from
Data Point Assigned Center
v1 (2, 1) v2 (2, 3)

a1 (1, 1) 1 2.236068 v1
a2 (2, 1) 0 2 v1
a3 (2, 3) 2 0 v2
a4 (2, 3) 2 0 v2
a5 (4, 3) 2.828427 2 v2
a6 (5, 5) 5 3.605551 v2
• Step 2: Cluster of v1: {a1, a2}
• Cluster of v2: {a3, a4, a5, a6}
• Step 3: Recalculate the cluster center v1 and v2
• v1 = 1/2 * [(1, 1) + (2, 1)] = (1.5, 1)
• v2 = 1/4 * [(2, 3) + (2, 3) + (4, 3) + (5, 5)] = (3.25, 3.5)
• Step 4: Repeat from step 1 until we get same cluster center or
same cluster element as in the previous iteration.
• Second Iteration:
• Step 1: Find the distance between the updated cluster centers
and each data points

Distance from Distance from


Data Point Assigned Center
v1 (1.5, 1) v2 (3.25, 3.5)
a1 (1, 1) 0.5 3.363406 v1
a2 (2, 1) 0.5 2.795085 v1
a3 (2, 3) 2.061553 1.346291 v2
a4 (2, 3) 2.061553 1.346291 v2
a5 (4, 3) 3.201562 0.901388 v2
a6 (5, 5) 5.315073 2.304886 v2
• Step 2: Cluster of v1 : {a1, a2}
• Cluster of v2: {a3, a4, a5, a6}
• Step 3: Recalculate the cluster center v1 and v2
• v1 = 1/2 * [(1, 1) + (2, 1)] = (1.5, 1)
• v2 = 1/4 * [(2, 3) + (2, 3) + (4, 3) + (5, 5)] = (3.25,
3.5)
Spam Filters,
Naive Bayes, and
Wrangling
OUTLINE

• Spam Filters
• Naive Bayes
• Wrangling
Learning by Example (Spam)
• Consider a bunch of text shown below,
Contd…
• You may notice that several of the rows of text look like spam.
• How did you figure this out?
• Can you write code to automate the spam filter that your brain represents?
Contd…
• Rachel’s class had a few ideas about what things might be clear signs of spam:
1. Any email is spam if it contains Not Relevant references (Ex: Ad Credits, You
have Won and so on). That’s a good rule to start with, but as you’ve likely seen in
your own email, people (spammers) figured out this spam filter rule and got
around it by modifying the spelling.

2. Maybe something about the length of the subject gives it away as spam, or perhaps
excessive use of exclamation points or other punctuation. But some words like
“Yahoo!” are authentic, so you don’t want to make your rule too simplistic.
And here are a few suggestions regarding
code you could write to identify spam:
• Try a probabilistic model. In other words, should you not have simple rules, but
have many rules of thumb that aggregate together to provide the probability of a
given email being spam? This is a great idea.

• What about k-nearest neighbors or linear regression? You learned about these
techniques in the previous chapter, but do they apply to this kind of problem? (Hint:
the answer is “No.”)
Why Won’t Linear Regression Work for Filtering Spam?

• Consider a dataset or matrix where each row represents a different email message (it
could be keyed by email_id).

• Now let’s make each word in the email a feature—this means that we create a
column called “You_Have_Won,” for example, and then for any message that has the
word “You_Have_Won” in it at least once, we put a 1 in; otherwise we assign a 0.

• Alternatively, we could put the number of times the word appears. Then each column
represents the appearance of a different word.
• In order to use linear regression, we need to have a training set of emails where
the messages have already been labeled with some outcome variable. In this case,
the outcomes are either spam or not.

• We could do this by having human evaluators label messages “spam,” which is a


reasonable, but time intensive, solution.

• Another way to do it would be to take an existing spam filter such as Gmail’s


spam filter and use those labels.
Contd…
• The first thing to consider is that your target is binary (0 if not spam, 1 if spam) -
you wouldn’t get a 0 or a 1 using linear regression; you’d get a number.

• Strictly speaking, this option really isn’t ideal; linear regression is aimed at
modeling a continuous output and this is binary.

• But this issue is basically a not too much important.


Contd…
• Remember we should use a model appropriate for the data.
• But if we wanted to fit it in R, in theory it could still work.
• R doesn’t check for us whether the model is appropriate or not. We could go for
it, fit a linear model, and then use that to predict and then choose a critical value
(Threshold Value) so that above that predicted value we call it “1” and below we
call it “0.”
Contd…
• But if we went ahead and tried, it still wouldn’t work because there are too many variables
compared to observations!

• We have on the order of 10,000 emails with on the order of 100,000 words. This won’t work.

• Technically, this corresponds to the fact that the matrix in the equation for linear regression is
not invertible—in fact, it’s not even close.

• Moreover, maybe we can’t even store it because it’s so huge.

• With carefully chosen feature selection and domain expertise, we could limit it to 100 words
(from 100,000 words) and that could be enough! But again, we’d still have the issue that
linear regression is not the appropriate model for a binary outcome.
How About k-nearest
Neighbors?
• To use k-nearest neighbors (k-NN) to create a spam filter:
• We would still need to choose features, probably corresponding to words,
and we’d likely define the value of those features to be 0 or 1, depending
on whether the word is present or not.

• Then, we’d need to define when two emails are “near” each other based on
which words they both contain.
Contd…
• Again, in this case with 10,000 emails and 100,000 words, we’ll
encounter a problems:
• Namely, the space we’d be working in has too many dimensions. Yes,
computing distances in a 100,000-dimensional space requires lots of
computational work. But that’s not the real problem.

• The real problem is even more basic: even our nearest neighbors are
really far away. This is called “the curse of dimensionality,” and it makes k-
NN a poor algorithm in this case.
Naive Bayes
• Bayes Law
• Let’s say we’re testing for a rare disease, where 1% of the
population is infected. We have a highly sensitive and specific test,
which is not quite perfect:
• 99% of sick patients test positive.
• 99% of healthy patients test negative.
• Given that a patient tests positive, what is the probability
that the patient is actually sick?
• Imagine we have 10,000 perfectly representative people.
• Recall from your basic statistics course that, given events x and
y,
• there’s a relationship between the probabilities of either event (denoted
p(x) and p(y)),
• the joint probabilities (both happen, which is denoted p(x, y),
• and conditional probabilities (event x happens given y happens, denoted
p(x | y)) as follows:

• Using that, we solve for p y x (assuming p x ≠0) to get what is called


Bayes’ Law:
• In our example:
A Spam Filter for Individual Words
(Spam Filtering)
• If the word “You have won” appears, this adds to the
probability that the email is spam.
• But it’s not conclusive, yet. We need to see what else is in
the email.
• Let’s first focus on just one word at a time, which we
generically call “word.” Then, applying Bayes’ Law, we have:
An Example
• Let’s look at some basic statistics on a random Enron employee’s
email.
• We can count 1,500 spam versus 3,672 ham (not spam)
• so, we already know p(spam) and p(ham).
• Let's use the word “meeting” to decide a mail is spam or not.
• 16 spam emails contain the word – “meeting”,
• i.e p(meeting|spam) = 16/1500
• 153 ham emails contain the word – “meeting”,
• i.e p(meeting|ham) = 153/3672
• We can now compute the chance that an email is spam:
• We can now compute the chance that an email is spam:
A Spam Filter That Combines
Words: Naive Bayes
• Each email can be represented by a list of words.
• This list is like a big checklist where each word is either checked (1) or unchecked (0)
depending on if it appears in the email.
• Since there are many words, we represent this checklist efficiently using only the
words that are actually in the email.
• The model predicts the chance that an email is spam or not based on the words it
contains.
• Denote the email vector to be x and the various entries x j, where the j indexes the
words.
• For now we can denote “is spam” by c, and we have the following model for p(x|c):
• The θ here is the probability that an individual word is present in a spam email.
• To simplify calculations, we take the logarithm of the formula.
• This changes the multiplication into addition.

• The term log θj / (1−θj) doesn’t depend on a given email, just the word
• so let’s rename it to wj
• Same with the quantity Σ j log (1−θ j) = w0.
• Compute these waits once, no need to compute again & again for each email.
• The weights that vary by email are the xjs. We need to compute them separately
for each email, but that shouldn’t be too hard.
• We can put together what we know to compute p(x|c), and then use Bayes’ Law to
get an estimate of p(c|x) , which is what we actually want
• This helps us make the final decision about whether an email is spam or not.
• Training and Performance:
• Naive Bayes is inexpensive to train if we have labeled data.
• It works by counting the occurrence of words in spam and non-spam emails.
• With more training data, we can improve the accuracy of the filter.
• Practical Implementation:
• In practice, there's a main model adjusted for individual users.
• Before using the main model, simple rules are applied quickly to sort emails.
Laplace Smoothing
• θj represents the probability of seeing a specific word (indexed by j) in a
spam email.
• It's calculated as a ratio of counts: θj = njc / nc.
• Here, njc is the number of times the word appears in spam emails, and
nc is the total count of the word across all emails.
• Laplace Smoothing is about refining the estimate of θj by adding a
smoothing factor.
• The formula becomes: θjc = (njc + α) / (nc + β).
• By choosing α and β (e.g., α=1 and β=10), we prevent extreme
probabilities like 0 or 1.
• Laplace Smoothing can be seen as incorporating prior knowledge
into the probability estimation.
• The maximal likelihood estimate (θML) seeks the values of θ
that make the observed data most probable.
• Assuming independent trials, we maximize the log-likelihood
function for each word separately.

• The derivative of the log-likelihood function leads to the original


probability estimate θj = njc / nc
• Adding a prior extends the estimation process to include prior
beliefs about the data.
• The maximum a posteriori likelihood (MAP) determines the
parameter θ most likely given the observed data.

• Applying Bayes's Law transforms θMAP into a form proportional to


the prior times the likelihood.
• The prior, denoted as p(θ), assumes a specific distribution form
(θα(1-θ)β) to achieve Laplace Smoothing's result.
Scraping the Web: APIs and Other Tools
• As a data scientist, you’re not always just handed some data and
asked to go figure something out based on it.

• Often, you have to actually figure out how to go get some data you
need to ask a question, solve a problem, do some research, etc.

• One way you can do this is with an API.


• For the sake of this discussion, an API (application programming
interface) is something websites provide to developers so they can
download data from the website easily and in standard format.
• Usually, the developer has to register and receive a “key,” which is something like
a password.

• For example, the New York Times has an API here.


• When you go this route, you often get back weird formats, sometimes in JSON, but
there’s no standardization to this standardization; i.e., different websites give you
different “standard” formats.

• One way to get beyond this is to use Yahoo’s YQL language, which allows you to go
to the Yahoo! Developer Network and write SQL-like queries that interact with many
of the APIs on common sites like this:

• The output is standard, and you only have to parse this in Python or R once.
But what if you want data when there’s no API available?

• In this case you might want to use something like the Firebug extension for
Firefox. You can “inspect the element” on any web page, and Firebug allows
you to grab the field inside the HTML. In fact, it gives you access to the full
HTML document so you can interact and edit. In this way you can see the
HTML as a map of the page and Firebug as a kind of tour guide.

• After locating the stuff you want inside the HTML, you can use curl, wget,
grep, awk, perl, etc., to write a quick-and-dirty shell script to grab what you
want, especially for a one-off grab. If you want to be more systematic, you
can also do this using Python or R.
Other parsing tools you might want to look into include:

• lynx and lynx --dump


• Good if you pine for the 1970s. Oh wait, 1992. Whatever.

• Beautiful Soup
• Robust but kind of slow.

• Mechanize (or here)


• Super cool as well, but it doesn’t parse JavaScript.

• PostScript
• Image classification.

You might also like