DS - Module 3
DS - Module 3
SC(x, y) = x . y / ||x|| × ||y||
The Jaccard Similarity will be 0 if the two sets don't share any values and 1 if the
two sets are identical. The set may contain either numerical values or strings.
Additionally, this function can be used to find the dissimilarity
between two sets by calculating d(A,B)=1–J(A,B).
• Mahalanobis Distance
• Also can be used between two real-valued vectors and has the
advantage over Euclidean distance that it takes into account
correlation and is scale-invariant.
Similarity or distance metrics
• Hamming Distance
• Can be used to find the distance between two strings or pairs of words or DNA
sequences of the same length.
• The distance between olive and ocean is 4 because aside from the “o” the other 4
letters are different.
O L I V E
O C E A N
0 1 1 1 1
• Total - 4
Similarity or distance metrics
• Manhattan
• This is also a distance between two real-valued k-dimensional vectors.
• The image to have in mind is that of a taxi having to travel the city streets of
Manhattan, which is laid out in a grid-like fashion (you can’t cut diagonally across
buildings).
• The distance is therefore defined as
40 20
50 50
60 90
10 25
70 70
60 10
25 80
k-means (unsupervised)
• Goal is to segment data into clusters or strata
• Important for marketing research where you need to determine
your sample space.
• Assumptions:
• Labels are not known.
• You pick k (more of an art than a science).
k-means (unsupervised)
• Randomly pick k centroids (centers of data) and place them
near “clusters” of data.
• Assign each data point to a centroid.
• Move the centroids to the average location of the data points
assigned to it.
• Repeat the previous two steps until the data point
assignments don’t change.
k-means (unsupervised)
• Let’s say you have users where
you know
• X how many ads have been
shown to each user (the number of
impressions) and
• Y How many times each has
clicked on an ad (number of clicks).
K-Means Problem
• Given a data set of the six objects characterised by two features.
Use k-means clustering algorithm to divide the following data
into two clusters and Find the updated cluster centres after two
iteration.
X 1 2 2 2 4 5
Y 1 1 3 3 3 5
a1 (1, 1) 1 2.236068 v1
a2 (2, 1) 0 2 v1
a3 (2, 3) 2 0 v2
a4 (2, 3) 2 0 v2
a5 (4, 3) 2.828427 2 v2
a6 (5, 5) 5 3.605551 v2
• Step 2: Cluster of v1: {a1, a2}
• Cluster of v2: {a3, a4, a5, a6}
• Step 3: Recalculate the cluster center v1 and v2
• v1 = 1/2 * [(1, 1) + (2, 1)] = (1.5, 1)
• v2 = 1/4 * [(2, 3) + (2, 3) + (4, 3) + (5, 5)] = (3.25, 3.5)
• Step 4: Repeat from step 1 until we get same cluster center or
same cluster element as in the previous iteration.
• Second Iteration:
• Step 1: Find the distance between the updated cluster centers
and each data points
• Spam Filters
• Naive Bayes
• Wrangling
Learning by Example (Spam)
• Consider a bunch of text shown below,
Contd…
• You may notice that several of the rows of text look like spam.
• How did you figure this out?
• Can you write code to automate the spam filter that your brain represents?
Contd…
• Rachel’s class had a few ideas about what things might be clear signs of spam:
1. Any email is spam if it contains Not Relevant references (Ex: Ad Credits, You
have Won and so on). That’s a good rule to start with, but as you’ve likely seen in
your own email, people (spammers) figured out this spam filter rule and got
around it by modifying the spelling.
2. Maybe something about the length of the subject gives it away as spam, or perhaps
excessive use of exclamation points or other punctuation. But some words like
“Yahoo!” are authentic, so you don’t want to make your rule too simplistic.
And here are a few suggestions regarding
code you could write to identify spam:
• Try a probabilistic model. In other words, should you not have simple rules, but
have many rules of thumb that aggregate together to provide the probability of a
given email being spam? This is a great idea.
• What about k-nearest neighbors or linear regression? You learned about these
techniques in the previous chapter, but do they apply to this kind of problem? (Hint:
the answer is “No.”)
Why Won’t Linear Regression Work for Filtering Spam?
• Consider a dataset or matrix where each row represents a different email message (it
could be keyed by email_id).
• Now let’s make each word in the email a feature—this means that we create a
column called “You_Have_Won,” for example, and then for any message that has the
word “You_Have_Won” in it at least once, we put a 1 in; otherwise we assign a 0.
• Alternatively, we could put the number of times the word appears. Then each column
represents the appearance of a different word.
• In order to use linear regression, we need to have a training set of emails where
the messages have already been labeled with some outcome variable. In this case,
the outcomes are either spam or not.
• Strictly speaking, this option really isn’t ideal; linear regression is aimed at
modeling a continuous output and this is binary.
• We have on the order of 10,000 emails with on the order of 100,000 words. This won’t work.
• Technically, this corresponds to the fact that the matrix in the equation for linear regression is
not invertible—in fact, it’s not even close.
• With carefully chosen feature selection and domain expertise, we could limit it to 100 words
(from 100,000 words) and that could be enough! But again, we’d still have the issue that
linear regression is not the appropriate model for a binary outcome.
How About k-nearest
Neighbors?
• To use k-nearest neighbors (k-NN) to create a spam filter:
• We would still need to choose features, probably corresponding to words,
and we’d likely define the value of those features to be 0 or 1, depending
on whether the word is present or not.
• Then, we’d need to define when two emails are “near” each other based on
which words they both contain.
Contd…
• Again, in this case with 10,000 emails and 100,000 words, we’ll
encounter a problems:
• Namely, the space we’d be working in has too many dimensions. Yes,
computing distances in a 100,000-dimensional space requires lots of
computational work. But that’s not the real problem.
• The real problem is even more basic: even our nearest neighbors are
really far away. This is called “the curse of dimensionality,” and it makes k-
NN a poor algorithm in this case.
Naive Bayes
• Bayes Law
• Let’s say we’re testing for a rare disease, where 1% of the
population is infected. We have a highly sensitive and specific test,
which is not quite perfect:
• 99% of sick patients test positive.
• 99% of healthy patients test negative.
• Given that a patient tests positive, what is the probability
that the patient is actually sick?
• Imagine we have 10,000 perfectly representative people.
• Recall from your basic statistics course that, given events x and
y,
• there’s a relationship between the probabilities of either event (denoted
p(x) and p(y)),
• the joint probabilities (both happen, which is denoted p(x, y),
• and conditional probabilities (event x happens given y happens, denoted
p(x | y)) as follows:
• The term log θj / (1−θj) doesn’t depend on a given email, just the word
• so let’s rename it to wj
• Same with the quantity Σ j log (1−θ j) = w0.
• Compute these waits once, no need to compute again & again for each email.
• The weights that vary by email are the xjs. We need to compute them separately
for each email, but that shouldn’t be too hard.
• We can put together what we know to compute p(x|c), and then use Bayes’ Law to
get an estimate of p(c|x) , which is what we actually want
• This helps us make the final decision about whether an email is spam or not.
• Training and Performance:
• Naive Bayes is inexpensive to train if we have labeled data.
• It works by counting the occurrence of words in spam and non-spam emails.
• With more training data, we can improve the accuracy of the filter.
• Practical Implementation:
• In practice, there's a main model adjusted for individual users.
• Before using the main model, simple rules are applied quickly to sort emails.
Laplace Smoothing
• θj represents the probability of seeing a specific word (indexed by j) in a
spam email.
• It's calculated as a ratio of counts: θj = njc / nc.
• Here, njc is the number of times the word appears in spam emails, and
nc is the total count of the word across all emails.
• Laplace Smoothing is about refining the estimate of θj by adding a
smoothing factor.
• The formula becomes: θjc = (njc + α) / (nc + β).
• By choosing α and β (e.g., α=1 and β=10), we prevent extreme
probabilities like 0 or 1.
• Laplace Smoothing can be seen as incorporating prior knowledge
into the probability estimation.
• The maximal likelihood estimate (θML) seeks the values of θ
that make the observed data most probable.
• Assuming independent trials, we maximize the log-likelihood
function for each word separately.
• Often, you have to actually figure out how to go get some data you
need to ask a question, solve a problem, do some research, etc.
• One way to get beyond this is to use Yahoo’s YQL language, which allows you to go
to the Yahoo! Developer Network and write SQL-like queries that interact with many
of the APIs on common sites like this:
• The output is standard, and you only have to parse this in Python or R once.
But what if you want data when there’s no API available?
• In this case you might want to use something like the Firebug extension for
Firefox. You can “inspect the element” on any web page, and Firebug allows
you to grab the field inside the HTML. In fact, it gives you access to the full
HTML document so you can interact and edit. In this way you can see the
HTML as a map of the page and Firebug as a kind of tour guide.
• After locating the stuff you want inside the HTML, you can use curl, wget,
grep, awk, perl, etc., to write a quick-and-dirty shell script to grab what you
want, especially for a one-off grab. If you want to be more systematic, you
can also do this using Python or R.
Other parsing tools you might want to look into include:
• Beautiful Soup
• Robust but kind of slow.
• PostScript
• Image classification.