0% found this document useful (0 votes)
47 views27 pages

Data Science and Visualization (21CS644) : Text Books

Uploaded by

sudhanva0703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views27 pages

Data Science and Visualization (21CS644) : Text Books

Uploaded by

sudhanva0703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

BANGALORE INSTITUTE OF TECHNOLOGY

Department of Information Science and Engineering


K.R.Road, V.V.Puram, Bangalore – 560 004

Data Science and Visualization (21CS644)

by

Dr. ROOPA. H
Associate Professor
Department of Information Science & Engineering,
Bangalore Institute of Technology

TEXT BOOKS:

1. Doing Data Science. Cathy O Neil and Rachel Schutt, O Reilly Media,
Inc 2013.
2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt,
Publishing, ISBN 9781800568112.

REFERENCE BOOKS:

1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman,


Cambridge University Press, 2010.
2. Data Science from Scratch, Joel Grus, Shroff Publisher, /Oreilly
Publisher Media.
3. A handbook for data driven design by Andy krik.

2 Dr.Roopa H, Department of ISE, BIT

1
Module -2 :
Exploratory Data Analysis and the Data Science Process:
Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of
EDA, The Data Science, Process, Case Study: Real Direct (online real estate
firm).
Three Basic Machine Learning Algorithms:
Linear Regression, k-Nearest Neighbors (k- NN), k-means.

09-Hours
Textbook 1: Chapter 2, Chapter 3

3 Dr.Roopa H, Department of ISE, BIT

Exploratory Data Analysis:


“Exploratory data analysis” is an attitude, a state of flexibility, a willingness
to look for those things that we believe are not there, as well as those we
believe to be there.
— John Tukey
• Exploratory data analysis (EDA) as the first step toward building a
model.
• It’s traditionally presented as a bunch of histograms and stem-and-leaf
plots.
• EDA is a critical part of the data science process, and also represents a
philosophy or way of doing statistics practiced by a strain of statisticians
coming from the Bell Labs tradition.
• In EDA, there is no hypothesis and there is no model. The
“exploratory” aspect means that your understanding of the problem you
are solving, or might solve, is changing as you go.
4 Dr.Roopa H, Department of ISE, BIT

2
Tools of EDA:
• The basic tools of EDA are plots, graphs and summary statistics.
• Generally speaking, it’s a method of systematically going through the
data,
 plotting distributions of all variables (using box plots),
 plotting time series of data,
 transforming variables,
 looking at all pairwise relationships between variables using
scatterplot matrices,
 and generating summary statistics for all of them.
• At the very least that would mean computing their mean, minimum,
maximum, the upper and lower quartiles, and identifying outliers.
• But as much as EDA is a set of tools, it’s also a mindset. And that
mindset is about your relationship with the data.
• EDA happens between you and the data and isn’t about proving
anything to anyone else yet.
5 Dr.Roopa H, Department of ISE, BIT

Philosophy of Exploratory Data Analysis


Long before worrying about how to convince others, you first have to
understand what’s happening yourself.
— Andrew Gelman
In the context of data in an Internet/engineering company, EDA is done
with smaller datasets, but there are additional reasons to do it with data that
has been generated from logs.
EDA should be done
• to gain intuition about the data;
• to make comparisons between distributions;
• for sanity checking (making sure the data is on the scale you expect,
in the format you thought it should be);
• to find out where data is missing or if there are outliers; and
• to summarize the data.
EDA also helps with debugging the logging process for data generated from
logs.
6 Dr.Roopa H, Department of ISE, BIT

3
For example, “patterns” you find in the data could actually be something wrong in the
logging process that needs to be fixed. If you never go to the trouble of debugging,
you’ll continue to think your patterns are real.
• In the end, EDA helps you make sure the product is performing as
intended.
• Difference between EDA and data visualization is that EDA is done
toward the beginning of analysis, and data visualization, as it’s used in our
vernacular, is done toward the end to communicate one’s findings.
• With EDA, the graphics are solely done for you to understand what’s
going on.
• With EDA, you can also use the understanding you get to inform and
improve the development of algorithms.
For example, suppose you are trying to develop a ranking algorithm that ranks content
that you are showing to users. To do this you might want to develop a notion of
“popular.”
Before you decide how to quantify popularity , you need to understand how the data is
behaving, and the best way to do that is looking at it and getting your hands dirty.

7 Dr.Roopa H, Department of ISE, BIT

The Data Science Process:

Figure 2.1 -The data science process

8 Dr.Roopa H, Department of ISE, BIT

4
Steps in Data Science process:
• First we have the RealWorld.
• Start with raw data—logs, Olympics records, Enron employee emails, or
recorded genetic material.
• Process this to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling, or whatever you want to call
it.
• Get the data down to a nice format, like something with columns:
• name | event | year | gender | event time
• Clean the dataset because of duplicates, missing values, absurd outliers,
and data that wasn’t actually logged or incorrectly logged.
• Design model to use some algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something else.
 The model chosen depends on the type of problem we’re trying to
solve, of course, which could be a classification problem, a prediction
problem, or a basic description problem.
• Interpret, visualize, report, or communicate our results.
9 Dr.Roopa H, Department of ISE, BIT

A Data Scientist’s Role in This Process:


Someone has to make the decisions about what data to collect, and why.
That person needs to be formulating questions and hypotheses and making
a plan for how the problem will be attacked. And that someone is the data
scientist or our beloved data science team.

Figure 2.2 -The data scientist is involved in every part of this process

10 Dr.Roopa H, Department of ISE, BIT

5
Case Study: RealDirect:
Doug Perlson, the CEO of RealDirect, has a background in real estate law,
startups, and online advertising. His goal with RealDirect is to use all the
data he can access about real estate to improve the way people sell and buy
houses.
Normally, people sell their homes about once every seven years, and they
do so with the help of professional brokers and current data. But there’s a
problem both with the broker system and the data quality. RealDirect
addresses both of them.
First, the brokers. They are typically “free agents” operating on their own—
think of them as home sales consultants. This means that they guard their
data aggressively, and the really good ones have lots of experience. But in
the grand scheme of things, that really means they have only slightly more
data than the inexperienced brokers.

11 Dr.Roopa H, Department of ISE, BIT

RealDirect is addressing this problem by hiring a team of licensed real estate


agents who work together and pool their knowledge. To accomplish this, it
built an interface for sellers, giving them useful data driven tips on how to
sell their house. It also uses interaction data to give real-time
recommendations on what to do next.
The team of brokers also become data experts, learning to use information-
collecting tools to keep tabs on new and relevant data or to access publicly
available information. For example, you can now get data on co-op (a
certain kind of apartment in NYC) sales, but that’s a relatively recent
change.
One problem with publicly available data is that it’s old news—there’s a
three-month lag between a sale and when the data about that sale is
available. RealDirect is working on real-time feeds on things like when
people start searching for a home, what the initial offer is, the time between
offer and close, and how people search for a home online.
Ultimately, good information helps both the buyer and the seller. At least if
they’re honest.
12 Dr.Roopa H, Department of ISE, BIT

6
How Does RealDirect Make Money?

• First, it offers a subscription to sellers—about $395 a month—to access


• the selling tools.
• Second, it allows sellers to use RealDirect’s agents at a reduced
commission, typically 2% of the sale instead of the usual 2.5% or 3%.
This is where the magic of data pooling comes in: it allows RealDirect to
take a smaller commission because it’s more optimized, and therefore gets
more volume.

There are some challenges they have to deal with as well, of course.
• RealDirect requires registration.
• RealDirect comprises licensed brokers in various established realtor
associations

13 Dr.Roopa H, Department of ISE, BIT

Exercise: RealDirect Data Strategy

You have been hired as chief data scientist at realdirect.com, and report
directly to the CEO. The company (hypothetically) does not yet have its
data plan in place. It’s looking to you to come up with a data strategy.

14 Dr.Roopa H, Department of ISE, BIT

7
Sample R code that takes the Brooklyn housing data in the preceding exercise, and cleans and
explores it a bit.
# Author: Benjamin Reddy

require(gdata)
bk <- read.xls("rollingsales_brooklyn.xls",pattern="BOROUGH")
head(bk)
summary(bk)
bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]","",
bk$SALE.PRICE))
count(is.na(bk$SALE.PRICE.N))
names(bk) <- tolower(names(bk))

## clean/format the data with regular expressions


bk$gross.sqft <- as.numeric(gsub("[^[:digit:]]","",
bk$gross.square.feet))
bk$land.sqft <- as.numeric(gsub("[^[:digit:]]","",
bk$land.square.feet))
bk$sale.date <- as.Date(bk$sale.date)
bk$year.built <- as.numeric(as.character(bk$year.built))
15 Dr.Roopa H, Department of ISE, BIT

## do a bit of exploration to make sure there's not anything weird going on with sale prices
attach(bk)
hist(sale.price.n)
hist(sale.price.n[sale.price.n>0])
hist(gross.sqft[sale.price.n==0])
detach(bk)

## keep only the actual sales


bk.sale <- bk[bk$sale.price.n!=0,]
plot(bk.sale$gross.sqft,bk.sale$sale.price.n)
plot(log(bk.sale$gross.sqft),log(bk.sale$sale.price.n))

## for now, let's look at 1-, 2-, and 3-family homes


bk.homes <- bk.sale[which(grepl("FAMILY",
bk.sale$building.class.category)),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
bk.homes[which(bk.homes$sale.price.n<100000),]
[order(bk.homes[which(bk.homes$sale.price.n<100000),] $sale.price.n),]

16 Dr.Roopa H, Department of ISE, BIT

8
## remove outliers that seem like they weren't actual sales
bk.homes$outliers <- (log(bk.homes$sale.price.n) <=5) + 0
bk.homes <- bk.homes[which(bk.homes$outliers==0),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))

17 Dr.Roopa H, Department of ISE, BIT

Algorithms:
• An algorithm is a procedure or set of steps or rules to accomplish a task.
Algorithms are one of the fundamental concepts in, or building blocks of,
computer science: the basis of the design of elegant and efficient code,
data preparation and processing, and software engineering.
• Some of the basic types of tasks that algorithms can solve are sorting,
searching, and graph-based computational problems.
• Efficient algorithms that work sequentially or in parallel are the basis of
pipelines to process and prepare data.
With respect to data science, there are at least three classes of
algorithms one should be aware of:
• Data munging, preparation, and processing algorithms, such as sorting,
MapReduce, or Pregel.
• Optimization algorithms for parameter estimation, including Stochastic
Gradient Descent, Newton’s Method, and Least Squares.
• Machine learning algorithms.
18 Dr.Roopa H, Department of ISE, BIT

9
Statistical modeling came out of statistics departments, and machine learning algorithms came out of
computer science departments. Certain methods and techniques are considered to be part of both, and
you’ll see that we often use the words somewhat interchangeably.
There are some broad generalizations to consider:
1. Interpreting parameters
Statisticians think of the parameters in their linear regression models as having real-world
interpretations, and typically want to be able to find meaning in behavior or describe the real-
world phenomenon corresponding to those parameters. Whereas a software engineer or computer
scientist might be wanting to build their linear regression algorithm into production-level code,
and the predictive model is what is known as a black box algorithm, they don’t generally focus on the
interpretation of the parameters. If they do, it is with the goal of hand tuning them in order to
optimize predictive power.
2. Confidence intervals
Statisticians provide confidence intervals and posterior distributions for parameters and
estimators, and are interested in capturing the variability or uncertainty of the parameters. Many
machine learning algorithms, such as k-means or k-nearest neighbors (which we cover a bit later in
this chapter), don’t have a notion of confidence intervals or uncertainty.
3. The role of explicit assumptions
Statistical models make explicit assumptions about data generating processes and distributions, and
you use the data to estimate parameters. Nonparametric solutions, like we’ll see later in this
chapter, don’t make any assumptions about probability distributions, or they are implicit.

19 Dr.Roopa H, Department of ISE, BIT

20 Dr.Roopa H, Department of ISE, BIT

10
Data scientist (noun): Person who is better at statistics than any software
engineer and better at software engineering than any statistician.
— Josh Wills
Three Basic Algorithms:
Many business or real-world problems that can be solved with data can be
thought of as classification and prediction problems when we express them
mathematically.
A whole host of models and algorithms can be used to classify and predict.
Three basic algorithms linear regression, k-nearest neighbors (k-NN), and
k-means are discussed.

21 Dr.Roopa H, Department of ISE, BIT

Linear Regression:

One of the most common statistical methods is linear regression.


There is a linear relationship between
an outcome variable (sometimes also called the response variable,
dependent variable, or label) and a predictor (sometimes also called an
independent variable, explanatory variable, or feature); or between one
variable and several other variables, in which case you’re modeling the
relationship as having a linear structure.
A line with a slope and an intercept,

y = f( x )= β0 +β1 *x

22 Dr.Roopa H, Department of ISE, BIT

11
Example 1. Overly simplistic example to start. Suppose you run a social networking site that
charges a monthly subscription fee of $25, and that this is your only source of revenue. Each month
you collect data and count your number of users and total revenue. You’ve done this daily over the
course of two years, recording it all in a spreadsheet.
You could express this data as a series of points. Here are the first four:
S= {(x, y) = (1,25 ), (10,250 ), (100,2500) , (200,5000)}
There’s a clear relationship enjoyed by all of these points, namely y =25x.

•There’s a linear pattern.


•The coefficient relating x and y is 25.
• It seems deterministic.

Figure2.3 - An obvious linear pattern


23 Dr.Roopa H, Department of ISE, BIT

Example 2. Looking at data at the user level. Say you have a dataset keyed by user
(meaning each row contains data for a single user), and the columns represent user behavior
on a social networking site over a period of a week. Let’s say you feel comfortable that the
data is clean at this stage and that you have on the order of hundreds of thousands of users.
The names of the columns are total_num_friends, total_new_friends_this_week,
num_visits, time_spent, number_ apps_downloaded, number_ads_shown, gender, age, and
so on.
During the course of your exploratory data analysis, you’ve randomly sampled 100 users to
keep it simple, and you plot pairs of these variables,
For example, x = total_new_friends and y = time_spent (in seconds).
The business context might be that eventually you want to be able to promise advertisers
who bid for space on your website in advance a certain number of users, so you want to be
able to forecast number of users several days or weeks in advance. But for now, you are
simply trying to build intuition and understand your dataset.
Eyeball the first few rows and see: 7 276
3 43
4 82
6 136
10 417
9 269
24 Dr.Roopa H, Department of ISE, BIT

12
Now, your brain can’t figure out what’s going on by just looking at them (and your friend’s
brain probably can’t, either). They’re in no obvious particular order, and there are a lot of
them. So you try to plot it as in the below figure
It looks like there’s kind of a linear
relationship here, and it makes sense; the
more new friends you have, the more time
you might spend on the site.

But how can you figure out how to describe


that relationship?

Let’s also point out that there is no perfectly


deterministic relationship between number of
new friends and time spent on the site, but it
makes sense that there is an association
between these two variables.

Figure 2.4 - Looking kind of linear

25 Dr.Roopa H, Department of ISE, BIT

Start by writing something down


There are two things you want to capture in the model.
The first is the trend and the second is the variation.
We’ll start first with the trend.
First, let’s start by assuming there actually is a relationship and that it’s linear. It’s the best
you can do at this point.

There are many lines that look more or


less like they might work, as shown in
Figure 2.5.
So how do you pick which one?

Figure2.5 -Which line is the best fit?


26 Dr.Roopa H, Department of ISE, BIT

13
Assuming a linear relationship, start your model by assuming the functional
form to be:
y = β0 +β1x
Now your job is to find the best choices for β0 and β1 using the observed
data to estimate them: (x1, y1) , (x2, y2) , . . . (xn, yn ).
Writing this with matrix notation results in this:
y =x ・β
There you go: you’ve written down your model. Now the rest is fitting
the model.
Fitting the model:
The intuition behind linear regression is that you want to find the line that
minimizes the distance between all the points and the line. Many lines look
approximately correct, but your goal is to find the optimal one.
Linear regression seeks to find the line that minimizes the sum of the
squares of the vertical distances between the approximated or predicted yi’s
and the observed yi’s. You do this because you want to minimize your
prediction errors.This method is called least squares estimation.
27 Dr.Roopa H, Department of ISE, BIT

To find this line, you’ll define the “residual


sum of squares” (RSS),
denoted RSS β , to be:

RSS( β )= Σi (yi −βxi)2

where i ranges over the various data points.


It is the sum of all the squared vertical
distances between the observed points and
any given line. Note this is a function of β
and you want to optimize with respect to β
to find the optimal line.
Figure 2.6 -The line closest to all the points
To minimize RSS β = (y−βx )t (y−βx ),
differentiate it with respect to β and set it Here the little “hat” symbol on top of the β is
equal to zero, then solve for β. there to indicate that it’s the estimator for β.You
This results in: don’t know the true value of β; all you have is
(β Cap/Hat) β = (xtx) −1 xt y the observed data, which you plug into the
estimator to get an estimate.
28 Dr.Roopa H, Department of ISE, BIT

14
To actually fit this, to get the βs, all you need is one line of R code where
you’ve got a column of y’s and a (single) column of x’s: x y
model <- lm(y ~ x) 7 276
So for the example where the first few rows of the data were: 3 43
The R code for this would be: 4 82
> model <- lm (y~x) 6 136
> model 10 417
Call: 9 269
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-32.08 45.92
> coefs <- coef(model)
> plot(x, y, pch=20,col="red", xlab="Number new friends", ylab="Time spent
(seconds)")
> abline(coefs[1],coefs[2])
And the estimated line is y = −32.08+45.92x,
which you’re welcome to round to y = −32+46x,
and the corresponding plot looks like the left hand side of below Figure 2.7
29 Dr.Roopa H, Department of ISE, BIT

Figure 2.7 - On the left is the fitted line. We can see that for any fixed value,
say 5, the values for y vary. For people with 5 new friends, we display their
time spent in the plot on the right.

You’ve so far modeled the trend, you haven’t yet modeled the variation.
30 Dr.Roopa H, Department of ISE, BIT

15
Extending beyond least squares:
Now that you have a simple linear regression model down (one output, one
predictor) using least squares estimation to estimate your βs, you can build
upon that model in three primary ways, described in the upcoming sections:
1. Adding in modeling assumptions about the errors
2. Adding in more predictors
3.Transforming the predictors
Adding in modeling assumptions about the errors. If you use your model to
predict y for a given value of x, your prediction is deterministic and doesn’t
capture the variability in the observed data. See on the right hand side of
Figure 2.7 that for a fixed value of x =5, there is variability among the time
spent on the site. You want to capture this variability in your model, so you
extend your model to:
y = β0 +β1x+ϵ
where the new term ϵ is referred to as noise.
It’s also called the error term—ϵ represents the actual error, the difference between the
observations and the true regression line, which you’ll never know and can only estimate with
your
31
βs. Dr.Roopa H, Department of ISE, BIT

The noise is normally distributed, which is denoted by :

ϵ ∼N ( 0,σ2 )
With the preceding assumption on the distribution of noise, this model is saying that, for
any given value of x, the conditional distribution of y given x is
p( y/ x) ∼N (β0 +β1x,σ2 )

Estimate parameters β0,β1,σ from the data.

You have the estimated line, you can see how far away the observed data points are from
the line itself, and you can treat these differences, also known as observed errors or residuals
,as observations themselves, or estimates of the actual errors, the ϵs.

Define ei = yi − yi = yi − β0 +β1xi for i=1, . . .,n.

Then you estimate the variance (σ2) of ϵ, as:


Σi e2i
n−2
This is called the mean squared error and captures how much the predicted value varies from
the observed.
32 Dr.Roopa H, Department of ISE, BIT

16
Evaluation metrics:

summary (model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-121.17 -52.63 -9.72 41.54 356.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.083 16.623 -1.93 0.0565 .
x 45.918 2.141 21.45 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 77.47 on 98 degrees of freedom
Multiple R-squared: 0.8244, Adjusted R-squared: 0.8226
F-statistic: 460 on 1 and 98 DF, p-value: < 2.2e-16
33 Dr.Roopa H, Department of ISE, BIT

R-squared:

This can be interpreted as the proportion of variance explained by our


model. Note that mean squared error is in there getting divided by total
error, which is the proportion of variance.
p-values:
Looking at the output, the estimated βs are in the column marked Estimate.
To see the p-values, look at Pr( >|t|) . We can interpret the values in this
column as follows: We are making a null hypothesis that the βs are zero. For
any given β, the p-value captures the probability of observing the data that
we observed, and obtaining the test-statistic that we obtained under the null
hypothesis.
This means that if we have a low p-value, it is highly unlikely to observe
such a test-statistic under the null hypothesis, and the coefficient is highly
likely to be nonzero and therefore significant.

34 Dr.Roopa H, Department of ISE, BIT

17
Cross-validation:
Another approach to evaluating the model is as follows. Divide our data up
into a training set and a test set: 80% in the training and 20% in the test. Fit
the model on the training set, then look at the mean squared error on the test
set and compare it to that on the training set. Make this comparison across
sample size as well. If the mean squared errors are approximately the same,
then our model generalizes well and we’re not in danger of overfitting.

Figure 2.8 - Comparing mean


squared error in training
and test set, taken from a
slide of Professor Nando de
Freitas; here, the ground
truth is known because it
came from a dataset with
data simulated from a
known distribution

35 Dr.Roopa H, Department of ISE, BIT

Other models for error terms:


The mean squared error is an example of what is called a loss function.
This is the standard one to use in linear regression because it gives us a pretty
nice measure of closeness of fit.
Adding other predictors:
What we just looked at was simple linear regression— one outcome or
dependent variable and one predictor. But we can extend this model by
building in other predictors, which is called multiple linear regression:
y = β0 +β1x1 +β2x2 +β3x3 +ϵ.
model <- lm(y ~ x_1 + x_2 + x_3)
Or to add in interactions between variables:
model <- lm(y ~ x_1 + x_2 + x_3 + x2_*x_3)
One key here is to make scatterplots of y against each of the predictors as well
as between the predictors, and histograms of y x for various values of each of
the predictors to help build intuition.

36 Dr.Roopa H, Department of ISE, BIT

18
Transformations:
Going back to one x predicting one y, why did we assume a linear relationship? Instead,
maybe, a better model would be a polynomial relationship like this:
y = β0 +β1x+β2x2 +β3x3
To think of it as linear, you transform or create new variables
—for example, z =x2— and build a regression model based on z.
Other common transformations are to take the log or to pick a threshold and turn it into a
binary predictor instead.
Review of LRM:
Let’s review the assumptions we made when we built and fit our model:
• Linearity
• Error terms normally distributed with mean 0
• Error terms independent of each other
• Error terms have constant variance across values of x
• The predictors we’re using are the right predictors
When and why do we perform linear regression? Mostly for two reasons:
• If we want to predict one variable knowing others
• If we want to explain or understand the relationship between two
or more things
37 Dr.Roopa H, Department of ISE, BIT

k-Nearest Neighbors (k-NN):


k-NN is an algorithm that can be used when you have a bunch of objects
that have been classified or labeled in some way, and other similar objects
that haven’t gotten classified or labeled yet, and you want a way to
automatically label them.
The objects could be data scientists who have been classified as “sexy” or
“not sexy”; or people who have been labeled as “high credit” or “low credit”;
or restaurants that have been labeled “five star,” “four star,” “three star,” “two
star,” “one star,” or if they really suck, “zero stars.” More seriously, it could
be patients who have been classified as “high cancer risk” or “low cancer
risk.”
The intuition behind k-NN is to consider the most similar other items
defined in terms of their attributes, look at their labels, and give the
unassigned item the majority vote. If there’s a tie, you randomly select
among the labels that have tied for first.

38 Dr.Roopa H, Department of ISE, BIT

19
Automate it, two decisions must be made:
First, how do you define similarity or closeness? Once you define it, for a
given unrated item, you can say how similar all the labeled items are to it,
and you can take the most similar items and call them neighbors, who each
have a “vote.”
This brings you to the second decision: how many neighbors should you
look at or “let vote”? This value is k, which ultimately you’ll choose as the
data scientist, and we’ll tell you how.

Example with credit scores:


Say you have the age, income, and a credit category of high or low for a
bunch of people and you want to use the age and income to predict the
credit label of “high” or “low” for a new person.

For example, here are the first few rows of a dataset, with income
represented in thousands:

39 Dr.Roopa H, Department of ISE, BIT

age income credit


69 3 low
66 57 low
49 79 low
49 17 low
58 26 high
44 71 high
You can plot people as
points on the plane and
label people with an empty
circle if they have low
credit ratings, as shown in
Figure 2.9.
Figure 2.9 - Credit rating as a function of
age and income
40 Dr.Roopa H, Department of ISE, BIT

20
What if a new guy comes in who is 57 years old and who makes $37,000?
What’s his likely credit rating label? Look at Figure 2.10. Based on the other
people near him, what credit score label do you think he should be given?
Let’s use k-NN to do it automatically.

Figure 2.10- What about that guy?


41 Dr.Roopa H, Department of ISE, BIT

1) Decide on your similarity or distance metric.


2) Split the original labeled dataset into training and test data.
3) Pick an evaluation metric.
4) Run k-NN a few times, changing k and checking the evaluation
measure.
5) Optimize k by picking the one with the best evaluation measure.
6) Once you’ve chosen k, use the same training set and now create a
new test set with the people’s ages and incomes that you have no labels
for, and want to predict. In this case, your new test set only has one
lonely row, for the 57-year-old.

42 Dr.Roopa H, Department of ISE, BIT

21
Similarity or distance metrics:
Definitions of “closeness” and similarity vary depending on the context:
Euclidean distance is a good go-to distance metric for attributes that are real-valued and
can be plotted on a plane or in multidimensional space.

Cosine Similarity
Also can be used between two real-valued vectors, x and y , and will yield a value between
–1 (exact opposite) and 1 (exactly the same) with 0 in between meaning independent.
Recall the definition

Jaccard Distance or Similarity


This gives the distance between a set of objects—for example, a list of Cathy’s friends A=
{Kahn,Mark,Laura, . . . }and a list of Rachel’s friends B= {Mladen,Kahn,Mark, . . . }—and
says how similar those two sets are:

Mahalanobis Distance
Also can be used between two real-valued vectors and has the advantage over Euclidean
distance that it takes into account correlation and is scale-invariant.

where S is the covariance matrix.


43 Dr.Roopa H, Department of ISE, BIT

Hamming Distance

Can be used to find the distance between two strings or pairs of words or DNA sequences
of the same length. The distance between olive and ocean is 4 because aside from the “o”
the other 4 letters are different. The distance between shoe and hose is 3 because aside
from the “e” the other 3 letters are different.You just go through each position and check
whether the letters the same in that position, and if not, increment your count by 1.

Manhattan
This is also a distance between two real-valued k-dimensional vectors. The image to have in
mind is that of a taxi having to travel the city streets of Manhattan, which is laid out in a
grid-like fashion (you can’t cut diagonally across buildings).
The distance is therefore defined as,

where i is the ith element of each of the vectors.

44 Dr.Roopa H, Department of ISE, BIT

22
Training andTest sets:

For any machine learning algorithm, the general approach is to have a


training phase, during which you create a model and “train it”; and then you
have a testing phase, where you use new data to test how good the model is.

For k-NN, the training phase is straightforward: it’s just reading in your data
with the “high” or “low” credit data points marked. In testing, you pretend
you don’t know the true label and see how good you are at guessing using
the k-NN algorithm. To do this, you’ll need to save some clean data from
the overall data
for the testing

45 Dr.Roopa H, Department of ISE, BIT

R console might look like this:


> head(data)
age income credit
1 69 3 low
2 66 57 low
3 49 79 low
4 49 17 low
5 58 26 high
6 44 71 high
n.points <- 1000 # number of rows in the dataset
sampling.rate <- 0.8
# we need the number of points in the test set to calculate
# the misclassification rate
num.test.set.labels <- n.points * (1 - sampling.rate)
# randomly sample which rows will go in the training set
training <- sample(1:n.points, sampling.rate * n.points,
replace=FALSE)
train <- subset(data[training, ], select = c(Age, Income))
# define the training set to be those rows
46 Dr.Roopa H, Department of ISE, BIT

23
# the other rows are going into the test set
testing <- setdiff(1:n.points, training)
# define the test set to be the other rows
test <- subset(data[testing, ], select = c(Age, Income))
cl <- data$Credit[training]
# this is the subset of labels for the training set
true.labels <- data$Credit[testing]
# subset of labels for the test set, we're withholding these

Pick an evaluation metric:


- sensitivity and specificity
Sensitivity is here defined as the probability of correctly diagnosing an ill
patient as ill; Specificity is here defined as the probability of correctly
diagnosing a well patient as well.
Sensitivity is also called the true positive rate or recall and varies based on what
academic field you come from, but they all mean the same thing. And
specificity is also called the true negative rate.
47 Dr.Roopa H, Department of ISE, BIT

For each test set, you’ll pretend you don’t know his label. Look at the labels of his three
nearest neighbors, say, and use the label of the majority vote to label him. You’ll label all the
members of the test set and then use the misclassification rate to see how well you did. All
this is done automatically in R, with just this single line of R code:
knn (train, test, cl, k=3)
Choosing k:
How do you choose k? This is a parameter you have control over. You might need to
understand your data pretty well to get a good guess, and then you can try a few different k’s
and see how your evaluation changes. So you’ll run k-nn a few times, changing k, and
checking the evaluation metric each time.
# we'll loop through and see what the misclassification rate
# is for different values of k
for (k in 1:20) {
print(k)
predicted.labels <- knn(train, test, cl, k)
# We're using the R function knn()
num.incorrect.labels <- sum(predicted.labels != true.labels)
misclassification.rate <- num.incorrect.labels /
num.test.set.labels
print(misclassification.rate) }
48 Dr.Roopa H, Department of ISE, BIT

24
Here’s the output in the form (k, misclassification rate):
k misclassification.rate
1,0.28
2, 0.315
3, 0.26
4, 0.255
5, 0.23
6, 0.26
7, 0.25
8, 0.25
9, 0.235
10, 0.24
So let’s go with k =5 because it has the lowest misclassification rate, and now you can
apply it to your guy who is 57 with a $37,000 salary.
In the R console, it looks like:
> test <- c(57,37)
> knn(train,test,cl, k = 5)
[1] low
The output by majority vote is a low credit score when k = 5.

49 Dr.Roopa H, Department of ISE, BIT

What are the k-NN modeling assumptions?


The k-NN algorithm is an example of a nonparametric approach.You
had no modeling assumptions about the underlying data-generating
distributions, and you weren’t attempting to estimate any parameters.
But you still made some assumptions, which were:
• Data is in some feature space where a notion of “distance” makes
sense.
•Training data has been labeled or classified into two or more
classes.
•You pick the number of neighbors to use, k.
•You’re assuming that the observed features and the labels are somehow
associated. They may not be, but ultimately your evaluation metric will
help you determine how good the algorithm is at labeling. You might want
to add more features and check how that alters the evaluation metric.
You’d then be tuning both which features you were using and k. But as
always, you’re in danger here of overfitting.
50 Dr.Roopa H, Department of ISE, BIT

25
k-means:
k-means is the first unsupervised learning technique, where the goal of the
algorithm is to determine the definition of the right answer by finding
clusters of data for you.
Consider some kind of data at the user level, e.g., Google+
data, survey data, medical data, or SAT scores.
Start by adding structure to your data. Namely, assume each row of
your dataset corresponds to a user as follows:
age gender income state household size
Your goal is to segment the users. This process is known by various names:
besides being called segmenting, you could say that you’re going to stratify,
group, or cluster the data. They all mean finding similar types of users and
bunching them together.
Let’s say you have users where you know how many ads have been shown
to each user (the number of impressions) and how many times each has
clicked on an ad (number of clicks). Figure 2.11 shows a simplistic picture
that illustrates what this might look like.
51 Dr.Roopa H, Department of ISE, BIT

Figure 3-9. Clustering in two dimensions; look at the panels in the left column
from top to bottom, and then the right column from top to bottom
k-means algorithm looks for clusters in d dimensions, where d is the number
of features for each data point.
52 Dr.Roopa H, Department of ISE, BIT

26
Here’s how the k-means algorithm illustrated in Figure 2.11 works:
1. Initially, you randomly pick k centroids (or points that will be the center of
your clusters) in d-space. Try to make them near the data but different from
one another.
2.Then assign each data point to the closest centroid.
3. Move the centroids to the average location of the data points (which
correspond to users in this example) assigned to it.
4. Repeat the preceding two steps until the assignments don’t change, or
change very little.
k-means has some known issues:
• Choosing k is more an art than a science, although there are bounds:
1≤k ≤n, where n is number of data points.
• There are convergence issues—the solution can fail to exist, if the algorithm
falls into a loop, for example, and keeps going back and forth between two
possible solutions, or in other words, there isn’t a single unique solution.
• Interpretability can be a problem—sometimes the answer isn’t at all useful.
Indeed that’s often the biggest problem.
53 Dr.Roopa H, Department of ISE, BIT

In practice, this is just one line of code in R:

kmeans(x, centers, iter.max = 10, nstart = 1,


algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"))

Your dataset needs to be a matrix, x, each column of which is one of your


features. You specify k by selecting centers. It defaults to a certain number of
iterations, which is an argument you can change. You can also select the
specific algorithm it uses to discover the clusters.

54 Dr.Roopa H, Department of ISE, BIT

27

You might also like