Data Science and Visualization (21CS644) : Text Books
Data Science and Visualization (21CS644) : Text Books
by
Dr. ROOPA. H
Associate Professor
Department of Information Science & Engineering,
Bangalore Institute of Technology
TEXT BOOKS:
1. Doing Data Science. Cathy O Neil and Rachel Schutt, O Reilly Media,
Inc 2013.
2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt,
Publishing, ISBN 9781800568112.
REFERENCE BOOKS:
1
Module -2 :
Exploratory Data Analysis and the Data Science Process:
Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of
EDA, The Data Science, Process, Case Study: Real Direct (online real estate
firm).
Three Basic Machine Learning Algorithms:
Linear Regression, k-Nearest Neighbors (k- NN), k-means.
09-Hours
Textbook 1: Chapter 2, Chapter 3
2
Tools of EDA:
• The basic tools of EDA are plots, graphs and summary statistics.
• Generally speaking, it’s a method of systematically going through the
data,
plotting distributions of all variables (using box plots),
plotting time series of data,
transforming variables,
looking at all pairwise relationships between variables using
scatterplot matrices,
and generating summary statistics for all of them.
• At the very least that would mean computing their mean, minimum,
maximum, the upper and lower quartiles, and identifying outliers.
• But as much as EDA is a set of tools, it’s also a mindset. And that
mindset is about your relationship with the data.
• EDA happens between you and the data and isn’t about proving
anything to anyone else yet.
5 Dr.Roopa H, Department of ISE, BIT
3
For example, “patterns” you find in the data could actually be something wrong in the
logging process that needs to be fixed. If you never go to the trouble of debugging,
you’ll continue to think your patterns are real.
• In the end, EDA helps you make sure the product is performing as
intended.
• Difference between EDA and data visualization is that EDA is done
toward the beginning of analysis, and data visualization, as it’s used in our
vernacular, is done toward the end to communicate one’s findings.
• With EDA, the graphics are solely done for you to understand what’s
going on.
• With EDA, you can also use the understanding you get to inform and
improve the development of algorithms.
For example, suppose you are trying to develop a ranking algorithm that ranks content
that you are showing to users. To do this you might want to develop a notion of
“popular.”
Before you decide how to quantify popularity , you need to understand how the data is
behaving, and the best way to do that is looking at it and getting your hands dirty.
4
Steps in Data Science process:
• First we have the RealWorld.
• Start with raw data—logs, Olympics records, Enron employee emails, or
recorded genetic material.
• Process this to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling, or whatever you want to call
it.
• Get the data down to a nice format, like something with columns:
• name | event | year | gender | event time
• Clean the dataset because of duplicates, missing values, absurd outliers,
and data that wasn’t actually logged or incorrectly logged.
• Design model to use some algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something else.
The model chosen depends on the type of problem we’re trying to
solve, of course, which could be a classification problem, a prediction
problem, or a basic description problem.
• Interpret, visualize, report, or communicate our results.
9 Dr.Roopa H, Department of ISE, BIT
Figure 2.2 -The data scientist is involved in every part of this process
5
Case Study: RealDirect:
Doug Perlson, the CEO of RealDirect, has a background in real estate law,
startups, and online advertising. His goal with RealDirect is to use all the
data he can access about real estate to improve the way people sell and buy
houses.
Normally, people sell their homes about once every seven years, and they
do so with the help of professional brokers and current data. But there’s a
problem both with the broker system and the data quality. RealDirect
addresses both of them.
First, the brokers. They are typically “free agents” operating on their own—
think of them as home sales consultants. This means that they guard their
data aggressively, and the really good ones have lots of experience. But in
the grand scheme of things, that really means they have only slightly more
data than the inexperienced brokers.
6
How Does RealDirect Make Money?
There are some challenges they have to deal with as well, of course.
• RealDirect requires registration.
• RealDirect comprises licensed brokers in various established realtor
associations
You have been hired as chief data scientist at realdirect.com, and report
directly to the CEO. The company (hypothetically) does not yet have its
data plan in place. It’s looking to you to come up with a data strategy.
7
Sample R code that takes the Brooklyn housing data in the preceding exercise, and cleans and
explores it a bit.
# Author: Benjamin Reddy
require(gdata)
bk <- read.xls("rollingsales_brooklyn.xls",pattern="BOROUGH")
head(bk)
summary(bk)
bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]","",
bk$SALE.PRICE))
count(is.na(bk$SALE.PRICE.N))
names(bk) <- tolower(names(bk))
## do a bit of exploration to make sure there's not anything weird going on with sale prices
attach(bk)
hist(sale.price.n)
hist(sale.price.n[sale.price.n>0])
hist(gross.sqft[sale.price.n==0])
detach(bk)
8
## remove outliers that seem like they weren't actual sales
bk.homes$outliers <- (log(bk.homes$sale.price.n) <=5) + 0
bk.homes <- bk.homes[which(bk.homes$outliers==0),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
Algorithms:
• An algorithm is a procedure or set of steps or rules to accomplish a task.
Algorithms are one of the fundamental concepts in, or building blocks of,
computer science: the basis of the design of elegant and efficient code,
data preparation and processing, and software engineering.
• Some of the basic types of tasks that algorithms can solve are sorting,
searching, and graph-based computational problems.
• Efficient algorithms that work sequentially or in parallel are the basis of
pipelines to process and prepare data.
With respect to data science, there are at least three classes of
algorithms one should be aware of:
• Data munging, preparation, and processing algorithms, such as sorting,
MapReduce, or Pregel.
• Optimization algorithms for parameter estimation, including Stochastic
Gradient Descent, Newton’s Method, and Least Squares.
• Machine learning algorithms.
18 Dr.Roopa H, Department of ISE, BIT
9
Statistical modeling came out of statistics departments, and machine learning algorithms came out of
computer science departments. Certain methods and techniques are considered to be part of both, and
you’ll see that we often use the words somewhat interchangeably.
There are some broad generalizations to consider:
1. Interpreting parameters
Statisticians think of the parameters in their linear regression models as having real-world
interpretations, and typically want to be able to find meaning in behavior or describe the real-
world phenomenon corresponding to those parameters. Whereas a software engineer or computer
scientist might be wanting to build their linear regression algorithm into production-level code,
and the predictive model is what is known as a black box algorithm, they don’t generally focus on the
interpretation of the parameters. If they do, it is with the goal of hand tuning them in order to
optimize predictive power.
2. Confidence intervals
Statisticians provide confidence intervals and posterior distributions for parameters and
estimators, and are interested in capturing the variability or uncertainty of the parameters. Many
machine learning algorithms, such as k-means or k-nearest neighbors (which we cover a bit later in
this chapter), don’t have a notion of confidence intervals or uncertainty.
3. The role of explicit assumptions
Statistical models make explicit assumptions about data generating processes and distributions, and
you use the data to estimate parameters. Nonparametric solutions, like we’ll see later in this
chapter, don’t make any assumptions about probability distributions, or they are implicit.
10
Data scientist (noun): Person who is better at statistics than any software
engineer and better at software engineering than any statistician.
— Josh Wills
Three Basic Algorithms:
Many business or real-world problems that can be solved with data can be
thought of as classification and prediction problems when we express them
mathematically.
A whole host of models and algorithms can be used to classify and predict.
Three basic algorithms linear regression, k-nearest neighbors (k-NN), and
k-means are discussed.
Linear Regression:
y = f( x )= β0 +β1 *x
11
Example 1. Overly simplistic example to start. Suppose you run a social networking site that
charges a monthly subscription fee of $25, and that this is your only source of revenue. Each month
you collect data and count your number of users and total revenue. You’ve done this daily over the
course of two years, recording it all in a spreadsheet.
You could express this data as a series of points. Here are the first four:
S= {(x, y) = (1,25 ), (10,250 ), (100,2500) , (200,5000)}
There’s a clear relationship enjoyed by all of these points, namely y =25x.
Example 2. Looking at data at the user level. Say you have a dataset keyed by user
(meaning each row contains data for a single user), and the columns represent user behavior
on a social networking site over a period of a week. Let’s say you feel comfortable that the
data is clean at this stage and that you have on the order of hundreds of thousands of users.
The names of the columns are total_num_friends, total_new_friends_this_week,
num_visits, time_spent, number_ apps_downloaded, number_ads_shown, gender, age, and
so on.
During the course of your exploratory data analysis, you’ve randomly sampled 100 users to
keep it simple, and you plot pairs of these variables,
For example, x = total_new_friends and y = time_spent (in seconds).
The business context might be that eventually you want to be able to promise advertisers
who bid for space on your website in advance a certain number of users, so you want to be
able to forecast number of users several days or weeks in advance. But for now, you are
simply trying to build intuition and understand your dataset.
Eyeball the first few rows and see: 7 276
3 43
4 82
6 136
10 417
9 269
24 Dr.Roopa H, Department of ISE, BIT
12
Now, your brain can’t figure out what’s going on by just looking at them (and your friend’s
brain probably can’t, either). They’re in no obvious particular order, and there are a lot of
them. So you try to plot it as in the below figure
It looks like there’s kind of a linear
relationship here, and it makes sense; the
more new friends you have, the more time
you might spend on the site.
13
Assuming a linear relationship, start your model by assuming the functional
form to be:
y = β0 +β1x
Now your job is to find the best choices for β0 and β1 using the observed
data to estimate them: (x1, y1) , (x2, y2) , . . . (xn, yn ).
Writing this with matrix notation results in this:
y =x ・β
There you go: you’ve written down your model. Now the rest is fitting
the model.
Fitting the model:
The intuition behind linear regression is that you want to find the line that
minimizes the distance between all the points and the line. Many lines look
approximately correct, but your goal is to find the optimal one.
Linear regression seeks to find the line that minimizes the sum of the
squares of the vertical distances between the approximated or predicted yi’s
and the observed yi’s. You do this because you want to minimize your
prediction errors.This method is called least squares estimation.
27 Dr.Roopa H, Department of ISE, BIT
14
To actually fit this, to get the βs, all you need is one line of R code where
you’ve got a column of y’s and a (single) column of x’s: x y
model <- lm(y ~ x) 7 276
So for the example where the first few rows of the data were: 3 43
The R code for this would be: 4 82
> model <- lm (y~x) 6 136
> model 10 417
Call: 9 269
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-32.08 45.92
> coefs <- coef(model)
> plot(x, y, pch=20,col="red", xlab="Number new friends", ylab="Time spent
(seconds)")
> abline(coefs[1],coefs[2])
And the estimated line is y = −32.08+45.92x,
which you’re welcome to round to y = −32+46x,
and the corresponding plot looks like the left hand side of below Figure 2.7
29 Dr.Roopa H, Department of ISE, BIT
Figure 2.7 - On the left is the fitted line. We can see that for any fixed value,
say 5, the values for y vary. For people with 5 new friends, we display their
time spent in the plot on the right.
You’ve so far modeled the trend, you haven’t yet modeled the variation.
30 Dr.Roopa H, Department of ISE, BIT
15
Extending beyond least squares:
Now that you have a simple linear regression model down (one output, one
predictor) using least squares estimation to estimate your βs, you can build
upon that model in three primary ways, described in the upcoming sections:
1. Adding in modeling assumptions about the errors
2. Adding in more predictors
3.Transforming the predictors
Adding in modeling assumptions about the errors. If you use your model to
predict y for a given value of x, your prediction is deterministic and doesn’t
capture the variability in the observed data. See on the right hand side of
Figure 2.7 that for a fixed value of x =5, there is variability among the time
spent on the site. You want to capture this variability in your model, so you
extend your model to:
y = β0 +β1x+ϵ
where the new term ϵ is referred to as noise.
It’s also called the error term—ϵ represents the actual error, the difference between the
observations and the true regression line, which you’ll never know and can only estimate with
your
31
βs. Dr.Roopa H, Department of ISE, BIT
ϵ ∼N ( 0,σ2 )
With the preceding assumption on the distribution of noise, this model is saying that, for
any given value of x, the conditional distribution of y given x is
p( y/ x) ∼N (β0 +β1x,σ2 )
You have the estimated line, you can see how far away the observed data points are from
the line itself, and you can treat these differences, also known as observed errors or residuals
,as observations themselves, or estimates of the actual errors, the ϵs.
16
Evaluation metrics:
summary (model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-121.17 -52.63 -9.72 41.54 356.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.083 16.623 -1.93 0.0565 .
x 45.918 2.141 21.45 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 77.47 on 98 degrees of freedom
Multiple R-squared: 0.8244, Adjusted R-squared: 0.8226
F-statistic: 460 on 1 and 98 DF, p-value: < 2.2e-16
33 Dr.Roopa H, Department of ISE, BIT
R-squared:
17
Cross-validation:
Another approach to evaluating the model is as follows. Divide our data up
into a training set and a test set: 80% in the training and 20% in the test. Fit
the model on the training set, then look at the mean squared error on the test
set and compare it to that on the training set. Make this comparison across
sample size as well. If the mean squared errors are approximately the same,
then our model generalizes well and we’re not in danger of overfitting.
18
Transformations:
Going back to one x predicting one y, why did we assume a linear relationship? Instead,
maybe, a better model would be a polynomial relationship like this:
y = β0 +β1x+β2x2 +β3x3
To think of it as linear, you transform or create new variables
—for example, z =x2— and build a regression model based on z.
Other common transformations are to take the log or to pick a threshold and turn it into a
binary predictor instead.
Review of LRM:
Let’s review the assumptions we made when we built and fit our model:
• Linearity
• Error terms normally distributed with mean 0
• Error terms independent of each other
• Error terms have constant variance across values of x
• The predictors we’re using are the right predictors
When and why do we perform linear regression? Mostly for two reasons:
• If we want to predict one variable knowing others
• If we want to explain or understand the relationship between two
or more things
37 Dr.Roopa H, Department of ISE, BIT
19
Automate it, two decisions must be made:
First, how do you define similarity or closeness? Once you define it, for a
given unrated item, you can say how similar all the labeled items are to it,
and you can take the most similar items and call them neighbors, who each
have a “vote.”
This brings you to the second decision: how many neighbors should you
look at or “let vote”? This value is k, which ultimately you’ll choose as the
data scientist, and we’ll tell you how.
For example, here are the first few rows of a dataset, with income
represented in thousands:
20
What if a new guy comes in who is 57 years old and who makes $37,000?
What’s his likely credit rating label? Look at Figure 2.10. Based on the other
people near him, what credit score label do you think he should be given?
Let’s use k-NN to do it automatically.
21
Similarity or distance metrics:
Definitions of “closeness” and similarity vary depending on the context:
Euclidean distance is a good go-to distance metric for attributes that are real-valued and
can be plotted on a plane or in multidimensional space.
Cosine Similarity
Also can be used between two real-valued vectors, x and y , and will yield a value between
–1 (exact opposite) and 1 (exactly the same) with 0 in between meaning independent.
Recall the definition
Mahalanobis Distance
Also can be used between two real-valued vectors and has the advantage over Euclidean
distance that it takes into account correlation and is scale-invariant.
Hamming Distance
Can be used to find the distance between two strings or pairs of words or DNA sequences
of the same length. The distance between olive and ocean is 4 because aside from the “o”
the other 4 letters are different. The distance between shoe and hose is 3 because aside
from the “e” the other 3 letters are different.You just go through each position and check
whether the letters the same in that position, and if not, increment your count by 1.
Manhattan
This is also a distance between two real-valued k-dimensional vectors. The image to have in
mind is that of a taxi having to travel the city streets of Manhattan, which is laid out in a
grid-like fashion (you can’t cut diagonally across buildings).
The distance is therefore defined as,
22
Training andTest sets:
For k-NN, the training phase is straightforward: it’s just reading in your data
with the “high” or “low” credit data points marked. In testing, you pretend
you don’t know the true label and see how good you are at guessing using
the k-NN algorithm. To do this, you’ll need to save some clean data from
the overall data
for the testing
23
# the other rows are going into the test set
testing <- setdiff(1:n.points, training)
# define the test set to be the other rows
test <- subset(data[testing, ], select = c(Age, Income))
cl <- data$Credit[training]
# this is the subset of labels for the training set
true.labels <- data$Credit[testing]
# subset of labels for the test set, we're withholding these
For each test set, you’ll pretend you don’t know his label. Look at the labels of his three
nearest neighbors, say, and use the label of the majority vote to label him. You’ll label all the
members of the test set and then use the misclassification rate to see how well you did. All
this is done automatically in R, with just this single line of R code:
knn (train, test, cl, k=3)
Choosing k:
How do you choose k? This is a parameter you have control over. You might need to
understand your data pretty well to get a good guess, and then you can try a few different k’s
and see how your evaluation changes. So you’ll run k-nn a few times, changing k, and
checking the evaluation metric each time.
# we'll loop through and see what the misclassification rate
# is for different values of k
for (k in 1:20) {
print(k)
predicted.labels <- knn(train, test, cl, k)
# We're using the R function knn()
num.incorrect.labels <- sum(predicted.labels != true.labels)
misclassification.rate <- num.incorrect.labels /
num.test.set.labels
print(misclassification.rate) }
48 Dr.Roopa H, Department of ISE, BIT
24
Here’s the output in the form (k, misclassification rate):
k misclassification.rate
1,0.28
2, 0.315
3, 0.26
4, 0.255
5, 0.23
6, 0.26
7, 0.25
8, 0.25
9, 0.235
10, 0.24
So let’s go with k =5 because it has the lowest misclassification rate, and now you can
apply it to your guy who is 57 with a $37,000 salary.
In the R console, it looks like:
> test <- c(57,37)
> knn(train,test,cl, k = 5)
[1] low
The output by majority vote is a low credit score when k = 5.
25
k-means:
k-means is the first unsupervised learning technique, where the goal of the
algorithm is to determine the definition of the right answer by finding
clusters of data for you.
Consider some kind of data at the user level, e.g., Google+
data, survey data, medical data, or SAT scores.
Start by adding structure to your data. Namely, assume each row of
your dataset corresponds to a user as follows:
age gender income state household size
Your goal is to segment the users. This process is known by various names:
besides being called segmenting, you could say that you’re going to stratify,
group, or cluster the data. They all mean finding similar types of users and
bunching them together.
Let’s say you have users where you know how many ads have been shown
to each user (the number of impressions) and how many times each has
clicked on an ad (number of clicks). Figure 2.11 shows a simplistic picture
that illustrates what this might look like.
51 Dr.Roopa H, Department of ISE, BIT
Figure 3-9. Clustering in two dimensions; look at the panels in the left column
from top to bottom, and then the right column from top to bottom
k-means algorithm looks for clusters in d dimensions, where d is the number
of features for each data point.
52 Dr.Roopa H, Department of ISE, BIT
26
Here’s how the k-means algorithm illustrated in Figure 2.11 works:
1. Initially, you randomly pick k centroids (or points that will be the center of
your clusters) in d-space. Try to make them near the data but different from
one another.
2.Then assign each data point to the closest centroid.
3. Move the centroids to the average location of the data points (which
correspond to users in this example) assigned to it.
4. Repeat the preceding two steps until the assignments don’t change, or
change very little.
k-means has some known issues:
• Choosing k is more an art than a science, although there are bounds:
1≤k ≤n, where n is number of data points.
• There are convergence issues—the solution can fail to exist, if the algorithm
falls into a loop, for example, and keeps going back and forth between two
possible solutions, or in other words, there isn’t a single unique solution.
• Interpretability can be a problem—sometimes the answer isn’t at all useful.
Indeed that’s often the biggest problem.
53 Dr.Roopa H, Department of ISE, BIT
27