Module 2
Module 2
Handouts for Session 1: Exploratory Data Analysis and the Data Science Process: Basic
tools (plots, graphs and summary statistics) of EDA
2.1 Exploratory Data Analysis (EDA)
It is the First step towards building a model.The understanding of the problem you are
working on is changing as you go. – Thereby “Exploratory”
Method of systematically going through the data, plotting distributions of all variables
(using box plots), plotting time series of data, transforming variables, looking at all
pairwise relationships between variables using scatterplot matrices, and generating
summary statistics for all of them.
At the very least that would mean computing their mean, minimum, maximum, the
upper and lower quartiles, and identifying outliers.
EDA is about understanding the data and gaining intuition, understanding the shape of
it and connecting the understanding of the process that generated the data to the data
itself.
Questions:
1.Define EDA.
In the context of data generated from logs, EDA helps with the debugging process.
Patterns found in the data could be something actually wrong with the logging process
that needs fixing. If you never go to the trouble of debugging, you’ll continue to think
your patterns are real.
The insights drawn from EDA can be used to improve the development of algorithms.
Example: Develop a ranking algorithm that ranks content shown to the users. – Develop
a notion of “Popular”
Before deciding how to quantify popularity (no. of clicks, most commented, average
etc.) the behaviour of the data needs to be understood.
Exercise: EDA
There are 31 datasets named nyt1.csv, nyt2.csv,…,nyt31.csv, which you can find here:
https://github.com/oreillymedia/doing_data_science.
Each one represents one (simulated) day’s worth of ads shown and clicks recorded on
the New York Times home page in May 2012. Each row represents a single user. There
are five columns: age, gender (0=female, 1=male), number impressions, number clicks,
and loggedin.
We use R to handle these data. It’s a programming language designed specifically for
data analysis, and it’s pretty intuitive to start using. Code can be written based on the
following logic,
Summary Statistics: Generating summary statistics for the dataset and for
age categories.
Visualization: Creating histograms and boxplots to visualize data distribution.
Click-Through Rate (CTR): Calculating and visualizing the click-through
rate.
Creating Categories: Creating a new column 'scode' to categorize data based
on impressions and clicks.
Converting to Factor: Converting the newly created column into a factor.
Summary Table: Generating a summary table for impressions based on the
created categories.
# Reading Data
data1 read.csv(url("http://stat.columbia.edu/~rachel/datasets/nyt1.csv"))
# Displaying the first few rows of the dataset
head(data1)
# the cut() function is used to create age categories (agecat) based on the 'Age'
variable.
data1$agecat cut(data1$Age, c(-Inf, 0, 18, 24, 34, 44, 54, 64, Inf))
library(ggplot2)
# Creating a histogram to visualize the distribution of Impressions across age
#categories
ggplot(data1, aes(x = Impressions, fill = agecat)) + geom_histogram(binwidth = 1)
# Creating a boxplot to visualize the distribution of Impressions within each age
#category
ggplot(data1, aes(x = agecat, y = Impressions, fill = agecat)) + geom_boxplot()
# Creating a new column to indicate whether there are impressions or not
data1$hasimps cut(data1$Impressions, c(-Inf, 0, Inf))
# Generating summary statistics for Clicks grouped by the presence or absence of
#impressions
summaryBy(Clicks ~ hasimps, data = data1, FUN = siterange)
# Creating density plots to visualize the click-through rate distribution across age
#categories
ggplot(subset(data1, Impressions > 0), aes(x = Clicks / Impressions, colour = agecat)) +
geom_density()
# Creating density plots for click-through rate, filtering out cases where there are no
#clicks
ggplot(subset(data1, Clicks > 0), aes(x = Clicks / Impressions, colour = agecat)) +
geom_density()
# Generating a boxplot to visualize the distribution of Clicks within each age
#category
ggplot(subset(data1, Clicks > 0), aes(x = agecat, y = Clicks, fill = agecat)) +
geom_boxplot()
# Creating a density plot to visualize the distribution of Clicks across age categories
ggplot(subset(data1, Clicks > 0), aes(x = Clicks, colour = agecat)) + geom_density()
# Creating a new column to categorize the data based on impressions and clicks
data1$scode[data1$Impressions == 0] "NoImps"
data1$scode[data1$Impressions > 0] "Imps"
data1$scode[data1$Clicks > 0] "Clicks"
# Converting the newly created column into a factor
data1$scode factor(data1$scode)
# Generating a summary table for impressions based on the created categories and
other #variables
etable summaryBy(Impressions ~ scode + Gender + agecat, data = data1, FUN = clen)
Questions:
1.Explain EDA and explain the steps involved in EDA
2.Write R script for demonstrating EDA
Raw data is recorded. Lot of aspects to these real word activities are lost even when we
have that raw data. Real world data is not clean. The raw data is processed to make it
clean for analysis. We build and use data munging pipelines (joining, scraping,
wrangling). This done with Python, R, SQL Shell scripts.
The EDA process can now be started. During the course of the EDA we may find that
the data is not actually clean as there are missing values, outliers, incorrectly logged
data or data that was not logged.
In such a case, we may have to collect more data or we can spend more time cleaning
the data (Imputation). The model is designed to use some algorithm (K-NN, Linear
Regression, Naïve Bayes, Decision Tree, Random Forest etc) Model Selection depends
on type of problem being addressed – Prediction, Classification or a basic description
problem.
Alternatively, our goal may be to build or prototype a “data product” such as a spam
classifier, search ranking algorithm or a recommendation system.The key difference
here that differentiates data science from statistics here is that, the data product is
incorporated back into the real world and users interact with it and that generates more
data, which creates a feedback loop.
A Movie Recommendation system generates evidence that lots of people love a movie.
This will lead to more people watching the movie – feedback loop
Take this loop into account in any analysis you do by adjusting for any biases your
model caused. Your models are not just predicting the future, but causing it!
A Human Data Scientist has to make the decisions about what data to collect, and
why.
That person needs to be formulating questions and hypotheses and making a plan for
how the problem will be attacked.
Let’s revise or at least add an overlay to make clear that the data scientist needs to be
involved in this process throughout, meaning they are involved in the actual coding as
well as in the higher-level process, as shown in Figure 2.
Ask a question.
Do background research.
Construct a hypothesis.
Not every problem requires one to go through all the steps, but almost all problems can
be solved with some combination of the stages.
Questions:
1.Explain the process involved in Data Science.
2.Explain in details the role of Data Scientist.
Handouts for Session 4: Case Study, Three Basic Machine Learning Algorithm
2.5 Case Study: RealDirect
Goal: Use all the accessible real estate data to improve the way people buy/sell
houses
Problem Statement: Normally people sell their homes about once every 7 years with
the help of professional brokers and current data.
Brokers are typically free-agents and guard their data aggressively and the really good
ones have a lot of experience (i.e slightly more data than the inexperienced brokers).
Solution by RealDirect:
Hire a team of licensed real estate agents who work together and pool their
knowledge.
Provide an interface for sellers, giving them useful data driven tips on how to
sell their house.
The team of brokers also become data experts learning to use information collecting
tools to keep tabs on new and relevant data or to access publicly available data.
Publicly available data is old and has a 3-month lag between a sale and when the data
about the sale is available.
RealDirect is working on real-time feeds on when people start searching for a home
and what the initial offer is, the time between offer and close and how people search
for a home online
2.6 Algorithms
An algorithm is a procedure or set of steps or rules to accomplish a task. Algorithms are one of
the fundamental concepts in, or building blocks of, computer science: the basis of the design
of elegant and efficient code, data preparation and processing, and software engineering.
With respect to data science, there are at least three classes of algorithms one should be aware
of:
Machine learning algorithms that are the basis of artificial intelligence (AI) such as image
recognition, speech recognition, recommendation systems, ranking and personalization of
content— often the basis of data products—are not usually part of a core statistics
curriculum or department.
Many business or real-world problems that can be solved with data are classification and
prediction problems when expressed mathematically. Those models and algorithms can be
used to classify and predict.
The key challenge for data scientists isn't just knowing how to implement statistical
methods, but rather understanding which methods are appropriate based on the problem
and underlying assumptions.
It's about knowing when and why to use certain techniques, considering factors like the
nature of the problem, data characteristics, and contextual requirements.
Questions:
1. Explain in Detail the process involved in the case study of Real Direct.
Example 1. Suppose you run a social networking site that charges a monthly
subscription fee of $25, and that this is your only source of revenue.
Each month you collect data and count your number of users and total revenue. You’ve
done this daily over the course of two years, recording it all in a spreadsheet.You could
express this data as a series of points.
Here are the first four:
From the given data it can be observed that y =25x. Which shows that,
• There’s a linear pattern.
• The coefficient relating x and y is 25.
• It seems deterministic.
There seems to be a linear relationship between the number of new friends and the time
spent on the social networking site, suggesting that more new friends lead to more time
spent on the site.
This relationship can be described using statistical methods like correlation and linear
regression. Although there is an association between these variables, it is not perfectly
deterministic, indicating that other factors also influence the time users spend on the
site.
Start by writing something down
To model the relationship, capture the trend and variation. Start by assuming a linear
relationship between variables. Focus on the trend first, using linear modeling to
describe how the number of new friends relates to time spent on the site.
There are many lines that look more or less like they might work, asshown in below
figure
To begin modelling the assumed linear relationship 𝑦=𝛽0+𝛽1𝑥 the task is to find the
optimal values for 𝛽0(intercept is the value of y when x=0) and 𝛽1(slope of the line-
how much y changes for a unit change in x) using the observed data
𝑥1,𝑦1,𝑥2,𝑦2,...,𝑥𝑛,𝑦𝑛 . This model can be expressed in matrix notation as 𝑦=𝑥⋅𝛽. The
next step involves fitting this model to the data.
To find the optimal line, the "residual sum of squares" (RSS), denoted as RSS(β), is
defined as the sum of the squared vertical distances between observed points and any
given line. It's represented as:
Here, 𝑖 ranges over the various data points. This function of 𝛽 needs to be optimized to
find the optimal line.
The "hat" symbol ( ) indicates that it's the estimator for 𝛽. Since the true value of 𝛽
is unknown, we use the observed data to compute an estimate using this estimator.
To fit the linear regression model and obtain the 𝛽 coefficients in R, you can use the
lm() function with a simple one-liner. For example:
This line of code creates a linear regression model named model where y is the response
variable and x is the predictor variable. The lm() function in R automatically calculates
the 𝛽 coefficients for the linear model based on the provided data.
The following line of code fits a linear regression model where time_spent (y) is the
response variable and total_new_friends (x) is the predictor variable, using the data
provided. This model will estimate the relationship between the number of new friends
and the time spent on the social networking site.
Code:
abline(coefs[1], coefs[2])
output:
When you display model it gives the following
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
PREPARED BY DEPARTMENT OF CSE 15
DATA SCIENCE AND VISUALIZATION(21CS644)
-32.08 45.92
1. X-axis (Time spent): The x-axis represents the range of time spent by users on the
social networking site, divided into intervals or bins. Each bin represents a range of
time spent (e.g., 0-100 seconds, 101-200 seconds, and so on).
2. Y-axis (Frequency): The y-axis represents the frequency or count of users falling
within each bin. It shows how many users spent a particular amount of time on the site,
as indicated by the height of the bars.
If a new x-value of 5 came in, meaning user had five new friends, how confident are
we in the output value of
32.08+45.92*5 = 195.7s
To address the question of confidence, you need to extend your model to account for
variation. While you've modeled the trend between the number of new friends and time
spent on the site, you haven't yet modeled the variation. This means that you wouldn't
claim that everyone with five new friends spends exactly the same amount of time on
the site.
Now that you have a simple linear regression model down (one output, one predictor)
using least squares estimation to estimate your βs, you can build upon that model in
three primary ways,
1. Adding in modeling assumptions about the errors
2. Adding in more predictors
3. Transforming the predictors
When applying the model to predict 𝑦 for a specific 𝑥 value, the prediction lacks
the variability present in the observed data.
See on the right-hand side of Figure 7 that for a fixed value of x =5, there is
variability among the time spent on the site. This variability can be shown in
the model as,
The new term 𝜖, also known as noise or the error term, represents the
unaccounted variability in the data. It reflects the difference between the
observed data points and the true regression line, which can only be estimated
∑i ei 2
ei = yi − ŷ = yi −(β1xi + β2) for i= 1 to n. The variance of e is : : Divide
n−2
by n-2 produces an unbiased estimator. This is called the Mean Squared Error
– It captures how much the predicted value varies from the actual value.
Evaluation Metrics: Next we need to find how best the model we have built so
we need to use different evaluation metrics to measure it. So we have R-
squared, p-values, Cross-validation.
i)R-squared
∑𝑖 (𝑦𝑖 −𝑦̂)2
𝑅 2 = 1-
∑𝑖 (𝑦𝑖 −𝑦̅)2
iii)Cross-Validation: Divide the data into a training set and a test set: 80% in
the training and 20% in the test. Fit the model on the training set. Then look at
PREPARED BY DEPARTMENT OF CSE 18
DATA SCIENCE AND VISUALIZATION(21CS644)
the mean squared error on the test set and compare it to that on the training set.
Make this comparison across sample size as well. If the mean squared errors are
approximately the same, then the model generalizes well and there is no
overfitting.
What we just looked at was simple linear regression, one outcome or dependent
variable and one predictor. But we can extend this model by building in other
predictors, which is called multiple linear regression:
model<-lm(y~x_1+x_2+x_3)
model<-lm(y~x_1+x_2+x_3+ x_2*x_3)
𝑦= (𝑎0 𝑥 + 𝑎1 𝑥 2 + 𝑎2 𝑥 3 + 𝑏)+𝑒
Questions:
3.Explain the three primary ways to extend the linear regression beyond the least squares
To automate the process, two key decisions are essential: defining the measure
of similarity or closeness between items and utilizing this measure to identify
the most similar items, known as neighbors, to an unrated item. These neighbors
contribute their "votes" towards the classification or labeling of the unrated
item.
Consider a dataset consisting of the age, income, and a credit category of high
or low for a bunch of people and you want to use the age and income to predict
the credit label of “high” or “low” for a new person.
For example, here are the first few rows of a dataset, with income represented
in thousands:
age income credit
69 3 low
66 57 low
49 79 low
49 17 low
58 26 high
44 71 high
plot people as points on the plane and label people with an empty circle if they
have low credit ratings.
Given the credit scores of other individuals nearby, what credit score label do
you propose should be assigned to him? Let's use K-Nearest Neighbors (K-NN)
to automate this process.
Similarity or distance metrics can be employed to quantify the similarity between data
points. Definitions of “closeness” and similarity vary depending on the context. There
are many more distance metrics available to you depending on your type of data.
In our scenario, determine a metric (e.g., Euclidean distance) to measure the similarity
between individuals based on their age and income. Euclidean distance is a good go-to
distance metric for attributes that are real-valued.
1. Cosine Similarity
Also can be used between two real-valued vectors, x and y, and will
yield a value between –1 (exact opposite) and 1 (exactly the same) with
0 in between meaning independent.
3. Mahalanobis Distance
Also can be used between two real-valued vectors and has the advantage
over Euclidean distance that it takes into account correlation and is scale-
invariant.
4. Hamming Distance
Can be used to find the distance between two strings or pairs of words
or DNA sequences of the same length. The distance between olive and
ocean is 4 because aside from the “o” the other 4 letters are different.
The distance between shoe and hose is 3 because aside from the “e” the
other 3 letters are different. You just go through each position and check
whether the letters the same in that position, and if not, increment your
count by 1.
5. Manhattan
This is also a distance between two real-valued k-dimensional vectors.
In machine learning, the typical process involves two phases: training and testing.
During training, a model is created and trained using labeled data to learn patterns
and relationships.
In the testing phase, the model's performance is evaluated using new, unseen data to
assess its effectiveness in making predictions or classifications.
In K-Nearest Neighbors (k-NN), the training phase involves reading the labeled data
with "high" or "low" credit points marked. During testing, the algorithm attempts to
predict the labels of unseen data points using the k-NN approach without prior
knowledge of the true labels, evaluating its accuracy in the process.
To accomplish this, a portion of clean data from the entire dataset needs to be
reserved for the testing phase. Typically, about 20% of the data is randomly selected
and set aside for testing purposes.
Sample R code to prepare Train and Test set,
To choose k, run k-nn a few times, changing k, and checking the evaluation
metric each time.
When you have binary classes like “high credit” or “low credit,” picking k to
be an odd number can be a good idea because there will always be a majority
vote, no ties. If there is a tie, the algorithm just randomly picks.
So let’s go with k =5 because it has the lowest misclassification rate, and now
k=5 can be applied to the guy who is 57 with a $37,000 salary.
Questions:
This is where k-means comes into picture where, k-means is: a clustering
algorithm where k is the number of bins. k-means algorithm looks for clusters
in d dimensions, where d is the number of features for each data point.
Algorithm
Randomly pick 𝑘 centroids (or points that will be the center of your clusters) in 𝑑-space,
ensuring they are near the data but distinct from one another.
One has to determine if there's a natural way to describe these groups once the
algorithm completes. At times, you may need to make slight adjustments to 𝑘 a few
times before obtaining natural groupings. This is an example of unsupervised
learning because the labels are not known and are instead discovered by the
algorithm.
1.Choosing 𝑘 is more art than science, bounded by 1≤𝑘≤𝑛, where 𝑛 is the number of data
points.
2.Convergence issues may arise, with the algorithm potentially falling into a loop and
failing to find a unique solution.
K-means advantages:
2D version:
Clustering in two dimensions; look at the panels in the left column from top to
bottom, and then the right column from top to bottom.
In practice, k-means is just one line of code in R:
nstart: The number of random sets of initial cluster centers. The algorithm will run
nstart times with different initial centers and return the best solution based on the total
within-cluster sum of squares. The default is 1.
algorithm: Specifies the algorithm to use for clustering. Options include:
"Hartigan-Wong" (default): The standard algorithm proposed by
Hartigan and Wong.
"Lloyd": Also known as the standard k-means algorithm.
"Forgy": Another version of the k-means algorithm.
"MacQueen": A variation of the k-means algorithm.
Example:
kmeans(x, centers = 3, iter.max = 10, nstart = 1, algorithm = "Hartigan-Wong")
This example runs the k-means clustering algorithm on the dataset x, aiming to create
3 clusters, with a maximum of 10 iterations, using 1 random start, and employing the
Hartigan-Wong algorithm.
Questions: