0% found this document useful (0 votes)
10 views

Module 2

Uploaded by

prashanthkapu491
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Module 2

Uploaded by

prashanthkapu491
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA SCIENCE AND VISUALIZATION(21CS644)

Module 2-Exploratory Data Analysis and the Data Science Process


Exploratory Data Analysis and the Data
Science Process: Basic tools (plots, graphs
and summary statistics) of EDA, Philosophy
Module 2 Syllabus of EDA, The Data Science Process, Case
Study: Real Direct (online real estate firm).
Three Basic Machine Learning
Algorithms: Linear Regression, k-Nearest
Neighbours (k- NN), k-means

Handouts for Session 1: Exploratory Data Analysis and the Data Science Process: Basic
tools (plots, graphs and summary statistics) of EDA
2.1 Exploratory Data Analysis (EDA)

 “Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look


for those things that we believe are not there, as well as those we believe to be there

 Exploratory Data Analysis (EDA) is an approach of analyzing data sets to summarize


their main characteristics, often using statistical graphics and other data visualization
methods.

 It is the First step towards building a model.The understanding of the problem you are
working on is changing as you go. – Thereby “Exploratory”

Basic Tools – Plots, Graphs, Summary Statistics

 Method of systematically going through the data, plotting distributions of all variables
(using box plots), plotting time series of data, transforming variables, looking at all
pairwise relationships between variables using scatterplot matrices, and generating
summary statistics for all of them.
 At the very least that would mean computing their mean, minimum, maximum, the
upper and lower quartiles, and identifying outliers.
 EDA is about understanding the data and gaining intuition, understanding the shape of
it and connecting the understanding of the process that generated the data to the data
itself.

Questions:
1.Define EDA.

PREPARED BY DEPARTMENT OF CSE 1


DATA SCIENCE AND VISUALIZATION(21CS644)

2.Explain the basic tools involved in EDA

Handouts for Session 2: Philosophy of EDA


2 Philosophy of EDA
 Gain Intuition about the data, Make comparisons between distributions, sanity checking
(ensuring data is on the expected scale and format), missing data analysis, outlier
analysis and summarize it.

 In the context of data generated from logs, EDA helps with the debugging process.
Patterns found in the data could be something actually wrong with the logging process
that needs fixing. If you never go to the trouble of debugging, you’ll continue to think
your patterns are real.

 EDA helps to ensure that the product is performing as intended.

 The insights drawn from EDA can be used to improve the development of algorithms.

 Example: Develop a ranking algorithm that ranks content shown to the users. – Develop
a notion of “Popular”

 Before deciding how to quantify popularity (no. of clicks, most commented, average
etc.) the behaviour of the data needs to be understood.

Exercise: EDA
There are 31 datasets named nyt1.csv, nyt2.csv,…,nyt31.csv, which you can find here:
https://github.com/oreillymedia/doing_data_science.

Each one represents one (simulated) day’s worth of ads shown and clicks recorded on
the New York Times home page in May 2012. Each row represents a single user. There
are five columns: age, gender (0=female, 1=male), number impressions, number clicks,
and loggedin.

We use R to handle these data. It’s a programming language designed specifically for
data analysis, and it’s pretty intuitive to start using. Code can be written based on the
following logic,

 Reading Data: Loading a dataset from a URL.


 Categorization: Creating age categories based on the 'Age' variable.

PREPARED BY DEPARTMENT OF CSE 2


DATA SCIENCE AND VISUALIZATION(21CS644)

 Summary Statistics: Generating summary statistics for the dataset and for
age categories.
 Visualization: Creating histograms and boxplots to visualize data distribution.
 Click-Through Rate (CTR): Calculating and visualizing the click-through
rate.
 Creating Categories: Creating a new column 'scode' to categorize data based
on impressions and clicks.
 Converting to Factor: Converting the newly created column into a factor.
 Summary Table: Generating a summary table for impressions based on the
created categories.

# Reading Data
data1  read.csv(url("http://stat.columbia.edu/~rachel/datasets/nyt1.csv"))
# Displaying the first few rows of the dataset
head(data1)
# the cut() function is used to create age categories (agecat) based on the 'Age'
variable.
data1$agecat  cut(data1$Age, c(-Inf, 0, 18, 24, 34, 44, 54, 64, Inf))

# Generating summary statistics for the dataset


summary(data1)
# Installing and loading the doBy package for data manipulation
install.packages("doBy")
library("doBy")
#the siterange function computes and returns a vector containing the following
summary #statistics of a given vector
siterange <- function(x){c(length(x), min(x), mean(x), max(x))}
# Generating summary statistics for Age grouped by age categories
summaryBy(Age ~ agecat, data = data1, FUN = siterange)
# Generating summary statistics for Gender, Signed_In, Impressions, and Clicks
#grouped by age categories
summaryBy(Gender + Signed_In + Impressions + Clicks ~ agecat, data = data1)
# Installing and loading the ggplot2 package for data visualization
install.packages("ggplot2")
PREPARED BY DEPARTMENT OF CSE 3
DATA SCIENCE AND VISUALIZATION(21CS644)

library(ggplot2)
# Creating a histogram to visualize the distribution of Impressions across age
#categories
ggplot(data1, aes(x = Impressions, fill = agecat)) + geom_histogram(binwidth = 1)
# Creating a boxplot to visualize the distribution of Impressions within each age
#category
ggplot(data1, aes(x = agecat, y = Impressions, fill = agecat)) + geom_boxplot()
# Creating a new column to indicate whether there are impressions or not
data1$hasimps cut(data1$Impressions, c(-Inf, 0, Inf))
# Generating summary statistics for Clicks grouped by the presence or absence of
#impressions
summaryBy(Clicks ~ hasimps, data = data1, FUN = siterange)
# Creating density plots to visualize the click-through rate distribution across age
#categories
ggplot(subset(data1, Impressions > 0), aes(x = Clicks / Impressions, colour = agecat)) +
geom_density()
# Creating density plots for click-through rate, filtering out cases where there are no
#clicks
ggplot(subset(data1, Clicks > 0), aes(x = Clicks / Impressions, colour = agecat)) +
geom_density()
# Generating a boxplot to visualize the distribution of Clicks within each age
#category
ggplot(subset(data1, Clicks > 0), aes(x = agecat, y = Clicks, fill = agecat)) +
geom_boxplot()
# Creating a density plot to visualize the distribution of Clicks across age categories
ggplot(subset(data1, Clicks > 0), aes(x = Clicks, colour = agecat)) + geom_density()
# Creating a new column to categorize the data based on impressions and clicks
data1$scode[data1$Impressions == 0]  "NoImps"
data1$scode[data1$Impressions > 0] "Imps"
data1$scode[data1$Clicks > 0]  "Clicks"
# Converting the newly created column into a factor
data1$scode  factor(data1$scode)
# Generating a summary table for impressions based on the created categories and
other #variables
etable summaryBy(Impressions ~ scode + Gender + agecat, data = data1, FUN = clen)

PREPARED BY DEPARTMENT OF CSE 4


DATA SCIENCE AND VISUALIZATION(21CS644)

Questions:
1.Explain EDA and explain the steps involved in EDA
2.Write R script for demonstrating EDA

Handouts for Session 3: Data Science Process


2.3 The Data Science Process
 The real world where different types of data is generated. Inside the Real World are lots
of people busy at various activities. Some people are using Google+, others are
competing in the Olympics; there are spammers sending spam, and there are people
getting their blood drawn. Say we have data on one of these things.

 Raw data is recorded. Lot of aspects to these real word activities are lost even when we
have that raw data. Real world data is not clean. The raw data is processed to make it
clean for analysis. We build and use data munging pipelines (joining, scraping,
wrangling). This done with Python, R, SQL Shell scripts.

 Eventually data is brought into a format with columns.

 The EDA process can now be started. During the course of the EDA we may find that
the data is not actually clean as there are missing values, outliers, incorrectly logged
data or data that was not logged.

 In such a case, we may have to collect more data or we can spend more time cleaning
the data (Imputation). The model is designed to use some algorithm (K-NN, Linear
Regression, Naïve Bayes, Decision Tree, Random Forest etc) Model Selection depends
on type of problem being addressed – Prediction, Classification or a basic description
problem.

 Alternatively, our goal may be to build or prototype a “data product” such as a spam
classifier, search ranking algorithm or a recommendation system.The key difference
here that differentiates data science from statistics here is that, the data product is
incorporated back into the real world and users interact with it and that generates more
data, which creates a feedback loop.

PREPARED BY DEPARTMENT OF CSE 5


DATA SCIENCE AND VISUALIZATION(21CS644)

 A Movie Recommendation system generates evidence that lots of people love a movie.
This will lead to more people watching the movie – feedback loop
 Take this loop into account in any analysis you do by adjusting for any biases your
model caused. Your models are not just predicting the future, but causing it!

Figure 1: The Data Science Process

2.4 A Data Scientist’s Role in this Process

Figure 2: The Data Scientist’s Role

PREPARED BY DEPARTMENT OF CSE 6


DATA SCIENCE AND VISUALIZATION(21CS644)

 A Human Data Scientist has to make the decisions about what data to collect, and
why.

 That person needs to be formulating questions and hypotheses and making a plan for
how the problem will be attacked.

 Let’s revise or at least add an overlay to make clear that the data scientist needs to be
involved in this process throughout, meaning they are involved in the actual coding as
well as in the higher-level process, as shown in Figure 2.

 Connection with the Scientific Method:

 Ask a question.

 Do background research.

 Construct a hypothesis.

 Test your hypothesis by doing an experiment.

 Analyze your data and draw a conclusion.

 Communicate your results.

 Not every problem requires one to go through all the steps, but almost all problems can
be solved with some combination of the stages.

Questions:
1.Explain the process involved in Data Science.
2.Explain in details the role of Data Scientist.
Handouts for Session 4: Case Study, Three Basic Machine Learning Algorithm
2.5 Case Study: RealDirect

 Goal: Use all the accessible real estate data to improve the way people buy/sell
houses

 Problem Statement: Normally people sell their homes about once every 7 years with
the help of professional brokers and current data.

 Brokers are typically free-agents and guard their data aggressively and the really good
ones have a lot of experience (i.e slightly more data than the inexperienced brokers).

PREPARED BY DEPARTMENT OF CSE 7


DATA SCIENCE AND VISUALIZATION(21CS644)

 Solution by RealDirect:

 Hire a team of licensed real estate agents who work together and pool their
knowledge.

 Provide an interface for sellers, giving them useful data driven tips on how to
sell their house.

 Uses the interaction data to give real time recommendations on what to do


next.

 The team of brokers also become data experts learning to use information collecting
tools to keep tabs on new and relevant data or to access publicly available data.

 Publicly available data is old and has a 3-month lag between a sale and when the data
about the sale is available.

 RealDirect is working on real-time feeds on when people start searching for a home
and what the initial offer is, the time between offer and close and how people search
for a home online

 Good information helps both buyer and seller.

 How does RealDirect Make Profits?

 Subscription to sellers – about $395 a month to access the selling tools


 Sellers can use RealDirect’s agents at a reduced commission typically 2% of the sale
instead of the usual 2.5% or 3%
 The data pooling enables RealDirect to take smaller commission as it is more optimized
and therefore gets more volume
 The site is a platform for buyers and sellers to manage their sale or purchase process.
 There are statuses for each person on site: active, offer made, offer rejected, showing,
in contract, etc.
 Based on Status different actions are suggested by the platform.
 Key issues that a buyer might care about—nearby parks, subway, and schools, as well
as the comparison of prices per square foot of apartments sold in the same building or
block.
 This is the kind of data they want to increasingly cover as part of the service of
RealDirect.
PREPARED BY DEPARTMENT OF CSE 8
DATA SCIENCE AND VISUALIZATION(21CS644)

2.6 Algorithms

An algorithm is a procedure or set of steps or rules to accomplish a task. Algorithms are one of
the fundamental concepts in, or building blocks of, computer science: the basis of the design
of elegant and efficient code, data preparation and processing, and software engineering.

With respect to data science, there are at least three classes of algorithms one should be aware
of:

1. Data munging, preparation, and processing algorithms, such as sorting, MapReduce, or


Pregel. These algorithms are characterized as data engineering.
2. Optimization algorithms for parameter estimation, including Stochastic Gradient
Descent, Newton’s Method, and Least Squares.
3. Machine learning algorithms - largely used to predict, classify, or cluster.

Machine Learning Algorithms

Machine learning algorithms that are the basis of artificial intelligence (AI) such as image
recognition, speech recognition, recommendation systems, ranking and personalization of
content— often the basis of data products—are not usually part of a core statistics
curriculum or department.

Three Basic Algorithms

Many business or real-world problems that can be solved with data are classification and
prediction problems when expressed mathematically. Those models and algorithms can be
used to classify and predict.

The key challenge for data scientists isn't just knowing how to implement statistical
methods, but rather understanding which methods are appropriate based on the problem
and underlying assumptions.

It's about knowing when and why to use certain techniques, considering factors like the
nature of the problem, data characteristics, and contextual requirements.

Questions:
1. Explain in Detail the process involved in the case study of Real Direct.

PREPARED BY DEPARTMENT OF CSE 9


DATA SCIENCE AND VISUALIZATION(21CS644)

Handouts for Session 5: Linear Regression Algorithm


2.7 Linear Regression Algorithm

 Linear regression is a common statistical method used to show the mathematical


relationship between two variables.
 It assumes a linear connection between an outcome variable (also called the response
variable, dependent variable, or label, like sales) and a predictor variable (also called
an independent variable, explanatory variable, or feature, like advertising spend).
Essentially, it helps us understand how changes in one variable can predict changes in
another.
 Sometimes, it makes sense that changes in one variable correlate linearly with changes
in another variable. For example, it makes sense that the more umbrellas you sell, the
more money you make.

 Example 1. Suppose you run a social networking site that charges a monthly
subscription fee of $25, and that this is your only source of revenue.
 Each month you collect data and count your number of users and total revenue. You’ve
done this daily over the course of two years, recording it all in a spreadsheet.You could
express this data as a series of points.
 Here are the first four:

 From the given data it can be observed that y =25x. Which shows that,
• There’s a linear pattern.
• The coefficient relating x and y is 25.
• It seems deterministic.

PREPARED BY DEPARTMENT OF CSE 10


DATA SCIENCE AND VISUALIZATION(21CS644)

Figure 3: An observed linear pattern


 Example 2 – User Level Data:The dataset, keyed by user, contains weekly behavior
data for hundreds of thousands of users on a social networking site, with columns like
total_num_friends, total_new_friends_this_week, num_visits, time_spent,
number_apps_downloaded, number_ads_shown, gender, and age. During exploratory
data analysis (EDA), a random sample of 100 users was used to plot pairs of variables,
such as total_new_friends vs. time_spent. The business goal is to forecast the number
of users to promise advertisers, but the current focus is on building intuition and
understanding the dataset. The first few rows are listed below :
total_new_friends time_spent
7 276
3 43
4 82
6 136
10 417
9 269

 When plotted the graph looks like below,

PREPARED BY DEPARTMENT OF CSE 11


DATA SCIENCE AND VISUALIZATION(21CS644)

Figure 4: Dataset Plotted

 There seems to be a linear relationship between the number of new friends and the time
spent on the social networking site, suggesting that more new friends lead to more time
spent on the site.
 This relationship can be described using statistical methods like correlation and linear
regression. Although there is an association between these variables, it is not perfectly
deterministic, indicating that other factors also influence the time users spend on the
site.
 Start by writing something down

To model the relationship, capture the trend and variation. Start by assuming a linear
relationship between variables. Focus on the trend first, using linear modeling to
describe how the number of new friends relates to time spent on the site.

 There are many lines that look more or less like they might work, asshown in below
figure

PREPARED BY DEPARTMENT OF CSE 12


DATA SCIENCE AND VISUALIZATION(21CS644)

Figure 5: Which line is the best fit?

 To begin modelling the assumed linear relationship 𝑦=𝛽0+𝛽1𝑥 the task is to find the
optimal values for 𝛽0(intercept is the value of y when x=0) and 𝛽1(slope of the line-
how much y changes for a unit change in x) using the observed data
𝑥1,𝑦1,𝑥2,𝑦2,...,𝑥𝑛,𝑦𝑛 . This model can be expressed in matrix notation as 𝑦=𝑥⋅𝛽. The
next step involves fitting this model to the data.

 Fitting the model: In linear regression, the goal is to calculate coefficients 𝛽 by


finding the line that minimizes the average distance between all data points and the line
itself. This is achieved by minimizing the sum of squared residuals, representing
the vertical distances between data points and the line.
 Linear regression aims to minimize the sum of the squares of the vertical distances

between predicted and observed y-values. This minimization reduces prediction


errors and is known as least squares estimation.

PREPARED BY DEPARTMENT OF CSE 13


DATA SCIENCE AND VISUALIZATION(21CS644)

Figure 6: The line closest to all the points

 To find the optimal line, the "residual sum of squares" (RSS), denoted as RSS(β), is
defined as the sum of the squared vertical distances between observed points and any
given line. It's represented as:

Here, 𝑖 ranges over the various data points. This function of 𝛽 needs to be optimized to
find the optimal line.

 To minimize , differentiate it with respect to β and set it


equal to zero, then solve for β. This results in:

 This formula gives the vector 𝛽 that minimizes the RSS.

The "hat" symbol ( ) indicates that it's the estimator for 𝛽. Since the true value of 𝛽
is unknown, we use the observed data to compute an estimate using this estimator.

 To fit the linear regression model and obtain the 𝛽 coefficients in R, you can use the
lm() function with a simple one-liner. For example:

PREPARED BY DEPARTMENT OF CSE 14


DATA SCIENCE AND VISUALIZATION(21CS644)

model <- lm(y ~ x)

This line of code creates a linear regression model named model where y is the response
variable and x is the predictor variable. The lm() function in R automatically calculates
the 𝛽 coefficients for the linear model based on the provided data.

 The following line of code fits a linear regression model where time_spent (y) is the
response variable and total_new_friends (x) is the predictor variable, using the data
provided. This model will estimate the relationship between the number of new friends
and the time spent on the social networking site.

Code:

# Perform linear regression


model <- lm(y ~ x)
model
coefs <- coef(model)

# Plot the data and regression line


plot(x, y, pch=20, col="red", xlab="Number new friends", ylab="Time spent
(seconds)")

abline(coefs[1], coefs[2])

# Plot histogram of time spent


hist(my_dataset$time_spent, breaks = 10, col = "blue", xlab = "Time spent (seconds)",
ylab = "Frequency",main = "Histogram of Time Spent on the Site")

output:
When you display model it gives the following
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
PREPARED BY DEPARTMENT OF CSE 15
DATA SCIENCE AND VISUALIZATION(21CS644)

-32.08 45.92

And the estimated line is y = −32.08+45.92x, which can be rounded to y = −32+46x, a


nd the corresponding plot looks like below,

Figure 7: Left: Model fitting Right: Time spent (5 new friend)

The graph on the right shows,

1. X-axis (Time spent): The x-axis represents the range of time spent by users on the
social networking site, divided into intervals or bins. Each bin represents a range of
time spent (e.g., 0-100 seconds, 101-200 seconds, and so on).

2. Y-axis (Frequency): The y-axis represents the frequency or count of users falling
within each bin. It shows how many users spent a particular amount of time on the site,
as indicated by the height of the bars.

If a new x-value of 5 came in, meaning user had five new friends, how confident are
we in the output value of

32.08+45.92*5 = 195.7s

 To address the question of confidence, you need to extend your model to account for
variation. While you've modeled the trend between the number of new friends and time
spent on the site, you haven't yet modeled the variation. This means that you wouldn't
claim that everyone with five new friends spends exactly the same amount of time on
the site.

PREPARED BY DEPARTMENT OF CSE 16


DATA SCIENCE AND VISUALIZATION(21CS644)

2.8 Extending beyond least squares

 Now that you have a simple linear regression model down (one output, one predictor)
using least squares estimation to estimate your βs, you can build upon that model in
three primary ways,
1. Adding in modeling assumptions about the errors
2. Adding in more predictors
3. Transforming the predictors

1. Adding in modeling assumptions about the errors

 When applying the model to predict 𝑦 for a specific 𝑥 value, the prediction lacks
the variability present in the observed data.
 See on the right-hand side of Figure 7 that for a fixed value of x =5, there is
variability among the time spent on the site. This variability can be shown in
the model as,

 The new term 𝜖, also known as noise or the error term, represents the
unaccounted variability in the data. It reflects the difference between the
observed data points and the true regression line, which can only be estimated

using the regression coefficients . It is the difference between observations


and true regression line. Assumption is that noise is Normally Distributed
e~N(0, 𝜎 2 ). P(y|x) ~N(β1𝑥 + β2, 𝜎 2 ) is the conditional distribution of y given
x.
 Eg: Among the set of people who had five new friends this week, the amount
of the time they spent on the website had a normal distribution with a mean of
β1*5 + β2 and a variance of 𝜎 2 , and you’re going to estimate your parameters
β1, β2,σ from the data.
 Measure the residual, how far the observed points are from the estimated line

PREPARED BY DEPARTMENT OF CSE 17


DATA SCIENCE AND VISUALIZATION(21CS644)

∑i ei 2
ei = yi − ŷ = yi −(β1xi + β2) for i= 1 to n. The variance of e is : : Divide
n−2

by n-2 produces an unbiased estimator. This is called the Mean Squared Error
– It captures how much the predicted value varies from the actual value.

 Evaluation Metrics: Next we need to find how best the model we have built so
we need to use different evaluation metrics to measure it. So we have R-
squared, p-values, Cross-validation.
i)R-squared
∑𝑖 (𝑦𝑖 −𝑦̂)2
𝑅 2 = 1-
∑𝑖 (𝑦𝑖 −𝑦̅)2

This can be interpreted as the proportion of variance explained by our


model. MSE getting divided by Total Error is the proportion of variance
unexplained by the model.1-variance unexplained gives the variance
explained by the model.

ii)p-values:In Statistical hypothesis testing, the P-value or sometimes called


probability value, is used to observe the test results or more extreme results by
assuming that the null hypothesis (H0) is true.P-value is also used as an
alternative to determine the point of rejection in order to provide the smallest
significance level at which the null hypothesis is least or rejected. It is expressed
as the level of significance that lies between 0 and 1, and if there is smaller p-
value, then there would be strong evidence to reject the null hypothesis. If the
value of p-value is very small, then it means the observed output is feasible but
doesn't lie under the null hypothesis conditions (H0).The p-value of 0.05 is
known as the level of significance (α). Usually, it is considered using two
suggestions, which are given below:

 If p-value>0.05: The large p-value shows that the null hypothesis


needs to be accepted.
 If p-value<0.05: The small p-value shows that the null hypothesis
needs to be rejected, and the result is declared as statically
significant.

iii)Cross-Validation: Divide the data into a training set and a test set: 80% in
the training and 20% in the test. Fit the model on the training set. Then look at
PREPARED BY DEPARTMENT OF CSE 18
DATA SCIENCE AND VISUALIZATION(21CS644)

the mean squared error on the test set and compare it to that on the training set.
Make this comparison across sample size as well. If the mean squared errors are
approximately the same, then the model generalizes well and there is no
overfitting.

2. Other Models for Error Terms


 The mean squared error is an example of what is called a loss function. This is
the standard one to use in linear regression because it gives us a pretty nice
measure of closeness of fit. It has the additional desirable property that by
assuming that are normally distributed, we can rely on the maximum likelihood
principle. There are other loss functions such as one that relies on absolute value
rather than squaring. It’s also possible to build custom loss functions specific to
your particular problem or context

3. Transforming the predictors

 What we just looked at was simple linear regression, one outcome or dependent
variable and one predictor. But we can extend this model by building in other
predictors, which is called multiple linear regression:

𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝜖.

 The R code would be

model<-lm(y~x_1+x_2+x_3)

model<-lm(y~x_1+x_2+x_3+ x_2*x_3)

 Make scatterplots of y against each of the predictors as well as between the


predictors, and histograms of y|x for various values of each of the predictors to
help build intuition. To evaluate the model we can use R-squared, p-values, and
using cross validation with training and testing sets.
 Sometimes the relationship between may not be linear. The relationship maybe
polynomial in nature. It may be possible sometimes that the assumption made
is a linear relationship, but the real relationship is quadratic. Acquiring more
data can be helpful in this regard.

𝑦= (𝑎0 𝑥 + 𝑎1 𝑥 2 + 𝑎2 𝑥 3 + 𝑏)+𝑒

PREPARED BY DEPARTMENT OF CSE 19


DATA SCIENCE AND VISUALIZATION(21CS644)

Questions:

1.Explain the linear Regression with example.

2.Explain Linear Regression Model with R script.

3.Explain the three primary ways to extend the linear regression beyond the least squares

Handouts for Session 6: kNN Algorithm


2.9 k-Nearest Neighbors (k-NN) Algorithm
 K-Nearest Neighbors (K-NN) is an algorithm employed for automatically
labeling unclassified objects based on their similarity to already classified ones
in a dataset. For instance, it could be applied to classify data scientists as "best"
or "worst", individuals as "high credit" or "low credit," restaurants by star
ratings, or patients as "high cancer risk" or "low cancer risk," among various
other applications.

 The intuition behind K-Nearest Neighbors (K-NN) is to identify the most


similar items based on their attributes, examine their labels, and assign the
unclassified item the majority label. In the case of a tie, one of the tied labels is
randomly selected.

 In the context of movie ratings, K-Nearest Neighbors (K-NN) allows you to


predict the rating of an unrated movie, such as "Data Gone Wild," by analyzing
its attributes like length, genre, number of comedy scenes, number of Oscar-
winning actors, and budget. By comparing these attributes with those of already
rated movies, the algorithm identifies the most similar movies and assigns a
rating based on the collective ratings of its nearest neighbors, enabling
predictions without watching the movie.

 To automate the process, two key decisions are essential: defining the measure
of similarity or closeness between items and utilizing this measure to identify
the most similar items, known as neighbors, to an unrated item. These neighbors
contribute their "votes" towards the classification or labeling of the unrated
item.

PREPARED BY DEPARTMENT OF CSE 20


DATA SCIENCE AND VISUALIZATION(21CS644)

 The second decision involves determining the number of neighbors to consider


for voting, denoted as "k." As a data scientist, you'll choose this value, which
dictates the extent of influence from neighboring items on the classification or
labeling of the unrated item.

 Consider a dataset consisting of the age, income, and a credit category of high
or low for a bunch of people and you want to use the age and income to predict
the credit label of “high” or “low” for a new person.
 For example, here are the first few rows of a dataset, with income represented
in thousands:
age income credit
69 3 low
66 57 low
49 79 low
49 17 low
58 26 high
44 71 high

plot people as points on the plane and label people with an empty circle if they
have low credit ratings.

Figure 8:Credit rating as a function of age and income


 What if a new guy comes in who is 57 years old and who makes $37,000?
What’s his likely credit rating label?

PREPARED BY DEPARTMENT OF CSE 21


DATA SCIENCE AND VISUALIZATION(21CS644)

 Given the credit scores of other individuals nearby, what credit score label do
you propose should be assigned to him? Let's use K-Nearest Neighbors (K-NN)
to automate this process.

Figure 9: What about the guy?

2.10 Overview of the kNN process:

1. Decide on your similarity or distance metric.


2. Split the original labeled dataset into training and test data.
3. Pick an evaluation metric.
4. Run k-NN a few times, changing k and checking the evaluation measure.
5. Optimize k by picking the one with the best evaluation measure.
6. Once you’ve chosen k, use the same training set and now create a new test set with
the people’s ages and incomes that you have no labels for, and want to predict. In this
case, your new test set only has one lonely row, for the 57-year-old.

1. Similarity or distance metrics

 Similarity or distance metrics can be employed to quantify the similarity between data
points. Definitions of “closeness” and similarity vary depending on the context. There
are many more distance metrics available to you depending on your type of data.
 In our scenario, determine a metric (e.g., Euclidean distance) to measure the similarity
between individuals based on their age and income. Euclidean distance is a good go-to
distance metric for attributes that are real-valued.

PREPARED BY DEPARTMENT OF CSE 22


DATA SCIENCE AND VISUALIZATION(21CS644)

1. Cosine Similarity
Also can be used between two real-valued vectors, x and y, and will
yield a value between –1 (exact opposite) and 1 (exactly the same) with
0 in between meaning independent.

2. Jaccard Distance or Similarity


This gives the distance between a set of objects—for example, a list of
Cathy’s friends A= {Kahn,Mark,Laura, . . .} and a list of Rachel’s friends
B= {Mladen,Kahn,Mark, . . .} —and says how similar those two sets are

3. Mahalanobis Distance
Also can be used between two real-valued vectors and has the advantage
over Euclidean distance that it takes into account correlation and is scale-
invariant.

4. Hamming Distance
Can be used to find the distance between two strings or pairs of words
or DNA sequences of the same length. The distance between olive and
ocean is 4 because aside from the “o” the other 4 letters are different.
The distance between shoe and hose is 3 because aside from the “e” the
other 3 letters are different. You just go through each position and check
whether the letters the same in that position, and if not, increment your
count by 1.
5. Manhattan
This is also a distance between two real-valued k-dimensional vectors.

where i is the ith element of each of the vectors.

2.Training and test sets for k-NN

PREPARED BY DEPARTMENT OF CSE 23


DATA SCIENCE AND VISUALIZATION(21CS644)

 In machine learning, the typical process involves two phases: training and testing.
 During training, a model is created and trained using labeled data to learn patterns
and relationships.
 In the testing phase, the model's performance is evaluated using new, unseen data to
assess its effectiveness in making predictions or classifications.

 In K-Nearest Neighbors (k-NN), the training phase involves reading the labeled data
with "high" or "low" credit points marked. During testing, the algorithm attempts to
predict the labels of unseen data points using the k-NN approach without prior
knowledge of the true labels, evaluating its accuracy in the process.
 To accomplish this, a portion of clean data from the entire dataset needs to be
reserved for the testing phase. Typically, about 20% of the data is randomly selected
and set aside for testing purposes.
 Sample R code to prepare Train and Test set,

3. Pick an evaluation metric


 An evaluation metric is a measure used to assess the performance of a machine learning
model. Evaluation metrics are not always straightforward or universal, as different
scenarios may require prioritizing certain types of errors over others. For instance, false
PREPARED BY DEPARTMENT OF CSE 24
DATA SCIENCE AND VISUALIZATION(21CS644)

negatives might be more critical than false positives in certain applications.


Collaborating with domain experts to design an evaluation metric tailored to the
specific requirements of the problem at hand can be essential.
 For example, if you were using a classification algorithm to predict whether someone
had cancer or not, you would want to minimize false negatives (misdiagnosing someone
as not having cancer when they actually do), so you could work with a doctor to tune
your evaluation metric.

 Accuracy: Ratio of the number of correct labels to the total number of


labels, and the misclassification rate, which is just 1–accuracy.
Minimizing the misclassification rate then just amounts to maximizing
accuracy.
 Sensitivity (true positive rate or recall): sensitivity is here defined as the
probability of correctly diagnosing an ill patient as ill.
 Specificity (true negative rate): specificity is here defined as the probability
of correctly diagnosing a well patient as well. There is also the false positive
rate and the false negative rate, and these don’t get other special names.
 True Positive (TP): Instances that are actually positive and are correctly
classified as positive by the model.
 False Positive (FP): Instances that are actually negative but are incorrectly
classified as positive by the model.
 True Negative (TN): Instances that are actually negative and are correctly
classified as negative by the model.
 False Negative (FN): Instances that are actually positive but are incorrectly
classified as negative by the model.

4. Run k-NN and checking the evaluation measure


 Once we know, distance measure and evaluation metric. Apply K-Nearest
Neighbors to classify individuals in the test set based on the majority label among
their nearest neighbors.
 Calculate the misclassification rate to evaluate model performance.All this is done
automatically in R, with just this single line of R code:

PREPARED BY DEPARTMENT OF CSE 25


DATA SCIENCE AND VISUALIZATION(21CS644)

5. Optimize k by picking the one with the best evaluation measure.

 To choose k, run k-nn a few times, changing k, and checking the evaluation
metric each time.
 When you have binary classes like “high credit” or “low credit,” picking k to
be an odd number can be a good idea because there will always be a majority
vote, no ties. If there is a tie, the algorithm just randomly picks.

 So let’s go with k =5 because it has the lowest misclassification rate, and now
k=5 can be applied to the guy who is 57 with a $37,000 salary.

 The k-NN algorithm is an example of a nonparametric approach. It operates


without modeling assumptions about the underlying data-generating
distributions and does not involve estimating any parameters.
But still there were some assumptions:

PREPARED BY DEPARTMENT OF CSE 26


DATA SCIENCE AND VISUALIZATION(21CS644)

 Data is in some feature space where a notion of “distance” makes sense.


 Training data has been labeled or classified into two or more classes.
 You pick the number of neighbors to use, k.
 The assumption is that the observed features and labels are associated.
The evaluation metric will show the algorithm's labeling performance.
Adding features and tuning 𝑘 can improve the model, but there's a risk
of overfitting.
 Both linear regression and k-NN are examples of “supervised learning,” where
you’ve observed both x and y, and you want to know the function that brings x
to y.

Questions:

1.Explain the kNN with example.

2.Explain the overview of kNN process in detail

3.Explain the evaluation metric process in kNN

4. Explain the Training and test Phases of kNN

Handouts for Session 8: k-means Algorithm


2.10 k-means Algorithm

 K-means is an unsupervised learning technique that aims to define the correct


answer by identifying clusters within the data.Consider some user level data and
assume each row of your dataset corresponds to a user as follows: age, gender,
income, state, household, size.
 The goal is to segment users, a process known as segmenting, stratifying,
grouping, or clustering the data. All these terms refer to finding similar types of
users and grouping them together.
 To see why an algorithm like this might be useful, let’s bucket users using
handmade thresholds. You may have 10 age buckets, 2 gender buckets, and so
on, which would result in 10 × 2 × 50 × 10 × 3 = 30,000 possible bins, which is
big. Moreover, this data existing in a five-dimensional space where each axis
corresponds to one attribute. Each user would then live in one of those 30,000
five-dimensional cells. This makes it impossible to build a different marketing
campaign for each bin.
PREPARED BY DEPARTMENT OF CSE 27
DATA SCIENCE AND VISUALIZATION(21CS644)

 This is where k-means comes into picture where, k-means is: a clustering
algorithm where k is the number of bins. k-means algorithm looks for clusters
in d dimensions, where d is the number of features for each data point.

2.10.1 k-means algorithm

Algorithm

Randomly pick 𝑘 centroids (or points that will be the center of your clusters) in 𝑑-space,
ensuring they are near the data but distinct from one another.

 Assign each data point to the closest centroid.


 Move the centroids to the average location of the data points assigned to them.
 Repeat the previous two steps until the assignments don’t change or change very little.

 One has to determine if there's a natural way to describe these groups once the
algorithm completes. At times, you may need to make slight adjustments to 𝑘 a few
times before obtaining natural groupings. This is an example of unsupervised
learning because the labels are not known and are instead discovered by the
algorithm.

K-means has known issues:

1.Choosing 𝑘 is more art than science, bounded by 1≤𝑘≤𝑛, where 𝑛 is the number of data
points.

2.Convergence issues may arise, with the algorithm potentially falling into a loop and
failing to find a unique solution.

Interpretability can be problematic, with results sometimes being unhelpful or unclear.

K-means advantages:

1.k-means is pretty fast (compared to other clustering algorithms),

2.There are broad applications in marketing, computer vision (partitioning an image).

3.Can be a starting point for other models.

2D version:

PREPARED BY DEPARTMENT OF CSE 28


DATA SCIENCE AND VISUALIZATION(21CS644)

 Consider a simpler example than the five-dimensional one previously discussed.


Suppose there is data on users, including the number of ads shown to each user
(impressions) and the number of times each user clicked on an ad (clicks).
 Clustering in two dimensions; look at the panels in the left column from top to
bottom, and then the right column from top to bottom.
 In practice, k-means is just one line of code in R:

 Clustering in two dimensions; look at the panels in the left column from top to
bottom, and then the right column from top to bottom.
 In practice, k-means is just one line of code in R:

 The kmeans function in R has several parameters:


 x: The dataset, which should be a numeric matrix or data frame where each row
represents an observation and each column represents a feature.
 centers: Specifies the number of clusters (k) to create or the initial cluster centers. If
an integer, it indicates the number of clusters; if a matrix, each row represents an
initial cluster center.
 iter.max: The maximum number of iterations allowed. The default is 10. This
parameter controls how long the algorithm will run before stopping.

PREPARED BY DEPARTMENT OF CSE 29


DATA SCIENCE AND VISUALIZATION(21CS644)

 nstart: The number of random sets of initial cluster centers. The algorithm will run
nstart times with different initial centers and return the best solution based on the total
within-cluster sum of squares. The default is 1.
 algorithm: Specifies the algorithm to use for clustering. Options include:
 "Hartigan-Wong" (default): The standard algorithm proposed by
Hartigan and Wong.
 "Lloyd": Also known as the standard k-means algorithm.
 "Forgy": Another version of the k-means algorithm.
 "MacQueen": A variation of the k-means algorithm.
Example:
 kmeans(x, centers = 3, iter.max = 10, nstart = 1, algorithm = "Hartigan-Wong")
 This example runs the k-means clustering algorithm on the dataset x, aiming to create
3 clusters, with a maximum of 10 iterations, using 1 random start, and employing the
Hartigan-Wong algorithm.

Questions:

1.Explain K-means Algorithm in detail.

2.Explain the advantages and disadvantages of K-means

PREPARED BY DEPARTMENT OF CSE 30

You might also like