0% found this document useful (0 votes)
25 views37 pages

DSV-M2-lecture Notes - Valar New

Uploaded by

AKHILA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views37 pages

DSV-M2-lecture Notes - Valar New

Uploaded by

AKHILA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

21CS644- DATA SCIENCE AND VISUALIZATION

CO 1. Understand the data in different forms

CO 2. Apply different techniques to Explore Data Analysis and the Data Science Process

CO 3. Analyze feature selection algorithms & design a recommender system.

CO 4. Evaluate data visualization tools and libraries and plot graphs.

CO 5. Develop different charts and include mathematical expressions.

Module-2

Exploratory Data Analysis and the Data Science Process

 Basic tools (plots, graphs and summary statistics) of EDA,


 Philosophy of EDA, The Data Science Process,
 Case Study: Real Direct (online real estate firm).
 Three Basic Machine Learning Algorithms:
1. Linear Regression,
2. k-Nearest Neighbours (k- NN),
3. k-means.

1
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

EXPLORATORY DATA ANALYSIS

 Exploratory Data Analysis (EDA) as the first step toward building a model.
 EDA is a critical part of the data science process, and also represents a philosophy or
way of doing statistics practiced by a strain of statisticians coming from the Bell Labs
tradition.
 In EDA, there is no hypothesis and there is no model. The “exploratory” aspect
means that your understanding of the problem you are solving, or might solve, is
changing as you go.
 The basic tools of EDA are plots graphs and summary statistics.
 It’s a method of systematically going through the data plotting distributions of all
variables, plotting time series of data, transforming variables, looking at all pairwise
relationships between variables using scatterplot matrices, and generating summary
statistics for all of them.
 EDA may be computing their mean minimum maximum the upper and lower
quartiles, and identifying outliers.
 But as much as EDA is a set of tools, it’s also a mind-set. And that mind-set is about
your relationship with the data.
 You want to understand the data—gain intuition, understand the shape of it, and try
to connect your understanding of the process that generated the data to the data
itself.
 EDA happens between the person and the data and isn’t about proving anything to
anyone else yet.
 In the end, EDA helps to make sure the product is performing as intended. Although
there’s lots of visualization involved in EDA, we distinguish between EDA and data
visualization in that EDA is done toward the beginning of analysis, and data
visualization, as it’s is done toward the end to communicate one’s findings.
 With EDA, the graphics are solely done for you to understand what’s going on.

Philosophy of Exploratory Data Analysis

• In the context of data in an Internet/engineering company, EDA is done for


some of the same reasons it’s done with smaller datasets, but there are additional
reasons to do it with data that has been generated from logs.
• There are important reasons anyone working with data should do EDA.
Namely, to gain intuition about the data; to make comparisons between
distributions; for sanity checking, to find out where data is missing or if there are
outliers; and to summarize the data
• In the context of data generated from logs, EDA also helps with debugging
2

the logging process, For example, “patterns” you find in the data could actually be
Page

something wrong in the logging process that needs to be fixed.


Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

• If you never go to the trouble of debugging, you’ll continue to think your


patterns are real. The engineers we’ve worked with are always grateful for help in
this area
• In the end, EDA helps you make sure the product is performing as intended.

Exercise: EDA

There are 31 datasets named nyt1.csv, nyt2.csv,…,nyt31.csv, which you can find here:
https://github.com/oreillymedia/doing_data_science.

Each one represents one (simulated) day’s worth of ads shown and clicks recorded on the
New York Times home page in May 2012.

Impressions: Going through the no of ads in the web site

Click through rate: visiting the ads in the web site.

Once the data loaded, it’s time for some EDA

1.Create a new variable, age_group that categorizes users as

"< 18 18 24 25 34 35 44 45 54 55 64 and 65”

2. 2 For a single day

Plot the distributions of number impressions and click through rate ( clicks/# impressions)
for these age categories

Define a new variable to segment or categorize users based on their click behavior

Explore the data and make visual and quantitative comparisons across user
segments/demographics 18 year old females or logged in versus not, for example)

3. Now extend your analysis across days. Visualize some metrics and distributions over time.

4. Describe and interpret any patterns you find.

Sample Code:
3
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


# Author: Maura Fitzgerald21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
data1 <- read.csv(url("http://stat.columbia.edu/~rachel/datasets/nyt1.csv"))
0r
data1 <- read.csv("C:\\Users\\valarmathi\\downloads\\nyt1.csv")

# categorize
head(data1)
data1$agecat <-cut(data1$Age,c(-Inf,0,18,24,34,44,54,64,Inf))

# view
summary(data1)

# brackets
install.packages("doBy")
library("doBy")
siterange <- function(x){c(length(x), min(x), mean(x), max(x))}
summaryBy(Age~agecat, data =data1, FUN=siterange)

# so only signed in users have ages and genders


summaryBy(Gender+Signed_In+Impressions+Clicks~agecat,
data =data1)

4
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


# plot 21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
install.packages("ggplot2")
library(ggplot2)
ggplot(data1, aes(x=Impressions, fill=agecat))+geom_histogram(binwidth=1)
ggplot(data1, aes(x=agecat, y=Impressions, fill=agecat))+geom_boxplot()

# create click thru rate

# we don't care about clicks if there are no impressions

# if there are clicks with no imps my assumptions about

# this data are wrong

data1$hasimps <-cut(data1$Impressions,c(-Inf,0,Inf))
summaryBy(Clicks~hasimps, data =data1, FUN=siterange)
ggplot(subset(data1, Impressions>0), aes(x=Clicks/Impressions,
colour=agecat)) + geom_density()
ggplot(subset(data1, Clicks>0), aes(x=Clicks/Impressions,
colour=agecat)) + geom_density()
ggplot(subset(data1, Clicks>0), aes(x=agecat, y=Clicks,

fill=agecat)) + geom_boxplot()

ggplot(subset(data1, Clicks>0), aes(x=Clicks, colour=agecat))

+ geom_density()

# create categories

data1$scode[data1$Impressions==0] <- "NoImps"

data1$scode[data1$Impressions >0] <- "Imps"

data1$scode[data1$Clicks >0] <- "Clicks"

# Convert the column to a factor

data1$scode <- factor(data1$scode)

head(data1)

#look at levels

clen <- function(x){c(length(x))}

etable<-summaryBy(Impressions~scode+Gender+agecat,

data = data1, FUN=clen)


5
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Data Science Process:

Data Collection: This involves gathering data from various sources such as databases,
websites, sensors, or other means.

First we have the Real World. Inside the Real World are lots of people busy at various
activities. Some people are using Google+, others are competing in the Olympics; there are
spammers sending spam, and there are people getting their blood drawn. Say we have data
on one of these things.

2. Data Cleaning and Pre-processing Raw data often contains errors, missing values, or
inconsistencies Data scientists clean and pre-process the data to ensure accuracy and
consistency.

We want to process this to make it clean for analysis. So we build and use pipelines of data
munging: joining, scraping, wrangling, or whatever you want to call it. To do this we use
tools such as Python, shell scripts, R, or SQL, or all of the above

3. Exploratory Data Analysis (Data scientists explore the data using statistical techniques and
visualization tools to understand its characteristics, identify patterns, and detect anomalies.

Once we have this clean dataset, we should be doing some kind of EDA. In the course of
doing EDA, we may realize that it isn’t actually clean because of duplicates, missing values,
absurd outliers, and data that wasn’t actually logged or incorrectly logged. If that’s the case,
we may have to go back to collect more data, or spend more time cleaning the dataset.
6
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

4. Feature Engineering: This involves selecting, transforming, and creating new features
from the raw data to improve the performance of machine learning algorithms.

5. Machine Learning Data scientists apply machine learning algorithms to build predictive
models or uncover hidden patterns in the data This includes supervised learning,
unsupervised learning, and reinforcement learning techniques.

Next, we design our model to use some algorithm like k-nearest neighbor (k-NN), linear
regression, Naive Bayes, or something else.The model we choose depends on the type of
problem we’re trying to solve, of course, which could be a classification problem, a
prediction problem, or a basic description problem

6. Model Evaluation and Validation Data scientists assess the performance of the machine
learning models using metrics and techniques to ensure they generalize well to unseen data

We then can interpret, visualize, report, or communicate our results. This could take the
form of reporting the results up to our boss or coworkers, or publishing a paper in a journal
and going out and giving academic talks about it.

Alternatively, our goal may be to build or prototype a “data product”; e.g., a spam classifier,
or a search ranking algorithm, or a recommendation system. Now the key here that makes
data science special and distinct from statistics is that this data product then gets
incorporated back into the real world, and users interact with that product, and that
generates more data, which creates a feedback loop.

7. Deployment and Monitoring Once a model is trained and evaluated, it is deployed into
production systems Data scientists monitor the performance of deployed models and
update them as needed

8. Domain Expertise Understanding the specific domain or industry is crucial for interpreting
the results of data analysis and making informed decisions.

A Data Scientist’s Role in This Process

“data scientist.” -

Someone has to make the decisions about what data to collect, and why. That person needs
to be formulating questions and hypotheses and making a plan for how the problem will be
attacked.

In the context of data science, a hypothesis refers to a tentative assumption or proposed


explanation for a phenomenon or relationship between variables that can be tested using
data and statistical analysis. It is a statement or claim about a population parameter or the
7

relationship between two or more variables.


Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Null Hypothesis Vs Alternate Hypothesis

Example:

H0: There is no significant difference in test scores between students who attended an
online course and those who attended a traditional classroom course.

Ha: There is a significant difference in test scores between students who attended an online
course and those who attended a traditional classroom course.

Connection to the Scientific Method

We can think of the data science process as an extension of or variation of the scientific
method:

• Ask a question.

• Do background research.

• Construct a hypothesis.

• Test your hypothesis by doing an experiment.

• Analyze your data and draw a conclusion.

• Communicate your results.

In both the data science process and the scientific method, not every problem requires one
to go through all the steps, but almost all problems can be solved with some combination of
the stages. For example,

if your end goal is a data visualization (which itself could be thought of as a data product),
it’s possible you might not do any machine learning or statistical modelling , but you’d want
to get all the way to a clean dataset, do some exploratory analysis, and then create the
visualization.

Case Study: Real Direct

• Doug Perlson, the CEO of RealDirect, has a background in real estate law, startups,
and online advertising.

• His goal with RealDirect is to use all the data he can access about real estate to
improve the way people sell and buy houses.

• Normally, people sell their homes about once every seven years, and they do so with
the help of professional brokers and current data. But there’s a problem both with the
8
Page

broker system and the data quality. RealDirect addresses both of them.

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

• First, the brokers. They are typically “free agents” operating on their own—think of
them as home sales consultants. This means that they guard their data aggressively, and the
really good ones have lots of experience. But in the grand scheme of things, that really
means they have only slightly more data than the inexperienced brokers.

• RealDirect is addressing this problem by hiring a team of licensed realestate agents


who work together and pool their knowledge.To accomplish this, it built an interface for
sellers, giving them useful datadriven tips on how to sell their house. It also uses interaction
data to give real-time recommendations on what to do next.

• The team of brokers also becomes data experts, learning to use information-
collecting tools to keep tabs on new and relevant data or to access publicly available
information.

• One problem with publicly available data is that it’s old news—there’s a three-month
lag between a sale and when the data about that sale is available.

• RealDirect is working on real-time feeds on things like when people start searching
for a home, what the initial offer is, the time between offer and close, and how people
search for a home online.

• Ultimately, good information helps both the buyer and the seller. At least if they’re
honest.

How Does Real Direct Make Money?

• First, it offers a subscription to sellers—about $395 a month—to access the selling


tools.

• Second, it allows sellers to use Real Direct’s agents at a reduced commission,


typically 2% of the sale instead of the usual 2.5% or 3%.

• This is where the magic of data pooling comes in: it allows Real Direct to take a
smaller commission because it’s more optimized, and therefore gets more volume.

• The site itself is best thought of as a platform for buyers and sellers to manage their
sale or purchase process.

• There are statuses for each person on site: active, offer made, offer rejected,
showing, in contract, etc. Based on your status, different actions are suggested by the
software.

Challenges:
9

• First off, there’s a law in New York that says you can’t show all the current housing
Page

listings unless those listings reside behind a registration wall, so RealDirect requires
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

registration. On the one hand, this is an obstacle for buyers, but serious buyers are likely
willing to do it.

Sample R code

Here’s some sample R code that takes the Brooklyn housing data in the preceding exercise,

# Author: Benjamin Reddy

require(gdata)

bk <- read.xls("rollingsales_brooklyn.xls",pattern="BOROUGH")

head(bk)

summary(bk)

bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]","",

bk$SALE.PRICE))

count(is.na(bk$SALE.PRICE.N))

names(bk) <- tolower(names(bk))

Case Study: RealDirect | 49

## clean/format the data with regular expressions

bk$gross.sqft <- as.numeric(gsub("[^[:digit:]]","",

bk$gross.square.feet))

bk$land.sqft <- as.numeric(gsub("[^[:digit:]]","",

bk$land.square.feet))

bk$sale.date <- as.Date(bk$sale.date)

bk$year.built <- as.numeric(as.character(bk$year.built))

## do a bit of exploration to make sure there's not anything

## weird going on with sale prices

attach(bk)
10

hist(sale.price.n)
Page

hist(sale.price.n[sale.price.n>0])
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
hist(gross.sqft[sale.price.n==0])

detach(bk)
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

## keep only the actual sales

bk.sale <- bk[bk$sale.price.n!=0,]

plot(bk.sale$gross.sqft,bk.sale$sale.price.n)

plot(log(bk.sale$gross.sqft),log(bk.sale$sale.price.n))

## for now, let's look at 1-, 2-, and 3-family homes

bk.homes <- bk.sale[which(grepl("FAMILY",

bk.sale$building.class.category)),]

plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))

bk.homes[which(bk.homes$sale.price.n<100000),]

[order(bk.homes[which(bk.homes$sale.price.n<100000),]

$sale.price.n),]

## remove outliers that seem like they weren't actual sales

bk.homes$outliers <- (log(bk.homes$sale.price.n) <=5) + 0

bk.homes <- bk.homes[which(bk.homes$outliers==0),]

plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))

and cleans and explores it a bit. (This exercise is for Manhattan)

Machine Learning Algorithms

Machine learning algorithms are largely used to predict, classify, or cluster. Statistical
modelling came out of statistics departments, and machine learning algorithms came out of
computer science departments.

In general, machine learning algorithms that are the basis of artificial intelligence (AI) such
as image recognition, speech recognition, recommendation systems, ranking and
personalization of content—often the basis of data products—are not usually part of a core
statistics curriculum or department. They aren’t generally designed to infer the underlying
11

generative process (e.g., to model something), but rather to predict or classify with the most
Page

accuracy.
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Three Basic Algorithms

Many business or real-world problems that can be solved with data can be thought of as
classification and prediction problems when we express them mathematically.

“In the real world, how do I know that this algorithm is the solution to the problem I’m
trying to solve?”

• Linear Regression

• k-Nearest Neighbors (k-NN)

• k-means

1.Linear regression
Linear regression is a fundamental algorithm in machine learning and statistics used for
predictive modelling and data analysis. It aims to find the best-fitting straight line (or linear
equation) that describes the relationship between one or more independent variables
(features) and a dependent variable (target).

The linear regression algorithm works by minimizing the sum of squared differences
between the predicted values (obtained from the linear equation) and the actual values in
the given dataset. This process is known as Ordinary Least Squares (OLS) method.

When you use it, you are making the assumption that there is a linear relationship between
an outcome variable (sometimes also called the response variable, dependent variable, or
label) and a predictor (sometimes also called an independent variable, explanatory variable,
or feature); or between one variable and several other variables, in which case you’re
modeling the relationship as having a linear structure.

Model Representation: Assume a linear relationship between the independent variable(s)


and the dependent variable, represented by the equation:

y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε, (0r) y = f (x )= β0 +β1 *x

where β₀ is the intercept, β₁, β₂, ..., βₙ are the coefficients (weights) corresponding to the
independent variables X₁, X₂, ..., Xₙ, and ε is the error term (residual).

Example1. Overly simplistic example to start

Suppose you run a social networking site that charges a monthly subscription fee of $25,
and that this is your only source of revenue. Each month you collect data and count your
12

number of users and total revenue. You’ve done this daily over the course of two years,
recording it all in a spread sheet.
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

You could express this data as a series of points. Here are the first four:

When you showed this to someone else who didn’t even know how much you charged or
anything about your business model, they might notice that there’s a clear relationship

enjoyed by all of these points, namely y =25x.

•There’s a linear pattern.

• The coefficient relating x and y is 25.

• It seems deterministic.

Example 2. Looking at data at the user level.

• you have a dataset keyed by user (meaning each row contains data for a single user),
13

and the columns represent user behavior on a social networking site over a period of a
Page

week.
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

• The names of the columns are total_num_friends, total_new_friends_this_week,


num_visits, time_spent, number_apps_downloaded, number_ads_shown, gender, age, and
so on.

• During the course of your exploratory data analysis, you’ve randomly sampled 100
users to keep it simple, and you plot pairs of these variables,

for example, x = total_new_friends and y = time_spent (in seconds).

Question:

The business context might be that eventually you want to be able to promise advertisers
who bid for space on your website in advance a certain number of users, so you want to be
able to forecast number of users several days or weeks in advance.

New Time Spent


Friends

7 276

3 43

4 82

6 136

10 471

9 269

It looks like there’s kind of a linear relationship here, and it makes sense; the more new
friends you have, the more time you might spend on the site.
14
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

There are two things you want to capture in the model. The first is the trend and the second

is the variation.

Example 3: Start by writing something down

• There are two things you want to capture in the model. The first is the trend and the
second is the variation. We’ll start first with the trend.
• First, let’s start by assuming there actually is a relationship and that it’s linear,
Because it is assumed that there a linear relationship, start the model by assuming
the functional form to be

• Writing this with matrix notation results in this

4. Fitting the model:

• The intuition behind linear regression is that to find the line that minimizes the
distance between all the points and the line.
• To find this line, you’ll define the “residual sum of squares” (RSS), denoted RSS β , to
be:
15
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Linear regression seeks to find the line that minimizes the sum of the squares of the
vertical distances between the approximated or predicted yis and the observed yis.

16
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

The R code for this would be:


> model <- lm (y~x)
> model
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-32.08 45.92
> coefs <- coef(model)
> plot(x, y, pch=20,col="red", xlab="Number new friends",
ylab="Time spent (seconds)")
> abline(coefs[1],coefs[2])

R-squared

summary (model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-121.17 -52.63 -9.72 41.54 356.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.083 16.623 -1.93 0.0565 .
x 45.918 2.141 21.45 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
17

Residual standard error: 77.47 on 98 degrees of freedom


Multiple R-squared: 0.8244, Adjusted R-squared: 0.8226
Page

F-statistic: 460 on 1 and 98 DF, p-value: < 2.2e-16


Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

R2 =1−Σi yi − yi2

Σi yi − y2 .

This can be interpreted as the proportion of variance explained by our model.

Evaluation metrics:

• R-squared: mean squared error is in there getting divided by total error

• p-values: For any given β, the p-value captures the probability of observing the data
that we observed, and obtaining the test-statistic that we obtained under the null
hypothesis


• Cross-validation:
o Divide our data up into a training set and a test set: 80% in the training and
20% in the test
o Fit the model on the training set, then look at the mean squared error on the
test set and compare it to that on the training set.
o If the mean squared errors are approximately the same, then model
generalizes well and not in danger of overfitting

 Other models for error terms


o The mean squared error is an example of what is called a loss function. This is
the standard one to use in linear regression because it gives us a pretty nice
measure of closeness of fit.
o There are other loss functions such as one that relies on absolute value rather
than squaring.

 Adding other predictors


 was simple linear regression—one outcome or dependent variable and one
predictor. But we can extend this model by building in other predictors,
18

which is called multiple linear regression


Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

 Transformations
o Better model would be a polynomial relationship like this:

o To think of it as linear, you transform or create new variables—for example, z =x2


and build a regression model based on z. Other common transformations are to
take the log or to pick a threshold and turn it into a binary predictor instead

LINEAR V/S LOGISTIC REGRESSION

Aspect Linear Regression Logistic Regression

Outcome Variable Continuous


Binary (categorical)
Type (numeric)

Logistic (S-shaped)
Model Form Linear relationship
relationship

Output Range −∞ to +∞ 0 to 1 (probability)

Classification and
Usage Prediction
probability estimation

Normally distributed Linear relationship with


Assumptions
residuals log odds
19
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

2.K – MEAN CLUSTERING

20
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Point Coordinates

A1 (2,10)

A2 (2,6)

A3 (11,11)

A4 (6,9)

A5 (6,4)

A6 (1,2)

A7 (5,10)

A8 (4,9)

A9 (10,12)

A10 (7,5)

A11 (9,11)

A12 (4,6)

A13 (3,10)

A14 (3,8)

A15 (6,11)

EMAMPLE OF K – MEAN CLUSTERING

Consider the value of K=3

Randomly Choose
21

Centroid 1=(2,6) is associated with cluster 1.


Page

Centroid 2=(5,10) is associated with cluster 2.


Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Centroid 3=(6,11) is associated with cluster 3.

K – MEAN CLUSTERING ( 1ST ITERATION )

Distance
Distance from Distance from
from
Point Centroid 1 Centroid 2 Assigned Cluster
Centroid 3
(2,6) (5,10)
(6,11)

A1 (2,10) 4 3 4.123106 Cluster 2

A2 (2,6) 0 5 6.403124 Cluster 1

A3 (11,11) 10.29563 6.082763 5 Cluster 3

A4 (6,9) 5 1.414214 2 Cluster 2

A5 (6,4) 4.472136 6.082763 7 Cluster 1

A6 (1,2) 4.123106 8.944272 10.29563 Cluster 1

A7 (5,10) 5 0 1.414214 Cluster 2

A8 (4,9) 3.605551 1.414214 2.828427 Cluster 2

A9 (10,12) 10 5.385165 4.123106 Cluster 3

A10 (7,5) 5.09902 5.385165 6.082763 Cluster 1

A11 (9,11) 8.602325 4.123106 3 Cluster 3

A12 (4,6) 2 4.123106 5.385165 Cluster 1

A13 (3,10) 4.123106 2 3.162278 Cluster 2

A14 (3,8) 2.236068 2.828427 4.242641 Cluster 1

A15 (6,11) 6.403124 1.414214 0 Cluster 3


22
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

K – MEAN CLUSTERING ( 2ND ITERATION )

Distance
Distance
from
from Distance from
Point Centroid 1 Assigned Cluster
centroid 2 (4, centroid 3 (9, 11.25)
(3.833,
9.6)
5.167)

A1 (2,10) 5.169 2.040 7.111 Cluster 2

A2 (2,6) 2.013 4.118 8.750 Cluster 1

A3 (11,11) 9.241 7.139 2.016 Cluster 3

A4 (6,9) 4.403 2.088 3.750 Cluster 2

A5 (6,4) 2.461 5.946 7.846 Cluster 1

A6 (1,2) 4.249 8.171 12.230 Cluster 1

A7 (5,10) 4.972 1.077 4.191 Cluster 2

A8 (4,9) 3.837 0.600 5.483 Cluster 2

A9 (10,12) 9.204 6.462 1.250 Cluster 3

A10 (7,5) 3.171 5.492 6.562 Cluster 1

A11 (9,11) 7.792 5.192 0.250 Cluster 3

A12 (4,6) 0.850 3.600 7.250 Cluster 1

A13 (3,10) 4.904 1.077 6.129 Cluster 2

A14 (3,8) 2.953 1.887 6.824 Cluster 2

A15 (6,11) 6.223 2.441 3.010 Cluster 2


23

K – MEAN CLUSTERING ( 3RD ITERATION )


Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

Distance Distance from Distance


from centroid from
Point Assigned Cluster
Centroid 2 (4.143, centroid 3
1 (4, 4.6) 9.571) (10, 11.333)

A1 (2,10) 5.758 2.186 8.110 Cluster 2

A2 (2,6) 2.441 4.165 9.615 Cluster 1

A3 (11,11) 9.485 7.004 1.054 Cluster 3

A4 (6,9) 4.833 1.943 4.631 Cluster 2

A5 (6,4) 2.088 5.872 8.353 Cluster 1

A6 (1,2) 3.970 8.197 12.966 Cluster 1

A7 (5,10) 5.492 0.958 5.175 Cluster 2

A8 (4,9) 4.400 0.589 6.438 Cluster 2

A9 (10,12) 9.527 6.341 0.667 Cluster 3

A10 (7,5) 3.027 5.390 7.008 Cluster 1

A11 (9,11) 8.122 5.063 1.054 Cluster 3

A12 (4,6) 1.400 3.574 8.028 Cluster 1

A13 (3,10) 5.492 1.221 7.126 Cluster 2

A14 (3,8) 3.544 1.943 7.753 Cluster 2

A15 (6,11) 6.705 2.343 4.014 Cluster 2

K – MEAN CLUSTERING GRAPH


24
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

K – MEAN CLUSTERING APPLICATIONS

Document Classification: Using k-means clustering, we can divide documents into various
clusters based on their content, topics, and tags.

Customer segmentation: Supermarkets and e-commerce websites divide their customers


into various clusters based on their transaction data and demography. This helps the
business to target appropriate customers with relevant products to increase sales.

Cyber profiling: In cyber profiling, we collect data from individuals as well as groups to
identify their relationships. With k-means clustering, we can easily make clusters of people
based on their connection to each other to identify any available patterns.

Image segmentation: We can use k-means clustering to perform image segmentation by


grouping similar pixels into clusters.

Fraud detection in banking and insurance: By using historical data on frauds, banks and
insurance agencies can predict potential frauds by the application of k-means clustering.

K – MEAN CLUSTERING CASE STUDY

# Generate sample data points

set.seed(123) # Set seed for reproducibility

data <- matrix(rnorm(100), ncol = 2) # Generate 50 data points with 2 features


25

# Perform k-means clustering with k = 3


Page

k <- 3
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

kmeans_result <- kmeans(data, centers = k)

# Print the cluster centers

print(kmeans_result$centers)

# Plot the data points with cluster assignments

plot(data, col = kmeans_result$cluster, main = "K-means Clustering", xlab = "Feature 1",


ylab = "Feature 2")

# Mark the cluster centers on the plot

points(kmeans_result$centers, col = 1:k, pch = 8, cex = 2)

Aspect Linear Regression Logistic Regression

Outcome
Continuous (numeric) Binary (categorical)
Variable Type

Model Form Linear relationship Logistic (S-shaped) relationship

Output Range −∞ to +∞ 0 to 1 (probability)

Classification and probability


Usage Prediction
estimation

Normally distributed
Assumptions Linear relationship with log odds
residuals

3. K-NEAREST NEIGHBORS (K-NN)


26
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

 The K-Nearest Neighbours (KNN) algorithm is a supervised machine learning method


employed to tackle classification.

 It memorizes the training instances which are subsequently used as "knowledge" for
the prediction phase.

Training Phase: During the training phase, it memorizes the training instances along with
their corresponding class labels (in the case of classification) or target values.

Prediction Phase:

Input: When presented with a new, the algorithm first calculates the distance between this
instance and all other instances in the training data.

Distance Metric: Commonly used distance metrics include Euclidean distance, Manhattan
distance, Minkowski distance, etc

Selecting Neighbors: The algorithm then selects the K nearest neighbors (data points) to the
new instance based on the computed distances.

Classification: For a classification task, the class of the new instance is determined by a
majority vote among its K nearest neighbors. The most frequent class label among the K
neighbors is assigned to the new instance.
27

Choosing the Right Value of K:


Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

The value of K is a hyperparameter that needs to be chosen carefully. A smaller value of K


can be sensitive to noise and outliers, leading to overfitting. On the other hand, a larger
value of K can result in a smoother decision boundary but may lead to underfitting.

Training Dataset and the value of K is 5.

BRIGHTNESS SATURATION CLASS

40 20 Red

50 50 Blue

60 90 Blue

10 25 Red

70 70 Blue

60 10 Red

25 80 Blue

New data entry:

BRIGHTNESS SATURATION CLASS

20 35 ?

28
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

BRIGHTNESS SATURATION CLASS DISTANCE

40 20 Red 25
29

50 50 Blue 33.54
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

60 90 Blue 68.01

10 25 Red 10

70 70 Blue 61.03

60 10 Red 47.17

25 80 Blue 45

Rearrange the distances in ascending order:

BRIGHTNESS SATURATION CLASS DISTANCE

10 25 Red 10

40 20 Red 25

50 50 Blue 33.54

25 80 Blue 45

60 10 Red 47.17

70 70 Blue 61.03

60 90 Blue 68.01
30

Since we chose 5 as the value of K, we'll only consider the first five rows.
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

BRIGHTNESS SATURATION CLASS DISTANCE

10 25 Red 10

40 20 Red 25

50 50 Blue 33.54

25 80 Blue 45

60 10 Red 47.17

As we can see, the majority class within the 5 nearest neighbors to the new entry is Red.
Therefore, we'll classify the new entry as Red.

Advantages of K-NN Algorithm

 It is simple to implement.

 No training is required before classification.

Disadvantages of K-NN Algorithm

 Can be cost-intensive when working with a large data set.

 A lot of memory is required for processing large data sets.

 Choosing the right value of K can be tricky.

31
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

32
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

# Load the Iris dataset

data(iris)

# View the structure and summary of the dataset

str(iris)

summary(iris)

# Set seed for reproducibility

set.seed(123)

# Split the dataset into training (80%) and testing (20%) sets

library(caTools) # Load caTools package for sample.split function

split <- sample.split(iris$Species, SplitRatio = 0.8)

train_data <- subset(iris, split == TRUE)

test_data <- subset(iris, split == FALSE)

# Normalize the feature variables (excluding the target variable 'Species')

normalize <- function(x) {

return((x - min(x)) / (max(x) - min(x)))

train_data[, 1:4] <- lapply(train_data[, 1:4], normalize)

test_data[, 1:4] <- lapply(test_data[, 1:4], normalize)

# Load the class package for KNN algorithm

library(class)

# Define the number of neighbors (k) for KNN

k <- 3

# Predict the species of iris flowers using KNN


33

predicted_species <- knn(train_data[, 1:4], test_data[, 1:4], train_data$Species, k = k)


Page

# Compare predicted species with actual species


Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

accuracy <- mean(predicted_species == test_data$Species)

print(paste("Accuracy of KNN (k =", k, "):", format(accuracy, digits = 4)))

EXAMPLE: USER RETENTION

 User retention refers to the ability of a product, service, or platform to retain its
users over a period of time.

 It is a critical metric for businesses as it directly impacts growth, profitability, and


overall success.

 It indicates customer satisfaction, loyalty, and the ability of a product to meet


ongoing needs.

 Metrics - Retention Rate & Churn Rate

Factors Influencing User Retention:

 User Experience

 Value Proposition
34
Page

 Customer Support

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

 Product Updates

 Onboarding Process

 Engagement

 Personalization

Retention Strategies:

 Segmentation

 Onboarding Optimization

 Continuous Engagement

 Feedback and Improvement

 Reactivation Campaigns

Brainstorming, role of domain expertise, and place for imagination:-

 Understanding Data

 Derived Variables

 Aggregated Statistics

 Identifying Relevant Features

 Experimentation

 Textual Features (Convert text data into numerical features)

FEATURE SELECTION ALGORITHMS

 Wrappers approach

 Filters approach

 Decision Trees approach

 Random Forests approach

 Subset Generation:

 Generate different subsets of features from the original feature set.


35

 Model Training and Evaluation:


Page

 Train a machine learning model using each subset of features.


Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

 Evaluate the model's performance (e.g., accuracy, error rate) using a chosen
performance metric on a validation set (or through cross-validation).

 Feature Selection Criteria:

 Use the model's performance as a criterion for selecting the best subset of features.

Subset Generation:
Generate different subsets of features from the original feature set.
Model Training and Evaluation:
Train a machine learning model using each subset of features.
Evaluate the model's performance (e.g., accuracy, error rate) using a chosen performance
metric on a validation set (or through cross-validation).
Feature Selection Criteria:
Use the model's performance as a criterion for selecting the best subset of features.

36
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL


21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME

WRAPPER FEATURE SELECTION ALGORITHMS

Types of Wrapper Methods:

Forward Selection:

Starts with an empty set of features and gradually adds one feature at a time, selecting the
one that maximizes the model's performance.

Backward Elimination:

Begins with the full set of features and removes one feature at a time, evaluating the impact
on the model's performance.

Recursive Feature Elimination (RFE):

Iteratively removes less important features based on a model's coefficients or feature


importance scores until the optimal subset is achieved.

Pros and Cons of Wrapper Methods:

Pros:

 Wrapper methods consider interactions between features that are specific


to the chosen machine learning model.

 Can lead to more optimal feature subsets for the predictive task.

 Able to incorporate the impact of feature selection on the model's


performance directly.

Cons:

 Computationally expensive, especially for high-dimensional datasets and


complex models.

 Prone to overfitting if not used with caution, especially with small datasets.

 May be sensitive to the choice of the evaluation metric and the performance
of the underlying machine learning algorithm.
37
Page

Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL

You might also like