DSV-M2-lecture Notes - Valar New
DSV-M2-lecture Notes - Valar New
CO 2. Apply different techniques to Explore Data Analysis and the Data Science Process
Module-2
1
Page
Exploratory Data Analysis (EDA) as the first step toward building a model.
EDA is a critical part of the data science process, and also represents a philosophy or
way of doing statistics practiced by a strain of statisticians coming from the Bell Labs
tradition.
In EDA, there is no hypothesis and there is no model. The “exploratory” aspect
means that your understanding of the problem you are solving, or might solve, is
changing as you go.
The basic tools of EDA are plots graphs and summary statistics.
It’s a method of systematically going through the data plotting distributions of all
variables, plotting time series of data, transforming variables, looking at all pairwise
relationships between variables using scatterplot matrices, and generating summary
statistics for all of them.
EDA may be computing their mean minimum maximum the upper and lower
quartiles, and identifying outliers.
But as much as EDA is a set of tools, it’s also a mind-set. And that mind-set is about
your relationship with the data.
You want to understand the data—gain intuition, understand the shape of it, and try
to connect your understanding of the process that generated the data to the data
itself.
EDA happens between the person and the data and isn’t about proving anything to
anyone else yet.
In the end, EDA helps to make sure the product is performing as intended. Although
there’s lots of visualization involved in EDA, we distinguish between EDA and data
visualization in that EDA is done toward the beginning of analysis, and data
visualization, as it’s is done toward the end to communicate one’s findings.
With EDA, the graphics are solely done for you to understand what’s going on.
the logging process, For example, “patterns” you find in the data could actually be
Page
Exercise: EDA
There are 31 datasets named nyt1.csv, nyt2.csv,…,nyt31.csv, which you can find here:
https://github.com/oreillymedia/doing_data_science.
Each one represents one (simulated) day’s worth of ads shown and clicks recorded on the
New York Times home page in May 2012.
Plot the distributions of number impressions and click through rate ( clicks/# impressions)
for these age categories
Define a new variable to segment or categorize users based on their click behavior
Explore the data and make visual and quantitative comparisons across user
segments/demographics 18 year old females or logged in versus not, for example)
3. Now extend your analysis across days. Visualize some metrics and distributions over time.
Sample Code:
3
Page
# categorize
head(data1)
data1$agecat <-cut(data1$Age,c(-Inf,0,18,24,34,44,54,64,Inf))
# view
summary(data1)
# brackets
install.packages("doBy")
library("doBy")
siterange <- function(x){c(length(x), min(x), mean(x), max(x))}
summaryBy(Age~agecat, data =data1, FUN=siterange)
4
Page
data1$hasimps <-cut(data1$Impressions,c(-Inf,0,Inf))
summaryBy(Clicks~hasimps, data =data1, FUN=siterange)
ggplot(subset(data1, Impressions>0), aes(x=Clicks/Impressions,
colour=agecat)) + geom_density()
ggplot(subset(data1, Clicks>0), aes(x=Clicks/Impressions,
colour=agecat)) + geom_density()
ggplot(subset(data1, Clicks>0), aes(x=agecat, y=Clicks,
fill=agecat)) + geom_boxplot()
+ geom_density()
# create categories
head(data1)
#look at levels
etable<-summaryBy(Impressions~scode+Gender+agecat,
Data Collection: This involves gathering data from various sources such as databases,
websites, sensors, or other means.
First we have the Real World. Inside the Real World are lots of people busy at various
activities. Some people are using Google+, others are competing in the Olympics; there are
spammers sending spam, and there are people getting their blood drawn. Say we have data
on one of these things.
2. Data Cleaning and Pre-processing Raw data often contains errors, missing values, or
inconsistencies Data scientists clean and pre-process the data to ensure accuracy and
consistency.
We want to process this to make it clean for analysis. So we build and use pipelines of data
munging: joining, scraping, wrangling, or whatever you want to call it. To do this we use
tools such as Python, shell scripts, R, or SQL, or all of the above
3. Exploratory Data Analysis (Data scientists explore the data using statistical techniques and
visualization tools to understand its characteristics, identify patterns, and detect anomalies.
Once we have this clean dataset, we should be doing some kind of EDA. In the course of
doing EDA, we may realize that it isn’t actually clean because of duplicates, missing values,
absurd outliers, and data that wasn’t actually logged or incorrectly logged. If that’s the case,
we may have to go back to collect more data, or spend more time cleaning the dataset.
6
Page
4. Feature Engineering: This involves selecting, transforming, and creating new features
from the raw data to improve the performance of machine learning algorithms.
5. Machine Learning Data scientists apply machine learning algorithms to build predictive
models or uncover hidden patterns in the data This includes supervised learning,
unsupervised learning, and reinforcement learning techniques.
Next, we design our model to use some algorithm like k-nearest neighbor (k-NN), linear
regression, Naive Bayes, or something else.The model we choose depends on the type of
problem we’re trying to solve, of course, which could be a classification problem, a
prediction problem, or a basic description problem
6. Model Evaluation and Validation Data scientists assess the performance of the machine
learning models using metrics and techniques to ensure they generalize well to unseen data
We then can interpret, visualize, report, or communicate our results. This could take the
form of reporting the results up to our boss or coworkers, or publishing a paper in a journal
and going out and giving academic talks about it.
Alternatively, our goal may be to build or prototype a “data product”; e.g., a spam classifier,
or a search ranking algorithm, or a recommendation system. Now the key here that makes
data science special and distinct from statistics is that this data product then gets
incorporated back into the real world, and users interact with that product, and that
generates more data, which creates a feedback loop.
7. Deployment and Monitoring Once a model is trained and evaluated, it is deployed into
production systems Data scientists monitor the performance of deployed models and
update them as needed
8. Domain Expertise Understanding the specific domain or industry is crucial for interpreting
the results of data analysis and making informed decisions.
“data scientist.” -
Someone has to make the decisions about what data to collect, and why. That person needs
to be formulating questions and hypotheses and making a plan for how the problem will be
attacked.
Example:
H0: There is no significant difference in test scores between students who attended an
online course and those who attended a traditional classroom course.
Ha: There is a significant difference in test scores between students who attended an online
course and those who attended a traditional classroom course.
We can think of the data science process as an extension of or variation of the scientific
method:
• Ask a question.
• Do background research.
• Construct a hypothesis.
In both the data science process and the scientific method, not every problem requires one
to go through all the steps, but almost all problems can be solved with some combination of
the stages. For example,
if your end goal is a data visualization (which itself could be thought of as a data product),
it’s possible you might not do any machine learning or statistical modelling , but you’d want
to get all the way to a clean dataset, do some exploratory analysis, and then create the
visualization.
• Doug Perlson, the CEO of RealDirect, has a background in real estate law, startups,
and online advertising.
• His goal with RealDirect is to use all the data he can access about real estate to
improve the way people sell and buy houses.
• Normally, people sell their homes about once every seven years, and they do so with
the help of professional brokers and current data. But there’s a problem both with the
8
Page
broker system and the data quality. RealDirect addresses both of them.
• First, the brokers. They are typically “free agents” operating on their own—think of
them as home sales consultants. This means that they guard their data aggressively, and the
really good ones have lots of experience. But in the grand scheme of things, that really
means they have only slightly more data than the inexperienced brokers.
• The team of brokers also becomes data experts, learning to use information-
collecting tools to keep tabs on new and relevant data or to access publicly available
information.
• One problem with publicly available data is that it’s old news—there’s a three-month
lag between a sale and when the data about that sale is available.
• RealDirect is working on real-time feeds on things like when people start searching
for a home, what the initial offer is, the time between offer and close, and how people
search for a home online.
• Ultimately, good information helps both the buyer and the seller. At least if they’re
honest.
• This is where the magic of data pooling comes in: it allows Real Direct to take a
smaller commission because it’s more optimized, and therefore gets more volume.
• The site itself is best thought of as a platform for buyers and sellers to manage their
sale or purchase process.
• There are statuses for each person on site: active, offer made, offer rejected,
showing, in contract, etc. Based on your status, different actions are suggested by the
software.
Challenges:
9
• First off, there’s a law in New York that says you can’t show all the current housing
Page
listings unless those listings reside behind a registration wall, so RealDirect requires
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
registration. On the one hand, this is an obstacle for buyers, but serious buyers are likely
willing to do it.
Sample R code
Here’s some sample R code that takes the Brooklyn housing data in the preceding exercise,
require(gdata)
bk <- read.xls("rollingsales_brooklyn.xls",pattern="BOROUGH")
head(bk)
summary(bk)
bk$SALE.PRICE))
count(is.na(bk$SALE.PRICE.N))
bk$gross.square.feet))
bk$land.square.feet))
attach(bk)
10
hist(sale.price.n)
Page
hist(sale.price.n[sale.price.n>0])
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
hist(gross.sqft[sale.price.n==0])
detach(bk)
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
plot(bk.sale$gross.sqft,bk.sale$sale.price.n)
plot(log(bk.sale$gross.sqft),log(bk.sale$sale.price.n))
bk.sale$building.class.category)),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
bk.homes[which(bk.homes$sale.price.n<100000),]
[order(bk.homes[which(bk.homes$sale.price.n<100000),]
$sale.price.n),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
Machine learning algorithms are largely used to predict, classify, or cluster. Statistical
modelling came out of statistics departments, and machine learning algorithms came out of
computer science departments.
In general, machine learning algorithms that are the basis of artificial intelligence (AI) such
as image recognition, speech recognition, recommendation systems, ranking and
personalization of content—often the basis of data products—are not usually part of a core
statistics curriculum or department. They aren’t generally designed to infer the underlying
11
generative process (e.g., to model something), but rather to predict or classify with the most
Page
accuracy.
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
Many business or real-world problems that can be solved with data can be thought of as
classification and prediction problems when we express them mathematically.
“In the real world, how do I know that this algorithm is the solution to the problem I’m
trying to solve?”
• Linear Regression
• k-means
1.Linear regression
Linear regression is a fundamental algorithm in machine learning and statistics used for
predictive modelling and data analysis. It aims to find the best-fitting straight line (or linear
equation) that describes the relationship between one or more independent variables
(features) and a dependent variable (target).
The linear regression algorithm works by minimizing the sum of squared differences
between the predicted values (obtained from the linear equation) and the actual values in
the given dataset. This process is known as Ordinary Least Squares (OLS) method.
When you use it, you are making the assumption that there is a linear relationship between
an outcome variable (sometimes also called the response variable, dependent variable, or
label) and a predictor (sometimes also called an independent variable, explanatory variable,
or feature); or between one variable and several other variables, in which case you’re
modeling the relationship as having a linear structure.
where β₀ is the intercept, β₁, β₂, ..., βₙ are the coefficients (weights) corresponding to the
independent variables X₁, X₂, ..., Xₙ, and ε is the error term (residual).
Suppose you run a social networking site that charges a monthly subscription fee of $25,
and that this is your only source of revenue. Each month you collect data and count your
12
number of users and total revenue. You’ve done this daily over the course of two years,
recording it all in a spread sheet.
Page
You could express this data as a series of points. Here are the first four:
When you showed this to someone else who didn’t even know how much you charged or
anything about your business model, they might notice that there’s a clear relationship
• It seems deterministic.
• you have a dataset keyed by user (meaning each row contains data for a single user),
13
and the columns represent user behavior on a social networking site over a period of a
Page
week.
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
• During the course of your exploratory data analysis, you’ve randomly sampled 100
users to keep it simple, and you plot pairs of these variables,
Question:
The business context might be that eventually you want to be able to promise advertisers
who bid for space on your website in advance a certain number of users, so you want to be
able to forecast number of users several days or weeks in advance.
7 276
3 43
4 82
6 136
10 471
9 269
It looks like there’s kind of a linear relationship here, and it makes sense; the more new
friends you have, the more time you might spend on the site.
14
Page
There are two things you want to capture in the model. The first is the trend and the second
is the variation.
• There are two things you want to capture in the model. The first is the trend and the
second is the variation. We’ll start first with the trend.
• First, let’s start by assuming there actually is a relationship and that it’s linear,
Because it is assumed that there a linear relationship, start the model by assuming
the functional form to be
• The intuition behind linear regression is that to find the line that minimizes the
distance between all the points and the line.
• To find this line, you’ll define the “residual sum of squares” (RSS), denoted RSS β , to
be:
15
Page
Linear regression seeks to find the line that minimizes the sum of the squares of the
vertical distances between the approximated or predicted yis and the observed yis.
16
Page
R-squared
summary (model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-121.17 -52.63 -9.72 41.54 356.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.083 16.623 -1.93 0.0565 .
x 45.918 2.141 21.45 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
17
R2 =1−Σi yi − yi2
Σi yi − y2 .
Evaluation metrics:
• p-values: For any given β, the p-value captures the probability of observing the data
that we observed, and obtaining the test-statistic that we obtained under the null
hypothesis
•
• Cross-validation:
o Divide our data up into a training set and a test set: 80% in the training and
20% in the test
o Fit the model on the training set, then look at the mean squared error on the
test set and compare it to that on the training set.
o If the mean squared errors are approximately the same, then model
generalizes well and not in danger of overfitting
Transformations
o Better model would be a polynomial relationship like this:
Logistic (S-shaped)
Model Form Linear relationship
relationship
Classification and
Usage Prediction
probability estimation
20
Page
Point Coordinates
A1 (2,10)
A2 (2,6)
A3 (11,11)
A4 (6,9)
A5 (6,4)
A6 (1,2)
A7 (5,10)
A8 (4,9)
A9 (10,12)
A10 (7,5)
A11 (9,11)
A12 (4,6)
A13 (3,10)
A14 (3,8)
A15 (6,11)
Randomly Choose
21
Distance
Distance from Distance from
from
Point Centroid 1 Centroid 2 Assigned Cluster
Centroid 3
(2,6) (5,10)
(6,11)
Distance
Distance
from
from Distance from
Point Centroid 1 Assigned Cluster
centroid 2 (4, centroid 3 (9, 11.25)
(3.833,
9.6)
5.167)
Document Classification: Using k-means clustering, we can divide documents into various
clusters based on their content, topics, and tags.
Cyber profiling: In cyber profiling, we collect data from individuals as well as groups to
identify their relationships. With k-means clustering, we can easily make clusters of people
based on their connection to each other to identify any available patterns.
Fraud detection in banking and insurance: By using historical data on frauds, banks and
insurance agencies can predict potential frauds by the application of k-means clustering.
k <- 3
Prepared by: C. VALARMATHI AP/CSE SSCE, ANEKAL
21CS644-DATA SCIENCE AND VISULAIZATION -2021 SCHEME
print(kmeans_result$centers)
Outcome
Continuous (numeric) Binary (categorical)
Variable Type
Normally distributed
Assumptions Linear relationship with log odds
residuals
It memorizes the training instances which are subsequently used as "knowledge" for
the prediction phase.
Training Phase: During the training phase, it memorizes the training instances along with
their corresponding class labels (in the case of classification) or target values.
Prediction Phase:
Input: When presented with a new, the algorithm first calculates the distance between this
instance and all other instances in the training data.
Distance Metric: Commonly used distance metrics include Euclidean distance, Manhattan
distance, Minkowski distance, etc
Selecting Neighbors: The algorithm then selects the K nearest neighbors (data points) to the
new instance based on the computed distances.
Classification: For a classification task, the class of the new instance is determined by a
majority vote among its K nearest neighbors. The most frequent class label among the K
neighbors is assigned to the new instance.
27
40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
20 35 ?
28
Page
40 20 Red 25
29
50 50 Blue 33.54
Page
60 90 Blue 68.01
10 25 Red 10
70 70 Blue 61.03
60 10 Red 47.17
25 80 Blue 45
10 25 Red 10
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45
60 10 Red 47.17
70 70 Blue 61.03
60 90 Blue 68.01
30
Since we chose 5 as the value of K, we'll only consider the first five rows.
Page
10 25 Red 10
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45
60 10 Red 47.17
As we can see, the majority class within the 5 nearest neighbors to the new entry is Red.
Therefore, we'll classify the new entry as Red.
It is simple to implement.
31
Page
32
Page
data(iris)
str(iris)
summary(iris)
set.seed(123)
# Split the dataset into training (80%) and testing (20%) sets
library(class)
k <- 3
User retention refers to the ability of a product, service, or platform to retain its
users over a period of time.
User Experience
Value Proposition
34
Page
Customer Support
Product Updates
Onboarding Process
Engagement
Personalization
Retention Strategies:
Segmentation
Onboarding Optimization
Continuous Engagement
Reactivation Campaigns
Understanding Data
Derived Variables
Aggregated Statistics
Experimentation
Wrappers approach
Filters approach
Subset Generation:
Evaluate the model's performance (e.g., accuracy, error rate) using a chosen
performance metric on a validation set (or through cross-validation).
Use the model's performance as a criterion for selecting the best subset of features.
Subset Generation:
Generate different subsets of features from the original feature set.
Model Training and Evaluation:
Train a machine learning model using each subset of features.
Evaluate the model's performance (e.g., accuracy, error rate) using a chosen performance
metric on a validation set (or through cross-validation).
Feature Selection Criteria:
Use the model's performance as a criterion for selecting the best subset of features.
36
Page
Forward Selection:
Starts with an empty set of features and gradually adds one feature at a time, selecting the
one that maximizes the model's performance.
Backward Elimination:
Begins with the full set of features and removes one feature at a time, evaluating the impact
on the model's performance.
Pros:
Can lead to more optimal feature subsets for the predictive task.
Cons:
Prone to overfitting if not used with caution, especially with small datasets.
May be sensitive to the choice of the evaluation metric and the performance
of the underlying machine learning algorithm.
37
Page