0% found this document useful (0 votes)

122 views23 pages

Data Exploration & Visualization

Exploratory data analysis is an important part of the data analytics process. It involves steps like data preparation, missing value treatment, outlier detection, variable transformation, and feature engineering. Visualization is key to exploratory data analysis, with techniques like univariate analysis using histograms and box plots to understand individual variables, and bivariate analysis using scatter plots, correlation, and chi-square tests to explore relationships between variables. The goal of exploratory data analysis is to understand the data, identify patterns and insights, and prepare the data for building predictive models.

Uploaded by

divya kolluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views23 pages

Data Exploration & Visualization

Uploaded by

divya kolluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Exploration &

Visualization
Data Science & Analytics
Introduction
 Exploratory data analysis is a concept developed by John Tuckey (1977) that
consists on a new perspective of statistics. Tuckey’s idea was that in traditional
statistics, the data was not being explored graphically, is was just being used to
test hypotheses. The first attempt to develop a tool was done in Stanford, the
project was called prim9. The tool was able to visualize data in nine dimensions,
therefore it was able to provide a multivariate perspective of the data.

 In recent days, exploratory data analysis is a must and has been included in the
big data analytics life cycle. The ability to find insight and be able to
communicate it effectively in an organization is fueled with strong EDA
capabilities.

 Based on Tuckey’s ideas, Bell Labs developed the S programming language in

order to provide an interactive interface for doing statistics. The idea of S was to
provide extensive graphical capabilities with an easy-to-use language. In
today’s world, in the context of Big Data, R that is based on the S programming
language is the most popular software for analytics.
Table of Contents
 Steps of Data Exploration and Preparation
 Missing Value Treatment
 Why missing value treatment is required ?
 Why data has missing values?
 Which are the methods to treat missing value ?
 Techniques of Outlier Detection and Treatment
 What is an outlier?
 What are the types of outliers ?
 What are the causes of outliers ?
 What is the impact of outliers on dataset ?
 How to detect outlier ?
 How to remove outlier ?
 The Art of Feature Engineering
 What is Feature Engineering ?
 What is the process of Feature Engineering ?
 What is Variable Transformation ?
 When should we use variable transformation ?
 What are the common methods of variable transformation ?
 What is feature variable creation and its benefits ?
Steps for Data Exploration &
Visualization
 Remember the quality of your inputs decide the quality of your output. So, once
you have got your business hypothesis ready, it makes sense to spend lot of time
and efforts here. With my personal estimate, data exploration, cleaning and
preparation can take up to 70% of your total project time.
 Below are the steps involved to understand, clean and prepare your data for
building your predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
 Finally, we will need to iterate over steps 4 – 7 multiple times before we come up
with our refined model.
 Let’s now study each stage in detail:-
Steps for Data Exploration &
Visualization (contd)
Variable Identification

 First, identify Predictor (Input) and Target (output) variables. Next, identify
the data type and category of the variables.
 Let’s understand this step more clearly by taking an example.
 Example:- Suppose, we want to predict, whether the students will play
cricket or not (refer below data set). Here you need to identify predictor
variables, target variable, data type of variables and category of variables.
Steps for Data Exploration &
Visualization (contd)
 Below, the variables have been defined in different category:
Steps for Data Exploration &
Visualization (contd)
Univariate Analysis
 At this stage, we explore variables one by one. Method to perform uni-
variate analysis will depend on whether the variable type is categorical or
continuous. Let’s look at these methods and statistical measures for
categorical and continuous variables individually:

 Continuous Variables:- In case of continuous variables, we need to

understand the central tendency and spread of the variable. These are
measured using various statistical metrics visualization methods as shown
below:
Steps for Data Exploration &
Visualization (contd)
Univariate Analysis

 Note: Univariate analysis is also used to highlight missing and outlier values.
In the upcoming part of this series, we will look at methods to handle
missing and outlier values. To know more about these methods, you can
refer course descriptive statistics from Udacity.

 Categorical Variables:- For categorical variables, we’ll use frequency table

to understand distribution of each category. We can also read as
percentage of values under each category. It can be be measured using
two metrics, Count and Count% against each category. Bar chart can be
used as visualization.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis

 Bi-variate Analysis finds out the relationship between two variables. Here,
we look for association and disassociation between variables at a pre-
defined significance level. We can perform bi-variate analysis for any
combination of categorical and continuous variables. The combination
can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.

 Let’s understand the possible combinations in detail:

 Continuous & Continuous: While doing bi-variate analysis between two

continuous variables, we should look at scatter plot. It is a nifty way to find
out the relationship between two variables. The pattern of scatter plot
indicates the relationship between variables. The relationship can be linear
or non-linear.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Continuous & Continuous
Scatter plot shows the relationship between
two variable but does not indicates the
strength of relationship amongst them. To find
the strength of the relationship, we use
Correlation. Correlation varies between -1
and +1.

• -1: perfect negative linear correlation

• +1:perfect positive linear correlation
and
• 0: No correlation

Correlation can be derived using following

formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)*

Var(Y))
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Continuous & Continuous
Various tools have function or functionality to identify correlation between
variables. In Excel, function CORREL() is used to return the correlation between
two variables and SAS uses procedure PROC CORR to identify the correlation.
These function returns Pearson Correlation value to identify the relationship
between two variables:

 In above example, we have good positive relationship(0.65) between two

variables X and Y.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Categorial

Categorical & Categorical: To find the relationship between two categorical

variables, we can use following methods:
• Two-way table: We can start analyzing the relationship by creating a two-way
table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or
count% of observations available in each combination of row and column
categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Categorial
Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables.
Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a
larger population as well. Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability for the computed chi-
square distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95%
confidence.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Categorial

The chi-square test statistic for a test of independence of two categorical variables is
found by:
where O represents the observed frequency. E is the expected
frequency under the null hypothesis and computed by:

From previous two-way table, the expected count for product category 1 to be of small size
is 0.22. It is derived by taking the row total for Size (9) times the column total for Product
category (2) then dividing by the sample size (81). This is procedure is conducted for each cell.
Statistical Measures used to analyze the power of relationship are:

 Cramer’s V for Nominal Categorical Variable

 Mantel-Haenszed Chi-Square for ordinal categorical variable.
Different data science language and tools have specific methods to perform chi-square test. In
SAS, we can use Chisq as an option with Proc freq to perform this test.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Continuous

 While exploring relation between categorical and continuous variables, we can

draw box plots for each level of categorical variables. If levels are small in number, it
will not show the statistical significance. To look at the statistical significance we can
perform Z-test, T-test or ANOVA
 Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different
or not
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Continuous

 If the probability of Z is small then the difference of two averages is more significant.
The T-test is very similar to Z-test but it is used when number of observation for both
categories is less than 30.
Missing Value Treatment

Why missing values treatment is required?

 Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Missing Value Treatment (contd)
Why missing values treatment is required?
 Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values.
The inference from this data set is that the chances of playing cricket by males is higher than females. On the
other hand, if you look at the second table, which shows data after treatment of missing values (based on gender),
we can see that females have higher chances of playing cricket compared to males.
Missing Value Treatment (contd)
Why my data has missing values?

 We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for
occurrence of these missing values. They may occur at two stages:
 Data Extraction: It is possible that there are problems with extraction process. In such cases, we should
double-check for correct data with data guardians. Some hashing procedures can also be used to make
sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be
corrected easily as well.
 Data collection: These errors occur at time of data collection and are harder to correct. They can be
categorized in four types:

 Missing completely at random: This is a case when the probability of missing variable is same for all
observations. For example: respondents of data collection process decide that they will declare their earning
after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value.
 Missing at random: This is a case when variable is missing at random and missing ratio varies for different values
/ level of other input variables. For example: We are collecting data for age and female has higher missing
value compare to male.
 Missing that depends on unobserved predictors: This is a case when the missing values are not random and
are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic
causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all patients.
 Missing that depends on the missing value itself: This is a case when the probability of missing value is directly
correlated with missing value itself. For example: People with higher or lower income are likely to provide non-
response to their earning.
Missing Value Treatment (contd)
Which are the methods to treat missing values ?
 Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.

 In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one
of the major advantage of this method, but this method reduces the power of model because it
reduces the sample size.
 In pair wise deletion, we perform analysis with all cases in which the variables of interest are
present. Advantage of this method is, it keeps as many cases available for analysis. One of the
disadvantage of this method, it uses different sample size for different variables.

Deletion methods
are used when the
nature of missing
data is “Missing
completely at
random” else non
random missing
values can bias
the model output.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)

 Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with
estimated ones. The objective is to employ known relationships that can be identified in
the valid values of the data set to assist in estimating the missing values. Mean / Mode /
Median imputation is one of the most frequently used methods. It consists of replacing
the missing data for a given attribute by the mean or median (quantitative attribute) or
mode (qualitative attribute) of all known values of that variable. It can be of two types:-

 Generalized Imputation: In this case, we calculate the mean or median for all non missing values
of that variable then replace missing value with mean or median. Like in above table, variable
“Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and
then replace missing value with it.
 Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and
“Female” (25) individually of non missing values then replace the missing value based on
gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female”
with 25.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)

 Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
data set and variable with missing values is treated as target variable. Next, we create a
model to predict target variable based on other attributes of the training data set and
populate missing values of test data set.We can use regression, ANOVA, Logistic
regression and various modeling technique to perform this. There are 2 drawbacks for this
approach:
 The model estimated values are usually more well-behaved than the true values
 If there are no relationships with attributes in the data set and the attribute with missing values,
then the model will not be precise for estimating missing values.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)

 KNN Imputation: In this method of imputation, the missing values of an attribute are imputed
using the given number of attributes that are most similar to the attribute whose values are
missing. The similarity of two attributes is determined using a distance function. It is also known
to have certain advantage & disadvantages.
 Advantages:
 k-nearest neighbour can predict both qualitative & quantitative attributes
 Creation of predictive model for each attribute with missing data is not required
 Attributes with multiple missing values can be easily treated
 Correlation structure of the data is taken into consideration
 Disadvantage:
 KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset
looking for the most similar instances.
 Choice of k-value is very critical. Higher value of k would include attributes which are significantly
different from what we need whereas lower value of k implies missing out of significant attributes.

 After dealing with missing values, the next task is to deal with outliers. Often, we tend to
neglect outliers while building models. This is a discouraging practice. Outliers tend to make
your data skewed and reduces accuracy. Let’s learn more about outlier treatment.

Brief - Data Governance
No ratings yet
Brief - Data Governance
20 pages
Unit-3: Non-Linear Data Structure
No ratings yet
Unit-3: Non-Linear Data Structure
23 pages
Sas 1
100% (1)
Sas 1
292 pages
Sas 1
100% (1)
Sas 1
292 pages
Chapter 1 Introduction To Visualization
No ratings yet
Chapter 1 Introduction To Visualization
53 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Netflix Data Science Interview Question
No ratings yet
Netflix Data Science Interview Question
7 pages
Data Science Interview
No ratings yet
Data Science Interview
32 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
DataMining S
No ratings yet
DataMining S
103 pages
IIFL - Awfis - Initiating Coverage - 20241125
No ratings yet
IIFL - Awfis - Initiating Coverage - 20241125
40 pages
Day 5 IELTS Academic Reading Questions by KenyanNurse-1
No ratings yet
Day 5 IELTS Academic Reading Questions by KenyanNurse-1
12 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Data Science Capstone Project
No ratings yet
Data Science Capstone Project
21 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
AVO Company Profile
No ratings yet
AVO Company Profile
41 pages
Unit 4 Part A
No ratings yet
Unit 4 Part A
51 pages
Statistic Interview Questions and Answers by Jeevan Raj
No ratings yet
Statistic Interview Questions and Answers by Jeevan Raj
21 pages
Codes Us
No ratings yet
Codes Us
56 pages
Data Science Questions and Answers
No ratings yet
Data Science Questions and Answers
4 pages
Data Science Skills They Dont Teach You
No ratings yet
Data Science Skills They Dont Teach You
72 pages
R Session A
No ratings yet
R Session A
107 pages
Data Visualization 2
No ratings yet
Data Visualization 2
22 pages
Tetra Plate Heat Exchanger - 2020H
No ratings yet
Tetra Plate Heat Exchanger - 2020H
8 pages
Start Here With Machine Learning
No ratings yet
Start Here With Machine Learning
25 pages
Temporary Structures: Wall Form Design
No ratings yet
Temporary Structures: Wall Form Design
12 pages
Lab 18 - Time Series Anomaly Detection
No ratings yet
Lab 18 - Time Series Anomaly Detection
37 pages
Accenture Offer Letter Validation
No ratings yet
Accenture Offer Letter Validation
18 pages
VB UNIT 1 Notes
No ratings yet
VB UNIT 1 Notes
24 pages
IJERT Data Analysis Using Python
No ratings yet
IJERT Data Analysis Using Python
6 pages
Manual Soldadura en Campo
No ratings yet
Manual Soldadura en Campo
32 pages
Multinomial Logistic Regression Basic Relationships
No ratings yet
Multinomial Logistic Regression Basic Relationships
73 pages
Sunalign: Agri Tech
No ratings yet
Sunalign: Agri Tech
18 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Developing Grit
No ratings yet
Developing Grit
8 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Data Science Case Study For Introduction
No ratings yet
Data Science Case Study For Introduction
19 pages
BIOS Manual For System Boards With Intel® 7 Series / C216 Chipset
No ratings yet
BIOS Manual For System Boards With Intel® 7 Series / C216 Chipset
70 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter - 6 Issue and Redemption of Debentures
No ratings yet
Chapter - 6 Issue and Redemption of Debentures
8 pages
Fourth Edition: Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization
No ratings yet
Fourth Edition: Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization
66 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Module 2
No ratings yet
Module 2
20 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Isidro-Free Recall Experiment
No ratings yet
Isidro-Free Recall Experiment
19 pages
Pert 7 - Ethics and Privacy
No ratings yet
Pert 7 - Ethics and Privacy
18 pages
76 - Sample - Chapter Kunci M2K3 No 9
No ratings yet
76 - Sample - Chapter Kunci M2K3 No 9
94 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Chapter 1 Data Analysis
No ratings yet
Chapter 1 Data Analysis
18 pages
Hillstone HSM 4.0.0 EN
No ratings yet
Hillstone HSM 4.0.0 EN
2 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
BS 1881-Part 115-86
No ratings yet
BS 1881-Part 115-86
11 pages
Polypac (All Grades) Sds
No ratings yet
Polypac (All Grades) Sds
10 pages
IADC-SPE-184628-MS - Drill Bit Connections A Time For Change
No ratings yet
IADC-SPE-184628-MS - Drill Bit Connections A Time For Change
10 pages
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
No ratings yet
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
34 pages
2324 Level J (UAE) Moral Education Exam Related Materials T3 W6
No ratings yet
2324 Level J (UAE) Moral Education Exam Related Materials T3 W6
2 pages
Improving Neural Networks For Time-Series Forecasting Using Data Augmentation and Automl
No ratings yet
Improving Neural Networks For Time-Series Forecasting Using Data Augmentation and Automl
8 pages
Data Science Math Skills
No ratings yet
Data Science Math Skills
1 page
1 PDF
No ratings yet
1 PDF
10 pages
Exploring Retention Factors in Gen y Engineers
No ratings yet
Exploring Retention Factors in Gen y Engineers
18 pages
Aesv
No ratings yet
Aesv
32 pages
Probability Distributions in Data Science - Towards Data Science
No ratings yet
Probability Distributions in Data Science - Towards Data Science
15 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Spark 101 - Overview and Efficient Use
No ratings yet
Spark 101 - Overview and Efficient Use
9 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Module 08 Fixture I
100% (1)
Module 08 Fixture I
34 pages
IBM Power E1050 Level 2 Quiz
No ratings yet
IBM Power E1050 Level 2 Quiz
17 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Usp36-Nf31 GC1251
No ratings yet
Usp36-Nf31 GC1251
5 pages
RA No 11232 Revised Corporation Code of The Philippines Sec 115 To Sec 132
No ratings yet
RA No 11232 Revised Corporation Code of The Philippines Sec 115 To Sec 132
4 pages
Sharp Photointerrupter Line-Up
No ratings yet
Sharp Photointerrupter Line-Up
6 pages
Mit Data Science Program
100% (1)
Mit Data Science Program
15 pages
Python Programming Concepts
No ratings yet
Python Programming Concepts
5 pages
Time Series
No ratings yet
Time Series
23 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Solutions Manual Using R Introductory ST
No ratings yet
Solutions Manual Using R Introductory ST
33 pages
Case Study: How Neuroscience Transformed Business: The TCS Story
No ratings yet
Case Study: How Neuroscience Transformed Business: The TCS Story
6 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Ogsd en PDF
No ratings yet
Ogsd en PDF
346 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
CSC8001-Data Science Project Report
No ratings yet
CSC8001-Data Science Project Report
5 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
11-12 Big Data Concepts and Tools
No ratings yet
11-12 Big Data Concepts and Tools
30 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Put The Verbs Into The Correct Tense (Simple Present or Present Progressive)
No ratings yet
Put The Verbs Into The Correct Tense (Simple Present or Present Progressive)
3 pages
Specialist Base Programming
No ratings yet
Specialist Base Programming
4 pages
Principles of Data Science
No ratings yet
Principles of Data Science
3 pages
Resume - Rajat Chaturvedi
No ratings yet
Resume - Rajat Chaturvedi
3 pages
Thers: Please Give Previous Certificate No
No ratings yet
Thers: Please Give Previous Certificate No
2 pages
HP LaserJet 2605
No ratings yet
HP LaserJet 2605
2 pages
Data Science Course Content Chapter 1: Introduction To Data Science
No ratings yet
Data Science Course Content Chapter 1: Introduction To Data Science
8 pages

Data Exploration & Visualization

Uploaded by

Data Exploration & Visualization

Uploaded by

Data Exploration &

 Based on Tuckey’s ideas, Bell Labs developed the S programming language in

 Continuous Variables:- In case of continuous variables, we need to

 Categorical Variables:- For categorical variables, we’ll use frequency table

 Let’s understand the possible combinations in detail:

 Continuous & Continuous: While doing bi-variate analysis between two

• -1: perfect negative linear correlation

Correlation can be derived using following

Correlation = Covariance(X,Y) / SQRT( Var(X)*

 In above example, we have good positive relationship(0.65) between two

Categorical & Categorical: To find the relationship between two categorical

 Cramer’s V for Nominal Categorical Variable

 While exploring relation between categorical and continuous variables, we can

Why missing values treatment is required?

You might also like