0% found this document useful (0 votes)

287 views8 pages

DSE 3 Unit 4

Exploratory data analysis (EDA) is used to discover trends, patterns, and check assumptions in data through statistical summaries and graphical representations. There are four main types of EDA: univariate non-graphical, multivariate non-graphical, univariate graphical, and multivariate graphical. Common tools for EDA include R and Python which are used to perform tasks like missing value analysis, clustering, predictive modeling, and multivariate visualization.

Uploaded by

Priyaranjan Soren

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

287 views8 pages

DSE 3 Unit 4

Uploaded by

Priyaranjan Soren

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

DSE-3: Data Science – (Unit-4)

Exploratory Data Analysis (EDA): Exploratory Data Analysis is an approach

to analyze the data using visual techniques. It is used to discover trends, patterns,
or to check assumptions with the help of statistical summary and graphical
representations.
Univariate, Bivariate and Multivariate data and its analysis
1. Univariate data – This type of data consists of only one variable. The analysis
of univariate data is thus the simplest form of analysis since the information
deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and
find patterns that exist within it. The example of a univariate data can be height.
Heights(in cm) 164 167.3 170 174.2 178 180 186
Suppose that the heights of seven students of a class is recorded(figure 1),there
is only one variable that is height and it is not dealing with any cause or
relationship. The description of patterns found in this type of data can be made
by drawing conclusions using central tendency measures (mean, median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency distribution tables,
histograms, pie charts, frequency polygon and bar charts.
2. Bivariate data – This type of data involves two different variables. The
analysis of this type of data deals with causes and relationships and the
analysis is done to find out the relationship among the two variables.Example of
bivariate data can be temperature and ice cream sales in summer season.
Temperature (in Celsius) Ice-cream Sales
20 2000
25 2500
35 5000
73 7800
Suppose the temperature and ice cream sales are the two variables of a
bivariate data(figure 2). Here, the relationship is visible from the table that
temperature and sales are directly proportional to each other and thus related
because as the temperature increases, the sales also increase. Thus bivariate
data analysis involves comparisons, relationships, causes and explanations.
These variables are often plotted on X and Y axis on the graph for better
understanding of data and one of these variables is independent while the other
is dependent.
3. Multivariate data – When the data involves three or more variables, it is
categorized under multivariate. Example of this type of data is suppose an
advertiser wants to compare the popularity of four advertisements on a website,
then their click rates could be measured for both men and women and
relationships between variables can then be examined.
It is similar to bivariate but contains more than one dependent variable. The
ways to perform analysis on this data depends on the goals to be achieved.
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 1 of 8
Types of Exploratory Data Analysis:
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
1. Univariate Non-graphical: this is the simplest form of data analysis as during
this we use just one variable to research the info. The standard goal of univariate
non-graphical EDA is to know the underlying sample distribution/ data and make
observations about the population. Outlier detection is additionally part of the
analysis. The characteristics of population distribution include:
 Central tendency: The central tendency or location of distribution has got to
do with typical or middle values. The commonly useful measures of central
tendency are statistics called mean, median, and mode during which the
foremost common is mean. For skewed distribution or when there’s concern
about outliers, the median may be preferred.
 Spread: Spread is an indicator of what proportion distant from the middle we
are to seek out the values. The values deviation and variance are two
useful measures of spread. The variance is the root of the mean of the
square of the individual deviations.
2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is
usually won’t to show the connection between two or more variables within the sort
of either cross-tabulation or statistics.
 For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For 2 variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one-variable
and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
 For each categorical variable and one quantitative variable, we create
statistics for quantitative variables separately for every level of the specific
variable then compare the statistics across the amount of categorical
variable.
 Comparing the means is an off-the-cuff version of ANOVA and comparing
medians may be a robust version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective,
they are doing not give the complete picture of the data; therefore, graphical
methods are more involve a degree of subjective analysis, also are required.
Common sorts of univariate graphics are:
 Histogram: The foremost basic graph is a histogram, which may be a barplot
during which each bar represents the frequency (count) or proportion
(count/total count) of cases for a variety of values. Histograms are one of the
simplest ways to quickly learn a lot about your data, including central
tendency, spread, modality, shape and outliers.
 Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-
leaf plots. It shows all data values and therefore the shape of the distribution.
 Boxplots: Another very useful univariate graphical technique is that the
boxplot. Boxplots are excellent at presenting information about central
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 2 of 8
tendency and show robust measures of location and spread also as providing
information about symmetry and outliers, although they will be misleading
about aspects like multimodality. One among the simplest uses of boxplots is
within the sort of side-by-side boxplots.
 Quantile-normal plots: The ultimate univariate graphical EDA technique is
that the most intricate. it’s called the quantile-normal or QN plot or more
generally the quantile-quantile or QQ plot. it’s wont to see how well a specific
sample follows a specific theoretical distribution. It allows detection of non-
normality and diagnosis of skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display
relationships between two or more sets of knowledge. The sole one used
commonly may be a grouped barplot with each group representing one level of 1 of
the variables and every bar within a group representing the amount of the opposite
variable. Other common sorts of multivariate graphics are:
 Scatterplot: For 2 quantitative variables, the essential graphical EDA
technique is that the scatterplot , sohas one variable on the x-axis and one
on the y-axis and therefore the point for every case in your dataset.
 Run chart: It’s a line graph of data plotted over time.
 Heat map: It’s a graphical representation of data where values are depicted
by color.
 Multivariate chart: It’s a graphical representation of the relationships
between factors and response.
 Bubble chart: It’s a data visualization that displays multiple circles (bubbles)
in two-dimensional plot.
Tools Required for Exploratory Data Analysis: Some of the most
common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R foundation for statistical
computing. The R language is widely used among statisticians in developing
statistical observations and data analysis.
2. Python: An interpreted, object-oriented programming language with dynamic
semantics. Its high level, built-in data structures, combined with dynamic binding,
make it very attractive for rapid application development, also as to be used as a
scripting or glue language to attach existing components together. Python and
EDA are often used together to spot missing values in the data set, which is vital.
Apart from these functions described above, EDA can also perform the following:
 Perform k-means clustering: it’s an unsupervised learning algorithm where
the info points are assigned to clusters, also referred to as k-groups, k-means
clustering is usually utilized in market segmentation, image compression, and
pattern recognition
 EDA is often utilized in predictive models like linear regression, where it’s
wont to predict outcomes.
 It is also utilized in univariate, bivariate, and multivariate visualization for
summary statistics, establishing relationships between each variable and
understanding how different fields within the data interact with one another.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 3 of 8

Common multivariate statistical techniques used to visualize high-
dimensional data: Some of the common multivariate statistical techniques used
to visualize high-dimensional data are-
 Regression analysis
 Multivariate analysis of variance (MANOVA)
Regression analysis: In simple words, the statistical techniques used to determine
the relationship between a dependent and an independent variable is called
regression. This relationship is then used to fit a corresponding line to the
independent variable and forecast the dependent variable according to it.
Regression has a wide variety of applications. An example of this can be forming an
equation from known data of the price of the stock of the previous 5 years to predict
the future price of the stock.
Types of Regression: There are mainly 7 types of regression.
1. Linear Regression

The Linear Regression is utilized to build up a connection between an independent

and a dependent variable by fitting the model into the best fit. The straight line which
obtains upon the best fit is called a regression line.

The objective in Linear Regression is to limit the separation between the real
information focuses and the anticipated information focuses i.e., limit the residuals
and locate the best-fitted line.
Representation of Linear regression:

Dependent variable = Intercept + Slope * Independent Variable + Error

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 4 of 8

2. Logistic Regression

In the case of a Linear Regression, when the dependent variable is discrete, it

becomes Logistic Regression. Logistic Regression appraises the parameters of a
strategic model and is a type of binomial regression. Subsequently, this is utilized to
manage information that has two potential outcomes. The connection between the
models and the indicators are utilized to foresee the likelihood of an occasion where
the outcome is twofold that is either yes or no.

odds = p / (1-p) = probability of event occurrence / probability of not event

occurrence
ln(odds) = ln(p/(1-p))
Here, p is the probability of the occurrence of the event.
Logistic Regression requires a large sample size to draw the outcome.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 5 of 8

3. Polynomial Regression

When the relationship between a dependent and independent variable is nonlinear,

polynomial regression is used. For this, the least-squares method is used. In this
type of regression, the power of the independent equation is more than one. In
short, this type of regression is generally adopted for curvilinear data.

The equation is of the form: y = a + b*x2

4. Stepwise Regression: This type of regression is utilized when we deal with
multiple independent variables. Right now, the determination of autonomous factors
is finished with the assistance of a programmed procedure, which includes no
human mediation.
The Stepwise Regression procedures follow three methodologies –

 Firstly, Forward determination which includes over and again adding factors to
check in its improvement which stops when no further enhancements past a
degree are conceivable.
 Secondly, Backward Elimination approach which includes cancellation of
factors each in turn until no more factors could be erased without huge
misfortune.
 Thirdly, The bidirectional end which is a blend of the other two methodologies.
With each progression, the variable is included or subtracted from the arrangement
of informative factors. The methodologies for stepwise relapse are forward choice, in
reverse disposal, and bidirectional end.
The equation is of the form: y = a + b*x + e
Where ‘e’ is the error term.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 6 of 8

5. Ridge Regression: Ridge Regression is a procedure for examining data obtained
from multiple regressions. At the point when multicollinearity happens, least-squares
methods are impartial. A level of inclination adds to the relapse gauges and an
outcome, ridge regression diminishes the standard errors.
In other words, Ridge Regression is a method utilized when the information
experiences multicollinearity (autonomous factors are profoundly connected). In
multicollinearity, despite the fact that the least-squares gauges (OLS) are fair-
minded, their differences are enormous which veers off the watched an incentive a
long way from the genuine worth.

Regularly in relapse issues, the model turns out to be excessively unpredictable and
tends to overfit. Along these lines it is important to decrease the change in the
model and keep the model from overfitting. Ridge Regression is one such method
that punishes the size of the coefficients.

6. Lasso Regression: In short, Lasso Regression is like Ridge Regression regarding

its use. However, the only difference is that the data is being fed is not normal. The
assumptions of Lasso regression are the same as least squared regression except
normality is not to be assumed. Lasso Regression shrinks coefficients to zero, which
certainly helps in feature selection.
7. ElasticNet Regression: ElasticNet regression is being utilized in the case of
dominant independent variables being more than one amongst many correlated
independent variables. ElasticNet Regression is a combination of Lasso Regression
and Ridge Regression methods.
Multivariate analysis of variance (MANOVA): It is simply an ANOVA (Analysis of
variance) with several dependent variables. It is a continuation of the ANOVA. In
an ANOVA, we test for statistical differences on one continuous dependent
variable by an independent grouping variable. The MANOVA continues this
analysis by taking multiple continuous dependent variables and bundles them
collectively into a weighted linear composite variable. The MANOVA compares
whether or not the newly created combination varies by the different levels, or
groups, of the independent variable. One can perform this MANOVA test in R
programming very easily. For example, let’s conduct an experiment where we give
two treatments to two groups of rats, and we are taken the weight and height of
rats. In that case, the weight and height of rats are two dependent variables, and
the hypothesis is that both collectively are affected by the difference in treatment. A
multivariate analysis of variance could be used to test this hypothesis.
Interpretation of MANOVA: If the global multivariate test is important then
assume that the corresponding effect is important. In this case, the subsequent
issue is to decide if the treatment affects only the heights, only the weight or both.
In other words, we want to distinguish the particular dependent variables that
contributed to the significant global effect and to clarify this question, use one-way
ANOVA to test separately each dependent variable.
Assumptions of MANOVA: MANOVA can be used in specific conditions like-
 The dependent variables should be normally distributed within groups.
 Homogeneity of variances across the range of predictors.
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 7 of 8
 Linearity between all pairs of covariates, all pairs of dependent variables, and
all dependent variable-covariate pairs in every cell.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 8 of 8

Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
U1 - Data Mining Task Primitives
No ratings yet
U1 - Data Mining Task Primitives
4 pages
Seminar On Database Management System Design
No ratings yet
Seminar On Database Management System Design
43 pages
Data Science - Module 2 (Updated)
No ratings yet
Data Science - Module 2 (Updated)
94 pages
Unit - 5 Multivariate Analysis
No ratings yet
Unit - 5 Multivariate Analysis
29 pages
Ai Sanfoundry Artificial Intelligence MCQ
No ratings yet
Ai Sanfoundry Artificial Intelligence MCQ
130 pages
PHP Lab - Iv Sem - Bca
No ratings yet
PHP Lab - Iv Sem - Bca
16 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Worksheets (6-10) Data Visualisation
100% (1)
Worksheets (6-10) Data Visualisation
11 pages
Unit - I Introduction To Data Analytics
No ratings yet
Unit - I Introduction To Data Analytics
89 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
Data Flow Diagrams Complete
100% (1)
Data Flow Diagrams Complete
26 pages
Digital Literacy - All Units
No ratings yet
Digital Literacy - All Units
29 pages
DSE 3 Unit 2
No ratings yet
DSE 3 Unit 2
18 pages
Datafication Technology
No ratings yet
Datafication Technology
23 pages
IOT Lab Manual
No ratings yet
IOT Lab Manual
30 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
DWDM LAB Final Manualtest
No ratings yet
DWDM LAB Final Manualtest
134 pages
Module 5 Notes
No ratings yet
Module 5 Notes
28 pages
DBMS (UNIT-6) (Advances in Databases and Big Data)
No ratings yet
DBMS (UNIT-6) (Advances in Databases and Big Data)
103 pages
DSAP Lab Manual PDF
No ratings yet
DSAP Lab Manual PDF
61 pages
Excel Lab Manual-2
No ratings yet
Excel Lab Manual-2
62 pages
I MSC CS Ooad
No ratings yet
I MSC CS Ooad
110 pages
Dat Science Unit 2
No ratings yet
Dat Science Unit 2
27 pages
IGNOU MCS-011 Previous Years Questions
No ratings yet
IGNOU MCS-011 Previous Years Questions
64 pages
8423 Tejas Java Practical
No ratings yet
8423 Tejas Java Practical
95 pages
D.N.jha - Rethinking Hindu Identity-Routledge (2014)
100% (1)
D.N.jha - Rethinking Hindu Identity-Routledge (2014)
111 pages
Dav Institute Of, Meangement Sahil
No ratings yet
Dav Institute Of, Meangement Sahil
61 pages
Xii - Information Technology - QP - FPB - Set - 1 - 2023
No ratings yet
Xii - Information Technology - QP - FPB - Set - 1 - 2023
6 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
CPSE Contacts
No ratings yet
CPSE Contacts
1,264 pages
Chapter 5 (Memory Management) Notes
No ratings yet
Chapter 5 (Memory Management) Notes
25 pages
Zeigarnik Effect 1
100% (4)
Zeigarnik Effect 1
8 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
Module II
No ratings yet
Module II
22 pages
Computer Systems Unit 2 - Fill The Blanks
No ratings yet
Computer Systems Unit 2 - Fill The Blanks
7 pages
Oops CS8392 MCQ
No ratings yet
Oops CS8392 MCQ
37 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Test Paper
No ratings yet
Test Paper
15 pages
OPERATING SYSTEM Multiple Choice Questions
No ratings yet
OPERATING SYSTEM Multiple Choice Questions
17 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Unix Lab Manual
No ratings yet
Unix Lab Manual
23 pages
MCQ Theory
No ratings yet
MCQ Theory
23 pages
Assignment of IOT Protocols
No ratings yet
Assignment of IOT Protocols
8 pages
Iwt Practical
No ratings yet
Iwt Practical
20 pages
Quiz 1 Ans
No ratings yet
Quiz 1 Ans
7 pages
Cs2253 - Computer Architecture 16 Marks Question Bank With Hints Unit - I 1. Explain Basic Functional Units of Computer. Input Unit
No ratings yet
Cs2253 - Computer Architecture 16 Marks Question Bank With Hints Unit - I 1. Explain Basic Functional Units of Computer. Input Unit
18 pages
Questionbank CPP
No ratings yet
Questionbank CPP
7 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Ce2017 Data Visualization
No ratings yet
Ce2017 Data Visualization
5 pages
CNS Bits
No ratings yet
CNS Bits
3 pages
Oracle Technical Questions
No ratings yet
Oracle Technical Questions
14 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
DM Important Questions
100% (1)
DM Important Questions
2 pages
Postgraduate-Pg Mba Semester-2 2024 May Financial-Management-Pattern-2019
No ratings yet
Postgraduate-Pg Mba Semester-2 2024 May Financial-Management-Pattern-2019
5 pages
Question Bank For Unix Programming
No ratings yet
Question Bank For Unix Programming
2 pages
Information Security Notes
No ratings yet
Information Security Notes
15 pages
Java Model Question Paper - I
No ratings yet
Java Model Question Paper - I
1 page
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
As 1418.4-2004 Cranes Hoists and Winches Tower Cranes
No ratings yet
As 1418.4-2004 Cranes Hoists and Winches Tower Cranes
8 pages
Fractionated Coconut Oil: Material Safety Data Sheet
No ratings yet
Fractionated Coconut Oil: Material Safety Data Sheet
3 pages
DE09 Sol
No ratings yet
DE09 Sol
157 pages
IoT Model Question Paper 3
No ratings yet
IoT Model Question Paper 3
2 pages
Highway Pavement Structural Design: (JRCP)
No ratings yet
Highway Pavement Structural Design: (JRCP)
37 pages
MCQ Data Science
No ratings yet
MCQ Data Science
1 page
Seal Aftermarket Products: An Easy Fix For A Self-Inflicted Failure
No ratings yet
Seal Aftermarket Products: An Easy Fix For A Self-Inflicted Failure
69 pages
Master Term Paper
100% (1)
Master Term Paper
8 pages
Cylinder Head Valves
No ratings yet
Cylinder Head Valves
6 pages
Technical Spec For Gas Detectors
No ratings yet
Technical Spec For Gas Detectors
19 pages
MOD 3 10KTL3 XH User Manual EN
No ratings yet
MOD 3 10KTL3 XH User Manual EN
29 pages
Exemples de Writing English BAC
No ratings yet
Exemples de Writing English BAC
3 pages
Heat and Mass Transfer
No ratings yet
Heat and Mass Transfer
29 pages
Chap 4
No ratings yet
Chap 4
17 pages
Thoits 1994 StressorsProblemSolvingIndividual
No ratings yet
Thoits 1994 StressorsProblemSolvingIndividual
19 pages
FN Series: Dry Heat Sterilizers /ovens
No ratings yet
FN Series: Dry Heat Sterilizers /ovens
2 pages
APA 7 Referencing Sources Examples August 2021 v1.0
No ratings yet
APA 7 Referencing Sources Examples August 2021 v1.0
67 pages
List of MCA For CSC
No ratings yet
List of MCA For CSC
9 pages
Cme 270 Midterm Exam, Fall 2010 Professor Hofmann Notes
No ratings yet
Cme 270 Midterm Exam, Fall 2010 Professor Hofmann Notes
7 pages
UE271
No ratings yet
UE271
1 page
Cavity Band Pass Filter: Professional Line
No ratings yet
Cavity Band Pass Filter: Professional Line
8 pages
Guo2017 Recent Developments of Miniature Ion Trap Mass Spectrometers
No ratings yet
Guo2017 Recent Developments of Miniature Ion Trap Mass Spectrometers
10 pages
Global Human Resource Management: Instructor Mr. Shyamasundar Tripathy
No ratings yet
Global Human Resource Management: Instructor Mr. Shyamasundar Tripathy
18 pages
Lab Report Writing Guidelines: AP Chemistry ASK
No ratings yet
Lab Report Writing Guidelines: AP Chemistry ASK
13 pages
Cream and Brown Illustration Social Science Class Education Presentation
No ratings yet
Cream and Brown Illustration Social Science Class Education Presentation
18 pages
Test 2 Answers
No ratings yet
Test 2 Answers
8 pages
Halter
No ratings yet
Halter
2 pages
A Study On Employees Satisfaction Towards Their Job in Seshsayee Paper and Boards Limited
No ratings yet
A Study On Employees Satisfaction Towards Their Job in Seshsayee Paper and Boards Limited
7 pages
Notice Regarding PTM For Students
No ratings yet
Notice Regarding PTM For Students
1 page

DSE 3 Unit 4

Uploaded by

DSE 3 Unit 4

Uploaded by

DSE-3: Data Science – (Unit-4)

Exploratory Data Analysis (EDA): Exploratory Data Analysis is an approach

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 3 of 8

The Linear Regression is utilized to build up a connection between an independent

Dependent variable = Intercept + Slope * Independent Variable + Error

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 4 of 8

In the case of a Linear Regression, when the dependent variable is discrete, it

odds = p / (1-p) = probability of event occurrence / probability of not event

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 5 of 8

When the relationship between a dependent and independent variable is nonlinear,

The equation is of the form: y = a + b*x2

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 6 of 8

6. Lasso Regression: In short, Lasso Regression is like Ridge Regression regarding

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 8 of 8

You might also like