0% found this document useful (0 votes)
287 views8 pages

DSE 3 Unit 4

Exploratory data analysis (EDA) is used to discover trends, patterns, and check assumptions in data through statistical summaries and graphical representations. There are four main types of EDA: univariate non-graphical, multivariate non-graphical, univariate graphical, and multivariate graphical. Common tools for EDA include R and Python which are used to perform tasks like missing value analysis, clustering, predictive modeling, and multivariate visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
287 views8 pages

DSE 3 Unit 4

Exploratory data analysis (EDA) is used to discover trends, patterns, and check assumptions in data through statistical summaries and graphical representations. There are four main types of EDA: univariate non-graphical, multivariate non-graphical, univariate graphical, and multivariate graphical. Common tools for EDA include R and Python which are used to perform tasks like missing value analysis, clustering, predictive modeling, and multivariate visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

DSE-3: Data Science – (Unit-4)

Exploratory Data Analysis (EDA): Exploratory Data Analysis is an approach


to analyze the data using visual techniques. It is used to discover trends, patterns,
or to check assumptions with the help of statistical summary and graphical
representations.
Univariate, Bivariate and Multivariate data and its analysis
1. Univariate data – This type of data consists of only one variable. The analysis
of univariate data is thus the simplest form of analysis since the information
deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and
find patterns that exist within it. The example of a univariate data can be height.
Heights(in cm) 164 167.3 170 174.2 178 180 186
Suppose that the heights of seven students of a class is recorded(figure 1),there
is only one variable that is height and it is not dealing with any cause or
relationship. The description of patterns found in this type of data can be made
by drawing conclusions using central tendency measures (mean, median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency distribution tables,
histograms, pie charts, frequency polygon and bar charts.
2. Bivariate data – This type of data involves two different variables. The
analysis of this type of data deals with causes and relationships and the
analysis is done to find out the relationship among the two variables.Example of
bivariate data can be temperature and ice cream sales in summer season.
Temperature (in Celsius) Ice-cream Sales
20 2000
25 2500
35 5000
73 7800
Suppose the temperature and ice cream sales are the two variables of a
bivariate data(figure 2). Here, the relationship is visible from the table that
temperature and sales are directly proportional to each other and thus related
because as the temperature increases, the sales also increase. Thus bivariate
data analysis involves comparisons, relationships, causes and explanations.
These variables are often plotted on X and Y axis on the graph for better
understanding of data and one of these variables is independent while the other
is dependent.
3. Multivariate data – When the data involves three or more variables, it is
categorized under multivariate. Example of this type of data is suppose an
advertiser wants to compare the popularity of four advertisements on a website,
then their click rates could be measured for both men and women and
relationships between variables can then be examined.
It is similar to bivariate but contains more than one dependent variable. The
ways to perform analysis on this data depends on the goals to be achieved.
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 1 of 8
Types of Exploratory Data Analysis:
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
1. Univariate Non-graphical: this is the simplest form of data analysis as during
this we use just one variable to research the info. The standard goal of univariate
non-graphical EDA is to know the underlying sample distribution/ data and make
observations about the population. Outlier detection is additionally part of the
analysis. The characteristics of population distribution include:
 Central tendency: The central tendency or location of distribution has got to
do with typical or middle values. The commonly useful measures of central
tendency are statistics called mean, median, and mode during which the
foremost common is mean. For skewed distribution or when there’s concern
about outliers, the median may be preferred.
 Spread: Spread is an indicator of what proportion distant from the middle we
are to seek out the values. The values deviation and variance are two
useful measures of spread. The variance is the root of the mean of the
square of the individual deviations.
2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is
usually won’t to show the connection between two or more variables within the sort
of either cross-tabulation or statistics.
 For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For 2 variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one-variable
and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
 For each categorical variable and one quantitative variable, we create
statistics for quantitative variables separately for every level of the specific
variable then compare the statistics across the amount of categorical
variable.
 Comparing the means is an off-the-cuff version of ANOVA and comparing
medians may be a robust version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective,
they are doing not give the complete picture of the data; therefore, graphical
methods are more involve a degree of subjective analysis, also are required.
Common sorts of univariate graphics are:
 Histogram: The foremost basic graph is a histogram, which may be a barplot
during which each bar represents the frequency (count) or proportion
(count/total count) of cases for a variety of values. Histograms are one of the
simplest ways to quickly learn a lot about your data, including central
tendency, spread, modality, shape and outliers.
 Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-
leaf plots. It shows all data values and therefore the shape of the distribution.
 Boxplots: Another very useful univariate graphical technique is that the
boxplot. Boxplots are excellent at presenting information about central
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 2 of 8
tendency and show robust measures of location and spread also as providing
information about symmetry and outliers, although they will be misleading
about aspects like multimodality. One among the simplest uses of boxplots is
within the sort of side-by-side boxplots.
 Quantile-normal plots: The ultimate univariate graphical EDA technique is
that the most intricate. it’s called the quantile-normal or QN plot or more
generally the quantile-quantile or QQ plot. it’s wont to see how well a specific
sample follows a specific theoretical distribution. It allows detection of non-
normality and diagnosis of skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display
relationships between two or more sets of knowledge. The sole one used
commonly may be a grouped barplot with each group representing one level of 1 of
the variables and every bar within a group representing the amount of the opposite
variable. Other common sorts of multivariate graphics are:
 Scatterplot: For 2 quantitative variables, the essential graphical EDA
technique is that the scatterplot , sohas one variable on the x-axis and one
on the y-axis and therefore the point for every case in your dataset.
 Run chart: It’s a line graph of data plotted over time.
 Heat map: It’s a graphical representation of data where values are depicted
by color.
 Multivariate chart: It’s a graphical representation of the relationships
between factors and response.
 Bubble chart: It’s a data visualization that displays multiple circles (bubbles)
in two-dimensional plot.
Tools Required for Exploratory Data Analysis: Some of the most
common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R foundation for statistical
computing. The R language is widely used among statisticians in developing
statistical observations and data analysis.
2. Python: An interpreted, object-oriented programming language with dynamic
semantics. Its high level, built-in data structures, combined with dynamic binding,
make it very attractive for rapid application development, also as to be used as a
scripting or glue language to attach existing components together. Python and
EDA are often used together to spot missing values in the data set, which is vital.
Apart from these functions described above, EDA can also perform the following:
 Perform k-means clustering: it’s an unsupervised learning algorithm where
the info points are assigned to clusters, also referred to as k-groups, k-means
clustering is usually utilized in market segmentation, image compression, and
pattern recognition
 EDA is often utilized in predictive models like linear regression, where it’s
wont to predict outcomes.
 It is also utilized in univariate, bivariate, and multivariate visualization for
summary statistics, establishing relationships between each variable and
understanding how different fields within the data interact with one another.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 3 of 8


Common multivariate statistical techniques used to visualize high-
dimensional data: Some of the common multivariate statistical techniques used
to visualize high-dimensional data are-
 Regression analysis
 Multivariate analysis of variance (MANOVA)
Regression analysis: In simple words, the statistical techniques used to determine
the relationship between a dependent and an independent variable is called
regression. This relationship is then used to fit a corresponding line to the
independent variable and forecast the dependent variable according to it.
Regression has a wide variety of applications. An example of this can be forming an
equation from known data of the price of the stock of the previous 5 years to predict
the future price of the stock.
Types of Regression: There are mainly 7 types of regression.
1. Linear Regression

The Linear Regression is utilized to build up a connection between an independent


and a dependent variable by fitting the model into the best fit. The straight line which
obtains upon the best fit is called a regression line.

The objective in Linear Regression is to limit the separation between the real
information focuses and the anticipated information focuses i.e., limit the residuals
and locate the best-fitted line.
Representation of Linear regression:

Dependent variable = Intercept + Slope * Independent Variable + Error

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 4 of 8


2. Logistic Regression

In the case of a Linear Regression, when the dependent variable is discrete, it


becomes Logistic Regression. Logistic Regression appraises the parameters of a
strategic model and is a type of binomial regression. Subsequently, this is utilized to
manage information that has two potential outcomes. The connection between the
models and the indicators are utilized to foresee the likelihood of an occasion where
the outcome is twofold that is either yes or no.

odds = p / (1-p) = probability of event occurrence / probability of not event


occurrence
ln(odds) = ln(p/(1-p))
Here, p is the probability of the occurrence of the event.
Logistic Regression requires a large sample size to draw the outcome.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 5 of 8


3. Polynomial Regression

When the relationship between a dependent and independent variable is nonlinear,


polynomial regression is used. For this, the least-squares method is used. In this
type of regression, the power of the independent equation is more than one. In
short, this type of regression is generally adopted for curvilinear data.

The equation is of the form: y = a + b*x2


4. Stepwise Regression: This type of regression is utilized when we deal with
multiple independent variables. Right now, the determination of autonomous factors
is finished with the assistance of a programmed procedure, which includes no
human mediation.
The Stepwise Regression procedures follow three methodologies –

 Firstly, Forward determination which includes over and again adding factors to
check in its improvement which stops when no further enhancements past a
degree are conceivable.
 Secondly, Backward Elimination approach which includes cancellation of
factors each in turn until no more factors could be erased without huge
misfortune.
 Thirdly, The bidirectional end which is a blend of the other two methodologies.
With each progression, the variable is included or subtracted from the arrangement
of informative factors. The methodologies for stepwise relapse are forward choice, in
reverse disposal, and bidirectional end.
The equation is of the form: y = a + b*x + e
Where ‘e’ is the error term.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 6 of 8


5. Ridge Regression: Ridge Regression is a procedure for examining data obtained
from multiple regressions. At the point when multicollinearity happens, least-squares
methods are impartial. A level of inclination adds to the relapse gauges and an
outcome, ridge regression diminishes the standard errors.
In other words, Ridge Regression is a method utilized when the information
experiences multicollinearity (autonomous factors are profoundly connected). In
multicollinearity, despite the fact that the least-squares gauges (OLS) are fair-
minded, their differences are enormous which veers off the watched an incentive a
long way from the genuine worth.

Regularly in relapse issues, the model turns out to be excessively unpredictable and
tends to overfit. Along these lines it is important to decrease the change in the
model and keep the model from overfitting. Ridge Regression is one such method
that punishes the size of the coefficients.

6. Lasso Regression: In short, Lasso Regression is like Ridge Regression regarding


its use. However, the only difference is that the data is being fed is not normal. The
assumptions of Lasso regression are the same as least squared regression except
normality is not to be assumed. Lasso Regression shrinks coefficients to zero, which
certainly helps in feature selection.
7. ElasticNet Regression: ElasticNet regression is being utilized in the case of
dominant independent variables being more than one amongst many correlated
independent variables. ElasticNet Regression is a combination of Lasso Regression
and Ridge Regression methods.
Multivariate analysis of variance (MANOVA): It is simply an ANOVA (Analysis of
variance) with several dependent variables. It is a continuation of the ANOVA. In
an ANOVA, we test for statistical differences on one continuous dependent
variable by an independent grouping variable. The MANOVA continues this
analysis by taking multiple continuous dependent variables and bundles them
collectively into a weighted linear composite variable. The MANOVA compares
whether or not the newly created combination varies by the different levels, or
groups, of the independent variable. One can perform this MANOVA test in R
programming very easily. For example, let’s conduct an experiment where we give
two treatments to two groups of rats, and we are taken the weight and height of
rats. In that case, the weight and height of rats are two dependent variables, and
the hypothesis is that both collectively are affected by the difference in treatment. A
multivariate analysis of variance could be used to test this hypothesis.
Interpretation of MANOVA: If the global multivariate test is important then
assume that the corresponding effect is important. In this case, the subsequent
issue is to decide if the treatment affects only the heights, only the weight or both.
In other words, we want to distinguish the particular dependent variables that
contributed to the significant global effect and to clarify this question, use one-way
ANOVA to test separately each dependent variable.
Assumptions of MANOVA: MANOVA can be used in specific conditions like-
 The dependent variables should be normally distributed within groups.
 Homogeneity of variances across the range of predictors.
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 7 of 8
 Linearity between all pairs of covariates, all pairs of dependent variables, and
all dependent variable-covariate pairs in every cell.

M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 8 of 8

You might also like