0% found this document useful (0 votes)
309 views14 pages

Data Analysis & Exploratory Data Analysis (EDA)

This document discusses data analysis and exploratory data analysis (EDA). It defines data analysis as using statistics and probability to identify trends in data sets and distinguish real trends from noise. The document outlines some common techniques used in data analysis, including general linear models, generalized linear models, and structural equation modeling. It emphasizes that the correct technique must be used to avoid faulty conclusions. The document also discusses exploratory data analysis and its purpose of gaining initial insights into data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
309 views14 pages

Data Analysis & Exploratory Data Analysis (EDA)

This document discusses data analysis and exploratory data analysis (EDA). It defines data analysis as using statistics and probability to identify trends in data sets and distinguish real trends from noise. The document outlines some common techniques used in data analysis, including general linear models, generalized linear models, and structural equation modeling. It emphasizes that the correct technique must be used to avoid faulty conclusions. The document also discusses exploratory data analysis and its purpose of gaining initial insights into data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Analysis & Exploratory Data Analysis (EDA)

Share on




Contents (Click to skip to that section):


1. Data Analysis
 Definition
 Techniques
 The Two Tools
 Variation
 The Three Rules
 Issues with Data Analysis
2. Exploratory Data Analysis
 Definition
 Purpose
 Types

How is wealth distributed in the United States? Which drugs work to cure cancer? Which stocks should I invest in?
All of these questions can be answered with data analysis.

Data Analysis Definition


Data Analysis is basically where you use statistics and probability to figure out trends in data set. It helps you to sort
out the “real” trends from the statistical noise. What is “noise”? A large amount of data that doesn’t seem to mean
anything at all (sometimes it can be impossible to see the trees because of the forest!). If you’ve ever tried to make
sense of the figures and numbers in a copy of the Wall Street Journal, you’ll know what “noise” means.

Data analysis is about picking out trends from sets of data.

Back to Top
Techniques.
The type of data analysis you use depends on what kind of study you’re doing. For example, you would use a
different technique for data gathered from interviews than you would for an analysis of stock market trends. Some
techniques you might use are:

 General linear model: Useful for assessing how several variables affect continuous variables.
Example: ANOVA tests.
 Generalized linear model: Used for discrete variables. Example: Linear Regression (What is Linear
Regression?).
 Structural equation modelling: Used for abstract variables like “Soap preference,” “Intelligence,” or
“Future goals.” SEM helps you to figure out if you have a valid model for your data.
 Item response theory: A way to analyze results from tests, exams, and questionnaires.
It’s vital you use the right technique; Using the wrong one can lead to faulty claims about your data. There are
dozens of examples of faulty claims about data on the internet. Perhaps two of the most famous are the Cold
Fusion debacle and the now infamous data on women’s poor prospects of getting married over age 30.
Back to Top

The Two Tools of Data analysis.


The two main tools that make up data analysis are lines and tables. For example, you might create a line graph with
a linear regression equation.

A high-leverage outlier. The point has moved the graph more because it is outside the range.

Or you could make a frequency distribution table to display data.


A frequency chart.

Back to Top

Variation
If life were simple, we could make a chart or a graph for every situation. But in real life, things are never as simple
as they appear. Take a two-pound bag of sugar. Does it really weight two pounds? Measure a hundred bags of sugar
and you’ll likely find a hundred different weights, from 5.0 pounds to 5.1 pounds and everything in between. That’s
what we call variance, and variance is one of the reasons we have to use probability distributions to evaluate data.
Back to Top

The Three Rules of Data analysis.


Using three basic rules of thumb can help you avoid incorrectly making claims about your data:

1. Look at your data and think about what it is you want to know. Do you want to prove that the
Earth is round? Or do you want to prove that the Earth has a circumference? Framing this question is
what we call stating the hypothesis.
2. Estimate a Central Tendency for your Data. Examples of measures of central tendency are
the mean and median. Which one you use will depend on your hypothesis in Step 1. For example, if
you wanted to prove the Earth was round, you might choose to look at the average volume, or the
average circumference.
3. Consider the exceptions to the central tendency. If you’ve measured the average, look at the figures
that are not average. If you’ve measured a median, look at the figures that don’t meet that expectation.
Exceptions can help you spot problems with your conclusion. A simple example: your child’s average
score in school is 70. Not bad, right? But if you look at the exceptions, you might find they are getting
100 in three classes (great!) and 40 in three other classes (uh oh). In this case, the average is
completely misleading.
Back to Top

Issues with Data Analysis.


Why do so many cases of data analysis end with faulty claims? One of the main reasons is that analyzing data is a
complicated and tedious process. It’s never as easy as plugging numbers into a computer. Some issues that can lead
to faulty data analysis include:

1. Not having the right analysis skills.


2. Using the wrong tools to analyze data. For example, using a z score when your data doesn’t have
a normal distribution.
3. Letting bias influence your results.
4. Not figuring out statistical significance.
5. Incorrectly stating the null hypothesis and alternate hypothesis.
6. Using misleading graphs and charts.
Unintentional reporting of bad results is usually the result of a lack of proper training. More than one study
(including this one) found that physicians were very poorly trained in the proper management of clinical trials.
Physicians were also very poorly trained in reading statistics from good data obtained from valid setups! (See: Even
Physicians Don’t Understand Statistics). Why would highly educated people have so much trouble interpreting data
analysis? Take a very simple example: A Word Count.
Example problem: You’re given an e-book of Shakespeare’s Romeo and Juliet. Your task is to find out how many
times the Word “Love” appears in it. Easy, right? You run it through a word count in a word processor and you
report that it’s found 126 times.
Oops. The word “love” is only found 94 times. Why is the word count so wrong? You failed to take into account all
of the other words that contain the letters “love”:

 Loves (2).
 Loved (3).
 Loving (6).
 Love’s (12).
 Lover (4).
 Lover’s (3).
 Lovest (2).
Now imagine if you were analyzing a text on the results from blood analysis to see if a particular cancer drug
worked or not. Perhaps you were looking for a specific chemical to see if it showed up more frequently than another.
Typing in just part of the chemical name could lead you to a (possibly harmful) conclusion.
Back to Top

Introduction to Statistical Data Analysis


 Neelam Tyagi
 Oct 29, 2020
 Statistics
“The number of people who think they understand statistics dangerously
dwarfs those who actually do, and maths can cause fundamental problems
when badly used.”― Rory Sutherland

 
In the information era, data is no protracted scarce, on the other hand, it is
irresistible. From delving into the overpowering quantity of data to precisely
interpret its complexity in order to provide insights for intense progress to
organizations and businesses, all sorts of data and information is exploited at
their entirety and this is where statistical data analysis has a significant part.  
 
“Statistics is the specific branch of science from where the professionalists
bring distinct conclusion/interference under the same data”
 
Moving discussion a step further, we shall discuss the comprehensive notion
concerning statistical data analysis and its types. Further, four basic steps
required for completion of statistical data analysis will be explained.
 
 
What is Statistical Data Analysis?
 
Being a branch of science, Statistics incorporates data acquisition, data
interpretation, and data validation, and statistical data analysis is the
approach of conducting various statistical operations, i.e. thorough
quantitative research that attempts to quantify data and employs some sorts
of statistical analysis. Here, quantitative data typically includes descriptive
data like survey data and observational data.
 
In the context of business applications, it is a very crucial technique for
business intelligence organizations that need to operate with large data
volumes. The basic goal of statistical data analysis is to identify trends, for
example, in the retailing business, this method can be approached to uncover
patterns in unstructured and semi-structured consumer data that can be used
for making more powerful decisions for enhancing customer experience and
progressing sales.
 
Apart from that, statistical data analysis has various applications in the field
of statistical analysis of market research, business intelligence(BI), data
analytics in big data, machine learning and deep learning, and financial and
economical analysis. (Recommend blog: Top Business Intelligence Tools and
Techniques in 2020)
 
In addition to that, the significance of data under statistical data analysis, 
 
1. Data comprises variables which are univariate or multivariate, and
extremely relying on the number of variables, the experts execute several
statistical techniques. If the data has a singular variable then univariate
statistical data analysis can be conducted including t-test for
significance, z test, f test, ANOVA one way, etc. And if the data has many
variables then different multivariate techniques can be performed such as
statistical data analysis, or discriminant statistical data analysis, etc.
(Related blog: An Introduction to Probability Distribution)
2. Data is of two types, continuous data and discrete data. The continuous
data cannot be counted and changes over time, e.g the intensity of light,
the temperature of a room, etc. The discrete data can be counted and has
a certain number of values, e.g. the number of bulbs, the number of
people in a group, etc.
3. Under statistical data analysis, the continuous data is distributed under
continuous distribution function, also known as the probability density
function. And the discrete data is distributed under a discrete distribution
function, also termed as the probability mass function.
4. Data can either be quantitative or qualitative. Qualitative data are labels
or names that are implemented to find a characteristic of each element,
whereas quantitative data are always in the form of numbers that intimate
either how much or how many. (More to read: Steps for qualitative data
analysis)
5. Under statistical data analysis, cross-sectional and time-series data are
important. For a definition, cross-sectional data are the data accumulated
at the same time or relatively the same point in time, whereas, time-
series data are the data gathered across certain time periods.
 
Statistical data analysis can be adopted in;

 
 Existing essential findings/conclusions unveiled through a dataset.
 Abstract and compile information.
 Compute measures of cohesiveness, relevance, or diversity in data.
 Originate forthcoming prophecies on the basis of earlier reported data.
 Test experimental forecasts.
 
Statistical Data Analysis Tools
 
Generally, under statistical data analysis, some form of statistical analysis
tools are practised that a layman can’t do without having statistical
knowledge. Various software programs are available to perform statistical
data analysis, these software include Statistical Analysis System(SAS),
Statistical Package for Social Science (SPSS), Stat soft and many more. 
 
 “Machine learning, in the simplest terms, is the analysis of statistics to help
computers make decisions based on repeatable characteristics found in the
data.”― Vardhan Kishore Agrawal

 
These tools allow extensive data-handling capabilities and several statistical
analysis methods that could examine a small chunk to very comprehensive
data statistics. Though computers serve as an important factor in statistical
data analysis that can assist in the summarization of data, statistical data
analysis concentrates on the interpretation of the result in order to drive
inferences and prophecies.

 
What are the Types of Statistical Data Analysis?
 
There are two important components of a statistical study, that are:
 
 Population - an assemblage of all elements of interest in a study, and
 Sample - a subset of the population.
 
And, there are two categories of widely used statistical methods under
statistical data analysis techniques;
 
1. Descriptive Statistics 
It is a form of data analysis that is basically used to describe, show or
summarize data from a sample in a meaningful way. For example, mean,
median, standard deviation and variance. In other words, descriptive
statistics attempts to illustrate the relationship between variables in a
sample or population and gives a summary in the form of mean, median
and mode.
 
2. Inferential Statistics 
This method is used for making conclusions from the data sample by
using the null and alternative hypotheses that are subjected to random
variation. Also, probability distribution, correlation testing and regression
analysis fall into this category. In simple words, inferential statistics
employs a random sample of data, taken from a population, to make and
explain inferences about the whole population. (Most related: What is p-
value in statistics?)
 
The table below shows the factual differences between descriptive statistics
and inferential statistics;   

S.N
Descriptive Statistics Inferential Statistics
o

Make inferences from the


Related with specifying the target sample and make them
1
population. generalize also according to
the population. 

Arrange, analyze and reflect the Correlate, test and anticipate


2
data in a meaningful mode. future outcomes.

Concluding outcomes are


Final outcomes are the
3 represented in the form of charts,
probability scores. 
tables and graphs.

Attempts in making
Explains the earlier acknowledged conclusions regarding the
4
data.  population which is beyond
the data available. 
Deployed tools-Measure of central
Deployed tools- Hypothesis
tendency (mean, median, mode),
5 testing, Analysis of variance,
Spread of data (Range, standard
etc.
deviation, etc.)

Difference between Descriptive Statistics and Inferential Statistics  

 
4 Basics Steps for Statistical Data Analysis
 
In order to analyze any problem with the use of statistical data analysis
comprises four basic steps;
 
1. Defining the problem
 
The precise and actuarial definition of the problem is imperative for achieving
accurate data concerning it. It becomes extremely difficult to collect data
without knowing the exact definition/address of the problem.
 
2. Accumulating the data
 
After addressing the specific problem, designing multiple ways in order to
accumulate data is an important task under statistical data analysis. Data can
be collected from the actual sources or can be obtained by observation and
experimental research studies, conducted to get new data. 
 
 In an experimental study, the important variable is identified according
to the defined problem, then one or more elements in the study are
controlled for getting data regarding how these elements affect other
variables. 
 In an observational study, no trial is executed for controlling or
impacting the important variable. For example, a conducted surrey is the
examples or a common type of observational study. 
 
3. Analyzing the data
 
Under statistical data analysis, the analyzing methods are divided into two
categories;
 
 Exploratory methods, this method is deployed for determining what the
data is revealing by using simple arithmetic and easy-drawing
graphs/description in order to summarize data.  
 Confirmatory methods, this method adopts concept and ideas from
probability theory for trying to answer particular problems. 
 
Probability is extremely imperative in decision-making as it gives a procedure
for estimating, representing, and explaining the possibilities associated with
forthcoming events. 
 
4. Reporting the outcomes
 
By inferences, an estimate or test that claims to be the characteristics of a
population can be derived from a sample, these results could be reported in
the form of a table, a graph or a set of percentages. Since only a small portion
of data has been investigated, therefore the reported result can depict some
uncertainties by implementing probability statements and intervals of values. 
 
With the help of statistical data analysis, experts could forecast and anticipate
future aspects from data. By understanding the information available and
utilizing it effectively may lead to adequate decision-making. (Source)

 
Conclusion
 
The statistical data analysis furnishes sense to the meaningless numbers and
thereby giving life to lifeless data. Therefore, it is imperative for a researcher
to have adequate knowledge about statistics and statistical methods to
perform any research study. This will assist in conducting an appropriate and
well-designed study preeminently to accurate and reliable results. Also, results
and inferences are explicit only and only if proper statistical tests are
practised. 
 
“Regression analysis is the hydrogen bomb of the statistics
arsenal.”― Charles Wheelan

 
While concluding the blog, we can say that statistical data analysis is nothing
but the compilation and interpretation of data in order to reveal hidden
patterns and trends. It can be adopted in dealing with situations like
accumulating research analyses, statistical modelling or sketching surveys
and studies

You might also like