Data Analysis & Exploratory Data Analysis (EDA)
Data Analysis & Exploratory Data Analysis (EDA)
Share on
How is wealth distributed in the United States? Which drugs work to cure cancer? Which stocks should I invest in?
All of these questions can be answered with data analysis.
Back to Top
Techniques.
The type of data analysis you use depends on what kind of study you’re doing. For example, you would use a
different technique for data gathered from interviews than you would for an analysis of stock market trends. Some
techniques you might use are:
General linear model: Useful for assessing how several variables affect continuous variables.
Example: ANOVA tests.
Generalized linear model: Used for discrete variables. Example: Linear Regression (What is Linear
Regression?).
Structural equation modelling: Used for abstract variables like “Soap preference,” “Intelligence,” or
“Future goals.” SEM helps you to figure out if you have a valid model for your data.
Item response theory: A way to analyze results from tests, exams, and questionnaires.
It’s vital you use the right technique; Using the wrong one can lead to faulty claims about your data. There are
dozens of examples of faulty claims about data on the internet. Perhaps two of the most famous are the Cold
Fusion debacle and the now infamous data on women’s poor prospects of getting married over age 30.
Back to Top
A high-leverage outlier. The point has moved the graph more because it is outside the range.
Back to Top
Variation
If life were simple, we could make a chart or a graph for every situation. But in real life, things are never as simple
as they appear. Take a two-pound bag of sugar. Does it really weight two pounds? Measure a hundred bags of sugar
and you’ll likely find a hundred different weights, from 5.0 pounds to 5.1 pounds and everything in between. That’s
what we call variance, and variance is one of the reasons we have to use probability distributions to evaluate data.
Back to Top
1. Look at your data and think about what it is you want to know. Do you want to prove that the
Earth is round? Or do you want to prove that the Earth has a circumference? Framing this question is
what we call stating the hypothesis.
2. Estimate a Central Tendency for your Data. Examples of measures of central tendency are
the mean and median. Which one you use will depend on your hypothesis in Step 1. For example, if
you wanted to prove the Earth was round, you might choose to look at the average volume, or the
average circumference.
3. Consider the exceptions to the central tendency. If you’ve measured the average, look at the figures
that are not average. If you’ve measured a median, look at the figures that don’t meet that expectation.
Exceptions can help you spot problems with your conclusion. A simple example: your child’s average
score in school is 70. Not bad, right? But if you look at the exceptions, you might find they are getting
100 in three classes (great!) and 40 in three other classes (uh oh). In this case, the average is
completely misleading.
Back to Top
Loves (2).
Loved (3).
Loving (6).
Love’s (12).
Lover (4).
Lover’s (3).
Lovest (2).
Now imagine if you were analyzing a text on the results from blood analysis to see if a particular cancer drug
worked or not. Perhaps you were looking for a specific chemical to see if it showed up more frequently than another.
Typing in just part of the chemical name could lead you to a (possibly harmful) conclusion.
Back to Top
In the information era, data is no protracted scarce, on the other hand, it is
irresistible. From delving into the overpowering quantity of data to precisely
interpret its complexity in order to provide insights for intense progress to
organizations and businesses, all sorts of data and information is exploited at
their entirety and this is where statistical data analysis has a significant part.
“Statistics is the specific branch of science from where the professionalists
bring distinct conclusion/interference under the same data”
Moving discussion a step further, we shall discuss the comprehensive notion
concerning statistical data analysis and its types. Further, four basic steps
required for completion of statistical data analysis will be explained.
What is Statistical Data Analysis?
Being a branch of science, Statistics incorporates data acquisition, data
interpretation, and data validation, and statistical data analysis is the
approach of conducting various statistical operations, i.e. thorough
quantitative research that attempts to quantify data and employs some sorts
of statistical analysis. Here, quantitative data typically includes descriptive
data like survey data and observational data.
In the context of business applications, it is a very crucial technique for
business intelligence organizations that need to operate with large data
volumes. The basic goal of statistical data analysis is to identify trends, for
example, in the retailing business, this method can be approached to uncover
patterns in unstructured and semi-structured consumer data that can be used
for making more powerful decisions for enhancing customer experience and
progressing sales.
Apart from that, statistical data analysis has various applications in the field
of statistical analysis of market research, business intelligence(BI), data
analytics in big data, machine learning and deep learning, and financial and
economical analysis. (Recommend blog: Top Business Intelligence Tools and
Techniques in 2020)
In addition to that, the significance of data under statistical data analysis,
1. Data comprises variables which are univariate or multivariate, and
extremely relying on the number of variables, the experts execute several
statistical techniques. If the data has a singular variable then univariate
statistical data analysis can be conducted including t-test for
significance, z test, f test, ANOVA one way, etc. And if the data has many
variables then different multivariate techniques can be performed such as
statistical data analysis, or discriminant statistical data analysis, etc.
(Related blog: An Introduction to Probability Distribution)
2. Data is of two types, continuous data and discrete data. The continuous
data cannot be counted and changes over time, e.g the intensity of light,
the temperature of a room, etc. The discrete data can be counted and has
a certain number of values, e.g. the number of bulbs, the number of
people in a group, etc.
3. Under statistical data analysis, the continuous data is distributed under
continuous distribution function, also known as the probability density
function. And the discrete data is distributed under a discrete distribution
function, also termed as the probability mass function.
4. Data can either be quantitative or qualitative. Qualitative data are labels
or names that are implemented to find a characteristic of each element,
whereas quantitative data are always in the form of numbers that intimate
either how much or how many. (More to read: Steps for qualitative data
analysis)
5. Under statistical data analysis, cross-sectional and time-series data are
important. For a definition, cross-sectional data are the data accumulated
at the same time or relatively the same point in time, whereas, time-
series data are the data gathered across certain time periods.
Statistical data analysis can be adopted in;
Existing essential findings/conclusions unveiled through a dataset.
Abstract and compile information.
Compute measures of cohesiveness, relevance, or diversity in data.
Originate forthcoming prophecies on the basis of earlier reported data.
Test experimental forecasts.
Statistical Data Analysis Tools
Generally, under statistical data analysis, some form of statistical analysis
tools are practised that a layman can’t do without having statistical
knowledge. Various software programs are available to perform statistical
data analysis, these software include Statistical Analysis System(SAS),
Statistical Package for Social Science (SPSS), Stat soft and many more.
“Machine learning, in the simplest terms, is the analysis of statistics to help
computers make decisions based on repeatable characteristics found in the
data.”― Vardhan Kishore Agrawal
These tools allow extensive data-handling capabilities and several statistical
analysis methods that could examine a small chunk to very comprehensive
data statistics. Though computers serve as an important factor in statistical
data analysis that can assist in the summarization of data, statistical data
analysis concentrates on the interpretation of the result in order to drive
inferences and prophecies.
What are the Types of Statistical Data Analysis?
There are two important components of a statistical study, that are:
Population - an assemblage of all elements of interest in a study, and
Sample - a subset of the population.
And, there are two categories of widely used statistical methods under
statistical data analysis techniques;
1. Descriptive Statistics
It is a form of data analysis that is basically used to describe, show or
summarize data from a sample in a meaningful way. For example, mean,
median, standard deviation and variance. In other words, descriptive
statistics attempts to illustrate the relationship between variables in a
sample or population and gives a summary in the form of mean, median
and mode.
2. Inferential Statistics
This method is used for making conclusions from the data sample by
using the null and alternative hypotheses that are subjected to random
variation. Also, probability distribution, correlation testing and regression
analysis fall into this category. In simple words, inferential statistics
employs a random sample of data, taken from a population, to make and
explain inferences about the whole population. (Most related: What is p-
value in statistics?)
The table below shows the factual differences between descriptive statistics
and inferential statistics;
S.N
Descriptive Statistics Inferential Statistics
o
Attempts in making
Explains the earlier acknowledged conclusions regarding the
4
data. population which is beyond
the data available.
Deployed tools-Measure of central
Deployed tools- Hypothesis
tendency (mean, median, mode),
5 testing, Analysis of variance,
Spread of data (Range, standard
etc.
deviation, etc.)
4 Basics Steps for Statistical Data Analysis
In order to analyze any problem with the use of statistical data analysis
comprises four basic steps;
1. Defining the problem
The precise and actuarial definition of the problem is imperative for achieving
accurate data concerning it. It becomes extremely difficult to collect data
without knowing the exact definition/address of the problem.
2. Accumulating the data
After addressing the specific problem, designing multiple ways in order to
accumulate data is an important task under statistical data analysis. Data can
be collected from the actual sources or can be obtained by observation and
experimental research studies, conducted to get new data.
In an experimental study, the important variable is identified according
to the defined problem, then one or more elements in the study are
controlled for getting data regarding how these elements affect other
variables.
In an observational study, no trial is executed for controlling or
impacting the important variable. For example, a conducted surrey is the
examples or a common type of observational study.
3. Analyzing the data
Under statistical data analysis, the analyzing methods are divided into two
categories;
Exploratory methods, this method is deployed for determining what the
data is revealing by using simple arithmetic and easy-drawing
graphs/description in order to summarize data.
Confirmatory methods, this method adopts concept and ideas from
probability theory for trying to answer particular problems.
Probability is extremely imperative in decision-making as it gives a procedure
for estimating, representing, and explaining the possibilities associated with
forthcoming events.
4. Reporting the outcomes
By inferences, an estimate or test that claims to be the characteristics of a
population can be derived from a sample, these results could be reported in
the form of a table, a graph or a set of percentages. Since only a small portion
of data has been investigated, therefore the reported result can depict some
uncertainties by implementing probability statements and intervals of values.
With the help of statistical data analysis, experts could forecast and anticipate
future aspects from data. By understanding the information available and
utilizing it effectively may lead to adequate decision-making. (Source)
Conclusion
The statistical data analysis furnishes sense to the meaningless numbers and
thereby giving life to lifeless data. Therefore, it is imperative for a researcher
to have adequate knowledge about statistics and statistical methods to
perform any research study. This will assist in conducting an appropriate and
well-designed study preeminently to accurate and reliable results. Also, results
and inferences are explicit only and only if proper statistical tests are
practised.
“Regression analysis is the hydrogen bomb of the statistics
arsenal.”― Charles Wheelan
While concluding the blog, we can say that statistical data analysis is nothing
but the compilation and interpretation of data in order to reveal hidden
patterns and trends. It can be adopted in dealing with situations like
accumulating research analyses, statistical modelling or sketching surveys
and studies