0% found this document useful (0 votes)

166 views79 pages

Data Analysis - Wikipedia

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information and support decision-making. It involves multiple techniques from descriptive statistics to predictive modeling. The key phases of data analysis are data collection, data processing/organization, data cleaning, exploratory analysis to understand patterns in the data, modeling to understand relationships, producing results in the form of data products or communications, and iterative feedback. The goal is to obtain raw data and convert it into useful information for making informed decisions.

Uploaded by

Gbadeyanka O Wuraola

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views79 pages

Data Analysis - Wikipedia

Uploaded by

Gbadeyanka O Wuraola

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Data analysis

Data analysis is a process of inspecting,

cleansing, transforming and modeling data
with the goal of discovering useful
information, informing conclusion and
supporting decision-making. Data analysis
has multiple facets and approaches,
encompassing diverse techniques under a
variety of names, and is used in different
business, science, and social science
domains. In today's business world, data
analysis plays a role in making decisions
more scientiﬁc and helping businesses
operate more effectively.[1]

Data mining is a particular data analysis

technique that focuses on statistical
modeling and knowledge discovery for
predictive rather than purely descriptive
purposes, while business intelligence
covers data analysis that relies heavily on
aggregation, focusing mainly on business
information.[2] In statistical applications,
data analysis can be divided into
descriptive statistics, exploratory data
analysis (EDA), and confirmatory data
analysis (CDA). EDA focuses on
discovering new features in the data while
CDA focuses on confirming or falsifying
existing hypotheses. Predictive analytics
focuses on application of statistical
models for predictive forecasting or
classification, while text analytics applies
statistical, linguistic, and structural
techniques to extract and classify
information from textual sources, a
species of unstructured data. All of the
above are varieties of data analysis.

Data integration is a precursor to data

analysis, and data analysis is closely
linked to data visualization and data
dissemination.

The process of data analysis

Data science process ﬂowchart from Doing Data

Science, by Schutt & O'Neil (2013)

Analysis refers to breaking a whole into its

separate components for individual
examination. Data analysis is a process
for obtaining raw data and converting it
into information useful for decision-
making by users. Data are collected and
analyzed to answer questions, test
hypotheses or disprove theories.[3]

Statistician John Tukey deﬁned data

analysis in 1961 as: "Procedures for
analyzing data, techniques for interpreting
the results of such procedures, ways of
planning the gathering of data to make its
analysis easier, more precise or more
accurate, and all the machinery and results
of (mathematical) statistics which apply to
analyzing data."[4]
There are several phases that can be
distinguished, described below. The
phases are iterative, in that feedback from
later phases may result in additional work
in earlier phases.[5] The CRISP framework
used in data mining has similar steps.

Data requirements

The data are necessary as inputs to the

analysis, which is specified based upon
the requirements of those directing the
analysis or customers (who will use the
finished product of the analysis). The
general type of entity upon which the data
will be collected is referred to as an
experimental unit (e.g., a person or
population of people). Specific variables
regarding a population (e.g., age and
income) may be specified and obtained.
Data may be numerical or categorical (i.e.,
a text label for numbers).[5]

Data collection

Data are collected from a variety of

sources. The requirements may be
communicated by analysts to custodians
of the data, such as information
technology personnel within an
organization. The data may also be
collected from sensors in the environment,
such as traﬃc cameras, satellites,
recording devices, etc. It may also be
obtained through interviews, downloads
from online sources, or reading
documentation.[5]

Data processing

The phases of the intelligence cycle used to convert

raw information into actionable intelligence or
knowledge are conceptually similar to the phases in
data analysis
data analysis.

Data initially obtained must be processed

or organised for analysis. For instance,
these may involve placing data into rows
and columns in a table format (i.e.,
structured data) for further analysis, such
as within a spreadsheet or statistical
software.[5]

Data cleaning

Once processed and organised, the data

may be incomplete, contain duplicates, or
contain errors. The need for data cleaning
will arise from problems in the way that
data are entered and stored. Data cleaning
is the process of preventing and correcting
these errors. Common tasks include
record matching, identifying inaccuracy of
data, overall quality of existing data,[6]
deduplication, and column
segmentation.[7] Such data problems can
also be identiﬁed through a variety of
analytical techniques. For example, with
ﬁnancial information, the totals for
particular variables may be compared
against separately published numbers
believed to be reliable.[8] Unusual amounts
above or below pre-determined thresholds
may also be reviewed. There are several
types of data cleaning that depend on the
type of data such as phone numbers,
email addresses, employers etc.
Quantitative data methods for outlier
detection can be used to get rid of likely
incorrectly entered data. Textual data spell
checkers can be used to lessen the
amount of mistyped words, but it is harder
to tell if the words themselves are
correct.[9]

Exploratory data analysis

Once the data are cleaned, it can be

analyzed. Analysts may apply a variety of
techniques referred to as exploratory data
analysis to begin understanding the
messages contained in the data.[10][11] The
process of exploration may result in
additional data cleaning or additional
requests for data, so these activities may
be iterative in nature. Descriptive statistics,
such as the average or median, may be
generated to help understand the data.
Data visualization may also be used to
examine the data in graphical format, to
obtain additional insight regarding the
messages within the data.[5]

Modeling and algorithms

Mathematical formulas or models called

algorithms may be applied to the data to
identify relationships among the variables,
such as correlation or causation. In
general terms, models may be developed
to evaluate a particular variable in the data
based on other variable(s) in the data, with
some residual error depending on model
accuracy (i.e., Data = Model + Error).[3]

Inferential statistics includes techniques

to measure relationships between
particular variables. For example,
regression analysis may be used to model
whether a change in advertising
(independent variable X) explains the
variation in sales (dependent variable Y).
In mathematical terms, Y (sales) is a
function of X (advertising). It may be
described as Y = aX + b + error, where the
model is designed such that a and b
minimize the error when the model
predicts Y for a given range of values of X.
Analysts may attempt to build models that
are descriptive of the data to simplify
analysis and communicate results.[3]

Data product

A data product is a computer application

that takes data inputs and generates
outputs, feeding them back into the
environment. It may be based on a model
or algorithm. An example is an application
that analyzes data about customer
purchasing history and recommends other
purchases the customer might enjoy.[5]

Communication

Data visualization to understand the results of a data

analysis.[12]

Once the data are analyzed, it may be

reported in many formats to the users of
the analysis to support their requirements.
The users may have feedback, which
results in additional analysis. As such,
much of the analytical cycle is iterative.[5]

When determining how to communicate

the results, the analyst may consider data
visualization techniques to help clearly
and eﬃciently communicate the message
to the audience. Data visualization uses
information displays (such as tables and
charts) to help communicate key
messages contained in the data. Tables
are helpful to a user who might look up
speciﬁc numbers, while charts (e.g., bar
charts or line charts) may help explain the
quantitative messages contained in the
data.

Quantitative messages

A time series illustrated with a line chart

demonstrating trends in U.S. federal spending and
revenue over time.

A scatterplot illustrating correlation between two

variables (inﬂation and unemployment) measured at
points in time.

Stephen Few described eight types of

quantitative messages that users may
attempt to understand or communicate
from a set of data and the associated
graphs used to help communicate the
message. Customers specifying
requirements and analysts performing the
data analysis may consider these
messages during the course of the
process.

1. Time-series: A single variable is

captured over a period of time, such
as the unemployment rate over a 10-
year period. A line chart may be used
to demonstrate the trend.
2. Ranking: Categorical subdivisions are
ranked in ascending or descending
order, such as a ranking of sales
performance (the measure) by sales
persons (the category, with each
sales person a categorical
subdivision) during a single period. A
bar chart may be used to show the
comparison across the sales
persons.
3. Part-to-whole: Categorical
subdivisions are measured as a ratio
to the whole (i.e., a percentage out of
100%). A pie chart or bar chart can
show the comparison of ratios, such
as the market share represented by
competitors in a market.
4. Deviation: Categorical subdivisions
are compared against a reference,
such as a comparison of actual vs.
budget expenses for several
departments of a business for a
given time period. A bar chart can
show comparison of the actual
versus the reference amount.
5. Frequency distribution: Shows the
number of observations of a
particular variable for given interval,
such as the number of years in which
the stock market return is between
intervals such as 0–10%, 11–20%,
etc. A histogram, a type of bar chart,
may be used for this analysis.
6. Correlation: Comparison between
observations represented by two
variables (X,Y) to determine if they
tend to move in the same or opposite
directions. For example, plotting
unemployment (X) and inﬂation (Y)
for a sample of months. A scatter
plot is typically used for this
message.
7. Nominal comparison: Comparing
categorical subdivisions in no
particular order, such as the sales
volume by product code. A bar chart
may be used for this comparison.
8. Geographic or geospatial:
Comparison of a variable across a
map or layout, such as the
unemployment rate by state or the
number of persons on the various
ﬂoors of a building. A cartogram is a
typical graphic used.[13][14]

Techniques for analyzing

quantitative data
Author Jonathan Koomey has
recommended a series of best practices
for understanding quantitative data. These
include:

Check raw data for anomalies prior to

performing your analysis;
Re-perform important calculations, such
as verifying columns of data that are
formula driven;
Conﬁrm main totals are the sum of
subtotals;
Check relationships between numbers
that should be related in a predictable
way, such as ratios over time;
Normalize numbers to make
comparisons easier, such as analyzing
amounts per person or relative to GDP
or as an index value relative to a base
year;
Break problems into component parts
by analyzing factors that led to the
results, such as DuPont analysis of
return on equity.[8]

For the variables under examination,

analysts typically obtain descriptive
statistics for them, such as the mean
(average), median, and standard deviation.
They may also analyze the distribution of
the key variables to see how the individual
values cluster around the mean.

An illustration of the MECE principle used for data

analysis.

The consultants at McKinsey and

Company named a technique for breaking
a quantitative problem down into its
component parts called the MECE
principle. Each layer can be broken down
into its components; each of the sub-
components must be mutually exclusive
of each other and collectively add up to
the layer above them. The relationship is
referred to as "Mutually Exclusive and
Collectively Exhaustive" or MECE. For
example, proﬁt by deﬁnition can be broken
down into total revenue and total cost. In
turn, total revenue can be analyzed by its
components, such as revenue of divisions
A, B, and C (which are mutually exclusive
of each other) and should add to the total
revenue (collectively exhaustive).

Analysts may use robust statistical

measurements to solve certain analytical
problems. Hypothesis testing is used
when a particular hypothesis about the
true state of affairs is made by the analyst
and data is gathered to determine whether
that state of affairs is true or false. For
example, the hypothesis might be that
"Unemployment has no effect on inﬂation",
which relates to an economics concept
called the Phillips Curve. Hypothesis
testing involves considering the likelihood
of Type I and type II errors, which relate to
whether the data supports accepting or
rejecting the hypothesis.

Regression analysis may be used when

the analyst is trying to determine the
extent to which independent variable X
affects dependent variable Y (e.g., "To
what extent do changes in the
unemployment rate (X) affect the inﬂation
rate (Y)?"). This is an attempt to model or
ﬁt an equation line or curve to the data,
such that Y is a function of X.

Necessary condition analysis (NCA) may

be used when the analyst is trying to
determine the extent to which independent
variable X allows variable Y (e.g., "To what
extent is a certain unemployment rate (X)
necessary for a certain inflation rate (Y)?").
Whereas (multiple) regression analysis
uses additive logic where each X-variable
can produce the outcome and the X's can
compensate for each other (they are
sufficient but not necessary), necessary
condition analysis (NCA) uses necessity
logic, where one or more X-variables allow
the outcome to exist, but may not produce
it (they are necessary but not sufficient).
Each single necessary condition must be
present and compensation is not possible.

Analytical activities of data

users
Users may have particular data points of
interest within a data set, as opposed to
general messaging outlined above. Such
low-level user analytic activities are
presented in the following table. The
taxonomy can also be organized by three
poles of activities: retrieving values,
ﬁnding data points, and arranging data
points.[15][16][17][18]
General Pro Forma
# Task Examples
Description Abstract

Given a set of - What is the mileage per

What are the values of
speciﬁc gallon of the Ford Mondeo?
attributes {X, Y, Z, ...} in
1 Retrieve Value cases, ﬁnd
the data cases {A, B, C, - How long is the movie Gone
attributes of
...}? with the Wind?
those cases.

Given some
concrete - What Kellogg's cereals have

conditions on high ﬁber?

attribute Which data cases - What comedies have won

2 Filter values, ﬁnd satisfy conditions {A, B, awards?
data cases C...}?
- Which funds
satisfying
underperformed the SP-500?
those
conditions.

Given a set of - What is the average calorie

data cases, content of Post cereals?
compute an What is the value of
- What is the gross income of
Compute Derived aggregate aggregation function F
3 all stores combined?
Value numeric over a given set S of
representation data cases? - How many manufacturers of
of those data cars are there?
cases.

Find data - What is the car with the

cases highest MPG?

possessing an What are the - What director/ﬁlm has won

extreme value top/bottom N data the most awards?
4 Find Extremum
of an attribute cases with respect to
- What Marvel Studios ﬁlm
over its range attribute A?
has the most recent release
within the
date?
data set.

5 Sort Given a set of What is the sorted order - Order the cars by weight.
data cases, of a set S of data cases
- Rank the cereals by calories.
rank them according to their value
according to of attribute A?
some ordinal
metric.

Given a set of - What is the range of ﬁlm

data cases lengths?
and an
What is the range of - What is the range of car
attribute of
6 Determine Range values of attribute A in horsepowers?
interest, ﬁnd
a set S of data cases?
the span of - What actresses are in the
values within data set?
the set.

Given a set of
data cases
and a
quantitative
- What is the distribution of
attribute of What is the distribution
carbohydrates in cereals?
Characterize interest, of values of attribute A
7
Distribution characterize in a set S of data - What is the age distribution
the cases? of shoppers?
distribution of
that attribute's
values over
the set.

Identify any
anomalies
within a given - Are there exceptions to the
set of data Which data cases in a relationship between
cases with set S of data cases horsepower and
8 Find Anomalies respect to a have acceleration?
given unexpected/exceptional
- Are there any outliers in
relationship or values?
protein?
expectation,
e.g. statistical
outliers.
9 Cluster Given a set of Which data cases in a - Are there groups of cereals
data cases, set S of data cases are w/ similar fat/calories/sugar?
ﬁnd clusters similar in value for
- Is there a cluster of typical
of similar attributes {X, Y, Z, ...}? ﬁlm lengths?
attribute
values.

- Is there a correlation
Given a set of
between carbohydrates and
data cases
fat?
and two
attributes, - Is there a correlation
What is the correlation between country of origin and
determine
between attributes X MPG?
10 Correlate useful
and Y over a given set S
relationships
of data cases? - Do different genders have a
between the
preferred payment method?
values of
those - Is there a trend of increasing
attributes. ﬁlm length over the years?

Given a set of
data cases,
Which data cases in a - Are there groups of
ﬁnd
set S of data cases are restaurants that have foods
11 Contextualization[18] contextual
relevant to the current based on my current caloric
relevancy of
users' context? intake?
the data to the
users.

Barriers to effective analysis

Barriers to effective analysis may exist
among the analysts performing the data
analysis or among the audience.
Distinguishing fact from opinion, cognitive
biases, and innumeracy are all challenges
to sound data analysis.

Confusing fact and opinion

Effective analysis You are entitled

requires obtaining to your own
relevant facts to answer opinion, but you
questions, support a are not entitled
conclusion or formal to your own
facts.
opinion, or test
hypotheses. Facts by Daniel Patrick
Moynihan
deﬁnition are irrefutable,
meaning that any
person involved in the analysis should be
able to agree upon them. For example, in
August 2010, the Congressional Budget
Oﬃce (CBO) estimated that extending the
Bush tax cuts of 2001 and 2003 for the
2011–2020 time period would add
approximately $3.3 trillion to the national
debt.[19] Everyone should be able to agree
that indeed this is what CBO reported; they
can all examine the report. This makes it a
fact. Whether persons agree or disagree
with the CBO is their own opinion.

As another example, the auditor of a public

company must arrive at a formal opinion
on whether ﬁnancial statements of
publicly traded corporations are "fairly
stated, in all material respects." This
requires extensive analysis of factual data
and evidence to support their opinion.
When making the leap from facts to
opinions, there is always the possibility
that the opinion is erroneous.

Cognitive biases

There are a variety of cognitive biases that

can adversely affect analysis. For
example, conﬁrmation bias is the tendency
to search for or interpret information in a
way that conﬁrms one's preconceptions. In
addition, individuals may discredit
information that does not support their
views.

Analysts may be trained speciﬁcally to be

aware of these biases and how to
overcome them. In his book Psychology of
Intelligence Analysis, retired CIA analyst
Richards Heuer wrote that analysts should
clearly delineate their assumptions and
chains of inference and specify the degree
and source of the uncertainty involved in
the conclusions. He emphasized
procedures to help surface and debate
alternative points of view.[20]

Innumeracy
Effective analysts are generally adept with
a variety of numerical techniques.
However, audiences may not have such
literacy with numbers or numeracy; they
are said to be innumerate. Persons
communicating the data may also be
attempting to mislead or misinform,
deliberately using bad numerical
techniques.[21]

For example, whether a number is rising or

falling may not be the key factor. More
important may be the number relative to
another number, such as the size of
government revenue or spending relative
to the size of the economy (GDP) or the
amount of cost relative to revenue in
corporate ﬁnancial statements. This
numerical technique is referred to as
normalization[8] or common-sizing. There
are many such techniques employed by
analysts, whether adjusting for inﬂation
(i.e., comparing real vs. nominal data) or
considering population increases,
demographics, etc. Analysts apply a
variety of techniques to address the
various quantitative messages described
in the section above.

Analysts may also analyze data under

different assumptions or scenarios. For
example, when analysts perform financial
statement analysis, they will often recast
the financial statements under different
assumptions to help arrive at an estimate
of future cash flow, which they then
discount to present value based on some
interest rate, to determine the valuation of
the company or its stock. Similarly, the
CBO analyzes the effects of various policy
options on the government's revenue,
outlays and deficits, creating alternative
future scenarios for key measures.

Other topics
Smart buildings
A data analytics approach can be used in
order to predict energy consumption in
buildings.[22] The different steps of the
data analysis process are carried out in
order to realise smart buildings, where the
building management and control
operations including heating, ventilation,
air conditioning, lighting and security are
realised automatically by miming the
needs of the building users and optimising
resources like energy and time.

Analytics and business intelligence

Analytics is the "extensive use of data,

statistical and quantitative analysis,
explanatory and predictive models, and
fact-based management to drive decisions
and actions." It is a subset of business
intelligence, which is a set of technologies
and processes that use data to understand
and analyze business performance.[23]

Education

Analytic activities of data visualization users

In education, most educators have access
to a data system for the purpose of
analyzing student data.[24] These data
systems present data to educators in an
over-the-counter data format (embedding
labels, supplemental documentation, and a
help system and making key
package/display and content decisions) to
improve the accuracy of educators’ data
analyses.[25]

Practitioner notes
This section contains rather technical
explanations that may assist practitioners
but are beyond the typical scope of a
Wikipedia article.

Initial data analysis

The most important distinction between

the initial data analysis phase and the
main analysis phase, is that during initial
data analysis one refrains from any
analysis that is aimed at answering the
original research question. The initial data
analysis phase is guided by the following
four questions:[26]

Quality of data
The quality of the data should be checked
as early as possible. Data quality can be
assessed in several ways, using different
types of analysis: frequency counts,
descriptive statistics (mean, standard
deviation, median), normality (skewness,
kurtosis, frequency histograms, n:
variables are compared with coding
schemes of variables external to the data
set, and possibly corrected if coding
schemes are not comparable.

Test for common-method variance.

The choice of analyses to assess the data

quality during the initial data analysis
phase depends on the analyses that will
be conducted in the main analysis
phase.[27]

Quality of measurements

The quality of the measurement

instruments should only be checked
during the initial data analysis phase when
this is not the focus or research question
of the study. One should check whether
structure of measurement instruments
corresponds to structure reported in the
literature.

There are two ways to assess

measurement: [NOTE: only one way seems
to be listed]
Analysis of homogeneity (internal
consistency), which gives an indication
of the reliability of a measurement
instrument. During this analysis, one
inspects the variances of the items and
the scales, the Cronbach's α of the
scales, and the change in the Cronbach's
alpha when an item would be deleted
from a scale[28]
Initial transformations

After assessing the quality of the data and

of the measurements, one might decide to
impute missing data, or to perform initial
transformations of one or more variables,
although this can also be done during the
main analysis phase.[29]
Possible transformations of variables
are:[30]

Square root transformation (if the

distribution differs moderately from
normal)
Log-transformation (if the distribution
differs substantially from normal)
Inverse transformation (if the
distribution differs severely from
normal)
Make categorical (ordinal /
dichotomous) (if the distribution differs
severely from normal, and no
transformations help)
Did the implementation of the study
fulﬁll the intentions of the research
design?

One should check the success of the

randomization procedure, for instance by
checking whether background and
substantive variables are equally
distributed within and across groups.
If the study did not need or use a
randomization procedure, one should
check the success of the non-random
sampling, for instance by checking
whether all subgroups of the population of
interest are represented in sample.
Other possible data distortions that should
be checked are:
dropout (this should be identiﬁed during
the initial data analysis phase)
Item nonresponse (whether this is
random or not should be assessed
during the initial data analysis phase)
Treatment quality (using manipulation
checks).[31]
Characteristics of data sample

In any report or article, the structure of the

sample must be accurately described. It is
especially important to exactly determine
the structure of the sample (and
speciﬁcally the size of the subgroups)
when subgroup analyses will be performed
during the main analysis phase.
The characteristics of the data sample can
be assessed by looking at:

Basic statistics of important variables

Scatter plots
Correlations and associations
Cross-tabulations[32]
Final stage of the initial data analysis

During the ﬁnal stage, the ﬁndings of the

initial data analysis are documented, and
necessary, preferable, and possible
corrective actions are taken.
Also, the original plan for the main data
analyses can and should be speciﬁed in
more detail or rewritten.
In order to do this, several decisions about
the main data analyses can and should be
made:

In the case of non-normals: should one

transform variables; make variables
categorical (ordinal/dichotomous);
adapt the analysis method?
In the case of missing data: should one
neglect or impute the missing data;
which imputation technique should be
used?
In the case of outliers: should one use
robust analysis techniques?
In case items do not ﬁt the scale: should
one adapt the measurement instrument
by omitting items, or rather ensure
comparability with other (uses of the)
measurement instrument(s)?
In the case of (too) small subgroups:
should one drop the hypothesis about
inter-group differences, or use small
sample techniques, like exact tests or
bootstrapping?
In case the randomization procedure
seems to be defective: can and should
one calculate propensity scores and
include them as covariates in the main
analyses?[33]

Analysis
Several analyses can be used during the
initial data analysis phase:[34]

Univariate statistics (single variable)

Bivariate associations (correlations)
Graphical techniques (scatter plots)

It is important to take the measurement

levels of the variables into account for the
analyses, as special statistical techniques
are available for each level:[35]

Nominal and ordinal variables

Frequency counts (numbers and
percentages)
Associations
circumambulations
(crosstabulations)
hierarchical loglinear analysis
(restricted to a maximum of 8
variables)
loglinear analysis (to identify
relevant/important variables
and possible confounders)
Exact tests or bootstrapping (in
case subgroups are small)
Computation of new variables
Continuous variables
Distribution
Statistics (M, SD, variance,
skewness, kurtosis)
Stem-and-leaf displays
Box plots
Nonlinear analysis

Nonlinear analysis is often necessary

when the data is recorded from a
nonlinear system. Nonlinear systems can
exhibit complex dynamic effects including
bifurcations, chaos, harmonics and
subharmonics that cannot be analyzed
using simple linear methods. Nonlinear
data analysis is closely related to
nonlinear system identiﬁcation.[36]

Main data analysis

In the main analysis phase analyses aimed
at answering the research question are
performed as well as any other relevant
analysis needed to write the ﬁrst draft of
the research report.[37]

Exploratory and conﬁrmatory

approaches

In the main analysis phase either an

exploratory or conﬁrmatory approach can
be adopted. Usually the approach is
decided before data is collected. In an
exploratory analysis no clear hypothesis is
stated before analysing the data, and the
data is searched for models that describe
the data well. In a conﬁrmatory analysis
clear hypotheses about the data are
tested.

Exploratory data analysis should be

interpreted carefully. When testing multiple
models at once there is a high chance on
finding at least one of them to be
significant, but this can be due to a type 1
error. It is important to always adjust the
significance level when testing multiple
models with, for example, a Bonferroni
correction. Also, one should not follow up
an exploratory analysis with a
confirmatory analysis in the same dataset.
An exploratory analysis is used to find
ideas for a theory, but not to test that
theory as well. When a model is found
exploratory in a dataset, then following up
that analysis with a confirmatory analysis
in the same dataset could simply mean
that the results of the confirmatory
analysis are due to the same type 1 error
that resulted in the exploratory model in
the first place. The confirmatory analysis
therefore will not be more informative than
the original exploratory analysis.[38]

Stability of results

It is important to obtain some indication

about how generalizable the results are.[39]
While this is often diﬃcult to check, one
can look at the stability of the results. Are
the results reliable and reproducible?
There are two main ways of doing that.

Cross-validation. By splitting the data

into multiple parts, we can check if an
analysis (like a ﬁtted model) based on
one part of the data generalizes to
another part of the data as well. Cross-
validation is generally inappropriate,
though, if there are correlations within
the data, e.g. with panel data. Hence
other methods of validation sometimes
need to be used. For more on this topic,
see statistical model validation.
Sensitivity analysis. A procedure to study
the behavior of a system or model when
global parameters are (systematically)
varied. One way to do that is via
bootstrapping.

Free software for data

analysis
Notable free software for data analysis
include:

DevInfo – a database system endorsed

by the United Nations Development
Group for monitoring and analyzing
human development.
ELKI – data mining framework in Java
with data mining oriented visualization
functions.
KNIME – the Konstanz Information
Miner, a user friendly and
comprehensive data analytics
framework.
Orange – A visual programming tool
featuring interactive data visualization
and methods for statistical data
analysis, data mining, and machine
learning.
Pandas – Python library for data
analysis
PAW – FORTRAN/C data analysis
framework developed at CERN
R – a programming language and
software environment for statistical
computing and graphics.
ROOT – C++ data analysis framework
developed at CERN
SciPy – Python library for data analysis

International data analysis

contests
Different companies or organizations hold
a data analysis contests to encourage
researchers utilize their data or to solve a
particular question using data analysis. A
few examples of well-known international
data analysis contests are as follows.

Kaggle competition held by Kaggle[40]

LTPP data analysis contest held by
FHWA and ASCE.[41][42]

See also
Actuarial science
Analytics
Big data
Business intelligence
Censoring (statistics)
Computational physics
Data acquisition
Data blending
Data governance
Data mining
Data Presentation Architecture
Data science
Digital signal processing
Dimension reduction
Early case assessment
Exploratory data analysis
Fourier analysis
Machine learning
Multilinear PCA
Multilinear subspace learning
Multiway data analysis
Nearest neighbor search
Nonlinear system identification
Predictive analytics
Principal component analysis
Qualitative research
Scientific computing
Structured data analysis (statistics)
System identification
Test method
Text analytics
Unstructured data
Wavelet

References
Citations

1. Xia, B. S., & Gong, P. (2015). Review of

business intelligence through data
analysis. Benchmarking, 21(2), 300-
311. doi:10.1108/BIJ-08-2012-0050
2. Exploring Data Analysis
3. Judd, Charles and, McCleland, Gary
(1989). Data Analysis. Harcourt Brace
Jovanovich. ISBN 0-15-516765-0.
4. John Tukey-The Future of Data
Analysis-July 1961
5. Schutt, Rachel; O'Neil, Cathy (2013).
Doing Data Science. O'Reilly Media.
ISBN 978-1-449-35865-5.
6. Clean Data in CRM: The Key to
Generate Sales-Ready Leads and
Boost Your Revenue Pool Retrieved
29th July, 2016
7. "Data Cleaning" . Microsoft Research.
Retrieved 26 October 2013.
8. Perceptual Edge-Jonathan Koomey-
Best practices for understanding
quantitative data-February 14, 2006
9. Hellerstein, Joseph (27 February
2008). "Quantitative Data Cleaning for
Large Databases" (PDF). EECS
Computer Science Division: 3.
Retrieved 26 October 2013.
10. Stephen Few-Perceptual Edge-
Selecting the Right Graph For Your
Message-September 2004
11. Behrens-Principles and Procedures of
Exploratory Data Analysis-American
Psychological Association-1997
12. Grandjean, Martin (2014). "La
connaissance est un réseau" (PDF).
Les Cahiers du Numérique. 10 (3): 37–
54. doi:10.3166/lcn.10.3.37-54 .
13. Stephen Few-Perceptual Edge-
Selecting the Right Graph for Your
Message-2004
14. Stephen Few-Perceptual Edge-Graph
Selection Matrix
15. Robert Amar, James Eagan, and John
Stasko (2005) "Low-Level Components
of Analytic Activity in Information
Visualization"
16. William Newman (1994) "A Preliminary
Analysis of the Products of HCI
Research, Using Pro Forma Abstracts"
17. Mary Shaw (2002) "What Makes Good
Research in Software Engineering?"
18. "ConTaaS: An Approach to Internet-
Scale Contextualisation for Developing
Efficient Internet of Things
Applications" . ScholarSpace.
HICSS50. Retrieved May 24, 2017.
19. "Congressional Budget Office-The
Budget and Economic Outlook-August
2010-Table 1.7 on Page 24" (PDF).
Retrieved 2011-03-31.
20. "Introduction" . cia.gov.
21. Bloomberg-Barry Ritholz-Bad Math
that Passes for Insight-October 28,
2014
22. González-Vidal, Aurora; Moreno-Cano,
Victoria (2016). "Towards energy
efficiency smart buildings models
based on intelligent data analytics".
Procedia Computer Science. 83
(Elsevier): 994–999.
doi:10.1016/j.procs.2016.04.213 .
23. Davenport, Thomas and, Harris,
Jeanne (2007). Competing on
Analytics. O'Reilly. ISBN 978-1-4221-
0332-6.
24. Aarons, D. (2009). Report finds states
on course to build pupil-data
systems. Education Week, 29(13), 6.
25. Rankin, J. (2013, March 28). How data
Systems & reports can either fight or
propagate the data analysis error
epidemic, and how educator leaders
can help. Presentation conducted
from Technology Information Center
for Administrative Leadership (TICAL)
School Leadership Summit.
26. Adèr 2008a, p. 337.
27. Adèr 2008a, pp. 338-341.
28. Adèr 2008a, pp. 341-342.
29. Adèr 2008a, p. 344.
30. Tabachnick & Fidell, 2007, p. 87-88.
31. Adèr 2008a, pp. 344-345.
32. Adèr 2008a, p. 345.
33. Adèr 2008a, pp. 345-346.
34. Adèr 2008a, pp. 346-347.
35. Adèr 2008a, pp. 349-353.
36. Billings S.A. "Nonlinear System
Identification: NARMAX Methods in
the Time, Frequency, and Spatio-
Temporal Domains". Wiley, 2013
37. Adèr 2008b, p. 363.
38. Adèr 2008b, pp. 361-362.
39. Adèr 2008b, pp. 361-371.
40. "The machine learning community
takes on the Higgs" . Symmetry
Magazine. July 15, 2014. Retrieved
14 January 2015.
41. Nehme, Jean (September 29, 2016).
"LTPP International Data Analysis
Contest" . Federal Highway
Administration. Retrieved October 22,
2017.
42. "Data.Gov:Long-Term Pavement
Performance (LTPP)" . May 26, 2016.
Retrieved November 10, 2017.
Bibliography

Adèr, Herman J. (2008a). "Chapter 14:

Phases and initial steps in data
analysis". In Adèr, Herman J.;
Mellenbergh, Gideon J.; Hand, David J
(eds.). Advising on research methods : a
consultant's companion . Huizen,
Netherlands: Johannes van Kessel Pub.
pp. 333–356. ISBN 9789079418015.
OCLC 905799857 .
Adèr, Herman J. (2008b). "Chapter 15:
The main analysis phase". In Adèr,
Herman J.; Mellenbergh, Gideon J.;
Hand, David J (eds.). Advising on
research methods : a consultant's
companion . Huizen, Netherlands:
Johannes van Kessel Pub. pp. 357–386.
ISBN 9789079418015.
OCLC 905799857 .
Tabachnick, B.G. & Fidell, L.S. (2007).
Chapter 4: Cleaning up your act.
Screening data prior to analysis. In B.G.
Tabachnick & L.S. Fidell (Eds.), Using
Multivariate Statistics, Fifth Edition
(pp. 60–116). Boston: Pearson
Education, Inc. / Allyn and Bacon.

Wikiversity has learning resources

about Data analysis
Adèr, H.J. & Mellenbergh, G.J. (with
contributions by D.J. Hand) (2008).
Advising on Research Methods: A
Consultant's Companion. Huizen, the
Netherlands: Johannes van Kessel
Publishing.
Chambers, John M.; Cleveland, William
S.; Kleiner, Beat; Tukey, Paul A. (1983).
Graphical Methods for Data Analysis,
Wadsworth/Duxbury Press. ISBN 0-534-
98052-X
Fandango, Armando (2008). Python Data
Analysis, 2nd Edition. Packt Publishers.
Juran, Joseph M.; Godfrey, A. Blanton
(1999). Juran's Quality Handbook, 5th
Edition. New York: McGraw Hill. ISBN 0-
07-034003-X
Lewis-Beck, Michael S. (1995). Data
Analysis: an Introduction, Sage
Publications Inc, ISBN 0-8039-5772-6
NIST/SEMATECH (2008) Handbook of
Statistical Methods ,
Pyzdek, T, (2003). Quality Engineering
Handbook, ISBN 0-8247-4614-7
Richard Veryard (1984). Pragmatic Data
Analysis. Oxford : Blackwell Scientiﬁc
Publications. ISBN 0-632-01311-7
Tabachnick, B.G.; Fidell, L.S. (2007).
Using Multivariate Statistics, 5th Edition.
Boston: Pearson Education, Inc. / Allyn
and Bacon, ISBN 978-0-205-45938-4

Retrieved from
"https://en.wikipedia.org/w/index.php?
title=Data_analysis&oldid=920438508"

Last edited 18 days ago by Qwfp

Content is available under CC BY-SA 3.0 unless

otherwise noted.

Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow 3rd Edition TEXTBOOK
0% (2)
Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow 3rd Edition TEXTBOOK
14 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Data Processing and Analysis
100% (3)
Data Processing and Analysis
38 pages
Learn Excel Data Analysis
100% (15)
Learn Excel Data Analysis
721 pages
Unit 05: Data Preparation & Analysis
100% (1)
Unit 05: Data Preparation & Analysis
26 pages
Op Il Innocent Passage
No ratings yet
Op Il Innocent Passage
14 pages
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
No ratings yet
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
12 pages
The Process of Data Analysis
No ratings yet
The Process of Data Analysis
9 pages
Data Analysis
No ratings yet
Data Analysis
22 pages
Upload 3
No ratings yet
Upload 3
22 pages
Data Analysis
No ratings yet
Data Analysis
28 pages
Data Analysis and Interpretation
100% (1)
Data Analysis and Interpretation
26 pages
Investment Analysis Documentation
No ratings yet
Investment Analysis Documentation
70 pages
Unit 1 Notes - Data Analysis Using R
No ratings yet
Unit 1 Notes - Data Analysis Using R
17 pages
DATA ANALYSIS Docx
No ratings yet
DATA ANALYSIS Docx
17 pages
Data Analysis
No ratings yet
Data Analysis
1 page
Unit I (Notes 2)
No ratings yet
Unit I (Notes 2)
16 pages
It Is The Process of Checking and Adjusting The Data For Omissions
No ratings yet
It Is The Process of Checking and Adjusting The Data For Omissions
5 pages
Week 1 Lecture
No ratings yet
Week 1 Lecture
26 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
9 pages
Data Analytics - TYBCS
No ratings yet
Data Analytics - TYBCS
6 pages
Unit 1
No ratings yet
Unit 1
57 pages
Top 65 SQL Data Analysis Q&A
No ratings yet
Top 65 SQL Data Analysis Q&A
53 pages
Rma Midterm Reviewer
No ratings yet
Rma Midterm Reviewer
11 pages
Lecture 6 23-24
No ratings yet
Lecture 6 23-24
20 pages
Analysis Terms
No ratings yet
Analysis Terms
1 page
What Is Data Visualization and Why Is It Important
No ratings yet
What Is Data Visualization and Why Is It Important
18 pages
Unit 2
No ratings yet
Unit 2
81 pages
Types of Data Analysis: Techniques and Methods
No ratings yet
Types of Data Analysis: Techniques and Methods
4 pages
Project Abdulrahman Saud Al-Subhi 8889
No ratings yet
Project Abdulrahman Saud Al-Subhi 8889
12 pages
DA Unit 2 Trio 1
No ratings yet
DA Unit 2 Trio 1
26 pages
Data Analyst
No ratings yet
Data Analyst
12 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
8 pages
Information: Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and Modeling
No ratings yet
Information: Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and Modeling
1 page
Chap 7 (Research)
No ratings yet
Chap 7 (Research)
24 pages
Data Analystic
No ratings yet
Data Analystic
35 pages
Lesson 1 Notes
No ratings yet
Lesson 1 Notes
14 pages
What Is Data Analysis
No ratings yet
What Is Data Analysis
6 pages
Curso Data Analis
No ratings yet
Curso Data Analis
7 pages
Data Analytics-Methods-Tools-And-Techniques
No ratings yet
Data Analytics-Methods-Tools-And-Techniques
19 pages
Data - Analytics - Interview - Q and A
No ratings yet
Data - Analytics - Interview - Q and A
64 pages
Data Science - III
No ratings yet
Data Science - III
94 pages
Data Analysis
No ratings yet
Data Analysis
6 pages
Data Analytics and Visualization Unit-I
No ratings yet
Data Analytics and Visualization Unit-I
25 pages
MBA 4th Sem MBAIIT1 - SAD - Unit-2 - Notes
No ratings yet
MBA 4th Sem MBAIIT1 - SAD - Unit-2 - Notes
20 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Module 1 - Introduction To Data Analytics
No ratings yet
Module 1 - Introduction To Data Analytics
21 pages
PTDLKT
No ratings yet
PTDLKT
11 pages
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
No ratings yet
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
9 pages
Data Analysis
No ratings yet
Data Analysis
87 pages
Unit 3 Nivelación de Inglés
No ratings yet
Unit 3 Nivelación de Inglés
34 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Unit I
No ratings yet
Unit I
31 pages
Data Analysis: Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and Modeling
No ratings yet
Data Analysis: Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and Modeling
6 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
208 RM Lab File1 PDF
No ratings yet
208 RM Lab File1 PDF
31 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
29 pages
Data Analytics Part 3
No ratings yet
Data Analytics Part 3
54 pages
Aa MDM MST
No ratings yet
Aa MDM MST
8 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
426153E5E4D1CCBF6E1F45DFFA0A20F4
No ratings yet
426153E5E4D1CCBF6E1F45DFFA0A20F4
12 pages
Population of Cities in Nigeria (2019)
No ratings yet
Population of Cities in Nigeria (2019)
24 pages
9178C359E31ECD993ABB1FE667FFB352
No ratings yet
9178C359E31ECD993ABB1FE667FFB352
11 pages
Documentation On Medicinal Plants Sold I PDF
No ratings yet
Documentation On Medicinal Plants Sold I PDF
9 pages
The Terms in This Table Apply To Many or All Taxa in A Particular Biological Family, Class, or Clade
No ratings yet
The Terms in This Table Apply To Many or All Taxa in A Particular Biological Family, Class, or Clade
2 pages
Guidelines For The Management of Pregnant Women and Nursing Mothers
No ratings yet
Guidelines For The Management of Pregnant Women and Nursing Mothers
8 pages
Guidelines For Pregnant Women and Nursing Mothers: Covid-19
No ratings yet
Guidelines For Pregnant Women and Nursing Mothers: Covid-19
4 pages
E-Service Complaint Form
No ratings yet
E-Service Complaint Form
1 page
Community Case Definitions: For Coronavirus Disease
No ratings yet
Community Case Definitions: For Coronavirus Disease
2 pages
Self-Isolation and Quarantine Guide: Covid-19
No ratings yet
Self-Isolation and Quarantine Guide: Covid-19
4 pages
N.B: This Supersedes The Earlier One Forwarded
No ratings yet
N.B: This Supersedes The Earlier One Forwarded
1 page
Subrecipient Risk Assessment Questionnaire
No ratings yet
Subrecipient Risk Assessment Questionnaire
6 pages
MGT of Dead Bodies
No ratings yet
MGT of Dead Bodies
8 pages
Yoruba Culture of Nigeria: Creating Space For An Endangered Specie
No ratings yet
Yoruba Culture of Nigeria: Creating Space For An Endangered Specie
7 pages
Population of Cities in Nigeria (2019)
No ratings yet
Population of Cities in Nigeria (2019)
25 pages
Determinant of Gross Margin in Vegetable Production - Case Study of Iwo Zone of Osun State
No ratings yet
Determinant of Gross Margin in Vegetable Production - Case Study of Iwo Zone of Osun State
4 pages
Well
No ratings yet
Well
92 pages
Search Products, Brands and Categories: Shop On The Jumia App Free On Play Store Open
No ratings yet
Search Products, Brands and Categories: Shop On The Jumia App Free On Play Store Open
9 pages
TOP 4 Wizards On America's Got Talent 2019 - Magicians Got Talent
No ratings yet
TOP 4 Wizards On America's Got Talent 2019 - Magicians Got Talent
3 pages
Ofo Efori
No ratings yet
Ofo Efori
1 page
Control Your App Permissions On Android 6.0 and Up - Google Play Help
100% (1)
Control Your App Permissions On Android 6.0 and Up - Google Play Help
2 pages
Interswitch - Electronic Payment and Digital Commerce Solutions.
No ratings yet
Interswitch - Electronic Payment and Digital Commerce Solutions.
4 pages
Effects of Visual Aids in Enhancing The Learning Performance of Students at Elementary Level in District Peshawar (Bushra Khan)
95% (44)
Effects of Visual Aids in Enhancing The Learning Performance of Students at Elementary Level in District Peshawar (Bushra Khan)
43 pages
Recommendation Systems
No ratings yet
Recommendation Systems
27 pages
Output
No ratings yet
Output
18 pages
A Comparison of Time Series and Causal Models
100% (7)
A Comparison of Time Series and Causal Models
1 page
Mis 1
No ratings yet
Mis 1
21 pages
The Effect of Customer Relationship Management On Students' Satisfaction: A Case of Selected Private Colleges in Hawassa City
No ratings yet
The Effect of Customer Relationship Management On Students' Satisfaction: A Case of Selected Private Colleges in Hawassa City
80 pages
A. Adam Sloope: Skills
No ratings yet
A. Adam Sloope: Skills
3 pages
Môn TH 2
No ratings yet
Môn TH 2
18 pages
Sigma Plot Statistics User Guide
No ratings yet
Sigma Plot Statistics User Guide
470 pages
OLA - Research Engineer AI IITR
No ratings yet
OLA - Research Engineer AI IITR
2 pages
Get Research Methods, Statistics, and Applications Second Edition - Ebook PDF Version Free All Chapters
No ratings yet
Get Research Methods, Statistics, and Applications Second Edition - Ebook PDF Version Free All Chapters
41 pages
Daba Research Proposal
100% (1)
Daba Research Proposal
31 pages
CONSTRUCTION OF-WPS Office
No ratings yet
CONSTRUCTION OF-WPS Office
27 pages
Pivot Tables Explained Unleash This Excel Superpower and - Anne Walsh - 2021 - Bookboon - Com LTD - 9788740338928 - Anna's Archive
No ratings yet
Pivot Tables Explained Unleash This Excel Superpower and - Anne Walsh - 2021 - Bookboon - Com LTD - 9788740338928 - Anna's Archive
148 pages
Business Statistics: A Decision-Making Approach: Analysis of Variance
No ratings yet
Business Statistics: A Decision-Making Approach: Analysis of Variance
14 pages
CHEM 254 SSmith 202001
No ratings yet
CHEM 254 SSmith 202001
10 pages
REVIEWER Population Proportion Problem
No ratings yet
REVIEWER Population Proportion Problem
2 pages
Cusum Ewma Template
No ratings yet
Cusum Ewma Template
11 pages
The Hindrances To The Royal Ambassador's Organization in Fulfilling Its Missions and Evangelism
100% (1)
The Hindrances To The Royal Ambassador's Organization in Fulfilling Its Missions and Evangelism
61 pages
Components of Ai System Design PDF
No ratings yet
Components of Ai System Design PDF
1 page
Unlocking Spatial Analysis With Tableau - Mapping Your Data For Location-Based Insights
No ratings yet
Unlocking Spatial Analysis With Tableau - Mapping Your Data For Location-Based Insights
2 pages
Compendium OF Research Methods
No ratings yet
Compendium OF Research Methods
9 pages
APPLIED Inquiries-Investigations-And-Immersion Q4 Mod5 W1-2 Finding Answer To The Research Question
No ratings yet
APPLIED Inquiries-Investigations-And-Immersion Q4 Mod5 W1-2 Finding Answer To The Research Question
28 pages
Statistics Practice Workbook
No ratings yet
Statistics Practice Workbook
98 pages
Food Bazar Project
No ratings yet
Food Bazar Project
62 pages
Chapter 5 Introduction To Factorial Designs
No ratings yet
Chapter 5 Introduction To Factorial Designs
28 pages
Certificate Course in Data Analytics For Finance Professionals
No ratings yet
Certificate Course in Data Analytics For Finance Professionals
8 pages
A Study of Factors That Influence College Academic Achievement: A Structural Equation Modeling Approach
No ratings yet
A Study of Factors That Influence College Academic Achievement: A Structural Equation Modeling Approach
25 pages
Python Statistics
No ratings yet
Python Statistics
6 pages