Data Analysis
Data Analysis
Data Analysis
Data visualization
Major dimensions[hide]
Thought leaders[hide]
Related Topics[hide]
Data • Information
Big data • Database
Chartjunk • Visual perception
Regression analysis • Statistical model
Misleading graph
V
T
E
The phases of the intelligence cycle used to convert raw information into
actionable intelligence or knowledge are conceptually similar to the
phases in data analysis.
Data initially obtained must be processed or organized for analysis. For
instance, this may involve placing data into rows and columns in a table
format for further analysis, such as within a spreadsheet or statistical
software.
Data cleaning
Once processed and organized, the data may be incomplete, contain
duplicates, or contain errors. The need for data cleaning will arise from
problems in the way that data is entered and stored. Data cleaning is the
process of preventing and correcting these errors. Common tasks include
record matching, deduplication, and column segmentation.[4] Such data
problems can also be identified through a variety of analytical
techniques. For example, with financial information, the totals for
particular variables may be compared against separately published
numbers believed to be reliable.[5] Unusual amounts above or below pre-
determined thresholds may also be reviewed. There are several types of
data cleaning that depend on the type of data. Quantitative data methods
for outlier detection can be used to get rid of likely incorrectly entered
data. Textual data spellcheckers can be used to lessen the amount of
mistyped words, but it is harder to tell if the words themselves are
correct.
Exploratory data analysis
Once the data is cleaned, it can be analyzed. Analysts may apply a
variety of techniques referred to as exploratory data analysis to begin
understanding the messages contained in the data.[7][8] The process of
exploration may result in additional data cleaning or additional requests
for data, so these activities may be iterative in nature. Descriptive
statistics such as the average or median may be generated to help
understand the data. Data visualization may also be used to examine the
data in graphical format, to obtain additional insight regarding the
messages within the data.
Modeling and algorithms
Mathematical formulas or models called algorithms may be applied to
the data to identify relationships among the variables, such
as correlation or causation. In general terms, models may be developed
to evaluate a particular variable in the data based on other variable(s) in
the data, with some residual error depending on model accuracy (i.e.,
Data = Model + Error)
Inferential statistics includes techniques to measure relationships
between particular variables. For example, regression analysis may be
used to model whether a change in advertising (independent variable X)
explains the variation in sales (dependent variable Y). In mathematical
terms, Y (sales) is a function of X (advertising). It may be described as
Y = aX + b + error, where the model is designed such that a and b
minimize the error when the model predicts Y for a given range of
values of X. Analysts may attempt to build models that are descriptive of
the data to simplify analysis and communicate results.[1]
Data product
A data product is a computer application that takes data inputs and
generates outputs, feeding them back into the environment. It may be
based on a model or algorithm. An example is an application that
analyzes data about customer purchasing history and recommends other
purchases the customer might enjoy.[3]
Communication
- What is the
Given a set mileage per
What are the values gallon of the
of specific
Retrieve of attributes {X, Y, Audi TT?
1 cases, find
Value Z, ...} in the data
attributes of
cases {A, B, C, ...}? - How long is the
those cases.
movie Gone with
the Wind?
- What Kellogg's
Given some
cereals have
concrete
high fiber?
conditions
on attribute Which data cases - What comedies
2 Filter values, find satisfy conditions have won
data cases {A, B, C...}? awards?
satisfying - Which funds
those underperformed
conditions. the SP-500?
- What is the
Given a set average calorie
of data content of Post
cases, cereals?
What is the value of
Compute compute an
aggregation function - What is the
3 Derived aggregate
F over a given set S gross income of
Value numeric
of data cases? all stores
representatio
combined?
n of those
data cases. - How many
manufacturers of
cars are there?
- What is the car
with the highest
Find data MPG?
cases - What
possessing director/film has
What are the
an extreme won the most
Find top/bottom N data
4 value of an awards?
Extremum cases with respect to
attribute
attribute A? - What Robin
over its
range within Williams film
the data set. has the most
recent release
date?
Given a set
of data What is the sorted - Order the cars
cases, rank order of a set S of by weight.
5 Sort them data cases according - Rank the
according to to their value of cereals by
some ordinal attribute A? calories.
metric.
- What is the
Given a set
range of film
of data cases
lengths?
and an
What is the range of - What is the
attribute of
Determine values of attribute A range of car
6 interest, find
Range in a set S of data horsepowers?
the span of
cases?
values - What actresses
within the are in the data
set. set?
Continuous variables
Distribution
Statistics (M, SD, variance, skewness, kurtosis)
Stem-and-leaf displays
Box plots
Nonlinear analysis
Nonlinear analysis will be necessary when the data is recorded from
a nonlinear system. Nonlinear systems can exhibit complex dynamic
effects including bifurcations, chaos, harmonics and subharmonics that
cannot be analyzed using simple linear methods. Nonlinear data analysis
is closely related to nonlinear system identification.
Main data analysis
In the main analysis phase analyses aimed at answering the research
question are performed as well as any other relevant analysis needed to
write the first draft of the research report.
Exploratory and confirmatory approaches
In the main analysis phase either an exploratory or confirmatory
approach can be adopted. Usually the approach is decided before data is
collected. In an exploratory analysis no clear hypothesis is stated before
analysing the data, and the data is searched for models that describe the
data well. In a confirmatory analysis clear hypotheses about the data are
tested.
Exploratory data analysis should be interpreted carefully. When testing
multiple models at once there is a high chance on finding at least one of
them to be significant, but this can be due to a type 1 error. It is
important to always adjust the significance level when testing multiple
models with, for example, a Bonferroni correction. Also, one should not
follow up an exploratory analysis with a confirmatory analysis in the
same dataset. An exploratory analysis is used to find ideas for a theory,
but not to test that theory as well. When a model is found exploratory in
a dataset, then following up that analysis with a confirmatory analysis in
the same dataset could simply mean that the results of the confirmatory
analysis are due to the same type 1 error that resulted in the exploratory
model in the first place. The confirmatory analysis therefore will not be
more informative than the original exploratory analysis.
Stability of results
It is important to obtain some indication about how generalizable the
results are.While this is hard to check, one can look at the stability of the
results. Are the results reliable and reproducible? There are two main
ways of doing this: