Lesson 8 - Introduction To Data Analysis
Lesson 8 - Introduction To Data Analysis
INTRODUCTION TO RESEARCH
METHODS
Introduction to Data Analysis
Overview
Data Warehousing
Data Mining
Overview
Data analytics:
– the processing of data to infer patterns, correlations, or
models for prediction
Association rules:
bread milk
DB-Concepts, OS-Concepts Networks
– Left hand side: antecedent, right hand side:
consequent
– An association rule must have an associated
population; the population consists of a set of
instances
• E.g. each transaction (sale) at a shop is an instance, and
the set of all transactions is the population
Association Rules (Cont.)
Rules have an associated support, as well as an associated
confidence.
Support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of the
rule.
– E.g. suppose only 0.001 percent of all purchases include
milk and screwdrivers. The support for the rule is milk
screwdrivers is low.
Confidence is a measure of how often the consequent is
true when the antecedent is true.
– E.g. the rule bread milk has a confidence of 80 percent if
80 percent of the purchases that include bread also include
milk.
We omit further details, such as how to efficiently
infer association rules
Analysis
Good Reliability
Reliability is related to the idea of noise or “error” that
obscures your ability to see a relationship.
In general, you can improve reliability by:
– doing a better job of constructing measurement
instruments,
– by increasing the number of questions on any scale or
– by reducing situational distractions in the measurement
context.
Improving Conclusion Validity
Good Implementation
When you are studying the effects of interventions,
treatments or programs, you can improve conclusion
validity by assuring good implementation.
This can be accomplished by training program
operators and standardizing the protocols for
administering the program.
Statistical Power
There are four interrelated components that influence the conclusions you might reach
from a statistical test in a research project.
The logic of statistical inference with respect to these components is often difficult to
understand and explain.
The four components are:
– sample size;
– effect size is the salience of the treatment relative to the noise in measurement;
– alpha level (α, or significance level) is the odds that the observed result is due to chance;
– statistical power (1−β) is the odds that you will observe a treatment effect when it occurs.
Data Preparation
– developing and documenting a database structure that integrates the various measures.
Data Preparation
Logging the Data
In any research project you may have data coming from a number of different sources at different times:
– mail surveys returns
– coded interview data
– pretest or posttest data
– observational data
In all but the simplest of studies, you need to set up a procedure for logging the information and keeping
track of it until you are ready to do a comprehensive data analysis.
In most cases, you will want to set up a database that enables you to assess at any time what data is
already in and what is still outstanding.
Data Preparation
Checking the Data For Accuracy
As soon as data is received you should screen it for accuracy. In some circumstances doing this
right away will allow you to go back to the sample to clarify any problems or errors. There are
several questions you should ask as part of this initial data screening:
– Are the responses legible/readable?
– Are all important questions answered?
– Are the responses complete?
– Is all relevant contextual information included (e.g., data, time, place, researcher)?
In most social research, quality of measurement is a major issue. Assuring that the data collection
process does not contribute inaccuracies will help assure the overall quality of subsequent analyses.
Data Preparation
Developing a Database Structure
The database structure is the manner in which you intend to store the data for the study
so that it can be accessed in subsequent data analyses.
You might use the same structure you used for logging in the data or, in large complex
studies, you might have one structure for logging data and another for storing it.
As mentioned above, there are generally two options for storing data on computer –
database programs and statistical programs.
Usually database programs are the more complex of the two to learn and operate, but
they allow the analyst greater flexibility in manipulating the data.
Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study.
They provide simple summaries about the sample and the measures.
Together with simple graphics analysis, they form the basis of virtually every
quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics.
With descriptive statistics you are simply describing what is or what the data shows.
With inferential statistics, you are trying to reach conclusions that extend beyond the
immediate data alone.
Descriptive Statistics
Univariate Analysis
Univariate analysis involves the examination across cases of one variable at a time. There are three major
characteristics of a single variable that we tend to look at:
– the distribution
– the dispersion
In most situations, we would describe all three of these characteristics for each of the variables in our study.
Descriptive Statistics
Correlation