0% found this document useful (0 votes)
32 views

Lesson 8 - Introduction To Data Analysis

This document provides an overview of data analytics and related concepts such as data warehousing, online analytical processing (OLAP), and data mining. It discusses how data is gathered from multiple sources and aggregated to generate reports and predictive models that can help with business decision making. Key steps in data analytics include data preparation, descriptive statistics, and inferential statistics. Specific analytical techniques covered include regression, association rules, and threats to conclusion validity when analyzing relationships in data.

Uploaded by

Nixon Mark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lesson 8 - Introduction To Data Analysis

This document provides an overview of data analytics and related concepts such as data warehousing, online analytical processing (OLAP), and data mining. It discusses how data is gathered from multiple sources and aggregated to generate reports and predictive models that can help with business decision making. Key steps in data analytics include data preparation, descriptive statistics, and inferential statistics. Specific analytical techniques covered include regression, association rules, and threats to conclusion validity when analyzing relationships in data.

Uploaded by

Nixon Mark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Lesson 11

INTRODUCTION TO RESEARCH
METHODS
Introduction to Data Analysis

Introduction to Research Methods: Dr (Eng.) Musebe 2023


1
Chapter 11: Data Analytics

 Overview

 Data Warehousing

 Online Analytical Processing

 Data Mining
Overview

 Data analytics:
– the processing of data to infer patterns, correlations, or
models for prediction

 Primarily used to make business decisions


– Per individual customer
• E.g. what product to suggest for purchase

– Across all customers


• E.g. what products to manufacture/stock, in what quantity

 Critical for businesses today


Overview (Cont.)
 Common steps in data analytics
– Gather data from multiple sources into one location
• Data warehouses also integrated data into common schema
• Data often needs to be extracted from source formats,
transformed to common schema, and loaded into the data
warehouse
– Generate aggregates and reports summarizing data
• Dashboards showing graphical charts/reports
• Online analytical processing (OLAP) systems allow
interactive querying
• Statistical analysis using tools such as R/SAS/SPSS
– Including extensions for parallel processing of big data
– Build predictive models and use the models for
decision making
Overview (Cont.)
 Predictive models are widely used today
– E.g. use customer profile features (e.g. income, age,
gender, education, employment) and past history of a
customer to predict likelihood of default on loan
• and use prediction to make loan decision
– E.g. use past history of sales (by season) to predict
future sales
• And use it to decide what/how much to produce/stock
• And to target customers
 Other examples of business decisions:
– What items to stock?
– What insurance premium to change?
– To whom to send advertisements?
Overview (Cont.)

 Machine learning techniques are key to finding


patterns in data and making predictions
 Data mining extends techniques developed by
machine-learning communities to run them on very
large datasets
 The term business intelligence (BI) is synonym for
data analytics
 The term decision support focuses on reporting
and aggregation
Data Analysis

 In most social research the data analysis involves


three major steps, done in roughly this order:

– Cleaning and organizing the data for analysis (Data


Preparation)

– Describing the data (Descriptive Statistics)

– Testing Hypotheses and Models (Inferential Statistics)


Data Warehousing
Data Analysis and OLAP
 Online Analytical Processing (OLAP)
– Interactive analysis of data, allowing data to be summarized
and viewed in different ways in an online fashion (with
negligible delay)
 We use the following relation to illustrate OLAP
concepts
– sales (item_name, color, clothes_size, quantity)
This is a simplified version of the sales fact table joined
with the dimension tables, and many attributes
removed (and some renamed)
Regression
 Regression deals with the prediction of a value, rather than
a class.
– Given values for a set of variables, X1, X2, …, Xn, we wish to
predict the value of a variable Y.
 One way is to infer coefficients a0, a1, a1, …, an such that
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
 Finding such a linear polynomial is called linear regression.
– In general, the process of finding a curve that fits the data
is also called curve fitting.
 The fit may only be approximate
– because of noise in the data, or
– because the relationship is not exactly a polynomial
 Regression aims to find coefficients that give the best
possible fit.
Association Rules
 Retail shops are often interested in associations
between different items that people buy.
– Someone who buys bread is quite likely also to buy
milk
– A person who bought the book Database System
Concepts is quite likely also to buy the book Operating
System Concepts.
 Associations information can be used in several
ways.
– E.g. when a customer buys a particular book, an online
shop may suggest associated books.
Association Rules

 Association rules:
bread  milk
DB-Concepts, OS-Concepts  Networks
– Left hand side: antecedent, right hand side:
consequent
– An association rule must have an associated
population; the population consists of a set of
instances
• E.g. each transaction (sale) at a shop is an instance, and
the set of all transactions is the population
Association Rules (Cont.)
 Rules have an associated support, as well as an associated
confidence.
 Support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of the
rule.
– E.g. suppose only 0.001 percent of all purchases include
milk and screwdrivers. The support for the rule is milk 
screwdrivers is low.
 Confidence is a measure of how often the consequent is
true when the antecedent is true.
– E.g. the rule bread  milk has a confidence of 80 percent if
80 percent of the purchases that include bread also include
milk.
 We omit further details, such as how to efficiently
infer association rules
Analysis

 Data Preparation involves:


– checking or logging the data in;
– checking the data for accuracy;
– entering the data into the computer;
transforming the data; and
– developing and documenting a database
structure that integrates the various measures.
Association Rules (Cont.)

 Descriptive Statistics are used to describe the


basic features of the data in a study.
 They provide simple summaries about the sample
and the measures.
 Together with simple graphics analysis, they form
the basis of virtually every quantitative analysis of
data.
 With descriptive statistics you are simply
describing what is, what the data shows.
Association Rules (Cont.)
 Inferential Statistics investigate questions, models and
hypotheses.
 In many cases, the conclusions from inferential statistics
extend beyond the immediate data alone.
 For instance, we use inferential statistics to try to infer
from the sample data what the population thinks.
– Or, we use inferential statistics to make judgments of the
probability that an observed difference between groups is
a dependable one or one that might have happened by
chance in this study.
 Thus, we use inferential statistics to make inferences
from our data to more general conditions; we use
descriptive statistics simply to describe what’s going on
in our data.
Conclusion Validity
 In many ways, conclusion validity is the most
important of the four validity types because it is
relevant whenever we are trying to decide if there is a
relationship in our observations (and that’s one of the
most basic aspects of any analysis).

 Perhaps we should start with an attempt at a


definition:

– Conclusion validity is the degree to which conclusions


we reach about relationships in our data are
reasonable.
Threats to Conclusion Validity
 A threat to conclusion validity is a factor that can lead
you to reach an incorrect conclusion about a
relationship in your observations.
 You can essentially make two kinds of errors about
relationships:
– Conclude that there is no relationship when in fact
there is (you missed the relationship or didn’t see it)
– Conclude that there is a relationship when in fact there
is not (you’re seeing things that aren’t there!)
 Most threats to conclusion validity have to do with the
first problem.
Finding no relationship when there is one
 When you’re looking for the needle in the haystack you
essentially have two basic problems:
– the tiny needle and too much hay.
 One important threat is low reliability of measures This can be
due to many factors including:
– poor question wording, bad instrument design or layout,
illegibility of field notes, and so on.
 In studies where you are evaluating a program you can
introduce noise through poor reliability of treatment
implementation.
 Random irrelevancies in the setting can also obscure your
ability to see a relationship.
 The types of people you have in your study can also make it
harder to see relationships.
Finding a relationship when there is not one
 In statistical analysis, we attempt to determine the
probability that the finding we get is a “real” one or
could have been a “chance” finding.
 In fact, we often use this probability to decide whether
to accept the statistical result as evidence that there is
a relationship.
 In the social sciences, researchers often use the rather
arbitrary value known as the 0.05 level of significance
to decide whether their result is credible or could be
considered a “fluke.”
 Essentially, the value 0.05 means that the result you got
could be expected to occur by chance at least 5 times
out of every 100 times you run the statistical analysis.
Problems that can lead to either conclusion error
 Every analysis is based on a variety of assumptions about the
nature of the data, the procedures you use to conduct the
analysis, and the match between these two.
– If you are not sensitive to the assumptions behind your analysis
you are likely to draw erroneous conclusions about relationships.
 In quantitative research we refer to this threat as the violated
assumptions of statistical tests.
– For instance, many statistical analyses assume that the data are
distributed normally — that the population from which they are
drawn would be distributed according to a “normal” or “bell-
shaped” curve.
 If that assumption is not true for your data and you use that
statistical test, you are likely to get an incorrect estimate of the
true relationship.
Improving Conclusion Validity

 So you may have a problem assuring that you are


reaching credible conclusions about relationships in
your data.
 What can you do about it?

 Here are some general guidelines you can follow in


designing your study that will help improve conclusion
validity.
Improving Conclusion Validity
 Good Statistical Power
 The rule of thumb in social research is that you want statistical power to
be greater than 0.8 in value. That is, you want to have at least 80 chances
out of 100 of finding a relationship when there is one.
 As pointed out in the discussion of statistical power, there are several
factors that interact to affect power.
– One thing you can usually do is to collect more information — use a larger
sample size.
– The second thing you can do is to increase your risk of making a Type I error
— increase the chance that you will find a relationship when it’s not there.
– In practical terms you can do that statistically by raising the alpha level. For
instance, instead of using a 0.05 significance level, you might use 0.10 as
your cutoff point.
Improving Conclusion Validity

 Good Reliability
 Reliability is related to the idea of noise or “error” that
obscures your ability to see a relationship.
 In general, you can improve reliability by:
– doing a better job of constructing measurement
instruments,
– by increasing the number of questions on any scale or
– by reducing situational distractions in the measurement
context.
Improving Conclusion Validity

 Good Implementation
 When you are studying the effects of interventions,
treatments or programs, you can improve conclusion
validity by assuring good implementation.
 This can be accomplished by training program
operators and standardizing the protocols for
administering the program.
Statistical Power

 There are four interrelated components that influence the conclusions you might reach
from a statistical test in a research project.
 The logic of statistical inference with respect to these components is often difficult to
understand and explain.
 The four components are:
– sample size;
– effect size is the salience of the treatment relative to the noise in measurement;
– alpha level (α, or significance level) is the odds that the observed result is due to chance;
– statistical power (1−β) is the odds that you will observe a treatment effect when it occurs.
Data Preparation

 Data Preparation involves:


– checking or logging the data in;

– checking the data for accuracy;

– entering the data into the computer;

– transforming the data, and

– developing and documenting a database structure that integrates the various measures.
Data Preparation
 Logging the Data
 In any research project you may have data coming from a number of different sources at different times:
– mail surveys returns
– coded interview data
– pretest or posttest data
– observational data
 In all but the simplest of studies, you need to set up a procedure for logging the information and keeping
track of it until you are ready to do a comprehensive data analysis.
 In most cases, you will want to set up a database that enables you to assess at any time what data is
already in and what is still outstanding.
Data Preparation
 Checking the Data For Accuracy
 As soon as data is received you should screen it for accuracy. In some circumstances doing this
right away will allow you to go back to the sample to clarify any problems or errors. There are
several questions you should ask as part of this initial data screening:
– Are the responses legible/readable?
– Are all important questions answered?
– Are the responses complete?
– Is all relevant contextual information included (e.g., data, time, place, researcher)?
 In most social research, quality of measurement is a major issue. Assuring that the data collection
process does not contribute inaccuracies will help assure the overall quality of subsequent analyses.
Data Preparation
 Developing a Database Structure
 The database structure is the manner in which you intend to store the data for the study
so that it can be accessed in subsequent data analyses.
 You might use the same structure you used for logging in the data or, in large complex
studies, you might have one structure for logging data and another for storing it.
 As mentioned above, there are generally two options for storing data on computer –
database programs and statistical programs.
 Usually database programs are the more complex of the two to learn and operate, but
they allow the analyst greater flexibility in manipulating the data.
Descriptive Statistics
 Descriptive statistics are used to describe the basic features of the data in a study.
 They provide simple summaries about the sample and the measures.
 Together with simple graphics analysis, they form the basis of virtually every
quantitative analysis of data.
 Descriptive statistics are typically distinguished from inferential statistics.
 With descriptive statistics you are simply describing what is or what the data shows.
 With inferential statistics, you are trying to reach conclusions that extend beyond the
immediate data alone.
Descriptive Statistics
 Univariate Analysis

 Univariate analysis involves the examination across cases of one variable at a time. There are three major
characteristics of a single variable that we tend to look at:
– the distribution

– the central tendency

– the dispersion

 In most situations, we would describe all three of these characteristics for each of the variables in our study.
Descriptive Statistics

 Correlation

 The correlation is one of the most common and most


useful statistics.
 A correlation is a single number that describes the
degree of relationship between two variables.

You might also like