Introduction To Data Viz Lecture 2
Introduction To Data Viz Lecture 2
ECONOMETRICS
Email: [email protected]
1
CONTENT TO BE
COVERED
Data- Data sources
Types of variables
Qualitative and quantitative variables
Discrete and continuous variables
Levels of measurements
- Nominal, Ordinal, Interval, Ratio
Data Collection
Data structures- Cross sectional, Time series, Panel
Primary, Secondary
Collection Methods-Questionnaire, Content Analysis etc
Sample and Sampling Methods
2
CONTENT TO BE
COVERED
Types of statistics-Descriptive, Inferential
Describing Data using Graphs- Two way scatter, Box and whisker
plots, pie charts, pie charts etc
Describing Data using summaries
- measures of central tendency
- measures of variation
-measures of distribution
Confidence Interval – CI, p-value, Level of Significance test
3
CONTENT TO BE
COVERED
Analysis of Variance (ANOVA)-t-test, comparison of means
Correlation Matrix-Pearson, Spearman, Kendall.
Data Dimension Reduction Techniques
- PCA, FA,DA
4
INTRODUCTION TO
STATISTICS
Definition: (Statistics)
Science of collection, presentation, analysis, and reasonable
interpretation of data.
5
6
DATA SIGHT
TAZVISHAYA 7
We are focusing on “quantitative analysis”
The general idea is to summarize and analyze data so that it is useful for
decision-making
We do this by calculating “measures of central tendency” and by looking for
relationships
(We will NOT cover formal tests of hypotheses)
8
DATA
Data may be continuous or discrete
Just looking at the data often does not enable one to ascertain what is
actually happening
Solution: Use appropriate descriptive statistics to summarize and
present results
9
A TAXONOMY OF
STATISTICS
10
TYPES OF STATISTICS
Techniques that summarize and describe characteristics of a group or
make comparisons of characteristics between groups are knows as
descriptive statistics.
11
Three types of analysis
Univariate analysis
the examination of the distribution of cases on
only one variable at a time (e.g., college
graduation)
Bivariate analysis
the examination of two variables simultaneously
(e.g., the relation between gender and college
graduation)
Multivariate analysis
the examination of more than two variables
simultaneously (e.g., the relationship between
gender, race, and college graduation)
12
“Purpose”
Univariate analysis
Purpose: description
Bivariate analysis
Multivariate analysis
13
UNIVARIATE ANALYSIS
Involves examination of the distribution of cases on only
ONE variable at a time
14
Percentages are simple 100 X proportion
Or [100 X (frequency of criterion variable divided by N)]
15
TYPES OF VARIABLES
16
BIVARIATE ANALYSIS
Bivariate analysis focus on the
relationship between two variables
17
CONTINGENCY TABLES
Format: attributes of independent variable are used as column
headings and attributes of the dependent variable are used as
row headings
18
MULTIVARIATE
ANALYSIS
Multivariate Analysis allow the separate and combined effects of the independent
variable to be examined
19
STATISTICAL DESCRIPTION
OF DATA
Statistics describes a numeric set of data by
its
– Center
– Variability
– Shape
Statistics describes a categorical set of data
by
– Frequency, percentage or proportion of
each category
20
Some Definitions
•Variable - any characteristic of an individual or entity. A variable can
take different values for different individuals. Variables can be
categorical or quantitative. Per S. S. Stevens…
• Nominal - Categorical variables with no inherent order or ranking
sequence such as names or classes (e.g., gender). Value may be a
numerical, but without numerical value (e.g., I, II, III). The only
operation that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild,
moderate, severe. Can be compared for equality, or greater or less, but
not how much greater or less.
Unimodal - having a single peak
Bimodal - having two distinct peaks
Symmetric - left and right half are mirror images.
21
SOME DEFINITIONS
22
DATA PRESENTATION
Two types of statistical presentation of data - graphical and numerical.
Graphical Presentation: We look for the overall pattern and for striking deviations
from that pattern. Over all pattern usually described by shape, center, and spread
of the data. An individual value that falls outside the overall pattern is called an
outlier.
Bar diagram and Pie charts are used for categorical variables.
Histogram, stem and leaf and Box-plot are used for numerical variables.
23
Data Presentation –Categorical
Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals
who fall in each category.
30
25
1 15 (15/60)=0.25 25.0
20
15 2 25 (25/60)=0.333 41.7
10
5
3 20 (20/60)=0.417 33.3
0 Total 60 1.00 100
1 2 3
Treatm ent Group
24
Data Presentation –Categorical
Variable
Pie Chart: Lists the categories and presents the percent or count of individuals
who fall in each category.
1 15 (15/60)=0.25 25.0
25% 1 2 25 (25/60)=0.333 41.7
33%
2 3 20 (20/60)=0.417 33.3
25
GRAPHICAL PRESENTATION –
NUMERICAL VARIABLE
Histogram: Overall pattern can be described by its shape, center, and spread.
The following age distribution is right skewed. The center lies between 80 to
100. No outliers.
Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518
16 Median 84
14 Mode 84
Number of Subjects
26
GRAPHICAL PRESENTATION –
NUMERICAL VARIABLE
Box-Plot: Describes the five-number summary
160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
1
Box Plot
27
A fundamental concept in summary statistics is that of a central value for a set
of observations and the extent to which the central value characterizes the
whole set of data. Measures of central value such as the mean or median must
be coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.
28
Methods of Center Measurement
Commonly used methods are mean, median, mode, geometric mean etc.
x1 x2 ... xn x i
x i 1
n n
29
Methods of Center Measurement
Mode: The value that is observed most frequently. The mode is undefined
for sequences in which no observation is repeated.
30
Mean or Median
The median is less sensitive to outliers (extreme scores) than the mean and thus
a better measure than the mean for highly skewed distributions, e.g. family
income. For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The
median of these four observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give a realistic picture of
the major part of the data. It is influenced by extreme value 990.
31
Methods of Variability Measurement
Range: The difference between the largest and the smallest observations. The
range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
32
Methods of Variability Measurement
n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is
(5 5) 2 (3 5) 2 (7 5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the
above example is 2.
33
Methods of Variability Measurement
Quartiles: Data can be divided into four regions that cover the total range of
observed values. Cut points for these regions are known as quartiles.
The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is
between the 25th and 50th percentage points in the data. The upper bound of Q2 is
the median. The third quartile (Q3) is the 25% of the data lying between the
median and the 75% cut point in the data.
Q1 is the median of the first half of the ordered observations and Q3 is the
median of the second half of the ordered observations.
34
Methods of Variability Measurement
In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th
observation is 11. So Q1 is of this data is 11.
35
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called
Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut points are
called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2)
and the 75th percentile of the data is Q3.
36
Skewness
37
Kurtosis
38
Summary of the Variable ‘Age’ in the
given data set
10
Median 84
Mode 84
8
Standard Deviation 30.22979318
Number of Subjects
6
Sample Variance 913.8403955
Kurtosis -1.183899591
4
Skewness 0.389872725
Range 95 2
Minimum 48
0
Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60
39
ANALYSIS--
INTRODUCTION
The BIG Questions:
What are you trying to discover or show?
How will you present the results?
40
DATA COLLECTION
INSTRUMENTS
Questionnaires & surveys
Transactions logs
Experimental observation
Bills & invoices
Census forms & reports
Pre-packaged data sets
Content analysis
41
ISSUES IN RESEARCH
DESIGN
Case study vs. statistical sample
What is the universe ? (uses, users, etc.)
Example: political debate over “average tax cut” vs. “tax cut for the average family”
42
SAMPLE SIZE & SAMPLING
METHODS
How large a sample is needed?
The larger the sample the more accurate the results
(unless the response rate becomes very low)
The larger the sample the more the cost/effort
Sample size does NOT depend on the size of the population
Rules of thumb
100 for 95% confidence, 5% tolerance, 90-10 expected split
400 for 95% confidence, 5% tolerance, 50-50 expected split
30 – 50 in each cell on n x m discrete classes
43
SOURCES OF ERROR
The respondent
The investigator
Sampling error
Change in the system itself
Coding & analysis
Model specification (Oversimplification and Under simplification)
44