0% found this document useful (0 votes)
32 views

Introduction To Data Viz Lecture 2

The document discusses statistical concepts and methods for describing and analyzing quantitative data, including different types of variables, levels of measurement, methods for collecting and sampling data, descriptive and inferential statistics, and techniques for summarizing data through graphs, measures of central tendency and variation, and contingency tables. It provides definitions and examples of key statistical terms and outlines topics to be covered in more depth, such as types of analyses, measures for univariate, bivariate and multivariate data, and data presentation methods.

Uploaded by

anderson
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Introduction To Data Viz Lecture 2

The document discusses statistical concepts and methods for describing and analyzing quantitative data, including different types of variables, levels of measurement, methods for collecting and sampling data, descriptive and inferential statistics, and techniques for summarizing data through graphs, measures of central tendency and variation, and contingency tables. It provides definitions and examples of key statistical terms and outlines topics to be covered in more depth, such as types of analyses, measures for univariate, bivariate and multivariate data, and data presentation methods.

Uploaded by

anderson
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

STATISTICS &

ECONOMETRICS

Course Manager : T Tazvishaya


H.Acc, M.Acc, DipPharm, MsDA, (CTA & Law Student )

Contact Details : 0773610198

Email: [email protected]

1
CONTENT TO BE
COVERED
Data- Data sources
Types of variables
 Qualitative and quantitative variables
 Discrete and continuous variables
Levels of measurements
- Nominal, Ordinal, Interval, Ratio
Data Collection
 Data structures- Cross sectional, Time series, Panel
 Primary, Secondary
 Collection Methods-Questionnaire, Content Analysis etc
 Sample and Sampling Methods

2
CONTENT TO BE
COVERED
Types of statistics-Descriptive, Inferential
Describing Data using Graphs- Two way scatter, Box and whisker
plots, pie charts, pie charts etc
Describing Data using summaries
- measures of central tendency
- measures of variation
-measures of distribution
Confidence Interval – CI, p-value, Level of Significance test

3
CONTENT TO BE
COVERED
Analysis of Variance (ANOVA)-t-test, comparison of means
Correlation Matrix-Pearson, Spearman, Kendall.
Data Dimension Reduction Techniques
- PCA, FA,DA

4
INTRODUCTION TO
STATISTICS
Definition: (Statistics)
Science of collection, presentation, analysis, and reasonable
interpretation of data.

Statistics presents a rigorous scientific method for gaining insight into


data. For example, suppose we measure the weight of 100 patients in
a study. With so many measurements, simply looking at the data fails
to provide an informative account. However statistics can give an
instant overall picture of data based on graphical presentation or
numerical summarization irrespective to the number of data points.
Besides data summarization, another important task of statistics is to
make inference and predict relations of variables

5
6
DATA SIGHT

TAZVISHAYA 7
We are focusing on “quantitative analysis”
The general idea is to summarize and analyze data so that it is useful for
decision-making
We do this by calculating “measures of central tendency” and by looking for
relationships
 (We will NOT cover formal tests of hypotheses)

Primary vs. secondary data sources


Data on uses (system) vs. data on users (people)

8
DATA
Data may be continuous or discrete
Just looking at the data often does not enable one to ascertain what is
actually happening
Solution: Use appropriate descriptive statistics to summarize and
present results

9
A TAXONOMY OF
STATISTICS

10
TYPES OF STATISTICS
Techniques that summarize and describe characteristics of a group or
make comparisons of characteristics between groups are knows as
descriptive statistics.

Inferential statistics are used to make generalizations or inferences


about a population based on findings from a sample.

The choice of a type of analysis is based on the evaluation questions,


the type of data collected, and the audience who will receive the
results. 

11
Three types of analysis

 Univariate analysis
 the examination of the distribution of cases on
only one variable at a time (e.g., college
graduation)
 Bivariate analysis
 the examination of two variables simultaneously
(e.g., the relation between gender and college
graduation)
 Multivariate analysis
 the examination of more than two variables
simultaneously (e.g., the relationship between
gender, race, and college graduation)
12
“Purpose”
 Univariate analysis

 Purpose: description

 Bivariate analysis

 Purpose: determining the empirical relationship


between the two variables

 Multivariate analysis

 Purpose: determining the empirical relationship among


the variables

13
UNIVARIATE ANALYSIS
Involves examination of the distribution of cases on only
ONE variable at a time

Frequency distributions are listings of the number of cases


in each attribute of a variable
 Ungrouped frequency distribution
 Grouped frequency distribution

Proportions express number of cases of the criterion


variable as part of the total population; frequency of
criterion variable divided by N

14
Percentages are simple 100 X proportion
 Or [100 X (frequency of criterion variable divided by N)]

Rates make comparisons more meaningful by controlling for population differences

15
TYPES OF VARIABLES

Continuous: increase steadily in tiny fractions

Discrete: jumps from category to category

16
BIVARIATE ANALYSIS
Bivariate analysis focus on the
relationship between two variables

17
CONTINGENCY TABLES
Format: attributes of independent variable are used as column
headings and attributes of the dependent variable are used as
row headings

Guidelines for presenting & interpreting contingency tables


 Contents of table described in title
 Attributes of each variable clearly described
 Base on which percentages are computed should be shown
 Norm is to percentage down & compare across
 Table should indicate # of cases omitted from analysis

18
MULTIVARIATE
ANALYSIS
Multivariate Analysis allow the separate and combined effects of the independent
variable to be examined

19
STATISTICAL DESCRIPTION
OF DATA
 Statistics describes a numeric set of data by
its
– Center
– Variability
– Shape
 Statistics describes a categorical set of data
by
– Frequency, percentage or proportion of
each category

20
Some Definitions
•Variable - any characteristic of an individual or entity. A variable can
take different values for different individuals. Variables can be
categorical or quantitative. Per S. S. Stevens…
• Nominal - Categorical variables with no inherent order or ranking
sequence such as names or classes (e.g., gender). Value may be a
numerical, but without numerical value (e.g., I, II, III). The only
operation that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild,
moderate, severe. Can be compared for equality, or greater or less, but
not how much greater or less.
 Unimodal - having a single peak
 Bimodal - having two distinct peaks
 Symmetric - left and right half are mirror images.

21
SOME DEFINITIONS

• Interval - Values of the variable are ordered as in Ordinal, and


additionally, differences between values are meaningful, however,
the scale is not absolutely anchored. Calendar dates and temperatures
on the Fahrenheit scale are examples. Addition and subtraction, but
not multiplication and division are meaningful operations.
• Ratio - Variables with all properties of Interval plus an absolute,
non-arbitrary zero point, e.g. age, weight, temperature (Kelvin).
Addition, subtraction, multiplication, and division are all meaningful
operations.
•Distribution - (of a variable) tells us what values the variable takes
and how often it takes these values.

22
DATA PRESENTATION
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking deviations
from that pattern. Over all pattern usually described by shape, center, and spread
of the data. An individual value that falls outside the overall pattern is called an
outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf and Box-plot are used for numerical variables.

23
Data Presentation –Categorical
Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals
who fall in each category.

Figure 1: Bar Chart of Subjects in


Tre atm ent Groups Treatment Frequency Proportion Percent
Group (%)
Nu m ber of Subjects

30
25
1 15 (15/60)=0.25 25.0
20
15 2 25 (25/60)=0.333 41.7
10
5
3 20 (20/60)=0.417 33.3
0 Total 60 1.00 100
1 2 3
Treatm ent Group

24
Data Presentation –Categorical
Variable
Pie Chart: Lists the categories and presents the percent or count of individuals
who fall in each category.

Figure 2: Pie Chart of Treatment Frequency Proportion Percent


Subjects in Treatment Groups Group (%)

1 15 (15/60)=0.25 25.0
25% 1 2 25 (25/60)=0.333 41.7
33%
2 3 20 (20/60)=0.417 33.3

3 Total 60 1.00 100


42%

25
GRAPHICAL PRESENTATION –
NUMERICAL VARIABLE
Histogram: Overall pattern can be described by its shape, center, and spread.
The following age distribution is right skewed. The center lies between 80 to
100. No outliers.

Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518

16 Median 84
14 Mode 84
Number of Subjects

12 Standard Deviation 30.22979318


10
Sample Variance 913.8403955
8
Kurtosis -1.183899591
6
4 Skewness 0.389872725
2 Range 95
0 Minimum 48
40 60 80 100 120 140 More
Maximum 143
Age in Month
Sum 5425
Count 60

26
GRAPHICAL PRESENTATION –
NUMERICAL VARIABLE
Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age

160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
1

Box Plot

27
A fundamental concept in summary statistics is that of a central value for a set
of observations and the extent to which the central value characterizes the
whole set of data. Measures of central value such as the mean or median must
be coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.

To understand how well a central value characterizes a set of observations, let


us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.

28
Methods of Center Measurement

Center measurement is a summary measure of the overall level of a dataset

Commonly used methods are mean, median, mode, geometric mean etc.

Mean: Summing up all the observation and dividing by number of


observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2, ...xn are n observations of a variable
x. Then the mean of this variable,
n

x1  x2  ...  xn x i
x  i 1

n n

29
Methods of Center Measurement

Median: The middle value in an ordered sequence of observations. That is, to


find the median we need to order the data set and then find the middle
value. In case of an even number of observations the average of the two
middle most values is the median. For example, to find the median of {9, 3, 6,
7, 5}, we first sort the data giving {3, 5, 6, 7, 9}, then choose the middle value
6. If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the median
is the average of the two middle values from the sorted sequence, in this
case, (5 + 6) / 2 = 5.5.

Mode: The value that is observed most frequently. The mode is undefined
for sequences in which no observation is repeated.

30
Mean or Median

The median is less sensitive to outliers (extreme scores) than the mean and thus
a better measure than the mean for highly skewed distributions, e.g. family
income. For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The
median of these four observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give a realistic picture of
the major part of the data. It is influenced by extreme value 990.

31
Methods of Variability Measurement

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation, interquartile range,


coefficient of variation etc.

Range: The difference between the largest and the smallest observations. The
range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.

32
Methods of Variability Measurement

Variance: The variance of a set of observations is the average of the squares of


the deviations of the observations from their mean. In symbols, the variance of
the n observations x1, x2,…xn is
( x1  x ) 2  ....  ( xn  x ) 2
S 
2

n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5  5) 2  (3  5) 2  (7  5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the
above example is 2.

33
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that cover the total range of
observed values. Cut points for these regions are known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where


q is the desired quartile and n is the number of observations of data.

The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is
between the 25th and 50th percentage points in the data. The upper bound of Q2 is
the median. The third quartile (Q3) is the 25% of the data lying between the
median and the 75% cut point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is the
median of the second half of the ordered observations.

34
Methods of Variability Measurement

In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th
observation is 11. So Q1 is of this data is 11.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q1
Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is also the
Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the


previous example is 61- 40=21. The middle half of the ordered data lie between 40
and 61.

35
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called
Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut points are
called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2)
and the 75th percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of the data,


where p is the desired percentile and n is the number of observations of data.

Coefficient of Variation: The standard deviation of data divided by it’s mean. It is


usually expressed in percent.

Coefficient of Variation =  100
x

36
Skewness

 Measures asymmetry of data


 Positive or right skewed: Longer right tail
 Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,


n
n  ( xi  x ) 3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 

37
Kurtosis

 Measures peakedness of the distribution of data. The


kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 

38
Summary of the Variable ‘Age’ in the
given data set

Mean 90.41666667 Histogram of Age

Standard Error 3.902649518

10
Median 84
Mode 84

8
Standard Deviation 30.22979318

Number of Subjects

6
Sample Variance 913.8403955
Kurtosis -1.183899591

4
Skewness 0.389872725
Range 95 2

Minimum 48
0

Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60

39
ANALYSIS--
INTRODUCTION
The BIG Questions:
 What are you trying to discover or show?
 How will you present the results?

From survey to report


 Flow of information
 Sample surveys

Brief comparison of SAS & R

40
DATA COLLECTION
INSTRUMENTS
Questionnaires & surveys
Transactions logs
Experimental observation
Bills & invoices
Census forms & reports
Pre-packaged data sets
Content analysis

41
ISSUES IN RESEARCH
DESIGN
Case study vs. statistical sample
What is the universe ? (uses, users, etc.)
 Example: political debate over “average tax cut” vs. “tax cut for the average family”

Is the sample representative ?


 Volumes vs. titles in the library

Does correlation imply causality?


 Do we need to identify the pathogen?

Controlling for outside factors

42
SAMPLE SIZE & SAMPLING
METHODS
How large a sample is needed?
 The larger the sample the more accurate the results
(unless the response rate becomes very low)
 The larger the sample the more the cost/effort
Sample size does NOT depend on the size of the population
Rules of thumb
 100 for 95% confidence, 5% tolerance, 90-10 expected split
 400 for 95% confidence, 5% tolerance, 50-50 expected split
 30 – 50 in each cell on n x m discrete classes

43
SOURCES OF ERROR
The respondent
The investigator
Sampling error
Change in the system itself
Coding & analysis
Model specification (Oversimplification and Under simplification)

44

You might also like