0% found this document useful (0 votes)
13 views64 pages

Tian Statistics Lesson 3 Descriptive Statistics

The document covers fundamentals of data analytics and statistics, focusing on data processing, descriptive statistics, and measures of central tendency and dispersion. It provides examples of statistical calculations using SAS, explains the concepts of populations and samples, and discusses various types of variables. Additionally, it introduces key statistical measures such as mean, median, mode, variance, and standard deviation, along with their applications in data analysis.

Uploaded by

vineetpjoshi.71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views64 pages

Tian Statistics Lesson 3 Descriptive Statistics

The document covers fundamentals of data analytics and statistics, focusing on data processing, descriptive statistics, and measures of central tendency and dispersion. It provides examples of statistical calculations using SAS, explains the concepts of populations and samples, and discusses various types of variables. Additionally, it introduces key statistical measures such as mean, median, mode, variance, and standard deviation, along with their applications in data analysis.

Uploaded by

vineetpjoshi.71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Fundamentals of Data Analytics

and Statistics

Page 1 of 13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)

In statistics, aggregate data are data combined from several

measurements. When data are aggregated, groups of observations are

replaced with summary statistics based on those observations.

Page 2 of 50
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)

Example: Use SAS to roll up data to a higher lever (data set is shown in

the following image and named as data1). FAC is the unique ID, the goal is

to roll up the data to BOR level with value x equal to the sum of all unique

FAC belonging to the same BOR.

Page 3 of 50
Gender Income

M 50000
F 30000
M 54000
M 37000
F 48000
F 55000
M 90000
F 67000
M 110000
M 40000
F 20000
F 80000

Find three ways to calculate the average income by gender in SAS.

Page 4 of 13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)

PROC SQL;
CREATE TABLE TEMP AS
SELECT BOR,
SUM(X) AS X
FROM DATA1
GROUP BY BOR;
QUIT;

Page 5 of 50
Overview of Statistics
Statistics

Data Analysis Data Collection

Page 6 of 50
Statistical Data Analysis:
•Descriptive statistics
• Numerical
• Graphical

•Inferential statistics

Page 7 of 13
Populations and Samples:
A population is the set of all items of interest in a
statistical problem.

A sample is a set of data drawn from the


population
e.g

a. Two outcomes for the coin (heads and tails)


b. For the die,. 6 outcomes -1,2,3,4,5 and 6
c. Combine

H T
123456 123456

H1 H2 H3 H4 H5 H6 T1 T2 T3 T4 T5 T6

Page 8 of 13
Let us throw the die 10 times, the outcome of results:

2345136 434

Let define the results as variable X


Variable X has data value as above- Variable X is a random sample of
size 10 from population (1-6)

Let throw the coin 10 times, the results:

HHTHTTHTHH

Let define the results as variable Y


Variable Y has data value as above- Variable Y is a random variable

Selling price for a product

89.5 79.9 83.1 85.5 88.9

Variable -price
Page 9 of 13
Random variables

A random variable, is a variable whose value


depends on possible outcomes.

Page 10 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

Variables

Qualitative Quantitative

Nominal Ordinal Discrete Continuous

Page 11 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

• Continuous variable – a quantitative variable whose possible values

form some interval of numbers.

• Discrete variable – not continuous and can be counted.

• Nominal variable - no intrinsic ordering to its categories, e.g. gender.

• Ordinal variable - a clear ordering to its categories, e.g. temperature

as a variable has three orderly categories (low, medium and high).

Page 12 of 50
SAS out put data

Region age agecategory salary salarycategory


South 42 older 57000 33301+
North 36 middle 40200 33301+
North 65 older 21450 <=25500
East 28 young 21900 <=25500

True or false:

a. The variable region is an ordinal variable


b. The variable age is a continuous variable
c. The variable agecategory is an ordinal variable
d. The variable salarycategory is a quantitative
variable

Page 13 of
13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)

A descriptive measure of a sample is called a statistics.

Descriptive statistics involves arranging, summarizing and presenting a set of data that
meaningful essentials of the data can be extracted and interpreted easily.

The following measures are commonly used to describe the observations:

 a measure of location, or central tendency, such as the arithmetic mean,


median, mode or interquartile mean
 a measure of statistical dispersion, such as standard deviation, variance,
range, interquartile range or the distance standard deviation.

 a measure of the shape of the distribution, such as skewness or kurtosis

 if more than one variable is measured, a measure of statistical

dependence, such as the Pearson correlation coefficient, Spearman’s


Page 14 of 50

correlation coefficient
Mean
Standard Error
Median
Mode
Standard Deviation
Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

Page 15 of
13
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)

The following Box plot is showing several summary statistics. A box plot is a

convenient way of graphically depicting groups of numerical data through their

quartiles. Box plots may also have lines extending vertically from the boxes

indicating variability outside the upper and lower quartiles. Outliers may be plotted

as individual points. Box plots are non-parametric and display variation in

samples of a statistical population without making any assumptions of the

underlying statistical distribution.

Page 16 of 50
Topic 1: Various Cases of Data Processing (continued)
(Cont’d)

Page 17 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Descriptive statistics is at the heart of all quantitative analysis. So how do we

describe data? There are two ways: measures of central tendency and

measures of variability, or dispersion.

Page 18 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method
Population Mean – or average is calculated by finding the sum of the
study data and dividing it by the total number of data.

Mean=Sum of the oberservations/ number of observations

Sample Mean

The mean of a sample of n observations x1, x2 , x3,…, xn is defined as

X=( x1+….+xn)/n

7, 3,9, -2, 4, 6 x=(7+3+9-2+4+6)/6=4.5

Dial: 1 2 3 4 5 6

Throw 5 times: 2 3 1 5 2

Page 19 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Median – is the middle value in a set of data when the n obs are
arranged in order of magnitude.

When n=odd
median=middle value
When n=even
median=mean of the two middle values

7 3 9 4 6 median=?

7, 3,9, -2, 4, 6 median=?

Page 20 of 50
Mode – is the value that appears most frequently in the
set of data

Data set: 28 60 26 32 30 26 29

Mode=? 26

Data set: 28 60 29 30 33

Mode=? No mode D N E

Page 21 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

• The mean and median can be used with numeric data. The mode can
be used with both numerical and nominal data.

• The mean is the preferred measure of central tendency since it


considers all of the numbers in a dataset; however, the mean is

extremely sensitive to outliers.

• The median is preferred in cases where there are outliers, since the
Page 22 of 50
median only considers the middle values.
Topic 2: Concepts of descriptive statistics, and calculation
method

Example: calculate the mean, median and mode for the following data.

8, 4, 9, 3, 5, 8, 6, 6, 7, 8 and 10.

Mean: (8+ 4 + 9 + 3 + 5 + 8 + 6 + 6 + 7 + 8 + 10) / 11 = 74 / 11 = The mean is

6.73.

Median: In a data set of 11, the median is the number in the sixth place.

3, 4, 5, 6, 6, 7, 8, 8, 8, 9, 10. The median is 7.

Page 23 of 50
Mode: The number 8 appears more than any other number. The mode is 8.
Topic 2: Concepts of descriptive statistics, and calculation
method

Measures of Dispersion
Range Variance Standard deviation

Range – is the difference between the smallest number and the largest

number.

2 5 7 9 10 11 15 range=?

0 25 7 9 11 15 138 range=?

The range is very sensitive to outliers.


Page 24 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Variance – is a measure of the average distance that a set of data lies from its

mean. The higher the variance, the more spread out your data are.

The variance formula (for population):

The variance formula (for sample):

Page 25 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Example: Find the sample variance for the following set of numbers:

12, 15, 18, 20, 30

Page 26 of 50
Step 1: mean=(12+15+17+20+30)/5=19

Step 2: 12-19 15-19 18-19 20-19 30-19


-7 -4 -1 1 11

Step 3: 49 16 1 1 121

Step 4: =49+16+1+1+121=188

Step 5: n-1 = 5 – 1 = 4

Step 6: Variance = 188 / 4 =47

Page 27 of
13
Practice:

Consider two data sets with 5 observations:

A: 8, 9, 10, 11, 12

B: 4, 7, 10, 13, 16

Find the variance for them.

Page 28 of
13
Solution:

Va=2
Vb=18

Shortcut formula:

Page 29 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

Standard deviation – is the square root of variance.

Sample std: s=sqrt(s^2)

Population std σ=sqrt(σ^2)

e.g. Variance =47

std: s=6.86.

The unit attached to variance is the square of the unit attached to the original obs. Statistician
often want a measure of variability that is expressed in the same units as the original obs.

Page 30 of 50
Coefficient of variation:
Defined as CV=S/ X

Mean=19 S=6.86

Then CV=.36

If mean=12 S=6.5

CV=S/mean=6.5/12=.54

Page 31 of
13
The standard deviation is a measure of dispersion. According to

CHEBYCHEV’s rule, for any data set:

• At least 75% of the data lie within two standard

deviations to either side of the mean, that is, between

mean - 2s and mean+ 2s.

• At least 89% of the data lie within three standard

deviations to either side of the mean, that is, between

mean - 3s and mean + 3s

Page 32 of
13
For the example:

Data: 12, 15, 17, 20, 30

5*75%=3.75 ------at least 3 or 4 data in the range (mean-2s, mean+2s)

Mean=19 s=6.86

2s=13.72

Mean-2s=19-13.72=5.28
Mean+2s=19+13.72=32.72

(5.28, 32.72)

Page 33 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

 In general, for any number k > 1, at least 1-1/k2 percentage of the data lie

within k standard deviations to either side of the mean, that is, between -

k*s and + k*s.

Page 34 of 50
Example:

The average bill amount at a local restaurant is $36.42 with a standard


deviation of $8.15. What is the minimum percentage of bills amount
between $15.23 and $57.61?

Page 35 of
13
Solution:

To find the k

36.42-15.23=21.19
57.61-36.42=21.19

kS=21.19

8.15k=21.19

k=2.6

Percentage=1-(1/k^2)=1-.147=.853=85.3%

Page 36 of
13
Empirical Rule:

If a sample of observations has a mound-shaped distribution, the


interval

(mean-s, mean+s) contains approximately 68% of the observations


(mean-2s, mean+2s) contains approximately 95% of the obs.
(Mean-3s, mean+3s) contains virtually all of the obs.

Page 37 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

Z-score is a frequently used quantity in statistical analysis. The z-score or standard

score for a data value is the number of standard deviations that the data value is

away from the mean of the data set. The sample z-score is computed by using the

formula

Where and s are respectively, the mean and standard deviation of the sample

data. A negative z-score indicate that a data value is smaller than the mean,

whereas a positive z-score indicates that a data value is larger than the mean.

Page 38 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Descriptive measures that indicate the relative position of a data value are

called measures of relative standing. The z-score can be used as a

measure of relative standing. If a data value has large positive z-score, then

it is larger than most of the other data values in the data set; if a data value

has a large negative z-score, then it is smaller than most of the other data

values in the data set; and if a data value has a z-score near 0, then it is

located near the mean of the data set. We can use z-score to compare the

relative standings of two data values form different data sets.


Page 39 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Other descriptive measures:

• Percentiles – divide a data set into 100 equal parts.

• Deciles – divide a data set into 10 equal parts.

• Quartiles – divide a data set into 4 equal parts.

Page 40 of 50
Percentile formula:

e.g.

A sample with 10 obs, after sort the data 25% position

R=(25/100)*(10+1)=.25*11=2.75

Page 41 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

Quartiles definition:

Arrange the data in increasing order and determine the median.

 The first quartile (Q1) is the median of the data lying at or below the

median of the entire data set.

 The second quartile (Q2) is the median of the entire data set.

 The third quartile (Q3) is the median of the data lying at or above the

median of the entire data set.

Page 42 of 50
Example:

1, 2,3 4, 5, 6, 7, 8, 9

Q1=?
Q2=5
Q3=?

1,2,3,4,5,6,7,8,9,10

Q1=?
Q2=5.5
Q3=?
Page 43 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

Practice:

The A.C. Nielsen Company publishes data on the TV viewing habits of

Canadians. A sample of 20 people yields the weekly viewing times, in

hours. Determine and interpret the quartiles for these data

25 41 27 32 43 66 35 31 15 5 34 26 32 38 16 30 38 30 20 21

Page 44 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Solution:

1. To find the quartiles, we arrange the data in increasing order as

below:

5 15 16 20 21 25 26 27 30 30 31 32 32 34 35 38 38 41 43 66

2. The # of pieces of data is 20 and so the position of median is at

(20+1)/2 = 10.5, halfway between the 10th and 11th data values (in

boldface type) in the ordered list. Thus the median of the entire data

set is (30+31)/2 = 30.5.


Page 45 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Solution:

3. The first quartile (Q1) is the median of the data lying at or below the median of

the entire dataset. We see that the data lying at or below the median of the entire

data set is

5 15 16 20 21 25 26 27 30 30

This data set has 10 pieces of data. Its median is at position (10+1)/2=5.5. Hence

the first quartile (Q1) is (21+25)/2=23; that is, Q1 = 23.

Page 46 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Solution:

4. The third quartile (Q3) is the median of the data lying at or above the median of

the entire dataset. We see that the data lying at or below the median of the entire

data set is

31 32 32 34 35 38 38 41 43 66

This data set has 10 pieces of data. Its median is at position (10+1)/2=5.5.

Hence the third quartile (Q3) is (35+38)/2=36.5; that is, Q3 = 36.5.

Page 47 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Solution:

5. We conclude that 25% of the viewing times are less than 23 hours,

25% are between 23 and 30.5 hours, 25% are between 30.5 and 36.5

hours, and 25% are greater than 36.5 hours.

Page 48 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Inter-Quartile Range:

The interquartile range (IQR) is defined as the difference between the first and

third quartiles; that is,

IQR = Q3 – Q1.

Thus, roughly speaking the IQR gives the range of the middle 50% of the data.

We will define an outlier to be a value located at a distance of more than

1.5(IQR) from the box in Box plot.

Page 49 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

Inner and Outer Fences: Possible and Probable Outliers

The inner fences and outer fence are defined as follows:

inner fences: Q1 – 1.5 * IQR Q3 + 1.5 * IQR

outer fences: Q1 – 3 * IQR Q3 + 3 * IQR

Data values that lie between the inner and outer fences are considered possible

outliers; those that lie outside the outer fences are considered probable

outliers.

Page 50 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

The left-side picture shows

the Boxplot with an

interquartile range and a

probability density function

(pdf) of a normal

N(0,σ2) Population

Page 51 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

The five-number summary of a data set consists of the minimum, maximum,

and quartiles written in increasing order: Min, Q1, Q2, Q3, Max.

Page 52 of 50
Topic 2: Concepts of descriptive statistics, and calculation
method

A Boxplot also called a box-and-whisker diagram, is based on the five-number

summary can be used to provide a graphical display of the center and variation of a

dataset.

Largest
Upper quartile
Median
Lower quartile
Smallest

Page 53 of 50
Cases that have values more than 3 time the length of the box
below or above the box are extreme values outliers.

Cases have values between 1.5 and 3 are outliers

Page 54 of
13
/*Creating Box Plots */

Data bikerace;
Input Division $ NumberLaps @@;
Datalines;
Adult 44 Adult 33 Youth 33 Masters 38 Adult 40
Masters 32 Youth 32 Youth 38 Youth 33 Adult 24
Masters 33 Adult 44 Youth 35 Adult 49 Adult 38
Adult 39 Adult 42 Adult 32 Youth 42 Youth 70
Masters 33 Adult 33 Masters 32 Youth 37 Masters 40
;
Run;

Proc sgplot data=bikerace;


Vbox numberlaps;
Run;

Proc sgplot data=bikerace;


hbox numberlaps;
Run;

Page 55 of
13
/*Creating Box Plots */

Proc sgplot data=bikerace;


Vbox numberlaps/category=division ;
Run;

Title 'Create a Box plot Using Bikerace dataset';


Proc sgplot data=bikerace;
Vbox numberlaps/category=division extreme;
Run;

Page 56 of
13
Topic 2: Concepts of descriptive statistics, and calculation
method

Page 57 of 50
Exercise 1: Sorting - Ordering

A data file called auto (shown in the following image) has a duplicate record for

the BMW. How to sort the data file by the Foreign and rep78 and also remove

the duplicate record for the BMW in SAS language?

Page 58 of 50
Exercise 2:
Given data like the following:

How to aggregate data by region in SAS?


Page 59 of 50
Exercise 3:

What does Chebychev’s rule say about the percentage of data in a

dataset that lies within?

a. 1.25 standard deviations to either side of the mean?

b. 3.5 standard deviations to either side of the mean?

c. 5 standard deviations to either side of the mean?

Page 60 of 50
Exercise 4:

The table below contains data on the ages of the two teams involved in game 1 of the

2010 National League Division Series. Is there a relationship between the ages of

the players on the teams and the outcome of the NLDS?

(1) Determine the lower and upper quartiles of the ages for the Phillies. Then find

the IQR of the Phillie’s ages.

(2) Determine the lower and upper quartiles of the ages for the Reds. Then find

the IQR of the Red’s ages.

(3) Which team has the greater age range? Which has the greater IQR?

Page 61 of 50
Exercise 4 (Cont’d):

Page 62 of 50
Homework:

Create box plot, histogram and descriptive statistics from sashelp.cars

Page 63 of 50
Bibliography:

Allan G. Bluman Elementary Statistics – A Step by Step Approach Seventh

Edition McGraw-Hill, New York 2009 Page 165 – 169

Ron Larson & Betsy Farber Elementary Statistics Picturing the World 5th

Edition Pearson Education, Boston 2012

Page 64 of 50

You might also like