0% found this document useful (0 votes)
48 views45 pages

"Probability and Statistics (For Engineering) 235 M: Summer Session 2019/2020

This document discusses descriptive statistics, which involves collecting, organizing, summarizing, and presenting data. It defines key terms like population, sample, variable, and parameter. It also covers types of data like qualitative vs quantitative. The main focus is on describing numerical data using tables, graphs like bar charts and pie charts, and grouped frequency distributions. Guidelines are provided for constructing grouped frequency distributions, including determining the number of classes, class width, lower and upper class limits, and tallying the frequencies.

Uploaded by

Maram Batayha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views45 pages

"Probability and Statistics (For Engineering) 235 M: Summer Session 2019/2020

This document discusses descriptive statistics, which involves collecting, organizing, summarizing, and presenting data. It defines key terms like population, sample, variable, and parameter. It also covers types of data like qualitative vs quantitative. The main focus is on describing numerical data using tables, graphs like bar charts and pie charts, and grouped frequency distributions. Guidelines are provided for constructing grouped frequency distributions, including determining the number of classes, class width, lower and upper class limits, and tallying the frequencies.

Uploaded by

Maram Batayha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

M235 “Probability and Statistics (for Engineering)

Summer Session 2019/2020

Chapter 1

1
Descriptive Statistics
Introduction
Definitions:
1. Descriptive Statistics: Collection, organization,
summarization, and presentation of data. Descriptive
statistics useful in data screening.
2. Population: A population is any entire collection of
subject from which we may collect data. It is the entire
group we are interested in.
3. Sample: A sample is a group subjects selected from a
population.
4. Variable: A variable is any characteristic of an
individual that takes different values.
5. Data: Data is a collection of numbers or facts that is
used as a basis for making conclusions.
6. Raw Data: Raw data is a data collected in original
form and has not been organized.
7. Parameter: A parameter is a measure computed from
population data.
8. Statistic: A statistic is a measure computed from the
sample data.
9. Outlier: An outlier is an observation point that is
distant from most of the other data values.
10. Inferential Statistics: Generalizing or infer from sample to
population in form of estimation, hypothesis testing,
determining relationships between variables.

2
Types of Data
A qualitative or categorical data
Data represent characteristics such as person’s gender,
blood type.
A quantitative or numerical data
data assume numerical values such as weight, blood
pressure.

Figure 1: Types of data

3
Quantitative (Numerical) variables

1. Discrete variables which assume a finite or countable


number of possible values (Usually obtained by counting, e.g.
Number of children per family).

2. Continuous variables: variables which assume an infinite


number of possible values. Usually obtained by measurement,
e.g. height, weight, strength.

4
Descriptive Statistics
Part 1: Describing Data with Tables and Graphs

Categorical or Qualitative Data

A table (Frequency Distribution) shows categories and


frequencies.
Example: The blood type of sample of 12 patients is given as
follows:
A, B, B, O, AB, AB, A, A, B, O, AB, A

Table 1: Frequency Distribution of blood type


Blood Tally Frequency Relative Percentage
Type Frequency
A //// 4 0.33 33%
B /// 3 0.25 25%
AB /// 3 0.25 25%
O // 2 0.17 17%
Total 12 1.00 100%

Graphs: Visual investigation of data is important and


very useful!
1. Bar chart used to represent a frequency distribution of
categorical variable.

5
Figure 2: Bar chart of blood type

2. Pie graph is a circle that is divided into sections, the


angle for each section is
frequency of the category
 360
total frequency

Figure 3: Pie chart of blood type

6
Describing Numerical Data with tables and Graphs
A discrete variable can take only few different values, then the
data can be summarized in the same way as qualitative data.
Example: Consider the following data: Quiz scores out of 4
for 30 students.
0, 1, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0, 0,
4, 4, 4, 4, 4
Different values are: 0, 1, 2, 3, 4

Table 2: Frequency Distribution of Scores

Score Frequency Relative Frequency Percentage


0 5 0.17 17%
1 6 0.2 20%
2 9 0.3 30%
3 5 0.17 17%
4 5 0.17 17%
Total 30 1.00 100%

Chart of Score

6
Count

0
0 1 2 3 4
Score

Figure 4: Bar Chart of Scores

7
Pie Chart of Score
Category
0
1
16.7% 16.7% 2
3
4

16.7%
20.0%

30.0%

Figure 5: Pie Chart of Scores

Dot Plot
Dotplot of Score

0 1 2 3 4
Score

Figure 6: Dot plot of Scores

8
Frequency Distribution Grouped Data
If the discrete data have a lot of different values or the
variable is continuous, then the data must be grouped into
classes before the table of frequencies can be formed.

Table 3: Example of grouped frequency distribution

Classes Frequency
100-104 2
105-109 8
110-114 18
115-119 13
120-124 7
125-129 1
130-134 1
Total 50

Frequency distribution:
*class limits. Represent the smallest and largest data values in
each class.
*lower class limit: The smallest value that can belong to a
given interval.
*upper class limit: The largest value that can belong to the
interval.
*class width: is the difference between the upper class limit
and the lower class limit
*class boundaries. Separate one class in a grouped frequency
distribution from another. The boundaries have one more
decimal place than the raw data and therefore do not appear in
the data. There is no gap between the class. If the observations
are given to the nearest integer, we subtract 0.5 from the lower

9
class limit to get the lower class boundary and add 0.5, to get
the upper class boundary.
*Class Midpoint: the number in the middle of the class. It is
found by adding the upper and lower limits and dividing by
two.
Guidelines for classes
1. Usually number of classes between 5 and 20.
2. The class width should be an odd number. This will
guarantee that the class midpoints are integers instead of
decimals.
3. The classes must be mutually exclusive. This means that
no data value can fall into two different classes
4. The classes must be all inclusive. This means that all
data values must be included.
5. The classes must be continuous. There are no gaps in a
frequency distribution.
6. The classes must be equal in width. The exception here is
the first and last class. It is possible to have an “below
…” or “…and above” class, this is often used with ages.
STEPS
1. Find the largest and smallest values.
2. Compute the Range=Maximum Value-Minimum Value.
3. Select the number of classes desired. Usually between 5
and 20.
4. Find the class width by dividing the range by the number
of classes and rounding up.
5. Select a starting point for the lowest class limit (smallest
value or any convenient number.
6. To get the lower class limit of the second class add the
class width to get the lower class limit of the next class,
keep adding the class width until you get all the classes.
7. To get the upper class limit for the first class, subtract
one from the lower limit of the second class to get the

10
upper limit of the first class then add the class width to
each upper limit to get all upper limits.
8. Tally the data.
9. Find the numerical frequencies from the tallies.
Example: The following data represent the record of
high temperatures in F for each of the 50 states.
Construct a grouped frequency distribution for the data
using 7 classes?
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 110 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
The procedure for constructing a grouped frequency
distribution for the above numerical data as follows:
Step 1: Take the number of classes = 7
Step 2: largest value = 134, smallest value = 100.
range = 134 – 100 =34
Step 3: class width = range / number of classes,
class width = 34/7=4.9 ≈ 5
Step 4 Start with the lower class limit for the first class 100.
Step 5: To get the lower class limit of the second class add the
class width 5 to get the lower class limit of the next
class, keep adding the class width until there are 7
classes.
Step 6: To get the upper class limit for the first class, subtract
one from the lower limit of the second class which is
105 to get the upper limit of the first class which is
104, then add the class width 5 to each upper limit to
get all upper limits.
Step 7: Tally the data
Step 8: Find the numerical frequencies from the tallies.
[ See table 3]

11
Graphical Representation
The graphical representation of the frequency distribution of
the grouped data like the Bar graph, it displays the data by
using vertical bars of heights which represent frequencies. but
there are gabs between the rectangles on the horizontal axis.
We use the class boundaries to get no gabs between the
rectangles on the horizontal axis, this graph is called
Histogram.
Histogram graphically summarize center, spread, skewness,
outliers.
Skewness (symmetry) of data. How concentrated data are at
the low or high end of the scale.
Kurtossis (peakedness) of data. How concentrated data are
around a single value.
Example: The following table represent the frequency
distribution of temperature data using class boundaries:

Table 4: frequency distribution


Class boundaries Frequency
99.5-104.5 2
104.5-109.5 8
109.5-114.5 18
114.5-119.5 13
119.5-124.5 7
124.5-129.5 1
129.5-134.5 1

12
Histogram of Tempreture
20

15

Frequency

10

0
102 107 112 117 122 127 132
Tempreture

Figure 7: Histogram of Temperature Data

Relative Frequency Distribution: Relative frequency is the


frequency divided by the total frequency.

Table 5: Relative frequency

Class limits Frequency Relative Frequency Percentage


100-104 2 2/50=0.04 4%
105-109 8 8/50=0.16 16%
110-114 18 18/50=0.36 36%
115-119 13 13/50=0.26 26%
120-124 7 7/50=0.14 14%
125-129 1 1/50=0.02 2%
130-134 1 1/50=0.02 2%
Total 50 1.00 100%

13
Frequency Polygon: A line graph. The frequency is placed
along the vertical axis and the class midpoints are placed
along the horizontal axis, the points are connected with liners.
Example: Consider the following frequency table:

Table 6: The age of the best actresses

Class limits Class Boundaries Frequency Midpoint


20-29 19.5-29.5 23 24.5
30-39 29.5-39.5 21 34.5
40-49 39.5-49.5 21 44.5
50-59 49.5-59.5 4 54.5
60-69 59.5-69.5 1 64.5
70-79 69.5-79.5 1 74.5

14
Cumulative Frequency Polygon :
*The cumulative frequency shows the cumulative data values
with values up to and including those in a given range.
*Line graph (rather than a bar graph), plot the cumulative
frequency at each upper real class limit with the height being
the corresponding cumulative frequency

Table 7: Cumulative frequency of the above example

Real Limits Freq. Upper Real Class Limits Cum. Freq. CF%
9.5-19.5 0 19.5 0 0%
19.5-29.5 23 29.5 23 32.4%
29.5-39.5 21 39.5 44 62%
39.5-49.5 21 49.5 65 91.5%
49.5-59.5 4 59.5 69 97.2%
59.5-69.5 1 69.5 70 98,6%
69.5-79.5 1 79.5.5 71 100%
79.5-89.5 0 89.5 71 100%

15
Frequency Curve
A smooth curve which corresponds to the limiting case of a
histogram computed for a frequency distribution of a
continuous distribution as the number of data points becomes
very large.

16
Steam-and-leaf plot is a data plot that uses part of a data value
as the steam and the rest of data value the leaf to form groups
or classes. This is very useful for sorting data quickly. It is the
same as histogram but saves the original data points.

Example: Consider the following 11 numbers: 12, 13, 21, 27,


33, 34, 35, 37, 40, 40, 41.
The frequency given in table 8.
Table 8: The frequency of the 11 numbers
Class limits Frequency
10-19 2
20-29 2
30-39 4
40-49 3

Steam Leaf
1 23
2 17
3 3457
4 001
*The “steam” is the left-hand column which contains the tens
digits.
*The “leaves” are the lists in the right-hand column, showing
all the ones digits for each of the tens, twenties, thirties, and
forties.

For example: “4|0” means 40

17
Shapes of Data Distributions:
Symmetric bell-shaped
Example: Men’s Heights

14

12

10

0
1 2 3 4 5 6 7 8 9

Skewness refers to asymmetry


Right or Positive-Skewed – If the bulk
of the data is at the left and the right tail
is longer (Tail extends to the right)
Example: Personal Income in the U.S

18
12

10

0
1 2 3 4 5 6 7 8 9

Left-Skewed – – If the bulk of the data


is at the right and the left tail is longer
(Tail extends to the right)
Example: Exam score with a few
students doing poorly.

14

12

10

0
1 2 3 4 5 6 7 8 9

19
Uniform – All data values are equally
represented.
10
9
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6

20
Descriptive Statistics
Part 2
Describing Data with
Numerical Measures
*Measures of Central Tendency
-Mean
-Median
-Mode
*Measures of Dispersion
-Range
-Variance
-Standard deviation
-Coefficient of variation (C.V)
*Measures of Position
-Percentiles
-Quartiles

21
Measures of Central Tendency

A measure of central tendency is a value


used to locate the center in a data set or
distribution.
1- Mean:
Mean – (average) the sum of all data values
divided by the number of values in the data set.
The mean provides a measure of center for a data
set. The mean of a sample data set is denoted by
x (x-bar) and the mean of a population data set by
the Greek letter  (mu)
Sample data set
n

x1  x2  ...  xn i 1 x i
x 
n n
Where n = number of items in the sample.
Population data set
Where N= number of items in the population.
N

x1  x 2  ...  x N x i
  i 1

N N

22
Example: consider the following sample data:
46, 54, 42, 46, 32

46  54  42  46  32 220
x   44
5 5
Weighted Mean:
In this case, we attach a numerical weight (w)
to each value and calculate the mean as
follows:
x 
 ( x  w)
w
Trimmed mean: it’s the mean, but it trims any
outliers
2- Median:
Median – the value which separates the
largest 50% of data values from the lowest
50%.
To calculate the median, place data values in
number order. If n is odd, the middle value is
the median. If n is even, the mean of the two
middle values is the median.

23
In general:
Arrange the data from smallest to largest
If n is odd the median is the observation in
position (n+1)/2
If n is even the median is the average of the
observations in positions (n/2) and (n/2)+1
Example1:
Consider the following sample data
15, 26, 17, 5, 4, 8, 6, 9, 10, 12, 14, 15,
15,16,18,20
Determine the median?
Arrange the 16 data values
4, 5, 6, 8, 9,10, 12, 14, 15, 15, 15, 16, 17, 18,
20,26
n=16 is even the median is the average of the
observations in positions (n/2)=(16/2)=8
(which is 14) and (n/2)+1=(16/2)+1=9 (which
is 15)
(14  15)
Median =  14.5
2

24
Example2:
Consider the following sample data
15, 26, 17, 5, 4, 8, 6, 9, 10, 12, 14, 15,15,16,18
Determine the median?
Arrange the 15 data values
4, 5, 6, 8, 9,10, 12, 14, 15, 15, 15, 16,17,18, 20
If n=15 is odd, then the median is the
observation in position (n+1)/2=(15+1)/2=8,
which is 14
Median = 14

3. Mode:
The mode is the most frequent data value.
There may be no mode if no one value appears
more than any other. There may also be two
modes (bimodal), or more than two modes
(multi-modal).
Useful for qualitative variables.
Example 1: 3, 7, 8, 8, 1
Mode = 8
Example 2: 1, 1, 9, 8, 6, 5, 10, 8
Mode = 8 and 1

25
Which measures to use?
*Mean is the most commonly used measure of
central tendency.
*One drawback of the mean is that it is
heavily influenced by a few very high or
very low data values (outliers or
skewedness). In these cases it is more
common to use the median.
*The mode has the advantage that it can be
used to measure data sets even if they contain
only qualitative data. A disadvantage is that a
data set may not have a mode.
*For unimodel distributions which are
almost symmetric, the best measure of
location is the mean. Moreover, the
mean, median, and mode are almost the
same.
*For unimodel distributions which have
moderate skewness, the median is the most
appropriate measure of location

26
27
Mean for
Grouped Data
Example:
Frequency Distribution for
Temperature Data
Classes Frequency Midpoint f.x
f x
100-104 2 102 204
105-109 8 107 856
110-114 18 112 2016
115-119 13 117 1521
120-124 7 122 854
125-129 1 127 127
130-134 1 132 132
Total n=50 5710

f is the frequency, find the midpoint x for each


class

 f  x
mean 
n

28
Mean = 5710 / 50 = 114.2
Measures of Dispersion or Variability
How spread out data are?
Example: Quiz Scores: 3, 3, 4, 4, 4, 4, 4,
4, 5, 5, 5
Example: Quiz Scores: 1, 3, 4, 5, 6, 6, 7,
8, 8, 9, 10
Dotplot of score vs group
group

1
2
2 4 6 8 10
score

Dotplot of score vs group

group
1
2

2 4 6 8 10
score

29
Range:
The range for a set of data is the difference
between the largest and smallest values.
Variance:
The variance is a measure of the amount that a
set of data varies about its mean.
The sum of deviations from the mean for
any given sample is always zero
n

n n n
n xi
 (x
i 1
i  x )   xi   x 
i 1 i 1
i 1

n
 nx  nx  nx  0

We use square deviations

Population Data Set


N

 (x i  )2
 2
 i 1

Sample Data Set


n

 i
( x  x ) 2

s2  i 1

n 1
(n-1) makes estimate unbiased
30
Sample Standard Deviation = s = s 2

Population Standard Deviation =    2

Standard deviation is the positive square root


of the variance. The standard deviation
therefore is really a sort of “average distance
or deviation” of each point from the mean
Result: computing formula
n

n n
( xi ) 2
 i
( x
i 1
 x ) 2
 i
x 2

i 1
i 1
n
Example: Consider a sample of size n=5 with
data values as follows:
10, 20, 12, 17, 16
Compute the sample variance s2 ?
5

Note : x
i 1
i  75, x  15

X X X ( X  X )2
10 10-15=-5 25
20 20-15=5 25
12 12-15=-3 9

31
17 17-15=2 4
16 16-17=1 1
Total=75 0 64

 i
( X  X ) 2

64
s  2 i 1
  16
n 1 5 1

The sample standard deviation, s=4


or
5 5

x
i 1
i  75, x  15, x i 1
2
i  1189

n
( xi ) 2
(75) 2
x 2
i  i 1
n
1189 
5  16
s2  i 1

n 1 5 1

Sample Standard Deviation s=4

32
COEFFICIENT OF VARIATION (CV)
The Coefficient of Variation calculates the
standard deviation as a percentage of the mean
s
C.V.   100%
x
Remark 1:Good to compare variability of
data sets with different units..
Remark 2: the coefficient of variation, which
expresses the standard deviation as a
percentage of the mean.

33
Example:
Sample1 Sample2
Age 25 years 11 years
Mean 145 pounds 80 pounds
weight
Standard 10 pounds 10 pounds
deviation

Solution:
for the 25-year-old
s 10
C.V.  100%  100%  6.9%
x 145
for the 11-year-olds
s 10
C.V.  100%  100%  12.5%
x 80

34
Theorem Involving Standard Deviation:
Chebychev’s Theorem
 Applies to any data set.
 Given a number k greater than 1 and a set of n
observations, the portion (%) of data values that
must be within k standard deviations of the mean is
1
at least: 1 
k2
For example
1 1
k  2  1  1   0.75
k2 22
At least 75% of the data lie in the interval
( x  2s, x  2s)
1 1
k  3 1  1   0.89
k2 32
At least 89% of the data lie in the interval
( x  3s, x  3s)
Example: Given x  70, s  10
The percentage of the data in the interval
( x  2s, x  2s)  (70  2(10), 70  2(10))
=(50, 90) is at least 75%
The standard deviation of a data set is an
important quantity because it limits the
number of data values that can be very far
(high or low) from average.

35
The Empirical Rule:
 Applies only to bell-shaped distributions.

 Approximately 68% of data values must


be within 1 standard deviation of the mean
( x  s, x  s)
 Approximately 95% of data values must
be within 2 standard deviation of the mean
( x  2s, x  2s)
 Approximately 99.7% of data values must
be within 3 standard deviation of the mean
( x  3s, x  3s)
Example: Men’s Heights have a bell-
shaped distribution with a mean of
69.2 inches and a standard deviation
of 2.9 inches. Between what heights
does 95% of the male population lie?
solution
Approximately 95% of data values must be
within 2 standard deviation of the mean
( x  2s, x  2s)  (69.2  2(2.9), 69.2  2(2.9))
=(63.4, 75)

36
Measures of Position
Percentiles
Percentiles – divide a data set into 100 parts.
For example, the 36th percentile is the value
which separates the lowest 36% of data values
from the highest 64% of data values and is
denoted by P36
A percentile is a numerical measure that also
locates values of interest in a data set.
The p-th percentile of a data set is a value such
that at p% of observations at or below this
value, and at least(100-p) % are at or above
this value.

Note: There are 99 percentiles P1, P2,..,P99


Calculating the p-th percentile
The Data:
X1, X 2 , X 3 ,...., X n
Step 1: Arrange the data values in
ascending order.
X (1) , X ( 2) , X (3) ,...., X ( n)
Step 2: Compute q as follows:

37
p
q( )(n  1)
100
where p is the percentile of interest and n is
the number of data values.

Step 3: (a) If q is an integer,


p-th percentile = X(q) .
(b) If q is not an integer,
p-th percentile =X(q) = X (L) +(q-L)(X(L+1)-X(L))
where L is the integer part of q.

Example:
Consider the following sample data
15, 26, 17, 5, 4, 8, 6, 9, 10, 12, 14, 15,
15,16,18,20
Determine the 90th percentile?
Step 1: Arrange the 16 data values
4, 5, 6, 8, 9,10, 12, 14, 15, 15, 15, 16, 17, 18,
20,26.
Step 2: Compute q as follows
 90 
q   17  15.3
 100 
Step 3:
38
p-th percentile=X(q) = X (L) +(q-L)(X(L+1)-X(L))
90-th percentile=
X(15.3) = X (15) +(15.3-15)(X(16)-X(15))
=20+(0.3)(26-20)
=20+1.8=21.8

Also find the 50th percentile?


 50 
i 17  8.5
 100 
50-th percentile=X(8.5)
= X (8) +(8.5-8)(X(9)-X(8))
=14+(0.5)(15-14)
=14.5
Quartiles
The quartiles Q1, Q2, and Q3 that divide the
data in 4 equal parts.
Q1=First quartile, or 25th percentile., is a value
that has one fourth, or 25%, of the
observations below its value.
Q2=Second quartile, or 50th percentile (also the
median)
Q3=Third quartile, or 75th percentile.

39
Ex: If your doctor tells you your 3 year old is
in the 50th percentile for height and the 35th
percentile for weight, what does that mean?

Box-Whisker Plot
Is a graphical numerical summary of five
numbers:
Minimum Value
Q1 : First Quartile
Median
Q3 : Third Quartile
Maximum Value

40
 Box plots are an excellent tool for
conveying location and variation
information in data sets, particularly for
detecting and illustrating location and
variation changes between different groups
of data
* It provides a quick visual summary that
easily shows center, spread, range,
skewedness and any outliers.
*Whiskers line: is the line marks
the range of the data by connecting
the smallest and largest values
(excluding outliers) to the Box

41
Another measure of variability”
Interquartile Range IQR = Q3 -Q1
*note that the box (IQR) contains
50% of values.
Box plots and Skewedness

Negative or left skewed

42
Box plots and Outliers
 Outliers: values above a Q3+3×IQR or below Q1-
3×IQR are outliers.
 Suspected outliers are slightly more central
versions of outliers: values above a Q3+1.5×IQR or
below Q1-1.5×IQR are suspected outliers.

43
Example: Assume that someone gave you the
following data that give the strength of the left
hand (arm1) and the right hand (arm2) of a group
of persons.
Left Arm: 20, 23, 24, 30, 21, 22, 33, 44, 33, 22,
33, 43, 54, 34, 22, 11, 15, 23, 34, 22, 11, 12, 18,
19, 20, 42, 41
Right Arm: 12, 15, 23, 32, 43, 55, 54, 44, 48, 49,
50, 12, 55, 56, 57, 50, 49, 43, 48, 46, 30, 27, 33,
55, 59, 52, 51, 53, 54, 47
Side-by-side Box of left and right strength data

60

50

40
Data

30

20

10
Left.Arm Right.Arm

44
Z-SCORES

We use it to know the position of one


observation to others is a set of data
xi  x
zi  , z i is the z - score for item xi
s
A z-score is the number of standard
deviations (s) that the value lies below or
above the mean
Note: The mean of all Z-scores is zero and
their variance is one.
Example: Men have a mean height of 69.2
inches with a standard deviation of 2.9
inches. Find the standard (z-)score of a man
who is 60 feet tall
60  69.2
z  score   3.17
2.9

The value 60 is below the mean of 3.17


standard deviations

45

You might also like