0% found this document useful (0 votes)
71 views169 pages

Introduction To Statistics (Stat 2181)

statics

Uploaded by

biniyam adisu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views169 pages

Introduction To Statistics (Stat 2181)

statics

Uploaded by

biniyam adisu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Introduction to Statistics (Stat 2181)

Melkie C. (MSc.)
[email protected] /[email protected]

Addis Ababa University


College of Natural and Computational Sciences
Statistics Department

October, 2018
1. Introduction

2
The art of learning from data
3
Definition of Statistics
 Plural form

 numerical facts and figures collected for a certain purposes

 aggregates of numerical expressed facts (figures) collected in a systematic

manner for a predetermined purpose

 Singular form

 systematic collection and interpretation of numerical data to make a decision

 the science of collecting, organizing, presenting, analyzing and interpreting

numerical data to make decision on the bases of such analysis

4
Classification of Statistics
 Descriptive Statistics

 Mainly concerned with the methods and techniques used in collection,

organization, presentation, and analysis of a set of data without making


any conclusions or inferences.
 Gathering data

 Editing and classifying them

 Presenting data in tables

 drawing diagrams and graphs for them

 Calculating averages and measures of dispersions.

Remark: Descriptive statistics doesn’t go beyond describing the data


themselves.
5
Classification of Statistics …
 Descriptive Statistics (Example)

 Record effectiveness of a processor for a certain type of tasks and then

finding the average of these performance test-CPU Time.

 From sample we have 40% fourth year computer Science students suggest

positive attitude toward the delivery of lectures.

 Drawing graphs that show the difference in the ‘scores’ of fourth year

Computer Science males and females students.

6
Classification of Statistics …
 Inferential Statistics

 Deals with the method of inferring or drawing conclusion about the

characteristics of the population based upon the results of a sample

 Utilizes sample data to make decision for entire data set based on sample

 Inferential Statistic (Example)

 There is a definitive relationship between smoking and lung cancer

 Females are more flexible than males

 From sample study number of new computer accounts registered in Addis Ababa
university is 50--number of concurrent users.

7
Definition of Some Basic Statistical Terms
 Data

 a collection of related facts and figures from which conclusions may be

drawn

 a scientific term for facts, figures, information and measurement

 Population/target population

 a totality of things, objects, peoples, etc about which information is being

collected

 Often too large to sample in its entirety

 Example: population of athletes fed a certain type of diet

8
Definition of Some Basic Statistical Terms
 Sample

 part of a population selected to draw conclusions about the population

 Subset of a population

Population

Sample
 Census

 a complete enumeration of the population. But in most real problems it


cannot be realized, hence we take sample.
9
Definition of Some Basic Statistical Terms
 Statistic

 A value computed from the sample, used to describe the sample.

 Parameter

 A descriptive measure (value) computed from the population.

 Variable

 A certain characteristic whose value changes from object to object and time

to time-CPUTime, Internet Speed

 Sampling frame

 A list of people, items or units from which the sample is taken.

10
Stages in Statistical Investigation

 Statistical data must possess the following properties

 The data must be aggregate of facts

 They must be affected to a marked extent by a multiplicity of causes

 They must be estimated according to reasonable standards of accuracy

 The data must be collected in a systematic manner for predefined purpose

 The data should be placed in relation to each other

11
Stages in Statistical Investigation
1. Data Collection
 The processes of measuring, assembling and gathering data

 Data may be collected by the investigator directly using interview,

questionnaire, and observation or may be available from published or


unpublished sources.

 Data gathering is the basis (foundation) of any statistical work.

 Valid conclusions can only result from properly collected data.

12
Stages in Statistical Investigation …
2. Data Organization
 It is a stage where we edit our data

 The collected data involve irrelevant figures, incorrect facts, omission and

mistakes

 classify (arrange) according to their common characteristics

3. Data Presentation
 The organized data can now be presented in the form of tables, diagram and

graphs.

 The main purpose of data presentation is to facilitate statistical analysis

13
Stages in Statistical Investigation …
4. Data Analysis
 Study the data to draw conclusions about the population parameter

 Dig out information useful for decision making

 Calculations of averages, the computation of measures of dispersion,

regression and correlation analysis

5. Data Interpretation
 Draw valid conclusions from the results obtained through data analysis

 Making inference about general population from sample results

14
Uses and Limitations of Statistics
 Uses of Statistics

 Condenses and summarizes complex data

 Facilitates comparison of data

 Helps to measure variability in data

 Used to create relationship between variables

 Helps in predicting future trends

 Influences the policies of government

 Helpful in formulating and testing hypothesis and to develop new theories

15
Uses and Limitations of Statistics …
 Limitations of Statistics

 Statistics doesn’t deal with single (individual) values rather it deals with

aggregate values

 Statistics can’t deal with qualitative characteristics

 Statistical conclusions are not universally true

 Statistical interpretations require a high degree of skill and understanding of

the subject

 Statistics can be misused

16
Scales of Measurment
 A variable in statistics is any characteristic, which can take on different
values for different elements when data are collected

 Variable can be qualitative or quantitative

 Qualitative Variables are nonnumeric variables and can't be measured,


example (name of a computer brand, rating connection speed).

 Quantitative variables are numeric variables and can be quantified

 Quantitative variables can be discrete (takes always whole number values) or

continuous (assume or take any decimal value )

17
Scales of Measurement
 Measurement “is assigning numbers to objects, events, or abstract
concepts according to a known set of rules”

 This permits data to be categorised, quantified and/or analysed in order


that meaningful conclusions can be drawn.

 Four scales of measurement are identified

 Nominal Scale Lowest Level

 Ordinal Scale

 Interval Scale

 Ratio Scale Highest Level


18
Scales of Measurement
 Nominal Scales of Measurement
 A measure of identity or category into mutually exclusive, exhaustive classes

 Useful for quantifying qualitative data

 Provides no information regarding either order or magnitude


 Example: Blood type (A, B, AB and O) , Name of A student, Identification number of a
student

 Ordinal Scales of Measurement


 A measure of order or rank

 Used to arrange data into series

 Provides no information regarding magnitude


 Example: internet speed (weak, fair, excellent)

19
Scales of Measurement …
 Interval Scales of Measurement

 A measure of order and quantity

 Difference between values can be calculated

 Cannot establish ‘x-fold’ increase


 Example: Temperature (10oC (50oF) and 20oC (68OF) as between 25oc (77oF) and 35oc
(95oF))

 Ratio Scales of Measurement

 Highest level of measurement

 An interval scale with an absolute zero point


 Example: time to complete a certain task

20
2. Methods of Data Collection and Presentation

21
Sources of Data
 Primary data
 data measured or collect by the investigator or the user directly from the source

 the data you collect is unique to you and your research and, until you publish, no one

else has access to it

 The primary sources of data are objects or persons from which we collect the

figures used for first hand.

 Secondary data
 second-hand data that was gathered by someone else

 The secondary sources are either published or unpublished materials or records.

22
Sources of Data
 Few of sources of secondary data are:-

 Government reports

 Official statistics

 Journals

 Reference books

 Library search engines

 Computerized database

 Universities

 Research institutes

 Hospitals

 World wide web (www)

23
Methods of Data Collection
 Two activities involved in primary data collection

 Planning and measuring

 Planning to data collection requires

 Identify source and elements of the data

 Decide whether to consider sample or census

 If sampling is preferred, decide on sample size, selection method, etc

 Decide measurement procedure

 Set up the necessary organizational structure: logistics, time table

 Measuring: Collect data using different (appropriate) techniques

24
Methods of Data Collection…
1. Observation
 Recording the behavioral patterns of people, objects and events in a

systematic manner.

 Ranges from single visual observation to those requiring special skills like

direct observation/examination.

 It may include laboratory experiment; conducting laboratory experiments on

fields of chemical, biological sciences and so on.

 Example:- measuring height, weight, temperature, chemical component in

water, performance of a computer, internet speed etc.

25
Methods of Data Collection…
2. Questionnaire
 It is a popular means of collecting data,

 it is difficult to design and often require many rewrites before an acceptable

questionnaire is produced.

 a set of questions are administered (provided) to respondent either physically

or through mail (Email, Postal, etc).


 Schedule through enumerations is the method in which investigator approach to the
informant with prepared questionnaire and got replies to the questions.

26
Methods of Data Collection…
 Advantages of Questionnaire
 Can be used as a method in its own right or as a basis for interviewing or a

telephone survey.

 Can be posted, e-mailed or faxed.

 Can cover a large number of people or organizations and wide geographic

coverage.

 Relatively cheap and avoids embarrassment on the part of the respondent.

 Respondent can consider responses, and there is no interviewer bias.

27
Methods of Data Collection…
 Disadvantage of Questionnaire
 Historically low response rate (although inducements may help).

 Time delay whilst waiting for responses to be returned

 Several reminders may be required and it assumes no literacy problems.

 No control over who completes, and it is not possible to give assistance if

required.

 Respondent can read all questions beforehand and then decide whether to

complete or not.

28
Methods of Data Collection…
3. Interview
 A technique that is primarily used to gain an understanding of the underlying

reasons and motivations for people’s attitudes, preferences or behavior

 Interviews can be undertaken on a personal (face to face)or via telephone

(indirect method)

 They can be conducted at work, at home, in the street or in a shopping

center, or some other agreed location.


 Mall intercept - shoppers

 Personal interview

 Telephone interview

29
Methods of Data Collection…
 Advantages of Interviewing
 Serious approach by respondent resulting in accurate information and good response

rate.

 Characteristics (motives and feelings) of respondent assessed – tone of voice, facial

expression, hesitation, etc.

 Interviewer in control and can give help if there is a problem

 If one interviewer used, uniformity of approach.

 Completed and immediate.

 Possible in-depth questions.

 Can use recording equipment.

 Used to pilot than other methods.

30
Methods of Data Collection…
 Disadvantages of Interviewing
 Need to set up interviews.

 If many interviewers, training required.

 Time consuming.

 Geographic limitations.

 Can be expensive.

 Normally need a set of questions.

 Respondent bias – tendency to please or impress, create false personal image, or end

interview quickly, Embarrassment possible if personal questions.

 Transcription and analysis can present problems (subjectivity).

31
Methods of Data Collection…
4. Extract from Records/Documentary Sources
 it is method of collecting information (secondary data) from published or

unpublished sources

 Secondary data also collected from diaries

 Advantage of secondary data


 Secondary data may help to clarify or redefine the definition of the problem as part of

the exploratory research process.

 Provides a larger database as compared to primary data

 Time saving

 Does not involve collection of data

32
Methods of Data Collection…
 Disadvantages of secondary data
 Lack of availability

 Lack of relevance

 Inaccurate data

 Insufficient data

5. Focus Group Discussion: discussion with selected individuals

33
Methods of Data Presentation
 The major objectives of data presentation are

 To presenting data in visual display and more understandable

 To have great attraction about the data

 To facilitate quick comparisons using measures of location and dispersion.

 To enable the reader to determine the shape and nature of distribution to

make statistical inference, and to facilitate further statistical analysis.


 There are three methods of data presentation
 Tables,

 Diagrams, and

 Graphs

34
Methods of Data Presentation …
 Tabular presentation of data

 Tables are important to summarize large volume of data in more

understandable way.

 Tables can be

 Simple (one way table): table which present one characteristics for example age
distribution.

 Two way table: it presents two characteristics in columns and rows for example
age versus sex.

 A higher order table: table which presents two or more characteristics in one
table.

35
Methods of Data Presentation …
 Frequency Distribution

 It is the organization of raw data in table form, using classes and

frequencies.

 Frequency is the number of values in a specific class of the distribution.

 There are three basic types of frequency distributions

 Categorical frequency distribution

 Ungrouped frequency distribution

 Grouped frequency distribution

36
Methods of Data Presentation …
 Categorical Frequency Distribution

 The categorical frequency distribution is used for data which can be placed

in specific categories such as nominal or ordinal level data


 The major components of categorical frequency distribution are class, tally and

frequency (or proportion).


 Percentages are also usable

 Form of a categorical frequency distribution

A B C D
Class Tally Frequency Percent

37
Methods of Data Presentation …
 Example 2.1: Twenty five samples of computers exist in a computer lab and the
computers are identified and stated below. The data set is given as follows {Dell (D), HP
(H), (Apple (A)}
A H H D D
D D H A H
H H D H D
A D H D H
D D D H D

Construct a frequency distribution for the above data?


A B C D
Class Tally Frequency Percent
A /// 3 12
D //// //// // 12 48
H //// //// 10 32

38
Methods of Data Presentation …
 Ungrouped Frequency Distribution

 It is the distribution that use individual data values along with their

frequencies.

 often constructed for small set of data on discrete variable (when data are

numerical), and when the range of the data is small.

 sometimes it is complicated to use ungrouped frequency distribution for

large mass of data, as result we use grouped frequency distribution.

 The major components of this type of frequency distributions are class, tally,

frequency, and cumulative frequency (less than/more than).

39
Methods of Data Presentation …
 Ungrouped Frequency Distribution
 Example 2.2: A person is interested in the number of laptops a family may have,

he/she took sample of 30 families and obtained the following observations. Number
of laptops in a sample of 30 families

2, 0, 2 , 1, 0, 4, 1, 2, 2, 0, 0, 3, 2, 1, 2, 2, 2, 2, 2, 1, 2, 0, 3,
1, 1, 3, 3, 3, 4, 2.

 Construct a frequency distribution for this data.

0 5
1 7
2 12
3 4
4 2
40
Methods of Data Presentation …

 Grouped Frequency Distribution

 It is a frequency distribution when several numbers are grouped in one class

 the data must be grouped in which each class has more than one unit in

width.

 We use when the range of the data is large, and for data from continuous

variable.

 Sometimes used for large volume of discrete data

41
Methods of Data Presentation …
 Class limit (CL)

 It separates one class from another.

 The limits could actually appear in the data

 have gaps between the upper limits of one class and the lower limit of the next class.

 Has lower class limit or upper class limit

 Class boundary(CB)

 Separate one class in a grouped frequency distribution from the other.

 The boundary has one more decimal place than the raw data.

 There is no gap between the upper boundaries of one class and the lower boundaries

of the succeeding class.

 Has lower class boundary and upper class boundary


42
Methods of Data Presentation …
 Unit of measurement (U)

 This is the difference between possible two successive values. E.g. 1, 0.1, 0.01 …

 Class width (W)

 The difference between the upper and lower class boundaries of any consecutive

class.

 The class width is also the difference between the lower limit or upper limits of two

consecutive classes.

 Class mark (Midpoint)

 It is found by adding the lower and upper class limit (Boundaries) and divided the

sum by two.

43
Methods of Data Presentation …
 Guidelines for classes

 There should be 5 to 20 classes. Determine using Sturge’s rule

K  1 3.32 log n
 Classes should be continuous.

 Classes must be mutually exclusive.

 Classes should be exhaustive.

 Classes should have same width (except open ended classes)

Range R
W 
Number of classes K

44
Methods of Data Presentation …
 Steps to construct grouped frequency distribution

 Find smallest (S) and largest (L) values in your data.

 Compute difference between L and S, R.

 Determine the number of class using Sturge’s rule, round up!

 determine class width, ratio of R and K, round up!

 Take the smallest value as the first class lower class limit, and add class width to get consecutive
lower class limits.

 To get upper class limit subtract unit of measurement from second class lower class limit, and add
class width to get remaining upper class limits.

 Subtract half of unit of measurement from lower class limit to get class boundary, and add half of
unit of measurement to upper class limit to get upper class boundary.

 Tally data

 Count number of tallies as frequency

 Find cumulative frequency


45
Methods of Data Presentation …
 Grouped Frequency Distribution
 Example 2.3: Consider the following set of data, data collected on to evaluate

effectiveness of a processor for a certain type of tasks, we recorded the CPU time for
n = 20 randomly chosen jobs (in seconds).

15 31 8 36 29 23 33 18 24 43 20 12 25
21 24 29 28 40 35 24

Construct the grouped frequency distribution.

46
Methods of Data Presentation …
 Grouped frequency distribution

Class Class limit Class Boundary Tally Frequency LCF MCF Class mark
1 8 - 13 7.5 – 13.5 // 2 2 20 10.5
2 14 - 19 13.5 – 19.5 // 2 4 18 16.5
3 20 - 25 19.5 – 24.5 //// // 7 11 16 22.5
4 26 - 31 25.5 – 31.5 //// 4 15 9 28.5
5 32 - 37 31.5 – 37.5 /// 3 18 5 34.5
6 38 - 43 37.5 – 43.5 // 2 20 2 40.5

47
Methods of Data Presentation …
 Diagrammatic and Graphic presentation of the data

 One of the most effective and interesting alternative way in which a

statistical data may be presented is through diagrams and graphs.

 There are several ways in which statistical data may be displayed pictorially

such as different types of graphs and diagrams.


 Pie chart

 Bar chart

 Histogram

48
Methods of Data Presentation …
 Pie Chart

 Pie chart is a circular diagram and the area of the sector of a circle is used in

pie chart.

 To construct a pie chart (sector diagram), draw a circle (measures 3600)

 The angles of each component are calculated by the formula

Component part
Angle of sec tor   3600
Total

 These angles are made in the circle by mean of a protractor to show different

components.

 The arrangement of the sectors is usually anti-clock wise.

49
Methods of Data Presentation …
 Pie Chart (Example)
 Example 2.4: The following table gives the details of quarterly sale of a Microsoft

company’s profit (in millions of dollar) in four quarters of a year.


Month Profit($,000,000)
1st quarter 100
2nd quarter 300
3rd quarter 500
4th quarter 600
Total 1500
 Construct a pie chart

50
Methods of Data Presentation …
 Pie Chart (Example)
Quarter Angle of sector
Profit($,000,000) Percent (%)
(in degrees)

1st quarter 100 24 7


2nd quarter 300 72 20
3rd quarter 500 120 33
4th quarter 600 144 40
Total 1500 360 100

1st quarter
7%
2nd quarter

20% 3rd quarter


40%

4th quarter

33%

51
Methods of Data Presentation …
 Bar Chart

 Use vertical or horizontal bins to represent the frequencies of a distribution.

 While we draw bar chart, we have to consider the following two points.

 Make the bars the same width

 Make the units on the axis that are used for the frequency equal in size

 Bar charts can be

 Simple bar chart,

 Multiple bar charts,

 Stratified or stacked bar chart

 Deviation bar chart

52
Methods of Data Presentation …
 Simple Bar Chart

 Used to represents data involving only one variable classified on spatial,

quantitative or temporal basis

 Make bars of equal width but variable length

 Example 2.5 (Example 2.4 quarterly sale of a Microsoft company’s profit)

53
Methods of Data Presentation …
 Multiple Bar Chart

 When two or more interrelated series of data are depicted by a bar diagram

 Make bars of equal width but variable length

 Example 2.6: Suppose we have export and import (in million) figures for a

company working on mineral for few years.

80
60
40 Export

20 Import

0
2010 2011 2012

54
Methods of Data Presentation …
 Stratified/Stacked Bar Chart

 used to represent data in which the total magnitude is divided into


different or components.

 First make simple bars for each class taking total magnitude in that class
and then divide these simple bars into parts in the ratio of various
components

 Shows the variation in different components within each class as well as


between different classes.

 Stratified bar diagram is also known as component bar chart.

55
Methods of Data Presentation …
 Stratified/Stacked Bar Chart

 Example 2.7: The table below shows the profit of a company ($ Millions)

from different item sales in 1st quarter of the year. Draw stratified/stacked bar
chart Company Desktop Laptop Accessory Total
HP 30 50 40 120
DELL 33 16 27 76
TOSHIBA 37 13 37 87

140 Ball
120 T-shirt
Sales in $,000,000

100 40 Shoe
80
37
60 27
50
40 16 13

20 30 33 37
0
X Y Z
Company 56
Methods of Data Presentation …
 Deviation Bar Chart

 Used when the data contains both positive and negative values such as data

on net profit, net expense, percent change etc

 Example 2.8: Suppose we have the following data relating to net profit

(percent) of commodity.
Commodity Net profit
Soap 80
Sugar -95 Net profit
Coffee 125 150
100
50 Soap
0 Sugar
Soap Sugar Coffee
-50 Coffee
-100
-150

57
Methods of Data Presentation …
 Histogram

 Histogram is a special type of bar graph in which the horizontal scale

represents classes of data values and the vertical scale represents frequencies.

 The height of the bars correspond to the frequency values, and the drawn

adjacent to each other (without gaps).

 A graph which displays data by using vertical bars of various heights to

represent frequencies.

 Class boundaries are placed along the horizontal axes.

58
Methods of Data Presentation …
 Histogram
 A histogram shows the shape of a pmf or a pdf of data, checks for homogeneity,

and suggests possible outliers.

 To construct a histogram, we split the range of data into equal intervals, “bins,”

and count how many observations fall into each bin.

 Example 2.9: construct histogram from example 2.3.

59
Methods of Data Presentation …
 Frequency polygon
 It is a graphic form of a frequency distribution.

 It can be constructed by plotting the class frequencies against class marks

and joining them by a set of line segments.

 Remark: we should add two classes with zero frequencies at the two ends of

the frequency distribution to complete the polygon.

 Example 2.10: Construct frequency polygon from Example 2.3.

60
Methods of Data Presentation …
 Ogive
 A graph showing the cumulative frequency (less than or more than type)

plotted against upper or lower class boundaries respectively.

 class boundaries are plotted along the horizontal axis and the corresponding

cumulative frequencies are plotted along the vertical axis.

 The points are joined by a free hand curve.

 Example 2.11: Construct cumulative frequency polygon from Example 2.3.

61
3. Measures of Central Tendency

62
Introduction
 A measure of central tendency is a descriptive statistic that describes the
average, or typical value of a set of scores.

 It is also defined as a single value that is used to describe “center” of the


data

Typical value
(Center of data)
 The three major objectives of measures of central tendency are
 To summarize a set of data by single value

 To facilitate comparison among different data sets

 To use for further statistical analysis or manipulation

63
Introduction
 Good properties of typical average
 Computation should be based on all the observed values.

 It should be simple to understand and easy to interpret.

 As little as affected by fluctuations of sampling.

 should not unduly be influenced by extreme values.

 it should be defined rigidly which means that it should have a definite value

 There are three common measures of central tendency

 Mean

 Median

 Mode

64
The Summation Notation
 Also called Sigma notation

 Sigma is a Greek letter ∑ meaning “sum”

 Let X is a variable

n ending point/

X
Upper limit of
the summation
i
i 1
Summation
notation
Xi is the index of
summation, each
starting point/
term of the sum
Lower limit of
the summation
(index of the
summation)

65
The Summation Notation..
 Properties of summation notation
n

X
i 1
i  X1  X 2    X n

XY
i 1
i i  X 1Y1  X 2Y2    X nYn

 i 1 2
X 2

i 1
 X 2
 X 2
  X 2
n

n n

 CX
i 1
i  C  X i  CX 1  CX 2    CX n
i 1

66
The Mean
 Mean is the most commonly used measure of central tendency. There are
different types of mean
 Arithmetic mean,

 Weighted mean,

 Geometric mean (GM) and

 Harmonic mean (HM)

 If mentioned without an adjective (as mean), it generally refers to the


arithmetic mean.

67
The Arithmetic Mean
 It is computed by adding all the values in the data set divided by the
number of observations in it.

 If we have the raw data, mean is given by the formula


n

X i
X i 1

n
 If we have frequency distribution (ungrouped) mean is given by the
n
formula fX i i
X i 1
n

 If we have frequency distribution (grouped) mean is given by the


formula n

fm i i
LCBi  UCBi
X i 1
, where mi  LCB/UCB is lower/upper class boundary
n 2
68
The Arithmetic Mean …
 Example 3.1: A person is interested to evaluate effectiveness of a processor for a certain
type of tasks, we recorded the CPU time for n = 10 randomly chosen jobs (in seconds)
70 36 43 69 82 48 34 62 35 15. (Ans:494/10=4.94 seconds)

 Example 3.2: Twenty first year computer science students were ordered to write a
program using C++ language. The number of trials was recorded until a compile send
free message error. (Ans: 3.82 trials)
Number of trials Number of students
4 8
5 4
6 6
7 1
8 1

69
The Arithmetic Mean …
 Example 3.3: Twenty nine tasks were performed by a computer and the time elapsed was
measured to evaluate effectiveness of a processor. The time is recorded in seconds, and
the result is summarized in the table. What is the mean CPU time of this processor. (Ans:
46.1 secs)

Time (in Number of Tasks


seconds)
9-28 6
29-48 12
49-68 6
69-88 4
89-108 1

70
Properties of Arithmetic Mean …
 It can be computed for any set of numerical data, it always exists, and unique.

 It depends on all observations.

 The sum of deviations of the observations about the mean is zero i.e.

 It is greatly affected by extreme values.

 It lends itself to further statistical treatment, for instance, combinations of means.

 It is relatively reliable, i.e. it is not greatly affected by fluctuations in sampling.

 The sum of squares of deviations of all observations about the mean is the minimum

 If a constant is added to all observations, the new mean is old mean plus constant

 If all observations are multiplied by a constant, the new mean is the multiple of the constant and old
mean

 If wrong value is recorded and latter on it is discovered, the new corrected mean is

X corr X wrong
X corr  X wrong 
n
71
Weighted Mean
 Weighted mean is calculated when certain values in a data set are more
important than the others.

 A weight wi is attached to each of the values xi to reflect this importance.

 The weighted mean is computed as


k

w x i i
Xw  i 1
k

w
i 1
i

 Example: CGPA of a students (each result is weighted by credit of a course) [Ans: 2.88]

72
Geometric Mean
 It is defined as the arithmetic mean of the values taken on a log scale.

 It is also expressed as the nth root of the product of an observation.

n
GM  n X 1 * X 2 * X n  n X
i 1
i

 GM is an appropriate measure when values change exponentially and in case of


skewed distribution that can be made symmetrical by a log transformation.

 Note: The geometric mean is useful in finding the average of percentages,


ratios, indexes, or growth rates.

 One important disadvantage of GM is that it cannot be used if any of the values


are zero or negative.

73
Geometric Mean…

 Example 3.4: the price of a computer increased by the following percent for
three consecutive years; 10%, 15%, 20%. What is the average percentage
increase in the cost of computers the past three years. (Ans : 14.4%)

74
Harmonic Mean
 It is the reciprocal of the arithmetic mean of the observations.

 The harmonic mean is an average which is useful for sets of numbers which are
defined in relation to some unit, for example speed (distance per unit of time).
1
HM  n

 1 x 
i 1
i

n
n

 1 
n

  x
i 1  i

75
Harmonic Mean
 Example 3.5: A students download 2.5 MB file with 25 kb/s, on the other day he
downloaded (same file size) with a 35kb/s. What is the average download rate of
the computer? (Ans: 29.2 kb/s)

76
Relation between AM, GM, and Hm

 If all the values in a data set are the same, then all the three means (arithmetic
mean, GM and HM) will be identical.

 As the variability in the data increases, the difference among these means also
increases.

 Arithmetic mean is always greater than the GM, which in turn is always greater
than the HM.
 AM > GM > HM

77
Median
 If the sample data are arranged in increasing order, the median is
 if n is an odd number, median is middle value
th
n 1
X positioned observation
2
 Example 3.6: the number of new computer accounts registered during nine consecutive
days are 43, 37, 50, 52, 58, 105, 52, 45, and 45. what is the median number of new
accounts registered? (Ans: 50)

 if n is an even number, midway between the two middle values


th th
n n
item 1 item
2 2
X
2
 Example 3.7: the number of new computer accounts registered during ten
consecutive days are 43, 37, 50, 52, 58, 105, 52, 45, 52 and 45. what is the median
number of new accounts registered? (Ans: 51)
78
Median …
 If the data is in ungrouped frequency distribution, median is the class with largest less
than cumulative frequency smaller than or equal to half of the total observation

 Example 3.8: Forty five computers were used to execute a program and evaluated their
performance using CPU time. The time is recorded in seconds, and the result is
summarized in the table. What is the median performance (time) of these computers.
(Ans: 19 secs) Time (in Number of Less than
seconds) computers cumulative
frequency
15 4 4
16 9 13
18 8 21
19 14 35
20 10 45

79
Median …
 If the data is in grouped frequency distribution, median is

w n
X LCBmed cf
f med 2

 Example 3.9: Forty five computers were used to execute a program and evaluated their
performance using CPU time. The time is recorded in seconds, and the result is
summarized in the table. What is the median performance (time) of these computers. .
(Ans: 20.81 secs) Time (in seconds) Number of
computers
14-16 6
17-19 12
20-22 16
23-25 9
26-28 7 80
Mode
 The most frequent observation (value) in a data

 An observation with the largest frequency


 There can be no mode Eg: 25, 27, 22, 18

 There can be only one mode-unimodal Eg: 25, 27, 22, 25,18

 There can be two mode-bimodal Eg: 25, 27, 22, 27, 25, 18, 20

 There can be more than two mode-multimodal Eg: 25, 27, 22, 27, 25, 18, 20, 19, 22, 17

 Mode grouped frequency distribution

Xˆ LCBmod w 1
, 1 f1 f0 , 2 f1 f2
1 2

 f1 = frequency of the modal class


 f0 = frequency of the class preceding the modal class
 f2 = frequency of the class next to the modal class
81
Mode…
 The most frequent observation (value) in a data
 Example 3.10: Twenty five amateur cyclists were taken to field and their time
is recorded to complete a given distance. The time is recorded in seconds, and
the result is summarized in the table. What is the modal time to complete the
distance. (Ans: 29.5 secs)

Time (in seconds) Number of


Atheletes
15.5- 21.5 3
21.5-27.5 6
27.5-33.5 8
33.5-39.5 4
39.5-45.5 3
45.5-51.5 1

82
Quantiles

 Quartiles are three points which divide an array into four parts in
such a way that each portion contains an equal number of
elements.
 First quartile (Q1) 25% of the observations lies below or equal to it

 Second quartile (Q2) 50 % of the observations lies below or equal to it and

 Third quartile (Q3) 75% of the observations lies below or equal to it

 The ith quartile for raw data is


in  1
Qi 
4
 If there is an even number of data items, then we need to get the average
of the middle numbers.
83
Quantiles
 Example 3.11: Find the median, lower quartile and upper quartile of the
following numbers.
a) 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25

b) 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25, 65

 Solution: first arrange data from smallest to largest

a)

b)

13 23.5 39
84
Quantiles
 The ith quartile for grouped frequency distribution is

w in
Qi LCBQi cf
fQi 4

85
Quantiles …

 Deciles are nine points which divide an array into 10 parts in such
a way that each part contains equal number of elements.
 The nine deciles are denoted by D1, D2, …, D9

 First decile (D1) 10% of the observations lies below or equal to it

 Second decile (D2) 20% of the observations lies below or equal to it etc

 The ith decile for grouped frequency distribution is

w in
Di LCBDi cf
f Di 10

86
Quantiles …

 Percentiles are 99 points which divide an array into 100 parts in


such a way that each part consists of equal number of elements.
 The ninty nine percentiles are denoted by P1, P2, …, P99

 First percentile (P1) 1% of the observations lies below or equal to it

 Second percentile (P2) 2% of the observations lies below or equal to it etc

 The ith percentile for grouped frequency distribution is

w in
Pi LCBPi cf
f pi 100

87
Quantiles …
 Example 3.12: The following frequency distribution is the score of 25

students.

Score Number
of
students Compute the following quantities
25-29 1
● First quartile (Ans:44.92)
30-34 1
●Ninth decile (Ans:65.75)
35-39 1
40-44 3 ●forty fifth percentile (Ans:51.38)
45-49 3
Remark: Q1  P25
50-54 6
Q2  D5  P50  Median
55-59 4
60-64 3 Q3  P75
65-69 2 D1  P10 ; D2  P20 ;; D9  P90
70-74 1
88
4. Measures of Dispersion

Number of pages written by two type writers for five consecutive days

89
Introduction
 Central tendency measures do not reveal the variability present in the data.

 Dispersion is the scatteredness of the data series around its average.

 Dispersion is the extent to which values in a distribution differ from the

average of the distribution

 A measure of statistical dispersion is a nonnegative real number that is zero if

all the data are the same and increases as the data become more diverse.

 Why we need measures of dispersion?


 Determine the reliability of an average

 Serve as a basis for the control of the variability

 To compare the variability of two or more series and

 Facilitate the use of other statistical measures.

90
Introduction…
 Properties of a good measures of dispersion

 It should be rigidly defined

 It should be easy to understand and to calculate

 It should be based on all observations of data

 It should be easily subjected to further mathematical treatment

 It should be least affected by sampling fluctuation

 It shouldn’t be unduly affected by extreme values

91
Introduction…
 There are many types of dispersion measures
 Range /Relative Range (Coefficient of range)

 Inter Quartile Range/ coefficient of quartile deviation

 Mean Absolute Deviation /Coefficient of mean deviation

 Variance/Standard Deviation/ coefficient of variation

 Measures of dispersion cane be absolute or relative.

 When measurements are observed with different units, or have different

averages use relative measures of dispersion.

92
Range (R)
 Range is the difference between two extreme values in a data

 Denoted by R

 For raw data: R = max − min

 For grouped frequency distribution: R  UCLk  LCL1 or mk m1

 It is easy to compute and understand.

 Only two values are used in its calculation.

 It is influenced by an extreme value (non-robust).

 It is a very unstable or unreliable (sampling fluctuation, extreme values)

 It is used in statistical quality control

 Used when extreme values are important


93
Relative Range (RR)

 Relative range is the ratio of the difference and sum of the two
extreme values in a data

 Denoted by RR/CR

max  min
RR 
max  min

 Example 4.1: Consider 2.4, 2.5, 3.0, 1.5, 4.7, 4.3 and 3.5 as time
(in minutes) of installing five software on your personal computer.

(Ans: R=3.2, RR=0.52)

94
Inter Quartile Range
 Measures the range of the middle 50% of the values only

 Is defined as the difference between the upper and lower quartiles

 Interquartile range = upper quartile - lower quartile

IQR= Q3 - Q1

 The semi-interquartile range (or SIR) is defined as the difference of the first
and third quartiles divided by two

SIR = (Q3 - Q1) / 2

 The SIR is often used with skewed data as it is insensitive to the extreme scores

 It gives the average amount by which the two quartiles differ from the median

95
Coefficient of Quartile Deviation
 The ratio of the difference to sum of the two extreme quartiles of a data

 Denoted by QCD
Q3  Q1
CQD 
Q3  Q1
 Example: A basketball coach has a team of 20 players. He recorded the number of free throw
success out of 10 trials. The following are recorded: 9, 7, 3, 7, 1, 2, 5, 4, 5, 10, 10, 2, 2, 2, 6, 7, 9, 8,
5, 6. What are the SIR and CQD for the free throw success?

 Solution: put in ascending order: 1, 2, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 7, 8, 9, 9, 10, 10.

(Ans: SIR=2.5, CQD=0.5)

96
Mean Absolute Deviation (MAD)
 Measures the ‘average’ distance of each observation away from the mean of the
data

 Gives an equal weight to each observation

 Generally more sensitive than the range or interquartile range, since a change in
any value will affect it

 The Mean Absolute Deviation of a set of n numbers is

 x x i
MAD  i 1

n
 All values are used in the calculation.

 It is not unduly influenced by large or small values (robust)

 The absolute values are difficult to manipulate.


97
Coefficient of Mean Deviation (CMD)
MAD
CMD 
x
 All values are used in the calculation.

 It is not unduly influenced by large or small values (robust)

 The absolute values are difficult to manipulate.

 Example: VO2 max, or maximal oxygen uptake, is a common measurement linked to


aerobic endurance that many athletes use to determine their overall fitness. The
following VO2 max measurements (ml/kg/mn) are recorded. From five men

52.5, 46.8, 38.8, 37.6, 32.3.

 Compute MAD and CMD?

 Solution: (Ans: MAD=6.44, CMD=0.16)

98
Variance
 Variance is the mean of squared deviation of observations from their
arithmetic mean

 x  x 
2
i
s2  i 1
n 1

 All values are used in the calculation.

 It is not extremely influenced by outliers (non-robust).

 The units of variance are awkward: the square of the original units.

 Therefore standard deviation is more natural since it recovers he original units.

99
Standard Deviation
 One of the most useful measures of dispersion is the standard deviation.

 It is based on deviations from the mean of the data.

 The sample standard deviation is found by calculating the square root of

 ( x  x )2 .
the variance.
s
n 1
 To calculate standard deviation follow this step
1. Calculate the mean of the numbers

2. Find the deviations from the mean.

3. Square each deviation

4. Sum the squared deviations

5. Divide the sum in Step 4 by n – 1

6. Take the square root of the quotient in Step 5


100
Coefficient of Variation
 The Coefficient of Variation (CV) for a data set defined as the ratio of the standard
deviation to the mean

 It shows the extent of variability in relation to mean of the population.

 It is a normalized measure of dispersion of a probability distribution or frequency


distribution.

s
CV  100%
x
 All values are used in the calculation.

 The actual value of the CV is independent of the unit in which the measurement has been

taken, so it is a dimensionless number.

 For comparison between data sets with different units or widely different means, one

should use the coefficient of variation instead of the standard deviation.


101
Coefficient of Variation
 Example: Ten amateurs play Dart game and their score is recorded as follows.

4, 40, 25, 50, 42, 28, 10, 40, 9, and 17.


 Compute mean, variance, standard deviation and
coefficient of variation for their score.
(Ans: mean=26.5, variance=259.61, sd=16.11,
and cv=60.80%)

102
Standard Score
 If X is a measurement from a distribution with mean X and standard
deviation S, then its value in standard units is
X X
Z
S
 Z gives the deviations from the mean in units of standard deviation

 Z gives the number of standard deviation a particular observation lie


above or below the mean.

 It is used to compare two observations coming from different groups

103
Standard Score
 Example: Two groups of people were trained to perform a certain task
and tested to find out which group is faster to learn the task. For the two
groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stan.dev. 1.2 min 1.3 min

 Relatively speaking:

a) Which group is more consistent in its performance? (Ans: Group 2)


b) Suppose a person A from group one take 9.2 minutes while person B from Group
two take 9.3 minutes, who was faster in performing the task? Why? (Ans: child B is
faster)

104
Skewness
 Skewness is the degree of asymmetry or departure from symmetry of a
distribution.

 A skewed frequency distribution is one that is not symmetrical.

 Skewness is concerned with the shape of the curve not size.

 If the frequency curve (smoothed frequency polygon) of a distribution has a longer

tail to the right of the central maximum than to the left, the distribution is said to be
skewed to the right or said to have positive skewness.

 If it has a longer tail to the left of the central maximum than to the right, it is said to
be skewed to the left or said to have negative skewness.

105
Skewness
 For moderately skewed distribution, the following relation holds among the three

commonly used measures of central tendency

mean-mode=3(mean-median)

 Pearson an measures of skewness

mean  mod e
skewness  or
s tan dard deviation

m3  x  x  2
 x  x 3

skewness  , where m2  and m3 


m2  2 n 1 n 1
3

 If skewness > 0, then it is positively skeweed

 If skewness = 0, then it is positively symmetric

 If skewness < 0, then it is negatively skeweed

106
Kurtosis
 Kurtosis is the degree of peakdness of a distribution, usually taken relative to a
normal distribution.
 A distribution having relatively high peak is called leptokurtic.

 If a curve representing a distribution is flat topped, it is called platykurtic.

 The normal distribution which is not very high peaked or flat topped is called

mesokurtic.

107
Kurtosis
 Measures of kurtosis

m4   x  x 4
  x  x 2

kurtosis  2 , where m4  and m2 


m2 n 1 n 1
 If kurtosis > 3, then the curve is leptokurtic

 If kurtosis = 3, then the curve is mesokurtic

 If kurtosis <3, then the curve is platykurtic

108
Skewness and Kurtosis (Example)
 Example: the following observations are score of 100 students.

Score 61 64 67 70 73
Number of students 5 18 42 27 8

Can we say the distribution is skeweed? What is the shape of the distribution?

(Ans: mean=67.45, m2=8.61, m3=-2.72, m4=201.39)

m3  2.72
skewness    0.11
m2  8.61
3 3
2 2

 The distribution is negatively skewed


m4 201.39
kurtosis  2   2.71
m2 8.612

 The shape is platykurtic 109


5. Elementary Probability

110
Introduction

 A newly released software contains an uncertain number of defects.

 When a computer program is executed, the amount of required memory may

be uncertain.

 When a job is sent to a printer, it takes uncertain time to print, and there is

always a different number of jobs in a queue ahead of it.

 Electronic components fail at uncertain times, and the order of their failures

cannot be predicted exactly.

 Viruses attack a system

111
Introduction…
 Experiment

 Any process of observation or measurement or any process which generates well

defined outcome
 Example 5.1: Race competition, tossing a coin, identifying sex of new born baby, A ball is
manufactured, it is tested whether defective or non-defective.

 Probability Experiment

 It is an experiment that can be repeated any number of times under similar


conditions and it is possible to enumerate the total number of outcomes with out
predicting an individual out come.

 It is also called random experiment.

 Example 5.2: Rolling a fair die, tossing a fair coin

112
Introduction…
 Outcome

 The result of a single trial of a random experiment

 Example 5.3: in tossing a fair coin {T}, {H}

 Sample Space

 Set of all possible outcomes of a probability experiment


 Example 5.4: Tossing a fair coin S={T, H}

 Event

 It is a subset of sample space.

 It is a statement about one or more outcomes of a random experiment.

 They are denoted by capital letters.

 Example 5.5: Roll a fair six sided die, A be observing even number A={2, 4, 6}

113
Introduction…
 Equally Likely Events

 Events which have the same chance of occurring.

 Roll a die, let A be observing a number less than 4 and B be observing a number
greater than 3.
 A and B are equally likely events

 Complement of an Event

 the complement of an event A means non-occurrence of A and contains those points

of the sample space which don’t belong to A.


 Example 5.6: Toss a fair die, let A be observing an odd number, then

A={1, 3, 5}

AC ={2, 4, 6}

114
Introduction…
 Elementary Event

 an event having only a single element or sample point.

 Mutually Exclusive Events

 Two events which cannot happen at the same time.

 Independent Events

 Two events are independent if the occurrence of one does not affect the probability of

the other occurring.

 Dependent Events

 Two events are dependent if the first event affects the outcome or occurrence of the

second event in a way the probability is changed

115
Counting Techniques
 In order to calculate probabilities, we have to know

 The number of elements of an event

 The number of elements of the sample space.

 That is in order to judge what is probable, we have to know what is possible.

 In order to determine the number of outcomes, one can use several rules of
counting.
 Addition rule

 Multiplication rule

 Permutation rule

 Combination rule

116
Counting Techniques…
 Multiplication Rule

 If an operation consists of k steps and

 the 1st step can be performed in n1 ways,

 the 2nd step can be performed in n2 ways (regardless of how the 1st step was performed) ,

….

 The kth step can be performed in nk ways (regardless of how the preceding steps were
performed) ,

then the entire operation can be performed in n1 ∙ n2 ∙… ∙ nk ways.


 Example 5.7: If we have 6 different shirts, 4 different pants, 5 different pairs of socks and 3

different pairs of shoes, how many different outfits could we wear? (Ans: 360)

 Exercise: How many 7-character license plates are possible if the first three characters must be

letters, the last four must be digits 0-9, and repeated characters are allowed?

117
Counting Techniques…
 Addition Rule

 If an operation consists of k steps and

 the 1st step can be performed in n1 ways,

 the 2nd step can be performed in n2 ways (regardless of how the 1st step was performed) ,

….

 The kth step can be performed in nk ways (regardless of how the preceding steps were
performed) ,

A procedure can’t be performed simultaneously, then the entire operation can be


performed in n1 + n2 + … + nk ways.
 Example 5.8: If aperson plans to travel from place A to place B and there are 2 different plane

routes, 3 different train routes, and 5 different bus routs, how many different alternatives can a
person has to travel from place A to Place B? (Ans: 10)

118
Counting Techniques…
 Permutation Rule

 A permutation is an ordered arrangement of r objects chosen from n

objects.

 The number of ways of selecting r distinct objects from n distinct objects

and rearranging those r objects is given by the formula

n!
n Pr 
(n  r )!
 Example 5.9: How many ways can we pick a Gold, Silver, and Bronze medal for
8 competitors in a game? (Ans: 336)

119
Counting Techniques…
 Permutation Rule

 The number of permutations of n objects of which n1 are alike, n2 are another

alike … nk are alike is


n!
n1!n2! nk !

 Example 5.10: In how many ways can you permute the word “STATISTICS”?
(Ans: 50400 ways)

 Example 5.11: in a computer lab there are three identical HP desktop computers
and four identical DELL desktop computers. In how many ways can you arrange
these computers. (Ans: 35)

120
Counting Techniques…
 Combination Rule

 A combination is selection of r objects from n objects.

 The number of ways of choosing r distinct objects from n distinct objects is

given by the formula


n
, where n! nn  1n  23  2 1 and 0! 1
  n!
C 
n r
 r  r!n  r !
 
 Example 5.12: An antivirus software reports that 3 folders out of 10 are infected.

How many possibilities are there? (Ans: 120 ways)

 Example 5.13: A committee of two people must be chosen from a group of five

people. How many different committees can be formed? (Ans: 10 ways)

121
Approaches in Probability Definition
 Classical/Mathematical Approach
 If an event A occurs in n times out of a total of N exhaustive, mutually exclusive and

equally likely experiments, then the probability of the occurrence of event A is


n
P( A) 
N
 Example 5.14: Determine the probability of getting an even number when rolling a

six sided die? (Ans: 0.5)


 Exercise: There are 20 computers in a store. Among them, 15 are brand new and 5 are

refurbished. Six computers are purchased for a student lab. From the first look, they are
indistinguishable, so the six computers are selected at random. Compute the probability that
among the chosen computers, two are refurbished. (Ans: 0.3522)

122
Approaches in Probability Definition…
 Frequentist Approach
 This is based on the relative frequencies of outcomes belonging to an event.

 The probability of an event A is the proportion of outcomes favorable to A in the

long run when the experiment is repeated under same condition.

n
P( A)  lim
N  N

 Example 5.15: If records show that 60 out of 100,000 bulbs produced are defective. What is the

probability of a newly produced bulb to be defective? (Ans: 0.0006)

123
Approaches in Probability Definition…
 Axiomatic Approach
 Let E be a random experiment and S be a sample space associated with E. With each

event A a real number called the probability of A (p(A)) satisfies the following
properties called axioms of probability or postulates of probability
 P(A)≥0

 P(S)=1
 P(Ac)=1- P(A)
 P( ) =0
 P(AnBc) = P(A)-P(AnB)

124
Approaches in Probability Definition…
 Axiomatic Approach
 Example 5.16: Sixty percent of the families in a certain community own

their own car, thirty percent own their own home, and twenty percent own
both their own car and their own home. If a family is randomly chosen,
a) what is the probability that this family do not have a car?

b) what is the probability that this family owns a car or a house?

c) what is the probability that this family owns a car or a house but not both?

 Solution: Let A represents that the family owns a car and B represents that the family owns a

house. P(A)=0.6, P(B)=0.3, and P(A n B)=0.2.


a) P(Ac)=0.4

b) P(AUB) =0.7

c) P((AnBc)U(AcnB))=0.5

125
Approaches in Probability Definition …
 Subjective Approach

 Subjective probability is a prediction that is based on an individual's personal

judgment, not on mathematical calculations.

 Subjective probabilities, like the name suggests, are probabilities that come

from an individual's personal judgment of an event happening.

 Subjective probability differ from person to person, and because they are

subjective, they can be based on a person's beliefs or other factors.

 Used when no historical data.

126
Conditional probability and Independence
 Conditional probability
 Conditional probability provides us with a way to reason about the outcome of an

experiment, based on partial information.

 If the occurrence of one event has an effect on the next occurrence of the other event

then the two events are conditional or dependant events.

 The conditional probability of an event A given that B has already occurred, denoted

p(A|B) is
P A  B 
P( A | B) 
P B 

 Remark: P( Ac | B)  1  P A | B 

127
Conditional probability and Independence…
 Conditional probability
 Example 5.17: A family has two children. What is the conditional probability that

both are boys given that at least one of them is a boy? Assume that the sample space
S is given by S = {(b, b), (b, g), (g, b), (g, g)}, and all outcomes are equally likely. (b,
g) means, for instance, that the older child is a boy and the younger child is a girl.

 Solution :

 Let A be both are boys and

 B be at least one of them is a boy

P A  B 
1
1
P( A | B)   4
P B  3 3
4
 Exercise: Out of six computer chips, two are defective. If two chips are randomly chosen for

testing (without replacement), compute the probability that both of them are defective.

128
Conditional probability and Independence…
 Conditional probability
 Two events A and B are independent if and only if P A  B  P APB

 independence is equivalent to the condition P(A|B) = P(A).

 Example 5.18: Toss a fair a coin and die together, what is the probability of getting

head on the coil if the die shows an even number. (Ans: 0.5)

 Exercise: A box contains four black and six white balls. What is the probability of

getting two black balls in drawing one after the other under the following conditions?
 The first ball drawn is not replaced?

 The first ball drawn is replaced?

129
Exercises on probability
 There is a 1% probability for a hard drive to crash. Therefore, it has two backups, each having a

2% probability to crash, and all three components are independent of each other. The stored
information is lost only in an unfortunate situation when all three devices crash. What is the
probability that the information is saved?

 A new computer virus can enter the system through e-mail or through the internet. There is a

30% chance of receiving this virus through e-mail. There is a 40% chance of receiving it
through the internet. Also, the virus enters the system simultaneously throughe-mail and the
internet with probability 0.15. What is the probability that the virus does not enter the system at
all?

 Suppose that after 10 years of service, 40% of computers have problems with motherboards

(MB), 30% have problems with hard drives (HD), and 15% have problems with both MB and
HD. What is the probability that a 10-year old computer still has fully functioning MB and HD?

130
6. Random Variables and Probability
Distributions

131
Introduction
Random variable
 a numerical description of the outcomes of the experiment or

 a numerical valued function defined on sample space, usually denoted by

capital letters.

 If X is a random variable, then it is a function from the elements of the

sample space to the set of real numbers. i.e


 X is a function X: S  R

 A random variable takes a possible outcome and assigns a number to it.

 Example 6.1: Flip a coin twice, let X be the number of heads in two tosses

 X={0, 1, 2}

132
Introduction…
Random variables can be
 Discrete random variables: are variables which can assume only a specific number

of values. They have values that can be counted.


 Example 6.2
 the number of jobs submitted to a printer,
 the number of errors in a program,
 the number of failed components of a computer

 Continuous random variable: are variables that can assume all values between any

two give values


 Example 6.3:
 software installation time
 code execution time
 connection time
 waiting time

133
Introduction…
Probability Distribution
 A probability distribution consists of a value a random variable can assume and
the corresponding probabilities of the values.

 Example 6.4: Consider the experiment of tossing a coin twice. Let X is the
number of heads. Construct the probability distribution of X
 First identify the possible value that X can assume.

 Calculate the probability of each possible distinct value of X and express X in the

form of frequency distribution

X 0 1 2
P(X=x) 1/4 2/4 1/4

134
Introduction…
Mean and variance of Random Variable
E  X    x P X  x 

 
Var  X   E X 2  E  X  ,
2

Where E X    x P X  x 
2 2

 Example 6.5: Consider the experiment of tossing a coin twice. Let X is the
number of heads.

X 0 1 2
P(X=x) 1/4 2/4 1/4

 Find mean and variance of X

 Solution: E(X)=1 and var(X)=1.5

135
Binomial Distribution
 A binomial experiment is a probability experiment that satisfies the
following four requirements called assumptions of a binomial
distribution.
1. The experiment consists of n identical trials.

2. Each trial has only one of the two possible mutually exclusive outcomes,
success or a failure.

3. The probability of each outcome does not change from trial to trial, and

4. The trials are independent, thus we must sample with replacement

 Example 6.6
 Scan computer for a certain virus(present, absent)

 Surgery from a certain injury (successful, fail)


136
Binomial Distribution…
 The outcomes of the binomial experiment and the corresponding probabilities of
these outcomes are called Binomial Distribution.

 Let p be probability of success and q=1-p be probability of failure, then

X ~ Bin n, p 

 the probability of getting x success out of n trials is

 n x
p X  x     p 1  p  , x  0,1, 2,, n.
n x
 x
 

 The mean and variance are

 E(X)=np

 Var(X)=np(1-p)

137
Binomial Distribution…
 Example 5.7: A quality control engineer tests the quality of produced computers.
Suppose that 5% of computers have defects, and defects occur independently of each
other. Find the probability of exactly 3 defective computers in a shipment of twenty.
(Ans: 0.06)

 Exercise: A lab network consisting of 20 computers was attacked by a computer virus.


This virus enters each computer with probability 0.4, independently of other computers.
Find the probability that it entered at least 10 computers.

138
Poisson Distribution
 A random variable X is said to have a Poisson distribution if its probability
distribution is given by:
e   x
p X  x   , x  0,1, 2,
x!
 The Poisson distribution depends only on the average number of occurrences per
unit time of space.

 The Poisson distribution is used as a distribution of rare events, such as

 Number of misprints.

 Natural disasters like earth quake.

 Accidents.

 Mean and variance: E(X)=λ, variance(X)= λ

139
Poisson Distribution..
 Example 5.8: Customers of an internet service provider initiate new accounts at the
average rate of 10 accounts per day.

(a) What is the probability that 8 new accounts will be initiated today? (Ans: 0.1126)

(b) What is the probability that more than 1 new accounts will be initiated today? (Ans:
0.9995)

(c) What is the probability that more than 10 new accounts will be initiated within 2 days?
(Ans: 0.9892)

140
Normal Probability Distribution
 A random variable X is said to have a normal distribution if its probability
density function is given by
1  x 
2
  
f x  
1 2  
e
 2

 It is bell shaped and is symmetrical about its mean and it is mesokurtic.

 It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.

 It is a continuous distribution.

 It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different

normal distribution.

 Total area under the curve sums to 1, i.e., the area of the distribution on each side of the mean is

0.5

 It is unimodal, i.e., values mound up only in the center of the curve.


141
Normal Probability Distribution…
 Note: To facilitate the use of normal distribution, the following distribution
known as the standard normal distribution was derived by using the
transformation
1
1  2 z2
f z   e
2
 Properties of the Standard Normal Distribution:

 Same as a normal distribution, but also...

 Mean is zero

 Variance is one

 Standard Deviation is one

 Areas under the standard normal distribution curve have been tabulated in
various ways. The most common ones are the areas between 0 and Z.

142
Normal Probability Distribution…
 Example 5.9 : (read from table); determine the following probabilities
 P(0<Z<1.43)=? (Ans: 0.4236 )

 P(-1.2<Z<0)=? (Ans: 0.3849)

 P(Z<-1.43)=? (Ans: 0.0764 )

 P(-1.43≤Z<1.2)=? (Ans: 0.8085)

 P(Z≥1.52)=? (Ans: 0.0643 )

 P(Z≥-1.52)=? (Ans: 0.9357 )

143
Normal Probability Distribution…
 Remark: pa  x  b  p  a
  x
  
 b  p a
  z  b  for population

pa  x  b  p as x  x x
s  bs x   p as x  z  bs x  for sample

 Example 5.10: Installation of some software package requires downloading 82 files.

On the average, it takes 15 sec to download one file, with a variance of 16 sec2.

A. What is the probability that the software is installed between 15 seconds to 20

seconds?

144
Normal Probability Distribution…
B. What is the probability that the software is installed in less than 20 seconds?

B. What is the probability that the software is installed in more than 20 seconds?

 Exercise: An average scanned image occupies 0.6 megabytes of memory with a standard deviation
of 0.4 megabytes. If you plan to publish 80 images on your web site, what is the probability that
their total size is between 47 megabytes and 50 megabytes?

145
146
7. Introduction to Sampling

147
Introduction
 When secondary data are not available for the problem under study, a
decision may be taken to collect primary data by using any of the
methods discussed in the previous chapter.

 The required information may be obtained by following either the census


method or the sample method.

 Why Sample?

 Speed

 Less costly

 Less manpower/ labour

148
Definition of Terms
 An element is an object on which a measurement is taken.

 A population is a collection of elements about which we wish to make an


inference.

 Sampling units are nonoverlapping collections of elements from the population


that cover the entire population.

 A sampling frame is a list of sampling units.

 A sample is a collection of sampling units drawn from a sampling frame.

 The deviation between an estimate from an ideal sample and the true population
value is the sampling error.

 Almost always, the sampling frame does not match up perfectly with the target
population, leading to errors of coverage.
149
Essentials of Sampling
 Representativeness

 A sample should be so selected that it truly represents the population otherwise the
results obtained may be misleading.

 To ensure representativeness the random method of selection should be used.

 Adequacy

 The size of sample should be adequate; otherwise it may not represent the
characteristics of the population.

 Independence

 All items of the sample should be selected independently of one another and all
items of the population should have the same chance of being selected in the
sample.
 By independence of selection we mean that the selection of a particular item in one draw
has no influence on the probabilities of selection in any other draw.
150
Essentials of Sampling…

 Homogeneity

 When we talk of homogeneity we mean that there is no basic

difference in the nature of units of the population and that of the


sample.

 If two samples from the same population are taken, they should give

more or less the same unit.

151
Types of Sampling
 Methods of sampling

152
Simple Random Sampling

 A method of probability sampling in which a sample of n elements is


randomly chosen from a population of N elements.

 Simple random sampling is the simplest of the probability sampling


techniques.
 It requires a complete sampling frame, which may not be available or

feasible to construct for large populations.

 Even if a complete frame is available, more efficient approaches may be

possible if other useful information is available about the units in the


population.

153
Simple Random Sampling …

 Advantages are that it is free of classification error, and it requires


minimum advance knowledge of the population.

 It best suits situations where the population is fairly homogeneous and


not much information is available about the population.
 If these conditions are not true, stratified sampling may be a better choice.

 Simple random sampling also reduces the chance of bias occurring in the
sample.

 When using simple random sampling every unit in the population has an
equal chance of being included in the sample.

154
Simple Random Sampling …

 Simple random sampling is usually done by assigning each unit in the


population a number, then selecting n amount of random numbers, the
corresponding units then form the sample.
 Number the elements in the population (i.e., sampling frame) from 1 to N.

 Selection Procedure for Simple Random Sampling can be done

 Using lottery method

 Table of random numbers

155
Simple Random Sampling …

 Lottery Method

 Here, each member or item of the population at hand is assigned a unique

number.

 The numbers are then thoroughly mixed, like if you put them in a bowl or jar

and shook it.

 Then, without looking, select n numbers.

 The population members or items that are assigned that number are then

included in the sample.

156
Simple Random Sampling …
 Table of random numbers

 A random number table typically contains random digits between 0 and 9 that

are arranged in groups of 4, 5, or 10 and displayed in rows.

 In the table, all digits are equally probable and the probability of any given

digit is unaffected by the digits that precede it.


 Number the elements in the population (i.e., sampling frame) from 1 to N.

 Using a table of random numbers, select and record a random number between 1 and N.

 Select a second random number between 1 and N. If the second number is the same as the first selected
number, discard it and go to the next step. If the second number is not the same as the first number, record it.

 Select a third random number between 1 and N. If this number is the same as either one of the previous
numbers, discard it and go to the next step. If the number is not the same as the previous numbers, record it.

 Continue in this manner until n different numbers between 1 and N have been chosen.

 Population elements corresponding to selected numbers are an SRS sample of size n.

157
Systematic Sampling
 Systematic sampling is the selection of every kth element from a sampling

frame, where k, the sampling interval, is calculated as:

k = Number in population / Number in sample

 A method of probability sampling in which elements on an ordered list of

population members are chosen by applying an interval of constant length


after a random start.

 Using this procedure each element in the population has a known and equal

probability of selection.

 It is however, much more efficient and much less expensive to do than

simple random sampling.

158
Systematic Sampling…
 A sampling interval (denoted by the symbol, k) is chosen.
 If a sample of about n out of N elements is desired, k is usually the ratio, N/n, rounded to the
nearest integer.

 A random number between 1 and k is chosen.


 This number is called the random start and will be denoted by the symbol, j.

 Elements selected in the sample are those number j and every kth element for the
remainder of the list; i.e., j, j+k, j+2k, etc

159
Stratified Sampling
 The process of dividing a population of elements into distinct subpopulations
called strata.

 Strata are formed so that each population element is assigned to only one
stratum.

 Stratification is the process of grouping members of the population into


relatively homogeneous subgroups before sampling.

 When sub-populations vary considerably; it is advantageous to sample each


subpopulation (stratum) independently.

160
Stratified Sampling…
 The strata should be mutually exclusive every element in the population must be
assigned to only one stratum.

 The strata should also be collectively exhaustive: no population element can be


excluded.

 Then random or systematic sampling is applied within each stratum. This often
improves the representativeness of the sample by reducing sampling error.

161
Cluster Sampling
 Probability sampling in which sampling units at some point in the selection
process are collections, or clusters, of population elements.

 Cluster sampling is a sampling technique used when "natural" groupings are


evident in the population.

 The total population is divided into these groups (or clusters), and a sample of
the groups is selected.

 Then the required information is collected from the elements within each
selected group.

 This may be done for every element in these groups, or a subsample of elements
may be selected within each of these groups.

162
Cluster Sampling…
 Cluster sampling can be single stage or multi stage.

163
Judgment Sampling
 Judgmental sampling is a non-probability sampling technique where the researcher
selects units to be sampled based on their knowledge and professional judgment.

 It is also called as purposive sampling.

 Judgment sampling is used in cases where the specialty of an authority can select a
more representative sample that can bring more accurate results than by using other
probability sampling techniques.

 The process involves nothing but purposely handpicking individuals from the
population based on the authority's or the researcher's knowledge and judgment.

164
Quota Sampling
 Quota sampling is a non-probability sampling technique wherein the assembled
sample has the same proportions of individuals as the entire population with
respect to known characteristics, traits or focused phenomenon.

 The main reason why researchers choose quota samples is that it allows the
researchers to sample a subgroup that is of great interest to the study.

 Quota sampling also allows the researchers to observe relationships between


subgroups.

165
Quota Sampling…

 Step-by-step Quota Sampling


 The first step in non-probability quota sampling is to divide the population into

exclusive subgroups.

 Then, the researcher must identify the proportions of these subgroups in the

population; this same proportion will be applied in the sampling process..

 Finally, the researcher selects subjects from the various subgroups while taking into

consideration the proportions noted in the previous step.

 The final step ensures that the sample is representative of the entire population.

166
Convenience Sampling
 Convenience sampling is a non-probability sampling technique where subjects are
selected because of their convenient accessibility and proximity to the researcher.

 The subjects are selected just because they are easiest to recruit for the study and the
researcher did not consider selecting subjects that are representative of the entire
population.

 In all forms of research, it would be ideal to test the entire population, but in most cases,
the population is just too large that it is impossible to include every individual.

 This is the reason why most researchers rely on sampling techniques like convenience
sampling, the most common of all sampling techniques.

 Many researchers prefer this sampling technique because it is fast, inexpensive, easy and
the subjects are readily available.

167
Snowball/Chain Sampling
 Snowball sampling is a non-probability sampling technique that is used by
researchers to identify potential subjects in studies where subjects are hard to
locate.

 Researchers use this sampling method if the sample for the study is very rare or
is limited to a very small subgroup of the population.

 This type of sampling technique works like chain referral. After observing the
initial subject, the researcher asks for assistance from the subject to help identify
people with a similar trait of interest.

168
Snowball/Chain Sampling
 The process of snowball sampling is much like asking your subjects to nominate another
person with the same trait as your next subject.

 The researcher then observes the nominated subjects and continues in the same way until
the obtaining sufficient number of subjects.

169

You might also like