0% found this document useful (0 votes)
118 views86 pages

Statistics

Statistics is the science of collecting, organizing, analyzing, and interpreting quantitative data. It involves collecting data from a population and organizing it into tables, graphs, or charts. Common statistical terms include population, sample, variable, measurement, parameter, and statistic. Statistics is used to estimate characteristics of populations from samples. For example, to estimate the average value of cars in the Philippines, a random sample of 200 cars could be selected and their values averaged to estimate the population parameter. This provides an overview of key statistical concepts and terminology in 3 sentences.

Uploaded by

myeonnie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views86 pages

Statistics

Statistics is the science of collecting, organizing, analyzing, and interpreting quantitative data. It involves collecting data from a population and organizing it into tables, graphs, or charts. Common statistical terms include population, sample, variable, measurement, parameter, and statistic. Statistics is used to estimate characteristics of populations from samples. For example, to estimate the average value of cars in the Philippines, a random sample of 200 cars could be selected and their values averaged to estimate the population parameter. This provides an overview of key statistical concepts and terminology in 3 sentences.

Uploaded by

myeonnie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

STATISTICS

Statistics

 branchof science that deals with the collection,


presentation, organization, analysis and interpretation
of quantitative data.

 When we say Collection of data we mean the process


of gathering relevant information from population.
When we talk about Organization of data we refer to
the systematic arrangement of data into tables, graph,
or charts.
Review of Terminologies :
POPULATION
 the collection of all elements under consideration in a statistical inquiry
SAMPLE
 a subset of the population
VARIABLE
 is a characteristic or attribute of the elements in a collection that can assume
different values for the different elements
MEASUREMENT
 a number or attribute computed for each member of a population or of a sample.
SAMPLE DATA
 the measurements of sample elements
PARAMETER
 a summary measure describing a specific characteristic of the population
STATISTIC
 a summary measure describing a specific characteristic of the sample
Grand Picture of Statistics
ILLUSTRATIVE EXAMPLE:
There are millions of passenger automobiles in the
Philippines . What is their average value? It is
obviously impractical to attempt to solve this
problem directly by assessing the value of every
single car in the country, adding up all those
numbers, and then dividing by however many
numbers there are. Instead, the best we can do
would be to estimate the average. One natural way
to do so would be to randomly select some of the
cars, say 200 of them, ascertain the value of each of
those cars, and find the average of those 200
numbers.
POPULATION
 the set of all those millions of vehicles
SAMPLE
 the set of 200 cars selected
SAMPLE DATA
 the 200 numbers
 the monetary values of the cars selected
MEASUREMENT
 the number attached to each one, its value
VARIABLE
 Vehicle type/model, color
PARAMETER
 The average value
STATISTIC
 The average of the data
EXERCISES: Identify each of the following data
sets as either a population or a sample:

1. The grade point averages (GPAs) of all students at a


college.
2. The GPAs of a randomly selected group of students on a
college campus.
3. The ages of the nine Supreme Court Justices of the United
States on January 1, 1842.
4. The gender of every second customer who enters a movie
theater.
5. The lengths of Atlantic croakers caught on a fishing trip to
the beach.
A researcher wishes to estimate the average amount
spent per person by visitors of the Enchanted
Kingdom. He takes a random sample of forty visitors
and obtains an average of Php5,000.00 per person.

1. What is the population of interest?

2. What is the parameter of interest?

3. Based on this sample, do we know the average


amount spent per person by visitors to the park?
Explain fully
Areas in Applied Statistics

1. Descriptive statistics includes all the


techniques used in organizing, summarizing and
presenting the data on hand.

2. Inferential statistics includes all the techniques


used in analysing the sample data that will lead
to generalizations about a population from
which the sample came from.
Types of Data

QUALITATIVE DATA
 measurements for which there is no natural
numerical scale but which consists of attributes,
labels, or other nonnumerical characteristics

QUANTITATIVE DATA
 numericalmeasurements that arise from a
natural numerical scale
CLASSIFICATION OF
QUANTITATIVE DATA
DISCRETE DATA
 types
of data that can be measured or
counted

CONTINUOUS DATA
 datathat can assume an infinite number of
values between any two specific values
 often include fractions and decimals
EXERCISES: Identify the following
measures as qualitative or quantitative.
1. The genders of the first 40 newborns in a hospital one
year.
2. The natural hair color of 20 randomly selected fashion
models.
3. The ages of 20 randomly selected fashion models.
4. The fuel economy in miles per gallon of 20 new cars
purchased last month.
5. The political affiliation of 500 randomly selected voters.
Collection of Data

• Measurement is the process


of determining the value or
label of the variable based
on what has been observed.
Levels of Measurement
1. The ratio level of measurements has all of the following properties:
a) The numbers in the system are used to classify a person/object
into distinct and non-overlapping categories.
b) The system arranges the categories according to magnitude.
c) The system has a fixed unit of measurement representing a set
size throughout the scale.
d) The system has an absolute zero.
1. The interval level of measurement satisfies only the first three properties
of the ratio level.
2. The ordinal level of measurement satisfies only the first two properties
of the ratio level.
3. The nominal level of measurement satisfies only the first property of the
ratio level.
Examples of Measurement Scales

NOMINAL-LEVEL ORDINAL-LEVEL INTERVAL-LEVEL RATIO-LEVEL


• Zip code • Grade (A, B, C) • SAT score • Height
• Gender (Male, • Judging (1st • IQ • Weight
Female) place, 2nd place) • Temperature • Time
• Eye color • Rating scale • Salary
• Political (poor, good, • Age
affiliation excellent)
• Religious • Team ranking
affiliation
• Major field
• Nationality
Quotation :
Classification of data according to
source

• Primary data are data documented by


the primary source. The data
collectors themselves documented
these data.
• Secondary data are data documented
by a secondary source. An
individual/agency other than the data
collectors, documented these data.
Method of collecting data

• Survey is a method of collecting data on the variable of


interest by asking people questions.
• Experiment is a method of collecting data where there
is a direct human intervention on the conditions that
may affect the values of the variables of interest.
• Observation method is a method of collecting data on
the phenomenon of interest by recording the
observations made about the phenomenon as it actually
happens.
Sampling and Sampling Techniques

• Target population is the population we want to study,


while sampled population is the population from where
we actually select the sample.
• Probability sampling is a method of selecting a sample
wherein each element in the population has a known,
nonzero chance of being included in the sample;
otherwise it is nonprobability sampling.
Methods of Probability
Sampling
1.Stratifiedsampling
2.Simple random sampling
3.Systematic sampling
4.Cluster sampling
Methods of Nonprobability
Sampling
1.Haphazard or
convenience
2.Judgement or purposive
3.Quota sampling
Presentation of Data

• Textual presentation of data incorporates important


figures in a paragraph of text.
• Tabular presentation of data arranges figures in a
systematic manner in rows and columns.
• Graphical presentation of data portrays numerical
figures or relationships among variables in pictorial
form.
Organization of Data

• Raw data are data in their original


form.
• The array is an ordered arrangement
of data according to magnitude.
• The frequency distribution is a way of
summarizing data by showing the
number of observations that belong in
the different categories or classes.
Measures of Central Tendency

Measures of Central Tendency are descriptive measures that


are used to describe the center of a set of data, arranged
numerically.
1. The arithmetic mean is the most common type of average. It
is the sum of all the observed values divided by the numbers
of observations.
2. The median is the value that divides the array into two equal
parts.
3. The mode is the observed value that occurs with the greatest
frequency in a data set.
Ungrouped Data :

SimpleMean :
To compute the mean of ungrouped
data,we use the formula :

X = x1+ x2 + x3 + …..+ Xn / N
Example 1: for Mean for Ungrouped Data

 1.The ages of five contestants in a Statistics Quiz


Bee are the following : 18,17,18,19 and 18 . Find
their average age .

 X= 18 + 17 + 18+ 19+ +18 / 5


= 18

 Add the values (ages) then divide the sum by 5


Then the mean age of the contestants is 18
Example 2 # for Median

A college professor at a certain university assigns


statistics practice problems to be worked via the
net .Students must use a secret code to access the
problems and the time of log –in and log-out are
automatically recorded for the professor .At the end
of the week ,the professor examines the amount of
time each student spent solving the assigned
problems. Find the median.The data is provided
below in minutes.
15,28,25,48,22,43,39,44,43,
49,34,22,33,27,25,22,30.
Solution
A. First arrange the data in ascending
or descending order.
15,22,22,22,25,25,27,28,30,33,34,39,4
3,43,44,48,49
b. Next divide the data set into two
equal parts:

15,22,22,22,25,25, 27,28 30
33, 34,39,43,43,44,48,49
Since ,there is an odd number of
values in the data ,we take the
middle most number/value which is
30 as the median of the data set.
Example 3 #: for Mode

 Thenumber /value /observation in a data


set which appears the most number of
times .If no number in the list is
repeated ,then there is no mode for the list
.However ,it is also possible to have more
than one mode for the same distribution of
data ( bi-modal ,tri-modal ,or multi –modal)
Ungrouped Data :

To find the mode of an ungrouped


data .Find the frequency of each
number/value/observation having the
highest frequency as the mode
MODE=number/value/ observation
with the highest frequency
Example for Mode :Ungrouped Data

 Findthe mode of the given data set :


15,28,25,48,22, 43, 39, 44 ,
43,49,34,22,33,27,25,22,30.

 Solution
: First ,arrange the data set in
ascending or descending order
15,22,22,22,25,25,27,28,30,33,34,39,43,
43,44, 48,49
 Next ,determine the number that appeared
the most number of times

In the given data ,the number that


appeared the most number of times
is 22,Thus the mode of the data set
is 22. The data set is to be uni-modal
Measures of Location

Measures of Location, on position or fractiles are used to


specify the location of specific data in relation to the rest of the
sample. It divides the distribution into equal number of parts.
1. The percentiles divide the ordered observation into 100
equal parts.
2. The quartiles divide the ordered observations into 4 equal
parts.
3. The decile divides that observed observation into 10 equal
parts.
Consider the given set of data:
Seatwork :Ungrouped data

Set A: 9, 12, 13, 15, 15, 17, 24


Set B: 7, 11, 15, 15, 17, 19, 21
Set C: 11, 11, 15, 15, 15, 18, 20
Weighted Mean :
 Is mean calculated by giving values in a data set
more influence according to some attribute of the
data.
 Itis an average in which each quantity to be
averaged is assigned a weight ,and these
weightings determine the relative importance of
each quantity on the average.
 Weightings are the equivalent of having that many
like items with the same value involved in the
average.
The formula for weighted
mean is :
WM = Σ wx / Σ w
Where : w is the weight of
each value and x is the
matching value
Example :

1. Xandra bought fruits for New


Year .She brought 3 apples at
10pesos each ,5 ponkans at 5
each ,3pears at 15pesos each .
4pieces of chico at 25 pesos each.
What is the average price of each
fruit that Xandra bought ?
Solution :

WM = Σ wx / Σ w 
 = 3x10 + 5x5 +3x15+4x25 / 3+5+ 3+4
 = 30+ 25 + 45 + 100 / 15
 = 13.33
Thus,theaverage price of each fruit
bought by Xandra is 13.33pesos.
Grouped Data :
Class mark or Midpoint Method:
-In this method ,the class mark of
each interval has to be known and
then it will be multiplied to the
corresponding frequency of every
class interval.
The formula for the mean using this method
s:

X = ∑ cf X / n
Where :cf = frequency
 x = class mark
 n = total number of
observations
Examples :

 1.Consider the frequency distribution below :


 cl cf
 75-79 5
 70-74 7
 65-64 8
 60-64 10
 55-59 8
 50-54 9
 45-49 5
 n=50
Determine the mean of the distribution :

Solution :
First get the midpoint or class mark of
each class interval
Second , multiply the frequency of
each class to the corresponding
midpoint or class mark
Last, Then get the sum of the products
Table
 cl cf x cfX
 75-79 5 77 385
 70-74 7 72 504
 65-69 8 67 536
 60-64 10 62 620
 55-59 7 57 399
 50-54 9 52 468
 45-48 4 47 188
 n=50 Σ cfX=3100
From the values in the table ,we can now compute for
the value of mean by substituting the computed :

 N=50, and Σ cfX = 3,100

X = Σ cfX / n
 = 3,100 / 50
 = 62
 So, therefore the mean of the data is 62
Grouped Data :
The Formula for the Median for group data :

 X = LBMC + { n/2 -< CF b / cf MC }


 Where : LBMC = exact lower class boundary of the
median class
 < CFb = less than cumulative frequency
below the median class
 i= class size
 cfMC =frequency of the median class
Example :

 1. The record of 21 people in a 100m race is


summarized in the given frequency table :
 Time (in seconds ) Frequency
 51-55 2
 56-60 7
 61-65 8
 66-70 4
 21
Determine the median of the given data
Solution :
a . Compute for the <CF of the data
 Time (in seconds) Frequency <CF
51-55 2 2
56-60 7 9
61-65 8 17
66-70 4 21
21
b. Determine the median class by
computing for the value of N/2

N/2 = 21/2 = 10.5

C . Locate the computed value for N/2


@ the < CF column (must be within one
of the <CF)
Note :

Looking at the < CF column ,we can see that 10.5


lies within 17 .The interval that corresponds to 17
is the interval 61-65 .

Therefore :
The median class is the interval 61-65
d . Look at the < CF corresponding to the median
class .Then get the < CF before the median class .

The < CFb (<CF before the median


class ) is 9

e . Substract the <CFb from N/2


 N/2 -<CFb = 10.5 -9 = 1.5
f . Divide the answer in step e by the
frequency of the median class

1.5 / 8 = 0.1875

g . Multiply the answer in step f by the value


of i .To determine the value of i . Substract
the lower limit from the upper limit in any of
the class intervals then add 1 .
i = 65-61 =4+1 =5
i = 5
0.1875 x 5 = 0.9375
h . Add the answer in step g to the exact
lower limit ( LBMC) of the median class .
 The answer in this step is the median value
of the data set.
 The exact lower limit (LBMC) of the median
class is 60.5
 60.5 + 0.9375 = 61.44
 Now using the Formula will have :
Using the Formula :


X = LBMC + { N /2 - <CFb / cfMC } I

= 60.5 + { 21/2 -9 /8} 5
 = 60.5 + { 10.5 -9/ 8 }5
 = 60.5+ 0.9375
 = 61.4375 or 61.44

Hence,The median of the data set in the problem is


61.44
Determine the Median of the given data:

Solution :
a . Compute for the < CF of the data
Time (in seconds) Frequency < CF
51-55 2 2
56- 60 7 9
61- 65 8 17
66-70 4 21
21
Measures of Dispersion

Measures of Dispersion or Variability describes the spread or


the scatterings of the values around the mean
1. The range is the distance between the maximum value and the
minimum value.
2. The variance is the average squared difference of each
observation from the mean.
3. The standard deviation is the positive square root of the
variance.
4. The coefficient of variation is the ratio of the standard deviation
to the mean, expressed as a percentage.
Measures of Skewness and Kurtosis

Measure of skewness measures the degree of symmetry


of a distribution.
• symmetric distribution
• positively skewed distribution
• negatively skewed distribution
Measures of Skewness
1. Symmetrical or Normal Distribution
In a symmetrical distribution the mean, median, and
mode all fall at the same point or equal.
2. Positively Skewed Distribution
In a positively skewed distribution, the extreme
scores are larger, thus the mean is larger than the
median.
3. Negatively Skewed Distribution
The order of the measures of central tendency would
be the opposite of the positively skewed distribution,
with the mean being smaller than the median, which is
smaller than the mode.
Measure of Kurtosis

Measure of kurtosis refers to the peakedness or flatness


of the curve of the distribution.

K
  x  x
4

ns 4
i. when K > 3, the distribution is Leptokurtic
ii. when K = 3, the distribution is Mesokurtic
iii. when K < 3, the distribution is Platykurtic
Measure of Kurtosis

• Leptokurtic. The curve is more peaked and the hump is


narrower or sharper than the normal curve.

• Platykurtik. The curve is less peaked and the hump is


flatter than the normal curve.

• Mesokurtic. The hump is the same as the normal curved.


It is neither too flat nor too peaked.
Normal Distribution

The normal distribution is patterned for the distribution of


a set of data which follows a bell shaped curve.

The graph of a normal distribution is called a normal


curve.
Properties of Normal Distribution

1. Normal curve is bell shaped.


2. The mean, median and mode are located at the
center of the distribution and it is unimodal.
3. It is symmetrical about the mean.
4. It is continuous and asymptotic with respect to the x-
axis.
5. The total area under curve is 1.00 or 100%.
Many Normal Distributions

There are infinite number of normal curves

By varying mean and standard deviation, we obtain


different normal distributions
7-6

The Standard Normal Distribution

•  A

normal distribution with a mean of 0 and a standard
deviation of 1 is called the standard normal distribution.
• The z-score measures how many standard deviations an
observed value is above or lower the mean.
• Sample z score is given by the formula
• The standard score is useful when we want to compare two
or more observed values from different data set.
Area under the Standard Normal Curve

Given Steps
Between zero and any Look up the area in the table
number
Between two positives, or Look up both areas in the table
Between two negatives and subtract the smaller from
the larger.
Between a negative and a Look up both areas in the table
positive and add them together
Less than a negative, or Look up the area in the table
Greater than a positive and subtract from 0.5000
Greater than a negative, or Look up the area in the table
Less than a positive and add to 0.5000
Test of Hypothesis

• A one-tailed test of hypothesis is a test where the


alternate hypothesis specifies a one-directional
difference for the parameter of interest.

• A two-tailed test of hypothesis is a test where the


alternate hypothesis does not specify a directional
difference for the parameter of interest.
Test of Hypothesis

• A test statistic is a statistic whose value is calculated from


sample data, which will be the basis for deciding whether to
reject H_0 or not in a test of hypothesis.
• The critical region is the set of values of the test statistic
for which we reject the null hypothesis. The acceptance
region is the set of values of the test statistic for which we
do not reject the null hypothesis.
• These two regions are separated by the critical value of
the set statistic.
Critical Value

• The critical value of the tabular value for the


hypothesis test is a threshold to which the value of the
test statistic in a sample is compared to determine
whether or not the null hypothesis is rejected.

• We reject the null hypothesis if the computed value is


greater than or equal to the critical value.
Types of Error

• The Type I error is the error committed when we


decide to reject the null hypothesis when in reality the
null hypothesis is true.
• The Type II error is the error committed when we
decide not to reject the null hypothesis when in reality
the null hypothesis is false.
The Level of Significance

•  The

level of significance, denoted by is the maximum
probability of committing a type I error that the
researcher is willing to commit.
• Very frequently used are the .05 and .01 level of
significance.

• Note: 0.05 level of significance implies that we are


willing to commit an error of 5% therefore a confidence
level of 95%.
p-value

The p-value is the probability of selecting a sample whose


computed value for the test statistic is equal or more
extreme than the realized value computed from the
sample data, given that the null hypothesis is true.
As a rule, if the p-value is greater than the level of
significance, then we do not reject the null hypothesis.
On the other hand, if the p-value is less than or equal to
the level of significance, then we reject the null
hypothesis.
Steps in Hypothesis Testing

1. State the null and alternative hypotheses.


2. Choose the level of significance
3. Determine the appropriate statistical technique and
corresponding test statistic to use.
4. Perform the computation. Compare the computed value
with the critical value (others use the p-value instead)
5. Make the decision rule (Reject the null hypothesis or
failed to reject it).
Decision Rule

•  Reject

if the value of the test statistic falls in the
region of rejection (that is, test statistics is greater
than the critical value.)
• Reject if the p-value is less than or equal to the level
of significance.
Test Statistic
  FREQUENTLY USED INFERENTIAL STATISTICAL TOOLS  
  Single Two Two More than More than  
LEVEL OF Sample Related Independent Two Related two CORRELATI
MEASURE- Samples Samples Samples Independe ONAL
MENT nt MEASURES
Samples

PARAMETRIC
INTERVAL/ t test for Paired t test t test for Pearson r
RATIO single independent ANOVA for ANOVA
sample samples repeated F-test
measures
Z test
ORDINAL Kolmogorov Sign test, Mann- Friedman Kruskal- Spearman
-Smirnov   Whitney U Rank Test Wallis rank order
one-sample Wilcoxon test, H Test correlation
test matched-  

NON-PARAMETRIC
pairs, Wald-
  Wolfowitz
Signed- runs test
ranks test
NOMINAL Chi-square McNemar Chi-square   Chi- Phi
one-sample test for square Coefficient,
test independent test for  
samples with with more Yule’s Q
two than two  
subclasses subclasses
Parametric Test

The parametric tests are tests applied to data that are


normally distributed, the levels of measurement of which
are expressed in interval and ratio.
t-test for Dependent Samples (paired)

• A parametric test applied to one group of samples.


• It can be used in evaluation of a certain program or
treatment.
• It is applied when the mean before and the mean after
are being compared.
t-test for Independent Samples
(unpaired)
• Used when we compare the means of two independent
groups.
• Used when the sample is less than 30.
z-test

• used to compare two means: the sample means and the


perceived population mean.
• used to compare the two sample means taken from the
same population.
• When samples are equal to or greater than 30.
• It can be applied in two ways: the One-sample mean
test and the two sample mean test.
F-test

• another parametric test used to compare the means of


two or more independent groups
• also known as the analysis of variance (ANOVA)
• Kinds of ANOVA: One-way, two-way, three-way
• We use ANOVA to find out if there is a significant
difference between and among the means of two or
more independent groups.
The Pearson Product Moment Coefficient
Correlation, r
• It is used to analyze if a relationship exists between two
variables (measured in the interval or ratio scale) say
variable x and y.
• It was developed by Karl Pearson that is why the
correlation coefficient is sometimes called "Pearson's r."
The formula is defined by:
N  XY   X  Y
r
[ N  X    X  ][ N  Y    Y  ]
2 2 2 2
Basic properties of r

The range of the correlation coefficient is from -1 to +1.

If the value of the coefficient is close to -1.00, it represents a


perfect negative correlation while a value of +1.00 represents
a perfect positive correlation.

If the value is equal to 0.00, it means that there is no relation


between the variables.
Simple Linear Regression Analysis

  The simple linear regression analysis predicts the value of y given the value of x.
• It is used when there is a relationship between the independent variable x and
the dependent variable y.
• The formula for the simple linear regression is where y = dependent variable, x
= independent variable, a = y-intercept,
 This statistical procedure is concerned with prediction or forecasting
 It is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables.
 In simple linear regression, we predict scores on one variable (dependent)
from the scores on a second variable (independent).
 The variable we are predicting is called the criterion variable and is referred to as
Y. The variable we are basing our predictions on is called the predictor variable
and is referred to as X.
Nonparametric Test

Nonparametric tests are tests that do not require a


normal distribution. They utilize both nominal and ordinal
data.
Chi-Square Test

• This is the test of difference between the observed and


expected frequencies.
• The Test for Goodness of fit determines if the sample under analysis
was drawn from a population that follows some specified distribution.
• The Test for Homogeneity answers the proposition that several
populations are homogeneous with respect to some characteristic.
• The Test for independence (one of the most frequent uses of Chi
Square) is for testing the null hypothesis that two criteria of
classification, when applied to a population of subjects are
independent. If they are not independent, then there is an association
between them.
References:

• Almeda, Josefina V. et al, Elementary Statistics – Quezon


City: The University of the Philippines Press, c2010 (2013
printing)
• Ang, Raymond et al, Basic Statistics
• Broto, Antonio S, Parametric and Nonparametric statistics –
National Book Store (2008)
• Salvador, Ivy., Powerpoint: Pampanga State Agricultural
University (2017)

You might also like