STAT1100
STAT1100
Definition of Statistics
• In plural sense, statistics is any set of numerical data (e.g. vital statistics, monthly sales)
• In singular sense, statistics is a branch of science that deals with the collection, presentation, analysis
and interpretation of data
“Statistics is all about the study of the theory and applications of the scientific methods dealing with the
process of collecting, handling and using data for sound decision making.” Almeda, et al., 2010
“Statistics is the business of using the scientific method to answer research questions about the world.” -
Rumsey, 2003
INFORMATION EMPOWERS!
Statistics provides us with the tools we need to convert massive data volumes into pertinent information
that we can use to make better and more sensible decisions.
Any value generated from the population is a parameter - a numerical value that describes a
population.
Any value derived from the sample is a statistic - a numerical value that describes the sample.
• In 1662, John Graunt published statistical information about births and deaths. Graunt’s work was
followed by studies of mortality and disease rates, population sizes, incomes, and unemployment rates.
• Gottfried Achenwall used the word “statistik” at a German university in 1749 to mean the political
science of different countries.
PaFields of Statistics
1. Statistical Theory
- Deals with the development and exposition of theories that serve as bases of statistical methods. Also
referred as Theoretical Statistics or Mathematical Statistics.
2. Statistical Methods
- These are procedures and techniques used in the collection, presentation, analysis, and interpretation
of data. Also referred as Applied Statistics.
• Procedure – a fixed, step-by-step sequence of activities that must be followed in the same order to
correctly performs a task.
• Technique – is a practical method, skill, or art applied to a particular task.
Statistical methods can be further divided into two areas: Descriptive and Inferential Statistics.
Areas of Statistics
1. Descriptive Statistics
• Includes all the techniques used in collecting, organizing, summarizing, and presenting the data.
• It only allow us to summarize or describe relevant characteristics of data without drawing conclusions
or inferences about a larger set.
• It utilizes numerical, tabular, and graphical methods to look for patterns in a data set, summarize the
information revealed in the data set, and present that information in a convenient form.
2. Inferential Statistics
• Includes all the techniques used in analyzing the sample data that will lead to generalizations about a
population from which the sample came from.
• It utilizes sample data to make conclusions or generalizations about the population.
• Includes interpreting, making inferences, estimating, hypothesis testing, determining relationships,
and making predictions.
• In the government, policy makers use statistics as base of their decision. Preparation of local and
national budgets are dependent on statistics.
• Statistics-based researches boost the growth of natural science fields such as Biology, Chemistry,
Environmental Science, and Physics.
• Large business companies invest in acquisition of big data. They analyze it to know more of the
demographics, location, and personalities of their clients. It primarily aims to reach more target clients.
• Investors monitor the past patterns of the stock market to aid them in their investing decisions.
• Sport statistics are collected every game and broken down by team, by quarter, and even by player to
monitor the performance of the team and the player.
• In education, a researcher tests whether new methods of teaching are better than old ones using
statistical methods.
• Many people look at the weather forecasts such as temperature and rain fall before going out or doing
certain tasks.
Partial list of general research objectives that can be accomplished by performing a statistical inquiry
1. Describe the characteristics of the elements in the population under study through the computation
or estimation of a parameter such as the proportion, total, and average.
2. Compare the characteristics of the elements in the different subgroups in the population through
contrasts of their respective summary measures.
3. Justify an assertion made by the researcher about a particular characteristic of the population or
samples.
4. Determine the nature and strength of relationships among the different variables of interest.
5. Identify the different groups of interrelated variables under study.
6. Reveal the natural groupings of the elements in the population based on the values of a set of
variables.
7. Determine the effects of one or more variables on a response variable.
8. Clarify patterns and trends in the values of a variable over time or space.
9. Predict the value of a variable based upon its relationship with another variable.
10. Forecast future values of a variable using a sequence of observations on the same variable taken
over time.
Variables
• Variable – a characteristic or attribute of persons or objects which can assume different values or
labels under statistical study
Examples: sex, age, educational attainment, income, level of satisfaction
• Observation – realized value of a variable
• Data – collection of observations
Example 1: Below are illustrations of variables together with its possible observations
a. The CLSU administration is interested in determining the COVID-19 vaccination status of their
students for the academic year 2022-2023
Population: set of all CLSU students for AY 2022-2023
Variable of Interest: vaccination status
b. The research division of certain pharmaceutical company is investigating the effectiveness of new diet
pill in reducing weight on male adults.
Population: set of all male adults who will take the diet pill
Variables of Interest: weight before taking the pill weight after taking the pill
Example 3:
Let us define the variable of interest as the age of the thirty CLSU BS Stat 1 students for AY 2022 – 2023.
Suppose we determine and the record the ages of all the 30 students as:
In a statistical inquiry, it is necessary to describe what we observe about the variable of interest in a
compact manner, such as a single number or a short label. This is done through determining the value or
label of our variables.
• Measurement – process of determining the value or label of the variable for a particular observational
unit.
• Observational Unit – individual persons, objects, places or events on which a variable is measured.
Types of Variables
1. Quantitative Variable – observations or values of the variables are expressed numerically that are
meaningful or indicate some sort of amount. It contains counts or measurements.
Example: age, allowance, number of students, weight, height, etc.
1. Discrete Variable - a variable which can assume finite or countably infinite number of values usually
measured by counting or enumeration, that is 0, 1, 2, 3 and so on answers the question “How many?”
Examples: number of students, number of pets, number of chairs, etc.
2. Continuous Variable - a variable which can assume infinitely many values corresponding to a line
interval gives rise to measurement answers the question “How much?”
Examples: height, weight, width, length, allowance, etc.
Levels of Measurement
1. Nominal - qualitative data, classificatory scale, weakest level of measurement, numbers or symbols
are used simply for categorizing subjects into different (nonoverlapping) groups
Examples: Automobile, Color, Marital Status ,Sex, ID Number
2. Ordinal - qualitative data, classificatory with ordering scale, classifies data into categories that can be
ranked
4. Ratio - quantitative data, highest level of measurement, has the properties of nominal, ordinal and
interval levels, has absolute zero or true zero
SUMMATION NOTATION
• In Statistics, working with sums of numerical values is frequent
• Given a set of n observations represented by X1 as the first value, X2 as the
second value, and so on up to Xn as the nth value, then the sum can be expressed
as:
• read as “the summation of X sub i where i ranges from 1 to n”
Data Collection Methods
2.2 Sampling
• Population is the totality of all elements/units in the study
• Population size (𝑵) is the total number of all elements/units in the population.
Example:
Population: All BSStat 1 students at CLSU AY 2022-2023
The population size is 𝑵 = 𝟑𝟐.
In this example, our sample is composed of Angelica, Janna, Ana Mariz, Alexis,
Sheen Elleine, Karl Roy, Algheirine, and Jean Claude.
The sample size is 𝒏 = 𝟖.
• Sampling Frame is the complete list of elements/units in the population from
which a sample is drawn. This is very important before performing any sampling
techniques.
• Sampling is the process of selecting a sample from the population. The sampling
method can either be probability sampling method or nonprobability sampling
method.
Types of Sampling Method
1. Probability Sampling
each unit in the population has a known, non-zero probability of
selection, and have equal chances of being selected as a sample
if a probability sampling design is implemented well, an investigator can
use a relatively small sample to make inferences about an arbitrary large
population
2. Nonprobability Sampling
elements of the population are taken depending to a large extent on
the personal feelings or purpose of the researcher and without
regard for some chance mechanism for choosing an element
the elements in the population do not have equal chances of being
selected as a sample
Remarks:
Whenever possible, probability sampling is used because there is no
objective way of assessing the reliability of inferences under
nonprobability sampling
What we want is a sample that is representative of the population
The sampling frame is required in the execution of probability
sampling methods
Probability Sampling Methods
1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Sampling
4. Cluster Sampling
5. Multistage Sampling
SRS Techniques
1. Lottery Technique / Draw Lots
i. All names in the sampling frame are written on separate papers
which are physically similar in form (to avoid bias).
ii. These are placed on a bowl or small box and thoroughly mixed.
iii. Draw the number of sheets or papers (corresponds to the sample
size) randomly without looking at them.
DISADVANTAGE
If the population is big, the lottery technique is not recommended since it requires the preparation of a
large number of individual sheet for each unit in the population.
2. Use of Random Number
i. Each name in the sampling frame has been numbered from 1 to 𝑁.
Example:
Selection Procedure:
a. Make a list of the
sampling units and
number them
from 1 to N
b. Determine k using the
formula 𝑘 = N⁄𝑛
and round off to
the nearest whole
number
c. Select a random start r, where 1 ≤ 𝑟 ≤ 𝑘.
The corresponding r is the first unit of the
sample.
d. The other units of the sample corresponds to 𝑟 + 𝑘, 𝑟 + 2𝑘, … and so on
3. Stratified Sampling
The population is divided into two or more non-overlapping groups
called strata, according to some criterion such as geographic location,
grade level, age, or income. All members of a stratum share the specific
characteristics.
Random samples are drawn from each stratum.
This can be performed in two ways
▪ Equal allocation
▪ Proportional allocation
- the percentage of these sample taken from each stratum is
proportionate to the percentage that each stratum is within the population.
4. Cluster Sampling
The entire population is divided into pre-existing segments called clusters. Then
clusters will be randomly selected using simple random or systematic sampling,
and every member of each selected cluster is included in the sample
Example: A researcher wants to survey CLSU faculty members. She use cluster
sampling as her sampling method with the college as clusters. She decided to
randomly choose 5 clusters (colleges) from the 9 colleges.
After selecting these 5 colleges, all the faculty members from those selected
clusters will be considered as sample
5. Multistage Sampling
Multistage sampling is a sampling technique wherein sampling is done at two or
more hierarchical stages.
Example of hierarchical stages is in considering samples from different
geographical areas: In Philippines, there are regional level, provincial level, city
level, barangay level, etc.
Since there is more than one stage of sampling, you can use many probability
sampling methods.
Hierarchical Stages: university level – college level – program level – student level
Stage 1: You can use SRS to choose which college will you consider
Stage 2: From those selected colleges, you use equal stratified sampling to get
two programs or courses in each selected college
Stage 3: In each selected program or courses, I can use again SRS to get sample
students for each programs.
Further stages may also be performed, considering the year level and blocks (or
sections) of students.
1. Textual Presentation
Collected data may be organized and presented in a narrative or textual form. It is
the simplest method of presenting data.
Advantages:
o This presentation gives emphasis to significant figures and comparisons.
o It is the most appropriate approach when there are only a few significant
figures or information to be presented.
Example:
“As of August 14, 2022, 64.6% of the total Philippine populations were
vaccinated with the last dose of primary series.”
“Between May 1 and August 14, 2022, there were 280 COVID-19 related deaths
reported in the Philippines.”
2. Tabular Presentation
Tables are designed to summarize facts revealed by enquiry and to present
them in such a way that all the important factors contained in the data
under review are displayed.
This method takes the form of arranging statistical data in columns and
rows
Tabulation is the process of condensation of data into tables.
Parts of a Statistical Table
1. Heading – consists of table number,
title, and headnote
2. Box Head / Caption – contains the
column heads which describe the data
3. Stub / Classes – the portion of the
table comprising the first column on the
left
4. Field / Body – main part of the table
that contains the substance or figures of
the data
3. Graphical Presentation
A graph or chart is a device showing numerical values or relationships in
pictorial form
In a graph, the main features and implications of a body of data can be
grasped at a glance
It can simplify a concept that would otherwise have been expressed in so
many words.
Bar graph uses bars to compare data across categories of a variable. The
size of the bar represents the frequency or percentage of a particular
category.
Line graph use to display the trend or
pattern of values of a variable over time.
Pie Chart use to show how part of the data compares in size to the
whole circle. The circle is divided proportionally to the relative
frequency (percentages) and portion of the circles are allocated to the
different groups or categories.
Infographic
Ungrouped FDT
also referred as single-value grouping or qualitative FDT, since it is usually
used for qualitative data
the classes are the distinct categories or values of the variable
organized tabulation of variable that usually contains three columns –
listing of classes of the variable, frequency, and percentage (optional)
constructed for:
✓ qualitative data
✓ quantitative data with only a few unique values
Example 2: Suppose the following data set shows the favorite color of 12 students
Yellow Black Blue Blue Red Blue
Black Green Red Yellow Black Black
Table 2. Distribution of Student’s Favorite Subject
Favorite Color Frequency Percentage
Yellow 2 16.77%
Black 4 33.33%
Blue 3 25.00%
Yellow 2 16.67%
Red 1 8.33%
Answer:
Table 3. Distribution of Family by
Number Of Children
No.of Frequency Percentage
Children
0 7 14%
1 8 16%
2 11 22%
3 14 28%
4 8 16%
5 2 4%
Grouped FDT
also referred as grouping by class intervals or quantitative FDT
the classes are intervals of values of the variable
constructed for:
✓ quantitative data with many values
it has two necessary columns – list of class intervals and frequency
additional columns can also be added for other values such as class mark,
true class boundary, relative frequency, cumulative frequencies, and
relative cumulative frequencies
Step 4. Determine the class size (𝒄). Round off with the same decimal places as
the raw data.
Step 5. Determine and enumerate the class intervals. Each class interval is defined
by its class limits – lower limit (𝐿𝐿) and upper limit (𝑈𝐿). We use the following
rules:
Step 5.1. The lowest data or the minimum is always set as the lower limit of the
first class, that is 𝐿𝐿1 = minimum. In our example, 16 is the lowest score.
Step 5.2. The next lower limit is obtained by 𝐿𝐿next = 𝐿𝐿previous + 𝑐. Example for
the 2 nd class interval, 𝐿𝐿2 = 𝐿𝐿1 + 𝑐 = 16 + 8 = 24. The pattern continues until you
create 𝑘 classes.
Step 5.3. Set the upper limits using the formula 𝑈𝐿 = 𝐿𝐿 + 𝑐 − 1 unit of measure.
Since our data are all whole numbers, then 1 unit of measure = 1. So, for the first
class interval, 𝑈𝐿1 = 𝐿𝐿1 + 𝑐 − 1uom = 16 + 8 − 1 = 23 . Continue this until you get
the upper limits of each class intervals.
Check if the class limit covers the maximum value. If not, simply increase the number of classes.
Step 6. Count the frequency (f) or the number of observations that fall in each
class intervals.
Step 7. Class Mark (CM). The class mark is the midpoint of a class interval, that is
Step 8. True Class Boundary (TCB). The TCBs reflect the continuous property of the data. We have
upper and lower TCB.
• 𝑳𝑻𝑪𝑩 = 𝑳𝑳 − 𝟎. 𝟓(𝟏𝐮𝐨𝐦)
• 𝑼𝑻𝑪𝑩 = 𝑼𝑳 + 𝟎. 𝟓(𝟏𝐮𝐨𝐦)
Step 9. Relative Frequency (RF). The relative frequency of a class is the ratio of
class frequency to the total number of observations, 𝑛, and is expressed in
percentage.
2. Frequency Polygon
-special type of line graph that plots the class frequencies at the midpoint of the
classes and connect the plotted points by means of straight lines.
- Place the class frequencies on the 𝑦 − axis and the class marks on the 𝑥 − axis.
- The term “polygon” implies a closed shape with several sides. Thus, we need to
close our frequency polygon. To close the frequency polygon, add an additional
class mark at both ends of the classes.
3. Ogive
- Plot of the cumulative frequency distribution
- Use this to determine the number of observations below or above a particular
class boundary.
- The less than ogive, or < ogive, is the plot of < CF against the UTCB.
- The greater than ogive, or > ogive, is the plot of > CF against the LTCB.
Illustration
The following data represent the height of trees in meters, measured to the
nearest tenth, of a sample of 50 trees in a certain region. Set up a (complete)
frequency distribution.
Solution:
Make an Array
Determine the upper limits of each class interval. Add additional class, if
necessary
Since the data are in the nearest tenth, 1uom = 0.1.
𝑈𝐿 = 𝐿𝐿 + 𝑐 − 1uom → 𝑈𝐿1 = 𝐿𝐿1 + 𝑐 − 0.1 = 4.3 + 0.8 − 0.1 = 5.0
Count the frequency for each class.