Statistical Reviewer Midterm 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

WEEK-1-INTRODUCTION-AND-CONCEPTS-OF-STATISTICS

Statistics is a branch of mathematics that deals with the collection, organization, presentation,
analysis, and interpretation of data. Statistics derived from the Latin word “status” which means
state.

In addition, statistics is about providing a measure of confidence in any conclusions.


1. The first part states that statistics involves the collection of information.
2. The second refers to the organization and summarization of information.
3. The third states that the information is analyzed to draw conclusions or answer
specific questions.
4. The fourth part states that results should be reported using some measure that
represents how convinced we are that our conclusions reflect reality.

➢ Statistics is important because it enables people to make decisions based on empirical


evidence.

➢ Statistics provides us with tools needed to convert massive data into pertinent
information that can be used in decision making.
➢ Statistics can provide us information that we can use to make sensible decisions.

What information is referred to in the definition?

The information referred to the definition is the data.

DATA

Data are “factual information used as a basis for reasoning, discussion, or calculation”.
Data can be numerical, as in height, or no numerical, as in gender. In either case, data describe
characteristics of an individual

Field of Statistics

A. Mathematical Statistics

The study and development of statistical theory and methods in the abstract.

B. Applied Statistics
The application of statistical methods to solve real problems involving randomly
generated data and the development of new statistical methodology motivated by real
problems.

Example branches of Applied Statistics:


Psychometric, econometrics, and biostatistics.

Limitation of Statistics

1. Statistics is not suitable to the study of qualitative phenomenon.

2. Statistics does not study individuals.

1
3. Statistical laws are not exact.

4. Statistics table may be misused.

5. Statistics is only, one of the methods of studying a problem.

DEFINITIONS
Universe is the set of all entities under study.

Population is the total or entire group of individuals or observations from which information is
desired by a researcher.
Apart from persons, a population may consist of mosquitoes, villages, institution, etc.

An individual is a person or object that is a member of the population being studied.


Parameter is a numerical summary of a population
Statistic is a numerical summary of a sample.

Sample is the subset of the population.

Classification of Statistics

➢ Descriptive statistics consist of organizing and summarizing data.


Descriptive statistics describe data through numerical summaries, tables, and graphs.

➢ Inferential statistics uses methods that take a result from a sample, extend it to the
population, and measure the reliability of the result.

EXAMPLE
You are walking down the street and notice that a person walking in front of you drops PHP100.
Nobody seems to notice the PHP100 except you. Since you could keep the money without anyone
knowing, would you keep the money or return it to the owner?

1. Present the scenario to 50 students and use the results to make a statement about all the
students at the school.
2. Collect the information needed to answer the questions.

Suppose 39 of the 50 students stated that they would return the money to the owner.
We could present this result by saying that the percent of students in the survey who
would return the money to the owner is 78%. Descriptive statistics

3. If we extend the results of our sample to the population, we are performing inferential
statistics. The Generalization contains uncertainty because a sample cannot tell us
everything about a population. Therefore, inferential statistics includes a level of
confidence in the results. So rather than saying that 78% of all students would return the
money, we might say that we are 95% confident that between 74% and 82% of all
students would return the money. Notice how this inferential statement includes a level
of confidence (measure of reliability) in our results.

PROCESS OF STATISTICS
1. Identify the research objective.

2
A researcher must determine the question(s) he or she wants answered. The
question(s) must clearly identify the population that is to be studied. Identify the
research objective.
Example: A research objective is presented. For each research objective, identify
the population and sample in the study.

1. The Philippine Mental Health Associations contacts 1,028 teenagers who


are 13 to 17 years of age and live in Antipolo City and asked whether or
not they had been prescribed medications for any mental disorders, such
as depression or anxiety.

2. A farmer wanted to learn about the weight of his soybean crop. He


randomly sampled 100 plants and weighted the soybeans on each plant.

2. Organize and summarize the information. Descriptive statistics allow the researcher to
obtain an overview of the data and can help determine the type of statistical methods the
researcher should use.

3. Draw conclusion from the information. In this step the information collected from the
sample is generalized to the population.

Example For the following statements, decide whether it belongs to the field of descriptive
statistics or inferential statistics.
1. A badminton player wants to know his average score for the past 10 games.
(Descriptive Statistics)
2. A car manufacturer wishes to estimate the average lifetime of batteries by testing a
sample of 50 batteries. (Inferential Statistics)
3. Janine wants to determine the variability of her six exam scores in Algebra.
(Descriptive Statistics)
4. A shipping company wishes to estimate the number of passengers traveling via their ships
next year using their data on the number of passengers in the past three years.
(Inferential Statistics)
5. A politician wants to determine the total number of votes his rival obtained in the past
election based on his copies of the tally sheet of electoral returns. (Descriptive Statistics)

SOURCES OF DATA
After determining the number of samples needed, the next step is to choose the method on how
you are going to collect the data.
Data can be collected through observation, experimentation, or conducting censuses or surveys.
These data are called primary data.

Data obtained from those already published by the government, industries, or individual sources
are called secondary data

Data sets can consist of two types of data: qualitative data and quantitative data.

Qualitative data: Consists of attributes, labels, or nonnumerical entries.

Quantitative data: Consists of numerical measurements or counts.

3
Example Determine whether the following variables are qualitative or quantitative

Hair color

Temperature
Stages of breast cancer

Number of hamburgers sold


Number of children
Zip code

Types of Quantitative Data

1. If you count to get the value of a quantitative variable, it is discrete.

2. A continuous variable is a quantitative variable that has an infinite number of possible


values that are not countable. If you measure to get the value of a quantitative variable,
it is continuous.

Example: Determine whether the following quantitative variables are discrete or continuous.

1. The number of heads obtained after flipping a coin five times.

2. The number of cars that arrive at a McDonald’s drive-through between 12:00 P.M and
1:00 P.M.

3. The distance of a 2005 Toyota Prius can travel in city conditions with a full tank of gas.

4. Number of words correctly spelled.

5. Time of a runner to finish one lap.

WEEK 2: GRAPHICAL PRESENTATION OF DATA

4
5
GRAPHS AND DIAGRAMS

GRAPHS

6
INTRODUCTION TO THE STATISTICAL CONCEPTS

Levels of Measurement

It is important to know which type of scale is represented by your data since different statistics
are appropriate for different scales of measurement. A characteristic may be measured using
nominal, ordinal, interval and ration scales.

1. Nominal Level - They are sometimes called categorical scales or categorical data. Such a
scale classifies persons or objects into two or more categories. Whatever the basis for
classification, a person can only be in one category, and members of a given category
have a common set of characteristics.
Example: - Method of payment (cash, check, debit card, credit card)
- Type of school (public vs. private)
- Eye Color (Blue, Green, Brown)

2. Ordinal Level - This involves data that may be arranged in some order, but differences
between data values either cannot be determined or meaningless. An ordinal scale not
only classifies subjects but also ranks them in terms of the degree to which they possess
a characteristic of interest. In other words, an ordinal scale puts the subjects in order from
highest to lowest, from most to least. Although ordinal scales indicate that some subjects
are higher, or lower than others, they do not indicate how much higher or how much
better.
Examples: - Food Preferences Stage of Disease
- Social Economic Class (First, Middle, Lower)
- Severity of Pain

3. Interval Level - This is a measurement level not only classifies and orders the
measurements, but it also specifies that the distances between each interval on the scale
are equivalent along the scale from low interval to high interval. A value of zero does not
mean the absence of the quantity. Arithmetic operations such as addition and subtraction
can be performed on values of the variable.

Example: - Temperature on Fahrenheit / Celsius Thermometer


- Trait anxiety (e.g., high anxious vs. low anxious)
- IQ (e.g., high IQ vs. average IQ vs. low IQ)

4. Ratio Level - A ratio scale represents the highest, most precise, level of measurement. It
has the properties of the interval level of measurement and the ratios of the values of the
variable have meaning. A value of zero means the absence of the quantity. Arithmetic
operations such as multiplication and division can be performed on the values of the
variable.

Example: - Height and weight


- Time
- Time until death Operations that make sense for variables of different
scales.
Both interval and ratio data involve measurement. Most data analysis techniques that apply to
ratio data also apply to interval data. Therefore, in most practical aspects, these types of data
(interval and ratio) are grouped under metric data. In some other instances, these types of data
are also known as numerical discrete and numerical continuous.
Example: Categorize each of the following as nominal, ordinal, interval or ratio measurement.

1. Ranking of college athletic teams.


2. Employee number.

7
3. Number of vehicles registered.
4. Brands of soft drinks.
5. Number of car passers along C5 on a given day.
6. Zip code
7. Degree of pain

DATA COLLECTION AND BASIC (Concepts in Sampling DESIGN)


Data Collection Everybody collects, interprets and uses information, much of it in numerical or
statistical forms in day-today life. It is a common practice that people receive large quantities of
information everyday through conversations, televisions, computers, the radios, newspapers,
posters, notices and instructions.

Data collection is the process of gathering and measuring information on variables of interest, in
an established systematic fashion that enables one to answer stated research questions, test
hypotheses, and evaluate outcomes. Without proper planning for data collection, a number of
problems can occur. If the data collection steps and processes are not properly planned, the
research project can ultimately end up with a data set that does not serve the purpose for which
it was intended.

STEPS IN GATHERING DATA

1. Set the objectives for collecting data


2. Determine the data needed based on the set objectives.
3. Determine the method to be used in data gathering and define the comprehensive data
collection points.
4. Design data gathering forms to be used.
5. Collect data

Methods of Collecting Data

The primary data can be collected by the following five methods:


1. Direct personal interviews - The researcher has direct contact with the interviewee. The
researcher gathers information by asking questions to interviewee.

2. Indirect/Questionnaire Method - This method of data collection involves sourcing and


accessing existing data that were originally collected for the purpose of the study.
Designing good “questioning tools” forms an important and time- consuming phase in the
development of most research proposals. Once the decision has been made to use these
techniques, the following questions should be considered before designing our tools.

3. Observation Method - In this method, the researcher observes the behavior of the
participants.

The investigator is the person gathering the data, while the subject is the person being
observed.
- It makes use of the human senses.

4. Experimentation Method - It is used to determine the cause-and-effect relationship of


certain phenomena under controlled conditions.

5. Registration Method - In this method, information can be gathered from documents or


reports which are released by the government or any institution.

8
Key Design Principles of a Good Questionnaire

1. Keep the questionnaire as short as possible.


2. Decide on the type of questionnaire (Open Ended or Closed Ended).

3. Write the questions properly.


4. Order the questions appropriately.
5. Avoid questions that prompt or motivate the respondent to say what you would like to
hear.

6. Write an introductory letter or an introduction

Sampling Techniques
When conducting research, it would be impossible to collect data from every individual in a
population. So, instead of taking data from a whole population, a small group that will represent
the population is used. Getting only a sample from the population can lessen the expenses and
increase the speed of data gathering. Various sampling techniques can be used in determining
the samples.

Sampling is the process of selecting a group from the population from which data will be
collected. For example, if you would like to get the opinion of the students in your school
regarding the “Education in the New Normal”, then you could survey a sample of 100 students.

Types of Sampling Techniques

A. Probability Sampling. It involves random selection, wherein every member of the


population has an equal chance of being chosen.

B. Non-probability Sampling. It does not involve random selection, wherein members of the
population may not have an equal chance of being selected or they may not have a chance
at all. It allows you to collect data easily.

Types of Probability Sampling

1. Simple Random Sampling

➢ It is sometimes called the lottery or fishbowl method.

➢ A sample may be selected with or without replacement.

➢ Every member of the population has an equal chance of being selected.

Example: Mr. Aquino announces that he has a graded recitation in class today. He has written all
his students’ names on a chip and placed them inside a box. Without looking, he picks a chip
inside the box, calls on the student, and starts his graded recitation.

2. Systematic Sampling
➢ Every member of the population is listed in order and individuals are
chosen at regular intervals.
➢ The sampling starts by choosing the first element from the list and then
selecting every nth element and so on.

9
Example: The principal wanted to survey the feedback of the self-learning modules. The parents
were asked to fall in line outside the gate. The principal then started the survey with the first
parent who entered the school and selected every 5 th person in the line.

3. Stratified Sampling
➢ It involves dividing the population into subpopulations and then
randomly select samples from each subgroup.

Example: A researcher would like to know the effects of the COVID-19


vaccine on health workers who are not COVID patients in a certain hospital. He then
divides the health workers into subgroups (ex. by gender or by age range) and then using
random sampling he selects the number of individuals from each subgroup to be included
in the research.

4. Cluster Sampling

➢ It involves dividing the population into subgroups and randomly selecting


the entire subgroups instead of choosing an individual from each
subgroup.
Example: The branch managers of a certain fast-food chain want to survey the quality of the food
and the overall performance of their restaurants located in Paranaque City. They randomly select
5 restaurants in the area and gather data from all of their customers who are eating there.

Types of Non-Probability Sampling

1. Convenience Sampling

➢ It is the easiest method of sampling and is also known as


accidental sampling.

➢ A sample may be selected because of convenience, or they are


readily available and willing to participate in the survey.

Example: A tourist, who is visiting the Philippines for the first time, wants to know the best tourist
spot in the country. He decides to ask the 2 passengers seated beside him.

2. Purposive Sampling
➢ This is also known as judgment sampling.
➢ In this method, the researcher selects a sample that is useful for
the study
Example: A company of a certain adult milk drink wants to do market research of their product.
A lady with a clipboard is assigned to a grocery store to do the job. She starts by looking for
individuals whom she thinks may meet the criteria for their needed samples. After the
verification, she then asks these persons if they can be interviewed and participate in the survey.

3. Quota Sampling
➢ is a non-probability sampling technique similar to stratified sampling.
In this method, the population is split into segments (strata) and you
have to fill a quota based on people who match the characteristics of
each stratum.

10
• Proportional quota sampling gives proportional numbers that represent segments in the
wider population. For this, the population frame must be known.

• Non-proportional quota sampling uses stratum to divide a population, though only the
minimum sample size per stratum is decided.

4. Snowball Sampling
➢ is a non-probability sampling type that mimics a pyramid system in its selection
pattern. You choose early sample participants, who then go on to recruit further
sample participants until the sample size has been reached. This ongoing pattern
can be perfectly described by a snowball rolling downhill: increasing in size as it
collects more snow (in this case, participants).

GRAPHICAL SUMMARIES AND FREQUENCY DISTRIBUTION OF DATA

Frequency Distribution. The frequency of a category is


the number of times it occurs in the data set.

• A frequency distribution is a table that presents


the frequency for each category.

The relative frequency of a category is the frequency of the category


divided by the sum of all frequencies.

• A relative frequency distribution is a frequency


distribution that includes the relative
frequencies

Bar Graph

A bar graph is a graphical representation of a frequency distribution.


A bar graph consists of rectangles of equal width, with one rectangle for each category. The
heights of the rectangles represent how the frequencies or relative frequencies of the categories.
The default bar graph is a vertical bar graph.
• A pareto chart is a bar graph in which the
categories are presented in order of frequency
or relative frequency, with the largest frequency
or relative frequency on the left and the smallest
one on the right.

11
• A horizontal bar graph is a bar graph in which the bars are
horizontal. It is more convenient
to use when the categories have
long names.

• Side-by-side bar graphs are used when


we want to compare two bar graphs that
have the same categories. Both bar graphs
are on the same axes, putting the bar that
correspond to the same category next to
each other.

Pie Chart
• A pie chart is an alternative to the bar graph for displaying
relative frequency information. A pie chart is a circle. The
circle is divided into sectors, one for each category. The
relative sizes of the sectors match the
relative frequencies of the categories.

Frequency Distribution
The frequency of a category is
the number of times it occurs in Credit Cards Frequency
the data set.
Discoverer 7
• A frequency distribution is a
table that presents the
Visa 23
frequency for each category

Am. Express 9
• Classes – intervals of equal
width that cover all the Master Card 11
values that are observed.

• Lower class limit –


smallest value in that
class
Total 50
• Upper class limit –
largest value in that class
• Class width – difference between consecutive lower-class limits

• Not the same as difference between lower and upper limit of a class
• Should be expressed with the same number of
decimal places as the data

12
Frequency Distribution

However, for quantitative data, there are no natural categories.

Q: How do we create a frequency distribution for quantitative data?


A: We divide the data into classes.

• Classes – intervals of equal width that cover all the values that are observed.
• We then count the number of observations that fall into each class to
obtain the class frequencies.

• Ideal class number is at least 5 but not more than 20

Frequency Distribution
How to make frequency distribution table for quantitative data:
1. List classes

• Note the smallest and largest value


• Choose class width
• Choose lower class limit for the first class, must be a convenient

number slightly less than the smallest value

• Compute the lower-class limit for the second class by adding the class width to the
lower-class limit of the first class

• Continue creating classes until you reach the largest value

2. Count the number of observations for each class


• Can use countifs or frequency functions in Excel
• Similar to frequency distribution tables for qualitative data aside from the need to
make your own classes

How to make frequency distribution table for quantitative data:

Histogram

A histogram is a graphical representation of a frequency


distribution. Histograms
based on frequency
distributions are called
frequency histograms, and
histograms based on relative
frequency distributions are

13
called relative frequency histograms. Histograms are related to bar graphs and are appropriate
for quantitative data.

Histograms give a visual impression of the “shape” of a data set.

• Skewed – if one side, or tail, is longer than the other

• Bell-shaped – if with a peak in the middle

• Uniformly distributed – if classes have relatively the


same frequencies

A peak, or high point, of a histogram is referred to as a mode.

• Unimodal – one mode

• Bimodal – two modes


Can have more than two modes but does not happen often

Stem-and-Leaf Plot

➢ Stem-and-leaf plots are a simple way to


display small data sets.

Dots Plot
• Dot plots are graphs that can be used to give a rough
impression of the shape of a data set. It is useful
when the data set is not too large, and when there
are some repeated values.

Time-Series Plot
Time-series plots may be used when the data consist
of values of a variable measured at different points
in time.

14
MEASURES OF CENTRAL TENDENCY OR MEASURES OF LOCATION
OR MEASURES OF AVERAGES

Descriptive Statistics
The goal of descriptive statistics is to summarize a collection of data in a clear and
understandable way.

Central Tendency

Measure of Central Tendency:


◦ A single summary score that best describes the central location of an entire distribution
of scores.

◦ The typical score.


◦ The center of the distribution.

One distribution can have multiple locations where scores cluster.

◦ Must decide which measure is best for a given situation.

Measures of Central Tendency:


◦ Mean

The sum of all scores divided by the number of scores.


◦ Median

The value that divides the distribution in half when observations are ordered.

◦ Mode
The most frequent score.

Arithmetic Mean (Mean)


Definition:

Sum of all the observation is divided by the number of the observations

The arithmetic mean is the most common measure of the


central location of a sample.

15
Mean
Is the balance point of a distribution.

The sum of negative deviations from the mean


exactly equals the sum of positive deviations from
the mean.

Some Important Properties of the Mean

Interval-Ratio Level of Measurement

Center of Gravity (the mean balances all the scores).


Sensitivity to Extremes

Median
Definition:

The value that is larger than half the population and smaller than half the population.

Pros and Cons of Median

Pros

◦ Not influenced by extreme scores or skewed distributions.


◦ Good with ordinal data.
◦ Easier to compute than the mean.

Cons
◦ May not exist in the data.

◦ Doesn’t take actual values into account.

16
Pros and Cons of the Mode
Pros

◦ Good for nominal data.


◦ Easiest to compute and understand.

◦ The score comes from the data set.


Cons

◦ Ignores most of the information in a distribution.

◦ Small samples may not have a mode.

Comparison of Mean and Median


• Mean is sensitive to a few very large (or small) values “outliers” so sometime mean does
not reflect the quantity desired.

• Median is “resistant” to outliers


• Mean is attractive mathematically 50% of sample is above the median, 50% of sample is
below the median.

Suppose the next patient enrolls and their age is 97 years. How does the mean and median
change?

To get the median, order the data:

21, 24, 34, 34, 42, 44, 46, 52, 56, 64, 97

If the age were recorded incorrectly as 977 instead of 97,

What would the new median be?


What would the new mean be?

17
DESCRIPTIVE STATISTICS MEASURES OF CENTRAL TENDENCY GROUPED DATA

Frequency Distribution Table

Class limits
- The smallest and the largest values that fall within the class interval (class)
- Taken with equal number of significant figures as the given data.

CLASS BOUNDARIES (TRUE CLASS LIMITS)

- MORE PRECISE EXPRESSION OF THE CLASS INTERVAL

- IT IS USUALLY ONE SIGNIFICANT DIGIT MORE THAN THE CLASS LIMIT.


- ACQUIRED AS THE MIDPOINT OF THE UPPER LIMIT OF THE LOWER CLASS AND THE
LOWER LIMIT OF THE UPPER CLASS

Frequency
- The number of observations falling within a particular class.

- Counting and tallying Class width (class size)

- Numerical difference between the upper- and lower-class boundaries of a class


interval.

Class mark (class midpoint)


- Middle element of the class
- It represents the entire class and it is usually symbolized by x.

Cumulative Frequency Distribution

- can be derived from the frequency distribution and can be also obtained by simply
adding the class frequencies

- Partial sums

Relative Frequency
- Percentage frequency of the class with respect to the total population

- For presenting pie charts

Relative Frequency (%rf) Distribution


- The proportion in percent the frequency of each class to the total frequency

18
- Obtained by dividing the class frequency by the total frequency, and multiplying the
answer by 100

Steps in constructing a frequency distribution table (FDT)

1. Get the lowest and the highest value in the distribution. We shall mark the highest and
lowest value in the distribution.

2. Get the value of the range. The range denoted by r, refers to the difference between the
highest and the lowest value in the distribution. Thus, R = H-L.

3. Determine the number of classes. In the determination of the number of classes, it should
be noted that there is no standard method to follow. Generally, the number of
classes must not be less than 5 and should not be more than 15.
4. Determine the size of the class interval. The value of c can be obtained by dividing the
range by the desired number of classes. Hence, 𝐶 = 𝑅τ𝑘.

5. Construct the classes. In constructing the classes, we first determine the lower limit of the
distribution. The value of this lower limit can be chosen arbitrarily as long as the lowest
value shall be on the first interval and the highest value to the last interval.
6. Determine the frequency of each class. The determination of the number of
frequencies is done by counting the number of items that shall fall in each interval.

MEASURES OF VARIABILITY (For Ungrouped Data)


Definition

Measures of Variability or Dispersion are measures of the average distance of each observation
from the center of the distribution. They measure homogeneity or heterogeneity of a particular
group.
While the measures of central tendency convey information about the commonalities of
measured properties, the properties, the measures of variability quantify the degree to which
they differ.
Measures of variability are lengths between various points within the distribution. The spread of
these data points tells you about variability.

A small measure of variability would indicate that the data are:

19
- clustered closely around the mean

- more homogeneous;

- less variable
- more consistent and;

- more uniformly distributed.

Range
- difference between the highest value and lowest value.

R = HV – LV

The Mean Absolute Deviation (MAD)


A more reliable measures of variability considers all the data in the given distribution. One of
them is MAD.
MAD - is the average of the summation of the absolute deviation of each observation from the
mean. The formula is

Example 2. Find the absolute deviation of the male group in Example 1.

20
Example 3. Find the absolute deviation of the female group in Example 1.

Variance - it is the average of the squared deviation from the mean.

Formula:

where: σ2 - sample variance


x - value from the raw data

𝑥ҧ - mean

n - number of samples

Example. Compute for the variance of the grades in Math of the two groups in example 1.
Male group:

Example. Compute for the variance of the grades in Math of the two groups in example 1.

Female group:

21
Standard Deviation – is the square of the average deviation from the mean, or simply the square
root of the variance.

We see that the scores of male are more spread out than those of
the females.

22

You might also like