STASTICAL ANALYSIS FOR IE
STATISTICS- is the science of conducting studies to collect, organize, summarize, analyse and draw
conclusions from data.
Students study statistics for several reasons:
1. To be knowledgeable about the vocabulary, symbols, concepts and statistical procedures used in their
studies.
2. To be able to design experiments; collect, organize, analyse and summarize data; and possibly make
reliable predictions or forecasts for future use.
3. They can also use the knowledge gained from studying statistics to become better consumers and
citizens.
Variable- is a characteristic or attribute that can assume different values.
Data- are the values (measurements or observations) that the variables can assume.
Random Variables- variables whose values are determined by chance.
Data Value or Datum- a collection of data values forms a data set. Each values forms a data set. Each
value in the data set called datum.
Branches of Statistics
Descriptive Statistics- consists of the collection, organization, summarization and presentation of data.
Inferential Statistics- consists of generalizing from samples to populations, performing estimations and
hypothesis tests, determining relationships among variables, and making predictions.
Population- consists of all subjects (human or otherwise) that are being studied.
Sample- a group of subjects selected from a population.
Applying the concepts
Attendance and Grades
A study conducted at First Asia Institute of Technology and Humanities revealed that students who
attended class 95 to 100% of the time usually received an A in the class. Students who attended class
less 80 to 90% of the time usually received a B or C in the class. Students who attended class less than
80% of the time usually received a D or an F or eventually withdrew from the class. Based on this
information, attendance and grades are related. The more you attend class the more likely you will
receive a higher grade. If you improve your attendance, your grades will probably improve. Many factors
affect your grade in a course. One factor that you have considerable control is over attendance. You can
increase your opportunities for learning by attending class more often.
Questions:
1. What are the variables under study? The variables are grades and attendance
2. What are the data in the study? The data consists of specific grades and attendance numbers
3. Are descriptive, inferential, or both types of statistics used? This are descriptive statistics
4. What is the population under study? The population under study is students at Faith
5. Was a sample collected? If so, from where? While not specified, we probably have data from a sample
of FAITH Students.
6. From the information given, comment on the relationship between the variables. Based on the data,
it appears that in general, the better your attendance the higher your grade.
Variables and Types of Data
Variables can be classified as qualitative or quantitative.
Qualitative Variables- variables that can be placed into distinct categories, according to some
characteristic or attributes. (Ex: gender, religious preference and geographic locations)
Quantitative Variables- these are the numerical and can be ordered and ranked. (Ex: The variable age
is numerical and people can be ranked in order according to the value of their ages. Other examples are
heights, weights and body temperature.)
Types of Quantitative Variables
Discrete Variables- assume values that can be counted. (Ex: Number of children in a family. Number of
students in a classroom. Number of calls received by a switchboard operator each day for a month.)
Continuous Variable- can assume an infinite number of values between any two specific values. They
are obtained by measuring. They often include fractions and decimals.
Measurement Scales- a classification how variables are categorized, counted and measured.
Four Types of Measurement Scales
1. Nominal Level of Measurement
Classifies data into mutually exclusive (non-overlapping), exhausting categories in which no order or
ranking can be imposed on the data.
Ex: Zip Code; Gender (male or female); Eye Color (brown, blue, hazel); Political and Religious
Affiliation; Major Field (Mathematics, Computer, etc); and Nationality
2. Ordinal Level of Measurement
Classifies data into categories that can be ranked; however, precise differences between the ranks do
not exist. Ex: Grade (A, B, C, D, F); Judging (first place, second place, etc.); Rating Scale (poor, good
excellent); Ranking of tennis players.
3. Internal Level of Measurement
Ranks data and precise differences between units of measure do exist; however, there is no
meaningful zero. Ex: SAT score; IQ; Temperature
4. Ratio Level of Measurement
Possesses all the characteristics of interval measurement and there exists a true zero. In addition, true
ratios exist when the same variable is measured on two different members of the population. Ex:
Height; Weight; Time; Salary and Age
Data Collection and Sampling Techniques
Data Collection
Data can be collected in a variety of ways. One of the most common methods is through the use of
surveys. Surveys can be done by using a variety of methods. Three of the most common methods are
the telephone survey, the mailed questionnaire and the personal interview.
a. Telephone surveys have an advantage over personal interview surveys in that they are less costly.
Also, people may be more candid in their opinions since there is no face-to-face contact. A major
drawback to the telephone survey is that some people in the population will not have phones or will not
answer when the calls are made; hence not all people have a chance of being surveyed. Also, many
people now have unlisted numbers and cell phones, so they cannot be surveyed. Finally, even if the tone
of the voice of the interviewer might influence the response of the person who is being interviewed.
b. Mailed questionnaire surveys can be used to cover a wider geographic area than telephone surveys or
personal interviews since mailed questionnaires surveys are less expensive to conduct. Also,
respondents can remain anonymous if they desire. Disadvantages of mailed questionnaire survey
include a low number of responses and inappropriate answers to questions. Another drawback is that
some people may have difficulty reading or understanding the questions.
c. Personal interview surveys have the advantage of obtaining in depth-responses to questions from the
person being interviewed. One disadvantage is that interviewers must be trained in asking questions
and recording responses, which makes the personal interview survey more costly than the other two
survey methods. Another disadvantage is that the interviewer may be biased in his or her selection of
respondents. To obtain samples that are unbiased there are Four Basic Methods of
Sampling
1. Random Sampling - are selected by using chance methods or random numbers. One such method is to
number each subject in the population.
2. Systematic Sampling - use by numbering each subject of the population and then selecting every kth
subject.
3. Stratified Sampling - it is obtain by dividing the population into groups (called strata) according to
some characteristic that is important to the study, then sampling from each group. Samples written the
strata should be randomly selected.
4. Cluster Sampling
Observational study- the researchers merely observes what is happening or what was happened in the
past and tries to draw conclusions based on this observation.
Experimental study- the researcher manipulates one of the variables and tries to determine how the
manipulation influences other variables.
Independent Variables- is an experimental study, is the one that is being manipulated by the researcher.
This variable is also called the explanatory variable or the outcome variable.
Outcome Variables- variable that is studied to see if it has changed significantly due to the manipulation
of the independent variable.
Confounding Variable- is the one that influences the dependent or outcome variable but cannot be
separated from the independent variable.
Uses and Misuses of Statistics
1. Statistical Techniques can be used to described data, compare two or more data sets, determine if a
relationship exists between variables, test hypothesis, and make estimates about population
characteristics.
2. The misuse of statistical techniques to sell products that don’t work properly, to attempt to prove
something true that is really not true, or to get our attention by using statistics to evoke fear, shock and
outrage.
Some Ways that Statistics can be Misrepresented
Suspect samples
Ambiguous averages
Changing the subject
Detached statistics
Implied connections
Misleading graphs
Faulty survey questions
Table has not the same impact as presenting numbers in a well-drawn chart or graph.
Statistics
Gather Data
Analyze Data
Draw Conclusions
Summarize
Statistical Charts and Graphs Charts and graphs:
• Histograms
• Frequency Polygons
• Ogives
• Pie graphs
• Pareto
• Time series graphs
• Stem and leaf plot
• Scatter plot
Frequency Distribution- The most convenient method of organizing data.
Raw data- when data are collected in original form.
Frequency Distribution
Classes- Each raw data value is placed into a quantitative or qualitative categories.
Frequencies- The number of data values contained in a specific class.
Two Types of Frequency Distribution
1. Categorical Frequency Distribution it used for data that can be placed in specific categories, such as
nominal or ordinal level data.
For example: data such as political affiliation, religious affiliation, or major field of study would use
categorical frequency distributions.
Step 1. Make a table as shown.
Step 2. Tally the data and place the results in column B
Step 3. Count the tallies and place the results in column C
Step 4. Find the percentage of values in each class by using the formula
Percentage are not normally part of a frequency distribution, but they can be added since they used in
certain types of graphs such as pie graphs.
Relative frequency – the decimal equivalent of a percent.
Step 5. Find the totals for columns C (frequency) and D (percent). The completed table is shown.
2. Grouped Frequency Distribution when the range of data is large, the data must be grouped into
classes that are more than one unit in width.
Rules to follow to construct Grouped
Frequency Distribution:
1. There should be between 5 and 20 classes.
2. It is preferable but not absolutely necessary that the class width be an odd number.
3. The classes must be mutually exclusive.
4. The classes must be continuous.
5. The classes must be exhaustive.
6. The classes must be equal in width (54 and above or below 110)
Step 1. Determine the classes
Step 2. Tally the data.
Step 3. Find the numerical frequencies from the tallies.
Step 4. Find the cumulative frequencies.
PROBABILITY AND COUNTING'S
Sample Spaces and Probability
Probability experiment – is a chance process that leads to well – defined results called outcomes.
Outcome – is the result of a single trial of a probability experiment.
Example: When a coin is tossed, there are two possible outcomes: Head or Tail
In the roll of a single die, there are six possible outcomes: 1, 2, 3, 4, 5, and 6
Sample space – is the set of all possible outcomes of a probability experiment.
Tree Diagram - Is a device consisting of line segments emanating from starting point and also from the
outcome point. It is used to determine all possible outcomes of a probability experiment.
Event- an event consists of a set of outcomes of a probability experiment.
Three Basic Interpretation of Probability
1. Classical Probability
2. Empirical Probability or Relative Frequency Probability
3. Subjective Probability
Classical Probability
It uses sample spaces to determine the numerical probability that an event will happen.
It is so named because it was the first type of probability studied formally by the mathematicians in
the 17th and 18th century.
It assumes that all outcomes in the sample space are equally likely to occur.
Rounding Rules for probabilities
Probabilities should be expressed as reduced fractions or rounded to two or three decimal places.
When the probability of an event is an extremely small decimal, it is permissible to round the decimal
to the first nonzero digit after the point.
Example: 0.0000587 = 0.00006
And and Or
And means “at the same time” (x)
Or has two meanings (+)
(1) inclusive or
(2) exclusive or
CHARTS AND GRAPHS
Histograms, Frequency Polygons, and Ogives
Purpose: To convey the data to the viewers in pictorial form. It is easier for most people to
comprehend the meaning of the data presented graphically than data presented numerically in tables or
frequency distributions.
Use:
Statistical graphs can be used to describe the data set or to analyze it. Graphs are also useful in getting
the audience’s attention in a publication or a speaking presentation. They can be used to discuss an
issue, reinforce a critical point, or summarize a data set. They can also be used to discover a trend or
pattern in a situation over a period of time.
The 3 most commonly used graphs in research are as follows:
Histogram
Frequency Polygons
Cumulative Frequency
Histogram- is a graph that displays the data by using contiguous vertical bars (unless the frequency of a
class is 0) of various heights to represent the frequencies of the classes.
Step 1. Draw and label the x and y axes. The x axis is always the horizontal axis, and the y axis is always
the vertical axis.
Step 2. Represent the frequency on the y axis and the class boundaries on the x axis.
Step 3. Using the frequency as the heights, draw vertical bars for each class.
Frequency Polygon- is a graph that displays the data by using lines that connect points plotted for the
frequencies at the midpoints of the classes. The frequencies are represented by the heights of the
points.
Step 1. Find the midpoints of each class.
Step 2. Draw the x and y axis with the midpoint of each class, and then use a suitable scale on the y axis
for the frequencies.
Step 3. Using the midpoints for the x values and the frequencies as the y values, plot the points.
Step 4. Connect adjacent points with line segments. Draw a line back to the x axis at the beginning and
end of the graph, at the same distance that the previous and next midpoints would be located.
Ogive- the third type of graph that can be used to represent the cumulative frequencies for the classes
in a frequency distribution.
Step 1. Find the cumulative frequency for each class.
Step 2. Draw the x and y axes. Label the x axis with the class boundaries. Use an appropriate scale for
the y axis to represent the cumulative frequencies.
Step 3. Plot the cumulative frequency at each upper class boundary. Upper boundaries are used since
cumulative frequencies represent the number of data values accumulated up to the upper boundary of
each class.
Step 4. Starting with the first upper class boundary.
Relative Frequency- these distribution can be converted to distributions .using proportions instead of
raw data as frequencies
Step 1. Convert each frequency to a proportion or relative frequency by dividing the frequency for each
class by the total number of observations.
Step 2. Find the cumulative frequencies.
Step 3. Draw the graph.
Pareto Chart- it is used to represent a frequency distribution for a categorical variable, and the
frequencies are displayed by the heights of vertical bars, which are arranged in order from highest to
lowest.
Step 1. Make the bars the same width.
Step 2. Arrange the data from largest to smallest according to frequency.
Step 3. Make the units that are used for the frequency equal in size.
Pie Chart- is a circle that is divided into sections or wedges according to the percentage of frequencies in
each category of the distribution.
f = frequency for each class
n= sum of the frequencies
Stem and Leaf Plot- it is a data plot that uses part of the data value as the stem and part of the data
value as the leaf to form groups or classes.
Step 1. Arrange the data in order. (smallest-highest)
Step 2. Separate the data according to the first digit as shown.
Step 3. Leading digit (stem); Trailing digit (leaf)
Scatter Plot- it is a graph of order pairs of data values that is used to determine if a relationship exists
between two variables.
Analyzing the Scatter Plot
There are several types of relationships that can exist between the x values and the y values. These
relationships can be identified by looking at the pattern of the points on the graphs. The types of
patterns and corresponding relationships are given next.
1. A positive linear relationship exists when the points fall approximately in an ascending straight line
and both the x and y values increase at the same time. The relationship then is that as the values for the
x variable increase, the values for the y variable are increasing.
2. A negative linear relationship exists when the points fall approximately in a descending straight line
from left to right. The relationship then is that as the x values are increasing, the y values are decreasing,
or vice versa.
3. A nonlinear relationship exists when the points fall in curved line. The relationship is described by the
nature of the curve.
4. No relationship exists when there is no discernable pattern of the points.
Four Basic Probability Rules
Probability Rule 1 The probability of any event E is a number (either a fraction or decimal) between and
including 0 and 1. This is denoted by 0 ≤ P (E) ≤ 1.
Probability Rule 2 If an event E cannot occur (i.e., the event contains no members in the sample space),
its probability is 0.
Probability Rule 3 If an event E is certain, then the probability of E is 1.
Probability Rule 4 The sum of the probabilities of all the outcomes in the sample space is 1.
Complementary Events – The complement of an event E is the set of outcomes in the sample space that
are not included in the outcomes of event E. The complement of Ē is denoted by (read “E bar”)
Empirical Probability - The difference between classical and empirical is that classical probability
assumes that certain outcomes are equally likely (such as the outcomes when a die is rolled), while
empirical probability relies on actual experience to determine the likelihood of outcomes.
The Addition Rule for Probability
Two events are mutually exclusive events if they cannot occur at the same time.
Addition Rule 1 - When to events A and B are mutually exclusive, the probability that A or B will occur
is
P (A or B) = P (A) + P (B)
P (A or B or C) = P (A) + P (B) + P (C)
Addition Rule 2
If A and B are not mutually exclusive, then
P (A or B) = P (A) + P (B) – P (A and B)
P (A or B or C) = P (A) + P (B) + P (C) - P (A and B) - P (A and C) - P (B and C) + P (A and B and C)
Multiplication Rule 1
Two events A and B are independent events if the fact A occurs does not affect the probability of B
occurring.
P (A and B) = P (A) • P (B)
Multiplication Rule 2
When the outcome or occurrence of the first event affects the outcome or occurrence of the second
event in such a way that the probability is changed, the events are said to be dependent variables.
P (A and B) = P (A) • P (B | A)
Counting Rules
The fundamental Counting Rule In a sequence of n events in which the one has k1 possibilities and the
second event has K2 and the third has k3, and so forth, the total number of possibilities of the sequence
will be
𝑘1 • 𝑘2 • 𝑘3 • • • 𝑘𝑛
Note: In this case and means to multiply
Data Description
Statistical methods that can be used to summarize data. The most familiar of these methods is finding of
average.
Parameters – measures found by using all the data values in the population.
Statistic – is a characteristic or measure obtained by using the data values from a sample.
Mean – it is the sum of the values, divided by the total number of values. The symbol 𝑋̅ represent the
sample mean. Also known as arithmetic average.
Where:
n = represents the total number of values in the sample
Where:
N = represents the total number of values in the population
Greek letters are used to denote parameters
Roman letters are used to denote statistics
*Rounding rule for Mean:
The mean should be rounded to one or more decimal place than occurs in the raw data.
e.g. 70 ~ 70.1
70.1 ~ 70.11
70.11 ~ 70.111
Mean
Step 1. Make a table as shown
Step 2. Find the midpoints
Step 3. Multiply Frequency by Midpoints Xm
Step 4. Find the sum of column D
Step 5. Divide the sum by n to get the mean
Median – is the midpoint of the data array. The symbol is MD.
Step 1. Arrange the data in order
Step 2. Select the middle point
Mode – the value that occurs most often in a data set
Unimodal – one value
Bimodal – two values occur
Multimodal – more than two values occur