0% found this document useful (0 votes)
6 views8 pages

P299 Module 8 Notes

Uploaded by

jbcruz2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

P299 Module 8 Notes

Uploaded by

jbcruz2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

MODULE 8: QUANTITATIVE DATA ANALYSIS –  under specified conditions, we may

INFERENTIAL AND DESCRIPTIVE STATISTICS assume that sampling distributions of


statistics such as the sample mean are
Source: Inferential Statistics
normally distributed, even if the
Inference - method of making judgments about samples are drawn from populations
an unknown, drawing on what is already known that are not normally distributed
to be true.  also referred to as the bell curve due to
its characteristic shape
Statistical Inference - science of characterizing
 also referred to as the Gaussian
or making decisions about a population by using
distribution in honor of the eighteenth-
information from a sample drawn from that
century physicist and mathematician
population.
Karl Gauss, who used this distribution to
Descriptive Statistics - If the cases you are analyze astronomical data
studying represent the entire population of
Standard Normal Distribution (Z Distribution)
interest, and you do not wish to generalize
beyond those cases  normal distribution with a mean of 0
and standard deviation of 1
Inferential Statistics - If the cases you are
 Any normal distribution can be
studying do not represent the entire population
transformed to the standard normal
of interest, and you do wish to generalize
distribution by converting the original
beyond those cases
values to standardized scores
Theoretical Probability Distributions
Characteristics of Normal Distribution
 defined by a formula that specifies what
 Symmetry
values can be taken by data points
 Unimodality (a single most common
within the distribution and how
value)
common each value will be
 A continuous range from −∞ to +∞
 often presented in graphical form
(from negative infinity to positive
 useful in inferential statistics because
infinity)
their properties and characteristics are
 A total area under the curve of 1
known
 A common value for the mean, median,
Classifications of Probability Distribution and mode

1. Continuous – the data can take any Empirical Rule for Any Normal Distribution
value within a specified range
 About 68% of the data will fall within
2. Discrete - the data can take only certain
one standard deviation of the mean.
values
 About 95% of the data will fall within
Normal Distribution two standard deviations of the mean.
 About 99% of the data will fall within
 a reasonable description of how many
three standard deviations of the mean.
continuous variables are distributed in
reality, from industrial process variation Z-Score
to intelligence test scores
 The process of making such
comparisons is facilitated by converting
raw scores (scores in their natural 4. There is a fixed number of trials,
metric, for instance, weight measured denoted as n.
in pounds or kilograms) into Z-scores,
Formula for Binomial Distribution
which express the value of the score in
terms of units of the standard
deviation.
 sometimes referred to as normalized
scores
 facilitate comparison of scores from
populations with different means and
standard deviations. Variables
 distance of a data point from the mean, 1. Independent - presumed to influence
expressed in units of standard the value of the dependent variable
deviation. 2. Dependent - represent an outcome of
the study
3. Control - might influence the
dependent variable but are not the
main focus of interest

where: Population - consists of all the people or other


X – observed value entities that the researchers would like to study
µ – mean if they had infinite resources
σ – standard deviation Non-Probability Sampling
Binomial Distribution  there is a high probability that the
 applies to many types of real-life data sample drawn using a nonprobability
with dichotomous outcomes (outcomes method will not be representative of
that can take only two values), from the population of interest, and there is
machine parts that are either defective no way to correct the sample
or acceptable to students who either statistically
pass or fail a class.  popular because the researcher can
 Events in a binomial distribution are bypass the more cumbersome process
generated by a Bernoulli process of drawing a probability sample, but a
price is paid for this convenience
Requirements for Data Represented by  Conclusions based on data using
Binomial Distribution nonprobability sampling methods are
of limited usefulness in generalizing to
1. The outcome of each trial is one of two
a larger population because there is no
mutually exclusive outcomes.
way to know how the sample relates to
2. Each trial is independent, so the result
the population of interest
of one trial has no influence on the
result of any other trial. Types of Non-Probability Sampling
3. The probability of success, denoted as
p, is constant for every trial. 1. Volunteer Sampling
 Use of volunteer samples is best  a slight improvement over
reserved for circumstances in convenience sampling because
which it would be difficult to it can ensure representation of
select a sample randomly from different demographic groups
a population, for instance in a within the sample
study about people who use  you still have no way of
illegal drugs. knowing whether the people in
 Even with limited ability to the sample are representative
generalize, useful information of the population of interest
can be gained from volunteer  data collector might approach
samples, particularly in the people who seem most like
early stages of a project himself (for instance in age) or
 Results from volunteer samples who seem the friendliest or
have limited usefulness if the most approachable, rendering
goal is to generalize beyond the the sample even less useful as a
sample. means to acquire information
about a larger population.
2. Convenience Sampling
Probability Sampling
 can be used to collect
information in the early stages  every member of the population has a
of a study but have limited known probability to be selected for the
usefulness if the goal is to sample
generalize beyond the sample  preferred because the researcher can
 because those 50 people are generalize the results obtained from the
not a random selection of area sample to the population of interest.
residents, it would not be valid
to conclude that their opinions Types of Probability Sampling
reflect those of the area as a 1. Simple Random Sampling
whole  all samples of a given size have
 you might use the information an equal probability of being
gained from a survey selected
administered to a convenience  has the most desirable
sample to construct a statistical properties of any kind
questionnaire for a more of sampling,
scientific sample of the area’s  can be impossible or
population. prohibitively expensive to
execute in some contexts
3. Quota Sampling
 the data collector is instructed 2. Systematic Sampling
to get responses from a certain  you need a list or other
number or proportion of enumeration of your population
subjects within broad  You then choose a start number
classifications. at random between 1 and n and
include in your sample the
object representing the start distribution, even if the sample is drawn
number and every nth object from a population that is not normally
following distributed.
 particularly useful when the
Steps in Hypothesis Testing
population accrues over time
and there is no predetermined 1. Develop a research hypothesis that can
list of population members be tested mathematically.
 you must ensure that the data 2. Formally state the null and alternative
is not cyclic in a way that hypotheses.
corresponds with your random 3. Decide on an appropriate statistical
starting point and value of n. test, gather data, and do the
calculations.
3. Stratified Sampling 4. Make your decision based on the
 the population of interest is results.
divided into nonoverlapping
Types of Hypotheses
groups or strata based on
common characteristics 1. Null Hypothesis - always predicts no
 If comparing different strata or effect or no relationship between
making estimates of the variables
characteristics of subgroups is a
primary goal of the study, 2. Alternative Hypothesis - states your
stratified sampling is a good research prediction of an effect or
choice because it can be relationship
designed to ensure adequate
sampling from each stratum of One-Tailed vs Two-Tailed Tests
interest. 1. One-Tailed Test - allow for the
possibility of an effect in one direction
4. Cluster Sampling
 population is sampled by using 2. Two-Tailed Test - for the possibility of
preexisting groups an effect in two directions—positive
 often used in national surveys and negative
that require in-person
interviews or the collection of Type I and Type II Errors
physical specimens

Central Limit Theorem

 states that the sampling distribution of


the sample mean approximates the
normal distribution, regardless of the
distribution of the population from Point Estimate vs Interval Estimate
which the samples are drawn if the 1. Point Estimate – calculating a single
sample size is sufficiently large statistic that represents a single point
 enables us to make statistical inferences on the number line
based on the properties of the normal 2. Interval Estimate – a range of numbers
Confidence Interval

 interval between two values that


represent the upper and lower
confidence limits or confidence bounds
for a statistic
 formula used to calculate the Source: Descriptive Statistics
confidence interval depends on the
statistic being used Descriptive Statistics
 conveys important information about  the use of statistical and graphic
the precision of a point estimate techniques to present information
 if our test statistic is the mean and we about the data set being studied
are using a 95% confidence interval,  it is a common practice to begin an
over an infinite number of repetitions of analysis by examining graphical displays
drawing a sample and computing its of a data set and to compute some
mean, 95% of the time the confidence basic descriptive statistics to get a
interval thus constructed would contain better sense of the data to be analyzed
the true mean of the population.
Measures of Central Tendency
P-Value
1. Mean
 expresses the probability that results at  average of a set of values
least as extreme as those obtained in an  appropriate for interval and
analysis of sample data are due to ratio data
chance  not an appropriate summary
 commonly reported for most research measure for every data set
results involving statistical calculations, because it is sensitive to
in part because intuition is a poor guide extreme values, also known as
to how unusual a particular result is. outliers and can also be
Z-Statistic misleading for skewed
(nonsymmetrical) data.
 instead of asking what the probability of  Trimmed Mean (Winsorized
a particular score is, we are now Mean) – calculated by trimming
interested in the probability of a or discarding a certain
particular sample mean. percentage of the extreme
 an important example of the application values in a distribution and then
of the central limit theorem, which calculating the mean of the
allows us to compute the probability of remaining values
a sample result by using the normal
distribution, even if we don’t know the 2. Median
distribution of the population from  the middle value when the
which the sample was drawn values are ranked in ascending
or descending order.
 a better measure of central
tendency than the mean for
data that is asymmetrical or
contains outliers
 it does not matter whether the
data set contains some
extremely large or small values 4. Standard Deviation
because they will not affect the  square root of the variance
median more than less extreme
values.

3. Mode
 refers to the most frequently
occurring value.
 most often useful in describing
ordinal or categorical data.
5. Coefficient of Variation
Dispersion - refers to how variable or spread
out data values are.  a measure of relative variability
that makes it possible to
Measures of Dispersion compare variability across
variables measured in different
1. Range
units
 the difference between the
highest and lowest values.
 If there are one or a few
outliers in the data set, the
Outliers
range might not be a useful
summary measure.  a data point or observation whose value
is quite different from the others in the data
2. Interquantile Range set being analyzed.
 alternative measure of  a data point that seems to come from a
dispersion that is less different population or is outside the typical
influenced than the range by pattern of the other data points
extreme values
Graphic Methods
 the range of the middle 50% of
the values in a data set, which is 1. Frequency Tables
calculated as the difference  when the actual values of the
between the 75th and 25th numbers in different categories,
percentile values. rather than the general pattern
among the categories, are of
3. Variance primary interest.
 average of the squared  an efficient way to present large
deviations from the mean quantities of data and represent
a middle ground between text
(paragraphs describing the data and the least common the
values) and pure graphics (such furthest to the right), and a
as a histogram). cumulative frequency line is
 Absolute Frequency - raw superimposed over the bars
numbers or counts for each
category 5. Stem and Leaf Plot
 Relative Frequency - displays  divide your data into intervals
the percent of the total (using your common sense and
represented by each category the level of detail appropriate
 Cumulative Frequency - shows to your purpose) and display
the relative frequency for each each data point by using two
category and those below it columns.
 The stem is the leftmost column
2. Bar Chart and contains one value per row,
 particularly appropriate for and the leaf is the rightmost
displaying discrete data with column and contains one digit
only a few categories for each case belonging to that
row.
3. Pie Chart  plot that displays the actual
 shows graphically what values of the data set but also
proportion each part occupies assumes a shape indicating
of the whole which ranges of values are most
 most useful when there are common.
only a few categories of  not only tells us the actual
information and the differences values of the scores and their
among those categories are range but the basic shape of
fairly large their distribution as well.

4. Pareto Chart/Diagram 6. Box Plot


 combines the properties of a  also known as the hinge plot or
bar chart and a line chart; the the box-and-whiskers plot
bars display frequency and  a compact way to summarize
relative frequency, whereas the and display the distribution of a
line displays cumulative set of continuous data
frequency.  always constructed to highlight
 it is easy to see which factors five important characteristics of
are most important in a a data set: the median, the first
situation and, therefore, to and third quartiles (and hence
which factors most attention the interquartile range as well),
should be directed and the minimum and
 the bars are ordered in maximum.
descending frequency from left
to right (so the most common 7. Histogram
cause is the furthest to the left
 the bars (also known as bins
because you can think of them
as bins into which values from a
continuous distribution are
sorted) touch each other, unlike
the bars in a bar chart.
 The x-axis (vertical axis) in a
histogram represents a scale
rather than simply a series of
labels, and the area of each bar
represents the proportion of
values that are contained in
that range.

8. Bivariate Charts
 Charts that display information
about the relationship between
two variables
 Scatterplot - define each point
in a data set by two values,
commonly referred to as x and
y, and plot each point on a pair
of axes

You might also like