0% found this document useful (0 votes)
13 views7 pages

Statistics

Advance Statistics

Uploaded by

matienzo.jc0405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Statistics

Advance Statistics

Uploaded by

matienzo.jc0405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Terminologies in Statistics SCALE OF MEASUREMENT and Operations

Scale of measurement - dictates what statistical


Statistics - numbers measured for some purpose analysis can be done on data
- a collection of procedures and principles for gathering data and analyzing information in order to help
people make decisions when faced with uncertainty Nominal Lowest level
Categories have no order
Raw Data - numbers and category labels that have been collected orCountmeasured
the incidence per category or
- not yet processed in any way percentages
- information about a group of items Ordinal Categories have order
Ranking can be done
Universe - set of entities under study Interval Zero point is arbitrary
Variable - a characteristic that differs from one entity to the next Addition and subtraction can be done
Population - data collected when all individuals in a universe are measured
Not multiplication nor division
Sample - subset of a universe or population Ratio Highest level
Zero point is absolute
Statistic - summary measure of sample data All mathematical operations can be
Parameter - a summary measure of population data done

CLASSIFICATIONS OF VARIABLES Methods of Collecting Data

Qualitative - places an individual or item into one of • Objective Method – data are collected by
several groups or categories measuring or observing the characteristics of
Quantitative - takes numerical values; arithmetic interest directly on the units.
operations such as adding and averaging can be • Subjective Method – information is collected
performed through interviews not necessarily requiring the
- can be discrete (its values are presence of the units under study.
countable) or continuous (it can take any • Use of Existing Records – if the data or part of
value in an interval or collection of data needed by the researcher have already been
intervals) collected by another researcher or institution,
perhaps for some other purposes, then this
LEVELS OF MEASUREMENT method can be convenient. The researcher should
remember to properly acknowledge the source of
Nominal - takes values that give names or labels to data.
various categories with no particular ordering.
Information that can be obtained from processing Classification of Data Collected
data on these variables is limited to frequency counts
and percentages. • Primary Data – data collected directly from
Examples: gender, place of origin, religion etc. source and are obtained through objective or
subjective methods.
Ordinal – basically nominal with categories having an • Secondary Data – data acquired with the use of
inherent ordering. The difference between existing records
categories cannot be measured and has no meaning.
Information that can be obtained from processing METHODS OF ORGANIZING AND PRESENTING
data on these variables is limited to frequency counts DATA
with conditional insight on the rank or order of the
categories specified. Textual Presentation – provides a concise narrative
Examples: social class (lower, middle and upper description highlighting a few but the most important
class), satisfaction rating (very dissatisfied, results of the study.
dissatisfied, satisfied, very satisfied)
Example
Interval – basically quantitative variables with Data collected consists of 10 respondents
differences between two consecutive quantities their gender, quiz score and address (town). It was
being constant. Intervals between categories can be observed that 6 out of 10 respondents are female.
quantified and have meaning however it is Four respondents were currently residing in Sta Cruz,
distinguished as having no true starting or zero 2 respondents from Pila and Bay, and 1 respondent
point. from Los Banos and Pagsanjan.
Examples: room temperature (in Celsius) and IQ Tabular Presentation – if it is necessary to present
more details or numerical information
Ratio – all characteristics of interval scale variable in
addition to having an absolute zero point.
Examples: weekly allowance and class standing
Responde Quiz Address Mean (or the Arithmetic Mean) – defined as the sum of
Gender the data values divided by the total number of data
nts Score (Town)
1 F 2 Sta Cruz values
2 M 6 Sta Cruz Median – a single value at the middle of an array of
data observations, denoted by Md.
3 F 0 Sta Cruz
Mode – refers to the most frequent value in the data
4 M 9 Pila set
5 F 3 Bay
6 M 3 Los Banos Generalized Formula for Quantiles
7 M 3 Bay
 N +1 
th

8 F 7 Pagsanjan Qk = k   item
9 F 5 Pila
 4 
 N +1 
th
10 F 5 Sta Cruz
Dk = k   item
Quiz Score Count  10 
0 1  N +1 
th

2 1 Pk = k   item
 100 
3 3
5 2
6 1 Gender Count
7 1 F 6
9 1 M 4
Grand Total 10 Grand Total 10

Graphical Presentation – for visual presentation of


the data; Graph allows for a quick overview of the
distributional properties of and trends in the data set.

M
F

**FDT
Pie Chart of Respondent’s Sex
MEASURE OF DISPERSION
- are quantities that describe the spread or variability
of the values in a data set.

Range – difference between MAX and MIN values of a


data set. R = MAX – MIN.
Standard Deviation – measure which indicates the
average distance of the observations from the mean
of the data set.
N

 (x − )
2
i

Bar Chart for the Quiz Score of the Respondents = i =1

N
Variance – defined as the average squared
NUMERICAL DESCRIPTIVE MEASURES differences of the observations from the mean of the
data set. *Square of SD
MEASURE OF LOCATION Coefficient of variation – relative measure of
- is a value within the range of the data that describes variability that indicates the magnitude of variation
its specific location or position relative to the entire relative to the magnitude of the mean. It is denoted as
data set CV and expressed in percentage as shown in the
Minimum (MIN) - lowest value in the data set  
formula, CV =   100% .
Maximum (MAX) - highest value in the data set 

MEASURE OF CENTRAL TENDENCY – represent the


value(s) where the data observations tend to
concentrate/cluster.
METHODS IN SAMPLING Schools with NAT scores in the top 20% are labeled
excellent. Schools in the bottom 25% are labeled "in
Methods of drawing a sample are classified into two danger" and schools in the bottom 5% are designated
categories as failing. Previous data suggests that NAT scores are
Probability Sampling – refers to methods whereby approximately normal with a mean of 75 and a
some form of random selection is used. standard deviation of 5.
Non-probability Sampling – refers to all methods 1. What is the probability that a randomly
whereby the elements of a sample are taken selected school will score below 70?
depending on some personal feelings of the  70 − 75 
researcher or purpose without the use of some P ( X  70 ) = P  Z   = P ( Z  1) = 0.8413
 5 
chance mechanism on selection process. 2. What is the score cut-off required for schools
to be labeled excellent?
Probability Sampling Techniques
P ( X  x ) = 0.2
1. Simple Random Sampling (SRS)
2. Stratified Random Sampling (StRS)
3. Systematic Sampling (SYS) HYPOTHESIS TESTING
4. Cluster Sampling (CL)
Hypothesis Testing – one of the major concerns in
PROBABILITIES inferential statistics; it is a technique used to
determine whether a specific conjecture about the
• always 0 for X equal to a single value
parameter(s) of the population under study will be
• on values in an interval
accepted or rejected.
• associated areas under a curve called the
probability density function of the random
TYPES OF HYPOTHESIS
variable
Statistical Hypothesis is an assertion about the value
PROBABILITY DENSITY CURVE
of the population parameter or the form of the
distribution.

Null Hypothesis (Ho) – the hypothesis being tested; is


usually a statement of equality signifying no
difference, no change, no relationship or effect.

Alternative Hypothesis (Ha) – is a contracting


statement which is a accepted if sample data do not
PROBABILITY DENSITY CURVE
provide sufficient evidence to support the null
1. It lies on or above the horizontal axis
hypothesis; it asserts that there is a difference, a
2. Total area under the curve is equal to 1.
change, a relationship or an effect embodies the
researcher wants to prove and aptly called the
NORMAL DISTRIBUTIONS
researcher’s hypothesis.
• an important family of densities
• many variables have approximately this shape
An alternative hypothesis may be either one-
and form
sided or two-sided. An alternative hypothesis is one-
• Many statistics used in inference are based on sided if the direction of the change, effect or
sums or averages, which generally have relationship is specified. Otherwise, it is two-sided.
(approximately) this distribution.
The statement, Ha: 1   2 , is an example of two-
NORMAL CURVE sided alternative hypothesis since it would be true if
• Symmetric 1   2 or if 1   2 , while Ha: 1   2 , is an
• Bell-shaped example of one-sided hypothesis.
• Centered at the mean
• Spread is determined by the standard deviation Test Statistic is a numerical characteristic of the
sample that will serve as a basis for rejecting or failing
RULES IN FINDING PROBABILITIES to reject the Ho. On the other hand, a decision rule
• P(Z > z) = 1 – P(Z < z) specifies the range of values of the test statistic which
• P(Z < -z) = P(Z > z) leads to rejection of Ho in favor of Ha. The sampling
• P(Z > -z) = P(Z < z) distribution of the test statistic, under the assumption
• P(a < Z < b) = P(Z < b) – P(Z < a) that the Ho holds, is used to specify the critical region.
The critical or rejection region defines the range of
TRANSFORMATION THEOREM values of the test statistic that are very unlikely to be
x − obtained when Ho is true, and hence will lead to the
If X N ( ,  2 ) , then Z = rejection of the Ho. The value(s) at the cut-off point(s)

is(are) called critical value(s) of the test statistic.
Test of Statistical Hypothesis on One Population

Hypothesis testing is a systematic method used to


evaluate data and is used as an aid in a decision-
making process. In this process, decisions are made
concerning populations on the basis of sample
information and statistical tests are used in arriving at
such decisions. The typical series of steps involved in Example
hypothesis testing are the as follows: A medical investigation claims that the average
number of new patients per week at a local hospital is
1. State the null (Ho) and the alternative (Ha) 15. A random sample of 30 weeks yielded a mean of
hypotheses. 16.1 patients with a standard deviation of 4.7. Test
2. Determine the appropriate test statistic to use this claim using α=5%.
and its distribution under the assumption that
Ho is true.
3. Choose a level of significance (α) and
determine the critical or rejection region of the
test. Formulate the decision rule that will be
used for rejecting or failing to reject Ho based
on the value of the test statistic.
4. Calculate the value of the test statistic using
the sample data.
5. Make a decision on whether to reject or fail to
reject Ho in accordance with the decision rule
constructed in Step 3 and results of the
computation.
6. Make the appropriate conclusion in relation to
the objective of the problem.

TEST ON ONE POPULATION MEAN (µ) – Unknown


Variance (σ2)

The test of hypothesis on one population


mean requires n observations x1 , x2 , x3 ,..., xn that are
independently drawn from the target population using
SIMPLE RANDOM SAMPLING (SRS). The test further TEST ON ONE POPULATION PROPORTION – for
requires that X follow a NORMAL DISTRIBUTION with Large Samples
mean µ and population variance σ2, that is
The test of hypothesis on one population proportion
X N (  ,  2 ) . The test is applicable to variables
requires n observations x1 , x2 , x3 ,..., xn that are drawn
that are measured in AT LEAST THE INTERVAL
independently from the population using the SIMPLE
SCALE.
RANDOM SAMPLING (SRS) scheme. These
The main objective of the test of hypothesis on ne
observations are assumed further to have come from
population mean is to test, on the basis of the sample,
a sequence of INDEPENDENT and IDENTICAL
if the null hypothesis Ho:  = 0 where  0 is the BERNOULLI TRIALS where the probability, P of
hypothesized value of the population mean, can be observing the attribute of interest in an element drawn
rejected in favor of its alternative hypothesis that from the population is the same for all elements of the
takes any one of the three forms, namely; Ha:   0 population, that is, P remains constant from trial to
trial.
, Ha:    0 , or Ha:   0 . The appropriate test
The test described is applicable for large sample
x − 0 cases, that is, for sample size n, such that
statistic given by tc = which is distributed as
s nP (1 − P )  3 . The appropriate test statistic is given
n
P − P0
the Student’s t-distribution with (n-1) degrees of by Zc = which is approximately
freedom, but approximately distributed as N(0,1) for P0 (1 − P0 )
large sample size. n
distributed as standard normal.
The objective of the test is to determine if the
null hypothesis Ho:  = 0 (where P0 is the
hypothesized value of the population proportion) can
be rejected in favor of the formulated alternative
hypothesis. This may be any of the three possible
alternatives: Ha: P  P0 , Ha: P  P0 , or Ha: P  P0 . • One random sample is taken but the units are
The complete procedure of the test is presented in the categorized as belonging to one population or
table below. another, e.g. male/female.
• Participants are randomly assigned to one of two
treatment conditions

Case 1: Related Sample


A simple random sample of size n related or paired
samples are drawn. A characteristic of interest X is
measured for each member pair. The data obtained
may be presented in this form:
Example
The US Census reports that in 2010, 48% of
households have no children. A random sample of
500 households was taken to assess if the population
proportion has changed from the census value of 0.48
Of the 500 households, 220 have no children. Use a
10% significance level. A similar data representation can be used when the
observations are taken using the self-pairing method.
One important assumption for this case is that the
observed differences follows normal distribution.
In general, the null hypothesis is in this form
Ho : D = D0 ( 1 − 2 = D0 ) , where D0 is the
hypothesized population mean difference.

d − D0
The test statistic is given by t c = , which follows
sd
the Student’s t distribution with (n-1) df.

Example
A study involved n=25 right-handed students who
each turned two different knobs (right-hand thread
and left-hand thread). The time it takes to move knob
indicator a fixed distance was measured from all
individuals. It is of interest to assess if right-hand
threads are easier to turn, on average. Use a 5%
TESTS ON TWO POPULATION MEANS significance level.

TYPES OF SAMPLES
Related samples are obtained by matching of similar
units with respect to some important characteristics
or by self-pairing in which two measurement are taken
form the same unit.
Independent samples are obtained when two
unrelated sets of units are measured for a specific
variable.

PAIRED DATA SCENARIO


• Each person or unit is measured twice under
different conditions.
• Similar individuals or units are paired prior to an
experiment, where each member of a pair
receives a different treatment.

INDEPENDENT SAMPLES SCENARIO


• Random samples are taken separately from two
populations
In general, the null hypothesis is in this form
Ho : D = D0 ( 1 − 2 = D0 ) , where D0 is the
hypothesized population mean difference.

The test statistic is given by t c =


(x 1 )
− x2 − D0
,
1 1
s  + 
2
p
 n1 n2 
which follows the Student’s t distribution with (n1+n2-
2) df.

s 2
=
( n1 − 1) s12 + ( n2 − 1) s22
p
n1 + n2 − 2

The table shows the difference between the posttest Example


and pretest among the components of the body parts, In Stroop’s Word Color Test, words that are color
that is eyes, ears, nose, etc. (PREBODY, POSTBODY); names are shown in colors different from the word.
letters of the alphabet (PRELET, POSTLET); forms or The task is to correctly identify the display color of
various muppets (PREFORM, POSTFORM); each word. The time needed to complete the test for
friendships (PRERELATE, POSTREL); classification of 16 individuals after they had consumed alcohol and
animals (PRECLASF, POSTCLAS). for 16 other individuals after they had consumed a
placebo drink, flavored to taste as if it contained
The mean and standard deviation for the PREBODY is alcohol will be compared. Each group was balanced
12.26 and 4.72 respectively, whereas those for the with 8 men and 8 women.
POSTBODY are 25.26 and 5.41. The mean difference
between PREBODY and POSTBODY is 13.000. The test
indicates that there is a significant difference between
the pretest and posttest with a t-value of 39.721 and a
p-value of 0.000 (p<0.05).

Pair 2 is a pretest and posttest between the alphabetic


characters, namely PRELET and POSTLET. While the
POSTLET has a mean of 26.70 and standard deviation
of 13.27, the PRELET has a mean of 15.78 and
standard deviation of 5.16. The mean difference
between PRELET and POSTLET is 10.917. The test
indicates that there is a significant difference between
the pretest and posttest with a t-value of 16.255 and a
p-value of 0.000 (p<0.05).

Case 2: Independent Sample


Independent simple random samples of size n1 and
n2. A characteristic of interest X is measured on each
unit. The data obtained may be presented in the table
below.

In addition to independence of samples, populations


1 and 2 are assumed to be normally distributed with
unknown but assumed equal variances.
The data in Table 2 illustrates the difference POSTAVE
between setting. POSTAVE's morning setting has a
mean of 21.1915 and a standard deviation of 6.6261,
while the evening setting has a mean of 20.0959 and a
standard deviation of 6.1769. The mean difference
between morning and evening is 1.096. A t-value of
1.303 and a p-value of 0.194 (p>0.05) for the test of
difference indicate that there is no discernible
difference between the morning and evening settings.

MORE THAN TWO MEANS


*F-test or ANOVA The relationship between POSTBODY, POSTLET,
POSTREL, POSTFORM, and POSTCLAS is shown in
Table 3. The data shows significant relationship
between POSTBODY, POSTLET, POSTREL,
The data in table 1 above displays the test of postform POSTFORM, and POSTCLAS. A high positive
score differences between age categories. The test of correlation is indicated by POSTFORM and
difference's f-statistics is 28.120 and its p-value is POSTCLAS, which have a correlation coefficient of
0.000 (p<0.05), indicating that there is a statistically 0.786.
significant difference in the postform scores for the
various age groups.

With the exception of Adult-Teenager, all age


categories have a significant mean difference
according to the Scheffe post-hoc test (table 2), with
a p-value of 0.208 being more than the level of
significance (0.05). The post hoc test also reveals that
among the age groups, the Middle Age-Children group
received the greatest mean difference, while Adult-
Teenagers received the lowest mean difference.

RELATIONSHIPS BETWEEN VARIABLES

CORRELATION - measures the strength and direction


of linear relationship between two variables

CORRELATION COEFFICIENT, r
s xy
r=
sx s y
 x y
s xy = 
( x−x )( y− y ) 
=
xy −
n
n−1 n−1
OR
n xy −  x  y
r=
n (  x ) − (  x )  n(  y ) − (  y ) 
2 2 2 2



You might also like