0% found this document useful (0 votes)
27 views10 pages

Notes Stats Quiz 2

This document covers statistical concepts including measures of central tendency, variation, skewness, and kurtosis, along with their calculations and applications using software like Excel. It explains different measures such as mean, median, mode, variance, and standard deviation, and introduces the normal distribution and its properties. Additionally, it emphasizes ethical considerations in presenting statistical data.

Uploaded by

Airyn Francisco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Notes Stats Quiz 2

This document covers statistical concepts including measures of central tendency, variation, skewness, and kurtosis, along with their calculations and applications using software like Excel. It explains different measures such as mean, median, mode, variance, and standard deviation, and introduces the normal distribution and its properties. Additionally, it emphasizes ethical considerations in presenting statistical data.

Uploaded by

Airyn Francisco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CA51018 - Statistics Analysis with Software Application

Module 3: Measures of Central Tendency, Position and Variability


Notes_by_ai

Summary Definition
Central tendency – the extent to which all the data values group around a typical or central value.
● Where are the data values concentrated? What seem to be typical or middle data values?
Variation – the amount of dispersion, or scattering, of values
● How much variation is there in the data? How spread out are the data values? Are there unusual values?
Shape – the pattern of the distribution of values from the lowest value to the highest value.
● Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?

CENTRAL TENDENCY
● This is a statistical measure which describes where the center of a frequency distribution lies.
● The three measures commonly used are the mean, the mode and the median.
● Some variations of the mean are the arithmetic mean, geometric mean, weighted mean and the trimmed mean.

Five Measures of Central Tendency


Statistic Formula Excel Formula Pro Cons

Mean 𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑠 =AVERAGE (Data) Familiar and uses all Influence by extreme
(Arithmetic 𝑥 = the sample values
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠
mean) information

Median Middle value in sorted =MEDIAN (Data) Robust when Ignores extremes and
array extreme data values can be affected by
exist gaps in data values.

Mode Most frequently =MODE (Data) Useful for attribute May not be unique,
occurring data value data or discrete data and is not helpful for
with a small range continuous data

Trimmed Same as the mean except =TRIMMEAN(Data) Mitigates effects of Excludes some data
mean omit highest and lowest extreme values values that could be
k% of data values (e.g., relevant
5%)

Geometric 𝑛 =GEOMEAN (Data) Useful for growth Less familiar and


√𝑥1 𝑥2 . . . 𝑥𝑛
Mean rates and mitigates requires positive data
high extremes

A. Arithmetic Mean = sum of values divided by the number of values


○ The most common measure of central tendency
○ Affected by extreme values (outliers)
○ The mean of a set of a set of numerical data is unique
○ It is the only measure of central tendency where the sum of the deviation of each value from the mean
will always be zero
○ It includes precise information from every score, hence, it is affected by a change in any score
○ The mean of separate distribution can be combined to get the mean of the total distribution

𝑥1 + 𝑥2 + . . . +𝑥𝑛
𝑥 =
𝑛
a. Weighted Mean – Find the weighted mean of variable X by multiplying each value by its
corresponding weight and dividing the sum of the products by the sum of the weights. Where w1 , w2
,...,wn are the weights, and x1 , x2 ,...,xn are the values

𝑥1 𝑤1 + 𝑥2 𝑤2 + . . . +𝑥𝑛 𝑤𝑛
𝑥 =
𝑤1 + 𝑤2 + 𝑤𝑛

B. Geometric Mean – used to measure the rate of change of a variable over time. It is useful for growth rates that
mitigates high extremes. It is, however, less familiar. It also requires that data is positive.
𝑥𝑔 = (𝑥1 ∗ 𝑥2 ∗. . .∗ 𝑥3 )1/𝑛
a. Geometric mean rate of return – measures the status of an investment over time. Where Ri is the rate of
return in time period i
𝑅𝑔 = [(1 + 𝑅1 ) ∗ (1 + 𝑅2 ) ∗. . .∗ (1 + 𝑅𝑛 )] 1/𝑛 − 1
b. Growth rates – a variation on the geometric mean used to find the average growth rate for a time series.
𝑛−1𝑥𝑛
𝐺𝑅 = √ −1
𝑥1
C. Median – the middle number (50% above and 50% below)
● It is not affected by extreme values
● Locating the median
𝑛+1 𝑛+1
○ The median of an ordered set of data is located at the ranked value. Note that is NOT the
2 2
value of the median, only the position of the median in the ranked data.
○ If the number of values is odd, the median is the middle number
○ If the number of values is even, the median is the average of the two middle numbers

D. Mode – value that occurs most often


● It is not affected by extreme values
● It is used for categorical data and used for numerical primarily when grouped
● There may be no mode or there may be several modes

E. Trimmed Mean – to calculate the trimmed mean, first remove the highest and lowest k percent of the observations.
● To determine how many observations to trim, multiply k by n and round of the result
● Example: Let us say that k x n = 3.4 = 3. So, we would remove the three smallest and three larges
observations before averaging the remaining values.

Locating Extreme Outliers Z-score


● To compute the Z-score of a data value, subtract the mean and divide by the standard deviation.
● Z-score – the number of standard deviations a data value is from the mean (used to determine if it is an outlier
or not) .
● Extreme outlier – A data value is considered an extreme outlier if its Zscore is less -3.0 or greater +3.0.
● The larger the absolute value of the Z-score, the farther the data value is from the mean

𝑋 − 𝑥
𝑍 =
𝑆

Where X represents the data value


𝑋 is the sample mean
S is the sample standards
QUARTILE MEASUREMENT
Quantiles – an approximate summary or the description of data
A. Quartiles – describing data by four parts
B. Deciles – describing data by ten parts
C. Percentiles – describing data by 100 parts

Quartile Measures
● Quartiles split the ranked data into 4 segments with an equal number of values per segment
● The first quartile, (Q1), is the value for which 25% of the observations are smaller and 75% are larger.
● Q2 is the same as the median (50% are smaller and 50% are larger)
● Only 25% of the values are greater than the third quartile

Quartile Measures Guidelines


1. Rule 1: If the result is a whole number, then the quartile is equal to that ranked value.
2. Rule 2: If the result is a fraction half (2.5, 3.5, etc), then the quartile is equal to the average of the corresponding
ranked values.
3. Rule 3: If the result is neither a whole number or a fractional half, you round the result to the nearest integer
and select that ranked value.

DISPERSION AND VARIATION


Variation – the spread of data points about the center of the distribution in a sample. It is affected by extreme values.

Measures of Variation
Statistic Formula Excel Pro Con

Range 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 =MAX(Data) - Easy to calculate Sensitive to extreme


MIN(Data) data values. It
ignores the way in
which data are
distributed.

Variance (𝑆 2 ) 𝛴(𝑥 − 𝜐)2 =VAR(Data) Plays a key role in Non-intuitive


𝜎2 = mathematical meaning
𝑁
statistics
𝛴(𝑥 − 𝑥)2
𝑆2 =
𝑛−1

Standard =STDEV(Data) Most common Non-intuitive


deviation (S) 𝛴(𝑥 − 𝜐)2 measure that uses meaning
𝜎=√ = √𝜎
𝑁 same units as the raw
data
𝛴(𝑥 − 𝑥)2
𝑆= = √𝑆
𝑛−1

Coefficient of 𝑆 None Measures relative Requires non-


Variation (CV) 𝐶𝑉 = 100 𝑥 ( ) variation in percent negative data
𝑥
so can compare data
sets

Mean Absolute 𝛴|𝑥 − 𝑥| =AVEDEV(Data) Easy to understand Lacks “nice”


Deviation 𝑀𝐷 = theoretical properties
𝑛
(MAD)
A. Range – simplest measure of variation. This is the difference between the largest and the smallest values.

𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

B. Variance
○ Population variance – the sum of squared deviations around the mean divided by the population size.

𝛴(𝑥 − 𝜐)2
𝜎2 =
𝑁

○ Sample variance – we divide by n-1 (instead of n), otherwise sample variance would tend to
underestimate the unknown population variance.
𝛴(𝑥 − 𝑥)2
𝑆2 =
𝑛−1

C. Standard deviation – the square root of the variance that explains how individual values in a data set vary from
the mean. The units of measure are the same as X.
○ Population standard deviation

𝛴(𝑥 − 𝜐)2
𝜎=√ = √𝜎
𝑁
○ Sample standard deviation
𝛴(𝑥 − 𝑥)2
𝑆= = √𝑆
𝑛−1

D. Coefficient of variation – useful for comparing variables measured in different units or with different means.
○ A unit-free measure of dispersion
○ Expressed as a percent of the mean
○ Only appropriate for nonnegative data. It is undefined if the mean is zero or negative

𝑆
𝐶𝑉 = 100 𝑥 ( )
𝑥

E. Mean absolute deviation – reveals the average distance from an individual data point to the mean (center of
the distribution). It uses absolute values of the deviations around the mean.

𝛴|𝑥 − 𝑥|
𝑀𝐷 =
𝑛

CENTRAL TENDENCY VS DISPERSION = the lower the coefficient variation, the better.

SKEWNESS AND KURTOSIS


A. Skewness – a unit free statistic
○ The coefficient compares two samples measured in different units or one sample with a known
reference distribution (e.g., symmetrical normal distribution)
■ In excel, use Data Analysis/Descriptive Statistic or the function “=SKEW(array)”
■ As “n” increases, the range of chance variation narrows
○ Skewness may be indicated by comparing the mean and median
○ The higher the sample size, the tighter the skewness

B. Kurtosis – the relative length of the tails and the degree of concentration in the center.

○ Consider three kurtosis prototype shapes


■ Platykurtic – flatter tails (kurtosis < 0)
■ Mesokurtic – normal peak (kurtosis = 0)
■ Leptokurtic – sharper peak (kurtosis > 0)

○ A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ
○ As sample size increases, the chance range narrows

GENERAL DESCRIPTIVE STATISTICS USING MICROSOFT EXCEL


1. Select data tab
2. Select data analysis
3. select descriptive statistics and click OK.
4. Enter the cell range
5. Check the summary statistics box
6. Click OK
BIVARIATE DATA
Sample Covariance
● Sample covariance – measures the strength of the linear relationship between two numerical variables
○ The sample covariance (iyak ka nalang)
○ The covariance is only concerned with the strength of the relationship
○ No causal effect is implied

● Covariance between two random variables


○ Statistical function covar also in Data Analysis
■ Cov (X,Y) > 0 : X and Y tend to move in the same direction
■ Cov (X,Y) < 0 : X and Y tend to move in opposite direction
■ Cov (X,Y) = 0 : X and Y are independent

The Correlation Coefficient


● Unit free
● Ranges between -1 and 1
○ The closer to -1, the stronger the negative linear relationship
○ The closer to 1, the stronger the positive linear relationship
○ The close to 0, the weaker any linear relationship

THE CORRELATION COEFFICIENT USING MICROSOFT EXCEL


1. Select Data tab/Data Analysis
2. Choose correlation from the selection menu
3. Click OK
4. Enter input range
5. Check labels in first row if data had labels
6. Click OK to get output

ETHICAL CONSIDERATIONS
Numerical Descriptive measures:
1. Should document both good and bad results
2. Should be presented in a fair, objective, and neutral manner
3. Should not use inappropriate summary measures to distort facts
CA51018 - Statistics Analysis with Software Application
Module 3: Normal Distribution and Test of Normality
Notes_by_ai

Continuous Random Variable – a variable that can assume any value on a continuum (can assume an uncountable
number of values)
● Examples: Thickness of an item, time required to complete a task, temperature of a solution, and height.

Normal Distribution – the most common continuous distribution.


● Also known as the “Gaussian distribution or the bell curve”
● In this distribution, the probability that various values occur within certain ranges or intervals can be
calculated.

Properties of the Normal distribution


1. Bell shaped
2. Symmetrical
3. Mean, median, and mode are equal
4. Location is characterized by the mean, 𝜐
5. Spread is characterized by the standard deviation, 𝜎
6. The random variables has an infinite theoretical range: (−∞)𝑡𝑜 (+∞)

Shape of the Normal Distribution


● Changing the mean shifts the distribution left or right
● Changing the standard deviation increases or decreases the
spread

Standardized Normal Distribution – also known as the “Z” distribution


● Mean is always 0
○ Values above the mean have positive Z-values
○ Values below the mean have negative Z-values
● Standard deviation is always 1

Calculating Z values

𝑋 − 𝜐
𝑍 =
𝜎
Normal probabilities

Probability – measured by the area under the curve

● The total area under the curve is 1.0, and the curve is symmetric, so half is above the mean, half is below
Normal Probability Tables
● Row – shows the value of Z to the first decimal point
● Column – gives the value of Z to the second decimal point
● Value within – gives the probability from Z = (−∞ ) up to the desired Z value.

Solution using Excel (finding X)


1. First, find a Z value given an area under the Standard Normal Distribution (SND), in a blank cell type:
“=NORMSINV(z-score)
2. Then, calculate the x value using : 𝑥 = 𝑧𝜎 + 𝜐

Assessing Normality
● It is important to evaluate how well the data set is approximated by a normal distribution.
● Normally distributed data should approximate the theoretical normal distribution:
○ The normal distribution is bell shaped (symmetrical) where the mean is equal to the median.
○ The empirical rule applies to the normal distribution.
○ The interquartile range of a normal distribution is 1.33 standard deviations.

The Empirical Rule as applied to the Normal Distribution


● This rule states that for symmetrical bell-shaped data sets, one can find that roughly two out of every three
observations are contained within a distance of 1 standard deviation around the mean and roughly

Normal Probability Plot – a normal probability plot for data from a normal distribution will be approximately linear.
● Non-linear plots indicate a deviation from normality.

Exploratory Data Analysis


The Five Number Summary
1. Minimum
2. First Quartile (Q1)
3. Median (Q2)
4. Third Quartile (Q3)
5. Maximum

The Box-and-Whisker Plot – a graphical display of the five number summary


● The box and central line are centered between the endpoints if data are symmetric around the median.
● A box-and-whisker plot can be shown in either vertical or horizontal
format

Other ways of assessing normality of data include:


1. Checking for skewness with Pearson coefficient (PC) of skewness as:

3(𝑋 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑃𝐶 =
𝑠

Nota bene: The data is considered significantly skewed when PC is greater than or equal to +1 or less than or
equal to -1

2. Checking for outliers – an outlier is a data value that lies more than 1.5 (IQR) units below Quartile 1 or
1.5(IQR) units above Quartile 3.
CA51018 - Statistics Analysis with Software Application
Module 4: Principles of Constructing Research Instruments
Notes_by_ai

Basic Steps of Survey Research


Step 1: State the goals of the research.
Step 2: Develop the budget (time, money, staff).
Step 3: Create a research design (target population, frame, sample size).
Step 4: Choose a survey type and method of administration.
Step 5: Design a data collection instrument (questionnaire).
Step 6: Pretest the survey instrument and revise as needed.
Step 7: Administer the survey (follow up if needed).
Step 8: Code the data and analyze it.

Survey Types
a. Mail – You need a well-targeted and current mailing list (people move a lot). Low response rates are typical
and nonresponse bias is expected (nonrespondents differ from those who respond). Zip code lists (often costly)
are an attractive option to define strata of similar income, education, and attitudes. To encourage participation,
a cover letter should clearly explain the uses to which the data will be put. Plan for follow-up mailings.
b. Telephone – Random dialing yields very low response and is poorly targeted. Purchased phone lists help reach
the target population, though a low response rate still is typical (disconnected phones, caller screening,
answering machines, work hours, no-call lists). Other sources of nonresponse bias include the growing number
of non-English speakers and distrust caused by scams and spams.
c. Interviews – Interviewing is expensive and time consuming, yet a trade-off between sample size for high-
quality results may still be worth it. Interviews must be carefully handled so interviewers must be well-trained
– an added cost. But you can obtain information on complex or sensitive topics (e.g., gender discrimination in
companies, birth control practices, diet and exercise habits).
d. Web – Web surveys are growing in popularity, but are subject to nonresponse bias because those who participate
may differ from those who feel too busy, don’t own computers or distrust your motives (scams and spam are
again to blame). This type of survey works best when targeted to a well-defined interest group on a question of
self-interest (e.g., views of CPAs on new proposed accounting rules, frequent flyer views on airline security).
e. Direct Observation – This can be done in a controlled setting (e.g., psychology lab) but requires informed
consent, which can change behavior. Unobtrusive observation is possible in some non-lab settings (e.g., what
percentage of airline passengers carry on more than two bags, what percentage of SUVs carry no passengers,
what percentage of drivers wear seat belts).

Survey Guidelines
1. Planning – What is the purpose of the survey? Consider staff expertise, needed skills, degree of precision,
budget.
2. Design – Invest time and money in designing the survey. Use books and references to avoid unnecessary errors
3. Quality – Take care in preparing a quality survey so that people will take you seriously.
4. Pilot Test – Pretest on friends or co-workers to make sure the survey is clear.
a. Adapt vs adopt technique
i. Adapt – standardized research instrument. No need to pilot test but needs to be reported
ii. Adopt – requires modification which raises issue of reliability; this leads to pilot testing
5. Buy-in – Improve response rates by stating the purpose of the survey, offering a token of appreciation or paving
the way with endorsements.
6. Expertise – Work with a consultant early on.
Questionnaire Design
[KISS – keep it short and simple]
1. Use a lot of white space in layout
2. Begin with short, clear instructions
3. State the survey purpose
4. Assure anonymity
5. Instruct on how to submit the completed survey.
6. Break survey into naturally occurring sections
7. Let respondents bypass sections that are not applicable (e.g., if you answer no to #7, skip to 5”
8. Pretest and revise as needed
9. Keep as short as possible

Type of Questions in the Questionnaire Design


1. Ranked Choices – “please evaluate you dining experience”
○ Excellent or good or fair or poor
2. Pictogram – enables one to select an answer base on an illustration (emotions, etc)
3. Likert scale – strongly agree, slightly agree, neither, slightly disagree, strongly disagree

Question Wording
● The way a question is asked has a profound influence on the reasons. For example,
○ Shall taxes be cut
○ Shall taxes be cut, if it means reducing highway maintenance?
○ Shall state taxes be cut, it means firing teachers and police?
● Make sure you have covered all the possibilities.
○ For example, Are you married? ❑ Yes ❑ No
● Overlapping classes or unclear categories are a problem.
○ For example, How old is your father? ❑ 35 – 45 ❑ 45 – 55 ❑ 55 – 65 ❑ 65 or older

Coding and Data Screening


● Responses are usually coded numerically (e.g., 1 = male 2 = female).
● Missing values are typically denoted by special characters (e.g., blank, “.” or “*”).
● Discard questionnaires that are flawed or missing many responses.
● Watch for multiple responses, outrageous or inconsistent replies or range answers.
● Follow-up if necessary and always document your data-coding decisions.

Data File Format – Enter data into a spreadsheet or database as a “flat file” (n subjects x m variables matrix).

Advice on Copying Data


● Using commas (,), dollar signs ($), or percent (%) as part of the values may result in your data being treated as
text values.
● A numerical variable may only contain the digits 0-9, a decimal point, and a minus sign.
● To avoid round-off errors, format the data column as plain numbers with the desired number of decimal places
before you copy the data to a statistical package.

You might also like