Notes Stats Quiz 2
Notes Stats Quiz 2
Summary Definition
Central tendency – the extent to which all the data values group around a typical or central value.
● Where are the data values concentrated? What seem to be typical or middle data values?
Variation – the amount of dispersion, or scattering, of values
● How much variation is there in the data? How spread out are the data values? Are there unusual values?
Shape – the pattern of the distribution of values from the lowest value to the highest value.
● Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?
CENTRAL TENDENCY
● This is a statistical measure which describes where the center of a frequency distribution lies.
● The three measures commonly used are the mean, the mode and the median.
● Some variations of the mean are the arithmetic mean, geometric mean, weighted mean and the trimmed mean.
Mean 𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑠 =AVERAGE (Data) Familiar and uses all Influence by extreme
(Arithmetic 𝑥 = the sample values
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠
mean) information
Median Middle value in sorted =MEDIAN (Data) Robust when Ignores extremes and
array extreme data values can be affected by
exist gaps in data values.
Mode Most frequently =MODE (Data) Useful for attribute May not be unique,
occurring data value data or discrete data and is not helpful for
with a small range continuous data
Trimmed Same as the mean except =TRIMMEAN(Data) Mitigates effects of Excludes some data
mean omit highest and lowest extreme values values that could be
k% of data values (e.g., relevant
5%)
𝑥1 + 𝑥2 + . . . +𝑥𝑛
𝑥 =
𝑛
a. Weighted Mean – Find the weighted mean of variable X by multiplying each value by its
corresponding weight and dividing the sum of the products by the sum of the weights. Where w1 , w2
,...,wn are the weights, and x1 , x2 ,...,xn are the values
𝑥1 𝑤1 + 𝑥2 𝑤2 + . . . +𝑥𝑛 𝑤𝑛
𝑥 =
𝑤1 + 𝑤2 + 𝑤𝑛
B. Geometric Mean – used to measure the rate of change of a variable over time. It is useful for growth rates that
mitigates high extremes. It is, however, less familiar. It also requires that data is positive.
𝑥𝑔 = (𝑥1 ∗ 𝑥2 ∗. . .∗ 𝑥3 )1/𝑛
a. Geometric mean rate of return – measures the status of an investment over time. Where Ri is the rate of
return in time period i
𝑅𝑔 = [(1 + 𝑅1 ) ∗ (1 + 𝑅2 ) ∗. . .∗ (1 + 𝑅𝑛 )] 1/𝑛 − 1
b. Growth rates – a variation on the geometric mean used to find the average growth rate for a time series.
𝑛−1𝑥𝑛
𝐺𝑅 = √ −1
𝑥1
C. Median – the middle number (50% above and 50% below)
● It is not affected by extreme values
● Locating the median
𝑛+1 𝑛+1
○ The median of an ordered set of data is located at the ranked value. Note that is NOT the
2 2
value of the median, only the position of the median in the ranked data.
○ If the number of values is odd, the median is the middle number
○ If the number of values is even, the median is the average of the two middle numbers
E. Trimmed Mean – to calculate the trimmed mean, first remove the highest and lowest k percent of the observations.
● To determine how many observations to trim, multiply k by n and round of the result
● Example: Let us say that k x n = 3.4 = 3. So, we would remove the three smallest and three larges
observations before averaging the remaining values.
𝑋 − 𝑥
𝑍 =
𝑆
Quartile Measures
● Quartiles split the ranked data into 4 segments with an equal number of values per segment
● The first quartile, (Q1), is the value for which 25% of the observations are smaller and 75% are larger.
● Q2 is the same as the median (50% are smaller and 50% are larger)
● Only 25% of the values are greater than the third quartile
Measures of Variation
Statistic Formula Excel Pro Con
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
B. Variance
○ Population variance – the sum of squared deviations around the mean divided by the population size.
𝛴(𝑥 − 𝜐)2
𝜎2 =
𝑁
○ Sample variance – we divide by n-1 (instead of n), otherwise sample variance would tend to
underestimate the unknown population variance.
𝛴(𝑥 − 𝑥)2
𝑆2 =
𝑛−1
C. Standard deviation – the square root of the variance that explains how individual values in a data set vary from
the mean. The units of measure are the same as X.
○ Population standard deviation
𝛴(𝑥 − 𝜐)2
𝜎=√ = √𝜎
𝑁
○ Sample standard deviation
𝛴(𝑥 − 𝑥)2
𝑆= = √𝑆
𝑛−1
D. Coefficient of variation – useful for comparing variables measured in different units or with different means.
○ A unit-free measure of dispersion
○ Expressed as a percent of the mean
○ Only appropriate for nonnegative data. It is undefined if the mean is zero or negative
𝑆
𝐶𝑉 = 100 𝑥 ( )
𝑥
E. Mean absolute deviation – reveals the average distance from an individual data point to the mean (center of
the distribution). It uses absolute values of the deviations around the mean.
𝛴|𝑥 − 𝑥|
𝑀𝐷 =
𝑛
CENTRAL TENDENCY VS DISPERSION = the lower the coefficient variation, the better.
B. Kurtosis – the relative length of the tails and the degree of concentration in the center.
○ A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ
○ As sample size increases, the chance range narrows
ETHICAL CONSIDERATIONS
Numerical Descriptive measures:
1. Should document both good and bad results
2. Should be presented in a fair, objective, and neutral manner
3. Should not use inappropriate summary measures to distort facts
CA51018 - Statistics Analysis with Software Application
Module 3: Normal Distribution and Test of Normality
Notes_by_ai
Continuous Random Variable – a variable that can assume any value on a continuum (can assume an uncountable
number of values)
● Examples: Thickness of an item, time required to complete a task, temperature of a solution, and height.
Calculating Z values
𝑋 − 𝜐
𝑍 =
𝜎
Normal probabilities
● The total area under the curve is 1.0, and the curve is symmetric, so half is above the mean, half is below
Normal Probability Tables
● Row – shows the value of Z to the first decimal point
● Column – gives the value of Z to the second decimal point
● Value within – gives the probability from Z = (−∞ ) up to the desired Z value.
Assessing Normality
● It is important to evaluate how well the data set is approximated by a normal distribution.
● Normally distributed data should approximate the theoretical normal distribution:
○ The normal distribution is bell shaped (symmetrical) where the mean is equal to the median.
○ The empirical rule applies to the normal distribution.
○ The interquartile range of a normal distribution is 1.33 standard deviations.
Normal Probability Plot – a normal probability plot for data from a normal distribution will be approximately linear.
● Non-linear plots indicate a deviation from normality.
3(𝑋 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑃𝐶 =
𝑠
Nota bene: The data is considered significantly skewed when PC is greater than or equal to +1 or less than or
equal to -1
2. Checking for outliers – an outlier is a data value that lies more than 1.5 (IQR) units below Quartile 1 or
1.5(IQR) units above Quartile 3.
CA51018 - Statistics Analysis with Software Application
Module 4: Principles of Constructing Research Instruments
Notes_by_ai
Survey Types
a. Mail – You need a well-targeted and current mailing list (people move a lot). Low response rates are typical
and nonresponse bias is expected (nonrespondents differ from those who respond). Zip code lists (often costly)
are an attractive option to define strata of similar income, education, and attitudes. To encourage participation,
a cover letter should clearly explain the uses to which the data will be put. Plan for follow-up mailings.
b. Telephone – Random dialing yields very low response and is poorly targeted. Purchased phone lists help reach
the target population, though a low response rate still is typical (disconnected phones, caller screening,
answering machines, work hours, no-call lists). Other sources of nonresponse bias include the growing number
of non-English speakers and distrust caused by scams and spams.
c. Interviews – Interviewing is expensive and time consuming, yet a trade-off between sample size for high-
quality results may still be worth it. Interviews must be carefully handled so interviewers must be well-trained
– an added cost. But you can obtain information on complex or sensitive topics (e.g., gender discrimination in
companies, birth control practices, diet and exercise habits).
d. Web – Web surveys are growing in popularity, but are subject to nonresponse bias because those who participate
may differ from those who feel too busy, don’t own computers or distrust your motives (scams and spam are
again to blame). This type of survey works best when targeted to a well-defined interest group on a question of
self-interest (e.g., views of CPAs on new proposed accounting rules, frequent flyer views on airline security).
e. Direct Observation – This can be done in a controlled setting (e.g., psychology lab) but requires informed
consent, which can change behavior. Unobtrusive observation is possible in some non-lab settings (e.g., what
percentage of airline passengers carry on more than two bags, what percentage of SUVs carry no passengers,
what percentage of drivers wear seat belts).
Survey Guidelines
1. Planning – What is the purpose of the survey? Consider staff expertise, needed skills, degree of precision,
budget.
2. Design – Invest time and money in designing the survey. Use books and references to avoid unnecessary errors
3. Quality – Take care in preparing a quality survey so that people will take you seriously.
4. Pilot Test – Pretest on friends or co-workers to make sure the survey is clear.
a. Adapt vs adopt technique
i. Adapt – standardized research instrument. No need to pilot test but needs to be reported
ii. Adopt – requires modification which raises issue of reliability; this leads to pilot testing
5. Buy-in – Improve response rates by stating the purpose of the survey, offering a token of appreciation or paving
the way with endorsements.
6. Expertise – Work with a consultant early on.
Questionnaire Design
[KISS – keep it short and simple]
1. Use a lot of white space in layout
2. Begin with short, clear instructions
3. State the survey purpose
4. Assure anonymity
5. Instruct on how to submit the completed survey.
6. Break survey into naturally occurring sections
7. Let respondents bypass sections that are not applicable (e.g., if you answer no to #7, skip to 5”
8. Pretest and revise as needed
9. Keep as short as possible
Question Wording
● The way a question is asked has a profound influence on the reasons. For example,
○ Shall taxes be cut
○ Shall taxes be cut, if it means reducing highway maintenance?
○ Shall state taxes be cut, it means firing teachers and police?
● Make sure you have covered all the possibilities.
○ For example, Are you married? ❑ Yes ❑ No
● Overlapping classes or unclear categories are a problem.
○ For example, How old is your father? ❑ 35 – 45 ❑ 45 – 55 ❑ 55 – 65 ❑ 65 or older
Data File Format – Enter data into a spreadsheet or database as a “flat file” (n subjects x m variables matrix).