Statistics Notes
Statistics Notes
Definitions
Statistics Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions. Variable Characteristic or attribute that can assume different values Random Variable A variable whose values are determined by chance. Population All subjects possessing a common characteristic that is being studied. Sample A subgroup or subset of the population. Parameter Characteristic or measure obtained from a population. Statistic (not to be confused with Statistics) Characteristic or measure obtained from a sample. Descriptive Statistics Collection, organization, summarization, and presentation of data. Inferential Statistics Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions. Qualitative Variables Variables which assume non-numerical values. Quantitative Variables Variables which assume numerical values. Discrete Variables Variables which assume a finite or countable number of possible values. Usually obtained by counting. Continuous Variables Variables which assume an infinite number of possible values. Usually obtained by measurement. Nominal Level Level of measurement which classifies data into mutually exclusive, all inclusive categories in which no order or ranking can be imposed on the data. Ordinal Level Level of measurement which classifies data into categories that can be ranked. Differences between the ranks do not exist. Interval Level
Level of measurement which classifies data that can be ranked and differences are meaningful. However, there is no meaningful zero, so ratios are meaningless. Ratio Level Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true zero. True ratios exist between the different units of measure. Random Sampling Sampling in which the data is collected using chance methods or random numbers. Systematic Sampling Sampling in which data is obtained by selecting every kth object. Convenience Sampling Sampling in which data is which is readily available is used. Stratified Sampling Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques. Cluster Sampling Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected.
Population vs Sample
The population includes all objects of interest whereas the sample is only a portion of the population. Parameters are associated with populations and statistics with samples. Parameters are usually denoted using Greek letters (mu, sigma) while statistics are usually denoted using Roman letters (x, s). There are several reasons why we don't work with populations. They are usually large, and it is often impossible to get data for every object we're studying. Sampling does not usually occur without cost, and the more items surveyed, the larger the cost. We compute statistics, and use them to estimate parameters. The computation is the first part of the statistics course (Descriptive Statistics) and the estimation is the second part (Inferential Statistics)
Discrete vs Continuous
Discrete variables are usually obtained by counting. There are a finite or countable number of choices available with discrete data. You can't have 2.63 people in the room. Continuous variables are usually obtained by measuring. Length, weight, and time are all examples of continous variables. Since continuous variables are real numbers, we usually round them. This implies a boundary depending on the number of decimal places. For example: 64 is really anything 63.5 <= x < 64.5. Likewise, if there are two decimal places, then
64.03 is really anything 63.025 <= x < 63.035. Boundaries always have one more decimal place than the data and end in a 5.
Levels of Measurement
There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have. Nominal is the lowest level. Only names are meaningful here. Ordinal adds an order to the names. Interval adds meaningful differences Ratio adds a zero so that ratios are meaningful.
Types of Sampling
There are five types of sampling: Random, Systematic, Convenience, Cluster, and Stratified. Random sampling is analogous to putting everyone's name into a hat and drawing out several names. Each element in the population has an equal chance of occuring. While this is the preferred way of sampling, it is often difficult to do. It requires that a complete list of every element in the population be obtained. Computer generated lists are often used with random sampling. You can generate random numbers using the TI82 calculator. Systematic sampling is easier to do than random sampling. In systematic sampling, the list of elements is "counted off". That is, every kth element is taken. This is similar to lining everyone up and numbering off "1,2,3,4; 1,2,3,4; etc". When done numbering, all people numbered 4 would be used. Convenience sampling is very easy to do, but it's probably the worst technique to use. In convenience sampling, readily available data is used. That is, the first people the surveyor runs into. Cluster sampling is accomplished by dividing the population into groups -- usually geographically. These groups are called clusters or blocks. The clusters are randomly selected, and each element in the selected clusters are used. Stratified sampling also divides the population into groups called strata. However, this time it is by some characteristic, not geographically. For instance, the population might be separated into males and females. A sample is taken from each of these strata using either random, systematic, or convenience sampling.
Frequency The number of times a certain value or class of values occurs. Frequency Distribution The organization of raw data in table form with classes and frequencies. Categorical Frequency Distribution A frequency distribution in which the data is only nominal or ordinal. Ungrouped Frequency Distribution A frequency distribution of numerical data. The raw data is not grouped. Grouped Frequency Distribution A frequency distribution where several numbers are grouped into one class. Class Limits Separate one class in a grouped frequency distribution from another. The limits could actually appear in the data and have gaps between the upper limit of one class and the lower limit of the next. Class Boundaries Separate one class in a grouped frequency distribution from another. The boundaries have one more decimal place than the raw data and therefore do not appear in the data. There is no gap between the upper boundary of one class and the lower boundary of the next class. The lower class boundary is found by subtracting 0.5 units from the lower class limit and the upper class boundary is found by adding 0.5 units to the upper class limit. Class Width The difference between the upper and lower boundaries of any class. The class width is also the difference between the lower limits of two consecutive classes or the upper limits of two consecutive classes. It is not the difference between the upper and lower limits of the same class. Class Mark (Midpoint) The number in the middle of the class. It is found by adding the upper and lower limits and dividing by two. It can also be found by adding the upper and lower boundaries and dividing by two. Cumulative Frequency The number of values less than the upper class boundary for the current class. This is a running total of the frequencies. Relative Frequency The frequency divided by the total frequency. This gives the percent of values falling in that class. Cumulative Relative Frequency (Relative Cumulative Frequency)
The running total of the relative frequencies or the cumulative frequency divided by the total frequency. Gives the percent of the values which are less than the upper class boundary. Histogram A graph which displays the data by using vertical bars of various heights to represent frequencies. The horizontal axis can be either the class boundaries, the class marks, or the class limits. Frequency Polygon A line graph. The frequency is placed along the vertical axis and the class midpoints are placed along the horizontal axis. These points are connected with lines. Ogive A frequency polygon of the cumulative frequency or the relative cumulative frequency. The vertical axis the cumulative frequency or relative cumulative frequency. The horizontal axis is the class boundaries. The graph always starts at zero at the lowest class boundary and will end up at the total frequency (for a cumulative frequency) or 1.00 (for a relative cumulative frequency). Pareto Chart A bar graph for qualitative data with the bars arranged according to frequency. Pie Chart Graphical depiction of data as slices of a pie. The frequency determines the size of the slice. The number of degrees in any slice is the relative frequency times 360 degrees. Pictograph A graph that uses pictures to represent data. Stem and Leaf Plot A data plot which uses part of the data value as the stem and the rest of the data value (the leaf) to form groups or classes. This is very useful for sorting data quickly.
1. Find the largest and smallest values 2. Compute the Range = Maximum - Minimum 3. Select the number of classes desired. This is usually between 5 and 20. 4. Find the class width by dividing the range by the number of classes and rounding up. There are two things to be careful of here. You must round up, not off. Normally 3.2 would round to be 3, but in rounding up, it becomes 4. If the range divided by the number of classes gives an integer value (no remainder), then you can either add one to the number of classes or add one to the class width. Sometimes you're locked into a certain number of
classes because of the instructions. The Bluman text fails to mention the case when there is no remainder. 5. Pick a suitable starting point less than or equal to the minimum value. You will be able to cover: "the class width times the number of classes" values. You need to cover one more value than the range. Follow this rule and you'll be okay: The starting point plus the number of classes times the class width must be greater than the maximum value. Your starting point is the lower limit of the first class. Continue to add the class width to this lower limit to get the rest of the lower limits. 6. To find the upper limit of the first class, subtract one from the lower limit of the second class. Then continue to add the class width to this upper limit to find the rest of the upper limits. 7. Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5 units from the upper limits. The boundaries are also half-way between the upper limit of one class and the lower limit of the next class. Depending on what you're trying to accomplish, it may not be necessary to find the boundaries. 8. Tally the data. 9. Find the frequencies. 10.Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not be necessary to find the cumulative frequencies. 11.If necessary, find the relative frequencies and/or relative cumulative frequencies.
12.
13.
14. Definitions 15. Statistic 16. Characteristic or measure obtained from a sample 17. Parameter 18. Characteristic or measure obtained from a population 19. Mean 20. Sum of all the values divided by the number of values. This can either be a population mean (denoted by mu) or a sample mean (denoted by x bar) 21. Median 22. The midpoint of the data after being ranked (sorted in ascending order). There are as many numbers below the median as above the median. 23. Mode 24. The most frequent number 25. Skewed Distribution
26. The majority of the values lie together on one side with a very few values (the tail) to the other side. In a positively skewed distribution, the tail is to the right and the mean is larger than the median. In a negatively skewed distribution, the tail is to the left and the mean is smaller than the median. 27. Symmetric Distribution 28. The data values are evenly distributed on both sides of the mean. In a symmetric distribution, the mean is the median. 29. Weighted Mean 30. The mean when each value is multiplied by its weight and summed. This sum is divided by the total of the weights. 31. Midrange 32. The mean of the highest and lowest values. (Max + Min) / 2 33. Range 34. The difference between the highest and lowest values. Max - Min 35. Population Variance 36. The average of the squares of the distances from the population mean. It is the sum of the squares of the deviations from the mean divided by the population size. The units on the variance are the units of the population squared. 37. Sample Variance 38. Unbiased estimator of a population variance. Instead of dividing by the population size, the sum of the squares of the deviations from the sample mean is divided by one less than the sample size. The units on the variance are the units of the population squared. 39. Standard Deviation 40. The square root of the variance. The population standard deviation is the square root of the population variance and the sample standard deviation is the square root of the sample variance. The sample standard deviation is not the unbiased estimator for the population standard deviation. The units on the standard deviation is the same as the units of the population/sample. 41. Coefficient of Variation 42. Standard deviation divided by the mean, expressed as a percentage. We won't work with the Coefficient of Variation in this course. 43. Chebyshev's Theorem 44. The proportion of the values that fall within k standard deviations of
the mean is at least where k > 1. Chebyshev's theorem can be applied to any distribution regardless of its shape. 45. Empirical or Normal Rule 46. Only valid when a distribution in bell-shaped (normal). Approximately 68% lies within 1 standard deviation of the mean;
95% within 2 standard deviations; and 99.7% within 3 standard deviations of the mean. 47. Standard Score or Z-Score 48. The value obtained by subtracting the mean and dividing by the standard deviation. When all values are transformed to their standard scores, the new mean (for Z) will be zero and the standard deviation will be one. 49. Percentile 50. The percent of the population which lies below that value. The data must be ranked to find percentiles. 51. Quartile 52. Either the 25th, 50th, or 75th percentiles. The 50th percentile is also called the median. 53. Decile 54. Either the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, or 90th percentiles. 55. Lower Hinge 56. The median of the lower half of the numbers (up to and including the median). The lower hinge is the first Quartile unless the remainder when dividing the sample size by four is 3. 57. Upper Hinge 58. The median of the upper half of the numbers (including the median). The upper hinge is the 3rd Quartile unless the remainder when dividing the sample size by four is 3. 59. Box and Whiskers Plot (Box Plot) 60. A graphical representation of the minimum value, lower hinge, median, upper hinge, and maximum. Some textbooks, and the TI-82 calculator, define the five values as the minimum, first Quartile, median, third Quartile, and maximum. 61. Five Number Summary 62. Minimum value, lower hinge, median, upper hinge, and maximum. 63. InterQuartile Range (IQR) 64. The difference between the 3rd and 1st Quartiles. 65. Outlier 66. An extremely high or low value when compared to the rest of the values. 67. Mild Outliers 68. Values which lie between 1.5 and 3.0 times the InterQuartile Range below the 1st Quartile or above the 3rd Quartile. Note, some texts use hinges instead of Quartiles. 69. Extreme Outliers 70. Values which lie more than 3.0 times the InterQuartile Range below the 1st Quartile or above the 3rd Quartile. Note, some texts use hinges instead of Quartiles.
75.
Mean
76. This is what people usually intend when they say "average"
79. Frequency Distribution: The mean of a frequency distribution is also the weighted mean.
80.
Median
81. The data must be ranked (sorted in ascending order) first. The median is the number in the middle. 82. To find the depth of the median, there are several formulas that could be used, the one that we will use is:
Depth of median = 0.5 * (n + 1)
83. Raw Data 84. The median is the number in the "depth of the median" position. If the sample size is even, the depth of the median will be a decimal -you need to find the midpoint between the numbers on either side of the depth of the median. 85. Ungrouped Frequency Distribution 86. Find the cumulative frequencies for the data. The first value with a cumulative frequency greater than depth of the median is the median. If the depth of the median is exactly 0.5 more than the cumulative frequency of the previous class, then the median is the midpoint between the two classes. 87. Grouped Frequency Distribution 88. This is the tough one. 89. Since the data is grouped, you have lost all original information. Some textbooks have you simply take the midpoint of the class. This
is an over-simplification which isn't the true value (but much easier to do). The correct process is to interpolate. 90. Find out what proportion of the distance into the median class the median by dividing the sample size by 2, subtracting the cumulative frequency of the previous class, and then dividing all that bay the frequency of the median class. 91. Multiply this proportion by the class width and add it to the lower boundary of the median class.
92.
93.
Mode
94. The mode is the most frequent data value. There may be no mode if no one value appears more than any other. There may also be two modes (bimodal), three modes (trimodal), or more than three modes (multi-modal). 95. For grouped frequency distributions, the modal class is the class with the largest frequency.
96. 98.
Midrange Summary
97. The midrange is simply the midpoint between the highest and lowest values. 99. The Mean is used in computing other statistics (such as the variance) and does not exist for open ended grouped frequency distributions (1). It is often not appropriate for skewed distributions such as salary information. 100. The Median is the center number and is good for skewed distributions because it is resistant to change. 101. The Mode is used to describe the most typical case. The mode can be used with nominal data whereas the others can't. The mode may or may not exist and there may be more than one value for the mode (2). 102. The Midrange is not used very often. It is a very rough estimate of the average and is greatly affected by extreme values (even more so than the mean).
Property Always Exists Uses all data values Affected by extreme values Mean No (1) Yes Yes Median Yes No No Mode No (2) No No Midrange Yes No Yes
Since the range only uses the largest and smallest values, it is greatly affected by extreme values, that is - it is not resistant to change.
Variance
"Average Deviation" The range only involves the smallest and largest numbers, and it would be desirable to have a statistic which involved all of the data values. The first attempt one might make at this is something they might call the average deviation from the mean and define it as:
The problem is that this summation is always zero. So, the average deviation will always be zero. That is why the average deviation is never used. Population Variance So, to keep it from being zero, the deviation from the mean is squared and called the "squared deviation from the mean". This "average squared deviation from the mean" is called the variance.
Unbiased Estimate of the Population Variance One would expect the sample variance to simply be the population variance with the population mean replaced by the sample mean. However, one of the major uses of statistics is to estimate the corresponding parameter. This formula has the problem that the estimated value isn't the same as the parameter. To counteract this, the sum of the squares of the deviations is divided by one less than the sample size.
Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means that the units were also squared. To get the units back the same as the original data values, the square root must be taken.
The sample standard deviation is not the unbiased estimator for the population standard deviation. The calculator does not have a variance key on it. It does have a standard deviation key. You will have to square the standard deviation to find the variance.
returns: What's wrong with the first formula, you ask? Consider the following example - the last row are the totals for the columns 1. Total the data values: 23 2. Divide by the number of values to get the mean: 23/5 = 4.6 3. Subtract the mean from each value to get the numbers in the second column. 4. Square each number in the second column to get the values in the third column. 5. Total the numbers in the third column: 5.2 6. Divide this total by one less than the sample size to get the variance: 5.2 / 4 = 1.3
x
4 - 4.6 = -0.6
5 3 6 5 23
5 - 4.6 = 0.4 3 - 4.6 = -1.6 6 - 4.6 = 1.4 5 - 4.6 = 0.4 0.00 (Always)
( 0.4 ) ^2 = 0.16 ( - 1.6 )^2 = 2.56 ( 1.4 )^2 = 1.96 ( 0.4 )^2 = 0.16 5.2
Not too bad, you think. But this can get pretty bad if the sample mean doesn't happen to be an "nice" rational number. Think about having a mean of 19/7 = 2.714285714285... Those subtractions get nasty, and when you square them, they're really bad. Another problem with the first formula is that it requires you to know the mean ahead of time. For a calculator, this would mean that you have to save all of the numbers that were entered. The TI-82 does this, but most scientific calculators don't. Now, let's consider the shortcut formula. The only things that you need to find are the sum of the values and the sum of the values squared. There is no subtraction and no decimals or fractions until the end. The last row contains the sums of the columns, just like before. 1. Record each number in the first column and the square of each number in the second column. 2. Total the first column: 23 3. Total the second column: 111 4. Compute the sum of squares: 111 - 23*23/5 = 111 - 105.8 = 5.2 5. Divide the sum of squares by one less than the sample size to get the variance = 5.2 / 4 = 1.3
x 4 5 3 6 5 x^2 16 25 9 36 25
23
111
Chebyshev's Theorem
The proportion of the values that fall within k standard deviations of the mean will be at least , where k is an number greater than 1.
"Within k standard deviations" interprets as the interval: to . Chebyshev's Theorem is true for any sample set, not matter what the distribution.
Empirical Rule
The empirical rule is only valid for bell-shaped (normal) distributions. The following statements are true. Approximately 68% of the data values fall within one standard deviation of the mean. Approximately 95% of the data values fall within two standard deviations of the mean. Approximately 99.7% of the data values fall within three standard deviations of the mean. The empirical rule will be revisited later in the chapter on normal probabilities.
The mean of the standard scores is zero and the standard deviation is 1. This is the nice feature of the standard score -- no matter what the original scale was, when the data is converted to its standard score, the mean is zero and the standard deviation is 1.
The kth percentile is the number which has k% of the values below it. The data must be ranked. 1. Rank the data
2. Find k% (k /100) of the sample size, n. 3. If this is an integer, add 0.5. If it isn't an integer round up. 4. Find the number in this position. If your depth ends in 0.5, then take the midpoint between the two numbers. It is sometimes easier to count from the high end rather than counting from the low end. For example, the 80th percentile is the number which has 80% below it and 20% above it. Rather than counting 80% from the bottom, count 20% from the top. Note: The 50th percentile is the median. If you wish to find the percentile for a number (rather than locating the kth percentile), then 1. Take the number of values below the number 2. Add 0.5 3. Divide by the total number of values 4. Convert it to a percent
Deciles (10 regions)
The percentiles divide the data into 100 equal regions. The deciles divide the data into 10 equal regions. The instructions are the same for finding a percentile, except instead of dividing by 100 in step 2, divide by 10.
Quartiles (4 regions)
The quartiles divide the data into 4 equal regions. Instead of dividing by 100 in step 2, divide by 4. Note: The 2nd quartile is the same as the median. The 1st quartile is the 25th percentile, the 3rd quartile is the 75th percentile. The quartiles are commonly used (much more so than the percentiles or deciles). The TI-82 calculator will find the quartiles for you. Some textbooks include the quartiles in the five number summary.
Hinges
The lower hinge is the median of the lower half of the data up to and including the median. The upper hinge is the median of the upper half of the data up to and including the median. The hinges are the same as the quartiles unless the remainder when dividing the sample size by four is three (like 39 / 4 = 9 R 3). The statement about the lower half or upper half including the median tends to be confusing to some students. If the median is split between two values (which happens whenever the sample size is even), the median isn't included in either since the median isn't actually part of the data. Example 1: sample size of 20 The median will be in position 10.5. The lower half is positions 1 - 10 and the upper half is positions 11 - 20. The lower hinge is the median of the lower half and would be in position 5.5. The upper hinge is the median of the upper half and would be in position 5.5 starting with original position 11 as position 1 -- this is the original position 15.5.
Example 2: sample size of 21 The median is in position 11. The lower half is positions 1 - 11 and the upper half is positions 11 - 21. The lower hinge is the median of the lower half and would be in position 6. The upper hinge is the median of the upper half and would be in position 6 when starting at position 11 -- this is original position 16.
Outliers
Outliers are extreme values. There are mild outliers and extreme outliers. The Bluman text does not distinguish between mild outliers and extreme outliers and just treats either as an outlier.
Extreme Outliers
Extreme outliers are any data values which lie more than 3.0 times the interquartile range below the first quartile or above the third quartile. x is an extreme outlier if ...
x < Q1 - 3 * IQR
or
x > Q3 + 3 * IQR
Mild Outliers
Mild outliers are any data values which lie between 1.5 times and 3.0 times the interquartile range below the first quartile or above the third quartile. x is a mild outlier if ...
Q1 - 3 * IQR <= x < Q1 - 1.5 * IQR
or
Q1 + 1.5 * IQR < x <= Q3 + 3 * IQR
Permutation An arrangement of objects in a specific order. Combination A selection of objects without regard to order. Tree Diagram A graphical device used to list all possibilities of a sequence of events in a systematic way.
Factorials
If n is a positive integer, then
n! = n (n-1) (n-2) ... (3)(2)(1) n! = n (n-1)!
A special case is 0!
0! = 1
Permutations
A permutation is an arrangement of objects without repetition where order is important. Permutations using all the objects A permutation of n objects, arranged into one group of size n, without repetition, and order being important is:
nPn
= P(n,n) = n!
ACB BAC BCA CAB CBA
Permutations of some of the objects A permutation of n objects, arranged in groups of size r, without repetition, and order being important is:
nPr
= P(n,r) = n! / (n-r)!
AC BA BC CA CB
Assuming that you start a n and count down to 1 in your factorials ... P(n,r) = first r factors of n factorial Distinguishable Permutations Sometimes letters are repeated and all of the permutations aren't distinguishable from each other. Example: Find all permutations of the letters "BOB" To help you distinguish, I'll write the second "B" as "b"
BOb BOB BOB BbO BBO BBO OBb OBB OBB ObB OBB bBO BBO bOB BBO
If you just write "B" as "B", however ... There are really only three distinguishable permutations here. If a word has N letters, k of which are unique, and you let n (n1, n2, n3, ..., nk) be the frequency of each of the k letters, then the total number of distinguishable permutations is given by:
Consider the word "STATISTICS": Here are the frequency of each letter: S=3, T=3, A=1, I=2, C=1, there are 10 letters total
10! 10*9*8*7*6*5*4*3*2*1 Permutations = ------------- = -------------------- = 50400 3! 3! 1! 2! 1! 6 * 6 * 1 * 2 * 1
Combinations
A combination is an arrangement of objects without repetition where order is not important. Note: The difference between a permutation and a combination is not whether there is repetition or not -- there must not be repetition with either, and if there is repetition, you can not use the formulas for permutations or combinations. The only difference in the definition of a permutation and a combination is whether order is important. A combination of n objects, arranged in groups of size r, without repetition, and order being important is:
nCr
= C(n,r) = n! / ( (n-r)! * r! )
Another way to write a combination of n things, r at a time is using the binomial notation: Example: Find all two-letter combinations of the letters "ABC"
AB = BA
AC = CA
BC = CB
Assuming that you start a n and count down to 1 in your factorials ... C(n,r) = first r factors of n factorial divided by the last r factors of n factorial Pascal's Triangle Combinations are used in the binomial expansion theorem from algebra to give the coefficients of the expansion (a+b)^n. They also form a pattern known as Pascal's Triangle.
1 1 6 15 2 4 20 1 1 15 1 6 1 1 3 5 10 1 7 3 10 21 1 5 35 1 35 21 1 7 1 1 1 1 4 6
Each element in the table is the sum of the two elements directly above it. Each element is also a combination. The n value is the number of the row (start counting at zero) and the r value is the element in the row (start counting at zero). That would make the 20 in the next to last row C(6,3) -it's in the row #6 (7th row) and position #3 (4th element). Symmetry Pascal's Triangle illustrates the symmetric nature of a combination. C(n,r)
C(n,n-r)
Since combinations are symmetric, if n-r is smaller than r, then switch the combination to its alternative form and then use the shortcut given above. C(n,r) = first r factors of n factorial divided by the last r factors of n factorial
TI-82
You can use the TI-82 graphing calculator to find factorials, permutations, and combinations.
Tree Diagrams
Tree diagrams are a graphical way of listing all the possible outcomes. The outcomes are listed in an orderly fashion, so listing all of the possible outcomes is easier than just trying to make sure that you have them all listed. It is called a tree diagram because of the way it looks. The first event appears on the left, and then each sequential event is represented as branches off of the first event. The tree diagram to the right would show the possible ways of flipping two coins. The final outcomes are obtained by following each branch to its conclusion: They are from top to bottom:
HH HT TH TT
Stats: Probability
Definitions
Probability Experiment Process which leads to well-defined results call outcomes Outcome The result of a single trial of a probability experiment Sample Space Set of all possible outcomes of a probability experiment Event One or more outcomes of a probability experiment Classical Probability Uses the sample space to determine the numerical probability that an event will happen. Also called theoretical probability. Equally Likely Events Events which have the same probability of occurring. Complement of an Event All the events in the sample space except the given events. Empirical Probability Uses a frequency distribution to determine the numerical probability. An empirical probability is a relative frequency. Subjective Probability Uses probability values based on an educated guess or estimate. It employs opinions and inexact information. Mutually Exclusive Events Two events which cannot happen at the same time. Disjoint Events Another name for mutually exclusive events. Independent Events Two events are independent if the occurrence of one does not affect the probability of the other occurring. Dependent Events Two events are dependent if the first event affects the outcome or occurrence of the second event in a way the probability is changed. Conditional Probability The probability of an event occurring given that another event has already occurred. Bayes' Theorem A formula which allows one to find the probability that an event occurred as the result of a particular previous event.
Sample Spaces
A sample space is the set of all possible outcomes. However, some sample spaces are better than others. Consider the experiment of flipping two coins. It is possible to get 0 heads, 1 head, or 2 heads. Thus, the sample space could be {0, 1, 2}. Another way to look at it is flip { HH, HT, TH, TT }. The second way is better because each event is as equally likely to occur as any other. When writing the sample space, it is highly desirable to have events which are equally likely. Another example is rolling two dice. The sums are { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }. However, each of these aren't equally likely. The only way to get a sum 2 is to roll a 1 on both dice, but you can get a sum of 4 by rolling a 1-3, 2-2, or 3-1. The following table illustrates a better sample space for the sum obtain when rolling two dice.
Second Die First Die 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12
Classical Probability
The above table lends itself to describing data another way -- using a probability distribution. Let's consider the frequency distribution for the above sums.
Sum Frequency Relative Frequency 1/36
3 4 5 6 7 8 9 10 11 12
2 3 4 5 6 5 4 3 2 1
2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
If just the first and last columns were written, we would have a probability distribution. The relative frequency of a frequency distribution is the probability of the event occurring. This is only true, however, if the events are equally likely. This gives us the formula for classical probability. The probability of an event occurring is the number in the event divided by the number in the sample space. Again, this is only true when the events are equally likely. A classical probability is the relative frequency of each event in the sample space when each event is equally likely. P(E) = n(E) / n(S)
Empirical Probability
Empirical probability is based on observation. The empirical probability of an event is the relative frequency of a frequency distribution based upon observation. P(E) = f / n
Probability Rules
There are two rules which are very important. All probabilities are between 0 and 1 inclusive 0 <= P(E) <= 1 The sum of all the probabilities in the sample space is 1 There are some other rules which are also important.
The probability of an event which cannot occur is 0. The probability of any event which is not in the sample space is zero. The probability of an event which must occur is 1. The probability of the sample space is 1. The probability of an event not occurring is one minus the probability of it occurring.
P(E') = 1 - P(E)
Two events are mutually exclusive if they cannot occur at the same time. Another word that means mutually exclusive is disjoint. If two events are disjoint, then the probability of them both occurring at the same time is 0.
Disjoint: P(A and B) = 0
If two events are mutually exclusive, then the probability of either occurring is the sum of the probabilities of each occurring. Specific Addition Rule Only valid when the events are mutually exclusive.
P(A or B) = P(A) + P(B)
Example 1: Given: P(A) = 0.20, P(B) = 0.70, A and B are disjoint I like to use what's called a joint probability distribution. (Since disjoint means nothing in common, joint is what they have in common -- so the values that go on the inside portion of the table are the intersections or "and"s of each pair of events). "Marginal" is another word for totals -- it's called marginal because they appear in the margins.
B A A' Marginal 0.00 0.70 0.70 B' 0.20 0.10 0.30 Marginal 0.20 0.80 1.00
The values in red are given in the problem. The grand total is always 1.00. The rest of the values are obtained by addition and subtraction.
Non-Mutually Exclusive Events
In events which aren't mutually exclusive, there is some overlap. When P(A) and P(B) are added, the probability of the intersection (and) is added twice.
To compensate for that double addition, the intersection needs to be subtracted. General Addition Rule Always valid.
P(A or B) = P(A) + P(B) - P(A and B)
Certain things can be determined from the joint probability distribution. Mutually exclusive events will have a probability of zero. All inclusive events will have a zero opposite the intersection. All inclusive means that there is nothing outside of those two events: P(A or B) = 1.
B A A and B are Mutually Exclusive if this value is 0 . B' . Marginal .
A'
Marginal
1.00
"AND" or Intersections
Independent Events
Two events are independent if the occurrence of one does not change the probability of the other occurring. An example would be rolling a 2 on a die and flipping a head on a coin. Rolling the 2 does not affect the probability of flipping the head. If events are independent, then the probability of them both occurring is the product of the probabilities of each occurring. Specific Multiplication Rule Only valid for independent events
P(A and B) = P(A) * P(B)
The 0.14 is because the probability of A and B is the probability of A times the probability of B or 0.20 * 0.70 = 0.14.
Dependent Events
If the occurrence of one event does affect the probability of the other occurring, then the events are dependent. Conditional Probability The probability of event B occurring that event A has already occurred is read "the probability of B given A" and is written: P(B|A) General Multiplication Rule Always works.
P(A and B) = P(A) * P(B|A)
Example 4: P(A) = 0.20, P(B) = 0.70, P(B|A) = 0.40 A good way to think of P(B|A) is that 40% of A is B. 40% of the 20% which was in event A is 8%, thus the intersection is 0.08.
B A A' Marginal 0.08 0.62 0.70 B' 0.12 0.18 0.30 Marginal 0.20 0.80 1.00
Independence Revisited
The following four statements are equivalent 1. A and B are independent events 2. P(A and B) = P(A) * P(B) 3. P(A|B) = P(A) 4. P(B|A) = P(B) The last two are because if two events are independent, the occurrence of one doesn't change the probability of the occurrence of the other. This
means that the probability of B occurring, whether A has happened or not, is simply the probability of B occurring.
This formula comes from the general multiplication principle and a little bit of algebra. Since we are given that event A has occurred, we have a reduced sample space. Instead of the entire sample space S, we now have a sample space of A since we know A has occurred. So the old rule about being the number in the event divided by the number in the sample space still applies. It is the number in A and B (must be in A since A has occurred) divided by the number in A. If you then divided numerator and denominator of the right hand side by the number in the sample space S, then you have the probability of A and B divided by the probability of A.
Examples
Example 1: The question, "Do you smoke?" was asked of 100 people. Results are shown in the table.
. Male Female Total
Yes 19 12 31
No 41 28 69
Total 60 40 100
What is the probability of a randomly selected individual being a male who smokes? This is just a joint probability. The number of "Male and Smoke" divided by the total = 19/100 = 0.19 What is the probability of a randomly selected individual being a male? This is the total for male divided by the total = 60/100 = 0.60. Since no mention is made of smoking or not smoking, it includes all the cases. What is the probability of a randomly selected individual smoking? Again, since no mention is made of gender, this is a
marginal probability, the total who smoke divided by the total = 31/100 = 0.31. What is the probability of a randomly selected male smoking? This time, you're told that you have a male - think of stratified sampling. What is the probability that the male smokes? Well, 19 males smoke out of 60 males, so 19/60 = 0.31666... What is the probability that a randomly selected smoker is male? This time, you're told that you have a smoker and asked to find the probability that the smoker is also male. There are 19 male smokers out of 31 total smokers, so 19/31 = 0.6129 (approx) After that last part, you have just worked a Bayes' Theorem problem. I know you didn't realize it - that's the beauty of it. A Bayes' problem can be set up so it appears to be just another conditional probability. In this class we will treat Bayes' problems as another conditional probability and not involve the large messy formula given in the text (and every other text). Example 2: There are three major manufacturing companies that make a product: Aberations, Brochmailians, and Chompielians. Aberations has a 50% market share, and Brochmailians has a 30% market share. 5% of Aberations' product is defective, 7% of Brochmailians' product is defective, and 10% of Chompieliens' product is defective. This information can be placed into a joint probability distribution
Company Aberations Brochmailians Chompieliens Total Good 0.50-0.025 = 0.475 0.30-0.021 = 0.279 0.20-0.020 = 0.180 0.934 Defective 0.05(0.50) = 0.025 0.07(0.30) = 0.021 0.10(0.20) = 0.020 0.066 Total 0.50 0.30 0.20 1.00
The percent of the market share for Chompieliens wasn't given, but since the marginals must add to be 1.00, they have a 20% market share. Notice that the 5%, 7%, and 10% defective rates don't go into the table directly. This is because they are conditional probabilities and the table is a joint probability table. These defective probabilities are conditional upon which company was given. That is, the 7% is not P(Defective), but P(Defective|Brochmailians). The joint probability P(Defective and Brochmailians) = P(Defective|Brochmailians) * P(Brochmailians). The "good" probabilities can be found by subtraction as shown above, or by multiplication using conditional probabilities. If 7% of Brochmailians' product is defective, then 93% is good. 0.93(0.30)=0.279.
What is the probability a randomly selected product is defective? P(Defective) = 0.066 What is the probability that a defective product came from Brochmailians? P(Brochmailian|Defective) = P(Brochmailian and Defective) / P(Defective) = 0.021/0.066 = 7/22 = 0.318 (approx). Are these events independent? No. If they were, then P(Brochmailians|Defective)=0.318 would have to equal the P(Brochmailians)=0.30, but it doesn't. Also, the P(Aberations and Defective)=0.025 would have to be P(Aberations)*P(Defective) = 0.50*0.066=0.033, and it doesn't. The second question asked above is a Bayes' problem. Again, my point is, you don't have to know Bayes formula just to work a Bayes' problem.
Bayes' Theorem
However, just for the sake of argument, let's say that you want to know what Bayes' formula is. Let's use the same example, but shorten each event to its one letter initial, ie: A, B, C, and D instead of Aberations, Brochmailians, Chompieliens, and Defective. P(D|B) is not a Bayes problem. This is given in the problem. Bayes' formula finds the reverse conditional probability P(B|D). It is based that the Given (D) is made of three parts, the part of D in A, the part of D in B, and the part of D in C.
P(B and D) -----------------------P(C and D) P(D|B)*P(B) ------------------------P(D|C)*P(C) P(B|D) = ----------------P(A and D) + P(B and D) +
Inserting the multiplication rule for each of these joint probabilities gives
P(B|D) = ---------------P(D|A)*P(A) + P(D|B)*P(B) +
However, and I hope you agree, it is much easier to take the joint probability divided by the marginal probability. The table does the adding for you and makes the problems doable without having to memorize the formulas.
Definitions
Random Variable Variable whose values are determined by chance Probability Distribution The values a random variable can assume and the corresponding probabilities of each. Expected Value The theoretical mean of the variable. Binomial Experiment An experiment with a fixed number of independent trials. Each trial can only have two outcomes, or outcomes which can be reduced to
two outcomes. The probability of each outcome must remain constant from trial to trial. Binomial Distribution The outcomes of a binomial experiment with their corresponding probabilities. Multinomial Distribution A probability distribution resulting from an experiment with a fixed number of independent trials. Each trial has two or more mutually exclusive outcomes. The probability of each outcome must remain constant from trial to trial. Poisson Distribution A probability distribution used when a density of items is distributed over a period of time. The sample size needs to be large and the probability of success to be small. Hypergeometric Distribution A probability distribution of a variable with two outcomes when sampling is done without replacement.
Probability Distributions
A listing of all the values the random variable can assume with their corresponding probabilities make a probability distribution. A note about random variables. A random variable does not mean that the values can be anything (a random number). Random variables have a well defined set of outcomes and well defined probabilities for the occurrence of each outcome. The random refers to the fact that the outcomes happen by chance -- that is, you don't know which outcome will occur next. Here's an example probability distribution that results from the rolling of a single fair die.
x p(x) 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 sum 6/6=1
The definitions for population mean and variance used with an ungrouped frequency distribution
were: Some of you might be confused by only dividing by N. Recall that this is the population variance, the sample variance, which was the unbiased estimator for the population variance was when it was divided by n-1. Using algebra, this is equivalent
to: Recall that a probability is a long term relative frequency. So every f/N can be replaced by p(x). This simplifies to be: What's even better, is that the last portion of the variance is the mean squared. So, the two formulas that we will be using are:
The mean is 7/2 or 3.5 The variance is 91/6 - (7/2)^2 = 35/12 = 2.916666... The standard deviation is the square root of the variance = 1.7078 Do not use rounded off values in the intermediate calculations. Only round off the final answer.
Binomial Experiment
A binomial experiment is an experiment which satisfies these four conditions A fixed number of trials Each trial is independent of the others There are only two outcomes The probability of each outcome remains constant from trial to trial. These can be summarized as: An experiment with a fixed number of independent trials, each of which can only have two possible outcomes. The fact that each trial is independent actually means that the probabilities remain constant. Examples of binomial experiments Tossing a coin 20 times to see how many tails occur. Asking 200 people if they watch ABC news. Rolling a die to see if a 5 appears. Examples which aren't binomial experiments Rolling a die until a 6 appears (not a fixed number of trials) Asking 20 people how old they are (not two outcomes) Drawing 5 cards from a deck for a poker hand (done without replacement, so not independent)
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
= = = = =
* * * * *
12 13 14 15
Notice that each of the 15 probabilities are exactly the same: (1/6)^2 * (5/6)^4. Also, note that the 1/6 is the probability of success and you needed 2 successes. The 5/6 is the probability of failure, and if 2 of the 6 trials were success, then 4 of the 6 must be failures. Note that 2 is the value of x and 4 is the value of n-x. Further note that there are fifteen ways this can occur. This is the number of ways 2 successes can be occur in 6 trials without repetition and order not being important, or a combination of 6 things, 2 at a time. The probability of getting exactly x success in n trials, with the probability of success on a single trial being p is:
P(X=x) = nCx * p^x * q^(n-x)
Example: A coin is tossed 10 times. What is the probability that exactly 6 heads will occur. 1. Success = "A head is flipped on a single coin" 2. p = 0.5 3. q = 0.5 4. n = 10 5. x = 6 P(x=6) = 10C6 * 0.5^6 * 0.5^4 = 210 * 0.015625 * 0.0625 = 0.205078125
Another way to remember the variance is mu-q (since the np is mu). Example: Find the mean, variance, and standard deviation for the number of sixes that appear when rolling 30 dice. Success = "a six is rolled on a single die". p = 1/6, q = 5/6. The mean is 30 * (1/6) = 5. The variance is 30 * (1/6) * (5/6) = 25/6. The standard deviation is the square root of the variance = 2.041241452 (approx)
Multinomial Probabilities
A multinomial experiment is an extended binomial probability. The difference is that in a multinomial experiment, there are more than two possible outcomes. However, there are still a fixed number of independent trials, and the probability of each outcome must remain constant from trial to trial. Instead of using a combination, as in the case of the binomial probability, the number of ways the outcomes can occur is done using distinguishable permutations. An example here will be much more useful than a formula. The probability that a person will pass a College Algebra class is 0.55, the probability that a person will withdraw before the class is completed is 0.40, and the probability that a person will fail the class is 0.05. Find the probability that in a class of 30 students, exactly 16 pass, 12 withdraw, and 2 fail.
Outcome Pass Withdraw Fail Total x 16 12 2 30 p(outcome) 0.55 0.40 0.05 1.00
Poisson Probabilities
Named after the French mathematician Simeon Poisson, Poisson probabilities are useful when there are a large number of independent trials with a small probability of success on a single trial and the variables occur over a period of time. It can also be used when a density of items is distributed over a given area or volume.
Lambda in the formula is the mean number of occurrences. If you're approximating a binomial probability using the Poisson, then lambda is the same as mu or n * p. Example: If there are 500 customers per eight-hour day in a check-out lane, what is the probability that there will be exactly 3 in line during any five-minute period? The expected value during any one five minute period would be 500 / 96 = 5.2083333. The 96 is because there are 96 five-minute periods in eight hours. So, you expect about 5.2 customers in 5 minutes and want to know the probability of getting exactly 3.
p(3;500/96) = e^(-500/96) * (500/96)^3 / 3! = 0.1288 (approx)
Hypergeometric Probabilities
Hypergeometric experiments occur when the trials are not independent of each other and occur due to sampling without replacement -- as in a five card poker hand. Hypergeometric probabilities involve the multiplication of two combinations together and then division by the total number of combinations. Example: How many ways can 3 men and 4 women be selected from a group of 7 men and 10 women?
The answer is = 7350/19448 = 0.3779 (approx) Note that the sum of the numbers in the numerator are the numbers used in the combination in the denominator. This can be extended to more than two groups and called an extended hypergeometric problem.
A correction applied to convert a discrete distribution to a continuous distribution. Finite Population Correction Factor A correction applied to the standard error of the means when the sample size is more than 5% of the population size and the sampling is done without replacement. Sampling Distribution of the Sample Means Distribution obtained by using the means computed from random samples of a specific size. Sampling Error Difference which occurs between the sample statistic and the population parameter due to the fact that the sample isn't a perfect representation of the population. Standard Error or the Mean The standard deviation of the sampling distribution of the sample means. It is equal to the standard deviation of the population divided by the square root of the sample size. Standard Normal Distribution A normal distribution in which the mean is 0 and the standard deviation is 1. It is denoted by z. Z-score Also known as z-value. A standardized score in which the mean is zero and the standard deviation is 1. The Z score is used to represent the standard normal distribution.
Bell-shaped Symmetric about mean Continuous Never touches the x-axis Total area under curve is 1.00 Approximately 68% lies within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations of the mean. This is the Empirical Rule mentioned earlier. Data values represented by x which has mean mu and standard deviation sigma.
by
Normal Probabilities
Comprehension of this table is vital to success in the course!
There is a table which must be used to look up standard normal probabilities. The z-score is broken into two parts, the whole number and tenth are looked up along the left side and the hundredth is looked up across the top. The value in the intersection of the row and column is the area under the curve between zero and the z-score looked up. Because of the symmetry of the normal distribution, look up the absolute value of any z-score.
Computing Normal Probabilities
There are several different situations that can arise when asked to find normal probabilities.
Situation Between zero and any number Between two positives, or Between two negatives Between a negative and Instructions Look up the area in the table
Look up both areas in the table and subtract the smaller from the larger. Look up both areas in the table and add them
a positive Less than a negative, or Greater than a positive Greater than a negative, or Less than a positive
together Look up the area in the table and subtract from 0.5000 Look up the area in the table and add to 0.5000
This can be shortened into two rules. 1. If there is only one z-score given, use 0.5000 for the second area, otherwise look up both z-scores in the table 2. If the two numbers are the same sign, then subtract; if they are different signs, then add. If there is only one z-score, then use the inequality to determine the second sign (< is negative, and > is positive).
Finding z-scores from probabilities
This is more difficult, and requires you to use the table inversely. You must look up the area between zero and the value on the inside part of the table, and then read the z-score from the outside. Finally, decide if the z-score should be positive or negative, based on whether it was on the left side or the right side of the mean. Remember, z-scores can be negative, but areas or probabilities cannot be.
Situation Area between 0 and a value Instructions Look up the area in the table Make negative if on the left side Subtract the area from 0.5000 Look up the difference in the table Make negative if in the left tail Subtract 0.5000 from the area Look up the difference in the table Make negative if on the left side
Area including one complete half (Less than a positive or greater than a negative) Within z units of the mean
Divide the area by 2 Look up the quotient in the table Use both the positive and negative z-scores Subtract the area from 1.000 Divide the area by 2
Two tails with equal area (More than z units from the
mean)
Look up the quotient in the table Use both the positive and negative z-scores
Using the table becomes proficient with practice, work lots of the normal probability problems!
0.0 0 1 0.3 15 9 0.3 41 3 0.3 64 3 0.3 84 9 0.4 03 2 0.4 19 2 0.4 33 2 0.4 45 2 0.4 55 4 0.4 64 1
0.0 1 0 0.3 18 6 0.3 43 8 0.3 66 5 0.3 86 9 0.4 04 9 0.4 20 7 0.4 34 5 0.4 46 3 0.4 56 4 0.4 64 9
0.0 2 9 0.3 21 2 0.3 46 1 0.3 68 6 0.3 88 8 0.4 06 6 0.4 22 2 0.4 35 7 0.4 47 4 0.4 57 3 0.4 65 6
0.0 3 7 0.3 23 8 0.3 48 5 0.3 70 8 0.3 90 7 0.4 08 2 0.4 23 6 0.4 37 0 0.4 48 4 0.4 58 2 0.4 66 4
0.0 4 5 0.3 26 4 0.3 50 8 0.3 72 9 0.3 92 5 0.4 09 9 0.4 25 1 0.4 38 2 0.4 49 5 0.4 59 1 0.4 67 1
0.0 5 3 0.3 28 9 0.3 53 1 0.3 74 9 0.3 94 4 0.4 11 5 0.4 26 5 0.4 39 4 0.4 50 5 0.4 59 9 0.4 67 8
0.0 6 1 0.3 31 5 0.3 55 4 0.3 77 0 0.3 96 2 0.4 13 1 0.4 27 9 0.4 40 6 0.4 51 5 0.4 60 8 0.4 68 6
0.0 7 8 0.3 34 0 0.3 57 7 0.3 79 0 0.3 98 0 0.4 14 7 0.4 29 2 0.4 41 8 0.4 52 5 0.4 61 6 0.4 69 3
0.0 8 6 0.3 36 5 0.3 59 9 0.3 81 0 0.3 99 7 0.4 16 2 0.4 30 6 0.4 42 9 0.4 53 5 0.4 62 5 0.4 69 9
0.0 9 3 0.3 38 9 0.3 62 1 0.3 83 0 0.4 01 5 0.4 17 7 0.4 31 9 0.4 44 1 0.4 54 5 0.4 63 3 0.4 70 6
8 0 . 9 1 . 0 1 . 1 1 . 2 1 . 3 1 . 4 1 . 5 1 . 6 1 . 7 1 . 8
0.0 0 0.4 71 3 0.4 77 2 0.4 82 1 0.4 86 1 0.4 89 3 0.4 91 8 0.4 93 8 0.4 95 3 0.4 96 5 0.4 97 4 0.4
0.0 1 0.4 71 9 0.4 77 8 0.4 82 6 0.4 86 4 0.4 89 6 0.4 92 0 0.4 94 0 0.4 95 5 0.4 96 6 0.4 97 5 0.4
0.0 2 0.4 72 6 0.4 78 3 0.4 83 0 0.4 86 8 0.4 89 8 0.4 92 2 0.4 94 1 0.4 95 6 0.4 96 7 0.4 97 6 0.4
0.0 3 0.4 73 2 0.4 78 8 0.4 83 4 0.4 87 1 0.4 90 1 0.4 92 5 0.4 94 3 0.4 95 7 0.4 96 8 0.4 97 7 0.4
0.0 4 0.4 73 8 0.4 79 3 0.4 83 8 0.4 87 5 0.4 90 4 0.4 92 7 0.4 94 5 0.4 95 9 0.4 96 9 0.4 97 7 0.4
0.0 5 0.4 74 4 0.4 79 8 0.4 84 2 0.4 87 8 0.4 90 6 0.4 92 9 0.4 94 6 0.4 96 0 0.4 97 0 0.4 97 8 0.4
0.0 6 0.4 75 0 0.4 80 3 0.4 84 6 0.4 88 1 0.4 90 9 0.4 93 1 0.4 94 8 0.4 96 1 0.4 97 1 0.4 97 9 0.4
0.0 7 0.4 75 6 0.4 80 8 0.4 85 0 0.4 88 4 0.4 91 1 0.4 93 2 0.4 94 9 0.4 96 2 0.4 97 2 0.4 97 9 0.4
0.0 8 0.4 76 1 0.4 81 2 0.4 85 4 0.4 88 7 0.4 91 3 0.4 93 4 0.4 95 1 0.4 96 3 0.4 97 3 0.4 98 0 0.4
0.0 9 0.4 76 7 0.4 81 7 0.4 85 7 0.4 89 0 0.4 91 6 0.4 93 6 0.4 95 2 0.4 96 4 0.4 97 4 0.4 98 1 0.4
1 . 9 2 . 0 2 . 1 2 . 2 2 . 3 2 . 4 2 . 5 2 . 6 2 . 7 2 . 8 2
0.0 0 98 1 0.4 98 7
0.0 1 98 2 0.4 98 7
0.0 2 98 2 0.4 98 7
0.0 3 98 3 0.4 98 8
0.0 4 98 4 0.4 98 8
0.0 5 98 4 0.4 98 9
0.0 6 98 5 0.4 98 9
0.0 7 98 5 0.4 98 9
0.0 8 98 6 0.4 99 0
0.0 9 98 6 0.4 99 0
. 9 3 . 0
The values in the table are the areas between zero and the z-score. That is, P(0<Z<z-score)
If the sample size is more than 5% of the population size and the sampling is done without replacement, then a correction needs to be made to the standard error of the means. In the following, N is the population size and n is the sample size. The adjustment is to multiply the standard error by the square root of the quotient of the difference between the population and sample sizes and one
less than the population size. For the most part, we will be ignoring this in class.
5.5 < x < 6.5 x > 6.5 x > 5.5 x < 5.5 x < 6.5
As you can see, whether or not the equal to is included makes a big difference in the discrete distribution and the way the conversion is performed. However, for a continuous distribution, equality makes no difference. Steps to working a normal approximation to the binomial distribution 1. Identify success, the probability of success, the number of trials, and the desired number of successes. Since this is a binomial problem, these are the same things which were identified when working a binomial problem. 2. Convert the discrete x to a continuous x. Some people would argue that step 3 should be done before this step, but go ahead and convert the x before you forget about it and miss the problem. 3. Find the smaller of np or nq. If the smaller one is at least five, then the larger must also be, so the approximation will be considered good. When you find np, you're actually finding the mean, mu, so denote it as such. 4. Find the standard deviation, sigma = sqrt (npq). It might be easier to find the variance and just stick the square root in the final calculation - that way you don't have to work with all of the decimal places. 5. Compute the z-score using the standard formula for an individual score (not the one for a sample mean). 6. Calculate the probability desired.
Stats: Estimation
Definitions
Confidence Interval An interval estimate with a specific level of confidence Confidence Level The percent of the time the true mean will lie in the interval estimate given. Consistent Estimator
An estimator which gets closer to the value of the parameter as the sample size increases. Degrees of Freedom The number of data values which are allowed to vary once a statistic has been determined. Estimator A sample statistic which is used to estimate a population parameter. It must be unbiased, consistent, and relatively efficient. Interval Estimate A range of values used to estimate a parameter. Maximum Error of the Estimate The maximum difference between the point estimate and the actual parameter. The Maximum Error of the Estimate is 0.5 the width of the confidence interval for means and proportions. Point Estimate A single value used to estimate a parameter. Relatively Efficient Estimator The estimator for a parameter with the smallest variance. T distribution A distribution used when the population variance is unknown. Unbiased Estimator An estimator whose expected value is the mean of the parameter being estimated.
Point Estimates
There are two types of estimates we will find: Point Estimates and Interval Estimates. The point estimate is the single best value. A good estimator must satisfy three conditions: Unbiased: The expected value of the estimator must be equal to the mean of the parameter
Consistent: The value of the estimator approaches the value of the parameter as the sample size increases Relatively Efficient: The estimator has the smallest variance of all estimators which could be used
Confidence Intervals
The point estimate is going to be different from the population parameter because due to the sampling error, and there is no way to know who close it is to the actual parameter. For this reason, statisticians like to give an interval estimate which is a range of values used to estimate the parameter. A confidence interval is an interval estimate with a specific level of confidence. A level of confidence is the probability that the interval estimate will contain the parameter. The level of confidence is 1 - alpha. 1-alpha area lies within the confidence interval.
Maximum Error of the Estimate
The maximum error of the estimate is denoted by E and is one-half the width of the confidence interval. The basic confidence interval for a symmetric distribution is set up to be the point estimate minus the maximum error of the estimate is less than the true population parameter which is less than the point estimate plus the maximum error of the estimate. This formula will work for means and proportions because they will use the Z or T distributions which are symmetric. Later, we will talk about variances, which don't use a symmetric distribution, and the formula will be different.
Area in Tails
Since the level of confidence is 1-alpha, the amount in the tails is alpha. There is a notation in statistics which means the score which has the specified area in the right tail. Examples: Z(0.05) = 1.645 (the Z-score which has 0.05 to the right, and 0.4500 between 0 and it) Z(0.10) = 1.282 (the Z-score which has 0.10 to the right, and 0.4000 between 0 and it). As a shorthand notation, the () are usually dropped, and the probability written as a subscript. The greek letter alpha is used represent the area in both tails for a confidence interval, and so alpha/2 will be the area in one tail. Here are some common values
Confidence Level 50% 80% Area between 0 and z-score 0.2500 0.4000 Area in one tail (alpha/2) 0.2500 0.1000 z-score
0.674 1.282
Notice in the above table, that the area between 0 and the z-score is simply one-half of the confidence level. So, if there is a confidence level which isn't given above, all you need to do to find it is divide the confidence level by two, and then look up the area in the inside part of the Z-table and look up the z-score on the outside. Also notice - if you look at the student's t distribution, the top row is a level of confidence, and the bottom row is the z-score. In fact, this is where I got the extra digit of accuracy from.
Student's t Distribution
When the population standard deviation is unknown, the mean has a Student's t distribution. The Student's t distribution was created by William T. Gosset, an Irish brewery worker. The brewery wouldn't allow him to publish his work under his name, so he used the pseudonym "Student".
The Student's t distribution is very similar to the standard normal distribution. It is symmetric about its mean It has a mean of zero It has a standard deviation and variance greater than 1. There are actually many t distributions, one for each degree of freedom As the sample size increases, the t distribution approaches the normal distribution. It is bell shaped. The t-scores can be negative or positive, but the probabilities are always positive.
Degrees of Freedom
A degree of freedom occurs for every data value which is allowed to vary once a statistic has been fixed. For a single mean, there are n-1 degrees of freedom. This value will change depending on the statistic being used.
Notice the formula is the same as for a population mean when the population standard deviation is known. The only thing that has changed is the formula for the maximum error of the estimate.
0.250 0.500 1.000 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691 0.690 0.689 0.688 0.688 0.687
0.100 0.200 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325
0.050 0.100 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725
0.025 0.050 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086
0.010 0.020 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528
0.005 0.010 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845
50%
80%
90%
95%
98%
99%
0.250 0.500 0.686 0.686 0.685 0.685 0.684 0.684 0.684 0.683 0.683 0.683 0.681 0.679 0.679 0.678 0.678 0.677 0.677 0.674
0.100 0.200 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.303 1.299 1.296 1.294 1.292 1.291 1.290 1.282
0.050 0.100 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.684 1.676 1.671 1.667 1.664 1.662 1.660 1.645
0.025 0.050 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.009 2.000 1.994 1.990 1.987 1.984 1.960
0.010 0.020 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.423 2.403 2.390 2.381 2.374 2.368 2.364 2.326
0.005 0.010 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.678 2.660 2.648 2.639 2.632 2.626 2.576
The values in the table are the areas critical values for the given areas in the right tail or in both tails.
Recall:
The best point estimate for p is p hat, the sample proportion: If the formula for z is divided by n in both the numerator and the
denominator, then the formula for z becomes: Solving this for p to come up with a confidence interval, gives the maximum
error of the estimate as: . This is not, however, the formula that we will use. The problem with estimation is that you don't know the value of the parameter (in this case p), so you can't use it to estimate itself - if you knew it, then there would be no problem to work out. So we will replace the parameter by the statistic in the formula for the maximum error of the estimate. The maximum error of the estimate is given by the formula for E shown. The Z here is the zscore obtained from the normal table, or the bottom of the t-table as explained in the introduction toestimation. The z-score is a factor of the level of confidence, so you may get in the habit of writing it next to the level of confidence.
When you're computing E, I suggest that you find the sample proportion, p hat, and save it to P on the calculator. This way, you can find q as (1-p). Do NOT round the value for p hat and use the rounded value in the calculations. This will lead to error. Once you have computed E, I suggest you save it to the memory on your calculator. On the TI-82, a good choice would be the letter E. The reason for this is that the limits for the confidence interval are now found by subtracting and adding the maximum error of the estimate from/to the sample proportion.
Population Mean
Here is the formula for the sample size which is obtained by solving the maximum error of the estimate formula for the population mean for n.
Population Proportion
Here is the formula for the sample size which is obtained by solving the maximum error of the estimate formula for the population proportion for n. Some texts use p hat and q hat, but since the sample hasn't been taken, there is no value for the sample proportion. p and q are taken from a previous study, if one is available. If there is no previous study or estimate available, then use 0.5 for p and q, as these are the values which will give the largest sample size, and it is better to have too large of a sample size and come under the maximum error of the estimate than to have too small of a sample size and exceed the maximum error of the estimate.
A statement which indicates the level of evidence (sufficient or insufficient), at what level of significance, and whether the original claim is rejected (null) or supported (alternative).
Example
"He's dead, Jim," said Dr. McCoy to Captain Kirk. Mr. Spock, as the science officer, is put in charge of statistically determining the correctness of Bones' statement and deciding the fate of the crew member (to vaporize or try to revive) His first step is to arrive at the hypothesis to be tested. Does the statement represent a change in previous condition? Yes, there is change, thus it is the alternative hypothesis, H1 No, there is no change, therefore is the null hypothesis, H0 The correct answer is that there is change. Dead represents a change from the accepted state of alive. The null hypothesis always represents no change. Therefore, the hypotheses are: H0 : Patient is alive. H1 : Patient is not alive (dead). States of nature are something that you, as a statistician have no control over. Either it is, or it isn't. This represents the true nature of things. Possible states of nature (Based on H0) Patient is alive (H0 true - H1 false ) Patient is dead (H0 false - H1 true) Decisions are something that you have control over. You may make a correct decision or an incorrect decision. It depends on the state of nature as to whether your decision is correct or in error. Possible decisions (Based on H0 ) / conclusions (Based on claim ) Reject H0 / "Sufficient evidence to say patient is dead" Fail to Reject H0 / "Insufficient evidence to say patient is dead" There are four possibilities that can occur based on the two possible states of nature and the two decisions which we can make.
Statisticians will never accept the null hypothesis, we will fail to reject. In other words, we'll say that it isn't, or that we don't have enough evidence to say that it isn't, but we'll never say that it is, because someone else might come along with another sample which shows that it isn't and we don't want to be wrong.
Statistically (double) speaking ... State of Nature Decision Reject H0 H0 True
Patient is alive, Sufficient evidence of death
H0 False
Patient is dead, Sufficient evidence of death
Fail to reject H0
H0 False
Vaporize a dead person
H0 False
Correct Assessment
Fail to reject H0
Correct Assessment
Which of the two errors is more serious? Type I or Type II ? Since Type I is the more serious error (usually), that is the one we concentrate on. We usually pick alpha to be very small (0.05, 0.01). Note: alpha is not a Type I error. Alpha is the probability of committing a Type I error. Likewise beta is the probability of committing a Type II error.
Conclusions
Conclusions are sentence answers which include whether there is enough evidence or not (based on the decision), the level of significance, and whether the original claim is supported or rejected. Conclusions are based on the original claim, which may be the null or alternative hypotheses. The decisions are always based on the null hypothesis
Original Claim H0 "REJECT"
There is sufficient evidence at the alpha level of significance to reject the claim that (insert original claim here)
H1 "SUPPORT"
There is sufficient evidence at the alpha level of significance to support the claim that (insert original claim here)
There is insufficient evidence at the alpha level of significance to reject the claim that (insert original claim here)
There is insufficient evidence at the alpha level of significance to support the claim that (insert original claim here)
Decision Rule: Reject H0 if t.s. < c.v. (left) or t.s. > c.v. (right)
If the hypothesized value of the parameter lies outside the confidence interval with a 1-alpha level of confidence, then the decision at an alpha level of significance is to reject the null hypothesis. Sounds simple enough, right? It is. However, it has a couple of problems. It only works with two-tail hypothesis tests. It requires that you compute the confidence interval first. This involves taking a z-score or t-score and converting it into an x-score, which is more difficult than standardizing an x-score.
All hypothesis testing is done under the assumption the null hypothesis is true!
I can't emphasize this enough. The value for all population parameters in the test statistics come from the null hypothesis. This is true not only for means, but all of the testing we're going to be doing.
seen before. The critical value is obtained from the normal table, or the bottom line from the t-table.
General Pattern
Notice the general pattern of these test statistics is (observed - expected) / standard deviation.
To test the claim, we're going to generate a whole bunch of values for pi, and then test to see if the mean is 3.2. H0 : mu = 3.2 (original claim) H1 : mu <> 3.2 (two tail test)
Procedure:
The area of the unit circle is pi. The area of the unit circle in the first quadrant is pi/4. The calculator generates random numbers between 0 and 1. What we're going to do is generate two random numbers which will simulate a randomly selected point in a unit square in the first quadrant. If the point is within the circle, then the distance from (0,0) will be less than or
equal to 1, if the point is outside the circle, the distance will be greater than 1. Have the calculator generate a squared distance from zero (the square of the distance illustrates the same properties as far as being less than 1 or greater than 1). Do this 25 times. Each time, record whether the point is inside the circle (<1) or outside the circle (>1).
RAND^2 + RAND^2
Pi/4 is approximately equal to the ratio of the points inside the circle to the total number of points. Therefore, pi will be 4 times the ratio of the points inside the circle to the total number of points. This whole process is repeated several times, and the mean and standard deviation is recorded. The hypothesis test is then conducted using the t-test to see if the true mean is 3.2 (based on the sample mean).
Example:
20 values for pi were generated by generating 25 pairs of random numbers and checking to see if they were inside or outside the circle as illustrated above.
3.68 3.36 3.52 3.52 3.20 3.36 3.36 2.88 3.04 3.52 3.04 2.88 2.56 3.04 2.72 3.68 3.36 3.20 3.36 2.60
The mean of the sample is 3.194, the standard deviation is 0.3384857923. The test statistic t = (3.194 - 3.2) / (0.3384857293/sqrt(20)) = 0.0792730931 The critical value, with an 0.05 level of significance since none was stated, for a two-tail test with 19 degrees of freedom is t = +/- 2.093. Since the test statistic is not in the critical region, the decision is fail to reject the null hypothesis There is insufficient evidence at the 0.05 level of significance to reject the claim that pi is 3.2. Note the double speak, but it serves to illustrate the point. We would not dare to claim that pi was 3.2, even though this sample seems to illustrate this. The sample doesn't provide enough evidence to show it's not 3.2, but there may be another sample somewhere which does provide enough evidence (let's hope so). So, we won't say it is 3.2, just that we don't have enough evidence to prove it isn't 3.2.
You are testing p, you are not testing p hat. If you knew the value of p, then there would be nothing to test.
All hypothesis testing is done under the assumption the null hypothesis is true!
I can't emphasize this enough. The value for all population parameters in the test statistics come from the null hypothesis. This is true not only for proportions, but all of the testing we're going to be doing. The population proportion has an approximately normal distribution if np and nq are both at least 5. Remember that we are approximating the binomial using the normal, and that the p we're talking about is the probability of success on a single trial. The test statistic is shown in the box to the right. The critical value is found from the normal table, or from the bottom row of the t-table. The steps involved in the hypothesis testing remain the same. The only thing that changes is the formula for calculating the test statistic and perhaps the distribution which is used.
General Pattern
Notice the general pattern of these test statistics is (observed - expected) / standard deviation.
P-Value Approach
The P-Value Approach, short for Probability Value, approaches hypothesis testing from a different manner. Instead of comparing z-scores or t-scores as in the classical approach, you're comparing probabilities, or areas.
The level of significance (alpha) is the area in the critical region. That is, the area in the tails to the right or left of the critical values. The p-value is the area to the right or left of the test statistic. If it is a two tail test, then look up the probability in one tail and double it. If the test statistic is in the critical region, then the p-value will be less than the level of significance. It does not matter whether it is a left tail, right tail, or two tail test. This rule always holds. Reject the null hypothesis if the p-value is less than the level of significance. You will fail to reject the null hypothesis if the p-value is greater than or equal to the level of significance. The p-value approach is best suited for the normal distribution when doing calculations by hand. However, many statistical packages will give the pvalue but not the critical value. This is because it is easier for a computer or calculator to find the probability than it is to find the critical value. Another benefit of the p-value is that the statistician immediately knows at what level the testing becomes significant. That is, a p-value of 0.06 would be rejected at an 0.10 level of significance, but it would fail to reject at an 0.05 level of significance. Warning: Do not decide on the level of significance after calculating the test statistic and finding the p-value. Here is a proportion to help you keep the order straight. Any proportion equivalent to the following statement is correct.
The test statistic is to the p-value as the critical value is to the level of significance.
There are two possible cases when testing two population means, the dependent case and the independent case. Most books treat the independent case first, but I'm putting the dependent case first because it follows immediately from the test for a single population mean in the previous chapter.
The mean of a difference is the difference of the means. The variance of a sum is the sum of the variances.
When the population variances are known, the difference of the means has a normal distribution. The variance of the difference is the sum of the variances divided by the sample sizes. This makes sense, hopefully, because according to the central limit theorem, the variance of the sampling distribution of the sample means is the variance divided by the sample size, so what we are doing is add the variance of each mean together. The test statistic is shown.
Population Variances Unknown, but both sample sizes large
When the population variances aren't known, the difference of the means has a Student's t distribution. However, if both sample sizes are large enough, then you will be using the normal row from the t-table, so
your book lumps this under the normal distribution, rather than the tdistribution. This gives us the chance to work the problem without knowing if the population variances are equal or not. The test statistic is shown, and is identical to above, except the sample variances are used instead of the population variances.
Population Variances Unknown, unequal with small sample sizes
Ok, you're probably wondering how do you know if the variances are equal or not if you don't know what they are. Some books teach the F-test to test the equality of two variances, and if your book does that, then you should use the F-test to see. Other books (statisticians) argue that if you do the Ftest first to see if the variances are equal, and then use the same level of significance to perform the t-test to test the difference of the means, that the overall level of significance isn't the same. So, the Bluman text tells the student whether or not the variances are equal and the Triola text. Since you don't know the population variances, you're going to be using a Student's t distribution. Since the variances are unequal, there is no attempt made to average them together as we will in the next situation. The degrees of freedom is the smaller of the two degrees of freedom (n-1 for each). The "min" function means take the minimum or smaller of the two values. Otherwise, the formula is the same as we used with large sample sizes.
Population Variances Unknown but equal with small sample sizes
If the variances are equal, then an effort is made to average them together. Now, equal does not mean identical. It is possible for two variances to be statistically equal but be numerically different. We will find a pooled estimate of the variance which is simply the weighted mean of the variance. The weighting factors are the degrees of freedom. Once the pooled estimate of the variance is computed, this mean (average) variance is used in the place of the individual sample variances. Otherwise, the formula is the same as before. The degrees of freedom are the sum of the individual degrees of freedom.
Scatter Plots
1. 2. 3. 4. 5. Enter the x values into L1 and the y variables into L2. Go to Stat Plot (2nd y=) Turn Plot 1 on Choose the type to be scatter plot (1st type) Set Xlist to L1
6. Set Ylist to L2 7. Set the Mark to any of the three choices 8. Zoom to the Stat setting (#9) Note, the Ylist and Mark won't show up until you select a scatter plot
Regression Lines
1. 2. 3. 4. 5. 6. 7. Setup the scatter plot as instructed above Go into the Stats, Calc, Setup screen Setup the 2-Var Stats so that: Xlist = L1, Ylist = L2, Freq = 1 Calculate the Linear Regression (ax+b) (#5) Go into the Plot screen. Position the cursor on the Y1 plot and hit CLEAR to erase it. While still in the Y1 data entry field, go to the VARS, STATS, EQ screen and choose option 7 which is the regression equation 8. Hit GRAPH
1. 2. 3. 4.
Setup the scatter plot as instructed above Go into the Plot screen. Position the cursor on the Y1 plot and hit CLEAR to erase it. Enter a*x+b into the function. The a and b can be found under the VARS, STATS, EQ screen
1. Go into the Stats, Calc, Setup screen 2. Setup the 2-Var Stats so that: Xlist = L1, Ylist = L2, Freq = 1 3. Calculate the Linear Regression (ax+b) (#5) 4. Hit the GRAPH key It is important that you calculate the linear regression variables before trying to graph the regression line. If you change the data in the lists or have not calculated the linear regression equations, then you will get an " ERR: Undefined" when you try to graph the data. Be sure to turn off the stats plots and/or the Y1 plot when you need to graph other data.
Stats: Correlation
Sum of Squares
We introduced a notation earlier in the course called the sum of squares. This notation was the SS notation, and will make these formulas much easier to work with.
Notice these are all the same pattern, SS(x) could be written as
If you divide the numerator and denominator by n, then you get something which is starting to hopefully look familiar. Each of these values have been seen before in the Sum of Squares notation section. So, the linear correlation coefficient can be written in terms of sum of squares. This is the formula that we would be using for calculating the linear correlation coefficient if we were doing it by hand. Luckily for us, theTI82 has this calculation built into it, and we won't have to do it by hand at all.
Hypothesis Testing
The claim we will be testing is "There is significant linear correlation" The Greek letter for r is rho, so the parameter used for linear correlation is rho H0: rho = 0 H1: rho <> 0 r has a t distribution with n-2 degrees of freedom, and the test statistic is
given by: Now, there are n-2 degrees of freedom this time. This is a difference from before. As an over-simplification, you subtract one degree of freedom for each variable, and since there are 2 variables, the degrees of freedom are n2. This doesn't look like our
the formula for the test statistic is the pattern we're looking for. Remember that
Hypothesis testing is always done under the assumption that the null hypothesis is true.
Since H0 is rho = 0, this formula is equivalent to the one given in the book. Additional Note: 1-r2 is later identified as the coefficient of nondetermination
Causation
If there is a significant linear correlation between two variables, then one of five situations can be true. There is a direct cause and effect relationship There is a reverse cause and effect relationship The relationship may be caused by a third variable The relationship may be caused by complex interactions of several variables The relationship may be coincidental
Stats: Regression
The idea behind regression is that when there is significant linear correlation, you can use a line to estimate the value of the dependent variable for certain values of the independent variable. The regression equation should only used When there is significant linear correlation. That is, when you reject the null hypothesis that rho=0 in a correlation hypothesis test. The value of the independent variable being used in the estimation is close to the original values. That is, you should not use a regression equation obtained using x's between 10 and 20 to estimate y when x is 200.
The regression equation should not be used with different populations. That is, if x is the height of a male, and y is the weight of a male, then you shouldn't use the regression equation to estimate the weight of a female. The regression equation shouldn't be used to forecast values not from that time frame. If data is from the 1960's, it probably isn't valid in the 1990's. Assuming that you've decided that you can have a regression equation because there is significant linear correlation between the two variables, the equation becomes: y' = ax + b or y' = a + bx (some books use y-hat instead of y-prime). The Bluman text uses the second formula, however, more people are familiar with the notion of y = mx + b, so I will use the first.
line: The regression line is sometimes called the "line of best fit" or the "best fit line". Since it "best fits" the data, it makes sense that the line passes through the means. The regression equation is the line with slope a passing through the point Another way to write the equation would be
apply just a little algebra, and we have the formulas for a and b that we would use (if we were stranded on a desert island without the TI-82) ... It also turns out that the slope of the regression line can be written
as . Since the standard deviations can't be negative, the sign of the slope is determined by the sign of the correlation coefficient. This agrees with the statement made earlier that the slope of the regression line will have the same slope as the correlation coefficient.
TI-82
Luckily, the TI-82 will find these values for us (isn't it a wonderful calculator?). We can also use the TI-82 to plot the regression line on the scatter plot.
Calculating Values
1. Enter the data. Put the x-values into list 1 and the y-values into list 2. 2. Go into the Stats, Calc, Setup screen 3. Setup the 2-Var Stats so that: Xlist = L1, Ylist = L2, Freq = 1 4. Calculate the Linear Regression (ax+b) (#5) This screen will give you the sample linear correlation coefficient, r; the slope of the regression equation, a; and the y-intercept of the regression equation, b. Just record the value of r. To write the regression equation, replace the values of a and b into the equation "y-hat = ax+b". To find the coefficient of determination, square r. You can find the variable r under VARS, STATS, EQ, r (#6).
Well, the ratio of the explained variation to the total variation is a measure of how good the regression line is. If the regression line passed through every point on the scatter plot exactly, it would be able to explain all of the variation. The further the line is from the points, the less it is able to explain.
Coefficient of Non-Determination
The coefficient of non-determination is ... The percent of variation which is unexplained by the regression equation The unexplained variation divided by the total variation 1 - r^2
Stats: Chi-Square
Definitions
Chi-square distribution
A distribution obtained from the multiplying the ratio of sample variance to population variance by the degrees of freedom when random samples are selected from a normally distributed population Contingency Table Data arranged in table form for the chi-square independence test Expected Frequency The frequencies obtained by calculation. Goodness-of-fit Test A test to see if a sample comes from a population with the given distribution. Independence Test A test to see if the row and column variables are independent. Observed Frequency The frequencies obtained by observation. These are the sample frequencies.
Chi-square is non-negative. Is the ratio of two non-negative values, therefore must be non-negative itself. Chi-square is non-symmetric. There are many different chi-square distributions, one for each degree of freedom. The degrees of freedom when working with a single population variance is n-1.
Chi-Square Probabilities
Since the chi-square distribution isn't symmetric, the method for looking up left-tail values is different from the method for looking up right tail values. Area to the right - just use the area given. Area to the left - the table requires the area to the right, so subtract the given area from one and look this area up in the table. Area in both tails - divide the area by two. Look up this area for the right critical value and one minus this area for the left critical value.
You can interpolate. This is probably the more accurate way. Interpolation involves estimating the critical value by figuring how far the given degrees of freedom are between the two df in the table and going that far between the critical values in the table. Most people born in the 70's didn't have to learn interpolation in high school because they had calculators which would do logarithms (we had to use tables in the "good old" days). You can go with the critical value which is less likely to cause you to reject in error (type I error). For a right tail test, this is the critical value further to the right (larger). For a left tail test, it is the value further to the left (smaller). For a two-tail test, it's the value further to the left and the value further to the right. Note, it is not the column with the degrees of freedom further to the right, it's the critical value which is further to the right.
The population has a normal distribution The data is from a random sample The observations must be independent of each other The test statistic has a chi-square distribution with n-1 degrees of
freedom and is given by: Testing is done in the same manner as before. Remember, all hypothesis testing is done under the assumption the null hypothesis is true.
Confidence Intervals
If you solve the test statistic formula for the population variance, you
get: 1. Find the two critical values (alpha/2 and 1-alpha/2) 2. Compute the value for the population variance given above. 3. Place the population variance between the two values calculated in step 2 (put the smaller one first).
Note, the left-hand endpoint of the confidence interval comes when the right critical value is used and the right-hand endpoint of the confidence interval comes when the left critical value is used. This is because the critical values are in the denominator and so dividing by the larger critical value (right tail) gives the smaller endpoint.
TI-82: Goodness-of-Fit
You can perform a chi-square goodness-of-fit test using the TI-82. Here are the steps. 1. Enter the observed frequencies into List 1. 2. Enter the expected frequencies into List 2. a. If you're given the expected frequencies, enter them into List 2. b. If you're given probabilities, then enter the probabilities into List 2 and multiply List 2 by the sum of List 1 and replace List 2 with that product: sum L1 * L2 -> L2 c. If you're testing that all categories appear with equal frequency, then you can a) enter that value into List 2, or b) enter the total frequency into each value of List 2 and then divide the list by the number of categories: L2 / k -> L2 (replace k by the number of categories), or c) enter 1 for each value in List 2 and then multiply the list by the
common expected frequency: L2 * E -> L2(replace E by the expected frequency) 3. Calculate the test statistic: sum ((L1 - L2)^2 / L2)
CONTING Program
This program completes a test for independence using a contingency table. The observed frequencies must be contained in matrix [A]and the result is a test statistic having a chi-square distribution. When the program is done running, the following variables are defined List 1 contains the observed frequencies List 2 contains the expected frequencies List 3 contains the row totals List 4 contains the column totals
Stats: F-Test
Definitions
F-distribution The ratio of two independent chi-square variables divided by their respective degrees of freedom. If the population variances are equal, this simplifies to be the ratio of the sample variances. Analysis of Variance (ANOVA) A technique used to test a hypothesis concerning the means of three or mor populations. One-Way Analysis of Variance Analysis of Variance when there is only one independent variable. The null hypothesis will be that all population means are equal, the alternative hypothesis is that at least one mean is different. Between Group Variation The variation due to the interaction between the samples, denoted SS(B) for Sum of Squares Between groups. If the sample means are close to each other (and therefore the Grand Mean) this will be small. There are k samples involved with one data value for each sample (the sample mean), so there are k-1 degrees of freedom. Between Group Variance The variance due to the interaction between the samples, denoted MS(B) for Mean Square Between groups. This is the between group variation divided by its degrees of freedom. Within Group Variation The variation due to differences within individual samples, denoted SS(W) for Sum of Squares Within groups. Each sample is considered
independently, no interaction between samples is involved. The degrees of freedom is equal to the sum of the individual degrees of freedom for each sample. Since each sample has degrees of freedom equal to one less than their sample sizes, and there are k samples, the total degrees of freedom is k less than the total sample size: df = N k. Within Group Variance The variance due to the differences within individual samples, denoted MS(W) for Mean Square Within groups. This is the within group variation divided by its degrees of freedom. Scheffe' Test A test used to find where the differences between means lie when the Analysis of Variance indicates the means are not all equal. The Scheffe' test is generally used when the sample sizes are different. Tukey Test A test used to find where the differences between the means lie when the Analysis of Variance indicates the means are not all equal. The Tukey test is generally used when the sample sizes are all the same. Two-Way Analysis of Variance An extension to the one-way analysis of variance. There are two independent variables. There are three sets of hypothesis with the two-way ANOVA. The first null hypothesis is that there is no interaction between the two factors. The second null hypothesis is that the population means of the first factor are equal. The third null hypothesis is that the population means of the second factor are equal. Factors The two independent variables in a two-way ANOVA. Treatment Groups Groups formed by making all possible combinations of the two factors. For example, if the first factor has 3 levels and the second factor has 2 levels, then there will be 3x2=6 different treatment groups. Interaction Effect The effect one factor has on the other factor Main Effect The effects of the independent variables.
Stats: F-Test
The F-distribution is formed by the ratio of two independent chi-square variables divided by their respective degrees of freedom. Since F is formed by chi-square, many of the chisquare properties carry over to the F distribution. The F-values are all non-negative The distribution is non-symmetric The mean is approximately 1 There are two independent degrees of freedom, one for the numerator, and one for the denominator. There are many different F distributions, one for each pair of degrees of freedom.
F-Test
The F-test is designed to test if two population variances are equal. It does this by comparing the ratio of two variances. So, if the variances are equal, the ratio of the variances will be 1.
All hypothesis testing is done under the assumption the null hypothesis is true
If the null hypothesis is true, then the F test-statistic given above can be simplified (dramatically). This ratio of sample variances will be test statistic used. If the null hypothesis is false, then we will reject the null hypothesis that the ratio was equal to 1 and our assumption that they were equal. There are several different F-tables. Each one has a different level of significance. So, find the correct level of significance first, and then look up the numerator degrees of freedom and the denominator degrees of freedom to find the critical value. You will notice that all of the tables only give level of significance for right tail tests. Because the F distribution is not symmetric, and there are no negative values, you may not simply take the opposite of the right critical value to find the left critical value. The way to find a left critical value is to reverse the degrees of freedom, look up the right critical value, and then take the reciprocal of this value. For example, the critical value with 0.05 on the left with 12 numerator and 15 denominator degrees of freedom is found of taking the reciprocal of the critical value with 0.05 on the right with 15 numerator and 12 denominator degrees of freedom.
Avoiding Left Critical Values
Since the left critical values are a pain to calculate, they are often avoided altogether. This is the procedure followed in the textbook. You can force the F test into a right tail test by placing the sample with the large variance in the numerator and the smaller variance in the denominator. It does not matter which sample has the larger sample size, only which sample has the larger variance.
The numerator degrees of freedom will be the degrees of freedom for whichever sample has the larger variance (since it is in the numerator) and the denominator degrees of freedom will be the degrees of freedom for whichever sample has the smaller variance (since it is in the denominator). If a two-tail test is being conducted, you still have to divide alpha by 2, but you only look up and compare the right critical value. Assumptions / Notes The larger variance should always be placed in the numerator The test statistic is F = s1^2 / s2^2 where s1^2 > s2^2 Divide alpha by 2 for a two tail test and then find the right critical value If standard deviations are given instead of variances, they must be squared When the degrees of freedom aren't given in the table, go with the value with the larger critical value (this happens to be the smaller degrees of freedom). This is so that you are less likely to reject in error (type I error) The populations from which the samples were obtained must be normal. The samples must be independent
The populations from which the samples were obtained must be normally or approximately normally distributed. The samples must be independent. The variances of the populations must be equal.
Hypotheses
The null hypothesis will be that all population means are equal, the alternative hypothesis is that at least one mean is different. In the following, lower case letters apply to the individual samples and capital letters apply to the entire set collectively. That is, n is one of many sample sizes, but N is the total sample size.
Grand Mean
The grand mean of a set of samples is the total of all the data values divided by the total sample size. This requires that you have all of the sample data available to you, which is usually the case, but not always. It turns out that all that is necessary to find perform a one-way analysis of variance are the number of samples, the sample means, the sample variances, and the sample sizes.
Another way to find the grand mean is to find the weighted average of the sample means. The weight applied is the sample size.
Total Variation
The total variation (not variance) is comprised the sum of the squares of the differences of each mean with the grand mean. There is the between group variation and the within group variation. The whole idea behind the analysis of variance is to compare the ratio of between group variance to within group variance. If the variance caused by the interaction between the samples is much larger when compared to the variance that appears within each group, then it is because the means aren't the same.
Between Group Variation
The variation due to the interaction between the samples is denoted SS(B) for Sum of Squares Between groups. If the sample means are close to each other (and therefore the Grand Mean) this will be small. There are k samples involved with one data value for each sample (the sample mean), so there are k-1 degrees of freedom. The variance due to the interaction between the samples is denoted MS(B) for Mean Square Between groups. This is the between group variation divided by its degrees of freedom. It is also denoted by .
The variation due to differences within individual samples, denoted SS(W) for Sum of Squares Within groups. Each sample is considered independently, no interaction between samples is involved. The degrees of freedom is equal to the sum of the individual degrees of freedom for each sample. Since each sample has degrees of freedom equal to one less than their sample sizes, and there are k samples, the total degrees of freedom is k less than the total sample size: df = N - k. The variance due to the differences within individual samples is denoted MS(W) for Mean Square Within groups. This is the within group variation divided by its degrees of freedom. It is also denoted by . It is the weighted average of the variances (weighted with the degrees of freedom).
F test statistic
Recall that a F variable is the ratio of two independent chisquare variables divided by their respective degrees of freedom. Also recall that the F test statistic is the ratio of two sample variances, well, it turns out that's exactly what we have here. The F test statistic is found by dividing the between group variance by the within group variance. The degrees of freedom for the numerator are the degrees of freedom for the between group (k-1) and the degrees of freedom for the denominator are the degrees of freedom for the within group (N-k).
Summary Table
All of this sounds like a lot to remember, and it is. However, there is a table which makes things really nice.
SS Between SS(B) df k-1 MS SS(B) ----------k-1 SS(W) ----------N-k . F MS(B) -------------MS(W) .
Within
SS(W)
N-k
Total
SS(W) + SS(B)
N-1
Notice that each Mean Square is just the Sum of Squares divided by its degrees of freedom, and the F value is the ratio of the mean squares. Do not put the largest variance in the numerator, always divide the between variance by the within variance. If the between variance is smaller than the within variance, then the means are really close to each other and you will fail to reject the claim that they are all equal. The degrees of freedom of the F-test are in the same order they appear in the table (nifty, eh?).
Decision Rule
The decision will be to reject the null hypothesis if the test statistic from the table is greater than the F critical value with k-1 numerator and N-k denominator degrees of freedom. If the decision is to reject the null, then at least one of the means is different. However, the ANOVA does not tell you where the difference lies. For this, you need another test, either the Scheffe' or Tukey test.
TI-82
Ok, now for the really good news. There's a program called ANOVA for the TI-82 calculator which will do all of the calculations and give you the values that go into the table for you. You must have the sample means, sample variances, and sample sizes to use the program. If you have the sum of
squares, then it is much easier to finish the table by hand (this is what we'll do with the two-way analysis of variance)
ANOVA Program
Performs a one-way Analysis of Variance. List 1 must contain the means of the samples, list 2 must contain the sample variances, and list 3 must contain the sample sizes. Note that the three lists must be the same size. The user is reminded of these requirements when running the program. The grand mean is displayed, followed by the sum of squares, degrees of freedom, and mean sum of squares for the between group and within group. The total sum of squares and degrees of freedom, along with the F test statistic is also shown. Upon completion, the program will give the user the chance to run the Scheffe test if the sample sizes are different or the Tukey test if the sample sizes are the same. All possible pairs are compared.
Hypotheses
Both tests are set up to test if pairs of means are different. The formulas refer to mean i and mean j. The values of i and j vary, and the total number of tests will be equal to a combination of k objects, 2 at a time C(k,2), where k is the number of samples.
Scheff Test
The Scheffe' test is customarily used with unequal sample sizes, although it could be used with equal sample sizes. The critical value for the Scheffe' test is the degrees of freedom for the between variance times the critical value for the one-way ANOVA. This simplifies to be:
CV = (k-1) F(k-1,N-k,alpha)
The test statistic is a little bit harder to compute. Pure mathematicians will argue that this shouldn't be
called F because it doesn't have an F distribution (it's the degrees of freedom times an F), but we'll live it with it. Reject H0 if the test statistic is greater than the critical value. Note, this is a right tail test. If there is no difference between the means, the numerator will be close to zero, and so performing a left tail test wouldn't show anything.
Tukey Test
The Tukey test is only usable when the sample sizes are the same. The Critical Value is looked up in a table. It is Table N in the Bluman text. There are actually several different tables, one for each level of significance. The number of samples, k, is used as a index along the top, and the degrees of freedom for the within group variance, v = N-k, are used as an index along the left side. The test statistic is found by dividing the difference between the means by the square root of the ratio of the within group variation and the sample size. Reject the null hypothesis if the absolute value of the test statistic is greater than the critical value (just like the linear correlation coefficient critical values).
TI-82
The ANOVA program for the TI-82 will do all of the pairwise comparisons for you after it has given the ANOVA summary table. You will need to know how to find the critical values and make the comparisons.
The populations from which the samples were obtained must be normally or approximately normally distributed. The samples must be independent. The variances of the populations must be equal. The groups must have the same sample size.
Hypotheses
There are three sets of hypothesis with the two-way ANOVA. The null hypotheses for each of the sets are given below. 1. The population means of the first factor are equal. This is like the one-way ANOVA for the row factor. 2. The population means of the second factor are equal. This is like the one-way ANOVA for the column factor. 3. There is no interaction between the two factors. This is similar to performing a test for independence with contingency tables.
Factors
The two independent variables in a two-way ANOVA are called factors. The idea is that there are two variables, factors, which affect the dependent variable. Each factor will have two or more levels within it, and the degrees of freedom for each factor is one less than the number of levels.
Treatment Groups
Treatement Groups are formed by making all possible combinations of the two factors. For example, if the first factor has 3 levels and the second factor has 2 levels, then there will be 3x2=6 different treatment groups. As an example, let's assume we're planting corn. The type of seed and type of fertilizer are the two factors we're considering in this example. This example has 15 treatment groups. There are 3-1=2 degrees of freedom for the type of seed, and 5-1=4 degrees of freedom for the type of fertilizer. There are 2*4 = 8 degrees of freedom for the interaction between the type of seed and type of fertilizer. The data that actually appears in the table are samples. In this case, 2 samples from each treatment group were taken.
Fert I Seed A-402 Seed B-894 Seed C-952 Main Effect 106, 110 110, 112 94, 97 Fert II 95, 100 98, 99 86, 87 Fert III 94, 107 100, 101 98, 99 Fert IV 103, 104 108, 112 99, 101 Fert V 100, 102 105, 107 94, 98
The main effect involves the independent variables one at a time. The interaction is ignored for this part. Just the rows or just the columns are used, not mixed. This is the part which is similar to the one-way analysis of variance. Each of the variances calculated to analyze the main effects are like the between variances
Interaction Effect
The interaction effect is the effect that one factor has on the other factor. The degrees of freedom here is the product of the two degrees of freedom for each factor.
Within Variation
The Within variation is the sum of squares within each treatment group. You have one less than the sample size (remember all treatment groups must have the same sample size for a two-way ANOVA) for each treatment group. The total number of treatment groups is the product of the number of levels for each factor. The within variance is the within variation divided by its degrees of freedom. The within group is also called the error.
F-Tests
There is an F-test for each of the hypotheses, and the F-test is the mean square for each main effect and the interaction effect divided by the within variance. The numerator degrees of freedom come from each effect, and the denominator degrees of freedom is the degrees of freedom for the within variance in each case.
Two-Way ANOVA Table
It is assumed that main effect A has a levels (and A = a-1 df), main effect B has b levels (and B = b-1 df), n is the sample size of each treatment, and N = abn is the total sample size. Notice the overall degrees of freedom is once again one less than the total sample size.
Source Main Effect A SS given df A, a-1 B, b-1 A*B, (a-1)(b1) N - ab, ab(n-1) N - 1, abn - 1 MS SS / df F MS(A) / MS(W)
Main Effect B
given
SS / df
MS(B) / MS(W)
Interaction Effect
given
SS / df
MS(A*B) / MS(W)
Within
given
SS / df
Total
sum of others
Summary
The following results are calculated using the Quattro Pro spreadsheet. It provides the p-value and the critical values are for alpha = 0.05.
Source of Variation Seed Fertilizer Interaction SS df MS F P-value F-crit
Within Total
136.0000 1241.4667
15 29
9.0667
From the above results, we can see that the main effects are both significant, but the interaction between them isn't. That is, the types of seed aren't all equal, and the types of fertilizer aren't all equal, but the type of seed doesn't interact with the type of fertilizer.
The two-way ANOVA, Example 13-9, in the Bluman text has the incorrect values in it. The student would have no way of knowing this because the book doesn't explain how to calculate the values. Here is the correct table:
Source of Variation Sample Column Interaction Within Total SS 3.920 9.680 54.080 3.300 70.980 df 1 1 1 4 7 MS 3.920 9.680 54.080 0.825 F 4.752 11.733 65.552
The student will be responsible for finishing the table, not for coming up with the sum of squares which go into the table in the first place.