Engineering Data Analysis Probability: Probability Is A Measure Quantifying The Likelihood That Events Will Occur
Engineering Data Analysis Probability: Probability Is A Measure Quantifying The Likelihood That Events Will Occur
Engineering Data Analysis Probability: Probability Is A Measure Quantifying The Likelihood That Events Will Occur
18-57405
BSEE-2102
Engineering Data Analysis
Probability
Probability is a measure quantifying the likelihood that events will occur.
Sample Space
The set of all possible outcomes of a statistical experiment is called the sample space and
is represented by the symbol S. Each outcome in a sample space is called an element or a
member of the sample space, or simply a sample point. If the sample space has a finite number of
elements, we may list the members separated by commas and enclosed in braces. Thus, the
sample space S, of possible outcomes when a coin is flipped, may be written S = {H,T}, where H
and T correspond to heads and tails, respectively.
Event
An event is a subset of a sample space
Probability of an event
The probability of an event A is the sum of the weights of all sample points in A.
Therefore, 0 ≤ P(A) ≤ 1,P (φ)=0 , and P(S)=1 . Furthermore, if A1, A2, A3, ... is a sequence of
mutually exclusive events, then P(A1 ∪A2 ∪A3 ∪···)=P(A1)+P(A2)+P(A3)+···.
Conditional probability
The probability of an event B occurring when it is known that some event A has occurred
is called a conditional probability and is denoted by P(B|A). The symbol P(B|A) is usually read
“the probability that B occurs given that A occurs” or simply “the probability of B, givenA.”
The conditional probability of B, givenA, denoted by P(B|A), is defined by P(B|A)=
P(A∩B)/P(A), provided P(A)>0.
Random variable
A random variable is a function that associates a real number with each element in the
sample space.
Joint probability distribution
If X and Y are two discrete random variables, the probability distribution for their
simultaneous occurrence can be represented by a function with values f(x,y) for any pair of
values (x,y) within the range of the random variables X and Y . It is customary to refer to this
function as the joint probability distribution of X and Y . Hence, in the discrete case,
f(x,y)=P(X = x,Y = y); that is, the values f(x,y) give the probability that outcomes x and y occur
at the same time.
The function f(x,y) is a joint probability distribution or probability mass function of the
discrete random variables X and Y.
Binomial distribution
The number X of successes in n Bernoulli trials is called a binomial random variable.
The probability distribution of this discrete random variable is called the binomial distribution,
and its values will be denoted by b(x;n,p) since they depend on the number of trials and the
probability of a success on a given trial. Thus, for the probability distribution of X, the number of
defectives is P(X = 2) =f(2) = b(2;3, 1/ 4)= 9/64.
Hypergeometric distribution
Hypergeometric distribution does not require independence and is based on sampling
done without replacement.
Probability distribution of the hypergeometric random variable X, the number of
successes in a random sample of size n selected from N items of which k are labeled success and
N −k labeled failure, is h(x;N,n,k)=k xN−k n−x N n, max{0,n−(N −k)}≤x ≤ min{n,k}.The range
of x can be determined by the three binomial coefficients in the definition, where x and n−x are
no more than k and N−k, respectively, and both of them cannot be less than 0. Usually, when
both k (the number of successes) and N −k (the number of failures) are larger than the sample
size n, the range of a hypergeometric random variable will be x =0
Poisson distribution
The probability distribution of the Poisson random variable X, representing the number
of outcomes occurring in a given time interval or specified region denoted by t.
Hypothesis
Hypothesis is a proposed explanation for a phenomenon. For a hypothesis to be a
scientific hypothesis, the scientific method requires that one can test it.
Statistical Hypothesis
A statistical hypothesis is an assertion or conjecture concerning one or more populations.
Hypothesis Testing
Hypothesis testing is a way for you to test the results of a survey or experiment to see if
you have meaningful results. It is also testing whether your results are valid by figuring the odds
that your results have happened by chance.
Two types of hypothesis and level of significance
The structure of hypothesis testing will be formulated with the use of the term null
hypothesis, which refers to any hypothesis we wish to test and is denoted by H0. The rejection of
H0 leads to the acceptance of an alternative hypothesis, denoted by H1. An understanding of the
different roles played by the null hypothesis (H0) and the alternative hypothesis (H1) is crucial to
one’s understanding of the rudiments of hypothesis testing. The alternative hypothesis H1
usually represents the question to be answered or the theory to be tested, and thus its
specification is crucial. The null hypothesis H0 nullifies or opposes H1 and is often the logical
complement to H1. As the reader gains more understanding of hypothesis testing, he or she
should note that the analyst arrives at one of the two following conclusions:
reject H0 in favor of H1 because of sufficient evidence in the data or
fail to reject H0 because of insufficient evidence in the data.
The probability of committing a type I error, also called the level of significance, is denoted
by the Greek letter α.
Hypothesis testing approach
Testing a hypothesis is an important part of the scientific method that allows you to
evaluate the validity of an educated guess. It can be the following:
Z-test
F-Test
Normality
Chi-square test for independence
Analysis of variance (ANOVA)
Mood’s median
Welch’s T-test
Kruskal-Wallis H test
Box-Cox Power Transformation
Test statistics and steps to hypothesis testing
Hypothesis testing can be one of the most confusing aspects for students, mostly because
before you can even perform a test, you have to know what your null hypothesis is. Often, those
tricky word problems that you are faced with can be difficult to decipher. But it’s easier than you
think; all you need to do is:
Step 1: State the Null hypothesis.
Step 2: State the Alternate Hypothesis.
Step 3: Draw a picture to help you visualize the problem.
Step 4: State the alpha level. If you aren’t given an alpha level, use 5% (0.05).
Step 5: Find the rejection region area (given by your alpha level above) from the z-table.
Here the numerator r of the random variable t is the estimate of ρ = 0 and sr is the standard
error of t.
Observation: If we solve the equation in Theorem 1 for r, we get
Observation: The theorem can be used to test the hypothesis that population random
variables x and y are independent i.e. ρ = 0.
Z-test
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. Because of the central limit theorem,
many test statistics are approximately normally distributed for large samples. For each
significance level, the Z-test has a single critical value (for example, 1.96 for 5% two tailed)
which makes it more convenient than the Student's t-test which has separate critical values for
each sample size. Therefore, many statistical tests can be conveniently performed as
approximate Z-tests if the sample size is large or the population variance is known. If the
population variance is unknown (and therefore has to be estimated from the sample itself) and
the sample size is not large (n < 30), the Student's t-test may be more appropriate.
Z-test for one sample group and one group proportion
One Proportion Z Test is a hypothesis test to make comparison between a group to
specified population proportion. Hypothesis test need an analyst to state a null hypothesis and an
alternative hypothesis. The results are mutually exclusive. That is if one is true, the other one
must be false and vice versa. Use this One Proportion Z Test statistics calculator to find the value
of Z - test statistic by entering observed proportion, sample size and null hypothesis value.
Z-test for one sample means
A one sample z test is one of the most basic types of hypothesis test. In order to run a one
sample z test, you work through several steps:
Step 1: State the Null Hypothesis. This is one of the common stumbling blocks–in order
to make sense of your sample and have the one sample z test give you the right information you
must make sure you’ve written the null hypothesis and alternate hypothesis correctly. For
example, you might be asked to test the hypothesis that the mean weight gain of pregnant women
was more than 30 pounds. Your null hypothesis would be: H0: μ = 30 and your alternate
hypothesis would be H,sub>1: μ > 30.
Step 2: Use the z-formula to find a z-score.
All you do is put in the values you are given into the formula. Your question should give
you the sample mean (x̄), the standard deviation (σ), and the number of items in the sample (n).
Your hypothesized mean (in other words, the mean you are testing the hypothesis for, or
your null hypothesis) is μ0.
F-test
An F-test is any statistical test in which the test statistic has an F-distribution under
the null hypothesis. It is most often used when comparing statistical models that have been fitted
to a data set, in order to identify the model that best fits the population from which the data were
sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least
squares.
How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation
coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or
negative, respectively. Achieving a value of +1 or -1 means that all your data points are included
on the line of best fit – there are no data points that show any variation away from this line.
Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation
around the line of best fit. The closer the value of r to 0 the greater the variation around the line
of best fit. Different relationships and their correlation coefficients are shown in the diagram
below:
Linear regression
Linear regression is a basic and commonly used type of predictive analysis. The overall
idea of regression is to examine two things: (1) does a set of predictor variables do a good job in
predicting an outcome (dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign
of the beta estimates–impact the outcome variable? These regression estimates are used to
explain the relationship between one dependent variable and one or more independent
variables. The simplest form of the regression equation with one dependent and one independent
variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c =
constant, b = regression coefficient, and x = score on the independent variable.
Chi-square test for goodness-of-fit
A goodness-of-fit test between observed and expected frequencies is based on the
quantity,
The expected frequencies are obtained by multiplying each cell probability by the total
number of observations. As before, we round these frequencies to one decimal. Thus, the
expected number of low-income voters in our sample who favor the tax reform is estimated to be
(336/ 1000)(598/ 1000)(1000) = (336)(598)/1000 = 200.9
when H0 is true. The general rule for obtaining the expected frequency of any cell is
given by the following formula:
expected frequency =(column total)×(row total) grand total
.
The expected frequency for each cell is recorded in parentheses beside the actual
observed value in Table 10.7. Note that the expected frequencies in any row or column add up to
the appropriate marginal total. In our example, we need to compute only two expected
frequencies in the top row of Table 10.7 and then find the others by subtraction. The number of
degrees of freedom associated with the chi-squared test used here is equal to the number of cell
frequencies that may be filled in freely when we are given the marginal totals and the grand total,
and in this illustration that number is 2. A simple formula providing the correct number of
degrees of freedom is v =( r−1)(c−1).