Unit IV - Analytics Tasks (Students)
Unit IV - Analytics Tasks (Students)
DR RAMESH CHATURVEDI
SUCHITRA PANDEY (FPM)
Exploratory Data Analysis- Data Tabulation
Tabulation of data is the process of organizing data into a table, which is a structured display of
numbers in rows and columns.
It makes data easier to understand and compare
It allows for statistical analysis and interpretation
It helps to reduce confusion when analyzing large data sets
Data is classified into groups based on characteristics like gender, nationality, profession, age,
salary, or test scores.
The data is then organized into a table with rows and columns.
Frequency table
A frequency table is a table that shows how many times a value or characteristic appears in a
data set. It's a way to summarize data by organizing it into categories and showing how often
each category appears.
Conti…..
In statistics, a "point estimate" is a single value used to estimate a population parameter based on
a sample, while an "interval estimate" provides a range of values within which the true population
parameter is likely to fall, offering a more comprehensive picture with a margin of error; essentially,
a point estimate is a best guess, while an interval estimate gives a range of plausible values around
that guess.
A point estimate definition is a calculation where a sample statistic is used to estimate or
approximate an unknown population parameter. For example, the average height of a random
sample can be used to estimate the average height of a larger population.
An "interval estimate" is simply a range of values that is likely to contain the true population
parameter based on a sample, while a "confidence interval" is a specific type of interval estimate
where you also specify a level of confidence (like 95%) that the true population parameter falls
within that range; essentially, a confidence interval adds a probability statement to the interval
estimate.
Interval estimate: "Based on our sample data, the average height of adults in this city is likely
between 5.5 and 6.5 feet."
Confidence interval: "With 95% confidence, we can say that the average height of adults in this city
is between 5.6 and 6.4 feet."
All confidence intervals are interval estimates, but not all interval estimates are confidence
intervals .
Significance level (α): At what level of significance the researcher wants to test the hypothesis (.1%,
1%, 5%,).
(1- α) is confidence level.
In a two tailed test alpha gets split into two parts (α/2).
For 5% level of significance in a two tailed test- 5/2 (so 2.5% to the left and 2.5% to the right). For one
tailed test significance level remains 5% to the specific tail.
Population mean is less than the hypothesized population mean, then it is a left tailed test.
Population mean is more than the hypothesized population mean, then it is a right tailed test.
Type I error (denoted by α): the null hypothesis is rejected when it is true (effect is actually not there
but the researcher assumes the effect).
If we are taking 95% confidence level, then level of significance is 5%. 95% is the confidence level
that our calculated statistic will fall in the acceptance region for the null hypothesis. Only 5%
chance that we reject the true null hypothesis.
Type II error (denoted by β): the null hypothesis is not rejected even when it is false.
When test statistic was into the rejection region and as a result it should have had been rejected
but it is still accepted.
Power of the test (1-β/error): The power of a statistical test is the probability that the test will
correctly reject a false null hypothesis. In other words, it measures the test's ability to detect an
effect or difference when one actually exists.
test statistic: it represents how many standard errors away from the hypothesized population mean
the sample mean is.
Beta (β):
Represents the probability of making a Type II error (failing to reject a false null
hypothesis).
Power (1 - β):
When you subtract beta from 1, you get the power of the test, which is the
probability of correctly rejecting a false null hypothesis.
Example: If a test has a beta of 0.2, then its power is 0.8 (1 - 0.2), meaning
there is an 80% chance of detecting a true effect if it exists.
Test Statistic
Test Statistic: The test statistic measures how far the sample data deviates
from what would be expected under the null hypothesis. It essentially
quantifies the difference between the observed data and the null
hypothesis prediction.
How it Works:
1.Calculate the Test Statistic: Different statistical tests have different formulas for calculating the test statistic.
The formula depends on the type of data, the type of hypothesis being tested, and the assumptions of the
test.
2.Compare to a Distribution: The calculated test statistic is then compared to a known probability
distribution (like the t-distribution, z-distribution, chi-square distribution, etc.). This distribution represents the
range of values the test statistic could take if the null hypothesis were true.
3.Determine the P-value: The p-value is the probability of obtaining a test statistic as extreme as, or more
extreme than, the one calculated from the sample data, assuming the null hypothesis is true.
4.Make a Decision:
•If the p-value is small (typically less than a predetermined significance level, often 0.05), it suggests
that the observed data is unlikely to have occurred if the null hypothesis were true. In this case, you
reject the null hypothesis.
•If the p-value is large, it suggests that the observed data is reasonably likely to have occurred if the
null hypothesis were true. In this case, you fail to reject the null hypothesis.
Correlation
In correlation analysis, we're interested in determining if there's a linear relationship between two variables.
The null and alternative hypotheses are used to formally test this relationship.
Let's consider the population correlation coefficient, denoted by ρ (rho). It measures the strength and
direction of the linear relationship between two variables in the population. The sample correlation
coefficient, denoted by r, is an estimate of ρ based on sample data.
Note: It is very important to remember that correlation does not equal causation. Even if a strong
correlation is found, it does not mean that one variable causes the other.
• Null Hypothesis (H₀): ρ = 0
• This states that there is no linear correlation between the two variables in the population. In other words,
they are not linearly related.
• Alternative Hypothesis (H₁): ρ ≠ 0
• This states that there is a linear correlation between the two variables in the population. They are linearly
related.
• This is a two-tailed test, as it tests for any linear relationship, whether positive or negative.
One-Tailed Alternatives (Less Common but Possible):
In some specific cases, you might use one-tailed hypotheses:
• H₁: ρ > 0
• This states that there is a positive linear correlation between the two variables.
• H₁: ρ < 0
• This states that there is a negative linear correlation between the two variables.
1.Calculate the Sample Correlation Coefficient (r): You calculate the correlation coefficient (r) from
your sample data.
2.Calculate the Test Statistic: Based on the sample correlation coefficient (r) and the sample size
(n), a test statistic (often a t-statistic) is calculated.
3.Determine the P-Value: The p-value is calculated based on the test statistic and the appropriate
distribution (usually the t-distribution).
4.Compare the P-Value to Alpha:
•If the p-value is less than or equal to the significance level (alpha, typically 0.05), you reject
the null hypothesis. This suggests that there is a statistically significant correlation between the
variables.
•If the p-value is greater than alpha, you fail to reject the null hypothesis. This suggests that
there is not enough evidence to conclude that there is a statistically significant correlation.
Regression
Simple Linear Regression
In the context of regression analysis, the null and alternative hypotheses are used to test whether there
is a statistically significant relationship between the independent variable(s) and the dependent
variable. Here's a breakdown:
Simple Linear Regression (One Independent Variable):
Let's consider a simple linear regression model:
Y = β₀ + β₁X + ε
Where:
•Y is the dependent variable.
•X is the independent variable.
•β₀ is the y-intercept.
•β₁ is the slope of the line (the coefficient of X).
•ε is the error term.
The line of best fit and the least squares method are intrinsically linked. The
least squares method is the mathematical process used to determine the
equation of the line of best fit. Here's how they relate:
Line of Best Fit:
• Visual Representation: The line of best fit is the visual representation on a
scatter plot of the linear relationship between two variables. It's the line
that appears to best approximate the trend of the data points.
• Goal: The goal is to draw a line that minimizes the overall "error" or distance
between the line and the data points.
Least Squares Method:
• Mathematical Technique: The least squares method is the mathematical
technique used to calculate the precise equation of the line of best fit.
• Minimizing Squared Errors: It works by minimizing the sum of the squared
vertical distances (residuals) between each data point and the line.
• Equation Calculation: The method produces the slope (m) and y-intercept
(b) of the line in the equation y = mx + b.
• Precision: It provides the most statistically accurate line of best fit for a
given set of data points.
Null and alternate hypothesis in simple linear
regression
The independent-samples t-test (or independent t-test, for short) compares the means between
two unrelated groups on the same continuous, dependent variable.
Assumptions:
Assumption #1: Your dependent variable should be measured on a continuous scale.
Assumption #2: Your independent variable should consist of two categorical, independent groups.
Assumption #3: You should have independence of observations
Assumption #4: There should be no significant outliers.
Assumption #5: Your dependent variable should be approximately normally distributed for each
group of the independent variable.
Assumption #6: There needs to be homogeneity of variances.
The concentration of cholesterol (a type of fat) in the blood is associated with the risk of
developing heart disease, such that higher concentrations of cholesterol indicate a higher level
of risk, and lower concentrations indicate a lower level of risk. If you lower the concentration of
cholesterol in the blood, your risk of developing heart disease can be reduced. Being
overweight and/or physically inactive increases the concentration of cholesterol in your blood.
Both exercise and weight loss can reduce cholesterol concentration. However, it is not known
whether exercise or weight loss is best for lowering cholesterol concentration.
Therefore, a researcher decided to investigate whether an exercise or weight loss intervention is
more effective in lowering cholesterol levels. To this end, the researcher recruited a random
sample of inactive males that were classified as overweight.
This sample was then randomly split into two groups: Group 1 underwent a calorie-controlled
diet and Group 2 undertook the exercise-training programme. In order to determine which
treatment programme was more effective, the mean cholesterol concentrations were
compared between the two groups at the end of the treatment programmes.
In SPSS Statistics, we separated the groups for analysis by creating a grouping
variable called Treatment (i.e., the independent variable), and gave the "diet
group" a value of "1" and the "exercise group" a value of "2" (i.e., the two
groups of the independent variable). Cholesterol concentrations were
entered under the variable name Cholesterol (i.e., the dependent variable).
Output
Dependent t test
The dependent t-test (called the paired-samples t-test in SPSS Statistics) compares
the means between two related groups on the same continuous, dependent
variable.
Assumption #1: Your dependent variable should be measured on
a continuous scale
Assumption #2: Your independent variable should consist of two categorical,
"related groups" or "matched pairs".
Assumption #3: There should be no significant outliers in the differences between
the two related groups.
Assumption #4: The distribution of the differences in the dependent
variable between the two related groups should be approximately normally
distributed.
A group of Sports Science students (n = 20) are selected from the population to investigate
whether a 12-week plyometric-training programme improves their standing long jump
performance. In order to test whether this training improves performance, the students are tested
for their long jump performance before they undertake a plyometric-training programme and then
again at the end of the programme (i.e., the dependent variable is "standing long jump
performance", and the two related groups are the standing long jump values "before" and "after"
the 12-week plyometric-training programme).
THANK YOU