0% found this document useful (0 votes)
11 views127 pages

Unit IV - Analytics Tasks (Students)

This document covers various analytics tasks, focusing on exploratory data analysis, data tabulation, and statistical methods such as frequency tables, descriptive statistics, and regression analysis. It explains concepts like normal distribution, hypothesis testing, and the significance of correlation and regression in understanding relationships between variables. Additionally, it discusses the differences between point and interval estimates, as well as Type I and Type II errors in hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views127 pages

Unit IV - Analytics Tasks (Students)

This document covers various analytics tasks, focusing on exploratory data analysis, data tabulation, and statistical methods such as frequency tables, descriptive statistics, and regression analysis. It explains concepts like normal distribution, hypothesis testing, and the significance of correlation and regression in understanding relationships between variables. Additionally, it discusses the differences between point and interval estimates, as well as Type I and Type II errors in hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

Unit IV- Analytics Tasks

DR RAMESH CHATURVEDI
SUCHITRA PANDEY (FPM)
Exploratory Data Analysis- Data Tabulation

Tabulation of data is the process of organizing data into a table, which is a structured display of
numbers in rows and columns.
 It makes data easier to understand and compare
 It allows for statistical analysis and interpretation
 It helps to reduce confusion when analyzing large data sets
Data is classified into groups based on characteristics like gender, nationality, profession, age,
salary, or test scores.
The data is then organized into a table with rows and columns.
Frequency table

 A frequency table is a table that shows how many times a value or characteristic appears in a
data set. It's a way to summarize data by organizing it into categories and showing how often
each category appears.
Conti…..

 Descriptive statistics are used to summarize and describe the main


features of a dataset. They provide a concise overview of the data,
helping us to understand its central tendency, variability, and distribution.
Here are some commonly used descriptive statistics:
Measures of Central Tendency: Mean, Median and Mode
Measures of Variability
Normal Distribution

 Data are assumed to follow normal distribution.


 Bell shaped distribution (symmetric distribution).
 Mean, median, mode lie on the same point.
 Central tendency: Any random variable tends to lie anywhere close to the three parameters
(mean, median, mode). Center around the mean (average), median (middle point) or mode
(highest frequency).
Note: A random variable is a mathematical formalization of a quantity or object which depends on
random events.
 Range: The difference between the highest and the lowest value.
 Standard deviation: How far the data points are from the mean.
 Variation: Square of the standard deviation.
 Inter quartile range: How much is the difference between the two quarters.
 Quartile deviation:
More is the spread, more is the variability in the data. We have to check this variability.
Kurtosis: How is the random data surrounded around the mean.
Standard Score: Standardising the scores helps in comparison.
4 in 1 maggie pack is most sold.
 What is the difference of sales between the first and the third quarter. If the sales factor is very
high then this may be due to seasonality factor or change in consumer behaviour, change is
some technology that has recently come up in the market, any substitute that might have
emerged (i.e. something has happened in the market). Research has to be done for this
variability.
 Parameter: Characteristics that describe a population. Population represents everyone or
everything in a study.
 Statistics: Characteristics that describe a sample. Sample is a small subset of the population.
Description and Estimation (Point & Interval)

In statistics, a "point estimate" is a single value used to estimate a population parameter based on
a sample, while an "interval estimate" provides a range of values within which the true population
parameter is likely to fall, offering a more comprehensive picture with a margin of error; essentially,
a point estimate is a best guess, while an interval estimate gives a range of plausible values around
that guess.
A point estimate definition is a calculation where a sample statistic is used to estimate or
approximate an unknown population parameter. For example, the average height of a random
sample can be used to estimate the average height of a larger population.
An "interval estimate" is simply a range of values that is likely to contain the true population
parameter based on a sample, while a "confidence interval" is a specific type of interval estimate
where you also specify a level of confidence (like 95%) that the true population parameter falls
within that range; essentially, a confidence interval adds a probability statement to the interval
estimate.
Interval estimate: "Based on our sample data, the average height of adults in this city is likely
between 5.5 and 6.5 feet."
Confidence interval: "With 95% confidence, we can say that the average height of adults in this city
is between 5.6 and 6.4 feet."

All confidence intervals are interval estimates, but not all interval estimates are confidence
intervals .
Significance level (α): At what level of significance the researcher wants to test the hypothesis (.1%,
1%, 5%,).
(1- α) is confidence level.
In a two tailed test alpha gets split into two parts (α/2).
For 5% level of significance in a two tailed test- 5/2 (so 2.5% to the left and 2.5% to the right). For one
tailed test significance level remains 5% to the specific tail.
Population mean is less than the hypothesized population mean, then it is a left tailed test.
Population mean is more than the hypothesized population mean, then it is a right tailed test.
Type I error (denoted by α): the null hypothesis is rejected when it is true (effect is actually not there
but the researcher assumes the effect).
If we are taking 95% confidence level, then level of significance is 5%. 95% is the confidence level
that our calculated statistic will fall in the acceptance region for the null hypothesis. Only 5%
chance that we reject the true null hypothesis.
Type II error (denoted by β): the null hypothesis is not rejected even when it is false.
When test statistic was into the rejection region and as a result it should have had been rejected
but it is still accepted.
Power of the test (1-β/error): The power of a statistical test is the probability that the test will
correctly reject a false null hypothesis. In other words, it measures the test's ability to detect an
effect or difference when one actually exists.
test statistic: it represents how many standard errors away from the hypothesized population mean
the sample mean is.
Beta (β):
Represents the probability of making a Type II error (failing to reject a false null
hypothesis).
Power (1 - β):

When you subtract beta from 1, you get the power of the test, which is the
probability of correctly rejecting a false null hypothesis.

Example: If a test has a beta of 0.2, then its power is 0.8 (1 - 0.2), meaning
there is an 80% chance of detecting a true effect if it exists.
Test Statistic

 Test Statistic: The test statistic measures how far the sample data deviates
from what would be expected under the null hypothesis. It essentially
quantifies the difference between the observed data and the null
hypothesis prediction.
How it Works:
1.Calculate the Test Statistic: Different statistical tests have different formulas for calculating the test statistic.
The formula depends on the type of data, the type of hypothesis being tested, and the assumptions of the
test.
2.Compare to a Distribution: The calculated test statistic is then compared to a known probability
distribution (like the t-distribution, z-distribution, chi-square distribution, etc.). This distribution represents the
range of values the test statistic could take if the null hypothesis were true.
3.Determine the P-value: The p-value is the probability of obtaining a test statistic as extreme as, or more
extreme than, the one calculated from the sample data, assuming the null hypothesis is true.
4.Make a Decision:
•If the p-value is small (typically less than a predetermined significance level, often 0.05), it suggests
that the observed data is unlikely to have occurred if the null hypothesis were true. In this case, you
reject the null hypothesis.
•If the p-value is large, it suggests that the observed data is reasonably likely to have occurred if the
null hypothesis were true. In this case, you fail to reject the null hypothesis.
Correlation

In correlation analysis, we're interested in determining if there's a linear relationship between two variables.
The null and alternative hypotheses are used to formally test this relationship.
Let's consider the population correlation coefficient, denoted by ρ (rho). It measures the strength and
direction of the linear relationship between two variables in the population. The sample correlation
coefficient, denoted by r, is an estimate of ρ based on sample data.

Note: It is very important to remember that correlation does not equal causation. Even if a strong
correlation is found, it does not mean that one variable causes the other.
• Null Hypothesis (H₀): ρ = 0
• This states that there is no linear correlation between the two variables in the population. In other words,
they are not linearly related.
• Alternative Hypothesis (H₁): ρ ≠ 0
• This states that there is a linear correlation between the two variables in the population. They are linearly
related.
• This is a two-tailed test, as it tests for any linear relationship, whether positive or negative.
One-Tailed Alternatives (Less Common but Possible):
In some specific cases, you might use one-tailed hypotheses:
• H₁: ρ > 0
• This states that there is a positive linear correlation between the two variables.
• H₁: ρ < 0
• This states that there is a negative linear correlation between the two variables.
1.Calculate the Sample Correlation Coefficient (r): You calculate the correlation coefficient (r) from
your sample data.
2.Calculate the Test Statistic: Based on the sample correlation coefficient (r) and the sample size
(n), a test statistic (often a t-statistic) is calculated.
3.Determine the P-Value: The p-value is calculated based on the test statistic and the appropriate
distribution (usually the t-distribution).
4.Compare the P-Value to Alpha:
•If the p-value is less than or equal to the significance level (alpha, typically 0.05), you reject
the null hypothesis. This suggests that there is a statistically significant correlation between the
variables.
•If the p-value is greater than alpha, you fail to reject the null hypothesis. This suggests that
there is not enough evidence to conclude that there is a statistically significant correlation.
Regression
Simple Linear Regression

In the context of regression analysis, the null and alternative hypotheses are used to test whether there
is a statistically significant relationship between the independent variable(s) and the dependent
variable. Here's a breakdown:
Simple Linear Regression (One Independent Variable):
Let's consider a simple linear regression model:
Y = β₀ + β₁X + ε
Where:
•Y is the dependent variable.
•X is the independent variable.
•β₀ is the y-intercept.
•β₁ is the slope of the line (the coefficient of X).
•ε is the error term.
 The line of best fit and the least squares method are intrinsically linked. The
least squares method is the mathematical process used to determine the
equation of the line of best fit. Here's how they relate:
 Line of Best Fit:
• Visual Representation: The line of best fit is the visual representation on a
scatter plot of the linear relationship between two variables. It's the line
that appears to best approximate the trend of the data points.
• Goal: The goal is to draw a line that minimizes the overall "error" or distance
between the line and the data points.
Least Squares Method:
• Mathematical Technique: The least squares method is the mathematical
technique used to calculate the precise equation of the line of best fit.
• Minimizing Squared Errors: It works by minimizing the sum of the squared
vertical distances (residuals) between each data point and the line.
• Equation Calculation: The method produces the slope (m) and y-intercept
(b) of the line in the equation y = mx + b.
• Precision: It provides the most statistically accurate line of best fit for a
given set of data points.
Null and alternate hypothesis in simple linear
regression

The hypotheses typically tested are:


•Null Hypothesis (H₀): β₁ = 0
•This means that there is no linear relationship between X and Y. In other words, changes in X do not
predict changes in Y. The slope of the regression line is zero.
•Alternative Hypothesis (H₁): β₁ ≠ 0
•This means that there is a linear relationship between X and Y. Changes in X do predict changes in Y.
The slope of the regression line is not zero.
•This is a two-tailed test, meaning we're looking for evidence of a relationship, regardless of whether it's
positive or negative.
You can also have one-tailed alternative hypotheses:
•H₁: β₁ > 0 (positive relationship)
•H₁: β₁ < 0 (negative relationship)
Multiple Regression
The researcher's goal is to be able to predict VO2max (an indicator of fitness
and health) based on these four attributes: age, weight, heart rate and
gender.
Running multiple regression in spss
Null and alternate hypothesis in multiple
regression
In multiple regression, the null and alternative hypotheses are used to assess the significance of the overall
model and the individual predictor variables. Here's a breakdown:
Overall Model Significance (F-test):
•Null Hypothesis (H₀): β₁ = β₂ = ... = βₚ = 0
•This hypothesis states that all the regression coefficients (β₁, β₂, ..., βₚ) are equal to zero. In simpler
terms, it means that none of the predictor variables (independent variables) in the model are linearly
related to the dependent variable. The model as a whole does not explain any significant variance in
the dependent variable.
•Alternative Hypothesis (H₁): At least one βᵢ ≠ 0 (where i = 1, 2, ..., p)
•This hypothesis states that at least one of the regression coefficients is not equal to zero. In other words,
at least one predictor variable is linearly related to the dependent variable. The model as a whole is
statistically significant and explains a portion of the variance in the dependent variable.
•This test uses the F-statistic to determine the overall significance of the regression model.
Significance of Individual Predictor Variables (t-tests):
For each individual predictor variable in the model, we test the following hypotheses:
•Null Hypothesis (H₀): βᵢ = 0
•This hypothesis states that the coefficient for the specific predictor variable (Xᵢ) is equal to zero. This
means that, holding all other predictor variables constant, the predictor variable Xᵢ does not have a
linear relationship with the dependent variable.
•Alternative Hypothesis (H₁): βᵢ ≠ 0
•This hypothesis states that the coefficient for the specific predictor variable (Xᵢ) is not equal to zero.
This means that, holding all other predictor variables constant, the predictor variable Xᵢ has a linear
relationship with the dependent variable.
•This test uses the t-statistic to determine the significance of each individual predictor variable.
• The F-test tells if the model as a whole is useful.
• The t-tests tell which individual predictors are contributing significantly to
the model.
By examining the p-values associated with the F-statistic and t-statistics, one
can determine which hypotheses to reject and draw conclusions about the
significance of one’s multiple regression model.
Logistic regression

 Null hypothesis: assumes beta coefficient is equal to zero, and,


 Alternative hypothesis: assumes that beta coefficient is not equal to zero.
Ordinal logistic regression (often just called 'ordinal regression') is used to predict an ordinal
dependent variable given one or more independent variables. It can be considered as either a
generalisation of multiple linear regression (multiple independent variables and a single dependent
variable) or as a generalisation of binomial logistic regression. As with other types of regression,
ordinal regression can also use interactions between independent variables to predict the
dependent variable.
Binomial logistic regression, also known as logistic regression, is a statistical model that predicts the
likelihood of an event happening in one of two categories (outcome variable is dichotomous,
meaning it has only two possible values, such as pass/fail or yes/no).
Multinomial logistic regression: dependent variable has more than two categories.
Logistic regression follows a maximum likelihood method not the method of least squares, as it looks
forward to the maximum chance for an event to occur.
 Credit card defaulter:
0= not a defaulter
1= yes a defaulter
 Gender:
0= female
1= male
 There are 18 such cases wherein the model predicted they will be a
defaulter but did not turn out to be adefaulter. Models accuracy owing to
this has come down to 88.7%.
 Hosmer Lemeshow test: Similar to Chi square test and this should be
significant [Once it is significant, the model becomes significant, it tests
goodness of fit i.e. the independent variables fit well to explain the model
(dependent variable)].
Both gender and monthly salary help predict whether a person will be a
defaulter or not.
t test

 The independent-samples t-test (or independent t-test, for short) compares the means between
two unrelated groups on the same continuous, dependent variable.
Assumptions:
Assumption #1: Your dependent variable should be measured on a continuous scale.
Assumption #2: Your independent variable should consist of two categorical, independent groups.
Assumption #3: You should have independence of observations
Assumption #4: There should be no significant outliers.
Assumption #5: Your dependent variable should be approximately normally distributed for each
group of the independent variable.
Assumption #6: There needs to be homogeneity of variances.
 The concentration of cholesterol (a type of fat) in the blood is associated with the risk of
developing heart disease, such that higher concentrations of cholesterol indicate a higher level
of risk, and lower concentrations indicate a lower level of risk. If you lower the concentration of
cholesterol in the blood, your risk of developing heart disease can be reduced. Being
overweight and/or physically inactive increases the concentration of cholesterol in your blood.
Both exercise and weight loss can reduce cholesterol concentration. However, it is not known
whether exercise or weight loss is best for lowering cholesterol concentration.
 Therefore, a researcher decided to investigate whether an exercise or weight loss intervention is
more effective in lowering cholesterol levels. To this end, the researcher recruited a random
sample of inactive males that were classified as overweight.
 This sample was then randomly split into two groups: Group 1 underwent a calorie-controlled
diet and Group 2 undertook the exercise-training programme. In order to determine which
treatment programme was more effective, the mean cholesterol concentrations were
compared between the two groups at the end of the treatment programmes.
In SPSS Statistics, we separated the groups for analysis by creating a grouping
variable called Treatment (i.e., the independent variable), and gave the "diet
group" a value of "1" and the "exercise group" a value of "2" (i.e., the two
groups of the independent variable). Cholesterol concentrations were
entered under the variable name Cholesterol (i.e., the dependent variable).
Output
Dependent t test

The dependent t-test (called the paired-samples t-test in SPSS Statistics) compares
the means between two related groups on the same continuous, dependent
variable.
Assumption #1: Your dependent variable should be measured on
a continuous scale
Assumption #2: Your independent variable should consist of two categorical,
"related groups" or "matched pairs".
Assumption #3: There should be no significant outliers in the differences between
the two related groups.
Assumption #4: The distribution of the differences in the dependent
variable between the two related groups should be approximately normally
distributed.
A group of Sports Science students (n = 20) are selected from the population to investigate
whether a 12-week plyometric-training programme improves their standing long jump
performance. In order to test whether this training improves performance, the students are tested
for their long jump performance before they undertake a plyometric-training programme and then
again at the end of the programme (i.e., the dependent variable is "standing long jump
performance", and the two related groups are the standing long jump values "before" and "after"
the 12-week plyometric-training programme).
THANK YOU

You might also like