Lesson 08 Data Analysis Using Statistics
Lesson 08 Data Analysis Using Statistics
Along with the prediction models, you need to co-relate existing data and test any
hypothesis.
This lesson will help you understand the usage of statistics for data analytics and
predictions.
Introduction to Statistical Analysis
Statistical Analysis
It is useful when you have a set of data and want to see a summary of that data set.
Statistical Analysis: Example
ABC LLC is a financial analytics and research organization that needs to determine how stock
prices are fluctuating in various emerging economies.
Statistical Analysis: Example
Generates charts
Data Analysis on Command
Data analysis tools are available under the Data Analysis command under Data tab.
Analysis ToolPak add-in needs to be loaded if the Data Analysis command is not available.
Moving Average: Introduction
Moving Average
16,000.0
14,000.0
12,000.0
8,000.0
Actual
Forecast
6,000.0
4,000.0
2,000.0
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Axis Title
A moving average is used to smooth out irregularities and easily recognize trends.
Moving Average
Problem statement:
Steps to follow:
It is used to determine whether there is enough evidence in a data sample to infer that a certain
condition is true for the entire population.
Hypothesis Testing
Take a random sample. Analyze the properties of Test whether the identified
the sample. conclusions represent the
population correctly or not.
Hypothesis Testing
The Hypothesis Test (t–test) is used to test the null hypothesis (H0), which assumes that the mean
or average of two populations is equal.
Assisted Practice: How to use Hypothesis Testing
Problem statement:
Demonstrate how to use Hypothesis Testing to determine Null Hypothesis for two variables.
Assisted Practice Guidelines
Steps to follow:
Problem statement:
Demonstrate how to ANOVA to determine Null Hypothesis for two or more variables.
Assisted Practice Guidelines
Steps to follow:
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Covariance: Types
60
Positive Covariance
50
40
30
If variable X increases as Y increases or X
20
decreases as Y decreases, then covariance is
10 positive.
0
1 2 3 4 5 6 7 8 9 10
Y 48 52 4 8 40 4 16 40 32 40
X 12 13 1 2 10 1 4 10 8 10
Covariance: Types
Negative Covariance
90
80
70
60
50
If variable X decreases as Y increases or X increases
40
as Y decreases, then covariance is negative.
30
20
10
0
1 2 3 4 5 6 7 8 9 10
X 15 16 4 5 13 4 7 13 11 13
Y $38 $20 $85 $82 $46 $85 $70 $46 $65 $46
Assisted Practice: How to use Covariance
Problem statement:
Steps to follow:
The correlation coefficient tells us how strongly two variables are related to
each other and it has a value between -1 and +1.
Problem statement:
Steps to follow:
Regression is a statistical method for determining the strength of a relationship between one
dependent variable and a set of independent variables that change over time.
Assisted Practice: How to use Regression
Problem statement:
Steps to follow:
Simple Linear Regression (SLR) tries to find a linear representation between two variables x and y.
y = function(x)
Simple Linear Regression
A linear relation of the temperature and number of ice creams sold can be observed using a
scatter plot.
Multiple Linear Regression
Multiple Linear Regression (MLR) tries to find the relationship between multiple independent x’s and
a single independent y.
Source: https://medium.com/analytics-vidhya/new-aspects-to-consider-while-moving-from-simple-linear-regression-to-multiple-linear-regression-dad06b3449ff
Multiple Linear Regression
Where:
•Y: dependent or resultant variable
•x1,x2,x3,…,xi: independent variables
y=β0 + β1x1+ β2x2 +… + βixi + e
•β0: constant term in the equation
• βi: slope coefficients to each independent
variable
Multiple Linear Regression
A multiple linear regression model can be built using Excel with at least 30 data points.
The mathematical equation with the coefficients is derived instantly and used to predict new
values.
Multiple Linear Regression
boston_housing.csv
Multiple Linear Regression
The data set contains 13 independent variables which define the dependent variable MEDV.
A model built using this data can be used to predict the median value of a new house with the
attributes of the house.
Multiple Linear Regression
Choose the complete data after checking for any junk data
The results appear in a new worksheet, showing the regression data for the chosen data set.
Linear Regression Model
Though the name has the word regression, it is not a regression algorithm.
Logistic Regression
• By converting the y value in the classification problem to an ‘In odds’ value of the event
Odds of event (E) is defined as the probability of E happening divided by the probability of E not
happening.
odds(E) = P(E)/1-P(E)
To perform logistic regression in Excel, multiple regression equation is used which is created by using Data
Analysis add-ins.
When a new data is given to the model, the P(E) is calculated, and the target value is derived.
Steps to Derive Target Value
Step 3: Use add-ins of Data Analysis, to calculate the intercept and coefficients
Steps to Derive Target Value
Step 4: The linear regression equation arrives for each data row. This equation can be called y.
Steps to Derive Target Value
Normal distribution helps find the probability distribution for various variables such as rainfall,
height, weight, manufacturing error, weight error, and test scores.
The standard
The mean,
deviation, which Normal
where the peak
indicates the Distribution
of the density
spread of the Curve
occurs
bell curve
Normal Distribution: Empirical Rule
All normal density curves satisfy the Empirical Rule or (68-95-99.7% Rule) in Statistics.
Problem statement:
Steps to follow:
A. Hypothesis Testing
B. Histogram
C. ANOVA
D. Covariance
Knowledge
Check Which of the following statistical methods is used to analyze variance between
1 more than two groups?
A. Hypothesis Testing
B. Histogram
C. ANOVA
D. Covariance
A. True
B. False
Knowledge
Check
The Null Hypothesis means that the mean/average of two populations is equal.
3
A. True
B. False
The Null Hypothesis(H0) means that the mean/average of two populations is equal.
Knowledge
Check
Which of the following is indicated if the Correlation Coefficient value is +1?
4
B. Zero Correlation
D. No Correlation
Knowledge
Check
Which of the following is indicated if the Correlation Coefficient value is +1?
4
B. Zero Correlation
D. No Correlation
A. Histogram
B. Hypothesis Testing
C. Moving Average
D. Regression
Knowledge
Check Which statistical measure determines the strength between a dependent variable
5 and an independent variable?
A. Histogram
B. Hypothesis Testing
C. Moving Average
D. Regression
Regression determines the strength between a dependent variable and an independent variable.
Knowledge
Check What are the mandatory fields required while creating a Normal Distribution
6 curve?
To create Normal Distribution curve, we need to specify two quantities: the mean, where the peak of the density
occurs, and the standard deviation, which indicates the spread of the bell curve.