0% found this document useful (0 votes)
62 views

Lesson 08 Data Analysis Using Statistics

Uploaded by

sbicapsec.ambala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Lesson 08 Data Analysis Using Statistics

Uploaded by

sbicapsec.ambala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Business Analytics with Excel

Data Analysis Using Statistics


Learning Objectives

By the end of this lesson, you will be able to:

Create a moving average chart

Perform ANOVA to compare means of different groups

Identify relationships between variables using covariance and


correlation

Calculate regression for the given data

Create normal distribution for the given data


A Day in the Life of Business Analyst

As a business analyst of an organization:

You are required to do forecasting and planning for sales data

Along with the prediction models, you need to co-relate existing data and test any
hypothesis.

This lesson will help you understand the usage of statistics for data analytics and
predictions.
Introduction to Statistical Analysis
Statistical Analysis

It involves the collection, examination, summarization, manipulation, and interpretation of


quantitative data to discover underlying causes, patterns, relationships, and trends.
Need for Statistical Analysis

It reveals the overall pattern and behaviour of the data.

It is useful when you have a set of data and want to see a summary of that data set.
Statistical Analysis: Example

ABC LLC is a financial analytics and research organization that needs to determine how stock
prices are fluctuating in various emerging economies.
Statistical Analysis: Example

The firm can use the moving average tool based on


the historical records and stock market data.

This tool forecasts the price trends for any


number of days.

It predicts the trends for the upcoming month by


creating a moving average chart.
Statistical Analysis: Tools

Moving Average ANOVA Correlation Normal Distribution

Hypothesis Testing Covariance Regression


Statistical Analysis in Excel

Excel is widely used to understand statistical concepts and perform calculations.

Provides data and


parameters for each tool

Uses appropriate statistical macro


functions

Calculates and displays results in


an output table

Generates charts
Data Analysis on Command

Data analysis tools are available under the Data Analysis command under Data tab.

Analysis ToolPak add-in needs to be loaded if the Data Analysis command is not available.
Moving Average: Introduction
Moving Average

It evaluates data points by creating a series of averages of different subsets of the


complete dataset.

16,000.0

14,000.0

12,000.0

Axis Title 10,000.0

8,000.0
Actual
Forecast
6,000.0

4,000.0

2,000.0

-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Axis Title

A moving average is used to smooth out irregularities and easily recognize trends.
Moving Average

It is mainly used to forecast long-term trends in the data.

Moving Average can be calculated for any period of time.


Assisted Practice: Create Moving Average Chart

Problem statement:

Demonstrate how to create a Moving Average chart in Excel.


Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: Moving average
Hypothesis Testing: Introduction
Hypothesis Testing

It is used to determine whether there is enough evidence in a data sample to infer that a certain
condition is true for the entire population.
Hypothesis Testing

To understand the characteristics of general population:

Take a random sample. Analyze the properties of Test whether the identified
the sample. conclusions represent the
population correctly or not.
Hypothesis Testing

A hypothesis about a Sample statistics are used to


population parameter is assess the likelihood that the
generated. hypothesis is true.
Hypothesis Testing

It is formulated in terms of two hypotheses:

Null Hypothesis, which is


Alternate Hypothesis,
referred to as H0, is assumed
which is referred to as H1,
to be true unless there is
is assumed to be true
strong evidence to the
when the null hypothesis
contrary.
is false.
Hypothesis Testing

The Hypothesis Test (t–test) is used to test the null hypothesis (H0), which assumes that the mean
or average of two populations is equal.
Assisted Practice: How to use Hypothesis Testing

Problem statement:

Demonstrate how to use Hypothesis Testing to determine Null Hypothesis for two variables.
Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: Hypothesis testing
ANOVA
ANOVA

It is a statistical method that The logic behind this analysis


stands for analysis of is to identify variance in the
variance. population.

ANOVA is a collection of statistical methods


used to compare the means of different
groups.
T-Test

The t-test helps ANOVA helps test


analyze variance the Null Hypothesis
between two of two or more
groups only. groups.
Assisted Practice: How to use ANOVA

Problem statement:

Demonstrate how to ANOVA to determine Null Hypothesis for two or more variables.
Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: ANOVA testing
Covariance
Covariance: Introduction

Covariance determines the relationship between two random variables and


how they change together.

60

50

40

30

20

10

0
1 2 3 4 5 6 7 8 9 10
Covariance: Types

Let us suppose that X and Y are two random variables.

60
Positive Covariance
50

40

30
If variable X increases as Y increases or X
20
decreases as Y decreases, then covariance is
10 positive.
0
1 2 3 4 5 6 7 8 9 10
Y 48 52 4 8 40 4 16 40 32 40
X 12 13 1 2 10 1 4 10 8 10
Covariance: Types

Negative Covariance
90
80
70
60
50
If variable X decreases as Y increases or X increases
40
as Y decreases, then covariance is negative.
30
20
10
0
1 2 3 4 5 6 7 8 9 10
X 15 16 4 5 13 4 7 13 11 13
Y $38 $20 $85 $82 $46 $85 $70 $46 $65 $46
Assisted Practice: How to use Covariance

Problem statement:

Demonstrate how to use Covariance in Excel.


Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: Use Covariance
Correlation
Correlation: Introduction

Correlation is a statistical measure that indicates the extent


to which two or more variables fluctuate together.
Correlation Coefficient

The correlation coefficient tells us how strongly two variables are related to
each other and it has a value between -1 and +1.

A correlation coefficient with value +1 indicates


a perfect positive correlation.
Correlation Coefficient

In Excel, CORREL function is used to calculate correlation.

A correlation coefficient with value A correlation coefficient with value


-1 indicates a perfect negative 0 indicates no correlation.
correlation.
Assisted Practice: How to use Correlation

Problem statement:

Demonstrate how to use Correlation in Excel.


Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: Use Covariance
Regression
Regression: Introduction

Regression is a statistical method for determining the strength of a relationship between one
dependent variable and a set of independent variables that change over time.
Assisted Practice: How to use Regression

Problem statement:

Demonstrate how to use Regression to determine relationships between variables.


Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: Use Regression
Multiple Linear Regression
Simple Linear Regression

Simple Linear Regression (SLR) tries to find a linear representation between two variables x and y.

y = function(x)
Simple Linear Regression

A linear relation of the temperature and number of ice creams sold can be observed using a
scatter plot.
Multiple Linear Regression

Multiple Linear Regression (MLR) tries to find the relationship between multiple independent x’s and
a single independent y.

Source: https://medium.com/analytics-vidhya/new-aspects-to-consider-while-moving-from-simple-linear-regression-to-multiple-linear-regression-dad06b3449ff
Multiple Linear Regression

The approach is to build a fitting line in n-dimensional space to:

• Explain the effects of the independent variables on the y variable.


• Predict y value given in a new set of x variables.
Multiple Linear Regression

The data is fit into the following equation:

Where:
•Y: dependent or resultant variable
•x1,x2,x3,…,xi: independent variables
y=β0 + β1x1+ β2x2 +… + βixi + e
•β0: constant term in the equation
• βi: slope coefficients to each independent
variable
Multiple Linear Regression

A multiple linear regression model can be built using Excel with at least 30 data points.

The mathematical equation with the coefficients is derived instantly and used to predict new
values.
Multiple Linear Regression

Consider the boston_housing.csv as the input data to build our model.

boston_housing.csv
Multiple Linear Regression

The data set contains 13 independent variables which define the dependent variable MEDV.

MEDV is the median value of a house in Boston according to


the data provided.
Multiple Linear Regression

A model built using this data can be used to predict the median value of a new house with the
attributes of the house.
Multiple Linear Regression

The meaning of each attribute is given in the Column description tab.


Create a Linear Regression Model

Choose the complete data after checking for any junk data

Click on Data Analysis in Data Tab.


If this does not appear, click on File -> Options -> Excel Add-ins and Go
Create a Linear Regression Model

Click on Analysis ToolPak to enable Data Analysis within Data


Create a Linear Regression Model

Choose Regression from the Data Analysis dialog box


Create a Linear Regression Model

• Under Regression, choose rows and columns for


the X range and column for the Y range
• Set Labels to present and the Confidence Level to
95%.
Create a Linear Regression Model

The results appear in a new worksheet, showing the regression data for the chosen data set.
Linear Regression Model

R-squared is a measure to indicate how much of


the variance of y is explained by all x’s. Closer to 1.0,
better the model fit.

The intercept coefficient is β0 in the multiple


regression equation.

Other coefficients are βi in the multiple


regression equation.
Linear Regression Model

Standard error is a deviation from actual and


the line of best fit line values.

P-value gives the significance of the feature


on the dependent variable.
Linear Regression Model

From the results it is understood that:

• The most and least important features determine


the median price of the house.
• The value of y can be determined by using the
equation with a new set of x values.
Logistic Regression
Logistic Regression

It is an algorithm for classification problems.

Though the name has the word regression, it is not a regression algorithm.
Logistic Regression

We have seen the following equation in linear regression:

y=β0 + β1x1+ β2x2 +… + βixi + e

This equation cannot be used because:


• The value of y is not in In odds value
• The dependent variable y represents classes
• y is no more a continuous variable unlike regression
• log(ODDS) instead can help to arrive at a similar equation
Logistic Regression

Linear regression equation can be reused for logistic regression.

• By converting the y value in the classification problem to an ‘In odds’ value of the event

• ln(odds(E)) = β0 + β1x1+ β2x2 +… + βixi + e


Odds of Event

Odds of event (E) is defined as the probability of E happening divided by the probability of E not
happening.

odds(E) = P(E)/1-P(E)

• The result of odds(E) is then converted to categorical values.


• Example: If y<= 0.5, then it is negative, or else it is positive.
Sigmoid Equation

If we solve for P(E) using the two odds equations, we get:

• P(E) = 1/1+e-(β0 + β1x1+ β2x2 +… + βixi + e)

• The equation in this form is called the sigmoid equation.

• Example: If you take a numeric value of Y, it converts it into


a probability value between 0 and 1.
Logistic Regression in Excel

To perform logistic regression in Excel, multiple regression equation is used which is created by using Data
Analysis add-ins.

• It forms the equation of P(E), and


• Segregates the target values based on P(E)
Logistic Regression in Excel

When a new data is given to the model, the P(E) is calculated, and the target value is derived.
Steps to Derive Target Value

These are the steps to derive target values.

Step 1: Data items are encoded to numeric values


Steps to Derive Target Value

Step 2: The target values are encoded to numeric values


Steps to Derive Target Value

Step 3: Use add-ins of Data Analysis, to calculate the intercept and coefficients
Steps to Derive Target Value

Step 4: The linear regression equation arrives for each data row. This equation can be called y.
Steps to Derive Target Value

Step 5: P(E) is calculated as 1/(1+e-y)


Steps to Derive Target Value

Step 6: A rule is applied on P(E) to get the target values


Normal Distribution
Normal Distribution: Introduction

All normal distributions are symmetric and have bell-shaped curves


with a single peak.
Create Normal Distribution

Normal distribution helps find the probability distribution for various variables such as rainfall,
height, weight, manufacturing error, weight error, and test scores.

The standard
The mean,
deviation, which Normal
where the peak
indicates the Distribution
of the density
spread of the Curve
occurs
bell curve
Normal Distribution: Empirical Rule

All normal density curves satisfy the Empirical Rule or (68-95-99.7% Rule) in Statistics.

68% of the observations 95% of the observations 99.7% of the observations


fall within 1 standard fall within 2 standard fall within 3 standard
deviation of the mean, i.e. deviations of the mean, i.e. deviations of the mean, i.e.
between Mean – Standard between Mean – between Mean –
Deviation and Mean + 2*Standard Deviation and 3*Standard Deviation and
Standard Deviation. Mean + 2*Standard Mean + 3*Standard
Deviation. Deviation.
Assisted Practice: Create Normal Distribution graph

Problem statement:

Demonstrate how to create a Normal Distribution graph in Excel.


Assisted Practice Guidelines

Steps to follow:

Step 1: Open the Excel file


Step 2: Create Normal Distribution
Key Takeaways

A Moving Average evaluates data points by creating a series of


averages of different subsets of the complete dataset.

The Hypothesis Testing is used to test the null hypothesis.

ANOVA is a collection of statistical methods used to compare the


means of different groups.

Covariance determines the relationship between two random


variables— how they change together.
Key Takeaways

Correlation is a statistical measure that indicates the extent to which


two or more variables fluctuate together

Regression is a statistical measure that determines the strength of


the relationship between one dependent variable and a series of
other changing variables.

All Normal Distributions are symmetric and have bell-shaped


curves with a single peak.
Knowledge Check
Knowledge
Check Which of the following statistical methods is used to analyze variance between
1 more than two groups?

A. Hypothesis Testing

B. Histogram

C. ANOVA

D. Covariance
Knowledge
Check Which of the following statistical methods is used to analyze variance between
1 more than two groups?

A. Hypothesis Testing

B. Histogram

C. ANOVA

D. Covariance

The correct answer is C

ANOVA is used to analyze variance between more than two groups.


Knowledge
Check What conclusion will you derive for the Null Hypothesis if “F > F crit” in ANOVA
2 testing?

A. The Null Hypothesis is not rejected

B. The Null Hypothesis is rejected

C. There is no relationship with Hypothesis Testing

D. None of the above is correct


Knowledge
Check What conclusion will you derive for the Null Hypothesis if “F > F crit” in ANOVA
2 testing?

A. The Null Hypothesis is not rejected

B. The Null Hypothesis is rejected

C. There is no relationship with Hypothesis Testing

D. None of the above is correct

The correct answer is B

In ANOVA testing if “F > F crit,” then the Null Hypothesis is rejected.


Knowledge
Check
The Null Hypothesis means that the mean/average of two populations is equal.
3

A. True

B. False
Knowledge
Check
The Null Hypothesis means that the mean/average of two populations is equal.
3

A. True

B. False

The correct answer is A

The Null Hypothesis(H0) means that the mean/average of two populations is equal.
Knowledge
Check
Which of the following is indicated if the Correlation Coefficient value is +1?
4

A. Perfect Positive Correlation

B. Zero Correlation

C. Perfect Negative Correlation

D. No Correlation
Knowledge
Check
Which of the following is indicated if the Correlation Coefficient value is +1?
4

A. Perfect Positive Correlation

B. Zero Correlation

C. Perfect Negative Correlation

D. No Correlation

The correct answer is A

The Correlation Coefficient value of +1 indicates Perfect Positive Correlation.


Knowledge
Check Which statistical measure determines the strength between a dependent variable
5 and an independent variable?

A. Histogram

B. Hypothesis Testing

C. Moving Average

D. Regression
Knowledge
Check Which statistical measure determines the strength between a dependent variable
5 and an independent variable?

A. Histogram

B. Hypothesis Testing

C. Moving Average

D. Regression

The correct answer is D

Regression determines the strength between a dependent variable and an independent variable.
Knowledge
Check What are the mandatory fields required while creating a Normal Distribution
6 curve?

A. Mean and Standard Deviation

B. Mean and Maximum value

C. Maximum and Minimum value

D. Standard Deviation and Minimum Value


Knowledge
Check What are the mandatory fields required while creating a Normal Distribution
6 curve?

A. Mean and Standard Deviation

B. Mean and Maximum value

C. Maximum and Minimum value

D. Standard Deviation and Minimum Value

The correct answer is A

To create Normal Distribution curve, we need to specify two quantities: the mean, where the peak of the density
occurs, and the standard deviation, which indicates the spread of the bell curve.

You might also like