Business Analytics - Ii

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

BUSINESS ANALYTICS - II

Analytics
Analytics is the data-driven decision-making
approach for a business problem.

WHAT IS BA
Data analysis—includes data description, data
inference, and the search for relationships in data.

• Business analytics (BA) is a set of disciplines and technologies for


solving business problems using data analysis and statistical
models.
• Taking in and processing historical business data. Analysing that
data to identify trends, patterns, and root causes. Making data-driven
business decisions based on those insights.
• It is the science of analysing data to find out patterns that will be helpful in
developing strategies.
BUSINESS ANALYSIS PROCESS
TYPES OF B US INES S
ANA LYTICS
ANALYSIS OF DATA

Univariate analysis
Bivariate analysis
Multivariate Analysis
• The term univariate analysis refers to the analysis of one variable. You
can remember this because the prefix “uni” means “one.”

• The purpose of the univariate analysis is to understand the distribution


of values for a single variable.

• Bivariate Analysis: The analysis of two variables.


• Multivariate Analysis: The analysis of two or more variables.
UNIVARIATE ANALYSIS

• There are three common ways to perform the univariate analysis:


1. Summary Statistics
2. Frequency Distributions
3. Charts
BIVARIATE ANALYSIS

The purpose of the bivariate analysis is to understand the relationship


between two variables.

• There are three common ways to perform the bivariate analysis:


• 1. Scatterplots.
• 2. Correlation Coefficients.
• 3. Simple Linear Regression.
MULTIVARIATE ANALYSIS

• The term multivariate analysis refers to the analysis of more than one


variable. You can remember this because the prefix “multi” means “more
than one.”
• There are two common ways to perform the multivariate analysis:
• 1. Scatterplot Matrix
• 2. Multiple linear Regression
Univariate Analysis

• Choose to perform univariate analysis on the variable Household Size:

1. Summary Statistics
• Measures of central tendency: these numbers describe where the centre
of a dataset is located. Examples include the mean and the median.
• Mean (the average value): 3.8
• Median (the middle value): 4
2. Frequency Distributions
• This allows us to quickly see that the
most frequent household size is 4.

3. Charts
• Boxplot
• Histogram
• Pie Chart
BIVARIATE ANALYSIS

If we plotted these (X, Y) pairs on a


scatterplot, it would look like this:
• Suppose we have the following dataset:

Based on the scatterplot we can tell that there is


a positive association between variables X and Y:
when X increases, Y tends to increase as well.
BIVARIATE ANALYSIS

• Two variables: 
• (1) Hours spent studying and 
• (2) Exam score received by 20 different students:
• 1. Scatterplots
• A scatterplot offers a visual way to perform bivariate analysis. It
allows us to visualize the relationship between two variables by
placing the value of one variable on the x-axis and the value of
the other variable on the y-axis.
• Interpreting Scatterplots
• Strong, positive relationship
• Weak, positive relationship
• No relation
• Strong, negative relationship
• Weak, negative relationship
CORRELATION

• Correlation is the statistical tool which is used to know the relationship


between two or more variables i.e. the degree to which the variables are
associated with each other. In simpler words, it measures the closeness of the
relationship. For example, price and supply, demand and supply, income and
expenditure are correlated.
TYPES OF CORRELATION

Positive Correlation – When the variables are changing in the same direction (either increase or
decrease in parallel), we call it as a positively correlated. For e.g. price of a goods and demand,
hot weather and cold drink consumptions, etc.

Negative Correlation – When the variables are changing in the opposite direction (One is
increasing and other is decreasing), we call it as a negatively correlated. For e.g. alcohol
consumption and lifeline, smartphones usages and battery lifeline, etc.

Zero Correlation – We call it a zero correlated when there is no relationship between the variables
(Correlation=0). For e.g. HR recruits and temperature, paper production and beverages, etc.
STANDARD RANGE OF CORRELATION
COEFFICIENT
• r = - 1 Perfect negative correlation
• - 0.99 to - 0.76 Strong negative correlation
• - 0.75 to - 0.26 Intermediate (Moderate) negative correlation
- 0.25 to 0 Weak negative correlation

• r = 0 Zero correlation

• 0 to 0.25 Weak positive correlation


• 0.26 to 0.75 Intermediate (Moderate) positive correlation
• 0.76 to 0.99 Strong positive correlation
• r = = 1 Perfect positive correlation
LINEAR CORRELATION - PEARSON'S
CORRELATION COEFFICIENT

 Used to measure the strength of association between two continuous features.


 Both positive and negative correlation are useful.

• Steps:-

 Compute the Pearson’s Correlation Coefficient for each feature.


 Sort according the score.
 Retain the highest ranked features, discard the lowest ranked.

• Limitation:-
 Pearson assumes all features are independent.
 Pearson identifies only linear correlations
• In
Correlational Analysis Scattered diagram

• Scatter plot is a simple graph where the data of two continuous variables are plotted
against each other.
• It examines the relationship between two variables and to check the degree of
association between them.
• One variable is called the independent variable and the other variable is called the
dependent variable.
• The degree of association of a variable is known as correlation.

• Scattered diagram is one of the ways of finding the extent of relationship between two
quantitative variables.
• However, this method will only indicate that there is a relationship between two variables
but, not the extent to which they are related.
REGRESSION - LINEAR REGRESSION

• Regression is a statistical measurement that attempts to determine the strength of the relationship between a
dependent variable and a series of independent variables.

Linear regression is a quiet and simple statistical regression method used for predictive
analysis and shows the relationship between continuous variables. Linear regression shows
the linear relationship between the independent variable (X-axis) and the dependent variable
(Y-axis).

If there is a single input variable (x), such linear regression is called simple linear regression.
And if there is more than one input variable, such linear regression is called multiple linear
regression.
Linear regression always uses a linear equation,
So Hypothesis function for Linear Regression is :-
y = mx + c
where x is the explanatory variable and Y is the dependent variable.

Finds out a linear relationship between x (input) and y(output).

Y= Dependent Variable.

x= Independent Variable.

c= intercept of the line.

m= slope.
INTERCEPT AND SLOPE

• The intercept (often labelled the constant) is the expected mean value of
Y when all X=0. 
• Start with a regression equation with one predictor, X.
• If X sometimes equals 0, the intercept is simply the expected mean
value of Y at that value. 
• If X never equals 0, then the intercept has no meaning.
• The slope indicates the steepness of a line.
• m is the slope of a regression line, which is the rate of change
for y as x changes.
• The slope is positive 5. When x increases by 1, y increases by 5. The
y-intercept is 2.
• The slope is negative 0.4. When x increases by 1, y decreases by 0.4.
The y-intercept is 7.2.
Represent in simple variable
Example:- Estimate the salary of an employee based y = mx + c
on year of experience.

Here year of experience is an independent variable,


and the salary of an employee is a dependent variable

So Y= salary and x = experience


c = y - mx

n = n + [(X[i] - mean_x) * (Y[i] - mean_y)]


d = d + [(X[i] - mean_x) ** 2]
m=n/d
c = mean_y - (m * mean_x)
FIND THE BEST FIT LINE

• When working with linear regression, our main goal is to find the best fit line,
meaning the error between predicted and actual values should be minimized.
The best fit line will have the least error.

• For Linear Regression, we use the Mean Squared Error (MSE) cost function,


which is the average squared error between the predicted and actual values.
ACTIVITY TIME

ICE BREAKERS
LEAST SQUARES ESTIMATION

• When fitting a straight line through a scatterplot, choose the line that makes the
vertical distance from the points to the line as small as possible.
• A fitted value is the predicted value of the dependent variable.

The residual is the difference between the actual and


fitted values of the dependent variable.

Fundamental Equation for Regression:


Observed Value = Fitted Value + Residual

For data points above the line, the residual is


positive, and for data points below the line, the
residual is negative.
DATA ANALYSIS TOOL FOR
REGRESSION

• 1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Regression and click OK


• 3. Select the Y Range This is the predictor variable (also called the
dependent variable).
• 4. Select the X Range. These are the explanatory variables (also
called independent variables). These columns must be adjacent to
each other.
• 5. Check Labels.
• 6. Click in the Output Range box and select a cell.
• 7. Check Residuals.
• 8. Click OK.
• It produces the Summary Output (rounded to 3 decimal places).
INFERENCE OF REGRESSION MODEL

• Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship


between two variables. The correlation coefficient can be any value between -1 and 1, and
its absolute value indicates the relationship strength. The larger the absolute value, the
stronger the relationship:
• R Square. It is the Coefficient of Determination, which is used as an indicator of the goodness of
fit. It shows how many points fall on the regression line. The R2 value is calculated from the total
sum of squares, more precisely, it is the sum of the squared deviations of the original data from
the mean.
• Adjusted R Square. It is the R square adjusted for the number of the independent variable in
the model. You will want to use this value instead of R square for multiple regression analysis.

• R-squared increases, even if the independent variable is insignificant. Adjusted R-squared


increases only when the independent variable is significant and affects the dependent
variable. Adjusted R2 is always less than or equal to R2.
• As the sample size increases, the difference between adjusted r-squared and r-squared reduces.
• Standard Error: It is another goodness-of-fit measure that shows the precision
of your regression analysis - the smaller the number, the more certain you can
be about your regression equation. While R2 represents the percentage of the
variance of the dependent variable that is explained by the model, Standard
Error is an absolute measure that shows the average distance that the data
points fall from the regression line.
• Observations. It is simply the number of observations in your model.
• EXCEL REGRESSION ANALYSIS OUTPUT : ANOVA

1. SS = Sum of Squares.
2. Regression MS = Regression SS / Regression degrees of freedom.
3. Residual MS = mean squared error (Residual SS / Residual degrees of freedom).
4. F: Overall F test for the null hypothesis.
5. Significance F: The significance associated P-Value.
R E G R E S S I O N A N A LY S I S : I N T E R PR ET R E G R ES S I O N
COEFFICIENTS

1. Coefficient: Gives you the least squares estimate.


2. Standard Error: the least squares estimate of the standard error.
3. T Statistic: The T Statistic for the null hypothesis vs. the 
alternate hypothesis.
4. P Value: Gives you the p-value for the hypothesis test.
5. Lower 95%: The lower boundary for the confidence interval.
6. Upper 95%: The upper boundary for the confidence interval.

You might also like