Unit-5 Concept of Business Analytics
Unit-5 Concept of Business Analytics
Unit-5 Concept of Business Analytics
From manual effort to machines, there has been no looking back for humans. In came the
digital age and out went the last iota of doubt anyone had regarding the future of mankind.
Business Analytics, Machine Learning, AI, Deep Learning, Robotics, and Cloud have
revolutionized the way we look, absorb, and process information. While there are still ongoing
developments happening in several of these advanced fields, business analytics has gained the
status of being all-pervasive across functions and domains. There is no aspect of our lives
untouched by Analytics. The mammoth wings of analytics are determining how we buy our
toothpaste to how we choose dating partners to how we lead our lives. Read on to know what
is Business Analytics.
Business Analytics is interchangeably used with data analytics. The only difference being that
while data analytics is the birth child of the data boom, business analytics represents a coming
of age that centers data insights at the heart of business transactions. Nearly 90% of all small,
mid-size, and large organizations have set up analytical capabilities over the last 5 years in an
effort to stay relevant in the market and draw value out of the insights that large volumes of
data recorded in the digital age can provide. Now that you know the definition of business
analytics, let us take a look at a more comprehensive understanding of the business analytics
process.
Professionals, on the other hand, are also in a rush to bag analytical roles for career success.
So, what does it mean for aspiring analytics professionals?
Nearly every domain has seen an up-rise in the number of opportunities in the analytics
segment but there is still a huge supply-demand gap that exists when filling these positions.
This is because of the lack of relevant quality education in graduation (that still continues to
teach its archaic curriculum) and also a lack of enthusiasm to upskill especially in the more
seasoned professionals with more than 5 years of experience. Slowly, this trend is changing
with freshers taking up business analytics and data science courses before entering the
workforce and the seasoned professionals taking cognizance of the fact that they may render
themselves jobless without upskilling to the skillset required in the digital economy. To make a
switch:
Find opportunities within your own firm to move – Every mid to large organization is
establishing its analytical capabilities and there are ample opportunities out there for
people to switch. If you have experience with reporting or analysis or statistics or
advanced excel, chances are that your leaders will be open to you moving on to a more
complex role. In the beginning, you may have to juggle between your regular work and
new analytical initiatives, but this is one of the easiest ways to get started.
Take up an Analytics Course – Learning things scientifically in a structured format helps
you scale faster. Several options are available when it comes to Analytics courses right
from MOOCs, weekend programs, hybrid courses (classroom + online) to full-time
programs. While traditional full-time programs tend to promise the best results, hybrid
courses and MOOCs are more suited to the learning needs of working professionals.
Get Hands-On Experience – A certificate or merely stating on your CV that you know
analytics tools and techniques will not help you get through job interviews. What you
need are ready projects on your resume to make an impression. Participating in online
hackathons, free projects with public data, or solving analytics challenges by Kaggle or
Analytics Vidhya will go a long way in giving you the confidence to make this switch.
Do you know what the main components of a Business Analytics dashboard are? Let us take a
look at them.
Data Aggregation: Before you start the process of analysis, you are required to gather,
organise and filter data through transactional records
Data Mining: The process of data mining refers to sorting through a large volume of
datasets using statistics, machine learning. This helps in identifying trends and
establishing relationships
Association and Sequence Identification: We must then identify actions that are
performed in relation to other actions or in a particular sequence
Text Mining: This allows us to explore a large volume of unstructured text datasets. This
is done for qualitative and quantitative analysis of the data
Forecasting: Forecasting is done in order to analyze historical data. This data can be
from a specific time period. It allows us to make informed estimates and determine
future behaviour
Predictive Analytics: This allows us to use various statistical tools and techniques to
create a predictive model. This model extracts information from different datasets and
provides information regarding patterns
Optimization: After all the trends have been identified, and once all the predictions have
been made, businesses must engage in simulation techniques that allow us to test the
best-case scenarios
Data Visualization: provides visual representations in the form of charts or graphs. This
ensures quick data analysis.
There are mainly four types of Business Analytics, each of these types are increasingly complex.
They allow us to be closer to achieving real-time and future situation insight application. Each
of these types have been discussed below.
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive Analytics
It summarizes an organisation’s existing data to understand what has happened in the past or is
happening currently. Descriptive Analytics is the simplest form of analytics as it employs data
aggregation and mining techniques. It makes data more accessible to members of an
organisation such as the investors, shareholders, marketing executives, and sales managers.
It can help identify strengths and weaknesses and provides an insight into customer behavior
too. This helps in forming strategies that can be developed in the area of targeted marketing.
2. Diagnostic Analytics
This type of Analytics helps shift focus from past performance to the current events and
determine which factors are influencing trends. To uncover the root cause of events,
techniques such as data discovery, data mining and drill-down are employed. Diagnostic
analytics makes use of probabilities, and likelihoods to understand why events may occur.
Techniques such as sensitivity analysis, and training algorithms are employed for classification
and regression.
3. Predictive Analytics
This type of Analytics is used to forecast the possibility of a future event with the help of
statistical models and ML techniques. It builds on the result of descriptive analytics to devise
models to extrapolate the likelihood of items. To run predictive analysis, Machine Learning
experts are employed. They can achieve a higher level of accuracy than by business intelligence
alone.
One of the most common applications is sentiment analysis. Here, existing data collected from
social media and is used to provide a comprehensive picture of an users opinion. This data is
analysed to predict their sentiment (positive, neutral or negative).
4. Prescriptive Analytics
Going a step beyond predictive analytics, it provides recommendations for the next best action
to be taken. It suggests all favorable outcomes according to a specific course of action and also
recommends the specific actions needed to deliver the most desired result. It mainly relies on
two things, a strong feedback system and a constant iterative analysis. It learns the relation
between actions and their outcomes. One common use of this type of analytics is to create
recommendation systems.
Although business analytics is being leveraged in most commercial sectors and industries, the
following applications are the most common.
1. Banking
Credit and debit cards are an everyday part of consumer spending, and they are an ideal way of
gathering information about a purchaser’s spending habits, financial situation, behavior trends,
demographics, and lifestyle preferences.
Excellent customer relations is critical for any company that wants to retain customer loyalty to
stay in business for the long haul. CRM systems analyze important performance indicators such
as demographics, buying patterns, socio-economic information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract insights that help
organizations maneuver their way through tricky terrain. Corporations turn to business analysts
to optimize budgeting, banking, financial planning, forecasting, and portfolio management.
4. Human Resources
Although HR is often the punchline of many office jokes, its value in keeping a company
successful is not to be underestimated. Great businesses are composed of a great staff, and it’s
HR’s job to not only find the ideal candidates but keep them on board. Business analysts help
the process by pouring through data that characterizes high performing candidates, such as
educational background, attrition rate, the average length of employment, etc. By working with
this information, business analysts help HR by forecasting the best fits between the company
and candidates.
5. Manufacturing
Business analysts work with data to help stakeholders understand the things that affect
operations and the bottom line. Identifying things like equipment downtime, inventory levels,
and maintenance costs help companies streamline inventory management, risks, and supply-
chain management to create maximum efficiency.
6. Marketing
Which advertising campaigns are the most effective? How much social media penetration
should a business attempt? What sort of things do viewers like/dislike in commercials? Business
analysts help answer these questions and so many more, by measuring marketing and
advertising metrics, identifying consumer behavior and the target audience, and analyzing
market trends.
As you can see, business analytics plays a valuable role in many different industries. You may
also notice that some of the applications merge into each other, but that’s hardly surprising. By
leveraging business analytics, multiple departments and teams can coordinate their efforts
based on the information gathered and processed. It’s up to the business analyst to identify
roadblocks and areas that need improvement, helping different departments to work together
to achieve a common goal.
Business Analytics Applications
1. Customer Segmentation
Customer segmentation is a vital business analytics application that helps companies group
their customers based on shared characteristics such as demographics, buying behavior, and
preferences. By analyzing customer data, businesses can tailor their marketing strategies,
product offerings, and customer service to target specific segments effectively, increasing
customer satisfaction and overall profitability.
2. Predictive Analytics
Predictive analytics leverages historical and real-time data to forecast future trends and events.
This application is used extensively in industries like finance, healthcare, and e-commerce for
tasks such as predicting stock prices, patient outcomes, and product demand. It enables
proactive decision-making, risk mitigation, and optimization of business operations.
Businesses utilize analytics to optimize their supply chains by analyzing data related to
inventory levels, supplier performance, transportation logistics, and demand forecasting. By
identifying inefficiencies and bottlenecks in the supply chain, companies can reduce costs,
improve product availability, and enhance overall operational efficiency.
4. Fraud Detection
Fraud detection analytics employs advanced algorithms and machine learning models to
identify and prevent fraudulent activities, such as credit card fraud, insurance fraud, and cyber-
attacks. By analyzing transactional data patterns and anomalies, organizations can minimize
financial losses and maintain the trust of their customers.
Market basket analysis involves examining customer purchase history to discover patterns in
product co-purchases. Retailers use this application to optimize product placement, cross-
selling, and promotional strategies. By understanding which products are frequently bought
together, businesses can increase sales and enhance the customer shopping experience.
6. Churn Analysis
Churn analysis focuses on identifying and reducing customer churn, which is the rate at which
customers stop using a company's products or services. By analyzing customer behavior and
feedback, businesses can implement retention strategies to retain valuable customers and
reduce revenue loss.
7. A/B Testing
A/B testing is a fundamental analytics application for optimizing digital marketing campaigns
and website performance. It involves conducting controlled experiments by randomly assigning
users to different versions of a webpage or marketing content. By comparing the performance
of these versions, companies can make data-driven decisions to improve conversion rates and
user engagement.
Employee performance analytics helps organizations evaluate the productivity and engagement
of their workforce. By analyzing data on key performance indicators (KPIs), attendance, and
employee feedback, companies can make informed decisions about talent management,
training, and workforce optimization.
Sentiment analysis, also known as opinion mining, uses natural language processing and
machine learning techniques to assess public sentiment and opinions from sources like social
media, customer reviews, and surveys. Companies can gain insights into how their brand is
perceived and use this information to shape marketing strategies and product development.
Usage of Business Analytics
Business analytics helps organizations run more efficiently and profitably. Here are six cases
where business analytics proves its worth in the commercial sector.
1. Churn Prevention
Churn is the customer attrition rate, a percentage of subscribers, or customers who stop doing
business with a company. Successful companies must keep the churn rate low and replace any
customer losses that inevitably occur. Furthermore, it’s more expensive to acquire new
customers than it is to retain existing ones. By using predictive analysis, a business analyst helps
identify customer dissatisfaction and the most likely risks or departure.
2. E-Commerce Personalization
Online businesses, like Amazon, collect, process, and analyze customer data to personalize their
customers’ shopping experiences. By customizing the experience, vendors can make
recommendations and increase the likelihood of further sales.
3. Predictive Maintenance
Companies must face the inevitability of equipment maintenance, both scheduled and
unplanned. Business analysts work with data to create metrics about maintenance lifecycles to
predict future maintenance needs and avoid costly unplanned downtime.
Insurance fraud is costly to companies and their customers alike. This is especially true in the
medical insurance industry, where fraud costs organizations in the US approximately $68 billion
a year. Business analysts use big data to process billions of claims and billing records, enabling
investigators to identify and mitigate any fraudulent activity.
As mentioned earlier, hiring new staff comes with its share of risks and uncertainty. Business
analysts leverage data-driven recruitment platforms to get a better picture of any given
candidate—improving the likelihood of a successful job match much faster. In some cases, the
information can even help anticipate job needs before a position is posted.
These notes are meant to provide a general overview on how to input data in Excel and Stata
and how to perform basic data analysis by looking at some descriptive statistics using both
programs.
Excel
When it opens you will see a blank worksheet, which consists of alphabetically titled columns
and numbered rows. Each cell is referenced by its coordinates of columns and rows, for
example A1 is the cell located in column A and row 1; B7 is the cell in column B and row 7. You
can reference a range of cells, for example C1:C5 are cells in columns C and rows 1 to 5. You can
also reference a matrix, A10:C15, are cells in columns A, B and C and rows 10 to 15.
To select a cell :
· Click on a cell (i.e. A10), hold the shift key, click on another cell (C15) to select the cells
between A10 and C15.
· You can also click on a cell and drag the mouse to the desire range
· To select not-adjacent cells, click on a cell, press ctrl and select another cell or range of
cells.
Excel stores your work in a workbook, each workbook has one or more worksheets (and/or
charts) which you can view by clicking on the sheet tab (lower left corner of the active (current)
sheet).
Entering data
You can type anything on a cell, in general you can enter text (or labels), numbers, formulas
(starting with the "=" sign), and logical values (as in "true" or "false").
Click on a cell and start typing, once you finish typing press "enter" (to move to the next cell
below) or "tab" (to move to the next cell to the right)
You can write long sentences in one single cell but you may see it partially depending on the
column width of the cell (and whether the adjacent column is full). To adjust the width of a
column go to Format -- Column -- Width or select "AutoFit Selection".
Numbers are assumed to be positive, if you need to enter a negative value use the minus sign
("-") or enclose the number in parentheses ("(number)").
If you need to enter percentages, dollar sign, or any other symbol to identify the number just
add the "%" or "$". You can also enter the number and change its format using the menu:
Format -- Cell and select the "number" tab which has all the different formats.
Dates are automatically stored as mm/dd/yyyy (or the default format if changed) but there is
some flexibility here. Enter month and number and excel will enter the date in the default
format. If you press "ctrl" and ";" (Crtl-;) excel will enter the current date.
Time is also entered in a default format. Enter "5 pm", excel will write "5:00 PM". To enter the
current time press "ctrl" and ":" (Ctrl-:)
To practice enter the following table (these data are made-up, not real)
Each column has a list of items. Column A has IDs, column B has last names of students and so
on.
Let"s say for example you do not want capital letters for the columns "Last Name" and "First
Name". You do not want "SMITH" you want "Smith". Two options, you can re-type all the
names or you can use the following formula (IMPORTANT: All formulas start with the equal "="
sign):
The full table should look like this. This is a made up table, it is just a collection of random info
and data.
Exploring data in excel
Generally one of the first things to do with new data is to get to know it by asking some general
questions like but not limited to the following:
You can start answering some of these questions by looking directly at the table, for some other
questions you may have to do some calculations by obtaining a set of descriptive statistics.
These statistics are a collection of measurements of two things: location and variability.
Location tells you the central value (the mean is the most common measure of this) of your
variables. Variability refers to the spread of the data from the center value (i.e. variance,
standard deviation). Statistics is basically the study of what causes such variability.
Location Variability
Mean Variance
Mode Standard deviation
Median Range
Let"s get some descriptive statistics for this data. In excel go to Tools -- Data Analysis. If you do
not see "data analysis" option you need to install it, go to Tools -- Add-Ins, a window will pop-
up and check the "Analysis ToolPack" option, then press OK. Try running data analysis again.
In the pop-up window select "Descriptive Statistics" click OK.
Check "Summary statistics" and the press OK. You will get the following:
While the whole descriptive statistics cells are selected go to Format--Cells to change all
numbers to have one decimal point. When you get the "format cells" window, select the
following:
Click OK. All numbers should now have one decimal as follows:
Now we know something about our data.
The average student in this sample is 25.2 years, has a SAT score of 1848.9, got a grade of 80.4,
is 66.4 inches tall and reads the newspaper 4.9 times a week. We know this by looking at the
"mean" value on each variable.
The mean is the sum of the observations divided by the total number of observations. It is the
most common indicator of central tendency of a variable. If you look at the last two rows:
"Sum" and "Count" you can estimate the mean dividing "Sum" by "Count" (sum/count). You can
also calculate the mean using the function below (IMPORTANT: All functions start with the
equal "=" sign):
For "age"
=AVERAGE(J2:J31)
"Sum" refers to the sum of all the values in a range of values. For age means the sum of the
ages of all students. The excel function for sum is:
"Count" refers to the count of cell that contain values (numbers). The function is:
The "Standard Error" (SE) indicates how close the sample mean is from the "true" population
mean. The average age of 25.2 years is just an estimate of this sample of students but it can
vary had you used a different set of students. The standard error is calculated by dividing the
standard deviation of the population (or the sample) by the square root of the total number of
observations. The SE can be used to roughly define a range of certainty for the mean. Using
"age":
· You are 68% certain that the average age is between 23.9 and 26.5 years old
· You are 95% certain that the average age is between 22.7 and 27.7 years old
· You are 99% certain that the average age is between 21.4 and 29.0 years old
The median is another measure of central tendency. To get the median you have to order the
data from lowest to highest. The median is the number in the middle. If the number of cases is
odd the median is the single value, for an even number of cases the median is the average of
the two numbers in the middle. The excel function is:
The mode refers to the most frequent, repeated or common number in the data. By age there
are more students 19 years old in the sample than any other group. In the SAT scores the mode
is "#N/A" which means that all values are unique. The excel function is:
The sample variance measures the dispersion of the data from the mean. It is the simple mean
of the squared distance from the mean. It is calculated by:
Higher variance means more dispersion from the mean. The excel function is:
The standard deviation is the squared root of the variance. Indicates how close the data is to
the mean. Assuming a normal distribution, 68% of the values are within 1 sd from the mean,
95% within 2 sd and 99% within 3 sd. The excel formula is:
Skewness measures the asymmetry of the data, when in an otherwise normal curve one of the
tails is longer than the other. It is a roughly test for normality in the data (by dividing it by the
SE). If it is positive there is more data on the left side of the curve (right skewed, the median
and the mode are lower than the mean). A negative value indicates that the mass of the data is
concentrated on the right of the curve (left tail is longer, left skewed, the median and the mode
are higher than the mean). A normal distribution has a skew of 0. Skewness can also be
estimated with the following function:
Kurtosis. The current view of kurtosis argues that it measures the peak of a
distribution. According to Peter Westfall, that view is not quite correct. His article "Kurtosis as
Peakedness, 1905--2014. R.I.P." (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/)
makes a compelling case against the current perception. In Westfall"s view, the peak, or lack-
thereof, is a symptom rather than a characteristic that shows the presence of outliers. High
kurtosis may suggest the presence of outliers. Technically speaking, kurtosis focuses more on
the tails for the distribution than the peak, so positive kurtosis indicates too few cases in the
tails or a tall distribution (leptokurtic), negative kurtosis too many cases in the tails or a flat
distribution (platykurtic). A normal distribution has a kurtosis of 0 (given a correction of -3,
otherwise it will have a kurtosis of 3). The excel function for kurtosis is:
To explore the data by groups you can sort the columns for the variables you want (for example
gender, or major or country, etc.) and obtain descriptive statistics by selecting only the range of
values that cover particular group. You can also use pivot tables.
Let"s say you are interested on looking at the average SAT score by gender and student"s
major. Let"s make the following crosstabulation
The pivot wizard will walk you through the process, this is the first window
Press "Next". In step 2 select the range for the range of all values as in the following picture:
On the right side of the wizard layout you can see the list of all variables in the data. Click and
drag "Gender" into the "ROW" area. Click and drag "Major" into the "COLUMN" area, and click
and drag "Sat score" into the "DATA" area. The wizard layout should look like this:
In the "DATA" area double-click on "Sum of Sat score", a new window will pop-up select
"Average" and click OK.
The wizard layout should look like this. Click OK, in the wizard window step 3 click "Finish"
In a new worksheet you will see the following (the pivot table window was moved to save some
space).
This is a crosstabulation between gender and major. Each cell represents the average SAT score
for a student according to gender and major. For example a female student with an econ major
has an average SAT score of 1952 (cell B5 in the picture) while a male student also with an econ
major has 1743 (B6). Overall econ major students have an average SAT score of 1806 (B7) . In
general, female students have an average SAT score in this sample of 1871.8 (E5) while male
students 1826 (E6).
Let"s say you want to explore whether there is a relationship between the average score
(grade) of each student and his/her major. In the sample we have three majors: Econ, Math and
Politics. The grades are the final grades for the entire academic year.
To do this we use one-way ANOVA, which stands for "analysis of variance". ANOVA "is a broad
class of techniques for identifying and measuring the various sources of variation within a
collection of data" (Kachigan, p. 273, 1986). It is closely related to regression analysis but with
the following difference: "[w]e can think of the analysis of variance technique as testing
hypotheses about the presence of relationships between predictor and criterion variables,
regression analysis as describing the nature of those relationships, and r2 as measuring
the strength of the relationships" (ibidem.) In other words, ANOVA "tests whether the means
of y [grades in this example] differ across categories of x [majors]" (Hamilton, p. 149)
With the above in mind, let"s see if there is a relationship between student"s majors and
student"s final grades. First we need to rearrange the data so excel can run the ANOVA. Using
only the columns "major" and "average score (grade)". Copy and paste both columns into a new
sheet, sort by major (Data--Sort, select the column for major and sort ascending) separate by
group. Final table should look like this..
Go to Tools -- Data Analysis, in the pop-up window select "Anova: Single Factor", the following
screen will pop-up
It looks similar to the one we got when we obtained "descriptive statistics". Select the input
range, check "labels in First Row", and select as output range "D1", click OK. You"ll get the
following:
By now you should be familiar with the summary statistics presented in the first table. You may
notice that the "sum" column has decimals while the data seems to be integers. The sum has
decimals because some of the scores have decimals; they are just rounded to the nearest
integer.
In the ANOVA table:
· Sources of variation. The analysis of variance requires the estimation of two variances:
between groups (econ, math and politics) and the within groups (students).
· df. Degrees of freedom. For between groups is 2 (number of majors minus 1) and for within
groups is 27 (number of students minus number of majors).
· MS. Mean square of deviations (variance estimates), which is equal to SS/df, Roughly 411/2
and 2549/27.
· P-value. This is the value that answers your question. We wanted to know whether there is
some sort of relationship between majors and grades. ANOVA assumes by default that
there is no relationship. As a general rule, a p-value greater than 0.05 means ANOVA"s
assumption may be right. We got a p-value of 0.13 which is greater than 0.05, so it seems
there is no relation between a student"s major and his/her final grade. Had the p-value
been lower than 0.05 then we would have found some kind of relationship between majors
and grades.
· F-crit. It is the critical value to check whether we reject of fail to reject ANOVA"s
assumption. Check the table for 0.05 confidences.
Here is a general overview on how some numbers were estimated. Follow the coordinates by
columns and rows
STATA
Stata is a statistical package to help you perform data analysis, data manipulation and graphics.
To open Stata go to Start -- Programs -- Stata[ver.*] -- Stata[*]. For cluster computers contact
OIT for instructions. When you open Stata this is what you will see:
Here are some brief explanations.
You can always use the "point-and-click" method by using the menu. We recommend however,
for most of the procedures, to use the command line.
When you work with Stata there are three basic procedures you may want to do first: create a
log file, set your working directory, and set the correct memory allocation for your data)
The log file records everything you type and get while working in Stata. Commands and output
are send to a text file for you to review later. Think of it as a "tape recorder" for your Stata
session. To create a log file go to File -- Log -- Begin
Select the working directory. In this case will be H:\statdata\. Name the log and select the type
"Log (*.log)".
In the results window you will something like
The second thing to check is your working directory. To do this in the command window type
the following
pwd
Which stands for "print working directory". This will show you your working directory, which
right now, in this example is H:\statadata.
cd H:\statadata\
You can check your current directory by looking at the lower left of the Stata screen.
The third initial step is to set the necessary memory allocation. In the picture above you can see
in green letters after "Notes:" that the memory allocation is 10 mb. This will be enough for a
medium size database but sometimes you may need more memory space to store your dataset.
To determine the size of your dataset follow the formula:
Size (in bytes) = (8*Number of cases or rows*(Number of variables + 8))
Depending on your Stata version and computer power, you can allocate up to around 2
gigabytes. To allocate 1 g you can type:
set mem 1g
From Excel to Stata
To put Excel data to Stata you can simply copy-and-paste.
NOTE: Not recommended for really big datasets or datasets with long string variables and lots
of special characters (like ";",",","#","%", etc.)
Got to Stata, click on the "Data editor" icon
A new window will pop-up, is the data editor window where you can input data or simply paste
it.
In Excel, select the whole table (A1:N31). Press Ctrl-C. Go to the "Data Editor" in Stata and paste
the table (Ctrl-V)
Numbers are always black. Red indicates error, in the editor"s case indicates that values are not
numbers, in this case letters or string characters. Close the data editor by clicking on the "X" in
the upper right corner
The variable window will be populated with all the variables in your data
Stata automatically eliminates the space in your original titles but keep the format in the
"Label" column. "Type" refers to whether the data is number or string (str*). "Format" shows
the length of the variable. In the command window type help format for details.
The whole screen will look like this
PREDICTIVE STATISTICS USING EXCEL
Usually, that is the first reaction I get when I bring the subject up.
When I show how we can explore Excel's versatile nature to create predictive models for our
data science and analytics ventures, this is accompanied by an incredulous look.
If the stores around you started gathering consumer data, should they follow a data-based
approach to sell their goods?
Let me ask you a question. Can their revenue/sales be predicted or the number of goods
estimated?
Now you have to wonder how they are going to construct a complex mathematical model in
the world that can predict these things?
And it may be beyond their reach to study analytics or recruit an analyst. The good news is
here-they don't need to.
Without having to write complicated code that flies over most people's heads, Microsoft Excel
gives us the opportunity to conjure predictive models.
In MS Excel, we can easily construct a simple model such as linear regression that can help us
perform analysis in a few simple steps.
It is a linear approach to statistically model the relationship between the dependent variable
(the variable you want to predict) and the independent variables (the factors used for
predicting). Linear regression gives us an equation like this:
Y=M1*X1+M2*X2+.............+MnXn+C
Here, we have Y as our dependent variable, X’s are the independent variables and all M’s are
the coefficients. Coefficients are basically the weights assigned to the features, based on their
importance and C is the constant which is basically the intercept.
To perform a regression analysis in Excel, we first need to enable Excel’s Analysis ToolPak Add-
in. The Analysis ToolPak in Excel is an add-in program that provides data analysis tools for
statistical and engineering analysis.
Go to Add-ins on the left panel -> Manage Excel Add-ins -> Go:
"The company Apple wants to predict the price of I-Pad by considering the following factors
Screen(type), Storage capacity, Connectivity(type) and Gen"
->Encode the data in order to perform the regression analysis: Assigning numeric value to
categorical data set.
Step 1 – Select Regression
Go to Data Analysis in the Data ToolPak, select Regression and press OK:
In this step, we will select some of the options necessary for our analysis, such as :
Output range – The range of cells where you want to display the results
In the summary, we have 3 types of output and we will cover them one-by-one:
Residual Table
The regression statistics table tells us how well the line of best fit defines the linear relationship
between the independent and dependent variables. Two of the most important measures are
the R squared and Adjusted R squared values.
The R-squared statistic is the indicator of goodness of fit which tells us how much variance is
explained by the line of best fit. R-squared value ranges from 0 to 1. In our case, we have the R-
squared value of 0.93 which means that our line is able to explain 93% of the variance - a good
sign.
But there is a problem - as we keep adding more variables, our R squared value will keep
increasing even though the variable might not be having any effect. Adjusted R-squared solves
this problem and is a much more reliable metric.
The Coefficient table breaks down the components of the regression line in the form of
coefficients.
Residual Table
The residual table reflects how much the predicted value varies from the actual value. It
consists of the values predicted by our model:
--> Our model has predicted the price range as per the specifications that was given as input.
The detailed model will be reflected in the video attached.