M A Y G C: Arket Nalytics at OU O ABS
M A Y G C: Arket Nalytics at OU O ABS
M A Y G C: Arket Nalytics at OU O ABS
He joined the company in the aftermath of the adaptation of a standardized solution that the
consulting companyimplemented, which led to a steep crash in the market share of the
company and thereported level of customer satisfaction. A large proportion of the driver-
partners had also joined competing firms citing a reduction in job flexibility along with
reduced percentage in profits, as reasons for the change.
After this, the company had decided to have the problem handled by the in-house analytics
team, following which they had hired Ravi Kumar who was considered a specialist in emerging
market operations and strategy. Before this job, he was working at XYZ Corp., where he had
been recognised for his success at the rollout of a European based self-care products’ brand.
Before that, he had worked with various other companies across India and South East Asia.
___________________________________________________________________________
This case was written by Sahadeb Sarkar and Shivanee Pethe of the Indian Institute of Management
Calcutta.The case was prepared solely to provide material for classroom discussion. The authors do
not intend to illustrate either effective or ineffective handling of a managerial situation and they
cannot be held liable for any loss or profit resulting from the use of the concepts highlighted in the
case.
This case study is meant for use in PGP for a course titled “Statistics for Management” taught by Prof.
Sahadeb Sarkar of IIM Calcutta starting 15th July, 2020. Beyond limited printing rights, copying,
distributing or posting of this case study in any form on any media is strictly prohibited. The limited
rights to use this case is only valid for the duration of the program.
Since he joined the company, Ravi Kumar had created various teams directly under his
supervision, and had studied not only the business model and the problem itself but also its
various aspects. It was up to him and his team, now, to analyse the data they had collected
and to draw pointers that the company could focus on, for increased profits.
He was concerned because the time to present these findings and pointers to the Board
members was fast approaching. It was only after that he would start projects to study each
problem in detail.
The algorithms aredesignedin such a way that if there is an increase in the availability of cabs
it results in a drop of rates; and if there is an increase in demand, there is a surge amount.
Additionally, if the availability of cabs is much larger than the demand of cabs, then the app
may also provide a discount in the flat rate per kilometre to encouragea customer to book
the ride. This is counted as a negative in the surge amount.Also, toinfluence customers to use
the services of YouGo Cabs instead of their competitors or local yellow cabs, the company
declares various kinds of offers such as discounts based on the purchase of a special pass or
incentives to introduce new members to the application. These are called “upgrades”.
After the trip is over, the application collects detailed feedback from the customer, as well as,
the driver-partner. These are then used to give a rating to the customers, as well as,the driver-
partner. A customer with a high rating is given the benefit of additional upgrades. Driver-
partners with a high rating are rewarded with bonuses and perks.
THE DRIVER-PARTNERS
Unlike local cab rental companies, YouGo Cabs does not own cabs. They partner with various
cab owners via profit-sharing contracts. The contract that YouGo Cabs has with their driver-
partners is a flat fee per ride along with a percentage of the profit. This means that for a ride
2
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
that costs₹ 100 to the customer, the cab-driver will earn a fee of ₹ 35, and after deducting
maintenance and fuel expenses, the remaining profit is divided between the driver and YouGo
Cabs.
YouGo Cabs also gives financial help to the underprivileged as a part of its CSR initiatives to
encourage employment. “GaadiChalao, GharChalao” (Drive a car, run your home) initiative,
isYouGo Cabs’largest CSR investment under which they aim to encourage entrepreneurial
spirit in the underprivileged youth. Under this initiative, the company has given low-interest
loans to many male and female partners to purchase cars, with which they can become driver-
partners of YouGo Cabs and create a source of livelihood for their families. The driver-partners
can repay the loan to YouGoCabs from their share of the profit of the trips that they
undertake.
Thus, YouGo Cabs is invested in the welfare of the driver-partners, encourages employment
and has a role to play in their success. Due to the firm’s investment, the driver-partners also
have a deep sense of loyalty and commitment toward the company and its vision and ensure
customer satisfaction. Additionally, through the feedback that the company obtains via the
application and the resulting overall rating of the driver-partner, YouGo Cabs rewards driver-
partners that show a consistently high rating on customer satisfaction. This further
incentivises the driver-partner to maintain a high level of customer satisfaction. This has led
to YouGo Cabs having the highest growth amongst its peers in terms of market reach in recent
years.
THE PROBLEM
YouGo Cabs began its operations in the United States in 2005. By 2008, they were the most
used cab company in the US and by 2009, they were the most widely used mode of private
transport. After their success in the United States, they expanded their operations to Europe
where they were equally successful. After this, they looked to Asia to to expand
operations.Since it began operations in India in 2012, YouGo Cabs has seen tremendous
growth in this market. In two years, they doubled the number of driver-partners and
expanded from one city to three cities. By 2016, YouGo Cabs was functional in all
Metropolitan Citiesand a majority of Tier-I cities.
With the increasing popularity of ride-sharing companies in Asia, inlate 2016, YouGo Cabs was
considering expansion into the China market. It was at that time that they realised that there
was a problem in the algorithm, as it was implemented in the Indian market. The patterns of
growth of their key indicators were not functioning as anticipated.
They noticed a mismatch between the reach of the firm and the profits reported. The reach
of the company measured by areas it serviced,the number of driversand the number of
bookings showed a significant increase. The profits of the company however, did not show a
comparable increase. This meant that while the company was expanding the market and
possibly gaining a bigger share in the existing market, this growth was not translating into
3
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
profits for the company. Since the share of the driver-partners in every ride was fixed, this
hinted to a problem of allocation that the current algorithm was creating.
The top management at YouGo Cabs wants to update the algorithm to suit the Indian market
and ensure that the profit gained by the company grows with the growing demand. They want
to conduct a study of the impact of various factors that affect the profit and the expected
quantum of impact so that the allocation algorithm can be updated to incorporate these
effects.
When these problems were first noticed, YouGo Cabs had given the project to a consultant
company - ABC Corp. This firm had been a consultant to YouGo Cabs headquarters across the
America and Europe. ABC Corp. had tried to implement a method and solution that had shown
tremendous success in the United States and across Europe. However, given the unique
nature of the Indian market and peculiar behavioural traits of the Indian customer, this
solution backfired, leading to losses in millions and a reduction in the number of driver-
partners. YouGo Cabs lost its market share and the ‘customer satisfaction’parameters showed
a steep decline.
It was in early 2018 that YouGo Cabs had hired Ravi Kumar, a hotshot in the business of
emerging markets. His first task was to look at the problem at hand, break it down into smaller
tasks and present an analysis of the factors that were impacting profit.
Over the last year and a half, Kumar had worked with several teams and studied, in-depth,
the various departments in the company, working closely with everybody from driver-
partners to back-end programmers. He had a team of market researchers working on
customer feedback, a team of managers understanding the details of the algorithm and a
team of experts consulting on various factors. The reports of these teams were sent to a team
of analysts who are working on the analysis that Ravi Kumar has to present to the Board, on
his understanding of the problem, and to highlight the various pain points of the firm andthe
factors that are important for enhancingthe profit following which he would open further
projects into each of them.
EXPLORATORY STORY
On his arrival at YouGo Cabs, Ravi Kumar formed a team of analysts, experts in various
verticals at the company and started an exploratory search. With his team, he conducted a
series of interviews with all stakeholders including Board members, middle managers,
application programmers and driver-partners. He also personally headed a customer survey,
backed by application data, as well as, customer interviews. All interviews were recorded with
the permission of the respondent and then transcribed by the team. The team then read and
analysed these interviews multiple times to find indications of information on potential
factors that would affect the outcomes on profit.
4
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
These interviews were conducted with the objective of understanding the various aspects of
the YouGoCabs experience. From customers, they tried to gauge the factors that impacted
the preference of the customer to use YouGo Cabs over its competitors. Through their
interviews with the driver-partners, they attempted to understand the ground realities of the
functioning of the application that the algorithm may be missing out on. Interviews with
programmers and managers were focused on understanding the algorithm behind the
working of the YouGoCabs application.
“...it [problem] is that I look for a cab in peak times and there
are never enough cabs available. Sometimes I wait for a cab
to accept at another price or I just use another transport…”
“...[I] had just looked for a car and the app had shown
available cabs, however, when I tried to book [the app] didn’t
allow…”
These comments led the team to consider the impact of cab availability in an area to the
decision of the customer to use the YouGo Cabs service over any other service. Their analysis
showed that in times where the number of cab requests was much higher than the number
of driver-partners available in the area, the prices of the ride surged to very high values and
customers were more likely to prefer other cabs.
The driver partners, on the other hand, were bothered by a different set of issues;
“We always get allotted to areas where there are many cabs
and the price of the ride goes down.”
Through their discussion with the programming team and managers, Ravi Kumar’s team had
understood the workings of the algorithm (specifically the relationship between the amount
charged and the number of driver-partners in the location). Incorporating that understanding
with the insights of their discussions with the driver-partners, resulted in a hint of a potential
flaw in the allocation mechanism that allocated too many drivers to a specific area thereby
reducing the amount charged and the subsequent profits.
Through these process of interviews, they also got peculiar insights into the behaviour of
customers and driver partners, which they hoped they would be able to incorporate into their
analysis. They found that customers preferred paying a higher price for the trip over waiting
for a long time for the cab to arrive. They also saw that driver-partners reduced the number
5
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
of rides accepted when driving conditions became harsh such as in peak hours or when there
was heavy rainfall. This could potentially result in the reduction in number of carsper trip
thereby leading to higher charges for the customer.
Thus they came up with a list of factors that they believed would impact profit as shown
below.
Once the team had created this list they followed an approach similar to the Delphi-approach
to establish the utility of these variables. They sent a list of these variables to various
academics and experts within the company and asked them questions reviewing the specific
information provided by each variable and the utility of the variable. If any two reviewers
disagreed then their responses were exchanged and they were given an opportunity to
update their review. This process was repeated until all the reviewers agreed on the following
final list of variables Table 2.
The variables of ‘Driving conditions during ride-booking’ and ‘Customer cancellations due to
time of arrival’ were removed by the panel of experts. The variable ‘Driving conditions during
ride-booking’ cannot be objectively captured by the application. The variable ‘Customer
cancellations due to time of arrival’ would note if the customer had cancelled the booking
due to long delay in the arrival of a cab. However, this was removed as it was deemed a very
subjective characteristic.
6
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
They removed the factor capturing ‘Rejection of trip by driver-partner’ since it may obfuscate
the model sincea driver-partner may reject a trip for reasons such as lunch-breaks and
tiredness, which would not have a bearing on the algorithm design.
As a primary analysis, Ravi Kumar wantedthe team to establish basic hypotheses about the
relationships between profit and the factors they hadidentified.He addressed the team and
asked them to look for various methods of analysis that would allow them to not only
establish these relationships but also enable them to comment on the quantum of the
relationship and the significance of the relationship.
To begin this, Kumar has asked the team to create an introductory report on the variables
that they would include into the analysis so that they would have some working hypotheses
about the directionality of the relationships. They would then use the hypotheses that were
so constructed, to check the logical standing of the model that they wouldcreate.
Based on their exploratory method, the team had an understanding that the most important
variable of their analysis was the amount charged. However, while creating and writing the
introductory report, the team realised that the variable “Amount Charged” was vaguely
defined. Thus they corrected the variable name to “Amount Charged per Ride”.
7
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
• Zone of Ride:
YouGo Cabs divides the serviced area into zones called Metro, Metro-Suburban and
Outskirts. The cab company designates various locations into these categories
depending on the total number of customers registered in the area based on the
registered mobile numbers. They also include the classification as given by the local
governing body into consideration while creating zones. The expectation is that a
higher number of potential customers in the area increases with a change in the zone
from Outskirts to Metro-Suburban to Metro will increase the surge amount, thereby
increasing profit.
• Upgrade Given:
An upgrade is when a customer is given a benefit while booking a trip. An upgrade
may meanavailability of a monthly discount pass at reduced rates, giving the customer
a premium car booking while keeping the cost same as an economy car or any other
benefit that may encourage the customer to choose YouGo Cabs over its competitors.
A customer that has been given an upgrade is more likely to book a ride on YouGo
Cabs, thus affecting the number of customers booking in the zone at the given time.
The increased demand should lead to a higher surge, thus increasing the amount
charged per ride and the resulting profit. It is expected that the profit would increase
with upgrades given.
• Number of Drivers:
The algorithm assigns the amount to be charged for the ride based on the availability
of cars in the area. The amount is decided in a manner similar to the pricing
mechanism of a basic supply-demand model in micro-economics. If too manydriver-
partners are available in the area of ride-booking relative to demand, the surge
amount charged for the ride reduces thus reducing the overall expected profit from
that ride. On the other hand, if there are very few driver-partners available compared
to demand for cabs, the surge price rises and hence, increases the expected profit for
8
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
• Number of Customers:
Along with the number of drivers available in the area where the ride is being booked,
the relative demand for cars is also an important factor that affects the profit. A high
number of customers compared to the availability of cabs increases the surge amount
charged, thereby increasing the profit of YouGo Cabs. Conversely, if the number of
customers booking cabs in a particular area is low in relation to the availability of cabs,
the surge amount charged reduces and thereby reduces the profit. Thus,onewould
expect the profit to increase with an increase in the number of customers and the
corresponding model coefficient to be positive.
Based on this report the team decided to collect data for the variables under focus based on
the data sources available to them. The data on most of the variables weretaken from what
the app collected fromeach user. These variables were Amount Charged per Ride,Type of
Zone andthe Upgrade Given. Since the unit over which the Amount Charged was measured
was over every ride, the data for Profit was also collected for each ride. The data for Profit
per Ride was collected based on a Unique Id comparison between the individual ride data
collected by the application and the profit calculation algorithm implemented by the
company. Using a similar comparison of the day, time and zone of the ride, the aggregated
data of the Number of Drivers in the area and the Number of Customers booking rides in that
area at that time,were collected from the master dataset.
In order to avoid introducing additional bias into the data, the team randomly selected one
unique trip in the day and collected data for the last 110 days. The members of the team who
were looking for ideal models to run have thought of implementing statistical modelling
techniques to create a model using a large dataset.
“If your aim is to open more detailed projects into each area,
then you must ensure that you have models that will tell you
not only where there are problems but also the extent of these
problems.”
9
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
The team decided to run various regression models to look for the kind of relationships that
all these variables have with the profit and to try and build the bestpredictive model that
explains the derived profit based on the factors that significantly affect it and then test its
abilities before drawing insights.
Thus, following this principle, they segregated the data into two groups. The first group, called
the ‘training dataset’consisting of 90% of the collected data (100 sample points) [Appendix 1:
Exhibit 1],will be used to conduct the analysis and create the models. The second group, called
the ‘validation/test dataset, consisting of the remaining 10% of collected data (10 sample
points) [Appendix 1: Exhibit 2] will be used to test the performance of the model.
WHAT NEXT?
After going through the introductory report and having a detailed discussion of the potential
models that can be created with his team, Ravi Kumar has given the go-ahead to his team of
analysts to start developing the regression models based on the collected data. Once these
are created, Ravi Kumar would then personally analyse each model and decide the best
models that would be used to draw further insights into the problems that the company is
facing.
This project is intended to point out the problem areas that the company needs to focus on
and based on the results of this study, a number of projects will be launched to analyse each
problem in detail and a possible overhaul of the entire algorithm and systems may be
conducted.
While overseeing his team’s work to finish the analysis, Ravi Kumar wonders about the
outcomes of the models that his team is currently creating. It is important to him that these
models be thoughtfully constructed and the insights be carefully obtained as these insights
will determine the future of the company in the Indian market, as well as,the countless
families that depend on YouGo Cabs for employment. The enormity and importance of the
task at handhave overwhelmed him. The YouGo Cabs Board havepinnedtheir faith on him.
CanRavi Kumar deliver?
10
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
11
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
Suppose we have a bivariate data set (xi, yi), i=1… n, where Y= sales revenue of a product and
X= expenditure on advertising the product. The following questions are pertinent:
For the second question, one can compute what is called the (sample) correlation coefficient
between X and Y, which is a measure of the linear relationship between x and y. If the
correlation coefficient, in absolute value, is “close” to 1 (on a scale of 0 to 1), say 0.8 or more,
then one may conclude that the linear relationship is strong.
Regression of Y on Xcan be used to answer the third question.Note that upper-case letters Y
and X are being used to mean they are random variables before their values in the sample are
observed, and lower-case letters x and y representthe observed or realized values of the
random variables X and Y in the sample.
Suppose, y increases with x, then quite often a value of x greater than (less than) its average
tends to be accompanied by a value of y greater than (less than) its average. This means that
if on the whole y increases with x, then, (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)should be positive for most values
of ‘i’. Consequently, the sample covariance between x and y (defined below) should be
positive:
1
Cov(x,y)= 𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
On the other hand, if on the whole, y decreases with x, then a parallel argument shows that
covariance should be negative. Observe that the covariance is not a unit-free measure. By
dividing covariance by s x s y we get a unit-free measure called the (product-moment)
correlation coefficient, defined by,
12
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
Cov(x, y)
rxy = r =
sx sy
1 1
where 𝑠𝑥 = √𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 , 𝑠𝑦 = √𝑛−1 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 .
The covariance and correlation are measures of the linear relationship between x and
y.Because covariance,as a measure of strength of linear relationship, does not have a finite
benchmark (since it ranges from minus infinity to plus infinity), one prefers a correlation
coefficient, as it has a finite range (from minus one to plus one) and can be used to compare
the strengths of linear relationship in different situations. However, both of them may fail to
capture non-linear relationships. The short-cut formulae used for computation are:
(∑ 𝑥𝑖 )(∑ 𝑦𝑖 )
1 (∑ 𝑥𝑖 )(∑ 𝑦𝑖 ) ∑ 𝑥𝑖 𝑦𝑖 −
𝑛
𝐶𝑜𝑣(𝑥, 𝑦) = (∑ 𝑥𝑖 𝑦𝑖 − ) , 𝑟𝑥𝑦 =
n −1 𝑛 2 2
√∑ 𝑦 2 − (∑ 𝑦𝑖 ) √∑ 𝑥 2 − (∑ 𝑥𝑖 )
𝑖 𝑛 𝑖 𝑛
(i) Expected value (“average”) of Y given X=x, denoted by “E(Y|X=x)”, is a linear function of x,
i.e., E(Y|X=x) = a+bx. Thus, Y = E(Y|X=x) + (Y− E(Y|X=x)). Denoting the deviation (Y−E(Y|X=x))
by e and treating it as unobservable error, for the observed data (x i, yi), i=1,…,n, one can write
𝑦𝑖 = 𝑎 + 𝑏 ∗ 𝑥𝑖 + 𝑒𝑖 , 𝑖 = 1, 2 … , 𝑛.
(iii) The error term ei is a random variable and normally distributed with mean 0 and variance
𝜎𝑒2 (constant, not depending on i).
(iv) The errors e1,e2,… ,en are independent of one another (at least pairwise uncorrelated).
It is natural to think that the sales (Y) of a product is related to the advertising expense (X). If
the sample correlation rxy happens to be 1, then all the (x,y)-observations fall on a straight
line defined by the equation y = a + b*x. However, in practice, r xy will rarely be exactly equal
to 1 or −1. If rxy 1 but high in magnitude, then there does not exist a line on which all the
(x,y)-observations lie, although one may find a line around which the observations are pretty
“tightly clustered”. Such a line will define an estimated linear relationship of Y on X based on
the observed data. Using this estimated linear relationship, one can predict the value of Y
when X takes a particular value.
Sales of a product depend on many variables, besides advertising expenditure, such as price
(X2), time of year (X3), state of the economy (X4), state of competition (X5) etc. There could
also be some unobservable, unexplainable factors that contribute to variation in sales.
Therefore, it is unlikely that we can exactly predict the value of sales revenue given the values
13
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
of other observable variables. That is, it is unlikely that we can find an exact relationship
between sales (Y) and advertising expense. This can be modelled as
𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 + 𝑒𝑖 , 𝑖 = 1,2, … , 𝑛,
where errors ei are assumed to be independently arising from a population with mean 0 and
standard deviation = e. Note 𝑒𝑖 = 𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 ) represents the part of Y that remains
unexplained because there might be a nonlinear relationship between Y and X or because Y
depends not just on X but other un-included (observable or unobservable) variables. We
minimise the sum of squares of errors
(y − a − bxi )
2
i
i =1
(∑ 𝑥𝑖 )(∑ 𝑦𝑖 )
∑ 𝑥𝑖 𝑦𝑖 − 𝑆
𝑏̂ = 𝑛
(∑ 𝑥𝑖 )2
𝑏̂ = (𝑟𝑥𝑦 𝑆𝑦)and𝑎̂ = 𝑦̅ − 𝑏̂𝑥̅ .
∑ 𝑥𝑖2 − 𝑥
𝑛
This method is called the least squares method because it minimises the sum of squares of
errors. Note that𝑏̂, called “b-hat”, has the ‘hat’ notation signifying that it is an estimate of the
slope b of the true but unknown line giving the “best” fit when the sample is the whole
population. The same is true for ̂,
𝑎 “a-hat” as an estimate of the y-intercept “a” of the true
but unknown line.
The “95% Confidence Intervals”: Excel Regression Analysis reports “Lower 95%” and “Upper
95%” numbers corresponding to coefficients like the slope b. The number under the heading
“Coefficients” give a single number “point estimate” 𝑏̂ for slope b. But the probability that
this number 𝑏̂ will be exactly equal to the true value of b is negligible (if not zero), whereas
the probability that the interval defined by [“Lower 95%”, “Upper 95%”] will contain the true
value of b is 0.95 or 95%. Here we have considered the level of confidence(1−α) as 95%. The
confidence intervals (C.I.) are computed as follows:
̂̂ ) , 𝑏̂ − 𝑡𝛼
𝐶. 𝐼. (𝑏̂, 1 − ) = [𝑏̂ − 𝑡𝛼,𝑛−𝑘−1 . √𝑉𝑎𝑟(𝑏 √ ̂̂
,𝑛−𝑘−1 . 𝑉𝑎𝑟(𝑏 )]
2 2
14
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
𝜎2
𝑉𝑎𝑟(𝑏̂) = ∑ 𝑐𝑖 2 𝑉(𝑌𝑖 ) = (∑ 𝑐𝑖 2 )𝜎 2 = ∑(𝑥𝑖 −𝑥̅ )2
,
1 𝑥̅ 2
𝑉𝑎𝑟(𝑎̂) = ∑ 𝑑𝑖 2 𝜎 2 = ( + ∑(𝑥 2
) 𝜎 2,
𝑛 𝑖 −𝑥̅ )
(𝑥 −𝑥̅ ) 1
𝑐𝑖 = ∑(𝑥𝑖 −𝑥̅ )2 , 𝑑𝑖 = (𝑛 − 𝑐𝑖 𝑥̅ ),
𝑖
∑(𝑦𝑖 −𝑦̂𝑖)2
and𝜎 2 = 2𝑒 is estimated by
̂2 =
̂2𝑒 = = Mean Squared Error (MSE) and𝑡,𝑛−𝑘−1 is
𝑛−2 2
the 100(1−2)-th percentile of the t distribution with (n-k-1) degrees of freedom (d.f.).
If X= Father’s height and Y= adult Son’s height, then 𝑠𝑦 = 𝑠𝑥 should hold true and if the
correlation r is less than 1, say, 𝑟 = 0.75, then (y − y̅) = (0.75)(x − x̅). This implies that if
the height of a father deviates from the average male adult height (that is,a father is either
taller or shorter than the average adult male) in his generation, his adult son’s height will tend
to deviate less from the average height of adult males in his son’s generation. That is, the
height of an adult son regresses to the average height of adult males. Suppose 𝑦̅ = 5 feet 6
inches = 𝑥̅ , then 6 feet tall fathers’ sons would be on an average about 5 feet 10.5 inches tall
and 5 feet tall fathers’ sons would be on an average about 5 feet 1.5 inches tall. In both cases
the average height of sons of tall or short parents goes back (1.5 inches) towards the average
height of 5 feet 6 inches of the population of adult males.
ANOVA: For a fitted regression model with a constant term, one can decompose thetotal
variation in the y-values into two parts as follows:
Remember there are (n-1) independent pieces of information among the deviations
(𝑦𝑖 − 𝑦̅)since ∑𝑖(𝑦𝑖 − 𝑦̅) = 0. Similarly, among the errors 𝑒𝑖 = (𝑦𝑖 − 𝑦̂𝑖 ) , i=1,…,n, there are
15
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
Criterion 1: How well has the model y= 𝑦̅ + 𝑏̂(𝑥 − 𝑥̅ ) fitted the data? How close are the
predicted values 𝑦̂𝑖 = 𝑎̂ + 𝑏̂𝑥𝑖 = 𝑦̅ + 𝑏̂(𝑥𝑖 − 𝑥̅ ) to the observed values yi? It may be
measured by the “coefficient of determination”, denoted by R2, the proportion of variation in
sales (y) explained by the fluctuations in the advertising expense (x):
RSS SSE
R2 = 𝑇𝑆𝑆 = 1 − 𝑇𝑆𝑆 where𝑆𝑆𝐸 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑𝑛𝑖=1 𝑒𝑖 2
To compare two different regression models that use different number (k) of explanatory
variables and are fitted to datasets of possibly different size (n), R2is modified to define
Adjusted R2 (or Adj R2) as
SSE/(n−k−1) 𝑛−1
Adj R2 = 1 − = 1− [1−R2](𝑛−𝑘−1)
𝑇𝑆𝑆/(𝑛−1)
One may declare the model as ‘fair’ if Adj R2 is around 0.7, ‘good’ if Adj R2 is 0.8 or more, ‘very
good’ if Adj R2 is 0.9 or more, ‘excellent’ if Adj R2 is 0.95 or more.
On The Term R2: The coefficient of determination R2 is called so because it is square of the
correlation coefficient of Y and X in a case of the simple linear regression (Y = a*1 + b*X) model
with a single predictor. In case of multiple linear regression (Y=a + b1X1 + b2X2+ …+ bkXk), R2 is
the square of the “multiple correlation coefficient”, which measures the linear relationship
between Y and the set of predictors X1, X2, …, Xk. The “multiple correlation coefficient” can be
calculated as the square of the correlation between Y and𝑌̂, where 𝑌̂ = 𝑎̂ + 𝑏̂1 𝑋1 + ⋯ +
𝑏̂𝑘 𝑋𝑘 is the predicted value of Y from the estimated multiple linear regression model.
On the F and t Tests: If Adj R2 is satisfactory, one can formally test the “hypothesis” that the
whole set of explanatory variables in the model is redundant or not through an “F” test. If
not,then one can check if any of the explanatory variables in the model is redundant by testing
if the corresponding coefficient parameter can be taken to be zero. This is done using a “t”
statistic that follows a “t” distribution.
16
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
The reported “Significance F” in Excel Regression Analysis output is known as the “P-value”
which is calculated as the probability that the Fk,n-k-1 random variable is greater than equal to
the value of the statistic (the term “statistic” is used for any function of sampled observations)
If “Significance F” for the “F” statistic is less than a benchmark value called a “level of
significance” or just “level”, then we reject the “hypothesis” that the whole set of explanatory
variables in the model is redundant. Usually, the “significance level” or “level” value is taken
to be either 1% or 5% or 10%. When in doubt, one may use a 5% level.
The reported “t Stat” (or t statistic) value in Excel Regression Analysis output is calculated as
where “standard error” represents the estimated standard deviation of the “coefficient”
estimate (“coefficient” being the slope b or the intercept a in a regression model such as Y =
a*1 + b*X). The “P-value”, which is calculated as the probability that the absolute value of a
tn-k-1 random variable is greater than equal to the absolute value of the “t Stat” (remember a
statistic is a function of sampled observations). In Excel, one can calculate it as
Again. if the P-value for the t-Stat is less than a benchmark value called a “significance level”,
then we reject the hypothesis that a particular explanatory variable in the model is redundant.
Usually, the value of the “significance level” is taken to be either 1% or 5% or 10%.
Criterion 2: To check the “independence” of regression errors, one can inspect the scatter-
plot of residuals against predicted values. This scatter plot should look ‘random’ (horizontal
rectangular shape), without any discernible patterns. For business data observed over time,
one can calculate the Durbin-Watson statistic (DW) value, defined as the sum of squares of
successive differences in errors divided by the sum of squares of errors
∑𝑛
𝑖=2(𝑒𝑖 −𝑒𝑖−1 )
2
DW= ∑𝑛 2
𝑖=1 𝑒𝑖
Thus, the range for DW is 0 to 4. If DW 2, then one may conclude that the regression errors
are independent. If the DW is “far away” from 2, it means one may improve the model by
including some other explanatory variables such as (i) square of an explanatory variable that
17
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
is already there or (ii) lagged values of the response variable Y [e.g. (Lag1) t = Yt-1 = value of Y
from last period].The DW statistic is ideally used for testing theindependence of errors in time
series data.
If the values of the dependent variable Y are not collected over time, one should not calculate
the DW value. But one can still inspect the plot of residuals versus predicted values to check
if it looks ‘random’. If the plot does not look random one needs to modify the model by
including other explanatory variable(s).
Criterion 3: One may fit different models to the data, for example, in case of y=Sales and x=
advertising expense (AE), (i) Sales = a + b*AE, (ii) Sales = a + b 1*AE + b2*Price, (iii) Sales = a +
b1*AE + b2*(AE2) etc. Based on Criteria 1 and 2 suppose one shortlists a few models. To decide
the best among the shortlisted models one may use a third criterion called MAPE (Mean
Absolute Percentage Error) defined by
1 𝑎𝑏𝑠(𝑒𝑗 )
𝑴𝑨𝑷𝑬 = (∑𝑚
𝑗=1 ( ) × 100%)).
𝑚 𝑦𝑗
This measures how small the absolute (prediction) errors committed by a model are in
relation to the observed values while predicting m “new” sets of values of explanatory
variable(s) (as in a ‘validation’ sample or ‘hold-out’ sample) which were not used to build the
models. If the MAPE value is about 5% or less one can infer that the chosen model is ‘very
good’.
Suppose we have “n” observation-vectors (x11, x21,…, xk1,y1), (x12, x22,…, xk2,y2), …, (x1n, x2n,…,
xkn, yn) on k explanatory variables X1, X2, …,Xk and one response variable Y. For example, Sales
(Y) of a product depend on many variables, besides advertising expenditure (X 1), such as price
(X2), time of year (X3), state of the economy (X4), price of competitors product (X5). Then one
may postulate the model:
Again, we can estimate (a, b1, b2… bk) using the least squares method by minimising the SSE:
n
with respect to a, b1, …, bk. As in case of simple linear regression, one can compute Adj R 2,
DW (if appropriate) and MAPE.
There are two special cases of Multiple Linear Regression, namely, Polynomial regression and
Dummy Variable Regression.
18
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
(a)Polynomial Regression
Suppose we have “n” pairs of observations (x1,y1), (x2,y2), …, (xn, yn) on two variables, say X =
advertising expense and Y = sales and the scatter plot of y versus x indicates a polynomial
relationship (non-linear) between Y and X:
Then, we can estimate (a, b1, b2, …,bk) using the least squares method by minimising, with
respect to a, b1, …, bk, the following:
(y )
n
2
i − a − b1 xi − b2 xi2 − ... − bk xik
i =1 .
Example: If one observes the behaviour of Y = sales of a product over a long duration by
increasing X = advertising expense (AE) without changing other factors such as price, etc.,
then the sales versus adverting expense may show a linear growth pattern in the short term
but after that, the sales growth will slow down and not increase any more. For such data that
is collected over a long period, a quadratic (in AE) regression model, Sales = a + b1*AE + b2
*(AE)2, will be more appropriate than the simple linear regression model, Sales = a + b1*AE.
Example: Y = Sales, X= AdvExp, over n=20 quarters. Then we can try to model quarterly effect
on sales by defining four indicator variables Dc,i= 1 if i-th observation belongs to quarter c and
𝐷𝑐𝑖 = 0 otherwise, c=1,2,3,4. To explain seasonal (quarterly) effect on sales one can then fit
the model;
(i) Y = βX + γ1D1+ ...+ γ4D4 + e
Or, the equivalent model,
(ii) Y = α + βX + γ1D1+ ...+ γ3D3 + e
In the regression model, dummy variables are used as explanatory variables to represent
different categories of a qualitative explanatory variable – hence the name dummy variable
regression. Usually, one fits the model (ii) above, i.e., Y = α + βX + γ1D1+ ...+ γ3D3 + e, and
19
IIMC-CRC-2019-09
Market Analytics at YouGo Cabs
reports the corresponding R2 and Adj R2 values. If one fits model (i) above, one should not
report the R2 and Adj R2 value produced by the Excel Add-in ‘Data Analysis’,since it defines R2
as
∑𝑛 ̂𝑖2
𝑖=1 𝑦 ∑𝑛 (𝑦̂ −𝑦̅)2
𝑖
∑𝑛 2 instead of∑𝑖=1
𝑛 (𝑦 −𝑦
̅)2
𝑖=1 𝑦𝑖 𝑖=1 𝑖
which (the latter) is used when the model has the constant term α, and consequently
‘goodness’ of fit of this model cannot be compared to that of other models that contain the
constant term using the Excel reported R2 for model (i).
References:
20