BC2406 S01 G02 Final Report
BC2406 S01 G02 Final Report
Semester 1, AY 2016/17
Seminar Number: 01
Group Number: 02
There are several factors that affect total app revenue which are unique to the mobile app
market. Unlike physical products, apps can be listed on the app store for free or paid download.
Additionally, while physical products often differentiate themselves with their strong branding,
there are many apps in the market that share similar functions and lack significant brand names.
Hence, it is crucial for the app developers to find other ways and sales strategies to improve
their app revenues.
Therefore, we have identified the business problem to investigate if an App’s Pricing Model
have any significant impact on its Sales Revenue.
This business problem is then further decomposed into 3 more-specific tasks, resulting in 3
models built, each targeting different key predictors and therefore targeting different aspects
of the overall analysis.
Regression and text mining were two data mining techniques used to attempt to answer the
business problem. Using data collected on 15 March 2013, variables were identified for
regression analysis while app descriptions were used for text mining to identify keywords.
Based on the 3 models built from sub tasks, this report had identified multiple strategies to
increase sales. The first way to increase sales is for developers to launch an app as a “Free”
version with the assumption that a Freemium model is incorporated. In addition, apps in some
categories were shown to have better sales than others. Moreover, it is found that using
frequently used terms in an app’s product description may cushion the negative repercussions
of an app’s sales if an app is launched as a Paid version. Nevertheless, in this study, the
freemium model is found to be more effective in increasing sales as compared to paid model.
This is especially true for app which are found in LifeStyle, Travel, Books categories.
Some suggestions for new app developers can be interpreted directly from our model outcomes.
In order to boost an app’s sales, new entrants should first try to adopt the Freemium model
before even considering the Paid model - especially if the app is launched in the LifeStyle,
Travel, or Books categories. Such a strategy is useful and applicable to existing apps too.
Therefore, existing app developers can consider adopting the Freemium model. However
should they still wish to adopt the Paid model, they need to devise successful and strategic
product descriptions and keyword to mitigate any possible negative effects on an apps sales
due to its high price.
Page 2 of 33
2. BUSINESS UNDERSTANDING
2.1 Background
Market growth in terms of cumulative app downloads from Apple’s AppStore has been
exponentially increasing from July 2008 to September 2016 (Appendix A). The AppStore
today contains up to 2,685,676 mobile applications (Steel Media Ltd, 2016), with total
cumulative downloads of up to 140 billion (Statista, 2016).
To answer the business problem, we first decomposed the problem into subtasks that will use
data-mining techniques that can support our analysis. The subtasks are as follows:
2.4 Hypothesis
We hypothesize that the revenue generated from an app will be affected by its pricing model
(Paid or Free), the category it belongs to and its product description.
Page 3 of 33
2.6 Key Predictors
Variable Rationale
Paid We predict that the pricing model that the app adopts will affect
(a binary variable that we the revenue it generates. This is because consumers may be price
will create based on the sensitive and may not be willing to spend money on apps based
variable Price where: on just the available product information, description and user
Paid=0 for Price=0 and; reviews. They might only be willing to spend money on it after
Paid=1 for Price>0) trying out the app for themselves. Therefore we expect that
setting a price for the initial download will impact the revenue
for the app.
Category We expect that the category of the app will affect the revenue it
generates. Consumers may be willing to pay a premium for a
Business app that they use for important matters because they
would place more value on the functions and quality of the
applications. However, for a Games app, consumers may not see
a need to pay for it as it is only for entertainment.
For this study, we will be examining apps from the Games,
Business, Education, Lifestyle, Entertainment, Travel, Books,
Health & Fitness, Food & Drink and Utilities categories.
Description We predict that app’s description in the app store will affect the
revenue it generates. For instance, having certain keywords like
“best” or “free” in the description might give the app a higher
chance of appearing on search results when the consumer looks
for apps. Also, the description allows the consumer to get an idea
of that the app is like. Therefore, a meaningful description would
increase the number of app downloads and thus increase revenue.
Text Mining
Based on descriptions from the files in the US_24 Category_Detailed folder, commonly used
words are extracted based on the different categories. From these words, the report will then
attempt to hypothesize a relationship between keywords used in Descriptions and Sales in the
respective categories. Developers should use these keywords more often, should there be
statistical proof that they exert a significant impact on an app’s sales revenue.
Linear Regression
Using the App Gross Rankings and transforming it into an indicator for sales revenue as a
dependent variable, our report will attempt to utilize the App Pricing Model, Categories, and
Descriptions as independent variables to form a regression model. This will be analysed to
evaluate the model explanatory power and prediction power.
Page 4 of 33
“Paid Model”. This part of the Data-Mining aims to find out if the interactions between Pricing
Model and Categories, Pricing Model and Frequently Used Descriptions Terms, have any
impact on the App Sales.
3. DATA PREPARATION
Missing Values
We have identified that there are a small number of records with missing (NA) values, hence
we have chosen to eliminate them from our data.
Outliers
In general, outliers are variables that are over 3 standard deviations from the mean. We
analysed the relevant variables for app prices and found that Price, Screenshot, Size,
StarsAllVersions, RatingsAllVersions, StarsCurrentVersion, RatingsCurrentVersion contain
outliers. We analysed the data for app prices and determined that these data are considered to
be “extreme” based on our domain knowledge on the app market. Hence, outliers are removed
because they may have disproportionate influence on our model.
Page 5 of 33
3.2.3 Data Pre-Processing
Normalization
From the summary statistics, we have identified that there are a few variables (Size,
RatingsAllVersions) with large scales. These variables have their mean lower than their
standard deviations. Hence, we did normalization by doing a log-transformation on those
variables, to prevent them from dominating and skewing the results.
Variable Transformation
Sales is created by log transforming the app’s top gross rank. For this analysis, it is assumed
that the lower the rank, the higher the sales revenue volume. In addition, an app with Rank 1
in sales rank is assumed to have the same sales revenue volume as an app in another category
with Rank 1 in sales revenue rank.
Dummy variables of the app categories were created to facilitate the computation of the
regression analysis.
terms_score is the variable created from determining how many types of 20 most frequent
description terms were used in a particular app description. The 20 most frequent description
terms were determined through the use of text mining. A score of 1 is added to the total score
for an app for each type of the 20 most frequent description terms. Prior to the calculation of
terms_score, the app descriptions were converted into a corpus and subsequently a DTM to be
processed. The app descriptions were converted into lower-case. Then, parsing was done where
HTML tags, frequently appeared but less-important terms device, less-informative terms,
stopwords, numbers, white space, punctuations and App Store-related terms were removed
(See Appendix B). In addition, meaningful numbers were converted to characters for the
purpose of this text mining. Finally, stemming is done to reduce the terms to their root form.
From this result, we obtained the 20 most frequently used terms in app descriptions for the
calculation of terms_score (See Appendix C).
3.3 Summary Statistics of Variables
The summary statistics of the variables that will be used in the analysis are as follows (after
cleaning and pre-processing):
Page 6 of 33
3.4 Visualize the Associations Among the Key Variables
Using the correlations, we can identify the associations among the key variables. This is useful
in helping us identify any variables that are highly correlated, and if there were any, we will
have to remove the variable from our data to prevent skewing or biases. In our case, we can
see that the variables do not have very high correlations with each other. Hence, we can proceed
on with our analysis.
(Paid * Term_Score)
(Paid * Categories)
Control Variables
Page 7 of 33
4.1.1 Model 1: Linear Regression with Descriptors
Model 1 aims to find out how the different independent variables is affecting the dependent
variable (sales). The different variables which is used to predict sales are: Paid, Games,
Business, Education, Lifestyle, Entertainment, Travel, Books, Health, Food, Utilities, and
terms_score. 𝛽₀ (intercept) in this case do not have an interpretation because there will not be
sales if all of the variables are not present. Lastly, there is a degree of error for the regression
model.
4.1.2 Model 2: Linear Regression with Associated Descriptors (Paid interact Score)
To improve our R2 of Model 1 as well as to identify if there is any relationship between the
payment model and app descriptions, Model 2 will include an interaction variable between Paid
and terms_score to investigate the interactions between an App Pricing Model and App
Descriptions. Interaction variables give us new insights as to how different factors may interact
with each other to exert profound effects on sales. Therefore, Model 2 aims to find out how an
app’s textual product description (i.e. terms_score) interacts with Paid, should terms_score
be found to have a positive & significant impact on an app’s sales. The proposed regression
model is as follows:
4.1.3 Model 3: Linear Regression with Associated Descriptors (Paid interact Categories)
While Model 2 provides a useful insights regarding a mobile app’s product description on its
sales, we would also like to investigate how different Categories & Paid, when interacted, have
any significant impact on an app’s sales. This will consequently provide developers a better
guide on whether providing using a Paid Model or a Freemium Model in their respective app
categories will significantly boost app sales.
In Model 3, ten additional interaction variables were added into Model 1. These variables are
derived by interacting (i.e. multiplying) Paid with 9 app Categories (binary).
Page 8 of 33
4.2 Outcome
The outcome of the models are as follows (highlighted in green are our focus):
Page 9 of 33
4.3 Interpretation of Outcomes: Estimated Coefficients
Only Key Variables that are unique to that model will be interpreted, also only significant
variables are interpreted. (For a full list, please refer to Appendix D, E and F).
Explanations
Our terms_score was found to be insignificant in Model 1. However, on further analysis at
Model 2, we found that there is a complementary relationship between Paid and the
terms_score. This means while in general, app descriptions does not affect app revenue, it does
become more important when the app is adopting a Paid model, at 5% significance level. This
is likely due to the fact that when it comes to Paid Apps, the app description is one of the few
sources for a user to gain information about the app. Hence, the user will take into account
what the app descriptions promises when purchasing the app. Whereas, in the case of a Free
app, they can download the app first to experience it for themselves.
Analysing the interactions between an App’s pricing model and categories, we found
that in general, when a category is following a Paid Model, the revenue will decrease. This is
consistent with our findings from Model 1.
4.5 Model Evaluation (Diagnostic Test for Models)
Regarding our group’s regression analysis, we have derived the following results.
Page 10 of 33
Null Hypothesis & F-Test
The null hypothesis refers to a statistical event in which at the 0.1% significance level, all of
the predictor variables utilised in our regression analysis are jointly & highly likely to have a
zero effect on mobile app’s Sales. Our analysis shows a p-value of < 2.2e-16, therefore since
we reject the null hypothesis at 0.1% Significance Level, we can conclude that the predictor
variables (at least one) are jointly significant, and are jointly & not highly likely to have a zero
effect on Sales.
As seen above, the model’s explanatory power (adjusted R-squared values), even though not
drastic, has been slightly improved from Model 1 (0.1653) to Model 2 (0.1668) and Model 3
(0.1673). Model 2 and Model 3 explains about 16.68% and 16.73% of variation respectively
in the dependent variable (i.e., sales). The F-statistic indicates that the null hypothesis should
be rejected and the predictors do have effects on Sales.
Page 11 of 33
Comparison of Errors
The maximum error of the 3 models is roughly about 4.4, suggesting that the model under-
predicted an app’s ranking by nearly e^4.4471 = 85 ranks for at least one observation. On the
other hand, 50% of errors fall within the 1Q and 3Q values. Therefore, the majority of
predictions were between e^(-0.5977) = 0.55 rank over an app’s true ranking and e^0.3913 =
1.48 rank under an app’s true ranking. Overall, the Error has been improved in Model 3 as
compared to Model 1 and 2.
Page 12 of 33
Adopt the Paid Model with more number of
2 Yes
frequently used descriptions terms
Adopt the Freemium Model with Apps of a All, with LifeStyle, Travel &
3
certain category Books being most significant.
Model Interpretation
Firstly, apps that adopt the Freemium Model are more likely to top the grossing
chart. Secondly, apps that have more frequently used descriptions terms in their
1 descriptions are more likely to top the grossing chart. Lastly, we have also
identified popular categories that are in the top grossing chart, namely: Games,
Business, Travel, Books, and Food.
Our findings suggest that if you are persistent in adopting the Paid Model, they
could mitigate the negative effects of the Paid Model by introducing more
2
frequently used description terms. As the number of frequently used descriptions
terms increases, it tends to cancel out the negative effects of the Paid Model.
This suggests that no matter which category a particular App belongs to, it should
3 always adopt the freemium model to top the grossing chart, especially in LifeStyle,
Travel, Books.
We therefore conclude our hypothesis “Revenue generated from an app will be
Overall
affected by its Pricing Model, Category and Product Description” is true.
With reference to Appendix D, in Model 1, the following variables are highly significant:
Log_RatingsAllVersions, Paid, Games, Travel, and Food. These independent variables have p-
value which are smaller than 0.001, which have an impact on the dependent variable (Sales).
In contrast, an app’s Screenshot, StarsAllVersions, Education, Lifestyle, Entertainment, Books,
Health, and terms_score are not significantly associated with an its Sales. Therefore these
variables do not predict the dependent variable (Sales).
With reference to Appendix E, predictor variables in Model 2 can only explain approximately
17% of variation in an app’s Sales. Therefore, this result shows that the additional variables of
the interacting variables between Product Descriptions & Paid (i.e. interaction_Tscore_Paid)
did not improve the model explanatory power on an app’s sales. Moreover, apps in Paid version
have complementary relationships with its descriptions. A Paid version of an app with product
descriptions consisting of our analysis’ identified most frequent terms may significantly
increase an app’s sales revenue by 2.89%, at 5% significance level. In other words, to mitigate
the negative effects on an app’s sales exerted by the higher app price in the Paid category,
developers need to write strategically useful and impactful product descriptions so as to prevent
any drop of sales, or even boost more sales.
With reference to Appendix F, despite having 10 additional variables, the predictor variables
in Model 3 is still not able to explain 83% of the variation of the model. All of these newly
introduced variables have negative coefficients. This implies that an app’s product Category
has supplementary relationships with its Paid version. We therefore infer that launching an
app as a Paid version, regardless of its product category, will not improve its sales. In fact,
these new variables exert a negative impact on sales, meaning launching an app (regardless of
Page 13 of 33
its product category) as a Paid version may most likely result in a decline in sales as compared
to launching it as a Free version. Moreover, only 3 out of these newly introduced variables
are statistically significant in their influence on an app’s sales. We can conclude that Travel,
Lifestyle and Books product categories exert negative impact on an app’s sales more
significantly than the other product categories when it is launched as a Paid version. For
example, a Paid version of an app belonging to the Lifestyle category may significantly
decrease an app’s revenue by 35.8%. Therefore, it is recommended that apps belonging in such
product categories are launched in the AppStore as Free versions.
Firstly, new entrants may use our findings as robust guidelines to help improve initial sales
performance. Assuming these new entrants can develop apps for any categories, based on our
regression analysis, it is suggested that they develop a Free version of an app based on the
Games or Travel product category, that is able to capture as many number of review
ratings as possible, so as to significantly boost an app’s sales.
For example, to increase number of review ratings (as validated and encouraged by Model 1),
App developers may consider utilizing an app review plugin such as Appriater which will
prompt users to review the app after they have used it a certain number of times or after a set
time period If the user taps on the “Rate” button, they are taken right to the AppStore where
they can pen their reviews (Kissmetrics, 2016). Alternatively, app developers may incentivize
their users to review the app, such as rewarding users of a game app certain amount of
EXP/rewards/points in exchange for their reviews.
Furthermore, app developers who design apps in the Travel, Lifestyle and Books product
categories are encouraged to launch them as a Free version, instead of launching them as a
Paid version. This is because doing the latter may significantly decrease app sales. Should app
developers found sales success in launching their apps as a Free version, and wishes to further
diversify their portfolios by launching it as a Paid version as the next strategic step, they need
to devise successful and strategic product descriptions & keyword presentation to mitigate
any possible negative effects on an app’s sales due to its high price, given the complementary
relationships between an app’s product description and its Paid version. This suggestion is also
corroborated by the fact that the interaction between an app’s Paid version & its product
descriptions somewhat exert a statistically significant impact on its sales.
A creative way for app developers to toy around with both the Free and Paid versions is to
decrease the price of a mobile app temporarily to Free for a limited period of time that
coincides with a certain season, say from mid-to-end December to take advantage of the
Christmas season (Rajput, 2016). App developers may utilize websites specializing in tracking
app price reductions, such as 148Apps & AppShopper, to analyse and determine the optimal
time period to keep the app as a Free version before restoring it back to the original Paid
version. Further research has shown that mobile apps who adopt this method continues to
attract high download frequency even after the apps are converted back to their Paid version.
In this way, app developers may also indirectly mitigate the negative effects that a high app
price has on sales, and sustain high sales revenue in a longer, more sustainable term.
Page 14 of 33
5.3 Limitations of Our Research & Analysis
Our analysis above have been largely focused on determining which app-specific attribute(s)
exert significant impact on a mobile app’s sales performance. However, we have not considered
the possibility of a producer to diversify its product portfolio and sell his/her products across
different categories. In a highly reputable research, it is found that such diversification is an
influentially paramount determinant to the high survival probability of a mobile app in
AppStore’s Top Charts, which consequently contributes significantly to a mobile app’s sales
performance (Lee, 2015).
Furthermore, another limitation of our research is that we focused only on the revenue
generated by an app through paid downloads and in-app purchases and ignored other possible
sources of revenue. For example, there are many free apps in the market that generate revenue
through advertisements. However, our study did not take into account advertising revenue
generated by an app as that information is not available in the calculation of gross rankings of
the apps in the Apple app store. Hence we have not investigated these other sales strategies that
developers can use to create a successful app.
Next, the analysis and findings of our research are based on a mobile app’s ranking information.
There are however several alternative methods to estimate an app’s sales revenue performance.
Additionally, the top-performing apps which appear at the top charts may aid users to make
his/her purchase decision faster and easier, because these apps will be promoted and flashed to
the users first when they first searched for the apps they are looking for. Unfortunately, the
limited availability of datasets provided inhibited our research from analysing a user’s
“potential preferential attachment mechanisms” in our analysis (Lee, 2015). Therefore, a longer
monitoring time period is necessary to evaluate if results may change or vary over a longer
time frame.
Lastly, this dataset is limited to Apple’s AppStore in the U.S. Future studies should include
analysis of mobile apps’ sales performance on other mobile app distribution platforms, such as
Google Play Store. This is because a mobile app’s sales performance may vary in differing
platforms, due to mitigating factors such as: different types and numbers of categories
available, different types of customer profile each mobile app distribution platform caters to
(for example, Google Play caters more to less affluent customers in Less Developed Countries
such as Indonesia and Brazil, whereas the Apple AppStore caters more to more affluent
customers in More Developed Countries such as Singapore & the U.S.), different App Store
Optimization (ASO) requirements, and more (Lee, 2015).
Page 15 of 33
6. APPENDICES
Page 16 of 33
Appendix B: Text Mining Parsing
u’ iphone Stopwords
u” touch Punctuations
u201d 3rd
u2011 2nd
u2013 4th
u2014 app
u2022 store
u2122 game
u2026 play
u2028 mobile
u2729 free
u20ac new
amp world
xae and
xa0 for
xa3 the
don to
won in
ing when
‘ll then
Page 17 of 33
www he
com she
than
can
get
one
also
just
need
Page 18 of 33
Appendix D: Estimation Output of Model 1
Page 19 of 33
Appendix D: Estimation Output of Model 1 (Continued)
Page 20 of 33
Lifestyle 0.141442 Apps in Lifestyle category Positive
improved app revenue by 14.14% Not
compared to apps in Utilities Significant
category
Utilities Baseline
Fitted Model
Page 21 of 33
Appendix D: Estimation Output of Model 2
Page 22 of 33
Appendix E: Estimation Output of Model 2 (Continued)
Page 23 of 33
Entertainment -0.150740 Apps in Entertainment category Negative
performed worse than apps in Not
Utilities category by decreasing Significant
app revenue by 15.07%
Utilities Baseline
Page 24 of 33
Appendix E: Estimation Output of Model 2 (Continued)
Fitted Model
Page 25 of 33
Appendix F: Estimation Output of Model 3
Page 26 of 33
Appendix F: Estimation Output of Model 3 (Continued)
Page 27 of 33
Entertainment 0.008207 Apps in Entertainment category Positive
improved app revenue by 0.82% Not
compared to apps in Utilities Significant
category
Utilities Baseline
Page 28 of 33
interaction_Paid_E -0.045050 Apps in Paid version have Negative
ducation supplementary relationships with Not
Education category. A Paid Significant
version of an app belonging to the
Education category do not exert
any significant impact on an app’s
revenue.
Page 29 of 33
interaction_Paid_ -0.185283 Apps in Paid version have Negative
Health supplementary relationships with Not
Health category. A Paid version of Significant
an app belonging to the Health
category do not exert any
significant impact on an app’s
revenue.
interaction_Paid_U Baseline
tilities
Fitted Model
Page 30 of 33
Appendix G: Residual Plots
Model 1
Model 2
Page 31 of 33
Model 3
Page 32 of 33
6. REFERENCES
Kissmetrics. (2016). 5 Clever Ways to Increase Mobile App Reviews. Kissmetrics Blog: A
Blog About Analytics, Marketing And Testing. Retrieved on November 6, 2016, from
https://blog.kissmetrics.com/increase-mobile-app-reviews/.
Perez, S. (2014a). The App Store, Six Years Later. Retrieved on November 6, 2016,
from https://techcrunch.com/2014/07/10/the-app-store-six-years-later/
Rajput, M. (2016, June 3). Ways to Determine The Best Pricing Model For Your App.
Entrepreneur India. Retrieved November 7, 2016, from
https://www.entrepreneur.com/article/276897.
Statista. (2016). Most popular Apple App Store categories in September 2016, by share of
available apps. Retrieved October 27, 2016, from
https://www.statista.com/statistics/270291/popular-categories-in-the-app-store/
Steel Media Ltd. (2016). Count of Active Applications in the App Store. Retrieved October 31,
2016, from http://www.pocketgamer.biz/metrics/app-store/app-count/.
Walz, A. (2015, May 27). Deconstructing the App Store Rankings Formula with a Little
Mad Science. Moz, Inc. Retrieved on October 30, 2016, from
https://moz.com/blog/app-store-rankings-formula-deconstructed-in-5-mad-science-
experiments.
Statwing. Interpreting residual plots to improve your regression. Retrieved November
8, 2016, from Interpreting residual plots to improve your regression,
http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/
Page 33 of 33