Apps Rating Prediction
Apps Rating Prediction
1 Problem statement
1.1 Description of the problem
The goal of this project is to predict the ratings of apps available on the Google Play Store. Given a
dataset that includes various attributes such as app category, number of installs, and user reviews,
the task is to develop a machine learning model that can accurately estimate the rating of an app.
This predictive model aims to help developers and stakeholders understand key factors influencing
app ratings and improve app quality and user satisfaction.
1.2 Inputs
App: The name of the app
Category: The category of the app
Rating: The rating of the app in the Play Store
Reviews: The number of reviews of the app
Size: The size of the app
Install: The number of installs of the app
Type: The type of the app (Free/Paid)
Price: The price of the app (0 if it is Free)
Content Rating: The appropiate target audience of the app
Genres: The genre of the app
Last Updated: The date when the app was last updated
Current Ver: The current version of the app
Android Ver: The minimum Android version required to run the app
1.3 Outputs
A regression machine learning model that can predict the ratings of apps based on various features.
1
1.4 Processing requirements
Data collection: Import Necessary Libraries
First, I import the necessary libraries. numpy is used for numerical operations, and sklearn provides
necessary functions for splitting the dataset and implementing linear regression.
EDA and draw charts
Data preprocessing:
• Convert each feature’s data type into a suitable formart
• Clean data: handle missing values, delete duplicated data and check outliers
Build model:
• Split Data into Training and Testing Sets
• Regression (Linear Regression, KNeighbors Regression, Random Forest Regression)
Evaluation Finally, we make predictions using our test set and evaluate the model’s performance
using metrics such as Mean Squared Error (MSE) and the Coefficient of Determination (R² score).
2 Data processing
2.1 Data collection
First of all, I will import all the necessary libraries that we will use throughout the project. This
generally includes libraries for data manipulation, data visualization, and others based on the
specific needs of the project:
[631]: # Data
import numpy as np
import pandas as pd
from collections import defaultdict
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msn
from wordcloud import WordCloud
# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
2
# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# Hide warnings
import warnings
warnings.filterwarnings('ignore')
Next, I will load the dataset into a pandas DataFrame which will facilitate easy manipulation and
analysis:
[633]: df = pd.read_csv('googleplaystore.csv')
Afterward, I am going to gain a thorough understanding of the dataset before proceeding to the
data cleaning and transformation stages.
Dataset overview First I will perform a preliminary analysis to understand the structure and
types of data columns:
[637]: df.sample(5)
3
[638]: df.shape
[641]: df.describe()
[641]: Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000
Here we can see that only Rating column is only in float, so we need to convert numerical columns
into int and float.
[643]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null object
4 Size 10841 non-null object
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content Rating 10840 non-null object
9 Genres 10841 non-null object
10 Last Updated 10841 non-null object
11 Current Ver 10833 non-null object
12 Android Ver 10838 non-null object
4
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
We could have converted it into integer like we did for Size but the data for this App looks different.
It can be noticed that the entries are entered wrong We could fix it by setting Category as nan and
shifting all the values, but deleting the sample for now.
[650]: df=df.drop(df.index[10472])
[653]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 10840 non-null object
5 Installs 10840 non-null object
6 Type 10839 non-null object
7 Price 10840 non-null object
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object
5
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(1), int32(1), object(11)
memory usage: 1.1+ MB
Step 2: Size
• It can be seen that data has metric prefixes(kilo and Mega) along with another
string.Replacing k and M with their values to convert values to numeric.
• The feature Size must be of floating type.
• The suffix, which is a size unit, must be removed. Example: ‘19.2M’ to 19.2
• If size is given as ‘Varies with device’ we replace it with 0
• The converted floating values of Size is represented in megabytes units.
[656]: df['Size'].unique()
[656]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
'31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
'5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
'1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
'3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
'2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
'7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
'4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
'4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
'23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
'8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
'5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
'6.2M', '18k', '53M', '1.4M', '3.0M', '5.8M', '3.8M', '9.6M',
'45M', '63M', '49M', '77M', '4.4M', '4.8M', '70M', '6.9M', '9.3M',
'10.0M', '8.1M', '36M', '84M', '97M', '2.0M', '1.9M', '1.8M',
'5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '34M', '93M', '65M', '79M',
'100M', '58M', '50M', '68M', '64M', '67M', '60M', '94M', '232k',
'99M', '624k', '95M', '8.5k', '41k', '292k', '11k', '80M', '1.7M',
'74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M',
'71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k',
'899k', '378k', '266k', '375k', '1.3M', '975k', '980k', '4.1M',
'89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k',
'713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k',
'953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k',
'26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k',
'293k', '17k', '74k', '14k', '317k', '78k', '924k', '902k', '818k',
'81k', '939k', '169k', '45k', '475k', '965k', '90M', '545k', '61k',
6
'283k', '655k', '714k', '93k', '872k', '121k', '322k', '1.0M',
'976k', '172k', '238k', '549k', '206k', '954k', '444k', '717k',
'210k', '609k', '308k', '705k', '306k', '904k', '473k', '175k',
'350k', '383k', '454k', '421k', '70k', '812k', '442k', '842k',
'417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k',
'429k', '192k', '200k', '460k', '728k', '496k', '816k', '414k',
'506k', '887k', '613k', '243k', '569k', '778k', '683k', '592k',
'319k', '186k', '840k', '647k', '191k', '373k', '437k', '598k',
'716k', '585k', '982k', '222k', '219k', '55k', '948k', '323k',
'691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k',
'82k', '208k', '913k', '514k', '551k', '29k', '103k', '898k',
'743k', '116k', '153k', '209k', '353k', '499k', '173k', '597k',
'809k', '122k', '411k', '400k', '801k', '787k', '237k', '50k',
'643k', '986k', '97k', '516k', '837k', '780k', '961k', '269k',
'20k', '498k', '600k', '749k', '642k', '881k', '72k', '656k',
'601k', '221k', '228k', '108k', '940k', '176k', '33k', '663k',
'34k', '942k', '259k', '164k', '458k', '245k', '629k', '28k',
'288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k',
'914k', '903k', '608k', '500k', '54k', '562k', '847k', '957k',
'688k', '811k', '270k', '48k', '329k', '523k', '921k', '874k',
'981k', '784k', '280k', '24k', '518k', '754k', '892k', '154k',
'860k', '364k', '387k', '626k', '161k', '879k', '39k', '970k',
'170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k',
'246k', '73k', '658k', '992k', '253k', '420k', '404k', '470k',
'226k', '240k', '89k', '234k', '257k', '861k', '467k', '157k',
'44k', '676k', '67k', '552k', '885k', '1020k', '582k', '619k'],
dtype=object)
[658]: 0 19000.0
1 14000.0
2 8.7
3 25000.0
4 2.8
…
10836 53000.0
10837 3.6
10838 9.5
10839 NaN
7
10840 19000.0
Name: Size, Length: 10840, dtype: float64
• There is a problem! There are some applications size in megabyte and some in kilobyte.
[660]: ###### Convert mega to kilo then convert all to mega
for i in df['Size']:
if i < 10:
df['Size']=df['Size'].replace(i,i*1000)
df['Size']=df['Size']/1000
df['Size']
[660]: 0 19.0
1 14.0
2 8.7
3 25.0
4 2.8
…
10836 53.0
10837 3.6
10838 9.5
10839 NaN
10840 19.0
Name: Size, Length: 10840, dtype: float64
[661]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null object
6 Type 10839 non-null object
7 Price 10840 non-null object
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(2), int32(1), object(10)
memory usage: 1.1+ MB
8
Step 3: Installs and Price
• The feature Installs must be of integer type.
• The characters ‘,’ and ‘+’ must be removed. Example: ‘10,000+’ to 10000
• The feature Price must be of floating type.
• The suffix ‘dollar’ must be removed if Price is non-zero. Example: ‘$4.99’ to 4.99
[664]: df['Installs'].unique()
[665]: df['Price'].unique()
[666]: items_to_remove=['+',',','$']
cols_to_clean=['Installs','Price']
for item in items_to_remove:
for col in cols_to_clean:
df[col]=df[col].str.replace(item,'')
df.head()
9
1 967 14.0 500000 Free 0 Everyone
2 87510 8.7 5000000 Free 0 Everyone
3 215644 25.0 50000000 Free 0 Teen
4 967 2.8 100000 Free 0 Everyone
Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up
[667]: df.Installs.unique()
[668]: df['Price'].unique()
[669]: df[df['Price']=='Everyone']
10
[670]: df['Installs']=df['Installs'].astype('int')
df['Price']=df['Price'].astype('float')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(3), int32(2), object(8)
memory usage: 1.1+ MB
• Updating the Last Updated column’s datatype from string to pandas datetime.
• Extracting new columns Updated Year, Updated Month and updated day.
[672]: #### Change Last update into a datetime column
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
df['Last Updated']
[672]: 0 2018-01-07
1 2018-01-15
2 2018-08-01
3 2018-06-08
4 2018-06-20
…
10836 2017-07-25
10837 2018-07-06
10838 2017-01-20
10839 2015-01-19
10840 2018-07-25
Name: Last Updated, Length: 10840, dtype: datetime64[ns]
11
[674]: df.drop('Last Updated', axis=1, inplace=True)
[675]: df.head()
Updated_Year
0 2018
1 2018
2 2018
3 2018
4 2018
[676]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
12
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB
Data cleaning
[678]: null = pd.DataFrame({'Null Values' : df.isna().sum().
↪sort_values(ascending=False), 'Percentage Null Values' : (df.isna().sum().
null
13
Its clear that we have missing values in Rating, Type, Content Rating, Current Ver
and Android Ver.
Step 1: Handle missing value I clean missing values using Random Value Imputation
Because This the best way to To maintain distrbuation For each feature.
[683]: def impute_median(series):
return series.fillna(series.median())
df['Rating'] = df['Rating'].transform(impute_median)
[684]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 10840 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
14
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB
df['Size'] = df['Size'].transform(impute_median)
[686]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 10840 non-null float64
3 Reviews 10840 non-null int32
4 Size 10840 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB
[687]: df.isnull().sum()
[687]: App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 1
Price 0
15
Content Rating 0
Genres 0
Current Ver 8
Android Ver 2
Updated_Month 0
Updated_Year 0
dtype: int64
[688]: df['Type'].fillna(str(df['Type'].mode().values[0]),inplace=True)
[689]: df.isnull().sum()
[689]: App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content Rating 0
Genres 0
Current Ver 8
Android Ver 2
Updated_Month 0
Updated_Year 0
dtype: int64
484
[692]: df.drop_duplicates(inplace=True)
0
Extract Numerical and categorical features
[695]: num_features=[col for col in df.columns if df[col].dtype!='O']
num_features
16
[695]: ['Rating',
'Reviews',
'Size',
'Installs',
'Price',
'Updated_Month',
'Updated_Year']
[696]: ['App',
'Category',
'Type',
'Content Rating',
'Genres',
'Current Ver',
'Android Ver']
17
[699]: sns.boxplot(df["Rating"])
[700]: sns.boxplot(df["Size"])
18
[701]: sns.boxplot(df["Installs"])
19
[702]: sns.boxplot(df["Rating"])
[703]: sns.boxplot(df["Price"])
20
2.3 Exploratory Data Analysis (EDA)
Category Column
[706]: df['Category'].value_counts()
[706]: Category
FAMILY 1943
GAME 1121
TOOLS 842
BUSINESS 427
MEDICAL 408
PRODUCTIVITY 407
PERSONALIZATION 388
LIFESTYLE 373
COMMUNICATION 366
FINANCE 360
SPORTS 351
PHOTOGRAPHY 322
HEALTH_AND_FITNESS 306
SOCIAL 280
NEWS_AND_MAGAZINES 264
TRAVEL_AND_LOCAL 237
BOOKS_AND_REFERENCE 230
SHOPPING 224
21
DATING 196
VIDEO_PLAYERS 175
MAPS_AND_NAVIGATION 137
EDUCATION 130
FOOD_AND_DRINK 124
ENTERTAINMENT 111
AUTO_AND_VEHICLES 85
LIBRARIES_AND_DEMO 85
WEATHER 82
HOUSE_AND_HOME 80
ART_AND_DESIGN 65
EVENTS 64
PARENTING 60
COMICS 60
BEAUTY 53
Name: count, dtype: int64
22
Text(24, 0, 'TRAVEL_AND_LOCAL'),
Text(25, 0, 'TOOLS'),
Text(26, 0, 'PERSONALIZATION'),
Text(27, 0, 'PRODUCTIVITY'),
Text(28, 0, 'PARENTING'),
Text(29, 0, 'WEATHER'),
Text(30, 0, 'VIDEO_PLAYERS'),
Text(31, 0, 'NEWS_AND_MAGAZINES'),
Text(32, 0, 'MAPS_AND_NAVIGATION')])
[708]: plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
background_color='black',
width=1920,
height=1080
).generate(" ".join(df.Category))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
23
Category vs Rating Analysis
[710]: plt.figure(figsize=(20,15))
sns.boxplot(y='Rating',x='Category',data = df.
↪sort_values('Rating',ascending=False))
plt.xticks(rotation=80)
24
Text(17, 0, 'BOOKS_AND_REFERENCE'),
Text(18, 0, 'SPORTS'),
Text(19, 0, 'FOOD_AND_DRINK'),
Text(20, 0, 'PARENTING'),
Text(21, 0, 'EVENTS'),
Text(22, 0, 'TRAVEL_AND_LOCAL'),
Text(23, 0, 'DATING'),
Text(24, 0, 'LIBRARIES_AND_DEMO'),
Text(25, 0, 'MAPS_AND_NAVIGATION'),
Text(26, 0, 'VIDEO_PLAYERS'),
Text(27, 0, 'EDUCATION'),
Text(28, 0, 'AUTO_AND_VEHICLES'),
Text(29, 0, 'BEAUTY'),
Text(30, 0, 'WEATHER'),
Text(31, 0, 'HOUSE_AND_HOME'),
Text(32, 0, 'ENTERTAINMENT')])
25
Type Column
[712]: df['Type'].value_counts()
[712]: Type
Free 9591
Paid 765
Name: count, dtype: int64
26
Type vs Rating Analysis
[716]: plt.figure(figsize=(15,8))
sns.catplot(y='Rating',x='Type',data = df.
↪sort_values('Rating',ascending=False),kind='boxen')
27
Content Rating Column
[718]: df['Content Rating'].value_counts()
plt.xticks(rotation=90)
28
[720]: (array([0, 1, 2, 3, 4, 5]),
[Text(0, 0, 'Everyone'),
Text(1, 0, 'Teen'),
Text(2, 0, 'Mature 17+'),
Text(3, 0, 'Everyone 10+'),
Text(4, 0, 'Adults only 18+'),
Text(5, 0, 'Unrated')])
[721]: plt.figure(figsize=(12,8))
sns.barplot(x="Content Rating", y="Installs", hue="Type", data=df)
29
Genres Column
[723]: df['Genres'].value_counts()
[723]: Genres
Tools 841
Entertainment 588
Education 527
Business 427
Medical 408
…
Parenting;Brain Games 1
Travel & Local;Action & Adventure 1
Lifestyle;Pretend Play 1
Tools;Education 1
Strategy;Creativity 1
Name: count, Length: 119, dtype: int64
30
[725]: Current Ver
Varies with device 1302
1.0 802
1.1 260
1.2 177
2.0 149
…
3.18.5 1
1.3.A.2.9 1
9.9.1.1910 1
7.1.34.28 1
2.0.148.0 1
Name: count, Length: 2831, dtype: int64
31
7.0 - 7.1.1 1
4.1 - 7.1.1 1
5.0 - 6.0 1
2.2 - 7.1.1 1
5.0 - 7.1.1 1
Name: count, dtype: int64
kde-Plot Analysis
[730]: kde_plot('Rating')
32
[731]: kde_plot('Size')
[732]: kde_plot('Updated_Month')
[733]: kde_plot('Price')
33
[734]: kde_plot('Updated_Year')
34
[737]: scatters('Size', 'Rating')
35
[738]: scatters('Size', 'Installs')
36
[740]: scatters('Reviews', 'Rating')
37
Further Analysis Apps with a 5.0 Rating
[744]: df_rating_5 = df[df.Rating == 5.]
print(f'There are {df_rating_5.shape[0]} apps having rating of 5.0')
38
Despite the full ratings, the number of installations for the majority of the apps is low. Hence,
those apps cannot be considered the best products.
- Reviews
[749]: sns.histplot(data=df_rating_5, x='Reviews', kde=True)
plt.title('Distribution of Reviews with 5.0 Rating Apps')
plt.show()
39
The distribution is right-skewed which shows applications with few reviews having 5.0 ratings,
which is misleading.
- Category
[752]: df_rating_5_cat = df_rating_5['Category'].value_counts().reset_index()
40
Family, LifeStyle and Medical apps receive the most 5.0 ratings on Google Play Store with Family
representing about quater of whole.
- Type
[756]: df_rating_5_type = df_rating_5['Type'].value_counts().reset_index()
41
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Title
plt.title('Pie chart of App Types with 5.0 Rating')
Almost 90% of the 5.0 rating apps are free on Goolge Play Store.
[759]: freq= pd.Series()
freq=df['Updated_Year'].value_counts()
freq.plot()
plt.xlabel("Dates")
plt.ylabel("Number of updates")
42
plt.title("Time series plot of Last Updates")
Dependent variable
Type of model Dependent variable examples Methods examples
Regression numeric test scores, blood linear regression,
pressure, stock k-nearest neighbors
prices, how fast regression, random
somebody can run a forest regression,
mile, how many neural network
potatoes somebody regression, many
will purchase at the others
store
I will use three different regression techniques: linear regression, k-nearest neighbors (KNN), and
random forest.
43
• Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
This means that the change in the dependent variable is proportional to the change in the
independent variables.
• Random forest regression is an ensemble method that combines multiple decision trees
to predict the target value. Ensemble methods are a type of machine learning algorithm that
combines multiple models to improve the performance of the overall model. Random forest
regression works by building a large number of decision trees, each of which is trained on a
different subset of the training data. The final prediction is made by averaging the predictions
of all of the trees.
• k-nearest neighbors regression is a simple and intuitive algorithm that makes predictions
by finding the K nearest data points to a given input and averaging their target values
Data Splitting for Modeling As always, we will divide our data into a training dataset that
we can use to train a predictive model and then a testing dataset that we can use to determine if
our predictive model is useful or not.
We split the dataset into 80% train and 20% test.
[764]: target = 'Rating'
Label Encoding
[768]: le_dict = defaultdict()
44
X_train[col] = le.fit_transform(X_train[col]) # Fitting and tranforming the␣
↪Train data
X_train[col] = X_train[col].astype('category') # Converting the label␣
↪encoded features from numerical back to categorical dtype in pandas
Standardization
[771]: # Converting and adding "Last Updated Month" to categorical features
categorical_features = features_to_encode + ['Updated_Month']
X_train['Updated_Month'] = X_train['Updated_Month'].astype('category')
X_test['Updated_Month'] = X_test['Updated_Month'].astype('category')
[772]: numeric_features
df_metrics_reg = pd.DataFrame(index=multi_index,
45
columns=['value'])
[878]: df_metrics_reg
[878]: value
model dataset metric
Linear train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
KNN train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
Random Forest train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
Linear Regression
[881]: lr = LinearRegression()
lr.fit(X_train, y_train)
[881]: LinearRegression()
46
KNeighbors Regression
[888]: knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
[888]: KNeighborsRegressor()
47
3.3 Regression Evaluation
[902]: # Rounding the values
df_metrics_reg
[902]: value
model dataset metric
Linear train RMSE 0.478
MAE 0.319
R2 0.023
test RMSE 0.483
MAE 0.327
R2 0.037
KNN train RMSE 0.409
MAE 0.280
R2 0.286
test RMSE 0.510
MAE 0.349
R2 -0.072
Random Forest train RMSE 0.468
MAE 0.309
R2 0.063
test RMSE 0.472
MAE 0.314
R2 0.081
ax.margins(y=0.2)
plt.show()
48
Linear Regression:
• The model has high error (RMSE and MAE) and low R2, indicating poor predictive power.
KNN:
• KNN has the lowest training RMSE (0.409) but the highest test RMSE (0.510). This suggests
that KNN may be overfitting the training data.
Random Forest:
• Slightly better than Linear Regression, but still with low R², indicating limited improvement.
• The Regression predictions don’t hold up very well!
• We can interpret that the dataset is not suitable for regression problem.
4 Model application
4.1 Describe the model application
Real Estate Valuation:
• A regression model can predict house prices by analyzing factors like square footage, neigh-
borhood quality, and number of bedrooms and bathrooms. This helps buyers, sellers, and
real estate agents make informed decisions.
Healthcare:
• In healthcare, regression models can be used to predict patient outcomes based on various
health indicators, aiding in early diagnosis and personalized treatment plans.
Finance:
• Financial institutions use regression models to forecast market trends and stock prices, helping
investors make strategic decisions. They also predict credit risk, aiding in loan approval
processes.
49
Marketing:
• Marketers use regression models to predict sales and customer behavior, optimizing marketing
strategies and budget allocations for maximum return on investment.
Weather Forecasting:
• Regression models analyze historical weather data to predict future conditions, assisting in
agriculture planning, disaster preparedness, and daily life activities.
Manufacturing:
• In manufacturing, regression models predict equipment failure times, enabling proactive main-
tenance and reducing downtime. They also help estimate production costs based on various
input factors, optimizing resource allocation.
Overall, regression models provide valuable predictive insights across various industries, helping to
make data-driven decisions and improve efficiency.
5 Conclusion
5.1 Advantages and Disadvantages of the approach
Advantages of Regression
• Easy to understand and interpret
• Robust to outliers
• Can handle both linear and nonlinear relationships.
Disadvantages of Regression
• Assumes linearity
• Sensitive to multicollinearity
• May not be suitable for highly complex relationships
50
5.2 Applicability of research results in the future
The results of the regression model research for predicting Google Play Store app ratings have
several future applications:
App Development:
• Improvement Prioritization: Developers can focus on improving features that significantly
impact ratings.
• Pre-release Testing: Predict potential ratings for new apps before launch to make necessary
adjustments.
Market Analysis:
• Competitive Benchmarking: Compare predicted ratings with competitor apps to identify
market position.
• Trend Analysis: Track predicted rating trends over time to adjust development strategies.
User Engagement:
• Personalized Recommendations: Use predicted ratings to suggest apps to users based on
their preferences.
• Feedback Loop: Incorporate user feedback into the model to continuously improve predic-
tions.
Investment Decisions:
• Funding Allocation: Investors can predict app success and allocate funds to high-potential
projects.
• Risk Assessment: Evaluate the risk associated with new app ventures by predicting poten-
tial ratings.
Quality Assurance:
• Automated Testing: Implement regression models in automated testing frameworks to
predict the impact of changes on app ratings.
• Feature Optimization: Identify and optimize features that contribute to higher ratings.
By leveraging these research results, stakeholders can make informed decisions, improve app quality,
and enhance user satisfaction, ultimately leading to greater success in the competitive app market.
51