0% found this document useful (0 votes)
29 views51 pages

Apps Rating Prediction

analysis on google data - classwork

Uploaded by

Hà My
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views51 pages

Apps Rating Prediction

analysis on google data - classwork

Uploaded by

Hà My
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

apps-rating-prediction

June 28, 2024

1 Problem statement
1.1 Description of the problem
The goal of this project is to predict the ratings of apps available on the Google Play Store. Given a
dataset that includes various attributes such as app category, number of installs, and user reviews,
the task is to develop a machine learning model that can accurately estimate the rating of an app.
This predictive model aims to help developers and stakeholders understand key factors influencing
app ratings and improve app quality and user satisfaction.

1.2 Inputs
App: The name of the app
Category: The category of the app
Rating: The rating of the app in the Play Store
Reviews: The number of reviews of the app
Size: The size of the app
Install: The number of installs of the app
Type: The type of the app (Free/Paid)
Price: The price of the app (0 if it is Free)
Content Rating: The appropiate target audience of the app
Genres: The genre of the app
Last Updated: The date when the app was last updated
Current Ver: The current version of the app
Android Ver: The minimum Android version required to run the app

1.3 Outputs
A regression machine learning model that can predict the ratings of apps based on various features.

1
1.4 Processing requirements
Data collection: Import Necessary Libraries
First, I import the necessary libraries. numpy is used for numerical operations, and sklearn provides
necessary functions for splitting the dataset and implementing linear regression.
EDA and draw charts
Data preprocessing:
• Convert each feature’s data type into a suitable formart
• Clean data: handle missing values, delete duplicated data and check outliers
Build model:
• Split Data into Training and Testing Sets
• Regression (Linear Regression, KNeighbors Regression, Random Forest Regression)
Evaluation Finally, we make predictions using our test set and evaluate the model’s performance
using metrics such as Mean Squared Error (MSE) and the Coefficient of Determination (R² score).

2 Data processing
2.1 Data collection
First of all, I will import all the necessary libraries that we will use throughout the project. This
generally includes libraries for data manipulation, data visualization, and others based on the
specific needs of the project:
[631]: # Data
import numpy as np
import pandas as pd
from collections import defaultdict

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msn
from wordcloud import WordCloud

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

2
# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Hide warnings
import warnings
warnings.filterwarnings('ignore')

Next, I will load the dataset into a pandas DataFrame which will facilitate easy manipulation and
analysis:
[633]: df = pd.read_csv('googleplaystore.csv')

Afterward, I am going to gain a thorough understanding of the dataset before proceeding to the
data cleaning and transformation stages.

Dataset overview First I will perform a preliminary analysis to understand the structure and
types of data columns:
[637]: df.sample(5)

[637]: App Category Rating Reviews \


5354 I am Rich Plus FAMILY 4.0 856
10716 Free Slideshow Maker & Video Editor PHOTOGRAPHY 4.2 162564
7769 Clash Soundboard For CR & COC FAMILY 4.3 13
10131 EZ Display PRODUCTIVITY 2.9 280
8867 DT Simple Interval Timer SPORTS 4.1 10

Size Installs Type Price Content Rating Genres \


5354 8.7M 10,000+ Paid $399.99 Everyone Entertainment
10716 11M 10,000,000+ Free 0 Everyone Photography
7769 26M 500+ Free 0 Everyone Entertainment
10131 10.0M 50,000+ Free 0 Everyone Productivity
8867 1.5M 1,000+ Free 0 Everyone Sports

Last Updated Current Ver Android Ver


5354 May 19, 2018 3.0 4.4 and up
10716 August 5, 2018 5.2 4.0 and up
7769 January 3, 2018 1.1 4.0.3 and up
10131 November 14, 2014 1.0.0.457 4.0 and up
8867 February 17, 2014 1.01 2.1 and up

3
[638]: df.shape

[638]: (10841, 13)

As we can see we have data of 10841 applications consisting of 13 attributes.


[640]: df.columns

[640]: Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',


'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'],
dtype='object')

[641]: df.describe()

[641]: Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000

Here we can see that only Rating column is only in float, so we need to convert numerical columns
into int and float.
[643]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null object
4 Size 10841 non-null object
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content Rating 10840 non-null object
9 Genres 10841 non-null object
10 Last Updated 10841 non-null object
11 Current Ver 10833 non-null object
12 Android Ver 10838 non-null object

4
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

2.2 Data preprocessing


As most of the features are set to data type object and have suffixes, each feature’s
data type must be converted into a suitable format for analysis.

Step 1: Reviews Checking if all values in number of Reviews column is numeric


[648]: df[~df.Reviews.str.isnumeric()]

[648]: App Category Rating Reviews \


10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M

Size Installs Type Price Content Rating Genres \


10472 1,000+ Free 0 Everyone NaN February 11, 2018

Last Updated Current Ver Android Ver


10472 1.0.19 4.0 and up NaN

We could have converted it into integer like we did for Size but the data for this App looks different.
It can be noticed that the entries are entered wrong We could fix it by setting Category as nan and
shifting all the values, but deleting the sample for now.
[650]: df=df.drop(df.index[10472])

The feature Reviews must be of integer type.


[652]: df["Reviews"] = df["Reviews"].astype(int)

[653]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 10840 non-null object
5 Installs 10840 non-null object
6 Type 10839 non-null object
7 Price 10840 non-null object
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object

5
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(1), int32(1), object(11)
memory usage: 1.1+ MB

Step 2: Size
• It can be seen that data has metric prefixes(kilo and Mega) along with another
string.Replacing k and M with their values to convert values to numeric.
• The feature Size must be of floating type.
• The suffix, which is a size unit, must be removed. Example: ‘19.2M’ to 19.2
• If size is given as ‘Varies with device’ we replace it with 0
• The converted floating values of Size is represented in megabytes units.
[656]: df['Size'].unique()

[656]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
'31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
'5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
'1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
'3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
'2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
'7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
'4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
'4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
'23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
'8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
'5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
'6.2M', '18k', '53M', '1.4M', '3.0M', '5.8M', '3.8M', '9.6M',
'45M', '63M', '49M', '77M', '4.4M', '4.8M', '70M', '6.9M', '9.3M',
'10.0M', '8.1M', '36M', '84M', '97M', '2.0M', '1.9M', '1.8M',
'5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '34M', '93M', '65M', '79M',
'100M', '58M', '50M', '68M', '64M', '67M', '60M', '94M', '232k',
'99M', '624k', '95M', '8.5k', '41k', '292k', '11k', '80M', '1.7M',
'74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M',
'71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k',
'899k', '378k', '266k', '375k', '1.3M', '975k', '980k', '4.1M',
'89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k',
'713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k',
'953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k',
'26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k',
'293k', '17k', '74k', '14k', '317k', '78k', '924k', '902k', '818k',
'81k', '939k', '169k', '45k', '475k', '965k', '90M', '545k', '61k',

6
'283k', '655k', '714k', '93k', '872k', '121k', '322k', '1.0M',
'976k', '172k', '238k', '549k', '206k', '954k', '444k', '717k',
'210k', '609k', '308k', '705k', '306k', '904k', '473k', '175k',
'350k', '383k', '454k', '421k', '70k', '812k', '442k', '842k',
'417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k',
'429k', '192k', '200k', '460k', '728k', '496k', '816k', '414k',
'506k', '887k', '613k', '243k', '569k', '778k', '683k', '592k',
'319k', '186k', '840k', '647k', '191k', '373k', '437k', '598k',
'716k', '585k', '982k', '222k', '219k', '55k', '948k', '323k',
'691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k',
'82k', '208k', '913k', '514k', '551k', '29k', '103k', '898k',
'743k', '116k', '153k', '209k', '353k', '499k', '173k', '597k',
'809k', '122k', '411k', '400k', '801k', '787k', '237k', '50k',
'643k', '986k', '97k', '516k', '837k', '780k', '961k', '269k',
'20k', '498k', '600k', '749k', '642k', '881k', '72k', '656k',
'601k', '221k', '228k', '108k', '940k', '176k', '33k', '663k',
'34k', '942k', '259k', '164k', '458k', '245k', '629k', '28k',
'288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k',
'914k', '903k', '608k', '500k', '54k', '562k', '847k', '957k',
'688k', '811k', '270k', '48k', '329k', '523k', '921k', '874k',
'981k', '784k', '280k', '24k', '518k', '754k', '892k', '154k',
'860k', '364k', '387k', '626k', '161k', '879k', '39k', '970k',
'170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k',
'246k', '73k', '658k', '992k', '253k', '420k', '404k', '470k',
'226k', '240k', '89k', '234k', '257k', '861k', '467k', '157k',
'44k', '676k', '67k', '552k', '885k', '1020k', '582k', '619k'],
dtype=object)

• Remove all characters from size and convert it to float.


[658]: df['Size']=df['Size'].str.replace('M','000')
df['Size']=df['Size'].str.replace('k','')
#apps['size']=apps['size'].str.replace('.','')
df['Size']=df['Size'].replace("Varies with device",np.nan)
df['Size']=df['Size'].astype('float')
df['Size']

[658]: 0 19000.0
1 14000.0
2 8.7
3 25000.0
4 2.8

10836 53000.0
10837 3.6
10838 9.5
10839 NaN

7
10840 19000.0
Name: Size, Length: 10840, dtype: float64

• There is a problem! There are some applications size in megabyte and some in kilobyte.
[660]: ###### Convert mega to kilo then convert all to mega
for i in df['Size']:
if i < 10:
df['Size']=df['Size'].replace(i,i*1000)
df['Size']=df['Size']/1000
df['Size']

[660]: 0 19.0
1 14.0
2 8.7
3 25.0
4 2.8

10836 53.0
10837 3.6
10838 9.5
10839 NaN
10840 19.0
Name: Size, Length: 10840, dtype: float64

[661]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null object
6 Type 10839 non-null object
7 Price 10840 non-null object
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(2), int32(1), object(10)
memory usage: 1.1+ MB

8
Step 3: Installs and Price
• The feature Installs must be of integer type.
• The characters ‘,’ and ‘+’ must be removed. Example: ‘10,000+’ to 10000
• The feature Price must be of floating type.
• The suffix ‘dollar’ must be removed if Price is non-zero. Example: ‘$4.99’ to 4.99
[664]: df['Installs'].unique()

[664]: array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',


'50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
'1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
'10+', '1+', '5+', '0+', '0'], dtype=object)

[665]: df['Price'].unique()

[665]: array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',


'$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
'$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
'$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
'$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
'$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
'$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
'$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
'$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
'$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
'$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
'$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
'$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
'$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)

[666]: items_to_remove=['+',',','$']
cols_to_clean=['Installs','Price']
for item in items_to_remove:
for col in cols_to_clean:
df[col]=df[col].str.replace(item,'')
df.head()

[666]: App Category Rating \


0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
1 Coloring book moana ART_AND_DESIGN 3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide … ART_AND_DESIGN 4.7
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3

Reviews Size Installs Type Price Content Rating \


0 159 19.0 10000 Free 0 Everyone

9
1 967 14.0 500000 Free 0 Everyone
2 87510 8.7 5000000 Free 0 Everyone
3 215644 25.0 50000000 Free 0 Teen
4 967 2.8 100000 Free 0 Everyone

Genres Last Updated Current Ver \


0 Art & Design January 7, 2018 1.0.0
1 Art & Design;Pretend Play January 15, 2018 2.0.0
2 Art & Design August 1, 2018 1.2.4
3 Art & Design June 8, 2018 Varies with device
4 Art & Design;Creativity June 20, 2018 1.1

Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up

[667]: df.Installs.unique()

[667]: array(['10000', '500000', '5000000', '50000000', '100000', '50000',


'1000000', '10000000', '5000', '100000000', '1000000000', '1000',
'500000000', '50', '100', '500', '10', '1', '5', '0'], dtype=object)

[668]: df['Price'].unique()

[668]: array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99',


'3.49', '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00',
'24.99', '11.99', '79.99', '16.99', '14.99', '1.00', '29.99',
'12.99', '2.49', '10.99', '1.50', '19.99', '15.99', '33.99',
'74.99', '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88',
'25.99', '399.99', '17.99', '400.00', '3.02', '1.76', '4.84',
'4.77', '1.61', '2.50', '1.59', '6.49', '1.29', '5.00', '13.99',
'299.99', '379.99', '37.99', '18.99', '389.99', '19.90', '8.49',
'1.75', '14.00', '4.85', '46.99', '109.99', '154.99', '3.08',
'2.59', '4.80', '1.96', '19.40', '3.90', '4.59', '15.46', '3.04',
'4.29', '2.60', '3.28', '4.60', '28.99', '2.95', '2.90', '1.97',
'200.00', '89.99', '2.56', '30.99', '3.61', '394.99', '1.26',
'1.20', '1.04'], dtype=object)

[669]: df[df['Price']=='Everyone']

[669]: Empty DataFrame


Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price, Content
Rating, Genres, Last Updated, Current Ver, Android Ver]
Index: []

10
[670]: df['Installs']=df['Installs'].astype('int')
df['Price']=df['Price'].astype('float')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(3), int32(2), object(8)
memory usage: 1.1+ MB
• Updating the Last Updated column’s datatype from string to pandas datetime.
• Extracting new columns Updated Year, Updated Month and updated day.
[672]: #### Change Last update into a datetime column
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
df['Last Updated']

[672]: 0 2018-01-07
1 2018-01-15
2 2018-08-01
3 2018-06-08
4 2018-06-20

10836 2017-07-25
10837 2018-07-06
10838 2017-01-20
10839 2015-01-19
10840 2018-07-25
Name: Last Updated, Length: 10840, dtype: datetime64[ns]

[673]: df['Updated_Month']=df['Last Updated'].dt.month


df['Updated_Year']=df['Last Updated'].dt.year

11
[674]: df.drop('Last Updated', axis=1, inplace=True)

[675]: df.head()

[675]: App Category Rating \


0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
1 Coloring book moana ART_AND_DESIGN 3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide … ART_AND_DESIGN 4.7
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3

Reviews Size Installs Type Price Content Rating \


0 159 19.0 10000 Free 0.0 Everyone
1 967 14.0 500000 Free 0.0 Everyone
2 87510 8.7 5000000 Free 0.0 Everyone
3 215644 25.0 50000000 Free 0.0 Teen
4 967 2.8 100000 Free 0.0 Everyone

Genres Current Ver Android Ver Updated_Month \


0 Art & Design 1.0.0 4.0.3 and up 1
1 Art & Design;Pretend Play 2.0.0 4.0.3 and up 1
2 Art & Design 1.2.4 4.0.3 and up 8
3 Art & Design Varies with device 4.2 and up 6
4 Art & Design;Creativity 1.1 4.4 and up 6

Updated_Year
0 2018
1 2018
2 2018
3 2018
4 2018

[676]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64

12
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB

Data cleaning
[678]: null = pd.DataFrame({'Null Values' : df.isna().sum().
↪sort_values(ascending=False), 'Percentage Null Values' : (df.isna().sum().

↪sort_values(ascending=False)) / (df.shape[0]) * (100)})

null

[678]: Null Values Percentage Null Values


Size 1695 15.636531
Rating 1474 13.597786
Current Ver 8 0.073801
Android Ver 2 0.018450
Type 1 0.009225
App 0 0.000000
Category 0 0.000000
Reviews 0 0.000000
Installs 0 0.000000
Price 0 0.000000
Content Rating 0 0.000000
Genres 0 0.000000
Updated_Month 0 0.000000
Updated_Year 0 0.000000

[679]: null_counts = df.isna().sum().sort_values(ascending=False)/len(df)


plt.figure(figsize=(16,8))
plt.xticks(np.arange(len(null_counts))+0.5,null_counts.
↪index,rotation='vertical')

plt.ylabel('fraction of rows with missing data')


plt.bar(np.arange(len(null_counts)),null_counts)

[679]: <BarContainer object of 14 artists>

13
Its clear that we have missing values in Rating, Type, Content Rating, Current Ver
and Android Ver.

Step 1: Handle missing value I clean missing values using Random Value Imputation
Because This the best way to To maintain distrbuation For each feature.
[683]: def impute_median(series):
return series.fillna(series.median())

df['Rating'] = df['Rating'].transform(impute_median)

[684]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 10840 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object

14
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB

[685]: def impute_median(series):


return series.fillna(series.median())

df['Size'] = df['Size'].transform(impute_median)

[686]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 10840 non-null float64
3 Reviews 10840 non-null int32
4 Size 10840 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB

[687]: df.isnull().sum()

[687]: App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 1
Price 0

15
Content Rating 0
Genres 0
Current Ver 8
Android Ver 2
Updated_Month 0
Updated_Year 0
dtype: int64

[688]: df['Type'].fillna(str(df['Type'].mode().values[0]),inplace=True)

[689]: df.isnull().sum()

[689]: App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content Rating 0
Genres 0
Current Ver 8
Android Ver 2
Updated_Month 0
Updated_Year 0
dtype: int64

Step 2: Delete duplicated data


[691]: duplicate = df.duplicated()
print(duplicate.sum())

484

[692]: df.drop_duplicates(inplace=True)

[693]: duplicate = df.duplicated()


print(duplicate.sum())

0
Extract Numerical and categorical features
[695]: num_features=[col for col in df.columns if df[col].dtype!='O']
num_features

16
[695]: ['Rating',
'Reviews',
'Size',
'Installs',
'Price',
'Updated_Month',
'Updated_Year']

[696]: cat_features=[col for col in df.columns if df[col].dtype=='O']


cat_features

[696]: ['App',
'Category',
'Type',
'Content Rating',
'Genres',
'Current Ver',
'Android Ver']

Step 3: Check outliers


[698]: sns.boxplot(df["Rating"])

[698]: <Axes: >

17
[699]: sns.boxplot(df["Rating"])

[699]: <Axes: >

[700]: sns.boxplot(df["Size"])

[700]: <Axes: >

18
[701]: sns.boxplot(df["Installs"])

[701]: <Axes: >

19
[702]: sns.boxplot(df["Rating"])

[702]: <Axes: >

[703]: sns.boxplot(df["Price"])

[703]: <Axes: >

20
2.3 Exploratory Data Analysis (EDA)
Category Column
[706]: df['Category'].value_counts()

[706]: Category
FAMILY 1943
GAME 1121
TOOLS 842
BUSINESS 427
MEDICAL 408
PRODUCTIVITY 407
PERSONALIZATION 388
LIFESTYLE 373
COMMUNICATION 366
FINANCE 360
SPORTS 351
PHOTOGRAPHY 322
HEALTH_AND_FITNESS 306
SOCIAL 280
NEWS_AND_MAGAZINES 264
TRAVEL_AND_LOCAL 237
BOOKS_AND_REFERENCE 230
SHOPPING 224

21
DATING 196
VIDEO_PLAYERS 175
MAPS_AND_NAVIGATION 137
EDUCATION 130
FOOD_AND_DRINK 124
ENTERTAINMENT 111
AUTO_AND_VEHICLES 85
LIBRARIES_AND_DEMO 85
WEATHER 82
HOUSE_AND_HOME 80
ART_AND_DESIGN 65
EVENTS 64
PARENTING 60
COMICS 60
BEAUTY 53
Name: count, dtype: int64

[707]: plt.rcParams['figure.figsize'] = (20, 10)


sns.countplot(x='Category',data=df)
plt.xticks(rotation=70)

[707]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,


17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
[Text(0, 0, 'ART_AND_DESIGN'),
Text(1, 0, 'AUTO_AND_VEHICLES'),
Text(2, 0, 'BEAUTY'),
Text(3, 0, 'BOOKS_AND_REFERENCE'),
Text(4, 0, 'BUSINESS'),
Text(5, 0, 'COMICS'),
Text(6, 0, 'COMMUNICATION'),
Text(7, 0, 'DATING'),
Text(8, 0, 'EDUCATION'),
Text(9, 0, 'ENTERTAINMENT'),
Text(10, 0, 'EVENTS'),
Text(11, 0, 'FINANCE'),
Text(12, 0, 'FOOD_AND_DRINK'),
Text(13, 0, 'HEALTH_AND_FITNESS'),
Text(14, 0, 'HOUSE_AND_HOME'),
Text(15, 0, 'LIBRARIES_AND_DEMO'),
Text(16, 0, 'LIFESTYLE'),
Text(17, 0, 'GAME'),
Text(18, 0, 'FAMILY'),
Text(19, 0, 'MEDICAL'),
Text(20, 0, 'SOCIAL'),
Text(21, 0, 'SHOPPING'),
Text(22, 0, 'PHOTOGRAPHY'),
Text(23, 0, 'SPORTS'),

22
Text(24, 0, 'TRAVEL_AND_LOCAL'),
Text(25, 0, 'TOOLS'),
Text(26, 0, 'PERSONALIZATION'),
Text(27, 0, 'PRODUCTIVITY'),
Text(28, 0, 'PARENTING'),
Text(29, 0, 'WEATHER'),
Text(30, 0, 'VIDEO_PLAYERS'),
Text(31, 0, 'NEWS_AND_MAGAZINES'),
Text(32, 0, 'MAPS_AND_NAVIGATION')])

[708]: plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
background_color='black',
width=1920,
height=1080
).generate(" ".join(df.Category))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

23
Category vs Rating Analysis
[710]: plt.figure(figsize=(20,15))
sns.boxplot(y='Rating',x='Category',data = df.
↪sort_values('Rating',ascending=False))

plt.xticks(rotation=80)

[710]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,


17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
[Text(0, 0, 'FAMILY'),
Text(1, 0, 'HEALTH_AND_FITNESS'),
Text(2, 0, 'SHOPPING'),
Text(3, 0, 'LIFESTYLE'),
Text(4, 0, 'TOOLS'),
Text(5, 0, 'COMMUNICATION'),
Text(6, 0, 'ART_AND_DESIGN'),
Text(7, 0, 'COMICS'),
Text(8, 0, 'PERSONALIZATION'),
Text(9, 0, 'GAME'),
Text(10, 0, 'MEDICAL'),
Text(11, 0, 'BUSINESS'),
Text(12, 0, 'PRODUCTIVITY'),
Text(13, 0, 'NEWS_AND_MAGAZINES'),
Text(14, 0, 'FINANCE'),
Text(15, 0, 'SOCIAL'),
Text(16, 0, 'PHOTOGRAPHY'),

24
Text(17, 0, 'BOOKS_AND_REFERENCE'),
Text(18, 0, 'SPORTS'),
Text(19, 0, 'FOOD_AND_DRINK'),
Text(20, 0, 'PARENTING'),
Text(21, 0, 'EVENTS'),
Text(22, 0, 'TRAVEL_AND_LOCAL'),
Text(23, 0, 'DATING'),
Text(24, 0, 'LIBRARIES_AND_DEMO'),
Text(25, 0, 'MAPS_AND_NAVIGATION'),
Text(26, 0, 'VIDEO_PLAYERS'),
Text(27, 0, 'EDUCATION'),
Text(28, 0, 'AUTO_AND_VEHICLES'),
Text(29, 0, 'BEAUTY'),
Text(30, 0, 'WEATHER'),
Text(31, 0, 'HOUSE_AND_HOME'),
Text(32, 0, 'ENTERTAINMENT')])

25
Type Column
[712]: df['Type'].value_counts()

[712]: Type
Free 9591
Paid 765
Name: count, dtype: int64

[713]: plt.rcParams['figure.figsize'] = (8,5)


sns.countplot(x='Type',data=df)
plt.xticks(rotation=70)

[713]: (array([0, 1]), [Text(0, 0, 'Free'), Text(1, 0, 'Paid')])

[714]: df["Type"].value_counts().plot.pie(autopct = "%1.1f%%")

[714]: <Axes: ylabel='count'>

26
Type vs Rating Analysis
[716]: plt.figure(figsize=(15,8))
sns.catplot(y='Rating',x='Type',data = df.
↪sort_values('Rating',ascending=False),kind='boxen')

[716]: <seaborn.axisgrid.FacetGrid at 0x17e9fe14910>

<Figure size 1500x800 with 0 Axes>

27
Content Rating Column
[718]: df['Content Rating'].value_counts()

[718]: Content Rating


Everyone 8381
Teen 1146
Mature 17+ 447
Everyone 10+ 377
Adults only 18+ 3
Unrated 2
Name: count, dtype: int64

Content Rating vs Rating Analysis


[720]: plt.figure(figsize=(12,8))
sns.boxplot(y='Rating',x='Content Rating',data = df.
↪sort_values('Rating',ascending=False))

plt.xticks(rotation=90)

28
[720]: (array([0, 1, 2, 3, 4, 5]),
[Text(0, 0, 'Everyone'),
Text(1, 0, 'Teen'),
Text(2, 0, 'Mature 17+'),
Text(3, 0, 'Everyone 10+'),
Text(4, 0, 'Adults only 18+'),
Text(5, 0, 'Unrated')])

[721]: plt.figure(figsize=(12,8))
sns.barplot(x="Content Rating", y="Installs", hue="Type", data=df)

[721]: <Axes: xlabel='Content Rating', ylabel='Installs'>

29
Genres Column
[723]: df['Genres'].value_counts()

[723]: Genres
Tools 841
Entertainment 588
Education 527
Business 427
Medical 408

Parenting;Brain Games 1
Travel & Local;Action & Adventure 1
Lifestyle;Pretend Play 1
Tools;Education 1
Strategy;Creativity 1
Name: count, Length: 119, dtype: int64

Current ver Column


[725]: df['Current Ver'].value_counts()

30
[725]: Current Ver
Varies with device 1302
1.0 802
1.1 260
1.2 177
2.0 149

3.18.5 1
1.3.A.2.9 1
9.9.1.1910 1
7.1.34.28 1
2.0.148.0 1
Name: count, Length: 2831, dtype: int64

Android Ver Column


[727]: df['Android Ver'].value_counts()

[727]: Android Ver


4.1 and up 2379
4.0.3 and up 1451
4.0 and up 1337
Varies with device 1221
4.4 and up 893
2.3 and up 643
5.0 and up 546
4.2 and up 387
2.3.3 and up 279
2.2 and up 239
3.0 and up 237
4.3 and up 235
2.1 and up 133
1.6 and up 116
6.0 and up 58
7.0 and up 42
3.2 and up 36
2.0 and up 32
5.1 and up 22
1.5 and up 20
4.4W and up 11
3.1 and up 10
2.0.1 and up 7
8.0 and up 6
7.1 and up 3
4.0.3 - 7.1.1 2
5.0 - 8.0 2
1.0 and up 2

31
7.0 - 7.1.1 1
4.1 - 7.1.1 1
5.0 - 6.0 1
2.2 - 7.1.1 1
5.0 - 7.1.1 1
Name: count, dtype: int64

[728]: # Function to create a scatter plot


def scatters(col1, col2):
# Create a scatter plot using Seaborn
plt.figure(figsize=(10, 6)) # Adjust the figure size as needed
sns.scatterplot(data=df, x=col1, y=col2, hue="Type")
plt.title(f'Scatter Plot of {col1} vs {col2}')
plt.xlabel(col1)
plt.ylabel(col2)
plt.show()

# Function to create a KDE plot


def kde_plot(feature):
# Create a FacetGrid for KDE plots using Seaborn
grid = sns.FacetGrid(df, hue="Type", aspect=2)

# Map KDE plots for the specified feature


grid.map(sns.kdeplot, feature)

# Add a legend to distinguish between categories


grid.add_legend()

kde-Plot Analysis
[730]: kde_plot('Rating')

32
[731]: kde_plot('Size')

[732]: kde_plot('Updated_Month')

[733]: kde_plot('Price')

33
[734]: kde_plot('Updated_Year')

Scatter plot Analysis


[736]: scatters('Price', 'Updated_Year')

34
[737]: scatters('Size', 'Rating')

35
[738]: scatters('Size', 'Installs')

[739]: scatters('Updated_Month', 'Installs')

36
[740]: scatters('Reviews', 'Rating')

[741]: scatters('Rating', 'Price')

37
Further Analysis Apps with a 5.0 Rating
[744]: df_rating_5 = df[df.Rating == 5.]
print(f'There are {df_rating_5.shape[0]} apps having rating of 5.0')

There are 271 apps having rating of 5.0


- Installs
[746]: sns.histplot(data=df_rating_5, x='Installs', kde=True, bins=50)

plt.title('Distribution of Installs with 5.0 Rating Apps')


plt.show()

38
Despite the full ratings, the number of installations for the majority of the apps is low. Hence,
those apps cannot be considered the best products.
- Reviews
[749]: sns.histplot(data=df_rating_5, x='Reviews', kde=True)
plt.title('Distribution of Reviews with 5.0 Rating Apps')
plt.show()

39
The distribution is right-skewed which shows applications with few reviews having 5.0 ratings,
which is misleading.
- Category
[752]: df_rating_5_cat = df_rating_5['Category'].value_counts().reset_index()

[753]: # Create a pie chart


plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")
plt.pie(df_rating_5_cat.iloc[:, 1], labels=df_rating_5_cat.iloc[:, 0],␣
↪autopct='%1.1f%%')

plt.title('Pie chart of App Categories with 5.0 Rating')


plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

# Show the pie chart


plt.show()

40
Family, LifeStyle and Medical apps receive the most 5.0 ratings on Google Play Store with Family
representing about quater of whole.
- Type
[756]: df_rating_5_type = df_rating_5['Type'].value_counts().reset_index()

[757]: # Create a pie chart


plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")

# Data for the pie chart


sizes = df_rating_5_type.iloc[:, 1]
labels = df_rating_5_type.iloc[:, 0]

# Pull a slice out by exploding it


explode = (0, 0.1) # Adjust the second value to control the pull-out distance

# Create the pie chart with default colors


plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, pctdistance=0.
↪85, explode=explode)

# Draw a circle in the center to make it look like a donut chart

41
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Equal aspect ratio ensures that pie is drawn as a circle.


plt.axis('equal')

# Title
plt.title('Pie chart of App Types with 5.0 Rating')

# Show the pie chart


plt.show()

Almost 90% of the 5.0 rating apps are free on Goolge Play Store.
[759]: freq= pd.Series()
freq=df['Updated_Year'].value_counts()
freq.plot()
plt.xlabel("Dates")
plt.ylabel("Number of updates")

42
plt.title("Time series plot of Last Updates")

[759]: Text(0.5, 1.0, 'Time series plot of Last Updates')

3 Build, train and evaluate models


3.1 Buiding model

Dependent variable
Type of model Dependent variable examples Methods examples
Regression numeric test scores, blood linear regression,
pressure, stock k-nearest neighbors
prices, how fast regression, random
somebody can run a forest regression,
mile, how many neural network
potatoes somebody regression, many
will purchase at the others
store

I will use three different regression techniques: linear regression, k-nearest neighbors (KNN), and
random forest.

43
• Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
This means that the change in the dependent variable is proportional to the change in the
independent variables.
• Random forest regression is an ensemble method that combines multiple decision trees
to predict the target value. Ensemble methods are a type of machine learning algorithm that
combines multiple models to improve the performance of the overall model. Random forest
regression works by building a large number of decision trees, each of which is trained on a
different subset of the training data. The final prediction is made by averaging the predictions
of all of the trees.
• k-nearest neighbors regression is a simple and intuitive algorithm that makes predictions
by finding the K nearest data points to a given input and averaging their target values

Feature Pruning We decide to prune the following features:


• App : App names are of no value for the model
• Genres : The informations it stores is same as the feature Category
• Current Ver : Current Version of an app doesn’t hold significant value.
• Android Ver: Android Version of an app doesn’t hold significant value.
[762]: pruned_features = ['App', 'Genres', 'Current Ver', 'Android Ver']

Data Splitting for Modeling As always, we will divide our data into a training dataset that
we can use to train a predictive model and then a testing dataset that we can use to determine if
our predictive model is useful or not.
We split the dataset into 80% train and 20% test.
[764]: target = 'Rating'

[765]: X = df.copy().drop(pruned_features+[target], axis=1)


y = df.copy()[target]

[766]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,␣


↪random_state=42)

Label Encoding
[768]: le_dict = defaultdict()

[769]: features_to_encode = X_train.select_dtypes(include=['category', 'object']).


↪columns

for col in features_to_encode:


le = LabelEncoder()

44
X_train[col] = le.fit_transform(X_train[col]) # Fitting and tranforming the␣
↪Train data
X_train[col] = X_train[col].astype('category') # Converting the label␣
↪encoded features from numerical back to categorical dtype in pandas

X_test[col] = le.transform(X_test[col]) # Only transforming the test data


X_test[col] = X_test[col].astype('category') # Converting the label encoded␣
↪features from numerical back to categorical dtype in pandas

le_dict[col] = le # Saving the label encoder for individual features

Standardization
[771]: # Converting and adding "Last Updated Month" to categorical features
categorical_features = features_to_encode + ['Updated_Month']
X_train['Updated_Month'] = X_train['Updated_Month'].astype('category')
X_test['Updated_Month'] = X_test['Updated_Month'].astype('category')

# Listing numeric features to scale


numeric_features = X_train.select_dtypes(exclude=['category', 'object']).columns

[772]: numeric_features

[772]: Index(['Reviews', 'Size', 'Installs', 'Price', 'Updated_Year'], dtype='object')

[773]: scaler = StandardScaler()

# Fitting and transforming the Training data


X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
# X_train = scaler.fit_transform(X_train)

# Only transforming the Test data


X_test[numeric_features] = scaler.transform(X_test[numeric_features])
# X_test = scaler.transform(X_test)

3.2 Running model


Regression Creating dataframe for metrics
[876]: models = ['Linear', 'KNN', 'Random Forest']
datasets = ['train', 'test']
metrics = ['RMSE', 'MAE', 'R2']

multi_index = pd.MultiIndex.from_product([models, datasets, metrics],


names=['model', 'dataset', 'metric'])

df_metrics_reg = pd.DataFrame(index=multi_index,

45
columns=['value'])

[878]: df_metrics_reg

[878]: value
model dataset metric
Linear train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
KNN train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
Random Forest train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN

Linear Regression
[881]: lr = LinearRegression()
lr.fit(X_train, y_train)

[881]: LinearRegression()

[883]: df_metrics_reg.loc['Linear', 'train', 'R2'] = lr.score(X_train, y_train)


df_metrics_reg.loc['Linear', 'test', 'R2'] = lr.score(X_test, y_test)

[885]: y_train_pred = lr.predict(X_train)


y_test_pred = lr.predict(X_test)

df_metrics_reg.loc['Linear', 'train', 'MAE'] = mean_absolute_error(y_train,␣


↪y_train_pred)

df_metrics_reg.loc['Linear', 'test', 'MAE'] = mean_absolute_error(y_test,␣


↪y_test_pred)

df_metrics_reg.loc['Linear', 'train', 'RMSE'] = mean_squared_error(y_train,␣


↪y_train_pred, squared=False)

df_metrics_reg.loc['Linear', 'test', 'RMSE'] = mean_squared_error(y_test,␣


↪y_test_pred, squared=False)

46
KNeighbors Regression
[888]: knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

[888]: KNeighborsRegressor()

[890]: df_metrics_reg.loc['KNN', 'train', 'R2'] = knn.score(X_train, y_train)


df_metrics_reg.loc['KNN', 'test', 'R2'] = knn.score(X_test, y_test)

[892]: y_train_pred = knn.predict(X_train)


y_test_pred = knn.predict(X_test)

df_metrics_reg.loc['KNN', 'train', 'MAE'] = mean_absolute_error(y_train,␣


↪y_train_pred)

df_metrics_reg.loc['KNN', 'test', 'MAE'] = mean_absolute_error(y_test,␣


↪y_test_pred)

df_metrics_reg.loc['KNN', 'train', 'RMSE'] = mean_squared_error(y_train,␣


↪y_train_pred, squared=False)

df_metrics_reg.loc['KNN', 'test', 'RMSE'] = mean_squared_error(y_test,␣


↪y_test_pred, squared=False)

Random Forest Regression


[895]: rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

[895]: RandomForestRegressor(max_depth=2, random_state=0)

[897]: df_metrics_reg.loc['Random Forest', 'train', 'R2'] = rf.score(X_train, y_train)


df_metrics_reg.loc['Random Forest', 'test', 'R2'] = rf.score(X_test, y_test)

[899]: y_train_pred = rf.predict(X_train)


y_test_pred = rf.predict(X_test)

df_metrics_reg.loc['Random Forest', 'train', 'MAE'] =␣


↪mean_absolute_error(y_train, y_train_pred)

df_metrics_reg.loc['Random Forest', 'test', 'MAE'] =␣


↪mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['Random Forest', 'train', 'RMSE'] =␣


↪mean_squared_error(y_train, y_train_pred, squared=False)

df_metrics_reg.loc['Random Forest', 'test', 'RMSE'] =␣


↪mean_squared_error(y_test, y_test_pred, squared=False)

47
3.3 Regression Evaluation
[902]: # Rounding the values

df_metrics_reg['value'] = df_metrics_reg['value'].apply(lambda v: round(v,␣


↪ndigits=3))

df_metrics_reg

[902]: value
model dataset metric
Linear train RMSE 0.478
MAE 0.319
R2 0.023
test RMSE 0.483
MAE 0.327
R2 0.037
KNN train RMSE 0.409
MAE 0.280
R2 0.286
test RMSE 0.510
MAE 0.349
R2 -0.072
Random Forest train RMSE 0.468
MAE 0.309
R2 0.063
test RMSE 0.472
MAE 0.314
R2 0.081

[904]: data = df_metrics_reg.reset_index()

g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='value',␣


↪hue='metric')

# Adding annotations to bars


# iterate through axes
for ax in g.axes.ravel():
# add annotations
for c in ax.containers:
ax.bar_label(c, label_type='edge')

ax.margins(y=0.2)

plt.show()

48
Linear Regression:
• The model has high error (RMSE and MAE) and low R2, indicating poor predictive power.
KNN:
• KNN has the lowest training RMSE (0.409) but the highest test RMSE (0.510). This suggests
that KNN may be overfitting the training data.
Random Forest:
• Slightly better than Linear Regression, but still with low R², indicating limited improvement.
• The Regression predictions don’t hold up very well!
• We can interpret that the dataset is not suitable for regression problem.

4 Model application
4.1 Describe the model application
Real Estate Valuation:
• A regression model can predict house prices by analyzing factors like square footage, neigh-
borhood quality, and number of bedrooms and bathrooms. This helps buyers, sellers, and
real estate agents make informed decisions.
Healthcare:
• In healthcare, regression models can be used to predict patient outcomes based on various
health indicators, aiding in early diagnosis and personalized treatment plans.
Finance:
• Financial institutions use regression models to forecast market trends and stock prices, helping
investors make strategic decisions. They also predict credit risk, aiding in loan approval
processes.

49
Marketing:
• Marketers use regression models to predict sales and customer behavior, optimizing marketing
strategies and budget allocations for maximum return on investment.
Weather Forecasting:
• Regression models analyze historical weather data to predict future conditions, assisting in
agriculture planning, disaster preparedness, and daily life activities.
Manufacturing:
• In manufacturing, regression models predict equipment failure times, enabling proactive main-
tenance and reducing downtime. They also help estimate production costs based on various
input factors, optimizing resource allocation.
Overall, regression models provide valuable predictive insights across various industries, helping to
make data-driven decisions and improve efficiency.

4.2 Interpretation of the results


• In conclusion, the dataset from Google Play Store apps has been explored and analyzed
using various data visualization techniques with the help of Matplotlib, Seaborn and Plotly
libraries.
• The preliminary analysis, visualization methods and EDA provided insights into the data and
helped in understanding the underlying patterns and relationships among the variables.
• The analysis of the Google Play Store dataset has shown that there is a weak correlation
between the rating and other app attributes such as size, installs, reviews, and price. We
found that there was a moderate positive correlation between the number of installs and the
rating, suggesting that higher-rated apps tend to have more installs.
• We also observed that free apps have higher ratings than paid apps, and that app size does
not seem to have a significant impact on rating.

5 Conclusion
5.1 Advantages and Disadvantages of the approach
Advantages of Regression
• Easy to understand and interpret
• Robust to outliers
• Can handle both linear and nonlinear relationships.
Disadvantages of Regression
• Assumes linearity
• Sensitive to multicollinearity
• May not be suitable for highly complex relationships

50
5.2 Applicability of research results in the future
The results of the regression model research for predicting Google Play Store app ratings have
several future applications:
App Development:
• Improvement Prioritization: Developers can focus on improving features that significantly
impact ratings.
• Pre-release Testing: Predict potential ratings for new apps before launch to make necessary
adjustments.
Market Analysis:
• Competitive Benchmarking: Compare predicted ratings with competitor apps to identify
market position.
• Trend Analysis: Track predicted rating trends over time to adjust development strategies.
User Engagement:
• Personalized Recommendations: Use predicted ratings to suggest apps to users based on
their preferences.
• Feedback Loop: Incorporate user feedback into the model to continuously improve predic-
tions.
Investment Decisions:
• Funding Allocation: Investors can predict app success and allocate funds to high-potential
projects.
• Risk Assessment: Evaluate the risk associated with new app ventures by predicting poten-
tial ratings.
Quality Assurance:
• Automated Testing: Implement regression models in automated testing frameworks to
predict the impact of changes on app ratings.
• Feature Optimization: Identify and optimize features that contribute to higher ratings.
By leveraging these research results, stakeholders can make informed decisions, improve app quality,
and enhance user satisfaction, ultimately leading to greater success in the competitive app market.

51

You might also like