0% found this document useful (0 votes)

29 views51 pages

Apps Rating Prediction

analysis on google data - classwork

Uploaded by

Hà My

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views51 pages

Apps Rating Prediction

analysis on google data - classwork

Uploaded by

Hà My

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

apps-rating-prediction

June 28, 2024

1 Problem statement
1.1 Description of the problem
The goal of this project is to predict the ratings of apps available on the Google Play Store. Given a
dataset that includes various attributes such as app category, number of installs, and user reviews,
the task is to develop a machine learning model that can accurately estimate the rating of an app.
This predictive model aims to help developers and stakeholders understand key factors influencing
app ratings and improve app quality and user satisfaction.

1.2 Inputs
App: The name of the app
Category: The category of the app
Rating: The rating of the app in the Play Store
Reviews: The number of reviews of the app
Size: The size of the app
Install: The number of installs of the app
Type: The type of the app (Free/Paid)
Price: The price of the app (0 if it is Free)
Content Rating: The appropiate target audience of the app
Genres: The genre of the app
Last Updated: The date when the app was last updated
Current Ver: The current version of the app
Android Ver: The minimum Android version required to run the app

1.3 Outputs
A regression machine learning model that can predict the ratings of apps based on various features.

1
1.4 Processing requirements
Data collection: Import Necessary Libraries
First, I import the necessary libraries. numpy is used for numerical operations, and sklearn provides
necessary functions for splitting the dataset and implementing linear regression.
EDA and draw charts
Data preprocessing:
• Convert each feature’s data type into a suitable formart
• Clean data: handle missing values, delete duplicated data and check outliers
Build model:
• Split Data into Training and Testing Sets
• Regression (Linear Regression, KNeighbors Regression, Random Forest Regression)
Evaluation Finally, we make predictions using our test set and evaluate the model’s performance
using metrics such as Mean Squared Error (MSE) and the Coefficient of Determination (R² score).

2 Data processing
2.1 Data collection
First of all, I will import all the necessary libraries that we will use throughout the project. This
generally includes libraries for data manipulation, data visualization, and others based on the
specific needs of the project:
[631]: # Data
import numpy as np
import pandas as pd
from collections import defaultdict

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msn
from wordcloud import WordCloud

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

2
# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Hide warnings
import warnings
warnings.filterwarnings('ignore')

Next, I will load the dataset into a pandas DataFrame which will facilitate easy manipulation and
analysis:
[633]: df = pd.read_csv('googleplaystore.csv')

Afterward, I am going to gain a thorough understanding of the dataset before proceeding to the
data cleaning and transformation stages.

Dataset overview First I will perform a preliminary analysis to understand the structure and
types of data columns:
[637]: df.sample(5)

[637]: App Category Rating Reviews \

5354 I am Rich Plus FAMILY 4.0 856
10716 Free Slideshow Maker & Video Editor PHOTOGRAPHY 4.2 162564
7769 Clash Soundboard For CR & COC FAMILY 4.3 13
10131 EZ Display PRODUCTIVITY 2.9 280
8867 DT Simple Interval Timer SPORTS 4.1 10

Size Installs Type Price Content Rating Genres \

5354 8.7M 10,000+ Paid $399.99 Everyone Entertainment
10716 11M 10,000,000+ Free 0 Everyone Photography
7769 26M 500+ Free 0 Everyone Entertainment
10131 10.0M 50,000+ Free 0 Everyone Productivity
8867 1.5M 1,000+ Free 0 Everyone Sports

Last Updated Current Ver Android Ver

5354 May 19, 2018 3.0 4.4 and up
10716 August 5, 2018 5.2 4.0 and up
7769 January 3, 2018 1.1 4.0.3 and up
10131 November 14, 2014 1.0.0.457 4.0 and up
8867 February 17, 2014 1.01 2.1 and up

3
[638]: df.shape

[638]: (10841, 13)

As we can see we have data of 10841 applications consisting of 13 attributes.

[640]: df.columns

[640]: Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',

'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'],
dtype='object')

[641]: df.describe()

[641]: Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000

Here we can see that only Rating column is only in float, so we need to convert numerical columns
into int and float.
[643]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null object
4 Size 10841 non-null object
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content Rating 10840 non-null object
9 Genres 10841 non-null object
10 Last Updated 10841 non-null object
11 Current Ver 10833 non-null object
12 Android Ver 10838 non-null object

4
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

2.2 Data preprocessing

As most of the features are set to data type object and have suffixes, each feature’s
data type must be converted into a suitable format for analysis.

Step 1: Reviews Checking if all values in number of Reviews column is numeric

[648]: df[~df.Reviews.str.isnumeric()]

[648]: App Category Rating Reviews \

10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M

Size Installs Type Price Content Rating Genres \

10472 1,000+ Free 0 Everyone NaN February 11, 2018

Last Updated Current Ver Android Ver

10472 1.0.19 4.0 and up NaN

We could have converted it into integer like we did for Size but the data for this App looks different.
It can be noticed that the entries are entered wrong We could fix it by setting Category as nan and
shifting all the values, but deleting the sample for now.
[650]: df=df.drop(df.index[10472])

The feature Reviews must be of integer type.

[652]: df["Reviews"] = df["Reviews"].astype(int)

[653]: df.info()

5
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(1), int32(1), object(11)
memory usage: 1.1+ MB

Step 2: Size
• It can be seen that data has metric prefixes(kilo and Mega) along with another
string.Replacing k and M with their values to convert values to numeric.
• The feature Size must be of floating type.
• The suffix, which is a size unit, must be removed. Example: ‘19.2M’ to 19.2
• If size is given as ‘Varies with device’ we replace it with 0
• The converted floating values of Size is represented in megabytes units.
[656]: df['Size'].unique()

[656]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
'31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
'5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
'1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
'3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
'2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
'7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
'4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
'4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
'23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
'8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
'5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
'6.2M', '18k', '53M', '1.4M', '3.0M', '5.8M', '3.8M', '9.6M',
'45M', '63M', '49M', '77M', '4.4M', '4.8M', '70M', '6.9M', '9.3M',
'10.0M', '8.1M', '36M', '84M', '97M', '2.0M', '1.9M', '1.8M',
'5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '34M', '93M', '65M', '79M',
'100M', '58M', '50M', '68M', '64M', '67M', '60M', '94M', '232k',
'99M', '624k', '95M', '8.5k', '41k', '292k', '11k', '80M', '1.7M',
'74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M',
'71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k',
'899k', '378k', '266k', '375k', '1.3M', '975k', '980k', '4.1M',
'89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k',
'713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k',
'953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k',
'26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k',
'293k', '17k', '74k', '14k', '317k', '78k', '924k', '902k', '818k',
'81k', '939k', '169k', '45k', '475k', '965k', '90M', '545k', '61k',

6
'283k', '655k', '714k', '93k', '872k', '121k', '322k', '1.0M',
'976k', '172k', '238k', '549k', '206k', '954k', '444k', '717k',
'210k', '609k', '308k', '705k', '306k', '904k', '473k', '175k',
'350k', '383k', '454k', '421k', '70k', '812k', '442k', '842k',
'417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k',
'429k', '192k', '200k', '460k', '728k', '496k', '816k', '414k',
'506k', '887k', '613k', '243k', '569k', '778k', '683k', '592k',
'319k', '186k', '840k', '647k', '191k', '373k', '437k', '598k',
'716k', '585k', '982k', '222k', '219k', '55k', '948k', '323k',
'691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k',
'82k', '208k', '913k', '514k', '551k', '29k', '103k', '898k',
'743k', '116k', '153k', '209k', '353k', '499k', '173k', '597k',
'809k', '122k', '411k', '400k', '801k', '787k', '237k', '50k',
'643k', '986k', '97k', '516k', '837k', '780k', '961k', '269k',
'20k', '498k', '600k', '749k', '642k', '881k', '72k', '656k',
'601k', '221k', '228k', '108k', '940k', '176k', '33k', '663k',
'34k', '942k', '259k', '164k', '458k', '245k', '629k', '28k',
'288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k',
'914k', '903k', '608k', '500k', '54k', '562k', '847k', '957k',
'688k', '811k', '270k', '48k', '329k', '523k', '921k', '874k',
'981k', '784k', '280k', '24k', '518k', '754k', '892k', '154k',
'860k', '364k', '387k', '626k', '161k', '879k', '39k', '970k',
'170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k',
'246k', '73k', '658k', '992k', '253k', '420k', '404k', '470k',
'226k', '240k', '89k', '234k', '257k', '861k', '467k', '157k',
'44k', '676k', '67k', '552k', '885k', '1020k', '582k', '619k'],
dtype=object)

• Remove all characters from size and convert it to float.

[658]: df['Size']=df['Size'].str.replace('M','000')
df['Size']=df['Size'].str.replace('k','')
#apps['size']=apps['size'].str.replace('.','')
df['Size']=df['Size'].replace("Varies with device",np.nan)
df['Size']=df['Size'].astype('float')
df['Size']

[658]: 0 19000.0
1 14000.0
2 8.7
3 25000.0
4 2.8
…
10836 53000.0
10837 3.6
10838 9.5
10839 NaN

7
10840 19000.0
Name: Size, Length: 10840, dtype: float64

• There is a problem! There are some applications size in megabyte and some in kilobyte.
[660]: ###### Convert mega to kilo then convert all to mega
for i in df['Size']:
if i < 10:
df['Size']=df['Size'].replace(i,i*1000)
df['Size']=df['Size']/1000
df['Size']

[660]: 0 19.0
1 14.0
2 8.7
3 25.0
4 2.8
…
10836 53.0
10837 3.6
10838 9.5
10839 NaN
10840 19.0
Name: Size, Length: 10840, dtype: float64

[661]: df.info()

8
Step 3: Installs and Price
• The feature Installs must be of integer type.
• The characters ‘,’ and ‘+’ must be removed. Example: ‘10,000+’ to 10000
• The feature Price must be of floating type.
• The suffix ‘dollar’ must be removed if Price is non-zero. Example: ‘$4.99’ to 4.99
[664]: df['Installs'].unique()

[664]: array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',

'50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
'1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
'10+', '1+', '5+', '0+', '0'], dtype=object)

[665]: df['Price'].unique()

[665]: array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',

'$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
'$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
'$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
'$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
'$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
'$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
'$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
'$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
'$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
'$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
'$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
'$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
'$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)

[666]: items_to_remove=['+',',','$']
cols_to_clean=['Installs','Price']
for item in items_to_remove:
for col in cols_to_clean:
df[col]=df[col].str.replace(item,'')
df.head()

[666]: App Category Rating \

0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
1 Coloring book moana ART_AND_DESIGN 3.9
2 U Launcher Lite – FREE Live Cool Themes, Hide … ART_AND_DESIGN 4.7
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3

Reviews Size Installs Type Price Content Rating \

0 159 19.0 10000 Free 0 Everyone

9
1 967 14.0 500000 Free 0 Everyone
2 87510 8.7 5000000 Free 0 Everyone
3 215644 25.0 50000000 Free 0 Teen
4 967 2.8 100000 Free 0 Everyone

Genres Last Updated Current Ver \

0 Art & Design January 7, 2018 1.0.0
1 Art & Design;Pretend Play January 15, 2018 2.0.0
2 Art & Design August 1, 2018 1.2.4
3 Art & Design June 8, 2018 Varies with device
4 Art & Design;Creativity June 20, 2018 1.1

Android Ver
0 4.0.3 and up
1 4.0.3 and up
2 4.0.3 and up
3 4.2 and up
4 4.4 and up

[667]: df.Installs.unique()

[667]: array(['10000', '500000', '5000000', '50000000', '100000', '50000',

'1000000', '10000000', '5000', '100000000', '1000000000', '1000',
'500000000', '50', '100', '500', '10', '1', '5', '0'], dtype=object)

[668]: df['Price'].unique()

[668]: array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99',

'3.49', '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00',
'24.99', '11.99', '79.99', '16.99', '14.99', '1.00', '29.99',
'12.99', '2.49', '10.99', '1.50', '19.99', '15.99', '33.99',
'74.99', '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88',
'25.99', '399.99', '17.99', '400.00', '3.02', '1.76', '4.84',
'4.77', '1.61', '2.50', '1.59', '6.49', '1.29', '5.00', '13.99',
'299.99', '379.99', '37.99', '18.99', '389.99', '19.90', '8.49',
'1.75', '14.00', '4.85', '46.99', '109.99', '154.99', '3.08',
'2.59', '4.80', '1.96', '19.40', '3.90', '4.59', '15.46', '3.04',
'4.29', '2.60', '3.28', '4.60', '28.99', '2.95', '2.90', '1.97',
'200.00', '89.99', '2.56', '30.99', '3.61', '394.99', '1.26',
'1.20', '1.04'], dtype=object)

[669]: df[df['Price']=='Everyone']

[669]: Empty DataFrame

Columns: [App, Category, Rating, Reviews, Size, Installs, Type, Price, Content
Rating, Genres, Last Updated, Current Ver, Android Ver]
Index: []

10
[670]: df['Installs']=df['Installs'].astype('int')
df['Price']=df['Price'].astype('float')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 9366 non-null float64
3 Reviews 10840 non-null int32
4 Size 9145 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Last Updated 10840 non-null object
11 Current Ver 10832 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(3), int32(2), object(8)
memory usage: 1.1+ MB
• Updating the Last Updated column’s datatype from string to pandas datetime.
• Extracting new columns Updated Year, Updated Month and updated day.
[672]: #### Change Last update into a datetime column
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
df['Last Updated']

[672]: 0 2018-01-07
1 2018-01-15
2 2018-08-01
3 2018-06-08
4 2018-06-20
…
10836 2017-07-25
10837 2018-07-06
10838 2017-01-20
10839 2015-01-19
10840 2018-07-25
Name: Last Updated, Length: 10840, dtype: datetime64[ns]

[673]: df['Updated_Month']=df['Last Updated'].dt.month

df['Updated_Year']=df['Last Updated'].dt.year

11
[674]: df.drop('Last Updated', axis=1, inplace=True)

[675]: df.head()

[675]: App Category Rating \

Reviews Size Installs Type Price Content Rating \

0 159 19.0 10000 Free 0.0 Everyone
1 967 14.0 500000 Free 0.0 Everyone
2 87510 8.7 5000000 Free 0.0 Everyone
3 215644 25.0 50000000 Free 0.0 Teen
4 967 2.8 100000 Free 0.0 Everyone

Genres Current Ver Android Ver Updated_Month \

0 Art & Design 1.0.0 4.0.3 and up 1
1 Art & Design;Pretend Play 2.0.0 4.0.3 and up 1
2 Art & Design 1.2.4 4.0.3 and up 8
3 Art & Design Varies with device 4.2 and up 6
4 Art & Design;Creativity 1.1 4.4 and up 6

Updated_Year
0 2018
1 2018
2 2018
3 2018
4 2018

[676]: df.info()

12
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB

Data cleaning
[678]: null = pd.DataFrame({'Null Values' : df.isna().sum().
↪sort_values(ascending=False), 'Percentage Null Values' : (df.isna().sum().

↪sort_values(ascending=False)) / (df.shape[0]) * (100)})

null

[678]: Null Values Percentage Null Values

Size 1695 15.636531
Rating 1474 13.597786
Current Ver 8 0.073801
Android Ver 2 0.018450
Type 1 0.009225
App 0 0.000000
Category 0 0.000000
Reviews 0 0.000000
Installs 0 0.000000
Price 0 0.000000
Content Rating 0 0.000000
Genres 0 0.000000
Updated_Month 0 0.000000
Updated_Year 0 0.000000

[679]: null_counts = df.isna().sum().sort_values(ascending=False)/len(df)

plt.figure(figsize=(16,8))
plt.xticks(np.arange(len(null_counts))+0.5,null_counts.
↪index,rotation='vertical')

plt.ylabel('fraction of rows with missing data')

plt.bar(np.arange(len(null_counts)),null_counts)

[679]: <BarContainer object of 14 artists>

13
Its clear that we have missing values in Rating, Type, Content Rating, Current Ver
and Android Ver.

Step 1: Handle missing value I clean missing values using Random Value Imputation
Because This the best way to To maintain distrbuation For each feature.
[683]: def impute_median(series):
return series.fillna(series.median())

df['Rating'] = df['Rating'].transform(impute_median)

[684]: df.info()

14
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB

[685]: def impute_median(series):

return series.fillna(series.median())

df['Size'] = df['Size'].transform(impute_median)

[686]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10840 non-null object
1 Category 10840 non-null object
2 Rating 10840 non-null float64
3 Reviews 10840 non-null int32
4 Size 10840 non-null float64
5 Installs 10840 non-null int32
6 Type 10839 non-null object
7 Price 10840 non-null float64
8 Content Rating 10840 non-null object
9 Genres 10840 non-null object
10 Current Ver 10832 non-null object
11 Android Ver 10838 non-null object
12 Updated_Month 10840 non-null int32
13 Updated_Year 10840 non-null int32
dtypes: float64(3), int32(4), object(7)
memory usage: 1.1+ MB

[687]: df.isnull().sum()

[687]: App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 1
Price 0

15
Content Rating 0
Genres 0
Current Ver 8
Android Ver 2
Updated_Month 0
Updated_Year 0
dtype: int64

[688]: df['Type'].fillna(str(df['Type'].mode().values[0]),inplace=True)

[689]: df.isnull().sum()

[689]: App 0
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content Rating 0
Genres 0
Current Ver 8
Android Ver 2
Updated_Month 0
Updated_Year 0
dtype: int64

Step 2: Delete duplicated data

[691]: duplicate = df.duplicated()
print(duplicate.sum())

484

[692]: df.drop_duplicates(inplace=True)

[693]: duplicate = df.duplicated()

print(duplicate.sum())

0
Extract Numerical and categorical features
[695]: num_features=[col for col in df.columns if df[col].dtype!='O']
num_features

16
[695]: ['Rating',
'Reviews',
'Size',
'Installs',
'Price',
'Updated_Month',
'Updated_Year']

[696]: cat_features=[col for col in df.columns if df[col].dtype=='O']

cat_features

[696]: ['App',
'Category',
'Type',
'Content Rating',
'Genres',
'Current Ver',
'Android Ver']

Step 3: Check outliers

[698]: sns.boxplot(df["Rating"])

[698]: <Axes: >

17
[699]: sns.boxplot(df["Rating"])

[699]: <Axes: >

[700]: sns.boxplot(df["Size"])

[700]: <Axes: >

18
[701]: sns.boxplot(df["Installs"])

[701]: <Axes: >

19
[702]: sns.boxplot(df["Rating"])

[702]: <Axes: >

[703]: sns.boxplot(df["Price"])

[703]: <Axes: >

20
2.3 Exploratory Data Analysis (EDA)
Category Column
[706]: df['Category'].value_counts()

[706]: Category
FAMILY 1943
GAME 1121
TOOLS 842
BUSINESS 427
MEDICAL 408
PRODUCTIVITY 407
PERSONALIZATION 388
LIFESTYLE 373
COMMUNICATION 366
FINANCE 360
SPORTS 351
PHOTOGRAPHY 322
HEALTH_AND_FITNESS 306
SOCIAL 280
NEWS_AND_MAGAZINES 264
TRAVEL_AND_LOCAL 237
BOOKS_AND_REFERENCE 230
SHOPPING 224

21
DATING 196
VIDEO_PLAYERS 175
MAPS_AND_NAVIGATION 137
EDUCATION 130
FOOD_AND_DRINK 124
ENTERTAINMENT 111
AUTO_AND_VEHICLES 85
LIBRARIES_AND_DEMO 85
WEATHER 82
HOUSE_AND_HOME 80
ART_AND_DESIGN 65
EVENTS 64
PARENTING 60
COMICS 60
BEAUTY 53
Name: count, dtype: int64

[707]: plt.rcParams['figure.figsize'] = (20, 10)

sns.countplot(x='Category',data=df)
plt.xticks(rotation=70)

[707]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
[Text(0, 0, 'ART_AND_DESIGN'),
Text(1, 0, 'AUTO_AND_VEHICLES'),
Text(2, 0, 'BEAUTY'),
Text(3, 0, 'BOOKS_AND_REFERENCE'),
Text(4, 0, 'BUSINESS'),
Text(5, 0, 'COMICS'),
Text(6, 0, 'COMMUNICATION'),
Text(7, 0, 'DATING'),
Text(8, 0, 'EDUCATION'),
Text(9, 0, 'ENTERTAINMENT'),
Text(10, 0, 'EVENTS'),
Text(11, 0, 'FINANCE'),
Text(12, 0, 'FOOD_AND_DRINK'),
Text(13, 0, 'HEALTH_AND_FITNESS'),
Text(14, 0, 'HOUSE_AND_HOME'),
Text(15, 0, 'LIBRARIES_AND_DEMO'),
Text(16, 0, 'LIFESTYLE'),
Text(17, 0, 'GAME'),
Text(18, 0, 'FAMILY'),
Text(19, 0, 'MEDICAL'),
Text(20, 0, 'SOCIAL'),
Text(21, 0, 'SHOPPING'),
Text(22, 0, 'PHOTOGRAPHY'),
Text(23, 0, 'SPORTS'),

22
Text(24, 0, 'TRAVEL_AND_LOCAL'),
Text(25, 0, 'TOOLS'),
Text(26, 0, 'PERSONALIZATION'),
Text(27, 0, 'PRODUCTIVITY'),
Text(28, 0, 'PARENTING'),
Text(29, 0, 'WEATHER'),
Text(30, 0, 'VIDEO_PLAYERS'),
Text(31, 0, 'NEWS_AND_MAGAZINES'),
Text(32, 0, 'MAPS_AND_NAVIGATION')])

[708]: plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
background_color='black',
width=1920,
height=1080
).generate(" ".join(df.Category))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

23
Category vs Rating Analysis
[710]: plt.figure(figsize=(20,15))
sns.boxplot(y='Rating',x='Category',data = df.
↪sort_values('Rating',ascending=False))

plt.xticks(rotation=80)

[710]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
[Text(0, 0, 'FAMILY'),
Text(1, 0, 'HEALTH_AND_FITNESS'),
Text(2, 0, 'SHOPPING'),
Text(3, 0, 'LIFESTYLE'),
Text(4, 0, 'TOOLS'),
Text(5, 0, 'COMMUNICATION'),
Text(6, 0, 'ART_AND_DESIGN'),
Text(7, 0, 'COMICS'),
Text(8, 0, 'PERSONALIZATION'),
Text(9, 0, 'GAME'),
Text(10, 0, 'MEDICAL'),
Text(11, 0, 'BUSINESS'),
Text(12, 0, 'PRODUCTIVITY'),
Text(13, 0, 'NEWS_AND_MAGAZINES'),
Text(14, 0, 'FINANCE'),
Text(15, 0, 'SOCIAL'),
Text(16, 0, 'PHOTOGRAPHY'),

24
Text(17, 0, 'BOOKS_AND_REFERENCE'),
Text(18, 0, 'SPORTS'),
Text(19, 0, 'FOOD_AND_DRINK'),
Text(20, 0, 'PARENTING'),
Text(21, 0, 'EVENTS'),
Text(22, 0, 'TRAVEL_AND_LOCAL'),
Text(23, 0, 'DATING'),
Text(24, 0, 'LIBRARIES_AND_DEMO'),
Text(25, 0, 'MAPS_AND_NAVIGATION'),
Text(26, 0, 'VIDEO_PLAYERS'),
Text(27, 0, 'EDUCATION'),
Text(28, 0, 'AUTO_AND_VEHICLES'),
Text(29, 0, 'BEAUTY'),
Text(30, 0, 'WEATHER'),
Text(31, 0, 'HOUSE_AND_HOME'),
Text(32, 0, 'ENTERTAINMENT')])

25
Type Column
[712]: df['Type'].value_counts()

[712]: Type
Free 9591
Paid 765
Name: count, dtype: int64

[713]: plt.rcParams['figure.figsize'] = (8,5)

sns.countplot(x='Type',data=df)
plt.xticks(rotation=70)

[713]: (array([0, 1]), [Text(0, 0, 'Free'), Text(1, 0, 'Paid')])

[714]: df["Type"].value_counts().plot.pie(autopct = "%1.1f%%")

[714]: <Axes: ylabel='count'>

26
Type vs Rating Analysis
[716]: plt.figure(figsize=(15,8))
sns.catplot(y='Rating',x='Type',data = df.
↪sort_values('Rating',ascending=False),kind='boxen')

[716]: <seaborn.axisgrid.FacetGrid at 0x17e9fe14910>

<Figure size 1500x800 with 0 Axes>

27
Content Rating Column
[718]: df['Content Rating'].value_counts()

[718]: Content Rating

Everyone 8381
Teen 1146
Mature 17+ 447
Everyone 10+ 377
Adults only 18+ 3
Unrated 2
Name: count, dtype: int64

Content Rating vs Rating Analysis

[720]: plt.figure(figsize=(12,8))
sns.boxplot(y='Rating',x='Content Rating',data = df.
↪sort_values('Rating',ascending=False))

plt.xticks(rotation=90)

28
[720]: (array([0, 1, 2, 3, 4, 5]),
[Text(0, 0, 'Everyone'),
Text(1, 0, 'Teen'),
Text(2, 0, 'Mature 17+'),
Text(3, 0, 'Everyone 10+'),
Text(4, 0, 'Adults only 18+'),
Text(5, 0, 'Unrated')])

[721]: plt.figure(figsize=(12,8))
sns.barplot(x="Content Rating", y="Installs", hue="Type", data=df)

[721]: <Axes: xlabel='Content Rating', ylabel='Installs'>

29
Genres Column
[723]: df['Genres'].value_counts()

[723]: Genres
Tools 841
Entertainment 588
Education 527
Business 427
Medical 408
…
Parenting;Brain Games 1
Travel & Local;Action & Adventure 1
Lifestyle;Pretend Play 1
Tools;Education 1
Strategy;Creativity 1
Name: count, Length: 119, dtype: int64

Current ver Column

[725]: df['Current Ver'].value_counts()

30
[725]: Current Ver
Varies with device 1302
1.0 802
1.1 260
1.2 177
2.0 149
…
3.18.5 1
1.3.A.2.9 1
9.9.1.1910 1
7.1.34.28 1
2.0.148.0 1
Name: count, Length: 2831, dtype: int64

Android Ver Column

[727]: df['Android Ver'].value_counts()

[727]: Android Ver

4.1 and up 2379
4.0.3 and up 1451
4.0 and up 1337
Varies with device 1221
4.4 and up 893
2.3 and up 643
5.0 and up 546
4.2 and up 387
2.3.3 and up 279
2.2 and up 239
3.0 and up 237
4.3 and up 235
2.1 and up 133
1.6 and up 116
6.0 and up 58
7.0 and up 42
3.2 and up 36
2.0 and up 32
5.1 and up 22
1.5 and up 20
4.4W and up 11
3.1 and up 10
2.0.1 and up 7
8.0 and up 6
7.1 and up 3
4.0.3 - 7.1.1 2
5.0 - 8.0 2
1.0 and up 2

31
7.0 - 7.1.1 1
4.1 - 7.1.1 1
5.0 - 6.0 1
2.2 - 7.1.1 1
5.0 - 7.1.1 1
Name: count, dtype: int64

[728]: # Function to create a scatter plot

def scatters(col1, col2):
# Create a scatter plot using Seaborn
plt.figure(figsize=(10, 6)) # Adjust the figure size as needed
sns.scatterplot(data=df, x=col1, y=col2, hue="Type")
plt.title(f'Scatter Plot of {col1} vs {col2}')
plt.xlabel(col1)
plt.ylabel(col2)
plt.show()

# Function to create a KDE plot

def kde_plot(feature):
# Create a FacetGrid for KDE plots using Seaborn
grid = sns.FacetGrid(df, hue="Type", aspect=2)

# Map KDE plots for the specified feature

grid.map(sns.kdeplot, feature)

# Add a legend to distinguish between categories

grid.add_legend()

kde-Plot Analysis
[730]: kde_plot('Rating')

32
[731]: kde_plot('Size')

[732]: kde_plot('Updated_Month')

[733]: kde_plot('Price')

33
[734]: kde_plot('Updated_Year')

Scatter plot Analysis

[736]: scatters('Price', 'Updated_Year')

34
[737]: scatters('Size', 'Rating')

35
[738]: scatters('Size', 'Installs')

[739]: scatters('Updated_Month', 'Installs')

36
[740]: scatters('Reviews', 'Rating')

[741]: scatters('Rating', 'Price')

37
Further Analysis Apps with a 5.0 Rating
[744]: df_rating_5 = df[df.Rating == 5.]
print(f'There are {df_rating_5.shape[0]} apps having rating of 5.0')

There are 271 apps having rating of 5.0

- Installs
[746]: sns.histplot(data=df_rating_5, x='Installs', kde=True, bins=50)

plt.title('Distribution of Installs with 5.0 Rating Apps')

plt.show()

38
Despite the full ratings, the number of installations for the majority of the apps is low. Hence,
those apps cannot be considered the best products.
- Reviews
[749]: sns.histplot(data=df_rating_5, x='Reviews', kde=True)
plt.title('Distribution of Reviews with 5.0 Rating Apps')
plt.show()

39
The distribution is right-skewed which shows applications with few reviews having 5.0 ratings,
which is misleading.
- Category
[752]: df_rating_5_cat = df_rating_5['Category'].value_counts().reset_index()

[753]: # Create a pie chart

plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")
plt.pie(df_rating_5_cat.iloc[:, 1], labels=df_rating_5_cat.iloc[:, 0],␣
↪autopct='%1.1f%%')

plt.title('Pie chart of App Categories with 5.0 Rating')

plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

# Show the pie chart

plt.show()

40
Family, LifeStyle and Medical apps receive the most 5.0 ratings on Google Play Store with Family
representing about quater of whole.
- Type
[756]: df_rating_5_type = df_rating_5['Type'].value_counts().reset_index()

[757]: # Create a pie chart

plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")

# Data for the pie chart

sizes = df_rating_5_type.iloc[:, 1]
labels = df_rating_5_type.iloc[:, 0]

# Pull a slice out by exploding it

explode = (0, 0.1) # Adjust the second value to control the pull-out distance

# Create the pie chart with default colors

plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, pctdistance=0.
↪85, explode=explode)

# Draw a circle in the center to make it look like a donut chart

41
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Equal aspect ratio ensures that pie is drawn as a circle.

plt.axis('equal')

# Title
plt.title('Pie chart of App Types with 5.0 Rating')

# Show the pie chart

plt.show()

Almost 90% of the 5.0 rating apps are free on Goolge Play Store.
[759]: freq= pd.Series()
freq=df['Updated_Year'].value_counts()
freq.plot()
plt.xlabel("Dates")
plt.ylabel("Number of updates")

42
plt.title("Time series plot of Last Updates")

[759]: Text(0.5, 1.0, 'Time series plot of Last Updates')

3 Build, train and evaluate models

3.1 Buiding model

Dependent variable
Type of model Dependent variable examples Methods examples
Regression numeric test scores, blood linear regression,
pressure, stock k-nearest neighbors
prices, how fast regression, random
somebody can run a forest regression,
mile, how many neural network
potatoes somebody regression, many
will purchase at the others
store

I will use three different regression techniques: linear regression, k-nearest neighbors (KNN), and
random forest.

43
• Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
This means that the change in the dependent variable is proportional to the change in the
independent variables.
• Random forest regression is an ensemble method that combines multiple decision trees
to predict the target value. Ensemble methods are a type of machine learning algorithm that
combines multiple models to improve the performance of the overall model. Random forest
regression works by building a large number of decision trees, each of which is trained on a
different subset of the training data. The final prediction is made by averaging the predictions
of all of the trees.
• k-nearest neighbors regression is a simple and intuitive algorithm that makes predictions
by finding the K nearest data points to a given input and averaging their target values

Feature Pruning We decide to prune the following features:

• App : App names are of no value for the model
• Genres : The informations it stores is same as the feature Category
• Current Ver : Current Version of an app doesn’t hold significant value.
• Android Ver: Android Version of an app doesn’t hold significant value.
[762]: pruned_features = ['App', 'Genres', 'Current Ver', 'Android Ver']

Data Splitting for Modeling As always, we will divide our data into a training dataset that
we can use to train a predictive model and then a testing dataset that we can use to determine if
our predictive model is useful or not.
We split the dataset into 80% train and 20% test.
[764]: target = 'Rating'

[765]: X = df.copy().drop(pruned_features+[target], axis=1)

y = df.copy()[target]

[766]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,␣

↪random_state=42)

Label Encoding
[768]: le_dict = defaultdict()

[769]: features_to_encode = X_train.select_dtypes(include=['category', 'object']).

↪columns

for col in features_to_encode:

le = LabelEncoder()

44
X_train[col] = le.fit_transform(X_train[col]) # Fitting and tranforming the␣
↪Train data
X_train[col] = X_train[col].astype('category') # Converting the label␣
↪encoded features from numerical back to categorical dtype in pandas

X_test[col] = le.transform(X_test[col]) # Only transforming the test data

X_test[col] = X_test[col].astype('category') # Converting the label encoded␣
↪features from numerical back to categorical dtype in pandas

le_dict[col] = le # Saving the label encoder for individual features

Standardization
[771]: # Converting and adding "Last Updated Month" to categorical features
categorical_features = features_to_encode + ['Updated_Month']
X_train['Updated_Month'] = X_train['Updated_Month'].astype('category')
X_test['Updated_Month'] = X_test['Updated_Month'].astype('category')

# Listing numeric features to scale

numeric_features = X_train.select_dtypes(exclude=['category', 'object']).columns

[772]: numeric_features

[772]: Index(['Reviews', 'Size', 'Installs', 'Price', 'Updated_Year'], dtype='object')

[773]: scaler = StandardScaler()

# Fitting and transforming the Training data

X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
# X_train = scaler.fit_transform(X_train)

# Only transforming the Test data

X_test[numeric_features] = scaler.transform(X_test[numeric_features])
# X_test = scaler.transform(X_test)

3.2 Running model

Regression Creating dataframe for metrics
[876]: models = ['Linear', 'KNN', 'Random Forest']
datasets = ['train', 'test']
metrics = ['RMSE', 'MAE', 'R2']

multi_index = pd.MultiIndex.from_product([models, datasets, metrics],

names=['model', 'dataset', 'metric'])

df_metrics_reg = pd.DataFrame(index=multi_index,

45
columns=['value'])

[878]: df_metrics_reg

[878]: value
model dataset metric
Linear train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
KNN train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN
Random Forest train RMSE NaN
MAE NaN
R2 NaN
test RMSE NaN
MAE NaN
R2 NaN

Linear Regression
[881]: lr = LinearRegression()
lr.fit(X_train, y_train)

[881]: LinearRegression()

[883]: df_metrics_reg.loc['Linear', 'train', 'R2'] = lr.score(X_train, y_train)

df_metrics_reg.loc['Linear', 'test', 'R2'] = lr.score(X_test, y_test)

[885]: y_train_pred = lr.predict(X_train)

y_test_pred = lr.predict(X_test)

df_metrics_reg.loc['Linear', 'train', 'MAE'] = mean_absolute_error(y_train,␣

↪y_train_pred)

df_metrics_reg.loc['Linear', 'test', 'MAE'] = mean_absolute_error(y_test,␣

↪y_test_pred)

df_metrics_reg.loc['Linear', 'train', 'RMSE'] = mean_squared_error(y_train,␣

↪y_train_pred, squared=False)

df_metrics_reg.loc['Linear', 'test', 'RMSE'] = mean_squared_error(y_test,␣

↪y_test_pred, squared=False)

46
KNeighbors Regression
[888]: knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

[888]: KNeighborsRegressor()

[890]: df_metrics_reg.loc['KNN', 'train', 'R2'] = knn.score(X_train, y_train)

df_metrics_reg.loc['KNN', 'test', 'R2'] = knn.score(X_test, y_test)

[892]: y_train_pred = knn.predict(X_train)

y_test_pred = knn.predict(X_test)

df_metrics_reg.loc['KNN', 'train', 'MAE'] = mean_absolute_error(y_train,␣

↪y_train_pred)

df_metrics_reg.loc['KNN', 'test', 'MAE'] = mean_absolute_error(y_test,␣

↪y_test_pred)

df_metrics_reg.loc['KNN', 'train', 'RMSE'] = mean_squared_error(y_train,␣

↪y_train_pred, squared=False)

df_metrics_reg.loc['KNN', 'test', 'RMSE'] = mean_squared_error(y_test,␣

↪y_test_pred, squared=False)

Random Forest Regression

[895]: rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

[895]: RandomForestRegressor(max_depth=2, random_state=0)

[897]: df_metrics_reg.loc['Random Forest', 'train', 'R2'] = rf.score(X_train, y_train)

df_metrics_reg.loc['Random Forest', 'test', 'R2'] = rf.score(X_test, y_test)

[899]: y_train_pred = rf.predict(X_train)

y_test_pred = rf.predict(X_test)

df_metrics_reg.loc['Random Forest', 'train', 'MAE'] =␣

↪mean_absolute_error(y_train, y_train_pred)

df_metrics_reg.loc['Random Forest', 'test', 'MAE'] =␣

↪mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['Random Forest', 'train', 'RMSE'] =␣

↪mean_squared_error(y_train, y_train_pred, squared=False)

df_metrics_reg.loc['Random Forest', 'test', 'RMSE'] =␣

↪mean_squared_error(y_test, y_test_pred, squared=False)

47
3.3 Regression Evaluation
[902]: # Rounding the values

df_metrics_reg['value'] = df_metrics_reg['value'].apply(lambda v: round(v,␣

↪ndigits=3))

df_metrics_reg

[902]: value
model dataset metric
Linear train RMSE 0.478
MAE 0.319
R2 0.023
test RMSE 0.483
MAE 0.327
R2 0.037
KNN train RMSE 0.409
MAE 0.280
R2 0.286
test RMSE 0.510
MAE 0.349
R2 -0.072
Random Forest train RMSE 0.468
MAE 0.309
R2 0.063
test RMSE 0.472
MAE 0.314
R2 0.081

[904]: data = df_metrics_reg.reset_index()

g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='value',␣

↪hue='metric')

# Adding annotations to bars

# iterate through axes
for ax in g.axes.ravel():
# add annotations
for c in ax.containers:
ax.bar_label(c, label_type='edge')

ax.margins(y=0.2)

plt.show()

48
Linear Regression:
• The model has high error (RMSE and MAE) and low R2, indicating poor predictive power.
KNN:
• KNN has the lowest training RMSE (0.409) but the highest test RMSE (0.510). This suggests
that KNN may be overfitting the training data.
Random Forest:
• Slightly better than Linear Regression, but still with low R², indicating limited improvement.
• The Regression predictions don’t hold up very well!
• We can interpret that the dataset is not suitable for regression problem.

4 Model application
4.1 Describe the model application
Real Estate Valuation:
• A regression model can predict house prices by analyzing factors like square footage, neigh-
borhood quality, and number of bedrooms and bathrooms. This helps buyers, sellers, and
real estate agents make informed decisions.
Healthcare:
• In healthcare, regression models can be used to predict patient outcomes based on various
health indicators, aiding in early diagnosis and personalized treatment plans.
Finance:
• Financial institutions use regression models to forecast market trends and stock prices, helping
investors make strategic decisions. They also predict credit risk, aiding in loan approval
processes.

49
Marketing:
• Marketers use regression models to predict sales and customer behavior, optimizing marketing
strategies and budget allocations for maximum return on investment.
Weather Forecasting:
• Regression models analyze historical weather data to predict future conditions, assisting in
agriculture planning, disaster preparedness, and daily life activities.
Manufacturing:
• In manufacturing, regression models predict equipment failure times, enabling proactive main-
tenance and reducing downtime. They also help estimate production costs based on various
input factors, optimizing resource allocation.
Overall, regression models provide valuable predictive insights across various industries, helping to
make data-driven decisions and improve efficiency.

4.2 Interpretation of the results

• In conclusion, the dataset from Google Play Store apps has been explored and analyzed
using various data visualization techniques with the help of Matplotlib, Seaborn and Plotly
libraries.
• The preliminary analysis, visualization methods and EDA provided insights into the data and
helped in understanding the underlying patterns and relationships among the variables.
• The analysis of the Google Play Store dataset has shown that there is a weak correlation
between the rating and other app attributes such as size, installs, reviews, and price. We
found that there was a moderate positive correlation between the number of installs and the
rating, suggesting that higher-rated apps tend to have more installs.
• We also observed that free apps have higher ratings than paid apps, and that app size does
not seem to have a significant impact on rating.

5 Conclusion
5.1 Advantages and Disadvantages of the approach
Advantages of Regression
• Easy to understand and interpret
• Robust to outliers
• Can handle both linear and nonlinear relationships.
Disadvantages of Regression
• Assumes linearity
• Sensitive to multicollinearity
• May not be suitable for highly complex relationships

50
5.2 Applicability of research results in the future
The results of the regression model research for predicting Google Play Store app ratings have
several future applications:
App Development:
• Improvement Prioritization: Developers can focus on improving features that significantly
impact ratings.
• Pre-release Testing: Predict potential ratings for new apps before launch to make necessary
adjustments.
Market Analysis:
• Competitive Benchmarking: Compare predicted ratings with competitor apps to identify
market position.
• Trend Analysis: Track predicted rating trends over time to adjust development strategies.
User Engagement:
• Personalized Recommendations: Use predicted ratings to suggest apps to users based on
their preferences.
• Feedback Loop: Incorporate user feedback into the model to continuously improve predic-
tions.
Investment Decisions:
• Funding Allocation: Investors can predict app success and allocate funds to high-potential
projects.
• Risk Assessment: Evaluate the risk associated with new app ventures by predicting poten-
tial ratings.
Quality Assurance:
• Automated Testing: Implement regression models in automated testing frameworks to
predict the impact of changes on app ratings.
• Feature Optimization: Identify and optimize features that contribute to higher ratings.
By leveraging these research results, stakeholders can make informed decisions, improve app quality,
and enhance user satisfaction, ultimately leading to greater success in the competitive app market.

Problem Statements For PBL Internships
No ratings yet
Problem Statements For PBL Internships
3 pages
Tba Record Final
No ratings yet
Tba Record Final
140 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Script Output
No ratings yet
Script Output
53 pages
App Rating Prediction Project
100% (5)
App Rating Prediction Project
14 pages
Google Play Store Apps
No ratings yet
Google Play Store Apps
63 pages
Python Lab Manual
No ratings yet
Python Lab Manual
33 pages
The Leadership Chellenge
100% (1)
The Leadership Chellenge
9 pages
Evaluasi Penggunaan Oksigen Sebagai Penghasil Uap Terapi Nebulizer Pada Pasien Asma
No ratings yet
Evaluasi Penggunaan Oksigen Sebagai Penghasil Uap Terapi Nebulizer Pada Pasien Asma
7 pages
PlayStore Analysis
No ratings yet
PlayStore Analysis
28 pages
Team Renegades MMLA Report
No ratings yet
Team Renegades MMLA Report
27 pages
First
No ratings yet
First
35 pages
Rain Prediction Using Random Forest
No ratings yet
Rain Prediction Using Random Forest
30 pages
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
No ratings yet
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
19 pages
App-Rating-Recommendations. - Googleplaystore - Analysis (1) .Ipynb at Main Prenith - App-Rating-Recommendations
No ratings yet
App-Rating-Recommendations. - Googleplaystore - Analysis (1) .Ipynb at Main Prenith - App-Rating-Recommendations
19 pages
INeuron ML Practical Assignments
No ratings yet
INeuron ML Practical Assignments
14 pages
CS1 Formula Sheet
No ratings yet
CS1 Formula Sheet
15 pages
Breccia Types: Hydrothermal, Fault, Volcanic, ETC: June 2016
No ratings yet
Breccia Types: Hydrothermal, Fault, Volcanic, ETC: June 2016
40 pages
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
No ratings yet
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
11 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Projects
No ratings yet
Projects
31 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Team Alacrity - Amazon ML Challenge 2023 - Text File
No ratings yet
Team Alacrity - Amazon ML Challenge 2023 - Text File
8 pages
Imtiaz Sp23mbai0005 Quiz1
No ratings yet
Imtiaz Sp23mbai0005 Quiz1
4 pages
AMP Microproject Grp-12
No ratings yet
AMP Microproject Grp-12
16 pages
Preprocessing ch.1
No ratings yet
Preprocessing ch.1
24 pages
Report
No ratings yet
Report
24 pages
Design and Analysis of An Hydraulic Trash Compactor: Test Engineering and Management February 2020
No ratings yet
Design and Analysis of An Hydraulic Trash Compactor: Test Engineering and Management February 2020
13 pages
10024947D00 - Turbine Control Board Requirements Specification, PB 540
No ratings yet
10024947D00 - Turbine Control Board Requirements Specification, PB 540
8 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Laboratory Experiment: LATENT HEAT: Q ML Q J Cal M KG L L L J KG
No ratings yet
Laboratory Experiment: LATENT HEAT: Q ML Q J Cal M KG L L L J KG
6 pages
Script Reading Month
No ratings yet
Script Reading Month
2 pages
Emulgel Preparation
No ratings yet
Emulgel Preparation
6 pages
Case Study
No ratings yet
Case Study
2 pages
Practical Record File X - DS
No ratings yet
Practical Record File X - DS
12 pages
Hypothesis (Pooled T Test)
No ratings yet
Hypothesis (Pooled T Test)
31 pages
Preprocessing Data For Machine Learning: Sarah Guido
No ratings yet
Preprocessing Data For Machine Learning: Sarah Guido
21 pages
The Recycling Folded Cascode A General Enhancement of The Folded Cascode Amplifier
No ratings yet
The Recycling Folded Cascode A General Enhancement of The Folded Cascode Amplifier
8 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Question Classification Blooms 1 PDF
No ratings yet
Question Classification Blooms 1 PDF
68 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Python
No ratings yet
Python
14 pages
FLYFokker Leaflet Lavatory Modifications
No ratings yet
FLYFokker Leaflet Lavatory Modifications
2 pages
Đề thi học kì 2 2022 - 2023
No ratings yet
Đề thi học kì 2 2022 - 2023
3 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Machine Learning Lab Record Report
No ratings yet
Machine Learning Lab Record Report
38 pages
Chapter 9 - Recommendation Systems
No ratings yet
Chapter 9 - Recommendation Systems
12 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Importing Libraries: Import As Import As Import As Import As Import As Import
No ratings yet
Importing Libraries: Import As Import As Import As Import As Import As Import
13 pages
cENTRE ELECTRICITY BILL
No ratings yet
cENTRE ELECTRICITY BILL
1 page
Assgn
No ratings yet
Assgn
6 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Plate Notebook Guided Project 1 1
No ratings yet
Plate Notebook Guided Project 1 1
58 pages
Day6 Dataanalyst
No ratings yet
Day6 Dataanalyst
9 pages
Intership
No ratings yet
Intership
40 pages
Major Project
No ratings yet
Major Project
17 pages
Flight Fare
No ratings yet
Flight Fare
15 pages
2025 Uc Secondary Teaching
No ratings yet
2025 Uc Secondary Teaching
20 pages
Sensors: Implementation of Parameter Observer For Capacitors
No ratings yet
Sensors: Implementation of Parameter Observer For Capacitors
19 pages
1 Essay3
No ratings yet
1 Essay3
2 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
GenAI 20 Weeks Roadmap
No ratings yet
GenAI 20 Weeks Roadmap
2 pages
Car Price Prediction
No ratings yet
Car Price Prediction
42 pages
Balaji 1
No ratings yet
Balaji 1
30 pages
SPC RT Flex96c - WECS 9520 - Dynex - 2014 03 PDF
No ratings yet
SPC RT Flex96c - WECS 9520 - Dynex - 2014 03 PDF
602 pages
Optimizing Flight Booking Decisions Through Machine Learning Price Predictions
No ratings yet
Optimizing Flight Booking Decisions Through Machine Learning Price Predictions
50 pages
Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University
No ratings yet
Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University
12 pages
2av56 Sensor
No ratings yet
2av56 Sensor
1 page
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Final 1
No ratings yet
Final 1
6 pages
Python Assignment
20% (5)
Python Assignment
3 pages
Practical (Data Science)
No ratings yet
Practical (Data Science)
13 pages
INFO1113 Assignment 2023 S2
No ratings yet
INFO1113 Assignment 2023 S2
11 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
China Orifice Forged Flanges Manufacturer & Supplier DHDZ
No ratings yet
China Orifice Forged Flanges Manufacturer & Supplier DHDZ
1 page
Schneider Electric - ComPacT-NSX-new-generation - LV432642
No ratings yet
Schneider Electric - ComPacT-NSX-new-generation - LV432642
3 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Class10 - Practical File
100% (1)
Class10 - Practical File
24 pages
Computer PGT
No ratings yet
Computer PGT
4 pages
SikaGrout-220 2011-11 - 1
No ratings yet
SikaGrout-220 2011-11 - 1
4 pages
Datascience
No ratings yet
Datascience
8 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Kavin
No ratings yet
Kavin
13 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Module 2
No ratings yet
Module 2
20 pages
System-On-Chip Design Book 2019 200dpi Aw
No ratings yet
System-On-Chip Design Book 2019 200dpi Aw
334 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
ChuteDesignFormulas Paper43
No ratings yet
ChuteDesignFormulas Paper43
11 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)