0% found this document useful (0 votes)
71 views30 pages

Balaji 1

This document discusses building a machine learning model to predict mobile phone prices. It begins by introducing the motivation and overview of the project, which is to predict prices during the lockdown period using web scraped mobile data. Various preprocessing steps are applied to the dataset, including handling missing values. Exploratory data analysis is performed. Regression algorithms like Random Forest and Support Vector Regression are used to predict prices based on features like brand, RAM, storage etc.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views30 pages

Balaji 1

This document discusses building a machine learning model to predict mobile phone prices. It begins by introducing the motivation and overview of the project, which is to predict prices during the lockdown period using web scraped mobile data. Various preprocessing steps are applied to the dataset, including handling missing values. Exploratory data analysis is performed. Regression algorithms like Random Forest and Support Vector Regression are used to predict prices based on features like brand, RAM, storage etc.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Overview

In this Blog we are going to do implementing a salable model for predicting the mobile

price prediction using some of the regression techniques based of some of features in

the dataset which is called mobile Price Prediction. There are some of the processing

techniques for creating a model. In this project i used web scrapping techniques for

collecting the mobile data from E-Commerce website. We will see about it in upcoming

parts …

Motivation

The Motivation behind it I just wanna know about the various kinds of mobile prices

during the lock down period. Because now a days most of the E-Commerce website are

focusing to sell the mobile for consumers.

Because now a days many of the students including me also having the online class

rooms for continue our education systems. So I got the idea about to do some of the

useful things do in the lock down period. That’s why I decided to doing in this project.

As well as one of my brother asked to me “Bro why shouldn’t we do this mobile price

prediction from end to end? Like we are not going to do get the data from Kaggle for
this project” .So I decided to make in this way.

Introduction

Price is the most effective attribute of marketing and business. The very first question of

costumer is about the price of items. All the costumers are first worried and thinks “If he

would be able to purchase something with given specifications or not”. So to estimate price
at home is the basic purpose of the work. This paper is only the first step toward the above

1
mentioned destination. Artificial Intelligence-which makes machine capable to answer the

questions intelligently- now a days is very vast engineering field. Machine learning provides

us best techniques for artificial intelligence like classification, regression, supervised learning

and unsupervised learning and many more. Different tools are available for machine learning

tasks like MATLAB, Python, cygwin, WEKA etc. We can use any of classifiers like Decision tree

, Naïve Bayes and many more. Different type of feature selection algorithms are available to

select only best features and minimize dataset. This will reduce computational complexity of

the problem. As this is optimization problem so many optimization techniques are also used

to reduce dimensionality of the dataset. Mobile now a days is one of the most selling and

purchasing device. Every day new mobiles with new version and more features are launched.

Hundreds and thousands of mobile are sold and purchased on daily basis. So here the mobile

price_class prediction is a case study for the given type of problem i.e finding optimal product.

The same work can be done to estimate real price of all products like cars, bikes , generators,

motors, food items, medicine etc.

Understand the Problem Statement

Mobile prices are an important reflection of the Humans and some ranges are of great

interest for both buyers and sellers. Ask a mobile buyer to describe their dream Mobile

or Branded Mobile Phones. So in this blog we are going to see about how the prices are

2
segregated based on the some of the features. As well as the target feature prediction

based on the same features.

About the Dataset

In this dataset I wasn’t downloading from Kaggle or any other data collecting websites.

I just make or create the dataset using one of the web scrapping tools. I’ll tell about next

upcoming part. So a little bit of overview we understand about the data and its features.

# lets understand features of this


datasetdf.columnsIndex([‘Brand me’, ‘Ratings’, ‘RAM’, ‘ROM’,
‘Mobile_Size’, ‘Primary_Cam’,‘Selfi_Cam’, ‘Battery_Power’,
‘Price’],dtype=’object’)

Data Overview

Overview of the Dataset

• 1. Brand me — This is first feature of our dataset. Its Denotes name of


the mobile phones and Brands.

• 2. Ratings — This Feature Denotes Number of the ratings gave by the


consumers for each mobile.

• 3. RAM — It’s have RAM size of the phone.

• 4. ROM — It’s have ROM (Internal Memory) size of the phone.

3
• 5. Mobile_Size — It’s represents how many inches of the particular
mobile phone have. Here all the values are gave in inches

• 6. Primary_Cam — It’s Denotes Number of the pixels of the primary


camera (Back Camera) for each mobile.

• 7. Selfi_Cam — It’s Denotes Number of the pixels of the Selfi


camera (Front Camera) for each mobile.

• 8. Battery_Power — It’s Denotes amount of the battery power in each


mobiles in mAh.

• 9. Price — It’s a Dependent Feature of the dataset. It’s just denoting


prices of the each mobile.

About the Web Scrapping

In this project I wasn’t get the dataset from Kaggle rather than I got an idea about

download from mobile websites. So I decided using web scrapping method. So we

should understand one thing what is web scrapping? Web scraping, web harvesting, or

web data extraction is data scraping used for extracting data from websites.

Web scraping software may access the World Wide Web directly using the Hypertext

Transfer Protocol, or through a web browser. Wanna more about web scrapping

just click here.

4
The tool for getting data from websites

Here I was use this tool for getting data from one of the E-Commerce website. This is a

tool for using web scrapping any data from any websites. You can also use python

coding for web scrapping. Now I’m just a beginner for web scrapping that’s why I used

a tool for getting data. Next upcoming days I develop my web scrapping coding skills.

If you wanna this tool for getting your data just click it for Download.

About the Algorithms used in

The major aim of in this project is to predict the house prices based on the features

using some of the regression techniques and algorithms.

1. Random Forest Regressor

2. Support Vector Regressor

5
Machine Learning Packages are used for in this Project

Machine Learning packages of using in this project

Data Collection

Here if you search the dataset in Kaggle you won’t be get the same dataset from Kaggle.

But you’ll be getting another kind of datasets like that. So Data collection part I already

mentioned to you using web scrapping method to collecting the data from one of the E-

Commerce website in Mobile sections. So here I’d like mentioned the link for you’ll be

getting the data. If you wanna get the dataset just click here.

After download the dataset the dataset will be look like.

6
Dataset before drop the first column

Note: After you should drop out the first column that’s Unnamed:
0 column.

Data Preprocessing

Data preprocessing is an important step in the data mining process. The phrase

“garbage in, garbage out” is particularly applicable to data mining and machine

learning projects. Data-gathering methods are often loosely controlled, resulting in

out-of-range values, impossible data combinations, missing values, etc.

In this project you might be performing lot of preprocessing steps. Because in this

dataset is not downloaded from Kaggle or any other data source website. This data

retrieve from E-Commerce website. But after I was get the dataset I was make a

dataset for model prediction. So you need not to and data preprocessing steps except

handling the missing values.

# Shape of the Datasetprint(“Shape of Training dataset:”,


df.shape)Shape of Training dataset: (836, 9)# Checking null
values for training datasetdf.isnull().sum()

7
Checking Null or Missing Values

Handling the missing values


# Fill up the mean values of all the missing value columns into
the datasetdf[‘Ratings’] =
df[‘Ratings’].fillna(df[‘Ratings’].mean())df[‘RAM’] =
df[‘RAM’].fillna(df[‘RAM’].mean())df[‘ROM’] =
df[‘ROM’].fillna(df[‘ROM’].mean())df[‘Mobile_Size’] =
df[‘Mobile_Size’].fillna(df[‘Mobile_Size’].mean())df[‘Selfi_Cam’
] = df[‘Selfi_Cam’].fillna(df[‘Selfi_Cam’].mean())

Note: We need not to have the Brand me feature for prediction because it just a

mobile name. So we should drop the first column.

After handling all the null or missing values will be look like

After Handling all the null values

8
Data Types Changing
# Data Typesdf.dtypes()

Data types of the Features

Note: Here some of the data types are floating point values. We need to change the

into integer values except Rating feature.


# Changing the Datatypedf[‘RAM’] =
df[‘RAM’].astype(‘int64’)df[‘ROM’] =
df[‘ROM’].astype(‘int64’)df[‘Selfi_Cam’] =
df[‘Selfi_Cam’].astype(‘int64’)

Note: After changing the data types dataset and data types will be look like.

After Changing the Data Types

Exploratory Data Analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to


summarize their main characteristics, often with visual methods. A statistical model

9
can be used or not, but primarily EDA is for seeing what the data can tell us beyond

the formal modeling or hypothesis testing task.


# Information about the dataset featuresdf.info()

Summary of the dataset


# Describedf.describe()

Description of the Dataset

10
Feature Observation
# Finding out the correlation between the featurescorr =
df.corr()corr.shape

First Understanding the correlation of features between target and other


features
# Plotting the heatmap of correlation between
featuresplt.figure(figsize=(14,14))sns.heatmap(corr, cbar=False,
square= True, fmt=’.2%’, annot=True, cmap=’Greens’)

Green Map and Heat Map !!!!


# Checking the null values using heatmap# There is any null
values are occupyed
heresns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap=’v
iridis’)

11
I think there is no null or missing values
plt.figure(figsize=(15,10))sns.set_style(‘whitegrid’)sns.countpl
ot(x=’Ratings’,data=df)

12
Rating Frequency
plt.figure(figsize=(15,10))sns.set_style(‘whitegrid’)sns.countpl
ot(x=’RAM’,data=df)

RAM Frequency
plt.figure(figsize=(15,10))sns.set_style(‘whitegrid’)sns.countpl
ot(x=’ROM’,data=df)

13
ROM Frequency
plt.figure(figsize=(15,10))sns.set_style(‘whitegrid’)sns.countpl
ot(x=’Primary_Cam’,data=df)

14
Primary Camera Frequency
plt.figure(figsize=(15,10))sns.set_style(‘whitegrid’)sns.countpl
ot(x=’Selfi_Cam’,data=df)

Selfi Camera Frequency

sns.distplot(df[‘RAM’].dropna(),kde=False,color=’darkred’,bins=10)

15
RAM Limitations

sns.distplot(df[‘Battery_Power’].dropna(),kde=False,color=’green’,b
ins=10)

16
Battery Power Limitations

sns.distplot(df[‘Price’].dropna(),kde=False,color=’darkblue’,bins=15
)

17
Price Limitations

sns.distplot(df[‘Battery_Power’].dropna(),kde=False,color=’darkblu
e’,bins=15)

18
Range of Battery Power
plt.figure(figsize=(10,10))sns.pairplot(data=df)

19
Pair Plot for all the features

Feature Selection

Feature Selection is the process where you automatically or

manually select those features which contribute most to your

prediction variable or output in which you are interested in. Having

irrelevant features in your data can decrease the accuracy of the models and make

your model learn based on irrelevant features.


# Lets try to understand which are important feature for this
datasetfrom sklearn.feature_selection import SelectKBestfrom
sklearn.feature_selection import chi2

20
Importing Libraries
X = df.iloc[:,1:7] # Independent columnsy = df.iloc[:,[-1]] # Y
target column i.e price range

Values Assigning
# Apply SelectKBest class to extract top 10 best
featuresbestfeatures = SelectKBest(score_func=chi2, k=4)fit =
bestfeatures.fit(X,y)

Fitting Method
dfscores = pd.DataFrame(fit.scores_)dfcolumns =
pd.DataFrame(X.columns)# Concat two dataframes for better
visualizationfeatureScores =
pd.concat([dfcolumns,dfscores],axis=1)featureScores.columns =
[‘Specs’,’Score’] #naming the dataframe columnsfeatureScores

Best Features
print(featureScores.nlargest(4,’Score’)) #print 4 best features

Top 4 Features

21
Feature Importance
from sklearn.ensemble import ExtraTreesClassifierimport
matplotlib.pyplot as pltmodel =
ExtraTreesClassifier()model.fit(X,y)

print(model.feature_importances_) # use inbuilt class


feature_importances of tree based classifiers[0.12253721
0.109504 0.26739755 0.09270551 0.20932722 0.19852852]# Plot
graph of feature importances for better
visualizationfeat_importances =
pd.Series(model.feature_importances_,
index=X.columns)feat_importances.nlargest(10).plot(kind=’barh’)p
lt.show()

Features Frequencies

22
Model Building

Random Forest Regressor

Support Vector Regressor

23
Model Performance

Random Forest Regressor

24
Support Vector Regressor

25
Methodology

Data
Collection

Apply Methodology
Pre-
Algorithm Processing

Accuracy
Of Result

26
Data Flow Diagrams

27
Prediction and Final score

Finally we made it!!!

Random Forest Regressor

Training Accuracy: 96.2% Accuracy

Testing Accuracy: 95.3% Accuracy

Support Vector Regressor

Training Accuracy: 96.2% Accuracy

Testing Accuracy: 95.8% Accuracy

Output & Conclusion

So both of the algorithms are gave same level of prediction accuracy.


So I Hope all of You Like this project.

This work can be concluded with the comparable results of both Feature selection algorithms

and classifier except the combination of WrapperattributEval and Descision Tree J48

classifier. This combination has achieved maximum accuracy and selected minimum but most

appropriate features. It is important to note that in Forward selection by adding irrelevant or

redundant features to the data set decreases the efficiency of both classifiers. While in

backward selection if we remove any important feature from the data set, its efficiency
decreases. The main reason of low accuracy rate is low number of instances in the data set.

28
One more thing should also be considered while working that converting a regression

problem into classification problem introduces more error.

References

[1] Sameerchand Pudaruth . “Predicting the Price of Used Cars using Machine Learning

Techniques”, International Journal of Information & Computation Technology. ISSN 0974-

2239 Volume 4, Number 7 (2014), pp. 753- 764

[2] Shonda Kuiper, “Introduction to Multiple Regression: How Much Is Your Car Worth? ” ,

Journal of Statistics Education · November 2008

[3] Mariana Listiani , 2009. “Support Vector Regression Analysis for Price Prediction in a Car

Leasing Application”. Master Thesis. Hamburg University of Technology.

[4] Limsombunchai, V. 2004. “House Price Prediction: Hedonic Price Model vs. Artificial

Neural Network”, New Zealand Agricultural and Resource Economics Society Conference,

New Zealand, pp. 25-26. 2004

[5] Kanwal Noor and Sadaqat Jan, “Vehicle Price Prediction System using Machine Learning

Techniques” , International Journal of Computer Applications (0975 – 8887) Volume 167 –

No.9, June 2017.

[6] Mobile data and specifications online available from https://www.gsmarena.com/ (Last
Accessed on Friday, December 22, 2017, 6:14:54 PM)

29
[7] Introduction to dimensionality reduction, A computer science portal for Geeks.

https://www.geeksforgeeks.org/dimensionality-reduction/ (Last Accessed on Monday , Jan

2018 22, 3 PM)

[8] Ethem Alpaydın, 2004. Introduction to Machine Learning, Third Edition. The MIT Press

Cambridge, Massachusetts London, England

[9] InfoGainAttributeEval-Weka Online available from

http://weka.WrapperattributEval/doc.dev/weka/attributeS

election/InfoGainAttributeEval.html (Last Accessed in Jan 2018 )

[10] Thu Zar Phyu, Nyein Nyein Oo. Performance Comparison of Feature Selection Methods.

MATEC Web

Details:

Name: V. Sai Balaji

Mobile Number:9100968754

Email: [email protected]

Project: Mobile Price Prediction using Machine Learning

30

You might also like