0% found this document useful (0 votes)
112 views55 pages

Report

Uploaded by

Sidharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views55 pages

Report

Uploaded by

Sidharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

FOURTH SEMESTER

MINI PROJECT REPORT

DIAMOND PRICE PREDICTION USING


MACHINE LEARNING

Submitted By

Name: HARRIS WILSON

Reg.No:THAUBOA013

Department of Data Science


St.Thomas’ College (Autonomous), Thrissur

Under the Guidance of

Mr. Rejin Varghese


Assistant Professor & HOD
Department of Data Science, St.Thomas College,Thrissur
Department of Data Science
St.Thomas’College (Autonomous) ,Thrissur
Affiliated To University of Calicut & Reaccredited By NAAC with “A” Grade &College
with Potential Excellence

CERTIFICATE

This is to certify that the Mini Project Report titled “Diamond Price Prediction Using
Machine Learning” is a bonafide record of the work carried out by Harris Wilson
(THAUBOA013) of St.Thomas’ College (Autonomous)Thrissur-680005 in partial
fulfillment of the requirements for the award of Degree of B.Voc Data Science of
University of Calicut, during the academic year 2020-2023. The fourth semester Mini
Project report has been approved as it satisfies the academic requirements in the respect
of mini project work prescribed for the said degree.

HOD Principal Faculty In-Charge

Valued On: ……………………………………….

Examiners:

1.………………………….……….. 2.…………………………………

Internal Examiner External Examiner


DECLARATION

I hereby declare that the project report entitled “Diamond Price


Prediction Using Machine Learning” which is being submitted in partial fulfillment of the
requirement of the award of the Degree in Bachelor of Vocational Studies in Data Science is the
result of the project carried out by me under the guidance and supervision of Mr. Rejin
Varghese , Assistant Professor & HOD, Department of Data Science.

I further declared that I or any other person has not previously submitted this project report to
any other institution / university for any other degree / diploma or any other person.

Place : Thrissur HARRIS WISLON

Date : (Signature)
DIAMOND PRICE PREDICTION USING
MACHINE LEARNING

2
CONTENT

1. Introduction 4
2. Chapter 1 5
2.1 Data Description...................................................................................5
2.2 Methodology.........................................................................................7
3. Chapter 2 8
3.1 Algorithms used in our project…………………………….……...….8
3.1.1 Random Forest Regression Algorithm………………………... 8
3.1.2 Comparison of Models…………………………………....……11

3.2 Program Code.....................................................................................18

3.2.1 Dataset Pre-Processing code…………………………………...19

3.2.2 Training of model………………………….……………....…...20

3.2.3 Accuracy of Algorithms………………………….……...……..21

3.2.4 Prediction part………………………………………………….22

4. Chapter 3 23

4.1 Results of data analysis and interpretations.......................................23


5. Conclusion 39
6. References 40

3
1. INTRODUCTION
Nowadays, Machine Learning is one of the most significant driving force of Artificial
Intelligence (AI), whereby computers can learn without being programmed to perform
specific tasks. Machine learning algorithms learn from previous cases to produce the required
results quickly and accurately. We can say that machine learning is a treasure that must be
exploited to solve wide range of problems at all levels, social, economic, profession and
others.
Diamond price prediction using machine learning in python is used to predict the price of the
diamond using key features of diamond. Diamond price prediction helps to know the price
value of the diamond and also helps to understand relation between the variables. The data
includes 5 variables for each diamond : the diamond’s cut, color, clarity, carat, and price.
Each of these variables, excluding price, has its own rating system. In the diamond trading
sector, buyers and investors face several difficulties in predicting diamond stones prices. This
difficulty are due to the difference in the stones shapes, sizes and purity. In order to ease this
problem and to aim diamond traders, this paper discusses the application of machine learning
algorithms as an approach to predict diamond price through employing their features. This is
done by using a dataset from Kaggle.
The purpose of this presentation is to explain how each of these variables determines the
price of a diamond. In a real world application this information could be used to inform a
company, investor, etc. how diamonds should be priced in a competitive market. The
objective of this project is use of Diamond data variables and predict diamond price values.
This is a Regression problem. This project is mainly uses four Algorithms (1) Linear
regression , (2) Decision Tree Regression , (3) Random Forest Regression , (4) KNN
Regression . Most efficient of these algorithms is Random Forest Regression which gives us
the accuracy of 97.7% . And we predict the price of the diamond using the cut ,clarity,colour ,
carat using Random forest regression. Atlast we get the predicted price value of diamond.

4
2. CHAPTER 1

2.1 DATA SET DESCRIPTION

The original dataset is available on Kaggle website. In the dataset,there are 10 columns and
53941 rows. The value of diamonds makes it very useful for industrial applications and
desirable jewelries. Multiple organizations have been created for grading and certifying
diamonds based on the "four Cs", which are color, cut, clarity, and carat. In this work,
diamonds dataset contains the prices and other attributes of almost 54,000 diamonds .
The columns are:-
1.Carat:Carat weight of the Diamond. [0.2-5.01] (Continuos data)
2.Cut:Describe cut quality of the diamond. Quality in increasing order Fair , Good, Very
Good, Premium , Ideal. (Ordinal data)

3.Color:Color of the Diamond With D being the best and J the worst. (Ordinal data)

4.Clarity:Diamond Clarity refers to the absence of the Inclusions and Blemishes.(In order
from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2,
SI1, SI2, I1, I2, I3. (Ordinal data)

5.Depth:The Height of a Diamond, measured from the Culet to the table, divided by its
average Girdle Diameter. z / mean (x, y) = 2 * z / (x + y) [43 - 79] .

6.Table:The Width of the Diamond's Table expressed as a Percentage of its Average


Diameter .

7.Price:The Price of the Diamond. .(Ordinal data)

8.X:Length of the Diamond in mm. [0 - 10.74] .

9.Y:Width of the Diamond in mm. [0 - 58.9] .

10.Z:Height of the Diamond in mm. [0 - 31.8] .

5
Continuos Data :- which is quantitative data that can be measured
Ordinal Data :- Categorical data that has a order to it (0,1,2,3,etc)
After downloading the dataset from Kaggle, we saved it to our working directory with name
123.csv. We used read_csv() to read the dataset and save it to the dataset variable

6
2.2 METHODOLOGY

START

Collect Diamond Analyse Dataset

Extract Variables

Data pre-processing

Splitting Data

Training data Testing data

Training Analyse
(LinearRegression,
DecisionTreeRegressor,
RandomForestRegressor, Predicting the value
KNeighborsRegressor)

7
3. CHAPTER 2

3.1 ALGORITHMS USED IN OUR PROJECTS

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
"Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts the final output. The
greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

8
Why Use Random Forest ?
Takes less training time as compared to other algorithms.Predicts output with high accuracy,
even for the large dataset it runs efficiently.Can also maintain accuracy when a large
proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

Advantages of Random Forest

Random Forest is capable of performing both Classification and Regression tasks. It is


capable of handling large datasets with high dimensionality. It enhances the accuracy of the
model and prevents the overfitting issue.

Disadvantages of Random Forest

Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

9
General code for Random Forest Regression

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
1.
2. #importing datasets
3. data_set= pd.read_csv('')
4.
5. #Extracting Independent and dependent Variable
6. x= data_set.iloc[].values
7. y= data_set.iloc[].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

#Fitting Random Forest Regression classifier to the training set


from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy") classifier.fit(x_train, y_train)

• n_estimators= The required number of trees in the Random Forest. The default value
is 10. We can choose any number but need to take care of the overfitting issue.
• criterion= It is a function to analyze the accuracy of the split. Here we have taken
"entropy" for the information gain.

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

10
Comparing the Algorithms
Linear Regression

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

Linear model makes its prediction by the following equation

Advantage of Linear Regression


Linear Regression is simple to implement and easier to interpret the output coefficients .
When you know the relationship between the independent and dependent variable have a
linear relationship, this algorithm is the best to use because of its less complexity to compared
to other algorithms.

11
Disadvantage of Linear Regression
On the other hand in linear regression technique outliers can have huge effects on the
.
regression and boundaries are linear in this technique Diversely linear regression assumes a
linear relationship between dependent and independent variables. That means it assumes that
there is a straight-line relationship between them. It assumes independence between
.
attributes But then linear regression also looks at a relationship between the mean of the
dependent variables and the independent variables. Just as the mean is not a complete
description of a single variable, linear regression is not a complete description of
relationships among variables.

General code of Linear Regression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

The following command imports the CSV dataset using pandas:


dataset
The = pd.read_csv('')
test_size variable is where we actually specify the proportion of the test set:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

For that, we need to import LinearRegression class, instantiate it, and call the fit() method
along with our training data:

regressor = LinearRegression()
regressor.fit(X_train, y_train) #training the algorithm

To make predictions on the test data, execute the following script

y_pred = regressor.predict(X_test)

12
Decision Tree Regression
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.In a Decision tree,
there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to
make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.The decisions or the test are performed on
the basis of features of the given dataset.

• It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

Advantages of the Decision Tree


It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.It can be very useful for solving decision-related problems.It helps to
think about all the possible outcomes for a problem.There is less requirement of data cleaning
compared to other algorithms.

Disadvantages of the Decision Tree

The decision tree contains lots of layers, which makes it complex.It may have an overfitting
issue, which can be resolved using the Random Forest algorithm.For more class labels, the
computational complexity of the decision tree may increase.

13
General Code Decision Tree Regression

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('')

feature_cols = [ ]X = data.iloc[,[]].values
y = data.iloc[].values

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25,
random_state= 0)

#feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

#Fit the model in the Decision Tree classifier.


from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)

#prediction
y_pred = classifier.predict(X_test)#Accuracy
from sklearn import metricsprint('Accuracy Score:',
metrics.accuracy_score(y_test,y_pred))

14
K-Nearest Neighbor(KNN) Algorithm
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar to the
available categories.K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.K-NN algorithm can be used
for Regression as well as for Classification but mostly it is used for the Classification
problems.K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.KNN algorithm at the training phase just
stores the dataset and when it gets new data, then it classifies that data into a category that is
much similar to the new data.

Advantages of KNN Algorithm:


It is simple to implement.It is robust to the noisy training data.It can be more effective if the
training data is large.

Disadvantages of KNN Algorithm:

Always needs to determine the value of K which may be complex some time.The
computation cost is high because of calculating the distance between the data points for all
the training samples.

15
General Code for K-Nearest Neighbors
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Read the dataset in


dataset = pd.read_csv(“)
dataset

#Preprocessing the data


X = dataset.iloc[].values
y = dataset.iloc[].values

#TrainTestSplit
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Fit the model in KNeighborsClassifier


from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=7)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

# Accuracy of the algorithm


from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

16
Random Forest Regression vs Linear Regression
Linear Models have very few parameters, Random Forests a lot more. That means that
Random Forests will overfit more easily than a Linear Regression. If the dataset contains
features some of which are Categorical Variables and some of the others are continuous
variable Tree is better than Linear Regression , since Trees can accurately divide the data
based on Categorical Variables. Comparing the accuracy linear regression has 87% while
Random Forest Regression has 97%. Thus it is more suitable to use Random forest
regression.
Random Forest Regression vs Decision tree regression
A decision tree combines some decisions, whereas a random forest combines several decision
trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily
on large data sets, especially the linear one. The random forest model needs rigorous training
Random forest algorithm avoids and prevents overfitting by using multiple trees. Comparing
the accuracy Decision tree has 97.1% while Random Forest Regression has 97.7% .Thus it is
more suitable to use Random forest regression.
Random Forest Regression vs KNN Algorithm
A random forest produces good predictions that can be understood easily. It can handle large
datasets efficiently. The random forest algorithm provides a higher level of accuracy in
predicting outcomes over the KNN which is non –parametrical and can only output the label
always need to determine the value of k which may be complex sometime. Comparing the
accuracy KNN has 97.5% while Random Forest Regression has 97.7% . Thus it is more
suitable to use Random forest regression.

R-squared=Explained variance/Total variance


The algorithm with more R-squared give the best model output, and indicates model is less
deviate. An R-squared score is 1 it means suits perfectly. The R-squared is often called the
coefficient of decision. The proportion of variance is estimated from the independent variable
in the dependent variable.In this project we use Linear regression,Decision tree regression,
Random forest regression,KNeighbour regression models for comparing. The models are
trained and tested according to train data and test data respectively.
R-SQUARED for each models are finded given below:-
linear regression =87.76517206528275
decision tree regression=97.1226955763902
random forest regression =97.76376847193016
knn regression =97.53654026540606
From these we can find the best model is Random forest regression

17
Program Code

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import pandas.util.testing as tm

# loading the data from CSV file to a pandas DataFrame


data=pd.read_csv('/content/123.csv')

#first 2 rows of the dataFrame


data.head(2)

#Description of the dataframe


data.describe()

#finding null in dataframe


data.isnull().sum()

#information about datframe


data.info()

#dropping unwanted columns in datframe


data=data.drop(['depth','table','x','y','z'],axis=1)

#first 2 rows of the dataFrame


data.head(2)

#datatype of the dataframe


data.dtypes

#changing the datatype


data['price']=data.price.astype(float)
data.dtypes

18
#Visualisation Of All Features

#Visualize via kde plots

sns.kdeplot(data['carat'], shade=True , color='r')

#Carat vs Price

sns.jointplot(x='carat' , y='price' , data=data , size=5)

It seems that Carat varies with Price Exponentially.

#CUT

sns.factorplot(x='cut', data=data , kind='count',aspect=2.5 )

#price value of different cut


sns.factorplot(x='cut', y='price', data=data, kind='box' ,aspect=2.5 )

Premium Cut on Diamonds as we can see are the most Expensive, followed by Very Good
Cut.

#clarity relation with diamond price


sns.boxplot(x='clarity', y='price', data=data )

It seems that VS1 and VS2 affect the Diamond's Price equally having quite high Price
margin.

# first 1 rows of the dataFrame


data.head(1)

#changing from cutname to numerical


from sklearn.preprocessing import LabelEncoder
l1=LabelEncoder()
label=l1.fit_transform(data['cut'])
l1.classes_

label

data['cut_label']=label

19
#first 2 rows of the dataFrame
data.head(2)

#changing from clarity name to numerical


l2=LabelEncoder()
Label1=l2.fit_transform(data['clarity'])
data['clarity_label']=Label1
data.head(2)

#changing colourcode to numerical


data['color']=data['color'].map({'D':1,'E':2,'F':3,'G':4,'H':5,'I':6,'J':7,'NA':8})

data['color'].fillna(0)

data['color'].isnull().sum()

data['color']=data['color'].fillna(method='ffill')

data['color'].isnull().sum()

#first 2 rows of the dataFrame


data.head(2)

#taking price as y
y=data['price']
y.head(1)

#dropping unwanted columns and rest of added to x


x=data.drop(['price','cut','clarity'],axis=1)
x.head(1)

#train_test_split of x and y
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=42)

len(x_train)

len(y_test)

len(data)

20
data.head()

data.tail()

from sklearn.preprocessing import StandardScaler


scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.fit_transform(x_test)

# LinearRegression
from sklearn.linear_model import LinearRegression
linreg=LinearRegression()
linreg.fit(x_train,y_train)
pred=linreg.predict(x_test)

from sklearn.metrics import r2_score


lr=r2_score(y_test,pred)*100
print(lr)

#DecisionTreeRegressor
from sklearn.tree import DecisionTreeRegressor
reg=DecisionTreeRegressor()
reg.fit(x_train,y_train)
pred1=reg.predict(x_test)

from sklearn.metrics import r2_score


dtr=r2_score(y_test,pred1)*100
print(dtr)

# RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(n_estimators=50)
rf.fit(x_train,y_train)
pred2=rf.predict(x_test)

from sklearn.metrics import r2_score


rfr=r2_score(y_test,pred2)*100
print(rfr)

21
#KNeighborsRegressor
from sklearn.neighbors import KNeighborsRegressor
knn=KNeighborsRegressor(n_neighbors=5)
knn.fit(x_train,y_train)
pred3=knn.predict(x_test)

from sklearn.metrics import r2_score


knr=r2_score(y_test,pred3)*100
print(knr)

#printing the accuracy of algorithms


print('linear regression',lr)
print('decision tree',dtr)
print('random forest regressor',rfr)
print('knn regressor',knr)

#prediction part
def prediction():
carat=(input('enter the carat value of the diamond:'))
cut=int(input('enter the value of cut:'))
clarity=int(input('enter the clarity:'))
color=int(input('enter the value of color:'))

price=rf.predict([[carat,cut,clarity,color]])

print('approximately price of diamond is:',price,'$')

predi=prediction()

predi

22
4). Chapter 3
4.1 RESULTS OF DATA ANALYSIS AND INTERPRETATION

Figure 1.1

Kernal density estimate(KDE) plot is a method for visualizing the distribution of observation
in a dataset. Representing carat in analogous to a histogram

23
Figure 1.2

Jointplot is a seaborn library specific and can be used to quickly visualize and analyse the
relationship between two variables and describe their individual distribution on the
sameplot.In this It seems that Carat varies with Price Exponentially.

24
Figure 1.3

Factorplot is used to draw a different types of categorical plot. Count of the Different types of
Cut.

25
Figure 1.4

Factorplot is used to draw a different types of categorical plot . In this plot we can infer that
Premium Cut on Diamonds as we can see are the most Expensive, followed by Very Good
Cut.

26
Figure 1.5

The seaborn boxplot is avery basic plot .Boxplot are used to visualize distributions.Thats
very useful when you want to compare data between two groups.In this plot It seems that
VS1 and VS2 affect the Diamond's Price equally having quite high Price margin.

27
Figure 2.1

LabelEncoder can be used to normalize labels . It can also be used to transform non-
numerical labels to numerical labels. Here changing the non numerical labels of cut to
numerical labels.

28
Figure 2.2

LabelEncoder can be used to normalize labels . It can also be used to transform non-
numerical labels to numerical labels. Here changing the non numerical labels of clarity to
numerical labels.

29
Figure 2.3

Changing the colourcode of diamond to numerical code

30
Figure 3.1

Defining y as dependent variable.Here dependent variable is Price


Because price depend on other factor

Defining x as independent variables. Here we drop price,cut,clarity from the dataframe


because cut and clarity was label encoded to cut_label and clarity_label.The rest of columns
taken as x

31
Figure 3.2

X and y to train test split. Train size 0.8 so 80% of x is taken as x train as training data and
rest 20% is taken as x test for test data . y is taken as y train as training data 80% and rest
20% is taken as y test for test data.

32
Figure 4.1
StandardScaler follows Standard Normal Distribution(SND),In Machine
Learning,StandardScaler is used to resize the distribution of values so that the mean of the
observed values is 0 and the standard deviation is 1

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables. Accuracy of linear regression in this model is
87.76517206528275

33
Figure 4.2

In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches .The
decisions or the test are performed on the basis of features of the given dataset .In this
accuracy of Decision Tree Regression in the model is 97.1226955763902

34
Figure 4.3

Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset. In
this accuracy of Random Forest Regression in this model is 97.76376847193016

35
Figure 4.4

K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories .K-
NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm . In this accuracy of K-NN in this model is
97.53654026540606

36
Figure 4.5

From this regressions the model with most accuracy was random forest regressor

37
Figure 4.6

38
5. Conclusions
By experimenting and analyzing it is feasible to conclude that the supervised learning
methods such as linear regression ,decision tree , random forest , knn regression models can
be utilized to evaluate diamond prices. Random Forest rregression model shows 97%
accuracy.It gives this much high accuracy because it possess a high capacity to determine
continuos numerical values .The price of a diamond is often decided by multiple factors . The
price determination , therefore seems difficult. An accurate predictive model can be valuable
to business and consumers to determine the fair price of a diamond. To build a model that
can accurately predict the price of a diamond potentially based on its cut , clarity ,colour .
According to industry players, diamonds offer good returns. Most importantly , diamond
investement has its own pros and cons .You should aware of that . Future work shall include
Unsupervised models orderly to extend the accuracy of prediction of a diamond dataset and
also its robustness.

39
6. REFERENCES
1. Machine Learning Algorithms for Diamond Price Prediction, Waad Alsuraihi, et al

2. Comparative Analysis of Supervised Models for Diamond Price Prediction, Garima


Sharma, et al

3. Gold and Diamond Price Prediction Using Enhanced Ensemble Learning,Mridul Saxena,
et al

40
Machine Learning Algorithms for Diamond Price
Prediction

Waad Alsuraihi Kholoud Bawazeer


King Abdulaziz University King Abdulaziz University
Jeddah, Saudi Arabia Jeddah, Saudi Arabia
Information Systems Department Information Systems Department

Ekram Al-hazmi Hanan Alghamdi


King Abdulaziz University King Abdulaziz University
Jeddah, Saudi Arabia Jeddah, Saudi Arabia
Information Systems Department Information Systems Department

ABSTRACT In the diamond trading sector, buyers and investors face several
Precious stones like diamond are in high demand in the investment difficulties in predicting diamond stones prices. This difficulty are
market due to their monetary rewards. Thus, it is of utmost due to the difference in the stones shapes, sizes and purity. In order
importance to the diamond dealers to predict the accurate price. to ease this problem and to aim diamond traders, this paper
However, the prediction process is difficult due to the wide discusses the application of machine learning algorithms as an
variation in the diamond stones sizes and characteristics. In this approach to predict diamond price through employing their features.
paper, several machine learning algorithms were used to help in This is done by using a dataset from Kaggle.
predicting diamond price, among them Liner regression, Random In this paper we used Diamonds Dataset [1], formulating the
forest regression, polynomial regression, Gradient descent and problem as a regression problem and comparing the performance
Neural network. After training several models, testing their of these algorithms. In this paper, we found that Random forest
accuracy and analyzing their results, it turns out that the best of regression achieved the best result. All algorithms are trained using
them is the random forest regression. predictive supervised learning models. Predictive modelling is the
task of building a model using historical data to estimate labels of
CCS Concepts new data samples [2]. As we consider predicting the price label as
• Information systems ➝ Information systems ➝ Decision a numeric value, regression predictive modelling is utilized in this
support systems ➝ Data analytics paper. Regression is the task of predicting a continuous output such
as integer or floating pint for a data sample given its attributes [3].
Keywords This paper is structured as follows. Section II presents some of
Machine Learning; Predictive analysis; Diamond price; Supervised;
related works. Section III discusses the proposed approach,
Regression; Random forest
describes diamond dataset and explains the pre-processing steps.
1. INTRODUCTION Section IV presents the results. Section V discuss these results.
Nowadays, Machine Learning is one of the most significant driving Section VI discusses the possible future work and concludes the
force of Artificial Intelligence (AI), whereby computers can learn paper. As we consider predicting the price label as a numeric value,
without being programmed to perform specific tasks. Machine regression predictive modelling is utilized in this paper. Regression
learning algorithms learn from previous cases to produce the is the problem of predicting a continuous output for an unseen data
required results quickly and accurately. We can say that machine sample [3]. A continuous output variable is a real-value, such as a
learning is a treasure that must be exploited to solve wide range of floating-point or integer value [2]. This paper is structured as
problems at all levels, social, economic, profession and others. follows. Section II presents some of related works. Section III
discusses the proposed approach, describes diamond dataset and
explains the pre-processing steps. Section IV presents the results.
Permission to make digital or hard copies of all or part of this work for Section V discuss these results. Section VI discusses the possible
personal or classroom use is granted without fee provided that copies are future work and concludes the paper.
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for 2. RELATED WORK
components of this work owned by others than ACM must be honored. In this section we will talk briefly about how others have analyzed
Abstracting with credit is permitted. To copy otherwise, or republish, to the same dataset and their results. We pick two analysis from
post on servers or to redistribute to lists, requires prior specific permission
Kaggle. In [4], the authors applied several models and reports their
and/or a fee. Request permissions from [email protected].
IVSP 2020, March 20–22, 2020, Singapore, Singapore
performance in term of R2-score, a statistical metric of how close
© 2020 Association for Computing Machinery. the data are to the fitted model and the higher the R2-score, the
ACM ISBN 978-1-4503-7695-2/20/03…$15.00 better the model fits your data. Random Forest Regression
DOI: https://doi.org/10.1145/3388818.3393715 achieved score of 0.982066, K-Neighbours Regression 0.959033,
Gradient Boosting Regression 0.905833, AdaBoost Regression
0.882499, Linear Regression 0.881432, Lasso Regression

150
0.865866 and Ridge Regression 0.753705. Random forest • Table width of top of diamond relative to widest point [43-
algorithm gives the highest R2-score "98%"[4]. In [5], the authors 95]. [See Figure 1].
trained other models including Decision Tree Regression with • Depth total depth percentage = z / mean (x, y) = 2 * z / (x +
accuracy 100.00, Random Forest Regression with accuracy 99.50, y) [43 - 79]. [See Figure 1]
Linear Regression with accuracy 91.87, Gradient Boosting
Regression with accuracy 90.38, Lasso Regression with accuracy
90.17, AdaBoost Regression the accuracy with 85.10, Elastic Net
Regression with accuracy 81.22 and Ridge Regression with
accuracy 80.12. The model that gives the highest accuracy is
Decision Tree Regression [5].

3. PROPOSED METHOD
3.1 Dataset Description
Since ancient times, diamonds have been used as ornamental items.
The value of diamonds makes it very useful for industrial
applications and desirable jewelries [6]. Multiple organizations
have been created for grading and certifying diamonds based on the
"four Cs", which are color, cut, clarity, and carat. In this work,
diamonds dataset contains the prices and other attributes of almost
54,000 diamonds [1].
The attributes are: Figure 1. Diamond stone features: table, depth and width
• Data frame [53940 instances & 10 features] 3.2 Feature Correlation
• Price in US dollars [$326 -$18,823] We use correlation coefficient measures to measure the strength of
• x length in mm [0 - 10.74] the relationship between two features. A correlation coefficient is a
• y width in mm [0 - 58.9], [See Figure 1] statistical metric of the degree to which the changes of one variable
• z depth in mm [0 - 31.8] can predict the change of another one. [7]. The strength of the
• Carat weight of the diamond [0.2-5.01] relationship between the characteristics of the diamond data set was
• Cut quality of the cut [ Fair, Good, Very Good, Premium, measured, [See Figure 2]
Ideal] As it can be noticed, X, Y, and Z have a strong relation with price.
• Color diamond [ from J (worst) - to D (best)] Carat Also has a significant relation with price.
• How clear the diamond is [ I1 (worst), SI2, SI1, VS2, VS1,
VVS2, VVS1, IF (best)]

Figure 2. Correlation coefficients of the diamond dataset features

151
▪ F_m(x): “Model obtained by adding m (=1, 2,…,
3.3 Pre-processing m) base learners”[13].
Pre-processing means apply a transformation to a dataset before ▪ h_m(x): “represents the direction in which the loss
feeding it to the algorithm. Data Pre-processing means performing function decreases w.r.t. F_m-1(x)” [13].
some cleansing techniques to convert the raw data into a clean data ▪ γ: “corresponds to the hyperparameter α in terms of the
set [8], in order to achieve better results when training the machine
utility” [13].
learning models. For example, some algorithms do not support null
values or categorical data. In this work, the dataset has no null
values (or Missing values), but there are three categorical features
3.4.4 Polynomial regression
that need transformation. Thus, before manipulating the dataset, we Polynomial regression is a regression analysis model in which the
convert the nominal values (object type) in color, cut and clarity relationship between the independent variable x and the dependent
variable y is modelled as an nth degree polynomial in x [14]. To
attributes, to numeric values by using one hot encoding. We split
the data set to two part – the first part is training set (80%) to create generate polynomial regression equation from linear regression
the model and the another is test set (20%) use it to validate the equation, we only add powers of the original features as new
features [10].
model.

3.4 Models Descriptions


In this paper, we trained several models to predict the price label.
Following are brief descriptions about these models: 3.4.5 Neural network
Neural network is a machine learning model that make use of an
3.4.1 Linear Regression architecture inspired by the neurons in the brain [15]. It uses
Linear Regression is a supervised machine learning algorithm Perceptron learning rule to make the prediction. The rule is:
based [9]. Linear model makes its prediction by the following
equation:

Where: Where
▪ x: input from training data. ▪ wi, j is the weight of the ith input and the jth output neuron
▪ y: label [10].
▪ θ0: intercept ▪ xi is the ith input of the current instance [10].
▪ θ1,2, n: coefficient of x. ▪ yj is the output of the jth output neuron for the current
instance [10].
3.4.2 Random forest regression ▪ yj is the target output of the jth output neuron for the
Random forest regression is a supervised learning algorithm which current instance [10].
uses ensemble learning method for both classification and ▪ η is the learning rate [10].
regression problems. It is kind of Ensemble learning which means
it takes the result of prediction from group of models [11]. Random
forest contains several decision trees, which are based on the
following Gini impurity equation. [10]

Where
▪ pi,k is “the ratio of class k instances among the training
instances in the ith node” [10].

3.4.3 Gradient Boosting Regressor


The seconds model is Gradient Boosting Regressor. It is a machine
learning algorithm for regression and classification problems,
which is based on an ensemble of weak learners, typically decision Figure 3. Layers of deep neural networks (simple model) [10]
trees[12]. This model it will be learned by taking weighted sum of Deep learning is neural networks composed of several layers [ see
M base learners (additive modelling). figure 3] [10] . The neural networks have three types of layers are
input, hidden and output. Each neuron is connected to the next layer
[10].
Neural networks have many parameters that we can change their
value to fit the data, but in this paper, we focused on the layers and
Where their neurons (units). We tried to change the number of layers and

152
the neurons (units), because they are among the factors affecting In the second case, we used four layers, the first one for the input
the accuracy of the prediction [10]. and has 26 inputs (features). The second and third layers have 10
units (neurons). The last one for the output which is the labeled
4. PERFORMANCE AND VALIDATION feature (price). There was not much difference from the first case.
METRICS In the third case, we used six layers, the first one for the input and
To check the rate of error we used MSE and RMSE. The MSE is has 26 inputs (features). The second layer has 10 units (neurons).
an estimator measures the average of error squares i.e. the average The third layer has 40 units (neurons). The fourth layer has 30 units
squared difference between the estimated values and true value [16]. (neurons). ). The fifth layer has 10 units (neurons). ). The last layer
The RMSE is a quadratic scoring which measures the average has 10 units (neurons). The last one for the output which is the
volume of the error. It is the difference between prediction and the labeled feature (price). Also, this case has a result very close to the
corresponding actual values. RMSE gives a relatively high weight second case.
to large errors, thus, it is preferred when large errors are particularly
While the Forest regression has the lowest error between the
undesirable [17]. We also use cross validation to estimate the
Actual and predicted values compared with other models. We did
generalization of the models on unseen data [18].
some sort of testing; the actual result is [15128.7 4681.5 5731.7 ...
5. RESULTS AND DISCUSSIONS 1003.4 2874.6 6300.7] and the predation results were [16231, 4540,
All the logarithms (or models) that mentioned earlier in Models 5729, ... , 1014, 28716320]. There is very little difference. This
descriptions are trained by Diamonds dataset. makes the random forest model the best model for predict the price
values, compared with the other models.
The results were as follows, when we applied the liner regression,
we did some sort of testing, but the result was there under fitting Table 1 summarizes the results of all trained algorithms.
because one of its result of predict was 4784 and the actual is Table 1. A comparison of accuracy results for all methods
16231... etc. So, there is high error between the Actual and trained with a diamond dataset
predicted values. Model Parameters MAE RMSE
In addition, when we use the cost function such as RMSE and MAE Liner - 742 1128.5
and their results showed that there was a high error. regression
For Gradient Boosting Regressor, there was low error between the
Actual and predicted values compared with other models. Forest - 112.93 241.97
After that, we use Polynomial regression with two different degrees, regression
when the degree equals 1 and the next time when it equals 2. When
the degree is 1, there is high error between the Actual and predicted
Gradient - 938 1406
values compared to other models. We notice underfitting in the
Boosting
model. So, we need to increase the complexity by increase the
Regressor
degree of polynomial equation.
We made the model with a degree equals 2 and test the result. We
found there is low error between the Actual and predicted values
Polynomial Degree =1 742.256 1128.569
compared with other models. Also, it near to be overfitting. So, we
regression
should not increase the degree or use regularizing method. Degree =2 401.497 672.398
About the Neural Network, in the first case, we used three layers
the first one for the input and has 26 inputs (features). The second Neural Layer 1= 26 input 3103 until it -
layer has 26 units (neurons). The last one for the output which is Network Layer 2 = 26 unit become 992 at
the labeled feature (price). As shown in Figure 4 there is a good epochs=100
Layer 3 = 1 output
alignment between predicted values and actual values,
Then, we tested using cost functions and the result of the error was Layer 1= 26 input 2803.8015 it -
high. The actual values are [16231, 4540, 5729, ..., 1014, 2871, Layer 2 = 10 unit become
6320] and the predicted results are [12755.024, 5089.3423, 764.1069 at
Layer 3 = 10 unit epochs=100
6623.827, ..., 1501.631, 2848.1982, 7164.5537]. So, there is also
underfitting in this model. Layer 4 = 1 output
Layer 1= 26 input 1921.5269 -
Layer 2 = 10 unit until it become
769.9570 at
Layer 3 = 40 unit epochs=100
Layer 4 = 20 unit
Layer 5 = 10 unit
Layer 6 = 1 output

As we compared the results of the logarithms that we trained with


the logarithms that were trained by the authors that we mentioned
them previously in the Related work section. Table 2summarizes
Figure 4 The scatter plot for Neural network results the results.

153
Table 2. A comparison of the methods used in this paper with [7] What is correlation coefficient? - Definition from
methods used in other related works WhatIs.com. (n.d.). Retrieved 5 December 2019, from
Proposed Work in [5] Work in [4] WhatIs.com website:
https://whatis.techtarget.com/definition/correlation-
models
Models RMSE MAE RMSE RMSE RMSE MAE coefficient
Mean [8] Data Preprocessing for Machine learning in Python. (2017,
October 29). Retrieved 5 December 2019, from
Linear 1128.57 742.26 1142.27 1126.90 1382.53 926.72 GeeksforGeeks website:
https://www.geeksforgeeks.org/data-preprocessing-machine-
regression
learning-python/
Gradient 1406.26 938.02 1242.20 1235.61 1232.08 720.72 [9] ML | Linear Regression. (2018, September 13). Retrieved 5
boosting December 2019, from GeeksforGeeks website:
regressor https://www.geeksforgeeks.org/ml-linear-regression/
[10] Géron, A. (2019). Hands-On Machine Learning with Scikit-
Random 241.98 112.94 282.70 577.40 559.57 283.57 Learn, Keras, and TensorFlow: Concepts, Tools, and
forest Techniques to Build Intelligent Systems. O’Reilly Media,
Inc.

6. CONCLUSION AND FUTURE WORK [11] Chakure, A. (2019, July 7). Random Forest and its
Implementation. Retrieved 5 December 2019, from Medium
This paper demonstrates that machine learning can contribute to the
website: https://towardsdatascience.com/random-forest-and-
economic field and help traders solve their problems, such as
its-implementation-71824ced454f
forecasting prices. In this paper, we applied several models and
conducted accuracy tests on them. We saw that the random forest [12] Grover, P. (2019, August 1). Gradient Boosting from scratch.
has the best result despite the noise in this dataset. Therefore, we Retrieved 5 December 2019, from Medium website:
recommend using a random forest or make ensemble models. https://medium.com/mlreview/gradient-boosting-from-
Ensemble learning helps improve machine learning results and scratch-1e317ae4587d
solve the problems in (bagging), bias (boosting) by combining [13] Mahto, K. K. (2019, February 25). Demystifying Maths of
several models [19]. In future, we aim to enhance our results Gradient Boosting. Retrieved 15 December 2019, from
through building combination of ensemble models. Medium website:
https://towardsdatascience.com/demystifying-maths-of-
7. REFERENCES gradient-boosting-bd5715e82b7c
[1] Diamonds | Kaggle. (n.d.). Retrieved 5 December 2019, from
[14] Polynomial regression. (2019). In Wikipedia. Retrieved from
https://www.kaggle.com/shivam2503/diamonds
https://en.wikipedia.org/w/index.php?title=Polynomial_regre
[2] Difference Between Supervised, Unsupervised, & ssion&oldid=928707881
Reinforcement Learning | NVIDIA Blog. (n.d.). Retrieved 21
[15] Simpson, M. (n.d.). Machine Learning Algorithms: What is a
October 2019, from
Neural Network? Retrieved 5 December 2019, from
https://blogs.nvidia.com/blog/2018/08/02/supervised-
https://www.verypossible.com/blog/machine-learning-
unsupervised-learning/
algorithms-what-is-a-neural-network
[3] Brownlee, J. (2017, December 10). Difference Between
[16] Python | Mean Squared Error—GeeksforGeeks. (n.d.).
Classification and Regression in Machine Learning.
Retrieved 21 October 2019, from
Retrieved 21 October 2019, from Machine Learning Mastery
https://www.geeksforgeeks.org/python-mean-squared-error/
website: https://machinelearningmastery.com/classification-
versus-regression-in-machine-learning/ [17] Mean Absolute Error (MAE) and Root Mean Squared Error
[4] Diamonds In-Depth Analysis. (n.d.). Retrieved 5 December (RMSE). (n.d.). Retrieved 21 October 2019, from
2019, from https://kaggle.com/fuzzywizard/diamonds-in- http://www.eumetrain.org/data/4/451/english/msg/ver_cont_
depth-analysis var/uos3/uos3_ko1.htm

[5] Diamond Price Modelling. (n.d.). Retrieved 5 December [18] Brownlee, J. (2018, May 22). A Gentle Introduction to k-fold
2019, from https://kaggle.com/tobby1177/diamond-price- Cross-Validation. Retrieved 21 October 2019, from Machine
Learning Mastery website:
modelling
https://machinelearningmastery.com/k-fold-cross-validation/
[6] Diamond (gemstone). (2019). In Wikipedia. Retrieved from
[19] Smolyakov, V. (2019, March 7). Ensemble Learning to
https://en.wikipedia.org/w/index.php?title=Diamond_(gemst
one)&oldid=918934528 Improve Machine Learning Results. Retrieved 5 December
2019, from Medium website:
https://blog.statsbot.co/ensemble-learning-d1dcd548e936

154
Gold and Diamond Price Prediction Using Enhanced
Ensemble Learning
Avinash Chandra Pandey Shubhangi Misra Mridul Saxena
Jaypee Institute of Information Jaypee Institute of Information Jaypee Institute of Information
Technology, Noida Technology, Noida Technology, Noida
[email protected] [email protected] [email protected]

Abstract—Precious metals like diamond and gold are in high Lightgbm and, XgBoost regressor are used to enhance the
demand due to their monetary rewards. Therefore, various tech- accuracy of basic models [5], [6]. Accuracy of predication
niques are generally employed to forecast prices of diamonds and also depends on the decency of features therefore, several
precious metals with the aim of fast and accurate results. The
prices fluctuate daily making it difficult to predict the next future feature selection methods are used for improving the efficiency
value. Hence, by examining the pattern of previous prices we can and accuracy of the state-of-the-art approaches [7], [8], [9].
apply regression models for future prediction. This paper aims However, supervised models generally suffer from over-fitting
at forecasting the future prices of precious metals like gold and and under-fitting problem and also shows poor performance for
precious stones like diamond, using ensemble techniques, aiming imbalance datasets [10], [11]. Therefore, to avoid these issues,
to get the most accurate result of all. Ensemble models are used for
increasing the accuracy of prices. Later, feature selection methods this paper introduces a hybrid model based on the strength of
are also used and results are compared. random forest and principal component analysis (PCA). This
Index Terms—Ensemble models; Feature Selection; Boosting; paper attempts to first equalize, then better the paradigms in
Forecasting; order to obtain a more reliable outcome.
The main contribution of this study can be encapsulate as
I. I NTRODUCTION follows.
Price forecasting is the process of using historical data on • First, PCA, recursive feature elimination, and Chi-square
a given product to predict the long-term trends of the market. test have been used to eliminate the correlation among
Historically, Gold and Diamond have found its applications in features and obtain the best subset of features
almost every field. Most countries use Gold in the form of • Second, ensemble models based the strength of random
currency and for investments also. Banks prefer investing in forest and linear regression are used for predicting the
precious metals due to its unique properties and high demand future prices of Gold and Diamond
in the market. As a customer, there is always ambiguity The remainder of this paper is arranged as follows. Section
about the correct time to invest, purchase or sell precious 2, briefs the related work in field of Gold and Diamond price
items like gold and diamond. It is of utmost significance to prediction. Preliminaries are discussed in Section 3 and the
buyers and investors when it comes to making maximum profit proposed method is given in Section 4. Experimental results
out of investment and least expense out of a purchase made are presented in Section 5 followed by conclusion in Section
for the above-mentioned items. There are numerous models 6.
and applications currently in the market which are used for
predicting the future price of these metals [1]. However, the II. BACKGROUND S TUDY
price of these metals reflect non-linearity and dynamic time- According to Pradeep [4], forecasting prices of Gold and
series behavior. Therefore, forecasting prices of these metals Diamond is a complicated task due to its non-linearity and
is a difficult process. Many researchers have tried predicting fluctuating time series behavior, constrained with many factors
future prices of these metals using various machine learning like economic, financial etc. This paper examined different
algorithms [2]. ensemble models for determining the future momentum of
Machine learning algorithm are generally classified into two gold and silver stock price for the upcoming days relative
main categories, namely supervised and unsupervised machine to current day stock price. Machine learning algorithms like
learning methods. Supervised machine learning generally out- linear regression, logistic regression, random forest regression,
performs the other methods [3]. Some of the popular machine were used for predicting the gold and silver price. Ling et
learning models which have been used for forecasting the al. [12] have used different ensemble models by explaining
future prices of Gold and Diamond are regression models like the types of classifiers contained in it and used the model
linear regressor, Random Forest, and ensemble techniques [4]. for prediction. Ensemble models use the notion of voting,
Moreover, some of the models namely AdaBoost regressor, bagging and stacking techniques to enhance the prediction
accuracy. Hafezi and Akhavan [13] analyzed the performance

978-1-7281-3591-5/19/$31.00
c 2019 IEEE
of various ensemble models and found that these models show selection methods.
better classification accuracy than other models. Moreover,
III. P RELIMINARIES
these models are also used for prediction of stock price.
Classification Ensemble learning strategies, especially boost- A. Feature Selection
ing and bagging decision trees have improved the prediction Feature Selection method calculates the correlation value
accuracy of base learning algorithms[14]. Web[14] investigated between different groups of features and the features with less
that the betterment in accuracy of multi-strategy methods to value is removed. Correlation coefficient value ranges between
ensemble learning is due to an rise in the diversification of -1 to 1. The correlation between each attribute was obtained
ensemble members that are formed. From the experimentation with color, depth, and clarity having a negative relationship
it is found that the multi-strategy ensemble learning techniques with the price attribute. These three attributes were dropped
are more correct than ensemble learning approaches. In multi- from the dataset. In the proposed model, recursive feature
strategy ensemble learning technique base learning algorithms elimination methods including Chi-Square Test and Principal
are repeatedly used with random training data sets [14]. Multi- Component Analysis are used for feature selection. Recursive
strategy ensemble learning techniques reduce test error in the Feature Elimination (RFE) technique removes the weak features
data set by investigating the link in multi-strategy ensemble recursively and finds the best features which can predict the
learning and generation of diversity in ensemble membership. results. After each iteration, it removes the weak features and
Many researchers believe that before applying any machine ranks them according to their ability to predict. The proposed
learning algorithm data should be pre-processed and noisy data method employs Linear regression, Chi-Squared test, and Prin-
should be removed. Using data without any pre-processing may cipal Component Analysis (PCA) to select the five best features
lead to inaccurate results. Many factors such as redundant and from dataset. Chi-Squared Test is a statistical test mainly used
irrelevant features affect the success of machine learning on on categorical data for finding the dependency of two features.
a given task. If the irrelevant features are used for training Principal Component Analysis is a feature selection method
of machine learning models then models may suffer from the which extracts important features from the set of uncorrelated
under fitting problem [15]. Hence, before training machine features from the dataset. PCA constructs new features from
learning model feature selection methods can be used to obtain the original data set and looks for features that show as many
the optimal features from the dataset. Feature selection is the variations as possible resulting in the removal of the redundant
process of identifying and removing irrelevant features from properties/features. The new feature would have “maximum
the training data set. Feature selection is used to reduce the variance" and “minimum error".
dimensionality of data set and to enable regression algorithms B. Ensemble Model
to operate faster. Random Forest Regressor, Bagging, Adaboost, Lightgbm,
Mahato and Attar [4] proposed a machine learning model Xgboost are ensemble which has been incorporated in the
to predict the future price of Gold and to identify the correct model for training and prediction [19]. Random Forest algo-
time of investment to gain more profit. Since, there are still rithm is an ensemble model that uses multiple decision trees
plenty of ignored issues that need to be handled for improving for final prediction. In this, each tree is trained on a subset of
the prediction accuracy. Therefore, a novel forecasting model training data that is arbitrarily selected. This subset is obtained
has been introduced to enhance the forecasting accuracy [16]. by randomly selecting features. Each tree predicts its own
Sivalingam et al. [17] used various forecasting techniques in- result and then the final result is obtained by majority voting
cluding basic forecasting approaches such as linear regression, of the trees. The forest it builds is an ensemble of decision
multiple linear regression, ensemble models for the Gold price Trees which is trained with “Bagging" method. An and Meng
prediction. Further, Guha and [18] Bandyopadhyay used Arima [20] used a number of trees for price prediction. Bagging
model for predicting future prices of Gold. However, Arima also known as bootstrap aggregating are trained by different
model only forecasts the immediate future and not the value regression models and the result is then calculated by using
of longer time period. Moreover, it also requires long time simple majority voting of each output [21]. For the same, a
series and doesn’t guarantee perfect forecasting while machine decision tree was used as an input parameter where each node
learning models can predict the value of longer time period. represents a decision and each leaf gives an outcome.
This paper has reviewed the different forecasting approaches Adaboost Regressor also Known as adaptive boosting uses
of gold prices up to 2018.In spite of a lot of methods had weighing instances technique to predict the result. Initially,
been proposed for gold price forecasting, several limitations it gives equal weights to each training dataset and predicts
that have been emphasized makes room for investigation and the result. In the next iteration, it gives higher weightage to
enhancement of these approaches open as there are still no the observations which have been incorrectly classified. This
ideal solutions to predict the gold price. After analyzing the process is iterated till the accuracy is reached or maximum
results of all the papers, it can be concluded that stock price iterations are reached. Lightgbm is a tree-based learning algo-
predictions and price predictions for precious metals aren’t a rithm which makes vertical trees as compared to other models
very complex task. Prediction accuracy can be enhanced by which make horizontal trees. It mainly increases the accuracy.
employing appropriate machine learning algorithms and feature In Gradient Boosting Regressor, it ensembles weaker prediction
models and each newly generated model minimizes the loss may not provide the exact expected results. Random Forest
function using gradient descent. It combines weak models into regressor solves the problem of accuracy by building decision
a single strong model in an iterative manner. XgBoost delivers trees according to the input given by the user and provides the
improved prediction accuracy and is said to handle missing most accurate results. Bagging can also be used for improving
values efficiently. It also gives results 10 times faster than the the accuracy of the dataset however, it requires other regression
other gradient boosting algorithms. models as input for prediction. Later it combines the results
of all the models used and predicts the results. Similarly,
IV. P ROPOSED W ORK
other ensemble models like Lightgbm, Adaboost improve the
The proposed model combines the strength of ensemble accuracy but could not handle missing data properly. Xgboost
model with different feature selection methods to enhance the known as extreme gradient boosting handles missing values
classification accuracy. The proposed model uses the following efficiently. It also gives results 10 times faster than the other
three phases- gradient boosting algorithms.
1) First, dataset is prepossessed to remove noise and blank
entries. V. E XPERIMENTAL R ESULTS :
2) Second, Feature selection methods such as Chi-square The performance of proposed model has been evaluated on
test, principal component analysis, and recursive feature benchmark datasets taken from Kaggle [22]. It consists of total
selection are used for selecting the best features from 10 features: Carat, cut, color, clarity, depth, table, price, x, y, z
data. with a total of 53940 rows. Initially, all the features were used
3) Regression models namely, linear and random forest are for prediction. Later, three feature selection techniques have
used for prediction. been applied in order to reduce the redundancy and size of the
dataset. The Chi-square test, RFE, and PCA select 9, 5, and 6
features respectively. To analyze the performance of proposed
model k-fold cross validation is used and value of k is chosen
to 5.
The performance of proposed model is compared in the terms
of mean, best and worst accuracy and depicted in Table I. From
the table, it can be analyzed that the random forest model
outperforms the linear regression model. Further, to essence
of feature selection, proposed models have used along with
different feature selection methods namely, Chi-Square, REF,
and PCA. The accuracy of proposed ensemble model with
different feature selection methods have been depicted in Table
in II. From the table, it can be visualize that between the two
regression model, linear and random forest the later showed
greater accuracy. Random forest regression with Chi-Square
feature selection showed the best accuracy with 5 best features.
There is a decrease in the accuracy after the application of
Recursive Feature Extraction due to the removal of important
features from the data set providing the five best features using
a linear model. Among the ensemble models, Bagging regressor
gave the highest accuracy whereas Ada Boost Regressor gave
the least accuracy. The decrease in Ada Boost Regressor
model’s accuracy is due to overfitting of the training dataset.

TABLE I: Comparison of Regression models in terms of


Fig. 1: Flow Chart of Proposed Model mean, best, and worst accuracy

The complete flow of proposed model has been depicted in Regression model Mean Accuracy Best Accuracy Worst Accuracy
Fig.1. From the figure it can be perceive that the proposed Linear 0.8695 0.8745 0.8456
model employs the feature selection before training of ensemble
Random Forest 0.9730 0.9821 0.9614
model. Hence, it reduces the possibility of under-fitting and
over-fitting problem. The proposed model uses linear regression
to checks the extent to which the dependent and independent Moreover, the performance of proposed model has also been
variables are related. This method provides results for binary compared with state-of-the-art approaches including gradient
variables where there is one dependent variable and another an boosting, ada boost, and bagging. The result of all the consid-
independent variable. Hence, working with multivariate variable ered methods has been represented in Table III. If the result
TABLE II: Models with Feature Extraction
Regression Model Feature Selection Model Accuracy Best Accuracy Worst Accuracy
Chi-Square Test 0.8663 0.8812 0.8318
Linear PCA 0.8663 0.8832 0.8414
RFE 0.8538 0.8794 0.8403
Chi-Square Test 0.9754 0.9803 0.9702
Random Forest PCA 0.9754 0.9781 0.9604
RFE 0.9306 0.9713 0.9342

of Table II, II, and III are compared then it obtained that the [10] S. Santiso, A. Casillas, and A. Pérez, “The class imbalance problem
performance of existing ensemble models can be enhanced by detecting adverse drug reactions in electronic health records,” Health
informatics journal, p. 1460458218799470, 2018.
incorporating the feature selection and preprocessing steps. the [11] A. L. Blum and P. Langley, “Selection of relevant features and examples
training dataset. in machine learning,” Artificial intelligence, vol. 97, no. 1-2, pp. 245–271,
1997.
TABLE III: Ensemble Models [12] X. Ling, W. Deng, C. Gu, H. Zhou, C. Li, and F. Sun, “Model ensemble
for click prediction in bing search ads,” in Proceedings of the 26th
International Conference on World Wide Web Companion. International
Regressor model Accuracy World Wide Web Conferences Steering Committee, 2017, pp. 689–698.
[13] R. Hafezi and A. Akhavan, “Forecasting gold price changes: Application
Gradient Boosting 0.9584 of an equipped artificial neural network,” AUT Journal of Modeling and
Ada Boost 0.8814 Simulation, vol. 50, no. 1, pp. 71–82, 2018.
[14] G. I. Webb and Z. Zheng, “Multistrategy ensemble learning: Reducing
Bagging 0.9658 error by combining ensemble learning techniques,” IEEE Transactions on
Proposed Hybrid Model 0.9754 Knowledge and Data Engineering, vol. 16, no. 8, pp. 980–991, 2004.
[15] M. Karagiannopoulos, D. Anyfantis, S. Kotsiantis, and P. Pintelas, “Fea-
ture selection for regression problems,” Proceedings of the 8th Hellenic
European Research on Computer Mathematics & its Applications, Athens,
VI. C ONCLUSION Greece, vol. 2022, 2007.
[16] Z. Ismail, A. Yahya, and A. Shabri, “Forecasting gold prices using mul-
Price prediction of precious metals is always a complex task tiple linear regression method,” American Journal of Applied Sciences,
but machine learning algorithms have made this task easier and vol. 6, no. 8, p. 1509, 2009.
more efficient for data analysts and researchers. In this paper, [17] K. C. Sivalingam, S. Mahendran, and S. Natarajan, “Forecasting gold
prices based on extreme learning machine,” International Journal of
performance of various ensemble models have been compared. Computers Communications & Control, vol. 11, no. 3, pp. 372–380, 2016.
Further, various feature selection models like PCA, RFE etc. [18] B. Guha and G. Bandyopadhyay, “Gold price forecasting using arima
that play significant roles in determining the efficiency of the model,” Journal of Advanced Management Science Vol, vol. 4, no. 2,
2016.
algorithm are also employed to reduce the chances of under- [19] T. G. Dietterich, “An experimental comparison of three methods for
fitting and enhance the prediction accuracy. constructing ensembles of decision trees: Bagging, boosting, and ran-
domization,” Machine learning, vol. 40, no. 2, pp. 139–157, 2000.
R EFERENCES [20] K. An and J. Meng, “Optimal-weight selection for regressor ensemble,” in
Computational Intelligence and Software Engineering, 2009. CiSE 2009.
[1] M. M. A. Khan, “Forecasting of gold prices (box jenkins approach),” In- International Conference on. IEEE, 2009, pp. 1–4.
ternational Journal of Emerging Technology and Advanced Engineering, [21] P. Langley et al., “Selection of relevant features in machine learning,” in
vol. 3, no. 3, pp. 662–670, 2013. Proceedings of the AAAI Fall symposium on relevance, vol. 184, 1994,
[2] I. ul Sami and K. N. Junejo, “Predicting future gold rates using machine pp. 245–271.
learning approach.” [22] Kaggle, “Data Set,” https://www.kaggle.com/omdatas/historic-gold-
[3] Y. Zhu and C. Zhang, “Gold price prediction based on pca-ga-bp neural prices/version/1, 2018, [Online; accessed 19-July-2018].
network,” Journal of Computer and Communications, vol. 6, no. 07, p. 22,
2018.
[4] P. K. Mahato and V. Attar, “Prediction of gold and silver stock price using
ensemble models,” in Advances in Engineering and Technology Research
(ICAETR), 2014 International Conference on. IEEE, 2014, pp. 1–4.
[5] C.-F. Tsai, Y.-C. Lin, D. C. Yen, and Y.-M. Chen, “Predicting stock returns
by classifier ensembles,” Applied Soft Computing, vol. 11, no. 2, pp.
2452–2459, 2011.
[6] G. I. Webb, “Multiboosting: A technique for combining boosting and
wagging,” Machine learning, vol. 40, no. 2, pp. 159–196, 2000.
[7] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset
selection problem,” in Machine Learning Proceedings 1994. Elsevier,
1994, pp. 121–129.
[8] D. Banerjee, A. Ghosal, and I. Mukherjee, “Prediction of gold price
movement using discretization procedure,” in Computational Intelligence
in Data Mining. Springer, 2019, pp. 345–356.
[9] A. C. Pandey* and D. S. Rajpoot, “Feature selection method based
on grey wolf optimization and simulated annealing,” Recent Patents on
Computer Science, vol. XX, pp. XX–XX, 2019.
Comparative Analysis of Supervised Models for
Diamond Price Prediction
2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence) | 978-1-6654-1451-7/20/$31.00 ©2021 IEEE | DOI: 10.1109/Confluence51648.2021.9377183

Manish Mahajan
Garima Sharma Vikas Tripathi
Computer Science and Engineering
Computer Science and Engineering Computer Science and Engineering
Graphic Era (Deemed to be) University
Graphic Era (Deemed to be) University Graphic Era (Deemed to be) University
Dehra Dun, Uttarakhand
Dehra Dun, Uttarakhand Dehra Dun, Uttarakhand

Awadhesh Kumar Srivastava


Computer Science and Engineering
KIET Group of Institutions, Delhi NCR,
India

Abstract— Diamond is one of the expensive gemstones on learning methods to detect a precise outcome. Also, a
the planet that occurs naturally in the form of minerals made comparison of outcomes of Linear regression, Decision tree,
of carbons. Precious stones like diamonds are always high in
Lasso regression, Random Forest, Ridge regression,
demand due to their monetary rewards. The price of such
stones varies according to their features. Given this, we carried ElasticNet, AdaBoost-Regressor, and Gradient-
out a comparative analysis and implementation of various BoostingRegressor model[2] is performed for detecting
supervised models in predicting the price of the diamond. In which one among them performs superior for the tasks. The
our work, we evaluated eight different supervised models like work is arranged as follows: Section 2 showcases the review
linear regression, lasso regression, ridge regression, decision of previously done related work. Section 3 presents the
tree, random forest, ElasticNet, AdaBoost Regressor, and experiments conducted for analysis. Section 4 consists of
Gradient-Boosting Regressor and showcases the best suitable
model with the more accurate result of all. This paper aims the results obtained from experiments performed in previous
from data preprocessing, finding a correlation between the section. The conclusion of the paper is in the last section
dataset attributes, training the above-given models, testing where the accuracy of other models of the same domain is
their accuracy, analyzing their outcomes, and in turn finding compared.
the best of them is the Random Forest Regression Model.
II. BACKGROUND STUDY
Keywords— Supervised Machine Learning Models;
Many studies were proposed that have been done to predict
Correlation Matrix, Model Selection
the diamond price using different techniques. José[2] data
mining techniques like M5P, linear regression, and neural
I. INTRODUCTION
networks. M5P data mining model shows that this model
Diamond, one of the rarest and naturally occurring possesses a high capacity to predict the diamond price.
substance mineral that is composed of carbon. It is the Using Dimensionality Reduction by high correlation proves
hardest substance which is known today. It has the highest that this M5P model is a useful technique to apply on a
thermal conductivity and also chemically resistant. These diamond dataset. A study implemented by Singfat[3] using
are the world's most popular gemstones. It is one of the (MLR) Multiple linear regression to find the relationship
gemstones on which more money is spent than any other between diamond price and diamond 4C’s. ]. The
combined gemstone. The diamond gains popularity because information that is given in the dataset, an MLR data mining
it has an optical property. Other factors include is its model for this dataset is the right and the usual path to
durability, custom, fashion, and aggressive marketing by inspect the price of a diamond. Various Machine learning
diamond producers.[1] any other combined gemstones. The algorithms like linear regression, logistic regression, random
diamond gains popularity because it has an optical property. forest regression, were used for predicting the gold and
Other factors include is its durability, custom, fashion, and diamond prices but the gap lies in finding the best using
aggressive marketing by diamond producers.[1]Diamonds preprocessing and correlation approach.[4]
have the highest non-metallic luster - known as More often, one would predict the price which is generally
"adamantine." The high percentage of the light that strikes represented in US dollars of a stone. Though, the
the surface of diamonds gets reflected due to this luster relationship shall not be linear because these heavy stones
property. This is a property that gives diamond gemstones are expensive than lighter stones. To understand it better, a
their "sparkle". Diamond is extremely hard (ten on Mohs graphical representation to examine the Kaggle diamond
scale), it is often used as an abrasive. Most of the industrial dataset is achieved through scatter plot data visualization[5].
diamonds are used for this purpose. Saw blades, grinding In Fig 1.the relation of carat against price is shown.
wheels and drill bit are embedded with small particles of a
diamond.
The principal motive of this research paper is to introduce
supervised machine learning methods to predict the price of
diamonds(that is given in US dollars($)). By using a
diamond dataset from Kaggle and supervised machine

978-0-7381-3160-3/21/$31.00 2021
c IEEE 1019

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 16,2021 at 12:19:00 UTC from IEEE Xplore. Restrictions apply.
the decision tree result. This model is powerful because they
limit the overfitting by not increasing the error
substantially.[9]
ElasticNet uses penalties from both ridge and lasso
regression for regularizing regression models. This
technique combines both ridge and lasso methods to learn
from the shortcomings to improve the regularization of
statistical models.
AdaboostRegressor is a meta-estimator. It begins by fitting
the original dataset with a regressor and then fits the same
dataset with additional copies of the regressor but for the
current prediction, the weight of instances is adjusted
according to error.
Fig 1. Scatterplot of Carat vs Price GradientBoostingRegressor builds an additive model in
forwarding stage-wise fashion. For arbitrary differentiable
The clear belief is that there is a strong relationship between loss functions, it allows optimization. A regression tree at
carat and price, nevertheless, it can be recognized that this each stage is fit on the negative gradient of a given loss
trend seems to be fade away, which means heavy Diamonds function.[10]
have higher price volatility. A. Tools Used
For analyzing the dataset Python 3 is used. It is an open
Nonetheless, in the present study, a broad range of data source, interpreted, and high-level language. It gives a great
analyzing methods and some more features are considered. approach to object-oriented programming. For a data
Analyzing the dataset of 53,940 records adds more accuracy scientist, it is one of the best languages to use for data
and robustness to this research. science projects or applications. One of the main reasons for
choosing Python as it is widely used in the research and by
scientific communities’ reason of being its easy to use and
III. PROPOSED METHODOLOGY easily adaptable language syntax.
This Methodology Section is divided into 4 subparts that is
Section A, B, C, and D. Section A will tell us about the tool B. Data Acquisition
used. In section B we will see that data is acquired. In the For analyzing the Kaggle repository is the data source for
section, we will see a correlation between a different thousands of datasets. It is an online community for machine
variable and in section D we see what can be evaluated with learning practitioners and data scientists and also it is a
statistics.[5] vigorous, attested, and adequate resource for investigating
We use supervised learning techniques for analyzing the different data sources. Users can find and publish different
datasets on Kaggle. They can explore datasets and build
dataset. It provides us with a powerful tool for processing
models in a web-based data-science environment[12]. The
and classifying data using machine language. In supervised diamond dataset from Kaggle will provide the essential
learning we use labeled data, which is a dataset that has features of the Diamond Dataset. The features of diamonds
been classified, to use for the learning algorithm. For the are shown in Table 1 and Fig 2. The four feature of a
classification of other unlabeled data, this dataset is used as diamond is represented to understand the term easily.
a base to predict data by using machine learning algorithms TABLE I DATA-FEATURES
[7].
Linear Regression is used to determine the extent up to Name Values Data Type
which there is a linear relationship among dependent and
one or more independent variables. Price(USD) Numerical
326….18823
In Lasso regression prediction error for a quantitative
response, a variable is minimized by obtaining a subset of Carat Numerical
the predictors. The regression coefficients for some weight 0.2……5.01
variables shrink toward zero because lasso imposes a Cut quality Fair, Good, Very Categorical
constraint on model parameters. Good,
Ridge regression is the Variation of Linear Regression. It is
a method to generate a parsimonious model when the Premium, Ideal
amount of predictor variables in the single set exceeds the Color J, Categorical
number of observations[8]. I,H,G,F,E,D
Decision trees are mostly used to identify a strategy that is
Clarity I1,SI2,SI1,VS2,VS1, Categorical
most anticipated to reach a particular goal. It is mainly used
for analyzing the decision and also one of the popular tools VVS2,VVS1,IF
to use in machine learning. X(length in Numerical
Random Forest consist of a group of decision trees. The mm) 0….10.74
final result for the random forest is found by aggregation of

1020 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence 2021)

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 16,2021 at 12:19:00 UTC from IEEE Xplore. Restrictions apply.
Y(length in Numerical subgroups called strata. This is called Stratified Sampling.
mm) 0….58.9 The carat is the most important parameter to predict the
Z(length in Numerical price of the diamonds we will use it for Stratified sampling.
mm) 0…...31.8 We have used root mean square error(RMSE) to manage
undesirable large errors in this huge dataset as mean
Depth z/mean(x,y)= Numerical absolute error will not efficiently work for this huge wild
percentage dataset. For particularly undesirable large errors, RMSE is
2*z/(x+y)(43…79)
most useful[14]. We will first find the RMSE and Cross
Table 43…95 Numerical
Validation Scores(CV_scores) to check the performance.
width
The function will plot a graph to show how well our
algorithm has predicted the data.
CV_scores(Cross-validation Scores) starts by shuffling the
data and splitting it into k folds. Then the k models are fit
on k−1/k of the data and evaluated on 1/k of the data. The
results that are obtained from each evaluation are averaged
together to compute a final score, after finding the final
score the final model is fit on the entire dataset to begin the
processing[15]. The K Folds are Evaluated as shown in Fig
5.

Fig 2. Diamond culet, width, depth, and table feature

C. Correlation Matrix
The correlation matrix is represented by the Table 01 shows
the correlation between distinct variables. The correlation
between the two variables is shown in each cell. For
representing this correlation, heat map is used to analyze
these relationships better. Fig 3. shows the correlation Fig 4. K Folds Evaluation
between distinct fields.
IV. RESULT ANALYSIS
The accuracy alone with RMSE and CV score of linear
regression, random forest, lasso regression, decision tree,
ridge regression, ElasticNet, AdaBoostRegressor, and
GradientBoostingRegressor, is shown in Fig 5.

Fig 3. High Correlation Heat Map

From the above Heat map[13], we can find x, y, and z are


correlated with price, and the price of diamond and
carat(weight) of a diamond are highly correlated. Hence we Fig 5. Model Accuracies
can readily say that carat is one of the main features to
predict the price of a diamond. The Random Forest produced a better overall result than
any other supervised learning algorithms. Orderly to acquire
D. Evaluating the Statistics these results, we have to look in-depth into data. This led to
the finding of outliers which is small in number that was
To start evaluating stats we will split the dataset into Train
harming the performance of several models, by eliminating
set (80%) and Test set (20%). The test set allows our model
these outliers, a decline in errors can be seen.
to make predictions on values that it has never seen before.
But taking random samples from our dataset can introduce
significant sampling bias. Therefore, to avoid sampling
bias, the data will be divided into different homogenous

2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence 2021) 1021

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 16,2021 at 12:19:00 UTC from IEEE Xplore. Restrictions apply.
V. COMPARATIVE ANALYSIS REFERENCES
The Random Forest Regression and Decision tree models
show remarkably better accuracy than any model [1] C.-F. Tsai, Y.-C. Lin, D. C. Yen, and Y.-M. Chen, “Predicting stock
returns by classifier ensembles,” Applied Soft Computing, vol. 11, no.
concerning root mean squared error, nevertheless, the 2, 2011, pp 2452–2459.
Random Forest Regression model had top performance [2] José M., “Implementing data mining methods to predict Diamond
overall and with the highest accuracy and with the least prices” Peña Marmolejos Graduate School of Arts and Sciences,
errors. The comparison of the seven models used is shown Fordham University, Int'l Conf. Data Science-ICDATA'18.
https://csce.ucmss.com/cr/books/2018/LFS/CSREA2018/
in Fig 6. ICD8070.pdf
[3] A. C. Pandey, S. Misra and M. Saxena, "Gold and Diamond Price
Prediction Using Enhanced Ensemble Learning," 2019 Twelfth
International Conference on Contemporary Computing (IC3), Noida,
India, 2019, doi: 10.1109/IC3.2019.8844910, pp. 1-4.
[4] Singfat the Chu, “Pricing the C’s of diamond stones”, National
University of Singapore, Journal of Statistics Education Volume.
https://www.tandfonline.com/doi/full/10.1080/10691898.2001.11910
659
[5] Diamond-The most popular gemstone [online]-
https://geology.com/minerals/diamond.shtml
[6] Waad Alsuraihi, Ekram Al-hazmi, Kholoud Bawazeer, Hanan
AlGhamdi, “Machine Learning Algorithms for Diamond Price
Prediction”, Publication: IVSP '20: Proceedings of the 2020 2nd
International Conference on Image, Video and Signal Processing,
March 2020.
Fig 6. Accuracy Comparison of Models [7] Alexandru Niculescu-Mizil, Rich Caruana, “Predicting good
probabilities with supervise Learning”, Publication: learning August
2005.
VI. CONCLUSION AND FUTURE WORK [8] Supervised Machine Learning Models with sci-kit learn [online]-
https://scikitlearn.org/stable/supervised_learning.html
By experimenting and analyzing it is feasible to conclude [9] Linear, Ridge and Lasso regression with sci-kit learn [online]-
that the Supervised learning methods such as https://www.pluralsight.com/guides/linear-lasso-ridge-regression-
AdaBoostRegressor, GradientBoosting-Regressor, linear scikit-learn
regression, decision tree, ElasticNet, ridge regression, lasso [10] Decision tree and Random Forest regression [online]-
https://towardsdatascience.com/decision-trees-and-random-forests.
regression, and random forest model can be utilized to
[11] GradientBoostingRegressor with sci-kit learn [online]-
evaluate diamond prices. Random Forest Regression model https://scikitlearn.org/stable/modules/ generated/sklearn.ensemble.
shows 97% accuracy It gives this much high accuracy Gradient Boosting Regressor.html
because it possesses a high capacity to determine continuous [12] Datasets - Diamonds dataset, Kaggle datasets repository [online]
numerical values. Future work shall include Unsupervised https://www.kaggle.com/shivam2503/diamonds
models orderly to extend the accuracy of predictions of a [13] Tovi Grossman, George Fitzmaurice, “Patina: Dynamic heatmaps for
visualizing application usage”, Publication: CHI April 2013.
diamond dataset and also its robustness. https://dl.acm.org/doi/abs/10.1145/2470654.2466442
[14] Chai T. “Root mean Square Error (RMSE) or Mean absolute
error(MAE)”, (NOAA Air Resources Laboratory (ARL), NOAA
Center for Weather and Climate Prediction, 5830 University Research
Court, College Park, MD 20740, USA;
https://ui.adsabs.harvard.edu/abs/2014GMDD....7.1525C/abstract
[15] Brownlee, J. (2018, May 22). “A Gentle Introduction to k-fold Cross-
Validation”.Online – “https://machinelearningmastery.com/k-fold-
cross-validation/ - Retrieved 21 October 2019,

1022 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence 2021)

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 16,2021 at 12:19:00 UTC from IEEE Xplore. Restrictions apply.

You might also like