Data Ingestion: Import As Import As Import As
Data Ingestion: Import As Import As Import As
Features:
df = pd.read_csv(r"..\\notebooks\\data\\gemstone.csv")
df.head()
(193573, 10)
Infernce:
• Rows - 1,93,573
• Columns - 10
### Missing values
df.isnull().sum()
id 0
carat 0
cut 0
color 0
clarity 0
depth 0
table 0
x 0
y 0
z 0
price 0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193573 entries, 0 to 193572
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 193573 non-null int64
1 carat 193573 non-null float64
2 cut 193573 non-null object
3 color 193573 non-null object
4 clarity 193573 non-null object
5 depth 193573 non-null float64
6 table 193573 non-null float64
7 x 193573 non-null float64
8 y 193573 non-null float64
9 z 193573 non-null float64
10 price 193573 non-null int64
dtypes: float64(6), int64(2), object(3)
memory usage: 16.2+ MB
### Dropping unwanted feature
df.drop(columns=['id'], axis=1, inplace=True)
df.head()
df.duplicated().sum()
df[cat_features]
df[num_features]
df[cat_features].describe()
df[num_features].describe().T
max
carat 3.50
depth 71.60
table 79.00
x 9.65
y 10.01
z 31.30
price 18818.00
df[cat_features]['color'].value_counts()
color
G 44391
E 35869
F 34258
H 30799
D 24286
I 17514
J 6456
Name: count, dtype: int64
df[cat_features]['clarity'].value_counts()
clarity
SI1 53272
VS2 48027
VS1 30669
SI2 30484
VVS2 15762
VVS1 10628
IF 4219
I1 512
Name: count, dtype: int64
sns.heatmap(data=df[num_features].corr(), annot=True)
<Axes: >
Feature Engineering
# encoding
Feature: cut, has unique values: ['Premium' 'Very Good' 'Ideal' 'Good'
'Fair']
Feature: color, has unique values: ['F' 'J' 'G' 'E' 'D' 'H' 'I']
Feature: clarity, has unique values: ['VS2' 'SI2' 'VS1' 'SI1' 'IF'
'VVS2' 'VVS1' 'I1']
# Ordinal encoding
cut_map={"Fair":1,"Good":2,"Very Good":3,"Premium":4,"Ideal":5}
clarity_map = {"I1":1,"SI2":2 ,"SI1":3 ,"VS2":4 , "VS1":5 , "VVS2":6 ,
"VVS1":7 ,"IF":8}
color_map = {"D":1 ,"E":2 ,"F":3 , "G":4 ,"H":5 , "I":6, "J":7}
df['cut'] = df['cut'].map(cut_map)
df['clarity'] = df['clarity'].map(clarity_map)
df['color'] = df['color'].map(color_map)
df.head()
carat cut color clarity depth table x y z price
0 1.52 4 3 4 62.2 58.0 7.27 7.33 4.55 13619
1 2.03 3 7 2 62.0 58.0 8.06 8.12 5.05 13387
2 0.70 5 4 5 61.2 57.0 5.69 5.73 3.50 2772
3 0.32 5 4 5 61.6 56.0 4.38 4.41 2.71 666
4 1.70 4 4 4 62.6 59.0 7.65 7.61 4.77 14453