0% found this document useful (0 votes)

17 views16 pages

Data Ingestion: Import As Import As Import As

Uploaded by

drsaheb422

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views16 pages

Data Ingestion: Import As Import As Import As

Uploaded by

drsaheb422

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Ingestion

Features:

• id - its an identifier, so we can drop this feature.

• carat - its representing measurement unit which is physical weight of diamond.
• cut - it refers to the diamond cut grade. Excellent (the best grade), Very Good, Good, Fair,
Poor (the worst grade).
• color - it refers to how clear a diamond is. The diamond color scale ranges from D
(entirely clear) to Z (a yellowish tint)
– Colorless — D, E, F.
– Near Colorless — G, H, I, J.
– Taint Yellow — K, L, M.
– Very Light Yellow — N, O, P, Q, R.
– Light Yellow — S, T, U, V, W, X, Y, Z.
• clarity - it refers to the qualitative metric that grades the visual appearance of each
diamond.
– Flawless (FL) No inclusions and no blemishes visible under 10x magnification
– Internally Flawless (IF) No inclusions visible under 10x magnification
– Very, Very Slightly Included (VVS1 and VVS2) Inclusions so slight they are difficult
for a skilled grader to see under 10x magnification
– Very Slightly Included (VS1 and VS2) Inclusions are observed with effort under 10x
magnification, but can be characterized as minor
– Slightly Included (SI1 and SI2) Inclusions are noticeable under 10x magnification
– Included (I1, I2, and I3) Inclusions are obvious under 10x magnification which may
affect transparency and brilliance
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv(r"..\\notebooks\\data\\gemstone.csv")

df.head()

id carat cut color clarity depth table x y z

price
0 0 1.52 Premium F VS2 62.2 58.0 7.27 7.33 4.55
13619
1 1 2.03 Very Good J SI2 62.0 58.0 8.06 8.12 5.05
13387
2 2 0.70 Ideal G VS1 61.2 57.0 5.69 5.73 3.50
2772
3 3 0.32 Ideal G VS1 61.6 56.0 4.38 4.41 2.71
666
4 4 1.70 Premium G VS2 62.6 59.0 7.65 7.61 4.77
14453
### Shape of dataset
df.shape

(193573, 10)

Infernce:

• Rows - 1,93,573
• Columns - 10
### Missing values

df.isnull().sum()

id 0
carat 0
cut 0
color 0
clarity 0
depth 0
table 0
x 0
y 0
z 0
price 0
dtype: int64

Inference: There are no missing values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193573 entries, 0 to 193572
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 193573 non-null int64
1 carat 193573 non-null float64
2 cut 193573 non-null object
3 color 193573 non-null object
4 clarity 193573 non-null object
5 depth 193573 non-null float64
6 table 193573 non-null float64
7 x 193573 non-null float64
8 y 193573 non-null float64
9 z 193573 non-null float64
10 price 193573 non-null int64
dtypes: float64(6), int64(2), object(3)
memory usage: 16.2+ MB
### Dropping unwanted feature
df.drop(columns=['id'], axis=1, inplace=True)
df.head()

carat cut color clarity depth table x y z

price
0 1.52 Premium F VS2 62.2 58.0 7.27 7.33 4.55
13619
1 2.03 Very Good J SI2 62.0 58.0 8.06 8.12 5.05
13387
2 0.70 Ideal G VS1 61.2 57.0 5.69 5.73 3.50
2772
3 0.32 Ideal G VS1 61.6 56.0 4.38 4.41 2.71
666
4 1.70 Premium G VS2 62.6 59.0 7.65 7.61 4.77
14453

### Check for Duplicate values

df.duplicated().sum()

Inference: There are no missing values

### Segregating Numerical & Catergorical features

cat_features = df.columns[df.dtypes == 'object']

num_features = df.columns[df.dtypes != 'object']

df[cat_features]

cut color clarity

0 Premium F VS2
1 Very Good J SI2
2 Ideal G VS1
3 Ideal G VS1
4 Premium G VS2
... ... ... ...
193568 Ideal D VVS2
193569 Premium G VVS2
193570 Very Good F SI1
193571 Very Good D SI1
193572 Good E SI2

[193573 rows x 3 columns]

df[num_features]

carat depth table x y z price

0 1.52 62.2 58.0 7.27 7.33 4.55 13619
1 2.03 62.0 58.0 8.06 8.12 5.05 13387
2 0.70 61.2 57.0 5.69 5.73 3.50 2772
3 0.32 61.6 56.0 4.38 4.41 2.71 666
4 1.70 62.6 59.0 7.65 7.61 4.77 14453
... ... ... ... ... ... ... ...
193568 0.31 61.1 56.0 4.35 4.39 2.67 1130
193569 0.70 60.3 58.0 5.75 5.77 3.47 2874
193570 0.73 63.1 57.0 5.72 5.75 3.62 3036
193571 0.34 62.9 55.0 4.45 4.49 2.81 681
193572 0.71 60.8 64.0 5.73 5.71 3.48 2258

[193573 rows x 7 columns]

df[cat_features].describe()

cut color clarity

count 193573 193573 193573
unique 5 7 8
top Ideal G SI1
freq 92454 44391 53272

df[num_features].describe().T

count mean std min 25% 50%

75% \
carat 193573.0 0.790688 0.462688 0.2 0.40 0.70
1.03
depth 193573.0 61.820574 1.081704 52.1 61.30 61.90
62.40
table 193573.0 57.227675 1.918844 49.0 56.00 57.00
58.00
x 193573.0 5.715312 1.109422 0.0 4.70 5.70
6.51
y 193573.0 5.720094 1.102333 0.0 4.71 5.72
6.51
z 193573.0 3.534246 0.688922 0.0 2.90 3.53
4.03
price 193573.0 3969.155414 4034.374138 326.0 951.00 2401.00
5408.00

max
carat 3.50
depth 71.60
table 79.00
x 9.65
y 10.01
z 31.30
price 18818.00

df[cat_features]['color'].value_counts()

color
G 44391
E 35869
F 34258
H 30799
D 24286
I 17514
J 6456
Name: count, dtype: int64

df[cat_features]['clarity'].value_counts()

clarity
SI1 53272
VS2 48027
VS1 30669
SI2 30484
VVS2 15762
VVS1 10628
IF 4219
I1 512
Name: count, dtype: int64

### Numerical features

plt.figure(figsize=(4,3))

for col in num_features:

sns.histplot(data=df, x=col, kde=True)
print("\n")
plt.show()
### Categorical features
plt.figure(figsize=(4,3))

for col in cat_features:

sns.countplot(data=df, x=col)
print("\n")
plt.show()
### Correlation

sns.heatmap(data=df[num_features].corr(), annot=True)

<Axes: >
Feature Engineering
# encoding

for col in cat_features:

print("Feature: {}, has unique values: {}".format(col,
df[col].unique()))

Feature: cut, has unique values: ['Premium' 'Very Good' 'Ideal' 'Good'
'Fair']
Feature: color, has unique values: ['F' 'J' 'G' 'E' 'D' 'H' 'I']
Feature: clarity, has unique values: ['VS2' 'SI2' 'VS1' 'SI1' 'IF'
'VVS2' 'VVS1' 'I1']

# Ordinal encoding
cut_map={"Fair":1,"Good":2,"Very Good":3,"Premium":4,"Ideal":5}
clarity_map = {"I1":1,"SI2":2 ,"SI1":3 ,"VS2":4 , "VS1":5 , "VVS2":6 ,
"VVS1":7 ,"IF":8}
color_map = {"D":1 ,"E":2 ,"F":3 , "G":4 ,"H":5 , "I":6, "J":7}

df['cut'] = df['cut'].map(cut_map)
df['clarity'] = df['clarity'].map(clarity_map)
df['color'] = df['color'].map(color_map)

df.head()
carat cut color clarity depth table x y z price
0 1.52 4 3 4 62.2 58.0 7.27 7.33 4.55 13619
1 2.03 3 7 2 62.0 58.0 8.06 8.12 5.05 13387
2 0.70 5 4 5 61.2 57.0 5.69 5.73 3.50 2772
3 0.32 5 4 5 61.6 56.0 4.38 4.41 2.71 666
4 1.70 4 4 4 62.6 59.0 7.65 7.61 4.77 14453

Infinity Lore
0% (1)
Infinity Lore
5 pages
Diamonds
No ratings yet
Diamonds
915 pages
Predective Modelling Project Business Report
50% (2)
Predective Modelling Project Business Report
58 pages
FDS Solved Slips
100% (1)
FDS Solved Slips
63 pages
Newton Alchemy
100% (2)
Newton Alchemy
33 pages
Ex 1
No ratings yet
Ex 1
119 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Lab File
No ratings yet
Lab File
96 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Bgycl-138 Hindi PDF
No ratings yet
Bgycl-138 Hindi PDF
131 pages
Data Analysis Advance House Price Prediction 1682585529
No ratings yet
Data Analysis Advance House Price Prediction 1682585529
73 pages
Geya Fds
No ratings yet
Geya Fds
34 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
West Rox
No ratings yet
West Rox
29 pages
Customer Retail Shopping Analysis 1686591558
No ratings yet
Customer Retail Shopping Analysis 1686591558
45 pages
Filipino Biologists and Their Contributions
100% (1)
Filipino Biologists and Their Contributions
2 pages
Datamining Exp5 Datanormalisation
No ratings yet
Datamining Exp5 Datanormalisation
14 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
The Role of Teachinglearning Media in Teaching Biology in Obe-Classes
No ratings yet
The Role of Teachinglearning Media in Teaching Biology in Obe-Classes
97 pages
Viewpoint 1: Sharing A Workspace
No ratings yet
Viewpoint 1: Sharing A Workspace
2 pages
Eda Red Wine
No ratings yet
Eda Red Wine
16 pages
Uds3 NP c2t Instructions
No ratings yet
Uds3 NP c2t Instructions
41 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Devesh
No ratings yet
Devesh
11 pages
Ariba Procure To Pay Buying Process Guide - 7-13
100% (2)
Ariba Procure To Pay Buying Process Guide - 7-13
17 pages
AM19 EDA Assignment5
No ratings yet
AM19 EDA Assignment5
19 pages
DIAMOND PRICE PREDICTIONS - Ipynb - Colaboratory
No ratings yet
DIAMOND PRICE PREDICTIONS - Ipynb - Colaboratory
21 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Project Data Mining (AMAN YADAV)
No ratings yet
Project Data Mining (AMAN YADAV)
12 pages
Predective Modelling
No ratings yet
Predective Modelling
28 pages
Zomato Rating Prediction
No ratings yet
Zomato Rating Prediction
11 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
AD3301 - Data - Transformation - Ipynb - Colaboratory
No ratings yet
AD3301 - Data - Transformation - Ipynb - Colaboratory
27 pages
DAVL PR1.2 Mit
No ratings yet
DAVL PR1.2 Mit
10 pages
Practical-5 - Jupyter Notebook
100% (1)
Practical-5 - Jupyter Notebook
8 pages
Stanghellini y Ballerini, 2004, Autism, Disembodied Existence
No ratings yet
Stanghellini y Ballerini, 2004, Autism, Disembodied Existence
10 pages
Python Project 2 Colab
No ratings yet
Python Project 2 Colab
6 pages
EDA Zomato 1681401606
No ratings yet
EDA Zomato 1681401606
15 pages
Innovative Assignment PDF
No ratings yet
Innovative Assignment PDF
11 pages
Quality Prediction Checkpoint
No ratings yet
Quality Prediction Checkpoint
14 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
No ratings yet
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
65 pages
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
No ratings yet
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
10 pages
C
No ratings yet
C
3 pages
10) Merging Dataframes: # Detecting Duplicates
No ratings yet
10) Merging Dataframes: # Detecting Duplicates
7 pages
Pandas - Ipynb - Colab
No ratings yet
Pandas - Ipynb - Colab
8 pages
ADS Exp3
No ratings yet
ADS Exp3
8 pages
Task 6
No ratings yet
Task 6
14 pages
Zomoto Data Analysis Using Python - 1
No ratings yet
Zomoto Data Analysis Using Python - 1
10 pages
Exp 12 and 15
No ratings yet
Exp 12 and 15
4 pages
Mini Project With Output
No ratings yet
Mini Project With Output
8 pages
Unit 6 Pyspark - MLlib
No ratings yet
Unit 6 Pyspark - MLlib
6 pages
Diamond Dataset Output
No ratings yet
Diamond Dataset Output
19 pages
DSBDA1
No ratings yet
DSBDA1
5 pages
Engo 645
No ratings yet
Engo 645
9 pages
Wine DS
No ratings yet
Wine DS
14 pages
Situated Cognition Dynamic Systems and Art
No ratings yet
Situated Cognition Dynamic Systems and Art
25 pages
Pyspark MLlib
No ratings yet
Pyspark MLlib
4 pages
Case Study
No ratings yet
Case Study
20 pages
Garde 5
No ratings yet
Garde 5
3 pages
Waiver For Thesis Submission
No ratings yet
Waiver For Thesis Submission
6 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
Vocab@Vic Handbook
No ratings yet
Vocab@Vic Handbook
21 pages
CV Adhraa 2017
No ratings yet
CV Adhraa 2017
2 pages
FFT®: Functional Fascial Taping®
No ratings yet
FFT®: Functional Fascial Taping®
5 pages
Diamonds: Analyze Diamonds by Their Cut, Color, Clarity, Price, and Other Attributes
No ratings yet
Diamonds: Analyze Diamonds by Their Cut, Color, Clarity, Price, and Other Attributes
14 pages
Pandas PD: File PD Read - CSV File Head
No ratings yet
Pandas PD: File PD Read - CSV File Head
10 pages
Dsbda Assignment 1
No ratings yet
Dsbda Assignment 1
5 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Article - 2000 - Non-Smooth Mechanical Systems
No ratings yet
Article - 2000 - Non-Smooth Mechanical Systems
8 pages
Lesson Plan
No ratings yet
Lesson Plan
4 pages
RMTS 2016 Layout
No ratings yet
RMTS 2016 Layout
1 page
Meyer Product Guide 2010
No ratings yet
Meyer Product Guide 2010
20 pages
Chromatograaphy
No ratings yet
Chromatograaphy
16 pages
Strassen Algorithm
No ratings yet
Strassen Algorithm
2 pages
Reshma Saujani
No ratings yet
Reshma Saujani
1 page
Assignment 1 - SSD PDF
No ratings yet
Assignment 1 - SSD PDF
10 pages
Class Xi Autmn Break Homework Math
No ratings yet
Class Xi Autmn Break Homework Math
5 pages
Assinment On Fuzzy Boundaries
100% (2)
Assinment On Fuzzy Boundaries
5 pages
Environmental Aspects & Impects
No ratings yet
Environmental Aspects & Impects
25 pages
Pandas Commands
No ratings yet
Pandas Commands
3 pages
Bootstrap Mock Test I
No ratings yet
Bootstrap Mock Test I
6 pages
Diamonds Q3 - Report: Plot (Diamond$carat, Diamond$price, Main "Price and Carat", Xlab "Carat", Ylab "Price")
No ratings yet
Diamonds Q3 - Report: Plot (Diamond$carat, Diamond$price, Main "Price and Carat", Xlab "Carat", Ylab "Price")
6 pages
Dataset)
No ratings yet
Dataset)
3 pages
Functionapplicationp PDF
No ratings yet
Functionapplicationp PDF
6 pages
PBL Lesson Plan
No ratings yet
PBL Lesson Plan
3 pages
Gis
No ratings yet
Gis
2 pages
Chess Analytics: Training with a Grandmaster
From Everand
Chess Analytics: Training with a Grandmaster
Efstratios Grivas
4.5/5 (2)
Mindful Maths 1: Use Your Algebra to Solve These Puzzling Pictures
From Everand
Mindful Maths 1: Use Your Algebra to Solve These Puzzling Pictures
Ann McNair
No ratings yet