Step 16 Chapter4
Step 16 Chapter4
categorical data
E X P L O R AT O R Y D ATA A N A LY S I S I N P Y T H O N
George Boorman
Curriculum Manager, DataCamp
Why perform EDA?
Detecting patterns and relationships
For example:
Married
Divorced
Cochin 4391
Banglore 2773
Delhi 1219
New Delhi 888
Hyderabad 673
Kolkata 369
Name: Destination, dtype: int64
planes["Destination"].value_counts(normalize=True)
Cochin 0.425773
Banglore 0.268884
Delhi 0.118200
New Delhi 0.086105
Hyderabad 0.065257
Kolkata 0.035780
Name: Destination, dtype: float64
pd.crosstab(
pd.crosstab(planes["Source"],
pd.crosstab(planes["Source"], planes["Destination"])
George Boorman
Curriculum Manager, DataCamp
Correlation
sns.heatmap(planes.corr(), annot=True)
plt.show()
Airline object
Date_of_Journey datetime64[ns]
Source object
Destination object
Route object
Dep_Time datetime64[ns]
Arrival_Time datetime64[ns]
Duration float64
Total_Stops object
Additional_Info object
Price float64
dtype: object
1 stop 4107
non-stop 2584
2 stops 1127
3 stops 29
4 stops 1
Name: Total_Stops, dtype: int64
Airline object
Date_of_Journey datetime64[ns]
Source object
Destination object
Route object
Dep_Time datetime64[ns]
Arrival_Time datetime64[ns]
Duration float64
Total_Stops int64
Additional_Info object
Price float64
dtype: object
planes["Price_Category"] = pd.cut(
planes["Price_Category"] = pd.cut(planes["Price"],
planes["Price_Category"] = pd.cut(planes["Price"],
labels=labels,
planes["Price_Category"] = pd.cut(planes["Price"],
labels=labels,
bins=bins)
Price Price_Category
0 13882.0 First Class
1 6218.0 Premium Economy
2 13302.0 First Class
3 3873.0 Economy
4 11087.0 Business Class
George Boorman
Curriculum Manager, DataCamp
What do we know?
George Boorman
Curriculum Manager, DataCamp
Inspection and validation
books["year"] = books["year"].astype(int)
books.dtypes
name object
author object
rating float64
year int64
genre object
dtype: object
Working_Year 12
Designation 27
Experience 33
Employment_Status 31
Employee_Location 28
Company_Size 40
Remote_Working_Ratio 24
Salary_USD 60
dtype: int64
Impute by sub-group
salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
planes["Price_Category"] = pd.cut(planes["Price"],
labels=labels,
bins=bins)
Sampling in Python