0% found this document useful (0 votes)
36 views12 pages

Data Frames and Charts 2: 2.1 Dealing With Missing Values

The document discusses exploring and visualizing data using Pandas and Seaborn in Python. It loads automobile mileage data, cleans missing values, and explores the schema. It then demonstrates various plots - bar plots to compare average sale prices by age and role, histograms and density plots of sale price distributions, box plots to identify outliers, scatter plots to show relationships between variables, pair plots to visualize multivariate relationships, and heatmaps to view correlations. The goal is to gain insights from data through visualization.

Uploaded by

Pratyush Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views12 pages

Data Frames and Charts 2: 2.1 Dealing With Missing Values

The document discusses exploring and visualizing data using Pandas and Seaborn in Python. It loads automobile mileage data, cleans missing values, and explores the schema. It then demonstrates various plots - bar plots to compare average sale prices by age and role, histograms and density plots of sale price distributions, box plots to identify outliers, scatter plots to show relationships between variables, pair plots to visualize multivariate relationships, and heatmaps to view correlations. The goal is to gain insights from data through visualization.

Uploaded by

Pratyush Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Frames and Charts 2

2.1 Dealing With Missing Values


import pandas as pd
autos = pd.read_csv( 'auto-mpg.data',sep= '\s+', header = None)
autos.head( 5 )

0 1 2 ... 6 7 8

0 18.000 8 307.000 ... 70 1 chevrolet chevelle malibu

1 15.000 8 350.000 ... 70 1 buick skylark 320

2 18.000 8 318.000 ... 70 1 plymouth satellite

3 16.000 8 304.000 ... 70 1 amc rebel sst

4 17.000 8 302.000 ... 70 1 ford torino

0 rows × 9 columns

autos.columns = ['mpg','cylinders', 'displacement',


'horsepower', 'weight', 'acceleration',
'year', 'origin', 'name']

autos.head( 5 )

mpg cylinder displacemen ... year origin name


s t
0 18.000 8 307.000 ... 70 1 chevrolet chevelle malibu

1 15.000 8 350.000 ... 70 1 buick skylark 320

2 18.000 8 318.000 ... 70 1 plymouth satellite

3 16.000 8 304.000 ... 70 1 amc rebel sst

4 17.000 8 302.000 ... 70 1 ford torino

5 rows × 9 columns

Now, we will look at the schema of the datframe.


autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 398 non-null object
weight 398 non-null float64
acceleration 398 non-null float64
year 398 non-null int64
origin 398 non-null int64
name 398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB

autos["horsepower"] = pd.to_numeric( autos["horsepower"], errors = 'coerce' )


autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg 398 non-null float64
cylinders 398 non-null int64
displacement 398 non-null float64
horsepower 392 non-null float64
weight 398 non-null float64
acceleration 398 non-null float64
year 398 non-null int64
origin 398 non-null int64
name 398 non-null object
dtypes: float64(5), int64(3), object(1)
memory usage: 28.1+ KB

autos[autos.horsepower.isnull()]

mpg cylinder displacemen ... year origin name


s t
32 25.000 4 98.000 ... 71 1 ford pinto

126 21.000 6 200.000 ... 74 1 ford maverick

330 40.900 4 85.000 ... 80 2 renault lecar deluxe

336 23.600 4 140.000 ... 80 1 ford mustang cobra

354 34.500 4 100.000 ... 81 2 renault 18i

374 23.000 4 151.000 ... 82 1 amc concord dl

6 rows × 9 columns

autos = autos.dropna(subset = ['horsepower'])


autos[autos.horsepower.isnull()]

mpg cylinder displacemen ... year origin name


s t

0 rows × 9 columns

2.2 Exploration using Visualization Plots

2.2.1 Drawing Plots

import matplotlib.pyplot as plt


import seaborn as sn
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

2.2.2 Bar Plot

import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
ipl_auction_df = pd.read_csv( 'IPL IMB381IPL2013.csv' )
soldprice_by_age = ipl_auction_df.groupby('AGE')['SOLD
PRICE'].mean().reset_index() sn.barplot(x = 'AGE', y = 'SOLD PRICE', data =
soldprice_by_age);
ipl_auction_df.groupby('AGE')['SOLD PRICE'].mean()
soldprice_by_age_role = ipl_auction_df.groupby(['AGE', 'PLAYING ROLE'])['SOLD
PRICE'].mean().reset_index()
soldprice_comparison = soldprice_by_age_role.merge(soldprice_by_age, on = 'AGE',
how = 'outer')
soldprice_comparison.rename( columns = { 'SOLD PRICE_x': 'SOLD_PRICE_AGE_ROLE',
'SOLD PRICE_y': 'SOLD_PRICE_AGE' }, inplace = True )
sn.barplot(x = 'AGE', y = 'SOLD_PRICE_AGE_ROLE', hue = 'PLAYING ROLE', data =
soldprice_comparison);

2.2.3 Histogram

plt.hist( ipl_auction_df['SOLD PRICE'] );


plt.hist( ipl_auction_df['SOLD PRICE'], bins = 20 );

2.2.4 Distribution or Density plot

sn.distplot( ipl_auction_df['SOLD PRICE']);

2.2.5 Box Plot


box = sn.boxplot(ipl_auction_df['SOLD PRICE']);

box = plt.boxplot(ipl_auction_df['SOLD PRICE']);

[item.get_ydata()[0] for item in box['caps']]


[20000.0, 1350000.0]

[item.get_ydata()[0] for item in box['whiskers']]


[225000.0, 700000.0]

[item.get_ydata()[0] for item in box['medians']]


[437500.0]
Who are outliers?
ipl_auction_df[ipl_auction_df['SOLD PRICE'] > 1350000.0][['PLAYER NAME',
'PLAYING ROLE',
'SOLD PRICE']]

PLAYER NAME PLAYING ROLE SOLD PRICE

15 Dhoni, MS W. Keeper 1500000

23 Flintoff, A Allrounder 1550000

50 Kohli, V Batsman 1800000

83 Pietersen, KP Batsman 1550000

93 Sehwag, V Batsman 1800000

111 Tendulkar, SR Batsman 1800000

113 Tiwary, SS Batsman 1600000

127 Yuvraj Singh Batsman 1800000

2.2.6 Comparing Distributions

Using distribution plots

sn.distplot( ipl_auction_df[ipl_auction_df['CAPTAINCY EXP'] == 1]['SOLD PRICE'],


color = 'y',
label = 'Captaincy Experience')
sn.distplot( ipl_auction_df[ipl_auction_df['CAPTAINCY EXP'] == 0]['SOLD PRICE'],
color = 'r',
label = 'No Captaincy Experience');
plt.legend();

Using box plots


sn.boxplot(x = 'PLAYING ROLE', y = 'SOLD PRICE', data = ipl_auction_df);

2.2.7 Scatter Plot

ipl_batsman_df = ipl_auction_df[ipl_auction_df['PLAYING ROLE'] == 'Batsman']

plt.scatter(x = ipl_batsman_df.SIXERS,
y = ipl_batsman_df['SOLD PRICE']);
plt.xlabel('SIXERS')
plt.ylabel('SOLD PRICE');
sn.regplot( x = 'SIXERS',
y = 'SOLD PRICE',
data = ipl_batsman_df );

2.2.8 Pair Plot

influential_features = ['SR-B', 'AVE', 'SIXERS', 'SOLD PRICE']


sn.pairplot(ipl_auction_df[influential_features], size=2)
<seaborn.axisgrid.PairGrid at 0x1a1b188860>
2.2.9 Correlations and Heatmaps

ipl_auction_df[influential_features].corr()

SR-B AVE SIXERS SOLD PRICE

SR-B 1.000 0.584 0.425 0.184

AVE 0.584 1.000 0.705 0.397

SIXERS 0.425 0.705 1.000 0.451

SOLD PRICE 0.184 0.397 0.451 1.000

sn.heatmap(ipl_auction_df[influential_features].corr(), annot=True);

You might also like