Exploratory Data Analysis Basics
Exploratory Data Analysis Basics
[3]: Row ID Order ID Order Date Order Priority Order Quantity Sales \
0 1 3 2010-10-13 Low 6 261.5400
1 2 6 2012-02-20 Not Specified 2 6.9300
2 3 32 2011-07-15 High 26 2808.0800
3 4 32 2011-07-15 High 24 1761.4000
4 5 32 2011-07-15 High 23 160.2335
Product Sub-Category \
0 Storage & Organization
1 Scissors, Rulers and Trimmers
2 Office Furnishings
1
3 Tables
4 Telephones and Communication
[5 rows x 21 columns]
[5]: #viewing the top 5 records after dropping the Row ID column
orders.head()
[5]: Order ID Order Date Order Priority Order Quantity Sales Discount \
0 3 2010-10-13 Low 6 261.5400 0.04
1 6 2012-02-20 Not Specified 2 6.9300 0.01
2 32 2011-07-15 High 26 2808.0800 0.07
3 32 2011-07-15 High 24 1761.4000 0.09
4 32 2011-07-15 High 23 160.2335 0.04
Product Sub-Category \
2
0 Storage & Organization
1 Scissors, Rulers and Trimmers
2 Office Furnishings
3 Tables
4 Telephones and Communication
[7]: orders.head()
3
0 Eldon Base for stackable storage shelf, platinum Large Box
1 Kleencut® Forged Office Shears by Acme United … Small Pack
2 Tenex Contemporary Contur Chairmats for Low an… Medium Box
3 KI Conference Tables Jumbo Box
4 Bell Sonecor JB700 Caller ID Medium Box
[9]: orders.head()
4
0 0.80 2010-10-20
2 0.65 2011-07-17
3 0.72 2011-07-16
4 0.60 2011-07-17
5 0.79 2011-07-16
[11]: orders.head()
5
1 Duplicate Values
[12]: raw_data = {
"city":␣
,→["Faridabad","Delhi","Faridabad","Noida","Noida","Faridabad","Noida","Delhi","Delhi"],
"rank": ["1st","2nd","1st","2nd","1st","2nd","1st","2nd","1st"],
"score1":[44,48,39,41,38,44,38,53,61],
"score2":[67,63,55,70,64,77,45,66,72]
}
df=pd.DataFrame(raw_data,columns=["city","rank","score1","score2"])
df
[13]: 0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
dtype: bool
[14]: 0 False
1 False
6
2 True
3 False
4 True
5 True
6 True
7 True
8 True
dtype: bool
[15]: df.duplicated(['rank'])
#the first occurrences of data also treated as False
[15]: 0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
8 True
dtype: bool
[16]: df.duplicated(['rank'],keep='last')
# the last occurances is treated as True, when we classify keep as last
[16]: 0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
dtype: bool
[17]: 0 False
1 False
2 True
3 False
4 False
5 False
6 True
7
7 True
8 False
dtype: bool
[18]: df.drop_duplicates()
# No rows are dropped because there is no duplicate rows
[19]: df.drop_duplicates(['city'])
#all the duplicate cities are dropped
[20]: df.drop_duplicates(['city','rank'])
# all the duplicate rows are dropped on the basis of combination of city and␣
,→rank column
[21]: orders.describe()
8
75% 44609.000000 38.000000 1705.432500 0.080000 162.707000
max 59973.000000 50.000000 89061.050000 0.250000 27220.690000
[23]: data['SalePrice'].describe()
[25]: sns.boxplot(x=data['SalePrice'])
[25]: <AxesSubplot:xlabel='SalePrice'>
9
[26]: # Checking the shapeof the data
data.shape
first_quartile = data['SalePrice'].quantile(.25)
third_quartile = data['SalePrice'].quantile(.75)
IQR = third_quartile - first_quartile
[31]: sns.boxplot(x=new_data['SalePrice'])
[31]: <AxesSubplot:xlabel='SalePrice'>
10
1.2 Outliers detection using IQR
[32]: # height of a persons
raw ={'name':
,→['mohan','maria','deepak','kunal','piyush','avinash','lisa','smita','tanu',
␣
,→'khusboo','nishant','johnson','donald','rakesh','pritvi','roy','ashish','abhishek','jassi','
'height':[1.2,2.3,4.9,5.1,5.2,5.4,5.5,5.5,5.6,5.6,5.8,5.9,6,6.1,6.2,6.5,7.
,→1,14.5,23.2,40.2]
}
df =pd.DataFrame(raw)
[33]: df
11
9 khusboo 5.6
10 nishant 5.8
11 johnson 5.9
12 donald 6.0
13 rakesh 6.1
14 pritvi 6.2
15 roy 6.5
16 ashish 7.1
17 abhishek 14.5
18 jassi 23.2
19 puneet 40.2
[34]: df.describe()
[34]: height
count 20.000000
mean 8.390000
std 8.782812
min 1.200000
25% 5.350000
50% 5.700000
75% 6.275000
max 40.200000
[36]: 0.9249999999999998
[38]: #outliers
df[(df.height<lower_limit)|(df.height>upper_limit)]
12
[38]: name height
0 mohan 1.2
1 maria 2.3
17 abhishek 14.5
18 jassi 23.2
19 puneet 40.2
[40]: df_no_outliers
[ ]:
[41]: df =pd.read_csv("heights.csv")
[42]: df.sample(5)
[43]: plt.hist(df.Height,bins=20,rwidth=0.8)
plt.xlabel('Height (inches)')
plt.ylabel('Count')
plt.show()
13
Refer to https://www.mathsisfun.com/data/standard-normal-distribution.html for more details
plt.hist(df.Height,bins=20,rwidth=0.8,density=True)
plt.xlabel('Height (inches)')
plt.ylabel('Count')
rng= np.arange(df.Height.min(),df.Height.max(),0.1)
plt.plot(rng,norm.pdf(rng,df.Height.mean(),df.Height.std()))
14
[45]: #Calculating Mean
df.Height.mean()
[45]: 66.3675597548656
[46]: 3.847528120795573
[47]: 77.91014411725232
[48]: 54.824975392478876
15
[49]: Gender Height
994 Male 78.095867
1317 Male 78.462053
2014 Male 78.998742
3285 Male 78.528210
3757 Male 78.621374
6624 Female 54.616858
9285 Female 54.263133
[50]: (9993, 2)
[51]: 7
For example in our case mean is 66.37 and the standard deviation is 3.84
If a value of data point is 77.91, then Z score for that is 3 because it is 3 standard
deviation away (77.91 = 66.37 + 3 * 3.84)
1.4 z = (x – �) / �
[52]: df['zscore'] = (df.Height - df.Height.mean())/df.Height.std()
[53]: df.head(5)
[54]: #outliers
df[(df['zscore']>3) | (df['zscore']<-3)]
16
1317 Male 78.462053 3.143445
2014 Male 78.998742 3.282934
3285 Male 78.528210 3.160640
3757 Male 78.621374 3.184854
6624 Female 54.616858 -3.054091
9285 Female 54.263133 -3.146027
[56]: 7
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
[59]: df['genre']
[59]: 0 Crime
1 Crime
17
2 Crime
3 Action
4 Crime
…
974 Comedy
975 Adventure
976 Action
977 Horror
978 Crime
Name: genre, Length: 979, dtype: object
[61]: Counter(df.genre)
18
Film-Noir 3
Family 2
History 1
Fantasy 1
Name: genre, dtype: int64
[63]: pandas.core.series.Series
[64]: gc = df['genre'].value_counts()
[65]: gc.plot()
[65]: <AxesSubplot:>
[66]: <AxesSubplot:>
19
[67]: #Visualizing through pie chart
gc.plot(kind='pie')
[67]: <AxesSubplot:ylabel='genre'>
20
Refer to https://medium.com/analytics-vidhya/intro-to-univariate-analysis-de75454b4719 for more
details
[68]: titanic_df = pd.read_csv('titanic_df.csv')
[69]: titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
21
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
[72]: # Lets check how many males and females were survived
sns.countplot(x=titanic_df.Survived,hue=titanic_df.Sex)
plt.show()
22
[73]: # Let's distinguish the data by passenger class
sns.countplot(x=titanic_df.Survived,hue=titanic_df.Pclass)
plt.show()
23
1.4.1 Numerical Data
[74]: tips_df = pd.read_csv('tips.csv')
[75]: tips_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
[77]: <AxesSubplot:ylabel='Density'>
24
[78]: # to find out the range
sns.boxplot(tips_df.tip, color = 'b')
[78]: <AxesSubplot:xlabel='tip'>
25
[79]: fig, axes = plt.subplots(1,2, figsize =(12,6))
#distribution of tips
sns.distplot(x=tips_df.tip, hist=True, kde=True, color='r',ax=axes[0])
1.5 Subplots
[80]: x = np.linspace(-1,1,101)
y1 = x
y2 = x**2
y3 = x**3
y4 = x**4
26
[81]: [<matplotlib.lines.Line2D at 0x1b56565bcd0>]
27
[83]: #Load the dataset
diamonds = pd.read_csv('diamonds.csv')
[84]: diamonds.head()
z
0 2.43
1 2.31
2 2.31
3 2.63
4 2.75
28
[87]: #checkig the shape after filtering
diamond.shape
Countplots
[88]: sns.set_style('darkgrid')
[89]: sns.countplot(x='color',data=diamond)
[90]: E 4896
F 4332
G 4323
H 3918
D 3780
I 2593
J 1481
Name: color, dtype: int64
29
[91]: # use y to change the orientation instead of x
sns.countplot(y='cut',data=diamond)
30
1.6 Order Ascending or Descending
[94]: #checking the value counts
diamond.color.value_counts()
[94]: E 4896
F 4332
G 4323
H 3918
D 3780
I 2593
J 1481
Name: color, dtype: int64
31
[97]: #plotting in ascending order
sns.countplot(x='color', data=diamond, order=diamond.color.value_counts().
,→index[::-1]);
32
[98]: # setting the color to Black
sns.countplot(x='color', data=diamond, order=diamond.color.value_counts().
,→index[::-1],color='lightblue');
33
[100]: #line width and edge color
sns.countplot(x='color',data=diamond, lw=4, ec='black')
34
[101]: #hatching
sns.countplot(x='color',data=diamond, lw=4, ec='black',hatch='/')
35
[103]: ## adjust axis, if required
y1 = np.random.randn(70)
y2 = np.random.randn(70)
plt.scatter(y1,y2)
plt.axis([-3,3,-3,3]) #plt.axis([Xmin,Xmax,Ymin,Ymax])
plt.show()
36
[104]: # using the scatter plot with comman axis
x = np.arange(70)
y1 = np.random.randn(70)
y2 = np.random.randn(70)
plt.scatter(x,y1, marker='o', label ='y1 with x')
plt.scatter(x,y2, marker='v', label ='y2 with x')
plt.legend(loc='upper left')
37
1.8 Line Plot
[106]: # extract data from various Internet sources into a pandas DataFrame
import pandas_datareader as pdr
38
plt.legend(['GOOG','AMZN'],loc =2)
39
1.9 Boxplot
[111]: sns.boxplot(tips['size'])
[111]: <AxesSubplot:xlabel='size'>
[112]: sns.boxplot(tips['total_bill'])
[112]: <AxesSubplot:xlabel='total_bill'>
40
[113]: sns.boxplot(x='sex', y ='total_bill', data=tips)
41
[114]: sns.boxplot(x='day', y ='total_bill', data=tips)
42
[116]: sns.boxplot(x='day', y ='total_bill', data=tips,hue='sex',palette ='husl')
43
[117]: sns.boxplot(x='day', y ='total_bill', data=tips,hue='smoker',palette␣
,→='coolwarm')
44
1.10 Joint Distribution
[119]: iris = pd.read_csv('Iris.csv')
[120]: iris.head()
45
[122]: sns.jointplot(x='SepalLengthCm', y='SepalWidthCm', data=iris)
46
[123]: # adding regression lines
sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg')
47
[124]: sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg',color='green')
48
1.11 Bar Plots
[125]: sns.barplot(x='day', y ='tip',data=tips)
49
[126]: sns.barplot(x='day', y ='total_bill',data=tips)
50
[127]: sns.barplot(x='day', y ='total_bill',data=tips,hue='sex')
51
[129]: sns.barplot(x='day', y ='total_bill',data=tips,hue='smoker')
52
[130]: sns.barplot(x='total_bill', y ='day' , data = tips, palette ='spring')
53
[132]: sns.barplot(x='smoker', y ='tip' , data = tips,hue='sex')
54
[133]: #Laod the dataset
df = pd.read_csv('mtcars.csv')
carb
0 4
1 4
2 1
3 1
4 2
55
[136]: <AxesSubplot:xlabel='wt', ylabel='mpg'>
56
[138]: sns.lmplot(x='wt', y='mpg',data=df)
57
[139]: sns.lmplot(x='wt', y='mpg',hue='vs', data=df)
58
[140]: sns.lmplot(x='wt', y='mpg',hue='vs',palette='Set1', data=df)
59
[141]: sns.lmplot(x='wt', y='mpg',hue='vs',markers=['+','o'],palette='Set1', data=df)
60
[142]: # Load the dataset
iris = pd.read_csv('Iris.csv')
[144]: sns.pairplot(iris)
61
[145]: sns.pairplot(iris,hue='Species')
62
[146]: ## Parallel coordinates
from pandas.plotting import parallel_coordinates
## Loading Dataset
df = pd.read_csv('NHANES.csv')
63
0 1.0 5.0 … 124.0 64.0 94.8 184.5 27.8 43.3
1 2.0 3.0 … 140.0 88.0 90.4 171.4 30.8 38.0
2 1.0 3.0 … 132.0 44.0 83.4 170.1 28.8 35.6
3 1.0 5.0 … 134.0 68.0 109.8 160.9 42.4 38.5
4 1.0 4.0 … 114.0 54.0 55.2 164.9 20.3 37.4
[5 rows x 28 columns]
[148]: #columns
df.columns
[150]: d.head()
[151]: <AxesSubplot:>
64
[152]: #Loading Dataset
df = pd.read_csv('avocado.csv')
[153]: df.columns
[154]: df.head()
4770 Total Bags Small Bags Large Bags XLarge Bags type \
0 48.16 8696.87 8603.62 93.25 0.0 conventional
1 58.33 9505.56 9408.07 97.49 0.0 conventional
2 130.50 8145.35 8042.21 103.14 0.0 conventional
3 72.58 5811.16 5677.40 133.76 0.0 conventional
65
4 75.78 6183.95 5986.26 197.69 0.0 conventional
year region
0 2015 Albany
1 2015 Albany
2 2015 Albany
3 2015 Albany
4 2015 Albany
[155]: df.dtypes
[156]: sns.heatmap(df.corr())
[156]: <AxesSubplot:>
66
[157]: sns.heatmap(df.corr(), annot =True)
[157]: <AxesSubplot:>
67
[158]: plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot =True)
[158]: <AxesSubplot:>
68
[159]: plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot =True,linewidth = 0.5)
[159]: <AxesSubplot:>
69
[160]: plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot =True,linewidth = 0.5, cmap='Blues')
[160]: <AxesSubplot:>
1.13 Refer to below link for “How to choose the right chart”
https://towardsdatascience.com/data-visualization-how-to-choose-the-right-chart-part-1-
d4c550085ea7
[ ]:
70