Pandas - Data Analysis - Deep Dive - Jupyter Notebook
Pandas - Data Analysis - Deep Dive - Jupyter Notebook
In [1]:
Note: you may need to restart the kernel to use updated packages.
In [2]:
In [3]:
In [4]:
pd.read_csv('drinks.csv')
Out[4]:
0 Afghanistan 0 0 0 0.0
2 Algeria 25 0 14 0.7
In [5]:
pandas.read_csv('drinks.csv',sep=',')
---------------------------------------------------------------------------
<ipython-input-5-c5ee57efda1b> in <module>
----> 1 pandas.read_csv('drinks.csv',sep=',')
In [6]:
pandas.read_csv('insurance.csv',sep = ',')
---------------------------------------------------------------------------
<ipython-input-6-a4d86e110cf2> in <module>
In [ ]:
pd.read_csv('insurance.csv')
In [ ]:
In [ ]:
pandas.read_csv('u.user')#,sep = '|')
In [ ]:
pd.read_csv('u.user',sep = '|')
In [ ]:
In [ ]:
pandas.crosstab()
In [ ]:
pandas.concat
In [ ]:
In [7]:
Out[7]:
1. Initial Analysis
In [8]:
purchase_order_data.shape #Attribute
Out[8]:
(4622, 5)
In [9]:
purchase_order_data.shape[0]
Out[9]:
4622
In [10]:
purchase_order_data.shape[1]
Out[10]:
Terminology Alert
Rows - Observations/Records/Datapoints
Columns - Features/Parameters
In [11]:
Total No of Parameters : 5
In [12]:
purchase_order_data.dtypes #Attribute
Out[12]:
order_id int64
quantity int64
item_name object
choice_description object
item_price object
dtype: object
In [13]:
purchase_order_data.info()
<class 'pandas.core.frame.DataFrame'>
In [14]:
purchase_order_data.isna().sum()
Out[14]:
order_id 0
quantity 0
item_name 0
choice_description 1246
item_price 0
dtype: int64
In [15]:
purchase_order_data.describe(include='all')
Out[15]:
purchase_order_data['item_name'].nunique()
Out[16]:
50
In [17]:
print(purchase_order_data['item_name'].unique())
'6 Pack Soft Drink' 'Chips and Tomatillo-Red Chili Salsa' 'Bowl'
'Carnitas Salad Bowl' 'Barbacoa Salad Bowl' 'Salad' 'Veggie Crispy Tacos'
purchase_order_data
Out[18]:
In [19]:
purchase_order_data.head(30)
Out[19]:
In [ ]:
purchase_order_data.tail(30)
In [20]:
purchase_order_data.groupby(by='item_name')['quantity'].sum().sort_values(ascending = False
Out[20]:
item_name
In [21]:
purchase_order_data['item_name'].value_counts() #Fre
Out[21]:
Chips 211
Veggie Burrito 95
Barbacoa Burrito 91
Veggie Bowl 85
Carnitas Bowl 68
Barbacoa Bowl 66
Carnitas Burrito 59
Nantucket Nectar 27
Izze 20
Chicken Salad 9
Burrito 6
Veggie Salad 6
Steak Salad 4
Crispy Tacos 2
Salad 2
Bowl 2
Carnitas Salad 1
In [24]:
Out[24]:
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
In [23]:
user_details.head(30)
Out[23]:
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
5 6 42 M executive 98101
6 7 57 M administrator 91344
7 8 36 M administrator 05201
8 9 29 M student 01002
9 10 53 M lawyer 90703
10 11 39 F other 30329
11 12 28 F other 06405
12 13 47 M educator 29206
13 14 45 M scientist 55106
14 15 49 F educator 97301
15 16 21 M entertainment 10309
16 17 30 M programmer 06355
17 18 35 F other 37212
18 19 40 M librarian 02138
19 20 42 F homemaker 95660
20 21 26 M writer 30068
21 22 25 M writer 40206
22 23 30 F artist 48197
23 24 21 F artist 94533
24 25 39 M engineer 55107
25 26 49 M engineer 21044
26 27 40 F librarian 30030
27 28 32 M writer 55369
28 29 41 M programmer 94043
29 30 7 M student 55436
1. Initial Investigation
In [ ]:
user_details.shape
In [ ]:
user_details.isna().sum()
In [ ]:
user_details.dtypes
In [ ]:
user_details.describe(include = 'all')
In [ ]:
print(user_details['occupation'].unique())
In [ ]:
user_details.head()
NOTE:
Discrete Data - it is not measurable(No units)/it is just a count || we cannot
seggregate it.
In [ ]:
user_details.groupby(by = 'gender')['age'].mean()
In [ ]:
round(user_details.groupby(by = ['occupation'])['age'].mean())
In [ ]:
user_details.groupby(by = ['occupation'])['user_id'].count()
In [ ]:
user_details.groupby(by = ['gender','occupation'])['user_id'].count()
In [ ]:
In [ ]:
salesman_data.describe()
Out[27]:
% Total
Shots Shots Penaltie
Shooting Goals- shots Hit Penalty
Team Goals on off n
Accuracy to- (inc. Woodwork goals
target target score
shots Blocked)
Czech
1 4 13 18 41.9% 12.9% 39 0 0
Republic
Republic of
11 1 7 12 36.8% 5.2% 28 0 0
Ireland
16 rows × 35 columns
In [ ]:
euro_2012.columns
CONFIGURATION
In [ ]:
pd.set_option('max_columns',None)
#pd.set_option('max_rows',None)
In [ ]:
euro_2012
1. Initial Analysis
In [ ]:
euro_2012.shape
In [ ]:
euro_2012.describe(include='all')
2. Filter Team, Goals, Shooting Accuracy, Yellow Cards and Red Cards
In [28]:
Out[28]:
0 Croatia 4 51.9% 9 0
2 Denmark 4 50.0% 4 0
3 England 5 50.0% 5 0
4 France 3 37.9% 6 0
5 Germany 10 47.8% 4 0
6 Greece 5 30.7% 9 1
7 Italy 6 43.0% 16 0
8 Netherlands 2 25.0% 5 0
9 Poland 2 39.4% 7 1
10 Portugal 6 34.3% 12 0
12 Russia 5 22.5% 6 0
13 Spain 12 55.9% 11 0
14 Sweden 5 47.2% 7 0
15 Ukraine 2 21.2% 5 0
In [ ]:
In [ ]:
euro_filtered_data
In [ ]:
import warnings
warnings.filterwarnings(action = 'ignore')
In [ ]:
In [ ]:
euro_filtered_data
In [ ]:
euro_filtered_data[euro_filtered_data['Goals'] > 4]
In [ ]:
euro_filtered_data[euro_filtered_data['Goals'] > 4]
In [ ]:
euro_filtered_data[euro_filtered_data['Red Cards'] == 1]
In [29]:
---------------------------------------------------------------------------
<ipython-input-29-9a66c94aa3ba> in <module>
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonze
ro__(self)
1440 @final
In [34]:
Out[34]:
6 Greece 5 30.7% 9 1
Chapter_3 Indexing
In [ ]:
In [ ]:
user_details.head()
In [ ]:
#Index locator
user_details.iloc[0:5,0:3]
In [ ]:
In [ ]:
In [ ]:
#Locator
euro_2012.loc[:5,['Shooting Accuracy','% Goals-to-shots']] #Pass the feature names
In [ ]:
euro_2012.iloc[[0,2,5],[1,5,9]]
In [ ]:
Series - 1D of pandas
Dataframe - 2D of pandas
In [ ]:
a = [1,2,3,4,5]
type(a)
In [ ]:
pandas_1D = pd.Series([1,2,3,4,5])
type(pandas_1D)
In [ ]:
Chapter_5 Deleting
In [35]:
Out[35]:
% Total
Shots Shots Penaltie
Shooting Goals- shots Hit Penalty
Team Goals on off n
Accuracy to- (inc. Woodwork goals
target target score
shots Blocked)
Czech
1 4 13 18 41.9% 12.9% 39 0 0
Republic
Republic of
11 1 7 12 36.8% 5.2% 28 0 0
Ireland
16 rows × 35 columns
In [36]:
del euro_2012['Goals']
In [37]:
euro_2012
Out[37]:
% Total
Shots Shots Penalties
Shooting Goals- shots Hit Penalty Hea
Team on off not
Accuracy to- (inc. Woodwork goals go
target target scored
shots Blocked)
Czech
1 13 18 41.9% 12.9% 39 0 0 0
Republic
Republic of
11 7 12 36.8% 5.2% 28 0 0 0
Ireland
16 rows × 34 columns
In [41]:
euro_2012[['Team','Shots on target']]
Out[41]:
0 Croatia 13
1 Czech Republic 13
2 Denmark 10
3 England 11
4 France 22
5 Germany 32
6 Greece 8
7 Italy 34
8 Netherlands 12
9 Poland 15
10 Portugal 22
11 Republic of Ireland 7
12 Russia 9
13 Spain 42
14 Sweden 17
15 Ukraine 7
In [42]:
euro_2012.Team
Out[42]:
0 Croatia
1 Czech Republic
2 Denmark
3 England
4 France
5 Germany
6 Greece
7 Italy
8 Netherlands
9 Poland
10 Portugal
11 Republic of Ireland
12 Russia
13 Spain
14 Sweden
15 Ukraine
In [43]:
---------------------------------------------------------------------------
<ipython-input-43-1b67689e46a4> in <module>
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __delit
em__(self, key)
3964 # there was no match, this call should raise the appropr
iate
3965 # exception:
3967 self._mgr.idelete(loc)
3968
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in ge
t_loc(self, key, method, tolerance)
3079 try:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
In [53]:
In [54]:
euro_2012
Out[54]:
% Total
Shots Shots Penalties
Goals- shots Hit Headed Passes Passi
on off not Passes
to- (inc. Woodwork goals completed Accura
target target scored
shots Blocked)
16 rows × 31 columns
In [58]:
student_details = pd.read_csv('student-mat.csv')
student_details
Out[58]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu
... ... ... ... ... ... ... ... ... ... ... ...
In [57]:
pd.set_option('max_columns',None)
In [62]:
In [63]:
student_details
Out[63]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu
... ... ... ... ... ... ... ... ... ... ... ...
In [67]:
Out[67]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu
... ... ... ... ... ... ... ... ... ... ... ...
Inclass Exercise - 2
Create a new column with a name 'Eligibility_Criteria' which returns 1 if age>17 and 0 if age < 17.
In [68]:
def get_age(x):
if x>17:
return 1
else:
return 0
In [70]:
Out[70]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu
... ... ... ... ... ... ... ... ... ... ... ...
In [71]:
Out[71]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu
... ... ... ... ... ... ... ... ... ... ... ...
Out[75]:
SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380
PLASTICS 9100
SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720
PLASTICS FOAM(1200)
VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500
TRADERS
WINE
SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720
TRADERS FOAM(1200)
SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450
TRADERS 9100
10*10
47285 31/03/2018 Sal:10042 Vkp 25 137 3,425
SHEET
In [76]:
Out[76]:
SILVER
0 1/4/2018 Sal:146 TP13 POUCH 50 85 4,250.00
9*12
DURGA
2 1/4/2018 Sal:146 TP13 10*12 1,600.00 5.5 8,800.00
Blue
DURGA
3 1/4/2018 Sal:146 TP13 13*16 400 11 4,400.00
BLUE
10*12
4 1/4/2018 Sal:146 TP13 SARAS- 600 8.1 4,860.00
NAT
HAMPI SPOON
44735 31/03/2019 Sal:9610 200 40 8,000.00
FOODS SOOFY
In [77]:
Out[77]:
BALAJI DONA-
0 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS VAI-9100
BALAJI SMART
1 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS BOUL(48)
BALAJI Vishnu
2 1/4/2019 Sal:688 110 18.5 2,035.00
PLASTICS Ice
3 28/3 0 0
BALAJI 100LEAF
4 1/4/2019 Sal:689 3 585 1,755.00
PLASTICS -SP
13*16
19171 10/10/2019 Sal:4935 K.SRIHARI WHITE 400 16 6,400.00
RK
In [80]:
sales_2017.shape,sales_2018.shape,sales_2019.shape
Out[80]:
In [81]:
sales_full_data = pd.concat([sales_2017,sales_2018,sales_2019])
sales_full_data
Out[81]:
SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100
SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)
VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE
SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)
SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100
13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK
In [82]:
sales_2017.append([sales_2018,sales_2019])
Out[82]:
SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100
SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)
VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE
SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)
SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100
13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK
In [86]:
sales_full_data.dtypes
Out[86]:
Date object
Voucher object
Party object
Product object
Qty object
Rate object
Gross object
Disc object
dtype: object
In [88]:
sales_full_data.isna().sum()
Out[88]:
Date 12591
Voucher 12557
Party 40
Product 12591
Qty 12557
Rate 12558
Gross 12558
Disc 105609
dtype: int64
TASK 2 - Explore Merge function - Left join, Right Join, Inner join,
Outer join
sales_cleaned_data = pd.read_csv('Sales-Transactions-Edited.csv')
sales_cleaned_data
Out[84]:
In [85]:
sales_cleaned_data.dtypes
Out[85]:
Date object
Voucher int64
Party object
Product object
Qty int64
Rate float64
dtype: object
In [87]:
sales_cleaned_data.isna().sum()
Out[87]:
Date 0
Voucher 0
Party 0
Product 0
Qty 0
Rate 1
dtype: int64
In [91]:
insurance_data = pd.read_csv('insurance.csv')
insurance_data
Out[91]:
In [92]:
insurance_data.describe(include = 'all')
Out[92]:
In [93]:
insurance_data['region'].unique()
Out[93]:
In [96]:
round(insurance_data.groupby(by='region')['charges'].mean().sort_values(ascending = False))
Out[96]:
region
southeast 14735.0
northeast 13406.0
northwest 12418.0
southwest 12347.0
In [98]:
round(insurance_data.groupby(by=['region','sex'])['charges'].mean().sort_values(ascending =
Out[98]:
region sex
male 12354.0
In [101]:
Out[101]:
region
In [103]:
Out[103]:
smoker
In [104]:
Out[104]:
children 0 1 2 3 4 5 All
region
In [ ]: