0% found this document useful (0 votes)
60 views37 pages

Pandas - Data Analysis - Deep Dive - Jupyter Notebook

This document provides an overview of analyzing data with Pandas by reading in different data files, performing initial analysis on the data such as checking the shape and dtypes, and exploring different attributes and methods. It demonstrates how to load data from CSV and TSV files, check the number of observations and features, and get information about the data types of each column.

Uploaded by

dimple mahadule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views37 pages

Pandas - Data Analysis - Deep Dive - Jupyter Notebook

This document provides an overview of analyzing data with Pandas by reading in different data files, performing initial analysis on the data such as checking the shape and dtypes, and exploring different attributes and methods. It demonstrates how to load data from CSV and TSV files, check the number of observations and features, and get information about the data types of each column.

Uploaded by

dimple mahadule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

Read the files

In [1]:

pip install pandas

Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-p


ackages (1.2.4)

Requirement already satisfied: numpy>=1.16.5 in c:\users\x1 yoga\appdata\roa


ming\python\python38\site-packages (from pandas) (1.19.5)

Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib


\site-packages (from pandas) (2021.1)

Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anac


onda3\lib\site-packages (from pandas) (2.8.1)

Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site


-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

Note: you may need to restart the kernel to use updated packages.

In [2]:

!pip install pandas

Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-p


ackages (1.2.4)

Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anac


onda3\lib\site-packages (from pandas) (2.8.1)

Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib


\site-packages (from pandas) (2021.1)

Requirement already satisfied: numpy>=1.16.5 in c:\users\x1 yoga\appdata\roa


ming\python\python38\site-packages (from pandas) (1.19.5)

Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site


-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

In [3]:

import pandas as pd #Alias

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 1/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [4]:

pd.read_csv('drinks.csv')

Out[4]:

country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol cont

0 Afghanistan 0 0 0 0.0

1 Albania 89 132 54 4.9

2 Algeria 25 0 14 0.7

3 Andorra 245 138 312 12.4

4 Angola 217 57 45 5.9

... ... ... ... ... ...

188 Venezuela 333 100 3 7.7

189 Vietnam 111 2 1 2.0

190 Yemen 6 0 0 0.1

191 Zambia 32 19 4 2.5

192 Zimbabwe 64 18 4 4.7

193 rows × 6 columns

In [5]:

pandas.read_csv('drinks.csv',sep=',')

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-5-c5ee57efda1b> in <module>

----> 1 pandas.read_csv('drinks.csv',sep=',')

NameError: name 'pandas' is not defined

In [6]:

pandas.read_csv('insurance.csv',sep = ',')

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-6-a4d86e110cf2> in <module>

----> 1 pandas.read_csv('insurance.csv',sep = ',')

NameError: name 'pandas' is not defined

In [ ]:

pd.read_csv('insurance.csv')

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 2/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

pandas.read_csv('data.tsv.txt',sep = '\t') #feature

In [ ]:

pandas.read_csv('u.user')#,sep = '|')

In [ ]:

pd.read_csv('u.user',sep = '|')

In [ ]:

pandas.read_csv(r'D:\My_Materials\Reference Notes\AI Course\Day_06_Pandas Exercise\01_Getti

In [ ]:

pandas.crosstab()

In [ ]:

pandas.concat

In [ ]:

pd.read_csv('D:/My_Materials/Reference Notes/AI Course/Day_06_Pandas Exercise/01_Getting &

Chapter 1 - Getting and Knowing your data

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 3/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [7]:

purchase_order_data = pd.read_csv('data.tsv.txt',sep = '\t') #purchase order details


purchase_order_data

Out[7]:

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato


0 1 1 NaN $2.39
Salsa

1 1 1 Izze [Clementine] $3.39

2 1 1 Nantucket Nectar [Apple] $3.39

Chips and Tomatillo-


3 1 1 NaN $2.39
Green Chili Salsa

[Tomatillo-Red Chili Salsa (Hot),


4 2 2 Chicken Bowl $16.98
[Black Beans...

... ... ... ... ... ...

[Fresh Tomato Salsa, [Rice, Black


4617 1833 1 Steak Burrito $11.75
Beans, Sour ...

[Fresh Tomato Salsa, [Rice, Sour


4618 1833 1 Steak Burrito $11.75
Cream, Cheese...

[Fresh Tomato Salsa, [Fajita


4619 1834 1 Chicken Salad Bowl $11.25
Vegetables, Pinto...

[Fresh Tomato Salsa, [Fajita


4620 1834 1 Chicken Salad Bowl $8.75
Vegetables, Lettu...

[Fresh Tomato Salsa, [Fajita


4621 1834 1 Chicken Salad Bowl $8.75
Vegetables, Pinto...

4622 rows × 5 columns

1. Initial Analysis

In [8]:

purchase_order_data.shape #Attribute

Out[8]:

(4622, 5)

In [9]:

purchase_order_data.shape[0]

Out[9]:

4622

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 4/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [10]:

purchase_order_data.shape[1]

Out[10]:

Terminology Alert
Rows - Observations/Records/Datapoints
Columns - Features/Parameters

In [11]:

print('Total No of Observations : {}\nTotal No of Parameters : {}'.format(purchase_order_da

Total No of Observations : 4622

Total No of Parameters : 5

In [12]:

purchase_order_data.dtypes #Attribute

Out[12]:

order_id int64

quantity int64

item_name object

choice_description object

item_price object

dtype: object

In [13]:

purchase_order_data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4622 entries, 0 to 4621

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 4622 non-null int64

1 quantity 4622 non-null int64

2 item_name 4622 non-null object

3 choice_description 3376 non-null object

4 item_price 4622 non-null object

dtypes: int64(2), object(3)

memory usage: 180.7+ KB

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 5/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [14]:

purchase_order_data.isna().sum()

Out[14]:

order_id 0

quantity 0

item_name 0

choice_description 1246

item_price 0

dtype: int64

In [15]:

purchase_order_data.describe(include='all')

Out[15]:

order_id quantity item_name choice_description item_price

count 4622.000000 4622.000000 4622 3376 4622

unique NaN NaN 50 1043 78

top NaN NaN Chicken Bowl [Diet Coke] $8.75

freq NaN NaN 726 134 730

mean 927.254868 1.075725 NaN NaN NaN

std 528.890796 0.410186 NaN NaN NaN

min 1.000000 1.000000 NaN NaN NaN

25% 477.250000 1.000000 NaN NaN NaN

50% 926.000000 1.000000 NaN NaN NaN

75% 1393.000000 1.000000 NaN NaN NaN

max 1834.000000 15.000000 NaN NaN NaN

1. How many items are available in the restaurent?


In [16]:

purchase_order_data['item_name'].nunique()

Out[16]:

50

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 6/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [17]:

print(purchase_order_data['item_name'].unique())

['Chips and Fresh Tomato Salsa' 'Izze' 'Nantucket Nectar'

'Chips and Tomatillo-Green Chili Salsa' 'Chicken Bowl' 'Side of Chips'

'Steak Burrito' 'Steak Soft Tacos' 'Chips and Guacamole'

'Chicken Crispy Tacos' 'Chicken Soft Tacos' 'Chicken Burrito'

'Canned Soda' 'Barbacoa Burrito' 'Carnitas Burrito' 'Carnitas Bowl'


'Bottled Water' 'Chips and Tomatillo Green Chili Salsa' 'Barbacoa Bowl'

'Chips' 'Chicken Salad Bowl' 'Steak Bowl' 'Barbacoa Soft Tacos'

'Veggie Burrito' 'Veggie Bowl' 'Steak Crispy Tacos'

'Chips and Tomatillo Red Chili Salsa' 'Barbacoa Crispy Tacos'

'Veggie Salad Bowl' 'Chips and Roasted Chili-Corn Salsa'

'Chips and Roasted Chili Corn Salsa' 'Carnitas Soft Tacos'

'Chicken Salad' 'Canned Soft Drink' 'Steak Salad Bowl'

'6 Pack Soft Drink' 'Chips and Tomatillo-Red Chili Salsa' 'Bowl'

'Burrito' 'Crispy Tacos' 'Carnitas Crispy Tacos' 'Steak Salad'

'Chips and Mild Fresh Tomato Salsa' 'Veggie Soft Tacos'

'Carnitas Salad Bowl' 'Barbacoa Salad Bowl' 'Salad' 'Veggie Crispy Tacos'

'Veggie Salad' 'Carnitas Salad']

2. Which was the most ordered item?


In [18]:

purchase_order_data

Out[18]:

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato


0 1 1 NaN $2.39
Salsa

1 1 1 Izze [Clementine] $3.39

2 1 1 Nantucket Nectar [Apple] $3.39

Chips and Tomatillo-


3 1 1 NaN $2.39
Green Chili Salsa

[Tomatillo-Red Chili Salsa (Hot),


4 2 2 Chicken Bowl $16.98
[Black Beans...

... ... ... ... ... ...

[Fresh Tomato Salsa, [Rice, Black


4617 1833 1 Steak Burrito $11.75
Beans, Sour ...

[Fresh Tomato Salsa, [Rice, Sour


4618 1833 1 Steak Burrito $11.75
Cream, Cheese...

[Fresh Tomato Salsa, [Fajita


4619 1834 1 Chicken Salad Bowl $11.25
Vegetables, Pinto...

[Fresh Tomato Salsa, [Fajita


4620 1834 1 Chicken Salad Bowl $8.75
Vegetables, Lettu...

[Fresh Tomato Salsa, [Fajita


4621 1834 1 Chicken Salad Bowl $8.75
Vegetables, Pinto...

4622 rows × 5 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 7/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [19]:

purchase_order_data.head(30)

Out[19]:

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato


0 1 1 NaN $2.39
Salsa

1 1 1 Izze [Clementine] $3.39

2 1 1 Nantucket Nectar [Apple] $3.39

Chips and Tomatillo-Green


3 1 1 NaN $2.39
Chili Salsa

[Tomatillo-Red Chili Salsa (Hot),


4 2 2 Chicken Bowl $16.98
[Black Beans...

[Fresh Tomato Salsa (Mild), [Rice,


5 3 1 Chicken Bowl $10.98
Cheese, Sou...

6 3 1 Side of Chips NaN $1.69

[Tomatillo Red Chili Salsa, [Fajita


7 4 1 Steak Burrito $11.75
Vegetables...

[Tomatillo Green Chili Salsa, [Pinto


8 4 1 Steak Soft Tacos $9.25
Beans, Ch...

[Fresh Tomato Salsa, [Rice, Black


9 5 1 Steak Burrito $9.25
Beans, Pinto...

10 5 1 Chips and Guacamole NaN $4.45

[Roasted Chili Corn Salsa, [Fajita


11 6 1 Chicken Crispy Tacos $8.75
Vegetables,...

[Roasted Chili Corn Salsa, [Rice,


12 6 1 Chicken Soft Tacos $8.75
Black Beans,...

[Fresh Tomato Salsa, [Fajita


13 7 1 Chicken Bowl $11.25
Vegetables, Rice,...

14 7 1 Chips and Guacamole NaN $4.45

Chips and Tomatillo-Green


15 8 1 NaN $2.39
Chili Salsa

[Tomatillo-Green Chili Salsa


16 8 1 Chicken Burrito $8.49
(Medium), [Pinto ...

[Fresh Tomato Salsa (Mild), [Black


17 9 1 Chicken Burrito $8.49
Beans, Rice...

18 9 2 Canned Soda [Sprite] $2.18

[Tomatillo Red Chili Salsa, [Fajita


19 10 1 Chicken Bowl $8.75
Vegetables...

20 10 1 Chips and Guacamole NaN $4.45

[[Fresh Tomato Salsa (Mild),


21 11 1 Barbacoa Burrito $8.99
Tomatillo-Green C...

22 11 1 Nantucket Nectar [Pomegranate Cherry] $3.39

[[Tomatillo-Green Chili Salsa


23 12 1 Chicken Burrito $10.98
(Medium), Tomati...

24 12 1 Izze [Grapefruit] $3.39

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 8/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato


25 13 1 NaN $2.39
Salsa

[Roasted Chili Corn Salsa (Medium),


26 13 1 Chicken Bowl $8.49
[Pinto Bea...

[[Tomatillo-Green Chili Salsa


27 14 1 Carnitas Burrito $8.99
(Medium), Roaste...

28 14 1 Canned Soda [Dr. Pepper] $1.09

[Tomatillo-Green Chili Salsa


29 15 1 Chicken Burrito $8.49
(Medium), [Pinto ...

In [ ]:

purchase_order_data.tail(30)

In [20]:

purchase_order_data.groupby(by='item_name')['quantity'].sum().sort_values(ascending = False

Out[20]:

item_name

Chicken Bowl 761

Chicken Burrito 591

Chips and Guacamole 506

Steak Burrito 386

Canned Soft Drink 351

Name: quantity, dtype: int64

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 9/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [21]:

purchase_order_data['item_name'].value_counts() #Fre

Out[21]:

Chicken Bowl 726

Chicken Burrito 553

Chips and Guacamole 479

Steak Burrito 368

Canned Soft Drink 301

Chips 211

Steak Bowl 211

Bottled Water 162

Chicken Soft Tacos 115

Chips and Fresh Tomato Salsa 110

Chicken Salad Bowl 110

Canned Soda 104

Side of Chips 101

Veggie Burrito 95

Barbacoa Burrito 91

Veggie Bowl 85

Carnitas Bowl 68

Barbacoa Bowl 66

Carnitas Burrito 59

Steak Soft Tacos 55

6 Pack Soft Drink 54

Chips and Tomatillo Red Chili Salsa 48

Chicken Crispy Tacos 47

Chips and Tomatillo Green Chili Salsa 43

Carnitas Soft Tacos 40

Steak Crispy Tacos 35

Chips and Tomatillo-Green Chili Salsa 31

Steak Salad Bowl 29

Nantucket Nectar 27

Barbacoa Soft Tacos 25

Chips and Roasted Chili Corn Salsa 22

Chips and Tomatillo-Red Chili Salsa 20

Izze 20

Chips and Roasted Chili-Corn Salsa 18

Veggie Salad Bowl 18

Barbacoa Crispy Tacos 11

Barbacoa Salad Bowl 10

Chicken Salad 9

Veggie Soft Tacos 7

Carnitas Crispy Tacos 7

Burrito 6

Carnitas Salad Bowl 6

Veggie Salad 6

Steak Salad 4

Crispy Tacos 2

Salad 2

Bowl 2

Veggie Crispy Tacos 1

Chips and Mild Fresh Tomato Salsa 1

Carnitas Salad 1

Name: item_name, dtype: int64

Chapter 1[b] Getting and Knowing your data


localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 10/37
10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [24]:

user_details = pd.read_csv('u.user',sep = '|') #user_details


user_details

Out[24]:

user_id age gender occupation zip_code

0 1 24 M technician 85711

1 2 53 F other 94043

2 3 23 M writer 32067

3 4 24 M technician 43537

4 5 33 F other 15213

... ... ... ... ... ...

938 939 26 F student 33319

939 940 32 M administrator 02215

940 941 20 M student 97229

941 942 48 F librarian 78209

942 943 22 M student 77841

943 rows × 5 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 11/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [23]:

user_details.head(30)

Out[23]:

user_id age gender occupation zip_code

0 1 24 M technician 85711

1 2 53 F other 94043

2 3 23 M writer 32067

3 4 24 M technician 43537

4 5 33 F other 15213

5 6 42 M executive 98101

6 7 57 M administrator 91344

7 8 36 M administrator 05201

8 9 29 M student 01002

9 10 53 M lawyer 90703

10 11 39 F other 30329

11 12 28 F other 06405

12 13 47 M educator 29206

13 14 45 M scientist 55106

14 15 49 F educator 97301

15 16 21 M entertainment 10309

16 17 30 M programmer 06355

17 18 35 F other 37212

18 19 40 M librarian 02138

19 20 42 F homemaker 95660

20 21 26 M writer 30068

21 22 25 M writer 40206

22 23 30 F artist 48197

23 24 21 F artist 94533

24 25 39 M engineer 55107

25 26 49 M engineer 21044

26 27 40 F librarian 30030

27 28 32 M writer 55369

28 29 41 M programmer 94043

29 30 7 M student 55436

1. Initial Investigation

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 12/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

user_details.shape

In [ ]:

user_details.isna().sum()

In [ ]:

user_details.dtypes

In [ ]:

user_details.describe(include = 'all')

1. What are the different occupations that users are doing?

In [ ]:

print(user_details['occupation'].unique())

2. How many male technicians and female technicians?

In [ ]:

user_details.head()

NOTE:
Discrete Data - it is not measurable(No units)/it is just a count || we cannot
seggregate it.

Continous Data - it is measurable(It has units) || It can be seggregated.

3. Male and Female average age.

In [ ]:

user_details.groupby(by = 'gender')['age'].mean()

In [ ]:

round(user_details.groupby(by = ['occupation'])['age'].mean())

In [ ]:

user_details.groupby(by = ['occupation'])['user_id'].count()

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 13/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

user_details.groupby(by = ['gender','occupation'])['user_id'].count()

In [ ]:

salesman_data = user_details[user_details['occupation'] == 'salesman']


salesman_data

In [ ]:

salesman_data.describe()

Chapter_02 - Filtering & Sorting


In [27]:

euro_2012 = pd.read_csv('Euro_2012_stats_TEAM.csv') #euro_2012_statistics


euro_2012

Out[27]:

% Total
Shots Shots Penaltie
Shooting Goals- shots Hit Penalty
Team Goals on off n
Accuracy to- (inc. Woodwork goals
target target score
shots Blocked)

0 Croatia 4 13 12 51.9% 16.0% 32 0 0

Czech
1 4 13 18 41.9% 12.9% 39 0 0
Republic

2 Denmark 4 10 10 50.0% 20.0% 27 1 0

3 England 5 11 18 50.0% 17.2% 40 0 0

4 France 3 22 24 37.9% 6.5% 65 1 0

5 Germany 10 32 32 47.8% 15.6% 80 2 1

6 Greece 5 8 18 30.7% 19.2% 32 1 1

7 Italy 6 34 45 43.0% 7.5% 110 2 0

8 Netherlands 2 12 36 25.0% 4.1% 60 2 0

9 Poland 2 15 23 39.4% 5.2% 48 0 0

10 Portugal 6 22 42 34.3% 9.3% 82 6 0

Republic of
11 1 7 12 36.8% 5.2% 28 0 0
Ireland

12 Russia 5 9 31 22.5% 12.5% 59 2 0

13 Spain 12 42 33 55.9% 16.0% 100 0 1

14 Sweden 5 17 19 47.2% 13.8% 39 3 0

15 Ukraine 2 7 26 21.2% 6.0% 38 0 0

16 rows × 35 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 14/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

euro_2012.columns

CONFIGURATION

In [ ]:

pd.set_option('max_columns',None)
#pd.set_option('max_rows',None)

In [ ]:

euro_2012

1. Initial Analysis

In [ ]:

euro_2012.shape

In [ ]:

euro_2012.describe(include='all')

2. Filter Team, Goals, Shooting Accuracy, Yellow Cards and Red Cards

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 15/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [28]:

euro_filtered_data = euro_2012[['Team','Goals','Shooting Accuracy','Yellow Cards','Red Card


euro_filtered_data

Out[28]:

Team Goals Shooting Accuracy Yellow Cards Red Cards

0 Croatia 4 51.9% 9 0

1 Czech Republic 4 41.9% 7 0

2 Denmark 4 50.0% 4 0

3 England 5 50.0% 5 0

4 France 3 37.9% 6 0

5 Germany 10 47.8% 4 0

6 Greece 5 30.7% 9 1

7 Italy 6 43.0% 16 0

8 Netherlands 2 25.0% 5 0

9 Poland 2 39.4% 7 1

10 Portugal 6 34.3% 12 0

11 Republic of Ireland 1 36.8% 6 1

12 Russia 5 22.5% 6 0

13 Spain 12 55.9% 11 0

14 Sweden 5 47.2% 7 0

15 Ukraine 2 21.2% 5 0

In [ ]:

euro_filtered_data.sort_values(by = 'Red Cards',ascending=False,inplace=False)

In [ ]:

euro_filtered_data

In [ ]:

import warnings
warnings.filterwarnings(action = 'ignore')

In [ ]:

euro_filtered_data.sort_values(by = 'Red Cards',ascending=False)#,inplace=True)

In [ ]:

euro_filtered_data

In [ ]:

euro_filtered_data[euro_filtered_data['Goals'] > 4]

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 16/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

euro_filtered_data[euro_filtered_data['Goals'] > 4]

In [ ]:

euro_filtered_data[euro_filtered_data['Red Cards'] == 1]

In [29]:

#Try to rectify the


(euro_filtered_data[euro_filtered_data['Goals'] > 4]) and (euro_filtered_data[euro_filtered

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-29-9a66c94aa3ba> in <module>

1 #Try to rectify the

----> 2 (euro_filtered_data[euro_filtered_data['Goals'] > 4]) and (euro_filt


ered_data[euro_filtered_data['Red Cards'] == 1])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonze
ro__(self)

1440 @final

1441 def __nonzero__(self):

-> 1442 raise ValueError(

1443 f"The truth value of a {type(self).__name__} is ambiguou


s. "

1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool


(), a.item(), a.any() or a.all().

In [34]:

euro_filtered_data[(euro_filtered_data['Goals'] > 4) & (euro_filtered_data['Red Cards'] ==

Out[34]:

Team Goals Shooting Accuracy Yellow Cards Red Cards

6 Greece 5 30.7% 9 1

Chapter_3 Indexing
In [ ]:

user_details = pd.read_csv('u.user',sep = '|') #user_details


user_details

1. Display the first 5 observations of first 3 features.

In [ ]:

user_details.head()

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 17/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

#Index locator
user_details.iloc[0:5,0:3]

In [ ]:

euro_2012 = pd.read_csv('Euro_2012_stats_TEAM.csv') #euro_2012_statistics


euro_2012

In [ ]:

euro_2012.iloc[:5,4:6] #Pass the index number of the columns

In [ ]:

#Locator
euro_2012.loc[:5,['Shooting Accuracy','% Goals-to-shots']] #Pass the feature names

In [ ]:

euro_2012.iloc[[0,2,5],[1,5,9]]

In [ ]:

euro_2012.loc[[0,2,5],['Shooting Accuracy','% Goals-to-shots']]

Chapter_4 Creating Series and DataFrames


Within Pandas we are having 2 data structures:

Series - 1D of pandas
Dataframe - 2D of pandas

In [ ]:

a = [1,2,3,4,5]
type(a)

In [ ]:

pandas_1D = pd.Series([1,2,3,4,5])
type(pandas_1D)

In [ ]:

pandas_2D = pd.DataFrame(data = {'Fname' : ['Alka','Barun'],


'Lname' : ['Goutam','Das']})
pandas_2D

Chapter_5 Deleting

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 18/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [35]:

euro_2012 = pd.read_csv('Euro_2012_stats_TEAM.csv') #euro_2012_statistics


euro_2012

Out[35]:

% Total
Shots Shots Penaltie
Shooting Goals- shots Hit Penalty
Team Goals on off n
Accuracy to- (inc. Woodwork goals
target target score
shots Blocked)

0 Croatia 4 13 12 51.9% 16.0% 32 0 0

Czech
1 4 13 18 41.9% 12.9% 39 0 0
Republic

2 Denmark 4 10 10 50.0% 20.0% 27 1 0

3 England 5 11 18 50.0% 17.2% 40 0 0

4 France 3 22 24 37.9% 6.5% 65 1 0

5 Germany 10 32 32 47.8% 15.6% 80 2 1

6 Greece 5 8 18 30.7% 19.2% 32 1 1

7 Italy 6 34 45 43.0% 7.5% 110 2 0

8 Netherlands 2 12 36 25.0% 4.1% 60 2 0

9 Poland 2 15 23 39.4% 5.2% 48 0 0

10 Portugal 6 22 42 34.3% 9.3% 82 6 0

Republic of
11 1 7 12 36.8% 5.2% 28 0 0
Ireland

12 Russia 5 9 31 22.5% 12.5% 59 2 0

13 Spain 12 42 33 55.9% 16.0% 100 0 1

14 Sweden 5 17 19 47.2% 13.8% 39 3 0

15 Ukraine 2 7 26 21.2% 6.0% 38 0 0

16 rows × 35 columns

In [36]:

del euro_2012['Goals']

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 19/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [37]:

euro_2012

Out[37]:

% Total
Shots Shots Penalties
Shooting Goals- shots Hit Penalty Hea
Team on off not
Accuracy to- (inc. Woodwork goals go
target target scored
shots Blocked)

0 Croatia 13 12 51.9% 16.0% 32 0 0 0

Czech
1 13 18 41.9% 12.9% 39 0 0 0
Republic

2 Denmark 10 10 50.0% 20.0% 27 1 0 0

3 England 11 18 50.0% 17.2% 40 0 0 0

4 France 22 24 37.9% 6.5% 65 1 0 0

5 Germany 32 32 47.8% 15.6% 80 2 1 0

6 Greece 8 18 30.7% 19.2% 32 1 1 1

7 Italy 34 45 43.0% 7.5% 110 2 0 0

8 Netherlands 12 36 25.0% 4.1% 60 2 0 0

9 Poland 15 23 39.4% 5.2% 48 0 0 0

10 Portugal 22 42 34.3% 9.3% 82 6 0 0

Republic of
11 7 12 36.8% 5.2% 28 0 0 0
Ireland

12 Russia 9 31 22.5% 12.5% 59 2 0 0

13 Spain 42 33 55.9% 16.0% 100 0 1 0

14 Sweden 17 19 47.2% 13.8% 39 3 0 0

15 Ukraine 7 26 21.2% 6.0% 38 0 0 0

16 rows × 34 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 20/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [41]:

euro_2012[['Team','Shots on target']]

Out[41]:

Team Shots on target

0 Croatia 13

1 Czech Republic 13

2 Denmark 10

3 England 11

4 France 22

5 Germany 32

6 Greece 8

7 Italy 34

8 Netherlands 12

9 Poland 15

10 Portugal 22

11 Republic of Ireland 7

12 Russia 9

13 Spain 42

14 Sweden 17

15 Ukraine 7

In [42]:

euro_2012.Team

Out[42]:

0 Croatia

1 Czech Republic

2 Denmark

3 England

4 France

5 Germany

6 Greece

7 Italy

8 Netherlands

9 Poland

10 Portugal

11 Republic of Ireland

12 Russia

13 Spain

14 Sweden

15 Ukraine

Name: Team, dtype: object

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 21/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [43]:

del euro_2012[['Shots on target','Shots off target']]

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-43-1b67689e46a4> in <module>

----> 1 del euro_2012[['Shots on target','Shots off target']]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __delit
em__(self, key)

3964 # there was no match, this call should raise the appropr
iate

3965 # exception:

-> 3966 loc = self.axes[-1].get_loc(key)

3967 self._mgr.idelete(loc)

3968

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in ge
t_loc(self, key, method, tolerance)

3078 casted_key = self._maybe_cast_indexer(key)

3079 try:

-> 3080 return self._engine.get_loc(casted_key)

3081 except KeyError as err:

3082 raise KeyError(key) from err

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '['Shots on target', 'Shots off target']' is an invalid key

In [53]:

euro_2012 = euro_2012.drop(labels=['Penalty goals','Team'],axis=1,)

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 22/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [54]:

euro_2012

Out[54]:

% Total
Shots Shots Penalties
Goals- shots Hit Headed Passes Passi
on off not Passes
to- (inc. Woodwork goals completed Accura
target target scored
shots Blocked)

0 13 12 16.0% 32 0 0 2 1076 828 76.9

1 13 18 12.9% 39 0 0 0 1565 1223 78.

2 10 10 20.0% 27 1 0 3 1298 1082 83.3

3 11 18 17.2% 40 0 0 3 1488 1200 80.6

4 22 24 6.5% 65 1 0 0 2066 1803 87.2

5 32 32 15.6% 80 2 0 2 2774 2427 87.4

6 8 18 19.2% 32 1 1 0 1187 911 76.7

7 34 45 7.5% 110 2 0 2 3016 2531 83.9

8 12 36 4.1% 60 2 0 0 1556 1381 88.7

9 15 23 5.2% 48 0 0 1 1059 852 80.4

10 22 42 9.3% 82 6 0 2 1891 1461 77.2

11 7 12 5.2% 28 0 0 1 851 606 71.2

12 9 31 12.5% 59 2 0 1 1602 1345 83.9

13 42 33 16.0% 100 0 0 2 4317 3820 88.4

14 17 19 13.8% 39 3 0 1 1192 965 80.9

15 7 26 6.0% 38 0 0 2 1276 1043 81.7

16 rows × 31 columns

TASK 1 - Explore List Comprehension

Chapter 6 - Apply Function

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 23/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [58]:

student_details = pd.read_csv('student-mat.csv')
student_details

Out[58]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 4 4 at_home teacher course

1 GP F 17 U GT3 T 1 1 at_home other course

2 GP F 15 U LE3 T 1 1 at_home other other

3 GP F 15 U GT3 T 4 2 health services home

4 GP F 16 U GT3 T 3 3 other other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 2 2 services services course

391 MS M 17 U LE3 T 3 1 services services course

392 MS M 21 R GT3 T 1 1 other other course

393 MS M 18 R LE3 T 3 2 services other course

394 MS M 19 U LE3 T 1 1 other at_home course

395 rows × 33 columns

In [57]:

pd.set_option('max_columns',None)

In [62]:

student_details['Medu'] = student_details['Medu'].apply(func = lambda x:x+2)

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 24/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [63]:

student_details

Out[63]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home teacher course

1 GP F 17 U GT3 T 3 1 at_home other course

2 GP F 15 U LE3 T 3 1 at_home other other

3 GP F 15 U GT3 T 6 2 health services home

4 GP F 16 U GT3 T 5 3 other other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services services course

391 MS M 17 U LE3 T 5 1 services services course

392 MS M 21 R GT3 T 3 1 other other course

393 MS M 18 R LE3 T 5 2 services other course

394 MS M 19 U LE3 T 3 1 other at_home course

395 rows × 33 columns

Inclass exercise - Make the first letter uppercase of the Fjob.

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 25/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [67]:

student_details['Fjob'] = student_details.Fjob.apply(lambda x:x.capitalize())


student_details

Out[67]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home Teacher course

1 GP F 17 U GT3 T 3 1 at_home Other course

2 GP F 15 U LE3 T 3 1 at_home Other other

3 GP F 15 U GT3 T 6 2 health Services home

4 GP F 16 U GT3 T 5 3 other Other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services Services course

391 MS M 17 U LE3 T 5 1 services Services course

392 MS M 21 R GT3 T 3 1 other Other course

393 MS M 18 R LE3 T 5 2 services Other course

394 MS M 19 U LE3 T 3 1 other At_home course

395 rows × 33 columns

Inclass Exercise - 2

Create a new column with a name 'Eligibility_Criteria' which returns 1 if age>17 and 0 if age < 17.

In [68]:

def get_age(x):
if x>17:
return 1
else:
return 0

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 26/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [70]:

student_details['Eligibity_Criteria'] = student_details['age'].apply(func = get_age)


student_details

Out[70]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home Teacher course

1 GP F 17 U GT3 T 3 1 at_home Other course

2 GP F 15 U LE3 T 3 1 at_home Other other

3 GP F 15 U GT3 T 6 2 health Services home

4 GP F 16 U GT3 T 5 3 other Other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services Services course

391 MS M 17 U LE3 T 5 1 services Services course

392 MS M 21 R GT3 T 3 1 other Other course

393 MS M 18 R LE3 T 5 2 services Other course

394 MS M 19 U LE3 T 3 1 other At_home course

395 rows × 34 columns

In [71]:

student_details['Eligibity_Criteria_2'] = student_details['age'].apply(func = lambda x: 1 i


student_details

Out[71]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home Teacher course

1 GP F 17 U GT3 T 3 1 at_home Other course

2 GP F 15 U LE3 T 3 1 at_home Other other

3 GP F 15 U GT3 T 6 2 health Services home

4 GP F 16 U GT3 T 5 3 other Other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services Services course

391 MS M 17 U LE3 T 5 1 services Services course

392 MS M 21 R GT3 T 3 1 other Other course

393 MS M 18 R LE3 T 5 2 services Other course

394 MS M 19 U LE3 T 3 1 other At_home course

395 rows × 35 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 27/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

Chapter_07 Concat || Merge || Append Function


In [75]:

sales_2017 = pd.read_csv('Sales Transactions-2017.csv')


sales_2017

Out[75]:

Date Voucher Party Product Qty Rate Gr

SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380
PLASTICS 9100

SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720
PLASTICS FOAM(1200)

VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500
TRADERS
WINE

SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720
TRADERS FOAM(1200)

SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450
TRADERS 9100

... ... ... ... ... ... ...

10*10
47285 31/03/2018 Sal:10042 Vkp 25 137 3,425
SHEET

47286 NaN NaN NaN NaN NaN NaN N

47287 NaN NaN NaN NaN NaN NaN N

47288 NaN Total NaN NaN 607,734.60 669,300.49 9,953,816

47289 NaN Total NaN NaN 7,593,062.00 8,309,116.00 115,778,725

47290 rows × 9 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 28/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [76]:

sales_2018 = pd.read_csv('Sales Transactions-2018.csv')


sales_2018

Out[76]:

Date Voucher Party Product Qty Rate Gross

SILVER
0 1/4/2018 Sal:146 TP13 POUCH 50 85 4,250.00
9*12

1 1/4/2018 Sal:146 TP13 RUBBER 5 290 1,450.00

DURGA
2 1/4/2018 Sal:146 TP13 10*12 1,600.00 5.5 8,800.00
Blue

DURGA
3 1/4/2018 Sal:146 TP13 13*16 400 11 4,400.00
BLUE

10*12
4 1/4/2018 Sal:146 TP13 SARAS- 600 8.1 4,860.00
NAT

... ... ... ... ... ... ... ...

HAMPI SPOON
44735 31/03/2019 Sal:9610 200 40 8,000.00
FOODS SOOFY

44736 NaN NaN NaN NaN NaN NaN NaN

44737 NaN NaN NaN NaN NaN NaN NaN

44738 NaN Total NaN NaN 666,056.00 1,067,808.80 10,796,991.30 29,9

44739 NaN Total NaN NaN 7,097,803.00 10,024,197.00 117,897,671.80 720,2

44740 rows × 9 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 29/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [77]:

sales_2019 = pd.read_csv('Sales Transactions-2019.csv')


sales_2019

Out[77]:

Date Voucher Party Product Qty Rate Gross

BALAJI DONA-
0 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS VAI-9100

BALAJI SMART
1 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS BOUL(48)

BALAJI Vishnu
2 1/4/2019 Sal:688 110 18.5 2,035.00
PLASTICS Ice

3 28/3 0 0

BALAJI 100LEAF
4 1/4/2019 Sal:689 3 585 1,755.00
PLASTICS -SP

... ... ... ... ... ... ... ...

13*16
19171 10/10/2019 Sal:4935 K.SRIHARI WHITE 400 16 6,400.00
RK

19172 NaN NaN NaN NaN NaN NaN NaN

19173 NaN NaN NaN NaN NaN NaN NaN

19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.50 20

19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.40 672

19176 rows × 9 columns

In [80]:

sales_2017.shape,sales_2018.shape,sales_2019.shape

Out[80]:

((47290, 9), (44740, 9), (19176, 9))

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 30/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [81]:

sales_full_data = pd.concat([sales_2017,sales_2018,sales_2019])
sales_full_data

Out[81]:

Date Voucher Party Product Qty Rate Gros

SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100

SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)

VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE

SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)

SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100

... ... ... ... ... ... ...

13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK

19172 NaN NaN NaN NaN NaN NaN Na

19173 NaN NaN NaN NaN NaN NaN Na

19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.5

19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.4

111206 rows × 9 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 31/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [82]:

sales_2017.append([sales_2018,sales_2019])

Out[82]:

Date Voucher Party Product Qty Rate Gros

SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100

SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)

VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE

SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)

SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100

... ... ... ... ... ... ...

13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK

19172 NaN NaN NaN NaN NaN NaN Na

19173 NaN NaN NaN NaN NaN NaN Na

19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.5

19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.4

111206 rows × 9 columns

In [86]:

sales_full_data.dtypes

Out[86]:

Date object

Voucher object

Party object

Product object

Qty object

Rate object

Gross object

Disc object

Voucher Amount object

dtype: object

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 32/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [88]:

sales_full_data.isna().sum()

Out[88]:

Date 12591

Voucher 12557

Party 40

Product 12591

Qty 12557

Rate 12558

Gross 12558

Disc 105609

Voucher Amount 83646

dtype: int64

TASK 2 - Explore Merge function - Left join, Right Join, Inner join,
Outer join

TASK 3 - Datatype conversion and Data Cleaning on Sales


data
In [84]:

sales_cleaned_data = pd.read_csv('Sales-Transactions-Edited.csv')
sales_cleaned_data

Out[84]:

Date Voucher Party Product Qty Rate

0 1/4/2017 1 SOLANKI PLASTICS DONA-VAI-9100 2 1690.0

1 1/4/2017 1 SOLANKI PLASTICS LITE FOAM(1200) 6 1620.0

2 1/4/2017 2 SARNESWARA TRADERS VISHNU CHOTA WINE 500 23.0

3 1/4/2017 2 SARNESWARA TRADERS LITE FOAM(1200) 6 1620.0

4 1/4/2017 2 SARNESWARA TRADERS DONA-VAI-9100 5 1690.0

... ... ... ... ... ... ...

95557 12/9/2019 4265 TP13 SPOON MED M.W 20 11.0

95558 12/9/2019 4266 K.SRIHARI SMART BOUL(48) 1 1830.0

95559 12/9/2019 4267 SMS SMARTBOUL GLA(4000) 1 1520.0

95560 12/9/2019 4268 ANILFANCY RR WINEGLASS 100 20.0

95561 12/9/2019 4268 ANILFANCY RR WATER GLASS 100 20.0

95562 rows × 6 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 33/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [85]:

sales_cleaned_data.dtypes

#Date - Datetime64/datetime32 dataypr

Out[85]:

Date object

Voucher int64

Party object

Product object

Qty int64

Rate float64

dtype: object

In [87]:

sales_cleaned_data.isna().sum()

Out[87]:

Date 0

Voucher 0

Party 0

Product 0

Qty 0

Rate 1

dtype: int64

Chapter_8 Grouping Vs Pivottable Vs Crosstab

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 34/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [91]:

insurance_data = pd.read_csv('insurance.csv')
insurance_data

Out[91]:

age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

In [92]:

insurance_data.describe(include = 'all')

Out[92]:

age sex bmi children smoker region charges

count 1338.000000 1338 1338.000000 1338.000000 1338 1338 1338.000000

unique NaN 2 NaN NaN 2 4 NaN

top NaN male NaN NaN no southeast NaN

freq NaN 676 NaN NaN 1064 364 NaN

mean 39.207025 NaN 30.663397 1.094918 NaN NaN 13270.422265

std 14.049960 NaN 6.098187 1.205493 NaN NaN 12110.011237

min 18.000000 NaN 15.960000 0.000000 NaN NaN 1121.873900

25% 27.000000 NaN 26.296250 0.000000 NaN NaN 4740.287150

50% 39.000000 NaN 30.400000 1.000000 NaN NaN 9382.033000

75% 51.000000 NaN 34.693750 2.000000 NaN NaN 16639.912515

max 64.000000 NaN 53.130000 5.000000 NaN NaN 63770.428010

1. What is average insurance charges based on region?

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 35/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [93]:

insurance_data['region'].unique()

Out[93]:

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [96]:

round(insurance_data.groupby(by='region')['charges'].mean().sort_values(ascending = False))

Out[96]:

region

southeast 14735.0

northeast 13406.0

northwest 12418.0

southwest 12347.0

Name: charges, dtype: float64

2. What is average insurance charges based on region and gender?

In [98]:

round(insurance_data.groupby(by=['region','sex'])['charges'].mean().sort_values(ascending =

Out[98]:

region sex

southeast male 15880.0

northeast male 13854.0

southeast female 13500.0

southwest male 13413.0

northeast female 12953.0

northwest female 12480.0

male 12354.0

southwest female 11274.0

Name: charges, dtype: float64

In [101]:

round(pd.pivot_table( data = insurance_data,values='charges',index='region',columns=['sex',

Out[101]:

sex female male

smoker no yes no yes

region

northeast 9640.0 28032.0 8664.0 30926.0

northwest 8787.0 29671.0 8321.0 30713.0

southeast 8440.0 33035.0 7609.0 36030.0

southwest 8234.0 31688.0 7779.0 32599.0

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 36/37


10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [103]:

pd.crosstab(index = insurance_data['smoker'] ,columns = insurance_data['sex'],margins=True)

Out[103]:

sex female male All

smoker

no 547 517 1064

yes 115 159 274

All 662 676 1338

In [104]:

pd.crosstab(index = insurance_data['region'] ,columns = insurance_data['children'],margins=

Out[104]:

children 0 1 2 3 4 5 All

region

northeast 147 77 51 39 7 3 324

northwest 132 74 66 46 6 1 325

southeast 157 95 66 35 5 6 364

southwest 138 78 57 37 7 8 325

All 574 324 240 157 25 18 1338

In [ ]:

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 37/37

You might also like