0% found this document useful (0 votes)

60 views37 pages

Pandas - Data Analysis - Deep Dive - Jupyter Notebook

This document provides an overview of analyzing data with Pandas by reading in different data files, performing initial analysis on the data such as checking the shape and dtypes, and exploring different attributes and methods. It demonstrates how to load data from CSV and TSV files, check the number of observations and features, and get information about the data types of each column.

Uploaded by

dimple mahadule

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views37 pages

Pandas - Data Analysis - Deep Dive - Jupyter Notebook

Uploaded by

dimple mahadule

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

Read the files

In [1]:

pip install pandas

Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-p

ackages (1.2.4)

Requirement already satisfied: numpy>=1.16.5 in c:\users\x1 yoga\appdata\roa

ming\python\python38\site-packages (from pandas) (1.19.5)

Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib

\site-packages (from pandas) (2021.1)

Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anac

onda3\lib\site-packages (from pandas) (2.8.1)

Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site

-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

Note: you may need to restart the kernel to use updated packages.

In [2]:

!pip install pandas

Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-p

ackages (1.2.4)

Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anac

onda3\lib\site-packages (from pandas) (2.8.1)

Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib

\site-packages (from pandas) (2021.1)

Requirement already satisfied: numpy>=1.16.5 in c:\users\x1 yoga\appdata\roa

ming\python\python38\site-packages (from pandas) (1.19.5)

Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site

-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

In [3]:

import pandas as pd #Alias

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 1/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [4]:

pd.read_csv('drinks.csv')

Out[4]:

country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol cont

0 Afghanistan 0 0 0 0.0

1 Albania 89 132 54 4.9

2 Algeria 25 0 14 0.7

3 Andorra 245 138 312 12.4

4 Angola 217 57 45 5.9

... ... ... ... ... ...

188 Venezuela 333 100 3 7.7

189 Vietnam 111 2 1 2.0

190 Yemen 6 0 0 0.1

191 Zambia 32 19 4 2.5

192 Zimbabwe 64 18 4 4.7

193 rows × 6 columns

In [5]:

pandas.read_csv('drinks.csv',sep=',')

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-5-c5ee57efda1b> in <module>

----> 1 pandas.read_csv('drinks.csv',sep=',')

NameError: name 'pandas' is not defined

In [6]:

pandas.read_csv('insurance.csv',sep = ',')

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-6-a4d86e110cf2> in <module>

----> 1 pandas.read_csv('insurance.csv',sep = ',')

NameError: name 'pandas' is not defined

In [ ]:

pd.read_csv('insurance.csv')

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 2/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

pandas.read_csv('data.tsv.txt',sep = '\t') #feature

In [ ]:

pandas.read_csv('u.user')#,sep = '|')

In [ ]:

pd.read_csv('u.user',sep = '|')

In [ ]:

pandas.read_csv(r'D:\My_Materials\Reference Notes\AI Course\Day_06_Pandas Exercise\01_Getti

In [ ]:

pandas.crosstab()

In [ ]:

pandas.concat

In [ ]:

pd.read_csv('D:/My_Materials/Reference Notes/AI Course/Day_06_Pandas Exercise/01_Getting &

Chapter 1 - Getting and Knowing your data

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 3/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [7]:

purchase_order_data = pd.read_csv('data.tsv.txt',sep = '\t') #purchase order details

purchase_order_data

Out[7]:

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato

0 1 1 NaN $2.39
Salsa

1 1 1 Izze [Clementine] $3.39

2 1 1 Nantucket Nectar [Apple] $3.39

Chips and Tomatillo-

3 1 1 NaN $2.39
Green Chili Salsa

[Tomatillo-Red Chili Salsa (Hot),

4 2 2 Chicken Bowl $16.98
[Black Beans...

... ... ... ... ... ...

[Fresh Tomato Salsa, [Rice, Black

4617 1833 1 Steak Burrito $11.75
Beans, Sour ...

[Fresh Tomato Salsa, [Rice, Sour

4618 1833 1 Steak Burrito $11.75
Cream, Cheese...

[Fresh Tomato Salsa, [Fajita

4619 1834 1 Chicken Salad Bowl $11.25
Vegetables, Pinto...

[Fresh Tomato Salsa, [Fajita

4620 1834 1 Chicken Salad Bowl $8.75
Vegetables, Lettu...

[Fresh Tomato Salsa, [Fajita

4621 1834 1 Chicken Salad Bowl $8.75
Vegetables, Pinto...

4622 rows × 5 columns

1. Initial Analysis

In [8]:

purchase_order_data.shape #Attribute

Out[8]:

(4622, 5)

In [9]:

purchase_order_data.shape[0]

Out[9]:

4622

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 4/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [10]:

purchase_order_data.shape[1]

Out[10]:

Terminology Alert
Rows - Observations/Records/Datapoints
Columns - Features/Parameters

In [11]:

print('Total No of Observations : {}\nTotal No of Parameters : {}'.format(purchase_order_da

Total No of Observations : 4622

Total No of Parameters : 5

In [12]:

purchase_order_data.dtypes #Attribute

Out[12]:

order_id int64

quantity int64

item_name object

choice_description object

item_price object

dtype: object

In [13]:

purchase_order_data.info()

RangeIndex: 4622 entries, 0 to 4621

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 4622 non-null int64

1 quantity 4622 non-null int64

2 item_name 4622 non-null object

3 choice_description 3376 non-null object

4 item_price 4622 non-null object

dtypes: int64(2), object(3)

memory usage: 180.7+ KB

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 5/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [14]:

purchase_order_data.isna().sum()

Out[14]:

order_id 0

quantity 0

item_name 0

choice_description 1246

item_price 0

dtype: int64

In [15]:

purchase_order_data.describe(include='all')

Out[15]:

order_id quantity item_name choice_description item_price

count 4622.000000 4622.000000 4622 3376 4622

unique NaN NaN 50 1043 78

top NaN NaN Chicken Bowl [Diet Coke] $8.75

freq NaN NaN 726 134 730

mean 927.254868 1.075725 NaN NaN NaN

std 528.890796 0.410186 NaN NaN NaN

min 1.000000 1.000000 NaN NaN NaN

25% 477.250000 1.000000 NaN NaN NaN

50% 926.000000 1.000000 NaN NaN NaN

75% 1393.000000 1.000000 NaN NaN NaN

max 1834.000000 15.000000 NaN NaN NaN

1. How many items are available in the restaurent?

In [16]:

purchase_order_data['item_name'].nunique()

Out[16]:

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 6/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [17]:

print(purchase_order_data['item_name'].unique())

['Chips and Fresh Tomato Salsa' 'Izze' 'Nantucket Nectar'

'Chips and Tomatillo-Green Chili Salsa' 'Chicken Bowl' 'Side of Chips'

'Steak Burrito' 'Steak Soft Tacos' 'Chips and Guacamole'

'Chicken Crispy Tacos' 'Chicken Soft Tacos' 'Chicken Burrito'

'Canned Soda' 'Barbacoa Burrito' 'Carnitas Burrito' 'Carnitas Bowl'

'Bottled Water' 'Chips and Tomatillo Green Chili Salsa' 'Barbacoa Bowl'

'Chips' 'Chicken Salad Bowl' 'Steak Bowl' 'Barbacoa Soft Tacos'

'Veggie Burrito' 'Veggie Bowl' 'Steak Crispy Tacos'

'Chips and Tomatillo Red Chili Salsa' 'Barbacoa Crispy Tacos'

'Veggie Salad Bowl' 'Chips and Roasted Chili-Corn Salsa'

'Chips and Roasted Chili Corn Salsa' 'Carnitas Soft Tacos'

'Chicken Salad' 'Canned Soft Drink' 'Steak Salad Bowl'

'6 Pack Soft Drink' 'Chips and Tomatillo-Red Chili Salsa' 'Bowl'

'Burrito' 'Crispy Tacos' 'Carnitas Crispy Tacos' 'Steak Salad'

'Chips and Mild Fresh Tomato Salsa' 'Veggie Soft Tacos'

'Carnitas Salad Bowl' 'Barbacoa Salad Bowl' 'Salad' 'Veggie Crispy Tacos'

'Veggie Salad' 'Carnitas Salad']

2. Which was the most ordered item?

In [18]:

purchase_order_data

Out[18]:

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato

0 1 1 NaN $2.39
Salsa

1 1 1 Izze [Clementine] $3.39

2 1 1 Nantucket Nectar [Apple] $3.39

Chips and Tomatillo-

3 1 1 NaN $2.39
Green Chili Salsa

[Tomatillo-Red Chili Salsa (Hot),

4 2 2 Chicken Bowl $16.98
[Black Beans...

... ... ... ... ... ...

[Fresh Tomato Salsa, [Rice, Black

4617 1833 1 Steak Burrito $11.75
Beans, Sour ...

[Fresh Tomato Salsa, [Rice, Sour

4618 1833 1 Steak Burrito $11.75
Cream, Cheese...

[Fresh Tomato Salsa, [Fajita

4619 1834 1 Chicken Salad Bowl $11.25
Vegetables, Pinto...

[Fresh Tomato Salsa, [Fajita

4620 1834 1 Chicken Salad Bowl $8.75
Vegetables, Lettu...

[Fresh Tomato Salsa, [Fajita

4621 1834 1 Chicken Salad Bowl $8.75
Vegetables, Pinto...

4622 rows × 5 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 7/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [19]:

purchase_order_data.head(30)

Out[19]:

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato

0 1 1 NaN $2.39
Salsa

1 1 1 Izze [Clementine] $3.39

2 1 1 Nantucket Nectar [Apple] $3.39

Chips and Tomatillo-Green

3 1 1 NaN $2.39
Chili Salsa

[Tomatillo-Red Chili Salsa (Hot),

4 2 2 Chicken Bowl $16.98
[Black Beans...

[Fresh Tomato Salsa (Mild), [Rice,

5 3 1 Chicken Bowl $10.98
Cheese, Sou...

6 3 1 Side of Chips NaN $1.69

[Tomatillo Red Chili Salsa, [Fajita

7 4 1 Steak Burrito $11.75
Vegetables...

[Tomatillo Green Chili Salsa, [Pinto

8 4 1 Steak Soft Tacos $9.25
Beans, Ch...

[Fresh Tomato Salsa, [Rice, Black

9 5 1 Steak Burrito $9.25
Beans, Pinto...

10 5 1 Chips and Guacamole NaN $4.45

[Roasted Chili Corn Salsa, [Fajita

11 6 1 Chicken Crispy Tacos $8.75
Vegetables,...

[Roasted Chili Corn Salsa, [Rice,

12 6 1 Chicken Soft Tacos $8.75
Black Beans,...

[Fresh Tomato Salsa, [Fajita

13 7 1 Chicken Bowl $11.25
Vegetables, Rice,...

14 7 1 Chips and Guacamole NaN $4.45

Chips and Tomatillo-Green

15 8 1 NaN $2.39
Chili Salsa

[Tomatillo-Green Chili Salsa

16 8 1 Chicken Burrito $8.49
(Medium), [Pinto ...

[Fresh Tomato Salsa (Mild), [Black

17 9 1 Chicken Burrito $8.49
Beans, Rice...

18 9 2 Canned Soda [Sprite] $2.18

[Tomatillo Red Chili Salsa, [Fajita

19 10 1 Chicken Bowl $8.75
Vegetables...

20 10 1 Chips and Guacamole NaN $4.45

[[Fresh Tomato Salsa (Mild),

21 11 1 Barbacoa Burrito $8.99
Tomatillo-Green C...

22 11 1 Nantucket Nectar [Pomegranate Cherry] $3.39

[[Tomatillo-Green Chili Salsa

23 12 1 Chicken Burrito $10.98
(Medium), Tomati...

24 12 1 Izze [Grapefruit] $3.39

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 8/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

order_id quantity item_name choice_description item_price

Chips and Fresh Tomato

25 13 1 NaN $2.39
Salsa

[Roasted Chili Corn Salsa (Medium),

26 13 1 Chicken Bowl $8.49
[Pinto Bea...

[[Tomatillo-Green Chili Salsa

27 14 1 Carnitas Burrito $8.99
(Medium), Roaste...

28 14 1 Canned Soda [Dr. Pepper] $1.09

[Tomatillo-Green Chili Salsa

29 15 1 Chicken Burrito $8.49
(Medium), [Pinto ...

In [ ]:

purchase_order_data.tail(30)

In [20]:

purchase_order_data.groupby(by='item_name')['quantity'].sum().sort_values(ascending = False

Out[20]:

item_name

Chicken Bowl 761

Chicken Burrito 591

Chips and Guacamole 506

Steak Burrito 386

Canned Soft Drink 351

Name: quantity, dtype: int64

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 9/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [21]:

purchase_order_data['item_name'].value_counts() #Fre

Out[21]:

Chicken Bowl 726

Chicken Burrito 553

Chips and Guacamole 479

Steak Burrito 368

Canned Soft Drink 301

Chips 211

Steak Bowl 211

Bottled Water 162

Chicken Soft Tacos 115

Chips and Fresh Tomato Salsa 110

Chicken Salad Bowl 110

Canned Soda 104

Side of Chips 101

Veggie Burrito 95

Barbacoa Burrito 91

Veggie Bowl 85

Carnitas Bowl 68

Barbacoa Bowl 66

Carnitas Burrito 59

Steak Soft Tacos 55

6 Pack Soft Drink 54

Chips and Tomatillo Red Chili Salsa 48

Chicken Crispy Tacos 47

Chips and Tomatillo Green Chili Salsa 43

Carnitas Soft Tacos 40

Steak Crispy Tacos 35

Chips and Tomatillo-Green Chili Salsa 31

Steak Salad Bowl 29

Nantucket Nectar 27

Barbacoa Soft Tacos 25

Chips and Roasted Chili Corn Salsa 22

Chips and Tomatillo-Red Chili Salsa 20

Izze 20

Chips and Roasted Chili-Corn Salsa 18

Veggie Salad Bowl 18

Barbacoa Crispy Tacos 11

Barbacoa Salad Bowl 10

Chicken Salad 9

Veggie Soft Tacos 7

Carnitas Crispy Tacos 7

Burrito 6

Carnitas Salad Bowl 6

Veggie Salad 6

Steak Salad 4

Crispy Tacos 2

Salad 2

Bowl 2

Veggie Crispy Tacos 1

Chips and Mild Fresh Tomato Salsa 1

Carnitas Salad 1

Name: item_name, dtype: int64

Chapter 1[b] Getting and Knowing your data

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 10/37
10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [24]:

user_details = pd.read_csv('u.user',sep = '|') #user_details

user_details

Out[24]:

user_id age gender occupation zip_code

0 1 24 M technician 85711

1 2 53 F other 94043

2 3 23 M writer 32067

3 4 24 M technician 43537

4 5 33 F other 15213

... ... ... ... ... ...

938 939 26 F student 33319

939 940 32 M administrator 02215

940 941 20 M student 97229

941 942 48 F librarian 78209

942 943 22 M student 77841

943 rows × 5 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 11/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [23]:

user_details.head(30)

Out[23]:

user_id age gender occupation zip_code

0 1 24 M technician 85711

1 2 53 F other 94043

2 3 23 M writer 32067

3 4 24 M technician 43537

4 5 33 F other 15213

5 6 42 M executive 98101

6 7 57 M administrator 91344

7 8 36 M administrator 05201

8 9 29 M student 01002

9 10 53 M lawyer 90703

10 11 39 F other 30329

11 12 28 F other 06405

12 13 47 M educator 29206

13 14 45 M scientist 55106

14 15 49 F educator 97301

15 16 21 M entertainment 10309

16 17 30 M programmer 06355

17 18 35 F other 37212

18 19 40 M librarian 02138

19 20 42 F homemaker 95660

20 21 26 M writer 30068

21 22 25 M writer 40206

22 23 30 F artist 48197

23 24 21 F artist 94533

24 25 39 M engineer 55107

25 26 49 M engineer 21044

26 27 40 F librarian 30030

27 28 32 M writer 55369

28 29 41 M programmer 94043

29 30 7 M student 55436

1. Initial Investigation

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 12/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

user_details.shape

In [ ]:

user_details.isna().sum()

In [ ]:

user_details.dtypes

In [ ]:

user_details.describe(include = 'all')

1. What are the different occupations that users are doing?

In [ ]:

print(user_details['occupation'].unique())

2. How many male technicians and female technicians?

In [ ]:

user_details.head()

NOTE:
Discrete Data - it is not measurable(No units)/it is just a count || we cannot
seggregate it.

Continous Data - it is measurable(It has units) || It can be seggregated.

3. Male and Female average age.

In [ ]:

user_details.groupby(by = 'gender')['age'].mean()

In [ ]:

round(user_details.groupby(by = ['occupation'])['age'].mean())

In [ ]:

user_details.groupby(by = ['occupation'])['user_id'].count()

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 13/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

user_details.groupby(by = ['gender','occupation'])['user_id'].count()

In [ ]:

salesman_data = user_details[user_details['occupation'] == 'salesman']

salesman_data

In [ ]:

salesman_data.describe()

Chapter_02 - Filtering & Sorting

In [27]:

euro_2012 = pd.read_csv('Euro_2012_stats_TEAM.csv') #euro_2012_statistics

euro_2012

Out[27]:

% Total
Shots Shots Penaltie
Shooting Goals- shots Hit Penalty
Team Goals on off n
Accuracy to- (inc. Woodwork goals
target target score
shots Blocked)

0 Croatia 4 13 12 51.9% 16.0% 32 0 0

Czech
1 4 13 18 41.9% 12.9% 39 0 0
Republic

2 Denmark 4 10 10 50.0% 20.0% 27 1 0

3 England 5 11 18 50.0% 17.2% 40 0 0

4 France 3 22 24 37.9% 6.5% 65 1 0

5 Germany 10 32 32 47.8% 15.6% 80 2 1

6 Greece 5 8 18 30.7% 19.2% 32 1 1

7 Italy 6 34 45 43.0% 7.5% 110 2 0

8 Netherlands 2 12 36 25.0% 4.1% 60 2 0

9 Poland 2 15 23 39.4% 5.2% 48 0 0

10 Portugal 6 22 42 34.3% 9.3% 82 6 0

Republic of
11 1 7 12 36.8% 5.2% 28 0 0
Ireland

12 Russia 5 9 31 22.5% 12.5% 59 2 0

13 Spain 12 42 33 55.9% 16.0% 100 0 1

14 Sweden 5 17 19 47.2% 13.8% 39 3 0

15 Ukraine 2 7 26 21.2% 6.0% 38 0 0

16 rows × 35 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 14/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

euro_2012.columns

CONFIGURATION

In [ ]:

pd.set_option('max_columns',None)
#pd.set_option('max_rows',None)

In [ ]:

euro_2012

1. Initial Analysis

In [ ]:

euro_2012.shape

In [ ]:

euro_2012.describe(include='all')

2. Filter Team, Goals, Shooting Accuracy, Yellow Cards and Red Cards

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 15/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [28]:

euro_filtered_data = euro_2012[['Team','Goals','Shooting Accuracy','Yellow Cards','Red Card

euro_filtered_data

Out[28]:

Team Goals Shooting Accuracy Yellow Cards Red Cards

0 Croatia 4 51.9% 9 0

1 Czech Republic 4 41.9% 7 0

2 Denmark 4 50.0% 4 0

3 England 5 50.0% 5 0

4 France 3 37.9% 6 0

5 Germany 10 47.8% 4 0

6 Greece 5 30.7% 9 1

7 Italy 6 43.0% 16 0

8 Netherlands 2 25.0% 5 0

9 Poland 2 39.4% 7 1

10 Portugal 6 34.3% 12 0

11 Republic of Ireland 1 36.8% 6 1

12 Russia 5 22.5% 6 0

13 Spain 12 55.9% 11 0

14 Sweden 5 47.2% 7 0

15 Ukraine 2 21.2% 5 0

In [ ]:

euro_filtered_data.sort_values(by = 'Red Cards',ascending=False,inplace=False)

In [ ]:

euro_filtered_data

In [ ]:

import warnings
warnings.filterwarnings(action = 'ignore')

In [ ]:

euro_filtered_data.sort_values(by = 'Red Cards',ascending=False)#,inplace=True)

In [ ]:

euro_filtered_data

In [ ]:

euro_filtered_data[euro_filtered_data['Goals'] > 4]

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 16/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

euro_filtered_data[euro_filtered_data['Goals'] > 4]

In [ ]:

euro_filtered_data[euro_filtered_data['Red Cards'] == 1]

In [29]:

#Try to rectify the

(euro_filtered_data[euro_filtered_data['Goals'] > 4]) and (euro_filtered_data[euro_filtered

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-29-9a66c94aa3ba> in <module>

1 #Try to rectify the

----> 2 (euro_filtered_data[euro_filtered_data['Goals'] > 4]) and (euro_filt

ered_data[euro_filtered_data['Red Cards'] == 1])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonze
ro__(self)

1440 @final

1441 def nonzero(self):

-> 1442 raise ValueError(

1443 f"The truth value of a {type(self).name} is ambiguou

s. "

1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool

(), a.item(), a.any() or a.all().

In [34]:

euro_filtered_data[(euro_filtered_data['Goals'] > 4) & (euro_filtered_data['Red Cards'] ==

Out[34]:

Team Goals Shooting Accuracy Yellow Cards Red Cards

6 Greece 5 30.7% 9 1

Chapter_3 Indexing
In [ ]:

user_details = pd.read_csv('u.user',sep = '|') #user_details

user_details

1. Display the first 5 observations of first 3 features.

In [ ]:

user_details.head()

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 17/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [ ]:

#Index locator
user_details.iloc[0:5,0:3]

In [ ]:

euro_2012 = pd.read_csv('Euro_2012_stats_TEAM.csv') #euro_2012_statistics

euro_2012

In [ ]:

euro_2012.iloc[:5,4:6] #Pass the index number of the columns

In [ ]:

#Locator
euro_2012.loc[:5,['Shooting Accuracy','% Goals-to-shots']] #Pass the feature names

In [ ]:

euro_2012.iloc[[0,2,5],[1,5,9]]

In [ ]:

euro_2012.loc[[0,2,5],['Shooting Accuracy','% Goals-to-shots']]

Chapter_4 Creating Series and DataFrames

Within Pandas we are having 2 data structures:

Series - 1D of pandas
Dataframe - 2D of pandas

In [ ]:

a = [1,2,3,4,5]
type(a)

In [ ]:

pandas_1D = pd.Series([1,2,3,4,5])
type(pandas_1D)

In [ ]:

pandas_2D = pd.DataFrame(data = {'Fname' : ['Alka','Barun'],

'Lname' : ['Goutam','Das']})
pandas_2D

Chapter_5 Deleting

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 18/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [35]:

euro_2012 = pd.read_csv('Euro_2012_stats_TEAM.csv') #euro_2012_statistics

euro_2012

Out[35]:

% Total
Shots Shots Penaltie
Shooting Goals- shots Hit Penalty
Team Goals on off n
Accuracy to- (inc. Woodwork goals
target target score
shots Blocked)

0 Croatia 4 13 12 51.9% 16.0% 32 0 0

Czech
1 4 13 18 41.9% 12.9% 39 0 0
Republic

2 Denmark 4 10 10 50.0% 20.0% 27 1 0

3 England 5 11 18 50.0% 17.2% 40 0 0

4 France 3 22 24 37.9% 6.5% 65 1 0

5 Germany 10 32 32 47.8% 15.6% 80 2 1

6 Greece 5 8 18 30.7% 19.2% 32 1 1

7 Italy 6 34 45 43.0% 7.5% 110 2 0

8 Netherlands 2 12 36 25.0% 4.1% 60 2 0

9 Poland 2 15 23 39.4% 5.2% 48 0 0

10 Portugal 6 22 42 34.3% 9.3% 82 6 0

Republic of
11 1 7 12 36.8% 5.2% 28 0 0
Ireland

12 Russia 5 9 31 22.5% 12.5% 59 2 0

13 Spain 12 42 33 55.9% 16.0% 100 0 1

14 Sweden 5 17 19 47.2% 13.8% 39 3 0

15 Ukraine 2 7 26 21.2% 6.0% 38 0 0

16 rows × 35 columns

In [36]:

del euro_2012['Goals']

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 19/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [37]:

euro_2012

Out[37]:

% Total
Shots Shots Penalties
Shooting Goals- shots Hit Penalty Hea
Team on off not
Accuracy to- (inc. Woodwork goals go
target target scored
shots Blocked)

0 Croatia 13 12 51.9% 16.0% 32 0 0 0

Czech
1 13 18 41.9% 12.9% 39 0 0 0
Republic

2 Denmark 10 10 50.0% 20.0% 27 1 0 0

3 England 11 18 50.0% 17.2% 40 0 0 0

4 France 22 24 37.9% 6.5% 65 1 0 0

5 Germany 32 32 47.8% 15.6% 80 2 1 0

6 Greece 8 18 30.7% 19.2% 32 1 1 1

7 Italy 34 45 43.0% 7.5% 110 2 0 0

8 Netherlands 12 36 25.0% 4.1% 60 2 0 0

9 Poland 15 23 39.4% 5.2% 48 0 0 0

10 Portugal 22 42 34.3% 9.3% 82 6 0 0

Republic of
11 7 12 36.8% 5.2% 28 0 0 0
Ireland

12 Russia 9 31 22.5% 12.5% 59 2 0 0

13 Spain 42 33 55.9% 16.0% 100 0 1 0

14 Sweden 17 19 47.2% 13.8% 39 3 0 0

15 Ukraine 7 26 21.2% 6.0% 38 0 0 0

16 rows × 34 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 20/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [41]:

euro_2012[['Team','Shots on target']]

Out[41]:

Team Shots on target

0 Croatia 13

1 Czech Republic 13

2 Denmark 10

3 England 11

4 France 22

5 Germany 32

6 Greece 8

7 Italy 34

8 Netherlands 12

9 Poland 15

10 Portugal 22

11 Republic of Ireland 7

12 Russia 9

13 Spain 42

14 Sweden 17

15 Ukraine 7

In [42]:

euro_2012.Team

Out[42]:

0 Croatia

1 Czech Republic

2 Denmark

3 England

4 France

5 Germany

6 Greece

7 Italy

8 Netherlands

9 Poland

10 Portugal

11 Republic of Ireland

12 Russia

13 Spain

14 Sweden

15 Ukraine

Name: Team, dtype: object

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 21/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [43]:

del euro_2012[['Shots on target','Shots off target']]

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-43-1b67689e46a4> in <module>

----> 1 del euro_2012[['Shots on target','Shots off target']]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __delit
em__(self, key)

3964 # there was no match, this call should raise the appropr
iate

3965 # exception:

-> 3966 loc = self.axes[-1].get_loc(key)

3967 self._mgr.idelete(loc)

3968

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in ge
t_loc(self, key, method, tolerance)

3078 casted_key = self._maybe_cast_indexer(key)

3079 try:

-> 3080 return self._engine.get_loc(casted_key)

3081 except KeyError as err:

3082 raise KeyError(key) from err

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '['Shots on target', 'Shots off target']' is an invalid key

In [53]:

euro_2012 = euro_2012.drop(labels=['Penalty goals','Team'],axis=1,)

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 22/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [54]:

euro_2012

Out[54]:

% Total
Shots Shots Penalties
Goals- shots Hit Headed Passes Passi
on off not Passes
to- (inc. Woodwork goals completed Accura
target target scored
shots Blocked)

0 13 12 16.0% 32 0 0 2 1076 828 76.9

1 13 18 12.9% 39 0 0 0 1565 1223 78.

2 10 10 20.0% 27 1 0 3 1298 1082 83.3

3 11 18 17.2% 40 0 0 3 1488 1200 80.6

4 22 24 6.5% 65 1 0 0 2066 1803 87.2

5 32 32 15.6% 80 2 0 2 2774 2427 87.4

6 8 18 19.2% 32 1 1 0 1187 911 76.7

7 34 45 7.5% 110 2 0 2 3016 2531 83.9

8 12 36 4.1% 60 2 0 0 1556 1381 88.7

9 15 23 5.2% 48 0 0 1 1059 852 80.4

10 22 42 9.3% 82 6 0 2 1891 1461 77.2

11 7 12 5.2% 28 0 0 1 851 606 71.2

12 9 31 12.5% 59 2 0 1 1602 1345 83.9

13 42 33 16.0% 100 0 0 2 4317 3820 88.4

14 17 19 13.8% 39 3 0 1 1192 965 80.9

15 7 26 6.0% 38 0 0 2 1276 1043 81.7

16 rows × 31 columns

TASK 1 - Explore List Comprehension

Chapter 6 - Apply Function

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 23/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [58]:

student_details = pd.read_csv('student-mat.csv')
student_details

Out[58]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 4 4 at_home teacher course

1 GP F 17 U GT3 T 1 1 at_home other course

2 GP F 15 U LE3 T 1 1 at_home other other

3 GP F 15 U GT3 T 4 2 health services home

4 GP F 16 U GT3 T 3 3 other other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 2 2 services services course

391 MS M 17 U LE3 T 3 1 services services course

392 MS M 21 R GT3 T 1 1 other other course

393 MS M 18 R LE3 T 3 2 services other course

394 MS M 19 U LE3 T 1 1 other at_home course

395 rows × 33 columns

In [57]:

pd.set_option('max_columns',None)

In [62]:

student_details['Medu'] = student_details['Medu'].apply(func = lambda x:x+2)

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 24/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [63]:

student_details

Out[63]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home teacher course

1 GP F 17 U GT3 T 3 1 at_home other course

2 GP F 15 U LE3 T 3 1 at_home other other

3 GP F 15 U GT3 T 6 2 health services home

4 GP F 16 U GT3 T 5 3 other other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services services course

391 MS M 17 U LE3 T 5 1 services services course

392 MS M 21 R GT3 T 3 1 other other course

393 MS M 18 R LE3 T 5 2 services other course

394 MS M 19 U LE3 T 3 1 other at_home course

395 rows × 33 columns

Inclass exercise - Make the first letter uppercase of the Fjob.

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 25/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [67]:

student_details['Fjob'] = student_details.Fjob.apply(lambda x:x.capitalize())

student_details

Out[67]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home Teacher course

1 GP F 17 U GT3 T 3 1 at_home Other course

2 GP F 15 U LE3 T 3 1 at_home Other other

3 GP F 15 U GT3 T 6 2 health Services home

4 GP F 16 U GT3 T 5 3 other Other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services Services course

391 MS M 17 U LE3 T 5 1 services Services course

392 MS M 21 R GT3 T 3 1 other Other course

393 MS M 18 R LE3 T 5 2 services Other course

394 MS M 19 U LE3 T 3 1 other At_home course

395 rows × 33 columns

Inclass Exercise - 2

Create a new column with a name 'Eligibility_Criteria' which returns 1 if age>17 and 0 if age < 17.

In [68]:

def get_age(x):
if x>17:
return 1
else:
return 0

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 26/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [70]:

student_details['Eligibity_Criteria'] = student_details['age'].apply(func = get_age)

student_details

Out[70]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home Teacher course

1 GP F 17 U GT3 T 3 1 at_home Other course

2 GP F 15 U LE3 T 3 1 at_home Other other

3 GP F 15 U GT3 T 6 2 health Services home

4 GP F 16 U GT3 T 5 3 other Other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services Services course

391 MS M 17 U LE3 T 5 1 services Services course

392 MS M 21 R GT3 T 3 1 other Other course

393 MS M 18 R LE3 T 5 2 services Other course

394 MS M 19 U LE3 T 3 1 other At_home course

395 rows × 34 columns

In [71]:

student_details['Eligibity_Criteria_2'] = student_details['age'].apply(func = lambda x: 1 i

student_details

Out[71]:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason gu

0 GP F 18 U GT3 A 6 4 at_home Teacher course

1 GP F 17 U GT3 T 3 1 at_home Other course

2 GP F 15 U LE3 T 3 1 at_home Other other

3 GP F 15 U GT3 T 6 2 health Services home

4 GP F 16 U GT3 T 5 3 other Other home

... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 4 2 services Services course

391 MS M 17 U LE3 T 5 1 services Services course

392 MS M 21 R GT3 T 3 1 other Other course

393 MS M 18 R LE3 T 5 2 services Other course

394 MS M 19 U LE3 T 3 1 other At_home course

395 rows × 35 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 27/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

Chapter_07 Concat || Merge || Append Function

In [75]:

sales_2017 = pd.read_csv('Sales Transactions-2017.csv')

sales_2017

Out[75]:

Date Voucher Party Product Qty Rate Gr

SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380
PLASTICS 9100

SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720
PLASTICS FOAM(1200)

VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500
TRADERS
WINE

SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720
TRADERS FOAM(1200)

SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450
TRADERS 9100

... ... ... ... ... ... ...

10*10
47285 31/03/2018 Sal:10042 Vkp 25 137 3,425
SHEET

47286 NaN NaN NaN NaN NaN NaN N

47287 NaN NaN NaN NaN NaN NaN N

47288 NaN Total NaN NaN 607,734.60 669,300.49 9,953,816

47289 NaN Total NaN NaN 7,593,062.00 8,309,116.00 115,778,725

47290 rows × 9 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 28/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [76]:

sales_2018 = pd.read_csv('Sales Transactions-2018.csv')

sales_2018

Out[76]:

Date Voucher Party Product Qty Rate Gross

SILVER
0 1/4/2018 Sal:146 TP13 POUCH 50 85 4,250.00
9*12

1 1/4/2018 Sal:146 TP13 RUBBER 5 290 1,450.00

DURGA
2 1/4/2018 Sal:146 TP13 10*12 1,600.00 5.5 8,800.00
Blue

DURGA
3 1/4/2018 Sal:146 TP13 13*16 400 11 4,400.00
BLUE

10*12
4 1/4/2018 Sal:146 TP13 SARAS- 600 8.1 4,860.00
NAT

... ... ... ... ... ... ... ...

HAMPI SPOON
44735 31/03/2019 Sal:9610 200 40 8,000.00
FOODS SOOFY

44736 NaN NaN NaN NaN NaN NaN NaN

44737 NaN NaN NaN NaN NaN NaN NaN

44738 NaN Total NaN NaN 666,056.00 1,067,808.80 10,796,991.30 29,9

44739 NaN Total NaN NaN 7,097,803.00 10,024,197.00 117,897,671.80 720,2

44740 rows × 9 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 29/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [77]:

sales_2019 = pd.read_csv('Sales Transactions-2019.csv')

sales_2019

Out[77]:

Date Voucher Party Product Qty Rate Gross

BALAJI DONA-
0 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS VAI-9100

BALAJI SMART
1 1/4/2019 Sal:687 1 1,730.00 1,730.00
PLASTICS BOUL(48)

BALAJI Vishnu
2 1/4/2019 Sal:688 110 18.5 2,035.00
PLASTICS Ice

3 28/3 0 0

BALAJI 100LEAF
4 1/4/2019 Sal:689 3 585 1,755.00
PLASTICS -SP

... ... ... ... ... ... ... ...

13*16
19171 10/10/2019 Sal:4935 K.SRIHARI WHITE 400 16 6,400.00
RK

19172 NaN NaN NaN NaN NaN NaN NaN

19173 NaN NaN NaN NaN NaN NaN NaN

19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.50 20

19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.40 672

19176 rows × 9 columns

In [80]:

sales_2017.shape,sales_2018.shape,sales_2019.shape

Out[80]:

((47290, 9), (44740, 9), (19176, 9))

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 30/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [81]:

sales_full_data = pd.concat([sales_2017,sales_2018,sales_2019])
sales_full_data

Out[81]:

Date Voucher Party Product Qty Rate Gros

SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100

SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)

VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE

SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)

SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100

... ... ... ... ... ... ...

13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK

19172 NaN NaN NaN NaN NaN NaN Na

19173 NaN NaN NaN NaN NaN NaN Na

19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.5

19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.4

111206 rows × 9 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 31/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [82]:

sales_2017.append([sales_2018,sales_2019])

Out[82]:

Date Voucher Party Product Qty Rate Gros

SOLANKI DONA-VAI-
0 1/4/2017 Sal:1 2 1,690.00 3,380.0
PLASTICS 9100

SOLANKI LITE
1 1/4/2017 Sal:1 6 1,620.00 9,720.0
PLASTICS FOAM(1200)

VISHNU
SARNESWARA
2 1/4/2017 Sal:2 CHOTA 500 23 11,500.0
TRADERS
WINE

SARNESWARA LITE
3 1/4/2017 Sal:2 6 1,620.00 9,720.0
TRADERS FOAM(1200)

SARNESWARA DONA-VAI-
4 1/4/2017 Sal:2 5 1,690.00 8,450.0
TRADERS 9100

... ... ... ... ... ... ...

13*16
19171 10/10/2019 Sal:4935 K.SRIHARI 400 16 6,400.0
WHITE RK

19172 NaN NaN NaN NaN NaN NaN Na

19173 NaN NaN NaN NaN NaN NaN Na

19174 NaN Total NaN NaN 99,284.90 175,381.65 2,203,649.5

19175 NaN Total NaN NaN 2,710,193.00 5,519,888.40 53,360,791.4

111206 rows × 9 columns

In [86]:

sales_full_data.dtypes

Out[86]:

Date object

Voucher object

Party object

Product object

Qty object

Rate object

Gross object

Disc object

Voucher Amount object

dtype: object

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 32/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [88]:

sales_full_data.isna().sum()

Out[88]:

Date 12591

Voucher 12557

Party 40

Product 12591

Qty 12557

Rate 12558

Gross 12558

Disc 105609

Voucher Amount 83646

dtype: int64

TASK 2 - Explore Merge function - Left join, Right Join, Inner join,
Outer join

TASK 3 - Datatype conversion and Data Cleaning on Sales

data
In [84]:

sales_cleaned_data = pd.read_csv('Sales-Transactions-Edited.csv')
sales_cleaned_data

Out[84]:

Date Voucher Party Product Qty Rate

0 1/4/2017 1 SOLANKI PLASTICS DONA-VAI-9100 2 1690.0

1 1/4/2017 1 SOLANKI PLASTICS LITE FOAM(1200) 6 1620.0

2 1/4/2017 2 SARNESWARA TRADERS VISHNU CHOTA WINE 500 23.0

3 1/4/2017 2 SARNESWARA TRADERS LITE FOAM(1200) 6 1620.0

4 1/4/2017 2 SARNESWARA TRADERS DONA-VAI-9100 5 1690.0

... ... ... ... ... ... ...

95557 12/9/2019 4265 TP13 SPOON MED M.W 20 11.0

95558 12/9/2019 4266 K.SRIHARI SMART BOUL(48) 1 1830.0

95559 12/9/2019 4267 SMS SMARTBOUL GLA(4000) 1 1520.0

95560 12/9/2019 4268 ANILFANCY RR WINEGLASS 100 20.0

95561 12/9/2019 4268 ANILFANCY RR WATER GLASS 100 20.0

95562 rows × 6 columns

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 33/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [85]:

sales_cleaned_data.dtypes

#Date - Datetime64/datetime32 dataypr

Out[85]:

Date object

Voucher int64

Party object

Product object

Qty int64

Rate float64

dtype: object

In [87]:

sales_cleaned_data.isna().sum()

Out[87]:

Date 0

Voucher 0

Party 0

Product 0

Qty 0

Rate 1

dtype: int64

Chapter_8 Grouping Vs Pivottable Vs Crosstab

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 34/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [91]:

insurance_data = pd.read_csv('insurance.csv')
insurance_data

Out[91]:

age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

In [92]:

insurance_data.describe(include = 'all')

Out[92]:

age sex bmi children smoker region charges

count 1338.000000 1338 1338.000000 1338.000000 1338 1338 1338.000000

unique NaN 2 NaN NaN 2 4 NaN

top NaN male NaN NaN no southeast NaN

freq NaN 676 NaN NaN 1064 364 NaN

mean 39.207025 NaN 30.663397 1.094918 NaN NaN 13270.422265

std 14.049960 NaN 6.098187 1.205493 NaN NaN 12110.011237

min 18.000000 NaN 15.960000 0.000000 NaN NaN 1121.873900

25% 27.000000 NaN 26.296250 0.000000 NaN NaN 4740.287150

50% 39.000000 NaN 30.400000 1.000000 NaN NaN 9382.033000

75% 51.000000 NaN 34.693750 2.000000 NaN NaN 16639.912515

max 64.000000 NaN 53.130000 5.000000 NaN NaN 63770.428010

1. What is average insurance charges based on region?

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 35/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [93]:

insurance_data['region'].unique()

Out[93]:

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [96]:

round(insurance_data.groupby(by='region')['charges'].mean().sort_values(ascending = False))

Out[96]:

region

southeast 14735.0

northeast 13406.0

northwest 12418.0

southwest 12347.0

Name: charges, dtype: float64

2. What is average insurance charges based on region and gender?

In [98]:

round(insurance_data.groupby(by=['region','sex'])['charges'].mean().sort_values(ascending =

Out[98]:

region sex

southeast male 15880.0

northeast male 13854.0

southeast female 13500.0

southwest male 13413.0

northeast female 12953.0

northwest female 12480.0

male 12354.0

southwest female 11274.0

Name: charges, dtype: float64

In [101]:

round(pd.pivot_table( data = insurance_data,values='charges',index='region',columns=['sex',

Out[101]:

sex female male

smoker no yes no yes

region

northeast 9640.0 28032.0 8664.0 30926.0

northwest 8787.0 29671.0 8321.0 30713.0

southeast 8440.0 33035.0 7609.0 36030.0

southwest 8234.0 31688.0 7779.0 32599.0

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 36/37

10/7/21, 12:10 AM Pandas - Data Analysis - Deep dive - Jupyter Notebook

In [103]:

pd.crosstab(index = insurance_data['smoker'] ,columns = insurance_data['sex'],margins=True)

Out[103]:

sex female male All

smoker

no 547 517 1064

yes 115 159 274

All 662 676 1338

In [104]:

pd.crosstab(index = insurance_data['region'] ,columns = insurance_data['children'],margins=

Out[104]:

children 0 1 2 3 4 5 All

region

northeast 147 77 51 39 7 3 324

northwest 132 74 66 46 6 1 325

southeast 157 95 66 35 5 6 364

southwest 138 78 57 37 7 8 325

All 574 324 240 157 25 18 1338

In [ ]:

localhost:8888/notebooks/Data science/pandas/Pandas - Data Analysis - Deep dive.ipynb 37/37

Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Final Fraft Hermit Crab Essay
No ratings yet
Final Fraft Hermit Crab Essay
6 pages
LESSON 6 (History of Appetizer0
No ratings yet
LESSON 6 (History of Appetizer0
22 pages
Class 6 Pandas
No ratings yet
Class 6 Pandas
13 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Pandas Presentation
No ratings yet
Pandas Presentation
10 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
20 Pandas Codes To Master Data Analysis
No ratings yet
20 Pandas Codes To Master Data Analysis
3 pages
PJT Explanation of Code Line by Line
No ratings yet
PJT Explanation of Code Line by Line
2 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
26 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Data Frame
No ratings yet
Data Frame
95 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Jupyter Notebook Viewer1
No ratings yet
Jupyter Notebook Viewer1
17 pages
Python Pandas
No ratings yet
Python Pandas
2 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Pandas Notes
No ratings yet
Pandas Notes
2 pages
Pandas
No ratings yet
Pandas
25 pages
Pandas - Cheat - Sheet
No ratings yet
Pandas - Cheat - Sheet
6 pages
CO3 - 1 - Pandas Series and Data Frame
No ratings yet
CO3 - 1 - Pandas Series and Data Frame
37 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Pandas
No ratings yet
Pandas
4 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Introduction To Pandas - Ipynb - Colaboratory
No ratings yet
Introduction To Pandas - Ipynb - Colaboratory
7 pages
Data Engineer Interview 1740985064
No ratings yet
Data Engineer Interview 1740985064
14 pages
Python2 Master
No ratings yet
Python2 Master
12 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Python Libraries - 2025 (1) Python Libraries - 2025 (1) Python Libraries - 2025
No ratings yet
Python Libraries - 2025 (1) Python Libraries - 2025 (1) Python Libraries - 2025
9 pages
1 Import and Handling Data - Jupyter Notebook
No ratings yet
1 Import and Handling Data - Jupyter Notebook
9 pages
Unit3 - 3) Pandas - Ipynb - Colab
No ratings yet
Unit3 - 3) Pandas - Ipynb - Colab
11 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
IP
No ratings yet
IP
10 pages
Pandas
No ratings yet
Pandas
2 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Numpy Boolean Indexing: Filter
No ratings yet
Numpy Boolean Indexing: Filter
39 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
12 Pandas
100% (1)
12 Pandas
21 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Pandas
No ratings yet
Pandas
41 pages
Data Preprocessing Tasks in Pandas PYTHON
No ratings yet
Data Preprocessing Tasks in Pandas PYTHON
2 pages
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
No ratings yet
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
10 pages
RN Reassessment
No ratings yet
RN Reassessment
5 pages
Grade 1 EVS States of Matter
No ratings yet
Grade 1 EVS States of Matter
7 pages
How Much Meat Should A Lamb Yield
No ratings yet
How Much Meat Should A Lamb Yield
2 pages
Soal Pts Bahasa Inggris Kelas 8
No ratings yet
Soal Pts Bahasa Inggris Kelas 8
5 pages
Actual Test 2019 Listening
No ratings yet
Actual Test 2019 Listening
51 pages
PLA
No ratings yet
PLA
55 pages
America Is in The Heart
No ratings yet
America Is in The Heart
12 pages
#1HEART DISEASE - Specifically Heart Failure.#1
100% (1)
#1HEART DISEASE - Specifically Heart Failure.#1
18 pages
3rd Year Bac Samples For Paragraphs
No ratings yet
3rd Year Bac Samples For Paragraphs
3 pages
Fruit and Vegetables Riddles With Key Fun Activities Games Games Icebreakers Picture Des 55125
No ratings yet
Fruit and Vegetables Riddles With Key Fun Activities Games Games Icebreakers Picture Des 55125
2 pages
Just Jorie - Robin Alexander
No ratings yet
Just Jorie - Robin Alexander
207 pages
Intralox Letter of Conformity PUR 657D
No ratings yet
Intralox Letter of Conformity PUR 657D
4 pages
The History of The First Thanksgiving Day +key
No ratings yet
The History of The First Thanksgiving Day +key
2 pages
Havells Home Appliance List Price FEB 21 LOW
No ratings yet
Havells Home Appliance List Price FEB 21 LOW
7 pages
Klasa 4 Unit 6
No ratings yet
Klasa 4 Unit 6
6 pages
PHP 4 F VHNC
No ratings yet
PHP 4 F VHNC
9 pages
Fields and Flowers Combined Sample 08.2023
100% (2)
Fields and Flowers Combined Sample 08.2023
92 pages
Characteristics of Society
100% (2)
Characteristics of Society
11 pages
The Trauma of Burnout How To Manage Your Nervous System Before It Manages You Claire Plumbly Download
No ratings yet
The Trauma of Burnout How To Manage Your Nervous System Before It Manages You Claire Plumbly Download
26 pages
Brasserie Menu June 24
No ratings yet
Brasserie Menu June 24
2 pages
Data Buyer (AutoRecovered)
No ratings yet
Data Buyer (AutoRecovered)
26 pages
Konjac Companies in China
No ratings yet
Konjac Companies in China
7 pages
To Trim An Artichoke: Don't Do It If There Are Distractions
No ratings yet
To Trim An Artichoke: Don't Do It If There Are Distractions
3 pages
CID Beverage Bases (Powdered)
No ratings yet
CID Beverage Bases (Powdered)
11 pages
FOOD (FISH) PROCESSING TLE 8 Q1 With Tos
No ratings yet
FOOD (FISH) PROCESSING TLE 8 Q1 With Tos
8 pages
More Smiles With Every Sip and Every Bite Which They Tend To Produce Delicious, Affordable
No ratings yet
More Smiles With Every Sip and Every Bite Which They Tend To Produce Delicious, Affordable
2 pages
BPP Q2 Week 4
No ratings yet
BPP Q2 Week 4
9 pages
Chicken Sotanghon Soup With Malunggay at Sayote: Ingredients
No ratings yet
Chicken Sotanghon Soup With Malunggay at Sayote: Ingredients
23 pages