0% found this document useful (0 votes)
7 views14 pages

Python Pandas Tutorial For Beginners 2019 (On The Go)

The document is a tutorial on the Python Pandas library, which is used for data manipulation and analysis, featuring key data structures like Series and DataFrame. It provides an overview of creating and manipulating DataFrames, including loading CSV files, filtering, sorting, and aggregating data. The tutorial is structured into videos covering various topics and includes code examples for practical understanding.

Uploaded by

kart238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Python Pandas Tutorial For Beginners 2019 (On The Go)

The document is a tutorial on the Python Pandas library, which is used for data manipulation and analysis, featuring key data structures like Series and DataFrame. It provides an overview of creating and manipulating DataFrames, including loading CSV files, filtering, sorting, and aggregating data. The tutorial is structured into videos covering various topics and includes code examples for practical understanding.

Uploaded by

kart238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Python Pandas Tutorial for Beginners

What is Pandas?
• Pandas is a python package / library
• Pandas is a library for data manipulation and analysis
• Two main data structures: the Series and DataFrame
# Visual representation of Pandas

import os
from IPython.display import Image
PATH = "F:\\Github\\Python tutorials\\Introduction to Pandas\\"
Image(filename = PATH + "pandas.png", width=300, height=300)

Tutorial Overview
Video 1
1. What is a Panda's Series
2. What is a Dataframe
3. Creating DataFrames
4. Loading a CSV into a DataFrame
5. Basic methods for investigating/viewing a DataFrame

Video 2
1. Column Filtering
2. Row Filtering
3. Filtering / Slicing
4. Sorting the DF
5. Summarising / Aggregating Data
6. Creating New Calculated Fields

Importing / Installing packages


# Packages / libraries
import os #provides functions for interacting with the operating
system
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# To install Pandas type "pip install pandas" to the anaconda terminal

1. What is Panda's Series


Series is a one-dimensional object (similar to a vector)

# Creating a Series by passing in a list without the Index


s = pd.Series(['28', '1.79', 'Yiannis'])
s

0 28
1 1.79
2 Yiannis
dtype: object

# Creating a Series by passing in a list & the Indexs


s = pd.Series(['28', '1.79 cm', 'Yiannis'], index=['Age', 'Height',
"Name"])
s

Age 28
Height 1.79 cm
Name Yiannis
dtype: object

# Indexing / filtering the Series


s['Name']

'Yiannis'
2. What is a Dataframe
DataFrame is like an Excel table / PivotTable. It is a tabular data structure comprised of rows and
columns

3. Creating a DataFrame
# Manually

df = pd.DataFrame(columns = ["Name", 'Gender','Age','Height'])


df

Empty DataFrame
Columns: [Name, Gender, Age, Height]
Index: []

# Passing Values in the DataFrame


df.loc[0] = ["Joe", "Male", 23, 1.70]
df.loc[1] = ["Tom", "Male", '50', 1.80]
df.loc[2] = ["Tine", "Female", 93, 1.79]
df

Name Gender Age Height


0 Joe Male 23 1.70
1 Tom Male 50 1.80
2 Tine Female 93 1.79

# Creating DataFrame from a List


a_list = [["Joe", "Male", 23, 1.70], ["Tom", "Male", '50', 1.80]]
a_list

df = pd.DataFrame(a_list, columns = ["Name", 'Gender','Age','Height'])


df

Name Gender Age Height


0 Joe Male 23 1.7
1 Tom Male 50 1.8

# Creating DataFrame from an Array


a = np.arange(15).reshape(5,3)
print(a)

df = pd.DataFrame(a, columns = ["A", 'B','C'])


df

[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]]
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14

# Transpose a DataFrame

#method 1
df.T

# method 2
np.transpose(df)

0 1 2 3 4
A 0 3 6 9 12
B 1 4 7 10 13
C 2 5 8 11 14

4. Loading a CSV file into a DataFrame


# This is to find out your current directory
cwd = os.getcwd()
cwd

'C:\\Users\\pitsi\\Desktop\\Python Tutorials'

# Loading the data

raw_data = pd.read_csv('F:\\Github\Python tutorials\\Introduction to


Pandas\\Marketing Raw Data.csv')

# runs all the data


raw_data

#runs the first 5 rows


raw_data.head(1)

#runs the number of rows you select


raw_data.head(5)

Date Day_Name Visitors Revenue Marketing Spend


Promo
0 09/11/2020 Monday 707 5211 651.375 No
Promo
1 10/11/2020 Tuesday 1455 10386 1298.250
Promotion Red
2 11/11/2020 Wednesday 1520 12475 1559.375
Promotion Blue
3 12/11/2020 Thursday 1726 14414 1801.750 No
Promo
4 13/11/2020 Friday 2134 20916 2614.500 No
Promo

5. Basic methods for investigating/viewing a DataFrame


# Display the last N number of rows
raw_data.tail(5)

Date Day_Name Visitors Revenue Marketing Spend


Promo
177 05/05/2021 Wednesday 1400 11196 1119.600000
No Promo
178 06/05/2021 Thursday 2244 18611 2067.888889
Promotion Red
179 07/05/2021 Friday 2023 14502 1450.200000
No Promo
180 08/05/2021 Saturday 1483 8975 1121.875000
No Promo
181 09/05/2021 Sunday 1303 6968 871.000000
No Promo

# Shape
raw_data.shape

(182, 6)

# Displays the information for our DF


raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 6 columns):
Date 182 non-null object
Day_Name 182 non-null object
Visitors 182 non-null int64
Revenue 182 non-null int64
Marketing Spend 182 non-null float64
Promo 182 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 8.6+ KB

# Converting Date to Date


raw_data['Date'] = pd.to_datetime(raw_data['Date'], format='%d/%m/%Y')
raw_data
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 6 columns):
Date 182 non-null datetime64[ns]
Day_Name 182 non-null object
Visitors 182 non-null int64
Revenue 182 non-null int64
Marketing Spend 182 non-null float64
Promo 182 non-null object
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 8.6+ KB

# Displays the summary statistics for all numeric columns


raw_data.describe()

Visitors Revenue Marketing Spend


count 182.000000 182.000000 182.000000
mean 1666.730769 12991.840659 1396.356564
std 503.528049 5883.117597 691.867416
min 488.000000 2898.000000 322.000000
25% 1339.000000 8808.500000 880.431250
50% 1546.000000 11547.500000 1223.900000
75% 2027.500000 15816.500000 1676.450000
max 4139.000000 36283.000000 4535.375000

6. Column Filtering
# Exmp 1 - Series
raw_data['Visitors'].head()

0 707
1 1455
2 1520
3 1726
4 2134
Name: Visitors, dtype: int64

# Exmp 2 - Series
raw_data.Visitors.head()

0 707
1 1455
2 1520
3 1726
4 2134
Name: Visitors, dtype: int64

# Exmp 3 - Data Frame


type(raw_data[['Visitors']])

# Exmp 4 - Data Frame 2 columns +


raw_data[['Visitors', 'Revenue','Marketing Spend']].head()

Visitors Revenue Marketing Spend


0 707 5211 651.375
1 1455 10386 1298.250
2 1520 12475 1559.375
3 1726 14414 1801.750
4 2134 20916 2614.500

# Creating an Array from a df column

# exm 1
np.array(raw_data[['Visitors']])

# exm 2
raw_data[['Visitors']].values

raw_data[['Visitors']].sum()

np.sum(raw_data[['Visitors']])

Visitors 303345
dtype: int64

7. Row Filtering
# Exmp 1
raw_data[0:5]

Date Day_Name Visitors Revenue Marketing Spend


Promo
0 09/11/2020 Monday 707 5211 651.375 No
Promo
1 10/11/2020 Tuesday 1455 10386 1298.250
Promotion Red
2 11/11/2020 Wednesday 1520 12475 1559.375
Promotion Blue
3 12/11/2020 Thursday 1726 14414 1801.750 No
Promo
4 13/11/2020 Friday 2134 20916 2614.500 No
Promo

# Exmp 2
raw_data.loc[0:5]

Date Day_Name Visitors Revenue Marketing Spend


Promo
0 09/11/2020 Monday 707 5211 651.375 No
Promo
1 10/11/2020 Tuesday 1455 10386 1298.250
Promotion Red
2 11/11/2020 Wednesday 1520 12475 1559.375
Promotion Blue
3 12/11/2020 Thursday 1726 14414 1801.750 No
Promo
4 13/11/2020 Friday 2134 20916 2614.500 No
Promo
5 14/11/2020 Saturday 1316 12996 1444.000
Promotion Blue

8. Filtering / Slicing
# Select all the data when the day is Monday

# exmp 1
raw_data[raw_data['Day_Name'] == 'Monday']

# exmp 2
raw_data[raw_data.Day_Name == 'Monday'].head()

Date Day_Name Visitors Revenue Marketing Spend


Promo
0 09/11/2020 Monday 707 5211 651.375000 No
Promo
7 16/11/2020 Monday 1548 10072 1119.111111 No
Promo
14 23/11/2020 Monday 2632 34278 4284.750000 Promotion
Blue
21 30/11/2020 Monday 1541 9144 1016.000000 No
Promo
28 07/12/2020 Monday 1584 9612 961.200000
Promotion Red

# Select all the data when the day is Marketing Spend > 1000

# exmp 1
raw_data[raw_data['Marketing Spend'] > 1000].head()

#raw_data.shape

Date Day_Name Visitors Revenue Marketing Spend


Promo
1 10/11/2020 Tuesday 1455 10386 1298.250
Promotion Red
2 11/11/2020 Wednesday 1520 12475 1559.375
Promotion Blue
3 12/11/2020 Thursday 1726 14414 1801.750 No
Promo
4 13/11/2020 Friday 2134 20916 2614.500 No
Promo
5 14/11/2020 Saturday 1316 12996 1444.000
Promotion Blue
# Select all the data when the day is Marketing Spend > 1000 & Promo =
"No Promo"

# exmp 1 - 2 conditions
raw_data[(raw_data['Marketing Spend'] > 1000) & (raw_data['Promo'] ==
'No Promo')].head()

Date Day_Name Visitors Revenue Marketing Spend Promo


3 2020-11-12 Thursday 1726 14414 1801.750000 No Promo
4 2020-11-13 Friday 2134 20916 2614.500000 No Promo
7 2020-11-16 Monday 1548 10072 1119.111111 No Promo
10 2020-11-19 Thursday 2321 17660 1605.454545 No Promo
21 2020-11-30 Monday 1541 9144 1016.000000 No Promo

9. Sorting the DF
# Exmp 1 - by Date
raw_data.sort_values(by = 'Date')

# Exmp 2 - passing it back v1


raw_data = raw_data.sort_values(by = 'Date')

# Exmp 3 - passing it back v2


raw_data.sort_values(by = 'Date', inplace = True)

# Exmp 4 - by Rev asc


raw_data.sort_values(by = 'Revenue')

# Exmp 5 - by Rev desc


raw_data.sort_values(by = 'Revenue', ascending = False, inplace =
True)

raw_data.head()

Date Day_Name Visitors Revenue Marketing Spend


Promo
47 2020-12-26 Saturday 2678 36283 4535.375000
Promotion Blue
14 2020-11-23 Monday 2632 34278 4284.750000
Promotion Blue
59 2021-01-07 Thursday 4139 30146 3014.600000 No
Promo
124 2021-03-13 Saturday 2732 28196 3524.500000
Promotion Blue
109 2021-02-26 Friday 2553 26608 2418.909091
Promotion Blue
10. Aggregating the DF
# Exmp 1 - Total Revenue
raw_data['Revenue'].sum()
np.sum(raw_data['Revenue'])

# Exmp 2 - Average Visitors


raw_data['Visitors'].mean()

# Exmp 3 - Max Marketing Spend


raw_data['Marketing Spend'].min()

322.0

# Sum of Revenue by Day Name

# option 1 - Series
raw_data.groupby('Day_Name')['Revenue'].sum()

# option 2 - df
a = raw_data.groupby('Day_Name', as_index =
False).agg({'Revenue':'sum'})
a

Day_Name Revenue
0 Friday 447823
1 Monday 270080
2 Saturday 285171
3 Sunday 286458
4 Thursday 491323
5 Tuesday 261206
6 Wednesday 322454

# Sum of Revenue by Day Name & number of Data points


a = raw_data.groupby('Day_Name', as_index =
False).agg({'Revenue':'sum', 'Date':'count'})

# Renaming columns v1
#a.columns = ['Day_Name','Revenue','Count of Datapoints']
#a

# Renaming columns v1
a.columns = a.columns.str.replace('Date', 'Count of Datapoints')

raw_data.columns

Index(['Date', 'Day_Name', 'Visitors', 'Revenue', 'Marketing Spend',


'Promo'], dtype='object')

# avg Summary by Promo


raw_data.groupby('Promo', as_index = False).agg({'Revenue':'mean',
'Visitors':'mean', 'Marketing Spend':'mean'})
Promo Revenue Visitors Marketing Spend
0 No Promo 10945.229730 1627.648649 1176.664264
1 Promotion Blue 16754.185185 1790.907407 1829.731425
2 Promotion Red 12034.111111 1596.111111 1264.041522

# Sum the revenue by Promo when the Day Name = 'Saturday'


raw_data[raw_data['Day_Name'] == 'Saturday'].groupby('Promo', as_index
= False).agg({'Revenue':'mean', 'Visitors':'mean', 'Marketing
Spend':'mean'})

Promo Revenue Visitors Marketing Spend


0 No Promo 8128.000000 1302.818182 839.546511
1 Promotion Blue 16429.000000 1666.125000 1940.668056
2 Promotion Red 9190.142857 1285.857143 953.681061

raw_data.head()

Date Day_Name Visitors Revenue Marketing Spend


Promo
47 2020-12-26 Saturday 2678 36283 4535.375000
Promotion Blue
14 2020-11-23 Monday 2632 34278 4284.750000
Promotion Blue
59 2021-01-07 Thursday 4139 30146 3014.600000 No
Promo
124 2021-03-13 Saturday 2732 28196 3524.500000
Promotion Blue
109 2021-02-26 Friday 2553 26608 2418.909091
Promotion Blue

11. Creating New Calculated Fields


# Revenue per Visitor
raw_data['Revenue per Visitor'] = raw_data['Revenue'] /
raw_data['Visitors']
raw_data.head()

Date Day_Name Visitors Revenue Marketing Spend


Promo \
47 2020-12-26 Saturday 2678 36283 4535.375000
Promotion Blue
14 2020-11-23 Monday 2632 34278 4284.750000
Promotion Blue
59 2021-01-07 Thursday 4139 30146 3014.600000 No
Promo
124 2021-03-13 Saturday 2732 28196 3524.500000
Promotion Blue
109 2021-02-26 Friday 2553 26608 2418.909091
Promotion Blue

Revenue per Visitor


47 13.548544
14 13.023556
59 7.283402
124 10.320644
109 10.422248

# Revenue per Spend


raw_data['Revenue per Spend'] = raw_data['Revenue'] /
raw_data['Marketing Spend']
raw_data.head()

raw_data['Aasdfsad'] = raw_data['Revenue'] / raw_data['Marketing


Spend']
raw_data.head()

Date Day_Name Visitors Revenue Marketing Spend


Promo \
47 2020-12-26 Saturday 2678 36283 4535.375000
Promotion Blue
14 2020-11-23 Monday 2632 34278 4284.750000
Promotion Blue
59 2021-01-07 Thursday 4139 30146 3014.600000 No
Promo
124 2021-03-13 Saturday 2732 28196 3524.500000
Promotion Blue
109 2021-02-26 Friday 2553 26608 2418.909091
Promotion Blue

Revenue per Visitor Revenue per Spend Aasdfsad


47 13.548544 8.0 8.0
14 13.023556 8.0 8.0
59 7.283402 10.0 10.0
124 10.320644 8.0 8.0
109 10.422248 11.0 11.0

# Deleting a column
del raw_data['Aasdfsad']
raw_data.head()

Date Day_Name Visitors Revenue Marketing Spend


Promo \
47 2020-12-26 Saturday 2678 36283 4535.375000
Promotion Blue
14 2020-11-23 Monday 2632 34278 4284.750000
Promotion Blue
59 2021-01-07 Thursday 4139 30146 3014.600000 No
Promo
124 2021-03-13 Saturday 2732 28196 3524.500000
Promotion Blue
109 2021-02-26 Friday 2553 26608 2418.909091
Promotion Blue

Revenue per Visitor Revenue per Spend


47 13.548544 8.0
14 13.023556 8.0
59 7.283402 10.0
124 10.320644 8.0
109 10.422248 11.0

# Spend per visitor


raw_data['Spend per visitor'] = raw_data['Marketing Spend'] /
raw_data['Visitors']

raw_data.head()

Date Day_Name Visitors Revenue Marketing Spend


Promo \
47 2020-12-26 Saturday 2678 36283 4535.375000
Promotion Blue
14 2020-11-23 Monday 2632 34278 4284.750000
Promotion Blue
59 2021-01-07 Thursday 4139 30146 3014.600000 No
Promo
124 2021-03-13 Saturday 2732 28196 3524.500000
Promotion Blue
109 2021-02-26 Friday 2553 26608 2418.909091
Promotion Blue

Revenue per Visitor Revenue per Spend Spend per visitor


47 13.548544 8.0 1.693568
14 13.023556 8.0 1.627945
59 7.283402 10.0 0.728340
124 10.320644 8.0 1.290081
109 10.422248 11.0 0.947477

raw_data.groupby('Promo', as_index = False).agg({'Revenue per


Spend':'sum'})

Promo Revenue per Spend


0 No Promo 694.0
1 Promotion Blue 504.0
2 Promotion Red 526.0

a = raw_data.groupby('Promo', as_index = False).agg({'Revenue':'sum',


'Marketing Spend':'sum'})
a
a['Revenue per Spend'] = a['Revenue'] / a['Marketing Spend']
a

Promo Revenue Marketing Spend Revenue per Spend


0 No Promo 809947 87073.155556 9.301914
1 Promotion Blue 904726 98805.496969 9.156636
2 Promotion Red 649842 68258.242173 9.520345

More Details here: https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html

You might also like