Python Pandas Tutorial For Beginners 2019 (On The Go)
Python Pandas Tutorial For Beginners 2019 (On The Go)
What is Pandas?
• Pandas is a python package / library
• Pandas is a library for data manipulation and analysis
• Two main data structures: the Series and DataFrame
# Visual representation of Pandas
import os
from IPython.display import Image
PATH = "F:\\Github\\Python tutorials\\Introduction to Pandas\\"
Image(filename = PATH + "pandas.png", width=300, height=300)
Tutorial Overview
Video 1
1. What is a Panda's Series
2. What is a Dataframe
3. Creating DataFrames
4. Loading a CSV into a DataFrame
5. Basic methods for investigating/viewing a DataFrame
Video 2
1. Column Filtering
2. Row Filtering
3. Filtering / Slicing
4. Sorting the DF
5. Summarising / Aggregating Data
6. Creating New Calculated Fields
0 28
1 1.79
2 Yiannis
dtype: object
Age 28
Height 1.79 cm
Name Yiannis
dtype: object
'Yiannis'
2. What is a Dataframe
DataFrame is like an Excel table / PivotTable. It is a tabular data structure comprised of rows and
columns
3. Creating a DataFrame
# Manually
Empty DataFrame
Columns: [Name, Gender, Age, Height]
Index: []
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]]
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
# Transpose a DataFrame
#method 1
df.T
# method 2
np.transpose(df)
0 1 2 3 4
A 0 3 6 9 12
B 1 4 7 10 13
C 2 5 8 11 14
'C:\\Users\\pitsi\\Desktop\\Python Tutorials'
# Shape
raw_data.shape
(182, 6)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 6 columns):
Date 182 non-null object
Day_Name 182 non-null object
Visitors 182 non-null int64
Revenue 182 non-null int64
Marketing Spend 182 non-null float64
Promo 182 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 8.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 6 columns):
Date 182 non-null datetime64[ns]
Day_Name 182 non-null object
Visitors 182 non-null int64
Revenue 182 non-null int64
Marketing Spend 182 non-null float64
Promo 182 non-null object
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 8.6+ KB
6. Column Filtering
# Exmp 1 - Series
raw_data['Visitors'].head()
0 707
1 1455
2 1520
3 1726
4 2134
Name: Visitors, dtype: int64
# Exmp 2 - Series
raw_data.Visitors.head()
0 707
1 1455
2 1520
3 1726
4 2134
Name: Visitors, dtype: int64
# exm 1
np.array(raw_data[['Visitors']])
# exm 2
raw_data[['Visitors']].values
raw_data[['Visitors']].sum()
np.sum(raw_data[['Visitors']])
Visitors 303345
dtype: int64
7. Row Filtering
# Exmp 1
raw_data[0:5]
# Exmp 2
raw_data.loc[0:5]
8. Filtering / Slicing
# Select all the data when the day is Monday
# exmp 1
raw_data[raw_data['Day_Name'] == 'Monday']
# exmp 2
raw_data[raw_data.Day_Name == 'Monday'].head()
# Select all the data when the day is Marketing Spend > 1000
# exmp 1
raw_data[raw_data['Marketing Spend'] > 1000].head()
#raw_data.shape
# exmp 1 - 2 conditions
raw_data[(raw_data['Marketing Spend'] > 1000) & (raw_data['Promo'] ==
'No Promo')].head()
9. Sorting the DF
# Exmp 1 - by Date
raw_data.sort_values(by = 'Date')
raw_data.head()
322.0
# option 1 - Series
raw_data.groupby('Day_Name')['Revenue'].sum()
# option 2 - df
a = raw_data.groupby('Day_Name', as_index =
False).agg({'Revenue':'sum'})
a
Day_Name Revenue
0 Friday 447823
1 Monday 270080
2 Saturday 285171
3 Sunday 286458
4 Thursday 491323
5 Tuesday 261206
6 Wednesday 322454
# Renaming columns v1
#a.columns = ['Day_Name','Revenue','Count of Datapoints']
#a
# Renaming columns v1
a.columns = a.columns.str.replace('Date', 'Count of Datapoints')
raw_data.columns
raw_data.head()
# Deleting a column
del raw_data['Aasdfsad']
raw_data.head()
raw_data.head()