Introduction to Pandas Library in Python
Introduction to Pandas
1. Objective
• Getting started with pandas
• Introduction to panda’s data structure
• NumPy vs pandas
• Series vs Data frames
• Creating, Reading and Writing
• Indexing, Selecting and assigning data to data frames
• Grouping and sorting
• Handling missing values
• Use fundamental Python data frames for data science
2. Getting Started with pandas:
In the previous lab, we dove into detail on NumPy and its functionalities, which provides
efficient storage and manipulation of dense typed arrays in Python. Here we’ll build on this
knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a
package built on top of NumPy, and provides an efficient implementation of a Series and
DataFrame. In this lab, we will focus on the mechanics of using Series, DataFrame, and related
structures effectively.
Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:
import pandas as pd
This important convention will be used through the lab.
2.1 Introduction to pandas Data Structures
At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy
structured arrays in which the rows and columns are identified with labels rather than simple
integer indices. To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every problem,
they provide a solid, easy-to-use basis for most applications.
Installing Pandas library
pip install pandas # for regular python environments
conda install pandas # for anaconda environment
!pip install pandas # for jupytor notebook (applicable in our case)
By: Faizan Irshad Page 2
Introduction to Pandas
2.2 Series
❖ A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows:
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
Series wraps both a sequence of values and a sequence of indices, which we can access with the
values and index attributes.
print(data.values) //output: [0.25 0.5 0.75 1. ]
Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:
print(data[1]) //output: 0.5
For example, if we wish, we can use strings as an index:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)
output:
and the item access works as expected: data[‘b’] = ?
Important: We can even use noncontiguous or nonsequential indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
By: Faizan Irshad Page 3
Introduction to Pandas
2.3 DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric, string,
Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a
dict of Series.
2.4 Reading Data files
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we
won't actually be creating our own data by hand. Instead, we'll be working with data that
already exists.
Data can be stored in any of a number of different forms and formats. By far the most
basic of these is the humble CSV file. CSV file is a table of values separated by commas.
Hence the name: "Comma-Separated Values", or CSV.
Loading file from C/users folder
data = pd.read_csv("covid19.csv")
load file using google drive if using colab
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
dataset = pd.read_csv('/content/drive/MyDrive/covid19.csv')
Loading from any other specific location on computer, use
backslashes for address
data = pd.read_csv("C:\\Users\\ABPKSUP\\Desktop\\AI Enabled Data
Analytics -LUMS\\Class 4, Pandas Library\\covid19.csv")
We can use the shape attribute to check how large the resulting DataFrame is:
print(data.shape)
We can examine the contents of the resultant DataFrame using the head() command,
which grabs the first five rows:
print(data.head())
By: Faizan Irshad Page 4
Introduction to Pandas
3. Access in Data Frame
In Python, we can access the property of an object by accessing it as an attribute. A book object, for
example, might have a title property, which we can access by calling book Title. Columns in a pandas
DataFrame work in much the same way.
print(data.Country)
Moreover, we can also access data frame specific value as:
print(data['Country’][2])
4. Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest
of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas
has its own accessor operators, loc and iloc. For more advanced operations, these are the ones
you're supposed to be using:
4.1 Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection:
selecting data based on its numerical position in the data. iloc follows this paradigm.
To select the first row of data in a DataFrame, we may use the following:
print(data.iloc[0])
To get a column with iloc, we can do the following:
print(data.iloc[:,0])
By: Faizan Irshad Page 5
Introduction to Pandas
To select the column from just the first, second, and third row, we would do:
print(data.iloc[:3,0])
It's also possible to pass a list: print(data.iloc[[0,2,5],0])
Finally, it's worth knowing that negative numbers can be used in selection. This will start counting
forwards from the end of the values. So, for example here are the last five elements of the dataset.
print(data.iloc[-5:])
By: Faizan Irshad Page 6
Introduction to Pandas
4.2 Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-
based selection. In this paradigm, it's the data index value, not its position, which matters.
loc is label-based, means that you have to specify rows and columns based on their row
and column labels. iloc is integer position-based, so you have to specify rows and columns by
their integer position values (0-based integer position). Since your dataset usually has
meaningful indices, it's usually easier to do things using loc instead. For example, here's one
operation that's much easier using loc.
print(data.loc[:, ['ObservationDate', 'Province/State', 'Country']])
5. Conditional Selection
So far we've been indexing various strides of data, using structural properties of the DataFrame
itself. To do interesting things with the data, however, we often need to ask questions based on
conditions.
For example, suppose that we're interested specifically in Anhui province cases
print(data.loc[data['Province/State'] == 'Anhui'])
/ / total 494 records
We can use the OR (|) to bring the two questions together( for AND you can use &):
print(data.loc[(data['Province/State'] == 'Anhui') | (data['Deaths'] ==
0)])
total 32,893 records found.
Pandas comes with a few built-in conditional selectors, two of which we will highlight here.
The first is isin which lets you select data whose value "is in" a list of values. For example case in
in Peru or Spain.
print(data.loc[(data.Country.isin(['Peru','Spain']))])
The second is isnull or isna (and its companion notnull). These methods let you highlight
values which are (or are not) empty (NaN).
print(data.loc[(data.Country.notnull())])
print(data.loc[(data['Province/State'].notnull())])
By: Faizan Irshad Page 7
Introduction to Pandas
To find records with null values
print(data.loc[(data['Province/State'].isnull())])
print(data.loc[(data['Province/State'].isna())])
By: Faizan Irshad Page 8
Introduction to Pandas
Assigning data:
Going the other way, assigning data to a DataFrame is easy.
Adding a New Column to the data frame with selected values..
data['New Column'] = 'dummy_values'
print(data['New Column'])
6. Functions and maps
To see a list of unique values we can use the unique() function:
print(data.Country.unique())
More functions which are available with pandas (here pd means pandas
library while df is the name of our dataset like we used name ‘data’ above.
It can be any name that we assign to our data)
By: Faizan Irshad Page 9
Introduction to Pandas
By: Faizan Irshad Page 10
Introduction to Pandas
Important:
Writing to files
df.describe() | Summary statistics for numerical columns
df.mean() | Returns the mean of all columns
df.corr() | Returns the correlation between columns in a DataFrame
df.count() | Returns the number of non-null values in each DataFrame column
df.max() | Returns the highest value in each column
df.min() | Returns the lowest value in each column
df.median() | Returns the median of each column
df.std() | Returns the standard deviation of each column
pd.to_datetime | To convert column format to date time
Combining:
Concatenate:
The concat function is used to concatenate two or more DataFrames along a particular axis (either rows or
columns).
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Concatenate along rows (axis=0)
result = pd.concat([df1, df2], axis=0)
print(result)
Output
A B
0 1 3
1 2 4
0 5 7
1 6 8
By: Faizan Irshad Page 11
Introduction to Pandas
# Concatenate along columns
result2 = pd.concat([df1, df2], axis=1)
print(result2)
OUTPUT
A B A B
0 1 3 5 7
1 2 4 6 8
Merge:
The merge function is used to merge two DataFrames based on a common column or index.
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [3, 4]})
# Merge based on a common column 'key'
result = pd.merge(df1, df2, on='key')
Output
key value_x value_y
0 A 1 3
1 B 2 4
By: Faizan Irshad Page 12
Introduction to Pandas
Practice Task 1
Create a DataFrame fruits that looks like this:
Bananas Apples
0 20 17
Practice Task 2
Create a DataFrame sales that looks like below:
Bananas Apples
September 200 120
October 165 90
Practice Task 3
Create a series marks that looks like:
7.1 Practice Task 4
Read the following csv dataset of Groceries given to you into a DataFrame called grocer
7.2 Practice Task 5
Select the item description column from grocer and assign the result to the variable item desc.
7.3 Practice Task 6
Select the first value from the member_number column of grocer and assign it to variable
first_member.
7.4 Practice Task 7
Select the first 20 rows of data and save it to variable first_rows.
By: Faizan Irshad Page 13
Introduction to Pandas
7.5 Practice Task 8
Create a variable selected_grocer containing the values of all columns only for purchase made in
2015.
7.6 Practice Task 9
Create a variable containing the member_number and date columns of the first 500 records.
7.7 Practice Task 10
Create a DataFrame having purchases both in 2014 and 2015 year.
By: Faizan Irshad Page 14