0% found this document useful (0 votes)
12 views

Lab-3 Pandas Library

The document provides an introduction to the Pandas library in Python, focusing on its data structures, specifically Series and DataFrames, and their functionalities. It covers installation, data manipulation techniques such as indexing, conditional selection, and merging, as well as practical tasks for hands-on learning. The content is aimed at helping users effectively utilize Pandas for data science applications.

Uploaded by

Charlie William
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lab-3 Pandas Library

The document provides an introduction to the Pandas library in Python, focusing on its data structures, specifically Series and DataFrames, and their functionalities. It covers installation, data manipulation techniques such as indexing, conditional selection, and merging, as well as practical tasks for hands-on learning. The content is aimed at helping users effectively utilize Pandas for data science applications.

Uploaded by

Charlie William
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Pandas Library in Python

Introduction to Pandas

1. Objective

• Getting started with pandas


• Introduction to panda’s data structure
• NumPy vs pandas
• Series vs Data frames
• Creating, Reading and Writing
• Indexing, Selecting and assigning data to data frames
• Grouping and sorting
• Handling missing values
• Use fundamental Python data frames for data science

2. Getting Started with pandas:

In the previous lab, we dove into detail on NumPy and its functionalities, which provides
efficient storage and manipulation of dense typed arrays in Python. Here we’ll build on this
knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a
package built on top of NumPy, and provides an efficient implementation of a Series and
DataFrame. In this lab, we will focus on the mechanics of using Series, DataFrame, and related
structures effectively.

Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:
import pandas as pd

This important convention will be used through the lab.

2.1 Introduction to pandas Data Structures


At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy
structured arrays in which the rows and columns are identified with labels rather than simple
integer indices. To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every problem,
they provide a solid, easy-to-use basis for most applications.

Installing Pandas library


pip install pandas # for regular python environments
conda install pandas # for anaconda environment
!pip install pandas # for jupytor notebook (applicable in our case)

By: Faizan Irshad Page 2


Introduction to Pandas

2.2 Series
❖ A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows:
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)

Series wraps both a sequence of values and a sequence of indices, which we can access with the
values and index attributes.
print(data.values) //output: [0.25 0.5 0.75 1. ]

Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:
print(data[1]) //output: 0.5

For example, if we wish, we can use strings as an index:


data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)

output:
and the item access works as expected: data[‘b’] = ?

Important: We can even use noncontiguous or nonsequential indices:


data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])

By: Faizan Irshad Page 3


Introduction to Pandas

2.3 DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric, string,
Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a
dict of Series.

2.4 Reading Data files


Being able to create a DataFrame or Series by hand is handy. But, most of the time, we
won't actually be creating our own data by hand. Instead, we'll be working with data that
already exists.

Data can be stored in any of a number of different forms and formats. By far the most
basic of these is the humble CSV file. CSV file is a table of values separated by commas.
Hence the name: "Comma-Separated Values", or CSV.

Loading file from C/users folder


data = pd.read_csv("covid19.csv")

load file using google drive if using colab


from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
dataset = pd.read_csv('/content/drive/MyDrive/covid19.csv')

Loading from any other specific location on computer, use


backslashes for address
data = pd.read_csv("C:\\Users\\ABPKSUP\\Desktop\\AI Enabled Data
Analytics -LUMS\\Class 4, Pandas Library\\covid19.csv")

We can use the shape attribute to check how large the resulting DataFrame is:
print(data.shape)

We can examine the contents of the resultant DataFrame using the head() command,
which grabs the first five rows:

print(data.head())

By: Faizan Irshad Page 4


Introduction to Pandas

3. Access in Data Frame


In Python, we can access the property of an object by accessing it as an attribute. A book object, for
example, might have a title property, which we can access by calling book Title. Columns in a pandas
DataFrame work in much the same way.

print(data.Country)
Moreover, we can also access data frame specific value as:
print(data['Country’][2])

4. Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest
of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas
has its own accessor operators, loc and iloc. For more advanced operations, these are the ones
you're supposed to be using:

4.1 Index-based selection


Pandas indexing works in one of two paradigms. The first is index-based selection:
selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:
print(data.iloc[0])

To get a column with iloc, we can do the following:

print(data.iloc[:,0])

By: Faizan Irshad Page 5


Introduction to Pandas

To select the column from just the first, second, and third row, we would do:

print(data.iloc[:3,0])

It's also possible to pass a list: print(data.iloc[[0,2,5],0])

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting
forwards from the end of the values. So, for example here are the last five elements of the dataset.

print(data.iloc[-5:])

By: Faizan Irshad Page 6


Introduction to Pandas

4.2 Label-based selection

The second paradigm for attribute selection is the one followed by the loc operator: label-
based selection. In this paradigm, it's the data index value, not its position, which matters.

loc is label-based, means that you have to specify rows and columns based on their row
and column labels. iloc is integer position-based, so you have to specify rows and columns by
their integer position values (0-based integer position). Since your dataset usually has
meaningful indices, it's usually easier to do things using loc instead. For example, here's one
operation that's much easier using loc.

print(data.loc[:, ['ObservationDate', 'Province/State', 'Country']])

5. Conditional Selection

So far we've been indexing various strides of data, using structural properties of the DataFrame
itself. To do interesting things with the data, however, we often need to ask questions based on
conditions.

For example, suppose that we're interested specifically in Anhui province cases

print(data.loc[data['Province/State'] == 'Anhui'])

/ / total 494 records

We can use the OR (|) to bring the two questions together( for AND you can use &):

print(data.loc[(data['Province/State'] == 'Anhui') | (data['Deaths'] ==


0)])

total 32,893 records found.

Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin which lets you select data whose value "is in" a list of values. For example case in
in Peru or Spain.

print(data.loc[(data.Country.isin(['Peru','Spain']))])

The second is isnull or isna (and its companion notnull). These methods let you highlight
values which are (or are not) empty (NaN).

print(data.loc[(data.Country.notnull())])

print(data.loc[(data['Province/State'].notnull())])

By: Faizan Irshad Page 7


Introduction to Pandas

To find records with null values


print(data.loc[(data['Province/State'].isnull())])
print(data.loc[(data['Province/State'].isna())])

By: Faizan Irshad Page 8


Introduction to Pandas

Assigning data:

Going the other way, assigning data to a DataFrame is easy.


Adding a New Column to the data frame with selected values..

data['New Column'] = 'dummy_values'


print(data['New Column'])

6. Functions and maps

To see a list of unique values we can use the unique() function:

print(data.Country.unique())

More functions which are available with pandas (here pd means pandas
library while df is the name of our dataset like we used name ‘data’ above.
It can be any name that we assign to our data)

By: Faizan Irshad Page 9


Introduction to Pandas

By: Faizan Irshad Page 10


Introduction to Pandas

Important:

Writing to files

df.describe() | Summary statistics for numerical columns


df.mean() | Returns the mean of all columns
df.corr() | Returns the correlation between columns in a DataFrame
df.count() | Returns the number of non-null values in each DataFrame column
df.max() | Returns the highest value in each column
df.min() | Returns the lowest value in each column
df.median() | Returns the median of each column
df.std() | Returns the standard deviation of each column
pd.to_datetime | To convert column format to date time

Combining:

Concatenate:
The concat function is used to concatenate two or more DataFrames along a particular axis (either rows or
columns).

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})


df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenate along rows (axis=0)


result = pd.concat([df1, df2], axis=0)
print(result)

Output
A B
0 1 3
1 2 4
0 5 7
1 6 8

By: Faizan Irshad Page 11


Introduction to Pandas

# Concatenate along columns


result2 = pd.concat([df1, df2], axis=1)

print(result2)
OUTPUT
A B A B
0 1 3 5 7
1 2 4 6 8

Merge:
The merge function is used to merge two DataFrames based on a common column or index.

df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})


df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [3, 4]})

# Merge based on a common column 'key'


result = pd.merge(df1, df2, on='key')

Output
key value_x value_y
0 A 1 3
1 B 2 4

By: Faizan Irshad Page 12


Introduction to Pandas

Practice Task 1

Create a DataFrame fruits that looks like this:


Bananas Apples
0 20 17

Practice Task 2

Create a DataFrame sales that looks like below:


Bananas Apples
September 200 120
October 165 90

Practice Task 3

Create a series marks that looks like:

7.1 Practice Task 4

Read the following csv dataset of Groceries given to you into a DataFrame called grocer

7.2 Practice Task 5


Select the item description column from grocer and assign the result to the variable item desc.

7.3 Practice Task 6

Select the first value from the member_number column of grocer and assign it to variable
first_member.

7.4 Practice Task 7


Select the first 20 rows of data and save it to variable first_rows.

By: Faizan Irshad Page 13


Introduction to Pandas

7.5 Practice Task 8


Create a variable selected_grocer containing the values of all columns only for purchase made in
2015.

7.6 Practice Task 9

Create a variable containing the member_number and date columns of the first 500 records.

7.7 Practice Task 10

Create a DataFrame having purchases both in 2014 and 2015 year.

By: Faizan Irshad Page 14

You might also like