Lab-3 Pandas Library
Lab-3 Pandas Library
Introduction to Pandas
1. Objective
In the previous lab, we dove into detail on NumPy and its functionalities, which provides
efficient storage and manipulation of dense typed arrays in Python. Here we’ll build on this
knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a
package built on top of NumPy, and provides an efficient implementation of a Series and
DataFrame. In this lab, we will focus on the mechanics of using Series, DataFrame, and related
structures effectively.
Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:
import pandas as pd
2.2 Series
❖ A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows:
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
Series wraps both a sequence of values and a sequence of indices, which we can access with the
values and index attributes.
print(data.values) //output: [0.25 0.5 0.75 1. ]
Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:
print(data[1]) //output: 0.5
output:
and the item access works as expected: data[‘b’] = ?
2.3 DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric, string,
Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a
dict of Series.
Data can be stored in any of a number of different forms and formats. By far the most
basic of these is the humble CSV file. CSV file is a table of values separated by commas.
Hence the name: "Comma-Separated Values", or CSV.
We can use the shape attribute to check how large the resulting DataFrame is:
print(data.shape)
We can examine the contents of the resultant DataFrame using the head() command,
which grabs the first five rows:
print(data.head())
print(data.Country)
Moreover, we can also access data frame specific value as:
print(data['Country’][2])
4. Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest
of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas
has its own accessor operators, loc and iloc. For more advanced operations, these are the ones
you're supposed to be using:
To select the first row of data in a DataFrame, we may use the following:
print(data.iloc[0])
print(data.iloc[:,0])
To select the column from just the first, second, and third row, we would do:
print(data.iloc[:3,0])
Finally, it's worth knowing that negative numbers can be used in selection. This will start counting
forwards from the end of the values. So, for example here are the last five elements of the dataset.
print(data.iloc[-5:])
The second paradigm for attribute selection is the one followed by the loc operator: label-
based selection. In this paradigm, it's the data index value, not its position, which matters.
loc is label-based, means that you have to specify rows and columns based on their row
and column labels. iloc is integer position-based, so you have to specify rows and columns by
their integer position values (0-based integer position). Since your dataset usually has
meaningful indices, it's usually easier to do things using loc instead. For example, here's one
operation that's much easier using loc.
5. Conditional Selection
So far we've been indexing various strides of data, using structural properties of the DataFrame
itself. To do interesting things with the data, however, we often need to ask questions based on
conditions.
For example, suppose that we're interested specifically in Anhui province cases
print(data.loc[data['Province/State'] == 'Anhui'])
We can use the OR (|) to bring the two questions together( for AND you can use &):
Pandas comes with a few built-in conditional selectors, two of which we will highlight here.
The first is isin which lets you select data whose value "is in" a list of values. For example case in
in Peru or Spain.
print(data.loc[(data.Country.isin(['Peru','Spain']))])
The second is isnull or isna (and its companion notnull). These methods let you highlight
values which are (or are not) empty (NaN).
print(data.loc[(data.Country.notnull())])
print(data.loc[(data['Province/State'].notnull())])
Assigning data:
print(data.Country.unique())
More functions which are available with pandas (here pd means pandas
library while df is the name of our dataset like we used name ‘data’ above.
It can be any name that we assign to our data)
Important:
Writing to files
Combining:
Concatenate:
The concat function is used to concatenate two or more DataFrames along a particular axis (either rows or
columns).
Output
A B
0 1 3
1 2 4
0 5 7
1 6 8
print(result2)
OUTPUT
A B A B
0 1 3 5 7
1 2 4 6 8
Merge:
The merge function is used to merge two DataFrames based on a common column or index.
Output
key value_x value_y
0 A 1 3
1 B 2 4
Practice Task 1
Practice Task 2
Practice Task 3
Read the following csv dataset of Groceries given to you into a DataFrame called grocer
Select the first value from the member_number column of grocer and assign it to variable
first_member.
Create a variable containing the member_number and date columns of the first 500 records.