Unit-4Introduction To Pandas
Unit-4Introduction To Pandas
Pandas is a powerful and open-source Python library. The Pandas library is used
for data manipulation and analysis. Pandas consist of data structures and
functions to perform efficient operations on data.
What is Pandas Libray in Python?
Pandas is a powerful and versatile library that simplifies the tasks of data
manipulation in Python. Pandas is well-suited for working with tabular data,
such as spreadsheets or SQL tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers
working with structured data in Python.
Did you know?
Pandas name is derived from “panel data” and is also refered as “Python Data
Analysis“.
What is Python Pandas used for?
The Pandas library is generally used for data science, but have you wondered why? This is
because the Pandas library is used in conjunction with other libraries that are used for data
science.
It is built on top of the NumPy library which means that a lot of the structures of NumPy
are used or replicated in Pandas.
The data produced by Pandas is often used as input for plotting functions in Matplotlib,
statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.
You must be wondering, Why should you use the Pandas Library. Python’s Pandas library
is the best tool to analyze, clean, and manipulate data.
Here is a list of things that we can do using Pandas.
Data set cleaning, merging, and joining.
Easy handling of missing data (represented as NaN) in floating point as well as
non-floating point data.
Columns can be inserted and deleted from DataFrame and higher-dimensional
objects.
Powerful group by functionality for performing split-apply-combine operations
on data sets.
Data Visualization.
Getting Started with Pandas
Let’s see how to start working with the Python Pandas library:
Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in the system or
not. If not, then we need to install it on our system using the pip command.
Follow these steps to install Pandas:
Step 1: Type ‘cmd’ in the search box and open it.
Step 2: Locate the folder using the cd command where the python-pip file has been
installed.
Step 3: After locating it, type the command:
pip install pandas
For more reference, take a look at this article on installing pandas follows.
Importing Pandas
After the Pandas have been installed in the system, you need to import the library. This
module is generally imported as follows:
import pandas as pd
Note: Here, pd is referred to as an alias for the Pandas. However, it is not necessary to
import the library using the alias, it just helps in writing less code every time a method or
property is called.
Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are:
Series
DataFrame
Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, Python objects, etc.). The axis labels are collectively called indexes.
The Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but
must be of a hashable type.
The object supports both integer and label-based indexing and provides a host of methods
for performing operations involving the index.
Pandas Series
Creating a Series
Pandas Series is created by loading the datasets from existing storage (which can be a SQL
database, a CSV file, or an Excel file).
Pandas Series can be created from lists, dictionaries, scalar values, etc.
Example: Creating a series using the Pandas Library.
import pandas as pd
import numpy as np
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print("Pandas Series:\n", ser)
Output
Pandas Series: Series([], dtype: float64)
Pandas Series:
0 g
1 e
2 e
3 k
4 s
dtype: object
For more information, refer to Creating a Pandas Series
Pandas DataFrame
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and
columns).
Creating DataFrame
Pandas DataFrame is created by loading the datasets from existing storage (which can be a
SQL database, a CSV file, or an Excel file).
Pandas DataFrame can be created from lists, dictionaries, a list of dictionaries, etc.
Example: Creating a DataFrame Using the Pandas Library
import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
Output:
Empty DataFrame
Columns: []
Index: []
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
print(first)
Output:
first
Output:
Output:
As shown in the output image, two series were returned since there was only one
parameter both of the times.
Output:
import pandas as pd
print(first)
Output:
import pandas as pd
print(first)
Output:
print(row2)
Output:
row2
Output:
print(row2)
Output:
Output:
print(first)
Output:
print(first)
Output:
Get item from object for given key (DataFrame column, Panel
DataFrame.get()
slice, etc.).
print(df.head())
Output:
Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’, ‘family_members’,
‘state_pop’],
dtype=’object’)
To make the column an index, we use the Set_index() function of pandas. If we want to make
one column an index, we can simply pass the name of the column as a string in set_index(). If
we want to do multi-indexing or Hierarchical Indexing, we pass the list of column names in
the set_index().
print(df_ind3.head(10))
Output:
print(df_ind3_region.head(10))
Output:
We cannot use only level(1) index for getting data from the dataframe, if we do so it will give
an error. We can only use level (1) index or the inner indexes with the level(0) or main index
with the help list of tuples.
print(df_ind3_state.head(10))
Output:
# import module
import pandas as pd
# display dataset
print(df)
Output:
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to
get a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is
the function we want to perform.
Some functions used in the aggregation are:
Function Description:
sum() :Compute sum of column values
min() :Compute min of column values
max() :Compute max of column values
mean() :Compute mean of column
size() :Compute column sizes
describe() :Generates descriptive statistics
first() :Compute first of group values
last() :Compute last of group values
count() :Compute count of column values
std() :Standard deviation of column
var() :Compute variance of column
sem() :Standard error of the mean of column
Examples:
The sum() function is used to calculate the sum of every value.
df.sum()
Output:
df.describe()
Output:
We used agg() function to calculate the sum, min, and max of each column in our
dataset.
Output:
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-
combine strategy.
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as result.
df.groupby(by=['Maths'])
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012581821388>
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
a = df.groupby('Maths')
a.first()
Output:
First grouping based on “Maths” within each team we are grouping based on “Science”
b = df.groupby(['Maths', 'Science'])
b.first()
Output:
Implementation on a Dataset
Here we are using a dataset of diamond information.
# import module
import numpy as np
import pandas as pd
Output:
We group by using cut and get the sum of all columns.
dataset.groupby('cut').sum()
Output:
Here we are grouping using cut and color and getting minimum value for all other
groups.
dataset.groupby(['cut', 'color']).agg('min')
Output:
Here we are grouping using color and getting aggregate values like sum, mean,
min, etc. for the price group.
dataset.groupby(['color']).agg(agg_functions)
Output:
We can see that in the prod(product i.e. multiplication) column all values are inf, inf is the
result of a numerical calculation that is mathematically infinite.
Python | Pandas.pivot_table()
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,
fill_value=None, margins=False, dropna=True, margins_name=’All’) create a
spreadsheet-style pivot table as a DataFrame.
Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the
index and columns of the result DataFrame.
Parameters:
data : DataFrame
values : column to aggregate, optional
index: column, Grouper, array, or list of the previous
columns: column, Grouper, array, or list of the previous
aggfunc: function, list of functions, dict, default numpy.mean
-> If list of functions passed, the resulting pivot table will have hierarchical columns whose
top level are the function names.
-> If dict is passed, the key is column to aggregate and value is function or list of functions
fill_value[scalar, default None] : Value to replace missing values with
margins[boolean, default False] : Add all row / columns (e.g. for subtotal / grand totals)
dropna[boolean, default True] : Do not include columns whose entries are all NaN
margins_name[string, default ‘All’] : Name of the row / column that will contain the totals
when margins is True.
Returns: DataFrame
Code:
# importing pandas as pd
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})
df
# Simplest pivot table must have a dataframe
# and an index/list of index.
table = pd.pivot_table(df, index =['A', 'B'])
table
table
import numpy as np
# converting to lowercase
print(np.char.lower(['GEEKS', 'FOR']))
# converting to lowercase
print(np.char.lower('GEEKS'))
Run on IDE
Output :
['geeks' 'for']
geeks
numpy.split() : This function returns a list of strings after breaking the given string by the
specified separator.
# Python program explaining
# numpy.split() function
import numpy as np
# splitting a string
print(np.char.split('geeks for geeks'))
# splitting a string
print(np.char.split('geeks, for, geeks', sep = ','))
Run on IDE
Output :
['geeks', 'for', 'geeks']
['geeks', 'for', 'geeks']
numpy.join() : This function is a string method and returns a string in which the elements of
sequence have been joined by str separator.
# Python program explaining
# numpy.join() function
import numpy as np
# splitting a string
print(np.char.join('-', 'geeks'))
# splitting a string
print(np.char.join(['-', ':'], ['geeks', 'for']))
Run on IDE
Output :
g-e-e-k-s
['g-e-e-k-s', 'g:e:e:k:s']
FUNCTION DESCRIPTION
numpy.strip() It is used to remove all the leading and trailing spaces from a string.
numpy.capitalize( It converts the first character of a string to capital (uppercase) letter. If the
) string has its first character as capital, then it returns the original string.
It creates and returns a new string which is padded with the specified
numpy.center() character..
numpy.rjust() For each element in a, return a copy with the leading characters removed.
For each element in a, return a copy with the leading and trailing characters
numpy.strip() removed.
numpy.rstrip() For each element in a, return a copy with the trailing characters removed.
For each element in a, return a list of the words in the string, using sep as
numpy.rsplit() the delimiter string.
Returns the uppercased string from the given string. It converts all
lowercase characters to uppercase.If no lowercase characters exist, it returns
numpy.upper() the original string.
String Information –
numpy.count() : This function returns the number of occurrences of a substring in the given
string.
# Python program explaining
# numpy.count() function
import numpy as np
# counting a substring
print(np.char.count(a, 'fo'))
Run on IDE
Output :
[1, 0, 1]
[0, 1, 0]
numpy.rfind() : This function returns the highest index of the substring if found in given
string. If not found then it returns -1.
# Python program explaining
# numpy.rfind() function
import numpy as np
# counting a substring
print(np.char.rfind(a,'geek'))
# counting a substring
print(np.char.rfind(a, 'fo'))
Run on IDE
Output :
[0, -1, 0]
[-1, 0, -1]
numpy.isnumeric() : This function returns “True” if all characters in the string are numeric
characters, Otherwise, It returns “False”.
# Python program explaining
# numpy.isnumeric() function
import numpy as np
# counting a substring
print(np.char.isnumeric('geeks'))
# counting a substring
print(np.char.isnumeric('12geeks'))
Run on IDE
Output :
False
False
FUNCTION DESCRIPTION
numpy.find() It returns the lowest index of the substring if it is found in given string. If its is not
returns -1.
numpy.isalpha() It returns “True” if all characters in the string are alphabets, Otherwise, It returns “
It returns true if all characters in a string are decimal. If all characters are not decim
numpy.isdecimal() false.
numpy.isdigit() It returns “True” if all characters in the string are digits, Otherwise, It returns “Fals
numpy.islower() It returns “True” if all characters in the string are lowercase, Otherwise, It returns “
Returns true for each element if there are only whitespace characters in the string a
numpy.isspace() one character, false otherwise.
Returns true for each element if the element is a titlecased string and there is at lea
numpy.istitle() false otherwise.
Returns true for each element if all cased characters in the string are uppercase and
numpy.isupper() one character, false otherwise.
Returns the highest index of the substring inside the string if substring is found. Ot
numpy.rindex() an exception.
numpy.startswith() Returns True if a string starts with the given prefix otherwise returns False.
String Comparison –
numpy.equal(): This function checks for string1 == string2 elementwise.
# Python program explaining
# numpy.equal() function
import numpy as np
print(a)
Run on IDE
Output :
False
import numpy as np
print(a)
Run on IDE
Output :
True
numpy.greater(): This function checks whether string1 is greater than string2 or not.
# Python program explaining
# numpy.greater() function
import numpy as np
print(a)
Run on IDE
Output :
True
FUNCTION DESCRIPTION
Although the time series is also available in the Scikit-learn library, data science
professionals use the Pandas library as it has compiled more features to work on
the DateTime series. We can include the date and time for every record and can fetch the
records of DataFrame.
We can find out the data within a certain range of dates and times by using the DateTime
module of Pandas library.
Let’s discuss some major objectives of time series analysis using Pandas library.
Objectives of Time Series Analysis
Create a series of date
Work with data timestamp
Convert string data to timestamp
Slicing of data using timestamp
Resample your time series for different time period aggregates/summary statistics
Working with missing data
Now, let’s do some practical analysis of some data to demonstrate the use of Pandas’ time
series.
Create DateTime Values with Pandas
To create a DateTime series using Pandas, we need the DateTime module and then we can
create a DateTime range with the date_range method.
Example
Python3
import pandas as pd
from datetime import datetime
import numpy as np
Output
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:01:00',
'2019-01-01 00:02:00', '2019-01-01 00:03:00',
'2019-01-01 00:04:00', '2019-01-01 00:05:00',
'2019-01-01 00:06:00', '2019-01-01 00:07:00',
'2019-01-01 00:08:00', '2019-01-01 00:09:00',
...
'2019-01-07 23:51:00', '2019-01-07 23:52:00',
'2019-01-07 23:53:00', '2019-01-07 23:54:00',
'2019-01-07 23:55:00', '2019-01-07 23:56:00',
'2019-01-07 23:57:00', '2019-01-07 23:58:00',
'2019-01-07 23:59:00', '2019-01-08 00:00:00'],
dtype='datetime64[ns]', length=10081, freq='T')
Explanation:
Here in this code, we have created the timestamp based on minutes for date ranges
from 1/1/2019 to 8/1/2019.
We can vary the frequency by hours to minutes or seconds.
This function will help you to track the record of data stored per minute. As we can see in the
output the length of the datetime stamp is 10081.
Note: Remember pandas use data type as datetime64[ns].
Determine the Data Type of an Element in the DateTime Range
To determine the type of an element in the DateTime range, we use indexing to fetch the
element and then use the type function to know its data type.
Python3
import pandas as pd
from datetime import datetime
import numpy as np
Output
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Explanation:
We are checking the type of our object named range_date.
Create DataFrame with DateTime Index
To create a DataFrame with a DateTime index, we first need to create a DateTime range and
then pass it to pandas.DataFrame method.
Python3
import pandas as pd
from datetime import datetime
import numpy as np
print(df.head(10))
Output
date data
0 2019-01-01 00:00:00 49
1 2019-01-01 00:01:00 58
2 2019-01-01 00:02:00 48
3 2019-01-01 00:03:00 96
4 2019-01-01 00:04:00 42
5 2019-01-01 00:05:00 8
6 2019-01-01 00:06:00 20
7 2019-01-01 00:07:00 96
8 2019-01-01 00:08:00 48
9 2019-01-01 00:09:00 78
Explanation:
We first created a time series then converted this data into DataFrame and used the random
function to generate the random data and map over the dataframe. Then to check the result we
use the print function.
To do time series manipulation, we need to have a DateTime index so that DataFrame is
indexed on the timestamp. Here, we are adding one more new column in the Pandas
DataFrame.
Convert DateTime elements to String format
The below example demonstrates how we can convert the DateTime elements of DateTime
object to string format.
Python3
import pandas as pd
from datetime import datetime
import numpy as np
Output:
['2019-01-01 00:01:00', '2019-01-01 00:02:00', '2019-01-01 00:03:00', '2019-01-01 00:04:00',
'2019-01-01 00:05:00', '2019-01-01 00:06:00', '2019-01-01 00:07:00', '2019-01-01 00:08:00',
'2019-01-01 00:09:00', '2019-01-01 00:10:00']
Explanation:
This code just uses the elements of data_rng and converts them to string and due to a lot of
data we slice the data and print the first ten values list string_data.
By using the for each loop in the list, we got all the values that are in the series range_date.
When we are using date_range we always have to specify the start and end date.
Accessing Specific DateTime Element
The below example demonstrates how we access specific DateTime element of DateTime
object.
Python3
import pandas as pd
from datetime import datetime
import numpy as np
df['datetime'] = pd.to_datetime(df['date'])
df = df.set_index('datetime')
df.drop(['date'], axis = 1, inplace = True)
print(df['2019-01-05'][1:11])
Output
data
datetime
2019-01-05 00:01:00 99
2019-01-05 00:02:00 21
2019-01-05 00:03:00 29
2019-01-05 00:04:00 98
2019-01-05 00:05:00 0
2019-01-05 00:06:00 72
2019-01-05 00:07:00 69
2019-01-05 00:08:00 53
2019-01-05 00:09:00 3
2019-01-05 00:10:00 37
# importing pandas as pd
import pandas as pd
Let’s evaluate the sum over all the columns and add the resultant column to the dataframe
# To evaluate the sum over all the columns
df.eval('D = A + B+C', inplace = True)
Output :
Example #2: Use eval() function to evaluate the sum of any two column element in the
dataframe and insert the resulting column in the dataframe. The dataframe has NaN value.
Note : Any expression can not be evaluated over NaN values. So the corresponding cells will
be NaN too.
# importing pandas as pd
import pandas as pd
Output :
Notice, the resulting column ‘D’ has NaN value in the last row as the corresponding cell used
in evaluation was a NaN cell.
# display
data
Output:
As shown in the output image, the data now only have rows where Senior Management is
True.
Example 2: Multiple conditions filtering In this example, Dataframe has been filtered on
multiple conditions. Before applying the query() method, the spaces in column names have
been replaced with ‘_’.
Python3
# display
data
Output:
As shown in the output image, only two rows have been returned on the basis of filters
applied.