0% found this document useful (0 votes)
27 views

Unit-4Introduction To Pandas

Uploaded by

Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Unit-4Introduction To Pandas

Uploaded by

Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to Pandas

Pandas is a powerful and open-source Python library. The Pandas library is used
for data manipulation and analysis. Pandas consist of data structures and
functions to perform efficient operations on data.
What is Pandas Libray in Python?
Pandas is a powerful and versatile library that simplifies the tasks of data
manipulation in Python. Pandas is well-suited for working with tabular data,
such as spreadsheets or SQL tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers
working with structured data in Python.
Did you know?
Pandas name is derived from “panel data” and is also refered as “Python Data
Analysis“.
What is Python Pandas used for?
The Pandas library is generally used for data science, but have you wondered why? This is
because the Pandas library is used in conjunction with other libraries that are used for data
science.
It is built on top of the NumPy library which means that a lot of the structures of NumPy
are used or replicated in Pandas.
The data produced by Pandas is often used as input for plotting functions in Matplotlib,
statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.
You must be wondering, Why should you use the Pandas Library. Python’s Pandas library
is the best tool to analyze, clean, and manipulate data.
Here is a list of things that we can do using Pandas.
 Data set cleaning, merging, and joining.
 Easy handling of missing data (represented as NaN) in floating point as well as
non-floating point data.
 Columns can be inserted and deleted from DataFrame and higher-dimensional
objects.
 Powerful group by functionality for performing split-apply-combine operations
on data sets.
 Data Visualization.
Getting Started with Pandas
Let’s see how to start working with the Python Pandas library:
Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in the system or
not. If not, then we need to install it on our system using the pip command.
Follow these steps to install Pandas:
Step 1: Type ‘cmd’ in the search box and open it.
Step 2: Locate the folder using the cd command where the python-pip file has been
installed.
Step 3: After locating it, type the command:
pip install pandas
For more reference, take a look at this article on installing pandas follows.
Importing Pandas
After the Pandas have been installed in the system, you need to import the library. This
module is generally imported as follows:
import pandas as pd
Note: Here, pd is referred to as an alias for the Pandas. However, it is not necessary to
import the library using the alias, it just helps in writing less code every time a method or
property is called.
Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are:
 Series
 DataFrame
Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, Python objects, etc.). The axis labels are collectively called indexes.
The Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but
must be of a hashable type.
The object supports both integer and label-based indexing and provides a host of methods
for performing operations involving the index.

Pandas Series

Creating a Series
Pandas Series is created by loading the datasets from existing storage (which can be a SQL
database, a CSV file, or an Excel file).
Pandas Series can be created from lists, dictionaries, scalar values, etc.
Example: Creating a series using the Pandas Library.

import pandas as pd
import numpy as np

# Creating empty series


ser = pd.Series()
print("Pandas Series: ", ser)

# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)
print("Pandas Series:\n", ser)

Output
Pandas Series: Series([], dtype: float64)
Pandas Series:
0 g
1 e
2 e
3 k
4 s
dtype: object
For more information, refer to Creating a Pandas Series
Pandas DataFrame
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and
columns).
Creating DataFrame
Pandas DataFrame is created by loading the datasets from existing storage (which can be a
SQL database, a CSV file, or an Excel file).
Pandas DataFrame can be created from lists, dictionaries, a list of dictionaries, etc.
Example: Creating a DataFrame Using the Pandas Library

import pandas as pd

# Calling DataFrame constructor


df = pd.DataFrame()
print(df)

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list


df = pd.DataFrame(lst)
print(df)

Output:
Empty DataFrame
Columns: []
Index: []
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks

Pandas Indexing using [ ], .loc[], .iloc[ ], .ix[ ]


There are a lot of ways to pull the elements, rows, and columns from a
DataFrame. There are some indexing method in Pandas which help in getting an
element from a DataFrame. These indexing methods appear very similar but
behave very differently. Pandas support four types of Multi-axes indexing they
are:
 Dataframe.[ ] ; This function also known as indexing operator
 Dataframe.loc[ ] : This function is used for labels.
 Dataframe.iloc[ ] : This function is used for positions or integer based
 Dataframe.ix[] : This function is used for both label and integer based
Collectively, they are called the indexers. These are by far the most common
ways to index data. These are four function which help in getting the elements,
rows, and columns from a DataFrame.

Indexing a Dataframe using indexing operator [] :


Indexing operator is used to refer to the square brackets following an object.
The .loc and .iloc indexers also use the indexing operator to make selections. In
this indexing operator to refer to df[].
Selecting a single columns
In order to select a single column, we simply put the name of the column in-
between the brackets

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving columns by indexing operator


first = data["Age"]

print(first)
Output:

Selecting multiple columns


In order to select multiple columns, we have to pass a list of columns in an
indexing operator.

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving multiple columns by indexing operator


first = data[["Age", "College", "Salary"]]

first
Output:

Indexing a DataFrame using .loc[ ] :


This function selects data by the label of the rows and columns. The df.loc indexer
selects data in a different way than just the indexing operator. It can select subsets
of rows or columns. It can also simultaneously select subsets of rows and
columns.
Selecting a single row
In order to select a single row using .loc[], we put a single row label in
a .loc function.
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method


first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)

Output:
As shown in the output image, two series were returned since there was only one
parameter both of the times.

Selecting multiple rows


In order to select multiple rows, we put all the row labels in a list and pass that
to .loc function.
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving multiple rows by loc method


first = data.loc[["Avery Bradley", "R.J. Hunter"]]
print(first)

Output:

Selecting two rows and three columns


In order to select two rows and three columns, we select a two rows which we
want to select and three columns and put it in a separate list like this:
Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]

import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving two rows and three columns by loc method


first = data.loc[["Avery Bradley", "R.J. Hunter"],
["Team", "Number", "Position"]]

print(first)

Output:

Selecting all of the rows and some columns


In order to select all of the rows and some columns, we use single colon [:] to
select all of rows and list of some columns which we want to select like this:
Dataframe.loc[:, ["column1", "column2", "column3"]]

import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving all rows and some columns by loc method
first = data.loc[:, ["Team", "Number", "Position"]]

print(first)

Output:

Indexing a DataFrame using .iloc[ ] :


This function allows us to retrieve rows and columns by position. In order to do
that, we’ll need to specify the positions of the rows that we want, and the
positions of the columns that we want as well. The df.iloc indexer is very similar
to df.loc but only uses integer locations to make its selections.
Selecting a single row
In order to select a single row using .iloc[], we can pass a single integer
to .iloc[] function.
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving rows by iloc method


row2 = data.iloc[3]

print(row2)

Output:

Selecting multiple rows


In order to select multiple rows, we can pass a list of integer to .iloc[] function.
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving multiple rows by iloc method


row2 = data.iloc [[3, 5, 7]]

row2
Output:

Selecting two rows and two columns


In order to select two rows and two columns, we create a list of 2 integer for rows
and list of 2 integer for columns then pass to a .iloc[] function.
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving two rows and two columns by iloc method


row2 = data.iloc [[3, 4], [1, 2]]

print(row2)

Output:

Selecting all the rows and a some columns


In order to select all rows and some columns, we use single colon [:] to select all
of rows and for columns we make a list of integer then pass to a .iloc[] function.
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving all rows and some columns by iloc method


row2 = data.iloc [:, [1, 2]]
print(row2)

Output:

Indexing a using Dataframe.ix[ ] :


Early in the development of pandas, there existed another indexer, ix. This
indexer was capable of selecting both by label and by integer location. While it
was versatile, it caused lots of confusion because it’s not explicit. Sometimes
integers can also be labels for rows or columns. Thus there were instances where
it was ambiguous. Generally, ix is label based and acts just as the .loc indexer.
However, .ix also supports integer type selections (as in .iloc) where passed an
integer. This only works where the index of the DataFrame is not integer
based .ix will accept any of the inputs of .loc and .iloc.
Note: The .ix indexer has been deprecated in recent versions of Pandas.
Selecting a single row using .ix[] as .loc[]
In order to select a single row, we put a single row label in a .ix function. This
function act similar as .loc[] if we pass a row label as a argument of a function.
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by ix method


first = data.ix["Avery Bradley"]

print(first)

Output:

Selecting a single row using .ix[] as .iloc[]


In order to select a single row, we can pass a single integer to .ix[] function. This
function similar as a iloc[] function if we pass an integer in a .ix[] function.
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by ix method


first = data.ix[1]

print(first)
Output:

Methods for indexing in DataFrame


Function Description

Dataframe.head() Return top n rows of a data frame.

Dataframe.tail() Return bottom n rows of a data frame.

Dataframe.at[] Access a single value for a row/column label pair.

Access a single value for a row/column pair by integer


Dataframe.iat[]
position.

Purely integer-location based indexing for selection by


Dataframe.tail()
position.

DataFrame.lookup() Label-based “fancy indexing” function for DataFrame.

DataFrame.pop() Return item and drop from frame.

Returns a cross-section (row(s) or column(s)) from the


DataFrame.xs()
DataFrame.

Get item from object for given key (DataFrame column, Panel
DataFrame.get()
slice, etc.).

Return boolean DataFrame showing whether each element in


DataFrame.isin()
the DataFrame is contained in values.

Return an object of same shape as self and whose


DataFrame.where() corresponding entries are from self where cond is True and
otherwise are from other.
Return an object of same shape as self and whose
DataFrame.mask() corresponding entries are from self where cond is False and
otherwise are from other.

DataFrame.query() Query the columns of a frame with a boolean expression.

DataFrame.insert() Insert column into DataFrame at specified location.

How to use Hierarchical Indexes with Pandas ?





The index is like an address, that’s how any data point across the data frame or series can be
accessed. Rows and columns both have indexes, rows indices are called index and for
columns, it’s general column names.
Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more than one column name
as the index. In this article, we are going to use homelessness.csv file.

# importing pandas library as alias pd


import pandas as pd

# calling the pandas read_csv() function.


# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')

print(df.head())

Output:

In the following data frame, there is no indexing.


Columns in the Dataframe:

# using the pandas columns attribute.


col = df.columns
print(col)

Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’, ‘family_members’,
‘state_pop’],
dtype=’object’)
To make the column an index, we use the Set_index() function of pandas. If we want to make
one column an index, we can simply pass the name of the column as a string in set_index(). If
we want to do multi-indexing or Hierarchical Indexing, we pass the list of column names in
the set_index().

Below Code demonstrates Hierarchical Indexing in pandas:

# using the pandas set_index() function.


df_ind3 = df.set_index(['region', 'state', 'individuals'])

# we can sort the data by using sort_index()


df_ind3.sort_index()

print(df_ind3.head(10))

Output:

Now the dataframe is using Hierarchical Indexing or multi-indexing.


Note that here we have made 3 columns as an index (‘region’, ‘state’, ‘individuals’ ). The
first index ‘region’ is called level(0) index, which is on top of the Hierarchy of indexes, next
index ‘state’ is level(1) index which is below the main or level(0) index, and so on. So, the
Hierarchy of indexes is formed that’s why this is called Hierarchical indexing.
We may sometimes need to make a column as an index, or we want to convert an index
column into the normal column, so there is a pandas reset_index(inplace = True) function,
which makes the index column the normal column.
Selecting Data in a Hierarchical Index or using the Hierarchical Indexing:
For selecting the data from the dataframe using the .loc() method we have to pass the name of
the indexes in a list.

# selecting the 'Pacific' and 'Mountain'


# region from the dataframe.

# selecting data using level(0) index or main index.


df_ind3_region = df_ind3.loc[['Pacific', 'Mountain']]

print(df_ind3_region.head(10))

Output:
We cannot use only level(1) index for getting data from the dataframe, if we do so it will give
an error. We can only use level (1) index or the inner indexes with the level(0) or main index
with the help list of tuples.

# using the inner index 'state' for getting data.


df_ind3_state = df_ind3.loc[['Alaska', 'California', 'Idaho']]

print(df_ind3_state.head(10))

Output:

Using inner levels indexes with the help of a list of tuples:


Syntax:
df.loc[[ ( level( 0 ) , level( 1 ) , level( 2 ) ) ]]

# selecting data by passing all levels index.


df_ind3_region_state = df_ind3.loc[[("Pacific", "Alaska", 1434),
("Pacific", "Hawaii", 4131),
("Mountain", "Arizona", 7259),
("Mountain", "Idaho", 1297)]]
df_ind3_region_state
Output:

Grouping and Aggregating with Pandas





In this article, we are going to see grouping and aggregating using pandas. Grouping and
aggregating will help to achieve data analysis easily using various functions. These methods
will help us to the group and summarize our data and make complex analysis comparatively
easy.
Creating a sample dataset of marks of various subjects.

# import module
import pandas as pd

# Creating our dataset


df = pd.DataFrame([[9, 4, 8, 9],
[8, 10, 7, 6],
[7, 6, 8, 5]],
columns=['Maths', 'English',
'Science', 'History'])

# display dataset
print(df)

Output:
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to
get a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is
the function we want to perform.
Some functions used in the aggregation are:
Function Description:
 sum() :Compute sum of column values
 min() :Compute min of column values
 max() :Compute max of column values
 mean() :Compute mean of column
 size() :Compute column sizes
 describe() :Generates descriptive statistics
 first() :Compute first of group values
 last() :Compute last of group values
 count() :Compute count of column values
 std() :Standard deviation of column
 var() :Compute variance of column
 sem() :Standard error of the mean of column
Examples:
 The sum() function is used to calculate the sum of every value.

df.sum()

Output:

 The describe() function is used to get a summary of our dataset

df.describe()
Output:

 We used agg() function to calculate the sum, min, and max of each column in our
dataset.

df.agg(['sum', 'min', 'max'])

Output:

Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-
combine strategy.
 Splitting the data into groups based on some criteria.
 Applying a function to each group independently.
 Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as result.

df.groupby(by=['Maths'])

Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012581821388>
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.

a = df.groupby('Maths')
a.first()

Output:
First grouping based on “Maths” within each team we are grouping based on “Science”

b = df.groupby(['Maths', 'Science'])
b.first()

Output:

Implementation on a Dataset
Here we are using a dataset of diamond information.

# import module
import numpy as np
import pandas as pd

# reading csv file


dataset = pd.read_csv("diamonds.csv")

# printing first 5 rows


print(dataset.head(5))

Output:
 We group by using cut and get the sum of all columns.

dataset.groupby('cut').sum()

Output:

 Here we are grouping using cut and color and getting minimum value for all other
groups.

dataset.groupby(['cut', 'color']).agg('min')

Output:
 Here we are grouping using color and getting aggregate values like sum, mean,
min, etc. for the price group.

# dictionary having key as group name of price and


# value as list of aggregation function
# we want to perform on group price
agg_functions = {
'price':
['sum', 'mean', 'median', 'min', 'max', 'prod']
}

dataset.groupby(['color']).agg(agg_functions)

Output:

We can see that in the prod(product i.e. multiplication) column all values are inf, inf is the
result of a numerical calculation that is mathematically infinite.

Python | Pandas.pivot_table()



pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,
fill_value=None, margins=False, dropna=True, margins_name=’All’) create a
spreadsheet-style pivot table as a DataFrame.
Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the
index and columns of the result DataFrame.
Parameters:
data : DataFrame
values : column to aggregate, optional
index: column, Grouper, array, or list of the previous
columns: column, Grouper, array, or list of the previous
aggfunc: function, list of functions, dict, default numpy.mean
-> If list of functions passed, the resulting pivot table will have hierarchical columns whose
top level are the function names.
-> If dict is passed, the key is column to aggregate and value is function or list of functions
fill_value[scalar, default None] : Value to replace missing values with
margins[boolean, default False] : Add all row / columns (e.g. for subtotal / grand totals)
dropna[boolean, default True] : Do not include columns whose entries are all NaN
margins_name[string, default ‘All’] : Name of the row / column that will contain the totals
when margins is True.
Returns: DataFrame
Code:

# Create a simple dataframe

# importing pandas as pd
import pandas as pd
import numpy as np

# creating a dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})

df
# Simplest pivot table must have a dataframe
# and an index/list of index.
table = pd.pivot_table(df, index =['A', 'B'])

table

# Creates a pivot table dataframe


table = pd.pivot_table(df, values ='A', index =['B', 'C'],
columns =['B'], aggfunc = np.sum)

table

Numpy | String Operations


This module is used to perform vectorized string operations for arrays of dtype numpy.string_
or numpy.unicode_. All of them are based on the standard string functions in Python’s built-in
library.
String Operations –
numpy.lower() : This function returns the lowercase string from the given string. It converts
all uppercase characters to lowercase. If no uppercase characters exist, it returns the original
string.
# Python program explaining
# numpy.lower() function

import numpy as np

# converting to lowercase
print(np.char.lower(['GEEKS', 'FOR']))

# converting to lowercase
print(np.char.lower('GEEKS'))
Run on IDE
Output :
['geeks' 'for']
geeks

numpy.split() : This function returns a list of strings after breaking the given string by the
specified separator.
# Python program explaining
# numpy.split() function

import numpy as np

# splitting a string
print(np.char.split('geeks for geeks'))

# splitting a string
print(np.char.split('geeks, for, geeks', sep = ','))
Run on IDE
Output :
['geeks', 'for', 'geeks']
['geeks', 'for', 'geeks']

numpy.join() : This function is a string method and returns a string in which the elements of
sequence have been joined by str separator.
# Python program explaining
# numpy.join() function

import numpy as np

# splitting a string
print(np.char.join('-', 'geeks'))

# splitting a string
print(np.char.join(['-', ':'], ['geeks', 'for']))
Run on IDE
Output :
g-e-e-k-s
['g-e-e-k-s', 'g:e:e:k:s']

FUNCTION DESCRIPTION

numpy.strip() It is used to remove all the leading and trailing spaces from a string.

numpy.capitalize( It converts the first character of a string to capital (uppercase) letter. If the
) string has its first character as capital, then it returns the original string.

It creates and returns a new string which is padded with the specified
numpy.center() character..

It is used to convert from one encoding scheme, in which argument string is


numpy.decode() encoded to the desired encoding scheme.
numpy.encode() Returns the string in the encoded form

Return an array with the elements of a left-justified in a string of length


numpy.ljust() width.

numpy.rjust() For each element in a, return a copy with the leading characters removed.

For each element in a, return a copy with the leading and trailing characters
numpy.strip() removed.

numpy.lstrip() Convert angles from degrees to radians.

numpy.rstrip() For each element in a, return a copy with the trailing characters removed.

numpy.partition() Partition each element in a around sep.

numpy.rpartition Partition (split) each element around the right-most separator.

For each element in a, return a list of the words in the string, using sep as
numpy.rsplit() the delimiter string.

It is used to convert the first character in each word to Uppercase and


numpy.title() remaining characters to Lowercase in string and returns new string.

Returns the uppercased string from the given string. It converts all
lowercase characters to uppercase.If no lowercase characters exist, it returns
numpy.upper() the original string.

String Information –
numpy.count() : This function returns the number of occurrences of a substring in the given
string.
# Python program explaining
# numpy.count() function

import numpy as np

a=np.array(['geeks', 'for', 'geeks'])


# counting a substring
print(np.char.count(a,'geek'))

# counting a substring
print(np.char.count(a, 'fo'))
Run on IDE
Output :
[1, 0, 1]
[0, 1, 0]

numpy.rfind() : This function returns the highest index of the substring if found in given
string. If not found then it returns -1.
# Python program explaining
# numpy.rfind() function

import numpy as np

a=np.array(['geeks', 'for', 'geeks'])

# counting a substring
print(np.char.rfind(a,'geek'))

# counting a substring
print(np.char.rfind(a, 'fo'))
Run on IDE
Output :
[0, -1, 0]
[-1, 0, -1]

numpy.isnumeric() : This function returns “True” if all characters in the string are numeric
characters, Otherwise, It returns “False”.
# Python program explaining
# numpy.isnumeric() function

import numpy as np

# counting a substring
print(np.char.isnumeric('geeks'))

# counting a substring
print(np.char.isnumeric('12geeks'))
Run on IDE
Output :
False
False

FUNCTION DESCRIPTION

numpy.find() It returns the lowest index of the substring if it is found in given string. If its is not
returns -1.

numpy.index() It returns the position of the first occurrence of substring in a string

numpy.isalpha() It returns “True” if all characters in the string are alphabets, Otherwise, It returns “

It returns true if all characters in a string are decimal. If all characters are not decim
numpy.isdecimal() false.

numpy.isdigit() It returns “True” if all characters in the string are digits, Otherwise, It returns “Fals

numpy.islower() It returns “True” if all characters in the string are lowercase, Otherwise, It returns “

Returns true for each element if there are only whitespace characters in the string a
numpy.isspace() one character, false otherwise.
Returns true for each element if the element is a titlecased string and there is at lea
numpy.istitle() false otherwise.

Returns true for each element if all cased characters in the string are uppercase and
numpy.isupper() one character, false otherwise.

Returns the highest index of the substring inside the string if substring is found. Ot
numpy.rindex() an exception.

numpy.startswith() Returns True if a string starts with the given prefix otherwise returns False.

String Comparison –
numpy.equal(): This function checks for string1 == string2 elementwise.
# Python program explaining
# numpy.equal() function

import numpy as np

# comparing a string elementwise


# using equal() method
a=np.char.equal('geeks','for')

print(a)
Run on IDE
Output :
False

numpy.not_equal(): This function checks whether two string is unequal or not.


# Python program explaining
# numpy.unequal() function

import numpy as np

# comparing a string elementwise


# using unequal() method
a=np.char.unequal('geeks','for')

print(a)
Run on IDE
Output :
True

numpy.greater(): This function checks whether string1 is greater than string2 or not.
# Python program explaining
# numpy.greater() function

import numpy as np

# comparing a string elementwise


# using greater() method
a=np.char.greater('geeks','for')

print(a)
Run on IDE
Output :
True

FUNCTION DESCRIPTION

numpy.greater_equal() It checks whether string1 >= string2 or not.

numpy.less_equal() It checks whether string1 is <= string2 or not.

numpy.less() It check whether string1 is lesser than string2 or not.


Basic of Time Series Manipulation Using Pandas

Although the time series is also available in the Scikit-learn library, data science
professionals use the Pandas library as it has compiled more features to work on
the DateTime series. We can include the date and time for every record and can fetch the
records of DataFrame.
We can find out the data within a certain range of dates and times by using the DateTime
module of Pandas library.
Let’s discuss some major objectives of time series analysis using Pandas library.
Objectives of Time Series Analysis
 Create a series of date
 Work with data timestamp
 Convert string data to timestamp
 Slicing of data using timestamp
 Resample your time series for different time period aggregates/summary statistics
 Working with missing data
Now, let’s do some practical analysis of some data to demonstrate the use of Pandas’ time
series.
Create DateTime Values with Pandas
To create a DateTime series using Pandas, we need the DateTime module and then we can
create a DateTime range with the date_range method.
Example
 Python3

import pandas as pd
from datetime import datetime
import numpy as np

range_date = pd.date_range(start ='1/1/2019', end ='1/08/2019', freq ='Min')


print(range_date)

Output
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:01:00',
'2019-01-01 00:02:00', '2019-01-01 00:03:00',
'2019-01-01 00:04:00', '2019-01-01 00:05:00',
'2019-01-01 00:06:00', '2019-01-01 00:07:00',
'2019-01-01 00:08:00', '2019-01-01 00:09:00',
...
'2019-01-07 23:51:00', '2019-01-07 23:52:00',
'2019-01-07 23:53:00', '2019-01-07 23:54:00',
'2019-01-07 23:55:00', '2019-01-07 23:56:00',
'2019-01-07 23:57:00', '2019-01-07 23:58:00',
'2019-01-07 23:59:00', '2019-01-08 00:00:00'],
dtype='datetime64[ns]', length=10081, freq='T')
Explanation:
Here in this code, we have created the timestamp based on minutes for date ranges
from 1/1/2019 to 8/1/2019.
We can vary the frequency by hours to minutes or seconds.
This function will help you to track the record of data stored per minute. As we can see in the
output the length of the datetime stamp is 10081.
Note: Remember pandas use data type as datetime64[ns].
Determine the Data Type of an Element in the DateTime Range
To determine the type of an element in the DateTime range, we use indexing to fetch the
element and then use the type function to know its data type.
 Python3

import pandas as pd
from datetime import datetime
import numpy as np

range_date = pd.date_range(start ='1/1/2019', end ='1/08/2019', freq ='Min')


print(type(range_date[110]))

Output
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Explanation:
We are checking the type of our object named range_date.
Create DataFrame with DateTime Index
To create a DataFrame with a DateTime index, we first need to create a DateTime range and
then pass it to pandas.DataFrame method.
 Python3

import pandas as pd
from datetime import datetime
import numpy as np

range_date = pd.date_range(start ='1/1/2019', end ='1/08/2019',freq ='Min')


df = pd.DataFrame(range_date, columns =['date'])
df['data'] = np.random.randint(0, 100, size =(len(range_date)))

print(df.head(10))

Output
date data
0 2019-01-01 00:00:00 49
1 2019-01-01 00:01:00 58
2 2019-01-01 00:02:00 48
3 2019-01-01 00:03:00 96
4 2019-01-01 00:04:00 42
5 2019-01-01 00:05:00 8
6 2019-01-01 00:06:00 20
7 2019-01-01 00:07:00 96
8 2019-01-01 00:08:00 48
9 2019-01-01 00:09:00 78
Explanation:
We first created a time series then converted this data into DataFrame and used the random
function to generate the random data and map over the dataframe. Then to check the result we
use the print function.
To do time series manipulation, we need to have a DateTime index so that DataFrame is
indexed on the timestamp. Here, we are adding one more new column in the Pandas
DataFrame.
Convert DateTime elements to String format
The below example demonstrates how we can convert the DateTime elements of DateTime
object to string format.
 Python3

import pandas as pd
from datetime import datetime
import numpy as np

range_date = pd.date_range(start ='1/1/2019', end ='1/08/2019',freq ='Min')

df = pd.DataFrame(range_date, columns =['date'])


df['data'] = np.random.randint(0, 100, size =(len(range_date)))

string_data = [str(x) for x in range_date]


print(string_data[1:11])

Output:
['2019-01-01 00:01:00', '2019-01-01 00:02:00', '2019-01-01 00:03:00', '2019-01-01 00:04:00',
'2019-01-01 00:05:00', '2019-01-01 00:06:00', '2019-01-01 00:07:00', '2019-01-01 00:08:00',
'2019-01-01 00:09:00', '2019-01-01 00:10:00']
Explanation:
This code just uses the elements of data_rng and converts them to string and due to a lot of
data we slice the data and print the first ten values list string_data.
By using the for each loop in the list, we got all the values that are in the series range_date.
When we are using date_range we always have to specify the start and end date.
Accessing Specific DateTime Element
The below example demonstrates how we access specific DateTime element of DateTime
object.
 Python3

import pandas as pd
from datetime import datetime
import numpy as np

range_data = pd.date_range(start ='1/1/2019', end ='1/08/2019', freq ='Min')


df = pd.DataFrame(range_data, columns =['date'])
df['data'] = np.random.randint(0, 100, size =(len(range_data)))

df['datetime'] = pd.to_datetime(df['date'])
df = df.set_index('datetime')
df.drop(['date'], axis = 1, inplace = True)

print(df['2019-01-05'][1:11])

Output
data
datetime
2019-01-05 00:01:00 99
2019-01-05 00:02:00 21
2019-01-05 00:03:00 29
2019-01-05 00:04:00 98
2019-01-05 00:05:00 0
2019-01-05 00:06:00 72
2019-01-05 00:07:00 69
2019-01-05 00:08:00 53
2019-01-05 00:09:00 3
2019-01-05 00:10:00 37

Python | Pandas dataframe.eval()


Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.
Pandas dataframe.eval() function is used to evaluate an expression in the context of the
calling dataframe instance. The expression is evaluated over the columns of the dataframe.
Syntax: DataFrame.eval(expr, inplace=False, **kwargs)
Parameters:
expr : The expression string to evaluate.
inplace : If the expression contains an assignment, whether to perform the operation inplace
and mutate the existing DataFrame. Otherwise, a new
DataFrame is returned.
kwargs : See the documentation for eval() for complete details on the keyword arguments
accepted by query().
Returns: ret : ndarray, scalar, or pandas object
Example #1: Use eval() function to evaluate the sum of all column element in the dataframe
and insert the resulting column in the dataframe.

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df=pd.DataFrame({"A":[1,5,7,8],
"B":[5,8,4,3],
"C":[10,4,9,3]})

# Print the first dataframe


df

Let’s evaluate the sum over all the columns and add the resultant column to the dataframe
# To evaluate the sum over all the columns
df.eval('D = A + B+C', inplace = True)

# Print the modified dataframe


df

Output :

Example #2: Use eval() function to evaluate the sum of any two column element in the
dataframe and insert the resulting column in the dataframe. The dataframe has NaN value.
Note : Any expression can not be evaluated over NaN values. So the corresponding cells will
be NaN too.

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df=pd.DataFrame({"A":[1,2,3],
"B":[4,5,None],
"C":[7,8,9]})

# Print the dataframe


df
Let’s evaluate the sum of column “B” and “C”.

# To evaluate the sum of two columns in the dataframe


df.eval('D = B + C', inplace = True)

# Print the modified dataframe


df

Output :

Notice, the resulting column ‘D’ has NaN value in the last row as the corresponding cell used
in evaluation was a NaN cell.

Pandas query() Method


Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages that makes
importing and analyzing data much easier. Analyzing data requires a lot of filtering
operations. Pandas Dataframe provide many methods to filter a Data frame
and Dataframe.query() is one of them.
Pandas query() method Syntax
Syntax: DataFrame.query(expr, inplace=False, **kwargs)
Parameters:
 expr: Expression in string form to filter data.
 inplace: Make changes in the original data frame if True
 kwargs: Other keyword arguments.
Return type: Filtered Data frame
Pandas DataFrame query() Method
Dataframe.query() method only works if the column name doesn’t have any empty spaces. So
before applying the method, spaces in column names are replaced with ‘_’ . To download the
CSV file used, Click Here.
Pandas DataFrame query() Examples
Example 1: Single condition filtering In this example, the data is filtered on the basis of a
single condition. Before applying the query() method, the spaces in column names have been
replaced with ‘_’.
Python3

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# replacing blank spaces with '_'


data.columns =
[column.replace(" ", "_") for column in data.columns]

# filtering with query method


data.query('Senior_Management == True',
inplace=True)

# display
data

Output:
As shown in the output image, the data now only have rows where Senior Management is
True.
Example 2: Multiple conditions filtering In this example, Dataframe has been filtered on
multiple conditions. Before applying the query() method, the spaces in column names have
been replaced with ‘_’.
Python3

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# replacing blank spaces with '_'


data.columns =
[column.replace(" ", "_") for column in data.columns]

# filtering with query method


data.query('Senior_Management == True
and Gender == "Male" and Team == "Marketing"
and First_Name == "Johnny"', inplace=True)

# display
data

Output:
As shown in the output image, only two rows have been returned on the basis of filters
applied.

You might also like