Atmiya University
Faculty of Science,
Department of Computer Science
& I.T.
Subject Name: 21UFSDE309 Data Science Using Python
By: Dr. Hiren Kavathiya
Introduction to Pandas in Python
Pandas is an open-source library that is made mainly for
working with relational or labeled data both easily and
naturally.
It provides various data structures and operations for
manipulating numerical data and time series.
This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity
for users.
Department of Computer Science & I.T.
Introduction to Pandas in Python
History: Pandas were initially developed by Wes
McKinney in 2008 while he was working at AQR Capital
Management.
He convinced the AQR to allow him to open source the
Pandas.
Another AQR employee, Chang She, joined as the
second major contributor to the library in 2012.
Over time many versions of pandas have been released.
The latest version of the pandas is 1.4.1
Department of Computer Science & I.T.
Introduction to Pandas in Python
Advantages
Fast and efficient for manipulating and analyzing data.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as
well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
Data set merging and joining.
Flexible reshaping and pivoting of data sets
Provides time-series functionality.
Powerful group by functionality for performing split-apply-combine
operations on data sets. Department of Computer Science & I.T.
What is Matplotlib?
Matplotlib is a low level graph plotting library in python
that serves as a visualization utility.
Matplotlib was created by John D. Hunter.
Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments
are written in C, Objective-C and Javascript for Platform
compatibility.
Department of Computer Science & I.T.
What is Matplotlib?
Matplotlib is a low level graph plotting library in python
that serves as a visualization utility.
Matplotlib was created by John D. Hunter.
Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments
are written in C, Objective-C and Javascript for Platform
compatibility.
Department of Computer Science & I.T.
What is Matplotlib?
Installation of Matplotlib
If you have Python and PIP already installed on a system,
then installation of Matplotlib is very easy.
Install it using this command:
C:\Users\Your Name>pip install matplotlib
Department of Computer Science & I.T.
What is Matplotlib?
Import Matplotlib
Once Matplotlib is installed, import it in your applications by
adding the import module statement:
import matplotlib
Checking Matplotlib Version
The version string is stored under __version__ attribute.
Example
import matplotlib
print(matplotlib.__version__)
Department of Computer Science & I.T.
What is Data Structure in Pandas?
Pandas is divided into three data structures when it
comes to dimensionality of an array. These data
structures are:
Series
DataFrame
Panel
Department of Computer Science & I.T.
What is Data Structure in Pandas?
Data Structure Dimensions
Series 1D
DataFrame 2D
Panel 3D
Series and Data Frames are the most widely used data
structures based on the usage and problem solving sets in
data science. If we look at these data structures in terms of
a spreadsheet then Series would be a single column of an
excel sheet, whereas DataFrame will have rows and
columns and be a sheet itself. Department of Computer Science & I.T.
What is a Series in Pandas?
Pandas series is a one dimensional data structure which
can have values of integer, float and string. We use series
when we want to work with a single dimensional array. It
is important to note that series cannot have multiple
columns. It only holds one column just like in an excel
sheet. Series does have an index as an axis label. You can
have your own index labels by customizing the index
values.
Department of Computer Science & I.T.
What is a Series in Pandas?
This is Series
Name
Dhyey
Krishna
Kishan
Radha
Shyam
Department of Computer Science & I.T.
Installing Pandas on Windows
Installing Pandas on Windows
You can install pandas on windows by simply going to
command prompt and type:
pip install pandas
Department of Computer Science & I.T.
What is a Series in Pandas?
Creating a Series in Pandas
Pandas Series can be created in different ways from
MySQL table, through excel worksheet (CSV) or from an
array, dictionary, list etc. Let’s look at how to create a
series. Let’s import Pandas first into the python file or
notebook that you are working in:
import pandas as pd
ps = pd.Series([1,2,3,4,5])
print(ps)
Department of Computer Science & I.T.
What is a Series in Pandas?
Changing the index of Series in Pandas
By default, the index values of your series are numbers
ranging from 0 onwards. You can change the index of the
series by customising the index values inside a list, in
order to achieve that use the index argument to change
values.
ps = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
print(ps)
Department of Computer Science & I.T.
What is a Series in Pandas?
Changing the index of Series in Pandas
By default, the index values of your series are numbers
ranging from 0 onwards. You can change the index of the
series by customising the index values inside a list, in
order to achieve that use the index argument to change
values.
ps = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
print(ps)
Department of Computer Science & I.T.
What is a Series in Pandas?
Creating a Series from a Dictionary
Let’s learn about creating series from a dictionary, just like creating a
conventional Series in Pandas, a dictionary has all the elements
predefined to suit a Series. If an index is not specified while declaring
the Series, then the keys are considered to be index by default. If an
index is passed then keys are replaced as index labels.
import pandas as pd
import numpy as np
dict_pd = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e': 5}
series_dict = pd.Series(dict_pd)
print(series_dict)
Department of Computer Science & I.T.
Data Frames
A Pandas DataFrame is a 2 dimensional data structure, like
a 2 dimensional array, or a table with rows and columns.
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Department of Computer Science & I.T.
Data Frames
Locate Row
As you can see from the result above, the DataFrame is
like a table with rows and columns.
Pandas use the loc attribute to return one or more
specified row(s)
Example
Return row 0:
#refer to the row index:
print(df.loc[0])
Department of Computer Science & I.T.
Data Frames
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
Department of Computer Science & I.T.
Data Frames
Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them
into a DataFrame.
Example
Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Department of Computer Science & I.T.