Python Pandas1
Python Pandas1
syllabus
Informatics Practices 2023-24
Chapter 1
Data Handling
using Pandas
Visit : python.mykvs.in
Visit :Visit
python.mykvs.in for
for
: python.mykvs.in for
regular
regular
regular updates updates
updates
Data Handling using Pandas
Visit : python.mykvs.in
Visit :Visit
python.mykvs.in for
for
: python.mykvs.in for
regular
regular
regular updates updates
updates
Data Handling using Pandas
Basic Features of Pandas
1. Dataframe object help a lot in keeping track of our data.
2. With a pandas dataframe, we can have different data types
(float, int, string, datetime, etc) all in one place
3. Pandas has built in functionality for like easy grouping &
easy joins of data, rolling windows
4. Good IO capabilities; Easily pull data from a MySQL
database directly into a data frame
5. With pandas, you can use patsy for R-style syntax in
doing regressions.
6. Tools for loading data into in-memory data objects from
different file formats.
7. Data alignment and integrated handling of missing data.
8. Reshaping and pivoting of data sets.
9. Label-based slicing, indexing and subsetting of large data
sets.
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas – Installation/Environment Setup
Pandas module doesn't come bundled with Standard Python.
If we install Anaconda Python package Pandas will be
installed by default.
Steps for Anaconda installation & Use
1. visit the site https://www.anaconda.com/download/
2. Download appropriate anaconda installer
3. After download install it.
4. During installation check for set path and all user
5. After installation start spyder utility of anaconda from start menu
6. Type import pandas as pd in left pane(temp.py)
7. Then run it.
8. If no error is show then it shows pandas is installed.
9. Like default temp.py we can create another .py file from new
window option of file menu for new program.
Visit : python.myks.in for regular updates
Data Handling using Pandas
Pandas – Installation/Environment Setup
2. DataFrame
DataFrame is like a two-dimensional array with
heterogeneous data.
SR. Admn Student Name Class Section Gender Date Of
No. No Birth
1 001284 NIDHI MANDAL I A Girl 07/08/2010
2 001285 SOUMYADIP I A Boy 24/02/2011
BHATTACHARYA
3 001286 SHREYAANG I A Boy 29/12/2010
SHANDILYA
Basic feature of DataFrame are
❖ Heterogeneous data
❖ Size Mutable
❖ Data Mutable
Pandas Series
It is like one-dimensional array capable of holding data
of any type (integer, string, float, python objects, etc.).
Series can be created using constructor.
Syntax :- pandas.Series( data, index, dtype, copy)
Creation of Series is also possible from – ndarray,
dictionary, scalar value.
Series can be created using
1. Array
2. Dict
3. Scalar value or constant
Pandas Series
e.g.
Output
Series([], dtype: float64)
Output Output
1 a 100 a
2 b 101 b
3 c 102 c
4 d 103d dtype:
dtype: object object
Note : default index is starting
from 0 Note : index is starting from 100
Output Output
a 0.0 b 1.0
b 1.0 c 2.0
c 2.0 d NaN
dtype: float64 a 0.0
dtype: float64
Pandas Series
Head function
e.g
Output
a 1
b. 2
c. 3
dtype: int64
Return first 3 elements
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas Series
tail function
e.g
Output
c 3
d. 4
e. 5
dtype: int64
Return last 3 elements
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas Series
Retrieve Data Using Label as (Index)
e.g.
Output c
3
d 4
dtype: int64
Pandas Series
Retrieve Data from selection
There are three methods for data selection:
▪ loc gets rows (or columns) with particular labels from
the index.
▪ iloc gets rows (or columns) at particular positions in
the index (so it only takes integers).
▪ ix usually tries to behave like loc but falls back to
behaving like iloc if a label is not present in the index.
ix is deprecated and the use of loc and iloc is encouraged
instead
Pandas Series
Retrieve Data from
selection
e.g. >>> s.ix[:3] # the integer is in the index so
>>> s = pd.Series(np.nan,
index=[49,48,47,46,45, 1, 2, 3, 4, 5]) s.ix[:3] works like loc
>>> s.iloc[:3] # slice the first three rows 49 NaN
49 NaN 48 NaN
48 NaN
47 NaN 47 NaN
>>> s.loc[:3] # slice up to and including 46 NaN
label 3 45 NaN
49 NaN
48 NaN
1 NaN
47 NaN 2 NaN
46 NaN 3 NaN
45 NaN
1 NaN
2 NaN
3 NaN
Pandas DataFrame
It is a two-dimensional data structure, just like any table
(with rows & columns).
Basic Features of DataFrame
Columns may be of different types
Size can be changed(Mutable)
Labeled axes (rows / columns)
Arithmetic operations on rows and columns
Structure
Rows
Pandas DataFrame
Create a DataFrame from Lists 0
e.g.1 0 1
output 1 2
import pandas as pd1 2 3
data1 = [1,2,3,4,5] 3 4
df1 = pd1.DataFrame(data1) 4 5
print (df1)
e.g.2
import pandas as pd1
data1 = [['Freya',10],['Mohak',12],['Dwivedi',13]]
Name Age
df1 = pd1.DataFrame(data1,columns=['Name','Age'])
1 Freya 10
print (df1) output 2 Mohak 12
2 Dwivedi 13
Pandas DataFrame
Create a DataFrame from Dict of ndarrays / Lists
e.g.1
import pandas as pd1
data1 = {'Name':['Freya', 'Mohak'],'Age':[9,10]}
df1 = pd1.DataFrame(data1)
print (df1)
Output
Name Age
1 Freya 9
2 Mohak 10
Write below as 3rd statement in above prog for indexing
df1 = pd1.DataFrame(data1, index=['rank1','rank2','rank3','rank4'])
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas DataFrame
Create a DataFrame from List of Dicts
e.g.1
import pandas as pd1
data1 = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z': 5}]
df1 = pd1.DataFrame(data1)
print (df1)
Output
x y z
0 1 2 NaN
1 5 4 5.0
Column Deletion
del df1['one'] # Deleting the first column using DEL function
df.pop('two') #Deleting another column using POP function
Rename columns
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas DataFrame
Row Selection, Addition, and Deletion
#Selection by Label
import pandas as pd1
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df1
= pd1.DataFrame(d1)
print (df1.loc['b'])
Output
one 2.0
two 2.0
Name: b, dtype: float64
Pandas DataFrame
#Selection by integer location
import pandas as pd1
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd1.DataFrame(d1)
print (df1.iloc[2])
Output
one 3.0
two 3.0
Name: c, dtype: float64
Pandas DataFrame
Addition of Rows
import pandas as pd1
df1 = df1.append(df2)
print (df1)
Deletion of Rows
# Drop rows with label 0
df1 = df1.drop(0)
Pandas DataFrame
Iterate over rows in a dataframe
e.g.
import pandas as pd1
import numpy as np1
raw_data1 = {'name': ['freya', 'mohak'],
'age': [10, 1],
'favorite_color': ['pink', 'blue'],
'grade': [88, 92]}
df1 = pd1.DataFrame(raw_data1, columns = ['name', 'age',
'favorite_color', 'grade'])
for index, row in df1.iterrows():
print (row["name"], row["age"])
Output
freya 10
mohak 1
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas DataFrame
Head & Tail
head() returns the first n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number. tail() returns the
last n rows .e.g.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)
Visit : python.mykvs.in for regular updates
Data Handling using Pandas
Pandas DataFrame
Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and columns.
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# dictionary of lists
dict = {'name':[“Mohak", “Freya", “Roshni"],
'degree': ["MBA", "BCA", "M.Tech"],
'score':[90, 40, 80]}
print (df)