0% found this document useful (0 votes)
5 views

Chapter 2 Data Handling using pandas - I(DATA FRAME)

Chapter 2 covers data handling using pandas, focusing on the creation and manipulation of DataFrames, which are two-dimensional labeled data structures. It details various methods for creating DataFrames from lists, dictionaries, NumPy arrays, and Series, as well as operations for adding, deleting, and renaming rows and columns. The chapter also discusses indexing, attributes of DataFrames, and importing/exporting data with CSV files.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter 2 Data Handling using pandas - I(DATA FRAME)

Chapter 2 covers data handling using pandas, focusing on the creation and manipulation of DataFrames, which are two-dimensional labeled data structures. It details various methods for creating DataFrames from lists, dictionaries, NumPy arrays, and Series, as well as operations for adding, deleting, and renaming rows and columns. The chapter also discusses indexing, attributes of DataFrames, and importing/exporting data with CSV files.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter 2

Data Handling using pandas – I

DataFrame
➢ a two-dimensional labelled data structure like a table of MySQL.
➢ It contains rows and columns, and therefore has both a row and column index.
➢ Each column can have a different type of value such as numeric, string,
boolean, etc., as in tables of a database.( Heterogeneous type.)
➢ Data is mutable.
➢ Size is mutable.

Creation of DataFrame
(A) Creation of an empty DataFrame
import pandas as pd
print("create empty dataframe")
df=pd.DataFrame()
print(df)
Output:
create empty dataframe
Empty DataFrame
Columns: []
Index: []

(B) Creation of Dataframe from list


import pandas as pd
print("create a dataframe from list")
df1=pd.DataFrame([10,20,30])
print(df1)
df2=pd.DataFrame([[10,20,30],[40,50,60]])
print(df2)
df3=pd.DataFrame([[10,20],[30,40],[50,60],[70,80,90]])
print(df3)
Output:
create a dataframe from list
0
0 10
1 20
2 30
0 1 2
0 10 20 30
1 40 50 60
0 1 2
0 10 20 NaN
1 30 40 NaN
2 50 60 NaN
3 70 80 90.0

(c) Creation of DataFrame from Dictionary


import pandas as pd
print("create a dataframe from the dictionary")
df4=pd.DataFrame({'name':'ali','age':17,'mark':50},index=['stud1'])
print(df4)
Output:
create a dataframe from the dictionary
name age mark
stud1 ali 17 50
(D) Creation of DataFrame from NumPy ndarrays
Create a simple DataFrame without any column labels, using a single ndarray
Program:
import pandas as pd
import numpy as np
array1 = np.array([10,20,30])
dFrame4 = pd.DataFrame(array1)
print(dFrame4)
Output:
0
0 10
1 20
2 30

Create a DataFrame using more than one ndarrays


Program:
import pandas as pd
import numpy as np
array1 = np.array([10,20,30])
array2 = np.array([100,200,300])
array3 = np.array([-10,-20,-30, -40])
dFrame5 = pd.DataFrame([array1, array3,array2], columns=[ 'A', 'B', 'C', 'D'])
print(dFrame5)
Output:
A B C D
0 10 20 30 NaN
1 -10 -20 -30 -40.0
2 100 200 300 NaN
(E) Creation of DataFrame from Series
import pandas as pd
seriesA = pd.Series([1,2,3,4,5],index = ['a', 'b', 'c', 'd', 'e'])
seriesB = pd.Series ([1000,2000,-1000,-5000,1000],index = ['a', 'b', 'c', 'd', 'e'])
seriesC = pd.Series([10,20,-10,-50,100],index = ['z', 'y', 'a', 'c', 'e'])
print("create a DataFrame using a single series")
dFrame6 = pd.DataFrame(seriesA)
print(dFrame6)
print("create a DataFrame using more than one series")
dFrame7 = pd.DataFrame([seriesA, seriesB])
print(dFrame7)
print("create a DataFrame using more than one series do not have the same set of
labels")
dFrame8 = pd.DataFrame([seriesA, seriesC])
print(dFrame8)
Output:
create a DataFrame using a single series
0
a 1
b 2
c 3
d 4
e 5
create a DataFrame using more than one series
a b c d e
0 1 2 3 4 5
1 1000 2000 -1000 -5000 1000
create a DataFrame using more than one series do not have the same set of labels
a b c d e z y
0 1.0 2.0 3.0 4.0 5.0 NaN NaN
1 -10.0 NaN -50.0 NaN 100.0 10.0 20.0

(F) Creation of DataFrame from Dictionary of List


import pandas as pd
print("create a dataframe from the dictionary of list")
df5=pd.DataFrame({'name':['ali','mohd','fahad'],'age':[17,18,16],'mark':[50,30,20]})
print(df5)
Output:
create a dataframe from the dictionary of list
name age mark
0 ali 17 50
1 mohd 18 30
2 fahad 16 20

(G) Creation of DataFrame from Dictionary of Series


import pandas as pd
print("Create a dataframe from the dictionary of series")
s1=pd.Series(['zubair','amaan','ilan'])
s2=pd.Series([17,18,17])
df6=pd.DataFrame({'name':s1,'age':s2})
print(df6)
Output:
Create a dataframe from the dictionary of series
name age
0 zubair 17
1 amaan 18
2 ilan 17
(H) Creation of DataFrame from list of dictionaries
import pandas as pd
print("create a dataframe from a list of dictionaries")
df7=pd.DataFrame([{'name':'ali','age':16,'mark':80},{'name':'zubair','age':17},{'nam
e':'amaan','age':16}])
print(df7)
Output:
create a dataframe from a list of dictionaries
name age mark
0 ali 16 80.0
1 zubair 17 NaN
2 amaan 16 NaN

Operations on rows and columns in DataFrames


Some basic operations on rows and columns of a DataFrame are selection, deletion,
addition, and renaming.
(A) Adding a New Column to a DataFrame
Consider the following dataframe df
name age mark1
0 ali 17 50
1 mohd 18 30
2 fahad 16 20
1) In order to add a new column with different values , we can write the following
statement:
df[‘mark2’]=[40,35,45]
2)In order to add a new column with particular(same) value , we can write the
following statement:
df[‘mark3’]=50
3)In order to add a new column with the values(sum) of other columns, we can write
the following statement:
df[‘total’]=df[‘mark1’]+df[‘mark2’]+df[‘mark3’]

Assigning values to a new column label that does not exist will create a new column at
the end. If the column already exists in the DataFrame then the assignment statement
will update the values of the already existing column.

(B) Adding a New Row to a DataFrame


Can add a new row to a DataFrame using the DataFrame.loc[ ] method.

1) In order to add a new row, we can write the following statement:


df.loc[3]=['abubaker',15,50,80,100,230]
2)In order to add a new row with particular(same) value , we can write the following
statement:
df.loc[4]=50
3)In order to update a particular value of an element to another value, we can write
the following statement:
df.loc[2,'total']=100

Assigning values to a new row label that does not exist will create a new row at the
end. If the row already exists in the DataFrame then the assignment statement will
update the values of the already existing row.

(C) Deleting Rows or Columns from a DataFrame


The DataFrame.drop() method to delete rows and columns from a DataFrame. We
need to specify the names of the labels to be dropped and the axis from which they
need to be dropped. To delete a row, the parameter axis is assigned the value 0 and
for deleting a column, the parameter axis is assigned the value 1.
1) To delete a row from a dataframe
df=df.drop(3,axis=0)

2) To delete a column from a dataframe


(i) df=df.drop('total',axis=1)
(ii) d=df.pop('total')
(iii) del df['total']

(D) Renaming Row Labels of a DataFrame


Change the labels of rows and columns in a DataFrame using the DataFrame.rename()
method.

To rename the row indices Maths to sub1, Science to sub2, Hindi to sub3 and English
to sub4 we can write the following statement:
df=df.rename({'Maths':'Sub1',‘Science':'Sub2','English':'Sub3','Hindi':'Sub4'}
, axis='index')

(E) Renaming Column Labels of a DataFrame


To alter the column names of df we can again use the rename() method, as shown
below. The parameter axis='columns' implies we want to change the column labels:

ResultDF=ResultDF.rename({'Arnab':'Student1','Ramit':'Student2','Samridhi':
'Student3','Mallika':'Student4'},axis='columns')

Accessing DataFrames Element through Indexing


Data elements in a DataFrame can be accessed using indexing. There are two ways of
indexing Dataframes : Label based indexing and Boolean Indexing.

(A) Label Based Indexing


There are several methods in Pandas to implement label based indexing.
DataFrame.loc[ ] is an important method that is used for label based indexing with
DataFrames.
NOTE:
when the row label is passed as an integer value, it is interpreted as a label of the
index and not as an integer position along the index.
iloc() and loc()
iloc (): iloc is used for indexing or selecting based on position. It refers
to position-based indexing.
loc (): loc is used for indexing or selecting based on label (by row
name or column name).
import pandas as pd
print("creation of dataframe from dictionary of list")
df2=pd.DataFrame({'name':['ali','giri','mini','geena','meena','reena'],'age':[15,16,17,1
6,16,17],'mark':[60,70,80,90,85,75]},index=['s1','s2','s3','s4','s5','s6'])
print(df2)
print("to display the rows s1,s2,s3 using slicing")
print(df2[0:3])
print(df2['s1':'s3'])
print("to display the rows s1,s3,s5 using slicing")
print(df2[0:5:2])
print(df2['s1':'s5':2])
print("to display the rows s1,s4,s5 using loc")
print(df2.loc[['s1','s4','s5']])
print("to display the rows s1,s3,s5 and columns name,mark using loc")
print(df2.loc[['s1','s3','s5'],['name','mark']])
print(df2.loc['s1':'s5':2,'name':'mark':2])
print("to display the rows s1,s4,s5 using iloc")
print(df2.iloc[[0,3,4]])
print("to display the rows s1,s3,s5 and columns name,mark using iloc")
print(df2.iloc[[0,2,4],[0,2]])
print(df2.iloc[0:5:2,0:3:2])
(B) Boolean Indexing
In Boolean indexing, we can select the subsets of data based on the actual values in
the DataFrame rather than their row/column labels. Thus, we can use conditions on
column names to filter data values.

import pandas as pd
student=pd.DataFrame({'English':[50,60,70],'Physics':[60,50,80],'Chemistry':[70,40,
90],'Maths':[80,90,80],'IP':[90,80,85]},index=['athul','fardeen','fawaz'])
print(student)
print(student.IP>80)
#print(student['IP']>80)
#print(student.loc[:,'IP']>80)
print(student.IP>80)
print(student[student.IP>80])

Attributes of DataFrames
Attribute Name Purpose
DataFrame.index to display row labels
DataFrame.columns to display column labels
DataFrame.dtypes to display data type of each column in the DataFrame
to display a NumPy ndarray having all the values in
DataFrame.values
the DataFrame, without the axes labels
to display a tuple representing the dimensionality of
DataFrame.shape
the DataFrame
to display a tuple representing the dimensionality of
DataFrame.size
the DataFrame

to transpose the DataFrame.


DataFrame.T Means, row indices and column labels of the DataFrame
replace each other’s position
DataFrame.head(n) to display the first n rows in the DataFrame
DataFrame.tail(n) to display the last n rows in the DataFrame
import pandas as pd
a=[{"Name":'abc','age':25,'Mark':80.0},{"Name":'def','age':28,'Mark':95},{"Name":'g
hi','age':30,'Mark':60},{"Name":'jkl','age':22,'Mark':56},{"Name":'mno','age':23,'Mar
k':35}]
student=pd.DataFrame(a)
student.index=['s1','s2','s3','s4','s5']
print(student)
print(student.index)
print(student.columns)
print(student.dtypes)
print(student.values)
print(student.shape)
print(student.size)
print(student.T)
print(student.head(3))
print(student.tail(2))

ITER ROWS AND ITER ITEMS IN A DATAFRAME.


iterrows(): Iterate over DataFrame rows as (index, Series) pairs.

iteritems(): Iterate over (column name, Series) pairs. Iterates over the DataFrame
columns, returning a tuple with the column name and the content as a Series.
import pandas as pd
a=[{"Name":'abc','age':25,'Mark':80},{"Name":'def','age':28,'Mark':95},{"Name":'ghi
','age':30,'Mark':60},{"Name":'jkl','age':22,'Mark':56},{"Name":'mno','age':23,'Mark':
35}]
student=pd.DataFrame(a)
print(student)
l=[]
for (i,j) in student.iterrows():
if(j["Mark"]>90):
grade="A+"
l.append(grade)
#student.loc[i,"Grade"]="A+"
elif(j["Mark"]>70 and j["Mark"] <=90):
grade="A"
l.append(grade)
elif(j["Mark"]>50 and j["Mark"]<=70):
grade="B+"
l.append(grade)
elif(j["Mark"]>32 and j["Mark"]<=50):
grade="B"
l.append(grade)
else:
grade="C"
l.append(grade)
student["Grade"]=l
print(student)

CSV
A Comma Separated Value (CSV) file is a text file where values are separated by
comma. Each line represents a record (row). Each row consists of one or more
fields (columns). They can be easily handled through a spreadsheet application.

Importing and exporting data Between CSV Files and DataFrames.


We can create a DataFrame by importing data from CSV files where values are
separated by commas. Similarly, we can also store or export data in a DataFrame as a
.csv file.
We can load the data from the ResultData.csv file into a DataFrame, say marks using
Pandas read_csv() function as shown below:
>>> marks = pd.read_csv("C:/NCERT/ResultData.csv",sep =",", header=0)

• The first parameter to the read_csv() is the name of the comma separated data file
along with its path.
• The parameter sep specifies whether the values are separated by comma, semicolon,
tab, or any other character. The default value for sep is a space.
• The parameter header specifies the number of the row whose values are to be used
as the column names. It also marks the start of the data to be fetched. header=0
implies that column names are inferred from the first line of the file. By default,
header=0.
• We can exclusively specify column names using the parameter names while creating
the DataFrame using the read_csv() function.

The to_csv() function to save a DataFrame to a text or csv file.

#To open Employee.csv into a DataFrame


import pandas as pd
df=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv")
print("\nRead Employee.csv into a DataFrame")
print(df)
#To show the shape of Dataframe
print("\nshape",df.shape)
#To open and read from Employee.csv with specified columns
df1=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv",
usecols=['Name','Salary'])
print("\nRead from Employee.csv with specified columns")
print(df1)
#To display only 5 records from Employee.csv
df2=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv",nrows=5)
print("\nDisplay only 5 records from Employee.csv")
print(df2)
#To display records without header
df3=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv",header=None)
print("\nDisplay records without header")
print(df3)
#To display records without index numbers
df4=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv",index_col=1)
print("\nDisplay records without index numbers")
print(df4)
#To display Employee file with new Column names
df5=pd.read_csv("E:\\2020-2021\\12
CBSE\\Employee.csv",skiprows=5,names=['eid','ename','eage','ecity','esalary'])
print("\nDisplay Employee file with new Column names")
print(df5)
# To change a particular value in csv with NaN value
df6=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv",na_values=[1600])
print("\nUpdated a particular value in csv with NaN value")
print(df6)
df=pd.read_csv("E:\\2020-2021\\12 CBSE\\Employee.csv")
print(df)
print("\nCreating a Empnew CSV file by copying the contents of Employee.csv")
df.to_csv("E:\\2020-2021\\12 CBSE\\Empnew.csv")
print("\nCreating a Emp CSV file by copying selective columns of Employee.csv")
df.to_csv("E:\\2020-2021\\12 CBSE\\Emp.csv",columns=['Empid','Name'])
student={'RollNo':[1,2,3,4,5],'StudName':['Teena','Rinku','Payal','Akshay','Garvit'],'Ma
rks':[90,78,88,89,77],'Class':['11A','11B','11C','11A','11D']}
df=pd.DataFrame(student)
print("\nCreating a Student CSV file from DataFrame")
df.to_csv("E:\\2020-2021\\12 CBSE\\Student.csv")

Difference between Pandas Series and NumPy Arrays


Pandas Series NumPy Arrays
In series we can define our own
labeled index to NumPy arrays are accessed by
access elements of an array. These can their integer
be numbers position using numbers only.
or letters.
The elements can be indexed in The indexing starts with zero for
descending order the first
also. element and the index is fixed.
There is no concept of NaN values
If two series are not aligned, NaN or and if there
missing values are no matching values in arrays,
are generated. alignment
fails.
Series require more memory. NumPy occupies lesser memory.

You might also like