Chapter 2 Data Handling using pandas - I(DATA FRAME)
Chapter 2 Data Handling using pandas - I(DATA FRAME)
DataFrame
➢ a two-dimensional labelled data structure like a table of MySQL.
➢ It contains rows and columns, and therefore has both a row and column index.
➢ Each column can have a different type of value such as numeric, string,
boolean, etc., as in tables of a database.( Heterogeneous type.)
➢ Data is mutable.
➢ Size is mutable.
Creation of DataFrame
(A) Creation of an empty DataFrame
import pandas as pd
print("create empty dataframe")
df=pd.DataFrame()
print(df)
Output:
create empty dataframe
Empty DataFrame
Columns: []
Index: []
Assigning values to a new column label that does not exist will create a new column at
the end. If the column already exists in the DataFrame then the assignment statement
will update the values of the already existing column.
Assigning values to a new row label that does not exist will create a new row at the
end. If the row already exists in the DataFrame then the assignment statement will
update the values of the already existing row.
To rename the row indices Maths to sub1, Science to sub2, Hindi to sub3 and English
to sub4 we can write the following statement:
df=df.rename({'Maths':'Sub1',‘Science':'Sub2','English':'Sub3','Hindi':'Sub4'}
, axis='index')
ResultDF=ResultDF.rename({'Arnab':'Student1','Ramit':'Student2','Samridhi':
'Student3','Mallika':'Student4'},axis='columns')
import pandas as pd
student=pd.DataFrame({'English':[50,60,70],'Physics':[60,50,80],'Chemistry':[70,40,
90],'Maths':[80,90,80],'IP':[90,80,85]},index=['athul','fardeen','fawaz'])
print(student)
print(student.IP>80)
#print(student['IP']>80)
#print(student.loc[:,'IP']>80)
print(student.IP>80)
print(student[student.IP>80])
Attributes of DataFrames
Attribute Name Purpose
DataFrame.index to display row labels
DataFrame.columns to display column labels
DataFrame.dtypes to display data type of each column in the DataFrame
to display a NumPy ndarray having all the values in
DataFrame.values
the DataFrame, without the axes labels
to display a tuple representing the dimensionality of
DataFrame.shape
the DataFrame
to display a tuple representing the dimensionality of
DataFrame.size
the DataFrame
iteritems(): Iterate over (column name, Series) pairs. Iterates over the DataFrame
columns, returning a tuple with the column name and the content as a Series.
import pandas as pd
a=[{"Name":'abc','age':25,'Mark':80},{"Name":'def','age':28,'Mark':95},{"Name":'ghi
','age':30,'Mark':60},{"Name":'jkl','age':22,'Mark':56},{"Name":'mno','age':23,'Mark':
35}]
student=pd.DataFrame(a)
print(student)
l=[]
for (i,j) in student.iterrows():
if(j["Mark"]>90):
grade="A+"
l.append(grade)
#student.loc[i,"Grade"]="A+"
elif(j["Mark"]>70 and j["Mark"] <=90):
grade="A"
l.append(grade)
elif(j["Mark"]>50 and j["Mark"]<=70):
grade="B+"
l.append(grade)
elif(j["Mark"]>32 and j["Mark"]<=50):
grade="B"
l.append(grade)
else:
grade="C"
l.append(grade)
student["Grade"]=l
print(student)
CSV
A Comma Separated Value (CSV) file is a text file where values are separated by
comma. Each line represents a record (row). Each row consists of one or more
fields (columns). They can be easily handled through a spreadsheet application.
• The first parameter to the read_csv() is the name of the comma separated data file
along with its path.
• The parameter sep specifies whether the values are separated by comma, semicolon,
tab, or any other character. The default value for sep is a space.
• The parameter header specifies the number of the row whose values are to be used
as the column names. It also marks the start of the data to be fetched. header=0
implies that column names are inferred from the first line of the file. By default,
header=0.
• We can exclusively specify column names using the parameter names while creating
the DataFrame using the read_csv() function.