Revision Point - Dataframe

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

DATAFRAME

A Data frame is a two- dimensional data OUTPUT


structure, i.e., data is aligned in a tabular
fashion in rows and columns.
Features are:
Two-dimensional
size-mutable &
data mutable
Contains heterogeneous data
Contains rows and columns index
The DataFrame contains labelled axes (rows
or axis = 0 and columns or axis = 1). Since DataFrames are two-dimensional, to
All elements within a single column have the create DataFrame from Series, we can also take
same data type, but different columns can have two or more Series objects to create a
different data types. DataFrame.
Have a look to know the 2- D form Example 2
representation of a DataFrame - import pandas
Column Column roll = pandas.Series( [10, 12, 13, 16])
Column
Row name = pandas.Series(['Aruna', 'Kavita',
Name/1
Roll /0 Mark /2
'Gaurav','Sumit'])
FIRST/0 D[0][0] D[0][1] D[0][2] DF = pandas.DataFrame( { 'Roll_No' : roll ,
SECOND/1 D[1][0] D[1][1] D[1][2] 'SName' : name } )
THIRD/2 D[2][0] D[2][1] D[2][2] print(DF)
OUTPUT
For using the DataFrame object we must import
the pandas library as below:
import pandas OR import pandas as
ALIAS-NAME
Creating a DataFrame Example 3
Mainly DataFrame() function of pandas library
import pandas
is used. There are different ways of creating a
s1 = pandas.Series( { 101 : 'Amit', 102 : 'Anita',
DataFrame using -
103:'Geetu', 105:'Jatin'})
A - Empty DataFrame
s2 = pandas.Series( { 101 : 93 , 102 : 87 , 103 :
Let us learn with help of an example to create
82 , 104 : 93 , 105 : 90 } )
and print a DataFrame.
dfs = pandas.DataFrame( {'Name' : s1 , 'Marks' :
import pandas
s2 } )
DF = pandas.DataFrame()
print("Series 1")
print(DF)
print(s1)
OUTPUT
print("Series 2")
print(s2)
print("data frame from the above series ")
B - Dictionary of Series print(dfs)
Example 1
s = pandas.Series(100 , index =['a','b','c','d'])
print(s)
df = pandas.DataFrame(s)
print(df)
Page1
10 | K V S R E G I O N A L
2021-22)
OFFICE, JAIPUR |SUBJECT-INFORMATICS PRACTICES (TERM -I SESSION
{'roll' : 104 , 'name' : 'Gautam', 'mark' : 478}
]
DF = pandas.DataFrame( L , index= ['s1' , 's2'] )
print("DataFrame from List of Dictionaries with
Row-Index")
print(DF)
OUTPUT

Example 3
We can also use the index=[list_of_row_labels]
and columns=[list_of_column_labels] to specify
the row index as well as the column index
Example 3, dataframe from a list of dictionaries
with row index & column index
C - List of Dictionary import pandas
Recall that dictionary is of the form { key1 : L = [ {'roll' : 101 , 'name' : 'Astha' } ,\
{'roll' : 104 , 'name' : 'Gautam', 'mark' : 478} ]
value1 , key2 : value2 , - - - }
DF = pandas.DataFrame( L , index= ['s1' , 's2'] ,
The keys of the dictionary become the column
columns =['roll' , 'name'] )
names in the DataFrame object and the values
#note , here column 'mark' is skipped
of the dictionary become the column-values of
print("First DataFrame" )
the DataFrame object
print(DF)
Example 1
DF2 = pandas.DataFrame(L , index= ['s1','s2'] ,
import pandas
d1 = { 'roll' : 101 , 'name' : 'Astha' , 'tot_mark' : 456 } columns=['roll' , 'name' , 'age'] )
d2 = {'roll' : 104 , 'name' : 'Gautam', 'tot_mark' : 478 } #Here, column 'age' is additonal column, which
d3 = {'roll' : 105 , 'name' : 'Deepika', 'tot_mark' : 453 , does not exist in List of Dictionary
'grade' : 'A2' } print("Second DataFrame is")
L = [ d1 , d2, d3 ] print(DF2)
df_list = pandas.DataFrame(L) OUTPUT
print("Data Frame from list of dictionaries ")
print(df_list)
OUTPUT

As shown in the output, NaN ( Not a Number ) is D - Text/CSV Files -


automatically added for missing places. A CSV (Comma Separated Values) file can be
Example 2 imported directly to a DataFrame object
Instead of the default row labels: using the read_csv() method.
can specify our own row labels by using the Simple form of Syntax is
index=[list_of _row_labels] parameters in the <data-frame-name> = read_csv(<file-name-
DataFrame() function. path>)
import pandas
L = [ {'roll' : 101 , 'name' : 'Astha' } ,\ below -
Page
11 | K V2S R E G I O N A L O F F I C E , J A I P U R | S U B J E C T - I N F O R M A T I C S P R A C T I C E S ( T E R M - I S E S S I O N
2021-22)
Adm_N Name Class Marks
o
1201 Aniket Sharma XII 83
1203 Anita Gupta XII 91

1206 Gautam Kumar XI 89


1207 Mahesh Singh XII 94
1209 Pratik Mehra XI 90
1214 Nikita Verma XII 92 Example 4
Example 1 To read CSV file with specific / selected rows
#read the csv file in a DataFrame #nrows = we will use to display only first four
import pandas as pp records
# pp = alias-name of pandas library DF = pp.read_csv("stu_result.csv", nrows = 4)
sdf = print("\nFirst four records of DataFrame \n ",
pp.read_csv("D:/CPP/python_practice/stu_resu DF)
lt.csv") OUTPUT
#OR, read_csv("stu_result.csv"), if file is in same
folder as our program
print(sdf)
OUTPUT

Example 5
To read CSV file without header
# header = to omit(None) the display of
headings of columns
DH = pp.read_csv("stu_result.csv", header =
The read_csv() method has many parameters to None )
control the kind of data imported to create the print("The DataFrame is\n", DH)
DataFrame. OUTPUT
Example 2
To show the shape ( number of rows and
columns) of CSV file imported in a DataFrame
r ,c = sdf.shape
print("\nTotal rows", r, "Total columns", c)
OUTPUT

Similary, we can use <data-frame>.size to find


number of values of DataFrame
Example 3 Example 6
To read CSV file with specific / selected columns To read CSV file without index
#usecols = to display selected columns only #when we do not want to display the row
DF3 = pp.read_csv("stu_result.csv", usecols = indices
['Adm_No' , 'Name', 'Class'] ) df2 = pp.read_csv("stu_result.csv", index_col = 0 )
print("\nDataFrame is\n", DF3) print(df2)
OUTPUT

Page
12 |K V3
S REGIONAL OFFICE, JAIPUR |SUBJECT-INFORMATICS PRACTICES (TERM -I SESSION
2021-22)
OUTPUT
Display rows using loc method:-
Syntax-
<DataFrame
object>.loc[<startrow>:<endrow>,<startcolum
n>:<endcolumn>]

Examples:
print(df.loc[1]) # display data of particular
Here, Adm_No will be the first column instead of single row (row 1)
indices. Output:
Example 7 a 10
b 20
To read CSV file with new column names
c 30
#to use different names of column from default d 40
data, use skiprows along-with names e 50
DF = pp.read_csv("stu_result.csv", skiprows =1 , Name: 1, dtype: int64
names = ['StuNo' , 'SName', 'SClass','T_Marks'] )
print('DataFrame\n', DF) print(df.loc[0:1]) #display data of
OUTPUT multiple rows by using slicing(rows 0 and 1)
Output:
a b c d e
0 1 2 3 4 5
1 10 20 30 40 50

# display data of
multiple rows with single column by using
Display/Iteration of DataFrame:- slicing
import pandas as pd Output: (rows 0,1 and column a)
L1=[1,2,3,4,5] 0 1
L2=[10,20,30,40,50] 1 10
df=pd.DataFrame ([L1,L2],columns=[ Name: a, dtype: int64
'a','b','c','d','e']) # display
print(df) # display entire DataFrame data of multiple rows with multiple columns
Output: using slicing method(rows 0,1 and columns
a b c d e a,b,c)
0 1 2 3 4 5 Output:
1 10 20 30 40 50 a b c
Display columns 0 1 2 3
print(df['a']) # display data of particular 1 10 20 30
column (column a) Display rows using iloc method:-
Output: This method is used when DataFrame object
0 1 does not have row and column labels or even
1 10 we may not remember them. It works on
Name: a, dtype: int64 numeric index.
Syntax:-
print(df[['a','c','e']]) # display data of <DataFrame
multiple columns (columns a,c and e) object>.iloc[<startrowindex>:<endrowindex>,<
Output: startcolumnindex>:<endcolumnindex>]
a c e
0 1 3 5
1 10 30 50
Page
13 | K V4S R E G I O N A L O F F I C E , J A I P U R | S U B J E C T - I N F O R M A T I C S P R A C T I C E S ( T E R M - I S E S S I O N
2021-22)
Examples: Adding a New Column to a DataFrame: To
print(df.iloc[0:2,1:3]) # display rows exist add a new column to a DataFrameResultDFwe
on index 0,1 and columns exist on index 1,2 can write the following statement:

Output: >>>ResultDF['Radha']=[89,78,76]
b c Or
0 2 3 ResultDF.loc[:,'Radha']=[89,78,76]
1 20 30 Or
print(df.iloc[0:2,:]) # display rows exist on ResultDF.at[:,'Radha']=[89,78,76]
index 0,1 with all columns >>>print(ResultDF)
Output: or
a b c d e Output:-
0 1 2 3 4 5
Arnab RamitSamridhi Riya
2 10 20 30 40 50
Mallika Radha
Difference between loc and iloc method:-
Maths 90 92 89 81 94 89
In loc method both start label and end label
Science 91 81 91 71 95 78
are included but in iloc method end index is
excluded when given as strat:end. Hindi 97 96 88 67 99 76
Operations on rows and columns in
Note: Assigning values to a new column label
DataFrames:-We can perform some basic
that does not exist will create a new column
operations on rows and columns of a DataFrame
at the end If already exists then the
like selection, deletion, addition, and renaming
assignment statement will update the values
import pandas as pd of the already existing column
dict={ 'Arnab': pd.Series([90, 91, 97], Example :
index=['Maths','Science','Hindi']), ResultDF['Ramit']=[99, 98, 78]
>>>print(ResultDF)
'Ramit': pd.Series([92, 81, 96], Output:
index=['Maths','Science','Hindi']), Arnab RamitSamridhi Riya Mallika Radha
Maths 90 99 89 81 94 89
'Samridhi': pd.Series([89, 91, 88],
Science 91 98 91 71 95 78
index=['Maths','Science','Hindi']),
Hindi 97 78 88 67 99 76
'Riya': pd.Series([81, 71, 67], Adding a New Row to a DataFrame: To add a
index=['Maths','Science','Hindi']), new row to a DataFrame we can use the
DataFrame.loc[ ] method.
'Mallika': pd.Series([94, 95, 99], Suppose we want to add English marks in
index=['Maths','Science','Hindi']) } above DataFrame, we can write the following
statement:
ResultDF = pd.DataFrame(dict)
ResultDF.loc['English'] = [85, 86, 83, 80, 90, 89]
print(ResultDF)
>>>print(ResultDF)
Output: Or
ResultDF.at['English'] = [85, 86, 83, 80, 90, 89]
Arnab RamitSamridhi Riya Mallika >>>print(ResultDF)
Maths 90 92 89 81 94 Output:
Arnab RamitSamridhi Riya Mallika
Science 91 81 91 71 95 Radha
Maths 90 99 89 81 94 89
Hindi 97 96 88 67 99
Science 91 98 91 71 95 78
>>>
Page
14 | K V5S R E G I O N A L O F F I C E , J A I P U R | S U B J E C T - I N F O R M A T I C S P R A C T I C E S ( T E R M - I S E S S I O N
2021-22)
Hindi 97 78 88 67 99 76 Output:-
English 85 86 83 80 90 89 Delhi 10927986
DataFrame.loc[] method can also be used to Mumbai 12691836
change the data values of a row to a particular
Kolkata 4631392
value.
Selecting / Accessing multiple columns: Just
Example: to set marks in 'Maths' for all
use the following syntax
columns to 0:
>>>ResultDF.loc['Maths']=0 <DF_object>[[<column_name1>,<column_name
>>>print(ResultDF) 2>,<column_name3>......]]
Output:
Arnab RamitSamridhi Riya Mallika
Radha
Maths 0 0 0 0 0 0 Output:- Population Hospital
Science 91 98 91 71 95 78 Delhi 10927986 189
Hindi 97 78 88 67 99 76
English 85 86 83 80 90 89 Mumbai 12691836 208
>>>ResultDF[: ] = 0 # Set all values in
ResultDF to 0 Kolkata 4631392 149
>>>ResultDF
Selecting /Accessing a subset from a
Arnab Ramit Samridhi Riya DataFrame using Row / Column Names: Use
Mallika Radha the following syntax :-
<DF_object>.loc[<start_row>:<end_row>,<start
Maths 0 0 0 0 0 0
0 _column>:<end_column>]

Science 0 0 0 0 0 0 or
0
<DF_object>.iloc[<start_row_index>:<end_row_
Hindi 0 0 0 0 0 index>,<start_column_index>:<end_column_ind
0 0
ex>]
English 0 0 0 0 0
0 0

Selecting / Accessing Data from DataFrame : Output:

DataFrame : DF5 Population Hospital Schools

Population Hospital Schools Mumbai 12691836 208 8508

Delhi 10927986 189 7916 Example 2. >>>DF5.iloc[0:2,0:2]


Mumbai 12691836 208 8508 Output: -
Kolkata 4631392 149 7226 Population Hospital
Selecting / Accessing a column: Just use the Delhi 10927986 189
following syntax
Mumbai 12691836 208
<DF_object>[column_name] or
<DF_object>.<column_name> Deleting Rows or Columns from a
DataFrame: DataFrame.drop() method is used
to delete rows and columns from a DataFrame.
>>>DF5.Population To delete a row set the parameter axis=0 and

Page
15 | K V6S R E G I O N A L O F F I C E , J A I P U R | S U B J E C T - I N F O R M A T I C S P R A C T I C E S ( T E R M - I S E S S I O N
2021-22)
for deleting a column set axis=1. Consider the Output: Arnab Mallika Radha
following DataFrame:
Sub1 90 94 89
Arnab RamitSamridhi Riya Mallika
Sub2 97 99 76
Radha
English 85 90 89
Maths 90 99 89 81 94 89
Note: The parameter axis='index' is used to
Science 91 98 91 71 95 78
specify that the row label is to be
Hindi 97 78 88 67 99 76 changed and axis='columns' to specify
that the column label is to be changed
English 85 86 83 80 90 89
Renaming Column Labels of a DataFrame:
To delete the row with label 'Science' we can
write the following statement: ResultDF=ResultDF.rename({'Arnab':'Student1
','Mallika':'Student2','Radha':'Student3'},
>>>ResultDF = ResultDF.drop('Science',
axis=0)
>>>print(ResultDF)
>>>ResultDF
Output: Student1 Student2 Student3
Output : Arnab RamitSamridhi Riya Mallika Radha
Sub1 90 94 89
Maths 90 99 89 81 94 89

Hindi 97 78 88 67 99 76 Sub2 97 99 76

English 85 86 83 80 90 89 English 85 90 89
To delete the columns having labels 'Samridhi', >>>
'Ramit' and 'Riya': we can write the following
Operations on rows and columns in
statement:-
DataFrames:-We can perform some basic
>>>ResultDF = operations on rows and columns of a
ResultDF.drop(['Samridhi','Ramit','Riya'], DataFrame like selection, deletion, addition,
axis=1) and renaming
>>>ResultDF import pandas as pd
Output:Arnab Mallika Radha dict={ 'Arnab': pd.Series([90, 91, 97],
index=['Maths','Science','Hindi']),
Maths 90 94 89
'Ramit': pd.Series([92, 81, 96],
Hindi 97 99 76
index=['Maths','Science','Hindi']),
English 85 90 89
'Samridhi': pd.Series([89, 91, 88],
Renaming Row Labels of a DataFrame: index=['Maths','Science','Hindi']),
DataFrame.rename() method is used to rename
'Riya': pd.Series([81, 71, 67],
the row and column label. To rename the row
index=['Maths','Science','Hindi']),
indices Maths to sub1, Hindi to sub2 in above
DataFrame we can write the following 'Mallika': pd.Series([94, 95, 99],
statement:- index=['Maths','Science','Hindi']) }
ResultDF=ResultDF.rename({'Maths':'Sub1', ResultDF = pd.DataFrame(dict)
print(ResultDF)
Print(ResultDF)
Page 7
16 | K V S R E G I O N A L O F F I C E , J A I P U R | S U B J E C T - I N F O R M A T I C S P R A C T I C E S ( T E R M - I S E S S I O N
2021-22)
Output: Adding a New Row to a DataFrame: To add a
new row to a DataFramewe can use the
Arnab RamitSamridhi Riya Mallika
DataFrame.loc[ ] method.
Maths 90 92 89 81 94
Suppose we want to add English marks in
Science 91 81 91 71 95 above DataFrame, we can write the following
statement:
Hindi 97 96 88 67 99
ResultDF.loc['English'] = [85, 86, 83, 80, 90, 89]
>>>
>>>print(ResultDF)
Adding a New Column to a DataFrame: To
add a new column to a DataFrameResultDFwe Or
can write the following statement:
ResultDF.at['English'] = [85, 86, 83, 80, 90, 89]
>>>ResultDF['Radha']=[89,78,76]
>>>print(ResultDF)
Or
Output:
ResultDF.loc[:,'Radha']=[89,78,76]
Arnab RamitSamridhi Riya Mallika
Or Radha
ResultDF.at[:,'Radha']=[89,78,76] Maths 90 99 89 81 94 89

>>>print(ResultDF) Science 91 98 91 71 95 78

or Hindi 97 78 88 67 99 76
Output:- English 85 86 83 80 90 89
Arnab RamitSamridhi Riya Mallika Radha DataFrame.loc[] method can also be used to
Maths 90 92 89 81 94 89 change the data values of a row to a particular
value.
Science 91 81 91 71 95 78

Hindi 97 96 88 67 99 76 Example: to set marks in 'Maths' for all


columns to 0:
Note: Assigning values to a new column label
that does not exist will create a new column at >>>ResultDF.loc['Maths']=0
the end If already exists then the assignment
>>>print(ResultDF)
statement will update the values of the already
existing column Output:

Example : Arnab RamitSamridhi Riya Mallika Radha

ResultDF['Ramit']=[99, 98, 78] Maths 0 0 0 0 0 0

>>>print(ResultDF) Science 91 98 91 71 95 78
Hindi 97 78 88 67 99 76
Output:
English 85 86 83 80 90 89
Arnab Ramit Samridhi Riya Mallika Radha
>>>ResultDF[: ] = 0 # Set all values in
Maths 90 99 89 81 94 89
ResultDF to 0
Science 91 98 91 71 95 78
>>>ResultDF
Hindi 97 78 88 67 99 76
Arnab Ramit Samridhi Riya
Mallika Radha
Page
17 | K V S8 R E G I O N A L
2021-22)
OFFICE, JAIPUR |SUBJECT-INFORMATICS PRACTICES (TERM -I SESSION
Maths 0 0 0 0 0 0 0 or
Science 0 0 0 0 0 0 0 <DF_object>.iloc[<start_row_index>:<end_row_
Hindi 0 0 0 0 0 0
0
index>,<start_column_index>:<end_column_ind
ex>]
English 0 0 0 0 0 0 0

Selecting / Accessing Data from DataFrame :


Output:
DataFrame : DF5
Population Hospital Schools
Population Hospital Schools

Delhi 10927986 189 7916


Mumbai 12691836 208 8508

Mumbai 12691836 208 8508


Example 2. >>>DF5.iloc[0:2,0:2]
Kolkata4631392 149 7226 Output: -
Selecting / Accessing a column: Just use the Population Hospital
following syntax
Delhi 10927986 189
<DF_object>[column_name] or
<DF_object>.<column_name> Mumbai 12691836 208

Example : Deleting Rows or Columns from a


>>>DF5.Population DataFrame: DataFrame.drop() method is used
to delete rows and columns from a DataFrame.
Output:- To delete a row set the parameter axis=0 and for
Delhi 10927986 deleting a column set axis=1. Consider the
following DataFrame:
Mumbai 12691836
Arnab RamitSamridhi Riya Mallika
Kolkata 4631392 Radha
Selecting / Accessing multiple columns: Just Maths 90 99 89 81 94 89
use the following syntax
Science 91 98 91 71 95 78
<DF_object>[[<column_name1>,<column_name
2>,<column_name3>......]] Hindi 97 78 88 67 99 76
English 85 86 83 80 90 89

Output:- Population Hospital To delete the row with label 'Science' we can
write the following statement:
Delhi 10927986 189
>>>ResultDF = ResultDF.drop('Science',
Mumbai 12691836 208 axis=0)
>>>ResultDF
Kolkata 4631392 149
Output : Arnab RamitSamridhi Riya Mallika
Radha
Selecting /Accessing a subset from a
DataFrame using Row / Column Names: Use Maths 90 99 89 81 94
89
the following syntax :-
<DF_object>.loc[<start_row>:<end_row>,<start
_column>:<end_column>]
Page 9
18 | K V S R E G I O N A L O F F I C E , J A I P U R | S U B J E C T - I N F O R M A T I C S P R A C T I C E S ( T E R M - I S E S S I O N
2021-22)
Hindi 97 78 88 67 99 Print(ResultDF)
76
Output: Arnab Mallika Radha
English 85 86 83 80 90
Sub1 90 94 89
89
Sub2 97 99 76
To delete the columns having labels 'Samridhi',
'Ramit' and 'Riya': we can write the following English 85 90 89
statement:-
>>>ResultDF = Note: The parameter axis='index' is used to
ResultDF.drop(['Samridhi','Ramit','Riya'], specify that the row label is to be
axis=1) changed and axis='columns' to specify
>>>ResultDF that the column label is to be changed
Output:Arnab Mallika Radha
Maths 90 94 89 Renaming Column Labels of a DataFrame:
Hindi 97 99 76 ResultDF=ResultDF.rename({'Arnab':'Student1
English 85 90 89 ','Mallika':'Student2','Radha':'Student3'},

Renaming Row Labels of a DataFrame:


>>>print(ResultDF)
DataFrame.rename() method is used to rename
the row and column label. To rename the row Output: Student1 Student2 Student3
indices Maths to sub1, Hindi to sub2 in above
Sub1 90 94 89
DataFrame we can write the following
statement:- Sub2 97 99 76
ResultDF=ResultDF.rename({'Maths':'Sub1', English 85 90 89
>>>

Indexing and Boolean indexing:-


In Boolean indexing, we select data based on the actual values of the data and not on their row/column
labels or integer locations. If we provide list of Boolean values as index then only those rows will be
selected where True is stored. Consider following code for the df1
Hindi English IP
Aditya 34 23 67
Aman 34 85 56
Rajesh 60 80 91
Mohit 45 21 32
print( df1[[True,False,False,True]])

OUTPUT
Hindi English IP
Aditya 34 23 67
Mohit 45 21 32

Page
19 | K V10
S REGIONAL OFFICE, JAIPUR |SUBJECT-INFORMATICS PRACTICES (TERM -I SESSION
2021-22)
False, True, True, False, so

English marks are more than 50.


OUTPUT
Hindi English IP
Aman 34 85 56
Rajesh 60 80 91

Find the details of student who secured 34 marks in Hindi

where 34 marks is stored in Hindi


OUTPUT
Hindi English IP
Aditya 34 23 67
Aman 34 85 56
Find the details of student who secured marks is IP subject which is more than average marks
of IP subject
() will return average marks for IP which is 61.5
mean() will return Series of [True,False,True,False]
So this code can be used as index to get desired result

output
Hindi English IP
Aditya 34 23 67
Rajesh 60 80 91
We can include specific column(s) in our output in two ways
To display only IP column in place of all columns we can modify above code as given below
df1
OR

Output
Aditya 67
Rajesh 91
Name:IP, dtype: int64
If Hindi and IP marks to be displayed for the same problem stated above the code will be
df1
OR

output
Hindi IP
Aditya 34 67
Rajesh 60 91

Page 11
20 | K V S R E G I O N A L
2021-22)
OFFICE, JAIPUR |SUBJECT-INFORMATICS PRACTICES (TERM -I SESSION

You might also like