Data Handling Python NCERT
Data Handling Python NCERT
In this chapter
»» Introduction to
Python Libraries
2.1 Introduction to Python Libraries
»» Series
Python libraries contain a collection of built- »» DataFrame
in modules that allow us to perform many
»» Importing and
actions without writing detailed programs Exporting Data
for it. Each library in Python contains a large between CSV Files
number of modules that one can import and and DataFrames
use.
»» Pandas Series Vs
NumPy, Pandas and Matplotlib are three NumPy ndarray
well-established Python libraries for scientific
and analytical use. These libraries allow us
to manipulate, transform and visualise data
easily and efficiently.
NumPy, which stands for ‘Numerical
Python’, is a library we discussed in class
XI. Recall that, it is a package that can
be used for numerical data analysis and
Output:
0 10
1 20
2 30
dtype: int64
Output:
0 1
1 2
2 3
3 4
dtype: int32
>>> seriesCapCntry[[3,2]]
France Paris
UK London
dtype: object
>>> seriesCapCntry[['UK','USA']]
UK London
USA WashingtonDC
dtype: object
The index values associated with the series can be
altered by assigning new index values as shown in
the following example:
>>> seriesCapCntry.index=[10,20,30,40]
>>> seriesCapCntry
10 NewDelhi
20 WashingtonDC
30 London
40 Paris
dtype: object
(B) Slicing
Sometimes, we may need to extract a part of a series.
This can be done through slicing. This is similar to
slicing used with NumPy arrays. We can define which
part of the series is to be sliced by specifying the start
and end parameters [start :end] with the series name.
When we use positional indices for slicing, the value
at the endindex position is excluded, i.e., only (end -
start) number of data values of the series are extracted.
Consider the following series seriesCapCntry:
USA WashingtonDC
UK London
dtype: object
USA WashingtonDC
UK London
France Paris
dtype: object
>>> seriesAlph[1:3] = 50
>>> seriesAlph
a 10
b 50
c 50
d 13
e 14
f 15
dtype: int32
Observe that updating the values in a series using
slicing also excludes the value at the end index position.
But, it changes the value at the end index label when
slicing is done using labels.
>>> seriesAlph['c':'e'] = 500
>>> seriesAlph
a 10
b 50
c 500
d 500
e 500
f 15
dtype: int32
>>> seriesCapCntry
India NewDelhi
USA WashingtonDC
UK London
France Paris
dtype: object
>>> seriesTenTwenty.head()
0 10
1 11
2 12
3 13
4 14
dtype: int32
count() Returns the number of non-NaN values in >>> seriesTenTwenty.count()
the Series 10
tail(n) Returns the last n members of the series. If >>> seriesTenTwenty.tail(2)
the value for n is not passed, then by default 8 18
n takes 5 and the last five members are 9 19
displayed. dtype: int32
>>> seriesTenTwenty.tail()
5 15
6 16
7 17
8 18
9 19
dtype: int32
>>> seriesA
a 1
b 2
c 3
d 4
e 5
dtype: int64
a 11.0
b -998.0
c 53.0
d -996.0
e -95.0
y 980.0
z 990.0
dtype: float64
(C) Multiplication of two Series
Again, it can be done in two different ways, as shown in
the following examples:
e 0.05
y NaN
z NaN
dtype: float64
Let us now replace the missing values with 0 before
dividing seriesA by seriesB using explicit division
method div().
a -0.10
b inf
c -0.06
d inf
e 0.05
y 0.00
z 0.00
dtype: float64
2.3 DataFrame
Sometimes we need to work on multiple columns at
a time, i.e., we need to process the tabular data. For
example, the result of a class, items in a restaurant’s
menu, reservation chart of a train, etc. Pandas store
such tabular data using a DataFrame. A DataFrame is
a two-dimensional labelled data structure like a table
of MySQL. It contains rows and columns, and therefore
has both a row and column index. Each column can
have a different type of value such as numeric, string,
boolean, etc., as in tables of a database.
Column Indexes
State Geographical Area Area under Very
(sq Km) Dense Forests (sq
Km)
1 Assam 78438 2797
Row Indexes
>>> ResultSheet={
'Arnab': pd.Series([90, 91, 97],
index=['Maths','Science','Hindi']),
'Ramit': pd.Series([92, 81, 96],
index=['Maths','Science','Hindi']),
'Samridhi': pd.Series([89, 91, 88],
index=['Maths','Science','Hindi']),
'Riya': pd.Series([81, 71, 67],
index=['Maths','Science','Hindi']),
'Mallika': pd.Series([94, 95, 99],
index=['Maths','Science','Hindi'])}
Activity 2.7
>>> ResultDF = pd.DataFrame(ResultSheet)
Use the type function >>> ResultDF
to check the datatypes Arnab Ramit Samridhi Riya Mallika
of ResultSheet and Maths 90 92 89 81 94
ResultDF. Are they the Science 91 81 91 71 95
same? Hindi 97 96 88 67 99
The following output shows that every column in the
DataFrame is a Series:
>>> type(ResultDF.Arnab)
<class 'pandas.core.series.Series'>
When a DataFrame is created from a Dictionary of
Series, the resulting index or row labels are a union of all
series indexes used to create the DataFrame. For example:
dictForUnion = { 'Series1' :
pd.Series([1,2,3,4,5],
index = ['a', 'b', 'c', 'd', 'e']) ,
'Series2' :
pd.Series([10,20,-10,-50,100],
index = ['z', 'y', 'a', 'c', 'e']),
'Series3' :
pd.Series([10,20,-10,-50,100],
index = ['z', 'y', 'a', 'c', 'e']) }
>>> ResultDF['Arnab']=90
>>> ResultDF
Arnab Ramit Samridhi Riya Mallika Preeti
Maths 90 99 89 81 94 89
Science 90 98 91 71 95 78
Hindi 90 78 88 67 99 76
(B) Adding a New Row to a DataFrame
We can add a new row to a DataFrame using the
DataFrame.loc[ ] method. Consider the DataFrame
ResultDF that has three rows for the three subjects –
Maths, Science and Hindi. Suppose, we need to add the
marks for English subject in ResultDF, we can use the
following statement:
>>> ResultDF
Arnab Ramit Samridhi Riya Mallika Preeti
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 76
>>> ResultDF
Arnab Ramit Samridhi Riya Mallika
Maths 90 92 89 81 94
Science 91 81 91 71 95
Hindi 97 96 88 67 99
Hindi 97 89 78 60 45
To remove the duplicate rows labelled ‘Hindi’, we
need to write the following statement:
>>> ResultDF= ResultDF.drop('Hindi', axis=0)
>>> ResultDF
>>> ResultDF.loc['Science']
Arnab 91
Ramit 81
Samridhi 91
Riya 71
Mallika 95
Name: Science, dtype: int64
Also, note that when the row label is passed as an
integer value, it is interpreted as a label of the index and
not as an integer position along the index, for example:
>>> dFrame10Multiples = pd.DataFrame([10,20,30,40,50])
>>> dFrame10Multiples.loc[2]
0 30
Name: 2, dtype: int64
When a single column label is passed, it returns the column
as a Series.
>>> ResultDF.loc[:,'Arnab']
Notes Maths 90
Science 91
Hindi 97
Name: Arnab, dtype: int64
Also, we can obtain the same result that is the marks
of ‘Arnab’ in all the subjects by using the command:
>>> print(df['Arnab'])
Maths 56
Science 91
English 97
Hindi 97
Name: Arnab, dtype: int64
To read more than one row from a DataFrame, a list
of row labels is used as shown below. Note that using [[]]
returns a DataFrame.
>>> ResultDF.loc[['Science', 'Hindi']]
Arnab Ramit Samridhi Riya Mallika
Science 91 81 91 71 95
Hindi 97 96 88 67 99
(B) Boolean Indexing
Boolean means a binary variable that can represent
either of the two states - True (indicated by 1) or False
(indicated by 0). In Boolean indexing, we can select
the subsets of data based on the actual values in the
DataFrame rather than their row/column labels. Thus,
we can use conditions on column names to filter data
values. Consider the DataFrame ResultDF, the following
statement displays True or False depending on whether
the data value satisfies the given condition or not.
>>> ResultDF.loc['Maths'] > 90
Arnab False
Ramit True
Samridhi False
Riya False
Mallika True
Name: Maths, dtype: bool
To check in which subjects ‘Arnab’ has scored more
than 90, we can write:
>>> ResultDF.loc[:,‘Arnab’]>90
Maths False
Science True
Hindi True
Name: Arnab, dtype: bool
>>> dFrame1=dFrame1.append(dFrame2)
>>> dFrame1
C1 C2 C3 C5
R1 1.0 2.0 3.0 NaN
R2 4.0 5.0 NaN NaN
R3 6.0 NaN NaN NaN
R4 NaN 10.0 NaN 20.0
R2 NaN 30.0 NaN NaN
R5 NaN 40.0 NaN 50.0
Alternatively, if we append dFrame1 to dFrame2, the
rows of dFrame2 precede the rows of dFrame1. To get
the column labels appear in sorted order we can set the
parameter sort=True. The column labels shall appear in
unsorted order when the parameter sort = False.
# append dFrame1 to dFrame2
>>> dFrame2 =dFrame2.append(dFrame1,
sort=’True’)
>>> dFrame2
C1 C2 C3 C5
R4 NaN 10.0 NaN 20.0
R2 NaN 30.0 NaN NaN
>>> ForestArea = {
'Assam' :pd.Series([78438, 2797,
10192, 15116], index = ['GeoArea', 'VeryDense',
'ModeratelyDense', 'OpenForest']),
'Kerala' :pd.Series([ 38852, 1663,
9407, 9251], index = ['GeoArea' ,'VeryDense',
'ModeratelyDense', 'OpenForest']),
'Delhi' :pd.Series([1483, 6.72, 56.24,
129.45], index = ['GeoArea', 'VeryDense',
'ModeratelyDense', 'OpenForest'])}
>>> ResultDF
In series we can define our own labeled index to NumPy arrays are accessed by their integer
access elements of an array. These can be numbers position using numbers only.
or letters.
The elements can be indexed in descending order The indexing starts with zero for the first
also. element and the index is fixed.
If two series are not aligned, NaN or missing values There is no concept of NaN values and if there
are generated. are no matching values in arrays, alignment
fails.
Series require more memory. NumPy occupies lesser memory.
S u mmar y
• NumPy, Pandas and Matplotlib are Python
libraries for scientific and analytical use.
• pip install pandas is the command to install
Pandas library.
• A data structure is a collection of data values
and the operations that can be applied to that
data. It enables efficient storage, retrieval and
modification to the data.
• Two main data structures in Pandas library
are Series and DataFrame. To use these
data structures, we first need to import the
Pandas library.
• A Series is a one-dimensional array containing a
sequence of values. Each value has a data label
associated with it also called its index.
• The two common ways of accessing the elements
of a series are Indexing and Slicing.
• There are two types of indexes: positional index
and labelled index. Positional index takes an
integer value that corresponds to its position in
the series starting from 0, whereas labelled index
takes any user-defined label as index.
• When positional indices are used for slicing, the
value at end index position is excluded, i.e., only
(end - start) number of data values of the series
are extracted. However with labelled indexes the
Notes
value at the end index label is also included in
the output.
• All basic mathematical operations can be
performed on Series either by using the
operator or by using appropriate methods of the
Series object.
• While performing mathematical operations index
matching is implemented and if no matching
indexes are found during alignment, Pandas
returns NaN so that the operation does not fail.
• A DataFrame is a two-dimensional labeled data
structure like a spreadsheet. It contains rows
and columns and therefore has both a row and
column index.
• When using a dictionary to create a DataFrame,
keys of the Dictionary become the column labels
of the DataFrame. A DataFrame can be thought of
as a dictionary of lists/ Series (all Series/columns
sharing the same index label for a row).
• Data can be loaded in a DataFrame from a file on
the disk by using Pandas read_csv function.
• Data in a DataFrame can be written to a text
file on disk by using the pandas.DataFrame.to_
csv() function.
• DataFrame.T gives the transpose of a DataFrame.
• Pandas haves a number of methods that support
label based indexing but every label asked for
must be in the index, or a KeyError will be raised.
• DataFrame.loc[ ] is used for label based indexing
of rows in DataFrames.
• Pandas.DataFrame.append() method is used to
merge two DataFrames.
• Pandas supports non-unique index values. Only
if a particular operation that does not support
duplicate index values is attempted, an exception
is raised at that time.
• The basic difference between Pandas Series and
NumPy ndarray is that operations between Series
automatically align the data based on labels. Thus,
we can write computations without considering
whether all Series involved have the same label or
not whereas in case of ndarrays it raises an error.
Notes Exercise
1. What is a Series and how is it different from a 1-D
array, a list and a dictionary?
2. What is a DataFrame and how is it different from a
2-D array?
3. How are DataFrames related to Series?
4. What do you understand by the size of (i) a Series,
(ii) a DataFrame?
5. Create the following Series and do the specified
operations:
a) EngAlph, having 26 elements with the alphabets
as values and default index values.
b) Vowels, having 5 elements with index labels ‘a’,
‘e’, ‘i’, ‘o’ and ‘u’ and all the five values set to zero.
Check if it is an empty series.
c) Friends, from a dictionary having roll numbers of
five of your friends as data and their first name
as keys.
d) MTseries, an empty Series. Check if it is an empty
series.
e) MonthDays, from a numpy array having the
number of days in the 12 months of a year. The
labels should be the month numbers from 1 to 12.
6. Using the Series created in Question 5, write
commands for the following:
a) Set all the values of Vowels to 10 and display the
Series.
b) Divide all values of Vowels by 2 and display the
Series.
c) Create another series Vowels1 having 5 elements
with index labels ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’ having values
[2,5,6,3,8] respectively.
d) Add Vowels and Vowels1 and assign the result to
Vowels3.
e) Subtract, Multiply and Divide Vowels by Vowels1.
f) Alter the labels of Vowels1 to [‘A’, ‘E’, ‘I’, ‘O’, ‘U’].
7. Using the Series created in Question 5, write
commands for the following:
a) Find the dimensions, size and values of the Series
EngAlph, Vowels, Friends, MTseries, MonthDays.
b) Rename the Series MTseries as SeriesEmpty.
c) Name the index of the Series MonthDays as
monthno and that of Series Friends as Fname.