0% found this document useful (0 votes)
53 views

Chapter 2 Data Handling using pandas - I(Series)

Chapter 2 focuses on data handling using the Pandas library, highlighting its capabilities for data manipulation and analysis. It covers key data structures like Series and DataFrame, differences between Pandas and NumPy, methods for creating Series, accessing elements, and performing mathematical operations. The chapter also explains attributes and methods of Series, including indexing, slicing, and the use of iloc() and loc() for data retrieval.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Chapter 2 Data Handling using pandas - I(Series)

Chapter 2 focuses on data handling using the Pandas library, highlighting its capabilities for data manipulation and analysis. It covers key data structures like Series and DataFrame, differences between Pandas and NumPy, methods for creating Series, accessing elements, and performing mathematical operations. The chapter also explains attributes and methods of Series, including indexing, slicing, and the use of iloc() and loc() for data retrieval.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 2

Data Handling using pandas – I

NumPy, Pandas and Matplotlib are three well-established Python libraries for scientific and
analytical use.

PANDAS (PANEL DATA)


➢ High-level data manipulation tool used for analysing data.
➢ It is very easy to import and export data using Pandas library.
➢ It is built on packages to do most of our data analysis and visualisation work.
➢ Pandas has three important data structures, namely –Series, DataFrame and Panel.

Differences between Pandas and Numpy:


1. A Numpy array is homogeneous data, while a Pandas DataFrame is heterogeneous data.
2. Pandas have an interface for operations like file loading, plotting, selection, joining, GROUP BY.
3. Pandas DataFrames (with column names) make it very easy to keep track of data.
4. Pandas data is in Tabular Format, whereas Numpy is numeric array.

Installing Pandas
To install Pandas from command line, we need to type in:
pip install pandas.

Data Structure in Pandas


A data structure is a collection of data values and operations that can be applied to that data. It
enables efficient storage, retrieval and modification of the data.
Two commonly used data structures in Pandas are Series and DataFrame.

Series
➢ one-dimensional array
➢ homogenous data
➢ containing a sequence of values with index
➢ sequence of values of any data type (int, float, list, string, etc)
➢ data is mutable
➢ size is immutable
The data label associated with a particular value is called its index

1
Creation of Series
A) Creation of Series from List
Program:
import pandas as pd
l=[10,20,30]
series1 = pd.Series(l)
# series1 = pd.Series([10,20,30])
print(series1)
Output:
0 10
1 20
2 30
dtype: int64

User-defined labels can be assigned to the index and use them to access elements of a Series.
Program:
import pandas as pd
series2 = pd.Series(["Kavi","Shyam"], index=[3,5])
print(series2)
Output:
3 Kavi
5 Shyam
dtype: object

We can also use letters or strings as indices


Program:
import pandas as pd
series2 = pd.Series([2,3],index=["Feb","Mar"])
print(series2)
Output:
Feb 2
Mar 3
dtype: int64
(B) Creation of empty Series
Program:
import pandas as pd

2
print("Creation of empty Series")
s=pd.Series(dtype=int)
print(s)
Note: Below all statements also create empty Series
s=pd.Series()
s=pd.Series([],dtype=int)
s=pd.Series({},dtype=int)
s=pd.Series((),dtype=int)
Output:
Creation of empty Series
Series([], dtype: int32)
(c) Creation of Series from Scalar value
Program:
import pandas as pd
print("create a series from scalar value")
s4=pd.Series(25,index=[10,11,12])
print(s4)
Output:
create a series from scalar value
10 25
11 25
12 25
dtype: int64
(d) Creation of Series from dictionary
Keys of the dictionary will become indices in the series.
Program:
import pandas as pd
print("create a series from dictionary")
d={'a':'ant','b':'bat'}
s5=pd.Series(d)
print(s5)
Output:
create a series from dictionary
a ant
b bat
dtype: object

3
(e) Creation of Series from ndarray
Program:
import numpy as np
import pandas as pd
print("create a series from ndarray")
a=np.array([10,20,30])
s6=pd.Series(a)
print(s6)
Output:
create a series from ndarray
0 10
1 20
2 30
dtype: int32
OR
Program:
import numpy as np
import pandas as pd
print("create a series from ndarray using arange function")
b=np.arange(10,20,3)
s7=pd.Series(b)
print(s7)
Output:
create a series from ndarray
0 10
1 13
2 16
3 19
dtype: int32

Accessing Elements of a Series


There are two common ways for accessing the elements of a series: Indexing and Slicing.
(A) Indexing
Indexing in Series is used to access elements in a series. Indexes are of two types: positional
index and labelled index. Positional index takes an integer value that corresponds to its position
in the series starting from 0, whereas labelled index takes any user-defined label as index.

4
Program:
import pandas as pd
s1=pd.Series([10,20,30,40,50],index=['I','II','III','IV','V'])
print(s1)
print("Assigning new index values")
s1.index=['one','two','three','four','five']
print(s1)
print("To access an element 30 using labelled indexing")
print(s1['three'])
print("To access the element 50 using postional indexing")
print(s1[4])
print("To access the element 20 and 40 using labelled indexing")
print(s1[['two','four']])
print("To access the element 20 and 40 using postional indexing")
print(s1[[1,3]])
print("To change value of an postional index 4")
s1[4]=55
print(s1)

Output:
I 10
II 20
III 30
IV 40
V 50
dtype: int64
Assigning new index values
one 10
two 20
three 30
four 40
five 50
dtype: int64
To access an element 30 using labelled indexing
30
To access the element 50 using postional indexing

5
50
To access the element 20 and 40 using labelled indexing
two 20
four 40
dtype: int64
To access the element 20 and 40 using postional indexing
two 20
four 40
dtype: int64
To change value of an postional index 4
one 10
two 20
three 30
four 40
five 55
dtype: int64

(B) Slicing
To extract a part of a series can be done through slicing. We can define which part of the series
is to be sliced by specifying the start and end parameters [start :end] with the series name.
When we use positional indices for slicing, the value at the end index position is excluded. If
labelled indexes are used for slicing, then value at the end index label is also included in the
output.
Program:
import pandas as pd
s1=pd.Series([10,20,30,40,55],index=['one','two','three','four','five'])
print("Positional index used for slicing")
print(s1[1:4])#excludes the value at index position 4
print("Labelled index used for slicing")
print(s1['one':'three'])
print("The series in reverse order")
print(s1[::-1])
print("To give same values for a given slice")
s1[1:4]=5
print(s1)
print("To give different values for a given slice")

6
s1[1:4]=[5,10,15]
print(s1)

Output:
Positional index used for slicing
two 20
three 30
four 40
dtype: int64
Labelled index used for slicing
one 10
two 20
three 30
dtype: int64
The series in reverse order
five 55
four 40
three 30
two 20
one 10
dtype: int64
To give same values for a given slice
one 10
two 5
three 5
four 5
five 55
dtype: int64
To give different values for a given slice
one 10
two 5
three 10
four 15
five 55
dtype: int64

7
Attributes of Series
Attribute Name Purpose
name assigns a name to the Series
index.name assigns a name to the index of the series
values prints a list of the values in the series
size prints the number of values in the Series object
empty prints True if the series is empty, and False otherwise

Program:
import pandas as pd
import numpy as np
s1=pd.Series({'a':np.NAN,'b':20,'c':30,'d':40})
print(s1)
s1.name='NIMS'
print(s1)
s1.index.name='Division'
print(s1)
print(s1.size)
print(s1.values)
print(s1.empty)
print(s1.count())
s2=pd.Series(dtype=int)
print(s2)
s2.name='Test'
print(s2)
s1.index.name='Result'
print(s2)
print(s2.size)
print(s2.values)
print(s2.empty)
print(s2.count())
Output:
a NaN
b 20.0
c 30.0
d 40.0
dtype: float64
a NaN

8
b 20.0
c 30.0
d 40.0
Name: NIMS, dtype: float64
Division
a NaN
b 20.0
c 30.0
d 40.0
Name: NIMS, dtype: float64
4
[nan 20. 30. 40.]
False
3
Series([], dtype: int32)
Series([], Name: Test, dtype: int32)
Series([], Name: Test, dtype: int32)
0
[]
True
0

Methods of Series
Method Explanation
Returns the first n members of the series. If the value for n is not passed, then
head(n)
by default n takes 5 and the first five members are displayed.
count() Returns the number of non-NaN values in the Series
Returns the last n members of the series. If the value for n is not passed, then
tail(n)
by default n takes 5 and the last five members are displayed.

Program:
import pandas as pd
s1=pd.Series([10,20,30,40,50,60,70,80,90])
print(s1.head())
print(s1.tail())
print(s1.head(2))
print(s1.tail(3))

9
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
4 50
5 60
6 70
7 80
8 90
dtype: int64
0 10
1 20
dtype: int64
6 70
7 80
8 90
dtype: int64

Mathematical Operations on Series


While performing mathematical operations on series, index matching is implemented and all
missing values are filled in with NaN by default. Basic mathematical operations like addition,
subtraction, multiplication, division, etc., can be done on two Series, the operation is done on
each corresponding pair of elements.
(A) Addition of two Series
It can be done in two ways. In the first way, two series are simply added together (eg: s1+s2)
The second way is applied when we do not want to have NaN values in the output. We can use
the series method add() and a parameter fill_value to replace missing value with a specified
value. eg: s1.add(s2,fill_value=10)
(B)Subtraction of two Series
Again, it can be done in two different ways
s1-s2
s1.sub(s2,fill_value=20)

10
(C) Multiplication of two Series
Again, it can be done in two different ways
s1*s2
s1.mul(s2,fill_value=10)
(D) Division of two Series
Again, it can be done in two different ways
s1/s2
s1.div(s2,fill_value=20)
Program:
import pandas as pd
s1=pd.Series([10,20,30])
s2=pd.Series([5,15,25,35])
print(s1+s2)
print(s1.add(s2,fill_value=40))
print(s1-s2)
print(s1.sub(s2,fill_value=40))
print(s1*s2)
print(s1.mul(s2,fill_value=40))
print(s1/s2)
print(s1.div(s2,fill_value=40))
Output:
0 15.0
1 35.0
2 55.0
3 NaN
dtype: float64
0 15.0
1 35.0
2 55.0
3 75.0
dtype: float64
0 5.0
1 5.0
2 5.0
3 NaN
dtype: float64

11
0 5.0
1 5.0
2 5.0
3 5.0
dtype: float64
0 50.0
1 300.0
2 750.0
3 NaN
dtype: float64
0 50.0
1 300.0
2 750.0
3 1400.0
dtype: float64
0 2.000000
1 1.333333
2 1.200000
3 NaN
dtype: float64
0 2.000000
1 1.333333
2 1.200000
3 1.142857
dtype: float64

iloc() and loc()


iloc()-iloc() is used for displaying rows based on positional based indexing.
loc()- loc() is used for displaying rows based on labelled (row name) based indexing.
Program:
import pandas as pd
s1=pd.Series([10,20,30,40,50],index=['a','e','i','o','u'])
print(s1)
print(s1.iloc[1:4])#select rows with positional index 1,2,3(upper limit 4 is excluded)
print(s1.loc['a':'i'])#select rows with labelled index 'a','e','i'(upper limit also included in loc)
Output:

12
a 10
e 20
i 30
o 40
u 50
dtype: int64
e 20
i 30
o 40
dtype: int64
a 10
e 20
i 30
dtype: int64

13

You might also like