Class 12th IP Chapter 2nd
Class 12th IP Chapter 2nd
Subject :IP
Chapter 2
Data Handling using pandas
1) NumPy, Pandas and Matplotlib are Python libraries for scientific and analytical use.
2) pip install pandas is the command to install Pandas library.
3) A data structure is a collection of data values and the operations that can be applied to
that data. It enables efficient storage, retrieval and modification to the data.
4) Two main data structures in Pandas library are Series and DataFrame. To use these data
structures, we first need to import the Pandas library.
5) A Series is a one-dimensional array containing a sequence of values. Each value has a
data label associated with it also called its index.
6) The two common ways of accessing the elements of a series are Indexing and Slicing.
7) There are two types of indexes: positional index and labelled index. Positional index takes
an integer value that corresponds to its position in the series starting from 0, whereas
labelled index takes any user-defined label as index.
8) When positional indices are used for slicing, the value at end index position is excluded,
i.e., only (end - start) number of data values of the series are extracted. However with
labelled indexes the value at the end index label is also included in the output.
9) All basic mathematical operations can be performed on Series either by using the operator
or by using appropriate methods of the Series object.
10) While performing mathematical operations index matching is implemented and if no
matching indexes are found during alignment, Pandas returns NaN so that the operation
does not fail.
11) A DataFrame is a two-dimensional labeled data structure like a spreadsheet. It contains
rows and columns and therefore has both a row and column index.
12) When using a dictionary to create a DataFrame, keys of the Dictionary become the column
labels of the DataFrame. A DataFrame can be thought of as a dictionary of lists/ Series (all
Series/columns sharing the same index label for a row).
13) Data can be loaded in a DataFrame from a file on the disk by using Pandas read_csv
function.
14) Data in a DataFrame can be written to a text file on disk by using the
pandas.DataFrame.to_csv() function.
15) DataFrame.T gives the transpose of a DataFrame.
16) Pandas haves a number of methods that support label based indexing but every label
asked for must be in the index, or a KeyError will be raised.
17) DataFrame.loc[ ] is used for label based indexing of rows in DataFrames.
18) Pandas.DataFrame.append() method is used to merge two DataFrames.
19) Pandas supports non-unique index values. Only if a particular operation that does not
support duplicate index values is attempted, an exception is raised at that time.
20) The basic difference between Pandas Series and NumPy ndarray is that operations
between Series automatically align the data based on labels. Thus, we can write
computations without considering whether all Series involved have the same label or not
whereas in case of ndarrays it raises an error.
Very short answer type questions:
1. What is python?
Python is a very popular and easy to learn programming language, created by Guido van
Rossum in 1991. It is used in a variety of fields, including software development, web
development, scientific computing, big dataand Artificial Intelligence. The programs given in this
book are written using Python.
1. What is program?
2. What is Software?
System Software
Application Software
The language used to specify those set of instructions to the computer is called a programming
language. for example Python, C, C++, Java, etc.
4. What is Function?
A function is a block of code which only runs when it is called. You can pass data, known as
parameters, into a function. A function can return data as a result.
5. What is Variable ?
Variable is a reserved memory location to store value. When you create variable your reserve
some space in memory.
6. What is Array?
An array is a special variable, which can hold more than one value at a time. Arrays are used to
store multiple values in one single variable:
7. What is numpy Array?
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensional and single dimensional array elements.
8. What is ndarray?
Ndarray is the n-dimensional array object defined in the numpy which stores the collection of
the similar type of elements.
9. What is index and axes attribute?
The axes attribute of DataFrame class contains both the row axis index and the column axis
index. The ndim attribute returns the number of dimensions, which is 2 for a DataFrame instance.
The shape attribute has the shape of the 2 dimensional matrix/DataFrame as a tuple.
10. What is re-indexing?
Reindexing in Pandas can be used to change the index of rows and columns of a DataFrame.
Indexes can be used with reference to many index DataStructure associated with several pandas
series or pandas DataFrame.
11. What is CSV (Comma Separated Values) file
A CSV (Comma Separated Values) format is one of the most simple and common ways to store
tabular data. To represent a CSV file, it must be saved with the .csv file extension.
12. Write the parameters of series in pandas.
A pandas Series is a one-dimensional labelled data structure which can hold data such as
strings, integers and even other Python objects. It is built on top of numpy array and is the
primary data structure to hold one-dimensional data in pandas.
o data: It can be any list, dictionary, or scalar value.
o Index values must be unique and hashable, same length as data. Default np.arrange(n) if
no index is passed..
o dtype: It refers to the data type of series.
o copy: It is used for copying the data.
13. How to install pandas using pip?
Here is the how-to to install Pandas for Windows:
1. Install Python
2. Type in the command “pip install manager”
3. Once finished, type the following:
pip install pandas
A DataFrame is a two-dimensional labelled data structure like a table of MySQL. It contains rows
and columns, and therefore has both a row and column index.
a 2-dimensional array: you have rows and columns. The rows are indicated as the “axis 0”, while
the columns are the “axis 1”.
Dataframe and series both are data structures from the Pandas library.Series is a one-
dimensional structure whereas Dataframe is a two-dimensional structure.
16. What do you understand by the size of (i) a Series, (ii) a DataFrame?
17. What is a Series and how is it different from a 1-D array, a list and a dictionary?
A Series is a one-dimensional array containing a sequence of values of any data type (int,
float, list, string, etc) which by default have numeric data labels starting from zero
Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure
that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps
typed keys to a set of typed values.
Series is a 1D data structure designed for a particular use case which is quite different from
a list. Yet they both are 1D, ordered data structures. In Series we can change index but we
cannot do in list.
Pandas is an open source Python package that is most widely used for data science/data
analysis and machine learning tasks. It is built on top of another package named Numpy, which
provides support for multi-dimensional arrays. As one of the most popular data wrangling
packages, Pandas works well with many other data science modules inside the Python
ecosystem, and is typically included in every Python distribution, from those that come with your
operating system to commercial vendor distributions like ActiveState’s ActivePython.
This library is built on top of the NumPy library. Pandas is fast and it has high performance &
productivity for users.
The Pandas Series can be defined as a one-dimensional array that is capable of storing various
data types. We can easily convert the list, tuple, and dictionary into series using "series' method.
The row labels of series are called the index. A Series cannot contain multiple columns. It has the
following parameter:
o data: It can be any list, dictionary, or scalar value.
o index: The value of the index should be unique and hashable. It must be of the same
length as data. If we do not pass any index, default np.arrange(n) will be used.
o dtype: It refers to the data type of series.
o copy: It is used for copying the data.
1) Creating a Series:
We can create a Series in two ways:
1. Create an empty Series
2. Create a Series using inputs.
2) Create an Empty Series:
We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:
1. <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having default
datatype, i.e., float64.
Example
1. import pandas as pd
2. x = pd.Series([])
3. print (x)
Output : Series([], dtype: float64)
Creating a Series using inputs:
We can create Series by using various inputs:
o Array
o Dict
o Scalar value
3) Creating Series from Array:
Before creating a Series, firstly, we have to import the numpy module and then use array()
function in the program. If the data is ndarray, then the passed index must be of the same length.
If we do not pass an index, then by default index of range(n) is being passed where n defines the
length of an array, i.e., [0,1,2,....range(len(array))-1].
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Output :
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
4) Create a Series from dict
We can also create a Series from dict. If the dictionary object is being passed as an input
and the index is not specified, then the dictionary keys are taken in a sorted order to
construct the index.
If index is passed, then values correspond to a particular label in the index will be extracted from
the dictionary.
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Output:
x 0.0
y 1.0
z 2.0
dtype: float64
5) Create a Series using Scalar:
If we take the scalar values, then the index must be provided. The scalar value will be repeated
for matching the length of the index.
1. #import pandas library
2. import pandas as pd
3. import numpy as np
4. x = pd.Series(4, index=[0, 1, 2, 3])
5. print (x)
Output:
0 4
1 4
2 4
3 4
dtype: int64
Head function : The head function in Python displays the first five rows of the dataframe by
default. It takes in a single parameter: the number of rows. We can use this parameter to display
the number of rows of our choice.
Syntax
The head function is defined as follows:
dataframe.head(N)
N refers to the number of rows. If no parameter is passed, the first five rows are returned.
The head function also supports negative values of N. In that case, all rows except the last N
rows are returned.
Example
The code snippet below shows how the head function is used in pandas:
import pandas as pd
# Creating a dataframe
import pandas as pd
# Creating a dataframe
df = pd.DataFrame({'Days': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
'Sunday']})
print(df) # By default
print('\n ************************************')
print(df.head()) # By default
print('\n ************************************')
print(df.head(3)) # Printing first 3 rows
print('\n ************************************')
print(df.head(-2)) # Printing all except the last 2 rows
Days
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
************************************
Days
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
************************************
Days
0 Monday
1 Tuesday
2 Wednesday
************************************
Days
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
Tail function : The tail function in Python displays the last five rows of the dataframe by
default. It takes in a single parameter: the number of rows. We can use this parameter to display
the number of rows of our choice.
Syntax
The tail function is defined as follows:
dataframe.tail(N)
N refers to the number of rows. If no parameter is passed, the first last rows are returned.
The tail function also supports negative values of N. In that case, all rows except the first N rows
are returned.
Example
The code snippet below shows how the tail function is used in pandas:
import pandas as pd
# Creating a dataframe
df = pd.DataFrame({'Days': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
'Sunday']})
print(df) # By default
print('\n ************************************')
print(df.tail()) # By default
print('\n ************************************')
print(df.tail(3)) # Printing first 3 rows
print('\n ************************************')
print(df.tail(-2)) # Printing all except the last 2 rows
Output
Days
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
************************************
Days
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
************************************
Days
4 Friday
5 Saturday
6 Sunday
************************************
Days
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
1 data
data takes various forms like ndarray, series, map, lists,
dict, constants and also another DataFrame.
2 index
For the row labels, the Index to be used for the resulting
frame is Optional Default np.arange(n) if no index is
passed.
3 columns
For column labels, the optional default syntax is -
np.arange(n). This is only true if no index is passed.
4 dtype
Data type of each column.
5 copy
This command (or whatever it is) is used for copying of
data, if the default is False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame using these
inputs.
1) Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.
Example
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print (df)
Live Demo
output is as follows −
Empty DataFrame
Columns: []
Index: []
output is as follows −
Fruits Quantity
0 Apples 10
1 Mangos 12
2 Bananas 13
All the ndarrays must be of same length. If index is passed, then the length of the index should
equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1
import pandas as pd
data = {'Fruits':['Apples', 'Mangos', 'Bananas', 'Ricky'],'Quantity':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)
Fruits Quantity
0 Apples 28
1 Mangos 34
2 Bananas 29
3 Ricky 42
Live Demo
Example 2
Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Fruits':['Apples', 'Mangos', 'Bananas', 'Ricky'],'Quantity':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)
output is as follows –
Fruits Quantity
rank1 Apples 28
rank2 Mangos 34
rank3 Bananas 29
rank4 Ricky 42
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are
by default taken as column names.
Example 1
The following example shows how to create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)
Its output is as follows −
a b c
0 1 2 NaN
1 5 10 20.0
Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all
the series indexes passed.
Example
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df)
output is as follows −
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Let us now understand column selection, addition, and deletion through examples.
1) Column Selection
We will understand this by selecting a column from the DataFrame.
Example
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df ['one']
Its output is as follows −
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
2) Column Addition
We will understand this by adding a new column to an existing data frame.
Example
Live Demo
import pandas as pd
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new series
print df
Its output is as follows −
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
df = pd.DataFrame(d)
print df.loc['b']
Its output is as follows −
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the Name of the
series is the label with which it is retrieved.
Selection by integer location
Rows can be selected by passing integer location to an iloc function.
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df.iloc[2]
Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
Multiple rows can be selected using ‘ : ’ operator.
Live Demo
import pandas as pd
df = pd.DataFrame(d)
print df[2:4]
Its output is as follows −
one two
c 3.0 3
d NaN 4
5) Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at
the end.
Live Demo
import pandas as pd
df = df.append(df2)
print df
Its output is as follows −
a b
0 1 2
1 3 4
0 5 6
1 7 8
6) Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple
rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label and will see
how many rows will get dropped.
Live Demo
import pandas as pd
df = df.append(df2)
print df
Its output is as follows −
ab
134
178
pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −
Sr.No Parameter & Description
1 data
data takes various forms like ndarray, series, map, lists,
dict, constants and also another DataFrame.
2 index
For the row labels, the Index to be used for the resulting
frame is Optional Default np.arange(n) if no index is
passed.
3 columns
For column labels, the optional default syntax is -
np.arange(n). This is only true if no index is passed.
4 dtype
Data type of each column.
5 copy
This command (or whatever it is) is used for copying of
data, if the default is False.
Q.9. What is use of lower() and upper()?
To convert a Python string to uppercase, use the built-in upper() method of a string. To convert a
Python string to lowercase, use the built-in lower() method.
upper() method on a string converts all of the characters to uppercase, whereas the
lower() method converts all of the characters to lowercase..
The behavior of basic iteration over Pandas objects depends on the type. When iterating over a
Series, it is regarded as array-like, and basic iteration produces the values. Other data structures,
like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the
objects.
In short, basic iteration (for i in object) produces −
Series − values
DataFrame − column labels
Panel − item labels
Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following example to understand
the same.
Live Demo
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
print (key,value)
Its output is as follows −
col1 0 0.802390
1 0.324060
2 0.256811
3 0.839186
Name: col1, dtype: float64
col2 0 1.624313
1 -1.033582
2 1.796663
3 1.856277
Name: col2, dtype: float64
col3 0 -0.022142
1 -0.230820
2 1.160691
3 -0.830279
Name: col3, dtype: float64
We can create a DataFrame by importing data from CSV files where values are separated by
commas. Similarly, we can also store or export data in a DataFrame as a .csv file.
Importing a CSV file to a DataFrame
Let us assume that we have the following data in a csv file named ResultData.csv stored in the
folder C:/NCERT. In order to practice the code while we progress, you are suggested to create
this csv file using a spreadsheet and save in your computer.
RollNo Name Eco Maths
1 Arnab 18 57
2 Kritika 23 45
3 Divyam 51 37
4 Vivaan 40 60
5 Aaroosh 18 27
We can load the data from the ResultData.csv file into a DataFrame, say marks using Pandas
read_csv()
function as shown below:
>>> marks = pd.read_csv("C:/NCERT/ResultData.
csv",sep =",", header=0)
>>> marks
Concatenating means obtaining a new string that contains both of the original strings. In
Python pandas, there are a few ways to concatenate or combine strings. The new string that
is created is referred to as a string object. In order to merge two strings into a single object.
Syntax: concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort,
copy)
Example:
import pandas as pd
df1=pd.DataFrame({'A':['A1','A2','A3'],'B':['B1','B2','B3']},index=[0,1,2])
df2=pd.DataFrame({'A':['A4','A5','A6'],'B':['B4','B5','B6']},index=[3,4,5])
df3=pd.DataFrame({'A':['A7','A8','A9'],'B':['B7','B8','B9']},index=[6,7,8])
dfram=[df1,df2,df3]
result=pd.concat(dfram)
print('Frist series\n',df1)
print('Second series\n',df2)
print('Third series\n',df3)
print('Concat series\n',result)
Output :
Frist series
A B
0 A1 B1
1 A2 B2
2 A3 B3
Second series
A B
3 A4 B4
4 A5 B5
5 A6 B6
Third series
A B
6 A7 B7
7 A8 B8
8 A9 B9
Concat series
A B
0 A1 B1
1 A2 B2
2 A3 B3
3 A4 B4
4 A5 B5
5 A6 B6
6 A7 B7
7 A8 B8
8 A9 B9