0% found this document useful (0 votes)
10 views99 pages

EDA Unit2

The document provides an overview of Exploratory Data Analysis (EDA) using Python's Pandas library, covering data manipulation techniques such as handling missing data, combining datasets, and creating DataFrames. It includes practical examples of creating, appending, and analyzing data, as well as statistical analysis and data cleaning methods. The document emphasizes the importance of preparing datasets for machine learning models and introduces fundamental Pandas data structures like Series and DataFrame.

Uploaded by

Kavitha Ganesan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views99 pages

EDA Unit2

The document provides an overview of Exploratory Data Analysis (EDA) using Python's Pandas library, covering data manipulation techniques such as handling missing data, combining datasets, and creating DataFrames. It includes practical examples of creating, appending, and analyzing data, as well as statistical analysis and data cleaning methods. The document emphasizes the importance of preparing datasets for machine learning models and introduces fundamental Pandas data structures like Series and DataFrame.

Uploaded by

Kavitha Ganesan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Rajalakshmi Institute of Technology

(An Autonomous Institution), Affiliated to Anna University, Chennai


Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

UNIT II
EXPLORATORY DATA ANALYSIS
EDA USING PYTHON - Data Manipulation using Pandas – Pandas Objects –
Data Indexing and Selection – Operating on Data – Handling Missing Data –
Hierarchical Indexing – Combining datasets – Concat, Append, Merge and Join –
Aggregation and grouping – Pivot Tables – Vectorized String Operations.
Data manipulation with Pandas
In Machine Learning, the model requires a dataset to operate, i.e. to
train and test. But data doesn’t come fully prepared and ready to use. There are
discrepancies like Nan/ Null / NA values in many rows and columns. Sometimes the
data set also contains some of the rows and columns which are not even required in
the operation of our model. In such conditions, it requires proper cleaning and
modification of the data set to make it an efficient input for our model. We achieve
that by practicing Data Wrangling before giving data input to the model.
Today, we will get to know some methods using Pandas which is a
famous library of Python. Pandas are a newer package built on top of NumPy, and
provide an efficient implementation of a DataFrame. DataFrames are essentially
multidimensional arrays with attached row and column labels, and often with
heterogeneous types and/or missing data. As well as offering a convenient storage
interface for labeled data, Pandas implements a number of powerful data operations
familiar to users of both database frameworks and spreadsheet programs.
Installing Pandas
Before moving forward, ensure that Pandas is installed in your system.

1
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

If not, you can use the following command to install it:


pip install pandas
Creating DataFrame
Let’s dive into the programming part. Our first aim is to create a
Pandas dataframe in Python, as you may know, pandas is one of the most used
libraries of Python.
Code Output
# Importing the pandas library Name Age Student
import pandas as pd 0 Abhijit 20 False
# creating a dataframe object 1 Smriti 19 True
student_register = pd.DataFrame() 2 Akash 20 True
# assigning values to the 3 Roshni 14 False
# rows and columns of the dataframe
student_register['Name'] =
['Abhijit','Smriti', 'Akash', 'Roshni']
student_register['Age'] = [20, 19, 20, 14]
student_register['Student'] = [False, True,
True, False]
print(student_register)
As you can see, the dataframe object has four rows [0, 1, 2, 3] and three
columns[“Name”, “Age”, “Student”] respectively. The column which contains the
index values i.e. [0, 1, 2, 3] is known as the index column, which is a default part in
pandas datagram. We can change that as per our requirement too because Python is

2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

powerful.

Adding data in DataFrame using Append Function


Next, for some reason we want to add a new student in the datagram,
i.e you want to add a new row to your existing data frame, that can be achieved by
the following code snippet.
One important concept is that the “dataframe” object of Python, consists of rows
which are “series” objects instead, stack together to form a table. Hence adding a
new row means creating a new series object and appending it to the dataframe.
Code Output
# creating a new pandas Name Age Student
# series object 0 Abhijit 20 False
new_person = pd.Series(['Mansi', 19, True], index 1 Smriti 19 True
= ['Name', 'Age', 'Student']) 2 Akash 20 True
# using the .append() function 3 Roshni 14 False
# to add that row to the dataframe
student_register.append(new_person, ignore_index =
True)
print(student_register
Data Manipulation on Dataset
Till now, we got the gist of how we can create dataframe, and add data
to it. But how will we perform these operations on a big dataset. For this let’s take a
new dataset

3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Getting Shape and information of the data


Let’s exact information of each column, i.e. what type of value it stores and how
many of them are unique. There are three support functions, .shape, .info()
and .corr() which output the shape of the table, information on rows and columns,
and correlation between numerical columns.
Code Output
# dimension of the dataframe Shape:
print('Shape: ') (4, 3)
print(student_register.shape) --------------------------------------
print('--------------------------------------') Info:
# showing info about the data <class 'pandas.core.frame.DataFrame'>
print('Info: ') RangeIndex: 4 entries, 0 to 3
print(student_register.info()) Data columns (total 3 columns):
print('--------------------------------------') # Column Non-Null Count Dtype
# correlation between columns --- ------ -------------- -----
print('Correlation: ') 0 Name 4 non-null object
print(student_register.corr()) 1 Age 4 non-null int64
2 Student 4 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 196.0+ bytes
None
--------------------------------------
Correlation:

4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Age Student
Age 1.000000 0.502519
Student 0.502519 1.000000
The description of the output given by .info() method is as follows:

RangeIndex describes about the index column, i.e. [0, 1, 2, 3] in our datagram.
Which is the number of rows in our dataframe.
As the name suggests Data columns give the total number of columns as output.
Name, Age, Student are the name of the columns in our data, non-null tells us that in
the corresponding column, there is no NA/ Nan/ None value exists. object, int64 and
bool are the datatypes each column have.
dtype gives you an overview of how many data types present in the datagram, which
in term simplifies the data cleaning process.
Also, in high-end machine learning models, memory usage is an important term, we
can’t neglect that.
 Getting Statistical Analysis of Data
Before processing and wrangling any data you need to get the total overview of it,
which includes statistical conclusions like standard deviation(std), mean and it’s
quartile distributions.
Code Output
# for showing the statistical Describe
# info of the dataframe Age
print('Describe') count 4.000000

5
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(student_register.describe()) mean 18.250000


std 2.872281
min 14.000000
25% 17.750000
50% 19.500000
75% 20.000000
max 20.000000
The description of the output given by .describe() method is as follows:
count is the number of rows in the dataframe.
mean is the mean value of all the entries in the “Age” column.
std is the standard deviation of the corresponding column.
min and max are the minimum and maximum entry in the column respectively.
25%, 50% and 75% are the First Quartiles, Second Quartile(Median) and Third
Quartile respectively, which gives us important info on the distribution of the dataset
and makes it simpler to apply an ML model.
 Dropping Columns from Data
Let’s drop a column from the data. We will use the drop function from the pandas.
We will keep axis = 1 for columns.
Code Output
students = student_register.drop('Age', Name Student
axis=1) 0 Abhijit False
print(students.head()) 1 Smriti True
2 Akash True

6
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

3 Roshni False
 Dropping Rows from Data
Let’s try dropping a row from the dataset, for this, we will use drop function. We
will keep axis=0.
Code Output
students = students.drop(2, axis=0) Name Student
print(students.head()) 0 Abhijit False
1 Smriti True
3 Roshni False
In the output we can see that the 2 row is dropped.
Pandas Object:
At the very basic level, Pandas objects can be thought of as enhanced
versions of NumPy structured arrays in which the rows and columns are identified
with labels rather than simple integer indices. As we will see during the course of
this chapter, Pandas provides a host of useful tools, methods, and functionality on
top of the basic data structures, but nearly everything that follows will require an
understanding of what these structures are. Thus, before we go any further, let’s
introduce these three fundamental Pandas data structures: the Series, DataFrame,
and Index.
We will start our code sessions with the standard NumPy and Pandas imports:
import numpy as np
import pandas as pd

7
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 The Pandas Series Object


A Pandas Series is a one-dimensional array of indexed data. It can be
created from a list or array as follows:
Code Output
data = pd.Series([0.25, 0.5, 0.75, 1.0]) 0 0.25
print(data) 1 0.50
2 0.75
3 1.00
dtype: float64
As we see in the preceding output, the Series wraps both a sequence of
values and a sequence of indices, which we can access with the values and
index attributes. The values are simply a familiar NumPy array:
Code Output
data = pd.Series([0.25, 0.5, 0.75, 1.0]) array([ 0.25, 0.5 , 0.75, 1. ])
print(data.values)
The index is an array-like object of type pd.Index, which we’ll discuss in
more detail momentarily:
Code Output
data = pd.Series([0.25, 0.5, 0.75, 1.0]) RangeIndex(start=0, stop=4, step=1)
print(data.index)
0.5
print (data[1])
print (data[1:3]) 1 0.50
2 0.75
dtype: float64
8
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

As we will see, though, the Pandas Series is much more general and
flexible than the one-dimensional NumPy array that it emulates.
 Series as generalized NumPy array
From what we’ve seen so far, it may look like the Series object is basically
inter‐ changeable with a one-dimensional NumPy array. The essential
difference is the pres‐ ence of the index: while the NumPy array has an
implicitly defined integer index used to access the values, the Pandas Series
has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional
capabilities. For example, the index need not be an integer, but can consist
of values of any desired type. For example, if we wish, we can use strings
as an index:
Code Output
import pandas as pd a 0.25
b 0.50
data = pd.Series([0.25, 0.5, 0.75, 1.0],
c 0.75
index=['a', 'b', 'c', 'd']) d 1.00
dtype: float64
print(data)
And the item access works as expected: 0.5
print(data ['b'] )
We can even use noncontiguous or 2 0.25
5 0.50
nonsequential indices: 3 0.75
data = pd.Series([0.25, 0.5, 0.75, 1.0], 7 1.00
dtype: float64
index=[2, 5, 3, 7])

9
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(data)

Series as Specialized Dictionary:


In this way, you can think of a Pandas Series a bit like a specialization of a
Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of
arbitrary values, and a Series is a structure that maps typed keys to a set of typed
values. This typing is important: just as the type-specific compiled code behind a
NumPy array makes it more efficient than a Python list for certain operations, the type
information of a Pandas Series makes it much more efficient than Python dictionaries
for certain operations.
We can make the Series-as-dictionary analogy even more clear by
constructing a
Series object directly from a Python dictionary:
Code Output
population_dict = {'California': 38332521, California 38332521
Texas 26448193
'Texas': 26448193, 'New York': 19651127,
New York 19651127
'Florida': 19552860,'llinois': 12882135} population = Florida 19552860
Illinois 12882135
pd.Series(population_dict) population
dtype: int64

By default, a Series will be created where the index 38332521


is drawn from the sorted keys. From here, typical
dictionary-style item access can be performed:
population['California']

10
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Unlike a dictionary, though, the Series also supports california 38332521


Texas 26448193
array-style operations such as slicing:
New York 19651127
population['California':'Illinois'] Florida 19552860
Illinois 12882135
dtype: int64
 Constructing Series objects

We’ve already seen a few ways of constructing a Pandas Series from


scratch; all of them are some version of the following:
>>> pd.Series(data, index=index)
where index is an optional argument, and data can be one of many entities.
For example, data can be a list or NumPy array, in which case index
defaults to an integer sequence:
Code Output
import pandas as pd a 0.25
b 0.50
pd.Series([2, 4, 6])
0 2 1 4 2 6 dtype: int64
data can be a scalar, which is repeated to fill 100 5
the specified index: 200 5
pd.Series(5, index=[100, 200, 300]) 300 5
dtype: int64

data can be a dictionary, in which index 2a


defaults to the sorted dictionary keys: 1b
pd.Series({2:'a', 1:'b', 3:'c'}) 3c
dtype: object
In each case, the index can be explicitly set 3c
11
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

if a different result is preferred:


pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]) 2a
dtype: object
 The Pandas DataFrame Object
The next fundamental structure in Pandas is the DataFrame. Like the Series
object discussed in the previous section, the DataFrame can be thought of either
as a gener‐ alization of a NumPy array, or as a specialization of a Python
dictionary.
 DataFrame as a generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a
DataFrame is an analog of a two-dimensional array with both flexible row indices
and flexible column names. Just as you might think of a two-dimensional array as
an ordered sequence of aligned one-dimensional columns, you can think of a
DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean
that they share the same index.
To demonstrate this, let’s first construct a new Series listing the area of
each of the five states discussed in the previous section:
Code Output
import pandas as pd California 423967
Texas 695662
# Creating a dictionary with area data
New York 141297
area_dict = { Florida 170312
Illinois 149995
'California': 423967,
dtype: int64
'Texas': 695662,

12
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

'New York': 141297,


'Florida': 170312,
'Illinois': 149995
}
# Creating a pandas Series from the
dictionary
area = pd.Series(area_dict)
# Displaying the Series
print(area)
we can use a dictionary to construct a population area
California 38332521 423967
single two-dimensional object containing
Texas 26448193 695662
this information: New York 19651127 141297
Florida 19552860 170312
states = pd.DataFrame({'population':
Illinois 12882135 149995
population, 'area': area}) # Displaying
the DataFrame print(states)
Print(states.index) Index(['California', 'Texas', 'New
York', 'Florida', 'Illinois'],
dtype='object')
the DataFrame has a columns attribute, Index(['area', 'population'],
dtype='object')
which is an Index object holding the
column labels:
states.columns
Thus the DataFrame can be thought of as a generalization of a two-
dimensional NumPy array, where both the rows and columns have a

13
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

generalized index for access‐ ing the data.


 DataFrame as specialized dictionary
Similarly, we can also think of a DataFrame as a specialization of a dictionary.
Where a dictionary maps a key to a value, a DataFrame maps a column name to a
Series of column data. For example, asking for the 'area' attribute returns the
Series object containing the areas we saw earlier:
In[22]: states['area']
Out[22]: California
423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
Notice the potential point of confusion here: in a two-dimensional NumPy
array, data[0] will return the first row. For a DataFrame, data['col0'] will
return the first column. Because of this, it is probably better to think about
DataFrames as generalized dictionaries rather than generalized arrays,
though both ways of looking at the situa‐ tion can be useful. We’ll explore
more flexible means of indexing DataFrames in “Data Indexing and
Selection” on page 107.
Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways. Here we’ll

14
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

give several examples.


From a single Series object. A DataFrame is a collection of Series objects, and a
single- column DataFrame can be constructed from a single Series:
In[23]: pd.DataFrame(population, columns=['population'])

Out[23]: population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
From a list of dicts. Any list of dictionaries can be made into a DataFrame. We’ll
use a simple list comprehension to create some data:
In[24]: data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Out[24]: a b
0 0 0
1 1 2
2 2 4
Even if some keys in the dictionary are missing, Pandas will fill them in with
NaN (i.e., “not a number”) values:

15
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

In[25]: pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])


Out[25]: a b c
0 1.0 2 NaN
1 NaN 3 4.0
From a dictionary of Series objects. As we saw before, a DataFrame can be
constructed from a dictionary of Series objects as well:
In[26]: pd.DataFrame({'population': population,
'area': area})
Out[26]: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

From a two-dimensional NumPy array. Given a two-dimensional array of data, we


can create a DataFrame with any specified column and index names. If
omitted, an integer index will be used for each:
In[27]: pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
Out[27]: foo bar
a 0.865257 0.213169

16
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

b 0.442759 0.108267
c 0.047110 0.905718

From a NumPy structured array. We covered structured arrays in “Structured Data:


NumPy’s Structured Arrays” on page 92. A Pandas DataFrame operates
much like a structured array, and can be created directly from one:
In[28]: A = np.zeros(3, dtype=[('A', 'i8'), ('B',
'f8')]) A
Out[28]: array([(0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('A', '<i8'), ('B', '<f8')])
In[29]: pd.DataFrame(A)

Out[29]: A B
0 0 0.0
1 0 0.0
2 0 0.0
The Pandas Index Object
We have seen here that both the Series and DataFrame objects contain
an explicit index that lets you reference and modify data. This Index object
is an interesting structure in itself, and it can be thought of either as an
immutable array or as an ordered set (technically a multiset, as Index
objects may contain repeated values). Those views have some interesting
consequences in the operations available on Index objects. As a simple

17
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

example, let’s construct an Index from a list of integers:


In[30]: ind = pd.Index([2, 3, 5, 7,
11]) ind
Out[30]: Int64Index([2, 3, 5, 7, 11], dtype='int64')
Index as immutable array
The Index object in many ways operates like an array. For example, we can
use stan‐ dard Python indexing notation to retrieve values or slices:
In[31]: ind[1]
Out[31]: 3
In[32]:
ind[::2]
Out[32]: Int64Index([2, 5, 11], dtype='int64')
Index objects also have many of the attributes familiar from NumPy arrays:
In[33]: print(ind.size, ind.shape, ind.ndim,
ind.dtype) 5 (5,) 1 int64
One difference between Index objects and NumPy arrays is that indices
are immuta‐ ble—that is, they cannot be modified via the normal means:
In[34]: ind[1] = 0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)

<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0

18
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/
base.py ...
1243
1244 def setitem (self, key, value):
-> 1245 raise TypeError("Index does not support mutable
operations") 1246
1247 def getitem (self, key):
TypeError: Index does not support mutable operations
This immutability makes it safer to share indices between multiple
DataFrames and arrays, without the potential for side effects from
inadvertent index modification.
Index as ordered set
Pandas objects are designed to facilitate operations such as joins
across datasets, which depend on many aspects of set arithmetic. The
Index object follows many of
the conventions used by Python’s built-in set data structure, so that unions,
intersec‐ tions, differences, and other combinations can be computed in a
familiar way:
In[35]: indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7,
11]) In[36]: indA & indB #
intersection

19
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Out[36]: Int64Index([3, 5, 7],


dtype='int64') In[37]: indA | indB #
union
Out[37]: Int64Index([1, 2, 3, 5, 7, 9, 11],
dtype='int64') In[38]: indA ^ indB # symmetric
difference
Out[38]: Int64Index([1, 2, 9, 11], dtype='int64')
These operations may also be accessed via object methods—for example,
indA.inter section(indB).
Data Indexing and Selection
In Chapter 2, we looked in detail at methods and tools to access, set, and
modify val‐ ues in NumPy arrays. These included indexing (e.g., arr[2,
1]), slicing (e.g., arr[:, 1:5]), masking (e.g., arr[arr > 0]), fancy indexing
(e.g., arr[0, [1, 5]]), and combinations thereof (e.g., arr[:, [1, 5]]). Here
we’ll look at similar means of accessing and modifying values in Pandas
Series and DataFrame objects. If you have used the NumPy patterns, the
corresponding patterns in Pandas will feel very famil‐ iar, though there are
a few quirks to be aware of.
We’ll start with the simple case of the one-dimensional Series object, and
then move on to the more complicated two-dimensional DataFrame object.
Data Selection in Series
As we saw in the previous section, a Series object acts in many ways like a
one- dimensional NumPy array, and in many ways like a standard Python

20
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

dictionary. If we keep these two overlapping analogies in mind, it will


help us to understand the pat‐ terns of data indexing and selection in these
arrays.
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection
of keys to a collection of values:
In[1]: import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Out[1]: 0.25
a
b 0.50
c 0.75
d 1.00
dtype:
float64 In[2]:
data['b'] Out[2]:
0.5
We can also use dictionary-like Python expressions and methods to
examine the keys/indices and values:
In[3]: 'a' in data
Out[3]: True

21
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

In[4]:
data.keys()
Out[4]: Index(['a', 'b', 'c', 'd'], dtype='object')
In[5]: list(data.items())
Out[5]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
Series objects can even be modified with a dictionary-like syntax. Just as
you can extend a dictionary by assigning to a new key, you can extend a
Series by assigning to a new index value:
In[6]: data['e'] = 1.25
data
Out[6]: a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
This easy mutability of the objects is a convenient feature: under the hood,
Pandas is making decisions about memory layout and data copying that
might need to take place; the user generally does not need to worry about
these issues.
Series as one-dimensional array
A Series builds on this dictionary-like interface and provides array-style
item selec‐ tion via the same basic mechanisms as NumPy arrays—that is,

22
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

slices, masking, and fancy indexing. Examples of these are as follows:


In[7]: # slicing by explicit index
data['a':'c']
Out[7]: a 0.25
b 0.50
c 0.75
dtype: float64
In[8]: # slicing by implicit integer index
data[0:2]
Out[8]: a 0.25
b 0.50
dtype: float64
In[9]: # masking
data[(data > 0.3) & (data < 0.8)]
Out[9]: b 0.50
c 0.75
dtype: float64
In[10]: # fancy indexing
data[['a', 'e']]
Out[10]: a 0.25
e 1.25
dtype: float64
Among these, slicing may be the source of the most confusion. Notice that

23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

when you are slicing with an explicit index (i.e., data['a':'c']), the final
index is included in the slice, while when you’re slicing with an implicit
index (i.e., data[0:2]), the final index is excluded from the slice.
Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion. For
example, if your Series has an explicit integer index, an indexing
operation such as data[1] will use the explicit indices, while a slicing
operation like data[1:3] will use the implicit Python-style index.
In[11]: data = pd.Series(['a', 'b', 'c'], index=[1, 3,
5]) data
Out[11]: 1 a
2 b
5 c
dtype: object
In[12]: # explicit index when indexing
data[1
] Out[12]: 'a'
In[13]: # implicit index when slicing
data[1:3]
Out[13]: 3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas

24
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

provides some special indexer attributes that explicitly expose certain


indexing schemes. These
are not functional methods, but attributes that expose a particular slicing
interface to the data in the Series.
First, the loc attribute allows indexing and slicing that always references the
explicit index:
In[14]: data.loc[1]
Out[14]: 'a'
In[15]:
data.loc[1:3]
Out[15]: 1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the
implicit Python-style index:
In[16]: data.iloc[1]
Out[16]: 'b'
In[17]:
data.iloc[1:3]
Out[17]: 3 b
5 c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series objects

25
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

is equiva‐ lent to standard []-based indexing. The purpose of the ix indexer


will become more apparent in the context of DataFrame objects, which we
will discuss in a moment.
One guiding principle of Python code is that “explicit is better than
implicit.” The explicit nature of loc and iloc make them very useful in
maintaining clean and read‐ able code; especially in the case of integer
indexes, I recommend using these both to make code easier to read and
understand, and to prevent subtle bugs due to the mixed
indexing/slicing convention.
Data Selection in DataFrame
Recall that a DataFrame acts in many ways like a two-dimensional or
structured array, and in other ways like a dictionary of Series structures
sharing the same index. These analogies can be helpful to keep in mind as
we explore data selection within this structure.
DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of related
Series
objects. Let’s return to our example of areas and populations of states:
In[18]: area = pd.Series({'California': 423967, 'Texas':
695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,

26
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

'New York': 19651127, 'Florida': 19552860,


'Illinois': 12882135})
data = pd.DataFrame({'area':area,
'pop':pop}) data

Out[18 area pop


]:
Californi 4239 383325
a 67 21
Florida 1703 195528
12 60
Illinois 1499 128821
95 35
New 1412 196511
York 97 27
Texas 6956 264481
62 93
The individual Series that make up the columns of the DataFrame can be
accessed via dictionary-style indexing of the column name:
In[19]: data['area']

Out[19]: 42396
California 7
Florida 17031
27
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

2
Illinois 14999
5
New York14129
7
Texas 69566
2
Name: area, dtype: int64
Equivalently, we can use attribute-style access with column names that are
strings:
In[20]: data.area

Out[20]: 42396
California 7
Florida 17031
2
Illinois 14999
5
New York14129
7
Texas 69566
2
Name: area, dtype: int64
This attribute-style column access actually accesses the exact same object
28
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

as the dictionary-style access:


In[21]: data.area is data['area']
Out[21]: True
Though this is a useful shorthand, keep in mind that it does not work for
all cases! For example, if the column names are not strings, or if the
column names conflict with methods of the DataFrame, this attribute-style
access is not possible. For exam‐ ple, the DataFrame has a pop() method,
so data.pop will point to this rather than the "pop" column:
In[22]: data.pop is
data['pop'] Out[22]: False

29
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

In particular, you should avoid the temptation to try column assignment


via attribute (i.e., use data['pop'] = z rather than data.pop = z).
Like with the Series objects discussed earlier, this dictionary-style syntax can
also be used to modify the object, in this case to add a new column:
In[23]: data['density'] = data['pop'] /
data['area'] data

Out[23 area pop density


]:
Californi 4239 383325 90.41392
a 67 21 6
Florida 1703 195528 114.8061
12 60 21
Illinois 1499 128821 85.88376
95 35 3
New 1412 196511 139.0767
York 97 27 46
Texas 6956 264481 38.01874
62 93 0
This shows a preview of the straightforward syntax of element-by-element
arithmetic between Series objects; we’ll dig into this further in “Operating
on Data in Pandas” on page 115.
DataFrame as two-dimensional array
As mentioned previously, we can also view the DataFrame as an enhanced

30
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

two- dimensional array. We can examine the raw underlying data array
using the values attribute:
In[24]: data.values

Out[24]: 4.23967000e+ 3.83325210e+ 9.04139261e+


array([[ 05, 07, 01],
[ 1.70312000e+ 1.95528600e+ 1.14806121e+
05, 07, 02],
[ 1.49995000e+ 1.28821350e+ 8.58837628e+
05, 07, 01],
[ 1.41297000e+ 1.96511270e+ 1.39076746e+
05, 07, 02],
[ 6.95662000e+ 2.64481930e+ 3.80187404e+
05, 07, 01]])
With this picture in mind, we can do many familiar array-like observations
on the DataFrame itself. For example, we can transpose the full DataFrame to
swap rows and columns:
In[25]: data.T
Out[25]:
California Florida Illinois New York Texas
area 4.239670e 1.703120e 1.499950e 1.412970e 6.956620e
+05 +05 +05 +05 +05
pop 3.833252e 1.955286e 1.288214e 1.965113e 2.644819e
+07 +07 +07 +07 +07

31
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

densit 9.041393e 1.148061e 8.588376e 1.390767e 3.801874e


y +01 +02 +01 +02 +01
When it comes to indexing of DataFrame objects, however, it is clear that
the dictionary-style indexing of columns precludes our ability to simply
treat it as a NumPy array. In particular, passing a single index to an array
accesses a row:
In[26]: data.values[0]
Out[26]: array([ 4.23967000e+05, 3.83325210e+07,
9.04139261e+01]) and passing a single
“index” to a DataFrame accesses a column:
In[27]: data['area']

Out[27]: 42396
California 7
Florida 17031
2
Illinois 14999
5
New York14129
7
Texas 69566
2
Name: area, dtype: int64
Thus for array-style indexing, we need another convention. Here Pandas
again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc
32
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

indexer, we can index the underlying array as if it is a simple NumPy


array (using the implicit Python-style index), but the DataFrame index
and column labels are maintained in the result:
In[28]: data.iloc[:3, :2]

Out[28]: area pop


California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
In[29]: data.loc[:'Illinois', :'pop']

Out[29]: area pop


California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
The ix indexer allows a hybrid of these two approaches:
In[30]: data.ix[:3, :'pop']

Out[30]: area pop


California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
Keep in mind that for integer indices, the ix indexer is subject to the same
potential sources of confusion as discussed for integer-indexed Series
33
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

objects.
Any of the familiar NumPy-style data access patterns can be used within
these index‐ ers. For example, in the loc indexer we can combine masking
and fancy indexing as in the following:
In[31]: data.loc[data.density > 100, ['pop',
'density']] Out[31]: pop density
Florida 19552860 114.806121
New York 19651127 139.076746

34
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Any of these indexing conventions may also be used to set or modify


values; this is done in the standard way that you might be accustomed
to from working with NumPy:
In[32]: data.iloc[0, 2] =
90 data
Out[32] area pop density
:
California 42396 3833252 90.000000
7 1
Florida 17031 1955286 114.80612
2 0 1
Illinois 14999 1288213 85.883763
5 5
New York 14129 1965112 139.07674
7 7 6
Texas 69566 2644819 38.018740
2 3
To build up your fluency in Pandas data manipulation, I suggest spending
some time with a simple DataFrame and exploring the types of indexing,
slicing, masking, and fancy indexing that are allowed by these various
indexing approaches.
Additional indexing conventions
There are a couple extra indexing conventions that might seem at odds
with the pre‐ ceding discussion, but nevertheless can be very useful in
practice. First, while index‐ ing refers to columns, slicing refers to rows:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

In[33]: data['Florida':'Illinois']
Out[33]: area pop
density Florida 170312
19552860 114.806121
Illinois 149995 12882135 85.883763
Such slices can also refer to rows by number rather than by index:
In[34]: data[1:3]
Out[34]: area pop
density Florida 170312
19552860 114.806121
Illinois 149995 12882135 85.883763
Similarly, direct masking operations are also interpreted row-wise rather
than column-wise:
In[35]: data[data.density > 100]
Out[35]: area pop
density Florida 170312
19552860 114.806121
New York 141297 19651127 139.076746
These two conventions are syntactically similar to those on a NumPy
array, and while these may not precisely fit the mold of the Pandas
conventions, they are nevertheless quite useful in practice.
Pandas Data operations
One of the essential pieces of NumPy is the ability to perform quick element-
wise operations, both with basic arithmetic (addition, subtraction,
multiplication, etc.) and with more sophisticated operations (trigonometric
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

functions, exponential and logarithmic functions, etc.). Pandas inherits much


of this functionality from NumPy, and the ufuncs that we introduced
in Computation on NumPy Arrays: Universal Functions are key to this. In
Pandas, there are different useful data operations for DataFrame, which are as
follows :
Row and column selection
We can select any row and column of the DataFrame by passing the name of
the rows and column. When you select it from the DataFrame, it becomes
one-dimensional and considered as Series.
Filter Data
We can filter the data by providing some of the boolean expression in DataFrame.
Null values
A Null value can occur when no data is being provided to the items. The
various columns may contain no values which are usually represented as
NaN. In Pandas, several useful functions are available for detecting,
removing, and replacing the null values in Data Frame. These functions are as
follows:
isnull(): The main task of isnull() is to return the true value if any row has
null values.
notnull(): It is opposite of isnull() function and it returns true values for not
null value.
dropna(): This method analyzes and drops the rows/columns of null values.
illna(): It allows the user to replace the NaN values with some other values.
replace(): It is a very rich function that replaces a string, regex, series,
dictionary, etc.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

interpolate(): It is a very powerful function that fills null values in the


DataFrame or series.
String operation
A set of a string function is available in Pandas to operate on string data and
ignore the missing/NaN values. There are different string operation that can
be performed using .str. option. These functions are as follows:
lower(): It converts any strings of the series or index into lowercase letters.
upper(): It converts any string of the series or index into uppercase letters.
strip(): This function helps to strip the whitespaces including a new line from
each string in the Series/index.
split(' '): It is a function that splits the string with the given pattern.
cat(sep=' '): It concatenates series/index elements with a given separator.
ontains(pattern): It returns True if a substring is present in the element, else
False.
replace(a,b): It replaces the value a with the value b.
repeat(value): It repeats each element with a specified number of times.
count(pattern): It returns the count of the appearance of a pattern in each
element.
startswith(pattern): It returns True if all the elements in the series starts with
a pattern.
endswith(pattern): It returns True if all the elements in the series ends with a
pattern.
find(pattern): It is used to return the first occurrence of the pattern.
findall(pattern): It returns a list of all the occurrence of the pattern.
swapcase: It is used to swap the case lower/upper.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

islower(): It returns True if all the characters in the string of the Series/Index
are in lowercase. Otherwise, it returns False.
isupper(): It returns True if all the characters in the string of the Series/Index
are in uppercase. Otherwise, it returns False.
isnumeric(): It returns True if all the characters in the string of the
Series/Index are numeric. Otherwise, it returns False.
Count Values
This operation is used to count the total number of occurrences using
'value_counts()' option.
Plots
Pandas plots the graph with the matplotlib library. The .plot() method allows
you to plot the graph of your data.
.plot() function plots index against every column.
You can also pass the arguments into the plot() function to draw a specific
column.
Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will
work on Pandas Series and DataFrame objects. Let’s start by defining a
simple Series and DataFrame on which to demonstrate this:
In[1]: import pandas
as pd import
numpy as np
In[2]: rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10,
4)) ser
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Out[2]: 0 6
1 3
2 7
3 4
dtype: int64
In[3]: df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df
Out[3]: A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4
If we apply a NumPy ufunc on either of these objects, the result will be
another Pan‐ das object with the indices preserved:
In[4]: np.exp(ser)
Out[4]: 403.42879
0 3
1 20.085537
2 1096.6331
58
3 54.598150
dtype: float64
Or, for a slightly more complex calculation:
In[5]: np.sin(df * np.pi / 4)
Out[5]: A B C D
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

0 -1.000000 7.071068e-01 1.000000 -1.000000e+00


1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16
Any of the ufuncs discussed in “Computation on NumPy Arrays:
Universal Func‐ tions” on page 50 can be used in a similar manner.
UFuncs: Index Alignment
For binary operations on two Series or DataFrame objects, Pandas will
align indices in the process of performing the operation. This is very
convenient when you are working with incomplete data, as we’ll see in
some of the examples that follow.
Index alignment in Series
As an example, suppose we are combining two different data sources, and
find only the top three US states by area and the top three US states by
population:
In[6]: area = pd.Series({'Alaska': 1723337, 'Texas':
695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')
Let’s see what happens when we divide these to compute the population
density:
In[7]: population / area
Out[7]: Alaska
California90.413926
New York NaN
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Texas 38.018740
dtype: float64
The resulting array contains the union of indices of the two input arrays,
which we could determine using standard Python set arithmetic on these
indices:
In[8]: area.index | population.index
Out[8]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
Any item for which one or the other does not have an entry is marked with
NaN, or “Not a Number,” which is how Pandas marks missing data (see
further discussion of missing data in “Handling Missing Data” on page
119). This index matching is implemented this way for any of Python’s
built-in arithmetic expressions; any missing val‐ ues are filled in with NaN
by default:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2,
3]) A + B
Out[9]: 0 NaN
1 5.0
2 9.0
3 NaN
dtype:
float64
If using NaN values is not the desired behavior, we can modify the fill
value using appropriate object methods in place of the operators. For
example, calling A.add(B) is equivalent to calling A + B, but allows
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

optional explicit specification of the fill value for any elements in A or B


that might be missing:
In[10]: A.add(B, fill_value=0)

Out[10]: 2.0
0
1 5.0
2 9.0
3 5.0
dtype: float64

Index alignment in DataFrame


A similar type of alignment takes place for both columns and indices when
you are performing operations on DataFrames:
In[11]: A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB'))
A
Out[11]: A B
0 1 11
1 5 1
In[12]: B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
B
Out[12]:
B A
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

C 0 4
0 9
1 5 8 0
2 9 2 6
In[13]: A + B
Out[13]: A
BC 0 1.0
15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in place
of missing entries. Here we’ll fill with the mean of all values in A (which we compute
by first stacking the rows of A):
In[14]: fill =
A.stack().mean()
A.add(B,
fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
Table 3-1 lists Python operators and their equivalent Pandas object methods.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Table 3-1. Mapping between Python operators and Pandas methods


Python operator Pandas method(s)
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()

Ufuncs: Operations Between DataFrame and Series


When you are performing operations between a DataFrame and a Series,
the index and column alignment is similarly maintained. Operations
between a DataFrame and a Series are similar to operations between a
two-dimensional and one-dimensional NumPy array. Consider one
common operation, where we find the difference of a two-dimensional
array and one of its rows:
In[15]: A = rng.randint(10,
size=(3, 4)) A
Out[15]: array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])

In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

[-1, -2, 2, 4],


[ 3, -7, 1, 4]])
According to NumPy’s broadcasting rules (see “Computation on Arrays:
Broadcast‐ ing” on page 63), subtraction between a two-dimensional array
and one of its rows is applied row-wise.
In Pandas, the convention similarly operates row-wise by default:
In[17]: df = pd.DataFrame(A,
column=list('QRST')) df - df.iloc[0]
Out[17]: Q R S T
0 0 0 0 0
1 -1 -2 2 4
2 3 -7 1 4
If you would instead like to operate column-wise, you can use the object
methods mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18] Q R S T
:
0 -5 0 -6 -4
1 -4 0 -2 2
2 5 0 2 7
Note that these DataFrame/Series operations, like the operations discussed
before, will automatically align indices between the two elements:
In[19]: halfrow =
df.iloc[0, ::2] halfrow
Out[19]: Q 3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

S 2
Name: 0, dtype:
int64 In[20]: df - halfrow
Out[20 Q R S T
]:
0 Na 0.0 Na
0.0 N N
1 - Na 2.0 Na
1.0 N N
2 Na 1.0 Na
3.0 N N
This preservation and alignment of indices and columns means that
operations on data in Pandas will always maintain the data context, which
prevents the types of silly errors that might come up when you are
working with heterogeneous and/or mis‐ aligned data in raw NumPy
arrays.
Handling Missing Data
Missing Data can occur when no information is provided for one or more
items or for a whole unit. Missing Data is a very big problem in a real-life
scenarios. Missing Data can also refer to as NA(Not Available) values in
pandas. In Pandas missing data is represented by two value:
 None: None is a Python singleton object that is often used for missing
data in Python code.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 NaN : NaN (an acronym for Not a Number), is a special floating-point


value recognized by all systems that use the standard IEEE floating-
point representation
Handling Missing Data
Following these stages, we handle missing data. We'll go through each step in
more detail, but here's a general idea:
1. We start by importing the necessary packages.
2. We use the read_csv() function to read the dataset.
3. The dataset is printed. And we check if any record has NaN values or
missing data.
4. On the dataset, we apply the dropna() function. The records that
contain missing values are deleted using this procedure. In order to
remove the entries and update the new dataset in the same variable, we
additionally pass the argument in place to be True.
5. The dataset is printed. No records contain missing values anymore.
Calculation with Missing Data
None, is a Python singular object that is frequently used for missing
data in Python programs. Because it is a Python object, None can only be
used in arrays of the data type "object" (i.e., arrays of Python objects), and
cannot be used in any other NumPy/Pandas array:
import numpy as np
import pandas as pd
array = np.array([3, None, 0, 4, None])
print(array)
Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

array([3, None, 0, 4, None], dtype=object)


The alternative missing data representation, NaN (an acronym for Not a
Number), is distinct; it is a unique floating-point value recognized by all
systems that utilize the common IEEE floating-point notation:
import numpy as np
import pandas as pd
array = np.array([3, np.nan, 0, 4, np.nan])
print(array)
Output:
array([ 3., nan, 0., 4., nan])
It's important to note that NumPy selected a native floating-point type for this
array, which implies that in contrast to the object array from earlier, this array
allows rapid operations that are pushed into the produced code.
Cleaning Missing Data
The result of the isna() and isnull() methods is a Boolean check of whether or
not each cell of the DataFrame has a missing value. In this way, if a value is
absent from a certain cell, the function will return True; otherwise, it will
return False (if the cell has a value).
We've undoubtedly observed that both isna() and isnull() provide an identical
response, so you can use either one to display the Boolean check to see if
there is missing data or not.
#Import the libraries
import numpy as np
import pandas as pd
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

# Create a CSV dataset


data_string = '''ID,Gender,Salary,Country,Company
1,Male,15000,India,Google
2,Female,45000,China,NaN
3,Female,25000,India,Google
4,NaN,NaN,Australia,Google
5,Male,NaN,India,Google
6,Male,54000,NaN,Alibaba
7,NaN,74000,China,NaN
8,Male,14000,Australia,NaN
9,Female,15000,NaN,NaN
10,Male,33000,Australia,NaN'''
with open('salary.csv', 'w') as out:
out.write(data_string)
# Import the dataset
df = pd.read_csv('/content/salary.csv')
print('Salary Dataset: \n', df)
# Check for missing data
print('Missing Data\n', df.isna())
print('Missing Data\n', df.isnull())
# Print only missing data
print('Filter based on columns: \n', df[df.isnull().any(axis=1)])
# Sum up the missing values
print('Sum up the missing values: \n', df.isnull().sum())
Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Salary Dataset:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
1 2 Female 45000.0 China NaN
2 3 Female 25000.0 India Google
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Missing Data
ID Gender Salary Country Company
0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True
7 False False False False True
8 False False False True True
9 False False False False True
Missing Data
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

ID Gender Salary Country Company


0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True
7 False False False False True
8 False False False True True
9 False False False False True
Filter based on columns:
ID Gender Salary Country Company
1 2 Female 45000.0 China NaN
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Sum up the missing values:
ID 0
Gender 2
Salary 2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Country 2
Company 5
dtype: int64
Dropping Missing Data
You can choose to either ignore missing data or substitute values for it when
handling missing data. As we can see at the bottom of the DataFrame output,
this results in a clean DataFrame with no missing data.
df.dropna(inplace=True)
print(df)
Output:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
2 3 Female 25000.0 India Google
Replacing Missing Data
You can choose to either ignore missing data or substitute values for it when
handling missing data. Fortunately, the Pandas fillna() method may be used to
replace missing values in a DataFrame with a value given by the user. Type
the following to replace any missing values with the number 0 (i.e., the value
of 0 is arbitrary and may be any other value of your choice):
df["Salary"].fillna(20000, inplace=True)
print(df)
Output:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
1 2 Female 45000.0 China NaN
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

2 3 Female 25000.0 India Google


3 4 NaN 20000.0 Australia Google
4 5 Male 20000.0 India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Important Functions for Handling Missing Data in Pandas
Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. To facilitate this convention, there are several useful
functions for detecting, removing, and replacing null values in Pandas
DataFrame :
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()
Checking for missing values using isnull() and notnull()
In order to check missing values in Pandas DataFrame, we use a function
isnull() and notnull(). Both function help in checking whether a value is
NaN or not. These function can also be used in Pandas Series in order to
find null values in a series.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Checking for missing values using isnull()


In order to check null values in Pandas DataFrame, we use isnull() function
this function return dataframe of Boolean values which are True for NaN
values.
Code #1:

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# using isnull() function
df.isnull()

Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Filling missing values using fillna(), replace() and interpolate()


In order to fill null values in a datasets, we use fillna(), replace() and
interpolate() function these function replace NaN values with some value of
their own. All these function help in filling a null values in datasets of a
DataFrame. Interpolate() function is basically used to fill NA values in the
dataframe but it uses various interpolation technique to fill the missing
values rather than hard-coding the value.
Code #1: Filling null values with a single value

 Python

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# filling missing value using fillna()
df.fillna(0)

Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Dropping missing values using dropna()


In order to drop a null values from a dataframe, we used dropna() function
this function drop Rows/Columns of datasets with Null values in different
ways.
Code #1: Dropping rows with at least 1 null value.

 Python

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Output

Now we drop rows with at least one Nan value (Null value)
Hierarchical Indexing
Hierarchical indexing is a method of creating structured group relationships in
data.
• A MultiIndex or Hierarchical index comes in when our DataFrame has more
than two dimensions. As we already know, a Series is a one-dimensional
labelled NumPy array and a DataFrame is usually a two-dimensional table
whose columns are Series. In some instances, in order to carry out some
sophisticated data analysis and manipulation, our data is presented in higher
dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical
Index as the name suggests is ordering more than one item in terms of their
ranking.
Hierarchical indexing is a method of creating structured group relationships in
the dataset. Data frames can have hierarchical indexes. To show this, let me
create a dataset.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Notice that in this dataset, both row and column have hierarchical indexes.
You can name hierarchical levels. Let’s show this.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Selecting in Hierarchical Indexing


You can select subgroups of data. For example, let’s select the index named
num.

What is Swaplevel?
Sometimes, you may want to swap the level of the indexes. You can use the
swaplevel method for this. The swaplevel method takes two levels and returns
a new object. For example, let’s swap the class and exam indexes in the
dataset.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Sorting in Hierarchical Indexing


To sort the indexes by level, you can use the sort_index method. For example,
let’s sort the dataset by level 1.

Summary Statistics in Hierarchical Indexing


Summary statistics in Series or DataFrame are found by one level. If you have
more than one level of data, you can calculate summary statistics according to
the level. For example, let’s see the sum values according to the exam level in
the dataset.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Let’s see the total values according to the field level.

Hierarchical Indexing in The Data Frame


You can move the DataFrame’s columns to the row index. To show this, let’s
create a dataset.

Let’s transform columns a and b of this dataset into a row index.


Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

In the set_index method, the indexes moved to the row are removed from the
column. You can use drop = False to remain the columns you get as an index
in the same place.

Let’s first take a look at data2 to demonstrate the reset_index method.


Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

You can use the reset_index method to restore the dataset.


Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Combining Datasets with Concat in Pandas


Concatenation is a powerful method in pandas for combining datasets. It
allows you to stack dataframes either vertically (adding rows) or horizontally
(adding columns). The concat function is versatile and can handle different
types of concatenation operations.
Types of Concatenation
1. Concatenating Along Rows (Vertical Concatenation)
2. Concatenating Along Columns (Horizontal Concatenation)
3. Concatenating with Different Indexes
4. Concatenating with Keys
1. Concatenating Along Rows (Vertical Concatenation)
When concatenating along rows, dataframes are stacked on top of each other.
This is the default behavior of concat.
Example:
python
import pandas as pd
# Creating example dataframes
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Concatenating along rows (axis=0)
df_concat_rows = pd.concat([df1, df2])
print(df_concat_rows)
Output:
A B
0 1 3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1 2 4
0 5 7
1 6 8
2. Concatenating Along Columns (Horizontal Concatenation)
When concatenating along columns, dataframes are merged side by side.
Example:
python

# Concatenating along columns (axis=1)


df_concat_cols = pd.concat([df1, df2], axis=1)
print(df_concat_cols)
Output:
A B A B
0 1 3 5 7
1 2 4 6 8
3. Concatenating with Different Indexes
When the dataframes have different indexes, the concat function aligns them
by the index. Missing values will be filled with NaN.
Example:
python
# Creating dataframes with different indexes
df3 = pd.DataFrame({'A': [1, 2]}, index=[0, 1])
df4 = pd.DataFrame({'B': [3, 4]}, index=[2, 3])
# Concatenating with different indexes
df_concat_diff_idx = pd.concat([df3, df4], axis=1)
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(df_concat_diff_idx)
Output:
A B
0 1.0 NaN
1 2.0 NaN
2 NaN 3.0
3 NaN 4.0
4. Concatenating with Keys
Adding keys creates a hierarchical index, useful for identifying the source of
each row in the concatenated dataframe.
Example:
python
# Concatenating with keys
df_concat_keys = pd.concat([df1, df2], keys=['df1', 'df2'])
print(df_concat_keys)
Output:
A B
df1 0 1 3
1 2 4
df2 0 5 7
1 6 8
Summary
 Concatenating Along Rows (Vertical): Stacks dataframes vertically.
 Concatenating Along Columns (Horizontal): Merges dataframes
side by side.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Concatenating with Different Indexes: Aligns dataframes by index,


filling missing values with NaN.
 Concatenating with Keys: Creates a hierarchical index for
identifying the source of each row.
Conclusion
Using concat in pandas, you can efficiently combine datasets in various ways,
whether you need to stack rows, merge columns, align different indexes, or
add hierarchical keys. This flexibility makes concat a powerful tool for data
manipulation.
Combining Datasets with Append in Pandas

Appending is a straightforward method in pandas for adding rows of one dataframe


to the end of another. The `append` function is specifically designed for this
purpose, making it easy to combine datasets row-wise.

Appending DataFrames

When you use the `append` method, it returns a new dataframe with the rows of the
second dataframe added to the end of the first one. The original dataframes remain
unchanged unless explicitly reassigned.
Example 1: Simple Appending
```python
import pandas as pd
# Creating example dataframes
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})


# Appending df2 to df1
df_append = df1.append(df2, ignore_index=True)
print(df_append)
Output:**
A B
0 1 3
1 2 4
2 5 7
3 6 8
```
Appending DataFrames with Different Columns
When appending dataframes with different columns, missing values will be filled
with NaN for the columns that do not exist in one of the dataframes.
Example 2: Appending with Different Columns
```python
# Creating dataframes with different columns
df3 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df4 = pd.DataFrame({'B': [5, 6], 'C': [7, 8]})
# Appending df4 to df3
df_append_diff_cols = df3.append(df4, ignore_index=True)
print(df_append_diff_cols)
Output:**
A B C
0 1 3 NaN
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1 2 4 NaN
2 NaN 5 7.0
3 NaN 6 8.0
Appending a Series to a DataFrame
You can also append a Series to a dataframe as a new row.
Example 3: Appending a Series
python
# Creating a dataframe
df5 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
# Creating a series
s = pd.Series({'A': 5, 'B': 6})
# Appending the series to the dataframe
df_append_series = df5.append(s, ignore_index=True)
print(df_append_series)
Output:**
A B
0 1 3
1 2 4
2 5 6
Summary
Simple Appending**: Adds rows of one dataframe to the end of another.
Appending with Different Columns**: Handles different columns by filling
missing values with NaN.
Appending a Series**: Adds a series as a new row to the dataframe.
Conclusion
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

The `append` method in pandas is an intuitive and efficient way to combine


datasets row-wise. Whether you're dealing with matching columns, different
columns, or even appending a single row (series), `append` provides a
straightforward solution for expanding your dataframes.
Combining Datasets: Merge and Join
Merging and joining are powerful techniques in pandas for combining
datasets based on common columns or indices. These operations allow you to
integrate data from different sources in a flexible manner.
Merging DataFrames
The merge function combines datasets based on common columns or indices.
It supports various types of joins, including inner, outer, left, and right joins.
Example 1: Inner Join
An inner join returns only the rows with matching keys in both dataframes.
python

import pandas as pd
# Creating example dataframes
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
# Performing an inner join
df_inner = pd.merge(df1, df2, on='key', how='inner')
print(df_inner)
Output:
key value1 value2
0 A 1 4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1 B 2 5
Example 2: Left Join
A left join returns all rows from the left dataframe and the matched rows from
the right dataframe. Missing values are filled with NaN.
python
# Performing a left join
df_left = pd.merge(df1, df2, on='key', how='left')
print(df_left)
Output:
key value1 value2
0 A 1 4.0
1 B 2 5.0
2 C 3 NaN
Example 3: Outer Join
An outer join returns all rows when there is a match in one of the dataframes.
Missing values are filled with NaN.
python
# Performing an outer join
df_outer = pd.merge(df1, df2, on='key', how='outer')
print(df_outer)
Output:
key value1 value2
0 A 1.0 4.0
1 B 2.0 5.0
2 C 3.0 NaN
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

3 D NaN 6.0
Joining DataFrames
The join method is used for combining dataframes on their indices. It is
similar to merge, but it is based on indices rather than columns.
Example 4: Simple Join
python
# Creating example dataframes with indices
df3 = pd.DataFrame({'value1': [1, 2, 3]}, index=['A', 'B', 'C'])
df4 = pd.DataFrame({'value2': [4, 5, 6]}, index=['A', 'B', 'D'])

# Performing a join
df_join = df3.join(df4, how='inner')
print(df_join)
Output:
value1 value2
A 1 4
B 2 5
Example 5: Left Join with Join
python
# Performing a left join with join
df_join_left = df3.join(df4, how='left')
print(df_join_left)
Output:
value1 value2
A 1 4.0
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

B 2 5.0
C 3 NaN
Summary
 Merge: Combines dataframes based on common columns or indices.
o Inner Join: Returns rows with matching keys in both
dataframes.
o Left Join: Returns all rows from the left dataframe and
matched rows from the right dataframe.
o Outer Join: Returns all rows when there is a match in one of
the dataframes.
 Join: Combines dataframes based on their indices.
o Simple Join: Performs a join based on indices.
o Left Join with Join: Returns all rows from the left dataframe
and matched rows from the right dataframe based on indices.
Conclusion
Using merge and join in pandas, you can efficiently combine datasets based
on columns or indices, providing flexibility in how you integrate data from
different sources. These operations are fundamental for data manipulation and
analysis, enabling you to create comprehensive datasets for further
exploration and insights.
Aggregation and Grouping
Aggregation and grouping are powerful techniques in pandas for summarizing
and analyzing data. Grouping allows you to split the data into groups based on
some criteria, and aggregation lets you compute summary statistics for each
group.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Grouping DataFrames
The groupby method is used to group data in pandas. This method splits the
data into groups based on some criteria.
Example: Grouping Data
python
import pandas as pd
# Creating an example dataframe
df = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C'],
'Value': [10, 15, 10, 20, 25]
})
# Grouping by 'Category'
grouped = df.groupby('Category')
print(grouped)
Output: The output is a DataFrameGroupBy object. To see the grouped data,
you need to apply an aggregation function.
Aggregating DataFrames
Aggregation involves computing summary statistics for each group. Common
aggregation functions include sum, mean, count, etc.
Example 1: Aggregating with Sum
python
# Aggregating the grouped data with sum
sum_agg = grouped.sum()
print(sum_agg)
Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Value
Category
A 25
B 30
C 25
Example 2: Aggregating with Multiple Functions
You can apply multiple aggregation functions to each group.
python

# Aggregating with multiple functions


multi_agg = grouped.agg(['sum', 'mean', 'count'])
print(multi_agg)
Output:
css

Value
sum mean count
Category
A 25 12.5 2
B 30 15.0 2
C 25 25.0 1
Example 3: Custom Aggregation Functions
You can also define custom aggregation functions.
python
# Defining a custom aggregation function
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

def range_func(x):
return x.max() - x.min()
# Applying the custom function
custom_agg = grouped.agg(range_func)
print(custom_agg)
Output:
Value
Category
A 5
B 10
C 0
Grouping by Multiple Columns
You can group by multiple columns to create a hierarchical index.
Example: Grouping by Multiple Columns
python
# Creating an example dataframe with multiple columns
df_multi = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C'],
'SubCategory': ['X', 'Y', 'X', 'Y', 'X'],
'Value': [10, 15, 10, 20, 25]
})
# Grouping by 'Category' and 'SubCategory'
grouped_multi = df_multi.groupby(['Category', 'SubCategory'])
multi_agg = grouped_multi.sum()
print(multi_agg)
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Output:
Value
Category SubCategory
A X 10
Y 15
B X 10
Y 20
C X 25
Grouping and Aggregating with Pivot Tables
Pivot tables provide a way to summarize data in a tabular format.
Example: Creating a Pivot Table
python
# Creating a pivot table
pivot_table = pd.pivot_table(df_multi, values='Value', index=['Category'],
columns=['SubCategory'], aggfunc='sum')
print(pivot_table)
Output:
SubCategory X Y
Category
A 10.0 15.0
B 10.0 20.0
C 25.0 NaN
Summary
 Grouping: Use groupby to split the data into groups based on some
criteria.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Aggregation: Apply aggregation functions like sum, mean, count, and


custom functions to summarize data for each group.
 Grouping by Multiple Columns: Group by multiple columns to
create hierarchical indices.
 Pivot Tables: Use pivot tables to summarize data in a tabular format.
Conclusion
Aggregation and grouping in pandas provide a robust framework for
summarizing and analyzing data. These techniques allow you to derive
meaningful insights from your data by computing summary statistics and
organizing data into easily interpretable formats.
Pivot Tables in Pandas
A pivot table in Pandas is a quantitative table that summarizes a large
DataFrame, such as a large dataset. It is a component of data processing.
In pivot tables, the report may include average, mode, summation, or other
statistical elements. Pivot tables were originally linked with Microsoft
Excel. However, they may also be created in Python employing Pandas.
Pandas has grown in popularity among data scientists as a platform
for analyzing and altering data. Pandas is a fairly fundamental and flexible
language with which we are all acquainted. It provides numerous useful
features to assist us in transforming data into the desired
format. Pandas.pivot_table()
Pivot tables in Pandas allow users to examine subsets of data depending
on indexes and values. Values are organized by index and provided to the
user. The syntax for the Pandas.pivot_table() function is as follows:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Syntax
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc=’mean’, fill_value=None, margins=False, dropna=True,
margins_name=’All’, observed=False)
Parameters
It requires the following set of parameters:
Sr. Parameter
Parameter Description
No Name
It requires a Pandas DataFrame or the database from which
1 data
the pivot table is to be created.
It is a purely optional parameter. It is used to indicate which
2 values
Column's statistical summary should be displayed.
It specifies the Column that will be employed to index the
feature specified in the values parameter. If an array is
3 index
supplied as a parameter, it must be of a similar length as the
Dataset.
It is used to aggregate information based on specified
4 columns
column characteristics.
It specifies the set of functions that must be executed on our
5 aggfunc
DataFrame.
It is used to supply a value in the DataFrame to substitute
6 fill_value
missing data.
7 margins It only accepts Boolean values and is initially set to False. If
set to True, it adds all rows and columns to the resulting
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Sr. Parameter
Parameter Description
No Name
pivot table.
It can only take Boolean values and is set to True by default.
8 dropna It is employed to delete all NaN values from DataFrame,
including any.
When the margins option is set to True, it is employed to
9 margins_name define the title of the row/column that will hold the
statistics.
It only takes Boolean values. This option applies solely to
10 observed category characteristics. If it is set as 'True,' the DataFrame
will only display data for categorical groupings.
Return Value
It is employed to generate a DataFrame with an excel-style pivot table. The
levels in the pivot table will be saved as MultiIndex objects on the resultant
DataFrame's index and columns.
Pivot Table in Pandas with Python
One of Excel's most powerful features is pivot tables. A pivot table helps us
to extract information from data. Pandas has a method
named pivot_table() that is comparable. Pandas pivot_table() is a simple
method that may quickly provide highly strong analyses. Pandas
pivot_table() is a must-have tool for any Data Scientist. Let's see how we can
make one for ourselves.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

How to Create a Pivot Table DataFrame


The pivot table Pandas method is used to perform a pivot on a Pandas
DataFrame. Let's create a DataFrame and employ the pivot table pandas
method on it.
Code:
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'Name': [ 'PUBG: Battlegrounds' , 'Tetris (EA)', 'Grand
Theft Auto V', 'Wii Sports', 'Minecraft' ],
'Genre': ['Battle royale' , 'Puzzle', 'Action-adventure', 'Sports
simulation', 'Survival,Sandbox'],
'Platform': [ 'PC', 'Multi-platform', 'Multi-platform', 'Wii', 'Multi-
platform'],
'Publishers': [ 'PUBG Corporation', 'Electronic Arts', 'Rockstar
Games', 'Nintendo' , 'Xbox Game Studios',],
'Total_Year': random.sample(range(10, 30), 5),
'Sales': random.sample(range(100, 300), 5)})
df.head()
Output: Name Genre Platform Publishers Total_Year Sales
0 PUBG: Battlegrounds Battle royale PC PUBG Corporation
15 187
1 Tetris (EA) Puzzle Multi-platform Electronic Arts 22
254
2 Grand Theft Auto V Action-adventure Multi-platform Rockstar
Games 27 123
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Now that our DataFrame has been created, we will use the pivot table Pandas
method as pd.pivot table() to indicate which features should be included in
the rows and columns by employing the index and columns arguments.
The values argument should specify the feature that will be employed to write
in the cell values.
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
columns=['Platform'],
values='Sales')
p_table
Output: Platform PC PS4 Xbox
Name
Alice 250.0 500.0 300.0
Bob 150.0 400.0 200.0

In this example, we created a basic pivot table in Pandas that displays the
average income of every type of employee within each department.
How to Fill Missing Values Using the fill_value Parameter
In the last part, we learned how to make pivot tables in Pandas. Sometimes
our dataset contains NaN values, which might interfere with the statistical
computation of our data in the pivot table. This is common encountered in
large datasets, when there are a high number of NaN values that must be
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

addressed before summarizing any relevant insights from the data.


The pivot_table() methods in Pandas include a parameter that we may use to
fill all of the NaN values in our DataFrame before conducting any
calculations on it. To further comprehend the fill_value parameter, consider
the following example.
Code:
import pandas as pd
import numpy as np
import random

df = pd.DataFrame({'Name': [ np.nan , 'Tetris (EA)', 'Grand Theft Auto V',


'Wii Sports', 'Minecraft' ],
'Genre': ['Battle royale' , np.nan, 'Action-adventure', 'Sports
simulation', 'Survival,Sandbox'],
'Platform': [ 'PC', np.nan, 'Multi-platform', 'Wii', 'Multi-platform'],
'Publishers': [ 'PUBG Corporation', 'Electronic Arts', 'Rockstar
Games', 'Nintendo' , 'Xbox Game Studios',],
'Total_Year': random.sample(range(10, 30), 5),
'Sales': random.sample(range(100, 300), 5)})
print(df.to_markdown())

p_table = pd.pivot_table( data=df,


index=['Name'],
columns=['Platform'],
values='Sales',
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

fill_value="None")

p_table
Output:
| | Name | Genre | Platform | Publishers |
Total_Year | Sales |
|:---:|:-------------------:|:------------------:|:---------------:|:------------------|:---------
----:|:--------:|
| 0 | nan | Battle royale | PC | PUBG Corporation |
25 | 259 |
| 1 | Tetris (EA) | nan | nan | Electronic Arts | 13 |
175 |
| 2 | Grand Theft Auto V | Action-adventure | Multi-platform | Rockstar
Games | 19 | 186 |
| 3 | Wii Sports | Sports simulation | Wii | Nintendo |
24 | 294 |
| 4 | Minecraft | Survival,Sandbox | Multi-platform | Xbox Game
Studios | 10 | 297 |

How to Add Margins in Pivot Table


While examining the parameters for the pivot table() method, we discovered
that the margins keyword may be used to estimate totals along every
grouping. We can set margins in a pivot table if and only if the margins
keyword is set to True. The margins_name argument is used to do this. It is
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

set to "All" by default. It is employed to specify the title of the row or column
containing the totals. Consider the following code example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
columns=['Platform'],
values='Sales',
margins=True,
margins_name='Grand Total')
p_table
How to Calculate Multiple Types of Aggregations for any Given Value
Column
The aggfunc keyword specifies the type of aggregation used, which is
by default a mean. The aggregate specification, like the GroupBy, may be
either a string representing one of the many popular possibilities
(e.g.,'sum,"mean,' 'count,"min,"max,' etc.) or a method that executes an
aggregation (– for example, np.sum(), min(), sum(), np.std, np.min, np.max
and so forth.). It can also be given as a dictionary associating a column with
any of the above-mentioned choices. To further comprehend it, consider the
following example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
columns=['Platform'],
values='Sales',
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

aggfunc=['sum', 'mean', 'count'])


p_table
How to Group Data Using Index in a Pivot Table
We saw in the parameter subsection that the index parameter defines the
Column that will be used to index the feature supplied in the values
argument. If the index parameter is an array, it must be the same length as
the DataFrame. Consider the following example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
values='Sales')
p_table

How to Run a Pivot with a Multi-Index


We saw in the last section that the index only employed one characteristic,
a single level indexing. However, we can build pivot tables in Pandas with
numerous indexes. A pivot table with multi-level indices can give highly
helpful and thorough summarized data whenever data is structured
hierarchically. Consider the following illustration:
Code:
p_table = pd.pivot_table( data=df,
index=['Name', 'Genre','Publishers' ],
columns=['Platform'],
values='Sales',
aggfunc= ['sum', 'mean','count'])
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

p_table
Output:
In terms of production, there are multi-level indexes that explain that there
are three genres in multi-platform, and the total of their sales, mean, and
count are as follows. The sequence in which the indices are supplied matters,
and the results will differ as a consequence.
Aggregate on Specific Features with Values Parameter
The value argument instructs the method on which characteristics
to aggregate. It is an optional parameter, and if we do not indicate it, the
method will aggregate all of the dataset's quantitative variables. In the
previous index example, we saw that aggregation was performed on
all quantitative columns. Because the value argument was not given,
pivot_table examined all numerical columns by default. Consider the
following example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name' ],
columns=['Genre'],
values='Platform',
aggfunc= ['count'])
p_table
How to Specify and Create Your Own Aggregation Methods
In addition, Pandas lets us to input a custom method into
the pivot_table() method. This significantly increases our capacity to deal
with analyses that are specially targeted to our requirements! Let's look at
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

how we may pass in a function that tests whether or not the value is Multi-
platform.
Code:
def func(value):
plt = 'Multi-platform'
if (plt == value).bool():
return 'True'
return 'False'
This function receives a single input, values, which will be
the pivot_table() function's values. Our plt variable is then used to determine
whether the provided value is Multi-platform or not. Finally, based on the
criteria, the Boolean result is returned. Let's examine how we can apply this to
our Platforms column in our pivot table.
Code:
p_table = pd.pivot_table( data=df,
index=['Name' ],
values='Platform',
aggfunc= [func])
p_table
Difference Between Pivot Table and Group By
We saw in our previous article "Introduction to groupby in Pandas" how
the GroupBy concept allows us to examine relationships within a dataset.
Pivot tables in Pandas are comparable to the groupby() function in Pandas.
The pivot table accepts simple column-wise data as input and organizes it
into a two-dimensional DataFrame that gives a multidimensional overview of
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

the data. The distinction between pivot tables in pandas and GroupBy is that
pivot tables are multidimensional versions of GroupBy aggregation. That is,
we divide, execute, or merge, but the divide and merge occur on a two-
dimensional grid.
Aside from that, the object provided by the groupby() method is a
DataFrameGroupBy object rather than a dataframe. As a result, standard
Pandas DataFrame methods will not operate on this object.
Code:
p_table = pd.pivot_table( data=df,
index=['Platform'])
group= df.groupby('Platform')
print("Pivot table type :",type(p_table))
print("Group type :",type(group))
Output:
Pivot table type : <class 'pandas.core.frame.DataFrame'>
Group type : <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Advanced Pivot Table Filtering in Pandas
A Pandas pivot table may also be used to filter data. As pivot tables are
frequently rather extensive, filtering a pivot table may greatly focus the
results. Because the method outputs a DataFrame, we could just filter it like
any other.
Now we have the option of filtering by a constant or a dynamic value. We
may, for example, just filter solely on an user defined value. However, if we
wanted to only show instances where the Sales data was greater than the
mean, we might employ the following filter:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Code:
p_table = pd.pivot_table( data=df,
index=['Name' ],
values='Sales',
aggfunc= ['mean'])

print(p_table[p_table["mean"]['Sales'] > df['Sales'].mean()])


Output:
mean
Sales
Name
Minecraft 297
Wii Sports 294
This concludes our article, kudos! You now have a firm knowledge of
Pandas’ pivot_table() method and can use it to extract relevant
information from numerous real-world instances.
Explore the Hands-on Applications of These Concepts in Our Data Science
Courses. Enroll Now and Turn Theoretical Knowledge into Practical
Mastery.
Conclusion
This article taught us:
 A pivot table is a quantitative table that summarizes a large
DataFrame, such as a large dataset.
 In Pandas, we use the pivot_table() function to generate pivot tables.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Pandas' pivot_table() methods provide a fill_value argument, which


we can use to fill all of the NaN values in our DataFrame before doing
any computations.
 A pivot table Pandas method may also be used to filter data.
 Pivot tables in Pandas are comparable to the groupby() function in
Pandas.
 The groupby() function returns a DataFrameGroupBy object rather
than a dataframe.
 The key difference between pivot tables and GroupBy aggregation is
that pivot tables are multidimensional versions of GroupBy
aggregation.
Vectorized String Operations.
ectorized string operations in pandas allow you to perform efficient string
manipulations on entire columns of data. These operations are performed
using the str accessor, which provides a suite of methods for string
manipulation.
Common Vectorized String Operations
1. Lowercase and Uppercase Conversion
2. String Length
3. String Containment and Matching
4. Replacing Substrings
5. Splitting and Joining Strings
6. Removing Whitespace
7. Extracting Substrings
8. Concatenating Strings
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1. Lowercase and Uppercase Conversion


Example: Convert to Lowercase
python
import pandas as pd
# Creating an example dataframe
df = pd.DataFrame({'Text': ['Hello', 'World', 'Pandas', 'Vectorized', 'String
Operations']})
# Convert to lowercase
df['Text_Lower'] = df['Text'].str.lower()
print(df)
Output:
Text Text_Lower
0 Hello hello
1 World world
2 Pandas pandas
3 Vectorized vectorized
4 String Operations string operations
Example: Convert to Uppercase
python

# Convert to uppercase
df['Text_Upper'] = df['Text'].str.upper()
print(df)
Output:
Text Text_Upper
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

0 Hello HELLO
1 World WORLD
2 Pandas PANDAS
3 Vectorized VECTORIZED
4 String Operations STRING OPERATIONS
2. String Length
Example: Calculate String Length
python
# Calculate string length
df['Text_Length'] = df['Text'].str.len()
print(df)
Output:
Text Text_Length
0 Hello 5
1 World 5
2 Pandas 6
3 Vectorized 10
4 String Operations 17
3. String Containment and Matching
Example: Check for Substring
python

# Check if 'o' is in the string


df['Contains_o'] = df['Text'].str.contains('o')
print(df)
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Output:
Text Contains_o
0 Hello True
1 World True
2 Pandas False
3 Vectorized False
4 String Operations True
4. Replacing Substrings
Example: Replace Substring
python

# Replace 'o' with 'O'


df['Text_Replaced'] = df['Text'].str.replace('o', 'O')
print(df)
Output:
Text Text_Replaced
0 Hello HellO
1 World WOrld
2 Pandas Pandas
3 Vectorized VectOrized
4 String Operations String OperatiOns
5. Splitting and Joining Strings
Example: Split Strings
python
# Split strings on space
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

df['Text_Split'] = df['Text'].str.split()
print(df)
Output:
Text Text_Split
0 Hello [Hello]
1 World [World]
2 Pandas [Pandas]
3 Vectorized [Vectorized]
4 String Operations [String, Operations]
6. Removing Whitespace
Example: Strip Whitespace
python

# Creating an example dataframe with extra whitespace


df_ws = pd.DataFrame({'Text': [' Hello ', ' World ', ' Pandas ']})
# Strip leading and trailing whitespace
df_ws['Text_Strip'] = df_ws['Text'].str.strip()
print(df_ws)
Output:
Text Text_Strip
0 Hello Hello
1 World World
2 Pandas Pandas
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

7. Extracting Substrings
Example: Extract Substring
python
# Extract first 3 characters
df['Text_Substr'] = df['Text'].str[:3]
print(df)
Output:
Text Text_Substr
0 Hello Hel
1 World Wor
2 Pandas Pan
3 Vectorized Vec
4 String Operations Str
8. Concatenating Strings
Example: Concatenate Strings
python
# Creating another column to concatenate
df['More_Text'] = ['Everyone', 'People', 'Library', 'Methods', 'Tutorial']
# Concatenate 'Text' and 'More_Text' with a space
df['Text_Concat'] = df['Text'] + ' ' + df['More_Text']
print(df)
Output:
Text More_Text Text_Concat
0 Hello Everyone Hello Everyone
1 World People World People
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

2 Pandas Library Pandas Library


3 Vectorized Methods Vectorized Methods
4 String Operations Tutorial String Operations Tutorial
Summary
 Lowercase and Uppercase Conversion: Use str.lower() and
str.upper().
 String Length: Use str.len().
 String Containment and Matching: Use str.contains().
 Replacing Substrings: Use str.replace().
 Splitting and Joining Strings: Use str.split() and str.join().
 Removing Whitespace: Use str.strip().
 Extracting Substrings: Use slicing (e.g., str[:3]).
 Concatenating Strings: Use + operator for string concatenation.
Conclusion
Vectorized string operations in pandas provide an efficient way to perform
complex string manipulations on entire columns of data. These operations are
essential for data cleaning, transformation, and analysis.

Unit II Completed
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

You might also like