EDA Unit2
EDA Unit2
UNIT II
EXPLORATORY DATA ANALYSIS
EDA USING PYTHON - Data Manipulation using Pandas – Pandas Objects –
Data Indexing and Selection – Operating on Data – Handling Missing Data –
Hierarchical Indexing – Combining datasets – Concat, Append, Merge and Join –
Aggregation and grouping – Pivot Tables – Vectorized String Operations.
Data manipulation with Pandas
In Machine Learning, the model requires a dataset to operate, i.e. to
train and test. But data doesn’t come fully prepared and ready to use. There are
discrepancies like Nan/ Null / NA values in many rows and columns. Sometimes the
data set also contains some of the rows and columns which are not even required in
the operation of our model. In such conditions, it requires proper cleaning and
modification of the data set to make it an efficient input for our model. We achieve
that by practicing Data Wrangling before giving data input to the model.
Today, we will get to know some methods using Pandas which is a
famous library of Python. Pandas are a newer package built on top of NumPy, and
provide an efficient implementation of a DataFrame. DataFrames are essentially
multidimensional arrays with attached row and column labels, and often with
heterogeneous types and/or missing data. As well as offering a convenient storage
interface for labeled data, Pandas implements a number of powerful data operations
familiar to users of both database frameworks and spreadsheet programs.
Installing Pandas
Before moving forward, ensure that Pandas is installed in your system.
1
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
powerful.
3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Age Student
Age 1.000000 0.502519
Student 0.502519 1.000000
The description of the output given by .info() method is as follows:
RangeIndex describes about the index column, i.e. [0, 1, 2, 3] in our datagram.
Which is the number of rows in our dataframe.
As the name suggests Data columns give the total number of columns as output.
Name, Age, Student are the name of the columns in our data, non-null tells us that in
the corresponding column, there is no NA/ Nan/ None value exists. object, int64 and
bool are the datatypes each column have.
dtype gives you an overview of how many data types present in the datagram, which
in term simplifies the data cleaning process.
Also, in high-end machine learning models, memory usage is an important term, we
can’t neglect that.
Getting Statistical Analysis of Data
Before processing and wrangling any data you need to get the total overview of it,
which includes statistical conclusions like standard deviation(std), mean and it’s
quartile distributions.
Code Output
# for showing the statistical Describe
# info of the dataframe Age
print('Describe') count 4.000000
5
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
6
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
3 Roshni False
Dropping Rows from Data
Let’s try dropping a row from the dataset, for this, we will use drop function. We
will keep axis=0.
Code Output
students = students.drop(2, axis=0) Name Student
print(students.head()) 0 Abhijit False
1 Smriti True
3 Roshni False
In the output we can see that the 2 row is dropped.
Pandas Object:
At the very basic level, Pandas objects can be thought of as enhanced
versions of NumPy structured arrays in which the rows and columns are identified
with labels rather than simple integer indices. As we will see during the course of
this chapter, Pandas provides a host of useful tools, methods, and functionality on
top of the basic data structures, but nearly everything that follows will require an
understanding of what these structures are. Thus, before we go any further, let’s
introduce these three fundamental Pandas data structures: the Series, DataFrame,
and Index.
We will start our code sessions with the standard NumPy and Pandas imports:
import numpy as np
import pandas as pd
7
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
As we will see, though, the Pandas Series is much more general and
flexible than the one-dimensional NumPy array that it emulates.
Series as generalized NumPy array
From what we’ve seen so far, it may look like the Series object is basically
inter‐ changeable with a one-dimensional NumPy array. The essential
difference is the pres‐ ence of the index: while the NumPy array has an
implicitly defined integer index used to access the values, the Pandas Series
has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional
capabilities. For example, the index need not be an integer, but can consist
of values of any desired type. For example, if we wish, we can use strings
as an index:
Code Output
import pandas as pd a 0.25
b 0.50
data = pd.Series([0.25, 0.5, 0.75, 1.0],
c 0.75
index=['a', 'b', 'c', 'd']) d 1.00
dtype: float64
print(data)
And the item access works as expected: 0.5
print(data ['b'] )
We can even use noncontiguous or 2 0.25
5 0.50
nonsequential indices: 3 0.75
data = pd.Series([0.25, 0.5, 0.75, 1.0], 7 1.00
dtype: float64
index=[2, 5, 3, 7])
9
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
print(data)
10
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
12
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
13
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
14
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Out[23]: population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
From a list of dicts. Any list of dictionaries can be made into a DataFrame. We’ll
use a simple list comprehension to create some data:
In[24]: data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Out[24]: a b
0 0 0
1 1 2
2 2 4
Even if some keys in the dictionary are missing, Pandas will fill them in with
NaN (i.e., “not a number”) values:
15
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
16
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
b 0.442759 0.108267
c 0.047110 0.905718
Out[29]: A B
0 0 0.0
1 0 0.0
2 0 0.0
The Pandas Index Object
We have seen here that both the Series and DataFrame objects contain
an explicit index that lets you reference and modify data. This Index object
is an interesting structure in itself, and it can be thought of either as an
immutable array or as an ordered set (technically a multiset, as Index
objects may contain repeated values). Those views have some interesting
consequences in the operations available on Index objects. As a simple
17
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0
18
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/
base.py ...
1243
1244 def setitem (self, key, value):
-> 1245 raise TypeError("Index does not support mutable
operations") 1246
1247 def getitem (self, key):
TypeError: Index does not support mutable operations
This immutability makes it safer to share indices between multiple
DataFrames and arrays, without the potential for side effects from
inadvertent index modification.
Index as ordered set
Pandas objects are designed to facilitate operations such as joins
across datasets, which depend on many aspects of set arithmetic. The
Index object follows many of
the conventions used by Python’s built-in set data structure, so that unions,
intersec‐ tions, differences, and other combinations can be computed in a
familiar way:
In[35]: indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7,
11]) In[36]: indA & indB #
intersection
19
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
20
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
21
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
In[4]:
data.keys()
Out[4]: Index(['a', 'b', 'c', 'd'], dtype='object')
In[5]: list(data.items())
Out[5]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
Series objects can even be modified with a dictionary-like syntax. Just as
you can extend a dictionary by assigning to a new key, you can extend a
Series by assigning to a new index value:
In[6]: data['e'] = 1.25
data
Out[6]: a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
This easy mutability of the objects is a convenient feature: under the hood,
Pandas is making decisions about memory layout and data copying that
might need to take place; the user generally does not need to worry about
these issues.
Series as one-dimensional array
A Series builds on this dictionary-like interface and provides array-style
item selec‐ tion via the same basic mechanisms as NumPy arrays—that is,
22
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
when you are slicing with an explicit index (i.e., data['a':'c']), the final
index is included in the slice, while when you’re slicing with an implicit
index (i.e., data[0:2]), the final index is excluded from the slice.
Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion. For
example, if your Series has an explicit integer index, an indexing
operation such as data[1] will use the explicit indices, while a slicing
operation like data[1:3] will use the implicit Python-style index.
In[11]: data = pd.Series(['a', 'b', 'c'], index=[1, 3,
5]) data
Out[11]: 1 a
2 b
5 c
dtype: object
In[12]: # explicit index when indexing
data[1
] Out[12]: 'a'
In[13]: # implicit index when slicing
data[1:3]
Out[13]: 3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas
24
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
25
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
26
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Out[19]: 42396
California 7
Florida 17031
27
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2
Illinois 14999
5
New York14129
7
Texas 69566
2
Name: area, dtype: int64
Equivalently, we can use attribute-style access with column names that are
strings:
In[20]: data.area
Out[20]: 42396
California 7
Florida 17031
2
Illinois 14999
5
New York14129
7
Texas 69566
2
Name: area, dtype: int64
This attribute-style column access actually accesses the exact same object
28
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
29
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
30
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
two- dimensional array. We can examine the raw underlying data array
using the values attribute:
In[24]: data.values
31
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Out[27]: 42396
California 7
Florida 17031
2
Illinois 14999
5
New York14129
7
Texas 69566
2
Name: area, dtype: int64
Thus for array-style indexing, we need another convention. Here Pandas
again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc
32
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
objects.
Any of the familiar NumPy-style data access patterns can be used within
these index‐ ers. For example, in the loc indexer we can combine masking
and fancy indexing as in the following:
In[31]: data.loc[data.density > 100, ['pop',
'density']] Out[31]: pop density
Florida 19552860 114.806121
New York 19651127 139.076746
34
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
In[33]: data['Florida':'Illinois']
Out[33]: area pop
density Florida 170312
19552860 114.806121
Illinois 149995 12882135 85.883763
Such slices can also refer to rows by number rather than by index:
In[34]: data[1:3]
Out[34]: area pop
density Florida 170312
19552860 114.806121
Illinois 149995 12882135 85.883763
Similarly, direct masking operations are also interpreted row-wise rather
than column-wise:
In[35]: data[data.density > 100]
Out[35]: area pop
density Florida 170312
19552860 114.806121
New York 141297 19651127 139.076746
These two conventions are syntactically similar to those on a NumPy
array, and while these may not precisely fit the mold of the Pandas
conventions, they are nevertheless quite useful in practice.
Pandas Data operations
One of the essential pieces of NumPy is the ability to perform quick element-
wise operations, both with basic arithmetic (addition, subtraction,
multiplication, etc.) and with more sophisticated operations (trigonometric
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
islower(): It returns True if all the characters in the string of the Series/Index
are in lowercase. Otherwise, it returns False.
isupper(): It returns True if all the characters in the string of the Series/Index
are in uppercase. Otherwise, it returns False.
isnumeric(): It returns True if all the characters in the string of the
Series/Index are numeric. Otherwise, it returns False.
Count Values
This operation is used to count the total number of occurrences using
'value_counts()' option.
Plots
Pandas plots the graph with the matplotlib library. The .plot() method allows
you to plot the graph of your data.
.plot() function plots index against every column.
You can also pass the arguments into the plot() function to draw a specific
column.
Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will
work on Pandas Series and DataFrame objects. Let’s start by defining a
simple Series and DataFrame on which to demonstrate this:
In[1]: import pandas
as pd import
numpy as np
In[2]: rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10,
4)) ser
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Out[2]: 0 6
1 3
2 7
3 4
dtype: int64
In[3]: df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df
Out[3]: A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4
If we apply a NumPy ufunc on either of these objects, the result will be
another Pan‐ das object with the indices preserved:
In[4]: np.exp(ser)
Out[4]: 403.42879
0 3
1 20.085537
2 1096.6331
58
3 54.598150
dtype: float64
Or, for a slightly more complex calculation:
In[5]: np.sin(df * np.pi / 4)
Out[5]: A B C D
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Texas 38.018740
dtype: float64
The resulting array contains the union of indices of the two input arrays,
which we could determine using standard Python set arithmetic on these
indices:
In[8]: area.index | population.index
Out[8]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
Any item for which one or the other does not have an entry is marked with
NaN, or “Not a Number,” which is how Pandas marks missing data (see
further discussion of missing data in “Handling Missing Data” on page
119). This index matching is implemented this way for any of Python’s
built-in arithmetic expressions; any missing val‐ ues are filled in with NaN
by default:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2,
3]) A + B
Out[9]: 0 NaN
1 5.0
2 9.0
3 NaN
dtype:
float64
If using NaN values is not the desired behavior, we can modify the fill
value using appropriate object methods in place of the operators. For
example, calling A.add(B) is equivalent to calling A + B, but allows
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Out[10]: 2.0
0
1 5.0
2 9.0
3 5.0
dtype: float64
C 0 4
0 9
1 5 8 0
2 9 2 6
In[13]: A + B
Out[13]: A
BC 0 1.0
15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in place
of missing entries. Here we’ll fill with the mean of all values in A (which we compute
by first stacking the rows of A):
In[14]: fill =
A.stack().mean()
A.add(B,
fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
Table 3-1 lists Python operators and their equivalent Pandas object methods.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
S 2
Name: 0, dtype:
int64 In[20]: df - halfrow
Out[20 Q R S T
]:
0 Na 0.0 Na
0.0 N N
1 - Na 2.0 Na
1.0 N N
2 Na 1.0 Na
3.0 N N
This preservation and alignment of indices and columns means that
operations on data in Pandas will always maintain the data context, which
prevents the types of silly errors that might come up when you are
working with heterogeneous and/or mis‐ aligned data in raw NumPy
arrays.
Handling Missing Data
Missing Data can occur when no information is provided for one or more
items or for a whole unit. Missing Data is a very big problem in a real-life
scenarios. Missing Data can also refer to as NA(Not Available) values in
pandas. In Pandas missing data is represented by two value:
None: None is a Python singleton object that is often used for missing
data in Python code.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Salary Dataset:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
1 2 Female 45000.0 China NaN
2 3 Female 25000.0 India Google
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Missing Data
ID Gender Salary Country Company
0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True
7 False False False False True
8 False False False True True
9 False False False False True
Missing Data
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Country 2
Company 5
dtype: int64
Dropping Missing Data
You can choose to either ignore missing data or substitute values for it when
handling missing data. As we can see at the bottom of the DataFrame output,
this results in a clean DataFrame with no missing data.
df.dropna(inplace=True)
print(df)
Output:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
2 3 Female 25000.0 India Google
Replacing Missing Data
You can choose to either ignore missing data or substitute values for it when
handling missing data. Fortunately, the Pandas fillna() method may be used to
replace missing values in a DataFrame with a value given by the user. Type
the following to replace any missing values with the number 0 (i.e., the value
of 0 is arbitrary and may be any other value of your choice):
df["Salary"].fillna(20000, inplace=True)
print(df)
Output:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
1 2 Female 45000.0 China NaN
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# using isnull() function
df.isnull()
Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Python
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# filling missing value using fillna()
df.fillna(0)
Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Python
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Output
Now we drop rows with at least one Nan value (Null value)
Hierarchical Indexing
Hierarchical indexing is a method of creating structured group relationships in
data.
• A MultiIndex or Hierarchical index comes in when our DataFrame has more
than two dimensions. As we already know, a Series is a one-dimensional
labelled NumPy array and a DataFrame is usually a two-dimensional table
whose columns are Series. In some instances, in order to carry out some
sophisticated data analysis and manipulation, our data is presented in higher
dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical
Index as the name suggests is ordering more than one item in terms of their
ranking.
Hierarchical indexing is a method of creating structured group relationships in
the dataset. Data frames can have hierarchical indexes. To show this, let me
create a dataset.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Notice that in this dataset, both row and column have hierarchical indexes.
You can name hierarchical levels. Let’s show this.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
What is Swaplevel?
Sometimes, you may want to swap the level of the indexes. You can use the
swaplevel method for this. The swaplevel method takes two levels and returns
a new object. For example, let’s swap the class and exam indexes in the
dataset.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
In the set_index method, the indexes moved to the row are removed from the
column. You can use drop = False to remain the columns you get as an index
in the same place.
1 2 4
0 5 7
1 6 8
2. Concatenating Along Columns (Horizontal Concatenation)
When concatenating along columns, dataframes are merged side by side.
Example:
python
print(df_concat_diff_idx)
Output:
A B
0 1.0 NaN
1 2.0 NaN
2 NaN 3.0
3 NaN 4.0
4. Concatenating with Keys
Adding keys creates a hierarchical index, useful for identifying the source of
each row in the concatenated dataframe.
Example:
python
# Concatenating with keys
df_concat_keys = pd.concat([df1, df2], keys=['df1', 'df2'])
print(df_concat_keys)
Output:
A B
df1 0 1 3
1 2 4
df2 0 5 7
1 6 8
Summary
Concatenating Along Rows (Vertical): Stacks dataframes vertically.
Concatenating Along Columns (Horizontal): Merges dataframes
side by side.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Appending DataFrames
When you use the `append` method, it returns a new dataframe with the rows of the
second dataframe added to the end of the first one. The original dataframes remain
unchanged unless explicitly reassigned.
Example 1: Simple Appending
```python
import pandas as pd
# Creating example dataframes
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1 2 4 NaN
2 NaN 5 7.0
3 NaN 6 8.0
Appending a Series to a DataFrame
You can also append a Series to a dataframe as a new row.
Example 3: Appending a Series
python
# Creating a dataframe
df5 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
# Creating a series
s = pd.Series({'A': 5, 'B': 6})
# Appending the series to the dataframe
df_append_series = df5.append(s, ignore_index=True)
print(df_append_series)
Output:**
A B
0 1 3
1 2 4
2 5 6
Summary
Simple Appending**: Adds rows of one dataframe to the end of another.
Appending with Different Columns**: Handles different columns by filling
missing values with NaN.
Appending a Series**: Adds a series as a new row to the dataframe.
Conclusion
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
import pandas as pd
# Creating example dataframes
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
# Performing an inner join
df_inner = pd.merge(df1, df2, on='key', how='inner')
print(df_inner)
Output:
key value1 value2
0 A 1 4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1 B 2 5
Example 2: Left Join
A left join returns all rows from the left dataframe and the matched rows from
the right dataframe. Missing values are filled with NaN.
python
# Performing a left join
df_left = pd.merge(df1, df2, on='key', how='left')
print(df_left)
Output:
key value1 value2
0 A 1 4.0
1 B 2 5.0
2 C 3 NaN
Example 3: Outer Join
An outer join returns all rows when there is a match in one of the dataframes.
Missing values are filled with NaN.
python
# Performing an outer join
df_outer = pd.merge(df1, df2, on='key', how='outer')
print(df_outer)
Output:
key value1 value2
0 A 1.0 4.0
1 B 2.0 5.0
2 C 3.0 NaN
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
3 D NaN 6.0
Joining DataFrames
The join method is used for combining dataframes on their indices. It is
similar to merge, but it is based on indices rather than columns.
Example 4: Simple Join
python
# Creating example dataframes with indices
df3 = pd.DataFrame({'value1': [1, 2, 3]}, index=['A', 'B', 'C'])
df4 = pd.DataFrame({'value2': [4, 5, 6]}, index=['A', 'B', 'D'])
# Performing a join
df_join = df3.join(df4, how='inner')
print(df_join)
Output:
value1 value2
A 1 4
B 2 5
Example 5: Left Join with Join
python
# Performing a left join with join
df_join_left = df3.join(df4, how='left')
print(df_join_left)
Output:
value1 value2
A 1 4.0
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
B 2 5.0
C 3 NaN
Summary
Merge: Combines dataframes based on common columns or indices.
o Inner Join: Returns rows with matching keys in both
dataframes.
o Left Join: Returns all rows from the left dataframe and
matched rows from the right dataframe.
o Outer Join: Returns all rows when there is a match in one of
the dataframes.
Join: Combines dataframes based on their indices.
o Simple Join: Performs a join based on indices.
o Left Join with Join: Returns all rows from the left dataframe
and matched rows from the right dataframe based on indices.
Conclusion
Using merge and join in pandas, you can efficiently combine datasets based
on columns or indices, providing flexibility in how you integrate data from
different sources. These operations are fundamental for data manipulation and
analysis, enabling you to create comprehensive datasets for further
exploration and insights.
Aggregation and Grouping
Aggregation and grouping are powerful techniques in pandas for summarizing
and analyzing data. Grouping allows you to split the data into groups based on
some criteria, and aggregation lets you compute summary statistics for each
group.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Grouping DataFrames
The groupby method is used to group data in pandas. This method splits the
data into groups based on some criteria.
Example: Grouping Data
python
import pandas as pd
# Creating an example dataframe
df = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C'],
'Value': [10, 15, 10, 20, 25]
})
# Grouping by 'Category'
grouped = df.groupby('Category')
print(grouped)
Output: The output is a DataFrameGroupBy object. To see the grouped data,
you need to apply an aggregation function.
Aggregating DataFrames
Aggregation involves computing summary statistics for each group. Common
aggregation functions include sum, mean, count, etc.
Example 1: Aggregating with Sum
python
# Aggregating the grouped data with sum
sum_agg = grouped.sum()
print(sum_agg)
Output:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Value
Category
A 25
B 30
C 25
Example 2: Aggregating with Multiple Functions
You can apply multiple aggregation functions to each group.
python
Value
sum mean count
Category
A 25 12.5 2
B 30 15.0 2
C 25 25.0 1
Example 3: Custom Aggregation Functions
You can also define custom aggregation functions.
python
# Defining a custom aggregation function
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
def range_func(x):
return x.max() - x.min()
# Applying the custom function
custom_agg = grouped.agg(range_func)
print(custom_agg)
Output:
Value
Category
A 5
B 10
C 0
Grouping by Multiple Columns
You can group by multiple columns to create a hierarchical index.
Example: Grouping by Multiple Columns
python
# Creating an example dataframe with multiple columns
df_multi = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C'],
'SubCategory': ['X', 'Y', 'X', 'Y', 'X'],
'Value': [10, 15, 10, 20, 25]
})
# Grouping by 'Category' and 'SubCategory'
grouped_multi = df_multi.groupby(['Category', 'SubCategory'])
multi_agg = grouped_multi.sum()
print(multi_agg)
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Output:
Value
Category SubCategory
A X 10
Y 15
B X 10
Y 20
C X 25
Grouping and Aggregating with Pivot Tables
Pivot tables provide a way to summarize data in a tabular format.
Example: Creating a Pivot Table
python
# Creating a pivot table
pivot_table = pd.pivot_table(df_multi, values='Value', index=['Category'],
columns=['SubCategory'], aggfunc='sum')
print(pivot_table)
Output:
SubCategory X Y
Category
A 10.0 15.0
B 10.0 20.0
C 25.0 NaN
Summary
Grouping: Use groupby to split the data into groups based on some
criteria.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Syntax
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc=’mean’, fill_value=None, margins=False, dropna=True,
margins_name=’All’, observed=False)
Parameters
It requires the following set of parameters:
Sr. Parameter
Parameter Description
No Name
It requires a Pandas DataFrame or the database from which
1 data
the pivot table is to be created.
It is a purely optional parameter. It is used to indicate which
2 values
Column's statistical summary should be displayed.
It specifies the Column that will be employed to index the
feature specified in the values parameter. If an array is
3 index
supplied as a parameter, it must be of a similar length as the
Dataset.
It is used to aggregate information based on specified
4 columns
column characteristics.
It specifies the set of functions that must be executed on our
5 aggfunc
DataFrame.
It is used to supply a value in the DataFrame to substitute
6 fill_value
missing data.
7 margins It only accepts Boolean values and is initially set to False. If
set to True, it adds all rows and columns to the resulting
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Sr. Parameter
Parameter Description
No Name
pivot table.
It can only take Boolean values and is set to True by default.
8 dropna It is employed to delete all NaN values from DataFrame,
including any.
When the margins option is set to True, it is employed to
9 margins_name define the title of the row/column that will hold the
statistics.
It only takes Boolean values. This option applies solely to
10 observed category characteristics. If it is set as 'True,' the DataFrame
will only display data for categorical groupings.
Return Value
It is employed to generate a DataFrame with an excel-style pivot table. The
levels in the pivot table will be saved as MultiIndex objects on the resultant
DataFrame's index and columns.
Pivot Table in Pandas with Python
One of Excel's most powerful features is pivot tables. A pivot table helps us
to extract information from data. Pandas has a method
named pivot_table() that is comparable. Pandas pivot_table() is a simple
method that may quickly provide highly strong analyses. Pandas
pivot_table() is a must-have tool for any Data Scientist. Let's see how we can
make one for ourselves.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Now that our DataFrame has been created, we will use the pivot table Pandas
method as pd.pivot table() to indicate which features should be included in
the rows and columns by employing the index and columns arguments.
The values argument should specify the feature that will be employed to write
in the cell values.
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
columns=['Platform'],
values='Sales')
p_table
Output: Platform PC PS4 Xbox
Name
Alice 250.0 500.0 300.0
Bob 150.0 400.0 200.0
In this example, we created a basic pivot table in Pandas that displays the
average income of every type of employee within each department.
How to Fill Missing Values Using the fill_value Parameter
In the last part, we learned how to make pivot tables in Pandas. Sometimes
our dataset contains NaN values, which might interfere with the statistical
computation of our data in the pivot table. This is common encountered in
large datasets, when there are a high number of NaN values that must be
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
fill_value="None")
p_table
Output:
| | Name | Genre | Platform | Publishers |
Total_Year | Sales |
|:---:|:-------------------:|:------------------:|:---------------:|:------------------|:---------
----:|:--------:|
| 0 | nan | Battle royale | PC | PUBG Corporation |
25 | 259 |
| 1 | Tetris (EA) | nan | nan | Electronic Arts | 13 |
175 |
| 2 | Grand Theft Auto V | Action-adventure | Multi-platform | Rockstar
Games | 19 | 186 |
| 3 | Wii Sports | Sports simulation | Wii | Nintendo |
24 | 294 |
| 4 | Minecraft | Survival,Sandbox | Multi-platform | Xbox Game
Studios | 10 | 297 |
set to "All" by default. It is employed to specify the title of the row or column
containing the totals. Consider the following code example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
columns=['Platform'],
values='Sales',
margins=True,
margins_name='Grand Total')
p_table
How to Calculate Multiple Types of Aggregations for any Given Value
Column
The aggfunc keyword specifies the type of aggregation used, which is
by default a mean. The aggregate specification, like the GroupBy, may be
either a string representing one of the many popular possibilities
(e.g.,'sum,"mean,' 'count,"min,"max,' etc.) or a method that executes an
aggregation (– for example, np.sum(), min(), sum(), np.std, np.min, np.max
and so forth.). It can also be given as a dictionary associating a column with
any of the above-mentioned choices. To further comprehend it, consider the
following example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name'],
columns=['Platform'],
values='Sales',
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
p_table
Output:
In terms of production, there are multi-level indexes that explain that there
are three genres in multi-platform, and the total of their sales, mean, and
count are as follows. The sequence in which the indices are supplied matters,
and the results will differ as a consequence.
Aggregate on Specific Features with Values Parameter
The value argument instructs the method on which characteristics
to aggregate. It is an optional parameter, and if we do not indicate it, the
method will aggregate all of the dataset's quantitative variables. In the
previous index example, we saw that aggregation was performed on
all quantitative columns. Because the value argument was not given,
pivot_table examined all numerical columns by default. Consider the
following example:
Code:
p_table = pd.pivot_table( data=df,
index=['Name' ],
columns=['Genre'],
values='Platform',
aggfunc= ['count'])
p_table
How to Specify and Create Your Own Aggregation Methods
In addition, Pandas lets us to input a custom method into
the pivot_table() method. This significantly increases our capacity to deal
with analyses that are specially targeted to our requirements! Let's look at
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
how we may pass in a function that tests whether or not the value is Multi-
platform.
Code:
def func(value):
plt = 'Multi-platform'
if (plt == value).bool():
return 'True'
return 'False'
This function receives a single input, values, which will be
the pivot_table() function's values. Our plt variable is then used to determine
whether the provided value is Multi-platform or not. Finally, based on the
criteria, the Boolean result is returned. Let's examine how we can apply this to
our Platforms column in our pivot table.
Code:
p_table = pd.pivot_table( data=df,
index=['Name' ],
values='Platform',
aggfunc= [func])
p_table
Difference Between Pivot Table and Group By
We saw in our previous article "Introduction to groupby in Pandas" how
the GroupBy concept allows us to examine relationships within a dataset.
Pivot tables in Pandas are comparable to the groupby() function in Pandas.
The pivot table accepts simple column-wise data as input and organizes it
into a two-dimensional DataFrame that gives a multidimensional overview of
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
the data. The distinction between pivot tables in pandas and GroupBy is that
pivot tables are multidimensional versions of GroupBy aggregation. That is,
we divide, execute, or merge, but the divide and merge occur on a two-
dimensional grid.
Aside from that, the object provided by the groupby() method is a
DataFrameGroupBy object rather than a dataframe. As a result, standard
Pandas DataFrame methods will not operate on this object.
Code:
p_table = pd.pivot_table( data=df,
index=['Platform'])
group= df.groupby('Platform')
print("Pivot table type :",type(p_table))
print("Group type :",type(group))
Output:
Pivot table type : <class 'pandas.core.frame.DataFrame'>
Group type : <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Advanced Pivot Table Filtering in Pandas
A Pandas pivot table may also be used to filter data. As pivot tables are
frequently rather extensive, filtering a pivot table may greatly focus the
results. Because the method outputs a DataFrame, we could just filter it like
any other.
Now we have the option of filtering by a constant or a dynamic value. We
may, for example, just filter solely on an user defined value. However, if we
wanted to only show instances where the Sales data was greater than the
mean, we might employ the following filter:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Code:
p_table = pd.pivot_table( data=df,
index=['Name' ],
values='Sales',
aggfunc= ['mean'])
# Convert to uppercase
df['Text_Upper'] = df['Text'].str.upper()
print(df)
Output:
Text Text_Upper
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
0 Hello HELLO
1 World WORLD
2 Pandas PANDAS
3 Vectorized VECTORIZED
4 String Operations STRING OPERATIONS
2. String Length
Example: Calculate String Length
python
# Calculate string length
df['Text_Length'] = df['Text'].str.len()
print(df)
Output:
Text Text_Length
0 Hello 5
1 World 5
2 Pandas 6
3 Vectorized 10
4 String Operations 17
3. String Containment and Matching
Example: Check for Substring
python
Output:
Text Contains_o
0 Hello True
1 World True
2 Pandas False
3 Vectorized False
4 String Operations True
4. Replacing Substrings
Example: Replace Substring
python
df['Text_Split'] = df['Text'].str.split()
print(df)
Output:
Text Text_Split
0 Hello [Hello]
1 World [World]
2 Pandas [Pandas]
3 Vectorized [Vectorized]
4 String Operations [String, Operations]
6. Removing Whitespace
Example: Strip Whitespace
python
7. Extracting Substrings
Example: Extract Substring
python
# Extract first 3 characters
df['Text_Substr'] = df['Text'].str[:3]
print(df)
Output:
Text Text_Substr
0 Hello Hel
1 World Wor
2 Pandas Pan
3 Vectorized Vec
4 String Operations Str
8. Concatenating Strings
Example: Concatenate Strings
python
# Creating another column to concatenate
df['More_Text'] = ['Everyone', 'People', 'Library', 'Methods', 'Tutorial']
# Concatenate 'Text' and 'More_Text' with a space
df['Text_Concat'] = df['Text'] + ' ' + df['More_Text']
print(df)
Output:
Text More_Text Text_Concat
0 Hello Everyone Hello Everyone
1 World People World People
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Unit II Completed
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis