0% found this document useful (0 votes)

87 views43 pages

Notes - EDA-Unit2

Uploaded by

savi ezhilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views43 pages

Notes - EDA-Unit2

Uploaded by

savi ezhilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT II EDA USING PYTHON

Data Manipulation using Pandas – Pandas Objects – Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables – Vectorized String Operations.

COURSE OBJECTIVE:
To implement data manipulation using Pandas.

COURSE OUTCOME:
CO2: Implement data manipulation using Pandas
Data Manipulation using Pandas
Pandas - package built on top of NumPy
- provides an efficient implementation of a DataFrame
DataFrames - multidimensional arrays with attached row and column labels
- supports heterogeneous types and/or missing data
Pandas - enhanced versions of NumPy structured arrays
- rows and columns are identified with labels rather than simple integer indices
Three fundamental Pandas data structures: Series, DataFrame, and Index

Pandas Objects
1. The Pandas Series Object
Pandas Series - one-dimensional array of indexed data
- can be created from a list or array
- can be accessed with the values and index attributes
- can also be accessed using index via square-bracket notation

Series as generalized NumPy array

Numpy Array - has an implicitly defined integer index used to access the values
Pandas Series - has an explicitly defined index associated with the values
index need not be an integer

non-contiguous or non-sequential indices

Series as specialized dictionary

dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series
is a structure which maps typed keys to a set of typed values.
Series will be created where the index is drawn from the sorted keys

Series also supports array-style operations such as slicing

Constructing Series objects

Constructing a Pandas Series from scratch

Index is an optional argument

Data can be a list or NumPy array - index defaults to an integer sequence

Data can be a scalar, which is repeated to fill the specified index

Data can be a dictionary - index defaults to the sorted dictionary keys

Index can be explicitly set

2. The Pandas DataFrame Object

DataFrame as a generalized NumPy array

DataFrame - is a two-dimensional array with both flexible row indices and flexible column
names

columns attribute - column labels

index attribute - gives access to the index labels

DataFrame as specialized dictionary
DataFrame maps a column name to a Series of column data
Constructing DataFrame objects
From a single Series object From a list of dicts

From a dictionary of Series objects

From a two-dimensional NumPy array

From a NumPy structured array

3. The Pandas Index Object
- immutable array or as an ordered set

One difference between Index objects and NumPy arrays is that indices are immutable–that
is, they cannot be modified via the normal means
Index as ordered set
- unions, intersections, differences, and other combinations can be computed
Data Indexing and Selection
Indexing, slicing, masking, fancy indexing and combinations

Data Selection in Series

Series as dictionary

Series as one-dimensional array

Indexers: loc, iloc, and ix
loc attribute - indexing and slicing refer explicit index

iloc attribute - indexing and slicing refer implicit index ix - hybrid of the two

Data Selection in DataFrame

attribute-style access with column names

If column names are not strings, or if the column names conflict with methods of the DataFrame
– attribute style access is not possible. Eg : pop() method
DataFrame as two-dimensional array
Operating on Data
Performing element-wise operations
basic arithmetic - addition, subtraction, multiplication
sophisticated operations - trigonometric functions, exponential and logarithmic functions
Pandas inherits much of this functionality from NumPy, and the ufuncs
Ufuncs: Index Preservation
all NumPy ufunc - work on Pandas Series and DataFrame objects
Index alignment in Series
For binary operations - Pandas will align indices

Index alignment in DataFrame

Ufuncs: Operations Between DataFrame and Series

Handling Missing Data

• real-world data is rarely clean and homogeneous
• different data sources may indicate missing data in different ways - null, NaN, or NA
Trade-Offs in Missing Data Conventions

➢ number of schemes have been developed

➢ two strategies:
❑ mask
o Globally indicate missing values
o Boolean array- one bit in the data represent null value
o Requires allocation of an additional Boolean array - adds overhead in
both storage and computation
❑ sentinel value
o Indicates a missing entry
o Data-specific convention, such -9999 or some rare bit pattern
o Reduces the range of valid values that can be represented
➢ Missing Data in Pandas

• use sentinels for missing data

• special floating-point NaN value
• Python None object
➢ None: Pythonic missing data

• Used only in arrays with data type 'object'

• Performing aggregations - sum() or min() in an array with a None value – result

in an error

NaN: Missing numerical data

it is a special floating-point value
result of arithmetic with NaN will be another NaN
Aggregates over the values are well defined (i.e., they don't result in an error) but not always
useful
NaN and None in Pandas
Pandas automatically converts the None to a NaN value

Operating on Null Values

Detecting null values

Dropping null values

We cannot drop single values from a DataFrame; we can only drop full rows or full columns.
By default, dropna() will drop all rows in which any null value is present
axis=1 drops all columns containing a null value

Filling null values

fill NA entries with a single value, such as zero:
forward-fill to propagate the previous value forward

if a previous value is not available during a forward fill, the NA value remains.
Hierarchical Indexing
❑ Multi-indexing
❑ Store higher-dimensional data – data indexed by more than one or two keys
❑ Incorporate multiple index levels within a single index
❑ Higher-dimensional data can be represented within the 1D Series and 2D DataFrame
objects
A Multiply Indexed Series

❑ represent 2D data within a 1D Series

The bad way

❑ index or slice the series based on this multiple index

❑ need to select all values from 2010 – complex process use Python tuples as keys
The Better Way: Pandas MultiIndex

create a multi-index from the tuples

MultiIndex contains multiple levels of indexing

❑ Some entries are missing in the first column

❑ Blank entry indicates the same value as the line above it

Access all data for which the second index is 2010

MultiIndex as extra dimension

Each extra level in a multi-index represents an extra dimension of data

unstack() method - convert a multiply indexed Series into a DataFrame

stack() method provides the opposite operation:

Methods of MultiIndex Creation

1. Pass a list of two or more index arrays to the constructor

2. Pass a dictionary with appropriate tuples as keys

Explicit MultiIndex constructors

from a simple list of arrays

from a list of tuples

from a Cartesian product of single indices by passing levels and labels

MultiIndex level names

MultiIndex for columns

four-dimensional data, where the dimensions are the subject, the measurement type, the year,
and the visit number
Indexing and Slicing a MultiIndex

Multiply indexed Series

access single elements by indexing with multiple terms

partial indexing, or indexing just one of the levels in the index

Partial slicing is available as well, as long as the MultiIndex is sorted

With sorted indices, partial indexing can be performed on lower

levels by passing an empty slice in the first index
Selection based on Boolean masks Selection based on fancy indexing

Multiply indexed DataFrames Recover Guido's heart rate

Using loc, iloc, and ix indexers

Each individual index in loc or iloc can be passed as a
tuple of multiple indices

create a slice within a tuple will lead to a syntax error IndexSlice object
Rearranging Multi-Indices
1. Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if the index is not sorted

partial slice of this index, it will result in an error

With the index sorted - partial slicing will work as expected

2. Stacking and unstacking indices

Convert a dataset from a stacked multi-index to a simple two-dimensional representation

The opposite of unstack() is stack() - used to recover the original series

3. Index setting and resetting

Turn the index labels into columns - reset_index method

Build a MultiIndex from the column values – set_index method

Data Aggregations on Multi-Indices

Combining datasets
function which creates a DataFrame of a particular form
Concat, Append, Merge and Join

Concatenation of NumPy Arrays

1. Simple Concatenation with pd.concat

Concatenate higher-dimensional objects
❑ By default, the concatenation takes place row-wise within the DataFrame
❑ pd.concat allows specification of an axis
Duplicate indices
Pandas concatenation preserves indices, even if the result will have duplicate indices

Ignoring the index

Adding MultiIndex keys

2. Concatenation with joins

data from different sources might have different sets of column names
❑ By default - no data - NA
❑ To change this - specify join and join_axes parameters
❑ By default, the join is a union of the input columns (join='outer'), can be changed to
intersection - using join='inner'
Use join_axes argument - directly specify the index

3. The append() method

II . Merge and Join

high-performance, in-memory join and merge operations
Categories of Joins
1. One-to-one joins
❑ simplest type of merge
❑ very similar to the column-wise concatenation
❑ pd.merge() – use common column as a key
❑ The result of the merge is a new DataFrame
❑ Order of entries in each column is not maintained
❑ Merge - general discards the index
2. Many-to-one joins preserve duplicate entries as appropriate

3. Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is a many-to-
many merge
Specification of the Merge Key
1. The on keyword
This option works only if both the left and right DataFrames have the specified column name.

2. The left_on and right_on keywords

merge two datasets with different column names
The result has a redundant column that we can drop if desired
3. The left_index and right_index keywords
merge on an index

DataFrames implement the join() method - merge on indices

mix indices and columns

Specifying Set Arithmetic for Joins

inner join - how keyword, which defaults to "inner“

result contains the intersection of the two sets of inputs
❑ how -'outer', 'left', and 'right’.
❑ outer join - returns a join over the union of the input columns, and fills in all missing
values with NAs

The left join and right join return joins over the left entries and right entries, respectively

Overlapping Column Names: The suffixes Keyword

two input DataFrames have conflicting column names
❑ merge function automatically appends a suffix _x or _y to make the output columns
unique.
❑ It is possible to specify a custom suffix using the suffixes keyword
Aggregation and grouping
Efficient summarization: computing aggregations like sum(), mean(), median(), min(), and
max(), in which a single number gives insight into the nature of a potentially large dataset
Simple Aggregation in Pandas
for a Pandas Series the aggregates return a single value
For a DataFrame, by default the aggregates return results within each column
describe() - computes several common aggregates for each column and returns the result.

GroupBy: Split, Apply, Combine

aggregate conditionally on some label or index - implemented by groupby operation
1. Split, apply, combine
❑ The split step involves breaking up and grouping a DataFrame depending on the value of
the specified key.
❑ The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
❑ The combine step merges the results of these operations into an output array.

❑ Does not return DataFrames

❑ Returns a DataFrameGroupBy object
❑ Does no actual computation until the aggregation is applied - "lazy evaluation"
❑ To produce a result - apply an aggregate to the DataFrameGroupBy object
2. The GroupBy object
Aggregate, filter, transform, and apply
pass a dictionary mapping column names to operations to be applied on that column

Aggregation
It can take a string, a function, or a list, and compute all the aggregates at once

Filtering
❑ Allows to drop data based on the group properties.
❑ Eg: All groups in which the standard deviation is larger than some critical value
❑ The filter function - return a Boolean value specifying whether the group passes the
filtering.
Here because group A does not have a standard deviation greater than 4, it is dropped from the
result
Transformation
transformation - return transformed version of the full data to recombine
Eg: Center the data by subtracting the group-wise mean

The apply() method

Lets to apply an arbitrary function to the group results
Eg: Normalizes the first column by the sum of the second
Pivot Tables
❑ A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data.
❑ The pivot table takes simple column-wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
Motivating Pivot Tables

Database of passengers on the Titanic, available through the Seaborn library

Pivot Tables by Hand

Group according to gender, survival status, or some combination

Look at survival by both sex and, say, class.

This two-dimensional GroupBy is common in Pandas and use , pivot_table- to handle multi-
dimensional aggregation.

Pivot Table Syntax

More readable than the groupby approach, and produces the same result.

1. Multi-level pivot tables

❑ Grouping in pivot tables can be specified with multiple levels, and via a number of
options.

❑ For example: age as a third dimension - bin the age using the pd.cut function

same strategy can b e applied to columns as well

Eg: add info on the fare paid using pd.qcut to automatically compute quantiles

The result is a four-dimensional aggregation with hierarchical indices

2. Additional pivot table options

fill_value and dropna - deal with missing data

aggfunc keyword - controls what type of aggregation is applied, which is a mean by default

compute totals along each grouping - margins keyword

Vectorized String Operations
• ease in handling and manipulating string data
• Pandas string operations
Introducing Pandas String Operations

Tables of Pandas String Methods

Methods using regular expressions

Miscellaneous methods

Vectorized item access and slicing

Indicator variables

ملخص قطع الكتاب صف سادس
No ratings yet
ملخص قطع الكتاب صف سادس
14 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
G66.eu - Axe-Fx III
No ratings yet
G66.eu - Axe-Fx III
38 pages
Superagency in The Workplace Empowering People To Unlock Ais Full Potential v3
No ratings yet
Superagency in The Workplace Empowering People To Unlock Ais Full Potential v3
47 pages
Meltwater Full Userguide2021 Updated
No ratings yet
Meltwater Full Userguide2021 Updated
16 pages
REST Services Version 1 2022.2
No ratings yet
REST Services Version 1 2022.2
90 pages
Requirement Analysis and Modeling
No ratings yet
Requirement Analysis and Modeling
6 pages
XII IP CH 1 Python Pandas - I Series
No ratings yet
XII IP CH 1 Python Pandas - I Series
45 pages
CPS-UNIT - 1-Compressed
No ratings yet
CPS-UNIT - 1-Compressed
183 pages
Stec55x - Pilot Guide
No ratings yet
Stec55x - Pilot Guide
80 pages
Qwen Technical Report
No ratings yet
Qwen Technical Report
59 pages
PS CORE Graduate Programme Overview
100% (1)
PS CORE Graduate Programme Overview
2 pages
Channel Assignment: Frequency Management and
No ratings yet
Channel Assignment: Frequency Management and
12 pages
What Is Azure Sentinel - Microsoft Docs
No ratings yet
What Is Azure Sentinel - Microsoft Docs
7 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
138 pages
Cambridge IGCSE ™: Mathematics 0580/12
No ratings yet
Cambridge IGCSE ™: Mathematics 0580/12
6 pages
PLG - PMC 500 KULIAH 1 - 10sept2019 - V2
No ratings yet
PLG - PMC 500 KULIAH 1 - 10sept2019 - V2
24 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Free Fire Bangladesh Championship 2025 RULEBOOK
No ratings yet
Free Fire Bangladesh Championship 2025 RULEBOOK
31 pages
Momo Statement Report
No ratings yet
Momo Statement Report
9 pages
LVM Recovery
No ratings yet
LVM Recovery
11 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
850c Display Manual Biktrix Version
No ratings yet
850c Display Manual Biktrix Version
9 pages
Data Analytics Pandas
No ratings yet
Data Analytics Pandas
33 pages
Practical No. 1 Aim: The Euclid Problem Theory
No ratings yet
Practical No. 1 Aim: The Euclid Problem Theory
4 pages
Pandas Summarized Visually in 8
100% (2)
Pandas Summarized Visually in 8
8 pages
Final Rpaper
No ratings yet
Final Rpaper
5 pages
Topic 4 Convolution Integral
No ratings yet
Topic 4 Convolution Integral
5 pages
Dataframes UNIT 1 PART 2
No ratings yet
Dataframes UNIT 1 PART 2
33 pages
Pandas - Series - Short - Notes
No ratings yet
Pandas - Series - Short - Notes
7 pages
Sri Vidya College of Engineering and Technology Question Bank
No ratings yet
Sri Vidya College of Engineering and Technology Question Bank
5 pages
Festo Training Book
No ratings yet
Festo Training Book
39 pages
MLL Ip Xii
No ratings yet
MLL Ip Xii
22 pages
Week 1 Handouts
No ratings yet
Week 1 Handouts
3 pages
Unit 3 Eda Notes
No ratings yet
Unit 3 Eda Notes
24 pages
Data Science Data Manipulation With Pandas
No ratings yet
Data Science Data Manipulation With Pandas
77 pages
Pandas
No ratings yet
Pandas
63 pages
Pandas
No ratings yet
Pandas
163 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
Module 6
No ratings yet
Module 6
48 pages
Pandas
No ratings yet
Pandas
7 pages
Premium College
No ratings yet
Premium College
2 pages
MCA Projects
No ratings yet
MCA Projects
6 pages
Panda
No ratings yet
Panda
46 pages
05getting Started With Pandas
No ratings yet
05getting Started With Pandas
44 pages
Ncert Pandas
No ratings yet
Ncert Pandas
36 pages
SmartReceipts in Fusion Receivables Receipts
No ratings yet
SmartReceipts in Fusion Receivables Receipts
2 pages
Leip 102
No ratings yet
Leip 102
36 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
Unit - 1 - Python Pandas
No ratings yet
Unit - 1 - Python Pandas
176 pages
Top 9 Ethical Issues in AI
No ratings yet
Top 9 Ethical Issues in AI
2 pages
Working With Pandas Notes
No ratings yet
Working With Pandas Notes
27 pages
Python Pandas - Series Notes
No ratings yet
Python Pandas - Series Notes
13 pages
CSMMPPPP
No ratings yet
CSMMPPPP
5 pages
MNO3704 Assessment 3
No ratings yet
MNO3704 Assessment 3
5 pages
Unit III Part 2 1725700061785
No ratings yet
Unit III Part 2 1725700061785
85 pages
Pandas Notes
No ratings yet
Pandas Notes
44 pages
Pandas
No ratings yet
Pandas
13 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Pandas
No ratings yet
Pandas
29 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Pandas Notes
No ratings yet
Pandas Notes
19 pages
Data Handlinng Using Pandas-I
No ratings yet
Data Handlinng Using Pandas-I
46 pages
09 - Pandas Slides
No ratings yet
09 - Pandas Slides
33 pages
UNIT 3 (Chapter 2) Pandas
No ratings yet
UNIT 3 (Chapter 2) Pandas
43 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
Pandas
No ratings yet
Pandas
21 pages
1 Data Handlinng Using Pandas-I
No ratings yet
1 Data Handlinng Using Pandas-I
46 pages
Unit 2
No ratings yet
Unit 2
81 pages
Chapter 1 and 2 Series and Data Frame
No ratings yet
Chapter 1 and 2 Series and Data Frame
45 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Base System (Binary, Decimal, Octal & Hexadecimal)
No ratings yet
Base System (Binary, Decimal, Octal & Hexadecimal)
2 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
No ratings yet
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
14 pages
Data Science - Unit-3-Part-2
No ratings yet
Data Science - Unit-3-Part-2
32 pages
VM 1
No ratings yet
VM 1
10 pages
Unit III - Pandas - Data Manipulation Using Python
No ratings yet
Unit III - Pandas - Data Manipulation Using Python
15 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Data Handling Using Pandas-I-ORG
No ratings yet
Data Handling Using Pandas-I-ORG
44 pages
Data Handling Using Pandas - 1-2-1
No ratings yet
Data Handling Using Pandas - 1-2-1
10 pages
ML Lab8
No ratings yet
ML Lab8
28 pages
2.2 Data Indexing and Selection
No ratings yet
2.2 Data Indexing and Selection
8 pages
Class XII IP Key Points (Python Pandas)
No ratings yet
Class XII IP Key Points (Python Pandas)
5 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
12ip 22 23
No ratings yet
12ip 22 23
188 pages
Word2vec Overview
No ratings yet
Word2vec Overview
2 pages
X A Iiiiii Iiiiii
No ratings yet
X A Iiiiii Iiiiii
2 pages
Java Topics
No ratings yet
Java Topics
1 page
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
IT Vocabulary For Esl Students
No ratings yet
IT Vocabulary For Esl Students
4 pages