0% found this document useful (0 votes)
56 views

Notes - EDA-Unit2 (1)

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Notes - EDA-Unit2 (1)

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT II EDA USING PYTHON

Data Manipulation using Pandas – Pandas Objects – Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables – Vectorized String Operations.

COURSE OBJECTIVE:
To implement data manipulation using Pandas.

COURSE OUTCOME:
CO2: Implement data manipulation using Pandas
Data Manipulation using Pandas
Pandas - package built on top of NumPy
- provides an efficient implementation of a DataFrame
DataFrames - multidimensional arrays with attached row and column labels
- supports heterogeneous types and/or missing data
Pandas - enhanced versions of NumPy structured arrays
- rows and columns are identified with labels rather than simple integer indices
Three fundamental Pandas data structures: Series, DataFrame, and Index

Pandas Objects
1. The Pandas Series Object
Pandas Series - one-dimensional array of indexed data
- can be created from a list or array
- can be accessed with the values and index attributes
- can also be accessed using index via square-bracket notation

Series as generalized NumPy array


Numpy Array - has an implicitly defined integer index used to access the values
Pandas Series - has an explicitly defined index associated with the values
index need not be an integer

non-contiguous or non-sequential indices

Series as specialized dictionary


dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series
is a structure which maps typed keys to a set of typed values.
Series will be created where the index is drawn from the sorted keys

Series also supports array-style operations such as slicing

Constructing Series objects


Constructing a Pandas Series from scratch

Index is an optional argument


Data can be a list or NumPy array - index defaults to an integer sequence

Data can be a scalar, which is repeated to fill the specified index

Data can be a dictionary - index defaults to the sorted dictionary keys

Index can be explicitly set

2. The Pandas DataFrame Object

DataFrame as a generalized NumPy array

DataFrame - is a two-dimensional array with both flexible row indices and flexible column
names

columns attribute - column labels

index attribute - gives access to the index labels


DataFrame as specialized dictionary
DataFrame maps a column name to a Series of column data
Constructing DataFrame objects
From a single Series object From a list of dicts

From a dictionary of Series objects

From a two-dimensional NumPy array

From a NumPy structured array


3. The Pandas Index Object
- immutable array or as an ordered set

One difference between Index objects and NumPy arrays is that indices are immutable–that
is, they cannot be modified via the normal means
Index as ordered set
- unions, intersections, differences, and other combinations can be computed
Data Indexing and Selection
Indexing, slicing, masking, fancy indexing and combinations

Data Selection in Series

Series as dictionary

Series as one-dimensional array


Indexers: loc, iloc, and ix
loc attribute - indexing and slicing refer explicit index

iloc attribute - indexing and slicing refer implicit index ix - hybrid of the two

Data Selection in DataFrame


attribute-style access with column names

If column names are not strings, or if the column names conflict with methods of the DataFrame
– attribute style access is not possible. Eg : pop() method
DataFrame as two-dimensional array
Operating on Data
Performing element-wise operations
basic arithmetic - addition, subtraction, multiplication
sophisticated operations - trigonometric functions, exponential and logarithmic functions
Pandas inherits much of this functionality from NumPy, and the ufuncs
Ufuncs: Index Preservation
all NumPy ufunc - work on Pandas Series and DataFrame objects
Index alignment in Series
For binary operations - Pandas will align indices

Index alignment in DataFrame


Ufuncs: Operations Between DataFrame and Series

Handling Missing Data


• real-world data is rarely clean and homogeneous
• different data sources may indicate missing data in different ways - null, NaN, or NA
Trade-Offs in Missing Data Conventions

➢ number of schemes have been developed


➢ two strategies:
❑ mask
o Globally indicate missing values
o Boolean array- one bit in the data represent null value
o Requires allocation of an additional Boolean array - adds overhead in
both storage and computation
❑ sentinel value
o Indicates a missing entry
o Data-specific convention, such -9999 or some rare bit pattern
o Reduces the range of valid values that can be represented
➢ Missing Data in Pandas

• use sentinels for missing data


• special floating-point NaN value
• Python None object
➢ None: Pythonic missing data

• Used only in arrays with data type 'object'

• Performing aggregations - sum() or min() in an array with a None value – result


in an error

NaN: Missing numerical data


it is a special floating-point value
result of arithmetic with NaN will be another NaN
Aggregates over the values are well defined (i.e., they don't result in an error) but not always
useful
NaN and None in Pandas
Pandas automatically converts the None to a NaN value

Operating on Null Values

Detecting null values


Dropping null values

We cannot drop single values from a DataFrame; we can only drop full rows or full columns.
By default, dropna() will drop all rows in which any null value is present
axis=1 drops all columns containing a null value

Filling null values


fill NA entries with a single value, such as zero:
forward-fill to propagate the previous value forward

if a previous value is not available during a forward fill, the NA value remains.
Hierarchical Indexing
❑ Multi-indexing
❑ Store higher-dimensional data – data indexed by more than one or two keys
❑ Incorporate multiple index levels within a single index
❑ Higher-dimensional data can be represented within the 1D Series and 2D DataFrame
objects
A Multiply Indexed Series

❑ represent 2D data within a 1D Series


The bad way

❑ index or slice the series based on this multiple index

❑ need to select all values from 2010 – complex process use Python tuples as keys
The Better Way: Pandas MultiIndex

create a multi-index from the tuples

MultiIndex contains multiple levels of indexing

❑ Some entries are missing in the first column

❑ Blank entry indicates the same value as the line above it


Access all data for which the second index is 2010

MultiIndex as extra dimension

Each extra level in a multi-index represents an extra dimension of data

unstack() method - convert a multiply indexed Series into a DataFrame

stack() method provides the opposite operation:

Methods of MultiIndex Creation


1. Pass a list of two or more index arrays to the constructor

2. Pass a dictionary with appropriate tuples as keys

Explicit MultiIndex constructors

from a simple list of arrays

from a list of tuples

from a Cartesian product of single indices by passing levels and labels


MultiIndex level names

MultiIndex for columns

four-dimensional data, where the dimensions are the subject, the measurement type, the year,
and the visit number
Indexing and Slicing a MultiIndex

Multiply indexed Series

access single elements by indexing with multiple terms

partial indexing, or indexing just one of the levels in the index

Partial slicing is available as well, as long as the MultiIndex is sorted

With sorted indices, partial indexing can be performed on lower


levels by passing an empty slice in the first index
Selection based on Boolean masks Selection based on fancy indexing

Multiply indexed DataFrames Recover Guido's heart rate

Using loc, iloc, and ix indexers


Each individual index in loc or iloc can be passed as a
tuple of multiple indices

create a slice within a tuple will lead to a syntax error IndexSlice object
Rearranging Multi-Indices
1. Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if the index is not sorted

partial slice of this index, it will result in an error

With the index sorted - partial slicing will work as expected

2. Stacking and unstacking indices

Convert a dataset from a stacked multi-index to a simple two-dimensional representation


The opposite of unstack() is stack() - used to recover the original series

3. Index setting and resetting

Turn the index labels into columns - reset_index method

Build a MultiIndex from the column values – set_index method

Data Aggregations on Multi-Indices


Combining datasets
function which creates a DataFrame of a particular form
Concat, Append, Merge and Join

Concatenation of NumPy Arrays

1. Simple Concatenation with pd.concat


Concatenate higher-dimensional objects
❑ By default, the concatenation takes place row-wise within the DataFrame
❑ pd.concat allows specification of an axis
Duplicate indices
Pandas concatenation preserves indices, even if the result will have duplicate indices

Ignoring the index

Adding MultiIndex keys

2. Concatenation with joins


data from different sources might have different sets of column names
❑ By default - no data - NA
❑ To change this - specify join and join_axes parameters
❑ By default, the join is a union of the input columns (join='outer'), can be changed to
intersection - using join='inner'
Use join_axes argument - directly specify the index

3. The append() method

II . Merge and Join


high-performance, in-memory join and merge operations
Categories of Joins
1. One-to-one joins
❑ simplest type of merge
❑ very similar to the column-wise concatenation
❑ pd.merge() – use common column as a key
❑ The result of the merge is a new DataFrame
❑ Order of entries in each column is not maintained
❑ Merge - general discards the index
2. Many-to-one joins preserve duplicate entries as appropriate

3. Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is a many-to-
many merge
Specification of the Merge Key
1. The on keyword
This option works only if both the left and right DataFrames have the specified column name.

2. The left_on and right_on keywords


merge two datasets with different column names
The result has a redundant column that we can drop if desired
3. The left_index and right_index keywords
merge on an index

DataFrames implement the join() method - merge on indices

mix indices and columns


Specifying Set Arithmetic for Joins

inner join - how keyword, which defaults to "inner“


result contains the intersection of the two sets of inputs
❑ how -'outer', 'left', and 'right’.
❑ outer join - returns a join over the union of the input columns, and fills in all missing
values with NAs

The left join and right join return joins over the left entries and right entries, respectively

Overlapping Column Names: The suffixes Keyword


two input DataFrames have conflicting column names
❑ merge function automatically appends a suffix _x or _y to make the output columns
unique.
❑ It is possible to specify a custom suffix using the suffixes keyword
Aggregation and grouping
Efficient summarization: computing aggregations like sum(), mean(), median(), min(), and
max(), in which a single number gives insight into the nature of a potentially large dataset
Simple Aggregation in Pandas
for a Pandas Series the aggregates return a single value
For a DataFrame, by default the aggregates return results within each column
describe() - computes several common aggregates for each column and returns the result.

GroupBy: Split, Apply, Combine


aggregate conditionally on some label or index - implemented by groupby operation
1. Split, apply, combine
❑ The split step involves breaking up and grouping a DataFrame depending on the value of
the specified key.
❑ The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
❑ The combine step merges the results of these operations into an output array.

❑ Does not return DataFrames


❑ Returns a DataFrameGroupBy object
❑ Does no actual computation until the aggregation is applied - "lazy evaluation"
❑ To produce a result - apply an aggregate to the DataFrameGroupBy object
2. The GroupBy object
Aggregate, filter, transform, and apply
pass a dictionary mapping column names to operations to be applied on that column

Aggregation
It can take a string, a function, or a list, and compute all the aggregates at once

Filtering
❑ Allows to drop data based on the group properties.
❑ Eg: All groups in which the standard deviation is larger than some critical value
❑ The filter function - return a Boolean value specifying whether the group passes the
filtering.
Here because group A does not have a standard deviation greater than 4, it is dropped from the
result
Transformation
transformation - return transformed version of the full data to recombine
Eg: Center the data by subtracting the group-wise mean

The apply() method


Lets to apply an arbitrary function to the group results
Eg: Normalizes the first column by the sum of the second
Pivot Tables
❑ A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data.
❑ The pivot table takes simple column-wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
Motivating Pivot Tables

Database of passengers on the Titanic, available through the Seaborn library

Pivot Tables by Hand

Group according to gender, survival status, or some combination

Look at survival by both sex and, say, class.


This two-dimensional GroupBy is common in Pandas and use , pivot_table- to handle multi-
dimensional aggregation.

Pivot Table Syntax

More readable than the groupby approach, and produces the same result.

1. Multi-level pivot tables

❑ Grouping in pivot tables can be specified with multiple levels, and via a number of
options.

❑ For example: age as a third dimension - bin the age using the pd.cut function

same strategy can b e applied to columns as well

Eg: add info on the fare paid using pd.qcut to automatically compute quantiles

The result is a four-dimensional aggregation with hierarchical indices


2. Additional pivot table options

fill_value and dropna - deal with missing data

aggfunc keyword - controls what type of aggregation is applied, which is a mean by default

compute totals along each grouping - margins keyword


Vectorized String Operations
• ease in handling and manipulating string data
• Pandas string operations
Introducing Pandas String Operations

Tables of Pandas String Methods


Methods using regular expressions

Miscellaneous methods

Vectorized item access and slicing


Indicator variables

You might also like