Notes - EDA-Unit2 (1)
Notes - EDA-Unit2 (1)
Data Manipulation using Pandas – Pandas Objects – Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables – Vectorized String Operations.
COURSE OBJECTIVE:
To implement data manipulation using Pandas.
COURSE OUTCOME:
CO2: Implement data manipulation using Pandas
Data Manipulation using Pandas
Pandas - package built on top of NumPy
- provides an efficient implementation of a DataFrame
DataFrames - multidimensional arrays with attached row and column labels
- supports heterogeneous types and/or missing data
Pandas - enhanced versions of NumPy structured arrays
- rows and columns are identified with labels rather than simple integer indices
Three fundamental Pandas data structures: Series, DataFrame, and Index
Pandas Objects
1. The Pandas Series Object
Pandas Series - one-dimensional array of indexed data
- can be created from a list or array
- can be accessed with the values and index attributes
- can also be accessed using index via square-bracket notation
DataFrame - is a two-dimensional array with both flexible row indices and flexible column
names
One difference between Index objects and NumPy arrays is that indices are immutable–that
is, they cannot be modified via the normal means
Index as ordered set
- unions, intersections, differences, and other combinations can be computed
Data Indexing and Selection
Indexing, slicing, masking, fancy indexing and combinations
Series as dictionary
iloc attribute - indexing and slicing refer implicit index ix - hybrid of the two
If column names are not strings, or if the column names conflict with methods of the DataFrame
– attribute style access is not possible. Eg : pop() method
DataFrame as two-dimensional array
Operating on Data
Performing element-wise operations
basic arithmetic - addition, subtraction, multiplication
sophisticated operations - trigonometric functions, exponential and logarithmic functions
Pandas inherits much of this functionality from NumPy, and the ufuncs
Ufuncs: Index Preservation
all NumPy ufunc - work on Pandas Series and DataFrame objects
Index alignment in Series
For binary operations - Pandas will align indices
We cannot drop single values from a DataFrame; we can only drop full rows or full columns.
By default, dropna() will drop all rows in which any null value is present
axis=1 drops all columns containing a null value
if a previous value is not available during a forward fill, the NA value remains.
Hierarchical Indexing
❑ Multi-indexing
❑ Store higher-dimensional data – data indexed by more than one or two keys
❑ Incorporate multiple index levels within a single index
❑ Higher-dimensional data can be represented within the 1D Series and 2D DataFrame
objects
A Multiply Indexed Series
❑ need to select all values from 2010 – complex process use Python tuples as keys
The Better Way: Pandas MultiIndex
four-dimensional data, where the dimensions are the subject, the measurement type, the year,
and the visit number
Indexing and Slicing a MultiIndex
create a slice within a tuple will lead to a syntax error IndexSlice object
Rearranging Multi-Indices
1. Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if the index is not sorted
3. Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is a many-to-
many merge
Specification of the Merge Key
1. The on keyword
This option works only if both the left and right DataFrames have the specified column name.
The left join and right join return joins over the left entries and right entries, respectively
Aggregation
It can take a string, a function, or a list, and compute all the aggregates at once
Filtering
❑ Allows to drop data based on the group properties.
❑ Eg: All groups in which the standard deviation is larger than some critical value
❑ The filter function - return a Boolean value specifying whether the group passes the
filtering.
Here because group A does not have a standard deviation greater than 4, it is dropped from the
result
Transformation
transformation - return transformed version of the full data to recombine
Eg: Center the data by subtracting the group-wise mean
More readable than the groupby approach, and produces the same result.
❑ Grouping in pivot tables can be specified with multiple levels, and via a number of
options.
❑ For example: age as a third dimension - bin the age using the pd.cut function
Eg: add info on the fare paid using pd.qcut to automatically compute quantiles
aggfunc keyword - controls what type of aggregation is applied, which is a mean by default
Miscellaneous methods