Unit V
Unit V
1
working with Missing Data
• Missing Data can occur when no information is provided for one or more
items or for a whole unit.
• Missing Data is a very big problem in a real-life scenarios.
• Missing Data can also refer to as NA(Not Available) values in pandas.
• In DataFrame sometimes many datasets simply arrive with missing data,
either because it exists and was not collected or it never existed.
• For Example, Suppose different users being surveyed may choose not to
share their income, some users may choose not to share the address in
this way many datasets went missing.
2
working with Missing Data
Pandas support two values to represent missing data:
❑ None: None is a Python singleton object that is commonly used in Python
programs to represent missing data.
❑ NaN: Also known as Not a Number, or NaN, is a particular floating-point
value that is accepted by all systems that employ the IEEE standard for
floating-point representation.
• There are several useful functions for detecting, removing, and replacing
null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
3
working with Missing Data
Checking for missing values using isnull() and notnull():
• In order to check missing values in Pandas DataFrame, we use a function
isnull() and notnull(). Both function help in checking whether a value is
NaN or not.
• In order to check null values in Pandas DataFrame, we use isnull() function
this function return data frame of Boolean values which are True for NaN
values.
4
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()
5
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()
6
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()
7
working with Missing Data
https://www.scaler.com/topics/pandas/panel-in-pandas/
Data munging
• Data munging includes all the stages prior to analysis, such as: Data
structuring, Cleaning, Enrichment, Validation, Data transformation.
• The munging process typically begins with a large volume of raw data. Data
scientists will mung the data into shape by removing any errors or
inconsistencies.
• They will then organize the data according to the destination schema so that
it’s ready to use at the endpoint.
• Data consumers need to have clean, organized, high-quality data. These consumers
can include:
• People: Data scientists and analytics teams require a steady stream of data. To
provide them with this, the business needs to implement a munging process.
• Processes: Automated processes might require data from other systems. Munging
helps to remove any data inconsistencies, allowing these processes to run smoothly in
the background.
13
Data munging
1. Discover: First, the data scientist performs a degree of data exploration. This
is a first glance at the data to establish the most important patterns.
2. Structure: Raw data might not have an appropriate structure for the
intended usage. The data scientists will organize and normalize the data so
that it’s more manageable.
3. Clean: Raw data can contain corrupt, empty, or invalid cells. There may also
be values that require conversions, such as dates and currencies. Part of the
cleaning operation is to ensure there’s consistency across all values. The
cleaning process will standardize this value for every address.
1/9/2024 14
Data munging
5. Validate: Finally, it’s time to ensure that all data values are logically
consistent. This means checking things like whether all phone
numbers have nine digits, that there are no numbers in name fields,
and that all dates are valid calendar dates
1/9/2024 15
Data munging
1/9/2024 16
Data munging
• Flexibility: Munging often has one objective in mind, such as preparing data
for analytics. This means that the data may not be in an appropriate format for
other uses, such as warehousing.
Importing Libraries
Let’s get Pandas and NumPy up and running on your Python script.
Pandas used the first row as header, but this is not what we want:
Instead of numeric values, we would like to supply our own column names:
1/9/2024 20
Data Cleaning With Python
1/9/2024 21
Data Cleaning With Python
Detect Outliers
• Outliers are numerical values that lie significantly outside of the
statistical norm. Cutting that down from unnecessary science garble –
they are data points that are so out of range they are likely misreads.
1/9/2024 22
Filtering
Importing Libraries
Boolean vectors
Once we have the vector of 209 values of true or false (the Boolean
vector), we can apply that to the original data frame.
If the first value in the Boolean vector is true, the first row of the data
frame is returned; if the first value is false, the row is skipped.
1/9/2024 24
Filtering
Filtering
We can assign the results to a separate data frame that contains only the
140 female employees:
The Python type() method is used to make sure the result is a Pandas
data frame.
Python has some basic built-in functions that can be applied to the
core data types, such as integers, floating point numbers, and so on.
For example, if we want to take the result of mean() and round it to
two decimals, we can wrap the whole expression inside
the round() function.
Filtering
Output:
Filtering
Filtering by a list
It is not uncommon when we have But an alternative approach (and
categorical data to need to filter or the only approach that works with
recode on a specific list of values. To categorical data) is to create a list and
reuse the example used previously, use the isin() method to check
assume we want to create a list of membership in the list. This gives the
managerial employees. The easiest way same result as above.
to do this is to use a greater-than
condition:
Merging data
We have multiple data sources, but in order to make statements about the
content, you would rather combine them.
Pandas DataFrame merge() function is used to merge two DataFrame
objects with a database-style join operation. The joining is performed on
columns or indexes. If the joining is done on columns, indexes are
ignored.
➢ Concatenate Data Frames along row and column.
➢ Merge Data Frames on specific keys by different join logics like left-join,
inner-join, etc.
➢ Join Data Frames by index.
➢ Time-series friendly merging provided in pandas
1/9/2024 29
Merging Data
In a case where two data frames have a similar shape, it might be useful to
just append one after the other. Maybe A and B are products and one data
frame contains the number of items sold per product in a store:
The default concat operation appends both frames along the rows – or
index, which corresponds to axis 0. To concatenate along the columns,
we can pass in the axis keyword argument:
Merging data
A left, right and full join can be specified by the how parameter:
1/9/2024 32
Merging Data
The merge methods can be specified with the how parameter. The following
table shows the methods in comparison with SQL:
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key
order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys
lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve
the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left
keys.
Reshaping data
➢Use the reshape() method to change the shape of the Series to the desired
shape.
➢Use the stack() method to pivot the Series from a wide format to a long
format, if needed.
➢Use the melt() method to unpivot the Series from a wide format to a long
format, if needed.
➢Use the unstack() method to pivot the Series from a long format to a wide
format, if needed.
➢Use the pivot() method to pivot the Series from a long format to a wide
format, if needed.
➢Use the T attribute to transpose the Series, if needed.
Reshaping data
Here, the shape argument specifies the new dimensions of the array, while the optional
order argument specifies the order in which the elements of the array are arranged.
The pivot() function reshapes data based on column values. It takes simple
column-wise data as input, and groups the entries into a two-dimensional
table.
1/9/2024 38
Reshaping data
We have passed the parameters index, columns and values to the pivot
function. Here,
Index specifies the column to be used as the index for the pivoted
dataframe
Columns specifies the column whose unique values will become the new
column headers
Values specifies the column containing the values to be placed in the new
columns
So as we can see in the output, the dataframe has been pivoted, with the
unique values from the category column (A and B) becoming separate
columns.
And the corresponding values from the value column are then placed in
the respective cells.
39
Reshaping data
1/9/2024 40
Data aggregation
1/9/2024 41
Data aggregation
1/9/2024 42
Data aggregation
1/9/2024 43
Data aggregation
1/9/2024 44
Grouping data
The pandas groupby() function is used to split the data into groups based on certain
criteria, with the syntax, dataframe.groupby(['Criteria']) . It is a powerful tool used for
grouping values in datasets.
To group the data by a specific column, you can use the groupby() function and pass
the name of the column that you want to group on.
We use a simpler example here. Imagine some fictional weather data about the number
of sunny hours per day and city:
Grouping data
46
Grouping data
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)
Parameters :
➢by : mapping, function, str, or iterable
➢axis : int, default 0
➢level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
➢as_index : For aggregated output, return object with group labels as the index. Only
relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
➢sort : Sort group keys. Get better performance by turning this off. Note this does not
influence the order of observations within each group. groupby preserves the order of
rows within each group.
➢group_keys : When calling apply, add group keys to index to identify pieces
➢squeeze : Reduce the dimensionality of the return type if possible, otherwise return a
consistent type
Returns : GroupBy object