0% found this document useful (0 votes)
2 views47 pages

Unit V

Uploaded by

haricharani1919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views47 pages

Unit V

Uploaded by

haricharani1919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Analysis Application Examples

Dr. K.VENKATA RAMANA


Associate Professor,
Dept of Computer Science & Systems Engineering
Andhra University

1
working with Missing Data
• Missing Data can occur when no information is provided for one or more
items or for a whole unit.
• Missing Data is a very big problem in a real-life scenarios.
• Missing Data can also refer to as NA(Not Available) values in pandas.
• In DataFrame sometimes many datasets simply arrive with missing data,
either because it exists and was not collected or it never existed.
• For Example, Suppose different users being surveyed may choose not to
share their income, some users may choose not to share the address in
this way many datasets went missing.

2
working with Missing Data
Pandas support two values to represent missing data:
❑ None: None is a Python singleton object that is commonly used in Python
programs to represent missing data.
❑ NaN: Also known as Not a Number, or NaN, is a particular floating-point
value that is accepted by all systems that employ the IEEE standard for
floating-point representation.
• There are several useful functions for detecting, removing, and replacing
null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
3
working with Missing Data
Checking for missing values using isnull() and notnull():
• In order to check missing values in Pandas DataFrame, we use a function
isnull() and notnull(). Both function help in checking whether a value is
NaN or not.
• In order to check null values in Pandas DataFrame, we use isnull() function
this function return data frame of Boolean values which are True for NaN
values.

4
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and


interpolate() function these function replace NaN values with some value of
their own.
• All these function help in filling a null values in datasets of a DataFrame.
• Interpolate() function is basically used to fill NA values in the data frame
but it uses various interpolation technique to fill the missing values rather
than hard-coding the value.

5
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and


interpolate() function these function replace NaN values with some value of
their own.

6
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and


interpolate() function these function replace NaN values with some value of
their own.

7
working with Missing Data

dropna(): In order to drop a null values from a data frame, we used


dropna() function this function drop Rows/Columns of datasets with
Null values
Hierarchical Indexing
• Hierarchical data is often used to represent multiple levels of nested
groups or categories. For example, a company may have a hierarchy of
employees, departments, and locations.
• One of the challenges of working with hierarchical data is how to represent
it in a tabular format which can make it easy to manipulate and analyze.
• To represent the hierarchical data in Python, using Pandas' built-in
methods like 'set_index()' and 'groupby()'.
Hierarchical Indexing
• In the following example, we will demonstrate the use of 'groupby()' method
in Pandas to group data based on a specific column. We will use the same
code used in previous example with slight changes. Here, we will group the
data based on the unique values in the 'Category' column. It will form
separate groups for each unique category.
Panel Data
• The Panel in Pandas is used for working with three-dimensional data. It has three
main axes these are items is the 0 axis which corresponds to the data, major-axis is
the axis 1 for rows, and minor-axis is the axis 2 for columns. A panel can be created
by using the pandas. panel() function.

• The panel in pandas is a three-dimensional container of data. To create a panel, we


can use ndarrays(multidimensional arrays) and a dictionary of Data Frames(one of
the Pandas 2-D data structures that contain data in the tabular form of rows and
columns). We can also extract data from panels using different methods.

https://www.scaler.com/topics/pandas/panel-in-pandas/
Data munging

Introduction to Data Munging


• Data munging, also known as data wrangling, is the process of converting
raw data into a more usable format.

• Data munging includes all the stages prior to analysis, such as: Data
structuring, Cleaning, Enrichment, Validation, Data transformation.

• The munging process typically begins with a large volume of raw data. Data
scientists will mung the data into shape by removing any errors or
inconsistencies.

• They will then organize the data according to the destination schema so that
it’s ready to use at the endpoint.

• The process also involves data transformation, such as normalizing datasets


to create one-to-many mappings

• Munging is generally a permanent data transformation process.


12
Data munging
Why Use Data Munging?
• Most organizations have multiple, disparate sources of incoming data. These sources
will all have different standards for validating data and catching errors. Some may
simply output the data “as-is.”

• Data consumers need to have clean, organized, high-quality data. These consumers
can include:

• People: Data scientists and analytics teams require a steady stream of data. To
provide them with this, the business needs to implement a munging process.

• Processes: Automated processes might require data from other systems. Munging
helps to remove any data inconsistencies, allowing these processes to run smoothly in
the background.

• Repositories: Organizations often store vast quantities of information in a data


lake or data warehouse. Munging can also help standardize data, which makes it
easier to store in a data warehouse.

13
Data munging

How to Do Data Munging


The modern data munging process now involves six main steps:

1. Discover: First, the data scientist performs a degree of data exploration. This
is a first glance at the data to establish the most important patterns.

2. Structure: Raw data might not have an appropriate structure for the
intended usage. The data scientists will organize and normalize the data so
that it’s more manageable.

3. Clean: Raw data can contain corrupt, empty, or invalid cells. There may also
be values that require conversions, such as dates and currencies. Part of the
cleaning operation is to ensure there’s consistency across all values. The
cleaning process will standardize this value for every address.

1/9/2024 14
Data munging

4. Enrich: Data enrichment is the process of filling in missing details by


referring to other data sources. For example, the raw data might
contain partial customer addresses.

5. Validate: Finally, it’s time to ensure that all data values are logically
consistent. This means checking things like whether all phone
numbers have nine digits, that there are no numbers in name fields,
and that all dates are valid calendar dates

6. Publish: When the data munging process is complete, the data


science team will push it towards its final destination. Often this is a
data repository, where it will integrate with data from other sources.
This will make the munged data permanently available to all
consumers.

1/9/2024 15
Data munging

1/9/2024 16
Data munging

Issues with Data Munging


Data munging processes sometimes present issues such as:

• Resource overheads: When data scientists oversee the munging process, it


can take up a substantial amount of their time.

• Data loss: Data munging is usually a one-way process. Data scientists


permanently transform the incoming data, and there may not be an extant
copy of the original data.

• Flexibility: Munging often has one objective in mind, such as preparing data
for analytics. This means that the data may not be in an appropriate format for
other uses, such as warehousing.

• Process errors: If the munging process is manual or semi-automatic, there's a


chance for errors to creep in. An automated process gives business experts an
opportunity to get involved in the data mapping process.
17
Data Cleaning With Python

What Is Data Cleaning


• When working with multiple data sources, there are many chances
for data to be incorrect, duplicated, or mislabeled.

• If data is wrong, outcomes and algorithms are unreliable, even


though they may look correct.

• Data cleaning is the process of changing or eliminating garbage,


incorrect, duplicate, corrupted, or incomplete data in a dataset.

• There’s no such absolute way to describe the precise steps in the


data cleaning process because the processes may vary from dataset
to dataset.

• Data cleansing, data cleansing, or data scrub is the general data


preparation process initiative.
18
Data Cleaning With Python

Importing Libraries
Let’s get Pandas and NumPy up and running on your Python script.

Input read Dataset and Locating Missing Data


This file contains only four rows, it will allow us to demonstrate the
process up to a cleaned data set

The lines that contain values are all


comma-separated, but we have
missing (NA) and probably unclean
(5.3*) values.

As we can see the “read.csv” should be the dataset we want to examine.


And, in this case, when we read “pd.read_csv” as the prior function, we
know we are using the Pandas library to read our dataset.
19
Data Cleaning With Python

Pandas used the first row as header, but this is not what we want:

Instead of numeric values, we would like to supply our own column names:

1/9/2024 20
Data Cleaning With Python

If we know in advance the undesirable characters in our data set, we can

augment the read_csv method with a custom converter function:

If we wanted to only keep the complete


entries, we could drop any row that
contains undefined values:

1/9/2024 21
Data Cleaning With Python

Check for Duplicates


• Duplicates, like missing data, cause problems and clog up analytics
software. Let’s locate and eliminate them.

• To locate duplicates we start out with:

Detect Outliers
• Outliers are numerical values that lie significantly outside of the
statistical norm. Cutting that down from unnecessary science garble –
they are data points that are so out of range they are likely misreads.

1/9/2024 22
Filtering

Filtering means limiting rows and/or columns. Filtering is clearly


central to any data analysis.

Importing Libraries

Boolean vectors

Filtering in Pandas relies heavily on the concept of a Boolean vectors.

The expression (==) tests whether each value of


the Gender column is equal to the string
“Female”.
The result of the expression is a vector of trues
and falses corresponding to whether each of the
209 values of Gender is equal to “Female”.
23
Filtering

Once we have the vector of 209 values of true or false (the Boolean
vector), we can apply that to the original data frame.

If the first value in the Boolean vector is true, the first row of the data
frame is returned; if the first value is false, the row is skipped.

1/9/2024 24
Filtering
Filtering

We can assign the results to a separate data frame that contains only the
140 female employees:

The Python type() method is used to make sure the result is a Pandas
data frame.

Python has some basic built-in functions that can be applied to the
core data types, such as integers, floating point numbers, and so on.
For example, if we want to take the result of mean() and round it to
two decimals, we can wrap the whole expression inside
the round() function.
Filtering

Complex filtering criteria


Boolean vectors can be created by combining conditions with & (and) and
| (or). The only trick is that each condition must be in parentheses:

The vector can then be applied to the whole data


set to filter the data frame to female employees
with job grade 1. Rather than listing the results,
We call the shape property to confirm that only
48 employees are included in the resulting vector.

Output:
Filtering

Filtering by a list
It is not uncommon when we have But an alternative approach (and
categorical data to need to filter or the only approach that works with
recode on a specific list of values. To categorical data) is to create a list and
reuse the example used previously, use the isin() method to check
assume we want to create a list of membership in the list. This gives the
managerial employees. The easiest way same result as above.
to do this is to use a greater-than
condition:
Merging data

We have multiple data sources, but in order to make statements about the
content, you would rather combine them.
Pandas DataFrame merge() function is used to merge two DataFrame
objects with a database-style join operation. The joining is performed on
columns or indexes. If the joining is done on columns, indexes are
ignored.
➢ Concatenate Data Frames along row and column.
➢ Merge Data Frames on specific keys by different join logics like left-join,
inner-join, etc.
➢ Join Data Frames by index.
➢ Time-series friendly merging provided in pandas

1/9/2024 29
Merging Data

In a case where two data frames have a similar shape, it might be useful to
just append one after the other. Maybe A and B are products and one data
frame contains the number of items sold per product in a store:

Sometimes, we won't care about the


indices of the originating data frames:
Merging data

Combine objects is offered by the pd.concat function, which takes an arbitrary


number of series, data frames, or panels as input.

The default concat operation appends both frames along the rows – or
index, which corresponds to axis 0. To concatenate along the columns,
we can pass in the axis keyword argument:
Merging data

A left, right and full join can be specified by the how parameter:

1/9/2024 32
Merging Data

The merge methods can be specified with the how parameter. The following
table shows the methods in comparison with SQL:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key
order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys
lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve
the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left
keys.
Reshaping data

Reshaping data refers to the process of converting a DataFrame from one


format to another for better data visualization and analysis.

➢Use the reshape() method to change the shape of the Series to the desired
shape.
➢Use the stack() method to pivot the Series from a wide format to a long
format, if needed.
➢Use the melt() method to unpivot the Series from a wide format to a long
format, if needed.
➢Use the unstack() method to pivot the Series from a long format to a wide
format, if needed.
➢Use the pivot() method to pivot the Series from a long format to a wide
format, if needed.
➢Use the T attribute to transpose the Series, if needed.
Reshaping data

Using reshape method


The reshape method can be used to change the shape of a Series. This method requires
the new shape to be compatible with the original shape.

Here, the shape argument specifies the new dimensions of the array, while the optional
order argument specifies the order in which the elements of the array are arranged.

Output Firstly a series is created


which contains values from
1 to 9. Then the
values.reshape(3,3) is used
to reshape the series to a
matrix of size 3x3.
Reshaping data
Using stack() and unstack() method
In Pandas, we can also use the stack() and unstack() to reshape data.
stack() is used to pivot a level of the column labels, transforming them into innermost
row index levels.
unstack() is used to pivot a level of the row index, transforming it into an outermost
column level
Reshaping data

Using the Melt function


The melt() function in Pandas transforms a DataFrame from a wide format to a long
format.

In this example, we have used


the melt() function to transform the
DataFrame df from a wide format to a
long format.
Reshaping data

Using the pivot() method

The pivot() function reshapes data based on column values. It takes simple
column-wise data as input, and groups the entries into a two-dimensional
table.

1/9/2024 38
Reshaping data

We have passed the parameters index, columns and values to the pivot
function. Here,
Index specifies the column to be used as the index for the pivoted
dataframe
Columns specifies the column whose unique values will become the new
column headers
Values specifies the column containing the values to be placed in the new
columns
So as we can see in the output, the dataframe has been pivoted, with the
unique values from the category column (A and B) becoming separate
columns.
And the corresponding values from the value column are then placed in
the respective cells.
39
Reshaping data

Using the Transpose attribute


The transpose function (T) can be used to switch the rows and columns of a Series.
This is useful when we want to visualize the data in a different way.
Here, T is an attribute and not a method, so you don't need to use parentheses when
using it. Also, because it's an attribute and not a method, it can't take any arguments.
The T attribute returns a new DataFrame with the rows and columns interchanged.

1/9/2024 40
Data aggregation

Pandas comes with a lot of aggregation functions built-in. It is done using


the pandas and numpy libraries. The data must be available or converted
to a dataframe to apply the aggregation functions.
We start with some artificial data again, containing measurements about
the number of sunshine hours per city and date:

To view a summary per city, we use


the describe function on the
grouped data set:

1/9/2024 41
Data aggregation

On certain data sets, it can be


useful to group by more than one
attribute.
We can get an overview about the
sunny hours per country and date
by passing in two column names:

1/9/2024 42
Data aggregation

We can define any function to be we define a custom function,


which takes an input of a series object
applied on the groups with the agg
and computes the difference between
method the
smallest and the largest element:

1/9/2024 43
Data aggregation

The main task of DataFrame.aggregate() function is to apply some


aggregation to one or more column. Most frequently used aggregations
are:
sum: It is used to return the sum of the values for the requested axis.
min: It is used to return the minimum of the values for the requested
axis.
max: It is used to return the maximum values for the requested axis.

1/9/2024 44
Grouping data

The pandas groupby() function is used to split the data into groups based on certain
criteria, with the syntax, dataframe.groupby(['Criteria']) . It is a powerful tool used for
grouping values in datasets.
To group the data by a specific column, you can use the groupby() function and pass
the name of the column that you want to group on.
We use a simpler example here. Imagine some fictional weather data about the number
of sunny hours per day and city:
Grouping data

The groups attributes return a dictionary containing the unique groups


and the corresponding values as axis labels:

The result of a groupby is a


GroupBy object, not a
DataFrame, we can use the
usual indexing notation to refer
to columns:

46
Grouping data
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)

Parameters :
➢by : mapping, function, str, or iterable
➢axis : int, default 0
➢level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
➢as_index : For aggregated output, return object with group labels as the index. Only
relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
➢sort : Sort group keys. Get better performance by turning this off. Note this does not
influence the order of observations within each group. groupby preserves the order of
rows within each group.
➢group_keys : When calling apply, add group keys to index to identify pieces
➢squeeze : Reduce the dimensionality of the return type if possible, otherwise return a
consistent type
Returns : GroupBy object

You might also like