0% found this document useful (0 votes)

2 views47 pages

Unit V

Uploaded by

haricharani1919

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views47 pages

Unit V

Uploaded by

haricharani1919

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Data Analysis Application Examples

Dr. K.VENKATA RAMANA

Associate Professor,
Dept of Computer Science & Systems Engineering
Andhra University

1
working with Missing Data
• Missing Data can occur when no information is provided for one or more
items or for a whole unit.
• Missing Data is a very big problem in a real-life scenarios.
• Missing Data can also refer to as NA(Not Available) values in pandas.
• In DataFrame sometimes many datasets simply arrive with missing data,
either because it exists and was not collected or it never existed.
• For Example, Suppose different users being surveyed may choose not to
share their income, some users may choose not to share the address in
this way many datasets went missing.

2
working with Missing Data
Pandas support two values to represent missing data:
❑ None: None is a Python singleton object that is commonly used in Python
programs to represent missing data.
❑ NaN: Also known as Not a Number, or NaN, is a particular floating-point
value that is accepted by all systems that employ the IEEE standard for
floating-point representation.
• There are several useful functions for detecting, removing, and replacing
null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
3
working with Missing Data
Checking for missing values using isnull() and notnull():
• In order to check missing values in Pandas DataFrame, we use a function
isnull() and notnull(). Both function help in checking whether a value is
NaN or not.
• In order to check null values in Pandas DataFrame, we use isnull() function
this function return data frame of Boolean values which are True for NaN
values.

4
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and

interpolate() function these function replace NaN values with some value of
their own.
• All these function help in filling a null values in datasets of a DataFrame.
• Interpolate() function is basically used to fill NA values in the data frame
but it uses various interpolation technique to fill the missing values rather
than hard-coding the value.

5
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and

interpolate() function these function replace NaN values with some value of
their own.

6
working with Missing Data
Filling missing values using fillna(), replace() and interpolate()

• In order to fill null values in a datasets, we use fillna(), replace() and

interpolate() function these function replace NaN values with some value of
their own.

7
working with Missing Data

dropna(): In order to drop a null values from a data frame, we used

dropna() function this function drop Rows/Columns of datasets with
Null values
Hierarchical Indexing
• Hierarchical data is often used to represent multiple levels of nested
groups or categories. For example, a company may have a hierarchy of
employees, departments, and locations.
• One of the challenges of working with hierarchical data is how to represent
it in a tabular format which can make it easy to manipulate and analyze.
• To represent the hierarchical data in Python, using Pandas' built-in
methods like 'set_index()' and 'groupby()'.
Hierarchical Indexing
• In the following example, we will demonstrate the use of 'groupby()' method
in Pandas to group data based on a specific column. We will use the same
code used in previous example with slight changes. Here, we will group the
data based on the unique values in the 'Category' column. It will form
separate groups for each unique category.
Panel Data
• The Panel in Pandas is used for working with three-dimensional data. It has three
main axes these are items is the 0 axis which corresponds to the data, major-axis is
the axis 1 for rows, and minor-axis is the axis 2 for columns. A panel can be created
by using the pandas. panel() function.

• The panel in pandas is a three-dimensional container of data. To create a panel, we

can use ndarrays(multidimensional arrays) and a dictionary of Data Frames(one of
the Pandas 2-D data structures that contain data in the tabular form of rows and
columns). We can also extract data from panels using different methods.

https://www.scaler.com/topics/pandas/panel-in-pandas/
Data munging

Introduction to Data Munging

• Data munging, also known as data wrangling, is the process of converting
raw data into a more usable format.

• Data munging includes all the stages prior to analysis, such as: Data
structuring, Cleaning, Enrichment, Validation, Data transformation.

• The munging process typically begins with a large volume of raw data. Data
scientists will mung the data into shape by removing any errors or
inconsistencies.

• They will then organize the data according to the destination schema so that
it’s ready to use at the endpoint.

• The process also involves data transformation, such as normalizing datasets

to create one-to-many mappings

• Munging is generally a permanent data transformation process.

12
Data munging
Why Use Data Munging?
• Most organizations have multiple, disparate sources of incoming data. These sources
will all have different standards for validating data and catching errors. Some may
simply output the data “as-is.”

• Data consumers need to have clean, organized, high-quality data. These consumers
can include:

• People: Data scientists and analytics teams require a steady stream of data. To
provide them with this, the business needs to implement a munging process.

• Processes: Automated processes might require data from other systems. Munging
helps to remove any data inconsistencies, allowing these processes to run smoothly in
the background.

• Repositories: Organizations often store vast quantities of information in a data

lake or data warehouse. Munging can also help standardize data, which makes it
easier to store in a data warehouse.

13
Data munging

How to Do Data Munging

The modern data munging process now involves six main steps:

1. Discover: First, the data scientist performs a degree of data exploration. This
is a first glance at the data to establish the most important patterns.

2. Structure: Raw data might not have an appropriate structure for the
intended usage. The data scientists will organize and normalize the data so
that it’s more manageable.

3. Clean: Raw data can contain corrupt, empty, or invalid cells. There may also
be values that require conversions, such as dates and currencies. Part of the
cleaning operation is to ensure there’s consistency across all values. The
cleaning process will standardize this value for every address.

1/9/2024 14
Data munging

4. Enrich: Data enrichment is the process of filling in missing details by

referring to other data sources. For example, the raw data might
contain partial customer addresses.

5. Validate: Finally, it’s time to ensure that all data values are logically
consistent. This means checking things like whether all phone
numbers have nine digits, that there are no numbers in name fields,
and that all dates are valid calendar dates

6. Publish: When the data munging process is complete, the data

science team will push it towards its final destination. Often this is a
data repository, where it will integrate with data from other sources.
This will make the munged data permanently available to all
consumers.

1/9/2024 15
Data munging

1/9/2024 16
Data munging

Issues with Data Munging

Data munging processes sometimes present issues such as:

• Resource overheads: When data scientists oversee the munging process, it

can take up a substantial amount of their time.

• Data loss: Data munging is usually a one-way process. Data scientists

permanently transform the incoming data, and there may not be an extant
copy of the original data.

• Flexibility: Munging often has one objective in mind, such as preparing data
for analytics. This means that the data may not be in an appropriate format for
other uses, such as warehousing.

• Process errors: If the munging process is manual or semi-automatic, there's a

chance for errors to creep in. An automated process gives business experts an
opportunity to get involved in the data mapping process.
17
Data Cleaning With Python

What Is Data Cleaning

• When working with multiple data sources, there are many chances
for data to be incorrect, duplicated, or mislabeled.

• If data is wrong, outcomes and algorithms are unreliable, even

though they may look correct.

• Data cleaning is the process of changing or eliminating garbage,

incorrect, duplicate, corrupted, or incomplete data in a dataset.

• There’s no such absolute way to describe the precise steps in the

data cleaning process because the processes may vary from dataset
to dataset.

• Data cleansing, data cleansing, or data scrub is the general data

preparation process initiative.
18
Data Cleaning With Python

Importing Libraries
Let’s get Pandas and NumPy up and running on your Python script.

Input read Dataset and Locating Missing Data

This file contains only four rows, it will allow us to demonstrate the
process up to a cleaned data set

The lines that contain values are all

comma-separated, but we have
missing (NA) and probably unclean
(5.3*) values.

As we can see the “read.csv” should be the dataset we want to examine.

And, in this case, when we read “pd.read_csv” as the prior function, we
know we are using the Pandas library to read our dataset.
19
Data Cleaning With Python

Pandas used the first row as header, but this is not what we want:

Instead of numeric values, we would like to supply our own column names:

1/9/2024 20
Data Cleaning With Python

If we know in advance the undesirable characters in our data set, we can

augment the read_csv method with a custom converter function:

If we wanted to only keep the complete

entries, we could drop any row that
contains undefined values:

1/9/2024 21
Data Cleaning With Python

Check for Duplicates

• Duplicates, like missing data, cause problems and clog up analytics
software. Let’s locate and eliminate them.

• To locate duplicates we start out with:

Detect Outliers
• Outliers are numerical values that lie significantly outside of the
statistical norm. Cutting that down from unnecessary science garble –
they are data points that are so out of range they are likely misreads.

1/9/2024 22
Filtering

Filtering means limiting rows and/or columns. Filtering is clearly

central to any data analysis.

Importing Libraries

Boolean vectors

Filtering in Pandas relies heavily on the concept of a Boolean vectors.

The expression (==) tests whether each value of

the Gender column is equal to the string
“Female”.
The result of the expression is a vector of trues
and falses corresponding to whether each of the
209 values of Gender is equal to “Female”.
23
Filtering

Once we have the vector of 209 values of true or false (the Boolean
vector), we can apply that to the original data frame.

If the first value in the Boolean vector is true, the first row of the data
frame is returned; if the first value is false, the row is skipped.

1/9/2024 24
Filtering
Filtering

We can assign the results to a separate data frame that contains only the
140 female employees:

The Python type() method is used to make sure the result is a Pandas
data frame.

Python has some basic built-in functions that can be applied to the
core data types, such as integers, floating point numbers, and so on.
For example, if we want to take the result of mean() and round it to
two decimals, we can wrap the whole expression inside
the round() function.
Filtering

Complex filtering criteria

Boolean vectors can be created by combining conditions with & (and) and
| (or). The only trick is that each condition must be in parentheses:

The vector can then be applied to the whole data

set to filter the data frame to female employees
with job grade 1. Rather than listing the results,
We call the shape property to confirm that only
48 employees are included in the resulting vector.

Output:
Filtering

Filtering by a list
It is not uncommon when we have But an alternative approach (and
categorical data to need to filter or the only approach that works with
recode on a specific list of values. To categorical data) is to create a list and
reuse the example used previously, use the isin() method to check
assume we want to create a list of membership in the list. This gives the
managerial employees. The easiest way same result as above.
to do this is to use a greater-than
condition:
Merging data

We have multiple data sources, but in order to make statements about the
content, you would rather combine them.
Pandas DataFrame merge() function is used to merge two DataFrame
objects with a database-style join operation. The joining is performed on
columns or indexes. If the joining is done on columns, indexes are
ignored.
➢ Concatenate Data Frames along row and column.
➢ Merge Data Frames on specific keys by different join logics like left-join,
inner-join, etc.
➢ Join Data Frames by index.
➢ Time-series friendly merging provided in pandas

1/9/2024 29
Merging Data

In a case where two data frames have a similar shape, it might be useful to
just append one after the other. Maybe A and B are products and one data
frame contains the number of items sold per product in a store:

Sometimes, we won't care about the

indices of the originating data frames:
Merging data

Combine objects is offered by the pd.concat function, which takes an arbitrary

number of series, data frames, or panels as input.

The default concat operation appends both frames along the rows – or
index, which corresponds to axis 0. To concatenate along the columns,
we can pass in the axis keyword argument:
Merging data

A left, right and full join can be specified by the how parameter:

1/9/2024 32
Merging Data

The merge methods can be specified with the how parameter. The following
table shows the methods in comparison with SQL:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key
order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys
lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve
the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left
keys.
Reshaping data

Reshaping data refers to the process of converting a DataFrame from one

format to another for better data visualization and analysis.

➢Use the reshape() method to change the shape of the Series to the desired
shape.
➢Use the stack() method to pivot the Series from a wide format to a long
format, if needed.
➢Use the melt() method to unpivot the Series from a wide format to a long
format, if needed.
➢Use the unstack() method to pivot the Series from a long format to a wide
format, if needed.
➢Use the pivot() method to pivot the Series from a long format to a wide
format, if needed.
➢Use the T attribute to transpose the Series, if needed.
Reshaping data

Using reshape method

The reshape method can be used to change the shape of a Series. This method requires
the new shape to be compatible with the original shape.

Here, the shape argument specifies the new dimensions of the array, while the optional
order argument specifies the order in which the elements of the array are arranged.

Output Firstly a series is created

which contains values from
1 to 9. Then the
values.reshape(3,3) is used
to reshape the series to a
matrix of size 3x3.
Reshaping data
Using stack() and unstack() method
In Pandas, we can also use the stack() and unstack() to reshape data.
stack() is used to pivot a level of the column labels, transforming them into innermost
row index levels.
unstack() is used to pivot a level of the row index, transforming it into an outermost
column level
Reshaping data

Using the Melt function

The melt() function in Pandas transforms a DataFrame from a wide format to a long
format.

In this example, we have used

the melt() function to transform the
DataFrame df from a wide format to a
long format.
Reshaping data

Using the pivot() method

The pivot() function reshapes data based on column values. It takes simple
column-wise data as input, and groups the entries into a two-dimensional
table.

1/9/2024 38
Reshaping data

We have passed the parameters index, columns and values to the pivot
function. Here,
Index specifies the column to be used as the index for the pivoted
dataframe
Columns specifies the column whose unique values will become the new
column headers
Values specifies the column containing the values to be placed in the new
columns
So as we can see in the output, the dataframe has been pivoted, with the
unique values from the category column (A and B) becoming separate
columns.
And the corresponding values from the value column are then placed in
the respective cells.
39
Reshaping data

Using the Transpose attribute

The transpose function (T) can be used to switch the rows and columns of a Series.
This is useful when we want to visualize the data in a different way.
Here, T is an attribute and not a method, so you don't need to use parentheses when
using it. Also, because it's an attribute and not a method, it can't take any arguments.
The T attribute returns a new DataFrame with the rows and columns interchanged.

1/9/2024 40
Data aggregation

Pandas comes with a lot of aggregation functions built-in. It is done using

the pandas and numpy libraries. The data must be available or converted
to a dataframe to apply the aggregation functions.
We start with some artificial data again, containing measurements about
the number of sunshine hours per city and date:

To view a summary per city, we use

the describe function on the
grouped data set:

1/9/2024 41
Data aggregation

On certain data sets, it can be

useful to group by more than one
attribute.
We can get an overview about the
sunny hours per country and date
by passing in two column names:

1/9/2024 42
Data aggregation

We can define any function to be we define a custom function,

which takes an input of a series object
applied on the groups with the agg
and computes the difference between
method the
smallest and the largest element:

1/9/2024 43
Data aggregation

The main task of DataFrame.aggregate() function is to apply some

aggregation to one or more column. Most frequently used aggregations
are:
sum: It is used to return the sum of the values for the requested axis.
min: It is used to return the minimum of the values for the requested
axis.
max: It is used to return the maximum values for the requested axis.

1/9/2024 44
Grouping data

The pandas groupby() function is used to split the data into groups based on certain
criteria, with the syntax, dataframe.groupby(['Criteria']) . It is a powerful tool used for
grouping values in datasets.
To group the data by a specific column, you can use the groupby() function and pass
the name of the column that you want to group on.
We use a simpler example here. Imagine some fictional weather data about the number
of sunny hours per day and city:
Grouping data

The groups attributes return a dictionary containing the unique groups

and the corresponding values as axis labels:

The result of a groupby is a

GroupBy object, not a
DataFrame, we can use the
usual indexing notation to refer
to columns:

46
Grouping data
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)

Parameters :
➢by : mapping, function, str, or iterable
➢axis : int, default 0
➢level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
➢as_index : For aggregated output, return object with group labels as the index. Only
relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
➢sort : Sort group keys. Get better performance by turning this off. Note this does not
influence the order of observations within each group. groupby preserves the order of
rows within each group.
➢group_keys : When calling apply, add group keys to index to identify pieces
➢squeeze : Reduce the dimensionality of the return type if possible, otherwise return a
consistent type
Returns : GroupBy object

Beatrice Gay Letter Writing Scipt
0% (2)
Beatrice Gay Letter Writing Scipt
23 pages
Lecture Week 6-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 6-Data Scraping and Data Wrangling
16 pages
Governance, Risk and Compliance - Energy Industry
100% (2)
Governance, Risk and Compliance - Energy Industry
4 pages
Unit IV
No ratings yet
Unit IV
27 pages
Project On Microsoft
33% (3)
Project On Microsoft
7 pages
Unit 4
No ratings yet
Unit 4
60 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
DWDV Unit 1
No ratings yet
DWDV Unit 1
21 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Data Wrangling
No ratings yet
Data Wrangling
3 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
ISTQB FL Chap 1
No ratings yet
ISTQB FL Chap 1
10 pages
02 Flow Control
No ratings yet
02 Flow Control
16 pages
Casestudy HR and Ai
No ratings yet
Casestudy HR and Ai
7 pages
Module 3
No ratings yet
Module 3
20 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
11 20241108 DataAnalysis AppliExamples
No ratings yet
11 20241108 DataAnalysis AppliExamples
36 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Introduction of Internet of Things: Drive For Ever
No ratings yet
Introduction of Internet of Things: Drive For Ever
13 pages
EE-232: Signals and Systems Lab 3: Signal Transformations
No ratings yet
EE-232: Signals and Systems Lab 3: Signal Transformations
8 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Computer Project XI
No ratings yet
Computer Project XI
10 pages
Capture Manager 1.3.0 ReadMe en
No ratings yet
Capture Manager 1.3.0 ReadMe en
8 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Intelligent Disk Subsystems
No ratings yet
Intelligent Disk Subsystems
69 pages
Proiect 6. Manipulating Arrays
No ratings yet
Proiect 6. Manipulating Arrays
6 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
ITECH - SAS1000 Software Installation Instruction-EN
No ratings yet
ITECH - SAS1000 Software Installation Instruction-EN
10 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Prepared By: Ms. Priti Rumao
No ratings yet
Prepared By: Ms. Priti Rumao
56 pages
Dingo Coin White Paper
No ratings yet
Dingo Coin White Paper
4 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Lec6 MobileRobotControl
No ratings yet
Lec6 MobileRobotControl
5 pages
The CW Checks The Following Learning Outcomes:: 7ECON012C, Data Analytics
No ratings yet
The CW Checks The Following Learning Outcomes:: 7ECON012C, Data Analytics
3 pages
Project
No ratings yet
Project
27 pages
Dyna DF14 Blade SDH-System
No ratings yet
Dyna DF14 Blade SDH-System
3 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Lec 4
No ratings yet
Lec 4
9 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Exp 1
No ratings yet
Exp 1
3 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Database Procedure
No ratings yet
Database Procedure
65 pages
Appointments For The Case Study
No ratings yet
Appointments For The Case Study
7 pages
2-Data Wrangling
No ratings yet
2-Data Wrangling
13 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
Aditya Resume 2
No ratings yet
Aditya Resume 2
2 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
SAP HANA Cloud Guide
No ratings yet
SAP HANA Cloud Guide
30 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Tiv PDF
No ratings yet
Tiv PDF
1 page
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
GD - Gregory Adekoya - Submitted
No ratings yet
GD - Gregory Adekoya - Submitted
1 page
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Document
No ratings yet
Document
29 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Unit 4
No ratings yet
Unit 4
60 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Employee'S Timekeeping & Cos Form: For Offset Day (OS)
No ratings yet
Employee'S Timekeeping & Cos Form: For Offset Day (OS)
1 page
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Birthday Girl PDF
No ratings yet
Birthday Girl PDF
1 page
VT Secondary Injection Format
No ratings yet
VT Secondary Injection Format
3 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Pressure Derivative Analysis With Type Curves For
No ratings yet
Pressure Derivative Analysis With Type Curves For
5 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Aga A2 0101 Ap
No ratings yet
Aga A2 0101 Ap
1 page
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet

Unit V

Uploaded by

Unit V

Uploaded by

Data Analysis Application Examples

Dr. K.VENKATA RAMANA

• In order to fill null values in a datasets, we use fillna(), replace() and

• In order to fill null values in a datasets, we use fillna(), replace() and

• In order to fill null values in a datasets, we use fillna(), replace() and

dropna(): In order to drop a null values from a data frame, we used

• The panel in pandas is a three-dimensional container of data. To create a panel, we

Introduction to Data Munging

• The process also involves data transformation, such as normalizing datasets

• Munging is generally a permanent data transformation process.

• Repositories: Organizations often store vast quantities of information in a data

How to Do Data Munging

4. Enrich: Data enrichment is the process of filling in missing details by

6. Publish: When the data munging process is complete, the data

Issues with Data Munging

• Resource overheads: When data scientists oversee the munging process, it

• Data loss: Data munging is usually a one-way process. Data scientists

• Process errors: If the munging process is manual or semi-automatic, there's a

What Is Data Cleaning

• If data is wrong, outcomes and algorithms are unreliable, even

• Data cleaning is the process of changing or eliminating garbage,

• There’s no such absolute way to describe the precise steps in the

• Data cleansing, data cleansing, or data scrub is the general data

Input read Dataset and Locating Missing Data

The lines that contain values are all

As we can see the “read.csv” should be the dataset we want to examine.

If we know in advance the undesirable characters in our data set, we can

augment the read_csv method with a custom converter function:

If we wanted to only keep the complete

Check for Duplicates

• To locate duplicates we start out with:

Filtering means limiting rows and/or columns. Filtering is clearly

Filtering in Pandas relies heavily on the concept of a Boolean vectors.

The expression (==) tests whether each value of

Complex filtering criteria

The vector can then be applied to the whole data

Sometimes, we won't care about the

Combine objects is offered by the pd.concat function, which takes an arbitrary

Reshaping data refers to the process of converting a DataFrame from one

Using reshape method

Output Firstly a series is created

Using the Melt function

In this example, we have used

Using the pivot() method

Using the Transpose attribute

Pandas comes with a lot of aggregation functions built-in. It is done using

To view a summary per city, we use

On certain data sets, it can be

We can define any function to be we define a custom function,

The main task of DataFrame.aggregate() function is to apply some

The groups attributes return a dictionary containing the unique groups

The result of a groupby is a

You might also like