0% found this document useful (0 votes)

5 views42 pages

IntroToPython Unit 5

Module 5 of the Introduction to Python Programming course covers data cleaning, wrangling, and visualization techniques. Key topics include handling missing data, data transformation, and various plotting methods using libraries like pandas and seaborn. The module emphasizes the importance of preparing and visualizing data for effective analysis and decision-making.

Uploaded by

barneyross469

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views42 pages

IntroToPython Unit 5

Uploaded by

barneyross469

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Introduction to Python Programing

Module-5

BTECH- II SEM-AIML
JAIN UNIVERSITY

Dr. Senthil Kumar, Associate Professor, Department of CSE-AIML

1
MODULE 5: Data cleaning, Data wrangling, Plotting and Visualization
[12 hours]
• Data cleaning: Handling Missing Data, Filtering Out Missing Data, Filling In Missing Data, Data
Transformation, Removing Duplicates, Transforming Data Using a Function or Mapping, Replacing Values,
Renaming Axis Indexes,

• Discretization and Binning, Detecting and Filtering Outliers, Permutation and Random Sampling,
Computing Indicator/Dummy Variables.

• Data wrangling: Hierarchical Indexing, Reordering and Sorting Levels, Summary Statistics by Level, Indexing
with a DataFrame’s columns.

• Plotting and visualization: Figures and Subplots ,Colors, Markers, and Line Styles, Ticks, Labels, and
Legends, Plotting with pandas and seaborn, Line Plots, Bar Plots, Histograms and Density Plots, Scatter or
Point Plot
Data cleaning
• Data cleaning is a major task of data preprocessing

• Data cleaning fixing bad data in the data set.

• Empty cells, Data in wrong format , Wrong data , Duplicates

• Data are replaced by alternative, smaller representations using

• parametric models (e.g., regression or log-linear models)

• nonparametric models (e.g., histograms, clusters, sampling, or data

aggregation)
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

4
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• Missing data may need to be inferred

5
How to Handle Missing Data?
• Methods of handling Missing Values
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
6
Handling Missing Data, Filtering Out Missing Data

• Missing values are represented as:

• None: A Python object commonly used to represent missing values in object-
type arrays.

• NaN: A special floating-point value from NumPy, which is recognized by all

systems that use IEEE floating-point standards.

• Pandas gives us two convenient functions: isnull() and notnull().

Checking for Missing Values Using isnull()
• isnull() returns a DataFrame of Boolean values,where True represents
missing data (NaN).
Filling Missing Values in Pandas Using fillna(), replace()
and interpolate()

• fillna() is a function to replace missing values (NaN) with a given value.

For instance, you can replace missing values with 0
• pad method to fill missing values with the previous value or bfill to fill with the next
value.

• replace() to replace NaN values with a specific value

• interpolate() function fills missing values using interpolation techniques,

such as the linear method.
Filling Missing Values in Pandas Using fillna(), replace() and interpolate()
Filling Missing Values in Pandas Using fillna(), replace() and interpolate()
Data Transformation, Removing Duplicates
• Data transformation is the process of converting raw data into a more
suitable format or structure for analysis, to improve its quality and
make it compatible with the requirements of a particular task or
system.
• Cleaning, encoding, and structuring the data in a manner that makes
it compatible with analytical tools and algorithms.
• Adding/Dropping Rows or Columns
• Applying Functions
• Merging and Joining
• Grouping and Aggregating
• One-Hot Encoding: Convert categorical variables into binary columns.
• Pivoting: Reshape data using pivot_table for different views.
• Melting: Convert columns into rows.
Data Transformation, Removing Duplicates
• To remove duplicate rows in a Pandas DataFrame, the drop_duplicates()
method is used.

• This method identifies and removes duplicate rows based on specified

columns or all columns.
• Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
• subset: column or list of column label. It’s default value is none. After passing columns, it will
consider them only for duplicates. ( Optional)
• keep: keep is to control how to consider duplicate value. It has only three distinct value and
default is ‘first’.
• If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
• If ‘last‘, it considers last value as unique and rest of the same values as duplicates.
• If False, it consider all of the same values as duplicates
• inplace: Boolean values, removes rows with duplicates if True.
Transforming Data Using a Function or Mapping
• The transform() function is used with GroupBy objects or directly on a
DataFrame.
• It applies a function to each group or column while preserving the original
shape of the DataFrame.
• This makes it useful for calculations that involve aggregations or window
functions.
• The replace() method replaces the specified value with another specified
value.
• The replace() method searches the entire DataFrame and replaces every case of the
specified value.
Transforming Data Using a Function or Mapping Replacing Values
Renaming Axis Indexes
Discretization and Binning

• Discretization, also known as binning, transforms continuous

numerical data into discrete intervals or bins.

• This technique is useful for data preprocessing in machine learning to

handle outliers, reduce noise, and improve model performance.
Binning:

• 3 types of Binning:
• smoothing by bin means
• smoothing by bin medians
• smoothing by bin boundaries
• Benefits of Binning in Python
• Noise reduction: Binning can smooth out
minor observation errors or fluctuations in
the data.
• Data discretization: Binning can transform
continuous variables into categorical
counterparts which are easier to analyze.
• Improved model performance: Binning
can lead to improvements in accuracy of
the predictive models by introducing bins
as categorical features.
Detecting and Filtering Outliers
• Handling outliers:
• An Outlier is a data-item/object that deviates significantly from the rest of
the (so-called normal)objects.

• They can be caused by measurement or execution errors.

• Outliers can be detected using visualization, implementing mathematical

formulas on the dataset, or using the statistical approach
Visualizing Outliers Using Box Plot

• It captures the summary of the data

effectively and efficiently with only a simple
box and whiskers.

• Boxplot summarizes sample data using 25th,

50th, and 75th percentiles.

• One can just get insights into the dataset by

just looking at its boxplot.
Visualizing Outliers Using Scatter Plot.

• It is used when paired numerical data

• dependent variable has multiple

values for each independent variable,

• determine the relationship between

the two variables.
Visualizing Outliers Using : Z-score
• Z- Score is also called a standard score.

• This value/score helps to understand that how far is the data point from the mean.
Permutation and Random Sampling
• Permutation
• Permutation refers to the arrangement of elements in a specific order.
• The itertools module in Python provides the permutations() function to generate all possible
permutations of a sequence.
• Eg:Permutations('ABC') will generate ('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), and ('C', 'B',
'A').

• The numpy.random.permutation() function can be used to generate a random permutation of a

sequence or a range of numbers.

• Random Sampling
• random.sample() → Works for simple lists.
• pandas.sample() → Works for structured datasets (CSV, DataFrames).
• Stratified Sampling (sklearn) → Ensures proportionate sampling.
Example: Permutation
Example: Random Sampling
Computing Indicator/Dummy Variables
• Indicator or dummy variables, also known as binary variables, are commonly created to
represent categorical data numerically.

• A dataset may contain various type of values, sometimes it consists of categorical values.

• In-order to use those categorical value for programming efficiently we create dummy
variables. A dummy variable is a binary variable that indicates whether a separate
categorical variable takes on a specific value.

• They take on values of 0 or 1 to indicate the absence or presence of a specific category.

The pandas library offers a convenient function called get_dummies() to generate these
variables.
Computing Indicator/Dummy Variables
Data wrangling
• Definition: Data wrangling ( or data munging) in Python refers to the process of
cleaning, structuring, and enriching raw data into a desired format for better
decision-making and analysis.
• Data wrangling steps before any kind of data analysis, modeling, or visualization.
1. Loading Data
2. Exploring Data
3. Cleaning Data
4. Transforming Data
5. Filtering and Sorting
6. Combining Data
7. Saving Cleaned Data
• Popular tools for wrangling:
• pandas – for general data wrangling
• numpy – for numerical operations
• openpyxl, xlrd – for Excel files
• json – for JSON data
Data wrangling
• Data wrangling can be summarized in Python with the below
functionalities:
• Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
• Dealing with missing values: Most of the datasets having a vast amount of data
contain missing values of NaN, they are needed to be taken care of by replacing
them with mean, mode, the most frequent value of the column, or simply by
dropping the row having a NaN value.
• Reshaping data: In this process, data is manipulated according to the requirements,
where new data can be added or pre-existing data can be modified.
• Filtering data: Some times datasets are comprised of unwanted rows or columns
which are required to be removed or filtered
• it can be used for a required purpose like data analyzing, machine
learning, data visualization, model training etc.
Data wrangling: indexing using pands
1. Hierarchical Indexing
Hierarchical indexing (also called multi-level indexing) allows you to have multiple (two or more) index levels

on an axis. It is a powerful feature of pandas for working with higher-dimensional data in a 2D structure.

2. Rearrange the order of the index levels or sort them:

Sorting with sort_index(): Sorts the index.

3. Summary Statistics by Level

Compute summary statistics (like sum, mean, etc.) across a level of a MultiIndex using the level argument.

4. Indexing with a DataFrame’s Columns

Treat one or more columns of a DataFrame as an index by setting them with set_index().
Plotting and Visualization
• Data visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools
provide an accessible way to see and understand trends, outliers, and patterns in
data.
• Data visualization translates complex data sets into visual formats that are
easier for the human brain to comprehend. This can include a variety of visual
tools such as:
• Charts: Bar charts, line charts, pie charts, etc.
• Graphs: Scatter plots, histograms, etc.
• Maps: Geographic maps, heat maps, etc.
• Dashboards: Interactive platforms that combine multiple visualizations.
• Tools: Matplotlib, Seaborn, Bokeh, Plotly, etc
31
Plotting and Visualization : Types of Data

32
Plotting and visualization: Figures and Subplots
• Types of Data Visualization Techniques
• Bar Charts: Ideal for comparing categorical data or displaying frequencies, bar charts offer a clear visual representation
of values.
• Line Charts: Perfect for illustrating trends over time, line charts connect data points to reveal patterns and fluctuations.
• Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to understand proportions and
percentages.
• Scatter Plots: Showcase relationships between two variables, identifying patterns and outliers through scattered data
points.
• Histograms: Depict the distribution of a continuous variable, providing insights into the underlying data patterns.
• Heatmaps: Visualize complex data sets through color-coding, emphasizing variations and correlations in a matrix.
• Box Plots: Unveil statistical summaries such as median, quartiles, and outliers, aiding in data distribution analysis.
• Area Charts: Similar to line charts but with the area under the line filled, these charts accentuate cumulative data
patterns.
• Bubble Charts: Enhance scatter plots by introducing a third dimension through varying bubble sizes, revealing additional
insights.
• Treemaps: Efficiently represent hierarchical data structures, breaking down categories into nested rectangles. 33
Plotting and visualization: Figures and Subplots
• This module is used to control the default spacing of the subplots and top level container for all
plot elements.

• matplotlib.figure.Figure.subplots() method

• Syntax: subplots(self, nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,

subplot_kw=None, gridspec_kw=None)
• nrows, ncols :

• sharex, sharey : controls sharing of properties among x (sharex) or y (sharey) axes.

• squeeze : optional parameter and it contains boolean value with default as True.

• num: This parameter is the pyplot.figure keyword that sets the figure number or label.

• subplot_kwd: This parameter is the dict with keywords passed to the add_subplot call used to create each subplot.

• gridspec_kw: This parameter is the dict with keywords passed to the GridSpec constructor used to create the grid the subplots are
placed on.
Plotting and visualization: Figures and Subplots
Plotting and visualization:Colors, Markers, and
Line Styles, Ticks, Labels, and Legends,
• Common options:
• Colors: 'red', 'blue', 'green', 'black', etc. or hex codes like '#FF5733'
• Markers:
• 'o' — circle
• 's' — square
• '^' — triangle
• '*' — star
• Line styles:
• '-' — solid
• '--' — dashed
• ':' — dotted
• '-.' — dash-dot
Plotting and visualization:Colors, Markers, and Line Styles
Plotting and visualization: Ticks
Plotting and visualization: Ticks, Labels, and Legends,
Plotting with pandas and seaborn, Line Plots,
Plotting with pandas and seaborn, Bar Plots,
Histograms and Density Plots, Scatter or Point Plot
END

Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Lec 4
No ratings yet
Lec 4
9 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
DataPreprocessing.pptx
No ratings yet
DataPreprocessing.pptx
56 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
L04
No ratings yet
L04
6 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
BDS306B Module5
No ratings yet
BDS306B Module5
5 pages
Datascience
No ratings yet
Datascience
26 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Phython Example
No ratings yet
Phython Example
12 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Unit V
No ratings yet
Unit V
47 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Data Handling Part Ii
No ratings yet
Data Handling Part Ii
41 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Pandas
No ratings yet
Pandas
30 pages
Practice 1
No ratings yet
Practice 1
45 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
AL Notes
No ratings yet
AL Notes
61 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Practical_2_fKs4RPadH3 (1)
No ratings yet
Practical_2_fKs4RPadH3 (1)
4 pages
Classification of Forests in the Colonial Regime
No ratings yet
Classification of Forests in the Colonial Regime
8 pages
3rd Module QB.docx
No ratings yet
3rd Module QB.docx
5 pages
Physics Module 3
No ratings yet
Physics Module 3
11 pages
Physics Opticalexp1
No ratings yet
Physics Opticalexp1
20 pages
New Doc 05-24-2023 15.40
No ratings yet
New Doc 05-24-2023 15.40
15 pages
Thi Cuoi Kì CNKH
No ratings yet
Thi Cuoi Kì CNKH
40 pages
Tcs Add Analytics Insights Intelligent Decision Making
No ratings yet
Tcs Add Analytics Insights Intelligent Decision Making
8 pages
PSUT Coursera Courses To Select From
No ratings yet
PSUT Coursera Courses To Select From
35 pages
Yashraj: Education
No ratings yet
Yashraj: Education
1 page
MIS770 Foundation Skills in Business Analysis: Assignment One Designing A Survey Instrument
No ratings yet
MIS770 Foundation Skills in Business Analysis: Assignment One Designing A Survey Instrument
5 pages
Advanced Data Analyst Roadmap
No ratings yet
Advanced Data Analyst Roadmap
3 pages
Subject Outline - MBAS901 - Trimester 1, 2024 - Wollongong
100% (2)
Subject Outline - MBAS901 - Trimester 1, 2024 - Wollongong
19 pages
Communication - Tools
No ratings yet
Communication - Tools
7 pages
Business Analytics Using Excel
100% (1)
Business Analytics Using Excel
2 pages
Advanced Programming Final Client Report
No ratings yet
Advanced Programming Final Client Report
27 pages
Visual Lib (M912220e)
No ratings yet
Visual Lib (M912220e)
16 pages
Introduction To Variables in Data Analysis (1) Read Only
No ratings yet
Introduction To Variables in Data Analysis (1) Read Only
12 pages
01 ERPSim Analysis - SAC
No ratings yet
01 ERPSim Analysis - SAC
18 pages
Applied Statistics & Data Visualisation Assignment Brief 2024-25
No ratings yet
Applied Statistics & Data Visualisation Assignment Brief 2024-25
9 pages
Power BI Projects
No ratings yet
Power BI Projects
8 pages
A Practical Guide To AB Testing
No ratings yet
A Practical Guide To AB Testing
27 pages
Ebook-Becoming-Data-Driven-Germany OT
No ratings yet
Ebook-Becoming-Data-Driven-Germany OT
11 pages
Trouvain Truong 2013
No ratings yet
Trouvain Truong 2013
5 pages
Data Visualization Part 2
No ratings yet
Data Visualization Part 2
18 pages
Spreadsheets Cala
No ratings yet
Spreadsheets Cala
8 pages
Data Analyst Associate (Da-100) : Module - 2
100% (2)
Data Analyst Associate (Da-100) : Module - 2
25 pages
DVP Report
No ratings yet
DVP Report
14 pages
Datathon at UCI Resource Sheet
No ratings yet
Datathon at UCI Resource Sheet
15 pages
Ip Project
No ratings yet
Ip Project
16 pages
Alqatawna Hamza Orange Ai RG
No ratings yet
Alqatawna Hamza Orange Ai RG
32 pages
Altivar - Plcopen - CANopen - Example Guide
No ratings yet
Altivar - Plcopen - CANopen - Example Guide
35 pages
COPS Coursework 2021
100% (1)
COPS Coursework 2021
35 pages
Curso Data Analis
No ratings yet
Curso Data Analis
7 pages
Module 3 - Foundations Data Everywhere
No ratings yet
Module 3 - Foundations Data Everywhere
15 pages
PFDA (Programming For Data Analysis) APU
No ratings yet
PFDA (Programming For Data Analysis) APU
60 pages

IntroToPython Unit 5

Uploaded by

IntroToPython Unit 5

Uploaded by

Introduction to Python Programing

Dr. Senthil Kumar, Associate Professor, Department of CSE-AIML

• Data cleaning fixing bad data in the data set.

• Data are replaced by alternative, smaller representations using

• nonparametric models (e.g., histograms, clusters, sampling, or data

• Missing values are represented as:

• NaN: A special floating-point value from NumPy, which is recognized by all

• Pandas gives us two convenient functions: isnull() and notnull().

• fillna() is a function to replace missing values (NaN) with a given value.

• replace() to replace NaN values with a specific value

• interpolate() function fills missing values using interpolation techniques,

• This method identifies and removes duplicate rows based on specified

• Discretization, also known as binning, transforms continuous

• This technique is useful for data preprocessing in machine learning to

• They can be caused by measurement or execution errors.

• Outliers can be detected using visualization, implementing mathematical

• It captures the summary of the data

• Boxplot summarizes sample data using 25th,

• One can just get insights into the dataset by

• It is used when paired numerical data

• dependent variable has multiple

• determine the relationship between

• The numpy.random.permutation() function can be used to generate a random permutation of a

• They take on values of 0 or 1 to indicate the absence or presence of a specific category.

2. Rearrange the order of the index levels or sort them:

3. Summary Statistics by Level

4. Indexing with a DataFrame’s Columns

• Syntax: subplots(self, nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,

• sharex, sharey : controls sharing of properties among x (sharex) or y (sharey) axes.

You might also like