Introduction to Python Programing
Module-5
BTECH- II SEM-AIML
JAIN UNIVERSITY
Dr. Senthil Kumar, Associate Professor, Department of CSE-AIML
1
MODULE 5: Data cleaning, Data wrangling, Plotting and Visualization
[12 hours]
• Data cleaning: Handling Missing Data, Filtering Out Missing Data, Filling In Missing Data, Data
Transformation, Removing Duplicates, Transforming Data Using a Function or Mapping, Replacing Values,
Renaming Axis Indexes,
• Discretization and Binning, Detecting and Filtering Outliers, Permutation and Random Sampling,
Computing Indicator/Dummy Variables.
• Data wrangling: Hierarchical Indexing, Reordering and Sorting Levels, Summary Statistics by Level, Indexing
with a DataFrame’s columns.
• Plotting and visualization: Figures and Subplots ,Colors, Markers, and Line Styles, Ticks, Labels, and
Legends, Plotting with pandas and seaborn, Line Plots, Bar Plots, Histograms and Density Plots, Scatter or
Point Plot
Data cleaning
• Data cleaning is a major task of data preprocessing
• Data cleaning fixing bad data in the data set.
• Empty cells, Data in wrong format , Wrong data , Duplicates
• Data are replaced by alternative, smaller representations using
• parametric models (e.g., regression or log-linear models)
• nonparametric models (e.g., histograms, clusters, sampling, or data
aggregation)
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
4
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• Missing data may need to be inferred
5
How to Handle Missing Data?
• Methods of handling Missing Values
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
6
Handling Missing Data, Filtering Out Missing Data
• Missing values are represented as:
• None: A Python object commonly used to represent missing values in object-
type arrays.
• NaN: A special floating-point value from NumPy, which is recognized by all
systems that use IEEE floating-point standards.
• Pandas gives us two convenient functions: isnull() and notnull().
Checking for Missing Values Using isnull()
• isnull() returns a DataFrame of Boolean values,where True represents
missing data (NaN).
Filling Missing Values in Pandas Using fillna(), replace()
and interpolate()
• fillna() is a function to replace missing values (NaN) with a given value.
For instance, you can replace missing values with 0
• pad method to fill missing values with the previous value or bfill to fill with the next
value.
• replace() to replace NaN values with a specific value
• interpolate() function fills missing values using interpolation techniques,
such as the linear method.
Filling Missing Values in Pandas Using fillna(), replace() and interpolate()
Filling Missing Values in Pandas Using fillna(), replace() and interpolate()
Data Transformation, Removing Duplicates
• Data transformation is the process of converting raw data into a more
suitable format or structure for analysis, to improve its quality and
make it compatible with the requirements of a particular task or
system.
• Cleaning, encoding, and structuring the data in a manner that makes
it compatible with analytical tools and algorithms.
• Adding/Dropping Rows or Columns
• Applying Functions
• Merging and Joining
• Grouping and Aggregating
• One-Hot Encoding: Convert categorical variables into binary columns.
• Pivoting: Reshape data using pivot_table for different views.
• Melting: Convert columns into rows.
Data Transformation, Removing Duplicates
• To remove duplicate rows in a Pandas DataFrame, the drop_duplicates()
method is used.
• This method identifies and removes duplicate rows based on specified
columns or all columns.
• Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
• subset: column or list of column label. It’s default value is none. After passing columns, it will
consider them only for duplicates. ( Optional)
• keep: keep is to control how to consider duplicate value. It has only three distinct value and
default is ‘first’.
• If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
• If ‘last‘, it considers last value as unique and rest of the same values as duplicates.
• If False, it consider all of the same values as duplicates
• inplace: Boolean values, removes rows with duplicates if True.
Transforming Data Using a Function or Mapping
• The transform() function is used with GroupBy objects or directly on a
DataFrame.
• It applies a function to each group or column while preserving the original
shape of the DataFrame.
• This makes it useful for calculations that involve aggregations or window
functions.
• The replace() method replaces the specified value with another specified
value.
• The replace() method searches the entire DataFrame and replaces every case of the
specified value.
Transforming Data Using a Function or Mapping Replacing Values
Renaming Axis Indexes
Discretization and Binning
• Discretization, also known as binning, transforms continuous
numerical data into discrete intervals or bins.
• This technique is useful for data preprocessing in machine learning to
handle outliers, reduce noise, and improve model performance.
Binning:
• 3 types of Binning:
• smoothing by bin means
• smoothing by bin medians
• smoothing by bin boundaries
• Benefits of Binning in Python
• Noise reduction: Binning can smooth out
minor observation errors or fluctuations in
the data.
• Data discretization: Binning can transform
continuous variables into categorical
counterparts which are easier to analyze.
• Improved model performance: Binning
can lead to improvements in accuracy of
the predictive models by introducing bins
as categorical features.
Detecting and Filtering Outliers
• Handling outliers:
• An Outlier is a data-item/object that deviates significantly from the rest of
the (so-called normal)objects.
• They can be caused by measurement or execution errors.
• Outliers can be detected using visualization, implementing mathematical
formulas on the dataset, or using the statistical approach
Visualizing Outliers Using Box Plot
• It captures the summary of the data
effectively and efficiently with only a simple
box and whiskers.
• Boxplot summarizes sample data using 25th,
50th, and 75th percentiles.
• One can just get insights into the dataset by
just looking at its boxplot.
Visualizing Outliers Using Scatter Plot.
• It is used when paired numerical data
• dependent variable has multiple
values for each independent variable,
• determine the relationship between
the two variables.
Visualizing Outliers Using : Z-score
• Z- Score is also called a standard score.
• This value/score helps to understand that how far is the data point from the mean.
Permutation and Random Sampling
• Permutation
• Permutation refers to the arrangement of elements in a specific order.
• The itertools module in Python provides the permutations() function to generate all possible
permutations of a sequence.
• Eg:Permutations('ABC') will generate ('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), and ('C', 'B',
'A').
• The numpy.random.permutation() function can be used to generate a random permutation of a
sequence or a range of numbers.
• Random Sampling
• random.sample() → Works for simple lists.
• pandas.sample() → Works for structured datasets (CSV, DataFrames).
• Stratified Sampling (sklearn) → Ensures proportionate sampling.
Example: Permutation
Example: Random Sampling
Computing Indicator/Dummy Variables
• Indicator or dummy variables, also known as binary variables, are commonly created to
represent categorical data numerically.
• A dataset may contain various type of values, sometimes it consists of categorical values.
• In-order to use those categorical value for programming efficiently we create dummy
variables. A dummy variable is a binary variable that indicates whether a separate
categorical variable takes on a specific value.
• They take on values of 0 or 1 to indicate the absence or presence of a specific category.
The pandas library offers a convenient function called get_dummies() to generate these
variables.
Computing Indicator/Dummy Variables
Data wrangling
• Definition: Data wrangling ( or data munging) in Python refers to the process of
cleaning, structuring, and enriching raw data into a desired format for better
decision-making and analysis.
• Data wrangling steps before any kind of data analysis, modeling, or visualization.
1. Loading Data
2. Exploring Data
3. Cleaning Data
4. Transforming Data
5. Filtering and Sorting
6. Combining Data
7. Saving Cleaned Data
• Popular tools for wrangling:
• pandas – for general data wrangling
• numpy – for numerical operations
• openpyxl, xlrd – for Excel files
• json – for JSON data
Data wrangling
• Data wrangling can be summarized in Python with the below
functionalities:
• Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
• Dealing with missing values: Most of the datasets having a vast amount of data
contain missing values of NaN, they are needed to be taken care of by replacing
them with mean, mode, the most frequent value of the column, or simply by
dropping the row having a NaN value.
• Reshaping data: In this process, data is manipulated according to the requirements,
where new data can be added or pre-existing data can be modified.
• Filtering data: Some times datasets are comprised of unwanted rows or columns
which are required to be removed or filtered
• it can be used for a required purpose like data analyzing, machine
learning, data visualization, model training etc.
Data wrangling: indexing using pands
1. Hierarchical Indexing
Hierarchical indexing (also called multi-level indexing) allows you to have multiple (two or more) index levels
on an axis. It is a powerful feature of pandas for working with higher-dimensional data in a 2D structure.
2. Rearrange the order of the index levels or sort them:
Sorting with sort_index(): Sorts the index.
3. Summary Statistics by Level
Compute summary statistics (like sum, mean, etc.) across a level of a MultiIndex using the level argument.
4. Indexing with a DataFrame’s Columns
Treat one or more columns of a DataFrame as an index by setting them with set_index().
Plotting and Visualization
• Data visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools
provide an accessible way to see and understand trends, outliers, and patterns in
data.
• Data visualization translates complex data sets into visual formats that are
easier for the human brain to comprehend. This can include a variety of visual
tools such as:
• Charts: Bar charts, line charts, pie charts, etc.
• Graphs: Scatter plots, histograms, etc.
• Maps: Geographic maps, heat maps, etc.
• Dashboards: Interactive platforms that combine multiple visualizations.
• Tools: Matplotlib, Seaborn, Bokeh, Plotly, etc
31
Plotting and Visualization : Types of Data
32
Plotting and visualization: Figures and Subplots
• Types of Data Visualization Techniques
• Bar Charts: Ideal for comparing categorical data or displaying frequencies, bar charts offer a clear visual representation
of values.
• Line Charts: Perfect for illustrating trends over time, line charts connect data points to reveal patterns and fluctuations.
• Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to understand proportions and
percentages.
• Scatter Plots: Showcase relationships between two variables, identifying patterns and outliers through scattered data
points.
• Histograms: Depict the distribution of a continuous variable, providing insights into the underlying data patterns.
• Heatmaps: Visualize complex data sets through color-coding, emphasizing variations and correlations in a matrix.
• Box Plots: Unveil statistical summaries such as median, quartiles, and outliers, aiding in data distribution analysis.
• Area Charts: Similar to line charts but with the area under the line filled, these charts accentuate cumulative data
patterns.
• Bubble Charts: Enhance scatter plots by introducing a third dimension through varying bubble sizes, revealing additional
insights.
• Treemaps: Efficiently represent hierarchical data structures, breaking down categories into nested rectangles. 33
Plotting and visualization: Figures and Subplots
• This module is used to control the default spacing of the subplots and top level container for all
plot elements.
• matplotlib.figure.Figure.subplots() method
• Syntax: subplots(self, nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,
subplot_kw=None, gridspec_kw=None)
• nrows, ncols :
• sharex, sharey : controls sharing of properties among x (sharex) or y (sharey) axes.
• squeeze : optional parameter and it contains boolean value with default as True.
• num: This parameter is the pyplot.figure keyword that sets the figure number or label.
• subplot_kwd: This parameter is the dict with keywords passed to the add_subplot call used to create each subplot.
• gridspec_kw: This parameter is the dict with keywords passed to the GridSpec constructor used to create the grid the subplots are
placed on.
Plotting and visualization: Figures and Subplots
Plotting and visualization:Colors, Markers, and
Line Styles, Ticks, Labels, and Legends,
• Common options:
• Colors: 'red', 'blue', 'green', 'black', etc. or hex codes like '#FF5733'
• Markers:
• 'o' — circle
• 's' — square
• '^' — triangle
• '*' — star
• Line styles:
• '-' — solid
• '--' — dashed
• ':' — dotted
• '-.' — dash-dot
Plotting and visualization:Colors, Markers, and Line Styles
Plotting and visualization: Ticks
Plotting and visualization: Ticks, Labels, and Legends,
Plotting with pandas and seaborn, Line Plots,
Plotting with pandas and seaborn, Bar Plots,
Histograms and Density Plots, Scatter or Point Plot
END