Unit II - Data Science
Unit II - Data Science
21CSS303T
Unit II
Unit-2: DATA WRANGLING, DATA CLEANING AND PREPARATION
10 hours
• Data Handling: Problem faced when handling large data-General
techniques for handling large volume of data- General
programming tips for dealing large data sets
• Data Wrangling: Clean, Transform, Merge, Reshape: Combining and
Merging Datasets, Merging on Index, Concatenate, Combining with
overlap, Reshaping, Pivoting
• Data Cleaning and Preparation: Handling Missing Data, Data
Transformation, String Manipulation, summarizing, Binning,
classing and Standardization, outlier/Noise& Anomalies.
Problems faced when
handling large data
• A large volume of data poses new challenges, such as
overloaded memory and algorithms that never stop
running. It forces you to adapt and expand your repertoire
of techniques. But even when you can perform your
analysis, you should take care of issues such as I/O
(input/output) and CPU starvation, because these can
cause speed issues.
Problems faced when
handling large data
• A computer only has a limited amount of RAM.
• When you try to squeeze more data into this memory than
actually fits, the OS will start swapping out memory blocks
to disks, which is far less efficient than having it all in
memory.
• But only a few algorithms are designed to handle large data
sets; most of them load the whole data set into memory at
once, which causes the out-of-memory error.
Problems faced when
handling large data
General techniques for handling
large volumes of data
• Never-ending algorithms, out-of-memory errors, and speed
issues are the most common challenges you face when
working with large data.
General techniques for handling
large volumes of data
• No clear one-to-one mapping exists between the problems
and solutions because many solutions address both lack of
memory and computational performance.
• For instance, data set compression will help you solve
memory issues because the data set becomes smaller. But
this also affects computation speed with a shift from the
slow hard disk to the fast CPU.
• Contrary to RAM (random access memory), the hard disc
will store everything even after the power goes down, but
writing to disk costs more time than changing
information in the fleeting RAM. When constantly
changing the information, RAM is thus preferable over the
(more durable) hard disk.
General techniques for handling
large volumes of data
• An algorithm that’s well suited for handling large data
doesn’t need to load the entire data set into memory to
make predictions. Ideally, the algorithm also supports
parallelized calculations.
General techniques for handling
large volumes of data
ONLINE LEARNING ALGORITHMS
• Full batch learning (also called statistical learning) —
Feed the algorithm all the data at once
• Mini-batch learning—Feed the algorithm a spoonful (100,
1000, …, depending on what your hardware can handle) of
observations at a time.
• Online learning—Feed the algorithm one observation at a
time.
General techniques for handling
large volumes of data
General techniques for handling
large volumes of data
• DIVIDING A LARGE MATRIX INTO MANY SMALL ONES
• bcolz - store data arrays compactly and uses the hard drive when the
array no longer fits into the main memory
• Dask - optimize the flow of calculations and makes performing
calculations in parallel easier
• MAPREDUCE – easy to parallelize and distribute
• MapReduce algorithms are easy to understand with an
analogy: Imagine that you were asked to count all the votes
for the national elections. Your country has 25 parties, 1,500
voting offices, and 2 million people. You could choose to
• gather all the voting tickets from every office individually and
count them centrally
• ask the local offices to count the votes for the 25 parties and hand
over the results to you, and you could then aggregate them by
party.
Choosing the right data structure
• Algorithms can make or break your program, but the way
you store your data is of equal importance.
• Data structures have different storage requirements, but
also influence the performance of CRUD (create, read,
update, and delete) and other operations on the data set.
Sparse Matrix
• SPARSE DATA
• A sparse data set contains relatively little information
compared to its entries (observations).
Trees
TREE STRUCTURES
• Trees are a class of data structure that allows you to
retrieve information much faster than scanning through
a table.
• A tree always has a root value and subtrees of children,
each with its children, and so on.
Trees
Trees
• Trees are also popular in databases.
• Databases prefer not to scan the table from the first line
until the last, but to use a device called an index to avoid
this.
• Indices are often based on data structures such as trees and
hash tables to find observations faster. The use of an index
speeds up the process of finding data enormously.
Hash Tables
• Hash tables are data structures that calculate a key for
every value in your data and put the keys in a bucket.
This way you can quickly retrieve the information by
looking in the right bucket when you encounter the data.
• Dictionaries in Python are a hash table implementation,
and they’re a close relative of key-value stores.
Selecting the right tool
• With the right class of algorithms and data structures in
place, it’s time to choose the right tool for the job.
• The right tool can be a Python library or at least a tool
that’s controlled from Python.
Selecting the right tool
• Python has a number of libraries that can help you deal
with large data. They range from smarter data structures
over code optimizers to just-in-time compilers.
• Cython
• Numexpr
• Numba
• Bcolz
• Blaze
• Theano
• Dask
Selecting the right tool
• Python has a number of libraries that can help you deal with
large data. They range from smarter data structures over code
optimizers to just-in-time compilers.
• Cython - specify the data type while developing the program.
Once the compiler has this information, it runs programs much
faster.
• Numexpr - numerical expression evaluator for NumPy but can be
many times faster than the original NumPy
• Numba - achieve greater speed by compiling your code right
before you execute it, also known as just-in-time compiling.
• Bcolz - overcome the out-of-memory problem that can occur
when using NumPy. It can store and work with arrays in an
optimal compressed form. It not only slims down your data need
but also uses Numexpr in the background to reduce the
calculations needed when performing calculations with bcolz
arrays.
Selecting the right tool
• Python has a number of libraries that can help you deal
with large data. They range from smarter data structures
over code optimizers to just-in-time compilers.
• Blaze - if you want to use the power of a database backend but
like the “Pythonic way” of working with data. Blaze will
translate your Python code into SQL but can handle many more
data stores than relational databases such as CSV, Spark, and
others.
• Theano - work directly with the graphical processing unit
(GPU) and do symbolical simplifications whenever possible,
and it comes with an excellent just-in-time compiler.
• Dask - Dask enables you to optimize your flow of calculations
and execute them efficiently. It also enables you to distribute
calculations.
General programming tips for
dealing with large data sets
■ Don’t reinvent the wheel. Use tools and libraries developed
by others.
■ Get the most out of your hardware. Your machine is never
used to its full potential; with simple adaptions you can make
it work harder.
■ Reduce the computing need. Slim down your memory and
processing needs as much as possible.
General programming tips for
dealing with large data sets
General programming tips for
dealing with large data sets
• Exploit the power of databases.
• The first reaction most data scientists have when working
with large data sets is to prepare their analytical base tables
inside a database. This method works well when the features
you want to prepare are fairly simple. When this preparation
involves advanced modeling, find out if it’s possible to employ
user-defined functions and procedures.
• Use optimized libraries.
• Creating libraries like Mahout, Weka, and other machine
learning algorithms requires time and knowledge.
• They are highly optimized and incorporate best practices and
state-of-the art technologies.
• Spend your time on getting things done, not on reinventing
and repeating others people’s efforts, unless it’s for the sake of
understanding how things work.
General programming tips for
dealing with large data sets
• Get the most out of your hardware
• Feed the CPU compressed data - Avoid CPU starvation is to
feed the CPU compressed data instead of the inflated (raw)
data.
• Make use of the GPU - Sometimes your CPU and not your
memory is the bottleneck. If your computations are
parallelizable, you can benefit from switching to the GPU. This
has a much higher throughput for computations than a CPU.
• Use multiple threads - It’s still possible to parallelize
computations on your CPU. You can achieve this with normal
Python threads.
General programming tips for
dealing with large data sets
Reduce your computing needs
• “Working smart + hard = achievement.”
• This also applies to the programs you write.
• The best way to avoid having large data problems is by
removing as much of the work as possible up front and
letting the computer work only on the part that can’t be
skipped.
General programming tips for
dealing with large data sets
Reduce your computing needs
• Profile your code and remediate slow pieces of code – Profiler to detect
slow parts inside your program
• Use compiled code whenever possible, certainly when loops are involved -
Use functions from packages that are optimized for numerical computations
• Otherwise, compile the code yourself - just-in-time compiler or implement
the slowest parts of your code in a lower-level language such as C or Fortran
and integrate
• Avoid pulling data into memory - reading data in chunks and parsing the
data on the fly. This won’t work on every algorithm but enables calculations
on extremely large data sets.
• Use generators to avoid intermediate data storage - Generators help you
return data per observation instead of in batches.
• Use as little data as possible. If no large-scale algorithm is available and you
aren’t willing to implement such a technique yourself, then you can still train
your data on only a sample of the original data.
• Use your math skills to simplify calculations as much as possible.
Data Wrangling
• Data Wrangling is referred to as data munging.
• It is the process of transforming and mapping data from
one "raw" data form into another format to make it
more appropriate and valuable for various downstream
purposes such as analytics.
• The goal of data wrangling is to assure quality and
useful data
Data Wrangling
Data Wrangling
• Data Wrangling is one of those technical terms that are
more or less self-descriptive.
• The term "wrangling" refers to rounding up
information in a certain way.
Data Wrangling
• Discovery: Before starting the wrangling process, it is
critical to think about what may lie beneath your data.
• Organization: After you've gathered your raw data
within a particular dataset, you must structure your
data.
• Cleaning: When your data is organized, you can begin
cleaning your data. Data cleaning involves removing
outliers, formatting nulls, and eliminating duplicate
data.
Data Wrangling
• Data enrichment: This step requires that you take a step back
from your data to determine if you have enough data to proceed.
• Validation: After determining you gathered enough data, you
will need to apply validation rules to your data. Validation rules,
performed in repetitive sequences, confirm that data is
consistent throughout your dataset.
• Publishing: The final step of the data munging process is data
publishing. Data providing notes and documentation of your
wrangling process and creating access for other users and
applications.
Data Wrangling
Customer Behavior
Fraud Detection
Analysis
Data Wrangling
Data munging is used for diverse use-cases as follows:
• Fraud Detection - Distinguish corporate fraud by
identifying unusual behavior by examining detailed
information like multi-party and multi-layered emails
or web chats.
• Customer Behaviour Analysis - A data-munging tool can
quickly help your business processes get precise
insights via customer behavior analysis.
Data Wrangling Tools
There are different tools for data wrangling that can be used for gathering,
importing, structuring, and cleaning data before it can be fed into analytics
and BI apps.
• Spreadsheets / Excel Power Query is the most basic manual data wrangling
tool.
• OpenRefine - An automated data cleaning tool that requires programming
skills
• Tabula - It is a tool suited for all data types
• Google DataPrep - It is a data service that explores, cleans, and prepares
data
• Data wrangler - It is a data cleaning and transforming tool
• Plotly (data wrangling with Python) is useful for maps and chart data.
Benefits of Data Wrangling
Benefits of Data Wrangling
• Data consistency: The organizational aspect of data
wrangling offers a resulting dataset that is more
consistent.
• Improved insights: Data wrangling can provide
statistical insights about metadata by transforming the
metadata to be more constant.
• Cost efficiency: Data-wrangling allows for more efficient
data analysis and model-building processes, businesses
will ultimately save money in the long run.
Clean, Transform and Merge
When working with datasets in pandas, you typically
follow three major steps:
1.Cleaning – Handling missing values, duplicates, and
inconsistent data
2.Transforming – Changing data formats, applying
functions, feature engineering
3.Merging – Combining multiple datasets for analysis
Data Cleaning
import pandas as pd
df = pd.DataFrame(data)
df_cleaned = df.dropna()
print(df_filled)
Data Transformation
#Applying Functions to Columns (apply(), map())
df['Age_Category'] = df['Age'].apply(lambda x:
'Young' if x < 30 else 'Old’)
# Merging datasets
merged_df = pd.merge(df1, df2, on='ID', how='left')
print(merged_df)
Combining and Merging
datasets
• Data contained in pandas objects can be combined together
in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or
more keys. This will be familiar to users of SQL or other
relational databases, as it implements database join
operations.
• pandas.concat glues or stacks together objects along an axis.
Row-wise concatenation
df1 = pd.DataFrame({'ID': [1, 2], 'Name':
['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [3, 4], 'Name':
['Charlie', 'David']})
Column-wise concatenation
df1 = pd.DataFrame({'ID': [1, 2], 'Name':
['Alice', 'Bob']})
df2 = pd.DataFrame({'Score': [85, 90]})
# Merging on index
merged_df = pd.merge(df1, df2, left_index=True,
right_index=True, how='outer')
print(merged_df)
Merging on Index
2. Using join() for Index-Based Merging
• The join() function is simpler and is specifically
designed for merging on indexes.
# Joining on index
joined_df = df1.join(df2, how='outer')
print(joined_df)
Merging on Index
Resetting Index Before Merging
• If the datasets have an ID column instead of an index, you may need to
set the index first.
📌 Key points:
• ignore_index=True resets the index after stacking.
• By default, it stacks row-wise (axis=0).
Concatenate
Concatenating DataFrames (concat())
• Concatenation allows stacking datasets vertically (row-wise) or
horizontally (column-wise).
📌 Key points:
• axis=1 joins column-wise.
• If indices don’t match, missing values (NaN) appear.
Combining with Overlap
Combining with Overlap (combine_first())
• This method is used to fill missing values in one DataFrame with
values from another.
Using pivot()
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-
02'],
'City': ['New York', 'London', 'New York', 'London'],
'Sales': [200, 150, 220, 180] })
print(pivot_df)
📌 Key points:
• Converts City values into column names.
• Uses Date as the index.
• Shows Sales values as table entries.
Melting
Melting Data (melt())
• The opposite of pivoting – it converts wide data into
long format.
melted_df =
pivot_df.reset_index().melt(id_vars='Date',
var_name='City', value_name='Sales')
print(melted_df)
• 📌 Key points:
• Converts pivot_df back into a long format.
• id_vars keeps specified columns (Date).
• var_name and value_name rename the columns.
Stacking and Unstacking
Stacking (stack())
• Converts columns into a hierarchical index (multi-
index rows).
stacked_df = pivot_df.stack()
print(stacked_df)
Unstacking (unstack())
• Converts multi-index rows back into columns.
unstacked_df = stacked_df.unstack()
print(unstacked_df)
Data Cleaning
• Data cleaning, also known as data cleansing or data
preprocessing, is a crucial step in the data science
pipeline that involves identifying and correcting or
removing errors, inconsistencies, and inaccuracies in
the data to improve its quality and usability.
• Data cleaning is essential because raw data is often
noisy, incomplete, and inconsistent, which can
negatively impact the accuracy and reliability of the
insights derived from it.
Why data cleaning is
important?
• Data cleansing is a crucial step in the data preparation process,
playing an important role in ensuring the accuracy, reliability, and
overall quality of a dataset.
• For decision-making, the integrity of the conclusions drawn heavily
relies on the cleanliness of the underlying data. Without proper data
cleaning, inccuracies, outliers, missing values, and inconsistencies can
compromise the validity of analytical results. Moreover, clean data
facilitates more effective modeling and pattern recognition, as
algorithms perform optimally when fed high-quality, error-free input.
• Additionally, clean datasets enhance the interpretability of findings,
aiding in the formulation of actionable insights.
Steps to perform data
cleanlines
Removal of
unwanted
observations
Fixing
Handling
Structural
Missing Data
Errors
Managing
Unwanted
Outliers
Data Preparation
• Data preparation is the process of making raw data
ready for after processing and analysis. The key
methods are to collect, clean, and label raw data in a
format suitable for machine learning (ML) algorithms,
followed by data exploration and visualization.
• The process of cleaning and combining raw data before
using it for machine learning and business analysis is
known as data preparation.
Why data preparation is
important?
• Data preparation ensures data accuracy and
consistency, leading to reliable insights and informed
decision-making.
• It optimizes data for analysis, uncovering hidden
patterns and trends. Additionally, it enhances model
performance and accuracy, driving better decision
outcomes.
• It saves time and resources by preventing errors and
inefficiencies in the analysis phase.
Steps to perform data
preparation
Describe
Purpose and
Requirements
Data
Transformation
Data Collection
and
Enrichment
Combining and
Data Exploring Integrating
Data
Data Profiling
Handling Missing data
What are missing values?
• Missing values are data points that are absent for a
specific variable in a dataset.
• They can be represented in various ways, such as blank
cells, null values, or special symbols like “NA” or
“unknown.” These missing data points pose a significant
challenge in data analysis and can lead to inaccurate or
biased results.
Why is it important?
• Data Quality: Missing values can introduce errors and inconsistencies into your
dataset, leading to inaccurate results and unreliable models.
• Model Performance: Many machine learning algorithms cannot handle missing
values directly. If not addressed, they can either produce incorrect results or fail to
converge.
• Bias: Improper handling of missing values can introduce bias into your analysis,
leading to misleading conclusions. For example, if missing values are more likely to
occur in certain groups, ignoring or deleting them can bias your results.
• Data Loss: Deleting rows or columns with missing values can result in significant
data loss, especially if there are many missing values. This can reduce the
statistical power of your analysis.
• Interpretation: Missing values can make it difficult to interpret the results of your
analysis. For example, if a variable has many missing values, it may be difficult to
draw meaningful conclusions about its relationship with other variables.
Techniques to handle
missing data
Techniques
Deletion Imputation
Deletion
• Listwise Deletion: Remove entire rows or columns containing
missing values. This method is simple but can result in a
significant loss of data, especially if there are many missing
values.
• K-Nearest Neighbors (KNN) Imputation: Impute missing values using the average
values of the k nearest neighbors. This method can be effective for numerical data.
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Handling missing values
1)Check for missing values
print(df.isnull()) # Boolean DataFrame
indicating missing values
print(df.isnull().sum()) # Count of missing
values in each column
• By carefully considering these factors and applying appropriate techniques, you can
effectively handle missing values in your data science projects and improve the
accuracy and reliability of your models.
Choosing the right approach
• The best approach for handling missing values depends on the nature of your data,
the amount of missing data, and the specific requirements of your analysis. Consider
the following factors:
• Amount of missing data: If there are many missing values, imputation might be
preferable to deletion.
• Distribution of missing data: If missingness is random, imputation might be
suitable. If missingness is related to other variables, more sophisticated
techniques might be necessary.
• Impact of missing data on the analysis: If missing values are likely to bias your
results, it's important to address them.
• By carefully considering these factors and applying appropriate techniques, you can
effectively handle missing values in your data science projects and improve the
accuracy and reliability of your models.
Data Transformation
• Data transformation is the process of converting,
cleansing, and structuring data into a usable format
that can be analyzed to support decision making
processes, and to propel the growth of an organization.
• Data transformation is used when data needs to be
converted to match that of the destination system.
Data Transformation Techniques
Data Smoothing
• Data smoothing is a process that is used to remove
noise from the dataset using some algorithms.
• It allows for highlighting important features present in
the dataset.
• It helps in predicting the patterns.
• When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise
form.
Attribute Construction
• In the attribute construction method, the new
attributes consult the existing attributes to construct a
new data set that eases data mining.
• New attributes are created and applied to assist the
mining process from the given attributes. This
simplifies the original data and makes the mining more
efficient.
Data Generalization
• Data generalization is the process of converting
detailed data into a more abstract, higher-level
representation while retaining essential information.
• It is commonly used in data mining, privacy
preservation, and machine learning to reduce
complexity and improve model generalization.
• Types
• Attribute Generalization
• Hierarchical Generalization
• Numeric Generalization
• Text Generalization
Data Generalization
1) Attribute Generalization
Replacing specific values with more general categories.
Example:
Raw Data: Age = {22, 25, 30, 35}
Generalized Data: Age Group = {20-30, 30-40}
2) Hierarchical Generalization
Using a hierarchy to replace specific values with broader
categories.
Example:
Raw Data: "Toyota Corolla" → Generalized: "Toyota" →
"Car"
Numeric Generalization
Data Generalization
3) Numeric Generalization
Converting continuous numerical data into categorical
ranges.
Example:
Raw Data: Salary = {45000, 52000, 61000}
Generalized: Salary Range = {40K-50K, 50K-60K, 60K-
70K}
4) Text Generalization
Replacing specific words with generalized versions.
Example:
Raw Data: "John lives in New York"
Generalized: "Person lives in City"
Data Aggregation
• Data collection or aggregation is the method of storing
and presenting data in a summary format.
• The data may be obtained from multiple data sources to
integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of
data analysis insights is highly dependent on the
quantity and quality of the data used.
Data Discretization
• Data discretization is the process of converting
continuous numerical data into discrete categories
(bins).
• It is commonly used in machine learning, data mining,
and feature engineering to simplify models and improve
Interpretability.
# Sample dataset
data = {'Age': [22, 25, 30, 35, 40, 45, 50,
55, 60]}
df = pd.DataFrame(data)
# Equal-width binning into 3 categories
df['Age_Binned'] = pd.cut(df['Age'], bins=3,
labels=['Young', 'Middle-aged', 'Old’])
print(df)
Data Normalization
• Data normalization is a preprocessing technique used
to scale numerical data into a specific range, usually
[0,1] or [-1,1].
• It ensures that features contribute equally to a model,
preventing bias due to different scales.
Why Normalize Data?
✅ Improves Machine Learning Performance – Many
algorithms (e.g., KNN, SVM, Neural Networks) perform
better with normalized data.
✅ Speeds Up Convergence – Gradient descent optimizes
faster when features are scaled.
✅ Prevents Dominance of Large-Scale Features – Avoids a
situation where one feature overpowers others.
String manipulation
• Strings are fundamental and essential data structures
that every Python programmer works with.
• In Python, a string is a sequence of characters enclosed
within either single quotes ('...') or doubles quotes ("...").
• It is an immutable built-in data structure, meaning once a
string is created, it cannot be modified. However, we can
create new strings by concatenating or slicing existing
strings.
String manipulations
• Basic String Operations
# Defining a string
text = "Hello, Python!"
# String length
print(len(text)) # 14
# Accessing characters
print(text[0]) # 'H'
print(text[-1]) # '!'
# Slicing a string
print(text[0:5]) # 'Hello'
print(text[:5]) # 'Hello'
print(text[7:]) # 'Python!'
# Reversing a string
print(text[::-1]) # '!nohtyP ,olleH'
String manipulations
• String Case Manipulation
text = "hello python"
# Concatenation
print(str1 + " " + str2) # 'Hello Python'
# Repetition
print(str1 * 3) # 'HelloHelloHello'
String manipulations
• String Searching and Replacing
# Find position
print(text.find("fun")) # 10
print(text.index("is")) # 7
# Replace substring
print(text.replace("fun", "awesome")) # 'Python is
awesome'
String manipulations
• Splitting and Joining Strings
text = "apple,banana,grape"
print(text.strip()) # 'Python'
print(text.lstrip()) # 'Python '
print(text.rstrip()) # ' Python'
String manipulations
• Formatting Strings
name = "Alice"
age = 25
# Using format()
print("My name is {} and I am {} years
old.".format(name, age))
# Using % operator
print("My name is %s and I am %d years old." %
(name, age))
String manipulations
• Checking String properties
text = "Python123"
print(text.isalpha())#False(containsnumbers)
print(text.isdigit())#False(contains letters)
print(text.isalnum()) # True (letters and
numbers only)
print(text.isspace()) # False (not just
spaces)
print("hello".islower()) # True
print("HELLO".isupper()) # True
String manipulations
• Reversing words in a sentence
text = "Hello World"
Centrality
• Mean
• Median
• Mode
Dispersion
• Standard Deviation
• Variance
• Range
Sample Distribution
• Histogram
• Skewness
• Kurtosis
Summarizing
Centrality
• Centrality of a data describes the center or middle
value of the data set.
• The common measures of centrality are:
• Mean: The average value of a dataset.
• Mode: The most frequent value in a dataset.
• Median: The middle value in a sorted dataset.
Summarizing
Dispersion
• This category measures how spread out the data is.
• The common measures of dispersion are:
• Standard Deviation: Measures the average
distance of data points from the mean.
• Variance: The square of the standard deviation.
• Range: The difference between the maximum
and minimum values.
Summarizing
Sample Distribution
• This category involves analyzing the shape and
characteristics of the distribution of the data.
• The common methods under this category are:
• Histogram: A graphical representation of the distribution
of a numerical variable.
• Tally: A simple counting method.
• Skewness: Measures the asymmetry of the distribution.
• Kurtosis: Measures the "tailedness" of the distribution.
Binning
• Binning is the process of grouping continuous values
into discrete bins (intervals).
• It is commonly used in data preprocessing, feature
engineering, and data visualization to reduce
complexity and improve interpretability.
Binning
Types of Binning
1.Equal-Width Binning
• Divides the data into equal-sized intervals.
• Example: Ages (22, 25, 30, 35, 40) → Bins: [20-30], [30-
40], etc.
2.Equal-Frequency Binning (Quantile Binning)
• Each bin contains (almost) the same number of values.
• Example: 5 salaries split into 3 bins, each with ~2 values.
3.Custom Binning
• Manually defining bins based on domain knowledge.
• Example: Categorizing income levels → Low, Medium,
High.
Standardization
• Standardization is a common preprocessing
technique in data science that transforms
numerical data to have a mean of 0 and a standard
deviation of 1.
• This is particularly useful when dealing with
features that have different scales or units, as it
ensures that all features contribute equally to the
model.
Why Standardization?
• Equalizes Feature Importance: Standardization prevents features with
larger magnitudes from dominating the model, ensuring that all
features are treated fairly.
Standardization
Z-Score Min-max
Standardization
• Z-score normalization is a data preprocessing technique that transforms
numerical data to have a mean of 0 and a standard deviation of 1. This is
particularly useful when dealing with features that have different scales
or units, as it ensures that all features contribute equally to the model.
Types of
Outliers
Global Contextual
Outliers Outliers
Types of Outliers
Global outliers
• Global outliers are isolated data points that are far away from the
main body of the data.
• They are often easy to identify and remove.
Contextual outliers
• Contextual outliers are data points that are unusual in a specific
context but may not be outliers in a different context.
• They are often more difficult to identify and may require additional
information or domain knowledge to determine their significance.
Anamolies
• A data object that deviates significantly from normal
objects, such as an unusual credit card purchase.
• Anomalies can also be instances or collections of data that
are very rare in the data set and have features that differ
significantly from most of the data.
• Anomalies can be contextual outliers, which are anomalies
that have values that significantly deviate from other data
points in the same context. For example, an anomaly in one
dataset might not be an anomaly in another.