Cleaning & Preprocessing Data by Khushmandeep Kaur

Uploaded by

Mandeep Kaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

Cleaning & Preprocessing Data by Khushmandeep Kaur

Uploaded by

Mandeep Kaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 11

Cleaning and preprocessing

data
e
n
i
z
a
t
i
o
n
Contents – W
h
y

u
s
e
INTRODUCTION

 In today's data-driven world, the quality of data

directly impacts the success of analytical and
machine learning endeavours.
 Data cleaning and preprocessing are foundational
steps to ensure the accuracy and reliability of our
data.
 Tokenization, a fundamental technique in natural
language processing (NLP), plays a pivotal role in
breaking down textual data into meaningful
components.
 Throughout this presentation, we'll delve into the
significance of data cleaning, explore the
intricacies of tokenization, and showcase how
combining these processes can elevate the quality
of our data for downstream analysis and
modelling.
 What is Data Cleaning ?
 Data cleaning is the process of identifying and rectifying errors,
inconsistencies, and inaccuracies within datasets.
 It involves transforming raw data into a standardized, accurate,
and reliable format suitable for analysis.
 Data cleaning tasks encompass a range of operations aimed at
improving data quality and integrity.
 Identification and removal of duplicates: Detecting and
eliminating duplicate records or entries within the dataset to
ensure each data point is unique, preventing skewing of analysis
results.
 Handling missing values: Identifying missing or null values in
the dataset and implementing strategies such as imputation or
deletion to handle missing data based on context and impact on
analysis.
 Correction of errors: Detecting and rectifying errors in data
entry, formatting, or encoding, and validating data against
predefined rules or constraints to ensure accuracy.
 Standardization of data: Converting data into a consistent
format or representation, standardizing units of measurement,
Importance of Data Cleaning
 Enhanced Data Accuracy: Data cleaning ensures that the dataset is free from errors, inconsistencies, and
inaccuracies, leading to more reliable and accurate analysis results.

 Improved Decision-Making: Clean data provides a solid foundation for making informed decisions. By
removing noise and irrelevant information, data cleaning enables stakeholders to trust the data-driven insights
derived from analysis.

 Reduced Risk of Bias: Dirty data can introduce biases into analysis, leading to skewed results and flawed
conclusions. Data cleaning helps mitigate this risk by eliminating duplicates, correcting errors, and
standardizing data, ensuring fair and unbiased analysis.

 Optimized Performance of Models: Clean data is essential for training machine learning models effectively. By
providing high-quality input data, data cleaning enhances the performance and reliability of predictive models,
leading to more accurate predictions and actionable insights.

 Cost and Time Savings: Data cleaning upfront saves time and resources in the long run. By addressing data
quality issues early in the data pipeline, organizations avoid costly errors, rework, and delays in analysis and
decision-making processes.
Techniques for Data Cleaning
 Removing Duplicates: Identifying and eliminating duplicate records or entries within the dataset to ensure each
data point is unique.
 Handling Missing Values: Dealing with missing or null values by either imputing them with estimated values,
deleting them, or using more sophisticated techniques like predictive modeling to fill in missing values.
 Error Correction: Detecting and rectifying errors in data entry, formatting, or encoding. This may involve
validating data against predefined rules or constraints and correcting inaccuracies.
 Standardization: Converting data into a consistent format or representation. This includes standardizing units of
measurement, date formats, and other variables to facilitate analysis and comparison.
 Parsing and Formatting: Parsing and restructuring unstructured or semi-structured data into a structured format
suitable for analysis. This may involve splitting text fields, extracting relevant information, and formatting data
according to predefined rules.
 Outlier Detection and Handling: Identifying outliers or anomalies in the data that deviate significantly from the
norm and deciding whether to remove them, adjust them, or treat them separately in the analysis.
 Data Transformation: Transforming data to meet the requirements of specific analysis or modelling tasks. This
may include aggregating, summarizing, or transforming variables to create new features or insights.
 Normalization and Scaling: Scaling numerical variables to a common scale or normalizing them to a standard
distribution. This ensures that variables with different scales or units have a similar impact on analysis and
modelling.
 Dealing with Inconsistencies: Resolving inconsistencies in data values, such as variations in spelling,
capitalization, or naming conventions. This may involve standardizing naming conventions or using fuzzy
matching algorithms to identify similar values.
What is Tokenization ?
Tokenization is the process of breaking down a text or sequence into smaller units called
tokens. These tokens can be words, phrases, symbols, or other meaningful
elements, depending on the context and requirements of the task. Tokenization is a
fundamental step in natural language processing (NLP) and text analysis tasks, as it
enables computers to understand and process textual data more effectively.

The tokenization process typically involves the following steps:

 Text Segmentation: The input text is segmented or divided into smaller units, such
as words, sentences, or characters, depending on the granularity required for the
task.

 Token Generation: Each segment or unit generated from the segmentation step is
considered a token. Tokens can represent individual words, punctuation marks,
numbers, or special characters.

 Normalization: Tokens may undergo normalization to standardize their

representation. This can include converting all text to lowercase, removing accents
or diacritics, and handling special cases like contractions or abbreviations.

 Filtering: Optionally, certain tokens may be filtered out based on predefined

criteria. This can include removing stopwords (commonly occurring words like
"the," "and," "is") or symbols that are not relevant to the analysis.
Why use Tokenization ?
 Simplifies Text Processing: Tokenization breaks down text into smaller units, such as words or phrases, making it easier
for computers to process and analyze textual data. By segmenting text into meaningful units, tokenization simplifies various
NLP tasks, including sentiment analysis, text classification, and information retrieval.
 Feature Extraction: Tokens generated through tokenization serve as features for machine learning models. Each token
represents a specific aspect of the text, allowing models to learn patterns and relationships within the data. Feature extraction
via tokenization enables the creation of numerical representations (vectors) of text data, facilitating the application of machine
learning algorithms.
 Enables Vocabulary Management: Tokenization allows for efficient management of vocabulary and word-level
statistics. By tokenizing text into words or subword units, NLP systems can build vocabularies, track word frequencies, and
perform statistical analysis, which are essential for tasks like language modeling and word embeddings.
 Supports Text Analysis and Understanding: Tokenization enables computers to understand and interpret textual data
more effectively. By breaking down text into tokens, NLP systems can analyze the structure, semantics, and context of text,
leading to insights and understanding that can be leveraged for various applications, such as information retrieval and question
answering.
 Facilitates Text Normalization and Preprocessing : Tokenization is often accompanied by text normalization steps,
such as lowercase conversion, punctuation removal, and stemming. These preprocessing techniques help standardize text
representations and reduce the complexity of downstream analysis tasks. Tokenization thus serves as a crucial step in the data
preprocessing pipeline, improving the quality and consistency of textual data.
 Supports Multilingual and Cross-lingual Processing: Tokenization can be adapted to handle text in multiple
languages and scripts, making it suitable for multilingual and cross-lingual NLP tasks. By tokenizing text into language-
specific units, NLP systems can effectively process and analyze diverse language corpora, enabling applications that span
multiple languages and cultures.
Techniques for Tokenization
 Word Tokenization : Breaks text into individual words based on whitespace or punctuation boundaries.
Example: "The quick brown fox jumps over the lazy dog." → ["The", "quick", "brown", "fox", "jumps", "over",
"the", "lazy", "dog"]
 Sentence Tokenization : Segments text into sentences based on punctuation marks or specific language
rules.
Example: "The quick brown fox. Jumps over the lazy dog." → ["The quick brown fox.", "Jumps over the lazy
dog."]
 Regular Expression Tokenization : Uses regular expressions to define custom tokenization patterns.
Allows for more flexible tokenization based on specific requirements.
Example: Tokenizing based on whitespace and punctuation: "The, quick-brown fox" → ["The", "quick", "brown",
"fox"]
 N-gram Tokenization : Divides text into contiguous sequences of n items (words, characters, etc.).
Useful for capturing local context and relationships between adjacent tokens.
Example: "The quick brown fox" (2-gram) → ["The quick", "quick brown", "brown fox"]
 Character Tokenization : Breaks text into individual characters.
Useful for character-level modeling or handling languages with complex scripts.
Example: "The quick brown fox" → ["T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f",
"o", "x"]
Cleaning and Preprocessing Data with Tokenization
Cleaning and preprocessing data with tokenization involves combining tokenization with other techniques to ensure
that the textual data is properly formatted, normalized, and ready for analysis. Here's how tokenization fits into the
data cleaning and preprocessing pipeline:

 Tokenization : The text data is tokenized into smaller units, such as words, phrases, or characters, using
appropriate tokenization techniques (e.g., word tokenization, sentence tokenization).
 Text Normalization :Tokens may undergo normalization to standardize their representation and improve
consistency. This can include converting text to lowercase, removing punctuation, and handling special cases
like contractions or abbreviations.
 Handling Stopwords : Stopwords, commonly occurring words like "the," "and," "is," may be removed from the
tokenized text to reduce noise and improve the efficiency of analysis.
 Removing Irrelevant Tokens : Tokens that are not relevant to the analysis or modeling task may be filtered out.
This can include removing numerical tokens, symbols, or rare words with low frequency.
 Lemmatization or Stemming : Tokens may undergo lemmatization or stemming to reduce inflectional forms to
their base or root form. This helps in standardizing tokens and reducing the vocabulary size.
 Handling Out-of-Vocabulary Tokens : Out-of-vocabulary (OOV) tokens, i.e., tokens not present in the
vocabulary, may be replaced or handled separately. This can involve using special tokens to represent OOV
tokens or performing subword tokenization to handle unknown words.
 Data Integration and Alignment : If the data involves multiple sources or formats,
tokenization ensures that the text data is integrated and aligned properly. This may
involve aligning tokens across different languages or dialects.
 Quality Assurance : Tokenization is accompanied by quality assurance checks to
ensure that the tokenized data meets predefined standards. This may involve verifying
the correctness of tokenization results and addressing any errors or inconsistencies.
 Final Data Formatting : Once tokenization and preprocessing are complete, the data is
formatted into a structured format suitable for further analysis or modeling. This may
involve converting tokenized text into numerical vectors or other appropriate
representations.

By combining tokenization with other cleaning and preprocessing techniques, the text data is
transformed into a standardized, normalized, and cleaned format, ready for analysis,
modeling, or other NLP tasks. This ensures that the data is reliable, consistent, and conducive
to extracting meaningful insights.

E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
Book - Handbook of Collaborative Learning (2013)
100% (1)
Book - Handbook of Collaborative Learning (2013)
498 pages
Verb List For Writing Educational Objectives: Cognitive Domain
100% (2)
Verb List For Writing Educational Objectives: Cognitive Domain
3 pages
Sensory Integration
100% (1)
Sensory Integration
5 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Eric Carle Unit
100% (4)
Eric Carle Unit
82 pages
(With Notes) Presupposition and Entailment
100% (1)
(With Notes) Presupposition and Entailment
32 pages
Buddhist Psychology of Perception
No ratings yet
Buddhist Psychology of Perception
8 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
WEEK 12 & 13 Simulated Teaching
No ratings yet
WEEK 12 & 13 Simulated Teaching
30 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Work Motivation: Principles and Applications Damodar Suar
No ratings yet
Work Motivation: Principles and Applications Damodar Suar
28 pages
Writing To Learn (Pre-Reading) - (Literacy Strategy Guide)
No ratings yet
Writing To Learn (Pre-Reading) - (Literacy Strategy Guide)
8 pages
Portfolio Output No.21: Reflection On Leadership An Membership
No ratings yet
Portfolio Output No.21: Reflection On Leadership An Membership
6 pages
Data Preprocessing v6.1
No ratings yet
Data Preprocessing v6.1
64 pages
Reflection Paper On Guidance and Counseling
93% (15)
Reflection Paper On Guidance and Counseling
2 pages
Misunderstanding of Word Meaning Within A Context in English - Arabic Translation
No ratings yet
Misunderstanding of Word Meaning Within A Context in English - Arabic Translation
62 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Scope & Implications of Interpersonal Skills New
No ratings yet
Scope & Implications of Interpersonal Skills New
19 pages
Vyacheslav v. Moshkalo - Language and Culture of The Baloch in Turkmenistan
No ratings yet
Vyacheslav v. Moshkalo - Language and Culture of The Baloch in Turkmenistan
6 pages
Assignment 2
No ratings yet
Assignment 2
10 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
BayesiaLab Book V18 PDF
No ratings yet
BayesiaLab Book V18 PDF
383 pages
Unit - 2
No ratings yet
Unit - 2
55 pages
Weebly Syllabus April
No ratings yet
Weebly Syllabus April
4 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
02 DP
No ratings yet
02 DP
31 pages
Big Data
No ratings yet
Big Data
51 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Wjec English Language Gcse Coursework
100% (2)
Wjec English Language Gcse Coursework
4 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Lecture 02 - NLU Concepts
No ratings yet
Lecture 02 - NLU Concepts
27 pages
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
No ratings yet
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
23 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
21 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Auroville Teaching Methods
No ratings yet
Auroville Teaching Methods
14 pages
CleaningTokenizing Tweets
No ratings yet
CleaningTokenizing Tweets
8 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Data Mining
No ratings yet
Data Mining
22 pages
Keterampilan Dasar Konseling
No ratings yet
Keterampilan Dasar Konseling
47 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Smaexp 3
No ratings yet
Smaexp 3
9 pages
English English: Curso de Sensibilización A La PAEP
No ratings yet
English English: Curso de Sensibilización A La PAEP
9 pages
Communication Skill I
No ratings yet
Communication Skill I
4 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Lect 02
No ratings yet
Lect 02
23 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
AI Assignment
No ratings yet
AI Assignment
8 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
No ratings yet
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
11 pages
How To Teach Speaking
No ratings yet
How To Teach Speaking
25 pages
Ibm RPT (1) 2 1
No ratings yet
Ibm RPT (1) 2 1
14 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
121a1114 D2 Sma Exp3
No ratings yet
121a1114 D2 Sma Exp3
9 pages
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
12 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
II Semester - Data Mining and Business Intelligence
100% (1)
II Semester - Data Mining and Business Intelligence
2 pages
Module 2
No ratings yet
Module 2
8 pages
Unit 5
No ratings yet
Unit 5
8 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
Steps For Effective Text Data Cleaning
No ratings yet
Steps For Effective Text Data Cleaning
6 pages
Pre Processing of Twitter's Data For Opinion Mining in Political Context
No ratings yet
Pre Processing of Twitter's Data For Opinion Mining in Political Context
11 pages
Gitika Mandal BE4 A 17 NLP EXP1
No ratings yet
Gitika Mandal BE4 A 17 NLP EXP1
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
1 page
Text Processing
No ratings yet
Text Processing
5 pages
Tutoring Overview - July 2023
No ratings yet
Tutoring Overview - July 2023
5 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
851 - Organizational Behaviour-Pearson Education Limited (2020)
No ratings yet
851 - Organizational Behaviour-Pearson Education Limited (2020)
5 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
Bruce Maldy Pratama Visualcv Resume
No ratings yet
Bruce Maldy Pratama Visualcv Resume
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Python Data Structures Explained: A Practical Guide with Examples
From Everand
Python Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Transact-SQL Essentials: Definitive Reference for Developers and Engineers
From Everand
Transact-SQL Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
T-SQL Techniques and Best Practices: Definitive Reference for Developers and Engineers
From Everand
T-SQL Techniques and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Talend Data Integration Essentials: Definitive Reference for Developers and Engineers
From Everand
Talend Data Integration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
From Everand
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Get Hired as a Data Analyst FAST in 2024
From Everand
Get Hired as a Data Analyst FAST in 2024
Silas Meadowlark
No ratings yet

Cleaning & Preprocessing Data by Khushmandeep Kaur

Uploaded by

Cleaning & Preprocessing Data by Khushmandeep Kaur

Uploaded by

Cleaning and preprocessing

 In today's data-driven world, the quality of data

The tokenization process typically involves the following steps:

 Normalization: Tokens may undergo normalization to standardize their

 Filtering: Optionally, certain tokens may be filtered out based on predefined

You might also like