0% found this document useful (0 votes)
14 views11 pages

Cleaning & Preprocessing Data by Khushmandeep Kaur

Uploaded by

Mandeep Kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Cleaning & Preprocessing Data by Khushmandeep Kaur

Uploaded by

Mandeep Kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 11

Cleaning and preprocessing

data
e
n
i
z
a
t
i
o
n
Contents – W
h
y

u
s
e
INTRODUCTION

 In today's data-driven world, the quality of data


directly impacts the success of analytical and
machine learning endeavours.
 Data cleaning and preprocessing are foundational
steps to ensure the accuracy and reliability of our
data.
 Tokenization, a fundamental technique in natural
language processing (NLP), plays a pivotal role in
breaking down textual data into meaningful
components.
 Throughout this presentation, we'll delve into the
significance of data cleaning, explore the
intricacies of tokenization, and showcase how
combining these processes can elevate the quality
of our data for downstream analysis and
modelling.
 What is Data Cleaning ?
 Data cleaning is the process of identifying and rectifying errors,
inconsistencies, and inaccuracies within datasets.
 It involves transforming raw data into a standardized, accurate,
and reliable format suitable for analysis.
 Data cleaning tasks encompass a range of operations aimed at
improving data quality and integrity.
 Identification and removal of duplicates: Detecting and
eliminating duplicate records or entries within the dataset to
ensure each data point is unique, preventing skewing of analysis
results.
 Handling missing values: Identifying missing or null values in
the dataset and implementing strategies such as imputation or
deletion to handle missing data based on context and impact on
analysis.
 Correction of errors: Detecting and rectifying errors in data
entry, formatting, or encoding, and validating data against
predefined rules or constraints to ensure accuracy.
 Standardization of data: Converting data into a consistent
format or representation, standardizing units of measurement,
Importance of Data Cleaning
 Enhanced Data Accuracy: Data cleaning ensures that the dataset is free from errors, inconsistencies, and
inaccuracies, leading to more reliable and accurate analysis results.

 Improved Decision-Making: Clean data provides a solid foundation for making informed decisions. By
removing noise and irrelevant information, data cleaning enables stakeholders to trust the data-driven insights
derived from analysis.

 Reduced Risk of Bias: Dirty data can introduce biases into analysis, leading to skewed results and flawed
conclusions. Data cleaning helps mitigate this risk by eliminating duplicates, correcting errors, and
standardizing data, ensuring fair and unbiased analysis.

 Optimized Performance of Models: Clean data is essential for training machine learning models effectively. By
providing high-quality input data, data cleaning enhances the performance and reliability of predictive models,
leading to more accurate predictions and actionable insights.

 Cost and Time Savings: Data cleaning upfront saves time and resources in the long run. By addressing data
quality issues early in the data pipeline, organizations avoid costly errors, rework, and delays in analysis and
decision-making processes.
Techniques for Data Cleaning
 Removing Duplicates: Identifying and eliminating duplicate records or entries within the dataset to ensure each
data point is unique.
 Handling Missing Values: Dealing with missing or null values by either imputing them with estimated values,
deleting them, or using more sophisticated techniques like predictive modeling to fill in missing values.
 Error Correction: Detecting and rectifying errors in data entry, formatting, or encoding. This may involve
validating data against predefined rules or constraints and correcting inaccuracies.
 Standardization: Converting data into a consistent format or representation. This includes standardizing units of
measurement, date formats, and other variables to facilitate analysis and comparison.
 Parsing and Formatting: Parsing and restructuring unstructured or semi-structured data into a structured format
suitable for analysis. This may involve splitting text fields, extracting relevant information, and formatting data
according to predefined rules.
 Outlier Detection and Handling: Identifying outliers or anomalies in the data that deviate significantly from the
norm and deciding whether to remove them, adjust them, or treat them separately in the analysis.
 Data Transformation: Transforming data to meet the requirements of specific analysis or modelling tasks. This
may include aggregating, summarizing, or transforming variables to create new features or insights.
 Normalization and Scaling: Scaling numerical variables to a common scale or normalizing them to a standard
distribution. This ensures that variables with different scales or units have a similar impact on analysis and
modelling.
 Dealing with Inconsistencies: Resolving inconsistencies in data values, such as variations in spelling,
capitalization, or naming conventions. This may involve standardizing naming conventions or using fuzzy
matching algorithms to identify similar values.
What is Tokenization ?
Tokenization is the process of breaking down a text or sequence into smaller units called
tokens. These tokens can be words, phrases, symbols, or other meaningful
elements, depending on the context and requirements of the task. Tokenization is a
fundamental step in natural language processing (NLP) and text analysis tasks, as it
enables computers to understand and process textual data more effectively.

The tokenization process typically involves the following steps:

 Text Segmentation: The input text is segmented or divided into smaller units, such
as words, sentences, or characters, depending on the granularity required for the
task.

 Token Generation: Each segment or unit generated from the segmentation step is
considered a token. Tokens can represent individual words, punctuation marks,
numbers, or special characters.

 Normalization: Tokens may undergo normalization to standardize their


representation. This can include converting all text to lowercase, removing accents
or diacritics, and handling special cases like contractions or abbreviations.

 Filtering: Optionally, certain tokens may be filtered out based on predefined


criteria. This can include removing stopwords (commonly occurring words like
"the," "and," "is") or symbols that are not relevant to the analysis.
Why use Tokenization ?
 Simplifies Text Processing: Tokenization breaks down text into smaller units, such as words or phrases, making it easier
for computers to process and analyze textual data. By segmenting text into meaningful units, tokenization simplifies various
NLP tasks, including sentiment analysis, text classification, and information retrieval.
 Feature Extraction: Tokens generated through tokenization serve as features for machine learning models. Each token
represents a specific aspect of the text, allowing models to learn patterns and relationships within the data. Feature extraction
via tokenization enables the creation of numerical representations (vectors) of text data, facilitating the application of machine
learning algorithms.
 Enables Vocabulary Management: Tokenization allows for efficient management of vocabulary and word-level
statistics. By tokenizing text into words or subword units, NLP systems can build vocabularies, track word frequencies, and
perform statistical analysis, which are essential for tasks like language modeling and word embeddings.
 Supports Text Analysis and Understanding: Tokenization enables computers to understand and interpret textual data
more effectively. By breaking down text into tokens, NLP systems can analyze the structure, semantics, and context of text,
leading to insights and understanding that can be leveraged for various applications, such as information retrieval and question
answering.
 Facilitates Text Normalization and Preprocessing : Tokenization is often accompanied by text normalization steps,
such as lowercase conversion, punctuation removal, and stemming. These preprocessing techniques help standardize text
representations and reduce the complexity of downstream analysis tasks. Tokenization thus serves as a crucial step in the data
preprocessing pipeline, improving the quality and consistency of textual data.
 Supports Multilingual and Cross-lingual Processing: Tokenization can be adapted to handle text in multiple
languages and scripts, making it suitable for multilingual and cross-lingual NLP tasks. By tokenizing text into language-
specific units, NLP systems can effectively process and analyze diverse language corpora, enabling applications that span
multiple languages and cultures.
Techniques for Tokenization
 Word Tokenization : Breaks text into individual words based on whitespace or punctuation boundaries.
Example: "The quick brown fox jumps over the lazy dog." → ["The", "quick", "brown", "fox", "jumps", "over",
"the", "lazy", "dog"]
 Sentence Tokenization : Segments text into sentences based on punctuation marks or specific language
rules.
Example: "The quick brown fox. Jumps over the lazy dog." → ["The quick brown fox.", "Jumps over the lazy
dog."]
 Regular Expression Tokenization : Uses regular expressions to define custom tokenization patterns.
Allows for more flexible tokenization based on specific requirements.
Example: Tokenizing based on whitespace and punctuation: "The, quick-brown fox" → ["The", "quick", "brown",
"fox"]
 N-gram Tokenization : Divides text into contiguous sequences of n items (words, characters, etc.).
Useful for capturing local context and relationships between adjacent tokens.
Example: "The quick brown fox" (2-gram) → ["The quick", "quick brown", "brown fox"]
 Character Tokenization : Breaks text into individual characters.
Useful for character-level modeling or handling languages with complex scripts.
Example: "The quick brown fox" → ["T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f",
"o", "x"]
Cleaning and Preprocessing Data with Tokenization
Cleaning and preprocessing data with tokenization involves combining tokenization with other techniques to ensure
that the textual data is properly formatted, normalized, and ready for analysis. Here's how tokenization fits into the
data cleaning and preprocessing pipeline:

 Tokenization : The text data is tokenized into smaller units, such as words, phrases, or characters, using
appropriate tokenization techniques (e.g., word tokenization, sentence tokenization).
 Text Normalization :Tokens may undergo normalization to standardize their representation and improve
consistency. This can include converting text to lowercase, removing punctuation, and handling special cases
like contractions or abbreviations.
 Handling Stopwords : Stopwords, commonly occurring words like "the," "and," "is," may be removed from the
tokenized text to reduce noise and improve the efficiency of analysis.
 Removing Irrelevant Tokens : Tokens that are not relevant to the analysis or modeling task may be filtered out.
This can include removing numerical tokens, symbols, or rare words with low frequency.
 Lemmatization or Stemming : Tokens may undergo lemmatization or stemming to reduce inflectional forms to
their base or root form. This helps in standardizing tokens and reducing the vocabulary size.
 Handling Out-of-Vocabulary Tokens : Out-of-vocabulary (OOV) tokens, i.e., tokens not present in the
vocabulary, may be replaced or handled separately. This can involve using special tokens to represent OOV
tokens or performing subword tokenization to handle unknown words.
 Data Integration and Alignment : If the data involves multiple sources or formats,
tokenization ensures that the text data is integrated and aligned properly. This may
involve aligning tokens across different languages or dialects.
 Quality Assurance : Tokenization is accompanied by quality assurance checks to
ensure that the tokenized data meets predefined standards. This may involve verifying
the correctness of tokenization results and addressing any errors or inconsistencies.
 Final Data Formatting : Once tokenization and preprocessing are complete, the data is
formatted into a structured format suitable for further analysis or modeling. This may
involve converting tokenized text into numerical vectors or other appropriate
representations.

By combining tokenization with other cleaning and preprocessing techniques, the text data is
transformed into a standardized, normalized, and cleaned format, ready for analysis,
modeling, or other NLP tasks. This ensures that the data is reliable, consistent, and conducive
to extracting meaningful insights.

You might also like