Cleaning & Preprocessing Data by Khushmandeep Kaur
Cleaning & Preprocessing Data by Khushmandeep Kaur
data
e
n
i
z
a
t
i
o
n
Contents – W
h
y
u
s
e
INTRODUCTION
Improved Decision-Making: Clean data provides a solid foundation for making informed decisions. By
removing noise and irrelevant information, data cleaning enables stakeholders to trust the data-driven insights
derived from analysis.
Reduced Risk of Bias: Dirty data can introduce biases into analysis, leading to skewed results and flawed
conclusions. Data cleaning helps mitigate this risk by eliminating duplicates, correcting errors, and
standardizing data, ensuring fair and unbiased analysis.
Optimized Performance of Models: Clean data is essential for training machine learning models effectively. By
providing high-quality input data, data cleaning enhances the performance and reliability of predictive models,
leading to more accurate predictions and actionable insights.
Cost and Time Savings: Data cleaning upfront saves time and resources in the long run. By addressing data
quality issues early in the data pipeline, organizations avoid costly errors, rework, and delays in analysis and
decision-making processes.
Techniques for Data Cleaning
Removing Duplicates: Identifying and eliminating duplicate records or entries within the dataset to ensure each
data point is unique.
Handling Missing Values: Dealing with missing or null values by either imputing them with estimated values,
deleting them, or using more sophisticated techniques like predictive modeling to fill in missing values.
Error Correction: Detecting and rectifying errors in data entry, formatting, or encoding. This may involve
validating data against predefined rules or constraints and correcting inaccuracies.
Standardization: Converting data into a consistent format or representation. This includes standardizing units of
measurement, date formats, and other variables to facilitate analysis and comparison.
Parsing and Formatting: Parsing and restructuring unstructured or semi-structured data into a structured format
suitable for analysis. This may involve splitting text fields, extracting relevant information, and formatting data
according to predefined rules.
Outlier Detection and Handling: Identifying outliers or anomalies in the data that deviate significantly from the
norm and deciding whether to remove them, adjust them, or treat them separately in the analysis.
Data Transformation: Transforming data to meet the requirements of specific analysis or modelling tasks. This
may include aggregating, summarizing, or transforming variables to create new features or insights.
Normalization and Scaling: Scaling numerical variables to a common scale or normalizing them to a standard
distribution. This ensures that variables with different scales or units have a similar impact on analysis and
modelling.
Dealing with Inconsistencies: Resolving inconsistencies in data values, such as variations in spelling,
capitalization, or naming conventions. This may involve standardizing naming conventions or using fuzzy
matching algorithms to identify similar values.
What is Tokenization ?
Tokenization is the process of breaking down a text or sequence into smaller units called
tokens. These tokens can be words, phrases, symbols, or other meaningful
elements, depending on the context and requirements of the task. Tokenization is a
fundamental step in natural language processing (NLP) and text analysis tasks, as it
enables computers to understand and process textual data more effectively.
Text Segmentation: The input text is segmented or divided into smaller units, such
as words, sentences, or characters, depending on the granularity required for the
task.
Token Generation: Each segment or unit generated from the segmentation step is
considered a token. Tokens can represent individual words, punctuation marks,
numbers, or special characters.
Tokenization : The text data is tokenized into smaller units, such as words, phrases, or characters, using
appropriate tokenization techniques (e.g., word tokenization, sentence tokenization).
Text Normalization :Tokens may undergo normalization to standardize their representation and improve
consistency. This can include converting text to lowercase, removing punctuation, and handling special cases
like contractions or abbreviations.
Handling Stopwords : Stopwords, commonly occurring words like "the," "and," "is," may be removed from the
tokenized text to reduce noise and improve the efficiency of analysis.
Removing Irrelevant Tokens : Tokens that are not relevant to the analysis or modeling task may be filtered out.
This can include removing numerical tokens, symbols, or rare words with low frequency.
Lemmatization or Stemming : Tokens may undergo lemmatization or stemming to reduce inflectional forms to
their base or root form. This helps in standardizing tokens and reducing the vocabulary size.
Handling Out-of-Vocabulary Tokens : Out-of-vocabulary (OOV) tokens, i.e., tokens not present in the
vocabulary, may be replaced or handled separately. This can involve using special tokens to represent OOV
tokens or performing subword tokenization to handle unknown words.
Data Integration and Alignment : If the data involves multiple sources or formats,
tokenization ensures that the text data is integrated and aligned properly. This may
involve aligning tokens across different languages or dialects.
Quality Assurance : Tokenization is accompanied by quality assurance checks to
ensure that the tokenized data meets predefined standards. This may involve verifying
the correctness of tokenization results and addressing any errors or inconsistencies.
Final Data Formatting : Once tokenization and preprocessing are complete, the data is
formatted into a structured format suitable for further analysis or modeling. This may
involve converting tokenized text into numerical vectors or other appropriate
representations.
By combining tokenization with other cleaning and preprocessing techniques, the text data is
transformed into a standardized, normalized, and cleaned format, ready for analysis,
modeling, or other NLP tasks. This ensures that the data is reliable, consistent, and conducive
to extracting meaningful insights.