Module 1.2 Data Preprocessing
Module 1.2 Data Preprocessing
Prepared by
Fathima Shana E
Assistant Professor
Dept. of ADS
Data Modelling Approach
• Data are individual facts, statistics, or items of information, often numeric.
• A data model is a conceptual framework that organizes and structures data to
represent how data is stored, managed, and processed in a system
• They are a set of tables and the relationship between them
1.Data Sources
• Data in business analytics typically comes from various sources:
– Internal: Sales, marketing, operations, HR, finance.
– External: Social media, market research, government reports.
--Real-time: IoT devices, sensors, website interactions
2. Data Storage
• Businesses use different methods to store and organize data:
– Databases: Structured storage using relational or non-relational
models.
– Data Warehouses: Centralized repositories for analytical data.
– Data Lakes: Storage for raw, unstructured, or semi-structured data.
3. Data Structuring
• Data must be organized into a logical structure to
facilitate analysis:
– Relational Databases: Tables with relationships (SQL
databases).
– Hierarchical Models: Parent-child relationships.
– Dimensional Models: Fact and dimension tables
4. Data Cleaning
• Ensuring the data is free from errors and inconsistencies:
– Removing duplicates.
– Handling missing values.
– Standardizing formats.
5. Data Categorization
• Classifying data to improve organization and retrieval:
– By Type: Numerical, categorical, textual, temporal.
– By Source: Customer data, product data, sales data.
– By Purpose: Operational, analytical, strategic.
6. Data Integration
• Combining data from different sources into a single, coherent view
7. Metadata Management
• Maintaining information about the data (e.g., source, definitions,
ownership) to ensure traceability and governance.
8. Data Security and Privacy
• Organizing data with appropriate access controls to protect
sensitive information.
9. Data Accessibility
• Ensuring stakeholders can access the right data efficiently
5 V’s of Business Analytics
1. Velocity is the speed at which the data is created and how fast it moves.
2.Volume is the amount of data qualifying as big data.
E.G :A retail chain generating millions of sales transactions daily requires scalable storage
solutions like data warehouses or lakes.
3. Value is the value the data provides.
The business impact derived from data. Data analytics should drive actionable insights and
measurable outcomes.
Example: Using sales data to predict future demand and optimize inventory.
4.Variety is the diversity that exists in the types of data.
• Structured: Databases, spreadsheets.
• Semi-structured: JSON, XML.
• Unstructured: Text, images, videos.
• Example: Combining customer demographics
(structured), social media posts (unstructured), a
web clickstream data (semi-structured).
5.Veracity is the data's quality and accuracy.
Ensures data is accurate, clean, and consistent for
meaningful analysis.
Example: A company struggling with duplicate or
inconsistent customer records in CRM.
Data cleaning and validation become critical steps in
analytics.
STRUCTURED DATA VS UNSTRUCTURED DATA
ASPECT STRUCTURED DATA UNSTRUCTURED DATA
Data that is organized in Data that does not
predefined formats, follow a specific
Definition
such as rows and structure or predefined
columns. format.
Customer database
Emails, images, videos,
Example with columns for Name,
social media posts.
Age, and Email.
Tabular, with clearly
Freeform, lacks a rigid
defined fields (e.g.,
Format structure (e.g., text
spreadsheets,
files, multimedia).
databases).
Relies on fixed schemas
No predefined schema;
Schema (e.g., relational
format is flexible.
models).
Requires advanced
Easily queried using
Ease of Analysis tools and algorithms
tools like SQL.
(e.g., AI, NLP).
ASPECT STRUCTURED DATA UNSTRUCTURED DATA
Relational Databases
Storage Systems Data Lakes (e.g., Hadoop
(e.g., MySQL).
Tables with rows and File systems, object
Examples
columns. storage.
SQL-based tools, BI
Machine Learning, NLP, Big
Processing Tools platforms (e.g., Tableau,
Data tools
Power BI).
Marketing, healthcare (e.g.,
Banking, e-commerce,
Industries medical images), social
logistics.
media analytics.
Unstructured Data
Scenario Structured Data Example Example
Database with Name,
Customer Data Emails, customer reviews.
Email, Purchase History.
Patient records in tables X-ray images, doctor’s
Healthcare
(e.g., age, diagnosis). notes.
Sales data by product and Social media posts,
Marketing
region. advertisements.
Data Analytics framework
• Data analytics is the process of examining data to uncover useful information
and support decision-making.
• It involves collecting raw data from different sources, cleaning and organizing
it, and using tools and techniques to analyze it.
5. Data Modeling
• Select the appropriate analytical approach (descriptive, diagnostic,
predictive, or prescriptive).
• Apply statistical methods or machine learning algorithms.
• Train and validate models to ensure accuracy and reliability.
6. Interpret Results
• Analyze model outputs to derive insights.
• Relate findings back to business goals and KPIs.
• Use visualization tools to communicate insights effectively (dashboards,
reports).
7. Decision Making
• Use insights to support or refine business strategies.
• Develop action plans based on the analysis.
Types of Data Analytics Frameworks
1. Descriptive Analytics:
• Descriptive analytics is a branch of data analytics that focuses on
summarizing historical data to gain insights into past events or
phenomena. It involves organizing and presenting data in a meaningful
way through visualization techniques, such as charts, graphs, and
dashboards. Descriptive analytics aims to provide a clear and concise
snapshot of what has happened.
2. Diagnostic Analytics:
• Diagnostic analytics is a form of data analytics that delves deeper into
understanding the root causes and reasons behind specific events or
outcomes. It goes beyond descriptive analytics by investigating the
relationships between variables to uncover insights and explanations.
Diagnostic analytics involves conducting exploratory analysis and
applying statistical techniques to identify patterns, correlations, and
anomalies within the data.
3. Predictive Analytics:
Predictive analytics is a field within data analytics that employs historical data
and statistical modelling methods to predict future outcomes or trends. Its
objective is to make well-informed forecasts and estimations based on the
analysis of patterns, correlations, and connections present in the data. By
utilizing a range of statistical and machine learning algorithms, predictive
analytics creates predictive models that enable organizations to anticipate
customer behavior, market trends, demand patterns, and other important
factors.
4. Prescriptive Analytics:
Prescriptive analytics is an advanced field in data analytics that
employs historical data, mathematical models, optimization algorithms,
and simulation methods to offer guidance on the best actions or decisions
to attain desired outcomes. Unlike descriptive and predictive analytics,
which concentrate on understanding past occurrences and forecasting
future trends, prescriptive analytics takes an additional step by
proposing precise courses of action.
.
5. Cognitive Analytics :
• Cognitive analytics refers to the application of advanced
technologies and techniques that enable systems and machines
to mimic human cognitive abilities, such as perception, learning,
reasoning, and problem-solving.
Features:
1. Dashboards with automatic refreshing and full-screen view
2. SQL Mode for data analysts and professionals
3. Establish standardized segments and metrics for team-wide use
4. Schedule data delivery to Slack or email through dashboard
subscriptions
5. Access data in Slack at any time using MetaBot
6. Simplify data for your team by renaming, annotating, and concealing
fields
7.Great UX and user-friendly,
• BIRT (Business Intelligence and Reporting
Tools)
• BIRT is a versatile business intelligence open-source tool focusing on
reporting and data visualization.
• It enables users to design, generate, and
• view reports with customizable templates and interactive charts, making a
robust choice for data reporting.
• Strong focus on report generation.
• BIRT has two main components:
• BIRT designer: A graphical tool for designing and developing reports
• BIRT runtime engine: Provides support for running reports and rendering
published report output
• Pentaho is an open-source Business Intelligence (BI) suite that
offers comprehensive data integration and analytics capabilities.
• It is used for data management, reporting, data mining, and
dashboarding.
• Pentaho's main strength is its end-to-end BI functionality,
allowing users to collect, store, analyze, and visualize data in
various ways.
• Pentaho Data Integration (PDI):Integrate data from various
sources like databases and transform the data using complex
operations like filtering, aggregating, and merging.
• Load the data into databases, data warehouses, or other
destinations
• Pentaho Business Analytics: Provides tools
for reporting, dashboards, and visualizations. It includes a Report
Designer and an Interactive Dashboard.
Jaspersoft
• Jaspersoft is a renowned open-source business intelligence tool
with robust reporting, dashboards, and data analysis
capabilities. It is widely used for creating and delivering reports
and interactive data visualizations, making it a valuable choice
for data-driven decision-making.
Features:
1.Comprehensive reporting and dashboard creation
2.Rich library of data visualizations and chart types
3. Integration with various data sources and databases
4. Multi-tenancy support for secure sharing
5. Advanced reporting features like ad-hoc reporting and
scheduling
Helical Insight
• Helical Insight is a self-service open-source BI and reporting
tool.
• It empowers businesses to explore data, create customized
reports, and share insights using Machine Learning and NLP.
• Its focus on self-service makes it accessible to users across the
organization.
Features:
1. Web-based Business Intelligence software
2. Interaction with organizational data
3. Utilizes Machine Learning and NLP (Natural Language
Processing)
4. Custom workflow specification
Redash
• Redash is an open-source tool for data visualization and
collaboration.
• It facilitates organizations in taking a more data-driven approach
by providing them with tools for democratizing data access. It
also has a good range of out-of-the-box dashboards.
Features:
• 1. Data source connections for querying and visualization
• 2. Interactive and shareable dashboards
• 3. Collaboration and sharing features for reports and queries
• 4. Scheduled and automated report generation
• 5. Customizable visualizations and chart types
• 6. Extensible through plugins and API integration
Data Cleaning
• Data cleaning is the process to remove incorrect, inconsistent, incomplete
and inaccurate data from the datasets, and it also replaces the missing values.
• The process of identifying and correcting inaccuracies, inconsistencies, and
errors in a dataset.
Steps in Data Cleaning
1.Handling Missing Values
2. Noisy data
3.Data Cleaning as a Process
1.Handling Missing Values
• Deletion: Remove rows or columns with excessive missing data.
• Imputation: Fill missing values using:
Mean/Median/Mode: For numerical data.
Forward/Backward Fill: Use adjacent values to fill gaps
(e.g., time- series data).
Predictive Imputation: Use machine learning models to
estimate missing values based on other features.
Steps in Handling Missing Values
• 1. Ignore the tuple
• 2. Fill in the missing value manually
• 3. Use a global constant to fill in the missing value
• 4. Use a measure of central tendency for the attribute (e.g., the
• mean or median or mode) to fill in the missing value
• 5. Use the attribute mean or median for all samples belonging to
• the same class as the given tuple
• 6. Use the most probable value to fill in the missing value
• 1.Noisy Data
• Noisy data refers to data that contains errors, outliers, or
irrelevant information, making it difficult to analyze and interpret.
• Noise in data can distort the results of analysis, affect model
performance, and compromise decision-making.
• Noisy data has a low Signal-to-Noise Ratio.
• Data = true signal + noise
• Noisy data unnecessarily increases the amount of storage space
required and can adversely affect any data mining analysis results.
1.Assume Normality:
• Semi-supervised methods typically assume that most of the data points are
normal, and only a small fraction are outliers.
2.Train on Labeled Normal Data:
• Use the labeled normal data to learn the patterns or boundaries of normal
behavior.
3.Analyze Unlabeled Data:
• Predict the likelihood of data points in the unlabeled dataset being normal or
anomalous based on the learned model.
4.Identify Outliers: Points that deviate significantly from the learned normal
behavior are flagged as outliers.
• E.g.; DBSCAN, K-Means
OUTLIER DETECTION :UNSUPERVISED METHODS
• These methods use only unlabeled data to identify outliers.
• There are no predefined labels for normal or anomalous data
points.
• These methods identify outliers by analyzing the data's inherent
patterns, structures, and statistical properties without the need
for labeled examples.
• For example, unsupervised outlier detection methods can use
density-based or distance-based methods to identify data points
that are far away from the rest of the data.
• Some popular unsupervised methods include, k-nearest
neighbor (KNN) based method, DBSCAN, and Isolation Forest.
• No Labels Required: Works with unlabeled data, assuming that
most data points represent normal behavior
• Assumption: Outliers are rare and differ substantially from the
rest of the data.
OUTLIER DETECTION :STATISTICAL METHODS
• Statistical methods for outlier detection rely on assumptions
about the distribution of the data to identify points that deviate
significantly from the expected pattern.
• These methods are particularly useful when you know or
assume that the data follows a certain distribution, such as
normal distribution or other well-understood statistical
properties.
• The data not following the model are outliers.
Z-Score (Standard Score)
• This method calculates the standard deviation of the data points
and identifies outliers as those with Z-scores exceeding a certain
threshold (typically 3 or -3).
• It measures how far a data point is from the mean in terms of
standard deviations.
• If the Z-score of a point is greater than a threshold (commonly ∣Z∣>3), it is
considered an outlier.
• A high Z-score means the data point is far from the mean, while a Z-score
close to 0 means it’s near the mean.
2. Interquartile Range (IQR)
The IQR method is based on dividing the data into quartiles (Q1, Q3) and
identifying outliers as points that fall outside the range defined by these
quartiles.
where
• Q1 is the median of the lower quartile (25th percentile) & Q3 is the median
upper quartile (75th percentile).
• Points below the lower bound or above the upper bound are considered
outliers.
OUTLIER DETECTION:PROXIMITY BASED METHODS
• An object is considered an outlier if its nearest neighbors are
significantly far away, meaning the proximity of the object deviates
substantially from the proximity of most
other objects in the same dataset.
• Example:
– Model the proximity of an object using its 3 nearest neighbors.
– Objects in region R (as shown in the figure) are substantially
different from other objects in the dataset.
– Thus, the objects in region R are considered outliers.
• Two Major Types of Proximity-Based Outlier Detection:
1.Distance-based E.g.: KNN
2.Density-based E.g.: DBSCAN
OUTLIER DETECTION: CLUSTERING BASED METHODS
• Clustering-based methods are used for outlier detection by
utilizing clustering algorithms to group data into clusters.
• Objects that do not fit well into any cluster or belong to very
small clusters are considered outliers.
• In clustering, data points are grouped based on similarity
(distance or density).
• Outliers are points that:
– Do not belong to any cluster or
– Are assigned to small or sparse clusters that significantly
deviate from larger cluster
Examples
• K-Means-Based Outlier Detection:
• K-Means clustering, data points are assigned to the nearest
cluster center.
• Outliers can be detected by calculating the distance of each
point to its nearest cluster center.
– If the distance exceeds a certain threshold, the point is
labeled as an outlier.
2.DBSCAN-Based Outlier Detection:
Points that do not belong to any cluster (i.e., they are in low-
density regions) are marked as noise or outliers.
3.Hierarchical Clustering-Based Outlier Detection:
• In hierarchical clustering, clusters are formed in a hierarchical
manner (either bottom-up or top-down).
• Outliers can be identified as:
– Points that are far from their respective clusters.
– Points that form singleton clusters (clusters with only one
or very few points).