0% found this document useful (0 votes)

19 views22 pages

Data Mining

Data mining is the process of extracting insights from large datasets using statistical analysis and machine learning, enabling organizations to make informed decisions. It involves various techniques such as classification, clustering, and regression, and requires data preprocessing steps like cleaning, integration, and transformation to ensure data quality. Effective data cleaning and reduction techniques are essential for maintaining data integrity and improving analysis outcomes.

Uploaded by

Tanu Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views22 pages

Data Mining

Uploaded by

Tanu Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Mining

 Data mining is the process of extracting valuable

insights and patterns from large datasets, often
using statistical analysis and machine learning
techniques. It helps organizations discover hidden
information, understand complex phenomena, and
make informed decisions.
Aspects of Data Mining:
alysis:
ng involves analyzing large datasets to identify patterns, trends, and relationships.
Recognition:
uncover hidden information and predict future outcomes based on past data.
ge Discovery:
ng transforms raw data into actionable knowledge, enabling organizations to make better decisions.
ues:
echniques include classification, clustering, association rule learning, and predictive modeling.
ions:
ng is used in various fields like marketing, finance, healthcare, and telecommunications.
Mining Process:
m Definition: Identifying the business question or objective that needs to be answered.
ollection and Preparation: Gathering and cleaning data from various sources.
Building: Using machine learning or statistical methods to create predictive models.
.

Data Mining is defined as extracting information from huge

sets of data. In other words, we can say that data mining
is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for
any of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Functionalities:
•Classification:
•Organizes data into predefined categories, like classifying emails as spam or not spam.
•Clustering:
•Groups similar data points together, helping identify distinct segments or groups within
the data.
•Regression:
•Predicts numerical values based on existing data, like forecasting sales based on historical
trends.
•Association Rules:
•Discovers relationships between different variables, such as identifying which items are
frequently purchased together in a grocery store.
•Anomaly Detection:
•Identifies unusual or out-of-the-ordinary data points that may indicate errors or interesting
patterns.
•Visualization:

•Presents data in a visually appealing and understandable format, such as charts and
graphs, making it easier to interpret insights.
In data mining, data processing, or preprocessing, involves
transforming raw data into a usable format by cleaning, integrating,
reducing, and transforming it, ensuring data quality and suitability for
analysis and model building.
1. Data Cleaning:
•Handling Missing Values:
•Identifying and addressing missing data points, either by imputation
(filling in with estimates) or removal.
Removing Outliers:
Detecting and dealing with extreme values that can skew analysis,
either by removing them or transforming them.

Correcting Inconsistencies:
Addressing errors, duplicates, and inconsistencies in the data to ensure
accuracy.
2. Data Integration:

Combining Data Sources:

Merging data from multiple sources (databases, files) into a unified
dataset.

Handling Schema Differences:

Resolving discrepancies in data formats, attribute names, and data
types across different sources.

3. Data Reduction:
Dimensionality Reduction: Reducing the number of variables (features)
while preserving relevant information.

Data Compression: Reducing the size of the dataset for efficient

storage and processing.

Sampling: Selecting a representative subset of the data for analysis,

especially when dealing with large datasets.
4. Data Transformation:
Normalization/Standardization: Scaling data values to a common range
(e.g., 0 to 1 or with zero mean and unit variance).

Encoding Categorical Data: Converting categorical variables (e.g.,

colors, types) into numerical representations suitable for algorithms.

Feature Engineering: Creating new features or attributes from existing

ones to improve model performance.

1. Historical Forms of Data Processing

Historically, data processing was categorized by the level of human
intervention and the technology available:
•Manual Data Processing:
Before the advent of electronic systems, data processing was
performed by humans. This includes activities such as bookkeeping and
manual record keeping where data was entered, sorted, and analyzed
by hand.

•Mechanical (or Electromechanical) Data Processing:With the advent of

devices like the punched card systems pioneered by Herman Hollerith
for the 1890 U.S. Census, mechanical means were used to collect and
process data.
Distributed and Parallel Processing:
For extremely large data sets, processing tasks can be divided among multiple
computers (distributed processing) or multiple processors (parallel processing)
within a single system. These forms of processing enable handling “big data”
efficiently by scaling horizontally or concurrently executing tasks

Data cleaning, also known as data cleansing or data scrubbing, is the process of
identifying and correcting or removing errors, inconsistencies, inaccuracies, and
corrupt records from a dataset. It ensures that data is accurate, consistent, and
usable, which is fundamental for building reliable and effective artificial
intelligence (AI) and machine learning (ML) models.
Common Data Issues Addressed Missing Values: Incomplete
records that can skew analysis.
Duplicate Entries: Redundant data that inflates datasets.
Inconsistent Formatting: Variations in data presentation (e.g.,
date formats).
•Outliers: Anomalous data points that may distort results.
•Typographical Errors: Mistakes in data entry that lead to
inaccuracies.
Best Practices for Effective Data Cleaning

Understand the Data: Gain a comprehensive understanding of the

dataset's structure and content.
Use Automation Tools: Leverage data cleaning software to streamline
the process.
Maintain Documentation: Keep detailed records of cleaning
procedures for transparency and reproducibility.
Regularly Update Data: Implement routine checks to ensure
ongoing data quality.

By implementing these techniques and best practices, organizations

can enhance the quality of their data, leading to more accurate
analyses and informed decision-making.Sources
Types of Missing Data
Understanding the nature of missing data is essential for selecting
appropriate handling techniques.
The main categories include:

Missing Completely at Random (MCAR):The probability of data being

missing is independent of both observed and unobserved data. In this
scenario, the missingness does not relate to any other data values. For
example, survey respondents may skip questions randomly due to
external factors like accidental omissions.
Missing at Random (MAR):
•The missingness is related to observed data but not the missing data
itself. For instance, if younger individuals are less likely to report their
income, the missing income data depends on age but not on the income
values themselves.
Missing Not at Random (MNAR):
•The missingness is related to the value of the missing data itself. For
example, individuals with higher incomes might choose not to disclose
their earnings, leading to missing data that depends on the unreported
income values.
Data cleaning is a crucial step in data preprocessing to ensure your
dataset is accurate, consistent, and usable for analysis or modeling.
Two common issues in raw data are missing values and noisy data.
Here’s a breakdown of both:
1. Missing Values Causes:Human error (e.g., incomplete surveys)
2. Data corruption
3. Incompatibility during data merges
4. Sensor or equipment malfunction

2. Noisy Data
Noisy data means random errors or variances that distort the dataset.
Sources:
•Data entry errors
•Faulty sensors
•Communication errors
•Outliers
Handling Techniques:
1.Smoothing Techniques:
1. Moving average: Replace data with the average of neighboring
values.
2. Bin smoothing: Group data into bins and replace values with
bin mean/median
1.Clustering:
1. Group data into clusters (e.g., using K-Means) and remove or
smooth data points that don’t fit well.
2.Regression or Model-based methods:
1. Fit a model and use residuals to detect unusual patterns or noise.

1. Binning (Smoothing by binning)

Binning is a data smoothing technique used to reduce the effect of
minor observation errors or noise. It groups continuous data into bins
(intervals).
📌 Types of Binning:
•Equal-width binning: Bins of the same size (range).
•Equal-frequency binning: Each bin has the same number of values.
•Smoothing methods within bins:
• By mean: Replace values in a bin with the mean of the bin.
• By median: Use the median instead.
2. Clustering for Cleaning
Clustering can help detect outliers or inconsistencies. If a data point
doesn't belong well to any cluster, it's likely an anomaly or noise.
📌 Use cases:
•Group similar records.
•Identify and remove outliers (points far from any cluster center)

3. Regression for Cleaning or Imputation

Regression can be used to predict missing or noisy values based on
relationships with other variables.
📌 Use cases:
•Predict a missing column using linear regression.
•Identify anomalies using regression residuals (difference between
actual & predicted).
4. Computer and Human Inspection
Sometimes automated methods aren't enough, especially in critical or
high-risk domains. A hybrid approach helps.
✅ Computer-Based Inspection:
•Use rules or algorithms to flag suspicious data.
•Automate checks (e.g., null checks, range checks, duplicates).
•Use visualization tools (histograms, boxplots) to find patterns.
👀 Human-Based Inspection:
•Data analysts or domain experts manually review flagged or random
samples.
•Especially useful for categorical data or textual entries.
🧠 Tip: Use both!
•Computer: Fast and scalable.
•Human: Context-aware and nuanced
Data Reduction: Data Cube Aggregation
📌 What is Data Reduction?
Data reduction is the process of reducing the volume of data while
maintaining its integrity and analytical value. It makes data
processing more efficient, especially in big data contexts.

What is Data Cube Aggregation?

Data Cube Aggregation is a technique used to summarize and group
multidimensional data by applying aggregation functions like sum,
average, count, etc., over combinations of dimensions.Think of it like
building a multi-dimensional pivot table.
Main Concepts:

Dimensions: Categories or attributes (e.g., Time, Region, Product)

Measures: Numeric values to be aggregated (e.g., Sales,

Revenue)Aggregation

Levels: Can be rolled up (more summarized) or drilled down (more

detailed)
Dimensionality Reduction
Dimensionality reduction is a technique used in machine learning and
data analysis to reduce the number of input variables (features) in a
dataset while preserving as much information as possible. It’s especially
useful when dealing with high-dimensional data, which can be hard to
visualize and may lead to issues like overfitting.

This is the process of reducing the number of input variables or

features in a dataset while retaining as much information as possible.
Common Techniques:
•PCA (Principal Component Analysis) – Projects data to a lower-
dimensional space using linear combinations of features.
•t-SNE (t-Distributed Stochastic Neighbor Embedding) – Good
for visualizing high-dimensional data in 2D/3D.
•UMAP (Uniform Manifold UMAP (Uniform Manifold
Approximation and Projection)
•Similar to t-SNE, but faster, scalable, and preserves more
global structure
•Good for visualizing and clustering
Used for:
•Visualization of high-dimensional data
•Noise reduction
•Speeding up algorithms
•Avoiding the curse of dimensionality

Data Compression
This refers to encoding data using fewer bits. It can be lossless (no
info lost) or lossy (some data sacrificed for better compression).
Examples:
•ZIP files – Lossless compression of general data
•JPEG, MP3 – Lossy compression for images/audio
•Autoencoders – Neural networks trained to compress and
reconstruct data (learn an efficient encoding)
Techniques:
•Huffman Coding
•Run-Length Encoding
•LZW (Lempel–Ziv–Welch)
•Autoencoders (again — yes, they can be used for compression too!)
Numerosity Reduction
Numerosity reduction refers to techniques that reduce the volume of
data by replacing the original data with a smaller representation
that is more compact but still maintains the essential properties and
patterns of the data.

It's particularly useful when dealing with large datasets to improve

processing speed, reduce storage, and simplify analysis.

Main Techniques of Numerosity Reduction:

1. Parametric Methods
Replace the data with a model. You don't store the data, just the model
parameters.
•Regression models: Fit the data with a linear, polynomial, or
nonlinear function.
•Logistic regression, exponential models, etc.
•Clustering models: Replace data with the cluster center and perhaps
the number of points per cluster (e.g., K-means).
2. Non-Parametric Methods
Reduce data without assuming any specific model form.
•Histograms: Divide data into bins and store the bin ranges and
frequencies.
•Data cube aggregation: Aggregate data across different dimensions
(e.g., monthly → quarterly).
•Sampling: Use a subset that statistically represents the full dataset.
•Cluster-based reduction: Store cluster centroids instead of all data
points.
What is Clustering?
Clustering groups similar data points together into clusters based on
a similarity metric (like Euclidean distance). Once you have these
clusters, you can represent each group by its centroid (average
position), thereby reducing the number of data points you need to
store or analyze.
Discretization
Discretization is the process of transforming continuous attributes
(numerical values) into discrete values
(categories or intervals).
For example:A continuous attribute like Age = [0, 100]
might be discretized into:
0–12 → Child
13–19 → Teen
20–64 → Adult
65+ → Senior

Why Discretize?
•Simplifies models and reduces noise
•Helps algorithms that work better with categorical data (like decision
trees)
•Enables generation of concept hierarch
Concept Hierarchy Generation
Concept hierarchy involves organizing data from low-level concepts
(raw data) to higher-level concepts — think granularity levels.

Example Hierarchy:
For the attribute Location:
City → State → Country → Continent

Types of Concept Hierarchies:Schema hierarchy: Already exists in data

schema (e.g., Date → Month → Year)

Set-grouping hierarchy: Based on grouping values (e.g., ZIP codes →

Cities)

Rule-based: Created by user-defined rules (e.g., income levels)

Data Binning
No ratings yet
Data Binning
9 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Unit 3
No ratings yet
Unit 3
22 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Iso 17165-1 - 2007
No ratings yet
Iso 17165-1 - 2007
32 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Lect 4
No ratings yet
Lect 4
30 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Unit 2
No ratings yet
Unit 2
37 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
(EM) FWC ICT 2025 1st Term Paper With Scheme-1
No ratings yet
(EM) FWC ICT 2025 1st Term Paper With Scheme-1
21 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Tracy R. Twyman - PLUS ULTRA - Trip 8 The Internet Is Compromised
No ratings yet
Tracy R. Twyman - PLUS ULTRA - Trip 8 The Internet Is Compromised
67 pages
Matrices One Shot #BB
100% (1)
Matrices One Shot #BB
158 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Correlation
No ratings yet
Correlation
14 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Digital Progress and Trends Report 2023
No ratings yet
Digital Progress and Trends Report 2023
177 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
How To Implement Modbus TCP Protocol Using VBA With Excel - Acc Automation
No ratings yet
How To Implement Modbus TCP Protocol Using VBA With Excel - Acc Automation
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Residential Plumbing Inspection Checklist Template
No ratings yet
Residential Plumbing Inspection Checklist Template
6 pages
Final Bio-Metric Project 2 Atul K
100% (1)
Final Bio-Metric Project 2 Atul K
44 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Steam Sample Conditioning and Collection
No ratings yet
Steam Sample Conditioning and Collection
14 pages
Chapter 5
No ratings yet
Chapter 5
20 pages
WinterTech Inventions Volume 1
No ratings yet
WinterTech Inventions Volume 1
14 pages
(QP) JEE ADVANCED MOCK TESTS - PDF - 5 PDF
No ratings yet
(QP) JEE ADVANCED MOCK TESTS - PDF - 5 PDF
22 pages
Portfolio Submission
No ratings yet
Portfolio Submission
47 pages
DWM
No ratings yet
DWM
14 pages
AFI Changemakers and UNCTAD Delegates Report On Technology 2019
No ratings yet
AFI Changemakers and UNCTAD Delegates Report On Technology 2019
35 pages
Pir 7000 Pi 9046393 en GB
No ratings yet
Pir 7000 Pi 9046393 en GB
10 pages
Chapter 7: Deadlocks: The Deadlock Problem System Model Deadlock Characterization Methods For Handling Deadlocks
No ratings yet
Chapter 7: Deadlocks: The Deadlock Problem System Model Deadlock Characterization Methods For Handling Deadlocks
63 pages
Rcrit 18V818 1724
No ratings yet
Rcrit 18V818 1724
40 pages
Open Source Flood Mapping Tools - Qgis River Gis A
No ratings yet
Open Source Flood Mapping Tools - Qgis River Gis A
8 pages
The New Standard: High Speed Stability
No ratings yet
The New Standard: High Speed Stability
16 pages
Evaluating The Digital Documentation Process From 3D Scan To Drawing
No ratings yet
Evaluating The Digital Documentation Process From 3D Scan To Drawing
8 pages
Detection and Classification of Arrhythmia Using An Explainable Deep Learning Model
No ratings yet
Detection and Classification of Arrhythmia Using An Explainable Deep Learning Model
9 pages
Date Sheet For The BS in Computer Science 4 Year Programe 1st 2nd 3rd 69039
No ratings yet
Date Sheet For The BS in Computer Science 4 Year Programe 1st 2nd 3rd 69039
3 pages
The RF Line: Semiconductor Technical Data
No ratings yet
The RF Line: Semiconductor Technical Data
7 pages
Activity Guide - Packets - Unit 2 Lesson 05 (
No ratings yet
Activity Guide - Packets - Unit 2 Lesson 05 (
2 pages
Synopsis of Final Project Credit Appraisal Procedure of Canara Bank
No ratings yet
Synopsis of Final Project Credit Appraisal Procedure of Canara Bank
4 pages
LC 33
No ratings yet
LC 33
2 pages
Stratmanage 10 Act 01
No ratings yet
Stratmanage 10 Act 01
1 page
Azumanga Daioh - Chapter 12
No ratings yet
Azumanga Daioh - Chapter 12
1 page
Miginox 310 / Tiginox 310: Classification: en Iso 14343-A
No ratings yet
Miginox 310 / Tiginox 310: Classification: en Iso 14343-A
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining

 Data mining is the process of extracting valuable

Data Mining is defined as extracting information from huge

Combining Data Sources:

Handling Schema Differences:

Data Compression: Reducing the size of the dataset for efficient

Sampling: Selecting a representative subset of the data for analysis,

Encoding Categorical Data: Converting categorical variables (e.g.,

Feature Engineering: Creating new features or attributes from existing

1. Historical Forms of Data Processing

•Mechanical (or Electromechanical) Data Processing:With the advent of

Understand the Data: Gain a comprehensive understanding of the

By implementing these techniques and best practices, organizations

Missing Completely at Random (MCAR):The probability of data being

1. Binning (Smoothing by binning)

3. Regression for Cleaning or Imputation

What is Data Cube Aggregation?

Dimensions: Categories or attributes (e.g., Time, Region, Product)

Measures: Numeric values to be aggregated (e.g., Sales,

Levels: Can be rolled up (more summarized) or drilled down (more

This is the process of reducing the number of input variables or

It's particularly useful when dealing with large datasets to improve

Main Techniques of Numerosity Reduction:

Types of Concept Hierarchies:Schema hierarchy: Already exists in data

Set-grouping hierarchy: Based on grouping values (e.g., ZIP codes →

Rule-based: Created by user-defined rules (e.g., income levels)

You might also like