0% found this document useful (0 votes)
19 views22 pages

Data Mining

Data mining is the process of extracting insights from large datasets using statistical analysis and machine learning, enabling organizations to make informed decisions. It involves various techniques such as classification, clustering, and regression, and requires data preprocessing steps like cleaning, integration, and transformation to ensure data quality. Effective data cleaning and reduction techniques are essential for maintaining data integrity and improving analysis outcomes.

Uploaded by

Tanu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Data Mining

Data mining is the process of extracting insights from large datasets using statistical analysis and machine learning, enabling organizations to make informed decisions. It involves various techniques such as classification, clustering, and regression, and requires data preprocessing steps like cleaning, integration, and transformation to ensure data quality. Effective data cleaning and reduction techniques are essential for maintaining data integrity and improving analysis outcomes.

Uploaded by

Tanu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining

 Data mining is the process of extracting valuable


insights and patterns from large datasets, often
using statistical analysis and machine learning
techniques. It helps organizations discover hidden
information, understand complex phenomena, and
make informed decisions.
Aspects of Data Mining:
alysis:
ng involves analyzing large datasets to identify patterns, trends, and relationships.
Recognition:
uncover hidden information and predict future outcomes based on past data.
ge Discovery:
ng transforms raw data into actionable knowledge, enabling organizations to make better decisions.
ues:
echniques include classification, clustering, association rule learning, and predictive modeling.
ions:
ng is used in various fields like marketing, finance, healthcare, and telecommunications.
Mining Process:
m Definition: Identifying the business question or objective that needs to be answered.
ollection and Preparation: Gathering and cleaning data from various sources.
Building: Using machine learning or statistical methods to create predictive models.
.

Data Mining is defined as extracting information from huge


sets of data. In other words, we can say that data mining
is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for
any of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Functionalities:
•Classification:
•Organizes data into predefined categories, like classifying emails as spam or not spam.
•Clustering:
•Groups similar data points together, helping identify distinct segments or groups within
the data.
•Regression:
•Predicts numerical values based on existing data, like forecasting sales based on historical
trends.
•Association Rules:
•Discovers relationships between different variables, such as identifying which items are
frequently purchased together in a grocery store.
•Anomaly Detection:
•Identifies unusual or out-of-the-ordinary data points that may indicate errors or interesting
patterns.
•Visualization:

•Presents data in a visually appealing and understandable format, such as charts and
graphs, making it easier to interpret insights.
In data mining, data processing, or preprocessing, involves
transforming raw data into a usable format by cleaning, integrating,
reducing, and transforming it, ensuring data quality and suitability for
analysis and model building.
1. Data Cleaning:
•Handling Missing Values:
•Identifying and addressing missing data points, either by imputation
(filling in with estimates) or removal.
Removing Outliers:
Detecting and dealing with extreme values that can skew analysis,
either by removing them or transforming them.

Correcting Inconsistencies:
Addressing errors, duplicates, and inconsistencies in the data to ensure
accuracy.
2. Data Integration:

Combining Data Sources:


Merging data from multiple sources (databases, files) into a unified
dataset.

Handling Schema Differences:


Resolving discrepancies in data formats, attribute names, and data
types across different sources.

3. Data Reduction:
Dimensionality Reduction: Reducing the number of variables (features)
while preserving relevant information.

Data Compression: Reducing the size of the dataset for efficient


storage and processing.

Sampling: Selecting a representative subset of the data for analysis,


especially when dealing with large datasets.
4. Data Transformation:
Normalization/Standardization: Scaling data values to a common range
(e.g., 0 to 1 or with zero mean and unit variance).

Encoding Categorical Data: Converting categorical variables (e.g.,


colors, types) into numerical representations suitable for algorithms.

Feature Engineering: Creating new features or attributes from existing


ones to improve model performance.

1. Historical Forms of Data Processing


Historically, data processing was categorized by the level of human
intervention and the technology available:
•Manual Data Processing:
Before the advent of electronic systems, data processing was
performed by humans. This includes activities such as bookkeeping and
manual record keeping where data was entered, sorted, and analyzed
by hand.

•Mechanical (or Electromechanical) Data Processing:With the advent of


devices like the punched card systems pioneered by Herman Hollerith
for the 1890 U.S. Census, mechanical means were used to collect and
process data.
Distributed and Parallel Processing:
For extremely large data sets, processing tasks can be divided among multiple
computers (distributed processing) or multiple processors (parallel processing)
within a single system. These forms of processing enable handling “big data”
efficiently by scaling horizontally or concurrently executing tasks

Data cleaning, also known as data cleansing or data scrubbing, is the process of
identifying and correcting or removing errors, inconsistencies, inaccuracies, and
corrupt records from a dataset. It ensures that data is accurate, consistent, and
usable, which is fundamental for building reliable and effective artificial
intelligence (AI) and machine learning (ML) models.
Common Data Issues Addressed Missing Values: Incomplete
records that can skew analysis.
​Duplicate Entries: Redundant data that inflates datasets.
​Inconsistent Formatting: Variations in data presentation (e.g.,
date formats).
•Outliers: Anomalous data points that may distort results.​
•Typographical Errors: Mistakes in data entry that lead to
inaccuracies.​
Best Practices for Effective Data Cleaning

Understand the Data: Gain a comprehensive understanding of the


dataset's structure and content.​
Use Automation Tools: Leverage data cleaning software to streamline
the process.​
Maintain Documentation: Keep detailed records of cleaning
procedures for transparency and reproducibility.
Regularly Update Data: Implement routine checks to ensure
ongoing data quality.

By implementing these techniques and best practices, organizations


can enhance the quality of their data, leading to more accurate
analyses and informed decision-making.​Sources
Types of Missing Data
Understanding the nature of missing data is essential for selecting
appropriate handling techniques.
The main categories include:

Missing Completely at Random (MCAR):The probability of data being


missing is independent of both observed and unobserved data. In this
scenario, the missingness does not relate to any other data values. For
example, survey respondents may skip questions randomly due to
external factors like accidental omissions.
Missing at Random (MAR):
•The missingness is related to observed data but not the missing data
itself. For instance, if younger individuals are less likely to report their
income, the missing income data depends on age but not on the income
values themselves.
Missing Not at Random (MNAR):
•The missingness is related to the value of the missing data itself. For
example, individuals with higher incomes might choose not to disclose
their earnings, leading to missing data that depends on the unreported
income values.
Data cleaning is a crucial step in data preprocessing to ensure your
dataset is accurate, consistent, and usable for analysis or modeling.
Two common issues in raw data are missing values and noisy data.
Here’s a breakdown of both:
1. Missing Values Causes:Human error (e.g., incomplete surveys)
2. Data corruption
3. Incompatibility during data merges
4. Sensor or equipment malfunction

2. Noisy Data
Noisy data means random errors or variances that distort the dataset.
Sources:
•Data entry errors
•Faulty sensors
•Communication errors
•Outliers
Handling Techniques:
1.Smoothing Techniques:
1. Moving average: Replace data with the average of neighboring
values.
2. Bin smoothing: Group data into bins and replace values with
bin mean/median
1.Clustering:
1. Group data into clusters (e.g., using K-Means) and remove or
smooth data points that don’t fit well.
2.Regression or Model-based methods:
1. Fit a model and use residuals to detect unusual patterns or noise.

1. Binning (Smoothing by binning)


Binning is a data smoothing technique used to reduce the effect of
minor observation errors or noise. It groups continuous data into bins
(intervals).
📌 Types of Binning:
•Equal-width binning: Bins of the same size (range).
•Equal-frequency binning: Each bin has the same number of values.
•Smoothing methods within bins:
• By mean: Replace values in a bin with the mean of the bin.
• By median: Use the median instead.
2. Clustering for Cleaning
Clustering can help detect outliers or inconsistencies. If a data point
doesn't belong well to any cluster, it's likely an anomaly or noise.
📌 Use cases:
•Group similar records.
•Identify and remove outliers (points far from any cluster center)

3. Regression for Cleaning or Imputation


Regression can be used to predict missing or noisy values based on
relationships with other variables.
📌 Use cases:
•Predict a missing column using linear regression.
•Identify anomalies using regression residuals (difference between
actual & predicted).
4. Computer and Human Inspection
Sometimes automated methods aren't enough, especially in critical or
high-risk domains. A hybrid approach helps.
✅ Computer-Based Inspection:
•Use rules or algorithms to flag suspicious data.
•Automate checks (e.g., null checks, range checks, duplicates).
•Use visualization tools (histograms, boxplots) to find patterns.
👀 Human-Based Inspection:
•Data analysts or domain experts manually review flagged or random
samples.
•Especially useful for categorical data or textual entries.
🧠 Tip: Use both!
•Computer: Fast and scalable.
•Human: Context-aware and nuanced
Data Reduction: Data Cube Aggregation
📌 What is Data Reduction?
Data reduction is the process of reducing the volume of data while
maintaining its integrity and analytical value. It makes data
processing more efficient, especially in big data contexts.

What is Data Cube Aggregation?


Data Cube Aggregation is a technique used to summarize and group
multidimensional data by applying aggregation functions like sum,
average, count, etc., over combinations of dimensions.Think of it like
building a multi-dimensional pivot table.
Main Concepts:

Dimensions: Categories or attributes (e.g., Time, Region, Product)

Measures: Numeric values to be aggregated (e.g., Sales,


Revenue)Aggregation

Levels: Can be rolled up (more summarized) or drilled down (more


detailed)
Dimensionality Reduction
Dimensionality reduction is a technique used in machine learning and
data analysis to reduce the number of input variables (features) in a
dataset while preserving as much information as possible. It’s especially
useful when dealing with high-dimensional data, which can be hard to
visualize and may lead to issues like overfitting.

This is the process of reducing the number of input variables or


features in a dataset while retaining as much information as possible.
Common Techniques:
•PCA (Principal Component Analysis) – Projects data to a lower-
dimensional space using linear combinations of features.
•t-SNE (t-Distributed Stochastic Neighbor Embedding) – Good
for visualizing high-dimensional data in 2D/3D.
•UMAP (Uniform Manifold UMAP (Uniform Manifold
Approximation and Projection)
•Similar to t-SNE, but faster, scalable, and preserves more
global structure
•Good for visualizing and clustering
Used for:
•Visualization of high-dimensional data
•Noise reduction
•Speeding up algorithms
•Avoiding the curse of dimensionality

Data Compression
This refers to encoding data using fewer bits. It can be lossless (no
info lost) or lossy (some data sacrificed for better compression).
Examples:
•ZIP files – Lossless compression of general data
•JPEG, MP3 – Lossy compression for images/audio
•Autoencoders – Neural networks trained to compress and
reconstruct data (learn an efficient encoding)
Techniques:
•Huffman Coding
•Run-Length Encoding
•LZW (Lempel–Ziv–Welch)
•Autoencoders (again — yes, they can be used for compression too!)
Numerosity Reduction
Numerosity reduction refers to techniques that reduce the volume of
data by replacing the original data with a smaller representation
that is more compact but still maintains the essential properties and
patterns of the data.

It's particularly useful when dealing with large datasets to improve


processing speed, reduce storage, and simplify analysis.

Main Techniques of Numerosity Reduction:


1. Parametric Methods
Replace the data with a model. You don't store the data, just the model
parameters.
•Regression models: Fit the data with a linear, polynomial, or
nonlinear function.
•Logistic regression, exponential models, etc.
•Clustering models: Replace data with the cluster center and perhaps
the number of points per cluster (e.g., K-means).
2. Non-Parametric Methods
Reduce data without assuming any specific model form.
•Histograms: Divide data into bins and store the bin ranges and
frequencies.
•Data cube aggregation: Aggregate data across different dimensions
(e.g., monthly → quarterly).
•Sampling: Use a subset that statistically represents the full dataset.
•Cluster-based reduction: Store cluster centroids instead of all data
points.
What is Clustering?
Clustering groups similar data points together into clusters based on
a similarity metric (like Euclidean distance). Once you have these
clusters, you can represent each group by its centroid (average
position), thereby reducing the number of data points you need to
store or analyze.
Discretization
Discretization is the process of transforming continuous attributes
(numerical values) into discrete values
(categories or intervals).
For example:A continuous attribute like Age = [0, 100]
might be discretized into:
0–12 → Child
13–19 → Teen
20–64 → Adult
65+ → Senior

Why Discretize?
•Simplifies models and reduces noise
•Helps algorithms that work better with categorical data (like decision
trees)
•Enables generation of concept hierarch
Concept Hierarchy Generation
Concept hierarchy involves organizing data from low-level concepts
(raw data) to higher-level concepts — think granularity levels.

Example Hierarchy:
For the attribute Location:
City → State → Country → Continent

Types of Concept Hierarchies:Schema hierarchy: Already exists in data


schema (e.g., Date → Month → Year)

Set-grouping hierarchy: Based on grouping values (e.g., ZIP codes →


Cities)

Rule-based: Created by user-defined rules (e.g., income levels)

You might also like