0% found this document useful (0 votes)
2 views4 pages

Data Mining Overview

Data mining is the process of extracting patterns and useful information from large datasets, essential for informed decision-making across various fields. Key functionalities include association analysis, classification, clustering, prediction, outlier detection, and evolution analysis, while data processing involves cleaning, integration, transformation, and reduction techniques. Decision trees serve as a predictive model that splits data based on conditions, providing an interpretable and visual approach to decision-making.

Uploaded by

pkt6279
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views4 pages

Data Mining Overview

Data mining is the process of extracting patterns and useful information from large datasets, essential for informed decision-making across various fields. Key functionalities include association analysis, classification, clustering, prediction, outlier detection, and evolution analysis, while data processing involves cleaning, integration, transformation, and reduction techniques. Decision trees serve as a predictive model that splits data based on conditions, providing an interpretable and visual approach to decision-making.

Uploaded by

pkt6279
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Mining Overview and Related Concepts

1. Overview of Data Mining

Definition:

Data mining is the process of discovering patterns, relationships, and useful information from large

datasets using statistical, machine learning, and database techniques.

It is an essential step in the Knowledge Discovery in Databases (KDD) process.

Motivation:

- The growth of data from diverse sources (IoT, social media, business, science, etc.).

- The need to make informed decisions based on patterns and trends in data.

- Competitive advantage in various fields such as healthcare, finance, marketing, and science.

Key Functionalities:

1. Association Analysis: Discovering rules that reveal relationships between variables (e.g., 'If X,

then Y').

2. Classification: Assigning categories to data based on predefined models.

3. Clustering: Grouping data into clusters based on similarity.

4. Prediction: Forecasting future values using existing data.

5. Outlier Detection: Identifying anomalies or rare items in data.

6. Evolution Analysis: Understanding trends and changes in data over time.

2. Data Processing in Data Mining

Data preparation is crucial to ensure the quality of input data. It involves:

Data Cleaning:
- Handling Missing Values: Replace, remove, or predict missing entries.

- Handling Noisy Data: Use techniques like binning, regression, or clustering to smooth data.

- Handling Inconsistent Data: Resolve discrepancies by normalization, domain constraints, or user

validation.

Data Integration:

- Combining data from multiple sources into a unified dataset.

- Ensure schema consistency and detect redundancy.

Data Transformation:

- Normalization: Rescale data to a common range (e.g., [0, 1]).

- Aggregation: Summarize data at a higher abstraction level.

- Encoding categorical data using methods like one-hot encoding.

Data Reduction:

Reduce data volume while preserving essential patterns and structure:

- Data Cube Aggregation: Summarizing data at higher levels (e.g., regional vs. store-level sales).

- Dimensionality Reduction: Techniques like PCA or LDA to reduce features.

- Data Compression: Use lossless or lossy compression methods to store data compactly.

- Numerosity Reduction: Approximation using parametric or non-parametric models.

- Discretization: Convert continuous values into discrete intervals.

- Concept Hierarchy Generation: Organize data into multiple levels (e.g., 'city' -> 'state' -> 'country').

3. Data Cleaning in Detail

- Missing Values: Techniques to handle:

1. Replace with the mean/median/mode.

2. Predict missing values using regression or machine learning models.


3. Ignore tuples with missing values if the dataset is large enough.

- Noisy Data: Contains random errors or variations:

1. Binning: Smooth data by grouping it into bins (e.g., bin means, medians).

2. Clustering: Group data, treating smaller clusters as noise.

3. Regression: Fit a model to the data and treat deviations as noise.

4. Inspection: Use domain expertise to validate data manually or computationally.

- Inconsistent Data: Occurs due to duplicate records, schema differences, or incorrect entries.

Resolve using:

1. Rule-based corrections.

2. Schema alignment during data integration.

3. Human intervention for ambiguous cases.

4. Data Reduction Techniques

1. Data Cube Aggregation: Create summaries by aggregating data across dimensions (e.g., sales

by region, time).

2. Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) reduce the

number of attributes while preserving variability.

3. Data Compression: Compress data storage using algorithms like Huffman coding or wavelet

transforms.

4. Numerosity Reduction: Replace original data with models (parametric like regression or

non-parametric like histograms).

5. Discretization: Group numeric values into intervals (e.g., age: 0-18, 19-35, etc.).

6. Concept Hierarchy Generation: Summarize data at higher abstraction levels (e.g., product types

into broader categories).


5. Decision Trees

Definition:

A decision tree is a predictive model that splits data into branches based on conditions at nodes,

leading to decision outcomes at leaves.

Steps in Decision Tree Induction:

1. Start with a root node containing all data.

2. Select the best attribute for splitting using metrics like Information Gain or Gini Index.

3. Partition the dataset into subsets based on attribute values.

4. Repeat the process recursively until stopping criteria are met (e.g., maximum depth, no significant

gain).

5. Assign leaf nodes with class labels or predictions.

Advantages:

- Easy to interpret and visualize.

- Handles both categorical and numerical data.

- Non-linear decision boundaries.

6. Forms of Data in Pre-Processing

1. Structured Data: Tables, spreadsheets, or relational databases.

2. Semi-Structured Data: JSON, XML, or NoSQL databases.

3. Unstructured Data: Text, images, videos, or logs.

4. Temporal/Sequential Data: Time-series or event logs.

5. Spatial Data: Geographic information like maps.

You might also like