Introduction to Data Mining

The document provides an overview of data mining, detailing its objectives, major applications, and techniques such as sampling, data visualization, and handling dirty data. It covers classification methods, performance metrics, and various classification techniques like K-NN and logistic regression. Additionally, it discusses cause and effect modeling to understand relationships between variables.

Uploaded by

Prashant

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Introduction to Data Mining

Uploaded by

Prashant

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

INTRODUCTION

TO DATA
MINING

Unit-7
SCOPE OF DATA MINING

Data mining refers to the process of discovering patterns, relationships, and

insights from large datasets. It involves various techniques from statistics,
machine learning, and database systems. The key objectives of data mining
include:
• Identifying hidden patterns in data
• Predicting future trends and behaviors
• Enhancing decision-making processes
• Extracting useful knowledge from vast amounts of information
MAJOR APPLICATIONS OF DATA MINING

• Business intelligence and market analysis

• Fraud detection and risk management
• Healthcare analytics
• Scientific discovery and research
• Social media and web analytics
DATA EXPLORATION AND REDUCTION

Sampling
• Sampling is a technique used to select a subset of data for analysis.
• It helps in reducing computational costs and improving efficiency.
• Common sampling methods:
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Cluster sampling
DATA VISUALIZATION

• Data visualization techniques help in understanding patterns, trends, and

outliers in datasets.
• Popular visualization tools include histograms, scatter plots, box plots, and
heatmaps.
• Effective visualization aids in data preprocessing and model selection.
DIRTY DATA

• Dirty data refers to incomplete, inconsistent, or incorrect data.

• Common issues:
Missing values
Duplicate records
Outliers
Data inconsistencies

• Cleaning techniques include imputation, normalization, and transformation.

CLUSTER ANALYSIS

• Cluster analysis is a technique used to group similar data points.

• Methods of clustering:
• K-means clustering
• Hierarchical clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• Used in market segmentation, image recognition, and anomaly
detection.
CLASSIFICATION

Intuitive Explanation of Classification

• Classification is a supervised learning technique used to
categorize new data based on previously known data.
• It involves assigning labels to data instances.
• Examples: spam detection, medical diagnosis, and sentiment
analysis.
MEASURING CLASSIFICATION PERFORMANCE

• Accuracy
• Precision and Recall
• F1-score
• Confusion matrix
• ROC curve and AUC
ACCURACY

• Definition: The proportion of correctly predicted instances (both true positives

and true negatives) out of the total number of instances.
• Useful when the classes are balanced. Not ideal for imbalanced datasets.
PRECISION AND RECALL
Precision: Recall (Sensitivity or True Positive
Rate):
 Measures the proportion of
correctly predicted positive  Measures the proportion of correctly
instances out of all predicted predicted positive instances out of all actual
positive instances.
positive instances.  Focuses on minimizing false negatives.
 Focuses on minimizing false positives.
 Formula:
 Formula
F1-SCORE

• Definition: The harmonic mean of precision and recall. It balances both

metrics.
• To balance precision and recall, especially in imbalanced datasets.
• Formula:
CONFUSION MATRIX

• Definition: A table that summarizes the performance of a classification model

by showing the counts of true positives, true negatives, false positives, and
false negatives.
• Provides a detailed breakdown of model performance, helping to identify
specific types of errors.
• Structure:
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
ROC CURVE AND AUC
ROC Curve (Receiver Operating AUC (Area Under the Curve):
Characteristic Curve): • Represents the area under the ROC
• Plots the True Positive Rate (Recall) curve.
against the False Positive Rate (FPR) at • Ranges from 0 to 1, where 1 indicates
various threshold settings. perfect classification and 0.5 indicates
• False Positive Rate (FPR): random guessing.
• Evaluates the trade-off between
sensitivity and specificity. AUC is useful
for comparing models.
USING TRAINING AND VALIDATION DATA

• Training data is used to build a classification model.

• Validation data is used to tune model parameters.
• Test data is used to evaluate model performance.
CLASSIFYING NEW DATA

• Once trained, the classification model is used to predict new, unseen data.
• The model assigns a label to each new data instance based on learned patterns.
CLASSIFICATION TECHNIQUES

K-Nearest Neighbors (K-NN)

• A simple, instance-based learning algorithm.
• Classifies a data point based on the majority class of its k-nearest neighbors.
• Works well for smaller datasets but can be computationally expensive for large datasets.
Discriminant Analysis
• Used for classifying observations into predefined categories.
• Types:
• Linear Discriminant Analysis (LDA)
• Quadratic Discriminant Analysis (QDA)
CONT.…..
Logistic Regression
• A statistical model used for binary classification.
• Estimates the probability of a class using the logistic function.
• Suitable for predicting categorical outcomes (e.g., pass/fail, spam/ham).
Association Rule Mining
• Discovers relationships between variables in large datasets.
• Used in market basket analysis to find items that frequently occur together.
• Key algorithms:
• Apriori Algorithm
• FP-Growth Algorithm
CAUSE AND EFFECT MODELING

• Used to understand causal relationships between variables.

• Techniques include:
• Regression analysis
• Granger causality
• Structural equation modeling

• Applications: economic forecasting, medical research, and policy evaluation.

Six Sigma Black Belt - Cheat Sheet
100% (6)
Six Sigma Black Belt - Cheat Sheet
11 pages
Snowflake SnowPro Advanced - Architect - Practice Exam - Medium
No ratings yet
Snowflake SnowPro Advanced - Architect - Practice Exam - Medium
7 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Machine Learning Interview Questions.
50% (2)
Machine Learning Interview Questions.
43 pages
DBA - Syllabus
No ratings yet
DBA - Syllabus
13 pages
Performance Parameters
No ratings yet
Performance Parameters
23 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
2. Performance Measures
No ratings yet
2. Performance Measures
19 pages
Confusion Matrix & Evaluation Metrics in Machine Learning
No ratings yet
Confusion Matrix & Evaluation Metrics in Machine Learning
23 pages
Imp Notes For Aamd
No ratings yet
Imp Notes For Aamd
6 pages
TE_DWM Module no 3
No ratings yet
TE_DWM Module no 3
48 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
Session9-LogisticRegression_a6c5bc556df30fa3eb779e22e464a08a - Copy
No ratings yet
Session9-LogisticRegression_a6c5bc556df30fa3eb779e22e464a08a - Copy
33 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Pertemuan 5
No ratings yet
Pertemuan 5
67 pages
MACHINELEARNING
No ratings yet
MACHINELEARNING
20 pages
Lect 2
No ratings yet
Lect 2
54 pages
Module 01 - Performance Metrics in ML (1)
No ratings yet
Module 01 - Performance Metrics in ML (1)
15 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Healthcare Fraud Detection System
No ratings yet
Healthcare Fraud Detection System
25 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
Model Evaluation
No ratings yet
Model Evaluation
18 pages
Session 7-8 - Data Cleaning and Logistic Regression For Classification
No ratings yet
Session 7-8 - Data Cleaning and Logistic Regression For Classification
30 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
18 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Data Management Tutorials
No ratings yet
Data Management Tutorials
56 pages
Machine_Learning_II
No ratings yet
Machine_Learning_II
61 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
S1-Evaluate-Performance-LKW-1Mar2025
No ratings yet
S1-Evaluate-Performance-LKW-1Mar2025
26 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Chapter 2 - Data Analysis I
No ratings yet
Chapter 2 - Data Analysis I
36 pages
Lesson 3 Lab Stats and QC
No ratings yet
Lesson 3 Lab Stats and QC
81 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Lecture 5 Chapter 3
No ratings yet
Lecture 5 Chapter 3
56 pages
What Are The Evaluation Metrics in Machine Learning
No ratings yet
What Are The Evaluation Metrics in Machine Learning
3 pages
Descriptive Analytics - Uni and Bi
No ratings yet
Descriptive Analytics - Uni and Bi
36 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Auc Roc: "Area Under The Curve" (AUC) of The "Receiver Operating Characteristic" (ROC)
No ratings yet
Auc Roc: "Area Under The Curve" (AUC) of The "Receiver Operating Characteristic" (ROC)
7 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Calibration w4 2024-1
No ratings yet
Calibration w4 2024-1
31 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
FAM Unit6
No ratings yet
FAM Unit6
32 pages
Lecture 16
No ratings yet
Lecture 16
36 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
Module - 5 Risk & Decision Analysis - Simulation
No ratings yet
Module - 5 Risk & Decision Analysis - Simulation
48 pages
Business Analytics
No ratings yet
Business Analytics
6 pages
ML2
No ratings yet
ML2
8 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
Unit 3
No ratings yet
Unit 3
55 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Manual Data
No ratings yet
Manual Data
13 pages
FALLSEM2024-25_BCSE209L_TH_VL2024250101737_2024-07-30_Reference-Material-II
No ratings yet
FALLSEM2024-25_BCSE209L_TH_VL2024250101737_2024-07-30_Reference-Material-II
23 pages
chapter 5 Model Evaluation
No ratings yet
chapter 5 Model Evaluation
21 pages
Boss
No ratings yet
Boss
13 pages
ML notes
No ratings yet
ML notes
16 pages
Model Evaluation
No ratings yet
Model Evaluation
80 pages
Overview of Sample Size: Ismail SA RMC Ilkkm SG Buloh
No ratings yet
Overview of Sample Size: Ismail SA RMC Ilkkm SG Buloh
9 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Rapha Bar Graph Assessment Rubric
No ratings yet
Rapha Bar Graph Assessment Rubric
1 page
DBMS Insurance Database 12
No ratings yet
DBMS Insurance Database 12
14 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
iKnowMed Gen 2
No ratings yet
iKnowMed Gen 2
2 pages
Sih Presentation
No ratings yet
Sih Presentation
7 pages
SAP S4HANA Migration Cockpit - Part 1
No ratings yet
SAP S4HANA Migration Cockpit - Part 1
3 pages
Generation and deployment of honeytokens in relational databases for cyber deception
No ratings yet
Generation and deployment of honeytokens in relational databases for cyber deception
25 pages
DBMS Ii-Ii-2021-22 (R20) - Part-A
No ratings yet
DBMS Ii-Ii-2021-22 (R20) - Part-A
24 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
On Base Pru 04052021
No ratings yet
On Base Pru 04052021
101 pages
Nand Flash Memory ECC Engine
No ratings yet
Nand Flash Memory ECC Engine
8 pages
Adbms Part2
No ratings yet
Adbms Part2
20 pages
Lab Exercise No1
100% (2)
Lab Exercise No1
13 pages
Business Rules
No ratings yet
Business Rules
3 pages
banking_problem_database
No ratings yet
banking_problem_database
5 pages
Data 2
No ratings yet
Data 2
1 page
Akhlaqi Norm Critique Mughal Political Thought Iqtidar Alam Khan
No ratings yet
Akhlaqi Norm Critique Mughal Political Thought Iqtidar Alam Khan
11 pages
Suitability Codes: Just One Piece of The Level 1 Jigsaw - A Discussion Session
No ratings yet
Suitability Codes: Just One Piece of The Level 1 Jigsaw - A Discussion Session
20 pages
railway_PROJECT SARIT DUTTA
No ratings yet
railway_PROJECT SARIT DUTTA
13 pages
DBMS 3
No ratings yet
DBMS 3
16 pages
Digital Library Infrastructure and Architecture
No ratings yet
Digital Library Infrastructure and Architecture
8 pages
Module 8 Group Collaboration
No ratings yet
Module 8 Group Collaboration
12 pages
SYLLABUS
No ratings yet
SYLLABUS
3 pages
S4 Analytics Options
No ratings yet
S4 Analytics Options
6 pages
PCI Work From Home Security Awareness Training - v02
No ratings yet
PCI Work From Home Security Awareness Training - v02
2 pages
Geeky Banker CAIIB IT MODULE B COMPLETE
No ratings yet
Geeky Banker CAIIB IT MODULE B COMPLETE
39 pages