Introduction-to-Data-Mining

Uploaded by

Aya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views32 pages

Introduction-to-Data-Mining

Uploaded by

Aya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Mining

Chapter 1 . Introduction
SASSI Abdessamed
Motivation
Why do we need data mining?
● Nowadays, the total world wide volume of data is very large
■ Hundreds of ZettaBytes (ZB = 270 byte)
● Data types and formats can be complexe
■ Video, Image, Audio, etc.
● Most data formats are not human readable
■ Binary formats
● Humans cannot deal with such amount and complexity
● We need concise insights and patterns to make decisions
Data mining is a misnomer?
● Literally data mining means gathering or collecting data
● In practice, data mining means extracting knowledge from data
● This knowledge is like golden-nuggets hidden in a large volume data
● Hence the word mining in the name
● So,
● What is Data?
● What is Knowledge?
● And, What does Data Mining really means?
Data
What is data?
● Data are collected observations or measurements represented as Text,
Numbers, or Multimedia [3].
● Data can be quantitative (represent quantities or numerical values)
■ Sensory data (Temperature, Light, Pixel Intensities, Voltage, …)
■ Time Durations (Age, Travel Length, …)
■ Size & Length Measurements (Area, Volume, Distance, Length, …)
■ Health Measurements (Blood Pressure, Sugar Level, O2 Saturation, …)
● Data can also be qualitative (categorical)
■ Text (words, letters, digits, …)
■ Age Classes (e.g. Football Age categories)
■ Blood Types
● Data can also be a complex mixture of the two types
■ E.g. Maps (Graphs)
Data vs Knowledge
● A book doesn’t know of its content
● Knowing Being Aware of the information we possess
■ Understanding
■ Being able to act and make decisions
■ Produce new thoughts
■ Discover Patterns
● Unlike having information, Knowing is active action
● How can we make computers discover by knowledge on their own?
Data Sources
● In our daily lives we produce tons of data (information)
■ Social Networks, Emails, Blogs, …
■ E-Commerce, Banking, Stores, …
■ Hospitals & Health reports
■ Administrative records
● Hence, data can be supplied by a variety of technologies:
■ Relational databases
■ Data warehouses
■ Transaction databases
■ Text databases
■ Social networks data
■ World-Wide Web
■ Time-series data
Data Formats
● The data we want to analyse using data mining methods have various
formats
■ Transactions
■ N-dimensional Vectors (data points)
■ Graphs
■ Tables
■ etc.
● The format of the data determines the data mining algorithm we can use
● We may also change the format of the data in order to be able to use a
certain type of algorithm
Data Preparation & Preprocessing
● Data integration. Combining data from multiple sources
■ Joining multiple tables.
■ Resolving data inconsistencies from different sources.
● Data selection. Selecting domain relevant data.
■ Selecting a specific of attributes (columns)
● Data cleaning.
■ Noise Reduction : Removing or correcting noisy data
■ Outlier Detection : Identifying and handling outliers
■ Handling Missing Values : Removing or filling in missing data
● Data Reduction.
■ Dimensionality Reduction: to reduce the number of attributes while retaining
important information.
■ Sampling: Selecting a subset of the data that represents the whole dataset to reduce
computation time.
Data Preparation & Preprocessing
● Data Transformation.
■ Normalization: Scaling numerical data to a common range
■ Data Discretization: Converting continuous attributes into discrete bins or categories
Data Mining
What is data mining?
● Extracting or “mining” knowledge from large amounts of data [1].
● A set of software techniques for identifying / discovering useful
patterns and trends from large amounts of data through automated
analysis.
● Obtaining a simplified view of data to help with decision making.
● Extracting Knowledge from data.
What is knowledge in this context?
● For data mining, knowledge is in the form of Patterns and Insights:
■ (If .. Then) Rules
■ Associations
■ Anomalies
■ Recommendations
■ Groups & Classes (Clusters)
■ Predictions
■ Correlations
Intersection with other fields & technologies
● Statistics
■ A variety of data mining algorithms involve some methods from the field of statistics
■ The methods of statistics themselves can be used as low-level data mining methods
● Databases
■ Most of the data sources will be stored using database technology
● Data warehouses
■ Data mining are generally applied to data integrated in a data warehouse
● Machine Learning
■ We can use some of these techniques to learn patterns
● Data visualization
■ To familiarise with the data, detect outliers, decide what preprocessing we need
■ To display the extracted patterns and make decisions after data mining
Why Data Mining?
● Large quantities of data to be analysed
■ Algorithms must be highly scalable
● High dimensionality of the data to be analysed
■ Each record of data is a vector with a large number of dimensions (attributes)
● Some data types are complex by nature
■ Web pages
■ Multimedia
■ Sensor data
■ Graphs
■ Social Network
■ …
Data mining process
Data Collection

Data Integration Data mining

Databases Data
warehouse

Patterns
Data mining as a step in KDD
KDD = Knowledge Discovery from Data
1. Data selection.
■ Identifying relevant datasets and selecting data that is important for our need / task
2. Data Preprocessing.
■ Cleaning the data by handling missing values, noise, and inconsistencies.
3. Data transformation.
■ Change the form of the data depending on the data mining algorithms to be used
4. Data mining.
■ A set of intelligent data analysis techniques
5. Pattern evaluation
■ Interpreting the discovered patterns and evaluating their Interestingness.
6. Knowledge presentation.
■ Visualize the discovered knowledge (patterns)
Data mining as a step in KDD
Architecture of a typical data mining system [1]
Database / Data Warehouse
Server

Data Cleaning, Integration, and Selection

Other types of
Database Data Warehouse World Wide Web Repositories
(spearsheets,
nosql, …)
Data Mining Tasks
Categories of Data Mining Tasks
● Data mining tasks can be on of two categories

● Descriptive Mining Tasks (Unsupervised learning)

- Clustering : find a groups or similar items,
- Associations rules : find relations between items,

● Predictive Mining Tasks (Supervised learning)

- Classification : assign data to their predefined classes
- Regression : assign data to a function
- Time series analysis: Data analysis over time
Association Rules Mining
● Frequent Patterns, Associations, and Correlations Mining
● Frequent Itemsets. Unordered sets of items that appears together very
often.
■ Milk and Bread are frequently bought together.
● Frequent Subsequences. Ordered sets of items that appears together
very often.
■ PC → Camera → Memory Card
● Association Analysis can uncover.
■ Single-dimensional Association Rules
■ BUY(X, “COMPUTER”) ⇒ BUY(X, “SOFTWARE”) [Support=1%, Confidence=50%]
■ Multi-dimensional Association Rules
■ AGE(X, “20..29”) ∧ INCOME(X, “20K..29K”) ⇒ BUY(X, “CD Player”) [Support=1%,
Confidence=50%]
Classification and Prediction
● Classification. Describe a class/concept as a function (model) than can
be used later to predict classes of new objects.
● Prediction. Finds a function (model) that can predict missing
continuous numerical values.
● In both cases, we need a set of objects with known labels (classes /
outputs) to train the model
■ Training Dataset
Cluster Analysis (Clustering)
● Unsupervised classification
● We group objects into clusters (classes) that are initially unknown
● We use the concept of similarity between objects.
● Minimize the inter-class similarity (similarity of objects from different
clusters)
● Maximize the intra-class similarity (similarity of objects of the same
cluster)
Outlier Analysis
● Detect objects in the data that are irregular with respect to other objects
● Can be used for:
■ Anomaly detection
■ Fraudulent Credit Card Transactions
■ …
Pattern Evaluation
Pattern Interestingness
● A pattern is considered interesting if [1]:
1. It is easily understood by humans.
2. Can be generalized to new unseen (test) data with some uncertainty.
3. Useful.
4. Novel (add something new to our knowledge).
● Various performance (quality) metrics can be used to evaluate (assess)
the usefulness or interestingness of discovered patterns.
● The definition of these performance metrics depends highly on the
nature and structure of the patterns.
● We can prune way uninteresting patterns by comparing their quality to
a threshold defined by the user.
Data Mining Applications
Some Applications
● Healthcare
■ Diagnosis and Treatment: Identifying patterns in patient data to help diagnose diseases
and recommend treatments.
■ Medical Research: Analyzing clinical data to discover new medical knowledge and drug
efficacy.
● Finance and Banking
■ Fraud Detection: Identifying unusual transactions or behavior that could indicate fraud.
■ Risk Management: Assessing loan applicants' risk levels and predicting credit scores.
■ Customer Segmentation: Classifying customers based on spending habits, transaction
frequency, and investment preferences.
● Telecommunications
■ Churn Prediction: Analyzing user behavior to predict when customers may leave the
service
■ Customer Service: Using data mining to offer more personalized and efficient support.
Some Applications
● Social Media and Web Analytics
■ Sentiment Analysis: Analyzing social media posts to gauge public opinion on products,
services, or events.
● Government and Public Services
■ Crime Prevention: Predicting criminal behavior and identifying hotspots based on
historical data.
■ Tax Fraud Detection: Detecting anomalies in tax records to identify potential fraud
cases.
● Marketing
■ Customer Segmentation: Grouping customers into segments based on purchasing
behavior and preferences.
■ Targeted Advertising: Analyzing data to create more effective marketing campaigns and
personalized ads.
References
1. Han, Jiawei, Micheline Kamber, and Data Mining. "Concepts and
techniques." Morgan Kaufmann 340 (2006): 94104-3205.
2. IBM Technologies on Youtube
3. University of Houston Libraries on Youtube

Service Reset - Jungheinrich EFG425
No ratings yet
Service Reset - Jungheinrich EFG425
1 page
BP Crane Ops
100% (9)
BP Crane Ops
63 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
1 Intro
No ratings yet
1 Intro
33 pages
1 IT326 - Ch1 - Introduction
No ratings yet
1 IT326 - Ch1 - Introduction
37 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
unit-III
No ratings yet
unit-III
101 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
35 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
intro data mining
No ratings yet
intro data mining
51 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Data Mining
No ratings yet
Data Mining
88 pages
Unit 1
No ratings yet
Unit 1
46 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining
No ratings yet
Data Mining
13 pages
Unit 3
No ratings yet
Unit 3
23 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
02 DM BI Data Mining
No ratings yet
02 DM BI Data Mining
66 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
data mining 1
No ratings yet
data mining 1
39 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
da257829-b262-4875-aa76-2975d8aeaa2c
No ratings yet
da257829-b262-4875-aa76-2975d8aeaa2c
31 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
DATA_MINING_UNIT_1
No ratings yet
DATA_MINING_UNIT_1
13 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Data Mining
No ratings yet
Data Mining
254 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Why Data Mining?: March 3, 2015
No ratings yet
Why Data Mining?: March 3, 2015
41 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
01 Intro
No ratings yet
01 Intro
23 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Introduction
No ratings yet
Introduction
27 pages
KDD Process
No ratings yet
KDD Process
56 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Knowledge Management - 10 - Data Mining Overview
No ratings yet
Knowledge Management - 10 - Data Mining Overview
41 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Unit 1
No ratings yet
Unit 1
19 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
01 Intro
No ratings yet
01 Intro
35 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining
No ratings yet
Data Mining
26 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Module 4
No ratings yet
Module 4
54 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
PW1 python
No ratings yet
PW1 python
2 pages
PW4 python solution
No ratings yet
PW4 python solution
6 pages
Chapter 03 Object Oriented Programming and Exceptions in Python
No ratings yet
Chapter 03 Object Oriented Programming and Exceptions in Python
70 pages
PW2 python
No ratings yet
PW2 python
2 pages
Chapter 02 Advanced Data Structures and Functions
No ratings yet
Chapter 02 Advanced Data Structures and Functions
103 pages
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
No ratings yet
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
179 pages
Practical Work 03 Solutions
No ratings yet
Practical Work 03 Solutions
5 pages
Practical Work 03 Advanced Functions in Python
No ratings yet
Practical Work 03 Advanced Functions in Python
2 pages
Practical Work 02 solution
No ratings yet
Practical Work 02 solution
9 pages
Practical Work 04 Object Oriented Programming
No ratings yet
Practical Work 04 Object Oriented Programming
1 page
Chapter 01 Introduction to Python_part2_2
No ratings yet
Chapter 01 Introduction to Python_part2_2
62 pages
Lab 1 (1)
No ratings yet
Lab 1 (1)
5 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
DC Users Instruction A27.11 CIP Spool (AQA 001078 - Assembly of CIP Spool)
No ratings yet
DC Users Instruction A27.11 CIP Spool (AQA 001078 - Assembly of CIP Spool)
5 pages
Phacodynamic S: Haitham Al Mahrouqi April 2018
No ratings yet
Phacodynamic S: Haitham Al Mahrouqi April 2018
34 pages
Classifier Classifier: 2007 Drill Bit
No ratings yet
Classifier Classifier: 2007 Drill Bit
17 pages
Skateboarding Somerset: Proposal
No ratings yet
Skateboarding Somerset: Proposal
4 pages
TS_KAMA
No ratings yet
TS_KAMA
3 pages
UPDATED - Consent by Father or Mother and Legal Guardian of APAAR ID - Docx - 20241013 - 081158 - 0000
No ratings yet
UPDATED - Consent by Father or Mother and Legal Guardian of APAAR ID - Docx - 20241013 - 081158 - 0000
2 pages
Chapter 5-DATA TYPE AND DATA REPRESENTATIONS
No ratings yet
Chapter 5-DATA TYPE AND DATA REPRESENTATIONS
25 pages
RGU-10A / RGU-100A: Type A Ultraimmunized Residual Current Protection & Monitoring Relay
No ratings yet
RGU-10A / RGU-100A: Type A Ultraimmunized Residual Current Protection & Monitoring Relay
8 pages
2ND Acquaintance Party 2019 EMCEE SCRIPT - PDF - Learning
No ratings yet
2ND Acquaintance Party 2019 EMCEE SCRIPT - PDF - Learning
8 pages
Mercer Mettl Assessments - Reviews
No ratings yet
Mercer Mettl Assessments - Reviews
8 pages
Aplus Integrated Circuits, Inc.: aP89W24USB Voice Otp Development System User Guide
No ratings yet
Aplus Integrated Circuits, Inc.: aP89W24USB Voice Otp Development System User Guide
11 pages
Cinematography Theory and Practice Image Making for Cinematographers and Directors 2nd Edition Blain Brown instant download
100% (1)
Cinematography Theory and Practice Image Making for Cinematographers and Directors 2nd Edition Blain Brown instant download
56 pages
Internet Archive - GeoCities Special Collection 2009
No ratings yet
Internet Archive - GeoCities Special Collection 2009
2 pages
cloud storage-notes
No ratings yet
cloud storage-notes
7 pages
Dali 2-0
No ratings yet
Dali 2-0
21 pages
TATA DoCoMo ETOP, PESTEL, SAP, SWOT
60% (5)
TATA DoCoMo ETOP, PESTEL, SAP, SWOT
10 pages
Dongmi Catalog 03.04.20
No ratings yet
Dongmi Catalog 03.04.20
29 pages
Data Recovery Tomer
No ratings yet
Data Recovery Tomer
6 pages
Unit-1 Flexible Manufacturing Systems
No ratings yet
Unit-1 Flexible Manufacturing Systems
20 pages
4.1-2 Hand Tools and Its Uses
No ratings yet
4.1-2 Hand Tools and Its Uses
14 pages
Metode Subtractive Fuzzy C-Means (SFCM) Dalam Pengelompokan
No ratings yet
Metode Subtractive Fuzzy C-Means (SFCM) Dalam Pengelompokan
13 pages
K 200 Plus 5960-321
No ratings yet
K 200 Plus 5960-321
20 pages
4L-PB351G-L60D - 4L-PB531G-L60D
No ratings yet
4L-PB351G-L60D - 4L-PB531G-L60D
4 pages
JAVASCRIPT CODE TEMPLATES
No ratings yet
JAVASCRIPT CODE TEMPLATES
12 pages
CC StoCast Brick EN Web S973
No ratings yet
CC StoCast Brick EN Web S973
4 pages
4th sem back sub list
No ratings yet
4th sem back sub list
5 pages
Lovol Service Manual_KTR
No ratings yet
Lovol Service Manual_KTR
13 pages
Aanhidayatulloh,+7+etty+padmiati (1) - Dikonversi
No ratings yet
Aanhidayatulloh,+7+etty+padmiati (1) - Dikonversi
26 pages

Introduction-to-Data-Mining

Uploaded by

Introduction-to-Data-Mining

Uploaded by

Data Mining

Data Integration Data mining

Data Cleaning, Integration, and Selection

● Descriptive Mining Tasks (Unsupervised learning)

● Predictive Mining Tasks (Supervised learning)

You might also like