0% found this document useful (0 votes)
19 views32 pages

Introduction-to-Data-Mining

Uploaded by

Aya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views32 pages

Introduction-to-Data-Mining

Uploaded by

Aya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Mining

Chapter 1 . Introduction
SASSI Abdessamed
Motivation
Why do we need data mining?
● Nowadays, the total world wide volume of data is very large
■ Hundreds of ZettaBytes (ZB = 270 byte)
● Data types and formats can be complexe
■ Video, Image, Audio, etc.
● Most data formats are not human readable
■ Binary formats
● Humans cannot deal with such amount and complexity
● We need concise insights and patterns to make decisions
Data mining is a misnomer?
● Literally data mining means gathering or collecting data
● In practice, data mining means extracting knowledge from data
● This knowledge is like golden-nuggets hidden in a large volume data
● Hence the word mining in the name
● So,
● What is Data?
● What is Knowledge?
● And, What does Data Mining really means?
Data
What is data?
● Data are collected observations or measurements represented as Text,
Numbers, or Multimedia [3].
● Data can be quantitative (represent quantities or numerical values)
■ Sensory data (Temperature, Light, Pixel Intensities, Voltage, …)
■ Time Durations (Age, Travel Length, …)
■ Size & Length Measurements (Area, Volume, Distance, Length, …)
■ Health Measurements (Blood Pressure, Sugar Level, O2 Saturation, …)
● Data can also be qualitative (categorical)
■ Text (words, letters, digits, …)
■ Age Classes (e.g. Football Age categories)
■ Blood Types
● Data can also be a complex mixture of the two types
■ E.g. Maps (Graphs)
Data vs Knowledge
● A book doesn’t know of its content
● Knowing Being Aware of the information we possess
■ Understanding
■ Being able to act and make decisions
■ Produce new thoughts
■ Discover Patterns
● Unlike having information, Knowing is active action
● How can we make computers discover by knowledge on their own?
Data Sources
● In our daily lives we produce tons of data (information)
■ Social Networks, Emails, Blogs, …
■ E-Commerce, Banking, Stores, …
■ Hospitals & Health reports
■ Administrative records
● Hence, data can be supplied by a variety of technologies:
■ Relational databases
■ Data warehouses
■ Transaction databases
■ Text databases
■ Social networks data
■ World-Wide Web
■ Time-series data
Data Formats
● The data we want to analyse using data mining methods have various
formats
■ Transactions
■ N-dimensional Vectors (data points)
■ Graphs
■ Tables
■ etc.
● The format of the data determines the data mining algorithm we can use
● We may also change the format of the data in order to be able to use a
certain type of algorithm
Data Preparation & Preprocessing
● Data integration. Combining data from multiple sources
■ Joining multiple tables.
■ Resolving data inconsistencies from different sources.
● Data selection. Selecting domain relevant data.
■ Selecting a specific of attributes (columns)
● Data cleaning.
■ Noise Reduction : Removing or correcting noisy data
■ Outlier Detection : Identifying and handling outliers
■ Handling Missing Values : Removing or filling in missing data
● Data Reduction.
■ Dimensionality Reduction: to reduce the number of attributes while retaining
important information.
■ Sampling: Selecting a subset of the data that represents the whole dataset to reduce
computation time.
Data Preparation & Preprocessing
● Data Transformation.
■ Normalization: Scaling numerical data to a common range
■ Data Discretization: Converting continuous attributes into discrete bins or categories
Data Mining
What is data mining?
● Extracting or “mining” knowledge from large amounts of data [1].
● A set of software techniques for identifying / discovering useful
patterns and trends from large amounts of data through automated
analysis.
● Obtaining a simplified view of data to help with decision making.
● Extracting Knowledge from data.
What is knowledge in this context?
● For data mining, knowledge is in the form of Patterns and Insights:
■ (If .. Then) Rules
■ Associations
■ Anomalies
■ Recommendations
■ Groups & Classes (Clusters)
■ Predictions
■ Correlations
Intersection with other fields & technologies
● Statistics
■ A variety of data mining algorithms involve some methods from the field of statistics
■ The methods of statistics themselves can be used as low-level data mining methods
● Databases
■ Most of the data sources will be stored using database technology
● Data warehouses
■ Data mining are generally applied to data integrated in a data warehouse
● Machine Learning
■ We can use some of these techniques to learn patterns
● Data visualization
■ To familiarise with the data, detect outliers, decide what preprocessing we need
■ To display the extracted patterns and make decisions after data mining
Why Data Mining?
● Large quantities of data to be analysed
■ Algorithms must be highly scalable
● High dimensionality of the data to be analysed
■ Each record of data is a vector with a large number of dimensions (attributes)
● Some data types are complex by nature
■ Web pages
■ Multimedia
■ Sensor data
■ Graphs
■ Social Network
■ …
Data mining process
Data Collection

Data Integration Data mining

Databases Data
warehouse

Patterns
Data mining as a step in KDD
KDD = Knowledge Discovery from Data
1. Data selection.
■ Identifying relevant datasets and selecting data that is important for our need / task
2. Data Preprocessing.
■ Cleaning the data by handling missing values, noise, and inconsistencies.
3. Data transformation.
■ Change the form of the data depending on the data mining algorithms to be used
4. Data mining.
■ A set of intelligent data analysis techniques
5. Pattern evaluation
■ Interpreting the discovered patterns and evaluating their Interestingness.
6. Knowledge presentation.
■ Visualize the discovered knowledge (patterns)
Data mining as a step in KDD
Architecture of a typical data mining system [1]
Database / Data Warehouse
Server

Data Cleaning, Integration, and Selection

Other types of
Database Data Warehouse World Wide Web Repositories
(spearsheets,
nosql, …)
Data Mining Tasks
Categories of Data Mining Tasks
● Data mining tasks can be on of two categories

● Descriptive Mining Tasks (Unsupervised learning)


- Clustering : find a groups or similar items,
- Associations rules : find relations between items,

● Predictive Mining Tasks (Supervised learning)


- Classification : assign data to their predefined classes
- Regression : assign data to a function
- Time series analysis: Data analysis over time
Association Rules Mining
● Frequent Patterns, Associations, and Correlations Mining
● Frequent Itemsets. Unordered sets of items that appears together very
often.
■ Milk and Bread are frequently bought together.
● Frequent Subsequences. Ordered sets of items that appears together
very often.
■ PC → Camera → Memory Card
● Association Analysis can uncover.
■ Single-dimensional Association Rules
■ BUY(X, “COMPUTER”) ⇒ BUY(X, “SOFTWARE”) [Support=1%, Confidence=50%]
■ Multi-dimensional Association Rules
■ AGE(X, “20..29”) ∧ INCOME(X, “20K..29K”) ⇒ BUY(X, “CD Player”) [Support=1%,
Confidence=50%]
Classification and Prediction
● Classification. Describe a class/concept as a function (model) than can
be used later to predict classes of new objects.
● Prediction. Finds a function (model) that can predict missing
continuous numerical values.
● In both cases, we need a set of objects with known labels (classes /
outputs) to train the model
■ Training Dataset
Cluster Analysis (Clustering)
● Unsupervised classification
● We group objects into clusters (classes) that are initially unknown
● We use the concept of similarity between objects.
● Minimize the inter-class similarity (similarity of objects from different
clusters)
● Maximize the intra-class similarity (similarity of objects of the same
cluster)
Outlier Analysis
● Detect objects in the data that are irregular with respect to other objects
● Can be used for:
■ Anomaly detection
■ Fraudulent Credit Card Transactions
■ …
Pattern Evaluation
Pattern Interestingness
● A pattern is considered interesting if [1]:
1. It is easily understood by humans.
2. Can be generalized to new unseen (test) data with some uncertainty.
3. Useful.
4. Novel (add something new to our knowledge).
● Various performance (quality) metrics can be used to evaluate (assess)
the usefulness or interestingness of discovered patterns.
● The definition of these performance metrics depends highly on the
nature and structure of the patterns.
● We can prune way uninteresting patterns by comparing their quality to
a threshold defined by the user.
Data Mining Applications
Some Applications
● Healthcare
■ Diagnosis and Treatment: Identifying patterns in patient data to help diagnose diseases
and recommend treatments.
■ Medical Research: Analyzing clinical data to discover new medical knowledge and drug
efficacy.
● Finance and Banking
■ Fraud Detection: Identifying unusual transactions or behavior that could indicate fraud.
■ Risk Management: Assessing loan applicants' risk levels and predicting credit scores.
■ Customer Segmentation: Classifying customers based on spending habits, transaction
frequency, and investment preferences.
● Telecommunications
■ Churn Prediction: Analyzing user behavior to predict when customers may leave the
service
■ Customer Service: Using data mining to offer more personalized and efficient support.
Some Applications
● Social Media and Web Analytics
■ Sentiment Analysis: Analyzing social media posts to gauge public opinion on products,
services, or events.
● Government and Public Services
■ Crime Prevention: Predicting criminal behavior and identifying hotspots based on
historical data.
■ Tax Fraud Detection: Detecting anomalies in tax records to identify potential fraud
cases.
● Marketing
■ Customer Segmentation: Grouping customers into segments based on purchasing
behavior and preferences.
■ Targeted Advertising: Analyzing data to create more effective marketing campaigns and
personalized ads.
References
1. Han, Jiawei, Micheline Kamber, and Data Mining. "Concepts and
techniques." Morgan Kaufmann 340 (2006): 94104-3205.
2. IBM Technologies on Youtube
3. University of Houston Libraries on Youtube

You might also like