0% found this document useful (0 votes)
22 views47 pages

DM 1

The document outlines the syllabus for a Data Mining course, covering topics such as data types, mining functionalities, association rule mining, classification, clustering, and advanced concepts like mining data streams and multimedia. It also includes course objectives, outcomes, and references to textbooks and resources. The course aims to equip students with the ability to understand data mining tasks, apply preprocessing methods, and evaluate mining algorithms across various data types.

Uploaded by

mrpulluri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views47 pages

DM 1

The document outlines the syllabus for a Data Mining course, covering topics such as data types, mining functionalities, association rule mining, classification, clustering, and advanced concepts like mining data streams and multimedia. It also includes course objectives, outcomes, and references to textbooks and resources. The course aims to equip students with the ability to understand data mining tasks, apply preprocessing methods, and evaluate mining algorithms across various data types.

Uploaded by

mrpulluri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

AI512PE: DATA MINING (Professional Elective - I)

Dr. M. Kumara Swamy


Syllabus
❑ UNIT – I
Data Mining: Data–Types of Data–Data Mining Functionalities– Interestingness Patterns–
Classification of Data Mining systems–Data mining Task primitives–Integration of Data
mining system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.
❑ UNIT – II
Association Rule Mining: Mining Frequent Patterns–Associations and correlations – Mining
Methods– Mining Various kinds of Association Rules– Correlation Analysis– Constraint
based Association mining. Graph Pattern Mining, SPM.
❑ UNIT – III
Classification: Classification and Prediction – Basic concepts–Decision tree induction–
Bayesian classification, Rule–based classification, Lazy learner.
❑ UNIT - IV
Clustering and Applications: Cluster analysis–Types of Data in Cluster Analysis
Categorization of Major Clustering Methods– Partitioning Methods, Hierarchical Methods
Density–Based Methods, Grid–Based Methods, Outlier Analysis.
2
Syllabus…
❑ UNIT – V
Advanced Concepts: Basic concepts in Mining data streams–Mining Time–series data––
Mining sequence patterns in Transactional databases– Mining Object– Spatial–
Multimedia–Text and Web data – Spatial Data mining–Multimedia Data mining–Text
Mining– Mining the World Wide Web.
TEXT BOOKS
1. Data Mining – Concepts and Techniques – Jiawei Han & Micheline Kamber, 3rd Edition Elsevier.
2. Data Mining Introductory and Advanced topics – Margaret H Dunham, PEA.
REFERENCE BOOK
1. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques
(Second Edition), Morgan Kaufmann, 2005.
Pre-Requisites
• Database Management Systems
• Computer Oriented Statistical Methods
3
Course Objectives and Outcomes
Course Objectives
❑ It presents methods for mining frequent patterns, associations, and correlations.
❑ It then describes methods for data classification and prediction, and data–clustering
approaches.
❑ It covers mining various types of data stores such as spatial, textual, multimedia, streams.
Course Outcomes
❑ Ability to understand the types of the data to be mined and present a general classification of
tasks and primitives to integrate a data mining system.
❑ Apply preprocessing methods for any given raw data.
❑ Extract interesting patterns from large amounts of data.
❑ Discover the role played by data mining in various fields.
❑ Choose and employ suitable data mining algorithms to build analytical applications
❑ Evaluate the accuracy of supervised and unsupervised models and algorithms.
4
Unit - I
Data Mining
❑ Data–Types of Data
❑ Data Mining Functionalities
❑ Interestingness Patterns
❑ Classification of Data Mining systems
❑ Data mining Task primitives
❑ Integration of Data mining system with a Data warehouse
❑ Major issues in Data Mining
❑ Data Preprocessing.

5
Why Data Mining?
❑ The Explosive Growth of Data: from terabytes to petabytes
❑ Data collection and data availability
❑ Automated data collection tools, database systems, Web, computerized
society
❑ Major sources of abundant data
❑ Business: Web, e-commerce, transactions, stocks, …
❑ Science: Remote sensing, bioinformatics, scientific simulation, …
❑ Society and everyone: news, digital cameras, YouTube
❑ We are drowning in data, but starving for knowledge!
❑ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
6
What Is Data Mining?
❑ Data mining (knowledge discovery from data)
❑ Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
❑ Data mining: a misnomer?
❑ Alternative names
❑ Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
❑ Watch out: Is everything “data mining”?
❑ Simple search and query processing
❑ (Deductive) expert systems
7
Knowledge Discovery (KDD) Process
❑ This is a view from typical database systems
and data warehousing communities Pattern Evaluation

❑ Data mining plays an essential role in the


knowledge discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

8 Databases
Example: A Web Mining Framework
❑ Web mining usually involves
❑ Data cleaning
❑ Data integration from multiple sources
❑ Warehousing the data
❑ Data cube construction
❑ Data selection for data mining
❑ Data mining
❑ Presentation of the mining results
❑ Patterns and knowledge to be used or stored into knowledge-base
9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
10
KDD Process: A View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Classification Pattern selection
Normalization
Clustering Pattern interpretation
Feature selection
Outlier analysis
Dimension reduction Pattern visualization
…………

❑ This is a view from typical machine learning and statistics communities

11
Data Mining vs. Data Exploration
❑ Which view do you prefer?
❑ KDD vs. ML/Stat. vs. Business Intelligence
❑ Depending on the data, applications, and your focus

❑ Data Mining vs. Data Exploration


❑ Business intelligence view
❑ Warehouse, data cube, reporting but not much mining
❑ Business objects vs. data mining tools
❑ Supply chain example: mining vs. OLAP vs. presentation tools
❑ Data presentation vs. data exploration

12
Multi-Dimensional View of Data Mining
❑ Data to be mined
Database data (extended-relational, object-oriented, heterogeneous), data warehouse,

transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-
media, graphs & social and information networks
❑ Knowledge to be mined (or: Data mining functions)
❑ Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, …
❑ Descriptive vs. predictive data mining
❑ Multiple/integrated functions and mining at multiple levels
❑ Techniques utilized
❑ Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
❑ Applications adapted
❑ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis,
13 text mining, Web mining, etc.
Data Mining: On What Kinds of Data?
❑ Database-oriented data sets and applications
❑ Relational database, data warehouse, transactional database
❑ Object-relational databases, Heterogeneous databases and legacy databases
❑ Advanced data sets and advanced applications
❑ Data streams and sensor data
❑ Time-series data, temporal data, sequence data (incl. bio-sequences)
❑ Structure data, graphs, social networks and information networks
❑ Spatial data and spatiotemporal data
❑ Multimedia database
❑ Text databases
❑ The World-Wide Web
14
Data Mining Functions: (1) Generalization
❑ Information integration and data warehouse construction
❑ Data cleaning, transformation, integration, and
multidimensional data model
❑ Data cube technology
❑ Scalable methods for computing (i.e., materializing)
multidimensional aggregates
❑ OLAP (online analytical processing)
❑ Multidimensional concept description: Characterization
and discrimination
❑ Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
15
Data Mining Functions: (2) Pattern Discovery
❑ Frequent patterns (or frequent itemsets)
❑ What items are frequently purchased together in your Walmart?
❑ Association and Correlation Analysis

❑ A typical association rule


❑ Diaper → Beer [0.5%, 75%] (support, confidence)
❑ Are strongly associated items also strongly correlated?
❑ How to mine such patterns and rules efficiently in large datasets?
❑ How to use such patterns for classification, clustering, and other applications?
16
Data Mining Functions: (3) Classification
❑ Classification and label prediction
❑ Construct models (functions) based on some training examples
❑ Describe and distinguish classes or concepts for future prediction
❑ Ex. 1. Classify countries based on (climate)
❑ Ex. 2. Classify cars based on (gas mileage)
❑ Predict some unknown class labels
❑ Typical methods
❑ Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic
regression, …
❑ Typical applications:
❑ Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, …
17
Data Mining Functions: (4) Cluster Analysis
❑ Unsupervised learning (i.e., Class label is
unknown)
❑ Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
❑ Principle: Maximizing intra-class similarity
& minimizing interclass similarity
❑ Many methods and applications

18
Data Mining Functions: (5) Outlier Analysis
❑ Outlier analysis
❑ Outlier: A data object that does not comply with the
general behavior of the data
❑ Noise or exception?―One person’s garbage could be
another person’s treasure
❑ Methods: by product of clustering or regression analysis, …
❑ Useful in fraud detection, rare events analysis

19
Data Mining Functions: (6) Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
❑ Sequence, trend and evolution analysis
❑ Trend, time-series, and deviation analysis
❑ e.g., regression and value prediction
❑ Sequential pattern mining
❑ e.g., buy digital camera, then buy large memory cards
❑ Periodicity analysis
❑ Motifs and biological sequence analysis
❑ Approximate and consecutive motifs
❑ Similarity-based analysis
❑ Mining data streams
❑ Ordered, time-varying, potentially infinite, data streams

20
Data Mining Functions: (7) Structure and
Network Analysis
❑ Graph mining
❑ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
❑ Information network analysis
❑ Social networks: actors (objects, nodes) and relationships (edges)
❑ e.g., author networks in CS, terrorist networks
❑ Multiple heterogeneous networks
❑ A person could be multiple information networks: friends, family, classmates, …
❑ Links carry a lot of semantic information: Link mining
❑ Web mining
❑ Web is a big information network: from PageRank to Google
❑ Analysis of Web information networks
❑ Web community discovery, opinion mining, usage mining, …
21
Evaluation of Knowledge
❑ Are all mined knowledge interesting?
❑ One can mine tremendous amount of “patterns”
❑ Some may fit only certain dimension space (time, location, …)
❑ Some may not be representative, may be transient, …
❑ Evaluation of mined knowledge → directly mine only interesting knowledge?
❑ Descriptive vs. predictive
❑ Coverage
❑ Typicality vs. novelty
❑ Accuracy
❑ Timeliness

22
❑ …
Data Mining: Confluence of Multiple Disciplines

Machine Pattern
Statistics
Learning Recognition

Applications Data Mining Visualization

Database High-Performance
Algorithm
Technology Computing

23
Why Confluence of Multiple Disciplines?
❑ Tremendous amount of data
❑ Algorithms must be scalable to handle big data
❑ High-dimensionality of data
❑ Micro-array may have tens of thousands of dimensions
❑ High complexity of data
❑ Data streams and sensor data
❑ Time-series data, temporal data, sequence data
❑ Structure data, graphs, social and information networks
❑ Spatial, spatiotemporal, multimedia, text and Web data
❑ Software programs, scientific simulations
❑ New and sophisticated applications

24
Applications of Data Mining
❑ Web page analysis: classification, clustering, ranking
❑ Collaborative analysis & recommender systems
❑ Basket data analysis to targeted marketing
❑ Biological and medical data analysis
❑ Data mining and software engineering
❑ Data mining and text analysis
❑ Data mining and social and information network analysis
❑ Built-in (invisible data mining) functions in Google, MS, Yahoo!, Linked, Facebook, …
❑ Major dedicated data mining systems/tools
❑ SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
25
Major Issues in Data Mining (1)
❑ Mining Methodology
❑ Mining various and new kinds of knowledge
❑ Mining knowledge in multi-dimensional space
❑ Data mining: An interdisciplinary effort
❑ Boosting the power of discovery in a networked environment
❑ Handling noise, uncertainty, and incompleteness of data
❑ Pattern evaluation and pattern- or constraint-guided mining
❑ User Interaction
❑ Interactive mining
❑ Incorporation of background knowledge
❑ Presentation and visualization of data mining results
26
Major Issues in Data Mining (2)
❑ Efficiency and Scalability
❑ Efficiency and scalability of data mining algorithms
❑ Parallel, distributed, stream, and incremental mining methods
❑ Diversity of data types
❑ Handling complex types of data
❑ Mining dynamic, networked, and global data repositories
❑ Data mining and society
❑ Social impacts of data mining
❑ Privacy-preserving data mining
❑ Invisible data mining
27
Types of Data Sets: (1) Record Data
❑ Relational records
❑ Relational tables, highly structured
❑ Data matrix, e.g., numerical matrix, crosstabs

❑ Transaction data

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

❑ Document data: Term-frequency vector (matrix) of text documents


28
Types of Data Sets: (2) Graphs and Networks
❑ Transportation network

❑ World Wide Web

❑ Molecular Structures

❑ Social or information networks


29
Types of Data Sets: (3) Ordered Data
❑ Video data: sequence of images

❑ Temporal data: time-series

❑ Sequential Data: transaction sequences

❑ Genetic sequence data


30
Types of Data Sets: (4) Spatial, image and multimedia Data

❑ Spatial data: maps

❑ Image data:

❑ Video data:

31
Important Characteristics of Structured Data
❑ Dimensionality
❑ Curse of dimensionality
❑ Sparsity
❑ Only presence counts
❑ Resolution
❑ Patterns depend on the scale
❑ Distribution
❑ Centrality and dispersion

32
Data Objects
❑ Data sets are made up of data objects
❑ A data object represents an entity
❑ Examples:
❑ sales database: customers, store items, sales
❑ medical database: patients, treatments
❑ university database: students, professors, courses
❑ Also called samples , examples, instances, data points, objects, tuples
❑ Data objects are described by attributes
❑ Database rows → data objects; columns → attributes

33
Attributes
❑ Attribute (or dimensions, features, variables)
❑ A data field, representing a characteristic or feature of a data object.
❑ E.g., customer _ID, name, address
❑ Types:
❑ Nominal (e.g., red, blue)
❑ Binary (e.g., {true, false})
❑ Ordinal (e.g., {freshman, sophomore, junior, senior})
❑ Numeric: quantitative
❑ Interval-scaled: 100○C is interval scales
❑ Ratio-scaled: 100○K is ratio scaled since it is twice as high as 50 ○K
❑ Q1: Is student ID a nominal, ordinal, or interval-scaled data?
❑ Q2: What about eye color? Or color in the color spectrum of physics?
34
Attribute Types
❑ Nominal: categories, states, or “names of things”
❑ Hair_color = {auburn, black, blond, brown, grey, red, white}
❑ marital status, occupation, ID numbers, zip codes
❑ Binary
❑ Nominal attribute with only 2 states (0 and 1)
❑ Symmetric binary: both outcomes equally important
❑ e.g., gender
❑ Asymmetric binary: outcomes not equally important.
❑ e.g., medical test (positive vs. negative)
❑ Convention: assign 1 to most important outcome (e.g., HIV positive)
❑ Ordinal
❑ Values have a meaningful order (ranking) but magnitude between successive
values is not known
❑ Size = {small, medium, large}, grades, army rankings
35
Numeric Attribute Types
❑ Quantity (integer or real-valued)

❑ Interval

❑ Measured on a scale of equal-sized units


❑ Values have order
❑ E.g., temperature in C˚or F˚, calendar dates
❑ No true zero-point
❑ Ratio

❑ Inherent zero-point
❑ We can speak of values as being an order of magnitude larger than the unit
of measurement (10 K˚ is twice as high as 5 K˚).
❑ e.g., temperature in Kelvin, length, counts, monetary quantities
36
Discrete vs. Continuous Attributes
❑ Discrete Attribute
❑ Has only a finite or countably infinite set of values
❑ E.g., zip codes, profession, or the set of words in a collection of documents
❑ Sometimes, represented as integer variables
❑ Note: Binary attributes are a special case of discrete attributes
❑ Continuous Attribute
❑ Has real numbers as attribute values
❑ E.g., temperature, height, or weight
❑ Practically, real values can only be measured and represented using a finite
number of digits
❑ Continuous attributes are typically represented as floating-point variables
37
Visualizing Complex Data and Relations: Social Networks
❑ Visualizing non-numerical data: social and information networks

organizing
information networks

A typical network structure

A social network

38
What is Data Preprocessing? — Major Tasks
❑ Data cleaning
❑ Handle missing data, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
❑ Data integration
❑ Integration of multiple databases, data cubes, or files
❑ Data reduction
❑ Dimensionality reduction
❑ Numerosity reduction
❑ Data compression
❑ Data transformation and data discretization
❑ Normalization
❑ Concept hierarchy generation
39
Why Preprocess the Data? — Data Quality Issues
❑ Measures for data quality: A multidimensional view
❑ Accuracy: correct or wrong, accurate or not
❑ Completeness: not recorded, unavailable, …
❑ Consistency: some modified but some not, dangling, …
❑ Timeliness: timely update?
❑ Believability: how trustable the data are correct?
❑ Interpretability: how easily the data can be understood?

40
Data Cleaning
❑ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, and transmission error
❑ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
❑ e.g., Occupation = “ ” (missing data)
❑ Noisy: containing noise, errors, or outliers
❑ e.g., Salary = “−10” (an error)
❑ Inconsistent: containing discrepancies in codes or names, e.g.,
❑ Age = “42”, Birthday = “03/07/2010”
❑ Was rating “1, 2, 3”, now rating “A, B, C”
❑ discrepancy between duplicate records
❑ Intentional (e.g., disguised missing data)
❑ Jan. 1 as everyone’s birthday?
41
Incomplete (Missing) Data
❑ Data is not always available
❑ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
❑ Missing data may be due to
❑ Equipment malfunction
❑ Inconsistent with other recorded data and thus deleted
❑ Data were not entered due to misunderstanding
❑ Certain data may not be considered important at the time of entry
❑ Did not register history or changes of the data
❑ Missing data may need to be inferred

42
How to Handle Missing Data?
❑ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
❑ Fill in the missing value manually: tedious + infeasible?
❑ Fill in it automatically with
❑ a global constant : e.g., “unknown”, a new class?!
❑ the attribute mean
❑ the attribute mean for all samples belonging to the same class: smarter
❑ the most probable value: inference-based such as Bayesian formula or decision
tree

43
Noisy Data
❑ Noise: random error or variance in a measured variable
❑ Incorrect attribute values may be due to
❑ Faulty data collection instruments
❑ Data entry problems
❑ Data transmission problems
❑ Technology limitation
❑ Inconsistency in naming convention
❑ Other data problems
❑ Duplicate records
❑ Incomplete data
❑ Inconsistent data

44
How to Handle Noisy Data?
❑ Binning
❑ First sort data and partition into (equal-frequency) bins
❑ Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
❑ Regression
❑ Smooth by fitting the data into regression functions
❑ Clustering
❑ Detect and remove outliers
❑ Semi-supervised: Combined computer and human inspection
❑ Detect suspicious values and check by human (e.g., deal with possible outliers)

45
Data Cleaning as a Process
❑ Data discrepancy detection
❑ Use metadata (e.g., domain, range, dependency, distribution)
❑ Check field overloading
❑ Check uniqueness rule, consecutive rule and null rule
❑ Use commercial tools
❑ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
❑ Data auditing: by analyzing data to discover rules and relationship to detect violators
(e.g., correlation and clustering to find outliers)
❑ Data migration and integration
❑ Data migration tools: allow transformations to be specified
❑ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
❑ Integration of the two processes
❑ Iterative and interactive (e.g., Potter’s Wheels)
46
END OF UNIT - I

47

You might also like