0% found this document useful (0 votes)
40 views16 pages

Unit 5 Introduction To Data Mining: Prashasti Kanikar 9/26/2020

This document provides an introduction to data mining. It discusses that data mining aims to discover hidden patterns from large databases. It describes the different types of data that can be mined, including relational databases, time-series data, graphs, text and web data. The document also outlines several data mining techniques, such as classification, clustering, association analysis and outlier detection. Finally, it discusses some common applications and challenges of data mining.

Uploaded by

Hansica Madurkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views16 pages

Unit 5 Introduction To Data Mining: Prashasti Kanikar 9/26/2020

This document provides an introduction to data mining. It discusses that data mining aims to discover hidden patterns from large databases. It describes the different types of data that can be mined, including relational databases, time-series data, graphs, text and web data. The document also outlines several data mining techniques, such as classification, clustering, association analysis and outlier detection. Finally, it discusses some common applications and challenges of data mining.

Uploaded by

Hansica Madurkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit 5

Introduction to Data Mining

1 PRASHASTI KANIKAR 9/26/2020


 Data Mining is “an information extraction
activity whose goal is to discover
hidden facts contained in large
databases.”

2 PRASHASTI KANIKAR 9/26/2020


Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications

 Relational database, data warehouse, transactional database

 Advanced data sets and advanced applications

 Data streams and sensor data

 Time-series data, temporal data, sequence data (incl. bio-sequences)

 Structure data, graphs, social networks and multi-linked data

 Object-relational databases

 Heterogeneous databases and legacy databases

 Spatial data and spatiotemporal data

 Multimedia database

 Text databases

 The World-Wide Web

3 PRASHASTI KANIKAR 9/26/2020


What kind of Patterns can be mined?
Data Mining Function: (1) Generalization
 Information integration and data warehouse construction
 Data cleaning, transformation, integration, and multidimensional
data model
 Data cube technology
 Scalable methods for computing (i.e., materializing)
multidimensional aggregates
 OLAP (online analytical processing)
 Multidimensional concept description: Characterization and
discrimination
 Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet region
4 PRASHASTI KANIKAR 9/26/2020
Data Mining Function: (2) Association and Correlation
Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your Walmart?
 Association, correlation vs. causality
 A typical association rule
 Bread  Butter [0.5%, 75%] (support, confidence)

5 PRASHASTI KANIKAR 9/26/2020


Data Mining Function: (3) Classification

 Classification and label prediction


 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 E.g., classify countries based on (climate), or classify cars based on (gas mileage)
 Predict some unknown class labels

6 PRASHASTI KANIKAR 9/26/2020


Data Mining Function: (4) Cluster Analysis

 Unsupervised learning (i.e., Class label is unknown)


 Group data to form new categories (i.e., clusters), e.g., cluster houses to
find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing interclass
similarity
 Many methods and applications

7 PRASHASTI KANIKAR 9/26/2020


Data Mining Function: (5) Outlier Analysis

 Outlier analysis
 Outlier: A data object that does not comply with the general behavior of the data
 Noise or exception? ― One person’s garbage could be another person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

8 PRASHASTI KANIKAR 9/26/2020


Time and Ordering: Sequential Pattern, Trend and
Evolution Analysis
 Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis: e.g., regression and value
prediction
 Sequential pattern mining
 e.g., first buy digital camera, then buy large SD memory cards
 Periodicity analysis
 Motifs and biological sequence analysis
 Approximate and consecutive motifs
 Similarity-based analysis
 Mining data streams
 Ordered, time-varying, potentially infinite, data streams

9 PRASHASTI KANIKAR 9/26/2020


Structure and Network Analysis
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures
(web fragments)
 Information network analysis
 Social networks: actors (objects, nodes) and relationships (edges)
 e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks
 A person could be multiple information networks: friends, family, classmates, …
 Links carry a lot of semantic information: Link mining
 Web mining
 Web is a big information network: from PageRank to Google
 Analysis of Web information networks
 Web community discovery, opinion mining, usage mining, …

10 PRASHASTI KANIKAR 9/26/2020


Technologies to be used

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

Data Mining: Confluence of Multiple Disciplines


11 PRASHASTI KANIKAR 9/26/2020
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

12 PRASHASTI KANIKAR 9/26/2020


Data Mining Application:
Marketing
 Sales Analysis
• associations between product sales:
 bread and butter
 Toothpaste and toothbrush

 Customer Profiling
• data mining can tell you what types of customers
buy what products
 Identifying Customer Requirements
• identify the best products for different customers
• use prediction to find what factors will attract new
customers
13 PRASHASTI KANIKAR 9/26/2020
Data Mining Application:
Fraud Detection
• Association Rule Mining can detect a group of people who
stage accidents to collect on insurance

• a data-mining application can be used to detect suspicious


money transactions

• data mining can be used to help commercial lending


decisions and to prevent fraud

14 PRASHASTI KANIKAR 9/26/2020


Other Applications of Data Mining
 Web page analysis: from web page classification, clustering to PageRank & HITS
algorithms
 Collaborative analysis & recommender systems

 Basket data analysis to targeted marketing

 Biological and medical data analysis: classification, cluster analysis (microarray data
analysis), biological sequence analysis, biological network analysis
 Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)

 From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis
Manager, Oracle Data Mining Tools) to invisible data mining

15 PRASHASTI KANIKAR 9/26/2020


Major Issues in Data Mining

 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining

 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results

16 PRASHASTI KANIKAR 9/26/2020

You might also like