Data Mining
Data Mining References
Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, Morgan Kaufmann Publishers, Elsevier, 3 rd Edition,
2012.
Margaret H. Dunham, Data Mining: Introduction and Advanced
Topics, Pearson Education, 2006.
Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to
Data Mining , Pearson Education, 2006.
Richard O. Duda, Peter E. Hart and David G. Stork , Pattern
Classification, Wiley Publication, 2nd Edition, 2000.
Ian H. Witten, Eibe Frank and Mark A. Hall, Data Mining Practical
Machine Learning Tools and Techniques, Morgan Kaufmann
Publishers, Elsevier, 3rd Edition, 2011.
IEEE Transactions
Knowledge and Data Engineering
ACM Transactions
Information Systems
Database Systems
Internet Technology
2
Data Mining Objectives
Data Mining or
Knowledge Discovery from Data
OBJECTIVES
Understanding basic data mining concepts &
techniques:
uncovering interesting data patterns, hidden in large data
sets
Development of data mining tools:
scalable and efficient
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics, or
linguistics.)
Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally
accessible
Scientific info. management, acquisition, organization, query, and
visualization tasks scale almost linearly with data volumes. Data mining is a
major new challenge!
Evolution of Database
Technology
1960s:
data creation & collection
IMS
electronic mode
hierarchical database system by IBM
network DBMS
1970s:
relational data model
relational DBMS implementation
1980s:
RDBMS
advanced data models
extended-relational, OO, deductive, etc.
application-oriented DBMS
spatial, scientific, engineering, etc.
5
Evolution of Database
Technology
1990s:
Data mining
Data warehousing
Multimedia databases
Web databases
2000s:
Stream data management and mining
Data mining and its applications
Web technology
XML
data integration
social networks
global information systems
DM Evolution
Data Mining Importance
The Explosive Growth of Data:
Terabytes (240 bytes)
Petabytes
Exabytes
Zitabytes
Drowning in DATA, but STARVING for KNOWLEDGE !
Data Tombs to Golden Nuggets
PLATO
Greek philosopher and mathematician
Necessity is the Mother of Invention
Data Mining automated analysis of massive data sets
8
Data Mining Definition
Data mining definition:
Extraction or mining of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from large
amounts of data stored in databases, data warehouses, or other
information repositories
Alternative names
knowledge discovery (mining) in databases (KDD)
knowledge extraction
data/pattern analysis
data archeology
data dredging
information harvesting
business intelligence etc.
9
Knowledge Discovery (KDD) Process
Data mining
Pattern Evaluation
core of knowledge discovery process (identify true interesting patterns representing knowledge)
Pattern
Data Mining
(intelligent methods applied to extract patterns)
Task-relevant Data
Transformation
(summary, aggregation etc.)
Selection
(retrieve relevant data)
Data Warehouse
Data Cleaning
(remove noise and inconsistent data)
Data Integration
(combine multiple data sources)
Databases
10
Data Mining TOOLS
EXPLORE !!!!!!!!!!!!!!
R TOOL
PYTHON TOOL
WEKA TOOL
SPSS TOOL
ORANGE TOOL
CLEMENTINE TOOL
And many more.
References: DM Papers
11
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
12
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
13
Why Not Traditional Data
Analysis?
Tremendous amount of data
High-dimensionality of data
Algorithms must be highly scalable to handle such as terabytes of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
14
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be
discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
15
Multi-Dimensional View of Data
Mining
Data to be mined
Knowledge to be mined
Characterization, discrimination, association,
clustering, trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
classification,
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.
16
Data Warehousing
consolidation of data from several databases which are in turn
maintained by individual business units along with historical and
summary information
Roll-up
17
Multi-Tiered Architecture
other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
18
Data Mining Research
Publications
Tayal, D. K., Jain, A., Arora, S. , Agarwal, S., Gupta, T. and Tyagi, N., Crime
Detection and Criminal Identification in India Using Data Mining Techniques,
Artificial Intelligence & Society (AIS), SPRINGER, vol. 30, no. 1, pp. 117-127,
Feb 2015. [Indexed: Scopus, Google Scholar, EDSCO, ACM Digital Library,
DBLP]
Jain, A. Yadav, D., and Tayal, D. K., NER for Hindi Language Using Association
Rules, International Conference on Data Mining and Intelligent Computing
(ICDMIC 2014), IGDTUW Delhi, India, IEEE, 5th-6th Sept 2014. [Indexed: Scopus]
19