Data Analytics-Introduction: manish@IIITA
Data Analytics-Introduction: manish@IIITA
manish@IIITA
Introduction
• Motivation: Why data mining?
• What is data mining?
• Data Mining: On what kind of data?
• Data mining functionality
• Are all the patterns interesting?
• Classification of data mining systems
• Major issues in data mining
manish@IIITA
• Data explosion problem
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
• 1990s—2000s:
– Data mining and data warehousing, multimedia databases,
and Web databases
manish@IIITA
What Is Data Mining?
• Data mining (knowledge discovery in databases):
manish@IIITA
Why Data Mining? — Potential Applications
• Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
– Fraud detection and management
• Other Applications
– Text mining (news group, email, documents) and Web
analysis.
– Intelligent query answering
manish@IIITA
Market Analysis and Management (1)
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
manish@IIITA
Market Analysis and Management (2)
• Customer profiling
– data mining can tell you what types of customers buy
what products (clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new
customers
• Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency
and variation)
manish@IIITA
Corporate Analysis and Risk
Management
• Finance planning and asset evaluation
– cash flow analysis and prediction
– cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.)
• Resource planning:
– summarize and compare the resources and spending
• Competition:
– monitor competitors and market directions
– group customers into classes and a class-based
pricing procedure
– set pricing strategy in a highly competitive market
manish@IIITA
Fraud Detection and Management (1)
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
manish@IIITA
Fraud Detection and Management (2)
• Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies
that in many cases blanket screening tests were
requested (save Australian $1m/yr).
• Detecting telephone fraud
– Telephone call model: destination of the call, duration,
time of day or week. Analyze patterns that deviate
from an expected norm.
– British Telecom identified discrete groups of callers
with frequent intra-group calls, especially mobile
phones, and broke a multimillion dollar fraud.
• Retail
– Analysts estimate that 38% of retail shrink is due to
dishonest employees.
manish@IIITA
Other Applications
• Sports
– IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage
for New York Knicks and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining
• Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
manish@IIITA
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core
of knowledge Data Mining
discovery process.
Task-relevant Data
Data Cleaning
Data Integration
Databases manish@IIITA
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction,
invariant representation.
• Choosing functions of data mining
– summarization, classification, regression, association,
clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant
patterns, etc.
• Use of discovered knowledge
manish@IIITA
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
manish@IIITA
Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous databases
– WWW
manish@IIITA
Data Mining Functionalities (1)
• Concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
manish@IIITA
Data Mining Functionalities (3)
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
manish@IIITA
Data Mining: Confluence of
Multiple Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
manish@IIITA
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
manish@IIITA
A Multi-Dimensional View of Data Mining
Classification
• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining etc.
manish@IIITA
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of
abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
manish@IIITA
Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and
global information systems (WWW)
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy
manish@IIITA
Summary
• Data mining: discovering interesting patterns from large amounts
of data
• A natural evolution of database technology, in great demand, with
wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
• Classification of data mining systems
• Major issues in data mining
manish@IIITA