0% found this document useful (0 votes)
8 views29 pages

CSM6404 DM L1

Data Mining Lecture of Bangladesh Agriculture University.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

CSM6404 DM L1

Data Mining Lecture of Bangladesh Agriculture University.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

CSM 6404: Data Mining

Dr. Md Geaur Rahman


Director, PGD in ICT
&
Associate Professor in Computer Science
Department of Computer Science and Mathematics
Bangladesh Agricultural University

Email: [email protected]

Lecture 1
Outline
Definition,motivation & application
Branches of data mining
Major issues in data mining
What Is Data Mining?

 Data mining (knowledge discovery in databases):


◦ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases

 Alternative names and their “inside stories”:


◦ Data mining: a misnomer?
◦ Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
Data Mining Definition
Finding hidden information in a database
Fit data to a model
Similar terms
◦ Exploratory data analysis
◦ Data driven discovery
◦ Deductive learning
Motivation:

 Data explosion problem

◦ Automated data collection tools and mature database technology lead to


tremendous amounts of data stored in databases, data warehouses and
other information repositories
 We are drowning in data, but starving for knowledge!
 Solution: Data warehousing and data mining

◦ Data warehousing and on-line analytical processing

◦ Extraction of interesting knowledge (rules, regularities, patterns,


constraints) from data in large databases
Why Mine Data? Commercial Viewpoint

 Lotsof data is being collected


and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions

 Computers have become cheaper and more powerful


 Competitive Pressure is Strong
◦ Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Let us look at some examples of data
sources
Netflix
Amazon
Wal-Mart
Algorithmic Trading/High Frequency Trading
Banks (Segmint)
Google/Yahoo/Microsoft/IBM
CRM/Consumer Behavior Profiling
Consumer Review
Mobile Ads
Social Network (Facebook/Twitter/Google+)
…
Why Mine Data? Scientific Viewpoint
 Datacollected and stored at
enormous speeds (GB/hour)
◦ remote sensors on a satellite
◦ telescopes scanning the skies
◦ microarrays generating gene
expression data
◦ scientific simulations
generating terabytes of data
 Traditional
techniques infeasible for raw data
 Data mining may help scientists
◦ in classifying and segmenting data
◦ in Hypothesis Formation
Examples: What is (not) Data Mining?

 What is not Data  What is Data Mining?


Mining?

– Look up phone – Certain names are more prevalent in


number in phone certain US locations (O’Brien,
directory O’Rurke, O’Reilly… in Boston area)
– Group together similar documents
– Query a Web search returned by search engine according
engine for information to their context (e.g. Amazon
about “Amazon” rainforest, Amazon.com,)
Database Processing vs. Data Mining
Processing

Query Query
◦ Well defined ◦ Poorly defined
◦ SQL ◦ No precise query language
• Data • Data
• Operational data • Not operational data

• Output • Output
• Precise • Fuzzy
• Subset of database • Not a subset of database
Evolution of Database Technology
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the last
month.
– Find all customers who have purchased milk

Data Mining
– Find all credit applicants who are poor credit risks. (classification)

– Identify customers with similar buying habits. (Clustering)

– Find all items which are frequently purchased with milk. (association
rules)
Potential Applications
 Data analysis and decision support
◦ Market analysis and management
 Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
◦ Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
◦ Fraud detection and detection of unusual patterns (outliers)
 Other Applications
◦ Text mining (news group, email, documents) and Web mining
◦ Stream data mining
◦ Bioinformatics and bio-data analysis
Ex.: Market Analysis and Management
 Where does the data come from?—Credit card
transactions, loyalty cards, discount coupons, customer
complaint calls, surveys …
 Target marketing
◦ Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.,
 E.g. Most customers with income level 60k – 80k with food expenses $600 - $800
a month live in that area
◦ Determine customer purchasing patterns over time
 E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k
usually buy this type of CD player
 Cross-market analysis—Find associations/co-relations between
product sales, & predict based on such association
◦ E.g. Customers who buy computer A usually buy software B
15
Ex.: Market Analysis and Management (2)
 Customer requirement analysis
◦ Identify the best products for different customers
◦ Predict what factors will attract new customers
 Provision of summary information
◦ Multidimensional summary reports
 E.g. Summarize all transactions of the first quarter from three different branches
Summarize all transactions of last year from a particular branch
Summarize all transactions of a particular product
◦ Statistical summary information
 E.g. What is the average age for customers who buy product A?

 Fraud detection
◦ Find outliers of unusual transactions
 Financial planning
◦ Summarize and compare the resources and spending
16
Data Mining Tasks
 Prediction Tasks
◦ Use some variables to predict unknown or future values of other
variables
 Description Tasks
◦ Find human-interpretable patterns that describe the data.

Common data mining tasks


◦ Classification [Predictive]
◦ Clustering [Descriptive]
◦ Association Rule Discovery [Descriptive]
◦ Sequential Pattern Discovery [Descriptive]
◦ Regression [Predictive]
◦ Deviation Detection [Predictive]
Data Mining Models and Tasks
Decisions in Data Mining
 Databases to be mined
◦ Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
 Knowledge to be mined
◦ Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
◦ Multiple/integrated functions and mining at multiple levels
 Techniques utilized
◦ Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
 Applications adapted
◦ Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
analysis, Web mining, Weblog analysis, etc.
Knowledge Discovery (KDD) Process
KDD Process: Several Key Steps
 Learning the application domain
◦ relevant prior knowledge and goals of application
 Identifying a target data set: data selection
 Data processing
◦ Data cleaning (remove noise and inconsistent data)
◦ Data integration (multiple data sources maybe combined)
◦ Data selection (data relevant to the analysis task are retrieved from database)
◦ Data transformation (data transformed or consolidated into forms appropriate for
mining)
(Done with data preprocessing)
◦ Data mining (an essential process where intelligent methods are applied to extract
data patterns)
◦ Pattern evaluation (indentify the truly interesting patterns)
◦ Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
21

Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
A typical DM System Architecture
 Database, data warehouse, WWW or other information
repository (store data)
 Database or data warehouse server (fetch and

combine data)
 Knowledge base (turn data into meaningful groups

according to domain knowledge)


 Data mining engine (perform mining tasks)
 Pattern evaluation module (find interesting patterns)
 User interface (interact with the user)
A typical DM System Architecture (2)
Origins of Data Mining

 Draws ideas from machine learning/AI, pattern recognition, statistics,


and database systems
 Traditional Techniques
may be unsuitable due to
◦ Enormity of data Statistics/ Machine Learning/
◦ High dimensionality AI Pattern
of data Recognition
◦ Heterogeneous,
distributed nature Data Mining
of data

Database
systems
Major Issues in Data Mining
 Mining methodology and User interaction
◦ Mining different kinds of knowledge
 DM should cover a wide spectrum of data analysis and knowledge discovery tasks
 Enable to use the database in different ways
 Require the development of numerous data mining techniques

◦ Interactive mining of knowledge at multiple levels of abstraction


 Difficult to know exactly what will be discovered
 Allow users to focus the search, refine data mining requests

◦ Incorporation of background knowledge


 Guide the discovery process
 Allow discovered patterns to be expressed in concise terms and different levels of
abstraction
◦ Data mining query languages and ad hoc data mining
 High-level query languages need to be developed
 Should be integrated with a DB/DW query language
26
Major Issues in Data Mining (Contd..)
◦ Presentation and visualization of results
 Knowledge should be easily understood and directly usable
 High level languages, visual representations or other expressive forms
 Require the DM system to adopt the above techniques

◦ Handling noisy or incomplete data


 Require data cleaning methods and data analysis methods that can handle noise

◦ Pattern evaluation – the interestingness problem


 How to develop techniques to access the interestingness of discovered patterns, especially
with subjective measures bases on user beliefs or expectations

27
Major Issues in Data Mining (contd..)
 Performance Issues
◦ Efficiency and scalability
 Huge amount of data
 Running time must be predictable and acceptable

◦ Parallel, distributed and incremental mining algorithms


 Divide the data into partitions and processed in parallel
 Incorporate database updates without having to mine the entire data again from
scratch
 Diversity of Database Types
◦ Other database that contain complex data objects, multimedia data,
spatial data, etc.
◦ Expect to have different DM systems for different kinds of data
◦ Heterogeneous databases and global information systems
 Web mining becomes a very challenging and fast-evolving field in data mining
28
Thank you 

You might also like