Data Mining
Data Mining
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
• Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that
area
• Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of
CD player
• Fraud detection
• Find outliers of unusual transactions
• Financial planning
• Summarize and compare the resources and spending
Knowledge Discovery (KDD) Process
KDD Process: Several Key Steps
• Learning the application domain
• relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data Pre-processing
• Data cleaning (remove noise and inconsistent data)
• Data integration (multiple data sources maybe combined)
• Data selection (data relevant to the analysis task are retrieved from database)
• Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
• Data mining (an essential process where intelligent methods are applied to
extract data patterns)
• Pattern evaluation (indentify the truly interesting patterns)
• Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
• Data are organized around major subjects, e.g. customer, item, supplier and
activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization
1.9 Major Issues in Data Mining
• Presentation and visualization of results
• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
• Handling noisy or incomplete data
• Require data cleaning methods and data analysis methods that can handle noise
• Pattern evaluation – the interestingness problem
• How to develop techniques to access the interestingness of discovered patterns, especially
with subjective measures bases on user beliefs or expectations
1.9 Major Issues in Data Mining
• Performance Issues
• Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
• Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data again from
scratch