Week 1A - Overview and Introduction of Data Mining
Week 1A - Overview and Introduction of Data Mining
Week 1:
Overview &
Introduction
2
Informatics Engineering | Universitas Surabaya
Reference Book
Introduction to DATA MINING
Pang Ning Tan, Michael Steinbach, Vipin Kumar
Topics
• Introduction to Data Mining
• Data Exploration
• Classification Analysis
• Association Analysis
• Clustering Analysis
Lesson Plan (1 st Half of Semester)
Week Topic(s) Description
Mid-term Exam
Lesson Plan (2 nd Half of Semester)
Week Topic(s) Description
Final Exam
Grading
Mid-term Grade (NTS: Nilai Tengah Semester) =
20% Assignments +
30% Quiz 1 (QTS: Quiz Tengah Semester) +
50% Mid Exam (UTS: Ujian Tengah Semester)
Final Grade (NA: Nilai Akhir) = 40% Mid-term Grade + 60% Final-term Grade
All kinds of PLAGIARISM and CHEATING will give you zero on your grade.
Introduction
13
Informatics Engineering | Universitas Surabaya
What is Data Mining?
• The processes or techniques of DISCOVERING INTERESTING
PATTERNS, MODELS, and other kinds of knowledge by analyzing
large datasets that provides insights or enable fast and accurate
decision making.
• Non-trivial extraction of implicit, previously unknown, and potentially
useful information from data.
• Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns.
• Knowledge mining from data.
Why Data Mining? (1)
• Business worldwide generate gigantic datasets, including sales
transactions, stock trading records, product descriptions, sales
promotions, company profiles and performance, and customer
feedback.
• Scientific and engineering practices generate high orders of
petabytes of data in continuous manner, from remote sensing, to
process measuring, scientific experiments, system performance,
engineering observations, and environment surveillance.
Why Data Mining? (2)
• Biomedical research and health industry generate tremendous
amounts of data from gene sequence machines, biomedical
experiment and research reports, medical reports, patient
monitoring, and medical imaging.
• Billions of web searches supported by search engines process tens
of petabytes of data daily.
• Social media tools have become increasingly popular, producing a
tremendous number of texts, pictures, and videos, generating
various kinds of web communities and social networks.
Why Data Mining? (3)
• The explosively growing, widely available, and gigantic body of data
makes our time truly the data age.
• Powerful and versatile tools are badly needed to automatically
uncover information from the tremendous amounts of data and to
transform such data into organized knowledge.
Data Age is Here
• “We are living in the information age” or “We are
actually living in the data age”?
• Terabytes of petabytes of data pour into our
computer networks, WWW, and various kinds of
devices every day
• Business, news agency, society, science,
engineering, medicine, and almost every other aspect
of daily life. https://www.splunk.com/en_us/campaigns/data-age.html
• This explosive growth of available data volume is a
result of the computerization of our society and the
fast development of powerful computing, sensing,
and data collection, storage, and publication tools.
• This explosive growing, widely available, and gigantic
body of data makes our time truly the data age.
Data Mining Tasks
• Prediction Methods
Use some variables to predict unknown or future values of other
variables.
• Description Methods
Find human-interpretable patterns that describe the data.
Data Mining Tasks
• Classification (PREDICTIVE)
• Clustering (DESCRIPTIVE)
• Association Rule Discovery (DESCRIPTIVE)
• Sequential Pattern Discovery (DESCRIPTIVE)
• Regression (PREDICTIVE)
• Deviation Detection (PREDICTIVE)
Pattern Discovery Techniques
• Classification:
– Decision Trees, Naïve Bayes, Support Vector Machines
• Clustering:
– k-means, Hierarchical Clustering
• Association Rule Mining:
– Apriori Algorithm
Data Mining in Summary (Shivam Arora, 2024)
Retail and Marketing
(REAL-WORLD EXAMPLES)
• Customer Segmentation
– Retailers use data mining to segment customers based on purchasing
behavior.
– Example: identifying high-value customers who are likely to buy premium
products.
• Market Basket Analysis
– To understand the purchase behavior of customers by finding
associations between different products.
– Example: if customers frequently buy bread and butter together, a store
might place these items to each other.
E-commerce
(REAL-WORLD EXAMPLES)
• Recommendation Systems:
– E-commerce platforms (e.g. Amazon, Netflix) use data mining to
recommend products and content to users based on their browsing and
purchase history.
– Example: “Customers who bought this also bought … “
recommendations/
• Dynamic Pricing
– Online retailers use data mining to adjust prices dynamically based on
demand, competition, and customer behavior. This helps in maximizing
sales and profits.
Finance and Banking
(REAL-WORLD EXAMPLES)
• Credit Scoring
– Financial institutions use data mining to assess the creditworthiness of
applicants by analyzing historical data on loan repayments, credit card
usage, and financial transactions.
• Fraud Detection
– Banks use anomaly detection techniques to identify unusual patterns in
transactions that may indicate fraudulent activities.
– Example: a sudden large transaction from a foreign country could trigger
a fraud alert.
Healthcare
(REAL-WORLD EXAMPLES)
• Predictive Analytics
– Healthcare providers use data mining to predict disease outbreaks,
patient admission rates, and the likelihood of patient readmissions. This
help resources and improving patient care.
• Personalized Treatment Plans
– Healthcare professionals can develop personalized treatment plans
based on the patient’s medical history, genetics, and lifestyle by
analyzing patient data.
Telecommunications
(REAL-WORLD EXAMPLES)
• Churn Prediction
– Telecom companies use data mining to predict which customers are
likely to switch to a competitor.
– Companies can take proactive measures to retain customers by
understanding the factors leading to churn.
– Customer Churn: the number of customers that stopped using the company’s product
or service during a period of time.
• Network Optimization
– Data mining helps in optimizing network performance by analyzing call
data records and detecting issues like dropped calls and network
congestion.
Manufacturing
(REAL-WORLD EXAMPLES)
• Predictive Maintenance
– Manufacturers uses data mining to predict equipment failures before they
occur by analyzing sensor data from machinery.
– This helps in scheduling maintenance and reducing downtime.
• Quality Control
– Data mining is used to identify patterns in production data that lead to
defects, allowing manufacturers to improve product quality and reduce
waste.
Energy and Utilities
(REAL-WORLD EXAMPLES)
• Sentiment Analysis
– Companies use data mining to analyze social media posts, reviews, and
comments to gauge public sentiment about their products or services.
– This helps in marketing strategy and brand management.
• User Behavior Analysis
– Social media platforms like Facebook and Twitter use data mining to
understand user behavior, preferences, and engagement patterns, which
helps in improving user experience and targeted advertising.
Sports & Entertainment
(REAL-WORLD EXAMPLES)
• Performance Analysis
– Sports teams use data mining to analyze player performance, injury
patterns, and game strategies.
– This helps in making informed decisions on player selection and game
tactics.
• Audience Engagement
– Entertainment companies use data mining to analyze viewer preference
and engagement patterns, helping in content creation and personalized
recommendations.
Data Mining Tools
Challenges of Data Mining
• Scalability
• High Dimensionality
• Heterogeneous & Complex Data
• Data Quality
• Data Ownership & Distribution
• Non-traditional Analysis
Data mining:
An essential step in knowledge discovery
• Many people treat data mining as a synonym for another popularly
used term, knowledge discovery from data, or KDD.
• Others view data mining as merely an essential step in the overall
process of knowledge discovery.
Knowledge Discovery Process
• Data Collection
• Data Preparation
– Data cleaning: to remove noise and inconsistent data
– Data integration: where multiple data sources may be combined
– Data transformation: where data are transformed and consolidated into forms appropriate for mining by
performing summary or aggregation operations
– Data selection: where data relevant to the analysis task are retrieved from the database
• Data mining: an essential process where intelligent methods are applied to extract
patterns or construct models
• Pattern/model Evaluation and Interpretation: to identify the truly interesting patterns or
models representing knowledge based on interestingness measures
– Metrics for Evaluation: accuracy, precision, recall, F1-score
– Visualization Techniques: confusion matrix, ROC curve
• Knowledge Presentation: where visualization and knowledge representation techniques are
used to present mined knowledge to users
– Tools for Presenting Data: Tableau, Power BI
– Effective Data Visualization
KDD Process:
A typical view from ML & Statistics
• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 nd ed., Springer-Verlag,
2009
• B. Liu, Web Data Mining, Springer 2006.
59
Informatics Engineering | Universitas Surabaya