0% found this document useful (0 votes)
262 views

01-Introduction To Data Mining

Data mining involves extracting useful patterns from large amounts of data through techniques like classification, clustering, and association rule mining. It draws from multiple disciplines like machine learning, statistics, and database systems to analyze vast, complex datasets. The goals of data mining include prediction, description, and discovering hidden patterns in data to help organizations make better decisions.

Uploaded by

Ku Ha Ku
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
262 views

01-Introduction To Data Mining

Data mining involves extracting useful patterns from large amounts of data through techniques like classification, clustering, and association rule mining. It draws from multiple disciplines like machine learning, statistics, and database systems to analyze vast, complex datasets. The goals of data mining include prediction, description, and discovering hidden patterns in data to help organizations make better decisions.

Uploaded by

Ku Ha Ku
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to Data Mining

M. Tanzil Furqon, S.Kom., MCompSc.


2
What Is Data Mining?
• Data mining (knowledge discovery from data)
▫ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
▫ Data mining: a misnomer?
• Alternative names
▫ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging/cleaning, information harvesting, business
intelligence, etc.
• Watch out: Is everything “data mining”?
▫ Simple search and query processing
▫ (Deductive) expert systems
Why Mine the Data?
• Lots of data is being collected
and warehoused
▫ Web data, e-commerce
▫ purchases at department/
grocery stores
▫ Bank/Credit Card
transactions
• Competitive Pressure is Strong
▫ Provide better, customized services for an edge (e.g.
in Customer Relationship Management) à
automobile industry (Mitsubishi xpander)
Why Mine the Data? (contd..)
• Data collected and stored at enormous speeds (GB/hour)
▫ remote sensors on a satellite
▫ telescopes scanning the skies
▫ microarrays generating gene
expression data
▫ scientific simulations
▫ generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
▫ in classifying and segmenting data
▫ in Hypothesis Formation
Why Mine the Data? (contd..)
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information à take much time
• Much of the data is never analyzed at all
Definition of Data Mining
• Non-trivial extraction of implicit, previously unknown
and potentially useful information from data

• Exploration & analysis, by automatic or


semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

• This is a view from typical machine learning and statistics communities


Data Mining Vs Non Data Mining
! Non Data Mining ! Data Mining (example)
(example) Certain names are more
– Look up phone prevalent in certain US
number in phone locations (O’Brien, O’Rurke,
directory O’Reilly… in Boston area)
– Group together similar
– Query a Web documents returned by search
search engine for engine according to their
information about context (e.g. Amazon
“Amazon” rainforest, Amazon.com,)
Data Mining Vs Database
• DB’s user knows what is looking for.
• DM’s user might/might not know what is looking for.
• DB’s answer to query is 100% accurate, if data correct.
• DM’s effort is to get the answer as accurate as possible.
• DB’s data are retrieved as stored.
• DM’s data need to be cleaned (some what) before
producing results.
• DB’s results are subset of data.
• DM’s results are the analysis of the data.
• The meaningfulness of the results is not the concern of
Database as it is the main issue in Data Mining.
Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
▫ Object-oriented and object-relational databases
▫ Spatial databases
▫ Time-series data and temporal data
▫ Text databases and multimedia databases
▫ Heterogeneous and legacy databases
▫ WWW
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
▫ Enormity of data AI Pattern
▫ High dimensionality Recognition
of data
Data Mining
▫ Heterogeneous,
distributed nature
of data Database
systems
Data Mining: Confluence of Multiple Disciplines
13

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing
Why Confluence of Multiple Disciplines?
14

• Tremendous amount of data


▫ Algorithms must be scalable to handle big data
• High-dimensionality of data
▫ Micro-array may have tens of thousands of dimensions
• High complexity of data
▫ Data streams and sensor data
▫ Time-series data, temporal data, sequence data
▫ Structure data, graphs, social and information networks
▫ Spatial, spatiotemporal, multimedia, text and Web data
▫ Software programs, scientific simulations
• New and sophisticated applications
Data mining is supported by three
sufficiently mature technologies:
• Massive data collections
Commercial databases (using high performance engines)
are growing at exceptional rates

• Powerful multiprocessor computers


cost-effective parallel multiprocessor computer technology

• Data mining algorithms


under development for decades, in research areas such as
statistics, artificial intelligence, and machine learning,
but now implemented as mature, reliable, understandable
tools that consistently outperform older statistical methods
Data Mining Tasks
• Prediction Methods
▫ Use some variables to predict unknown or future
values of other variables.

• Description Methods
▫ Find human-interpretable patterns that describe
the data.
Data Mining Tasks (contd..)
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
1. Classification (Definition)
• Given a collection of records (training set )
▫ Each record contains a set of attributes, one of the
attributes is the class.

• Find a model for class attribute as a function of


the values of other attributes.

• Goal: previously unseen records should be


assigned a class as accurately as possible.
Classification (example)…. Contd.
i cal i cal o us
gor gor i n u
a te a te o nt a ss
c c c cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No Learn
Training Model
10 No Single 90K Yes
10

Set Classifier
Classification (application -1)
• Direct Marketing
▫ Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
▫ Approach:
– Use the data for a similar product introduced before.
– We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
– Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
– Use this information as input attributes to learn a classifier
model.
Classification (application-2)
• Fraud Detection
▫ Goal: Predict fraudulent cases in credit card
transactions.
▫ Approach:
– Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he pays on
time, etc
– Label past transactions as fraud or fair transactions. This
forms the class attribute.
– Learn a model for the class of the transactions.
– Use this model to detect fraud by observing credit card
transactions on an account.
Classification (application-3)
• Customer Attrition/Churn:
▫ Goal: To predict whether a customer is likely to be
lost to a competitor.
▫ Approach:
– Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what
time-of-the day he calls most, his financial status,
marital status, etc.
– Label the customers as loyal or disloyal.
– Find a model for loyalty.
2. Clustering (definition)
• Given a set of data points, each having a set of
attributes, and a similarity measure among
them, find clusters such that
▫ Data points in one cluster are more similar to one
another.
▫ Data points in separate clusters are less similar to
one another.
• Similarity Measures:
▫ Euclidean Distance
▫ Cosine similarity, etc.
Illustration of clustering
! Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized
Clustering (application -1)
• Market Segmentation:
▫ Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
▫ Approach:
– Collect different attributes of customers based on their
geographical and lifestyle related information.
– Find clusters of similar customers.
– Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
Clustering (application-2)
• Document Clustering:
▫ Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
▫ Approach: To identify frequently occurring terms
in each document. Form a similarity measure
based on the frequencies of different terms. Use it
to cluster.
3. Association Rule Discovery (definition)
• Given a set of records each of which contain some
number of items from a given collection;
▫ Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Coffee, Bread {Milk} --> {Coke}
{Diaper, Milk} --> {Coffee}
3 Coffee, Coke, Diaper, Milk
4 Coffee, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery (definition) –
contd..
• A rule must have some minimum user-specified
confidence & support
• Support: proportion of transactions in the data
set which contain the itemset
• Confidence (XàY): Sup(X U Y)/Sup(X)
Association Rule (application)
• Marketing and Sales Promotion:
▫ Let the rule discovered be
{Bagels, … } --> {Potato Chips}
▫ Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
▫ Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
▫ Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Association Rule (application-2)
• Supermarket shelf management.
▫ Goal: To identify items that are bought together
by sufficiently many customers.
▫ Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies
among items.
▫ A classic rule --
– If a customer buys diaper and milk, then he is very
likely to buy tea.
– So, don’t be surprised if you find six-packs stacked
next to diapers!
4. Sequential Pattern Discovery
• Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among
different events.

(A B) (C) (D E)

• Rules are formed by first discovering patterns. Event occurrences in the


patterns are governed by timing constraints.
Sequential Pattern (application)
• In telecommunications alarm logs,
▫ (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
• In point-of-sale transaction sequences,
▫ Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
▫ Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
5. Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
▫ Predicting sales amounts of new product based on
advertising expenditure.
▫ Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
▫ Time series prediction of stock market indices.
Applications of Data Mining
• Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
• Data mining and software engineering
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
Major Issues in Data Mining (1)
• Mining Methodology
▫ Mining various and new kinds of knowledge
▫ Mining knowledge in multi-dimensional space
▫ Data mining: An interdisciplinary effort
▫ Boosting the power of discovery in a networked environment
▫ Handling noise, uncertainty, and incompleteness of data
▫ Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
▫ Interactive mining
▫ Incorporation of background knowledge
▫ Presentation and visualization of data mining results
Major Issues in Data Mining (2)

• Efficiency and Scalability


▫ Efficiency and scalability of data mining algorithms
▫ Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
▫ Handling complex types of data
▫ Mining dynamic, networked, and global data repositories
• Data mining and society
▫ Social impacts of data mining
▫ Privacy-preserving data mining
▫ Invisible data mining
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
▫ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
▫ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
▫ Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
▫ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001),
WSDM (2008), etc.
• ACM Transactions on KDD (2007)
Conferences and Journals on Data Mining
• KDD Conferences n Other related conferences
▫ ACM SIGKDD Int. Conf. on Knowledge
n DB conferences: ACM SIGMOD,
Discovery in Databases and Data
VLDB, ICDE, EDBT, ICDT, …
Mining (KDD)
▫ SIAM Data Mining Conf. (SDM) n Web and IR conferences: WWW,
▫ (IEEE) Int. Conf. on Data Mining SIGIR, WSDM
(ICDM) n ML conferences: ICML, NIPS
▫ European Conf. on Machine Learning n PR conferences: CVPR,
and Principles and practices of
n Journals
Knowledge Discovery and Data Mining
(ECML-PKDD) n Data Mining and Knowledge
▫ Pacific-Asia Conf. on Knowledge Discovery (DAMI or DMKD)
Discovery and Data Mining (PAKDD) n IEEE Trans. On Knowledge and
▫ Int. Conf. on Web Search and Data Data Eng. (TKDE)
Mining (WSDM) n KDD Explorations
n ACM Trans. on KDD
Where to Find References? DBLP, CiteSeer, Google
• Data mining and KDD (SIGKDD: CDROM)
▫ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
▫ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
▫ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
▫ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
• AI & Machine Learning
▫ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
▫ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI,
etc.
• Web and IR
▫ Conferences: SIGIR, WWW, CIKM, etc.
▫ Journals: WWW: Internet and Web Information Systems,
• Statistics
▫ Conferences: Joint Stat. Meeting, etc.
▫ Journals: Annals of statistics, etc.
• Visualization
▫ Conference proceedings: CHI, ACM-SIGGraph, etc.
▫ Journals: IEEE Trans. visualization and computer graphics, etc.
Data Mining di GOJEK
• Bagaimana GOJEK Memanfaatkan Big
Data Penggunanya untuk Bisnis
▫ memanfaatkan big data dengan pendekatan data
science
▫ mengambil berbagai keputusan real-time, dengan
menggunakan teknik seperti machine learning,
kecerdasan buatan (AI), dan juga natural
language processing.
Implementasi Data Science di GOJEK
• GOJEK menerapkan data science pada hampir seluruh proses
bisnis dan operasional mereka.
• Implementasinya tidak hanya bagi pengguna, tapi
mitra driver dan merchant juga dianalisis datanya.
Implementasi Data Science di GOJEK
• “Sistem pengalokasian driver kini jauh lebih baik dengan
penerapan machine learning. Dulu order yang masuk pasti
dialokasikan ke driver terdekat. Kini, berbagai
pertimbangan lain ikut dilibatkan. Hasilnya pick-up
rate semakin cepat, dan cancelation rate menurun”
(Syafrie, VP of Data Science GOJEK)
“Kan ada driver yang mencari
order searah dengan jalan
pulang. Ada juga yang
senangnya mengambil order
jarak-jarak pendek.
Dengan menggunakan machine
learning untuk memprediksi itu,
kita bisa sesuaikan sedemikian
rupa, sehingga baik customer
maupun driver sama-sama
enak“.
Rekrutmen talenta masih menjadi
tantangan utama

You might also like