6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining
6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining
2
Data mining
3
What is the difference between descriptive (BI) and predictive
analytics?
Descriptive Predictive
John
Lives in Seattle, zip: 98109
21 years old
iPhone 5
Plan: $98 a month
Talk: 400 minutes
Data: 1.9Gb Low churn
SMS: 370
Complaints: 0
Customer care calls: 1
risk
Dropped calls: low
Mike
Lives in Atlanta, zip: 30308
38 years old
Samsung Galaxy S3
Plan: $78 a month
Talk: 1200 minutes
Data: 0.2 Gb of data
High churn
SMS: 8
Customer care calls: 6 risk
Dropped calls: high
4
Classification
If
Gender = Female
and
Age >= 15 then
Purchase lipstick = YES
7
Decision (classification) Trees
Most cluster analysis methods involve the use of a distance measure to calculate
the closeness between pairs of items
o Data points in one cluster are more similar to one another
o Data points in separate clusters are less similar to one another
Spend
Cluster #1
Cluster #3
Cluster #2
Income
11
Source: Nick Kadochnikov, Data Mining Modeling Techniques
K-means clustering
12
K-means clustering
13
K-means clustering
14
Clustering: LinkedIn
15
Source: Matthew A. Russell, Mining the Social Web
Clustering: LinkedIn
17
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
18
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
19
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
20
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
21
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
22
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
23
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)
24
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
25
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
26
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
27
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
28
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
29
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
30
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
31
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering
32
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
33
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
34
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
35
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
36
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
37
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
38
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
39
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
40
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
41
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
42
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN
43
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – comparison
44
Source: Saeed Aghabozorgi, Cluster Analysis
Association Rules
45
Neural Networks
f1
Training ANN means
learning the weights of
f5 the neurons
x1 f2
f6 y
f3
x2
Output Layer
f7
f4
Input Layer
46 Hidden Layer
Linear regression: basics
Residual
Spend
y = c0 + c1 x1 + … + cn xn
Income
47
Source: Nick Kadochnikov, Data Mining Modeling Techniques
Other types of regression analysis
Quantile regression
Ordinary least squares regression approximates the conditional mean of the
response variable, while quantile regression is estimating either the conditional
median or other quantiles of the response variable
This is very helpful in case of skewed data (i.e. income distribution in the US) or to
deal with data without suppressing outliers
Logistic regression
Logistic regression is used to predict categorical target variable
Most often a variable with a binary outcome
Logit and Probit regressions can also be used to predict binary outcome. While , the
underlying distributions are different, all three models will produce rather similar
outcomes
It is frequently used to estimate the probability of an event
o Bank customer defaulting on the loan
o Customer responding to a marketing promotion
48
Source: Nick Kadochnikov, Data Mining Modeling Techniques
49
Questions
50
Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information
contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy,
which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other
materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering
the terms and conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken
by you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental
costs and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included
in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations,
Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.
51