0% found this document useful (0 votes)
22 views51 pages

6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining

Oleksandr Romanko is a senior research analyst at IBM Canada and adjunct professor at the University of Toronto and Ukrainian Catholic University. He will be giving a presentation on data science, business analytics, big data, and artificial intelligence with a focus on data mining. Data mining techniques that will be covered include classification, clustering, regression, and forecasting. Classification involves assigning records to predefined groups using attributes to predict class, while clustering identifies natural groupings of data without predefined labels. Specific clustering algorithms like k-means and hierarchical clustering will be discussed.

Uploaded by

Mariia Liubiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views51 pages

6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining

Oleksandr Romanko is a senior research analyst at IBM Canada and adjunct professor at the University of Toronto and Ukrainian Catholic University. He will be giving a presentation on data science, business analytics, big data, and artificial intelligence with a focus on data mining. Data mining techniques that will be covered include classification, clustering, regression, and forecasting. Classification involves assigning records to predefined groups using attributes to predict class, while clustering identifies natural groupings of data without predefined labels. Specific clustering algorithms like k-means and hierarchical clustering will be discussed.

Uploaded by

Mariia Liubiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Oleksandr Romanko, Ph.D.

Senior Research Analyst, Risk Analytics – Business Analytics, IBM Canada


Adjunct Professor, University of Toronto
Adjunct Professor, Ukrainian Catholic University

Introduction to Data Science,


Business Analytics, Big Data
and Artificial Intelligence:
Data Mining

Kyiv Polytechnic University September 10-11, 2016


Please note:
IBM Risk Analytics statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product
direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information about
potential future products may not be incorporated into any contract. The development,
release, and timing of any future features or functionality described for our products
remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks
in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the
amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an
individual user will achieve results similar to those stated here.

2
Data mining

 Data mining application classes of problems


– Classification
– Clustering
– Regression
– Forecasting
– Others
 Hypothesis or discovery driven
 Iterative
 Scalable

3
What is the difference between descriptive (BI) and predictive
analytics?

Descriptive Predictive
John
Lives in Seattle, zip: 98109
21 years old
iPhone 5
Plan: $98 a month
Talk: 400 minutes
Data: 1.9Gb Low churn
SMS: 370
Complaints: 0
Customer care calls: 1
risk
Dropped calls: low

Mike
Lives in Atlanta, zip: 30308
38 years old
Samsung Galaxy S3
Plan: $78 a month
Talk: 1200 minutes
Data: 0.2 Gb of data
High churn
SMS: 8
Customer care calls: 6 risk
Dropped calls: high

4
Classification

 Classification is a supervised learning technique, which maps data into predefined


classes or groups
 Training set contains a set of records, where one of the records indicates class
 Modeling objective is to assign a class variable to all of the records, using
attributes of other variables to predict a class
 Data is divided into test / train, where “train” is used to build the model and “test” is
used to validate the accuracy of classification
 Typical techniques: Decision Trees, Neural Networks
Customers
Gender Age Lipstick
Female 21 Yes
Female Male
Male 30 No
Female 14 No
>=15 years <15 years No
Female 35 Yes
Male 17 No
Yes No
Female 16 Yes
5
Classification: Creating Model

Training Data Classification


Algorithms

Works with both interval and


categorical variables
Trained
Gender Age Lipstick Classifier
Female 21 Yes
Male 30 No
Female 14 No
Female 35 Yes Purchased lipstick if
Male 17 No
Gender = Female
and
Female 16 Yes Age >= 15
6
Classification: Applying Rules

Gender Age Lipstick Gender Age Lipstick


Female 27 ? Female 27 P Yes
Male 55 ? Male 55 P No
Female 47 ? Female 47 P Yes
Male 39 ? Male 39 P No
Female 27 ?
Apply Female 27 P Yes
Scoring
Male 19 ? Male 19 P No

If
Gender = Female
and
Age >= 15 then
Purchase lipstick = YES

7
Decision (classification) Trees

 A tree can be "learned" by splitting the source set into Customers

subsets based on an attribute value test


 Tree partitions samples into mutually exclusive groups Female Male

by selecting the best splitting attribute, one group for


each terminal node >=15 years <15 years No

 The process is repeated recursively for each derived


subset, until the stopping criteria is reached Yes No

 Works with both interval and  Decision trees can be used to


categorical variables support multiple modeling objectives
o Customer segmentation
 No need to normalize the data o Investment / portfolio decisions
 Intuitive if-then rules are easy to o Issuing a credit card or loan
extract and apply o Medical patient / disease classification

 Best applied to binary outcomes


Decision (classification) Trees

Source: Joel Grus, Data Science from Scratch


Decision (classification) Trees

Source: Joel Grus, Data Science from Scratch


Cluster analysis (segmentation)

 Unsupervised learning algorithm


o Unlabeled data and no “target” variable

 Frequently used for segmentation (to identify natural groupings of customers)


o Market segmentation, customer segmentation

 Most cluster analysis methods involve the use of a distance measure to calculate
the closeness between pairs of items
o Data points in one cluster are more similar to one another
o Data points in separate clusters are less similar to one another
Spend

Cluster #1
Cluster #3

Cluster #2

Income
11
Source: Nick Kadochnikov, Data Mining Modeling Techniques
K-means clustering

12
K-means clustering

13
K-means clustering

14
Clustering: LinkedIn

15
Source: Matthew A. Russell, Mining the Social Web
Clustering: LinkedIn

Source: Matthew A. Russell,


16
Mining the Social Web
Cluster analysis - K-means clustering

17
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

18
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

19
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

20
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

21
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

22
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

23
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

24
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

25
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

26
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

27
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

28
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

29
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

30
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

31
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

32
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

33
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

34
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

35
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

36
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

37
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

38
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

39
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

40
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

41
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

42
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

43
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – comparison

44
Source: Saeed Aghabozorgi, Cluster Analysis
Association Rules

 Frequently called Market Basket Analysis is an unsupervised learning algorithm


(no target variable)
 Detects associations (affinities) between variables (items or events)
 If customer purchased bread and bananas, s/he has an 80% probability to
purchase milk during the same trip
 Multiple applications:
o Cross-sell and up-sell
o Targeted Promotions
o Product bundling
o Store planograms
o Assortment optimization

45
Neural Networks

 Based loosely on computer models of how brains work


 Model is an assembly of inter-connected neurons (nodes) and weighted links
 Each neuron applies a nonlinear function to its inputs to produce an output
 Output node sums up each of its input value according to the weights of its links
 Used for classification, pattern recognition, speech recognition
 “Black Box” model – no explanatory power, very hard to interpret the results

f1
Training ANN means
learning the weights of
f5 the neurons
x1 f2

f6 y
f3
x2
Output Layer
f7
f4
Input Layer
46 Hidden Layer
Linear regression: basics

 Predict a value of a given continuous variable based on the values of other


variables, assuming a linear or nonlinear model of dependency
 Virtually endless applications:
o Election outcomes
o Future product revenues or commodity prices
o Wind velocity

Both predictive and explanatory power

Residual
Spend

y = c0 + c1 x1 + … + cn xn

Spend = 7.4 + 0.37 * Income

Income
47
Source: Nick Kadochnikov, Data Mining Modeling Techniques
Other types of regression analysis

Quantile regression
 Ordinary least squares regression approximates the conditional mean of the
response variable, while quantile regression is estimating either the conditional
median or other quantiles of the response variable
 This is very helpful in case of skewed data (i.e. income distribution in the US) or to
deal with data without suppressing outliers

Logistic regression
 Logistic regression is used to predict categorical target variable
 Most often a variable with a binary outcome
Logit and Probit regressions can also be used to predict binary outcome. While , the
underlying distributions are different, all three models will produce rather similar
outcomes
 It is frequently used to estimate the probability of an event
o Bank customer defaulting on the loan
o Customer responding to a marketing promotion
48
Source: Nick Kadochnikov, Data Mining Modeling Techniques
49
Questions

50
Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information
contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy,
which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other
materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering
the terms and conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken
by you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental
costs and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included
in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations,
Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

51

You might also like