0% found this document useful (0 votes)

22 views51 pages

6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining

Oleksandr Romanko is a senior research analyst at IBM Canada and adjunct professor at the University of Toronto and Ukrainian Catholic University. He will be giving a presentation on data science, business analytics, big data, and artificial intelligence with a focus on data mining. Data mining techniques that will be covered include classification, clustering, regression, and forecasting. Classification involves assigning records to predefined groups using attributes to predict class, while clustering identifies natural groupings of data without predefined labels. Specific clustering algorithms like k-means and hierarchical clustering will be discussed.

Uploaded by

Mariia Liubiva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views51 pages

6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining

Uploaded by

Mariia Liubiva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Oleksandr Romanko, Ph.D.

Senior Research Analyst, Risk Analytics – Business Analytics, IBM Canada

Adjunct Professor, University of Toronto
Adjunct Professor, Ukrainian Catholic University

Introduction to Data Science,

Business Analytics, Big Data
and Artificial Intelligence:
Data Mining

Kyiv Polytechnic University September 10-11, 2016

Please note:
IBM Risk Analytics statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product
direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information about
potential future products may not be incorporated into any contract. The development,
release, and timing of any future features or functionality described for our products
remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks
in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the
amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an
individual user will achieve results similar to those stated here.

2
Data mining

 Data mining application classes of problems

– Classification
– Clustering
– Regression
– Forecasting
– Others
 Hypothesis or discovery driven
 Iterative
 Scalable

3
What is the difference between descriptive (BI) and predictive
analytics?

Descriptive Predictive
John
Lives in Seattle, zip: 98109
21 years old
iPhone 5
Plan: $98 a month
Talk: 400 minutes
Data: 1.9Gb Low churn
SMS: 370
Complaints: 0
Customer care calls: 1
risk
Dropped calls: low

Mike
Lives in Atlanta, zip: 30308
38 years old
Samsung Galaxy S3
Plan: $78 a month
Talk: 1200 minutes
Data: 0.2 Gb of data
High churn
SMS: 8
Customer care calls: 6 risk
Dropped calls: high

4
Classification

 Classification is a supervised learning technique, which maps data into predefined

classes or groups
 Training set contains a set of records, where one of the records indicates class
 Modeling objective is to assign a class variable to all of the records, using
attributes of other variables to predict a class
 Data is divided into test / train, where “train” is used to build the model and “test” is
used to validate the accuracy of classification
 Typical techniques: Decision Trees, Neural Networks
Customers
Gender Age Lipstick
Female 21 Yes
Female Male
Male 30 No
Female 14 No
>=15 years <15 years No
Female 35 Yes
Male 17 No
Yes No
Female 16 Yes
5
Classification: Creating Model

Training Data Classification

Algorithms

Works with both interval and

categorical variables
Trained
Gender Age Lipstick Classifier
Female 21 Yes
Male 30 No
Female 14 No
Female 35 Yes Purchased lipstick if
Male 17 No
Gender = Female
and
Female 16 Yes Age >= 15
6
Classification: Applying Rules

Gender Age Lipstick Gender Age Lipstick

Female 27 ? Female 27 P Yes
Male 55 ? Male 55 P No
Female 47 ? Female 47 P Yes
Male 39 ? Male 39 P No
Female 27 ?
Apply Female 27 P Yes
Scoring
Male 19 ? Male 19 P No

If
Gender = Female
and
Age >= 15 then
Purchase lipstick = YES

7
Decision (classification) Trees

 A tree can be "learned" by splitting the source set into Customers

subsets based on an attribute value test

 Tree partitions samples into mutually exclusive groups Female Male

by selecting the best splitting attribute, one group for

each terminal node >=15 years <15 years No

 The process is repeated recursively for each derived

subset, until the stopping criteria is reached Yes No

 Works with both interval and  Decision trees can be used to

categorical variables support multiple modeling objectives
o Customer segmentation
 No need to normalize the data o Investment / portfolio decisions
 Intuitive if-then rules are easy to o Issuing a credit card or loan
extract and apply o Medical patient / disease classification

 Best applied to binary outcomes

Decision (classification) Trees

Source: Joel Grus, Data Science from Scratch

Decision (classification) Trees

Source: Joel Grus, Data Science from Scratch

Cluster analysis (segmentation)

 Unsupervised learning algorithm

o Unlabeled data and no “target” variable

 Frequently used for segmentation (to identify natural groupings of customers)

o Market segmentation, customer segmentation

 Most cluster analysis methods involve the use of a distance measure to calculate
the closeness between pairs of items
o Data points in one cluster are more similar to one another
o Data points in separate clusters are less similar to one another
Spend

Cluster #1
Cluster #3

Cluster #2

Income
11
Source: Nick Kadochnikov, Data Mining Modeling Techniques
K-means clustering

12
K-means clustering

13
K-means clustering

14
Clustering: LinkedIn

15
Source: Matthew A. Russell, Mining the Social Web
Clustering: LinkedIn

Source: Matthew A. Russell,

16
Mining the Social Web
Cluster analysis - K-means clustering

17
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

18
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

19
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

20
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

21
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

22
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

23
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis - Fuzzy C-means clustering (FCM)

24
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

25
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

26
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

27
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

28
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

29
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

30
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

31
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – Hierarchical clustering

32
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

33
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

34
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

35
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

36
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

37
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

38
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

39
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

40
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

41
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

42
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – DBSCAN

43
Source: Saeed Aghabozorgi, Cluster Analysis
Cluster analysis – comparison

44
Source: Saeed Aghabozorgi, Cluster Analysis
Association Rules

 Frequently called Market Basket Analysis is an unsupervised learning algorithm

(no target variable)
 Detects associations (affinities) between variables (items or events)
 If customer purchased bread and bananas, s/he has an 80% probability to
purchase milk during the same trip
 Multiple applications:
o Cross-sell and up-sell
o Targeted Promotions
o Product bundling
o Store planograms
o Assortment optimization

45
Neural Networks

 Based loosely on computer models of how brains work

 Model is an assembly of inter-connected neurons (nodes) and weighted links
 Each neuron applies a nonlinear function to its inputs to produce an output
 Output node sums up each of its input value according to the weights of its links
 Used for classification, pattern recognition, speech recognition
 “Black Box” model – no explanatory power, very hard to interpret the results

f1
Training ANN means
learning the weights of
f5 the neurons
x1 f2

f6 y
f3
x2
Output Layer
f7
f4
Input Layer
46 Hidden Layer
Linear regression: basics

 Predict a value of a given continuous variable based on the values of other

variables, assuming a linear or nonlinear model of dependency
 Virtually endless applications:
o Election outcomes
o Future product revenues or commodity prices
o Wind velocity

Both predictive and explanatory power

Residual
Spend

y = c0 + c1 x1 + … + cn xn

Spend = 7.4 + 0.37 * Income

Income
47
Source: Nick Kadochnikov, Data Mining Modeling Techniques
Other types of regression analysis

Quantile regression
 Ordinary least squares regression approximates the conditional mean of the
response variable, while quantile regression is estimating either the conditional
median or other quantiles of the response variable
 This is very helpful in case of skewed data (i.e. income distribution in the US) or to
deal with data without suppressing outliers

Logistic regression
 Logistic regression is used to predict categorical target variable
 Most often a variable with a binary outcome
Logit and Probit regressions can also be used to predict binary outcome. While , the
underlying distributions are different, all three models will produce rather similar
outcomes
 It is frequently used to estimate the probability of an event
o Bank customer defaulting on the loan
o Customer responding to a marketing promotion
48
Source: Nick Kadochnikov, Data Mining Modeling Techniques
49
Questions

50
Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information
contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy,
which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other
materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering
the terms and conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken
by you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental
costs and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included
in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations,
Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

VNX Architectural Overview Final Produced
No ratings yet
VNX Architectural Overview Final Produced
40 pages
Rough Volatility 2023 Part 1 Handout
No ratings yet
Rough Volatility 2023 Part 1 Handout
43 pages
The Shard Presentation
No ratings yet
The Shard Presentation
15 pages
Beginner Course-Navigating The UI
No ratings yet
Beginner Course-Navigating The UI
8 pages
A Method For Selection of Power MOSFETs To Minimiz
No ratings yet
A Method For Selection of Power MOSFETs To Minimiz
8 pages
Tensor Software
100% (1)
Tensor Software
6 pages
Structural Cals For UCW
No ratings yet
Structural Cals For UCW
11 pages
Microwave Solid Antennas: Introduction and Antenna Descriptions
No ratings yet
Microwave Solid Antennas: Introduction and Antenna Descriptions
56 pages
Future Worth Method
No ratings yet
Future Worth Method
17 pages
Janitza BHB Umg96s2 en
No ratings yet
Janitza BHB Umg96s2 en
68 pages
Assignment
No ratings yet
Assignment
6 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Elements of Mechanical Engineering
No ratings yet
Elements of Mechanical Engineering
76 pages
Objective Problems: (Level 1)
No ratings yet
Objective Problems: (Level 1)
7 pages
Load Test On Separately Excitied DC Generator
No ratings yet
Load Test On Separately Excitied DC Generator
5 pages
Imaging and Design For Online Environment
No ratings yet
Imaging and Design For Online Environment
2 pages
6 Tree
No ratings yet
6 Tree
2 pages
Circuit Note: Dual-Channel Colorimeter With Programmable Gain Transimpedance Amplifiers and Digital Synchronous Detection
No ratings yet
Circuit Note: Dual-Channel Colorimeter With Programmable Gain Transimpedance Amplifiers and Digital Synchronous Detection
8 pages
Presentation of NARP TRG Team
No ratings yet
Presentation of NARP TRG Team
29 pages
Employee Benefit Plans 6: Limitations On Contributions and Benefits
No ratings yet
Employee Benefit Plans 6: Limitations On Contributions and Benefits
23 pages
List of Deliverabls-Aker Proposal
No ratings yet
List of Deliverabls-Aker Proposal
4 pages
Silva Et-Al 2013
No ratings yet
Silva Et-Al 2013
8 pages
Berry-Esseen Central Limit The
No ratings yet
Berry-Esseen Central Limit The
65 pages
Depolarization
No ratings yet
Depolarization
8 pages
Beneficiation of Ajabanoko Iron Ore Deposit, Kogi State, Nigeria Using Magnetic Methods
No ratings yet
Beneficiation of Ajabanoko Iron Ore Deposit, Kogi State, Nigeria Using Magnetic Methods
3 pages
Good Is The Activity of The Soul in Accordance With Virtue
No ratings yet
Good Is The Activity of The Soul in Accordance With Virtue
6 pages
Canopus
No ratings yet
Canopus
6 pages
9 Redox Notes
No ratings yet
9 Redox Notes
12 pages
1704875755
No ratings yet
1704875755
4 pages
Maximum Mark: 30: Cambridge International Advanced Subsidiary and Advanced Level
No ratings yet
Maximum Mark: 30: Cambridge International Advanced Subsidiary and Advanced Level
4 pages
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining

Uploaded by

6 - Romanko - Data - Science - and - Business - Analytics - Data - Mining

Uploaded by

Oleksandr Romanko, Ph.D.

Senior Research Analyst, Risk Analytics – Business Analytics, IBM Canada

Introduction to Data Science,

Kyiv Polytechnic University September 10-11, 2016

 Data mining application classes of problems

 Classification is a supervised learning technique, which maps data into predefined

Training Data Classification

Works with both interval and

Gender Age Lipstick Gender Age Lipstick

 A tree can be "learned" by splitting the source set into Customers

subsets based on an attribute value test

by selecting the best splitting attribute, one group for

 The process is repeated recursively for each derived

 Works with both interval and  Decision trees can be used to

 Best applied to binary outcomes

Source: Joel Grus, Data Science from Scratch

Source: Joel Grus, Data Science from Scratch

 Unsupervised learning algorithm

 Frequently used for segmentation (to identify natural groupings of customers)

Source: Matthew A. Russell,

 Frequently called Market Basket Analysis is an unsupervised learning algorithm

 Based loosely on computer models of how brains work

 Predict a value of a given continuous variable based on the values of other

Both predictive and explanatory power

Spend = 7.4 + 0.37 * Income

You might also like