0% found this document useful (0 votes)
16 views145 pages

BDA 557 Data Science For Business Slides

The document provides an overview of Big Data and its parameters, including volume, variety, velocity, veracity, value, variability, visualization, validity, vulnerability, and volatility. It explains key concepts such as Data Analytics, Business Analytics, and Data Science, along with the applications and benefits of Big Data Analytics in various business fields. Additionally, it outlines the KDD process model for Big Data analytics and distinguishes between structured, semi-structured, and unstructured data.

Uploaded by

papwilly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views145 pages

BDA 557 Data Science For Business Slides

The document provides an overview of Big Data and its parameters, including volume, variety, velocity, veracity, value, variability, visualization, validity, vulnerability, and volatility. It explains key concepts such as Data Analytics, Business Analytics, and Data Science, along with the applications and benefits of Big Data Analytics in various business fields. Additionally, it outlines the KDD process model for Big Data analytics and distinguishes between structured, semi-structured, and unstructured data.

Uploaded by

papwilly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Overview of Data Science for

Business
Nana Osei Boateng
Study Objectives
• After completing this unit, students should be able to:
▪ Explain the concept of Big Data
▪ Explain the Big Data Parameters
▪ Explain the terminologies such as Data Analytics, Big Data Analytics, Data
Science, Business Analytics
▪ Gain an understanding of Big Data Analytics tools
▪ Distinguish between Supervised and Unsupervised Machine Learning
▪ Explain the benefits of Big Data Analytics

Data Science for Business (BDA 517) 2


What is Big Data?
•Big data refers to a set of techniques and technologies
that require new forms of integration in order to
uncover hidden value from large datasets that are
diverse, complex, and of a very large scale.
Ratra and Gulia (2019)

Data Science for Business (BDA 517) 3


What is Big Data?
• Big data refers to large datasets that are not able to be
captured, stored, managed and analyzed by typical software
tools. These data sets that are huge not only in size but also
in heterogeneity and complexity including operational,
transactional, sales, marketing and other data.
Šubić, Poščić & Jakšić, (2015)

Data Science for Business (BDA 517) 4


Parameters of Big Data
▪Volume (large amount of data reaching levels such as megabytes,
gigabytes, terabytes, petabytes, etc).

▪Variety (substantial data heterogeneity across individuals and data


types)

▪Velocity (high speed of access and analysis) as the dimensional


challenges of data management.
Langley (2001)
Data Science for Business (BDA 517) 5
Parameters of Big Data
▪Veracity (uncertainty due to data inconsistency & incompleteness,
ambiguities, latency, deception, model approximations),

▪Value (referring to the potential insights that an organisation may


harness from available data).

Aftab & Siddiqui (2019)

Data Science for Business (BDA 517) 6


Parameters of Big Data
▪Variability (inconsistency of speed at which data is stored in
a system)

▪Visualization (demonstration of data in different forms to get


a clear understanding).
Erevelles et al. (2016)

Data Science for Business (BDA 517) 7


Parameters of Big Data
▪Validity (dealing with data accuracy and correctness)

▪Vulnerability (data security)

▪Volatility (dealing with the statistical measure of the dispersion


for a given set of returns).

Jukić , Sharma, Nestorov & Nestorov, (2015)


Data Science for Business (BDA 517) 8
Sources of Big Data
▪ Databases
▪ Data Warehouses
▪ Online portal content,
▪ Point of Sale (POS) data or other Transactional Data
▪ Smart Meter Data
▪ Video sources
▪ Image sources

Data Science for Business (BDA 517) 9


Sources of Big Data
▪Other natural language text sources,
▪Social Media
▪RFID system data
▪Audio sources
▪Clickstream data
▪GPS data
▪Weblog posts
Data Science for Business (BDA 517) 10
Sources of Big Data
▪Geotagging
▪Satellite imagery,
▪Aerial imagery and videos,
▪Wireless sensor and network
▪Simulation data,
▪Spatial data, etc
Data Science for Business (BDA 517) 11
Other Terminologies
Although the terms Data Analytics, Business Analytics, Data Science, Big
Data Analytics are used interchangeably, there are differences between these
terminologies.
▪ Data Analytics
Data analytics is the analysis of data, whether huge or small, in order to
understand it and see how to use the knowledge hidden within it.
▪ Business Analytics
Business analytics is the application of data analytics to business.

Data Science for Business (BDA 517) 12


Other Terminologies…..Cont’d
▪ Big Data Analytics
• Big data analytics is the analysis of huge amounts of data (for example,
trillions of records) or the analysis of difficult-to-crack problems. Usually,
this requires a huge amount of storage and/or computing capability.
• This analysis requires enormous amounts of memory to hold the data, a huge
number of processors, and high-speed processing to crunch the data and get
its essence. An example is the analysis of geospatial data captured by
satellite to identify weather patterns and make related predictions.

Data Science for Business (BDA 517) 13


Other Terminologies…..Cont’d
Data Science
• Data science is an interdisciplinary field (including disciplines such
as statistics, mathematics, and computer programming) that derives
knowledge from data and applies it for predictive or other purposes.
• Expertise about underlying processes, systems, and algorithms is
used. An example is the application of t-values and p-values from
statistics in identifying significant model parameters in a regression
equation.
Data Science for Business (BDA 517) 14
Applications of Data Science/Big Data Analytics in
Business
Data Science has been successfully applied in many business fields including:
▪ Marketing & Sales
▪ Human Resources
▪ Product Design
▪ Service Design
▪ Financial Services
▪ Customer Service & Support areas

Data Science for Business (BDA 517) 15


Framework for Business Analytics
• The business Analytics
Framework gives a bird’s
eye view of how various
tools and techniques
converge together to drive
Business Analytics which
entails making data-driven-
decisions for competitive
advantage.

Source: Adapted from Hodeghatta & Nayak (2016)

Data Science for Business (BDA 517) 16


Types of Big Data Analytics

▪ Descriptive Analytics
▪ Diagnostic Analytics
▪ Predictive Analytics
▪ Prescriptive Analytics

Data Science for Business (BDA 517) 17


Examples of Big Data Analytics Tools

Data Science for Business (BDA 517) 18


Big Data Analytics (Machine Learning) Techniques

▪ Machine learning techniques are classified into:


❑Supervised Machine Learning
❑Unsupervised Machine Learning.

Data Science for Business (BDA 517) 19


Supervised & Unsupervised Machine Learning
• Consider two similar questions we might ask about a customer population.
“Can we find groups of customers
“Do our customers naturally fall who have particularly high
into different groups?” likelihoods of cancelling their
service soon after their contracts
expire?”

A specific target defined: will a


No specific purpose or target has customer leave when their contract
expires?
been specified for the grouping.
Segmentation has a specific reason:
When there is no such target, the to take action based on likelihood of
data mining problem is referred to churn.
as unsupervised. Thus a supervised data mining
problem
Data Science for Business (BDA 517) 20
Big Data Analytics (Machine Learning) Techniques
▪ Supervised Machine Learning Techniques:
❑Linear Regression
❑Decision Trees
❑Regression Trees
❑Random Forest
❑K-Nearest Neighbour (KNN)
❑Artificial Neural Network (ANN)
❑Naïve Bayes

Data Science for Business (BDA 517) 21


Big Data Analytics (Machine Learning) Techniques
▪ Unsupervised Machine Learning Techniques:
❑Clustering Techniques
❑Association Rule

Data Science for Business (BDA 517) 22


Benefits of Big Data Analytics
▪ Provides key resource for enterprises obtaining new knowledge, added value and
fostering new products, processes and markets.
▪ Ability to help enterprises understand their business environments, their customers’
behavior and needs and their competitors’ activities.
▪ Big Data Analytics sophistication leads to superior performance (a) indirectly, by
producing more timely, relevant and actionable information thereby creating an
incentive for managers to act upon that information for superior performance and – to a
lesser extent – (b) directly, by means of automation of decisions and business processes.

Data Science for Business (BDA 517) 23


Benefits of Big Data Analytics

Benefit Item Description


▪ Strategic Benefits • Aligning IT with a business strategy
• Establishing useful links with other
organizations Enabling a quicker response to
change Improving customer relations
• Providing better products or services

Raguseo (2018)

Data Science for Business (BDA 517) 24


Benefits of Big Data Analytics

Benefit Item Description


▪ Transformational Benefits • Achieving an improved skill level for the
employees
• Developing new business opportunities
• Expanding capabilities

• Improving business models

Raguseo (2018)

Data Science for Business (BDA 517) 25


Benefits of Big Data Analytics

Benefit Item Description


• Transactional Benefits • Saving on supply chain management
• Reducing operating costs
• Reducing communication costs
• Avoiding the need to increase the workforce

• Increasing return on financial assets

• Enhancing employee productivity

Raguseo (2018)

Data Science for Business (BDA 517) 26


Benefits of Big Data Analytics

Benefit Item Description


• Informational Benefits • Enabling faster access to data
• Enabling easier access to data
• Improving management data
• Improving data accuracy

• Providing data in more useable formats

Raguseo (2018)

Data Science for Business (BDA 517) 27


References
• Aftab, U. and Siddiqui, G.F. (2019), “Big Data Augmentation with Data Warehouse: A Survey”, Proceedings - 2018 IEEE
International Conference on Big Data, Big Data 2018, IEEE, pp. 2785–2794.
• Erevelles, S., Fukawa, N. and Swayne, L. (2016a), “Big Data Consumer Analytics and the Transformation of Marketing”, Journal
of Business Research, Vol. 69 No. 2, pp. 897–904.
• Hodeghatta, U.R. and Nayak, U., (2017). Business analytics using R - A practical approach. Apress.
• Jukić, N., Sharma, A., Nestorov, S. and Nestorov, B. (2015), “Augmenting Data Warehouses with Big Data”, Information Systems
Management, Vol. 32 No. 3, pp. 200–209.
• Laney, D., (2001). 3D data management: Controlling data volume, velocity and variety. META group research note, 6(70), p.1.
• Mohamed, A., Najafabadi, M.K., Wah, Y.B., Zaman, E.A.K. and Maskat, R. (2020), “The state of the art and taxonomy of big data
analytics: view from new big data framework”, Artificial Intelligence Review, Springer Netherlands, Vol. 53 No. 2, pp. 989–1037.
• Raguseo, E., (2018). “Big data technologies: An empirical investigation on their adoption, benefits and risks for
companies”. International Journal of Information Management, 38(1), pp.187-195.
• Ratra, R. and Gulia, P. (2019). “Big Data Tools and Techniques: A Roadmap for Predictive Analytics”, International Journal of
Engineering and Advanced Technology, Vol. 9 No. 2, pp. 4986–4992.
• Šubić, T., Poščić, P. and Jakšić, D. (2015), “Big Data in Data Warehouses”, pp. 235–244.

Data Science for Business (BDA 517) 28


Unit 2 –
Big Data
Analytics
Process
Nana Boateng
Data Science for Business

Study Objectives

• After completing this unit, students should be able to:


• Distinguish between structured data, semi-structured data, and unstructured data.
• Explain various approaches to big data analytics processes including
• KDD
• SEMMA
• CRISP-DM
• Analytic Life Cycle
• Data Science Life Cycle

2
Understanding Data
▪ Getting quality data is the most important factor determining the
accuracy of the results. Data can be either from a primary source or
secondary source.
Examples of Primary Sources of Data Examples of Secondary Sources
(within the organisation) • Official reports from government,
• Databases (e.g. Operational, HR, newspaper, articles, and census
Manufacturing, IT, etc.) data, etc.
• Data Warehouse

Data Science for Business 3


Understanding Data
Other Sources of Data
❑Structured Data
❑Unstructured Data

Data Science for Business 4


Structured Data
▪ Structured data is the type of data that is well-organized and accurately
formatted. This data exists in a format of relational databases, meaning the
information is stored in tables with rows and columns that are connected.

▪ In this way, structured data is arranged and recorded neatly, so it can be easily
found and processed. As long as data fits within the structure of RDBMSs, we can
easily search for specific information and single out the relationships between its
pieces. Such data can only be used for its intended purpose

Data Science for Business 5


Structured Data

This data can comprise


both text and numbers,
such as employee
names, contacts, ZIP
codes, addresses, credit
card numbers, etc.
Common data formats
include CSV, XML

Data Science for Business 6


Unstructured Data
▪ Unstructured Data relate to data that is not structured in a pre-
defined way, meaning data is stored in its native formats.

▪ There is a wide array of forms that make up unstructured data such as


email, text files, social media posts, video, images, audio, sensor data,
and so on.

▪ Unstructured data formats include pdf, jpeg, wmv, mp3, etc.

Data Science for Business 7


Semi – Structured Data
▪ Semi-structured data is partially structured, meaning that it
incorporates certain markers that can split semantic elements and
implement data hierarchies, but it is still different from the tabular
data models presented in relational databases.

▪ Such a structure is called self-describing. Markup languages such as


XML are the forms of semi-structured data.

Data Science for Business 8


Semi Structured Data

Example of Semi Structured Data

▪ JSON is also a semi-structured


data model that is used by new-
generation databases such as
MongoDB.

Data Science for Business 9


Structured & Unstructured Data
Both structured and unstructured data carry great value for
businesses of diverse fields and scale.

As a data scientist, it is essential to be able to analyse all data


(structured and unstructured) to be able to improve the
effectiveness business intelligence to leverage competitive
advantage.

One of the platforms via which different sources of data can


be brought together to derive insight is a data warehouse.

Data Science for Business 10


Big Data Analytics Process Models

Data Science for Business 11


KDD Process Model
❑ The Knowledge Discovery & Data Mining (KDD)
process model consists of five steps:
▪ Selection
▪ Pre-processing
▪ Transformation
▪ Data Mining
▪ Interpretation/Evaluation

Data Science for Business 12


KDD Process Model

Fayyad, Piatetsky-Shapiro and Smyth, (1996)

Data Science for Business 13


KDD Process Model
• Selection
• This stage consists on creating a target data set, or focusing on a subset of
variables or data samples, on which discovery is to be performed.
• Pre-processing
• This stage consists on the target data cleaning and pre-processing in order to
obtain consistent data.
• Transformation
• This stage consists of the transformation of the data using dimensionality
reduction or transformation methods.
Fayyad, Piatetsky-Shapiro and Smyth (1996)

Data Science for Business 14


KDD Process Model
• Data Mining
• This stage consists on the searching for patterns of interest in a particular
representational form, depending on the data mining objective (usually,
prediction)
• Interpretation/Evaluation
• This stage consists of the interpretation and evaluation of the mined patterns.

Fayyad, Piatetsky-Shapiro and Smyth (1996)

Data Science for Business 15


CRISP-DM Process Model
❑ CRISP-DM stands for Cross Industry Standard
Process for Data Mining.
❑ The six iterative phases CRISP-DM process
model consist of:
▪ Business Understanding
▪ Data Understanding
▪ Data Preparation
▪ Modelling
▪ Evaluation
▪ Deployment

Data Science for Business 16


CRISP-DM Process Model
▪ Business understanding
▪ This initial phase focuses on understanding the project objectives and requirements from a business
perspective, then converting this knowledge into a data mining problem definition and a preliminary plan
designed to achieve the objectives.

▪ Data understanding
▪ The data understanding phase starts with an initial data collection and proceeds with activities in order to
get familiar with the data, to identify data quality problems, to discover first insights into the data or to
detect interesting subsets to form hypotheses for hidden information.

▪ Data preparation
▪ The data preparation phase covers all activities to construct the final dataset from the initial raw data.
(Chapman et al., 2000)

Data Science for Business 17


CRISP-DM Process Model
▪ Modelling
▪ In this phase, various modeling techniques are selected and applied and their
parameters are calibrated to optimal values.

▪ Evaluation
▪ At this stage the model (or models) obtained are more thoroughly evaluated and the
steps executed to construct the model are reviewed to be certain it properly achieves the
business objectives.

▪ Deployment
▪ Creation of the model is generally not the end of the project. Even if the purpose of the
model is to increase knowledge of the data, the knowledge gained will need to be
organized and presented in a way that the customer can use it.

(Chapman et al., 2000)

Data Science for Business 18


SEMMA Process Model
▪ Pushed by SAS and often in conjunction with SAS tools

▪ SEMMA offers an easy to understand process, allowing an


organised and adequate development and maintenance of Big
Data Analytics projects.

▪ It thus confers a structure for his conception, creation and


evolution that help to present solutions to business problems.

Data Science for Business 19


SEMMA Process Model
▪ Sample
▪ This stage consists of sampling the data by extracting a portion of a large data set big enough to contain
the significant information, yet small enough to manipulate quickly. This stage is pointed out as being
optional.
▪ Explore
▪ This stage consists of the exploration of the data by searching for unanticipated trends and anomalies in
order to gain understanding and ideas.
▪ Modify
▪ This stage consists of the modification of the data by creating, selecting, and transforming the variables to
focus the model selection process.
▪ Model
▪ This stage consists of modeling the data by allowing the software to search automatically for a
combination of data that reliably predicts a desired outcome.
▪ Assess
▪ This stage consists of assessing the data by evaluating the usefulness and reliability of the findings from
the data mining process and estimate how well it performs.

Data Science for Business 20


Analytic Life Cycle Process Model
• The Analytic Life Cycle
developed by SAS in 2018
consists of two main phases that
is, Discovery and Deployment.

Data Science for Business 21


Analytic Life Cycle Process Model

Discovery Phase Deployment Phase


• Ask a question • Implement your models
• define the business needs of the organisation and • insights derived from the discovery phase are put into
break these problems down into mathematical forms action using repeatable, automated processes
in order to solve them.
• Act on new information
• Prepare the Data • Acting on new information is based on strategic and
• combine and transform data into appropriate formats operational decisions.
for further processing
• Evaluate the results
• Explore the Data • feedback derived from the results are fed back into the
• analyse and visualize data to uncover hidden patterns, model, thus creating a machine learning loop
anomalies, etc.
• Ask again
• Model the Data • recalibration to incorporate the new information
• ensembling of machine learning models algorithms in
an ensemble in order to help select the best
performing model.

Data Science for Business 22


Data Science Life Cycle
❑The Data Science Life Cycle consists of five iterative
stages:
▪ business understanding,
▪ data acquisition and understanding,
▪ modelling,
▪ deployment and
▪ customer acceptance.

Data Science for Business 23


Data Science Life Cycle

Data Science for Business 24


Data Science Life Cycle
❑Business Understanding
▪ Definition of SMART objectives and the identification of data sources

❑Data Acquisition & Understanding


▪ ingestion of data,
▪ exploration of data,
▪ setting up a data pipeline

❑Modelling
▪ feature engineering,
▪ model training
▪ model evaluation

Data Science for Business 25


Data Science Life Cycle
❑Deployment
▪ Use of open Application Protocol Interface (API) to connect to other applications such
as dashboards, spreadsheets, online websites, line-of-business applications and
back-end application.

❑Customer Acceptance
▪ system validation
▪ project hand-off

Data Science for Business 26


Summary
• Data can be structured, semi-structured or unstructured
• There are several Big Data Analytics Models including:
• KDD
• CRISP-DM
• SEMMA
• Analytic Life Cycle
• Data Science Life Cycle

Data Science for Business 27


Unit 3 – R Basics for Data Science
Nana Boateng
Study Objectives
After completing this unit, students should be able to:
1) Install R Console & RStudio IDE
2) Explain and create value assignment
3) Explain and create various types of objects in R such as:
❑ Scalar
❑ Vector
❑ Matrices
❑ Arrays
❑ Data Frame
❑ List
❑ Factors

R Basics for Data Science Nana Boateng 2


Installing R
Follow these steps to download the binaries:
1) Go to the official R site at www.r-project.org
2) Click the Download tab.
3) Select the operating system.
4) Read the instructions to install the software. On Windows, you just have to
click the installer and follow the instructions provided.
5) Pick the nearest geographic area (country) and mirror site to download.
6) Download the installer and run the installer.
7) Follow the instructions by the installer to successfully install the software

R Basics for Data Science Nana Boateng 3


Installing R

▪ After the installation, click the


icon to start R.

▪ A window appears, showing the


R console

R Basics for Data Science Nana Boateng 4


Installing
RStudio ▪ After the successful installation of the R
Console, you will need to install RStudio

R Basics for Data Science Nana Boateng 5


Installing RStudio
RStudio provides an integrated development environment (IDE) for R. RStudio is available
in two variants:
▪ desktop version
▪ server version.
• RStudio Desktop allows RStudio to run locally on the desktop. RStudio Server runs on a
web server and can be accessed remotely by using a web browser. RStudio Desktop is
available for Microsoft Windows, macOS, and Linux.
• refer to the RStudio web site at
https://rstudio.com/products/rstudio/download/#download and download the desktop
version.

R Basics for Data Science Nana Boateng 6


Exploring the RStudio IDE

• RStudio has four windows,


which allow you to write
scripts, view the output,
view the environment and
the variables, and view the
graphs and plots.

R Basics for Data Science Nana Boateng 7


Exploring the RStudio IDE
• The top-left window allows
you to enter the R
commands or scripts. R
scripts provided in the
window can be executed
one at a time or as a file.
The code also can be saved
as an R script for future
reference.
• Each R command can be
executed by clicking Run at
the top-right corner of this
window.

R Basics for Data Science Nana Boateng 8


Exploring the RStudio IDE
• The bottom-left
window is the R
console, which displays
the R output results.
• Also, you can enter any
R command in this
window to check the
results. Because it is a
console window, your
R commands cannot be
stored.

R Basics for Data Science Nana Boateng 9


Exploring the RStudio IDE
• The top-right
window lists the
environment variable
types and global
variables.

R Basics for Data Science Nana Boateng 10


Exploring the RStudio IDE
• The bottom-right
window shows the
generated graphs and
plots, and provides
help information.
• It also has an option to
export or save plots to
a file.

R Basics for Data Science Nana Boateng 11


R Basics
Value Assignment
▪ Objects in R obtain values by
assignment.

▪ This is achieved by the gets


arrow <-
For Example:
To assign x to the value 1 is
written as:
x <- 1
R Basics for Data Science Nana Boateng 12
Types of Objects in R
• Scalar
• Vector
• Matrices
• Arrays
• Data Frame
• List
• Factors

R Basics for Data Science Nana Boateng 13


Scalar
▪ Scalar refers to the atomic quantity that can hold only one value at a time. Scalars
are the most basic data types that can be used to construct more complex ones.
▪ Lets take a look at some common types of scalars with simple R commands (note
the difference between a number and a character.
Number Character

R Basics for Data Science Nana Boateng 14


Vector
A vector is a sequence of data elements of the same basic type.

R Basics for Data Science Nana Boateng 15


Matrix
• A matrix is a collection of data
elements arranged in two-
dimensional rectangular layout.
Same as vector, the components
in a matrix must be of the same
data type. The following
example is a matrix with 4 rows
and 3 columns.

R Basics for Data Science Nana Boateng 16


Matrix

R Basics for Data Science Nana Boateng 17


Arrays
• An array is a data structure containing a number of data values (all of
which are of the same type)
• One-dimensional array

• The two-dimensional array can function exactly like a matrix

R Basics for Data Science Nana Boateng 18


Array
Three-dimensional array
• Example of a 3D
Array

Arrays can have more than two dimensions. In R,


they’re created by using the array() function:
myarray <- array(vector, dimensions, dimnames)

R Basics for Data Science Nana Boateng 19


Data Frame

• A data frame is
more general than
a matrix, in that
different columns
can have different
basic data types.
Data frame is the
most common
data type we are
going to use in this
class.
R Basics for Data Science Nana Boateng 20
Lists
Syntax for creating a list based on
▪ Lists are the most complex of the R previous data structures
data types. Lists allow you to
specify and store any data type
object.
▪ A list can contain a combination of
vectors, matrices, or even data
frames under one single object
name.
▪ You create a list by using the list()
function:
mylist <- list(object1, object2, ...)

R Basics for Data Science Nana Boateng 21


Factors
▪ Factors are special data types used to represent categorical data,
which is important in statistical analysis.

▪ A factor is an integer vector; each integer type has a label. For


example, if your data has a variable by name “gender” with values
Male or Female, then factor automatically assigns values 1 for Male
and 2 for Female.
▪ In R, categorical (ordinal & nominal) variables are called factors and
they play crucial role in R determining how data will be analyzed and
presented visually.
R Basics for Data Science Nana Boateng 22
Factors
• Factor objects can be created using
the factor() function that stores the
categorical values as a vector of
integers and automatically assigns an
internal vector of character strings
(the original values) mapped to these
integers.

For example, assume that you have the


vector performance
<- c("Excellent", "Average", "Poor",
”Average”):
• The statement performance <-
factor(performance) stores this vector
as (1, 2,3, 2) and associates it with 1=
Excellent , 2 = Average, 3= Poor
R Basics for Data Science Nana Boateng 23
Importing Data into R
▪ Data is available from a variety of sources and in a variety of formats.

▪ As a data scientist, your task is to read data from different sources


and different formats, analyze it, and report the findings.

R Basics for Data Science Nana Boateng 24


Importing Data into R
▪ A data source can be an Oracle
database, a SAP management
system,
▪ the Web, or a combination of
these.
▪ The data format can be a simple
flat file in a comma-delimited
format, Excel format, or
Extensible Markup Language
(XML) format. R provides a wide
range of tools for importing data

R Basics for Data Science Nana Boateng 25


Reading Data from Text File
▪ One of the most popular input formats to R is Comma-Separated
Values (CSV).
▪ To read CSV files in R, you can use read.csv(),which imports data from
a CSV file and creates a data frame.

▪ Steps for importing a text file


1) Set a working directory using the setwd() command
2) Read the text (csv) file using the read.csv() syntax

R Basics for Data Science Nana Boateng 26


Unit 5 –
Association Rule
Learning (Apriori
Algorithm)

Nana Boateng
Study Objectives
• After completing this unit students should be able to:
a) Identify item sets from a transactional database
b) Calculate support, confidence and lift ratios
c) Build association rules using the Apriori algorithm
d) Set parameters for the minimum support, confidence and lift thresholds
Unsupervised Machine Learning:
– Association Rule
▪ An important unsupervised machine-learning concept is association-
rule analysis, also called affinity analysis or market-basket analysis.

▪ This type of analysis is often used to find out “what item goes with
what item,” and is predominantly used in the study of customer
transaction databases.

Data Science for Business (BDA 517) 3


Shopping Basket Analysis

Is there a relationship between beer and


nappies?!
Data Science for Business (BDA 517) 4
Association Rule
▪ The result of a market basket analysis is a set of association rules that specify patterns of
relationships among items. A typical rule might be expressed in the form:
{peanut butter, jelly} {bread}

• This association rule states that if peanut butter and jelly are purchased, then bread is
also likely to be purchased.
• In other words, "peanut butter and jelly imply bread."

• Groups of one or more items are surrounded by brackets to indicate that they form a set,
or more specifically, an itemset that appears in the data with some regularity.

• Association rules are learned from subsets of itemsets:


• e.g., the preceding rule was identified from the set of {peanut butter, jelly, bread}.

Data Science for Business (BDA 517) 5


Association Rule
▪ Association rules provide a simple analysis indicating that when an
event occurs, another event occurs with a certain probability.

▪ Discovering relationships among a huge number of transactional


database records can help in better marketing, inventory
management, product promotions, launching new products, and
other business decision processes.

Data Science for Business (BDA 517) 6


Association Rule
• Association rules indicate relationships by using simple if-then rule
structures computed from the data that are probabilistic in nature.

• The classic example is in retail marketing. If a retail department wants


to find out which items are frequently purchased together, it can use
association-rule analysis.

Data Science for Business (BDA 517) 7


Association Rule
▪ This helps the store manage inventory, offer promotions, and
introduce new products.
▪ This market-basket analysis also helps retailers plan ahead for sales
and know which items to promote with a reduced price.
▪ For example, this type of analysis can indicate whether customers
who purchase a mobile phone also purchase a screen guard or phone
cover, or whether a customer buys milk and bread together.
▪ Then those stores can promote the phone cover or can offer a new
bakery bread at a promotional price for the purchase of milk. These
offers might encourage customers to buy a new product at a reduced
price.
Data Science for Business (BDA 517) 8
Apriori Algorithm
• The Apriori algorithm is normally used in association rule mining.

• The algorithm begins by generating frequent-item sets with just one


item (a one-item set) and then generates a two-item set with two
items frequently purchased together, and then moves on to three-
item sets with three items frequently purchased together, and so on,
until all the frequent-item sets are generated.

Data Science for Business (BDA 517) 9


Apriori Algorithm
▪ Once the list of all frequent-item sets is generated, you can find out how many of
those frequent-item sets are in the database.

▪ For example, how many two-item sets, how many three-item sets, and so forth.
In general, generating n-item sets uses the frequent n – 1 item sets and requires a
complete run through the database once.

▪ Therefore, the Apriori algorithm is faster, even for a large database with many
unique items. The key idea is to begin generating frequent-item sets with just one
item (a one-item set) and then recursively generate two-item sets, then three-
item sets, and so on, until we have generated frequent-item sets of all sizes.

Data Science for Business (BDA 517) 10


Apriori Algorithm
• Once we generate the rules, the goal is to find the rules that indicate
a strong association between the items, and indicate dependencies
between the antecedent (previous item) and the consequent (next
item) in the set.
• Three measures are used:
▪ support
▪ confidence
▪ lift ratios

Data Science for Business (BDA 517) 11


Apriori Algorithm

Support Illustrative Example

▪ The support is simply the number ▪ A --> B, (B follows A), where A and
of transactions that include both B are item sets. For example:
the antecedent and consequent ▪ {Milk, Jam} ➤ {chocolate}
item sets.
▪ It is expressed as a percentage of • Support (S) is the fraction of
the total number of records in the transactions that contain both A
database. and B (antecedent and
consequent).

Data Science for Business (BDA 517) 12


Apriori Algorithm
Support

▪ For example, support for the two-item set {bread, jam} in the data set
is 5 out of a total of 10 records, which is (5/10) = 0.5 or 50%
▪ You can define the support number and ignore the other item sets
from your analysis. If support is very low, it is not worth examining.

Data Science for Business (BDA 517) 13


Apriori Algorithm
Confidence
• Confidence (A --> B) is a ratio of support for A & B ( i.e. antecedents
and consequents together), to the support for A.
• It is the conditional probability of B given A:

𝑝(𝐴 ∩ 𝐵)
𝑝 𝐵𝐴 =
𝑝(𝐴)

• A high value of confidence suggests a strong association rule.


Data Science for Business (BDA 517) 14
Apriori Algorithm

• {A} → {B}
• {Milk, Diaper} → {Beer} (s=0.4, c=0.67)

𝑝(𝐴∪𝐵)
• Support: 𝑝 𝐴 ∩ 𝐵 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠

= 2/5
= 0.4
𝑝(𝐴∩𝐵)
Confidence: 𝑝 𝐵 𝐴 = 𝑝(𝐴)
= (2/5) /(3/5)
= 0.67

Data Science for Business (BDA 517) 15


Apriori Algorithm

• {Milk, Beer} → {Diaper}


• (support=0.4, confidence =1.0)

Data Science for Business (BDA 517) 16


Apriori Algorithm

Given a set of transactions T, the goal of association rule mining is to


find all rules having:
• support ≥ minsup threshold
• confidence ≥ minconf threshold

Data Science for Business (BDA 517) 17


Apriori Algorithm
▪ Strong rules have both high support and confidence.

▪ The Apriori Algorithm uses minimum levels of support and confidence with the Apriori
principle to quickly find strong rules by reducing the number of rules to a more
manageable level.

▪ Basic principle:
▪ the Apriori principle states that all subsets of a frequent itemset must also be frequent.
▪ In other words, if {A, B} is frequent, then {A} and {B} both must be frequent.
▪ Recall also that by definition, the support metric indicates how frequently an itemset appears
in the data.
▪ Therefore, if we know that {A} does not meet a desired support threshold, there is no reason
to consider {A, B} or any itemset containing {A}; it cannot possibly be frequent.

Data Science for Business (BDA 517) 18


Apriori Algorithm
▪ The Apriori Algorithm uses this logic to exclude potential association rules prior to
actually evaluating them.
▪ The actual process of creating rules occurs in two phases:
1. Identifying all itemsets that meet a minimum support threshold.
2. Creating rules from these itemsets that meet a minimum confidence threshold.

▪ The first phase occurs in multiple iterations.


▪ Each successive iteration involves evaluating the support of storing a set of increasingly large
itemsets. E.g.
▪ Iteration 1 involves evaluating the set of 1-item itemsets (1-itemsets)
▪ Iteration 2 evaluates the 2-itemsets, and so on.
▪ The result of each iteration i is a set of all i-itemsets that meet the minimum support threshold.

Data Science for Business (BDA 517) 19


Apriori Algorithm
▪ All the itemsets from iteration i are combined in order to generate candidate itemsets for
evaluation in iteration i + 1.
▪ The Apriori principle can eliminate some of these before the next round.
▪ If {A}, {B}, and {C} are frequent in iteration 1 while {D} is not frequent, then iteration 2 will consider
only {A, B}, {A, C}, and {B, C}.
▪ Thus, the algorithm needs to evaluate only three itemsets rather than six.

▪ Suppose in iteration 2 {A, B} and {B, C} are frequent, but not {A, C}.
▪ Although iteration 3 would normally begin by evaluating the support for {A, B, C}, this step need not
occur at all.
▪ Why? The Apriori principle states that {A, B, C} cannot be frequent if the subset {A, C} is not.
▪ Having generated no new itemsets, the algorithm may stop.

▪ At this point, the second phase of the Apriori algorithm may begin.
▪ Given the set of frequent itemsets, association rules are generated from all possible subsets.
▪ For instance, {A, B} would result in candidate rules for {A} -> {B} and {B} -> {A}.
▪ These are evaluated against a minimum confidence threshold, and any rules that do not meet the
desired confidence level are eliminated.

Data Science for Business (BDA 517) 20


Apriori Algorithm
Lift Ratio

Though support and confidence are


good measure to show the strength
of the association rule, they can
sometimes be deceptive.
For example, if the antecedent or the
consequent have a high support, we
can have a high confidence even
though both are independent.

Data Science for Business (BDA 517) 21


Apriori Algorithm
Calculate the support, confidence & lift of the following
item sets based on the following association rules:
▪ {Shirt} {Tie}
▪ {Socks} {Shirt}
▪ {Trouser, tie} {belt}

1) Transaction 1: shirt, trouser, tie, belt


2) Transaction 2: shirt, belt, tie, shoe
3) Transaction 3: socks, tie, shirt, jacket
4) Transaction 4: trouser, tie, belt, blazer
5) Transaction 5: trouser, tie, hat, sweater
Data Science for Business (BDA 517) 22
Apriori Algorithm
The lift ratio provides the strength of the consequent in a random
selection.
But the confidence gives the rate at which a consequent can be found
in the database.
A low confidence indicates a low consequent rate, and deciding
whether promoting the consequent is a worthwhile exercise. The more
records, the better the conclusion.
Finally, the more distinct the rules that are considered, the better the
interpretation and outcome.

Data Science for Business (BDA 517) 23


Thank You!

Data Science for Business (BDA 517) 24


Unit 6 -
Clustering

Nana Boateng
Study Objectives
After completing this unit, students should be able to:
• Explain the uses of clustering
• Compute Euclidean and Jaccard distances
• Differentiate between Hierarchichal and Non-Hierarchichal clustering
methods
• Explain the working of the KMeans and the Hierchichical clustering
algorithms
• Explain how a dendrogram is generated

Data Science for Business (BDA 557) 2


Clustering
▪ Clustering is one of the most important unsupervised machine learning methods.

▪ Clustering Analysis deals with identifying hidden groups or finding a structure in a collection of
unlabelled data and can be used to uncover previously undetected relationships in a data set.

• General Goal: to determine intrinsic groupings in a set of unlabelled data


• Points in a cluster are “close” to one another
• Points in different clusters are “far” from one another

• Typically we do want to find:


• representatives for homogeneous groups (data reduction)
• “natural clusters” to be able to describe their (unknown) properties
• useful and suitable groupings (“useful” data classes)
• unusual data objects (outlier detection)

Data Science for Business (BDA 557) 3


What is Clustering
• To understand what clustering is,
let’s consider this simple
example where the data is
represented as a matrix
containing entries of card suits.

• This matrix is composed of rows


containing our observations and
columns (features) that tell us
something that we measured
across these observations.

Data Science for Business (BDA 557) 4


What is Clustering…
• In cluster analysis we are
interested in grouping our
observations such that all
members of a group are similar to
one another and at the same time
they are distinctly different from
all members outside of this group.

Data Science for Business (BDA 557) 5


What is Clustering…
• Cluster analysis is a form of
Exploratory Data Analysis (EDA)
where observations are divided
into meaningful groups that
share common characteristics
(features) amongst each other.

Data Science for Business (BDA 557) 6


Some Other Clustering Definitions
• Cluster:
• A summary of data in the form of points in some d-dimensional space
• Points in a cluster are considered “similar” or “near” to each other
• Distance:
• A means to determine the “similarity” of 2 or more points in some d-dimensional space
• Space:
• 2 types of space:
• Euclidean: in which distance can be measured in real numbers
• Non-euclidean: in which it may not be possible to measure distance in terms of real numbers
• Centroid
• The average (or centre) of a cluster – if in a Euclidean space

• Clustroid
• In a non-euclidean space, there may not be a concept representative of “average” so we use
a representative or typical element of a cluster

Data Science for Business (BDA 557) 7


Some Uses of Clustering….
▪ Market segmentation:
o customers are segmented based on demographics and transaction history so
that a marketing strategy can be formulated.

▪ HR
o identification employee skills, performance, and attrition. cluster based on interests,
demographics, gender, and salary to help a business to act on HR-related issues such as
relocating, improving performance, or hiring the properly skilled labour force for
forthcoming projects.

Data Science for Business (BDA 557) 8


Some Uses of Clustering….
▪ Finance
o helps create risk-based portfolios based on various characteristics such as
returns, volatility, and P/E ratio. Selecting stocks from different clusters can
create a balanced portfolio based on risks.
o Similarly, clusters can be created based on revenues and growth, market
capital, products and solutions, as well as global presence. These clusters can
help a business understand how to position in the market.

Data Science for Business (BDA 557) 9


Flow of Cluster Analysis

Data Science for Business (BDA 557) 10


Distance Between Two Observations
• Most clustering methods
measure similarity between
observations using a dissmilarity
metric, often referred to as the
distance. Distance = 1 - Similarity
• These two concepts are just two
sides of the same coin. If two
observations have a large
distance then they are less
similar to one another. Likewise,
if their distance value is small,
then they are more similar.
Data Science for Business (BDA 557) 11
Measures of Distance
• Euclidean: simplest (assumes a Euclidean space)
• Square root of the sum of squares of the feature vector
• Important: numerical data must be normalised to avoid disproportionate
weightings
• Nominal features are binary comparisons

Data Science for Business (BDA 557) 12


Measures of Distance –
Euclidean Distance
• Here the blue player is
positioned in the center of the
field, which we will refer to as
(0,0). While the red player has a
position of (12,9) - or twelve feet
to the right of center and 9 feet
up.

Data Science for Business (BDA 557) 13


Measures of Distance –
Euclidean Distance

Data Science for Business (BDA 557) 14


Measures of Distance –
Euclidean Distance

Data Science for Business (BDA 557) 15


Measuring Distance in R
Distance between two players

• To do this in R, we use the dist()


function to calculate the
euclidean distance between our
observations and specify the
method parameter as
‘Euclidean’.

Data Science for Business (BDA 557) 16


Measuring Distance in R
• Measuring distance between
three players.

Data Science for Business (BDA 557) 17


Scaling
• In the previous example, we calculated the distance between two
players on a soccer field, we used two features, x and y.
• Both of these features are the coordinates of the players and both are
measured in the same manner.
• Because of this, they are comparable to one another and can be used
together to calculate the euclidean distance between the players.
• But, what happens when the features aren't measured in the same
manner?

Data Science for Business (BDA 557) 18


Scaling Different Measurement of Features
• The Euclidean distance for two
scenarios is 2. However, a height
of 2 feet is different from a
weight of 2 pounds. They are not
comparable.
• In such situaltions, the features
have to be scaled.

Data Science for Business (BDA 557) 19


Scaling Different Measurement of Features
• Standardisation
• This entails updating each
measurement for a feature by
subtracting the average value of
that feature and then dividing by
its standard deviation.
• Doing this across our features
places them on a similar scale
where each feature has a mean
of zero and a standard deviation
of one

Data Science for Business (BDA 557) 20


Measures of Distance –
Jaccard Distance
• The Jaccard Index is used to calculate the distance between
categorical variables.

Data Science for Business (BDA 557) 21


Jaccard Distance

Data Science for Business (BDA 557) 22


Calculating Jaccard Distance in R

Data Science for Business (BDA 557) 23


Clustering Methods
• In statistics, cluster analysis is performed on data to gain insights that
help you understand the characteristics and distribution of data.
• Conventional clustering is based on the similarity measures of
geometric distance.
• There are two general methods of clustering for a data set of n
records:
▪ Hierarchical Clustering Method
▪ Non-Hierarchical Clustering Method

Data Science for Business (BDA 557) 24


Non-Hierarchical Clustering Method
With the non-hierarchical clustering method, the clusters are formed
based on specified numbers initially.

The method assigns records to each cluster. Since this method is simple
and computationally less expensive, it is the preferred method for very
large data sets. The K-Means algorithm is a non-hierarchical clustering
method.

Data Science for Business (BDA 557) 25


K-Means Algorithm
• K-means is a clustering algorithm that is used to find homogeneous
within a population. The K-means algorithm works by first the
number of subgroups, or clusters, in the data and then assigns each
observation to one of those subgroups.

• The algorithm intends to partition n objects into k clusters with the


nearest mean. The end result is to produce k different clusters with
clear distinctions. In other words, the objective of this k-means
clustering is to minimize total intracluster variance.

Data Science for Business (BDA 557) 26


K-Means Algorithm
The k-means algorithm for clustering is as follows:
1) Select k. It can be 1 or 2 or 3 or anything.
2) Select k points at random as cluster centroids.
3) Start assigning objects to their closest cluster based on Euclidean
measurement.
4) Calculate the centroid of all objects in each cluster.
5) Check the distance of the data point to the centroid of its own cluster. If
it is closest, then leave it as is. If not, move it to the next closest cluster.
6) Repeat the preceding steps until all the data points are covered and no
data point is moving from one cluster to another (the cluster is stable).

Data Science for Business (BDA 557) 27


Iterations of the K-Means Algorithm…

Data Science for Business (BDA 557) 28


Iterations of the K-Means Algorithm &
Determination of Centroids

Data Science for Business (BDA 557) 29


Limitations of K-Means
▪ K-means is a simple and relatively easy and efficient method.
However, you need to specify k at the beginning.

▪ A different k can vary the results and cluster formation. A practical


approach is to compare the outcomes of multiple runs with different
k values and choose the best one based on a predefined criterion.

Data Science for Business (BDA 557) 30


Hierarchical Clustering
• Hierarchical clustering is used when the number of clusters is not
known ahead of time. This is different from kmeans clustering where
you first have to specify the number of clusters and then execute the
algorithm.
• There are two approaches to hierarchical clustering:
❑Agglomerative Algorithm (bottom-up approach)
❑Divisive Algorithm (top-down approach)

Data Science for Business (BDA 557) 31


Hierarchical Clustering Method
• The agglomerative algorithm begins with n clusters and starts
merging sequentially with similar clusters until a single cluster is
formed.

• The divisive algorithm is the opposite. The algorithm first starts with
one single cluster and then divides into multiple clusters based on
dissimilarities.

Data Science for Business (BDA 557) 32


Hierarchical Clustering Process
1. Assign every point to a unique cluster

2. Compute the distance between all clusters


• Note the distance points is the same as the distance between clusters at this point

3. Find the closest clusters and merge them into 1 cluster

4. Recompute the distance of the new cluster with all other clusters

5. Repeat 3 and 4 until either:


• There is only 1 cluster
• Some other stopping criteria is/are met

Data Science for Business (BDA 557) 33


Dendrograms
• A dendrogram demonstrates Illustrative Example of a
how clusters are merged in a Dengrogram
hierarchy. A dendrogram is a
tree-like structure that
summarizes the process of
clustering and shows the
hierarchy pictorially. Similar
records are joined by lines
whose vertical line reflects the
distance measure between two
records
Data Science for Business (BDA 557) 34
Hierarchical Clustering (in Euclidean Space)

Data Science for Business (BDA 557) 35


Hierarchical Clustering (in Euclidean Space)

Data Science for Business (BDA 557) 36


Hierarchical Clustering (in Euclidean Space)

Data Science for Business (BDA 557) 37


Hierarchical Clustering (in Euclidean Space)

Data Science for Business (BDA 557) 38


Hierarchical Clustering (in Euclidean Space)

Data Science for Business (BDA 557) 39


Thank You!

Data Science for Business (BDA 557) 40

You might also like