BDA 557 Data Science For Business Slides
BDA 557 Data Science For Business Slides
Business
Nana Osei Boateng
Study Objectives
• After completing this unit, students should be able to:
▪ Explain the concept of Big Data
▪ Explain the Big Data Parameters
▪ Explain the terminologies such as Data Analytics, Big Data Analytics, Data
Science, Business Analytics
▪ Gain an understanding of Big Data Analytics tools
▪ Distinguish between Supervised and Unsupervised Machine Learning
▪ Explain the benefits of Big Data Analytics
▪ Descriptive Analytics
▪ Diagnostic Analytics
▪ Predictive Analytics
▪ Prescriptive Analytics
Raguseo (2018)
Raguseo (2018)
Raguseo (2018)
Raguseo (2018)
Study Objectives
2
Understanding Data
▪ Getting quality data is the most important factor determining the
accuracy of the results. Data can be either from a primary source or
secondary source.
Examples of Primary Sources of Data Examples of Secondary Sources
(within the organisation) • Official reports from government,
• Databases (e.g. Operational, HR, newspaper, articles, and census
Manufacturing, IT, etc.) data, etc.
• Data Warehouse
▪ In this way, structured data is arranged and recorded neatly, so it can be easily
found and processed. As long as data fits within the structure of RDBMSs, we can
easily search for specific information and single out the relationships between its
pieces. Such data can only be used for its intended purpose
▪ Data understanding
▪ The data understanding phase starts with an initial data collection and proceeds with activities in order to
get familiar with the data, to identify data quality problems, to discover first insights into the data or to
detect interesting subsets to form hypotheses for hidden information.
▪ Data preparation
▪ The data preparation phase covers all activities to construct the final dataset from the initial raw data.
(Chapman et al., 2000)
▪ Evaluation
▪ At this stage the model (or models) obtained are more thoroughly evaluated and the
steps executed to construct the model are reviewed to be certain it properly achieves the
business objectives.
▪ Deployment
▪ Creation of the model is generally not the end of the project. Even if the purpose of the
model is to increase knowledge of the data, the knowledge gained will need to be
organized and presented in a way that the customer can use it.
❑Modelling
▪ feature engineering,
▪ model training
▪ model evaluation
❑Customer Acceptance
▪ system validation
▪ project hand-off
• A data frame is
more general than
a matrix, in that
different columns
can have different
basic data types.
Data frame is the
most common
data type we are
going to use in this
class.
R Basics for Data Science Nana Boateng 20
Lists
Syntax for creating a list based on
▪ Lists are the most complex of the R previous data structures
data types. Lists allow you to
specify and store any data type
object.
▪ A list can contain a combination of
vectors, matrices, or even data
frames under one single object
name.
▪ You create a list by using the list()
function:
mylist <- list(object1, object2, ...)
Nana Boateng
Study Objectives
• After completing this unit students should be able to:
a) Identify item sets from a transactional database
b) Calculate support, confidence and lift ratios
c) Build association rules using the Apriori algorithm
d) Set parameters for the minimum support, confidence and lift thresholds
Unsupervised Machine Learning:
– Association Rule
▪ An important unsupervised machine-learning concept is association-
rule analysis, also called affinity analysis or market-basket analysis.
▪ This type of analysis is often used to find out “what item goes with
what item,” and is predominantly used in the study of customer
transaction databases.
• This association rule states that if peanut butter and jelly are purchased, then bread is
also likely to be purchased.
• In other words, "peanut butter and jelly imply bread."
• Groups of one or more items are surrounded by brackets to indicate that they form a set,
or more specifically, an itemset that appears in the data with some regularity.
▪ For example, how many two-item sets, how many three-item sets, and so forth.
In general, generating n-item sets uses the frequent n – 1 item sets and requires a
complete run through the database once.
▪ Therefore, the Apriori algorithm is faster, even for a large database with many
unique items. The key idea is to begin generating frequent-item sets with just one
item (a one-item set) and then recursively generate two-item sets, then three-
item sets, and so on, until we have generated frequent-item sets of all sizes.
▪ The support is simply the number ▪ A --> B, (B follows A), where A and
of transactions that include both B are item sets. For example:
the antecedent and consequent ▪ {Milk, Jam} ➤ {chocolate}
item sets.
▪ It is expressed as a percentage of • Support (S) is the fraction of
the total number of records in the transactions that contain both A
database. and B (antecedent and
consequent).
▪ For example, support for the two-item set {bread, jam} in the data set
is 5 out of a total of 10 records, which is (5/10) = 0.5 or 50%
▪ You can define the support number and ignore the other item sets
from your analysis. If support is very low, it is not worth examining.
𝑝(𝐴 ∩ 𝐵)
𝑝 𝐵𝐴 =
𝑝(𝐴)
• {A} → {B}
• {Milk, Diaper} → {Beer} (s=0.4, c=0.67)
𝑝(𝐴∪𝐵)
• Support: 𝑝 𝐴 ∩ 𝐵 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
= 2/5
= 0.4
𝑝(𝐴∩𝐵)
Confidence: 𝑝 𝐵 𝐴 = 𝑝(𝐴)
= (2/5) /(3/5)
= 0.67
▪ The Apriori Algorithm uses minimum levels of support and confidence with the Apriori
principle to quickly find strong rules by reducing the number of rules to a more
manageable level.
▪ Basic principle:
▪ the Apriori principle states that all subsets of a frequent itemset must also be frequent.
▪ In other words, if {A, B} is frequent, then {A} and {B} both must be frequent.
▪ Recall also that by definition, the support metric indicates how frequently an itemset appears
in the data.
▪ Therefore, if we know that {A} does not meet a desired support threshold, there is no reason
to consider {A, B} or any itemset containing {A}; it cannot possibly be frequent.
▪ Suppose in iteration 2 {A, B} and {B, C} are frequent, but not {A, C}.
▪ Although iteration 3 would normally begin by evaluating the support for {A, B, C}, this step need not
occur at all.
▪ Why? The Apriori principle states that {A, B, C} cannot be frequent if the subset {A, C} is not.
▪ Having generated no new itemsets, the algorithm may stop.
▪ At this point, the second phase of the Apriori algorithm may begin.
▪ Given the set of frequent itemsets, association rules are generated from all possible subsets.
▪ For instance, {A, B} would result in candidate rules for {A} -> {B} and {B} -> {A}.
▪ These are evaluated against a minimum confidence threshold, and any rules that do not meet the
desired confidence level are eliminated.
Nana Boateng
Study Objectives
After completing this unit, students should be able to:
• Explain the uses of clustering
• Compute Euclidean and Jaccard distances
• Differentiate between Hierarchichal and Non-Hierarchichal clustering
methods
• Explain the working of the KMeans and the Hierchichical clustering
algorithms
• Explain how a dendrogram is generated
▪ Clustering Analysis deals with identifying hidden groups or finding a structure in a collection of
unlabelled data and can be used to uncover previously undetected relationships in a data set.
• Clustroid
• In a non-euclidean space, there may not be a concept representative of “average” so we use
a representative or typical element of a cluster
▪ HR
o identification employee skills, performance, and attrition. cluster based on interests,
demographics, gender, and salary to help a business to act on HR-related issues such as
relocating, improving performance, or hiring the properly skilled labour force for
forthcoming projects.
The method assigns records to each cluster. Since this method is simple
and computationally less expensive, it is the preferred method for very
large data sets. The K-Means algorithm is a non-hierarchical clustering
method.
• The divisive algorithm is the opposite. The algorithm first starts with
one single cluster and then divides into multiple clusters based on
dissimilarities.
4. Recompute the distance of the new cluster with all other clusters