0% found this document useful (0 votes)
47 views260 pages

Business Intelligence and Analytics Notes

The document provides an overview of Business Intelligence and Analytics, emphasizing the importance of data-driven decision-making and the various types of analytics, including descriptive, predictive, prescriptive, and diagnostic analytics. It outlines the components of business analytics, the data mining process, and the significance of understanding data for effective analysis. The course structure and evaluation scheme for a Business Analytics program are also detailed, along with the role of data mining in extracting valuable insights from data.

Uploaded by

yashwanth kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views260 pages

Business Intelligence and Analytics Notes

The document provides an overview of Business Intelligence and Analytics, emphasizing the importance of data-driven decision-making and the various types of analytics, including descriptive, predictive, prescriptive, and diagnostic analytics. It outlines the components of business analytics, the data mining process, and the significance of understanding data for effective analysis. The course structure and evaluation scheme for a Business Analytics program are also detailed, along with the role of data mining in extracting valuable insights from data.

Uploaded by

yashwanth kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 260

Business Intelligence & Analytics

Dr. Ramesh Kandela


[email protected]
Introduction
to
Business Analytics
Dr. Ramesh Kandela
[email protected]
Analytics
• Analytics is the process of using computational methods to discover and report influential
patterns in data.
• It is the scientific process of transforming data into insight for making better decisions
(INFORMS).
Data Analytics Insights Decisions

Business Analytics
• Business Analytics is “the use of math and statistics to derive meaning from data in order
to make better business decisions.”
• Business Analytics is the extensive use of data, statistical and quantitative analysis,
explanatory and predictive models, and fact-based management to drive decisions and
actions(Davenport and Harris).
• Business analytics is a set of statistical and operations research techniques, artificial
intelligence, information technology and management strategies used for framing a
business problem, collecting data, and analyzing the data to create value to organizations.
Data
• Data is a collection of facts, such as numbers, words, measurements, observations .
• A data set is usually a rectangular array of data, with variables in columns and observations in
rows. A variable (or field or attribute) is a characteristic of members of a population, such as
height, gender, or salary. An observation (or case or record) is a list of all variable values for a
single member of a population.
Types of Data Data
Quantitative data Qualitative data
(numerical) (categorical)

Discrete Nominal

Continuous Ordinal

Cross-sectional data are data on a cross section of a population at a distinct point in time.
Time series data are data collected over time.
Why Study Business Analytics?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube , …

• Taking a data-driven approach to business can come with tremendous upside, but many
companies report that the number of skilled employees in analytics roles are in short supply.

• Because they help you understand the customers, as well as the performance of HR,
marketing, financial, production and inventory better. And knowing where to improve is the
key to success.
Data-driven Decision Making
• Today’s largest and most successful organizations use data to their advantage when making
high-impact business decisions.
• A typical data-driven decision-making (Business Analytics) process uses the following steps :
1. Identify the problem or opportunity for value creation.
2. Identify sources of data (primary as well secondary data sources).
3. Pre-process the data for issues such as missing and incorrect data. Generate derived
variables and transform the data if necessary. Prepare the data for analytics model building.
4. Divide the data sets into subsets training and validation data sets.
5. Build analytical models and identify the best model(s) using model performance in
validation data.
6. Implement Solution/Decision/Develop Product.
Population Versus Sample
• Population: A population consists of all elements—individuals, items, or objects—whose
characteristics are being studied
• Sample A portion of the population selected for study is referred to as a sample.
Components of Business Analytics
Business Analytics can be broken into 3 components:
1. Business Context
Business analytics projects start with the business context and ability of the organization to
ask the right questions.
2. Technology
Information Technology (IT) is used for data capture, data storage, data preparation, data
analysis, and data share. To analyse data, one may need to use software such as Excel, R,
Python, Tableau, SQL, SAS, SPSS etc.
3. Data Science
Davenport and Patil (2012) claim that ‘data scientist’ will be the sexiest job of the 21st century.
Business Analytics vs. Data Science
The main goal of business analytics is to extract meaningful insights from data to guide
organizational decisions, while data science is focused on turning raw data into meaningful
conclusions through using algorithms and statistical models.
Types of Business Analytics
• Business analytics can be grouped into four types:
Descriptive Analytics
• What happened in the past?
• Many organisations use DA as part of business intelligence.
Predictive Analytics
• What will happen in the future?
• Many organisations use predictive analytics.
Prescriptive Analytics
• What is the best action?
• Small proportion of organisations use prescriptive analytics.
Diagnostic Analytics
• Why did it happen?
• This focuses on the past performance to ascertain why something has happened.
Descriptive Analytics
Descriptive analytics is the simplest form of analytics that mainly uses simple descriptive
statistics, data visualization techniques, and business-related queries to understand past data.
One of the primary objectives of descriptive analytics is innovative ways of data summarization.
Descriptive statistics consists of methods for organizing, displaying, and describing data by
using tables, graphs, and summary measures.
Types of descriptive statistics:
• Organize the Data
• Tables
• Frequency Distributions
• Relative Frequency Distributions
• Displaying the Data
• Graphs
• Bar Chart or Histogram
• Summarize the Data
• Central Tendency
• Variation
Predictive Analytics
• Predictive analytics is a branch of advanced analytics that makes predictions about future
outcomes using historical data combined with statistical modeling, data mining techniques
and machine learning. Companies employ predictive analytics to find patterns in this data to
identify risks and opportunities. Predictive analytics is often associated with big data
and data science.
Types of Predictive Models
• K-Nearest Neighbors
• Regression
• Naïve Bayes Classifier
• Logistic Regression
• Classification and Regression Trees
• Clustering models
• Ensemble Methods
• Time series models
• Neural Networks
• Association Rules and Collaborative Filtering
Prescriptive Analytics
• Prescriptive analytics is the highest level of analytics capability which is used for choosing optimal
actions once an organization gains insights through descriptive and predictive analytics.
Examples of Descriptive Analytics
• Summarizing past events, exchange of data, and social media usage
• Reporting general trends
Examples of Diagnostic Analytics
• Identifying technical issues
• Explaining customer behavior
• Improving organization culture
Examples of Predictive Analytics
• Predicting customer preferences
• Recommending products
• Predicting staff and resources
Examples of Prescriptive Analytics
• Tracking fluctuating manufacturing prices
• Suggest the best course of action
Business Intelligence
• Business intelligence (BI) is software that ingests business data and presents it in user-
friendly views such as reports, dashboards, charts and graphs.
• Business intelligence combines business analytics, data mining, data visualization, data tools
and infrastructure, and best practices to help organizations make more data-driven
decisions..
BI Tools

Business Intelligence and Business Analytics


Business Intelligence Business Analytics
While BI is more about using data collected over a period of time
from different sources to create dashboards, reports, and BA focuses on the implementation of data
documentation insights into actionable steps
Business intelligence focuses on Descriptive Analytics & Diagnostic Business analytics focuses on Predictive
Analytics Analytics &Prescriptive Analytics
Both BI and BA focus on presenting and organizing data for visualization
Course Outline
Session No. Topic(s)
1 Introduction to Business Analytics
2-3 Introduction to Data Mining Process
4-6 Introduction to ‘R’
k-Nearest Neighbors Method for Classification/Prediction
(k-NN) and
7-9 Evaluating Predictive Performance
10-12 Naïve Bayes Classifier
13 Test-1 (15% weightage)
14-17 Classification and Regression Trees
18-20 Ensemble Methods
21-23 Neural Networks
24 Test-2 (15% weightage)
25-28 Association Rules and Collaborative Filtering
29-32 Information Dashboards
33 Project Presentations/Evaluation (20% weightage)
Recommended Text Book:
• Data Mining for Business Analytics: Concepts, Techniques and Applications in R by Shmueli, Bruce, Yahav, Patel,
Lichtendahl Jr., Wiley Publication, 2021, An Indian Adaptation.
Evaluation Scheme
Component Expected Slot /
Component Weightage
Number Due Date
Class Participation 1 Session 33 10%
Test-1 2 Session 13 15%
Test-2 3 Session 24 15%
Project Report
4 Session 33 20%
Submission/Presentation
End-Term Exam At the end of the semester 40%
Total 100%
Happy Analyzing
Data Mining

Dr. Ramesh Kandela


[email protected]
Data Mining
• Data mining is the process that helps in extracting information from a given data set to
identify trends, patterns, and useful data.
• Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in the
process of knowledge discovery.
Why Data Mining?
Origins of Data Mining
• The Explosive Growth of Data: from terabytes to petabytes
Machine
• Data collection and data availability Learning/
Statistics
• Automated data collection tools, database systems, Web, AI/Pattern
computerized society Recognition
Data Mining
Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific Database systems
simulation,..
• Society and everyone: news, digital cameras, YouTube , …
• We are drowning in data, but starving for knowledge!
Data Mining in Business Intelligence

Increasing potential
to support End User
business decisions Decision
Making

Data Presentation Business Analyst

Visualization Techniques
Data Mining Data Analyst
Information Discovery

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
communities
Pattern Evaluation
• Data mining plays an essential role in the
knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis

• This is a view from typical machine learning and statistics communities


• The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge.
Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Predictive tasks.
• The objective of these tasks is to predict the value of a particular attribute based on the
values of other attributes. The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the prediction are known as the
explanatory or independent variables.
Regression

Descriptive tasks. Predictive


Here, the objective is to derive patterns Classification
(correlations, trends, clusters, and anomalies)
that summarize the underlying relationships in Data Mining
data. Descriptive data mining tasks are often Association Rules
exploratory in nature and frequently require
postprocessing techniques to validate and Descriptive Clustering
explain the results.
Anomaly Detection
Predictive Modelling
• Predictive modelling refers to the task of building a model for the target variable as a function of
the explanatory variables. There are two types of predictive modelling tasks: classification and
regression.
• The main difference between Regression and Classification algorithms that Regression algorithms
are used to predict the continuous values such as price, salary, age, etc. predict a car price is an
example of regression . to predict a target numeric value, such as the price of a car, given a set of
features (mileage, age, brand, etc.) called predictors.
• Classification algorithms are used to predict/Classify the discrete values such as Positive or
Negative, Spam or Not Spam, etc. The spam filter is a good example of this: it is trained with many
example emails along with their class (spam or ham), and it must learn how to classify new emails.
• For example, predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued.
• On the other hand, forecasting the future price of a stock is a regression task because price is a
continuous-valued attribute.
• The goal of both tasks is to learn a model that minimizes the error between the predicted and true
values of the target variable.
Descriptive Patterns
Association Rule Mining
• Association Rule Mining is used when you want to find an association between different
objects in a set, find frequent patterns in a transaction database, relational databases or any
other information repository. The applications of Association Rule Mining are found in
Marketing, Basket Data Analysis (or Market Basket Analysis) in retailing, clustering and
classification. It can tell you what items do customers frequently buy together by generating
a set of rules called Association Rules.
• For example, the rule "If a customer buys bread, they are also likely to buy milk"
Cluster Analysis
• Cluster Analysis in Data Mining means that to find out the group of objects which are similar
to each other in the group but are different from the object in other groups.
• Example (Document Clustering). The collection of news articles can be grouped based on
their respective topics. Each article is represented as a set of word-frequency pairs (w, c),
where w is a word and c is the number of times the word appears in the article.
Anomaly Detection
• Anomaly detection is the task of identifying observations whose characteristics are
significantly different from the rest of the data. Such observations are known as anomalies
or outliers.
• The goal of an anomaly detection algorithm is to discover the real anomalies and avoid
falsely labeling normal objects as anomalous.
• Example (Credit Card Fraud Detection). A credit card company records the transactions
made by every credit card holder, along with personal information such as credit limit, age,
annual income, and address.
Steps in Data Mining Process Develop an understanding of the purpose
of the data mining project
CRISP-DM
Data Understanding (Collect &Explore)

Data Preprocessing

Determine the data mining task


Partition the data (for supervised tasks)
Select a dm model and train it.

A popular data mining process Make Predictions


frameworks is CRISP-DM (Cross Industry
Standard Process for Data Mining). This Model Evaluation (Model Testing)
framework was developed by a
consortium of companies involved in Model Deployment
data mining
Develop an understanding of the purpose of the data mining project
• This can be the hardest part of the data mining process, and many organizations spend too
little time on this important step. Data scientists and business stakeholders need to work
together to define the business problem, which helps inform the data questions and
parameters for a given project.
• Analysts may also need to do additional research to understand the business context
appropriately.
• Define the problem and objectives for the data mining project.

Data Understanding
• The data understanding phase of CRISP-DM involves taking a closer look at the data
available for mining. This step is critical in avoiding unexpected problems during the next
phase--data preparation--which is typically the longest part of a project.
• Data understanding involves accessing the data and exploring it using descriptive statistics.
Data Understanding
Data understanding: Collect and explore the data to gain an understanding of its properties
and characteristics.
Perform Descriptive Analytics or Data Exploration
It is always a good practice to perform descriptive analytics before moving to building a
Machine Learning model. Descriptive statistics will help us to understand the variability in the
data.
Single Variable Summaries
• The simplest way to gain insight into variables is to assess them one at a time through the
calculation of summary statistics.
• Data Visualization in One Dimension
Multiple Variable Summaries
• Cross tabulation
• Data Visualization
• Correlation
Descriptive Statistics
Descriptive statistics consists of methods for organizing, displaying, and describing data by
using tables, graphs, and summary measures.
Types of descriptive statistics:
• Organize the Data
• Tables
• Frequency Distributions
• Relative Frequency Distributions
• Displaying the Data
• Graphs
• Bar Chart or Histogram
• Summarize the Data
• Central Tendency
• Variation
Frequency Distributions Course No.of Students
BBA 10
• A frequency distribution is the organizing the raw data in table
BSc 6
form, using classes (groups) and frequencies .
BTech 7
• Frequency Distribution for Qualitative Data A frequency BCOM 8
distribution for qualitative data lists all categories and the BPharm 4
number of elements that belong to each of the categories. SUM 35

X Frequency
• Frequency Distribution for Quantitative Data 52-55 3
• Grouped frequency distributions: The data must be grouped into 56-59 3
60-63 9
classes that are more than one unit in width. In a grouped table,
64-67 9
the X column lists groups of observations, called class intervals, 68-71 8
rather than individual values. 72-75 3
Total 35
Data Visualization Charts
▪ Bar Graph
• Data visualization is the presentation of data in
a pictorial or graphical format. It enables
▪ Column Chart
decision makers to see analytics presented ▪ Pie Chart
visually, so they can grasp difficult concepts or ▪ Line Chart
identify new patterns. ▪ Histogram
• Data visualization is the graphical ▪ Box Plot
representation of information and data. By ▪ Scatter Plot
using visual elements like charts, graphs, and ▪ Heat Map
maps, data visualization tools provide an ▪ Pair Plot
accessible way to see and understand trends,
outliers, and patterns in data.
• In the world of Big Data, data visualization
tools and technologies are essential to analyze
massive amounts of information and make
data-driven decisions.
Descriptive Measures
Measures of Central Tendency
• Mean
• Median
• Mode
Measures of Variability
• Range
• Standard Deviation
• Variance
Getting the dataset
Study Time Attendance Sex Marks
12.50 85 Female 9.3
12.00 72 Male 8.6
11.50 95 Female 9.6
11.00 86 Male 8.1
10.50 78 Female 9.2
10.00 68 Male 7.5
9.50 74 Female 8.4
9.00 77 Male 7.7
8.50 Female 7.3
8.00 Male -7.4
Data Preprocessing
• Data preprocessing is a process of preparing the raw data and making it suitable for a data
mining task.
• A real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing techniques
can improve data quality, thereby helping to improve the accuracy and efficiency of the
subsequent mining process.
• Data preprocessing is an important step in machine learning process, because quality decisions
must be based on quality data.
Why is Data preprocessing important?
• Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Reduction
• Dimensionality reduction
• Data Transformation
• Normalization
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., Attendance=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Marks=“−7.4” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
Identifying and Handling the Missing Values
• Data Cleaning: Most Machine Learning algorithms cannot work with missing features, so let’s
create a few functions to take care of them.
Ways to handle missing data:
• By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this way
is not so efficient and removing data may lead to loss of information which will not give the
accurate output.
• Use a measure of central tendency (e.g., the mean or median) to fill in the missing value:
• For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median .
•Dropping Variables
• It is always better to keep data than to discard it. Sometimes you can drop variables if the
data is missing for more than 60% observations but only if that variable is insignificant. This
method is not effective, imputation is always a preferred choice over dropping variables
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention

Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates of one another
• Major issue when merging data from heterogenous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
38
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British
units
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement values
s.t. each old value can be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• One of the most important transformations you need to apply to your data is feature scaling. With
few exceptions, Machine Learning algorithms don’t perform well when the input numerical
attributes have very different scales.
There are two common ways to get all attributes to have the same scale:
• Min-max normalization
• Z-score normalization
• Discretization: Concept hierarchy climbing
• Encoding Categorical data
Min-max normalization
• Min-max scaling (normalization) is quite simple: values are shifted and rescaled so that they
end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max
minus the min. v𝑎𝑙𝑢𝑒 − min𝐴
𝑣′ =
• let A be a numeric variable with n observed values, v1, v2, …. , vn. max𝐴 − min𝐴

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600 − 12,000
Z-score normalization 98,000 − 12,000
= 0.716

• Standardization is quite different: first it subtracts the mean value (so standardized values
always have a zero mean), and then it divides by the standard deviation so that the resulting
distribution has unit variance.
• In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, vi , of A is
normalized to v’i by computing
v − A Ex. Let μ = 54,000, σ = 16,000. Then
v' =
 A
73,600 − 54,000
= 1.225
16,000
Encoding Categorical data
• Encoding categorical data is a process of converting categorical data into integer format.
• Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So, it is necessary to encode these categorical variables into numbers.
• Encoding the Independent Variables
Determine the data mining task (Partition the data (for supervised tasks)
Data mining techniques come in two main forms: supervised (also known as predictive)
and unsupervised (also known as descriptive).
Supervised Learning
• These algorithms require the knowledge of both the outcome variable (dependent
variable) and the independent variable (input variables).
• In supervised learning, the training data you feed to the algorithm includes the
desired solutions, called label.
• Regression and Classification algorithms are Supervised Learning algorithms.
Unsupervised Learning
• Unsupervised learning: In unsupervised learning, the training data is unlabeled. The
system tries to learn without a teacher.
• These algorithms are set of algorithms which do not have the knowledge of the
outcome variable in the dataset.
• Most important supervised learning Most Important Unsupervised Algorithms
algorithms Clustering
• Linear Regression • — K-Means
• Logistic Regression • — DBSCAN
• — Hierarchical Cluster Analysis (HCA)
• k-Nearest Neighbors
Dimensionality Reduction/ Feature Selection
• Support Vector Machines (SVMs)
• — Principal Component Analysis (PCA)
• Naïve Bayes Association Rule Learning
• Decision Trees and Random Forests • — Apriori
• Neural Networks • — Eclat

Partition the data (for supervised tasks). If the task is supervised (classification or prediction),
randomly partition the dataset into training and test datasets.
Splitting the dataset into the Training set and Test set
• The train-test split is a technique for evaluating the performance of a machine learning algorithm.
• It can be used for regression or classification problems and can be used for any supervised learning
algorithm.
• The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit
the model and is referred to as the training dataset. The second subset is not used to train the
model; instead, the input element of the dataset is provided to the model, then predictions are
made and compared to the expected values. This second dataset is referred to as the test dataset.
• Train Dataset: Used to fit the machine learning model.
• Test Dataset: Used to evaluate the fit machine learning model.
• The objective is to estimate the performance of the machine learning model on new data: data not
used to train the model.
• The proportion of training dataset is usually between 70% and 80% of the data and the remaining
data is treated as the test dataset (validation data). The subsets may be created using
random/stratified sampling procedure. This is an important step to measure the performance of the
model using dataset not used in model building. It is also essential to check for any overfitting of the
model. In many cases, multiple training and multiple test data are used (called cross-validation).
Select a model and train it
• Train Dataset: Used to fit the machine learning model.
• The selected model may not be always the most accurate model, as accurate model may take more
time to compute and may require expensive infrastructure. The final model for deployment will be
based on multiple criteria such as accuracy, computing speed, cost of deployment, and so on. As a
part of model building, we will also go through feature selection which identifies important features
that have significant relationship with the outcome variable.
Model Evaluation (Model Testing)
• Test Dataset: Used to evaluate the fit machine learning model.
The most common evaluation metrics for regression:
• Mean Absolute Error
• Mean Squared Error
• Root Mean Square Error
• R squared and Adjusted R Square

The key classification metrics are:


• Accuracy
• Recall
• Precision
• F1-Score
Model Deployment
Model Deployment: Once the final model is chosen, then the organization must decide the
strategy for model deployment. Model deployment can be in the form of simple business
rules, chatbots, real-time actions, robots, and so on.
• The foregoing steps encompass the steps in SEMMA, a methodology developed
by the software company SAS:
• Sample Take a sample from the dataset; partition into training, validation, and
test datasets.
• Explore Examine the dataset statistically and graphically.
• Modify Transform the variables and impute missing values.
• Model Fit predictive models (e.g., regression tree, neural network).
• Assess Compare models using a validation dataset.

• IBM SPSS Modeler (previously SPSS-Clementine) has a similar methodology,


termed CRISP-DM (CRoss-Industry Standard Process for Data Mining).
Thank You
Dashboards with Tableau

Ramesh Kandela

“Information is the Oil and Analytics in the Combustion Engine”.


Exploratory Data Analysis (EDA)

Ramesh Kandela
[email protected]
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis is an approach in analyzing data sets to summarize their main
characteristics, often using Descriptive statistics .

• Descriptive statistics will help us to understand the variability in the data.

EDA assists Business Analyst professionals in various ways:

• Getting a better understanding of data

• Identifying various data patterns

• Getting a better understanding of the problem statement


Data Understanding
• Data understanding involves accessing the data and exploring it using descriptive statistics.
• Descriptive statistics consists of methods for organizing, displaying, and describing data by
using tables, graphs, and summary measures.
Types of descriptive statistics:
• Organize the Data
• Tables
• Frequency Distributions
• Relative Frequency Distributions
• Displaying the Data
• Graphs
• Bar Chart or Histogram
• Summarize the Data
• Central Tendency
• Variation
Titanic Dataset
• There are a total of 891 instances, each consisting of 12 attributes. So here’s a brief information about
what the data consist of-
1. Passenger Id: A unique id given for each passenger in the data-set.
2. Survived: It denotes whether the passenger survived or not. Here, 0 = Not Survived 1 = Survived
3. Pclass: represents the Ticket class which is also considered as proxy for socio-economic status (SES)
Here, 1 = Upper Class 2 = Middle Class 3 = Lower Class
4. Name: Name of the Passenger
5. Sex: Denotes the Sex/Gender of the passenger i.e ‘male’ or ‘female’.
6. Age: Denotes the age of the passenger
7. SibSp: It represents no. of siblings / spouses aboard the Titanic
Sibling = brother, sister, stepbrother, stepsister. Spouse = husband, wife.
8. Parch: It represents no. of parents / children aboard the Titanic (Parent = mother, father. Child = daughter,
son, stepdaughter, stepson). Some children travelled only with a nanny, therefore parch=0 for them.
9. Ticket: It represents the ticket number of the passenger
10. Fare: It represents Passenger fare.
11. Cabin: It represents the Cabin No.
12. Embarked: It represents the Port of Embarkation. Here, C=Cherbourg, Q= Queenstown, S = Southampton
Data Source: https://www.kaggle.com/competitions/titanic/data
Types of Data
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 No UpperBraund, Mr. Owen Harris male 22 1 0A/5 21171 7.25 S
2 Yes LowCumings, Mrs. John Bradley female 38 1 0PC 17599 71.28C85 C
Qualitative / Categorical:
Nominal: Without order
• Survived, Sex, Embarked are of this type
Ordinal: Ordered
• Pclass
Text: Text in nature,
• Name is text data in this data set.
Quantitative / Numerical:
Discrete: Discrete: counted (Discrete Data can only take certain values.)
• sibSp and parch are of this type
Continuous: measured (Continuous Data can take any value (within a range)
Age and Fare are of this type
Dashboards
• A dashboard is a way of displaying various types of visual data in one place.
• Usually, a dashboard is intended to convey different, but related information in an easy-to-
digest form. And oftentimes, this includes things like key performance indicators (KPI)s or
other important business metrics that stakeholders need to see and understand at a glance.
• A data dashboard is a tool many businesses use to track, analyze, and display data—usually
to gain insight into the overall wellbeing of an organization, department, or specific process.
Data dashboards versus data visualizations
• Data visualization is a way of presenting data in a visual form to make it easier to understand
and analyze.
• Data dashboards are a summary of different, but related data sets, presented in a way that
makes the related information easier to understand. Dashboards are a type of data
visualization, and often use common visualization tools such as graphs, charts, and tables.
Titanic Dataset Dashboard
Storytelling with Data
• Understand the context. Build a clear understanding of who you are communicating to, what you
need them to know or do, how you will communicate to them, and what data you have to back up
your case.
• Choose an appropriate visual display.
• Eliminate clutter. Identify elements that don’t add informative value and remove them from your
visuals.
• Focus attention where you want it. Employ the power of preattentive attributes like color, size, and
position to signal what’s important. Use these strategic attributes to draw attention to where you
want your audience to look and guide your audience through your visual.
• Tell a story. Craft a story with clear beginning (plot), middle (twists), and end (call to action). Leverage
conflict and tension to grab and maintain your audience’s attention.

https://www.storytellingwithdata.com/
Business Intelligence
• Business intelligence (BI) is software that ingests business data and presents it in user-friendly
views such as reports, dashboards, charts and graphs.
• Business intelligence combines business analytics, data mining, data visualization, data tools and
infrastructure, and best practices to help organizations make more data-driven decisions.
• Business intelligence (BI) uncovers insights for making strategic decisions. Business intelligence
tools analyze historical and current data and present findings in intuitive visual formats.
• According to CIO magazine: “Although business intelligence does not tell business users what to
do or what will happen if they take a certain course, neither is BI only about generating reports.
Rather, BI offers a way for people to examine data to understand trends and derive insights.”

BI Tools
Data Visualization Why Data Visualization
• Data visualization is the presentation of
data in a pictorial or graphical format. It Data Visualization Provides Clearer Understanding
enables decision makers to
see analytics presented visually, so they
can grasp difficult concepts or identify Instant absorption of large and complex data
new patterns.
• Data visualization is the graphical
representation of information and data. Better decision-making based on data
By using visual elements like charts,
graphs, and maps, data visualization tools
provide an accessible way to see and Audience Engagement
understand trends, outliers, and patterns
in data.
• In the world of Big Data, data Reveals hidden patterns and deeper insights
visualization tools and technologies are
essential to analyze massive amounts of
information and make data-driven To help the analyst avoid problems
decisions.
Two Key Questions for Data Visualization
1. What type of data are you working with?
• Qualitative
• Quantitative
2. What are you trying to communicate?
• Relationship
• Comparison
• Composition
• Distribution
• Trending
Tableau
• Tableau is an excellent data visualization and business intelligence tool used for reporting and
analyzing vast volumes of data.
• Tableau is a visual analytics platform transforming the way we use data to solve problems—
empowering people and organizations to make the most of their data.

Install Tableau
Tableau Desktop
https://www.tableau.com/support/releases
Tableau for Students/Faculty: Free (one-year license) access to Tableau Desktop
• https://www.tableau.com/academic/teaching#form
• https://www.tableau.com/academic/students#form
Tableau Public
• Tableau Public is a free platform to explore, create, and publicly share data visualizations. Get
inspired by the endless possibilities with data.
• https://public.tableau.com/en-us/s/
Connecting to Data in Tableau
• When open the Tableau, we will see a screen that looks like this,
where we have the option to choose data connection:
• The options under the navigation heading “To a File” can be
accessed with Tableau. All possible data connections, including data
that resides on a server, can be accessed with Tableau
• At the bottom of the left navigation, a couple of data sources come
with every Tableau download. The first, Sample – Superstore, is an
Excel file to connect to it using Tableau.
Tableau
Data Types
Work Sheet

So simply click on the Sheet 1 tab at the bottom to start visualizing the data! You should now
see the main work area within Tableau, which looks like this:
Foundations for building visualizations
• When you first connect to a data source such as the Superstore file, Tableau will display the
data connection and the fields in the Data pane.
• Dimensions contain qualitative values (such as names, dates, or geographical data). You can
use dimensions to categorize, segment, and reveal the details in your data. Dimensions
affect the level of detail in the view.
• Measures contain numeric, quantitative values that you can measure. Measures can be
aggregated. When you drag a measure into the view, Tableau applies an aggregation to that
measure (by default).
• Generally, the measure is the number; the dimension is what you “slice and dice” the
number by.
• Fields can be dragged from the data pane onto the canvas area or onto various shelves such
as Rows, Columns, Color, or Size. As we'll see, the placement of the fields will result in
different encodings of the data based on the type of field.
Bar Graph
• A graph made of bars whose heights represent the frequencies of respective categories is
called a bar graph.
Bar & Column charts commonly used for:Comparing numerical data across categories
Examples:
• Total sales by product type
• Population by country
• Revenue by department, by quarter
Create a Bar Chart in Tableau
To create a bar chart by placing a dimension on the Rows shelf and a measure
on the Columns shelf, or vice versa.

Connect to the Sample - Superstore data source.

Drag the Region dimension to Columns and drag


the Sales measure to Rows.(Horizontal Bar)

Drag the Region dimension to Columns and drag


the Sales measure to Rows.(Vertical Bar/Column Chart)
Tableau Show Me
• In the Data pane, select the fields. Hold the Ctrl key
(Command key on a Mac) to make multiple selections.
• Click Show Me on the toolbar and then select the type of
view you want to create.

Create a Bar Chart using Show Me

• In the Data pane, select Region and Profit. Hold down


the Ctrl key (or the command key on a Mac) as you
select the fields.
• Click Show Me on the toolbar.
• Select the scatter plot chart type from Show Me.
The Tableau Workspace Toolbar View
Workbook name Cards and shelves

Icon

Side Bar

Data Source

Status bar Sheet tabs


.
The Tableau Workspace
• A. Workbook name: A workbook contains sheets. A sheet can be a worksheet, a dashboard, or a
story.
• B. Cards and shelves: Drag fields to the cards and shelves in the workspace to add data to your view
• Every worksheet in Tableau contains shelves and cards, such as Columns, Rows, Marks, Filters, Pages,
Legends, and more.
• By placing fields on shelves or cards, Build the structure of visualization.
• Increase the level of detail and control the number of marks in the view by including or excluding
data.
• Add context to the visualization by encoding marks with color, size, shape, text, and detail.
• Marks card: The Marks card is a key element for visual analysis in Tableau. As you drag fields to
different properties in the Marks card, you add context and detail to the marks in the view.
• Filters shelf: The Filters shelf allows to specify which data to include and exclude.
C. Toolbar
Use the toolbar to access commands and analysis and navigation tools
Toolbar Button Description
Tableau icon:Navigates to the start page.
The start page in Tableau Desktop is a central location from which you can do the following:
•Connect to your data
•Open your most recently used workbooks, and
•Discover and explore content produced by the Tableau community.

Undo: Reverses the most recent action in the workbook.

Redo: Repeats the last action you reversed with the Undo button.

Save: In Tableau Desktop, saves the changes made to the workbook.


New Data Source: In Tableau Desktop, opens the Connect pane where you can create a new
connection or open a saved connection.
Pause Auto Updates: Controls whether Tableau updates the view when changes are made.
Run Update: Runs a manual query of the data to update the view with changes when automatic
updates are turned off.
C. Toolbar
Toolbar Button Description
New Worksheet: Creates a new blank worksheet, use the drop-down menu to create a
new worksheet, dashboard, or story.
Duplicate: Creates a new worksheet containing the same view as the current sheet.

Swap: Moves the fields on the Rows shelf to the Columns shelf and vice versa.
Sort Ascending: Applies a sort in ascending order of a selected field based on the
measures in the view.
Sort Descending: Applies a sort in descending order of a selected field based on the
measures in the view.
Totals: You can compute grand totals and subtotals for the data in a view. Select from
the following options:
Show Column Grand Totals: Adds a row showing totals for all columns in the view.
Show Row Grand Totals: Adds a column showing totals for all rows in the view.
Row Totals to Left: Moves rows showing totals to the left of a crosstab or view.
Column Totals to Top: Moves columns showing totals to the top of a crosstab or view.
Add All Subtotals: Inserts subtotal rows and columns in the view, if you have multiple
dimensions in a column or row.
Remove All Subtotals: Removes subtotal rows or columns.
C. Toolbar
Toolbar
Button Description
Clear: Clears the current worksheet. Use the drop-down menu to clear specific parts of the
view such as filters, formatting, sizing, and axis ranges.
Highlight: Turn on highlighting for the selected sheet. Use the options on the drop-down menu
to define how values are highlighted.
Group Members: Creates a group by combining selected values. When multiple dimensions
are selected, use the drop-down menu to specify whether to group on a specific dimension or
across all dimensions.
Show Mark Labels: Switches between showing and hiding mark labels for the current sheet.
Fix Axes: switches between a locked axis that only shows a specific range and a dynamic axis
that adjusts the range based on the minimum and maximum values in the view.
Format Workbook: Open the Format Workbook pane to change how fonts and titles look in every view
in a workbook by specifying format settings at the workbook level instead of at the worksheet level.
Fit: Specifies how the view should be sized within the window. Select Standard, Fit Width, Fit Height, or
Entire View. Note: This menu is not available in geographic map views.
The Cell Size commands have different effects depending on the type of visualization. To access the Cell
Size menu in Tableau Desktop click Format > Cell Size.
C. Toolbar
Toolbar
Button Description
Show/Hide Cards: Shows and hides specific cards in a worksheet. Select each card that you
want to hide or show on the drop-down menu.
In Tableau Server and Tableau Online, you can show and hide cards for
the Title, Caption, Filter and Highlighter only.
Presentation Mode: Switches between showing and hiding everything except the view (i.e.,
shelves, toolbar, Data pane).
Download: Use the options under Download to capture parts of your view for use in other
applications.
Share Workbook With Others: Publish your workbook to Tableau Server or Tableau Online.
Show Me: Helps you choose a view type by highlighting view types that work best with the field
types in your data. An orange outline shows around the recommended chart type that is the
best match for your data.
• D. View - This is the canvas in the workspace where you create a visualization (also
referred to as a "viz").
• E. Click this icon to go to the Start page, where you can connect to data.
• F. Side Bar - In a worksheet, the side bar area contains the Data pane and
the Analytics pane
• G. Click this tab to go to the Data Source page and view your data
• H. Status bar - Displays information about the current view.
• I. Sheet tabs - Tabs represent each sheet in your workbook. This can include
worksheets, dashboards, and stories.
Marks Card
• Marks Card in Tableau
• There is a card to the left of the view where we can
drag fields and control mark properties like color, size,
label, shape, tooltip and detail.
Pie chart
• Pie chart A circle divided into portions(slices) that represent the relative frequencies or
percentages of different categories or classes.
• Use pie charts to show proportions of a whole.
• The slice of a pie chart is to show the proportion of parts out of a whole.
Commonly used for: Comparing proportions totalling 100%
Examples:
• Percentage of budget spent by department
7.79% Fri
• Proportion of internet users by age range
Sat
• Breakdown of site traffic by source Sun
25.41%
35.66% Thur

31.15%
Build a Pie Chart
• Step 1: Connect to the Sample - Superstore data source.
• Step 2:Drag the Sales measure to Columns and drag the Region dimension to Rows.
• Tableau aggregates the Sales measure as a sum.
• Also note that the default chart type is a bar chart.
• Step 3:Click Show Me on the toolbar, then
select the pie chart type.

• Step 4.The result is a rather small pie. To make the chart bigger, hold down Ctrl + Shift
(hold down ñ + z on a Mac) and press B several times.
• Step 5: To add labels, drag the Region dimension from the Data pane to Label on
the Marks card.
Stacked Bar Graphs
• Stacked bar graphs show the quantitative relationship that exists between a main category and
its subcategories.
• Each bar represents a principal category and it is divided into segments
representing subcategories of a second categorical variable.
• The chart shows not only the quantitative relationship between the different subcategories
with each other but also with the main category as a whole. They are also used to show how
the composition of the subcategories changes over time.

Stacked Bar Graphs in Tableau


Step 1: Connect to the Sample - Superstore data source
Step 2: Drag the Region dimension to Columns and drag
the Sales measure to Rows
Step 3: Drag the Ship Mode dimension to Color on the Marks card. The
view shows how different shipping modes have contributed to total
sales in the region.
In stacked bar graphs, various categories of the same field are plotted
on top of each other.
Side-by-Side Bar Chart
• Similar to bar charts, you can use this chart to show a
side by side comparison of data.
• The side-by-side bar chart is similar to the stacked bar
chart except we’ve un-stacked the bars and put the
bars side by side along the horizontal axis.
Maps
Symbol Map
• Connect to the Sample - Superstore data source.
• In the Data pane, under Dimensions, double-click State.
• From Measures, drag Sales to Size on the Marks card.
Add image files to repository:
• Find image files on the internet or any
other source. Download the image to
your computer.
• Drag images into your “Documents” ->
"my Tableau repository" -> "shapes"
folder.
• Open Tableau and your new shapes
will automatically be included in your
"edit shapes" menu.
Filled map
• In the Data pane, under Dimensions, double-click State.
• A map view is automatically created.
• On the Marks card, click the Mark Type drop-down and select Map.
• The map view updates to a filled (polygon) map.
• From Dimensions, drag Sales to Color on the Marks card.
Removing Borders
• Tableau provides quite a few
options to change the format of a
generated map
To removing borders :
• Select Map > Map Layers.
• The Map Layers pane appears on
the left side of the workspace. This
is where all background map
customization happens.
• In the Map Layers pane, untick all
boxes.
Filters in Tableau
• Tableau provides the ability to filter individual views or
even entire data sources on dimensions, measures, or
dates. This filtering capability can serve a variety of
purposes, including minimizing the size of the data for
efficiency purposes, cleaning up underlying data,
removing irrelevant dimension members, and setting
measure or date ranges for what you want to analyze.
• Drag dimensions, measures, and date fields to the
Filters shelf. When you add a field to the Filters shelf,
the Filter dialog box opens so you can define the filter.
The Filter dialog box differs depending on whether you
are filtering categorical data (dimensions), quantitative
data (measures), or date fields.
Dimension filter
• Dimensions contain discrete categorical data, so filtering this type of field generally involves
selecting the values to include or exclude.
• When you drag a dimension from the Data pane to the Filters shelf in Tableau Desktop, the
following Filter dialog box appears:
Filter Quantitative data (Measures)
• Measures contain quantitative data, so filtering
this type of field generally involves selecting a
range of values that you want to include.
• When you drag a measure from the Data pane
to the Filters shelf in Tableau Desktop, the
following dialog box appears:

Select how you want to aggregate the field, and then click Next.
In the subsequent dialog box, you're given the option to create four types of quantitative filters:
Filter dates
• When you drag a date field from the Data pane to the Filters shelf in Tableau Desktop, the
following Filter Field dialog box appears:
Display interactive filters in the view
• When an interactive filter is shown, you can quickly include or exclude data in the view.
• To show a filter in the view:
• In the view, click the field drop-down menu and select Show Filter.
• The field is automatically added to the Filters shelf (if it is not already being filtered), and
a filter card appears in the view. Interact with the card to filter your data.
• Set options for filter card interaction and appearance
• After you show a filter, there are many different options that let you control how the filter
works and appears. You can access these options by clicking the drop-down menu in the
upper right corner of the filter card in the view.
• Some options are available for all types of filters, and others depend on whether you’re
filtering a categorical field (dimension) or a quantitative field (measure).
Histogram
• Histograms are created by continuous (numerical) data.
• Histogram is the visual representation of the data which can be used to assess the probability
distribution (frequency distribution) of the data. A histogram is a chart that displays the shape
of a distribution.
• A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is
plotted as a bar whose height corresponds to how many data points are in that bin. A histogram
looks like a bar chart but groups values for a continuous measure into ranges, or bins.

Commonly Used For:


• Showing the distribution of a continuous data set
• We should use histogram when we need the
count of the variable in a plot.
Examples:
• Frequency of test scores among students
• Distribution of population by age group
• Distribution of heights or weights
Frequency Distribution (Probability Distribution)
• For a very large data set, as the number of classes is increased (and the width of classes is
decreased), the frequency polygon eventually becomes a smooth curve. Such a curve is called
a frequency distribution curve or simply a frequency curve
Shapes of Histograms
A histogram can assume any one of a large number of shapes.
The most common of these shapes are
1. Symmetric: A symmetric histogram is identical on both sides
of its central point.
2.Skewed: A skewed histogram is nonsymmetric.
For a skewed histogram, the tail on one side is
longer than the tail on the other side. A skewed-to-
the-right histogram has a longer tail on the right
side (Figure a). A skewed-to-the-left histogram has
a longer tail on the left side (Figure b).

3. A uniform or rectangular histogram has the same frequency for each class.
Create a histogram in Tableau
• In Tableau we can create a histogram using Show Me.
1.Connect to the Sample - Superstore data source.
2.Drag Quantity to Columns.
3.Click Show Me on the toolbar, then select the histogram chart type.
• The histogram chart type is available in Show Me when the view contains a single measure
and no dimensions.
Three things happen after you click the histogram icon in Show Me:
• The view changes to show vertical bars, with a continuous x-axis (1 – 14) and
a continuous y-axis (0 – 5,000).
• The Quantity measure you placed on the Columns shelf, which had been
aggregated as SUM, is replaced by a continuous Quantity (bin) dimension.
(The green color of the field on the Columns shelf indicates that the field is
continuous.)
• To edit this bin: In the Data pane, right-click the bin and select Edit.
• The Quantity measure moves to the Rows shelf and the aggregation
changes from SUM to CNT (Count).
Box and Whisker Plot
• A box-and-whisker plot gives a graphic presentation of data using five measures: the median, the
first quartile, the third quartile, and the smallest and the largest values in the data set between the
lower and the upper inner fences.
• The length of the box is equivalent to IQR. It is possible that the data may contain values beyond Q1
– 1.5 IQR and Q3 + 1.5 IQR. The whisker of the box plot extends till Q1 – 1.5 IQR (or minimum value)
and Q3 + 1.5 IQR (or maximum value); observations beyond these two limits are potential outliers.
Commonly Used For:
• Visualizing statistical characteristics across data series
Examples:
• Comparing historical annual rainfall across cities
• Analyzing distributions of values and identifying outliers
• Comparing mean and median height/weight by country

28, 29, 29, 30, 34, 35, 35, 37, 38,2


5
The five-number summary is 25, 29, 32, 35, 38
Box plots are used to show distributions of
numeric data values.
The above image is a boxplot of symmetric distribution.

The above boxplot represents a positively skewed


distribution.

The above boxplot represents a negative skewed


distribution.
Create a Box Plot in Tableau
• To create a box plot that shows discounts by region and
customer segment, follow these steps:
1.Connect to the Sample - Superstore data source.
2.Drag the Segment dimension to Columns.
3.Drag the Discount measure to Rows.
4.Tableau creates a vertical axis and displays a bar chart—the
default chart type when there is a dimension on
the Columns shelf and a measure on the Rows shelf.
4.Drag the Region dimension to Columns, and drop it to the right
of Segment.
5.Now you have a two-level hierarchy of dimensions from left to
right in the view, with regions (listed along the bottom) nested
within segments (listed across the top).
5.Click Show Me in the toolbar, then select the box-and-whisker
plot chart type.
6. Drag Region from the Marks card back to Columns, to the right of Segment.
The horizontal lines are flattened box plots, which is what happens when box plots are based
on a single mark.
Box plots are intended to show a distribution of data, and that can be difficult when data is
aggregated, as in the current view.
7.To disaggregate data, select Analysis > Aggregate Measures.
8.Click the Swap button to swap the axes:
9. Right-click (control-click on Mac) the bottom axis and select Edit Reference Line.
10. In Edit Reference Line, Band, or Box dialog box, in the Fill drop-down list, select an
interesting color scheme.
Scatter Plot
• A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different
numeric variables. The position of each dot on the horizontal and vertical axis indicates
values for an individual data point.
• Scatter plots are used to observe relationships between variables.
Commonly Used For:
• Exploring Correlations Or Relationships Between Series
Examples:
• Advertisements And Sales
• Study time and Marks

• Positive correlation depicts a rise, and it is seen on the diagram as data points slope
upwards from the lower-left corner of the chart towards the upper-right.
• Negative correlation depicts a fall, and this is seen on the chart as data points slope
downwards from the upper-left corner of the chart towards the lower-right.
• Data that is neither positively nor negatively correlated is considered uncorrelated (null).
Create scatter plot in Tableau
• To use scatter plots and trend lines to compare sales
to profit, follow these steps:
1.Open the Sample - Superstore data source.
2.Drag the Profit measure to Columns.
3.Tableau aggregates the measure as a sum and
creates a horizontal axis.
3.Drag the Sales measure to Rows.
Tableau aggregates the measure as a sum and creates
a vertical axis.
4. Drag the dimension Sub-Category and drop into
the Label shelf under the Marks pane.
Create scatter plot in Tableau
• Once the data is loaded, perform the
following steps to create a scatter plot of
two measures:
1. From the top toolbar, under Analysis,
uncheck Aggregate Measures.
2. Drag-and-drop Profit into the Columns
shelf.
3. Drag-and-drop Sales into the Rows shelf.

• Right-click on any data marker or anywhere


in the plot area and click on the Show Trend
Lines option in Trend Lines to see a plot with
a linear trend line.
Line Chart
• A line chart is a graphical representation of (Sales) historical price action that connects a series of
data points with a continuous line. This is the most basic type of chart used in finance. Line charts
can be used on any timeframe.
• Time series is a line plot and it is basically connecting data points with a straight line. It is useful
in understanding the trend over time. It can explain the correlation between points by the trend.
Commonly Used For:
• Visualizing trends over time
• Time Series should be used when single or multiple variables are to be plotted over time.
Examples:
• Stock price by hour
• Average temperature by month
• Profit by quarter

Number of monthly active Facebook users worldwide as of 3rd quarter 2020


Create Line Charts in Tableau
• To create a view that displays the sum of sales
and the sum of profit for all years, and then uses
forecasting to determine a trend, follow these
steps:
1.Connect to the Sample - Superstore data source.
2.Drag the Order Date dimension to Columns.
3.Tableau aggregates the date by year and creates
column headers.
• Drag the Sales measure to Rows.(On
the Marks card, select Line from the drop-down
list.)
3.Tableau aggregates Sales as SUM and displays a
simple line chart.
4.Drag the Profit measure to Rows and drop it to
the right of the Sales measure.
• Click the drop-down arrow in
the Year(Order Date) field on
the Columns shelf and select Month in
the lower part of the context menu to
see a continuous range of values over
the four-year period.
• The small circles on each data point are
called “markers” can be added by
clicking on the Color Marks Card and
choosing one of the “Markers:” options
under Effects.
Build a Text Table
• In Tableau, create text tables (also called cross-tabs or
pivot tables) by placing one dimension on the Rows shelf
and another dimension on the Columns shelf. You then
complete the view by dragging one or more measures
to Text on the Marks card.
❖To create a text table that shows sales totals by year and
category, follow these steps:
• Connect to the Sample - Superstore data source.
• Drag the Order Date dimension to Columns.
• Tableau aggregates the date by year and creates column
headers.
• Drag the Sub-Category dimension to Rows.
• Now add a measure to the view to see actual data.
• Drag the Sales measure to Text on the Marks card.
• Tableau aggregates the measure as a sum.
Showing Aggregate Measures
• Tableau, by default, aggregates measure values, and this behavior can be changed
to show all individual values of the measures by clicking on the Analysis menu
option from the top toolbar and unchecking Aggregate Measures to remove
aggregation. It is also possible to change the aggregation type, such as total,
average, variance, and others.
• When you add a measure to the view, Tableau automatically aggregates its values.
Sum, average, and median are common aggregations
• The current aggregation appears as part of the measure's name in the view. For
example, Sales becomes SUM(Sales). Every measure has a default aggregation
which is set by Tableau when you connect to a data source
• To change the default aggregation:
• Right-click (control-click on Mac) a
measure in the Data pane and select
Default Properties > Aggregation, and
then select one of the aggregation
options.
Heatmap
• Heatmap is defined as a graphical representation of data using colors to visualize the
value of the matrix. In this to represent more common values or higher activities brighter
colors
• Heat maps are a visualization where marks on a chart are represented as colors. As the
marks “heat up” due their higher values or density of records, a more intense color is
displayed. These colors can be displayed in a matrix/crosstab, which creates a highlight
table.
• Heatmaps are useful for cross-examining multivariate data, through placing variables in
the rows and columns and coloring the cells within the table.
• “For heat maps try 1 or more dimensions and 1 or 2 measures”
• This is very close to the requirements for drawing a highlight table.
Heatmap
• Open Tableau Desktop and connect to the Sample - Superstore data source.
• Select the Region mode and Subcategory and Sales.
• Select the chart type heat map (from Show me).
• Apply marks by color of sales.
Highlight Table
• Use color to highlight data and tell a story. Also similar to an Excel table but the cells are
colored (similar to conditional formatting in Excel). Can be used to compare values across
rows and columns. You can change the color scheme (different colors)
Tree map
• tree map to show hierarchical (tree-structured) data and part-to-whole relationships.
Treemapping is ideal for showing large amounts of items in a single visualization
simultaneously. This view is very similar to a heat map, but the boxes are grouped by
items that are close in hierarchy.
Packed Bubble Chart

• Use packed bubble charts to


display data in a cluster of
circles. Dimensions define the
individual bubbles, and
measures define the size and
color of the individual circles.
Area Chart
An area chart is a line chart where the area between the line and the axis are shaded with a
color. These charts are typically used to represent accumulated totals over time and are the
conventional way to display stacked lines.
The area chart is a combination between a line graph and a stacked bar chart. It shows relative
proportions of totals or percentage relationships.
To create an area chart:
• Open Tableau Desktop and connect to the Sample -
Superstore data source.
• From the Data pane, drag Order Date to
the Columns shelf.
• On the Columns shelf, right-click YEAR(Order Date) and
select Month.
• From the Data pane, drag Quantity to the Rows shelf.
• From the Date pane, drag Ship Mode to Color on the
Marks card.
• On the Marks card, click the Mark Type drop-down and
select Area.
Combination Charts (Dual Axes Charts)
• Combination charts are views that use multiple mark types in the same visualization.
• For example, you may show sum of profit as bars with a line across the bars showing sum of
sales. You can also use combination charts to show multiple levels of detail in the same view.
For example, you can have a line chart with individual lines showing average sales over time
for each customer segment, then you can have another line that shows the combined
average across all customer segments.
• Companies often graph actual monthly sales and targeted monthly sales. If the marketing
analyst plots both these series with straight lines it is difficult to differentiate between
actual and target sales. For this reason analysts often summarize actual sales versus targeted
sales via a combination chart in which the two series are displayed using different formats
Combo
(such as a line and column graph.)
2000

1500

1000

500

0
January February March April May June July
Actual Target
Combination Charts (Dual Axes Charts)
To create a combination chart, follow the steps below:
• Open Tableau Desktop and connect to the Sample -
Superstore data source.
• From the Data pane, drag Order Date to
the Columns shelf.
• On the Columns shelf, right-click YEAR(Order Date) and
select Month.
• From the Data pane, drag Sales to the Rows shelf.
• From the Data pane, drag Profit to the Rows shelf and
place it to the right of SUM(Sales).
• On the Rows shelf, right-click SUM(Profit) and
select Dual-Axis.
• On the SUM(Profit) Marks card, click the Mark Type drop-
down and select Bar.
Word Map
Word map is a visual representation of text data.
Creating Dashboards
Dashboards in Tableau are very powerful as they are a compilation of individual visualizations
on different sheets. This provides the reader with a lot of information on one single view with
all the filters, parameters, and legends of individual visualizations.
Create a dashboard
You create a dashboard in much the same way you create a new worksheet.
1.At the bottom of the workbook, click the New Dashboard icon:
2.From the Sheets list at left, drag views to your dashboard at right.

Add interactivity
• In the upper corner of sheet, enable the Use as Filter option to use selected marks in the
sheet as filters for other sheets in the dashboard.
Floating and Tiled Layout Arrangements on Dashboards
• Each object (Worksheet) in a dashboard can use one of two types of layouts: Tiled or Floating.
Tiled objects are arranged in a single layer grid that adjust in size based on the total
dashboard size and the objects around it. Floating objects can be layered on top of other
objects and can have a fixed size and position.
• Tiled Layout All objects are tiled on a single layer. The top three views are in a horizontal
layout container.
• Floating Layout While most objects are tiled on this dashboard, the map view and its
corresponding color legend are floating. They are layered on top of the bar chart, which uses
a tiled layout.
Adding drop-down selectors
• Single Value (Dropdown): Displays the values of the filter in a drop-
down list where only a single value can be selected at a time.
• Multiple Values(Dropdown): Displays the values of the filter in a drop-
down list where multiple values can be selected.
Slider selectors
• Wildcard Match: Displays a text box where you can type a few
characters. All values that match those characters are automatically
selected. You can use the asterisk character as a wildcard character.
For example, you can type “tab*” to select all values that begin with
the letters “tab”. Pattern Match is not case sensitive. If you are using a
multidimensional data source, this option is only available when
filtering single level hierarchies and attributes.
Search box selectors
• Single Value (Slider): Displays the values of the filter along the range
of a slider. Only a single value can be selected at a time. This option is
useful for dimensions that have an implicit order such as dates.
Happy Visualizing
K Nearest Neighbors (K-NN)

Ramesh Kandela
[email protected]
KNN
• K Nearest Neighbors is a classification (/Regression) algorithm . The idea in k-nearest-
neighbors methods is to identify k records in the training dataset that are similar to a new
record that we wish to classify.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
• When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space
for the k training tuples that are closest to the unknown tuple. These k training tuples are
the k “nearest neighbors” of the unknown tuple.
• For k-nearest-neighbor classification, the unknown tuple is assigned the most common class
among its k-nearest neighbors.
Why do we need a K-NN Algorithm?
Euclidean Distance(Similarity Measure )
• The most common approach is to measure similarity in terms of distance between pairs of
objects.
• The most commonly used measure of similarity is the Euclidean distance. The Euclidean
distance is the square root of the sum of the squared differences in values for each variable.
• If the points (x1,y1)and (x2,y2) are in 2-dimensional space, then the Euclidean distance
between them is
Point P1=(1,4)
Point p2= (5,1)
Euclidean distance=5
• KNN classifier predicts the class of a given test observation by identifying the observations
that are nearest to it, the scale of the variables matters. Any variables that are on a large
scale will have a much larger effect on the distance between the observations, and hence on
the KNN classifier, than variables that are on a small scale.
• There are two common ways to get all attributes to have the same scale:
• Min-max normalization
• Z-score normalization
How does K-NN work?
The K-NN working can be explained on the basis of the below steps:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors and Take the K nearest
neighbors as per the calculated Euclidean distance.
Step-3: Among these k neighbors, count the number of the data points in each category.
Step-4: Assign the new data point to that category for which the number of the neighbor is
maximum.
Step-1: Choosing k
• The k value in the k-NN algorithm defines how many neighbors will be checked to
determine the classification of a specific query point.
• For example, if k=1, the instance will be assigned to the same class as its single nearest
neighbor.
• Defining k can be a balancing act as different values can lead to overfitting or underfitting.
Lower values of k can have high variance, but low bias, and larger values of k may lead to
high bias and lower variance. The choice of k will largely depend on the input data as data
with more outliers or noise will likely perform better with higher values of k.
Age Salary Purchased
44 72000 No
27 48000 Yes
30 54000 No
38 61000 No
35 58000 Yes
37 67000 ?

K=3
How can determine a good value for k, the number of neighbors?”
• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• This can be determined experimentally. Starting with k =5, we use a test set to estimate the
error rate of the classifier. This process can be repeated each time by incrementing k to
allow for one more neighbor. The k value that gives the minimum error rate may be
selected.

From the plot, the smallest error is 0.59 at K=37.


Step-2:Calculate the Euclidean distance
Observation Age Salary Purchased ED Rank
1 44 72000No 5000.005 1
2 27 48000Yes 19000 5
3 30 54000No 13000 4
4 38 61000No 6000 2
5 35 58000Yes 9000 3
37 67000?
Step-3 and 4 Observation Age Salary Purchased ED Rank
1 44 72000No 5000.005 1
2 27 48000Yes 19000 5
K=3 3 30 54000No 13000 4
4 38 61000No 6000 2
5 35 58000Yes 9000 3
37 67000

If K=3, new datapoint assigned to No category


Advantages and disadvantages of the KNN algorithm
Advantages
• Easy to implement: Given the algorithm’s simplicity and accuracy.
• Adapts easily: As new training samples are added, the algorithm adjusts to account for any new data since all training
data is stored into memory.
• Few hyperparameters: KNN only requires a k value and a distance metric, which is low when compared to other
machine learning algorithms.
• It can be used for both classification and regression problems
• No Training Period: KNN is called Lazy Learner (Instance based learning). It does not learn anything in the training
period.
• knn is also called a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
Disadvantages
• Since KNN is a lazy algorithm, it takes up more memory and data storage compared to other classifiers. This can be costly from
both a time and money perspective. More memory and storage will drive up business expenses and more data can take longer
to compute.
• Need feature scaling: We need to do feature scaling (standardization and normalization) before applying KNN algorithm to any
dataset. If we don't do so, KNN may generate wrong predictions.
• Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of dimensionality, which means that it doesn’t
perform well with high-dimensional data inputs
• Prone to overfitting: Lower values of k can overfit the data, whereas higher values of k tend to “smooth out” the prediction
values since it is averaging the values over a greater area, or neighborhood. However, if the value of k is too high, then it can
underfit the data. Always needs to determine the value of K which may be complex some time.
Applications of k-NN
• Data preprocessing: Datasets frequently have missing values, but the KNN algorithm can
estimate for those values in a process known as missing data imputation.
• Finance: It has also been used in a variety of finance and economic use cases. For example,
using KNN on credit data can help banks assess risk of a loan to an organization or individual.
It is used to determine the credit-worthiness of a loan applicant.
• Healthcare: KNN has also had application within the healthcare industry, making predictions
on the risk of heart attacks and prostate cancer. The algorithm works by calculating the most
likely gene expressions.
• The table below provides a training data set containing six observations, three predictors,
and one qualitative response variable.

• Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0
using K-nearest neighbors.
• b) What is our prediction with K = 1? Why?
• (c) What is our prediction with K = 3? Why?
Observation Cough Fever Covid
1 1 1No
2 1 2Yes
3 0 -1No
4 0 3No
5 1 2.5Yes If k=3
6 1 1Yes
7 1 -2No Predicted
8 1 0.5No Yes
9 1 0.9Yes Yes
10 0 2.5Yes Yes
Evaluating Classification Model
The key classification metrics are:
Accuracy
Recall
Precision
F1-Score
Model Evaluation
• A confusion matrix is a table that is often used to describe the performance of a
classification model (or "classifier") on a set of test data for which the true values are
known.
• True positives .TP: These refer to the positive that were correctly labeled by the classifier. Let
TP be the number of true positives.
• True negatives .TN: These are the negative that were correctly labeled by the classifier. Let
TN be the number of true negatives.
• False positives .FP: These are the negative that were incorrectly labeled as positive. Let FP be
the number of false positives.
• False negatives .FN: These are the positive that were mislabeled as negative. Let FN be the
number of false negatives.
Predicted
Positive Negative
False Negative (FN)
Positive True Positive (TP) (Type II Error)
False Positive (FP)
Actual Negative (Type I Error) True Negative
Accuracy Predicted
• The accuracy of a classifier on a given test Positive Negative
set is the percentage of test set data that False Negative (FN)
are correctly classified by the classifier. Positive True Positive (TP) (Type II Error)
• Accuracy = (TP+TN)/(TP+FP+FN+TN) False Positive (FP) True Negative(TN)
• Accuracy is the proportion of true results Actual Negative (Type I Error)
among the total number of cases
examined.
• The error rate or misclassification rate of a
classifier, which is simply 1-accuracy. Covid Predicted
Positive Negative
Covid Predicted Positive 0 5
Positive Negative Actual
Positive 2 0 Covid Negative 0 95
Actual
Covid Negative 1 0
Accuracy=95/100=95%
Accuracy=2+0/3=2/3=67% • Accuracy is not a good choice with
When to use? unbalanced classes
• Accuracy is useful when target • In this situation we’ll want to understand
classes are well balanced recall and precision
Recall
Recall: what proportion of actual Positives is correctly classified?
Recall = (TP)/(TP+FN) TP/Total actual positives
• When to use?
• Recall is a valid choice of evaluation metric when we want to capture as many positives as
possible. For example: If we are building a system to predict if a person has Covid-19 or not,
we want to capture the virus even if we are not very sure.
Covid Predicted Covid Predicted
Positive Negative Positive Negative
Positive 4 1 Positive 0 5
Actual Actual
Covid Negative 2 93 Covid Negative 0 95

Accuracy=4+93/100=97% Accuracy=95/100=95%
Recall =4/4+1=.8
Recall = 0/0+5=0
Precision
What proportion of predicted Positives is truly Positive?
Precision = (TP)/(TP+FP)
When to use?
• Precision is a valid choice of evaluation metric when we want to be very sure of our
prediction. For example: If we are building a system to predict if we should decrease the
credit limit on a particular account, we want to be very sure about our prediction or it may
result in customer dissatisfaction
Covid Predicted
Covid Predicted Positive Negative
Positive Negative Positive 0 5
Positive 4 1 Actual
Actual Covid Negative 0 95
Covid Negative 2 93
Accuracy=4+93/100=97% Accuracy=95/100=95%

Recall =4/4+1=.8 Recall = 0/0+5=0


Precision =4/4+2=.66 Precision =0/0=0
F1 score
• An alternative way to use precision and recall is to combine them into a single measure.
This is the approach of the F measure (also known as the F1 score or F-score).
• F1-score is the harmonic mean of precision and recall

• F1-score takes both precision and recall into account, which also means it accounts for
both FPs and FNs. The higher the precision and recall, the higher the F1-score. F1-score
ranges between 0 and 1. The closer it is to 1, the better the model.

• When to use?
• We want to have a model with both good precision and recall.
Sensitivity & Specificity
Sensitivity or Recall (True Positive Rate) In simple terms, the proportion of patients that
were identified correctly to have the disease (i.e. True Positive) upon the total number of
patients who actually have the disease is called as Sensitivity or Recall.

Specificity (True Negative Rate): When it's actually no, how often does it predict no?
The proportion of negative tuples that are correctly identified.
Similarly, the proportion of patients that were identified correctly to not have the disease (i.e.
True Negative) upon the total number of patients who do not have the disease is called as
Specificity.

• False Positive Rate: When it's actually no, how often does it predict yes?
Receiver Operating Characteristic (ROC) Curve
• Receiver operating characteristic (ROC) curve can be used to understand the overall worth of a
classification model at various thresholds settings.
• ROC curve is a plot between sensitivity (true positive rate) in the vertical axis and
1 – specificity (false positive rate) in the horizontal axis.
• ROC curves are a useful visual tool for comparing two classification models.

• In a ROC curve, a higher X-axis value


indicates a higher number of False positives
than True negatives. While a higher Y-axis
value indicates a higher number of True
positives than False negatives. So, the choice
of the threshold depends on the ability to
balance between False positives and False
negatives.
Plotting ROC curve
• There are five positive tuples and five negative tuples
I

E F G H

B C

A
Area Under the Curve (AUC)
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes and is used as a summary of the ROC curve. Higher the AUC, better the model
is at predicting 0s as 0s and 1s as 1s.
When AUC = 1, then the classifier is able to perfectly
distinguish between all the Positive and the Negative class
points correctly.

When 0.5<AUC<1, there is a high chance that the classifier will


be able to distinguish the positive class values from the
negative class values.

This is the worst situation. When AUC is approximately 0.5,


model has no discrimination capacity to distinguish between
positive class and negative class.
Classifier Comparison
The AUC can be used to compare the performance of two or more classifiers. A single
threshold can be selected, and the classifiers’ performance at that point compared, or the
overall performance can be compared by considering the AUC.

ROC curves of two classification models, M1 and M2. The diagonal shows where, for every
true positive, we are equally likely to encounter a false positive. The closer an ROC curve is
to the diagonal line, the less accurate the model is. Thus, M1 is more accurate here.
Happy Analyzing
Naive Bayes Classifier
Ramesh Kandela
[email protected]
Naive Bayes
• The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for
classification tasks, like text classification.
• Naive Bayes classification is based on Bayes’ theorem. They can predict class membership
probabilities such as the probability that a given tuple belongs to a particular class.
• Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:

• Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Bayes Theorem

• Basically, we are trying to find probability of event A, given the event B is true. Event B is
also termed as evidence.
• P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
• P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
• P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

Bayes theorem can be rewritten as:

• P(y|x) is the posterior probability of class (y, target) given predictor (x, attributes).
• P(y) is the prior probability of class.
• P(x|y) is the likelihood which is the probability of the predictor given class.
• P(x) is the prior probability of the predictor.
Naïve Bayesian Classification
Dependent variable Purchased (Y), has two distinct values (namely,( yes, no)) and
independent variables (X) are age, income, student, and credit rating.
• The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

• The variable y is the dependent variable(Purchased), which represents if it is suitable to


purchased or not given the conditions. Variable X represent the independent
variables/features.
• X is given as,
• Here x_1,x_2….x_n represent the independent variables/ features, i.e they can be mapped
to Age, Income, Credit_rating and Student. According to above example, Bayes theorem can
be rewritten as:
Naive Bayes Classifiers
• Now, we can obtain the values for each by looking at the dataset and substitute them into
the equation. For all entries in the dataset, the denominator does not change, it remain
static. Therefore, the denominator can be removed and a proportionality can be introduced.

• Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability. This can be expressed mathematically as:

• Using the above function, we can obtain the class, given the predictors.
• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved and, in this sense, is
considered “naïve.”
Types of Naive Bayes Classifier
• Gaussian Naive Bayes (GaussianNB): gaussiannb is used in classification tasks and it assumes that
feature values follow a gaussian distribution i.e. normal distribution-.
• Multinomial Naïve Bayes (MultinomialNB): This type of Naïve Bayes classifier assumes that the
features are from multinomial distributions. This variant is useful when using discrete data, such
as frequency counts, and it is typically applied within natural language processing use cases, like
spam classification.
• Bernoulli Naïve Bayes (BernoulliNB): This is another variant of the Naïve Bayes classifier, which is
used with Boolean variables—that is, variables with two values, such as True and False or 1 and 0.
The naive Bayes classification procedure is as follows:
• Calculate the prior probability for given class labels
• For class C1, estimate the individual conditional probabilities for each predictor P(xj jC1)
• Multiply these probabilities by each other, then multiple by the proportion of records
belonging to class C1 (prior probability for given class labels).
• Repeat Steps 2 and 3 for all the classes.
• Assign the record to the class with the highest probability for this set of predicted values.
Identify the class for the given X = (age = youth, income = medium, student = yes, credit rating = fair)
P(y): P(buys_computer = “yes”) = 9/14 = 0.643
Id Age Income Student Credit_rating Purchased P(buys_computer = “no”) = 5/14= 0.357
1Youth High NO Fair No Compute P(X|y) for each class
2Youth High NO Excellent No P(age = Youth|buys_computer = “yes”) = 2/9 = 0.222
3Middle_aged High NO Fair Yes P(age = Youth| buys_computer = “no”) = 3/5 = 0.6
4Senior Medium NO Fair Yes P(income =medium|buys_computer =“yes”) =4/9 = 0.444
5Senior Low Yes Fair Yes P(income = medium | buys_computer = “no”) = 2/5 = 0.4
6Senior Low Yes Excellent No P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
7Middle_aged Low Yes Excellent Yes P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
8Youth Medium NO Fair No P(credit_rating = fair| buys_computer = “yes”) = 6/9 = 0.667
9Youth Low Yes Fair Yes P(credit_rating = fair| buys_computer = “no”) = 2/5 = 0.4
10Senior Medium Yes Fair Yes P(X|buys_computer =“yes”) =0.222 x 0.444 x 0.667 x 0.66 = 0.044
11Youth Medium Yes Excellent Yes P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
12Middle_aged Medium NO Excellent Yes P(X|y)*P(y) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =0.044*.643= 0.028
13Middle_aged High Yes Fair Yes P(X|buys_computer = “no”) * P(buys_computer = “no”) =0.019*0.357= 0.007
14Senior Medium NO Excellent No
15Youth Medium Yes Fair ? Therefore, X belongs to class (“buys_computer = yes”)

Identify the class for the given X = (age = Middle_aged, income = medium, student = yes, credit rating = fair
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
P(X|y)=
Identify the class for the given X = (age = Middle_aged, income = medium, student = yes, credit rating = fair
P(age=Middle_aged/Purchased=No)=0/5=0
P(age=Youth/Purchased=No)=3/5=.6
P(age=Senior/Purchased=No)=2/5=0.4

Use Laplacian correction (or Laplacian estimator)


Adding 1 to each case
P(age=Middle_aged/Purchased=No)=1/8=0.125
P(age=Youth/Purchased=No)=4/8=.5
P(age=Senior/Purchased=No)=3/8=0.375

• The “corrected” prob. estimates are close to their “uncorrected” counterparts


Predicting class label using Naïve Bayes for given test data
today = (Sunny, Hot, Normal, False)

OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY GOLF


1 Rainy Hot High FALSE No
2 Rainy Hot High TRUE No
3 Overcast Hot High FALSE Yes
4 Sunny Mild High FALSE Yes
5 Sunny Cool Normal FALSE Yes
6 Sunny Cool Normal TRUE No
7 Overcast Cool Normal TRUE Yes
8 Rainy Mild High FALSE No
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier


• First, the naive Bayes classifier requires a very large number of records to obtain good
results.
• Second, where a predictor category is not present in the training data, naive Bayes assumes
that a new record with that category of the predictor has zero probability.
Applications of the Naïve Bayes classifier
• Spam filtering: Spam classification is one of the most popular applications of Naïve Bayes
cited in literature.
• Document classification: Document and text classification go hand in hand. Another
popular use case of Naïve Bayes is content classification. Imagine the content categories of
a News media website. All the content categories can be classified under a topic taxonomy
based on the each article on the site.
• Sentiment analysis: While this is another form of text classification, sentiment analysis is
commonly leveraged within marketing to better understand and quantify opinions and
attitudes around specific products and brands.
Happy Analyzing
Decision Trees

Ramesh Kandela
[email protected]

“Nothing is particularly hard if you divide it into small jobs”.— Henry Ford
Decision Tree
• A decision tree is a non-parametric supervised learning algorithm, which is utilized for both
classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node,
branches, internal nodes and leaf nodes.
• Decision Tree is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are
the output of those decisions and do not contain any further branches.

Why Decision Trees?


•Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
•The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
• Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node (Terminal Node). A leaf node has a parent node. It does not have child nodes.
• The root node represents the top of the tree. It does not have a parent node, however, it has different child
nodes.
• Branch nodes/ sub-nodes are in the middle of the tree. A branch node has a parent node and several child nodes.

Pruning: Pruning is the process of removing the unwanted branches from the tree.
Decision Tree Example
Training data set: Buys_computer
Age Income Student Credit rating Purchased
Youth High NO Fair No Resulting tree:
Youth High NO Excellent No
Middle_aged High NO Fair Yes
Senior Medium NO Fair Yes age?
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Middle_aged Low Yes Excellent Yes youth Middle_aged
overcast Senior
Youth Medium NO Fair No
Youth Low Yes Fair Yes yes
Senior Medium Yes Fair Yes student? credit rating?
Youth Medium Yes Excellent Yes
Middle_aged Medium NO Excellent Yes no yes excellent fair
Middle_aged High Yes Fair Yes
Senior Medium NO Excellent No no yes no yes
Youth Medium Yes Fair ? Yes
Types of Decision Trees
• ID3: (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. This algorithm leverages entropy and
information gain as metrics to evaluate candidate splits. This algorithm iteratively divides features into two or
more groups at each step. ID3 follows a top-down approach i.e the tree is built from the top and at every step
greedy approach is applied. The greedy approach means that at each iteration we select the best feature at the
present moment to create a node and this node is again split after applying some statistical
methods. ID3 generally is not a very ideal algorithm as it overfits when the data is continuous.
• C4.5: It is considered to be better than the ID3 algorithm as it can handle both discrete and continuous data.
In C4.5 splitting is done based on Information gain and the feature with the highest Information gain is made the
decision node and is further split. C4.5 handles overfitting by the method of pruning i.e it removes the
branches/subpart of the tree that does not hold much importance (or) is redundant.
• C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets
than C4.5 while being more accurate.
• CHAID: CHAID stands for Chi-square Automatic Interaction Detector. In CHAID chi-square is the attribute selection
measure to split the nodes when it's a classification-based use case and uses F-test as an attribute selection
measure when it is a regression-based use case. Higher the chi-square value higher is the preference given to that
feature.
Types of Decision Trees
Classification and Regression Trees
• Classification and Regression Trees or CART introduced by Leo Breiman to refer to Decision
Tree algorithms and it is very similar to C4.5, but it differs in that it supports numerical target
variables (regression).
• As the name suggests CART can also perform both classification and regression-based tasks.
• Classification and Regression Tree (CART) is a common terminology that is used for a
Classification Tree (used when the dependent variable is discrete) and a Regression Tree (used
when the dependent variable is continuous).
• CART uses Gini’s impurity index as an attribute selection method while splitting a node into
further nodes when it's a classification-based use case and uses sum squared error as an
attribute selection measure when the use case is regression-based.
• The CART algorithm provides a foundation for important algorithms like random forest ,bagged
decision trees and boosted decision trees.
Steps in decision trees
The basic idea behind any decision tree algorithm is as follows:
• Select the root node which is the best attribute using Attribute Selection Measures (ASM) to split
the records.
• The root node is then split into two or more subsets that contains possible values for the best
attributes using the ASM. Nodes thus created are known as internal nodes. Each internal node has
exactly one incoming edge.
• Further divide each internal node until no further splitting is possible or the stopping criterion is
met. The terminal nodes (leaf nodes) will not have any outgoing edges.
• Terminal nodes are used for generating business rules.
• Tree pruning is used to avoid large trees and overfitting the data. Tree pruning is achieved through
different stopping criteria.
Attribute Selection Measures
Here is a list of some attribute selection measures.
• Gini index
• Entropy
• Information gain
• Gain Ratio
• Reduction in Variance
• Chi-Square
Gini Index
• It is a measure of purity or impurity while creating a decision tree. It is calculated by
subtracting the sum of the squared probabilities of each class from one. CART ( Classification
and regression tree ) uses the Gini index as an attribute selection measure to select the best
attribute/feature to split.
• The attribute with a lower Gini index is used as the best attribute to split.
The Gini index is defined by
Select the root node Age Income Student Credit rating Purchased
Purchased Youth High NO Fair No
YES N0 Youth High NO Excellent No
Youth 2 3 5 Middle_aged High NO Fair Yes
Age Middle_aged 4 0 4 Senior Medium NO Fair Yes
Senior 3 2 5 Senior Low Yes Fair Yes
14 Senior Low Yes Excellent No
• Gini index for Age Middle_aged Low Yes Excellent Yes
• Gini(Age=Youth) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48 Youth Medium NO Fair No
• Gini(Age=Middle_aged) = 1 – (4/4)2 – (0/4)2 = 0 Youth Low Yes Fair Yes
• Gini(Age=Senior) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48 Senior Medium Yes Fair Yes
• weighted sum of Gini indexes for the Age feature.
• Gini(Age) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 +
Youth Medium Yes Excellent Yes
0 + 0.171 = 0.342 Middle_aged Medium NO Excellent Yes
Middle_aged High Yes Fair Yes
Senior Medium NO Excellent No
Purchased Purchased
YES N0 Purchased
YES N0 YES N0
High 2 2 4
Income Medium 4 2 6 NO 3 4 7 Credit Fair 6 2 8
low 3 1 4 Student Yes 6 1 7 Rating Excellent 3 3 6
14 14 14
Purchased Purchased
YES N0 YES N0
High 2 2 4 NO 3 4 7
Income Medium 4 2 6
Student Yes 6 1 7
low 3 1 4
14
14
Gini index for Student
• Gini index for Income
Gini(Student=No) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 =0.489
• Gini(Income=High) = 1 – (2/4)2 – (2/4)2 = 0.5 Gini(Student=Yes) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
• Gini(Income=Medium) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – Weighted sum for Student feature
0.111 = 0.445 Gini(Student) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
• Gini(Income=Low) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 –
Purchased
0.0625 = 0.375 YES N0
• weighted sum of gini index for Income feature Credit Fair 6 2 8
• Gini(Income) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x Rating Excellent 3 3 6
0.445 = 0.142 + 0.107 + 0.190 = 0.439 14

Gini index for Age = 0.342 Gini index for Credit rating
Gini index for Income .439 Gini(Credit rating=Fair) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375
Gini index for Student = .367 Gini(Credit rating=Excellent) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini index for Credit rating = .428 Weighted sum for Credit rating feature
Gini(Credit rating) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

The top of the tree will be Age feature because its cost is the lowest.
age?

youth Middle_aged Senior


Age Income Student Credit rating Purchased Age Income Student Credit rating Purchased
Youth High NO Fair No Senior Medium NO Fair Yes
Youth High NO Excellent No Senior Low Yes Fair Yes
Youth Medium NO Fair No Senior Low Yes Excellent No
Youth Low Yes Fair Yes Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes Senior Medium NO Excellent No
Age Income Student Credit rating Purchased age?
• Middle_aged leaf has Middle_aged High NO Fair Yes
only yes decisions. Middle_aged Low Yes Excellent Yes
This means that
Middle_aged Medium NO Excellent Yes
youth Middle-aged Senior
Middle_aged leaf is
over. Middle_aged High Yes Fair Yes yes

• Apply same principles to those sub dataset for youth and senior. We need to find the Gini
index scores for Income, Student and Credit rating respectively.
age?

youth Middle_aged Senior


Age Income Student Credit rating Purchased Age Income Student Credit rating Purchased
Youth High NO Fair No Senior Medium NO Fair Yes
Youth High NO Excellent No Senior Low Yes Fair Yes
Youth Medium NO Fair No Senior Low Yes Excellent No
Youth Low Yes Fair Yes Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes Senior Medium NO Excellent No

Decision for Youth Decision for Senior


• Gini index for Income= .2 • Gini index for Income =..466
• Gini index for Student = .0 • Gini index for Student = .466
• Gini index for Credit rating = .466 • Gini index for Credit rating = .0
• Credit rating is the extension of youth
• Student is the extension of youth because it because it has the lowest value.
has the lowest value.
age?
youth
student? Senior
no overcast
Middle_aged
Age Income Student Credit rating Purchased
credit rating?
Youth High NO Fair No yes yes
Youth High NO Excellent No
Youth Medium NO Fair No excellent fair
Age Income Student Credit rating Purchased Age Income Student Credit rating Purchased
Youth Low Yes Fair Yes Senior Low Yes Excellent No
Youth Medium Yes Excellent Yes Senior Medium NO Excellent No
Age Income Student Credit rating Purchased
Final Decision Tree Senior Medium NO Fair Yes
Senior Low Yes Fair Yes
Senior Medium Yes Fair Yes

Identify the class for the given X = (age = youth, income


= medium, student = yes, credit rating = fair)
X belongs to class Purchased= yes”)
Entropy
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data.
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S
Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,
•S= Total number of samples
•P(yes)= probability of yes
•P(no)= probability of no
For a binary classification problem
• If all examples are positive or all are negative then entropy will be zero i.e, low.
• If half of the examples are of positive class and half are of negative class then entropy is one i.e,
high.
The lesser the entropy the better will be the model because the classes would be split better
because of less uncertainty.
Information Gain
• ID3 uses information gain as its attribute selection measure.
• Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
• It calculates how much information a feature provides us about a class.
• A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first.
• It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Gain ratio
• C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that
reduces its bias and is usually the best option. Gain ratio overcomes the problem with information
gain by taking into account the number of branches that would result before making the split. It
corrects information gain by taking the intrinsic information of a split into account.
Reduction in Variance
• Reduction in variance is an algorithm used for continuous target variables (regression
problems). This algorithm uses the standard formula of variance to choose the best split. The
split with lower variance is selected as the criteria to split the population:

Chi-Square
• Chi-Square is a comparison between observed results and expected results. This statistical
method is used in CHAID(Chi-square Automatic Interaction Detector). CHAID in itself is a very
old algorithm that is not used much these days. The higher the value of chi-square, higher is
the difference between the current node and its parent.
• The formula for chi - square

Higher the chi-square value is higher


preference given to that feature.
Avoiding Overfitting of Decision Trees
• Decision Trees are prone to over-fitting.
• One danger in growing deeper trees on the training data is overfitting. Overfitting will lead
to poor performance on new data.
• Overfitting refers to the condition when the model completely fits the training data but fails
to generalize the testing unseen data. Overfit condition arises when the model memorizes
the noise of the training data and fails to capture important patterns.
• A perfectly fit decision tree performs well for training data but performs poorly for unseen
test data.
The following are some of the commonly used techniques to avoid overfitting:
• Pruning
Pre-pruning
Post-pruning
• Ensemble – Random Forest
• Cross Validation
Cross-Validation
• Cross-Validation is a resampling technique with the fundamental idea of splitting the
dataset into 2 parts- training data and test data. Train data is used to train the model and
the unseen test data is used for prediction. If the model performs well over the test data
and gives good accuracy, it means the model hasn’t overfitted the training data and can be
used for prediction.
• Cross-validation can be used to estimate the test error associated with a given statistical
learning method in order to evaluate its performance, or to select the appropriate level of
flexibility.
• The process of evaluating a model’s performance is known as model assessment, whereas
the process of selecting the proper level of flexibility for a model is known as model
selection.
• Cross-validation is a statistical method used to estimate the performance (or accuracy) of
machine learning models. It is used to protect against overfitting in a predictive model,
particularly in a case where the amount of data may be limited.
• In cross-validation, you make a fixed number of folds (or partitions) of the data, run the
analysis on each fold, and then average the overall error estimate.
Methods used for Cross-Validation
The following are some common methods that are used for cross-validation.
• Validation Set Approach
• Leave-P-out cross-validation
• Leave one out cross-validation
• K-fold cross-validation
• Stratified k-fold cross-validation
K-Fold Cross-Validation
• K-fold cross-validation is a robust validation approach that can be adopted to verify if the
model is Overfitting.
• K-fold cross-validation is resampling technique, the whole data is divided into k sets of
almost equal sizes. The first set is selected as the test set and the model is trained on the
remaining k-1 sets. The test error rate is then calculated after fitting the model to the
test data.
• Cross-validation is a resampling procedure used to evaluate machine learning models on
a limited data sample.
Steps for K-fold cross-validation (iris dataset)
1.Split the dataset into K equal partitions (or "folds")
1. So if k = 5 and dataset has 150 observations
2. Each of the 5 folds would have 30 observations
2.Use fold 1 as the testing set and the union of the
other folds as the training set
1. Testing set = 30 observations (fold 1)
2. Training set = 120 observations (folds 2-5)
3.Calculate testing accuracy
4.Repeat steps 2 and 3 K times, using a different
fold as the testing set each time
5.Use the average testing accuracy as the estimate of
out-of-sample accuracy
Pruning
• Pruning is a process of deleting unnecessary nodes from a tree in order to get the optimal
decision tree.
• Pruning means to change the model by deleting the child nodes of a branch node.
• A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning.
• There are two common approaches to tree pruning: Pre pruning and Post pruning.
Pre pruning
• In the Pre pruning approach, a tree is “pruned” by halting its construction early (e.g., by
deciding not to further split or partition the subset of training tuples at a given node). Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the
subset tuples or the probability distribution of those tuples.
• When constructing a tree, measures such as statistical significance, information gain, Gini
index, and so on, can be used to assess the goodness of a split. If partitioning the tuples at a
node would result in a split that falls below a prespecified threshold, then further partitioning
of the given subset is halted. There are difficulties, however, in choosing an appropriate
threshold. High thresholds could result in oversimplified trees, whereas low thresholds could
result in very little simplification.
Post pruning
• The second and more common approach is post-pruning, which removes subtrees from a
“fully grown” tree. A subtree at a given node is pruned by removing its branches and
replacing them with a leaf. The leaf is labeled with the most frequent class among the
subtree being replaced.
• The cost complexity pruning algorithm used in CART is an example of the post pruning
approach. This approach considers the cost complexity of a tree to be a function of the
number of leaves in the tree and the error rate of the tree (where the error rate is the
percentage of tuples misclassified by the tree). It starts from the bottom of the tree.
• For each internal node, N, it computes the cost complexity of the subtree at N, and the
cost complexity of the subtree at N if it were to be pruned (i.e., replaced by a leaf node).
The two values are compared. If pruning the subtree at node N would result in a smaller
cost complexity, then the subtree is pruned. Otherwise, it is kept.
Advantages and Disadvantages of Decision Trees
Advantages
• It is simple to understand as it follows the same process which a human follow while making
any decision in real life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Disadvantages
• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
• For more class labels, the computational complexity of the decision tree may increase.
Happy Analyzing
Ensemble Methods
Ramesh Kandela
[email protected]

• Bagging
• Random Forest
• Boosting
Ensemble Methods
• Ensemble methods is a machine learning technique that combines several base models in
order to produce one optimal predictive model.
• The goal of ensemble methods is to combine the predictions of several base estimators
built with a given learning algorithm in order to improve generalizability / robustness over
a single estimator.
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an
improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of classifiers
• Boosting: weighted vote with a collection of classifiers
Bagging
• Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly
used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is
selected with replacement—meaning that the individual data points can be chosen more than once.
After several data samples are generated, these weak models are then trained independently, and
depending on the type of task—regression or classification, for example—the average or majority of
those predictions yield a more accurate estimate.
• Same training algorithm for every predictor and train
them on different random subsets of the training set.
When sampling is performed with replacement, this
method is called bagging. When sampling is performed
without replacement, it is called pasting.

• Bagging involves creating multiple copies of the original training


data set using the bootstrap, fitting a separate decision tree to
each copy, and then combining all of the trees in order to create
a single predictive model. Notably, each tree is built on a
bootstrap data set, independent of the other trees.
How bagging works
The bagging algorithm, which has three basic steps:
1.Bootstrapping: Bagging leverages a bootstrapping sampling
technique to create diverse samples. This resampling method
generates different subsets of the training dataset by
selecting data points at random and with replacement. This
means that each time you select a data point from the
training dataset, you are able to select the same instance
multiple times. As a result, a value/instance repeated twice
(or more) in a sample.

2. Parallel training: These bootstrap samples are then trained independently and in parallel with
each other using weak or base learners.
3. Aggregation: Finally, depending on the task (i.e. regression or classification), an average or a
majority of the predictions are taken to compute a more accurate estimate. In the case of
regression, an average is taken of all the outputs predicted by the individual classifiers; this is
known as soft voting. For classification problems, the class with the highest majority of votes is
accepted; this is known as hard voting or majority voting.
Out-of-Bag Error Estimation
• Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring
the prediction error of bagging random forests, and boosted decision trees.
• Bagging uses subsampling with replacement to create training samples for the model to
learn from. on average, each bagged tree makes use of around two-thirds of the
observations.3 The remaining one-third of the observations not used to fit a given bagged
tree are referred to as the out-of-bag (OOB) observations.
• In order to obtain a single prediction for the ith observation, we can average these
predicted responses (if regression is the goal) or can take a majority vote (if classification is
the goal). OOB MSE (for a regression problem) or classification error (for a classification
problem) can be computed
Random Forests
Random Forests
• Random forest, like its name implies, consists of a large number of individual decision trees
that operate as an ensemble. Each individual tree in the random forest spits out a class
prediction and the class with the most votes becomes our model’s prediction
• Random forests provide an improvement over bagged trees by way of a small tweak that
decorrelates the trees.
• In random forests (see RandomForestClassifier and RandomForestRegressor classes), each
tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap
sample) from the training set.
• The random forest algorithm is an extension of the bagging method as it utilizes both
bagging and feature randomness to create an uncorrelated forest of decision trees. Feature
randomness, also known as feature bagging or “the random subspace method”
How it works
How it works
• Random forest algorithms have three main hyperparameters, which need to be set before
training. These include node size, the number of trees, and the number of features sampled.
From there, the random forest classifier can be used to solve for regression or classification
problems.
• The random forest algorithm is made up of a collection of decision trees, and each tree in
the ensemble is comprised of a data sample drawn from a training set with replacement,
called the bootstrap sample.
• Of that training sample, one-third of it is set aside as test data, known as the out-of-bag
(oob) sample. Another instance of randomness is then injected through feature bagging,
adding more diversity to the dataset and reducing the correlation among decision trees.
• Depending on the type of problem, the determination of the prediction will vary. For a
regression task, the individual decision trees will be averaged, and for a classification task, a
majority vote—i.e. the most frequent categorical variable—will yield the predicted class.
Finally, the oob sample is then used for cross-validation, finalizing that prediction.
Variable importance
• Random forests can be used to rank the importance of variables in a regression or
classification problem.
• A variable importance plot for the Heart data. Variable importance is computed using the
mean decrease in Gini index, and expressed relative to the maximum.
Variable Mean Decrease Gini
ChestPain 14.9746278
MaxHR 13.9973942
Thal 12.3107004
Oldpeak 10.764543
Ca 9.3101672
Age 9.2581531
Chol 7.8603037
RestBP 7.5728805
Sex 5.1902859
ExAng 5.0879835
Slope 4.8899441
RestECG 1.6251977
Fbs 0.7974986
Boosting
Boosting
• Boosting works in a similar way of bagging, except that the trees are grown sequentially: each
tree is grown using information from previously grown trees. Boosting does not involve
bootstrap sampling; instead each tree is fit on a modified version of the original data set.
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to allow the subsequent classifier,
Mi+1, to pay more attention to the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier, where the weight of each
classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting
the model to misclassified data.
Types of Boosting Algorithms
1.AdaBoost (Adaptive Boosting)
2.Gradient Tree Boosting
3.XGBoost (eXtreme Gradient)
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, other wise it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of
the weights of the misclassified tuples:
d
error ( M i ) =  w j  err ( X j )
j

• The weight of classifier Mi’s vote is


1 − error ( M i )
log
error ( M i ) 198
How AdaBoost Works
• Step 1: A weak classifier (e.g. a decision stump) is made on top of the training data based on
the weighted samples. Here, the weights of each sample indicate how important it is to be
correctly classified. Initially, for the first stump, we give all the samples equal weights.
• Step 2: We create a decision stump for each variable and see how well each stump classifies
samples to their target classes
• Step 3: More weight is assigned to the incorrectly classified samples so that they're classified
correctly in the next decision stump. Weight is also assigned to each classifier based on the
accuracy of the classifier, which means high accuracy = high weight!
• Step 4: Reiterate from Step 2 until all the data points have been correctly classified, or the
maximum iteration level has been reached.
Gradient Boosting
• Gradient Boosting – It builds a final model from the sum of several weak learning
algorithms that were trained on the same dataset. It operates on the idea of stagewise
addition.
• The first weak learner in the gradient boosting algorithm will not be trained on the dataset;
instead, it will simply return the mean of the relevant column.
• The residual for the first weak learner algorithm’s output will then be calculated and used as
the output column or target column for the next weak learning algorithm that will be
trained.
• The second weak learner will be trained using the same methodology, and the residuals will
be computed and utilized as an output column once more for the third weak learner, and so
on until we achieve zero residuals.
• The dataset for gradient boosting must be in the form of numerical or categorical data, and
the loss function used to generate the residuals must be differential at all times.
XGBoost
• XGBoost (eXtreme Gradient) – In addition to the gradient boosting technique, XGBoost is
another boosting machine learning approach. It is an extreme variation of the gradient
boosting technique.
• The key distinction between XGBoost and Gradient Boosting is that XGBoost applies a
regularisation approach. It is a regularised version of the current gradient-boosting
technique. Because of this, XGBoost outperforms a standard gradient boosting method.
• Works better when the dataset contains both numerical and categorical variables.
• XGBoost delivers high performance as compared to Gradient Boosting. Its training is very
fast and can be parallelized across clusters.
Happy Analyzing
Neural Networks

Ramesh Kandela
[email protected]
Deep Learning
• Deep learning attempts to mimic the human brain.
• Deep learning is a part of machine learning, which is in turn part of a larger umbrella called AI.
• Deep Learning is a special type of machine learning consisting of few algorithms which are
designed to learn from large amounts of data or unstructured data like images, texts, and
audio. These algorithms are inspired by the way our brain works.
Artificial Intelligence
• Artificial Intelligence(AI): A technique which enables to
mimic human behavior. Machine
learning
• Machine Learning: Subset of AI technique which use
statistical methods to enable machine to improve with Deep
experience. learning

• Deep Learning: Subset of ML which make the computation of multi-layer neural network
feasible.
Neural Networks
• A neural network is a method in artificial intelligence that teaches computers to process data
in a way that is inspired by the human brain. It is a type of machine learning process, called
deep learning, that uses interconnected nodes or neurons in a layered structure that
resembles the human brain.
• Deep learning algorithms like ANN, CNN, RNN, etc.
• Artificial Neural Networks for Regression and Classification
• Convolutional Neural Networks for Computer Vision or Image Processing
• Recurrent Neural Networks for Time Series Analysis
• In a neural network the independent variables are called input cells and the dependent
variable is called output cell.
Biological Neurons
• It is an unusual-looking cell mostly found in
animal cerebral cortexes (e.g., your brain),
composed of a cell body containing the nucleus
and most of the cell’s complex components,
and many branching extensions called
dendrites, plus one very long extension called
the axon.

• The axon’s length may be just a few times longer than the cell body, or up to tens of
thousands of times longer. Near its extremity the axon splits off into many branches called
telodendria, and at the tip of these branches are minuscule structures called synaptic
terminals (or simply synapses), which are connected to the dendrites (or directly to the cell
body) of other neurons. Biological neurons receive short electrical impulses called signals
from other neurons via these synapses. When a neuron receives a sufficient number of
signals from other neurons within a few milliseconds, it fires its own signals.
Biological Neuron
• A human brain has billions of neurons. Neurons are interconnected nerve cells in the human
brain that are involved in processing and transmitting chemical and electrical signals.
• Dendrites are branches that receive information from other neurons.
• Cell nucleus or Soma processes the information received from dendrites.
• Axon is a cable that is used by neurons to send information.
• Synapse is the connection between an axon and other neuron dendrites.

The Perceptron
• The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank
Rosenblatt.
• Perceptron may eventually be able to learn, make decisions, and translate languages."
From Biological to Artificial Neurons
• An artificial neuron is a way to mimic the biological neuron, which is the building block of the
human brain.
Biological Neuron vs. Artificial Neuron
Biological Neuron Artificial Neuron
Cell Nucleus (Soma) Node
Dendrites Input
Synapse Weights or interconnections
Axon Output

What Happens inside a Neuron?


Step–1: Inputs are passed as inputs to the Artificial Neuron which are multiplied by weights and a bias is added to
them inside the transfer function.
What “Transfer function” does?
The transfer function creates a weighted sum of all the inputs and adds a constant to it called bias.
W1*X1 + W2*X2 + W3*X3 . . . Wn*Xn + b
Step–2: Then the resultant value is passed to the activation function. The result of the activation function is
treated as the output of the neuron.
Artificial Neural Network
• When combine multiple neurons together, it creates a vector of neurons called a layer.
• When we combine multiple layers together, where all the neurons are interconnected to each
other, this network of neurons is called, Artificial Neural Network.
History of Artificial Neural Networks
• Artificial Neural Networks were First introduced in 1943 by The neurophysiologist Warren
McCulloch and The mathematician Walter Pitts In the paper “A Logical Calculus of Ideas
Immanent in Nervous Activity”
• McCulloch and Pitts presented a simplified computational model of how biological neurons
might work together in animal brains to perform complex computations using propositional
logic. This was the first artificial neural network architecture.
• In the early 1980s there was a revival of interest in ANNs
• ○ As new network architectures were invented
• ○ And better training techniques were developed
• By the 1990s other powerful Machine Learning techniques were invented, such as Support
Vector Machines. These techniques seemed to offer better results and stronger theoretical
foundations than ANNs, so once again the study of neural networks entered a long winter.
Why ANN’s are relevant today?
• There is now a huge quantity of data available to train neural networks, and ANNs frequently
outperform other ML techniques on very large and complex problems.
• The tremendous increase in computing power since the 1990s now makes it possible to train
large neural networks in a reasonable amount of time. This is in part due to Moore’s Law, but
also thanks to the gaming industry, which has produced powerful GPU cards by the millions.
• The training algorithms have been improved. To be fair they are only slightly different from
the ones used in the 1990s, but these relatively small tweaks have a huge positive impact.
• Some theoretical limitations of ANNs have turned out to be benign in practice. For example,
many people thought that ANN training algorithms were doomed because they were likely to
get stuck in local optima, but it turns out that this is rather rare in practice (or when it is the
case, they are usually fairly close to the global optimum).
• ANNs seem to have entered a virtuous circle of funding and progress. Amazing products
based on ANNs regularly make the headline news, which pulls more and more attention and
funding toward them, resulting in more and more progress, and even more amazing
products.
Artificial Neural Network
• Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer,
two or more hidden layers, and an output layer.

There are three major type of layers in the ANN


1.The Input layer (interface to accept data)
2.The Hidden layer(s) (Actual Neurons)
3.The Output layer (Actual Neurons to output the result)

• Each node, or artificial neuron, connects to another and has an associated weight and
threshold. If the output of any individual node is above the specified threshold value, that
node is activated, sending data to the next layer of the network. Otherwise, no data is passed
along to the next layer of the network.
How do artificial neural networks work?
• Input Layer :This layer accepts input features. It provides information from the outside world
to the network, no computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer.
• Hidden Layer: Hidden layers take their input from the input layer or other hidden layers.
Artificial neural networks can have a large number of hidden layers. Each hidden layer
analyzes the output from the previous layer, processes it further, and passes it on to the next
layer. Hidden layer performs all sort of computation on the features.
Output Layer: The output layer gives the final
result of all the data processing by the artificial
neural network. It can have single or multiple
nodes. For instance, if we have a binary (yes/no)
classification problem, the output layer will have
one output node, which will give the result as 1 or
0. However, if we have a multi-class classification
problem, the output layer might consist of more
than one output node.
How do artificial neural networks work?
Step–1: Inputs are passed as inputs to the Artificial Neuron which are multiplied by weights
and a bias is added to them inside the transfer function.
• What “Transfer function” does?
• The transfer function creates a weighted sum of all the inputs and adds a constant to it
called bias.
• W1*X1 + W2*X2 + W3*X3 . . . Wn*Xn + b
Step–2: Then the resultant value is passed to the activation function. The result of the
activation function is treated as the output of the neuron.
How do artificial neural networks work?
• To perform this, the below steps takes place after the input data is passed to the neuron.
1.Each input value is multiplied with a small number (close to zero) called Weights
2.All of these values are summed up
3.A number is added to this sum called Bias
4.The above summation is passed to a function called “Activation Function“
5.The Activation function will produce an output based on its equation.
6.The output produced by the neurons in the last layer of the network is treated as the output.
7.If the output produced by the neurons does not match with the actual answer, then the error signals
are sent back in the network which adjusts the weights and the bias values such that the output
comes closer to the actual value this process is known as Backpropagation.
8.The steps 1-7 are repeated again till the difference between the actual value and the predicted
value(output of the neuron) becomes equal or almost equal and there is no further improvement
happening… this situation is known as convergence. When we say the algorithm has reached its
maximum possible accuracy.
Activation Function
• An Activation Function decides whether a neuron should be activated or not. This means
that it will decide whether the neuron’s input to the network is important or not in the
process of prediction using simpler mathematical operations.
• The role of the Activation Function is to derive output from a set of input values fed to a
node (or a layer).
• The primary role of the Activation Function is to transform the summed weighted input
from the node into an output value to be fed to the next hidden layer or as output. that’s
why it’s often referred to as a Transfer Function in Artificial Neural Network.
Sigmoid or Logistic Activation Function
• It is the famous S shaped function that transforms the input values into a range
between 0 and 1.
• Sigmoid function gives an ‘S’ shaped curve. In order to map predicted values to
probabilities, we use the sigmoid function. The function maps any real value into
another value between 0 and 1.
Tanh or hyperbolic tangent Activation Function
• Tanh is also like logistic sigmoid but better. The range of the tanh function is from (-
1 to 1). tanh is also sigmoidal (s - shaped).
ReLU (Rectified Linear Unit) Activation Function
• The ReLU is the most used activation function in the world right now. Since it is used in
almost all the convolutional neural networks or deep learning.

Range: [ 0 to infinity)
Linear or Identity Activation Function
• It takes the inputs, multiplied by the weights for each neuron, and creates
an output signal proportional to the input.
• Range : (-infinity to infinity)
Which activation function to use?
• Sigmoid functions and their combinations generally work better in the case
of classification problems.
• Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient problem.
• Tanh is avoided most of the time due to dead neuron problem.
• ReLU activation function is widely used and is default choice as it yields better results.
• If we encounter a case of dead neurons in our networks the leaky ReLU function is the
best choice.
• ReLU function should only be used in the hidden layers.
• An output layer can be linear activation function in case of regression problems.
Cost Function
and
Gradient Descent
Cost Function
• A Cost Function is used to measure just how wrong the model is in finding a relation between the input and output.
• Cost Function quantifies the error between predicted values and expected values and presents it in
the form of a single real number.
• Cost function is a measure of "how good" a neural network did with respect to it's given training
sample and the expected output. It also may depend on variables such as weights and biases.
• A cost function is a single value, not a vector, because it rates how good the neural network did as a
whole. Specifically, a cost function is of the form C(W,B,Sr,Er)
• Where W is our neural network's weights, B is our neural network's biases, Sr is the input of a single
training sample, and Er is the desired output of that training sample.

Types of the cost function


• Regression cost Function
• Binary Classification cost Functions
• Multi-class Classification cost Functions
Gradient Descent
• Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable
function. Gradient descent is simply used in machine learning to find the values of a
function's parameters (coefficients) that minimize a cost function as far as possible.
• A gradient simply measures the change in all weights with regard to the change in error.
Learning Rate
• Learning rate is a hyper-parameter that controls how much we are adjusting the weights of
our network with respect the loss gradient. The lower the value, the slower we travel along
the downward slope. While this might be a good idea (using a low learning rate) in terms of
making sure that we do not miss any local minima, it could also mean that we’ll be taking a
long time to converge — especially if we get stuck on a plateau region.

• We could start with larger steps, then go smaller as


we realize the slope gets closer to zero.
• This is known as adaptive gradient descent.
• In 2015, Kingma and Ba published their paper:
“Adam: A Method for Stochastic Optimization“.
Back Propagation
• Forward Propagation: Transferring information from the input layer to the output layer
• Backpropagation, which is also referred to as backward propagation, was developed and
popularized by David Rumelhart and Ronadl Williams when they proposed it during 1986 to
be used for neural network.
• Backpropagation, short for "backward propagation of errors," is an algorithm for supervised
learning of artificial neural networks using gradient descent. Given an artificial neural
network and an error function, the method calculates the gradient of the error function
with respect to the neural network's weights. It is a generalization of the delta rule for
perceptrons to multilayer feedforward neural networks.
• Back Propagation: Transferring error information from the output layer to the hidden layers
and adjusting weights of each hidden layer.
• ANN keeps comparing the actual output with the predicted output and adjusts all the
weights in the network to make the predicted values equal to the actual values.
• ANN will start adjusting the weights and bias values for each of the layers starting from the
output layer to each one of the hidden layers one by one in the backward direction. The goal
to reduce the loss value which in turn will reduce the difference between the predicted
value and the original value.
• This adjustment of weights is done based on the direction(Gradient) of the loss and since we
are trying to reduce the loss, we always choose such weights which will generate lower loss
values and the “direction” of movement of loss is downwards, hence, this technique is
known as Gradient Descent.
• When the weight of all the hidden layers is adjusted, we say one round of Back
Propagation is finished.
How does an Artificial Neural Network(ANN) learn
• ANN learns by multiple iterations of forward propagation and backward propagation.
STEP 1: Randomly initialise the weights to small numbers close to 0 (but not 0).
STEP 2: Input the first observation of your dataset in the input layer, each feature in one input node.
STEP 3: Forward-Propagation: from left to right, the neurons are activated in a way that the impact of
each neuron's activation is limited by the weights. Propagate the activations until getting the predicted
result y.
STEP 4: Compare the predicted result to the actual result.
Measure the generated error.
STEP 5: Back-Propagation: from right to left, the error is back-
propagated. Update the weights according to how much they
are responsible for the error. The learning rate decides by
how much we update the weights.
STEP 6: Repeat Steps 1 to 5 and update the weights after
each observation (Reinforcement Learning). Or Repeat Steps
1 to 5 but update the weights only after a batch of
observations (Batch Learning).
STEP 7: When the whole training set passed through the
ANN, that makes an epoch. Redo more epochs.
Thank You
Association Rule Mining

Ramesh Kandela
[email protected]
What Is Association Mining?
Motivation: Finding regularities in data
• What products were often purchased together? — Beer and diapers
• What are the subsequent purchases after buying a Phone?

Amazon Product Recommendations


Association Rule Mining
• Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset.
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and
association rule mining Tid Items bought
• Itemset 10 Beer, Nuts, Diaper

• A collection of one or more items 20 Beer, Coffee, Diaper


30 Beer, Diaper, Eggs
• Example: {Milk, Bread, Diaper}
40 Nuts, Eggs, Milk
• k-itemset
50 Nuts, Coffee, Diaper, Eggs, Milk
• An itemset that contains k items Customer
Given a set of transactions, find rules that will predict the occurrence of buys both Customer
buys diaper
an item based on the occurrences of other items in the transaction.

Applications
Market Basket Analysis, Medical Diagnosis, Catalog design,
Customer
Sale Campaign Analysis buys beer
Market Basket Analysis
• Market Basket Analysis is a typical example of frequent itemset mining. This process analyzes
customer buying habits by finding associations between the different items that customers place in
their “shopping baskets” (Figure ).
• For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
• Market Basket is simply the list of products purchased by the customer.

• Market Basket Analysis to identify pairs or sets of products that


customers tend to purchase together and how this knowledge
can help the retailer increase profits.
• Market Basket Analysis draws actionable insights after looking at
the association between products bought during a given
transaction.
• The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are
frequently purchased together by customers.
Basic Concept: Association Rules
Generating Candidate Rules
• Association rule learning works on the concept of If and Else Statement, such as if A then B
If A → Then B
• Here the If element is called antecedent, and then statement is called Consequent.
• Let I={i1, i2, . . ., in} be the set of all distinct items
• The association rules can be represented as “AB” where A and B are subsets, namely
itemsets, of I
• If A appears in one transaction, it is most likely that B also occurs in the same transaction
• For example
• “Bread  Milk”
• “Beer  Diaper”
• {Milk, Diaper} ⇒ Beer
Rule Evaluation Metrics
The measurement of interestingness for association rules
• Support
• Confidence

Support
Support indicates how frequently the if/then relationship appears in the database.
Support count (): Frequency of occurrence of an itemset Tid Items bought
10 Beer, Nuts, Diaper
E.g. ({Beer, Diaper}) = 3
20 Beer, Coffee, Diaper
Support, s, probability that a transaction contains A∪B 30 Beer, Diaper, Eggs
s=support(“AB”) = P(A∪B) = support count(AUB)/T 40 Nuts, Eggs, Milk
Beer → Diaper (3/5=0.6=60%) 50 Nuts, Coffee, Diaper, Eggs, Milk

Diaper → Beer
Frequent Itemset: An itemset whose support is greater than or equal to a minimum support
threshold.
Confidence
• Confidence tells about the number of times these relationships have been found to be true.
• Confidence, c, the conditional probability that a transaction having A also contains B.
• c = confidence(“AB”) = P(B|A)

• Beer → Diaper (100%)


• Diaper → Beer (75%)
Rule Evaluation Metrics
• Diaper → Beer [support =60%,confidence =75%]
• Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules.
• A support of 60% for Rule means that 60% of all the transactions under analysis show that
Diaper and Beer are purchased together.
• A confidence of 75% means that 75% of the customers who purchased a Diaper also
bought the Beer.
• Typically, association rules are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold. Tid Items bought

Let minsup = 50%, minconf = 50% 10 Beer, Nuts, Diaper


20 Beer, Coffee, Diaper
Find all the rules X → Y with minimum support and confidence
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
◼Beer → Diaper (60%, 100%)
50 Nuts, Coffee, Diaper, Eggs, Milk
◼Diaper → Beer (60%, 75%)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Example of Rules:
{Milk,Diaper} → {Beer} (s= , c= )
{Milk,Beer} → {Diaper} (s= , c= ) {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Diaper,Beer} → {Milk} (s=, c= ) {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Beer} → {Milk,Diaper} (s=, c= ) {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=, c= ) {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Milk} → {Diaper,Beer} (s=, c=) {Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Association Rule Mining process
• In general, association rule mining can be viewed as a two-step process:
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
• Apriori Algorithm
• ECLAT Algorithm
• FP-Growth Algorithm
2. Rule Generation
– Generate strong association rules from the frequent itemsets: these rules must
satisfy minimum support and minimum confidence.
The Apriori Algorithm
• The name, Apriori, is based on the fact that the algorithm uses prior knowledge of frequent
itemset properties.
• Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1) itemsets.
• First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is
used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of
each Lk requires one full scan of the database.

• Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
• All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Apriori: A Candidate Generation-and-test Approach
Method: join and prune steps
• The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.
This set of candidates is denoted Ck.
• The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but
all of the frequent k-itemsets are included in Ck. A database scan to determine the count of
each candidate in Ck would result in the determination of Lk (i.e., all candidates having a
count no less than the minimum support count are frequent by definition, and therefore
belong to Lk). Ck, however, can be huge, and so this could involve heavy computation.
• To reduce the size of Ck, the Apriori property is used as follows. Any (k -1)itemset that is not
frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)s ubset of a
candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can
be removed from Ck.
Consider a database, D, consisting of 9 transactions. Suppose min.support count required is 2
and let min.confidence required is 70%. Use the apriori algorithm to generate all the frequent
candidate itemsets Ci and frequent itemsets Li. Then, generate the strong association rules
from frequent itemsets using min. support & min. confidence.
The Apriori Algorithm—Example
K=1
Create a table containing support count of each item present in dataset Called C1(candidate set)

• Compare candidate set item’s support count with minimum


support count(here min_support=2 if support_count of
candidate set items is less than min_support then remove
those items). This gives us itemset L1.
• K=2: Generate candidate set C2 using L1 (this is called join step).
• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for each itemset)
• Now find support count of these itemsets by searching in dataset.

• Compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.
• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4},
subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset

• Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3.
• Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that, they
should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Generating Association Rules from Frequent Itemsets
Selecting Strong Rules
• Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate
strong association rules from them (where strong association rules satisfy both minimum support and minimum
confidence).

Based on this equation, association rules can be generated as follows:

The resulting association rules are as shown below, each listed with its confidence:

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because
these are the only ones generated that are strong.
Lift
• Lift is a simple correlation measure.
• The lift between the occurrence of A and B can be measured by

• The lift of the association (or correlation) rule A=>B which computes the ratio between the rule’s
confidence and the support of the itemset in the rule consequent.

• If the lift is less than 1, then the occurrence of A is negatively correlated with the occurrence
of B, meaning that the occurrence of one likely leads to the absence of the other one.
• If the resulting value is greater than 1, then A and B are positively correlated, meaning that
the occurrence of one implies the occurrence of the other.
• If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.
The Apriori Algorithm—An Example
Supmin = 2
Itemset sup
Itemset sup
Tid Items {A} 2
L1 {A} 2
10 A, C, D C1 {B} 3
{B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E

C2 Itemset sup C2 Itemset


{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
The Apriori Algorithm—An Example

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Collaborative Filtering
Movie Recommendation
Recommender systems
• Recommender systems are self-explanatory; as the name suggests, they are systems or
techniques that recommend or suggest a particular product, service, or entity.
• Eg: In the case of Netflix, which movie to watch, In the case of e-commerce, which product to
buy, or In the case of Kindle, which book to read, which songs to listen to (Spotify), etc.
• Recommender systems can be classified into the following two categories, based on their
approach to providing recommendations.
Types of recommendation system
• Content-Based Filtering
• Collaborative Based Filtering
Collaborative Recommendation
• The basic idea of these systems is that if users shared the same interests in the past – if they
viewed or bought the same books, for instance – they will also have similar tastes in the
future.
• For example, user A and user B have a purchase history that overlaps strongly, and user A
has recently bought a book that B has not yet seen; the basic rationale is to propose this
book also to B. Because this selection of hopefully interesting books involves filtering the
most promising ones from a large set and because the users implicitly collaborate with one
another, this technique is also called collaborative filtering (CF).
• Collaborative filtering is based on the fact that relationships exist between products and
people’s interests. Many recommendation systems use collaborative filtering to find these
relationships and to give an accurate recommendation of a product that the user might like
or be interested in.
Data Type and Format:
Collaborative filtering requires availability of all item–user information. Specifically, for each
item–user combination, we should have some measure of the user’s preference for that item.
Preference can be a numerical rating or a binary behavior such as a purchase, a ‘like’, or a click.

There are two classes of Collaborative Filtering:


• User-based, which measures the similarity
between target users and other users.
• Item-based, which measures the similarity
between the items that target users rate or
interact with and other items.
User-Based Collaborative Filtering
• User-Based Collaborative Filtering is a technique used to predict the items that a user might
like on the basis of ratings given to that item by the other users who have similar taste with
that of the target user.
• Many websites use collaborative filtering for building their recommendation system.
• This is a memory-based method
User-Based Collaborative Filtering: “People Like You”:
One approach to generating personalized recommendations for a user using collaborative
filtering is based on finding users with similar preferences, and recommending items that they
liked but the user hasn’t purchased. The algorithm has two steps:
1) Find users who are most similar to the user of interest (neighbors). This is done by
comparing the preference of our user to the preferences of other users.
2) Considering only the items that the user has not yet purchased, recommend the ones that
are most preferred by the user’s neighbors.
Measuring Similarity

• Pearson correlation similarity of two users x, y is defined as

Where Ixy is the set of items rated by both user x and user y.

• The cosine-based approach defines the cosine-similarity between two users x and y as:
User-based filtering
• The main idea behind user-based filtering is that if we are able to find users that have bought and
liked similar items in the past, they are more likely to buy similar items in the future too. Therefore,
these models recommend items to a user that similar users have also liked. Amazon's Customers who
bought this item also bought is an example of this filter
• In user-based collaborative filtering, we have an active user for whom the recommendation is aimed.
The collaborative filtering engine first looks for users who are similar. That is users who share the
active users rating patterns. Collaborative filtering basis this similarity on things like history,
preference, and choices that users make when buying, watching, or enjoying something.
Item-based filtering

Item-based Collaborative Filtering:


When the number of users is much larger than the number of items, it is computationally
cheaper (and faster) to find similar items rather than similar users. Specifically, when a user
expresses interest in a particular item, the item-based collaborative filtering algorithm has
two steps:
1) Find the items that were co-rated or co-purchased, (by any user) with the item of interest.
2) Recommend the most popular or correlated item(s) among the similar items.
Item-based filtering
If a group of people have rated two items similarly, then the two items must be similar.
Therefore, if a person likes one particular item, they're likely to be interested in the other
item too. This is the principle on which item-based filtering works. Again, Amazon makes
good use of this model by recommending products to you based on your browsing and
purchase history, as shown in the following screenshot:
Thank You

You might also like