Business Intelligence and Analytics Notes
Business Intelligence and Analytics Notes
Business Analytics
• Business Analytics is “the use of math and statistics to derive meaning from data in order
to make better business decisions.”
• Business Analytics is the extensive use of data, statistical and quantitative analysis,
explanatory and predictive models, and fact-based management to drive decisions and
actions(Davenport and Harris).
• Business analytics is a set of statistical and operations research techniques, artificial
intelligence, information technology and management strategies used for framing a
business problem, collecting data, and analyzing the data to create value to organizations.
Data
• Data is a collection of facts, such as numbers, words, measurements, observations .
• A data set is usually a rectangular array of data, with variables in columns and observations in
rows. A variable (or field or attribute) is a characteristic of members of a population, such as
height, gender, or salary. An observation (or case or record) is a list of all variable values for a
single member of a population.
Types of Data Data
Quantitative data Qualitative data
(numerical) (categorical)
Discrete Nominal
Continuous Ordinal
Cross-sectional data are data on a cross section of a population at a distinct point in time.
Time series data are data collected over time.
Why Study Business Analytics?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube , …
• Taking a data-driven approach to business can come with tremendous upside, but many
companies report that the number of skilled employees in analytics roles are in short supply.
• Because they help you understand the customers, as well as the performance of HR,
marketing, financial, production and inventory better. And knowing where to improve is the
key to success.
Data-driven Decision Making
• Today’s largest and most successful organizations use data to their advantage when making
high-impact business decisions.
• A typical data-driven decision-making (Business Analytics) process uses the following steps :
1. Identify the problem or opportunity for value creation.
2. Identify sources of data (primary as well secondary data sources).
3. Pre-process the data for issues such as missing and incorrect data. Generate derived
variables and transform the data if necessary. Prepare the data for analytics model building.
4. Divide the data sets into subsets training and validation data sets.
5. Build analytical models and identify the best model(s) using model performance in
validation data.
6. Implement Solution/Decision/Develop Product.
Population Versus Sample
• Population: A population consists of all elements—individuals, items, or objects—whose
characteristics are being studied
• Sample A portion of the population selected for study is referred to as a sample.
Components of Business Analytics
Business Analytics can be broken into 3 components:
1. Business Context
Business analytics projects start with the business context and ability of the organization to
ask the right questions.
2. Technology
Information Technology (IT) is used for data capture, data storage, data preparation, data
analysis, and data share. To analyse data, one may need to use software such as Excel, R,
Python, Tableau, SQL, SAS, SPSS etc.
3. Data Science
Davenport and Patil (2012) claim that ‘data scientist’ will be the sexiest job of the 21st century.
Business Analytics vs. Data Science
The main goal of business analytics is to extract meaningful insights from data to guide
organizational decisions, while data science is focused on turning raw data into meaningful
conclusions through using algorithms and statistical models.
Types of Business Analytics
• Business analytics can be grouped into four types:
Descriptive Analytics
• What happened in the past?
• Many organisations use DA as part of business intelligence.
Predictive Analytics
• What will happen in the future?
• Many organisations use predictive analytics.
Prescriptive Analytics
• What is the best action?
• Small proportion of organisations use prescriptive analytics.
Diagnostic Analytics
• Why did it happen?
• This focuses on the past performance to ascertain why something has happened.
Descriptive Analytics
Descriptive analytics is the simplest form of analytics that mainly uses simple descriptive
statistics, data visualization techniques, and business-related queries to understand past data.
One of the primary objectives of descriptive analytics is innovative ways of data summarization.
Descriptive statistics consists of methods for organizing, displaying, and describing data by
using tables, graphs, and summary measures.
Types of descriptive statistics:
• Organize the Data
• Tables
• Frequency Distributions
• Relative Frequency Distributions
• Displaying the Data
• Graphs
• Bar Chart or Histogram
• Summarize the Data
• Central Tendency
• Variation
Predictive Analytics
• Predictive analytics is a branch of advanced analytics that makes predictions about future
outcomes using historical data combined with statistical modeling, data mining techniques
and machine learning. Companies employ predictive analytics to find patterns in this data to
identify risks and opportunities. Predictive analytics is often associated with big data
and data science.
Types of Predictive Models
• K-Nearest Neighbors
• Regression
• Naïve Bayes Classifier
• Logistic Regression
• Classification and Regression Trees
• Clustering models
• Ensemble Methods
• Time series models
• Neural Networks
• Association Rules and Collaborative Filtering
Prescriptive Analytics
• Prescriptive analytics is the highest level of analytics capability which is used for choosing optimal
actions once an organization gains insights through descriptive and predictive analytics.
Examples of Descriptive Analytics
• Summarizing past events, exchange of data, and social media usage
• Reporting general trends
Examples of Diagnostic Analytics
• Identifying technical issues
• Explaining customer behavior
• Improving organization culture
Examples of Predictive Analytics
• Predicting customer preferences
• Recommending products
• Predicting staff and resources
Examples of Prescriptive Analytics
• Tracking fluctuating manufacturing prices
• Suggest the best course of action
Business Intelligence
• Business intelligence (BI) is software that ingests business data and presents it in user-
friendly views such as reports, dashboards, charts and graphs.
• Business intelligence combines business analytics, data mining, data visualization, data tools
and infrastructure, and best practices to help organizations make more data-driven
decisions..
BI Tools
Increasing potential
to support End User
business decisions Decision
Making
Visualization Techniques
Data Mining Data Analyst
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Task-relevant Data
Data Cleaning
Data Integration
Databases
KDD Process: A Typical View from ML and Statistics
Data Preprocessing
Data Understanding
• The data understanding phase of CRISP-DM involves taking a closer look at the data
available for mining. This step is critical in avoiding unexpected problems during the next
phase--data preparation--which is typically the longest part of a project.
• Data understanding involves accessing the data and exploring it using descriptive statistics.
Data Understanding
Data understanding: Collect and explore the data to gain an understanding of its properties
and characteristics.
Perform Descriptive Analytics or Data Exploration
It is always a good practice to perform descriptive analytics before moving to building a
Machine Learning model. Descriptive statistics will help us to understand the variability in the
data.
Single Variable Summaries
• The simplest way to gain insight into variables is to assess them one at a time through the
calculation of summary statistics.
• Data Visualization in One Dimension
Multiple Variable Summaries
• Cross tabulation
• Data Visualization
• Correlation
Descriptive Statistics
Descriptive statistics consists of methods for organizing, displaying, and describing data by
using tables, graphs, and summary measures.
Types of descriptive statistics:
• Organize the Data
• Tables
• Frequency Distributions
• Relative Frequency Distributions
• Displaying the Data
• Graphs
• Bar Chart or Histogram
• Summarize the Data
• Central Tendency
• Variation
Frequency Distributions Course No.of Students
BBA 10
• A frequency distribution is the organizing the raw data in table
BSc 6
form, using classes (groups) and frequencies .
BTech 7
• Frequency Distribution for Qualitative Data A frequency BCOM 8
distribution for qualitative data lists all categories and the BPharm 4
number of elements that belong to each of the categories. SUM 35
X Frequency
• Frequency Distribution for Quantitative Data 52-55 3
• Grouped frequency distributions: The data must be grouped into 56-59 3
60-63 9
classes that are more than one unit in width. In a grouped table,
64-67 9
the X column lists groups of observations, called class intervals, 68-71 8
rather than individual values. 72-75 3
Total 35
Data Visualization Charts
▪ Bar Graph
• Data visualization is the presentation of data in
a pictorial or graphical format. It enables
▪ Column Chart
decision makers to see analytics presented ▪ Pie Chart
visually, so they can grasp difficult concepts or ▪ Line Chart
identify new patterns. ▪ Histogram
• Data visualization is the graphical ▪ Box Plot
representation of information and data. By ▪ Scatter Plot
using visual elements like charts, graphs, and ▪ Heat Map
maps, data visualization tools provide an ▪ Pair Plot
accessible way to see and understand trends,
outliers, and patterns in data.
• In the world of Big Data, data visualization
tools and technologies are essential to analyze
massive amounts of information and make
data-driven decisions.
Descriptive Measures
Measures of Central Tendency
• Mean
• Median
• Mode
Measures of Variability
• Range
• Standard Deviation
• Variance
Getting the dataset
Study Time Attendance Sex Marks
12.50 85 Female 9.3
12.00 72 Male 8.6
11.50 95 Female 9.6
11.00 86 Male 8.1
10.50 78 Female 9.2
10.00 68 Male 7.5
9.50 74 Female 8.4
9.00 77 Male 7.7
8.50 Female 7.3
8.00 Male -7.4
Data Preprocessing
• Data preprocessing is a process of preparing the raw data and making it suitable for a data
mining task.
• A real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing techniques
can improve data quality, thereby helping to improve the accuracy and efficiency of the
subsequent mining process.
• Data preprocessing is an important step in machine learning process, because quality decisions
must be based on quality data.
Why is Data preprocessing important?
• Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Reduction
• Dimensionality reduction
• Data Transformation
• Normalization
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., Attendance=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Marks=“−7.4” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
Identifying and Handling the Missing Values
• Data Cleaning: Most Machine Learning algorithms cannot work with missing features, so let’s
create a few functions to take care of them.
Ways to handle missing data:
• By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this way
is not so efficient and removing data may lead to loss of information which will not give the
accurate output.
• Use a measure of central tendency (e.g., the mean or median) to fill in the missing value:
• For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median .
•Dropping Variables
• It is always better to keep data than to discard it. Sometimes you can drop variables if the
data is missing for more than 60% observations but only if that variable is insignificant. This
method is not effective, imputation is always a preferred choice over dropping variables
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates of one another
• Major issue when merging data from heterogenous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
38
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British
units
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement values
s.t. each old value can be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• One of the most important transformations you need to apply to your data is feature scaling. With
few exceptions, Machine Learning algorithms don’t perform well when the input numerical
attributes have very different scales.
There are two common ways to get all attributes to have the same scale:
• Min-max normalization
• Z-score normalization
• Discretization: Concept hierarchy climbing
• Encoding Categorical data
Min-max normalization
• Min-max scaling (normalization) is quite simple: values are shifted and rescaled so that they
end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max
minus the min. v𝑎𝑙𝑢𝑒 − min𝐴
𝑣′ =
• let A be a numeric variable with n observed values, v1, v2, …. , vn. max𝐴 − min𝐴
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600 − 12,000
Z-score normalization 98,000 − 12,000
= 0.716
• Standardization is quite different: first it subtracts the mean value (so standardized values
always have a zero mean), and then it divides by the standard deviation so that the resulting
distribution has unit variance.
• In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, vi , of A is
normalized to v’i by computing
v − A Ex. Let μ = 54,000, σ = 16,000. Then
v' =
A
73,600 − 54,000
= 1.225
16,000
Encoding Categorical data
• Encoding categorical data is a process of converting categorical data into integer format.
• Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So, it is necessary to encode these categorical variables into numbers.
• Encoding the Independent Variables
Determine the data mining task (Partition the data (for supervised tasks)
Data mining techniques come in two main forms: supervised (also known as predictive)
and unsupervised (also known as descriptive).
Supervised Learning
• These algorithms require the knowledge of both the outcome variable (dependent
variable) and the independent variable (input variables).
• In supervised learning, the training data you feed to the algorithm includes the
desired solutions, called label.
• Regression and Classification algorithms are Supervised Learning algorithms.
Unsupervised Learning
• Unsupervised learning: In unsupervised learning, the training data is unlabeled. The
system tries to learn without a teacher.
• These algorithms are set of algorithms which do not have the knowledge of the
outcome variable in the dataset.
• Most important supervised learning Most Important Unsupervised Algorithms
algorithms Clustering
• Linear Regression • — K-Means
• Logistic Regression • — DBSCAN
• — Hierarchical Cluster Analysis (HCA)
• k-Nearest Neighbors
Dimensionality Reduction/ Feature Selection
• Support Vector Machines (SVMs)
• — Principal Component Analysis (PCA)
• Naïve Bayes Association Rule Learning
• Decision Trees and Random Forests • — Apriori
• Neural Networks • — Eclat
Partition the data (for supervised tasks). If the task is supervised (classification or prediction),
randomly partition the dataset into training and test datasets.
Splitting the dataset into the Training set and Test set
• The train-test split is a technique for evaluating the performance of a machine learning algorithm.
• It can be used for regression or classification problems and can be used for any supervised learning
algorithm.
• The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit
the model and is referred to as the training dataset. The second subset is not used to train the
model; instead, the input element of the dataset is provided to the model, then predictions are
made and compared to the expected values. This second dataset is referred to as the test dataset.
• Train Dataset: Used to fit the machine learning model.
• Test Dataset: Used to evaluate the fit machine learning model.
• The objective is to estimate the performance of the machine learning model on new data: data not
used to train the model.
• The proportion of training dataset is usually between 70% and 80% of the data and the remaining
data is treated as the test dataset (validation data). The subsets may be created using
random/stratified sampling procedure. This is an important step to measure the performance of the
model using dataset not used in model building. It is also essential to check for any overfitting of the
model. In many cases, multiple training and multiple test data are used (called cross-validation).
Select a model and train it
• Train Dataset: Used to fit the machine learning model.
• The selected model may not be always the most accurate model, as accurate model may take more
time to compute and may require expensive infrastructure. The final model for deployment will be
based on multiple criteria such as accuracy, computing speed, cost of deployment, and so on. As a
part of model building, we will also go through feature selection which identifies important features
that have significant relationship with the outcome variable.
Model Evaluation (Model Testing)
• Test Dataset: Used to evaluate the fit machine learning model.
The most common evaluation metrics for regression:
• Mean Absolute Error
• Mean Squared Error
• Root Mean Square Error
• R squared and Adjusted R Square
Ramesh Kandela
Ramesh Kandela
[email protected]
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis is an approach in analyzing data sets to summarize their main
characteristics, often using Descriptive statistics .
https://www.storytellingwithdata.com/
Business Intelligence
• Business intelligence (BI) is software that ingests business data and presents it in user-friendly
views such as reports, dashboards, charts and graphs.
• Business intelligence combines business analytics, data mining, data visualization, data tools and
infrastructure, and best practices to help organizations make more data-driven decisions.
• Business intelligence (BI) uncovers insights for making strategic decisions. Business intelligence
tools analyze historical and current data and present findings in intuitive visual formats.
• According to CIO magazine: “Although business intelligence does not tell business users what to
do or what will happen if they take a certain course, neither is BI only about generating reports.
Rather, BI offers a way for people to examine data to understand trends and derive insights.”
BI Tools
Data Visualization Why Data Visualization
• Data visualization is the presentation of
data in a pictorial or graphical format. It Data Visualization Provides Clearer Understanding
enables decision makers to
see analytics presented visually, so they
can grasp difficult concepts or identify Instant absorption of large and complex data
new patterns.
• Data visualization is the graphical
representation of information and data. Better decision-making based on data
By using visual elements like charts,
graphs, and maps, data visualization tools
provide an accessible way to see and Audience Engagement
understand trends, outliers, and patterns
in data.
• In the world of Big Data, data Reveals hidden patterns and deeper insights
visualization tools and technologies are
essential to analyze massive amounts of
information and make data-driven To help the analyst avoid problems
decisions.
Two Key Questions for Data Visualization
1. What type of data are you working with?
• Qualitative
• Quantitative
2. What are you trying to communicate?
• Relationship
• Comparison
• Composition
• Distribution
• Trending
Tableau
• Tableau is an excellent data visualization and business intelligence tool used for reporting and
analyzing vast volumes of data.
• Tableau is a visual analytics platform transforming the way we use data to solve problems—
empowering people and organizations to make the most of their data.
Install Tableau
Tableau Desktop
https://www.tableau.com/support/releases
Tableau for Students/Faculty: Free (one-year license) access to Tableau Desktop
• https://www.tableau.com/academic/teaching#form
• https://www.tableau.com/academic/students#form
Tableau Public
• Tableau Public is a free platform to explore, create, and publicly share data visualizations. Get
inspired by the endless possibilities with data.
• https://public.tableau.com/en-us/s/
Connecting to Data in Tableau
• When open the Tableau, we will see a screen that looks like this,
where we have the option to choose data connection:
• The options under the navigation heading “To a File” can be
accessed with Tableau. All possible data connections, including data
that resides on a server, can be accessed with Tableau
• At the bottom of the left navigation, a couple of data sources come
with every Tableau download. The first, Sample – Superstore, is an
Excel file to connect to it using Tableau.
Tableau
Data Types
Work Sheet
So simply click on the Sheet 1 tab at the bottom to start visualizing the data! You should now
see the main work area within Tableau, which looks like this:
Foundations for building visualizations
• When you first connect to a data source such as the Superstore file, Tableau will display the
data connection and the fields in the Data pane.
• Dimensions contain qualitative values (such as names, dates, or geographical data). You can
use dimensions to categorize, segment, and reveal the details in your data. Dimensions
affect the level of detail in the view.
• Measures contain numeric, quantitative values that you can measure. Measures can be
aggregated. When you drag a measure into the view, Tableau applies an aggregation to that
measure (by default).
• Generally, the measure is the number; the dimension is what you “slice and dice” the
number by.
• Fields can be dragged from the data pane onto the canvas area or onto various shelves such
as Rows, Columns, Color, or Size. As we'll see, the placement of the fields will result in
different encodings of the data based on the type of field.
Bar Graph
• A graph made of bars whose heights represent the frequencies of respective categories is
called a bar graph.
Bar & Column charts commonly used for:Comparing numerical data across categories
Examples:
• Total sales by product type
• Population by country
• Revenue by department, by quarter
Create a Bar Chart in Tableau
To create a bar chart by placing a dimension on the Rows shelf and a measure
on the Columns shelf, or vice versa.
Icon
Side Bar
Data Source
Redo: Repeats the last action you reversed with the Undo button.
Swap: Moves the fields on the Rows shelf to the Columns shelf and vice versa.
Sort Ascending: Applies a sort in ascending order of a selected field based on the
measures in the view.
Sort Descending: Applies a sort in descending order of a selected field based on the
measures in the view.
Totals: You can compute grand totals and subtotals for the data in a view. Select from
the following options:
Show Column Grand Totals: Adds a row showing totals for all columns in the view.
Show Row Grand Totals: Adds a column showing totals for all rows in the view.
Row Totals to Left: Moves rows showing totals to the left of a crosstab or view.
Column Totals to Top: Moves columns showing totals to the top of a crosstab or view.
Add All Subtotals: Inserts subtotal rows and columns in the view, if you have multiple
dimensions in a column or row.
Remove All Subtotals: Removes subtotal rows or columns.
C. Toolbar
Toolbar
Button Description
Clear: Clears the current worksheet. Use the drop-down menu to clear specific parts of the
view such as filters, formatting, sizing, and axis ranges.
Highlight: Turn on highlighting for the selected sheet. Use the options on the drop-down menu
to define how values are highlighted.
Group Members: Creates a group by combining selected values. When multiple dimensions
are selected, use the drop-down menu to specify whether to group on a specific dimension or
across all dimensions.
Show Mark Labels: Switches between showing and hiding mark labels for the current sheet.
Fix Axes: switches between a locked axis that only shows a specific range and a dynamic axis
that adjusts the range based on the minimum and maximum values in the view.
Format Workbook: Open the Format Workbook pane to change how fonts and titles look in every view
in a workbook by specifying format settings at the workbook level instead of at the worksheet level.
Fit: Specifies how the view should be sized within the window. Select Standard, Fit Width, Fit Height, or
Entire View. Note: This menu is not available in geographic map views.
The Cell Size commands have different effects depending on the type of visualization. To access the Cell
Size menu in Tableau Desktop click Format > Cell Size.
C. Toolbar
Toolbar
Button Description
Show/Hide Cards: Shows and hides specific cards in a worksheet. Select each card that you
want to hide or show on the drop-down menu.
In Tableau Server and Tableau Online, you can show and hide cards for
the Title, Caption, Filter and Highlighter only.
Presentation Mode: Switches between showing and hiding everything except the view (i.e.,
shelves, toolbar, Data pane).
Download: Use the options under Download to capture parts of your view for use in other
applications.
Share Workbook With Others: Publish your workbook to Tableau Server or Tableau Online.
Show Me: Helps you choose a view type by highlighting view types that work best with the field
types in your data. An orange outline shows around the recommended chart type that is the
best match for your data.
• D. View - This is the canvas in the workspace where you create a visualization (also
referred to as a "viz").
• E. Click this icon to go to the Start page, where you can connect to data.
• F. Side Bar - In a worksheet, the side bar area contains the Data pane and
the Analytics pane
• G. Click this tab to go to the Data Source page and view your data
• H. Status bar - Displays information about the current view.
• I. Sheet tabs - Tabs represent each sheet in your workbook. This can include
worksheets, dashboards, and stories.
Marks Card
• Marks Card in Tableau
• There is a card to the left of the view where we can
drag fields and control mark properties like color, size,
label, shape, tooltip and detail.
Pie chart
• Pie chart A circle divided into portions(slices) that represent the relative frequencies or
percentages of different categories or classes.
• Use pie charts to show proportions of a whole.
• The slice of a pie chart is to show the proportion of parts out of a whole.
Commonly used for: Comparing proportions totalling 100%
Examples:
• Percentage of budget spent by department
7.79% Fri
• Proportion of internet users by age range
Sat
• Breakdown of site traffic by source Sun
25.41%
35.66% Thur
31.15%
Build a Pie Chart
• Step 1: Connect to the Sample - Superstore data source.
• Step 2:Drag the Sales measure to Columns and drag the Region dimension to Rows.
• Tableau aggregates the Sales measure as a sum.
• Also note that the default chart type is a bar chart.
• Step 3:Click Show Me on the toolbar, then
select the pie chart type.
• Step 4.The result is a rather small pie. To make the chart bigger, hold down Ctrl + Shift
(hold down ñ + z on a Mac) and press B several times.
• Step 5: To add labels, drag the Region dimension from the Data pane to Label on
the Marks card.
Stacked Bar Graphs
• Stacked bar graphs show the quantitative relationship that exists between a main category and
its subcategories.
• Each bar represents a principal category and it is divided into segments
representing subcategories of a second categorical variable.
• The chart shows not only the quantitative relationship between the different subcategories
with each other but also with the main category as a whole. They are also used to show how
the composition of the subcategories changes over time.
Select how you want to aggregate the field, and then click Next.
In the subsequent dialog box, you're given the option to create four types of quantitative filters:
Filter dates
• When you drag a date field from the Data pane to the Filters shelf in Tableau Desktop, the
following Filter Field dialog box appears:
Display interactive filters in the view
• When an interactive filter is shown, you can quickly include or exclude data in the view.
• To show a filter in the view:
• In the view, click the field drop-down menu and select Show Filter.
• The field is automatically added to the Filters shelf (if it is not already being filtered), and
a filter card appears in the view. Interact with the card to filter your data.
• Set options for filter card interaction and appearance
• After you show a filter, there are many different options that let you control how the filter
works and appears. You can access these options by clicking the drop-down menu in the
upper right corner of the filter card in the view.
• Some options are available for all types of filters, and others depend on whether you’re
filtering a categorical field (dimension) or a quantitative field (measure).
Histogram
• Histograms are created by continuous (numerical) data.
• Histogram is the visual representation of the data which can be used to assess the probability
distribution (frequency distribution) of the data. A histogram is a chart that displays the shape
of a distribution.
• A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is
plotted as a bar whose height corresponds to how many data points are in that bin. A histogram
looks like a bar chart but groups values for a continuous measure into ranges, or bins.
3. A uniform or rectangular histogram has the same frequency for each class.
Create a histogram in Tableau
• In Tableau we can create a histogram using Show Me.
1.Connect to the Sample - Superstore data source.
2.Drag Quantity to Columns.
3.Click Show Me on the toolbar, then select the histogram chart type.
• The histogram chart type is available in Show Me when the view contains a single measure
and no dimensions.
Three things happen after you click the histogram icon in Show Me:
• The view changes to show vertical bars, with a continuous x-axis (1 – 14) and
a continuous y-axis (0 – 5,000).
• The Quantity measure you placed on the Columns shelf, which had been
aggregated as SUM, is replaced by a continuous Quantity (bin) dimension.
(The green color of the field on the Columns shelf indicates that the field is
continuous.)
• To edit this bin: In the Data pane, right-click the bin and select Edit.
• The Quantity measure moves to the Rows shelf and the aggregation
changes from SUM to CNT (Count).
Box and Whisker Plot
• A box-and-whisker plot gives a graphic presentation of data using five measures: the median, the
first quartile, the third quartile, and the smallest and the largest values in the data set between the
lower and the upper inner fences.
• The length of the box is equivalent to IQR. It is possible that the data may contain values beyond Q1
– 1.5 IQR and Q3 + 1.5 IQR. The whisker of the box plot extends till Q1 – 1.5 IQR (or minimum value)
and Q3 + 1.5 IQR (or maximum value); observations beyond these two limits are potential outliers.
Commonly Used For:
• Visualizing statistical characteristics across data series
Examples:
• Comparing historical annual rainfall across cities
• Analyzing distributions of values and identifying outliers
• Comparing mean and median height/weight by country
• Positive correlation depicts a rise, and it is seen on the diagram as data points slope
upwards from the lower-left corner of the chart towards the upper-right.
• Negative correlation depicts a fall, and this is seen on the chart as data points slope
downwards from the upper-left corner of the chart towards the lower-right.
• Data that is neither positively nor negatively correlated is considered uncorrelated (null).
Create scatter plot in Tableau
• To use scatter plots and trend lines to compare sales
to profit, follow these steps:
1.Open the Sample - Superstore data source.
2.Drag the Profit measure to Columns.
3.Tableau aggregates the measure as a sum and
creates a horizontal axis.
3.Drag the Sales measure to Rows.
Tableau aggregates the measure as a sum and creates
a vertical axis.
4. Drag the dimension Sub-Category and drop into
the Label shelf under the Marks pane.
Create scatter plot in Tableau
• Once the data is loaded, perform the
following steps to create a scatter plot of
two measures:
1. From the top toolbar, under Analysis,
uncheck Aggregate Measures.
2. Drag-and-drop Profit into the Columns
shelf.
3. Drag-and-drop Sales into the Rows shelf.
1500
1000
500
0
January February March April May June July
Actual Target
Combination Charts (Dual Axes Charts)
To create a combination chart, follow the steps below:
• Open Tableau Desktop and connect to the Sample -
Superstore data source.
• From the Data pane, drag Order Date to
the Columns shelf.
• On the Columns shelf, right-click YEAR(Order Date) and
select Month.
• From the Data pane, drag Sales to the Rows shelf.
• From the Data pane, drag Profit to the Rows shelf and
place it to the right of SUM(Sales).
• On the Rows shelf, right-click SUM(Profit) and
select Dual-Axis.
• On the SUM(Profit) Marks card, click the Mark Type drop-
down and select Bar.
Word Map
Word map is a visual representation of text data.
Creating Dashboards
Dashboards in Tableau are very powerful as they are a compilation of individual visualizations
on different sheets. This provides the reader with a lot of information on one single view with
all the filters, parameters, and legends of individual visualizations.
Create a dashboard
You create a dashboard in much the same way you create a new worksheet.
1.At the bottom of the workbook, click the New Dashboard icon:
2.From the Sheets list at left, drag views to your dashboard at right.
Add interactivity
• In the upper corner of sheet, enable the Use as Filter option to use selected marks in the
sheet as filters for other sheets in the dashboard.
Floating and Tiled Layout Arrangements on Dashboards
• Each object (Worksheet) in a dashboard can use one of two types of layouts: Tiled or Floating.
Tiled objects are arranged in a single layer grid that adjust in size based on the total
dashboard size and the objects around it. Floating objects can be layered on top of other
objects and can have a fixed size and position.
• Tiled Layout All objects are tiled on a single layer. The top three views are in a horizontal
layout container.
• Floating Layout While most objects are tiled on this dashboard, the map view and its
corresponding color legend are floating. They are layered on top of the bar chart, which uses
a tiled layout.
Adding drop-down selectors
• Single Value (Dropdown): Displays the values of the filter in a drop-
down list where only a single value can be selected at a time.
• Multiple Values(Dropdown): Displays the values of the filter in a drop-
down list where multiple values can be selected.
Slider selectors
• Wildcard Match: Displays a text box where you can type a few
characters. All values that match those characters are automatically
selected. You can use the asterisk character as a wildcard character.
For example, you can type “tab*” to select all values that begin with
the letters “tab”. Pattern Match is not case sensitive. If you are using a
multidimensional data source, this option is only available when
filtering single level hierarchies and attributes.
Search box selectors
• Single Value (Slider): Displays the values of the filter along the range
of a slider. Only a single value can be selected at a time. This option is
useful for dimensions that have an implicit order such as dates.
Happy Visualizing
K Nearest Neighbors (K-NN)
Ramesh Kandela
[email protected]
KNN
• K Nearest Neighbors is a classification (/Regression) algorithm . The idea in k-nearest-
neighbors methods is to identify k records in the training dataset that are similar to a new
record that we wish to classify.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
• When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space
for the k training tuples that are closest to the unknown tuple. These k training tuples are
the k “nearest neighbors” of the unknown tuple.
• For k-nearest-neighbor classification, the unknown tuple is assigned the most common class
among its k-nearest neighbors.
Why do we need a K-NN Algorithm?
Euclidean Distance(Similarity Measure )
• The most common approach is to measure similarity in terms of distance between pairs of
objects.
• The most commonly used measure of similarity is the Euclidean distance. The Euclidean
distance is the square root of the sum of the squared differences in values for each variable.
• If the points (x1,y1)and (x2,y2) are in 2-dimensional space, then the Euclidean distance
between them is
Point P1=(1,4)
Point p2= (5,1)
Euclidean distance=5
• KNN classifier predicts the class of a given test observation by identifying the observations
that are nearest to it, the scale of the variables matters. Any variables that are on a large
scale will have a much larger effect on the distance between the observations, and hence on
the KNN classifier, than variables that are on a small scale.
• There are two common ways to get all attributes to have the same scale:
• Min-max normalization
• Z-score normalization
How does K-NN work?
The K-NN working can be explained on the basis of the below steps:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors and Take the K nearest
neighbors as per the calculated Euclidean distance.
Step-3: Among these k neighbors, count the number of the data points in each category.
Step-4: Assign the new data point to that category for which the number of the neighbor is
maximum.
Step-1: Choosing k
• The k value in the k-NN algorithm defines how many neighbors will be checked to
determine the classification of a specific query point.
• For example, if k=1, the instance will be assigned to the same class as its single nearest
neighbor.
• Defining k can be a balancing act as different values can lead to overfitting or underfitting.
Lower values of k can have high variance, but low bias, and larger values of k may lead to
high bias and lower variance. The choice of k will largely depend on the input data as data
with more outliers or noise will likely perform better with higher values of k.
Age Salary Purchased
44 72000 No
27 48000 Yes
30 54000 No
38 61000 No
35 58000 Yes
37 67000 ?
K=3
How can determine a good value for k, the number of neighbors?”
• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• This can be determined experimentally. Starting with k =5, we use a test set to estimate the
error rate of the classifier. This process can be repeated each time by incrementing k to
allow for one more neighbor. The k value that gives the minimum error rate may be
selected.
• Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0
using K-nearest neighbors.
• b) What is our prediction with K = 1? Why?
• (c) What is our prediction with K = 3? Why?
Observation Cough Fever Covid
1 1 1No
2 1 2Yes
3 0 -1No
4 0 3No
5 1 2.5Yes If k=3
6 1 1Yes
7 1 -2No Predicted
8 1 0.5No Yes
9 1 0.9Yes Yes
10 0 2.5Yes Yes
Evaluating Classification Model
The key classification metrics are:
Accuracy
Recall
Precision
F1-Score
Model Evaluation
• A confusion matrix is a table that is often used to describe the performance of a
classification model (or "classifier") on a set of test data for which the true values are
known.
• True positives .TP: These refer to the positive that were correctly labeled by the classifier. Let
TP be the number of true positives.
• True negatives .TN: These are the negative that were correctly labeled by the classifier. Let
TN be the number of true negatives.
• False positives .FP: These are the negative that were incorrectly labeled as positive. Let FP be
the number of false positives.
• False negatives .FN: These are the positive that were mislabeled as negative. Let FN be the
number of false negatives.
Predicted
Positive Negative
False Negative (FN)
Positive True Positive (TP) (Type II Error)
False Positive (FP)
Actual Negative (Type I Error) True Negative
Accuracy Predicted
• The accuracy of a classifier on a given test Positive Negative
set is the percentage of test set data that False Negative (FN)
are correctly classified by the classifier. Positive True Positive (TP) (Type II Error)
• Accuracy = (TP+TN)/(TP+FP+FN+TN) False Positive (FP) True Negative(TN)
• Accuracy is the proportion of true results Actual Negative (Type I Error)
among the total number of cases
examined.
• The error rate or misclassification rate of a
classifier, which is simply 1-accuracy. Covid Predicted
Positive Negative
Covid Predicted Positive 0 5
Positive Negative Actual
Positive 2 0 Covid Negative 0 95
Actual
Covid Negative 1 0
Accuracy=95/100=95%
Accuracy=2+0/3=2/3=67% • Accuracy is not a good choice with
When to use? unbalanced classes
• Accuracy is useful when target • In this situation we’ll want to understand
classes are well balanced recall and precision
Recall
Recall: what proportion of actual Positives is correctly classified?
Recall = (TP)/(TP+FN) TP/Total actual positives
• When to use?
• Recall is a valid choice of evaluation metric when we want to capture as many positives as
possible. For example: If we are building a system to predict if a person has Covid-19 or not,
we want to capture the virus even if we are not very sure.
Covid Predicted Covid Predicted
Positive Negative Positive Negative
Positive 4 1 Positive 0 5
Actual Actual
Covid Negative 2 93 Covid Negative 0 95
Accuracy=4+93/100=97% Accuracy=95/100=95%
Recall =4/4+1=.8
Recall = 0/0+5=0
Precision
What proportion of predicted Positives is truly Positive?
Precision = (TP)/(TP+FP)
When to use?
• Precision is a valid choice of evaluation metric when we want to be very sure of our
prediction. For example: If we are building a system to predict if we should decrease the
credit limit on a particular account, we want to be very sure about our prediction or it may
result in customer dissatisfaction
Covid Predicted
Covid Predicted Positive Negative
Positive Negative Positive 0 5
Positive 4 1 Actual
Actual Covid Negative 0 95
Covid Negative 2 93
Accuracy=4+93/100=97% Accuracy=95/100=95%
• F1-score takes both precision and recall into account, which also means it accounts for
both FPs and FNs. The higher the precision and recall, the higher the F1-score. F1-score
ranges between 0 and 1. The closer it is to 1, the better the model.
• When to use?
• We want to have a model with both good precision and recall.
Sensitivity & Specificity
Sensitivity or Recall (True Positive Rate) In simple terms, the proportion of patients that
were identified correctly to have the disease (i.e. True Positive) upon the total number of
patients who actually have the disease is called as Sensitivity or Recall.
Specificity (True Negative Rate): When it's actually no, how often does it predict no?
The proportion of negative tuples that are correctly identified.
Similarly, the proportion of patients that were identified correctly to not have the disease (i.e.
True Negative) upon the total number of patients who do not have the disease is called as
Specificity.
• False Positive Rate: When it's actually no, how often does it predict yes?
Receiver Operating Characteristic (ROC) Curve
• Receiver operating characteristic (ROC) curve can be used to understand the overall worth of a
classification model at various thresholds settings.
• ROC curve is a plot between sensitivity (true positive rate) in the vertical axis and
1 – specificity (false positive rate) in the horizontal axis.
• ROC curves are a useful visual tool for comparing two classification models.
E F G H
B C
A
Area Under the Curve (AUC)
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes and is used as a summary of the ROC curve. Higher the AUC, better the model
is at predicting 0s as 0s and 1s as 1s.
When AUC = 1, then the classifier is able to perfectly
distinguish between all the Positive and the Negative class
points correctly.
ROC curves of two classification models, M1 and M2. The diagonal shows where, for every
true positive, we are equally likely to encounter a false positive. The closer an ROC curve is
to the diagonal line, the less accurate the model is. Thus, M1 is more accurate here.
Happy Analyzing
Naive Bayes Classifier
Ramesh Kandela
[email protected]
Naive Bayes
• The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for
classification tasks, like text classification.
• Naive Bayes classification is based on Bayes’ theorem. They can predict class membership
probabilities such as the probability that a given tuple belongs to a particular class.
• Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
• Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Bayes Theorem
• Basically, we are trying to find probability of event A, given the event B is true. Event B is
also termed as evidence.
• P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
• P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
• P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
• P(y|x) is the posterior probability of class (y, target) given predictor (x, attributes).
• P(y) is the prior probability of class.
• P(x|y) is the likelihood which is the probability of the predictor given class.
• P(x) is the prior probability of the predictor.
Naïve Bayesian Classification
Dependent variable Purchased (Y), has two distinct values (namely,( yes, no)) and
independent variables (X) are age, income, student, and credit rating.
• The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
• Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability. This can be expressed mathematically as:
• Using the above function, we can obtain the class, given the predictors.
• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved and, in this sense, is
considered “naïve.”
Types of Naive Bayes Classifier
• Gaussian Naive Bayes (GaussianNB): gaussiannb is used in classification tasks and it assumes that
feature values follow a gaussian distribution i.e. normal distribution-.
• Multinomial Naïve Bayes (MultinomialNB): This type of Naïve Bayes classifier assumes that the
features are from multinomial distributions. This variant is useful when using discrete data, such
as frequency counts, and it is typically applied within natural language processing use cases, like
spam classification.
• Bernoulli Naïve Bayes (BernoulliNB): This is another variant of the Naïve Bayes classifier, which is
used with Boolean variables—that is, variables with two values, such as True and False or 1 and 0.
The naive Bayes classification procedure is as follows:
• Calculate the prior probability for given class labels
• For class C1, estimate the individual conditional probabilities for each predictor P(xj jC1)
• Multiply these probabilities by each other, then multiple by the proportion of records
belonging to class C1 (prior probability for given class labels).
• Repeat Steps 2 and 3 for all the classes.
• Assign the record to the class with the highest probability for this set of predicted values.
Identify the class for the given X = (age = youth, income = medium, student = yes, credit rating = fair)
P(y): P(buys_computer = “yes”) = 9/14 = 0.643
Id Age Income Student Credit_rating Purchased P(buys_computer = “no”) = 5/14= 0.357
1Youth High NO Fair No Compute P(X|y) for each class
2Youth High NO Excellent No P(age = Youth|buys_computer = “yes”) = 2/9 = 0.222
3Middle_aged High NO Fair Yes P(age = Youth| buys_computer = “no”) = 3/5 = 0.6
4Senior Medium NO Fair Yes P(income =medium|buys_computer =“yes”) =4/9 = 0.444
5Senior Low Yes Fair Yes P(income = medium | buys_computer = “no”) = 2/5 = 0.4
6Senior Low Yes Excellent No P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
7Middle_aged Low Yes Excellent Yes P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
8Youth Medium NO Fair No P(credit_rating = fair| buys_computer = “yes”) = 6/9 = 0.667
9Youth Low Yes Fair Yes P(credit_rating = fair| buys_computer = “no”) = 2/5 = 0.4
10Senior Medium Yes Fair Yes P(X|buys_computer =“yes”) =0.222 x 0.444 x 0.667 x 0.66 = 0.044
11Youth Medium Yes Excellent Yes P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
12Middle_aged Medium NO Excellent Yes P(X|y)*P(y) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =0.044*.643= 0.028
13Middle_aged High Yes Fair Yes P(X|buys_computer = “no”) * P(buys_computer = “no”) =0.019*0.357= 0.007
14Senior Medium NO Excellent No
15Youth Medium Yes Fair ? Therefore, X belongs to class (“buys_computer = yes”)
Identify the class for the given X = (age = Middle_aged, income = medium, student = yes, credit rating = fair
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
P(X|y)=
Identify the class for the given X = (age = Middle_aged, income = medium, student = yes, credit rating = fair
P(age=Middle_aged/Purchased=No)=0/5=0
P(age=Youth/Purchased=No)=3/5=.6
P(age=Senior/Purchased=No)=2/5=0.4
Ramesh Kandela
[email protected]
“Nothing is particularly hard if you divide it into small jobs”.— Henry Ford
Decision Tree
• A decision tree is a non-parametric supervised learning algorithm, which is utilized for both
classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node,
branches, internal nodes and leaf nodes.
• Decision Tree is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are
the output of those decisions and do not contain any further branches.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Decision Tree Example
Training data set: Buys_computer
Age Income Student Credit rating Purchased
Youth High NO Fair No Resulting tree:
Youth High NO Excellent No
Middle_aged High NO Fair Yes
Senior Medium NO Fair Yes age?
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Middle_aged Low Yes Excellent Yes youth Middle_aged
overcast Senior
Youth Medium NO Fair No
Youth Low Yes Fair Yes yes
Senior Medium Yes Fair Yes student? credit rating?
Youth Medium Yes Excellent Yes
Middle_aged Medium NO Excellent Yes no yes excellent fair
Middle_aged High Yes Fair Yes
Senior Medium NO Excellent No no yes no yes
Youth Medium Yes Fair ? Yes
Types of Decision Trees
• ID3: (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. This algorithm leverages entropy and
information gain as metrics to evaluate candidate splits. This algorithm iteratively divides features into two or
more groups at each step. ID3 follows a top-down approach i.e the tree is built from the top and at every step
greedy approach is applied. The greedy approach means that at each iteration we select the best feature at the
present moment to create a node and this node is again split after applying some statistical
methods. ID3 generally is not a very ideal algorithm as it overfits when the data is continuous.
• C4.5: It is considered to be better than the ID3 algorithm as it can handle both discrete and continuous data.
In C4.5 splitting is done based on Information gain and the feature with the highest Information gain is made the
decision node and is further split. C4.5 handles overfitting by the method of pruning i.e it removes the
branches/subpart of the tree that does not hold much importance (or) is redundant.
• C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets
than C4.5 while being more accurate.
• CHAID: CHAID stands for Chi-square Automatic Interaction Detector. In CHAID chi-square is the attribute selection
measure to split the nodes when it's a classification-based use case and uses F-test as an attribute selection
measure when it is a regression-based use case. Higher the chi-square value higher is the preference given to that
feature.
Types of Decision Trees
Classification and Regression Trees
• Classification and Regression Trees or CART introduced by Leo Breiman to refer to Decision
Tree algorithms and it is very similar to C4.5, but it differs in that it supports numerical target
variables (regression).
• As the name suggests CART can also perform both classification and regression-based tasks.
• Classification and Regression Tree (CART) is a common terminology that is used for a
Classification Tree (used when the dependent variable is discrete) and a Regression Tree (used
when the dependent variable is continuous).
• CART uses Gini’s impurity index as an attribute selection method while splitting a node into
further nodes when it's a classification-based use case and uses sum squared error as an
attribute selection measure when the use case is regression-based.
• The CART algorithm provides a foundation for important algorithms like random forest ,bagged
decision trees and boosted decision trees.
Steps in decision trees
The basic idea behind any decision tree algorithm is as follows:
• Select the root node which is the best attribute using Attribute Selection Measures (ASM) to split
the records.
• The root node is then split into two or more subsets that contains possible values for the best
attributes using the ASM. Nodes thus created are known as internal nodes. Each internal node has
exactly one incoming edge.
• Further divide each internal node until no further splitting is possible or the stopping criterion is
met. The terminal nodes (leaf nodes) will not have any outgoing edges.
• Terminal nodes are used for generating business rules.
• Tree pruning is used to avoid large trees and overfitting the data. Tree pruning is achieved through
different stopping criteria.
Attribute Selection Measures
Here is a list of some attribute selection measures.
• Gini index
• Entropy
• Information gain
• Gain Ratio
• Reduction in Variance
• Chi-Square
Gini Index
• It is a measure of purity or impurity while creating a decision tree. It is calculated by
subtracting the sum of the squared probabilities of each class from one. CART ( Classification
and regression tree ) uses the Gini index as an attribute selection measure to select the best
attribute/feature to split.
• The attribute with a lower Gini index is used as the best attribute to split.
The Gini index is defined by
Select the root node Age Income Student Credit rating Purchased
Purchased Youth High NO Fair No
YES N0 Youth High NO Excellent No
Youth 2 3 5 Middle_aged High NO Fair Yes
Age Middle_aged 4 0 4 Senior Medium NO Fair Yes
Senior 3 2 5 Senior Low Yes Fair Yes
14 Senior Low Yes Excellent No
• Gini index for Age Middle_aged Low Yes Excellent Yes
• Gini(Age=Youth) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48 Youth Medium NO Fair No
• Gini(Age=Middle_aged) = 1 – (4/4)2 – (0/4)2 = 0 Youth Low Yes Fair Yes
• Gini(Age=Senior) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48 Senior Medium Yes Fair Yes
• weighted sum of Gini indexes for the Age feature.
• Gini(Age) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 +
Youth Medium Yes Excellent Yes
0 + 0.171 = 0.342 Middle_aged Medium NO Excellent Yes
Middle_aged High Yes Fair Yes
Senior Medium NO Excellent No
Purchased Purchased
YES N0 Purchased
YES N0 YES N0
High 2 2 4
Income Medium 4 2 6 NO 3 4 7 Credit Fair 6 2 8
low 3 1 4 Student Yes 6 1 7 Rating Excellent 3 3 6
14 14 14
Purchased Purchased
YES N0 YES N0
High 2 2 4 NO 3 4 7
Income Medium 4 2 6
Student Yes 6 1 7
low 3 1 4
14
14
Gini index for Student
• Gini index for Income
Gini(Student=No) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 =0.489
• Gini(Income=High) = 1 – (2/4)2 – (2/4)2 = 0.5 Gini(Student=Yes) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
• Gini(Income=Medium) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – Weighted sum for Student feature
0.111 = 0.445 Gini(Student) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
• Gini(Income=Low) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 –
Purchased
0.0625 = 0.375 YES N0
• weighted sum of gini index for Income feature Credit Fair 6 2 8
• Gini(Income) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x Rating Excellent 3 3 6
0.445 = 0.142 + 0.107 + 0.190 = 0.439 14
Gini index for Age = 0.342 Gini index for Credit rating
Gini index for Income .439 Gini(Credit rating=Fair) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375
Gini index for Student = .367 Gini(Credit rating=Excellent) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini index for Credit rating = .428 Weighted sum for Credit rating feature
Gini(Credit rating) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
The top of the tree will be Age feature because its cost is the lowest.
age?
• Apply same principles to those sub dataset for youth and senior. We need to find the Gini
index scores for Income, Student and Credit rating respectively.
age?
Chi-Square
• Chi-Square is a comparison between observed results and expected results. This statistical
method is used in CHAID(Chi-square Automatic Interaction Detector). CHAID in itself is a very
old algorithm that is not used much these days. The higher the value of chi-square, higher is
the difference between the current node and its parent.
• The formula for chi - square
• Bagging
• Random Forest
• Boosting
Ensemble Methods
• Ensemble methods is a machine learning technique that combines several base models in
order to produce one optimal predictive model.
• The goal of ensemble methods is to combine the predictions of several base estimators
built with a given learning algorithm in order to improve generalizability / robustness over
a single estimator.
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an
improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of classifiers
• Boosting: weighted vote with a collection of classifiers
Bagging
• Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly
used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is
selected with replacement—meaning that the individual data points can be chosen more than once.
After several data samples are generated, these weak models are then trained independently, and
depending on the type of task—regression or classification, for example—the average or majority of
those predictions yield a more accurate estimate.
• Same training algorithm for every predictor and train
them on different random subsets of the training set.
When sampling is performed with replacement, this
method is called bagging. When sampling is performed
without replacement, it is called pasting.
2. Parallel training: These bootstrap samples are then trained independently and in parallel with
each other using weak or base learners.
3. Aggregation: Finally, depending on the task (i.e. regression or classification), an average or a
majority of the predictions are taken to compute a more accurate estimate. In the case of
regression, an average is taken of all the outputs predicted by the individual classifiers; this is
known as soft voting. For classification problems, the class with the highest majority of votes is
accepted; this is known as hard voting or majority voting.
Out-of-Bag Error Estimation
• Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring
the prediction error of bagging random forests, and boosted decision trees.
• Bagging uses subsampling with replacement to create training samples for the model to
learn from. on average, each bagged tree makes use of around two-thirds of the
observations.3 The remaining one-third of the observations not used to fit a given bagged
tree are referred to as the out-of-bag (OOB) observations.
• In order to obtain a single prediction for the ith observation, we can average these
predicted responses (if regression is the goal) or can take a majority vote (if classification is
the goal). OOB MSE (for a regression problem) or classification error (for a classification
problem) can be computed
Random Forests
Random Forests
• Random forest, like its name implies, consists of a large number of individual decision trees
that operate as an ensemble. Each individual tree in the random forest spits out a class
prediction and the class with the most votes becomes our model’s prediction
• Random forests provide an improvement over bagged trees by way of a small tweak that
decorrelates the trees.
• In random forests (see RandomForestClassifier and RandomForestRegressor classes), each
tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap
sample) from the training set.
• The random forest algorithm is an extension of the bagging method as it utilizes both
bagging and feature randomness to create an uncorrelated forest of decision trees. Feature
randomness, also known as feature bagging or “the random subspace method”
How it works
How it works
• Random forest algorithms have three main hyperparameters, which need to be set before
training. These include node size, the number of trees, and the number of features sampled.
From there, the random forest classifier can be used to solve for regression or classification
problems.
• The random forest algorithm is made up of a collection of decision trees, and each tree in
the ensemble is comprised of a data sample drawn from a training set with replacement,
called the bootstrap sample.
• Of that training sample, one-third of it is set aside as test data, known as the out-of-bag
(oob) sample. Another instance of randomness is then injected through feature bagging,
adding more diversity to the dataset and reducing the correlation among decision trees.
• Depending on the type of problem, the determination of the prediction will vary. For a
regression task, the individual decision trees will be averaged, and for a classification task, a
majority vote—i.e. the most frequent categorical variable—will yield the predicted class.
Finally, the oob sample is then used for cross-validation, finalizing that prediction.
Variable importance
• Random forests can be used to rank the importance of variables in a regression or
classification problem.
• A variable importance plot for the Heart data. Variable importance is computed using the
mean decrease in Gini index, and expressed relative to the maximum.
Variable Mean Decrease Gini
ChestPain 14.9746278
MaxHR 13.9973942
Thal 12.3107004
Oldpeak 10.764543
Ca 9.3101672
Age 9.2581531
Chol 7.8603037
RestBP 7.5728805
Sex 5.1902859
ExAng 5.0879835
Slope 4.8899441
RestECG 1.6251977
Fbs 0.7974986
Boosting
Boosting
• Boosting works in a similar way of bagging, except that the trees are grown sequentially: each
tree is grown using information from previously grown trees. Boosting does not involve
bootstrap sampling; instead each tree is fit on a modified version of the original data set.
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to allow the subsequent classifier,
Mi+1, to pay more attention to the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier, where the weight of each
classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting
the model to misclassified data.
Types of Boosting Algorithms
1.AdaBoost (Adaptive Boosting)
2.Gradient Tree Boosting
3.XGBoost (eXtreme Gradient)
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, other wise it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of
the weights of the misclassified tuples:
d
error ( M i ) = w j err ( X j )
j
Ramesh Kandela
[email protected]
Deep Learning
• Deep learning attempts to mimic the human brain.
• Deep learning is a part of machine learning, which is in turn part of a larger umbrella called AI.
• Deep Learning is a special type of machine learning consisting of few algorithms which are
designed to learn from large amounts of data or unstructured data like images, texts, and
audio. These algorithms are inspired by the way our brain works.
Artificial Intelligence
• Artificial Intelligence(AI): A technique which enables to
mimic human behavior. Machine
learning
• Machine Learning: Subset of AI technique which use
statistical methods to enable machine to improve with Deep
experience. learning
• Deep Learning: Subset of ML which make the computation of multi-layer neural network
feasible.
Neural Networks
• A neural network is a method in artificial intelligence that teaches computers to process data
in a way that is inspired by the human brain. It is a type of machine learning process, called
deep learning, that uses interconnected nodes or neurons in a layered structure that
resembles the human brain.
• Deep learning algorithms like ANN, CNN, RNN, etc.
• Artificial Neural Networks for Regression and Classification
• Convolutional Neural Networks for Computer Vision or Image Processing
• Recurrent Neural Networks for Time Series Analysis
• In a neural network the independent variables are called input cells and the dependent
variable is called output cell.
Biological Neurons
• It is an unusual-looking cell mostly found in
animal cerebral cortexes (e.g., your brain),
composed of a cell body containing the nucleus
and most of the cell’s complex components,
and many branching extensions called
dendrites, plus one very long extension called
the axon.
• The axon’s length may be just a few times longer than the cell body, or up to tens of
thousands of times longer. Near its extremity the axon splits off into many branches called
telodendria, and at the tip of these branches are minuscule structures called synaptic
terminals (or simply synapses), which are connected to the dendrites (or directly to the cell
body) of other neurons. Biological neurons receive short electrical impulses called signals
from other neurons via these synapses. When a neuron receives a sufficient number of
signals from other neurons within a few milliseconds, it fires its own signals.
Biological Neuron
• A human brain has billions of neurons. Neurons are interconnected nerve cells in the human
brain that are involved in processing and transmitting chemical and electrical signals.
• Dendrites are branches that receive information from other neurons.
• Cell nucleus or Soma processes the information received from dendrites.
• Axon is a cable that is used by neurons to send information.
• Synapse is the connection between an axon and other neuron dendrites.
The Perceptron
• The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank
Rosenblatt.
• Perceptron may eventually be able to learn, make decisions, and translate languages."
From Biological to Artificial Neurons
• An artificial neuron is a way to mimic the biological neuron, which is the building block of the
human brain.
Biological Neuron vs. Artificial Neuron
Biological Neuron Artificial Neuron
Cell Nucleus (Soma) Node
Dendrites Input
Synapse Weights or interconnections
Axon Output
• Each node, or artificial neuron, connects to another and has an associated weight and
threshold. If the output of any individual node is above the specified threshold value, that
node is activated, sending data to the next layer of the network. Otherwise, no data is passed
along to the next layer of the network.
How do artificial neural networks work?
• Input Layer :This layer accepts input features. It provides information from the outside world
to the network, no computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer.
• Hidden Layer: Hidden layers take their input from the input layer or other hidden layers.
Artificial neural networks can have a large number of hidden layers. Each hidden layer
analyzes the output from the previous layer, processes it further, and passes it on to the next
layer. Hidden layer performs all sort of computation on the features.
Output Layer: The output layer gives the final
result of all the data processing by the artificial
neural network. It can have single or multiple
nodes. For instance, if we have a binary (yes/no)
classification problem, the output layer will have
one output node, which will give the result as 1 or
0. However, if we have a multi-class classification
problem, the output layer might consist of more
than one output node.
How do artificial neural networks work?
Step–1: Inputs are passed as inputs to the Artificial Neuron which are multiplied by weights
and a bias is added to them inside the transfer function.
• What “Transfer function” does?
• The transfer function creates a weighted sum of all the inputs and adds a constant to it
called bias.
• W1*X1 + W2*X2 + W3*X3 . . . Wn*Xn + b
Step–2: Then the resultant value is passed to the activation function. The result of the
activation function is treated as the output of the neuron.
How do artificial neural networks work?
• To perform this, the below steps takes place after the input data is passed to the neuron.
1.Each input value is multiplied with a small number (close to zero) called Weights
2.All of these values are summed up
3.A number is added to this sum called Bias
4.The above summation is passed to a function called “Activation Function“
5.The Activation function will produce an output based on its equation.
6.The output produced by the neurons in the last layer of the network is treated as the output.
7.If the output produced by the neurons does not match with the actual answer, then the error signals
are sent back in the network which adjusts the weights and the bias values such that the output
comes closer to the actual value this process is known as Backpropagation.
8.The steps 1-7 are repeated again till the difference between the actual value and the predicted
value(output of the neuron) becomes equal or almost equal and there is no further improvement
happening… this situation is known as convergence. When we say the algorithm has reached its
maximum possible accuracy.
Activation Function
• An Activation Function decides whether a neuron should be activated or not. This means
that it will decide whether the neuron’s input to the network is important or not in the
process of prediction using simpler mathematical operations.
• The role of the Activation Function is to derive output from a set of input values fed to a
node (or a layer).
• The primary role of the Activation Function is to transform the summed weighted input
from the node into an output value to be fed to the next hidden layer or as output. that’s
why it’s often referred to as a Transfer Function in Artificial Neural Network.
Sigmoid or Logistic Activation Function
• It is the famous S shaped function that transforms the input values into a range
between 0 and 1.
• Sigmoid function gives an ‘S’ shaped curve. In order to map predicted values to
probabilities, we use the sigmoid function. The function maps any real value into
another value between 0 and 1.
Tanh or hyperbolic tangent Activation Function
• Tanh is also like logistic sigmoid but better. The range of the tanh function is from (-
1 to 1). tanh is also sigmoidal (s - shaped).
ReLU (Rectified Linear Unit) Activation Function
• The ReLU is the most used activation function in the world right now. Since it is used in
almost all the convolutional neural networks or deep learning.
Range: [ 0 to infinity)
Linear or Identity Activation Function
• It takes the inputs, multiplied by the weights for each neuron, and creates
an output signal proportional to the input.
• Range : (-infinity to infinity)
Which activation function to use?
• Sigmoid functions and their combinations generally work better in the case
of classification problems.
• Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient problem.
• Tanh is avoided most of the time due to dead neuron problem.
• ReLU activation function is widely used and is default choice as it yields better results.
• If we encounter a case of dead neurons in our networks the leaky ReLU function is the
best choice.
• ReLU function should only be used in the hidden layers.
• An output layer can be linear activation function in case of regression problems.
Cost Function
and
Gradient Descent
Cost Function
• A Cost Function is used to measure just how wrong the model is in finding a relation between the input and output.
• Cost Function quantifies the error between predicted values and expected values and presents it in
the form of a single real number.
• Cost function is a measure of "how good" a neural network did with respect to it's given training
sample and the expected output. It also may depend on variables such as weights and biases.
• A cost function is a single value, not a vector, because it rates how good the neural network did as a
whole. Specifically, a cost function is of the form C(W,B,Sr,Er)
• Where W is our neural network's weights, B is our neural network's biases, Sr is the input of a single
training sample, and Er is the desired output of that training sample.
Ramesh Kandela
[email protected]
What Is Association Mining?
Motivation: Finding regularities in data
• What products were often purchased together? — Beer and diapers
• What are the subsequent purchases after buying a Phone?
Applications
Market Basket Analysis, Medical Diagnosis, Catalog design,
Customer
Sale Campaign Analysis buys beer
Market Basket Analysis
• Market Basket Analysis is a typical example of frequent itemset mining. This process analyzes
customer buying habits by finding associations between the different items that customers place in
their “shopping baskets” (Figure ).
• For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
• Market Basket is simply the list of products purchased by the customer.
Support
Support indicates how frequently the if/then relationship appears in the database.
Support count (): Frequency of occurrence of an itemset Tid Items bought
10 Beer, Nuts, Diaper
E.g. ({Beer, Diaper}) = 3
20 Beer, Coffee, Diaper
Support, s, probability that a transaction contains A∪B 30 Beer, Diaper, Eggs
s=support(“AB”) = P(A∪B) = support count(AUB)/T 40 Nuts, Eggs, Milk
Beer → Diaper (3/5=0.6=60%) 50 Nuts, Coffee, Diaper, Eggs, Milk
Diaper → Beer
Frequent Itemset: An itemset whose support is greater than or equal to a minimum support
threshold.
Confidence
• Confidence tells about the number of times these relationships have been found to be true.
• Confidence, c, the conditional probability that a transaction having A also contains B.
• c = confidence(“AB”) = P(B|A)
Example of Rules:
{Milk,Diaper} → {Beer} (s= , c= )
{Milk,Beer} → {Diaper} (s= , c= ) {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Diaper,Beer} → {Milk} (s=, c= ) {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Beer} → {Milk,Diaper} (s=, c= ) {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=, c= ) {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Milk} → {Diaper,Beer} (s=, c=) {Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Association Rule Mining process
• In general, association rule mining can be viewed as a two-step process:
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
• Apriori Algorithm
• ECLAT Algorithm
• FP-Growth Algorithm
2. Rule Generation
– Generate strong association rules from the frequent itemsets: these rules must
satisfy minimum support and minimum confidence.
The Apriori Algorithm
• The name, Apriori, is based on the fact that the algorithm uses prior knowledge of frequent
itemset properties.
• Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1) itemsets.
• First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is
used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of
each Lk requires one full scan of the database.
• Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
• All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Apriori: A Candidate Generation-and-test Approach
Method: join and prune steps
• The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.
This set of candidates is denoted Ck.
• The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but
all of the frequent k-itemsets are included in Ck. A database scan to determine the count of
each candidate in Ck would result in the determination of Lk (i.e., all candidates having a
count no less than the minimum support count are frequent by definition, and therefore
belong to Lk). Ck, however, can be huge, and so this could involve heavy computation.
• To reduce the size of Ck, the Apriori property is used as follows. Any (k -1)itemset that is not
frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)s ubset of a
candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can
be removed from Ck.
Consider a database, D, consisting of 9 transactions. Suppose min.support count required is 2
and let min.confidence required is 70%. Use the apriori algorithm to generate all the frequent
candidate itemsets Ci and frequent itemsets Li. Then, generate the strong association rules
from frequent itemsets using min. support & min. confidence.
The Apriori Algorithm—Example
K=1
Create a table containing support count of each item present in dataset Called C1(candidate set)
• Compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.
• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4},
subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset
• Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3.
• Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that, they
should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Generating Association Rules from Frequent Itemsets
Selecting Strong Rules
• Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate
strong association rules from them (where strong association rules satisfy both minimum support and minimum
confidence).
The resulting association rules are as shown below, each listed with its confidence:
If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because
these are the only ones generated that are strong.
Lift
• Lift is a simple correlation measure.
• The lift between the occurrence of A and B can be measured by
• The lift of the association (or correlation) rule A=>B which computes the ratio between the rule’s
confidence and the support of the itemset in the rule consequent.
• If the lift is less than 1, then the occurrence of A is negatively correlated with the occurrence
of B, meaning that the occurrence of one likely leads to the absence of the other one.
• If the resulting value is greater than 1, then A and B are positively correlated, meaning that
the occurrence of one implies the occurrence of the other.
• If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.
The Apriori Algorithm—An Example
Supmin = 2
Itemset sup
Itemset sup
Tid Items {A} 2
L1 {A} 2
10 A, C, D C1 {B} 3
{B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Collaborative Filtering
Movie Recommendation
Recommender systems
• Recommender systems are self-explanatory; as the name suggests, they are systems or
techniques that recommend or suggest a particular product, service, or entity.
• Eg: In the case of Netflix, which movie to watch, In the case of e-commerce, which product to
buy, or In the case of Kindle, which book to read, which songs to listen to (Spotify), etc.
• Recommender systems can be classified into the following two categories, based on their
approach to providing recommendations.
Types of recommendation system
• Content-Based Filtering
• Collaborative Based Filtering
Collaborative Recommendation
• The basic idea of these systems is that if users shared the same interests in the past – if they
viewed or bought the same books, for instance – they will also have similar tastes in the
future.
• For example, user A and user B have a purchase history that overlaps strongly, and user A
has recently bought a book that B has not yet seen; the basic rationale is to propose this
book also to B. Because this selection of hopefully interesting books involves filtering the
most promising ones from a large set and because the users implicitly collaborate with one
another, this technique is also called collaborative filtering (CF).
• Collaborative filtering is based on the fact that relationships exist between products and
people’s interests. Many recommendation systems use collaborative filtering to find these
relationships and to give an accurate recommendation of a product that the user might like
or be interested in.
Data Type and Format:
Collaborative filtering requires availability of all item–user information. Specifically, for each
item–user combination, we should have some measure of the user’s preference for that item.
Preference can be a numerical rating or a binary behavior such as a purchase, a ‘like’, or a click.
Where Ixy is the set of items rated by both user x and user y.
• The cosine-based approach defines the cosine-similarity between two users x and y as:
User-based filtering
• The main idea behind user-based filtering is that if we are able to find users that have bought and
liked similar items in the past, they are more likely to buy similar items in the future too. Therefore,
these models recommend items to a user that similar users have also liked. Amazon's Customers who
bought this item also bought is an example of this filter
• In user-based collaborative filtering, we have an active user for whom the recommendation is aimed.
The collaborative filtering engine first looks for users who are similar. That is users who share the
active users rating patterns. Collaborative filtering basis this similarity on things like history,
preference, and choices that users make when buying, watching, or enjoying something.
Item-based filtering