0% found this document useful (0 votes)
26 views216 pages

DA Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views216 pages

DA Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 216

Introduction UNIT - IV

Features of Python

1 High Level Integrated Interactive Object-Oriented

Scripting Language

2 Simple & Easy to learn


Introduction UNIT - IV

Features of Python

3 Portable

4 Free | Open Source


Introduction UNIT - IV

Python

Object Oriented High Level


What is Python

Programming
Language
Introduction UNIT - IV

Features of Python

5 Perform complex tasks using a few lines of code.

6 Run equally on different platforms such as Windows, Linux, Unix,Macintosh, etc

7 Provides a vast range of libraries for the various fields such as machine learning,
web developer, and also for the scripting.
Introduction UNIT - IV

Advantages of Python

● Ease of programming

● Minimizes the time to develop and maintain code

● Modular and object-oriented

● Large community of users

● A large standard and user-contributed library


Introduction UNIT - IV

Disadvantages of Python

● Interpreted and therefore slower than compiled languages

● Decentralized with packages


Introduction UNIT - IV

Essential Python Libraries

● A library is a collection of files (called modules) that contains functions for other
programs.

● A Python library is a reusable chunk of code that you may to include in your
programs.
Introduction UNIT - IV

Essential Python Libraries

01 03
NumPy
02 SciPy 04
Pandas
SciKit-Learn
Introduction UNIT - IV

Essential Python Libraries

● Numpy (Numerical Python) is a perfect tool for


01 scientific computing and performing basic and
advanced array operations.
NumPy
● The library offers many handy features performing
operations on n-arrays and matrices in Python.

● It helps to process arrays that store values of the


same data type and makes performing math
operations on arrays easier.
Introduction UNIT - IV

Essential Python Libraries

● It is one of the most popular Python libraries in


data science.
02 ● It provides support for data structures and data
analysis tools.
Pandas
● The library is optimized to perform data
science tasks especially fast and efficiently.
● Pandas is best suited for structured, labelled data, in
other words, tabular data,that has headings
associated with each column of data.
Introduction UNIT - IV

Essential Python Libraries

Pandas
02 Have core data structures
Pandas

Series DataFrame

used to store data


Introduction UNIT - IV

Essential Python Libraries

Series
02
Pandas ● The series is a one-dimensional array-like structure

● designed to hold a single array (or 'column) of data


and an associated array of data labels called an
index.
Introduction UNIT - IV

Essential Python Libraries

DataFrame

02 ● The DataFrame represents tabular data, a bit like a


spreadsheet.
Pandas ● DataFrames are organised into columns.

● each column can store a single data-type, such as floating


point numbers, strings, boolean values etc.
● DataFrames can be indexed by either their row or column
names.
Introduction UNIT - IV

Essential Python Libraries

● SciPy contains many different packages and


03 modules to assist in mathematics and scientific
computing.
SciPy
● It's difficult to state a single use case for SciPy
considering that it contains so many different
useful packages
Introduction UNIT - IV

Essential Python Libraries

Some of the important packages include:


03
Matplotlib

SciPy
● A 2D plotting library that can be used in Python scripts,
the Python and IPython shell, web application servers,
and more.
Introduction UNIT - IV

Essential Python Libraries

Some of the important packages include:


03
IPython

SciPy
● An interactive console that runs your code like the Python
shell, but gives you even more features, like support for data
visualizations.
Introduction UNIT - IV
Essential Python Libraries

04
● Scikit-learn is probably the most useful library for
machine learning in Python.
SciKit-Learn
Introduction UNIT - IV
Essential Python Libraries

04 this library contains a lot of efficient tools


for
SciKit-Learn machine learning & statistical modeling
Including

dimensionality
Classification Clustering Regression
reduction
Introduction UNIT - IV

Essential Python Libraries

● Scikit-learn comes loaded with a lot of features


04
1. Supervised learning algorithms:
SciKit-Learn
❏ Think of any supervised learning algorithm you might
have heard about and there is a very high chance that
it is part of scikit-learn.
Introduction UNIT - IV

Essential Python Libraries

● Scikit-learn comes loaded with a lot of features


04
2. Cross-validation:
SciKit-Learn
❏ There are various methods to check the accuracy of
supervised models on unseen data.
Introduction UNIT - IV

Essential Python Libraries

● Scikit-learn comes loaded with a lot of features


04
3. Unsupervised learning algorithms:
SciKit-Learn
❏ there is a large spread of algorithms in the offering -
starting from clustering, factor analysis, principal
component analysis to unsupervised neural networks.
Introduction UNIT - IV

Essential Python Libraries

● Scikit-learn comes loaded with a lot of features


04
3. Various toy datasets:
SciKit-Learn
❏ This came in handy while learning scikit-learn.

❏ For example : IRIS dataset, Boston House prices dataset.


Introduction UNIT - IV

Essential Python Libraries

● Scikit-learn comes loaded with a lot of features


04
3. Feature extraction :
SciKit-Learn
❏ Useful for extracting features from images and text
(e.g. Bag of words).
Data Preprocessing UNIT - IV

● Data preprocessing is a data mining technique that involves transforming raw data
into an understandable format.

● Aim to reduce the data size, find the relation between data and normalized them.
Data Preprocessing UNIT - IV

Why Data Preprocessing

● Data which capture from various sources is not pure.


● It contains some noise.
● It is called dirty data or incomplete data.
● In this data, there is lacking attribute values, interest, or containing only aggregate
data. For example : occupation=” “
● Noisy data which contains errors or outliers. For eg. Salary=”-10”.
Data Preprocessing UNIT - IV

Why Data Preprocessing

● Inconsistent data which contains discrepancies in codes or names . for example-


Age=”51” Birthday =”03/09/1998”.
● Incomplete , Noisy , and inconsistent data are common place properties of large
real world databases and data warehouses.
● Incomplete data can occur for a variety of reasons
Data Preprocessing UNIT - IV

Steps during pre-processing

1 Data Cleaning

● Data is cleansed through process such as filling in missing values,


smoothing the noisy data, or resolving the inconsistencies in the data.
Data Preprocessing UNIT - IV

Steps during pre-processing

2 Data Integration

● Data with different representations are put together and conflicts within
the data are resolved
Data Preprocessing UNIT - IV

Steps during pre-processing

3 Data Transformation

● Data is normalized, aggregated and generalized.


Data Preprocessing UNIT - IV

Steps during pre-processing

3 Data Reduction

● Data is normalized, aggregated and generalized.


Data Preprocessing UNIT - IV

Steps during pre-processing

3 Data Discretization

● Involves the reduction of number of values of a continuous attribute by


dividing the range of attributes intervals.
Data Preprocessing UNIT - IV

Removing Duplicates

● Removing Duplicates in the context of data quality is where an organisation


looks to identify and then remove instances where there is more than one record
of a single person.
Data Preprocessing UNIT - IV

Removing Duplicates

● With large scales of data, this will often be done using tools that find and merge
duplicate records in an existing database and prevent new ones from entering it
based on similarities in specific fields.
Data Preprocessing UNIT - IV

Removing Duplicates

● Preparing a dataset before designing a machine learning model is an important


task for the data scientist.
● If there are more duplicates then making machine learning model is useless or
not so accurate. Therefore, you must know to remove the duplicates from the
dataset.
Data Preprocessing UNIT - IV

Removing Duplicates

1 Handling missing data values

● Data cleaning routines attempt to fill in missing values, smooth


out noise while identifying outliers, and correct inconsistencies in
the data.
This is usually done when the class label is
missing. Data Preprocessing UNIT - IV

Removing Duplicates

1 Handling missing data values

● The various methods for handling the problem of missing values


in data tuples are as follows:

Ignoring the tuple

● This is usually done when the class label is missing.


This is usually done when the class label is
missing. Data Preprocessing UNIT - IV

Removing Duplicates

1 Handling missing data values

Manually filling in the missing value

● This approach is time-consuming and may not be a reasonable task for


large data sets with many missing values, especially when the value to be
filled in is not easily determined.
This is usually done when the class label is
missing. Data Preprocessing UNIT - IV

Removing Duplicates

1 Handling missing data values

Using a global constant to fill in the missing value

● Replace all missing attribute values by the same constant


This is usually done when the class label is
missing. Data Preprocessing UNIT - IV

Removing Duplicates

1 Handling missing data values

● Using a measure of central tendency for the attribute, such as the mean,
the median, the mode
This is usually done when the class label is
missing. Data Preprocessing UNIT - IV

Removing Duplicates

1 Handling missing data values

● Using the attribute mean for numeric values or attribute mode nominal
values, for all samples belonging to the same class as the given tuple.
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

● Data transformation is the process of converting data from one format or


structure into another format or structure.

● Data transformation is critical to activities such as data integration and data


management.
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Common reasons to transform data:

❏ Moving data to a new data store

❏ Users want to join unstructured data or streaming data with


structured data so user can analyze the data together
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Common reasons to transform data:

❏ Users want to join unstructured data or streaming data with structured data
so user can analyze the data together
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Common reasons to transform data:

❏ Users want to add information to data to enrich it, such as performing lookups.
Adding geological data, or adding timestamps.
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Common reasons to transform data:

❏ Users want to perform aggregations, such as comparing sales data from


different regions or totalling sales from different regions
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Different ways to transform data:

Scripting

❏ SQL or Python to write the code to extract & transform


the data.
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Different ways to transform data:

On-premise ❏ ETL (Extract, Transform, Load) tools can take much of the pain out of
ETL tools
scripting the transformations by automating the process

❏ These tools are typically hosted on your company’s site, and may
require extensive expertise & infrastructure cost
Data Preprocessing UNIT - IV

Removing Duplicates

2 Transformation of data using function or mapping

Different ways to transform data:

Cloud-based
ETL tools ❏ These ETL tools are hosted in the cloud

❏ Where u can leverage the expertise and infrastructure of the vendor


Data Preprocessing UNIT - IV

Analytics Types

Business analytics is the process of making sense of gathered data

Measuring business performance and producing valuable conclusions

that can help

companies make informed decisions on the future of the business,

through the

use of various statistical methods and techniques.


Data Preprocessing UNIT - IV

Analytics Types

❏ Business Analytics (BA) is the iterative, methodical exploration of an organization's


data, with an emphasis on statistical analysis.

❏ Business analytics is used by companies that are committed to making data-driven


decisions.

❏ Business analytics combines the fields of management, business and computer


science.
Data Preprocessing UNIT - IV

Analytics Types

❏ The analytical part requires an understanding of data, statistics and computer


science.

❏ Business analytics utilizes big data, statistical analysis and data visualization to
implement organization changes.
Data Preprocessing UNIT - IV
Analytics Types

Challenges with developing and implementing business analytics are

Executive ownership

IT Involvement
Project Management Office
(PMO)
Available production data vs.
Cleansed modeling data
End user involvement and
buy-in
change management
Data Preprocessing UNIT - IV
Analytics Types

Data-driven decision-making process uses the following steps:

1. Identify the problem or opportunity for value creation

2. Identify primary as well secondary data sources.

3. Pre-process the data for issues such as missing and incorrect data. Generate derived
variables and transform the data if necessary. Prepare the data for analytics model
building.
Data Preprocessing UNIT - IV
Analytics Types

Data-driven decision-making process uses the following steps:

4. Divide the data sets into subsets training and validation data sets.

5. Build analytical models and identify the best model(s) using model performance
in validation data.

6. Implement solution / Decision / Develop product.


Data Preprocessing UNIT - IV

Analytics Types

Predictive Descriptive

Prescriptive
Analytics Types Data Preprocessing UNIT - IV

Predictive

Predictive analytics tells you what could happen in the future.

❏ Predictive analytics helps your organization predict with confidence what will
happen next so that you can make smarter decisions and improve business
outcomes.

❏ The purpose of the predictive model is finding the likelihood different samples
will perform in a specific way.
Analytics Types Data Preprocessing UNIT - IV

Predictive

Predictive analytics tells you what could happen in the future.

❏ The predictive model typically calculates live transactions multiple times to


help evaluate the benefit of a customer transaction.

❏ Predictive models typically utilize a variety of variable data to make the


prediction.

❏ The variability of the component data will have a relationship with what it is likely
to predict.
Analytics Types Data Preprocessing UNIT - IV

Predictive
Project definition

Monitoring
Data collection

Predictive
Analytics Process Deployment
Analysis

Statistics
Modelling
Analytics Types Data Preprocessing UNIT - IV

Predictive

Project definition

❏ Identify what shall be the outcome of the project, the deliverables, business objectives
and based on that go towards gathering those data sets that are to be used.
Analytics Types Data Preprocessing UNIT - IV

Predictive

Data collection

❏ This is more of the big basket where all data from various sources are binned
for usage.

❏ This gives a picture about the various customer interactions as a single view
item
Analytics Types Data Preprocessing UNIT - IV

Predictive

Analysis

❏ the data is inspected, cleansed, transformed and modelled to discover if it


really provides useful information and arriving at conclusion ultimately
Analytics Types Data Preprocessing UNIT - IV

Predictive

Statistics

❏ This enables to validate if the findings, assumptions and hypothesis are fine to go
ahead with and test them using statistical model.
Analytics Types Data Preprocessing UNIT - IV

Predictive

Modelling

❏ Through this accurate predictive models about the future can be provided.

❏ From the options available the best option could be chosen as the required solution
with multi model evaluation.
Analytics Types Data Preprocessing UNIT - IV

Predictive

Deployment

❏ Through the predictive model deployment an option is created to deploy the


analytics results into everyday effective decision.

❏ This way the results, reports and other metrics can be taken based on modelling.
Analytics Types Data Preprocessing UNIT - IV

Predictive

Monitoring

❏ Models are monitored to control and check for performance conformance to


ensure that the desired results are obtained as expected.
Analytics Types Data Preprocessing UNIT - IV

of Predictive Analytics
Example

Social Media Fraud


Analysis Weather Retail Health care
detection
Analytics Types Data Preprocessing UNIT - IV

of Predictive Analytics
Example

❏ Online social media is a fundamental shift of how information is


Social Media being produced, particularly as relates to businesses.
Analysis
Analytics Types Data Preprocessing UNIT - IV

of Predictive Analytics
Example

❏ Weather forecasting has improved by leaps and bounds thanks to


predictive analytics models.
Weather
Analytics Types Data Preprocessing UNIT - IV

of Predictive Analytics
Example
❏ Probably the largest sector to use predictive analytics, retail is
always looking to improve its sales position and for get better
relations with customers.

❏ One of the most ubiquitous examples is Amazon's


Retail
recommendations.
Analytics Types Data Preprocessing UNIT - IV

of Predictive Analytics
Example
❏ Usage of predictive analytics in the healthcare domain can aid to
determine and prevent cases and risks of those developing certain
health related complications like diabetics, asthma and other lifé
threatening ailments.

Health care ❏ Through the administering of predictive analytics in health care


better clinical decisions can be made.
Analytics Types Data Preprocessing UNIT - IV

of Predictive Analytics
Example

❏ Predictive analytics can aid to spot inaccurate credit application, deviant


transactions leading to frauds both online and offline, identity thefts and
false insurance claims saving financial and insurance institutions of lots of
Fraud issues and damages to their operations.
detection
Analytics Types Data Preprocessing UNIT - IV

Descriptive

❏ It is simple method and used in first phase of analytics, involves gathering,


organizing tabulating and depicting data then the characteristics of what we are
studying
Analytics Types Data Preprocessing UNIT - IV

Descriptive

❏ The descriptive model shows relationships between the product/service with the
acquired data.

❏ This model can be used to organize a customer by their personal preferences


for example.
Analytics Types Data Preprocessing UNIT - IV

Descriptive

❏ The descriptive model shows relationships between the product/service with the
acquired data.

❏ This model can be used to organize a customer by their personal preferences


for example.
Analytics Types Data Preprocessing UNIT - IV

Descriptive

❏ Descriptive statistics are useful to show things like, total stock in inventory,
average dollars spent per customer and year over year change in sales.

❏ While business intelligence tries to make sense of all the data that's collected
each and every day by organizations of all types, communicating the data in a
way that people can easily grasp often becomes an issue.
Analytics Types Data Preprocessing UNIT - IV

of Descriptive Analytics Production

Example
Financial

Regarding the Operations


company’ s

Historical Insights Sales

Inventory
Reports that provides
Production
Analytics Types Data Preprocessing UNIT - IV
Descriptive

❏ Data visualization evolved because data


displayed graphically allows for an easier
comprehension of the information,
validating the old adage,

❏ "a picture is worth a thousand words."


Analytics Types Data Preprocessing UNIT - IV

Descriptive

❏ In business, proper data visualization provides a different approach to


show potential connections, relationships, etc.

❏ which are not as obvious in data that's non-visual.

❏ A business intelligence dashboard is an information management tool that is


used to track KPIs, metrics and other key data points relevant to a business,
department or specific process.
Analytics Types Data Preprocessing UNIT - IV

Prescriptive

❏ This model suggests a course of action.

❏ Prescriptive analytics assists users in finding the optimal solution to a problem or


in making the right choice/decision among several alternatives.

❏ The prescriptive model utilizes an understanding of what has happened, why it


has happened and a variety of "what-might-happen" analysis to help the user
determine the best course of action to take.
Analytics Types Data Preprocessing UNIT - IV

Prescriptive
of Prescriptive Analytics
Example

Traffic Applications Product Optimization Operational Research


Analytics Types Data Preprocessing UNIT - IV

Fig. Relationship between descriptive, predictive & prescriptive analytics


Market Basket Analysis UNIT - IV

It is a technique that allow us to discover the relationships between products.


Market Basket Analysis UNIT - IV

It can be
called

Association Analysis

Frequent itemset mining


Market Basket Analysis Why?

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Use Cases (Applications) of Association Rule Mining

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Transaction Data UNIT - IV

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV

Simple Example -Transaction Data

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example -Frequent Item Set

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example- Association Rule
UNIT - IV

Simple Example- Association Rule

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example- Association Rule Support

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule Confidence UNIT - IV

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule Lift UNIT - IV

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example- Association Rule Lift -
Interpretation
● Lift = 1: implies no relationship between mobile phone and screen guard (i.e., mobile phone
and screen guard occur together only by chance)
● Lift > 1: implies that there is a positive relationship between mobile phone and screen guard (i..,
mobile phone and screen guard occur together more often than random)
● Lift < 1: implies that there is a negative relationship between mobile phone and screen guard
(i.e., mobile phone and screen guard occur together less often than random)

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV

•Frequent itemsets from the previous section can form


candidate rules such as X implies Y .

X→Y
UNIT - IV

Appropriateness
of Candidate
Rule

Support Confidence Lift


UNIT - IV
Association Rule/ Apriori Example

TID List_Of_Item IDs Minimum Support = 0.5 or 50%


T100 I1, I2, I5 Means 9/2 = 4.5 = 4

T101 I2, I4
ITEM Set FREQUENCY
T102 I2, I5
{ I1} 6
T103 I1, I2, I4
{I2} 8
T104 I1, I2, I3
{I 3} 5
T105 I2, I3
{I4} 3
T106 I1, I2, I3, I4
{I5} 3
T107 I1, I2, I3

T108 I1, I3 , I5
UNIT - IV

Example
Minimum Support =
0.5 or 50%
TID List_Of_Item IDs Means 9/2 = 4.5 = 4
T100 I1, I2, I5
ITEM Set FREQUENC After ITEM Set FREQUENCY
T101 I2, I4 Y
Prunin
{ I1} 6
{ I1} 6 g
T102 I2, I5
{I2} 8
{I2} 8
T103 I1, I2, I4
{I 3} 5
{I 3} 5
T104 I1, I2, I3
{I4} 3
T105 I2, I3
{I5} 3
T106 I1, I2, I3, I4

T107 I1, I2, I3

T108 I1, I3 , I5
UNIT - IV
Example

Minimum Support =
0.5 or 50%
Means 9/2 = 4.5 = 4
After
ITEM Set FREQUENCY Candidate Prunin
ITEM Set FREQUENCY ITEM Set FREQUENCY
Generatio g
{ I1} 6
n { I1, I2} 5 { I1, I2} 5
{I2} 8
{I1, I3} 4 {I1, I3} 4
{I 3} 5
{I 2, I3} 4 {I 2, I3} 4
UNIT - IV

Example
Minimum Support =
0.5 or 50%
Means 9/2 = 4.5 = 4

Candidate After Pruning


ITEM Set FREQUENCY Generatio No values, so ITEM Set FREQUENCY
n ITEM Set FREQUENCY go to previous
{ I1, I2} 5 { I1, I2} 5
stage
{I1, I3} 4 { I1, I2,I3} 3
{I1, I3} 4
{I 2, I3} 4 {I 2, I3} 4

We have 3 rules
1. I1 => I2
2. I1 => I3
3. I2 => I3
Example- Support

Rule Frequency Formula Putting Support Value


of X + Y Values
in
Formula
I1 => I2 5 5/9 0.55
Freq( X+Y)
I1 => I3 4 ______________________ 4/9 0.44

No of Transaction
I2 => I3 4 4/9 0.44
Example- Confidence

Rule Freq( X) Freq Formula for Putting Confidenc


(X+ Confidence Values in e
Y) Formula (x=>y)
I1 => I2 6 5 5/6 0.83
Freq( X+Y)
I1 => I3 6 4 ______________________ 4/6 0.66

Freq (X)
I2 => I3 8 4 4/8 0.50
Example- Lift
Rule Support Support Support Formula Putting Values Support
of of of in Formula Value
(X+ Y) X Y
I1 => I2 0.55 6/9 = 8/9 =0.88 0.55 0.94
0.66 Support( X+Y) ----------------
______________________
(0.66 * 0.88)
I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 1.21
0.66 Support (X) * Support (Y) ----------------
(0.66 * 0.55)
I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.44 0.90
----------------
(0.88 * 0.55)
Example

Rule Support Confidence Lift


I1 => I2 0.55 0.83 0.94

I1 => I3 0.44 0.66 1.21

I2 => I3 0.44 0.50 0.90


Example
Example

Rule Support Confidence Lift


I1 => I2 0.55 0.83 0.94

I1 => I3 0.44 0.66 1.21

I2 => I3 0.44 0.50 0.90


Example

Rule Support Confidence Lift


I1 => I2 0.55 0.83 0.94

I1 => I3 0.44 0.66 1.21


Example
Example

Rule Support Confidence Lift


I1 => I2 0.55 0.83 0.94

I1 => I3 0.44 0.66 1.21


Applications of Association Rules
The term market basket analysis refers to a specific
implementation of association rules

•For better merchandising – products to include/exclude from inventory each month


•Placement of products
•Cross-selling
•Promotional programs—multiple product purchase incentives managed through a
loyalty card program
Market Basket Analysis UNIT - IV

It creates If-Then scenario rules

if
Then Item B is likely to
Item A is purchased
be purchased

Rule written as If {A} Then {B}


Market Basket Analysis UNIT - IV

If Part Then Part


if
Then Item B is likely to
Called Item A is purchased Called
be purchased

Antecedent Consequent

It is Condition It is result
Market Basket Analysis UNIT - IV

Algorithm

Association Rule

Apriori
Market Basket Analysis UNIT - IV

Support

Algorithm

measures

Association Rule
Confidence Lift
Market Basket Analysis UNIT - IV

Support

Algorithm
● Support is the number of transactions that include items
in the (A) & {B} parts of the rule as a percentage of the
total number of transactions.

● It is a measure of how frequently the collection of items


Association Rule
occur together as a percentage of all transaction.
Support = A+B
Total
Market Basket Analysis UNIT - IV

Confidence
Algorithm
● Confidence of the rule is the ratio of the number of transactions
that include all items in (B) as well as the number of
transactions that include all items in (A) to the number of
transactions that include all items in (A).
Association Rule
Confident = A+B
A
Association Rules UNIT - IV

❏ Association analysis is useful for discovering interesting relationships hidden in large


data sets.

❏ The uncovered relationships can be represented in the form of association rules or sets
of frequent items.
Association Rules UNIT - IV

❏ Association rule mining is a procedure which is meant to find frequent patterns,


correlations, associations, or causal structures from data sets found in various kinds of
databases such as relational databases, transactional databases, and other forms of
data repositories.

❏ Association rules are if/then statements that help uncover relationships between
seemingly unrelated data in a transactional database, relational database or other
information repository.
Association Rules UNIT - IV

❏ An example of an association rule would be

"If a customer buys a 1 packet brade, he is 80 % likely to also purchase milk."

ID Items
1 {Bread, Milk}
Market basket transaction
2 {Bread, Milk, Cola, Sugar}

3 {Bread, Milk, Tea, Sugar}

… …

{ Bread, Milk } Example of frequent itemset

{ Bread } → { Milk } Example of association rule


Association Rules UNIT - IV

❏ Association rule mining can be viewed as a two-step process :

1. Find all frequent itemsets :


By definition, each of these item sets will occur at least as frequently as a predetermined
minimum support count, min sup.

2. Generate strong association rules from the frequent item sets :


By definition, these rules must satisfy minimum support and minimum confidence.
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm

Algorithm

Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution
Algorithm
● Find the frequent itemsets and generate association rules on the
given dataset
● Assume that minimum support threshold (s = 33.33%) and
minimum confidence threshold (c = 60%)
Apriori

Do not consider the Transaction which has frequency < 2


Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 1 Generating 1-itemset Frequent Pattern


Algorithm

Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 2 Generating 2-itemset Frequent Pattern


Algorithm

Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items

Solution Step : - 3 Generating 3-itemset Frequent Pattern


Algorithm

Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 4 Frequent Itemset

Algorithm

Frequent Itemset (I) = {Hot Dogs, Coke, Chips}

Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items

Solution Step : - 5 Generating Association Rules from Frequent Itemsets


Algorithm
● For each frequent itemset “l”, generate all nonempty subsets of l.
● For every nonempty subset s of l, output the rule “s 🡪 (l-s)”

If support count(l)/support count(s) >= minconf

Where minconf is minimum confidence threshold.


Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 5 Generating Association Rules from Frequent Itemsets


Algorithm
● [Hot Dogs^Coke]=>[Chips]

● Confidence

= sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke)

Apriori = 2/2*100=100%

Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 5 Generating Association Rules from Frequent Itemsets


Algorithm
● [Hot Dogs^Chips]=>[Coke]

● confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Chips)

= 2/2*100=100%
Apriori

Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 5 Generating Association Rules from Frequent Itemsets


Algorithm
[Coke^Chips]=>[Hot Dogs]

confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips

Apriori = 2/3*100=66.67%

//Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm

Solution Step : - 5 Generating Association Rules from Frequent Itemsets


Algorithm
[Coke^Chips]=>[Hot Dogs]

confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips

Apriori = 2/3*100=66.67%

//Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items

Solution

Algorithm
● [Hot Dogs]=>[Coke^Chips]

○ confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs)

= 2/4*100=50%

○ Rejected
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items

Solution

Algorithm
● [Coke]=>[Hot Dogs^Chips]

○ confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke)

= 2/3*100=66.67%

//Selected
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items

Solution

Algorithm
There are four strong results (minimum confidence greater than

60%)

● [Hot Dogs^Coke]=>[Chips]

● [Hot Dogs^Chips]=>[Coke]
Apriori
● [Coke^Chips]=>[Hot Dogs]

● [Coke]=>[Hot Dogs^Chips]
Market Basket Analysis UNIT - IV
Apriori Algorithm

Algorithm
Click Here for More Examples

Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm

Drawback

Algorithm
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly
scan the database
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm

Drawback

Algorithm
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly
scan the database
Apriori
Market Basket Analysis UNIT - IV

Frequent Pattern (FP) Growth

Algorithm ● an improvement of apriori algorithm.


● used for finding frequent itemset in a transaction
database without candidate generation.
● represents frequent items in frequent pattern trees or
FP Growth
FP-tree.
Market Basket Analysis UNIT - IV

Frequent pattern growth


Example

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth


Example

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth


Example

Algorithm
● minimum support be 3
● These elements are stored in descending order of their
respective frequencies.
● After insertion of the relevant items, the set L looks like this:-

FP Growth L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
Market Basket Analysis UNIT - IV

Frequent pattern growth


Example
Ordered-Item set
Algorithm

FP Growth

Item sorting : Items in a transaction are sorted in descending order of


support counts.
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Tree Data Structure: Inserting the set {K, E, M, O, Y}

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Tree Data Structure: Inserting the set {K, E,O, Y}

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Tree Data Structure: Inserting the set {K, E,M}

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Tree Data Structure: Inserting the set {K, M,Y}

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Tree Data Structure: Inserting the set {K, E, O}

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Conditional Pattern Base

Algorithm

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Conditional Frequent Pattern Base

Algorithm
It is done by taking the set of elements that is common in all the paths in the
Conditional Pattern Base of that item and calculating its support count by summing the
support counts of all the paths in the Conditional Pattern Base.

FP Growth
Market Basket Analysis UNIT - IV

Frequent pattern growth

Example Frequent Pattern rules

Algorithm

FP Growth
Regression UNIT - IV

● Regression is a data mining function that predicts a number.


Algorithm

● Profit, sale, mortgage rates, house values, square footage,


temperature or distance could all be predicted using regression
techniques.

● For example, a regression model could be used to predict the values of


Regression a data warehouse based on web-marketing, number of data entries,
size and other factors
Regression UNIT - IV

Algorithm ● A regression task begins with a data set in which the target values are
known.

● Regression analysis is a good choice when all of the predictor variables


are continuously valued as well.

Regression
● For an input x, if the output is continuous, this is called a regression
problem.
Regression UNIT - IV

● For example, based on historical information of demand for toothpaste


in your supermarket, you are asked to predict the demand for the next
Algorithm month.

● Regression is concerned with the prediction of continuous quantities.

● Linear regression is the oldest and most widely used predictive model
in field of machine learning.
Regression
● The goal is to minimize the sum of the squared errors to fit a straight
line to a set of data points.
Regression UNIT - IV
Regression Line

Least squares :

Algorithm ● The least squares regression line is the line that makes
the sum of squared residuals as small as possible.
● Linear means "straight line".

Regression Line :

● It is the line which gives the best estimate of one


Regression variable from the value of any other given variable.

● The regression line gives the average relationship


between the two variables in mathematical form.
Regression UNIT - IV
Regression Line Linear Regression

● For two variables X and Y, there are always two lines of regression.

Regression line of X on Y:
Algorithm
Gives the best estimate for the value of X for any specific given values of Y:

X=a+bY

where,
a = X - intercept
b = Slope of the line
Regression X = Dependent variable
Y = Independent variable
Regression UNIT - IV
Regression Line Linear Regression

● For two variables X and Y, there are always two lines of regression.

Algorithm

Regression
Regression UNIT - IV
Regression Line Linear Regression

● For two variables X and Y, there are always two lines of regression.

Regression line of Y on X:
Algorithm
Gives the best estimate for the value of Y for any specific given values of X:

Y=a+bX

where,
a = Y - intercept
b = Slope of the line
Regression X = Dependent variable
Y = Independent variable
Regression UNIT - IV
Regression Line

Linear Regression Example :

Algorithm
❏ The simplest form of regression to visualize is linear regression
with a single predictor.

❏ A linear regression technique can be used if relationship


between X and Y can approximated with a straight line.
Regression
Regression UNIT - IV
Regression Line

Linear Regression Example :

Algorithm Consider following data

(i) Find values of b0 and b1 w.r.t. linear regression model which best
fits given data.

(ii) Interpret and explain equation of regression line.


Regression
(iii) If new person rates " Bahubali-Part-I" as 3 then predict the rating of
same person for "Bahubali-Part-II"
Regression UNIT - IV
Regression Line

Linear Regression Example :

Algorithm Person Xi = rating for movie "Bahubali- Yi = rating for movie


Part-I" by ith person "Bahubali-Part-II" by ith person

1st 4 3

2nd 2 4

3rd 3 2

Regression 4th 5 5

5th 1 3

6th 3 1
Regression UNIT - IV
Regression Line
Average of X Values
Linear Regression Example :
Average of Y Values
Algorithm

Regression
Regression UNIT - IV
Regression Line

values of β0 and β1 w.r.t. linear regression model

Algorithm

Regression
Regression UNIT - IV
Regression Line

Interpretation 1
Algorithm
For increase in value of x by 0.3 unit there is increases in value of y in
one units.

Interpretation 2

Even if x = 0 value of independent variable, it is expected that value of


Regression y is 2.1.
Regression UNIT - IV
Regression Line

● If new person rates " Bahubali-Part-I" as 3 then predict


Algorithm
the rating of same person for "Bahubali-Part-II"
○ For x=3 the y value will be
○ Y (Predicted) = 2.1 + 0.3 (3) = 2.1+ 0.9
● If new person rates " Bahubali-Part-I" as 3 then predict the
Regression rating of same person for "Bahubali-Part-II" is 3.9
Regression UNIT - IV
Logistic Regression

❏ Logistic regression is a form of regression analysis in which the


Algorithm
outcome variable is binary or dichotomous.

❏ A statistical method used to model dichotomous or binary outcomes


using predictor variables.

Regression
❏ Logistic component : Instead of modeling the outcome, Y, directly,
the method models the log odds (Y) using the logistic function.
Regression UNIT - IV
Logistic Regression

❏ Methods used to quantify association between an outcome and


Algorithm predictor variables. It could be used to build predictive models as a
function of predictors

❏ In simple logistic regression, logistic regression with 1 predictor


variable.

Regression Logistic
In [P/(1-P)] = a0 + a1X1 + a2X2 + -------------- + akXk
Regression
Regression UNIT - IV
Logistic Regression

Algorithm

Regression Logistic
In [P/(1-P)] = a0 + a1X1 + a2X2 + -------------- + akXk
Regression
UNIT - IV
Classification UNIT - IV

❏ It Predicts categorical labels (classes), prediction models continuous-valued functions.

❏ Classification is considered to be supervised learning.

❏ Preprocessing of the data in preparation for classification and prediction can involve data
cleaning to reduce noise or handle missing values,

❏ relevance analysis to remove irrelevant or redundant data transformation such as


generalizing the data to higher level concepts or normalizing data
Classification UNIT - IV

New example

Training example Machine Learning Rules for Predicted


labeled Algorithm Classification Classification
Classification UNIT - IV

Naïve Bayes algorithm is a supervised learning algorithm,


Naive Bayes
which is based on Bayes theorem and used for solving
classification problems.

It is a part of classification algorithm which also provides


Decision Tree
solutions to the regression problems using the classification
rule
Naive Bayes Joint Probability Example UNIT - IV

P(Red and King)

Color
Type Red Black Total
King 2 2 4
Non-King 24 24 48
Total 26 26 52
Marginal Probability Example
UNIT - IV

P(King)

Color
Type Red Black Total
King 2 2 4
Non-King 24 24 48
Total 26 26 52
Conditional Probability Example UNIT - IV

From the face card the probability of selecting one card of the type Heart
and Jack is 1/12. Total number of face cards is 12, which have only one heart
of Jack.
Naïve Bayes Classification UNIT - IV

Based on Bayes Rule


Naïve Bayes Classification UNIT - IV
Naïve Bayes Classification UNIT - IV

Finally, we classify X as RED since its class membership achieves the largest
posterior probability.
Naïve Bayes Solved Example UNIT - IV
Naïve Bayes Solved Example UNIT - IV

Conditional Probability
Naïve Bayes Solved Example UNIT - IV
Conditional Probability
Naïve Bayes Solved Example UNIT - IV
Example
In this example we have 4 inputs (predictors). The final posterior probabilities can be standardized
between 0 and 1.
Naïve Bayes Solved Example UNIT - IV

P (N0 | Today) > P (Yes | Today)

So, prediction that golf would be played is ‘No’.

https://www.analyticsvidhya.com/blog/2021/09/naive-bayes-algorithm-a-complete-g
uide-for-data-science-enthusiasts/
Decision Tree UNIT - IV

• to create a training model that can use to predict the class or value of the
target variable by learning simple decision rules inferred from prior
data(training data).
• start from the root of the tree
• compare the values of the root attribute with the record’s attribute.
• On the basis of comparison, follow the branch corresponding to that value and
jump to the next node.
Decision Trees UNIT - IV
Decision Trees UNIT - IV

Each node is associated with a feature (one of the elements of a


feature vector that represent an object);
Each node test the value of its associated feature;
There is one branch for each value of the feature
Leaves specify the categories (classes)
Can categorize instances into multiple disjoint categories –
multi-class
Decision Trees - Algorithms UNIT - IV

ID3 → (extension of D3)


C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs
multi-level splits when computing classification trees)
MARS → (multivariate adaptive regression splines)s
Decision Trees - ID3 UNIT - IV

● The ID3 algorithm builds decision trees using a top-down greedy

search approach through the space of possible branches with no

backtracking.

● A greedy algorithm, as the name suggests, always makes the choice

that seems to be the best at that moment.


Decision Trees - ID3 UNIT - IV

1. It begins with the original set S as the root node.


2. On each iteration of the algorithm, it iterates through the very unused attribute of the
set S and calculates Entropy(H) and Information gain(IG) of this attribute.
3. It then selects the attribute which has the smallest Entropy or Largest Information
gain.
4. The set S is then split by the selected attribute to produce a subset of the data.
5. The algorithm continues to recur on each subset, considering only attributes never
selected before.
Decision Trees - Information Gain UNIT - IV

The amount of information improved in the nodes before splitting them for making further decisions.
Decision Trees - Information Gain UNIT - IV

Less Impurities More Impurities

Information Gain = 1 – Entropy


Decision Trees - Entropy UNIT - IV
● The entropy of any random variable or random process is the average level of
uncertainty involved in the possible outcome of the variable or process.
● To understand it more let’s take an example of a coin flip
● two probabilities either it will be a tail, or it will be a head and if the probability of
tail after flip is p then the probability of a head is 1-p.
● and the maximum uncertainty is for p = ½ when there is no reason to expect one
outcome over another.
● Here we can say that the entropy here is 1
Decision Trees - Entropy UNIT - IV
● Mathematically the formula for entropy is:

Where

X = random variable or process

Xi = possible outcomes

p(Xi) = probability of possible outcomes.


Decision Trees - Entropy UNIT - IV

% of enrolled for training = 50%

% of not % of enrolled for training = 50%

Let’s first calculate the entropy for the above-given situation.

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1


Decision Trees - Entropy UNIT - IV

% of enrolled for training = 0%

% of not % of enrolled for training = 100%

Let’s first calculate the entropy for the above-given situation.

Entropy = -(0) * log2(0) -(1) * log2(1) = 0


Decision Trees - Entropy UNIT - IV

● if a node is containing only one class in it or formally says the node of the
tree is pure the entropy for data in such node will be zero and according to
the information gain formula the information gained for such node will we
higher and purity is higher
● if the entropy is higher the information gain will be less and the node can
be considered as the less pure.
Decision Trees - Information Gain UNIT - IV

Gain (S, A) = expected reduction in entropy due to sorting on A

Values (A) is the set of all possible values for attribute A,


Sv is the subset of S which attribute A has value v,
|S| and | Sv | represent the number of samples in set S and set Sv respectively

Gain(S,A) is the expected reduction in entropy caused by knowing the value of attribute A.
Decision Trees UNIT - IV

❑ Play Tennis Example


❑ Feature values:
❑ Outlook = (sunny, overcast, rain)
❑ Temperature =(hot, mild, cool)
❑ Humidity = (high, normal)
❑ Wind =(strong, weak)
Decision Trees UNIT - IV

201
Decision Trees UNIT - IV

202
Decision Trees UNIT - IV
❑ Play Tennis Example
❑ Feature Vector = (Outlook, Temperature, Humidity, Wind)

Outlook
Sunny Overcast Rain
Humidity Wind
Yes
High Normal Strong Weak
No Yes No Yes
Decision Trees UNIT - IV

Node Node
associated associated
with a feature with a feature
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes

Node
associated
with a feature
Decision Trees UNIT - IV

❑ Outlook = (sunny, overcast, rain)


One branch
One branch for each value
for each value Outlook
Sunny Overcast Rain
Humidity One branch Yes Wind
for each
High Normal value Strong Weak
No Yes No Yes
Decision Trees UNIT - IV

❑ Class = (Yes, No)

Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes

Leaf nodes
specify classes Leaf nodes
specify classes
Example UNIT - IV

Play Tennis Example


UNIT - IV
Example

Humidity

High Normal
3+,4- 6+,1-
E=.985 E=.592

Gain(S, Humidity) = .94 - 7/14 * 0.985 - 7/14 *.592 = 0.151


UNIT - IV
Example

Wind

Weak Strong
6+2- 3+,3-
E=.811 E=1.0

Gain(S, Wind) = .94 - 8/14 * 0.811 - 6/14 * 1.0 = 0.048


Example UNIT - IV

Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
0.970 0.0 0.970
Gain(S, Outlook) = 0.246
UNIT - IV
Example

Pick Outlook as the root


Outlook

Gain(S, Humidity) = 0.151


Sunny Overcast Rain Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
Gain(S, Outlook) = 0.246
UNIT - IV
Example

Outlook
PickSunny
OutlookOvercast
as the root
Rain
Yes
1,2,8,9,11 4,5,6,10,14
3,7,12,13 3+,2-
2+,3- 4+,0-
? ?

Continue until: Every attribute is included in path, or, all examples in the leaf
have same label
Example

Outlook

Sunny Overcast Rain


Yes
1,2,8,9,11 3,7,12,13
2+,3- 4+,0-
?

Gain (Ssunny, Humidity) = .97-(3/5) * 0-(2/5) * 0 = .97


Gain (Ssunny, Temp) = .97- 0-(2/5) *1 = .57
Gain (Ssunny, Wind) = .97-(2/5) *1 - (3/5) *.92 = .02
UNIT - IV
Example

Outlook

Sunny Overcast Rain


Yes
Humidity

High Normal
Gain (Ssunny, Humidity) = .97-(3/5) * 0-(2/5) * 0 = .97
No Yes Gain (Ssunny, Temp) = .97- 0-(2/5) *1 = .57
Gain (Ssunny, Wind) = .97-(2/5) *1 - (3/5) *.92 = .02

214
UNIT - IV
Example

Outlook

Sunny Overcast Rain


Yes
Humidity ?
4,5,6,10,14
High Normal 3+,2-

No Yes
Gain (Srain, Humidity) =
Gain (Srain, Temp) =
Gain (Srain, Wind) =
215
UNIT - IV
Example

Outlook

Sunny Overcast Rain


Yes
Humidity Wind

High Normal Strong Weak

No Yes No Yes

https://www.saedsayad.com/decision_tree.htm UNIT - IV

You might also like