DA Unit - IV
DA Unit - IV
Features of Python
Scripting Language
Features of Python
3 Portable
Python
Programming
Language
Introduction UNIT - IV
Features of Python
7 Provides a vast range of libraries for the various fields such as machine learning,
web developer, and also for the scripting.
Introduction UNIT - IV
Advantages of Python
● Ease of programming
Disadvantages of Python
● A library is a collection of files (called modules) that contains functions for other
programs.
● A Python library is a reusable chunk of code that you may to include in your
programs.
Introduction UNIT - IV
01 03
NumPy
02 SciPy 04
Pandas
SciKit-Learn
Introduction UNIT - IV
Pandas
02 Have core data structures
Pandas
Series DataFrame
Series
02
Pandas ● The series is a one-dimensional array-like structure
DataFrame
SciPy
● A 2D plotting library that can be used in Python scripts,
the Python and IPython shell, web application servers,
and more.
Introduction UNIT - IV
SciPy
● An interactive console that runs your code like the Python
shell, but gives you even more features, like support for data
visualizations.
Introduction UNIT - IV
Essential Python Libraries
04
● Scikit-learn is probably the most useful library for
machine learning in Python.
SciKit-Learn
Introduction UNIT - IV
Essential Python Libraries
dimensionality
Classification Clustering Regression
reduction
Introduction UNIT - IV
● Data preprocessing is a data mining technique that involves transforming raw data
into an understandable format.
● Aim to reduce the data size, find the relation between data and normalized them.
Data Preprocessing UNIT - IV
1 Data Cleaning
2 Data Integration
● Data with different representations are put together and conflicts within
the data are resolved
Data Preprocessing UNIT - IV
3 Data Transformation
3 Data Reduction
3 Data Discretization
Removing Duplicates
Removing Duplicates
● With large scales of data, this will often be done using tools that find and merge
duplicate records in an existing database and prevent new ones from entering it
based on similarities in specific fields.
Data Preprocessing UNIT - IV
Removing Duplicates
Removing Duplicates
Removing Duplicates
Removing Duplicates
Removing Duplicates
Removing Duplicates
● Using a measure of central tendency for the attribute, such as the mean,
the median, the mode
This is usually done when the class label is
missing. Data Preprocessing UNIT - IV
Removing Duplicates
● Using the attribute mean for numeric values or attribute mode nominal
values, for all samples belonging to the same class as the given tuple.
Data Preprocessing UNIT - IV
Removing Duplicates
Removing Duplicates
Removing Duplicates
❏ Users want to join unstructured data or streaming data with structured data
so user can analyze the data together
Data Preprocessing UNIT - IV
Removing Duplicates
❏ Users want to add information to data to enrich it, such as performing lookups.
Adding geological data, or adding timestamps.
Data Preprocessing UNIT - IV
Removing Duplicates
Removing Duplicates
Scripting
Removing Duplicates
On-premise ❏ ETL (Extract, Transform, Load) tools can take much of the pain out of
ETL tools
scripting the transformations by automating the process
❏ These tools are typically hosted on your company’s site, and may
require extensive expertise & infrastructure cost
Data Preprocessing UNIT - IV
Removing Duplicates
Cloud-based
ETL tools ❏ These ETL tools are hosted in the cloud
Analytics Types
through the
Analytics Types
Analytics Types
❏ Business analytics utilizes big data, statistical analysis and data visualization to
implement organization changes.
Data Preprocessing UNIT - IV
Analytics Types
Executive ownership
IT Involvement
Project Management Office
(PMO)
Available production data vs.
Cleansed modeling data
End user involvement and
buy-in
change management
Data Preprocessing UNIT - IV
Analytics Types
3. Pre-process the data for issues such as missing and incorrect data. Generate derived
variables and transform the data if necessary. Prepare the data for analytics model
building.
Data Preprocessing UNIT - IV
Analytics Types
4. Divide the data sets into subsets training and validation data sets.
5. Build analytical models and identify the best model(s) using model performance
in validation data.
Analytics Types
Predictive Descriptive
Prescriptive
Analytics Types Data Preprocessing UNIT - IV
Predictive
❏ Predictive analytics helps your organization predict with confidence what will
happen next so that you can make smarter decisions and improve business
outcomes.
❏ The purpose of the predictive model is finding the likelihood different samples
will perform in a specific way.
Analytics Types Data Preprocessing UNIT - IV
Predictive
❏ The variability of the component data will have a relationship with what it is likely
to predict.
Analytics Types Data Preprocessing UNIT - IV
Predictive
Project definition
Monitoring
Data collection
Predictive
Analytics Process Deployment
Analysis
Statistics
Modelling
Analytics Types Data Preprocessing UNIT - IV
Predictive
Project definition
❏ Identify what shall be the outcome of the project, the deliverables, business objectives
and based on that go towards gathering those data sets that are to be used.
Analytics Types Data Preprocessing UNIT - IV
Predictive
Data collection
❏ This is more of the big basket where all data from various sources are binned
for usage.
❏ This gives a picture about the various customer interactions as a single view
item
Analytics Types Data Preprocessing UNIT - IV
Predictive
Analysis
Predictive
Statistics
❏ This enables to validate if the findings, assumptions and hypothesis are fine to go
ahead with and test them using statistical model.
Analytics Types Data Preprocessing UNIT - IV
Predictive
Modelling
❏ Through this accurate predictive models about the future can be provided.
❏ From the options available the best option could be chosen as the required solution
with multi model evaluation.
Analytics Types Data Preprocessing UNIT - IV
Predictive
Deployment
❏ This way the results, reports and other metrics can be taken based on modelling.
Analytics Types Data Preprocessing UNIT - IV
Predictive
Monitoring
of Predictive Analytics
Example
of Predictive Analytics
Example
of Predictive Analytics
Example
of Predictive Analytics
Example
❏ Probably the largest sector to use predictive analytics, retail is
always looking to improve its sales position and for get better
relations with customers.
of Predictive Analytics
Example
❏ Usage of predictive analytics in the healthcare domain can aid to
determine and prevent cases and risks of those developing certain
health related complications like diabetics, asthma and other lifé
threatening ailments.
of Predictive Analytics
Example
Descriptive
Descriptive
❏ The descriptive model shows relationships between the product/service with the
acquired data.
Descriptive
❏ The descriptive model shows relationships between the product/service with the
acquired data.
Descriptive
❏ Descriptive statistics are useful to show things like, total stock in inventory,
average dollars spent per customer and year over year change in sales.
❏ While business intelligence tries to make sense of all the data that's collected
each and every day by organizations of all types, communicating the data in a
way that people can easily grasp often becomes an issue.
Analytics Types Data Preprocessing UNIT - IV
Example
Financial
Inventory
Reports that provides
Production
Analytics Types Data Preprocessing UNIT - IV
Descriptive
Descriptive
Prescriptive
Prescriptive
of Prescriptive Analytics
Example
It can be
called
Association Analysis
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Use Cases (Applications) of Association Rule Mining
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Transaction Data UNIT - IV
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example -Frequent Item Set
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example- Association Rule
UNIT - IV
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example- Association Rule Support
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule Confidence UNIT - IV
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule Lift UNIT - IV
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
Simple Example- Association Rule Lift -
Interpretation
● Lift = 1: implies no relationship between mobile phone and screen guard (i.e., mobile phone
and screen guard occur together only by chance)
● Lift > 1: implies that there is a positive relationship between mobile phone and screen guard (i..,
mobile phone and screen guard occur together more often than random)
● Lift < 1: implies that there is a negative relationship between mobile phone and screen guard
(i.e., mobile phone and screen guard occur together less often than random)
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
UNIT - IV
X→Y
UNIT - IV
Appropriateness
of Candidate
Rule
T101 I2, I4
ITEM Set FREQUENCY
T102 I2, I5
{ I1} 6
T103 I1, I2, I4
{I2} 8
T104 I1, I2, I3
{I 3} 5
T105 I2, I3
{I4} 3
T106 I1, I2, I3, I4
{I5} 3
T107 I1, I2, I3
T108 I1, I3 , I5
UNIT - IV
Example
Minimum Support =
0.5 or 50%
TID List_Of_Item IDs Means 9/2 = 4.5 = 4
T100 I1, I2, I5
ITEM Set FREQUENC After ITEM Set FREQUENCY
T101 I2, I4 Y
Prunin
{ I1} 6
{ I1} 6 g
T102 I2, I5
{I2} 8
{I2} 8
T103 I1, I2, I4
{I 3} 5
{I 3} 5
T104 I1, I2, I3
{I4} 3
T105 I2, I3
{I5} 3
T106 I1, I2, I3, I4
T108 I1, I3 , I5
UNIT - IV
Example
Minimum Support =
0.5 or 50%
Means 9/2 = 4.5 = 4
After
ITEM Set FREQUENCY Candidate Prunin
ITEM Set FREQUENCY ITEM Set FREQUENCY
Generatio g
{ I1} 6
n { I1, I2} 5 { I1, I2} 5
{I2} 8
{I1, I3} 4 {I1, I3} 4
{I 3} 5
{I 2, I3} 4 {I 2, I3} 4
UNIT - IV
Example
Minimum Support =
0.5 or 50%
Means 9/2 = 4.5 = 4
We have 3 rules
1. I1 => I2
2. I1 => I3
3. I2 => I3
Example- Support
No of Transaction
I2 => I3 4 4/9 0.44
Example- Confidence
Freq (X)
I2 => I3 8 4 4/8 0.50
Example- Lift
Rule Support Support Support Formula Putting Values Support
of of of in Formula Value
(X+ Y) X Y
I1 => I2 0.55 6/9 = 8/9 =0.88 0.55 0.94
0.66 Support( X+Y) ----------------
______________________
(0.66 * 0.88)
I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 1.21
0.66 Support (X) * Support (Y) ----------------
(0.66 * 0.55)
I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.44 0.90
----------------
(0.88 * 0.55)
Example
if
Then Item B is likely to
Item A is purchased
be purchased
Antecedent Consequent
It is Condition It is result
Market Basket Analysis UNIT - IV
Algorithm
Association Rule
Apriori
Market Basket Analysis UNIT - IV
Support
Algorithm
measures
Association Rule
Confidence Lift
Market Basket Analysis UNIT - IV
Support
Algorithm
● Support is the number of transactions that include items
in the (A) & {B} parts of the rule as a percentage of the
total number of transactions.
Confidence
Algorithm
● Confidence of the rule is the ratio of the number of transactions
that include all items in (B) as well as the number of
transactions that include all items in (A) to the number of
transactions that include all items in (A).
Association Rule
Confident = A+B
A
Association Rules UNIT - IV
❏ The uncovered relationships can be represented in the form of association rules or sets
of frequent items.
Association Rules UNIT - IV
❏ Association rules are if/then statements that help uncover relationships between
seemingly unrelated data in a transactional database, relational database or other
information repository.
Association Rules UNIT - IV
ID Items
1 {Bread, Milk}
Market basket transaction
2 {Bread, Milk, Cola, Sugar}
… …
Algorithm
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Solution
Algorithm
● Find the frequent itemsets and generate association rules on the
given dataset
● Assume that minimum support threshold (s = 33.33%) and
minimum confidence threshold (c = 60%)
Apriori
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Algorithm
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items
● Confidence
Apriori = 2/2*100=100%
Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm
= 2/2*100=100%
Apriori
Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm
Apriori = 2/3*100=66.67%
//Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm
Apriori = 2/3*100=66.67%
//Selected
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items
Solution
Algorithm
● [Hot Dogs]=>[Coke^Chips]
= 2/4*100=50%
○ Rejected
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items
Solution
Algorithm
● [Coke]=>[Hot Dogs^Chips]
= 2/3*100=66.67%
//Selected
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Example of Apriori Algorithm → Table P. 4.4.3 transaction with 8 items
Solution
Algorithm
There are four strong results (minimum confidence greater than
60%)
● [Hot Dogs^Coke]=>[Chips]
● [Hot Dogs^Chips]=>[Coke]
Apriori
● [Coke^Chips]=>[Hot Dogs]
● [Coke]=>[Hot Dogs^Chips]
Market Basket Analysis UNIT - IV
Apriori Algorithm
Algorithm
Click Here for More Examples
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Drawback
Algorithm
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly
scan the database
Apriori
Market Basket Analysis UNIT - IV
Apriori Algorithm
Drawback
Algorithm
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly
scan the database
Apriori
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
● minimum support be 3
● These elements are stored in descending order of their
respective frequencies.
● After insertion of the relevant items, the set L looks like this:-
FP Growth L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
Market Basket Analysis UNIT - IV
FP Growth
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
It is done by taking the set of elements that is common in all the paths in the
Conditional Pattern Base of that item and calculating its support count by summing the
support counts of all the paths in the Conditional Pattern Base.
FP Growth
Market Basket Analysis UNIT - IV
Algorithm
FP Growth
Regression UNIT - IV
Algorithm ● A regression task begins with a data set in which the target values are
known.
Regression
● For an input x, if the output is continuous, this is called a regression
problem.
Regression UNIT - IV
● Linear regression is the oldest and most widely used predictive model
in field of machine learning.
Regression
● The goal is to minimize the sum of the squared errors to fit a straight
line to a set of data points.
Regression UNIT - IV
Regression Line
Least squares :
Algorithm ● The least squares regression line is the line that makes
the sum of squared residuals as small as possible.
● Linear means "straight line".
Regression Line :
● For two variables X and Y, there are always two lines of regression.
Regression line of X on Y:
Algorithm
Gives the best estimate for the value of X for any specific given values of Y:
X=a+bY
where,
a = X - intercept
b = Slope of the line
Regression X = Dependent variable
Y = Independent variable
Regression UNIT - IV
Regression Line Linear Regression
● For two variables X and Y, there are always two lines of regression.
Algorithm
Regression
Regression UNIT - IV
Regression Line Linear Regression
● For two variables X and Y, there are always two lines of regression.
Regression line of Y on X:
Algorithm
Gives the best estimate for the value of Y for any specific given values of X:
Y=a+bX
where,
a = Y - intercept
b = Slope of the line
Regression X = Dependent variable
Y = Independent variable
Regression UNIT - IV
Regression Line
Algorithm
❏ The simplest form of regression to visualize is linear regression
with a single predictor.
(i) Find values of b0 and b1 w.r.t. linear regression model which best
fits given data.
1st 4 3
2nd 2 4
3rd 3 2
Regression 4th 5 5
5th 1 3
6th 3 1
Regression UNIT - IV
Regression Line
Average of X Values
Linear Regression Example :
Average of Y Values
Algorithm
Regression
Regression UNIT - IV
Regression Line
Algorithm
Regression
Regression UNIT - IV
Regression Line
Interpretation 1
Algorithm
For increase in value of x by 0.3 unit there is increases in value of y in
one units.
Interpretation 2
Regression
❏ Logistic component : Instead of modeling the outcome, Y, directly,
the method models the log odds (Y) using the logistic function.
Regression UNIT - IV
Logistic Regression
Regression Logistic
In [P/(1-P)] = a0 + a1X1 + a2X2 + -------------- + akXk
Regression
Regression UNIT - IV
Logistic Regression
Algorithm
Regression Logistic
In [P/(1-P)] = a0 + a1X1 + a2X2 + -------------- + akXk
Regression
UNIT - IV
Classification UNIT - IV
❏ Preprocessing of the data in preparation for classification and prediction can involve data
cleaning to reduce noise or handle missing values,
New example
Color
Type Red Black Total
King 2 2 4
Non-King 24 24 48
Total 26 26 52
Marginal Probability Example
UNIT - IV
P(King)
Color
Type Red Black Total
King 2 2 4
Non-King 24 24 48
Total 26 26 52
Conditional Probability Example UNIT - IV
From the face card the probability of selecting one card of the type Heart
and Jack is 1/12. Total number of face cards is 12, which have only one heart
of Jack.
Naïve Bayes Classification UNIT - IV
Finally, we classify X as RED since its class membership achieves the largest
posterior probability.
Naïve Bayes Solved Example UNIT - IV
Naïve Bayes Solved Example UNIT - IV
Conditional Probability
Naïve Bayes Solved Example UNIT - IV
Conditional Probability
Naïve Bayes Solved Example UNIT - IV
Example
In this example we have 4 inputs (predictors). The final posterior probabilities can be standardized
between 0 and 1.
Naïve Bayes Solved Example UNIT - IV
https://www.analyticsvidhya.com/blog/2021/09/naive-bayes-algorithm-a-complete-g
uide-for-data-science-enthusiasts/
Decision Tree UNIT - IV
• to create a training model that can use to predict the class or value of the
target variable by learning simple decision rules inferred from prior
data(training data).
• start from the root of the tree
• compare the values of the root attribute with the record’s attribute.
• On the basis of comparison, follow the branch corresponding to that value and
jump to the next node.
Decision Trees UNIT - IV
Decision Trees UNIT - IV
backtracking.
The amount of information improved in the nodes before splitting them for making further decisions.
Decision Trees - Information Gain UNIT - IV
Where
Xi = possible outcomes
● if a node is containing only one class in it or formally says the node of the
tree is pure the entropy for data in such node will be zero and according to
the information gain formula the information gained for such node will we
higher and purity is higher
● if the entropy is higher the information gain will be less and the node can
be considered as the less pure.
Decision Trees - Information Gain UNIT - IV
Gain(S,A) is the expected reduction in entropy caused by knowing the value of attribute A.
Decision Trees UNIT - IV
201
Decision Trees UNIT - IV
202
Decision Trees UNIT - IV
❑ Play Tennis Example
❑ Feature Vector = (Outlook, Temperature, Humidity, Wind)
Outlook
Sunny Overcast Rain
Humidity Wind
Yes
High Normal Strong Weak
No Yes No Yes
Decision Trees UNIT - IV
Node Node
associated associated
with a feature with a feature
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
Node
associated
with a feature
Decision Trees UNIT - IV
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
Leaf nodes
specify classes Leaf nodes
specify classes
Example UNIT - IV
Humidity
High Normal
3+,4- 6+,1-
E=.985 E=.592
Wind
Weak Strong
6+2- 3+,3-
E=.811 E=1.0
Outlook
Outlook
PickSunny
OutlookOvercast
as the root
Rain
Yes
1,2,8,9,11 4,5,6,10,14
3,7,12,13 3+,2-
2+,3- 4+,0-
? ?
Continue until: Every attribute is included in path, or, all examples in the leaf
have same label
Example
Outlook
Outlook
High Normal
Gain (Ssunny, Humidity) = .97-(3/5) * 0-(2/5) * 0 = .97
No Yes Gain (Ssunny, Temp) = .97- 0-(2/5) *1 = .57
Gain (Ssunny, Wind) = .97-(2/5) *1 - (3/5) *.92 = .02
214
UNIT - IV
Example
Outlook
No Yes
Gain (Srain, Humidity) =
Gain (Srain, Temp) =
Gain (Srain, Wind) =
215
UNIT - IV
Example
Outlook
No Yes No Yes
https://www.saedsayad.com/decision_tree.htm UNIT - IV