0% found this document useful (0 votes)

14 views8 pages

Algorithms New

Evaluation

Uploaded by

Rehan Zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Algorithms New

Evaluation

Uploaded by

Rehan Zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Algorithms

K-mean-clustering

K-mean clustering

An implementation of K-mean clustering:

function: k_mean(x, k)

Takes:

x: input data points in form of list of data point

eg: x = [[12,34,21], [23,34,23], ...... ]

k: number of cluster for dividing data points

eg: k=3
Returns:

cluster: contain assigned cluser for every data point

eg: cluster = [1, 0 ,2, ......]

K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw
inferences from them. It is based on centroid-based clustering.

Centroid - A centroid is a data point at the centre of a cluster. In centroid-based clustering,

clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity
is derived by how close a data point is to the centroid of the cluster. K-Means clustering works as
follows:- The K-Means clustering algorithm uses an iterative procedure to deliver a final result.
The algorithm requires number of clusters K and the data set as input. The data set is a collection
of features for each data point. The algorithm starts with initial estimates for the K centroids. The
algorithm then iterates between two steps:-
1. Data assignment step

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids
in set C, then each data point is assigned to a cluster based on minimum Euclidean distance.

2. Centroid update step

In this step, the centroids are recomputed and updated. This is done by taking the mean of all
data points assigned to that centroid’s cluster.

The algorithm then iterates between step 1 and step 2 until a stopping criteria is met. Stopping
criteria means no data points change the clusters, the sum of the distances is minimized or some
maximum number of iterations is reached. This algorithm is guaranteed to converge to a result.
The result may be a local optimum meaning that assessing more than one run of the algorithm
with randomized starting centroids may give a better outcome.

3. Choosing the value of K

The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-
defined value of K. To find the number of clusters in the data, we need to run the K-Means
clustering algorithm for different values of K and compare the results. So, the performance of K-
Means algorithm depends upon the value of K. We should choose the optimal value of K that
gives us best performance. There are different techniques available to find the optimal value of
K. The most common technique is the elbow method which is described below.

4. The elbow method

The elbow method is used to determine the optimal number of clusters in K-means clustering.
The elbow method plots the value of the cost function produced by different values of K.

If K increases, average distortion will decrease. Then each cluster will have fewer constituent
instances, and the instances will be closer to their respective centroids. However, the
improvements in average distortion will decline as K increases. The value of K at which
improvement in distortion declines the most is called the elbow, at which we should stop
dividing the data into further clusters.
5. The problem statement
In this project, I implement K-Means clustering with Python and Scikit-Learn. As mentioned
earlier, K-Means clustering is used to find intrinsic groups within the unlabelled dataset and
draw inferences from them. I have used Facebook Live Sellers in Thailand Dataset for this
project. I implement K-Means clustering to find intrinsic groups within this dataset that display
the same status_type behaviour. The status_type behaviour variable consists of posts of a
different nature (video, photos, statuses and links).

Naïve Bayes
A custom implementation of a Naive Bayes Classifier written in Python 3.

Dataset

Loan Defaulters

Home Marital Annual Defaulted

Owner Status Income Borrower

Yes Single $125,000 No

No Married $100,000 No

No Single $70,000 No

Yes Married $120,000 No

No Divorced $95,000 Yes

No Married $60,000 No

Yes Divorced $220,000 No

Home Marital Annual Defaulted
Owner Status Income Borrower

No Single $85,000 Yes

No Married $75,000 No

No Single $90,000 Yes

Source: Introduction to Data Mining (1st Edition) by Pang-Ning Tan

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|
c). Look at the equation below:

Decision Tree

1. Introduction to Decision Tree algorithm

A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree
like structure and their possible combinations to solve a particular problem. It belongs to the
class of supervised learning algorithms where it can be used for both classification and
regression purposes.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.

2. Classification and Regression Trees (CART)

Nowadays, Decision Tree algorithm is known by its modern name CART which stands
for Classification and Regression Trees. Classification and Regression Trees or CART is a
term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for
classification and regression modeling problems.The CART algorithm provides a foundation for
other important algorithms like bagged decision trees, random forest and boosted decision trees.

In this project, I will solve a classification problem. So, I will refer the algorithm also as
Decision Tree Classification problem.

3. Decision Tree algorithm intuition

The Decision-Tree algorithm is one of the most frequently and widely used supervised machine
learning algorithms that can be used for both classification and regression tasks. The intuition
behind the Decision-Tree algorithm is very simple to understand.

The Decision Tree algorithm intuition is as follows:-

1. For each attribute in the dataset, the Decision-Tree algorithm forms a node. The most
important attribute is placed at the root node.
2. For evaluating the task in hand, we start at the root node and we work our way down the
tree by following the corresponding node that meets our condition or decision.
3. This process continues until a leaf node is reached. It contains the prediction or the
outcome of the Decision Tree.

4. Attribute selection measures

The primary challenge in the Decision Tree implementation is to identify the attributes which we
consider as the root node and each level. This process is known as the attributes selection.
There are different attributes selection measure to identify the attribute which can be considered
as the root node at each level.

There are 2 popular attribute selection measures. They are as follows:-

 Information gain
 Gini index

While using Information gain as a criterion, we assume attributes to be categorical and for Gini
index attributes are assumed to be continuous. These attribute selection measures are described
below.

Information gain
By using information gain as a criterion, we try to estimate the information contained by each
attribute. To understand the concept of Information Gain, we need to know another concept
called Entropy.

Entropy measures the impurity in the given dataset. In Physics and Mathematics, entropy is
referred to as the randomness or uncertainty of a random variable X. In information theory, it
refers to the impurity in a group of examples. Information gain is the decrease in entropy.
Information gain computes the difference between entropy before split and average entropy after
split of the dataset based on given attribute values.

The ID3 (Iterative Dichotomiser) Decision Tree algorithm uses entropy to calculate information
gain. So, by calculating decrease in entropy measure of each attribute we can calculate their
information gain. The attribute with the highest information gain is chosen as the splitting
attribute at the node.

Gini index
Another attribute selection measure that CART (Categorical and Regression Trees) uses is
the Gini index. It uses the Gini method to create split points.

Gini index says, if we randomly select two items from a population, they must be of the same
class and probability for this is 1 if the population is pure.
It works with the categorical target variable “Success” or “Failure”. It performs only binary
splits. The higher the value of Gini, higher the homogeneity. CART (Classification and
Regression Tree) uses the Gini method to create binary splits.

Steps to Calculate Gini for a split

1. Calculate Gini for sub-nodes, using formula sum of the square of probability for success
and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split.

In case of a discrete-valued attribute, the subset that gives the minimum gini index for that
chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy
is to select each pair of adjacent values as a possible split-point and point with smaller gini index
chosen as the splitting point. The attribute with minimum Gini index is chosen as the splitting
attribute.

5. The problem statement

The problem is to predict the safety of the car. In this project, I build a Decision Tree Classifier
to predict the safety of the car. I implement Decision Tree Classification with Python and Scikit-
Learn. I have used the Car Evaluation Data Set for this project, downloaded from the UCI
Machine Learning Repository website.

Random Forest Regression

A random forest is a meta estimator that fits a number of classifying decision trees on various
sub-samples of the dataset and use averaging to improve the predictive accuracy and control
over-fitting. The sub-sample size is always the same as the original input sample size but the
samples are drawn with replacement (can be changed by user).

Generally, Decision Tree and Random Forest models are used for classification task. However,
the idea of Random Forest as a regularizing meta-estimator over single decision tree is best
demonstrated by applying them to regresion problems. This way it can be shown that, in the
presence of random noise, single decision tree is prone to overfitting and learn spurious
correlations while a properly constructed Random Forest model is more immune to such
overfitting.

What is make_regression method?

It is a convenient method/function from scikit-learn stable to generate a random regression
problem. The input set can either be well conditioned (by default) or have a low rank-fat tail
singular profile.

Import dataset, make Scatter plots and histograms.

How will a Decision Tree regressor do?

Every run will generate different result but on most occasions, the single decision tree regressor
is likely to learn spurious features i.e. will assign small importance to features which are not true
regressors.

The output is generated by applying a (potentially biased) random linear regression model
with n_informative nonzero regressors to the previously generated input and some gaussian
centered noise with some adjustable scale.

Show the relative importance of regressors side by side

For Random Forest Model, show the relative importance of features as determined by the meta-
estimator. For the OLS model, show normalized t-statistic values.

It will be clear that although the RandomForest regressor identifies the important regressors
correctly, it does not assign the same level of relative importance to them as done by OLS
method t-statistic.

GROW - Using AI To Screen Human Intelligence
50% (4)
GROW - Using AI To Screen Human Intelligence
6 pages
AI Jobs Presentation
No ratings yet
AI Jobs Presentation
7 pages
Jntuk M Tech r16 Eee Syllabus
No ratings yet
Jntuk M Tech r16 Eee Syllabus
252 pages
Heartline - Dec2024-Biosignal Cover Story
No ratings yet
Heartline - Dec2024-Biosignal Cover Story
5 pages
Question Bank - Module 4
No ratings yet
Question Bank - Module 4
3 pages
Pfe Projects Ideas
No ratings yet
Pfe Projects Ideas
4 pages
Ai Questions Bank
No ratings yet
Ai Questions Bank
15 pages
STS Assignment1
0% (1)
STS Assignment1
3 pages
TACN3
No ratings yet
TACN3
25 pages
Data Mining
No ratings yet
Data Mining
68 pages
ML4 ML Algorithms
No ratings yet
ML4 ML Algorithms
123 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
Ucan Titles 2024
No ratings yet
Ucan Titles 2024
25 pages
Machine Learning in Python JIMS Rohini Sector 5 PDF
No ratings yet
Machine Learning in Python JIMS Rohini Sector 5 PDF
110 pages
Classification
No ratings yet
Classification
32 pages
#1
No ratings yet
#1
27 pages
ML Unit-III
No ratings yet
ML Unit-III
30 pages
Chapter 03
No ratings yet
Chapter 03
16 pages
Certified Ethical Emerging Technologist Exam CET 110
No ratings yet
Certified Ethical Emerging Technologist Exam CET 110
17 pages
Cse Vsem 503 B PR Unit 2 Notes
No ratings yet
Cse Vsem 503 B PR Unit 2 Notes
17 pages
Unit 4 Computer Graphics
No ratings yet
Unit 4 Computer Graphics
10 pages
Helping Employees Succeed With GenAI
No ratings yet
Helping Employees Succeed With GenAI
12 pages
ML Notes
No ratings yet
ML Notes
50 pages
Overview of Clustering K Means
No ratings yet
Overview of Clustering K Means
8 pages
Topic Modelling Meets Deep Neural Networks - A Survey
No ratings yet
Topic Modelling Meets Deep Neural Networks - A Survey
8 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
Unit IV
No ratings yet
Unit IV
96 pages
Ayesha Sana
No ratings yet
Ayesha Sana
9 pages
Seminar 3
No ratings yet
Seminar 3
43 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Unit 3,4,5 ML (CS - AI)
No ratings yet
Unit 3,4,5 ML (CS - AI)
37 pages
Process Linux
No ratings yet
Process Linux
5 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Advanced Arithmetic Operations Part 5
No ratings yet
Advanced Arithmetic Operations Part 5
5 pages
Defender Vs Attacker Part 4
No ratings yet
Defender Vs Attacker Part 4
5 pages
2.unit 2 ML Q&A
No ratings yet
2.unit 2 ML Q&A
36 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Environmental DNA To Investigate Mammalian Ecology
No ratings yet
Environmental DNA To Investigate Mammalian Ecology
4 pages
ML Notes
No ratings yet
ML Notes
12 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
2023 NDSTS
No ratings yet
2023 NDSTS
12 pages
Final Draft
No ratings yet
Final Draft
3 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
Farhat Proposal
No ratings yet
Farhat Proposal
4 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
Astrophysics Part 2
No ratings yet
Astrophysics Part 2
2 pages
Sas Viya Overview
No ratings yet
Sas Viya Overview
4 pages
DR +neeraja+kalluri
No ratings yet
DR +neeraja+kalluri
11 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
The Use AI Technology in Learning Grammar
No ratings yet
The Use AI Technology in Learning Grammar
6 pages
Decision Trees: Make A Decision (Represent An Outcome
No ratings yet
Decision Trees: Make A Decision (Represent An Outcome
4 pages
Session 5b Classification by Decision Tree Induction
No ratings yet
Session 5b Classification by Decision Tree Induction
42 pages
Advanced Algorithms Exam For AI Specialization
No ratings yet
Advanced Algorithms Exam For AI Specialization
5 pages
ml2 1
No ratings yet
ml2 1
7 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
Machine Learning CS5011 Assignment #2: Dr. B. Ravindran
No ratings yet
Machine Learning CS5011 Assignment #2: Dr. B. Ravindran
8 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Information Technology
No ratings yet
Information Technology
3 pages
Unit - II
No ratings yet
Unit - II
37 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
9 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
41 pages
Water Trash - Collecting Robot Swarms: Clearbot - Dev Contact@clearbot - Dev
No ratings yet
Water Trash - Collecting Robot Swarms: Clearbot - Dev Contact@clearbot - Dev
12 pages
Malek Hammou: Research Interest
No ratings yet
Malek Hammou: Research Interest
3 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
DWDM Asgmnt Prog
No ratings yet
DWDM Asgmnt Prog
51 pages
Classification and Clustering Algorithm Notes
No ratings yet
Classification and Clustering Algorithm Notes
19 pages
Machine Learning QNA
No ratings yet
Machine Learning QNA
1 page
Article 42 of The Constitution of Azad Jammu and Kashmir
No ratings yet
Article 42 of The Constitution of Azad Jammu and Kashmir
10 pages
Mac OS
No ratings yet
Mac OS
2 pages
ML Unit 2 Final - III Yr
No ratings yet
ML Unit 2 Final - III Yr
72 pages
System Testing
No ratings yet
System Testing
5 pages
Ml-Unit Iii-1
No ratings yet
Ml-Unit Iii-1
46 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Advanced Arithmetic Operations Part 2
No ratings yet
Advanced Arithmetic Operations Part 2
4 pages
2023-24 ML Notes 2
No ratings yet
2023-24 ML Notes 2
16 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Knowledge Based Systems
No ratings yet
Knowledge Based Systems
17 pages
Cybersecurity and Its Features
No ratings yet
Cybersecurity and Its Features
5 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Advanced Arithmetic Operations Part 3
No ratings yet
Advanced Arithmetic Operations Part 3
3 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Importance of Cybersecurity
No ratings yet
Importance of Cybersecurity
4 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Altman S. Moore's Law For Everything - Marc 21, 2021
No ratings yet
Altman S. Moore's Law For Everything - Marc 21, 2021
9 pages
Basics of Computer Science
No ratings yet
Basics of Computer Science
7 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
Types of Data
No ratings yet
Types of Data
3 pages
Data Mining CS4168 Lecture 5 Basics of Classification 1
No ratings yet
Data Mining CS4168 Lecture 5 Basics of Classification 1
25 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
Unit 3
100% (1)
Unit 3
21 pages
MLunit 2 Mynotes
No ratings yet
MLunit 2 Mynotes
15 pages
Unit 4 Classification & Prediction
No ratings yet
Unit 4 Classification & Prediction
10 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
Ai Love b2
No ratings yet
Ai Love b2
2 pages
FMLanswerkey-IT 2
No ratings yet
FMLanswerkey-IT 2
11 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Chapter 4classification and Prediction
No ratings yet
Chapter 4classification and Prediction
19 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
SAP Data Intelligence
No ratings yet
SAP Data Intelligence
4 pages
Syllabus Chemistry
No ratings yet
Syllabus Chemistry
1 page
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Algorithms New

Uploaded by

Algorithms New

Uploaded by

Algorithms

An implementation of K-mean clustering:

x: input data points in form of list of data point

eg: x = [[12,34,21], [23,34,23], ...... ]

cluster: contain assigned cluser for every data point

eg: cluster = [1, 0 ,2, ......]

Centroid - A centroid is a data point at the centre of a cluster. In centroid-based clustering,

2. Centroid update step

3. Choosing the value of K

4. The elbow method

Home Marital Annual Defaulted

Yes Single $125,000 No

Yes Married $120,000 No

No Divorced $95,000 Yes

Yes Divorced $220,000 No

No Single $85,000 Yes

No Single $90,000 Yes

Source: Introduction to Data Mining (1st Edition) by Pang-Ning Tan

1. Introduction to Decision Tree algorithm

2. Classification and Regression Trees (CART)

3. Decision Tree algorithm intuition

The Decision Tree algorithm intuition is as follows:-

4. Attribute selection measures

There are 2 popular attribute selection measures. They are as follows:-

Steps to Calculate Gini for a split

5. The problem statement

Random Forest Regression

What is make_regression method?

Import dataset, make Scatter plots and histograms.

How will a Decision Tree regressor do?

Show the relative importance of regressors side by side

You might also like