Research Paper (Machine Learning & Clustering)
Research Paper (Machine Learning & Clustering)
Research Paper (Machine Learning & Clustering)
1
Fig. 1. Machine Learning
The term machine learning was coined in Fig. 2. Machine Learning Process
machine learning a central part of their operations. learning, data scientists supply algorithms with labelled
Machine learning has become a significant competitive training data and define the variables they want the
differentiator for many companies. algorithm to assess for correlations. Both the input and the
output of the algorithm is specified.
2
B. Unsupervised Learning: data points in other groups. It is basically a collection of
This type of machine learning objects on the basis of similarity and dissimilarity
involves algorithms that train on unlabelled data. The between them. It is basically a type of unsupervised
algorithm scans through data sets looking for any learning method. It does it by finding some similar
meaningful connection. The data that algorithms train on patterns in the unlabelled dataset such as shape, size,
as well as the predictions or recommendations they output colour, behaviour, etc., and divides them as per the
are predetermined. presence and absence of those similar patterns. After
applying this clustering technique, each cluster or group
C. Reinforcement Learning:
is provided with a cluster-ID. ML system can use this id
to simplify the processing of large and complex datasets.
Data scientists typically
If the examples are labeled, then clustering
use reinforcement learning to teach a machine to
becomes classification. The process of classifying the
complete a multi-step process for which there are clearly
input instances based on their corresponding class labels
defined rules. Data scientists program an algorithm to
is known as classification whereas grouping the instances
complete a task and give it positive or negative cues as it
based on their similarity without the help of class labels is
works out how to complete a task. But for the most part,
known as clustering.
the algorithm decides on its own what steps to take along
the way.
Clustering can be divided into two subgroups:
• Hard Clustering: In hard clustering, each data
point either belongs to a cluster completely or not.
• Soft Clustering: In soft clustering, instead of
putting each data point into a separate cluster, a
probability or likelihood of that data point to be
in those clusters is assigned.
VI. CLUSTERING
3
VII. TYPES OF CLUSTERING MODELS VIII.TYPES OF CLUSTERING ALGORITHMS
A. K-Means Clustering:
B. Centroid Model:
In these models, the no. of clusters required at the end B. Hierarchical Clustering:
have to be mentioned beforehand, which makes it
As the name suggests is an
important to have prior knowledge of the dataset. These
algorithm that builds hierarchy of clusters. This algorithm
models run iteratively to find the local optima.
starts with all the data points assigned to a cluster of their
own. Then two nearest clusters are merged into the same
C. Distribution Model:
cluster. In the end, this algorithm terminates when there is
These clustering models are only a single cluster left. The results of hierarchical
based on the notion of how probable is it that all data clustering can be shown using dendrogram.
points in the cluster belong to the same distribution (For
C. Density Based Spatial Clustering of Applications
example: Normal, Gaussian). These models often suffer
with Noise (DBSCAN):
from overfitting.
DBSCAN is a base algorithm for density-based
D. Density Model: clustering. It can discover clusters of different shapes and
sizes from a large amount of data, which is containing
These models search the data
noise and outliers. It is an example of a density-based
space for areas of varied density of data points in the
model. In this algorithm, the areas of high density are
data space. It isolates various different density
separated by the areas of low density. Because of this, the
regions and assign the data points within these
clusters can be found in any arbitrary shape.
regions in the same cluster. Popular examples of
density models are DBSCAN and OPTICS.
4
D. T-Distributed StocasticNeighbour Embedding (t- inbox, email companies use algorithms. The purpose of
SNE) Algorithm: these algorithms is to flag an email as spam correctly or
t-SNE is a non-linear dimensionality not.
reduction algorithm used for exploring high- How clustering works: K-Means clustering
dimensional data. It maps multi-dimensional data to two techniques have proven to be an effective way of
or more dimensions suitable for human observation. With identifying spam. The way that it works is by looking at
help of the t-SNE algorithms, you may have to plot fewer the different sections of the email (header, sender, and
exploratory data analysis plots next time you work with content). The data is then grouped together.
high dimensional data. These groups can then be classified to identify which are
Agglomerative hierarchical algorithm performs the How clustering works: The way that the algorithm works
bottom-up hierarchical clustering. In this, each data point is by taking in the content of the fake news article, the
is treated as a single cluster at the outset and then corpus, examining the words used and then clustering
successively merged. It is not as well-known as K-Means, them. These clusters are what helps the algorithm
but it is quite flexible and often easier to interpret. The determine which pieces are genuine and which are fake
cluster hierarchy can be represented as a dendrogram or news. Certain words are found more commonly in
tree-structure. sensationalized, click-bait articles. When you see a high
percentage of specific terms in an article, it gives a higher
IX. APPLICATIONS OF CLUSTERING
probability of the material being fake news.
5
What the problem is: If you are a business trying to get What is the problem: You need to look into fraudulent
the best return on your marketing investment, it is crucial driving activity. The challenge is how do you identify
that you target people in the right way. If you get it wrong, what is true and which is false?
you risk not making any sales, or worse, damaging
How clustering works: By analysing the GPS logs, the
your Customer trust.
algorithm is able to group similar behaviours. Based on
How clustering works: Clustering algorithms are able to the characteristics of the groups you are then able to
group together people with similar traits and likelihood to classify them into those that are real and which are
purchase. Once you have the groups, you can run tests on fraudulent.
each group with different marketing copy that will help
you better target your messaging to them in the future.
F. Document Analysis –
There are many different reasons
D. Classifying Network Traffic – why you would want to run an analysis on a document. In
Imagine you want to this scenario, you want to be able to organize
understand the different types of traffic coming to your the documents quickly and efficiently.
website. You are particularly interested in understanding
What the problem is: Imagine you are limited in time and
which traffic is spam or coming from bots.
need to organize information held in documents quickly.
What the problem is: As more and more services begin to To be able to complete this ask you need to: understand
use APIs on your application, or as your website grows, it the theme of the text, compare it with other documents
is important to know where the traffic is coming from. For and classify it.
example, you want to be able to block harmful traffic and
How clustering works: Hierarchical clustering has been
double down on areas driving growth. However, it is hard
used to solve this problem. The algorithm is able to look
to know which is which when it comes to classifying the
at the text and group it into different themes. Using this
traffic.
technique, you can cluster and organize similar
How clustering works: K-means clustering is used to documents quickly using the characteristics identified in
group together characteristics of the traffic sources. When the paragraph.
the clusters are created, you can then classify the traffic
types. By having precise information on traffic sources,
G. Fantasy Football and Sports –
you are able to grow your site and plan capacity
Now let’s look for the
effectively.
critical issues – fantasy football!
6
How clustering works: When there is little performance XIII.CONCLUSION
data available to train your model on, you have an
advantage for unsupervised learning. In this type of In summary, Machine Learning is a technique
machine learning problem, you can find similar players of training machines to perform the activities a human
using some of their characteristics. This has been done brain can do, albeit bit faster and better than an average
using K-Means clustering. Ultimately this means you can human-being. Clustering is an integral part of data mining
get a better team more quickly at the start of the year, and machine learning. It segments the datasets into groups
giving you an advantage. with similar characteristics, which can help you make
better user behavior predictions. Various clustering
models and algorithms are present that will help to create
X. ADVANTAGES OF CLUSTERING
the best potential groups of data objects. It has a large
number of applications in various domains such as
A. Clustering is very useful in data analysis and data mapping, customer reports, etc. Moreover, using
pattern recognition. clustering, we can easily increase the accuracy of the
B. Clustering with use of Dendrograms help us in machine learning approach. we have discussed what are
clear visualization, which is practical and easy to the various ways of performing clustering.
understand.
Although clustering is easy to implement, you need to
take care of some important aspects like treating outliers
in your data and making sure each cluster has sufficient
XI. DISADVANTAGES OF CLUSTERING
population. At last, clustering is rarely the end of your
analysis; often, it is just the beginning.
A. Algorithm selection is easy.
B. The speed at which the data is generated after
clustering algorithm is a challenge. XIV. REFERENCES
[1]https://towardsdatascience.com/hac-hierarchical-
XII. BENEFITS OF CLUSTERING agglomerative-clustering-is-it-better-than-k-means-
7
[5]https://www.kdnuggets.com/2020/04/dbscan-
clustering-algorithm-machine-learning.html