Unsupervised Machine Learning - Dealing With Unknown Data
Unsupervised Machine Learning - Dealing With Unknown Data
9 SearchEnterpriseAI
g
SPAINTER_VFX - STOCK.ADOBE.COM
f
MANAGE
GUEST CONTRIBUTOR
https://searchenterpriseai.techtarget.com/post/Unsupervised-machine-learning-Dealing-with-unknown-data?offer=ML_series 1/8
16/04/2021 Unsupervised machine learning: Dealing with unknown data
The following article is comprised of excerpts from the course "Fundamental Machine Learning" that is part of the Machine Learning Specialist certification
program from Arcitura Education. It is the third part of the 13-part series, "Using machine learning algorithms, practices and patterns."
With unsupervised learning, the algorithm and model are subjected to "unknown" data -- that is, data for
which no previously defined categories or labels exist. When data is unknown, the machine learning
system must teach itself to classify the data. It accomplishes this by processing the unlabeled data with
special algorithms to learn from its inherent structure (Figure 1).
Most of the time, data that is used in unsupervised learning is not historical data. For example,
unsupervised learning can be used in healthcare to create a model that can categorize and identify the
results of different tests to quickly identify abnormal situations or test results. The model can learn from
different features of X-ray images or blood test results to categorize future tests or scans.
In unsupervised machine learning, clustering is the most common process used to identify and group similar entities or items together. This task is
performed with the aim of finding similarities in data points and grouping similar data points together.
k Figure 1. Unknown data is categorized by the system; an analyst then reviews the
https://searchenterpriseai.techtarget.com/post/Unsupervised-machine-learning-Dealing-with-unknown-data?offer=ML_series 2/8
16/04/2021 Unsupervised machine learning: Dealing with unknown data
For example, the learning model identifies and groups high-risk customers by determining which spend more than a certain amount or more than a certain
number of times in casinos or on gambling websites; it then categorizes them accordingly in a group (Figure 2).
Grouping similar data points helps to create a more accurate profile and attributes for different groups. Clustering can also be used to reduce the
dimensionality of the data when there are significant amounts of data.
The hypothetical toy company, introduced in Part 2, continues to look for ways to gain further insights into its customer base. It sends an online survey to all
of its customers, asking them to fill out a questionnaire about their preferences regarding the types of toys they enjoy buying for their families and how
much they prefer to spend on toys each year. The toy company gets a good response, primarily because it includes the promise that all customers who
complete the survey will be entered into a raffle for a series of high-end prizes.
The company uses a clustering algorithm to mine the database in which survey results are recorded. The algorithm looks for common responses and
compares those against common characteristics of the customer profiles. Doing so results in potentially useful groups or clusters of data.
After the clustering process is completed, the following new data clusters are discovered and characterized by the analyst:
Cluster A: Customers who have historically paid by credit card are more likely to spend more on toys each year than those who usually pay by cash.
Cluster B: Customers who have three or more children are more likely to purchase outdoor toys priced at over $100 than those who have fewer
children.
The toy company adds a new class label to each customer record (based on its cluster membership) as further input for future model building using
classification algorithms.
https://searchenterpriseai.techtarget.com/post/Unsupervised-machine-learning-Dealing-with-unknown-data?offer=ML_series 3/8
16/04/2021 Unsupervised machine learning: Dealing with unknown data
Dimension reduction algorithms are used to decrease the number of characteristics or attributes in data sets so that the data generated is more relevant to
the problem being solved, and less difficult to visualize and understand. Reducing dimensions further helps reduce the amount of space required for storing
data sets and can also improve performance, as data sets are trimmed down and optimized, thereby decreasing the time required to perform computations.
Dimension reduction algorithms exist for both supervised and unsupervised learning.
Our hypothetical toy company, when carrying out classification and regression algorithms, has been using a standard set of characteristics about
customers, including:
geographic location
age group
transaction frequency
frequency of returns
In an attempt to reduce the number of factors (features) taken into consideration when each model is trained, the toy company attempts to reduce the
quantity of these characteristics (dimensions) to only those most relevant and valuable to its machine learning analysis goals.
They deploy a dimension reduction algorithm for this purpose. Upon running the algorithm, it is determined that the age group and frequency of returns
values add negligible value to the typical analysis results, so they are dropped from further classification and regression processing. The remaining features
are used in subsequent model development because they have higher predictive potential.
Semi-supervised learning
Semi-supervised learning is a hybrid approach that combines aspects of supervised and unsupervised learning. Commonly, semi-supervised learning is
carried with a smaller volume of labeled historical data that is combined with a quantity of unlabeled (unknown) data. These two types of data are combined
to form the training data used to train a model. Essentially, the labeled data establishes base labels and categories that are used as a starting point for the
algorithm to process related unlabeled data.
https://searchenterpriseai.techtarget.com/post/Unsupervised-machine-learning-Dealing-with-unknown-data?offer=ML_series 4/8
16/04/2021 Unsupervised machine learning: Dealing with unknown data
This approach is often necessary when it is considered too time-consuming and expensive to collect, pre-process and label large amounts of historical
training data.
Reinforcement learning
Reinforcement learning is a learning method that interacts with its environment by producing actions and discovering errors or rewards. Trial-and-error
searches and delayed rewards are the most relevant characteristics of reinforcement learning. This method allows machines and software agents to
automatically determine an ideal behavior within a specific context in order to maximize its performance.
In other words, reinforcement learning uses a trial-and-error model to teach the machine so that it can learn the required behaviors and decisions needed
to make the expected decisions. Reinforcement learning is used in robotics, gaming and self-driving cars.
What's next?
The remaining 10 parts of this series focus on proven machine learning techniques in a standard patterns format. (These patterns should not be confused
with computation and data-related patterns resulting from machine learning processing.) The next article focuses on two exploration patterns: central
tendency computation and variability computation.
Course overview
Lesson 3
https://searchenterpriseai.techtarget.com/post/Unsupervised-machine-learning-Dealing-with-unknown-data?offer=ML_series 5/8
16/04/2021 Unsupervised machine learning: Dealing with unknown data
Related Resources
Making Data-Driven Investment Decisions Vendor Landscape for Data Science and Machine Learning Platforms
–Dataiku –TIBCO
https://searchenterpriseai.techtarget.com/post/Unsupervised-machine-learning-Dealing-with-unknown-data?offer=ML_series 6/8