Pad Unit 1 Ibm
Pad Unit 1 Ibm
Pad Unit 1 Ibm
Anomaly Detection
Student Guide
Course code GAI10SG194
V10.1
Student Notebook
TOC Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this training document,
are official trademarks of IBM or other companies:
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
DB2® HACMP™ System i™
System p™ System x™ System z™
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other product and service names might be trademarks of IBM or other companies.
pref
Course description
Pattern Recognition and Anomaly Detection
Purpose
This course describes the classify the datasets a given pattern to one of the pre-specified classes, develop
skills of using pattern detection and anomaly detection techniques with AI algorithms and gain experience of
doing independent study and research for different anomaly detection areas.
Audience
The audience of this course is Bachelor of Technology (B.Tech) students.
Prerequisites
A Basic overview of pattern anomaly detection.
Objectives
After completing this course, you should be able to:
• Understand the concept of pattern recognition and anomaly detection
• Learn about linear models for classification
• Gain an insight into clustering based methods
• Learn example of sparse kernel machines and graphical models
• Gain knowledge on anomaly detection in big data
References
• https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Gohana_inverted_S-curve.png/560px-Gohan
a_inverted_S-curve.png
• https://lh3.googleusercontent.com/HkRug5Yd6SlGy0AkSgLZ9FYwrq3Os5jeSoEiHqg5ft1se9C8uSUcXjY9
p3yfYfhg13eyUA=s86
• https://images.app.goo.gl/j5A2Q22DcLvTFt5K8
• www.IBM.com
Uempty
Unit 1. Pattern and Anomaly Detection
Introduction
References
IBM Knowledge Center
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Unit objectives are stated above.
Uempty
Notes:
Example: Dress colors, voice style etc. In computer science, the values of vector features are used to
describe a sequence.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
• It has applications in statistical data analysis, signal processing, image analysis, information
retrieval, bioinformatics, data compression, computer graphics and machine learning.
Notes:
What is pattern recognition?
Pattern recognition has its origins in statistics and engineering some modern approaches to pattern
recognition include the use of machine learning, due to the increased availability of big data and a new
abundance of processing power. However, these activities can be viewed as two facets of the same field of
application, and together they have undergone substantial development over the past few decades.
Acknowledgement of patterns is the method of utilizing machine learning algorithms to recognize patterns.
Acknowledgement of trends is classification of information which is based on statistical evidence and
representation from experience or models earlier gained. Their applicability is one of the essential facets of
pattern detection.
Explanations: Acknowledgement of the voice, detection of voices, Multimedia Document Recognition (MDR),
automated medical diagnosis. The raw data is interpreted and translated into a mode of computer usage in a
basic pattern recognition program. Pattern Identification requires identifying groupings and trends.
Clustering defines data fragmentation, which is a basic judgment making process which is of concern to
others. Unsupervised learning utilizes grouping. Pattern identification has the following characteristics:
• Pattern detection programs must recognize a recognizable image rapidly and reliably.
• Distinguish and classify unfamiliar things.
• Distinguishes forms and structures from various perspectives.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Pattern recognition techniques
There are three primary pattern detection models:
Statistical: Define where the individual item resides (for reference, whether this is a cake). This approach
takes advantage of controlled machine learning.
Syntactic/structural: Describing a more dynamic interaction among components (e.g., speech parts). This
model takes advantage of semi-supervised deep learning.
Template alignment: Alignment the characteristics of the item to the predefined prototype, and proxy
recognition of the model. Another of the applications of such a pattern is to test plagiarism.
The pattern recognition algorithms have two key components:
• Explorative: Used to understand data commons.
• Descriptive: Had to identify the commonalities in definite way.
The synthesis of these two components is used to derive information from the data for usage in analytics of
large data. The study of the prevalent variables and their interaction provides information that can be
important in interpreting the subject matter.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Training and learning in pattern recognition: Training and learning is a process that trains a computer and
produces reliable results. Training is an important step because the way the software works on the given data
depends on which algorithm is used for the data. The entire dataset is divided into two categories, one of
which is to evaluate software in the software preparation process and the other is to evaluate the software
after training.
Training set: Training sets are employed in model construction. It is composed of drawing sets that could be
employed to train the program. The usage of training laws and algorithms provides valuable details on how
input data should be correlated with output judgments. Some specific knowledge is derived from the data and
the findings produced by applying such algorithms to a system-trained dataset. Typically 80% of data is used
for training data in the dataset.
Test set: The device is checked using test info. This is a collection of data that can be used during testing to
check that the device is generating the correct result. Usually, for data processing, 20 per cent of the dataset
is used. The test results was used to assess device precision. Illustration a program that recognizes which
group of a specific flower belongs, can correctly recognize the flowers in seven and misinterpret other
flowers, so that the precision is seventy percent.
Uempty
• A model is an ideal concrete entity, or theoretical idea. The definition of the animal is an
illustration when thinking regarding animal groups.
• The definition of the ball is a trend while speaking about various styles about balls.
• The groups can be baseball, cricket game, table tennis match, in sample case balls.
• Choosing attributes and describing patterns is a very critical phase in classifying the layout.
• Effective presentation requires the use of non-discriminatory attributes, and the sample
classification computational pressure.
Notes:
Real-time examples and explanations: A vector is a distinct image of a pattern. Any unit of a vector reflects
the model attribute. In the model being discussed, the first dimension of the vector includes the value of the
first feature.
Illustration: While describing a spherical body, (25, 1) may be interpreted as a round shell with a weigh of 25
units and a diameter of 1 dimension. A portion of the vector can be generated by class mark.
Advantages:
• Classification problems are resolved by pattern recognition.
• Pattern recognition addresses the question of fraudulent bio-metric identification.
• It is very useful for textile pattern recognition for visually disabled blind people.
• It helps in speaker diarization (“process of partitioning an input audio stream into homogeneous
segments according to the speaker identity”).
• The same object can be seen from various angles.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Disadvantages:
• Methodology for identification of syntactic patterns is complex and very long to use.
• Larger datasets are necessary sometime to get improved precision.
• This can not clarify whether this would recognize a particular thing.
• Instance my face vs my friend’s face.
Applications:
• Analysis, segmentation and image processing: Consideration of patterns is used to convey
knowledge of the method needed for image processing in human identification.
• Computer vision: Pattern detection is employed to derive relevant attributes from a specified
image/video pattern and is employed for numerous purposes such as biological and biochemical
photography in machine vision.
• Seismic analysis: The pattern realizing system is employed in seismic records to identify, image, and
describe temporal patterns. The definition of predictive variables is applied and employed in different
applications of seismic research.
• Analysis and classification for radar signal processing: The sample identification and signal
processing methods are used to identify and interpret AP mine (Anti-personnel mine) in specific radar
signal recognition applications.
• Remembering words: Significant progress in understanding expression utilizing pattern detection
methods. It is employed in numerous audio recognition algorithms which aim to include issues with the
representation of phonemes and model broad objects such as phrases.
• Fingerprint identification: Technology for identifying fingerprints is a significant development on the
biometric industry. Various identification approaches have been utilized to suit fingerprints, with which
tools for pattern realization are commonly utilized.
Uempty
Pattern recognition use cases IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Chat bots, NLP with text generation, text analysis, text translation.
Notes:
Pattern recognition use cases
Customer research and stock market analysis: Current equity market forecast-trend identification is used
for quantitative market price research and probable result forecasts. Yard charts utilize this study for pattern
detection. Viewer study-audience selection is the review of chosen features of accessible consumer data and
the classification. Those apps are supported by Google Analytics.
Chat bots: NLP with text generation, text analysis, text translation: The processing of NLP is a machine
learning field that emphasizes on training computers to comprehend human language and generate
notifications. This seems like intense sci-fi, yet in fact it's not about the meaning of conversation. it is just
about what is conveyed explicitly in the letter. NLP creates fragments of text, seeks links and makes a
distinction. The cycle starts by dividing the phrases; it distinguishes the vocabulary terms and pieces, then
describes how such terms should be included in a paragraph. In doing so, the NLP is utilizing a mix of
techniques such as filtering, fragmentation, and labeling to establish a paradigm for process management.
Controlled and unsupervised algorithms for machine learning require many phases in the method.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Anomaly detection: An anomaly is a divergence from ordinary, natural, or anticipated values. Detection of
abnormalities detects atypical trends within the results. Often such rare occurrences are called outliers. You
may define the report metrics which anomaly detection should track. For the defined metrics, the following
action is taken by anomaly detection:
• Historical tons of information.
• Model and points of anomaly are determined.
• The contributing factors are listed.
• Visually shows information in a separate view of the anomaly detection data.
Now let us assume you've got a report which monitors positive checkouts. We want to learn whether the
amount of checkouts would diverge from the standard, so you pick the checkout parameter to spot
irregularities. Oddity analysis tracks the metric across history records and detects and quantifies metric
deviations. A phenomenon is noticed that the average amount of active weekend checkouts is 15 per cent
smaller than on during the week. It then senses an error as the checkout abruptly decreases much more.
Deviation Identification flags the data deviation, and lists the reasons leading to the decrease in positive
checkouts.
Uempty
What are some other practical
uses for anomaly detection? IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure 1-9. What are some other practical uses for anomaly detection? PAD011.0
Notes:
What are some other practical uses for anomaly detection?
Anomaly detection can be used for the following business cases:
Traffic dropped or spiked: Spot irregular improvements to regular traffic patterns and might not be constant
over the year. Tracking of irregularities will identify what is the usual traffic trend, and mark a deviation when
traffic diverges from that trend.
Transactions or revenue dropped: Monitoring of irregularities will identify what is the usual trend for
purchases or sales and mark an anomaly if they dip outside the standard.
Traffic from social media increased or decreased: Marketers will be warned if the flow of social network
traffic switches all a moment. The move may be attributed to their mass tweeting or their Facebook page
getting punished for utilizing so much email.
Traffic from organic search increased or decreased: For SEO, if the volume of traffic from the search
engines decreases, it may be an indication that the rating algorithms have improved. The database need to
be revised in order to achieve a better rank in the existing requirements for assessment.
Which are the anomaly detection information standards? Was it for data days, or data density?
Anomaly identification includes data measured for at least fourteen days, minus the initially consecutive zero
values. Therefore, the lacking interest in the data does not surpass fifty percent.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
How is anomaly detection calculated over time?
For instance, in August there had been a launch of an iOS app that generated a huge spike in iOS traffic. The
traffic increase in the total session count was associated to the anomaly (increase). An android app was
released in September, which generated its own increase in traffic.
Is it clearly addressed for every anomaly?
The highest adding element for session count metric for launching the iOS device is the application aspect for
an iOS rating. The second thing which contributes is the aspect of the platform with an android meaning.
Though, if you set the metric income, the "metric" session list has a device aspect with iOS as the very first
respondent and Android as the secondary respondent.
Uempty
• To continue with the training, after learning the various steps involved in pattern recognition
and anomaly detection, it is instructed to utilize the concepts to perform the following activity.
• You are instructed to write the following activities using python code.
Notes:
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Key point for AI and ML- anomaly detection
Another approach to manage data more quickly and easily is to identify unexpected occurrences,
adjustments or shifts in datasets. Monitoring of irregularities applies to detecting objects or occurrences that
do not adhere to a trend predicted or certain things in a dataset that are typically impossible to detect by a
professional specialist. Another approach for data management is to detect unusual incidents, changes, or
shifts in data sets more quickly and effectively. Therefore an anomaly detection is one of the key goals of
Industrial IoT, and a device that relies on artificial intelligence to identify unusual activities in the reported data
collection.
Identification of irregularities applies to detecting objects or occurrences that do not adhere to a trend
predicted or certain things in a dataset that are typically impossible to detect by a human specialist. Typically
these irregularities may be converted into issues such as design faults, defects or theft. Instances of potential
anomalies:
• A leaking link pipe which causes the entire manufacturing line to be closed.
• Several unsuccessful authentication attempts showing the opportunity for cyber-fishing activities.
• Fraud identification in financial transactions.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Tasks for artificial intelligence IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: AI Task
Source: https://images.app.goo.gl/F9Z664sSWR53yGqLA
Notes:
Tasks for artificial intelligence
• Automation: Driven by an AI detection algorithms, data sets are constantly analyzed, the normal
behavioral parameters are precisely specified, and pattern leaks are recognized.
• Real-time analysis: AI applications will view the behavior of the data in live time. The minute the
machine doesn't know a sequence, it gives out a warning.
• Scrupulousness: Anomaly monitoring systems include end-to-end gap-free surveillance to track data in
depth to detect the slightest irregularities that individuals may not find.
• Accuracy: AI increases anomaly identification performance, eliminating warnings of disturbance and
false positives/negatives caused by static criteria.
• Self-learning: AI-driven, self-learning algorithms are the backbone of systems and can benefit from data
patterns and provide predictions or answers when appropriate.
Uempty
AI system learning process
(1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Notes:
AI system learning process
If two points A and B are connected by a curve, the curve should also be like the midpoint of A and B. This is
not true for high order curves of polynomials; even values of a positive or negative magnitude may be very
large. With polynomials that are small in order, the curve is more likely to fall close to the middle (a first-grade
polynomial only guarantees that it runs exactly through the middle). Polynomials of low order are usually
smooth, and polynomial curves of high order tend to be lumpy. The maximum possible number of binding
points in a polynomial curve is n-2, to define this more precisely, where n is the order of the polynomial
equation. An inflection point is on a curve, from a positive to a negative radius. This is also where we can
claim that it moves from 'water holding' to 'water dumping’.
Be mindful that high-order polynomials are only "possible" to be lumpy; they can even be smooth, but unlike
low-order polynomial curves, there is no guarantee of it. A 15th grade polynomial may have, at most, 13 rows,
but may also have 12, 11 or any number up to zero. It is undesirable for all the reasons previously provided
for polynomials of high order that the level of the polynomial curve is higher than sufficient for an exact fit, but
it also contributes to the existence of several solutions. For instance, a first-degree polynomial (one line),
restricted by only one point, would provide an unlimited number of solutions rather than the normal two. This
leads to a problem how one solution can be compared and selected, which can also be a problem for
software and for people. That is why it is generally easier to select a degree as small as possible to meet all
the restrictions and maybe even less if an approximate fit is appropriate.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Change other data points functions
Similar types of curves can, in some cases, be used, for example trigonometric functions (like sine and
cosine). Data can be equipped with Gaussian, Lorentzian and Voigt functions in spectroscopy. The inverted
logistic sigmoid (S-curve) feature in agriculture describes the relationship between crop yield and growth
factors. A sigmoid regression of data in agricultural lands made the blue figure. It is shown that the crop yield
gradually decreases initially, i.e., with lower soil salinity, while the decrease then progresses more rapidly.
Uempty
• To continue with the training, after learning the various steps involved in pattern recognition
and anomaly detection, it is instructed to utilize the concepts to perform the following activity.
• You are instructed to write the following activities using python code.
Notes:
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Figure: Algebraic curves in a parametrical form used for creation of the forming line in the edge
section of different rated surfaces
Source: https://images.app.goo.gl/Rn75dzFLCv9y6nu5A
Notes:
Test to geometric requirements for curves algebraic
"Fitting" generally means trying to find a curve for the algebraic analysis of data, which minimizes the vertical
displacement (y-axis) of an element from the curve (e.g., ordinary smaller squares). Geometric fitting is
intended therefore to provide the optimal visual fit for graphic and image applications; this typically includes
attempting to reduce the orthogonal distance to the curve (e.g., the sum of lower squares), or the two points
from the curve displacement axes. Geometric fits are not common since non-linear or iterative calculations
are typically required, although the result is more esthetically accurate and geometrically accurate.
Uempty
Curves matched to data points
(1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Notes:
Curves matched to data points
Unable to postulate a function y = f(x), a plane curve can still be modified. In some cases certain types of
curves can also be used, such as conical (circular, elliptical, parabolic or hyperbolic) or trigonometric (sine,
cosine). For starters, trajectories of gravitational objects follow a parabola when air resistance is ignored.
Therefore, a parabolic curve should be valid to suit trajectory data points. Tides obey sinusoidal patterns, and
if the impact of the moon and the sun are considered, the maiden database points must be compared to one
sinusoidal or two sine waves from various ages. In the case of a parametric curve and of its coordinates can
be effectively adapted as a separate arc length function; the chord distance can be used if data points can be
ordered.
A geometrically fitting circle: The issue of trying to find the best visual fit in a circle of 2D data points is
approached by Coope. The technique elegantly turns the normally non-linear problem into a linear problem
that can be overcome without using iterative numerical methods.
The geometric fit of an ellipse: "A geometrically fitting circle" technique is extended to general ellipses by
adding a non-linear step, resulting in a method that is fast, yet finds visually pleasing ellipses of arbitrary
orientation and displacement.
Application to surface: Most of this principle also applies in 2D curves to 3D surfaces, each patch identified
in two parametric directions, usually called u and v, by a net of curves. A region may be made up of one or
more patches of a surface in either direction.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Software: Many statistic packages such as GNU plot, MLAB, Maple, MATLAB, GNU Octaves and SciPy,
such as R and numerical applications, provide controls for making curve fits in a range of different scenarios.
Case study
Example 1: A group of senior citizens who have never used the Internet before will receive training. As
shown in the table above left of figure 1, the random sample is chosen for 5 people and for 6 months. The
hours of the internet are registered. Determine if the data matches a quadratic regression line. First, on the
right of this data, we create a table in above figure with a second variable (MonSq). In columns I, J and K, we
now use the right-hand table for running the regression data analytical method (quadratic model). The result
is shown in second figure.
Uempty
Curves matched to data points
(2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Notes:
The set r square 95 percent value and p-value (meaning F) near 0 suggest that the structure is very well
suited to the data means it suits well. It can also be confirmed by the quadratic meaning that the p-value of
the MonSq variable is close to 0. That is further shown by the dispersion diagram in figure 1, which
demonstrates that the four trends are greater than the linear trend.
Figure shows the regression quadratic best suited to the data.
Usage times=21.92–24.55*month+8.06*month 2. We therefore assume that the model will run for 20.8 hours
(or use the TREND function) and we know how many hours after three months the actual person will access
the Internet. For the comparison with a linear model in the regression analysis of the previous tests, the
regression data analysis approach is also available. The linear pattern is produced only by columns I and K of
Figure 1. The production appears in figure 2. The fact that the quadratic model has a higher R-square value
(95.2%versus 83.5%), and the default error (13.2% versus 24.5%), reflects the fact that the quadratic model
matches the data more precisely.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Figure 1-20. Case study: Anomaly detection with IBM Watson PAD011.0
Notes:
Case study– anomaly detection with IBM Watson
https://dataplatform.cloud.ibm.com/docs/content/wsd/nodes/anomalydetection.html
In order to identify outliers or unusual cases in the data, anomaly detection models are applied. Referring to
other forms of modeling, where abnormal case rules are stored, anomaly detection models store information
on normal behavior. This allows outliers to be detected even though they do not adhere to a known pattern
and it can be especially useful in cases where new trends are continually developing, such as fraud detection.
Anomaly detection is an unmonitored process, meaning that a training dataset containing known fraud cases
is not necessary.
In general, anomaly detection can analyze large numbers of fields to identify clusters or peer groups within
which similar recordings fall during conventional methods of identifying outliers one or two different variables
at a time. In order to assess potential anomalies, each record may be compared to those in its peer group.
The more remote an event is from the standard center, the rarer it is. The algorithm could, for example, group
records into three separate clusters, flagged as far as the middle of each cluster is concerned.
An anomaly index is assigned to each record which represents the ratio of the group deviation index to its
average over the cluster of the event. That the index value, the greater the difference the situation is than the
average. If the anomaly index values are less than 1 or even 1.5 under the normal situation, they should not
be seen as exceptions because it's either the same or slightly higher than the average variance. However,
cases with an index value above 2 may be perfect for exceptions because the difference is at least twice the
norm.
Uempty Anomaly detection is a form of discovery designed to rapidly identify rare cases or documents that are
candidates for further study. Those should be considered as possible phenomena that may or may not be
true after closer examination. The record is completely true, but you may opt to screen it from the data for
model building purposes. If the algorithm repeatedly causes errors or an object in the data collection process,
this can mean an error or artifact.
Note that the identification of anomalies identifies irregular registers or cases through cluster analysis based
on a set of fields selected in the model, regardless of whether the fields are important to the pattern you’re
looking for. For this purpose, in combination MIT derivation or another tool for the screening and ranking of
fields, you may want to use anomalies detection. For example, the feature selection can be used to classify
the most appropriate fields for a particular target and then anomaly detection can be used to find the records
that are most unique for those fields. (An alternate way was to build a model decision tree and then analyze
any misclassified documents as potential anomalies, but the large-scale replication or automation of this
process would be harder.)
For instance. Anomaly detection can be used for screening agricultural production grants for potential fraud
cases to identify anomalies from the pattern, identifying records that are irregular and worth investigating
further. Particular attention is paid to grant applications which seem to require too much (or too little) money
for the farm form and scale.
Necessities: One or more fields of knowledge. Please be aware that only input fields with a function set can
be used as input for a source or form node. Goal fields (Task or Both functions set) are ignored. By
highlighted cases that do not obey a known set of rules and not an abnormality detection model, even though
it does not match the previously known trends, irregular cases are detected. In conjunction with the selection
of the app, anomaly detection allows vast volumes of data to be screened quickly to identify the records of
highest interest.
Why a Brazilian bank gives each of its 65 million client’s personal attention?
Watson is IBM AI that integrates in the workflows seamlessly and already uses leading platforms and
resources. Once you need AI and use it, it means allowing your employees to focus on what they do best. If
you need IA, your employee. Bradesco is one of Brazil's biggest banks with over 5,200 branches. When your
customers don't have a pleasant experience in a sector as competitive as banking, they might not long be
your customer. Bradesco has therefore started to search for ways to speed up operation and boost the
personalization level for each customer. Bradesco has deployed all services in IBM Watson.
In five stages, how Watson learned:
• A dedicated team of 10,000 customer questions trained Watson in Portuguese and banking.
• In a small number of branches, Watson was checked until the bank was pleased with the responses.
• Watson was established and given nationally to the 5,200 branches.
• Watson 's findings were decreased from 10 minutes to a few seconds as staff started to trust Watson.
• With reviews on over 10 million interactions, Watson continues to learn and improve.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
• To continue with the training, after learning the various steps involved in pattern recognition
and anomaly detection, it is instructed to utilize the concepts to perform the following activity.
• You are instructed to write the following activities using python code.
Notes:
Uempty
Notes:
Probability theory: Probability provides the details as to how probable an incident would arise. Going to dig
through chance jargon:
• Trial or experiment: The act with a certain probability which contributes to a consequence.
• Sample space: A compilation of all the potential experimental outcomes.
• Event: Non-empty sample space subset is referred to as case.
In scientific terminology, chance is the calculation of how probable an occurrence is to occur while an
experiment is carried out.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
• P (black) = 2/12 = 1/6 There are 2 black marbles in the bag, 12 is your sample space.
.
• P (blue) = 4/12 = 1/3 There are 4 blue marbles in the bag , 12 is your sample space.
• P (not green) = 11/12 There's 1 green. So 12-1 = 11 that aren't green,12 is your sample
space.
• P (not purple) = 1
• I will select a marble that is not purple because there are no purple marbles in the bag.
Whenever the chance of something occurring is definite, the probability is i.
Notes:
Likelihood (probability) with marbles: Within a bag you can find four blue marbles, five red marbles, one
green marble and two black marbles. Suppose you randomly select one marble. Find each likelihood.
• P(black).
• P(blue).
• P(blue or black).
• P(not green).
• P(not purple).
Solution as above.
Uempty
Maximum likelihood theory and
estimation (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The estimate of density is the issue of evaluating the distribution of likelihood for a sub-set of
a sample in a question domain.
Notes:
Maximum likelihood theory and estimation: Although the basic approach used in computer training is
measuring the overall probability, there are many methods used to tackle estimating the number. The final
risk calculation involves a mathematical method to calculate the possibility of observing the data set in terms
of the chance distribution and calculation parameters. This method can be employed to find a space of
potential parameters and distributions. This powerful probabilistic model often forms the foundation for many
machine learning algorithms, namely important approaches to estimating quantitative values and class titles,
such as linear regression and logical regression, as well as deep artificial neural learning networks.
Probability density estimation problem: A popular simulation problem involves computing a cumulative
distributive likelihood for a data gathering. For examples, given an observation sample (X) in a domain (x1,
x2, x3 ..., xn), where each observation has the same probability distribution separately from the domain (i.e.,
so called separate and identically distributable or close). The density approximation comprises of the
probability distribution function choice and the distribution parameters that best define the information
detected (X) as a composite probability distribution.
Where can you pick the distribution likelihood function?
This question is more complicated since the sampling (X) of the population is small and contains noise, so
any measured density and its parameters are determined incorrectly. There are several techniques to solve
the problem, but 2 common solutions exist:
• Maximum a Posterior Probability (MAP).
• Maximum Likelihood Estimate (MLE).
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
• Suppose that we are given a sequence (x1….xn) of IID random variables and a priori
distribution of it is given by Us wish to find the MAP estimate of it. Note that the normal
distribution is its own conjugate prior, so we will be able to find a closed-form solution
analytically.
• Which turns out to be a linear interpolation between the prior mean and the sample mean
weighted by their respective covariance's. The case of is called a non-informative prior and
leads to an ill-defined a priori probability distribution.
Notes:
Maximum a posterior probability (MAP)
In Bayesian statistics, a maximum a posterior probability (MAP) estimate is an estimate of an unknown
quantity that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of
an unobserved quantity based on empirical data. It is closely related to the method of maximum likelihood
(ML) estimation but employs an augmented optimization objective which incorporates a prior distribution (that
quantifies the additional information available through prior knowledge of a related event) over the quantity
one wants to estimate. MAP estimation can therefore be a regularization of ML estimation.
Maximum Likelihood Estimate (MLE): The Maximum Likelihood Estimate or MLE is a method for
probability density calculation. The overall estimation of chance is to regard the problem as an optimization or
search issue, in which we are searching for a collection of parameters that are best suited to the shared
likelihood of the data (X). First, a parameter named theta is specified which determines both the probability
density function option and the parameters of the distribution. This may be a computational value vector with
dynamic values and reflect several probability distributions, as well as their parameters.
Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a ‘head’ p.
The goal then becomes to determine p.
Suppose the coin is tossed 80 times: I.e., the sample might be something like x1 = H, x2 = T, ..., x80 = T, and
the count of the number of heads "H" is observed.
Uempty The probability of tossing tails is 1 − p (so here p is θ above). Suppose the outcome is 49 heads and 31 tails,
and suppose the coin was taken from a box containing three coins: one which gives heads with probability p
= 1⁄3, one which gives heads with probability p = 1⁄2 and another which gives heads with probability p = 2⁄3.
The coins have lost their labels, so which one it was is unknown. Using maximum likelihood estimation, the
coin that has the largest likelihood can be found, given the data that were observed. By using the probability
mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but for
different values of p (the "probability of success"), the likelihood function (defined below) takes one of three
values:
The likelihood is maximized when p = 2⁄3, and so this is the maximum likelihood estimate for p.
Validation and testing
The validation is the process in which the model and its hyperparameters are tuned. You want to check the
model with a new collection of data (i.e., data not used for cross-validation, bootstrapping or other approach
you have used) that have not been used yet. This simulates the performance of the model on completely new
data. The key benefit of the model is to see how it performs.
Assessing models of regression
The core techniques for evaluating models of regression are:
• Mean absolute error.
• Median absolute error.
• (root) mean squared error.
• Coefficient of determination (R2).
Residuals: The disparity between the observed and predicted effects is a residual( ei).
The distance between the observed data point and the regression line can also be regarded as vertical.
Minimize the least squares of the line.
That is, between the line and the data, it minimizes the mean squared error (MSE).
Residual (error) variation
The residual variance tests how well the data points match a regression line.
The average residual estimate is the same as the average square error:
Nonetheless, you are more likely to see that to make this estimator uneven:
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
This means taking into account the degrees of freedom (here for interception and slope, both of which must
be estimated).
The square root of this difference, σ, is the average squared root error (RMSE).
Coefficient of determination
The variance overall is proportional to the residual variance (variation after the predictor is eliminated) plus
the systematic/regression variation:
Where:
r= Correlation coefficient.
x= Values in first set of data.
y= Values in second set of data.
n= Total no of values.
In the case of a y = mx+b line a point error (xn,xy) is:
This is the difference between the points on the xn line intuitively. And the current xn stage.
The squared error of the lines is the sum of the squares of all these errors:
You want to eliminate this squared mistake to make the best match rows.
Evaluating classification models
Significant quantities:
Uempty
• To continue with the training, after learning the various steps involved in pattern recognition
and anomaly detection, it is instructed to utilize the concepts to perform the following activity.
• You are instructed to write the following activities using python code.
Notes:
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
• Where h is the model, D is the predictions made by the model, L(h) is the number of bits
required to represent the model, and L(D | h) is the number of bits required to represent the
predictions from the model on the training dataset.
• The score as defined above is minimized, e.g., the model with the lowest MDL is selected.
• The number of bits required to encode (D | h) and the number of bits required to encode (h)
can be calculated as the negative log-likelihood. For example:
• Or the negative log-likelihood of the model parameters (theta) and the negative log-likelihood
of the target values (y) given the input values (X) and the model parameters (theta).
Notes:
Model selection: We need to assess the performance of the models and choose which is the best one based
on different factor. We cannot determine the cost function of the model hypothesis because it can lead to
overrun if the error is minimized. A good approach is to split the data into a training set and a test set (e.g.
70% /30% division). You then train your model on the test set to see how it performs.
You will also calculate the validation error instead of only calculating the test error. Validation is mostly done in
order to adjust hyperparameters you will not adapt them to the training system because this may result in
over fitting or you will prefer to modify them to the test set since that results in a generalization calculation that
is overly optimistic. Therefore, we have a separate data set for tuning hyperparameters for the purpose of
validation.
• If your model does not fit well, you may use these mistakes to determine what kind of problem you have.
• If your training error is high, you have a high bias (under fitting) problem and a broad validation/testing set
error.
• When you have a small error in training and a significant validation and testing error, you have a massive
issue with variance.
• Cross-validation k-fold (better than small datasets).
- The training set is divided into kk folds.
- Take the k−1k−1 folds iteratively and check on the other fold.
Uempty - The performance average "take one-out" cross-validation is also provided where k = nk = n (nn is the
number of data dots) is k-fold cross-validation.
- The bootstrapping cycle.
• New data sets are generated from the original dataset by replacement sampling (uniformly at random).
• Train and validate the unselected data in the bootstrapped dataset.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-39
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Figure: AU ROC
Source: https://images.app.goo.gl/iMXwj7jLYADko1uCA
Notes:
Area under the curve (AUC)
It is the way to classify the binary and multi label. You may select some cut-off from the binary classification
above which a sample is assigned to one class and below which the other class is assigned to an example.
You can achieve different results depending on the cut-off-there is a tradeoff between real and false positive
concentrations.
You can draw a curve for its y axis P(TP) and its x-axis P(FP) of the receiver operational characteristic
(ROC). Every point in the curve is a cut-off value. That is, the ROC curve will see the classifier performance
through the entire cutting cap, while other metrics (e.g. the F mark etc.) tell you just the results for one cuts.
The ROC curve shows the sweeping over all cutout thresholds. You will get a completer and truthful summary
of how well your classifier functions if you display all thresholds at once. It is not immune to the harm of the
class of data.
In order to calculate how effective the classification algorithm is, the area under the curve (AUC) is used. An
AUC of more than 0.8 is commonly regarded as 'healthy.' An AUC of 0.5 is equal to a random devaluation. A
straight line.
Uempty Matrices of uncertainty (confusion matrices): This method is suitable for binary or multi-class
classification. Evaluation is often viewed as a classification uncertainty matrix. The core values are:
• True positive (TP): Positive samples that have been established as positive.
• True Negative (TN) samples: Negative samples graded.
• Wrong Positives (FP): Positive samples labelled negative.
• False negatives (FN): Negative samples that have been declared positive.
An additional value:
• Positive predictive value (PPV): Precision also takes account of the prevalence. The PPV is equal to
the accuracy of the fully balanced data set (i.e. equal positive and negative cases, 0.5 prevalence).
• Zero error rate: How often would you be wrong if you expected that every example would be positive.
This is a good starting point for contrasting the classifier with.
• F-score: Average weighted recall and accuracy.
• Cohen’s kappa: High kappa score if the accuracy of the classification and the zero-error rate differ
greatly.
Remember that class 1 and class 0 for the unique class must be specified in the convention. This is to say,
we seek to predict the uncommon class.
Perhaps you would like to use accuracy/alert as a calculating metric instead.
In 1T/0 T the real class is indicated and in 1P/0P the expected class is indicated.
Precision is the amount of true positive over the expected total number. That is, what are actually the positive
fractions of the examples labeled?
Recall the number of true positives above the number of actual positives. In other words, what are the
positive examples found in the data?
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-41
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
But if you are very relaxed, you may just want to mark an example as 1. You may then move the threshold to
0.9 to make the classifications stricter. In this scenario, you're more precise, but less alert because some of
the more vague optimistic examples might not be valid enough.
On the other hand, to eliminate false negatives, you may want to lower the threshold at which case recall
increases, but accuracy decreases.
So what is the most efficient way to compare accuracy and reminder values between algorithms? You can
condense precision and reminiscently into one metric: the F1 score (the harmonic mean of precision and
reminder also known as the F-score):
Although not always helping more info, it does generally. Many algorithms perform much better as more and
more data is collected. More complex but fairly simple algorithms can only be done by more training data.
Here are some things to do if the algorithm is not working well:
• Obtain more examples of training (can assist with problems of great variance).
• Seek smaller feature sets (may help with problems with high variances).
• Seek to incorporate polynomial characteristics (can help with problems with high bias).
• Seek to decrease? (can help with high bias issues) regularization parameter.
• Check additional features (can help with problems with high bias).
• Seek to increase regulatory parameter ?(can help with problems with high variance).
Uempty
Matrices of uncertainty
(confusion matrices) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• False positive rate: How often are negative-labeled samples predicted as positive?
• Specificity (or "true negative rate"): How often are negative-labeled samples predicted as
negative?
• Precision: How many of the predicted positive samples are correctly predicted?
• Prevalence: How many labeled-positive samples are there in the data?
Notes:
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-43
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Loss of logging (log-loss): This approach is ideal for classification of binary, multiclass and multi label.
Log-loss is a precision metric that can be used if it is not a class but a probability of the classification
performance. For instance, if it forecasts 1 as 0.51 and the corresponding class is 0, then it is less "true" than
if it had forecast class 1 as 0.95. The distance from the classifier is penalized.
Uempty
Rate for F1 (F1 score) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The F1 score is the weighted average accuracy and warning, even the balanced F-score or
F-measure.
Figure: F1 score
Notes:
Rate for F1 (F1 score): The highest score is 1 and the worst score is 0. It can be used to define binary,
multi-class and multi-label (the latter two use the weighted average F1 score for every class).
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-45
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
• Metric selection is more complex for biased groups (or strongly predetermined bias data).
• For example, you have a dataset with only 0.5% of the data in category 1.
• You run your experiment and remember that 99.5 percent of the tests are correctly graded.
Notes:
Metric selection
Only by using the skews in these data can the model define each example in category 0 and achieve the
exactness.
Uempty
Hyperparameter selection (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Notes:
Hyperparameter selection: Hyperparameter tuning is also considered an art as that does not involve an
optimization process that is accurate and realistic. Other automated methods, including:
• Grid search.
• Random search.
• Evolutionary algorithms.
• Bayesian optimization.
Grid search: Only search at various hyperparameter combinations to see what is the right combination. In
general, depending on the parameter, hyperparameters are checked over specific intervals or scales. This
can be 10, 20, 30, and 1e-5, 1e-4, 1e-3, and so on. Parallels are fast, but it is brute.
Random search: Surprisingly, sampling from the whole grid by random frequency works as well as scanning
the whole grid in far less time.
Intuitively: If we want to obtain the maximum 5 percent value of the hyperparameter mixture, then a 5
percent probability of leading to that outcome is a random hyperparameter combination. If we want to get
a combination 95 % of the time successfully, we need to pass many random combinations. If we take n
combinations of hyperparameter the probability of n not being the top 5% is (1−0.05)n, so the chance that at
least one of the top 5% is just 1−(1−0.05)n. When we want a 95 % of the time, i.e. we want the probability that
at least one is what we're looking for to be at 95% so we set at least 1−(1−0.05)n=0.95 and therefore n to 60
random hyperparameter combinations so that we have a 95% chance of at least 1 hyperparameter
combination with the top 5 of such combinations. We need only 60 random hyperparameter combinations.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-47
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Optimization of Bayesian hyperparameter
We can use Bayesian optimization to choose appropriate hyperparameters. We can sample Gaussian
hyperparameters and use them for the measurement of a later distribution as observations. Then we pick the
following hyperparameters to model the predicted changes from the best results or a Gaussian cycle of high
confidence (UCB). We want to construct a utility feature from the rear model, this is what we are searching for
next hyperparameter.
Basic idea: Models the output of a smooth algorithm in order to achieve the highest level of the
hyperparameters. This is faster than searching the grid by conceiving where the ideal set of hyperparameters
can be rather than searching the brute force across the entire space. One problem is that it can be very costly
(for example, to train a large neural network) to use a hyperparameter sample to calculate the results. We use
a Gaussian process because we calculate margins and conditions in closed form with the properties of the
process.
Uempty
The problem with high
dimensionality IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The dimension of a problem refers to the number of input variables (actually, degrees of
freedom).
• The exponential increase in data required to densely populate space as the dimension
increases.
Notes:
The problem with high dimensionality
Choosing a particular piece of machine learning algorithm e.g. logistic regression support vector machine
(SVM) is less useful, it may be a best algorithm for a specific issue, but sometimes the average efficiency of
this algorithm is not performing well in the case of over fitting and under fitting of data. Even high
dimensionality becomes a disadvantage as one attempts to analyze the dismissal. With a higher dimensional
distribution of probability, locating a good enveloping distribution is more complicated, because the approval
likelihood would begin to diminish with dimensionality.
Let us presume you are 100 yards long on a straight line and you lost a penny on it somewhere. This would
not have been that tough to search. You are heading down the path and it requires 2 minutes. So let us
presume you have got a 100-yard square on either hand and you have lost a penny on it somewhere. It will
be hard to find, like being trapped together over two football fields. This will take days. A circle still 100 yards
cut. That is like looking for a 30-story football stadium building the scale of one. When you get more
measurements, the challenge of navigating in the room gets even easier. You do not intuitively know that as it
is only described in quantitative formulae, because they all have the same distance. This is the
Dimensionality curse. It gets a name because it is clunky, valuable and yet easy.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-49
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Information theory: Information theory is a subfield of mathematics associated with communication of data
via a noisy system. The definition of precisely how many information a communication includes is a
cornerstone of knowledge theory. Most frequently, it is possible to quantify the information using probability in
an event and a random variable called entropy. Awareness estimation and entropy are an important approach
in machine learning and serve as a basis for techniques such as attribute selection, decision taking and in
broader terms fitting category models. A machine learner therefore needs a good understanding and
comprehension of information and entropy.
Calculate the information for an event: The quantification of information is the foundation of the area of
knowledge theory. The concept of calculating how much surprise an occurrence is the theory behind the
quantification of knowledge. There are more unlikely rare (low-probability) events and more information about
other usual accidents (high-probability).
• Low probability event: High Information (surprising).
• High probability event: Low Information (unsurprising).
We can see the forecast trend that low-will events are more volatile and contain more information and less
information is available in comparison to high-risk incidents. We can also see that this interaction is not linear,
but rather sub linear in fact. When using the log form, this makes sense.
Uempty
• To continue with the training, after learning the various steps involved in pattern recognition
and anomaly detection, it is instructed to utilize the concepts to perform the following activity.
• You are instructed to write the following activities using python code.
Notes:
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-51
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
3. Use of nonlinear units in the feedback layer of competitive network leads to concept of?
a) Feature mapping
b) Pattern storage
c) Pattern classification
d) None of the mentioned
Notes:
Write your answers here:
1.
2.
3.
Uempty
True or False:
1. From given input-output pairs pattern recognition model should capture characteristics of
the system? True/False
2. Can system be both interpolative & accretive at same time? True/False
3. Does pattern classification belong to category of non-supervised learning? True/False
Notes:
Write your answers here:
Fill in the blanks:
1.
2.
3.
4.
True or false:
1.
2.
3.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-53
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Uempty
Notes:
Unit summary is as stated above.
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-55
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.