Opentext Demystify Data Science White Paper
Opentext Demystify Data Science White Paper
Machine learning 10
Supervised learning 11
Unsupervised learning 12
Resources 14
This paper will clarify some key definitions around artificial intelligence and
machine learning. It will also simplify some common techniques in machine learning,
such as supervised learning, natural language processing and classification, and
identify the types of business questions these techniques can answer.
Finally, this paper will help define meaningful and high value use-cases with
a structured framework to gather and align business, technology and data
requirements for a successful artificial intelligence implementation.
Deep learning
Popular and powerful set of machine learning techniques, which mimic the brain’s
neuron activities, called neural networks.
Machine learning
Deep learning Machine learning Artificial intelligence
Popular and powerful
AI
Field of set that learns from historicalField
of machine learning
dataof AI that learnsan
towards from
historical data towards an
Computing systems
end goal/outcome. For
capable of performing
example, which mimiclikely to default
the customers
techniques, on their home
end goal/outcome. For loan. tasks that humans are very
the brain’s neuron example, the customers good at, such as
activities, called likely to default on their recognizing objects,
Artificial intelligence
neural networks. home loan. recognizing and making
sense of speech,
Computing systems capable of performing tasks that humans very good
areself-driving cars. at,
Source: https://www.kdnuggets.com/2018/11/an-introduction-ai.html
Machine learning, a subset Machine learning, a subset of artificial intelligence, enables users to learn from
historical data to achieve a desired outcome. It powers targeted ads, personalized
of artificial intelligence,
content, song recommendations, predictive maintenance activities, virtual
enables users to learn assistants and more.
from historical data to Machine learning can be broken down into two key phases, learning and predicting.
In the learning phase, certain statistical techniques or algorithms are applied to
achieve a desired outcome.
historical data and/or previous business outcomes to generate a machine learning
It powers targeted ads, model. A model can be thought of as a set of rules or instructions, such as steps in
a recipe, that one must follow to make a business decision.
personalized content,
For example, in order to approve a loan application, a loan officer will consider
song recommendations,
income, age, net worth and many other factors before making a final decision.
predictive maintenance Each attribute of the application is a rule or factor that the officer must evaluate
to approve or reject the loan. Machine learning techniques follow a similar
activities, virtual assistants
methodology, comparing various attributes, historical decisions and the outcome
and more. of similar applicants to estimate the credit worthiness of the new applicant.
An algorithm is a
Task Main objective
step-by-step instruction set Insight/ result
1. Take the chicken out
Task or formula for solving a 2. Salt and season
The algorithm
3. Bake it learns from
An algorithm is a step- Minimize
problem errors or
or completing a task
its mistakes/errors, finds
by-step instruction set some sort of “loss
the best approach and
or formula for solving a function” to attain the
generates insights and
problem or completing a best approach to solve
Main Minimize errors or some sort rules that canthebe
Minimize usedofto
number
task a “loss
of taskfunction” to attain the things/steps needed to take
objective best approach to solve a task
makeinpredictions
order to serve the dish
1. Take the chicken out Minimize the number Learn from your mistakes
of things/steps needed the next time you attempt
2. Salt and season
to take in order to the recipe
3. Bake it The algorithm
serve learns from its
the dish
Insight/ mistakes/errors, finds the
best approach and generates
Learn from your mistakes
the next time you attempt
result insights and rules that can be the recipe
used to make predictions
With the growth of data, the invention of advanced algorithms and cheaper
commodity hardware to process big data at scale, deep learning, a powerful set
of machine learning techniques, has become prominent in the industry. Deep
learning techniques mimic the brain’s neuron activities, which is why they are also
referred to as neural networks. Some common applications include natural language
processing, image recognition, realistic photo and video generation.
Data analytics is the science of analyzing raw data to draw conclusions from that
information. Data analytics techniques can reveal trends and metrics that would
otherwise be lost in a mass of information. This information can then be utilized
to optimize processes to increase the overall efficiency of a business or system.
Data analytics techniques can be broken down into four main types based on the
difficulty of analysis and business value.
a. D
escriptive analytics parses raw historical data and draws conclusion that help
managers, investors and others determine why business changes occurred.
b. D
iagnostic analytics provides an understanding of why events took place
by examining data. A type of advanced analytics, techniques include data
discovery and mining, correlation analysis and drill-down.
c. P
redictive analytics uses statistics and modeling to predict future behavior. Using
data patterns, predictive analytics identifies when patterns are likely to reoccur
to identify and prevent potential risks, take advantage of future opportunities or
advantageously reallocate resources.
d. P
rescriptive analytics uses machine learning to analyze raw data to help
organizations make better decision and take a proper course of action. Factoring
in possible scenarios, available resources, past performance and current
performance, prescriptive analytics help determine the best course of action in
a situation.
Difficulty
Explore and transform: Data preparation, cleaning and exploratory data analysis
BI or analytics:
Data flow: Infrastructure, pipelines, ETL,segmentation,
Metrics, s
tructured and unstructured data storage
aggregation, data labelling
Data collection: External data, logging, sensors, u
ser generated content
Data flow:
Infrastructure, pipelines, ETL,
structured and unstructured data storage
Data collection:
External data, logging, sensors,
user generated content
Data collection. At the bottom of the pyramid is data collection. At this stage, the
goal is to identify what data is needed and what is available. If it is a user-facing
product, are all relevant interactions logged? If it is a sensor, what data is coming
through and how? Without data, no machine learning or AI solution can learn or
predict outcomes.
Data flow. Identify how the data flows through the system. Is there a reliable
stream/ETL process established? Where is the data stored, and how easy is it to
access and analyze?
Explore and transform. Only when data is accessible can it be explored and
transformed for modelling. This stage is one of the most time-consuming and
underestimated of the data science project lifecycle. It is at this stage that teams
and organizations realize that they are missing data, their machine sensors are
unreliable, they are not tracking relevant information about customers and other
key issues. It forces them to return to data collection and ensure the foundation is
solid before moving forward.
Machine learning and benchmarking. Although there is sample data that can be
used to make predictions, work is not complete. A/B testing or experimentation
framework needs to be in place to deploy models incrementally and avoid real
world disasters. Model validation and experimentation approaches provide a
rough estimate of the effects of changes before practical implementation. At
this stage, a very simple baseline or benchmark for performance tracking should
be established. An example fraud detection system includes monitoring high
risk credit card transactions that were proved to be fraudulent and comparing
them with the current operational performance of machine learning models to
accurately detect fraud.
Convert massive amounts of big data into meaningful and actionable insights. Voice
assistants, autopilot features and smart home devices have become part of day-
to-day life. This new class of AI-driven products are powered by machine learning
and advanced analytics techniques, allowing organizations and teams to better
understand consumer needs and wants, feature requests and usage patterns.
Machine learning offers various approaches to solve business problems. The first
approach is based on whether there is data related to the outcome of a process.
Did the machine stop working? Did the customer leave? Did the employee quit?
It is important to understand and model how behavior and fluctuations in data
lead to a certain business outcome. This type of machine learning is known as
supervised learning.
Supervised learning
Supervised learning can be broken down into two categories based on what it is
trying to predict.
• Will this customer default in the next month, six months or year?
• What should be the price of a property based on size, number of rooms and location?
• How many orders am I likely to receive in the next three months for my product?
There are multiple unsupervised learning approaches and techniques that can
be utilized to gain meaningful insights. One of the more popular techniques is
clustering, which groups things that are similar or have features in common.
Organizations use clustering techniques to answer business questions, such as:
• How many distinct customer groups exist for my products? Who belongs to
which group?
• To which customer subgroups should I market my product and how should I target
them? What are the key characteristics of each group?
• Given the huge number of container shipments arriving at a country’s ports every
day, which should be opened by customs to prevent smuggling, terrorism, etc.?
•G
iven a log of all the traffic on a computer network, which sessions represent
attempted intrusions?
Unsupervised learning
Partition the data set into X groups so that records
Clustering Anomaly
Clustering detection Association mining
in the same group are similar to each other, and
Unsupervised learning
1. Business knowledge
• What is the current business process? How are things done currently? Does
someone manually identify which products to recommend to each customer?
Does someone manually review each loan application for fraud or risk? Does an
engineer manually inspect all machinery each week for failure? Be as specific and
detailed as possible in defining the current process.
2. Solution vision
• Why is it important to solve the current business problem/use-case? Define
what success would look like. Specifically, in order to execute a successful
project, what are the minimum requirements and success criteria?
• How is the ROI of AI and analytics measured? Is there any current method to
track/benchmark the performance of current business processes and outcomes?
3. Data adequacy
• What data is available? Is it structured or unstructured? Is there data relevant to
answering the business problem? Example: Operational data is required to predict
when a machine will fail.
• How much data does the organization have? Where is the data stored?
OpenText™ Magellan™
product overview
OpenText AI white paper
OpenText™ Magellan™ infographic
Section 2: Solution vision
Join the conversation
Keep up to date
Watch the videos
Section 3: Data adequacy
• Identify what success means and what the end solution will look like at the start.
About OpenText
OpenText, The Information Company, enables organizations to gain insight through
market leading information management solutions, on-premises or in the cloud. For
more information about OpenText (NASDAQ: OTEX, TSX: OTEX) visit: opentext.com.
opentext.com/contact 14/14
Copyright © 2021 Open Text. All Rights Reserved. Trademarks owned by Open Text.
For more information, visit: https://www.opentext.com/about/copyright-information • (23.04.21)17629.EN#