Machine Learning
Machine Learning
Welcome to the MapR Academy course: Introduction to Artificial Intelligence and Machine
Learning.
We'll start with a brief review of artificial intelligence and machine learning, otherwise referred
to as AI and ML. Next, we'll look at the first steps in creating a machine learning project plan.
Preparation is crucial to establishing a strong foundation for the project. This includes identifying
a clear business need and plan, a discussion on data management, and careful feature selection
and engineering.
Artificial intelligence, and specifically machine learning, is our future, and it's happening right
now.
Machine learning is the self-driving car. It is the device you talk to, to control your in-home
systems, or even just to get directions on your phone. It provides real-time language translation
services. It is the tool that recommends the perfect movie to add to your collection, and then it
tells you when it's on sale. It can be the robot or android that will eventually order your groceries
and then fix and serve you dinner, and have it all based on your particular health goals.
Machine learning is all these things and more, and it is changing our world and the way we live
and do business every day.
Before we get into the finer details of artificial intelligence and machine learning, let's see how it
fits in the larger world of data science. Because this field is rapidly changing, some people may
be confused or disagree about the overall landscape and terms being used in the industry, so
let's clarify how we will be defining these in this course.
Think of a series of Russian nesting dolls, but now imagine these as futuristic robot dolls, like
these. With this analogy, we can have multiple robot dolls nested within each level.
With these nesting robot dolls, version 2.0, the largest doll represents the entire field of data
science. The second doll represents artificial intelligence, and the next doll, as machine learning.
A fourth doll, for deep learning, can also be nested within the machine learning doll, but we
won’t be going into much depth on this topic in this course.
Both artificial intelligence and machine learning nest under the largest doll of data science,
whose purpose is to extract insights from data. Data science analyzes large amounts of data to
deliver value and give businesses a competitive edge across all industries. As an example, retail
businesses analyze buyer habits to better target recommendations and promotions to their
customers.
In the growing world of big data, it is important to have an effective data science strategy to help
make informed business decisions. All these fields have become more prominent, as the
attempts to meet the growing demand for finding more efficient ways to extract value from data
at scale has increased.
Using artificial intelligence to accomplish these goals is a natural outgrowth of the big data
movement.
Artificial intelligence describes a machine that is capable of imitating and performing intelligent
human behavior. Some of these tasks could include problem solving and decision making, or
specific activities requiring acute perception, recognition, or translation abilities.
Remember from our robot nesting doll example that machine learning is nested within artificial
intelligence, therefore all machine learning counts as AI, but not all AI counts as machine
learning. Other robot dolls within classic AI, without machine learning, would include expert
systems using symbolic artificial intelligence, and AI Planning.
Symbolic artificial intelligence is one of the first approaches of AI, which is based on the
assertion that human intelligence can be achieved through the manipulation of symbols. This is
the basis for physical symbol systems, also called formal systems, which center around three
basic concepts that follow human thinking abilities:
Symbols, like the plus sign as the physical joining of two perpendicular lines, are first encoded in
our brains.
Thoughts are the structures, like the plus sign means to add things together.
The manipulation process is the act of thinking, or applying the symbol and its structure
together, like when we use the plus sign in a mathematical equation for one plus two equals
three.
To understand this more clearly, let's take a look at a couple of specific examples, next. Physical
Symbol System examples include algebra, formal logic, and even chess.
In algebra, the numbers and mathematical operators of plus, x, equals, et cetera, are the
symbols. The equations and formulas are the expressions, and the calculation is the manipulated
expression response.
With formal logic problems, words like "if," "or," "not," are the symbols. The structures are true
or false statements, and the process manipulation are the rules of logical deduction, which
result in the final expression. So, we could say "If a primarily healthy adult has a fever, then they
may have the flu."
And in games, such as chess, the defined number of pieces are the symbols and the legal chess
moves are the structures. The manipulated expressions are the resulting positions of the pieces
on the board after each move.
This AI approach states that machines are capable of mimicking this behavior. Though interest in
this approach has faded over time, it led to the development of expert systems, which are widely
considered to be one of the first successful forms of artificial intelligence.
An expert system is a computer system that is designed to solve problems by imitating the
decision making abilities of a human expert. This system uses two subsystems: a knowledge base
and an inference engine.
Input data is presented to an expert system for training. This data is then reasoned through
production, or If-Then, rules. Together, the data and reasoning production rules create the
knowledge base of an expert system.
The inference engine applies the rules to the data and facts in the knowledge base, and then
deduces new facts from it.
Automated planning and scheduling, also known as AI Planning, is another branch of classic AI.
AI Planning can be done in known environments, and it describes a system that coordinates
strategies or action sequences from an initial state, in order to achieve a specified goal state. The
actions may be executed by autonomous robots, intelligent agents, or unmanned vehicles, or a
combination of these.
This field has such a wide variety of project scopes and with varying complexity, that the level of
programming effort and the human resources required for AI Planning became too much for
most organizations to support.
Today, machine learning has taken over this field, as it offers a much more agile approach, which
we will soon discuss.
It includes both piles of if-then statements, as with the simple rule-based, expert systems used in
classic AI, along with more complex statistical models that use learning algorithms to generate
predictions.
Then there is also the Hollywood version of AI: super-fancy computer systems, specialized
robots, and advanced androids.
It may all seem a bit campy, but every day we get closer and closer to this type of reality. Mainly,
this is because we are now teaching machines how to learn, and grow, on their own.
Now that we know about artificial intelligence, how about machine learning? This is where AI
really starts to get interesting.
Machine learning describes machines that are taught to learn and make decisions by examining
large amounts of input data. It makes calculated suggestions and/or predictions based on
analyzing this information and performs tasks that are considered to require human intelligence.
This includes activities like speech recognition, translation, visual perception, and more.
The field of machine learning also encompasses the area of deep learning.
The key difference here between machine learning and artificial intelligence, is the term
"learning."
Machines learn and provide intelligent insights through a sophisticated use of learning
algorithms.
To provide business value, the machine is trained to learn patterns from data, and then can
proceed autonomously on new and changing data. This creates a dynamic feedback loop, which
allows it to efficiently generate more models to gain further insights, even more accurately,
without requiring additional resources or human interaction.
With continuous advancement in this field, machines are becoming increasingly self-healing,
self-organizing, and self-architecting, seamlessly producing greater value for businesses.
So, what is deep learning? Also known as artificial neural networks, deep learning is one of the
most talked about subareas of machine learning.
Deep learning performs machine learning in a hierarchy of layers, where the output of decisions
from one layer, feeds in to the next layer. This model is loosely patterned after the brain's neural
networks and has been setting new records of accuracy when applied to sound and image
recognition.
The term “deep” describes the number of layers in a network and some go deeper than others
by using many layers, versus just one layer.
We won't be going into any detail regarding deep learning techniques in this lesson, but a new
business course on this topic will be joining this series soon.
Information is now created, collected, and analyzed faster than ever and from more diverse and
distributed sources. This includes traditional web data, mobile devices, streaming media, and IoT
sensors, both on premise and in the cloud.
The increased volumes and types of data, along with the improvements of hardware and big
data processing, are a relatively new development.
Machine learning offers the ability for businesses to save significant amounts of data processing
and manual work, thus freeing up professionals to use their expertise more efficiently. It can also
find new patterns from potentially missed opportunities from within the masses of data by using
sophisticated machine learning algorithms.
Now, we'll look at the various types of learning methods used in the field of machine learning.
There are a few different types of machine learning, but they generally fall into these main
groups. Supervised and unsupervised learning are the primary learning types, along with semi-
supervised learning.
Other methods sit in the middle or outskirts of these methodologies, such as reinforcement
learning, which describes a machine that creates a cyclical learning cycle as it continuously trains
itself from its own results. These results are then feed back into itself as input data. We won’t be
going into much detail on these other techniques, but further information can be found online.
First, let’s understand what differentiates each of these learning types from each other and how
they work.
Supervised learning uses labeled data to train machines to learn the relationships between given
inputs and outputs.
A label is a known description given to objects in the data, which trains the machine on what to
look for. Labels also provide the structure of the algorithm output, as any result must be one of
these labels. Therefore, you can think of labels as a schema defining the possible output that we
want the machine to look for.
Think of this as the algorithm to use when data scientists have labeled input data, and when the
type of behavior to predict is known. We want the machine to learn the patterns used to classify
this data, and apply those patterns to classify new data.
First, labeled or classified data is loaded into the system. The preparation of labeled data makes
this the most time consuming step, as it is often done by a human trainer. The model is trained
and connections to inputs and outputs are made. As new data is introduced, the algorithm is
applied and it categorizes the results.
In the previous cat example on labeled data, the trained labels include ears, nose, tail, paws, and
cat, which the algorithm then applies to presented data, in this case an image of a cat, and
returns the results of known output as "cat," yes.
Supervised learning always has a clear objective and can be easily measured for accuracy. The
training of the machine is also tightly controlled, which leads to very specific behavioral
outcomes.
On the down side, it is often very labor intensive as all data needs to be labeled before the
model is trained and this can take hundreds of hours of specialized human effort. The costs of
this can become astronomical. This creates an overall slower training process and may also limit
the data that it can work with.
Finally, insights may be more limited, as the predicted behavior is described in advance. There is
no freedom for the machine to explore other possibilities, as we will see with unsupervised
learning.
In supervised learning, there are primarily two categories of algorithms: classification and
regression.
A classification algorithm organizes input data as belonging to one of several pre-defined classes.
This algorithm is the most useful for providing categorical results that fit within the predefined
labels. It is very effective with well calculated if-then rules, and distinguishes one class of objects
from another.
Some common use cases for classification algorithms include credit card fraud detection and
email spam detection, both of which are binary classification problems, meaning there are only
two possible output values. Data is labeled, for example, as fraud/non-fraud or spam/non-spam.
Generally, if the question we are asking of a model is open-ended or if the potential answers are
not categorical, then we aren't dealing with a classification problem, but more likely a regression
one.
A regression algorithm attempts to predict the output value given the input value. Regression
problems are predictive of a continuous numerical, as opposed to categorical, result.
Think of this continuous value as a range or average; something that is estimating the
relationship between variables.
For example, this type of algorithm can be used to determine how profitable a credit card model
is. It is also used to predict customer or employee churn models.
Regression algorithms determine the strength of correlation between two attributes, allowing
you to find a predictive range of likelihood.
This table depicts some of the most common algorithms used with supervised learning types. It
is important to understand that many machine learning data models will use more than one, and
sometimes many, different algorithms for a project.
While supervised learning involves having labeled data to find input-output relationships during
the training phase, unsupervised learning has no knowledge of the output label. In this type of
ML, the machine finds groups and patterns in the data on its own, and there is no specific
outcome or target to predict.
Think of this as the algorithm to use when we don't know how to classify the data and we want
the machine to classify or group it for us.
First, unlabeled raw data is loaded into the system. Next, the algorithm analyzes the data and
looks for patterns on its own. It then identifies and groups patterns of behavior and provides
output results.
Compared to supervised learning, unsupervised learning projects are much faster to implement,
as no data labeling is required. In this regard, it uses less human resources. It also interprets data
on its own, and has the potential to provide unique, disruptive insights for a business to
consider.
However, unsupervised learning can be difficult to measure for accuracy because there is no
expected result to compare it to. It can require more experimentation and tuning to get
meaningful results.
Lastly, unsupervised learning does not natively handle high dimensional data, or massively large
datasets with considerable variance, well. This is known as the curse of dimensionality. In some
cases, the dimensions, or number of variables, may need to be reduced for it to work effectively.
This requires human-intensive data cleansing.
Let's take a look at a common use case example using cluster analysis. Cluster analysis has the
goal of organizing raw data into related groups, and is often used for anomaly detection.
This security company uses it to identify unusual patterns in network traffic, indicating potential
signs of a security breach or intrusion.
First, the security company streams in raw network traffic data. Next, the algorithm analyzes the
data on its own and looks for unusual patterns. It then identifies patterns of behavior as either
normal or suspect. When suspect behavior is identified, the output is provided and the company
is notified.
With this example using anomaly detection, a scatter plot may return results looking something
like this. The green dots indicate behavior that is grouped together as normal, and the red dots
show the potential outliers that are sent back as suspect.
This table depicts some unsupervised learning algorithms. The most common algorithm here is
K-Means, for cluster analysis, which is what we've just focused on with our security use case
example on anomaly detection.
As a combination of the previous two learning types, let's look at how a common self-training
algorithm, from the semi-supervised learning method, works:
First, an initial set of labeled input training data is loaded into the system. The model is trained
on the data, and then a new data set of unlabeled data is presented.
The algorithm infers new labels and classifiers to apply to the new data. High confidence data, or
data that scores well based on the algorithm, is added back to the original labeled data set. From
here, the machine progressively adapts and learns in an iterative process.
In some cases when the labels and the rule-based engine conflicts, a human is needed for
verification.
It is useful to understand what open source tools are available when you begin composing your
own machine learning projects. This table shows of list common ML libraries that can be easily
adapted for use in any project.
Note, this is only a subset of available libraries, as it will continue to change and grow.
Machine learning can be applied to any scenario where data is analyzed. ML is used in every
industry to gain a strong business advantage by minimizing overhead costs, predicting user
activity, and discovering new insights and untapped opportunities in their markets.
From retail, finance, healthcare, manufacturing businesses, and so many more – they are all
taking advantage of using artificial intelligence in their data science efforts.
Retail businesses use machine learning for a wide variety of purposes in many markets. In
ecommerce, they track shopping cart activity and avert abandonments, provide useful product
recommendations, plan targeted promotions, and forecast product demand for stock availability.
Their brick-and-mortar stores effectively apply ML to track customer product interest and spend
in real-time. Both markets integrate seamlessly together to provide a fully personalized, holistic
customer 360 program.
All these efforts combine to offer their business reduced administration costs and increased
profits, all while providing a personalized touch for every customer.
A very popular use of ML that we're all familiar with is the recommender system. We've all made
purchases online and have received recommendations for related items. These recommendation
engines use an information filtering technique that predicts our preference for, or rating of, an
item.
As with many things in machine learning, this can be applied to just about anything, from
products like clothing, books, movies, or music, to research articles, jobs, restaurants, online
dating, and twitter feeds.
In financial services, we have a consumer credit card company that uses machine learning for
credit card fraud detection and credit line approval. Both of these are considered binary
classification problems, giving a result of fraud or not fraud, or approved or declined. Specific
promotional and offer recommendations are also made to their customers, based on previous
purchase activity.
Finally, ML is used in a customer retention program, which we will look at in more detail, next.
This credit card company uses ML in an attempt to predict at risk customers and retain those
who call in to close their accounts. Based on monitoring transaction activity and user
demographic data, they can predict behavior and prepare offers specifically for these customers.
This process uses a predictive, binary classification model to determine which customers are at
risk, and then uses a classic recommender model to determine other suitable card offers that
might retain these customers.
Note that in this example, multiple models are working together to solve this business problem.
In the healthcare industry, this provider uses machine learning to predict patient re-admittance
probability. Not only does this help save the time of medical professionals, but also allows them
to provide better care for their patients. ML is used to perform early screens on patients for
cancer, using anomaly detection and image recognition on ultrasounds and scans.
In this example, high resolution lung scans are examined for lesions, using anomaly detection
and image recognition.
Here, we can see how image recognition identifies lung segmentation, revealing lesions which
are then evaluated by a professional for malignancy. In this way, human doctor time is used
much more efficiently, as they don't have to sift through all image scans on their own, and they
only have to focus on the scans showing potential issues.
Not only does machine learning assist these medical professionals by providing an additional pair
of tireless eyes, but it has demonstrated constant improvements in accuracy and detection
capabilities in the field.
This manufacturing company is using machine learning to analyze their data with the goal of
finding efficiency improvements and improving quality control. They use ML to avert potential
employee churn for cost reduction and they also analyze real-time streaming data from sensors
to instantly assess and detect the need for machine line maintenance.
This company uses anomaly detection on real-time streaming data to analyze IoT sensors on a
robotic arm. It instantly assesses and detects potential equipment maintenance requirements. If
the organization can accurately predict when a piece of hardware will fail, and replace that
component before it fails, it saves them production costs, reduces downtime, and increases
operational efficiency.
With so many devices now including sensor data and components that send diagnostic reports,
predictive maintenance using ML is becoming increasingly more accurate and effective.
Who isn’t excited about the concept of a self-driving car? Every day, this gets closer and closer to
our daily reality. Machines, well cars in this case, need to be trained on many different things
before they can actually hit the road. Things like how to drive and what possible environments
and conditions they may encounter, to start with.
Much of this can be achieved through basic supervised learning, training that requires vast
amounts of labeled data.
Once perfected, this technology will need to be extremely intelligent, fully autonomous, self-
organizing, and dynamic. But for now, let’s just start with the basics. How do we train a car to
drive itself safely?
In this example, supervised learning techniques are used to train the machine through specific
driving examples on speed and the conditions of terrain. Thousands of miles of driving are
recorded and fed in as streaming input data in order to provide examples on the most
appropriate driving methods for these varying conditions. This process is very human intensive,
requiring hundreds of hours of driving to generate the input data, but the trainers know exactly
what is being fed in, and can control the details of what they want to train and classify as “good
driving.”
This raw data will determine a linear decision surface which is displayed in a visual diagram, as
shown in this scatter plot example. Based on this information, the car can detect the speed and
the conditions of the terrain, and then determine how to best drive in that given scenario.
Keep in mind that this is a very simplistic view of the very beginning of the training process for a
self-driving car and there are many other ways to accomplish this goal. The end product will be
vastly more complex than what we see here.
Now that we've seen what can be done in a variety of ML use cases, let's see how we
can best put it all together with these MapR solutions.
One of the most important things to understand about machine learning is that 90% of the
effort in running a project is all about data logistics. It is largely dependent on the organization
and accessibility of your data. Having seamless access to all data types and sources, is a critical
component in this effort.
Data flow and model management is best handled at the data-platform level. MapR provides a
flexible platform to support the needs of machine learning data logistics with these tools.
Data Science libraries are emerging quickly from a wide variety of developers in academia,
enterprise, and various other communities. These libraries are all being created by different
teams, and use different APIs to interact with data. Many of these APIs are not compatible with
HDFS, and therefore cannot communicate with data stored in a traditional Hadoop cluster.
Since many ML libraries are not compatible with HDFS, often vendors require a separate cluster
for ML workflows, from their Hadoop storage and processing. Data is transferred or copied
between these two clusters to perform ML operations, and then back again for storage and
Hadoop processing. On top of the expense of having a second cluster, moving data back and
forth multiple times adds significant time to any workflow, can result in forked data, and can
cause real-time processes to work with data that is now minutes, or even hours old.
The MapR Data Platform includes numerous built-in, open APIs. Being fully POSIX compliant,
data stored on a MapR cluster is accessible to any library compatible with this data interaction
industry standard. In addition to standard POSIX libraries, MapR also provides APIs for Kafka,
Spark, Amazon S3, and many others.
Once your data is in a MapR cluster, on-premise, in the cloud, or both, it can be enabled for
every type of ML workflow. Your data can interact directly with any number of ML libraries, be
processed by Spark or Hadoop, and even be queried and visualized, all in a single cluster, all in
real-time, without the need for time intensive data movement between systems.
Recall from earlier, we established that 90% of machine learning work is data logistics. With this
platform, MapR is uniquely positioned to enable a flexible way to support your big data and
machine learning needs.
This is a foundational platform that supports ML data storage needs with tools including: MapR
XD Distributed File and Object Store, MapR Database, MapR Event Streaming, and cloud
integration. As part of the same foundational technology, all of these tools gain the advantages
of MapR security, replication, and high availability features.
Together, these tools optimize your organizational setup for data management logistics.
Let's take a look at some of the technology that creates the MapR Data Platform.
With MapR XD your ML projects also have access to any files stored in a POSIX compliant file
system. This allows us to include large binary files like images or video, as well as archives of
JSON, CSV logs, or Parquet files. It also supports various machine learning tools without requiring
special connectors or having to copy training data to local disks.
The MapR Database is a binary, JSON document database, and NoSQL datastore, fully compatible
with the HBase API. You can use it to import streaming, JSON data for live ML processing, and
also access standard tables, such as customer, HR, or finance data, to include data at rest in your
ML projects.
For more specific information about the MapR Database, take a look at our available courses in
MapR Academy.
MapR Event Streaming is a Kafka API based publish and subscribe messaging system that enables
real-time data streaming in your machine learning pipelines. Streaming data comes from sensor
networks, mobile applications, web clients, logs from a server, or even a “Thing” from the
Internet of Things.
As relevant data is created by internal and external data sources, it can be fed in real-time to
machine learning models to analyze data and make predictions. These streams are replayable
and replicable, so they can be rebuilt if the streams are ever interrupted or replayed to validate
results. Not only can the same message streams be shared by multiple consumers in multiple
locations, but they can be easily replicated to feed multiple ML models.
For more information, take a look at our available courses in the MapR Academy Event Store
stream series.
Cloud storage has become very popular and with MapR, our ML projects can use any cloud
storage in just the same way as the tools we've just discussed.
The MapR Data Science Refinery is a preconfigured, easy-to-deploy and scalable data science
toolkit. It assists with the processing of machine learning projects and offers native access to all
platform assets.
MapR recognizes the need for agile, containerized solutions that scale to fit the needs of all
types of data science teams. Containerized data science allows us to run our ML models in a
predictable environment, as well as meaningfully separate data from processing.
This also takes advantage of MapR's global namespace solution for multiple data sources,
enabling secure, seamless built-in collaboration abilities. The global namespace can be mounted
directly to containers to enable a stateful development process. This provides the ability to be
more agile in the processing of data, such as being able to direct work to GPU versus CPU.
The refinery also includes preconfigured Docker containers to leverage MapR as a persistent data
store. It provides the data science notebook, Apache Zeppelin, with a Helium framework to offer
easy, pluggable visualization capabilities for your ML projects.
For more information in containers, check out our course on application containers and
Kubernetes!
The rendezvous architecture is a proposed design from MapR on how to handle enterprise-grade
data logistics for machine learning. It specifically supports a wide range of machine learning use
cases and allows for continuous integrations and iterations of machine learning models.
The core design of this architecture centers around three main concepts: it focuses on streaming
data, it takes advantage of containers, and it has a flexible, microservices approach.
We have a streaming-first approach to handling data. Not only are new, innovative streaming
technologies increasing, but this type of streaming data is powerful and provides widespread
advantages to data analytics and data science teams.
Any enterprise level ML project will need to take advantage of streaming data. For a business to
truly generate value, they must continually ingest and analyze real-time data, monitoring live
data streams for changes and new opportunities. Streams also allow data to be persisted.
As we've already discussed, this is all supported with the MapR Event Store streaming solution.
The architecture leverages containers to provide a predictable environment for running machine
learning models. Containers make deployments repeatable and easy, because all the applications
and dependencies are packaged together. Compared to virtual machines, containers have similar
resources and isolation benefits, but are much lighter weight.
Lastly, this architecture also processes streaming data in a flexible microservices approach to
handle multiple machine learning models.
These models are treated as small, independently deployable services that are containerized
individually, and can be distributed and replicated across servers. It is the use of these
containers, that we just covered, which allow for this type of flexible handling.
You can easily introduce new ML models, retire, replicate, or update existing ones, all on the fly
without affecting live data. Extensive metrics and diagnostics allows them to be compared
instantly, as they process real-time production data. It also supports continuous integrations and
iterations between models, and helps avoid single points of failure and large-scale outages.
When we have the right data and the capability to access and leverage it efficiently, we need
only to come up with the right questions. Putting serious consideration and effort into the
organization of all available data sources, will provide the foundation necessary to be successful
with all machine learning projects.
When data is well-organized, and the right questions are asked with the guiding hand of domain
knowledge experts, the possibilities are limitless.
More courses in this business series on machine learning will be available soon.
Congratulations, you have completed the MapR Academy business course on artificial
intelligence and machine learning.