Lect1 Introduction
Lect1 Introduction
CCCS416
APPLIED MACHINE LEARNING
2
Outline:
qBig data
qMachine Learning definitions (What and Why)
qMachine Learning Applications
qMachine Learning approaches:
qSupervised learning
qUnsupervised learning
qSemi-supervised learning
qReinforcement learning
qOffline learning vs online learning
qInstance-based vs model-based learning
qMain challenges of Machine Learning
3
Big Data
qWidespread use of personal computers and
wireless communication leads to “big data”
qWe are both producers and consumers of data
qData is not random, it has structure, e.g.,
customer behavior
qWe need to extract that structure from data for
(a) Understanding the process
(b) Making predictions for the future
qThe big amount of data calls for automated
methods of data analysis, which is what machine
learning provides!
https://www.domo.com/learn/infographic/data-never-sleeps-8
4
What is Machine Learning?
qMachine Learning is the science (and art) of programming computers so they can
learn from data.
qMachine Learning is the field of study that gives computers the ability to learn
without being explicitly programmed. (Arthur Samuel, 1959)
qA computer program is said to learn from experience E with respect to some task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience E. (Tom Mitchell, 1997)
5
What is Machine Learning?
6
What is Machine Learning?
7
Why “Learn” ?
qData is cheap and abundant (data warehouses, data marts); knowledge is
expensive and scarce.
qLearning general models from a data of particular examples
qExample in retail: Customer transactions to consumer behavior:
People who bought “Blink” also bought “Outliers”
(www.amazon.com)
qBuild a model that is a good and useful approximation to the data.
qSome cases where learning is useful, when:
• Human expertise does not exist (navigating on Mars),
• Humans are unable to explain their expertise (speech recognition)
8
Machine Learning is great for:
qProblems for which existing solutions require a lot of hand-tuning or long lists
of rules: one Machine Learning algorithm can often simplify code and perform
better.
qComplex problems for which there is no good solution at all using a
traditional approach: the best Machine Learning techniques can find a solution.
qFluctuating environments: a Machine Learning system can adapt to new data.
qGetting insights about complex problems and large amounts of data.
9
Why machine learning ?
10
Applications
11
Applications
• Business
• Walmart data warehouse mined for advertising
• Credit card companies mined for fraudulent use of your card based on
purchase patterns
• Netflix developed movie recommender system
• Genomics
• Human genome project: collection of DNA sequences, microarray data
• Communication Systems
• Speech recognition
• Image analysis
12
Machine Learning Approaches
qSupervised learning
qClassification
qRegression
qUnsupervised learning
qSemi-supervised learning
qReinforcement learning
13
Supervised learning:
Classification
In supervised learning, the training data you feed to the algorithm includes the
desired solutions, called labels.
Example: spam filtering (class: spam or ham)
https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html 14
Supervised learning:
Regression
Another typical task is to predict a target numeric value, such as the price of a
car, given a set of features (mileage, age, brand, etc.) called predictors.
https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html
15
Examples for Supervised Learning
Algorithms
vk-Nearest Neighbors
vSupport Vector Machines (SVMs)
vDecision Trees and Random Forests
vNeural networks
vLinear Regression
vLogistic Regression
v…
16
Unsupervised Learning
qDoes not require a human expert to manually label the data
qWe are just given input data, without any outputs.
qThe goal is to discover “interesting structure” in the data “knowledge discovery”
qClustering: Grouping similar instances
qExample applications
qCustomer segmentation in CRM
qImage compression: Color quantization
qBioinformatics: Learning motifs
17
Clustering: example
https://www.researchgate.net/figure/An-example-of-the-document-clustering_fig1_322455242
18
Unsupervised learning
https://www.edureka.co/blog/machine-learning-tutorial/
19
Unsupervised machine learning
“Some” examples for unsupervised learning algorithms:
vK-means
vFuzzy k-means
vPrinciple Component Analysis
vAssociation rule learning
v…
20
Semi-supervised learning
qSome algorithms can deal with partially labeled training data, usually a lot of
unlabeled data and a little bit of labeled data.
qMost semi-supervised learning algorithms are combinations of unsupervised and
supervised algorithms
qSome photo-hosting services, such as Google Photos, are good examples of this.
Once you upload all your family photos to the service, it automatically recognizes
that the same person A shows up in photos 1, 5, and 11, while another person B
shows up in photos 2, 5, and 7.
21
Semi-supervised machine learning
https://www.ecloudvalley.com/mlintroduction/
22
Reinforcement Learning
qThis is useful for learning how to act or behave when given occasional reward or
punishment signals. (For example, consider how a baby learns to walk.)
qThe learning system, called an agent in this context, can observe the environment,
select and perform actions, and get rewards in return (or penalties in the form of
negative rewards).
qThen learn by itself what is the best strategy, called a policy, to get the most
reward over time.
qA policy defines what action the agent should choose when it is in a given
situation.
qExamples: gaming and robot navigation.
23
Reinforcement Learning
https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html
24
Offline learning vs Online learning
Offline learning (batch learning)
qThe system is incapable of learning incrementally: it must be trained using all the
available data.
qIf we want a batch learning system to know about new data (such as a new type of
spam), we need to train a new version of the system from scratch on the full
dataset (not just the new data, but also the old data), then stop the old system and
replace it with the new one.
qThis solution is simple and often works fine, but training using the full set of data
can take long time.
qTraining on the full set of data requires a lot of computing resources (CPU,
memory space, disk space, disk I/O, network I/O, etc.).
25
Offline learning vs Online learning
Online Learning
qThe system is trained incrementally by feeding it data instances sequentially, either
individually or by small groups called mini-batches.
qEach learning step is fast and cheap, so the system can learn about new data on the fly, as it
arrives. It is also a good option if you have limited computing resources
https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html
26
Instance-based vs model-based learning
27
Instance-based vs model-based learning
28
Main challenges of Machine Learning
qData-related challenges
qInsufficient quantity of training data
qNon-representative training data
qPoor-quality data
qIrrelevant features
qAlgorithms-related challenges
qOverfitting the Training data
qUnderfitting the Training data
29
Insufficient quantity of data
• It takes a lot of data for most Machine Learning algorithms to work properly.
Even for very simple problems you typically need thousands of examples, and
for complex problems such as image or speech recognition you may need
millions of examples
30
https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html
Non-representative training data
• In order to generalize well, it is crucial that your training data be representative of
the new cases you want to generalize to.
https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch01.html
31
Poor quality data
qIf the training data is full of errors, outliers, and noise (e.g., due to poor-quality
measurements), it will make it harder for the system to detect the underlying
patterns, so your system is less likely to perform well.
qExamples
qIf some instances are clearly outliers, it may help to simply discard them or try to fix the errors
manually.
qIf some instances are missing a few features (e.g., 5% of your customers did not specify their
age), you must decide whether you want to ignore this attribute altogether, ignore these
instances, fill in the missing values (e.g., with the median age), or train one model with the
feature and one model without it, and so on.
32
Irrelevant features
qThe system will only be capable of learning if the training data contains
enough relevant features and not too many irrelevant ones
qFeatures Engineering
qFeature selection: selecting the most useful features to train on among existing features.
qFeature extraction: combining existing features to produce a more useful one.
qCreating new features by gathering new data.
33
Overfitting the training data
It means that the model performs well on the training data, but it does not generalize
well.
The possible solutions are:
q To simplify the model by selecting one with fewer
parameters (e.g., a linear model rather than a high-degree
polynomial model), by reducing the number of attributes in
the training data or by constraining the model
q To gather more training data
q To reduce the noise in the training data (e.g., remove
outliers)
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization
34
Underfitting the training data
qIt occurs when the model is too simple to learn the underlying structure of the
data.
qThe main options to fix this problem are:
qSelecting a more powerful model, with more parameters.
qFeeding better features to the learning algorithm (feature engineering).
qReducing the constraints on the model.
35
Machine learning process
36
Summary
37
References
§ Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron
(chapter 1)
§ Introduction to machine learning by Ethem Alpaydin (chapter 1)
§ Machine learning: a probabilistic perspective by Kevin P. Murphy (chapter 1)
38