0% found this document useful (0 votes)
39 views

2 - Basics of Machine Learning

This document provides an introduction to machine learning, including its goals, typical workflow, and some common methods. It discusses how machine learning uses algorithms to find patterns in large amounts of data and make predictions without being explicitly programmed. The document emphasizes the importance of validating models on testing data and cautions that machine learning is best for interpolation and not extrapolation. It also introduces scikit-learn as a popular Python tool for machine learning.

Uploaded by

HERiTAGE1981
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

2 - Basics of Machine Learning

This document provides an introduction to machine learning, including its goals, typical workflow, and some common methods. It discusses how machine learning uses algorithms to find patterns in large amounts of data and make predictions without being explicitly programmed. The document emphasizes the importance of validating models on testing data and cautions that machine learning is best for interpolation and not extrapolation. It also introduces scikit-learn as a popular Python tool for machine learning.

Uploaded by

HERiTAGE1981
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CHAPTER 2


Introduction to

Machine Learning

Hervé Gross, PhD - Reservoir Engineer


Advanced Resources and Risk Technology
[email protected]

© ADVANCED RESOURCES AND RISK TECHNOLOGY This document can only be distributed within UFRGS
Machine Learning

o “Give computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)
o Algorithms that use data (“learn”) to make predictions
o No “model” ( = function that maps input to output) is explicitly given
o Although no model is given, the users have many algorithms to chose from
o Validation is always required to assess the quality of our predictions

o “data science” often means machine-learning applied to massive amounts of data

o Goals:
o analyze and describe data: find trends, clusters, anomalies,
o produce predictions that improve in quality and reliability with more and more data,
o optimize decisions based on multiple previous experiments

© ADVANCED RESOURCES AND RISK TECHNOLOGY 2


ML always follows the same steps

Data Data Machine Model Predictions


acquisition preparation Learning Validation Decisions

TRAINING

TESTING

Typical machine learning workflow


Data preparation (removing redundancies, incomplete or erroneous data, select features…) : this is often
the most time-consuming step, sometimes even iterative.
Model Validation : often done lightly but it is the most important step in the process to establish credibility in
the outcomes.

© ADVANCED RESOURCES AND RISK TECHNOLOGY 3


Two families, many methods

Non-exhaustive list: beware of hype!

UNSUPERVISED LEARNING SUPERVISED LEARNING


Provide input, not outputs
 Provide input and outputs

Find patterns, clusters, anomalies
 Predict outputs when new sets of inputs are given
Predict similarity when new data is given
Analytical learning
Clustering Artificial neural network
k-means Backpropagation
mixture models Boosting (meta-algorithm)
hierarchical clustering, Bayesian statistics
Case-based reasoning
Anomaly detection Decision tree learning
Inductive logic programming
Neural Networks Gaussian process regression
Hebbian Learning Group method of data handling
Generative Adversarial Networks Kernel estimators
Learning Automata
Approaches for learning latent variable models such as Learning Classifier Systems
Expectation–maximization algorithm (EM) Minimum message length (decision trees, decision graphs, etc.)
Method of moments Multilinear subspace learning
Blind signal separation techniques Naive bayes classifier
Principal component analysis Maximum entropy classifier
Independent component analysis Conditional random field
Non-negative matrix factorization Nearest Neighbor Algorithm
Singular value decomposition Probably approximately correct learning (PAC) learning
Ripple down rules, a knowledge acquisition methodology
Symbolic machine learning algorithms
Subsymbolic machine learning algorithms
Support vector machines
Minimum Complexity Machines (MCM)
BEWARE OF HYPE Random Forests
Many ML algorithms derive from the same concepts. Ensembles of Classifiers
Ordinal classification
Their names are often marketing, and they all sell false Data Pre-processing
to solve “all problems”. All algorithms have Handling imbalanced datasets
Statistical relational learning
assumptions, weaknesses, and applicability restrictions.
 Proaftn, a multicriteria classification algorithm
it is important to exert critical knowledge. … https://en.wikipedia.org/wiki/Unsupervised_learning
https://en.wikipedia.org/wiki/Supervised_learning

© ADVANCED RESOURCES AND RISK TECHNOLOGY 4


Applications

o Used vastly in computer science and anything related to data sets of observations so
large that human calibration is impossible:
o search engines (page ranking, related researches),
o social networks (suggestions, advertisement),
o image pattern recognition (classification of features, face detection, identification)
o natural language processing…

o Scientific uses in all domains of science with enough observations. Sometimes seen as
a statistical “black box” when theory-driven models are considered more elegant. Both
approaches can coexist and be complementary.

o Also used in economics, finance, medical diagnosis, opinion management,…


o Not needed when the model is obvious or simple

© ADVANCED RESOURCES AND RISK TECHNOLOGY 5


Applicability : ML is not magic!

o Everything is based on the underlying model that the algorithm forces on the data
o Interpolation = if done well, OK
o Extremes = difficult
o Extrapolation = very dangerous (especially if you cannot explain the underlying statistical model)
o Machine Learning is just a model, as such it always needs validation

lation
extrapo
extra
pola
tion
output

Training domain
input

© ADVANCED RESOURCES AND RISK TECHNOLOGY 6


The importance of validation

Validation (sometimes called “model assessment) consists of evaluating the predictive


quality of trained model
• Models are useless (and dangerous!) if they have not been validated
• Always validate

• Methods for validating


• Holdout methods (split the data in training and testing sets, typically 2/3 and 1/3)
• N-fold cross-validation (split the data in N subsets, train with N-1, test with the last one)
• Bootstrap: create a new (usually larger) dataset by randomly sampling the data with replacement

• Validation metrics
• Accuracy ( = bias) and precision (=variability) : correlations, spearman, bias matrices, etc.
• Sensitivity : true positive rate (or recall, or probability of detection)
• Specificity : true negative rate
• Sensitivity vs. specificity = ROC curves : receiver operating characteristics

© ADVANCED RESOURCES AND RISK TECHNOLOGY 7


Machine learning toolboxes

Toolboxes in (almost) all languages:


https://github.com/josephmisiti/awesome-machine-learning#awesome-machine-learning-

The number of machine learning packages is very large, free open-source or proprietary,
language-specific or not, cross-platform or not, cloud-friendly or not, etc.

For this class, we will use SCIKIT-LEARN in Python:


http://scikit-learn.org/stable/index.html

© ADVANCED RESOURCES AND RISK TECHNOLOGY 8


Scikit-learn

© ADVANCED RESOURCES AND RISK TECHNOLOGY 9


References

Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861 – 874. doi:10.1016/
j.patrec.2005.10.010.
Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness &
Correlation" (PDF). Journal of Machine Learning Technologies. 2 (1): 37–63.
Ting, Kai Ming (2011). Encyclopedia of machine learning. Springer. ISBN 978-0-387-30164-8.

© ADVANCED RESOURCES AND RISK TECHNOLOGY 10

You might also like