0% found this document useful (0 votes)
13 views43 pages

L1 Intro

Uploaded by

anhducday8888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views43 pages

L1 Intro

Uploaded by

anhducday8888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to

Machine Learning and Data Mining


(Học máy và Khai phá dữ liệu)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2024
2
Contents
 Introduction to Machine Learning & Data Mining
 Supervised learning
 Unsupervised learning
 Performance evaluation

 Practical advice
3
Who is real? Ai thực, ai giả?
4
Why ML & DM?
 “The most important general-purpose technology of our era is artificial
intelligence, particularly machine learning” – Harvard Business
Review
https://hbr.org/cover-story/2017/07/the-business-of-artificial-intelligence

 A huge demand on Data Science


 “Data scientist: the sexiest job of the 21 st century” – Harvard
Business Review.
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
5
Some projects
 NAFOSTED
 Title

 AFOSR (2015-2016, USA)


6
Why ML & DM?
 Data mining, inference, prediction
 ML & DM provides an efficient way to make intelligent
systems/services.
 ML provides vital methods and a foundation for Big Data.
150
All global data in Zettabytes
1 ZB ≈ 1021 Bytes 120

90

60

30

Source: Statista
7
Why? Industry 4.0

https://www.pwc.com/ca/en/industries/industry-4-0.html
8
Why? AI & DS & Industry 4.0

Artificial
Intelligence
Machine
Learning

Industry 4.0

Data Science
9
Some successes: Amazon’s secret

“The company reported


a 29% sales increase to
$12.83 billion during its
second fiscal quarter,
up from $9.9 billion
during the same time
last year.”
– Fortune, July 30, 2012
10
Some successes: GAN (2014)
❖ A machine can make imagination (trí tưởng tượng)
min max 𝔼𝑥~𝑝data log 𝐷 𝑥 + 𝔼𝑧~𝑝noise log 1 − 𝐷 𝐺 𝑧
𝐺 𝐷

Ian Goodfellow
Artificial faces

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In NIPS, pp. 2672-2680. 2014.
11
Some successes: AlphaGo (2016)
 AlphaGo of Google DeepMind the world champion at Go
(cờ vây), 3/2016
 Go is a 2500-year-old game.
 Go is one of the most complex games.

 AlphaGo learns from 30 millions human moves,


and plays itself to find new moves.

 It beat Lee Sedol (World champion)


 http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/
 http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-
1.19234
12
Some successes: GPT-3 (2020)
 Language generation (writing ability?)
 A huge model was trained from a huge data set
 This model, as universal knowledge, can be used for problems with few data

Tri thức của


GPT-3 cho
ngữ cảnh
ít dữ liệu

Con người không


thể nhận diện bài
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared viết 500 từ là do
Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language máy hay người viết
models are few-shot learners." NeurIPS (2020). Best Paper Award
13
Some successes: Text-to-image (2022)
 Draw pictures by keywords

Midjourney

DALL-E 2
A bowl of soup

Imagen
14
Some successes: ChatGPT (2022)
 Human-level Chatting, Writing, QA,…

Why ChatGPT is
about to
change how
you work, like it
or not?
- Forbes, Feb. 2, 2023
15
Some successes: Sora (2024)
 Generate videos by short descriptions
16
Machine Learning vs Data Mining
 Machine Learning  Data Mining
(ML - Học máy) (DM - Khai phá dữ liệu)
To build computer systems To find new and useful
that can improve themselves knowledge from datasets.
by learning from data.
(Tìm ra/Khai phá những tri thức
(Xây dựng những hệ thống mà mới và hữu dụng từ các tập dữ
có khả năng tự cải thiện bản liệu lớn.)
thân bằng cách học từ dữ liệu.)

 Some venues: NeurIPS,  Some venues: KDD, PKDD,


ICML, ICLR, IJCAI, AAAI, PAKDD, ICDM, CIKM
ACML, ECML
17
Data

Structured – relational (table-like) Un-structured

texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …
18
Methodology: product-driven
Business Analytic
understanding approach

Data
Feedback
requirements

Data
Deployment
collection

Data
Evaluation
understanding

Data
Modeling
preparation (http://www.theta.co.nz/)
19
Methodology: insight-driven

Data Analysis, Insight &


Data Data vizualization hypothesis
collection processing & testing, &
Policy
Grasping ML Decision

70-90% of the whole process

(John Dickerson, University of Maryland)


20
Product development: experience
IBM Research

DeepQA: Incremental Progress in Answering Precision


Application
 on the Jeopardy Challenge: 6/2007-11/2010 IBM Watson
Playing in the Winners Cloud
100%

90% v0.8 11/10

80% V0.7 04/10

70% v0.6 10/09

v0.5 05/09
60%
Precision

v0.4 12/08
50% v0.3 08/08

v0.2 05/08
40%
v0.1 12/07
30%

20%

10%
Baseline 12/06
0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
© Data Science Laboratory,
% Answered SOICT, HUST, 2017
21
What is Machine Learning?
 Machine Learning (ML) is an active subfield of Artificial
Intelligence.
 ML seeks to answer the question [Mitchell, 2006]
 How can we build computer systems that automatically improve with
experience, and what are the fundamental laws that govern all
learning processes?

 Some other views on ML:


 Build systems that automatically improve their performance [Simon,
1983].
 Program computers to optimize a performance objective at some
task, based on data and past experience [Alpaydin, 2020]
22
A learning machine
 We say that a machine learns if the system reliably
improves its performance P at task T, following experience
E.
 A learning problem can be described as a triple (T, P, E).
 ML is close to and intersects with many areas.
 Computer Science,
 Statistics, Probability,
 Optimization,
 Psychology, Neuroscience,
 Computer Vision,
 Economics, Biology, Bioinformatics, …
23
Some real examples (1)
 Spam filtering for emails
 T: filter/predict all emails that are spam.
 P: the accuracy of prediction, that is the
percentage of emails that are correctly
classified into normal/spam.
 E: set of old emails, each with a label of
spam/normal. Spam?
No Yes
24
Some real examples (2)
 Image captioning
 T: give some words that describe the
meaning of an image.
 P: ?
 E: set of images, each has a short description.

a small hedgehog
a girl giving cat a gentle hug holding a piece of
lychee-inspired spherical
chair watermelon

Image source: https://openai.com/dall-e-3


25
What does a machine learn?
 A mapping (function):
𝑦∗: 𝑥 ⟼ 𝑦
 x: observation (example, data instance), past experience
 y: prediction, new knowledge, new experience,…
26
Where does a machine learn from?
 Learn from a set of training examples (training set, tập học, tập
huấn luyện) { {x1, x2, …, xN}; {y1, y2,…, yM} }
 xi is an observation (quan sát, mẫu, điểm dữ liệu) of x in the past.
 yj is an observation of y in the past, often called label (nhãn) or response
(phản hồi) or output (đầu ra).

 After learning:
 We obtain a model, new knowledge, or new experience (f).
 We can use that model/function to do prediction or inference for future
observations, e.g.,
𝑦 = 𝑓(𝑥)
27
Two basic learning problems
 There is an unknown function 𝑦 ∗ that maps each x to a
number 𝑦 ∗ (𝑥)
 In practice, we can collect some pairs: (xi, yi), where 𝑦𝑖 = 𝑦 ∗ 𝑥𝑖

 Supervised learning (học có giám sát): find the true function


𝑦 ∗ from a given training set {x1, x2, …, xN, y1, y2,…, yN}.
 Classification (categorization, phân loại, phân lớp): if y only belongs to a
discrete set, for example {spam, normal}
 Regression (hồi quy): if y is a real number
28
Supervised learning: Regression
 Prediction of stock indices
29
Supervised learning: classification
 Multiclass classification (phân loại nhiều lớp):
when the output y is one of the pre-defined
labels {c1, c2, …, cL}
(mỗi đầu ra chỉ thuộc 1 lớp, mỗi quan sát x chỉ có 1 nhãn)
 Spam filtering: y in {spam, normal}
 Financial risk estimation: y in {high, normal, no}
 Discovery of network attacks: ?

 Multilabel classification (phân loại đa nhãn):


when the output y is a subset of labels
(mỗi đầu ra là một tập nhỏ các lớp;
mỗi quan sát x có thể có nhiều nhãn)
 Image tagging: y = {birds, nest, tree}
 sentiment analysis
30
Two basic learning problems
 Unsupervised learning (học không giám sát): find the true
function 𝑦 ∗ from a given training set {x1, x2, …, xN}.
 𝑦 ∗ can be a data cluster
 𝑦 ∗ can be a hidden structure
 𝑦 ∗ can be a trend, …

 Other learning problems:


 semi-supervised learning,
 reinforcement learning,
 …
31
Unsupervised learning: examples (1)
 Clustering data into clusters
 Discover the data groups/clusters

 Community detection
 Detect communities in online social networks
32
Unsupervised learning: examples (2)
 Trends detection
 Discover the trends, demands, future needs
of online users
33
Design a learning system (1)
 Some issues should be carefully considered when designing
a learning system.
Business Analytic
 Determine the type of the understanding approach

function to be learned Feedback


Data
requirements
(Xác định dạng bài toán học)
 𝑦 ∗ : 𝑋 → {0,1} Deployment
Data
collection

 𝑦 ∗ : X → set of labels/tags
Data
Evaluation
understanding

Data
 Collect a training set:
Modeling
preparation

 Do the observations have any label?


 The training set plays the key role in the effectiveness of the system.
 The training observations should characterize the whole data space
→good for future predictions.
34
Design a learning system (2)
 Select a representation or approximation (model) f for the
unknown function y*
(Lựa chọn dạng hàm f để đi xấp xỉ hàm y* chưa biết)
 Linear model?
 A neural network?
 A decision tree? …

Business Analytic
understanding approach

Data
Feedback
requirements

 Select a learning algorithm to find f:


Data
Deployment
collection
 Ordinary least square? Ridge regression?
Evaluation
Data
understanding  Backpropagation?

Modeling
Data  ID3? …
preparation
35
ML: some issues (1)
 Learning algorithm
 Under what conditions the chosen algorithm will (asymptotically)
converge?
(với điều kiện nào thì thuật toán học sẽ hội tụ?)
 For a given application/domain and a given objective function, what
algorithm performs best?
(Đối với một ứng dụng và mục tiêu cho trước, thuật toán nào sẽ tốt nhất?)

 No-free-lunch theorem [Wolpert and Macready, 1997]:


if an algorithm performs well on a certain class of problems, then
it necessarily pays for that with degraded performance on the
set of all remaining problems.
 No algorithm can beat another on all domains.
(không có thuật toán nào luôn hiệu quả nhất trên mọi miền ứng dụng)
36
ML: some issues (2)
 Training data
 How many observations are enough for learning?
 Whether or not does the size of the training set affect performance of
an ML system?
 What is the effect of the disrupted or noisy observations?
37
ML: some issues (3)
 Learnability:
 The goodness/limit of the learning algorithm?
 What is the generalization (tổng quát hoá) of the system?
 Predict well new observations, not only the training data.
 Avoid overfitting or underfitting.
38
Overfitting (quá khớp, quá khít)
 Function h is called overfitting [Mitchell, 1997] if there exists
another function g such that:
 g might be worse than h for the training data, but
 g is better than h for future data.

 A learning algorithm is said to overfit relative to another


one if it is more accurate in fitting known data, but less
accurate in predicting unseen data.
 Overfitting is caused by many factors:
 The trained function/model is too complex or have too much
parameters.
 Noises or errors are present in the training data.
 The training size is too small, not characterizing the whole data space.
39
Overfitting and Underfitting

Test error
Error

Training error

Underfitting Good Overfitting


model

Simple Good Too


(Học không đến complex
nơi đến chốn) (Học vẹt?)
40
Overfitting: example
 Using few neighbors in k-
NN can degrade
prediction on unseen
data, even though
decreasing the error on Expected
prediction
the training data. error

Training
error

[Hastie et al., 2017]


41
Underfitting: example
 Using too many neighbors
in “K-nearest neighbors”
(k-NN) can degrade
prediction on both
training and unseen data. Expected
prediction
Underfitting error

Training
error

[Hastie et al., 2017]


42
Overfitting: Regularization
 Among many functions, which one can generalize best
from the given training data? f(x)
 Generalization is the main target of ML.
 Predict unseen data well.

 Regularization: a popular choice


(Hiệu chỉnh)
x

(Picture from http://towardsdatascience.com/multitask-learning-teach-your-ai-more-to-make-it-better-dde116c2cd40)


43
References
 Alpaydin E. (2020). Introduction to Machine Learning. The MIT Press.
 Hastie, T., Robert Tibshirani, Jerome Friedman (2017). The Elements of
Statistical Learning. Springer.
 Mitchell, T. M. (1997). Machine learning. McGraw Hill.
 Mitchell, T. M. (2006). The discipline of machine learning. Carnegie
Mellon University, School of Computer Science, Machine Learning
Department.
 Simon H.A. (1983). Why Should Machines Learn? In R. S. Michalski, J.
Carbonell, and T. M. Mitchell (Eds.): Machine learning: An artificial
intelligence approach, chapter 2, pp. 25-38. Morgan Kaufmann.
 Valiant, L. G. (1984). A theory of the learnable. Communications of the
ACM, 27(11), 1134-1142.
 Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for
Optimization", IEEE Transactions on Evolutionary Computation 1, 67.

You might also like