0% found this document useful (0 votes)

13 views43 pages

L1 Intro

Uploaded by

anhducday8888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views43 pages

L1 Intro

Uploaded by

anhducday8888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Introduction to

Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2024
2
Contents
 Introduction to Machine Learning & Data Mining
 Supervised learning
 Unsupervised learning
 Performance evaluation

 Practical advice
3
Who is real? Ai thực, ai giả?
4
Why ML & DM?
 “The most important general-purpose technology of our era is artificial
intelligence, particularly machine learning” – Harvard Business
Review
https://hbr.org/cover-story/2017/07/the-business-of-artificial-intelligence

 A huge demand on Data Science

 “Data scientist: the sexiest job of the 21 st century” – Harvard
Business Review.
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
5
Some projects
 NAFOSTED
 Title

 AFOSR (2015-2016, USA)

6
Why ML & DM?
 Data mining, inference, prediction
 ML & DM provides an efficient way to make intelligent
systems/services.
 ML provides vital methods and a foundation for Big Data.
150
All global data in Zettabytes
1 ZB ≈ 1021 Bytes 120

Source: Statista
7
Why? Industry 4.0

https://www.pwc.com/ca/en/industries/industry-4-0.html
8
Why? AI & DS & Industry 4.0

Artificial
Intelligence
Machine
Learning

Industry 4.0

Data Science
9
Some successes: Amazon’s secret

“The company reported

a 29% sales increase to
$12.83 billion during its
second fiscal quarter,
up from $9.9 billion
during the same time
last year.”
– Fortune, July 30, 2012
10
Some successes: GAN (2014)
❖ A machine can make imagination (trí tưởng tượng)
min max 𝔼𝑥~𝑝data log 𝐷 𝑥 + 𝔼𝑧~𝑝noise log 1 − 𝐷 𝐺 𝑧
𝐺 𝐷

Ian Goodfellow
Artificial faces

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In NIPS, pp. 2672-2680. 2014.
11
Some successes: AlphaGo (2016)
 AlphaGo of Google DeepMind the world champion at Go
(cờ vây), 3/2016
 Go is a 2500-year-old game.
 Go is one of the most complex games.

 AlphaGo learns from 30 millions human moves,

and plays itself to find new moves.

 It beat Lee Sedol (World champion)

 http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/
 http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-
1.19234
12
Some successes: GPT-3 (2020)
 Language generation (writing ability?)
 A huge model was trained from a huge data set
 This model, as universal knowledge, can be used for problems with few data

Tri thức của

GPT-3 cho
ngữ cảnh
ít dữ liệu

Con người không

thể nhận diện bài
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared viết 500 từ là do
Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language máy hay người viết
models are few-shot learners." NeurIPS (2020). Best Paper Award
13
Some successes: Text-to-image (2022)
 Draw pictures by keywords

Midjourney

DALL-E 2
A bowl of soup

Imagen
14
Some successes: ChatGPT (2022)
 Human-level Chatting, Writing, QA,…

Why ChatGPT is
about to
change how
you work, like it
or not?
- Forbes, Feb. 2, 2023
15
Some successes: Sora (2024)
 Generate videos by short descriptions
16
Machine Learning vs Data Mining
 Machine Learning  Data Mining
(ML - Học máy) (DM - Khai phá dữ liệu)
To build computer systems To find new and useful
that can improve themselves knowledge from datasets.
by learning from data.
(Tìm ra/Khai phá những tri thức
(Xây dựng những hệ thống mà mới và hữu dụng từ các tập dữ
có khả năng tự cải thiện bản liệu lớn.)
thân bằng cách học từ dữ liệu.)

 Some venues: NeurIPS,  Some venues: KDD, PKDD,

ICML, ICLR, IJCAI, AAAI, PAKDD, ICDM, CIKM
ACML, ECML
17
Data

Structured – relational (table-like) Un-structured

texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …
18
Methodology: product-driven
Business Analytic
understanding approach

Data
Feedback
requirements

Data
Deployment
collection

Data
Evaluation
understanding

Data
Modeling
preparation (http://www.theta.co.nz/)
19
Methodology: insight-driven

Data Analysis, Insight &

Data Data vizualization hypothesis
collection processing & testing, &
Policy
Grasping ML Decision

70-90% of the whole process

(John Dickerson, University of Maryland)

20
Product development: experience
IBM Research

DeepQA: Incremental Progress in Answering Precision

Application
 on the Jeopardy Challenge: 6/2007-11/2010 IBM Watson
Playing in the Winners Cloud
100%

90% v0.8 11/10

80% V0.7 04/10

70% v0.6 10/09

v0.5 05/09
60%
Precision

v0.4 12/08
50% v0.3 08/08

v0.2 05/08
40%
v0.1 12/07
30%

20%

10%
Baseline 12/06
0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
© Data Science Laboratory,
% Answered SOICT, HUST, 2017
21
What is Machine Learning?
 Machine Learning (ML) is an active subfield of Artificial
Intelligence.
 ML seeks to answer the question [Mitchell, 2006]
 How can we build computer systems that automatically improve with
experience, and what are the fundamental laws that govern all
learning processes?

 Some other views on ML:

 Build systems that automatically improve their performance [Simon,
1983].
 Program computers to optimize a performance objective at some
task, based on data and past experience [Alpaydin, 2020]
22
A learning machine
 We say that a machine learns if the system reliably
improves its performance P at task T, following experience
E.
 A learning problem can be described as a triple (T, P, E).
 ML is close to and intersects with many areas.
 Computer Science,
 Statistics, Probability,
 Optimization,
 Psychology, Neuroscience,
 Computer Vision,
 Economics, Biology, Bioinformatics, …
23
Some real examples (1)
 Spam filtering for emails
 T: filter/predict all emails that are spam.
 P: the accuracy of prediction, that is the
percentage of emails that are correctly
classified into normal/spam.
 E: set of old emails, each with a label of
spam/normal. Spam?
No Yes
24
Some real examples (2)
 Image captioning
 T: give some words that describe the
meaning of an image.
 P: ?
 E: set of images, each has a short description.

a small hedgehog
a girl giving cat a gentle hug holding a piece of
lychee-inspired spherical
chair watermelon

Image source: https://openai.com/dall-e-3

25
What does a machine learn?
 A mapping (function):
𝑦∗: 𝑥 ⟼ 𝑦
 x: observation (example, data instance), past experience
 y: prediction, new knowledge, new experience,…
26
Where does a machine learn from?
 Learn from a set of training examples (training set, tập học, tập
huấn luyện) { {x1, x2, …, xN}; {y1, y2,…, yM} }
 xi is an observation (quan sát, mẫu, điểm dữ liệu) of x in the past.
 yj is an observation of y in the past, often called label (nhãn) or response
(phản hồi) or output (đầu ra).

 After learning:
 We obtain a model, new knowledge, or new experience (f).
 We can use that model/function to do prediction or inference for future
observations, e.g.,
𝑦 = 𝑓(𝑥)
27
Two basic learning problems
 There is an unknown function 𝑦 ∗ that maps each x to a
number 𝑦 ∗ (𝑥)
 In practice, we can collect some pairs: (xi, yi), where 𝑦𝑖 = 𝑦 ∗ 𝑥𝑖

 Supervised learning (học có giám sát): find the true function

𝑦 ∗ from a given training set {x1, x2, …, xN, y1, y2,…, yN}.
 Classification (categorization, phân loại, phân lớp): if y only belongs to a
discrete set, for example {spam, normal}
 Regression (hồi quy): if y is a real number
28
Supervised learning: Regression
 Prediction of stock indices
29
Supervised learning: classification
 Multiclass classification (phân loại nhiều lớp):
when the output y is one of the pre-defined
labels {c1, c2, …, cL}
(mỗi đầu ra chỉ thuộc 1 lớp, mỗi quan sát x chỉ có 1 nhãn)
 Spam filtering: y in {spam, normal}
 Financial risk estimation: y in {high, normal, no}
 Discovery of network attacks: ?

 Multilabel classification (phân loại đa nhãn):

when the output y is a subset of labels
(mỗi đầu ra là một tập nhỏ các lớp;
mỗi quan sát x có thể có nhiều nhãn)
 Image tagging: y = {birds, nest, tree}
 sentiment analysis
30
Two basic learning problems
 Unsupervised learning (học không giám sát): find the true
function 𝑦 ∗ from a given training set {x1, x2, …, xN}.
 𝑦 ∗ can be a data cluster
 𝑦 ∗ can be a hidden structure
 𝑦 ∗ can be a trend, …

 Other learning problems:

 semi-supervised learning,
 reinforcement learning,
 …
31
Unsupervised learning: examples (1)
 Clustering data into clusters
 Discover the data groups/clusters

 Community detection
 Detect communities in online social networks
32
Unsupervised learning: examples (2)
 Trends detection
 Discover the trends, demands, future needs
of online users
33
Design a learning system (1)
 Some issues should be carefully considered when designing
a learning system.
Business Analytic
 Determine the type of the understanding approach

function to be learned Feedback

Data
requirements
(Xác định dạng bài toán học)
 𝑦 ∗ : 𝑋 → {0,1} Deployment
Data
collection

 𝑦 ∗ : X → set of labels/tags
Data
Evaluation
understanding

Data
 Collect a training set:
Modeling
preparation

 Do the observations have any label?

 The training set plays the key role in the effectiveness of the system.
 The training observations should characterize the whole data space
→good for future predictions.
34
Design a learning system (2)
 Select a representation or approximation (model) f for the
unknown function y*
(Lựa chọn dạng hàm f để đi xấp xỉ hàm y* chưa biết)
 Linear model?
 A neural network?
 A decision tree? …

Business Analytic
understanding approach

Data
Feedback
requirements

 Select a learning algorithm to find f:

Data
Deployment
collection
 Ordinary least square? Ridge regression?
Evaluation
Data
understanding  Backpropagation?

Modeling
Data  ID3? …
preparation
35
ML: some issues (1)
 Learning algorithm
 Under what conditions the chosen algorithm will (asymptotically)
converge?
(với điều kiện nào thì thuật toán học sẽ hội tụ?)
 For a given application/domain and a given objective function, what
algorithm performs best?
(Đối với một ứng dụng và mục tiêu cho trước, thuật toán nào sẽ tốt nhất?)

 No-free-lunch theorem [Wolpert and Macready, 1997]:

if an algorithm performs well on a certain class of problems, then
it necessarily pays for that with degraded performance on the
set of all remaining problems.
 No algorithm can beat another on all domains.
(không có thuật toán nào luôn hiệu quả nhất trên mọi miền ứng dụng)
36
ML: some issues (2)
 Training data
 How many observations are enough for learning?
 Whether or not does the size of the training set affect performance of
an ML system?
 What is the effect of the disrupted or noisy observations?
37
ML: some issues (3)
 Learnability:
 The goodness/limit of the learning algorithm?
 What is the generalization (tổng quát hoá) of the system?
 Predict well new observations, not only the training data.
 Avoid overfitting or underfitting.
38
Overfitting (quá khớp, quá khít)
 Function h is called overfitting [Mitchell, 1997] if there exists
another function g such that:
 g might be worse than h for the training data, but
 g is better than h for future data.

 A learning algorithm is said to overfit relative to another

one if it is more accurate in fitting known data, but less
accurate in predicting unseen data.
 Overfitting is caused by many factors:
 The trained function/model is too complex or have too much
parameters.
 Noises or errors are present in the training data.
 The training size is too small, not characterizing the whole data space.
39
Overfitting and Underfitting

Test error
Error

Training error

Underfitting Good Overfitting

model

Simple Good Too

(Học không đến complex
nơi đến chốn) (Học vẹt?)
40
Overfitting: example
 Using few neighbors in k-
NN can degrade
prediction on unseen
data, even though
decreasing the error on Expected
prediction
the training data. error

Training
error

[Hastie et al., 2017]

41
Underfitting: example
 Using too many neighbors
in “K-nearest neighbors”
(k-NN) can degrade
prediction on both
training and unseen data. Expected
prediction
Underfitting error

Training
error

[Hastie et al., 2017]

42
Overfitting: Regularization
 Among many functions, which one can generalize best
from the given training data? f(x)
 Generalization is the main target of ML.
 Predict unseen data well.

 Regularization: a popular choice

(Hiệu chỉnh)
x

(Picture from http://towardsdatascience.com/multitask-learning-teach-your-ai-more-to-make-it-better-dde116c2cd40)

43
References
 Alpaydin E. (2020). Introduction to Machine Learning. The MIT Press.
 Hastie, T., Robert Tibshirani, Jerome Friedman (2017). The Elements of
Statistical Learning. Springer.
 Mitchell, T. M. (1997). Machine learning. McGraw Hill.
 Mitchell, T. M. (2006). The discipline of machine learning. Carnegie
Mellon University, School of Computer Science, Machine Learning
Department.
 Simon H.A. (1983). Why Should Machines Learn? In R. S. Michalski, J.
Carbonell, and T. M. Mitchell (Eds.): Machine learning: An artificial
intelligence approach, chapter 2, pp. 25-38. Morgan Kaufmann.
 Valiant, L. G. (1984). A theory of the learnable. Communications of the
ACM, 27(11), 1134-1142.
 Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for
Optimization", IEEE Transactions on Evolutionary Computation 1, 67.

Crisc PDF
No ratings yet
Crisc PDF
9 pages
ABB LTD.: Affolternstrasse 44 CH-8050 Zurich Switzerland
No ratings yet
ABB LTD.: Affolternstrasse 44 CH-8050 Zurich Switzerland
34 pages
ML Revision
No ratings yet
ML Revision
37 pages
ML1-Introduction To Machine Learning
No ratings yet
ML1-Introduction To Machine Learning
46 pages
AI.5-Machine Learning (21-26)
No ratings yet
AI.5-Machine Learning (21-26)
196 pages
2025 Slides7 ML Eng
No ratings yet
2025 Slides7 ML Eng
59 pages
1 Introduction
No ratings yet
1 Introduction
51 pages
Slide AI-ML-DL
No ratings yet
Slide AI-ML-DL
124 pages
Lecture 1 - Introduction To ML
No ratings yet
Lecture 1 - Introduction To ML
39 pages
Lesson 4 - Introduction Machine Learning
No ratings yet
Lesson 4 - Introduction Machine Learning
44 pages
AI.5 Machine Learning (21 26)
No ratings yet
AI.5 Machine Learning (21 26)
196 pages
AI.5 Machine Learning (21 26)
No ratings yet
AI.5 Machine Learning (21 26)
176 pages
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
No ratings yet
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
53 pages
Mlintro 2
No ratings yet
Mlintro 2
28 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
Chapter 1 - Introdution
No ratings yet
Chapter 1 - Introdution
26 pages
ML Revision
No ratings yet
ML Revision
207 pages
3 - Machine Learning Overview
No ratings yet
3 - Machine Learning Overview
30 pages
1 - ML - Introduction
No ratings yet
1 - ML - Introduction
47 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Machine Learning - Course
No ratings yet
Machine Learning - Course
6 pages
L1 Overview
No ratings yet
L1 Overview
28 pages
01 - ML - Introduction
No ratings yet
01 - ML - Introduction
65 pages
ENG6500 1 IntroductionToMLDL Part1
No ratings yet
ENG6500 1 IntroductionToMLDL Part1
63 pages
Mlintro 3
No ratings yet
Mlintro 3
28 pages
MAI Lecture 01 Introduction
No ratings yet
MAI Lecture 01 Introduction
52 pages
1 - Machine Learning Overview
No ratings yet
1 - Machine Learning Overview
53 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Unit 1
No ratings yet
Unit 1
38 pages
Unit 1: Introduction To Machine Learning
No ratings yet
Unit 1: Introduction To Machine Learning
12 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
BE02000041 Funda of AI Unit 3 Basics of ML
No ratings yet
BE02000041 Funda of AI Unit 3 Basics of ML
86 pages
Lesson3 IntroML
No ratings yet
Lesson3 IntroML
46 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
32 pages
Karthik
No ratings yet
Karthik
10 pages
AIV ML: Achine Learning Ntroduction
No ratings yet
AIV ML: Achine Learning Ntroduction
10 pages
1 - Machine Learning Overview
No ratings yet
1 - Machine Learning Overview
56 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Lecture 1 - Intro
No ratings yet
Lecture 1 - Intro
63 pages
Difference Between AI
No ratings yet
Difference Between AI
37 pages
Unit 1 - Machine Learning - NOTES1 - ML
No ratings yet
Unit 1 - Machine Learning - NOTES1 - ML
52 pages
Elements of Machine Learning
No ratings yet
Elements of Machine Learning
116 pages
AI321: Theoretical Foundations of Machine Learning: Dr. Motaz El-Saban
No ratings yet
AI321: Theoretical Foundations of Machine Learning: Dr. Motaz El-Saban
44 pages
AIML Text Book 6th Semister
No ratings yet
AIML Text Book 6th Semister
226 pages
20ECE633T Machine Learning in VLSI
No ratings yet
20ECE633T Machine Learning in VLSI
81 pages
Chapter 1 Introduction To ML
No ratings yet
Chapter 1 Introduction To ML
52 pages
Lecture 1 Ai
No ratings yet
Lecture 1 Ai
38 pages
Slide 1 Introduction
No ratings yet
Slide 1 Introduction
33 pages
MLT Unit - 1
No ratings yet
MLT Unit - 1
38 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
Introducing Machine Learning
No ratings yet
Introducing Machine Learning
17 pages
Aws Scholarship
No ratings yet
Aws Scholarship
48 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
The Machine Learning Landscape
No ratings yet
The Machine Learning Landscape
25 pages
Unit1 ML NGP
No ratings yet
Unit1 ML NGP
106 pages
ENG6500 1 IntroductionToMLDL Part1
No ratings yet
ENG6500 1 IntroductionToMLDL Part1
74 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
49 pages
Introduction To Learning: Frederic Precioso 24/01/2019
No ratings yet
Introduction To Learning: Frederic Precioso 24/01/2019
179 pages
ML Unit-1
No ratings yet
ML Unit-1
139 pages
Mlunit 1
No ratings yet
Mlunit 1
139 pages
E-M Algorithm Questions
No ratings yet
E-M Algorithm Questions
2 pages
Complete NOTES COA UNIT 1
No ratings yet
Complete NOTES COA UNIT 1
31 pages
Handle Error Mysql
No ratings yet
Handle Error Mysql
15 pages
Introduction To Java Programming
No ratings yet
Introduction To Java Programming
24 pages
Nixon
No ratings yet
Nixon
93 pages
Naukri DivyanshuBhati (3y 0m)
No ratings yet
Naukri DivyanshuBhati (3y 0m)
5 pages
User Manual Edupage Professors
No ratings yet
User Manual Edupage Professors
18 pages
11th Computer Applications 1st Mid Term Exam 2022 Question Paper Tenkasi District English Medium PDF Download
No ratings yet
11th Computer Applications 1st Mid Term Exam 2022 Question Paper Tenkasi District English Medium PDF Download
1 page
Good PDF 4303162 2
No ratings yet
Good PDF 4303162 2
33 pages
Recruitment For Various Posts
No ratings yet
Recruitment For Various Posts
4 pages
HCM - Oracle Journeys - New Features in Release 25A
No ratings yet
HCM - Oracle Journeys - New Features in Release 25A
28 pages
Placement Information System
67% (3)
Placement Information System
51 pages
MMD 03
No ratings yet
MMD 03
53 pages
Oracle FMW12c On SLES12-SP3 PDF
No ratings yet
Oracle FMW12c On SLES12-SP3 PDF
298 pages
Technical Ptoposal-Zncb Head Office-Questionnaire Information - Og
No ratings yet
Technical Ptoposal-Zncb Head Office-Questionnaire Information - Og
303 pages
Block Cioher
No ratings yet
Block Cioher
38 pages
CH 16
No ratings yet
CH 16
16 pages
Assignment 1 - EER Modeling
0% (1)
Assignment 1 - EER Modeling
2 pages
Import Java - Util.stack Public Class Linkedinterface (
No ratings yet
Import Java - Util.stack Public Class Linkedinterface (
4 pages
Wilkinson-Reinsch1971 Book HandbookForAutomaticComputatio
No ratings yet
Wilkinson-Reinsch1971 Book HandbookForAutomaticComputatio
450 pages
Autosar Sws Os
No ratings yet
Autosar Sws Os
335 pages
Mbes Processing Learning Material
100% (1)
Mbes Processing Learning Material
80 pages
Polynomials Mat 110 2022 Presentation 1
No ratings yet
Polynomials Mat 110 2022 Presentation 1
21 pages
Apr-25 Digital Auditing-17-20
No ratings yet
Apr-25 Digital Auditing-17-20
4 pages
XII Maths-1
No ratings yet
XII Maths-1
6 pages
Secondary Memory
No ratings yet
Secondary Memory
3 pages
Sequential ISAR Images Classification Using CNN-Bi-LSTM Method
No ratings yet
Sequential ISAR Images Classification Using CNN-Bi-LSTM Method
5 pages
COMSATS University Islamabad Lahore Campus: Defence Road, Off Raiwind Road, Lahore. 042-111-001-007 Ext: 820, 803
No ratings yet
COMSATS University Islamabad Lahore Campus: Defence Road, Off Raiwind Road, Lahore. 042-111-001-007 Ext: 820, 803
12 pages