0% found this document useful (0 votes)
10 views

Topic 08 - Data Modelling - Part II

Topic 08 -
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Topic 08 - Data Modelling - Part II

Topic 08 -
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

University of Science, VNU-HCM

Faculty of Information Technology

Môn Cơ Sở Trí Tuệ Nhân Tạo


Introduction to Data Science Course

Data Modeling (Part 2)

Le Ngoc Thanh
[email protected]
Department of Computer Science

Ho Chi Minh City


Contents
◎ Data science and machine learning review
◎ Classification model
◎ Clustering model

2
Process

3
Data Science’s tasks

4
ML Tasks

5
The course’s focus
◎ In this course, we focus on three main groups of ML:
○ Regression
○ Classification
○ Clustering

6
General model learning architecture

(Hypothesis)

7
Classification
◎ Classification is the problem of identifying which of a set of
categories an observation belongs to.

8
Classification
◎ The inputs and outputs for the learning binary
classification task can be stated as follows:
○ Input
data 𝐱! ∈ ℝ" , 𝑗 ∈ Z: = 1,2, … , 𝑚
labels 𝐲! ∈ ±1 , 𝑗 ∈ 𝑍 # ⊂ 𝑍
○ Ouput
labels 𝐲! ∈ ±1 , 𝑗 ∈ 𝑍

9
Challenges
◎ Some challenges in classification task:
○ The boundary between the data forms a nonlinear manifold that is difficult to characterize.
○ If the sampling data only captures a portion of the manifold, then it will almost surely fail in
characterizing population data.
○ Data can be in higher dimensional space and visualization is essentially impossible.

10
Well-known classification algorithms
◎ Some well-known classifier:
○ Support Vector Machine (SVM)
○ Classification and Regression Tree (CART)
○ k-nearest Neighbors (kNN)
○ Naïve Bayes
○ Ensemble Learning and Boosting (AdaBoost)
○ Ensemble Learning of Decision Tree (C4.5)
○ Deep neural nets

11
Contents
◎ Data science and machine learning review
◎ Classification model
○ Support Vector Machine (SVM)
○ Neural network
◎ Clustering model

12
Support Vector Machines
◎ The original SVM algorithm by Vapnik and Chervonenkis evolved out of the
statistical learning literature in 1963, where hyperplanes are optimized to split
the data into distinct classes.

13
Linear SVM
◎ The key idea of the linear SVM method
is to construct a hyperplane:
𝐰"𝐱+𝑏 =0
where the vector 𝐰 and constant 𝑏
parametrize the hyperplane.

14
Optimization problem in SVM
◎ The optimization problem in SVM includes:
○ Optimize a decision line which makes the fewest labeling errors.
○ Optimizes the largest margin between the data.

15
Objective function
◎ The loss function is defined as follows:
𝐿 𝐲! , 𝐲$! = 𝐿 𝐲! , sign 𝐰 + 𝐱! + 𝑏

0 if data is correctly labeled


=/
+1 if data is incorrectly labeled
◎ The goal is also to make the margin as large as possible:
'
1 (
argmin𝐰,$ > 𝐿(𝐲! , 𝐲$! ) + 𝐰 subject to min 𝐰 + 𝐱! + 𝑏 = 1
2 !
!%&
16
Noisy/Nonlinear Classification
◎ How to separate the following data by SVM?

17
Noisy/Nonlinear Classification
◎ Two basic approaches:
○ Use a linear classifier, but allow some (penalized) errors
◉ Soft margin, slack variables
○ Project data into higher dimensional space
◉ Do linear classification there
◉ Kernel functions

18
Soft margin
◎ Margin violation means choosing an hyperplane, which can
allow some data points to stay in either in between the margin
area or in the incorrect side of hyperplane.

19
Map to higher-dimensional space
◎ Embedding the data in a higher dimensional space:
𝐱 → Φ(𝐱)
◎ For example:
𝑥! , 𝑥" → 𝑧! , 𝑧" , 𝑧# ≔ (𝑥! , 𝑥" , 𝑥!" + 𝑥"" )

20
Kernel trick
◎ Kernel function (similarity function) takes input vectors in
the original space and return dot product of these vectors
in the feature space (a real number).
𝐾 𝐱, 𝐳 = Φ(𝐱) + Φ(𝐳)

21
Kernel trick
◎ The objective function only includes the dot product of the
transformed feature vectors.
&

𝐰 = 0 𝛼$ Φ(𝐱$ )
$%!

& &

𝑓 𝐱 = 𝐰 " Φ 𝐱 + 𝑏 = 0 𝛼$ Φ 𝐱$ " Φ 𝐱 = 0 𝛼$ 𝐾(𝐱$ , 𝐱)


$%! $%!
◎ Therefore, just substitute these dot product terms with the kernel
function, and don’t even use Φ(𝐱).
22
Contents
◎ Data science and machine learning review
◎ Classification model
○ Support Vector Machine (SVM)
○ Neural network
◎ Clustering model

23
Generic architecture of a multi-layer NN
◎ For classification tasks, the goal of the NN is to map a set of
input data to a classification.

24
One-layer network
◎ First, consider a single layer network for binary classification:

25
One-layer network
◎ Hypothesis with linear mapping:
𝐀𝐗 = 𝐘

| | | |
→ 𝑎& 𝑎( … 𝑎) 𝐱& 𝐱(… 𝐱 * = [+1 + 1 … − 1 − 1]
| | | |
where each column of the matrix 𝐗 is a dog or cat image
(𝐱! ∈ ℝ) ) and the columns of 𝐘 are its corresponding labels.

26
One-layer network
◎ The linear mapping:
𝐀𝐗 = 𝐘

27
Nonlinear transformations
◎ Hypothesis with nonlinear mapping:
𝐲 = 𝑓(𝐀, 𝐱)
where 𝑓 + called an activation function (transfer function) in
neural networks.

28
Neural network optimization
◎ The optimization of NN (determine the weights of the
network) is done through the backpropagation process.
○ The process forces the optimization to backprogate error through the
network relies on the chain rule of differentiation.
◎ For example, with simple neural network:

29
Neural network optimization
◎ The compositional structure is:
𝑦 = 𝑔 𝑧, 𝑏 = 𝑔 𝑓 𝑥, 𝑎 , 𝑏
◎ The error function is:
1 (
𝐸 = 𝑦+ − 𝑦
2

30
Neural network optimization
◎ Partial derivative:
!" !% !% !'
= − 𝑦$ − 𝑦 = −(𝑦$ − 𝑦) !' !# (chain rule)
!# !&
!" !%
= −(𝑦$ − 𝑦)
!& !&
◎ Gradient descent:
𝜕𝐸
𝑎,-& = 𝑎, + 𝜂
𝜕𝑎,
𝜕𝐸
𝑏,-& = 𝑏, + 𝜂
𝜕𝑏, 31
Overall progress of neural network
1. A NN is specified along with a labeled training set.
2. The initial weights of the network are set to random values.
3. The training data is run through the network to produce an output 𝑦,
whose ideal ground-truth output is 𝑦' .
4. The derivatives with respect to each network weight is then
computed using backprop formulas.
5. For a given learning rate 𝜂, the network weights are updated as in
gradient descent equation.
6. Return to step (3) and continue iterating until a maximum number of
iterations is reached or convergence is achieved.
32
Overall progress of neural network
1. A NN is specified along with a labeled training set.
2. The initial weights of the network are set to random values.

33
3. The training data is run through the network

34
◎ Backpropagate error

35
4. The derivatives with respect to each network weight is computed.
5. The network weights are updated as in gradient descent equation.

36
Stochastic Gradient Descent
◎ Stochastic Gradient Descent (SGD): a single, randomly data
point (k) is chosen to approximate the gradient at each step
of the iteration.
𝐰!-& = 𝐰! − 𝜂∇𝐸, (𝐰! )

37
Batch gradient descent
◎ If instead of a single point, a subset of points (K) is used, then
we have the following batch gradient descent algorithm.
𝐰!-& = 𝐰! − 𝜂∇𝐸. (𝐰! )

38
Deep Learning

39
Contents
◎ Data science and machine learning review
◎ Classification model
◎ Clustering model

40
Clustering
◎ Clustering is the process of grouping objects into clusters:
○ The same group objects are highly similar.
○ And very different from the subject in the rest of the groups.

The gap
The gap within between
the group is groups is large
small

41
Clustering
◎ Clustering is an unsupervised learning because the
label/class is not pre-defined
◎ Therefore, clustering is a type of learning by visual rather than
learning by examples

42
CHAMELEON 43
Some applications of clustering
◎ Grouping of related documents for web browsing
◎ Groups of genes and proteins have the same function
◎ Group of stocks with the same volatility
◎ Groups of areas of the same land type in geography
◎ Identify homegroups by Home type, value, and geographic
location
◎ Defining a Gaming object group

44
What is a group?

How many groups are there? 6 groups

2 groups 4 groups
45
A good Clustering?
◎ A good clustering method will have to create groups of
high quality:
○ The same level in the High group.
○ Similar levels to other low-level groups.
◎ The quality of the clustering depends on:
○ Analog measurement
○ Its enforcement
○ Ability to discover some or all of the underlying patterns

46
Measure Similarity
◎ The distance used is mainly to measure the same or not
similar level between the two objects.
○ Examples: Euclide gap, Cosin Gap, Minkowski, Mahattan...
◎ Distance functions differ in value ranges, types, ranks, and
variable elements.
◎ The weights of the dependent variables on the application
and the data implications.

47
Some clustering methods
◎ Partitional clustering
○ Formation of partitions and evaluate them based on some criteria
○ Algorithms: K-Means, K-Medoids, CLARANS
◎ Hierarchical clustering
○ Create the division of Layers
Algorithms: Diana, Agnes, BIRCH, CAMELEON
◎ Density-based clustering
○ Based on the jaw and density function
Algorithms: DBSACN, OPTICS, DenClue

48
Examples

49
Some clustering methods (2/2)
◎ Grid-based methods
○ Based on multi-level particle structure
Algorithms: STING, WaveCluster, CLIQUE
◎ Model-based methodology
◎ Frequent pattern-based methods
◎ Methods based on binding or user guidance
◎ Link-based method

50
Partitioning Clustering
◎ Partitioning Clustering Is the simplest and most
foundational method of clustering methods
◎ Idea: The partition of a D database consists of n objects
into k Groups so that it optimizes the criteria of the
partition.
◎ Global optimization: Full expression of all groups
◎ Heuristic algorithm:
○ K-means: each group is performed by the center value of the group
K-Medoids, PAM: each group is performed by one of the groups '
objects 51
K-Means algorithm (1/2)
◎ Before number k, each group is performed by the group's
centroid value.
○ S1: Select Random K objects as the center of the groups
○ S2: Assigns each remaining object to the closest group based on
distance measurement such as Euclide, Cosin analogue,
correlation,...
○ S3: Calculating the center value of each group based on newly joined
objects.
○ S4: If the group Center has nothing to change or there are only a few
points that change the group, stop, back to S2.

52
K-means example (1/4)

k=3
S1: Select any of the three centers: k1, k2, k3

k1
Y

k2

k3
53
X
K-means example (2/4)

S2: Assign each point to a group with the


closest group center

k1
Y

k2

k3
X 54
K-means example (3/4)

S3: Move a group center to the group's new


average score

k1 k1
Y

k2
k3
k2

k3
X 55
K-means example (4/4)

Repeat: Reassign points to close to new


group centers...

k1
Y

3 points to be
reassigned k3
k2

56
X
k-means
◎ Pros:
○ Simple, effective. Complexity O (TKN); T, K < < N.
Achieve local optimization.
◎ Cons:
○ Only available to objects in a continuous n-dimensional space
Need to determine the number K group before
Sensitivity to noise and personal data
Not suitable for exploring groups with circles/bridges

57
Exercise 1
◎ Use the K-means algorithm and Euclide spacing to Clustering 8
samples into 3 groups:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4),
A7=(1,2), A8=(4,9).
Distance matrix based on Euclide is given in the following slide.
Assuming the initializing seed (center) is k1 = A1, K2 = A4 and K3
= A7. Run K-Means 1 time. Identify groups to be formed. Where is
the new center of each group
◎ Draw in Space 10 x 10 The samples were bundled with the group
over 1 run (draw frames) and the center of each group. How many
K-means loops will converge (stop)? Illustrate the results at each
loop.
58
59

You might also like