Topic 08 - Data Modelling - Part II
Topic 08 - Data Modelling - Part II
Le Ngoc Thanh
[email protected]
Department of Computer Science
2
Process
3
Data Science’s tasks
4
ML Tasks
5
The course’s focus
◎ In this course, we focus on three main groups of ML:
○ Regression
○ Classification
○ Clustering
6
General model learning architecture
(Hypothesis)
7
Classification
◎ Classification is the problem of identifying which of a set of
categories an observation belongs to.
8
Classification
◎ The inputs and outputs for the learning binary
classification task can be stated as follows:
○ Input
data 𝐱! ∈ ℝ" , 𝑗 ∈ Z: = 1,2, … , 𝑚
labels 𝐲! ∈ ±1 , 𝑗 ∈ 𝑍 # ⊂ 𝑍
○ Ouput
labels 𝐲! ∈ ±1 , 𝑗 ∈ 𝑍
9
Challenges
◎ Some challenges in classification task:
○ The boundary between the data forms a nonlinear manifold that is difficult to characterize.
○ If the sampling data only captures a portion of the manifold, then it will almost surely fail in
characterizing population data.
○ Data can be in higher dimensional space and visualization is essentially impossible.
10
Well-known classification algorithms
◎ Some well-known classifier:
○ Support Vector Machine (SVM)
○ Classification and Regression Tree (CART)
○ k-nearest Neighbors (kNN)
○ Naïve Bayes
○ Ensemble Learning and Boosting (AdaBoost)
○ Ensemble Learning of Decision Tree (C4.5)
○ Deep neural nets
11
Contents
◎ Data science and machine learning review
◎ Classification model
○ Support Vector Machine (SVM)
○ Neural network
◎ Clustering model
12
Support Vector Machines
◎ The original SVM algorithm by Vapnik and Chervonenkis evolved out of the
statistical learning literature in 1963, where hyperplanes are optimized to split
the data into distinct classes.
13
Linear SVM
◎ The key idea of the linear SVM method
is to construct a hyperplane:
𝐰"𝐱+𝑏 =0
where the vector 𝐰 and constant 𝑏
parametrize the hyperplane.
14
Optimization problem in SVM
◎ The optimization problem in SVM includes:
○ Optimize a decision line which makes the fewest labeling errors.
○ Optimizes the largest margin between the data.
15
Objective function
◎ The loss function is defined as follows:
𝐿 𝐲! , 𝐲$! = 𝐿 𝐲! , sign 𝐰 + 𝐱! + 𝑏
17
Noisy/Nonlinear Classification
◎ Two basic approaches:
○ Use a linear classifier, but allow some (penalized) errors
◉ Soft margin, slack variables
○ Project data into higher dimensional space
◉ Do linear classification there
◉ Kernel functions
18
Soft margin
◎ Margin violation means choosing an hyperplane, which can
allow some data points to stay in either in between the margin
area or in the incorrect side of hyperplane.
19
Map to higher-dimensional space
◎ Embedding the data in a higher dimensional space:
𝐱 → Φ(𝐱)
◎ For example:
𝑥! , 𝑥" → 𝑧! , 𝑧" , 𝑧# ≔ (𝑥! , 𝑥" , 𝑥!" + 𝑥"" )
20
Kernel trick
◎ Kernel function (similarity function) takes input vectors in
the original space and return dot product of these vectors
in the feature space (a real number).
𝐾 𝐱, 𝐳 = Φ(𝐱) + Φ(𝐳)
21
Kernel trick
◎ The objective function only includes the dot product of the
transformed feature vectors.
&
𝐰 = 0 𝛼$ Φ(𝐱$ )
$%!
& &
23
Generic architecture of a multi-layer NN
◎ For classification tasks, the goal of the NN is to map a set of
input data to a classification.
24
One-layer network
◎ First, consider a single layer network for binary classification:
25
One-layer network
◎ Hypothesis with linear mapping:
𝐀𝐗 = 𝐘
| | | |
→ 𝑎& 𝑎( … 𝑎) 𝐱& 𝐱(… 𝐱 * = [+1 + 1 … − 1 − 1]
| | | |
where each column of the matrix 𝐗 is a dog or cat image
(𝐱! ∈ ℝ) ) and the columns of 𝐘 are its corresponding labels.
26
One-layer network
◎ The linear mapping:
𝐀𝐗 = 𝐘
27
Nonlinear transformations
◎ Hypothesis with nonlinear mapping:
𝐲 = 𝑓(𝐀, 𝐱)
where 𝑓 + called an activation function (transfer function) in
neural networks.
28
Neural network optimization
◎ The optimization of NN (determine the weights of the
network) is done through the backpropagation process.
○ The process forces the optimization to backprogate error through the
network relies on the chain rule of differentiation.
◎ For example, with simple neural network:
29
Neural network optimization
◎ The compositional structure is:
𝑦 = 𝑔 𝑧, 𝑏 = 𝑔 𝑓 𝑥, 𝑎 , 𝑏
◎ The error function is:
1 (
𝐸 = 𝑦+ − 𝑦
2
30
Neural network optimization
◎ Partial derivative:
!" !% !% !'
= − 𝑦$ − 𝑦 = −(𝑦$ − 𝑦) !' !# (chain rule)
!# !&
!" !%
= −(𝑦$ − 𝑦)
!& !&
◎ Gradient descent:
𝜕𝐸
𝑎,-& = 𝑎, + 𝜂
𝜕𝑎,
𝜕𝐸
𝑏,-& = 𝑏, + 𝜂
𝜕𝑏, 31
Overall progress of neural network
1. A NN is specified along with a labeled training set.
2. The initial weights of the network are set to random values.
3. The training data is run through the network to produce an output 𝑦,
whose ideal ground-truth output is 𝑦' .
4. The derivatives with respect to each network weight is then
computed using backprop formulas.
5. For a given learning rate 𝜂, the network weights are updated as in
gradient descent equation.
6. Return to step (3) and continue iterating until a maximum number of
iterations is reached or convergence is achieved.
32
Overall progress of neural network
1. A NN is specified along with a labeled training set.
2. The initial weights of the network are set to random values.
33
3. The training data is run through the network
34
◎ Backpropagate error
35
4. The derivatives with respect to each network weight is computed.
5. The network weights are updated as in gradient descent equation.
36
Stochastic Gradient Descent
◎ Stochastic Gradient Descent (SGD): a single, randomly data
point (k) is chosen to approximate the gradient at each step
of the iteration.
𝐰!-& = 𝐰! − 𝜂∇𝐸, (𝐰! )
37
Batch gradient descent
◎ If instead of a single point, a subset of points (K) is used, then
we have the following batch gradient descent algorithm.
𝐰!-& = 𝐰! − 𝜂∇𝐸. (𝐰! )
38
Deep Learning
39
Contents
◎ Data science and machine learning review
◎ Classification model
◎ Clustering model
40
Clustering
◎ Clustering is the process of grouping objects into clusters:
○ The same group objects are highly similar.
○ And very different from the subject in the rest of the groups.
The gap
The gap within between
the group is groups is large
small
41
Clustering
◎ Clustering is an unsupervised learning because the
label/class is not pre-defined
◎ Therefore, clustering is a type of learning by visual rather than
learning by examples
42
CHAMELEON 43
Some applications of clustering
◎ Grouping of related documents for web browsing
◎ Groups of genes and proteins have the same function
◎ Group of stocks with the same volatility
◎ Groups of areas of the same land type in geography
◎ Identify homegroups by Home type, value, and geographic
location
◎ Defining a Gaming object group
…
44
What is a group?
2 groups 4 groups
45
A good Clustering?
◎ A good clustering method will have to create groups of
high quality:
○ The same level in the High group.
○ Similar levels to other low-level groups.
◎ The quality of the clustering depends on:
○ Analog measurement
○ Its enforcement
○ Ability to discover some or all of the underlying patterns
46
Measure Similarity
◎ The distance used is mainly to measure the same or not
similar level between the two objects.
○ Examples: Euclide gap, Cosin Gap, Minkowski, Mahattan...
◎ Distance functions differ in value ranges, types, ranks, and
variable elements.
◎ The weights of the dependent variables on the application
and the data implications.
47
Some clustering methods
◎ Partitional clustering
○ Formation of partitions and evaluate them based on some criteria
○ Algorithms: K-Means, K-Medoids, CLARANS
◎ Hierarchical clustering
○ Create the division of Layers
Algorithms: Diana, Agnes, BIRCH, CAMELEON
◎ Density-based clustering
○ Based on the jaw and density function
Algorithms: DBSACN, OPTICS, DenClue
48
Examples
49
Some clustering methods (2/2)
◎ Grid-based methods
○ Based on multi-level particle structure
Algorithms: STING, WaveCluster, CLIQUE
◎ Model-based methodology
◎ Frequent pattern-based methods
◎ Methods based on binding or user guidance
◎ Link-based method
…
50
Partitioning Clustering
◎ Partitioning Clustering Is the simplest and most
foundational method of clustering methods
◎ Idea: The partition of a D database consists of n objects
into k Groups so that it optimizes the criteria of the
partition.
◎ Global optimization: Full expression of all groups
◎ Heuristic algorithm:
○ K-means: each group is performed by the center value of the group
K-Medoids, PAM: each group is performed by one of the groups '
objects 51
K-Means algorithm (1/2)
◎ Before number k, each group is performed by the group's
centroid value.
○ S1: Select Random K objects as the center of the groups
○ S2: Assigns each remaining object to the closest group based on
distance measurement such as Euclide, Cosin analogue,
correlation,...
○ S3: Calculating the center value of each group based on newly joined
objects.
○ S4: If the group Center has nothing to change or there are only a few
points that change the group, stop, back to S2.
52
K-means example (1/4)
k=3
S1: Select any of the three centers: k1, k2, k3
k1
Y
k2
k3
53
X
K-means example (2/4)
k1
Y
k2
k3
X 54
K-means example (3/4)
k1 k1
Y
k2
k3
k2
k3
X 55
K-means example (4/4)
k1
Y
3 points to be
reassigned k3
k2
56
X
k-means
◎ Pros:
○ Simple, effective. Complexity O (TKN); T, K < < N.
Achieve local optimization.
◎ Cons:
○ Only available to objects in a continuous n-dimensional space
Need to determine the number K group before
Sensitivity to noise and personal data
Not suitable for exploring groups with circles/bridges
57
Exercise 1
◎ Use the K-means algorithm and Euclide spacing to Clustering 8
samples into 3 groups:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4),
A7=(1,2), A8=(4,9).
Distance matrix based on Euclide is given in the following slide.
Assuming the initializing seed (center) is k1 = A1, K2 = A4 and K3
= A7. Run K-Means 1 time. Identify groups to be formed. Where is
the new center of each group
◎ Draw in Space 10 x 10 The samples were bundled with the group
over 1 run (draw frames) and the center of each group. How many
K-means loops will converge (stop)? Illustrate the results at each
loop.
58
59