0% found this document useful (0 votes)
20 views63 pages

CE6146 Lecture 1

Uploaded by

tony910313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views63 pages

CE6146 Lecture 1

Uploaded by

tony910313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

CE6146

Introduction to Deep Learning


Deep Learning Basics
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/9/19
Outline

• Basic Concepts

• Essential Mathematics for Deep Learning

• Building and Evaluating Deep Learning Models

• Frameworks for Deep Learning: Keras and PyTorch

2
Intended Learning Outcomes

By the end of this lecture, you will be able to:


• Understand key terminologies and differentiate between Artificial Intelligence,
Machine Learning, and Deep Learning.
• Identify different learning tasks and machine learning paradigms.
• Apply essential linear algebra, probability, and numerical computation concepts in
deep learning.
• Explore basic machine learning concepts relevant to deep learning.
• Understand the basic architecture and features of Keras and PyTorch frameworks.
3
Basic Concepts
• Key Terminologies
• What is Artificial Intelligence, Machine Learning, and Deep
Learning
• Learning Tasks
• Machine Learning Paradigms
Key Terminologies
⼀個Data 可能包含Label、Feature

• Data: Information used for analysis, training, and testing models.


• Label: The output or target value used in supervised learning. 監督式學習才會有label
• Feature: An individual measurable property or characteristic of the data being analyzed.
• Model: An algorithm or mathematical representation that makes predictions or decisions
based on data. ⼜稱為Black box
• Training Dataset: The subset of data used to train the model.
超參數
• Validation Dataset: A portion of the data used to tune the model's hyperparameters and
prevent overfitting. 避免過度訓練
• Test Dataset: Data that is completely unseen by the model and used to assess its final
performance. ⼜稱為independent dataset
training Testing Validation
5
What is Artificial Intelligence (AI)
AI includes any technique that enables computers to
mimic human intelligence, including rule-based
• Definition: systems, machine learning, and deep learning.

Artificial Intelligence (AI) is the simulation of human intelligence processes


by machines, especially computer systems. Rule-base 可以清楚定義規則
• Key Characteristics:
The ability to learn from data and experiences, and
‐ Learning improve over time.
The ability to make decisions and solve problems
‐ Reasoning
based on available information.
‐ Self-Correction The ability to recognize and correct errors to
improve performance.
6
What is Machine Learning (ML)
A subset of AI where models learn patterns from data
• Definition: without being explicitly programmed for each task.

Machine learning is a subset of artificial intelligence focused on building


systems that learn from data to make decisions or predictions.
沒有辦法有很明確、很清晰的規則,需要⼈為介入
• Key Characteristics:
‐ Data-driven Models learn from data rather than being explicitly
programmed with rules.
‐ Adaptable
Models can adapt and improve over time with more data.
‐ Predictive Used for making predictions or decisions based on input data.
7
What is Deep Learning (DL)
A specialized branch of machine learning where multi-layered neural
networks (deep networks) automatically learn hierarchical representations
• Definition: from raw data, often requiring no manual feature engineering.
Deep learning is a subset of machine learning that uses neural networks with
many layers (deep networks) to model complex patterns in data.
更有能⼒去處理非結構性資料,
• Key Characteristics: 不需要⼈為介入給定規則或特徵

‐ Automated Feature Extraction Neural networks automatically learn features from


raw data.
‐ Scalability Effective with large datasets and complex problems.
‐ High Performance Achieves state-of-the-art results in various tasks like
image recognition and natural language processing.
8
Difference Between ML and DL
Aspect Machine Learning Deep Learning

Feature Extraction Manual Automated


提取 ⼿動提取特徵 ⾃動提取特徵
Model Complexity
Simpler models Complex neural networks

Data Requirements Smaller datasets Large datasets


因為有結構,所以容易學習訓練
拓展性差 Good for structured data, Excels with unstructured
r
Performance
limited scalability data, scalable
Linear Regression, Decision Convolutional Neural
Example Algorithms
Trees, Random Forest Networks CNN 9
10

Source: 1. https://reurl.cc/yLq3Q6 2. https://reurl.cc/8vkgpb 3. https://reurl.cc/jWAn1L


Source: https://www.youtube.com/shorts/CxT5DVZmWCU 11
Key Applications of Deep Learning

• Image Recognition: Identifying objects within images. Computer vision

• Natural Language Processing (NLP): Understanding and generating human


language.
• Autonomous Systems: Self-driving cars, drones. ⾃駕⾞

• Healthcare: Disease detection from medical images, personalized medicine.


• Finance: Fraud detection, algorithmic trading.

12
What are Learning Tasks
針對特定的問題或是有⽬的性的演算法
• Learning tasks are specific problems or objectives that an algorithm is
designed to solve through learning from data.
• Learning tasks form the core of what algorithms aim to achieve, whether in
machine learning or deep learning.
• Learning tasks guide the design and implementation of models, determining
how algorithms are trained, validated, and deployed in real-world
applications. 驅動整個演算法的規劃與設計

13
Types of Learning Tasks
分類 回歸分析
可轉換 聚類分析
Classification Regression Clustering
有標籤,能分成不同類別 有標籤,所以可以分析所有數值 沒有任何標籤
The task of assigning input The task of predicting a The task of grouping
data into predefined continuous numerical value similar instances without
categories or classes. based on input data. predefined labels.
將相似的個體⾃動分群在⼀起

可轉換
Source: https://medium.datadriveninvestor.com/problems-with-classification-examples-from-real-life-645b7b756e96 14
What is Machine Learning Paradigms

• Machine learning paradigms refer to the different approaches or


methodologies used to train machine learning models.
• Machine learning paradigms specify how the model learns from data and how
it is guided to make predictions or decisions.
1 2
• The three primary paradigms are supervised learning, unsupervised
3
learning, and reinforcement learning.

15
Types of Machine Learning Paradigms
監督式學習 非監督式學習 強化式學習
Supervised Learning Unsupervised Learning Reinforcement Learning
Learning from labeled data Learning from unlabeled data Learning by interacting with
不需要有標籤帶入
(input-output pairs). to find hidden patterns. an environment to maximize
有標籤的Data 沒有標籤的Data cumulative reward.
透過動態的環境不斷重複地互動,在
沒有⼈類⼲預的情況下學習

Source: 1. https://botpenguin.com/glossary/supervised-learning 2. https://botpenguin.com/glossary/unsupervised-learning 3. https://botpenguin.com/glossary/reinforcement-learning 16


Source: https://doi.org/10.3389/fphar.2021.720694 17
Source: https://www.youtube.com/watch?v=6M5VXKLf4D4 18
Essential Mathematics for Deep Learning
• Linear Algebra
• Probability and Information Theory
• Numerical Computation
Why Mathematics is Essential in DL

• Mathematics is the foundation of deep learning models because it allows us to


1 2 3
represent and transform data, optimize parameters, and measure performance.
• Core areas include:
‐ Linear Algebra: Handles matrix and vector operations for data manipulation,
parameter updates, and neural network layers.
‐ Calculus: Used for optimization through gradient-based methods like backpropagation.
‐ Probability & Statistics: Necessary for understanding model evaluation, predictions,
and uncertainty in deep learning.
20
The Role of Linear Algebra in DL

• Linear algebra is fundamental in deep learning because most computations


(input transformations, weight updates, activations) are expressed as matrix
or vector operations.
• Vectors, matrices, and tensors are the building blocks for representing data,
weights, and operations in neural networks.
• Weight matrices, activation vectors, gradient computations, and
transformation operations are all handled through linear algebra.
21
Vectors and Matrices in DL
[x,y]

i
• Vectors are 1D arrays of numbers that represent features of data (e.g., pixel
intensities in an image or values in a dataset).
• Matrices are 2D arrays of numbers that represent transformations (e.g.,
weight matrices that map input features to outputs in a neural network layer).
[[x1,y1],[x2,y2]]

22
Matrix Operations in DL
合併
矩陣加法 -> 合併bias到input中 乛

• Matrix Addition: Adds corresponding elements from two matrices. Used in


neural networks to incorporate biases into the linear transformation of inputs.
• Matrix Multiplication: Multiplies rows of one matrix by columns of another.
矩陣乘法 -> 合併權重跟輸入
In deep learning, this is used to combine inputs with weights.
• Matrix Transpose: Switches rows and columns. In backpropagation,
transposed matrices are used to compute weight gradients.矩陣轉置 -> 計算權重斜率
• In deep learning, we apply matrix multiplications repeatedly through each
layer of the network to transform input data into predictions.
23
Eigenvalues and Eigenvectors
特徵截取
7
• Eigenvalues and Eigenvectors are crucial in understanding transformations in data
and feature extraction techniques like Principal Component Analysis.
• An eigenvector is a direction in the data that does not change direction when a
transformation (matrix) is applied, only its magnitude (determined by the
eigenvalue).
• Eigenvector Decomposition: Given a square matrix A, the eigenvector 𝜈 satisfies:
定義 𝐴∙𝜈 =𝜆∙𝜈
特徵數值 特徵向量
Here, 𝜆 is the eigenvalue, representing the scaling factor, and 𝜈 is the eigenvector.
24
Tensors in DL

• Tensors generalize vectors and matrices to higher dimensions.


• A tensor can be 1D (vector), 2D (matrix), or higher (3D, 4D, etc.), and they
are essential in representing multi-dimensional data (e.g., batches of images)
in deep learning.
• In frameworks like PyTorch and TensorFlow, tensors are the primary data
structure for storing inputs, weights, and intermediate activations.

25
Calculus in DL
梯度 微分

• Calculus is used for optimizing neural networks. Specifically, derivatives and
gradients help determine how model parameters should change to minimize
the loss function.
Rate
• Key Operations: 乛

‐ Derivatives measure how a small change in input affects the output.


‐ Gradients (vector of partial derivatives) are used to find the direction in
which to update the model’s parameters to reduce error.
26
Derivatives and Gradients

• The derivative of a function measures the rate of change of the function's


output with respect to a change in its input.
• In Deep Learning:
‐ The derivative of the loss function with respect to the model's weights tells
us how the weights should be adjusted to reduce the loss.
‐ Gradient (Vector of Derivatives):The gradient points in the direction of the
steepest increase in the loss function. By moving in the opposite direction,
we minimize the loss. 27
Gradient Descent for Optimization
Controls the step size of each update in gradient descent.
If too high, the model may overshoot the optimal solution.
If too low, training may be slow and get stuck in local minima.
• Gradient Descent Algorithm:
𝜃new = 𝜃old − 𝜂∇𝜃 𝐽(𝜃),
where 𝜃 represent the model’s parameters, 𝜂 is the learning rate, and ∇𝜃 𝐽(𝜃)
is the gradient of the loss function with respect to 𝜃.
• Gradient descent adjusts the model’s weights in the direction that reduces the
loss function. This is the core optimization method used in training deep
learning models.
28
Gradient Descent for Optimization

Figure 4.3 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron
Courville.
Figure 4.1 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron
Courville.
29
Backpropagation and Chain Rule

• Backpropagation uses the chain rule to compute gradients for all parameters
in a neural network. It allows us to efficiently compute the derivatives of the
loss function with respect to each parameter in the network.
• The chain rule states:

𝑑
𝑓 𝑔 𝑥 = 𝑓′ 𝑔 𝑥 ∙ 𝑔′(𝑥)
𝑑𝑥
In a neural network, this rule is applied layer by layer to propagate the error back from
the output to the input, updating the weights accordingly.
30
Probability and Statistics in DL
量化

• In deep learning, models often make predictions based on uncertain data.


Understanding probability helps quantify and interpret these uncertainties.
1 2
• Statistics allows us to assess model performance, optimize decision-making,
3
and understand data distributions.
• Key Applications:
‐ Probability is essential for making predictions and understanding model uncertainty.
‐ Statistics is used for model evaluation, understanding distributions, and performance
measurement.
31
Understanding Random Variables

• A random variable is a variable that can take different values based on some
random process or experiment. In deep learning, we often treat model outputs
(predictions) as random variables. In classification, the predicted class probabilities
from a softmax output can be treated as random
• Types of Random Variables: variables that follow a discrete distribution.

‐ Discrete Random Variable: Takes on a countable set of values (e.g., binary


classification outputs like 0 or 1).
‐ Continuous Random Variable: Takes on any value within a range (e.g., regression
outputs like temperature predictions).
32
Probability Distributions in DL

• Bernoulli Distribution: Used for binary classification tasks (e.g., predicting


whether an image is a cat or not).
• Gaussian (Normal) Distribution: Assumes that the prediction errors or noise
in a model are normally distributed. 常態分布(⾼斯分佈):鐘型曲線

• In a regression task, the output prediction might follow a normal distribution


centered around the predicted value, with some variance representing the
uncertainty in the prediction.
33
Bernoulli Distribution in Binary Classification

• Bernoulli Distribution: Used to model binary classification tasks, where the


output is either 0 or 1.
P(X = 1) = p, P(X = 0) = 1 − p
• In binary classification (e.g., spam detection), the model assigns a probability
p that the email is spam (1), and 1 − p that it is not spam (0).
• Real-World Example: Logistic regression outputs probabilities for binary
classification tasks, and the decision is made by thresholding this probability
(e.g., if 𝑝>0.5p>0.5, classify as 1).
34
Gaussian (Normal) Distribution in Regression

• Gaussian Distribution: A continuous probability distribution commonly used


to model prediction errors in regression tasks.
• The errors (residuals) in regression models often assume a normal
distribution, allowing the model to predict not just a point estimate but also a
measure of uncertainty (e.g., prediction intervals).
• Example: In a temperature prediction model, you might predict that the
temperature tomorrow will be 25∘C with a variance of 2∘C, implying that
most of the time, the actual temperature will lie between 23 and 27 degrees.
35
Maximum Likelihood Estimation

• MLE is used to estimate the parameters of a model by maximizing the


likelihood of the observed data.
• In neural networks, this often translates to minimizing the negative log-
likelihood. The objective is to find the parameter θ that maximizes the
likelihood of the observed data.
• Example: For a classification task, maximizing the likelihood of the correct
class label is equivalent to minimizing the cross-entropy loss.
36
Information Theory in DL

• Entropy (H) measures the uncertainty of a probability distribution:


Used to quantify the unpredictability in classification tasks.

H 𝑋 = − ෍ 𝑃 𝑥𝑖 𝑙𝑜𝑔𝑃(𝑥𝑖 )
𝑖
• Kullback–Leibler (KL) Divergence measures how one probability distribution
differs from another reference distribution:
Often used in Variational Autoencoders (VAEs) and model regularization.
𝑃(𝑥𝑖 )
D𝐾𝐿 𝑃||𝑄 = ෍ 𝑃 𝑥𝑖 𝑙𝑜𝑔
𝑄(𝑥𝑖 )
𝑖
37
Building and Evaluating Deep Learning Models
Flow for Constructing a DL Model

1) Data Collection and Preprocessing


2) Dataset Splitting (Training, Validation, Test)
3) Model Design (Architecture Selection)
4) Model Training
5) Model Evaluation
6) Model Optimization
Following a structured model development flow ensures that the model
generalizes well to unseen data and performs reliably. 39
Data Collection and Preprocessing

• Data Collection:
‐ Collect data relevant to the problem you are solving (e.g., images, text, structured data).
‐ Ensure that the data is representative of the real-world scenarios the model will encounter.

• Data Preprocessing:
‐ Cleaning the Data: Handle missing values, remove duplicates.
‐ Normalization/Standardization: Ensure input features have the same scale (important for
neural networks).
‐ Augmentation (for images): Apply transformations like rotations, flips, and scaling to
artificially expand the dataset.
40
Dataset Splitting
Train Val Test
• Why Split the Dataset?
To prevent the model from learning patterns specific to the training data and ensure it
can generalize to unseen data.
• Dataset Splitting:
‐ Training Set: Used to train the model (typically 70%-80% of the data).
‐ Validation Set: Used to tune hyperparameters and monitor overfitting (typically 10%-
15%).
‐ Test Set: Used to evaluate the model’s final performance on unseen data (typically
10%-15%).
41
Model Design

• Choosing a Model Architecture:


‐ Select a model architecture suitable for the problem (e.g., CNN for image
data, RNN for sequence data).
‐ Layers and Neurons: Decide how many layers and neurons each layer
should have based on the complexity of the problem.

42
Model Training

• Training the Model:


‐ Compile the Model: Choose the loss function and optimizer.
‐ Fit the Model: Train the model using the training set and monitor performance on the
validation set.

• Hyperparameters:
‐ Learning Rate: Controls how much to adjust the model weights with each training step.
‐ Batch Size: Determines how many samples are processed before the model updates its
weights.
43
Overfitting and Underfitting

• Overfitting: 過度訓練
‐ Occurs when the model performs well on the training set but poorly on the
validation/test set.
‐ Symptoms: Training accuracy continues to improve, but validation accuracy stagnates
or decreases.

• Underfitting:
‐ Occurs when the model performs poorly on both the training and validation sets.
‐ Symptoms: Low training and validation accuracy.
44
Overfitting and Underfitting

Figure 5.2 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 45
Overfitting and Underfitting

實際上應該是⼀個區間

Figure 5.3 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 46
Bias-Variance Tradeoff

• Bias: Error introduced by simplifying assumptions in the model. High bias


models tend to underfit the data and have poor performance on both training
and test sets.
• Variance: Error introduced by the model's sensitivity to small fluctuations in
the training data. High variance models tend to overfit, performing well on
how the predicted values of the model change if
the training set but poorly on unseen data. we train the model on different training datasets

As model complexity increases, bias decreases but variance increases.


The goal is to find a model that balances these two sources of error to minimize total error.
47
Bias-Variance Tradeoff

⼀般來說,模型越複雜可
以有越精準的bias,因為他
可以調整的細節更多

Predict Value Low Bias


準確變動不會
太⼤的model

太分散

Source: https://medium.com/@ivanreznikov/stop-using-the-same-image-in-bias-variance-trade-off-explanation-691997a94a54 48
Model Evaluation

• Evaluating the Model Based on the Learning Task:


‐ Classification Tasks: Use accuracy, precision, recall, F1-score, ROC-AUC to assess
model performance.
‐ Regression Tasks: Use mean squared error (MSE), mean absolute error (MAE), R-
squared to evaluate continuous output models.
‐ Unsupervised Learning Tasks: Use silhouette score, Davies-Bouldin index, or
clustering evaluation metrics.
The performance of a model depends on how well it aligns with the task's goals
(e.g., accuracy may not be sufficient for imbalanced classification).
49
Evaluating Classification Models (1/3)

• Accuracy: Proportion of correct predictions out of all predictions.
• Precision: Proportion of true positives among predicted positives.
• Recall (Sensitivity): Proportion of actual positives correctly predicted.
• F1-Score: The harmonic mean of precision and recall, useful when you need
to balance false positives and false negatives.
• ROC-AUC: Area under the Receiver Operating Characteristic curve. It helps
measure the trade-off between true positive rate (TPR) and false positive rate
(FPR). 50
Evaluating Classification Models (2/3)
Test Result
TP TP True
• Sensitivity ( hit rate or recall) = =
P TP + FN Condition Positive Negative
TN TN Status
• Specificity = =
N TN + FP True False
TP Disease Positive Negative P
• Precision (Positive Predictive Value, PPV) = TP + FP
(TP) (FN)
TP + TN TP + TN False True
• Accuracy (ACC) = P + N = TP + TN + FP + FN Health Positive Negative N
(FP) (TN)
2TP
• F1 score = 2TP + FP + FN 可以避免極端值影響過重
TP  TN − FP  FN
• Matthews correlation coefficient (MCC) = (TP + FP)(TP + FN )(TN + FP )(TN + FN )

51
Evaluating Classification Models (3/3)

• Let X be the predicted values with positive label, and Y with negative label.
• TPR(c) = Pr(X > c)
• FPR(c) = Pr(Y > c)
1
• AUC =  ROC (t )dt
0
1 𝑛𝑥 𝑛𝑦

• 𝐴 𝑈𝐶 = σ σ 𝐼(𝑋𝑖 > 𝑌𝑗 )
𝑛𝑥 𝑛𝑦 𝑖=1 𝑗=1

AUC 越接近1 代表model 越好

52
Evaluating Regression Models

• Mean Squared Error (MSE): Measures the average squared difference


between the predicted and actual values.
• Mean Absolute Error (MAE): Measures the average absolute difference
between the predicted and actual values.
• R-Squared (R²): Indicates the proportion of variance explained by the model.
An R² of 1 means the model perfectly fits the data.

53
Evaluating Unsupervised Learning Models

• Silhouette Score: Measures how similar an object is to its own cluster


compared to other clusters. A higher score indicates better-defined clusters.
• Davies-Bouldin Index: Measures the average similarity ratio between each
cluster and its most similar cluster. A lower value indicates better clustering.
• Cluster Purity: Evaluates the extent to which each cluster contains only data
points from a single class.

54
Preventing Overfitting

• Regularization Techniques:
‐ L2 Regularization: Adds a penalty for large weights in the model.
‐ Dropout: Randomly drops a percentage of neurons during training to reduce
dependency on specific neurons.

• Early Stopping: Stop training when validation performance stops improving


to prevent overfitting.

55
Frameworks for DL: Keras and PyTorch
• Keras
• PyTorch
Introduction to Keras and PyTorch

• Deep learning frameworks like Keras and PyTorch allow developers to focus
on model design and training without dealing with low-level matrix
operations or gradient computation.
• They provide pre-built functionalities for creating neural networks,
optimizing them, and handling large datasets efficiently.

57
Keras

• Keras is a high-level API built on top of TensorFlow. It simplifies model


building by abstracting complex backend operations, allowing for rapid
prototyping.
• Key Features:
‐ Sequential API: Simple linear stack of layers.
‐ Functional API: For more complex models (multi-input/output).

58
Keras Key Functionalities

• Loss Functions and Optimizers: Built-in loss functions (e.g., MSE, cross-
entropy) and optimizers (e.g., SGD, Adam).
• Callbacks: Customizable training callbacks for early stopping, learning rate
scheduling, etc.
• Data Handling: In-built support for image, text, and sequence data.

59
PyTorch Overview

• PyTorch is a low-level, flexible deep learning framework popular in academic


research. It features dynamic computation graphs, allowing more freedom in
experimenting with model structures.
• Key Features:
‐ Dynamic Computation Graphs: Graphs are built on the fly, making it
easier to debug and adjust models during training.
‐ Autograd: Automatic differentiation library that computes gradients during
backpropagation. 60
Tensors in PyTorch and Keras

• Tensors are the core data structure in both Keras and PyTorch, representing
multi-dimensional arrays.
• Tensors enable matrix operations and hold gradients for backpropagation.

61
Key Differences Between Keras and PyTorch

• Keras vs PyTorch:
‐ Ease of Use: Keras is user-friendly, suited for beginners, and focuses on rapid
prototyping.
‐ Flexibility: PyTorch offers more flexibility, making it preferred for research, where
dynamic changes to the computation graph are required.

• Computation Graph:
‐ Keras uses static computation graphs (compiled before training).
‐ PyTorch uses dynamic computation graphs (created on the fly).
62
Q&A

You might also like