AI2025 Lecture05 Inperson Slide
AI2025 Lecture05 Inperson Slide
Output Label
Model =
0 Not a Cat
Architecture
1 Cat
+ Parameters
123
94
…
202
𝒙=
123
94
…
142
SKKU Kang Research Group / SEE3007 Spring 2025 4
Learning Process
Many vector/matrix
operations
Compute loss
Update parameters
to minimize the loss
Model
Input Model Output Label
Compute loss
Update parameters
to minimize the loss
𝑦ො = 𝒘⊤ 𝒙 + 𝑏
Parameters: 𝒘 ∈ ℝ𝑛 , 𝑏 ∈ ℝ
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏
Parameters: 𝒘 ∈ ℝ𝑛 , 𝑏 ∈ ℝ
1
Sigmoid 𝜎 𝑥 =
1+𝑒 −𝑥
𝜎 −∞ = 0
𝜎 +∞ = 1
Compute loss
Update parameters
to minimize the loss Cost function & Loss
SKKU Kang Research Group / SEE3007 Spring 2025 10
Logistic Regression: Cost Function
▪ Given 𝒙 𝟏 , 𝑦 1 , … 𝒙 𝑚 , 𝑦 𝑚 , want 𝑦ො (𝑖) = 𝑃(𝑦 𝑖 = 1) Roughly speaking , 𝑦ො (𝑖) ≈ 𝑦 𝑖
𝐿 𝑦,
ො 𝑦 = −𝑦 log 𝑦ො − 1 − 𝑦 log 1 − 𝑦ො
If 𝑦 = 1: 𝐿 𝑦,
ො 𝑦 = − log 𝑦ො
If 𝑦 = 0: 𝐿 𝑦,
ො 𝑦 = − log 1 − 𝑦ො
− log 𝑦ො
▪ Cost function:
𝑚
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
Logistic regression model
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏
SKKU Kang Research Group / SEE3007 Spring 2025 11
12
Logistic Regression – Optimization
Compute loss
Update parameters
to minimize the loss
Optimization
SKKU Kang Research Group / SEE3007 Spring 2025 12
Optimization
▪ Logistic regression model
1
▪ 𝑦ො = 𝜎 𝒘⊺𝒙 + 𝑏 , where 𝜎 𝑧 = 1+𝑒−𝑧
▪ Cost function
𝑚 𝑚
1 (𝑖) (𝑖)
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො , 𝑦 = − 𝑦 (𝑖) log 𝑦ො (𝑖) + 1 − 𝑦 (𝑖) log 1 − 𝑦ො (𝑖)
𝑚 𝑚
𝑖=1 𝑖=1
▪ Our goal
▪ Find parameters 𝒘 ∈ ℝ𝑛 , 𝑏 ∈ ℝ that minimize 𝐽 𝒘, 𝑏
▪ Gradient Descent!
Repeatedly update
𝜃 ∗ = 𝜃- 𝜂 ⋅ 𝛻𝜃 𝐽(𝜃)
𝜃∗
By Hakky St
SKKU Kang Research Group / SEE3007 Spring 2025 14
Computing the Parameters with Gradient Decent
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏
∗
𝑚 𝜃 = 𝜃- 𝜂 ⋅ 𝛻𝜃 𝐽(𝜃)
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
𝜕𝐽 𝜃
𝜕𝑤1
𝑤1 𝜕𝐽 𝜃
𝑤2 𝜕𝑤2
𝜃= … 𝛻𝜃 𝐽 𝜃 = …
𝑤𝑛 𝜕𝐽 𝜃
𝑏 𝜕𝑤𝑛
𝜕𝐽 𝜃
𝜕𝑏
𝑑 𝑑 1
1. log e 𝑥 = 𝑙𝑛 𝑥 =
𝑑𝑥 𝑑𝑥 𝑥
𝑑
2. 𝜎 𝑥 = 𝜎 𝑥 (1 − 𝜎 𝑥 )
𝑑𝑥
1 𝑑 −1
𝜎 𝑧 =
1 + 𝑒 −𝑧
𝜎 𝑥 = −𝑒 −𝑧 = 𝜎 𝑥 (1 − 𝜎 𝑥 )
𝑑𝑥 1+𝑒 −𝑧 2
▪ 𝐿 𝑦,
ො 𝑦 = −𝑦 log 𝑦ො − 1 − 𝑦 log 1 − 𝑦ො
For the simplicity
𝑥1 𝑤1
𝒙= 𝑥 ,𝒘= 𝑤
2 2
𝑥1
𝑥2
𝑤1 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑎, 𝑦
𝑤2
𝑏
𝑦
Target to learn
𝑦
𝑥1
𝑥2
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑎, 𝑦
𝑤1
𝑑𝑧 𝑑𝑎 𝑑𝐿
𝑤2 𝑑𝑤𝑘 𝑑𝑧 𝑑𝑎
𝑏 𝑑𝐿 𝑑𝐿 𝑑𝑎 𝑑𝑧 𝑦 1−𝑦
= = − + 𝜎 𝑧 (1 − 𝜎 𝑧 )𝑥𝑘
𝑑𝑤𝑘 𝑑𝑎 𝑑𝑧 𝑑𝑤𝑘 𝑎 1−𝑎
𝑦
𝑥1
𝑥2
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑎, 𝑦
𝑤1
𝑤2
𝑏 𝑑𝐿 𝑑𝐿 𝑑𝑎 𝑑𝑧 𝑦 1−𝑦
= = − + 𝜎 𝑧 (1 − 𝜎 𝑧 )𝑥𝑘
𝑑𝑤𝑘 𝑑𝑎 𝑑𝑧 𝑑𝑤𝑘 𝑎 1−𝑎
𝑦 1−𝑦
= − + 𝑎(1 − 𝑎)𝑥𝑘 = −𝑦 1 − 𝑎 𝑥𝑘 + 1 − 𝑦 𝑎𝑥𝑘
𝑎 1−𝑎
= 𝑎 − 𝑦 𝑥𝑘
SKKU Kang Research Group / SEE3007 Spring 2025 20
Logistic Regression Derivative – Calculate on your own
𝑑𝐿
=
𝑑𝑤1
𝑑𝐿
=
𝑑𝑤2
𝑑𝐿
=
𝑑𝑏
𝑦
𝑥1
𝑥2
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑎, 𝑦
𝑤1
𝑤2
𝑏 𝑑𝐿 𝑑𝐿 𝑑𝑎 𝑑𝑧 𝑑𝐿
= = 𝑎 − 𝑦 𝑥1 𝑤1∗ ≔ 𝑤1 − 𝜂
𝑑𝑤1 𝑑𝑎 𝑑𝑧 𝑑𝑤1 𝑑𝑤1
𝑑𝐿 𝑑𝐿 𝑑𝑎 𝑑𝑧 𝑑𝐿
= = 𝑎 − 𝑦 𝑥2 𝑤2∗ ≔ 𝑤2 − 𝜂
𝑑𝑤2 𝑑𝑎 𝑑𝑧 𝑑𝑤2 𝑑𝑤2
𝑑𝐿 𝑑𝐿 𝑑𝑎 𝑑𝑧 𝑑𝐿
= =𝑎−𝑦 𝑏∗ ≔𝑏−𝜂
𝑑𝑏 𝑑𝑎 𝑑𝑧 𝑑𝑏 𝑑𝑏
SKKU Kang Research Group / SEE3007 Spring 2025 22
Gradient descent on m examples
𝑚
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
𝑚 𝑚
𝑑 1 𝑑 (𝑖) (𝑖)
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො , 𝑦 = 𝑦ො (𝑖) − 𝑦 𝑥𝑘
𝑑𝑤𝑘 𝑚 𝑑𝑤𝑘 𝑚
𝑖=1 𝑖=1
𝑚 𝑚
𝑑 1 𝑑 (𝑖) (𝑖)
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො , 𝑦 = 𝑦ො (𝑖) − 𝑦
𝑑𝑏 𝑚 𝑑𝑏 𝑚
𝑖=1 𝑖=1
▪ 𝑤1 −= 𝑙𝑟 ∗ 𝑑_𝑤1/m 𝑑 1 𝑑
𝑚
1
𝑚
(𝑖) (𝑖) = 𝑎(𝑖) − 𝑦 (𝑖)
▪ 𝑤2 −= 𝑙𝑟 ∗ 𝑑_𝑤2/m 𝑑𝑏
𝐽 𝒘, 𝑏 = 𝐿 𝑎 , 𝑦
𝑚 𝑑𝑏 𝑚
𝑖=1 𝑖=1
▪ 𝑏 −= 𝑙𝑟 ∗ 𝑑_𝑏/m
1 Epoch
m_train=1000
Model definition
w=2 𝑦ො = 𝑤 ∗ 𝑥 + 𝑏
b=3 𝑚_𝑡𝑟𝑎𝑖𝑛
x_data=10*np.array(range(m_train))/m_train 1 2
y_data=x_data*w+np.random.randn(m_train)+b 𝑙𝑜𝑠𝑠 = 𝑀𝑆𝐸 = 𝑦ො𝑛 − 𝑦𝑛
𝑚_𝑡𝑟𝑎𝑖𝑛
𝑛=1
▪ Report guidelines (doesn’t have to be long. make it easy to read and short)
▪ Plot training data
▪ Describe the data shape in a few sentences. E.g., how does it look like?
▪ Guess your fitting results without coding and explain your reasoning for the guess
▪ Attach your full code
▪ Show the training result → w, b, cost (J)
▪ How many times did you repeat training(number of epochs) to get a good result?
▪ Submit to iCampus
▪ Due: 3/19, 8:59AM
def forward_cal(w, b, x) :
▪ The fitting model
return x*w+b
def loss_cal(w, b, x, y, y_pred) :
return (y_pred-y)**2
supposed to be
▪ w=2, b=3
def w_g radient(w, b, x, y, y_pred) :
return 2*x*(y_pred-y)
def b_grad ient(w, b, x, y, y_pred) :
return 2*(y_pr ed-y)
w=0
b=0
lr=0.02
Nepoch=400
▪ You may observe the
# Training loop
for epo ch in ran ge(Nepoch):
# initialize
training diverges if you
d_w=0
d_b=0
loss=0
give too large learning
for x_val, y_val in zip(x_data, y_d ata ):
y_p red = forward_ca l(w, b, x_val)
d_w += w_gr adient(w, b, x_val, y_val, y_pred )
rate.
d_b += b _gr adient(w, b, x_val, y_val, y_pred )
loss += loss_cal(w, b, x_val, y_val, y_pred )
loss = lo ss/m_train
w = w - lr * d_w/m_train
b = b - lr * d_b/m_train
if epo ch%100==0:
prin t("e poch=% d, pre viou s_loss=%f, w'=% f, b'=%f" %( epoch, loss, w, b))
plt.figu re()
plt.plot(x_data, y_data, '*', labe l='Training data')
plt.plot(x_data, w*x_da ta+b, labe l='Trained model')
plt.xlab el('Input x')
plt.ylab el('Output y')
plt.legend( )
plt.sho w()
▪ Launch Jupyter
Select “Notebook”
Select “Python 3”
“Shift+Enter”
at each
Markdown line
x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1
𝑦 = 𝑥1 𝐴𝑁𝐷 𝑥2
Nepoch=1000
Cost_list=[]
for epoch in range(Nepoch):
cost = ~~
cost_list.append(cost)
SKKU Kang Research Group / SEE3007 Spring 2025 40
Testing
▪ Result analysis
▪ model.predict((0, 0)) → 7.1e-6
▪ It means when you give (0, 0), the result is 7.1e-6
▪ Hence, our trained model tells us the output is 0.
▪ model.predict((1, 1)) → 0.98
▪ Our model predicts the output as 0.98, which means 1
▪ The model predicts output well.
▪ Explain whether the logistic regression model works well for AND, OR, XOR data
▪ For one operator, the logistic regression won’t work. Which one?
▪ Due: 2024/3/27 (Thu) 11:59 PM (after in-person lecture07)
▪ So, you have two more classes (including today) to ask questions to finish.
▪ Submit to iCampus
Import numpy
Inner product
Addition
Scalar multiplication