0% found this document useful (0 votes)
21 views

Machine Learning Notes ?

Uploaded by

dequegaming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Machine Learning Notes ?

Uploaded by

dequegaming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

SLOPE

(1, 2)
𝟐 − (−𝟐) 𝟒
𝒎= =
𝟏 − (−𝟐) 𝟑 (2, 1)

𝐕𝐞𝐫𝐭𝐢𝐜𝐚𝐥 𝐜𝐡𝐚𝐧𝐠𝐞
𝐒𝐥𝐨𝐩𝐞 =
𝐇𝐨𝐫𝐢𝐳𝐨𝐧𝐭𝐚𝐥 𝐜𝐡𝐚𝐧𝐠𝐞 2

(–4, –1)
𝟏 − (−𝟏) 𝟐 𝟏
𝒎= = =
𝟐 − (−𝟒) 𝟔 𝟑
6

(–2, –2)
Positive slope Negative slope Zero slope Undefined slope

y = mx + b y = mx + b y=b x=a
m>0 m<0
L I N E A R E Q U AT I O N S

• Slope-intercept form: y = mx + b

• Standard form: Ax + By = C, where A and B are not both 0.


PA R A L L E L L I N E S

Same slope (positive, negative, zero)


y = –2x – 1 y = –2x + 4
or both vertical
PERPENDICULAR LINES
• Product of slopes is –1 or one is vertical and the other horizontal

y = –2x – 1 y = ½x – 1
EXAMPLE
Passes through (–2, 6) and parallel to Passes through (–2, 6) and perpendicular to
2 5 2 5
𝑦= 𝑥− 𝑦= 𝑥−
3 3 3 3

2
𝑚= 3
3 𝑚=−
2 2
𝑦= 𝑥+𝑏 3
3 𝑦=− 𝑥+𝑏
2 2
−2 + 𝑏 = 6 3
3 − −2 + 𝑏 = 6
4 2
− +𝑏 =6 3+𝑏=6
3
4 18 22 𝑏 =6−3=3
𝑏= + = 𝟑
3 3 3
𝒚=− 𝒙+𝟑
𝟐 𝟐𝟐 𝟐
𝒚= 𝒙+
𝟑 𝟑
B R E A K - E V E N A N A LY S I S
• Linear cost function, C(x) = mx + b m is the marginal cost, b is the fixed
cost, x is the number of items produced

• Revenue function, R(x) = px p is the price per unit and x is the


number of units sold

• Profit function, P(x) = R(x) – C(x)

• Break-even point: The point where R(x) = C(x) Occurs where the two lines intersect
EXAMPLE
The cost to produce x widgets is given by C(x) = 105x + 6000 and each widget sells for
$250. Determine the break-even quantity.
Solution:
R(41) = 250(41) = 10,250 R(42) = 250(42) = 10,500
R(x) = 250x
and and
250x = 105x + 6000
145x = 6000 C(41) = 105(41) + 6000 = 10,305 C(42) = 105(42) + 6000 = 10,410

x ~ 41.38
The breakeven quantity is 42 widgets.
Note: Selling 41 widgets is not enough.
LEAST SQUARES LINE
Minimize the sum of the squares of the vertical distances from the data points to the line

𝑦 = 𝑚𝑥 + 𝑏

Data points 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , (𝑥𝑛 , 𝑦𝑛 )

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑚=
𝑛 σ 𝑥2 − σ 𝑥 2

and

σ𝑦 − 𝑚 σ𝑥
𝑏=
𝑛
S C AT T E R P LOT
Income from side business Let 𝑥 represent the number of years since 1980 and
𝑦 represent the income in thousands of dollars
Year Income
1980 8,414
1985 9,124
1990 10,806
1995 12,321
2000 15,638
2005 18,242
2010 24,792
2015 25,436
L E A S T S Q U A R E S C A LC U L AT I O N S
Least Squares Calculations 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑚=
𝑛 σ 𝑥2 − σ 𝑥 2
x y xy x2 y2
0 8.414 0 0 70.795396 8 2741.325 − 140 124.773
=
8 3500 − 140 2
5 9.124 45.62 25 83.247376
10 10.806 108.06 100 116.769636 = 0.5312
15 12.321 184.815 225 151.807041
σ𝑦 − 𝑚 σ𝑥
20 15.638 312.76 400 244.547044 𝑏=
𝑛
25 18.242 456.05 625 332.770564
124.773 − 0.5312 140
30 24.792 743.76 900 614.643264 =
8
35 25.436 890.26 1225 646.990096
= 6.3
140 124.773 2741.325 3500 2261.57042
𝒚 = 𝟎. 𝟓𝟑𝟏𝟐𝒙 + 𝟔. 𝟑
GRAPH OF LEAST SQUARES LINE
LEAST SQUARES LINE PREDICTION
Use the least squares line 𝑦 = 0.5312𝑥 + 6.3 to predict income in 2025

Recall, 𝑥 is the number of years since 1980, so 𝑥 = 45 corresponds to 2025

𝑦 = 0.5312 45 + 6.3 = 30.204

Since 𝑦 is in thousands of dollars, the predicted income in 2025 is $30,204


C O R R E L AT I O N C O E F F I C I E N T

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 ∙ 𝑛 σ 𝑦2 − σ 𝑦 2

8 2741.325 140 124.773


=
8 3500 − 140 2 ∙ 8 2261.57042 − 124.773 2

= 𝟎. 𝟗𝟔𝟗𝟏
PYTHON
AV E R AG E R AT E O F C H A N G E
The average rate of change of 𝑓(𝑥) with respect to 𝑥 as 𝑥 changes from 𝑎 to 𝑏 is

𝑓 𝑏 − 𝑓(𝑎)
𝑏−𝑎

Based on population projections for 2000 to 2050, the projected Hispanic population (in millions) for a certain country can be modeled by the exponential function

𝑡
𝐻 𝑡 = 37.791 1.021

where 𝑡 = 0 corresponds to 2000 and 0 ≤ 𝑡 ≤ 50. Use 𝐻 to estimate the average rate of change in the Hispanic population from 2000 to 2010.

The years 2000 and 2010 correspond to 𝑡 = 0 and 𝑡 = 10, respectively

Tip: Use technology 𝐻 10 − 𝐻(0) 37.791 1.021 10


− 37.791 1.021 0
=
10 − 0 10
8.73
≈ = 0.873 Never round until the last step
10

Based on this model, the Hispanic population increased at an average


rate of approximately 873,000 people per year between 2000 and 2010
I N S TA N TA N E O U S R AT E O F C H A N G E
Suppose a car is stopped at a traffic light. When the light turns green, the car begins to move along a straight road. Assume that the
distance traveled by the car is given by 𝑠 𝑡 = 3𝑡 2 , for 0 ≤ 𝑡 ≤ 15 where 𝑡 is time in seconds and 𝑠(𝑡) is distance traveled in feet.

How do we find the exact velocity of the car at say, 𝑡 = 10? Velocity represents both how 𝑠 10 + ℎ − 𝑠(10) 3 10 + ℎ 2 − 3 10 2
fast something is moving and =
Interval Average velocity ℎ ℎ
its direction, so velocity can
be negative.
3 100 + 20ℎ + ℎ2 − 300
=

300 + 60ℎ + 3ℎ2 − 300


=

Table suggests that the velocity at 𝑡 = 10 is 60 ft/sec. 60ℎ + 3ℎ2 ℎ 60 + 3ℎ


= = = 60 + 3ℎ
ℎ ℎ

Consider the following where ℎ is small but not 0 𝑠 10 + ℎ − 𝑠(10)


lim = lim 60 + 3ℎ = 60 ft/sec
𝑠 10 + ℎ − 𝑠(10) 𝑠 10 + ℎ − 𝑠(10)
ℎ→0 ℎ ℎ→0
=
10 + ℎ − 10 ℎ
I N S TA N TA N E O U S R AT E O F C H A N G E
The instantaneous rate of change for a function 𝑓 when 𝑥 = 𝑎 is

𝑓 𝑎 + ℎ − 𝑓(𝑎)
lim
ℎ→0 ℎ Difference Quotient

provided this limit exists 𝑓 𝑎 + ℎ − 𝑓(𝑎)


Alternate Form

The instantaneous rate of change for a function 𝑓 when 𝑥 = 𝑎 can be written as

𝑓 𝑏 − 𝑓(𝑎)
lim
𝑏→𝑎 𝑏−𝑎

provided this limit exists


EXAMPLE
Suppose the total profit in hundreds of dollars from selling 𝑥 items is given by 𝑃 𝑥 = 2𝑥 2 − 5𝑥 + 6. Find and interpret the following:

(a) The average rate of change of profit from 𝑥 = 2 to 𝑥 = 4


(b) The average rate of change of profit from 𝑥 = 2 to 𝑥 = 3
(c) The instantaneous rate of change of profit with respect to the number produced when 𝑥 = 2

2 2 2
𝑃 4 − 𝑃(2) 2 4 −5 4 +6 − 2 2 −5 2 +6 𝑃 2 + ℎ − 𝑃(2) 2 2+ℎ −5 2+ℎ +6 −4
= lim = lim
4−2 2 ℎ→0 ℎ ℎ→0 ℎ

18 − 4
= =7 8 + 8ℎ + 2ℎ2 − 10 − 5ℎ + 6 − 4
2 = lim
ℎ→0 ℎ
The average rate of change of profit
from 𝑥 = 2 to 𝑥 = 4 is $700 per item 2ℎ2 + 3ℎ
= lim
ℎ→0 ℎ
2 2
𝑃 3 − 𝑃(2) 2 3 − 5(3) + 6 − 2 2 −5 2 +6
=
3−2 1 = lim 2ℎ + 3 = 3
ℎ→0

= 9−4=5
The instantaneous rate of change of profit with respect to the
The average rate of change of profit number of items produced when 𝑥 = 2 is $300 per item
from 𝑥 = 2 to 𝑥 = 3 is $500 per item
S E C A N T A N D TA N G E N T L I N E S

The slope of the secant line of the graph of 𝑦 = 𝑓(𝑥)


containing the points 𝑎, 𝑓(𝑎) and 𝑎 + ℎ, 𝑓(𝑎 + ℎ) is
given by

𝑓 𝑎 + ℎ − 𝑓(𝑎)

Slope of secant line = average rate of change

The slope of the tangent line of the graph of 𝑦 = 𝑓(𝑥)


at the point 𝑎, 𝑓(𝑎) is given by

𝑓 𝑎 + ℎ − 𝑓(𝑎)
lim
ℎ→0 ℎ

provided this limit exists. If this limit does not exist,


then there is no tangent at the point.
Slope of tangent line = instantaneous rate of change
D E F I N I T I O N O F T H E D E R I VAT I V E
The derivative of the function 𝑓 at 𝑥 is defined as

𝑓 𝑥 + ℎ − 𝑓(𝑥)
𝑓 ′ 𝑥 = lim
ℎ→0 ℎ

The function 𝑓′(𝑥) represents the instantaneous rate of change of 𝑦 = 𝑓(𝑥) with respect to 𝑥

The function 𝑓′(𝑥) represents the slope of the graph at any point 𝑥

If 𝑓′(𝑥) is evaluated at the point 𝑥 = 𝑎, then it represents the


slope of the curve, or the slope of the tangent line at that point
A P P L I C AT I O N S O F D E R I VAT I V E S

● Rate of Change of Quantities


● Increasing and Decreasing Functions
● Maxima and Minima
V I S UA L I Z I N G G D : L E A R N I N G R AT E
AND LOSS FUNCTION

Target: Find optimal model parameters to minimize the Loss


K Means Algorithm Overview
K means Algorithm in mathmetical concept

The K-means algorithm is a popular unsupervised machine learning method used for
clustering data into K distinct groups based on feature similarity. Mathematically, it
partitions a set of n data points in a d-dimensional space into K clusters
{C1 , C2 , … , CK } such that the within-cluster sum of squares (WCSS) is minimized.
​ ​ ​

Below is a detailed mathematical exposition of the K-means algorithm.

1. Mathematical Formulation
1.1. Data Representation
Let X = {x1 , x2 , … , xn } be a set of n data points, where each xi ∈ Rd is a d-
​ ​ ​ ​

dimensional vector.

1.2. Cluster Centers


The goal is to find K cluster centers (also called centroids) {μ1 , μ2 , … , μK }, where
​ ​ ​

each μk ∈ Rd , that best represent the clusters.


1.3. Objective Function


K-means aims to minimize the within-cluster sum of squares (WCSS), which is
defined as:

K
WCSS = ∑ ∑ ∥xi − μk ∥2
​ ​ ​ ​

k=1 xi ∈Ck
​ ​

Where:

Ck is the set of points assigned to cluster k .


∥xi − μk ∥2 is the squared Euclidean distance between point xi and centroid μk .


​ ​ ​ ​

1.4. Optimization Problem


Formally, the K-means clustering problem can be stated as:

Printed with ChatGPT to PDF 1/5


K
min ​ ∑ ∑ ∥xi − μk ∥2 ​ ​ ​ ​

{Ck },{μk }
k=1 xi ∈Ck
​ ​

​ ​

Subject to:

Ck ⊆ X for all k .

K
⋃k=1 Ck = X .
​ ​


 k′ .
Ck ∩ Ck′ = ∅ for all k = ​

2. Algorithm Steps
The K-means algorithm iteratively optimizes the objective function through two main
steps: Assignment and Update.

2.1. Initialization
Random Initialization: Select K distinct data points randomly from X as the
(0) (0) (0)
initial centroids {μ1 , μ2 , … , μK }. ​ ​ ​

Alternative Methods: Techniques like K-means++ can be used to choose initial


centroids to improve convergence and solution quality.

2.2. Assignment Step


(t) (t) (t)
Given the current centroids {μ1 , μ2 , … , μK } at iteration t, assign each data point
​ ​ ​

to the nearest centroid:

C k = {xi ∈ X ∥xi − μ k ∥2 ≤ ∥xi − μ k′ ∥2 ∀ k ′ ∈ {1, 2, … , K}}


(t) (t) (t)
​ ​ ​ ​ ​ ​

2.3. Update Step


Recompute the centroids as the mean of all data points assigned to each cluster:

1
∑ xi
(t+1)
μ k ​ = (t)
​ ​ ​ for each k = 1, 2, … , K
∣C k ∣ (t)
xi ∈C k

​ ​

2.4. Convergence Criteria


The algorithm repeats the Assignment and Update steps until one of the following
conditions is met:

Printed with ChatGPT to PDF 2/5


1. Centroid Stabilization: The centroids do not change significantly between
(t+1) (t)
iterations, i.e., ∥μk ​ − μk ∥ < ϵ for all k , where ϵ is a small threshold.

2. Maximum Iterations: A predefined maximum number of iterations is reached.

3. No Change in Assignments: The cluster assignments {Ck } do not change


between consecutive iterations.

3. Mathematical Properties
3.1. Convergence
K-means is guaranteed to converge to a local minimum of the WCSS objective
function. However, it may not find the global minimum due to its dependence on the
initial centroid positions.

3.2. Computational Complexity


Each iteration of the K-means algorithm has a computational complexity of O(nKd),
where:

n is the number of data points.


K is the number of clusters.
d is the dimensionality of the data.

The total complexity depends on the number of iterations until convergence.

3.3. Optimality
K-means solves the clustering problem via Lloyd's algorithm, which is an instance of
the Expectation-Maximization (EM) algorithm for Gaussian mixtures with equal
spherical covariance and equal priors. However, K-means assumes clusters are convex
and isotropic, which may not hold in all datasets.

4. Extensions and Variations


Several variations of the K-means algorithm have been proposed to address its
limitations:

K-means++: Improves initialization by spreading out the initial centroids,


leading to better convergence properties.

Printed with ChatGPT to PDF 3/5


Mini-Batch K-means: Uses small random subsets (mini-batches) of data for
updates, enhancing scalability for large datasets.

Kernel K-means: Extends K-means to non-linear cluster boundaries by applying


kernel functions.

5. Example
Consider a simple 2-dimensional dataset with n = 4 points:

X = {x1 = (1, 2), x2 = (1, 4), x3 = (1, 0), x4 = (10, 2)}


​ ​ ​ ​

Let K = 2.

Initialization:
(0) (0)
Suppose we randomly choose μ1 = (1, 2) and μ2 = (10, 2).
​ ​

Assignment:
(0)
x1 and x2 are closer to μ1 .
​ ​ ​

(0)
x3 is closer to μ1 .
​ ​

(0)
x4 is assigned to μ2 .
​ ​

Update:
(1) 1
New μ1 = 3
​ ((1, 2) + (1, 4) + (1, 0)) = (1, 2)
(1)
New μ2 = (10, 2) (unchanged)

Convergence:

Since the centroids did not change, the algorithm converges with the final
clusters:

C1 = {x1 , x2 , x3 }
​ ​ ​ ​

C2 = {x4 }
​ ​

6. Limitations
Choosing K : Determining the optimal number of clusters K is non-trivial and
often requires methods like the Elbow Method or Silhouette Analysis.

Printed with ChatGPT to PDF 4/5


Sensitivity to Initialization: Poor initial centroid selection can lead to
suboptimal clustering.

Assumption of Spherical Clusters: K-means works best when clusters are


spherical and equally sized, which may not hold for all datasets.

Scalability: While efficient for small to medium-sized datasets, K-means can be


computationally intensive for very large datasets without modifications like Mini-
Batch K-means.

7. Conclusion
Mathematically, K-means is an iterative optimization algorithm aimed at partitioning
data into K clusters by minimizing the within-cluster variance. Its simplicity and
efficiency make it a widely used clustering technique, though it comes with
assumptions and limitations that must be considered in practical applications.

Printed with ChatGPT to PDF 5/5

You might also like