0% found this document useful (0 votes)
322 views87 pages

ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated

This document provides an introduction to functional data analysis and high-dimensional data analytics. It discusses functional data, which are data that can be represented as functions, such as signals over time. It also covers topics like the "curse of dimensionality" when dealing with high-dimensional data, and how to perform dimension reduction and extract low-dimensional structures using methods like functional principal component analysis, splines, and regression. Splines are specifically piecewise polynomials that are fitted locally in intervals to provide flexibility while maintaining continuity across intervals.

Uploaded by

Vida Gholami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
322 views87 pages

ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated

This document provides an introduction to functional data analysis and high-dimensional data analytics. It discusses functional data, which are data that can be represented as functions, such as signals over time. It also covers topics like the "curse of dimensionality" when dealing with high-dimensional data, and how to perform dimension reduction and extract low-dimensional structures using methods like functional principal component analysis, splines, and regression. Splines are specifically piecewise polynomials that are fitted locally in intervals to provide flexibility while maintaining continuity across intervals.

Uploaded by

Vida Gholami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Topics on High-

Dimensional Data
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Introduction to HD and
Functional Data
Learning Objectives
• To understand the definition of high-
dimensional data and Big Data.
• To explain the concepts of “curse of
dimensionality” and “low-dimensional
learning.”
• To define Functional Data.
Big Data
The initial definition revolves around the three Vs:
Volume, Velocity, and Variety

Volume: Large sample size, each sample could be high-dimensional. Use


MapReduce, Hadoop, etc. when data are too large to be stored in one machine.

Velocity: Data is generated and collected very quickly. Increase computational


efficiency.

Variety: The data types you mention all take different shapes. How to deal with
high-dimensional data, e.g., profiles, images, videos, etc.?
High-Dimensional Data
High-Dimensional data is defined as a data set with large number of attributes.
4000

3500

3000
Power (Watt)

2500

2000

1500

1000

500

0
0 0.05 0.1 0.15 0.2 0.25 0.3
Time (Sec)

Signals
➢ 100 KHz Images
➢ 1M Pixels Videos
➢ Images Sequence Surveys
➢ Movies Ratings
How to extract useful information from such massive datasets?
High-Dimensional Data vs. Big Data
Small n Large n
Small p Traditional Statistics with limited Classic large sample theory
samples Big Data Challenge
Large p HD Statistics and optimization Deep Learning and
High-Dimensional Data Challenge Deep Neural Networks

BD Analytics challenge: n is too large to be stored or processed on one machine.


– Solutions: big-data framework for data storage and computation (e.g.,
parallel computing, MapReduce, Hadoop, Spark, etc.)
High-Dimensional Data vs. Big Data
Small n Large n
Small p Traditional Statistics with limited Classic large sample theory
samples Big Data Challenge
Large p HD Statistics and optimization Deep Learning and
High-Dimensional Data Challenge Deep Neural Networks

HD Analytics challenge is mainly related to “curse of dimensionality”:


1 𝑝𝑝
Computational issue: In some algorithm optimizations, we need evaluations
𝜖𝜖
in order to obtain an solution within 𝜖𝜖 of the optimum.
Curse of Dimensionality
HD Analytics challenge is mainly related to
“curse of dimensionality”:

Model learning issue: As distance between


observations increases with the dimensions,
the sample size required for learning a model
drastically increases.
– Solutions: Feature extraction and
dimension reduction through low-
dimensional learning .
Low-Dimensional Learning from
High-Dimensional Data
• High-dimensional data usually have low dimensional structure

• Real data highly concentrated on low-dimensional, sparse, or degenerate


structure in high-dimensional space

How can the LD structure be learned and exploited from HD data?


LD Learning Methods
Functional Data Analysis
• Splines
• Smoothing Splines
• Kernels

Tensor Analysis
• Multilinear Algebra
• Low Rank Tensor Decomposition

Rank Deficient Methods


• (Functional) Principal Component Analysis (FPCA)
• Robust PCA (RPCA)
• Matrix Completion
Functional Data
Definition: A fluctuating quantity or impulse whose variations represent information
and is often represented as a function of time or space.
Single-channel signals Multi-channel signals Images Point Cloud
4000

3500

3000
Power (Watt)

2500

2000

1500

1000

500

0
0 0.05 0.1 0.15 0.2 0.25 0.3
Time (Sec)
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Review of Regression
Learning Objectives
• To review linear regression
• To understand geometric interpreation
of linear regerssion
• To explain feature extraction using
regression
Regression
• Observe a collection of i.i.d. training data

Where x’s are explanatory (independent) variables and y is the response


(dependent) variable

• We want to build a function f(x) to model the relationship between x’s and y

• An intuitive way of finding f(x) is by minimizing the following loss function

• We have to impose some constraints/structure on f(x), e.g.,


Regression – Least Square Estimates
𝐲𝐲 = 𝐗𝐗𝐗𝐗 + 𝛜𝛜

𝑦𝑦1 𝑥𝑥11 ⋯ 𝑥𝑥1𝑝𝑝 𝛽𝛽1 𝜖𝜖1


𝐲𝐲 = ⋮ 𝐗𝐗 = ⋮ ⋱ ⋮ 𝛃𝛃 = ⋮ 𝛜𝛜 = ⋮
𝑦𝑦𝑛𝑛 𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛 𝛽𝛽𝑝𝑝 𝜖𝜖𝑝𝑝

We wish to find the vector of least squares estimators that minimizes:


𝐿𝐿 = 𝛜𝛜𝑇𝑇 𝛜𝛜 = (𝐲𝐲 − 𝐗𝐗𝐗𝐗)𝑻𝑻 (𝐲𝐲 − 𝐗𝐗𝐗𝐗)

The resulting least squares estimate is

� = (𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑻𝑻 𝐲𝐲
𝛃𝛃
Example
Pull strength for a wire bond against
wire length and die height. (Montgomery and Runger 2006)

Hastie. et al. 2009


Example cont.
(Montgomery and Runger 2006)

� = (𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑻𝑻 𝐲𝐲
𝛃𝛃
Geometric interpretation
𝐲𝐲� = 𝐗𝐗(𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑇𝑇 𝐲𝐲 = 𝐇𝐇𝐇𝐇
Projection Matrix (a.k.a. Hat matrix)

The outcome vector 𝑦𝑦 is orthogonally


projected onto the hyperplane spanned
by the input vectors 𝑥𝑥1 and 𝑥𝑥2 . The
Projection 𝑦𝑦� represents the vector of
predictions obtained by the least square
Hastie. et al. 2009
method
Properties of OLS Estimators
Unbiased estimators: � = 𝐸𝐸[(𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑻𝑻 𝐲𝐲]
𝐸𝐸(𝛃𝛃)
= 𝐸𝐸[(𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑻𝑻 (𝐗𝐗𝐗𝐗 + 𝛜𝛜)]
= 𝛃𝛃

Covariance Matrix: � = 𝜎𝜎 2 (𝐗𝐗 𝑻𝑻 𝐗𝐗)−1


cov(𝛃𝛃)

𝑆𝑆𝑆𝑆𝑆𝑆
𝜎𝜎� 2 =
𝑛𝑛 − 𝑝𝑝

According to the Gauss-Markov Theorem, among all unbiased linear estimates,


the least square estimate (LSE) has the minimum variance and it is unique.
Feature Extraction Using Regression

y = β 0 + β1t + β 2t 2 + β 3t 3
Polynomial regression
OLS Estimator

A signal (functional data/profile) sample

βˆ = [β 0 β1 β 2 β 3 ]T

Extracted features
Reference
• Montgomery, D. C., Runger, G., (2013), Applied Statistics and Probability for
Engineers. 6th Edition. Wiley, NY, USA.

• Hastie, T., Tibshirani, R., and Friedman, J., (2009) The Elements of Statistical
Learning. Springer Series in Statistics Springer New York Inc., New York, NY,
USA.
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Splines
Learning Objectives
• To discuss local vs. global polynomial
regression
• To explain splines and piecewise
polynomial regression
• To recognize splines basis and
truncated power basis
Polynomial Vs. Nonlinear Regression
mth-order polynomial regression
y = β 0 + β1 x + β 2 x 2 + β 3 x 3 +  + β m x m + ε

Nonlinear Regression:

Often requires domain knowledge


or first principles for finding the
underlying nonlinear function
a1 ( x − c) b1 + d + ε x > c
y=
a2 (− x + c) b2 + d + ε x ≤ c
Polynomial Regression
mth-order polynomial y = β 0 + β1 x + β 2 x 2 + β 3 x 3 +  + β m x m
Disadvantages of Polynomial Regression:
• Remote part of the function is very sensitive to outliers
• Less flexibility due to global functional structure
Polynomial Regression
Disadvantages of Polynomial Regression:
• Remote part of the function is very sensitive to outliers
• Less flexibility due to global functional structure

Example from Ji Zhou, 2011 Estimated using polynomials


Splines
• Linear combination of Piecewise polynomial functions under continuity assumption
• Partition the domain of x into continuous intervals and fit polynomials in each interval
separately
• Provides flexibility and local fitting

Suppose x ∈ [a, b]. Partition the x domain using the following points (a.k.a. knots).

Fit a polynomial in each interval under the continuity conditions and integrate them
by K
f ( X ) = ∑ β m hm ( X )
m =1
Splines – Simple Example

3 LSE
f (X ) = ∑β
m =1
m hm ( X ) β̂ m = Ym

6
f (X ) = ∑β
m =1
m hm ( X )

Image taken from: Hastie. et al. 2014


Splines – Simple Example
6
f (X ) = ∑β
m =1
m hm ( X )

Impose continuity constraint for each knot:

Total number of free parameters (degrees of freedom) is 6-2=4

Alternatively, one could incorporate the constraints into the


basis functions:

This basis is known as truncated power basis

Image taken from: Hastie. et al. 2014


Splines with Higher Order of Continuity
Cubic Polynomials Continuity constraints for smoothness:

splines df is calculated by
(# of regions)(# of parameters in each region) –
(# of knots)(# of constraints per knot)
Image taken from: Hastie. et al. 2014
Order-M Splines
Piecewise polynomials of order M-1, continuous derivatives up to order M-2
• M=1 piecewise-constant splines
• M=2 linear splines
• M=3 quadratic splines
• M=4 cubic splines

Truncated power basis functions:

• Total degrees of freedom is K+M


• Cubic spline is the lowest order spline for which the knot discontinuity is not visible to
human eyes
• Knots selection: a simple method is to use x quantiles. However, the choice of knots is a
variable/model selection problem.
Estimation 1

0.8

0.6

0.4

M+K
f (X ) = ∑β
m =1
m hm ( X )
0.2

0
0 20 40 60 80 100 120

Least square method can be used to estimate the coefficients


H = [h (x) h (x) h (x) h (x) h (x) h (x)] βˆ = (H T H ) H T y
−1
1 2 3 4 5 6

Linear Smoother: yˆ = Hβˆ = H H T H ( )


−1
H T y = Sy

Degrees of Freedom:
• Truncated power basis functions are simple and algebraically appealing.
• Not efficient for computation and ill-posed and numerically unstable. det (H T H ) = 1.3639e - 06
−1
%Data Generation

Example • X=[0:0.001:1];
• Y_true=sin(2*pi()*X.^3).^3;
1 • Y=sin(2*pi()*X.^3).^3+normrnd(0,0.1,1,length(X));
0.5 %Define knots and basis
0 • k = [1/7:1/7:6/7];
-0.5
Mean function • h1=ones(1,length(X));h2=X;h3=X.^2;h4=X.^3;
-1 • h5=(X-k(1)).^3;h5(h5<=0)=0;
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

• h6=(X-k(2)).^3;h6(h6<=0)=0;
• h7=(X-k(3)).^3;h7(h7<=0)=0;
• h8=(X-k(4)).^3;h8(h8<=0)=0;
• h9=(X-k(5)).^3;h9(h9<=0)=0;
• h10=(X-k(6)).^3;h10(h10<=0)=0;
• H=[h1' h2' h3' h4' h5' h6' h7' h8' h9' h10']
%Least square estimates
• B=(H'*H)\H'*Y'
• scatter(X,Y,'.'); hold on
• plot(X,H*B,'r')
• plot(X,Y_true,'k')
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Bsplines
Learning Objectives
• To discuss computational issue of
splines.
• To understand B-spline basis.
• To define the smoother matrix and
degrees of freedom.
Computational Issue of Splines
• Truncated power basis functions
are simple and algebraically
appealing.

• Not efficient for computation and ill-


posed and numerically unstable.

Cubic truncated power basis functions


(
det H T H )
−1
= 1.3639e - 06
Bsplines
Alternative basis vectors for piecewise
polynomials that are computationally
more efficient (deBoor 1978)
• Each basis function has a local support, i.e.,
it is nonzero over at most M (spline order)
consecutive intervals
• The basis matrix is banded
Bspline Basis
Let B j ,m ( x) be the jth B-spline basis function of order m (m ≤ M) for the knot sequence τ

Define the augmented knots sequence τ:


e.g. τ1 =  = τ M = ξ0

e.g. ξ K +1 = τ M + K +1 =  = τ 2 M + K

For j= 1,…, 2M+K-1,

For j= 1,…, 2M+K-m,


Bspline Basis in Matlab
n = 100
for sd = 1:4
subplot(4,1,sd)
knots = [ones(1,sd-1)...
linspace(1,n,10) n * ones(1,sd-1)];
nKnots = length(knots) - sd;
kspline = spmak(knots,eye(nKnots));
B=spval(kspline,1:n)';
plot(B)
end
Show the matrix B,
low bandwidth matrix: imagesc(B)

Generate bspline basis using R: bs(x, df, knots, intercept)


Example

Cubic truncated power basis functions Cubic Bspline basis functions


(
det H T H )
−1
= 1.3639e - 06 (
det BT B )
−1
= 1.4119e + 04
Smoother Matrix
Consider a regression Spline basis B

( )
ˆf = B BT B −1 BT y = Hy

• H is the smoother matrix (a.k.a. projection matrix)


• H is idempotent
• H is symmetric
• Degrees of freedom: trace (H)
Example - MATLAB
% Generate data:
n = 100; D = linspace(0,1,n); sigma = 0.3;
fun = @(x) 2.5 * x - sin(10 * x) - exp(-10 * x);
y = fun(D) + randn(1,n)*sigma; y = y';
% Generate B-spline basis:
sd=4;
knots = [ones(1,2) linspace(1,n,10) n * ones(1,2)];
nKnots = length(knots) - sd;
kspline = spmak(knots,eye(nKnots));
B=spval(kspline,1:n)';
% Least Square Estimation:
yhat = B/(B'*B)*B’*y;
K= trace(B/(B'*B)*B')
sigma2 = 1/(n-K)*(y-yhat)'*(y-yhat);
yn = yhat-3*sqrt(diag(sigma2*B/(B'*B)*B'));
yp = yhat+3*sqrt(diag(sigma2*B/(B'*B)*B'));
plot(D,y,'r.',D,yn,'b--',D,yp,'b--',D,yhat,'k-')
Example: Fat content prediction
• A beef distributor wants to know the fat
content of meat from spectrometric curves,
which correspond to the absorbance
measured at 100 wavelengths.
• She obtains the spectrometric curves for 215
pieces of finely chopped meat, (functional
predictors).
• Additionally, through a time consuming
chemical processing, she estimates the fat
content of each piece (response).
• She wants us to build a model to predict the
fat content of a new piece using the
spectrometric curve. The original dataset can be found at
http://lib.stat.cmu.edu/datasets/tecator.
Example: Fat content
Spectrometric Curves
• We split the dataset into train dataset, 195
curves, and test dataset, 20 curves.
• Regular approach: We builds a linear
regression using the 100 measurements
from the spectrometer as predictors.
• Functional approach: We use B-spline to B-Spline Coefficients
model each curve and extract features. 2.61 2.64 2.63 2.75 2.74
• The estimated B-spline coefficients are used 3.41 3.35 3.07 2.89 2.81
as predictive features that can be used in
building the fat regression model.
Fat Content
• The mean square errors of the predictions 12.5
for the test dataset are:
𝑟𝑟𝑟𝑟𝑟𝑟𝑒𝑒𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 27.02
𝑟𝑟𝑟𝑟𝑟𝑟𝑒𝑒𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 = 14.25
Reference
• Hastie, T., Tibshirani, R., and Friedman, J., (2009) The Elements of Statistical
Learning. Springer Series in Statistics Springer New York Inc., New York, NY,
USA.
Topics in High-
Dimensional
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Smoothing splines
Learning Objectives
• Discuss B-spline basis boundary
issue
• Introduce natural spline basis
• Define smoothing splines
• Discuss cross-validation for tuning
penalty parameter.
Boundary Effects on Splines

Consider the following setting with the fixed


training data

• Behavior of splines tends to be sporadic near the


boundaries, and extrapolation can be problematic
Natural Cubic Splines
Natural Cubic Splines
• Additional constraints are added to make the function linear beyond the boundary knots
• Assuming the function is linear near the boundaries (where there is less information) is often
reasonable
• Cubic spline; linear on (−∞, ξ1 ] and [ξ K , ∞)
• Prediction variance decreases
• The price is the bias near the boundaries
• Degrees of freedom is K, the number of knots
• Each of these basis functions has zero second and third derivative in the linear region

B = ns(x, df, intercept)


Smoothing Splines

• First term measures the closeness of the model to the data (related to bias).
• Second term penalizes curvature of the function (related to variance).
• Avoid knot selection. Select as many knots as the observations number.

• λ is smoothing parameter controlling trade off


between bias and variance.
• λ = 0 interpolate the data (overfitting)
• λ = ∞ linear least-square regression
Example of Overfitting

True Estimated curve with a


large number of knots
Smoothing Splines
Penalized residual sum of squares

It can be shown that the minimizer is a natural cubic spline:

Where Nj’s are a set of natural cubic spline basis with knots at each of unique xi’s
Matrix form

Solution
Smoother Matrix
Smoothing spline estimator is a linear smoother

• Sλ is the smoother matrix


• Sλ is NOT idempotent
• Sλ is symmetric
• Sλ is positive definite

• Degrees of freedom: trace (Sλ)


Choice of Tuning Parameter
Collect 3 independent data sets for training, validation and test

Training Data Validation Data Test Data

Tuning
parameters Model Estimation Estimated Model Optimal Model

Intermediate Model Final Model


Assessment Assessment

Optimal tuning
parameters

• If an independent validation dataset is not affordable, the K-fold cross validation (CV)
or leave-one-out CV can be used.
K-fold Cross-Validation (CV)
• 5-fold cross-validation (blank: training; red: test)
Choice of Tuning Parameter
Model Selection Criteria
• Akaike information criterion (AIC) − 2 log( L) + 2k
where k is the # of estimated parameters and L is Likelihood
function

• Bayesian information criterion (BIC)


− 2 log( L) + k log(n) where n is the sample size

• Generalized Cross-Validation (GCV)


Example - Over-fitting
Generate 40 knots for fitting 100 data samples
• Generate data:
fun = @(x) 2.5 * x - sin(10 * x) - exp(-10 * x);
n = 100; D = linspace(0,1,n); k =40;
sigma = 0.3; y = fun(D) + randn(1,n)*sigma; y = y’;
• Generate B-spline basis:
knots = [ones(1,2) linspace(1,n,k) n * ones(1,2)];
nKnots = length(knots) - 3;
kspline = spmak(knots,eye(nKnots));
B=spval(kspline,1:n)';
• Least Square Estimation:
yhat = B/(B'*B)*B'*y;
sigma2 = 1/(n-k)*(y-yhat)'*(y-yhat);
yn = yhat-3*sqrt(diag(sigma2*B/(B'*B)*B'));
yp = yhat+3*sqrt(diag(sigma2*B/(B'*B)*B'));
plot(D,y,'r.',D,yn,'b--',D,yp,'b--',D,yhat,'k-')
Example - Avoid Over-fitting by
%Smoothing Penalty
B is defined in the previous slide
D1 = (B(2:n,:)-B(1:(n-1),:));
D2 = (D1(2:(n-1),:)-D1(1:(n-2),:));
%Different lambda selection
alllambda = 0.0001:0.5:100;
L =length(alllambda);
RSS = zeros(L,1);
df = zeros(L,1);
for i = 1:L
S = B/(B'*B+alllambda(i)*D2’*D2)*B';
yhat = S*y;
RSS(i) = sum((yhat-y).^2);
df(i) = trace(S);
end
%GCV criterion
GCV = (RSS/n)./(1-df/n).^2;
plot(alllambda,GCV)
Reference
• Hastie, T., Tibshirani, R., and Friedman, J., (2009) The Elements of Statistical
Learning. Springer Series in Statistics Springer New York Inc., New York, NY,
USA.
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Kernel Smoothers
Learning Objectives
• To define kernel functions
• To understand KNN regression,
weighted kernel regression, and linear
and polynomial kernel regression.
K-Nearest Neighbor (KNN)
n
KNN Average fˆ ( x0 ) = ∑ w( x , x ) y
i =1
0 i i

1
 if xi ∈ N k ( x0 )
where w( x0 , xi ) =  K
0

• Simple average of the k nearest


observations to x0 (local averaging)
• Equal weights are assigned to all neighbors
• The fitted function is in form of a step
function (non-smooth function)
From Hastie. et al. 2009
Kernel Function
Any non-negative real-valued integrable function that satisfies the following conditions:


1. K (u )du = 1
−∞
2. K is an even function; K (−u ) = K (u )

3. It has finite second moment; ∫
−∞
u 2 K (u )du < ∞
Examples of Kernel functions
• Symmetric Beta family kernel
(1 − u 2 ) d
• Uniform kernel (d=0) K (u , d ) = 2 d +1 I ( u < 1)
2 B (d + 1, d + 1)
• Epanechnikov kernel (d=1)
• Bi/Tri-Weight (d=2,3)
3
• Tri-cube kernel K (u ) = (1 − u ) 3 I ( u < 1) From Hastie. et al. 2009
• Gaussian kernel K (u ) = 1 2π exp(−u 2 )
Kernel Smoother Regression
Kernel Regression
• Is weighted local averaging that fits a simple model separately at each query point x0

• More weights are assigned to closer observations.

• Localization is defined by the weighting function.


n
• For any point
∑ Kλ (x , x ) y
( )
0 i i
fˆ ( x0 ) = i =1
n where K λ ( x0 , xi ) = K x0 − xi λ
∑ Kλ (x , x )
i =1
0 i

• K is a kernel function.

• λ is so-called “bandwidth” or “window width” that defines the width of neighborhood.

• Kernel regression requires little training; all calculations get done at the evaluation time.
Example - Kernel Smoother
Regression
K λ ( x0 , xi ) = K ( x0 − xi λ )
3
K (u , d ) = (1 − u 2 ) I ( u < 1)
4
n

∑ Kλ (x , x ) y
i =1
0 i i
fˆ ( x0 ) = n

∑ Kλ (x , x )
i =1
0 i

From Hastie. et al. 2009


Choice of λ
• λ defines the width of neighborhood.
• Only points within [x0 - λ, x0 + λ] receive positive weights in kernels
with the support of [-1,1]
• Larger λ: smoother estimate, larger bias, smaller variance
• Smaller λ: rougher estimate, smaller bias, larger variance

The following criteria can be used for determining of λ:


– Leave-one-out cross validation

– K-fold cross validation

– Generalized cross validation


Example – RBF Kernel
% Data Genereation
x=[0:100];
y=[sin(x/10)+(x/50).^2+0.1*normrnd(0,1,1,101)]';
kerf=@(z)exp(-z.*z/2)/sqrt(2*pi);
% leave-one-out CV
h1=[1:0.1:4];

MSE
for j=1:length(h1); h=h1(j) ;
for i=1:length(y)
X1=x;Y1=y;X1(i)=[];Y1(i)=[];
z=kerf((x(i)-X1)/h); yke=sum(z.*Y1')/sum(z);
er(i)=y(i)-yke;
end
mse(j)=sum(er.^2);
end
lambda
plot(h1,mse); h=h1(find(mse==min(mse)));
Example – RBF Kernel
% Interpolation for N values
N=1000;
xall = linspace(min(x),max(x),N);

f = zeros(1,N);
for k=1:N
z=kerf((xall(k)-x)/h);
f(k)=sum(z.*y')/sum(z);
end
Drawbacks of Local Averaging
The local averaging can be biased on
the boundaries of the domain due to the
asymmetry of the kernel in that region.

From Hastie. et al. 2009


Local Linear Regression
Locally weighted linear regression model is estimated by
n

∑ K λ ( x0 , xi )[ yi − β 0 ( x0 ) − β1 ( x0 ) xi ]
2
arg min
β 0 ( x0 ), β1 ( x0 ) i =1

The estimate of the function at x0 is then

fˆ ( x0 ) = βˆ0 ( x0 ) + βˆ1 ( x0 ) x0

Local linear regression corrects the bias on


the boundaries

From Hastie. et al. 2009


Local Polynomial Regression
Locally weighted polynomial regression
model is estimated by
2
n  p 
arg min ∑ K λ ( x0 , xi )  yi − β 0 ( x0 ) − ∑ β j ( x0 ) xij 
β 0 ( x0 ), β1 ( x0 ) i =1
 j =1 

The estimate of the function at x0 is then


p
fˆ ( x0 ) = βˆ0 ( x0 ) + ∑ β^ j ( x0 ) x0j
j =1

Local polynomial regression corrects the


bias in curvature regions

From Hastie. et al. 2009


Local Polynomial Regression
• Higher-order polynomials result in
lower of the bias, higher variance.
• Local linear fits can help reduce
linear bias on the boundaries.
• Local quadratic fits are effective for
reducing bias due to curvature in
interior region, but not in boundary
regions (increase the variance)

From Hastie. et al. 2009


Reference
• Hastie, T., Tibshirani, R., and Friedman, J., (2009) The Elements of Statistical
Learning. Springer Series in Statistics Springer New York Inc., New York, NY,
USA.
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis

Kamran Paynabar, Ph.D.


Associate Professor
School of Industrial & Systems Engineering

Functional Principal Component


Learning Objectives
• How to perform PCA on functional data?
• To understand KL theorem and identify
eigen-function and eigen-values.
• To demonstrate feature extraction using
FPCA.
Signal Functional Form
𝑠𝑠𝑖𝑖 𝑡𝑡 = 𝜇𝜇 𝑡𝑡 + 𝜖𝜖𝑖𝑖 (𝑡𝑡)

• 𝑠𝑠𝑖𝑖 (𝑡𝑡): observed signals, 𝑖𝑖 = 1, … , 𝑁𝑁

• 𝜇𝜇 𝑡𝑡 : continuous functional mean

• 𝜖𝜖𝑖𝑖 𝑡𝑡 : realizations from a stochastic process with


mean function 0 and covariance function 𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′
It includes both random noise and signal-to-
signal variations
Karhunen–Loeve Theorem
Using Karhunen–Loeve Theorem 𝜖𝜖(𝑡𝑡) can be written as

𝜖𝜖𝑖𝑖 𝑡𝑡 = � ξ𝑖𝑖𝑖𝑖 𝜙𝜙𝑘𝑘 (𝑡𝑡)


𝑘𝑘=1
Where ξ𝑖𝑖𝑖𝑖 are zero-mean and uncorrelated coefficients, i.e., 𝐸𝐸 ξ𝑖𝑖𝑖𝑖 = 0 & 𝐸𝐸[(ξ𝑖𝑖𝑖𝑖 )2 ] = λ𝑘𝑘 and
𝜙𝜙𝑘𝑘 (𝑡𝑡) are eigen-functions of the covariance function 𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′ = cov(𝜖𝜖 𝑡𝑡 , 𝜖𝜖 𝑡𝑡 ′ ) i.e.,

𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′ = � 𝜆𝜆𝑘𝑘 𝜙𝜙𝑘𝑘 (𝑡𝑡)𝜙𝜙𝑘𝑘 (𝑡𝑡 ′ )


𝑘𝑘=1
λ𝟏𝟏 ≥ λ𝟐𝟐 ≥ ⋯ are ordered eigen-values. The eigen-functions can be obtained by solving:
𝑀𝑀
� 𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′ 𝜙𝜙𝑘𝑘 𝑡𝑡 𝑑𝑑𝑑𝑑 = 𝜆𝜆𝑘𝑘 𝜙𝜙𝑘𝑘 (𝑡𝑡𝑡)
0
Functional PCA
The variance of ξ𝑖𝑖𝑖𝑖 quickly decays with 𝑘𝑘. Therefore, only a few ξ𝑖𝑖𝑖𝑖 , also known as FPC-
scores, would be enough to accurately approximate the noise function. That is,
𝐾𝐾

𝜖𝜖𝑖𝑖 𝑡𝑡 ≅ � 𝜉𝜉𝑖𝑖𝑖𝑖 𝜙𝜙𝑘𝑘 (𝑡𝑡)


𝑘𝑘=1

Signals decomposition is given by

𝑠𝑠𝑖𝑖 𝑡𝑡 = 𝜇𝜇 𝑡𝑡 + 𝜖𝜖𝑖𝑖 𝑡𝑡

≅ 𝜇𝜇 𝑡𝑡 + ∑𝐾𝐾
𝑘𝑘=1 𝜉𝜉𝑖𝑖𝑖𝑖 𝜙𝜙𝑘𝑘 (𝑡𝑡)
Model Estimation
– Complete signals: sampled regularly

– Incomplete signals: sampled irregularly, sparse, fragmented


Estimation of Mean Function
Historical signals 𝑠𝑠𝑖𝑖 𝑡𝑡𝑖𝑖𝑗𝑗
– 𝑖𝑖 = 1, … , 𝑁𝑁, is the signal index
– 𝑗𝑗 = 1, … , 𝑚𝑚𝑖𝑖 , is the observation index in
each signal
– 𝑠𝑠𝑖𝑖 𝑡𝑡𝑖𝑖𝑖𝑖 ≅ 𝜇𝜇 𝑡𝑡𝑖𝑖𝑖𝑖 + ∑𝐾𝐾
𝑘𝑘=1 𝜉𝜉𝑖𝑖𝑖𝑖 𝜙𝜙𝑘𝑘 (𝑡𝑡𝑖𝑖𝑖𝑖 )

We can estimate mean function 𝜇𝜇̂ 𝑡𝑡 using local


linear regression by minimizing
𝑛𝑛 𝑚𝑚𝑖𝑖
𝑡𝑡𝑖𝑖𝑖𝑖 − 𝑡𝑡 2
min � � 𝑊𝑊 𝑠𝑠𝑖𝑖 𝑡𝑡𝑖𝑖𝑖𝑖 − 𝑐𝑐0 − 𝑡𝑡 − 𝑡𝑡𝑖𝑖𝑖𝑖 𝑐𝑐1
𝑐𝑐0 ,𝑐𝑐1 ℎ
𝑖𝑖=1 𝑗𝑗=1

– Solution: 𝜇𝜇̂ 𝑡𝑡 = 𝑐𝑐0,𝑡𝑡


̂
Estimation of Covariance Function
First, we use estimated mean functions to estimate the
raw covariance function 𝐶𝐶̂ 𝑡𝑡, 𝑡𝑡 ′ :

𝐶𝐶�𝑖𝑖 𝑡𝑡𝑖𝑖𝑗𝑗 , 𝑡𝑡𝑖𝑖𝑘𝑘 = 𝑠𝑠𝑖𝑖 (𝑡𝑡𝑖𝑖𝑖𝑖 ) − 𝜇𝜇̂ 𝑡𝑡𝑖𝑖𝑗𝑗 𝑠𝑠𝑖𝑖 (𝑡𝑡𝑖𝑖𝑘𝑘 ) − 𝜇𝜇̂ 𝑡𝑡𝑖𝑖𝑖𝑖
To estimate the covariance surface �𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′ , we use
local quadratic regression
𝑛𝑛
𝑡𝑡𝑖𝑖𝑖𝑖 − 𝑡𝑡 𝑡𝑡𝑖𝑖𝑘𝑘 − 𝑡𝑡𝑡 2
min � � 𝑊𝑊 , 𝐶𝐶�𝑖𝑖 𝑡𝑡𝑖𝑖𝑗𝑗 , 𝑡𝑡𝑖𝑖𝑘𝑘 − 𝑐𝑐0 − 𝑐𝑐1 𝑡𝑡 − 𝑡𝑡𝑖𝑖𝑖𝑖 − 𝑐𝑐2 (𝑡𝑡 ′ − 𝑡𝑡𝑖𝑖𝑖𝑖 )
𝑐𝑐0 ,𝑐𝑐1 ,𝑐𝑐2 ℎ ℎ
𝑖𝑖=1 1≤𝑗𝑗≠𝑘𝑘≤𝑚𝑚𝑖𝑖

Solution: �𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′ = 𝑐𝑐0̂ (𝑡𝑡, 𝑡𝑡𝑡)


Solve the estimated covariance function
𝜙𝜙� 𝑘𝑘 (𝑡𝑡) is estimated by discretizing the estimated
covariance function 𝐶𝐶̂ 𝑡𝑡, 𝑡𝑡 ′
Computing FPC-Scores
𝑀𝑀
Computing eigen-function 𝜙𝜙� 𝑘𝑘 𝑡𝑡𝑗𝑗 by solving � 𝐶𝐶̂ 𝑡𝑡, 𝑡𝑡 ′ 𝜙𝜙� 𝑘𝑘 𝑡𝑡 𝑑𝑑𝑑𝑑 = 𝜆𝜆̂ 𝑘𝑘 𝜙𝜙� 𝑘𝑘 (𝑡𝑡𝑡)
0

𝑀𝑀 1, 𝑚𝑚 = 𝑘𝑘
– ∫0 𝜙𝜙� 𝑘𝑘 𝑡𝑡 × 𝜙𝜙�𝑚𝑚 𝑡𝑡 𝑑𝑑𝑑𝑑 = �
0, 𝑚𝑚 ≠ 𝑘𝑘
– solved by discretizing the estimated covariance function 𝐶𝐶̂ 𝑡𝑡𝑗𝑗 , 𝑡𝑡𝑗𝑗′

𝑀𝑀
Computing FPC-scores 𝜉𝜉̂𝑖𝑖𝑖𝑖 𝜉𝜉𝑖𝑖𝑖𝑖 = � 𝑠𝑠𝑖𝑖 𝑡𝑡 − 𝜇𝜇̂ 𝑡𝑡 𝜙𝜙𝑘𝑘 𝑡𝑡 𝑑𝑑𝑑𝑑
0

𝐽𝐽
– Numerical integration
𝜉𝜉̂𝑖𝑖𝑖𝑖 = � 𝑠𝑠𝑖𝑖 𝑡𝑡𝑗𝑗 − 𝜇𝜇̂ 𝑡𝑡𝑗𝑗 𝜙𝜙�𝑘𝑘 𝑡𝑡𝑗𝑗 𝑡𝑡𝑗𝑗 − 𝑡𝑡𝑗𝑗−1
where 𝑡𝑡0 = 0 𝑗𝑗=1
FPCA Example

Original signals Signals with missing data (6 observations for each signal)
FPCA Example

Mean function Smoothed covariance function


FPCA Example

Fraction of variance explained 1st eigen function


st
(1 eigen value explained more than
98% of the total variation)
Example: Functional Data
• In a press machine the load profiles are measured during the forging process.
The goal is to predict the quality of produced product based on the load profiles.
• There are 200 profiles along with their quality labels. 100 non-defective and 100
defective parts.

• For a new curve, we want to decide if it belongs to class 1 or to class 2.


• Option 1: B-spline coefficients
• Option 2: Functional principal components
Example: Functional Data
Classification

Step 1: Extract features from the functional data


• B-spline coefficients
• Functional principal components
Note: Each curve has 50 time observations, using B-splines we reduced the curve
dimension from 50 to 10, and from 50 to 2 using FPCA scores.

Step 2: Train a classifier (e.g., Random Forest, SVM, etc.) using the extracted
features.
Example: Functional Data
Classification

Step 3. Predict the class for 40 new observations.


• Using 10 B-splines, all the curves were correctly classified.
• Using the scores of the first two principal components, 2 curves of class 1 were
classified in class 2.

You might also like