ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
Dimensional Data
Analytics
Functional Data Analysis
Introduction to HD and
Functional Data
Learning Objectives
• To understand the definition of high-
dimensional data and Big Data.
• To explain the concepts of “curse of
dimensionality” and “low-dimensional
learning.”
• To define Functional Data.
Big Data
The initial definition revolves around the three Vs:
Volume, Velocity, and Variety
Variety: The data types you mention all take different shapes. How to deal with
high-dimensional data, e.g., profiles, images, videos, etc.?
High-Dimensional Data
High-Dimensional data is defined as a data set with large number of attributes.
4000
3500
3000
Power (Watt)
2500
2000
1500
1000
500
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Time (Sec)
Signals
➢ 100 KHz Images
➢ 1M Pixels Videos
➢ Images Sequence Surveys
➢ Movies Ratings
How to extract useful information from such massive datasets?
High-Dimensional Data vs. Big Data
Small n Large n
Small p Traditional Statistics with limited Classic large sample theory
samples Big Data Challenge
Large p HD Statistics and optimization Deep Learning and
High-Dimensional Data Challenge Deep Neural Networks
Tensor Analysis
• Multilinear Algebra
• Low Rank Tensor Decomposition
3500
3000
Power (Watt)
2500
2000
1500
1000
500
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Time (Sec)
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis
Review of Regression
Learning Objectives
• To review linear regression
• To understand geometric interpreation
of linear regerssion
• To explain feature extraction using
regression
Regression
• Observe a collection of i.i.d. training data
• We want to build a function f(x) to model the relationship between x’s and y
� = (𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑻𝑻 𝐲𝐲
𝛃𝛃
Example
Pull strength for a wire bond against
wire length and die height. (Montgomery and Runger 2006)
� = (𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑻𝑻 𝐲𝐲
𝛃𝛃
Geometric interpretation
𝐲𝐲� = 𝐗𝐗(𝐗𝐗 𝑻𝑻 𝐗𝐗)−1 𝐗𝐗 𝑇𝑇 𝐲𝐲 = 𝐇𝐇𝐇𝐇
Projection Matrix (a.k.a. Hat matrix)
𝑆𝑆𝑆𝑆𝑆𝑆
𝜎𝜎� 2 =
𝑛𝑛 − 𝑝𝑝
y = β 0 + β1t + β 2t 2 + β 3t 3
Polynomial regression
OLS Estimator
βˆ = [β 0 β1 β 2 β 3 ]T
Extracted features
Reference
• Montgomery, D. C., Runger, G., (2013), Applied Statistics and Probability for
Engineers. 6th Edition. Wiley, NY, USA.
• Hastie, T., Tibshirani, R., and Friedman, J., (2009) The Elements of Statistical
Learning. Springer Series in Statistics Springer New York Inc., New York, NY,
USA.
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis
Splines
Learning Objectives
• To discuss local vs. global polynomial
regression
• To explain splines and piecewise
polynomial regression
• To recognize splines basis and
truncated power basis
Polynomial Vs. Nonlinear Regression
mth-order polynomial regression
y = β 0 + β1 x + β 2 x 2 + β 3 x 3 + + β m x m + ε
Nonlinear Regression:
Suppose x ∈ [a, b]. Partition the x domain using the following points (a.k.a. knots).
Fit a polynomial in each interval under the continuity conditions and integrate them
by K
f ( X ) = ∑ β m hm ( X )
m =1
Splines – Simple Example
3 LSE
f (X ) = ∑β
m =1
m hm ( X ) β̂ m = Ym
6
f (X ) = ∑β
m =1
m hm ( X )
splines df is calculated by
(# of regions)(# of parameters in each region) –
(# of knots)(# of constraints per knot)
Image taken from: Hastie. et al. 2014
Order-M Splines
Piecewise polynomials of order M-1, continuous derivatives up to order M-2
• M=1 piecewise-constant splines
• M=2 linear splines
• M=3 quadratic splines
• M=4 cubic splines
0.8
0.6
0.4
M+K
f (X ) = ∑β
m =1
m hm ( X )
0.2
0
0 20 40 60 80 100 120
Degrees of Freedom:
• Truncated power basis functions are simple and algebraically appealing.
• Not efficient for computation and ill-posed and numerically unstable. det (H T H ) = 1.3639e - 06
−1
%Data Generation
Example • X=[0:0.001:1];
• Y_true=sin(2*pi()*X.^3).^3;
1 • Y=sin(2*pi()*X.^3).^3+normrnd(0,0.1,1,length(X));
0.5 %Define knots and basis
0 • k = [1/7:1/7:6/7];
-0.5
Mean function • h1=ones(1,length(X));h2=X;h3=X.^2;h4=X.^3;
-1 • h5=(X-k(1)).^3;h5(h5<=0)=0;
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
• h6=(X-k(2)).^3;h6(h6<=0)=0;
• h7=(X-k(3)).^3;h7(h7<=0)=0;
• h8=(X-k(4)).^3;h8(h8<=0)=0;
• h9=(X-k(5)).^3;h9(h9<=0)=0;
• h10=(X-k(6)).^3;h10(h10<=0)=0;
• H=[h1' h2' h3' h4' h5' h6' h7' h8' h9' h10']
%Least square estimates
• B=(H'*H)\H'*Y'
• scatter(X,Y,'.'); hold on
• plot(X,H*B,'r')
• plot(X,Y_true,'k')
Topics on High-
Dimensional Data
Analytics
Functional Data Analysis
Bsplines
Learning Objectives
• To discuss computational issue of
splines.
• To understand B-spline basis.
• To define the smoother matrix and
degrees of freedom.
Computational Issue of Splines
• Truncated power basis functions
are simple and algebraically
appealing.
e.g. ξ K +1 = τ M + K +1 = = τ 2 M + K
( )
ˆf = B BT B −1 BT y = Hy
Smoothing splines
Learning Objectives
• Discuss B-spline basis boundary
issue
• Introduce natural spline basis
• Define smoothing splines
• Discuss cross-validation for tuning
penalty parameter.
Boundary Effects on Splines
• First term measures the closeness of the model to the data (related to bias).
• Second term penalizes curvature of the function (related to variance).
• Avoid knot selection. Select as many knots as the observations number.
Where Nj’s are a set of natural cubic spline basis with knots at each of unique xi’s
Matrix form
Solution
Smoother Matrix
Smoothing spline estimator is a linear smoother
Tuning
parameters Model Estimation Estimated Model Optimal Model
Optimal tuning
parameters
• If an independent validation dataset is not affordable, the K-fold cross validation (CV)
or leave-one-out CV can be used.
K-fold Cross-Validation (CV)
• 5-fold cross-validation (blank: training; red: test)
Choice of Tuning Parameter
Model Selection Criteria
• Akaike information criterion (AIC) − 2 log( L) + 2k
where k is the # of estimated parameters and L is Likelihood
function
Kernel Smoothers
Learning Objectives
• To define kernel functions
• To understand KNN regression,
weighted kernel regression, and linear
and polynomial kernel regression.
K-Nearest Neighbor (KNN)
n
KNN Average fˆ ( x0 ) = ∑ w( x , x ) y
i =1
0 i i
1
if xi ∈ N k ( x0 )
where w( x0 , xi ) = K
0
∫
1. K (u )du = 1
−∞
2. K is an even function; K (−u ) = K (u )
∞
3. It has finite second moment; ∫
−∞
u 2 K (u )du < ∞
Examples of Kernel functions
• Symmetric Beta family kernel
(1 − u 2 ) d
• Uniform kernel (d=0) K (u , d ) = 2 d +1 I ( u < 1)
2 B (d + 1, d + 1)
• Epanechnikov kernel (d=1)
• Bi/Tri-Weight (d=2,3)
3
• Tri-cube kernel K (u ) = (1 − u ) 3 I ( u < 1) From Hastie. et al. 2009
• Gaussian kernel K (u ) = 1 2π exp(−u 2 )
Kernel Smoother Regression
Kernel Regression
• Is weighted local averaging that fits a simple model separately at each query point x0
• K is a kernel function.
• Kernel regression requires little training; all calculations get done at the evaluation time.
Example - Kernel Smoother
Regression
K λ ( x0 , xi ) = K ( x0 − xi λ )
3
K (u , d ) = (1 − u 2 ) I ( u < 1)
4
n
∑ Kλ (x , x ) y
i =1
0 i i
fˆ ( x0 ) = n
∑ Kλ (x , x )
i =1
0 i
MSE
for j=1:length(h1); h=h1(j) ;
for i=1:length(y)
X1=x;Y1=y;X1(i)=[];Y1(i)=[];
z=kerf((x(i)-X1)/h); yke=sum(z.*Y1')/sum(z);
er(i)=y(i)-yke;
end
mse(j)=sum(er.^2);
end
lambda
plot(h1,mse); h=h1(find(mse==min(mse)));
Example – RBF Kernel
% Interpolation for N values
N=1000;
xall = linspace(min(x),max(x),N);
f = zeros(1,N);
for k=1:N
z=kerf((xall(k)-x)/h);
f(k)=sum(z.*y')/sum(z);
end
Drawbacks of Local Averaging
The local averaging can be biased on
the boundaries of the domain due to the
asymmetry of the kernel in that region.
∑ K λ ( x0 , xi )[ yi − β 0 ( x0 ) − β1 ( x0 ) xi ]
2
arg min
β 0 ( x0 ), β1 ( x0 ) i =1
fˆ ( x0 ) = βˆ0 ( x0 ) + βˆ1 ( x0 ) x0
𝑠𝑠𝑖𝑖 𝑡𝑡 = 𝜇𝜇 𝑡𝑡 + 𝜖𝜖𝑖𝑖 𝑡𝑡
≅ 𝜇𝜇 𝑡𝑡 + ∑𝐾𝐾
𝑘𝑘=1 𝜉𝜉𝑖𝑖𝑖𝑖 𝜙𝜙𝑘𝑘 (𝑡𝑡)
Model Estimation
– Complete signals: sampled regularly
𝐶𝐶�𝑖𝑖 𝑡𝑡𝑖𝑖𝑗𝑗 , 𝑡𝑡𝑖𝑖𝑘𝑘 = 𝑠𝑠𝑖𝑖 (𝑡𝑡𝑖𝑖𝑖𝑖 ) − 𝜇𝜇̂ 𝑡𝑡𝑖𝑖𝑗𝑗 𝑠𝑠𝑖𝑖 (𝑡𝑡𝑖𝑖𝑘𝑘 ) − 𝜇𝜇̂ 𝑡𝑡𝑖𝑖𝑖𝑖
To estimate the covariance surface �𝐶𝐶 𝑡𝑡, 𝑡𝑡 ′ , we use
local quadratic regression
𝑛𝑛
𝑡𝑡𝑖𝑖𝑖𝑖 − 𝑡𝑡 𝑡𝑡𝑖𝑖𝑘𝑘 − 𝑡𝑡𝑡 2
min � � 𝑊𝑊 , 𝐶𝐶�𝑖𝑖 𝑡𝑡𝑖𝑖𝑗𝑗 , 𝑡𝑡𝑖𝑖𝑘𝑘 − 𝑐𝑐0 − 𝑐𝑐1 𝑡𝑡 − 𝑡𝑡𝑖𝑖𝑖𝑖 − 𝑐𝑐2 (𝑡𝑡 ′ − 𝑡𝑡𝑖𝑖𝑖𝑖 )
𝑐𝑐0 ,𝑐𝑐1 ,𝑐𝑐2 ℎ ℎ
𝑖𝑖=1 1≤𝑗𝑗≠𝑘𝑘≤𝑚𝑚𝑖𝑖
𝑀𝑀 1, 𝑚𝑚 = 𝑘𝑘
– ∫0 𝜙𝜙� 𝑘𝑘 𝑡𝑡 × 𝜙𝜙�𝑚𝑚 𝑡𝑡 𝑑𝑑𝑑𝑑 = �
0, 𝑚𝑚 ≠ 𝑘𝑘
– solved by discretizing the estimated covariance function 𝐶𝐶̂ 𝑡𝑡𝑗𝑗 , 𝑡𝑡𝑗𝑗′
𝑀𝑀
Computing FPC-scores 𝜉𝜉̂𝑖𝑖𝑖𝑖 𝜉𝜉𝑖𝑖𝑖𝑖 = � 𝑠𝑠𝑖𝑖 𝑡𝑡 − 𝜇𝜇̂ 𝑡𝑡 𝜙𝜙𝑘𝑘 𝑡𝑡 𝑑𝑑𝑑𝑑
0
𝐽𝐽
– Numerical integration
𝜉𝜉̂𝑖𝑖𝑖𝑖 = � 𝑠𝑠𝑖𝑖 𝑡𝑡𝑗𝑗 − 𝜇𝜇̂ 𝑡𝑡𝑗𝑗 𝜙𝜙�𝑘𝑘 𝑡𝑡𝑗𝑗 𝑡𝑡𝑗𝑗 − 𝑡𝑡𝑗𝑗−1
where 𝑡𝑡0 = 0 𝑗𝑗=1
FPCA Example
Original signals Signals with missing data (6 observations for each signal)
FPCA Example
Step 2: Train a classifier (e.g., Random Forest, SVM, etc.) using the extracted
features.
Example: Functional Data
Classification