Bci Unit 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

1

Subject Name : BRAIN COMPUTER INTERFACE AND APPLICATIONS


Subject code : CBM342
Regulation : 2021
Department : ECE
Year : III
Semester : VI

UNIT-III

FEATURE EXTRACTION METHODS

Time/Space methods – Fourier Transform, PSD -Wavelets - Parametric methods-AR, MA, ARMA
models-PCA -Linear and Nonlinear Features.

UNIT-3 CBM342/BCI III YR/VI SEM


2

UNIT-III
1.Introduction:
Brain-computer Interfaces (BCIs) are control and communication systems based on acquisition
and processing of brain signals to control a computer or an external device. Usually, BCI is focused in
recognizing acquired events by different neuroimage methods, but the most used is the
electroencephalography (EEG). Feature extraction over EEG signals for BCI systems is crucial to the
classification performance.

Figure1:BCI Pattern Recognition system


 Temporal features i.e time domain features represents employing the EEG signal values at distinct
time windows or at distinct time points.
 Frequency domain features are also called spectral features which represent the signal power in
the specific frequency band.
 However, for the EEG signals having nonstationary in nature, time-frequency methods are useful,
which can provide useful information by taking into consideration the dynamic changes.
 Spatial features deal with the spatial representation of the signal i.e the selection of most
appropriate channels selection for the specific task
 One of the most important components of a BCI system is the EEG signal feature extraction
procedure, due to its role on the proper performance of the classification stage at discriminating
mental states.
 In a classification of imagined right/left hand movements was presented. Features were extracted
by means of the wavelet transform and classified with support vector machines (SVM).
 Tests results showed that the proposed method could accurately extract EEG substantial features
and provide an effective means to classify the motor mental tasks. Similar work was done in,
where a new method for classifying the EEG signals from the BCI Competition 2003 was
presented.
 Wavelet coefficients and power spectral density (PSD) are combined as future vector for further
classification step. It concludes the importance of using a feature vector.
 It was proposed a feature extraction method to discriminate hand movements, based on the
processing of EEG signals recorded. from two subjects with 10 electrodes.

UNIT-3 CBM342/BCI III YR/VI SEM


3

 Features are extracted from the raw data by means of PSD, and alpha and beta band power. Even
though results shows the discrimination ability given by such features, it is worth to note the
dependent variability of each subject, also it is important to determine the most prominent
electrodes for each task.

Figure 2 : Taxonomy of Feature Extraction Methods for EEG based BCI

Table1:Time domain features


UNIT-3 CBM342/BCI III YR/VI SEM
4

Table2:Time Frequency domain features

Table 3:Dimension reduction


 Another feature extraction methodology was proposed in. The rhythmic component extraction
(RCE) was applied to characterize left/right hand motor imagery.
 The EEG signals was recorded from two subjects with 14 electrodes. The experiment shows
that the combination of RCE and fisher discriminant analysis on the 12-15 Hz frequency band
performed slightly better than other methods.
 Although the results found in the State of the art are appropriate, in most cases the shared
information between different electrodes is not taken into account.
 This information is useful because allows to extract new relevant features with small amount of
electrodes, and comparable classification results.
 Due to the high importance of a suitable feature extraction stage in any classification system, in
this paper a comparison between different characterization methodologies of EEG signals for
BCI systems is presented.
 By the extraction of frequency information by different approaches in each electrode as well as
shared information among electrodes.
 Methodologies are tested in the data-set BCI competition 2003, which has motor imagination
data of left and right hand with 140 recordings with 3 electrodes and 1152 samples.
***************************************************************************
Write short notes on Feature Extraction methods in BCI.
*********************************************************************************

UNIT-3 CBM342/BCI III YR/VI SEM


5

What is Feature Extraction?


Feature extraction in machine learning is the process of transforming raw data into a set of numerical

features that can be used for further analysis. In the context of image processing, this involves

converting pixels into a form that a machine learning model can understand and utilize, typically

resulting in a feature vector that encapsulates the essential aspects of the input data.

Figure 3: Feature extraction is the thread that weaves raw data into patterns of insight.

 Process: It involves transforming raw data (like pixels in an image) into a set of usable features.
In deep learning, this is typically done through a series of convolutional layers.

 Layers Involved: Early layers of a convolutional neural network (CNN) capture basic features
like edges and textures, while deeper layers capture more complex features like patterns or
specific objects.

 Output: The output is a high-dimensional vector or set of vectors that succinctly represent the
important aspects of the input data.

Figure 4: Flow diagram of BCI

The Significance of Feature Extraction


1. Simplification of Complexity: One of the primary benefits of feature extraction is the simplification

of data. Raw data, like the pixel values of an image, are often too voluminous and complex for direct

UNIT-3 CBM342/BCI III YR/VI SEM


6

analysis. Feature extraction distills this data into a more manageable form, retaining only the most

relevant information.

2. Enhancing Model Performance: Feature extraction is pivotal in improving the performance of

machine learning models. By providing a clear, concise representation of the data, it allows models to

learn more effectively and make more accurate predictions.

3. Facilitating Transfer Learning: In the realm of deep learning, pre-trained models on extensive

datasets (like ImageNet) serve as powerful feature extractors. These pre-trained models can be

repurposed for various tasks, significantly reducing the time and resources required for model training.

Methodologies in Feature Extraction


1. Traditional Techniques: Historically, feature extraction involved handcrafted techniques where
domain experts identified and coded algorithms to extract features. Examples include edge detection
filters and color histograms in image processing.
2. Deep Learning Approaches:With the advent of deep learning, feature extraction has been
revolutionized. Convolutional Neural Networks (CNNs), for instance, automatically learn to extract
features during the training process. This has led to a paradigm shift from manual feature design to
automated feature learning.

Applications of Feature Extraction


1. Image Classification: In image classification, feature extractors identify patterns and characteristics
that define various categories, allowing models to categorize images effectively.
2. Object Detection and Recognition: Feature extraction is crucial in object detection, where it helps in
identifying and localizing objects within an image, and in facial recognition systems, where it discerns
unique facial features.
3. Beyond Computer Vision: The concept extends beyond visual data. In audio processing, for instance,
feature extractors identify characteristics like pitch and tempo, while in text analysis, they might focus
on semantic representations of words.

Challenges and Future Directions

1. Balancing Complexity and Performance: A significant challenge is balancing the complexity of the
feature extractor with the computational resources available. More complex models may offer better
feature extraction but at the cost of increased computational demands.
2. Generalization: Another challenge is ensuring that feature extractors generalize well to new, unseen
data. This is particularly important in applications like autonomous vehicles and medical image
analysis, where errors can have serious consequences.
UNIT-3 CBM342/BCI III YR/VI SEM
7

********************************************************************
Explain Fourier Transform method used in feature extraction

*********************************************************************

2.Fourier Transform:

• The Fourier Transform is a powerful tool in feature engineering, widely used in various fields
like signal processing, image analysis, and data science.
• Its significance lies in transforming time or space-based signals into the frequency domain,
offering a different perspective to analyze and process data. In this essay,
• we’ll explore the concepts of Fourier Transform and its application in feature engineering.

Understanding Fourier Transform


• At its core, the Fourier Transform decomposes a function of time (a signal) into its constituent
frequencies.
• This is crucial because in many practical scenarios, analyzing the frequency components of a
signal can be more insightful than examining the signal in its original time domain.
• For example, in signal processing, it can reveal hidden periodicities or dominant frequencies that
are not apparent in the time domain.

EQUATION
The continuous Fourier Transform (for a signal x(t)) is given by:

X(f) =∫−∞ 𝑥(𝑡)𝑒 −𝑗2𝜋𝑓𝑡 𝑑𝑡

Where:
X(f) is the Fourier Transform of x(t),
x(t) is the input signal in the time domain,
f is the frequency, and
j is the imaginary unit.

Block Diagram:
The block diagram for the Fourier Transform involves taking the input signal x(t) and passing it
through a Fourier Transform block. The output is X(f), which represents the signal in the frequency
domain.

+--------+ +----------------------+
| x(t) | ----> | Fourier Transform | ----> X(f)
+--------+ +----------------------+

.**********************************************************************************
UNIT-3 CBM342/BCI III YR/VI SEM
8

************************************************************************
Explain the significance and importance of Power spectral density in feature
extraction of EEG signals
************************************************************************

3. Power Spectral Density (PSD):


 Power spectral density (PSD), describes how to the power of a signal is distributed in
frequency. Since signal with nonzero average power is not square integrable.
 The Fourier transforms do not exist in this case. The PSD is the Fourier transform of the
autocorrelation function of the Signal.
 The power of a signal in a given frequency band is calculated by integrating over positive and
negative frequencies.
 The definition of power spectral density generalizes in a straight manner to finite time series
with l ≤ n ≤ N, such as signal sampled at discrete times xn = x (n∆t) for a total measurement
period T= N∆t
 Power spectral density (PSD), describes how to the power of a signal is distributed in frequency.
 Since signal with nonzero average power is not square integrable, the Fourier transforms do not
exist in this case. The PSD is the Fourier transform of the autocorrelation function of the signal

𝑁 2
1
𝑆(𝑒 𝑗𝜔 ) = |∑ 𝑥𝑛 𝑒 𝑗𝜔𝑛 |
2𝜋𝑁
𝑛=1

or
Equation:
The Power Spectral Density (for a signal x(t)) is often defined as the Fourier Transform of
the autocorrelation function Rxx(τ):

𝑆𝑥𝑥 (𝑓) = ∫ 𝑅𝑥𝑥(𝜏)𝑒 −𝑗2𝜋𝑓𝜏 𝑑𝜏


−∞

Where:
 Sxx(f) is the PSD,

 Rxx(τ) is the autocorrelation function of x(t),

 f is the frequency, and

 j is the imaginary unit.

Block Diagram:
The block diagram for PSD involves computing the autocorrelation function Rxx(τ) and then
passing it through a Fourier Transform block.

+--------+ +-----------------------+ +--------------+


| x(t) | ---->| Autocorrelation | ---->| Fourier | ----> S_{xx}(f)
+--------+ | Function (R{xx}(τ))| | Transform |
+-----------------------+ +--------------+

UNIT-3 CBM342/BCI III YR/VI SEM


9

 The distribution of average power of a signal x(t) in the frequency domain is called the
power spectral density (PSD) or power density (PD).
 A Power Spectral Density (PSD) is the measure of signal's power content versus
frequency.
 The power spectral density (PSD) which represents the power distribution of EEG series in
the frequency domain is used to evaluate the abnormalities of AD brain.
 The power spectral density (PSD) or power spectrum represents the proportion of the total
signal power contributed by each frequency component of a voltage signal.
 It is computed from the DFT as the mean squared amplitude of each frequency component,
averaged over the n samples in the digitised record.
 The PSD is a real, not a complex, quantity, expressed in terms of squared signal units per
frequency units and can be plotted as a single graph.
 The relationship between power spectral density and frequency is that each element of the
PSD is a measure of the signal power contributed by frequencies within a band of width ∆f
centred on the frequency k ∆f.
 The variance of the original digitised record can be computed from the integral of the PSD
 EEG relative power can be calculated by comparing the power values of specific frequency
bands in the EEG data with the power values of a control group.
 To calculate EEG relative power, the program POTENCOR uses Fourier analysis to separate
frequency components and calculates the normalized data for relative power.

 PSD is a good tool for stationary signal processing and suitable for narrowband signal. It is a
common signal processing technique that distributes the signal power over frequency and
show the strength of the energy as function of frequency.

 Power Spectra Density was calculated by using Welch and Burg Method to extract the
features from filtered data.

A. Welch Method
 Generally, the Welch method of the PSD can be described by the equations below, the
power spectra density equation is defined first. Then Welch Power Spectrum that mean
average of the periodogram for each interval is expressed.
𝑀−1 2
1
𝑃 (𝑓 ) = | ∑ 𝑥𝑖 (𝑛)𝑤(𝑛)𝑒 −𝑗2𝜋𝑓 |
𝑀𝑈
𝑛=0

1 𝐿−1
𝑃𝑤𝑒𝑙𝑐ℎ (𝑓) = ∑ 𝑃(𝑓)
𝐿 𝑖=0
B. Burg Method
 Burg method is a method that diminishing the forward and backward prediction errors so it
satisfy the Levinson-Durbin Recursion.

UNIT-3 CBM342/BCI III YR/VI SEM


10

 With higher order of Burg Model, the accuracy become lower, and false peaks will be
inferred in the spectra.
 The Burg method is highly suitable for short data records as it can generate accurate
prediction and always produces a stable model.
 Overall, the Burg method of PSD can be computed through following equation:
̂𝒑
𝑬
𝑷𝒃𝒖𝒓𝒈 (𝒇) = 𝟐
|𝟏 + ∑𝒑𝒌=𝟏 𝒂
̂ 𝒑 (𝒌)𝒆−𝟐𝒋𝝅𝒇 |

On the whole, the effects on PSDs suggest that researchers should be careful while making
choices in EEG transformation and time-window since they seemed to have the most effects on
PSDs. Artifact removal, filter, and PSD estimation method choices may have less effect on
PSDs,whichcan possibly be ignored in trial-to-trial studies.

.
Figure 5. Flow diagram of the processing methods carried out to estimate powers and phases for four frequency
bands. The selection choices of fives methods were explored highlighted with diamond shape: artifact removal,
electroencephalogram (EEG) transformation, filtering, time window selection, and power spectral density (PSD)
estimation. The estimated powers and phases were used to find the correlation between the choices.

UNIT-3 CBM342/BCI III YR/VI SEM


11

***************************************************************************
Explain Wavelet method in detail and how it is used to extract feature from EEG.
***************************************************************************

4.Wavelet Transform:
Equation:
The Continuous Wavelet Transform (CWT) is given by:

∞ 𝟏 𝒕−𝒃
W(a,b) =∫−∞ 𝒙(𝒕) 𝒂 𝝋∗ ( 𝒂 ) 𝒅𝒕

Where:
 W(a,b) is the wavelet transform,
 x(t) is the input signal,
 ψ∗(t) is the complex conjugate of the wavelet function,
 a is the scale parameter, and
 b is the translation parameter.

Block Diagram:
The block diagram for the Continuous Wavelet Transform involves scaling and translating the
wavelet function to analyze the input signal at different scales and positions.

+--------+ +------------------------+
| x(t) | ----> | Continuous Wavelet | ----> W(a, b)
+--------+ | Transform |

• Each of these methods has its own strengths and weaknesses in different applications.
• Fourier Transform is excellent for analyzingthe frequency content of a signal.
• PSD gives information about the distribution of power with respect to frequency, and Wavelet
Transform is valuable for analyzing signals in both time and frequency domainssimultaneously,
making it useful for non-stationary signals.
• Wavelet Transform is suitable for nonstationary signals and has advantage over spectral analysis.
• For time frequency representation of a signal wavelet is an effective method. The important feature
of WT is that it provides accurate frequency information at the low frequencies and accurate time
information at the high frequencies.
• This property is important in biomedical applications. Because most signals in the biomedical field
always contain high frequency components with short time period and low frequency components
with long time period. The WT provides multiresolution analysis of nonstationary signals. It is
shown in Fig.
• Where g[n] is high-pass filter and h[n] is low-pass filter. WT is most suitable for location of
transient events. It has advantage over spectral analysis Here EEG signal is decomposed into D1-
D4 levels.
• Wavelet overcomes the limitations of short time fourier transform (STFT). In serious patients
detection of disorder in the brain using conventional method is very inconvenient.
• Frequency content in the EEG signal provides useful information as compared to time domain.
The mother function ψ(n) is convolved with the signal x(n).

UNIT-3 CBM342/BCI III YR/VI SEM


12

• Its function is given by formula,


𝑵−𝟏
𝒘𝝋 𝒙(𝒃, 𝒂) = ∑ 𝝋 ∗ ( 𝒏′ − 𝒃/𝒂)
𝒏′

where a is called as scale coefficient and b is called shift coefficient. Formation of mother wavelet
is important because when it is fixed then it is easy to understand signal at possible coefficients a
and b.

Figure 6:Wavelet decomposition process


• Decomposition levels of EEG signal are selected based upon dominant frequency components
which h are present in the signal.
• This decomposition of EEG signal leads to formation of coefficients called as wavelet coefficient.
From different families of wavelet Daubechies family of order 2 (db2) is mostly used due to its
smoothing function.
• Down sample output of high pass filter provides detail wavelet coefficient and low pass filter
provides approximation wavelet coefficient.
• The discrete wavelet transform is a signal processing tool that has many engineering and scientific
applications for various tasks.
• It develops quantifying spikes, sharp waves and spikewaves. In DWT signal analyses different
frequency bands using filters. It is mainly used in detection of epileptic seizures.
• Wavelet provides features such as maximum, minimum, mean and standard deviation coefficient
of each sub-band.
• WT is used in the detection of mental tasks such as resting, multiplication, figure rotation and letter
composition
etc.*****************************************************************************
**

Explain the parametric methods AR,MA and ARMA models in detail used for analysis of
features in EEG.
*********************************************************************************
5.Parametric methods

UNIT-3 CBM342/BCI III YR/VI SEM


13

• AR, MA, ARMA, and ARIMA models are used to forecast the observation at (t+1) based on the
historical data of previous time spots recorded for the same observation.
• However, it is necessary to make sure that the time series is stationary over the historical data of
observation overtime period.
• If the time series is not stationary then we could apply the differencing factor on the records and
see if the graph of the time series is a stationary overtime period.

ACF (Auto Correlation Function)


• Auto Correlation function takes into consideration of all the past observations irrespective of its
effect on the future or present time period.
• It calculates the correlation between the t and (t-k) time period.
• It includes all the lags or intervals between t and (t-k) time periods.
• Correlation is always calculated using the Pearson Correlation formula.

PACF (Partial Correlation Function)


• The PACF determines the partial correlation between time period t and t-k.
• It doesn’t take into consideration all the time lags between t and t-k. For e.g. let's assume that
today's stock price may be dependent on 3 days prior stock price but it might not take into
consideration yesterday's stock price closure.
• Hence we consider only the time lags having a direct impact on future time period by neglecting the
insignificant time lags in between the two-time slots t and t-k.

How to differentiate when to use ACF and PACF?


• Let's take an example of sweets sale and income generated in a village over a year. Under the
assumption that every 2 months there is a festival in the village, we take out the historical data of
sweets sale and income generated for 12 months.
• If we plot the time as month then we can observe that when it comes to calculating the sweets sale
we are interested in only alternate months as the sale of sweets increases every two months.
• But if we are to consider the income generated next month then we have to take into consideration
all the 12 months of last year, so in the above situation, we will use ACF to find out the income
generated in the future but we will be using PACF to find out the sweets sold in the next month.

LTIsystemmodel
• In the model given below, the random signal x[n] is observed. Given the observed signal x[n],
the goal here is to find a model that best describes the spectral properties of x[n] under the
following assumptions
• x[n] is WSS(WideSense Stationary) and ergodic.
The input signal to the LTI system is white noise following Gaussian distribution – zero mean
and varianceσ2.
• The LTI system is BIBO (Bounded Input Bounded Output) stable.
UNIT-3 CBM342/BCI III YR/VI SEM
14

Figure7: Linear Time Invariant (LTI) system – signal model

In the model shown above, the input to the LTI system is a white noise following Gaussian distribution
– zero mean and variance σ2. The power spectral density (PSD) of the noise w[n] is
Sωω(ejω) =𝜎 2

The noise process drives the LTI system with frequency response H(ejɷ) producing the signal of
interest x[n]. The PSD of the output process is therefore,

Sωω(ejω) =𝜎 2 ⃒𝐻(𝑒𝑗𝑤)⃒2

Three cases are possible given the nature of the transfer function of the LTI system that is under
investigation here.

 Auto Regressive (AR) models : H(ejɷ) is an all-poles system


 Moving Average (MA)models: H(ejɷ) is an all-zeros system
 Auto Regressive Moving Average (ARMA) models: H(ejɷ) is a pole-zero system
5.1. Auto Regressive (AR) models (all-poles model)
In the AR model, the present output sample x[n] and the past N output samples determine the
source input w[n]. The difference equation that characterizes this model is given by
x[n] + a1x[n-1] + a2 x[n-2] +…+aN x[n-N] =ω[n]
Here, the LTI system is an Infinite Impulse Response (IIR) filter. This is evident from the fact that the
above equation considered past samples of x[n] when determining w[n], there by creating a feedback
loop from the output of the filter.
The frequency response of the IIR filter is well known

1
H(ejω) =∑𝑁 −𝑗𝑘𝜔
, a0 =1
𝑘=0 𝑎𝑘 𝑒

Figure 8: Spectrum of all-pole transfer function (representing AR model)

The transfer function H(ejɷ) is an all-pole transfer function (when the denominator is set to zero, the
transfer function goes to infinity -> creating peaks in the spectrum). Poles are best suited to model

UNIT-3 CBM342/BCI III YR/VI SEM


15

resonant peaks in a given spectrum. At the peaks, the poles are closer to unit circle. This model is well
suited for modeling peaky spectra.
5.2. Moving Average (MA) models (all-zeros model)

In the MA model, the present output sample x[n] is determined by the present source input w[n] and
past N samples of source input w[n]. The difference equation that characterizes this model is given by
x[n] =b0 w[n] + b1 w[n-1] + b2w[n-2] + … + bM w[n-M]

Here, the LTI system is an Finite Impulse Response (FIR) filter. This is evident from the fact that
the above equation that no feedback is involved from output to input. The frequency response of
the FIR filter is well known

H(ejω) =∑𝑀
𝑘=0 𝑏𝑘 𝑒
−𝑗𝑘𝜔

The transfer function H(ejɷ) is an all-zero transfer function (when the numerator is set to zero, the
transfer function goes to zero -> creating nulls in the spectrum). Zeros are best suited to model
sharp nulls in a given spectrum.

Figure 9: Spectrum of all-zeros transfer function (representing MA model)

How they differ:


 The AR model relates the current value of the series to its past values. It assumes that past

values have a linear relationship with the current value.


 The MA model relates the current value of the series to past white noise or error terms. It

captures the shocks or unexpected events in the past that are still affecting the series.
Combined Models:
Often, these models are combined to model and forecast time series data more effectively:
 ARMA (Autoregressive Moving Average): This model combines both AR and MA

components.
 ARIMA (Autoregressive Integrated Moving Average): This model adds an “I” (integrated)

component, which involves differencing the series to make it stationary before applying an
ARMA model.
Both AR and MA models (and their combinations) are foundational in time series forecasting, and
their applicability depends on the characteristics of the data and the nature of the underlying
processes generating the time series.

UNIT-3 CBM342/BCI III YR/VI SEM


16

5.3. Auto Regressive Moving Average (ARMA) model (pole-zero model)


ARMA model is a generalized model that is a combination of AR and MA model. The output of the
filter is linear combination of both weighted inputs (present and past samples) and weight outputs
(present and past samples). The difference equation that characterizes this model is given by
x[n] + a1x[n-1] + …+aNx[n-N] = b0w[n] + b1w[n-1] + … +bMw[n-M]

The frequency response of this generalized filter is well known

∑𝑀 𝑏𝑘 𝑒 −𝑗𝑘𝜔
H(ejω) =∑𝑁𝑘=0 −𝑗𝑘𝜔
,a0 =1
𝑘=0 𝑎𝑘 𝑒

Figure 10: Spectrum of pole-zero transfer function (representing ARMA model)

The transfer function H(ejɷ) is a pole-zero transfer function. It is best suited for modelling complex
spectra having well defined resonant peaks and nulls.

Comparing AR and ARMA model – minimization of squared error


AR model error and minimization

In the AR model, the present output sample x[n] and the past N-1 output samples determine the
source input w[n]. The difference equation that characterizes this model is given by
x[n] +a1x[n-1] +a2 x[n-2] +…+aNx[n-N] =w[n]
The model can be viewed from another perspective, where the input noise w[n] is viewed as an
error – the difference between present output sample x[n] and the predicted sample of x[n] from
the previous N-1 output samples. Let’s term this “AR model error”. Rearranging the difference
equation,

w[n] =x[n] - (− ∑𝑁
𝑘=1 𝑎𝑘 𝑥 [𝑛 − 𝑘 ])

• The summation term inside the brackets are viewed as output sample predicted from past N-
1 output samples and their difference being the error w[n].
• Least squared estimate of the co-efficientsak are found by evaluating the first derivative of the
squared error with respect to ak and equating it to zero finding the minima.
• From the equation above, w2[n] is the squared error that we wish to minimize. Here, w2[n] is a
quadratic equation of unknown model parameters ak.

UNIT-3 CBM342/BCI III YR/VI SEM


17

• Quadratic functions have unique minima, therefore it is easier to find the Least Squared
Estimates of ak by minimizing w2[n].
ARMA model error and minimization

The difference equation that characterizes this model is given by

x[n] +a1x[n-1]+…+aNx[n-N] =b0w[n] +b1w[n-1]+…+bMw[n-M]

Re-arranging, the ARMA model error w[n] is given by

w[n] = x[n] – ( − ∑𝑵 𝑴
𝒌=𝟏 𝒂𝒌 𝒙[𝒏 − 𝒌] + ∑𝒌=𝟏 𝒃𝒌 𝒘[𝒏 − 𝒌])

Now, the predictor (terms inside the brackets) considers weighted combinations of past values of
both input and output samples.

The squared error, w2[n] is NOT a quadratic function and we have two sets of unknowns ak and bk.
Therefore, no unique solution may be available to minimize this squared error-since multiple minima
pose a difficult numerical optimization problem.
ARMA Process
The ARMA process of order (p, q) is obtained by combining an MA(q) process and an AR(p)
processes. That is, it contains p AR terms and q MA terms and is given by
𝒑 𝒒
Yn =∑𝒌=𝟏 𝜶𝒌 𝒀𝒏 − 𝒌 + ∑𝒌=𝟎 𝜷𝒌 𝑾𝒏 − 𝒌 𝒏 ≥ 𝟎
A structural representation of the ARMA process is a combination of the structures

Figure 11: Structure of an ARMA Process

UNIT-3 CBM342/BCI III YR/VI SEM


18

One of the advantages of ARMA is that a stationary random sequence (or time series) may be more
adequately modeled by an ARMA model involving fewer parameters than a pure MA or AR process
alone. Since E[W[n − k]] = 0 for k = 0, 1, 2, …, q, it is easy to show that E[Y[n]] = 0. Similarly, it can
be shown that the variance of Y[n] is given by

𝑝 𝑞

𝜎𝑦2 𝑛 = ∑ 𝛼𝑘 𝑅𝑦𝑦 𝑛, 𝑛 − 𝑘 + ∑ 𝛽𝑘𝑅𝑌𝑊 𝑛, 𝑛 − 𝑘


𝑘=1 𝑘=0

Thus, the variance is obtained as the weighted sum of the autocorrelation function evaluated at
different times and the weighted sum of various crosscorrelation functions at different times. Finally, it
can be shown that the transfer function of the linear system defined by the ARMA(p, q) is given by

∑𝑞
𝑘=0 𝛽
𝑘 𝑒−𝑗Ω𝑘
HΩ = 𝑝
1−∑𝑘=0 𝛼𝑘 𝑒−𝑗Ω𝑘

***************************************************************************

Explain the concept of Principal component Analysis in detail for feature selection in BCI

****************************************************************************

6. Principal Component Analysis (PCA)

• PCA is a multivariate analytical method based on the linear transformation that is often used to
reduce the dimensionality of the data, to extract significant information from big data, to analyze
the variable structures, etc.
• The PCA method has been used for dimensionality reduction of the EEG signals . Since the
spatial resolution of the EEG signal is poor, considering all channels for feature extraction is just
increasing the burden.

PCA is a dimensionality reduction that identifies important relationships in our


data, transforms the existing data based on these relationships, and then quantifies the
importance of these relationships so we can keep the most important relationships and drop the
others. To remember this definition, we can break it down into four steps

UNIT-3 CBM342/BCI III YR/VI SEM


19

• :

Figure12: PCA method

1. We identify the relationship among features through a Covariance Matrix.

2. Through the linear transformation or eigendecomposition of the Covariance Matrix,


we get eigenvectors and eigenvalues.

3. Then we transform our data using Eigenvectors into principal components.

4. Lastly, we quantify the importance of these relationships using Eigenvalues and keep
the important principal components.
UNIT-3 CBM342/BCI III YR/VI SEM
20

Figure 13: steps of Principal Component Analysis


The following demo presents the linear transformation between features and principal components
using eigenvectors for a single data point from the Iris database.

Step by Step of Principal Component Analysis

Step by step, Principal Component Analysis unveils the hidden layers of data complexity,
simplifying the intricate to reveal the essential.

Algorithm
Principal Component Analysis (PCA) is a widely used technique in data analysis and
dimensionality reduction. It helps in identifying the most significant patterns and reducing the
complexity of high-dimensional data while preserving its essential information. This essay will
explain the steps involved in performing PCA:
 SCORE MATRIX GENERATION

Figure 14: Score Matrix generation

UNIT-3 CBM342/BCI III YR/VI SEM


21

Step 1: Data Collection and Standardization Before applying PCA, gather your data. Ensure that
your data is numeric, as PCA is primarily suited for numerical data. If your data has categorical
variables, you may need to preprocess them.
Next, standardize the data. Standardization is important because PCA is sensitive to the
scales of variables. Standardization transforms the data so that each variable has a mean of 0 and a
standard deviation of 1. This ensures that all variables are on the same scale.

Figure 15: Dimensionality Reduction of PCA

Step 2: Covariance Matrix Calculation The first step in PCA is to compute the covariance matrix
of the standardized data. The covariance matrix represents the relationships between variables.
Each element in the matrix represents the covariance between two variables.The formula for the
covariance between two variables X and Y is:
1
Cov(X,Y) =𝑁−1 ∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋) ( 𝑌𝑖 − 𝑌)

Where:
 N is the number of data points.
 Xi and Yi are individual data points.
 ˉXˉ and ˉYˉ are the means of variables X and Y, respectively.
The covariance matrix is a square matrix, with each element representing the covariance between
two variables.
Step 3: Eigenvalue and Eigenvector Computation After computing the covariance matrix, the
next step is to find the eigenvalues and eigenvectors of this matrix.
These are crucial in determining the principal components.Eigenvalues (λ) and eigenvectors (v) are
obtained by solving the following equation:
Covariance Matrix x ν = λ x ν
1
̅) ( 𝑌𝑖 − 𝑌̅ )
Cov (X,Y) =𝑁−1 ∑𝑁𝑖=1(𝑋𝑖 − 𝑋

UNIT-3 CBM342/BCI III YR/VI SEM


22

In this equation, λ represents the eigenvalue, and v represents the eigenvector. You’ll have as many
eigenvalues and eigenvectors as the number of variables in your data.

Step 4: Sorting and Selecting Principal Components The eigenvalues represent the amount of
variance in the data that each eigenvector explains. To reduce the dimensionality, sort the
eigenvalues in descending order.

The eigenvector corresponding to the largest eigenvalue explains the most variance and is the first
principal component. The second largest eigenvalue corresponds to the second principal
component, and so on.

Typically, you’ll select a subset of the top eigenvalues/eigenvectors that explain most of the
variance in the data while reducing the dimensionality. You can decide on the number of principal
components to keep based on a variance explained threshold (e.g., 95% of the total variance).

Step 5: Data Transformation To reduce the dimensionality of your data, create a projection
matrix using the selected eigenvectors (principal components). This matrix represents the
transformation needed to project the data into the new reduced-dimensional space.

Multiplystandardized data by this projection matrix to obtain the new data in the principal
component space.

Step 6: Interpretation and Analysis Once the data is transformed, you can interpret the principal
components and their relationships to the original variables. This is crucial for understanding the
most significant patterns in the data.

Explain how PCA is used for linear and non-linear component Analysis in detail

******************************************************************************

6.1. KARHUNEN-LOÉVE TRANSFORM

For completeness, we reiterate the basic approach in principal components orKarhunen-Loéve


transform. First, the d-dimensional mean vector µ and d x d covariance matrix ∑ are computed for
the full data set.

Next the eigenvectors and eigenvalues are computed andassorted according to decreasing
eigenvalue. Call these eigenvectors e1 with eigenvalue λ1, e2 with eigenvalue λ2, and so on, and
choose the k eigenvectors having the largest eigenvalues.

Often there will be just a few large eigenvalues, and this implies that k is the inherent
dimensionality of the subspace governing the "signal” while the remaining d -k dimensions generally
contain noise.

UNIT-3 CBM342/BCI III YR/VI SEM


23

Nextwe form a dx k matrix A whose columns consist of the k eigenvectors. The representation
of data by principal components consists of projecting the data onto the k-dimensional subspace
according to

x’ = F1(x) =A' (x-µ).

AUTO ENCODER

A simple three-layer linear neural network, trained as an auto-encoder can form


sucharepresentationas shown in Figure 16.

Each pattern of the data set is presented to both the input and output layers and the full
network trained by gradient descenton a sum-squared-error criterion, for instance by
backpropagation.

It can be shown that this representation minimizes a squared error criterion.After the network
is trained,the top layer is discarded and the linear hidden layer provides the principal components.

FIGURE 16. A three-layer neural network with linear hidden units, trained to bean auto-encoder, develops an internal
representation that corresponds to the principalcomponents of the full data set. The transformation F is a linear projection
onto a k-dimensional subspace denoted Г(F2).

6.2. Nonlinear Component Analysis (NLCA)

• Principal component analysis yields a k-dimensional linear subspace of feature space that best
represents the full data according to a minimum-square-error criterion,
• If the data represent complicated interactions of features, then the linear subspace may be a
poor representation and nonlinear components may be needed.
• A neural network approach to such nonlinear component analysis employs a network with five
layers of units, as shown in Fig.17
• The middle layer consists of k < d linear units, and it is here that the nonlinear components will
be revealed. It is important that the two other internal layers have nonlinear units.
• The entire network is trained using the techniques as an auto encoder or auto-associator. That
is, each d-dimensional pattern is presented as both the input and as the target or desired output.

UNIT-3 CBM342/BCI III YR/VI SEM


24

• When trained using a sum-squared error criterion, such a network readily learns the auto-
encoder problem. The top two layers of the trained network are discarded, and the rest used for
nonlinear component analysis.
• For each input pattern x, the outputs of the k units of the three-layer network correspond to the
nonlinear components.
• We can understand the function' of the full five-layer network in terms of two successive
mappings, F1 followed by F2. As Fig. 10.23 illustrates, F1, is a projection from the d-dimensional
input onto a k-dimensional nonlinear subspace and F2 is amapping from that subspace back to
the full d-dimensional space.
• There are often multiple local minima in the error surface associated with the five- layer
network, and we must take care to set an appropriate number k of units.
• Recall that in (linear) principal component analysis, the number of components k could be a e
chosen based on the spectrum of eigenvectors.
• If the eigenvalues are ordered by magnitude, any significant drop between successive values
indicates a "natural" numberdimension to the subspace. Likewise, suppose five-layer networks
are trained, with different numbers k of units in the middle layer.

FIGURE 17. A five-layer neural network with two layers of nonlinear units (e-g-sigmoidal), trained to be an auto-
encoder, develops an internal representation that corresponds to the nonlinear components of the full data set. The
process can be viewed36 in feature space (at the right), The transformation F1 is a nonlinear projection onto ak-
dimensional subspace, denoted Г(F2). Points in Г(F2) are mapped via F2 back to the d-dimensional space of the
original data. After training, the top two layers of the netare removed and the remaining three-layer network maps
inputs x to the space Г (F2).

FIGURE 18. Features from two classes are as shown, along with nonlínear components of the full data set.
Apparently, these classes are well-separated along the line marked Z2, but the large noise gives the largest nonlinear
component to be along z. Pre-processing by keeping merely the largest nonlínear component would retain the "noise
and discard the "signal," giving poor recognition. The same defect can arise in linear principal components, where the
coordinates are linear and orthogonal.

UNIT-3 CBM342/BCI III YR/VI SEM


25

• Assuming poor local minima have been avoided, the training error will surely decrease for
successively larger values of k.
• If the improvement k + l over k is small, this may indicate that k is the "natural" dimension of the
subspace at the network's middle layer.
• We should not conclude that principal component analysis or nonlinear component analysis is always
beneficial for classification.
• If the noise is large compared to the difference between categories, then component analysis will find
the directions of the noise, rather than the signal, as illustrated in Fig.18
• In such cases, we seek to ignore the noise, and instead we extract the directions that are indicative of
thecategories technique we consider next.

UNIT-3 CBM342/BCI III YR/VI SEM


26

UNIT-III
TWO MARK Questions and Answers

1. What is feature extraction in the context of BCI, and why is it important?


 Feature extraction in BCI refers to the process of selecting relevant information from raw
electroencephalogram (EEG) signals or other neural data to represent important characteristics.
 It is crucial because raw neural signals are often high-dimensional and contain a lot of
redundant or irrelevant information.
 Extracting relevant features helps reduce the dimensionality and highlights essential patterns,
making it easier for machine learning algorithms to interpret and classify brain activity.

2. Mention two common feature extraction methods used in BCI.


a) Time Domain Features:
Time-domain features involve analyzing the characteristics of neural signals in the time
dimension. Examples include mean amplitude, standard deviation, and peak amplitude. These
features capture information about the signal's amplitude and temporal dynamics.
b) Frequency Domain Features:
Frequency-domain features involve transforming the neural signals into the frequency domain
using methods like Fourier Transform. Features such as power spectral density, frequency band
power, or event-related synchronization/desynchronization are extracted, providing insights
into different frequency components of brain activity.

3. Explain the significance of spectral power in BCI feature extraction.


 Spectral power is a crucial feature extracted in BCI to understand the distribution of signal power
across different frequency bands.
 Different brain activities manifest in specific frequency ranges and analyzing spectral power
helps capture these patterns.
 For instance, alpha waves (8-13 Hz) are associated with relaxation, while beta waves (13-30 Hz)
are linked to cognitive processes.
 Spectral power features provide valuable information for identifying and classifying different
mental states or intentions in a BCI system.

4. How does spatial filtering contribute to feature extraction in BCI?


 Spatial filtering involves manipulating the distribution of neural signals across different
recording channels.
 In BCI, methods like Common Spatial Patterns (CSP) are employed to enhance the
discrimination between different brain states or classes.
 CSP optimally combines spatial information from multiple EEG channels, emphasizing the
channels that contain the most discriminative information.
 This enhances the signal-to-noise ratio and facilitates more effective feature extraction for
classification algorithms in BCI systems.

5. What is the primary purpose of utilizing the Fourier Transform in the context of Brain-
Computer Interface (BCI)?
 The Fourier Transform is used in BCI to convert neural signals from the time domain into the
frequency domain. By doing so, it allows the analysis and extraction of frequency components
present in the signal.
 This transformation is crucial for identifying and understanding different brain activities that
manifest at specific frequency bands, aiding in feature extraction for BCI applications.

6. Define Power spectral density.Howit helps in feature extraction?


The power spectral density (PSD) is a measure that describes how the power of a signal is
distributed across different frequencies. It provides a way to analyze the frequency content of a
time or space series.

UNIT-3 CBM342/BCI III YR/VI SEM


27

Interpretation:The PSD indicates the strength of the signal at different frequencies, helping to
identify dominant frequency components and patterns in the data.

7. Explain the significance of Power Spectral Density (PSD) in BCI feature extraction.
 Power Spectral Density (PSD) provides information about the distribution of signal power
across different frequencies.
 In BCI, PSD is essential for identifying frequency-specific characteristics of neural signals.
 It helps in extracting features related to different mental states or tasks by highlighting the
power variations within specific frequency bands.
 PSD is particularly useful for discerning patterns associated with cognitive processes, making
it a valuable tool in BCI feature extraction.

8. How do wavelets contribute to feature extraction in BCI, and what distinguishes them from
Fourier Transform?
Wavelets are employed in BCI for both time and frequency analysis. Unlike the Fourier
Transform, which provides a fixed resolution in the frequency domain, wavelets offer variable
resolution, allowing the analysis of both.

9. Define Wavelet Transform.


Wavelet transform is a mathematical tool used for analyzing signals, images, and other types of
data by decomposing them into different scales or resolutions. The wavelet transform uses
wavelets, which are small, localized functions that are well-suited for capturing localized features
in the data.

10. How Wavelet transform is used in signal processing?


Wavelet transform is useful for analyzing and extracting features from non-stationary signals with
time-varying characteristics.

11. What is the basic idea behind Wavelets?


The basic idea behind wavelets is to decompose a signal into different frequency components at
different scales. This decomposition is achieved by convolving the signal with a series of wavelet
functions that are dilated and translated. The resulting wavelet coefficients represent the
contribution of different frequency components at various scales.

12. What is the primary purpose of using AutoRegressive (AR) models in BCI?
AutoRegressive (AR) models in BCI are employed to capture the temporal dependencies within
neural signals. By representing each data point as a linear combination of its previous values, AR
models help in understanding and extracting the dynamic aspects of brain activity over time.

13. How does Moving Average (MA) contribute to feature extraction in BCI?
 Moving Average (MA) in BCI serves the purpose of smoothing time-series data. It calculates
the average of consecutive data points, reducing noise and highlighting underlying trends.
 MA is often used in BCI feature extraction to enhance the signal quality and improve the
interpretability of neural activity patterns.

14. What is the role of ARMA parametric model in feature Extraction?


ARMA models combine the autoregressive and moving average components, offering a more
comprehensive representation of the temporal dynamics in EEG signals. The parameters of both
AR and MA components can be utilized as features for capturing both short-term and long-term
dependencies in the data.
15. Define the role of Principal Component Analysis (PCA) in BCI.
 Principal Component Analysis (PCA) is utilized in BCI to reduce the dimensionality of data.
By identifying the principal components, which are linear combinations of original features,
 PCA helps in extracting essential information from neural signals. This reduction in
dimensionality facilitates efficient feature representation and analysis in BCI applications.

UNIT-3 CBM342/BCI III YR/VI SEM


28

16. Provide an example of a linear feature and a nonlinear feature used in BCI.

 Linear Feature: Mean amplitude of EEG signals is an example of a linear feature. It is


calculated as the average value of signal amplitudes and provides a straightforward measure of
signal intensity.
 Nonlinear Feature: Fractal dimension of neural signals is an example of a nonlinear feature. It
characterizes the complexity and irregularity of the signal, offering insights beyond linear
relationships and providing a more nuanced understanding of brain activity in BCI.

17.What is the basic concept of Principal component Analysis (PCA) method?

Principal Component Analysis (PCA) is a statistical method used for analyzingand simplifying the
structure in high-dimensional data by transforming it into a new coordinate system called the
principal components. The goal of PCA is to identify the directions, or principal components, in
which the data varies the most.

17. What is the Linear feature consequences of Principal Component Analysis operates?

 PCA operates through linear transformations. It seeks to find a set of orthogonal axes (principal
components) along which the data has the maximum variance.Each principal component is a
linear combination of the original features.
 The principal components are orthogonal, meaning they are uncorrelated. This is a consequence of
the linear transformation applied during PCA.

UNIT-3 CBM342/BCI III YR/VI SEM

You might also like