0% found this document useful (0 votes)
110 views139 pages

Class PPT - Unit2

Uploaded by

anusad003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views139 pages

Class PPT - Unit2

Uploaded by

anusad003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 139

UNIT II FEATURE ENGINEERING

Text Data – Visual Data – Feature-based Time-Series Analysis – Data


Streams – Feature Selection and Evaluation
•What is Feature Engineering?
•Feature Engineering is the process of extracting and organizing the
important features from raw data in such a way that it fits the purpose
of the machine
•learning model. It can be thought of as the art of selecting the
important features and transforming them into refined and meaningful
features that suit the needs of the model.
•Benefits of Feature Engineering
•An effective Feature Engineering implies:

 Higher efficiency of the model

 Easier Algorithms that fit the data

 Easier for Algorithms to detect patterns in the data

 Greater Flexibility of the features


Feature Creation, Transformations,
Feature Extraction, and Feature Selection
1. Feature Creation: Feature creation is finding the most useful variables to be
used in a predictive model. The process is subjective, and it requires
human
• creativity and intervention. The new features are created by mixing existing
features using addition, subtraction, and ration, and these new features have
• great flexibility.

•Transformations: The transformation step of feature engineering involves


adjusting the predictor variable to improve the accuracy and performance of the
model. For example, it ensures that the model is flexible to take input of the
variety of data; it ensures that all the variables are on the same scale, making
the model easier to understand. It improves the model's accuracy and ensures
that all the features are within the acceptable range to avoid any computational
error.
1. Feature Extraction: Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data. The main aim of this step is to
reduce the volume of data so that it can be easily used and managed for data modelling. Feature
extraction methods include cluster analysis, text analytics, edge detection
algorithms, and principal components analysis (PCA).

2. Feature Selection: While developing the machine learning model, only a few variables in the
dataset are useful for building the model, and the rest features are either redundant or
irrelevant. If we input the dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy of the model. Hence it is
very important to identify and select the most appropriate features from the data and remove the
irrelevant or less important features, which is done with the help of feature selection in
machine learning.
Need for Feature Engineering in Machine
Learning

• Better features mean flexibility.


• Better features mean simpler models.
• Better features mean better results.
Steps in Feature Engineering
o Data Preparation: The first step is data preparation. In this step, raw data acquired
from different resources are prepared to make it in a suitable format
• so that it can be used in the ML model. The data preparation may contain cleaning of
data, delivery, data augmentation, fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA) is an
important step of features engineering, which is mainly used by data scientists.
This step involves analysis, investing data set, and summarization of the main
characteristics of data. Different data visualization techniques are used to better
understand the manipulation of data sources, to find the most appropriate statistical
technique for data analysis, and to select the best features for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for accuracy
to compare all the variables from this baseline. The benchmarking process is used to
improve the predictability of the model and reduce the error rate.
Feature Engineering Techniques

•1. Imputation
•Feature engineering deals with inappropriate data, missing values, human interruption,
general errors, insufficient data sources, etc. Missing values within the dataset highly affect the
performance of the algorithm, and to deal with them "Imputation" technique is used.
Imputation is responsible for handling irregularities within the dataset.
•For example, removing the missing values from the complete row or complete column by a
huge percentage of missing values. But at the same time, to maintain the data size, it is required
to impute the missing data, which can be done as:
o For numerical data imputation, a default value can be imputed in a column, and missing values
can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the maximum
occurred value in a column.
2.Handling Outliers
•Outliers are the deviated values or data points that are observed too away from other data points in
such a way that they badly affect the performance of the model. Outliers can be handled with this feature
engineering technique. This technique first identifies the outliers and then remove them out.
•Standard deviation can be used to identify the outliers. For example, each value within a space has a
definite to an average distance, but if a value is greater distant than a certain value, it can be considered
as an outlier. Z-score can also be used to detect outliers.
3.Log transform
•Logarithm transformation or log transform is one of the commonly used mathematical
techniques in machine learning. Log transform helps in handling the skewed data, and it makes the
distribution more approximate to normal after transformation. It also reduces the effects of outliers on
the data, as because of the normalization of magnitude differences, a model becomes much robust.
4.Binning
•In machine learning, overfitting is one of the main issues that degrade the performance of the model
and which occurs due to a greater number of parameters
•and noisy data. However, one of the popular techniques of feature engineering, "binning", can be used to
normalize the noisy data. This process involves segmenting
•different features into bins.
5.Feature Split
•As the name suggests, feature split is the process of splitting features intimately into two or more parts
and performing to make new features. This technique helps the algorithms to better understand and
learn the patterns in the dataset.
•The feature splitting process enables the new features to be clustered and binned, which results in
extracting useful information and improving the performance of the data models.
6.One hot encoding
•One hot encoding is the popular encoding technique in machine learning. It is a technique that converts
the categorical data in a form so that they can be easily understood by machine learning algorithms and
hence can make a good prediction. It enables group the of categorical data without losing any information.
Tools for feature engineering

•Featuretools

•Featuretools is one of the most widely used libraries for feature engineering automation. It supports a wide range of
operations such as selecting features and constructing new ones with relational databases, etc. In addition, it offers
simple conversions utilizing max, sum, mode, and other terms. But one of its most important functionalities is the
possibility to build features using deep feature synthesis (DFS).
•Feature Selector

•As the name suggests, Feature Selector is a Python library for choosing features. It determines attribute significance
based on missing data, single unique values, collinear or insignificant features. For that, it uses “lightgbm” tree-based
learning methods. The package also includes a set of visualization techniques that can provide more information
about the dataset
•PyCaret

•PyCaret is a Python-based open-source library. Although it is not a dedicated tool for automated feature engineering, it
does allow for the automatic generation of features before model training. Its advantage is that it lets you replace
hundreds of code lines with just a handful, thus increasing productivity and exponentially speeding up the
experimentation cycle.
Benefits Drawbacks

Models with engineered features result in faster data Making a proper feature list requires deep analysis
processing. and understanding of the business context and
processes.

Less complex models are easier to maintain. Feature engineering is often time- consuming.

Complex ML solutions achieved through


Engineered features allow for more accurate complicated feature engineering are difficult to
estimations/predictions. explain because the model's logic remains unclear.
UNIT II FEATURE ENGINEERING
Text Data – Visual Data – Feature-based Time-Series Analysis – Data
Streams – Feature Selection and Evaluation
Where:
•P(x,y)P(x, y)P(x,y) is the joint probability
distribution of XXX and YYY,
•P(x)P(x)P(x) and P(y)P(y)P(y) are the
marginal probability distributions of XXX and
YYY.

A contingency table (also known as a cross-tabulation


or crosstab) is a data table used in statistics to display
the frequency distribution of variables. It is particularly
useful for analyzing the relationship between two or
more categorical variables. Each cell in the table shows
the count or frequency of observations corresponding
to the intersection of the categories of the variables.
•In feature selection, the mutual information is calculated between each feature and the target variable.
Features with higher mutual information values are considered more relevant for predicting the target.
•Features with low mutual information may be redundant or irrelevant and can be excluded from the model.
• Mutual Information Feature Selection Algorithm:
• Step 1: Compute the mutual information between each feature and the target variable.
• Step 2: Rank the features based on their mutual information scores.
• Step 3: Select the top kkk features with the highest mutual information scores.
• Step 4: Use these selected features to train your model.
UNIT II FEATURE ENGINEERING
Text Data – Visual Data – Feature-based Time-Series Analysis – Data
Streams – Feature Selection and Evaluation
• Most visual computing tasks involve prediction, regression or decision
making using features extracted from the original, raw visual data
(images or videos).
• Feature engineering typically refers to this (often creative) process of
extracting new representations from the raw data that are more
conducive to a computing task.
• Indeed, the performance of many machine learning algorithms
heavily depends on having insightful input representations that
expose the underlying explanatory factors of the output for the
observed input.
• In the case of images or videos, dimensionality reduction is often an
integral part of feature engineering, since the raw data are typically
high dimensional
• Many existing feature engineering approaches may be categorized
into one of three broad groups:
• 1. Classical, sometimes hand-crafted, feature representations
• 2.Advanced, latent-feature representations.
• 3. Deep representations through end-to-end learning
• Classical, sometimes hand-crafted, feature representations:
• In general, these may refer to rudimentary features such as image gradients as
well as fairly sophisticated features from elaborate algorithms such as the
histogram of oriented gradients feature .
• More often than not, such features are designed by domain experts who have
good knowledge about the data properties and the demands of the task. Hence
sometimes such features are called hand-crafted features.
• Hand-engineering features for each task requires a lot of manual labor and
domain knowledge, and optimality is hardly guaranteed. However, it allows
integration of human knowledge of the real world and of that specific task into
the feature design process, hence making it possible to obtain good results for
the said task.
• These types of features are easy to interpret. Note that it is not completely
correct to call all classical features as being hand-crafted, since some of them
are general-purpose features with little task-specific tuning (such as outputs of
simple gradient filters).
• 2.Advanced, latent-feature representations.
• While the raw data may be of high dimensions, the factors relevant to a
computing task may lie on a lower dimensional space.
• Latent representations may expose the underlying properties of data that exist
but cannot be readily measured from the original data.
• These features usually seek a specific structure such as sparsity, decorrelation of
reduced dimension, low rank, etc. The structure being enforced depends on the
task.
• The sparsity and low dimensionality of these representations is often
encouraged as real-world visual data have naturally sparse representations with
respect to some basis (e.g., Fourier basis) and may also be embedded in a lower-
dimensional manifold.
• However, obtaining latent representations is often a difficult optimization
process that may require extensive reformulation and/or clever optimization
techniques such as alternating minimization.
3. Deep representations through end-to-end learning.
• Deep representations are obtained by passing raw input data with minimal
preprocessing through a learned neural network, often consisting of a stack of
convolutional and/or fully connected layers.
• As the input is propagated through each network layer, different data
representations are obtained that abstract higher-level concepts. These networks
are being trained iteratively by minimizing a task-specific loss that alters the
parameters/weights in all layers.
• Recently, deep features have been found extremely effective in many visual
computing tasks, leading to tremendous gain in performance. Their most attractive
property is their ability to learn from raw input with minimal pre-processing.
• Moreover, it appears that such learned representations can provide a reasonable
performance on many tasks, alleviating the need for domain experts for each task.
• However, learning deep representations needs not only huge computational
resources but also large data collections, making them suitable primarily only for
computing clusters or servers with powerful GPUs, and for applications where
abundant labeled data are readily available
Classical Visual Feature
Representations
• Images are the most common form of visual data in many visual computing applications. An image is
a collection of pixels where each pixel is represented as a multidimensional array.
• Depending on the dimensionality of the pixels, images can be gray-scale, color or in other higher-
dimensional forms such as RGBD (with RGB for color and D for depth).
• Raw pixel intensities can be viewed as the simplest form of feature representation. However, this is
sensitive to many factors such as viewpoint change, lighting change, image interpolation, etc.
• Moreover, if each pixel value is treated as a separate feature, then that would result in a high-
dimensional feature space.
• In turn, increased processing time and large amounts of training data are needed to make an
inference from such high-dimensional features. Hence traditionally, most visual computing tasks do
not directly operate on this simple form of representation of pixels, but on features extracted from
them.
• Depending on how features are extracted, there are many feature extractors. Some categories
include spatial versus frequency domain approaches, global versus local, appearance versus
geometry, point versus area, etc.
3.1.1 Color Features
• Color Histogram: A histogram provides a rough estimate of the underlying distribution of
the data. To construct a histogram, the range of the input data is split into a number of bins.
Then the number of data points falling in each bin are counted. This is followed by
normalization. A histogram provides us with center, spread, skewness and number of
modes present in the data.
• Let us consider the computation of a color histogram of an RGB image. The first step is to
convert the RGB image into a single-channel image.
• To this end, the image is quantized such that each triplet (R,G,B) falls into a bin, converting
the RGB image into a single channel image. The resulting image has as many distinct values
as there were quantization bins.
• The quantized image is then vectorized and the histogram is computed by simply counting
the occurrences of each bin. While this is the most common strategy, other alternatives also
exist. For example, the histogram of each color channel can be computed separately and
concatenated.
• A color histogram is largely invariant to smaller changes in lighting
and viewpoint. It can be computed in linear time, making it an
effective feature for applications such as image retrieval.
• On the other hand, its effectiveness depends on the binning
strategy.
• A certain amount of hand-crafting is required to adapt the histogram
to fundus images. To automatically derive the quantization bins for
any dataset (assuming that it has sufficient number of samples), one
can extract all the unique shades (i.e., RGB triplets) from the dataset
and perform k-means clustering on them.
• The centroids can then be viewed as histogram bins. It produces a histogram that has more
entropy than the one derived using bins based on natural images. Figure 3.2 shows a color
histogram of the landscape image that uses pre-defined bins suitable for natural images.
• It also shows a color histogram of the fundus image using the same binning strategy. This
binning strategy does not capture the color distribution in the fundus image well.
• By using this adaptive binning strategy, one may obtain a better histogram as shown in Figure
3.3 The obtained histogram has a higher entropy as it makes better utilization of the available
number of bins. In practice, it has been observed that the adaptive binning strategy allows us
to use fewer bins (e.g., 16 instead of 64) with almost no loss of performance. Another
drawback is that the color histogram does not model the relationship of adjacent pixels, thus
failing to account for spatial distribution of different colors. As a remedy, a local color
histogram (LCH) for image retrieval has been proposed . To compute LCH, an image is divided
into blocks and for each, a color histogram is computed.
• A Local Color Histogram (LCH) is a method used in computer vision and image processing
to analyze the color distribution within specific regions or local patches of an image,
rather than considering the entire image as a whole, as is done in a global color
histogram. It is particularly useful for tasks that require understanding the spatial
distribution of colors and identifying features in localized areas of the image.
• All color histograms can then be averaged or concatenated to create the LCH. Though
LCH is better at modeling regional similarities, it is sensitive to transformations such as
rotation. Color Coherence Vector: This is another feature that considers color spatial
distribution. Color Coherence Vector (CCV) computation involves blurring the image with
either a mean or a Gaussian filter. The image is then quantized using binning strategies
similar to those mentioned above. Each pixel is classified as coherent or incoherent,
where a pixel is considered coherent if it has a higher connected component count than a
user-specified threshold. Repeating this for each bin produces CCVs. Alhough CCV models
spatial distribution of different colors, it is possible that two different connected
component patterns produce the same CCV. To overcome this, the relative position of the
components is usually added as another feature dimension. Figure 3.4 shows CCVs for
the landscape image and the fundus image. The same pattern can be observed here that
the same binning strategy fails
• to work well for both the images. Figure 3.5 shows CCVs of the fundus image obtained
with the adaptive binning strategy, which has a higher entropy. Histogram of Color
Moments: This feature is invariant to scaling and rotation. Color moments uniquely
characterize the color distribution in an image. Color moments are computed on a per-
channel basis. The first moment is just the average color of an image, whereas the next
two moments describe the variance and the shape of the color distribution. These
three moments provide sufficient information to carry out effective image retrieval.
Color moments work well under dynamic lighting conditions but fail under occlusion.
Color Correlogram: The above features do not lead to a compact representation that
models both local as well as global distribution of colors in an image. Color histograms
and CCVs fail when there is a large lighting change . To overcome these drawbacks,
color correlograms (CC). The CC measures the global distribution of local correlation
of colors.

• The rough procedure to compute the color correlogram is as follows:
Texture Features
• Texture features are quantitative measures used in image processing and computer vision to
describe the surface characteristics of objects within an image. They capture patterns of variation
in pixel intensities that correspond to how the surface "feels" visually. Texture features are widely
used in applications such as image classification, object detection, medical image analysis, and
more. These features describe properties like smoothness, roughness, regularity, and
directionality in images.
• Another important descriptor of visual data is their textural properties. Some of the texture
analysis approaches use spatial frequency to characterize textures since coarse textures
constitute low spatial frequencies and vice versa.
• Gray-level Co-occurrence Matrix (GCM):
• To model textures with reasonable confidence, we need multiple
observations to fill each entry in M. There are two ways to achieve
this. One option is to use a small number of quantization levels so
that multiple pixels could satisfy the aforementioned condition.
Higher quantization will lead to greater loss of accuracy.
• Another option is to evaluate the above condition over a larger
window. This will cause errors if the texture changes significantly
inside the window. The adjacency condition mentioned above models
pixels that are below and to the right of the pixels under
consideration. Three more types of adjacency conditions can be
defined: below and to the left, below, and to the right
Shape Features
• An image contains many well-defined objects that we encounter daily.
Shape is an important attribute of these objects. In some tasks such
as object detection, shape is an indispensable attribute. Moreover,
most of the objects come in different colors and textures, making
both weak attributes for discriminating objects. On the other hand,
objects can be both rigid and non-rigid. Modeling the shape of rigid
objects is relatively easier, but non-rigid ones can conform to many
shapes. Capturing and modeling all the shape forms is difficult.
Capturing discriminative characteristics of an object using color or
texture requires suitable pre-processing or large training data
• Shape Context: A feature that is invariant to rotation, translation and
scaling is highly desirable. Shape context tries to achieve this by
capturing relative positioning of all the other points with respect to a
reference point
UNIT II FEATURE ENGINEERING
Text Data – Visual Data – Feature-based Time-Series Analysis – Data
Streams – Feature Selection and Evaluation
Feature-Based Time Series Analysis
•What is Time Series Analysis? Time series analysis is a technique in statistics that deals with time series data and
trend analysis. Time series data follows periodic time intervals that have been measured in regular time intervals or
have been collected in particular time intervals. In other words, a time series is simply a series of data points ordered in
time, and time series analysis is the process of making sense of this data.
•In a business context, examples of time series data include any trends that need to be captured over a period of time.
A Google trends report is a type of time series data that can be analyzed. There are also far more complex
applications such as demand and supply forecasting based on past trends.

•Examples of Time Series Data


•In economics, time series data could be the Gross Domestic Product (GDP), the Consumer Price Index, S&P 500 Index, and
unemployment rates. The data set could be a country’s gross domestic product from the federal reserve economic data.
•From a social sciences perspective, time series data could be birth rate, migration data, population rise, and political
factors.
•The statistical characteristics of time series data does not always fit conventional statistical methods. As a result, analyzing
time series data accurately requires a unique set of tools and methods, collectively known as time series analysis.
 Seasonality refers to periodic fluctuations. For example, if you consider electricity consumption, it is typically high during
the day and lowers during the night. In the case of shopping patterns, online sales spike during the holidays before
slowing down and dropping.
 Autocorrelation is the similarity between observations as a function of the time lag between them. Plotting
•Data: Types, Terms, and Concepts
•Data, in general, is considered to be one of these three types:
1. Time series data: A set of observations on the values that a variable takes on at different points of time.

2. Cross-sectional data: Data of one or more variables, collected at the same point in time.

3. Pooled data: A combination of time series data and cross-sectional data.

•These are some of the terms and concepts associated with time series data analysis:
 Dependence: Dependence refers to the association of two observations with the same variable at prior time points.

 Stationarity: This parameter measures the mean or average value of the series. If a value remains constant over the given
time period, if there are spikes throughout the data, or if these values tend toward infinity, then it is not
stationarity.

 Differencing: Differencing is a technique to make the time series stationary and to control the correlations that arise
automatically. That said, not all time series analyses need differencing and doing so can produce inaccurate estimates.

 Curve fitting: Curve fitting as a regression method is useful for data not in a linear relationship. In such cases, the
mathematical equation for curve fitting ensures that data that falls too much on the fringes to have any real impact is
“regressed” onto a curve with a distinct formula that systems can use and interpret.
•Identifying Cross Sectional Data vs Time Series Data
•The opposite of time series data is cross-sectional data. This is when various
entities such as individuals and organizations are observed at a single point in
time to draw inferences. Both forms of data analysis have their own value,
and sometimes businesses use both forms of analysis to draw better
conclusions.
•Time series data can be found in nearly every area of business and
organizational application affected by the past. This ranges from economics,
social sciences, and anthropology to climate change, business, finance,
operations, and even epidemiology.
•In a time series, time is often the independent variable, and the goal is to
make a forecast for the future.
•The most prominent advantage of time series analysis is that—because data
points in a time series are collected in a linear manner at adjacent time
periods—it can potentially make correlations between observations. This
feature sets time series data apart from cross-sectional data.
•Time Series Analysis Techniques
•As we have seen above, time series analysis can be an ambitious goal for
organizations. In order to gain accurate results from model-fitting, one of several
mathematical models may be used in time series analysis such as:
 Box-Jenkins autoregressive integrated moving average (ARIMA) models

 Box-Jenkins multivariate models

 Holt-Winters exponential smoothing

•The Box-Jenkins models of both the ARIMA and multivariate varieties use the past behaviour
of a variable to decide which model is best to analyse it. The assumption is that any time
series data for analysis can be characterized by a linear function of its past values, past
errors, or both. When the model was first developed, the data used was from a gas furnace
and its variable behaviour over time.
•In contrast, the Holt-Winters exponential smoothing model is best suited to analyzing
time series data that exhibits a defining trend and varies by seasons. Such mathematical
models are a combination of several methods of measurement; the Holt-Winters method uses
weighted averages which can seem simple enough, but these values are layered on the
equations for exponential smoothing.
•Applications of Time Series Analysis
•Time series analysis models yield two outcomes:
 Obtain an understanding of the underlying forces and structure that produced
the observed data patterns. Complex, real-world scenarios very rarely fall into
set patterns, and time series analysis allows for their study—along with all of
their variables as observed over time. This application is usually meant to
understand processes that happen gradually and over a period of time such
as the impact of climate change on the rise of infection rates.
 Fit a mathematical model as accurately as possible so the process can move
into forecasting, monitoring, or even certain feedback loops. This is a use-
case for businesses that look to operate at scale and need all the input
they can get to succeed.
•From a practical standpoint, time series analysis in organizations are mostly used for:
 Economic forecasting

 Sales forecasting

 Utility studies

 Budgetary analysis

 Stock market analysis

 Yield projections

 Census analysis

 Process and quality control

 Inventory studies

 Workload projections
•Time series in Financial and Business Domain

• Most financial, investment and business decisions are taken into consideration on the basis
of future changes and demands forecasts in the financial domain.

•Time series analysis and forecasting essential processes for explaining the dynamic and
influential behaviour of financial markets. Via examining financial data, an expert can predict
required forecasts for important financial applications in several areas such as risk evolution,
option pricing & trading, portfolio construction, etc.

•For example, time series analysis has become the intrinsic part of financial analysis and
can be used in predicting interest rates, foreign currency risk, volatility in stock markets
and many more. Policymakers and business experts use financial forecasting to make
decisions about production, purchases, market sustainability, allocation of resources, etc.

•In investment, this analysis is employed to track the price fluctuations and price of a
security over time. For instance, the price of a security can be recorded;
•For the short term, such as the observation per hour for a business day,
and For the long term, such as observation at the month end for five years.

•Time series analysis is extremely useful to observe how a given asset,


security, or economic variable behaves/changes over time. For example, it
can be deployed to evaluate how the underlying changes associated with
some data observation behave after shifting to other data observations in
the same time period.

•Time series in Medical Domain

•Medicine has evolved as a data-driven field and continues to contribute in


time series analysis to human knowledge with enormous developments.
Advantages of Time Series Analysis
Data analysts have much to gain from time series analysis. From cleaning raw data, making sense of it, and uncovering
patterns to help with projections much can be accomplished through the application of various time series models.
Here are a few advantages of time series analysis:
It Cleans Data and Removes Confounding Factors

Data cleansing filters out noise, removes outliers, or applies various averages to gain a better overall perspective of
data. It means zoning in on the signal by filtering out the noise. The process of time series analysis removes all the
noise and allows businesses to truly get a clearer picture of what is happening day-to-day.
Provides Understanding of Data

The models used in time series analysis do help to interpret the true meaning of the data in a data set, making life
easier for data analysts. Autocorrelation patterns and seasonality measures can be applied to predict when a certain
data point can be expected. Furthermore, stationarity measures can gain an estimate of the value of said data point.
Forecasting Data

Time series analysis can be the basis to forecast data. Time series analysis is inherently equipped to uncover
patterns in data which form the base to predict future data points. It is this forecasting aspect of time series analysis
that makes it extremely popular in the business area. Where most data analytics use past data to retroactively gain
insights, time series analysis helps predict the future. It is this very edge that helps management make better business
decisions.
•Disadvantages of Time Series Analysis
•Time series analysis is not perfect. It can suffer from generalization
from a single study where more data points and models were
warranted. Human error could misidentify the correct data model,
which can have a snowballing effect on the output.
•It could also be difficult to obtain the appropriate data points. A major
point of difference between time-series analysis and most other
statistical problems is that in a time series, observations are not always
independent.
Time series and Trend analysis
A time series consists of a set of observations measured at specified,
usually equal, time interval.

Time series analysis attempts to identify those factors that exert


influence on the values in the series.
.

Time series analysis is a basic tool for forecasting. Industry and


government must forecast future activity to make decisions and
plans to meet projected changes.

An analysis of the trend of the observations is needed to acquire


an understanding of the progress of events leading to prevailing
conditions.
Time series examples
• Sales data
• Gross national product
• Share prices
• $A Exchange rate
• Unemployment rates
• Population
• Foreign debt
• Interest rates
Time series components
Time series data can be broken into these four components:
1. Secular trend
2. Seasonal variation
3. Cyclical variation
4. Irregular variation
Components of Time-Series Data
Irregular Cyclical
fluctuations

Trend Seasonal
1 2 3 4 5 6 7 8 9 10 11 12 13

Year

Predicting long term trends without smoothing?


What could go wrong?
Where do you commence your prediction from the bottom of a
variation going up or the peak of a variation going down………..
1. Secular Trend
This is the long term growth or decline of the series.

• In economic terms, long term may mean >10 years

• Describes the history of the time series

• Uses past trends to make prediction about the future

• Where the analyst can isolate the effect of a secular


trend, changes due to other causes become clearer
All Ords
8000

7000

6000

5000

4000 All Ords

3000

2000

1000

0
4 5 6 7 7 8 9 0 1 2 2 3 4 5 6 7 7 8 9 0 1 2 2 3 4 5 6 7 7 8
1 98 198 198 198 198 198 198 199 199 199 199 199 199 199 199 199 199 199 199 200 200 200 200 200 200 200 200 200 200 200
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
08 06 04 02 12 10 08 06 04 02 12 10 08 06 04 02 12 10 08 06 04 02 12 10 08 06 04 02 12 10
3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/ 3/
.

Look out
While trend estimates are often reliable, in some instances
the usefulness of estimates is reduced by:

• by a high degree of irregularity in original or seasonally


adjusted series or

• by abrupt change in the time series characteristics of the


original data
$A vs $US
during day 1 vote count 2000 US Presidential election

•.

This graph shows the amazing trend of the $A vs $UA during an 18 hour period on
November 8, 2000
2. Seasonal Variation

The seasonal variation of a time series is a pattern of


change that recurs regularly over time.

Seasonal variations are usually due to the differences


between seasons and to festive occasions such as
Easter and Christmas.

Examples include:
• Air conditioner sales in Summer
• Heater sales in Winter
• Flu cases in Winter
• Airline tickets for flights during school vacations
Monthly Retail Sales in NSW Retail
Department Stores
3. Cyclical variation

Cyclical variations also have recurring patterns but with a


longer and more erratic time scale compared to Seasonal
variations.

The name is quite misleading because these cycles can be far


from regular and it is usually impossible to predict just how
long periods of expansion or contraction will be.

There is no guarantee of a regularly returning pattern.


Cyclical variation

Example include:

• Floods
• Wars
• Changes in interest rates
• Economic depressions or recessions
• Changes in consumer spending
Cyclical variation
This chart represents an economic cycle, but we know
it doesn’t always go like this. The timing and length of
each phase is not predictable.
4. Irregular variation

An irregular (or random) variation in a time series occurs over varying


(usually short) periods.

It follows no pattern and is by nature unpredictable.

It usually occurs randomly and may be linked to events that also occur
randomly.

Irregular variation cannot be explained mathematically.


Irregular variation

If the variation cannot be accounted for by secular trend, season or


cyclical variation, then it is usually attributed to irregular variation.
Example include:

– Sudden changes in interest rates


– Collapse of companies
– Natural disasters
– Sudden shift s in government policy
– Dramatic changes to the stock market
– Effect of Middle East unrest on petrol prices
Monthly Value of Building Approvals ACT)
.

Now we have considered the 4 time series


components:
1. Secular trend
2. Seasonal variation
3. Cyclical variation
4. Irregular variation

We now take a closer look at secular trends and 5


techniques available to measure the underlying trend.
Why examine the trend?

When a past trend can be reasonably expected to


continue on, it can be used as the basis of future
planning:

• Capacity planning for increased population


• Utility loads
• Market progress
UNIT II FEATURE ENGINEERING
Text Data – Visual Data – Feature-based Time-Series Analysis – Data
Streams – Feature Selection and Evaluation
•Data Streams

•Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a very
high speed. It is an ordered sequence of information for a specific interval. The sender’s data is
transferred from the sender’s side and immediately shows in data streaming at the receiver’s
side. Streaming does not mean downloading the data or storing the information on storage
devices.
•Sources of Data Stream

•There are so many sources of the data stream, and a few widely used sources are listed below:
•Internet traffic
•Sensors data
•Real-time ATM transaction
•Live event data
•Call records
•Satellite data
•Audio listening
•Watching videos
•Real-time surveillance systems
•Online transactions
4.3 Characteristics of Data Stream in Data Mining
Data Stream in Data Mining should have the following characteristics:

Continuous Stream of Data: The data stream is an infinite continuous stream resulting in
big data. In data streaming, multiple data streams are passed simultaneously.
Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry
timestamps with them. After a particular time, the data stream loses its significance and is
relevant for a certain period.
Data Volatility: No data is stored in data streaming as It is volatile. Once the data
mining and analysis are done, information is summarized or discarded.
Concept Drifting: Data Streams are very unpredictable. The data changes or evolves with
time, as in this dynamic world, nothing is constant.
Data Stream is generated through various data stream generators. Then, data mining
techniques are implemented to extract knowledge and patterns from the data streams.
Therefore, these techniques need to process multi-dimensional, multi-level, single pass, and
online data streams.

4.4 Data Streams in Data Mining Techniques


Data Streams in Data Mining techniques are implemented to extract patterns and insights from a
data stream. A vast range of algorithms is available for stream mining. There are four main
algorithms used for Data Streams in Data Mining techniques.
Data Streams
• For the feature selection problem with streaming features, the number of instances
is fixed while candidate features arrive one at a time; the task is to timely select a
subset of relevant features from all features seen so far.
• A typical framework for streaming feature selection consists of
• Step 1: a new feature arrives;
• Step 2: decide whether to add the new feature to the selected features;
• step 3: determine whether to remove features from the selected features; and
• Step 4: repeat Step 1 to Step 3.
• Different algorithms may have distinct implementations for Step 2 and Step 3;
next we will review some representative methods. Note that Step 3 is optional and
some streaming feature selection algorithms only provide Step 2.
1. The Grafting Algorithm
•It is a machine learning technique used for incremental or online feature selection in the context of
classification problems. It is designed to select a subset of relevant features from a larger set while training a
classifier, thus improving both classification accuracy and computational efficiency.
•Here's an overview of how the Grafting Algorithm works:
•Initialization: Start with an empty set of selected features and an initial classifier. This classifier can be a
simple one, such as a linear model or a decision tree.
•Feature Selection and Classifier Update: For each incoming data point or batch of data points, follow these
steps:
a.Train the current classifier on the selected features.
b.Evaluate the performance of the classifier on the new data points or batch. You can use metrics like accuracy, F1-score, or log-
likelihood, depending on the problem.
c.For each feature not yet in the selected set, evaluate its potential contribution to the classifier's performance. This is typically done
by temporarily adding the feature to the selected set and measuring the change in performance (e.g., increase in accuracy or decrease
in error).
d.Select the feature that provides the greatest improvement in classification performance. If the improvement exceeds a predefined
threshold (a hyperparameter), add the feature to the selected set.
Key Points about the Grafting Algorithm:

Incremental Feature Selection: Grafting incrementally selects features one at a time, taking into account their contributions to
the classifier's performance.

Adaptive Feature Selection: It dynamically adjusts the set of selected features as new data arrives, ensuring that only the most
relevant features are retained.

Efficiency: Grafting is efficient because it avoids exhaustive search over feature subsets and only evaluates the utility of adding
or removing one feature at a time.

Performance Improvement: By selecting informative features during the learning process, Grafting aims to improve
classification accuracy while potentially reducing the computational complexity of the model.

Thresholds: The algorithm relies on a predefined threshold for evaluating whether adding a feature is beneficial. This
threshold can be set based on domain knowledge or through cross- validation.

Grafting is particularly useful in scenarios where you have a large number of features and limited computational resources or
when dealing with data streams where the feature set may evolve over time. It strikes a balance between maintaining model
performance and reducing feature dimensionality, which can be beneficial for both efficiency and interpretability of machine
learning models. Keep in mind that the specific implementation and parameter settings of the Grafting Algorithm may vary
depending on the machine learning framework and problem domain.
2 The Alpha-Investing algorithm

It is a statistical method used for sequential hypothesis testing, primarily in the context of multiple hypothesis testing or feature selection. It was introduced as an enhancement
to the Sequential Bonferroni method, aiming to control the Family-Wise Error Rate (FWER) while being more powerful and efficient in adaptive and sequential settings.
Here's a high-level overview of the Alpha-Investing algorithm:

Initialization: Start with an empty set of selected hypotheses (features) and set an initial significance level (alpha). This alpha level represents
the desired FWER control and guides the decision-making process.

Sequential Testing: As you encounter new hypotheses (features) or updates to existing ones, perform hypothesis tests (e.g., p-value tests) to
assess their significance. The tests are often related to whether a feature is associated with an outcome of interest.

Alpha Update: After each hypothesis test, update the alpha level dynamically based on the test results and the number of hypotheses tested
so far. Alpha-Investing adjusts the significance level to maintain FWER control while adapting to the increasing number of tests.

Decision Rules: Make decisions on whether to reject or retain each hypothesis based on the adjusted alpha level. Common decision rules include
rejecting a hypothesis if its p-value is less than the current alpha.

Continue or Terminate: Continue the process as long as you encounter new hypotheses or updates to existing ones. You can choose a stopping
criterion, such as reaching a fixed number of hypotheses or achieving a certain level of significance control.

Output: The selected hypotheses at the end of the process are considered statistically significant, and the others are rejected or not selected.

Key Advantages of Alpha-Investing:

Adaptivity: Alpha-Investing adapts its significance level as more hypotheses are tested. This adaptivity helps maintain better statistical power
compared to fixed significance levels like Bonferroni correction.

FWER Control: It controls the Family-Wise Error Rate, which is the probability of making at least one false discovery among all the hypotheses
tested. This makes it suitable for applications where controlling the overall error rate is critical.

Efficiency: Alpha-Investing is often more efficient than other multiple testing correction methods like Bonferroni correction because it tends to
use higher alpha levels for early tests and lower alpha levels for later tests.
•Selective: It allows for the selection of relevant features or hypotheses from a large pool while controlling the
overall error rate.
•Alpha-Investing is commonly used in fields like bioinformatics, genomics, finance, and any domain where multiple
hypothesis testing or feature selection is necessary and maintaining a strong control over the FWER is important.
It offers a balance between adaptivity and statistical rigor, making it a valuable tool in the data analysis toolkit.
3 The Online Streaming Feature Selection Algorithm
The feature selection in the context of data streams and online learning often involves adapting traditional
feature selection methods to handle streaming data.
Here is a conceptual outline of how feature selection can be performed in an online streaming setting:
Data Stream Ingestion: Start by ingesting your streaming data,which arrives continuously over
time. This data can be in the form of individual instances or mini-batches.
Initialization: Initialize your feature selection process by setting up the necessary data structures and variables.

Feature Selection Process:


• Receive New Data: As new data points arrive in the streaming fashion, preprocess and prepare them for
feature selection.
• Compute Feature Importance: Calculate the importance or relevance of each feature in the current data
batch. Various statistical measures, machine learning models, or domain knowledge can guide this
calculation.
• Update Feature Set: Decide whether to keep or discard each feature based on its importance. You can
use a threshold or ranking mechanism to select the most relevant features. This step can be done
incrementally as new data arrives.
• Retrain Models (Optional): If you are using machine learning models, retrain them using the updated
feature set to ensure that the model adapts to the changing data distribution.
• Update Metrics: Continuously monitor and evaluate the performance of your models or the selected features
using appropriate evaluation metrics. This can help you assess the quality of your feature selection process.
•Streaming Feature Selection Loop: Repeat the above steps as new data points continue to arrive in the stream. The feature selection
process is ongoing and adaptive to the changing data distribution.

•Termination: Decide on a stopping criterion for the feature selection process. This could be a fixed time duration, a certain number of
data points processed, or a change in model performance.

•Final Feature Set: The selected features at the end of the streaming feature selection process are considered the final set for modelling
or analysis.
It's important to note that the exact algorithm and methodology used for feature selection in a streaming context can vary based on
the specific problem, data, and goals. The choice of feature importance measure, update frequency, and stopping criteria should be tailored
to your particular application.

4 Unsupervised Streaming Feature Selection in social media

•Unsupervised streaming feature selection in social media data presents a unique set of challenges and opportunities. Unlike traditional
feature selection in batch data, where you have a fixed dataset, social media data arrives continuously, often with varying topics, trends,
and user behaviour. Here's an approach to unsupervised streaming feature selection in social media:

•Data Ingestion:

•Stream social media data from platforms like Twitter, Facebook, or Instagram.

•Preprocess the data, including text cleaning, tokenization, and potentially

•feature extraction techniques like TF-IDF or word embeddings.

•Online Clustering:

•Implement an online clustering algorithm like Online K-Means, Mini-Batch K-Means, or DBSCAN.

•Cluster the incoming data based on the extracted features. The number of clusters can be determined using heuristics or adaptively based
on data characteristics.
Feature Ranking and Selection:

•Rank features within each cluster based on their importance scores. Select a fixed number or percentage of top-ranked features from each
cluster. Alternatively, you can use dynamic thresholds to adaptively select features based on their importance scores within each cluster.

Dynamic Updating:

•Continuously update the clustering and feature selection process as new data arrives.

•Periodically recluster the data to adapt to changing trends and topics in social media discussions.

Evaluation and Monitoring:

•Monitor the quality of the selected features over time.

•Use unsupervised evaluation metrics such as silhouette score, Davies-Bouldin index, or within-cluster sum of squares to assess the quality
of clusters and feature importance.

Anomaly Detection (Optional):

•Incorporate anomaly detection techniques to identify unusual or emerging topics or trends in the data. Anomalies may indicate the need for
adaptive feature selection.

Modeling or Analysis:

•Utilize the selected features for various downstream tasks such as sentiment analysis, topic modeling, recommendation systems, or
anomaly detection.

Regular Maintenance:

•Regularly review and update the feature selection process as the social media landscape evolves. Consider adding new features or
modifying existing ones based
•Unsupervised streaming feature selection in social media data requires a flexible and adaptive approach due to the dynamic nature of social
media content. It aims to extract relevant features that capture the current themes and trends in the data without requiring labeled
training data. Keep in mind that the choice of clustering algorithm and feature importance metric should be tailored to your specific
social media data and objectives.

5. Non-Linear Methods for Streaming Feature Construction

•Non-linear methods for streaming feature construction are essential for extracting meaningful patterns and representations from streaming
data where the relationships among features may not be linear. These methods transform the input data into a new feature space, often with
higher dimensionality, to capture complex and non-linear relationships that may exist in the data. Here are some non-linear methods
commonly used for streaming feature construction:

•Kernel Methods:

•Kernel Trick: Apply the kernel trick to transform data into a higher-dimensional space without explicitly computing the feature vectors.
Common kernels include the Radial Basis Function (RBF) kernel and polynomial kernels.

•Online Kernel Methods: Adapt kernel methods to streaming data by updating kernel matrices incrementally as new data arrives. Online kernel
principal component analysis (KPCA) and online kernel support vector machines (SVM) are examples.

•Neural Networks:

•Deep Learning: Utilize deep neural networks, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), for
feature extraction. Deep architectures can capture intricate non-linear relationships in the data.

•Online Learning: Implement online learning techniques to continuously update neural network parameters as new data streams in. This
enables real-time feature construction.

•Autoencoders:

•Variational Autoencoders (VAEs): VAEs can be used to learn non-linear representations and reduce dimensionality. They are useful for capturing
latent variables and complex patterns in streaming data.
Online Autoencoders: Design autoencoders that update their weights as new data arrives, allowing them to adapt to changing
data distributions.

Manifold Learning:

t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a dimensionality reduction technique that can reveal non-
linear relationships in high-dimensional data. It can be adapted to streaming data by updating the t-SNE embedding.

Isomap:

Isomap is another manifold learning method that can be used for non-linear feature construction in streaming data by
incrementally updating the geodesic distances between data points.

Random Features:

Random Fourier Features: Use random Fourier features to approximate kernel methods' non-linear transformations in a
computationally efficient manner. This can be suitable for streaming data when kernel-based methods are too slow.

6. Non-linear Dimensionality Reduction:

Locally Linear Embedding (LLE) and Spectral Embedding: These dimensionality reduction techniques aim to preserve
local relationships, making them suitable for capturing non-linear structures in data streams.

Feature Mapping:

Apply non-linear feature mappings, such as polynomial expansions or trigonometric transformations, to create new features
that capture non-linear relationships among the original features.

Ensemble Techniques:
•Online Clustering and Density Estimation:

•Clustering and density estimation methods, such as DBSCAN and Gaussian Mixture Models (GMM), can be used to create features that represent the
underlying non-linear structures in streaming data.

•When selecting a non-linear feature construction method for streaming data, consider factors such as computational efficiency, scalability, and the adaptability
of the method to evolving data distributions. The choice of method should align with the specific characteristics and requirements of your streaming data
application.

7.Locally Linear Embedding for Data Streams

•Locally Linear Embedding (LLE) is a dimensionality reduction technique commonly used for nonlinear manifold learning and feature extraction. While it was
originally developed for batch data, it can be adapted for data streams with some modifications and considerations. Here's an overview of how LLE can be
applied to data streams:

1. Data Stream Preprocessing:

 Ingest the streaming data and preprocess it as it arrives, including cleaning, normalization, and transformation.

2. Sliding Window:

 Implement a sliding window mechanism to maintain a fixed-size buffer of the most recent data points. This buffer will be used for performing LLE on
the data stream.

3. Local Neighborhood Selection:

 For each incoming data point, determine its local neighborhood by considering a fixed number of nearest neighbors within the sliding window.

4. Local Linear Models:

 Construct local linear models for each data point based on its neighbors. This involves finding weights that best reconstruct the data point as a
linear combination of its neighbors.
1. Local Reconstruction Weights:

 Calculate the reconstruction weights for each data point in the local neighborhood. These weights represent the contribution of each neighbor to
the reconstruction of the data point.

2. Global Embedding:

 Combine the local linear models and reconstruction weights to compute a global embedding for the entire dataset. This embedding represents the
lower-dimensional representation of the data stream.

3. Continuous Update:

 Continuously update the sliding window and recompute the LLE embedding as new data points arrive. The old data points are removed
from the window, and the new ones are added.

4. Memory Management:

 Manage memory efficiently to ensure that the sliding window remains within a predefined size limit. You may need to adjust the window size
dynamically based on available memory and computational resources.

5. Hyperparameter Tuning:

 Tune hyperparameters such as the number of nearest neighbors, the dimensionality of the embedding space, and any regularization terms
based on the specific characteristics of your data stream.

6. Evaluation and Monitoring:

 Periodically evaluate the quality of the LLE embedding using appropriate metrics, such as reconstruction error or visual inspection. Monitoring the
quality helps ensure that the embedding captures meaningful patterns in the data stream.

7. Application:

 Use the lower-dimensional representation obtained through LLE for downstream tasks such as clustering, visualization, or classification,
depending on your specific objectives.
•Adapting LLE to data streams requires careful management of the sliding window and efficient computation of the local linear models.
Additionally, choosing an appropriate neighborhood size and dimensionality for the embedding is crucial for achieving meaningful results.
Consider the computational resources available and the real-time constraints of your application when implementing LLE for data streams.

8.Kernel learning for data streams

•Kernel learning for data streams is an area of machine learning that focuses on adapting kernel methods, which are originally
designed for batch data, to the streaming data setting. Kernel methods are powerful techniques for dealing with non- linear relationships and
high-dimensional data. Adapting them to data streams requires efficient processing and storage of data as it arrives in a sequential and
potentially infinite manner. Here are some key considerations and techniques for kernel learning in data streams:

1. Online Kernel Methods:

 Traditional kernel methods, such as Support Vector Machines (SVM) and Kernel Principal Component Analysis (KPCA), can be
adapted to data streams using online learning techniques.

 Online SVM and Online KPCA algorithms update model parameters incrementally as new data arrives.

2. Incremental Kernel Matrix Updates:

 A key challenge in kernel methods for data streams is efficiently updating the kernel matrix as new data points arrive. Techniques
like the Nyström approximation and random Fourier features can be employed to approximate kernel matrices and update
them incrementally.

3. Memory Management:

 Efficiently manage memory to ensure that the kernel matrix doesn't grow too large as data accumulates. This may involve storing
only a subset of the most recent data points or employing methods like forgetting mechanisms.
1. Streaming Feature Selection:

 Apply feature selection techniques to the input data to reduce dimensionality before applying kernel methods. This can help in
maintaining computational efficiency.

2. Online Hyperparameter Tuning:

 Tune kernel hyperparameters (e.g., the kernel width or the regularization parameter in SVM) adaptively based on the streaming data to maintain model
performance.

3. Concept Drift Detection:

 Monitor the data stream for concept drift, which occurs when the data distribution changes over time. When drift is detected, consider retraining or
adapting the kernel model.

4. Kernel Approximations:

 Use kernel approximations such as Random Kitchen Sinks or Fastfood to approximate kernel operations with linear time complexity, making them
suitable for streaming data.

5. Parallel and Distributed Computing:

 Utilize parallel or distributed computing frameworks to handle large-scale streaming data and kernel computations efficiently.

6. Online Ensemble Methods:

 Consider ensemble methods like Online Random Forest or Online Boosting, which combine multiple models with kernels to adapt to
changing data.

7. Evaluation and Monitoring:

 Continuously monitor the performance of the kernel learning model using appropriate evaluation metrics, such as classification accuracy, mean
squared error, or others relevant to your task.
1. Resource Constraints:

 Adapt your kernel learning approach to resource constraints, such as processing power and memory, which may be limited in
streaming environments.

•Kernel learning for data streams is an active area of research, and various algorithms and techniques have been proposed to address the unique
challenges posed by streaming data. The choice of approach should be based on the specific requirements and constraints of your streaming data
application.

9.Neural Networks for Data Streams

•Using neural networks for data streams, where data arrives continuously and in a potentially infinite sequence, presents unique challenges and
opportunities. Neural networks are powerful models for various machine learning tasks, including classification, regression, and sequence
modelling. Adapting them to data streams requires specialized techniques to handle the dynamic nature of the data. Here's an overview of
considerations when using neural networks for data streams:

1. Online Learning:

 Implement online learning techniques, also known as incremental or streaming learning, where the neural network is updated
incrementally as new data arrives. This is crucial for maintaining model performance in a changing data distribution.

2. Sliding Window:

 Use a sliding window mechanism to manage the memory and computational resources. Maintain a fixed-size window of the most recent data
points for training and updating the model.

3. Model Architecture:

 Choose neural network architectures that are amenable to online learning. Feedforward neural networks (multilayer perceptrons),
recurrent neural networks (RNNs), and online versions of convolutional neural networks (CNNs) can be adapted for data streams.
1. Mini-Batch Learning:

 Train neural networks in mini-batches as new data points arrive. This helps in utilizing efficient gradient descent algorithms, such as stochastic
gradient descent (SGD) or variants like ADAM, RMSprop, and AdaGrad.

2. Concept Drift Detection:

 Implement mechanisms to detect concept drift, which occurs when the data distribution changes over time. When drift is detected, consider
retraining or adapting the neural network.

3. Memory-efficient Models:

 Explore memory-efficient neural network architectures designed for streaming data, such as online memory networks, which adapt to the
limited memory capacity of the sliding window.

4. Feature Engineering:

 Perform feature engineering to extract relevant information from the data stream. Preprocessing steps like text tokenization, feature scaling,
or dimensionality reduction may be necessary.

5. Regularization:

 Apply regularization techniques, such as dropout or weight decay, to prevent overfitting, especially when data is limited in the sliding window.

6. Hyperparameter Tuning:

 Tune hyperparameters adaptively based on the streaming data, such as learning rates or network architectures.

7. Ensemble Methods:

 Consider ensemble techniques that combine multiple neural networks or models to improve robustness and adaptability in the presence of
concept drift.
1. Model Evaluation:

 Continuously monitor and evaluate the neural network's performance using appropriate evaluation metrics relevant to your task,
such as accuracy, F1-score, or mean squared error.

2. Online Anomaly Detection (Optional):

 Incorporate anomaly detection methods, including neural network-based approaches, to identify unusual or unexpected patterns in
the data stream.

3. Scalability and Parallel Processing:

 Utilize parallel or distributed computing frameworks to handle the computational load when processing large-scale data streams.

4. Resource Constraints:

 Adapt your neural network approach to resource constraints, such as processing power and memory, which may be limited in
streaming environments.

•Adapting neural networks to data streams is an active area of research, and various approaches, architectures, and libraries are available to
address the challenges of streaming data. The choice of approach should be tailored to the specific requirements of your streaming data
application.

•Feature Selection for Data Streams with Streaming Instances

•In this subsection, we review feature selection with streaming instances where the set of features is fixed, while new instances are
consistently and continuously arriving.

(i) Online Feature Selection


(ii) Unsupervised Feature Selection on Data Streams
UNIT II FEATURE ENGINEERING
Text Data – Visual Data – Feature-based Time-Series Analysis – Data
Streams – Feature Selection and Evaluation
• The entire original feature set can then be divided into four basic
disjoint subsets:
(1) irrelevant features,
(2) redundant feature,
(3) weakly relevant but non-redundant features, and
(4) strongly relevant features
Feature Selection Frameworks
• These algorithms are either search-based or correlation-based.
1.Search-Based Feature
Selection Framework
• For the search-based framework, a typical feature selection process
consists of three basic steps (shown in Fig. 1), namely, subset
generation, subset evaluation, and stopping criterion.
• Subset generation aims to generate a candidate feature subset. Each
candidate subset is evaluated and compared with the previous best
one according to a certain evaluation criterion.
• If the newly generated subset is better than the previous one, it will
be the latest best subset. The first two steps of search-based feature
selection are repeated until a given stopping criterion is satisfied.
• figure 1 indicates that search-based feature selection includes two key
factors: the evaluation criterion and the search strategy.
• According to the evaluation criterion, feature selection algorithms are
categorized into filter, wrapper, and hybrid (embedded) models.
• Feature selection algorithms under the filter model rely on analyzing
the general characteristics of data and evaluating features without
involving any learning algorithms.
• Wrapper utilizes a predefined learning algorithm instead of an
independent measure for subset evaluation.
• A typical hybrid algorithm makes use of both an independent
measure and a learning algorithm to evaluate feature subsets. The
analysis of advantages and disadvantages of filter, wrapper and hybrid
models is summarized in Table 2.
2.Correlation-Based Feature
Selection Framework
• The correlation-based framework considers the feature-feature
correlation and feature-class correlation.
• Generally, the correlation between feat is known as feature
redundancy, while the feature-class correlation is viewed as feature
relevance.
• Then an entire feature set can be divided into four basic disjoint
subsets: (1) irrelevant features, (2) redundant features, (3) weakly
relevant but non-redundant features, and (4) strongly relevant
features. An optimal feature selection algorithm should select non-
redundant and strongly relevant features as shown in Fig. 2.
• The correlation-based feature selection framework is shown in Fig. 3,
which consists of two steps: relevance analysis determines the subset
of relevant features, and redundancy analysis determines and
eliminates the redundant features from relevant ones to produce the
final subset.
• This framework has advantages over the search-based framework as
it circumvents subset search and allows for an efficient and effective
way of finding an approximate optimal subset
Advanced Topics for Feature
Selection
• Massive amounts of high-dimensional data bring about both
opportunities and challenges to feature selection. Valid computational
paradigms for new challenges are becoming increasingly important.
• Then along with the paradigms, many feature selection topics are
emerging, such as feature selection for high-dimensional small
sample size (HDSSS) data, feature selection for big data mining,
feature selection for multi-label learning, feature selection with
privacy preservation, and feature selection for streaming data mining.
• Stable Feature Selection
• Sparsity-Based Feature Selection
• Multi-Source Feature Selection
• Distributed Feature Selection
• Multi-View Feature Selection
• Multi-Label Feature Selection
• Online Feature Selection
• Step 1: Generate a new feature. Step 2: Determine whether the newly generated
feature should be added to the currently selected feature subset. Step 3:
Determine whether some features should be removed from the currently selected
feature subset when the new feature is added. Step 4: Repeat Step 1 to Step 3.
• Privacy-Preserving Feature Selection
• Adversarial Feature Selection
•Stable Feature Selection:
•Focuses on selecting features that are consistently important across different samples or
different subsets of the data. Stability in feature selection is crucial for ensuring the robustness
of the model, especially in cases where the data may be noisy or subject to change.
•Sparsity-Based Feature Selection:
•Relies on sparsity-inducing norms (like L1-norm) to select a subset of features. Methods like
Lasso (Least Absolute Shrinkage and Selection Operator) are commonly used to enforce
sparsity, leading to models that are easier to interpret and potentially reducing overfitting.
•Multi-Source Feature Selection:
•Deals with scenarios where data comes from multiple sources or domains. The goal is to select
features that are most informative across these different sources, potentially combining
information from them to improve model performance.
•Distributed Feature Selection:
Involves selecting features in a distributed or parallel computing environment. This approach is useful
when working with large datasets that are too big to process on a single machine,
requiring a distributed approach to feature selection.
•Multi-View Feature Selection:
Refers to feature selection in cases where the data is represented in different "views" or
feature sets, each providing different perspectives on the data. The goal is to select the most
informative features from each view to build a comprehensive model.
•Multi-Label Feature Selection:
Involves selecting features when the data has multiple labels (e.g., multi-label classification problems).
The challenge here is to select features that are informative for predicting all labels simultaneously,
potentially reducing redundancy and improving generalization.
• S table Feature Selection
•Forward selection - Forward selection is an iterative process, which begins with an empty set of features. After

each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new variable/feature does not improve the
performance of the model.
• Backward elimination - Backward elimination is also an iterative approach, but it is the opposite of forward
selection. This technique begins the process by considering all the features and removes the least significant
feature. This elimination process continues until removing the features does not improve the performance of the
model.
• Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature selection methods,
which evaluates each feature set as brute-force. It means this method tries & make each possible combination of
features and return the best performing feature set.
•Recursive Feature Elimination

•Recursive feature elimination is a recursive greedy optimization approach, where features are selected by
recursively taking a smaller and smaller subset of features. Now, an estimator is trained with each set of
features, and the importance of each feature is determined using coef_attribute or through a
feature_importances_attribute.
•Filter Methods
•In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step. The filter
method filters out the irrelevant feature and redundant columns from the
•model by using different metrics through ranking. The advantage of using filter methods is that it
needs low computational time and does not overfit
•Some common techniques of Filter methods are as follows:
•Information Gain
•Chi-square Test
•Fisher's Score
•Missing Value Ratio

•Information Gain: Information gain determines the reduction in entropy while


transforming the dataset. It can be used as a feature selection technique by calculating the
information gain of each variable with respect to the target variable.
•Chi-square Test: Chi-square test is a technique to determine the relationship between the
categorical variables. The chi-square value is calculated between each feature and the target
variable, and the desired number of features with the best chi-square value is selected.
•Fisher's Score:
•Fisher's score is one of the popular supervised technique of features selection. It returns the
rank of the variable on the fisher's criteria in descending order. Then we can select the variables
with a large fisher's score. the data
•Missing Value Ratio:
•The value of the missing value ratio can be used for evaluating the feature set against the
threshold value. The formula for obtaining the missing value ratio is the number of missing
•values in each column divided by the total number of observations. The variable is having more
than the threshold value can be dropped.
•These methods are also iterative, which evaluates each iteration, and optimally finds the most
important features that contribute the most to training in a particular iteration. Some techniques
of embedded methods are:
• Regularization- Regularization adds a penalty term to different parameters of the
machine learning model for avoiding overfitting in the model. This penalty term is added to the
coefficients; hence it shrinks some coefficients to zero. Those features with zero coefficients
can be removed from the dataset. The types of regularization techniques are L1 Regularization
(Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance - Different tree-based methods of feature selection help us
with feature importance to provide a way of selecting features. Here, feature importance
specifies which feature has more importance in model building or has a great
•impact on the target variable. Random Forest is such a tree-based method, which is a type of
bagging algorithm that aggregates a different number of decision trees. It automatically
•ranks the nodes by their performance or decrease in the impurity (Gini impurity) over all
•the trees. Nodes are arranged as per the impurity values, and thus it allows to pruning of trees
below a specific node. The remaining nodes create a subset of the most important features.
Embedded Methods

You might also like