IoT AMAR2
IoT AMAR2
Variety
Variability
Veracity
Visualization
Value
Data Handling Technology
Data Handling Technology Cont.
Data Handling Technology Cont.
Flow of Data
Data Source
Data Acquisition
Data Acquisition
Data Storage
Data Handling Using Hadoop
• Reliable, Scalable, Distributed data handling.
Building Blocks of Hadoop
Hadoop Distributed File System(HDFS)
Name and data nodes
Job and task Trackers
Hadoop Master/Slave Architecture
What is Data Analytics
Types of data analytics
Qualitative Analysis
Quantitative Analytics
Comparison
Advantages of Data Analytics
Method
Statistical Models:
Data consistency in an intermittently connected
or disconnected environment: Connected class
• Connection: Maintains information required to
connect to the data source through a
connection string. The connection string
contains information such as the name of the
data source and its location, and authorization
credentials and settings. The Connection class
has methods to open and close the
connection, for transactions to be initiated on
the connection, as well as control other
properties of the connection.
Connected class
• Command: Executes SQL statements or stored
procedures against the data source. The command
class has a Parameter Collection object
containing Parameter objects that allow
parameterized SQL statements and stored
procedures to be used against the data source.
• DataReader: Provides connected forward-only,
read-only access to the data source. It is optimized
for speed. The Data Reader is instantiated through
a Command object.
Connected class
• Parameter: Allows parameters for both parameterized queries and
stored procedures to be defined and set to appropriate values.
The Parameter class is accessed through the Parameters
Collection object within a Command object. It supports input and
output parameters as well as return values from stored
procedures.
• Transaction: Allows transactions to be created on a connection so
that multiple changes to data in a data source are treated as a
single unit of work and either all committed or cancelled.
• Data Adapter: Bridges the data source and the disconnected Data
Set or Data Table classes. The Data Adapter wraps the connected
classes to provide this functionality. It provides a method to
retrieve data into a disconnected object and a method to reconcile
modified data in the disconnected object with the data source.
The Command Builder class can generate the logic to reconcile
changes in simple situations; custom logic can be supplied to deal
with complex situations and optimize performance.
Disconnected class
• Data Set: Provides a consistent way to deal with disconnected data
completely independently of the data source. The Data Set is essentially an
in-memory relational database, serving as a container for the Data
Table, Data Column, Data Row, Constraint, and Data Relation objects. The
XML format serializes and transports a Data Set. A Data Set can be
accessed and manipulated either as XML or through the methods and
properties of the Data Set interchangeably; the Xml Data Document class
represents and synchronizes the relational data within a Data Set object
with the XML Document Object Model (DOM).
• Data Table: Allows disconnected data to be examined and modified
through a collection of Data Column and Data Row classes. The Data
Table allows constraints such as foreign keys and unique constraints to be
defined using the Constraint class.
• Data Column: Corresponds to a column in a table. The Data Column class
stores metadata about the structure of the column that, together with
constraints, defines the schema of the table. The Data Column can also
create expression columns based on other columns in the table.
Disconnected Data
• Data Row: Corresponds to a row in a table and can examine and update
data in the Data Table. The Data Table exposes Data Row objects through
the Data Row Collection object it contains. The Data Row caches changes
made to data contained in its columns, storing both original and current
values. This allows changes to be cancelled or to be later reconciled with
the data source.
• Constraint: Allows constraints to be placed on data stored within a Data
Table. Unique and foreign key constraints can be created to maintain data
integrity.
• Data Relation: Provides a way to indicate a relationship between
different Data Table objects within a Data Set. The Data Relation relates
columns in the parent and child tables allowing navigation between the
parent and child tables and referential integrity to be enforced through
cascading updates and deletes.
Disconnected Data
• Cyber Intrusions
– A web server involved in ftp
traffic
Simple Example
Y
• N1 and N2 are
N1 o1
regions of normal
O3
behavior
• Points o1 and o2 are
anomalies
o2
• Points in region O3 N2
are anomalies
X
Related problems
• Rare Class Mining
• Chance discovery
• Novelty Detection
• Exception Mining
• Noise Removal
• Black Swan*
Key Challenges
• Defining a representative normal region is challenging
• The boundary between normal and outlying behavior is
often not precise
• The exact notion of an outlier is different for different
application domains
• Availability of labeled data for training/validation
• Malicious adversaries
• Data might contain noise
• Normal behavior keeps evolving
Data Labels
• Supervised Anomaly Detection
– Labels available for both normal data and anomalies
– Similar to rare class mining
• Semi-supervised Anomaly Detection
– Labels available only for normal data
• Unsupervised Anomaly Detection
– No labels assumed
– Based on the assumption that anomalies are very
rare compared to normal data
Applications of Anomaly Detection
• Network intrusion detection
• Insurance / Credit card fraud detection
• Healthcare Informatics / Medical diagnostics
• Industrial Damage Detection
• Image Processing / Video surveillance
• Novel Topic Detection in Text Mining
• …
Intrusion Detection
• Intrusion Detection:
– Process of monitoring the events occurring in a computer system or
network and analyzing them for intrusions
– Intrusions are defined as attempts to bypass the security mechanisms
of a computer or network
• Challenges
– Traditional signature-based intrusion detection
systems are based on signatures of known
attacks and cannot detect emerging cyber threats
– Substantial latency in deployment of newly
created signatures across the computer system
• Anomaly detection can alleviate these
limitations
Fraud Detection
• Fraud detection refers to detection of criminal activities
occurring in commercial organizations
– Malicious users might be the actual customers of the organization or
might be posing as a customer (also known as identity theft).
• Types of fraud
– Credit card fraud
– Insurance claim fraud
– Mobile / cell phone fraud
– Insider trading
• Challenges
– Fast and accurate real-time detection
– Misclassification cost is very high
Healthcare Informatics
• Detect anomalous patient records
– Indicate disease outbreaks, instrumentation errors,
etc.
• Key Challenges
– Only normal labels available
– Misclassification cost is very high
– Data can be complex: spatio-temporal
Industrial Damage Detection
• Industrial damage detection refers to detection of different faults
and failures in complex industrial systems, structural damages,
intrusions in electronic security systems, suspicious events in
video surveillance, abnormal energy consumption, etc.
– Example: Aircraft Safety
• Anomalous Aircraft (Engine) / Fleet Usage
• Anomalies in engine combustion data
• Total aircraft health and usage management
• Key Challenges
– Data is extremely huge, noisy and unlabelled
– Most of applications exhibit temporal behavior
– Detecting anomalous events typically require immediate intervention
Image Processing
• Detecting outliers in a image
monitored over time
• Detecting anomalous regions
within an image
• Used in
– medical image analysis
– video surveillance
– satellite image analysis
• Key Challenges
– Detecting collective anomalies
– Data sets are very large
Anomaly
Classification Based Techniques
•Main idea: build a classification model for normal (and anomalous
(rare)) events based on labeled training data, and use it to classify
each new unseen event
•Classification models must be able to handle skewed (imbalanced)
class distributions
•Categories:
– Supervised classification techniques
• Require knowledge of both normal and anomaly class
• Build classifier to distinguish between normal and known anomalies
– Semi-supervised classification techniques
• Require knowledge of normal class only!
• Use modified classification model to learn the normal behavior and then detect
any deviations from normal behavior as anomalous
Classification Based Techniques
•Advantages:
– Supervised classification techniques
• Models that can be easily understood
• High accuracy in detecting many kinds of known anomalies
– Semi-supervised classification techniques
• Models that can be easily understood
• Normal behavior can be accurately learned
•Drawbacks:
– Supervised classification techniques
• Require both labels from both normal and anomaly class
• Cannot detect unknown and emerging anomalies
– Semi-supervised classification techniques
• Require labels from normal class
• Possible high false alarm rate - previously unseen (yet legitimate) data records may be
recognized as anomalies
Supervised Classification Techniques
• Manipulating data records (oversampling /
undersampling / generating artificial examples)
• Rule based techniques
• Model based techniques
– Neural network based approaches
– Support Vector machines (SVM) based approaches
– Bayesian networks based approaches
• Cost-sensitive classification techniques
• Ensemble based algorithms (SMOTEBoost, RareBoost
Semi-supervised Classification Techniques
• Use modified classification model to learn the normal
behavior and then detect any deviations from normal
behavior as anomalous
• Recent approaches:
– Neural network based approaches
– Support Vector machines (SVM) based approaches
– Markov model based approaches
– Rule-based approaches
Nearest Neighbor Based Techniques
• Key assumption: normal points have close neighbors
while anomalies are located far from other points
• General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data record is
anomaly or not
• Categories:
– Distance based methods
• Anomalies are data points most distant from other points
– Density based methods
• Anomalies are data points in low density regions
Clustering Based Techniques
•Key assumption: normal data records belong to large and dense
clusters, while anomalies belong do not belong to any of the clusters
or form very small clusters
•Categorization according to labels
– Semi-supervised – cluster normal data to create modes of normal behavior. If a
new instance does not belong to any of the clusters or it is not close to any
cluster, is anomaly
– Unsupervised – post-processing is needed after a clustering step to determine
the size of the clusters and the distance from the clusters is required fro the
point to be anomaly
•Anomalies detected using clustering based methods can be:
– Data records that do not fit into any cluster (residuals from clustering)
– Small clusters
– Low density clusters or local anomalies (far from other points within the same
cluster)
Clustering Based Techniques
• Advantages:
– No need to be supervised
– Easily adaptable to on-line / incremental mode suitable for
anomaly detection from temporal data
• Drawbacks
– Computationally expensive
• Using indexing structures (k-d tree, R* tree) may alleviate this problem
– If normal points do not create any clusters the techniques may
fail
– In high dimensional spaces, data is sparse and distances
between any two data records may become quite similar.
• Clustering algorithms may not give any meaningful clusters
Statistics Based Techniques
• Data points are modeled using stochastic distribution ⇒ points
are determined to be outliers depending on their relationship
with this model
• Advantage
– Utilize existing statistical modeling techniques to model various type of
distributions
• Challenges
– With high dimensions, difficult to estimate distributions
– Parametric assumptions often do not hold for real data sets
Types of Statistical Techniques
• Parametric Techniques
– Assume that the normal (and possibly anomalous) data is generated from
an underlying parametric distribution
– Learn the parameters from the normal sample
– Determine the likelihood of a test instance to be generated from this
distribution to detect anomalies
• Non-parametric Techniques
– Do not assume any knowledge of parameters
– Use non-parametric techniques to learn a distribution – e.g. parzen
window estimation
Information Theory Based Techniques
• Compute information content in data using information theoretic
measures, e.g., entropy, relative entropy, etc.
• Key idea: Outliers significantly alter the information content in a
dataset
• Approach: Detect data instances that significantly alter the
information content
– Require an information theoretic measure
• Advantage
– Operate in an unsupervised mode
• Challenges
– Require an information theoretic measure sensitive enough to detect
irregularity induced by very few outliers
Visualization Based Techniques
• Use visualization tools to observe the data
• Provide alternate views of data for manual inspection
• Anomalies are detected visually
• Advantages
– Keeps a human in the loop
• Disadvantages
– Works well for low dimensional data
– Can provide only aggregated or partial views for high
dimension data
Visual Data Mining*
• Detecting
Tele-communication
fraud
• Display telephone call
patterns as a graph
• Use colors to identify
fraudulent telephone
calls (anomalies)
Contextual Anomaly Detection
•Detect context anomalies
•General Approach
– Identify a context around a data instance (using a set
of contextual attributes)
– Determine if the data instance is anomalous w.r.t. the
context (using a set of behavioral attributes)
•Assumption
– All normal instances within a context will be similar (in
terms of behavioral attributes), while the anomalies
will be different
Contextual Attributes
• Contextual attributes define a neighborhood (context)
for each instance
• For example:
– Spatial Context
• Latitude, Longitude
– Graph Context
• Edges, Weights
– Sequential Context
• Position, Time
– Profile Context
• User demographics
Sequential Anomaly Detection
• Detect anomalous sequences in a database of
sequences, or
• Detect anomalous subsequence within a sequence
• Data is presented as a set of symbolic sequences
– System call intrusion detection
– Proteomics
– Climate data
Motivation for On-line Anomaly Detection
•Data in many rare events applications arrives continuously at an
enormous pace
•There is a significant challenge to analyze such data
•Examples of such rare events applications:
– Video analysis
– Aircraft safety
Scanning Computer
activity Network
Compromised
Machine with
Attacker Machine
vulnerability
Data Mining for Intrusion Detection
⬥ Increased interest in data mining based intrusion detection
– Attacks for which it is difficult to build signatures
– Attack stealthiness
– Unforeseen/Unknown/Emerging attacks
– Distributed/coordinated attacks
⬥ Data mining approaches for intrusion detection
– Misuse detection
⬥ Building predictive models from labeled labeled data sets (instances
are labeled as “normal” or “intrusive”) to identify known intrusions
⬥ High accuracy in detecting many kinds of known attacks
⬥ Cannot detect unknown and emerging attacks
– Anomaly detection
⬥ Detect novel attacks as deviations from “normal” behavior
⬥ Potential high false alarm rate - previously unseen (yet legitimate) system behaviors
may also be recognized as anomalies
– Summarization of network traffic