CC - Unit IV - Chapters
CC - Unit IV - Chapters
• DBSCAN can find irregular shaped clusters as seen from this example and can even find a cluster completely surrounded
by a different cluster.
• DBSCAN considers some points as noise and does not assign them to any cluster.
Classification of Big Data
• Classification is the process of categorizing objects into predefined categories.
• Classification is achieved by classification algorithms that belong to a broad category of algorithms called supervised
machine learning.
• Supervised learning involves inferring a model from a set of input data and known responses to the data (training
data) and then using the inferred model to predict responses to new data.
• Binary classification
• Binary classification involves categorizing the data into two categories. For example, classifying the sentiment of a
news article into positive or negative, classifying the state of a machine into good or faulty, classifying the heath
test into positive or negative, etc.
• Multi-class classification
• Multi-class classification involves more than two classes into which the data is categorized. For example, gene
expression classification problem involves multiple classes.
• Document classification
• Document classification is a type of multi-class classification approach in which the data to the classified is in the form
of text document. For classifying news articles into different categories such as politics, sports, etc.
Performance of Classification Algorithms
• Precision: Precision is the fraction of objects that are classified correctly.
• Recall: Recall is the fraction of objects belonging to a category that are classified correctly.
• Accuracy:
• F1-score: F1-score is a measure of accuracy that considers both precision and recall. F1-score is the harmonic
means of precision and recall given as,
Naive Bayes
• Naive Bayes is a probabilistic classification algorithm based on the Bayes theorem
with a naive assumption about the independence of feature attributes. Given a
class variable C and feature variables F1,...,Fn , the conditional probability
(posterior) according to Bayes theorem is given as,
• Since the evidence P(F1,...,Fn ) is constant for a given input and does not depend
on the class variable C, only the numerator of the posterior probability is
important for classification.
• With this simplification, classification can then be done as follows,
Decision Trees
• Decision Trees are a supervised learning method that use a tree created
from simple decision rules learned from the training data as a predictive
model.
• The predictive model is in the form of a tree that can be used to predict
the value of a target variable based on a several attribute variables.
• Each node in the tree corresponds to one attribute in the dataset on
which the “split” is performed.
• Each leaf in a decision tree represents a value of the target variable.
• The learning process involves recursively splitting on the attributes until all
the samples in the child node have the same value of the target variable
or splitting further results in no further information gain.
• To select the best attribute for splitting at each stage, different metrics
can be used.
Splitting Attributes in Decision Trees
To select the best attribute for splitting at each stage, different metrics can be used such as:
• Information Gain
• Information gain is defined based on the entropy of the random variable which is defined
as,
• Entropy is a measure of uncertainty in a random variable and choosing the attribute with the
highest information gain results in a split that reduces the uncertainty the most at that
stage.
• Gini Coefficient
• Gini coefficient measures the inequality, i.e. how often a randomly chosen sample that is
labeled based on the distribution of labels, would be labeled incorrectly. Gini coefficient is
defined as,
Decision Tree Algorithms
• There are different algorithms for building decisions trees, popular ones being ID3 and C4.5.
• ID3:
• Attributes are discrete. If not, discretize the continuous attributes.
• Calculate the entropy of every attribute using the dataset.
• Choose the attribute with the highest information gain.
• Create branches for each value of the selected attribute.
• Repeat with the remaining attributes.
• The ID3 algorithm can be result in over-fitting to the training data and can be expensive to train especially for
continuous attributes.
• C4.5
• The C4.5 algorithm is an extension of the ID3 algorithm. C4.5 supports both discrete and continuous attributes.
• To support continuous attributes, C4.5 finds thresholds for the continuous attributes and then splits based on the
threshold values. C4.5 prevents over-fitting by pruning trees after they have been created.
• Pruning involves removing or aggregating those branches which provide little discriminatory power.
Random Forest
• Random Forest is an ensemble learning method that is based on randomized decision trees.
• Random Forest trains a number decision trees and then takes the majority vote by using the mode of the class predicted by
the individual trees.
Breiman’s Algorithm
1.Draw a bootstrap sample (n times with replacement from the N samples in the training set)
from the dataset
2.Train a decision tree
-Until the tree is fully grown (maximum size)
-Choose next leaf node
-Select m attributes (m is much less than the total number of attributes M) at random.
-Choose the best attribute and split as usual
3.Measure out-of-bag error
- Use the rest of the samples (not selected in the bootstrap) to estimate the error of
the tree, by predicting their classes.
4.Repeat steps 1-3 k times to generate k trees.
5.Make a prediction by majority vote among the k trees
Support Vector Machine
• Support Vector Machine (SVM) is a supervised machine
learning approach used for classification and regression.
• The basic form is SVM is a binary classifier that classifies the
data points into one of the two classes.
• SVM training involves determining the maximum
margin hyperplane that separates the two classes.
• The maximum margin hyperplane is one which has the
largest separation from the nearest training data point.
• Given a training data set (xi ,yi ) where xi is an n dimensional
vector and yi = 1 if xi is in class 1 and yi = -1 if xi is in class 2.
• A standard SVM finds a hyperplane w.x-b = 0, which correctly
separates the training data points and has a maximum margin
which is the distance between the two hyperplanes w.x-b = 1
and w.x-b = -1
Support Vector Machine
Binary classification with Linear SVM Binary classification with RBF SVM
Recommendation Systems
• For applications that use the Platform-as-a-service (PaaS) cloud service model, the architecture and
deployment design steps are not required since the platform takes care of the architecture and deployment.
• Component Design
• In the component design step, the developers have to take into consideration the platform specific features.
• Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure Web Sites, etc., provide platform specific software
development kits (SDKs) for developing cloud applications.
• Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox environments and are allowed to perform only those
actions that do not interfere with the performance of other applications.
• Deployment & Scaling
• The deployment and scaling is handled by the platform while the developers focus on the application development
using the platform-specific SDKs.
• Portability
• Portability is a major constraint for PaaS based applications as it is difficult to move the
Multimedia Cloud Reference Architecture
• Infrastructure Services
• In the Multimedia Cloud reference architecture, the first layer is the
infrastructure services layer that includes computing and storage resources.
• Platform Services
• On top of the infrastructure services layer is the platform services layer
that includes frameworks and services for streaming and associated tasks
such as transcoding and analytics that can be leveraged for rapid
development of multimedia applications.
• Applications
• The topmost layer is the applications such as live video streaming, video
transcoding, video-on-demand, multimedia processing etc.
• Cloud-based multimedia applications alleviates the burden of installing and
maintaining multimedia applications locally on the multimedia consumption
devices (desktops, tablets, smartphone, etc) and provide access to rich
multimedia content.
• Service Models
• A multimedia cloud can have various service models such as IaaS, PaaS
and SaaS that offer infrastructure, platform or application services.
Multimedia Cloud - Live Video Streaming
• Workflow of a live video streaming application that uses multimedia cloud:
• The video and audio feeds generated by a number cameras and microphones are mixed/multiplexed with
video/audio mixers and then encoded by a client application which then sends the encoded feeds to the multimedia
cloud.
• On the cloud, streaming instances are created on-demand and the streams are then broadcast over the internet.
• The streaming instances also record the event streams which are later moved to the cloud storage for video
archiving.
• Empirical approach
• In this approach traces of applications are sampled and replayed to generate the synthetic workloads.
• The empirical approach lacks flexibility as the real traces obtained from a particular system are used for
workload generation which may not well represent the workloads on other systems with different
configurations and load conditions.
• Analytical approach
• Uses mathematical models to define the workload characteristics that are used by a synthetic workload
generator.
• Analytical approach is flexible and allows generation of workloads with different characteristics by
varying the workload model attributes.
• With the analytical approach it is possible to modify the workload model parameters one at a time and
investigate the effect on application performance to measure the application sensitivity to different
parameters.
User Emulation vs Aggregate Workloads
The commonly used techniques for workload generation are:
• User Emulation
• Each user is emulated by a separate thread that mimics the actions of a user by alternating between making
requests and lying idle.
• The attributes for workload generation in the user emulation method include think time, request types, inter-
request dependencies, for instance.
• User emulation allows fine grained control over modeling the behavioral aspects of the users interacting with the
system under test, however, it does not allow controlling the exact time instants at which the requests arrive the
system.
• Aggregate Workload Generation:
• Allows specifying the exact time instants at which the requests should arrive the system under test.
• However, there is no notion of an individual user in aggregate workload generation, therefore, it is not possible
to use this approach when dependencies between requests need to be satisfied.
• Dependencies can be of two types inter-request and data dependencies.
• An inter-request dependency exists when the current request depends on the previous request, whereas a data
dependency exists when the current requests requires input data which is obtained from the response of the
previous request.
Workload Characteristics
• Session
• A set of successive requests submitted by a user constitute a session.
• Inter-Session Interval
• Inter-session interval is the time interval between successive sessions.
• Think Time
• In a session, a user submits a series of requests in succession. The time interval between
two successive requests is called think time.
• Session Length
• The number of requests submitted by a user in a session is called the session length.
• Workload Mix
• Workload mix defines the transitions between different pages of an application and the
proportion in which the pages are visited.
Application Performance Metrics
The most commonly used performance metrics for cloud applications are:
• Response Time
• Response time is the time interval between the moment when the user submits a
request to the application and the moment when the user receives a response.
• Throughput
• Throughput is the number of requests that can be serviced per second.
Considerations for Benchmarking Methodology
• Accuracy
• Accuracy of a benchmarking methodology is determined by how closely the generated synthetic workloads
mimic the realistic workloads.
• Ease of Use
• A good benchmarking methodology should be user friendly and should involve minimal hand coding effort
for writing scripts for workload generation that take into account the dependencies between requests,
workload attributes, for instance.
• Flexibility
• A good benchmarking methodology should allow fine grained control over the workload attributes such as
think time, inter-session interval, session length, workload mix, for instance, to perform sensitivity analysis.
• Sensitivity analysis is performed by varying one workload characteristic at a time while keeping the others
constant.
• Wide Application Coverage
• A good benchmarking methodology is one that works for a wide range of applications and not tied to the
application architecture or workload types.
Types of Tests
• Baseline Tests
• Baseline tests are done to collect the performance metrics data of the entire application or a component of the
application.
• The performance metrics data collected from baseline tests is used to compare various performance tuning
changes which are subsequently made to the application or a component.
• Load Tests
• Load tests evaluate the performance of the system with multiple users and workload levels that are encountered in
the production phase.
• The number of users and workload mix are usually specified in the load test configuration.
• Stress Tests
• Stress tests load the application to a point where it breaks down.
• These tests are done to determine how the application fails, the conditions in which the application fails and the
metrics to monitor which can warn about impending failures under elevated workload levels.
• Soak Tests
• Soak tests involve subjecting the application to a fixed workload level for long periods of time.
• Soak tests help in determining the stability of the application under prolonged use and how the performance changes
with time.
Deployment Prototyping
• Deployment prototyping can help in making deployment architecture design choices.
• By comparing performance of alternative deployment architectures, deployment
prototyping can help in choosing the best and most cost effective deployment
architecture that can meet the application performance requirements.