0% found this document useful (0 votes)
19 views30 pages

CC Unit IV

Uploaded by

munisrikar0511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views30 pages

CC Unit IV

Uploaded by

munisrikar0511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Part-A

INTRODUCTION:-
Big data is defined as collections of data sets whose volume, velocity in terms of
time variation, or variety is so large that it is difficult to store, manage, process and
analyze the data using traditional databases and data processing tools. In the
recent years there has been an exponential growth in the both structured and
unstructured data generated by information technology, industrial, healthcare,
and other systems. Some examples of big data are described as follows:
- Social networks: They create text, images, audio, and video data.
- Web applications: They generate click-stream data to understand user behavior.
- Industrial systems: They use sensor data to monitor health and detect failures.
- Healthcare: Electronic health record (EHR) systems collect data.
- Web applications: They produce logs.
- Stock markets: They generate data for analysis.
The underlying characteristics of big data include:
 VOLUME: In Big Data, volume refers to the amount of data that is being
generated, processed, and stored. It's all about the sheer quantity of
data involved. 📊📈
 VELOCITY: Velocity in Big Data refers to the speed at which data is
generated, processed, and analyzed. It's about how quickly data is being
collected and how fast insights can be derived from it. 🚀⏱
 VARIETY: Variety in Big Data refers to the diverse types and formats of data
that are being collected and analyzed. It includes structured data (like
databases), unstructured data (like social media posts), and semi-structured
data (like XML files). It's all about dealing with different data sources
and formats. 📚📊
CLUSTERING BIG DATA:-
Clustering is the process of grouping similar data items together such that
data items that are more similar to each other (with respect to some similarity
criteria) than other data items are put in one cluster. Clustering big data is of much
interest, and happens in applications such as:
- Clustering social network data to find a group of similar users
-Clustering electronic health record (EHR) data to find similar patients.
- Clustering sensor data to group similar or related faults in a machine
- Clustering market research data to group similar customers
-Clustering clickstream data to group similar users
Clustering algorithms in unsupervised machine learning find patterns in data
without using training data, helping analyze Big Data. 🌟🔍
K-MEANS CLUSTERING:
K-means clustering is a popular unsupervised learning algorithm for cluster analysis. It
groups similar data points into clusters by minimizing the mean distance between
geometric points
K-MEANS CLUSTERING ALGORITHM:-

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.


K-MEANS CLUSTERING IN PYTHON:-
import numpy as np
import pandas as pd
from sagemaker.kmeans import KMeans
# Load the data
data = pd.read_csv('data.csv')
# Create a KMeans object
kmeans = KMeans(k=5, init='random')
# Fit the model to the data
kmeans.fit(data)
# Predict the cluster labels for each data point
labels = kmeans.predict(data)
# Evaluate the model
silhouette_score = kmeans.silhouette_score(data)
# Save the model
kmeans.save('kmeans.model')
# Load the model
kmeans = KMeans.load('kmeans.model')
# Use the model to make predictions on new data
new_data = pd.read_csv('new_data.csv')
new_labels = kmeans.predict(new_data)
DBSCAN CLUSTREING:-
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm in big data analytics known for its efficiency and ability to
identify clusters of arbitrary shapes. It works by grouping points based on their
density, with parameters like `epsilon` (ε) and `minPts` determining cluster
formation. DBSCAN efficiently handles large datasets, is robust to noise, and can
find clusters of various shapes. However, it requires careful parameter tuning and
may have scalability issues with extremely large or high-dimensional data.

Clusters formed in k-means and DBSCAN


Outlier influence on DBSCAN

Parallelizing Clustering Algorithms using MapReduce:-


The parallel implementation of k-means clustering with MapReduce is designed to
handle large-scale datasets that cannot fit into the memory of a single machine.
Here's a breakdown of the process described:
1. Data Distribution: The data to be clustered is distributed across a distributed file
system like HDFS and split into blocks, which are replicated across different nodes
in the cluster. This ensures fault tolerance and allows for parallel processing.
2. Initialization: Clustering begins with an initial set of centroids. These centroids
can be randomly chosen or selected using a specific initialization method.
3. Map Phase:
- Each map task calculates the distances between the data samples and the
centroids.
- Based on these distances, each data sample is assigned to the nearest centroid.
4. Reduce Phase:
- In the reduce phase, the centroids are recomputed using the mean of all the
points in each cluster.
- Each reducer is responsible for calculating the new centroid for a particular
cluster.
5. Convergence Check:
- The new centroids are then fed back to the client program.
- The client program checks whether convergence is reached or the maximum
number of iterations is completed.
- Convergence in k-means is typically determined by measuring the difference
between the coordinates of the new centroids and the centroids from the previous
iteration. If the movement of centroids falls below a specified threshold,
convergence is achieved.
PROGRAM:-
import sys
from mrjob.job import MRJob
class KMeansClusteringJob(MRJob):
def mapper(self, _, line):
# Read the input data
data = line.split(',')
# Perform the k-means clustering algorithm
cluster = kmeans(data)
# Write the output data
yield cluster, data

def reducer(self, cluster, data):


# Merge the output data from the mappers
for point in data:
yield cluster, point

if __name__ == '__main__':
KMeansClusteringJob.run()

CLASSIFICATION OF BIGDATA:-

Multi-class classification: This is a machine learning task where data points are categorized into
more than two distinct classes. Example: Classifying gene expression can involve multiple classes,
such as identifying different disease states or healthy tissues based on gene activity patterns.

binary classification, a machine learning technique for sorting data into two categories
(positive/negative, good/bad). It gives real-world examples like news sentiment analysis and
mentions there are various algorithms used for this task.

document classification is a subtype of multi-class classification. In document classification, the


data being classified is text documents, which are assigned to predefined categories like
"politics" or "sports". This makes it a multi-class problem because there are more than two
possible categories.
1)

 Multinomial Naive Bayes: This version works with features that have multiple categories, like
word counts in documents (red: 3 times, green: 1 time). It's good for situations where features
have a limited number of possibilities.
 Bernoulli Naive Bayes: This one is simpler and works with just yes/no (1/0) features. For example,
it can classify documents based on whether a specific word is present (1) or not (0).

Program:-
from sklearn.naive_bayes import GaussianNB
# Create a GaussianNB classifier
clf = GaussianNB()
# Traiu,n the classifier
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the classifier
accuracy = clf.score(X_test, y_test)
print(accuracy)

2)Decision Tree :-
is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.

 Information Gain: Focuses on reducing uncertainty (entropy) in the data. The higher the
information gain, the better the feature is at separating the data clearly.

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

 Gini Coefficient: Measures the likelihood of misclassifying data points. A lower gini coefficient
indicates better separation.

Gini Index= 1- ∑jPj2

PROGRAM:-

import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

# Load the data

df = pd.read_csv('churn.csv')

# Split the data into features and target variables

X = df.drop('churned', axis=1)

y = df['churned']

# Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create a decision tree classifier

clf = DecisionTreeClassifier()

# Train the classifier

clf.fit(X_train, y_train)

# Make predictions on the test set

y_pred = clf.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)


print('Accuracy:', accuracy)

# Visualize the decision tree

from sklearn.tree import export_graphviz

export_graphviz(clf, out_file='tree.dot', feature_names=X.columns)

3)

Random Forest Classifier


The Random forest classifier creates a set of decision trees from a randomly
selected subset of the training set. It is a set of decision trees (DT) from a randomly
selected subset of the training set and then It collects the votes from different
decision trees to decide the final prediction.

Program:-
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier

clf = RandomForestClassifier()

# Fit the classifier to the training data

clf.fit(X_train, y_train)
# Make predictions on the test data

y_pred = clf.predict(X_test)

# Evaluate the classifier's performance

print("Accuracy:", accuracy_score(y_test, y_pred))

4)Support Vector Machine Algorithm


The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane

PROGRAM:-
# Import the necessary libraries

from sklearn import svm

import numpy as np

# Create a dataset

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])


y = np.array([1, 1, -1, -1])

# Create an SVM classifier

clf = svm.SVC(kernel='linear')

# Fit the classifier to the data

clf.fit(X, y)

# Make a prediction

prediction = clf.predict([[9, 10]])

# Print the prediction

print(prediction)

Recommendation System: -
Recommendation systems are an important part of modern cloud
applications such as e- Commerce, social networks, content delivery networks, etc. A
recommendation system provides recommendations to users (for items such as books, movies,
songs, or restaurants) for unrated items based on the characteristics of the item or the ratings
given by the user and other users to similar items. The former approach is called item-based or
content-based recommendation, and the latter is called collaborative filtering.

Program: -
import surprise
# Load the data
data = surprise.Dataset.load_builtin('ml-100k')

# Build the model


model = surprise.KNNBasic()

# Train the model


model.fit(data.build_full_trainset())

# Get recommendations for a user


user_id = 196
recommendations = model.recommend(user_id, 10)

# Print the recommendations


for recommendation in recommendations:
print(recommendation)

PARTB: -
MULTIMEDIA CLOUD: -
INTRODUCTION: -

• Multimedia web applications (video, audio) are gaining popularity due to web advancements
and faster internet.

• These applications require significant computing resources.

• Cloud computing offers a cost-effective solution for managing the resource demands of
multimedia applications.

• A new concept, the "multimedia cloud," caters specifically to mobile multimedia applications
by providing storage, processing, and streaming services.

• Users of cloud-based multimedia applications benefit from not needing to install and maintain
software locally and gain access to richer multimedia content.

• Multimedia clouds offer various service options (IaaS, PaaS, SaaS) to suit different
development and user needs.
Case Study: Live Video Streaming App
Live video streaming has become popular because it allows people to watch events live from
anywhere on their devices. This case study explains how cloud computing makes it possible.
Instead of needing expensive equipment, cloud-based streaming is like renting a bike instead of
buying one. The cloud can handle a large audience and reach viewers all over the world, just like
a satellite signal. The events are even recorded so you can watch them later!

This text explains that figures 10.3, 10.4, and 10.5 (not included) show a demo app for creating live video
streams in the cloud. These figures likely walk through the steps of setting up a stream, including
specifying details, choosing a stream size, and launching the stream.
Here is a Django template for a live streaming app stream page:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Live Streaming App</title>
</head>
<body>
<h1>Live Streaming App</h1>

<div id="stream">
<video id="video" width="640" height="480" controls>
</video>
</div>

<script>
var video = document.getElementById('video');

// Connect to the live stream


var socket = new WebSocket('ws://localhost:8000/stream');

socket.onmessage = function(event) {
var data = JSON.parse(event.data);
// Set the video source
video.src = data.stream_url;
};
</script>
</body>
</html>
Streaming Protocols
The live streaming application described in the previous section uses RTMP
streaming protocol. There are a number of streaming methods used by stream
servers such as Flash Media Server including [41]:
• RTMP Dynamic Streaming (Unicast): High-quality, low-latency media streaming
with support for live and on-demand and full adaptive bitrate.
• RTMPE (encrypted RTMP): Real-time encryption of RTMP.
• RTMFP (multicast): IP multicast encrypted with support for both ASM or SSM
multicast for multicast-enabled network.
• RTMFP (P2P): P2P live video delivery between Flash Player clients.
• RTMFP (multicast fusion): IP and P2P working together to support higher QoS
within enterprise networks.
• HTTP Dynamic Streaming (HDS): HDS lets you watch high-quality videos (live or
on-demand) that adjust to your internet speed, all using regular
internet connections
• Protected HTTP Dynamic Streaming (PHDS): Real-time encryption of HDS.
• HTTP Live Streaming (HLS): videos can be streamed directly over the internet
(HTTP) to iPhones, iPads, and other devices that can play HLS videos. You can also
add a security lock (AES128 encryption) to these streams if needed.
The streaming methods listed above are based on the RTMP and HTTP
streaming protocols.
RTMP Streaming: -

RTMP acts like a traffic manager for live streaming. It keeps a constant, reliable connection open
(like a dedicated highway lane) to transmit video, audio, and data all at once. This data is
chopped up into small packets (think of them as individual cars) to ensure smooth delivery. The
size of these packets can be adjusted depending on traffic conditions, and more important data
(like audio) can be prioritized to avoid delays. Ultimately, RTMP efficiently juggles all this
streaming traffic to deliver high-quality video and audio.
HTTP Live Streaming: -

HTTP Live Streaming (HLS) was proposed by Apple and is a part of the Apple iOS [43]. HLS can
dynamically adjust playback quality to match the available speed of wired or wireless networks.
HLS streams video in bite-sized chunks, adjusting quality on the fly to match your internet speed, for a
smooth viewing experience.

HTTP Dynamic Streaming: -

 Delivers video (including HD) in chunks over regular internet connections, similar to how you
download a file.
 Like a smart delivery person, it adjusts the video quality (bitrate) based on your internet speed
for smooth playback.aZ
 Leverages existing internet infrastructure for efficient delivery of both live and on-demand
content.

Part-C)

Cloud Application & Benchmarking Tunning: -

Introduction: - Allocating resources for cloud applications is tricky because user demands
keep changing. Unlike traditional setups, cloud resources need to adjust automatically (scale up
or down) to handle these changes. This section talks about how to figure out how much cloud
space (resources) is needed for an application to run smoothly. By testing the application with
simulated use (workload), we can identify weak spots (bottlenecks) and allocate just the right
amount of resources, saving money and keeping users happy.

Benchmarking of cloud applications is important or the following reasons:

 Figuring out how much cloud space (resources) an application needs is complex.
 Testing the application (benchmarking) helps find the best setup for smooth performance and
saves money on resources.
 Benchmarking should be done regularly to handle changes in user demand.
 Ensuring an application's smooth launch requires testing its performance with various user demands
(workloads).
The steps involved in benchmarking of cloud applications are described below:
Here's a breakdown of the different steps involved in creating workloads for testing cloud
applications:

1. Collecting Real User Data (Trace Collection/Generation):


o This involves monitoring a live application to record user actions like requests, timestamps, etc.
This data creates a "trace" of real-world usage.
2. Workload Modeling :
o Scientists analyze the collected traces to understand user behavior patterns.
o This analysis helps create mathematical models that can be used to simulate user actions.

3. Workload Specification :
o A special language (WSL) is used to define the important user actions that affect application
performance.
o This allows for creating workloads with slightly different user behaviors for testing purposes.
4. Synthetic Workload Generation :
o The goal is to create realistic user simulations (workloads) for testing the application.
o There are two approaches:
 Empirical Approach: Replaying recordings of real user behavior (traces). This can be limiting as it
only reflects a specific scenario.
 Analytical Approach: Using mathematical models to generate user actions with various
characteristics. This allows for more flexibility in testing different user behaviors.
5. Generating Workloads (User Emulation vs Aggregate Workloads ):
o User Emulation: Simulates individual users with "think time" between actions, mimicking real
user behavior. This is good for testing how the application handles individual user interactions
but doesn't control the exact timing of requests.
o Aggregate Workload Generation: Specifies the exact timing of requests arriving at the system.
This is useful for testing how the application handles bursts of traffic but doesn't consider
individual user behavior.
Workload Characteristics
Each class of multi-tier applications can have their own characteristic workloads. you're testing a
shopping website (e-commerce application). To understand how the website performs under
different conditions, we need to consider how people use it:

 Sessions: A single visit by a user is like a session. They might browse a few pages, make a
purchase, or just window shop.
 Think Time: This is the time between when a user sees a response (like a product page loading)
and when they decide what to do next (click another link, add something to their cart, etc.). It's
like their "thinking time" in real life.
 Session Length: This is simply how many things a user does in one visit, like the number of pages
they browse.
 Workload Mix: This refers to the different things users typically do on the site. For example, on a
shopping site, users might browse more during sales (read-intensive) and buy more during
holidays (write-intensive, because purchases involve writing data to the database).
Tools Application & Approach Input/output & Model

httpref Application: Application: A tool that Input: Request URLs, specifications of the request

generates various HTTP workloads for rates, number of con- nections, for instance.

measur- ing server performance.


Output: Re- quests generated at
Approach: Has a core HTTP engine, a
the specified rate.
workload generation module and a

statistics collection module.

Application: Surge refers to a sudden, Input/Output: Surge can be an input in


SURGE significant increase in a quantity, used simulations or economic models, with its effect as
for electricity (protecting equipment) or the output (e.g., surge in raw material cost
user base growth in applications. impacting production output).
Approach: Surge describes methods for Model: Surge can be a factor included in a model
handling sudden increases, like to predict or analyze its consequences, like power
electrical surge protectors or grid stability models considering potential surges
healthcare surge capacity plans for
patient influxes.
Application: SWAT simulates Input/Output: Requires weather, land
SWAT
impacts of land management use, soil, and management data as
and climate on water, input, and outputs streamflow,
sediment, and nutrients in sediment, nutrient loads, and crop
watersheds. yields.
Approach: It uses a physical Model: SWAT is a hydrologic model
process-based approach to used to assess watershed health and
model hydrology, erosion, predict the effects of changes in land
and nutrient cycling. use or climate.

Application: HP – Load Input/Output: LoadRunner takes user


HP Load Runner
Runner is a performance scenarios (inputs) and provides
testing tool used to simulate performance metrics (outputs) like
high user loads on response times and error rates.
applications. Model: It can be used to model real-
world usage patterns by creating
Approach: It allows testing
virtual users with specific behaviors.
various approaches to ensure
application stability under
heavy workloads.

Application: GT-CAT is used Input/Output: GT-CAT takes (type of


GT-CAT
for (brief description of its input) and processes it to generate
purpose). (type of output).
Approaches: GT-CAT utilizes
(mention a key approach or Model: GT-CAT can be modeled as a
technique) to achieve its (type of model) to understand its
goals. behavior or optimize its
performance.
Table : - Performance evaluation tools for loud applications

Application Performance Metrics:-


* Response time is the total time it takes for a user to get a response from the application after submitting a
request. It's influenced by data transfer size, network bandwidth, application processing time, and user device
processing time.

* Throughput refers to the number of requests an application can handle per second. It indicates how well the
application handles traffic volume.
These metrics are crucial for setting performance goals (SLOs) that ensure a smooth user experience.

Design Consideration for a Benchmarking Methodology: -


* A good benchmarking methodology should be accurate (realistic workloads), easy to use
(minimal coding), flexible (controllable workload attributes), and widely applicable (various
applications).
* Accuracy is achieved by mimicking real workloads, especially when considering dependencies
between requests. User emulation is better suited for this than aggregate workload generation.
* Ease of use minimizes manual scripting for workload generation.
* Flexibility allows for fine-grained control over workload attributes for sensitivity analysis.
* Wide application coverage ensures the methodology works across different applications
and architectures.

Benchmarking Tools: -
Web application performance evaluation has shifted from manually scripted user interactions to
automated workload models built from real user data analysis.
Capturing Workload Characteristics: Compared to the traditional method of recording user
interactions for workload characteristics (Figure 11.1), the automated approach (Figure 11.2)
analyzes real traces from web application, and database servers to build workload models with
statistical analysis, eliminating the need for manual scripting.

Automated Performance Evaluation: The automated performance evaluation approach (Figure


11.2) simplifies creating various workload scenarios by analyzing real user traces to build models,
eliminating the manual scripting and parameterization needed in the traditional approach
(Figure 11.1).

Realistic Workloads: Unlike traditional methods that rely on manually scripted user
interactions, resulting in unrealistic workloads, the automated approach (Figure 11.2) leverages
real user traces for workload and benchmark models, generating synthetic workloads that
accurately mimic real-world user behavior.

Rapid Deployment Prototyping:

Traditional performance evaluation hinders rapid deployment architecture comparison as it


requires manual script creation for each new deployment; the automated approach (Figure 11.2)
addresses this by using an architecture model to swiftly build and evaluate various deployment
configurations.
Types of Tests:
 Baseline tests are like a starting point to see how fast the system runs before any changes
are made. This way, you can compare performance later and see if things got better or
worse.
 Load tests are like practice sessions with many users to see how well the system will
handle real-world traffic.
 Stress tests are like pushing the system way past its normal limits to see how it breaks
and what warnings to look for before it actually does.
 Soak tests are like running the system for a long time at a steady pace (consistent
workload level) to see if it holds up and how its speed changes over time.

Deployment Prototyping: -
Big cloud doesn't mean fast app! Testing different setups helps pick the cheapest and fastest
way to run your app, especially if user traffic goes up and down.

Figure 11.3 shows the steps involved in the deployment prototyping along with the variables
involved in each step. Given the performance requirements for an application, the deployment
design is an iterative process that involves the following steps: Deployment Design: Create the
deployment with various tiers as specified in the deployment configuration and deploy the
application.

Performance Evaluation: Verify whether the application meets the performance requirements
with the deployment.
Deployment Design: Create the deployment with various tiers as specified in the deployment
configuration and deploy the application.

Deployment Refinement: Deployments are refined based on the performance evaluations.


Various alternatives can exist in this step such as vertical scaling, horizontal scaling, for instance.

Benchmarking Case study: -

• Fig (a) shows the average throughput and response time. The observed throughput increases
as demanded request rate increases. As more number of requests are served per second by the
application, the response time also increases. The observed throughput saturates beyond a
demanded request rate of 50 req/sec.
• Fig (b) shows the CPU usage density of one of the application servers. This plot shows that the
application server CPU is non-saturated resource.
• Fig (c) shows the database server CPU usage density. From this density plot we observe that
the database CPU spends a large percentage of time at high utilization levels for demanded
request rate more than 40 req/sec
. • Fig (d) shows the density plot of the database disk I/O bandwidth.
• Fig (e) shows the network out rate for one of the application servers
• Fig (f) shows the density plot of the network out rate for the database server. From this plot we
observe a continuous saturation of the network out rate around 200 KB/s.
• Analysis
• Throughput continuously increases as the demanded request rate increases from 10 to
40 req/sec. Beyond 40 req/sec demanded request rate, we observe that throughput saturates,
which is due to the high CPU utilization density of the database server CPU. From the analysis of
density plots of various system resources, we observe that the database CPU is a system
bottleneck.

You might also like