CC Unit IV
CC Unit IV
INTRODUCTION:-
Big data is defined as collections of data sets whose volume, velocity in terms of
time variation, or variety is so large that it is difficult to store, manage, process and
analyze the data using traditional databases and data processing tools. In the
recent years there has been an exponential growth in the both structured and
unstructured data generated by information technology, industrial, healthcare,
and other systems. Some examples of big data are described as follows:
- Social networks: They create text, images, audio, and video data.
- Web applications: They generate click-stream data to understand user behavior.
- Industrial systems: They use sensor data to monitor health and detect failures.
- Healthcare: Electronic health record (EHR) systems collect data.
- Web applications: They produce logs.
- Stock markets: They generate data for analysis.
The underlying characteristics of big data include:
VOLUME: In Big Data, volume refers to the amount of data that is being
generated, processed, and stored. It's all about the sheer quantity of
data involved. 📊📈
VELOCITY: Velocity in Big Data refers to the speed at which data is
generated, processed, and analyzed. It's about how quickly data is being
collected and how fast insights can be derived from it. 🚀⏱
VARIETY: Variety in Big Data refers to the diverse types and formats of data
that are being collected and analyzed. It includes structured data (like
databases), unstructured data (like social media posts), and semi-structured
data (like XML files). It's all about dealing with different data sources
and formats. 📚📊
CLUSTERING BIG DATA:-
Clustering is the process of grouping similar data items together such that
data items that are more similar to each other (with respect to some similarity
criteria) than other data items are put in one cluster. Clustering big data is of much
interest, and happens in applications such as:
- Clustering social network data to find a group of similar users
-Clustering electronic health record (EHR) data to find similar patients.
- Clustering sensor data to group similar or related faults in a machine
- Clustering market research data to group similar customers
-Clustering clickstream data to group similar users
Clustering algorithms in unsupervised machine learning find patterns in data
without using training data, helping analyze Big Data. 🌟🔍
K-MEANS CLUSTERING:
K-means clustering is a popular unsupervised learning algorithm for cluster analysis. It
groups similar data points into clusters by minimizing the mean distance between
geometric points
K-MEANS CLUSTERING ALGORITHM:-
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
if __name__ == '__main__':
KMeansClusteringJob.run()
CLASSIFICATION OF BIGDATA:-
Multi-class classification: This is a machine learning task where data points are categorized into
more than two distinct classes. Example: Classifying gene expression can involve multiple classes,
such as identifying different disease states or healthy tissues based on gene activity patterns.
binary classification, a machine learning technique for sorting data into two categories
(positive/negative, good/bad). It gives real-world examples like news sentiment analysis and
mentions there are various algorithms used for this task.
Multinomial Naive Bayes: This version works with features that have multiple categories, like
word counts in documents (red: 3 times, green: 1 time). It's good for situations where features
have a limited number of possibilities.
Bernoulli Naive Bayes: This one is simpler and works with just yes/no (1/0) features. For example,
it can classify documents based on whether a specific word is present (1) or not (0).
Program:-
from sklearn.naive_bayes import GaussianNB
# Create a GaussianNB classifier
clf = GaussianNB()
# Traiu,n the classifier
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the classifier
accuracy = clf.score(X_test, y_test)
print(accuracy)
2)Decision Tree :-
is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
Information Gain: Focuses on reducing uncertainty (entropy) in the data. The higher the
information gain, the better the feature is at separating the data clearly.
Gini Coefficient: Measures the likelihood of misclassifying data points. A lower gini coefficient
indicates better separation.
PROGRAM:-
import pandas as pd
df = pd.read_csv('churn.csv')
X = df.drop('churned', axis=1)
y = df['churned']
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
3)
Program:-
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test)
PROGRAM:-
# Import the necessary libraries
import numpy as np
# Create a dataset
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
# Make a prediction
print(prediction)
Recommendation System: -
Recommendation systems are an important part of modern cloud
applications such as e- Commerce, social networks, content delivery networks, etc. A
recommendation system provides recommendations to users (for items such as books, movies,
songs, or restaurants) for unrated items based on the characteristics of the item or the ratings
given by the user and other users to similar items. The former approach is called item-based or
content-based recommendation, and the latter is called collaborative filtering.
Program: -
import surprise
# Load the data
data = surprise.Dataset.load_builtin('ml-100k')
PARTB: -
MULTIMEDIA CLOUD: -
INTRODUCTION: -
• Multimedia web applications (video, audio) are gaining popularity due to web advancements
and faster internet.
• Cloud computing offers a cost-effective solution for managing the resource demands of
multimedia applications.
• A new concept, the "multimedia cloud," caters specifically to mobile multimedia applications
by providing storage, processing, and streaming services.
• Users of cloud-based multimedia applications benefit from not needing to install and maintain
software locally and gain access to richer multimedia content.
• Multimedia clouds offer various service options (IaaS, PaaS, SaaS) to suit different
development and user needs.
Case Study: Live Video Streaming App
Live video streaming has become popular because it allows people to watch events live from
anywhere on their devices. This case study explains how cloud computing makes it possible.
Instead of needing expensive equipment, cloud-based streaming is like renting a bike instead of
buying one. The cloud can handle a large audience and reach viewers all over the world, just like
a satellite signal. The events are even recorded so you can watch them later!
This text explains that figures 10.3, 10.4, and 10.5 (not included) show a demo app for creating live video
streams in the cloud. These figures likely walk through the steps of setting up a stream, including
specifying details, choosing a stream size, and launching the stream.
Here is a Django template for a live streaming app stream page:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Live Streaming App</title>
</head>
<body>
<h1>Live Streaming App</h1>
<div id="stream">
<video id="video" width="640" height="480" controls>
</video>
</div>
<script>
var video = document.getElementById('video');
socket.onmessage = function(event) {
var data = JSON.parse(event.data);
// Set the video source
video.src = data.stream_url;
};
</script>
</body>
</html>
Streaming Protocols
The live streaming application described in the previous section uses RTMP
streaming protocol. There are a number of streaming methods used by stream
servers such as Flash Media Server including [41]:
• RTMP Dynamic Streaming (Unicast): High-quality, low-latency media streaming
with support for live and on-demand and full adaptive bitrate.
• RTMPE (encrypted RTMP): Real-time encryption of RTMP.
• RTMFP (multicast): IP multicast encrypted with support for both ASM or SSM
multicast for multicast-enabled network.
• RTMFP (P2P): P2P live video delivery between Flash Player clients.
• RTMFP (multicast fusion): IP and P2P working together to support higher QoS
within enterprise networks.
• HTTP Dynamic Streaming (HDS): HDS lets you watch high-quality videos (live or
on-demand) that adjust to your internet speed, all using regular
internet connections
• Protected HTTP Dynamic Streaming (PHDS): Real-time encryption of HDS.
• HTTP Live Streaming (HLS): videos can be streamed directly over the internet
(HTTP) to iPhones, iPads, and other devices that can play HLS videos. You can also
add a security lock (AES128 encryption) to these streams if needed.
The streaming methods listed above are based on the RTMP and HTTP
streaming protocols.
RTMP Streaming: -
RTMP acts like a traffic manager for live streaming. It keeps a constant, reliable connection open
(like a dedicated highway lane) to transmit video, audio, and data all at once. This data is
chopped up into small packets (think of them as individual cars) to ensure smooth delivery. The
size of these packets can be adjusted depending on traffic conditions, and more important data
(like audio) can be prioritized to avoid delays. Ultimately, RTMP efficiently juggles all this
streaming traffic to deliver high-quality video and audio.
HTTP Live Streaming: -
HTTP Live Streaming (HLS) was proposed by Apple and is a part of the Apple iOS [43]. HLS can
dynamically adjust playback quality to match the available speed of wired or wireless networks.
HLS streams video in bite-sized chunks, adjusting quality on the fly to match your internet speed, for a
smooth viewing experience.
Delivers video (including HD) in chunks over regular internet connections, similar to how you
download a file.
Like a smart delivery person, it adjusts the video quality (bitrate) based on your internet speed
for smooth playback.aZ
Leverages existing internet infrastructure for efficient delivery of both live and on-demand
content.
Part-C)
Introduction: - Allocating resources for cloud applications is tricky because user demands
keep changing. Unlike traditional setups, cloud resources need to adjust automatically (scale up
or down) to handle these changes. This section talks about how to figure out how much cloud
space (resources) is needed for an application to run smoothly. By testing the application with
simulated use (workload), we can identify weak spots (bottlenecks) and allocate just the right
amount of resources, saving money and keeping users happy.
Figuring out how much cloud space (resources) an application needs is complex.
Testing the application (benchmarking) helps find the best setup for smooth performance and
saves money on resources.
Benchmarking should be done regularly to handle changes in user demand.
Ensuring an application's smooth launch requires testing its performance with various user demands
(workloads).
The steps involved in benchmarking of cloud applications are described below:
Here's a breakdown of the different steps involved in creating workloads for testing cloud
applications:
3. Workload Specification :
o A special language (WSL) is used to define the important user actions that affect application
performance.
o This allows for creating workloads with slightly different user behaviors for testing purposes.
4. Synthetic Workload Generation :
o The goal is to create realistic user simulations (workloads) for testing the application.
o There are two approaches:
Empirical Approach: Replaying recordings of real user behavior (traces). This can be limiting as it
only reflects a specific scenario.
Analytical Approach: Using mathematical models to generate user actions with various
characteristics. This allows for more flexibility in testing different user behaviors.
5. Generating Workloads (User Emulation vs Aggregate Workloads ):
o User Emulation: Simulates individual users with "think time" between actions, mimicking real
user behavior. This is good for testing how the application handles individual user interactions
but doesn't control the exact timing of requests.
o Aggregate Workload Generation: Specifies the exact timing of requests arriving at the system.
This is useful for testing how the application handles bursts of traffic but doesn't consider
individual user behavior.
Workload Characteristics
Each class of multi-tier applications can have their own characteristic workloads. you're testing a
shopping website (e-commerce application). To understand how the website performs under
different conditions, we need to consider how people use it:
Sessions: A single visit by a user is like a session. They might browse a few pages, make a
purchase, or just window shop.
Think Time: This is the time between when a user sees a response (like a product page loading)
and when they decide what to do next (click another link, add something to their cart, etc.). It's
like their "thinking time" in real life.
Session Length: This is simply how many things a user does in one visit, like the number of pages
they browse.
Workload Mix: This refers to the different things users typically do on the site. For example, on a
shopping site, users might browse more during sales (read-intensive) and buy more during
holidays (write-intensive, because purchases involve writing data to the database).
Tools Application & Approach Input/output & Model
httpref Application: Application: A tool that Input: Request URLs, specifications of the request
generates various HTTP workloads for rates, number of con- nections, for instance.
* Throughput refers to the number of requests an application can handle per second. It indicates how well the
application handles traffic volume.
These metrics are crucial for setting performance goals (SLOs) that ensure a smooth user experience.
Benchmarking Tools: -
Web application performance evaluation has shifted from manually scripted user interactions to
automated workload models built from real user data analysis.
Capturing Workload Characteristics: Compared to the traditional method of recording user
interactions for workload characteristics (Figure 11.1), the automated approach (Figure 11.2)
analyzes real traces from web application, and database servers to build workload models with
statistical analysis, eliminating the need for manual scripting.
Realistic Workloads: Unlike traditional methods that rely on manually scripted user
interactions, resulting in unrealistic workloads, the automated approach (Figure 11.2) leverages
real user traces for workload and benchmark models, generating synthetic workloads that
accurately mimic real-world user behavior.
Deployment Prototyping: -
Big cloud doesn't mean fast app! Testing different setups helps pick the cheapest and fastest
way to run your app, especially if user traffic goes up and down.
Figure 11.3 shows the steps involved in the deployment prototyping along with the variables
involved in each step. Given the performance requirements for an application, the deployment
design is an iterative process that involves the following steps: Deployment Design: Create the
deployment with various tiers as specified in the deployment configuration and deploy the
application.
Performance Evaluation: Verify whether the application meets the performance requirements
with the deployment.
Deployment Design: Create the deployment with various tiers as specified in the deployment
configuration and deploy the application.
• Fig (a) shows the average throughput and response time. The observed throughput increases
as demanded request rate increases. As more number of requests are served per second by the
application, the response time also increases. The observed throughput saturates beyond a
demanded request rate of 50 req/sec.
• Fig (b) shows the CPU usage density of one of the application servers. This plot shows that the
application server CPU is non-saturated resource.
• Fig (c) shows the database server CPU usage density. From this density plot we observe that
the database CPU spends a large percentage of time at high utilization levels for demanded
request rate more than 40 req/sec
. • Fig (d) shows the density plot of the database disk I/O bandwidth.
• Fig (e) shows the network out rate for one of the application servers
• Fig (f) shows the density plot of the network out rate for the database server. From this plot we
observe a continuous saturation of the network out rate around 200 KB/s.
• Analysis
• Throughput continuously increases as the demanded request rate increases from 10 to
40 req/sec. Beyond 40 req/sec demanded request rate, we observe that throughput saturates,
which is due to the high CPU utilization density of the database server CPU. From the analysis of
density plots of various system resources, we observe that the database CPU is a system
bottleneck.