0% found this document useful (0 votes)
64 views

Cloud Computing Assign # 3

This document provides instructions for an assignment on cloud computing and big data. It includes questions about challenges of big data, types of data storage systems, definitions of big data and its 5 V's, examples of extracting value from big data, and features of data generated from the Internet of Things. Students are advised to thoroughly research the topics, clearly explain their answers using references, and submit their work through the learning management system for a future oral exam.

Uploaded by

Rafshan Imbisaat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Cloud Computing Assign # 3

This document provides instructions for an assignment on cloud computing and big data. It includes questions about challenges of big data, types of data storage systems, definitions of big data and its 5 V's, examples of extracting value from big data, and features of data generated from the Internet of Things. Students are advised to thoroughly research the topics, clearly explain their answers using references, and submit their work through the learning management system for a future oral exam.

Uploaded by

Rafshan Imbisaat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

INTRODUCTION TO CLOUD COMPUTING

ASSIGNMENT # 3

Course In charge: Prof. Dr. Narmeen Zakaria Bawany


INSTRUCTIONS:
− The assignment must be submitted on JUW LMS. Email submission will not be accepted
− Each student should solve the assignment individually
− You are advised to go through the related topics before solving the assignment.
− Make your work clear and understandable
− Use references where necessary
− Viva will be conducted from this assignment

SECTION # A

1. What are the challenges of big data?

(Read Section 1.5 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey."
Mobile networks and applications 19.2 (2014): 171-209.)

ANSWER

CHALLENGES OF BIG DATA

The sharply increasing data deluge in the big data era brings about huge challenges on
data acquisition, storage, management and analysis. Traditional data management and
analysis systems are based on the relational database management system (RDBMS).
However, such RDBMSs only apply to structured data, other than semi-structured or
unstructured data. In addition, RDBMSs are increasingly utilizing more and more
expensive hardware. For example, most data generated by sensor networks are highly
redundant, which may be filtered and compressed at orders of magnitude.

There are some challenges for big data,

− Lack of proper understanding of Big Data. Companies fail in their Big Data initiatives
due to insufficient understanding.
− Data growth issues.

1|Page
− Confusion while Big Data tool selection.
− Lack of data professionals.
− Securing data.
− Integrating data from a variety of sources.
2. Define Direct Attached Storage (DAS) and Network Attached Storage (NAS) and
Storage Area Network (SAN).
(Read Section 4.1 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey."
Mobile networks and applications 19.2 (2014): 171-209.)

ANSWER

❖ DIRECT-ATTACHED STORAGE (DAS) is a digital storage system directly


attached to a server or workstation. A typical DAS system is made of a data storage
device (for example enclosures holding a number of hard disk drives) connected
directly to a computer through a host bus adapter, which is used to be SCSI, now
these days more often SATA, SAS or Fiber Channel. The most important
differentiation between DAS and NAS is that between the computer and DAS there is
no network device (like a hub, switch, or router).

❖ NETWORK-ATTACHED STORAGE (NAS) is a computer data storage connected


to a network, providing data access to various group of clients. NAS not only operates
as a file server, but it is also specialized for this task either by its hardware, software,
or configuration of those elements.

NAS systems are networked appliances which contain one or more hard drives, often
arranged into logical, redundant storage containers, or RAID. Network-attached
storage removes the responsibility of file serving from other servers on the network.
They typically provide access to files using network file sharing protocols. In CCTV,
NAS devices have gained popularity, as a convenient method of sharing files among
multiple computers. The benefits of network-attached storage, compared to file
servers, include faster data access, easier administration, and simple configuration.

2|Page
❖ A STORAGE AREA NETWORK (SAN) is a separate network on the backside of
the server that is dedicated only to moving storage data between servers and the
external storage media. There is a perception that SANs are unnecessarily expensive,
but this does not have to be the case. A SAN can be created simply by placing an
additional NIC in the server and porting archive data out the back to the external
storage. Where multiple servers or multiple external storage units are required, a SAN
switch handles the task. SANs make the best use of the primary data network over
which live data are flowing. They place virtually no additional burden on the live data
network, conserving its spare capacity for system growth. SANs are always
recommended for external storage, even if there is only one server and one external
storage unit.

Reference: Thomas Norman CPP, in Effective Physical Security (Fifth Edition),


2017(https://www.sciencedirect.com/topics/computer-science/storage-area-network)

3. Define big data. What are the 5 V’s of big data?


ANSWER

BIG DATA

Big Data is data that contains greater variety, arriving in increasing volumes and with
more velocity big data are larger, more complex data sets, especially from
new data sources.

5V’S OF BIG DATA

− VELOCITY

The speed at which data is collected is called velocity it usually increases every year
as network hardware and technology becomes way more powerful and business can
capture more data fast.

Example: Google receives almost 60,000+ searches per second on any given day.

3|Page
− VOLUME

It refers to the amount of data that is being collected. Big data gets its name from this
v due to the huge amount of data being collected.

Example: Netflix has almost 86 million members internationally, streaming over 125
million + hours of shows per day. This results in a data warehouse which is over 60
petabytes in size.

− VALUE

The worth of data being collected .a company is maybe required for storing large
sums of data which has no value. For big data that is voluntarily collected a business
should review exactly what data is being collected and how it can be valuable to the
business. Data that has no value can often serve as a distraction and only hinder the
data analysis process.

VALUABLE DATA DATA WITH NO VALUE

• Customer Lifetime Value. • Data with missing or corrupt


values.
• Average Order Value.
• Data missing key structured
• Cancellation Rate.
elements such as customer
reference or date.

− VARIETY

The different types of data being captured are called variety it could be structured or
unstructured. The data must be processed in order to analyze it.

Example: For a product review, this could be performing a sentiment analysis to


determine whether the review is positive or negative. From there, a result of “percent
of positive reviews” could be generated.

4|Page
UNSTRUCTURED DATA STRUCTURED DATA

• Review sentiment • Email Address

• Free-form comments • Phone Number

− VERACITY

Veracity is the quality or trustworthiness of the data. There is little point to collecting
Big Data if you are not confident that the resulting analyze can be trusted.

For example, if you are piping all order data in but also including fraudulent or
cancelled orders, you can’t trust the analysis of the e-commerce conversion rate
because it will be artificially inflated.

4. Give examples of extracting value from big data.


(Read Section 1.3 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A
survey." Mobile networks and applications 19.2 (2014): 171-209.)
ANSWER

McKinsey & Company observed how big data created values after in-depth research on
the U.S. healthcare, the EU public sector administration, the U.S. retail, the global
manufacturing, and the global personal location data. Through research on the five core
industries that represent the global economy, the McKinsey report pointed out that big
data may give a full play to the economic function, improve the productivity and
competitiveness of enterprises and public sectors, and create huge benefits for consumers.
In [10], McKinsey summarized the values that big data could create: if big data could be
creatively and effectively utilized to improve efficiency and quality, the potential value of
the U.S medical industry gained through data may surpass USD 300 billion, thus
reducing the expenditure for the U.S. healthcare by over 8 %; retailers that fully utilize
big data may improve their profit by more than 60 %; big data may also be utilized to
improve the efficiency of government operations, such that the developed economies in

5|Page
Europe could save over EUR 100 billion (which excludes the effect of reduced frauds,
errors, and tax difference).

In general, text analytics solutions for big data use a combination of statistical and
Natural Language Processing (NLP) techniques to extract information from unstructured
data. NLP is a broad and complex field that has developed over the last 20 years. A
primary goal of NLP is to derive meaning from text.

5. IoT is an important source of big data. According to characteristics of Internet of Things,


What are the features of data generated from IoT?

Read Section 3.1.2 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A
survey." Mobile networks and applications 19.2 (2014): 171-209

ANSWER

According to characteristics of Internet of Things, the data generated from IoT has the
following features:

− LARGE-SCALE DATA:
In IoT, masses of data acquisition equipment’s are distributedly deployed, which may
acquire simple numeric data, e.g., location; or complex multimedia data, e.g.,
surveillance video. In order to meet the demands of analysis and processing, not only
the currently acquired data, but also the historical data within a certain time frame
should be stored. Therefore, data generated by IoT are characterized by large scales.
− HETEROGENEITY: Because of the variety data acquisition devices, the acquired data
is also different and such data features heterogeneity.

6|Page
− STRONG TIME AND SPACE CORRELATION: In IoT, every data acquisition device is
placed at a specific geographic location and every piece of data has time stamp. The
time and space correlation are an important property of data from IoT. During data
analysis and processing, time and space are also important dimensions for statistical
analysis.
− EFFECTIVE DATA ACCOUNTS FOR ONLY A SMALL PORTION OF THE BIG DATA:

A great quantity of noises may occur during the acquisition and transmission of data
in IoT. Among datasets acquired by acquisition devices, only a small amount of
abnormal data is valuable. For example, during the acquisition of traffic video, the
few video frames that capture the violation of traffic regulations and traffic accidents
are more valuable than those only capturing the normal flow of traffic.

6. Explain data preprocessing

(Read Section 3.2.3 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A
survey." Mobile networks and applications 19.2 (2014): 171-209)

ANSWER

DATA PREPROCESSING is a data mining technique that involves transforming raw


data into an understandable format. Real-world data is often incomplete, inconsistent,
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing.
Data preprocessing is used in database-driven applications such as customer relationship
management and rule-based applications (like neural networks).In order to enable
effective data analysis, we shall pre-process data under many circumstances to integrate
the data from different sources, which can not only reduces storage expense, but also
improves analysis accuracy. Some relational data pre-processing techniques are discussed
as follows.

7|Page
– INTEGRATION: data integration is the cornerstone of modern commercial informatics,
which involves the combination of data from different sources and provides users with a
uniform view of data.

It has two methods which are widely recognized: data warehouse and data federation.

▪ Data warehousing includes a process named ETL (Extract, Transform and


Load).
1) Extraction involves connecting source systems, selecting, collecting,
analyzing, and processing necessary data.
2) Transformation is the execution of a series of rules to transform the extracted
data into standard formats.
3) Loading means importing extracted and transformed data into the target
storage infrastructure. Loading is the most complex procedure among the
three, which includes operations such as transformation, copy, clearing,
standardization, screening, and data organization.

Generally, data integration methods are accompanied with flow processing


engines and search engines.

− CLEANING: data cleaning is a process to identify inaccurate, incomplete, or


unreasonable data, and then modify or delete such data to improve data quality.
Generally, data cleaning includes five complementary procedures:
▪ Defining and determining error types
▪ Searching and identifying errors
▪ Correcting errors
▪ Documenting error examples and error types
▪ And modifying data entry procedures to reduce future errors.

During cleaning, data formats, completeness, rationality, and restriction shall be


inspected.

Data cleaning is of vital importance to keep the data consistency, which is widely
applied in many fields, such as banking, insurance, retail industry,
telecommunications, and traffic control.
8|Page
− REDUNDANCY ELIMINATION: data redundancy refers to data repetitions or surplus,
which usually occurs in many datasets. Data redundancy can increase the unnecessary
data transmission expense and cause defects on storage systems, e.g., waste of storage
space, leading to data inconsistency, reduction of data reliability, and data damage.
Therefore, various redundancy reduction methods have been proposed, such as
redundancy detection, data filtering, and data compression. Such methods may apply
to different datasets or application environments.
However, redundancy reduction may also bring about certain negative effects. For
example, data compression and decompression cause additional computational
burden.
Therefore, the benefits of redundancy reduction and the cost should be carefully
balanced.

Data collected from different fields will increasingly appear in image or video
formats. It is well-known that images and videos contain considerable redundancy,
including:

▪ temporal redundancy
▪ special redundancy
▪ statistical redundancy
▪ sensing redundancy

Video compression is widely used to reduce redundancy in video data, as


specified in the many video coding standards (MPEG-2, MPEG-4, H.263, and
H.264/AVC). In [74], the authors investigated the problem of video compression in a
video surveillance system with a video sensor network. The authors propose a new
MPEG-4 based method by investigating the contextual redundancy related to
background and foreground in a scene. The low complexity and the low compression
ratio of the proposed approach were demonstrated by the evaluation results.

On generalized data transmission or storage, repeated data deletion is a special


data compression technology, which aims to eliminate repeated data copies. With
repeated data deletion, individual data blocks or data segments will be assigned with
identifiers (e.g., using a hash algorithm) and stored, with the identifiers added to the
9|Page
identification list. As the analysis of repeated data deletion continues, if a new data
block has an identifier that is identical to that listed in the identification list, the new
data block will be deemed as redundant and will be replaced by the corresponding
stored data block.

7. Discuss the case study of Google Flu Trends.


ANSWER

Earlier prediction of flu outbreaks could limit the number of people who get sick or die
from the flu each year. More accurate and earlier detection of flu outbreaks can ensure
resources for combating outbreaks are allocated and deployed earlier (e.g., clinics could
be deployed to affected neighborhoods)

Why did Google Flu Trends eventually fail? What assumptions did they make about their
data or their model that ultimately proved not to be true?

− Google Flu Trends worked well in some instances but often overestimated, under
estimated, or entirely missed flu outbreaks. A notable example occurred when Google
Flu Trend largely missed the outbreak of the H1N1 flu virus.

− Just because someone is reading about the flu doesn’t mean they actually have it.
Some search terms like “high school basketball” might be good predictors of the flu
one year but clearly shouldn’t be used to measure whether someone has the flu.

− In general, many terms may have been good predictors of the flu for a while only
because, like high school basketball, they are more searched in the winter when more
people get the flu.

− Google began recommending searches to users, which skewed what terms people
searched for as a result, the tool was measuring Google generated suggested searches
as well, which skewed results.

10 | P a g e
8. Discuss the case study of The "Re-identification" of Governor William Weld's Medical
Information”.
ANSWER
The 1997 re-identification of Massachusetts Governor William Weld’s medical data
within an insurance data set which had been stripped of direct identifiers has had a
profound impact on the development of de-identification provisions within the 2003
Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Weld’s re-
identification, purportedly achieved through the use of a voter registration list from
Cambridge, MA is frequently cited as an example that computer scientists can re-identify
individuals within de-identified data with “astonishing ease”. However, a careful re-
examination of the population demographics in Cambridge indicates that Weld was most
likely re-identifiable only because he was a public figure who experienced a highly
publicized hospitalization rather than there being any certainty underlying his re-
identification using the Cambridge voter data, which had missing data for a large
proportion of the population. The complete story of Weld's re-identification exposes an
important systemic barrier to accurate re-identification known as “the myth of the perfect
population register”. Because the logic underlying re-identification depends critically on
being able to demonstrate that a person within health data set is the only person in the
larger population who has a set of combined characteristics (known as “quasi-
identifiers”) that could potentially re-identify them, most re-identification attempts face a
strong challenge in being able to create a complete and accurate population register. This
strong limitation not only underlies the entire set of famous Cambridge re-identification
results but also impacts much of the existing re-identification research cited by those
making claims of easy re-identification. This paper critically examines the historic Weld
re-identification and the dramatic reductions (thousands fold) of re-identification risks for
de-identified health data as they have been protected by the HIPAA Privacy Rule
provisions for de-identification since 2003. The paper also provides recommendations for
enhancements to existing HIPAA de-identification policy, discusses critical advances
routinely made in medical science and improvement of our healthcare system using de-
identified data, and provides commentary on the vital importance of properly balancing

11 | P a g e
the competing goals of protecting patient privacy and preserving the accuracy of
scientific research and statistical analyses conducted with de-identified data.
9. Explain how Data anonymization and Data Encryption can be used to improve the
privacy/security. Explain the issues of each in detail.
(https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-
ruin/)
ANSWER

Data anonymization is a process of protecting individual or sensitive details by deleting


or encrypting identifiers that connect an individual to stored data.

However, even when you clear data of identifiers, attackers can use de-anonymization
methods to retrace the data anonymization process. Since data usually passes through
multiple sources some available to the public de-anonymization techniques can cross-
reference the sources and reveal personal information.

The General Data Protection Regulation (GDPR) outlines a specific set of rules that
protect user data and create transparency. While the GDPR is strict, it permits companies
to collect anonymized data without consent, use it for any purpose, and store it for an
indefinite time—as long as companies remove all identifiers from the data.

DATA ANONYMIZATION TECHNIQUES

− Data masking—hiding data with altered values. You can create a mirror version of a
database and apply modification techniques such as character shuffling, encryption,
and word or character substitution.
− Pseudonymization— data management and de-identification method that replaces
private identifiers with fake identifiers or pseudonyms.
− Generalization—deliberately removes some of the data to make it less identifiable.
Data can be modified into a set of ranges or a broad area with appropriate boundaries.
You can remove the house number in an address, but make sure you don’t remove the
road name. The purpose is to eliminate some of the identifiers while retaining a
measure of data accuracy.

12 | P a g e
− Data swapping—also known as shuffling and permutation, a technique used to
rearrange the dataset attribute values so they don’t correspond with the original
records. Swapping attributes (columns) that contain identifiers values such as date of
birth, for example, may have more impact on anonymization than membership type
values.
− Data perturbation—modifies the original dataset slightly by applying techniques
that round numbers and add random noise. The range of values needs to be in
proportion to the perturbation. A small base may lead to weak anonymization while a
large base can reduce the utility of the dataset.
− Synthetic data—algorithmically manufactured information that has no connection to
real events. Synthetic data is used to create artificial datasets instead of altering the
original dataset or using it as is and risking privacy and security. The process involves
creating statistical models based on patterns found in the original dataset.

DISADVANTAGES OF DATA ANONYMIZATION:

The GDPR stipulates that websites must obtain consent from users to collect personal
information such as IP addresses, device ID, and cookies. Collecting anonymous data and
deleting identifiers from the database limit your ability to derive value and insight from
your data. For example, anonymized data cannot be used for marketing efforts, or to
personalize the user experience.

SECTION # B

Refer to following research papers to answer the questions given below:

[1] Bonomi, Flavio, et al. "Fog computing and its role in the internet of
things." Proceedings of the first edition of the MCC workshop on Mobile cloud
computing. ACM, 2012.

[2] Yi, Shanhe, et al. "Fog computing: Platform and applications." 2015 Third IEEE
Workshop on Hot Topics in Web Systems and Technologies (HotWeb). IEEE, 2015.

13 | P a g e
[3] Bonomi, Flavio, et al. "Fog computing: A platform for internet of things and
analytics." Big data and internet of things: A roadmap for smart environments. Springer,
Cham, 2014. 169-186.

1. Define Fog Computing


ANSWER
The term fog computing, originated by Cisco, refers to an alternative to cloud
computing. Decentralization and flexibility are the main difference between fog
computing and cloud computing. Fog computing, also called fog networking or fogging,
describes a decentralized computing structure located between the cloud and devices that
produce data. This flexible structure enables users to place resources, including
applications and the data they produce, in logical locations to enhance performance. Fog
computing security issues also provide benefits for users. The fog computing paradigm
can segment bandwidth traffic, enabling users to boost security with additional firewalls
in the network.
2. What are the characteristics of the Fog computing?

ANSWER
Fog computing provides- Low latency and location awareness, it has Wide-spread
geographical distribution, supports Mobility, is compromised due to the large number of
nodes. The main task of fog is to deliver data and place it closer to the user who is
positioned at a location which at the edge of the network.
3. Define latency-sensitive applications.
ANSWER
A latency-sensitive application is an application which is supposed to respond fast on
specific events. Latency is defined as the time between the occurrence of an event and its
handling. There was a time when it took minutes or even hours for us to send a simple
email. But that time is long gone now. Technology has evolved in tune with the
requirements of the times. Right from things such as phone calls to the most critical
medical and military technology, the ‘time factor’ assumes a pivotal role. Medical
imaging is a time-critical application it require real-time low-latency image processing
with high throughput is very important. This is where latency sensitivity comes into play.

14 | P a g e
Medical imaging continues to evolve rapidly the latency targets for real-time services
remains a constant challenge. Latency is the total delay between the sensor and the
receiver. The various levels of latency decides the quality and timely delivery of the
images.
4. What are the requirements that lead to the development of Fog Computing platform?
ANSWER

Existing work has focused on the designing and implementing of fog computing nodes
such as Cloudlet, and ParaDrop. Cloudlet is considered as an exemplar implementation of
resource-rich fog nodes. It has a three-layer design, in which the bottom layer is Linux
and data cache from cloud, the middle layer is virtualization with a bunch of cloud
software's such as OpenStack and the top layer is applications isolated by different virtual
machine (VM) instances. The architecture of IOx is which a router from Cisco is. It
works by hosting applications in guest OS running on a hypervisor upon the hardware of
a grid router. The platform supports developers to run scripts, compile code, and install
their own operation system. This platform is not open to public and relies on expensive
hardware. ParaDrop is implemented on gateway (Wi-Fi access point or home set-top
box), which is an ideal fog node choice due to its proximity to end user. However, it is
designed for home usage scenarios and is not in a fully decentralized manner where all
application servers are required to use a ParaDrop Server as entry point to services
provided by gateways. We consider this ParaDrop as a complementary implementation of
fog computing platform for task scenarios.

5. Discuss how Fog computing supports cloud computing.


ANSWER
Fog computing came into existence to serve as an extension of cloud computing services
and not as a replacement for them. Adopting fog computing will have the following
advantages:
− Off the grid: Leveraging the benefits of fog computing, one can enable their IoT
solution to control, manage, and administer your local edge device network
without external dependencies on cloud-based services, which provide freedom
from subscription-based cloud services.

15 | P a g e
− Global distributed network: Fog computing-empowered edge nodes or
gateways provide distributed networks with the power of local decision-making
and temporary data storage for analysis. This kind of distribution ensures that
even if cloud services are not available, your IoT solution would be able to
function locally with some limited restrictions.
− Better bandwidth utilization: Fog computing empowers edge nodes, processes
the raw data obtained from the end devices locally and periodically pushes the
processed data to a central mainframe and thus ensures the most optimal usage of
network bandwidth.
− Real-time operation and low latency: Fog computing bifurcates data based on
time criticality and ensures that the most time-critical data are processed locally
without the intervention of a central mainframe — thus enabling real-time
operations and very low latency.
− Optimal usage of edge node resources: Fog computing-enabled edge nodes are
designed with the aspect of maximally leveraging edge node resources to
overcome the limitations over cloud computing and optimal usage of network
bandwidth.
6. The fog computing platform has a broad range of applications. Discuss any 3 such
applications. Clearly explain why fog computing is needed in these scenario.
FOG COMPUTING APPLICATIONS:
− Linked Vehicles:
Self-driven vehicles are available on market producing a significant volume of
data. The information has to be easily interpreted and processed based on the
information presented such as traffic, driving conditions, environment etc. All this
information is processed quickly with the aid of fog computing.
− Smart Grid And Smart Cities:
Energy networks use real time data for efficient management of system. It is
necessary to process the remote data near to the location where it produced. Fog
computing is constructed in such a manner that all problem can be sorted.
− Real Time Analytics:

16 | P a g e
Data can be transferred using fog computing deployments from the location
where it is produced to different location. Fog computing is used for real time
analytics that passes data to financial institution that use real time data from
production network.
7. What are the limitations of cloud computing that were addressed by Fog Computing.

ANSWER

The rapid increase of IoT-powered applications in the last couple of years and a lack of
proper network bandwidth allocation and utilization, cloud computing faces the following
limitations in wide adoption of the cloud, such as:

− Subscription-based cloud support


− Unused silicon power on edge devices (Router, Gateway).
− Huge amounts of raw data pushed to the cloud, resulting in high latency.
− Always dependent on Internet connections.
− Over-utilization of network bandwidth.
8. Discuss why Smart Traffic Light System (STLS) is good use case for fog computing.
Read paper [3].
ANSWER

SMART TRAFFIC LIGHT SYSTEM

− Smart Traffic Light System (STLS) calls for deployment of a STL at each
intersection
− The STL is equipped with sensors that
▪ Measure the distance and speed of approaching vehicles from every direction
▪ Detect presence of pedestrians/other vehicles crossing the street
− Issues "Slow down" warnings to vehicles at risk to crossing in red and even modifies
its own cycle to prevent collisions.

STLS GOALS

1) Accidents prevention.
2) Maintenance of steady flow of traffic (Green waves along the main roads).

17 | P a g e
3) Collection of relevant data to evaluate and improve the system.

Note: Goal (1) requires real-time reaction, (2) near-real time, and (3) relates to the
collection and analysis of global data over long periods.

KEY REQUIREMENTS DRIVEN BY STLS

1. Local Subsystem latency


Reaction time needed is in the order of < 10 milliseconds
2. Middleware orchestration platform
▪ Middleware to handle a # of critical software components
▪ Decision maker
▪ Message bus
3. Networking infrastructure
Edge/Fog nodes belongs to a family of modular compute and storage devices
4. Interplay with the cloud.
Data must be injected into a Data center/ cloud for deep analysis to Identify patterns
in traffic, city pollutants.
5. Consistency of a highly distributed system
Need to be consistent between the different aggregator points.
6. Multi-tenancy
It must provide strict service guarantees all the time.
7. Multiplicity of providers
May extend beyond the borders of a single controlling authority Orchestration of
consistent policies involving multiple agencies is a challenge unique to Edge/Fog
Computing.
9. Can Geo-distribution be termed as fourth dimension of Big Data? Why? Read paper [3]
ANSWER

GEO-DISTRIBUTION: A NEW DIMENSION OF BIG DATA

▪ 3 Dimensions: Volume, Velocity and Variety.


▪ Lot use cases: STLS, Connected Rail, pipeline monitoring are naturally distributed.
▪ This suggests adding a 4th dimension: geo-distribution.

18 | P a g e
▪ Since challenge is to manage number of sensors (and actuators) that are naturally
distributed as a coherent whole.
▪ Call for "moving the processing to the data"
▪ A distributed intelligent platform at the Edge (Fog computing) that manages
distributed computer, networking, and storage resources.

SECTION # C

1. Discuss Google File System.


ANSWER
Google introduced the distributed and fault tolerant GFS [24]. The GFS was designed to
meet many of the same goals as preexisting distributed file systems including scalability,
performance, reliability, and robustness. However, Google also designed GFS to meet
some specific goals driven by some key observations of their workload. Firstly, Google
experienced regular failures of its cluster machines; therefore, a distributed file-system
must be extremely fault tolerant and have some form of automatic fault recovery.
Secondly, multi gigabyte files are common so I/O and file block size must be designed
appropriately. Thirdly, the majority of files are appended to, rather than having existing
content overwritten or changed, this means optimizations should be focused on
appending files. Lastly, the computation engine should be designed and collocated with
the distributed file system for best performance.
2. What is Hadoop?
ANSWER
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.

19 | P a g e
3. Explain HDFS Architecture?
ANSWER

Hadoop File System was developed using distributed file system design. ... Unlike other
distributed systems, HDFS is highly fault tolerant and designed using low-cost
hardware. HDFS holds very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines.

4. Explain Hadoop Map Reduce.


ANSWER
Hadoop Map Reduce (Hadoop Map/Reduce) is a software framework for distributed
processing of large data sets on compute clusters of commodity hardware. It is a sub-
project of the Apache Hadoop project. The framework takes care of scheduling tasks,
monitoring them and re-executing any failed tasks.
5. What is Rack Awareness in Hadoop HDFS? What is the advantage of Rack Awareness?
ANSWER
Rack Awareness in Hadoop is the concept that chooses closer Data nodes based on
the rack information. To improve network traffic while reading/writing HDFS files in
large clusters of Hadoop. Name Node chooses data nodes, which are on the same rack or
a nearby rock to read/ write requests (client node).
ADVANTAGES OF IMPLEMENTING RACK AWARENESS IN HADOOP
Rack awareness in Hadoop helps optimize replica placement thus ensuring high
reliability and fault tolerance. Rack awareness ensures that the Read/Write requests to
replicas are placed to the closest rack or the same rack.
6. Explain Fault-Tolerance in Hadoop. How HDFS handles worker node and master node
failures?
ANSWER
Fault tolerance in Hadoop HDFS refers to the working strength of a system in
unfavorable conditions and how that system can handle such a situation. HDFS also
maintains the replication factor by creating a replica of data on other available machines
in the cluster if suddenly one machine fails.

20 | P a g e
In an HDFS cluster, there is ONE master node and many worker nodes. The master node
is called the Name Node (NN) and the workers are called Data Nodes (DN). Data nodes
actually store the data. They are the workhorses.
Name Node is in charge of file system operations (like creating files, user permissions,
etc.). Without it, the cluster will be inoperable. No one can write data or read data. This is
called a Single Point of Failure. The file system will continue to function even if a node
fails. Hadoop accomplishes this by duplicating data across nodes.

21 | P a g e

You might also like