Cloud Computing Assign # 3
Cloud Computing Assign # 3
ASSIGNMENT # 3
SECTION # A
(Read Section 1.5 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey."
Mobile networks and applications 19.2 (2014): 171-209.)
ANSWER
The sharply increasing data deluge in the big data era brings about huge challenges on
data acquisition, storage, management and analysis. Traditional data management and
analysis systems are based on the relational database management system (RDBMS).
However, such RDBMSs only apply to structured data, other than semi-structured or
unstructured data. In addition, RDBMSs are increasingly utilizing more and more
expensive hardware. For example, most data generated by sensor networks are highly
redundant, which may be filtered and compressed at orders of magnitude.
− Lack of proper understanding of Big Data. Companies fail in their Big Data initiatives
due to insufficient understanding.
− Data growth issues.
1|Page
− Confusion while Big Data tool selection.
− Lack of data professionals.
− Securing data.
− Integrating data from a variety of sources.
2. Define Direct Attached Storage (DAS) and Network Attached Storage (NAS) and
Storage Area Network (SAN).
(Read Section 4.1 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey."
Mobile networks and applications 19.2 (2014): 171-209.)
ANSWER
NAS systems are networked appliances which contain one or more hard drives, often
arranged into logical, redundant storage containers, or RAID. Network-attached
storage removes the responsibility of file serving from other servers on the network.
They typically provide access to files using network file sharing protocols. In CCTV,
NAS devices have gained popularity, as a convenient method of sharing files among
multiple computers. The benefits of network-attached storage, compared to file
servers, include faster data access, easier administration, and simple configuration.
2|Page
❖ A STORAGE AREA NETWORK (SAN) is a separate network on the backside of
the server that is dedicated only to moving storage data between servers and the
external storage media. There is a perception that SANs are unnecessarily expensive,
but this does not have to be the case. A SAN can be created simply by placing an
additional NIC in the server and porting archive data out the back to the external
storage. Where multiple servers or multiple external storage units are required, a SAN
switch handles the task. SANs make the best use of the primary data network over
which live data are flowing. They place virtually no additional burden on the live data
network, conserving its spare capacity for system growth. SANs are always
recommended for external storage, even if there is only one server and one external
storage unit.
BIG DATA
Big Data is data that contains greater variety, arriving in increasing volumes and with
more velocity big data are larger, more complex data sets, especially from
new data sources.
− VELOCITY
The speed at which data is collected is called velocity it usually increases every year
as network hardware and technology becomes way more powerful and business can
capture more data fast.
Example: Google receives almost 60,000+ searches per second on any given day.
3|Page
− VOLUME
It refers to the amount of data that is being collected. Big data gets its name from this
v due to the huge amount of data being collected.
Example: Netflix has almost 86 million members internationally, streaming over 125
million + hours of shows per day. This results in a data warehouse which is over 60
petabytes in size.
− VALUE
The worth of data being collected .a company is maybe required for storing large
sums of data which has no value. For big data that is voluntarily collected a business
should review exactly what data is being collected and how it can be valuable to the
business. Data that has no value can often serve as a distraction and only hinder the
data analysis process.
− VARIETY
The different types of data being captured are called variety it could be structured or
unstructured. The data must be processed in order to analyze it.
4|Page
UNSTRUCTURED DATA STRUCTURED DATA
− VERACITY
Veracity is the quality or trustworthiness of the data. There is little point to collecting
Big Data if you are not confident that the resulting analyze can be trusted.
For example, if you are piping all order data in but also including fraudulent or
cancelled orders, you can’t trust the analysis of the e-commerce conversion rate
because it will be artificially inflated.
McKinsey & Company observed how big data created values after in-depth research on
the U.S. healthcare, the EU public sector administration, the U.S. retail, the global
manufacturing, and the global personal location data. Through research on the five core
industries that represent the global economy, the McKinsey report pointed out that big
data may give a full play to the economic function, improve the productivity and
competitiveness of enterprises and public sectors, and create huge benefits for consumers.
In [10], McKinsey summarized the values that big data could create: if big data could be
creatively and effectively utilized to improve efficiency and quality, the potential value of
the U.S medical industry gained through data may surpass USD 300 billion, thus
reducing the expenditure for the U.S. healthcare by over 8 %; retailers that fully utilize
big data may improve their profit by more than 60 %; big data may also be utilized to
improve the efficiency of government operations, such that the developed economies in
5|Page
Europe could save over EUR 100 billion (which excludes the effect of reduced frauds,
errors, and tax difference).
In general, text analytics solutions for big data use a combination of statistical and
Natural Language Processing (NLP) techniques to extract information from unstructured
data. NLP is a broad and complex field that has developed over the last 20 years. A
primary goal of NLP is to derive meaning from text.
Read Section 3.1.2 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A
survey." Mobile networks and applications 19.2 (2014): 171-209
ANSWER
According to characteristics of Internet of Things, the data generated from IoT has the
following features:
− LARGE-SCALE DATA:
In IoT, masses of data acquisition equipment’s are distributedly deployed, which may
acquire simple numeric data, e.g., location; or complex multimedia data, e.g.,
surveillance video. In order to meet the demands of analysis and processing, not only
the currently acquired data, but also the historical data within a certain time frame
should be stored. Therefore, data generated by IoT are characterized by large scales.
− HETEROGENEITY: Because of the variety data acquisition devices, the acquired data
is also different and such data features heterogeneity.
6|Page
− STRONG TIME AND SPACE CORRELATION: In IoT, every data acquisition device is
placed at a specific geographic location and every piece of data has time stamp. The
time and space correlation are an important property of data from IoT. During data
analysis and processing, time and space are also important dimensions for statistical
analysis.
− EFFECTIVE DATA ACCOUNTS FOR ONLY A SMALL PORTION OF THE BIG DATA:
A great quantity of noises may occur during the acquisition and transmission of data
in IoT. Among datasets acquired by acquisition devices, only a small amount of
abnormal data is valuable. For example, during the acquisition of traffic video, the
few video frames that capture the violation of traffic regulations and traffic accidents
are more valuable than those only capturing the normal flow of traffic.
(Read Section 3.2.3 Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A
survey." Mobile networks and applications 19.2 (2014): 171-209)
ANSWER
7|Page
– INTEGRATION: data integration is the cornerstone of modern commercial informatics,
which involves the combination of data from different sources and provides users with a
uniform view of data.
It has two methods which are widely recognized: data warehouse and data federation.
Data cleaning is of vital importance to keep the data consistency, which is widely
applied in many fields, such as banking, insurance, retail industry,
telecommunications, and traffic control.
8|Page
− REDUNDANCY ELIMINATION: data redundancy refers to data repetitions or surplus,
which usually occurs in many datasets. Data redundancy can increase the unnecessary
data transmission expense and cause defects on storage systems, e.g., waste of storage
space, leading to data inconsistency, reduction of data reliability, and data damage.
Therefore, various redundancy reduction methods have been proposed, such as
redundancy detection, data filtering, and data compression. Such methods may apply
to different datasets or application environments.
However, redundancy reduction may also bring about certain negative effects. For
example, data compression and decompression cause additional computational
burden.
Therefore, the benefits of redundancy reduction and the cost should be carefully
balanced.
Data collected from different fields will increasingly appear in image or video
formats. It is well-known that images and videos contain considerable redundancy,
including:
▪ temporal redundancy
▪ special redundancy
▪ statistical redundancy
▪ sensing redundancy
Earlier prediction of flu outbreaks could limit the number of people who get sick or die
from the flu each year. More accurate and earlier detection of flu outbreaks can ensure
resources for combating outbreaks are allocated and deployed earlier (e.g., clinics could
be deployed to affected neighborhoods)
Why did Google Flu Trends eventually fail? What assumptions did they make about their
data or their model that ultimately proved not to be true?
− Google Flu Trends worked well in some instances but often overestimated, under
estimated, or entirely missed flu outbreaks. A notable example occurred when Google
Flu Trend largely missed the outbreak of the H1N1 flu virus.
− Just because someone is reading about the flu doesn’t mean they actually have it.
Some search terms like “high school basketball” might be good predictors of the flu
one year but clearly shouldn’t be used to measure whether someone has the flu.
− In general, many terms may have been good predictors of the flu for a while only
because, like high school basketball, they are more searched in the winter when more
people get the flu.
− Google began recommending searches to users, which skewed what terms people
searched for as a result, the tool was measuring Google generated suggested searches
as well, which skewed results.
10 | P a g e
8. Discuss the case study of The "Re-identification" of Governor William Weld's Medical
Information”.
ANSWER
The 1997 re-identification of Massachusetts Governor William Weld’s medical data
within an insurance data set which had been stripped of direct identifiers has had a
profound impact on the development of de-identification provisions within the 2003
Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Weld’s re-
identification, purportedly achieved through the use of a voter registration list from
Cambridge, MA is frequently cited as an example that computer scientists can re-identify
individuals within de-identified data with “astonishing ease”. However, a careful re-
examination of the population demographics in Cambridge indicates that Weld was most
likely re-identifiable only because he was a public figure who experienced a highly
publicized hospitalization rather than there being any certainty underlying his re-
identification using the Cambridge voter data, which had missing data for a large
proportion of the population. The complete story of Weld's re-identification exposes an
important systemic barrier to accurate re-identification known as “the myth of the perfect
population register”. Because the logic underlying re-identification depends critically on
being able to demonstrate that a person within health data set is the only person in the
larger population who has a set of combined characteristics (known as “quasi-
identifiers”) that could potentially re-identify them, most re-identification attempts face a
strong challenge in being able to create a complete and accurate population register. This
strong limitation not only underlies the entire set of famous Cambridge re-identification
results but also impacts much of the existing re-identification research cited by those
making claims of easy re-identification. This paper critically examines the historic Weld
re-identification and the dramatic reductions (thousands fold) of re-identification risks for
de-identified health data as they have been protected by the HIPAA Privacy Rule
provisions for de-identification since 2003. The paper also provides recommendations for
enhancements to existing HIPAA de-identification policy, discusses critical advances
routinely made in medical science and improvement of our healthcare system using de-
identified data, and provides commentary on the vital importance of properly balancing
11 | P a g e
the competing goals of protecting patient privacy and preserving the accuracy of
scientific research and statistical analyses conducted with de-identified data.
9. Explain how Data anonymization and Data Encryption can be used to improve the
privacy/security. Explain the issues of each in detail.
(https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-
ruin/)
ANSWER
However, even when you clear data of identifiers, attackers can use de-anonymization
methods to retrace the data anonymization process. Since data usually passes through
multiple sources some available to the public de-anonymization techniques can cross-
reference the sources and reveal personal information.
The General Data Protection Regulation (GDPR) outlines a specific set of rules that
protect user data and create transparency. While the GDPR is strict, it permits companies
to collect anonymized data without consent, use it for any purpose, and store it for an
indefinite time—as long as companies remove all identifiers from the data.
− Data masking—hiding data with altered values. You can create a mirror version of a
database and apply modification techniques such as character shuffling, encryption,
and word or character substitution.
− Pseudonymization— data management and de-identification method that replaces
private identifiers with fake identifiers or pseudonyms.
− Generalization—deliberately removes some of the data to make it less identifiable.
Data can be modified into a set of ranges or a broad area with appropriate boundaries.
You can remove the house number in an address, but make sure you don’t remove the
road name. The purpose is to eliminate some of the identifiers while retaining a
measure of data accuracy.
12 | P a g e
− Data swapping—also known as shuffling and permutation, a technique used to
rearrange the dataset attribute values so they don’t correspond with the original
records. Swapping attributes (columns) that contain identifiers values such as date of
birth, for example, may have more impact on anonymization than membership type
values.
− Data perturbation—modifies the original dataset slightly by applying techniques
that round numbers and add random noise. The range of values needs to be in
proportion to the perturbation. A small base may lead to weak anonymization while a
large base can reduce the utility of the dataset.
− Synthetic data—algorithmically manufactured information that has no connection to
real events. Synthetic data is used to create artificial datasets instead of altering the
original dataset or using it as is and risking privacy and security. The process involves
creating statistical models based on patterns found in the original dataset.
The GDPR stipulates that websites must obtain consent from users to collect personal
information such as IP addresses, device ID, and cookies. Collecting anonymous data and
deleting identifiers from the database limit your ability to derive value and insight from
your data. For example, anonymized data cannot be used for marketing efforts, or to
personalize the user experience.
SECTION # B
[1] Bonomi, Flavio, et al. "Fog computing and its role in the internet of
things." Proceedings of the first edition of the MCC workshop on Mobile cloud
computing. ACM, 2012.
[2] Yi, Shanhe, et al. "Fog computing: Platform and applications." 2015 Third IEEE
Workshop on Hot Topics in Web Systems and Technologies (HotWeb). IEEE, 2015.
13 | P a g e
[3] Bonomi, Flavio, et al. "Fog computing: A platform for internet of things and
analytics." Big data and internet of things: A roadmap for smart environments. Springer,
Cham, 2014. 169-186.
ANSWER
Fog computing provides- Low latency and location awareness, it has Wide-spread
geographical distribution, supports Mobility, is compromised due to the large number of
nodes. The main task of fog is to deliver data and place it closer to the user who is
positioned at a location which at the edge of the network.
3. Define latency-sensitive applications.
ANSWER
A latency-sensitive application is an application which is supposed to respond fast on
specific events. Latency is defined as the time between the occurrence of an event and its
handling. There was a time when it took minutes or even hours for us to send a simple
email. But that time is long gone now. Technology has evolved in tune with the
requirements of the times. Right from things such as phone calls to the most critical
medical and military technology, the ‘time factor’ assumes a pivotal role. Medical
imaging is a time-critical application it require real-time low-latency image processing
with high throughput is very important. This is where latency sensitivity comes into play.
14 | P a g e
Medical imaging continues to evolve rapidly the latency targets for real-time services
remains a constant challenge. Latency is the total delay between the sensor and the
receiver. The various levels of latency decides the quality and timely delivery of the
images.
4. What are the requirements that lead to the development of Fog Computing platform?
ANSWER
Existing work has focused on the designing and implementing of fog computing nodes
such as Cloudlet, and ParaDrop. Cloudlet is considered as an exemplar implementation of
resource-rich fog nodes. It has a three-layer design, in which the bottom layer is Linux
and data cache from cloud, the middle layer is virtualization with a bunch of cloud
software's such as OpenStack and the top layer is applications isolated by different virtual
machine (VM) instances. The architecture of IOx is which a router from Cisco is. It
works by hosting applications in guest OS running on a hypervisor upon the hardware of
a grid router. The platform supports developers to run scripts, compile code, and install
their own operation system. This platform is not open to public and relies on expensive
hardware. ParaDrop is implemented on gateway (Wi-Fi access point or home set-top
box), which is an ideal fog node choice due to its proximity to end user. However, it is
designed for home usage scenarios and is not in a fully decentralized manner where all
application servers are required to use a ParaDrop Server as entry point to services
provided by gateways. We consider this ParaDrop as a complementary implementation of
fog computing platform for task scenarios.
15 | P a g e
− Global distributed network: Fog computing-empowered edge nodes or
gateways provide distributed networks with the power of local decision-making
and temporary data storage for analysis. This kind of distribution ensures that
even if cloud services are not available, your IoT solution would be able to
function locally with some limited restrictions.
− Better bandwidth utilization: Fog computing empowers edge nodes, processes
the raw data obtained from the end devices locally and periodically pushes the
processed data to a central mainframe and thus ensures the most optimal usage of
network bandwidth.
− Real-time operation and low latency: Fog computing bifurcates data based on
time criticality and ensures that the most time-critical data are processed locally
without the intervention of a central mainframe — thus enabling real-time
operations and very low latency.
− Optimal usage of edge node resources: Fog computing-enabled edge nodes are
designed with the aspect of maximally leveraging edge node resources to
overcome the limitations over cloud computing and optimal usage of network
bandwidth.
6. The fog computing platform has a broad range of applications. Discuss any 3 such
applications. Clearly explain why fog computing is needed in these scenario.
FOG COMPUTING APPLICATIONS:
− Linked Vehicles:
Self-driven vehicles are available on market producing a significant volume of
data. The information has to be easily interpreted and processed based on the
information presented such as traffic, driving conditions, environment etc. All this
information is processed quickly with the aid of fog computing.
− Smart Grid And Smart Cities:
Energy networks use real time data for efficient management of system. It is
necessary to process the remote data near to the location where it produced. Fog
computing is constructed in such a manner that all problem can be sorted.
− Real Time Analytics:
16 | P a g e
Data can be transferred using fog computing deployments from the location
where it is produced to different location. Fog computing is used for real time
analytics that passes data to financial institution that use real time data from
production network.
7. What are the limitations of cloud computing that were addressed by Fog Computing.
ANSWER
The rapid increase of IoT-powered applications in the last couple of years and a lack of
proper network bandwidth allocation and utilization, cloud computing faces the following
limitations in wide adoption of the cloud, such as:
− Smart Traffic Light System (STLS) calls for deployment of a STL at each
intersection
− The STL is equipped with sensors that
▪ Measure the distance and speed of approaching vehicles from every direction
▪ Detect presence of pedestrians/other vehicles crossing the street
− Issues "Slow down" warnings to vehicles at risk to crossing in red and even modifies
its own cycle to prevent collisions.
STLS GOALS
1) Accidents prevention.
2) Maintenance of steady flow of traffic (Green waves along the main roads).
17 | P a g e
3) Collection of relevant data to evaluate and improve the system.
Note: Goal (1) requires real-time reaction, (2) near-real time, and (3) relates to the
collection and analysis of global data over long periods.
18 | P a g e
▪ Since challenge is to manage number of sensors (and actuators) that are naturally
distributed as a coherent whole.
▪ Call for "moving the processing to the data"
▪ A distributed intelligent platform at the Edge (Fog computing) that manages
distributed computer, networking, and storage resources.
SECTION # C
19 | P a g e
3. Explain HDFS Architecture?
ANSWER
Hadoop File System was developed using distributed file system design. ... Unlike other
distributed systems, HDFS is highly fault tolerant and designed using low-cost
hardware. HDFS holds very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines.
20 | P a g e
In an HDFS cluster, there is ONE master node and many worker nodes. The master node
is called the Name Node (NN) and the workers are called Data Nodes (DN). Data nodes
actually store the data. They are the workhorses.
Name Node is in charge of file system operations (like creating files, user permissions,
etc.). Without it, the cluster will be inoperable. No one can write data or read data. This is
called a Single Point of Failure. The file system will continue to function even if a node
fails. Hadoop accomplishes this by duplicating data across nodes.
21 | P a g e