0% found this document useful (0 votes)
117 views

Data Sharing Mediators

The term “mediator of data sharing” is decades old and refers to an approach, virtual database, and system to integrate data from diverse databases, thereby connecting data sources and the application (i.e. computer program) using them (Wiederhold, 1992). However, as the importance of human infrastructure to enable data releasing, sharing, and reusing has garnered greater recognition in recent years, this term has also been used to refer to the people who connect data creators and data users (Borgman, 2015).

Uploaded by

Mohit Kashayp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Data Sharing Mediators

The term “mediator of data sharing” is decades old and refers to an approach, virtual database, and system to integrate data from diverse databases, thereby connecting data sources and the application (i.e. computer program) using them (Wiederhold, 1992). However, as the importance of human infrastructure to enable data releasing, sharing, and reusing has garnered greater recognition in recent years, this term has also been used to refer to the people who connect data creators and data users (Borgman, 2015).

Uploaded by

Mohit Kashayp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction

1.2 Data sharing mediators


The term “mediator of data sharing” is decades old and refers to an approach, virtual database,
and system to integrate data from diverse databases, thereby connecting data sources and the
application (i.e. computer program) using them (Wiederhold, 1992). However, as the importance
of human infrastructure to enable data releasing, sharing, and reusing has garnered greater
recognition in recent years, this term has also been used to refer to the people who connect data
creators and data users (Borgman, 2015). The work of human mediators, such as those curating
and managing data, developing and maintaining sharing standards and technologies is crucial in
making data sharing possible; nevertheless, human mediators are usually invisible and
overlooked (Kervin, Cook, & Michener, 2014; Borgman, 2015). These two concepts of mediator
are both accurate, but technological components (e.g., information systems) have consistently
gained most attention and investment compared to human and social components in terms.

5.1 Why Data Sharing is Important


Data sharing is becoming increasingly important for many users and sometimes a crucial
requirement, especially for businesses and organisations aiming to gain profit. Historically, many
people viewed the computer as “impersonal giants” who threatened to cut jobs of many people
through automation. However, in recent times, it has been welcomed by a huge number of people
as it has become significantly social [56]. It is thus not surprising that more and more people are
demanding data sharing capability on their phones, computers and even recently Smart TVs.
People love to share information with one another. Whether it is with friends, family, colleagues
or the world, many people benefit greatly through sharing data. Some of the benefits include:

• Higher productivity: Businesses get more work done as well as making collaboration with
peers much more efficient and hence is key to satisfying their business goals. Hospitals also
benefit from data sharing and this has led to the lowering of healthcare costs [57]. Students also
benefit whenworking on group projects, as they are better able to collaborate with members and
get work done more efficiently.
• More enjoyment: Many people of any age, gender or ethnicity can connect withfriends, family
and colleagues to share their experiences in life as well as catch up with others via social
networking sites such as Facebook or MySpace. Employees and enterprise users can share their
experiences through sites like Yammer. People can also share videos on YouTube or photos on
Flickr, which can provide greater enjoyment with some people. In the past, connecting to a loved
one in a different country was not possible except through letters. Hence social data sharing
generally provides people with a rich experience as the sharing of personal information can
provide people with deeper and stronger relationships.

• To voice opinions: Some people prefer to share information to the world in order to voice an
opinion. Many people want to be heard and use social networking sites to promote their opinion,
which was not possible unless they formed protests. People are now using social networking
sites such as Facebook, Twitter and YouTube to raise awareness about real issues in the world.
Although, some campaigns have led to violent protests, online campaigns usually inform people
of issues and encourage people to help a cause. Data sharing is becoming increasingly prevalent
in many industries and organizations. Hospitals are now benefitting from data sharing as this
provides better, safer care of patients. There is now no need to repeat medical history every time
a new health professional is consulted which means no more unnecessary tests. Hence, the health
professional gets a more complete picture of medical history [57]. There is also a strong focus for
the sharing of research data [58]. According to Feldman et al. [59], there is growing support for
the sharing of research data in order to accelerate the pace of scientific discovery. Such sharing
will allow for more rapid translation of science to practice. Financial institutions also benefit
from data sharing and benefits include better customer support and better understanding of the
needs of the customer [60]. Shared data can be used to improve modeling, analysis and risk
tools.

5.2 Requirements of Data Sharing


To enable data sharing, it is imperative that only authorised users are able to get access to data
stored. We summarise the ideal requirements of data sharing below.

• The data owner should be able to specify a group of users that are allowed to view his/her data.
• Any member of the group should gain access to the data anytime without the data owner’s
intervention.

• No other user, other than the data owner and the members of the group, should gain access to
the data, including the Service Provider.

• The data owner should be able to revoke access to data for any member of the group.

• The data owner should be able to add members to the group.

• No member of the group should be allowed to revoke rights of other members of the group or
join new users to the group.

• The data owner should be able to specify who has read/write permissions on the data owner’s
files. We now look at the privacy and security requirement of data sharing. Achieving these
requirements can go a long way to attracting large numbers of users to adopting and embracing
Cloud technology.

• Data Confidentiality: Unauthorized users, should not be able to access data at any given time.
Data should remain confidential in transit, at rest and on backup media. Only authorized users
should be able to gain access to data.

• User revocation: When a user is revoked access rights to data, that user should not be able to
gain access to the data at any given time. Ideally, user revocation should not affect other
authorized users in the group for efficiency purposes.

• Scalable and Efficient: Since the number of Cloud users tends to be extremely large and at
times unpredictable as users join and leave, it is imperative that the system maintain efficiency as
well as be scalability.

• Collusion between entities: When considering data sharing methodologies in the Cloud, it is
vital that even when certain entities collude, they should still not be able to access any of the data
without the data owner’s permission. Earlier works of literature on data sharing did not consider
this problem, however collusion between entities can never be written off as an unlikely event.
A BASIC DATA SHARING FRAMEWORK
Once a data set has been created, a basic framework can be described outlining the ways data
may be shared. This framework describes increasing access to data with ever fewer restrictions:

• The data set exists – no detail may be provided other than the existence of the data set. For
example, knowing that a register of drivers’ licences exists

• Details about the data set – such as sharing details of the scope, parameters involved (often
referred to as the data dictionary), period over which the data is collected

• Ability to interrogate aggregated, perturbed, or obfuscated data – such as the ability to run a
defined set of logical operations over, and receive a result from, data which has been de-
identified in some way without accessing the data itself. Access may further be refined through
the level of aggregation, perturbation, or obfuscation.

• Ability to access aggregated, perturbed, or obfuscated data – the ability to run an unlimited set
of queries over data which has been de-identified in some way

• Access to data – whilst this may still be restricted to certain individuals, for certain approved
purposes in secure operating environments, there is no technical limitations to the operations
which may be performed.

• Ability to share data – some systems, such as the SURE6 system used by the SAX Institute
system, limit how data is accessed to prevent further sharing. The ability to on-share data
provides the most open access and greatest risk, as has been seen with Edward Snowden and
WikiLeaks.
In this basic framework, there is an explicit assumption that data sharing involves a data source,
a data recipient, and a sharing mechanism. It also implies increasingly open access to data to the
ultimate point of being able to on-share.

Sharing risks and benefits

Who, What, When

Who is at risk when network data is shared?


Entities potentially at risk when network traffic is shared include: persons who are identified or
dentifiable in network traffic, researchers, and network providers (NP) such as ISPs, backbone
providers, and private network owners. In addition to legal liabilities and ethical responsibilities,
researchers and their institutions also risk withdrawal of data and or funding as a result of
privacy leakage. Society also bears costs associated with misinformation, mistrust, and
internalizing behavioral norms that may result from privacy harms.
Which traffic data components are privacy-relevant?
We call a first-order identifier one which functionally distinguishes an individual: first and last
name, social security number, government-issued and other account identifiers, physical and
email addresses, certain biometric markers, and possibly the same information about immediate
family. A second-order identifier could be an IP address (IPA), machine access code (MAC)
address, host name, birthdate, phone number, zip code, gender, and financial, health, or
geographic information. These indirect identifiers can also include aggregated or behavioral
profile information such as IP header information, which in many cases can reveal which
applications are used, how often, and with which machines. Indirect identifiers also include URL
click streams, which can reveal information about the content of communications, including
search terms.
Under what conditions do these data types pose risk?
Network traffic measurement data can present a privacy risk when information in packets and
flow records can directly expose non-public information about persons - such as health, sexual
orientation, political affiliation, religious affiliation, criminal activity, associations, behavioral
activities, physical or virtual location; or, organizations - such as intellectual property, trade
secrets or other proprietary information. Network traffic may also indirectly expose non-public,
sensitive information if correlated (linked) with other public or private data, such as lists of IPAs
of worm-infected and thus vulnerable hosts. Network data can also yield mistaken attributions
and inferences about behavior, potentially more damaging than correct inferences.
The privacy risk across time may also vary, as the threat may be immediately manifest upon
disclosure of data, or it may be a latent risk which is held in abeyance until some future condition
arises. Lack of transparency between the DP and DS regarding the shared data's nature, scope,
and lineage is invariably a condition that enhances risk [https://www.caida.org/data/sharing/].

1.2 INFORMATION SECURITY GOVERNANCE


Secure flow of information plays a vital role in today’s economy and should therefore be treated
with due importance. More often than not, Information Security is single-handedly dealt with as
a technology issue rather than treating it as a governance issue as well. According to the IT
Governance Institute (ITGI), governance is a set of responsibilities and practices exercised by the
board and executive management with the goal of providing strategic direction, ensuring that
objectives are achieved, ascertaining that risks are managed appropriately and verifying that the
enterprise’s resources are used responsibly (CISSP Guide to Security Essentials 2010). Although
it is the responsibility of the executive management to deal with the technical aspects of
information security, the board of directors are expected to include the information security
concern in the organization’s governance undertakings. The role of integrating information
security into the governance framework is the responsibility of the board of directors, who,
despite acknowledging the need to uphold the integrity and continuity of business processes,
regrettably have superficial knowledge of information security. (IT Security Governance - CIO,
CISO and Practioner Guide 2009.)

1.3 Data Security


Issues around data confidentiality and privacy are under greater focus than ever before as
ubiquitous internet access exposes critical corporate data and personal information to new
security threats. On one hand, data sharing across different parties and for different purposes is
crucial for many applications, including homeland security, medical research, and environmental
protection. The availability of ‘‘big data’’ technologies makes it possible to quickly analyze huge
data sets and is thus further pushing the massive collection of data. On the other hand, the
combination of multiple datasets may allow parties holding these datasets to infer sensitive
information. Pervasive data gathering from multiple data sources and devices, such as smart
phones and smart power meters, further exacerbates this tension. Techniques for fine-grained and
context-based access control are crucial for achieving data confidentiality and privacy.
Depending on the specific use of data, e.g. operational purposes or analytical purposes, data
anonymization techniques may also be applied. An important challenge in this context is
represented by the insider threat, that is, data misuses by individuals who have access to data for
carrying on their organizational functions, and thus possess the necessary authorizations to
access proprietary or sensitive data. Protection against insider requires not only fine-grained and
context-based access control but also anomaly detection systems, able to detect unusual patterns
of data access, and data user surveillance systems, able to monitor user actions and habits in
cyber space – for example whether a data user is active on social networks. Notice that the
adoption of anomaly detection and surveillance systems entails data user privacy issues and
therefore a challenge is how to reconcile data protection with data user privacy. It is important to
point out that when dealing with data privacy, one has to distinguish between data subjects, that
is, the users to whom the data is related, and data users, that is, the users accessing the data.
Privacy of both categories of user is important, even though only few approaches have been
proposed for data user privacy [6, 8, 9]. Data security is not, however, limited to data
confidentiality and privacy. As data is often used for critical decision making, data
trustworthiness is a crucial requirement. Data needs to be protected from unauthorized
modifications. Its provenance must be available and certified. Data must be accurate, complete
and up-do-date. Comprehensive data trustworthiness solutions are difficult to achieve as they
need to combine different techniques, such as digital signatures, semantic integrity, data quality
techniques, as well taking into account data semantics. Notice also that assuring data
trustworthiness may require a tight control on data management processes which has privacy
implications. In what follows we briefly elaborate on the above issues and research challenges.

2 Access Control and Protection from Insider Threat


From a conceptual point of view, an access control mechanism typically includes a reference
monitor that checks that requested accesses by subjects to protected objects to perform certain
actions on these objects are allowed according to the access control policies. The decision taken
by the access control mechanism is referred to as access control decision. Of course, in order to
be effective access control mechanisms must support fine-grained access control that refers to
finely tuning the permitted accesses along different dimensions, including data object contents,
time and location of the access, purpose of the access. By properly restricting the contexts of the
possible accesses one can reduce improper data accesses and the opportunities for insiders to
steal data. To address such a requirement, extended access control models have been proposed,
including time-based access control models, location-based access control models, purpose-
based access control models, and attribute-based access control models that restrict data accesses
with respect to time periods, locations, purpose of data usage, and user identity attributes [8],
respectively. Even though the area of access control has been widely investigated [2], there are
many open research directions, including how to reconcile access control with privacy, and how
to design access control models and mechanisms for social networks and mobile devices. Many
advanced access control models require that information, such as the location of the user
requiring access or user identity attributes [3], be provided to the access control monitor. The
acquisition of such information may result in privacy breaches and the use of cloud for managing
the data and enforcing access control policies on the data further increases the risks for data users
of being target of spear phishing attacks. The challenge is how to perform access control while at
the same time maintaining the privacy of the user personal and context information [6, 8]. Social
networks and mobile devices acquire a large variety of information about individuals; therefore
access control mechanisms are needed to control with which parties this information is shared.
Also today user owned mobile devices are increasingly being used for job-related tasks and thus
store enterprise confidential data. The main issue is that, unlike conventional enterprise
environments in which administrators and other specialized staff are in charge of deploying
access control 10 E. Bertino policies, in social networks and mobile devices end-users are in
charge of deploying their own personal access control policies. The main challenge is how to
make sure that devices storing enterprise confidential data enforce the enterprise access control
policies and to make sure that un-trusted applications are unable to access this data. It is
important to point out that access control alone may not be sufficient to protect data against
insider threat as an insider may have a legitimate permission for certain data accesses. It is
therefore crucial to be able determine whether an access, even though is granted by the access
control mechanism, is ‘‘anomalous’’ with respect to data accesses typical of the job function of
the data user and/or the usual data access patterns. For example, consider a user that has the
permission to read an entire table in a database and assume that for his/her job function, the user
only needs to access a few entries a day and does so during working hours. With respect to such
access pattern, an access performed after office hours and resulting in the download of the entire
table would certainly be anomalous and needs to be flagged. Initial solutions to anomaly
detection for data accesses have been proposed [5]. However these may not be effective against
sophisticated attacks and needs to be complemented by techniques such as separation-of-duties
[1] and data flow control.

3 Data Trustworthiness

The problem of providing ‘‘trustworthy’’ data to users is an inherently difficult problem which
often depends on the application and data semantics as well as on the current context and
situation. In many cases, it is crucial to provide users and applications not only with the needed
data, but with also an evaluation indicating how much the data can be trusted. Being able to do
so is particularly challenging especially when large amounts of data are generated and
continuously transmitted. Solutions for improving data, like those found in data quality, may be
very expensive and require access to data sources which may have access restrictions, because of
data sensitivity. Also even when one adopts methodologies to assure that the data is of good
quality, attackers may still be able to inject incorrect data; therefore, it is important to assess the
damage resulting from the use of such data, to track and contain the spread of errors, and to
recover. The many challenges for assuring data trustworthiness require articulated solutions
combining different approaches and techniques including data integrity, data quality, record
linkage [4], and data provenance [10]. Initial approaches for sensor networks [7] have been
proposed that apply game theory techniques with the goal of determine which sensor nodes need
to be ‘‘hardened’’ so to assure that data has a certain level of trustworthiness. However many
issues need to be addressed, such as protection again colluding attackers, articulated metrics for
‘‘data trustworthiness’’, privacy-preserving data matching and correlation techniques.

4 Reconciling Data Security and Privacy


As already mentioned, assuring data security requires among other measures creating user
activity profiles for anomaly detection, collecting data provenance, and context information such
as user location. Much of this information is privacy sensitive and Data Security – Challenges
and Research Opportunities 11 security breaches or data misuses by administrators may lead to
privacy breaches. Also users may not feel comfortable with their personal data, habits and
behavior being collected for security purposes. It would thus seem that security and privacy are
conflicting requirements. However this is not necessarily true. Notable examples of approaches
reconciling data security and privacy include:

• Privacy-preserving attribute-based fine-grained access control for data on a cloud [8]. These
techniques allow one to enforce access control policies taking into account identity information
about users for data stored in a public cloud without requiring this information to be disclosed to
the cloud, thus preserving user privacy.

• Privacy-preserving location-based role-based access control [6]. These techniques allow one to
enforce access control based on location, so that users can access certain data only when located
in secure locations associated with the protected data. Such techniques do not require however
that the user locations be disclosed to the access control systems, thus preserving user location
privacy.
GOVERNMENT DATA VALUATION FRAMEWORK
For governments, the value of data may be reframed to consider not just operational and policy
improvements from the use of data, but the economic stimulus which can be created by
deliberate release of data. Governments hold vast quantities of personal data on citizens, as well
as data which is of importance for national security. In modified Gartner framework for
government (see Figure 4), the Exclusivity Value of Data must consider the impact of the release
of highly personal information and issues of national security. Such estimations would be
extremely difficult to quantify. The Research Value of Data focusses on research areas which
would be enhanced by deliberate release of data under controlled conditions (selected research
partners or by anonymising data). In areas as complex as health and human services, access to
data held by governments will be critical to understanding and addressing some of the greatest
challenges facing Australia. The challenge is of course that this is potentially the highest risk /
most sensitive data held by government. The Economic Value of Data focusses on industries
which would be stimulated by release of government data with appropriate treatment to prevent
personal information from being released. The Market Value of Data focusses on new industries
which may be created by release of government data under the same circumstances.

With the challenges associated with recognising the value of data, and quantifying the potential
risk of release of data, it is not surprising that so many individual data custodians are paralysed
by concerns about the consequences of releasing or sharing data. It is also not surprising that,
without an accounting framework for data in an accounting sense, that data is undervalued as a
factor of production in the Digital Economy.
2.1 History of Big Data

The first major data project is created in 1937 and was ordered by the Franklin D. Roosevelt’s
administration in the USA. The first data-processing machine appeared in 1943 and was
developed by the British to decipher Nazi codes during

World War II. This device, named Colossus, searched for patterns in intercepted messages at a
rate of 5.000 characters per second [3]. As of the 90s the creation of data is spurred as more and
more devices are connected to the Internet. After the first super computer ware built, it was not
able the process the data which are different in nature of storage, size and format. Big data
creating a new challenge in handing the data and producing useful information out it. In the year
of 2005, when the Hadoop ware developed by the yahoo which is built on top of Google’s
MapReduce. The goal was to indexing the entire World Wide Web. Today, the open source
Hadoop is used by a lot of organization to tackle the huge amount of data. Many government
organizations is using the big data to find out useful decision support for the betterment of
societies out of it. In 2009 the Indian government has started the project named “AADHAAR ”to
take an iris scan, fingerprint and photograph of all of its 1.32 billion inhabitants. All this data is
stored in the largest biometric database in the world. Recently, the Indian government has started
big data project to find out the income tax defaulter using social media (facebook data and
twitter) in year 2017.

Big Data
Efforts to collect, store, and analyze large amounts of data are not new to the technology world.
Many companies have collected large amounts of data on their customers to better understand
their preferences and provide better services and products. Data analysis is present in almost
every aspect of modern society, including mobile services, retail, manufacturing, the financial
sector, industries, and science. The concept of big data began to emerge in the early 2000s. Since
then many attempts have been made to find a precise definition for this term. The modern
definition of "Big Data" was first presented to the computing world by Roger Magoulas in 2005,
in order to define a large amount of data that traditional data management techniques could not
manage and process due to its complexity and its size [7]. In addition, [8] stated that "the big
data is defined by its size, comprising a large, complex and independent collection of data sets".
This data set cannot be manipulated with standard data management techniques because of the
inconsistency and unpredictability of the possible combinations. However, after reviewing
definitions to big data from major technology organizations, the definition presented by Gartner
was the most appropriate version found for this paper. "Big data is high-volume, high-velocity,
high-variety information assets that demands cost-effective and innovative forms of information
processing that enable enhanced insight and decision-making" [9]. Gartner's definition is broader
and encompasses the main aspects of big data.

The term Big Data is now used almost everywhere in our daily life. The term Big Data came
around 2005 which refers to a wide range of large data sets almost impossible to manage and
process using traditional data management tools – due to their size, but also their complexity. Big
Data can be seen in the finance and business where enormous amount of stock exchange,
banking, online and onsite purchasing data flows through computerized systems every day and
are then captured and stored for inventory monitoring, customer behaviour and market
behaviour. It can also be seen in the life sciences where big sets of data such as genome
sequencing, clinical data and patient data are analysed and used to advance breakthroughs in
science in research. Other areas of research where Big Data is of central importance are
astronomy, oceanography, and engineering among many others. The leap in computational and
storage power enables the collection, storage and analysis of these Big Data sets and companies
introducing innovative technological solutions to Big Data analytics are flourishing. In this
article, we explore the term Big Data as it emerged from the peer reviewed literature. As opposed
to news items and social media articles, peer reviewed articles offer a glimpse into Big Data as a
topic of study and the scientific problems methodologies and solutions that researchers are
focusing on in relation to it. The purpose of this article, therefore, is to sketch the emergence of
Big Data as a research topic from several points: (1) timeline, (2) geographic output, (3)
disciplinary output, (4) types of published papers, and (5) thematic and conceptual development.
The amount of data available to us is increasing in manifold with each passing moment. Data is
generated in huge amounts all around us. Every digital process and social media exchange
produces it. Systems, sensors and mobile devices transmit it. [1] With the advancement in
technology, this data is being recorded and meaningful value is being extracted from it. Big data
is an evolving term that describes any voluminous amount of structured, semi-structured and
unstructured data that has the potential to be mined for information. The 3Vs that define Big
Data are Variety, Velocity and Volume.
1) Volume: There has been an exponential growth in the volume of data that is being dealt with.
Data is not just in the form of text data, but also in the form of videos, music and large image
files. Data is now stored in terms of Terabytes and even Petabytes in different enterprises. With
the growth of the database, we need to re-evaluate the architecture and applications built to
handle the data.

2) Velocity: Data is streaming in at unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data
in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most
organizations.

3) Variety: Today, data comes in all types of formats. Structured, numeric data in traditional
databases. Information created from line-of-business applications. Unstructured text documents,
email, video, audio, stock ticker data and financial transactions. We need to find ways of
governing, merging and managing these diverse forms of data. There are two other metrics of
defining Big Data
4) Variability: Variability. In addition to the increasing velocities and varieties of data, data flows
can be highly inconsistent with periodic peaks. Daily, seasonal and event-triggered peak data
loads can be challenging to manage. Even more so with unstructured data involved.[2]

3) Complexity: Complexity. Today's data comes from multiple sources. And it is still an
undertaking to link, match, cleanse and transform data across systems. However, it is necessary
to connect and correlate relationships, hierarchies and multiple data linkages or your data can
quickly spiral out of control. A data environment can lie along the extremes on any one of the
following parameters, or a combination of them, or even all of them together.

2.4 Big Data Generation

In this section, we try to explore the key role actor that generate the flood of data.

1) IoT (INTERNET OF THINGS): The phrase “Internet of Things”which is also shortly well-
known as IoT is coined from the two words i.e. the first word is “Internet”and the second word is
“Things”. The best Definition of IoT would be “An open and comprehensive network of
intelligent objects that have the capacity to auto-organize, share information, data and resources,
reacting and acting in face of situations and changes in the environment ”[8]. Indian government
has plans to allocate 20 million USD for five internet-based, sensor-driven projects within its
”IoT and 100 Smart Cities” development program by 2020. The estimated 26 billion units to be
installed within the Internet of Things by 2020, according to Gartner, Inc., where in PCs, tablets
and smart phones were not included. These excluded items are predicted to reach 7.3 billion units
by 2020. Cisco and Intel estimate 50 billion connected devices by 2020[9].

2) BIOMEDICAL DATA AND MEDICAL DATA: Bio and

Medical related field such as Bio informatics, Clinical informatics, Health related field
developing the multidimensional Dataset are another source of Big Data generation. Clinical
related project, Genome related project, real time health monitoring system are adding huge data.
In addition to this medical imaging software such as (ct scam,MRI etc) are producing the vast
amount of data even with more complex feature. A project named ProteomicsDB was started by
the Swiss government to handle the genes which is size of 5.17 TB[10].

3) SOCIAL MEDIA: Nowadays, Social media (facebook, youtube, twitter, instagram etc) are
producing vast amount of data generation. Insta gram having 7.1 Million of active user,34.7
billion of active photo share, average of 1650 Million like per day[11].In Youtube,300 hours of
video being upload per second, more than 3.25 billion.

hours video being watched by one month[12]. Same wise, the twitter having 115 million active
user every month, average of 58 million tweet per day[13].More than 5 billion people worldwide
call, text, tweet, and browse on mobile devices [14]. The amount of e-mail accounts created
worldwide is expected to increase from 3.3 billion in 2012 to over 4.3 billion by late 2016 at an
average annual rate of 6 % over the next four years. In 2012, a total of 89 billion e-mails were
sent and received daily, and this value is expected to increase at an average annual rate of 13 %
over the next four years to exceed 143 billion by the end of 2016[15].

4) OTHER SCIENTIFIC DATA: some other scientific Area like astronomy, Earth Related, ocean
related project are generating vast amount of data. DPOSS (The Palomar Digital Sky Survey)
having 3 –TB, 2MASS (The Two Micron All- Sky Survey) having 10–TB, GBT (Green Bank
Telescope) having 20–PB, GALEX (The Galaxy Evolution Explorer) having 30– TB, SDSS (The
Sloan Digital Sky Survey) having 40–TB,SkyMapper Southern Sky Survey having 500 TB, Pan
STARRS (The Panoramic Survey Telescope and Rapid Response System) having more then 40–
PB expected, LSST (The Large Synoptic Survey Telescope) having more then 200 –PB
expected, SKA (The Square Kilometer Array) having more then 4.6 –EB expected[18].

Retrieval received from the other layers such as the computing layer. This layered also know as
physical layer of Big Data.

This Mid Layer provides an abstraction over the physical data layer and offers the core
functionalities such as data organization, access and retrieval of the data. This layer indexes the
data and organizes them over the distributed store devices. The data is partitioned into blocks and
organized onto the multiple repositories [42].

Analytics or Application layer consists tools and techniques, logic for developing the domain
specific analytics. This layer is also known as Logical layer

3.1 Management/Storage Tools Of Big Data

With the enhancement of computing technology, huge data can be managed without
supercomputer and high cost. Data can be saved for the over the network. Many tools and
techniques are available for storage management. some of them are Google Big Table, Simple
DB, NoSQL, MemcacheDB[20].

3.2 Hadoop

A project ware started by Mike Cafarella and Doug Cutting to indexing nearly 1 billion page for
their search engine project. In year 2003, Google has introduced the concept of Google File
system known as GFS. Later on in year of 2004, the Google has given architecture of Map
Reduce, which become the foundation of the framework know as Hadoop. In simple language
the core of Hadoop system are Mapreduce and HDFC(Hadoop Distributed File System).In this
section we will briefly discuss the component of Hadoop.

3.2.1 HDFS: HDFS is a Java-based file system that provides scalable and reliable data storage,
and it was designed to span large clusters of commodity servers. Cluster contains two types of
nodes. The first node is a name-node that acts as a master node. The second node type is a data
node that acts as slave node. HDFS stores files in blocks, default block size of 64MB.THose files
are replicated in multiples to facilitate the parallel processing of large amounts of data.
HDFS stores huge data and for storing such huge data, the files are stored across multiple
machines. These files are stored in redundant manner so that it can prevent the system from
possible data losses in case of failure. HDFS provides parallel processing also. This architecture
consists of a master server and a single Name-Node that handles the file system namespace and
regulates access to files by clients. In this architecture, there is one Data Node in each cluster.
This manages storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. In HDFS internally, a file is split into one or
more blocks. These blocks are stored in a set of Data Nodes. The Name-Node performs file
system namespace operations like closing, opening and renaming directories and files. It also
controls the mapping of blocks to Data Nodes.

3.2.2 MapReduce Frameworks

Map Reduce is a program model for distributed computing based on java. It is a processing
technique. The Map Reduce algorithm includes two important tasks, namely Map and Reduce.
The term Map Reduce actually refers to two separate and distinct tasks that Hadoop programs
perform. The first is the map job, which takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key/value pairs). The reduce job
takes the output from a map as input and combines those data tuples into a smaller set of tuples.
As the sequence of the name Map Reduce implies, the reduce job is always performed after the
map job.

In MapReduce, it is easy to scale data processing over multiple computing nodes. Under this
model, the data processing primitives are called mappers and reducers. Writing the application in
the MapReduce form is not so easy but once it is written it allows scaling the application to run
over thousands or even tens of thousands of machines in a cluster.

In MapReduce, it is easy to scale data processing over multiple computing nodes. Under this
model, the data processing primitives are called mappers and reducers. MapReduce program runs
in three stages, as shown in Fig. 3, namely map stage, shuffle stage, and reduce stage.

Map Phase The input data is stored in the form of files in the Hadoop file system (HDFS). This
input file is then passed to the mapper function line by line. These data is then processed by the
mapper and several small chunks of data is created.

3. BIG DATA MANAGEMENT


Question comes to mind that how to manage and develop the Big Data related project. What
architecture should we fallow to manage all component of Big data .Architecture of Big Data
must be synchronized with support infrastructure of the institution or company. Data are
generating different source which is noisy, faulty and messy. In this section, we will briefly
discuss data storage tools, Hadoop and other management tools.

2.1. Big Data and Its Characteristics

Big Data is a collection of massive and complex datasets and data volume that includes huge
quantities of data, data management capabilities, social media analytics, and real-time data. Big
Data can come from structured and unstructured data with five V as distinct characteristics
(Volume, Velocity, Value, Variety and Veracity) [12]. The changes in consumer behavior had
strong influences on all enterprises throughout time. A decision moment occurred in the 1970s
when a significant macroeconomic change affected the law of supply and demand. Until 1960,
the economic perspective of consumer behavior and the models relied on the assumption that all
consumers were always rational in their purchases, so they will always purchase the product
which brings the higher satisfaction. Before 1979, scientists developed three models (Economic
Model, Learning Model, and Psychoanalytic and Sociological Model) [13]. During that time,
consumers had conservative behavior on buying the same products. Moreover, consumer
behavior is an emergent phenomenon that has evolved along with human development. This
diversification of needs is the main cause that stimulated the researchers to study the consumer

behavior. In 2008, the economic and financial crisis that spread all over the world led consumers
to think twice before buying a product. Consumers were buying less products meaning their
behavior tend towards a defensive one. Online marketing began to take a role in purchases.
People started to use the Internet to order and compare the prices and characteristics of interested
products. At present, customers face diverse offers, which leads customers’ decision process to
be more complicated and their behavior become unpredictable.

VI. Big Data Advantages

The Big Data has numerous advantages on society, science and technology. It is unto the way
that how it is used for the human beings. Some of the advantages (Marr, 2013)are described
below:

A. Understanding and Targeting Customers

This is one of the biggest and most publicized areas of big data use today. Here, big data is used
to better understand customers and their behaviors and preferences. Companies are keen to
expand their traditional data sets with social media data, browser logs as well as text analytics
and sensor data to get a more complete picture of their customers. The big objective, in many
cases, is to create predictive models.

B. Understanding and Optimizing Business Process

Big data is also increasingly used to optimize business processes. Retailers are able to optimize
their stock based on predictions generated from social media data, web search trends and weather
forecasts. One particular business process that is seeing a lot of big data analytics is supply chain
or delivery route optimization. HR business processes are also being improved using big data
analytics. This includes the optimization of talent acquisition.

C. Improving Science and Research

Science and research is currently being transformed by the new possibilities big data brings.
Take, for example, CERN, the Swiss nuclear physics lab with its Large Hadron Collider, the
world’s largest and most powerful particle accelerator. Experiments to unlock the secrets of our
universe – how it started and works - generate huge amounts of data. The CERN datacentre has
65,000 processors to analyse its 30 petabytes of data.

D. Improving Healthcare and Public Health

The computing power of big data analytics enables us to decode entire DNA strings in minutes
and will allow us to find new cures and better understand and predict disease patterns. Just think
of what happens when all the individual data from smart watches and wearable devices can be
used to apply it to millions of people and their various diseases. The clinical trials of the future
won’t be limited by small sample sizes but could potentially include everyone.

E. Optimizing Machine and Device Performance


Big data analytics help machines and devices become smarter and more autonomous. For
example, big data tools are used to operate Google’s self-driving car. The Toyota Prius is fitted
with cameras, GPS as well as powerful computers and sensors to safely drive on the road without
the intervention of human beings. Big data tools are also used to optimize energy grids using data
from smart meters. We can even use big data tools to optimize the performance of computers and
data warehouses

F. Improving Security and Law Enforcement

Big data is applied heavily in improving security and enabling law enforcement. The revelations
are that the National Security Agency (NSA) in the U.S. uses big data analytics to foil terrorist
plots (and maybe spy on us). Others use big data techniques to detect and prevent cyber-attacks.
Police forces use big data tools to catch criminals and even predict criminal activity and credit
card companies use big data use it to detect fraudulent transactions.

IV. BIG DATA IN GOVERNMENTS

In today's information society, data is the fuel that powers several entities, and they are rapidly
transforming the way people live and work. Big data is treated as the next frontier of innovation,
competition and productivity [10]. That is why big data is drawing so much attention and being
applied in an increasing number of sectors of the global economy. In the private sector, the big
data is already being widely used in many industries such as logistics, healthcare, retail,
manufacturing, financial services, and etc. A study conduced by [11], found that the first
companies that adopted analytical techniques for big data presented far superior results compared
to its competitors, becoming much more productive than those who have trusted their strategies
to specialists and experienced professionals. Although the private sector leads the pace of
adoption of big data, this technology can also be widely used in the context of the public sector.
Recognizing the impact of big data applications on society, governments in some countries have
begun to invest heavily in initiatives aimed at developing this technology. The United States, for
example, in March 2012 invested more than $ 200 million in a big data research and
development initiative to improve the tools and techniques needed to access, organize and gain
knowledge from a large amount of data [12]. In 2014, in the UK, the government allocated £73
million to fund studies to leverage the potential of the big data in the country's public
administration [13]. In Australia, the government launched the Australian Public Service for Big
Data Strategies in 2013. The purpose of this program was to outline the potential of the big data
analysis to increase the value of the national government's and the Australian people's
information assets [14]. The Japanese government has allocated £87.5 million for the research
and development of big data projects, including a project to develop a high-speed network
infrastructure with 400 Gbps (billions of bits per second) capacity and another for the
development of data analysis [15]. In France, the Ministry of Digital Affairs was created in 2014.
This Ministry published a bill on the Digital Republic, outlining the general orientation for the
big data policy for France [16]. Therefore, big data technology creates opportunities and
challenges for governments. Opportunities include generating efficient analyzes for significant
improvement in government service delivery, using real-time information for e-government
experiences, monitoring and visualizing government performance for public decision-making in
a dynamic and participatory manner; And producing useful insights for the modernization of
governments [17]. The challenges of big data for the government are institutional and technical.
According to [18], institutional challenges include the creation of a governance structure to
efficiently address some key issues: standardization of data structures, enabling the
interoperability of information; privacy guarantees to gain the trust of citizens who share
information; Data sharing and inter-organization linkages for creating custom systems. Reference
[19] stated that the technical challenges are exemplified by shortage of skilled workers, the
underdevelopment of relevant software tools, the integration of multiple sources and data
formats, and the storage and access of data.

A SNEAK-PEEK INTO THE FUTURE

The government agencies should formulate strategies to harness the capabilities of big data. The
strategies should be designed to highlight key opportunities and challenges that big data will
bring to government agencies. The strategy should aim to assist agencies to take advantage of
these opportunities and realize the potential benefits of these new technologies. Big data allows
for more focused and evidence-based policy design and service implementation that in turn
allows citizens to interact with the government in a personalized and seamless way. A successful
big data strategy is expected to assist in realizing each of the priority areas which are:
a. Better services delivery — the usage of big data analytics will allow agencies to provide
personalized services that are designed to meet citizen’s needs and preferences. For example, the
government agencies can identify the individuals or groups who are eligible for certain
entitlements without needing them to apply for it explicitly.

b. Efficient government operations — by employing big data for predictive analysis government
agencies will be able to assess risk and feasibility, and they will be more effective in detecting
fraud and error. This will lead to enhancement in productivity as the government can engage
more resources into the projects which have more impact and confidence of the outcome.

c. Increased Collaboration — the usage of big data analytics and its related technologies will
enable the government agencies to collaborate with industry, academia, non-government
organizations and other interested parties locally and internationally. This will help in increasing
knowledge, kick start ideas, spark innovation, and generate growth and formulate better
decisions and solutions which meet the needs of local as well as international governments. Also,
as the government agencies pursue big data technologies, it will open up a channel for
collaboration between different agencies that will strengthen existing networks and help develop
new partnerships.

CHALLENGES FOR GOVERNMENT AGENCIES

Along with the massive opportunities, big data brings with itself many challenges which the
government agencies have to carefully address. The volume of data is already enormous and is
increasing rapidly daily. Due to the proliferation of devices which have internet, the velocity of
the data being generated is very high. In addition, the variety of data which is being generated is
diversifying and the agencies' capability to capture and process this data is limited. Current
software and hardware, technologies, architecture, management and analysis approaches are not
enough to cope with the massive amount of data and agencies have to change the way they think,
plan, work, govern, process, manage and report on data if they want to harness the full power of
big data. Some biggest challenges which the government agencies will face are:

A. Challenges related to privacy and security


The government and its agencies are committed to safeguard the privacy of its citizens. In doing
so the agencies should set clear boundaries for the usage of personal information. Government
agencies, when collecting or managing citizen's data must comply with the acts and regulations
and other laws as applicable. The agencies should be careful to maintain public confidence in the
government as a secure repository and chamberlain of citizen information. The use of big data
will introduce an additional layer of complexity in terms of management of information security
risks. Big data sources, the transport and delivery systems within and across agencies, and the
end points for this data will all become targets of interest for hackers, both local and international
and will have to be protected. Many governments have open governance policies, under which
they might release the large amount of machine readable data which could lead to a disaster as
the data may provide important information to unfriendly state and non-state actors spread
locally or internationally. This security threat will need to be understood very well and carefully
managed. The Big data gets its power by combining a number of apposite, different datasets
which can be linked and analyzed as a whole to reveal new patterns, trends and insights. But
before employing the big data analytics techniques, the government agencies should get the
public trust and citizen’s confidence. The government should convince its citizens that this
linking and analyzing the data can happen without compromising the privacy rights. Government
must ensure that the citizens' and public trust in government agencies is maintained. As the
volume of data which government holds is increasing, the public trust can easily be affected by
leakage of data or information into the public domain. Government and the concerned agencies
have to consider this fact at the very beginning and should develop the secure systems with
security as the prime requirement. Communication and collaboration with industry experts is an
important first step to begin with.

B. Challenges related to managing and sharing of data

Accessible information is the lifeblood of a robust democracy and a productive economy [2].
Everyone who has worked on data analytics before will agree to the fact that for data to hold any
value, it should fulfil 3 basic requirements - discoverable, accessible and usable. And the
government agencies should realize that these requirements become more significant when the
focus of our discussion is towards big data. Government agencies should work to achieve these
requirements but they should comply with the privacy laws. The processes by which the
government agencies collect, handle, utilize and manage the data should comply with all
applicable legislative regulations and the government must focus on ensuring that its agencies
analyze the data in lawful and meaningful manner. In most of the cases, big data is utilized for
complex analysis and to support decision making. For this to happen, the data sets which are
considered should be accurate, complete and should be available on-time. Owing to these
reasons, data movement and accessibility across the government agencies should be done via
standardized APIs, formats and metadata. Following the standardized methods will produce
better quality of data and it will produce phenomenal benefits as far as business intelligence is
concerned. Many governments these days follow open governance policy under which the data
sets are available to public. The governments should also extend these 'open' initiatives to make
data standardized, available and of course, open to flow within and between government
agencies. This will result in increased collaboration among government agencies. But again, all
this should be done to the extent permissible by the privacy laws. The inter-governmental agency
collaboration brings the opportunity for agencies for innovation and competition in the
marketplace. In order to promote the usage of government data across government agencies, the
government can make the agencies to maintain information asset logs for information available
to the public. These logs can be used by other agencies or within the same agency which will
increase reusability of data. In the process of making the data availability more fluid, the
government and government agencies should consider that the new technologies can provide an
opportunity for unfriendly state and non-state actors to gather or extract sensitive information
from seemingly harmless data. There have been concerns that correlating the separate data sets
that are individually unidentified, there are chances of extracting personal information. This is
called 'mosaic effect' [3]. The agencies should take proper measures to ensure the anonymity and
privacy is not compromised.

C. Challenges related to technology

The evolution of big data and its ability to tackle complex analysis of huge data, which seemed
impossible in the past, can be credited to the recent advancements in the technology. If the
government agencies want to embrace big data analytics, it will put a lot of stress on current
hardware and software which is already burdened with tasks like processing, analyzing and
storing data. The government agencies should manage these technological challenges and gaps
efficiently in order to gain full benefits of big data. Particularly, these technologies constitute low
cost storage technologies and arrays, in-memory processing and cloud based solutions clubbed
together with new software products which are supported by high performance servers and
processing platforms. The advancements in Cloud computing over the last few years is the prime
factor of wide adoption of big data analytics. With features like tiered storage, compute and
analytics capabilities, widely available software solutions and granularity in selecting services,
cloud computing has made big data analytics approachable. The agencies should utilize the
flexibility offered by cloud computing to store manage and perform computational analysis on
ever expanding data in a manner which was not possible before. The government agencies need
to have network which will provide sufficient bandwidth to transfer the data and enable real time
analysis of data in cloud environment. The feasibility of big data projects should be determined
based upon the government agency's business needs rather that preconceived technological
preferences. The Government agencies should collaborate with other vendors and developers of
big data solutions, and work with them so that they can come up with more capable tools and
technologies which can ease the challenges of big data analysis.

D. Challenges related to skills

Big data is relatively young, and fairly complex because of which government agencies have to
hire employees with new and diverse skill sets. This includes people skilled in science,
engineering, technological research, mathematics and statistics, analysis and interpretation,
business sense and understanding of the underlying nature of the business process or policy
intent and above all, creativity. It is unlikely that government agencies can find people having all
the skills and hence agencies have to form teams of specialists to allow the government agencies
to achieve the results which they desire from their data analysis efforts. Industry experts have
reported that there is a major shortage of data scientists and people having experience in big data
analytics. As reported by Gartner, by 2015, big data demand will reach 4.4 million jobs globally,
with two thirds of these positions remaining unfilled [4]. Currently, there is a shortage of degrees
and courses whose curriculum if focused on big data analytics. The big data analytics industry is
searching for qualified people who are skilled in big data analytics. Educational institutions and
universities are pushed to come up with the courses that provide education and train people for
this line of business. It is evident that it will take some more time before government agencies
can find skilled people for big data analytics. This leaves the government agencies to seek
support from the rich experience and expertise in big data available outside of government.
Government agencies should also strive for opportunities for better collaboration with academia
and industry (scientists, vendors and solution providers) and also independent research
institutions. This will allow the government agencies to retain skilled resources and attract more
expertise.

Encryption and Decryption Algorithm

Homorphic Encryption algorithm:

Homorphic encryption is an encryption algorithm which allows specific types of computations to


be carried out on plaintexts and generate an encrypted result which, when decrypted, matches the
result of operations performed on the plaintexts. RSA is the first encryption algorithm with the
homorphic property. In the algebraic concept we can define homorphic as a structure preserving
map between two algebraic structures such as groups [3]. A group is a set, G, to gether with an
operation (called the group law of G) that combines any two elements a and b to form another
element, denoted aоb. To qualify as a group, the set and operation,(G, о),must satisfy four
requirements known as the group axioms [3]

• Closure: For all a and b in G,the result of the operation, aоb, is also present in G.

• Associativity: For all a ,b and c in G,(aоb)оc=aо(bоc)

• Identity element: There exists an element in G, such that for every element a in G, the equality
eоa=a о e=a holds. Such an element is unique, and thus one speaks of the identity element.

• Inverse element: Fo reach a inG, there exists an element b in G such that (aо b )=(bоa)=e,
where is the identity element. Identity of g is written as 1.The result of an operation may depend
on the order of the operands. In other words, the result of combining element a with element b
need not yield the same result as combining element b with element a; the equation aоb=bоa may
not always be true. This equation always holds in the group of integers under addition, because
a+b=b+a for any two integers (commutativity of addition).Groups
forwhichthecommutativityequationa+b=b+aalwaysholdsarecalledaBoolean groups.
B. Verifiable computation algorithm (outsource computing):

Verifiable computation (VC) algorithm is the one which permits a frail (week) customer to send
his information on the cloud without much stressing over the security issues. This is the most
secure algorithm which helps the user to send his data to the cloud storage device [4].

C. Message digest algorithm:

The MD5 function is a cryptographic algorithm that takes an input of arbitrary length and
produces a message digest that is 128 bits long. The digest is sometimes also called the "hash" or
"fingerprint" of the input. MD5 is used in many situations where a potentially long the creation
and verification of digital signatures. How MD5 works [5]. The MD5 algorithm first divides the
input in blocks of 512 bits each. 64 Bits are inserted at the end of the last block. These 64 bits are
used to record the length of the original input. If the last block is less than 512bits, some extra
bits are 'padded' to the end.Next, each block is divided into 16 words of 32 bits each. These are
denoted as M0 ... M15.MD5 uses a buffer that is made up of four words that are each 32 bits
long [5]. The table MD5 further uses a table K that has 64 elements. Element number i is
indicated as Ki. The table is computed beforehand to speed up the computations. The elements
are computed using the mathematical sin function [5]:

Ki = abs(sin(i + 1)) * 232


Four auxiliary functions

In addition MD5 uses four auxiliary functions that each take as input three 32-bit words and
produce as output one 32-bit word. They apply the logical operators and, or, not and xor to the
input bits.

F(X,Y,Z) = (X and Y) or (not(X) and Z)

G(X,Y,Z) = (X and Z) or (Y and not(Z))

H(X,Y,Z) = X xor Y xor Z

I(X,Y,Z) = Y xor (X or not(Z))

The contents of the four buffers (A, B, C and D) are now mixed with the words of the input,
using the four auxiliary functions (F, G, H and I). There are four rounds, each involves 16 basic
operations. After all rounds have been performed, the buffers A, B, C and D contain the MD5
digest of the original input.

D. Key rotation algorithm:


The key rotation algorithm is the best algorithm in which we can store our data or send our data
on the cloud environment using a shared symmetric key.Here the data owner provides the
permission to encrypt the data using the shared symmetric key method. In this method a cloud
user can store his data using the encryption method .the cloud service provider with the support
of the data owner who gives support to convert plain data to cipher data and store in cloud .the
same way he gives the permission to retrieve data from the cloud data centre in an encrypted
manner and gives it back to the user in a decrypted manner.
E. DES Algorithm-

The Data Encryption Standard (DES) is the name of the Federal Information Processing
Standard (FIPS), Which Describes the data encryption algorithm (DEA).it has a 64-bit block size
key during execution. DES is a symmetric cryptosystem, specifically a 16-round FeistelCipher.
While communicating, both sender and receiver must know the same secret key, which can be
used to encrypt and decrypt the message, or to generate and verify a Message Authentication
Code (MAC). The DES can also be used for Single – user encryption, such as to store files on a
hard disk in encrypted form .The DES has a 64-bit block size and uses a 56 bit key during
execution. In Cipher Block Chaining mode of operation of DES, each block of ECB encrypted
cipher text is XOR ed with the next plain text block to be encrypted, thus making all the blocks
dependent on all the previous blocks .this means that in order to find the plaintext of a particular
block, you need to know the cipher text, the key and the cipher text for the previous block. The
first block to be encrypted has no previous cipher text, so the plaintext is XORed with a 64bit
number called the initialization vector.So if data is transmitted over network or phone line and
there is a transmission error, the error will be carried forward to all the subsequent blocks since
each block is dependent upon the last .this mode of operation is more secure than ECB
(electronic code book) because the extra XOR step adds one more layer to the encryption
process.

F. Rijndael Encryption Algorithm:

Rijndal is the block cipher algorithm recently chosen by the National Institute of Science and
Technology (NIST) as the Advanced Encryption Standard (AES). It supersedes the Data
Encryption Standard (DES). Rijndael is a standard symmetric key encryption algorithm used to
encrypt sensitive information. it is an iterated block cipher. Therefore, the encryption or
decryption of a block of data is accomplished by the iteration (a round) of a specific
transformation (a round function). Rijndael also defines a method to generate a series of sub keys
from the original key. The generated sub keys are used as input with the round function. Rijndael
is designed based on the following 3 benchmark:

1. Resistance against all known attacks;

2. Speed and code compactness on a wide range of platforms;

3. Design simplicity

Rijndael is the best combination of security, performance, efficiency, ease of implementation and
Flexibility. The Rijndael algorithm supports key sizes of 128, 192 and 256 bits, with data
handled in 128-bit blocks. Rijndael uses a variable number of rounds, depending on key/block
sizes, as follows: 9 rounds if the key/block size is 128 bits, 11 rounds if the key/block size is 192
bits, and 13 rounds if the key/block size is 256 bits. Rijndael is a substitution linear
transformation cipher. It uses triple discreet invertible uniform transformations (layers).
Specifically, these are: Linear Mix Transform; Non-linear Transform and Key Addition
Transform. Even before the first round, a simple key Addition layer is performed, which adds to
security. Thereafter, there are Nr-1 rounds and then the final round. The transformations form a
State when started but before completion of the entire process.

Algorithm

 Key Expansion Round keys are derived from the cipher key using Rijndael’s Key schedule.

 Initial Round Add Round Key each byte of the state is combined with the round key using
bitwise xor.

 Rounds Sub Bytes Shift Rows

LITERATURE
1. Diao Zhe ; Wang Qinghong ; Su Naizheng ; Zhang Yuhan, “”Study on Data Security
Policy Based on Cloud Storage”, IEEE, 2017. Along with the growing popularisation of
Cloud Computing. Cloud storage technology has been paid more and more attention as an
emerging network storage technology which is extended and developed by cloud
computing concepts. Cloud computing environment depends on user services such as
high-speed storage and retrieval provided by cloud computing system. Meanwhile, data
security is an important problem to solve urgently for cloud storage technology. In recent
years, There are more and more malicious attacks on cloud storage systems, and cloud
storage system of data leaking also frequently occurred. Cloud storage security concerns
the user's data security. The purpose of this paper is to achieve data security of cloud
storage and to formulate corresponding cloud storage security policy. Those were
combined with the results of existing academic research by analyzing the security risks of
user data in cloud storage and approach a subject of the relevant security technology,
which based on the structural characteristics of cloud storage system.
2. Ruimin Hu, “Key Technology for Big Visual Data Analysis in Security Space and Its
Applications”, IEEE, 2016. Seeing (surveillance) but not understanding (poor security
performance)" is common problem in most security systems. Although in recent years
significant progress has been achieved in biometric identification technology, the
progress of single technologies does not dramatically improve the overall performance or
solve the system-level problems of social security. Currently in addition to improvements
of single technology, the following system-level technical bottlenecks must be solved to
improve the overall performance of social security: 1. Blind spots of perception exist in
surveillance area of due to the non-overlap of surveillance sensors, 2. The performance of
single identification technology decreases sharply in complex surveillance scenarios such
as poor lighting conditions or disguise, 3. Traditional warning technologies become
invalid due to the multi-stage non-stationary evolution feature of complex events. Three
challenges listed above closely relate to three scientific problems in the analysis
technology of big visual data on three levels: sensing data, identification technology and
pattern recognition. Our study aims at (1) exploiting the complete mapping mechanism
between physical space and multivariate sensing spaces to fill-in the blind spots of
sensing data, (2) exploring the correlative mechanism of multi-modal objects in
multivariate sensing spaces to improve the analytical performance from single
identification technology, (3) studying the spatial-temporal evolution mechanism in the
entire lifecycle of complex events to extend pattern recognition from local space and
time. The security space is the physical space with comprehensive protective ability,
which includes ubiquitous sensing, reliable identification, penetrating trend analysis and
approaching danger warning at any time, at any location, for any object and for any
behavior. Aiming to build the theory of security space, this study is divided into three
levels: information acquisition and perceptional computation, scene analysis and
evolution prediction, resource scheduling and system applications. Then it is further
divided into five tasks: (1) task 1: visual object recognition and big data based
identification, (2) task 2: situational awareness of groups and multi-scale revolution, (3)
task 3: semantic analysis of scenes and correlative computation in multivariate spaces, (4)
task 4: big scale visual retrieval and security risk analysis, (5) task 5: Warning system of
the security space and its applications. Task 1 corresponds to the first level because
sensing and identification of objects is the basis for analysis. Task 2 and 3 correspond to
the second level since group behavior, events, scene and their evolution are crucial for the
prediction and warning of security events. Task 4 and 5 correspond to the third level of
this study in order to develop the high performance computing platform, to conduct
system evaluation and application demonstration. Big visual data contains massive high-
dimensional sensing data, implying the complicated relationship among social objects. In
fact, in the world of data, the spatial-temporal relationship between the big data objects is
more essential than the causal relationship, and these private and implicit relationships
compose the core values of the big data social analysis. Only the analysis of individuals,
groups and scenes in big visual data are based on the core element of social security
analysis, that is" social structure and social activities", can it supports the strategic
transference of urban security system from investigation afterwards to warning in
advance. The overall purpose of this study is to build the big data analysis system in
security space, realizing the intelligent big visual data system which supports data
analysis of hundred billions of feature data, billions of image data, millions of visual
sensing terminals. The expected achievements on the warning and protection system for
large spaces can reach the international leading level. Based on the achievements above,
the project plans to develop 10 intelligent big data analysis products of 3 categories, and
the expected benefits of industrialization promotion can reach 100 billion, which
promotes the upstream and downstream industry to realize economic benefits 3 billion.
Also we strive to become the internationally leading industry in the field of big security
data analysis.
3. Yulong Meng ; Guisheng Yin ; Ke Geng, “Ontology Security Strategy of Security Data
Integrity”, IEEE, 2009. There exists a variety of heterogeneous data that need to be
integrated within a secure domain. How to integrate the security data from various data
source safely is a challenge for database researchers. In this paper, we introduce a new
model for security data integrity, Weight-value-Extended MLS (WEMLS), which is
development based on MLS. Compared with MLS, WEMLS can not only guarantee the
security and integrity of data accessing, but also provide a more flexible mechanism for
data accessing. Finally, WEMLS has been verified by proposing a security data
integration model based on ontology.
4. Lin Liu, “Security and Privacy Requirements Engineering Revisited in the Big Data Era”,
IEEE, 2016. The world is witnessing exploding interests in big data and its analytics, which
makes data a critical and relevant asset more than ever for today's organizations. However,
without a proper understanding to security and privacy requirements, data might hardly be used
to its full power. Data owners and users can be in risk in terms of information security and
privacy when sharing their data and running the analytics procedures without a proper security
infrastructure in place. In this talk, current security requirements engineering concepts,
approaches and techniques are discussed, research challenges are identified in this setting, the
security models and strategies for open data repositories are examined, and requirements
engineering techniques needed in this age of rapid and disruptive changes are speculated.
5. Seungmin Kang ; Bharadwaj Veeravalli ; Khin Mi Mi Aung, “A Security-Aware Data
Placement Mechanism for Big Data Cloud Storage Systems”, IEEE, 2016. Public clouds
have become an attractive candidate to meet the ever-growing storage demands.
However, storing data in public clouds increases data retrieval time and threat level for
data security. These challenges drive the need for intelligent methods that solve the data
placement problem to achieve high performance while satisfying the security
requirement. In this paper, we propose a novel approach for data placement in cloud
storage systems addressing the above challenges. With the security constraint, we first
formulate the data placement problem as a linear programming model that minimizes the
total retrieval time of a data, which is divided and distributed over storage nodes. We then
develop a heuristic algorithm namely Security-awarE Data placement mechanism for
cLOUd storage Systems (SEDuLOUS) to solve the problem. We demonstrate the
effectiveness of the proposed algorithm through comprehensive simulations. The
simulation results show that the proposed algorithm significantly reduces the retrieval
time by up to 20% for the random-network-topology systems and 19% for the Internet2-
topology system compared to baseline methods, which consider only the security
requirement.
6. Hui Zou, “Protection of Personal Information Security in the Age of Big Data”, IEEE,
2016. As a production factor, big data is being intensively integrated with the
development of various industries. The consequent security issues concerning personal
information are becoming increasingly severe. This article aims to analyze the major
issues which arise with big data, including the breach of personal information, the
potential security risks and the users' reduced personal information control rights. It is
followed by the analysis of their causes: users' weak awareness of information security,
the under-developed laws and regulations concerning the security of personal
information, the laggard technology for security maintenance of big data, and motivations
for profits. In order to solve these problems, firstly, the awareness of information security
and users' ability of security maintenance should be strengthened. In addition, the
security management system of big data should be improved with deepening reform of
security technology. Furthermore, legislation and regulation system should be reinforced.
Lastly, standardized codes of practice and self-discipline pact of the industry should be
reinforced to ensure the security of personal information in the age of big data.
7. M. Hsu, “Parallel computing with distributed shared data”, IEEE, 1989. Summary form
only given. The issue of ease of using shared data in a data-intensive parallel computing
environment is discussed. An approach is investigated for transparently supporting data
sharing in a loosely coupled parallel computing environment, where a moderate to a large
number of individual computing elements are connected via a high-bandwidth network
without necessarily physically sharing memory. A system called VOYAGER is discussed
which serves as the underlying system facility that supervises the distributed shared
virtual memory. VOYAGER allows shared-data parallel applications to take advantage of
parallel and distributed processing with relative ease. The application program merely
maps the shared data onto its virtual address space replicates itself on distributed
machines and spawns appropriate execution threads; the threads would automatically be
given coordinated access to the shared data distributed in the network. Multiple
computation threads migrate and populate the processors of a number of computing
elements, making use of the multiple processors to achieve a high degree of parallelism.
The low-level resource management chores are made available once and for all in the
underlying facility VOYAGER, usable by many different data-intensive applications
8. Chung-Ming Chen ; Soo-Young Lee, “Replication of uniformly accessed shared data for
large-scale data-parallel algorithms”, IEEE, 1995. In this paper we show how to minimize
data sharing overhead required in most parallel algorithms, especially in Large-Scale
Data-Parallel (LSDP) algorithms on a 2D mesh. Two specific issues are addressed in this
study. One is what the optimal group size is, i.e., how many PEs should share a copy of
shared data. The other is where the replicated data should be allocated
9. S.H. Wong ; J.P. Contillo ; S. Molina ; B.E. Mase, “Web-based data management and
sharing of bottlenose dolphin photographic identification information”, IEEE, 2002. The
photo identification of bottlenose dolphins is a non-invasive technique of tracking individual
animals. It can provide data on the social affiliations, habitat usage, population structure,
behavior, and birth/death rates of bottlenose dolphin in a study area. The Southeast Fisheries
Science Center (SEFSC) has been conducting a long-term photo-ID program in Biscayne Bay,
Florida. Previously, the data were managed in a desktop database, which was limited in
capability in terms of data backup, security (e.g., computer virus), access to multiple users within
the organization, and information sharing among peers with data from adjacent study areas.
Information sharing can enhance our knowledge on the migration, habitat association, and
population structure of the animals in a larger geographic area. To improve data management
and access of bottlenose dolphin photo-ID information in the SEFSC database, and to facilitate
efficient data sharing with other photo-ID research groups, we have developed an Oracle
database application to enable data entry, update, categorization, search, and download of
dolphin photo-ID information collected by SEFSC and our partners in south Florida. The data
include scanned digital photos and associated attributes. The system has multiple levels of
access rights: system administrator, partners, and the general public. These users access the
system through Web browsers. Routine tasks such as data backup and virus protection are
managed by the Oracle database administrator, thus allowing the fisheries researchers to focus
on data quality control and analysis. The system also minimizes the effort experienced by
fisheries researchers to maintain the Web site because information (images and attributes)
submitted to the system can be shared instantly by the designated user community.
10. Muhammad Baqer Mollah and Md. Abul Kalam Azad,, “Secure Data Sharing and
Searching at the Edge of Cloud-Assisted Internet of Things”, IEEE, 2017. This article
proposes an efficient data-sharing scheme that allows smart devices to share securely data with
others at the edge of cloud-assisted Internet of Things (IoT). We also propose a secure searching
scheme to search desired data within own/shared data on storage.
11. Qinlong Huang, Licheng Wang,Yixian Yang, “Secure and Privacy-Preserving Data
Sharing and Collaboration in Mobile Healthcare Social Networks of Smart Cities”,
Hindawi, 2017. Mobile healthcare social networks (MHSN) integrated with connected
medical sensors and cloud-based health data storage provide preventive and curative
health services in smart cities.Thefusion of social data together with real-time health data
facilitates a novel paradigm of healthcare big data analysis. However, the collaboration of
healthcare and social network service providers may pose a series of security and privacy
issues. In this paper, we propose a secure health and social data sharing and collaboration
scheme inMHSN. To preserve the data privacy, we realize secure and fine-grained health
data and social data sharing with attribute-based encryption and identity-based broadcast
encryption techniques, respectively, which allows patients to share their private personal
data securely. In order to achieve enhanced data collaboration, we allow the healthcare
analyzers to access both the reencrypted health data and the social data with authorization
from the data owner based on proxy reencryption. Specifically, most of the health data
encryption and decryption computations are outsourced from resource-constrained
mobile devices to a health cloud, and the decryption of the healthcare analyzer incurs a
low cost. The security and performance analysis results show the security and efficiency
of our scheme.
12. Elisa Bertino, “Data Security – Challenges and Research Opportunities”, Springer, pp 9-
13, 2014 The proliferation of web-based applications and information systems, and recent
trends such as cloud computing and outsourced data management, have increased the
exposure of data and made security more difficult. In this paper we briefly discuss open
issues, such as data protection from insider threat and how to reconcile security and
privacy, and outline research directions.
13. Mazhar Ali, Student Member, Revathi Dhamotharan, Eraj Khan, Samee U. Khan,
Athanasios V. Vasilakos, Keqin Li, Albert Y. Zomaya, “SeDaSC: Secure Data Sharing in
Clouds”, IEEE, pp 1-11, 2015. Cloud storage is an application of clouds that liberates
organizations from establishing in-house data storage systems. However, cloud storage
gives rise to security concerns. In case of group-shared data, the data face both cloud-
specific and conventional insider threats. Secure data sharing among a group that
counters insider threats of legitimate yet malicious users is an important research issue. In
this paper, we propose the Secure Data Sharing in Clouds (SeDaSC) methodology that
provides: 1) data confidentiality and integrity; 2) access control; 3) data sharing
(forwarding) without using compute-intensive reencryption; 4) insider threat security;
and 5) forward and backward access control. The SeDaSC methodology encrypts a file
with a single encryption key. Two different key shares for each of the users are generated,
with the user only getting one share. The possession of a single share of a key allows the
SeDaSC methodology to counter the insider threats. The other key share is stored by a
trusted third party, which is called the cryptographic server. The SeDaSC methodology is
applicable to conventional and mobile cloud computing environments. We implement a
working prototype of the SeDaSC methodology and evaluate its performance based on
the time consumed during various operations. We formally verify the working of SeDaSC
by using high-level Petri nets, the Satisfiability Modulo Theories Library, and a Z3
solver. The results proved to be encouraging and show that SeDaSC has the potential to
be effectively used for secure data sharing in the cloud.
14. Yunchuan Sun, Junsheng Zhang, Yongping Xiong, Guangyu Zhu, “Data Security and
Privacy in Cloud Computing”, Hindawi, pp 1-10, 2014. Data security has consistently
been a major issue in information technology. In the cloud computing environment, it
becomes particularly serious because the data is located in different places even in all the
globe. Data security and privacy protection are the two main factors of user’s concerns
about the cloud technology. Though many techniques on the topics in cloud computing
have been investigated in both academics and industries, data security and privacy
protection are becomingmore important for the future development of cloud computing
technology in government, industry, and business. Data security and privacy protection
issues are relevant to both hardware and software in the cloud architecture. This study is
to review different security techniques and challenges from both software and hardware
aspects for protecting data in the cloud and aims at enhancing the data security and
privacy protection for the trustworthy cloud environment. In this paper, we make a
comparative research analysis of the existing research work regarding the data security
and privacy protection techniques used in the cloud computing.
15. Wesley Lourenco Barbosa, Antonio ManoelBatista da Silva, “Analysis of Studies on
Applications and Challenges in Implementation of Big Data in the Public
Administration”, International Journal on Recent and Innovation Trends in Computing
and Communication, vol 5, Issue 5, pp 751-759, 2017. The big data – huge amount of
data – era has begun and is redefining how organizations deal with information. While
the business sector has been using and developing big data applications for nearly a
decade, only recently the public sector has begun to adopt this technology to gather
information and use it as a decision support tool. Few organizations have so many
advantages to harness the potential of the big data as the public service agencies, because
of the large amount of data they have access to. However, due to the current theme, there
is still a long way to go. Some papers have presented ways in which governments are
using big data to better serve their citizens. Nevertheless, there is still much uncertainty
about the real possibility of improving government operations through this technology.
By analyzing the literature related to the topic, this paper aims to present the areas of
public administration that can take advantage of data analysis. In addition, raising the
challenges and resilience faced for the insertion of the big data in the public sector is also
important. In this way, we seek to understand how public organizations can take
advantage of the data that they have, to manage and improve the efficiency and offer of
public services to society. The results showed that several organizations can modernize
and improve their operations with data analysis, but challenges need to be overcome.
Therefore, the big data presents itself as an important tool for the modernization of public
administration.
16. Varun Singh, Ishan Srivastava, Vishal Johri, “Big Data and the Opportunities and
Challenges for Government Agencies”, (IJCSIT) International Journal of Computer
Science and Information Technologies, Vol. 5, Issue 4, pp 5821-5824, 2014 The data
stored with Government agencies is an asset both for the nation and the government. This
data which is a potential source of opportunity brings with itself many challenges and the
government agencies like many other corporations should be able to seize the opportunity
that this big data presents and utilize it to develop policies and deliver services to citizens.
In this paper the authors have tried to highlight the opportunities presented to government
bodies in relation to the use of big data and some other emerging tools and technologies
which facilitate better appreciation of what this powerful yet unutilized information can
tell us and also the potential threats it might pose
17. Thi Mai Le 1 and Shu-Yi Liaw, “Effects of Pros and Cons of Applying Big Data
Analytics to Consumers’ Responses in an E-Commerce Context”, MDPI, 2017. The era
of Big Data analytics has begun in most industries within developed countries. This new
analytics tool has raised motivation for experts and researchers to study its impacts to
business values and challenges. However, studies which help to understand customers’
views and their behavior towards the applications of Big Data analytics are lacking. This
research aims to explore and determine the pros and cons of applying Big Data analytics
that affects customers’ responses in an e-commerce environment. Data analyses were
conducted in a sample of 273 respondents from Vietnam. The findings found that
information search, recommendation system, dynamic pricing, and customer services had
significant positive effects on customers’ responses. Privacy and security, shopping
addiction, and group influences were found to have significant negative effects on
customers’ responses. Customers’ responses were measured at intention and behavior
stages. Moreover, positive and negative effects simultaneously presented significant
effect on customers’ responses. Each dimension of positive and negative factors had
different significant impacts on customers’ intention and behavior. Specifically,
information search had a significant influence on customers’ intention and improved
customers’ behavior. Shopping addiction had a drastic change from intention to behavior
compared to group influences and privacy and security. This study contributes to improve
understanding of customers’ responses under big data era. This could play an important
role to develop sustainable consumers market. E-vendors can rely on Big Data analytics
but over usage may have some negative applications.
18. Lenka Venkata Satyanarayana, “A Survey on Challenges and Advantages in Big Data”,
IJCST Vol. 6, Issue 2, 2015. Big data, which refers to the data sets that are too big to be
handled using the existing database management tools, are emerging in many important
applications, such as Internet search, business informatics, social networks, social media,
genomics, and meteorology. Big data presents a grand challenge for database and data
analytics research. The exciting activities addressing the big data challenge. The central
theme is to connect big data with people in various ways. Particularly, This paper will
showcase our recent progress in user preference understanding, context-aware, on-
demand data mining using crowd intelligence, summarization and explorative analysis of
large data sets, and privacy preserving data sharing and analysis.The primary purpose of
this paper is to provide an in-depth analysis of different platforms available for
performing big data analytics. This paper surveys different hardware platforms available
for big data analytics and assesses the advantages and drawbacks of Big Data.
19. Abawajy.J.et.al. 2014 [20] studied about the Large Iterative multitier Ensemble
Classifiers for Security of Big Data. They used Large Iterative Multitier Esemble
(LIME).they evaluate the performance of LIME classifiers for the detection of malware
using big data. They examined that Random Forest outperformed other base classifier for
the malware data set, and Decorate improved its outcomes better than other ensemble
meta classifiers did.
20. Shen.Y.et.al. [19] studied about the scalable multi criteria clustering for big data security
intelligence applications. They studied about attack attribution and situational
understandings are considered critical aspects to effectively deal with emerging,
increasingly sophisticated internet attacks. The security data mining process typically
involves a considerable amount of features interacting in a non oblivious way, which
makes it inherently complex. To deal with this challenge, they introduce MR-TRIAGE, a
set of distributed algorithm built on Map Reduce that can perform scalable multicriteria
data clustering on large security data sets and identify complex relationships hidden in
massive datasets. This approach effectively clusters any type of security events (e.g spam
emails, spear-phishing attacks, etc) that are sharing at least some commonalities among a
number of predefined features.
21. Saravanan.S.et.al. 2015 [44] studied about the information security in big data using
encryption and decryption. They studied that a dynamic secret-based encryption scheme
can be designed to secure the data that is been stored in the database to reduce its
complexity, the transmission sequence is proposed to update dynamic encryption key, so
they used the SH3 encryption and decryption algorithm. This will provide more security
to the data. To implement the big data security they used the student database
management system for the implement of encryption and decryption technique. The user
will have permission for adding, deleting, updating and searching of data. The data is
encrypted and then be stored in the database and in search operation the data is decrypted
and then searched in the database. While adding new data is provided as a normal data
byte by the user and the data is encrypted and then stored in the database. If the user
retrieve the data from the database the data is been decrypted and then shown to the end
user. The user is also provided option to encrypt and decrypt the data.
22. ZHAO.C.et.al.[18] studied about secure group communication over the big data, they
purpose efficient group key transfer protocol without an online key generation centre,
which is based on a 3-lsss. Its need not update existing two party secrets when adding or
deleting users. In the key transfer phase, our protocol only needs one inner product over
finite field for each member. The security of protocol is based on Diffie-Hellman secret
sharing scheme and linear secret sharing scheme. Our proposed protocol is suitable for
many groups- oriented applications over big data.
23. Sharif.A.et.al. [16] studied about current Security threats and prevention Measures
relating to cloud services, Hadoop, concurrent processing and big data. They examined
that cloud services are widely used across the globe to store and analyze Big Data. These
days it seems the news is full of stories about breaches to these services, resulting in the
exposure of huge amounts of private data. they studied the current security threats to
Cloud Services, Big data and Hadoop. They analyze a newly proposed Big Data security
system based on the EnCore system which uses sticky policies and the existing security
architectures of Verizon and Twilio, presenting the preventive measures taken by these
firms to minimize security concerns.
24. Nadar.S.et.al. [23] studied about the security in big data using (SSL) Secure Socket Layer
and (TLS) transport Layer Security. Yang.B.et.al.2016 [10] they studied about family
based meta-model framework to solve big data problems, especially in the field of
network traffic data analysis. The key contribution of the proposed framework are: 1) per
row disposability computing philosophy that enables in memory computing of big data
matrix possible: 2) per-row updatable matrix opens the door for exact computing of the
learning model, rather than estimates and sampling: 3) The abstraction of the data into
meta-model matrix enables privacy reserved data sharing and rapid model selection. The
experiment will then be expanded to distributed environment where MapReduce and
Hadoop will be invested to comprehensively understand the value of the Meta model
framework.
25. Xiao.C.et.al. [9] studied about the security of multimedia big data in multimedia sensing.
Firstly, by analysing the resources constraint in multimedia sensing system, it is found
that the problem of resources constraints will be aggravated. Then optimization model for
data encryption under resources constraints is proposed. Secondly, a general-purpose
lightweight speed adjustable video encryption scheme is proposed, which can reduce the
computation 12 overload on weak nodes and achieve a balance between performance and
security. Thirdly, a series of selective encryption control models are proposed, in which
the improved model is built based on SAFE encryption scheme. The proposed models
can optimize the resources distribution in multi-stream at system level. Finally,
experimental analyses show that the performances of the presented schemes are
effectively enough to support real-time application.
26. Kang.S.et.al. [6] [15] studied about the storing data in public clouds increases data
retrieval time and threat level for data security. They addressed the data placement
problem in cloud storage systems considering the security constraint. They formulated
the as a linear programming model that minimize the total retrieval time of a data, which
is divisible into0 multiple chunks with arbitrary size. They developed an efficient
heuristic algorithm namely security aware data placement mechanism for cloud storage
systems that uses the T-colorings approach to solve the security issue. The results
demonstrate effectiveness of the proposed algorithm by reducing the retrieval time by up
to 20% for random topologies and 19% for the internet2 topology. The simulation result
also show that while achieving the best performance and guaranteeing the data security.
27. Gai.k.et.al. [8] [12] [13] studied about the Security Aware Efficient Mass Distributed
Storage Approach for cloud systems in big data. They studied about cloud data security
and privacy information protection. They mainly focus on distributed storage security.

3. MOTIVATION TO THE WORK

Because unstructured big data has to be captured from multiple heterogeneous sources, So
various organizations as well as government agencies still struggling to handle varieties of data.
Big data veracity ensures that the data used are trusted authentic and protected from unauthorized
access and modification. Data that is used for data analytics must be secured from its collection
processing and storing. It’s not possible to store the huge volume of big data at centralized
storage. So the distributed storage is used to store unstructured big data. By storing unstructured
big data in distributed environment, parallel processing can be applied which is essential for
most of the 13 organization and government agencies. The instructed data on social sites are
changing continuously so the security issues arise on continuously changing data and to capture
it for processing and for analytics. Many times it’s better to move code instead of data at various
nodes so the security issue generated [1]-[20].

STORAGE AND PROCESSING ISSUES:-

When huge amount of data used for data analytics it’s necessary to identify. Who has the right to
access the data and at what time data from which location. The personnel information of a person
after processing with external data sets generates new facts about the person. The information
need to protect so it adds value to the business of the organization and government agencies [1]-
[3].

4. RESEARCH OBJECTIVE

The assumption made in the unstructured big data security for data analytics are as follows i) The
unstructured big data that is used in analytics is circular in nature. ii) An organization and
Government agencies is located at the centre and there are many circular areas which are
interlinked. iii) Big data are placed randomly following uniform distribution. iv) Each
organization and Government agencies has capturing capability to store data used for data
analytics. v) Each organization and Government agencies transmits or publishes the data to
various organization or sites. vii) Each organization and Government agencies work on useful
data from huge collection of infrastructure data that is collected from various ways. vii) Data
used by organization and government agencies for parallel processing. We shall work with
encrypted Big Data (structured or unstructured) stored at different location as it will be more
secure than direct storage and direct sharing and also we would study up to what extent it resist
the various attacks. The present study is being done to fulfil the following objectives: 14

1. To Analyse the Protection measures of data accessing by others during the Exchange of
valuable information and confidential data from an organization and government agencies
2. To Analyse the management of the collection of data and extracting of relevant data to
organization used for processing.

3. To investigate the Proper security measures to be put in place before the owner of big data
delivers the data sets for processing and storage by a third party.

4. To provide suggestion which are valuable regarding informational security culture and top
management support to an organization context with big data.

5. To investigate data analytics use cases to upgrade the system design manufacturing model by
design decomposition with Big Data.

6. To provide valuable suggestion to improve the data transfer rate with Security and Memory
Management for avoiding the direct access of cloud user to the private data.

Now, the detail description of some parameters and scope is as follows:

Security in big data is a great challenge. The research efforts are required to be focused in this
direction. One of the reasons is that today’s Big Data is used in various fields, so the protection
of that data is a measure concept. Moreover, the volume, variety, velocity are measure concern in
Big Data security are also prevailing while handling the privacy of secret information during
transmission. Encrypted data distribution storage can be beneficial to optimize the Security and
privacy of confidential information.

REFERENCES

1. Diao Zhe ; Wang Qinghong ; Su Naizheng ; Zhang Yuhan, “”Study on Data Security
Policy Based on Cloud Storage”, IEEE, 2017.
2. Ruimin Hu, “Key Technology for Big Visual Data Analysis in Security Space and Its
Applications”, IEEE, 2016.
3. Yulong Meng ; Guisheng Yin ; Ke Geng, “Ontology Security Strategy of Security Data
Integrity”, IEEE, 2009.
4. Lin Liu, “Security and Privacy Requirements Engineering Revisited in the Big Data Era”,
IEEE, 2016.
5. Seungmin Kang ; Bharadwaj Veeravalli ; Khin Mi Mi Aung, “A Security-Aware Data
Placement Mechanism for Big Data Cloud Storage Systems”, IEEE, 2016.
6. Hui Zou, “Protection of Personal Information Security in the Age of Big Data”, IEEE,
2016.
7. M. Hsu, “Parallel computing with distributed shared data”, IEEE, 1989.
8. Chung-Ming Chen ; Soo-Young Lee, “Replication of uniformly accessed shared data for
large-scale data-parallel algorithms”, IEEE, 1995.
9. S.H. Wong ; J.P. Contillo ; S. Molina ; B.E. Mase, “Web-based data management and
sharing of bottlenose dolphin photographic identification information”, IEEE, 2002.
10. Muhammad Baqer Mollah and Md. Abul Kalam Azad,, “Secure Data Sharing and
Searching at the Edge of Cloud-Assisted Internet of Things”,IEEE, 2017.
11. Qinlong Huang, Licheng Wang,Yixian Yang, “Secure and Privacy-Preserving Data
Sharing and Collaboration in Mobile Healthcare Social Networks of Smart Cities”,
Hindawi, 2017.
12. Elisa Bertino, “Data Security – Challenges and Research Opportunities”, Springer, pp 9-
13, 2014
13. Mazhar Ali, Student Member, Revathi Dhamotharan, Eraj Khan, Samee U. Khan,
Athanasios V. Vasilakos, Keqin Li, Albert Y. Zomaya, “SeDaSC: Secure Data Sharing in
Clouds”, IEEE, pp 1-11, 2015.
14. Yunchuan Sun, Junsheng Zhang, Yongping Xiong, Guangyu Zhu, “Data Security and
Privacy in Cloud Computing”, Hindawi, pp 1-10, 2014.
15. Wesley Lourenco Barbosa, Antonio ManoelBatista da Silva, “Analysis of Studies on
Applications and Challenges in Implementation of Big Data in the Public
Administration”, International Journal on Recent and Innovation Trends in Computing
and Communication, vol 5, Issue 5, pp 751-759, 2017
16. Wesley Lourenco Barbosa, Antonio ManoelBatista da Silva, “Analysis of Studies on
Applications and Challenges in Implementation of Big Data in the Public
Administration”, International Journal on Recent and Innovation Trends in Computing
and Communication, vol 5, Issue 5, pp 751-759, 2017
17. Varun Singh, Ishan Srivastava, Vishal Johri, “Big Data and the Opportunities and
Challenges for Government Agencies”, (IJCSIT) International Journal of Computer
Science and Information Technologies, Vol. 5, Issue 4, pp 5821-5824, 2014
18. Thi Mai Le 1 and Shu-Yi Liaw, “Effects of Pros and Cons of Applying Big Data
Analytics to Consumers’ Responses in an E-Commerce Context”, MDPI, 2017
19. Lenka Venkata Satyanarayana, “A Survey on Challenges and Advantages in Big Data”,
IJCST Vol. 6, Issue 2, 2015
20. Jemal H. Abawajy, Andrei Kelarev and Morshed Chowdhury, “Large Iterative Multitier
Ensemble Classifiers for Security of Big Data”, IEEE, pp.352-363, 2014.
21. Yun Shen, Olivier Thonnard, “ MR-TRIAGE:Scalable Multi-Criteria Clustering for Big
Data Security Intelligence Applications”, IEEE International Conference on Big Data,
pp.627-635, 2014.
22. S.K Sarvanan, G.Rekha, “Information security in big data using encryption and
decryption”, International Research Journal of Computer Science, vol 2, pp 65-70, 2015.
23. Changxiao ZHAO, Jianhua LIU, “ Novel Group Key Transfer Protocol for Big Data
Security”, IEEE, pp.161-165, 2015.
24. Ather Sharif, Sarah Cooney, Shengqi Gong, Drew Vitek, “Current Security Threats and
Prevention Measures Relating to Cloud Services, Hadoop concurrent Processing, and Big
Data”, IEEE International Conference on Big Data(Big Data), pp.1865-1870, 2015
25. Shivasakthi Nadar, Narendra Gawai, “Unstructured Big Data Processing: Security Issues
and Countermeasures,” International Journal of Science & Engineering Research, ISSN
2229-5518, Vol.6, Issue 3, March-2015.
26. Chen Xiao, Lifeng Wang, Zhu Jie, Tiemeng Chen,” A Multilevel Intelligent Selective
Encryption Control Model for Multimedia Big Data Security in Sensing System with
Resouce Constraints”, IEEE 3rd International Conference on Cyber Security and Cloud
Computing, pp.148-153,2016.
27. Seungmin Kang. Bharadwaj veeravalli, Khin Mi Mi Aung, “A Security-Aware Data
placement Mechanism for Big Data Cloud Storage Systems”, IEEE 2nd International
Conference on Big Data on Cloud, IEEE International Conference on High Performance
and smart Computing, IEEE International Conference on Intelligent Data and Security,
pp.327- 332,2016.
28. Pedro H.B. Las-Casas, Vinicius Santos Dias, Wangner Meira jr. nad Dorgival Guedes, “A
Big Data architecture for security data and its application to Phishing characterization”,
IEEE 2nd International Conference on Big Data Security on Cloud, IEEE International
Conference on High Performance and Smart Computing, IEEE International Conference
on intelligent Data and security, pp.36-41, 2016.
29. Keke Gai, Meikang Qiu, Hui Zhao ”Security-Aware Efficient Mass Distributed storage
Approach for Cloud Systems in Big Data”, IEEE 2nd International Conference on big
data security on cloud, IEEE International Conference on High Performance and Smart
Computing, IEEE International Conference on intelligent data and security, pp.140-145,
2016.
30. keke Gai, meikang Qiu, Sam Adam Elnagdy, “Security-Aware Information Classification
Using supervised learning for cloud-Based Cyber risk management in Financial Big
Data”. IEEE 2nd International Conference on Big Data Security on Cloud, IEEE
International Conference on High Performance and Smart Computing, IEEE International
Conference on Intelligent Data and Security, pp.197-202, new York, USA, 2016.
31. M. Qiu, K. Gai, B. Thuraisingham, L. Tao, and H. Zhao,Proactive user-centric secure data
scheme using attribute-based semantic access control for mobile cloud in financial
industry,” future generation computer systems”, pp.1 2016.

You might also like