Big Data Unit I
Big Data Unit I
Syllabus
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to Big Data
platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big Data, Big Data
technology components, Big Data importance and applications, Big Data features – security, compliance,
auditing and protection, Big Data privacy and ethics, Big Data Analytics, Challenges of conventional
systems, intelligent data analysis, nature of data, analytic processes and tools, analysis vs reporting,
modern data analytic tools.
Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing
application software. Data with many fields (rows) offer greater statistical power, while data with higher
complexity (more attributes or columns) may lead to a higher false discovery rate. Big data analysis
challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization,
querying, updating, information privacy, and data source. Big data was originally associated with three
key concepts(drivers of big data): volume, variety, and velocity. The analysis of big data presents
challenges in sampling, and thus previously allowing for only observations and sampling. Thus a fourth
concept, veracity, refers to the quality or insightfulness of the data. Without sufficient investment in
expertise for big data veracity, then the volume and variety of data can produce costs and risks that
exceed an organization's capacity to create and capture value from big data.
Current usage of the term big data tends to refer to the use of predictive analytics, user behavior analytics,
or certain other advanced data analytics methods that extract value from big data, and seldom to a
particular size of data set. "There is little doubt that the quantities of data now available are indeed large,
but that's not the most relevant characteristic of this new data ecosystem." Analysis of data sets can find
new correlations to "spot business trends, prevent diseases, combat crime and so on". Scientists, business
executives, medical practitioners, advertising and governments alike regularly meet difficulties with large
data-sets in areas including Internet searches, fintech, healthcare analytics, geographic information
systems, urban informatics, and business informatics. Scientists encounter limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics simulations, biology, and
environmental research.
The size and number of available data sets have grown rapidly as data is collected by devices such as
mobile devices, cheap and numerous information-sensing Internet of things devices, aerial (remote
sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and
wireless sensor networks.The world's technological per-capita capacity to store information has roughly
doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×260 bytes) of data are
generated. Based on an IDC report prediction, the global data volume was predicted to grow
exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there
will be 163 zettabytes of data. According to IDC, global spending on big data and business analytics
(BDA) solutions is estimated to reach $215.7 billion in 2021.While Statista report, the global big data
market is forecasted to grow to $103 billion by 2027.In 2011 McKinsey & Company reported, if US
healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could
create more than $300 billion in value every year. In the developed economies of Europe, government
administrators could save more than €100 billion ($149 billion) in operational efficiency improvements
alone by using big data. And users of services enabled by personal-location data could capture $600
billion in consumer surplus. One question for large enterprises is determining who should own big-data
initiatives that affect the entire organization.
Relational database management systems and desktop statistical software packages used to visualize data
often have difficulty processing and analyzing big data. The processing and analysis of big data may
require "massively parallel software running on tens, hundreds, or even thousands of servers".What
qualifies as "big data" varies depending on the capabilities of those analyzing it and their tools.
Furthermore, expanding capabilities make big data a moving target. "For some organizations, facing
hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options.
For others, it may take tens or hundreds of terabytes before data size becomes a significant
consideration."
In its true essence, Big Data is not something that is completely new or only of the last two decades. Over
the course of centuries, people have been trying to use data analysis and analytics techniques to support
their decision-making process. The ancient Egyptians around 300 BC already tried to capture all existing
‘data’ in the library of Alexandria. Moreover, the Roman Empire used to carefully analyze statistics of
their military to determine the optimal distribution for their armies.
However, in the last two decades, the volume and speed with which data is generated has changed –
beyond measures of human comprehension. The total amount of data in the world was 4.4 zettabytes in
2013. That is set to rise steeply to 44 zettabytes by 2020. To put that in perspective, 44 zettabytes is
equivalent to 44 trillion gigabytes. Even with the most advanced technologies today, it is impossible to
analyze all this data. The need to process these increasingly larger (and unstructured) data sets is how
traditional data analysis transformed into ‘Big Data’ in the last decade.
To illustrate this development over time, the evolution of Big Data can roughly be sub-divided into three
main phases. Each phase has its own characteristics and capabilities. In order to understand the context of
Big Data today, it is important to understand how each phase contributed to the contemporary meaning of
Big Data.
Database management and data warehousing are considered the core components of Big Data Phase 1. It
provides the foundation of modern data analysis as we know it today, using well-known techniques such
as database queries, online analytical processing and standard reporting tools.
From a data analysis, data analytics, and Big Data point of view, HTTP-based web traffic introduced a
massive increase in semi-structured and unstructured data. Besides the standard structured data types,
organizations now needed to find new approaches and storage solutions to deal with these new data types
in order to analyze them effectively. The arrival and growth of social media data greatly aggravated the
need for tools, technologies and analytics techniques that were able to extract meaningful information out
of this unstructured data.
Big Data phase 3.0
Although web-based unstructured content is still the main focus for many organizations in data analysis,
data analytics, and big data, the current possibilities to retrieve valuable information are emerging out of
mobile devices.
Mobile devices not only give the possibility to analyze behavioral data (such as clicks and search
queries), but also give the possibility to store and analyze location-based data (GPS-data). With the
advancement of these mobile devices, it is possible to track movement, analyze physical behavior and
even health-related data (number of steps you take per day). This data provides a whole new range of
opportunities, from transportation, to city design and health care.
Simultaneously, the rise of sensor-based internet-enabled devices is increasing the data generation like
never before. Famously coined as the ‘Internet of Things’ (IoT), millions of TVs, thermostats, wearables
and even refrigerators are now generating zettabytes of data every day. And the race to extract meaningful
and valuable information out of these new data sources has only just begun.
Data sources: All big data solutions start with one or more data sources. Examples include:
Batch processing: Because the data sets are so large, often a big data solution must process
data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for
analysis. Usually these jobs involve reading source files, processing them, and writing the output to
new files.
Real-time message ingestion: If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This might be a
simple data store, where incoming messages are dropped into a folder for processing. However,
many solutions need a message ingestion store to act as a buffer for messages, and to support scale-
out processing, reliable delivery, and other message queuing semantics.
Stream processing: After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is
then written to an output sink.
Analytical data store: Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools.
Analysis and reporting: The goal of most big data solutions is to provide insights into the
data through analysis and reporting. To empower users to analyze the data, the architecture may
include a data modeling layer, such as a multidimensional OLAP cube or tabular data model
Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results straight to a report or
dashboard. To automate these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
Granular auditing is a must in Big Data security, particularly after an attack on your system.
Organizations create a cohesive audit view following any attack, and be sure to provide a full audit trail
while ensuring there's easy access to that data in order to cut down incident response time.
Audit information integrity and confidentiality are also essential. Audit information should be stored
separately and protected with granular user access controls and regular monitoring. Make sure to keep
your Big Data and audit data separate, and enable all required logging when you're setting up auditing (in
order to collect and process the most detailed information possible)..
Compliance is always a headache for enterprises, and even more so when you're dealing with a constant
deluge of data. It's best to tackle it head-on with real-time analytics and security at every level of the
stack. Organizations apply Big Data analytics by using tools such as Kerberos(Kerberos is a computer
network security protocol that authenticates service requests between two or more trusted hosts across an
untrusted network, like the internet. It uses secret-key cryptography and a trusted third party for
authenticating client-server applications and verifying users' identities.), secure shell (SSH also known as
Secure Shell or Secure Socket Shell, is a network protocol that gives users, particularly system
administrators, a secure way to access a computer over an unsecured network.), and internet protocol
security (IPsec) to get a handle on real-time data.
Once you're doing that, you can mine logging events, deploy front-end security systems such as routers
and application-level firewalls, and begin implementing security controls throughout the stack at the
cloud, cluster, and application levels.
Big Data privacy and ethics
Big data ethics also known as simply data ethics refers to systemizing, defending, and recommending
concepts of right and wrong conduct in relation to data, in particular personal data
Data ethics is concerned with the following principles
1. Ownership - Individuals own their own data.
2. Transaction transparency - If an individual’s personal data is used, they should have
transparent access to the algorithm design used to generate aggregate data sets
3. Consent - If an individual or legal entity would like to use personal data, one needs
informed and explicitly expressed consent of what personal data moves to whom, when, and for
what purpose from the owner of the data.
4. Privacy - If data transactions occur all reasonable effort needs to be made to preserve
privacy.
5. Currency - Individuals should be aware of financial transactions resulting from the use of
their personal data and the scale of these transactions.
6. Openness - Aggregate data sets should be freely available