big data unit 1
big data unit 1
BIG DATA
UNIT - 1 NOTES
BIG DATA
Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
Big data is anything beyond the human & technical infrastructure needed to support storage,
processing and analysis.
Variety: Data can be structured data, semi-structured data and unstructured data. Data stored in a
database is an example of structured data.HTML data, XML data, email data,
CSV files are examples of semi-structured data. Powerpoint presentation, images, videos,
researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being created in real- time. We
have moved from simple desktop applications like payroll applications to real- time processing
applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary Big data is
high-volume, high-velocity and/or high variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight and decision making.
Data generates information and from information we can draw valuable insight. As depicted in
Figure, digital data can be broadly classified into structured, semi-structured, and unstructured
data.
1. Unstructured data: This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program. About 80% data of an organization is in
this format; for example, memos, chat rooms, PowerPoint presentations, images, videos, letters.
research, white papers, body of an email, etc.
their objects. About 10% data of an organization is in this format. Data stored in databases is an
example of structured data.
The "Internet of Things" and its widely ultra-connected nature are leading to a burgeoning rise in
big data. There is no dearth of data for today's enterprise. On the contrary, they are mired in data
and quite deep at that. That brings us to the following questions:
Data is widely available. What is scarce is the ability to draw valuable insight. Some examples of
Big Data:
• There are some examples of Big Data Analytics in different areas such as retail, IT
infrastructure, and social media.
• Retail: As mentioned earlier, Big Data presents many opportunities to improve sales and
marketing analytics.
• An example of this is the U.S. retailer Target. After analyzing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main
life-event situations.
• Marriage, when people tend to buy many new products
• Divorce, when people buy new products and change their spending habits
• Pregnancy, when people have many new things to buy and have an urgency to buy them. The
analysis target to manage its inventory, knowing that there would be demand for specific
products and it would likely vary by month over the coming nine- to ten-month cycles
• IT infrastructure: MapReduce paradigm is an ideal technical framework for many Big Data
projects, which rely on large data sets with unconventional data structures.
• One of the main benefits of Hadoop is that it employs a distributed file system, meaning it can
use a distributed cluster of servers and commodity hardware to process large amounts of data.
Some of the most common examples of Hadoop implementations are in the social media space,
where Hadoop can manage transactions, give textual updates, and develop social graphs among
millions of users.
Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and its
ecosystem of tools to manage this high volume.
CHARACTERISTICS OF DATA
1. Composition: The composition of data deals with the structure of data, that is, the sources of
data, the granularity, the types, and the nature of data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one use this data as
is for analysis?" or "Does it require cleansing for further enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been generated?" "Why was this
data generated?" How sensitive is this data?"
"What are the events associated with this data?" and so on. Small data (data as it existed prior to
the big data revolution) is about certainty. It is about known datasources; it is about no major
changes to the composition or context of data.
Most often we have answers to queries like why this data was generated, where and when it was
generated, exactly how we would like to use it, what questions will this data be able to answer,
and so on. Big data is about complexity. Complexity in terms of multiple and unknown datasets,
in terms of exploding volume, in terms of speed at which the data is being generated and the
speed at which it needs to be processed and in terms of the variety of data (internal or external,
behavioral or social) that is being generated.
1970s and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in the 1980s and 1990s. The era was of data intensive applications.
The World Wide Web (WWW) and the Internet of Things (IOT) have led to an onslaught of
structured, unstructured, and multimedia data. Refer Table
Data volume: Data today is growing at an exponential rate. This high tide of data will continue
to rise continuously. The key questions are – “will all this data be useful for analysis?”, “Do we
work with all this data or subset of it?”, “How will we separate the knowledge from the noise?”
etc.
Storage: Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further
complicates the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-term
decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage, search,
analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how big the data set should
be for it to be considered big data. Data visualization(computer graphics) is becoming popular as
a separate discipline. There are very few data visualization experts.
The more data we have for analysis, the greater will be the analytical accuracy and the greater
would be the confidence in our decisions based on these analytical findings. The analytical
accuracy will lead to a greater positive impact in terms of enhancing operational efficiencies,
reducing cost and time, and originating new products, new services, and optimizing existing
services.
The data from these sources may differ in format. Operational or transactional or day-to-day
business data is gathered from Enterprise Resource Planning (ERP) systems, Customer
Relationship Management (CRM), Legacy systems, and several third-party applications. The
data from these sources may differ in format.
This data is then integrated, cleaned up, transformed, and standardized through the process of
Extraction, Transformation, and Loading (ETL).
The transformed data is then loaded into the enterprise data warehouse (available at the
enterprise level) or data marts (available at the business unit/ functional unit or business process
level).
Business intelligence and analytics tools are then used to enable decision making from the use of
ad-hoc queries, SQL, enterprise dashboards, data mining, Online Analytical Processing etc.
Refer Figure
Following are the differences that one encounters dealing with traditional Bl and big data.
In a traditional BI environment, all the enterprise's data is housed in a central server whereas in a
big data environment data resides in a distributed file system. The distributed file system scales
by scaling in(decrease) or out(increase) horizontally as compared to a typical database server that
scales vertically.
In traditional BI, data is generally analyzed in an offline mode whereas in big data, it is analyzed
in both real-time streaming as well as in offline mode.
Traditional Bl is about structured data and it is here that data is taken to processing functions
(move data to code) whereas big data is about variety: Structured, semi- structured, and
unstructured data and here the processing functions are taken to the data (move code to data).
Among the larger concepts of rage in technology, big data technologies are widely associated
with many other technologies such as deep learning, machine learning, artificial intelligence
(AI), and Internet of Things (IoT) that are massively augmented. In combination with these
technologies, big data technologies are focused on analyzing and handling large amounts of
real-time data and batch-related data.
Before we start with the list of big data technologies, let us first discuss this technology's board
classification. Big Data technology is primarily classified into the following two types:
This type of big data technology mainly includes the basic day-to-day data that people used to
process. Typically, the operational-big data includes daily basis data such as online transactions,
social media platforms, and the data from any particular organization or a firm, which is usually
needed for analysis using the software based on big data technologies. The data can also be
referred to as raw data used as the input for several Analytical Big Data Technologies.
Some specific examples that include the Operational Big Data Technologies can be listed as
below:
○ Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
○ Online trading or shopping from e-commerce websites like Amazon, Flipkart, Walmart,
etc.
○ Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
Analytical Big Data is commonly referred to as an improved version of Big Data Technologies.
This type of big data technology is a bit complicated when compared with operational-big data.
Analytical big data is mainly used when performance criteria are in use, and important real-time
business decisions are made based on reports created by analyzing operational-real data. This
means that the actual investigation of big data that is important for business decisions falls under
this type of big data technology.
Some common examples that involve the Analytical Big Data Technologies can be listed as
below:
○ Medical health records where doctors can personally monitor the health status of an
individual
○ Carrying out the space mission databases where every information of a mission is very
important
We can categorize the leading big data technologies into the following four sections:
○ Data Storage
○ Data Mining
○ Data Analytics
○ Data Visualization
Big data infrastructure is what it sounds like: The IT infrastructure that hosts your “big data.”
(Keep in mind that what constitutes big data depends on a lot of factors; the data need not be
enormous in size to qualify as “big.”)
More specifically, big data infrastructure entails the tools and agents that collect data, the
software systems and physical storage media that store it, the network that transfers it, the
application environments that host the analytics tools that analyze it and the backup or archive
infrastructure that backs it up after analysis is complete.
Lots of things can go wrong with these various components. Below are the most common
problems you may experience that delay or prevent you from transforming big data into value.
Disk I/O bottlenecks are one common source of delays in data processing. Fortunately, there are
some tricks that you can use to minimize their impact.
One solution is to upgrade your data infrastructure solid-state disks (SSDs), which typically run
faster. Alternatively, you could use in-memory data processing, which is much faster than relying
on conventional storage.
SSDs and in-memory storage are more costly, of course, especially when you use them at scale.
But that does not mean you can’t take advantage of them strategically in a cost-effective way:
Consider deploying SSDs or in-memory data processing for workloads that require the highest
speed, but sticking with conventional storage where the benefits of faster I/O won’t outweigh the
costs.
Lack of scalability
If your data infrastructure can’t increase in size as your data needs grow, it will undercut your
ability to turn data into value.
At the same time, of course, you don’t want to maintain substantially more big data infrastructure
than you need today just so that it’s there for the future. Otherwise, you will be paying for
infrastructure you’re not currently using, which is not a good use of money.
One way to help address this challenge is to deploy big data workloads in the cloud, where you
can increase the size of your infrastructure virtually instantaneously when you need it, without
paying for it when you don’t. If you prefer not to shift all of your big data workloads to the
cloud, you might also consider keeping most workloads on-premise, but having a cloud
infrastructure set up and ready to handle “spillover” workloads when they arise—at least until
you can create a new on-premise infrastructure to handle them permanently.
If your data is large in size, transferring it across the network can take time—especially if
network transfers require using the public internet, where bandwidth tends to be much more
limited than it is on internal company networks.
Paying for more bandwidth is one way to mitigate this problem, but that will only get you so far
(and it will cost you). A better approach is to architect your big data infrastructure in a way that
minimizes the amount of data transfer that needs to occur over the network. You could do this by,
for example, using cloud-based analytics tools to analyze data that is collected in the cloud,
rather than downloading that data to an on-premise location first. (The same logic applies in
reverse: If your data is born or collected on-premise, analyze it there.)
Getting data from the format in which it is born into the format that you need to analyze it or
share it with others can be very tricky. Most applications structure data in ways that work best for
them, with little consideration of how well those structures work for other applications or
contexts.
This is why data transformation is so important. Data transformation allows you to convert data
from one format to another.
When done incorrectly—which means manually and in ways that do not control for data
quality—data transformation can quickly cause more trouble than it is worth. But when you
automate data transformation and ensure the quality of the resulting data, you maximize your
data infrastructure’s ability to meet your big data needs, no matter how your infrastructure is
constructed.
Big data analytics is the often complex process of examining large and varied data sets - or big
data - that has been generated by various sources such as eCommerce, mobile devices, social
media and the Internet of Things (IoT). It involves integrating different data sources,
transforming unstructured data into structured data, and generating insights from the data using
specialized tools and techniques that spread out data processing over an entire network. The
amount of digital data that exists is growing at a fast pace, doubling every two years. Big data
analytics is the solution that came with a different approach for managing and analyzing all of
these data sources. While the principles of traditional data analytics generally still apply, the
scale and complexity of big data analytics required the development of new ways to store and
process the petabytes of structured and unstructured data involved. The demand for faster speeds
and greater storage capacities created a technological vacuum that was soon filled by new
storage methods, such as data warehouses and data lakes, and nonrelational databases like
NoSQL, as well as data processing and data management technologies and frameworks, such as
open source Apache Hadoop, Spark, and Hive. Big data analytics takes advantage of advanced
analytic techniques to analyze really big data sets that include structured, semi-structured and
unstructured data, from various sources, and in different sizes from terabytes to zettabytes.
The Most Common Data Types Involved in Big Data Analytics Include:
● Web data. Customer level web behavior data such as visits, page views, searches,
purchases, etc.
● Text data. Data generated from sources of text including email, news articles, Facebook
feeds, Word documents, and more is one of the biggest and most widely used types of
unstructured data.
● Time and location, or geospatial data. GPS and cell phones, as well as Wi-Fi
connections, make time and location information a growing source of interesting data.
This can also include geographic data related to roads, buildings, lakes, addresses,
people, workplaces, and transportation routes, which have been generated from
geographic information systems.
● Real-time media. Real-time data sources can include real-time streaming or event-based
data.
● Smart grid and sensor data. Sensor data from cars, oil pipelines, windmill turbines, and
other sensors is often collected at extremely high frequency.
● Social network data. Unstructured text (comments, likes, etc.) from social network sites
like Facebook, LinkedIn, Instagram, etc. is growing. It is even possible to do link
analysis to uncover the network of a given user.
● Linked data: this type of data has been collected using standard Web technologies like
HTTP, RDF, SPARQL, and URLs.
● Network data. Data related to very large social networks, like Facebook and Twitter, or
technological networks such as the Internet, telephone and transportation networks.
Big data analytics helps organizations harness their data and use advanced data science
techniques and methods, such as natural language processing, deep learning, and machine
learning, uncovering hidden patterns, unknown correlations, market trends and customer
preferences, to identify new opportunities and make more informed business decisions.
● Cost reduction. Cloud computing and storage technologies, such as Amazon Web
Services (AWS) and Microsoft Azure, as well as Apache Hadoop, Spark, and Hive can
help companies decrease their expenses when it comes to storing and processing large
data sets.
● Improved decision making.With the speed of Spark and in-memory analytics, combined
with the ability to quickly analyze new sources of data, businesses can generate
immediate and actionable insights needed to make decisions in real time.
● New products and services. With the help of big data analytics tools, companies can
more precisely analyze customer needs, making it easier to give customers what they
want in terms of products and services.
● Fraud detection. Big data analytics is also used to prevent fraud, mainly in the financial
services industry, but it is gaining importance and usage across all verticals.
The properties you should strive for in Big Data systems are as much about complexity as they
are about scalability. Not only must a Big Data system perform well and be resource - efficient, it
must be easy to reason about as well. Let’s go over each property one by one.
Building systems that “do the right thing” is difficult in the face of the challenges of distributed
systems. Systems need to behave correctly despite machines going down randomly, the complex
semantics of consistency in distributed databases, duplicated data, concurrency, and more. These
challenges make it difficult even to reason about what a system is doing. Part of making a Big
Data system robust is avoiding these complexities so that you can easily reason about the system.
The vast majority of applications require reads to be satisfied with very low latency, typically
between a few milliseconds to a few hundred milliseconds. On the other hand, the update latency
requirements vary a great deal between applications. Some applications require updates to
propagate immediately, but in other applications a latency of a few hours is fine. Regardless, you
need to be able to achieve low latency updates when you need them in your Big Data systems.
More importantly, you need to be able to achieve low latency reads and updates without
compromising the robustness of the system.
Scalability
Scalability is the ability to maintain performance in the face of increasing data or load by adding
resources to the system. The Lambda Architecture is horizontally scalable across all layers of the
system stack: scaling is accomplished by adding more machines.
Generalization
A general system can support a wide range of applications. Because the Lambda Architecture is
based on functions of all data, it generalizes to all applications, whether financial management
systems, social media analytics, scientific applications, social networking, or anything else.
Extensibility
You don’t want to have to reinvent the wheel each time you add a related feature or make a
change to how your system works. Extensible systems allow functionality to be added with a
minimal development cost.
Often a new feature or a change to an existing feature requires a migration of old data into a new
format. Part of making a system extensible is making it easy to do large-scale migrations. Being
able to do big migrations quickly and easily is core to the approach you’ll learn.
Ad hoc queries
Being able to do ad hoc queries on your data is extremely important. Nearly every large dataset
has unanticipated value within it. Being able to mine a dataset arbitrarily gives opportunities for
business optimization and new applications. Ultimately, you can’t discover interesting things to
do with your data unless you can ask arbitrary questions of it.
Minimal maintenance
Maintenance is a tax on developers. Maintenance is the work required to keep a system running
smoothly. This includes anticipating when to add machines to scale, keeping processes up and
running, and debugging anything that goes wrong in production.
most complex components used, like read/write distributed databases, are in this layer where
outputs are eventually discardable.
Debuggability
A Big Data system must provide the information necessary to debug the system when things go
wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have
that value.
“Debuggability” is accomplished in the Lambda Architecture through the functional nature of the
batch layer and by preferring to use recomputation algorithms when possible.