Big Data UNIT I
Big Data UNIT I
is the data which does not conforms to a data model and has no easily identifiable structure
such that it can not be used by a computer program easily.
Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data
model, thus it is not a good fit for a mainstream relational database.
Sources of Unstructured Data:
• Web pages
• Images (JPEG, GIF, PNG, etc.)
• Videos
• Memos
• Reports
• Word documents and PowerPoint presentations
• Surveys
Disadvantages Of Unstructured data:
• It is difficult to store and manage unstructured data due to lack of schema and structure
• Indexing the data is difficult and error prone due to unclear structure and not having pre-
defined attributes. Due to which search results are not very accurate.
• Ensuring security to data is difficult task.
Problems faced in storing unstructured data:
is data that does not conform to a data model but has some structure. It lacks a fixed
or rigid schema. It is the data that does not reside in a rational database but that have
some organizational properties that make it easier to analyze. With some processes, we
Big Data contains a large amount of data that is not being processed by traditional data storage or the
processing unit. It is used by many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
• Volume
• Veracity
• Variety
• Value
• Velocity
Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many
more.
Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded
each day. Big data technologies can handle large amounts of data.
Variety:
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
The data which is still in text format but cannot be easily formatted and require a good and
intelligent transformer to convert them in a well format like Google search result, Web
scrapping data, any webpage, and web server click-stream data.
Note:Web scraping is an automatic method to obtain large amounts of data from websites. Most
of this data is unstructured data in an HTML format which is then converted into structured data
in a spreadsheet or a database so that it can be used in various applications.
Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow
you to access their data in a structured format.
1. Structured data: In Structured schema, along with all the required columns. It is in a tabular form.
Structured Data is stored in the relational database management system.
2. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV,
TSV, and email. OLTP (Online Transaction Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log files, audio files, and image files are included in
the unstructured data. Some organizations have much data available, but they did not know how
to derive the value of data since the data is raw.
4. Quasi-structured Data:The data format contains textual data with inconsistent data formats that are
formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a
Veracity:
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity:
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Big Data Analytics
Big data analytics refers to the methods, tools, and applications used to collect,
process, and derive insights from varied, high-volume, high-velocity data
sets. These data sets may come from a variety of sources, such as web, mobile,
email, social media, and networked smart devices.
data that is generated at a high speed and varied in form, ranging from structured
(database tables, Excel sheets) to semi-structured (XML files, webpages) to
unstructured (images, audio files).
Traditional forms of data analysis software aren't equipped to support this level
of complexity and scale, which is where the systems, tools, and applications
designed specifically for big data analysis come into play.
How does big data analytics work?
Analytics solutions glean insights and predict outcomes by analyzing data sets. However, in
order for the data to be successfully analyzed, it must first be stored, organized, and cleaned
by a series of applications in an integrated, step-by-step preparation process:
1) Collect
2) Process
3) Scrub
4) Analyze
• Collect. The data, which comes in structured, semi-structured, and unstructured forms,
is collected from multiple sources across web, mobile, and the cloud. It is then stored in
a repository—a data lake or data warehouse—in preparation to be processed.
• Process. During the processing phase, the stored data is verified, sorted, and filtered,
which prepares it for further use and improves the performance of queries.
• Scrub. After processing, the data is then scrubbed. Conflicts, redundancies, invalid or
incomplete fields, and formatting errors within the data set are corrected and cleaned.
• Analyze. The data is now ready to be analyzed. Analyzing big data is accomplished
through tools and technologies such as data mining, AI, predictive analytics, machine
learning, and statistical analysis, which help define and predict patterns and behaviors
Key big data analytics technologies and tools
big data analytics is actually composed of many individual technologies and tools working
together to store, move, scale, and analyze data. They may vary depending on your
infrastructure, but here are some of the most common big data analytics tools you'll find:
Collection and storage:
1.Hadoop
2.NoSQL Databases
Processing:
• Hadoop. One of the first frameworks to address the requirements of big data analytics, Apache Hadoop is an open-source
ecosystem that stores and processes large data sets through a distributed computing environment. Hadoop can scale up or
down, depending on your needs, which makes it a highly flexible and cost-efficient framework for managing big data.
• NoSQL databases. Unlike traditional databases, which are relational, NoSQL databases do not require that their data types
adhere to a fixed schema or structure. This allows them to support all types of data models, which is useful when working
with large quantities of semi-structured and raw data. Due to their flexibility, NoSQL databases have also proven to be faster
and more scalable than relational databases. Some popular examples of NoSQL include MongoDB, Apache CouchDB, and
Azure Cosmos DB.
• Data lakes and warehouses. Once data is collected from its sources, it must be stored in a central silo for further processing.
A data lake holds raw and unstructured data, which is then ready to be used across applications, while a data warehouse is a
system that pulls structured, pre-defined data from a variety of sources and processes that data for operational use. Both
options have different functions, but they often work together to make up a well-organized system for data storage.
Processing
• Data integration software. Data integration tools connect and consolidate data from different
platforms into one unified hub, such as a data warehouse, so that users have centralized access to all
the information they need for data mining, business intelligence reporting, and operational purposes.
• In-memory data processing. While traditional data processing is disk-based, in-memory data
processing uses RAM, or memory, to process data. This substantially increases processing and transfer
speeds, making it possible for organizations to glean insights in real time. Processing frameworks like
Apache Spark perform batch processing and real-time data stream processing in memory.
Scrubbing
• Data preprocessing and scrubbing tools. To ensure that your data is of the highest quality, data
cleansing tools resolve errors, fix syntax mistakes, remove missing values, and scrub duplicates. These
tools then standardize and validate your data so that it's ready for analysis.
Analysis
• Data mining. Big data analytics gain insight from data through knowledge discovery
processes like data mining, which extracts underlying patterns from large data sets. Through
algorithms designed to identify notable relationships between the data, data mining can
automatically define current trends in data, both structured and unstructured.
• Predictive analytics. Predictive analytics helps build analytic models that predict patterns
and behavior. This is accomplished through machine learning and other types of statistical
algorithms, which allow you to identify future outcomes, improve operations, and meet the
needs of your users.
Deep learning: imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.
• MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is
reducing, which organizes and reduces the results from each node to answer a query.
• YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and
resource management in the cluster.
• Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.
• Tableau is an end-to-end data analytics platform that allows you to prep, analyze,
collaborate, and share your big data insights. Tableau excels in self-service visual analysis,
allowing people to ask new questions of governed big data and easily share those insights
across the organization.
The big benefits of big data analytics:
Making big data accessible. Collecting and processing data becomes more difficult
as the amount of data grows. Organizations must make data easy and convenient for
data owners of all skill levels to use.
• Maintaining quality data. With so much data to maintain, organizations are spending
more time than ever before scrubbing for duplicates, errors, absences, conflicts, and
inconsistencies.
• Keeping data secure. As the amount of data grows, so do
privacy and security concerns. Organizations will need to strive for compliance and
put tight data processes in place before they take advantage of big data.
• Finding the right tools and platforms. New technologies for processing and
analyzing big data are developed all the time. Organizations must find the right
technology to work within their established ecosystems and address their particular
needs. Often, the right solution is also a flexible solution that can accommodate future
infrastructure changes.
The Lifecycle Phases of Big Data Analytics
• Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business case, which
defines the reason and goal behind the analysis.
• Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
• Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to remove corrupt
data.
• Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then transformed into
a compatible form.
• Stage 5- Data Munging – here the data is validated and cleaned
• Stage 6 - Data aggregation - In this stage, data with the same fields across different datasets are integrated.
• Stage 7 - Data analysis - Data is evaluated using analytical and statistical tools to discover useful
information.
• Stage 8 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data analysts can
produce graphic visualizations of the analysis.
• Stage 9- Final analysis result - This is the last step of the Big Data analytics lifecycle, where the final
Introduction to Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is
being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes
in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using
key value pair. The Map task takes input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired
Google file system(GFS)
The Google File System(GFS) is a scalable distributed file system for large distributed
data intensive applications. it delivers high aggregate performance to a large number
of clients.
GFS provides a familiar file system interface, though it does not implement a
standard API such as POSIX.
GFS has snapshot and record append operations. Snapshot creates a copy of a file or
a directory tree at low cost.
Record append allows multiple clients to append data to the same file concurrently
while guaranteeing the atomicity of each individual clients append
Hadoop Distributed File System (HDFS) is an Apache project. It's a file system
which is used to store the initial and 'reduced' data once the data is processed
using MapReduce. Google File System (GFS) was the database created by
Google initially to store the website indexing data for the search engine.
GFS commands like open, create, read, write and close files. The team also
included a couple of specialized commands: append and snapshot. They
created the specialized commands based on Google's needs. Append allows
clients to add information to an existing file without overwriting previously
written data. Snapshot is a command that creates quick copy of a computer's
contents.
Files on the GFS tend to be very large, usually in the multi-gigabyte (GB)
range. Accessing and manipulating files that large would take up a lot of the
network's bandwidth. Bandwidth is the capacity of a system to move data
from one location to another. The GFS addresses this problem by breaking
files up into chunks of 64 megabytes (MB) each. Every chunk receives a
unique 64-bit identification number called a chunk handle.
Google organized the GFS into clusters of computers. A cluster is simply a network
of computers. Each cluster might contain hundreds or even thousands of machines.
Within GFS clusters there are three kinds of entities: clients, master
servers and chunkservers.
GFS Architecture
A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, as shown in the above
figure
Each of these is typically a commodity Linux machine running a
user-level server process.
Filesare divided into fixed-size chunks. Each chunk is identified
by a fixed and globally unique 64-bit chunk handle assigned by
the master at the time of chunk creation.
Chunk servers store chunks on local disks as Linux files
forreliability, each chunk is replicated on multiple chunk servers.
By default, there will three replicas and this value can be changed
by user.
The master maintains all file system metadata. This includes the
namespace, access control information, the mapping from files to
chunks, and the current locations of chunks.
Italso controls system-wide activities such as chunk lease
management, garbage collection of orphaned chunks, and chunk
migration between chunk servers.
hemaster periodically communicates with each chunk server in
Heart Beat messages to give it instructions and collect its state.
Files in GFS are divided into chunks. Chunkservers store chunks on the local disks as
Linux files and
Each chunk is also replicated on multiple chunkservers. This avoids any loss of data if
and when the hardware fails. The default number of replicas is three.
The master maintains all file system metadata which includes the namespace, access
control information, the mapping from files to chunks and the current locations of the
chunks.
The clients communicate with the master and the chunkservers. The clients do not
cache the file data as the files are huge. It also avoids cache coherence issues.
Chunkservers do not cache any data because the chunks are stored as local files.
One of the most common issues that a master in any system suffers is a bottleneck. It is
necessary that the load is balanced among the various components of the system.
Otherwise, if the master fails, it will lead to single point failure.
GFS avoids this. It minimizes the master’s involvement in reads and writes so that it does
not become a bottleneck. Clients never read and write file data through the master.
It caches this information for a limited time and interacts with the chunkservers directly
for many subsequent operations. Hence, the master is moved off the critical path to avoid a
bottleneck. When a master fails, a new master is needed.
To overcome this problem, journaling is used. Journals change the namespace and
replicate the master.
Chunk Size:
A large chunk size offers several important advantages
• First, it reduces clients’ need to interact with the master because reads and
writes on the same chunk requires only one initial request to the master for
chunk location information.
• Second, it can reduce network overhead by keeping a persistent TCP
connection to the chunk server over an extended period of time.
• Third, it reduces the size of the metadata stored on the master.
Disadvantages:
Large block size can lead to internal fragmentation.
Even with lazy space allocation, a small file consists of a small number
of chunks, perhaps just one. The chunk servers storing those chunks
may become hot spots if many clients are accessing the same file
Hadoop
1. HDFS:
Short for Hadoop Distributed File System provides for distributed storage for
Hadoop. HDFS has a master-slave topology.
Master is a high-end machine where as slaves are inexpensive computers. The Big Data
files get divided into the number of blocks. Hadoop stores these blocks in a distributed
fashion on the cluster of slave nodes. On the master, we have metadata stored.
HDFS has two daemons running for it. They are :
1.Namenode
2.Datanode
1.Namenode:
NameNode performs following functions –
• NameNode Daemon runs on the master machine.
• It is responsible for maintaining, monitoring and managing DataNodes.
• It records the metadata of the files like the location of blocks, file size, permission,
hierarchy etc.
• Namenode captures all the changes to the metadata like deletion, creation and
renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the DataNodes.
NameNode should know any DataNode is dead. If the DataNode is dead the
NameNode issues commands to other DataNodes to replicate the data stored on
the dead DataNode, to bring the replication factor of the blocks to the configured
number of replicas.
To make NameNode aware of the status, each DataNode does the following,
• You can set the heartbeat interval in the hdfs-site.xml file by configuring the
parameter dfs.heartbeat.interval.
• If the NameNode doesn’t receive any heartbeats for a specified time, which is ten
minutes by default, it assumes the DataNode is lost and that the block
replicas hosted by that DataNode are unavailable.
DataNode: The various functions of DataNode are as follows –
Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the
data is available to the user from different DataNodes containing a copy of the same
data.
Also, the high availability Hadoop cluster consists of 2 or more running NameNodes
(active and passive) in a hot standby configuration. The active node is the NameNode,
which is active. Passive node is the standby node that reads edit logs modification of
active NameNode and applies them to its own namespace.
If an active node fails, the passive node takes over the responsibility of the active
node. Thus even if the NameNode goes down, files are available and accessible to
users.
5. Hadoop is very Cost-Effective
Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus
provides a cost-effective solution for storing and processing big data. Being an open-source
product, Hadoop doesn’t need any license.
Hadoop stores data in a distributed fashion, which allows data to be processed distributedly on a
cluster of nodes. Thus it provides lightning-fast processing capability to the Hadoop framework.
Hadoop is popularly known for its data locality feature means moving computation logic to the
data, rather than moving data to the computation logic. This features of Hadoop reduces the
bandwidth utilization in a system.
8. Hadoop provides Feasibility
Unlike the traditional system, Hadoop can process unstructured data. Thus provide feasibility to the
users to analyze data of any formats and size.
Hadoop is easy to use as the clients don’t have to worry about distributing computing. The processing
is handled by the framework itself.
In Hadoop due to the replication of data in the cluster, data is stored reliably on the cluster machines
despite machine failures.
The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume
Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted, then
also your data is stored reliably in the cluster and is accessible from the other machine containing a
copy of data.
MAP Reduce
Yarn technology
YARN was introduced to make the most out of HDFS, and job scheduling
is also handled by YARN.
Now that YARN has been introduced, the architecture of Hadoop 2.x provides a data processing platform that
is not only limited to MapReduce. It lets Hadoop process other-purpose-built data processing systems as well,
i.e., other frameworks can run on the same hardware on which Hadoop is installed.
Now that you have learned what is YARN, let’s see why we need Hadoop YARN.
Why yarn is used
Resource Manager
Resource Manager is the master daemon of YARN. It is responsible for managing several other
applications, along with the global assignments of resources such as CPU and memory. It is used
for job scheduling. Resource Manager has two components:
• Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals
with the scheduling of tasks and hence it performs no tracking and no monitoring of applications.
• Application Manager: The application Manager manages applications running in the cluster.
Tasks, such as the starting of Application Master or monitoring, are done by the Application
Manager.
Node Manager
Node Manager is the slave daemon of YARN. It has the following
responsibilities:
• Node Manager has to monitor the container’s resource usage, along with
reporting it to the Resource Manager.
• The health of the node on which YARN is running is tracked by the Node
Manager.
• It takes care of each node in the cluster while managing the workflow, along
with user jobs on a particular node.
• It keeps the data in the Resource Manager updated
• Node Manager can also destroy or kill the container if it gets an order from the
Resource Manager to do so.
Application Master:
Every job submitted to the framework is an application, and every application has a
specific Application Master associated with it. Application Master performs the
following tasks:
• It coordinates the execution of the application in the cluster, along with managing the
faults.
• It works with the Node Manager for executing and monitoring other components’ tasks.
• At regular intervals, heartbeats are sent to the Resource Manager for checking its health,
along with updating records according to its resource demands.
Container
4. Allocate Resources.
• Container
• Launching
5. Executing
Workflow of an Application in Apache Hadoop YARN
• Better cluster utilization: YARN allocates all cluster resources efficiently and
dynamically, which leads to better utilization of Hadoop as compared to the
previous version of it.
• Multi-tenancy: Various engines that access data on the Hadoop cluster can
efficiently work together all because of YARN as it is a highly versatile technology.
Pig:
Pig is a scripting platform that runs on Hadoop clusters designed to process and
analyze large datasets. Pig is extensible, self-optimizing, and easily
programmed.
Programmers can use Pig to write data transformations without knowing Java.
Pig uses both structured and unstructured data as input to perform analytics and
uses HDFS to store the results.
HIVE:
The Apache Hive™ data warehouse software facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL. The structure
can be projected onto data already in storage."
In other words, Hive is an open-source system that processes structured data in
Hadoop, residing on top of the latter for summarizing Big Data, as well as
Limitations of Hive
Of course, no resource is perfect, and Hive has some limitations. They are:
• Hive doesn’t support OLTP. Hive supports Online Analytical Processing
(OLAP), but not Online Transaction Processing (OLTP).
• It doesn’t support subqueries.
• It has a high latency.
• Hive tables don’t support delete or update operations.
Apache Mahout is an open source project that is primarily used for creating scalable
machine learning algorithms. It implements popular machine learning techniques such as:
• Recommendation
• Classification
• Clustering
Applications of Mahout
• Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout
internally.
• Foursquare helps you in finding out places, food, and entertainment available in a particular
area. It uses the recommender engine of Mahout.
• Twitter uses Mahout for user interest modelling.
• Yahoo! uses Mahout for pattern mining.
What is OOZIE?
Apache Oozie is a workflow scheduler for Hadoop. It is a system which runs the
workflow of dependent jobs. Here, users are permitted to create Directed Acyclic
Graphs of workflows, which can be run in parallel and sequentially in Hadoop.
Oozie is scalable and can manage the timely execution of thousands of workflows (each
consisting of dozens of jobs) in a Hadoop cluster.
HBase
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner. That means one has to search the entire dataset even for the simplest
of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of
data in a single unit of time (random access).
Hadoop Random Access Databases
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of
the databases that store huge amounts of data and access the data in a random manner.
HDFS HBase
HDFS is a distributed file system suitable HBase is a database built on top of the
for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing; It provides low latency access to single
no concept of batch processing. rows from billions of records (Random
access).
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the
data in indexed HDFS files for faster
lookups.
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases. It is provided by the Apache Software Foundation.
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record
in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
Apache Flume is a reliable and distributed system for collecting, aggregating and
moving massive quantities of log data. It has a simple yet flexible architecture
based on streaming data flows. Apache Flume is used to collect log data present in
log files from web servers and aggregating it into HDFS for analysis.
Zoo Keeper: