0% found this document useful (0 votes)

17 views91 pages

Big Data UNIT I

Uploaded by

Samiya Sadiq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views91 pages

Big Data UNIT I

Uploaded by

Samiya Sadiq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 91

Introduction To Bigdata

What is Big Data?

 Big data is a massive amount of data sets that cannot be
stored, processed, or analyzed using traditional tools.
 there are millions of data sources that generate data at a very rapid rate
 Some of the largest sources of data are social media platforms and networks.
Let’s use Facebook as an example—it generates more than 500 terabytes of data
every day. This data includes pictures, videos, messages, and more.
 Data also exists in different formats, like
1.Structured data
2. Unstructured
3. Semi- Structured data
Formats Of Data
1.Structured data:

In general, structured data in a Big Data environment is stored in Databases and

other well-defined structures and schemas. Structured data has clearly defined attributes
for easy access and is tabular, having rows and columns that clearly outline the data
structure. Structured Query Language, short for SQL, is primarily the go-to language
for communicating with structured data in a Big Data environment.

Roll_numbe Student_nam Class_teacher_na

Gender Class
r e me
1254 AB Female 1 KL
1562 CD Male 4 MN
1768 EF Female 2 OP
1266 GH Female 7 QR
1980 IJ Male 9 ST
Unstructured data:

is the data which does not conforms to a data model and has no easily identifiable structure
such that it can not be used by a computer program easily.
Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data
model, thus it is not a good fit for a mainstream relational database.
Sources of Unstructured Data:

• Web pages
• Images (JPEG, GIF, PNG, etc.)
• Videos
• Memos
• Reports
• Word documents and PowerPoint presentations
• Surveys
Disadvantages Of Unstructured data:

• It is difficult to store and manage unstructured data due to lack of schema and structure
• Indexing the data is difficult and error prone due to unclear structure and not having pre-
defined attributes. Due to which search results are not very accurate.
• Ensuring security to data is difficult task.
Problems faced in storing unstructured data:

• It requires a lot of storage space to store unstructured data.

• It is difficult to store videos, images, audios, etc.
• Due to unclear structure, operations like update, delete and search is very difficult.
• Storage cost is high as compared to structured data
• Indexing the unstructured data is difficult
2.Semi-structured data:

is data that does not conform to a data model but has some structure. It lacks a fixed

or rigid schema. It is the data that does not reside in a rational database but that have

some organizational properties that make it easier to analyze. With some processes, we

can store them in the relational database.

Sources of semi-structured Data:

• E-mails
• XML and other markup languages
• Binary executables
• TCP/IP packets
• Zipped files
• Integration of data from different sources
• Web pages
Characteristics of bigdata
Big Data Characteristics:

Big Data contains a large amount of data that is not being processed by traditional data storage or the

processing unit. It is used by many multinational companies to process the data and business of

many organizations. The data flow would exceed 150 exabytes per day before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data

• Volume
• Veracity
• Variety
• Value
• Velocity
Volume:
 The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many
more.

 Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded
each day. Big data technologies can handle large amounts of data.
Variety:
 Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
 The data which is still in text format but cannot be easily formatted and require a good and
intelligent transformer to convert them in a well format like Google search result, Web
scrapping data, any webpage, and web server click-stream data.

Note:Web scraping is an automatic method to obtain large amounts of data from websites. Most
of this data is unstructured data in an HTML format which is then converted into structured data
in a spreadsheet or a database so that it can be used in various applications.

Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow
you to access their data in a structured format.

Examples for quasi structured data

1. Clickstream data
2. Google Search results
3. Any Website web page data for scrapping
 The data is categorized as below:

1. Structured data: In Structured schema, along with all the required columns. It is in a tabular form.
Structured Data is stored in the relational database management system.

2. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV,
TSV, and email. OLTP (Online Transaction Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.

3. Unstructured Data: All the unstructured files, log files, audio files, and image files are included in
the unstructured data. Some organizations have much data available, but they did not know how
to derive the value of data since the data is raw.

4. Quasi-structured Data:The data format contains textual data with inconsistent data formats that are
formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and maintained by some server that contains a
Veracity:

 Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.

Value:

 Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity:

 Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.

 Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Big Data Analytics
 Big data analytics refers to the methods, tools, and applications used to collect,
process, and derive insights from varied, high-volume, high-velocity data
sets. These data sets may come from a variety of sources, such as web, mobile,
email, social media, and networked smart devices.

 data that is generated at a high speed and varied in form, ranging from structured
(database tables, Excel sheets) to semi-structured (XML files, webpages) to
unstructured (images, audio files).

 Traditional forms of data analysis software aren't equipped to support this level
of complexity and scale, which is where the systems, tools, and applications
designed specifically for big data analysis come into play.
How does big data analytics work?

 Analytics solutions glean insights and predict outcomes by analyzing data sets. However, in
order for the data to be successfully analyzed, it must first be stored, organized, and cleaned
by a series of applications in an integrated, step-by-step preparation process:

There are four steps are included:

1) Collect

2) Process

3) Scrub

4) Analyze
• Collect. The data, which comes in structured, semi-structured, and unstructured forms,
is collected from multiple sources across web, mobile, and the cloud. It is then stored in
a repository—a data lake or data warehouse—in preparation to be processed.

• Process. During the processing phase, the stored data is verified, sorted, and filtered,
which prepares it for further use and improves the performance of queries.

• Scrub. After processing, the data is then scrubbed. Conflicts, redundancies, invalid or
incomplete fields, and formatting errors within the data set are corrected and cleaned.

• Analyze. The data is now ready to be analyzed. Analyzing big data is accomplished
through tools and technologies such as data mining, AI, predictive analytics, machine
learning, and statistical analysis, which help define and predict patterns and behaviors
Key big data analytics technologies and tools

 big data analytics is actually composed of many individual technologies and tools working
together to store, move, scale, and analyze data. They may vary depending on your
infrastructure, but here are some of the most common big data analytics tools you'll find:
 Collection and storage:

1.Hadoop

2.NoSQL Databases

3. Data lakes and warehouses

 Processing:

1.data integration software

 2.In memory data processing

 Collection and storage

• Hadoop. One of the first frameworks to address the requirements of big data analytics, Apache Hadoop is an open-source
ecosystem that stores and processes large data sets through a distributed computing environment. Hadoop can scale up or
down, depending on your needs, which makes it a highly flexible and cost-efficient framework for managing big data.

• NoSQL databases. Unlike traditional databases, which are relational, NoSQL databases do not require that their data types
adhere to a fixed schema or structure. This allows them to support all types of data models, which is useful when working
with large quantities of semi-structured and raw data. Due to their flexibility, NoSQL databases have also proven to be faster
and more scalable than relational databases. Some popular examples of NoSQL include MongoDB, Apache CouchDB, and
Azure Cosmos DB.

• Data lakes and warehouses. Once data is collected from its sources, it must be stored in a central silo for further processing.
A data lake holds raw and unstructured data, which is then ready to be used across applications, while a data warehouse is a
system that pulls structured, pre-defined data from a variety of sources and processes that data for operational use. Both
options have different functions, but they often work together to make up a well-organized system for data storage.
 Processing

• Data integration software. Data integration tools connect and consolidate data from different
platforms into one unified hub, such as a data warehouse, so that users have centralized access to all
the information they need for data mining, business intelligence reporting, and operational purposes.

• In-memory data processing. While traditional data processing is disk-based, in-memory data
processing uses RAM, or memory, to process data. This substantially increases processing and transfer
speeds, making it possible for organizations to glean insights in real time. Processing frameworks like
Apache Spark perform batch processing and real-time data stream processing in memory.

 Scrubbing

• Data preprocessing and scrubbing tools. To ensure that your data is of the highest quality, data
cleansing tools resolve errors, fix syntax mistakes, remove missing values, and scrub duplicates. These
tools then standardize and validate your data so that it's ready for analysis.
 Analysis

• Data mining. Big data analytics gain insight from data through knowledge discovery
processes like data mining, which extracts underlying patterns from large data sets. Through
algorithms designed to identify notable relationships between the data, data mining can
automatically define current trends in data, both structured and unstructured.

• Predictive analytics. Predictive analytics helps build analytic models that predict patterns
and behavior. This is accomplished through machine learning and other types of statistical
algorithms, which allow you to identify future outcomes, improve operations, and meet the
needs of your users.

• Real-time analytics. By connecting a series of scalable, end-to-end streaming pipelines,

real-time streaming solutions like Azure Data Explorer store, process, and analyze your
 Real-time analytics:By connecting a series of scalable, end-to-end streaming
pipelines, real-time streaming solutions like Azure Data Explorer store, process,
and analyze your cross-platform data in real time, allowing you to gain insights
instantly

 Deep learning: imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.
• MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is
reducing, which organizes and reduces the results from each node to answer a query.

• YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and
resource management in the cluster.

• Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.

• Tableau is an end-to-end data analytics platform that allows you to prep, analyze,
collaborate, and share your big data insights. Tableau excels in self-service visual analysis,
allowing people to ask new questions of governed big data and easily share those insights
across the organization.
The big benefits of big data analytics:

1. Cost savings. Helping organizations identify ways to do business more efficiently.

2. Product development. Providing a better understanding of customer needs.

3. Market insights. Tracking purchase behavior and market trends.

The big challenges of big data

 Making big data accessible. Collecting and processing data becomes more difficult
as the amount of data grows. Organizations must make data easy and convenient for
data owners of all skill levels to use.
• Maintaining quality data. With so much data to maintain, organizations are spending
more time than ever before scrubbing for duplicates, errors, absences, conflicts, and
inconsistencies.
• Keeping data secure. As the amount of data grows, so do
privacy and security concerns. Organizations will need to strive for compliance and
put tight data processes in place before they take advantage of big data.
• Finding the right tools and platforms. New technologies for processing and
analyzing big data are developed all the time. Organizations must find the right
technology to work within their established ecosystems and address their particular
needs. Often, the right solution is also a flexible solution that can accommodate future
infrastructure changes.
The Lifecycle Phases of Big Data Analytics
• Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business case, which
defines the reason and goal behind the analysis.
• Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
• Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to remove corrupt
data.
• Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then transformed into
a compatible form.
• Stage 5- Data Munging – here the data is validated and cleaned
• Stage 6 - Data aggregation - In this stage, data with the same fields across different datasets are integrated.
• Stage 7 - Data analysis - Data is evaluated using analytical and statistical tools to discover useful
information.
• Stage 8 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data analysts can
produce graphic visualizations of the analysis.
• Stage 9- Final analysis result - This is the last step of the Big Data analytics lifecycle, where the final
Introduction to Hadoop
 Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is
being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes
in the cluster.

 Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.

2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.

3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using
key value pair. The Map task takes input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired
Google file system(GFS)
 The Google File System(GFS) is a scalable distributed file system for large distributed
data intensive applications. it delivers high aggregate performance to a large number
of clients.

  GFS provides a familiar file system interface, though it does not implement a
standard API such as POSIX.

 Files are organized hierarchically in directories and identified by path-names. GFS

support the usual operations such as create, delete, open, close, read, and write files.

  GFS has snapshot and record append operations. Snapshot creates a copy of a file or
a directory tree at low cost.

  Record append allows multiple clients to append data to the same file concurrently
while guaranteeing the atomicity of each individual clients append
 Hadoop Distributed File System (HDFS) is an Apache project. It's a file system
which is used to store the initial and 'reduced' data once the data is processed
using MapReduce. Google File System (GFS) was the database created by
Google initially to store the website indexing data for the search engine.
 GFS commands like open, create, read, write and close files. The team also
included a couple of specialized commands: append and snapshot. They
created the specialized commands based on Google's needs. Append allows
clients to add information to an existing file without overwriting previously
written data. Snapshot is a command that creates quick copy of a computer's
contents.
 Files on the GFS tend to be very large, usually in the multi-gigabyte (GB)
range. Accessing and manipulating files that large would take up a lot of the
network's bandwidth. Bandwidth is the capacity of a system to move data
from one location to another. The GFS addresses this problem by breaking
files up into chunks of 64 megabytes (MB) each. Every chunk receives a
unique 64-bit identification number called a chunk handle.
 Google organized the GFS into clusters of computers. A cluster is simply a network
of computers. Each cluster might contain hundreds or even thousands of machines.
Within GFS clusters there are three kinds of entities: clients, master
servers and chunkservers.
GFS Architecture
A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, as shown in the above
figure
 Each of these is typically a commodity Linux machine running a
user-level server process.
 Filesare divided into fixed-size chunks. Each chunk is identified
by a fixed and globally unique 64-bit chunk handle assigned by
the master at the time of chunk creation.
 Chunk servers store chunks on local disks as Linux files
 forreliability, each chunk is replicated on multiple chunk servers.
By default, there will three replicas and this value can be changed
by user.
 The master maintains all file system metadata. This includes the
namespace, access control information, the mapping from files to
chunks, and the current locations of chunks.
 Italso controls system-wide activities such as chunk lease
management, garbage collection of orphaned chunks, and chunk
migration between chunk servers.
 hemaster periodically communicates with each chunk server in
Heart Beat messages to give it instructions and collect its state.
 Files in GFS are divided into chunks. Chunkservers store chunks on the local disks as
Linux files and

 read or write chunkdata specified by a chunkhandle and byte range.

 Each chunk is also replicated on multiple chunkservers. This avoids any loss of data if
and when the hardware fails. The default number of replicas is three.

 The master maintains all file system metadata which includes the namespace, access
control information, the mapping from files to chunks and the current locations of the
chunks.

 The clients communicate with the master and the chunkservers. The clients do not
cache the file data as the files are huge. It also avoids cache coherence issues.
Chunkservers do not cache any data because the chunks are stored as local files.
 One of the most common issues that a master in any system suffers is a bottleneck. It is
necessary that the load is balanced among the various components of the system.
Otherwise, if the master fails, it will lead to single point failure.

 GFS avoids this. It minimizes the master’s involvement in reads and writes so that it does
not become a bottleneck. Clients never read and write file data through the master.

 Instead, a client asks the master which chunkservers it should contact.

 It caches this information for a limited time and interacts with the chunkservers directly
for many subsequent operations. Hence, the master is moved off the critical path to avoid a
bottleneck. When a master fails, a new master is needed.

 To overcome this problem, journaling is used. Journals change the namespace and
replicate the master.
Chunk Size:
A large chunk size offers several important advantages
• First, it reduces clients’ need to interact with the master because reads and
writes on the same chunk requires only one initial request to the master for
chunk location information.
• Second, it can reduce network overhead by keeping a persistent TCP
connection to the chunk server over an extended period of time.
• Third, it reduces the size of the metadata stored on the master.
Disadvantages:
 Large block size can lead to internal fragmentation.
 Even with lazy space allocation, a small file consists of a small number
of chunks, perhaps just one. The chunk servers storing those chunks
may become hot spots if many clients are accessing the same file
Hadoop

 Hadoop is a software framework developed by the Apache Software

Foundation for distributed storage and processing of huge amounts of
datasets.
 Hadoop is the solution to above Big Data problems. It is the technology to
store massive datasets on a cluster of cheap machines in a distributed
manner. Not only this it provides Big Data analytics through distributed
computing framework.
 It is an open-source software developed as a project by Apache Software
Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo gave
Hadoop to Apache Software Foundation. Since then two versions of
Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6 in the
year 2013.
Hadoop flavors
 Hadoop Flavors
 This section of the Hadoop Tutorial talks about the various flavors of Hadoop.
• Apache – Vanilla flavor, as the actual code is residing in Apache repositories.
• Hortonworks – Popular distribution in the industry.
• Cloudera – It is the most popular in the industry.
• MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
• IBM – Proprietary distribution is known as Big Insights.
 All the databases have provided native connectivity with Hadoop for fast data
transfer. Because, to transfer data from Oracle to Hadoop, you need a
connector.
Why Hadoop is invented?:

 1. Storage for Large Datasets

 The conventional RDBMS is incapable of storing huge amounts of Data. The
cost of data storage in available RDBMS is very high. As it incurs the cost of
hardware and software both.
 2. Handling data in different formats
 The RDBMS is capable of storing and manipulating data in a structured
format. But in the real world we have to deal with data in a structured,
unstructured and semi-structured format.
 3. Data getting generated with high speed:
 The data in oozing out in the order of tera to peta bytes daily. Hence we need
a system to process data in real-time within a few seconds. The traditional
RDBMS fail to provide real-time processing at great speeds.
 Hadoop consists of three core components –
• Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.
• Map-Reduce – It is the data processing layer of Hadoop.
• YARN – It is the resource management layer of Hadoop.

1. HDFS:
Short for Hadoop Distributed File System provides for distributed storage for
Hadoop. HDFS has a master-slave topology.

Master is a high-end machine where as slaves are inexpensive computers. The Big Data
files get divided into the number of blocks. Hadoop stores these blocks in a distributed
fashion on the cluster of slave nodes. On the master, we have metadata stored.
 HDFS has two daemons running for it. They are :
 1.Namenode
 2.Datanode
1.Namenode:
 NameNode performs following functions –
• NameNode Daemon runs on the master machine.
• It is responsible for maintaining, monitoring and managing DataNodes.
• It records the metadata of the files like the location of blocks, file size, permission,
hierarchy etc.
• Namenode captures all the changes to the metadata like deletion, creation and
renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the DataNodes.
 NameNode should know any DataNode is dead. If the DataNode is dead the
NameNode issues commands to other DataNodes to replicate the data stored on
the dead DataNode, to bring the replication factor of the blocks to the configured
number of replicas.

 To make NameNode aware of the status, each DataNode does the following,

• DataNodes send a heartbeat to the NameNode every three seconds by default.

• You can set the heartbeat interval in the hdfs-site.xml file by configuring the
parameter dfs.heartbeat.interval.

• If the NameNode doesn’t receive any heartbeats for a specified time, which is ten
minutes by default, it assumes the DataNode is lost and that the block
replicas hosted by that DataNode are unavailable.
 DataNode: The various functions of DataNode are as follows –

• DataNode runs on the slave machine.

• It stores the actual business data.

• It serves the read-write request from the user.

• DataNode does the ground work of creating, replicating and

deleting the blocks on the command of NameNode.

• After every 3 seconds, by default, it sends heartbeat to NameNode

reporting the health of HDFS.
Features of hadoop
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides the world’s
most reliable storage layer. let us discuss various key features of Hadoop.

1.Hadoop is open source.

2.Hadoop cluster is highly scalable.

3. Hadoop provides high fault tolerance.

4.Hadoop provides high availability.

5. Hadoop is very Cost-Effective

6. Hadoop is Faster in Data Processing.

7. Hadoop is based on Data Locality concept.

8. Hadoop provides Feasibility.

9. Hadoop is Easy to use.

10. Hadoop ensures Data Reliability

 1. Hadoop is Open Source

 Hadoop is an open-source project, which means its source code is

available free of cost for inspection, modification, and analyses that
allows enterprises to modify the code as per their requirements.

 2. Hadoop cluster is Highly Scalable

 Hadoop cluster is scalable means we can add any number of nodes

(horizontal scalable) or increase the hardware capacity of nodes
(vertical scalable) to achieve high computation power. This
provides horizontal as well as vertical scalability to the Hadoop
framework.
 Hadoop provides Fault Tolerance

 Fault tolerance is the most important feature of Hadoop. HDFS in

Hadoop 2 uses a replication mechanism to provide fault tolerance.

 It creates a replica of each block on the different machines depending

on the replication factor (by default, it is 3). So if any machine in a
cluster goes down, data can be accessed from the other machines
containing a replica of the same data.

 Hadoop 3 has replaced this replication mechanism by erasure coding.

Erasure coding provides the same level of fault tolerance with less
space. With Erasure coding, the storage overhead is not more than 50%.
Hadoop provides High Availability

 Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the
data is available to the user from different DataNodes containing a copy of the same
data.
 Also, the high availability Hadoop cluster consists of 2 or more running NameNodes
(active and passive) in a hot standby configuration. The active node is the NameNode,
which is active. Passive node is the standby node that reads edit logs modification of
active NameNode and applies them to its own namespace.
 If an active node fails, the passive node takes over the responsibility of the active
node. Thus even if the NameNode goes down, files are available and accessible to
users.
 5. Hadoop is very Cost-Effective

 Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus
provides a cost-effective solution for storing and processing big data. Being an open-source
product, Hadoop doesn’t need any license.

 6. Hadoop is Faster in Data Processing

 Hadoop stores data in a distributed fashion, which allows data to be processed distributedly on a
cluster of nodes. Thus it provides lightning-fast processing capability to the Hadoop framework.

 7. Hadoop is based on Data Locality concept

 Hadoop is popularly known for its data locality feature means moving computation logic to the
data, rather than moving data to the computation logic. This features of Hadoop reduces the
bandwidth utilization in a system.
 8. Hadoop provides Feasibility

 Unlike the traditional system, Hadoop can process unstructured data. Thus provide feasibility to the
users to analyze data of any formats and size.

 9. Hadoop is Easy to use

 Hadoop is easy to use as the clients don’t have to worry about distributing computing. The processing
is handled by the framework itself.

 10. Hadoop ensures Data Reliability

 In Hadoop due to the replication of data in the cluster, data is stored reliably on the cluster machines
despite machine failures.

 The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume
Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted, then
also your data is stored reliably in the cluster and is accessible from the other machine containing a
copy of data.
MAP Reduce
Yarn technology

 Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource

management layer in Hadoop. YARN came into the picture with the
introduction of Hadoop 2.x. It allows various data processing engines
such as interactive processing, graph processing, batch processing, and
stream processing to run and process data stored in HDFS (Hadoop
Distributed File System).

 YARN was introduced to make the most out of HDFS, and job scheduling
is also handled by YARN.
 Now that YARN has been introduced, the architecture of Hadoop 2.x provides a data processing platform that
is not only limited to MapReduce. It lets Hadoop process other-purpose-built data processing systems as well,
i.e., other frameworks can run on the same hardware on which Hadoop is installed.

 Now that you have learned what is YARN, let’s see why we need Hadoop YARN.
Why yarn is used

 Hadoop 1.x had some shortcomings like delays in batch processing,

scalability issues, etc. as it relied on MapReduce for processing big datasets.
With YARN, Hadoop is now able to support a variety of processing
approaches and has a larger array of applications. Hadoop YARN clusters are
now able to run stream data processing and interactive querying side by side
with MapReduce batch jobs. YARN framework runs even the non-MapReduce
applications, thus overcoming the shortcomings of Hadoop 1.x.
Hadoop YARN Architecture
Apache Hadoop YARN components
 Let’s now discuss each component of Apache Hadoop YARN one by one in detail.

 Resource Manager

 Resource Manager is the master daemon of YARN. It is responsible for managing several other
applications, along with the global assignments of resources such as CPU and memory. It is used
for job scheduling. Resource Manager has two components:

• Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals
with the scheduling of tasks and hence it performs no tracking and no monitoring of applications.

• Application Manager: The application Manager manages applications running in the cluster.
Tasks, such as the starting of Application Master or monitoring, are done by the Application
Manager.
 Node Manager
 Node Manager is the slave daemon of YARN. It has the following
responsibilities:
• Node Manager has to monitor the container’s resource usage, along with
reporting it to the Resource Manager.
• The health of the node on which YARN is running is tracked by the Node
Manager.
• It takes care of each node in the cluster while managing the workflow, along
with user jobs on a particular node.
• It keeps the data in the Resource Manager updated
• Node Manager can also destroy or kill the container if it gets an order from the
Resource Manager to do so.
 Application Master:
 Every job submitted to the framework is an application, and every application has a
specific Application Master associated with it. Application Master performs the
following tasks:

• It coordinates the execution of the application in the cluster, along with managing the
faults.

• It negotiates resources from the Resource Manager.

• It works with the Node Manager for executing and monitoring other components’ tasks.

• At regular intervals, heartbeats are sent to the Resource Manager for checking its health,
along with updating records according to its resource demands.
 Container

 A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a

single node. The tasks of a container are listed below:

• It grants the right to an application to use a specific amount of resources (memory,

CPU, etc.) on a specific host.

• YARN containers are particularly managed by a Container Launch context which

is Container Life Cycle (CLC). This record contains a map of environment
variables, dependencies stored in remotely accessible storage, security tokens, the
payload for Node Manager services, and the command necessary to create the
process.
 How is an application submitted in Hadoop YARN?

 1. Submit the job

2. Get an application ID
3. Retrieval of the context of application submission

• Start Container Launch

• Launch Application Master

 4. Allocate Resources.

• Container

• Launching

 5. Executing
 Workflow of an Application in Apache Hadoop YARN

1. Submission of the application by Client

2. Container allocation for starting Application Manager

3. Registering the Application Manager with Resource Manager

4. Application Manager asks for containers from Resource Manager

5. Application Manager notifies Node Manager to launch containers

6. Application code gets executed in the container

7. Client contacts Resource Manager/Application Manager to monitor the

status of the application

8. Application Manager gets disconnected with Resource Manager

 Features of Hadoop YARN

• High-degree compatibility: Applications created use the MapReduce framework

that can be run easily on YARN.

• Better cluster utilization: YARN allocates all cluster resources efficiently and
dynamically, which leads to better utilization of Hadoop as compared to the
previous version of it.

• Utmost scalability: Whenever there is an increase in the number of nodes in the

Hadoop cluster, the YARN Resource Manager assures that it meets the user
requirements.

• Multi-tenancy: Various engines that access data on the Hadoop cluster can
efficiently work together all because of YARN as it is a highly versatile technology.
Pig:
 Pig is a scripting platform that runs on Hadoop clusters designed to process and
analyze large datasets. Pig is extensible, self-optimizing, and easily
programmed.
 Programmers can use Pig to write data transformations without knowing Java.
Pig uses both structured and unstructured data as input to perform analytics and
uses HDFS to store the results.
 HIVE:
 The Apache Hive™ data warehouse software facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL. The structure
can be projected onto data already in storage."
 In other words, Hive is an open-source system that processes structured data in
Hadoop, residing on top of the latter for summarizing Big Data, as well as
 Limitations of Hive
 Of course, no resource is perfect, and Hive has some limitations. They are:
• Hive doesn’t support OLTP. Hive supports Online Analytical Processing
(OLAP), but not Online Transaction Processing (OLTP).
• It doesn’t support subqueries.
• It has a high latency.
• Hive tables don’t support delete or update operations.
 Apache Mahout is an open source project that is primarily used for creating scalable
machine learning algorithms. It implements popular machine learning techniques such as:
• Recommendation
• Classification
• Clustering
 Applications of Mahout
• Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout
internally.
• Foursquare helps you in finding out places, food, and entertainment available in a particular
area. It uses the recommender engine of Mahout.
• Twitter uses Mahout for user interest modelling.
• Yahoo! uses Mahout for pattern mining.
 What is OOZIE?

 Apache Oozie is a workflow scheduler for Hadoop. It is a system which runs the
workflow of dependent jobs. Here, users are permitted to create Directed Acyclic
Graphs of workflows, which can be run in parallel and sequentially in Hadoop.

 It consists of two parts:

• Workflow engine: Responsibility of a workflow engine is to store and run workflows

composed of Hadoop jobs e.g., MapReduce, Pig, Hive.

• Coordinator engine: It runs workflow jobs based on predefined schedules and

availability of data.

 Oozie is scalable and can manage the timely execution of thousands of workflows (each
consisting of dozens of jobs) in a Hadoop cluster.
HBase
 Limitations of Hadoop
 Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner. That means one has to search the entire dataset even for the simplest
of jobs.

 A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of
data in a single unit of time (random access).
 Hadoop Random Access Databases

 Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of
the databases that store huge amounts of data and access the data in a random manner.
HDFS HBase
HDFS is a distributed file system suitable HBase is a database built on top of the
for storing large files. HDFS.

HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing; It provides low latency access to single
no concept of batch processing. rows from billions of records (Random
access).
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the
data in indexed HDFS files for faster
lookups.
 Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases. It is provided by the Apache Software Foundation.

 Sqoop Import

 The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record
in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.

 Sqoop Export

 The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
 Apache Flume is a reliable and distributed system for collecting, aggregating and
moving massive quantities of log data. It has a simple yet flexible architecture
based on streaming data flows. Apache Flume is used to collect log data present in
log files from web servers and aggregating it into HDFS for analysis.

 Zoo Keeper:

 Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate

between themselves and maintain shared data with robust synchronization
techniques. ZooKeeper is itself a distributed application providing services for
writing a distributed application.
 benefits of ZooKeeper:

 Here are the benefits of using ZooKeeper −

• Simple distributed coordination process
• Synchronization − Mutual exclusion and co-operation between server processes. This process helps
in Apache HBase for configuration management.
• Ordered Messages
• Serialization − Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to execute running
threads.
• Reliability
• Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Eco Assignment
No ratings yet
Eco Assignment
9 pages
Daily Activity Booklet
No ratings yet
Daily Activity Booklet
143 pages
Swivel Grease MSDS
No ratings yet
Swivel Grease MSDS
8 pages
89443939-Wiring Diagram, FM Cab Facelift (ENG)
100% (1)
89443939-Wiring Diagram, FM Cab Facelift (ENG)
174 pages
Bda M1
No ratings yet
Bda M1
111 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Unit 1
No ratings yet
Unit 1
26 pages
BDA Presentations M1 P1
No ratings yet
BDA Presentations M1 P1
40 pages
Introduction Big Data
100% (2)
Introduction Big Data
140 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Big Data
No ratings yet
Big Data
110 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Big Data
No ratings yet
Big Data
16 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Big Data Intro
No ratings yet
Big Data Intro
21 pages
Big Data
No ratings yet
Big Data
84 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Unit-I - Big Data
No ratings yet
Unit-I - Big Data
29 pages
Unit 1 BDT
No ratings yet
Unit 1 BDT
27 pages
Module 1 BDA
No ratings yet
Module 1 BDA
103 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
Unit 1
No ratings yet
Unit 1
59 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data 101
No ratings yet
Big Data 101
18 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
BD 1
No ratings yet
BD 1
15 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
BDA Unit 1
No ratings yet
BDA Unit 1
60 pages
Unit 1
No ratings yet
Unit 1
57 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
Introduction To Big Data - Presentation
No ratings yet
Introduction To Big Data - Presentation
30 pages
Presentation 1
No ratings yet
Presentation 1
27 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Unit 1
No ratings yet
Unit 1
107 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
Big Data UNIT1
No ratings yet
Big Data UNIT1
23 pages
Module 6 - Big Data and NOSQL
No ratings yet
Module 6 - Big Data and NOSQL
63 pages
BIG Data Analytics
No ratings yet
BIG Data Analytics
17 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
74 pages
Big Data 1
No ratings yet
Big Data 1
22 pages
IMTC634 - Data Science - Chapter 11
No ratings yet
IMTC634 - Data Science - Chapter 11
22 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
University Institute of Computing: Big Data Analytics 21CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 21CAH-782
13 pages
Unit 1
No ratings yet
Unit 1
56 pages
Introduction To Big Data Analytics - Thendral1
No ratings yet
Introduction To Big Data Analytics - Thendral1
26 pages
Big Data Lecture 1
No ratings yet
Big Data Lecture 1
22 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
Big Data Class 27feb
No ratings yet
Big Data Class 27feb
48 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Electrical Testing Dry Type Transformers
100% (1)
Electrical Testing Dry Type Transformers
5 pages
Improving The ISOIEC 11770 Standard For Key Manage
No ratings yet
Improving The ISOIEC 11770 Standard For Key Manage
16 pages
Ad Spender Manual
No ratings yet
Ad Spender Manual
17 pages
TML Lib CJ1 Motion Control Library For o
No ratings yet
TML Lib CJ1 Motion Control Library For o
2 pages
2.2. BASIC Work in Team Environment
No ratings yet
2.2. BASIC Work in Team Environment
3 pages
Personal Banking: What Is It?
No ratings yet
Personal Banking: What Is It?
25 pages
BUHK408
No ratings yet
BUHK408
5 pages
IPC - 912 - 914 Series - ED4 - R3
No ratings yet
IPC - 912 - 914 Series - ED4 - R3
202 pages
Quiz ôn tập thi cuối kỳ Attempt review
No ratings yet
Quiz ôn tập thi cuối kỳ Attempt review
9 pages
Impact of Covid-19 in Business
0% (1)
Impact of Covid-19 in Business
17 pages
Cato DLP WP
No ratings yet
Cato DLP WP
10 pages
BOQ - Zallaf South Refinery Project - CAMP & TSF
No ratings yet
BOQ - Zallaf South Refinery Project - CAMP & TSF
18 pages
Green Building
100% (2)
Green Building
29 pages
Geosynthetics in Civil Engineering by Oleg Stolyarov Unit 3
No ratings yet
Geosynthetics in Civil Engineering by Oleg Stolyarov Unit 3
52 pages
Sterling N Computing
No ratings yet
Sterling N Computing
2 pages
Geronimo Creer, Jr. For Plaintiffs-Appellees. Benedicto G. Cobarde For Defendant, Defendant-Appellant
No ratings yet
Geronimo Creer, Jr. For Plaintiffs-Appellees. Benedicto G. Cobarde For Defendant, Defendant-Appellant
2 pages
Cataloge E&H Weld-In Adapter and Flanges
No ratings yet
Cataloge E&H Weld-In Adapter and Flanges
40 pages
Bigmart PDF
No ratings yet
Bigmart PDF
20 pages
KSIF Kenya Strategic Investment Framework On SLM 2017 2027
No ratings yet
KSIF Kenya Strategic Investment Framework On SLM 2017 2027
129 pages
Mun of La Carlota V NAWASA
No ratings yet
Mun of La Carlota V NAWASA
2 pages
GROUP6
No ratings yet
GROUP6
13 pages
Untitled Design
No ratings yet
Untitled Design
15 pages
CPT5 - Short Circuit Analysis - July 25, 2005
100% (3)
CPT5 - Short Circuit Analysis - July 25, 2005
235 pages
Pencil
No ratings yet
Pencil
17 pages
Gamal Mohamed CV
No ratings yet
Gamal Mohamed CV
2 pages
Canadian Manual On Foundation Engineering
No ratings yet
Canadian Manual On Foundation Engineering
297 pages

Big Data UNIT I

Uploaded by

Big Data UNIT I

Uploaded by

Introduction To Bigdata

What is Big Data?

In general, structured data in a Big Data environment is stored in Databases and

Roll_numbe Student_nam Class_teacher_na

• It requires a lot of storage space to store unstructured data.

can store them in the relational database.

Sources of semi-structured Data:

5 V's of Big Data

Examples for quasi structured data

For example, Facebook posts with hashtags.

There are four steps are included:

3. Data lakes and warehouses

1.data integration software

 2.In memory data processing

• Real-time analytics. By connecting a series of scalable, end-to-end streaming pipelines,

1. Cost savings. Helping organizations identify ways to do business more efficiently.

2. Product development. Providing a better understanding of customer needs.

3. Market insights. Tracking purchase behavior and market trends.

 Files are organized hierarchically in directories and identified by path-names. GFS

 read or write chunkdata specified by a chunkhandle and byte range.

 Instead, a client asks the master which chunkservers it should contact.

 Hadoop is a software framework developed by the Apache Software

 1. Storage for Large Datasets

• DataNodes send a heartbeat to the NameNode every three seconds by default.

• DataNode runs on the slave machine.

• It stores the actual business data.

• It serves the read-write request from the user.

• DataNode does the ground work of creating, replicating and

• After every 3 seconds, by default, it sends heartbeat to NameNode

1.Hadoop is open source.

2.Hadoop cluster is highly scalable.

3. Hadoop provides high fault tolerance.

4.Hadoop provides high availability.

5. Hadoop is very Cost-Effective

6. Hadoop is Faster in Data Processing.

7. Hadoop is based on Data Locality concept.

8. Hadoop provides Feasibility.

9. Hadoop is Easy to use.

10. Hadoop ensures Data Reliability

 Hadoop is an open-source project, which means its source code is

 2. Hadoop cluster is Highly Scalable

 Hadoop cluster is scalable means we can add any number of nodes

 Fault tolerance is the most important feature of Hadoop. HDFS in

 It creates a replica of each block on the different machines depending

 Hadoop 3 has replaced this replication mechanism by erasure coding.

 6. Hadoop is Faster in Data Processing

 7. Hadoop is based on Data Locality concept

 9. Hadoop is Easy to use

 10. Hadoop ensures Data Reliability

 Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource

 Hadoop 1.x had some shortcomings like delays in batch processing,

• It negotiates resources from the Resource Manager.

 A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a

• It grants the right to an application to use a specific amount of resources (memory,

• YARN containers are particularly managed by a Container Launch context which

 1. Submit the job

• Start Container Launch

• Launch Application Master

1. Submission of the application by Client

2. Container allocation for starting Application Manager

3. Registering the Application Manager with Resource Manager

4. Application Manager asks for containers from Resource Manager

5. Application Manager notifies Node Manager to launch containers

6. Application code gets executed in the container

7. Client contacts Resource Manager/Application Manager to monitor the

8. Application Manager gets disconnected with Resource Manager

• High-degree compatibility: Applications created use the MapReduce framework

• Utmost scalability: Whenever there is an increase in the number of nodes in the

 It consists of two parts:

• Workflow engine: Responsibility of a workflow engine is to store and run workflows

• Coordinator engine: It runs workflow jobs based on predefined schedules and

 Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate

 Here are the benefits of using ZooKeeper −

You might also like