BDT Module 1
BDT Module 1
DATA TECHNOLOGIES
MODULE - 1
Content
• What is Big Data? Evolution of Big Data
• Big data Challenges-Traditional versus big data approach
• Structured, unstructured, semi-structured and quasi structured data.
• Drivers for Big data- Five Vs
• Big data applications.
• Basics of Distributed File System
• The Big Data Technology Landscape: No-SQLThe Hadoop:
Contents
• History of Hadoop-Hadoop use cases,
• Distributed File system
• HDFS architecture
• The Design of HDFS
• Name node and data node
• Blocks and replication management
• Rack awareness
• HDFS Federation
• Anatomy of File write
• Anatomy of File read.
What is Big Data?
• Big Data is a term used for a collection of data sets that
are large and complex, which is difficult to store and
process using available database management tools or
traditional data processing applications.
• The challenge includes capturing, curating, storing,
searching, sharing, transferring, analyzing and
visualization of this data.
Evolution of Big Data
5
Big data Challenges-Traditional versus
big data approach
SR.NO TRADITIONAL DATA BIG DATA
1 Traditional data is generated in enterprise Big data is generated in outside and
level. enterprise level.
2 Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to
Zettabytes or Exabytes.
3 Traditional database system deals with Big data system deals with structured,
structured data. semi structured and unstructured data.
4 Traditional data is generated per hour or per But big data is generated more
day or more. frequently mainly per seconds.
5 Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.
6 Data integration is very difficult.
Data integration is very easy.
7 Normal system configuration is capable to High system configuration is required to
process traditional data. process big data.
8 The size of the data is very small. The size is more than the traditional data
size.
9 Traditional data base tools are required to Special kind of data base tools are
perform any data base operation. required to perform any data base
operation.
SR.NO TRADITIONAL DATA BIG DATA
10 Its data model is strict schema Its data model is flat schema
based and it is static. based and it is dynamic.
11 Traditional data is stable and Big data is not stable and
inter relationship. unknown relationship.
12 Big data is in huge volume which
Traditional data is in manageable becomes unmanageable.
volume.
13 It is easy to manage and It is difficult to manage and
manipulate the data. manipulate the data.
14 Its data sources includes ERP Its data sources includes social
transaction data, CRM media, device data, sensor data,
transaction data, financial data, video, images, audio etc.
organizational data, web
transaction data etc.
15 Traditional data base tools are Big data source is distributed and
required to perform any data it is managed in distributed form.
base operation.
Types of Big Data
• Unstructured
• Quasi-Structured
• Semi-Structured
• Structured
9
Characteristics of Big Data
he five characteristics that define Big Data are: Volume, Velocity,
Variety, Veracity and Value.
VOLUME
• Volume refers to the ‘amount of data’,
which is growing day by day at a very fast
pace.
• The size of data generated by humans,
machines and their interactions on social
media itself is massive.
• Researchers have predicted that 40
Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of
300 times from 2005.
10
Characteristics of Big Data
VELOCITY
• Velocity is defined as the pace at which different sources
generate the data every day.
• This flow of data is massive and continuous.
• There are 1.03 billion Daily Active Users (Facebook DAU) on
Mobile as of now, which is an increase of 22% year-over-year.
• This shows how fast the number of users are growing on social
media and how fast the data is getting generated daily.
• If we are able to handle the velocity, we will be able to generate
insights and take decisions based on real-time data.
VARIETY
•
As there are many sources which are contributing to Big Data,
the type of data they are generating is different.
• It can be structured, semi-structured or unstructured.
• Hence, there is a variety of data which is getting generated every
day.
• Earlier, we used to get the data from excel and databases, now
the data are coming in the form of images, audios, videos, sensor
data etc. as shown in below image.
• Hence, this variety of unstructured data creates problems in
capturing, storage, mining and analyzing the data.
VERACITY
• Veracity refers to the data in doubt or uncertainty of data available due to data
inconsistency and incompleteness.
• In the image below, you can see that few values are missing in the table. Also, a
few values are hard to accept, for example – 15000 minimum value in the 3rd
row, it is not possible.
• This inconsistency and incompleteness is Veracity.
• Data available can sometimes get messy and maybe difficult to trust.
• With many forms of big data, quality and accuracy are difficult to control like
Twitter posts with hashtags, abbreviations, typos and colloquial speech.
• The volume is often the reason behind for the lack of quality and accuracy in the
data.
VALUE
• It is all well and good to have access to big data but unless we can turn it into
value it is useless.
• By turning it into value It means, Is it adding to the benefits of the
organizations who are analyzing big data? Is the organization working on Big
Data achieving high ROI (Return On Investment)?
• Unless, it adds to their profits by working on Big Data, it is useless.
Applications of Big Data
• Smarter Healthcare
-Making use of the petabytes of patient’s data, the organization
can extract meaningful information and then build applications
that can predict the patient’s deteriorating condition in advance.
• Telecom
-Telecom sectors collects information, analyzes it and provide
solutions to different problems.
- By using Big Data applications, telecom companies have been
able to significantly reduce data packet loss, which occurs when
networks are overloaded, and thus, providing a seamless
connection to their customers.
Applications of Big Data
• Retail
Retail has some of the tightest margins, and is one of the greatest
beneficiaries of big data.
The beauty of using big data in retail is to understand consumer
behavior.
Amazon’s recommendation engine provides suggestion based on the
browsing history of the consumer.
• Traffic control
Traffic congestion is a major challenge for many cities globally.
Effective use of data and sensors will be key to managing traffic better
as cities become increasingly densely populated.
16
Applications of Big Data
• Manufacturing
Analyzing big data in the manufacturing industry can reduce
component defects, improve product quality, increase efficiency, and
save time and money.
• Search Quality
Every time we are extracting information from google, we are
simultaneously generating data for it.
Google stores this data and uses it to improve its search quality.
17
The Big Data Technology Landscape: No-
SQL(Not Only SQL)
23
Types of NoSQL Databases
• Key-value Pair Based
• Column-oriented Graph
• Graphs based
• Document-oriented
Types of NoSQL Databases
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to
handle lots of data and heavy load.
• Key-value pair storage databases store data as a hash table where
each key is unique, and the value can be a JSON, BLOB(Binary Large
Objects), string, etc.
• It is one of the most basic NoSQL database example. This kind of
NoSQL database is used as a collection, dictionaries, associative
arrays, etc. Key value stores help the developer to store schema-less
data.
• They work best for shopping cart contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store
DataBases. They are all based on Amazon’s Dynamo paper.
Types of NoSQL Databases
Key Value Pair Based
Types of NoSQL Databases
Column-based
• Column-oriented databases work on
columns and are based on BigTable paper
by Google.
• Every column is treated separately. Values
of single column databases are stored
contiguously.
• They deliver high performance on
aggregation queries like SUM, COUNT, AVG,
MIN etc. as the data is readily available in a
column.
• Column-based NoSQL databases are
widely used to manage data
warehouses, business intelligence, CRM,
Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are
NoSQL query examples of column based
database.
Types of NoSQL Databases
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key
value pair but the value part is stored as a document.
• The document is stored in JSON or XML formats.
• The value is understood by the DB and can be queried.
Types of NoSQL Databases
Document-Oriented
• In this diagram our left we can see we have rows and columns, and in
the right, we have a document database which has a similar structure
to JSON.
• Now for the relational database, we have to know what columns we
have and so on.
• However, for a document database, we have data store like JSON
object. We do not require to define which make it flexible.
• The document type is mostly used for CMS systems, blogging
platforms, real-time analytics & e-commerce applications.
• It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes,
MongoDB, are popular Document originated DBMS systems.
Types of NoSQL Databases
Graph-Based
• A graph type database stores entities as well the relations amongst those
entities.
• The entity is stored as a node with the relationship as edges.
• An edge gives a relationship between nodes.
• Every node and edge has a unique identifier.
• Compared to a relational database where tables are loosely connected, a
Graph database is a multi-relational in nature.
• Traversing relationship is fast as they are already captured into the DB,
and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics, spatial
data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based
databases.
Types of NoSQL Databases
Graph-Based
31
Hadoop
• Hadoop is an open-source software framework used for storing and
processing Big Data in a distributed manner on large clusters of
commodity hardware.
• Hadoop is licensed under the Apache v2 license.
• Hadoop was developed, based on the paper written by Google on the
MapReduce system and it applies concepts of functional
programming.
• Hadoop is written in the Java programming language and ranks
among the highest-level Apache projects.
• Hadoop was developed by Doug Cutting and Michael J. Cafarella.
History of Hadoop
USE CASE OF HADOOP
The ClickStream analysis using Hadoop provides three key benefits:
• Hadoop helps to join ClickStream data with other data sources such as
Customer Relationship Management Data (Customer Demographics
Data, Sales Data, and Information on Advertising Campaigns). This
additional data often provides the much needed information to
understand customer behavior.
• Hadoop's scalability property helps you to store years of data without
ample incremental cost. This helps you to perform temporal or year over
year analysis on Click Stream data which your competitors may miss.
• Business analysts can use Apache Pig or Apache Hive for website
analysis. With these tools, you can organize ClickStream data by user
session, refine it, and feed it to visualization or analytics tools.
USE CASE OF HADOOP
• ClickStream data (mouse clicks) helps you to understand the
purchasing behavior of customers.
• ClickStream analysis helps online marketers to optimize their
product web pages, promotional content, etc. to improve their
business.
Distributed File System
• A distributed file system (DFS) is a
file system that is distributed on
various file servers and locations.
• It permits programs to access and
store isolated data in the same
method as in the local files.
• It also permits the user to access
files from any system.
• It allows network users to share
information and files in a regulated
and permitted manner. Although,
the servers have complete control
over the data and provide users
access control.
Distributed File System
• DFS's primary goal is to enable users of physically
distributed systems to share resources and information
through the Common File System (CFS).
• It is a file system that runs as a part of the operating
systems. Its configuration is a set of workstations and
mainframes that a LAN connects.
Distributed File System
• DFS has two components in its services, and these are as follows:
1.Local Transparency
It is achieved via the namespace component.
1.Redundancy
• It is achieved via a file replication component.
• In the case of failure or heavy load, these components work
together to increase data availability by allowing data from
multiple places to be logically combined under a single folder
known as the "DFS root".
• It is not required to use both DFS components simultaneously;
the namespace component can be used without the file
replication component, and the file replication component can be
used between servers without the namespace component.
Applications of Distributed File System
• There are several applications of the distributed file system. Some of
them are as follows:
• Hadoop
✓ Hadoop is a collection of open-source software services.
✓ It is a software framework that uses the MapReduce programming
style to allow distributed storage and management of large amounts
of data.
✓ Hadoop is made up of a storage component known as Hadoop
Distributed File System (HDFS). It is an operational component
based on the MapReduce programming model.
44
Blocks
• Blocks are the nothing but the smallest continuous location on your
hard drive where data is stored.
• In general, in any of the File System, you store the data as a collection
of blocks.
• Similarly, HDFS stores each file as blocks which are scattered
throughout the Apache Hadoop cluster.
• The default size of each block is 128 MB in Apache Hadoop 2.x (64
MB in Apache Hadoop 1.x) which you can configure as per our
requirement.
Blocks
58
Anatomy of File Read
Anatomy of File Read
2. Distributed File System communicates with the Name Node to get the location
of data blocks. Name Node returns with the addresses of the Data nodes that
the data blocks are stored on. Subsequent to this, the Distributed File system
returns an FS Data lnput Stream to client to read from the file.
3. Client then calls read() on the stream DFSlnputStream, which has addresses of
the Data nodes for the first few blocks of the file, connects to the closest Data
node for the first block in the file.
Anatomy of File Read
4. Client calls read() repeatedly to stream the data from the Data node.
5. When the client completes the reading of the file, it calls close() on the
FSDatalnputStream to close the connection.
62
Anatomy of File Write
Anatomy of File Write
• The steps involved in anatomy of File Write are as follows:
1. The client calls create() on Distributed File System to create a file.
2. An RPC call to the Name Node happens through the Distributed File
System to create a new file. The Name Node performs various
checks to create a new file (checks whether such a file exists or
not). Initially, the Name Node creates a file without associating any
data blocks to the file. The Distributed File System returns an
FSDataOutputStream to the client to perform write.
3. As the client writes data, data is split into packets by
DFSOutputStream, which is then written to an internal queue,
called data queue. Data Streamer consumes the data queue. The
DataStreamer requests the Name Node to allocate new blocks by
selecting a list of suitable Data nodes to store replicas. This list of
Data nodes makes a pipeline. Here, we will go with the default
replication factor of three, so there will be three nodes in the
pipeline for the first block.
Anatomy of File Write
Anatomy of File Write
4. DataStreamer streams the packets to the first Data node in the
pipeline. It stores packet and forwards it to the second Data node in the
pipeline. In the same way, the second Data node stores the packet and
forwards it to the third Data node in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages an
"Ack queue" of packets that are waiting for the acknowledgement by
Data nodes. A packet is removed from the "Ack queue" only if it is
acknowledged by all the Data nodes in the pipeline.
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the Data node pipeline and
waits for relevant acknowledgments before communicating with the
Name Node to inform the client that the creation of the file is complete.
67
MAPREDUCE
Hadoop Mapreduce paradigm
• Hadoop is an open-source software framework
for storing and processing large datasets ranging
in size from gigabytes to petabytes.
• developed at the Apache Software Foundation.
• basically two components in Hadoop:
1. Massive data storage
2. Faster data processing
69
Hadoop Mapreduce paradigm
• Hadoop distributed File System (HDFS):
• It allows you to store data of various formats
across a cluster.
• Map-Reduce:
• For resource management in Hadoop. It allows
parallel processing over the data stored across
HDFS.
70
History of Hadoop
71
Why Hadoop?
• Cost Effective System
• Computing power
• Scalability
• Storage flexibility
• Inherent data protection
• Varied Data Sources
• Fault-Tolerant
• Highly Available
• High Throughput
• Multiple Languages Supported
72
Traditional restaurant scenerio
73
Traditional Scenario
74
Distributed Processing Scenario
75
Distributed Processing Scenario Failure
76
Solution of Restaurant problem
77
Hadoop in Restaurant Analogy
78
Map tasks
• Process independent chunks in a parallel manner
• Output of map task stored as intermediate data
on local disk of that server
Reduce task
• Output of mapper automatically shuffled and stored by
framework
• Sorts the output based on key
• Provide reduced output by combining the output from
various mappers
79
80
Map-reduce daemons
1. JobTrackers
2. TaskTrackers
81
JobTracker
• Master daemon
• Single JobTracker per Hadoop cluster
• Provide connectivity between Hadoop and client
application
• Execution plan creation(which task to assign to
which node)
• Monitor all running tasks
• If task failed then rescheduling
82
Task Tracker
• Responsible for executing individual task which
is assigned by JobTracker
• Single Task Tracker per slave
• Continuously sends heartbeat message to Job
Tracker
• If no heartbeat message then task will be
allocated to other Task Trackers
83
Map-reduce execution pipeline
84
Mapper
• Mapper maps the input key-value pairs into a set of
intermediate key-value pairs
• Phases:
1. RecordReader:
• Converts tasks with key value pairs
• <Key , value> → <positional information, chunk of
data that constitutes the record>
2. Map:
• generate zero or more intermediate key-value pairs
85
Mapper
3. Combiner
• Optimization technique for mapreduce job,
applies user specific aggregate function to only
that mapper
• Also known as Local reducer
4. Partitioner
• Intermediate key-value pairs
• Usually Number of partitions are equal to the
number of reducer
86
Reducer
1. Shuffle and sort:
• consumes the output of Mapping phase
• consolidate the relevant records from Mapping
phase output.
• the same words are clubbed together along with
their respective frequency.
87
Reducer
2. Reducer:
• Grouped data produced by the shuffle and sort phase
• Apply reduce function
• Process one group at a time
• Reducer function iterate all the values associated with that key
• Aggregation, filtering,combining
3. Output format:
• Separates key value pair with tab
• Write it out to a file using record writer
88
89
API
• Main Class file Packages
90
Main class file packages
• import org.apache.hadoop.conf.Configured; (Configuration of system parameters)
91
• import org.apache.hadoop.util.Tool; (interface
(command line options) used to access MapRed
functions)
• import org.apache.hadoop.util.ToolRunner;
92
Mapper File Packages
• import java.io.IOException; ( Exception handle)
93
Reducer file Package
• import java.io.IOException; ( Exception handle)
• import java.util.Iterator; (to call utility function has more elements from iterator class)
94
Reducer file Package
• import org.apache.hadoop.mapred.MapReduceBase; ( Inherited class of
MapReduce functions)
95
Hadoop 2.0 features
• HDFS Federation – horizontal scalability of
NameNode
• NameNode High Availability – NameNode is no
longer a Single Point of Failure
• YARN – ability to process Terabytes and
Petabytes of data available in HDFS using Non-
MapReduce applications such as MPI, GIRAPH
96
Namenode high availability
• Hadoop 1.x, NameNode was single point of failure
• Hadoop Administrators need to manually recover
the NameNode using Secondary NameNode.
• Hadoop 2.0 Architecture supports multiple
NameNodes to remove this bottleneck
• Passive Standby NameNode support.
• In case of Active NameNode failure, the passive
NameNode becomes the Active NameNode and
starts writing to the shared storage
97
MapReduce Execution Framework
98
Motivation for MR V2
99
Yarn Architecture
100
YARN(Yet Another Resource Negotiator)
• Main idea is splitting the JobTracker
responsibility of resource management and Job
scheduling into separate daemons.
101
YARN daemons
1. Global resource manager:
a) Scheduler(allocation of resources among
various running applications)
b) Application manager(Accepting job
submission, restarting application master in
case of failure)
102
YARN daemons
2. Node manager:
• Pre machine slave daemon
• Launching application container for application
execution
• Report usage of resources to the global resource
manager
103
YARN daemons
3. Application master:
• Application specific entity
• Negotiate required resources for execution from
the resource manager
• Works with node manager for executing and
monitoring component tasks
104
YARN
105
YARN workflow
1. Client submits an application
2. The Resource Manager allocates a container to start the
Application Manager
3. The Application Manager registers itself with the Resource
Manager
4. The Application Manager negotiates containers from the Resource
Manager
5. The Application Manager notifies the Node Manager to launch
containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to
monitor application’s status
8. Once the processing is complete, the Application Manager un-
registers with the Resource Manager
106
107