Big Data Module 1,2,3
Big Data Module 1,2,3
2> Hadoop is not a type of database, but rather a software ecosystem that allows for massively
parallel computing.
3 > Hadoop is big data platform consisting of multiple independent tools and modules like Spark ,
Yarn , HDFS, … These tools can be independently used and deployed for specific use case
Hadoop ecosystem 1
Hadoop ecosystem 2
Hadoop ecosystem 3
Core components of Hadoop….
HDFS
HDFS is the storage component of Hadoop framework. HDFS stands for Hadoop Distributed
File System.
● Be Fault-tolerant
● Provide high throughput access to data and is suitable for applications with large data
sets.
HDFS (Name node and Data node)/ Master
Slave
MapReduce
A programming model for processing and generating large datasets in parallel.
parallel processing of large datasets. It breaks down tasks into smaller sub-tasks that can
1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as the
input for Reduce function.
Example 1 (map reduce)
Map Phase : assign helpers (mappers) to different sections of shelves, each responsible for
counting the books in a specific genre (e.g., Mystery, Science Fiction).
Each helper (mapper) creates a small list with the genre name and the count of books in that genre.
Sort phase : gather all these small lists from the helpers (mappers) and group/sort them by
genre. Now, for each genre you have mulitple entries which are grouped.
Reduce Phase : assign another group of helpers (reducers) to each genre’s list. Each helper
(reducer) takes a list, adds up the counts, and writes down the final total for that genre
Breaking it Down
1. Input Splits
2. Mapping
3. Shuffling
4. Sorting
5. Reducing
Map Reduce example process
Example 3
Tradional approach Vs Map reduce approcach
NO SQL (not only SQL)
> NoSQL databases ("not only SQL") store data differently than relational tables. NoSQL
databases come in a variety of types based on their data model.
The main types are
1> document
2> key-value
3> wide-column
4> graph
Document-oriented databases
{
"_id": "12345",
"name": " Rohit ",
"email": "[email protected]",
"address": {
"street": "123 Collectors Colony",
"city": "some city",
"state": "some state",
"zip": "123456"
},
"hobbies": ["music", "guitar", "reading"]
}
Document-oriented databases : Example
Key-value databases
A key-value store is a simpler type of database where each item contains keys and values.
Each key is unique and associated with a single value. They are used for caching and
session management and provide high performance in reads and writes because they tend
to store things in memory.
Key: user:12345
Value: {"name": "Amit ", "email":@bar.com", "designation": "software developer"}
Wide-column stores
Wide-column stores store data in tables, rows, and dynamic columns.
The data is stored in tables. However, unlike traditional SQL databases, wide-column stores
are flexible, where different rows can have different sets of columns.
These databases can employ column compression techniques to reduce the storage space
and enhance performance.
examples : Apache Cassandra and HBase (User -> Netflix)
Wide-column stores
Graph databases
A graph database stores data in the form of nodes and edges. Nodes typically store
information about people, places, and things (like nouns), while edges store information
about the relationships between the nodes. They work well for highly connected data, where
the relationships or patterns may not be very obvious initially.
Graph databases : Example
Questions :
what was the need of MongoDB although there were many databases in action?"
purpose of building MongoDB ?
Replication in Mongo DB.
Sharding in Momgo DB.
1> All the modern applications require big data, fast features development, flexible deployment, and
the older database systems not competent enough, so the MongoDB was needed
Def:- The Apache Spark framework uses a master-slave architecture that consists of a driver,
which runs as a master node, and many executors that run across as worker nodes in the cluster.
Apache Spark can be used for batch processing and real-time processing as well
The Spark architecture depends upon two abstractions:
○ Resilient Distributed Dataset (RDD)
○ Directed Acyclic Graph (DAG)
Executor
○ An executor is a process launched for an application on a worker node.
○ It runs tasks and keeps data in memory or disk storage across them.
○ It read and write data to the external sources.
○ Every application contains its executor.
Task
A task is the smallest unit of work in Spark, representing a unit of computation that can be performed
on a single partition of data.
The driver program divides the Spark job into tasks and assigns them to the executor nodes for
execution.
Execution of map reduce program : overview
Case of failures?
● Application master
● Node manager
● Resource manager
● Task
Case of failures?
Task Failure
1> The most common of this is Task failure. When a user code in the reduce task or map
task, runtime exception is the most common occurrence of this failure. JVM reports the
error back if this happens, to its parent application master before it exits. The error finally
makes it to the user logs. The application frees up the container.
2> When the application master is notified of a task attempt that has failed, it will
reschedule execution of the task.
3> Hanging tasks are dealt with differently. The application master notices that it hasn’t
received a progress update for a while and proceeds to mark the task as failed. The task
JVM process will be killed automatically after this period
Task Failure……..
4> The application master will try to avoid rescheduling the task on a node manager where
it has previously failed. Furthermore, if a task fails four times, it will not be retried again.
This value is configurable.
5> For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum
percentage of tasks that are allowed to fail without triggering job failure can be set for the
job
Application Master Failure
1> The maximum number of attempts to run a MapReduce application master is controlled
by the mapreduce.am.max-attempts property.
2>The default value is 2, so if a MapReduce application master fails twice it will not be
tried again and the job will fail.
3> An application master sends periodic heartbeats to the resource manager, and in the
event of application master failure, the resource manager will detect the failure and start a
new instance of the master running in a new container which is managed by a node
manager.
Node Manager Failure
1> If a node manager fails by crashing or running very slowly, it will stop sending
heartbeats to the resource manager (or send them very infrequently). The resource manager
will notice a node manager that has stopped sending heartbeats if it hasn’t received one for
10 minutes. ( Configurable)
2> Node managers may be blacklisted if the number of failures for the application is high,
even if the node manager itself has not failed.
3> Blacklisting is done by the application master, and for MapReduce the application
master will try to reschedule tasks on different nodes if more than three tasks fail on a node
manager. (configurable)
Resource Manager Failure
1> Failure of the resource manager is serious, because without it, neither jobs nor task
containers can be launched.
2> To achieve high availability (HA), it is necessary to run a pair of resource managers in an
active-standby configuration. If the active resource manager fails, then the standby can take
over without a significant interruption to the client.
3 > Information about all the running applications is stored in a highly available state store
(backed by ZooKeeper or HDFS), so that the standby can recover the core state of the failed
active resource manage