0% found this document useful (0 votes)
29 views59 pages

Big Data Module 1,2,3

The document outlines the curriculum for a Big Data Analytics course, covering topics such as the definition and characteristics of big data, types of analytics, and the differences between big data and traditional data. It also introduces the Hadoop framework and its components, including HDFS and MapReduce, as well as various NoSQL databases. Additionally, it discusses Apache Spark architecture and failure management in distributed systems.

Uploaded by

suja7103gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views59 pages

Big Data Module 1,2,3

The document outlines the curriculum for a Big Data Analytics course, covering topics such as the definition and characteristics of big data, types of analytics, and the differences between big data and traditional data. It also introduces the Hadoop framework and its components, including HDFS and MapReduce, as well as various NoSQL databases. Additionally, it discusses Apache Spark architecture and failure management in distributed systems.

Uploaded by

suja7103gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Department of Information Technology

Semester - VIII AY-2024-25

Course Code: ITDO8011


Course Title : Big Data Analytics
Module 1 : Introduction to Big Data
• Introduction to big data
• Big Data characteristics
• Types of Big Data
• Tradional Vs Big Data Business approach
• Challenges
• Example in real life
Def (Big Data): -Big data refers to extremely large and diverse
collections of structured, unstructured, and semi-structured data
that continues to grow exponentially over time

Def ( Big Data Analytics ) :- Big data analytics is the process of


examining big data to uncover information -- such as hidden
patterns, correlations, market trends and customer preferences --
that can help organizations make informed business decisions.
Big Data Characteristics (The 5V's of big data analytics):
1. Volume. This refers to the massive amounts of data generated from different sources. For example,
this can consist of data from IoT devices, sensors, transaction logs and social media.
2. Velocity. Velocity refers to the speed at which this data is generated and how fast it's processed and
analyzed. If data is needed quickly, real-time or near-real-time data processing might be needed.
3. Variety. This refers to the data types, including structured, semistructured and unstructured data. It
also refers to the data's format, such as text, videos or images. The variety in data means that
organizations must have a flexible data management system to handle, integrate and analyze different
data types.
4. Veracity. Veracity refers to the accuracy and quality of data. The data must be reliable and should
contain minimal noise or anomalies. This is why tools that can clean, validate and verify data are
important.
5. Value. Value refers to the overall worth that big data analytics should provide. Large data sets should
be processed and analyzed to provide real-world meaningful insights that can positively affect an
organization's decisions.
Types of big data analytics
• Descriptive analytics :- data is analyzed for general assessment and summarization. For example,
an organization can use such data in sales reporting to analyze marketing efficiency.
• Diagnostic analytics. This refers to analytics that determines why a problem occurred. For example,
this could include gathering and studying competitor pricing data to determine when a product's sales
fell off because the competitor undercut it with a price drop.
• Predictive analytics. This refers to analysis that predicts what comes next. For example, this could
include monitoring the performance of machines in a factory and comparing that data to historical data
to determine when a machine is likely to break down or require maintenance or replacement.
• Prescriptive analytics. This form of analysis follows diagnostics and predictions. After identifying an
issue, it recommends what can be done about it. For example, this could include addressing supply
chain inconsistencies that are causing pricing problems by identifying suppliers whose performance is
unreliable and suggesting their replacement.
• Real-time analytics. This refers to the processing and analyzing of data as it's generated. Real-time
analytics is useful in settings where large amounts of data are generated and quick decisions need to
be made based on that data. For example, this would be useful in fraud detection systems.
difference between big data and traditional data
1> scale ( Gbytes or Tbytes)
2> type of data (Structured )
3> Data handling/processing ( SQL/ Database)
Challenges in Big Data:
1> Data accessibility.

2>Data quality maintenance.

3> Data security.

4> Choosing the right tools.

5> Talent shortages.


NoSQL databases Vs RDBMS
Module 2

Introduction to Big Data Framework


> What is Hadoop ?
> Hadoop core components
> Hadoop ecosystem
1> Apache Hadoop, is an framework for Distributed storage and Processing of Large
datasets.

2> Hadoop is not a type of database, but rather a software ecosystem that allows for massively
parallel computing.

3 > Hadoop is big data platform consisting of multiple independent tools and modules like Spark ,
Yarn , HDFS, … These tools can be independently used and deployed for specific use case
Hadoop ecosystem 1
Hadoop ecosystem 2
Hadoop ecosystem 3
Core components of Hadoop….
HDFS
HDFS is the storage component of Hadoop framework. HDFS stands for Hadoop Distributed
File System.

It is a distributed file system designed to :

● Store peta bytes of data across many machines.

● Be Fault-tolerant

● Provide low cost storge

● Provide high throughput access to data and is suitable for applications with large data

sets.
HDFS (Name node and Data node)/ Master
Slave
MapReduce
A programming model for processing and generating large datasets in parallel.

● Parallel Processing Model: Hadoop’s MapReduce programming model allows for

parallel processing of large datasets. It breaks down tasks into smaller sub-tasks that can

be executed in parallel across the nodes in the cluster.

● This parallelization significantly accelerates data processing, making it well-suited for

tasks like batch processing and large-scale analytics.


MapReduce

MapReduce program, Map() and Reduce() are two functions

1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as the
input for Reduce function.
Example 1 (map reduce)
Map Phase : assign helpers (mappers) to different sections of shelves, each responsible for
counting the books in a specific genre (e.g., Mystery, Science Fiction).

Each helper (mapper) creates a small list with the genre name and the count of books in that genre.
Sort phase : gather all these small lists from the helpers (mappers) and group/sort them by
genre. Now, for each genre you have mulitple entries which are grouped.
Reduce Phase : assign another group of helpers (reducers) to each genre’s list. Each helper
(reducer) takes a list, adds up the counts, and writes down the final total for that genre
Breaking it Down

● Mapping (Map): Assigning tasks and creating small lists.

● Grouping and Sorting: Organizing and grouping the small lists.

● Reducing (Reduce): Summing up and finalizing the results.


Illustration of Map Reduce 2
MapReduce Architecture

Map Reduce example process has the following phases:

1. Input Splits

2. Mapping

3. Shuffling

4. Sorting

5. Reducing
Map Reduce example process
Example 3
Tradional approach Vs Map reduce approcach
NO SQL (not only SQL)
> NoSQL databases ("not only SQL") store data differently than relational tables. NoSQL
databases come in a variety of types based on their data model.
The main types are

1> document

2> key-value

3> wide-column

4> graph
Document-oriented databases

1> A document-oriented database stores data in documents similar to JSON (JavaScript


Object Notation) objects.
2> Each document contains pairs of fields and values.
3> The values can typically be a variety of types, including things like strings, numbers,
booleans, arrays, or even other objects.
4> support nested structures, making it easy to represent complex relationships or
hierarchical data.

Example : Mongo DB (user -> UBER )


Document-oriented databases : Example

{
"_id": "12345",
"name": " Rohit ",
"email": "[email protected]",
"address": {
"street": "123 Collectors Colony",
"city": "some city",
"state": "some state",
"zip": "123456"
},
"hobbies": ["music", "guitar", "reading"]
}
Document-oriented databases : Example
Key-value databases

A key-value store is a simpler type of database where each item contains keys and values.
Each key is unique and associated with a single value. They are used for caching and
session management and provide high performance in reads and writes because they tend
to store things in memory.

Examples are Amazon DynamoDB and Redis.

Example :- word Dictionory


User : Twitter and pinterest
Key-value databases

Key: user:12345
Value: {"name": "Amit ", "email":@bar.com", "designation": "software developer"}
Wide-column stores
Wide-column stores store data in tables, rows, and dynamic columns.
The data is stored in tables. However, unlike traditional SQL databases, wide-column stores
are flexible, where different rows can have different sets of columns.
These databases can employ column compression techniques to reduce the storage space
and enhance performance.
examples : Apache Cassandra and HBase (User -> Netflix)
Wide-column stores
Graph databases
A graph database stores data in the form of nodes and edges. Nodes typically store
information about people, places, and things (like nouns), while edges store information
about the relationships between the nodes. They work well for highly connected data, where
the relationships or patterns may not be very obvious initially.
Graph databases : Example

Example: Neo4j ( User -> Walmart, Facebook )


Graph databases : Example
Trends of Databases
MongoDB
Def: "MongoDB is a scalable, open source, high performance, document-oriented database."

Questions :
what was the need of MongoDB although there were many databases in action?"
purpose of building MongoDB ?
Replication in Mongo DB.
Sharding in Momgo DB.
1> All the modern applications require big data, fast features development, flexible deployment, and
the older database systems not competent enough, so the MongoDB was needed

2> Purpose of building Mongodb


Scalability
○ Performance
○ High Availability
○ Scaling from single server deployments to large, complex multi-site architectures.
○ Develop Faster
○ Deploy Easier
○ Scale Bigger
Apache Spark Architecture

Def:- The Apache Spark framework uses a master-slave architecture that consists of a driver,
which runs as a master node, and many executors that run across as worker nodes in the cluster.
Apache Spark can be used for batch processing and real-time processing as well
The Spark architecture depends upon two abstractions:
○ Resilient Distributed Dataset (RDD)
○ Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)


The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
○ Resilient: Restore the data on failure.
○ Distributed: Data is distributed among different nodes.
○ Dataset: Group of data.
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data.
Driver Program
The Driver Program is a process that runs the main() function of the application and creates the
SparkContext object. The purpose of SparkContext is to coordinate the spark applications.
SparkContext connects to a different type of cluster managers and then perform the following tasks:
-
○ It acquires executors on nodes in the cluster.
○ Then, it sends your application code to the executors. Here, the application code can be defined
by JAR or Python files passed to the SparkContext.
○ At last, the SparkContext sends tasks to the executors to run.
Cluster Manager Roles :

○ The role of the cluster manager is to allocate resources across applications.


○ It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
○ Spark Driver works in conjunction with the Cluster Manager to control the execution of various
other jobs
Worker Node
○ The worker node is a slave node
○ Its role is to run the application code in the cluster.

Executor
○ An executor is a process launched for an application on a worker node.
○ It runs tasks and keeps data in memory or disk storage across them.
○ It read and write data to the external sources.
○ Every application contains its executor.

Task
A task is the smallest unit of work in Spark, representing a unit of computation that can be performed
on a single partition of data.
The driver program divides the Spark job into tasks and assigns them to the executor nodes for
execution.
Execution of map reduce program : overview
Case of failures?

Any one of the four component can fail

● Application master
● Node manager
● Resource manager
● Task
Case of failures?
Task Failure
1> The most common of this is Task failure. When a user code in the reduce task or map
task, runtime exception is the most common occurrence of this failure. JVM reports the
error back if this happens, to its parent application master before it exits. The error finally
makes it to the user logs. The application frees up the container.

2> When the application master is notified of a task attempt that has failed, it will
reschedule execution of the task.

3> Hanging tasks are dealt with differently. The application master notices that it hasn’t
received a progress update for a while and proceeds to mark the task as failed. The task
JVM process will be killed automatically after this period
Task Failure……..

4> The application master will try to avoid rescheduling the task on a node manager where
it has previously failed. Furthermore, if a task fails four times, it will not be retried again.
This value is configurable.

5> For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum
percentage of tasks that are allowed to fail without triggering job failure can be set for the
job
Application Master Failure
1> The maximum number of attempts to run a MapReduce application master is controlled
by the mapreduce.am.max-attempts property.

2>The default value is 2, so if a MapReduce application master fails twice it will not be
tried again and the job will fail.

3> An application master sends periodic heartbeats to the resource manager, and in the
event of application master failure, the resource manager will detect the failure and start a
new instance of the master running in a new container which is managed by a node
manager.
Node Manager Failure

1> If a node manager fails by crashing or running very slowly, it will stop sending
heartbeats to the resource manager (or send them very infrequently). The resource manager
will notice a node manager that has stopped sending heartbeats if it hasn’t received one for
10 minutes. ( Configurable)

2> Node managers may be blacklisted if the number of failures for the application is high,
even if the node manager itself has not failed.

3> Blacklisting is done by the application master, and for MapReduce the application
master will try to reschedule tasks on different nodes if more than three tasks fail on a node
manager. (configurable)
Resource Manager Failure

1> Failure of the resource manager is serious, because without it, neither jobs nor task
containers can be launched.

2> To achieve high availability (HA), it is necessary to run a pair of resource managers in an
active-standby configuration. If the active resource manager fails, then the standby can take
over without a significant interruption to the client.

3 > Information about all the running applications is stored in a highly available state store
(backed by ZooKeeper or HDFS), so that the standby can recover the core state of the failed
active resource manage

You might also like