0% found this document useful (0 votes)

29 views59 pages

Big Data Module 1,2,3

The document outlines the curriculum for a Big Data Analytics course, covering topics such as the definition and characteristics of big data, types of analytics, and the differences between big data and traditional data. It also introduces the Hadoop framework and its components, including HDFS and MapReduce, as well as various NoSQL databases. Additionally, it discusses Apache Spark architecture and failure management in distributed systems.

Uploaded by

suja7103gm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views59 pages

Big Data Module 1,2,3

Uploaded by

suja7103gm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Department of Information Technology

Semester - VIII AY-2024-25

Course Code: ITDO8011

Course Title : Big Data Analytics
Module 1 : Introduction to Big Data
• Introduction to big data
• Big Data characteristics
• Types of Big Data
• Tradional Vs Big Data Business approach
• Challenges
• Example in real life
Def (Big Data): -Big data refers to extremely large and diverse
collections of structured, unstructured, and semi-structured data
that continues to grow exponentially over time

Def ( Big Data Analytics ) :- Big data analytics is the process of

examining big data to uncover information -- such as hidden
patterns, correlations, market trends and customer preferences --
that can help organizations make informed business decisions.
Big Data Characteristics (The 5V's of big data analytics):
1. Volume. This refers to the massive amounts of data generated from different sources. For example,
this can consist of data from IoT devices, sensors, transaction logs and social media.
2. Velocity. Velocity refers to the speed at which this data is generated and how fast it's processed and
analyzed. If data is needed quickly, real-time or near-real-time data processing might be needed.
3. Variety. This refers to the data types, including structured, semistructured and unstructured data. It
also refers to the data's format, such as text, videos or images. The variety in data means that
organizations must have a flexible data management system to handle, integrate and analyze different
data types.
4. Veracity. Veracity refers to the accuracy and quality of data. The data must be reliable and should
contain minimal noise or anomalies. This is why tools that can clean, validate and verify data are
important.
5. Value. Value refers to the overall worth that big data analytics should provide. Large data sets should
be processed and analyzed to provide real-world meaningful insights that can positively affect an
organization's decisions.
Types of big data analytics
• Descriptive analytics :- data is analyzed for general assessment and summarization. For example,
an organization can use such data in sales reporting to analyze marketing efficiency.
• Diagnostic analytics. This refers to analytics that determines why a problem occurred. For example,
this could include gathering and studying competitor pricing data to determine when a product's sales
fell off because the competitor undercut it with a price drop.
• Predictive analytics. This refers to analysis that predicts what comes next. For example, this could
include monitoring the performance of machines in a factory and comparing that data to historical data
to determine when a machine is likely to break down or require maintenance or replacement.
• Prescriptive analytics. This form of analysis follows diagnostics and predictions. After identifying an
issue, it recommends what can be done about it. For example, this could include addressing supply
chain inconsistencies that are causing pricing problems by identifying suppliers whose performance is
unreliable and suggesting their replacement.
• Real-time analytics. This refers to the processing and analyzing of data as it's generated. Real-time
analytics is useful in settings where large amounts of data are generated and quick decisions need to
be made based on that data. For example, this would be useful in fraud detection systems.
difference between big data and traditional data
1> scale ( Gbytes or Tbytes)
2> type of data (Structured )
3> Data handling/processing ( SQL/ Database)
Challenges in Big Data:
1> Data accessibility.

2>Data quality maintenance.

3> Data security.

4> Choosing the right tools.

5> Talent shortages.

NoSQL databases Vs RDBMS
Module 2

Introduction to Big Data Framework

> What is Hadoop ?
> Hadoop core components
> Hadoop ecosystem
1> Apache Hadoop, is an framework for Distributed storage and Processing of Large
datasets.

2> Hadoop is not a type of database, but rather a software ecosystem that allows for massively
parallel computing.

3 > Hadoop is big data platform consisting of multiple independent tools and modules like Spark ,
Yarn , HDFS, … These tools can be independently used and deployed for specific use case
Hadoop ecosystem 1
Hadoop ecosystem 2
Hadoop ecosystem 3
Core components of Hadoop….
HDFS
HDFS is the storage component of Hadoop framework. HDFS stands for Hadoop Distributed
File System.

It is a distributed file system designed to :

● Store peta bytes of data across many machines.

● Be Fault-tolerant

● Provide low cost storge

● Provide high throughput access to data and is suitable for applications with large data

sets.
HDFS (Name node and Data node)/ Master
Slave
MapReduce
A programming model for processing and generating large datasets in parallel.

● Parallel Processing Model: Hadoop’s MapReduce programming model allows for

parallel processing of large datasets. It breaks down tasks into smaller sub-tasks that can

be executed in parallel across the nodes in the cluster.

● This parallelization significantly accelerates data processing, making it well-suited for

tasks like batch processing and large-scale analytics.

MapReduce

MapReduce program, Map() and Reduce() are two functions

1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as the
input for Reduce function.
Example 1 (map reduce)
Map Phase : assign helpers (mappers) to different sections of shelves, each responsible for
counting the books in a specific genre (e.g., Mystery, Science Fiction).

Each helper (mapper) creates a small list with the genre name and the count of books in that genre.
Sort phase : gather all these small lists from the helpers (mappers) and group/sort them by
genre. Now, for each genre you have mulitple entries which are grouped.
Reduce Phase : assign another group of helpers (reducers) to each genre’s list. Each helper
(reducer) takes a list, adds up the counts, and writes down the final total for that genre
Breaking it Down

● Mapping (Map): Assigning tasks and creating small lists.

● Grouping and Sorting: Organizing and grouping the small lists.

● Reducing (Reduce): Summing up and finalizing the results.

Illustration of Map Reduce 2
MapReduce Architecture

Map Reduce example process has the following phases:

1. Input Splits

2. Mapping

3. Shuffling

4. Sorting

5. Reducing
Map Reduce example process
Example 3
Tradional approach Vs Map reduce approcach
NO SQL (not only SQL)
> NoSQL databases ("not only SQL") store data differently than relational tables. NoSQL
databases come in a variety of types based on their data model.
The main types are

1> document

2> key-value

3> wide-column

4> graph
Document-oriented databases

1> A document-oriented database stores data in documents similar to JSON (JavaScript

Object Notation) objects.
2> Each document contains pairs of fields and values.
3> The values can typically be a variety of types, including things like strings, numbers,
booleans, arrays, or even other objects.
4> support nested structures, making it easy to represent complex relationships or
hierarchical data.

Example : Mongo DB (user -> UBER )

Document-oriented databases : Example

{
"_id": "12345",
"name": " Rohit ",
"email": "[email protected]",
"address": {
"street": "123 Collectors Colony",
"city": "some city",
"state": "some state",
"zip": "123456"
},
"hobbies": ["music", "guitar", "reading"]
}
Document-oriented databases : Example
Key-value databases

A key-value store is a simpler type of database where each item contains keys and values.
Each key is unique and associated with a single value. They are used for caching and
session management and provide high performance in reads and writes because they tend
to store things in memory.

Examples are Amazon DynamoDB and Redis.

Example :- word Dictionory

User : Twitter and pinterest
Key-value databases

Key: user:12345
Value: {"name": "Amit ", "email":@bar.com", "designation": "software developer"}
Wide-column stores
Wide-column stores store data in tables, rows, and dynamic columns.
The data is stored in tables. However, unlike traditional SQL databases, wide-column stores
are flexible, where different rows can have different sets of columns.
These databases can employ column compression techniques to reduce the storage space
and enhance performance.
examples : Apache Cassandra and HBase (User -> Netflix)
Wide-column stores
Graph databases
A graph database stores data in the form of nodes and edges. Nodes typically store
information about people, places, and things (like nouns), while edges store information
about the relationships between the nodes. They work well for highly connected data, where
the relationships or patterns may not be very obvious initially.
Graph databases : Example

Example: Neo4j ( User -> Walmart, Facebook )

Graph databases : Example
Trends of Databases
MongoDB
Def: "MongoDB is a scalable, open source, high performance, document-oriented database."

Questions :
what was the need of MongoDB although there were many databases in action?"
purpose of building MongoDB ?
Replication in Mongo DB.
Sharding in Momgo DB.
1> All the modern applications require big data, fast features development, flexible deployment, and
the older database systems not competent enough, so the MongoDB was needed

2> Purpose of building Mongodb

Scalability
○ Performance
○ High Availability
○ Scaling from single server deployments to large, complex multi-site architectures.
○ Develop Faster
○ Deploy Easier
○ Scale Bigger
Apache Spark Architecture

Def:- The Apache Spark framework uses a master-slave architecture that consists of a driver,
which runs as a master node, and many executors that run across as worker nodes in the cluster.
Apache Spark can be used for batch processing and real-time processing as well
The Spark architecture depends upon two abstractions:
○ Resilient Distributed Dataset (RDD)
○ Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
○ Resilient: Restore the data on failure.
○ Distributed: Data is distributed among different nodes.
○ Dataset: Group of data.
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data.
Driver Program
The Driver Program is a process that runs the main() function of the application and creates the
SparkContext object. The purpose of SparkContext is to coordinate the spark applications.
SparkContext connects to a different type of cluster managers and then perform the following tasks:
-
○ It acquires executors on nodes in the cluster.
○ Then, it sends your application code to the executors. Here, the application code can be defined
by JAR or Python files passed to the SparkContext.
○ At last, the SparkContext sends tasks to the executors to run.
Cluster Manager Roles :

○ The role of the cluster manager is to allocate resources across applications.

○ It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
○ Spark Driver works in conjunction with the Cluster Manager to control the execution of various
other jobs
Worker Node
○ The worker node is a slave node
○ Its role is to run the application code in the cluster.
○

Executor
○ An executor is a process launched for an application on a worker node.
○ It runs tasks and keeps data in memory or disk storage across them.
○ It read and write data to the external sources.
○ Every application contains its executor.

Task
A task is the smallest unit of work in Spark, representing a unit of computation that can be performed
on a single partition of data.
The driver program divides the Spark job into tasks and assigns them to the executor nodes for
execution.
Execution of map reduce program : overview
Case of failures?

Any one of the four component can fail

● Application master
● Node manager
● Resource manager
● Task
Case of failures?
Task Failure
1> The most common of this is Task failure. When a user code in the reduce task or map
task, runtime exception is the most common occurrence of this failure. JVM reports the
error back if this happens, to its parent application master before it exits. The error finally
makes it to the user logs. The application frees up the container.

2> When the application master is notified of a task attempt that has failed, it will
reschedule execution of the task.

3> Hanging tasks are dealt with differently. The application master notices that it hasn’t
received a progress update for a while and proceeds to mark the task as failed. The task
JVM process will be killed automatically after this period
Task Failure……..

4> The application master will try to avoid rescheduling the task on a node manager where
it has previously failed. Furthermore, if a task fails four times, it will not be retried again.
This value is configurable.

5> For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum
percentage of tasks that are allowed to fail without triggering job failure can be set for the
job
Application Master Failure
1> The maximum number of attempts to run a MapReduce application master is controlled
by the mapreduce.am.max-attempts property.

2>The default value is 2, so if a MapReduce application master fails twice it will not be
tried again and the job will fail.

3> An application master sends periodic heartbeats to the resource manager, and in the
event of application master failure, the resource manager will detect the failure and start a
new instance of the master running in a new container which is managed by a node
manager.
Node Manager Failure

1> If a node manager fails by crashing or running very slowly, it will stop sending
heartbeats to the resource manager (or send them very infrequently). The resource manager
will notice a node manager that has stopped sending heartbeats if it hasn’t received one for
10 minutes. ( Configurable)

2> Node managers may be blacklisted if the number of failures for the application is high,
even if the node manager itself has not failed.

3> Blacklisting is done by the application master, and for MapReduce the application
master will try to reschedule tasks on different nodes if more than three tasks fail on a node
manager. (configurable)
Resource Manager Failure

1> Failure of the resource manager is serious, because without it, neither jobs nor task
containers can be launched.

2> To achieve high availability (HA), it is necessary to run a pair of resource managers in an
active-standby configuration. If the active resource manager fails, then the standby can take
over without a significant interruption to the client.

3 > Information about all the running applications is stored in a highly available state store
(backed by ZooKeeper or HDFS), so that the standby can recover the core state of the failed
active resource manage

Oracle Core DBA
No ratings yet
Oracle Core DBA
5 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Mittal School of Business: Course Code: CAP348 Course Title: Introduction To Big Data
No ratings yet
Mittal School of Business: Course Code: CAP348 Course Title: Introduction To Big Data
6 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
Big Data
No ratings yet
Big Data
25 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
BDA Unit-1
No ratings yet
BDA Unit-1
31 pages
Database Security
No ratings yet
Database Security
38 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
29 pages
" DEEP WEB: Surfacing Hidden Values ": Seminar Report ON by
50% (2)
" DEEP WEB: Surfacing Hidden Values ": Seminar Report ON by
27 pages
How To Set Up PostgreSQL For High Availability and Replication With Hot Standby
No ratings yet
How To Set Up PostgreSQL For High Availability and Replication With Hot Standby
11 pages
Big Data
No ratings yet
Big Data
25 pages
Module 1
No ratings yet
Module 1
54 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
SQL Reporting Services
No ratings yet
SQL Reporting Services
50 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Alliance College of Engineering and Design: Data Base Management Systems
No ratings yet
Alliance College of Engineering and Design: Data Base Management Systems
7 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Bda M1
No ratings yet
Bda M1
111 pages
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
Biggdata
No ratings yet
Biggdata
24 pages
Reference Material - Azure Analytics Services
No ratings yet
Reference Material - Azure Analytics Services
159 pages
Big Data Presentation
No ratings yet
Big Data Presentation
22 pages
Unit 1
No ratings yet
Unit 1
89 pages
Category - Data Anomalies - Database Management
No ratings yet
Category - Data Anomalies - Database Management
3 pages
DBMS FILE Amit Singh
No ratings yet
DBMS FILE Amit Singh
98 pages
BDAV Question Bank Solution
No ratings yet
BDAV Question Bank Solution
63 pages
Final Assignment
No ratings yet
Final Assignment
5 pages
BigDataAnalytics 1.2
No ratings yet
BigDataAnalytics 1.2
25 pages
BDA Unit-1
No ratings yet
BDA Unit-1
32 pages
Tani
No ratings yet
Tani
33 pages
Sap MM Master Data Configuration
No ratings yet
Sap MM Master Data Configuration
55 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Quiz 2 Notes Dbms
No ratings yet
Quiz 2 Notes Dbms
185 pages
BDA UNIT - 1 - PDF
No ratings yet
BDA UNIT - 1 - PDF
143 pages
Feature Example: Import of Data Into Sofimsha
No ratings yet
Feature Example: Import of Data Into Sofimsha
4 pages
The 2 Dimensions of Audit Information
No ratings yet
The 2 Dimensions of Audit Information
4 pages
Hadoop Big Data Unit 2
No ratings yet
Hadoop Big Data Unit 2
23 pages
1 The Impact of Electronic Documents
No ratings yet
1 The Impact of Electronic Documents
17 pages
IMP Questions PDF in Big Data
No ratings yet
IMP Questions PDF in Big Data
15 pages
Full Functional Dependency (FFD) - DBMS
No ratings yet
Full Functional Dependency (FFD) - DBMS
1 page
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
22cs4001 Dbms Lab Manual
No ratings yet
22cs4001 Dbms Lab Manual
69 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Finding Files and Directories Text
No ratings yet
Finding Files and Directories Text
3 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Set 21
No ratings yet
Set 21
13 pages
DB Question Ans
No ratings yet
DB Question Ans
12 pages
SAP ABAP On HANA - Basic Programming
No ratings yet
SAP ABAP On HANA - Basic Programming
20 pages
BD Unit 1
No ratings yet
BD Unit 1
5 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
Unit 2 DBMS Class 12 Computer Science
No ratings yet
Unit 2 DBMS Class 12 Computer Science
12 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Converted Text From Images
No ratings yet
Converted Text From Images
2 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Score Obtained: 82 / 100%: 1. What Will Be The Output of The Following Code Snippet? Print (2 3 + (5 + 6) (1 + 1) )
No ratings yet
Score Obtained: 82 / 100%: 1. What Will Be The Output of The Following Code Snippet? Print (2 3 + (5 + 6) (1 + 1) )
18 pages
Big Data
No ratings yet
Big Data
79 pages
UNIT 4 Query Processing and Different Types of Databases
No ratings yet
UNIT 4 Query Processing and Different Types of Databases
13 pages
Niranjan Anisetty CV
No ratings yet
Niranjan Anisetty CV
3 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Naukri KowsalyaRamamoorthy (3y 0m)
No ratings yet
Naukri KowsalyaRamamoorthy (3y 0m)
3 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
660-6709 Michelle Price in Knoxville, TN Free Reverse Phone Lookup
No ratings yet
660-6709 Michelle Price in Knoxville, TN Free Reverse Phone Lookup
1 page
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
2024-Evaluating The Impact of Database and Data Warehouse Technologies On Organizational Performance A Systematic Review
No ratings yet
2024-Evaluating The Impact of Database and Data Warehouse Technologies On Organizational Performance A Systematic Review
105 pages

Big Data Module 1,2,3

Uploaded by

Big Data Module 1,2,3

Uploaded by

Department of Information Technology

Semester - VIII AY-2024-25

Course Code: ITDO8011

Def ( Big Data Analytics ) :- Big data analytics is the process of

2>Data quality maintenance.

3> Data security.

4> Choosing the right tools.

5> Talent shortages.

Introduction to Big Data Framework

It is a distributed file system designed to :

● Store peta bytes of data across many machines.

● Provide low cost storge

● Parallel Processing Model: Hadoop’s MapReduce programming model allows for

be executed in parallel across the nodes in the cluster.

● This parallelization significantly accelerates data processing, making it well-suited for

tasks like batch processing and large-scale analytics.

MapReduce program, Map() and Reduce() are two functions

● Mapping (Map): Assigning tasks and creating small lists.

● Grouping and Sorting: Organizing and grouping the small lists.

● Reducing (Reduce): Summing up and finalizing the results.

Map Reduce example process has the following phases:

1> A document-oriented database stores data in documents similar to JSON (JavaScript

Example : Mongo DB (user -> UBER )

Examples are Amazon DynamoDB and Redis.

Example :- word Dictionory

Example: Neo4j ( User -> Walmart, Facebook )

2> Purpose of building Mongodb

Resilient Distributed Datasets (RDD)

○ The role of the cluster manager is to allocate resources across applications.

Any one of the four component can fail

You might also like