0% found this document useful (0 votes)

34 views77 pages

Unit1 - BDH

The document provides an overview of Big Data and its significance in the digital world, highlighting the challenges traditional databases face in processing large volumes of structured and unstructured data. It discusses the evolution of Big Data technologies, particularly Hadoop, and outlines its architecture, tools, and applications across various industries. Additionally, it emphasizes the benefits of Big Data analytics for improving operations, customer service, and driving innovations.

Uploaded by

1697571taruni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views77 pages

Unit1 - BDH

Uploaded by

1697571taruni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

BIG DATA HADOOP

Introduction
INTRODUCTION
• Today we live in the digital world. With increased digitization the
amount of structured and unstructured data being created and stored
is exploding.
• The data is being generated from various sources - transactions, social
media, sensors, digital images, videos, audios and clickstreams for
domains including healthcare, retail, energy and utilities. I
• For instance, 30 billion content are being shared on Facebook every
month; the photos viewed every 16 seconds in Picasa could cover a
football field.
.WHAT IS BIGDATA:
• Big data is used to describe a massive volume of both structured and
unstructured data that is large it is difficult to process using traditional
database and software techniques.
• In most enterprise scenarios the volume of data is too big or it moves
too fast or it exceeds current processing capacity.
• Despite these problems, big data has the potential to help companies
improve operations and make faster, more intelligent decisions.
• The term big data is believed to have originated with web search
companies who needed to query very large distributed aggregations
of loosely-structured data.
Structured data vs Un-Structured
data
Semi Structure Data
• Email
• NoSQL databases
• CSV, XML, and JSON documents
• Electronic data interchange (EDI)
• HTML
• RDF
Why Big Data is Important ?
• Companies use big data in their systems to improve operations,
provide better customer service, create personalized marketing
campaigns and take other actions that, ultimately, can increase
revenue and profits.
Cont…
• Cost Savings
• Time Reductions
• Understand the market conditions
• Social Media Listening’s
• Using Big Data Analytics to Boost Customer Acquisition and Retention
• Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights
• Big Data Analytics as a Driver of Innovations and Product
Development
Evolution of Big Data
1) Early Days of Computing
2) Data Warehousing
3) The rise of Internet
4) The Emergence of Big Data—new technologies (Hadoop,NoSQL database
to handle Volume & Variety of Data)
5) The Growth of Big Data-New Technologies(Cloud Computing & Streaming
Analytics to handle Volume,Variety & Velocity of Data)
6) Artificial Intelligence & Machine Learning
7) IOT & 5G
8) Block Chain & Big Data
Cont…

1) 1940s to 1989 – Data Warehousing and Personal Desktop

Computers.
2) 1989 to 1999 – Emergence of the World Wide Web.
3) 2000s to 2010s – Controlling Data Volume, Social Media and Cloud
Computing
4) 2010s to now – Optimization Techniques, Mobile Devices and IoT
History

• John R Mashey is Introduce the Big Data.

Characteristics of Big Data

Additional
Volume Complexity
Velocity Scalability
Variety Flexibility
Veracity Accessibility
Value Security
Failure of Traditional Database in
Handling Big Data
• Big Data Is Too Big for Traditional Storage
• Big Data Is Too Complex for Traditional Storage
• Big Data Is Too Fast for Traditional Storage
• Machine data
• Social data, and
• Transactional data.
Applications of BIG DATA
1) Banking
2) Education
3) Media
4) Healthcare
5) Agriculture
6) Travel
7) Manufacturing
8) Government
9) Retail
Real World Big Data Examples

• Discovering consumer shopping habits.

• Personalized marketing.
• Fuel optimization tools for the transportation industry.
• Monitoring health conditions through data from wearables.
• Live road mapping for autonomous vehicles.
• Streamlined media streaming.
• Predictive inventory ordering
The Applications of Big Data

• Banking and Securities

• Communications, Media and Entertainment
• Healthcare Providers
• Education
• Manufacturing and Natural Resources
• Government
• Insurance
• Retail and Wholesale trade
• Transportation
• Energy and Utilities
BIG DATA INFRASTRUCTURE
• Big data architecture is a comprehensive solution to deal with an
enormous amount of data.
• It details the blueprint for providing solutions and infrastructure for
dealing with big data based on a company’s demands.
BIG DATA INFRASTRUCTURE
• Data Sources: Relational databases, data warehouses, cloud-based
data warehouses, SaaS applications, real-time data from company
servers and sensors such as IoT devices, third-party data providers,
and also static files such as Windows logs, comprise several data
sources.
• Data Storage:HDFS, Microsoft Azure, AWS, and GCP storage, among
other blob containers.

• Batch Processing:Multiple approaches to batch processing are

employed, including Hive jobs, U-SQL jobs, Sqoop or Pig and custom
map reducer jobs written in any one of the Java or Scala or other
languages such as Python.
• Real Time-Based Message Ingestion:Message-based ingestion stores
such as Apache Kafka, Apache Flume, Event hubs from Azure, and
others, on the other hand, must be used if message-based processing is
required. The delivery process, along with other message queuing
semantics, is generally more reliable.

• Stream Processing:Stream processing, on the other hand, handles all of

that streaming data in the form of windows or streams and writes it to
the sink. This includes Apache Spark, Flink, Storm, etc.

• Analytics-Based Datastore: In order to analyze and process already

processed data, analytical tools use the data store that is based on
HBase or any other NoSQL data warehouse technology.
• NoSQL databases like HBase or Spark SQL are also available.
• Reporting and Analysis: The generated insights, on the other hand,
must be processed and that is effectively accomplished by the
reporting and analysis tools that utilize embedded technology and a
solution to produce useful graphs, analysis, and insights that are
beneficial to the businesses. For example, Cognos, Hyperion, and
others.
• Orchestration: Data-based solutions that utilise big data are data-
related tasks that are repetitive in nature, and which are also
contained in workflow chains that can transform the source data and
also move data across sources as well as sinks and loads in stores.
Sqoop, oozie, data factory, and others are just a few examples.
There is more than one workload type
involved in big data systems, and they are
broadly classified as follows:
• Merely batching data where big data-based sources are at rest is a
data processing situation.
• Real-time processing of big data is achievable with motion-based
processing.
• The exploration of new interactive big data technologies and tools.
• The use of machine learning and predictive analysis.
Types of Big Data Architecture

• 1) LAMBDA ARCHITECTURE
• 2)KAPPA ARCHITECTURE
• Batch Layer: The batch layer of the lambda architecture saves
incoming data in its entirety as batch views. The batch views are used
to prepare the indexes. The data is immutable, and only copies of the
original data are created and preserved.
• Speed Layer: The speed layer delivers data straight to the batch layer,
which is responsible for computing incremental data. However, the
speed layer itself may also be reduced in latency by reducing the
number of computations. The stream layer processes the processed
data from the speed layer to produce error correction.
• Serving Layer: The batch views and the speed outcomes traverse to
the serving layer as a result of the batch layers batch views. The
serving layer indexes the views and parallelizes them to ensure users’
queries are fast and are exempt from delays.
• When compared to Lambda architecture, Kappa architecture is also intended to
handle both real-time streaming and batch processing data. The Kappa
architecture, in addition to reducing the additional cost that comes from the
Lambda architecture, replaces the data sourcing medium with message queues.

• The messaging engines store a sequence of data in the analytical databases,

which are then read and converted into appropriate format before being saved
for the end-user.

• The batch layer was eliminated in the Kappa architecture, and the speed layer
was enhanced to provide reprogramming capabilities. The key difference with
the Kappa architecture is that all the data is presented as a series or stream.
Data transformation is achieved through the steam engine, which is the central
engine for data processing.
Benefits of Big Data
Architecture
• High-performance parallel computing
• Elastic scalability
• Freedom of choice
• The ability to interoperate with other systems
BIG DATA LIFE CYCLE
Big Data Tools and Techniques

• A big data tool can be classified into the four buckets listed below
based on its practicability.
 Massively Parallel Processing (MPP)
 No-SQL Databases
 Distributed Storage and Processing Tools
 Cloud Computing Tools
• Doug Cutting and his team developed an Open Source Project
called HADOOP.

• Hadoop is an open-source framework that allows to store and process

big data in a distributed environment across clusters of computers
using simple programming models.

• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and

many more. Moreover it can be scaled up just by adding nodes in the
cluster.
HadoopArchitecture
Hadoop Architecture
History Of Hadoop (Contd.)
Hadoop can be divided into four (4)
distinctive layers.
HadoopServer Roles
2003 GOOGLE INTRODUCE GFS
2004 MAPREDUCE
Hadoop Tools
Tools
• HDFS
• MAP REDUCE
• YARN
• APACHE HIVE
• APACHE PIG
• APACHE HBASE
• APACHE ZOOKEEPER
• APACHE FLUME
• SQOOP
• OOZIE
• SPARK
VM WARE INSTALLATION STEPS
• The easiest way to run Hadoop on your Windows computer in order to run
Hadoop would be to install VMware Player, then install a virtual hadoop
server.

The instructions on how to install VMware Player on Windows.

• Download VM Ware Player for windows 32-bit and 64 bit for VMware
Player v5 and up.

• Run the installer file and then click the Next button on the welcome screen.

Data Ingestion With Lakeflow Connect
No ratings yet
Data Ingestion With Lakeflow Connect
98 pages
Databricks Certified Data Engineer Associate Course V2 Release
No ratings yet
Databricks Certified Data Engineer Associate Course V2 Release
300 pages
SQL Class Notes
100% (4)
SQL Class Notes
54 pages
A Brief Analysis of Palantir Gotham A Collaborative and Interactive Big Data Visualization Analysis Software Based On Dynamic Ontology
No ratings yet
A Brief Analysis of Palantir Gotham A Collaborative and Interactive Big Data Visualization Analysis Software Based On Dynamic Ontology
8 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Big Data
No ratings yet
Big Data
63 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
big data-one
No ratings yet
big data-one
9 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data
No ratings yet
Big Data
31 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
Module 1
No ratings yet
Module 1
29 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Big Data All Unit by Study4sub
No ratings yet
Big Data All Unit by Study4sub
161 pages
Big Data
No ratings yet
Big Data
16 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Big Data
No ratings yet
Big Data
30 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
BIGDATAUNIT1 AKTUpdf
No ratings yet
BIGDATAUNIT1 AKTUpdf
33 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Big Data
No ratings yet
Big Data
31 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
No ratings yet
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
BDA_Unit-1
No ratings yet
BDA_Unit-1
33 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Big Data
No ratings yet
Big Data
10 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Unit 1
No ratings yet
Unit 1
11 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
BDA I Unit
No ratings yet
BDA I Unit
44 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
DBMS Assignment 1
No ratings yet
DBMS Assignment 1
4 pages
Dinesh Katla AWS Backend Data Engineer Updated
No ratings yet
Dinesh Katla AWS Backend Data Engineer Updated
10 pages
Essbase Application Performance Tuning
No ratings yet
Essbase Application Performance Tuning
4 pages
MAP 3D Course Outline
No ratings yet
MAP 3D Course Outline
3 pages
Course 2 - Innovating With Data With Google Cloud
No ratings yet
Course 2 - Innovating With Data With Google Cloud
81 pages
Plugin Non-CDB DB To CDB
No ratings yet
Plugin Non-CDB DB To CDB
15 pages
Lampiran Program
No ratings yet
Lampiran Program
19 pages
Silabus Excel Mastery
No ratings yet
Silabus Excel Mastery
5 pages
المحيط البرهاني-من الحيل في الإجارات إلى الحيل في الحجر
No ratings yet
المحيط البرهاني-من الحيل في الإجارات إلى الحيل في الحجر
483 pages
Power BI-Data Analyst
100% (1)
Power BI-Data Analyst
11 pages
CSEM Check 20230807
No ratings yet
CSEM Check 20230807
13 pages
Database Systems - BIT - University of Colombo - Year 3 (Lecture Note 5)
No ratings yet
Database Systems - BIT - University of Colombo - Year 3 (Lecture Note 5)
48 pages
UNIT 1 - Database System Architecture
No ratings yet
UNIT 1 - Database System Architecture
12 pages
Trắc Nghiệm Big data
No ratings yet
Trắc Nghiệm Big data
69 pages
Database Applications (15-415) : ORM - Part I Lecture 11, February 11, 2018
No ratings yet
Database Applications (15-415) : ORM - Part I Lecture 11, February 11, 2018
45 pages
Microsoft DP-700 Premium PDF | Latest 2025 Dumps, Practice Questions & Study Guide (ExamHeist Verified)
No ratings yet
Microsoft DP-700 Premium PDF | Latest 2025 Dumps, Practice Questions & Study Guide (ExamHeist Verified)
35 pages
MSBI Training Plans: Plan A Plan B Plan C
No ratings yet
MSBI Training Plans: Plan A Plan B Plan C
14 pages
SQLworkshopPart1-using MS SQL Server Management Studio PDF
No ratings yet
SQLworkshopPart1-using MS SQL Server Management Studio PDF
14 pages
DBID Chnage Standalon Database
No ratings yet
DBID Chnage Standalon Database
7 pages
IBM IFW Banking Data Warehouse (BDW) ? One Model For Many Solutions
No ratings yet
IBM IFW Banking Data Warehouse (BDW) ? One Model For Many Solutions
5 pages
Blogs Data
No ratings yet
Blogs Data
47 pages
Document Management System
0% (1)
Document Management System
8 pages
Mohit Kumar-Profile
No ratings yet
Mohit Kumar-Profile
6 pages
Datastage Enterprise Edition: Different Version of Datastage
No ratings yet
Datastage Enterprise Edition: Different Version of Datastage
5 pages
Hoffer Mdm12e PP Ch09
No ratings yet
Hoffer Mdm12e PP Ch09
20 pages
Sub Queries
No ratings yet
Sub Queries
4 pages

Unit1 - BDH

Uploaded by

Unit1 - BDH

Uploaded by

BIG DATA HADOOP

1) 1940s to 1989 – Data Warehousing and Personal Desktop

• John R Mashey is Introduce the Big Data.

• Discovering consumer shopping habits.

• Banking and Securities

• Batch Processing:Multiple approaches to batch processing are

• Stream Processing:Stream processing, on the other hand, handles all of

• Analytics-Based Datastore: In order to analyze and process already

• The messaging engines store a sequence of data in the analytical databases,

• Hadoop is an open-source framework that allows to store and process

• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and

The instructions on how to install VMware Player on Windows.

You might also like