0% found this document useful (0 votes)
34 views77 pages

Unit1 - BDH

The document provides an overview of Big Data and its significance in the digital world, highlighting the challenges traditional databases face in processing large volumes of structured and unstructured data. It discusses the evolution of Big Data technologies, particularly Hadoop, and outlines its architecture, tools, and applications across various industries. Additionally, it emphasizes the benefits of Big Data analytics for improving operations, customer service, and driving innovations.

Uploaded by

1697571taruni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views77 pages

Unit1 - BDH

The document provides an overview of Big Data and its significance in the digital world, highlighting the challenges traditional databases face in processing large volumes of structured and unstructured data. It discusses the evolution of Big Data technologies, particularly Hadoop, and outlines its architecture, tools, and applications across various industries. Additionally, it emphasizes the benefits of Big Data analytics for improving operations, customer service, and driving innovations.

Uploaded by

1697571taruni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

BIG DATA HADOOP

Introduction
INTRODUCTION
• Today we live in the digital world. With increased digitization the
amount of structured and unstructured data being created and stored
is exploding.
• The data is being generated from various sources - transactions, social
media, sensors, digital images, videos, audios and clickstreams for
domains including healthcare, retail, energy and utilities. I
• For instance, 30 billion content are being shared on Facebook every
month; the photos viewed every 16 seconds in Picasa could cover a
football field.
.WHAT IS BIGDATA:
• Big data is used to describe a massive volume of both structured and
unstructured data that is large it is difficult to process using traditional
database and software techniques.
• In most enterprise scenarios the volume of data is too big or it moves
too fast or it exceeds current processing capacity.
• Despite these problems, big data has the potential to help companies
improve operations and make faster, more intelligent decisions.
• The term big data is believed to have originated with web search
companies who needed to query very large distributed aggregations
of loosely-structured data.
Structured data vs Un-Structured
data
Semi Structure Data
• Email
• NoSQL databases
• CSV, XML, and JSON documents
• Electronic data interchange (EDI)
• HTML
• RDF
Why Big Data is Important ?
• Companies use big data in their systems to improve operations,
provide better customer service, create personalized marketing
campaigns and take other actions that, ultimately, can increase
revenue and profits.
Cont…
• Cost Savings
• Time Reductions
• Understand the market conditions
• Social Media Listening’s
• Using Big Data Analytics to Boost Customer Acquisition and Retention
• Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights
• Big Data Analytics as a Driver of Innovations and Product
Development
Evolution of Big Data
1) Early Days of Computing
2) Data Warehousing
3) The rise of Internet
4) The Emergence of Big Data—new technologies (Hadoop,NoSQL database
to handle Volume & Variety of Data)
5) The Growth of Big Data-New Technologies(Cloud Computing & Streaming
Analytics to handle Volume,Variety & Velocity of Data)
6) Artificial Intelligence & Machine Learning
7) IOT & 5G
8) Block Chain & Big Data
Cont…

1) 1940s to 1989 – Data Warehousing and Personal Desktop


Computers.
2) 1989 to 1999 – Emergence of the World Wide Web.
3) 2000s to 2010s – Controlling Data Volume, Social Media and Cloud
Computing
4) 2010s to now – Optimization Techniques, Mobile Devices and IoT
History

• John R Mashey is Introduce the Big Data.


Characteristics of Big Data

Additional
Volume Complexity
Velocity Scalability
Variety Flexibility
Veracity Accessibility
Value Security
Failure of Traditional Database in
Handling Big Data
• Big Data Is Too Big for Traditional Storage
• Big Data Is Too Complex for Traditional Storage
• Big Data Is Too Fast for Traditional Storage
• Machine data
• Social data, and
• Transactional data.
Applications of BIG DATA
1) Banking
2) Education
3) Media
4) Healthcare
5) Agriculture
6) Travel
7) Manufacturing
8) Government
9) Retail
Real World Big Data Examples

• Discovering consumer shopping habits.


• Personalized marketing.
• Fuel optimization tools for the transportation industry.
• Monitoring health conditions through data from wearables.
• Live road mapping for autonomous vehicles.
• Streamlined media streaming.
• Predictive inventory ordering
The Applications of Big Data

• Banking and Securities


• Communications, Media and Entertainment
• Healthcare Providers
• Education
• Manufacturing and Natural Resources
• Government
• Insurance
• Retail and Wholesale trade
• Transportation
• Energy and Utilities
BIG DATA INFRASTRUCTURE
• Big data architecture is a comprehensive solution to deal with an
enormous amount of data.
• It details the blueprint for providing solutions and infrastructure for
dealing with big data based on a company’s demands.
BIG DATA INFRASTRUCTURE
• Data Sources: Relational databases, data warehouses, cloud-based
data warehouses, SaaS applications, real-time data from company
servers and sensors such as IoT devices, third-party data providers,
and also static files such as Windows logs, comprise several data
sources.
• Data Storage:HDFS, Microsoft Azure, AWS, and GCP storage, among
other blob containers.

• Batch Processing:Multiple approaches to batch processing are


employed, including Hive jobs, U-SQL jobs, Sqoop or Pig and custom
map reducer jobs written in any one of the Java or Scala or other
languages such as Python.
• Real Time-Based Message Ingestion:Message-based ingestion stores
such as Apache Kafka, Apache Flume, Event hubs from Azure, and
others, on the other hand, must be used if message-based processing is
required. The delivery process, along with other message queuing
semantics, is generally more reliable.

• Stream Processing:Stream processing, on the other hand, handles all of


that streaming data in the form of windows or streams and writes it to
the sink. This includes Apache Spark, Flink, Storm, etc.

• Analytics-Based Datastore: In order to analyze and process already


processed data, analytical tools use the data store that is based on
HBase or any other NoSQL data warehouse technology.
• NoSQL databases like HBase or Spark SQL are also available.
• Reporting and Analysis: The generated insights, on the other hand,
must be processed and that is effectively accomplished by the
reporting and analysis tools that utilize embedded technology and a
solution to produce useful graphs, analysis, and insights that are
beneficial to the businesses. For example, Cognos, Hyperion, and
others.
• Orchestration: Data-based solutions that utilise big data are data-
related tasks that are repetitive in nature, and which are also
contained in workflow chains that can transform the source data and
also move data across sources as well as sinks and loads in stores.
Sqoop, oozie, data factory, and others are just a few examples.
There is more than one workload type
involved in big data systems, and they are
broadly classified as follows:
• Merely batching data where big data-based sources are at rest is a
data processing situation.
• Real-time processing of big data is achievable with motion-based
processing.
• The exploration of new interactive big data technologies and tools.
• The use of machine learning and predictive analysis.
Types of Big Data Architecture

• 1) LAMBDA ARCHITECTURE
• 2)KAPPA ARCHITECTURE
• Batch Layer: The batch layer of the lambda architecture saves
incoming data in its entirety as batch views. The batch views are used
to prepare the indexes. The data is immutable, and only copies of the
original data are created and preserved.
• Speed Layer: The speed layer delivers data straight to the batch layer,
which is responsible for computing incremental data. However, the
speed layer itself may also be reduced in latency by reducing the
number of computations. The stream layer processes the processed
data from the speed layer to produce error correction.
• Serving Layer: The batch views and the speed outcomes traverse to
the serving layer as a result of the batch layers batch views. The
serving layer indexes the views and parallelizes them to ensure users’
queries are fast and are exempt from delays.
• When compared to Lambda architecture, Kappa architecture is also intended to
handle both real-time streaming and batch processing data. The Kappa
architecture, in addition to reducing the additional cost that comes from the
Lambda architecture, replaces the data sourcing medium with message queues.

• The messaging engines store a sequence of data in the analytical databases,


which are then read and converted into appropriate format before being saved
for the end-user.

• The batch layer was eliminated in the Kappa architecture, and the speed layer
was enhanced to provide reprogramming capabilities. The key difference with
the Kappa architecture is that all the data is presented as a series or stream.
Data transformation is achieved through the steam engine, which is the central
engine for data processing.
Benefits of Big Data
Architecture
• High-performance parallel computing
• Elastic scalability
• Freedom of choice
• The ability to interoperate with other systems
BIG DATA LIFE CYCLE
Big Data Tools and Techniques

• A big data tool can be classified into the four buckets listed below
based on its practicability.
 Massively Parallel Processing (MPP)
 No-SQL Databases
 Distributed Storage and Processing Tools
 Cloud Computing Tools
• Doug Cutting and his team developed an Open Source Project
called HADOOP.

• Hadoop is an open-source framework that allows to store and process


big data in a distributed environment across clusters of computers
using simple programming models.

• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and


many more. Moreover it can be scaled up just by adding nodes in the
cluster.
HadoopArchitecture
Hadoop Architecture
History Of Hadoop (Contd.)
Hadoop can be divided into four (4)
distinctive layers.
HadoopServer Roles
2003 GOOGLE INTRODUCE GFS
2004 MAPREDUCE
Hadoop Tools
Tools
• HDFS
• MAP REDUCE
• YARN
• APACHE HIVE
• APACHE PIG
• APACHE HBASE
• APACHE ZOOKEEPER
• APACHE FLUME
• SQOOP
• OOZIE
• SPARK
VM WARE INSTALLATION STEPS
• The easiest way to run Hadoop on your Windows computer in order to run
Hadoop would be to install VMware Player, then install a virtual hadoop
server.

The instructions on how to install VMware Player on Windows.

• Download VM Ware Player for windows 32-bit and 64 bit for VMware
Player v5 and up.

• Run the installer file and then click the Next button on the welcome screen.

You might also like