0% found this document useful (0 votes)
5 views75 pages

Big Data Intro-1

The document discusses big data processing techniques and platforms, highlighting the exponential growth of data and the significance of big data analytics in various industries. It covers the sources of big data, applications in sectors like healthcare and security, and introduces technologies such as Hadoop and Apache Spark for data storage and processing. The document emphasizes the importance of real-time analytics and the role of big data in enhancing business insights and operational efficiency.

Uploaded by

anshikanahata33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views75 pages

Big Data Intro-1

The document discusses big data processing techniques and platforms, highlighting the exponential growth of data and the significance of big data analytics in various industries. It covers the sources of big data, applications in sectors like healthcare and security, and introduces technologies such as Hadoop and Apache Spark for data storage and processing. The document emphasizes the importance of real-time analytics and the role of big data in enhancing business insights and operational efficiency.

Uploaded by

anshikanahata33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

BIG DATA PROCESSING TECHNIQUES

AND PLATFORMS
Prof. JP AGGARWAL
Introduction
• Almost everything we do leaves a digital trail that is
recorded and saved forever
• In fact, it is estimated that 90% of the world's data was
generated in the last two years alone.
• In the space of 13 years, this figure has increased by an
estimated 60x from just 2 zettabytes in 2010.
• The 120 zettabytes generated in 2023 are expected to
increase by over 150% in 2025, hitting 181
zettabytes.
Source: https://explodingtopics.com/blog/data-generated-per-day

www.spjain.org
Source: https://www.domo.com/data-never-sleeps
Big Data Era
• We are exposed to more information each day than our
15th century ancestors were exposed to in a lifetime
• The 3 Vs of Big Data (Volume, Velocity, Variety)
• Data beyond storage capacity or data beyond local
processing capability.
• Big data refers to large volume of structured, semi-
structured and unstructured data. The analysis of of big
data leads to better insights for business.
• Why suddenly Big Data ?

www.spjain.org
Sources of Big Data ( Internal and
External )
• Web searches, browsing, and social media
• Astronomy and telescopes
• Financial data
• Medical data
• Telephone calls and email
• Surveillance data
• Internet of Things (Smart Watches, Smart Fridge, etc)
• Sensor Data (eg. Smart phone, GPS)

www.spjain.org
Big Data Analytics
• Big Data Analytics is the process of examining large
data sets (current as well as historic) to uncover hidden
pattern, unknown correlations, market trends, customer
preferences, prevent failures and other useful business
information. Archived Social
Transactions
Data Media
• Analyze both internal and external data
• Batch Analytics
• On Historical Data BIG DATA
• Real-Time Analytics
• On current data in real-time Predictions
• e,g Credit Card Transactions
www.spjain.org
Turning Big Data into Value
What’s the social
sentiment for my How do I better
brand or products predict future
outcomes?

LIVE DATA
FEEDS

SOCIAL & How do I optimize my


WEB fleet based on
weather and traffic ADVANCED
ANALYTICS ANALYTICS
patterns?

www.spjain.org
Applications of Big Data
Better Understand and Target Customer

Telecom: Companies can now better


predict customer churn

Retail: Retailers can predict what


products will sell

Car Insurance: can now understand


how well their customers actually
drive.
Applications of Big Data

Understand and Optimize Business Retailer can optimize their stock based on Supply chain or delivery route
predictive models generated from social optimization using data from geographic
media data, web search trends and positioning and radio frequency
weather forecast. identification sensors.
Improving Health

Applicatio The computing power of Big Data Analytics enables us to


find new cures and better understand and predict disease
patterns.

ns of Big
Data
We can use all the data from smart watches and wearable
devices to better understand links between lifestyles and
diseases

Big Data Analytics also allow us to monitor and predict


epidemics and disease outbreak simply by listening to what
people are saying e.g “ Not feeling well today – in bed with
a cold” or searching for on the internet “cures for flu”
Applications of Big Data
Improving Security
and Law
•Security Services uses Big Data
Enforcement Analytics to foil terrorist plots and
detect cyber attacks.
•Police forces use Big Data tools to
catch criminals and even predict
criminal activity.
•Credit card companies use Big Data
Analytics to detect fraudulent
transactions.
Applicatio Most elite sports have now embraced Big
Data Analytics. Many use video analytics to

ns of Big track the performance of every player.

Data
Improvin Sensor technology is built into sports
equipment.

g Sports
Performa Many elite sports teams track athletes
outside of sporting environment – using

nce
smart technology to track nutrition and
sleep, as well as social media conversations
to monitor emotional wellbeing !
BIG Data SIZES
https://www.c-sharpcorner.com/
UploadFile/mahesh/cracking-big-data/
Case Study – Analyze
Payment Risks
A product distributor wants to match invoice to payments received from
vendors to predict payment risks. The invoice system and receivable
system are maintained in ORACLE, and they want to present the data
using Tableau, which is data visualization tool.
Source: SpiderOpsNet

www.spjain.org
Source: SpiderOpsNet

www.spjain.org
Big Data Solution Highlights
A cluster of 30 low-cost (commodity class) machines

Use of all data including archived data

Work done in parallel: time reduced to 1 hour from 30 hours

Cost reduced by 80%

Open-source tools: Free software, low maintenance

Built in fault tolerance ad data replication.

www.spjain.org
Big Data Industry Examples
Market Basket Analysis
Predict Inventory Demand
Medical device calibration
Bank loan default prediction and fraud detection
Algorithmic trading in stock markets
Data collections for insurance underswriting
Sensory data ingestion and processing
DNA Analysis in genetics
Short term load forecast in a powergrid

www.spjain.org
Large relational
database on SAN Highly parallel
(Storage area processors
networks)

Tradition
al Data may be
distributed but Bring data to

Technolo processing in one


place
process

gy
High end hardware
Limited scalability (in tune of
$50,000 / TB)
Parallel Processing

Cluster of commodity hardware

Fault tolerant processing


Big data Distributed data and distributed processing
Technolo Data redundancy
gy Data locality: Bring process to data

Commodity hardware ($3000 / TB)

Data and processing on same machine


How is BIG data Technology different
?

Cluster of commodity class Distributed data


machine
Mid range servers with less memory and mid- Data is distributed across cluster of machines
level hardware Each piece is called block
Machines may have 32GB – 96GB RAM 2Ghz Data redundancy and data replication
processor
Cost of hardware < $3000 / TB

www.spjain.org
How is BIG data
Technology
different ?
Distributed processing
Processing is divided into parallel task –
as many as number of blocks
Tasks are distributed to the cluster of
machines
Process is sent to where the data is
instead of bringing data to process
Fault tolerant processing
Question

• Can I use Big Data technologies to process small


amount of data ?

www.spjain.org
Big Data Solution Stages

BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
INGESTION STORAGE ANALYTICS VISUALIZATION

BIG DATA SECURITY

www.spjain.org
Introduction to Hadoop

www.spjain.org
Introduction to Hadoop

www.spjain.org
Hadoop

www.spjain.org
Hadoop

www.spjain.org
Hadoop

www.spjain.org
HDFS
• HDFS is a storage layer of Hadoop suitable for
distributed storage and processing.
• It provides file permissions, authentication, and
streaming access to file system data.
• HDFS can be accessed through Hadoop command line
interface.

www.spjain.org
Hadoop Architecture
• In 1990, Hard Disk Capacity 1 GB to 20 GB, RAM 64 to 128 MB,
I/O 10 Kbps
• Now, Hard disk Capacity 1 TB, RAM 16 - 32 GB, I/O 100 MBPS
• E.g 100 TB of Data by few years. So you approach Data Center
• E.g Processing time: 1 person – 12 months , 12 person – 1
month
• E.g Number of Files continuously coming in for processing
• You save data because you might need to process later.
• So, I write a program to process my data.
• Earlier before Hadoop, computation was processor bound.

www.spjain.org
Hadoop Architecture
• Hadoop is open source framework by Apache Software
Foundation
• For small data, its better to use local disk
• Hadoop = HDFS (Storage) + Map Reduce (Processing)
• Key Terms for HDFS
• Cluster (set of machines)
• Commodity Hardware (cheap hardware)
• Streaming access pattern (Write Once Read Many times)
• Blocks
• 4 KB in normal Operating System FS
• If you store 2KB – remaining space will be wasted

www.spjain.org
Streaming Access Pattern
• HDFS is built around the idea that the most efficient data processing pattern is
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, then various analysis are performed on that dataset over time. Each
analysis will involve a large proportion, if not all, of the dataset, so the time to
read the whole dataset is more important than the latency in reading the first
record.
• It stores data in large blocks - like 64 MB / 128 MB. The idea is that you want your
data layed out sequentially on your hard drive, reducing the number of seeks your
hard drive has to do to read data.
• In addition, HDFS is a user-space file system, so there is a single central name
node that contains an in-memory directory of where all of the blocks (and their
replicas) are stored across the cluster. Files are expected to be large (say 1 GB or
more), and are split up into several blocks. In order to read a file, the code asks
the name node for a list of blocks and then reads the blocks sequentially.
• The data is "streamed" off the hard drive by maintaining the maximum I/O rate
that the drive can sustain for these large blocks of data.

www.spjain.org
Hadoop Architecture
• HDFS – 64 MB to 128 MB
• e.g for a 35 MB file, no waste of space
• Five Services (Daemons):
• Master - Name Node
• Master - Secondary Name Node
• Master - Job Tracker (Resource Manager)
• Slave - Data Node
• Slave - Task Tracker (Node Manager)
• Master Services can talk to each other and Slave services
can talk to each other, Master can talk to its Slave
services and vice-versa. (NN<->DN, JT <-> TT)
www.spjain.org
Hadoop Architecture
• Writing on Hadoop Cluster (e.g. 200MB for understanding) for faster
processing
• On a cluster of 64MB block size, 4 blocks will be required (3 * 64MB and
1* 8 MB)
• Client will contact Name Node
• Meta Data (Data about Data) will be stored at Name Node
• Replication Factor for blocks
• Data Node will be communicating back to clients that file has been
written successfully.
• Data Nodes will send Heart Beats + Block Report to Name Node regularly.
• If any data node fail to send heart beat, then Name Node will modify the
meta data and save the block to other node.

www.spjain.org
Hadoop Architecture
• What if Name Node crashes ?
• Name Nodes are high reliable machines (they don’t store huge data, only
meta data)
• Single Point of Failure
• Processing :
• Job Tracker will accept clients processing request.
• Job Tracker will talk with Name Node to get block details for the
requested file
• Meta data will be shared by Name Node with Job Tracker
• Job Tracker will select Data Node (best possible) and assign task to
Task Tracker on that Node.
• This process is called MAP.
www.spjain.org
Hadoop Architecture

• Similar MAP jobs will be done on all input splits. (number of


mapper = number of input splits)
• What happens in case a node crashes while processing ?
• TT will fail and JT will assign new TT to process the same job.
Heartbeats are sent by TT to JT.
• What is Job Tracker is down ?
• We maintain high reliable machine for JT
• Single point of Failure
• All map outputs will now be combined and output will be
REDUCED to required output.
www.spjain.org
Limitations of Hadoop Vs Spark

• In Hadoop, you can only do Batch Processing, no real-time


processing.

www.spjain.org
Limitations of Hadoop Vs. Spark

www.spjain.org
Apache Spark

• SPARK supports batch as well as real-time processing of


Data

www.spjain.org
Apache Spark

• Apache Spark is a cluster computing framework for real-time


processing developed by Apache Software Foundation.
• Spark provides and interface for programming entire cluster
with implicit data parallelism and fault-tolerance.
• It was built on Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of
computations.

www.spjain.org
Apache Spark

www.spjain.org
Apache Spark

www.spjain.org
Apache Spark

www.spjain.org
Driverless Cars
• Waymo, Uber, Tesla, others
• Mission: design a driverless car that can drive on highways
and roads
• Goals:
• Fully autonomous operation
• High safety
• Fleet operation
• Solutions
• Deep learning from video
• Object detection model
• Action model

www.spjain.org
Forensics
• Palantir and Amazon, as well as Chinese companies
• Mission: design system to predict crime
• Goals:
• Integration across multiple surveillance sources
• High reliability
• Transparent algorithms that comply with laws
• Solutions
• Predictive algorithms using history, surveillance and behaviour
analysis
• Transparent explanations
• Face and gait recognition
www.spjain.org
Military AI
• USA, Russia, China. Others?
• Mission: design autonomous forces for ground, sea,
amphibian and air combat and surveillance
• Goals:
• Autonomous combat
• Autonomous surveillance
• Swarm tactics and strategy
• Solutions
• Autonomous weaponised robots for ground combat
• Driverless terrestrial, underwater, and aerial combat vehicles
• Drone and dust swarms

www.spjain.org
Checkpoint Frequency
• Default every 1 hour
• Every 64MB of editslog
• After 1 million transaction

www.spjain.org
Big Data Solution Stages

BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
INGESTION STORAGE ANALYTICS VISUALIZATION

BIG DATA SECURITY

www.spjain.org
Types of Data
• INTERNAL DATA
• Transactional data from Relational Database
• Archived Files
• Log Files from Applications
• Web site click data
• Customer files
• External Data
• Social Media e.g. Twitter
• Web Scraping
• 3rd Party / Agencies supplying trending data

www.spjain.org
BIG Data INGESTION
• Batch Data
• Streaming Data

www.spjain.org
Types of data for Ingestion
• Data Ingestion: Getting data into the big data system.
• Batch Data
• Relational databases
• Archived Files
• Streaming Data
• Continuous flow of data
• Social media e.g. Twitter
• Data coming from customer every 10 min.

www.spjain.org
Batch Data
• Data that comes in periodically:
• Once a day or once a week etc.
• Examples:
• Data from RDBMS
• Data from archived files
• Data coming from clients
• Can be processed through a scheduler
• Bulk loading to big data system
• Tools: Sqoop, distcp

www.spjain.org
Streaming Data
• Data that comes in continuously
• Examples:
• Data from Twitter
• Data from application log files
• Data from messaging system
• Need to be processed as soon as the data comes
• Fault tolerance and exactly once processing s important
• Tools: Flume, Nifi, Flink

www.spjain.org
Feature to Look Out For
• Fault Tolerance
• No data should be lost even if the process aborts
• Accurate processing
• Should not miss any data
• Should handle if destination system is down
• Exactly one processing
• If process restarts, should not process data already processed
• High scalability
• Should be able to handle huge number records per minute

www.spjain.org
sQOOP
• Part of Apache Open Souce
• Used for importing relational data into Hadoop
• Extract data from databases such as MYSQL, ORACLE,
DB2, etc
• Can import into HDFS, HIVE or HBASE
• Can be used to export data back to database

www.spjain.org
Distcp
• DISTCP is part of Apache Hadoop
• Used for importing file data into HDFS
• It can be used to get data from Amazon S3 to HDFS
• It can be used to copy data from one HDFS installation
to another HDFS installation as well
• It used map-reduce to copy data in parallel

www.spjain.org
Flume
• Part of Apache Open Source
• Parallel copying of streaming data in HDFS
• Distributed, reliable and available service
• Collect, aggregate and move log data
• Can move large amount of data efficiently
• Used for getting weblog and program log from multiple
sources into HDFS
• Sqoop for structured data, Flume for unstructured data

www.spjain.org
APACHE Flink
• Part of Apache Open Source and also supported by
Confluent Technologies
• Continuous processing for Streaming data
• Can process millions of events per second
• Source - > Sink based architecture like Flume
• Exactly once processing of data
• Provides high level data set APIs
• Out of order data processing

www.spjain.org
Out of Order Data
• Some data in the stream may come late
• If ingestion time is used, then this will be part of wrong
batch
• Flink makes it possible to make this data part of the
time stamp as specified in the data
• Event time that is part of the data can be used to
specify the batch of data

www.spjain.org
Nifi
• Part of Apache Open Source
• Supports dataflow with graphical user interface
• Can route messages from data ingestion to processing
using GUI
• Configure to include various connectors:
• Kafka
• HDFS
• Provides data provenance: Track data flow as data flows
through the system.

www.spjain.org
Kafka – Distributed Streaming
Platform
• What is Apache Kafka ?
• Distributed Streaming Platform

• What does this means ?


• Create one or more Real Time Data Streams
• Process these Real Time Data Streams

• Real Time refers to Minutes / Seconds or even


milliseconds.

www.spjain.org
Example – Smart Electricity Meter
• Smart Electricity Meter generating load related data every
minute.
• Kafka server receives this data every minute.
• Similarly there are other meters in the city who sends load
data to this Kafka Server.
• Application to read and process this data. For eg.
Computing and monitoring load of every house.
• As soon as the charge goes above pre-defined threshold,
send SMS at least in few seconds/mins.
• This is real time stream processing.

Apache Kafka is a highly scalable and distributed platform for creating and processing streams in real time.

www.spjain.org
How does Kafka Works ?
• Typically, 3 Components
• Publisher / Producer
• Client application that sends data
records(messages)
• The broker is responsible for receiving
messages from the producer and storing them
in local disk.
• The consumers are client applications that
read messages from the Broker and process
them.
• The Broker is at the center and acts as the
middleman between Producer and
Consumer.
• Kafka Server is the Messaging Broker.
• Kafka works as Pub-Sub Messaging
System

www.spjain.org
Who created Kafka and Why ?
• Created in Linked In and Open Source in 2011.
• It was initially designed to handle data integration problem.

www.spjain.org
Who created Kafka and Why ?
• All the boxes generate and store some data.
• Data Generated on one application is often needed in
other applications.
• Linked In solved this problem using Pub-Sub Architecture.

www.spjain.org
Kafka in Enterprise Application
Ecosystem.
• Circulatory system of your data ecosystem.
• Kafka occupies central space in the real-time data
integration infrastructure.
• The data producers can send data like messages.
• Messages sent to Kafka broker as soon as the
business event occurs.
• Consumers can consume messages from broker as
soon as the data arrives at the broker.
• Can be designed for processing data in milliseconds.
• Producers always interact with Kafka Brokers and
need not be concerned who is using the data.
• Producers and Consumer can be added, removed or
modified as the business case evolves.

www.spjain.org

You might also like