Big Data Intro-1
Big Data Intro-1
AND PLATFORMS
Prof. JP AGGARWAL
Introduction
• Almost everything we do leaves a digital trail that is
recorded and saved forever
• In fact, it is estimated that 90% of the world's data was
generated in the last two years alone.
• In the space of 13 years, this figure has increased by an
estimated 60x from just 2 zettabytes in 2010.
• The 120 zettabytes generated in 2023 are expected to
increase by over 150% in 2025, hitting 181
zettabytes.
Source: https://explodingtopics.com/blog/data-generated-per-day
www.spjain.org
Source: https://www.domo.com/data-never-sleeps
Big Data Era
• We are exposed to more information each day than our
15th century ancestors were exposed to in a lifetime
• The 3 Vs of Big Data (Volume, Velocity, Variety)
• Data beyond storage capacity or data beyond local
processing capability.
• Big data refers to large volume of structured, semi-
structured and unstructured data. The analysis of of big
data leads to better insights for business.
• Why suddenly Big Data ?
www.spjain.org
Sources of Big Data ( Internal and
External )
• Web searches, browsing, and social media
• Astronomy and telescopes
• Financial data
• Medical data
• Telephone calls and email
• Surveillance data
• Internet of Things (Smart Watches, Smart Fridge, etc)
• Sensor Data (eg. Smart phone, GPS)
www.spjain.org
Big Data Analytics
• Big Data Analytics is the process of examining large
data sets (current as well as historic) to uncover hidden
pattern, unknown correlations, market trends, customer
preferences, prevent failures and other useful business
information. Archived Social
Transactions
Data Media
• Analyze both internal and external data
• Batch Analytics
• On Historical Data BIG DATA
• Real-Time Analytics
• On current data in real-time Predictions
• e,g Credit Card Transactions
www.spjain.org
Turning Big Data into Value
What’s the social
sentiment for my How do I better
brand or products predict future
outcomes?
LIVE DATA
FEEDS
www.spjain.org
Applications of Big Data
Better Understand and Target Customer
Understand and Optimize Business Retailer can optimize their stock based on Supply chain or delivery route
predictive models generated from social optimization using data from geographic
media data, web search trends and positioning and radio frequency
weather forecast. identification sensors.
Improving Health
ns of Big
Data
We can use all the data from smart watches and wearable
devices to better understand links between lifestyles and
diseases
Data
Improvin Sensor technology is built into sports
equipment.
g Sports
Performa Many elite sports teams track athletes
outside of sporting environment – using
nce
smart technology to track nutrition and
sleep, as well as social media conversations
to monitor emotional wellbeing !
BIG Data SIZES
https://www.c-sharpcorner.com/
UploadFile/mahesh/cracking-big-data/
Case Study – Analyze
Payment Risks
A product distributor wants to match invoice to payments received from
vendors to predict payment risks. The invoice system and receivable
system are maintained in ORACLE, and they want to present the data
using Tableau, which is data visualization tool.
Source: SpiderOpsNet
www.spjain.org
Source: SpiderOpsNet
www.spjain.org
Big Data Solution Highlights
A cluster of 30 low-cost (commodity class) machines
www.spjain.org
Big Data Industry Examples
Market Basket Analysis
Predict Inventory Demand
Medical device calibration
Bank loan default prediction and fraud detection
Algorithmic trading in stock markets
Data collections for insurance underswriting
Sensory data ingestion and processing
DNA Analysis in genetics
Short term load forecast in a powergrid
www.spjain.org
Large relational
database on SAN Highly parallel
(Storage area processors
networks)
Tradition
al Data may be
distributed but Bring data to
gy
High end hardware
Limited scalability (in tune of
$50,000 / TB)
Parallel Processing
www.spjain.org
How is BIG data
Technology
different ?
Distributed processing
Processing is divided into parallel task –
as many as number of blocks
Tasks are distributed to the cluster of
machines
Process is sent to where the data is
instead of bringing data to process
Fault tolerant processing
Question
www.spjain.org
Big Data Solution Stages
BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
INGESTION STORAGE ANALYTICS VISUALIZATION
www.spjain.org
Introduction to Hadoop
www.spjain.org
Introduction to Hadoop
www.spjain.org
Hadoop
www.spjain.org
Hadoop
www.spjain.org
Hadoop
www.spjain.org
HDFS
• HDFS is a storage layer of Hadoop suitable for
distributed storage and processing.
• It provides file permissions, authentication, and
streaming access to file system data.
• HDFS can be accessed through Hadoop command line
interface.
www.spjain.org
Hadoop Architecture
• In 1990, Hard Disk Capacity 1 GB to 20 GB, RAM 64 to 128 MB,
I/O 10 Kbps
• Now, Hard disk Capacity 1 TB, RAM 16 - 32 GB, I/O 100 MBPS
• E.g 100 TB of Data by few years. So you approach Data Center
• E.g Processing time: 1 person – 12 months , 12 person – 1
month
• E.g Number of Files continuously coming in for processing
• You save data because you might need to process later.
• So, I write a program to process my data.
• Earlier before Hadoop, computation was processor bound.
www.spjain.org
Hadoop Architecture
• Hadoop is open source framework by Apache Software
Foundation
• For small data, its better to use local disk
• Hadoop = HDFS (Storage) + Map Reduce (Processing)
• Key Terms for HDFS
• Cluster (set of machines)
• Commodity Hardware (cheap hardware)
• Streaming access pattern (Write Once Read Many times)
• Blocks
• 4 KB in normal Operating System FS
• If you store 2KB – remaining space will be wasted
www.spjain.org
Streaming Access Pattern
• HDFS is built around the idea that the most efficient data processing pattern is
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, then various analysis are performed on that dataset over time. Each
analysis will involve a large proportion, if not all, of the dataset, so the time to
read the whole dataset is more important than the latency in reading the first
record.
• It stores data in large blocks - like 64 MB / 128 MB. The idea is that you want your
data layed out sequentially on your hard drive, reducing the number of seeks your
hard drive has to do to read data.
• In addition, HDFS is a user-space file system, so there is a single central name
node that contains an in-memory directory of where all of the blocks (and their
replicas) are stored across the cluster. Files are expected to be large (say 1 GB or
more), and are split up into several blocks. In order to read a file, the code asks
the name node for a list of blocks and then reads the blocks sequentially.
• The data is "streamed" off the hard drive by maintaining the maximum I/O rate
that the drive can sustain for these large blocks of data.
www.spjain.org
Hadoop Architecture
• HDFS – 64 MB to 128 MB
• e.g for a 35 MB file, no waste of space
• Five Services (Daemons):
• Master - Name Node
• Master - Secondary Name Node
• Master - Job Tracker (Resource Manager)
• Slave - Data Node
• Slave - Task Tracker (Node Manager)
• Master Services can talk to each other and Slave services
can talk to each other, Master can talk to its Slave
services and vice-versa. (NN<->DN, JT <-> TT)
www.spjain.org
Hadoop Architecture
• Writing on Hadoop Cluster (e.g. 200MB for understanding) for faster
processing
• On a cluster of 64MB block size, 4 blocks will be required (3 * 64MB and
1* 8 MB)
• Client will contact Name Node
• Meta Data (Data about Data) will be stored at Name Node
• Replication Factor for blocks
• Data Node will be communicating back to clients that file has been
written successfully.
• Data Nodes will send Heart Beats + Block Report to Name Node regularly.
• If any data node fail to send heart beat, then Name Node will modify the
meta data and save the block to other node.
www.spjain.org
Hadoop Architecture
• What if Name Node crashes ?
• Name Nodes are high reliable machines (they don’t store huge data, only
meta data)
• Single Point of Failure
• Processing :
• Job Tracker will accept clients processing request.
• Job Tracker will talk with Name Node to get block details for the
requested file
• Meta data will be shared by Name Node with Job Tracker
• Job Tracker will select Data Node (best possible) and assign task to
Task Tracker on that Node.
• This process is called MAP.
www.spjain.org
Hadoop Architecture
www.spjain.org
Limitations of Hadoop Vs. Spark
www.spjain.org
Apache Spark
www.spjain.org
Apache Spark
www.spjain.org
Apache Spark
www.spjain.org
Apache Spark
www.spjain.org
Apache Spark
www.spjain.org
Driverless Cars
• Waymo, Uber, Tesla, others
• Mission: design a driverless car that can drive on highways
and roads
• Goals:
• Fully autonomous operation
• High safety
• Fleet operation
• Solutions
• Deep learning from video
• Object detection model
• Action model
www.spjain.org
Forensics
• Palantir and Amazon, as well as Chinese companies
• Mission: design system to predict crime
• Goals:
• Integration across multiple surveillance sources
• High reliability
• Transparent algorithms that comply with laws
• Solutions
• Predictive algorithms using history, surveillance and behaviour
analysis
• Transparent explanations
• Face and gait recognition
www.spjain.org
Military AI
• USA, Russia, China. Others?
• Mission: design autonomous forces for ground, sea,
amphibian and air combat and surveillance
• Goals:
• Autonomous combat
• Autonomous surveillance
• Swarm tactics and strategy
• Solutions
• Autonomous weaponised robots for ground combat
• Driverless terrestrial, underwater, and aerial combat vehicles
• Drone and dust swarms
www.spjain.org
Checkpoint Frequency
• Default every 1 hour
• Every 64MB of editslog
• After 1 million transaction
www.spjain.org
Big Data Solution Stages
BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
INGESTION STORAGE ANALYTICS VISUALIZATION
www.spjain.org
Types of Data
• INTERNAL DATA
• Transactional data from Relational Database
• Archived Files
• Log Files from Applications
• Web site click data
• Customer files
• External Data
• Social Media e.g. Twitter
• Web Scraping
• 3rd Party / Agencies supplying trending data
www.spjain.org
BIG Data INGESTION
• Batch Data
• Streaming Data
www.spjain.org
Types of data for Ingestion
• Data Ingestion: Getting data into the big data system.
• Batch Data
• Relational databases
• Archived Files
• Streaming Data
• Continuous flow of data
• Social media e.g. Twitter
• Data coming from customer every 10 min.
www.spjain.org
Batch Data
• Data that comes in periodically:
• Once a day or once a week etc.
• Examples:
• Data from RDBMS
• Data from archived files
• Data coming from clients
• Can be processed through a scheduler
• Bulk loading to big data system
• Tools: Sqoop, distcp
www.spjain.org
Streaming Data
• Data that comes in continuously
• Examples:
• Data from Twitter
• Data from application log files
• Data from messaging system
• Need to be processed as soon as the data comes
• Fault tolerance and exactly once processing s important
• Tools: Flume, Nifi, Flink
www.spjain.org
Feature to Look Out For
• Fault Tolerance
• No data should be lost even if the process aborts
• Accurate processing
• Should not miss any data
• Should handle if destination system is down
• Exactly one processing
• If process restarts, should not process data already processed
• High scalability
• Should be able to handle huge number records per minute
www.spjain.org
sQOOP
• Part of Apache Open Souce
• Used for importing relational data into Hadoop
• Extract data from databases such as MYSQL, ORACLE,
DB2, etc
• Can import into HDFS, HIVE or HBASE
• Can be used to export data back to database
www.spjain.org
Distcp
• DISTCP is part of Apache Hadoop
• Used for importing file data into HDFS
• It can be used to get data from Amazon S3 to HDFS
• It can be used to copy data from one HDFS installation
to another HDFS installation as well
• It used map-reduce to copy data in parallel
www.spjain.org
Flume
• Part of Apache Open Source
• Parallel copying of streaming data in HDFS
• Distributed, reliable and available service
• Collect, aggregate and move log data
• Can move large amount of data efficiently
• Used for getting weblog and program log from multiple
sources into HDFS
• Sqoop for structured data, Flume for unstructured data
www.spjain.org
APACHE Flink
• Part of Apache Open Source and also supported by
Confluent Technologies
• Continuous processing for Streaming data
• Can process millions of events per second
• Source - > Sink based architecture like Flume
• Exactly once processing of data
• Provides high level data set APIs
• Out of order data processing
www.spjain.org
Out of Order Data
• Some data in the stream may come late
• If ingestion time is used, then this will be part of wrong
batch
• Flink makes it possible to make this data part of the
time stamp as specified in the data
• Event time that is part of the data can be used to
specify the batch of data
www.spjain.org
Nifi
• Part of Apache Open Source
• Supports dataflow with graphical user interface
• Can route messages from data ingestion to processing
using GUI
• Configure to include various connectors:
• Kafka
• HDFS
• Provides data provenance: Track data flow as data flows
through the system.
www.spjain.org
Kafka – Distributed Streaming
Platform
• What is Apache Kafka ?
• Distributed Streaming Platform
www.spjain.org
Example – Smart Electricity Meter
• Smart Electricity Meter generating load related data every
minute.
• Kafka server receives this data every minute.
• Similarly there are other meters in the city who sends load
data to this Kafka Server.
• Application to read and process this data. For eg.
Computing and monitoring load of every house.
• As soon as the charge goes above pre-defined threshold,
send SMS at least in few seconds/mins.
• This is real time stream processing.
Apache Kafka is a highly scalable and distributed platform for creating and processing streams in real time.
www.spjain.org
How does Kafka Works ?
• Typically, 3 Components
• Publisher / Producer
• Client application that sends data
records(messages)
• The broker is responsible for receiving
messages from the producer and storing them
in local disk.
• The consumers are client applications that
read messages from the Broker and process
them.
• The Broker is at the center and acts as the
middleman between Producer and
Consumer.
• Kafka Server is the Messaging Broker.
• Kafka works as Pub-Sub Messaging
System
www.spjain.org
Who created Kafka and Why ?
• Created in Linked In and Open Source in 2011.
• It was initially designed to handle data integration problem.
www.spjain.org
Who created Kafka and Why ?
• All the boxes generate and store some data.
• Data Generated on one application is often needed in
other applications.
• Linked In solved this problem using Pub-Sub Architecture.
www.spjain.org
Kafka in Enterprise Application
Ecosystem.
• Circulatory system of your data ecosystem.
• Kafka occupies central space in the real-time data
integration infrastructure.
• The data producers can send data like messages.
• Messages sent to Kafka broker as soon as the
business event occurs.
• Consumers can consume messages from broker as
soon as the data arrives at the broker.
• Can be designed for processing data in milliseconds.
• Producers always interact with Kafka Brokers and
need not be concerned who is using the data.
• Producers and Consumer can be added, removed or
modified as the business case evolves.
www.spjain.org