1big Data
1big Data
By Gartner
Understanding Big Data
100s of
millions
data every day
of GPS
? TBs of
enabled
devices sold
annually
25+ TBs of 2+
log data
every day billion
people on
the Web
76 million smart meters by end
in 2009… 2011
200M by 2014
The amount of data produced by us from
the beginning of time till 2003 was 5 billion
gigabytes. If you pile up the data in the
form of disks it may fill an entire football
field. The same amount was created in
every two days in 2011, and in every ten
minutes in 2013. This rate is still growing
enormously.
Velocity
● If traditional data is like a drop of
water, Big Data is like flowing river
● Speed at which data is generated by
billions of devices, and
communicated at the speed of light.
(High speed Internet availability)
● Mobile devices can generate and
communicate data from anywhere, at
any time
● Processing should be faster than
generation
● Analyse data while it is being
generated without even putting it into
databases
Image Source: https://www.domo.com/blog/data-never-sleeps-5/
Velocity (Speed)
12
Sources of Real-time/Fast Data
Mobile devices
(tracking all objects all the time)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
needs to be handled
13
Real-Time Analytics/Decision Requirement
Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter
Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Variety
Bankin
Social g
Media Financ
e
Our
Know
Gamin
g Customer n
Histor
y
Entertai Purcha
n se
Veracity
20
Additional V-s
Value
“Big data” is
By Gartner
Definition
“Big data” is
By Gartner
Wikipedia Definition
• Big data is a term for data sets that are so large or complex
that traditional data processing applications are inadequate…
• Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, querying, updating and
information privacy. …
• The term often refers simply to the use of predictive analytics
or certain other advanced methods to extract value from data,
and seldom to a particular size of data set. …
• Accuracy in big data may lead to more confident decision
making, and better decisions can result in greater operational
efficiency, cost reduction and reduced risk.
Harnessing Big Data
26
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
27
What’s driving Big Data
Organizations that do not learn to engage with Big Data, could find themselves left far behind
their competitors, landing in the bustbin of history. 28
The image describes the Life Cycle of Big Data, emphasizing the opportunities for
technology providers in managing large volumes of mostly unstructured data. Here's a
breakdown of the life cycle mentioned:
1.Generate: Data creation from various sources like sensors, social media, transactions, and
more.
2.Gather: Collecting data from different systems or sources.
3.Store: Keeping data in databases, cloud storage, or other repositories.
4.Organize: Structuring and managing data to make it usable.
5.Analyze: Extracting insights and meaning from the data.
6.Visualize: Presenting the data insights in a visual format for easy interpretation.
The text highlights that over 90% of the data generated is unstructured, and this creates vast
potential for technological innovation in managing and utilizing this data effectively.
THE EVOLUTION OF BUSINESS INTELLIGENCE
Interactive Business
Intelligence & Big Data:
Speed
In-memory RDBMS Scale
Real Time &
Single View
BI Reporting QliqView, Tableau, HANA
OLAP &
Graph Databases
Dataware house
Business Objects, SAS, Scale Speed
Informatica, Cognos other
SQL Reporting Tools
Big Data:
1990’s 2000’s Batch Processing & 2010’s
Distributed Data Store
Hadoop/Spark;
HBase/Cassandra
What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures voices of
the flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.
Social Media Data − Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.
Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’
decisions made on a share of different companies made by the customers.
Power Grid Data − The power grid data holds information consumed by a particular node with
respect to a base station.
Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Structured data − Relational data, Semi Structured data − XML data, Unstructured data −
Word, PDF, Text, Media Logs
Benefitting from Big Data
There are 3 major types of Big Data Applications
• Monitoring and Tracking
– Consumer goods producers : Sentiments and needs of their customers
– Industrial organizations : Track inventory in massive interlinked global supply chains
– Factory owners: Monitor machine performance and do preventive maintenance
– Utility companies: predict energy consumption, manage demand and supply
– Information Technology: Track website performance and improve its usefulness
– Financial organizations : to project trends better and make more effective and profitable bets, etc.
• Analysis and Insight
– Political organizations: to micro-target voters and win elections
– Police: To predict and prevent crimes
– Hospitals : to better diagnose diseases and make medicine prescriptions
– Advertising Agencies: To design more targeted marketing campaigns more quickly
– Fashion Designers: To track trends and create more innovative products.
• Digital Product Development
– Stock market feeds could be a digital product, Imagination is the limit
32
Big Data Technology
34
Big Data Technologies
Big data technologies are important in providing more accurate
analysis, which may lead to more concrete decision-making resulting
in greater operational efficiencies, cost reductions, and reduced risks
for the business.
To harness the power of big data, you would require an infrastructure
that can manage and process huge volumes of structured and
unstructured data in real-time and can protect data privacy and
security.
There are various technologies in the market from different vendors
including Amazon, IBM, Microsoft, etc., to handle big data. While
looking into the technologies that handle big data, we examine the
following two classes of technology
– Operational and Analytical Systems
Big Data technologies are generally divided into two categories: Analytical Big Data and Operational Big Data.
Each serves different purposes and is used in different contexts. Here's an explanation of both:
1. Analytical Big Data
•Purpose: Used for performing complex data analysis and extracting insights from vast amounts of historical data.
•Focus: It deals with batch processing, which involves working with historical data stored over time and
analyzing it to uncover patterns, trends, and predictive insights.
•Examples of Use:
• Business reporting and dashboards.
• Predictive analytics (forecasting future trends or customer behaviors).
• Data mining for discovering patterns or correlations.
•Technologies Used:
• Hadoop: A framework for processing large datasets distributed across clusters of computers.
• MapReduce: A programming model used within Hadoop for processing and generating large datasets.
• Apache Spark: A fast data processing engine that handles large-scale data analysis.
• Data Warehousing: Systems like Amazon Redshift, Google BigQuery for querying and storing large-
scale datasets.
•Key Characteristics:
• Works with historical, structured or unstructured data.
• Batch processing: Runs tasks periodically on collected data.
• Often not real-time but can provide deeper insights.
2. Operational Big Data
•Purpose: Used for real-time data processing and dealing with data that is continuously generated by day-to-day
operations.
•Focus: Stream processing, which means managing and analyzing real-time data generated from operations,
applications, and devices.
•Examples of Use:
• Real-time monitoring of system performance or transactions.
• Fraud detection in banking systems by analyzing live transaction data.
• Recommendations (e.g., Netflix recommending movies based on live user interactions).
• Sensor data in IoT devices (smart devices, manufacturing machinery).
•Technologies Used:
• Apache Kafka: A distributed streaming platform for building real-time data pipelines and streaming apps.
• NoSQL Databases: Like MongoDB or Cassandra, which are optimized for handling high-volume,
unstructured data.
• HBase: A real-time, scalable database that runs on top of Hadoop.
• Amazon Kinesis: For real-time data streaming and processing in the cloud.
•Key Characteristics:
• Deals with real-time or near-real-time data.
• Handles high-velocity and high-volume data streams.
• Supports low-latency systems where immediate action is required.
Key Differences:
•Analytical Big Data focuses on long-term, batch processing for historical analysis and insights. It’s more suitable
or understanding long-term trends and informing strategic decisions.
•Operational Big Data focuses on real-time processing for immediate decision-making and monitoring. It's essential
Operational Big Data
Operational Analytical
http://analytics-magazine.org/how-big-data-is-changing-the-oil-a-gas-industry/
Use Case: Uber - Pay Surge Pricing if Battery is Low
https://www.forbes.com/sites/amitchowdhry/2016/05/25/uber-low-battery/#19762c0474b3
Big Data Challenges
Big Data Challenges: Size does matter
1KB Kilobyte
1MB Megabyte
1GB Gigabyte 1 GB = 1 hr
1TB Terabyte 1 TB = 1024 hrs = 102 days
1PB Petabyte 1 PB = 286 yrs > 1 lifetime
1EB Exabyte 1 EB = 293K yrs
1ZB Zettabyte
1YB Yottabyte
Big Data Challenges: Vertical Vs Horizontal Scaling
Source: https://s-media-cache-ak0.pinimg.com/736x/10/0c/d0/100cd0da1c19e5d6f850ed23c3633714.jpg
Big Data Challenges: Scale of infrastructure
★ http://hadoop.apache.org/
53
Components of HDFS
DataNodes
54
1. NameNode
•Role: Acts as the master of the HDFS system.
•Function: Manages the metadata for the file system
(e.g., file names, permissions, file block locations).
•Responsibilities:
• Keeps track of the directory structure and
file locations.
• Maintains the mapping of file blocks to
DataNodes.
• Does not store the actual data but knows
where it is located.
• Coordinates file system operations such as
opening, closing, and renaming files or
directories.
•Single Point of Failure (SPOF): In traditional setups,
if the NameNode fails, the HDFS cluster becomes
unavailable (although High Availability (HA)
configurations can mitigate this).
2. DataNodes
•Role: The worker nodes that store actual data blocks.
•Function: Responsible for storing and retrieving data as instructed by the NameNode.
•Responsibilities:
• Store the actual file data in blocks (the default block size is 128 MB, though configurable).
• Periodically report back to the NameNode with the list of blocks they are storing (this is called a
heartbeat).
• Handle block replication (by default, each block is replicated 3 times across different DataNodes for fault
tolerance).
• Repair and rebalance blocks when failures occur.
3. Secondary NameNode (Checkpoint Node)
•Role: Often misunderstood, it does not serve as a backup for the NameNode but helps periodically merge the
NameNode's edits log with the namespace image.
•Function: Periodically creates checkpoints of the NameNode’s metadata by combining the edit log and the
namespace image.
•Responsibilities:
• Keeps the filesystem metadata in a manageable size.
• Reduces the time required for the NameNode to restart by managing the size of the edit logs.
Terminology
57
Terminology
58
A
r
c
h
i
t
e
c
t
u
r
e
Storing file on HDFS
Motivation: Reliability, Availability ,
Network Bandwidth
➢ The input file (say 1 TB) is split into smaller
chunks/blocks of 128 MB
➢ The chunks are stored on multiple nodes as
independent files on data nodes
60
Storing file on HDFS
➢ To ensure that data is not lost, data
can typically be replicated on:
➢ local rack
➢ remote rack (in case local rack fails)
➢ remote node (in case local node fails)
➢ randomly
➢ Default replication factor is 3
61
a local to the client, data is written faster, and there’s less load on the network.
Remote Rack:
plica is stored on a different rack. This ensures that the data is not dependent on the availability of a single rack.
ng the replica on a different rack, the system achieves fault tolerance. If the entire rack where the first replica is store
ower failure, network issues, or hardware failure), the data is still accessible from a different rack.
Same Remote Rack (Different DataNode):
ca is stored on the same remote rack as the second replica but on a different DataNode within that rack.
ps minimize cross-rack bandwidth consumption. Data transmission between DataNodes on the same rack is faste
rces compared to inter-rack communication. While the third replica is kept on the same rack for bandwidth efficiency, i
Node for added fault tolerance (so that the failure of one DataNode doesn't affect all replicas on that rack).
eplicas on the Same Rack?
on: If more than two replicas are stored on the same rack, the entire rack going down could result in significant data lo
s distributed across racks as much as possible.
Though placing replicas on the same rack minimizes network traffic between racks, keeping too many replicas on the s
erable to rack-level failures.
nd 3)
ication factor (more than 3 replicas), HDFS will:
ace the additional replicas on available DataNodes, while adhering to the following principle:
than two replicas will be placed on the same rack to ensure that even in the case of a rack failure, there are still en
racks.
Storing file on HDFS
● Default replication factor is 3
○ first replica of a block will be stored on a local rack
○ the next replica will be stored on a remote rack
○ the third replica will be stored on the same remote rack
but on a different Datanode
○ Why?
● More replicas?
○ the rest will be placed on random Datanodes
○ As far as possible, no more than two replicas are kept on
the same rack
63
File
Master
B1 B2 B3 Bn Node
Blocks 8 gigabit
1 gigabit
B1
Data nodes
B2 B1 B2 B2
B1
Rack 1 Rack 2
64
Tasks of NameNode
❑ Manages File System
➢ mapping files to blocks and blocks to data nodes
❑ Maintaining status of data nodes
➢ Heartbeat
■ Datanode sends heartbeat at regular intervals
■ If heartbeat is not received, datanode is declared dead
➢ Blockreport
■ DataNode sends list of blocks on it
■ Used to check health of HDFS
65
NameNode Functions
❑ Rebalancing - balancer tool
❑ Replication
➢ Addition of new nodes
➢ On Datanode failure
➢ Decommissioning
➢ On Disk failure
➢ Deletion of some files
➢ On Block corruption
❑ Data integrity
https://chatgpt.com/c/66e51679
➢ Checksum for each
-4910-800d-b52a-
block ba5840dfa617
➢ Stored in hidden file
66
HDFS Robustness
❑ Safemode
➢ At startup: No replication possible
➢ Receives Heartbeats and Blockreports from Datanodes
➢ Only a percentage of blocks are checked for defined replication
factor
❑ Replicate blocks wherever necessary
❑ Fault tolerant
❑ Scalable
❑ Reliable
❑ File are distributed in large blocks for
➢ Efficient reads
➢ Parallel access
68
Questions?
69