0% found this document useful (0 votes)
30 views69 pages

1big Data

Big data refers to vast, complex datasets that traditional data management tools cannot handle, characterized by high volume, velocity, and variety. It presents unique challenges and opportunities for businesses, requiring innovative processing methods to extract valuable insights. The lifecycle of big data includes generating, gathering, storing, organizing, analyzing, and visualizing data, with a significant portion being unstructured.

Uploaded by

bossbitch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views69 pages

1big Data

Big data refers to vast, complex datasets that traditional data management tools cannot handle, characterized by high volume, velocity, and variety. It presents unique challenges and opportunities for businesses, requiring innovative processing methods to extract valuable insights. The lifecycle of big data includes generating, gathering, storing, organizing, analyzing, and visualizing data, with a significant portion being unstructured.

Uploaded by

bossbitch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to Big Data

Introduction to Big Data

● Big data is an all-inclusive term that refers to extremely


large, very fast, highly diverse and complex data that
cannot be managed with traditional data management
tools
● It includes all kind of data, which helps deliver the right
information, to right person, in right quantity, at right
time, to make right decision
● Big data can be harnessed by developing infinitely
scalable, flexible, and evolutionary data architectures,
coupled with the use of cost-effective computing
machines.
Definition
“Big data” is

high-volume, -velocity and -variety information


assets
that demand cost-effective, innovative forms of
information processing

for enhanced insight and decision making

By Gartner
Understanding Big Data

● Big data is data that exceeds the processing capacity of


conventional database systems.
● The data is too big, moves too fast, or doesn’t fit the
structures of your database architectures.
● To gain value from this data, you must choose an
alternative way to process it.
● At the fundamental level it is just another collection of
data that can be analyzed and utilized for the benefit of
the business and on another level, it is a special kind of
data that poses unique challenges and offers unique
benefits.
Life Cycle of Big Data

Big data is mostly, over 90%, unstructured data


There are huge opportunities for technology providers to
innovate and manage the entire life cycle of Big Data
● Generate
● Gather
● Store
● Organize
● Analyze
● Visualize
The Three V-s

Volume and Velocity are driven by variety


The varying varacity and value of data complicates the situation
Volume

● Quantity of data generated in the


world is doubling every 12-18
months
● Data sets too large to store and
analyse using traditional
databases and finding
something from it in a
reasonable period of time is like
a miracle. (Petabytes and
Exabytes) (1 Exabyte = 1 Million
TB)
Image Source: https://www.domo.com/blog/data-never-sleeps-5/
4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world wide

100s of
millions
data every day

of GPS
? TBs of

enabled
devices sold
annually

25+ TBs of 2+
log data
every day billion
people on
the Web
76 million smart meters by end
in 2009… 2011
200M by 2014
The amount of data produced by us from
the beginning of time till 2003 was 5 billion
gigabytes. If you pile up the data in the
form of disks it may fill an entire football
field. The same amount was created in
every two days in 2011, and in every ten
minutes in 2013. This rate is still growing
enormously.
Velocity
● If traditional data is like a drop of
water, Big Data is like flowing river
● Speed at which data is generated by
billions of devices, and
communicated at the speed of light.
(High speed Internet availability)
● Mobile devices can generate and
communicate data from anywhere, at
any time
● Processing should be faster than
generation
● Analyse data while it is being
generated without even putting it into
databases
Image Source: https://www.domo.com/blog/data-never-sleeps-5/
Velocity (Speed)

• Data is begin generated fast from different


sources and need to be processed fast
• Online Data Analytics
• Late decisions ➔ missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like ➔ send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body ➔


any abnormal measurements require immediate reaction

12
Sources of Real-time/Fast Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
needs to be handled

13
Real-Time Analytics/Decision Requirement

Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Variety

● Different types of data that we


can use is inclusive of all
forms of data, for all kinds of
functions, from all sources
and devices
● Generated by different entities
○ Humans
○ Machines (HW + SW)
○ Sensors

Image Source: https://www.domo.com/blog/data-never-sleeps-5/


Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …
• Streaming Data
– You can only scan the data once
• A single application can be generating/collecting many types of data ,
• Form Data type range in numbers to text, graph, map, audio, video and others
• Function: human conversation, songs and movies, business transaction records, machine operation
performance data, new product design data, old archived data, etc.
• Source of data: Mobile phone and tablets, web access and search logs , Business transactional information,
temperature and pressure sensors on machines, RFID tags on assets generate incessant and repetitive data.
• Big Public Data (online, weather, finance, detecting human faces from pictures, compare voice to identify the
speaker, comparing handwritings to identify the writer, etc.)
Broadly speaking there are three broad types of sources of data
Human to Human Communication, Human-Machine communication and To extract knowledge➔ all these
Machine to Machine communication. types of data need to linked together
16
A Single View to the Customer

Bankin
Social g
Media Financ
e

Our
Know
Gamin
g Customer n
Histor
y

Entertai Purcha
n se
Veracity

● Relates to the truthfulness,


believability and quality of data
● Big Data is messy and there is
considerable misinformation out
there.
● The reasons for poor quality of data
can range from technical error to
human error to malicious intent.
● Volumes makes up for quality
○ Eg. Tweets with spelling
mistakes, short words
○ u→you, thr→there, teh→the
Image Source: https://www.domo.com/blog/data-never-sleeps-5/
Veracity
● The source of information may not be authoritative
● Whitehouse.gov or nytimes.com is authentic and complete, but
Wikipedia data may not be equally reliable.
● The data may not be communicated and received correctly because of
human or technical failure
● Sensors and communication machines may malfunction and may record
and transmit incorrect data.
● Urgency may require the transmission of the best data available at a
point in time. Such data makes reconciliation with later, accurate, records
more problematic.
● The data provided and received may, however also be intentionally wrong for
competitive or security reasons.
● There could be disinformation and malicious information spread for
strategic reasons.
Some Make it 4V’s

20
Additional V-s
Value

Getting value out of Big Data!!!

Image Source: https://www.domo.com/blog/data-never-sleeps-5/


Definition

“Big data” is

high-volume, -velocity and -variety information assets

that demand cost-effective, innovative forms of information processing

for enhanced insight and decision making

By Gartner
Definition

“Big data” is

high-volume, -velocity and -variety information assets

that demand cost-effective, innovative forms of information processing

for enhanced insight and decision making

By Gartner
Wikipedia Definition
• Big data is a term for data sets that are so large or complex
that traditional data processing applications are inadequate…
• Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, querying, updating and
information privacy. …
• The term often refers simply to the use of predictive analytics
or certain other advanced methods to extract value from data,
and seldom to a particular size of data set. …
• Accuracy in big data may lead to more confident decision
making, and better decisions can result in greater operational
efficiency, cost reduction and reduced risk.
Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)


• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

26
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

27
What’s driving Big Data

- Optimizations and predictive analytics


- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

Organizations that do not learn to engage with Big Data, could find themselves left far behind
their competitors, landing in the bustbin of history. 28
The image describes the Life Cycle of Big Data, emphasizing the opportunities for
technology providers in managing large volumes of mostly unstructured data. Here's a
breakdown of the life cycle mentioned:
1.Generate: Data creation from various sources like sensors, social media, transactions, and
more.
2.Gather: Collecting data from different systems or sources.
3.Store: Keeping data in databases, cloud storage, or other repositories.
4.Organize: Structuring and managing data to make it usable.
5.Analyze: Extracting insights and meaning from the data.
6.Visualize: Presenting the data insights in a visual format for easy interpretation.
The text highlights that over 90% of the data generated is unstructured, and this creates vast
potential for technological innovation in managing and utilizing this data effectively.
THE EVOLUTION OF BUSINESS INTELLIGENCE
Interactive Business
Intelligence & Big Data:
Speed
In-memory RDBMS Scale
Real Time &
Single View
BI Reporting QliqView, Tableau, HANA
OLAP &
Graph Databases
Dataware house
Business Objects, SAS, Scale Speed
Informatica, Cognos other
SQL Reporting Tools

Big Data:
1990’s 2000’s Batch Processing & 2010’s
Distributed Data Store
Hadoop/Spark;
HBase/Cassandra
What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures voices of
the flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.
Social Media Data − Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.
Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’
decisions made on a share of different companies made by the customers.
Power Grid Data − The power grid data holds information consumed by a particular node with
respect to a base station.
Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data − Search engines retrieve lots of data from different databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Structured data − Relational data, Semi Structured data − XML data, Unstructured data −
Word, PDF, Text, Media Logs
Benefitting from Big Data
There are 3 major types of Big Data Applications
• Monitoring and Tracking
– Consumer goods producers : Sentiments and needs of their customers
– Industrial organizations : Track inventory in massive interlinked global supply chains
– Factory owners: Monitor machine performance and do preventive maintenance
– Utility companies: predict energy consumption, manage demand and supply
– Information Technology: Track website performance and improve its usefulness
– Financial organizations : to project trends better and make more effective and profitable bets, etc.
• Analysis and Insight
– Political organizations: to micro-target voters and win elections
– Police: To predict and prevent crimes
– Hospitals : to better diagnose diseases and make medicine prescriptions
– Advertising Agencies: To design more targeted marketing campaigns more quickly
– Fashion Designers: To track trends and create more innovative products.
• Digital Product Development
– Stock market feeds could be a digital product, Imagination is the limit
32
Big Data Technology

34
Big Data Technologies
Big data technologies are important in providing more accurate
analysis, which may lead to more concrete decision-making resulting
in greater operational efficiencies, cost reductions, and reduced risks
for the business.
To harness the power of big data, you would require an infrastructure
that can manage and process huge volumes of structured and
unstructured data in real-time and can protect data privacy and
security.
There are various technologies in the market from different vendors
including Amazon, IBM, Microsoft, etc., to handle big data. While
looking into the technologies that handle big data, we examine the
following two classes of technology
– Operational and Analytical Systems
Big Data technologies are generally divided into two categories: Analytical Big Data and Operational Big Data.
Each serves different purposes and is used in different contexts. Here's an explanation of both:
1. Analytical Big Data
•Purpose: Used for performing complex data analysis and extracting insights from vast amounts of historical data.
•Focus: It deals with batch processing, which involves working with historical data stored over time and
analyzing it to uncover patterns, trends, and predictive insights.
•Examples of Use:
• Business reporting and dashboards.
• Predictive analytics (forecasting future trends or customer behaviors).
• Data mining for discovering patterns or correlations.
•Technologies Used:
• Hadoop: A framework for processing large datasets distributed across clusters of computers.
• MapReduce: A programming model used within Hadoop for processing and generating large datasets.
• Apache Spark: A fast data processing engine that handles large-scale data analysis.
• Data Warehousing: Systems like Amazon Redshift, Google BigQuery for querying and storing large-
scale datasets.
•Key Characteristics:
• Works with historical, structured or unstructured data.
• Batch processing: Runs tasks periodically on collected data.
• Often not real-time but can provide deeper insights.
2. Operational Big Data
•Purpose: Used for real-time data processing and dealing with data that is continuously generated by day-to-day
operations.
•Focus: Stream processing, which means managing and analyzing real-time data generated from operations,
applications, and devices.
•Examples of Use:
• Real-time monitoring of system performance or transactions.
• Fraud detection in banking systems by analyzing live transaction data.
• Recommendations (e.g., Netflix recommending movies based on live user interactions).
• Sensor data in IoT devices (smart devices, manufacturing machinery).
•Technologies Used:
• Apache Kafka: A distributed streaming platform for building real-time data pipelines and streaming apps.
• NoSQL Databases: Like MongoDB or Cassandra, which are optimized for handling high-volume,
unstructured data.
• HBase: A real-time, scalable database that runs on top of Hadoop.
• Amazon Kinesis: For real-time data streaming and processing in the cloud.
•Key Characteristics:
• Deals with real-time or near-real-time data.
• Handles high-velocity and high-volume data streams.
• Supports low-latency systems where immediate action is required.
Key Differences:
•Analytical Big Data focuses on long-term, batch processing for historical analysis and insights. It’s more suitable
or understanding long-term trends and informing strategic decisions.
•Operational Big Data focuses on real-time processing for immediate decision-making and monitoring. It's essential
Operational Big Data

This include systems like MongoDB that provide operational


capabilities for real-time, interactive workloads where data is primarily
captured and stored.
NoSQL Big Data systems are designed to take advantage of new
cloud computing architectures that have emerged over the past
decade to allow massive computations to be run inexpensively and
efficiently. This makes operational big data workloads much easier to
manage, cheaper, and faster to implement.
Some NoSQL systems can provide insights into patterns and trends
based on real-time data with minimal coding and without the need for
data scientists and additional infrastructure.
Analytical Big Data
These includes systems like Massively Parallel Processing
(MPP) database systems and MapReduce that provide
analytical capabilities for retrospective and complex
analysis that may touch most or all of the data.

MapReduce provides a new method of analyzing data that


is complementary to the capabilities provided by SQL, and
a system based on MapReduce that can be scaled up from
single servers to thousands of high and low end machines.

These two classes of technology are complementary and frequently


deployed together.
Operational vs. Analytical Systems

Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database


Use cases
Use Case: Big Data in Oil & Gas Drilling

http://analytics-magazine.org/how-big-data-is-changing-the-oil-a-gas-industry/
Use Case: Uber - Pay Surge Pricing if Battery is Low

https://www.forbes.com/sites/amitchowdhry/2016/05/25/uber-low-battery/#19762c0474b3
Big Data Challenges
Big Data Challenges: Size does matter

1KB Kilobyte
1MB Megabyte
1GB Gigabyte 1 GB = 1 hr
1TB Terabyte 1 TB = 1024 hrs = 102 days
1PB Petabyte 1 PB = 286 yrs > 1 lifetime
1EB Exabyte 1 EB = 293K yrs
1ZB Zettabyte
1YB Yottabyte
Big Data Challenges: Vertical Vs Horizontal Scaling

Vertical Scaling Horizontal Scaling


Big Data Challenges
Scaling

Source: https://s-media-cache-ak0.pinimg.com/736x/10/0c/d0/100cd0da1c19e5d6f850ed23c3633714.jpg
Big Data Challenges: Scale of infrastructure

Image Source: https://datacenter.legrand.com


Further Reading
● A Brief History of Big Data Everyone Should Read
● Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity
● What is big data? - OpenSource.com & O’Reilly
● Uber Use Case
● 5 Big Data Use Cases To Watch
● Best Big Data Analytics Use Cases
● The 5 game changing big data use cases
● Big Data - The 5 Vs Everyone Must Know
● Top SlideShare Presentations on Big Data
● Google Data Center 360° Tour
How to store huge files?
Requirements?
● Efficient Access
● Effective utilization of space
● Redundancy (Failsafe)
○ Given: probability of 1 disk failing is 1% per year
○ What are the chances that 1 out of 103 disk fails at a
data center?
HDFS
Hadoop distributed File System
HDFS
● Data storage system used by Hadoop
○ Hadoop: Project to develop open-source software
for reliable, scalable, distributed computing★
○ Will discuss Hadoop later
● Components
● Architecture
● Tasks / Services

★ http://hadoop.apache.org/
53
Components of HDFS

Secondary NameNode Active NameNode Standby NameNode

DataNodes

54
1. NameNode
•Role: Acts as the master of the HDFS system.
•Function: Manages the metadata for the file system
(e.g., file names, permissions, file block locations).
•Responsibilities:
• Keeps track of the directory structure and
file locations.
• Maintains the mapping of file blocks to
DataNodes.
• Does not store the actual data but knows
where it is located.
• Coordinates file system operations such as
opening, closing, and renaming files or
directories.
•Single Point of Failure (SPOF): In traditional setups,
if the NameNode fails, the HDFS cluster becomes
unavailable (although High Availability (HA)
configurations can mitigate this).
2. DataNodes
•Role: The worker nodes that store actual data blocks.
•Function: Responsible for storing and retrieving data as instructed by the NameNode.
•Responsibilities:
• Store the actual file data in blocks (the default block size is 128 MB, though configurable).
• Periodically report back to the NameNode with the list of blocks they are storing (this is called a
heartbeat).
• Handle block replication (by default, each block is replicated 3 times across different DataNodes for fault
tolerance).
• Repair and rebalance blocks when failures occur.
3. Secondary NameNode (Checkpoint Node)
•Role: Often misunderstood, it does not serve as a backup for the NameNode but helps periodically merge the
NameNode's edits log with the namespace image.
•Function: Periodically creates checkpoints of the NameNode’s metadata by combining the edit log and the
namespace image.
•Responsibilities:
• Keeps the filesystem metadata in a manageable size.
• Reduces the time required for the NameNode to restart by managing the size of the edit logs.
Terminology

57
Terminology

58
A
r
c
h
i
t
e
c
t
u
r
e
Storing file on HDFS
Motivation: Reliability, Availability ,
Network Bandwidth
➢ The input file (say 1 TB) is split into smaller
chunks/blocks of 128 MB
➢ The chunks are stored on multiple nodes as
independent files on data nodes

60
Storing file on HDFS
➢ To ensure that data is not lost, data
can typically be replicated on:
➢ local rack
➢ remote rack (in case local rack fails)
➢ remote node (in case local node fails)
➢ randomly
➢ Default replication factor is 3

61
a local to the client, data is written faster, and there’s less load on the network.
Remote Rack:
plica is stored on a different rack. This ensures that the data is not dependent on the availability of a single rack.
ng the replica on a different rack, the system achieves fault tolerance. If the entire rack where the first replica is store
ower failure, network issues, or hardware failure), the data is still accessible from a different rack.
Same Remote Rack (Different DataNode):
ca is stored on the same remote rack as the second replica but on a different DataNode within that rack.
ps minimize cross-rack bandwidth consumption. Data transmission between DataNodes on the same rack is faste
rces compared to inter-rack communication. While the third replica is kept on the same rack for bandwidth efficiency, i
Node for added fault tolerance (so that the failure of one DataNode doesn't affect all replicas on that rack).
eplicas on the Same Rack?
on: If more than two replicas are stored on the same rack, the entire rack going down could result in significant data lo
s distributed across racks as much as possible.
Though placing replicas on the same rack minimizes network traffic between racks, keeping too many replicas on the s
erable to rack-level failures.
nd 3)
ication factor (more than 3 replicas), HDFS will:
ace the additional replicas on available DataNodes, while adhering to the following principle:
than two replicas will be placed on the same rack to ensure that even in the case of a rack failure, there are still en
racks.
Storing file on HDFS
● Default replication factor is 3
○ first replica of a block will be stored on a local rack
○ the next replica will be stored on a remote rack
○ the third replica will be stored on the same remote rack
but on a different Datanode
○ Why?
● More replicas?
○ the rest will be placed on random Datanodes
○ As far as possible, no more than two replicas are kept on
the same rack

63
File
Master
B1 B2 B3 Bn Node

Blocks 8 gigabit
1 gigabit

B1

Data nodes
B2 B1 B2 B2

B1
Rack 1 Rack 2
64
Tasks of NameNode
❑ Manages File System
➢ mapping files to blocks and blocks to data nodes
❑ Maintaining status of data nodes
➢ Heartbeat
■ Datanode sends heartbeat at regular intervals
■ If heartbeat is not received, datanode is declared dead
➢ Blockreport
■ DataNode sends list of blocks on it
■ Used to check health of HDFS

65
NameNode Functions
❑ Rebalancing - balancer tool
❑ Replication
➢ Addition of new nodes
➢ On Datanode failure
➢ Decommissioning
➢ On Disk failure
➢ Deletion of some files
➢ On Block corruption

❑ Data integrity
https://chatgpt.com/c/66e51679
➢ Checksum for each
-4910-800d-b52a-
block ba5840dfa617
➢ Stored in hidden file

66
HDFS Robustness

❑ Safemode
➢ At startup: No replication possible
➢ Receives Heartbeats and Blockreports from Datanodes
➢ Only a percentage of blocks are checked for defined replication
factor
❑ Replicate blocks wherever necessary

All is well ➔ Exit Safemode


67
HDFS Summary

❑ Fault tolerant
❑ Scalable
❑ Reliable
❑ File are distributed in large blocks for
➢ Efficient reads
➢ Parallel access

68
Questions?

69

You might also like