Big Data-2
Big Data-2
Big Data
• What is Big Data?
• Analog starage vs digital.
• The FOUR V’s of Big Data.
• Who’s Generating Big Data
• The importance of Big Data.
• Hadoop
• Hadoop Architecture
Definition
Big data is the term for a
collection of data sets so
large and complex that it
becomes difficult to process
using on-hand database
management tools or
traditional data processing
applications. The challenges
include capture, curation,
storage, search, sharing,
transfer, analysis, and
visualization.
Wikipedia
Volume
Velocity
Variety
Veracity.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Volume. Many factors contribute to the increase in data volume.
Transaction-based data stored through the years.
Unstructured data streaming in from social media. Increasing
amounts of sensor and machine-to-machine data being
collected. In the past, excessive data volume was a storage
issue. But with decreasing storage costs, other issues emerge,
including how to determine relevance within large data
volumes and how to use analytics to create value from relevant
data.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Variety. Data today comes in all types of formats.
Structured, numeric data in traditional databases.
Information created from line-of-business applications.
Unstructured text documents, email, video, audio,
stock ticker data and financial transactions. Managing,
merging and governing different varieties of data is
something many organizations still grapple with.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Mobile devices
(tracking all objects all the time)
Social media and networks Scientific instruments Sensor technology and networks
(all of us are generating data) (collecting all sorts of data) (measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
17
The importance of Big Data
The real issue now is not that you are acquiring large
amounts of data. It's what you do with the data that
counts. The hopeful vision is that organizations will be
able to take data from any source, harness relevant data
and analyze it to find answers that enable:
• Cost reductions
• Time reductions
• New product development and optimized offerings
• Smarter business decision making
The importance of Big Data
For instance, by combining big data and high-powered analytics, it is possible
to:
• Determine root causes of failures, issues and defects in near-real time,
potentially saving billions of dollars annually.
• Optimize routes for many thousands of package delivery vehicles while
they are on the road.
• Analyze millions of SKUs to determine prices that maximize profit and
clear inventory.
• Generate retail coupons at the point of sale based on the customer's
current and past purchases.
• Send tailored recommendations to mobile devices while customers are in
the right area to take advantage of offers.
• Recalculate entire risk portfolios in minutes.
• Quickly identify customers who matter the most.
• Use clickstream analysis and data mining to detect fraudulent behavior
HADOOP
Hadoop, a distributed
framework for Big
Data
Hadoop vs RDBMS
Data Size
Access
Updates
Structure
Integrity
Scaling
Reference: Tom White’s Hadoop: The Definitive Guide
DBA Ratio
Search engines in 1990s
1996
1996
1996 1997
Google search engines
1998
2018
Hadoop’s Developers
2005: Doug Cutting and Michael J. Cafarella
developed Hadoop to support distribution for
the Nutch search engine project.
2004
2006
Some Hadoop Milestones
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying file system (ext3/4, NTFS, etc)
• Notifies Name-Node of what blocks it has
• Name-Node replicates blocks 2x in local rack, 1x elsewhere
HDFS
• The Apache Hadoop HDFS is the distributed file system of Hadoop
that is designed to store large files on cheap hardware.
• It is highly fault-tolerant and provides high throughput to
applications. HDFS is best suited for those applications which are
having very large data sets.
• The Hadoop HDFS file system provides Master and Slave
architecture. The Master node runs Name node daemons and Slave
nodes run Datanode daemons.
Map Reduce
Map-Reduce is the data processing layer of Hadoop
It distributes the task into small pieces and assigns those pieces to many machines
joined over a network, and assembles all the events to form the last event dataset.
The basic detail required by Map-Reduce is a key-value pair. All the data, whether
structured or not, needs to be translated to the key-value pair before it is passed
through the Map-Reduce model. In the Map-Reduce Framework, the processing
unit is moved to the data rather than moving the data to the processing unit.
Thank you for your
attention.