0% found this document useful (0 votes)
14 views40 pages

Big Data-2

Big Data refers to large and complex data sets that are challenging to process with traditional tools, characterized by the FOUR V's: Volume, Velocity, Variety, and Veracity. The document discusses the significance of Big Data, its generation sources, and the role of Hadoop as a distributed framework for managing and processing this data. Key attributes of Hadoop include scalability, cost-effectiveness, flexibility, and fault tolerance, making it suitable for handling vast amounts of data efficiently.

Uploaded by

Nafees Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views40 pages

Big Data-2

Big Data refers to large and complex data sets that are challenging to process with traditional tools, characterized by the FOUR V's: Volume, Velocity, Variety, and Veracity. The document discusses the significance of Big Data, its generation sources, and the role of Hadoop as a distributed framework for managing and processing this data. Key attributes of Hadoop include scalability, cost-effectiveness, flexibility, and fault tolerance, making it suitable for handling vast amounts of data efficiently.

Uploaded by

Nafees Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Big Data

Big Data
• What is Big Data?
• Analog starage vs digital.
• The FOUR V’s of Big Data.
• Who’s Generating Big Data
• The importance of Big Data.
• Hadoop
• Hadoop Architecture
Definition
Big data is the term for a
collection of data sets so
large and complex that it
becomes difficult to process
using on-hand database
management tools or
traditional data processing
applications. The challenges
include capture, curation,
storage, search, sharing,
transfer, analysis, and
visualization.
Wikipedia

“Big data is a collection of data sets so large


and complex that it becomes awkward to
work with using on-hand database
management tools. Difficulties include
capture, storage, search, sharing, analysis,
and visualization.”
– Wikipedia
Data Growth vs Technology Cost

Device Explosion Social Networks Cheap Storage


>5.5 billion (70+% of global $100 gets you 3million times
population) >2 Billion users more storage in 30 years)

Ubiquitous Connection Sensor Networks Inexpensive Computing


Web traffic 1980 10 MIPS/$
2010 -130 Exabyte (10 E18) >10 Billion 2005 10M MIPS/$
2015 -1.6 ZettaByte (10 E21)
The FOUR V’s of Big Data
According to IBM scientists big data can be break
into four dimensions:

Volume

Velocity

Variety

Veracity.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Volume. Many factors contribute to the increase in data volume.
Transaction-based data stored through the years.
Unstructured data streaming in from social media. Increasing
amounts of sensor and machine-to-machine data being
collected. In the past, excessive data volume was a storage
issue. But with decreasing storage costs, other issues emerge,
including how to determine relevance within large data
volumes and how to use analytics to create value from relevant
data.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Variety. Data today comes in all types of formats.
Structured, numeric data in traditional databases.
Information created from line-of-business applications.
Unstructured text documents, email, video, audio,
stock ticker data and financial transactions. Managing,
merging and governing different varieties of data is
something many organizations still grapple with.
The FOUR V’s of Big Data
The FOUR V’s of Big Data

Velocity. Data is streaming in at unprecedented


speed and must be dealt with in a timely manner.
RFID tags, sensors and smart metering are driving
the need to deal with torrents of data in near-real
time. Reacting quickly enough to deal with data
velocity is a challenge for most organizations.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Veracity - Big Data Veracity refers to the biases, noise and
abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. Veracity
in data analysis is the biggest challenge when compares to
things like volume and velocity. In scoping out your big data
strategy you need to have your team and partners work to
help keep your data clean and processes to keep ‘dirty
data’ from accumulating in your systems.
Who’s Generating 4Vs

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments Sensor technology and networks
(all of us are generating data) (collecting all sorts of data) (measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion

17
The importance of Big Data
The real issue now is not that you are acquiring large
amounts of data. It's what you do with the data that
counts. The hopeful vision is that organizations will be
able to take data from any source, harness relevant data
and analyze it to find answers that enable:
• Cost reductions
• Time reductions
• New product development and optimized offerings
• Smarter business decision making
The importance of Big Data
For instance, by combining big data and high-powered analytics, it is possible
to:
• Determine root causes of failures, issues and defects in near-real time,
potentially saving billions of dollars annually.
• Optimize routes for many thousands of package delivery vehicles while
they are on the road.
• Analyze millions of SKUs to determine prices that maximize profit and
clear inventory.
• Generate retail coupons at the point of sale based on the customer's
current and past purchases.
• Send tailored recommendations to mobile devices while customers are in
the right area to take advantage of offers.
• Recalculate entire risk portfolios in minutes.
• Quickly identify customers who matter the most.
• Use clickstream analysis and data mining to detect fraudulent behavior
HADOOP
Hadoop, a distributed
framework for Big
Data
Hadoop vs RDBMS

TRADITIONAL RDBMS HADOOP

Data Size

Access

Updates

Structure

Integrity

Scaling
Reference: Tom White’s Hadoop: The Definitive Guide

DBA Ratio
Search engines in 1990s
1996

1996

1996 1997
Google search engines

1998

2018
Hadoop’s Developers
2005: Doug Cutting and Michael J. Cafarella
developed Hadoop to support distribution for
the Nutch search engine project.

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache


Software Foundation.
Doug Cutting
Google Origins
2003

2004

2006
Some Hadoop Milestones

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in


209 seconds, compared to previous record of 297 seconds)
• 2009 - Avro and Chukwa became new members of Hadoop Framework
family
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding more
computational power to Hadoop framework
• 2011 - ZooKeeper Completed
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

- Ambari, Cassandra, Mahout have been added


Apache Hadoop
Open Source
It is a set of open source projects that transform commodity
hardware into a service that can:
• Store petabytes of data reliably
• Allow huge distributed computations
Key attributes:
• Open source
• Highly scalable
• Runs on commodity hardware
• Redundant and reliable (no data loss)
• Batch processing centric – using “Map-Reduce” processing
paradigm
HDFS / Hadoop
Data in a HDFS cluster is broken down into smaller pieces
(called blocks) and distributed throughout the cluster. In
this way, the map and reduce functions can be executed
on smaller subsets of your larger data sets, and this
provides the scalability that is needed for big data
processing. The goal of Hadoop is to use commonly
available servers in a very large cluster, where each server
has a set of inexpensive internal disk drives.
PROS OF HDFS
• Scalable – New nodes can be added as needed, and
added without needing to change data formats, how
data is loaded, how jobs are written, or the
applications on top.
• Cost effective – Hadoop brings massively parallel
computing to commodity servers. The result is a
sizeable decrease in the cost per terabyte of storage,
which in turn makes it affordable to model all your
data.
PROS OF HDFS
• Flexible – Hadoop is schema-less, and can absorb any
type of data, structured or not, from any number of
sources. Data from multiple sources can be joined
and aggregated in arbitrary ways enabling deeper
analyses than any one system can provide.
• Fault tolerant – When you lose a node, the system
redirects work to another location of the data and
continues processing without missing a beat.
Hadoop Framework Tools
Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power
and storage of the system lies
• Main nodes run TaskTracker to accept and reply to MapReduce
tasks, and also DataNode to store needed blocks closely as possible
• Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker
• Written in Java, also supports Python and Ruby
Hadoop’s Architecture
Hadoop’s Architecture
• Hadoop Distributed Filesystem

• Tailored to needs of MapReduce

• Targeted towards many reads of filestreams

• Writes are more costly

• High degree of data replication (3x by default)

• No need for RAID on normal nodes

• Large blocksize (64MB)

• Location awareness of DataNodes in network


Hadoop’s Architecture
NameNode:
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks
or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode failure

DataNode:
• Stores the actual data in HDFS
• Can run on any underlying file system (ext3/4, NTFS, etc)
• Notifies Name-Node of what blocks it has
• Name-Node replicates blocks 2x in local rack, 1x elsewhere
HDFS
• The Apache Hadoop HDFS is the distributed file system of Hadoop
that is designed to store large files on cheap hardware.
• It is highly fault-tolerant and provides high throughput to
applications. HDFS is best suited for those applications which are
having very large data sets.
• The Hadoop HDFS file system provides Master and Slave
architecture. The Master node runs Name node daemons and Slave
nodes run Datanode daemons.
Map Reduce
Map-Reduce is the data processing layer of Hadoop
It distributes the task into small pieces and assigns those pieces to many machines
joined over a network, and assembles all the events to form the last event dataset.
The basic detail required by Map-Reduce is a key-value pair. All the data, whether
structured or not, needs to be translated to the key-value pair before it is passed
through the Map-Reduce model. In the Map-Reduce Framework, the processing
unit is moved to the data rather than moving the data to the processing unit.
Thank you for your
attention.

You might also like