0% found this document useful (0 votes)

14 views40 pages

Big Data-2

Big Data refers to large and complex data sets that are challenging to process with traditional tools, characterized by the FOUR V's: Volume, Velocity, Variety, and Veracity. The document discusses the significance of Big Data, its generation sources, and the role of Hadoop as a distributed framework for managing and processing this data. Key attributes of Hadoop include scalability, cost-effectiveness, flexibility, and fault tolerance, making it suitable for handling vast amounts of data efficiently.

Uploaded by

Nafees Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views40 pages

Big Data-2

Uploaded by

Nafees Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Big Data

Big Data
• What is Big Data?
• Analog starage vs digital.
• The FOUR V’s of Big Data.
• Who’s Generating Big Data
• The importance of Big Data.
• Hadoop
• Hadoop Architecture
Definition
Big data is the term for a
collection of data sets so
large and complex that it
becomes difficult to process
using on-hand database
management tools or
traditional data processing
applications. The challenges
include capture, curation,
storage, search, sharing,
transfer, analysis, and
visualization.
Wikipedia

“Big data is a collection of data sets so large

and complex that it becomes awkward to
work with using on-hand database
management tools. Difficulties include
capture, storage, search, sharing, analysis,
and visualization.”
– Wikipedia
Data Growth vs Technology Cost

Device Explosion Social Networks Cheap Storage

>5.5 billion (70+% of global $100 gets you 3million times
population) >2 Billion users more storage in 30 years)

Ubiquitous Connection Sensor Networks Inexpensive Computing

Web traffic 1980 10 MIPS/$
2010 -130 Exabyte (10 E18) >10 Billion 2005 10M MIPS/$
2015 -1.6 ZettaByte (10 E21)
The FOUR V’s of Big Data
According to IBM scientists big data can be break
into four dimensions:

Volume

Velocity

Variety

Veracity.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Volume. Many factors contribute to the increase in data volume.
Transaction-based data stored through the years.
Unstructured data streaming in from social media. Increasing
amounts of sensor and machine-to-machine data being
collected. In the past, excessive data volume was a storage
issue. But with decreasing storage costs, other issues emerge,
including how to determine relevance within large data
volumes and how to use analytics to create value from relevant
data.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Variety. Data today comes in all types of formats.
Structured, numeric data in traditional databases.
Information created from line-of-business applications.
Unstructured text documents, email, video, audio,
stock ticker data and financial transactions. Managing,
merging and governing different varieties of data is
something many organizations still grapple with.
The FOUR V’s of Big Data
The FOUR V’s of Big Data

Velocity. Data is streaming in at unprecedented

speed and must be dealt with in a timely manner.
RFID tags, sensors and smart metering are driving
the need to deal with torrents of data in near-real
time. Reacting quickly enough to deal with data
velocity is a challenge for most organizations.
The FOUR V’s of Big Data
The FOUR V’s of Big Data
Veracity - Big Data Veracity refers to the biases, noise and
abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. Veracity
in data analysis is the biggest challenge when compares to
things like volume and velocity. In scoping out your big data
strategy you need to have your team and partners work to
help keep your data clean and processes to keep ‘dirty
data’ from accumulating in your systems.
Who’s Generating 4Vs

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments Sensor technology and networks
(all of us are generating data) (collecting all sorts of data) (measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion

17
The importance of Big Data
The real issue now is not that you are acquiring large
amounts of data. It's what you do with the data that
counts. The hopeful vision is that organizations will be
able to take data from any source, harness relevant data
and analyze it to find answers that enable:
• Cost reductions
• Time reductions
• New product development and optimized offerings
• Smarter business decision making
The importance of Big Data
For instance, by combining big data and high-powered analytics, it is possible
to:
• Determine root causes of failures, issues and defects in near-real time,
potentially saving billions of dollars annually.
• Optimize routes for many thousands of package delivery vehicles while
they are on the road.
• Analyze millions of SKUs to determine prices that maximize profit and
clear inventory.
• Generate retail coupons at the point of sale based on the customer's
current and past purchases.
• Send tailored recommendations to mobile devices while customers are in
the right area to take advantage of offers.
• Recalculate entire risk portfolios in minutes.
• Quickly identify customers who matter the most.
• Use clickstream analysis and data mining to detect fraudulent behavior
HADOOP
Hadoop, a distributed
framework for Big
Data
Hadoop vs RDBMS

TRADITIONAL RDBMS HADOOP

Data Size

Access

Updates

Structure

Integrity

Scaling
Reference: Tom White’s Hadoop: The Definitive Guide

DBA Ratio
Search engines in 1990s
1996

1996

1996 1997
Google search engines

1998

2018
Hadoop’s Developers
2005: Doug Cutting and Michael J. Cafarella
developed Hadoop to support distribution for
the Nutch search engine project.

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache

Software Foundation.
Doug Cutting
Google Origins
2003

2004

2006
Some Hadoop Milestones

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in

209 seconds, compared to previous record of 297 seconds)
• 2009 - Avro and Chukwa became new members of Hadoop Framework
family
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding more
computational power to Hadoop framework
• 2011 - ZooKeeper Completed
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

- Ambari, Cassandra, Mahout have been added

Apache Hadoop
Open Source
It is a set of open source projects that transform commodity
hardware into a service that can:
• Store petabytes of data reliably
• Allow huge distributed computations
Key attributes:
• Open source
• Highly scalable
• Runs on commodity hardware
• Redundant and reliable (no data loss)
• Batch processing centric – using “Map-Reduce” processing
paradigm
HDFS / Hadoop
Data in a HDFS cluster is broken down into smaller pieces
(called blocks) and distributed throughout the cluster. In
this way, the map and reduce functions can be executed
on smaller subsets of your larger data sets, and this
provides the scalability that is needed for big data
processing. The goal of Hadoop is to use commonly
available servers in a very large cluster, where each server
has a set of inexpensive internal disk drives.
PROS OF HDFS
• Scalable – New nodes can be added as needed, and
added without needing to change data formats, how
data is loaded, how jobs are written, or the
applications on top.
• Cost effective – Hadoop brings massively parallel
computing to commodity servers. The result is a
sizeable decrease in the cost per terabyte of storage,
which in turn makes it affordable to model all your
data.
PROS OF HDFS
• Flexible – Hadoop is schema-less, and can absorb any
type of data, structured or not, from any number of
sources. Data from multiple sources can be joined
and aggregated in arbitrary ways enabling deeper
analyses than any one system can provide.
• Fault tolerant – When you lose a node, the system
redirects work to another location of the data and
continues processing without missing a beat.
Hadoop Framework Tools
Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power
and storage of the system lies
• Main nodes run TaskTracker to accept and reply to MapReduce
tasks, and also DataNode to store needed blocks closely as possible
• Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker
• Written in Java, also supports Python and Ruby
Hadoop’s Architecture
Hadoop’s Architecture
• Hadoop Distributed Filesystem

• Tailored to needs of MapReduce

• Targeted towards many reads of filestreams

• Writes are more costly

• High degree of data replication (3x by default)

• No need for RAID on normal nodes

• Large blocksize (64MB)

• Location awareness of DataNodes in network

Hadoop’s Architecture
NameNode:
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks
or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode failure

DataNode:
• Stores the actual data in HDFS
• Can run on any underlying file system (ext3/4, NTFS, etc)
• Notifies Name-Node of what blocks it has
• Name-Node replicates blocks 2x in local rack, 1x elsewhere
HDFS
• The Apache Hadoop HDFS is the distributed file system of Hadoop
that is designed to store large files on cheap hardware.
• It is highly fault-tolerant and provides high throughput to
applications. HDFS is best suited for those applications which are
having very large data sets.
• The Hadoop HDFS file system provides Master and Slave
architecture. The Master node runs Name node daemons and Slave
nodes run Datanode daemons.
Map Reduce
Map-Reduce is the data processing layer of Hadoop
It distributes the task into small pieces and assigns those pieces to many machines
joined over a network, and assembles all the events to form the last event dataset.
The basic detail required by Map-Reduce is a key-value pair. All the data, whether
structured or not, needs to be translated to the key-value pair before it is passed
through the Map-Reduce model. In the Map-Reduce Framework, the processing
unit is moved to the data rather than moving the data to the processing unit.
Thank you for your
attention.

Subaru Forester 2007 Service Manual
96% (25)
Subaru Forester 2007 Service Manual
3,548 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Hadoop 1
No ratings yet
Hadoop 1
109 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
Introduction To Hadoop Slides
No ratings yet
Introduction To Hadoop Slides
111 pages
Lecture 4
No ratings yet
Lecture 4
32 pages
Big Data - 1
No ratings yet
Big Data - 1
46 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Unit 5
No ratings yet
Unit 5
32 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
1 BDA Unit1 ppt1
No ratings yet
1 BDA Unit1 ppt1
41 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
7th English Guide Term 1
No ratings yet
7th English Guide Term 1
84 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
MAN K100 Electrical System TGS-TGX
100% (4)
MAN K100 Electrical System TGS-TGX
236 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data
No ratings yet
Big Data
51 pages
Biggdata
No ratings yet
Biggdata
24 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
HADOOP
No ratings yet
HADOOP
55 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Windchill REST Services 1.5
No ratings yet
Windchill REST Services 1.5
257 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Method Installation of Steel Portal Frame
100% (2)
Method Installation of Steel Portal Frame
7 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Southern Province Grade 10 Information and Communication Technology Ict 2020 1 Term Test Paper 61e9422335b6f
No ratings yet
Southern Province Grade 10 Information and Communication Technology Ict 2020 1 Term Test Paper 61e9422335b6f
13 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
No ratings yet
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
6 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
IITG Credit Linked DS
No ratings yet
IITG Credit Linked DS
10 pages
Big Data
No ratings yet
Big Data
12 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
JD700B User Guide R06.0
No ratings yet
JD700B User Guide R06.0
690 pages
C WIPG 300Hv2 - S
No ratings yet
C WIPG 300Hv2 - S
7 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Annexure D - For Cable Cellar MVWS System
No ratings yet
Annexure D - For Cable Cellar MVWS System
1 page
LinkedIn EBook
No ratings yet
LinkedIn EBook
71 pages
Soil Nailing For Failed Slope Stabilization On Hilly Terrain
No ratings yet
Soil Nailing For Failed Slope Stabilization On Hilly Terrain
7 pages
Electromagnetic Clutches
100% (1)
Electromagnetic Clutches
12 pages
Honour Declaration Template (Revised 250214)
No ratings yet
Honour Declaration Template (Revised 250214)
4 pages
Wendland, Aristeae Ad Philocratem Epistula
No ratings yet
Wendland, Aristeae Ad Philocratem Epistula
275 pages
FD Pro 8.1 Admin Guide
No ratings yet
FD Pro 8.1 Admin Guide
22 pages
Akshaya and Jayshree Redevelopment
No ratings yet
Akshaya and Jayshree Redevelopment
7 pages
Hydraulic Surgery Table Manual
No ratings yet
Hydraulic Surgery Table Manual
8 pages
SCDA PPT Presentation
100% (1)
SCDA PPT Presentation
20 pages
Definition and Evolution of Marketing Management
No ratings yet
Definition and Evolution of Marketing Management
13 pages
Typical Slab and Beams and Columns Bbs 1st 9th Floor
No ratings yet
Typical Slab and Beams and Columns Bbs 1st 9th Floor
19 pages
CONDUITE
No ratings yet
CONDUITE
9 pages
The Derivative As The Slope of The Tangent Line
No ratings yet
The Derivative As The Slope of The Tangent Line
5 pages
Integration-And System Testing: O O S C
No ratings yet
Integration-And System Testing: O O S C
32 pages
Wind Energy Conversion
No ratings yet
Wind Energy Conversion
7 pages
Files and Data Structures
No ratings yet
Files and Data Structures
3 pages
Homecharger: Type 1 Plug Type 2 Plug Type 2 Socket
No ratings yet
Homecharger: Type 1 Plug Type 2 Plug Type 2 Socket
2 pages
Huawei AR1000V Brochure
No ratings yet
Huawei AR1000V Brochure
4 pages
Boost Performance of Informatica Lookups
No ratings yet
Boost Performance of Informatica Lookups
5 pages
Productattachments Files Downloads Ezmimo 2-4ghz Datasheet
No ratings yet
Productattachments Files Downloads Ezmimo 2-4ghz Datasheet
1 page
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Big Data-2

Uploaded by

Big Data-2

Uploaded by

Big Data

“Big data is a collection of data sets so large

Device Explosion Social Networks Cheap Storage

Ubiquitous Connection Sensor Networks Inexpensive Computing

Velocity. Data is streaming in at unprecedented

TRADITIONAL RDBMS HADOOP

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in

- Ambari, Cassandra, Mahout have been added

• Tailored to needs of MapReduce

• Targeted towards many reads of filestreams

• Writes are more costly

• High degree of data replication (3x by default)

• No need for RAID on normal nodes

• Large blocksize (64MB)

• Location awareness of DataNodes in network

You might also like