0% found this document useful (0 votes)

32 views38 pages

Unit - I Introduction To Big Data

The document discusses definitions of big data from various organizations and experts. It describes the importance, sources, characteristics and issues related to big data. It also provides an overview of Hadoop including its architecture, advantages, history and applications.

Uploaded by

praneelp2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views38 pages

Unit - I Introduction To Big Data

Uploaded by

praneelp2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

What is Big Data?

Some Famous definitions

Gartner:
“Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.”

IBM:
"Big Data is characterized by its volume, velocity, and variety – the three
Vs. But the fourth V, veracity (quality of data), is also a critical factor.“

McKinsey & Company:

"Big Data refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyze."
• Doug Laney (3Vs Model):
"Big Data is high volume, high velocity, and/or high variety information assets that
require new forms of processing to enable enhanced decision making, insight
discovery, and process optimization.“
•
Oracle:
"Big Data represents the information assets characterized by such a high volume,
velocity, and variety to require specific technology and analytical methods for its
transformation into value."
TechTarget:
•
"Big Data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis."
Why Big Data is important
Why Big Data is important
• 1. Cost savings: Large amounts of data – help in identifying
more efficient ways of doing business.
• 2. Time reductions: High speed of tools- helps businesses
analysing data immediately and take decisions based on the
learning.
• 3. Understand the market conditions: Analysing customers
purchase behaviour – company’s competitions.
• 4. Control online reputation: Get feedback who is saying
about your’s company- monitoring and improve the online
presence of the business.
• 5. Using big data analytics to boost customers acquisition
and retention – to observe various customer related
patterns and trends.
• 6.To solve advertisers problems and offer marketing
insights- the ability to make customer expectations,
changing company’s product line
• 7. As a Driver of innovations and product development –
to help company’s innovate and redevelop their products.
Sources of Big Data
These data come from many sources like
Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a
day to day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.
3V's of Big Data
Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in
every 2 years.
Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured.
Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like
the transaction data of the bank.
Volume: The amount of data which we deal with is of very large size of Peta bytes.

1 Petabyte= 1024 TB(nearly storing 7 lakhs of HD movies)

In analysis of 2021 – Every one hour 300 hours of youtube videos are uploaded
- In twitter account 500 millions of tweets post per day
Now a days, Data must be in audio, video, images and text formats(different formats)
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses
commodity hardware to form clusters and store data in a distributed fashion. It works on Write once, read
many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the required
output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set which can be computed
in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives
the desired result.
Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google.
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even
the tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.

Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.

Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.
Hadoop Architecture:
Hadoop Distributed File System
This architecture consist of a single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is
used to develop HDFS. So any machine that supports Java language can easily run the NameNode and
DataNode software.
Namenode
Namenode and Datanode
•NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
•DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
•Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.
•Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
Big Data is different from traditional
databases
• While traditional data is based on a centralized database architecture, big
data uses a distributed architecture. Computation is distributed among
several computers in a network. This makes big data far more scalable than
traditional data, in addition to delivering better performance and cost
benefits.
• In 2014, Data Science Central, Kirk Born has defined big data in 10 V's
i.e. Volume, Variety, Velocity, Veracity, Validity, Value, Variability, Venue,
Vocabulary, and Vagueness
Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006
o Hadoop introduced.

o Hadoop 0.1.0 released.

o Yahoo deploys 300 machines and within this year reaches 600 machines.
2007
o Yahoo runs 2 clusters of 1000 machines.

o Hadoop includes HBase.

2008
o YARN JIRA opened

o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.

o Yahoo clusters loaded with 10 terabytes per day.

o Cloudera was founded as a Hadoop distributor.

2009
o Yahoo runs 17 clusters of 24,000 machines.

o Hadoop becomes capable enough to sort a petabyte.

o MapReduce and HDFS become separate subproject.

2010
o Hadoop added the support for Kerberos.

o Hadoop operates 4,000 nodes with 40 petabytes.

o Apache Hive and Pig released.

2011
o Apache Zookeeper released.

o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of

storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

Comparison tools used for analysing bigdata
Applications:
1. Transportation - Congestion management and traffic control, Route planning, Traffic safety
2. Advertising and Marketing-
3. Banking and Financial Services - Fraud detection, Risk management, Customer relationship
optimization, Personalized marketing
4. Government-

5. Media and Entertainment

6. Meteorology - Study natural disaster patterns, Prepare weather forecasts, Understand the
impact of global warming, Provide early warning of impending crises such as hurricanes and
tsunamis
7. Healthcare-

8. Cybersecurity

9. Education - Customizing curricula, Reducing dropout rates, Improving student outcomes,

Targeted international recruiting
• Hadoop Community Package Consists of
• ∙ File system and OS level abstractions
• ∙ A MapReduce engine (either MapReduce or YARN)
• ∙ The Hadoop Distributed File System (HDFS)
• ∙ Java ARchive (JAR) files
• ∙ Scripts needed to start Hadoop
• ∙ Source code, documentation and a contribution section

Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
Lecture 4
No ratings yet
Lecture 4
32 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Big Data - 1
No ratings yet
Big Data - 1
46 pages
Unit 1
No ratings yet
Unit 1
89 pages
Notes Hadoop
No ratings yet
Notes Hadoop
19 pages
Bda Ese
No ratings yet
Bda Ese
66 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Assignment Questions BDA Lec 6
No ratings yet
Assignment Questions BDA Lec 6
51 pages
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage (PDFDrive) - 61-71
No ratings yet
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage (PDFDrive) - 61-71
11 pages
Experiment No 8 Memory Design
No ratings yet
Experiment No 8 Memory Design
6 pages
Multiple Stacks and Queues
No ratings yet
Multiple Stacks and Queues
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Da ANSWERS
No ratings yet
Da ANSWERS
13 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Data Science
No ratings yet
Data Science
87 pages
Naresh I Technologies, Hyderabad, 04023746666: C, C++ and Java Programs Compilation On UNIX Environment
No ratings yet
Naresh I Technologies, Hyderabad, 04023746666: C, C++ and Java Programs Compilation On UNIX Environment
18 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Competition Questions
No ratings yet
Competition Questions
10 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Survey Paper On Big Data Analytics Using Hadoop Technologies
No ratings yet
Survey Paper On Big Data Analytics Using Hadoop Technologies
7 pages
Big Data
No ratings yet
Big Data
51 pages
HADOOP
No ratings yet
HADOOP
55 pages
Unit II Big Data Final PDF
No ratings yet
Unit II Big Data Final PDF
25 pages
A Review of Machine Learning Techniques
No ratings yet
A Review of Machine Learning Techniques
6 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Grade 10 (Computer System Servicing) : Western Bicutan National High School Ph1 Ep Housing Pinagsama Taguig City
No ratings yet
Grade 10 (Computer System Servicing) : Western Bicutan National High School Ph1 Ep Housing Pinagsama Taguig City
6 pages
Big Data and Hadoop: A Review Paper
No ratings yet
Big Data and Hadoop: A Review Paper
3 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Big Data
No ratings yet
Big Data
4 pages
Alpha Server 255
No ratings yet
Alpha Server 255
146 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
DPP ITNO Pump PetroTec CEM03 80335202
No ratings yet
DPP ITNO Pump PetroTec CEM03 80335202
9 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
LENOVO AIO M90a GEN 5
No ratings yet
LENOVO AIO M90a GEN 5
3 pages
Big Data
No ratings yet
Big Data
12 pages
@vtucode - In-2022-Scheme-Module-4-3rd semester-CSE
No ratings yet
@vtucode - In-2022-Scheme-Module-4-3rd semester-CSE
35 pages
Sistem Komputer
No ratings yet
Sistem Komputer
12 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
Getting Started W/ Arduino On Windows
No ratings yet
Getting Started W/ Arduino On Windows
5 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Virtualization Structure and Tools
No ratings yet
Virtualization Structure and Tools
16 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
19 Ways To Bypass Software Restrictions and Spawn A Shell
No ratings yet
19 Ways To Bypass Software Restrictions and Spawn A Shell
26 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
Node B Integration Instructions For Ericsson 3308
100% (2)
Node B Integration Instructions For Ericsson 3308
3 pages
REVISION Jenkins
No ratings yet
REVISION Jenkins
3 pages
Video RAM
No ratings yet
Video RAM
2 pages
Exam Questions
No ratings yet
Exam Questions
12 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
50 pages
CC2530 ZNP Interface Specification
No ratings yet
CC2530 ZNP Interface Specification
79 pages
Two-Minute Drill - OCA
No ratings yet
Two-Minute Drill - OCA
6 pages
NPIB Panasonic Win10
No ratings yet
NPIB Panasonic Win10
2 pages
9 2 3 7 Lab ConfiguringPortAddressTranslation
No ratings yet
9 2 3 7 Lab ConfiguringPortAddressTranslation
11 pages
Cs693 Unit II
No ratings yet
Cs693 Unit II
24 pages
Synopsis On 8 Canmdidate Quiz Buzzer
No ratings yet
Synopsis On 8 Canmdidate Quiz Buzzer
4 pages
Power Macintosh G3 (Blue and White) Firmware Update 1.1 Document and Software
No ratings yet
Power Macintosh G3 (Blue and White) Firmware Update 1.1 Document and Software
3 pages
EE502 Assignment Answers
No ratings yet
EE502 Assignment Answers
2 pages
Oracle VM Manual
No ratings yet
Oracle VM Manual
1 page
Location Wise Details MASTER
No ratings yet
Location Wise Details MASTER
2 pages
DIFx Install Log
No ratings yet
DIFx Install Log
2 pages
Log
No ratings yet
Log
2 pages
Alessio Porcacchia
No ratings yet
Alessio Porcacchia
7 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Unit - I Introduction To Big Data

Uploaded by

Unit - I Introduction To Big Data

Uploaded by

What is Big Data?

Some Famous definitions

McKinsey & Company:

1 Petabyte= 1024 TB(nearly storing 7 lakhs of HD movies)

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

o Hadoop 0.1.0 released.

o Hadoop includes HBase.

o Yahoo clusters loaded with 10 terabytes per day.

o Cloudera was founded as a Hadoop distributor.

o Hadoop becomes capable enough to sort a petabyte.

o MapReduce and HDFS become separate subproject.

o Hadoop operates 4,000 nodes with 40 petabytes.

o Apache Hive and Pig released.

o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

5. Media and Entertainment

9. Education - Customizing curricula, Reducing dropout rates, Improving student outcomes,

You might also like