0% found this document useful (0 votes)
32 views38 pages

Unit - I Introduction To Big Data

The document discusses definitions of big data from various organizations and experts. It describes the importance, sources, characteristics and issues related to big data. It also provides an overview of Hadoop including its architecture, advantages, history and applications.

Uploaded by

praneelp2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views38 pages

Unit - I Introduction To Big Data

The document discusses definitions of big data from various organizations and experts. It describes the importance, sources, characteristics and issues related to big data. It also provides an overview of Hadoop including its architecture, advantages, history and applications.

Uploaded by

praneelp2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

What is Big Data?

Some Famous definitions

Gartner:
“Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.”

IBM:
"Big Data is characterized by its volume, velocity, and variety – the three
Vs. But the fourth V, veracity (quality of data), is also a critical factor.“

McKinsey & Company:


"Big Data refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyze."
• Doug Laney (3Vs Model):
"Big Data is high volume, high velocity, and/or high variety information assets that
require new forms of processing to enable enhanced decision making, insight
discovery, and process optimization.“

Oracle:
"Big Data represents the information assets characterized by such a high volume,
velocity, and variety to require specific technology and analytical methods for its
transformation into value."
TechTarget:

"Big Data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis."
Why Big Data is important
Why Big Data is important
• 1. Cost savings: Large amounts of data – help in identifying
more efficient ways of doing business.
• 2. Time reductions: High speed of tools- helps businesses
analysing data immediately and take decisions based on the
learning.
• 3. Understand the market conditions: Analysing customers
purchase behaviour – company’s competitions.
• 4. Control online reputation: Get feedback who is saying
about your’s company- monitoring and improve the online
presence of the business.
• 5. Using big data analytics to boost customers acquisition
and retention – to observe various customer related
patterns and trends.
• 6.To solve advertisers problems and offer marketing
insights- the ability to make customer expectations,
changing company’s product line
• 7. As a Driver of innovations and product development –
to help company’s innovate and redevelop their products.
Sources of Big Data
These data come from many sources like
Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a
day to day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.
3V's of Big Data
Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in
every 2 years.
Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured.
Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like
the transaction data of the bank.
Volume: The amount of data which we deal with is of very large size of Peta bytes.

1 Petabyte= 1024 TB(nearly storing 7 lakhs of HD movies)


In analysis of 2021 – Every one hour 300 hours of youtube videos are uploaded
- In twitter account 500 millions of tweets post per day
Now a days, Data must be in audio, video, images and text formats(different formats)
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses
commodity hardware to form clusters and store data in a distributed fashion. It works on Write once, read
many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the required
output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set which can be computed
in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives
the desired result.
Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google.
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even
the tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.

Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.

Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.
Hadoop Architecture:
Hadoop Distributed File System
This architecture consist of a single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is
used to develop HDFS. So any machine that supports Java language can easily run the NameNode and
DataNode software.
Namenode
Namenode and Datanode
•NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
•DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
•Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.
•Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
Big Data is different from traditional
databases
• While traditional data is based on a centralized database architecture, big
data uses a distributed architecture. Computation is distributed among
several computers in a network. This makes big data far more scalable than
traditional data, in addition to delivering better performance and cost
benefits.
• In 2014, Data Science Central, Kirk Born has defined big data in 10 V's
i.e. Volume, Variety, Velocity, Veracity, Validity, Value, Variability, Venue,
Vocabulary, and Vagueness
Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006
o Hadoop introduced.

o Hadoop 0.1.0 released.

o Yahoo deploys 300 machines and within this year reaches 600 machines.
2007
o Yahoo runs 2 clusters of 1000 machines.

o Hadoop includes HBase.

2008
o YARN JIRA opened

o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.

o Yahoo clusters loaded with 10 terabytes per day.

o Cloudera was founded as a Hadoop distributor.

2009
o Yahoo runs 17 clusters of 24,000 machines.

o Hadoop becomes capable enough to sort a petabyte.

o MapReduce and HDFS become separate subproject.

2010
o Hadoop added the support for Kerberos.

o Hadoop operates 4,000 nodes with 40 petabytes.

o Apache Hive and Pig released.


2011
o Apache Zookeeper released.

o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of


storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.


Comparison tools used for analysing bigdata
Applications:
1. Transportation - Congestion management and traffic control, Route planning, Traffic safety
2. Advertising and Marketing-
3. Banking and Financial Services - Fraud detection, Risk management, Customer relationship
optimization, Personalized marketing
4. Government-

5. Media and Entertainment


6. Meteorology - Study natural disaster patterns, Prepare weather forecasts, Understand the
impact of global warming, Provide early warning of impending crises such as hurricanes and
tsunamis
7. Healthcare-

8. Cybersecurity

9. Education - Customizing curricula, Reducing dropout rates, Improving student outcomes,


Targeted international recruiting
• Hadoop Community Package Consists of
• ∙ File system and OS level abstractions
• ∙ A MapReduce engine (either MapReduce or YARN)
• ∙ The Hadoop Distributed File System (HDFS)
• ∙ Java ARchive (JAR) files
• ∙ Scripts needed to start Hadoop
• ∙ Source code, documentation and a contribution section

You might also like