CS8091 BDA Unit I LectureNotes
CS8091 BDA Unit I LectureNotes
UNIT – I
CS8091 Big Data Analytics Lecture 1:
Overview of Big Data Analytics
Definition and Characteristics of Big Data
(i) Volume
❑ The name 'Big Data' itself is related to a size which is enormous.
❑ Size of data plays very crucial role in determining value out of data.
❑ Also, whether a particular data can actually be considered as a Big Data or not, is
dependent upon volume of data.
❑ Hence, 'Volume' is one characteristic which needs to be considered while
dealing with 'Big Data'.
CHARACTERISTICS OF BIG DATA:
(ii) Variety
❑ Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured.
❑ During earlier days, spreadsheets and databases were the only sources of data
❑ Now a days, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. is also being considered in the analysis applications.
❑ This variety of unstructured data poses certain issues for storage, mining and
analyzing data.
CHARACTERISTICS OF BIG DATA:
(iii) Velocity
(iv) Variability
❑ This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
❑ Variability refers to data whose meaning is constantly changing. Many a
time, organizations need to develop sophisticated programs in order to be able to
understand context in them and decode their exact meaning.
(v) Veracity
❑ It refers to the quality of the data that is being analyzed.
❑ High veracity data has many records that are valuable to analyze and that
contribute in a meaningful way to the overall results.
❑ Low veracity data, on the other hand, contains a high percentage of
meaningless data.
UNDERSTANDING THE BUSINESS DRIVERS
Business drivers are about agility in utilization and analysis of collections of datasets
and streams to create value:
• increase revenues,
• decrease costs,
• improve the customer experience,
• reduce risks, and
• increase productivity.
UNDERSTANDING THE BUSINESS DRIVERS
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
Source: IDC
Big data – 5V Classification/Characteristics
THE PROMOTION OF THE VALUE OF BIG DATA
The same types of benefits promoted by business intelligence and data warehouse
tools vendors and system integrators for the past 15_20 years, namely:
Most high-performance platforms are created by connecting multiple nodes together via a
variety of network topologies.
The general architecture distinguishes the management of computing resources and the
management of the data across the network of storage nodes.
Hadoop Architecture
IBM Watson
In 2011, IBM’s computer system Watson participated in the U.S. television game
show Jeopardy against two of the best Jeopardy champions in the show’s history. In the
game, the contestants are provided a clue such as “He likes his martinis shaken, not stirred”
and the correct response, phrased in the form of a question, would be, “Who is James
Bond?”
Over the three-day tournament, Watson was able to defeat the two human
contestants. To educate Watson, Hadoop was utilized to process various data sources such
as encyclopedias, dictionaries, news wire feeds, literature, and the entire contents of
Wikipedia. For each clue provided during the game,
Watson had to perform the following tasks in less than three seconds.
IBM Watson
LinkedIn
LinkedIn is an online professional network of 250 million users in 200 countries as
of early 2014 [5]. LinkedIn provides several free and subscription-based services, such as
company information pages, job postings, talent searches, social graphs of one’s contacts,
personally tailored news feeds, and access to discussion groups, including a Hadoop users
group.
LinkedIn utilizes Hadoop for the following purposes:
❑ Process daily production database transaction logs
❑ Examine the users’ activities such as views and clicks
❑ Feed the extracted data back to the production systems
❑ Restructure the data to add to an analytical database
❑ Develop and test analytical models
Use Cases of Hadoop
Yahoo!
As of 2012, Yahoo! has one of the largest publicly announced Hadoop
deployments at 42,000 nodes across several clusters utilizing. Yahoo!‘s Hadoop
applications include the following 350 petabytes of raw storage
❑ Search index creation and maintenance
❑ Web page content optimization
❑ Web ad placement optimization
❑ Spam filters
❑ Ad-hoc analysis and analytic model development
Prior to deploying Hadoop, it took 26 days to process three years’ worth of log
data. With Hadoop, the processing time was reduced to 20 minutes.
History behind Hadoop
Hadoop, an open source project managed and licensed by the Apache Software Foundation.
The origins of Hadoop began as a search engine called Nutch, developed by Doug
Cutting and Mike Cafarella. Based on two Google papers [9] [12], versions of MapReduce
and the Google File System were added to Nutch in 2004. In 2006, Yahoo! hired Cutting,
who helped to develop Hadoop based on the code in Nutch [13]. The name “Hadoop” came
from the name of Cutting’s child’s stuffed toy elephant that also inspired the wellrecognized
symbol for the Hadoop project.
Overview of High Performance Architecture – Apache Hadoop Framework
Apart from HDFS and Map reduce , the other components of Hadoop ecosystem
are shown.
Overview of High Performance Architecture – Apache Hadoop Framework
Hbase
PIG
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Pig.
Hive
Apache Hive, is an open source data warehouse system for querying and
analyzing large datasets stored in Hadoop files.
Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute
on Hadoop.
Overview of High Performance Architecture – Apache Hadoop Framework
Apache Mahout
Mahout is open source framework for creating scalable machine learning
algorithm and data mining library. Once data is stored in Hadoop HDFS,
mahout provides the data science tools to automatically find meaningful
patterns in those big data sets.
Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive. It also exports data from Hadoop to
other external sources. Sqoop works with relational databases such as
teradata, Netezza, oracle, MySQL.
Overview of High Performance Architecture – Apache Hadoop Framework
Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem
component for maintaining configuration information, naming, providing
distributed synchronization, and providing group services. Zookeeper manages
and coordinates a large cluster of machines.
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie
framework is fully integrated with apache Hadoop stack.
Oozie is scalable and can manage timely execution of thousands of workflow
in a Hadoop cluster. Oozie is very much flexible as well. One can easily start,
stop, suspend and rerun jobs.
Overview of High Performance Architecture – Apache Hadoop Framework
HDFS Architecture
The name node maintains metadata about each file (i.e. number of data blocks, file
name, path, Block IDs, Block location, no. of replicas, and also Slave related
configuration. This meta-data is available in memory in the master for faster
retrieval of data), as well as the history of changes to file metadata.
Overview of High Performance Architecture – Apache Hadoop Framework
Data Replication
⮚ For a given file, HDFS breaks the file, say, into 64 MB blocks and
stores the blocks across the cluster. So, if a file size is 300 MB, the file
is stored in five blocks: four 64 MB blocks and one 44 MB block. If a
file size is smaller than 64 MB, the block is assigned the size of the
file.
⮚ By default, HDFS creates three copies of each block across the cluster
to provide the necessary redundancy in case of a failure.
⮚ If a machine fails, HDFS replicates an accessible copy of the relevant
data blocks to another available machine. HDFS is also rack aware,
which means that it distributes the blocks across several equipment
racks to prevent an entire rack failure from causing a data unavailable
Overview of High Performance Architecture – Apache Hadoop Framework
MapReduce
• A software framework for easily writing applications which
process big amounts of data in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
• The MapReduce execution environment employs a master/slave
execution model, in which one master node (called the
JobTracker) manages a pool of slave computing resources (called
TaskTrackers) that are called upon to do the actual work.
• The term MapReduce actually refers to two different tasks:
✔ the Map task - the first task, which takes input data and converts
it into a set of data, where individual elements are broken down
into tuples (key/value pairs)
✔ the Reduce task - this task takes the output from a map task as
input and combines those to provide output.
Overview of High Performance Architecture – Apache Hadoop Framework
MapReduce
• the master is responsible for:
✔ resource management
✔ tracking resource consumption/availability and scheduling the
jobs component
✔ tasks on the slaves
✔ monitoring slaves and re-executing the failed tasks
• the slave TaskTracker:
✔ executes the tasks as directed by the master
✔ provides task-status information to the master periodically
Overview of High Performance Architecture – Apache Hadoop Framework
Map Reduce, can be used to develop applications to read, analyze, transform, and
share massive amounts of data is not a database system but rather is a
programming model introduced and described by Google researchers for parallel,
distributed computation involving massive datasets (ranging from hundreds of
terabytes to petabytes).
Map Reduce’s dependence on two basic operations that are applied to sets or
lists of data value pairs:
Reduce, in which the set of values associated with the intermediate key/value
pairs output by the Map operation are combined to provide the results.
Overview of High Performance Architecture – Apache Hadoop Framework
The data items are indexed using a defined key in to key, value pairs, in which the
key represents some grouping criterion associated with a computed value. With
some applications applied to massive datasets, the theory is that the computations
applied during the Map phase to each input key/value pair are independent from
one another. Figure 1.5 shows how Map and Reduce work.
Overview of High Performance Architecture – Apache Hadoop Framework
MapReduce
The Map Reduce paradigm provides the means to break a large task
into smaller tasks, run the tasks in parallel, and consolidate the
outputs of the individual tasks into the final output. As its name
implies,
Map Reduce consists of two basic parts
▪ a map step and
▪ a reduce step
Map:
Applies an operation to a piece of data
Provides some intermediate output
Reduce:
Consolidates the intermediate outputs from the map steps
Overview of High Performance Architecture – Apache Hadoop Framework
Each step uses key/value pairs, denoted as <key, value>, as input and
output.
For example, the key could be a filename, and the value could be the
entire contents of the file.
In general, each reducer processes the values for each key and
emits a key/value pair as defined by the reduce logic.
The output is then stored in HDFS like any other file in, say, 64
MB blocks replicated three times across the nodes.
Advantages of Hadoop
⮚ allows the user to quickly write and test distributed systems
⮚ efficient
⮚ automatic data and work distribution across the machines
⮚ easy utilization of the underlying parallelism of the CPU cores
⮚ Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA)
⮚ Hadoop library itself has been designed to detect and handle
failures at the application layer
⮚ servers can be added or removed from the cluster dynamically -
Hadoop continues to operate without interruption
⮚ open source
⮚ compatible on all platforms since it is Java based
Overview of High Performance Architecture – Apache Hadoop Framework
Advantages of YARN