0% found this document useful (0 votes)
4K views25 pages

Notes - KCS 061 Big Data Unit 1

This document provides an introduction to big data concepts. It discusses how big data is fueled by the Fourth Industrial Revolution and Internet of Things. Big data is defined by its volume, velocity, and variety which makes it difficult to process using traditional databases. There has been exponential growth in both structured and unstructured data from various sources. Big data science deals with collecting, storing, processing and analyzing massive data using cloud computing. It is predicted there will be over 2 million job openings for data scientists and analysts.

Uploaded by

PRACHI ROSHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4K views25 pages

Notes - KCS 061 Big Data Unit 1

This document provides an introduction to big data concepts. It discusses how big data is fueled by the Fourth Industrial Revolution and Internet of Things. Big data is defined by its volume, velocity, and variety which makes it difficult to process using traditional databases. There has been exponential growth in both structured and unstructured data from various sources. Big data science deals with collecting, storing, processing and analyzing massive data using cloud computing. It is predicted there will be over 2 million job openings for data scientists and analysts.

Uploaded by

PRACHI ROSHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

KCS 061 Big Data Notes

Date: 05 April 2021 UNIT 1

Introduction to Big Data:-


 "Fourth Industrial Revolution" by the World Economic Forum (WEF) in 2016.
 The Fourth Industrial Revolution is marked through the emergence of "cyber-physical
systems" where software interfaces seamlessly over networks with physical systems, such as
sensors, smartphones, vehicles, power grids or buildings, to create a new world of Internet of
Things (IoT).
 Data and information are fuel of this new age where powerful analytics algorithms burn this
fuel to generate decisions that are expected to create a smarter and more efficient world for
all of us to live in.
 This new area of technology has been defined as Big Data Science and Analytics, and the
industrial and academic communities are realizing this as a competitive technology that can
generate significant new wealth and opportunity.
 Examples:- Discovering consumer shopping habits, Personalized marketing, Fuel optimization
tools for the transportation industry, Monitoring health conditions through data from
wearables, Live road mapping for autonomous vehicles, Streamlined media streaming,
Predictive inventory ordering, Personalized health plans for cancer patients, Real-time data
monitoring and cybersecurity protocols.
 Big data is defined as collections of datasets whose volume, velocity or variety is so large that
it is difficult to store, manage, process and analyze the data using traditional databases and
data processing tools.
 In the recent years, there has been an exponential growth in the both structured and
unstructured data generated by information technology, industrial, healthcare, retail, web,
and other systems.
 Structured data examples :- Google Sheets and Microsoft Office Excel, Database Management
Systems (DBMS), SQL, customer data, phone records, transaction history
 Unstructured data examples: - email, text files, social media posts, video, images, audio,
sensor data, BLOBs (Binary Large OBjects), Text data, social media comments, phone calls
transcriptions, various logs files, images, audio, video
 Big data science and analytics deals with collection, storage, processing and analysis of
massive-scale data on cloud-based computing systems.
 Industry surveys, by Gartner and e-Skills, for instance, predict that there will be over 2 million
job openings for engineers and scientists trained in the area of data science and analytics
alone, and that the job market is in this area is growing at a 150 percent year-over-year growth
rate.

Big Data Analytics:-


 Analytics is a broad term that encompasses the processes, technologies, frameworks and algorithms
to extract meaningful insights from data.
 Analytics is this process of extracting and creating information from raw data by filtering, processing,
categorizing, condensing and contextualizing the data.
 Examples:- to predict something (for example whether a transaction is a fraud or not, whether it will
rain on a particular day, or whether a tumor is benign or malignant), to find patterns in the data (for
example, finding the top 10 coldest days in the year, finding which pages are visited the most on a
particular website, or finding the most searched celebrity in a particular year), finding relationships in
the data (for example, finding similar news articles, finding similar patients in an electronic health
record system, finding related products on an eCommerce website, finding similar images, or finding
correlation between news items and stock prices)

Analytics types:-
 Descriptive Analytics:- aims to answer - What has happened?
o For example, computing the total number of likes for a particular post, computing the average
monthly rainfall or finding the average number of visitors per month on a website. What was
the sales volume over the past 12 months? What is the number of support calls received as
categorized by severity and geographic location? What is the monthly commission earned by
each sales agent?
o Analysing past data to describing patterns in the data and present the data in a summarized
form. Use of statistics functions such as counts, maximum, minimum, mean, top-N,
percentage.
 Diagnostic analytics: - Aims to answer - Why did it happen?
o Analysis of past data to diagnose the reasons as to why certain events happened.
o Example: - a system that collects and analyses sensor data from machines for monitoring their
health and predicting failures. Why were Q2 sales less than Q1 sales? Why have there been
more support calls originating from the Eastern region than from the Western region? Why
was there an increase in patient re-admission rates over the past three months?
o Descriptive analytics can be useful for summarizing the data by computing various statistics.
Diagnostic analytics can provide more insights into why certain a fault has occurred based on
the patterns in the sensor data for previous faults.
 Predictive analytics: - Aims to answer - What is likely to happen?
o Predicting the occurrence of an event or the likely outcome of an event or forecasting the
future values using prediction models.
o Predictive Analytics is done using predictive models which are trained by existing data. These
models learn patterns and trends from the existing data and predict the occurrence of an
event or the likely outcome of an event (classification models) or forecast numbers (regression
models).
 Prescriptive Analytics: - Aims to answer - What can we do to make it happen?
o Prescriptive analytics uses multiple prediction models to predict various outcomes and the
best course of action for each outcome. Prescribes actions or the best option to follow from
the available options.
o Prescribes actions or the best option to follow from the available options. Suggest the best
mobile data plan for a customer based on the customer’s browsing patterns.
Characteristics of Big Data:-

 Volume: - Big data is a form of data whose volume is so large that it would not fit on a single machine
therefore specialized tools and frameworks are required to store process and analyse such data. There
is no fixed threshold for the volume of data to be considered as big data, however, typically, the term
big data is used for massive scale data that is difficult to store, manage and process using traditional
databases and data processing architectures.
 Velocity of data refers to how fast the data is generated. Data generated by certain sources can arrive
at very high velocities, for example, social media data or sensor data. Specialized tools are required to
ingest such high velocity data into the big data infrastructure and analyse the data in real-time.
 Variety refers to the forms of the data. Big data comes in different forms such as structured,
unstructured or semi-structured, including text data, image, audio, and video and sensor data.
 Veracity refers to how accurate is the data. To extract value from the data, the data needs to be
cleaned to remove noise. Data-driven applications can reap the benefits of big data only when the
data is meaningful and accurate.
 Value of data refers to the usefulness of data for the intended purpose. The value of the data is also
related to the veracity or accuracy of the data. For some applications value also depends on how fast
we are able to process the data.

Figure 1-1 IBM characterizes Big Data by its volume, velocity, and variety—or simply, V3. (Understanding Big
Data, Analytics for Enterprise Class Hadoop and Streaming Data, Paul C. Zikopoulos, )
Figure: Big Data Analytics Flow (Big Data Science & Analytics: A Hands-On Approach, (1st. ed.))

Big Data: Concepts and Terminology


Several fundamental concepts and terms need to be defined and understood.

Datasets
 Collections or groups of related data are generally referred to as datasets. Each group or dataset
member (datum) shares the same set of attributes or properties as others in the same dataset.
 Examples:- tweets stored in a flat file, a collection of image files in a directory, an extract of rows from
a database table stored in a CSV formatted file, historical weather observations that are stored as XML
files.
Data Analysis
 Data analysis is the process of examining data to find facts, relationships, patterns, insights and/or
trends.
 The overall goal of data analysis is to support better decision making.
 Example - analysis of ice cream sales data in order to determine how the number of ice cream cones
sold is related to the daily temperature. The results of such an analysis would support decisions related
to how much ice cream a store should order in relation to weather forecast information.

Data Analytics
 Data analytics is a discipline that includes the management of the complete data lifecycle, which
encompasses collecting, cleansing, organizing, storing, analyzing and governing data.
 Different kinds of organizations use data analytics tools and techniques in different ways.
o In business-oriented environments, data analytics results can lower operational costs and
facilitate strategic decision-making.
o In the scientific domain, data analytics can help identify the cause of a phenomenon to
improve the accuracy of predictions.
o In service-based environments like public sector organizations, data analytics can help
strengthen the focus on delivering high-quality services by driving down costs.

Drivers for Big Data


 Business motivations and drivers behind the adoption of Big Data solutions and technologies
o marketplace dynamics, appreciation and formalism of Business Architecture (BA), Business
Process Management (BPM), innovation in Information and Communications Technology
(ICT), Internet of Everything (IoE)
Affordable Technology and Commodity Hardware
Hyper-Connected Communities and Devices
Big Data Architecture

 Big data architecture refers to the logical and physical structure that dictates how high volumes of
data are ingested, processed, stored, managed, and accessed.
 Big data architecture is the foundation for big data analytics.
 The big data architecture framework serves as a reference blueprint for big data infrastructures and
solutions, logically defining how big data solutions will work, the components that will be used, how
information will flow, and security details.

Big Data Architecture Layers


 Big Data Sources Layer: a big data environment can manage both batch processing and real-time
processing of big data sources, such as data warehouses, relational database management systems,
SaaS applications, and IoT devices.
 Management & Storage Layer: receives data from the source, converts the data into a format
comprehensible for the data analytics tool, and stores the data according to its format.
 Analysis Layer: analytics tools extract business intelligence from the big data storage layer.
 Consumption Layer: receives results from the big data analysis layer and presents them to the
pertinent output layer - also known as the business intelligence layer.

Big Data Architecture Processes


 Connecting to Data Sources: connectors and adapters are capable of efficiently connecting any format
of data and can connect to a variety of different storage systems, protocols, and networks.
 Data Governance: includes provisions for privacy and security, operating from the moment of
ingestion through processing, analysis, storage, and deletion.
 Systems Management: highly scalable, large-scale distributed clusters are typically the foundation for
modern big data architectures, which must be monitored continually via central management
consoles.
 Protecting Quality of Service: the Quality of Service framework supports the defining of data quality,
compliance policies, and ingestion frequency and sizes.

Big Data Architecture Best Practices


 Understanding how the data will be used and how it will bring value to the business?
 Big data architecture principles for big data architecture strategy.
o Preliminary Step: A big data project should be in line with the business vision and have a good
understanding of the organizational context, the key drivers of the organization, data
architecture work requirements, architecture principles and framework to be used, and the
maturity of the enterprise architecture. It is also important to have a thorough understanding
of the elements of the current business technology landscape, such as business strategies and
organizational models, business principles and goals, current frameworks in use, governance
and legal frameworks, IT strategy, and any pre-existing architecture frameworks and
repositories.
o Data Sources: Before any big data solution architecture is coded, data sources should be
identified and categorized so that big data architects can effectively normalize the data to a
common format. Data sources can be categorized as either structured data, which is typically
formatted using predefined database techniques, or unstructured data, which does not follow
a consistent format, such as emails, images, and Internet data.
o Big Data ETL: Data should be consolidated into a single Master Data Management system for
querying on demand, either via batch processing or stream processing. For processing,
Hadoop has been a popular batch processing framework. For querying, the Master Data
Management system can be stored in a data repository such as NoSQL-based or relational
DBMS.
o Data Services API: When choosing a database solution, consider whether or not there is a
standard query language, how to connect to the database, the ability of the database to scale
as data grows, and which security mechanisms are in place.
o User Interface Service: a big data application architecture should have an intuitive design that
is customizable, available through current dashboards in use, and accessible in the cloud.
Standards like Web Services for Remote Portlets (WSRP) facilitate the serving of User
Interfaces through Web Service calls.

Building a Big Data Architecture


 Analyze the Problem:
o First determine if the business does in fact have a big data problem,
o Taking into consideration criteria such as data variety, velocity, and challenges with the
current system.
o Common use cases include data archival, process offload, data lake implementation,
unstructured data processing, and data warehouse modernization.
 Select a Vendor:
o Hadoop is one of the most widely recognized big data architecture tools for managing big data
end to end architecture.
o Popular vendors for Hadoop distribution include Amazon Web Services, BigInsights, Cloudera,
Hortonworks, Mapr, and Microsoft.
 Deployment Strategy:
o Deployment can be either on-premises, which tends to be more secure;
o cloud-based, which is cost effective and provides flexibility regarding scalability;
o Mix deployment strategy.
 Capacity Planning:
o When planning hardware and infrastructure sizing, consider daily data ingestion volume, data
volume for one-time historical load, the data retention period, multi-data center deployment,
and the time period for which the cluster is sized
 Infrastructure Sizing:
o Based on capacity planning.
o Determines the number of clusters/environment required and the type of hardware required.
o Consider the type of disk and number of disks per machine,
o the types of processing memory and memory size,
o number of CPUs and cores,
o and the data retained and stored in each environment.
 Plan a Disaster Recovery:
o The criticality of data stored,
o The Recovery Point Objective and Recovery Time Objective requirements,
o Backup interval,
o Multi datacenter deployment,
o and whether Active-Active or Active-Passive disaster recovery is most appropriate.

Big Data Analytics Tools


 Apache Hadoop, CDH (Cloudera Distribution for Hadoop), Cassandra, Knime, Datawrapper,
MongoDB, Storm, Apache SAMOA, Rapidminer, Tableau, R Programming.
 https://www.softwaretestinghelp.com/big-data-tools/

References:-
 https://builtin.com/big-data/big-data-examples-applications
 Arshdeep Bahga and Vijay Madisetti. 2016. Big Data Science; Analytics: A Hands-On Approach, (1st.
ed.). VPT.
 Big Data Fundamentals: Concepts, Drivers & Techniques (The Pearson Service Technology Series from
Thomas Erl) by Thomas Erl (Author), Wajid Khattak (Author), Paul Buhler (Author)
 https://www.omnisci.com/technical-glossary/big-data-architecture
 Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data, Paul C. Zikopoulos
, Chris Eaton, Dirk deRoos. Thomas Deutsch, George Lapis

Date: 09 Apr 2021 UNIT II

HADOOP
 Hadoop (http://hadoop.apache.org/) is a top-level Apache project in the Apache Software Foundation
that’s written in Java.
 Hadoop is a computing environment built on top of a distributed clustered file system that was
designed specifically for very large-scale data operations.
 Hadoop was inspired by Google’s work on its Google (distributed) File System (GFS).
 MapReduce programming paradigm, in which work is broken down into mapper and reducer tasks to
manipulate data that is stored across a cluster of servers for massive parallelism.
 Hadoop is designed to scan through large data sets to produce its results through a highly scalable,
distributed batch processing system.
 Hadoop is actually the name that creator Doug Cutting’s son gave to his stuffed toy elephant. In
thinking up a name for his project, Cutting was apparently looking for something that was easy to say
and stands for nothing in particular, so the name of his son’s toy seemed to make perfect sense.
 Hadoop is generally seen as having two parts:
o a file system (the Hadoop Distributed File System)
o and a programming paradigm (MapReduce)
 One of the key components of Hadoop is the redundancy built into the environment.
o data redundantly stored in multiple places across the cluster
o programming model - failures are expected and resolved automatically by running portions of
the program on various servers in the cluster.
o It is well known that commodity hardware components will fail (especially when you have
very large numbers of them), but this redundancy provides fault tolerance and a capability for
the Hadoop cluster to heal itself.
o scale out workloads across large clusters of inexpensive machines to work on Big Data
problems.
 Hadoop-related projects
o Apache Avro (for data serialization),
o Cassandra and HBase (databases),
o Chukwa (a monitoring system specifically designed with large distributed systems in mind),
o Hive (provides ad hoc SQL-like queries for data aggregation and summarization),
o Mahout (a machine learning library),
o Pig (a high-level Hadoop programming language that provides a data-flow language and
execution framework for parallel computation),
o ZooKeeper (provides coordination services for distributed applications),
o and more.

Components of Hadoop
 Hadoop project is comprised of three pieces:
o Hadoop Distributed File System (HDFS),
o Hadoop MapReduce model,
o and Hadoop Common.
 Hadoop Distributed File System
o Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed
throughout the cluster. Copies of these blocks are stored on other servers in the Hadoop
cluster.
o an individual file is actually stored as smaller blocks that are replicated across multiple servers
in the entire cluster.
o the map and reduce functions can be executed on smaller subsets of your larger data sets,
and this provides the scalability that is needed for Big Data processing.
o use commonly available servers in a very large cluster, where each server has a set of
inexpensive internal disk drives.
o MapReduce tries to assign workloads to these servers where the data to be processed is
stored. (Data Locality)

Figure:- example of how data blocks are written to HDFS. Notice how (by default) each block is written three times and at least one block
is written to a different server rack for redundancy. (Understanding Big Data Analytics, Paul C. Zikopoulos)

 Hadoop Distributed File System


o Think of a file that contains the phone numbers for everyone. The people with a last name
starting with A might be stored on server 1, B on server 2, and so on.
o In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to
reconstruct the entire phonebook,
o Program would need the blocks from every server in the cluster.
o HDFS replicates these smaller pieces onto two additional servers by default.
o A data file in HDFS is divided into blocks, and the default size of these blocks for Apache
Hadoop is 64 MB.
o All of Hadoop’s data placement logic is managed by a special server called NameNode.
o This NameNode server keeps track of all the data files in HDFS, such as where the blocks are
stored, and more.
o All of the NameNode’s information is stored in memory, which allows it to provide quick
response times to storage manipulation or read requests.
o Interaction with HDFS
 write your own Java applications to perform some of the functions.
 different HDFS commands to manage and manipulate files in the file system.

MapReduce
 MapReduce is programming paradigm that allows for massive scalability across hundreds or
thousands of servers in a Hadoop cluster.
 The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs
perform.
o map job
o reduce job
 The first is the map job, which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
 The reduce job takes the output from a map as input and combines those data tuples into a smaller
set of tuples.
 As the sequence of the name MapReduce implies, the reduce job is always performed after the map
job.
 Example
o Five files having each file contains two columns (a key and a value in Hadoop terms).
o Here key represents a city and the value represents corresponding temperature recorded in that city for
the various measurement days.
o Find the maximum temperature for each city across all of the data files?

o Following snippet shows a sample of the data from one of test files

Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18

o MapReduce framework can break this down into five map tasks, where each mapper works
on one of the five files and the mapper task goes through the data and returns the maximum
temperature for each city.

(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)

o four mapper tasks (working on the other four files not shown here) produced the following
intermediate results:

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)


(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)
(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)
(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)

o All five of these output streams would be fed into the reduce tasks, which combine the input
results and output a single value for each city, producing a final result set as follows:

(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)


Figure :- The flow of data in a simple MapReduce job . (Understanding Big Data Analytics, Paul C. Zikopoulos)

 In a Hadoop cluster, a MapReduce program is referred to as a job.


 A job is executed by subsequently breaking it down into pieces called tasks.
 An application submits a job to a specific node in a Hadoop cluster, which is running a daemon called
the JobTracker.
 The JobTracker communicates with the NameNode to find out where all of the data required for this
job exists across the cluster, and then breaks the job down into map and reduce tasks for each node
to work on in the cluster.
 These tasks are scheduled on the nodes in the cluster where the data exists.
 In a Hadoop cluster, a set of continually running daemons, referred to as TaskTracker agents, monitor
the status of each task.
 If a task fails to complete, the status of that failure is reported back to the JobTracker, which will then
reschedule that task on another node in the cluster.
 All MapReduce programs that run natively under Hadoop are written in Java, and it is the Java Archive
file (jar) that’s distributed by the JobTracker to the various Hadoop cluster nodes to execute the map
and reduce tasks.

Hadoop Word-Count Example

 Mapper Code
o import java.io.IOException;
o import java.util.StringTokenizer;
o
o import org.apache.hadoop.io.IntWritable;
o import org.apache.hadoop.io.LongWritable;
o import org.apache.hadoop.io.Text;
o import org.apache.hadoop.mapreduce.Mapper;
o
o public class MapClass extends Mapper<LongWritable, Text, Text,
IntWritable>{
o
o private final static IntWritable one = new IntWritable(1);
o private Text word = new Text();
o
o @Override
o protected void map(LongWritable key, Text value,
o Context context)
o throws IOException, InterruptedException {
o
o String line = value.toString();
o StringTokenizer st = new StringTokenizer(line," ");
o
o while(st.hasMoreTokens()){
o word.set(st.nextToken());
o context.write(word,one);
o }
o
o }
o }

 Reducer Code
o import java.io.IOException;
o import java.util.Iterator;
o
o import org.apache.hadoop.io.IntWritable;
o import org.apache.hadoop.io.Text;
o import org.apache.hadoop.mapreduce.Reducer;
o
o public class ReduceClass extends Reducer{
o
o @Override
o protected void reduce(Text key, Iterable values,
o Context context)
o throws IOException, InterruptedException {
o
o int sum = 0;
o Iterator valuesIt = values.iterator();
o
o while(valuesIt.hasNext()){
o sum = sum + valuesIt.next().get();
o }
o
o context.write(key, new IntWritable(sum));
o }
o }

Hadoop Common Components


 The Hadoop Common Components are a set of libraries that support the various Hadoop subprojects.
 HDFS shell commands (some examples) –
o cat - Copies the file to standard output (stdout).
o Chmod - Changes the permissions for reading and writing to a given file or set of files.
o Chown - Changes the owner of a given file or set of files.
o copyFromLocal - Copies a file from the local file system into HDFS.
o copyToLocal - Copies a file from HDFS to the local file system.
o cp - Copies HDFS files from one directory to another.
o Expunge - Empties all of the files that are in the trash.
o Ls - Displays a listing of files in a given directory.
o Mkdir - Creates a directory in HDFS.
o Mv - Moves files from one directory to another.
o Rm - Deletes a file and sends it to the trash.
Application Development in Hadoop
 Several application development languages have emerged that run on top of Hadoop.
o Pig and PigLatin
o Hive
o Jaql
o ZooKeeper
o HBase

Pig and PigLatin


 Pig was initially developed at Yahoo! to use Hadoop to focus more on analyzing large data sets and
spend less time having to write mapperand reducer programs.
 Pig programming language is designed to handle any kind of data.
 Pig is made up of two components:
o the first is the language itself, which is called PigLatin.
o The second is a runtime environment where PigLatin programs are executed.

Hive
 Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent
with SQL to leverage the Hadoop platform.
 Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to
standard SQL statements.
 HQL statements are broken down by the Hive service into MapReduce jobs and executed across a
Hadoop cluster.

Jaql
 Jaql is primarily a query language for JavaScript Object Notation (JSON) and allows to process both
structured and nontraditional data and was developed by IBM.
 Jaql allows to select, join, group, and filter data that is stored in HDFS.
 Jaql’s query language includes Lisp, SQL, XQuery, and Pig.

ZooKeeper
 Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud
applications.
 ZooKeeper is essentially a service for distributed systems offering a hierarchical key-value store, which
is used to provide a distributed configuration service, synchronization service, and naming registry for
large distributed systems.
 ZooKeeper is an open source Apache project that provides a centralized infrastructure and services
that enable synchronization across a cluster.
 ZooKeeper maintains common objects needed in large cluster environments.
 Examples of these objects include configuration information, hierarchical namingspace, and so on.
 Applications can leverage these services to coordinate distributed processing across large clusters.

HBase
 HBase is a column-oriented database management system that runs on top of HDFS.
 Unlike relational database systems, HBase does not support a structured query language like SQL.
 An HBase system comprises a set of tables. Each table contains rows and columns, much like a
traditional database.
 Each table must have an element defined as a Primary Key, and all access attempts to HBase tables
must use this Primary Key.
 An HBase column represents an attribute of an object; for example, if the table is storing diagnostic
logs from servers in your environment, where each row might be a log record, a typical column in such
a table would be the timestamp of when the log record was written, or the servername where the
record originated.

Data Formats for Hadoop


 The Hadoop ecosystem is designed to process large volumes of data distributed through the
MapReduce programming model.
 Hadoop Distributed File System (HDFS) is a distributed file system designed for large-scale data
processing where scalability, flexibility and performance are critical.
 Hadoop works in a master / slave architecture to store data in HDFS and is based on the principle
of storing few very large files.
 In HDFS two services are executed: Namenode and Datanode.
o The Namenode manages the namespace of the file system, in addition to maintaining the
file system tree and metadata for all files and directories. This information is permanently
stored on the local disk in the form of two files: the namespace image and the edition log.
The Namenode also knows the Datanodes where the blocks of a file are located.
 The default size of an HDFS block is 128MB.
 The HDFS blocks are larger because they aim at minimizing the cost of searches, since if a block is
large enough, the time to transfer data from the disk can be longer than the time needed to search
from the beginning of the block.
 The blocks fit well with replication to provide fault tolerance and availability. Each block is
replicated in several small separate machines.
 Hadoop allows to store information in any format, whether structured, semi-structured or
unstructured data. In addition, it also provides support for optimized formats for storage and
processing in HDFS.
 Hadoop does not have a default file format and the choice of a format depends on its use.
 The choice of an appropriate file format can produce the following benefits: Optimum writing
time, Optimum reading time, File divisibility, Adaptive scheme and compression support.
 Each format has advantages and disadvantages, and each stage of data processing will need a
different format to be more efficient.
 The objective is to choose a format that maximizes advantages and minimizes inconveniences.
 Choosing an appropriate HDFS file format to the type of work that will be done with it, can ensure
that resources will be used efficiently.
 Most common formats of the Hadoop ecosystem:
o Text/CSV
o SequenceFile
o Avro Data Files
o Parquet
o RCFile (Record Columnar File)
o ORC (Optimized Row Columnar)

 Text/CSV -
o A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.
o The text file format consumes more space when a numeric value needs to be stored as a
string. It is also difficult to represent binary data such as an image.
o A plain text file or CSV is the most common format both outside and within the Hadoop
ecosystem.
o The disadvantage in the use of this format is that it does not support block compression,
so the compression of a CSV file in Hadoop can have a high cost in reading.
o The plain text format or CSV would only be recommended in case of extractions of data
from Hadoop or a massive data load from a file.

 SequenceFile –
o The SequenceFile format stores the data in binary format.
o The sequencefile format can be used to store an image in the binary format.
o They store key-value pairs in a binary container format and are more efficient than a text
file. However, sequence files are not human- readable.
o This format accepts compression; however, it does not store metadata and the only
option in the evolution of its scheme is to add new fields at the end.
o This is usually used to store intermediate data in the input and output of MapReduce
processes.
o The SequenceFile format is recommended in case of storing intermediate data in
MapReduce jobs.

 Avro Data Files -


o Avro is a row-based storage format.
o This format includes in each file, the definition of the scheme of your data in JSON format,
improving interoperability and allowing the evolution of the scheme.
o Avro also allows block compression in addition to its divisibility, making it a good choice
for most cases when using Hadoop.
o Avro is a good choice in case the data scheme can evolve over time.
o The Avro file format has efficient storage due to optimized binary encoding. It is widely
supported both inside and outside the Hadoop ecosystem.
o The Avro file format is ideal for long-term storage of important data. It can read from and
write in many languages like Java, Scala and so on.
o Schema metadata can be embedded in the file to ensure that it will always be readable.
Schema evolution can accommodate changes.
o The Avro file format is considered the best choice for general-purpose storage in Hadoop.

 Parquet -
o Parquet is a column-based (column-based) binary storage format that can store nested
data structures.
o This format is very efficient in terms of disk input / output operations when the necessary
columns to be used are specified.
o This format is much optimized for use with Cloudera Impala.
o Parquet is a columnar format developed by Cloudera and Twitter.
o It is supported in Spark, MapReduce, Hive, Pig, Impala, Crunch, and so on.
o Parquet file format uses advanced optimizations described in Google’s Dremel paper.
These optimizations reduce the storage space and increase performance.
o This Parquet file format is considered the most efficient for adding multiple records at a
time. Some optimizations rely on identifying repeated patterns.

 RCFile (Record Columnar File) -


o RCFile is a columnar format that divides data into groups of rows, and inside it, data is
stored in columns.
o This format does not support the evaluation of the scheme and if want to add a new
column it is necessary to rewrite the file, slowing down the process.

 ORC (Optimized Row Columnar) -


o ORC is considered an evolution of the RCFile format and has all its benefits alongside with
some improvements such as better compression, allowing faster queries.
o This format also does not support the evolution of the scheme.
o ORC are recommended when query performance is important.

Hadoop – Streaming
 Hadoop streaming is a utility that comes with the Hadoop distribution.

Figure : - Hadoop Streaming (https://data-flair.training/blogs/hadoop-streaming/)

 It enables to create or run MapReduce scripts in any language either, java or non-java, as
mapper/reducer.
 By default, the Hadoop MapReduce framework is written in Java and provides support for writing
map/reduce programs in Java only.
 But Hadoop provides API for writing MapReduce programs in languages other than Java.
 Hadoop Streaming is the utility that allows us to create and run MapReduce jobs with any script or
executable as the mapper or the reducer.
 It uses Unix streams as the interface between the Hadoop and our MapReduce program so that we
can use any language which can read standard input and write to standard output to write for writing
our MapReduce program.
 Hadoop Streaming supports the execution of Java, as well as non-Java, programmed MapReduce jobs
execution over the Hadoop cluster.
 It supports the Python, Perl, R, PHP, and C++ programming languages.

How Streaming Works?


Figure :- Hadoop Streaming Working (https://data-flair.training/blogs/hadoop-streaming/)
 The mapper and the reducer are the scripts that read the input line-by-line from stdin and emit the
output to stdout.
 The utility creates a Map/Reduce job and submits the job to an appropriate cluster and monitor the
job progress until its completion.
 When a script is specified for mappers, then each mapper task launches the script as a separate
process when the mapper is initialized.
 The mapper task converts its inputs (key, value pairs) into lines and pushes the lines to the standard
input of the process. Meanwhile, the mapper collects the line oriented outputs from the standard
output and converts each line into a (key, value pair) pair, which is collected as the result of the
mapper.
 When reducer script is specified, then each reducer task launches the script as a separate process, and
then the reducer is initialized.
 As reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard
input of the process. Meantime, the reducer gathers the line-oriented outputs from the stdout of the
process and converts each line collected into a key/value pair, which is then collected as the result of
the reducer.
 For both mapper and reducer, the prefix of a line until the first tab character is the key, and the rest
of the line is the value except the tab character. In the case of no tab character in the line, the entire
line is considered as key, and the value is considered null. This is customizable by setting -inputformat
command option for mapper and -outputformat option for reducer.

HADOOP PIPES –
 Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
 Unlike Streaming, which uses standard input and output to communicate with the map and reduce
code, Pipes uses sockets as the channel over which the tasktracker communicates with the process
running the C++ map or reduce function.
 The application links against the Hadoop C++ library, which is a thin wrapper for communicating with
the tasktracker child process.
 The map and reduce functions are defined by extending the Mapper and Reducer classes defined in
the HadoopPipes namespace and providing implementations of the map() and reduce() methods in
each case.
 These methods take a context object (of type MapContext or ReduceContext), which provides the
means for reading input and writing output, as well as accessing job configuration information via the
JobConf class.
 Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as
Standard Template Library (STL) strings. This makes the interface simpler, although it does put a
slightly greater burden on the application developer, who has to convert to and from richer domain-
level types. This is evident in MapTempera tureReducer where we have to convert the input value into
an integer (using a convenience method in HadoopUtils) and then the maximum value back into a
string before it’s written out. In some cases, we can save on doing the conversion, such as in MaxTem
peratureMapper where the airTemperature value is never converted to an integer since it is never
processed as a number in the map() method.
 The main() method is the application entry point. It calls HadoopPipes::runTask, which connects to the
Java parent process and marshals data to and from the Mapper or Reducer.
 The runTask() method is passed a Factory so that it can create instances of the Mapper or Reducer.
Which one it creates is controlled by the Java parent over the socket connection.
 There are overloaded template factory methods for setting a combiner, partitioner, record reader, or
record writer.

HADOOP ECOSYSTEM
 Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework
which solves big data problems.
 It includes a number of services (ingesting, storing, analyzing and maintaining).
 Hadoop components, that together form a Hadoop ecosystem

Figure : - Hadoop Ecosystem (https://www.edureka.co/blog/hadoop-ecosystem)

o HDFS -> Hadoop Distributed File System


o YARN -> Yet Another Resource Negotiator
o MapReduce -> Data processing using programming
o Spark -> In-memory Data Processing
o PIG, HIVE-> Data Processing Services using Query (SQL-like)
o HBase -> NoSQL Database
o Mahout, Spark MLlib -> Machine Learning
o Apache Drill -> SQL on Hadoop
o Zookeeper -> Managing Cluster
o Oozie -> Job Scheduling
o Flume, Sqoop -> Data Ingesting Services
o Solr & Lucene -> Searching & Indexing
o Ambari -> Provision, Monitor and Maintain cluster

Map Reduce Framework and Basics


References:-
 Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data, Paul C.
Zikopoulos\, Chris Eaton, Dirk deRoos. Thomas Deutsch, George Lapis.
 https://en.wikipedia.org/wiki/Apache_ZooKeeper
 https://blog.bi-geek.com/en/formatos-de-ficheros-en-hadoop/
 http://hadoop.apache.org/docs/r1.2.1/streaming.html
 https://data-flair.training/blogs/hadoop-streaming/
 https://www.wisdomjobs.com/e-university/hadoop-tutorial-484/hadoop-pipes-14765.html
 https://www.edureka.co/blog/hadoop-ecosystem

You might also like