BAD601 Module 2 PDF
BAD601 Module 2 PDF
Introducing Hadoop
• It divides large data into smaller parts and processes them across
multiple computers at the same time.
• It is part of the Apache Software Foundation and is widely used for Big
Data processing.
Big Data Problems: Companies like Google, Facebook, and Amazon generate
terabytes of data every second.
Slow Processing: A single computer cannot handle such huge data efficiently.
Storage Limitations: Traditional databases have limits on how much data they can
store.
Hadoop helps to store and process large-scale data efficiently and quickly.
Vtucircle.com Page 1
Big Data Analytics-BAD601-Module 2
HDFS (Hadoop Distributed File System) – Like Google Drive, it stores data across
multiple computers.
MapReduce – Like group work, it processes data in parallel and then combines
results.
YARN (Yet Another Resource Negotiator) – Like a manager, it assigns tasks to
different computers.
Why Hadoop?
Low cost:
Vtucircle.com Page 2
Big Data Analytics-BAD601-Module 2
Computing power:
Scalability:
This boils down to simply adding nodes as the system grows and requires much
less administration.
• RDBMS is not suitable for storing and processing large files, images, and
videos.
Vtucircle.com Page 3
Big Data Analytics-BAD601-Module 2
RDBMS vs Hadoop
Relational Database
System Management Node Based Flat Structure.
System.
OLTP(Online
Processing Transaction Analytical, Big Data Processing
Processing)
Vtucircle.com Page 4
Big Data Analytics-BAD601-Module 2
Needs expensive
In a Hadoop Cluster, a node
hardware or high-
requires only a processor, a
Processor end processors to
network card, and few hard
store huge volumes
drives.
of data.
Vtucircle.com Page 5
Big Data Analytics-BAD601-Module 2
Choose Hadoop for large-scale batch processing, big data analytics, and cost-
effective distributed storage with HDFS.
Although there are several challenges with distributed computing, we will focus on
two major challenges.
Hardware Failure
In a distributed system, several servers are networked together. This implies that
more often than not, there may be a possibility of hardware failure. And when such
a failure does happen, how does one retrieve the data that was stored in the
system? Just to explain further - a regular hard disk may fail once in 3 years. And
when you have 1000 such hard disks, there is a possibility of at least a few being
down every day.
Vtucircle.com Page 6
Big Data Analytics-BAD601-Module 2
In a distributed system, the data is spread across the network on several machines.
A key challenge here is to integrate the data available on several machines prior to
processing it.
History of Hadoop
The image is a timeline depicting key events in the history of Hadoop and related
technologies. Here's a breakdown of the events shown:
Vtucircle.com Page 7
Big Data Analytics-BAD601-Module 2
Hadoop Overview
Framework: Means everything that you will need to develop and execute and
application is provided - programs, tools, etc..
Hadoop components
Vtucircle.com Page 8
Big Data Analytics-BAD601-Module 2
1. HDFS:
2. MapReduce:
1. HIVE
2. PIG
3. SQOOP
Vtucircle.com Page 9
Big Data Analytics-BAD601-Module 2
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
Data Processing Layer which processes data in parallel to extract richer and
meaningful insights from data.
Vtucircle.com Page 10
Big Data Analytics-BAD601-Module 2
1. Master HDFS: Its main responsibility is partitioning the data storage across the
slave nodes. It also keeps track of locations of data on DataNodes.
ClickStream Data
ClickStream data (mouse clicks) helps you to understand the purchasing behavior
of customers. ClickStream analysis helps online marketers to optimize their product
web pages, promotional content, etc. to improve their business.
The ClickStream analysis, as shown in above figure, using Hadoop provides three
key benefits:
Vtucircle.com Page 11
Big Data Analytics-BAD601-Module 2
1. Hadoop helps to join ClickStream data with other data sources such as Customer
Relationship Management Data (Customer Demographics Data, Sales Data, and
Information on Advertising Campaigns). This additional data often provides the
much needed information to understand cus- tomer behavior.
2. Hadoop's scalability property helps you to store years of data without ample
incremental cost. This helps you to perform temporal or year over year analysis on
ClickStream data which your competitors may miss.
3. Business analysts can use Apache Pig or Apache Hive for website analysis. With
these tools, you can organize ClickStream data by user session, refine it, and feed it
to visualization or analytics tools.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
5. You can create multiple copies of a file as per configuration, ensuring reliability.
This replication enhances fault tolerance for both software and hardware failures.
7. You can realize the power of HDFS when you perform read or write on large files
(gigabytes and larger).
Vtucircle.com Page 12
Big Data Analytics-BAD601-Module 2
8. It operates above native file systems like ext3 and ext4, as illustrated in the
figure below. This abstraction enables additional functionality and flexibility in data
management.
The figure highlights key aspects of the Hadoop Distributed File System (HDFS).
It mentions that HDFS uses a block-structured file system, has a default
replication factor of 3 for fault tolerance, and a default block size of 64 MB for
efficient storage and processing.
Vtucircle.com Page 13
Big Data Analytics-BAD601-Module 2
The figure illustrates the Hadoop Distributed File System (HDFS) architecture,
showing how a file is stored and managed across multiple nodes.
Key Components:
Client Application
The client interacts with HDFS through the Hadoop File System Client to
read/write data.
NameNode
Manages file-to-block mapping (e.g., Sample.txt is split into Block A, Block B, and
Block C).
DataNodes
Vtucircle.com Page 14
Big Data Analytics-BAD601-Module 2
The figure shows three DataNodes (A, B, and C), each storing different copies of
Blocks A, B, and C based on the replication factor.
Working of HDFS:
The NameNode manages block locations but does not store actual data.
The Hadoop File System Client communicates with the NameNode to fetch block
locations and then retrieves data from DataNodes.
This design ensures high availability, fault tolerance, and parallel processing for
large-scale data applications.
HDFS Daemons
NameNode
Vtucircle.com Page 15
Big Data Analytics-BAD601-Module 2
• When NameNode starts up, it reads FsImage and EditLog from disk and
applies all transactions from the EditLog to in-memory representation of
the FsImage.
• Then it flushes out new version of FsImage on disk and truncates the
old EditLog because the changes are updated in the FsImage.
DataNode
Vtucircle.com Page 16
Big Data Analytics-BAD601-Module 2
Heartbeat Mechanism:
Data Replication:
The mechanism helps detect node failures and automatically redistribute data,
ensuring Hadoop's reliability and resilience.
Secondary NameNode
Vtucircle.com Page 17
Big Data Analytics-BAD601-Module 2
1. The client opens the file that it wishes to read from by calling open() on
the DistributedFileSystem.
Vtucircle.com Page 18
Big Data Analytics-BAD601-Module 2
4. Client calls read() repeatedly to stream the data from the DataNode.
6. When the client completes the reading of the file, it calls close() on the
FSDataInputStream to close the connection.
Vtucircle.com Page 19
Big Data Analytics-BAD601-Module 2
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the DataNode pipeline and waits
for relevant acknowledgments before communicating with the NameNode to
inform the client that the creation of the file is complete.
As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is present
on different rack. It places the third replica on the same rack as second, but on
a different node in the rack. Once replica locations have been set, a pipeline is
Vtucircle.com Page 20
Big Data Analytics-BAD601-Module 2
built. This strategy provides good reliability. Figure below describes the typical
replica pipeline.
Objective: To get the list of directories and files at the root of HDFS.
hadoop fs –ls /
hadoop fs –ls –R /
The command hadoop fs -ls -R / is used to recursively list all files and
directories in Hadoop Distributed File System (HDFS) starting from the root
Vtucircle.com Page 21
Big Data Analytics-BAD601-Module 2
(/). This command is useful for searching files, checking storage structure, and
debugging file locations in HDFS.
Use Cases:
Use Cases:
Vtucircle.com Page 23
Big Data Analytics-BAD601-Module 2
Objective: To copy a file from Hadoop file system to local file system via
copyToLocal command.
hadoop fs -copyToLocal: Copies a file from HDFS to the local file system.
Vtucircle.com Page 24
Big Data Analytics-BAD601-Module 2
Vtucircle.com Page 25
Big Data Analytics-BAD601-Module 2
Hadoop Distributed File System and MapReduce Framework run on the same
set of nodes. This config- uration allows effective scheduling of tasks on the
nodes where data is present (Data Locality). This in turn results in very high
throughput.
The MapReduce functions and input/output locations are implemented via the
MapReduce applications. These applications use suitable interfaces to
construct the job. The application and the job parameters together are known
as job configuration. Hadoop job client submits job (jar/executable, etc.) to the
JobTracker. Then it is the responsibility of Job Tracker to schedule tasks to the
Vtucircle.com Page 26
Big Data Analytics-BAD601-Module 2
slaves. In addition to scheduling, it also monitors the task and provides status
information to the job-client.
Mapreduce Daemons:
Vtucircle.com Page 27
Big Data Analytics-BAD601-Module 2
The above figure shows how the mapreduce programming works which is
fundamental to Hadoop’s data processing framework. Here’s a step-by-step
explanation of the workflow illustrated in the figure:
Vtucircle.com Page 28
Big Data Analytics-BAD601-Module 2
Vtucircle.com Page 29
Big Data Analytics-BAD601-Module 2
1. First, the input dataset is split into multiple pieces of data (several small
subsets).
2. Next, the framework creates a master and several workers processes and
executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that were
assigned to each map task. The map worker uses the map function to extract only
those data that are present on their server and generates key/value pair for the
extracted data.
4. Map worker uses partitioner function to divide the data into regions. Partitioner
decides which reducer should get the output of the specified mapper.
5. When the map workers complete their work, the master instructs the reduce
workers to begin their work. The reduce workers in turn contact the map workers to
get the key/value data for their partition. The data thus received is shuffled and
sorted as per keys.
6. Then it calls reduce function for every unique key. This function writes the
output to the file.
7. When all the reduce workers complete their work, the master transfers the
control to the user program.
MapReduce Example
The famous example for MapReduce Programming is Word Count. For example,
consider you need to count the occurrences of similar words across 50 files. You
can achieve this using MapReduce Programming. Refer Figure.
Vtucircle.com Page 30
Big Data Analytics-BAD601-Module 2
2. Mapper Class: This class overrides the Map Function based on the problem
statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem
statement.
Vtucircle.com Page 32
Big Data Analytics-BAD601-Module 2
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
Vtucircle.com Page 33
Big Data Analytics-BAD601-Module 2
import org.apache.hadoop.mapreduce.Reducer;
Vtucircle.com Page 34
Big Data Analytics-BAD601-Module 2
HDFS Limitation
NameNode saves all its file metadata in main memory. Although the main memory
today is not as small and as expensive as it used to be two decades ago, still there
is a limit on the number of objects that one can have in the memory on a single
NameNode. The NameNode can quickly become overwhelmed with load on the
system increasing.
In Hadoop 2.x, this is resolved with the help of HDFS Federation.
Hadoop 2: HDFS
HDFS 2 consists of two major components: (a) namespace, (b) blocks storage
service. Namespace service takes care of file-related operations, such as creating
files, modifying files, and directories. The block storage service handles data node
cluster management, replication.
Vtucircle.com Page 35
Big Data Analytics-BAD601-Module 2
HDFS 2 Features
1. Horizontal scalability.
2. High availability.
HDFS Federation uses multiple independent NameNodes for horizontal scalability.
NameNodes are independent of each other. It means, NameNodes does not need any
coordination with each other. The DataNodes are common storage for blocks and
shared by all NameNodes. All DataNodes in the cluster registers with each
NameNode in the cluster.
High availability of NameNode is obtained with the help of Passive Standby
NameNode. In Hadoop 2.x, Active-Passive NameNode handles failover automatically.
All namespace edits are recorded to a shared NFS storage and there is a single
writer at any point of time. Passive NameNode reads edits from shared storage and
keeps updated metadata information. In case of Active NameNode failure, Passive
NameNode becomes an Active NameNode automatically. Then it starts writing to the
shared storage. Figure below describes the Active-Passive NameNode interaction.
Vtucircle.com Page 36
Big Data Analytics-BAD601-Module 2
Fundamental Idea
The fundamental idea behind this architecture is splitting the Job Tracker
responsibility of resource management and Job Scheduling/Monitoring into
separate daemons. Daemons that are part of YARN Architecture are described
below.
1. A Global ResourceManager: Its main responsibility is to distribute resources
among various applica- tions in the system. It has two main components:
(a) Scheduler: The pluggable scheduler of ResourceManager decides allocation of
resources to var- ious running applications. The scheduler is just that, a pure
scheduler, meaning it does NOT monitor or track the status of the application.
(b) ApplicationManager: ApplicationManager does the following:
• Accepting job submissions.
• Negotiating resources (container) for executing the application specific
ApplicationMaster. Restarting the ApplicationMaster in case of failure.
2. NodeManager: This is a per-machine slave daemon. NodeManager responsibility
is launching the application containers for application execution. NodeManager
monitors the resource usage such as memory, CPU, disk, network, etc. It then
reports the usage of resources to the global ResourceManager.
Vtucircle.com Page 37
Big Data Analytics-BAD601-Module 2
Basic Concepts
Application:
1. Application is a job submitted to the framework.
2. Example - MapReduce Job.
Container:
1. Basic unit of allocation.
2. Fine-grained resource allocation across multiple resource types (Memory, CPU,
disk, network, etc.)
(a) container_0 = 2GB, ICPU
(b) container_1 = 1GB, 6 CPU
3. Replaces the fixed map/reduce slots.
YARN ARCHITECTURE
The figure below shows YARN architecture.
Vtucircle.com Page 38
Big Data Analytics-BAD601-Module 2
Vtucircle.com Page 39
Big Data Analytics-BAD601-Module 2
map tasks to generate final output. Each map task is broken into the following
phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into
the following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This ensures data locality. Data locality means that data is not moved over
network; only computational code is moved to process data which saves network
bandwidth.
Mapper
A mapper maps the input key-value pairs into a set of intermediate key-value pairs.
Maps are individual tasks that have the responsibility of transforming input records
into intermediate key-value pairs.
1. RecordReader: RecordReader converts a byte-oriented view of the input (as
generated by the Input- Split) into a record-oriented view and presents it to the
Mapper tasks. It presents the tasks with keys and values. Generally the key is the
positional information and value is a chunk of data that constitutes the record.
2. Map: Map function works on the key-value pair produced by RecordReader and
generates zero or more intermediate key-value pairs. The MapReduce decides the
key-value pair based on the context.
Vtucircle.com Page 40
Big Data Analytics-BAD601-Module 2
Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones
that share a common key) to a smaller set of values. The Reducer has three primary
phases: Shuffle and Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and
downloads them into the local machine where the reducer is running. Then these
individual data pipes are sorted by keys which produce larger data list. The main
purpose of this sort is grouping similar words so that their values can be easily
iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort
phase, applies reduce function, and processes one group at a time. The reduce
function iterates all the values associated with that key. Reducer function provides
various operations such as aggregation, filtering, and com- bining data. Once it is
done, the output (zero or more key-value pairs) of reducer is sent to the output
format.
3. Output Format: The output format separates key-value pair with tab (default)
and writes it out to a file using record writer.
Figure describes the chores of Mapper, Combiner, Partitioner, and Reducer for the
word count problem.
The Word Count problem has been discussed under "Combiner" and "Partitioner".
Vtucircle.com Page 41
Big Data Analytics-BAD601-Module 2
Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is
set to be the combiner class. The difference between combiner class and reducer
class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the
reducer.
2. Output of the reducer is passed to the output file on disk.
The sections have been designed as follows:
Objective: What is it that we are trying to achieve here?
Input Data: What is the input that has been given to us to act upon?
Act: Output:
Vtucircle.com Page 42
Big Data Analytics-BAD601-Module 2
Vtucircle.com Page 43
Big Data Analytics-BAD601-Module 2
Vtucircle.com Page 44
Big Data Analytics-BAD601-Module 2
/mapreducedemos/output/wordcount/part-r-00000
Partitioner
• The partitioning phase happens after the map phase and before the
reduce phase.
• The number of partitions equals the number of reducers.
• The default partitioner in Hadoop is the hash partitioner, but custom
partitioners can be implemented.
Objective of the Exercise
• Implement a MapReduce program to count word occurrences.
• Use a custom partitioner to divide words based on their starting
alphabet.
• This ensures that words beginning with the same letter are sent to the
same reducer.
Input Data Example
• Introduction to Hadoop
• Introducing Hive
• Hive Session
• Pig Session
• Uses a switch-case structure for letters A–Z, with a default partition for
non-alphabet characters.
Vtucircle.com Page 45
Big Data Analytics-BAD601-Module 2
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
@Override
int partitionNumber = 0;
switch (alphabet) {
Vtucircle.com Page 46
Big Data Analytics-BAD601-Module 2
return partitionNumber;
job.setNumReduceTasks(27);
Vtucircle.com Page 47
Big Data Analytics-BAD601-Module 2
job.setPartitionerClass(WordCountPartitioner.class);
FileOutputFormat.setOutputPath(job,newPath("/mapreducedemos/output/wordcou
ntpartitioner/"));
• Welcome
• to
• Hadoop
• Session
• Introduction
• to
• Hadoop
• Introducing
• Hive
• Hive
• Session
• Pig
• Session
After mapping, each word gets a count of 1:(Words emitted by the mapper)
• Welcome → 1
Vtucircle.com Page 48
Big Data Analytics-BAD601-Module 2
• to → 1
• Hadoop → 1
• Session → 1
• Introduction → 1
• to → 1
• Hadoop → 1
• Introducing → 1
• Hive → 1
• Hive → 1
• Session → 1
• Pig → 1
• Session → 1
Partition
Word First Letter
Number
Welcome W 23
to T 20
Hadoop H 8
Session S 19
Introduction I 9
Introducing I 9
Hive H 8
Pig P 16
Vtucircle.com Page 49
Big Data Analytics-BAD601-Module 2
Each reducer generates an output file with words starting with a specific letter.
Hadoop 2
Hive 2
Introduction 1
Introducing 1
Pig 1
Session 3
to 2
Welcome 1
/mapreducedemos/output/wordcountpartitioner/
Vtucircle.com Page 50
Big Data Analytics-BAD601-Module 2
Searching
Objective
The program will output lines containing the keyword along with file name and
position.
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smith,38
1005,Bob,33
This configures the Hadoop job and specifies the Mapper, Reducer, and
input/output paths.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
Vtucircle.com Page 51
Big Data Analytics-BAD601-Module 2
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
job.setJarByClass(WordSearcher.class);
job.setMapperClass(WordSearchMapper.class);
job.setReducerClass(WordSearchReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.getConfiguration().set("keyword", "Jack");
job.setNumReduceTasks(1);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Vtucircle.com Page 52
Big Data Analytics-BAD601-Module 2
◻ Configures the job, sets the keyword, and specifies input and output paths.
◻ Uses only one reducer for simplicity.
The Mapper scans each line, searches for the keyword, and outputs matching
lines.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
@Override
keyword = configuration.get("keyword");
@Override
Integer wordPos;
pos++;
if (value.toString().contains(keyword)) {
wordPos = pos;
The Reducer simply writes the filtered results from the Mapper.
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
@Override
Vtucircle.com Page 54
Big Data Analytics-BAD601-Module 2
context.write(key, value);
After running the job, the output is stored in Hadoop HDFS at:
/mapreduce/output/search/part-r-00000
Content of part-r-00000:
1002,Jack,39 student.csv,5
◻ The program correctly identifies and outputs the row that contains "Jack".
Sorting
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smith,38
1005,Bob,33
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
Vtucircle.com Page 55
Big Data Analytics-BAD601-Module 2
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
// Mapper Class
public static class SortMapper extends Mapper<LongWritable, Text, Text, Text> {
protected void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String[] token = value.toString().split(",");
context.write(new Text(token[1]), new Text(token[0] + "," + token[2])); // Key:
Name, Value: ID,Score
}
}
// Reducer Class
public static class SortReducer extends Reducer<Text, Text, NullWritable, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
for (Text details : values) {
context.write(NullWritable.get(), new Text(key.toString() + "," +
details.toString()));
}
}
}
Vtucircle.com Page 56
Big Data Analytics-BAD601-Module 2
// Driver Class
public static void main(String[] args) throws IOException, InterruptedException,
ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sort Students by Name");
job.setJarByClass(SortStudNames.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output:
Alex,1003,44
Bob,1005,33
Jack,1002,39
John,1001,45
Smith,1004,38
Vtucircle.com Page 57
Big Data Analytics-BAD601-Module 2
COMPRESSION
In MapReduce programming, you can compress the MapReduce output file.
Compression provides two benefits as follows:
1. Reduces the space to store files.
2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress", true);
conf.setClass("mapred.output.compression.codec",GzipCodec.class,CompressionCo
dec.class);
Here, codec is the implementation of a compression and decompression algorithm.
GzipCodec is the compression algorithm for gzip. This compresses the output file.
****END****
Vtucircle.com Page 58