Hadoop Questions
Hadoop Questions
Hadoop Questions
1|Page
https://www.facebook.com/chatchindia
Document-oriented
High performance
High availability
Easy scalability
Rich-query language
When using replication, can some members use journaling and others not?
Yes!
Will there be journal replay programs in case of incomplete entries (if there is a failure in the
middle of one)?
Each journal (group) write is consistent and wont be replayed during recovery unless it is complete.
What is a namespace?
MongoDB stores BSON objects in collections. The concatenation of the database name and the collection name (with
a period in between) is called a namespace.
https://www.facebook.com/chatchindia
How do I do transactions/locking?
MongoDB does not use traditional locking or complex transactions with rollback, as it is designed to be light weight,
fast and predictable in its performance. It can be thought of how analogous is to the MySQLs MyISAM autocommit
model. By keeping transaction support extremely simple, performance is enhanced, especially in a system that may
run across many servers.
Should you start out with Sharded or with a Non-Sharded MongoDB environment?
We suggest starting with Non-Sharded for simplicity and quick startup, unless your initial data set will not fit on single
servers. Upgrading to Sharded from Non-sharded is easy and seamless, so there is not a lot of advantage in setting
up Sharding before your data set is large.
https://www.facebook.com/chatchindia
to store something in DB and commits all changes in a batch at a later time in future. If there is a server crash
or power failure, all those commits buffered in memory will be lost. This functionality can be disabled, but then
it will perform as good as or worse than MySQL.
3. MongoDB is only ideal for implementing things like analytics/caching where impact of small data loss is
negligible.
4. In MongoDB, its difficult to represent relationships between data so you end up doing that manually by creating
another table to represent the relationship between rows in two or more tables.
4|Page
https://www.facebook.com/chatchindia
What is default block size in HDFS and what are the benefits of
having smaller block sizes?
Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in
HDFS is 64MB and larger. This allows HDFS to decrease the amount of metadata storage required per file.
Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the
disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as
NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of
megabytes, or gigabytes each.
What are two main modules which help you interact with HDFS and
what are they used for?
user@machine:hadoop$ bin/hadoop moduleName-cmdargs
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific
command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as FsShell, provides basic file manipulation operations and works with objects within
the file system. The dfsadmin module manipulates or queries the file system as a whole.
5|Page
https://www.facebook.com/chatchindia
What are schedulers and what are the three types of schedulers that
can be used in Hadoop cluster?
Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the
jobtracker. The three types of schedulers are:
6|Page
https://www.facebook.com/chatchindia
Fair Scheduler
Capacity Scheduler
allocation.
2. When you have very little fluctuation within queue utilization. The CSs more rigid resource allocation makes
sense when all queues are at capacity almost all the time.
3. When you have high variance in the memory requirements of jobs and you need the CSs memory-based
scheduling support.
4. When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
1. When you have a slow network and data locality makes a significant difference to a job runtime, features like
delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
2. When you have a lot of variability in the utilization between pools, the Fair Schedulers pre-emption model
affects much greater overall cluster utilization by giving away otherwise reserved resources when theyre not
used. 3. When you require jobs within a pool to make equal progress rather than running in FIFO order.
What is file system checking utility FSCK used for? What kind of
information does it show? Can FSCK show information about files
which are open for writing by a client?
FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When
used with a path ( bin/Hadoop fsck / -files blocks locations -racks) it recursively shows the health of all files under
the path. And when used with / , it checks the entire file system. By Default FSCK ignores files still open for writing
by a client. To list such files, run FSCK with -openforwrite option.
7|Page
https://www.facebook.com/chatchindia
Hadoop-env.sh
Core-site.xml
Hdfs-site.xml
Mapred-site.xml
Masters
Slaves
These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using
bin/Hadoop-daemon.sh start xxxxxx where xxxx is the name of daemon, then masters and slaves file need not be
updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to
start appropriate daemons. If Hadoop daemons are started using bin/start-dfs.sh and bin/start-mapred.sh, then
masters and slaves configurations files on namenode machine need to be updated.
Masters Ip address/hostname of node where secondarynamenode will run.
Slaves Ip address/hostname of nodes where datanodes will be run and eventually task trackers.
8|Page
https://www.facebook.com/chatchindia
What is Hadoop?
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity
computers using a simple programming model.
9|Page
https://www.facebook.com/chatchindia
10 | P a g e
https://www.facebook.com/chatchindia
11 | P a g e
ANTRIXSHGUPTA
12 | P a g e
ANTRIXSHGUPTA
13 | P a g e
ANTRIXSHGUPTA
14 | P a g e
ANTRIXSHGUPTA
15 | P a g e
ANTRIXSHGUPTA
16 | P a g e
ANTRIXSHGUPTA
17 | P a g e
ANTRIXSHGUPTA
18 | P a g e
ANTRIXSHGUPTA
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is 70, for job tracker is 30 and for task tracker is 60.
19 | P a g e
ANTRIXSHGUPTA
To exit the Vi Editor, press ESC and type :q and then press enter.
What happens if you get a connection refused java exception when you type hadoop fsck /?
It could mean that the Namenode is not working on your VM.
We are using Ubuntu operating system with Cloudera, but from where we can download
Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from Edurekas dropbox and
the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu
or Red hat. There are installation steps present at the Cloudera location or in Edurekas Drop box. You can go either
ways.
20 | P a g e
ANTRIXSHGUPTA
sudo
hdfs
(press
enter),
su-hdfs
(press
enter),
/etc/init.d/ha
(press
enter)
and
then
21 | P a g e
ANTRIXSHGUPTA
22 | P a g e
ANTRIXSHGUPTA
Can you give us some more details about SSH communication between Masters and the
Slaves?
SSH is a password-less secure communication where data packets are sent across the slave. It has some format into
which data is sent across. SSH is not only between masters and slaves but also between two hosts.
In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?
Yes, you can go ahead with this! There are installation steps for creating a new cluster. You can uninstall your present
cluster and install the new cluster.
23 | P a g e
ANTRIXSHGUPTA
parameters of a reducer?
The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate
output parameters and the second two represent final output parameters.
Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and
generating a key-value pair out of the mapper. What
Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one
location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent
across to the reducers is known as Shuffling. What
Before transferring the data from hard disk location to map method, there is a phase or method called the Split
Method. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but
reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split method is
equal to the block size and is used to divide block into bunch of splits.
How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the custom splitter. There
is a feature of customization in Hadoop which can be called from the main method.
24 | P a g e
ANTRIXSHGUPTA
Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on
the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose
the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again
gets divided into mapper, thus we do not have a track of the previous row value.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any
programming language which can accept standard input and can produce standard output. It could be Perl, Python,
Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any
other programming language.
What is a Combiner?
A Combiner is a mini reducer that performs the local reduce task. It receives the input from the mapper on a
particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by
reducing the quantum of data that is required to be sent to the reducers.
25 | P a g e
ANTRIXSHGUPTA
What is PIG?
PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs. PIGs infrastructure layer consists of a compiler
that produces sequence of MapReduce Programs.
26 | P a g e
ANTRIXSHGUPTA
Are there any problems which can only be solved by MapReduce and cannot be solved by
PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different
cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and
the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population
data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program
whenever you find city with the name Bangalore and city with the name Noida, you create the alias name which will
be the common name for these two cities so that you create a common key for both the cities and it get passed to
the same reducer. For this, we have to write custom partitioner.
In mapreduce when you create a key for city, you have to consider city as the key. So, whenever the framework
comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There
is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida
then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we
cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.
Does Pig give any warning when there is a type mismatch or missing field?
No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a
warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.
27 | P a g e
ANTRIXSHGUPTA
bagname
GENERATE
expression1,
expression2,
statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.
What is bag?
A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags
are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of
the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag
in memory. There is no necessity that the complete bag should fit into memory. We represent bags with {}.
28 | P a g e
ANTRIXSHGUPTA
29 | P a g e
ANTRIXSHGUPTA
2.
3.
4.
Behavior-based targeting
5.
1.
Segmentation
2.
5.
6.
7.
CRM
Credit risk, scoring and analysis
8.
Trade surveillance and abnormal trading pattern analysis Health & Life Sciences:
1.
2.
3.
3.
30 | P a g e
ANTRIXSHGUPTA
1.
Price optimization
2.
3.
4.
5.
Gaming:
1. Behavioral Analytics High
Tech:
1. Optimize Funnel Conversion
2. Predictive Support
3. Predict Security Threats
4. Device Analytics
FACEBOOK
Facebook today is a world-wide phenomenon that has caught up with young and old alike. Launched in 2004 by a
bunch of Harvard University students, it was least expected to be such a rage. In a span of just a decade, how did it
manage this giant leap?
With around 1.23 billion users and counting, Facebook definitely has an upper hand over other social media websites.
What is the reason behind this success? This blog is an attempt to answer some of these queries.
It is quite evident that the existence of a durable storage system and high technological expertise has contributed to
the support of various user data like managing messages, applications, personal information etc, without which all of
it would have come to a staggering halt.So what does a website do when its user count exceeds the number of cars
in the world? How does it manage such a massive data?
31 | P a g e
ANTRIXSHGUPTA
WHAT IS HADOOP
So, What exactly is Hadoop?
It is truly said that Necessity is the mother of all inventions and Hadoop is amongst the finest inventions in the world
of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle
and process Big Data efficiently.
Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed
applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop.
Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies
concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level
Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug
Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Dougs sons toy
elephant!
32 | P a g e
ANTRIXSHGUPTA
Hadoop Ecosystem:
Once you are familiar with What is Hadoop, lets probe into its ecosystem. Hadoop Ecosystem is nothing but various
components that make up Hadoop so powerful, among which HDFS and MapReduce are the core components!
1. HDFS:
The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop. HDFS is designed to amass
gigantic amount of data unfailingly, and to transfer the data at an amazing speed among nodes and facilitates the
system to continue working smoothly even if any of the nodes fail to function. HDFS is very competent in writing
programs, handling their allocation, processing the data and generating the final outcomes. In fact, HDFS manages
around 40 petabytes of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary
NameNode.
2. MapReduce:
It all started with Google applying the concept of functional programming to solve the problem of how to manage large
amounts of data on the internet. Google named it as the MapReduce system and was penned down in a paper
published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in
2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The
function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few
seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and
JobHistoryServer.
3. Apache Pig:
Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-level
language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic
attribute of Pig programs is parallelization which helps them to manage large data sets. Apache Pig consists of a
compiler that generates a series of MapReduce program and a Pig Latin language layer that facilitates SQL-like
queries to be run on distributed databases in Hadoop.
33 | P a g e
ANTRIXSHGUPTA
4. Apache Hive:
As the name suggests, Hive is Hadoops data warehouse system that enables quick data summarization for Hadoop,
handle queries and evaluate huge data sets which are located in Hadoops file systems and also maintains full support
for map/reduce. Another striking feature of Apache Hive is to provide indexes such as bitmap indexes in order to
speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by other
companies too, including Netflix.
5. Apache HCatalog
Apache HCatalog is another important component of Apache Hadoop which provides a table and storage
management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema
and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop
such as such as Pig, Map Reduce, Streaming, and Hive.
6. Apache HBase
HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS for
storage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it
handles point queries (random reads). The key components of Apache HBase are HBase Master and the
RegionServer.
7. Apache Zookeeper
Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a record of
configuration information, naming, providing distributed synchronization, and providing group services which are
immensely crucial for various distributed systems. Infact, HBase is dependent upon ZooKeeper for its functioning.
WHY HADOOP
Hadoop can be contagious. Its implementation in one organization can lead to another one elsewhere. Thanks to
Hadoop being robust and cost-effective, handling humongous data seems much easier now. The ability to include
HIVE in an EMR workflow is yet another awesome point. Its incredibly easy to boot up a cluster, install HIVE, and be
doing simple SQL analytics in no time. Lets take a look at why Hadoop can be so incredible.
34 | P a g e
ANTRIXSHGUPTA
2. Scalable
Hadoop is a scalable platform, in the sense that new nodes can be easily added in the system as and when required
without altering the data formats, how data is loaded, how programs are written, or even without modifying the existing
applications. Hadoop is an open source platform and runs on industry-standard hardware. Moreover, Hadoop is also
fault tolerant this means, even if a node gets lost or goes out of service, the system automatically reallocates work
to another location of the data and continues processing as if nothing had happened!
4. Robust Ecosystem:
Hadoop has a very robust and a rich ecosystem that is well suited to meet the analytical needs of developers, web startups and other organizations. Hadoop Ecosystem consists of various related projects such as MapReduce, Hive,
HBase, Zookeeper, HCatalog, Apache Pig, which make Hadoop very competent to deliver a broad spectrum of
services.
6. Cost Effective:
Loaded with such great features, the icing on the cake is that Hadoop generates cost benefits by bringing massively
parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which
in turn makes it reasonable to model all your data. The basic idea behind Hadoop is to perform cost-effective data
analysis present across world wide web!
35 | P a g e
ANTRIXSHGUPTA
36 | P a g e
ANTRIXSHGUPTA
Volume: BIG DATA is clearly determined by its volume. It could amount to hundreds of terabytes or even
petabytes of information. For instance, 15 terabytes of Facebook posts or 400 billion annual medical records could
mean Big Data!
2.
Velocity: Velocity means the rate at which data is flowing in the companies. Big data requires fast processing.
Time factor plays a very crucial role in several organizations. For instance, processing 2 million records at share
market or evaluating results of millions of students applied for competitive exams could mean Big Data!
3.
Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured,
text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial
amount of an organizations data is not numeric; however, such data is equally important for decision-making process.
So, organizations need to think beyond stock records, documents, personnel files, finances, etc.
4.
Veracity: Veracity refers to the uncertainty of data available. Data available can sometimes get messy and
maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control like the Twitter posts
with hash tags, abbreviations, typos and colloquial speech. But big data and analytics technology now permits to work
37 | P a g e
ANTRIXSHGUPTA
38 | P a g e
ANTRIXSHGUPTA
Though Google has only provided the Whitepapers, without any implementation, around 90-95% of the architecture
presented in these Whitepapers is applied in these three Java-based Apache projects.
HDFS and MapReduce are the two major components of Hadoop, where HDFS is from the Infrastructural point of
view and MapReduce is from the Programming aspect. Though HDFS is at present a subproject of Apache Hadoop,
it was formally developed as an infrastructure for the Apache Nutch web search engine project.
To understand the magic behind the scalability of Hadoop from one-node cluster to a thousand-nodes cluster
39 | P a g e
ANTRIXSHGUPTA
4. Commodity Hardware:
HDFS (Hadoop Distributed File System) assumes that the cluster(s) will run on common hardware, that is,
nonexpensive, ordinary machines rather than high-availability systems. A great feature of Hadoop is that it can be
installed in any average commodity hardware. We dont need super computers or high-end hardware to work on
Hadoop. This leads to an overall cost reduction to a great extent.
ANTRIXSHGUPTA
6. High Throughput:
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the
system and it is usually used to measure performance of the system. In Hadoop HDFS, when we want to perform a
task or an action, then the work is divided and shared among different systems. So, all the systems will be executing
the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time.
In this way, the Apache HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read
data tremendously.
41 | P a g e
ANTRIXSHGUPTA