Big Data Notes
Big Data Notes
In a digital world where data is increasing rapidly because of the increasing use of the
internet, sensors and heavy machines at a very high rate. The sheer volume, variety
,velocity and veracity of such data is signified by the term “BIG DATA”.
EX: Rolling web log data, network and system logs click information what is considered “big
data” varies depending on the capabilities of the organization managing the set, and on the
capabilities of the applications that are traditionally used to process and analysis the data
set in its domain. Big data is when the data itself becomes part of the problem .
There are some major mile stones in the evaluation of big data
Hadoop was created by Doug Cutting and Mike Cafarellain 2005. It was originally developed
to support distribution for the Nutch search engine project. Doug, who was working at
Yahoo! at the time and is now Chief Architect of Cloudera, named the project after his son's
toy elephant.
2 Explain Distributed File System ?
A distributed file system (DFS) is a file system with data stored on a server. The data is
accessed and processed as if it was stored on the local client machine. The DFS makes it
convenient to share information and files among users on a network in a controlled and
authorized way.
Example
The following scenario explains how much amount of time is required to read to
complete the 1 Tera Byte data by using 1 machine 4 I/O channels with each channel 100 MB
capacity.
The normal environment required 45 minutes time to perform the reading operation on 1TB
data, to perform the same task with distributed environment the process took only 4.5
minutes the same performance benefits can be achieved through out in all the operations
like searching and other key operations.
Most Important Question
(i)Volume – We already know that Big Data indicates huge ‘volumes’ of data that is being
generated on a daily basis from various sources like social media platforms, business
processes, machines, networks, human interactions
Facebook Example:
vi) volatility: it refers to how long the data is valid. The data which is valid now many
become invalid after a few minutes or few days
1. Structured
2. Unstructured
3. Semi-structured
Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms. For instance, the
employee table in a company database will be structured as the employee details, their
job positions, their salaries, etc., will be present in an organized manner.
Any data with unknown form or the structure is classified as unstructured data. This makes it
very difficult and time-consuming to process and analyze unstructured data. Typical example
of unstructured data is, a heterogeneous data source containing a combination of simple text
files, images, videos etc.
Semi-structured
Semi-structured data pertains to the data containing both the formats mentioned above,
that is, structured and unstructured data. Example of semi-structured data is a data
represented in XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
GOVERNMENT
Big data analytics has proven to be very useful in the government sector. Big data analysis
played a large role in Barack Obama’s successful 2012 re-election campaign. Also most
recently, Big data analysis was majorly responsible for the BJP and its allies to win a highly
successful Indian General Election 2014. The Indian Government utilizes numerous
techniques to ascertain how the Indian electorate is responding to government action, as
well as ideas for policy augmentation.
The advent of social media has led to an outburst of big data. Various solutions have been
built in order to analyze social media activity like IBM’s Cognos Consumer Insights, a point
solution running on IBM’s BigInsights Big Data platform, can make sense of the chatter.
Social media can provide valuable real-time insights into how the market is responding to
ABHYUDAYA MAHILA DEGREE COLLEGE Page 6
products and campaigns. With the help of these insights, the companies can adjust their
pricing, promotion, and campaign placements accordingly. Before utilizing the big data
there needs to be some preprocessing to be done on the big data in order to derive some
intelligent and valuable results. Thus to know the consumer mindset the application of
intelligent decisions derived from big data is necessary.
TECHNOLOGY
The technological applications of big data comprise of the following companies which deal
with huge amounts of data every day and put them to use for business decisions as well. For
example, eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB
Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay‟s
90PB data warehouse. Amazon.com handles millions of back-end operations every day, as
well as queries from more than half a million third-party sellers. The core technology that
keeps Amazon running is Linux-based and as of 2005, they had the world’s three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from nearly
100 million drivers to help new home buyers determine their typical drive times to and from
work throughout various times of the day.
Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What’s going on in a customer’s call center is often a
great barometer and influencer of market sentiment, but without a Big Data solution, much
of the insight that a call center can provide will be overlooked or discovered too late. Big
Data solutions can help identify recurring problems or customer and staff behavior patterns
on the fly not only by making sense of time/quality resolution metrics but also by capturing
and processing call content itself.
BANKING
The use of customer data invariably raises privacy issues. By uncovering hidden connections
between seemingly unrelated pieces of data, big data analytics could potentially reveal
sensitive personal information. Research indicates that 62% of bankers are cautious in their
use of big data due to privacy issues. Further, outsourcing of data analysis activities or
distribution of customer data across departments for the generation of richer insights also
amplifies security risks. Such as customers’ earnings, savings, mortgages, and insurance
policies ended up in the wrong hands. Such incidents reinforce concerns about data privacy
and discourage customers from sharing personal information in exchange for customized
offers.
AGRICULTURE
A biotechnology firm uses sensor data to optimize crop efficiency. It plants test crops and
runs simulations to measure how plants react to various changes in condition. Its data
environment constantly adjusts to changes in the attributes of various data it collects,
MARKETING
Marketers have begun to use facial recognition software to learn how well their advertising
succeeds or fails at stimulating interest in their products. A recent study published in the
Harvard Business Review looked at what kinds of advertisements compelled viewers to
continue watching and what turned viewers off. Among their tools was “a system that
analyses facial expressions to reveal what viewers are feeling.” The research was designed
to discover what kinds of promotions induced watchers to share the ads with their social
network, helping marketers create ads most likely to “go viral” and improve sales.
SMART PHONES
Perhaps more impressive, people now carry facial recognition technology in their pockets.
Users of I Phone and Android smartphones have applications at their fingertips that use
facial recognition technology for various tasks. For example, Android users with the
remember app can snap a photo of someone, then bring up stored information about that
person based on their image when their own memory lets them down a potential boon for
salespeople.
TELECOM
Now a day’s big data is used in different fields. In telecom also it plays a very good role.
Operators face an uphill challenge when they need to deliver new, compelling, revenue-
generating services without overloading their networks and keeping their running costs
under control. The market demands new set of data management and analysis capabilities
that can help service providers make accurate decisions by taking into account customer,
network context and other critical aspects of their businesses. Most of these decisions must
be made in real time, placing additional pressure on the operators. Real-time predictive
analytics can help leverage the data that resides in their multitude systems, make it
immediately accessible and help correlate that data to generate insight that can help them
drive their business forward.
HEALTHCARE
Traditionally, the healthcare industry has lagged behind other industries in the use of big
data, part of the problem stems from resistance to change providers are accustomed to
making treatment decisions independently, using their own clinical judgment, rather than
relying on protocols based on big data. Other obstacles are more structural in nature. This is
one of the best place to set an example for Big Data Application.Even within a single
hospital, payor, or pharmaceutical company, important information often remains siloed
within one group or department because organizations lack procedures for integrating data
and communicating findings.
Health care stakeholders now have access to promising new threads of knowledge. This
information is a form of “big data,” so called not only for its sheer volume but for its
While most definitions of Big Data focus on the new forms of unstructured data flowing
through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to
answer the question using a simple equation:
Business
1. Opportunity to enable innovative new business models
2. Potential for new insights that drive competitive advantage
Technical
1. Data collected and stored continues to grow exponentially
2. Data is increasingly everywhere and in many formats
3. Traditional solutions are failing under new requirements
Financial
1. Cost of data systems, as a percentage of IT spend, continues to grow
2. Cost advantages of commodity hardware & open source software
Big data analytics refers to the strategy of analyzing large volumes of data, or big data. This
big data is gathered from a wide variety of sources, including social networks, videos, digital
images, sensors, and sales transaction records. The aim in analyzing all this data is to
uncover patterns and connections that might otherwise be invisible, and make superior
business decisions.
Prescriptive
These analytics reveals what kind of actions should be taken and which determines future
rules and regulations. These are quite valuable since they allow business owners to answer
specific queries. Take the bariatric health care industry for example. Patient populations can
be measured using prescriptive analytics to measure how many patients are morbidly
obese. That number can then be filtered further by adding categories such as diabetes, LDL
cholesterol levels and others to determine the exact treatment. Some companies also use
this data analysis to forecast sales leads, social media, CRM data etc.Diagnostic
Diagnostic
These analytics analyze past data to determine why certain incidents happened. Say, you
end up with an unsuccessful social media campaign; using a diagnostic big data analysis you
can examine the number of posts that were put up, followers, fans, page views/reviews,
pins etc that will allow you to sift the grain from the chaff so to speak. In other words, you
can distill literally thousands of data into a single view to see what worked and what didn’t
thus saving time and resources.
Descriptive
This phase is based on present processes and incoming data. Such analysis can help you
determine valuable patterns that can offer critical insights into important processes. For
instance, it can help you assess credit risk, review old financial performance to determine
how a customer might pay in the future and even categorize your clientele according to
their preferences and sales cycle. Mining descriptive analytics involves the usage of a
dashboard or simple email reports.
Predictive Analytics
These analytics involves the extraction of current data sets that can help users determine
upcoming trends and outcomes with ease. However, these cannot tell us exactly what will
happen in the future but what a business owner can expect along with different scenarios.
In other words, predictive analysis is an enabler of big data in that it amasses an enormous
amount of data such as customer info, historical data and customer insight in order to
predict future scenarios.
MapReduce implements various mathematical algorithms to divide a task into small parts
and assign them to multiple systems. In technical terms, MapReduce algorithm helps in
sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
Sorting
Searching
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the
mapper by their keys.
Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a
collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the
help of RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically sorted
by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the
Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with
the help of an example.
Let us assume we have employee data in four different files − A, B, C, and D. Let us
also assume there are duplicate employee records in all four files because of
i
m
p
o
r
t
i
ng the employee data from all database tables repeatedly. See the following
illustration.
The Map phase processes each input file and provides the employee data in key-
value pairs (<k, v> : <emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map
phase as a key-value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employee in each file. See the following snippet.
Max = v(salary);
else{
Continue checking;
Reducer phase − Form each file, you will find the highest salaried employee. To
avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any.
The same algorithm is used in between the four <k, v> pairs, which are coming from
four input files. The final output should be as follows −
<gopal, 50000>
1. Open Source
Apache Hadoop is an open source project. It means its code can be modified
according to business requirements.
2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is
processed in parallel on a cluster of nodes.
3. Fault Tolerance
This is one of the very important features of Hadoop. By default 3 replicas of each
block is stored across the cluster in Hadoop and it can be changed also as per the
requirement. So if any node goes down, data on that node can be recovered from
8. Easy to use
No need of client to deal with distributed computing, the framework takes care of all
the things. So this feature of Hadoop is easy to use.
9. Data Locality
This one is a unique features of Hadoop that made it easily handle the Big Data.
Hadoop works on data locality principle which states that move computation to data
instead of data to computation. When a client submits the MapReduce algorithm,
this algorithm is moved to data in the cluster rather than bringing data to the
location where the algorithm is submitted and then processing it.
HDFS:
Features:
• Scalable
• Reliable
• Commodity Hardware
Map Reduce:
Map Reduce is a programming model designed to process high volume distributed data.
Platform is built using Java for better exception handling. Map Reduce includes two
deamons, Job tracker and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
YARN:
YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2).
The two major functionalities of Job Tracker in MRv1, resource management and job
scheduling/ monitoring are split into separate daemons which are ResourceManager,
NodeManager and ApplicationMaster.
Features:
Data Access :
Pig:
Apache Pig is a high level language built on top of MapReduce for analyzing
large datasets with simple adhoc data analysis programs. Pig is also known as
Data Flow language. It is very well integrated with python. It is initially
developed by yahoo.
Hive:
Apache Hive is another high level query language and data warehouse
infrastructure built on top of Hadoop for providing data summarization, query
and analysis. It is initially developed by yahoo and made open source.
If you want to become a big data analyst, these two high level languages are a must know!!
Data Storage:
Hbase:
Apache HBase is a NoSQL database built for hosting large tables with
billions of rows and millions of columns on top of Hadoop commodity
hardware machines. Use Apache Hbase when you need random, realtime
read/write access to your Big Data.
Features:
Cassandra:
Cassandra is a NoSQL database designed for linear scalability and high availability. Cassandra
is based on key-value model. Developed by Facebook and known for faster response to
queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.
HCatalog is a table management layer which provides integration of hive metadata for other
Hadoop applications. It enables users with different data processing tools like Apache pig,
Apache MapReduce and Apache Hive to more easily read and write data.
Features:
Lucene:
Features:
Hama:
Features:
Crunch:
Apache crunch is built for pipelining MapReduce programs which are simple and efficient.
This framework is used for writing, testing and running MapReduce pipelines.
Features:
• Developer focused.
• Minimal abstractions
• Flexible data model.
Avro:
Apache Sqoop:
Features:
Apache Flume:
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
Apache Ambari:
Features:
Apache Zookeeper:
Features:
• Serialization
• Atomicity
• Reliability
• Simple API
Apache Oozie:
Features:
Q ) Hadoop HDFS Data Read and Write Operations(anatomy of file write and
read) ?
To write a file in HDFS, a client needs to interact with master i.e. namenode (master). Now
namenode provides the address of the datanodes(slaves) on which client will start writing
the data. Client directly writes data on the datanodes, now datanode will create data write
pipeline.
The first datanode will copy the block to another datanode, which intern copy it to the third
datanode. Once it creates the replicas of blocks, it sends back the acknowledgment.
Data input :-
The two classes that support data input in MapReduce are InputFormat and Record-Reader.
The InputFormat class is consulted to determine how the input data should be partitioned
for the map tasks, and the RecordReader performs the reading of data from the inputs.
INPUT FORMAT :-
Every job in MapReduce must define its inputs according to contracts specified in
the InputFormat abstract class. InputFormat implementers must fulfill three contracts:
first, they describe type information for map input keys and values; next, they specify
how the input data should be partitioned; and finally, they indicate the RecordReader
instance that should read the data from source.
RECORDREADER :-
MapReduce
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed
processing of large data sets on computing clusters. Apache Hadoop is an open-source
framework that allows to store and process big data in a distributed environment across
clusters of computers using simple programming models. MapReduce is the core
component for data processing in Hadoop framework. In layman’s term Mapreduce helps
to split the input data set into a number of parts and run a program on all data parts
parallel at once. The term MapReduce refers to two separate and distinct tasks. The first is
the map operation, takes a set of data and converts it into another set of data, where
The Reducer node processes all the tuples such that all the pairs with same key are counted
and the count is updated as the value of that specific key. In the example there are two pairs
with the key ‘Bear’ which are then reduced to single tuple with the value equal to the count.
All the output tuples are then collected and written in the output file.
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Spilts & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called a
1. Jobtracker : Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job
A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
It is the responsibility of jobtracker to coordinate the activity by scheduling tasks to
run on different data nodes.
Execution of individual task is then look after by tasktracker, which resides on every
data node executing part of the job.
Tasktracker's responsibility is to send the progress report to the jobtracker.
In addition, tasktracker periodically sends 'heartbeat' signal to the Jobtracker so as
to notify him of current state of the system.
Thus jobtracker keeps track of overall progress of each job. In the event of task
failure, the jobtracker can reschedule it on a different tasktracker.
The MapReduce implementation in this example differs from the school-book multiplication
that I just introduced. A single map function will be processing only a single matrix element
rather than the whole row.
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS,
Input Set of data
buS, caR, CAR, car, BUS, TRAIN
Reduce Function – Takes the output from Map as an input and combines those data
tuples into a smaller set of tuples.
Example – (Reduce function in Word Count)
(BUS,7),
Converts into smaller set of
Output (CAR,7),
tuples
(TRAIN,4)
1. Splitting – The splitting parameter can be anything, e.g. splitting by space, comma,
semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order
to group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result
Now Let’s See the Word Count Program in Java
Fortunately we don’t have to write all of the above steps, we only need to write the splitting
parameter, Map function logic, and Reduce function logic. The rest of the remaining steps
will execute automatically.
Make sure that Hadoop is installed on your system with java idk
Steps
Step 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish
Step 3. Right Click on Package > New > Class (Name it - WordCount)
Step 4. Add Following Reference Libraries –
Right Click on Project > Build Path> Add External Archivals
/usr/lib/hadoop-0.20/hadoop-core.jar
Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
ABHYUDAYA MAHILA DEGREE COLLEGE Page 29
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException, InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
Explanation
The program consist of 3 classes:
To Move this into Hadoop directly, open the terminal and enter the following commands:
Q) What is Serialization?
Serialization is the process of translating data structures or objects state into binary or textual
form to transport the data over network or to store on some persisten storage. Once the data
is transported over network or retrieved from the persistent storage, it needs to be
deserialized again. Serialization is termed as marshalling and deserialization is termed
as unmarshalling.
Avro is one of the preferred data serialization systems because of its language neutrality.
Due to lack of language portability in Hadoop writable classes, Avro becomes a natural
choice because of its ability to handle multiple data formats which can be further processed
by multiple languages.
Avro is also very much preferred for serializing the data in Hadoop.
It uses JSON(JavaScript Object Notation) for defining data types and protocols and serializes
data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide
both a serialization format for persistent data, and a wire format for communication
between Hadoop nodes, and from client programs to the Hadoop services.
By this, we can define Avro as a file format introduced with Hadoop to store data in a
predefined format.This file format can be used in any of the Hadoop's tools like Pig and
Hive.
Generally in distributed systems like Hadoop, the concept of serialization is used
for Interprocess Communication and Persistent Storage.
Interprocess Communication
To establish the interprocess communication between the nodes connected in a
network, RPC technique was used.
RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows −
o Compact − To make the best use of network bandwidth, which is the most
scarce resource in a data center.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its data with the loss of
power supply. For example - Magnetic disks and Hard Disk Drives.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and
deserialization. The following table describes the methods −
WritableComparable Interface
It is the combination of Writable and Comparable interfaces. This interface
inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it
provides methods for data serialization, deserialization, and comparison.
In addition to these classes, Hadoop supports a number of wrapper classes that implement
WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy
of Hadoop serialization is given below –
Constructors
S.No. Summary
1 IntWritable()
Methods
S.No. Summary
1 int get()
Using this method you can get the integer value present in the current object.
Name Node
Functions of NameNode:
It is the master daemon that maintains and
manages the DataNodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. The location of
blocks stored, the size of the files, permissions, hierarchy, etc. There are two files
associated with the metadata:
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability.
The DataNode is a block server that stores the data in the local file ext3 or ext4.
Functions of DataNode:
These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s
clients.
They send heartbeats to the NameNode periodically to report the overall health of
HDFS, by default, this frequency is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary NameNode
as a helper daemon. And don’t be confused about the Secondary NameNode being
a backup NameNode because it is not.
Functions of Secondary NameNode:
The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
It is responsible for combining the EditLogs with FsImage from the NameNode.
Blocks:
Re
plication Management:
Ex:
Copy sampletext1.csv file from hdfs to local file system
Hadoop fs-get/user/hadoop/sampletext1.csv localfile
6.View the content of file
Syntax: hadoop fs-cat<path[file name]>
Description:used to display the content of the hdfs file on your stdout
Ex:
Show the contents of “cmd.txt” file
Hadoop fs-cat/user/hadoop/cmd.txt
Ex:
Copy “text1.txt” to directory “kumar”
Ex:
Copy file to hdfs
Hadoop fs-copyFromLocal localfile hdfs:\host.port/user/hadoop/
Ex:
Copy “text1.txt” file to local directory from hdfs
Hadoop fs-copyToLocalhdfs://host:port/user/hadoop /text1.txt localdirpath
Ex:
Move single “text.txt” to kumar
Hadoop fs-mv/user/hadoop/text1.txt/user/hadoop/kumar
Ex:
Delete a text5.txt” file from hdfs
Hadoop fs-rm/user/hadoop/text5.txt
Ex:
Delete empty directory “”sairam”
Hadoop fs-rmdir/user/hadoop/sairam
Ex:
View the disk usage of “sampletext1.csv” file
Hadoop fs-du/user/hadoop/sampletext1.csv
Ex:
Hadoop fs-expenge
16.Collect specific information about the file or directory
Syntax: hadoop fs-stat[format]<path>...
Description: formatting options
%b size of file in bytes
%g group name
%n filename
%r replication factor
%u user name of owner
%y milliseconds
%0 hdfs block size in bytes(128 MB by default)
Ex:
Change replication factor of files with in the directory
Hadoop fs-setrep/user/hadoop/hdfsdir
Ex:
Move files “text1.txt” &”text2.txt” from local file system to hdfs
Hadoop fs-moveFromLocal text1.txt text2.txt/user/hadoop/
conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Execution
Format a new distributed-filesystem:
$ bin/hadoop namenode -format
RDBMS is best suited for dynamic data analysis and where fast responses are expected
but Hive is suited for data warehouse applications, where relatively static data is
analyzed, fast response times are not required, and when the data is not changing rapidly.
Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is
very costly scale up.
Q ) Explain Hive ?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
T
his component diagram contains different units. The following table describes each unit:
Step
3: Edit
the “.bas
hrc” file
to
update the environment variables for user.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set HIVE_HOME
export HIVE_HOME=/home/edureka/apache-hive-2.1.0-bin
export PATH=$PATH:/home/edureka/apache-hive-2.1.0-bin/bin
Q) Explain HiveQL ?
HiveQL is the Hive query language
Hadoop is an open source framework for the distributed processing of large amounts of
data across a cluster. It relies upon the MapReduce paradigm to reduce complex tasks
into smaller parallel tasks that can be executed concurrently across multiple machines.
However, writing MapReduce tasks on top of Hadoop for processing data is not for
everyone since it requires learning a new framework and a new programming paradigm
altogether. What is needed is an easy-to-use abstraction on top of Hadoop that allows
people not familiar with it to use its capabilities as easily.
Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on
top of Hadoop. Hive achieves this task by converting queries written in HiveQL into
MapReduce tasks that are then run across the Hadoop cluster to fetch the desired
results
Hive is best suited for batch processing large amounts of data (such as in data
warehousing) but is not ideally suitable as a routine transactional database because of
its slow response times (it needs to fetch data from across a cluster).
A common task for which Hive is used is the processing of logs of web servers. These
logs have a regular structure and hence can be readily converted into a format that
Hive can understand and process
Hive query language (HiveQL) supports SQL features like CREATE tables, DROP
tables, SELECT ... FROM ... WHERE clauses, Joins (inner, left outer, right outer and
outer joins), Cartesian products, GROUP BY, SORT BY, aggregations, union and
many useful functions on primitive as well as complex data types. Metadata browsing
features such as list databases, tables and so on are also provided. HiveQL does have
limitations compared with traditional RDBMS SQL. HiveQL allows creation of new
tables in accordance with partitions(Each table can have one or more partitions in
Hive) as well as buckets (The data in partitions is further distributed as buckets)and
allows insertion of data in single or multiple tables but does not allow deletion or
updating of data
2. DESCRIBE database
- shows the directory location for the database.
4. Alter Database
You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTER DATABASE command. No other metadata about the database can be
changed,including its name and directory location:
The EXTERNAL keyword tells Hive this table is external and the LOCATION …
clause is required to tell Hive where it’s located. Because it’s external
Load data
Select data
output is
When you select columns that are one of the collection types, Hive uses JSON (Java-
Script Object Notation) syntax for the output. First, let’s select the subordinates, an
ARRAY, where a comma-separated list surrounded with […] is used.
hive> SELECT name, subordinates FROM employees;
output is
The deductions is a MAP, where the JSON representation for maps is used, namely a comma-
separated list of key:value pairs, surrounded with {…}:
Finally, the address is a STRUCT, which is also written using the JSON map format:
hive> SELECT name, address FROM employees;
1 12
2 15
3 16
4 20
5 25
The above table contains the order id (order-id) and the corresponding cust id(cust-id)
Map-Side Joins:
In a map side joins operations in hive , the job is assigned to a map reduce task that consists
of two stages: map and reduce. At the map stage, the data is read from join tables and the
‘join key’ and ‘join value’ pair are returned to an intermediate file. This intermediate file is
then sorted and merged in the shuffled stage. At the reduce stage, the sorted result is taken as
input, and the joining task is completed.
A map side join is similar to to the normal join, but here, all the tasks are performed by the
mapper alone. The map side join is preferred in small tables. Suppose you have two tables out
of which one in a small table. Now , when a map reduced task is submitted, a map reduce
local task will be crearted to read the data of the small table from HDFS and the storeit into
an in memory hash table. After reading the data, the map reduced local tasks serialize the in
memory hash table into a hash table file.
In the next stage, the main join map reduce task runs and moves the data from the hash table
file to the hadoop distributed cache, which supplies these files to each mapers local disk. It
Sub queries:
A Query present within a Query is known as a sub query. The main query will depend on the
values returned by the subqueries.
Subqueries can be classified into two types
Subqueries in FROM clause
Subqueries in WHERE clause
When to use:
To get a particular value combined from two column values from different tables
Dependency of one table values on other tables
Comparative checking of one column values from other tables
Syntax:
Subquery in FROM clause
SELECT <column names 1, 2…n>From (SubQuery) <TableName_Main >
Subquery in WHERE clause
SELECT <column names 1, 2…n> From<TableName_Main>WHERE col1 IN (SubQuery);
Example:
SELECT col1 FROM (SELECT a+b AS col1 FROM t1) t2
Here t1 and t2 are table names. The colored one is Subquery performed on table t1. Here a
and b are columns that are added in a subquery and assigned to col1. Col1 is the column
value present in Main table. This column "col1" present in the subquery is equivalent to the
main table query in column col1.
HBase can store massive amounts of data from terabytes to petabytes. The tables present in
HBase consists of billions of rows having millions of columns.
HBase is a column-oriented database and can manage structured and un-structured data. It
supports NoSQL tool for access huge amount of data from non-relational data model.
HBase is a column-oriented NoSQL database. Although it looks similar to a relational
database which contains rows and columns, but it is not a relational database. Relational
databases are row oriented while HBase is column-oriented. So, let us first understand the
difference between Column-oriented and Row-oriented databases.
If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column
values will be stored together, then the second column values will be stored together and data
in other columns are stored in a similar manner.
When the amount of data is very huge, like in terms of petabytes or exabytes, we use
column-oriented approach, because the data of a single column is stored together and
can be accessed faster.
While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
When we need to process and analyze a large set of semi-structured or unstructured
data, we use column oriented approach. Such as applications dealing with Online
Region
A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the
columns of a column family is stored in one region. Each region contains the rows in a
sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.
A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
A Region has a default size of 256MB which can be configured according to the need.
A Group of regions is served to the clients by a Region Server.
A Region Server can serve approximately 1000 regions to the client.
Now starting from the top of the hierarchy, I would first like to explain you about HMaster
Server which acts similarly as a NameNode in HDFS.
Block Cache – This is the read cache. Most frequently read data is stored in the read cache
and whenever the block cache is full, recently used data is evicted.
MemStore- This is the write cache and stores new data that is not yet written to the disk.
Every column family in a region has a MemStore.
Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent
storage.
HFile is the actual storage file that stores the rows as sorted key values on a disk.
HMaster
As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.
HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
ZooKeeper
Hashing
When you have the data which is represented by the string identifier, then that is good
choice for your Hbase table row key. Use hash of that string identifier as a row key instead
of raw string. For example, if you are storing user data that is identified by user ID’s then
hash of user ID is better choice for your row key.
Timestamps
When you retrieve data based on time when it was stored, it is best to include the
timestamp in your row key. For example, you are trying to store the machine log identified
by machine number then append the timestamp to the machine number when designing
row key, machine001#1435310751234.
The ZooKeeper framework was originally built at Yahoo! for easier accessing of applications
but, later on, ZooKeeper was used for organizing services used by distributed frameworks
like Hadoop, HBase, etc., and Apache ZooKeeper became a standard. It was designed to be a
vigorous service that enabled application developers to focus mainly on their application
logic rather than coordination.
In a distributed environment, coordinating and managing a service has become a difficult
process. Apache ZooKeeper was used to solve this problem because of its simple
architecture, as well as API, that allows developers to implement common coordination
tasks like electing a master server, managing group membership, and managing metadata.
Apache ZooKeeper is used for maintaining centralized configuration information, naming,
providing distributed synchronization, and providing group services in a simple interface so
that we don’t have to write it from scratch. Apache Kafka also uses ZooKeeper to manage
configuration. ZooKeeper allows developers to focus on the core application logic, and it
implements various protocols on the cluster so that the applications need not implement
them on their own.
ZooKeeper Architecture
Apache ZooKeeper works on the Client–Server architecture in which clients are machine
nodes and servers are nodes.
The following figure shows the relationship between the servers and their clients. In this, we
can see that each client sources the client library, and further they communicate with any of
the ZooKeeper nodes.
The server gives an acknowledgement to the client to inform that the server is alive, and it
Server
provides all services to clients.
Leader If any of the server nodes is failed, this server node performs automatic recovery.
Follower It is a server node which follows the instructions given by the leader.
Working of Apache ZooKeeper
The first thing that happens as soon as the ensemble (a group of ZooKeeper servers) starts is,
it waits for the clients to connect to the servers.
After that, the clients in the ZooKeeper ensemble will connect to one of the nodes. That node
can be any of a leader node or a follower node.
Once the client is connected to a particular node, the node assigns a session ID to the client
and sends an acknowledgement to that particular client.
If the client does not get any acknowledgement from the node, then it resends the message to
another node in the ZooKeeper ensemble and tries to connect with it.
On receiving the acknowledgement, the client makes sure that the connection is not lost by
sending the heartbeats to the node at regular intervals.
Finally, the client can perform functions like read, write, or store the data as per the need.
Updating the Node’s Status: Apache ZooKeeper is capable of updating every node that
allows it to store updated information about each node across the cluster.
Managing the Cluster: This technology can manage the cluster in such a way that the status
of each node is maintained in real time, leaving lesser chances for errors and ambiguity.
Naming Service: ZooKeeper attaches a unique identification to every node which is quite
similar to the DNA that helps identify it.
Automatic Failure Recovery: Apache ZooKeeper locks the data while modifying which
helps the cluster recover it automatically if a failure occurs in the database.
One of the ways in which we can communicate with the ZooKeeper ensemble is by using the
ZooKeeper Command Line Interface (CLI). This gives us the feature of using various options,
and also for the sake of debugging there is increased dependence on the CLI.
Applications of Zookeeper
a. Apache Solr
For leader election and centralized configuration, Apache Solr uses Zookeeper.
b. Apache Mesos
A tool which offers efficient resource isolation and sharing across distributed applications,
as a cluster manager is what we call Apache Mesos. Hence for the fault-tolerant replicated
master, Mesos uses ZooKeeper.
c. Yahoo!
As we all know ZooKeeper was originally built at “Yahoo!”, so for several requirements like
robustness, data transparency, centralized configuration, better performance, as well as
coordination, they designed Zookeeper.
d. Apache Hadoop
As we know behind the growth of Big Data industry, Apache Hadoop is the driving force. So,
for configuration management and coordination, Hadoop relies on ZooKeeper.
Multiple ZooKeeper servers endure large Hadoop clusters, that’s why in order to retrieve
and update synchronization information, each client machine communicates with one of the
ZooKeeper servers. Like: Human Genome Project, as there are terabytes of data in Human
Genome Project, So, In order to analyze the dataset and find interesting facts for human
development, it usesHadoop MapReduce framework.
e. Apache HBase
An open source, distributed, NoSQL database which we use for real-time read/write access
of large datasets is what we call Apache HBase. While it comes to Zookeeper, installation
of HBase distributed application depends on a running ZooKeeper cluster.
One of them is the Telecom industry. As it stores billions of mobile call records and further
access them in real time. Hence we can say it uses HBase to process all the records in real
time, easily and efficiently.
ii. Social network
Like Twitter, LinkedIn, and Facebook receives huge volumes of data on daily basis so to find
recent trends and other interesting facts it also uses HBase.
f. Apache Accumulo
Moreover, on top of Apache ZooKeeper (and Apache Hadoop), another sorted distributed
key/value store “Apache Accumulo” is built.
g. Neo4j
For write master selection and read slave coordination, No4j is a distributed graph
database which uses ZooKeeper.
h. Cloudera