2.2. Components of Hadoop - Analysing
2.2. Components of Hadoop - Analysing
● Hadoop ecosystem is a framework which helps in solving the Big Data problems.
● It comprises different components and services (ingesting, storing, analyzing, and
maintaining) inside of it.
● Most of the services available in the Hadoop ecosystem are to supplement the main four
core components of Hadoop which include HDFS, YARN, MapReduce and Common.
1. components
● Hadoop Distributed File System runs on top of the existing file systems on
each node in a Hadoop cluster.
● Hadoop Distributed File System is a block-structured file system where each
file is divided into blocks of a predetermined size.
● Data in a Hadoop cluster is broken down into smaller units (called blocks)
and distributed throughout the cluster. Each block is duplicated twice (for a
total of three copies), with the two replicas stored on two nodes in a rack
somewhere else in the cluster.
● Since the data has a default replication factor of three, it is highly available
and fault-tolerant.
● If a copy is lost, HDFS will automatically re-replicate it elsewhere in the
cluster, ensuring that the threefold replication factor is maintained.
Big data is mostly generated from social media websites, sensors, devices, video/audio,
networks, log files and web, and much of it is generated in real time and on a very large scale.
Big data analytics is the process of examining this large amount of different data types, or big
data, in an effort to uncover hidden patterns, unknown correlations and other useful information.
Advantages of Big Data Analysis
Big data analysis allows market analysts, researchers and business users to develop deep insights
from the available data, resulting in numerous business advantages. Business users are able to
make a precise analysis of the data and the key early indicators from this analysis can mean
fortunes for the business. Some of the exemplary use cases are as follows:
● Whenever users browse travel portals, shopping sites, search flights, hotels or add a
particular item into their cart, then Ad Targeting companies can analyze this wide
variety of data and activities and can provide better recommendations to the user
regarding offers, discounts and deals based on the user browsing history and product
history.
● In the telecommunications space, if customers are moving from one service provider
to another service provider, then by analyzing huge call data records of the various
issues faced by the customers can be unearthed. Issues could be as wide-ranging as a
significant increase in the call drops or some network congestion problems. Based on
analyzing these issues, it can be identified if a telecom company needs to place a new
tower in a particular urban area or if they need to revive the marketing strategy for a
particular region as a new player has come up there. That way customer churn can be
proactively minimized.
Now let’s look at a case study for analyzing stock market data. We will evaluate various big data
technologies to analyze this stock market data from a sample ‘New York Stock Exchange’
dataset and calculate the covariance for this stock data and aim to solve both storage and
processing problems related to a huge volume of data.
Covariance is a financial term that represents the degree or amount that two stocks or financial
instruments move together or apart from each other. With covariance, investors have the
opportunity to seek out different investment options based upon their respective risk profile. It is
a statistical measure of how one investment moves in relation to the other.
A positive covariance means that asset returns moved together. If investment instruments or
stocks tend to be up or down during the same time periods, they have positive covariance.
A negative covariance means returns move inversely. If one investment instrument tends to be up
while the other is down, they have negative covariance.
This will help a stock broker in recommending the stocks to his customers.
Dataset: The sample dataset provided is a comma separated file (CSV) named
‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes, Stock
opening price, Stock highest price, etc. on the New York Stock Exchange.
The dataset provided is just a small sample dataset having around 3500 records, but in the real
production environment there could be huge stock data running into GBs or TBs. So our solution
must be supported in a real production environment.
Let’s have a look at the existing open source Hadoop data analysis technologies to analyze the
huge stock data being generated very frequently.
Which Data Analysis Technologies should be used?
Based on these criteria and comparing with the above analysis of features of these technologies,
we can conclude:
● If we use MapReduce, then complex business logic needs to be written to handle the
joins. We would have to think from map and reduce perspective and which particular
code snippet will go into map and which one will go into reduce side. A lot of
development effort needs to go into deciding how map and reduce joins will take
place. We would not be able to map the data into schema format and all efforts need
to be handled programmatically.
● If we are going to use Pig, then we would not be able to partition the data, which can
be used for sample processing from a subset of data by a particular stock symbol or
particular date or month. In addition to that Pig is more like a scripting language
which is more suitable for prototyping and rapidly developing MapReduce based
jobs. It also doesn’t provide the facility to map our data into an explicit schema
format that seems more suitable for this case study.
● Hive not only provides a familiar programming model for people who know SQL, it
also eliminates lots of boilerplate and sometimes tricky coding that we would have to
do in MapReduce programming. If we apply Hive to analyze the stock data, then we
would be able to leverage the SQL capabilities of Hive-QL as well as data can be
managed in a particular schema. It will also reduce the development time as well and
can manage joins between stock data also using Hive-QL which is of course pretty
difficult in MapReduce. Hive also has its thrift servers, by which we can submit our
Hive queries from anywhere to the Hive server, which in turn executes them. Hive
SQL queries are being converted into map reduce jobs by Hive compiler, leaving
programmers to think beyond complex programming and provides opportunity to
focus on business problem.
So based on the above discussion, Hive seems the perfect choice for the aforementioned case
study.
Problem Solution with Hive
Apache Hive is a data warehousing package built on top of Hadoop for providing data
summarization, query and analysis. The query language being used by Hive is called Hive-QL
and is very similar to SQL.
Since we are now done zeroing in on the data analysis technology part, now it’s time to get your
feet wet with deriving solutions for the mentioned case study.
Follow the steps mentioned in my previous blog How to Configure Hive On Cloudera:
Use ‘create table’ Hive command to create the Hive table for our provided csv dataset:
This will create a Hive table named ‘NYSE’ in which rows would be delimited and row fields
will be terminated by commas. This schema will be created into the embedded derby database as
configured into the Hive setup. By default, Hive stores metadata in an embedded Apache Derby
database, but can be configured for other databases like MySQL, SQL server, Oracle, etc.
● Load CSV Data into Hive Table
Use the following Hive command to load the CSV data file into Hive table:
This will load the dataset from the mentioned location to the Hive table ‘NYSE’ as created above
but all this dataset will be stored into the Hive-controlled file system namespace on HDFS, so
that it could be batch processed further by MapReduce jobs or Hive queries.
We can calculate the Covariance for the provided stock dataset for the inputted year as below
using the Hive select query:
(AVG(a.STOCK_PRICE_HIGH*b.STOCK_PRICE_HIGH) –
(AVG(a.STOCK_PRICE_HIGH)*AVG(b.STOCK_PRICE_HIGH)))
The covariance results after the above stock data analysis, are as follows:
The covariance has been calculated between two different stocks for each month on a particular
date for the available year.
From the covariance results, stock brokers or fund managers can provide below
recommendations:
● For Stocks QRR and QTM, these are having more positive covariance than negative
covariance, so having high probability that stocks will move together in same
direction.
● For Stocks QRR and QXM, these are mostly having negative covariance. So there
exists a greater probability of stock prices moving in an inverse direction.
● For Stocks QTM and QXM, these are mostly having positive covariance for most of
all months, so these tend to move in the same direction most of the times.
So similarly we can analyze more use cases of big data and can explore all possible solutions to
solve that use case and then by the comparison chart, the final best solution can be narrowed
down.