0% found this document useful (0 votes)
23 views22 pages

Unit III - Big Data

Uploaded by

massdangerman701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

Unit III - Big Data

Uploaded by

massdangerman701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

BIG DATA ANALYTICS UNIT-III

UNIT III
BIG DATA FROM DIFFERENT PERSPECTIVES

Syllabus

Big data from business Perspective: Introduction of big data- Characteristics of big data-Data in
the warehouse and data in Hadoop- Importance of Big data- Big data Use cases: Patterns for Big
data deployment. Big data from Technology Perspective: History of Hadoop-Components of
Hadoop-Application Development in Hadoop-Getting your data in Hadoop-other Hadoop
Component.

Big data from business Perspective: Introduction to Big Data

Fig. Big Data layout


1. Big Data is becoming one of the most talked about technology trends nowadays.
2. The real challenge with the big organization is to get maximum out of the data already
available and predict what kind of data to collect in the future.
3. How to take the existing data and make it meaningful that it provides us accurate insight
in the past data is one of the key discussion points in many of the executive meetings in
organizations.
4. With the explosion of the data the challenge has gone to the next level and now a Big
Data is becoming the reality in many organizations.
5. The goal of every organization and expert is same to get maximum out of the data, the
route and the starting point are different for each organization and expert.

Page 1
BIG DATA ANALYTICS UNIT-III

6. As organizations are evaluating and architecting big data solutions they are also
learning the ways and opportunities which are related to Big Data.
7. There is not a single solution to big data as well there is not a single vendor which can
claim to know all about Big Data.
8. Big Data is too big a concept and there are many players – different architectures,
different vendors and different technology.
9. The three Vs of Big data are Velocity, Volume and Variety
******************************

Big data Characteristics

Volume

Figure 1: Characteristics of Big Data


 The exponential growth in the data storage as the data is now more than text data.
 The data can be found in the format of videos, music’s and large images on our social
media channels.
 It is very common to have Terabytes and Petabytes of the storage system for
enterprises.
 As the database grows the applications and architecture built to support the data needs
to be reevaluated quite often.
 Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data.
 The big volume indeed represents Big Data.

Page 2
BIG DATA ANALYTICS UNIT-III

Velocity
 The data growth and social media explosion have changed how we look at the data.
 There was a time when we used to believe that data of yesterday is recent.
 The matter of the fact newspapers is still following that logic.
 However, news channels and radios have changed how fast we receive the news.
 Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages (a tweet, status updates etc.) is not
something interests users.
 They often discard old messages and pay attention to recent updates. The data
movement is now almost real time and the update window has reduced to fractions of
the seconds.
 This high velocity data represent Big Data.

Variety

 Data can be stored in multiple format. For example database, excel, csv, access or for the
matter of the fact, it can be stored in a simple text file.
 Sometimes the data is not even in the traditional format as we assume, it may be in the
form of video, SMS, pdf or something we might have not thought about it. It is the need
of the organization to arrange it and make it meaningful.
 It will be easy to do so if we have data in the same format, however it is not the case most
of the time. The real world have data in many different formats and that is the challenge
we need to overcome with the Big Data. This variety of the data represent BigData.

******************************

Page 3
BIG DATA ANALYTICS UNIT-III

Data in the warehouse and data in Hadoop


Data in the warehouse data in Hadoop
mostly ideal for analyzing structured data from Hadoop-based platform is well suited to deal
various systems and producing insights with with semi-structured and unstructured data, as
known and relatively stable measurements well as when a data discovery process is
needed
data goes through a lot of rigor to make it into data isn’t likely to be distributed like data
the warehouse warehouse data
Data in warehouses must shine with respect to With all the volume and velocity of today’s
quality; subsequently, it’s cleaned up via data, there’s just no way that you can afford to
cleansing, enrichment, matching, glossary, spend the time and resources required to
metadata, master data management, modeling, cleanse and document every piece of data
and other services before it’s ready for properly, because it’s just not going to be
analysis. economical.
an expensive process Hadoop use cases cost prohibitive
Data is going to go places and will be used in Data in Hadoop might seem of low value today,
reports and dashboards where the accuracy of or its value non-quantified, but it can in fact be
that data is key the key to questions yet unasked
data warehouse data is trusted enough to be Hadoop data isn’t as trusted
“public,”
cost per compute in a traditional data the cost of Hadoop is low
warehouse is relatively high
For your interactive navigational needs, you’ll Let to store all of the data in its native business
continue to pick and choose sources and object format and get value out of it through
cleanse that data and keep it in warehouses. massive parallelism on readily
available components
Data in warehouse can be placed in reports and data might sit in Hadoop for a while, and when
dashboards. you discover its value, it might migrate its way
into the warehouse
Data in warehouse are already analyzed and can get more value out of analyzing more data
useful data that may even initially seem unrelated

******************************

Page 4
BIG DATA ANALYTICS UNIT-III

Importance of Big data


It is defined Big Data as conforming to the volume, velocity, and variety (V3) attributes that
characterize it. Big Data solutions aren’t a replacement for existing warehouse solutions
 Big Data solutions are ideal for analyzing not only raw structured data, but semi- structured
and unstructured data from a wide variety of sources.
 Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a
sample of the data; or a sampling of data isn’t nearly as effective as a larger set of data from
which to derive analysis.
 Big Data solutions are ideal for iterative and exploratory analysis when business measures
on data are not predetermined.
When it comes to solving information management challenges using Big Data technologies, we
suggest you consider the following:
• a Big Data solution is not only going to leverage data not typically suitable for a traditional
warehouse environment, and in massive amounts of volume, but it’s going to give up some
of the formalities and “strictness” of the data. The benefit is that you can preserve the
fidelity of data and gain access to mountains of information for exploration and discovery
of business insights before running it through the due diligence that you’re accustomed to
the data that can be included as a participant of a cyclic system, enriching the models in the
warehouse.
• Big Data is well suited for solving information challenges that don’t natively fit within a
traditional relational database approach for handling the problem at hand.
It’s important that you understand that conventional database technologies are an important, and
relevant, part of an overall analytic solution. In fact, they become even more vital when used in
conjunction with your Big Data platform. A good analogy here is your left and right hands; each
offers individual strengths and optimizations for a task at hand.
There exists some class of problems that don’t natively belong in traditional databases, at least not
at first. And there’s data that we’re not sure we want in the warehouse, because perhaps we don’t
know if it’s rich in value, it’s unstructured, or it’s too voluminous. In many cases, we can’t find out
the value per byte of the data until after we spend the effort and money to put it into the
warehouse; but we want to be sure that data is worth saving and has a high value per byte before
investing in it.

******************************

Page 5
BIG DATA ANALYTICS UNIT-III

Big Data Use Cases:

Patterns for Big Data Deployment

The best way to frame why Big Data is important is to share with you a number of our real
customer experiences regarding usage patterns they are facing (and problems they are solving)
with an IBM Big Data platform. These patterns represent great Big Data opportunities—business
problems that weren’t easy to solve before—and help to understand how Big Data can help.

IT for IT Log Analytics

 Log analytics is a common use case for an inaugural Big Data project. We like to refer toall
those logs and trace data that are generated by the operation of IT solutions as data exhaust.
 Enterprises have lots of data exhaust, and it’s pretty much a pollutant if it’s just left around
for a couple of hours or days in case of emergency and simply purged.
 Data exhaust has concentrated value, and IT shops need to figure out a way to store and
extract value from it. Some of the value derived from data exhaust is obvious and has been
transformed into value-added click-stream data that records every gesture, click, and
movement made on a web site.
 Quite simply, IT departments need logs at their disposal, and today they just can’t store
enough logs and analyze them in a cost-efficient manner, so logs are typically kept for
emergencies and discarded as soon as possible.
 Another reason why IT departments keep large amounts of data in logs is to look for rare
problems. It is often the case that the most common problems are known and easy to deal
with, but the problem that happens “once in a while” is typically more difficult to diagnose
and prevent from occurring again.

But there are more reasons why log analysis is a Big Data problem aside from its voluminous
nature. The nature of these logs is semi-structured and raw, so they aren’t always suited for
traditional database processing. In addition, log formats are constantly changing due to
hardware and software upgrades, so they can’t be tied to strict inflexible analysis paradigms.
Finally, not only do you need to perform analysis on the longevity of the logs to determine trends
and patterns and to pinpoint failures, but you need to ensure the analysis is done on all the data.
Log analytics is actually a pattern that IBM established after working with a number of
companies, including some large financial services sector (FSS) companies. we’ll call this pattern

Page 6
BIG DATA ANALYTICS UNIT-III

IT for IT. IT for IT Big Data solutions is an internal use case within an organization itself. For
example, often non-IT business entities want this data provided to them as a kind of service
bureau. An internal IT for IT implementation is well suited for any organization with a large data
center footprint, especially if it is relatively complex.

The Fraud Detection Pattern


Fraud detection comes up a lot in the financial services vertical, but if you look around, you’ll find
it in any sort of claims- or transaction-based environment (online auctions, insurance claims,
underwriting entities, and so on). Pretty much anywhere some sort of financial transaction is
involved presents a potential for misuse and the ubiquitous specter of fraud. If you leverage a Big
Data platform, you have the opportunity to do more than you’ve ever done before to identify it or,
better yet, stop it. Several challenges in the fraud detection pattern are directly attributable to
solely utilizing conventional technologies.
 The most common, and recurring, theme you will see across all Big Data patterns is limits on
what can be stored as well as available compute resources to process your intentions. Without
Big Data technologies, these factors limit what can be modeled. Less data equals constrained
modeling. Highly dynamic environments commonly have cyclical fraud patterns that come and
go in hours, days, or weeks. If the data used to identify or bolster new fraud detection models
isn’t available with low latency, by the time you discover these new patterns, it’s too late and
some damage has already been done.
 Traditionally, in fraud cases, samples and models are used to identify customers that
characterize a certain kind of profile. The problem with this approach(and this is a trend that
you’re going to see in a lot of these use cases) is that although it works, you’re profiling a
segment and not the granularity at an individual transaction or person level.
 Quite simply, making a forecast based on a segment is good, but making a decision based upon
the actual particulars of an individual transaction is obviously better. To do this, you need to
work up a larger set of data than is conventionally possible in the traditional approach. You
can use BigInsights to provide an elastic and cost-effective repository to establish what of the
remaining 80 percent of the information is useful for fraud modeling, and then feed newly
discovered high-value information back into the fraud model as shown in Figure 1.

Page 7
BIG DATA ANALYTICS UNIT-III

Fig.1 Traditional fraud detection patterns use approximately 20 percent of available data

 A modern-day fraud detection ecosystem that provides a low-cost Big Data platform for
exploratory modeling and discovery. Notice how this data can be leveraged by traditional
systems either directly or through integration into existing data quality and governance
protocols.
 Notice the addition of InfoSphere Streams (the circle by the DB2 database cylinder) as well,
which showcases the unique Big Data platform that only IBM can deliver: it’s an ecosystem
that provides analytics for data-in-motion and data-at-rest.
 We teamed with a large credit card issuer to work on a permutation of Figure 2, and they
quickly discovered that they could not only improve just how quickly they were able to
speed up the build and refresh of their fraud detection models, but their models were
broader and more accurate because of all the new insight.
 In the end, this customer took a process that once took about three weeks from when a
transaction hit the transaction switch until when it was actually available for their fraud
teams to work on, and turned that latency into a couple of hours.
 In addition, the fraud detection models were built on an expanded amount of data that was
roughly 50 percent broader than the previous set of data. As we can see in this example, all
of that “80 percent of the data” that we talked about not being used wasn’t all valuable in
the end, but they found out what data had value and what didn’t, in a cost effective and
efficient manner, using the BigInsights platform.

Page 8
BIG DATA ANALYTICS UNIT-III

Fig2. modern-day fraud detection ecosystem synergizes a Big Data platform with traditional processes

 Now, of course, once you have your fraud models built, you’ll want to put them into action
to try and prevent the fraud in the first place. Recovery rates for fraud are dismal in all
industries, so it’s best to prevent it versus discover it and try to recover the funds post-
fraud. This is where InfoSphere Streams comes into play as you can see in Figure 2.
 Typically, fraud detection works after a transaction gets stored only to get pulled out of
storage and analyzed; storing something to instantly pull it back out again feels like latency
to us. With Streams, you can apply your fraud detection models as the transaction is
happening.

The Social Media Pattern


 Perhaps the most talked about Big Data usage pattern is social media and customer
sentiment. You can use Big Data to figure out what customers are saying about you (and
perhaps what they are saying about your competition); furthermore, you can use this
newly found insight to figure out how this sentiment impacts the decisions you’re making
and the way your company engages.
 More specifically, you can determine how sentiment is impacting sales, the effectiveness or
receptiveness of your marketing campaigns, the accuracy of your marketing mix (product,
price, promotion, and placement), and so on.
 Social media analytics is a pretty hot topic, so hot in fact that IBM has built a solution
specifically to accelerate your use of it: Cognos Consumer Insights (CCI). It’s a pointsolution
that runs on BigInsights. CCI can tell you what people are saying, how topics are

Page 9
BIG DATA ANALYTICS UNIT-III

trending in social media, and all sorts of things that affect your business, all packed into a
rich visualization engine.
 Although basic insights into social media can tell you what people are saying and how
sentiment is trending, they can’t answer what is ultimately a more important question:
“Why are people saying what they are saying and behaving in the way they are behaving?”
Answering this type of question requires enriching the social media feeds with additional
and differently shaped information that’s likely residing in other enterprise systems.
 Simply put, linking behavior, and the driver of that behavior, requires relating social media
analytics back to your traditional data repositories, whether they are SAP, DB2, Teradata,
Oracle, or something else. You have to look beyond just the data; you have to look at the
interaction of what people are doing with their behaviors, current financial trends, actual
transactions that you’re seeing internally, and so on. Sales, promotions, loyalty programs,
the merchandising mix, competitor actions, and even variables such as the weather can all
be drivers for what consumers feel and how opinions are formed.
Big Data and the Energy Sector
The energy sector provides many Big Data use case challenges in how to deal with the massive
volumes of sensor data from remote installations. Many companies are using only a fraction of the
data being collected, because they lack the infrastructure to store or analyze the available scale of
data. Take for example a typical oil drilling platform that can have 20,000 to 40,000 sensors on
board. All of these sensors are streaming data about the health of the oil rig, quality ofoperations,
and so on. Not every sensor is actively broadcasting at all times, but some arereporting back many
times per second. Now take a guess at what percentage of those sensors areactively utilized. If
you’re thinking in the 10 percent range (or even 5 percent), you’re either a great guesser or you’re
getting the recurring theme for Big Data that spans industry and use cases: clients aren’t using all
of the data that’s available to them in their decision-making process.

The Call Center Mantra: “This Call May Be Recorded for Quality Assurance
Purposes”
It seems that when we want our call with a customer service representative (CSR) to be recorded
for quality assurance purposes, it seems the may part never works in our favor. The challenge of
call center efficiencies is somewhat similar to the fraud detection pattern we discussed: Much like
the fraud information latency critical to robust fraud models, if you’ve got experience in a call
center, you’ll know that the time/quality resolution metrics and trending discontent patternsfor a
call center can show up weeks after the fact. This latency means that if someone’s on the

Page 10
BIG DATA ANALYTICS UNIT-III

phone and has a problem, you’re not going to know about it right away from an enterprise
perspective and you’re not going to know that people are calling about this new topic or that you’re
seeing new and potentially disturbing trending in your interactions within a specific segment.
The bottom line is this: In many cases, all of this call center information comes in too little, too late,
and the problem is left solely up to the CSR to handle without consistent and approved remediation
procedures in place. We’ve been asked by a number of clients for help with this pattern, which we
believe is well suited for Big Data. Call centers of all kinds want to find better ways to process
information to address what’s going on in the business with lower latency. This is a really
interesting Big Data use case, because it uses analytics-in-motion and analytics-at-rest. Using in-
motion analytics (Streams) means that you basically build your models and find out what’s
interesting based upon the conversations that have been converted from voice to text or with voice
analysis as the call is happening. Using at-rest analytics (BigInsights), you build up these models
and then promote them back into Streams to examine and analyze the calls that are actually
happening in real time: it’s truly a closed-loop feedback mechanism.

Risk: Patterns for Modeling and Management


Risk modeling and management is another big opportunity and common Big Data usage pattern.
Risk modeling brings into focus a recurring question when it comes to the Big Data usage patterns,
“How much of your data do you use in your modeling?” The financial crisis of 2008, the associated
subprime mortgage crisis, and its aftermath has made risk modeling and managementa key area
of focus for financial institutions. As you can tell by today’s financial markets, a lack of
understanding risk can have devastating wealth creation effects. In addition, newly legislated
regulatory requirements affect financial institutions worldwide to ensure that their risk levels fall
within acceptable thresholds. As was the case in the fraud detection pattern, our customer
engagements suggest that in this area, firms use between 15 and 20 percent of the available
structured data in their risk modeling. It’s not that they don’t recognize that there’s a lot of data
that’s potentially underutilized and rich in yet to be determined business rules that can be infused
into a risk model; it’s just that they don’t know where the relevant information can be found in the
rest of the data. In addition, as we’ve seen, it’s just too expensive in many clients’ current
infrastructure to figure it out, because clearly they cannot double, triple, or quadruple the size of
the warehouse just because there might (key word here)

******************************

Page 11
BIG DATA ANALYTICS UNIT-III

Big data from Technology Perspective: History of Hadoop


 Hadoop (http://hadoop.apache.org/) is a top-level Apache project in the Apache Software
Foundation that’s written in Java. Hadoop was inspired by Google’s work on its Google
(distributed) File System (GFS) and the MapReduce programming paradigm, in which work
is broken down into mapper and reducer tasks to manipulate data that is stored across a
cluster of servers for massive parallelism.
 MapReduce is not a new concept; however, Hadoop has made it practical to be applied to
a much wider set of use cases. Unlike transactional systems, Hadoop is designed to scan
through large data sets to produce its results through a highly scalable, distributed batch
processing system.
 Hadoop is not about speed-of-thought response times, real time warehousing, or blazing
transactional speeds; it is about discovery and making the once near-impossible possible
from a scalability and analysis perspective. The Hadoop methodology is built around a
function-to-data model as opposed to data-to-function; Hadoop is generally seen as having
two parts: a file system (the Hadoop Distributed File System) and a programming paradigm
(MapReduce)—more on these in a bit.
 One of the key components of Hadoop is the redundancy built into the environment. Not
only is the data redundantly stored in multiple places across the cluster, but the
programming model is such that failures are expected and are resolved automatically by
running portions of the program on various servers in the cluster. Due to this redundancy,
it’s possible to distribute the data and its associated programming across a very large
cluster of commodity components.
 It is well known that commodity hardware components will fail (especially when you have
very large numbers of them), but this redundancy provides fault tolerance and a capability
for the Hadoop cluster to heal itself. This allows Hadoop to scale out workloads across large
clusters of inexpensive machines to work on Big Data problems.
 Some of the Hadoop-related projects include: Apache Avro (for data serialization),
Cassandra and HBase (databases), Hive (provides ad hoc SQL-like queries for data
aggregation and summarization), Mahout (a machine learning library), Pig (a high-level
Hadoop programming language that provides a data-flow language and execution
framework for parallel computation), ZooKeeper (provides coordination services for
distributed applications), and more.
******************************

Page 12
BIG DATA ANALYTICS UNIT-III

Components of Hadoop
1. Apache Hadoop is an open-source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data.
2. It is licensed under an Apache V2 license.
3. It runs applications on large clusters of commodity hardware and it processes thousands of
terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and
Google File System (GFS) papers.
4. The major advantage of Hadoop framework is that it provides reliability and high availability.
Core components of Hadoop
There are two major components of the Hadoop framework and both of them does two of the
important task for it.
1. Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of
resources and they have processed them locally. Once the commodity server has processed
the data they send it back collectively to main server. This is effectively a process where we
process large data effectively and efficiently
2. Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference
between any other file system and Hadoop. When we move a file on HDFS, it is
automatically split into many small pieces. These small chunks of the file are replicated and
stored on other servers (usually 3) for the fault tolerance or high availability.
3. Namenode: Namenode is the heart of the hadoop system. The namenode manages the file
system namespace. It stores the metadata information of the data blocks. This metadata is
stored permanently on to local disk in the form of namespace image and edit log file. The
namenode also knows the location of the data blocks on the data node. However the
namenode does not store this information persistently. The namenode creates the block to
datanodemapping when it is restarted. If the NameNode crashes, then the entire Hadoop
system goes down.
4. Secondary Namenode: The responsibility of secondary name node is to periodically copy
and merge the namespace image and edit log. If the name node crashes, then the namespace
image stored in secondary NameNode can be used to restart the NameNode.
5. DataNode: It stores the blocks of data and retrieves them. The DataNodes also reports the
blocks information to the NameNode periodically.

Page 13
BIG DATA ANALYTICS UNIT-III

6. Job Tracker: Job Tracker responsibility is to schedule the client’s jobs. Job tracker creates
map and reduce tasks and schedules them to run on the DataNodes (task trackers). Job
Tracker also checks for any failed tasks and reschedules the failed tasks on another
DataNode. Job tracker can be run on the NameNode or a separate node.
7. Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility is to run
the map or reduce tasks assigned by the NameNode and to report the status of the tasks to
the NameNode.
Besides above two core components Hadoop project also contains following modules as well.
1. Hadoop
2. Common: Common utilities for the other Hadoop modules
3. Hadoop Yarn: A framework for job scheduling and cluster resource management
******************************

Application Development in Hadoop


The Hadoop platform can be a powerful tool for manipulating extremely large data sets. However,
the core Hadoop MapReduce APIs are primarily called from Java, which requires skilled
programmers. In addition, it is even more complex for programmers to develop and maintain
MapReduce applications for business applications that require long and pipelined processing. To
abstract some of the complexity of the Hadoop programming model, several application
development languages have emerged that run on top of Hadoop. Three of the more popular ones
are Pig, Hive, Jaql

1. Pig and PigLatin


 Pig allows to use Hadoop to focus more on analyzing large data sets and spend less time
having to write mapper and reducer programs.
 The Pig programming language is designed to handle any kind of data—hence the name!
 Pig is made up of two components: the first is the language itself, which is called PigLatin
and the second is a runtime environment where PigLatin programs are executed.
 The first step in a Pig program is to LOAD the data you want to manipulate from HDFS.
Then you run the data through a set of transformations (which, under the covers, are
translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen
or you STORE the results in a file somewhere.

Page 14
BIG DATA ANALYTICS UNIT-III

LOAD

As is the case with all the Hadoop features, the objects that are being worked on by Hadoop are
stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what
file (or files) it will use, and that’s done through the LOAD 'data_file' command. If a directory is
specified, all the files in that directory will be loaded into the program. If the data is stored in a file
format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD
statement to specify a user-defined function that can read in and interpret the data.

TRANSFORM

The transformation logic is where all the data manipulation happens. Here you can FILTER out
rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER
results, and much more. The following is an example of a Pig program that takes a file composed
of Twitter feeds, selects only those tweets that are using the en (English) iso_language code, then
groups them by the user who is tweeting, and displays the sum of the number of retweets of that
user’s tweets.

L = LOAD 'hdfs//node/tweet_data';

FL = FILTER L BY iso_language_code EQ 'en';

G = GROUP FL BY from_user;

RT = FOREACH G GENERATE group, SUM(retweets);

DUMP and STORE

If you don’t specify the DUMP or STORE command, the results of a Pig program are not generated.
You would typically use the DUMP command, which sends the output to the screen, when you are
debugging your Pig programs. When you go into production, you simply change the DUMP call to
a STORE call so that any results from running your programs are stored in a file for further
processing or analysis. You can use the DUMP command anywhere in your programto dump
intermediate result sets to the screen, which is very useful for debugging purposes. Nowthat we’ve
got a Pig program, we need to have it run in the Hadoop environment. Here is where the Pig
runtime comes in. There are three ways to run a Pig program:

 embedded in a script,
 embedded in a Java program,
 From the Pig command line, called Grunt.

Page 15
BIG DATA ANALYTICS UNIT-III

No matter which of the three ways you run the program, the Pig runtime environment translates
the program into a set of map and reduce tasks and runs them under the covers on your behalf.
This greatly simplifies the work associated with the analysis of large amounts of data and lets the
developer focus on the analysis of the data rather than on the individual map and reduce tasks.

2. Hive
 Although Pig can be quite a powerful and simple language to use, the downside is that it’s
something new to learn and master. Some folks at Facebook developed a runtime Hadoop
support structure that allows anyone who is already fluent with SQL (which is
commonplace for relational database developers) to leverage the Hadoop platform right
out of the gate.
 Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL)
statements that are similar to standard SQL statements; now you should be aware that HQL
is limited in the commands it understands, but it is still pretty useful. HQL statements are
broken down by the Hive service into MapReduce jobs and executedacross a Hadoop
cluster.
 As with any database management system (DBMS), you can run your Hive queries in many
ways.
 From a command line interface (known as the Hive shell),
 From a Java Database Connectivity (JDBC)
 Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC
drivers,
 From what is called a Hive Thrift Client. The Hive Thrift Client is much like anydatabase
client

The following shows an example of creating a table, populating it, and then querying that table
using Hive:

CREATE TABLE Tweets(from_user STRING, userid BIGINT, tweettext STRING, retweets INT)

COMMENT 'This is the Twitter feed table'


STORED AS SEQUENCEFILE;

LOAD DATA INPATH 'hdfs://node/tweetdata' INTO TABLE TWEETS;

SELECT from_user, SUM(retweets) FROM TWEETS


GROUP BY from_user;

Page 16
BIG DATA ANALYTICS UNIT-III

This means that Hive would not be appropriate for applications that need very fast response times,
as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not
appropriate for transaction processing that typically involves a high percentage of write
operations.

3. Jaql
 Jaql is primarily a query language for JavaScript Object Notation (JSON), but it supports
more than just JSON. It allows you to process both structured and nontraditional data and
was donated by IBM to the open source community.
 Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much
like a blend of Pig and Hive.
 Jaql’s query language was inspired by many programming and query languages, including
Lisp, SQL, XQuery, and Pig.
 Jaql is a functional, declarative query language that is designed to process large data sets.
For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries
consisting of MapReduce jobs.
 Before we get into the Jaql language, let’s first look at the popular data interchange format
known as JSON, so that we can build our Jaql examples on top of it. Application developers
are moving in large numbers towards JSON as their choice for a data interchange format,
because it’s easy for humans to read, and because of its structure, it’s easy for applications
to parse or generate.
 JSON is built on top of two types of structures.
 The first is a collection of name/value pairs. These name/value pairs can represent
anything since they are simply text strings (and subsequently fit well into existing
models) that could represent a record in a database, an object, an associative array, and
more.
 The second JSON structure is the ability to create an ordered list of values much like an
array, list, or sequence you might have in your existing applications. An object in JSON
is represented as {string : value }, where an array can be simply represented by [ value,
value, … ], where value can be a string, number, another JSON object, or another JSON
array.

******************************

Page 17
BIG DATA ANALYTICS UNIT-III

Getting Your Data into Hadoop

 One of the challenges with HDFS is that it’s not a POSIX-compliant file system. This means
that all the things you are accustomed to when it comes to interacting with a typical file
system (copying, creating, moving, deleting, or accessing a file, and more) don’t
automatically apply to HDFS.
 To do anything with a file in HDFS, you must use the HDFS interfaces or APIs directly. That
is yet another advantage of using the GPFS-SNC file system; with GPFS-SNC, you interact
with your Big Data files in the same manner that you would any other file system, and,
therefore, file manipulation tasks with Hadoop running on GPFS-SNC are greatly reduced.
 The basics of getting data into HDFS and cover Flume, which is a distributed data collection
service for flowing data into a Hadoop cluster is discussed here.

Basic Copy Data

 We must use specific commands to move files into HDFS either through APIs or using the
command shell.
 The most common way to move files from a local file system into HDFS is through the
copyFromLocal command.
 To get files out of HDFS to the local file system, you’ll typically use the copyToLocal
command. An example of each of these commands is shown here:
 hdfs dfs –copyFromLocal /user/dir/file hdfs://s1.n1.com/dir/hdfsfile
 hdfs dfs –copyToLocal hdfs://s1.n1.com/dir/hdfsfile /user/dir/file
 These commands are run through the HDFS shell program, which is simply a Java
application. The shell uses the Java APIs for getting data into and out of HDFS. These APIs
can be called from any Java application.
 HDFS commands can also be issued through the Hadoop shell, which is invoked by the
command hadoop fs.
 The problem with this method is that you must have Java application developers write the
logic and programs to read and write data from HDFS.
 If you need to access HDFS files from your Java applications, you would use the methods
in the org.apache.hadoop.fs package. This allows you to incorporate read and write
operations directly, to and from HDFS, from within your MapReduce applications.

Page 18
BIG DATA ANALYTICS UNIT-III

Flume
 A flume is a channel that directs water from a source to some other location where water is
needed. As its clever name implies, Flume was created to allow you to flow data from a source
into your Hadoop environment.
 In Flume, the entities you work with are called sources, decorators, and sinks.
 A source can be any data source, and Flume has many predefined source adapters.
 A sink is the target of a specific operation (and in Flume, among other paradigms that use this
term, the sink of one operation can be the source for the next downstream operation).
 A decorator is an operation on the stream that can transform the stream in some manner,
which could be to compress or uncompress data, modify data by adding or removing pieces of
information, and more.
 A number of predefined source adapters are built into Flume. For example, some adapters
allow the flow of anything coming off a TCP port to enter the flow, or anything coming to
standard input (stdin).
 A number of text file source adapters give you the granular control to grab a specific file and
feed it into a data flow or even take the tail of a file and continuously feed the flow with
whatever new data is written to that file.
 The latter is very useful for feeding diagnostic or web logs into a data flow, since they are
constantly being appended to, and the TAIL operator will continuously grab the latest entries
from the file and put them into the flow.
 A number of other predefined source adapters, as well as a command exit, allow you to use any
executable command to feed the flow of data. There are three types of sinks in Flume.
 One sink is basically the final flow destination and is known as a Collector Tier Event
sink. This is where you would land a flow (or possibly multiple flows joined together)
into an HDFS formatted file system.
 Another sink type used in Flume is called an Agent Tier Event; this sink is used when
you want the sink to be the input source for another operation. When you use these
sinks, Flume will also ensure the integrity of the flow by sending back
acknowledgments that data has actually arrived at the sink.
 The final sink type is known as a Basic sink, which can be a text file, the console display,
a simple HDFS path, or a null bucket where the data is simply deleted.

******************************

Page 19
BIG DATA ANALYTICS UNIT-III

Other Hadoop Components

ZooKeeper

 ZooKeeper is an open source Apache project that provides a centralized infrastructure and
services that enable synchronization across a cluster.
 ZooKeeper maintains common objects needed in large cluster environments.
 Examples of these objects include configuration information, hierarchical naming space,
and so on. Applications can leverage these services to coordinate distributed processing
across large clusters.
 ZooKeeper provides an infrastructure for cross-node synchronization and can be used by
applications to ensure that tasks across the cluster are serialized or synchronized. It does
this by maintaining status type information in memory on ZooKeeper servers.
 A ZooKeeper server is a machine that keeps a copy of the state of the entire system and
persists this information in local log files.
 A very large Hadoop cluster can be surpported by multiple ZooKeeper servers.
 This cluster-wide status centralization service is essential for management and
serialization tasks across a large distributed set of servers.

HBase

 HBase is a column-oriented database management system that runs on top of HDFS. It is


well suited for sparse data sets, which are common in many Big Data use cases.
 Unlike relational database systems, HBase does not support a structured query language
like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in
Java much like a typical MapReduce application.
 HBase does support writing applications in Avro, REST, and Thrift.
 An HBase system comprises a set of tables. Each table contains rows and columns, much
like a traditional database.
 Each table must have an element defined as a Primary Key, and all access attempts to HBase
tables must use this Primary Key.
 An HBase column represents an attribute of an object.
 HBase allows for many attributes to be grouped together into what are known as column
families, such that the elements of a column family are all stored together. This is different
from a row-oriented relational database, where all the columns of a given row are stored
together.

Page 20
BIG DATA ANALYTICS UNIT-III

 With HBase you must predefine the table schema and specify the column families.
However, it’s very flexible in that new columns can be added to families at any time, making
the schema flexible and therefore able to adapt to changing application requirements.
 Just as HDFS has a NameNode and slave nodes, and MapReduce has Job-Tracker and
TaskTracker slaves, HBase is built on similar concepts. In HBase a master node manages
the cluster and region servers store portions of the tables and perform the work on the
data.

Oozie

 Oozie is an open source project that simplifies workflow and coordination between jobs.
It provides users with the ability to define actions and dependencies between actions.
Oozie will then schedule actions to execute when the required dependencies have been
met.
 A workflow in Oozie is defined in what is called a Directed Acyclical Graph (DAG). Acyclical
means there are no loops in the graph, and all tasks and dependencies point from start to
end without going back. A DAG is made up of action nodes and dependency nodes. An action
node can be a MapReduce job, a Pig application, a file system task, or a Java application.
 Flow control in the graph is represented by node elements that provide logic based on the
input from the preceding task in the graph. Examples of flow control nodes are decisions,
forks, and join nodes.

Figure. : Oozie workflow


 A workflow can be scheduled to begin based on a given time or based on the arrival of some
specific data in the file system. After inception, further workflow actions are executed
based on the completion of the previous actions in the graph.

Page 21
BIG DATA ANALYTICS UNIT-III

Lucene

 Lucene is an extremely popular open source Apache project for text search and is included in
many open source projects.
 Lucene provides full text indexing and searching libraries for use within your Java application.
If you’ve searched on the Internet, it’s likely that you’ve interacted with Lucene.
 The Lucene concept is fairly simple, yet the use of these search libraries can be very powerful.
In a nutshell, let’s say you need to search within a collection of text, or a set of documents.
 Lucene breaks down these documents into text fields and builds an index on these fields. The
index is the key component of Lucene, as it forms the basis for rapid text search capabilities.
 You then use the searching methods within the Lucene libraries to find the text components.
This indexing and search platform is shipped with BigInsights and is integrated into Jaql,
providing the ability to build, scan, and query Lucene indexes within Jaql.

Avro

 Avro is an Apache project that provides data serialization services. When writing Avro data
to a file, the schema that defines that data is always written to the file. This makes it easy for
any application to read the data at a later time, because the schema defining the data is stored
within the file.
 Data can be versioned by the fact that a schema change in an application can be easily handled
because the schema for the older data remains stored within the data file.
 An Avro schema is defined using JSON. A schema defines the data types contained within a file
and is validated as the data is written to the file using the Avro APIs. Similarly, the data can be
formatted based on the schema definition as the data is read back from the file. The schema
allows you to define two types of data. The first are the primitive data types such as STRING,
INT[eger], LONG, FLOAT, DOUBLE, BYTE, NULL, and BOOLEAN.
 The second are complex type definitions. A complex type can be a record, an array, an enum
(which defines an enumerated list of possible values for a type), a map, a union (which defines
a type to be one of several types), or a fixed type.
 APIs for Avro are available in C, C++, C#, Java, Python, Ruby, and PHP, making it available to
most application development environments that are common around Hadoop.

******************************

Page 22

You might also like