Bigdata Lecture Notes
Bigdata Lecture Notes
Definition
❖ Big data can be defined as very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speeds.Whichcannotbeprocessed
using traditional technologies, processing methods and algorithms?
❖ Big data usually includes data sets with sizes beyond the ability of commonly used
software tools tocapture,create, manage, and process the data within atolerableelapsed
time.
❖ Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision-making.
2
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
◻ Big data is often boiled down to a few varieties including social data, machine data,and
transactional data.
◻ MajorretailerslikeAmazon.com,whichposted$10BinsalesinQ32011,andrestaurants like
US pizza chain Domino's, which serves over 1 million customers per day, are generating
petabytes of transactional big data.
◻
Big Data Analytics
3
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
Big (and small) Data analytics is the process of examining data—typically of a variety of
sources, types, volumes and / or complexities—to uncover hidden patterns, unknown
correlations, and other useful information.
The intent is to find business insights that were not previously possible or were missed, so that
better decisions can be made.
Big Data analytics uses a wide variety of advanced analytics to provide
1. Deeper insights. Rather than looking at segments, classifications, regions, groups, or
other summary levelsyou’llhaveinsightsintoalltheindividuals,alltheproducts,allthe parts,
all the events, all the transactions, etc.
3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.
DifferencebetweenDataScienceandBig Data:
4
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
4. Re-develop your products:Bigdata canalso help you understand how others perceive your
products so that you can adapt them or your marketing. If need be.
5. Early identification of risk to the product/services, if any.
6. Better operational efficiency.
Big Data Challenges:
Collecting, storing and processing big data comes with its own set of challenges:
1. Bigdata is growing exponentially and existing data management solutions have to be
constantly updated to cope with three Vs.
2. Organizations donot have enough skilled data professionals who can understand and
work with big data and big data tools.
1.2 Convergence of Key Trends:
◻ The essence of computer applications is to store things in the real world into computer
systems in the form of data, i.e., it is a process of producing data. Some data are the
records related to culture and societyandothersarethedescriptionsofphenomenaofthe
universe and life. The large scale of data is rapidly generated and stored in computer
systems, which is called data explosion.
◻ Data is generated automatically by mobile devices and computers,think facebook,search
queries, directions and GPS locations and image capture.
◻ Sensors also generate volumes of data, including medical data and commerce location-
based sensors. Experts expect55billionIP-enabledsensorsby2021.Evenstorageofall this
data is expensive. Analysis gets more important and more expensive every year.
◻ The below diagram shows the big data explosion by the current data boom and how
critical it is for us to be able to extract meaning from all of this data.
5
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
6
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
1. Volume:Volumes of data are larger than that conventional relational database infrastructure
can cope with. It consists of terabytes or petabytes of data.
7
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
2. Velocity:
➢ The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data. It is
being created in or near real-time.
➢ Data is increasingly accelerating the velocity at which it is created and at which it is
integrated.We have moved from batch to a real-time business.
➢ Initially, companies analyzed data using a batch process. One takes a chunk of data,
submits a job to the server and waits for delivery of the result.
8
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
3.Variety:
➢ It refer stoheterogeneous sources and the nature of data,both structured and
unstructured.
➢ Variety presents an equally difficult challenge.The grow thin data sources has fuelled the
growth in data types.In fact, 80% of the world’s data is unstructured.
➢ Yet most traditional methods apply analytics only to structured information.
➢ From excel tables and databases, data structure has changed to lose its structure and to
add hundreds of formats.
➢ Puretext, photo, audio, video, web, GPSdata, sensordata, relational databases, documents,
SMS, pdf, flash, etc. One no longer has control over the input data format.
➢ Structure canno longer be imposed like in the past inorder to keep control over the
analysis. As new applications are introduced new data formats come to life.
Thevarietyofdatasourcescontinuestoincrease.Itincludes
● Internet data (i.e., click stream, social media, social networking links)
● Primary research (i.e., surveys, experiments, observations)
● Secondaryresearch (i.e., competitive and marketplace data, industry reports,
consumerdata, business data)
● Location data (i.e., mobile device data, geospatial data)
● Image data (i.e., video, satellite image, surveillance)
● Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
● Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
9
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
4. Value
➢ It represents the business value to be derived from big data. The ultimate objective
ofanybig data project should be to generate some sort of value for the company doing all
the analysis. Otherwise, you're just performing some technological task for technology's
sake.
➢ For real-time spatial big data, decisions can be enhanced through visualization ofdynamic
change in such spatial phenomena as climate, traffic, social-media-based attitudes and
massive inventory locations.
➢ Exploration of data trends can include spatial proximities relationships. Once spatial big
data is structured, formal spatial analytics can beapplied,such as spatial auto correlation,
overlays, buffering, spatial cluster techniques and location quotients.
5. Veracity
➢ Big data must be fed with relevant and true data. We will not be able to perform useful
analytics if much of the incoming data comes from false sources or has errors.
➢ Veracity refers to the level of trustiness or messiness of data and ifhigher the trustiness
of the data, then lower the messiness and vice versa.
➢ It relates to the assurance of the data 'squality, integrity, credibility and accuracy. We
must evaluate the data for accuracy, before using it for business insights because it is
obtained from multiple sources.
10
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
1. UnderstandingandTargetingCustomers
2. Understanding and Optimizing Business Processes
3. Personal Quantification and Performance Optimization
4. ImprovingHealthcareandPublic Health
5. ImprovingSports Performance
6. ImprovingScienceandResearch
7. Optimizing Machine and Device Performance
8. ImprovingSecurityandLawEnforcement.
9. ImprovingandOptimizingCitiesandCountries
10. Financial Trading
1.3 Unstructureddata
★Unstructured data is information that either does not have a predefined datamodel and/or
does not fit well into a relational database.
★Rows and columns are not used for unstructured data; therefore it is difficult to retrieve the
required information.
★Unstructured information is typically text heavy, but may contain data such as dates,
numbers, and facts as well.
★The term semi-structured data is used to describe structured data that does not fit into a
formal structure of data models.
★However, semi-structured data does contain tags that separate semantic elements, which
includes the capability to enforce hierarchies within the data.
★The amount of data (all data, everywhere) is doubling every two years.Most new data is
unstructured.
★Specifically, unstructured data represents almost 80percent of newdata,while structured data
represents only 20 percent.
★Unstructured data tends to grow exponentially, unlike structured data,which tends to grow
in a more linear fashion.Unstructured data is vastly underutilized.
Structureddata
★Structureddata is arranged in rows and columns format. It helps applications to retrieve and
process data easily. DBMS is used for storing structured data.
★witha structured document, certain information always appears in the same locationon the
page.
★Structured data generally resides in a relational database, and as a result, itissometimes called
"relational data." This type of data can be easily mapped into pre-designed fields.
11
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
★For example, a database designer may set up fields for phone numbers, zip codes and
credit card numbers that accept a certain number of digits. Structured data has been orcan
be placed in fields like these.
MiningUnstructuredData
★Many organizations believe that their unstructured data stores include information that
could help them make better business decisions.
★Unfortunately,it's often very difficult to analyze unstructured data. To help with the
problem, organizations have turned to a number of different software solutions designed
to search unstructured data and extract important information.
★The primary benefit of these tools is the ability to glean actionable information that can
help a business succeeds in a competitive environment.
★Becausethevolumeofunstructureddataisgrowingsorapidly,many enterprises also turn to
technological solutions to help them better manage and store their unstructured data.
★These can include hardware or software solutions that enable them to make the most
efficient use of their available storage space.
ImplementingUnstructuredDataManagement
Organizations use a variety of different software tools to help them organize and manage
unstructured data. These can include the following:
● Big data tools: Software like Hadoop can process stores of both unstructured and
structured data that are extremely large, very complex and changing rapidly.
● Business intelligence software: Also known as BI, this is a broad category of analytics,
data mining, dashboards and reporting tools that help companies make sense of their
structured and unstructured data for the purpose of making better business decisions.
● Data integration tools: These tools combine data from disparate sources so that they can
beviewed or analyzed from a single application. They sometimes include the capability to
unify structured and unstructured data.
● Document management systems: Also called "enterprise content managementsystems,"
a DMS can track, store and share unstructured data that is saved in the form of document
files.
● Information management solutions: This type of software tracks structured and
unstructured enterprise data throughout its lifecycle.
● Search and indexing tools: These tools retrieve in formation from unstructured data
files such as documents, Web pages and photos.
12
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
Big data plays an important role in digital marketing. Each day information shared digitally
increases significantly. With the help of big data,marketerscananalyzeeveryactionof the
consumer. It provides better marketinginsightsandithelpsmarketerstomakemoreaccurate and
advanced marketing strategies.
b) Personalized targeting
c) Increasing sales
d) Improvestheefficiencyofamarketing campaign
e) Budget optimization
★Dataconstantly informs marketing teams of customer behaviors and industry trends and is
used to optimize future efforts, create innovative campaigns and build lasting
relationships with customers.
★Big data regarding customers provides marketers details about user demographics,
locations and interests, which can be used to personalize the product experience and
increase customer loyalty over time.
★Big data solutions can help organize data and pinpoint which marketing campaigns,
strategies or social channels are getting the most traction. This lets marketers allocate
marketingresourcesand reduce costs for projects that are not yielding as much revenue or
meeting desired audience goals.
★Personalized targeting :Nowadays, personalization is the key strategy for every marketer.
Engaging the customers at the right moment with the right message is the biggest issue
for marketers. Big data helps marketers to create targeted andpersonalized campaigns.
★Personalizedmarketingiscreatinganddelivering messages to the individuals or the group of
the audience through data analysis with the help of consumer's data such as geolocation,
browsing history, clickstream behavior and purchasing history. It is also known as one-
to-one marketing.
★Consumer insights: In this day an age, marketing has become the ability of a companyto
interpret the data and change its strategies accordingly. Big data allows for real-time
consumerinsights whicharecrucialtounderstandingthehabitsofyourcustomers.By
13
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
Interacting with your consumers through social media you will know exactly what they
want and expect from your product or service, which will be key to distinguishing your
campaign from your competitors.
★Helpincreasesales:Bigdata will help withdemand predictions fora product or service.
Information gathered on user behavior will allow marketers to answer what types of
product their users are buying, how often they conduct purchases orsearchforaproduct or
service and lastly, what payment methods they prefer using.
★Analyzecampaignresults:Bigdataallowsmarketerstomeasuretheircampaign
Performance.This is the most important part of digital marketing. Marketers will use
reports to measureanynegativechangestomarketingKPIs.Iftheyhavenotachievedthe desired
results it will be a signal that the strategy would need to be changed in order to maximize
revenue and make your marketing efforts more scalable in future.
1.5 WebAnalytics
★Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.
★Web analytics is not just a tool for measuring web traffic but can be used as a tool for
business and market research, and to assess and improve the effectiveness of a web site.
★The following are the some of the web analytic metrics: Hit, Page view, Visit / Session,
First Visit / First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit Rate, Page
Time Viewed / Page Visibility Time / Page View Duration, Session Duration / Visit
Duration. Average Page View Duration, and Click path etc.
★Mostpeopleintheonlinepublishingindustryknowhowcomplexandonerousitcouldbe to build
an infrastructure to access and manage all the Internet data within their own IT
department. Back in the day, IT departments would opt for a four-year project and
millions of dollars to go that route. However, today this sectorhasbuiltupanecosystem of
companies that spread the burden and allow others to benefit.
Webeventdataisincrediblyvaluable
• It tells you how your customers actually behave (in lots of detail), and how that varies
• Betweendifferent customers
• Forthesamecustomersovertime.(Seasonality,progressincustomerjourney)
• How behavior drives value
• It tells you how customers engage with you via your website / webapp
• Howthatvaries bydifferentversions ofyour product
• Howimprovementstoyourproductdriveincreasedcustomersatisfactionand lifetime
value
14
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
• Ittellsyouhowcustomersandprospectivecustomersengagewithyourdifferent marketing
campaigns and how that drives subsequent behavior
Webanalyticstoolsaregoodatdeliveringthestandardreportsthatarecommonacross different
business types…
• Wheredoesyourtrafficcomefrom e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
• Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
• Page views per session
• Page views per web page
• Conversionratebytrafficsource
• Transactionvaluebytrafficsource
• Capturing contextual data common people browsing the web
• Timestamps
• Referer data
• Webpagedata(e.g.pagetitle,URL)
• Browser data (e.g. type, plugins, language)
• Operating system (e.g. type, time zone)
• Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)
15
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
DataintheWorldofHealthCare
★The healthcare industry is now awash in data: from biological data such as gene
expression, Special Needs Plans (SNPs), proteomics, metabolomics to, more recently,
next-generation gene sequence data.
★This exponential growth in data is further fueled by thedigitizationofpatient-leveldata:
stored in Electronic Health Records (EHRs) and Health Information Exchanges (HIEs),
enhanced with data from imaging and test results, medical and prescription claims, and
personal health devices.
★The U.S. healthcare system is increasingly challenged by issues of cost and access to
quality care. Payers, producers, and providers are each attempting to realize improved
treatment outcomes and effective benefits for patients within a disconnected health care
framework.
★Historically, these healthcare ecosystem stakeholders tendtoworkatcrosspurposeswith other
members of the health care value chain. High levels of variability and ambiguity across
these individual approaches increase costs, reduce overall effectiveness, and impede the
performance of the healthcare system as a whole.
★Recent approaches to health care reform attempt to improve access to health care by
increasing government subsidies and reducing the ranks of the uninsured.
★One outcome of the recently passed Accountable Care Act is a revitalized focus on cost
containment and the creation of quantitative proofs of economic benefit by payers,
producers, and providers.
★A more interesting unintended consequence is an opportunity for these health care
stakeholders to set aside historical differences and create a combined counterbalance to
potential regulatory burdens established, without the input of the actual industry the
government is setting out to regulate.
16
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
★This “the enemy of my enemy is my friend” mentality has created an urgent motivation for
payers, producers, and, to a lesser extent, providers, to create a new health care
information value chain derived from a common healthcare analytics approach.
★The health care system is facing severe economic, effectiveness, and quality challenges.
These external factors are forcing a transformation of the pharmaceutical business model.
★Health care challenges are forcing the pharmaceutical business model to undergo rapid
change. Our industry ismovingfromatraditionalmodelbuiltonregulatoryapprovaland
settling of claims, to one of medical evidence and proving economic effectiveness
through improved analytics derived insights.
★The success of this new business model will be dependent on having access to data created
across the entire healthcare ecosystem.
★we believe there is an opportunity to drive competitive advantage for our LS clients by
creating a robust analytics capability and harnessing integrated real-world patient level
data.
1.7 Big Data Technology
Big data technology is defined as the technology and a software utility that is designedfor
analysis, processing and extraction of the information from alargesetofextremelycomplex
structures and large data sets which isverydifficultfortraditionalsystemstodealwith.Bigdata
technology is used to handle both real-time and batch related data.
Big data technology is defined as software-utility. This technology is primarilydesigned
to analyze, process and extract information from a large data set and a huge set of extremely
complex structures. This is very difficult for traditional data processing software to deal with.
Big data technologies including Apache Hadoop, Apache Spark, MongoDB, Cassandra,
Plotly, Pig, Tableau and Apache Cassandra etc.
Cassandra: Cassandra is one of the leading big data technologies among the listoftopNoSQL
databases. It is open-source, distributed and has extensive column storage options. It is freely
available and provides high availability without fail.
Apache Pig is a high level scripting language usedtoexecutequeriesforlargerdatasetsthatare used
within Hadoop.
Apache Spark is a fast, in- Memory data processing engine suitable for use in a wide range of
circumstances. Spark can be deployed in several ways, it features java, Python, Scala and R
programming languages and supports SQL, streaming data, machine learning and graph
processing, which can be used together in an application.
MongoDB: MongoDB is another important component of big data technologies in terms of
storage. No relational properties and RDBMS properties apply to Mongo DB because it is a
NoSQL database. This is not thesameastraditionalRDBMSdatabasesthatusestructuredquery
languages. Instead, MongoDB uses schema documents.
17
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
1.8 IntroductiontoHadoop
★Apache Hadoop is an open source framework thatisusedtoefficientlystoreandprocess large
datasets ranging in size from gigabytes to petabytes of data.
★Hadoop can also be installed on cloud servers to better manage the computeandstorage
resources required for big data. Leading cloud vendors such as Amazon Web Services
(AWS) and Microsoft Azure offer solutions.
★Cloudera supports Hadoop workloads both on-premises and in the cloud, including options
for one or more public cloud environments from multiple vendors.
★Hadoop provides a distributed file system and a framework for the analysis and
transformation of very large data sets using the Map Reduce paradigm.
★A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by
simply adding commodity servers.
KeyfeaturesofHadoop:
1. CostEffective System
2. LargeClusterofNodes
3. Parallel Processing
4. Distributed Data
5. Automatic Failover Management
6. Data Locality Optimization
7. Heterogeneous Cluster
18
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
8. Scalability.
Hadoop allows for the distribution of datasets across a cluster of commodity hardware.
Processing is performed in parallel on multiple servers simultaneously. Software clients input
data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then
processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of
individual machines or racks of machines are common and should be automatically handled in
software by the framework.
Challenges of Hadoop:
Map Reduce complexity:Asafile-intensivesystem,MapReducecanbeadifficulttooltoutilize for
complex jobs, such as interactive analytical tasks.
There are four main libraries in Hadoop.
1. Hadoop Common: This provides utilities used by all other modules in Hadoop.
2. Hadoop Map Reduce: This works as a parallel framework for scheduling and processingthe
data.
3. Hadoop YARN: This is an acronym for Yet Another Resource Navigator. It is an improved
version of Map Reduce and is used for processes running over Hadoop.
4. Hadoop Distributed File SystemHDFS:Thisstoresdataandmaintainsrecordsovervarious
machines or clusters. It also allows the data to be stored in an accessible format.
1.8.1 Hadoop Ecosystem
● Hadoop ecosystem is neither a programming language nor a service, it is a platform or
framework which solves big data problems.
● In addition to these core elements of Hadoop, Apache has also delivered other kinds of
accessories or complementary tools for developers.
● Some of the most well-known tools of the Hadoop ecosystem include HDFS, Hive, Pig,
YARN, MapReduce, Spark, Hbase, Oozie, Sqoop, Zookeeper, etc.
19
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
● Hadoop Distributed File System (HDFS), is one of the largest Apache projects and
primary storage system of Hadoop. It employs a Name Node and Data Node architecture.
● It is a distributed file system able to store large files running over the cluster of
commodity hardware.
● Apache Pig is a high level scripting language used to execute queries forlargerdatasets
that are used within Hadoop.
● Apache Spark is a fast, in - memory data processing engine suitable for use in a wide
range of circumstances. Spark can be deployed in several ways, it featuresJava,Python,
ScalaandRprogramminglanguagesandsupportsSQL,streamingdata,machinelearning and
graph processing, which can be used together in an application.
20
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
3. Resilientto failure: HDFShas the propertywith which itcan replicate dataover the network.
1.9 OpenSourceTechnologies
★Opensourcesoftwareislikeanyothersoftware(closed/proprietarysoftware).This software is
differentiated by its use and licenses.
21
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
★StandardSoftwareissoldandsupportedcommercially.However,OpenSourcesoftware can be
sold and/or supported commercially, too. Open source is a disruptive technology.
★The Netscape Public License and subsequently under the Mozilla Public License.
★Closed source is a term for software whose license does not allow for the release or
distribution of the software's source code. Generally, it means only the binaries of a
computer program are distributed and the license provides no access to the program's
source code.
★The source code of such programs is usually regarded as a trade secret of the company.
Accesstosourcecodebythirdpartiescommonlyrequiresthepartytosigna non-disclosure
agreement.
★In the 1970s and early 1980s, the software organization started usingtechnicalmeasures to
prevent computer users from being able to study and modify software. The copyright law
was extended to computer programs in 1980. The free software movement was conceived
in 1983 by Richard Stallman to satisfy the need for and to give the benefit of "software
freedom" to computer users.
22
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
Movement which aims to promote the universal freedom to distribute and modify
computer software without restriction. In February 1986, the first formal definition offree
software was published.
★The term "free software" is associated with FSFs definition, and the term "open source
software" isassociatedwithOSI'sdefinition.FSFsandOSI'sdefinitionsarewordedquite
differently but the set of software that they cover is almost identical.
★One of the primary goals of this foundation was the development of a free and open
computer operating system and application software that can be used and shared among
different users with complete freedom.
★While open source differs from the operation of traditional copyright licensing by
permitting both open distribution and open modification.
★Before the term open source became widely adopted, developers and producers used a
variety of phrases to describe the concept. The term open source gained popularity with
the rise of the Internet, which provided access to diverse production models,
communication paths and last but not least, interactive communities.
Servers:Apache,Tomcat,MediaWiki,Word Press,Eclipse,Moodle
ClientSoftware:MozillaFirefox,MozillaThunderbird,Open Office,7-Zip
DigitalContent:Wikipedia,Wiktionary,ProjectGutenberg
★The NIST defines cloud computing as : "Cloud computing is a model for enabling
ubiquitous, convenient, on-demand network access to a shared pool of configurable
computing resources that can be rapidly provisioned and released with minimal
management effort or service provider interaction.
★This cloud model is composed of five essential characteristics, three service models and
four deployment models."
★Cloud provider is responsible for the physical infrastructure and the cloud consumer is
responsible for application configuration, personalization and data.
23
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
24
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
★Cloud computing stacks are used for all sorts of applications and systems. They are
especially good in micro services and scalable applications, as each tier is dynamically
scaling and replaceable.
★The cloud computing pile makes up a threefold system that comprises its lower-level
elements. These components function as formalized cloud computing delivery models:
a) SoftwareasaService(SaaS)
b) Platform as a Service (PaaS)
c) InfrastructureasaService(IaaS)
SaaS applications are designed for end-users and delivered over the web.
PaaS is the set of tools and services designed to make coding and deploying those applications
quick and efficient.
IaaS is the hardware and software that powers it all, including servers, storage networks and
operating systems.
★Atthecrossroads of high capital costs and rapidly changing business needs is a sea change
that is driving the need for a new, compelling value proposition that is being manifested
in a cloud-deployment model.
★Thetraditionalcostofvaluechainsisbeingcompletelydisintermediatedby platforms—
massively scalable platforms where the marginal cost to deliver an incremental product
or service is zero.
★Theabilityto build massively scalable platforms—platforms where you have the option to
keep adding new products and services for zero additional cost—is giving rise to business
models that weren’t possible before. Mehta calls it “the next industrial revolution, where
the raw material is data and data factories replace manufacturing factories.” He pointed
out a few guiding principles that his firm stands by:
1. Stop saying “cloud.”It’s not about the fact that it is virtual, but the true value lies in
delivering software, data,and/oranalyticsinan“asaservice”model.Whetherthatisinaprivate hosted
model or a publicly shared one does not matter. The delivery, pricing, and consumption model
matters.
2. Acknowledge the business issues. There is no point to make light of matters around
information privacy, security, access, and delivery. These issues are real, more often than not
heavily regulated by multiple government agencies, and unless dealt with in a solution,willkill
any platform sell.
25
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
➔ Mobile devices are a great leveling field where making complicated actions easy is the
name of the game. For example, a young child can use an iPad but not a laptop.
➔ As a result, this will drive broad-based adoption as much for the ease of use as for the
mobility these devices offer. This will have an immense impact on the business
intelligence sector.
➔ Mobile BI or mobile analytics is the rising software technology that allows the users to
access information and analytics on their phones and tablets insteadofdesktop-basedBI
systems.
➔ Mobile analytics involves measuring and analyzing data generated by mobile platforms
and properties, such as mobile sites and mobile applications.
➔ Analytics is the practice of measuring and analyzing data of users in order to create an
understanding of user behavior as wellaswebsiteorperformance.Ifthispracticeisdone on
mobile apps and app users, it is called "mobile analytics".
➔ Mobile analytics is similar to web analytics where identification of the unique customer
and recording their usages.
➔ With mobile analytics data, you can improve your cross-channel marketing initiatives,
optimize the mobile experience for your customers and grow mobile user engagementand
retention.
➔ Analytics usually comes in the form of a software that integrates into a company's
existing websites and apps to capture, store and analyze the data.
➔ It is always very important for businesses to measure their critical KPIs (Key
Performance Indicators), as the old rule is always valid: "If you can't measure it, you
can't improve it".
26
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
WorkingofMobileAnalytics:
➔ Most of the analytics tools need alibrary(anSDK)tobeembeddedintothemobileapp's
projectcodeandatminimumaninitializationcodeinordertotracktheusersandscreens.
➔ SDKs differ by platform so a different SDK is required for each platform such as iOS,
Android,WindowsPhoneetc.Ontopofthat,additionalcodeisrequiredforcustomevent
tracking.
➔ With the help of this code, analytics tools track and count each user, app launch, tap,
event, app crash or anyadditionalinformationthattheuserhas,suchasdevice,operating
system, version IP address (and probable location).
➔ Unlike web analytics, mobile analytics tools don't depend on cookies to identify unique
userssincemobileanalyticsSDKscangenerateapersistentanduniqueidentifierforeach device.
➔ The tracking technology varies between websites, which use either javascript or cookies
and apps, which use a software development kit(SDK).
➔ Each time a website or app visitor takes an action, the application fires offdatawhichis
recorded in the mobile analytics platform.
27
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
1.12 CrowdsourcingAnalytics
★Crowdsourcing is the process of exploring customer's ideas, opinions and thoughts
available on the internet from large groups of people aimed at incorporating innovation,
implementing new ideas and eliminating product issues.
★Crowdsourcing means the outsourcing of human-intelligence tasks to a large group of
unspecified people via the Internet.
★Crowdsourcing is all about collecting data from users through some services, ideas, or
content and then it needs to be stored in a server such that the necessary data can be or
provided to users whenever necessary.
★Most usersnowadaysuseTrue callertofindunknownnumbersandGoogleMapstofind out
places and the traffic in a region. All the services are based on crowdsourcing.
★Crowdsourced data is a form of secondary data. Secondary data refers to data that is
collected by any party other than the researcher. Secondary data provides important
context for any investigation into a policy intervention.
★Whencrowdsourcing data, researchers collect plentiful, valuable and disperseddataata
cost typically lower than that of traditional data collection methods.
★Consider the trade-offs between sample size and sampling issues before deciding to crowd
source data. Ensuring data quality means making sure the platform which you are
collecting crowd sourced data is well-tested.
★Crowdsourcing experiments arenormallysetupbyaskingasetofuserstoperformatask for a
very small remuneration on each unit of the task. Amazon Mechanical Turk (AMT) is a
popular platform that has a large set of registered remote workers who are hired to
perform tasks such as data labeling.
★In data labeling tasks, the crowd workers are randomly assigned a single item in the
dataset. A data object may receive multiple labels from different workersandthesehave to
be aggregated to get the overall true label.
★Crowdsourcing allows for many contributors to be recruited in a short period of time,
thereby eliminating traditional barriers to data collection. Furthermore, crowdsourcing
platforms usually employ their own tools to optimize the annotation process, making it
easier to conduct time-intensive labeling tasks.
★Crowdsourcing data is especially effective in generating complex and free-form labels
such as in the case of audio transcription, sentiment analysis, image annotation or
translation.
★With crowdsourcing, companies can collect information from customers and use it
totheiradvantage.Brandsgatheropinions,askforhelp,receivefeedbacktoimprovetheir
28
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
product or service, and drive sales. For instance, Lego conducted a campaign where
customers had the chance to develop their designs of toys and submit them.
★To become the winner, the creator had to receive the biggest amount of people's votes. The
best design was moved to the production process. Moreover, the winner got a privilege
that amounted to a 1 % royalty on the net revenue.
TypesofCrowdsourcing:
There are four main types of crowdsourcing.
1. Wisdom of the crowd: It is a collective opinion of different individuals gathered in agroup.
This type is used for decision-making since it allows one to find the best solution for problems.
2. Crowd creation :This type involves a company asking its customers to help with new
products. This way, companies get brand new ideas and thoughts that help a business stand out.
3. Crowd voting: It isatypeofcrowdsourcingwherecustomersareallowedtochooseawinner. They
can vote to decide which of the options is the best for them. This type can be applied to different
situations. Consumers can choose one of the options provided by experts or products created by
consumers.
4. Crowdfunding: It is when people collect money and ask for investments for charities,
projects and startups without planning to return the money to the owners. People do it
voluntarily. Often, companies gather money to help individuals and families suffering from
natural disasters, poverty, social problems, etc.
Example:
1.13 Inter-andTrans-FirewallAnalytics
● A firewall is a device designed to control the flow of traffic into and out-of a
network. In general, firewalls are installed to prevent attacks. Firewall can be a
software program or a hardware device.
● Firewalls are software programs or hardware devices that filter the traffic that
flows into a user PC or user network through an internet connection.
● They sift through the data flow and block that which they deem harmful to the
user network or computer system.
29
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
● Firewalls filter based on IP, UDP and TCP information. Firewall is placedonthe
link between a network router and Internet or between a user and router.
● For large organizations with many smallnetworks,thefirewallisplacedonevery
connection attached to the Internet.
● Large organizations may use multiple levels of firewall or distributed firewalls,
locating a firewall at a single access point to the network.
● Firewalls test all traffic against consistent rules and pass traffic that meets those
rules. Many routers support basic firewall functionality.Firewallcanalsobeused to
control data traffic.
● Firewall based security depends on the firewall beingtheonlyconnectivitytothe
size from outside; there should be no way to bypass the firewall via other
gateways; wireless connections.
● Firewall filtersoutallincomingmessagesaddressedtoaparticularIPaddressora
particular TCP port number. Itdividesanetworkintoamoretrustedzoneinternal to
the firewall and a less trusted zone external to the firewall.
● Firewalls may also impose restrictions on outgoing traffic, to prevent certain
attacks and to limit losses if an attacker succeeds in getting access inside the
firewall.
Functions of firewall:
1. Accesscontrol:Firewall filtersincoming aswell asoutgoing packets.
2. Address/Port Translation: Using network address translation, internal machines,
though not visible on the Internet, can establish a connection with external machines on
the Internet. NATing is often done by firewall.
3. Logging: Security architecture ensures that each incoming or outgoing packet
encounters at least one firewall. The firewall can log all anomalous packets.
30
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
Firewallscanprotectthecomputeranduserpersonalinformationfrom:
1. Hackers who your system security.
2. Firewall prevents malware and other Internet hacker attacks from reaching your
computer in the first place.
3. Outgoingtrafficfrom yourcomputer createdby avirus infection.
Firewallscannotprovideprotection:
1. Against phishing scams and other fraudulent activity
2. Virusesspreadthroughe-mail
3. From physical access of your computer or network
4. For an unprotected wireless network.
FirewallCharacteristics
1. Alltraffic frominside to outsideand viceversa, must passthrough the firewall.
2. The firewall itself is resistant to penetration.
3. Onlyauthorizedtraffic,asdefinedbythelocalsecuritypolicy,willbeallowedtopass.
1.13.1 Firewall Rules
● The rules and regulations set by the organization. Policy determines the type of
internal and external information resources employees can access, the kinds of
programs they may install on their own computers as well as their authority for
reserving network resources.
Usercancreateordisablefirewall filterrulesbasedonfollowingconditions :
1. IPaddresses: Systemadmin canblock acertain rangeof IP addresses.
2. Domain names: Admin can only allow certain specific domain names to access your
systems or allow access to only some specific types of domain names or domain name
extension.
3. Protocol: A firewall can decide which of the systems can allow or have access to
commonprotocolslikeIP,SMTP,FTP,UDP,ICMP,TelnetorSNMP.
31
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes
4. Ports: Blocking or disabling ports of servers that are connected to the internet will
help maintain the kind of data flow you want to see it used for and also close down
possible entry points for hackers or malignant software.
5. Keywords: Firewalls also can sift through the data flow for a match of the keywords
or phrases to block out offensive or unwanted data from flowing in.
● When your computer makes a connection with another computer onthenetwork,
several things are exchanged including the source and destination ports.
● In a standard firewall configuration, most inbound ports are blocked. This would
normally cause a problem with return traffic since the source port is randomly
assigned.
● A state is a dynamic rulecreatedbythefirewallcontainingthesource-destination port
combination, allowing the desired return traffic to pass the firewall.
1.13.2 TypesofFirewall
1. Packet filter
2. Application level firewall
3. Circuit level gateway.
➔ Packet filter firewall controls access topacketsonthebasisofpacketsourceand
destination address or specific transport protocol type.
➔ It is done at the OSI data link, network andtransportlayers.Packetfilterfirewall
works on the network layer of the OSI model.
➔ Packetfiltersdonotseeinsideapacket;theyblockoracceptpacketssolelyonthe basis of
the IP addresses and ports. All incoming SMTP and FTP packets are parsed to
check whether they should drop or forwarded.
➔ But outgoing SMTP and FTP packets have alreadybeenscreenedbythegateway and
do not have to be checked by thepacketfilteringrouter.Packetfilterfirewall only
checks the header information.
32
JAYA COLLEGE OF ENGINEERING AND TECNOLOGY
LECTURE NOTES
Application level gateway is also called a bastion host. It operates at the application
level. Multiple application gateways can run on the same host but each gateway is a
separate server with its own processes.
These firewalls, also known as application proxies, provide the most secure type of data
connection because they can examine every layer of the communication, including the
application data.
Circuit level gateway: A circuit-level firewall is a second generation firewall that
validates TCP and UDP sessions before opening a connection.
The firewall does not simply allow or disallow packets but also determines whether the
connection between both ends is valid according to configurable rules, then opens a
session and permits traffic only from the allowed source and possibly only for a limited
period of time.
It typically performs basic packet filter operations and then adds verification of proper
handshaking of TCP and the legitimacy of the session information used in establishingthe
connection.
The decision to accept or reject a packet is based upon examining thepacket'sIPheader
and TCP header.
Circuit level gateway cannot examine the data content of the packets it relaysbetweena
trusted network and an untrusted network.
33
UNITIINOSQLDATAMANAGEMENT
IntroductiontoNoSQL–aggregatedatamodels–aggregates–key-valueanddocumentdatamodels–
relationships–graphdatabases–schemalessdatabases–materialized views – distribution
models– sharding– master-slave replication – peer-peerreplication–
shardingandreplication–consistency–relaxingconsistency–versionstamps–map-reduce–
partitioningandcombining–composingmap-reducecalculations
NOSQLDATAMANAGEMENT
WhatisNosql?
NoSQL database, also called Not Only SQL, is an approach to data
managementand database design that's useful for very large sets of distributed data.
NoSQL is awhole new way of thinking about a database. NoSQL is not a relational
database.
Therealityisthatarelationaldatabasemodelmaynotbethebestsolutionforallsituations.The
easiest way to think of NoSQL, is that of a database which does not adhering to
thetraditional relational database management system (RDMS) structure. Sometimes
youwill also see it revered to as 'not only SQL'.the most popular NoSQL database is
ApacheCassandra. Cassandra, which was once Facebook’s proprietary database, was
releasedasopensourcein2008.OtherNoSQLimplementationsincludeSimpleDB,GoogleB
igTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies
thatuseNoSQLincludeNetFlix,LinkedInandTwitter.
Why Are NoSQL Databases Interesting? / Why we should use Nosql? / when to useNosql?
ThereareseveralreasonswhypeopleconsiderusingaNoSQLdatabase.
Large data.Organizations are finding it valuable to capture more data and process
itmore quickly. They are finding it expensive, if even possible, to do so with
relationaldatabases. The primary reason is that a relational database is designed to run
on asingle machine, but it is usually more economic to run large data and computing
loadsonclustersofmanysmallerandcheapermachines.ManyNoSQLdatabasesaredesigne
dexplicitlytorunonclusters,sotheymakeabetterfitforbigdatascenarios.
34
Analytics.OnereasontoconsideraddingaNoSQLdatabasetoyourcorporateinfrastructureisthat
manyNoSQLdatabasesarewellsuitedtoperforminganalyticalqueries.
Scalability.NoSQLdatabasesaredesignedtoscale;it’soneoftheprimaryreasonsthatpeople
choose a NoSQL database. Typically, with a relational database like SQL
ServerorOracle,youscalebypurchasinglargerandfasterserversandstorageorbyemploying
specialists to provide additional tuning. Unlike relational databases,
NoSQLdatabasesaredesignedtoeasilyscaleoutastheygrow.Dataispartitionedandbalance
d across multiple nodes in a cluster, and aggregate queries are distributed bydefault.
Fast key-value access.This is probably the second most cited virtue of NoSQL in
thegeneral mind set.When latency is important it's hard to beat hashing on a key
andreading the value directly from memory or in as little as one disk seek. Not
everyNoSQL product is about fast access, some are more about reliability, for
example. butwhat people have wanted for a long time was a better memcached and
many NoSQLsystemsofferthat.
Schemamigration.Schemalessnessmakesiteasiertodealwithschemamigrationswithoutso
muchworrying.Schemasareinasensedynamic,becausetheyareimposed
35
by the application at run-time, so different parts of an application can have a
differentviewoftheschema.
Write availability. Do your writes need to succeed no mater what? Then we can
getintopartitioning,CAP,eventualconsistencyandallthatjazz.
No single point of failure. Not every product is delivering on this, but we are
seeing adefinite convergence on relatively easy to configure and manage high
availability withautomaticloadbalancingandclustersizing.Aperfectcloudpartner.
Generallyavailableparallelcomputing.WeareseeingMapReducebakedintoprod
ucts, which makes parallel computing something that will be a normal part
ofdevelopmentinthefuture.
Use the right data model for the right problem. Different data models are
used to
solvedifferentproblems.Muchefforthasbeenputinto,forexample,wedginggraphoperations
into a relational model, butit doesn't work. Isn't it better to solve a graphprobleminagraph
database?Wearenow
seeingageneralstrategyoftryingfindthebestfitbetweenaproblemandsolution.
36
Distributed systemsand cloud
computingsupport.Noteveryoneisworriedaboutscaleorperformanceoverandaboveth
atwhichcanbeachievedbynon-
NoSQLsystems.Whattheyneedisadistributedsystemthatcanspandatacenterswhilehandlingfai
lurescenarioswithoutahiccup.NoSQLsystems,becausetheyhavefocussedonscale,tendtoexplo
itpartitions,tendnotuseheavystrictconsistencyprotocols,andsoarewellpositionedtooperateind
istributedscenarios.
DifferencebetwwenSqlandNosql
SQL databases are primarily called as Relational Databases (RDBMS);
whereasNoSQLdatabaseareprimarilycalledasnon-relationalordistributeddatabase.
SQLdatabasesaretablebaseddatabaseswhereasNoSQLdatabasesaredocument based,
key-value pairs, graph databases or wide-column stores. Thismeans that SQL
databases represent data in form of tables which consists of
nnumberofrowsofdatawhereasNoSQLdatabasesarethecollectionofkey-valuepair,
documents,graph databasesor wide-columnstores whichdo not
havestandardschemadefinitionswhichit needsto adheredto.
SQLdatabaseshavepredefinedschemawhereasNoSQLdatabases
havedynamicschemaforunstructureddata.
SQLdatabasesareverticallyscalablewhereastheNoSQLdatabasesarehorizontally
scalable. SQL databases are scaled by increasing the horse-power
ofthehardware.NoSQLdatabasesarescaledbyincreasingthedatabasesserversinthepoolof
resourcestoreducetheload.
SQLdatabasesusesSQL(structuredquerylanguage)fordefiningandmanipulatingthedata,whi
chisverypowerful.InNoSQLdatabase,queriesarefocusedoncollectionofdocuments.Someti
mesitisalsocalledasUnQL(Unstructured Query Language).The
syntaxofusingUnQLvariesfrom databasetodatabase.
SQLdatabaseexamples:MySql,Oracle,Sqlite,PostgresandMS-
SQL.NoSQLdatabaseexamples:MongoDB,BigTable,Redis,RavenDb,Cassandra,Hbas
e,Neo4jandCouchDb
For complex queries: SQL databases are good fit for the complex query
intensiveenvironment whereas NoSQL databases are not good fit for complex
queries. Ona high-level, NoSQL don’t have standard interfaces to perform complex
queries,andthequeriesthemselvesinNoSQLarenotaspowerfulasSQL querylanguage.
For the type of data to be stored: SQLdatabasesare not best fitfor
hierarchicaldatastorage.But,NoSQLdatabasefitsbetterforthehierarchicaldatastorageas
37
it follows the key-value pair way of storing data similar to JSON data. NoSQLdatabase
are highly preferred for large data set (i.e for big data). Hbase is
anexampleforthispurpose.
Forscalability:Inmosttypicalsituations,SQLdatabasesareverticallyscalable.Youcanmanage
increasingloadbyincreasingtheCPU,RAM,SSD,etc,onasingleserver.Ontheotherhand,NoS
QLdatabasesarehorizontallyscalable.YoucanjustaddfewmoreserverseasilyinyourNoSQLd
atabaseinfrastructuretohandlethelargetraffic.
For high transactional based application: SQL databases are best fit for
heavydutytransactionaltypeapplications,asitismorestableandpromisestheatomicityasw
ellasintegrityofthedata.WhileyoucanuseNoSQLfortransactionspurpose,itisstillnotcom
parableandsableenoughinhighloadandforcomplextransactionalapplications.
For support: Excellentsupport are available for all SQL databasefrom theirvendors.
There are also lot of independent consultations who can help you withSQL database
for a very large scale deployments. For some NoSQL database youstill have to rely
on community support, and only limited outside experts
areavailableforyoutosetupanddeployyourlargescaleNoSQLdeployments.
Forproperties:SQLdatabasesemphasizesonACIDproperties(Atomicity,Consistency,
Isolation and Durability) whereas the NoSQL database follows
theBrewersCAPtheorem(Consistency, AvailabilityandPartitiontolerance )
ForDBtypes:Onahigh-level,wecanclassifySQLdatabasesaseitheropen-sourceorclose-
sourcedfromcommercialvendors.NoSQLdatabasescanbeclassified on the basis of way of
storing data as graph databases, key-value
storedatabases,documentstoredatabases,columnstoredatabaseandXMLdatabases.
TypesofNosqlDatabases:TherearefourgeneraltypesofNoSQLdatabases,eachwiththeirownspe
cificattributes:
1. Key-Valuestorage
This is the first category of NoSQL database. Key-value stores have a simple
datamodel,whichallowclientstoputamap/dictionaryandrequestvalueparkey.In the key-
value storage, each key has to be unique to provide non-
ambiguousidentificationofvalues.Forexample.
38
2. Document-databases
InthedocumentdatabaseNoSQLstoredocumentinJSONformat.JSON-baseddocument
are store in completely different sets of attributes can be storedtogether, which
stores highly unstructured data as named value pairs andapplications
thatlookatuserbehavior,actions,andlogsinrealtime.
3. Columns-storage
Columnardatabasesarealmostliketabulardatabases.Thuskeysinwidecolumnstore scan
have many dimensions, resulting in a structure similar to a multi-dimensional,
associative array. Shown in below example storing data in a
widecolumnsystemusingatwo-dimensionalkey.
39
Graph-storage
Graph databases are best suited for representing data with a high, yet
flexiblenumberofinterconnections,especiallywheninformationaboutthoseinterconnecti
onsisatleastasimportantastherepresenteddata.InNoSQLdatabase, data is stored in a
graph like structures in graph databases, so that thedata can bemade easily
accessible.Graph databases are commonly
usedonsocialnetworkingsites.Asshowinbelowfigure.
Exampledatabases
40
ProsandConsofRelationalDatabases
• Advantages
• Datapersistence
• Concurrency–ACID,transactions,etc.
• Integrationacrossmultipleapplications
• StandardModel–tablesandSQL
• Disadvantages
• Impedancemismatch
• Integrationdatabasesvs.applicationdatabases
• Not designedforclustering
DatabaseImpedancemismatch:
Impedance Mismatch means the difference between data model and in memory
datastructures.
Impedanceisthemeasureoftheamountthatsomeobjectresists(orobstruct,resist)the
flowofanotherobject.
Imagine you have a low current flashlight that normally uses AAA batteries. Supposeyou
could attach yourcarbatteryto theflashlight. Thelow currentflashlightwillpitifully
output a fraction of the light energy that the high current battery is capable
ofproducing. However, match the AAA batteries to the flashlight and they will run
withmaximumefficiency.
ThedatarepresentationinRDMSisnotmatchedwiththedatastructureusedinmemory. In-memory,
data structures are lists, dictionaries, nested and hierarchical datastructures whereas
inRelational database,itstores onlyatomic values,andthere
isnolistsarenestedrecords.Translatingbetweentheserepresentationscanbecostly,confusingan
dlimitstheapplicationdevelopmentproductivity.
Somecommoncharacteristicsofnosqlinclude:
Doesnotusetherelationalmodel(mostly)
Generallyopensourceprojects(currently)
Drivenbytheneedtorunonclusters
Builtfortheneedtorun21stcenturywebproperties
Schema-less
Polygotpersistence:Thepointofviewofusingdifferentdatastoresindifferentcircums
tancesisknownasPolyglotPersistence.
41
Today,mostlargecompaniesareusingavarietyofdifferentdatastoragetechnologiesfordiffere
ntkindsofdata.Manycompaniesstilluserelationaldatabasestostoresomedata,butthepersi
stenceneedsofapplicationsareevolving from predominantly relational to a mixture of
data sources.
Polyglotpersistenceiscommonlyusedtodefinethishybridapproach.Thedefinitionof
polyglot is“someone whospeaksor writes several languages.”Thetermpolyglot is
redefined for big data as a set of applications that use several
coredatabasetechnologies.
42
the application. Similarly, when writing data, the write needs to be coordinated
andperformedonmanytables.
NoSQLdatabaseshaveaverydifferentmodel.Forexample,adocument-
orientedNoSQLdatabasetakesthedatayouwanttostoreandaggregatesitintodocumentsusingthe
JSONformat.Each JSONdocumentcanbethought of as anobjectto
beusedbyyourapplication.AJSONdocumentmight,forexample,takeallthedatastoredinarowtha
tspans20tablesofarelationaldatabaseandaggregateitintoasingledocument/
object.Aggregatingthisinformationmayleadtoduplicationofinformation,butsincestorageis
nolonger costprohibitive,
theresultingdatamodelflexibility,easeofefficientlydistributingtheresultingdocumentsandread
andwriteperformanceimprovementsmakeitaneasytrade-offforweb-basedapplications.
AnothermajordifferenceisthatrelationaltechnologieshaverigidschemaswhileNoSQL
models are schemaless.Relational technology requires strictdefinition of aschema prior
to storing any data into a database. Changing the schema once data isinserted is a big
deal, extremely disruptive and frequently avoided – the exact oppositeof thebehavior
desired in the BigData era, whereapplication developers needtoconstantly–andrapidly–
incorporatenewtypesofdatatoenrichtheirapps.
Aggregatesdatamodelin nosql
Data Model: A data model is the model through which we perceive and
manipulateour data. For people using a database, the data model describes how we
interact withthedatainthedatabase.
RelationalDataModel:Therelationalmodeltakestheinformationthatwewantto
storeanddividesitintotuples.
Tuple being a limited Data Structure it captures a set of values and can’t be nested.
ThisgivesRelationalModelaspaceofdevelopment.
43
Aggregate Model: Aggregate is a term that comes from Domain-Driven Design,
anaggregate is a collection of related objects that we wish to treat as a unit, it is a unit
fordata manipulationandmanagementofconsistency.
Atomicpropertyholdswithinanaggregate
Communicationwithdatastoragehappensinunitofaggregate
Dealingwithaggregateismuchmoreefficientinclusters
Itisofteneasierforapplicationprogrammerstoworkwithaggregates
ExampleofRelationsandAggregates
Let’sassumewehavetobuildane-
commercewebsite;wearegoingtobesellingitemsdirectlytocustomersovertheweb,andwewillh
avetostoreinformationabout
users, our product catalog, orders, shipping addresses, billing addresses,
andpaymentdata.Wecanusethisscenariotomodelthedatausingarelationdatastoreasw
ell asNoSQLdatastoresandtalkabouttheirprosand cons.Fora
relationaldatabase,wemightstartwithadatamodelshowninthefollowingfigure.
Thefollowingfigurepresentssomesampledataforthismodel.
44
In relational, everything is properly normalized, so that no data is repeated in
multipletables. We also have referential integrity. A realisticorder system would
naturally bemore involved than this. Now let’s see how this model might look when
we think inmoreaggregateorientedterms
Again,wehavesomesampledata,whichwe’llshowinJSONformatasthat’sacommonrepresentationf
ordatainNoSQL.
//incustomers
{"
id":1,"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
//inorders
{"
45
id":99,
"customerId":1,"orderIte
ms":[
{
"productId":27,"price":3
2.45,
"productName":"NoSQLDistilled"
}
],
"shippingAddress":
[{"city":"Chicago"}]"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft","billingAddress":
{"city":"Chicago"}
}
],
}
Inthismodel,wehavetwomainaggregates:customerandorder.We’veusedtheblack-
diamondcompositionmarkerinUMLtoshowhowdatafitsintotheaggregationstructure.
The customer contains a list of billing addresses; the order contains a list oforder
items, a shipping address, and payments. The payment itself contains a
billingaddressforthatpayment.
A single logical address record appears three times in the example data, but instead
ofusing IDs it’s treated as a value and copied each time. This fits the domain where
wewould not want the shipping address, nor the payment’s billing address, to change.
In
arelationaldatabase,wewouldensurethattheaddressrowsaren’tupdatedforthiscase,makin
g a new row instead. With aggregates, we can copy the whole address
structureintotheaggregateasweneedto.
Aggregate-OrientedDatabases:Aggregate-
orienteddatabasesworkbestwhenmostdatainteractionisdonewiththesameaggregate;aggregat
e-ignorantdatabasesarebetterwheninteractionsusedataorganizedinmanydifferentformations.
46
Key-valuedatabases
•Storesdatathatisopaquetothedatabase
•Thedatabasedoescannotseethestructureofrecords
•Applicationneedstodealwiththis
•Allowsflexibilityregardingwhatisstored(i.e.textorbinarydata)
Documentdatabases
•Storesdatawhosestructureisvisibletothedatabase
•Imposeslimitationsonwhatcanbestored
•Allowsmoreflexibleaccesstodata(i.e.partialrecords)viaquerying
Bothkey-valueanddocumentdatabasesconsistofaggregaterecordsaccessedbyIDvalues
Column-familydatabases
•Twolevelsofaccesstoaggregates(andhence,twoparstothe“key”toaccessanaggre
gate’sdata)
•IDisusedto lookupaggregaterecord
•Columnname–eitheralabelforavalue(name)orakeytoalistentry(orderid)
•Columnsaregroupedintocolumnfamilies
47
SchemalessDatabases
A common theme across all the forms of NoSQL databases is that they are
schemaless.Whenyouwanttostoredatainarelationaldatabase,youfirsthavetodefineasche
ma—a definedstructure forthe databasewhichsayswhattablesexist,whichcolumns exist,
and what data types each column can hold. Before you store some data,you
havetohavetheschemadefinedforit.
With NoSQL databases, storing data is much more casual. A key-value store allows youto
store any data you like under a key. A document database effectively does the
samething,sinceitmakesnorestrictionsonthestructureofthedocumentsyou store.Column-
familydatabasesallowyoutostoreanydataunderanycolumn you like.Graph databases
allow you to freely add new edges and freely add properties to
nodesandedgesasyouwish.
whyschemaless?
Aschemalessstorealsomakesiteasiertodealwithnonuniformdata
Whenstartinganewdevelopmentprojectyoudon'tneedtospendthesameamountoftimeonu
p-frontdesignoftheschema.
NoneedtolearnSQLordatabasespecific stuffandtools.
Therigidschemaofarelationaldatabase(RDBMS)meansyouhavetoabsolutelyfollow the
schema. It can be harder to push data into the DB as it has to perfectlyfit the schema.
Being able to add data directly without having to tweak it
tomatchtheschemacansaveyoutime
Minor changes to the model and you will have to change both your code and
theschema in the DBMS. If no schema,you don't have to make changes in
twoplaces.Lesstimeconsuming
WithaNoSqlDByouhavefewerwaystopullthedataout
LessoverheadforDBengine
Lessoverheadfordevelopersrelatedtoscalability
48
EliminatestheneedforDatabaseadministratorsordatabaseexperts-
>fewerpeopleinvolvedandlesswaitingonexperts
SavetimewritingcomplexSQLjoins ->morerapiddevelopment
Prosandconsofschemaless data
Pros:
Morefreedomandflexibility
youcaneasilychangeyourdataorganization
you can deal with nonuniform
dataCons:
Aprogramthataccessesdata:
almostalwaysreliesonsomeformofimplicitschema
itassumesthatcertainfieldsarepresent
carrydatawithacertainmeaning
TheimplicitschemaisshiftedintotheapplicationcodethataccessesdataTounderstand
whatdataispresentyouhavelookattheapplicationcodeTheschemacannotbeusedt
o:
decidehowtostoreandretrievedataefficiently
ensuredataconsistency
Problemsifmultipleapplications,developedbydifferentpeople,accessthesamedata
base.
RelationalschemascanbechangedatanytimewithstandardSQLcommands
Key-valuedatabases
A key-value store is a simple hash table, primarily used when all access to the
databaseisviaprimarykey.
Key-value stores are the simplest NoSQL data stores to use from an API perspective.The
client can either get the value for the key, put a value for a key, or delete a key
fromthe data store. The value is a BLOB(Binary Large Object) that the data store just
stores,without caring or knowing what’s inside; it’s the responsibility of the
application tounderstand what was stored. Since key-value stores always use primary-
key access,theygenerally havegreat performanceandcanbeeasilyscaled.
It is an associative container such as map, dictionary, and in query processing an index.It
is an abstract data type composed of a collection of unique keys and a collection
ofvalues, where each key is associated with one value (or set of values). The operation
offinding the value associated with a key is called a lookup or indexing The
relationshipbetweenakey anditsvalueissometimescalledamappingor binding.
49
Some of the popular key-value databases are Riak, Redis, Memcached DB, Berkeley
DB,HamsterDB,AmazonDynamoDB.
A Key-Value model is great for lookups of simple or even complex values. When
thevalues are themselves interconnected, you’ve got a graph as shown in following
figure.Letsyoutraversequicklyamongalltheconnectedvalues.
InKey-Valuedatabase,
Dataisstoredsortedbykey.
Callerscanprovideacustomcomparisonfunctiontooverridethesortorder.
ThebasicoperationsarePut(key,value),Get(key),Delete(key).
Multiplechangescanbemadeinoneatomicbatch.
Userscancreateatransientsnapshottogetaconsistentviewofdata.
Forwardandbackwarditerationissupportedoverthedata.
In key-value databases, a single object that stores all the data and is put into a
singlebucket. Buckets are used to define a virtual keyspace and provide the ability to
defineisolated non-default configuration. Buckets might be compared to tables or
folders inrelationaldatabasesorfilesystems,respectively.
As their name suggest, they store key/value pairs. For example, for search engines, astore
may associate to each keyword (the key) a list of documents containing it
(thecorrespondingvalue).
One approach to implement a key-value store is to use a file decomposed in blocks .
Asthefollowingfigureshows,eachblockisassociatedwithanumber(rangingfrom1ton).Eac
hblockmanagesasetofkey-valuepairs:thebeginningoftheblockcontained,after some
information, an index of keys and the position of the corresponding values.These
values are stored starting from the end of the block (like a memory heap).
Thefreespaceavailable is delimited by theendoftheindex andtheendofthe values.
50
In this implementation, the size of a block is important since it defines
thelargest value that can be stored (for example the longest list of document
identifierscontaining a given keyword). Moreover, it supposes that a block number is
associatedtoeach key. Theseblocknumbers canbeassigned intwodifferentways:
1. The block number is obtained directly from the key, typically by using a
hashfunction.Thesizeofthefileisthendefinedbythelargestblocknumbercomputedbyever
ypossiblekey.
2. The block number is assigned increasingly. When a new pair must be stored, thefirst
block that can hold it is chosen. In practice, a given amount of space
isreservedinablockinordertomanageupdatesofexistingpairs(anewvaluecanreplace an
older and smaller one). This limit the size of the file to amount ofvaluestostore.
DocumentDatabases
In a relational database system you must define a schema before adding records to
adatabase. The schema is the structure described in a formal language supported by
thedatabase and provides a blueprint for the tables in a database and the
relationshipsbetween tables of data. Within a table, you need to define constraints in
terms of rowsandnamedcolumns aswellasthetype ofdatathat canbe
storedineachcolumn.
Incontrast,adocument-
orienteddatabasecontainsdocuments,whicharerecordsthatdescribethedatainthedocument,
aswellastheactualdata.Documentscanbeascomplex as you choose; you can use nested data
to provide additional sub-categories ofinformation about your object. You can also use one
or more document to represent areal-
worldobject.Thefollowingcomparesaconventionaltablewithdocument-basedobjects:
51
Inthisexamplewehaveatablethatrepresentsbeersandtheirrespectiveattributes:id,beer name,
brewer, bottles available and so forth. As we see in this illustration,
therelationalmodelconformstoaschemawithaspecifiednumberoffieldswhichrepresent a
specific purpose and data type. The equivalent document-based model
hasanindividualdocumentperbeer;eachdocumentcontainsthesametypesofinformation
foraspecificbeer.
In a document-oriented model, data objects are stored as documents; each documentstores
your data and enables you to update the data or delete it. Instead of columnswith
names and data types, we describe the data in the document, and provide thevalue for
that description. If we wanted to add attributes to a beer in a relational mode,we would
need to modify the database schema to include the additional columns andtheir data
types. In the case of document-based data, we would add additional key-valuepairsinto
ourdocumentsto representthe newfields.
Theothercharacteristicofrelationaldatabaseisdatanormalization;thismeansyoudecomposed
ataintosmaller,relatedtables.Thefigurebelowillustratesthis:
In the relational model, data is shared across multiple tables. The advantage to thismodel
is that there is less duplicated data in the database. If we did not separate
beersandbrewersintodifferenttablesandhadonebeertableinstead,wewouldhaverepeatedi
nformationaboutbreweriesforeachbeerproducedbythatbrewer.
52
Theproblemwiththisapproachisthatwhenyouchangeinformationacrosstables,youneedtolock
thosetablessimultaneouslytoensureinformationchangesacrossthetableconsistently.
Because you also spread information across a rigid structure, it makes itmore
difficultto changethe structureduring
production,anditisalsodifficulttodistributethedataacrossmultipleservers.
In the document-oriented database, we could choose to have two different
documentstructures: one for beers, and one for breweries. Instead of splitting your
applicationobjects into tables and rows, you would turn them into documents. By
providing areferenceinthebeer documentto abrewery document,you
createarelationshipbetweenthetwoentities:
InthisexamplewehavetwodifferentbeersfromtheAmtelbrewery.Werepresenteach
beerasaseparatedocumentandreferencethebreweryinthedocument- brewer field.Thetradit
orientedapproachprovidesseveralupsidescompared tothe ional
RDBMS model. First, because information is stored in documents, updating a schema isa
matter of updating the documents for that type of object. This can be done with
nosystem downtime. Secondly, we can distribute the information across multiple
serverswith greater ease. Since records are contained within entire documents, it
makes iteasiertomove,orreplicateanentireobjecttoanotherserver.
UsingJSONDocuments
JavaScriptObjectNotation(JSON)isalightweightdata-
interchangeformatwhichiseasytoreadandchange.JSONislanguage-
independentalthoughitusessimilarconstructstoJavaScript.Thefollowingare
basicdatatypessupportedinJSON:
Numbers,includingintegerandfloatingpoint,
Strings,includingallUnicodecharactersandbackslashescapecharacters,
Boolean:trueorfalse,
Arrays,enclosedinsquarebrackets:[“one”,“two”,“three”]
Objects,consistingofkey-
valuepairs,andalsoknownasanassociativearrayorhash.Thekey mustbeastring andthe
valuecanbeanysupportedJSONdatatype.
53
Forinstance,ifyouarecreatingabeerapplication,youmightwantparticulardocumentstructuretore
presentabeer:
{
"name":"descripti
on":"category"
:"updated":
}
For each of the keys in this JSON document you would provide unique values
torepresentindividualbeers.Ifyouwanttoprovidemoredetailedinformationinyourbeer
application about the actual breweries, you could create a JSON structure
torepresentabrewery:
{
"name":"addre
ss":"city":
"state":"website":"
description":
}
Performing data modeling for a document-based application is no different than thework
you would need to do for a relational database. For the most part it can be muchmore
flexible, it can provide a more realistic representation or your application data,andit
also enablesyou to changeyour mindlater about data structure. For morecomplex items
in your application, one option is to use nested pairs to represent theinformation:
{
"name":
"address":
"city":
"state":
"website":
"description":
"geo":
{
"location":["-
105.07","40.59"],"accuracy":"RANGE_INT
ERPOLATED"
}
"beers":[_id4058,_id7628]
}
Inthiscaseweaddedanestedattributeforthegeo-
locationofthebreweryandforbeers.Withinthelocation,weprovideanexactlongitudeandlatitud
e,aswellaslevel
54
ofaccuracyforplottingitonamap.Thelevelofnestingyouprovideisyourdecision;as long as a
document is under the maximum storage size for Server, you can provideanylevel
ofnestingthatyoucanhandleinyourapplication.
In traditional relational database modeling, you would create tables that contain asubset of
information for an item. For instance a brewery may contain types of beerswhich are
stored in a separate table and referenced by the beer id. In the case of
JSONdocuments,youusekey-valuespairs,orevennestedkey-valuepairs.
Column-FamilyStores
Its name conjured up a tabular structure which it realized with sparse columns and
noschema. The column-family model is as a two-level aggregate structure. As with
key-valuestores,thefirstkeyisoftendescribedasarowidentifier,pickinguptheaggregateof
interest. The difference with column family structures is that this row aggregate
isitselfformedofamapofmoredetailedvalues.Thesesecond-levelvaluesarereferredtoas
columns. As well as accessing the row as a whole, operations also allow picking out
aparticular column, so to get a particular customer’s name from following
figureyoucoulddosomethinglike
get('1234','name').
Column-familydatabasesorganizetheircolumnsintocolumnfamilies.Eachcolumnhas to be
part of a single column family, and the column acts as unit for access, with
theassumptionthatdataforaparticularcolumnfamilywillbeusuallyaccessedtogether.
Thedataisstructuredinto:
• Row-oriented: Each row is an aggregate (for example, customer with the ID of
1234)withcolumnfamiliesrepresentingusefulchunksofdata (profile, orderhistory)
withinthataggregate.
55
• Column-oriented: Each column family defines a record type (e.g., customer
profiles)with rows for each of the records. You then think of a row as the join of records in
allcolumnfamilies.
Eventhoughadocumentdatabasedeclaressomestructuretothedatabase,eachdocument is still
seen as a single unit. Column families give a two-dimensional qualitytocolumn-
familydatabases.
Cassandra uses the terms “ wide” and “ skinny.” Skinny rows have few columns
withthe same columns used across the many different rows. In this case, the column
familydefines a record type, each row is a record, and each column is a field. A wide
row hasmany columns (perhaps thousands), with rows having very different
columns. A widecolumnfamilymodels alist,with each columnbeingoneelementin
thatlist.
Relationships:AtomicAggregates
Aggregates allow one to store a single business entity as one document, row or key-
valuepairandupdateitatomically:
GraphDatabases:
GraphdatabasesareonestyleofNoSQLdatabasesthatusesadistributionmodelsimilar to
relational databases but offers a different data model that makes it better
athandlingdatawithcomplexrelationships.
Entitiesarealsoknownasnodes,whichhaveproperties
Nodes are organized byrelationshipswhich allowsto findinteresting
patternsbetweenthenodes
Theorganizationofthegraphletsthedatatobestoredonceandtheninterpretedindifferentwaysba
sedonrelationships
56
Let’sfollowalongsomegraphs,usingthemtoexpressthemselves.We’llread”agraphbyfollowin
garrowsaroundthediagramtoformsentences.
AGraphcontainsNodesandRelationships
AGraph–[:RECORDS_DATA_IN]–>Nodes–[:WHICH_HAVE]–>Properties.
ThesimplestpossiblegraphisasingleNode,arecordthathasnamedvaluesreferredtoas
Properties. A Node could start with a single Property and grow to a few
million,thoughthatcangetalittleawkward.Atsomepointitmakessensetodistributethedatai
ntomultiplenodes,organizedwithexplicitRelationships.
QueryaGraphwithaTraversal
ATraversal–navigates–>aGraph;it–identifies–>Paths–whichorder–>Nodes.
ATraversalishowyouqueryaGraph,navigatingfromstartingNodes to relatedNodes
according to an algorithm, finding answers to questions like “what music do
myfriendslikethatIdon’tyetown,”or“ifthispowersupplygoesdown,what
webservicesareaffected?”
57
Example
58
NoSQL databases do not have views and have precomputed and cached queries
usuallycalled“materializedview”.
DistributionModels
Multipleservers:InNoSQLsystems,datadistributedoverlargeclusters
Single server – simplest model, everything on one machine. Run the database on
asingle machine that handles all the reads and writes to the data store. We prefer
thisoption because it eliminates all the complexities. It’s easy for operations people
tomanageandeasyforapplicationdevelopersto reasonabout.
AlthoughalotofNoSQLdatabasesaredesignedaroundtheideaofrunningon acluster, it can
make sense to use NoSQL with a single-server distribution model if thedata model of
the NoSQL store is more suited to the application. Graph databases
aretheobviouscategoryhere—theseworkbestinasingle-serverconfiguration.
Ifyourdatausageismostlyaboutprocessingaggregates,thenasingle-serverdocumentorkey-
valuestoremaywellbeworthwhilebecauseit’seasieronapplicationdevelopers.
Orthogonalaspectsofdatadistributionmodels:
Sharding:
DBShardingisnothingbuthorizontalpartitioningofdata.Differentpeopleareaccessingdifferent
partsofthedataset.Inthesecircumstanceswecansupporthorizontalscalabilitybyputtingdifferen
tpartsofthedataontodifferentservers—atechniquethat’scalledsharding.
Atablewithbillionsofrowscanbepartitionedusing“RangePartitioning”.Ifthecustomer
transaction date, for an example, based partitioning will partition the datavertically. So
irrespective which instance in a Real Application Clusters access the data,it
is“not”horizontally partitionedalthoughGlobalEnqueueResources are
owningcertainblocksineachinstancebutitcanbemovingaround.Butin“dbshard”environm
ent,thedataishorizontallypartitioned.Foranexample:UnitedStatescustomer can live in
one shard and European Union customers can be in another
shardandtheothercountriescustomerscanliveinanothershardbutfromanaccessperspective
there is no need to know where the data lives. The DB Shard can go to
theappropriateshardtopickupthedata.
Differentpartsofthedataontodifferentservers
59
Horizontalscalability
Idealcase:differentusersalltalkingtodifferentservernodes
Dataaccessedtogetheronthesamenode̶aggregateunit!
Pros:itcanimprovebothreadsandwrites
Cons:Clustersuselessreliablemachinesr̶esiliencedecreasesMany
NoSQLdatabasesofferauto-sharding
thedatabasetakesontheresponsibilityofsharding
Improvingperformance:
Main rules of sharding:
1. Place thedataclose towhereit’saccessed
OrdersforBoston:datainyoureasternUSdatacenter
2. Trytokeeptheloadeven
Allnodesshouldgetequalamountsoftheload
3. Put together aggregates that may be read in
sequenceSameorder,same node
Master-Slave
ReplicationMaste
r
isthe authoritative sourceforthe data
isresponsible
forprocessinganyupdatestothatdatacanbeap
pointedmanuallyorautomatically
Slaves
Areplicationprocesssynchronizesthe slaveswiththemaster
Afterafailureofthemaster,aslavecanbeappointedasnewmastervery
quickly
ProsandconsofMaster-SlaveReplication
60
Pro
s
Morereadrequests:
Addmoreslavenodes
Ensurethatall readrequestsareroutedtotheslaves
Shouldthemasterfail,theslavescan stillhandlereadrequestsGoodfor
datasets witharead-intensivedataset
Allthereplicashaveequalweight,theycanallacceptwrites
Thelossofanyofthemdoesn’tpreventaccesstothedatastore.
Prosandconsofpeer-to-peerreplicationPros:
youcanrideovernodefailureswithoutlosingaccesstodatayoucaneasilyadd
nodestoimprove yourperformance
Con
s:
Inconsistency
Slowpropagationofchangestocopiesondifferentnodes
61
Inconsistenciesonreadleadtoproblemsbutarerelativelytransient
Twopeoplecanupdatedifferentcopiesofthesamerecordstoredondifferentnodesatthesametime
-awrite-writeconflict.
Inconsistentwritesareforever.
ShardingandReplication onMaster-Slave
Replication and sharding are strategies that can be combined. If we use both
masterslavereplicationandsharding,thismeansthatwehavemultiplemasters,but
eachdataitemonlyhasasinglemaster.Dependingonyourconfiguration,youmaychooseano
de to be a master for some data and slaves for others, or you may dedicate nodes
formasterorslaveduties.
Wehavemultiplemasters,buteachdataonlyhasasinglemaster.Twoschemes:
AnodecanbeamasterforsomedataandslavesforothersNodes
arededicatedformasterorslaveduties
ShardingandReplication onP2P
Usingpeer-to-peerreplicationandshardingisacommonstrategyforcolumnfamilydatabases.In a
scenariolikethis you might have tens or hundreds of nodes in
aclusterwithdatashardedoverthem.Agoodstartingpointforpeer-to-peerreplicationistohave a
replication factor of 3, so each shard is present on three nodes. Should a node
fail,thentheshardsonthatnodewillbebuiltontheothernodes.(Seefollowingfigure)
62
Usuallyeachshardispresentonthreenodes
Acommonstrategyforcolumn-familydatabases
KeyPoints
• Therearetwostylesofdistributingdata:
• Shardingdistributesdifferentdataacrossmultipleservers,soeachserveractsasthesinglesour
ceforasubsetofdata.
• Replicationcopiesdataacrossmultipleservers,soeachbitofdatacanbefoundinmultiplep
laces.
Asystemmayuseeitherorbothtechniques.
• Replicationcomesintwoforms:
• Master-
slavereplicationmakesonenodetheauthoritativecopythathandleswriteswhileslavessync
hronizewiththemasterandmay handle reads.
• Peer-to-
peerreplicationallowswritestoanynode;thenodescoordinatetosynchronizetheircopiesofthed
ata.
Master-slavereplicationreducesthechanceofupdateconflictsbutpeerto-
peerreplicationavoidsloadingallwritesonto asinglepointoffailure.
Consistency
Theconsistencypropertyensuresthatanytransactionwillbringthedatabasefromonevalid state
to another. Any data written to the database must be valid according to
alldefinedrules,includingconstraints,cascades,triggers,andanycombinationthereof.
ItisabiggestchangefromacentralizedrelationaldatabasetoaclusterorientedNoSQL.
RelationaldatabaseshasstrongconsistencywhereasNoSQLsystemshassmostlyeventualconsi
stency.
ACID:ADBMSisexpectedtosupport“ACIDtransactions,”processesthatare:
Atomicity:eitherthewholeprocessisdoneornoneis
Consistency:onlyvaliddataarewritten
Isolation:oneoperationatatime
Durability:oncecommitted,itstaysthatway
Variousformsofconsistency
1. UpdateConsistency(orwrite-writeconflict):
Martin and Pramod are looking at the company website and notice that the phonenumber
is out of date. Incredibly, they both have update access, so they both go in atthe same
time to update the number. To make the example interesting, we’ll
assumetheyupdateitslightlydifferently,becauseeachusesaslightlydifferentformat.This
63
issueiscalledawrite-write conflict:twopeopleupdatingthesamedataitematthesametime.
When the writes reach the server, the server will serialize them—decide to apply one,then
the other. Let’s assume it uses alphabetical order and picks Martin’s update
first,thenPramod’s.Withoutanyconcurrencycontrol,Martin’supdatewouldbeappliedand
immediately overwritten by Pramod’s. In this case Martin’s is a lost update. We
seethis as a failure of consistency because Pramod’s update was based on the state
beforeMartin’supdate,yetwasappliedafterit.
Solutions:
Pessimisticapproach
Preventconflictsfromoccurring
UsuallyimplementedwithwritelocksmanagedbythesystemOptimisticappr
oach
Letsconflictsoccur,butdetectsthemandtakesactiontosortthemoutApproaches:
conditionalupdates:testthevaluejustbeforeupdating
savebothupdates:recordthattheyareinconflictandthenmergethem
Donotworkifthere’smorethanoneserver(peer-to-peerreplication)
2. ReadConsistency(orread-writeconflict)
AliceandBobareusingTicketmasterwebsitetobookticketsforaspecificshow.Only oneticket is left
forthe specificshow. Alicesignson to Ticketmasterfirst andfindsone
left,andfindsitexpensive.Alicetakestimetodecide.Bobsignsonandfindsone ticketleft, orders
itinstantly. Bob purchasesand logs off. Alice decides
tobuyaticket,tofindtherearenotickets.ThisisatypicalRead-WriteConflictsituation.
Another example where Pramod has done a read in the middle of Martin’s write
asshowninbelow.
64
Werefertothistypeofconsistencyaslogicalconsistency.Toavoidalogicallyinconsistent by providing
Martin wraps his two writes in a transaction, the systemguarantees that
Pramodwilleitherreadbothdata itemsbefore
theupdateorbothaftertheupdate.Thelengthoftimeaninconsistencyispresentiscalledtheinconsi
stencywindow.
Replicationconsistency
Let’simaginethere’sonelasthotelroomforadesirableevent.Thehotelreservationsystem runs on
many nodes. Martin and Cindy are a couple considering this room,but they are
discussing this on the phone because Martin is in London and Cindy isin Boston.
Meanwhile Pramod, who is in Mumbai, goes and books that last room.That updates the
replicated room availability, but the update gets to Boston quickerthan it gets to London.
When Martin and Cindy fire up their browsers to see if theroom is available, Cindy sees
it booked and Martin sees it free. This is anotherinconsistentread—
butit’sabreachofadifferentformofconsistencywecallreplicationconsistency:ensur
ingthatthesamedataitemhasthesamevaluewhenread fromdifferentreplicas.
65
Eventual consistency:
Atanytime,nodesmayhavereplicationinconsistenciesbut,iftherearenofurtherupdates,eventually
allnodeswillbeupdatedtothesamevalue.Inotherwords,Eventualconsistencyisaconsistencymode
lusedinnosqldatabasetoachievehighavailabilitythatinformallyguaranteesthat,ifnonewupdatesar
emadetoagivendataitem,eventuallyallaccessestothatitemwillreturnthelastupdatedvalue.
EventuallyconsistentservicesareoftenclassifiedasprovidingBASE(BasicallyAvailable,Softstat
e,Eventualconsistency)semantics,incontrastto
traditionalACID(Atomicity,Consistency,Isolation,Durability)guarantees.
Basic Availability.The NoSQL database approach focuses on availability of
dataeveninthepresenceofmultiplefailures.Itachievesthisbyusingahighlydistributedapproac
htodatabasemanagement.Insteadofmaintainingasinglelargedata store and focusing on the
fault tolerance of that store, NoSQL databases spreaddata across many storage systems
with a high degree of replication. In the unlikelyevent that a failure disrupts access to a
segment of data, this does not necessarilyresultinacompletedatabaseoutage.
Softstate.BASEdatabasesabandontheconsistencyrequirementsoftheACIDmodel pretty much
completely. One of the basic concepts behind BASE is that
dataconsistencyisthedeveloper'sproblemandshouldnotbehandledbythedatabase.
Eventual Consistency.The only requirement that NoSQL databases have
regardingconsistencyistorequirethatatsomepointinthefuture,datawillconvergetoaconsistentstat
e.Noguaranteesaremade,however,aboutwhenthiswilloccur.Thatisacompletedeparture
fromtheimmediateconsistencyrequirementofACID thatprohibits a transaction fromexecuting
untiltheprior transactionhas completed andthedatabasehasconvergedtoaconsistentstate.
Version stamp: A field that changes every time the underlying data in the
recordchanges.Whenyoureadthedatayoukeepanoteoftheversionstamp,sothatwhenyouwrit
edatayoucancheckto seeiftheversionhaschanged.
You mayhavecomeacross thistechniquewithupdatingresources withHTTP. Oneway of doingthis
is to useetags. Whenever you get a resource, theserver respondswith an etag in
theheader.This etag isan opaquestring that indicatestheversion ofthe resource. If you then
update that resource, you can use a conditional update
bysupplyingtheetagthatyougotfromyourlastGETmethod.Iftheresourcehaschanged on the
server, the etags won’t match and the server will refuse the
update,returninga412(PreconditionFailed)errorresponse.Inshort,
Ithelpsyoudetectconcurrencyconflicts.
When you read data, then update it, you can check the version stamp to
ensurenobodyupdatedthedatabetweenyourreadandwrite
66
Versionstampscanbeimplementedusingcounters,GUIDs (alarge
randomnumberthat’sguaranteedtobeunique),contenthashes,timestamps,oracombinationof
these.
Withdistributedsystems,avectorofversionstamps(asetof counters,one
foreachnode)allowsyoutodetectwhendifferentnodeshaveconflictingupdates.
Sometimesthisiscalledacompare-and-set(CAS)operation.
Relaxingconsistency
TheCAPTheorem:ThebasicstatementoftheCAPtheoremisthat,giventhethreepropertiesof
Consistency, Availability,andPartition tolerance, you canonlygettwo.
Consistency:allpeopleseethesamedataatthesametime
Availability:if you cantalktoanodein thecluster,itcanreadand
writedataPartitiontolerance:theclustercansurvivecommunicationbreakagesthatseparate
theclusterintopartitionsunabletocommunicatewitheachother
Networkpartition:TheCAPtheoremstatesthatifyougetanetworkpartition,youhave
totradeoffavailabilityofdataversusconsistency.
Verylargesystemswill“partition”atsomepoint::
ThatleaveseitherCorAtochoosefrom(traditionalDBMSprefersCoverAandP)
67
Inalmost
allcases,youwouldchooseAoverC(exceptinspecificapplica
tionssuchasorderprocessing)
CAsystems
A single-server system is the obvious example of a CA
systemCAcluster:ifapartitionoccurs,allthenodeswouldgodown
A failed, unresponsive node doesn’t infer a lack of CAP availabilityA
system that suffer partitions: tradeoff consistency VS
availabilityGiveuptosomeconsistencyto getsomeavailability
Anexample
AnnistryingtobookaroomoftheAceHotelinNewYorkonanodelocatedinLondonofabo
okingsystem
PathinistryingtodothesameonanodelocatedinMumbaiThebookingsystemus
esapeer-to-peerdistribution
ThereisonlyaroomavailableThenet
worklinkbreaks
68
Possiblesolutions
CP:Neitherusercanbookanyhotelroom,sacrificingavailabilitycaP:Designate
Mumbai nodeas themasterforAcehotel
Pathincanmakethereservation
AnncanseetheinconsistentroominformationAnncann
otbooktheroom
AP:bothnodesacceptthehotelreservationOverboo
king!
Map-Reduce
Itisawaytotakeabigtaskanddivideitintodiscretetasksthatcanbedoneinparallel.Acommonusec
aseforMap/Reduceisindocumentdatabase.
Logicalview
The Map function is applied in parallel to every pair in the input dataset. This producesa
list of pairs for each call. After that, the MapReduce framework collects all pairs
withthesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.
Map(k1,v1) → list(k2,v2)
TheReducefunctionisthenappliedinparalleltoeachgroup,whichinturnproducesacollection
ofvaluesinthesamedomain:
Reduce(k2,list(v2)) → list(v3)
69
EachReducecalltypicallyproduceseitheronevaluev3oranemptyreturn
Example1:Countingand Summing
classReducer
methodReduce(wordt,counts[c1,c2,...])sum
=0
forallcountcin[c1,c2,...]dosum
=sum+c
Emit(wordt,sum)
Here, each document is split into words, and each word is counted by the map
function,usingthewordastheresultkey.Theframeworkputstogetherallthepairswiththesameke
y and feedsthem tothe samecallto reduce. Thus,this function justneeds
tosumallofitsinputvaluestofindthetotalappearancesofthatword.
Example2:
70
Multistagemap-reducecalculations
Letussaythatwehaveasetofdocumentsanditsattributeswiththefollowingform:
"type":"post",
71
"name":"Raven'sMap/
Reducefunctionality","blog_id":1342,
"post_id":29293921,"tags":
["raven","nosql"],
"post_content":"<p>...</
p>","comments":[
{
"source_ip":
'124.2.21.2',"author":"mar
tin",
"text":"excellentblog..."
}]
}
And we want to answer a question over more than a single document. That sort
ofoperation requires us to use aggregation, and over large amount of data, that is
bestdoneusingMap/Reduce,tosplitthework.
Map / Reduce is just a pair of functions, operating over a list of data. Let us say that
wewanttobeabouttoget acountofcommentsper blog.We
candothatusingthefollowingMap/Reducequeries:
frompostindocs.posts
selectnew{post.bl
og_id,
comments_length=comments.length
};
fromagginresults
groupaggbyagg.keyintogse
lectnew{
agg.blog_id,
72
comments_length=g.Sum(x=>x.comments_length)
};
Thereareacoupleofthingstonotehere:
Thefirstqueryisthemapquery,itmapstheinputdocumentintothefinalformat.
Thesecondqueryisthereducequery,itoperateoverasetofresultsandproduceananswer.
Thefirstvalueintheresultisthekey,whichiswhatweareaggregatingon(thinkthegroupbycl
auseinSQL).
Letusseehowthisworks,westartbyapplyingthemapquerytothesetofdocumentsthatwehave,pr
oducingthisoutput:
73
Thenextstepistostartreducingtheresults,inrealMap/Reducealgorithms,wepartition the
original input, and work toward the final result. In this case, imagine thatthe output of
the first step was divided into groups of 3 (so 4 groups overall), and
thenthereducequerywasappliedtoit,givingus:
You can see why it was called reduce, for every batch, we apply a sum by blog_id to geta
new Total Comments value. We started with 11 rows, and we ended up with just
10.That is where it gets interesting, because we are still not done, we can still reduce
thedatafurther.
74
This is what we do in the third step, reducing the data further still. That is why theinput &
output format of the reduce query must match, we will feed the output ofseveral the
reduce queries as the input of a new one. You can also see that now wemoved
fromhaving10rowstohavejust7.
And thefinalstepis:
75
Andnowwearedone,wecan'treducethedataanyfurtherbecauseallthekeysareunique.
RDBMScomparedtoMapReduce
MapReduce is a good fit for problems that need to analyze the whole dataset, in a
batchfashion,particularlyforadhocanalysis.AnRDBMSisgoodforpointqueriesorupdates,
wherethedatasethasbeenindexedtodeliverlow-
latencyretrievalandupdatetimesofarelativelysmallamountofdata.MapReducesuitsapplic
ationswherethedataiswrittenonce,andreadmanytimes,whereasarelationaldatabaseisgood
fordatasetsthatarecontinuallyupdated.
PartitioningandCombining
76
The first thing we can do is increase parallelism by partitioningthe output of
themappers.Eachreducefunctionoperatesontheresultsofasinglekey.Thisisalimitation—
itmeansyoucan’tdoanythinginthereducethatoperatesacrosskeys
—butit’salsoabenefitinthatitallowsyoutorunmultiple reducersin parallel.Totakeadvantage
of this, the results of the mapper are divided up based the key on eachprocessing node.
Typically, multiple keys are grouped together into partitions. Theframework then
takes the data from all the nodes for one partition, combines it into asingle group for
that partition, and sends it off to a reducer. Multiple reducers can
thenoperateonthepartitionsinparallel,withthefinalresultsmergedtogether.
(Thisstepisalsocalled“shuffling,”andthepartitionsaresometimesreferredtoas“buckets”or
“regions.”)
The next problem we can deal with is the amount of data being moved from node tonode
between the map and reduce stages. Much of this data is repetitive, consisting
ofmultiple key-value pairs for the same key. A combiner function cuts this data down
bycombiningallthedatafor thesamekeyintoasinglevalue(seefig)
CombinableReducer Example
77
special shape for this to work: Its output must match its input. We call such a function
acombinablereducer.
UNITIIIBASICSOFHADOOP
Dataformat–analyzingdatawithHadoop–scalingout–Hadoopstreaming–
Hadooppipes–design ofHadoopdistributed file system(HDFS)–
HDFSconcepts
IntroductiontoHadoop
Performing computation on large volumes of data has been done before, usually in
adistributed setting. What makes Hadoop unique is its simplified programming
modelwhich allows the user to quickly write and test distributed systems, and its
efficient,automatic distribution of dataandworkacross machinesandin turn
utilizingtheunderlyingparallelismoftheCPUcores.
InaHadoopcluster,dataisdistributedtoallthenodesoftheclusterasitisbeingloadedin.TheHadoo
pDistributedFileSystem(HDFS)willsplitlargedatafilesintochunkswhicharemanagedbydiffer
entnodesinthecluster.Inadditiontothiseachchunk isreplicatedacrossseveral machines,so
thatasinglemachinefailuredoesnotresult in any data being unavailable. An active
monitoring system then re-replicates thedata inresponsetosystem
failureswhichcanresultinpartialstorage. Even though
thefilechunksarereplicatedanddistributedacrossseveralmachines,theyformasinglenamespace
,sotheircontentsareuniversallyaccessible.
MAPREDUCEinHadoop
Hadoop limits the amount of communication which can be performed by the processes,as
each individual record is processed by a task in isolation from one another. Whilethis
sounds like a major limitation at first, it makes the whole framework much
morereliable.Hadoopwillnotrunjustanyprogramanddistributeitacrossacluster.Programs
mustbewrittentoconform toaparticularprogrammingmodel, named"MapReduce."
78
In MapReduce, records are processed in isolation by tasks called Mappers. The
outputfrom the Mappers is then brought together into a second set of tasks called
Reducers,whereresults fromdifferentmappers canbemerged together.
HadoopArchitecture:
79
dedicated machine that runs only the NameNode software. Each of the other machinesin
the cluster runs one instance of the DataNode software. The architecture does
notpreclude running multiple DataNodes on the same machine but in a real
deploymentthatisrarelythecase.
DataForamt
InputFormat:HowtheseinputfilesaresplitupandreadisdefinedbytheInputFormat.AnInp
utFormatisaclassthatprovidesthe followingfunctionality:
Selectsthefilesorotherobjectsthatshouldbeusedforinput
DefinestheInputSplitsthatbreakafileintotasks
ProvidesafactoryforRecordReaderobjectsthatreadthefile
OutputFormat:The(key,value)pairsprovidedtothisOutputCollectorarethenwrittentoou
tputfiles.
Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases. If it is
flat file, the data is stored using a line-oriented ASCII format, in whicheach line is a
record. For example, ( National Climatic Data Center) NCDC data as givenbelow, the
format supports a rich set of meteorological elements, many of which areoptional
orwithvariabledatalengths.
80
Datafilesareorganizedbydateandweatherstation.
To visualize the way the map works, consider the following sample lines of input
data(some unused columns have been dropped to fit
the page, indicated byellipses):
Theselinesarepresentedtothemapfunctionasthekey-valuepairs:
Thekeysarethelineoffsetswithinthefile,whichweignoreinourmapfunction.Themapfunctionmer
elyextractstheyearandtheairtemperature(indicatedinboldtext),andemitsthemasitsoutput(th
etemperaturevalueshavebeeninterpretedasintegers):
(1950,0)
(1950,22)
(1950,−11)
(1949,111)
(1949,78)
Theoutputfrom themap function is processedbythe MapReduceframework
beforebeingsenttothereducefunction.Thisprocessingsortsandgroupsthekey-
valuepairsbykey.So,continuingtheexample,ourreducefunctionseesthefollowinginput:
(1949,[111,78])
(1950,[0,22,−11])
Eachyearappearswithalistofallitsairtemperaturereadings.Allthereducefunctionhastodonowisiter
atethroughthelistandpickupthemaximumreading:
(1949,111)
(1950,22)
Thisisthefinaloutput:themaximumglobaltemperaturerecordedineachyear.Thewholedataflowisil
lustratedinthefollowingfigure.
JavaMapReduce
82
Havingrunthrough howtheMapReduceprogramworks,thenextstepistoexpressitin code.
We need three things: a map function, a reduce function, and some code torun the
job. The map function is represented by an implementation of the Mapperinterface,
whichdeclaresamap()method.
publicclassMaxTemperatureMapperextendsMapReduceBaseim
plementsMapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable>output,Reporterreporter)throws
IOException
{
//statementtoconverttheinputdataintostring
//statementtoobtainyearandtempusingthesubstringmethod
//statementtoplacetheyearandtempintoasetcalledOutputCollector
}
}
TheMapperinterfaceisagenerictype,withfourformaltypeparametersthatspecifytheinputkey,i
nputvalue,outputkey,andoutputvaluetypesofthemapfunction.
The map() method is passed a key and a value. We convert the Text value containingthe
line of input into a Java String, then use its substring() method to extract thecolumns
we are interested in.
Themap()methodalsoprovidesaninstanceofOutputCollectortowritetheoutputto.
ThereducefunctionissimilarlydefinedusingaReducer,asillustratedinfollowingfigure.
ThethirdpieceofcoderunstheMapReducejob.publ
icclassMaxTemperature
{
83
publicstaticvoidmain(String[]args)throwsIOException
{
JobConf conf =new
JobConf(MaxTemperature.class);conf.setJobName("Max
temperature");FileInputFormat.addInputPath(conf, new
Path(args[0]));FileOutputFormat.setOutputPath(conf,newPath(ar
gs[1]));conf.setMapperClass(MaxTemperatureMapper.class);c
onf.setReducerClass(MaxTemperatureReducer.class);conf.set
OutputKeyClass(Text.class);conf.setOutputValueClass(IntWri
table.class);JobClient.runJob(conf);
}
}
AJobConfobjectformsthespecificationofthejob.Itgivesyoucontroloverhowthe
job is run. Having constructed a JobConf object, we specify the input and output
paths.Next,wespecifythemapandreducetypestouseviathesetMapperClass()andsetReduc
erClass()methods.ThesetOutputKeyClass()andsetOutputValueClass()methods control
the output types for the map and the reduce functions. The
staticrunJob()methodonJobClientsubmitsthejobandwaitsforittofinish,writinginformatio
naboutitsprogresstotheconsole.
Scalingout
You’veseenhowMapReduceworksforsmallinputs; nowit’stimetotakeabird’s-eyeview of
the system and look at the data flow for large inputs. For simplicity, theexamples so
far have used files on the local filesystem. However, to scale out, we
needtostorethedata inadistributedfilesystem,typicallyHDFS.
A MapReduce job is a unit of work that the client wants to be performed: it consists ofthe
input data, the MapReduce program, and configuration information. Hadoop runsthe
job by dividing it into tasks, of which there are two types: map tasks and reducetasks.
There are two types of nodes that control the job execution process: a jobtracker and
anumberoftasktrackers.Thejobtrackercoordinatesallthejobsrunonthesystembyschedulingtas
kstorunontasktrackers.Tasktrackersruntasksandsendprogressreports to the jobtracker,
whichkeeps arecord ofthe overall progressof each job.If
ataskfails,thejobtrackercanrescheduleitonadifferenttasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits,or
just splits. Hadoop creates one map task for each split, which runs the
userdefinedmapfunctionforeachrecordinthesplit.
Having many splits means the time taken to process each split is small compared to
thetimetoprocessthewholeinput.Soifweareprocessingthesplitsinparallel,theprocessingisbett
erload-balancedifthesplitsaresmall,sinceafastermachinewillbe
84
abletoprocessproportionallymoresplitsoverthecourseofthejobthanaslowermachine.
Evenifthemachinesareidentical,failedprocessesorotherjobsrunningconcurrentlymake load
balancing desirable, and the quality of the load balancing increases as
thesplitsbecomemorefine-grained.
HadoopdoesitsbesttorunthemaptaskonanodewheretheinputdataresidesinHDFS.Thisis
calledthedatalocalityoptimization.
ThewholedataflowwithasinglereducetaskisillustratedinFigure.
Finally, it’s also possible to have zero reduce tasks. This can be appropriate when
youdon’tneedtheshufflesincetheprocessingcanbecarriedoutentirelyinparallelasshownin
figure.
85
CombinerFunctions
Hadoop allows the user to specify a combiner function to be run on the map output—
thecombinerfunction’soutputformstheinputtothereducefunction.Sincethecombiner
function is an optimization, Hadoop does not provide a guarantee of howmany times it
will call it for a particular map output record, if at all. In other
words,callingthecombinerfunctionzero,one,ormanytimesshould
producethesameoutputfromthereducer.
Thecontractforthecombinerfunctionconstrainsthetypeoffunctionthatmaybe
used.Thisisbestillustratedwithanexample.Supposethatforthemaximumtemperature example,
readings for the year 1950 were processed by two maps
(becausetheywereindifferentsplits).Imaginethefirstmapproducedtheoutput:
(1950,0)
(1950,20)
(1950,10)
And the secondproduced:
(1950,25)
(1950,15)
Thereducefunctionwouldbecalledwithalistofallthevalues:(1950,
[0,20,10,25,15])
withoutput:
(1950,25)
since25isthemaximumvalueinthelist.Wecoulduseacombinerfunctionthat,justlikethe
reducefunction,findsthemaximumtemperature foreach mapoutput.
Thereducewouldthenbecalledwith:
(1950,[20,25])
andthereducewouldproducethesameoutputasbefore.Moresuccinctly,wemayexpressthefunctionc
allsonthetemperaturevaluesinthiscaseasfollows:
max(0,20,10,25,15)=max(max(0,20,10),max(25,15))=max(20,25)=25
86
publicclassMaxTemperatureWithCombiner
{
publicstaticvoidmain(String[]args)throwsIOException
{
JobConfconf=newJobConf(MaxTemperatureWithCombiner.class);conf.set
JobName("Max temperature");FileInputFormat.addInputPath(conf, new
Path(args[0]));FileOutputFormat.setOutputPath(conf, new
Path(args[1]));conf.setMapperClass(MaxTemperatureMapper.class);co
nf.setCombinerClass(MaxTemperatureReducer.cla
ss);conf.setReducerClass(MaxTemperatureReducer.class);conf.setOut
putKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);J
obClient.runJob(conf);
}
}
HadoopStreaming
Hadoop providesanAPItoMapReducethatallowsyou
towriteyourmapandreducefunctionsinlanguagesotherthanJava.HadoopStreamingusesU
nixstandardstreamsas the interface between Hadoop and your program, so you can use
any language thatcan read standard input and write to standard output to write your
MapReduceprogram.
Streaming is naturally suited for text processing and when used in text mode, it has aline-
oriented view of data. Map input data is passed over standard input to your
mapfunction, which processes it line by line and writes lines to standard output. A
mapoutput key-valuepairiswrittenasasingletab-delimitedline.
Let’sillustratethisbyrewritingourMapReduceprogramforfindingmaximumtemperaturesbyy
earinStreaming.
Ruby:ThemapfunctioncanbeexpressedinRubyasshownbelowSTDIN
.each_linedo|line|
val=line
year,temp,q=val[15,4],val[87,5],val[92,1]
puts"#{year}\t#{temp}"if(temp!="+9999"&&q=~/[01459]/)
end
Sincethescriptjustoperatesonstandardinputandoutput,it’strivialtotestthescript
87
withoutusingHadoop,simplyusingUnixpipes:
%catinput/ncdc/sample.txt|ch02/src/main/ruby/max_temperature_map.rb1950
+0000
1950+0022
1950-0011
1949+0111
1949+0078
Thereducefunctionshownbelow
puts"#{last_key}\t#{max_val}"iflast_key
WecannowsimulatethewholeMapReducepipelinewithaUnixpipeline,weget
HadoopPipes
HadoopPipesisthenameoftheC++interfacetoHadoopMapReduce.UnlikeStreaming, which
uses standard input and output to communicate with the map
andreducecode,Pipesusessocketsasthechanneloverwhichthetasktrackercommunicates
withthe processrunningtheC++ mapor reducefunction.
The following example shows the source code for the map and reduce functions in C+
+.classMaxTemperatureMapper:publicHadoopPipes::Mapper
{
public:
MaxTemperatureMapper(HadoopPipes::TaskContext&context)
{}voidmap(HadoopPipes::MapContext&context)
88
{
//statementtoconverttheinputdataintostring
//statementtoobtainyearandtempusingthesubstringmethod
//statementtoplacetheyearandtempintoaset
}
};
classMapTemperatureReducer:publicHadoopPipes::Reducer
{
public:
MapTemperatureReducer(HadoopPipes::TaskContext&context)
{}voidreduce(HadoopPipes::ReduceContext&context)
{
//statementtofindthemaximumtemperatureofaeachyear
//statementtoputthemax.tempanditsyearinaset
}
};
intmain(intargc,char*argv[])
{
returnHadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,MapTemperatur
eReducer>());
}
ThemapandreducefunctionsaredefinedbyextendingtheMapperandReducerclasses defined
in the HadoopPipes namespace and providing implementations of
themap()andreduce()methodsineachcase.
These methods take a context object (of type MapContext or ReduceContext),
whichprovidesthemeansforreadinginputandwritingoutput,aswellasaccessingjobconfigurati
on informationviatheJobConfclass.
The main() method is the application entry point. It calls HadoopPipes::runTask,
whichconnects to the Java parent process and marshals data to and from the Mapper
orReducer.TherunTask()methodispassedaFactorysothatitcancreateinstancesoftheMapp
erorReducer.WhichoneitcreatesiscontrolledbytheJavaparentoverthesocketconnection.
There are overloaded template factory methods for setting a
combiner,partitioner,recordreader,orrecordwriter.
89
ThesampledataalsoneedstobecopiedfromthelocalfilesystemintoHDFS:
%hadoopfs-putinput/ncdc/sample.txtsample.txt
Now wecanrun thejob.Forthis,weusethe Hadooppipescommand,passingtheURIof
theexecutableinHDFSusingthe-programargument:
%hadooppipes\
-Dhadoop.pipes.java.recordreader=true\
-Dhadoop.pipes.java.recordwriter=true\
-inputsample.txt\
-outputoutput\
-programbin/max_temperature
Theresultisthesameastheotherversionsofthesameprogramthatweranpreviousexample.
DesignofHDFS
HDFSisafilesystemdesignedforstoringverylargefileswithstreamingdataaccesspatterns,running
onclustersofcommodity hardware.
“Verylarge”
Inthiscontextmeansfilesthatarehundredsofmegabytes,gigabytes,orterabytesinsize. There
areHadoop clustersrunningtodaythatstorepetabytesofdata.
Streamingdataaccess
HDFS isbuiltaroundtheideathat themostefficientdata processingpatternisawrite-
once,read-many-timespattern.Adatasetistypicallygeneratedorcopiedfrom source,then
variousanalysesare
performedonthatdatasetovertime.Eachanalysiswillinvolvealargeproportion,ifnotall,oft
hedataset,sothetimetoreadthewholedatasetismoreimportantthanthelatencyinreadingth
efirstrecord.
Commodityhardware
Hadoopdoesn’trequireexpensive,highlyreliablehardwaretorunon.It’sdesignedtorunoncluste
rsofcommodityhardware(commonlyavailablehardwareavailablefrom
multiplevendors)forwhichthechanceofnodefailureacrosstheclusterishigh,at
leastforlarge clusters.HDFS isdesignedto carry on working
withoutanoticeableinterruptiontotheuserinthefaceofsuchfailure.
Low-latencydataaccess
Applications that require low-latency access to data, in the tens of
millisecondsrange,willnotworkwellwithHDFS.Remember,HDFSisoptimizedfordel
iveringahighthroughputofdata,andthismaybeattheexpense oflatency
90
Lotsofsmall files
Since the namenode holds filesystem metadata in memory, the limit to the numberof
filesinafilesystemisgovernedbytheamountofmemoryonthenamenode.Asa rule of
thumb, each file, directory, and block takes about 150 bytes. So,
forexample,ifyouhadonemillionfiles,eachtakingoneblock,youwouldneedatleast300
MBofmemory.Whilestoringmillionsoffilesisfeasible,billionsisbeyondthecapability
ofcurrenthardware.
Multiplewriters,arbitraryfilemodifications
FilesinHDFSmaybewrittentobyasinglewriter.Writesarealwaysmadeattheend of the
file. There is no support for multiple writers, or for modifications
atarbitraryoffsetsinthefile
HDFSConcepts
ThefollowingdiagramillustratestheHadoopconcepts
Blocks
Adiskhasablocksize,whichistheminimumamountofdatathatitcanreadorwrite.Filesystemsforasin
glediskbuildonthisbydealingwithdatainblocks,whichareanintegralmultipleofthediskblocksiz
e.Filesystemblocksaretypicallyafewkilobytes
insize,whilediskblocksarenormally512bytes.HDFS,too,hastheconceptofablock,butitisamuc
hlargerunit—64MBbydefault.
HDFSblocksarelargecomparedtodiskblocks,andthereasonistominimizethecostof seeks.
By making a block large enough, the time to transfer the data from the
diskcanbemadetobesignificantlylargerthanthetimetoseektothestartoftheblock.
Thusthetimetotransferalargefilemadeofmultipleblocksoperatesatthedisktransferrate.
91
Benefitsofblocks
1. A file can be larger than any single disk in the network. There is nothing thatrequires
the blocks from a file to be stored on the same disk, so they can
takeadvantageofanyofthedisksinthecluster.
2. It simplifies the storage subsystem. The storage subsystem deals with
blocks,simplifyingstoragemanagementandeliminatingmetadataconcerns.
3. Blocksfitwellwithreplicationforprovidingfaulttoleranceandavailability.
NamenodesandDatanodes
AnHDFSclusterhastwotypesofnodeoperatinginamaster-
workerpattern:anamenode(themaster)andanumberofdatanodes(workers).Thenamenode
managesthefilesystemnamespace.Itmaintainsthefilesystemtreeandthemetadataforallthef
iles and directories in the tree. This information is stored persistently on the local
diskintheformoftwofiles:thenamespaceimageandtheeditlog.Thenamenodealsoknowsthe
datanodesonwhichalltheblocksforagivenfilearelocated,however,itdoes not store block
locations persistently, since this information is reconstructed
fromdatanodeswhenthesystemstarts.
Aclientaccessesthefilesystemonbehalfoftheuserbycommunicatingwiththenamenode and
datanodes. Datanodes are the workhorses of the filesystem. They storeandretrieve
blocks whenthey are told to, and they report back to the
namenodeperiodicallywithlistsofblocksthattheyarestoring.Withoutthenamenode,thefile
systemcannotbeused.
SecondaryNamenode
It is also possible to run a secondary namenode, which despite its name does not act asa
namenode. Its main role is to periodically merge the namespace image with the editlog
to prevent the edit log from becoming too large. The secondary namenode
usuallyrunsonaseparatephysicalmachine,sinceitrequiresplentyofCPUandasmuchmemor
yasthenamenodetoperformthemerge.Itkeepsacopyofthemergednamespaceimage,which
canbeusedintheeventofthenamenodefailing.
TheCommand-LineInterface
There are many other interfaces to HDFS, but the command line is one of the
simplestand,tomanydevelopers,themostfamiliar.Itprovidesacommandlineinterfacecalle
dFSshellthatletsauserinteractwiththedatainHDFS.Thesyntaxofthiscommandsetissimilar
toothershells(e.g.bash,csh)thatusersarealreadyfamiliarwith.Herearesomesampleaction/
commandpairs:
Action Command
Createadirectorynamed/foodir bin/hadoopdfs-mkdir/foodir
92
Createadirectorynamed/foodir bin/hadoopdfs-mkdir/foodir
Viewthecontentsofafilenamed bin/hadoopdfs–cat/foodir/myfile.txt
/foodir/myfile.txt
HadoopFilesystems
Hadoop has an abstract notion of filesystem, of which HDFS is just one
implementation.TheJavaabstractclassorg.apache.hadoop.fs.FileSystemrepresentsafiles
ysteminHadoop,andthereareseveralconcreteimplementations,whicharedescribedinTabl
e
Thrift
TheThriftAPIin the“thriftfs”moduleexposeHadoopfilesystemsasan ApacheThriftservice,
making it easy for any language that has Thrift bindings to interact with
aHadoopfilesystem,suchasHDFS.
C
HadoopprovidesaClibrarycalledlibhdfsthatmirrorstheJavaFileSysteminterface(itwaswritt
enasaClibraryforaccessingHDFS,butdespiteitsnameitcanbeused
93
toaccessanyHadoopfilesystem).ItworksusingtheJavaNativeInterface(JNI)tocallaJavafiles
ystemclient.
FUSE
FilesysteminUserspace(FUSE)allowsfilesystemsthatareimplementedinuserspaceto be
integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows
anyHadoopfilesystem(buttypicallyHDFS)tobemountedasastandardfilesystem.
WebDAV
WebDAVisasetof extensionstoHTTPto supportediting andupdating
files.WebDAVsharescanbemountedasfilesystemsonmostoperatingsystems,sobyexposingH
DFS(orotherHadoopfilesystems)overWebDAV,it’spossibletoaccessHDFSasastandardfilesy
stem.
Filepatterns
Itisacommonrequirementtoprocesssetsoffilesinasingleoperation.Hadoopsupportsthesames
etofglobcharactersasUnixbashasshowninTable
AnatomyofaFileRead
TogetanideaofhowdataflowsbetweentheclientinteractingwithHDFS,thenamenodeandthedatano
des,considerFigure,whichshowsthemainsequenceofeventswhenreadingafile.
94
Theclientopensthefileitwishestoreadbycallingopen()ontheFileSystemobject,which for
HDFSisan instanceof DistributedFileSystem(step1).
DistributedFileSystemcallsthenamenode,usingRPC,todeterminethelocationsofthe blocks
for the first few blocks in the file (step 2). For each block, the
namenodereturnstheaddressesofthe datanodesthathave a copyofthatblock.
TheDistributedFileSystemreturnsanFSDataInputStream(aninputstreamthatsupports file
seeks) to the client for it to read data from. FSDataInputStream in
turnwrapsaDFSInputStream,whichmanagesthedatanodeandnamenodeI/O.
Theclientthencallsread()onthestream(step3).DFSInputStream,whichhasstoredthe
datanode addresses for the first few blocks in the file, then connects to the
first(closest) datanode for the first block in the file. Data is streamed from the
datanodebacktotheclient,whichcallsread()repeatedlyonthestream(step4).Whentheend
of the block is reached, DFSInputStream will close the connection to the datanode,
thenfind the best datanode for the next block (step 5). This happens transparently to
theclient,whichfromitspointofviewisjustreadingacontinuousstream.Blocksare
readinorderwith the DFSInputStream openingnewconnectionstodatanodes
astheclientreadsthroughthestream.Itwillalsocallthenamenodetoretrievethedatanodelocationsfort
henextbatchofblocksasneeded.Whentheclienthasfinishedreading,itcallsclose()ontheFSDataI
nputStream(step6).
Anatomy ofaFileWrite
Theclient creates the file by calling create() on DistributedFileSystem (step 1).
DistributedFileSystemmakesanRPCcalltothenamenodetocreateanew
fileinthefilesystem’snamespace,withnoblocksassociatedwithit(step2).Thenamenode
performs various checks to make sure the file doesn’t already exist, and
thattheclienthastherightpermissionstocreatethefile.Ifthesecheckspass,thenamenodemak
es a record of the new file; otherwise, file creation fails and the client is thrown
anIOException.TheDistributedFileSystemreturnsanFSDataOutputStreamfortheclient
tostartwritingdatato.Justasinthereadcase,FSDataOutputStreamwrapsaDFSOutputStream,w
hich handles communicationwiththedatanodesandnamenode.
95
Astheclientwritesdata(step3),DFSOutputStreamsplitsitintopackets,whichitwritestoanint
ernalqueue,calledthedataqueue.ThedataqueueisconsumedbytheDataStreamer,whose
responsibilityitistoaskthenamenodetoallocatenewblocksbypicking a list of suitable
datanodes to store the replicas. The list of datanodes forms apipeline—we’ll assume
the replication level is three, so there are three nodes in thepipeline. The
DataStreamer streams the packets to the first datanode in the
pipeline,whichstoresthepacketandforwardsittotheseconddatanodeinthepipeline.Simil
arly, the second datanode stores the packet and forwards it to the third (and
last)datanodeinthepipeline(step4).
96
Application level gateway is also called a bastion host. It operates at the application level.
Multiple application gateways can run on the same host but each gateway is a separate server
with its own processes.
These firewalls, also known as application proxies, provide the most secure type of data
connection because they can examine every layer of the communication, including the
application data.
Circuit level gateway: A circuit-level firewall is a second generation firewall that validates
TCP and UDP sessions before opening a connection.
The firewall does not simply allow or disallow packets but also determines whether the
connection between both ends is valid according to configurable rules, then opens a session
and permits traffic only from the allowed source and possibly only for a limited period of
time.
It typically performs basic packet filter operations and then adds verification of proper
handshaking of TCP and the legitimacy of the session information used in establishingthe
connection.
The decision to accept or reject a packet is based upon examining thepacket'sIPheader and
TCP header.
Circuit level gateway cannot examine the data content of the packets it relaysbetweena
trusted network and an untrusted network.
97
UNITIVMAPREDUCEAPPLICATIONS
MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy
ofMapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and
YARN –jobscheduling–shuffleandsort–taskexecution–MapReducetypes– inputformats–
outputformats
MapReduceWorkflows
It explains the data processing problem into the MapReduce model. When the processing getsmore
complex, the complexity is generally manifested by having more MapReduce jobs, ratherthan
having more complex map and reduce functions. In other words, as a rule of thumb, thinkabout
adding more jobs, rather than adding complexity to jobs. Map Reduce workflow
isdividedintotwosteps:
• Decomposinga ProblemintoMapReduceJobs
• RunningJobs
1. DecomposingaProbleminto MapReduceJobs
98
Let’s look at an example of a more complex problem that we want to translate into
aMapReduceworkflow.When we write a MapReduce workflow, we have to
createtwoscripts:
99
2. Running Dependent Jobs (linear chain of jobs) or More complex Directed
AcyclicGraphjobs
For a linear chain, the simplest approach is to run each job one after another, waiting until a
jobcompletessuccessfullybeforerunningthenext:
JobClient.runJob(conf1);JobClient.runJ
ob(conf2);
ForanythingmorecomplexjoblikeDAGthanalinearchain,thereisaclasscalledJobControlwhichrepre
sentsagraphofjobs toberun.
Oozie
Unlike JobControl, which runs on the client machine submitting the jobs, Oozie runs as a
server,andaclientsubmits aworkflowtotheserver.InOozie,aworkflowis aDAGofaction
nodesand control-flow nodes. There is an action node which performs a workflow task, likemoving
files in HDFS, running a MapReduce job or running a Pig job. When the
workflowcompletes,OoziecanmakeanHTTPcallbacktotheclienttoinformitoftheworkflowstatus.
Itisalsopossibletoreceivecallbackseverytimetheworkflowentersorexitsanactionnode.
Oozie allows failed workflows to be re-run from an arbitrary point. This is useful for
dealingwithtransienterrorswhentheearly actionsintheworkflowaretimeconsumingtoexecute.
How MapReduce Works? / Explain the anatomy of classic map reduce job run/How
HadooprunsmapreduceJob?
You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It is very
short,but it conceals a great deal of processing behind the scenes. The whole process is
illustrated infollowingfigure.
100
AsshowninFigure1,therearefourindependententitiesintheframework:
- Client,whichsubmitstheMapReduceJob
- JobTracker,whichcoordinatesandcontrolsthejobrun.It isa JavaclasscalledJobTracker.
- TaskerTrackers,whichrunthetaskthatissplitjob,controlthespecificmaporreducetask,andmaker
eportstoJobTracker. TheyareJavaclass as well.
- HDFS, which provides distributed data storage and is used to share job files between
otherentities.
1. JobSubmission
2. JobInitialization
3. TaskAssignment
101
4. TaskExecution
5. TaskProgressandstatusupdates
6. TaskCompletion
JobSubmission
When the client call submit() on job object. An internal JobSubmmitter Java Object is initiatedand
submitJobInternal() is called. If the clients calls the waiForCompletion(), the job progressswill
begin and it will response to the client with process results to clients until the jobcompletion.
JobSubmmiterdothefollowingwork:
- AsktheJobTracker foranewjobID.
- Checksthe outputspecificationofthejob.
- Computestheinputsplitsforthe job.
- Copy the resources needed to run the job. Resources include the job jar file, the
configurationfile and the computed input splits. These resources will be copied to HDFS in a
directory namedafter the job id. The job jar will be copied more than 3 times across the cluster
so thatTaskTrackerscanaccess itquickly.
- TelltheJobTracker thatthejobis readyforexecutionby callingsubmitJob() onJobTracker.
JobInitialization
When the JobTracker receives the call submitJob(), it will put the call into an internal
queuefromwherethejobschedulerwill pickitupandinitializeit. Theinitializationisdoneas follow:
- An job object is created to represent the job being run. It encapsulates its tasks
andbookkeepinginformationsoastokeeptrackthetaskprogressandstatus.
- Retrieves the input splits from HDFS and create the list of tasks, each of which has task
ID.JobTracker creates one map task for each split, and the number of reduce tasks according
toconfiguration.
- JobTracker will create the setup task and cleanup task. Setup task is to create the final
outputdirectory for the job and the temporary working space for the task output. Cleanup task is
todeletethetemporaryworkingspaceforthetaskouput.
- JobTrackerwillassigntaskstofreeTaskTrackers
102
TaskAssignment
TaskTrackers send heartbeat periodically to JobTracker Node to tell it if it is alive or ready to geta
new task. The JobTracker will allocate a new task to the ready TaskTracker. Task assignment
isasfollows:
- The JobTracker will choose a job to select the task from according to scheduling algorithm,
asimple way is chosen on a priority list of job. After chose the job, the JobTracker will choose
ataskfromthejob.
- TaskTrackers has a fixed number of slots for map tasks and for reduces tasks which are
setindependently, theschedulerwill fitstheemptymaptaskslotsbeforereducetaskslots.
- To choose a reduce task, the JobTracker simply takes next in its list of yet-to-be-run
reducetask,becausethereisnodata
localityconsideration.ButmaptaskchosendependsonthedatalocalityandTaskTracker’s
networklocation.
TaskExecution
103
ProgressandStatusUpdates
After clients submit a job. The MapReduce job is a long time batching job. Hence the
jobprogressreportisimportant.WhatconsistsoftheHadooptaskprogressisasfollows:
- Readinganinputrecordinamapperorreducer
- Writinganoutputrecordina mapperorareducer
- Settingthestatusdescriptiononareporter,usingtheReporter’ssetStatus()method
- Incrementinga counter
- CallingReporter’sprogress()
And mapper and reducer on the child JVM will report to TaskTracker with it’s progress statusevery
few seconds. The mapper or reducers will set a flag to indicate the status change
thatshouldbesenttotheTaskTracker.Theflagischeckedinaseparatedthreadevery3seconds.Iftheflags
ets,itwillnotifytheTaskTrackerofcurrenttaskstatus.
The JobTracker combines all of the updates to produce a global view, and the Client can
usegetStatus()togetthejobprogressstatus.
JobCompletion
When the JobTracker receives a report that the last task for a job is complete, it will change
itsstatus to successful. Then the JobTracker will send a HTTP notification to the client which
callsthe waitForCompletion(). The job statistics and the counter information will be printed to
theclientconsole.FinallytheJobTrackerandtheTaskTrackerwill docleanupactionforthejob.
104
MRUnittest
MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some
limitedintegration testing of the mapper – reducer interaction along with combiners, custom
countersandpartitioners.
Towriteyourtestyouwould:
TestingMappers
1. Instantiate an instance of the MapDriver class parameterized exactly as the
mapperundertest.
2. Addaninstance oftheMapperyouaretestinginthewithMappercall.
3. InthewithInputcallpassinyourkey andinputvalue
4. SpecifytheexpectedoutputinthewithOutputcall
5. The last call runTest feeds the specified input values into the mapper and compares
theactualoutputagainsttheexpectedoutputsetinthe‘withOutput’method.
TestingReducers
1. The test starts by creating a list of objects (pairList) to be used as the input to
thereducer.
2. AReducerDriverisinstantiated
3. Nextwe passinaninstanceofthereducerwe wanttotestinthewithReducercall.
4. InthewithInputcall
wepassinthekey(of“190101”)andthepairListobjectcreatedatthestartofthetest.
5. Nextwe specifytheoutputthatwe expectourreducertoemit
6. Finally runTest is called, which feeds our reducer the inputs specified and compares
theoutputfromthereduceragainsttheexpectoutput.
MRUnit testing framework is based on JUnit and it can test Map Reduce programs written
onseveralversionsofHadoop.
FollowingisanexampletouseMRUnittounittestaMapReduceprogramthatdoesSMSCallDeatailsRecord(ca
lldetailsrecord)analysis.
Therecordslooklike
105
The MapReduce program analyzes these records, finds all records with CDRType as 1, and
noteitscorrespondingSMSStatusCode.Forexample,theMapperoutputsare
6,1
0,1
The Reducer takes these as inputs and output number of times a particular status code
hasbeenobtainedintheCDR records.
ThecorrespondingMapperand Reducerare
publicclassSMSCDRMapperextendsMapper<LongWritable,Text,Text,IntWritable>{
privateTextstatus= newText();
privatefinalstaticIntWritableaddOne=newIntWritable(1);
protectedvoidmap(LongWritablekey,Textvalue,Contextcontext)
throwsjava.io.IOException,InterruptedException{
//655209;1;796764372490213;804422938115889;6istheSamplerecordformat
String[]line=value.toString().split(";");
//Ifrecordis ofSMSCDR
if (Integer.parseInt(line[1]) == 1)
{status.set(line[4]);context.wri
te(status,addOne);
}
}
}
ThecorrespondingReducercodeis
106
for (IntWritable value : values)
{sum +=value.get();
}
context.write(key,newIntWritable(sum));
}
}
TheMRUnittestclassfortheMapperis
publicclassSMSCDRMapperReducerTest
{
MapDriver<LongWritable, Text, Text, IntWritable>
mapDriver;ReduceDriver<Text, IntWritable, Text, IntWritable>
reduceDriver;MapReduceDriver<LongWritable,Text,Text,IntWritable,Text,IntWri
table>
mapReduceDriver;
publicvoidsetUp()
{
SMSCDRMapper mapper = new
SMSCDRMapper();SMSCDRReducer reducer = new
SMSCDRReducer();mapDriver =
MapDriver.newMapDriver(mapper);;reduceDriver=Redu
ceDriver.newReduceDriver(reducer);
mapReduceDriver=MapReduceDriver.newMapReduceDriver(mapper,reducer);
}
@Test
publicvoid testMapper()
{
mapDriver.withInput(newLongWritable(),newText("655209;1;796764372490213;8044229
38115889;6"));
mapDriver.withOutput(new Text("6"), new
IntWritable(1));mapDriver.runTest();
}
@Test
publicvoid testReducer()
{
List<IntWritable>values=newArrayList<IntWritable>();val
ues.add(newIntWritable(1));
values.add(newIntWritable(1));
107
reduceDriver.withInput(new Text("6"),
values);reduceDriver.withOutput(new Text("6"), new
IntWritable(2));reduceDriver.runTest();
}
}
YARN:ItisaHadoopMapReduce2anddevelopedtoaddressthevariouslimitationsofclassicmapreduce
CurrentMapReduce(classic)Limitations:
Scalabilityproblem
MaximumClusterSize – 4000Nodesonly
MaximumConcurrentTasks–40000only
CoarsesynchronizationinJobTracker
ItsupportsSinglepointoffailure
Jobsneedtoberesubmittedby users
Restartisverytrickyduetocomplexstate
For large clusters with more than 4000 nodes, the classic MapReduce framework hit
thescalabilityproblems.
YARNstandsforYetAnotherResourceNegotiator
AgroupinYahoobegantodesignthenextgenerationMapReducein2010,andin2013Hadoop
2.x releases MapReduce 2, Yet Another Resource Negotiator (YARN) to remedy the
sociabilityshortcoming.
WhatdoesYarndo?
Providesaclusterlevelresourcemanager
Addsapplicationlevel resourcemanagement
ProvidesslotsforjobsotherthanMap/Reduce
108
Improvesresourceutilization
Improvesscaling
Clustersizeis6000–10000nodes
100,000+concurrenttaskscanbeexecuted
10,000concurrentjobscanbeexecuted
SplitJobTrackerinto
1. ResourceManager(RM):performsclusterlevelresourcemanagement
2. ApplicationMaster(AM):performsjobSchedulingandMonitoring
YARNArchitecture
Asshown in Figure, the YARN involves more entities than classic MapReduce 1 :
- Client, the same as classic MapReduce which submits the MapReduce job.
- ResourceManager,whichhastheultimateauthoritythatarbitratesresourcesamongalltheapplications
inthecluster,itcoordinatestheallocationofcomputeresourcesonthecluster.
- NodeManager,whichisinchargeofresourcecontainers,monitoringresourceusage(cpu,memory,
disk , network) on the node , and reporting to the Resource Manager.
109
- Application Master, which is in charge of the life cycle an application, like a MapReduce Job.
Itwill negotiates with the Resource Manager of cluster resources—in YARN called containers.
TheApplication Master and the MapReduce task in the containers are scheduled by the
ResourceManager. And both of them are managed by the Node Manager. Application Mater is
alsoresponsibleforkeepingtrackoftaskprogressandstatus.
- HDFS,thesameasclassicMapReduce,forfilessharingbetweendifferententities.
ResourceManager consistsoftwocomponents:
• Schedulerand
• ApplicationsManager.
YARNMapReduce
110
As shown in above Figure, it is the MapReduce process with YARN, there are 11 steps, and wewill
explain it in 6 steps the same as the MapReduce 1 framework. They are Job Submission,
JobInitialization, Task Assignment, Task Execution, Progress and Status Updates, and
JobCompletion.
JobSubmission
Clients can submit jobs with the same API as MapReduce 1 in YARN. YARN implements
itsClientProtocol,thesubmissionprocessis similartoMapReduce1.
- Theclientcallsthe submit()method, whichwill
initiatetheJobSubmmitterobjectandcallsubmitJobInternel().
- ResourceManagerwill allocateanewapplicationIDandresponseittoclient.
- Thejobclientcheckstheoutputspecificationofthejob
- Thejobclient computestheinput splits
- The job client copies resources, including the splits data, configuration information, the
jobJARinto HDFS
- Finally, the job client notify Resource Manager it is ready by calling submitApplication() on
theResourceManager.
JobInitialization
WhentheResourceManager(RM)receivesthecallsubmitApplication(),RMwillhandsoffthejobtoits
scheduler.Thejobinitializationisas follows:
- Theschedulerallocatesaresourcecontainerforthejob,
- TheRMlaunchestheApplicationMasterundertheNodeManager’smanagement.
- Application Master initialize the job. Application Master is a Java class named
MRAppMaster,which initializes the job by creating a number of bookkeeping objects to keep
track of the jobprogress.Itwill receivetheprogressandthecompletionreportsfromthetasks.
- Application Master retrieves the input splits from HDFS, and creates a map task object
foreach split. It will create a number of reduce task objects determined by
themapreduce.job.reducesconfigurationproperty.
- ApplicationMasterthendecides howtorunthejob.
For small job, called uber job, which is the one has less than 10 mappers and only one reducer,or
the input split size is smaller than a HDFS block, the Application Manager will run the job onits
own JVM sequentially. This policy is different from MapReduce 1 which will ignore the
smalljobsonasingleTaskTracker.
For large job, the Application Master will launches a new node with new NodeManager
andnewcontainer,inwhichrunthetask.Thiscanrunjobinparallelandgainmoreperformance.
111
Application Master calls the job setup method to create the job’s output directory.
That’sdifferentfromMapReduce1,wherethesetuptaskiscalledbyeachtask’sTaskTracker.
TaskAssignment
When the job is very large so that it can’t be run on the same node as the Application
Master.TheApplicationMasterwill make requesttotheResourceManager
tonegotiatemoreresourcecontainerwhichisinpiggybackedonheartbeatcalls. Thetaskassignmentisas
follows:
- The Application Master make request to the Resource Manager in heartbeat call. The
requestincludes the data locality information, like hosts and corresponding racks that the input
splitsresideson.
- TheRecourseManagerhandovertherequesttotheScheduler.TheSchedulermakes
decisionsbasedon theseinformation.Itattemptsto placethetaskasclosethe dataaspossible.Thedata-
localnodes isgreat,ifthisisnotpossible, therack-localthepreferredtonolocalnode.
- The request also specific the memory requirements, which is between the minimum
allocation(1GB by default) and the maximum allocation (10GB). The Scheduler will schedule a
containerwith multiples of 1GB memory to the task, based on the mapreduce.map.memory.mb
andmapreduce.reduce.memory.mbpropertysetby thetask.
This way is more flexible than MapReduce 1. In MapReduce 1, the TaskTrackers have a
fixednumber of slots and each task runs in a slot. Each slot has fixed memory allowance
whichresults in two problems. For small task, it will waste of memory, and for large task which
needmore memeory, it will lack of memory. In YARN, the memory allocation is more fine-
grained,whichis alsothebeauty ofYAREresidesin.
TaskExecution
After the task has been assigned the container by the Resource Manger’s scheduler,
theApplicationMasterwillcontacttheNodeMangerwhichwill launchthetaskJVM.
Thetaskexecutionisasfollows:
- TheJavaApplicationwhoseclassname isYarnChildlocalizesthe resourcesthat
thetaskneeds.YarnChild retrieves job resources including the job jar, configuration file, and any
needed filesfromtheHDFSandthedistributedcacheonthelocaldisk.
- YarnChildrunthemaporthereduce task
EachYarnChildruns onadedicatedJVM,whichisolates usercodefromthelongrunningsystemdaemons like
NodeManager and the Application Master. Different from MapReduce 1,
YARNdoesn’tsupportJVM reuse,henceeachtaskmustrunonnewJVM.
Thestreaming andthepipelineprocesssandcommunicationinthesameasMapReduce1.
112
ProgressandStatusUpdates
When the job is running under YARN, the mapper or reducer will report its status and progressto its
Application Master every 3 seconds over the umbilical interface. The Application Masterwill
aggregate these status reports into a view of the task status and progress. While inMapReduce 1,
the TaskTracker reports status to JobTracker which is responsible for
aggregatingstatusintoaglobalview.
Moreover,theNodeMangerwill sendheartbeatstotheResourceManagereveryfewseconds.The Node
Manager will monitoring the Application Master and the recourse container usagelike cpu,
memeory and network, and make reports to the Resource Manager. When the NodeManager
fails and stops heartbeat the Resource Manager, the Resource Manager will
removethenodefromitsavailableresourcenodespool.
TheclientpullsthestatusbycallinggetStatus() every1secondtoreceivetheprogressupdates,which are
printed on the user console. User can also check the status from the web UI. TheResource
Manager web UI will display all the running applications with links to the web UIwheredisplays
taskstatusandprogress indetail.
113
JobCompletion
Every 5 second the client will check the job completion over the HTTP ClientProtocol by
callingwaitForCompletion(). When the job is done, the Application Master and the task
containersclean up their working state and the outputCommitter’s job cleanup method is called.
And thejobinformationis archivedas history forlaterinterrogationbyuser.
YARNhasFaultTolerance(continuetoworkintheeventoffailure)andAvailability
ResourceManager
Nosinglepointoffailure–statesavedinZooKeeper
ApplicationMastersarerestartedautomaticallyonRMrestart
ApplicationMaster
Optionalfailovervia application-specificcheckpoint
MapReduce applications pick up where they left off via state saved
inHDFS
YARNhasNetworkCompatibility
Protocolsarewire-compatible
114
Oldclients cantalktonewservers
Rollingupgrades
YARNsupportsforprogrammingparadigmsotherthanMapReduce(Multitenancy)
Tez– Genericframeworktorunacomplex MR
HBaseonYARN
MachineLearning:Spark
Graphprocessing: Giraph
Real-timeprocessing:Storm
YARNisEnabledbyallowingtheuseofparadigm-specificapplicationmaster
YARNruns allonthesameHadoopcluster
JobSchedulinginMapReduce
115
Typesofjobschedulers
116
Failuresinclassicmapreduce
One of the major benefits of using Hadoop is its ability to handle such failures and allow
yourjobtocomplete.
1. TaskFailure
• Consider first the case of the child task failing. The most common way that this happensis
whenusercodeinthemaporreducetaskthrows aruntimeexception. Ifthis happens,the child
JVM reports the error back to its parent tasktracker, before it exits. The errorultimately
makes it into the user logs. The tasktracker marks the task attempt as
failed,freeingupaslottorunanothertask.
• For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
markedasfailed.Thisbehaviorismanagedbythestream.non.zero.exit.is.failureproperty.
• Another failure mode is the sudden exit of the child JVM. In this case, the
tasktrackernoticesthattheprocess hasexitedandmarkstheattemptas failed.
• A task attempt may also be killed, which is also different kind of failing. Killed
taskattempts do not count against the number of attempts to run the task since it
wasn’tthetask’s faultthatanattemptwas killed.
2. TasktrackerFailure
• If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats
tothe jobtracker (or send them very infrequently). The jobtracker will notice a
tasktrackerthat has stopped sending heartbeats (if it hasn’t received one for 10 minutes)
andremoveitfromitspooloftasktrackers toscheduletasks on.
• A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has
notfailed. A tasktracker is blacklisted if the number of tasks that have failed on it
issignificantly higher than the average task failure rate on the cluster.
Blacklistedtasktrackerscanberestartedtoremovethemfromthejobtracker’sblacklist.
117
3. JobtrackerFailure
• Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has
nomechanism for dealing with failure of the jobtracker—it is a single point of failure—
sointhiscasethejobfails.
FailuresinYARN
Refer to yarn architecture figure above, container and task failures are handled by node-
manager. When a container fails or dies, node-manager detects the failure event and launchesa
new container to replace the failing container and restart the task execution in the newcontainer.
In the event of application-master failure, the resource-manager detects the failureand start a
new instance of the application-master with a new container. The ability to recoverthe
associated job state depends on the application-master implementation. MapReduceapplication-
master has the ability to recover the state but it is not enabled by default. Otherthan resource-
manager, associated client also reacts with the failure. The client contacts theresource-
managertolocatethenewapplication-master’s address.
Upon failure of a node-manager, the resource-manager updates its list of available node-
managers.Application-mastershouldrecoverthetasksrunonthefailing node-managersbutitdepends on
the application-master implementation. MapReduce application-master has
anadditionalcapabilitytorecoverthefailingtaskandblacklistthenode-managersthatoftenfail.
Failure of the resource-manager is severe since clients can not submit a new job and
existingrunning jobs could not negotiate and request for new container. Existing node-
managers andapplication-masterstrytoreconnecttothefailedresource-
manager.Thejobprogresswillbe
118
lost when they are unable to reconnect. This lost of job progress will likely frustrate engineersor
data scientists that use YARN because typical production jobs that run on top of YARN
areexpectedtohavelongrunning timeandtypically theyareintheorderoffewhours.
Furthermore, this limitation is preventing YARN to be used efficiently in cloud
environment(suchas AmazonEC2)sincenodefailuresoftenhappenincloudenvironment.
Shuffleand Sort
MapReducemakestheguaranteethattheinputtoeveryreducerissortedbykey.
Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducers
asinputs—is knownas theshuffle.
Theshuffleisanareaofthecodebasewhererefinementsandimprovementsarecontinuallybeing
made.
STEPS
1. TheMap Side
2. TheReduce Side
3. ConfigurationTuning
I. TheMapSide
When the map function starts producing output, it is not simply written to
disk.The process is more involved, and takes advantage of buffering writes in memory
anddoingsomepresorting forefficiency reasons.
Shuffleand sortinMapReduce
119
The buffer is 100 MB by default, a size which can be tuned by changing the
io.sort.mbproperty. When the contents of the buffer reaches a certain threshold size a background
threadwill start to spill the contents to disk. Map outputs will continue to be written to the
bufferwhile the spill takes place, but if the buffer fills up during this time, the map will block
until thespill is complete. Spills are written in round-robin fashion to the directories specified by
themapred.local.dir property, in a job-specific subdirectory. Before it writes to disk, the thread
firstdivides the data into partitions corresponding to the reducers that they will ultimately be
sentto. Within each partition, the background thread performs an in-memory sort by key, and
ifthereisacombinerfunction,itis runontheoutputofthesort.
Running the combiner function makes for a more compact map output, so there is
lessdata to write to local disk and to transfer to the reducer. Each time the memory buffer
reachesthe spill threshold, a new spill file is created, so after the map task has written its last
outputrecord there could be several spill files. Before the task is finished, the spill files are
merged
intoasinglepartitionedandsortedoutputfile.Theconfigurationpropertyio.sort.factorcontrolsthe
maximum number of streams to merge at once; the default is 10.If there are at least threespill
files then the combiner is run again before the output file is written. Combiners may be
runrepeatedly over the input without affecting the final result.If there are only one or two
spills,then the potential reduction in map output size is not worth the overhead in invoking
thecombiner,soitis notrunagainforthis mapoutput.
120
II. TheReduce Side
The map output file is sitting on the local disk of the machine that ran the
maptask. The reduce task needs the map output for its particular partition from several
maptasksacross thecluster.
Copy phase of reduce: The map tasks may finish at different times, so the reduce task
startscopyingtheir outputs as soon as each completes. The reduce task has a small number of
copierthreadssothatitcanfetchmapoutputs inparallel.
Thedefaultisfivethreads,butthisnumbercanbechangedbysettingthemapred.reduce.parallel.c
opiesproperty.The mapoutputsarecopiedtoreducetaskJVM’smemory otherwise, they are copied to
disk. When the in-memory buffer reaches a
thresholdsizeorreachesathresholdnumberofmapoutputsitismergedandspilledtodisk.Ifacombiner is
specified it will be run during the merge to reduce the amount of data written todisk.The copies
accumulate on disk, a background thread merges them into larger, sorted
files.Thissavessometimemerginglateron.
Anymapoutputsthatwerecompressedhavetobedecompressedinmemoryinorderto perform a
merge on them. When all the map outputs have been copied, the reduce taskmoves into the sort
phase which merges the map outputs, maintaining their sort ordering. Thisis done in rounds. For
example, if there were 50 map outputs, and the merge factor was 10,then there would be 5
rounds. Each round would merge 10 files into one, so at the end therewould be five intermediate
files. These five files into a single sorted file, the merge saves a tripto disk by directly feeding the
reduce function. This final merge can come from a mixture of in-memoryandon-disksegments.
During the reduce phase, the reduce function is invoked for each key in the sortedoutput.
The output of this phase is written directly to the output filesystem, typically HDFS. Inthe case
of HDFS, since the tasktracker node is also running a datanode, the first block
replicawillbewrittentothelocaldisk.
121
III. ConfigurationTuning
On the map side, the best performance can be obtained by avoiding multiple spills to disk; oneis
optimal. If you can estimate the size of your map outputs, then you can set the io.sort.*properties
appropriately to minimize the number of spills. There is a MapReduce counter thatcounts the
total number of records that were spilled to disk over the course of a job, which
canbeusefulfortuning. Thecounterincludesbothmapandreducessidespills.
On the reduce side, the best performance is obtained when the intermediate data can resideentirely in
memory. By default, this does not happen, since for the general case all the memoryis reserved
for the reduce function. But if your reduce function has light memory requirements,then setting
mapred.inmem.merge.threshold to 0 andmapred.job.reduce.input.buffer.percentto 1.0 may bring
a performance boost. Hadoop uses a buffer size of 4 KB by default, which islow,
soyoushouldincreasethis acrossthecluster.
InputFormats
Hadoopcanprocess manydifferenttypesofdataformats,fromflattextfilestodatabases.
1) InputSplitsand Records:
An input split is a chunk of the input that is processed by a single map. Each map processes asingle
split. Each split is divided into records, and the map processes each record—a key-valuepair—
inturn.
publicabstractclassInputSplit{
publicabstractlonggetLength()throwsIOException,InterruptedException;
122
publicabstractString[]getLocations()throws IOException,InterruptedException;
}
FileInputFormat:FileInputFormatisthebaseclassforallimplementationsofInputFormat that
use files as their data source. It provides two things: a place to definewhich files are
included as the input to a job, and an implementation for generatingsplitsfortheinputfiles.
FileInputFormatinputpaths:Theinputtoajobisspecifiedasacollectionofpaths,which offers
great flexibility in constraining the input to a job. FileInputFormat offersfourstatic
conveniencemethodsforsettingaJob’s inputpaths:
publicstaticvoidaddInputPath(Jobjob,Pathpath)
publicstaticvoidaddInputPaths(Jobjob,StringcommaSeparatedPaths)publicstaticv
oidsetInputPaths(Jobjob,Path...inputPaths)
publicstaticvoidsetInputPaths(Jobjob,StringcommaSeparatedPaths)
Fig:InputFormatclasshierarchy
Table:Inputpathandfilterproperties
123
FileInputFormat input splits: FileInputFormat splits only large files. Here “large”
meanslargerthananHDFSblock.Thesplitsizeis normallythesizeofanHDFSblock.
Table:Propertiesforcontrollingsplitsize
Preventing splitting: There are a couple of ways to ensure that an existing file is not
split.The first way is to increase the minimum split size to be larger than the largest file
inyour system. The second is to subclass the concrete subclass of FileInputFormat that
youwanttouse,tooverridetheisSplitable()methodtoreturnfalse.
Fileinformationinthemapper:Amapperprocessingafileinputsplitcanfindinformationaboutth
esplitbycallingthegetInputSplit()methodontheMapper’sContextobject.
Table:Filesplitproperties
124
Processingawholefileasarecord:Arelatedrequirementthatsometimescropsupisformappersto
haveaccesstothefullcontentsofafile.ThelistingforWholeFileInputFormatshows away
ofdoing this.
Example:AnInputFormatforreadingawholefileasarecord
public class WholeFileInputFormat extends
FileInputFormat<NullWritable,BytesWritable>{
@Override
protected boolean isSplitable(JobContext context, Path file)
{returnfalse;
}}
WholeFileRecordReader is responsible for taking a FileSplit and converting it into a single
record,withanullkey andavaluecontaining thebytesofthefile.
2) TextInput
TextInputFormat:TextInputFormatisthedefaultInputFormat.Eachrecordisalineofinput.Afil
econtainingthefollowingtext:
On the top of the Crumpetty
TreeTheQuangleWanglesat,
But his face you could not
see,OnaccountofhisBeaverHat.
isdividedintoonesplitoffourrecords.Therecordsareinterpretedasthefollowingkey-valuepairs:
(0, On the top of the Crumpetty Tree)
(33,TheQuangleWanglesat,)
(57, But his face you could not see,)
(89,Onaccountofhis BeaverHat.)
Fig:Logicalrecords andHDFSblocksforTextInputFormat
125
You can specify the separator via
themapreduce.input.keyvaluelinerecordreader.key.value.separatorproperty.Itisatabcharacte
r by default. Consider the following input file, where → represents a
(horizontal)tabcharacter:
line1→On the top of the Crumpetty
Treeline2→The Quangle Wangle
sat,line3→But his face you could not
see,line4→OnaccountofhisBeaverHat.
LikeintheTextInputFormatcase,theinputisinasinglesplitcomprisingfourrecords,althoughthistimethekeys
aretheTextsequencesbeforethetabineachline:
(line1,OnthetopoftheCrumpettyTree)
(line2,TheQuangleWanglesat,)
(line3, But his face you could not see,)
(line4,OnaccountofhisBeaver Hat.)
NLineInputFormat: N refers to the number of lines of input that each mapper
receives.WithNsettoone,eachmapperreceivesexactlyonelineofinput.mapreduce.input.linei
nputformat.linespermappropertycontrolsthevalueofN.
For example, N is two, then each split contains two lines. One mapper will receive the
firsttwokey-valuepairs:
(0, On the top of the Crumpetty Tree)
(33,TheQuangleWanglesat,)
And another mapper will receive the second two key-value pairs:
(57,Buthisfaceyoucouldnotsee,)
(89,Onaccountofhis BeaverHat.)
126
3) Binary Input: Hadoop MapReduce is not just restricted to processing textual data—it
hassupportforbinary formats,too.
SequenceFileInputFormat: Hadoop’s sequence file format stores sequences of binarykey-
valuepairs.
SequenceFileAsTextInputFormat:SequenceFileAsTextInputFormatisavariantofSequence
FileInputFormatthatconvertsthesequencefile’skeysandvaluestoTextobjects.
SequenceFileAsBinaryInputFormat:SequenceFileAsBinaryInputFormatisavariantofSeque
nceFileInputFormat that retrieves the sequence file’s keys and values as
opaquebinaryobjects.
4) MultipleInputs:AlthoughtheinputtoaMapReducejobmayconsistofmultipleinputfiles,alloftheinp
utis interpretedbyasingleInputFormatandasingleMapper.
TheMultipleInputsclasshasanoverloadedversionofaddInputPath()thatdoesn’ttakeamapper:
public static void addInputPath(Job job, Path path, Class<? extends
InputFormat>inputFormatClass)
OutputFormats
Figure:OutputFormatclasshierarchy
127
1) Text Output: The default output format, TextOutputFormat, writes records as lines of text.
Itskeys
andvaluesmaybeofanytype,sinceTextOutputFormatturnsthemtostringsbycallingtoString()onthem.
2) Binary Output
SequenceFileOutputFormat: As the name indicates, SequenceFileOutputFormat
writessequencefilesforitsoutput.CompressioniscontrolledviathestaticmethodsonSequence
FileOutputFormat.
SequenceFileAsBinaryOutputFormat:SequenceFileAsBinaryOutputFormatisthecounterpa
rt to SequenceFileAsBinaryInput Format, and it writes keys and values in
rawbinaryformatintoaSequenceFilecontainer.
MapFileOutputFormat: MapFileOutputFormat writes MapFiles as output. The keys in
aMapFile must be added in order, so you need to ensure that your reducers emit keys
insortedorder.
3) Multiple Outputs: FileOutputFormat and its subclasses generate a set of files in the
outputdirectory. There is one file per reducer, and files are named by the partition number: part-r-
00000,partr-00001,etc.MapReducecomeswiththeMultipleOutputsclasstohelpyoudothis.
Zeroreducers:Therearenopartitions,astheapplicationneedstorunonlymaptasks.
One reducer: It can be convenient to run small jobs to combine the output of previousjobs
into a single file. This should only be attempted when the amount of data is small
enoughtobeprocessedcomfortablyby onereducer.
MultipleOutputs: MultipleOutputs allows you to write data to files whose names
arederivedfromtheoutputkeysandvalues,orinfactfromanarbitrarystring.MultipleOutputs
delegates to the mapper’s OutputFormat, which in this
exampleisaTextOutputFormat,butmorecomplexsetupsarepossible.
Lazy Output: FileOutputFormat subclasses will create output (part-r-nnnnn) files, even
iftheyare
empty.Someapplicationspreferthatemptyfilesnotbecreated,whichiswhereLazyOutputFormathe
lps.Itisawrapperoutputformatthatensuresthattheoutputfile
128
is Output Formats created only when the first record is emitted for a given partition. Touse it,
call its setOutputFormatClass() methodwiththe JobConfand the underlyingoutputformat.
DatabaseOutput:TheoutputformatsforwritingtorelationaldatabasesandtoHBase
UNITVHADOOPRELATEDTOOLS
HBase
• HBASEstandsforHadoopdataBase
• HBaseisa distributedcolumn-orienteddatabasebuiltontopofHDFS.
• HBase is the Hadoop application to use when you require real-time read/write random-
accesstoverylargedatasets.
• Horizontallyscalable
• –Automaticsharding
• Supportsstronglyconsistentreadsandwrites
• SupportsAutomaticfail-over
• IthasSimpleJavaAPI
• IntegratedwithMap/Reduceframework
• SupportsThrift,AvroandRESTWeb-services
• WhenToUseHBase
• Goodforlargeamountsofdata
• 100sofmillions orbillionsofrows
129
• If data is too small all the records will end up on a single node leaving
therestoftheclusteridle
• WhenNOTtoUse HBase
• BadfortraditionalRDBMsretrieval
• Transactionalapplications
• RelationalAnalytics
• 'groupby','join',and'wherecolumnlike',etc....
• Currentlybadfortext-basedsearchaccess
Let’s now take a look at how HBase (a column-oriented database) is different from some otherdata
structures and concepts that we are familiar with Row-Oriented vs. Column-Oriented datastores.
As shown below, in a row-oriented data store, a row is a unit of data that is read orwritten
together. In a column-oriented data store, the data in a column is stored together
andhencequickly retrieved.
130
Row-orienteddatastores–
Data is stored and retrieved one row at a time and hence could read unnecessary data
ifonlysomeofthedatainarowis required.
Easytoreadandwrite records
WellsuitedforOLTPsystems
Not efficient in performing operations applicable to the entire dataset and
henceaggregationisanexpensiveoperation
Typical compression mechanisms provide less effective results than those on column-
orienteddatastores
Column-orienteddatastores–
Data is stored and retrieved in columns and hence can read only relevant data if
onlysomedatais required
ReadandWritearetypicallyslower operations
WellsuitedforOLAPsystems
Can efficiently perform operations applicable to the entire dataset and hence
enablesaggregationovermanyrows andcolumns
Permitshighcompressionratesduetofewdistinctvaluesincolumns
RelationalDatabasesvs.HBase
When talking of data stores, we first think of Relational Databases with structured data storageand a
sophisticated query engine. However, a Relational Database incurs a big penalty toimprove
performance as the data size increases. HBase, on the other hand, is designed from theground up
to provide scalability and partitioning to enable efficient data structure
serialization,storageandretrieval.Broadly, thedifferencesbetweena RelationalDatabaseandHBase
are:
131
HDFSvs.HBase
HDFS is a distributed file system that is well suited for storing large files. It’s designed
tosupportbatchprocessingofdatabutdoesn’tprovidefastindividualrecordlookups.HBaseisbuilt on
top of HDFS and is designed to provide access to single rows of data in large
tables.Overall,thedifferences betweenHDFSandHBaseare
HBaseArchitecture
Just like in a Relational Database, data in HBase is stored in Tables and these Tables are storedin
Regions. When a Table becomes too big, the Table is partitioned into multiple Regions.
TheseRegions are assigned to Region Servers across the cluster. Each Region Server hosts
roughly thesamenumberofRegions.
132
TheHMasterintheHBaseisresponsiblefor
PerformingAdministration
ManagingandMonitoringtheCluster
AssigningRegionstotheRegionServers
ControllingtheLoadBalancingandFailover
Ontheotherhand, theHRegionServerperformthefollowingwork
HostingandmanagingRegions
SplittingtheRegionsautomatically
Handlingtheread/writerequests
CommunicatingwiththeClientsdirectly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each
Regionin turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in
theseStoreFiles in the form of Column Families (explained below). The MemStore holds in-
memorymodificationstotheStore(data).
The mapping of Regions to Region Server is kept in a system table called .META. When trying
toreadorwrite datafromHBase, theclientsreadtherequiredRegioninformationfrom
the .META table and directly communicate with the appropriate Region Server. Each Region
isidentifiedby thestartkey(inclusive)andtheendkey (exclusive)
133
HBASEdetailed
Architecture
You can see that HBase handles basically two kinds of file types. One is used for the write-aheadlog
and the other for the actual data storage. The files are primarily handled by theHRegionServer's.
But in certain scenarios even the HMaster will have to perform low-level fileoperations. You
may also notice that the actual files are in fact divided up into smaller blockswhen stored within
the Hadoop Distributed Filesystem (HDFS). This is also one of the
areaswhereyoucanconfigurethesystem tohandlelargerorsmallerdatabetter.Moreonthatlater.
The general flow is that a new client contacts the Zookeeper quorumfirst to find a particularrow
key. It does so by retrieving the server name (i.e. host name) that hosts the -ROOT-
regionfromZookeeper. Withthatinformationitcanquerythatservertogettheserver
thathoststhe .META. table. Both of these two details are cached and only looked up once.
Lastly it canquerythe.META. serverandretrievetheserverthathastherowtheclientislookingfor.
134
Once it has been told where the row resides, i.e. in what region, it caches this information aswell and
contacts the HRegionServer hosting that region directly. So over time the client has
aprettycompletepictureofwheretogetrowsfromwithoutneedingtoquerythe.META.serveragain.
Next the HRegionServer opens the region it creates a corresponding HRegion object. When
theHRegion is "opened" it sets up a Store instance for each HColumnFamily for every table
asdefined by the user beforehand. Each of the Store instances can in turn have one or
moreStoreFileinstances,whicharelightweightwrappersaroundtheactualstoragefilecalledHFile.A
HRegion also has a MemStore and a HLog instance. We will now have a look at how they
worktogetherbutalsowherethereareexceptionstotherule.
HBaseDataModel
The Data Model in HBase is designed to accommodate semi-structured data that could vary infield
size, data type and columns. Additionally, the layout of the data model makes it easier
topartition the data and distribute it across the cluster. The Data Model in HBase is made
ofdifferent logical components such as Tables, Rows, Column Families, Columns, Cells
andVersions.
135
Tables – The HBase Tables are more like logical collection of rows stored in separate
partitionscalled Regions. As shown above, every Region is then served by exactly one Region
Server. ThefigureaboveshowsarepresentationofaTable.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys
areuniqueinaTableandarealways treatedasabyte[].
Column Families – Data in a row are grouped together as Column Families. Each Column
Familyhas one more Columns and these Columns in a family are stored together in a low level
storagefile known as HFile. Column Families form the basic unit of physical storage to which
certainHBase features like compression are applied. Hence it’s important that proper care be
takenwhen designing Column Families in table. The table above shows Customer and Sales
ColumnFamilies. The Customer Column Family is made up 2 columns – Name and City,
whereas theSalesColumnFamilies ismadeupto2columns–ProductandAmount.
Columns – A Column Family is made of one or more columns. A Column is identified by aColumn
Qualifier that consists of the Column Family name concatenated with the Column nameusing a
colon – example: columnfamily:columnname. There can be multiple Columns within
aColumnFamily andRows withinatablecanhavevariednumberofColumns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family
andthe Column (Column Qualifier). The data stored in a Cell is called its value and the data
type isalwaystreatedas byte[].
Version – The data stored in a cell is versioned and versions of data are identified by thetimestamp.
The number of versions of data retained in a column family is configurable and
thisvaluebydefaultis 3.
136
InJSONformat, themodelisrepresentedas
HBaseclients
1. REST
HBase ships with a powerful REST server, which supports the complete client
andadministrativeAPI.It
alsoprovidessupportfordifferentmessageformats,offeringmanychoicesforaclientappli
cationtocommunicatewiththeserver.
RESTJava client
The REST server also comes with a comprehensive Java client API. It is located in
theorg.apache.hadoop.hbase.rest.clientpackage.
2. Thrift
Apache Thrift is written in C++, but provides schema compilers for many
programminglanguages,includingJava,C++,
Perl,PHP,Python,Ruby,andmore.Onceyouhave
compiled a schema, you can exchange messages transparently between systems
implementedinoneormoreofthoselanguages.
137
3. Avro
Apache Avro, like Thrift, provides schema compilers for many programming
languages,including Java, C++, PHP, Python, Ruby, and more. Once you have compiled a
schema,you can exchange messages transparently between systems implemented in one or
moreofthoselanguages.
4. OtherClients
• JRuby
The HBase Shell is an example of using a JVM-based language to access
theJavabased API. It comes with the full source code, so you can use it to add
thesamefeatures toyourownJRubycode.
• HB
ql HBql adds an SQL-like syntax on top of HBase, while adding the
extensionsneededwhereHBasehasuniquefeatures
• HBase-DSL
This project gives you dedicated classes that help when formulating queriesagainst
an HBase cluster. Using a builder-like style, you can quickly assemble
alltheoptions andparameters necessary.
• JPA/JPO
You can use, for example, DataNucleus to put a JPA/JPO access layer on top
ofHBase.
• PyHBase
ThePyHBaseproject(https://github.com/hammer/
pyhbase/)offersanHBaseclientthroughtheAvrogateway server.
• AsyncHBase
AsyncHBase offers a completely asynchronous, nonblocking, and thread-safeclient
to access HBase clusters. It uses the native RPC protocol to talk directly
tothevariousservers
Cassandra
TheCassandradatastoreisanopensourceApacheprojectavailableathttp://cassandra
.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s
inboxsearchproblem,inwhichthey
hadtodealwithlargevolumesofdatainawaythatwasdifficulttoscalewithtraditionalmethods.
138
Mainfeatures
• Decentralized
Every node in the cluster has the same role. There is no single point of failure. Data
isdistributed across the cluster (so each node contains different data), but there is no
masteraseverynodecanserviceanyrequest.
• Supportsreplicationandmultidatacenterreplication
Replication strategies are configurable.[18] Cassandra is designed as a distributed system,for
deployment of large numbers of nodes across multiple data centers. Key features
ofCassandra’s distributed architecture are specifically tailored for multiple-data
centerdeployment,forredundancy,forfailover anddisasterrecovery.
• Scalability
Read and write throughput both increase linearly as new machines are added, with
nodowntimeorinterruptiontoapplications.
• Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication
acrossmultipledatacentersis supported.Failednodes canbereplacedwithnodowntime.
• Tunableconsistency
Writes and reads offer a tunable level of consistency, all the way from "writes never fail"
to"blockforallreplicastobereadable",withthequorumlevelinthemiddle.
• MapReducesupport
Cassandra has Hadoop integration, with MapReduce support. There is support also
forApachePig andApacheHive.
• Querylanguage
Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to
thetraditional RPC interface. Language drivers are available for Java (JDBC), Python,
Node.JSandGo
Whywe use?
1. Quickwrites
2. Failsafe
3. QuickReporting
4. BatchProcessingtoo, withmapreduces.
5. Easeofmaintenance
6. Easeofconfiguration
7. tuneablyconsistent
8. highlyavailable
9. faulttolerant
139
10. The peer-to-peer design allows for high performance with linear scalability and no
singlepointsoffailure
11. Decentralizeddatabases
12. Supports12differentclientlanguages
13. Automaticprovisioningofnewnodes
Cassandradatamodel
Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key
valuenature is represented by a row object, in which value would be generally organized in
columns.Inshort,cassandrausesthefollowingterms
1. KeyspacecanbeseenasDBSchemainSQL.
2. Columnfamilyresemblesa tableinSQLworld(readbelowthisanalogyismisleading)
3. Row has a key and as a value a set of Cassandra columns.But without relational
schemacorset.
4. Columnisatriplet:=(name,value,timestamp).
5. Supercolumnis atupel:=(name,collectionofcolumns).
6. DataTypes: Validators&Comparators
7. Indexes
KeySpaces
KeySpacesarethelargestcontainer,withanorderedlistofColumnFamilies,similartoadatabaseinRDMS.
Column
140
A Column is the most basic element in Cassandra: a simple tuple that contains a name, valueand
timestamp. All values are set by the client. That's an important consideration for
thetimestamp,as itmeansyou'llneedclocksynchronization.
SuperColumn
A SuperColumn is a column that stores an associative array of columns. You could think of it
assimilar to a HashMap in Java, with an identifying column (name) that stores a list of
columnsinside (value). The key difference between a Column and a SuperColumn is that the
value of aColumn is a string, where the value of a SuperColumn is a map of Columns. Note
thatSuperColumnshavenotimestamp,justanameandavalue.
ColumnFamily
A ColumnFamily hold a number of Rows, a sorted map that matches column names to
columnvalues.A row is a set of columns, similar to the table concept from relational databases.
Thecolumnfamilyholdsanorderedlistofcolumnswhichyoucanreferencebycolumnname.
TheColumnFamilycanbeoftwotypes,Standardor
Super.StandardColumnFamilyscontainamapofnormalcolumns,
141
Example
SuperColumnFamily'scontainrowsofSuperColumns.
Example
DataTypes
Andofcoursetherearepredefineddatatypesincassandra,inwhich
Thedata typeofrowkeyiscalledavalidator.
Thedata typefora columnnameiscalleda comparator.
142
You can assign predefined data types when you create your column family(which
isrecommended), but Cassandra does not require it. Internally Cassandra stores column
namesandvaluesas hexbytearrays (BytesType). Thisis thedefaultclientencoding.
Indexes
TheunderstandingofIndexesinCassandraisrequisite. There aretwokindsofthem.
The Primary index for a column family is the index of its row keys. Each node maintains
thisindexforthedataitmanages.
The Secondary indexes in Cassandra refer to indexes on column values.
Cassandraimplementssecondary indexesas ahiddencolumnfamily
Primary index determines cluster-wide row distribution. Secondary indexes are very
importantforcustomqueries.
DifferencesBetweenRDBMSandCassandra
143
CassandraClients
5. Thrift
Thrift is the driver-level interface; it provides the API for client implementations in awide
variety of languages. Thrift was developed at Facebook and donated as an
Apacheproject
ThedesignofThriftoffersthefollowingfeatures:
Language-independenttypes
Commontransportinterface
Protocolindependence
Versioningsupport
6. Avro
The Apache Avro project is a data serialization and RPC system targeted as the replacement
forThriftinCassandra.
Avro provides many features similar to those of Thrift and other data serialization
andRPCmechanismsincluding:
• Robustdatastructures
• Anefficient,small binaryformatforRPCcalls
• Easy integration with dynamically typed languages such as Python, Ruby,
Smalltalk,Perl,PHP,andObjective-C
AvroistheRPCanddataserializationmechanism for
Cassandra. It generates code that remote clients can use to interact with the database.It’s
well-supported in the community and has the strength of growing out of the
largerandverywell-
knownHadoopproject.ItshouldserveCassandrawellfortheforeseeablefuture.
7. Hector
Hector is an open source project written in Java using the MIT license. It was createdby
Ran Tavory of Outbrain (previously of Google) and is hosted at GitHub. It was oneof
the early Cassandra clients and is used in production at Outbrain. It wraps
ThriftandoffersJMX,connectionpooling,andfailover.
Hectorisa well-supportedandfull-featuredCassandraclient,withmanyusersandan
144
activecommunity.It offersthefollowing:
High-levelobject-orientedAPI
Failoversupport
Connectionpooling
JMX(JavaManagement eXtensions) support
8. Chirper
ChirperisaportofTwissandrato.NET,writtenbyChakerNakhli.It’savailableundertheApache2.0license,and
thesourcecodeis onGitHub
9. Chiton
Chitonis aCassandrabrowserwrittenbyBrandonWilliams thatusesthePythonGTKframework
10. Pelops
Pelops is a free, open source Java client written by Dominic Williams. It is similar toHector
in that it’s Java-based, but it was started more recently. This has become a
verypopularclient. Its goalsincludethefollowing:
Tocreateasimple,easy-to-useclient
Tocompletelyseparateconcernsfordataprocessingfromlower-levelitemssuch
asconnectionpooling
Toactas aclosefollowertoCassandrasothatit’s readilyuptodate
11. Kundera
Kundera is an object-relational mapping (ORM) implementation for Cassandra written
usingJavaannotations.
12. Fauna
RyanKing ofTwitterandEvanWeavercreatedaRubyclientfortheCassandradatabasecalledFauna.
Pig
Pig is a simple-to-understand data flow language used in the analysis of large data sets.Pig
scripts are automatically converted into MapReduce jobs by the Pig interpreter,
soyoucananalyzethedatainaHadoopclusterevenifyouaren'tfamiliarwithMapReduce.
Usedto
o Processweblog
o Builduserbehaviormodels
o Processimages
145
oDatamining
Pigismadeupoftwocomponents:thefirstisthelanguageitself,whichiscalledPigLatin,andthesecondis
aruntimeenvironmentwherePigLatinprogramsareexecuted.
ThePigexecutionenvironmenthastwomodes:
• Localmode:Allscriptsarerunonasinglemachine.HadoopMapReduceandHDFSarenotrequir
ed.
• Hadoop:AlsocalledMapReducemode,all scriptsarerunona givenHadoopcluster.
Pigprogramscanberuninthreedifferentways,allofthemcompatiblewithlocalandHadoopmode:
1. Pig Latin Script: Simply a file containing Pig Latin commands, identified by the .pig
suffix(forexample,file.pigormyscript.pig).ThecommandsareinterpretedbyPigandexecutedi
nsequentialorder.
2. Gruntshell:Gruntisacommandinterpreter.YoucantypePigLatinonthegruntcommandlinean
dGruntwillexecutethecommandonyourbehalf.
3. Embedded:PigprogramscanbeexecutedaspartofaJava program.
Pig provides an engine for executing data flows in parallelon Hadoop. It includes a
language,PigLatin,forexpressingthesedataflows.PigLatinincludesoperatorsformanyofthetraditiona
l data operations (join, sort, filter, etc.), as well as the ability for users to develop
theirownfunctionsforreading,processing,andwritingdata.
Itisalarge-scaledataprocessingsystem
ScriptsarewritteninPigLatin,adataflowlanguage
Developedby Yahoo,andopensource
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS,
andHadoop’sprocessing system,MapReduce.
DifferencesbetweenPIGvsMapReduce
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source
tooutputstore.
Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all
jobsinaPigaremapreducejobsordatamovementjobs.Pigallowsforcustomfunctionstobe
146
addedwhichcanbeusedforprocessinginPig,somedefaultonesarelikeordering,grouping,distinct,countetc.
Map/
Reduceontheotherhandis,itisaprogrammingmodel,orframeworkforprocessinglargedatasetsindistribu
tedmanner,usinglargenumberofcomputers,i.e. nodes.
PIG commands are submitted as MapReduce jobs internally.An advantage PIG has
overMapReduce is that the former is more concise. A 200 lines Java code written for
MapReducecanbereducedto10lines ofPIG code.
PigLatin
PigLatinhasaveryrichsyntax.Itsupportsoperatorsforthefollowingoperations:
Loadingandstoringofdata
Streamingdata
Filteringdata
Groupingandjoiningdata
Sortingdata
Combiningandsplittingdata
Pig Latin also supports a wide variety of types, expressions, functions, diagnostic
operators,macros,andfilesystemcommands.
DUMP
Dumpdirectstheoutputofyourscripttoyourscreen
Syntax:
dumpout.txt;
LOAD:Loadsdatafromthefilesystem.
Syntax
147
'data' is the name of the file or directory, in single quotes. USING, AS areKeywords. If theUSING
clause is omitted, the default load function PigStorage is used. Schema- A schema
usingtheASkeyword,enclosedinparentheses
Usage
UsetheLOADoperatortoloaddatafromthefilesystem.
Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records
arenewline-separated.
123
421
834
Inthisexamplethedefaultloadfunction,PigStorage,loadsdata frommyfile.txttoformrelation
A. The two LOAD statements are equivalent. Note that, because no schema is specified,
thefieldsarenotnamedandallfieldsdefaulttotypebytearray.
A=LOAD'myfile.txt'USINGPigStorage('\
t');DUMPA;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
SampleCode
TheexamplesarebasedonthesePigcommands, whichextractalluserIDsfromthe
/etc/passwdfile.
A=load'passwd'usingPigStorage(':');B=for
eachAgenerate$0asid;
dumpB;
storeBinto‘id.txt’;
148
STORE:Storesorsavesresultstothefilesystem.
Syntax
Examples
In this example data is stored using PigStorage and the asterisk character (*) as the
fielddelimiter.
A = LOAD
'data';DUMPA;
ouptut
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STOREAINTO'myoutput'USINGPigStorage('*');CA
Tmyoutput;
Output
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
STREAM:Sendsdata toanexternalscriptorprogram.
GROUP:Groupsthedatainoneormorerelations.
JOIN(inner):Performsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.
149
JOIN(outer):Performsanouterjoinoftworelationsbasedoncommonfieldvalues.
Example
Supposewe have relationsA andB.
A=LOAD'data1'AS(a1:int,a2:int, a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B=LOAD'data2'AS(b1:int,
b2:int);DUMPB;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
InthisexamplerelationsAandB arejoinedbytheirfirstfields.
X=JOINABYa1,BBYb1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Grunt
GruntisPig’sinteractiveshell.ItenablesuserstoenterPigLatininteractivelyandprovidesashellforuserstoi
nteractwithHDFS.
150
In other words, it is a command interpreter. You can type Pig Latin on the grunt command
lineandGruntwillexecutethecommandonyourbehalf.
ToenterGrunt,invokePig withnoscriptorcommandtorun.Typing:
$pig-x local
will resultintheprompt:
grunt>
This gives you a Grunt shell to interact with your local filesystem. To exit Grunt you can
typequitorenterCtrl-D.
ExampleforenteringPigLatin ScriptsinGrunt
instructions.Fromyourcurrentworkingdirectoryinlinus,run:
$pig-x local
TheGruntshellis invokedandyoucanentercommandsattheprompt.
grunt>A=load'passwd'usingPigStorage(':');grunt>
B = foreach A generate $0 as
id;grunt>dumpB;
PigwillnotstartexecutingthePigLatinyouenteruntilitsees eitherastoreordump.
Pig’sDataModel
This includes Pig’s data types, how it handles concepts such as missing data, and how you
candescribeyourdatatoPig.
Types
Pig’s data types can be divided into two categories: scalar types, which contain a single
value,andcomplex types,whichcontainothertypes.
151
1. ScalarTypes
int
Aninteger.Theystorea four-bytesignedinteger
long
Alonginteger.Theystoreaneight-bytesignedinteger.
float
Afloating-pointnumber.Usesfour bytestostoretheirvalue.
double
A double-precision floating-point number.and use eight bytes to store theirvalue
chararray
Astringorcharacter array,andare expressedasstringliteralswithsinglequotes
bytearray
Abloborarrayofbytes.
2. ComplexTypes
Pig has several complex data types such as maps, tuples, and bags. All of these types cancontain
data of any type, including other complex types. So it is possible to have a map
wherethevaluefieldis abag,whichcontainsatuplewhereoneofthefields is amap.
Map
A map in Pig is a chararray to data element mapping, where that element can be any Pigtype,
including a complex type. The chararray is called a key and is used as an index
tofindtheelement,referredtoas thevalue.
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are dividedinto
fields, with each field containing one data element. These elements can be of anytype
—they do not all need to be the same type. A tuple is analogous to a row in
SQL,withthefields beingSQL columns.
152
Bag
Nulls
Pigincludestheconceptofadata elementbeingnull.Dataofanytypecanbenull. Itisimportant to
understand that in Pig the concept of null is the same as in SQL, which iscompletely
different from the concept of null in C, Java, Python, etc. In Pig a null
dataelementmeansthevalueisunknown.
Casts
Indicatesconvertonetypeofcontenttoanyothertype.
Hive
Hive was originally an internal Facebook project which eventually tenured into a full-blownApache
project, and it was created to simplify access to MapReduce (MR)by exposing a SQL-based
language for data manipulation. Hive also maintains metadata in a metastore, which isstored in a
relational database, as well as this metadata contains information about what tablesexist, their
columns,privileges, and more. Hive is an open source data warehousing solutionbuilt on top of
Hadoop, and its particular strength is in offering ad-hoc querying of data,
incontrasttothecompilationrequirementofPigandCascading.
Hive is a natural starting point for more full-featured business intelligence systems which
offerauserfriendlyinterfacefornon-technicalusers.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS as well as
easilycompatible file systems like Amazon S3 (Simple Storage Service). Amazon S3 is a
scalable, high-speed, low-cost, Web-based service designed for online backup and archiving of
data as well asapplication programs. Hive provides SQL-like language called HiveQL while
maintaining fullsupport for map/reduce, and to accelerate queries, it provides indexes, including
bitmapindexes. Apache Hive is a data warehouse infrastructure built on top ofHadoop
forprovidingdatasummarization,query,as wellas analysis.
AdvantagesofHive
Perfectlyfitslowlevel interfacerequirementofHadoop
153
HivesupportsexternaltablesandODBC/JDBC
HavingIntelligenceOptimizer
HivesupportofTable-levelPartitioningtospeedupthequery times
Metadatastoreisabigplusinthearchitecturethatmakes thelookupeasy
DataUnits
Hivedata isorganizedinto:
Databases:Namespacesthatseparatetablesandotherdata unitsfromnamingconfliction.
Tables: Homogeneous units of data, which have the same schema. An example of a table
couldbepage_viewstable,whereeachrowcouldcompriseofthefollowingcolumns(schema):
timestamp- whichisofINTtypethatcorrespondstoa unixtimestampofwhenthepagewasviewed.
userid - which is of BIGINT type that identifies the user who viewed the
page.page_url-whichis ofSTRINGtypethatcapturesthelocationofthepage.
referer_url - which is of STRING that captures the location of the page from where the
userarrivedatthecurrentpage.
IP - which is of STRING type that captures the IP address from where the page request
wasmade.
Partitions: Each Table can have one or more partition Keys which determines how the data isstored.
Partitions - apart from being storage units - also allow the user to efficiently identify therows that
satisfy a certain criteria. For example, a date_partition of type STRING andcountry_partition of
type STRING. Each unique value of the partition keys defines a partition ofthe Table. For
example all "US" data from "2009-12-23" is a partition of the page_views table.Therefore, if you
run analysis on only the "US" data for 2009-12-23, you can run that query
onlyontherelevantpartitionofthetabletherebyspeedinguptheanalysissignificantly.
Partition columns are virtual columns, they are not part of the data itself but are derived onload.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on
thevalue of a hash function of some column of the Table. For example the page_views table
maybe bucketed by userid, which is one of the columns, other than the partitions columns, of
thepage_viewtable.Thesecanbeusedtoefficientlysamplethedata.
154
HiveDataTypes
HiveSupporttwotypesofdatatypeformats
1. Primitive datatype
2. Collectiondatatype
13. PrimitiveDataTypes
TINYINT,SMALLINT,INT,BIGINT
arefourintegerdatatypeswithonlydifferencesintheirsize.
FLOATandDOUBLE aretwofloatingpointdatatypes.BOOLEANistostoretrueorfalse.
STRING is to store character strings. Note that, in hive, we do not specify length
forSTRINGlikeinotherdatabases.It’smoreflexibleandvariableinlength.
TIMESTAMPcanbeanintegerwhichisinterpretedassecondssinceUNIXepochtime.Itmay
be a float where number after decimal is nanosecond. It may be string which
isinterpreted
accordingtotheJDBCdatestringformati.e.YYYY-MM-
DDhh:mm:ss.fffffffff.TimecomponentisinterpretedasUTC time.
BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable
forbinarydata.
14. CollectionDataTypes
155
1. STRUCT
2. MAP
3. ARRAY
HiveFileformats
HivesupportsalltheHadoopfileformats,plusThriftencoding,aswell
assupportingpluggableSerDe(serializer/deserializer)classes tosupportcustomformats.
ThereareseveralfileformatssupportedbyHive.
efficient.SEQUENCEFILEformatismorespaceefficient.
MAPFILEwhichaddsanindextoa SEQUENCEFILEforfasterretrievalofparticularrecords.
Hive defaults to the following record and field delimiters, all of which are non-printable
controlcharactersandallofwhichcanbecustomized.
156
Let us take an example to understand it. I am assuming an employee table with below
structureinhive.
157
HiveQL
HiveQListheHivequerylanguage
Hadoop is an open source framework for the distributed processing of large amounts of dataacross
a cluster. It relies upon the MapReduce paradigm to reduce complex tasks into smallerparallel
tasks that can be executed concurrently across multiple machines. However,
writingMapReducetasksontopofHadoopforprocessingdataisnotforeveryonesince itrequires
158
learninganewframeworkanda newprogrammingparadigmaltogether.Whatis neededis aneasy-to-use
abstraction on top of Hadoop that allows people not familiar with it to use
itscapabilitiesaseasily.
Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on top
ofHadoop.Hive
achievesthistaskbyconvertingquerieswritteninHiveQLintoMapReducetasksthatarethenrunacross
theHadoopclustertofetchthedesiredresults
Hive is best suited for batch processing large amounts of data (such as in data warehousing) butis not
ideally suitable as a routine transactional database because of its slow response times
(itneedstofetchdatafromacross acluster).
A common task for which Hive is used is the processing of logs of web servers. These logs havea
regular structure and hence can be readily converted into a format that Hive can
understandandprocess
Hive query language (HiveQL) supports SQL features like CREATE tables, DROP tables,
SELECT ...FROM ... WHERE clauses, Joins (inner, left outer, right outer and outer joins),
Cartesian products,GROUP BY, SORT BY, aggregations, union and many useful functions on
primitive as well ascomplex data types. Metadatabrowsingfeaturessuchaslistdatabases,tablesand
soonarealso provided. HiveQL does have limitations compared with traditional RDBMS SQL.
HiveQLallows creation of new tables in accordance with partitions(Each table can have one or
morepartitions in Hive) as well as buckets (The data in partitions is further distributed as
buckets)andallows insertion of data in single or multiple tables but does not allow deletion or
updating ofdata
HiveQL:DataDefinition
Firstopenthehiveconsolebytyping:
$hive
Oncethehiveconsoleisopened,like
hive>
youneedtorunthequerytocreatethetable.
3.Createand Showdatabase
159
Theyareveryusefulforlargerclusterswithmultipleteamsandusers,asa wayofavoidingtablename collisions.
It’s also common to use databases to organize production tables into
logicalgroups.Ifyoudon’tspecify adatabase,thedefaultdatabaseis used.
Atanytime,youcanseethedatabasesthatalreadyexistasfollows:
hive>SHOWDATABASES;
outputis
defaultfinanc
ials
hive>CREATEDATABASEhuman_resources;
hive>SHOWDATABASES;
output
isdefaultf
inancials
human_resources
2. DESCRIBEdatabase
- showsthedirectorylocationforthedatabase.
hive>DESCRIBEDATABASEfinancials;
outputis
hdfs://master-server/user/hive/warehouse/financials.db
15. USEdatabase
TheUSE commandsetsadatabaseasyourworkingdatabase,
analogoustochangingworkingdirectoriesinafilesystem
hive>USE financials;
16. DROPdatabase
youcandropadatabase:
160
hive>DROP DATABASE IFEXISTSfinancials;
TheIFEXISTSisoptionalandsuppresseswarningsiffinancialsdoesn’texist.
17. AlterDatabase
You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTERDATABASE command. No other metadata about the database can be
changed,including itsnameanddirectory location:
hive>ALTERDATABASEfinancialsSETDBPROPERTIES('edited-by'='activesteps');
18. CreateTables
The CREATE TABLE statement follows SQL conventions, but Hive’s version offers sig-
nificantextensions to support a wide range of flexibility where the data files for tables are
stored, theformatsused,etc.
ManagedTables
The tables we have created so far are called managed tables or sometimes
calledinternal tables, because Hive controls the lifecycle of their data. As we’ve
seen,Hivestores the data for these tables in subdirectory under the directory defined
byhive.metastore.warehouse.dir(e.g.,/user/hive/warehouse),bydefault.
Managedtablesarelessconvenientforsharingwithothertools
ExternalTables
CREATEEXTERNALTABLEIF NOTEXISTSstocks(
exchange
STRING,symbol
STRING,ymd
STRING,price_op
en
FLOAT,price_hig
h
FLOAT,price_low
FLOAT,price_clos
e
FLOAT,volumeIN
T,
161
LOCATION'/data/stocks/';
The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is
requiredtotellHivewhereit’s located. Becauseit’sexternal
Partitioned,ManagedTables
we havetouse address.statetoprojectthevalueinsidetheaddress.So,let’spartitionthedatafirstby
countryandthenbystate:
PARTITIONEDBY(countrySTRING,
stateSTRING)ROW FORMATDELIMITED
FIELDSTERMINATEDBY'\
001'COLLECTION ITEMS
TERMINATED BY '\
002'MAPKEYSTERMINATEDBY '\003'
LINES TERMINATED BY '\
PartitioningtableschangeshowHivestructuresthedatastorage.Ifwecreatethistableinthemydbdatabase,ther
ewillstillbeanemployees directoryforthetable:
LOADDATALOCALINPATH'/path/to/
employees.txt'INTOTABLEemployees
PARTITION (country = 'US', state =
'IL');hdfs://master_server/user/hive/warehouse/mydb.db/employees
hive>SHOW PARTITIONSemployees;
162
outputis
OK
country=US/state=IL
Timetaken:0.145seconds
19. DroppingTables
ThefamiliarDROPTABLE commandfromSQLissupported:
DROPTABLEIFEXISTSemployees;
HiveQL:DataManipulation
1. LoadingDataintoManagedTables
Createstockstable
CREATEEXTERNALTABLEIF NOTEXISTSstocks(
exchange
STRING,symbol
STRING,ymd
STRING,price_op
enFLOAT,price_hi
gh
FLOAT,price_low
FLOAT,price_clos
e
FLOAT,volumeIN
T,
price_adj_closeFLOA
T)ROW FORMAT
DELIMITEDFIELDSTERMI
Loadthestocks
LOADDATALOCALINPATH'/path/to/
employees.txt'INTOTABLEstocks
PARTITION(exchange='NASDAQ',symbol='AAPL');
This command will first create the directory for the partition, if it doesn’t already
exist,thencopythedatatoit.
2. InsertingDataintoTablesfromQueries
163
INSERTOVERWRITETABLEemployeesPARTITION(country='US',state='OR')
WithOVERWRITE,
anypreviouscontentsofthepartitionarereplaced.IfyoudropthekeywordOVERWRITE
orreplaceitwithINTO,Hiveappendsthedataratherthanreplacesit.
HiveQLqueries
1. SELECT…FROMClauses
SELECT is the projection operator in SQL. The FROM clause identifies from which table, view,
ornestedqueryweselectrecords
Createemployees
ROW FORMAT
DELIMITEDFIELDSTERMI
NATEDBY'\001'
COLLECTION ITEMS TERMINATED BY '\
002'MAPKEYSTERMINATEDBY '\003'
LINES TERMINATED BY '\
n'STOREDASTEXTFILE
Loaddata
LOADDATALOCALINPATH'/path/to/
employees.txt'INTOTABLEemployees
PARTITION(country='US',state='IL');
Datainemployee.txtisassumedas
164
Selectdata
hive>SELECTname,salaryFROMemployees;
outputis
When you select columns that are one of the collection types, Hive uses JSON (Java-
ScriptObjectNotation)syntaxfortheoutput.First,let’sselectthesubordinates,anARRAY,
whereacomma-separatedlistsurroundedwith[…]isused.
hive>SELECTname,subordinatesFROM employees;
outputis
The deductions is a MAP, where the JSON representation for maps is used, namely a comma-
separatedlistofkey:valuepairs,surroundedwith{…}:
hive>SELECT name,deductionsFROMemployees;
outputis
Finally,theaddress isaSTRUCT,whichisalsowrittenusingtheJSONmapformat:
hive>SELECTname,addressFROM employees;
outputis
165
166