0% found this document useful (0 votes)
8 views

Bigdata Lecture Notes

The document provides an overview of Big Data Analytics, covering its definition, characteristics, and significance in modern data processing. It discusses the convergence of key trends, the challenges of managing large data volumes, and the various applications of Big Data technologies such as Hadoop and NoSQL databases. Additionally, it highlights the benefits of Big Data processing and the importance of advanced analytics in deriving insights for business decision-making.

Uploaded by

divyavenu49
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Bigdata Lecture Notes

The document provides an overview of Big Data Analytics, covering its definition, characteristics, and significance in modern data processing. It discusses the convergence of key trends, the challenges of managing large data volumes, and the various applications of Big Data technologies such as Hadoop and NoSQL databases. Additionally, it highlights the benefits of Big Data processing and the importance of advanced analytics in deriving insights for business decision-making.

Uploaded by

divyavenu49
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 166

JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

CCS334BigDataAnalytics Lecture Notes

UNITIUNDERSTANDINGBIG DATA CCS334BIGDATAANALYTICS

Introduction to big data – convergence of key trends – unstructured data –industryexamplesof


big data – web analytics – big dataapplications–bigdatatechnologies–introductiontoHadoop –
opensourcetechnologies–cloudandbigdata–mobilebusinessintelligence–Crowdsourcing analytics
– inter and trans firewall analytics.
UNITIINOSQLDATAMANAGEMENT
Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – schemaless databases – materialized views – distribution
models–master-slavereplication–consistency-Cassandra–Cassandradatamodel–Cassandra
examples – Cassandra clients
UNIT III MAPREDUCE APPLICATIONS
MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of
MapReduce job run–classicMap-reduce–YARN–failuresinclassicMap-reduceandYARN– job
scheduling – shuffle and sort – task execution – MapReduce types – input formats – output
formats.
UNIT IV BASICS OF HADOOP
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop pipes–
design of Hadoop distributed file system (HDFS) – HDFS concepts – Javainterface–dataflow–
Hadoop I/O – data integrity – compression – serialization – Avro–file-baseddatastructures-
Cassandra – Hadoop integration.
UNITVHADOOPRELATEDTOOLS
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis.Pig –Grunt
– pig data model – Pig Latin – developing and testingPigLatinscripts.Hive–datatypes and file
formats – HiveQL data definition – HiveQL data manipulation – HiveQL queries.
REFERENCES:
1. MichaelMinelli,MichelleChambers,andAmbigaDhiraj,"BigData,BigAnalytics: Emerging
Business Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013.
2. P.J.SadalageandM.Fowler,"NoSQLDistilled:ABriefGuidetotheEmergingWorld of Polyglot
Persistence", Addison-Wesley Professional, 2012.
3. TomWhite,"Hadoop:TheDefinitiveGuide",ThirdEdition,O'Reilley,2012.
4. EricSammer,"HadoopOperations",O'Reilley,2012.
5. E.Capriolo,D.Wampler,andJ.Rutherglen,"ProgrammingHive",O'Reilley,2012.
6. LarsGeorge,"HBase:TheDefinitiveGuide",O'Reilley,2011.
7. EbenHewitt,"Cassandra:TheDefinitiveGuide",O'Reilley, 2010.
8. AlanGates,"ProgrammingPig",O'Reilley,2011.
30 PERIODS
1
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

1.1 Big Data

Big data is data that exceeds the processingcapacityofconventionaldatabasesystems.Thedata is


too big, moves too fast, or does not fit the structures of traditional database architectures. In other
words, Big datais an all-encompassing term for any collection ofdata setsso large and complex
that it becomes difficult to process using on-hand data managementtoolsortraditional data
processing applications.Togainvaluefromthisdata,youmustchooseanalternativewayto process it.
Big Data is the next generation of data warehousing and business analytics and is poised
todelivertoplinerevenuescostefficientlyforenterprises.Bigdataisapopulartermused to describe the
exponential growth and availability of data, both structured and unstructured.
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in theworld
today has been created in the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Definition

❖ Big data can be defined as very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speeds.Whichcannotbeprocessed
using traditional technologies, processing methods and algorithms?
❖ Big data usually includes data sets with sizes beyond the ability of commonly used
software tools tocapture,create, manage, and process the data within atolerableelapsed
time.
❖ Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision-making.

2
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

◻ Big data is often boiled down to a few varieties including social data, machine data,and
transactional data.

◻ Social media data is providing remarkable insights to companies on consumer behavior


and sentiment that can be integratedwithCRMdataforanalysis,with230milliontweets
posted on Twitter per day, 2.7 billion Likes and commentsaddedtoFacebookeveryday,
and 60 hours of video uploaded to YouTube every minute (this is what we mean by
velocity of data).

◻ Machine data consists of informationgeneratedfromindustrialequipment,real-timedata


from sensors that track parts and monitor machinery (often also called the Internet of
Things), and even web logs that track user behavior online.

◻ MajorretailerslikeAmazon.com,whichposted$10BinsalesinQ32011,andrestaurants like
US pizza chain Domino's, which serves over 1 million customers per day, are generating
petabytes of transactional big data.

◻ The thing to note is that bigdatacanresembletraditionalstructureddataorunstructured, high


frequency information.


Big Data Analytics

3
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

Big (and small) Data analytics is the process of examining data—typically of a variety of
sources, types, volumes and / or complexities—to uncover hidden patterns, unknown
correlations, and other useful information.
The intent is to find business insights that were not previously possible or were missed, so that
better decisions can be made.
Big Data analytics uses a wide variety of advanced analytics to provide
1. Deeper insights. Rather than looking at segments, classifications, regions, groups, or
other summary levelsyou’llhaveinsightsintoalltheindividuals,alltheproducts,allthe parts,
all the events, all the transactions, etc.

2. Broader insights. The world is complex. Operating a business in a global, connected


economy is very complex given constantly evolving and changing conditions. Ashumans,
we simplify conditions so we can process events and understand what is happening. But
our best-laid plans often go astray because of the estimating or approximating. Big Data
analytics takes into account all the data, including new data sources, to understand the
complex, evolving, and interrelated conditions to producemore accurate insights.

3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.

DifferencebetweenDataScienceandBig Data:

S.No Data Science Big Data

1. It is a field of scientificanalysisofdatain Big data is storing and processing large


order to solve analytically complex volumes of structured and unstructured
problems and the significant andnecessary data that cannot be possible with
activity of cleansing, preparing data. traditional applications.

2. It is used in Biotech, energy, gaming and Usedinretail,education,healthcareand social


insurance. media.

3. Goals: Data classification, anomaly Goals: To provide bettercustomerservice,


detection, prediction, scoring and ranking. identifying new revenue opportunities,
effective marketing etc.

Benefits of Big Data Processing:


1. Improved customer service.
2. Business can utilize outside intelligence while making decisions.
3. Reducing maintenance costs.

4
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

4. Re-develop your products:Bigdata canalso help you understand how others perceive your
products so that you can adapt them or your marketing. If need be.
5. Early identification of risk to the product/services, if any.
6. Better operational efficiency.
Big Data Challenges:
Collecting, storing and processing big data comes with its own set of challenges:
1. Bigdata is growing exponentially and existing data management solutions have to be
constantly updated to cope with three Vs.
2. Organizations donot have enough skilled data professionals who can understand and
work with big data and big data tools.
1.2 Convergence of Key Trends:
◻ The essence of computer applications is to store things in the real world into computer
systems in the form of data, i.e., it is a process of producing data. Some data are the
records related to culture and societyandothersarethedescriptionsofphenomenaofthe
universe and life. The large scale of data is rapidly generated and stored in computer
systems, which is called data explosion.
◻ Data is generated automatically by mobile devices and computers,think facebook,search
queries, directions and GPS locations and image capture.
◻ Sensors also generate volumes of data, including medical data and commerce location-
based sensors. Experts expect55billionIP-enabledsensorsby2021.Evenstorageofall this
data is expensive. Analysis gets more important and more expensive every year.
◻ The below diagram shows the big data explosion by the current data boom and how
critical it is for us to be able to extract meaning from all of this data.

◻ The phenomena of exponential multiplication of data that gets stored is termed as


"Data Explosion". Continuous inflow of real-time data from various processes,
machinery and manual inputs keeps flooding the storage servers every second.
◻ Sending emails, making phone calls, collecting information for campaigns; each daywe
create a massive amount of data just by going about our normal business and this data
explosion does not seem to be slowing down. In fact, 90 % of the data that currently
exists was created in just the last two years.
Reason for this data explosion is Innovation.

5
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

1. Business model transformation: Innovation changed the way in which we do business,


provide services. The data world is governed by three fundamental trends: business model
transformation, globalization and personalization of services.
● Organizations have traditionally treated data as a legal or compliance requirement,
supporting limited management reporting requirements. Consequently, organizations
have treated data as a cost to be minimized.
● The businesses are required to produce more data related to product and provide
services to cater each sector and channel of customer.
2. Globalization :Globalization is an emerging trend in business where organizations start
operating on an international scale.From manufacturing to customer service,globalization has
changed the commerce of the world. Variety and different formats of data are generated due to
globalization.
3. Personalization of services: To enhance customer service, the form of one-to-one
marketing in theformofpersonalizationofserviceisoptedbythecustomer.Customersexpect
communication through various channels increases the speed of data generation.
4. New sources of data: The shift to online advertising supported by the likes of Google,
Yahoo and others is a key driver in the data boom. Social media, mobile devices, sensor
networks and new media are on the fingertips of customers or users. The data generated
through this isusedbycorporationsfordecisionsupportsystemslikebusinessintelligenceand
analytics. The growth of technology helped to emerge new business models over the last
decade or more. Integration of all the data across the enterprise is used to create a business
decision support platform.
1.2.1 V's of Big Data
We differentiate bigdata characteristics from traditional data by one or more of the five V's:
Volume, velocity, variety, veracity and value.

6
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

1. Volume:Volumes of data are larger than that conventional relational database infrastructure
can cope with. It consists of terabytes or petabytes of data.

➢ The size of available data has been growing at an increasing rate.


➢ The volume of data is growing.Experts predict that the volume of data in the world will
grow to 25 Zetta bytes in 2020.
➢ That same phenomenon affects every business – their data is growing at the same
exponential rate too.
➢ This applies to companies and to individuals.A text file is a few kilobytes,a sound fileis a
few megabytes while a full length movie is a few giga bytes. More sources of data are
added on a continuous basis.
➢ For companies, in the old days, all data was generated internally by employees.Currently,
the data is generated by employees, partners and customers. For a group of companies,
the data is also generated by machines.

7
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

➢ For example, Hundreds of millions of smartphones send a variety of information to the


network infrastructure. This data did not exist five years ago.
➢ More sources of data with a larger size of data combine to increase the volume of data
that has to be analyzed. This is a major issue for those looking to put that data to use
instead of letting it just disappear.
➢ Petabyte data sets are common these days and Exabyte is not far away.

2. Velocity:
➢ The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data. It is
being created in or near real-time.
➢ Data is increasingly accelerating the velocity at which it is created and at which it is
integrated.We have moved from batch to a real-time business.
➢ Initially, companies analyzed data using a batch process. One takes a chunk of data,
submits a job to the server and waits for delivery of the result.

8
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

➢ That scheme works when the incoming data rate is slowerthanthebatch-processingrate


and when the result is useful despite the delay.
➢ With the new sources of data such as social and mobile applications, the batch process
breaks down. The data is now streaming into the server in real time, in a continuous
fashion and the result is only useful if the delay is very short.
➢ Data comes at you at a record or a bytelevel, not always in bulk.And the demands of the
business have increased as well – from an answer next week to an answer in a minute.
➢ In addition,the world is becoming more instrumented and interconnected.The volume of
data streaming off those instruments is exponentially larger than it was even 2 years ago.

3.Variety:
➢ It refer stoheterogeneous sources and the nature of data,both structured and
unstructured.
➢ Variety presents an equally difficult challenge.The grow thin data sources has fuelled the
growth in data types.In fact, 80% of the world’s data is unstructured.
➢ Yet most traditional methods apply analytics only to structured information.
➢ From excel tables and databases, data structure has changed to lose its structure and to
add hundreds of formats.
➢ Puretext, photo, audio, video, web, GPSdata, sensordata, relational databases, documents,
SMS, pdf, flash, etc. One no longer has control over the input data format.
➢ Structure canno longer be imposed like in the past inorder to keep control over the
analysis. As new applications are introduced new data formats come to life.
Thevarietyofdatasourcescontinuestoincrease.Itincludes
● Internet data (i.e., click stream, social media, social networking links)
● Primary research (i.e., surveys, experiments, observations)
● Secondaryresearch (i.e., competitive and marketplace data, industry reports,
consumerdata, business data)
● Location data (i.e., mobile device data, geospatial data)
● Image data (i.e., video, satellite image, surveillance)
● Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
● Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)

9
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

4. Value
➢ It represents the business value to be derived from big data. The ultimate objective
ofanybig data project should be to generate some sort of value for the company doing all
the analysis. Otherwise, you're just performing some technological task for technology's
sake.
➢ For real-time spatial big data, decisions can be enhanced through visualization ofdynamic
change in such spatial phenomena as climate, traffic, social-media-based attitudes and
massive inventory locations.
➢ Exploration of data trends can include spatial proximities relationships. Once spatial big
data is structured, formal spatial analytics can beapplied,such as spatial auto correlation,
overlays, buffering, spatial cluster techniques and location quotients.
5. Veracity
➢ Big data must be fed with relevant and true data. We will not be able to perform useful
analytics if much of the incoming data comes from false sources or has errors.
➢ Veracity refers to the level of trustiness or messiness of data and ifhigher the trustiness
of the data, then lower the messiness and vice versa.
➢ It relates to the assurance of the data 'squality, integrity, credibility and accuracy. We
must evaluate the data for accuracy, before using it for business insights because it is
obtained from multiple sources.

10
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

Why Big data?

1. UnderstandingandTargetingCustomers
2. Understanding and Optimizing Business Processes
3. Personal Quantification and Performance Optimization
4. ImprovingHealthcareandPublic Health
5. ImprovingSports Performance
6. ImprovingScienceandResearch
7. Optimizing Machine and Device Performance
8. ImprovingSecurityandLawEnforcement.
9. ImprovingandOptimizingCitiesandCountries
10. Financial Trading
1.3 Unstructureddata
★Unstructured data is information that either does not have a predefined datamodel and/or
does not fit well into a relational database.
★Rows and columns are not used for unstructured data; therefore it is difficult to retrieve the
required information.
★Unstructured information is typically text heavy, but may contain data such as dates,
numbers, and facts as well.
★The term semi-structured data is used to describe structured data that does not fit into a
formal structure of data models.
★However, semi-structured data does contain tags that separate semantic elements, which
includes the capability to enforce hierarchies within the data.
★The amount of data (all data, everywhere) is doubling every two years.Most new data is
unstructured.
★Specifically, unstructured data represents almost 80percent of newdata,while structured data
represents only 20 percent.
★Unstructured data tends to grow exponentially, unlike structured data,which tends to grow
in a more linear fashion.Unstructured data is vastly underutilized.

Structureddata
★Structureddata is arranged in rows and columns format. It helps applications to retrieve and
process data easily. DBMS is used for storing structured data.
★witha structured document, certain information always appears in the same locationon the
page.
★Structured data generally resides in a relational database, and as a result, itissometimes called
"relational data." This type of data can be easily mapped into pre-designed fields.

11
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

★For example, a database designer may set up fields for phone numbers, zip codes and
credit card numbers that accept a certain number of digits. Structured data has been orcan
be placed in fields like these.

MiningUnstructuredData

★Many organizations believe that their unstructured data stores include information that
could help them make better business decisions.
★Unfortunately,it's often very difficult to analyze unstructured data. To help with the
problem, organizations have turned to a number of different software solutions designed
to search unstructured data and extract important information.
★The primary benefit of these tools is the ability to glean actionable information that can
help a business succeeds in a competitive environment.
★Becausethevolumeofunstructureddataisgrowingsorapidly,many enterprises also turn to
technological solutions to help them better manage and store their unstructured data.
★These can include hardware or software solutions that enable them to make the most
efficient use of their available storage space.

ImplementingUnstructuredDataManagement

Organizations use a variety of different software tools to help them organize and manage
unstructured data. These can include the following:

● Big data tools: Software like Hadoop can process stores of both unstructured and
structured data that are extremely large, very complex and changing rapidly.
● Business intelligence software: Also known as BI, this is a broad category of analytics,
data mining, dashboards and reporting tools that help companies make sense of their
structured and unstructured data for the purpose of making better business decisions.
● Data integration tools: These tools combine data from disparate sources so that they can
beviewed or analyzed from a single application. They sometimes include the capability to
unify structured and unstructured data.
● Document management systems: Also called "enterprise content managementsystems,"
a DMS can track, store and share unstructured data that is saved in the form of document
files.
● Information management solutions: This type of software tracks structured and
unstructured enterprise data throughout its lifecycle.
● Search and indexing tools: These tools retrieve in formation from unstructured data
files such as documents, Web pages and photos.

12
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

1.4 Industry Examples of Big Data

Big data plays an important role in digital marketing. Each day information shared digitally
increases significantly. With the help of big data,marketerscananalyzeeveryactionof the
consumer. It provides better marketinginsightsandithelpsmarketerstomakemoreaccurate and
advanced marketing strategies.

• Reasons why big data is important for digital marketers :

a) Real-time customer insights

b) Personalized targeting

c) Increasing sales

d) Improvestheefficiencyofamarketing campaign

e) Budget optimization

f) Measuring campaign's results more accurately.

★Dataconstantly informs marketing teams of customer behaviors and industry trends and is
used to optimize future efforts, create innovative campaigns and build lasting
relationships with customers.
★Big data regarding customers provides marketers details about user demographics,
locations and interests, which can be used to personalize the product experience and
increase customer loyalty over time.
★Big data solutions can help organize data and pinpoint which marketing campaigns,
strategies or social channels are getting the most traction. This lets marketers allocate
marketingresourcesand reduce costs for projects that are not yielding as much revenue or
meeting desired audience goals.
★Personalized targeting :Nowadays, personalization is the key strategy for every marketer.
Engaging the customers at the right moment with the right message is the biggest issue
for marketers. Big data helps marketers to create targeted andpersonalized campaigns.
★Personalizedmarketingiscreatinganddelivering messages to the individuals or the group of
the audience through data analysis with the help of consumer's data such as geolocation,
browsing history, clickstream behavior and purchasing history. It is also known as one-
to-one marketing.
★Consumer insights: In this day an age, marketing has become the ability of a companyto
interpret the data and change its strategies accordingly. Big data allows for real-time
consumerinsights whicharecrucialtounderstandingthehabitsofyourcustomers.By

13
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

Interacting with your consumers through social media you will know exactly what they
want and expect from your product or service, which will be key to distinguishing your
campaign from your competitors.
★Helpincreasesales:Bigdata will help withdemand predictions fora product or service.
Information gathered on user behavior will allow marketers to answer what types of
product their users are buying, how often they conduct purchases orsearchforaproduct or
service and lastly, what payment methods they prefer using.
★Analyzecampaignresults:Bigdataallowsmarketerstomeasuretheircampaign
Performance.This is the most important part of digital marketing. Marketers will use
reports to measureanynegativechangestomarketingKPIs.Iftheyhavenotachievedthe desired
results it will be a signal that the strategy would need to be changed in order to maximize
revenue and make your marketing efforts more scalable in future.

1.5 WebAnalytics

★Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.
★Web analytics is not just a tool for measuring web traffic but can be used as a tool for
business and market research, and to assess and improve the effectiveness of a web site.
★The following are the some of the web analytic metrics: Hit, Page view, Visit / Session,
First Visit / First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit Rate, Page
Time Viewed / Page Visibility Time / Page View Duration, Session Duration / Visit
Duration. Average Page View Duration, and Click path etc.

★Mostpeopleintheonlinepublishingindustryknowhowcomplexandonerousitcouldbe to build
an infrastructure to access and manage all the Internet data within their own IT
department. Back in the day, IT departments would opt for a four-year project and
millions of dollars to go that route. However, today this sectorhasbuiltupanecosystem of
companies that spread the burden and allow others to benefit.

Why use big data tools to analyses web analytics data?

Webeventdataisincrediblyvaluable

• It tells you how your customers actually behave (in lots of detail), and how that varies
• Betweendifferent customers
• Forthesamecustomersovertime.(Seasonality,progressincustomerjourney)
• How behavior drives value
• It tells you how customers engage with you via your website / webapp
• Howthatvaries bydifferentversions ofyour product
• Howimprovementstoyourproductdriveincreasedcustomersatisfactionand lifetime
value

14
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

• Ittellsyouhowcustomersandprospectivecustomersengagewithyourdifferent marketing
campaigns and how that drives subsequent behavior

Derivingvalue fromweb analyticsdata ofteninvolves verybespoke analytics

• The web is a rich and varied space! E.g.


• Bank
• Newspaper
• Social network
• Analytics application
• Governmentorganization(e.g.taxoffice)
• Retailer
• Marketplace
• Foreachtype ofbusinessyou’d expectdifferent :
• Typesofevents,withdifferenttypesofassociateddata
• Ecosystemofcustomers /partners withdifferent typesof relationships
• Product development cycle (and approach to product development)
• Typesofbusinessquestions/prioritiestoinformhowthedataisanalysed

Webanalyticstoolsaregoodatdeliveringthestandardreportsthatarecommonacross different
business types…

• Wheredoesyourtrafficcomefrom e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
• Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
• Page views per session
• Page views per web page
• Conversionratebytrafficsource
• Transactionvaluebytrafficsource
• Capturing contextual data common people browsing the web
• Timestamps
• Referer data
• Webpagedata(e.g.pagetitle,URL)
• Browser data (e.g. type, plugins, language)
• Operating system (e.g. type, time zone)
• Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)

1.6 Big Data and Advances in Health Care


Big Data promises an enormous revolution in health care, with important advancements in
everything from the management of chronic disease to the delivery of personalized medicine.
In addition to saving and improving lives, Big Data has the potential to transform the entire
health care system by replacing guesswork and intuition with objective, data-driven science see
the following figure

15
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

DataintheWorldofHealthCare

★The healthcare industry is now awash in data: from biological data such as gene
expression, Special Needs Plans (SNPs), proteomics, metabolomics to, more recently,
next-generation gene sequence data.
★This exponential growth in data is further fueled by thedigitizationofpatient-leveldata:
stored in Electronic Health Records (EHRs) and Health Information Exchanges (HIEs),
enhanced with data from imaging and test results, medical and prescription claims, and
personal health devices.
★The U.S. healthcare system is increasingly challenged by issues of cost and access to
quality care. Payers, producers, and providers are each attempting to realize improved
treatment outcomes and effective benefits for patients within a disconnected health care
framework.
★Historically, these healthcare ecosystem stakeholders tendtoworkatcrosspurposeswith other
members of the health care value chain. High levels of variability and ambiguity across
these individual approaches increase costs, reduce overall effectiveness, and impede the
performance of the healthcare system as a whole.
★Recent approaches to health care reform attempt to improve access to health care by
increasing government subsidies and reducing the ranks of the uninsured.
★One outcome of the recently passed Accountable Care Act is a revitalized focus on cost
containment and the creation of quantitative proofs of economic benefit by payers,
producers, and providers.
★A more interesting unintended consequence is an opportunity for these health care
stakeholders to set aside historical differences and create a combined counterbalance to
potential regulatory burdens established, without the input of the actual industry the
government is setting out to regulate.

16
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

★This “the enemy of my enemy is my friend” mentality has created an urgent motivation for
payers, producers, and, to a lesser extent, providers, to create a new health care
information value chain derived from a common healthcare analytics approach.
★The health care system is facing severe economic, effectiveness, and quality challenges.
These external factors are forcing a transformation of the pharmaceutical business model.
★Health care challenges are forcing the pharmaceutical business model to undergo rapid
change. Our industry ismovingfromatraditionalmodelbuiltonregulatoryapprovaland
settling of claims, to one of medical evidence and proving economic effectiveness
through improved analytics derived insights.
★The success of this new business model will be dependent on having access to data created
across the entire healthcare ecosystem.
★we believe there is an opportunity to drive competitive advantage for our LS clients by
creating a robust analytics capability and harnessing integrated real-world patient level
data.
1.7 Big Data Technology
Big data technology is defined as the technology and a software utility that is designedfor
analysis, processing and extraction of the information from alargesetofextremelycomplex
structures and large data sets which isverydifficultfortraditionalsystemstodealwith.Bigdata
technology is used to handle both real-time and batch related data.
Big data technology is defined as software-utility. This technology is primarilydesigned
to analyze, process and extract information from a large data set and a huge set of extremely
complex structures. This is very difficult for traditional data processing software to deal with.
Big data technologies including Apache Hadoop, Apache Spark, MongoDB, Cassandra,
Plotly, Pig, Tableau and Apache Cassandra etc.
Cassandra: Cassandra is one of the leading big data technologies among the listoftopNoSQL
databases. It is open-source, distributed and has extensive column storage options. It is freely
available and provides high availability without fail.
Apache Pig is a high level scripting language usedtoexecutequeriesforlargerdatasetsthatare used
within Hadoop.
Apache Spark is a fast, in- Memory data processing engine suitable for use in a wide range of
circumstances. Spark can be deployed in several ways, it features java, Python, Scala and R
programming languages and supports SQL, streaming data, machine learning and graph
processing, which can be used together in an application.
MongoDB: MongoDB is another important component of big data technologies in terms of
storage. No relational properties and RDBMS properties apply to Mongo DB because it is a
NoSQL database. This is not thesameastraditionalRDBMSdatabasesthatusestructuredquery
languages. Instead, MongoDB uses schema documents.

17
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

1.8 IntroductiontoHadoop
★Apache Hadoop is an open source framework thatisusedtoefficientlystoreandprocess large
datasets ranging in size from gigabytes to petabytes of data.

★Hadoopisdesignedtoscaleupfromasinglecomputertothousandsofclustered computers, with


each machine offering local computation and storage.

★While Hadoop is sometimes referred to as an acronym for High Availability Distributed


Object Oriented Platform.

★The Hadoop framework consists of astoragelayerknownastheHadoopDistributedFile System


(HDFS) and a processing framework called the Map Reduce programming model.

★Hadoopsplitslargeamountsofdataintochunks,distributesthemwithinthenetwork cluster and


processes them in its Map Reduce Framework.

★Hadoop can also be installed on cloud servers to better manage the computeandstorage
resources required for big data. Leading cloud vendors such as Amazon Web Services
(AWS) and Microsoft Azure offer solutions.

★Cloudera supports Hadoop workloads both on-premises and in the cloud, including options
for one or more public cloud environments from multiple vendors.

★Hadoop provides a distributed file system and a framework for the analysis and
transformation of very large data sets using the Map Reduce paradigm.

★An important characteristic of Hadoop is the partitioning of dataandcomputationacross


many (thousands) of hosts and executing application computations in parallel close to
their data.

★A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by
simply adding commodity servers.

KeyfeaturesofHadoop:

1. CostEffective System
2. LargeClusterofNodes
3. Parallel Processing
4. Distributed Data
5. Automatic Failover Management
6. Data Locality Optimization
7. Heterogeneous Cluster

18
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

8. Scalability.
Hadoop allows for the distribution of datasets across a cluster of commodity hardware.
Processing is performed in parallel on multiple servers simultaneously. Software clients input
data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then
processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of
individual machines or racks of machines are common and should be automatically handled in
software by the framework.
Challenges of Hadoop:
Map Reduce complexity:Asafile-intensivesystem,MapReducecanbeadifficulttooltoutilize for
complex jobs, such as interactive analytical tasks.
There are four main libraries in Hadoop.
1. Hadoop Common: This provides utilities used by all other modules in Hadoop.
2. Hadoop Map Reduce: This works as a parallel framework for scheduling and processingthe
data.
3. Hadoop YARN: This is an acronym for Yet Another Resource Navigator. It is an improved
version of Map Reduce and is used for processes running over Hadoop.
4. Hadoop Distributed File SystemHDFS:Thisstoresdataandmaintainsrecordsovervarious
machines or clusters. It also allows the data to be stored in an accessible format.
1.8.1 Hadoop Ecosystem
● Hadoop ecosystem is neither a programming language nor a service, it is a platform or
framework which solves big data problems.

● The Hadoop ecosystem refers to thevariouscomponentsoftheApacheHadoopsoftware


library, as well as to the accessories and tools provided by the Apache Software
Foundation for these types of software projects and to the ways that they work together.

● Hadoop is a Java -basedframeworkthatisextremelypopularforhandlingandanalyzing large


sets of data. The idea of a Hadoop ecosystem involves the use of different partsof the
core Hadoop set such as Map Reduce, a frameworkforhandlingvastamountsofdata and the
Hadoop Distributed File System (HDFS), a sophisticated file handling system. There is
also YARN, a Hadoop resource manager.

● In addition to these core elements of Hadoop, Apache has also delivered other kinds of
accessories or complementary tools for developers.

● Some of the most well-known tools of the Hadoop ecosystem include HDFS, Hive, Pig,
YARN, MapReduce, Spark, Hbase, Oozie, Sqoop, Zookeeper, etc.

19
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

● Hadoop Distributed File System (HDFS), is one of the largest Apache projects and
primary storage system of Hadoop. It employs a Name Node and Data Node architecture.

● It is a distributed file system able to store large files running over the cluster of
commodity hardware.

● YARN stands for Yet AnotherResourceNegotiator.Itisoneofthecorecomponentsin open


source Apache Hadoop suitable for resource management. It is responsible for managing
workloads, monitoring and security controls implementation.

● Hive is an ETL and Data warehousing toolusedtoqueryoranalyzelargedatasetsstored


within the Hadoop ecosystem. Hivehasthreemainfunctions:Datasummarization,query and
analysis of unstructured and semi- structured data in Hadoop.

● Map - Reduce: It is the core component of processing in a Hadoop Ecosystem as it


provides the logic of processing. In other words, MapReduce is a software framework
which helps in writing applications that processes large data sets using distributed and
parallel algorithms inside the Hadoop environment.

● Apache Pig is a high level scripting language used to execute queries forlargerdatasets
that are used within Hadoop.

● Apache Spark is a fast, in - memory data processing engine suitable for use in a wide
range of circumstances. Spark can be deployed in several ways, it featuresJava,Python,
ScalaandRprogramminglanguagesandsupportsSQL,streamingdata,machinelearning and
graph processing, which can be used together in an application.

20
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

● Apache Hbaseis a Hadoop ecosystem component which is a distributed database that


was designed to store structured data in tables that could have billions of rows and
millions of columns. Hbaseisascalable,distributedandNoSQLdatabasethatisbuilton top of
HDFS. Hbase provides real time access to read or write data in HDFS.

1.8.2 Hadoop Advantages

1. Scalable:Hadoop cluster can be extended by just adding nodes in the cluster.

2. Costeffective:Hadoopisopensourceandusescommodityhardwaretostoredatasoitis really cost


effective as compared to traditional relational database management systems.

3. Resilientto failure: HDFShas the propertywith which itcan replicate dataover the network.

4. Hadoop can handle unstructured as well as semi-structured data.

5. TheuniquestoragemethodofHadoopisbasedonadistributedfilesystem that effectively maps data


wherever the cluster is located.

1.9 OpenSourceTechnologies

★Opensourcesoftwareislikeanyothersoftware(closed/proprietarysoftware).This software is
differentiated by its use and licenses.

★Opensourcesoftwareguaranteestherightto access and modify the source code and to use,


reuses and redistribute the software, all with no royalty or other costs.

21
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

★StandardSoftwareissoldandsupportedcommercially.However,OpenSourcesoftware can be
sold and/or supported commercially, too. Open source is a disruptive technology.

★Open source is an approach to the design, development and distribution of software,


offering practical accessibility to software's source code.

★Open source licenses must permit non-exclusive commercial exploitation of thelicensed


work, must make available the work's source code and must permit the creation of
derivative works from the work itself.

★The Netscape Public License and subsequently under the Mozilla Public License.

★Proprietary software is computer software which is thelegalpropertyofoneparty.The


terms of use for other parties are defined by contracts or licensing agreements. These
terms may include various privileges to share, alter, dissemble and use the software and
its code.

★Closed source is a term for software whose license does not allow for the release or
distribution of the software's source code. Generally, it means only the binaries of a
computer program are distributed and the license provides no access to the program's
source code.

★The source code of such programs is usually regarded as a trade secret of the company.
Accesstosourcecodebythirdpartiescommonlyrequiresthepartytosigna non-disclosure
agreement.

Need of open source

★The demands of consumers as well as enterprisesareeverincreasingwiththeincreasein the


information technology usage. Information technology solutions are required to satisfy
their different needs. It is a fact that a single solution providercannotproduceall the
needed solutions. Open source, freeware and free software are now available for anyone
and for any use.

★In the 1970s and early 1980s, the software organization started usingtechnicalmeasures to
prevent computer users from being able to study and modify software. The copyright law
was extended to computer programs in 1980. The free software movement was conceived
in 1983 by Richard Stallman to satisfy the need for and to give the benefit of "software
freedom" to computer users.

★Richard StallmandeclaredtheideaoftheGNUoperatingsysteminSeptember1983.The GNU


Manifesto was written by Richard Stallman and published in March 1985.

★The Free Software Foundation (FSF) is a non-profit corporation started by Richard


Stallmanon4October1985tosupportthefreesoftwaremovement,acopyleftbased

22
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

Movement which aims to promote the universal freedom to distribute and modify
computer software without restriction. In February 1986, the first formal definition offree
software was published.

★The term "free software" is associated with FSFs definition, and the term "open source
software" isassociatedwithOSI'sdefinition.FSFsandOSI'sdefinitionsarewordedquite
differently but the set of software that they cover is almost identical.

★One of the primary goals of this foundation was the development of a free and open
computer operating system and application software that can be used and shared among
different users with complete freedom.

★While open source differs from the operation of traditional copyright licensing by
permitting both open distribution and open modification.

★Before the term open source became widely adopted, developers and producers used a
variety of phrases to describe the concept. The term open source gained popularity with
the rise of the Internet, which provided access to diverse production models,
communication paths and last but not least, interactive communities.

Successes of Open Source

Operating Systems: Linux, Symbian, GNU Project, NetBSD.

Servers:Apache,Tomcat,MediaWiki,Word Press,Eclipse,Moodle

ClientSoftware:MozillaFirefox,MozillaThunderbird,Open Office,7-Zip

DigitalContent:Wikipedia,Wiktionary,ProjectGutenberg

1.10 Cloud and Big Data

★The NIST defines cloud computing as : "Cloud computing is a model for enabling
ubiquitous, convenient, on-demand network access to a shared pool of configurable
computing resources that can be rapidly provisioned and released with minimal
management effort or service provider interaction.

★This cloud model is composed of five essential characteristics, three service models and
four deployment models."

★Cloud provider is responsible for the physical infrastructure and the cloud consumer is
responsible for application configuration, personalization and data.

★Broad network access refers to resources hosted in a cloudnetworkthatareavailablefor


access from a wide range of devices. Rapid elasticityisusedtodescribethecapabilityto
provide scalable cloud computing services.

23
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

★In measured services, NIST talks about measuredserviceasasetupwherecloudsystems may


control a user or tenant's use of resources by a metering capabilitysomewhere in the
system.
On-demand self-service refers to the service provided by cloud computing vendors that enables
the provision of cloud resources on demand whenever they are required.
TheCloud CubeModel hasfour dimensionsto differentiatecloud formations:
a) External/Internal
b) Proprietary/Open
c) De-parameterized / parameterized
d) Outsourced/Insourced.
External Internal: Physical locationofdataisdefinedbyexternal/internaldimension.Itdefines the
organization's boundary.
Example: Informationinsideadatacenterusingaprivateclouddeploymentwouldbeconsidered
internal and data that resided on Amazon EC2 would be considered external.
Proprietary / Open:Ownershipisproprietaryoropen;isameasurementfornotonlyownership of
technology but also its interoperability, use of data and ease of data-transfer and degree of
vendor's application's lock-in.
Proprietary means that the organizationprovidingtheserviceiskeepingthemeansofprovision
under their ownership.Cloudsthatareopenareusingtechnologythatisnotproprietary,meaning that
there are likely to be more suppliers.
De-parameterized / parameterized: Security Ranges is parameterized or de-parameterized;
which measures whether the operations are inside or outside the security boundary, firewall, etc.
Encryption and key management willbethetechnologymeansforprovidingdataconfidentiality and
integrity in a de-parameterized model.
Outsourced / Insourced:Out-sourcing/In-sourcing; which defines whetherthecustomerorthe
service provider provides the service.
Outsourced means the service is provided by a third party. It refers to letting contractors or
service providers handle all requests and most cloud business models fall into this.
Insourced is the services provided by your own staff under organization control. Insourced
means in-house development of clouds.
★Cloudcomputingisoftendescribedasastack, as a response to the broad range of services built
on top of one another under the "cloud". A cloud computing stack is acloud architecture
built in layers of one or more cloud-managed services (SaaS, Paas, IaaS, etc.).

24
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

★Cloud computing stacks are used for all sorts of applications and systems. They are
especially good in micro services and scalable applications, as each tier is dynamically
scaling and replaceable.

★The cloud computing pile makes up a threefold system that comprises its lower-level
elements. These components function as formalized cloud computing delivery models:

a) SoftwareasaService(SaaS)
b) Platform as a Service (PaaS)
c) InfrastructureasaService(IaaS)
SaaS applications are designed for end-users and delivered over the web.
PaaS is the set of tools and services designed to make coding and deploying those applications
quick and efficient.
IaaS is the hardware and software that powers it all, including servers, storage networks and
operating systems.
★Atthecrossroads of high capital costs and rapidly changing business needs is a sea change
that is driving the need for a new, compelling value proposition that is being manifested
in a cloud-deployment model.

★With acloudmodel,youpayonasubscriptionbasiswithnoupfrontcapitalexpense.You don’t


incur the typical 30 percent maintenance fees—and all the updatesontheplatform are
automatically available.

★Thetraditionalcostofvaluechainsisbeingcompletelydisintermediatedby platforms—
massively scalable platforms where the marginal cost to deliver an incremental product
or service is zero.

★Theabilityto build massively scalable platforms—platforms where you have the option to
keep adding new products and services for zero additional cost—is giving rise to business
models that weren’t possible before. Mehta calls it “the next industrial revolution, where
the raw material is data and data factories replace manufacturing factories.” He pointed
out a few guiding principles that his firm stands by:

1. Stop saying “cloud.”It’s not about the fact that it is virtual, but the true value lies in
delivering software, data,and/oranalyticsinan“asaservice”model.Whetherthatisinaprivate hosted
model or a publicly shared one does not matter. The delivery, pricing, and consumption model
matters.
2. Acknowledge the business issues. There is no point to make light of matters around
information privacy, security, access, and delivery. These issues are real, more often than not
heavily regulated by multiple government agencies, and unless dealt with in a solution,willkill
any platform sell.

25
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

3. Fix somecoretechnicalgaps.Everythingfromtheabilitytorunanalyticsatscaleina virtual


environment to ensuring information processing and analytics authenticityareissuesthat need
solutions and have to be fixed.
1.11 Mobile Business Intelligence
➔ Analytics on mobile devicesis what some refer to as putting BI in your pocket. Mobile
drivesstraightto the heart of simplicity and ease of use that has been a major barrier to BI
adoption since day one.

➔ Mobile devices are a great leveling field where making complicated actions easy is the
name of the game. For example, a young child can use an iPad but not a laptop.

➔ As a result, this will drive broad-based adoption as much for the ease of use as for the
mobility these devices offer. This will have an immense impact on the business
intelligence sector.

➔ Mobile BI or mobile analytics is the rising software technology that allows the users to
access information and analytics on their phones and tablets insteadofdesktop-basedBI
systems.

➔ Mobile analytics involves measuring and analyzing data generated by mobile platforms
and properties, such as mobile sites and mobile applications.

➔ Analytics is the practice of measuring and analyzing data of users in order to create an
understanding of user behavior as wellaswebsiteorperformance.Ifthispracticeisdone on
mobile apps and app users, it is called "mobile analytics".

➔ Mobile analytics is the practice of collecting user behavior data,determiningintentfrom


those metrics and taking action to drive retention, engagement and conversion.

➔ Mobile analytics is similar to web analytics where identification of the unique customer
and recording their usages.

➔ With mobile analytics data, you can improve your cross-channel marketing initiatives,
optimize the mobile experience for your customers and grow mobile user engagementand
retention.

➔ Analytics usually comes in the form of a software that integrates into a company's
existing websites and apps to capture, store and analyze the data.

➔ It is always very important for businesses to measure their critical KPIs (Key
Performance Indicators), as the old rule is always valid: "If you can't measure it, you
can't improve it".

26
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

➔ To be more specific, if a business findout75%oftheirusersexitintheshipmentscreen of


their sales funnel, probably there is something wrong with that screen in terms of its
design, user interface (UI) or user experience (UX) or there is a technical problem
preventing users from completing the process.

WorkingofMobileAnalytics:
➔ Most of the analytics tools need alibrary(anSDK)tobeembeddedintothemobileapp's
projectcodeandatminimumaninitializationcodeinordertotracktheusersandscreens.

➔ SDKs differ by platform so a different SDK is required for each platform such as iOS,
Android,WindowsPhoneetc.Ontopofthat,additionalcodeisrequiredforcustomevent
tracking.

➔ With the help of this code, analytics tools track and count each user, app launch, tap,
event, app crash or anyadditionalinformationthattheuserhas,suchasdevice,operating
system, version IP address (and probable location).

➔ Unlike web analytics, mobile analytics tools don't depend on cookies to identify unique
userssincemobileanalyticsSDKscangenerateapersistentanduniqueidentifierforeach device.

➔ The tracking technology varies between websites, which use either javascript or cookies
and apps, which use a software development kit(SDK).

➔ Each time a website or app visitor takes an action, the application fires offdatawhichis
recorded in the mobile analytics platform.

Threeelementsthat haveimpacted theviability ofmobile BI:


1. Location—the GPS componentandlocation...knowwhereyouareintimeaswellas the
movement.
2. It’snotjustaboutpushingdata;youcantransactwithyoursmartphonebasedon information
you get.
3. Multimediafunctionality allows the visualization pieces to really come into play.
ThreechallengeswithmobileBIinclude:
1. Managing standards for rolling out these devices.
2. Managing security (always a big challenge).
3. Managing“bringyourowndevice,”whereyouhavedevicesbothownedbythe company and
devices owned by the individual, both contributing to productivity.

27
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

1.12 CrowdsourcingAnalytics
★Crowdsourcing is the process of exploring customer's ideas, opinions and thoughts
available on the internet from large groups of people aimed at incorporating innovation,
implementing new ideas and eliminating product issues.
★Crowdsourcing means the outsourcing of human-intelligence tasks to a large group of
unspecified people via the Internet.
★Crowdsourcing is all about collecting data from users through some services, ideas, or
content and then it needs to be stored in a server such that the necessary data can be or
provided to users whenever necessary.
★Most usersnowadaysuseTrue callertofindunknownnumbersandGoogleMapstofind out
places and the traffic in a region. All the services are based on crowdsourcing.
★Crowdsourced data is a form of secondary data. Secondary data refers to data that is
collected by any party other than the researcher. Secondary data provides important
context for any investigation into a policy intervention.
★Whencrowdsourcing data, researchers collect plentiful, valuable and disperseddataata
cost typically lower than that of traditional data collection methods.
★Consider the trade-offs between sample size and sampling issues before deciding to crowd
source data. Ensuring data quality means making sure the platform which you are
collecting crowd sourced data is well-tested.
★Crowdsourcing experiments arenormallysetupbyaskingasetofuserstoperformatask for a
very small remuneration on each unit of the task. Amazon Mechanical Turk (AMT) is a
popular platform that has a large set of registered remote workers who are hired to
perform tasks such as data labeling.
★In data labeling tasks, the crowd workers are randomly assigned a single item in the
dataset. A data object may receive multiple labels from different workersandthesehave to
be aggregated to get the overall true label.
★Crowdsourcing allows for many contributors to be recruited in a short period of time,
thereby eliminating traditional barriers to data collection. Furthermore, crowdsourcing
platforms usually employ their own tools to optimize the annotation process, making it
easier to conduct time-intensive labeling tasks.
★Crowdsourcing data is especially effective in generating complex and free-form labels
such as in the case of audio transcription, sentiment analysis, image annotation or
translation.
★With crowdsourcing, companies can collect information from customers and use it
totheiradvantage.Brandsgatheropinions,askforhelp,receivefeedbacktoimprovetheir

28
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

product or service, and drive sales. For instance, Lego conducted a campaign where
customers had the chance to develop their designs of toys and submit them.
★To become the winner, the creator had to receive the biggest amount of people's votes. The
best design was moved to the production process. Moreover, the winner got a privilege
that amounted to a 1 % royalty on the net revenue.
TypesofCrowdsourcing:
There are four main types of crowdsourcing.
1. Wisdom of the crowd: It is a collective opinion of different individuals gathered in agroup.
This type is used for decision-making since it allows one to find the best solution for problems.
2. Crowd creation :This type involves a company asking its customers to help with new
products. This way, companies get brand new ideas and thoughts that help a business stand out.
3. Crowd voting: It isatypeofcrowdsourcingwherecustomersareallowedtochooseawinner. They
can vote to decide which of the options is the best for them. This type can be applied to different
situations. Consumers can choose one of the options provided by experts or products created by
consumers.
4. Crowdfunding: It is when people collect money and ask for investments for charities,
projects and startups without planning to return the money to the owners. People do it
voluntarily. Often, companies gather money to help individuals and families suffering from
natural disasters, poverty, social problems, etc.
Example:

o 99designs.com/, which does crowdsourcing of graphic design


o agentanything.com/, which posts “missions” where agents vie for to run errands
o 33needs.com/,whichallowspeopletocontributetocharitableprogramsthatmakea social
impact

1.13 Inter-andTrans-FirewallAnalytics

● A firewall is a device designed to control the flow of traffic into and out-of a
network. In general, firewalls are installed to prevent attacks. Firewall can be a
software program or a hardware device.

● Firewalls are software programs or hardware devices that filter the traffic that
flows into a user PC or user network through an internet connection.

● They sift through the data flow and block that which they deem harmful to the
user network or computer system.

29
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

Fig. 1.13.1 Firewall

● Firewalls filter based on IP, UDP and TCP information. Firewall is placedonthe
link between a network router and Internet or between a user and router.
● For large organizations with many smallnetworks,thefirewallisplacedonevery
connection attached to the Internet.
● Large organizations may use multiple levels of firewall or distributed firewalls,
locating a firewall at a single access point to the network.
● Firewalls test all traffic against consistent rules and pass traffic that meets those
rules. Many routers support basic firewall functionality.Firewallcanalsobeused to
control data traffic.
● Firewall based security depends on the firewall beingtheonlyconnectivitytothe
size from outside; there should be no way to bypass the firewall via other
gateways; wireless connections.
● Firewall filtersoutallincomingmessagesaddressedtoaparticularIPaddressora
particular TCP port number. Itdividesanetworkintoamoretrustedzoneinternal to
the firewall and a less trusted zone external to the firewall.
● Firewalls may also impose restrictions on outgoing traffic, to prevent certain
attacks and to limit losses if an attacker succeeds in getting access inside the
firewall.
Functions of firewall:
1. Accesscontrol:Firewall filtersincoming aswell asoutgoing packets.
2. Address/Port Translation: Using network address translation, internal machines,
though not visible on the Internet, can establish a connection with external machines on
the Internet. NATing is often done by firewall.
3. Logging: Security architecture ensures that each incoming or outgoing packet
encounters at least one firewall. The firewall can log all anomalous packets.

30
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

Firewallscanprotectthecomputeranduserpersonalinformationfrom:
1. Hackers who your system security.
2. Firewall prevents malware and other Internet hacker attacks from reaching your
computer in the first place.
3. Outgoingtrafficfrom yourcomputer createdby avirus infection.
Firewallscannotprovideprotection:
1. Against phishing scams and other fraudulent activity
2. Virusesspreadthroughe-mail
3. From physical access of your computer or network
4. For an unprotected wireless network.
FirewallCharacteristics
1. Alltraffic frominside to outsideand viceversa, must passthrough the firewall.
2. The firewall itself is resistant to penetration.
3. Onlyauthorizedtraffic,asdefinedbythelocalsecuritypolicy,willbeallowedtopass.
1.13.1 Firewall Rules
● The rules and regulations set by the organization. Policy determines the type of
internal and external information resources employees can access, the kinds of
programs they may install on their own computers as well as their authority for
reserving network resources.

● Policy is typically general and set at ahighlevelwithintheorganization.Policies that


contain details generally become too much of a "living document".

Usercancreateordisablefirewall filterrulesbasedonfollowingconditions :
1. IPaddresses: Systemadmin canblock acertain rangeof IP addresses.
2. Domain names: Admin can only allow certain specific domain names to access your
systems or allow access to only some specific types of domain names or domain name
extension.
3. Protocol: A firewall can decide which of the systems can allow or have access to
commonprotocolslikeIP,SMTP,FTP,UDP,ICMP,TelnetorSNMP.

31
JAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
CCS334BigDataAnalytics Lecture Notes

4. Ports: Blocking or disabling ports of servers that are connected to the internet will
help maintain the kind of data flow you want to see it used for and also close down
possible entry points for hackers or malignant software.
5. Keywords: Firewalls also can sift through the data flow for a match of the keywords
or phrases to block out offensive or unwanted data from flowing in.
● When your computer makes a connection with another computer onthenetwork,
several things are exchanged including the source and destination ports.
● In a standard firewall configuration, most inbound ports are blocked. This would
normally cause a problem with return traffic since the source port is randomly
assigned.
● A state is a dynamic rulecreatedbythefirewallcontainingthesource-destination port
combination, allowing the desired return traffic to pass the firewall.
1.13.2 TypesofFirewall
1. Packet filter
2. Application level firewall
3. Circuit level gateway.
➔ Packet filter firewall controls access topacketsonthebasisofpacketsourceand
destination address or specific transport protocol type.
➔ It is done at the OSI data link, network andtransportlayers.Packetfilterfirewall
works on the network layer of the OSI model.
➔ Packetfiltersdonotseeinsideapacket;theyblockoracceptpacketssolelyonthe basis of
the IP addresses and ports. All incoming SMTP and FTP packets are parsed to
check whether they should drop or forwarded.
➔ But outgoing SMTP and FTP packets have alreadybeenscreenedbythegateway and
do not have to be checked by thepacketfilteringrouter.Packetfilterfirewall only
checks the header information.

32
JAYA COLLEGE OF ENGINEERING AND TECNOLOGY
LECTURE NOTES

Application level gateway is also called a bastion host. It operates at the application
level. Multiple application gateways can run on the same host but each gateway is a
separate server with its own processes.
These firewalls, also known as application proxies, provide the most secure type of data
connection because they can examine every layer of the communication, including the
application data.
Circuit level gateway: A circuit-level firewall is a second generation firewall that
validates TCP and UDP sessions before opening a connection.
The firewall does not simply allow or disallow packets but also determines whether the
connection between both ends is valid according to configurable rules, then opens a
session and permits traffic only from the allowed source and possibly only for a limited
period of time.
It typically performs basic packet filter operations and then adds verification of proper
handshaking of TCP and the legitimacy of the session information used in establishingthe
connection.
The decision to accept or reject a packet is based upon examining thepacket'sIPheader
and TCP header.
Circuit level gateway cannot examine the data content of the packets it relaysbetweena
trusted network and an untrusted network.

33
UNITIINOSQLDATAMANAGEMENT
IntroductiontoNoSQL–aggregatedatamodels–aggregates–key-valueanddocumentdatamodels–
relationships–graphdatabases–schemalessdatabases–materialized views – distribution
models– sharding– master-slave replication – peer-peerreplication–
shardingandreplication–consistency–relaxingconsistency–versionstamps–map-reduce–
partitioningandcombining–composingmap-reducecalculations

NOSQLDATAMANAGEMENT
WhatisNosql?
NoSQL database, also called Not Only SQL, is an approach to data
managementand database design that's useful for very large sets of distributed data.
NoSQL is awhole new way of thinking about a database. NoSQL is not a relational
database.
Therealityisthatarelationaldatabasemodelmaynotbethebestsolutionforallsituations.The
easiest way to think of NoSQL, is that of a database which does not adhering to
thetraditional relational database management system (RDMS) structure. Sometimes
youwill also see it revered to as 'not only SQL'.the most popular NoSQL database is
ApacheCassandra. Cassandra, which was once Facebook’s proprietary database, was
releasedasopensourcein2008.OtherNoSQLimplementationsincludeSimpleDB,GoogleB
igTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies
thatuseNoSQLincludeNetFlix,LinkedInandTwitter.

Why Are NoSQL Databases Interesting? / Why we should use Nosql? / when to useNosql?

ThereareseveralreasonswhypeopleconsiderusingaNoSQLdatabase.

Application development productivity.A lot of application development


effort isspentonmappingdatabetweenin-
memorydatastructuresandarelationaldatabase.ANoSQL database may provide a data
model that better fits the application’s needs, thussimplifying
thatinteractionandresultinginlesscode towrite, debug, andevolve.

Large data.Organizations are finding it valuable to capture more data and process
itmore quickly. They are finding it expensive, if even possible, to do so with
relationaldatabases. The primary reason is that a relational database is designed to run
on asingle machine, but it is usually more economic to run large data and computing
loadsonclustersofmanysmallerandcheapermachines.ManyNoSQLdatabasesaredesigne
dexplicitlytorunonclusters,sotheymakeabetterfitforbigdatascenarios.

34
Analytics.OnereasontoconsideraddingaNoSQLdatabasetoyourcorporateinfrastructureisthat
manyNoSQLdatabasesarewellsuitedtoperforminganalyticalqueries.

Scalability.NoSQLdatabasesaredesignedtoscale;it’soneoftheprimaryreasonsthatpeople
choose a NoSQL database. Typically, with a relational database like SQL
ServerorOracle,youscalebypurchasinglargerandfasterserversandstorageorbyemploying
specialists to provide additional tuning. Unlike relational databases,
NoSQLdatabasesaredesignedtoeasilyscaleoutastheygrow.Dataispartitionedandbalance
d across multiple nodes in a cluster, and aggregate queries are distributed bydefault.

Massive write performance.


ThisisprobablythecanonicalusagebasedonGoogle'sinfluence.High volume. Facebook
needsto store 135 billionmessagesa month.
Twitter,forexample,hastheproblemofstoring7TB/dataperdaywiththeprospectofthisrequirem
ent doublingmultipletimesperyear.Thisisthedataistoo bigto fitononenode problem.At
80MB/sit takesaday to store 7TBsowritesneedto bedistributedover a cluster, which
implieskey-value access, MapReduce, replication,
faulttolerance,consistencyissues,andalltherest.Forfasterwritesin-memorysystemscanbeused.

Fast key-value access.This is probably the second most cited virtue of NoSQL in
thegeneral mind set.When latency is important it's hard to beat hashing on a key
andreading the value directly from memory or in as little as one disk seek. Not
everyNoSQL product is about fast access, some are more about reliability, for
example. butwhat people have wanted for a long time was a better memcached and
many NoSQLsystemsofferthat.

Flexible data model and flexible datatypes.NoSQL products support a whole


range ofnew data types, and this is a major area of innovation in NoSQL. We have:
column-oriented, graph, advanced data structures, document-oriented, and key-value.
Complexobjectscanbeeasilystoredwithoutalotofmapping.Developerslove
avoidingcomplexschemasandORMframeworks.Lackofstructureallowsformuchmorefle
xibility. We also have program and programmer friendly compatible datatypes
likesJSON.

Schemamigration.Schemalessnessmakesiteasiertodealwithschemamigrationswithoutso
muchworrying.Schemasareinasensedynamic,becausetheyareimposed

35
by the application at run-time, so different parts of an application can have a
differentviewoftheschema.

Write availability. Do your writes need to succeed no mater what? Then we can
getintopartitioning,CAP,eventualconsistencyandallthatjazz.

Easier maintainability, administration and operations.This is very


product specific,but many NoSQL vendors are trying to gain adoption by making it
easy for developersto adopt them. They are spending a lot of effort on ease of use,
minimal administration,and automated operations. This can lead to lower operations
costs as special
codedoesn'thavetobewrittentoscaleasystemthatwasneverintendedtobeusedthatway.

No single point of failure. Not every product is delivering on this, but we are
seeing adefinite convergence on relatively easy to configure and manage high
availability withautomaticloadbalancingandclustersizing.Aperfectcloudpartner.

Generallyavailableparallelcomputing.WeareseeingMapReducebakedintoprod
ucts, which makes parallel computing something that will be a normal part
ofdevelopmentinthefuture.

Programmer ease of use.Accessing your data should be easy. While the


relationalmodel is intuitive for end users, like accountants, it's not very intuitive for
developers.Programmers grok keys, values, JSON, Javascript stored procedures,
HTTP, and so on.NoSQL is for programmers. This is a developer led coup. The
response to a databaseproblem can't always be to hire a really knowledgeable DBA,
get your schema right,denormalize a little, etc., programmers would prefer a system
that they can make workfor themselves. It shouldn't be so hard to make a product
perform. Money is part of
theissue.Ifitcostsalottoscaleaproductthenwon'tyougowiththecheaperproduct,thatyoucon
trol,that'seasiertouse,and that's easiertoscale?

Use the right data model for the right problem. Different data models are
used to
solvedifferentproblems.Muchefforthasbeenputinto,forexample,wedginggraphoperations
into a relational model, butit doesn't work. Isn't it better to solve a graphprobleminagraph
database?Wearenow
seeingageneralstrategyoftryingfindthebestfitbetweenaproblemandsolution.

36
Distributed systemsand cloud
computingsupport.Noteveryoneisworriedaboutscaleorperformanceoverandaboveth
atwhichcanbeachievedbynon-
NoSQLsystems.Whattheyneedisadistributedsystemthatcanspandatacenterswhilehandlingfai
lurescenarioswithoutahiccup.NoSQLsystems,becausetheyhavefocussedonscale,tendtoexplo
itpartitions,tendnotuseheavystrictconsistencyprotocols,andsoarewellpositionedtooperateind
istributedscenarios.

DifferencebetwwenSqlandNosql
 SQL databases are primarily called as Relational Databases (RDBMS);
whereasNoSQLdatabaseareprimarilycalledasnon-relationalordistributeddatabase.
 SQLdatabasesaretablebaseddatabaseswhereasNoSQLdatabasesaredocument based,
key-value pairs, graph databases or wide-column stores. Thismeans that SQL
databases represent data in form of tables which consists of
nnumberofrowsofdatawhereasNoSQLdatabasesarethecollectionofkey-valuepair,
documents,graph databasesor wide-columnstores whichdo not
havestandardschemadefinitionswhichit needsto adheredto.
 SQLdatabaseshavepredefinedschemawhereasNoSQLdatabases
havedynamicschemaforunstructureddata.
 SQLdatabasesareverticallyscalablewhereastheNoSQLdatabasesarehorizontally
scalable. SQL databases are scaled by increasing the horse-power
ofthehardware.NoSQLdatabasesarescaledbyincreasingthedatabasesserversinthepoolof
resourcestoreducetheload.
 SQLdatabasesusesSQL(structuredquerylanguage)fordefiningandmanipulatingthedata,whi
chisverypowerful.InNoSQLdatabase,queriesarefocusedoncollectionofdocuments.Someti
mesitisalsocalledasUnQL(Unstructured Query Language).The
syntaxofusingUnQLvariesfrom databasetodatabase.
 SQLdatabaseexamples:MySql,Oracle,Sqlite,PostgresandMS-
SQL.NoSQLdatabaseexamples:MongoDB,BigTable,Redis,RavenDb,Cassandra,Hbas
e,Neo4jandCouchDb
 For complex queries: SQL databases are good fit for the complex query
intensiveenvironment whereas NoSQL databases are not good fit for complex
queries. Ona high-level, NoSQL don’t have standard interfaces to perform complex
queries,andthequeriesthemselvesinNoSQLarenotaspowerfulasSQL querylanguage.
 For the type of data to be stored: SQLdatabasesare not best fitfor
hierarchicaldatastorage.But,NoSQLdatabasefitsbetterforthehierarchicaldatastorageas

37
it follows the key-value pair way of storing data similar to JSON data. NoSQLdatabase
are highly preferred for large data set (i.e for big data). Hbase is
anexampleforthispurpose.
 Forscalability:Inmosttypicalsituations,SQLdatabasesareverticallyscalable.Youcanmanage
increasingloadbyincreasingtheCPU,RAM,SSD,etc,onasingleserver.Ontheotherhand,NoS
QLdatabasesarehorizontallyscalable.YoucanjustaddfewmoreserverseasilyinyourNoSQLd
atabaseinfrastructuretohandlethelargetraffic.
 For high transactional based application: SQL databases are best fit for
heavydutytransactionaltypeapplications,asitismorestableandpromisestheatomicityasw
ellasintegrityofthedata.WhileyoucanuseNoSQLfortransactionspurpose,itisstillnotcom
parableandsableenoughinhighloadandforcomplextransactionalapplications.
 For support: Excellentsupport are available for all SQL databasefrom theirvendors.
There are also lot of independent consultations who can help you withSQL database
for a very large scale deployments. For some NoSQL database youstill have to rely
on community support, and only limited outside experts
areavailableforyoutosetupanddeployyourlargescaleNoSQLdeployments.
 Forproperties:SQLdatabasesemphasizesonACIDproperties(Atomicity,Consistency,
Isolation and Durability) whereas the NoSQL database follows
theBrewersCAPtheorem(Consistency, AvailabilityandPartitiontolerance )
 ForDBtypes:Onahigh-level,wecanclassifySQLdatabasesaseitheropen-sourceorclose-
sourcedfromcommercialvendors.NoSQLdatabasescanbeclassified on the basis of way of
storing data as graph databases, key-value
storedatabases,documentstoredatabases,columnstoredatabaseandXMLdatabases.

TypesofNosqlDatabases:TherearefourgeneraltypesofNoSQLdatabases,eachwiththeirownspe
cificattributes:

1. Key-Valuestorage

This is the first category of NoSQL database. Key-value stores have a simple
datamodel,whichallowclientstoputamap/dictionaryandrequestvalueparkey.In the key-
value storage, each key has to be unique to provide non-
ambiguousidentificationofvalues.Forexample.

38
2. Document-databases

InthedocumentdatabaseNoSQLstoredocumentinJSONformat.JSON-baseddocument
are store in completely different sets of attributes can be storedtogether, which
stores highly unstructured data as named value pairs andapplications
thatlookatuserbehavior,actions,andlogsinrealtime.

3. Columns-storage

Columnardatabasesarealmostliketabulardatabases.Thuskeysinwidecolumnstore scan
have many dimensions, resulting in a structure similar to a multi-dimensional,
associative array. Shown in below example storing data in a
widecolumnsystemusingatwo-dimensionalkey.

39
Graph-storage

Graph databases are best suited for representing data with a high, yet
flexiblenumberofinterconnections,especiallywheninformationaboutthoseinterconnecti
onsisatleastasimportantastherepresenteddata.InNoSQLdatabase, data is stored in a
graph like structures in graph databases, so that thedata can bemade easily
accessible.Graph databases are commonly
usedonsocialnetworkingsites.Asshowinbelowfigure.

Exampledatabases

40
ProsandConsofRelationalDatabases
• Advantages
• Datapersistence
• Concurrency–ACID,transactions,etc.
• Integrationacrossmultipleapplications
• StandardModel–tablesandSQL
• Disadvantages
• Impedancemismatch
• Integrationdatabasesvs.applicationdatabases
• Not designedforclustering

DatabaseImpedancemismatch:
Impedance Mismatch means the difference between data model and in memory
datastructures.
Impedanceisthemeasureoftheamountthatsomeobjectresists(orobstruct,resist)the
flowofanotherobject.
Imagine you have a low current flashlight that normally uses AAA batteries. Supposeyou
could attach yourcarbatteryto theflashlight. Thelow currentflashlightwillpitifully
output a fraction of the light energy that the high current battery is capable
ofproducing. However, match the AAA batteries to the flashlight and they will run
withmaximumefficiency.
ThedatarepresentationinRDMSisnotmatchedwiththedatastructureusedinmemory. In-memory,
data structures are lists, dictionaries, nested and hierarchical datastructures whereas
inRelational database,itstores onlyatomic values,andthere
isnolistsarenestedrecords.Translatingbetweentheserepresentationscanbecostly,confusingan
dlimitstheapplicationdevelopmentproductivity.
Somecommoncharacteristicsofnosqlinclude:
 Doesnotusetherelationalmodel(mostly)
 Generallyopensourceprojects(currently)
 Drivenbytheneedtorunonclusters
 Builtfortheneedtorun21stcenturywebproperties
 Schema-less
 Polygotpersistence:Thepointofviewofusingdifferentdatastoresindifferentcircums
tancesisknownasPolyglotPersistence.

41
Today,mostlargecompaniesareusingavarietyofdifferentdatastoragetechnologiesfordiffere
ntkindsofdata.Manycompaniesstilluserelationaldatabasestostoresomedata,butthepersi
stenceneedsofapplicationsareevolving from predominantly relational to a mixture of
data sources.
Polyglotpersistenceiscommonlyusedtodefinethishybridapproach.Thedefinitionof
polyglot is“someone whospeaksor writes several languages.”Thetermpolyglot is
redefined for big data as a set of applications that use several
coredatabasetechnologies.

 Auto Sharding: NoSQL databases usually support auto-sharding, meaning


thattheynativelyandautomaticallyspreaddataacrossanarbitrarynumberofservers,withou
trequiringtheapplicationtoevenbeawareofthecompositionoftheserverpool
Nosqldata model
Relational
andNoSQLdatamodelsareverydifferent.Therelationalmodeltakesdataandseparatesitintoman
yinterrelatedtablesthatcontainrowsandcolumns.Tablesreferenceeachotherthroughforeignke
ysthatarestoredincolumnsaswell.Whenlooking up data, the desired information needs to be
collected from many tables
(oftenhundredsintoday’senterpriseapplications)andcombinedbeforeitcanbeprovidedto

42
the application. Similarly, when writing data, the write needs to be coordinated
andperformedonmanytables.

NoSQLdatabaseshaveaverydifferentmodel.Forexample,adocument-
orientedNoSQLdatabasetakesthedatayouwanttostoreandaggregatesitintodocumentsusingthe
JSONformat.Each JSONdocumentcanbethought of as anobjectto
beusedbyyourapplication.AJSONdocumentmight,forexample,takeallthedatastoredinarowtha
tspans20tablesofarelationaldatabaseandaggregateitintoasingledocument/
object.Aggregatingthisinformationmayleadtoduplicationofinformation,butsincestorageis
nolonger costprohibitive,
theresultingdatamodelflexibility,easeofefficientlydistributingtheresultingdocumentsandread
andwriteperformanceimprovementsmakeitaneasytrade-offforweb-basedapplications.

AnothermajordifferenceisthatrelationaltechnologieshaverigidschemaswhileNoSQL
models are schemaless.Relational technology requires strictdefinition of aschema prior
to storing any data into a database. Changing the schema once data isinserted is a big
deal, extremely disruptive and frequently avoided – the exact oppositeof thebehavior
desired in the BigData era, whereapplication developers needtoconstantly–andrapidly–
incorporatenewtypesofdatatoenrichtheirapps.

Aggregatesdatamodelin nosql
Data Model: A data model is the model through which we perceive and
manipulateour data. For people using a database, the data model describes how we
interact withthedatainthedatabase.
RelationalDataModel:Therelationalmodeltakestheinformationthatwewantto
storeanddividesitintotuples.
Tuple being a limited Data Structure it captures a set of values and can’t be nested.
ThisgivesRelationalModelaspaceofdevelopment.

43
Aggregate Model: Aggregate is a term that comes from Domain-Driven Design,
anaggregate is a collection of related objects that we wish to treat as a unit, it is a unit
fordata manipulationandmanagementofconsistency.

 Atomicpropertyholdswithinanaggregate
 Communicationwithdatastoragehappensinunitofaggregate
 Dealingwithaggregateismuchmoreefficientinclusters
 Itisofteneasierforapplicationprogrammerstoworkwithaggregates

ExampleofRelationsandAggregates
Let’sassumewehavetobuildane-
commercewebsite;wearegoingtobesellingitemsdirectlytocustomersovertheweb,andwewillh
avetostoreinformationabout
users, our product catalog, orders, shipping addresses, billing addresses,
andpaymentdata.Wecanusethisscenariotomodelthedatausingarelationdatastoreasw
ell asNoSQLdatastoresandtalkabouttheirprosand cons.Fora
relationaldatabase,wemightstartwithadatamodelshowninthefollowingfigure.

Thefollowingfigurepresentssomesampledataforthismodel.

44
In relational, everything is properly normalized, so that no data is repeated in
multipletables. We also have referential integrity. A realisticorder system would
naturally bemore involved than this. Now let’s see how this model might look when
we think inmoreaggregateorientedterms

Again,wehavesomesampledata,whichwe’llshowinJSONformatasthat’sacommonrepresentationf
ordatainNoSQL.
//incustomers
{"
id":1,"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
//inorders
{"

45
id":99,
"customerId":1,"orderIte
ms":[
{
"productId":27,"price":3
2.45,
"productName":"NoSQLDistilled"
}
],
"shippingAddress":
[{"city":"Chicago"}]"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft","billingAddress":
{"city":"Chicago"}
}
],
}

Inthismodel,wehavetwomainaggregates:customerandorder.We’veusedtheblack-
diamondcompositionmarkerinUMLtoshowhowdatafitsintotheaggregationstructure.
The customer contains a list of billing addresses; the order contains a list oforder
items, a shipping address, and payments. The payment itself contains a
billingaddressforthatpayment.
A single logical address record appears three times in the example data, but instead
ofusing IDs it’s treated as a value and copied each time. This fits the domain where
wewould not want the shipping address, nor the payment’s billing address, to change.
In
arelationaldatabase,wewouldensurethattheaddressrowsaren’tupdatedforthiscase,makin
g a new row instead. With aggregates, we can copy the whole address
structureintotheaggregateasweneedto.

Aggregate-OrientedDatabases:Aggregate-
orienteddatabasesworkbestwhenmostdatainteractionisdonewiththesameaggregate;aggregat
e-ignorantdatabasesarebetterwheninteractionsusedataorganizedinmanydifferentformations.

46
 Key-valuedatabases
 •Storesdatathatisopaquetothedatabase
 •Thedatabasedoescannotseethestructureofrecords
 •Applicationneedstodealwiththis
 •Allowsflexibilityregardingwhatisstored(i.e.textorbinarydata)
 Documentdatabases
 •Storesdatawhosestructureisvisibletothedatabase
 •Imposeslimitationsonwhatcanbestored
 •Allowsmoreflexibleaccesstodata(i.e.partialrecords)viaquerying

Bothkey-valueanddocumentdatabasesconsistofaggregaterecordsaccessedbyIDvalues
 Column-familydatabases
 •Twolevelsofaccesstoaggregates(andhence,twoparstothe“key”toaccessanaggre
gate’sdata)
 •IDisusedto lookupaggregaterecord
 •Columnname–eitheralabelforavalue(name)orakeytoalistentry(orderid)
 •Columnsaregroupedintocolumnfamilies

47
SchemalessDatabases
A common theme across all the forms of NoSQL databases is that they are
schemaless.Whenyouwanttostoredatainarelationaldatabase,youfirsthavetodefineasche
ma—a definedstructure forthe databasewhichsayswhattablesexist,whichcolumns exist,
and what data types each column can hold. Before you store some data,you
havetohavetheschemadefinedforit.
With NoSQL databases, storing data is much more casual. A key-value store allows youto
store any data you like under a key. A document database effectively does the
samething,sinceitmakesnorestrictionsonthestructureofthedocumentsyou store.Column-
familydatabasesallowyoutostoreanydataunderanycolumn you like.Graph databases
allow you to freely add new edges and freely add properties to
nodesandedgesasyouwish.
whyschemaless?
 Aschemalessstorealsomakesiteasiertodealwithnonuniformdata
 Whenstartinganewdevelopmentprojectyoudon'tneedtospendthesameamountoftimeonu
p-frontdesignoftheschema.
 NoneedtolearnSQLordatabasespecific stuffandtools.
 Therigidschemaofarelationaldatabase(RDBMS)meansyouhavetoabsolutelyfollow the
schema. It can be harder to push data into the DB as it has to perfectlyfit the schema.
Being able to add data directly without having to tweak it
tomatchtheschemacansaveyoutime
 Minor changes to the model and you will have to change both your code and
theschema in the DBMS. If no schema,you don't have to make changes in
twoplaces.Lesstimeconsuming
 WithaNoSqlDByouhavefewerwaystopullthedataout
 LessoverheadforDBengine
 Lessoverheadfordevelopersrelatedtoscalability

48
 EliminatestheneedforDatabaseadministratorsordatabaseexperts-
>fewerpeopleinvolvedandlesswaitingonexperts
 SavetimewritingcomplexSQLjoins ->morerapiddevelopment
Prosandconsofschemaless data
Pros:
 Morefreedomandflexibility
 youcaneasilychangeyourdataorganization
 you can deal with nonuniform
dataCons:
Aprogramthataccessesdata:
 almostalwaysreliesonsomeformofimplicitschema
 itassumesthatcertainfieldsarepresent
 carrydatawithacertainmeaning
TheimplicitschemaisshiftedintotheapplicationcodethataccessesdataTounderstand
whatdataispresentyouhavelookattheapplicationcodeTheschemacannotbeusedt
o:
 decidehowtostoreandretrievedataefficiently
 ensuredataconsistency
Problemsifmultipleapplications,developedbydifferentpeople,accessthesamedata
base.
RelationalschemascanbechangedatanytimewithstandardSQLcommands

Key-valuedatabases
A key-value store is a simple hash table, primarily used when all access to the
databaseisviaprimarykey.
Key-value stores are the simplest NoSQL data stores to use from an API perspective.The
client can either get the value for the key, put a value for a key, or delete a key
fromthe data store. The value is a BLOB(Binary Large Object) that the data store just
stores,without caring or knowing what’s inside; it’s the responsibility of the
application tounderstand what was stored. Since key-value stores always use primary-
key access,theygenerally havegreat performanceandcanbeeasilyscaled.
It is an associative container such as map, dictionary, and in query processing an index.It
is an abstract data type composed of a collection of unique keys and a collection
ofvalues, where each key is associated with one value (or set of values). The operation
offinding the value associated with a key is called a lookup or indexing The
relationshipbetweenakey anditsvalueissometimescalledamappingor binding.

49
Some of the popular key-value databases are Riak, Redis, Memcached DB, Berkeley
DB,HamsterDB,AmazonDynamoDB.
A Key-Value model is great for lookups of simple or even complex values. When
thevalues are themselves interconnected, you’ve got a graph as shown in following
figure.Letsyoutraversequicklyamongalltheconnectedvalues.

InKey-Valuedatabase,

 Dataisstoredsortedbykey.
 Callerscanprovideacustomcomparisonfunctiontooverridethesortorder.
 ThebasicoperationsarePut(key,value),Get(key),Delete(key).
 Multiplechangescanbemadeinoneatomicbatch.
 Userscancreateatransientsnapshottogetaconsistentviewofdata.
 Forwardandbackwarditerationissupportedoverthedata.
In key-value databases, a single object that stores all the data and is put into a
singlebucket. Buckets are used to define a virtual keyspace and provide the ability to
defineisolated non-default configuration. Buckets might be compared to tables or
folders inrelationaldatabasesorfilesystems,respectively.
As their name suggest, they store key/value pairs. For example, for search engines, astore
may associate to each keyword (the key) a list of documents containing it
(thecorrespondingvalue).
One approach to implement a key-value store is to use a file decomposed in blocks .
Asthefollowingfigureshows,eachblockisassociatedwithanumber(rangingfrom1ton).Eac
hblockmanagesasetofkey-valuepairs:thebeginningoftheblockcontained,after some
information, an index of keys and the position of the corresponding values.These
values are stored starting from the end of the block (like a memory heap).
Thefreespaceavailable is delimited by theendoftheindex andtheendofthe values.

50
In this implementation, the size of a block is important since it defines
thelargest value that can be stored (for example the longest list of document
identifierscontaining a given keyword). Moreover, it supposes that a block number is
associatedtoeach key. Theseblocknumbers canbeassigned intwodifferentways:

1. The block number is obtained directly from the key, typically by using a
hashfunction.Thesizeofthefileisthendefinedbythelargestblocknumbercomputedbyever
ypossiblekey.

2. The block number is assigned increasingly. When a new pair must be stored, thefirst
block that can hold it is chosen. In practice, a given amount of space
isreservedinablockinordertomanageupdatesofexistingpairs(anewvaluecanreplace an
older and smaller one). This limit the size of the file to amount ofvaluestostore.

DocumentDatabases

In a relational database system you must define a schema before adding records to
adatabase. The schema is the structure described in a formal language supported by
thedatabase and provides a blueprint for the tables in a database and the
relationshipsbetween tables of data. Within a table, you need to define constraints in
terms of rowsandnamedcolumns aswellasthetype ofdatathat canbe
storedineachcolumn.
Incontrast,adocument-
orienteddatabasecontainsdocuments,whicharerecordsthatdescribethedatainthedocument,
aswellastheactualdata.Documentscanbeascomplex as you choose; you can use nested data
to provide additional sub-categories ofinformation about your object. You can also use one
or more document to represent areal-
worldobject.Thefollowingcomparesaconventionaltablewithdocument-basedobjects:

51
Inthisexamplewehaveatablethatrepresentsbeersandtheirrespectiveattributes:id,beer name,
brewer, bottles available and so forth. As we see in this illustration,
therelationalmodelconformstoaschemawithaspecifiednumberoffieldswhichrepresent a
specific purpose and data type. The equivalent document-based model
hasanindividualdocumentperbeer;eachdocumentcontainsthesametypesofinformation
foraspecificbeer.
In a document-oriented model, data objects are stored as documents; each documentstores
your data and enables you to update the data or delete it. Instead of columnswith
names and data types, we describe the data in the document, and provide thevalue for
that description. If we wanted to add attributes to a beer in a relational mode,we would
need to modify the database schema to include the additional columns andtheir data
types. In the case of document-based data, we would add additional key-valuepairsinto
ourdocumentsto representthe newfields.
Theothercharacteristicofrelationaldatabaseisdatanormalization;thismeansyoudecomposed
ataintosmaller,relatedtables.Thefigurebelowillustratesthis:

In the relational model, data is shared across multiple tables. The advantage to thismodel
is that there is less duplicated data in the database. If we did not separate
beersandbrewersintodifferenttablesandhadonebeertableinstead,wewouldhaverepeatedi
nformationaboutbreweriesforeachbeerproducedbythatbrewer.

52
Theproblemwiththisapproachisthatwhenyouchangeinformationacrosstables,youneedtolock
thosetablessimultaneouslytoensureinformationchangesacrossthetableconsistently.
Because you also spread information across a rigid structure, it makes itmore
difficultto changethe structureduring
production,anditisalsodifficulttodistributethedataacrossmultipleservers.
In the document-oriented database, we could choose to have two different
documentstructures: one for beers, and one for breweries. Instead of splitting your
applicationobjects into tables and rows, you would turn them into documents. By
providing areferenceinthebeer documentto abrewery document,you
createarelationshipbetweenthetwoentities:

InthisexamplewehavetwodifferentbeersfromtheAmtelbrewery.Werepresenteach
beerasaseparatedocumentandreferencethebreweryinthedocument- brewer field.Thetradit
orientedapproachprovidesseveralupsidescompared tothe ional
RDBMS model. First, because information is stored in documents, updating a schema isa
matter of updating the documents for that type of object. This can be done with
nosystem downtime. Secondly, we can distribute the information across multiple
serverswith greater ease. Since records are contained within entire documents, it
makes iteasiertomove,orreplicateanentireobjecttoanotherserver.
UsingJSONDocuments
JavaScriptObjectNotation(JSON)isalightweightdata-
interchangeformatwhichiseasytoreadandchange.JSONislanguage-
independentalthoughitusessimilarconstructstoJavaScript.Thefollowingare
basicdatatypessupportedinJSON:
 Numbers,includingintegerandfloatingpoint,
 Strings,includingallUnicodecharactersandbackslashescapecharacters,
 Boolean:trueorfalse,
 Arrays,enclosedinsquarebrackets:[“one”,“two”,“three”]
 Objects,consistingofkey-
valuepairs,andalsoknownasanassociativearrayorhash.Thekey mustbeastring andthe
valuecanbeanysupportedJSONdatatype.
53
Forinstance,ifyouarecreatingabeerapplication,youmightwantparticulardocumentstructuretore
presentabeer:
{
"name":"descripti
on":"category"
:"updated":
}
For each of the keys in this JSON document you would provide unique values
torepresentindividualbeers.Ifyouwanttoprovidemoredetailedinformationinyourbeer
application about the actual breweries, you could create a JSON structure
torepresentabrewery:
{
"name":"addre
ss":"city":
"state":"website":"
description":
}
Performing data modeling for a document-based application is no different than thework
you would need to do for a relational database. For the most part it can be muchmore
flexible, it can provide a more realistic representation or your application data,andit
also enablesyou to changeyour mindlater about data structure. For morecomplex items
in your application, one option is to use nested pairs to represent theinformation:
{
"name":
"address":
"city":
"state":
"website":
"description":
"geo":
{
"location":["-
105.07","40.59"],"accuracy":"RANGE_INT
ERPOLATED"
}
"beers":[_id4058,_id7628]
}
Inthiscaseweaddedanestedattributeforthegeo-
locationofthebreweryandforbeers.Withinthelocation,weprovideanexactlongitudeandlatitud
e,aswellaslevel

54
ofaccuracyforplottingitonamap.Thelevelofnestingyouprovideisyourdecision;as long as a
document is under the maximum storage size for Server, you can provideanylevel
ofnestingthatyoucanhandleinyourapplication.
In traditional relational database modeling, you would create tables that contain asubset of
information for an item. For instance a brewery may contain types of beerswhich are
stored in a separate table and referenced by the beer id. In the case of
JSONdocuments,youusekey-valuespairs,orevennestedkey-valuepairs.

Column-FamilyStores
Its name conjured up a tabular structure which it realized with sparse columns and
noschema. The column-family model is as a two-level aggregate structure. As with
key-valuestores,thefirstkeyisoftendescribedasarowidentifier,pickinguptheaggregateof
interest. The difference with column family structures is that this row aggregate
isitselfformedofamapofmoredetailedvalues.Thesesecond-levelvaluesarereferredtoas
columns. As well as accessing the row as a whole, operations also allow picking out
aparticular column, so to get a particular customer’s name from following
figureyoucoulddosomethinglike
get('1234','name').

Column-familydatabasesorganizetheircolumnsintocolumnfamilies.Eachcolumnhas to be
part of a single column family, and the column acts as unit for access, with
theassumptionthatdataforaparticularcolumnfamilywillbeusuallyaccessedtogether.
Thedataisstructuredinto:
• Row-oriented: Each row is an aggregate (for example, customer with the ID of
1234)withcolumnfamiliesrepresentingusefulchunksofdata (profile, orderhistory)
withinthataggregate.

55
• Column-oriented: Each column family defines a record type (e.g., customer
profiles)with rows for each of the records. You then think of a row as the join of records in
allcolumnfamilies.
Eventhoughadocumentdatabasedeclaressomestructuretothedatabase,eachdocument is still
seen as a single unit. Column families give a two-dimensional qualitytocolumn-
familydatabases.
Cassandra uses the terms “ wide” and “ skinny.” Skinny rows have few columns
withthe same columns used across the many different rows. In this case, the column
familydefines a record type, each row is a record, and each column is a field. A wide
row hasmany columns (perhaps thousands), with rows having very different
columns. A widecolumnfamilymodels alist,with each columnbeingoneelementin
thatlist.

Relationships:AtomicAggregates
Aggregates allow one to store a single business entity as one document, row or key-
valuepairandupdateitatomically:

GraphDatabases:
GraphdatabasesareonestyleofNoSQLdatabasesthatusesadistributionmodelsimilar to
relational databases but offers a different data model that makes it better
athandlingdatawithcomplexrelationships.

 Entitiesarealsoknownasnodes,whichhaveproperties
 Nodes are organized byrelationshipswhich allowsto findinteresting
patternsbetweenthenodes
 Theorganizationofthegraphletsthedatatobestoredonceandtheninterpretedindifferentwaysba
sedonrelationships

56
Let’sfollowalongsomegraphs,usingthemtoexpressthemselves.We’llread”agraphbyfollowin
garrowsaroundthediagramtoformsentences.
AGraphcontainsNodesandRelationships

AGraph–[:RECORDS_DATA_IN]–>Nodes–[:WHICH_HAVE]–>Properties.

ThesimplestpossiblegraphisasingleNode,arecordthathasnamedvaluesreferredtoas
Properties. A Node could start with a single Property and grow to a few
million,thoughthatcangetalittleawkward.Atsomepointitmakessensetodistributethedatai
ntomultiplenodes,organizedwithexplicitRelationships.

QueryaGraphwithaTraversal

ATraversal–navigates–>aGraph;it–identifies–>Paths–whichorder–>Nodes.

ATraversalishowyouqueryaGraph,navigatingfromstartingNodes to relatedNodes
according to an algorithm, finding answers to questions like “what music do
myfriendslikethatIdon’tyetown,”or“ifthispowersupplygoesdown,what
webservicesareaffected?”

57
Example

In this context,a graph refers to a graph data structureof nodes connected


byedges.Intheabovefigurewehaveawebofinformationwhosenodesareverysmall(nothingmore than
a name)but there isarich structure ofinterconnectionsbetweenthem. Withthis structure, we can ask
questions such as “ find the books in the Databases categorythatare
writtenbysomeonewhom afriendof minelikes.”
Graph databases specialize in capturing this sort of information—but on a much
largerscalethanareadablediagramcouldcapture.Thisisidealforcapturinganydataconsistin
g of complex relationships such as social networks, product preferences,
oreligibilityrules.
MaterializedViews
In computing, a materialized view is a database object that contains the results of aquery.
For example, it may be a local copy of data located remotely, or may be a subsetof the
rows and/or columns of a table or join result, or may be a summary based
onaggregationsofatable'sdata.Materializedviewscanbeusedwithinthesameaggregate.
Materializedviews,whichstoredatabasedonremotetables,arealsoknownassnapshots.Asnaps
hotcanberedefinedasamaterializedview.
Materialized viewiscomputedinadvance and cachedondisk.
Strategiestobuildingamaterializedview:
Eagerapproach:thematerializedviewisupdatedatthesametimeofthebasedata.Itisgoodw
henyouhavemorefrequentreadsthanwrites.
Detached approach: batch jobs update the materialized views at regular intervals.
It isgoodwhenyoudon’twanttopayanoverheadoneachupdate.

58
NoSQL databases do not have views and have precomputed and cached queries
usuallycalled“materializedview”.
DistributionModels
Multipleservers:InNoSQLsystems,datadistributedoverlargeclusters
Single server – simplest model, everything on one machine. Run the database on
asingle machine that handles all the reads and writes to the data store. We prefer
thisoption because it eliminates all the complexities. It’s easy for operations people
tomanageandeasyforapplicationdevelopersto reasonabout.
AlthoughalotofNoSQLdatabasesaredesignedaroundtheideaofrunningon acluster, it can
make sense to use NoSQL with a single-server distribution model if thedata model of
the NoSQL store is more suited to the application. Graph databases
aretheobviouscategoryhere—theseworkbestinasingle-serverconfiguration.
Ifyourdatausageismostlyaboutprocessingaggregates,thenasingle-serverdocumentorkey-
valuestoremaywellbeworthwhilebecauseit’seasieronapplicationdevelopers.
Orthogonalaspectsofdatadistributionmodels:
Sharding:
DBShardingisnothingbuthorizontalpartitioningofdata.Differentpeopleareaccessingdifferent
partsofthedataset.Inthesecircumstanceswecansupporthorizontalscalabilitybyputtingdifferen
tpartsofthedataontodifferentservers—atechniquethat’scalledsharding.
Atablewithbillionsofrowscanbepartitionedusing“RangePartitioning”.Ifthecustomer
transaction date, for an example, based partitioning will partition the datavertically. So
irrespective which instance in a Real Application Clusters access the data,it
is“not”horizontally partitionedalthoughGlobalEnqueueResources are
owningcertainblocksineachinstancebutitcanbemovingaround.Butin“dbshard”environm
ent,thedataishorizontallypartitioned.Foranexample:UnitedStatescustomer can live in
one shard and European Union customers can be in another
shardandtheothercountriescustomerscanliveinanothershardbutfromanaccessperspective
there is no need to know where the data lives. The DB Shard can go to
theappropriateshardtopickupthedata.

Differentpartsofthedataontodifferentservers

59
 Horizontalscalability
 Idealcase:differentusersalltalkingtodifferentservernodes
 Dataaccessedtogetheronthesamenode̶aggregateunit!
Pros:itcanimprovebothreadsandwrites
Cons:Clustersuselessreliablemachinesr̶esiliencedecreasesMany
NoSQLdatabasesofferauto-sharding
thedatabasetakesontheresponsibilityofsharding
Improvingperformance:
Main rules of sharding:
1. Place thedataclose towhereit’saccessed
OrdersforBoston:datainyoureasternUSdatacenter
2. Trytokeeptheloadeven
Allnodesshouldgetequalamountsoftheload
3. Put together aggregates that may be read in
sequenceSameorder,same node

Master-Slave
ReplicationMaste
r
isthe authoritative sourceforthe data
isresponsible
forprocessinganyupdatestothatdatacanbeap
pointedmanuallyorautomatically
Slaves
Areplicationprocesssynchronizesthe slaveswiththemaster
Afterafailureofthemaster,aslavecanbeappointedasnewmastervery
quickly

ProsandconsofMaster-SlaveReplication
60
Pro
s
Morereadrequests:
Addmoreslavenodes
Ensurethatall readrequestsareroutedtotheslaves
Shouldthemasterfail,theslavescan stillhandlereadrequestsGoodfor
datasets witharead-intensivedataset

Cons Themaster isabottleneck


Limited by its ability to process updates and to pass those updates
onItsfailuredoeseliminatetheability tohandlewrites until:
themasterisrestored or
anewmaster isappointed
Inconsistencyduetoslowpropagationofchangestothesl
avesBadfor datasetswithheavywritetraffic
Peer-to-PeerReplication

 Allthereplicashaveequalweight,theycanallacceptwrites
 Thelossofanyofthemdoesn’tpreventaccesstothedatastore.
Prosandconsofpeer-to-peerreplicationPros:
youcanrideovernodefailureswithoutlosingaccesstodatayoucaneasilyadd
nodestoimprove yourperformance
Con
s:
Inconsistency
Slowpropagationofchangestocopiesondifferentnodes

61
Inconsistenciesonreadleadtoproblemsbutarerelativelytransient
Twopeoplecanupdatedifferentcopiesofthesamerecordstoredondifferentnodesatthesametime
-awrite-writeconflict.
Inconsistentwritesareforever.
ShardingandReplication onMaster-Slave
Replication and sharding are strategies that can be combined. If we use both
masterslavereplicationandsharding,thismeansthatwehavemultiplemasters,but
eachdataitemonlyhasasinglemaster.Dependingonyourconfiguration,youmaychooseano
de to be a master for some data and slaves for others, or you may dedicate nodes
formasterorslaveduties.

Wehavemultiplemasters,buteachdataonlyhasasinglemaster.Twoschemes:
AnodecanbeamasterforsomedataandslavesforothersNodes
arededicatedformasterorslaveduties
ShardingandReplication onP2P
Usingpeer-to-peerreplicationandshardingisacommonstrategyforcolumnfamilydatabases.In a
scenariolikethis you might have tens or hundreds of nodes in
aclusterwithdatashardedoverthem.Agoodstartingpointforpeer-to-peerreplicationistohave a
replication factor of 3, so each shard is present on three nodes. Should a node
fail,thentheshardsonthatnodewillbebuiltontheothernodes.(Seefollowingfigure)

62
Usuallyeachshardispresentonthreenodes
Acommonstrategyforcolumn-familydatabases
KeyPoints
• Therearetwostylesofdistributingdata:
• Shardingdistributesdifferentdataacrossmultipleservers,soeachserveractsasthesinglesour
ceforasubsetofdata.
• Replicationcopiesdataacrossmultipleservers,soeachbitofdatacanbefoundinmultiplep
laces.
Asystemmayuseeitherorbothtechniques.
• Replicationcomesintwoforms:
• Master-
slavereplicationmakesonenodetheauthoritativecopythathandleswriteswhileslavessync
hronizewiththemasterandmay handle reads.
• Peer-to-
peerreplicationallowswritestoanynode;thenodescoordinatetosynchronizetheircopiesofthed
ata.
Master-slavereplicationreducesthechanceofupdateconflictsbutpeerto-
peerreplicationavoidsloadingallwritesonto asinglepointoffailure.
Consistency
Theconsistencypropertyensuresthatanytransactionwillbringthedatabasefromonevalid state
to another. Any data written to the database must be valid according to
alldefinedrules,includingconstraints,cascades,triggers,andanycombinationthereof.
ItisabiggestchangefromacentralizedrelationaldatabasetoaclusterorientedNoSQL.
RelationaldatabaseshasstrongconsistencywhereasNoSQLsystemshassmostlyeventualconsi
stency.
ACID:ADBMSisexpectedtosupport“ACIDtransactions,”processesthatare:
 Atomicity:eitherthewholeprocessisdoneornoneis
 Consistency:onlyvaliddataarewritten
 Isolation:oneoperationatatime
 Durability:oncecommitted,itstaysthatway
Variousformsofconsistency
1. UpdateConsistency(orwrite-writeconflict):
Martin and Pramod are looking at the company website and notice that the phonenumber
is out of date. Incredibly, they both have update access, so they both go in atthe same
time to update the number. To make the example interesting, we’ll
assumetheyupdateitslightlydifferently,becauseeachusesaslightlydifferentformat.This
63
issueiscalledawrite-write conflict:twopeopleupdatingthesamedataitematthesametime.
When the writes reach the server, the server will serialize them—decide to apply one,then
the other. Let’s assume it uses alphabetical order and picks Martin’s update
first,thenPramod’s.Withoutanyconcurrencycontrol,Martin’supdatewouldbeappliedand
immediately overwritten by Pramod’s. In this case Martin’s is a lost update. We
seethis as a failure of consistency because Pramod’s update was based on the state
beforeMartin’supdate,yetwasappliedafterit.
Solutions:
Pessimisticapproach
Preventconflictsfromoccurring
UsuallyimplementedwithwritelocksmanagedbythesystemOptimisticappr
oach
Letsconflictsoccur,butdetectsthemandtakesactiontosortthemoutApproaches:
conditionalupdates:testthevaluejustbeforeupdating
savebothupdates:recordthattheyareinconflictandthenmergethem
Donotworkifthere’smorethanoneserver(peer-to-peerreplication)
2. ReadConsistency(orread-writeconflict)
AliceandBobareusingTicketmasterwebsitetobookticketsforaspecificshow.Only oneticket is left
forthe specificshow. Alicesignson to Ticketmasterfirst andfindsone
left,andfindsitexpensive.Alicetakestimetodecide.Bobsignsonandfindsone ticketleft, orders
itinstantly. Bob purchasesand logs off. Alice decides
tobuyaticket,tofindtherearenotickets.ThisisatypicalRead-WriteConflictsituation.
Another example where Pramod has done a read in the middle of Martin’s write
asshowninbelow.

64
Werefertothistypeofconsistencyaslogicalconsistency.Toavoidalogicallyinconsistent by providing
Martin wraps his two writes in a transaction, the systemguarantees that
Pramodwilleitherreadbothdata itemsbefore
theupdateorbothaftertheupdate.Thelengthoftimeaninconsistencyispresentiscalledtheinconsi
stencywindow.
Replicationconsistency
Let’simaginethere’sonelasthotelroomforadesirableevent.Thehotelreservationsystem runs on
many nodes. Martin and Cindy are a couple considering this room,but they are
discussing this on the phone because Martin is in London and Cindy isin Boston.
Meanwhile Pramod, who is in Mumbai, goes and books that last room.That updates the
replicated room availability, but the update gets to Boston quickerthan it gets to London.
When Martin and Cindy fire up their browsers to see if theroom is available, Cindy sees
it booked and Martin sees it free. This is anotherinconsistentread—
butit’sabreachofadifferentformofconsistencywecallreplicationconsistency:ensur
ingthatthesamedataitemhasthesamevaluewhenread fromdifferentreplicas.

65
Eventual consistency:
Atanytime,nodesmayhavereplicationinconsistenciesbut,iftherearenofurtherupdates,eventually
allnodeswillbeupdatedtothesamevalue.Inotherwords,Eventualconsistencyisaconsistencymode
lusedinnosqldatabasetoachievehighavailabilitythatinformallyguaranteesthat,ifnonewupdatesar
emadetoagivendataitem,eventuallyallaccessestothatitemwillreturnthelastupdatedvalue.
EventuallyconsistentservicesareoftenclassifiedasprovidingBASE(BasicallyAvailable,Softstat
e,Eventualconsistency)semantics,incontrastto
traditionalACID(Atomicity,Consistency,Isolation,Durability)guarantees.
Basic Availability.The NoSQL database approach focuses on availability of
dataeveninthepresenceofmultiplefailures.Itachievesthisbyusingahighlydistributedapproac
htodatabasemanagement.Insteadofmaintainingasinglelargedata store and focusing on the
fault tolerance of that store, NoSQL databases spreaddata across many storage systems
with a high degree of replication. In the unlikelyevent that a failure disrupts access to a
segment of data, this does not necessarilyresultinacompletedatabaseoutage.
Softstate.BASEdatabasesabandontheconsistencyrequirementsoftheACIDmodel pretty much
completely. One of the basic concepts behind BASE is that
dataconsistencyisthedeveloper'sproblemandshouldnotbehandledbythedatabase.
Eventual Consistency.The only requirement that NoSQL databases have
regardingconsistencyistorequirethatatsomepointinthefuture,datawillconvergetoaconsistentstat
e.Noguaranteesaremade,however,aboutwhenthiswilloccur.Thatisacompletedeparture
fromtheimmediateconsistencyrequirementofACID thatprohibits a transaction fromexecuting
untiltheprior transactionhas completed andthedatabasehasconvergedtoaconsistentstate.
Version stamp: A field that changes every time the underlying data in the
recordchanges.Whenyoureadthedatayoukeepanoteoftheversionstamp,sothatwhenyouwrit
edatayoucancheckto seeiftheversionhaschanged.
You mayhavecomeacross thistechniquewithupdatingresources withHTTP. Oneway of doingthis
is to useetags. Whenever you get a resource, theserver respondswith an etag in
theheader.This etag isan opaquestring that indicatestheversion ofthe resource. If you then
update that resource, you can use a conditional update
bysupplyingtheetagthatyougotfromyourlastGETmethod.Iftheresourcehaschanged on the
server, the etags won’t match and the server will refuse the
update,returninga412(PreconditionFailed)errorresponse.Inshort,
Ithelpsyoudetectconcurrencyconflicts.
When you read data, then update it, you can check the version stamp to
ensurenobodyupdatedthedatabetweenyourreadandwrite

66
Versionstampscanbeimplementedusingcounters,GUIDs (alarge
randomnumberthat’sguaranteedtobeunique),contenthashes,timestamps,oracombinationof
these.
Withdistributedsystems,avectorofversionstamps(asetof counters,one
foreachnode)allowsyoutodetectwhendifferentnodeshaveconflictingupdates.
Sometimesthisiscalledacompare-and-set(CAS)operation.

Relaxingconsistency
TheCAPTheorem:ThebasicstatementoftheCAPtheoremisthat,giventhethreepropertiesof
Consistency, Availability,andPartition tolerance, you canonlygettwo.
Consistency:allpeopleseethesamedataatthesametime
Availability:if you cantalktoanodein thecluster,itcanreadand
writedataPartitiontolerance:theclustercansurvivecommunicationbreakagesthatseparate
theclusterintopartitionsunabletocommunicatewitheachother

Networkpartition:TheCAPtheoremstatesthatifyougetanetworkpartition,youhave
totradeoffavailabilityofdataversusconsistency.
Verylargesystemswill“partition”atsomepoint::

 ThatleaveseitherCorAtochoosefrom(traditionalDBMSprefersCoverAandP)

67
 Inalmost
allcases,youwouldchooseAoverC(exceptinspecificapplica
tionssuchasorderprocessing)

CAsystems
A single-server system is the obvious example of a CA
systemCAcluster:ifapartitionoccurs,allthenodeswouldgodown
A failed, unresponsive node doesn’t infer a lack of CAP availabilityA
system that suffer partitions: tradeoff consistency VS
availabilityGiveuptosomeconsistencyto getsomeavailability

Anexample
AnnistryingtobookaroomoftheAceHotelinNewYorkonanodelocatedinLondonofabo
okingsystem
PathinistryingtodothesameonanodelocatedinMumbaiThebookingsystemus
esapeer-to-peerdistribution
ThereisonlyaroomavailableThenet
worklinkbreaks

68
Possiblesolutions
CP:Neitherusercanbookanyhotelroom,sacrificingavailabilitycaP:Designate
Mumbai nodeas themasterforAcehotel
Pathincanmakethereservation
AnncanseetheinconsistentroominformationAnncann
otbooktheroom
AP:bothnodesacceptthehotelreservationOverboo
king!

Map-Reduce

Itisawaytotakeabigtaskanddivideitintodiscretetasksthatcanbedoneinparallel.Acommonusec
aseforMap/Reduceisindocumentdatabase.

AMapReduceprogram iscomposedof aMap() procedure


thatperformsfilteringandsorting(suchassortingstudentsbyfirstnameintoqueues,onequeuefore
achname)andaReduce()
procedurethatperformsasummaryoperation(suchascountingthenumberofstudentsineachqueu
e,yieldingnamefrequencies).

"Map"step:The masternodetakestheinput,dividesitintosmaller sub-


problems,anddistributesthemtoworkernodes.Aworkernodemaydothisagaininturn,leadin
gtoamulti-
leveltreestructure.Theworkernodeprocessesthesmallerproblem,andpassestheanswerbac
ktoitsmasternode.
"Reduce" step: The master node then collects the answers to all the sub-problems
andcombinestheminsomewaytoformtheoutput–
theanswertotheproblemitwasoriginallytryingtosolve.

Logicalview
The Map function is applied in parallel to every pair in the input dataset. This producesa
list of pairs for each call. After that, the MapReduce framework collects all pairs
withthesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.
Map(k1,v1) → list(k2,v2)
TheReducefunctionisthenappliedinparalleltoeachgroup,whichinturnproducesacollection
ofvaluesinthesamedomain:
Reduce(k2,list(v2)) → list(v3)

69
EachReducecalltypicallyproduceseitheronevaluev3oranemptyreturn

Example1:Countingand Summing

Problem Statement: There is a number of documents where each document is a


set ofwords. It is required to calculate a total number of occurrences of each word in
alldocuments. For instance, there is a log file where each record contains a response
timeanditisrequiredto calculateanaverageresponsetime.
Solution:
Letstartwithsimple.ThecodesnippetbelowshowsMapperthatsimply emit“1”foreachword
itprocessesand Reducerthatgoes through thelistsof onesand sumthemup:
classMapper
methodMap(docidid,docd)forallw
ordtindocddoEmit(wordt,coun
t1)

classReducer
methodReduce(wordt,counts[c1,c2,...])sum
=0
forallcountcin[c1,c2,...]dosum
=sum+c
Emit(wordt,sum)

Here, each document is split into words, and each word is counted by the map
function,usingthewordastheresultkey.Theframeworkputstogetherallthepairswiththesameke
y and feedsthem tothe samecallto reduce. Thus,this function justneeds
tosumallofitsinputvaluestofindthetotalappearancesofthatword.

Example2:

70
Multistagemap-reducecalculations

Letussaythatwehaveasetofdocumentsanditsattributeswiththefollowingform:

"type":"post",

71
"name":"Raven'sMap/
Reducefunctionality","blog_id":1342,
"post_id":29293921,"tags":
["raven","nosql"],
"post_content":"<p>...</
p>","comments":[
{
"source_ip":
'124.2.21.2',"author":"mar
tin",
"text":"excellentblog..."
}]
}

And we want to answer a question over more than a single document. That sort
ofoperation requires us to use aggregation, and over large amount of data, that is
bestdoneusingMap/Reduce,tosplitthework.

Map / Reduce is just a pair of functions, operating over a list of data. Let us say that
wewanttobeabouttoget acountofcommentsper blog.We
candothatusingthefollowingMap/Reducequeries:

frompostindocs.posts
selectnew{post.bl
og_id,
comments_length=comments.length
};

fromagginresults
groupaggbyagg.keyintogse
lectnew{
agg.blog_id,

72
comments_length=g.Sum(x=>x.comments_length)
};

Thereareacoupleofthingstonotehere:

 Thefirstqueryisthemapquery,itmapstheinputdocumentintothefinalformat.
 Thesecondqueryisthereducequery,itoperateoverasetofresultsandproduceananswer.
 Thefirstvalueintheresultisthekey,whichiswhatweareaggregatingon(thinkthegroupbycl
auseinSQL).

Letusseehowthisworks,westartbyapplyingthemapquerytothesetofdocumentsthatwehave,pr
oducingthisoutput:

73
Thenextstepistostartreducingtheresults,inrealMap/Reducealgorithms,wepartition the
original input, and work toward the final result. In this case, imagine thatthe output of
the first step was divided into groups of 3 (so 4 groups overall), and
thenthereducequerywasappliedtoit,givingus:

You can see why it was called reduce, for every batch, we apply a sum by blog_id to geta
new Total Comments value. We started with 11 rows, and we ended up with just
10.That is where it gets interesting, because we are still not done, we can still reduce
thedatafurther.

74
This is what we do in the third step, reducing the data further still. That is why theinput &
output format of the reduce query must match, we will feed the output ofseveral the
reduce queries as the input of a new one. You can also see that now wemoved
fromhaving10rowstohavejust7.

And thefinalstepis:

75
Andnowwearedone,wecan'treducethedataanyfurtherbecauseallthekeysareunique.

RDBMScomparedtoMapReduce
MapReduce is a good fit for problems that need to analyze the whole dataset, in a
batchfashion,particularlyforadhocanalysis.AnRDBMSisgoodforpointqueriesorupdates,
wherethedatasethasbeenindexedtodeliverlow-
latencyretrievalandupdatetimesofarelativelysmallamountofdata.MapReducesuitsapplic
ationswherethedataiswrittenonce,andreadmanytimes,whereasarelationaldatabaseisgood
fordatasetsthatarecontinuallyupdated.

PartitioningandCombining

Inthe simplestform, wethinkof amap-


reducejobashavingasinglereducefunction.Theoutputsfromallthemaptasksrunningonthevario
usnodesareconcatenatedtogether and sent into the reduce. While this will work, there are
things we can do toincreasetheparallelismandtoreducethedatatransfer(seefigure)
ReducePartitioningExample

76
The first thing we can do is increase parallelism by partitioningthe output of
themappers.Eachreducefunctionoperatesontheresultsofasinglekey.Thisisalimitation—
itmeansyoucan’tdoanythinginthereducethatoperatesacrosskeys
—butit’salsoabenefitinthatitallowsyoutorunmultiple reducersin parallel.Totakeadvantage
of this, the results of the mapper are divided up based the key on eachprocessing node.
Typically, multiple keys are grouped together into partitions. Theframework then
takes the data from all the nodes for one partition, combines it into asingle group for
that partition, and sends it off to a reducer. Multiple reducers can
thenoperateonthepartitionsinparallel,withthefinalresultsmergedtogether.
(Thisstepisalsocalled“shuffling,”andthepartitionsaresometimesreferredtoas“buckets”or
“regions.”)
The next problem we can deal with is the amount of data being moved from node tonode
between the map and reduce stages. Much of this data is repetitive, consisting
ofmultiple key-value pairs for the same key. A combiner function cuts this data down
bycombiningallthedatafor thesamekeyintoasinglevalue(seefig)

CombinableReducer Example

A combiner function is, in essence, a reducer function—indeed, in many cases the


samefunctioncanbeusedforcombiningasthefinalreduction.Thereducefunctionneedsa

77
special shape for this to work: Its output must match its input. We call such a function
acombinablereducer.

UNITIIIBASICSOFHADOOP
Dataformat–analyzingdatawithHadoop–scalingout–Hadoopstreaming–
Hadooppipes–design ofHadoopdistributed file system(HDFS)–
HDFSconcepts

IntroductiontoHadoop

Performing computation on large volumes of data has been done before, usually in
adistributed setting. What makes Hadoop unique is its simplified programming
modelwhich allows the user to quickly write and test distributed systems, and its
efficient,automatic distribution of dataandworkacross machinesandin turn
utilizingtheunderlyingparallelismoftheCPUcores.

InaHadoopcluster,dataisdistributedtoallthenodesoftheclusterasitisbeingloadedin.TheHadoo
pDistributedFileSystem(HDFS)willsplitlargedatafilesintochunkswhicharemanagedbydiffer
entnodesinthecluster.Inadditiontothiseachchunk isreplicatedacrossseveral machines,so
thatasinglemachinefailuredoesnotresult in any data being unavailable. An active
monitoring system then re-replicates thedata inresponsetosystem
failureswhichcanresultinpartialstorage. Even though
thefilechunksarereplicatedanddistributedacrossseveralmachines,theyformasinglenamespace
,sotheircontentsareuniversallyaccessible.

MAPREDUCEinHadoop
Hadoop limits the amount of communication which can be performed by the processes,as
each individual record is processed by a task in isolation from one another. Whilethis
sounds like a major limitation at first, it makes the whole framework much
morereliable.Hadoopwillnotrunjustanyprogramanddistributeitacrossacluster.Programs
mustbewrittentoconform toaparticularprogrammingmodel, named"MapReduce."

78
In MapReduce, records are processed in isolation by tasks called Mappers. The
outputfrom the Mappers is then brought together into a second set of tasks called
Reducers,whereresults fromdifferentmappers canbemerged together.

HadoopArchitecture:

HDFS has a master/slave architecture. An HDFS cluster consists of a single


NameNode,amasterserverthatmanagesthefilesystemnamespaceandregulatesaccesstofil
esbyclients. In addition, there are a number of DataNodes, usually one per node in
thecluster, which manage storage attached to the nodes that they run on. HDFS
exposes afile system namespace and allows user data to be stored in files. Internally, a
file is
splitintooneormoreblocksandtheseblocksarestoredinasetofDataNodes.TheNameNodee
xecutesfilesystemnamespaceoperationslikeopening,closing,andrenaming files and
directories. It also determines the mapping of blocks to
DataNodes.TheDataNodesareresponsibleforservingreadandwriterequestsfromthefilesy
stem’s clients. The DataNodes also perform block creation, deletion, and
replicationuponinstructionfromtheNameNode.

The NameNode and DataNode are pieces of software designed to run on


commoditymachines. These machines typically run a GNU/Linux operating system
(OS). HDFS isbuilt using the Java language; any machine that supports Java can run
the NameNodeortheDataNodesoftware.UsageofthehighlyportableJavalanguage means
thatHDFScanbedeployedonawiderangeofmachines.Atypicaldeploymenthasa

79
dedicated machine that runs only the NameNode software. Each of the other machinesin
the cluster runs one instance of the DataNode software. The architecture does
notpreclude running multiple DataNodes on the same machine but in a real
deploymentthatisrarelythecase.

TheexistenceofasingleNameNodeinaclustergreatlysimplifiesthearchitectureofthe system. The


NameNode is the arbitrator and repository for all HDFS metadata.
ThesystemisdesignedinsuchawaythatuserdataneverflowsthroughtheNameNode.

DataForamt

InputFormat:HowtheseinputfilesaresplitupandreadisdefinedbytheInputFormat.AnInp
utFormatisaclassthatprovidesthe followingfunctionality:
 Selectsthefilesorotherobjectsthatshouldbeusedforinput
 DefinestheInputSplitsthatbreakafileintotasks
 ProvidesafactoryforRecordReaderobjectsthatreadthefile

OutputFormat:The(key,value)pairsprovidedtothisOutputCollectorarethenwrittentoou
tputfiles.

Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases. If it is
flat file, the data is stored using a line-oriented ASCII format, in whicheach line is a
record. For example, ( National Climatic Data Center) NCDC data as givenbelow, the
format supports a rich set of meteorological elements, many of which areoptional
orwithvariabledatalengths.

80
Datafilesareorganizedbydateandweatherstation.

Analyzingthe Data withHadoop


TotakeadvantageoftheparallelprocessingthatHadoopprovides,weneedtoexpressourqueryas
aMapReducejob.
MapandReduce:
MapReduceworksbybreakingtheprocessingintotwophases:themapphaseandthereduce
phase. Each phase has key-value pairs as input and output, the types of whichmay be
chosen by the programmer. The programmer also specifies two functions:
themapfunctionandthereducefunction.
The inputtoourmap phaseis theraw NCDCdata.Wechooseatext inputformat thatgives us
each line in the dataset as a text value. The key is the offset of the
beginningofthelinefromthebeginningofthefile,but
aswehavenoneedforthis,weignoreit.

Our mapfunction issimple.Wepull outthe year andtheair


temperature,sincethesearetheonlyfieldsweareinterestedin.Inthiscase,themapfunctionisjust
adatapreparation phase,setting upthe datainsuchawaythat thereducer function candoits
work on it:findingthemaximumtemperature foreach year. Themapfunction
isalsoagoodplacetodropbadrecords:herewefilterouttemperaturesthataremissing,
81
suspect,orerroneous.

To visualize the way the map works, consider the following sample lines of input
data(some unused columns have been dropped to fit
the page, indicated byellipses):

Theselinesarepresentedtothemapfunctionasthekey-valuepairs:

Thekeysarethelineoffsetswithinthefile,whichweignoreinourmapfunction.Themapfunctionmer
elyextractstheyearandtheairtemperature(indicatedinboldtext),andemitsthemasitsoutput(th
etemperaturevalueshavebeeninterpretedasintegers):
(1950,0)
(1950,22)
(1950,−11)
(1949,111)
(1949,78)
Theoutputfrom themap function is processedbythe MapReduceframework
beforebeingsenttothereducefunction.Thisprocessingsortsandgroupsthekey-
valuepairsbykey.So,continuingtheexample,ourreducefunctionseesthefollowinginput:
(1949,[111,78])
(1950,[0,22,−11])
Eachyearappearswithalistofallitsairtemperaturereadings.Allthereducefunctionhastodonowisiter
atethroughthelistandpickupthemaximumreading:
(1949,111)
(1950,22)
Thisisthefinaloutput:themaximumglobaltemperaturerecordedineachyear.Thewholedataflowisil
lustratedinthefollowingfigure.

JavaMapReduce

82
Havingrunthrough howtheMapReduceprogramworks,thenextstepistoexpressitin code.
We need three things: a map function, a reduce function, and some code torun the
job. The map function is represented by an implementation of the Mapperinterface,
whichdeclaresamap()method.

publicclassMaxTemperatureMapperextendsMapReduceBaseim
plementsMapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable>output,Reporterreporter)throws
IOException
{
//statementtoconverttheinputdataintostring
//statementtoobtainyearandtempusingthesubstringmethod
//statementtoplacetheyearandtempintoasetcalledOutputCollector
}
}

TheMapperinterfaceisagenerictype,withfourformaltypeparametersthatspecifytheinputkey,i
nputvalue,outputkey,andoutputvaluetypesofthemapfunction.
The map() method is passed a key and a value. We convert the Text value containingthe
line of input into a Java String, then use its substring() method to extract thecolumns
we are interested in.

Themap()methodalsoprovidesaninstanceofOutputCollectortowritetheoutputto.

ThereducefunctionissimilarlydefinedusingaReducer,asillustratedinfollowingfigure.

public class MaxTemperatureReducer extends MapReduceBase


implementsReducer<Text,IntWritable,Text,IntWritable>
{
publicvoidreduce(Text key, Iterator<IntWritable>values,OutputCollector<Text,
IntWritable> output, Reporter reporter)
throwsIOException
{
//statementtofindthemaximumtemperatureofaeachyear
//statementtoputthemax.tempanditsyearinasetcalledOuputCollector
}
}

ThethirdpieceofcoderunstheMapReducejob.publ
icclassMaxTemperature
{

83
publicstaticvoidmain(String[]args)throwsIOException
{
JobConf conf =new
JobConf(MaxTemperature.class);conf.setJobName("Max
temperature");FileInputFormat.addInputPath(conf, new
Path(args[0]));FileOutputFormat.setOutputPath(conf,newPath(ar
gs[1]));conf.setMapperClass(MaxTemperatureMapper.class);c
onf.setReducerClass(MaxTemperatureReducer.class);conf.set
OutputKeyClass(Text.class);conf.setOutputValueClass(IntWri
table.class);JobClient.runJob(conf);
}
}

AJobConfobjectformsthespecificationofthejob.Itgivesyoucontroloverhowthe
job is run. Having constructed a JobConf object, we specify the input and output
paths.Next,wespecifythemapandreducetypestouseviathesetMapperClass()andsetReduc
erClass()methods.ThesetOutputKeyClass()andsetOutputValueClass()methods control
the output types for the map and the reduce functions. The
staticrunJob()methodonJobClientsubmitsthejobandwaitsforittofinish,writinginformatio
naboutitsprogresstotheconsole.

Scalingout
You’veseenhowMapReduceworksforsmallinputs; nowit’stimetotakeabird’s-eyeview of
the system and look at the data flow for large inputs. For simplicity, theexamples so
far have used files on the local filesystem. However, to scale out, we
needtostorethedata inadistributedfilesystem,typicallyHDFS.
A MapReduce job is a unit of work that the client wants to be performed: it consists ofthe
input data, the MapReduce program, and configuration information. Hadoop runsthe
job by dividing it into tasks, of which there are two types: map tasks and reducetasks.
There are two types of nodes that control the job execution process: a jobtracker and
anumberoftasktrackers.Thejobtrackercoordinatesallthejobsrunonthesystembyschedulingtas
kstorunontasktrackers.Tasktrackersruntasksandsendprogressreports to the jobtracker,
whichkeeps arecord ofthe overall progressof each job.If
ataskfails,thejobtrackercanrescheduleitonadifferenttasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits,or
just splits. Hadoop creates one map task for each split, which runs the
userdefinedmapfunctionforeachrecordinthesplit.
Having many splits means the time taken to process each split is small compared to
thetimetoprocessthewholeinput.Soifweareprocessingthesplitsinparallel,theprocessingisbett
erload-balancedifthesplitsaresmall,sinceafastermachinewillbe

84
abletoprocessproportionallymoresplitsoverthecourseofthejobthanaslowermachine.
Evenifthemachinesareidentical,failedprocessesorotherjobsrunningconcurrentlymake load
balancing desirable, and the quality of the load balancing increases as
thesplitsbecomemorefine-grained.

HadoopdoesitsbesttorunthemaptaskonanodewheretheinputdataresidesinHDFS.Thisis
calledthedatalocalityoptimization.

ThewholedataflowwithasinglereducetaskisillustratedinFigure.

The numberof reducetasks is not governed bythe size of the input,butis


specifiedindependently.Whentherearemultiplereducers,themaptaskspartitiontheiroutput,eac
hcreatingonepartitionforeachreducetask.Therecanbemanykeys(andtheirassociatedvalues)in
eachpartition,buttherecordsforanygivenkeyareallinasinglepartition.Thepartitioningcanbeco
ntrolledbyauser-definedpartitioningfunction,butnormallythedefaultpartitioner—
whichbucketskeysusingahashfunction—worksverywell.
ThedataflowforthegeneralcaseofmultiplereducetasksisillustratedinFigure.

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when
youdon’tneedtheshufflesincetheprocessingcanbecarriedoutentirelyinparallelasshownin
figure.

85
CombinerFunctions

Hadoop allows the user to specify a combiner function to be run on the map output—
thecombinerfunction’soutputformstheinputtothereducefunction.Sincethecombiner
function is an optimization, Hadoop does not provide a guarantee of howmany times it
will call it for a particular map output record, if at all. In other
words,callingthecombinerfunctionzero,one,ormanytimesshould
producethesameoutputfromthereducer.

Thecontractforthecombinerfunctionconstrainsthetypeoffunctionthatmaybe
used.Thisisbestillustratedwithanexample.Supposethatforthemaximumtemperature example,
readings for the year 1950 were processed by two maps
(becausetheywereindifferentsplits).Imaginethefirstmapproducedtheoutput:
(1950,0)
(1950,20)
(1950,10)
And the secondproduced:
(1950,25)
(1950,15)
Thereducefunctionwouldbecalledwithalistofallthevalues:(1950,
[0,20,10,25,15])
withoutput:
(1950,25)
since25isthemaximumvalueinthelist.Wecoulduseacombinerfunctionthat,justlikethe
reducefunction,findsthemaximumtemperature foreach mapoutput.
Thereducewouldthenbecalledwith:
(1950,[20,25])
andthereducewouldproducethesameoutputasbefore.Moresuccinctly,wemayexpressthefunctionc
allsonthetemperaturevaluesinthiscaseasfollows:
max(0,20,10,25,15)=max(max(0,20,10),max(25,15))=max(20,25)=25

86
publicclassMaxTemperatureWithCombiner
{
publicstaticvoidmain(String[]args)throwsIOException
{
JobConfconf=newJobConf(MaxTemperatureWithCombiner.class);conf.set
JobName("Max temperature");FileInputFormat.addInputPath(conf, new
Path(args[0]));FileOutputFormat.setOutputPath(conf, new
Path(args[1]));conf.setMapperClass(MaxTemperatureMapper.class);co
nf.setCombinerClass(MaxTemperatureReducer.cla
ss);conf.setReducerClass(MaxTemperatureReducer.class);conf.setOut
putKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);J
obClient.runJob(conf);
}
}

HadoopStreaming

Hadoop providesanAPItoMapReducethatallowsyou
towriteyourmapandreducefunctionsinlanguagesotherthanJava.HadoopStreamingusesU
nixstandardstreamsas the interface between Hadoop and your program, so you can use
any language thatcan read standard input and write to standard output to write your
MapReduceprogram.

Streaming is naturally suited for text processing and when used in text mode, it has aline-
oriented view of data. Map input data is passed over standard input to your
mapfunction, which processes it line by line and writes lines to standard output. A
mapoutput key-valuepairiswrittenasasingletab-delimitedline.

Let’sillustratethisbyrewritingourMapReduceprogramforfindingmaximumtemperaturesbyy
earinStreaming.

Ruby:ThemapfunctioncanbeexpressedinRubyasshownbelowSTDIN

.each_linedo|line|
val=line
year,temp,q=val[15,4],val[87,5],val[92,1]
puts"#{year}\t#{temp}"if(temp!="+9999"&&q=~/[01459]/)
end

Sincethescriptjustoperatesonstandardinputandoutput,it’strivialtotestthescript

87
withoutusingHadoop,simplyusingUnixpipes:
%catinput/ncdc/sample.txt|ch02/src/main/ruby/max_temperature_map.rb1950
+0000
1950+0022
1950-0011
1949+0111
1949+0078

Thereducefunctionshownbelow

last_key, max_val = nil,


0STDIN.each_linedo
|line|
key,val=line.split("\t")
iflast_key&&last_key!=key
puts "#{last_key}\t#{max_val}"
last_key,max_val=key,val.to_i
e
last_key,max_val=key,[max_val,val.to_i].max

puts"#{last_key}\t#{max_val}"iflast_key

WecannowsimulatethewholeMapReducepipelinewithaUnixpipeline,weget

%catinput/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb| \sort|


ch02/src/main/ruby/max_temperature_reduce.rb
1949111
195022

HadoopPipes

HadoopPipesisthenameoftheC++interfacetoHadoopMapReduce.UnlikeStreaming, which
uses standard input and output to communicate with the map
andreducecode,Pipesusessocketsasthechanneloverwhichthetasktrackercommunicates
withthe processrunningtheC++ mapor reducefunction.

The following example shows the source code for the map and reduce functions in C+

+.classMaxTemperatureMapper:publicHadoopPipes::Mapper
{
public:
MaxTemperatureMapper(HadoopPipes::TaskContext&context)
{}voidmap(HadoopPipes::MapContext&context)

88
{
//statementtoconverttheinputdataintostring
//statementtoobtainyearandtempusingthesubstringmethod
//statementtoplacetheyearandtempintoaset
}

};
classMapTemperatureReducer:publicHadoopPipes::Reducer
{
public:
MapTemperatureReducer(HadoopPipes::TaskContext&context)
{}voidreduce(HadoopPipes::ReduceContext&context)
{
//statementtofindthemaximumtemperatureofaeachyear
//statementtoputthemax.tempanditsyearinaset
}
};

intmain(intargc,char*argv[])
{
returnHadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,MapTemperatur
eReducer>());
}

ThemapandreducefunctionsaredefinedbyextendingtheMapperandReducerclasses defined
in the HadoopPipes namespace and providing implementations of
themap()andreduce()methodsineachcase.
These methods take a context object (of type MapContext or ReduceContext),
whichprovidesthemeansforreadinginputandwritingoutput,aswellasaccessingjobconfigurati
on informationviatheJobConfclass.
The main() method is the application entry point. It calls HadoopPipes::runTask,
whichconnects to the Java parent process and marshals data to and from the Mapper
orReducer.TherunTask()methodispassedaFactorysothatitcancreateinstancesoftheMapp
erorReducer.WhichoneitcreatesiscontrolledbytheJavaparentoverthesocketconnection.
There are overloaded template factory methods for setting a
combiner,partitioner,recordreader,orrecordwriter.

Pipes doesn’t run in standalone (local) mode, since it relies on


Hadoop’sdistributedcachemechanism, whichworksonly
whenHDFSisrunning.
Withthe Hadoopdaemonsnowrunning,the firststepisto copythe
executabletoHDFSsothatitcanbepickedupbytasktrackerswhentheylaunchmapandreduc
etasks:
%hadoopfs-putmax_temperaturebin/max_temperature

89
ThesampledataalsoneedstobecopiedfromthelocalfilesystemintoHDFS:
%hadoopfs-putinput/ncdc/sample.txtsample.txt
Now wecanrun thejob.Forthis,weusethe Hadooppipescommand,passingtheURIof
theexecutableinHDFSusingthe-programargument:
%hadooppipes\
-Dhadoop.pipes.java.recordreader=true\
-Dhadoop.pipes.java.recordwriter=true\
-inputsample.txt\
-outputoutput\
-programbin/max_temperature
Theresultisthesameastheotherversionsofthesameprogramthatweranpreviousexample.

DesignofHDFS

HDFSisafilesystemdesignedforstoringverylargefileswithstreamingdataaccesspatterns,running
onclustersofcommodity hardware.

 “Verylarge”
Inthiscontextmeansfilesthatarehundredsofmegabytes,gigabytes,orterabytesinsize. There
areHadoop clustersrunningtodaythatstorepetabytesofdata.
 Streamingdataaccess
HDFS isbuiltaroundtheideathat themostefficientdata processingpatternisawrite-
once,read-many-timespattern.Adatasetistypicallygeneratedorcopiedfrom source,then
variousanalysesare
performedonthatdatasetovertime.Eachanalysiswillinvolvealargeproportion,ifnotall,oft
hedataset,sothetimetoreadthewholedatasetismoreimportantthanthelatencyinreadingth
efirstrecord.

 Commodityhardware
Hadoopdoesn’trequireexpensive,highlyreliablehardwaretorunon.It’sdesignedtorunoncluste
rsofcommodityhardware(commonlyavailablehardwareavailablefrom
multiplevendors)forwhichthechanceofnodefailureacrosstheclusterishigh,at
leastforlarge clusters.HDFS isdesignedto carry on working
withoutanoticeableinterruptiontotheuserinthefaceofsuchfailure.
 Low-latencydataaccess
Applications that require low-latency access to data, in the tens of
millisecondsrange,willnotworkwellwithHDFS.Remember,HDFSisoptimizedfordel
iveringahighthroughputofdata,andthismaybeattheexpense oflatency

90
 Lotsofsmall files
Since the namenode holds filesystem metadata in memory, the limit to the numberof
filesinafilesystemisgovernedbytheamountofmemoryonthenamenode.Asa rule of
thumb, each file, directory, and block takes about 150 bytes. So,
forexample,ifyouhadonemillionfiles,eachtakingoneblock,youwouldneedatleast300
MBofmemory.Whilestoringmillionsoffilesisfeasible,billionsisbeyondthecapability
ofcurrenthardware.

 Multiplewriters,arbitraryfilemodifications
FilesinHDFSmaybewrittentobyasinglewriter.Writesarealwaysmadeattheend of the
file. There is no support for multiple writers, or for modifications
atarbitraryoffsetsinthefile

HDFSConcepts

ThefollowingdiagramillustratestheHadoopconcepts

Blocks

Adiskhasablocksize,whichistheminimumamountofdatathatitcanreadorwrite.Filesystemsforasin
glediskbuildonthisbydealingwithdatainblocks,whichareanintegralmultipleofthediskblocksiz
e.Filesystemblocksaretypicallyafewkilobytes
insize,whilediskblocksarenormally512bytes.HDFS,too,hastheconceptofablock,butitisamuc
hlargerunit—64MBbydefault.

HDFSblocksarelargecomparedtodiskblocks,andthereasonistominimizethecostof seeks.
By making a block large enough, the time to transfer the data from the
diskcanbemadetobesignificantlylargerthanthetimetoseektothestartoftheblock.
Thusthetimetotransferalargefilemadeofmultipleblocksoperatesatthedisktransferrate.

91
Benefitsofblocks

1. A file can be larger than any single disk in the network. There is nothing thatrequires
the blocks from a file to be stored on the same disk, so they can
takeadvantageofanyofthedisksinthecluster.
2. It simplifies the storage subsystem. The storage subsystem deals with
blocks,simplifyingstoragemanagementandeliminatingmetadataconcerns.
3. Blocksfitwellwithreplicationforprovidingfaulttoleranceandavailability.

NamenodesandDatanodes
AnHDFSclusterhastwotypesofnodeoperatinginamaster-
workerpattern:anamenode(themaster)andanumberofdatanodes(workers).Thenamenode
managesthefilesystemnamespace.Itmaintainsthefilesystemtreeandthemetadataforallthef
iles and directories in the tree. This information is stored persistently on the local
diskintheformoftwofiles:thenamespaceimageandtheeditlog.Thenamenodealsoknowsthe
datanodesonwhichalltheblocksforagivenfilearelocated,however,itdoes not store block
locations persistently, since this information is reconstructed
fromdatanodeswhenthesystemstarts.
Aclientaccessesthefilesystemonbehalfoftheuserbycommunicatingwiththenamenode and
datanodes. Datanodes are the workhorses of the filesystem. They storeandretrieve
blocks whenthey are told to, and they report back to the
namenodeperiodicallywithlistsofblocksthattheyarestoring.Withoutthenamenode,thefile
systemcannotbeused.

SecondaryNamenode
It is also possible to run a secondary namenode, which despite its name does not act asa
namenode. Its main role is to periodically merge the namespace image with the editlog
to prevent the edit log from becoming too large. The secondary namenode
usuallyrunsonaseparatephysicalmachine,sinceitrequiresplentyofCPUandasmuchmemor
yasthenamenodetoperformthemerge.Itkeepsacopyofthemergednamespaceimage,which
canbeusedintheeventofthenamenodefailing.

TheCommand-LineInterface
There are many other interfaces to HDFS, but the command line is one of the
simplestand,tomanydevelopers,themostfamiliar.Itprovidesacommandlineinterfacecalle
dFSshellthatletsauserinteractwiththedatainHDFS.Thesyntaxofthiscommandsetissimilar
toothershells(e.g.bash,csh)thatusersarealreadyfamiliarwith.Herearesomesampleaction/
commandpairs:

Action Command
Createadirectorynamed/foodir bin/hadoopdfs-mkdir/foodir

92
Createadirectorynamed/foodir bin/hadoopdfs-mkdir/foodir
Viewthecontentsofafilenamed bin/hadoopdfs–cat/foodir/myfile.txt
/foodir/myfile.txt

HadoopFilesystems
Hadoop has an abstract notion of filesystem, of which HDFS is just one
implementation.TheJavaabstractclassorg.apache.hadoop.fs.FileSystemrepresentsafiles
ysteminHadoop,andthereareseveralconcreteimplementations,whicharedescribedinTabl
e

Thrift
TheThriftAPIin the“thriftfs”moduleexposeHadoopfilesystemsasan ApacheThriftservice,
making it easy for any language that has Thrift bindings to interact with
aHadoopfilesystem,suchasHDFS.

C
HadoopprovidesaClibrarycalledlibhdfsthatmirrorstheJavaFileSysteminterface(itwaswritt
enasaClibraryforaccessingHDFS,butdespiteitsnameitcanbeused
93
toaccessanyHadoopfilesystem).ItworksusingtheJavaNativeInterface(JNI)tocallaJavafiles
ystemclient.

FUSE
FilesysteminUserspace(FUSE)allowsfilesystemsthatareimplementedinuserspaceto be
integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows
anyHadoopfilesystem(buttypicallyHDFS)tobemountedasastandardfilesystem.

WebDAV
WebDAVisasetof extensionstoHTTPto supportediting andupdating
files.WebDAVsharescanbemountedasfilesystemsonmostoperatingsystems,sobyexposingH
DFS(orotherHadoopfilesystems)overWebDAV,it’spossibletoaccessHDFSasastandardfilesy
stem.

Filepatterns
Itisacommonrequirementtoprocesssetsoffilesinasingleoperation.Hadoopsupportsthesames
etofglobcharactersasUnixbashasshowninTable

AnatomyofaFileRead
TogetanideaofhowdataflowsbetweentheclientinteractingwithHDFS,thenamenodeandthedatano
des,considerFigure,whichshowsthemainsequenceofeventswhenreadingafile.

94
Theclientopensthefileitwishestoreadbycallingopen()ontheFileSystemobject,which for
HDFSisan instanceof DistributedFileSystem(step1).
DistributedFileSystemcallsthenamenode,usingRPC,todeterminethelocationsofthe blocks
for the first few blocks in the file (step 2). For each block, the
namenodereturnstheaddressesofthe datanodesthathave a copyofthatblock.
TheDistributedFileSystemreturnsanFSDataInputStream(aninputstreamthatsupports file
seeks) to the client for it to read data from. FSDataInputStream in
turnwrapsaDFSInputStream,whichmanagesthedatanodeandnamenodeI/O.
Theclientthencallsread()onthestream(step3).DFSInputStream,whichhasstoredthe
datanode addresses for the first few blocks in the file, then connects to the
first(closest) datanode for the first block in the file. Data is streamed from the
datanodebacktotheclient,whichcallsread()repeatedlyonthestream(step4).Whentheend
of the block is reached, DFSInputStream will close the connection to the datanode,
thenfind the best datanode for the next block (step 5). This happens transparently to
theclient,whichfromitspointofviewisjustreadingacontinuousstream.Blocksare
readinorderwith the DFSInputStream openingnewconnectionstodatanodes
astheclientreadsthroughthestream.Itwillalsocallthenamenodetoretrievethedatanodelocationsfort
henextbatchofblocksasneeded.Whentheclienthasfinishedreading,itcallsclose()ontheFSDataI
nputStream(step6).

Anatomy ofaFileWrite
Theclient creates the file by calling create() on DistributedFileSystem (step 1).
DistributedFileSystemmakesanRPCcalltothenamenodetocreateanew
fileinthefilesystem’snamespace,withnoblocksassociatedwithit(step2).Thenamenode
performs various checks to make sure the file doesn’t already exist, and
thattheclienthastherightpermissionstocreatethefile.Ifthesecheckspass,thenamenodemak
es a record of the new file; otherwise, file creation fails and the client is thrown
anIOException.TheDistributedFileSystemreturnsanFSDataOutputStreamfortheclient

tostartwritingdatato.Justasinthereadcase,FSDataOutputStreamwrapsaDFSOutputStream,w
hich handles communicationwiththedatanodesandnamenode.

95
Astheclientwritesdata(step3),DFSOutputStreamsplitsitintopackets,whichitwritestoanint
ernalqueue,calledthedataqueue.ThedataqueueisconsumedbytheDataStreamer,whose
responsibilityitistoaskthenamenodetoallocatenewblocksbypicking a list of suitable
datanodes to store the replicas. The list of datanodes forms apipeline—we’ll assume
the replication level is three, so there are three nodes in thepipeline. The
DataStreamer streams the packets to the first datanode in the
pipeline,whichstoresthepacketandforwardsittotheseconddatanodeinthepipeline.Simil
arly, the second datanode stores the packet and forwards it to the third (and
last)datanodeinthepipeline(step4).

DFSOutputStream also maintains an internal queue of packets that are waiting to


beacknowledged by datanodes, called the ack queue. A packet is removed from the
ackqueueonlywhenithasbeenacknowledgedbyallthedatanodesinthepipeline(step5).If
a datanode fails while data is being written to it, then the following actions are
taken,whicharetransparenttotheclientwritingthedata.Firstthepipelineisclosed,andany
packetsintheackqueueareaddedtothefrontofthedataqueuesothatdatanodes
thataredownstreamfromthefailednodewillnotmissanypackets.Whentheclienthasfinishedwriti
ngdata,itcallsclose()onthestream(step6).

96
Application level gateway is also called a bastion host. It operates at the application level.
Multiple application gateways can run on the same host but each gateway is a separate server
with its own processes.
These firewalls, also known as application proxies, provide the most secure type of data
connection because they can examine every layer of the communication, including the
application data.
Circuit level gateway: A circuit-level firewall is a second generation firewall that validates
TCP and UDP sessions before opening a connection.
The firewall does not simply allow or disallow packets but also determines whether the
connection between both ends is valid according to configurable rules, then opens a session
and permits traffic only from the allowed source and possibly only for a limited period of
time.
It typically performs basic packet filter operations and then adds verification of proper
handshaking of TCP and the legitimacy of the session information used in establishingthe
connection.
The decision to accept or reject a packet is based upon examining thepacket'sIPheader and
TCP header.
Circuit level gateway cannot examine the data content of the packets it relaysbetweena
trusted network and an untrusted network.

97
UNITIVMAPREDUCEAPPLICATIONS

MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy
ofMapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and
YARN –jobscheduling–shuffleandsort–taskexecution–MapReducetypes– inputformats–
outputformats

MapReduceWorkflows

It explains the data processing problem into the MapReduce model. When the processing getsmore
complex, the complexity is generally manifested by having more MapReduce jobs, ratherthan
having more complex map and reduce functions. In other words, as a rule of thumb, thinkabout
adding more jobs, rather than adding complexity to jobs. Map Reduce workflow
isdividedintotwosteps:

• Decomposinga ProblemintoMapReduceJobs
• RunningJobs

1. DecomposingaProbleminto MapReduceJobs

98
Let’s look at an example of a more complex problem that we want to translate into
aMapReduceworkflow.When we write a MapReduce workflow, we have to
createtwoscripts:

• the map script, and


• the reduce script.

When we start a map/reduce workflow, the framework will split the


inputin
to segments, passing each segment to a different machine. Each machine
thenruns the map script on the portion of data attributed
to it.

99
2. Running Dependent Jobs (linear chain of jobs) or More complex Directed
AcyclicGraphjobs

Whenthereis morethanonejobina MapReduceworkflow,thequestionarises:howdoyoumanage the jobs


so they are executed in order? There are several approaches, and the mainconsideration is
whether you have a linear chain of jobs, or a more complex directed acyclicgraph(DAG) ofjobs.

For a linear chain, the simplest approach is to run each job one after another, waiting until a
jobcompletessuccessfullybeforerunningthenext:

JobClient.runJob(conf1);JobClient.runJ

ob(conf2);

ForanythingmorecomplexjoblikeDAGthanalinearchain,thereisaclasscalledJobControlwhichrepre
sentsagraphofjobs toberun.

Oozie

Unlike JobControl, which runs on the client machine submitting the jobs, Oozie runs as a
server,andaclientsubmits aworkflowtotheserver.InOozie,aworkflowis aDAGofaction
nodesand control-flow nodes. There is an action node which performs a workflow task, likemoving
files in HDFS, running a MapReduce job or running a Pig job. When the
workflowcompletes,OoziecanmakeanHTTPcallbacktotheclienttoinformitoftheworkflowstatus.
Itisalsopossibletoreceivecallbackseverytimetheworkflowentersorexitsanactionnode.
Oozie allows failed workflows to be re-run from an arbitrary point. This is useful for
dealingwithtransienterrorswhentheearly actionsintheworkflowaretimeconsumingtoexecute.

Anatomyofclassicmap reduce jobrun

How MapReduce Works? / Explain the anatomy of classic map reduce job run/How
HadooprunsmapreduceJob?

You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It is very
short,but it conceals a great deal of processing behind the scenes. The whole process is
illustrated infollowingfigure.

100
AsshowninFigure1,therearefourindependententitiesintheframework:
- Client,whichsubmitstheMapReduceJob
- JobTracker,whichcoordinatesandcontrolsthejobrun.It isa JavaclasscalledJobTracker.
- TaskerTrackers,whichrunthetaskthatissplitjob,controlthespecificmaporreducetask,andmaker
eportstoJobTracker. TheyareJavaclass as well.
- HDFS, which provides distributed data storage and is used to share job files between
otherentities.

AstheFigure1 show,a MapReduceprocessingincluding10steps,andinshort, thatis:


- TheclientssubmitMapReducejobstotheJobTracker.
- TheJobTrackerassignsMapandReducetaskstoothernodesinthecluser
- Thesenodeseachruna softwaredaemonTaskTrackeronseparateJVM.
- Each TaskTracker actually initiates the Map or Reduce tasks and reports progress back to
theJobTracker
Therearesixdetailedlevelsinworkflows.Theyare:

1. JobSubmission
2. JobInitialization
3. TaskAssignment

101
4. TaskExecution
5. TaskProgressandstatusupdates
6. TaskCompletion

JobSubmission

When the client call submit() on job object. An internal JobSubmmitter Java Object is initiatedand
submitJobInternal() is called. If the clients calls the waiForCompletion(), the job progressswill
begin and it will response to the client with process results to clients until the jobcompletion.
JobSubmmiterdothefollowingwork:
- AsktheJobTracker foranewjobID.
- Checksthe outputspecificationofthejob.
- Computestheinputsplitsforthe job.
- Copy the resources needed to run the job. Resources include the job jar file, the
configurationfile and the computed input splits. These resources will be copied to HDFS in a
directory namedafter the job id. The job jar will be copied more than 3 times across the cluster
so thatTaskTrackerscanaccess itquickly.
- TelltheJobTracker thatthejobis readyforexecutionby callingsubmitJob() onJobTracker.

JobInitialization

When the JobTracker receives the call submitJob(), it will put the call into an internal
queuefromwherethejobschedulerwill pickitupandinitializeit. Theinitializationisdoneas follow:
- An job object is created to represent the job being run. It encapsulates its tasks
andbookkeepinginformationsoastokeeptrackthetaskprogressandstatus.
- Retrieves the input splits from HDFS and create the list of tasks, each of which has task
ID.JobTracker creates one map task for each split, and the number of reduce tasks according
toconfiguration.
- JobTracker will create the setup task and cleanup task. Setup task is to create the final
outputdirectory for the job and the temporary working space for the task output. Cleanup task is
todeletethetemporaryworkingspaceforthetaskouput.
- JobTrackerwillassigntaskstofreeTaskTrackers

102
TaskAssignment

TaskTrackers send heartbeat periodically to JobTracker Node to tell it if it is alive or ready to geta
new task. The JobTracker will allocate a new task to the ready TaskTracker. Task assignment
isasfollows:
- The JobTracker will choose a job to select the task from according to scheduling algorithm,
asimple way is chosen on a priority list of job. After chose the job, the JobTracker will choose
ataskfromthejob.
- TaskTrackers has a fixed number of slots for map tasks and for reduces tasks which are
setindependently, theschedulerwill fitstheemptymaptaskslotsbeforereducetaskslots.
- To choose a reduce task, the JobTracker simply takes next in its list of yet-to-be-run
reducetask,becausethereisnodata
localityconsideration.ButmaptaskchosendependsonthedatalocalityandTaskTracker’s
networklocation.

TaskExecution

WhentheTaskTrackerhasbeenassigneda task.Thetaskexecutionwill berunasfollows:


- CopyjarfilefromHDFS,copy neededfilesfromthedistributedcacheonthelocaldisk.
- Createsalocal workingdirectory forthetaskand‘un-jars’thejar filecontents tothedirecoty
- Creates a TaskRunner to run the task. The TaskRunner will lauch a new JVM to run each
task..TaskRunner fails by bugs will not affect TaskTracker. And multiple tasks on the node can
reusetheJVMcreatedbyTaskRunner.
- EachtaskonthesameJVMcreatedbyTaskRunnerwill runsetuptaskandcleanuptask.
- The child process created by TaskRunner will informs the parent process of the task’s
progresseveryfewsecondsuntilthetaskis complete.

103
ProgressandStatusUpdates

After clients submit a job. The MapReduce job is a long time batching job. Hence the
jobprogressreportisimportant.WhatconsistsoftheHadooptaskprogressisasfollows:
- Readinganinputrecordinamapperorreducer
- Writinganoutputrecordina mapperorareducer
- Settingthestatusdescriptiononareporter,usingtheReporter’ssetStatus()method
- Incrementinga counter
- CallingReporter’sprogress()

AsshowninFigure,whenataskisrunning,theTaskTracker will notifytheJobTrackeritstaskprogressby


heartbeatevery5seconds.

And mapper and reducer on the child JVM will report to TaskTracker with it’s progress statusevery
few seconds. The mapper or reducers will set a flag to indicate the status change
thatshouldbesenttotheTaskTracker.Theflagischeckedinaseparatedthreadevery3seconds.Iftheflags
ets,itwillnotifytheTaskTrackerofcurrenttaskstatus.
The JobTracker combines all of the updates to produce a global view, and the Client can
usegetStatus()togetthejobprogressstatus.

JobCompletion

When the JobTracker receives a report that the last task for a job is complete, it will change
itsstatus to successful. Then the JobTracker will send a HTTP notification to the client which
callsthe waitForCompletion(). The job statistics and the counter information will be printed to
theclientconsole.FinallytheJobTrackerandtheTaskTrackerwill docleanupactionforthejob.

104
MRUnittest

MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some
limitedintegration testing of the mapper – reducer interaction along with combiners, custom
countersandpartitioners.

Towriteyourtestyouwould:

 TestingMappers
1. Instantiate an instance of the MapDriver class parameterized exactly as the
mapperundertest.
2. Addaninstance oftheMapperyouaretestinginthewithMappercall.
3. InthewithInputcallpassinyourkey andinputvalue
4. SpecifytheexpectedoutputinthewithOutputcall
5. The last call runTest feeds the specified input values into the mapper and compares
theactualoutputagainsttheexpectedoutputsetinthe‘withOutput’method.

 TestingReducers
1. The test starts by creating a list of objects (pairList) to be used as the input to
thereducer.
2. AReducerDriverisinstantiated
3. Nextwe passinaninstanceofthereducerwe wanttotestinthewithReducercall.
4. InthewithInputcall
wepassinthekey(of“190101”)andthepairListobjectcreatedatthestartofthetest.
5. Nextwe specifytheoutputthatwe expectourreducertoemit
6. Finally runTest is called, which feeds our reducer the inputs specified and compares
theoutputfromthereduceragainsttheexpectoutput.

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written
onseveralversionsofHadoop.
FollowingisanexampletouseMRUnittounittestaMapReduceprogramthatdoesSMSCallDeatailsRecord(ca
lldetailsrecord)analysis.
Therecordslooklike

1. CDRID; CDRType; Phone1; Phone2; SMS Status


Code655209; 1; 796764372490213;
804422938115889; 6353415; 0; 356857119806206;
287572231184798;
4835699;1;252280313968413;889717902341635;0

105
The MapReduce program analyzes these records, finds all records with CDRType as 1, and
noteitscorrespondingSMSStatusCode.Forexample,theMapperoutputsare
6,1
0,1
The Reducer takes these as inputs and output number of times a particular status code
hasbeenobtainedintheCDR records.

ThecorrespondingMapperand Reducerare

publicclassSMSCDRMapperextendsMapper<LongWritable,Text,Text,IntWritable>{

privateTextstatus= newText();
privatefinalstaticIntWritableaddOne=newIntWritable(1);

/***Returnsthe SMSstatuscode anditscount */

protectedvoidmap(LongWritablekey,Textvalue,Contextcontext)
throwsjava.io.IOException,InterruptedException{

//655209;1;796764372490213;804422938115889;6istheSamplerecordformat

String[]line=value.toString().split(";");
//Ifrecordis ofSMSCDR
if (Integer.parseInt(line[1]) == 1)
{status.set(line[4]);context.wri
te(status,addOne);
}
}
}

ThecorrespondingReducercodeis

public class SMSCDRReducer


extendsReducer<Text,IntWritable,Text,IntWrit
able>{

protected void reduce(Text key, Iterable<IntWritable> values, Context


context)throwsjava.io.IOException, InterruptedException
{
intsum=0;

106
for (IntWritable value : values)
{sum +=value.get();
}
context.write(key,newIntWritable(sum));
}
}

TheMRUnittestclassfortheMapperis

publicclassSMSCDRMapperReducerTest
{
MapDriver<LongWritable, Text, Text, IntWritable>
mapDriver;ReduceDriver<Text, IntWritable, Text, IntWritable>
reduceDriver;MapReduceDriver<LongWritable,Text,Text,IntWritable,Text,IntWri
table>
mapReduceDriver;

publicvoidsetUp()
{
SMSCDRMapper mapper = new
SMSCDRMapper();SMSCDRReducer reducer = new
SMSCDRReducer();mapDriver =
MapDriver.newMapDriver(mapper);;reduceDriver=Redu
ceDriver.newReduceDriver(reducer);
mapReduceDriver=MapReduceDriver.newMapReduceDriver(mapper,reducer);
}

@Test
publicvoid testMapper()
{
mapDriver.withInput(newLongWritable(),newText("655209;1;796764372490213;8044229
38115889;6"));
mapDriver.withOutput(new Text("6"), new
IntWritable(1));mapDriver.runTest();
}

@Test
publicvoid testReducer()
{
List<IntWritable>values=newArrayList<IntWritable>();val
ues.add(newIntWritable(1));
values.add(newIntWritable(1));

107
reduceDriver.withInput(new Text("6"),
values);reduceDriver.withOutput(new Text("6"), new
IntWritable(2));reduceDriver.runTest();
}
}

YARN:ItisaHadoopMapReduce2anddevelopedtoaddressthevariouslimitationsofclassicmapreduce

CurrentMapReduce(classic)Limitations:

 Scalabilityproblem

 MaximumClusterSize – 4000Nodesonly

 MaximumConcurrentTasks–40000only

 CoarsesynchronizationinJobTracker

 ItsupportsSinglepointoffailure

 Whenafailureoccursitkills all queuedandrunningjobs

 Jobsneedtoberesubmittedby users

 Restartisverytrickyduetocomplexstate

For large clusters with more than 4000 nodes, the classic MapReduce framework hit
thescalabilityproblems.

YARNstandsforYetAnotherResourceNegotiator

AgroupinYahoobegantodesignthenextgenerationMapReducein2010,andin2013Hadoop
2.x releases MapReduce 2, Yet Another Resource Negotiator (YARN) to remedy the
sociabilityshortcoming.

WhatdoesYarndo?

 Providesaclusterlevelresourcemanager

 Addsapplicationlevel resourcemanagement

 ProvidesslotsforjobsotherthanMap/Reduce

108
 Improvesresourceutilization

 Improvesscaling

 Clustersizeis6000–10000nodes

 100,000+concurrenttaskscanbeexecuted

 10,000concurrentjobscanbeexecuted

 SplitJobTrackerinto

1. ResourceManager(RM):performsclusterlevelresourcemanagement

2. ApplicationMaster(AM):performsjobSchedulingandMonitoring

YARNArchitecture

Asshown in Figure, the YARN involves more entities than classic MapReduce 1 :
- Client, the same as classic MapReduce which submits the MapReduce job.
- ResourceManager,whichhastheultimateauthoritythatarbitratesresourcesamongalltheapplications
inthecluster,itcoordinatestheallocationofcomputeresourcesonthecluster.
- NodeManager,whichisinchargeofresourcecontainers,monitoringresourceusage(cpu,memory,
disk , network) on the node , and reporting to the Resource Manager.

109
- Application Master, which is in charge of the life cycle an application, like a MapReduce Job.
Itwill negotiates with the Resource Manager of cluster resources—in YARN called containers.
TheApplication Master and the MapReduce task in the containers are scheduled by the
ResourceManager. And both of them are managed by the Node Manager. Application Mater is
alsoresponsibleforkeepingtrackoftaskprogressandstatus.
- HDFS,thesameasclassicMapReduce,forfilessharingbetweendifferententities.

ResourceManager consistsoftwocomponents:
• Schedulerand
• ApplicationsManager.

Scheduler is in charge of allocating resources. The resource Container incorporates elementssuch as


memory, cup, disk, network etc. Scheduler just has the resource allocation function, hasno
responsible for job status monitoring. And the scheduler is pluggable, can be replaced
byotherschedulerplugin-in.

The ApplicationsManager is responsible for accepting job-submissions, negotiating the


firstcontainer for executing the application specific Application Master, and it provides
restartservicewhenthecontainerfails.
The MapReduce job is just one type of application in YARN. Different application can run on
thesameclusterwithYARNframework.

YARNMapReduce

110
As shown in above Figure, it is the MapReduce process with YARN, there are 11 steps, and wewill
explain it in 6 steps the same as the MapReduce 1 framework. They are Job Submission,
JobInitialization, Task Assignment, Task Execution, Progress and Status Updates, and
JobCompletion.
JobSubmission

Clients can submit jobs with the same API as MapReduce 1 in YARN. YARN implements
itsClientProtocol,thesubmissionprocessis similartoMapReduce1.
- Theclientcallsthe submit()method, whichwill
initiatetheJobSubmmitterobjectandcallsubmitJobInternel().
- ResourceManagerwill allocateanewapplicationIDandresponseittoclient.
- Thejobclientcheckstheoutputspecificationofthejob
- Thejobclient computestheinput splits
- The job client copies resources, including the splits data, configuration information, the
jobJARinto HDFS
- Finally, the job client notify Resource Manager it is ready by calling submitApplication() on
theResourceManager.
JobInitialization

WhentheResourceManager(RM)receivesthecallsubmitApplication(),RMwillhandsoffthejobtoits
scheduler.Thejobinitializationisas follows:
- Theschedulerallocatesaresourcecontainerforthejob,
- TheRMlaunchestheApplicationMasterundertheNodeManager’smanagement.
- Application Master initialize the job. Application Master is a Java class named
MRAppMaster,which initializes the job by creating a number of bookkeeping objects to keep
track of the jobprogress.Itwill receivetheprogressandthecompletionreportsfromthetasks.
- Application Master retrieves the input splits from HDFS, and creates a map task object
foreach split. It will create a number of reduce task objects determined by
themapreduce.job.reducesconfigurationproperty.
- ApplicationMasterthendecides howtorunthejob.

For small job, called uber job, which is the one has less than 10 mappers and only one reducer,or
the input split size is smaller than a HDFS block, the Application Manager will run the job onits
own JVM sequentially. This policy is different from MapReduce 1 which will ignore the
smalljobsonasingleTaskTracker.

For large job, the Application Master will launches a new node with new NodeManager
andnewcontainer,inwhichrunthetask.Thiscanrunjobinparallelandgainmoreperformance.

111
Application Master calls the job setup method to create the job’s output directory.
That’sdifferentfromMapReduce1,wherethesetuptaskiscalledbyeachtask’sTaskTracker.
TaskAssignment

When the job is very large so that it can’t be run on the same node as the Application
Master.TheApplicationMasterwill make requesttotheResourceManager
tonegotiatemoreresourcecontainerwhichisinpiggybackedonheartbeatcalls. Thetaskassignmentisas
follows:
- The Application Master make request to the Resource Manager in heartbeat call. The
requestincludes the data locality information, like hosts and corresponding racks that the input
splitsresideson.
- TheRecourseManagerhandovertherequesttotheScheduler.TheSchedulermakes
decisionsbasedon theseinformation.Itattemptsto placethetaskasclosethe dataaspossible.Thedata-
localnodes isgreat,ifthisisnotpossible, therack-localthepreferredtonolocalnode.
- The request also specific the memory requirements, which is between the minimum
allocation(1GB by default) and the maximum allocation (10GB). The Scheduler will schedule a
containerwith multiples of 1GB memory to the task, based on the mapreduce.map.memory.mb
andmapreduce.reduce.memory.mbpropertysetby thetask.

This way is more flexible than MapReduce 1. In MapReduce 1, the TaskTrackers have a
fixednumber of slots and each task runs in a slot. Each slot has fixed memory allowance
whichresults in two problems. For small task, it will waste of memory, and for large task which
needmore memeory, it will lack of memory. In YARN, the memory allocation is more fine-
grained,whichis alsothebeauty ofYAREresidesin.
TaskExecution

After the task has been assigned the container by the Resource Manger’s scheduler,
theApplicationMasterwillcontacttheNodeMangerwhichwill launchthetaskJVM.
Thetaskexecutionisasfollows:
- TheJavaApplicationwhoseclassname isYarnChildlocalizesthe resourcesthat
thetaskneeds.YarnChild retrieves job resources including the job jar, configuration file, and any
needed filesfromtheHDFSandthedistributedcacheonthelocaldisk.
- YarnChildrunthemaporthereduce task
EachYarnChildruns onadedicatedJVM,whichisolates usercodefromthelongrunningsystemdaemons like
NodeManager and the Application Master. Different from MapReduce 1,
YARNdoesn’tsupportJVM reuse,henceeachtaskmustrunonnewJVM.
Thestreaming andthepipelineprocesssandcommunicationinthesameasMapReduce1.

112
ProgressandStatusUpdates

When the job is running under YARN, the mapper or reducer will report its status and progressto its
Application Master every 3 seconds over the umbilical interface. The Application Masterwill
aggregate these status reports into a view of the task status and progress. While inMapReduce 1,
the TaskTracker reports status to JobTracker which is responsible for
aggregatingstatusintoaglobalview.
Moreover,theNodeMangerwill sendheartbeatstotheResourceManagereveryfewseconds.The Node
Manager will monitoring the Application Master and the recourse container usagelike cpu,
memeory and network, and make reports to the Resource Manager. When the NodeManager
fails and stops heartbeat the Resource Manager, the Resource Manager will
removethenodefromitsavailableresourcenodespool.
TheclientpullsthestatusbycallinggetStatus() every1secondtoreceivetheprogressupdates,which are
printed on the user console. User can also check the status from the web UI. TheResource
Manager web UI will display all the running applications with links to the web UIwheredisplays
taskstatusandprogress indetail.

113
JobCompletion

Every 5 second the client will check the job completion over the HTTP ClientProtocol by
callingwaitForCompletion(). When the job is done, the Application Master and the task
containersclean up their working state and the outputCommitter’s job cleanup method is called.
And thejobinformationis archivedas history forlaterinterrogationbyuser.

CompareClassicMap ReducewithYARNmap reduce

 YARNhasFaultTolerance(continuetoworkintheeventoffailure)andAvailability

 ResourceManager

 Nosinglepointoffailure–statesavedinZooKeeper

 ApplicationMastersarerestartedautomaticallyonRMrestart

 ApplicationMaster

 Optionalfailovervia application-specificcheckpoint

 MapReduce applications pick up where they left off via state saved
inHDFS

 YARNhasNetworkCompatibility

 Protocolsarewire-compatible

114
 Oldclients cantalktonewservers

 Rollingupgrades

 YARNsupportsforprogrammingparadigmsotherthanMapReduce(Multitenancy)

 Tez– Genericframeworktorunacomplex MR

 HBaseonYARN

 MachineLearning:Spark

 Graphprocessing: Giraph

 Real-timeprocessing:Storm

 YARNisEnabledbyallowingtheuseofparadigm-specificapplicationmaster

 YARNruns allonthesameHadoopcluster

 YARN’s biggest advantage is multi-tenancy, being able to run multiple


paradigmssimultaneouslyis abig plus.

JobSchedulinginMapReduce

115
Typesofjobschedulers

116
Failuresinclassicmapreduce

One of the major benefits of using Hadoop is its ability to handle such failures and allow
yourjobtocomplete.

1. TaskFailure
• Consider first the case of the child task failing. The most common way that this happensis
whenusercodeinthemaporreducetaskthrows aruntimeexception. Ifthis happens,the child
JVM reports the error back to its parent tasktracker, before it exits. The errorultimately
makes it into the user logs. The tasktracker marks the task attempt as
failed,freeingupaslottorunanothertask.
• For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
markedasfailed.Thisbehaviorismanagedbythestream.non.zero.exit.is.failureproperty.
• Another failure mode is the sudden exit of the child JVM. In this case, the
tasktrackernoticesthattheprocess hasexitedandmarkstheattemptas failed.
• A task attempt may also be killed, which is also different kind of failing. Killed
taskattempts do not count against the number of attempts to run the task since it
wasn’tthetask’s faultthatanattemptwas killed.

2. TasktrackerFailure
• If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats
tothe jobtracker (or send them very infrequently). The jobtracker will notice a
tasktrackerthat has stopped sending heartbeats (if it hasn’t received one for 10 minutes)
andremoveitfromitspooloftasktrackers toscheduletasks on.

• A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has
notfailed. A tasktracker is blacklisted if the number of tasks that have failed on it
issignificantly higher than the average task failure rate on the cluster.
Blacklistedtasktrackerscanberestartedtoremovethemfromthejobtracker’sblacklist.

117
3. JobtrackerFailure
• Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has
nomechanism for dealing with failure of the jobtracker—it is a single point of failure—
sointhiscasethejobfails.

FailuresinYARN

Refer to yarn architecture figure above, container and task failures are handled by node-
manager. When a container fails or dies, node-manager detects the failure event and launchesa
new container to replace the failing container and restart the task execution in the newcontainer.
In the event of application-master failure, the resource-manager detects the failureand start a
new instance of the application-master with a new container. The ability to recoverthe
associated job state depends on the application-master implementation. MapReduceapplication-
master has the ability to recover the state but it is not enabled by default. Otherthan resource-
manager, associated client also reacts with the failure. The client contacts theresource-
managertolocatethenewapplication-master’s address.

Upon failure of a node-manager, the resource-manager updates its list of available node-
managers.Application-mastershouldrecoverthetasksrunonthefailing node-managersbutitdepends on
the application-master implementation. MapReduce application-master has
anadditionalcapabilitytorecoverthefailingtaskandblacklistthenode-managersthatoftenfail.

Failure of the resource-manager is severe since clients can not submit a new job and
existingrunning jobs could not negotiate and request for new container. Existing node-
managers andapplication-masterstrytoreconnecttothefailedresource-
manager.Thejobprogresswillbe

118
lost when they are unable to reconnect. This lost of job progress will likely frustrate engineersor
data scientists that use YARN because typical production jobs that run on top of YARN
areexpectedtohavelongrunning timeandtypically theyareintheorderoffewhours.
Furthermore, this limitation is preventing YARN to be used efficiently in cloud
environment(suchas AmazonEC2)sincenodefailuresoftenhappenincloudenvironment.

Shuffleand Sort

 MapReducemakestheguaranteethattheinputtoeveryreducerissortedbykey.
 Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducers
asinputs—is knownas theshuffle.
 Theshuffleisanareaofthecodebasewhererefinementsandimprovementsarecontinuallybeing
made.

STEPS

1. TheMap Side
2. TheReduce Side
3. ConfigurationTuning

I. TheMapSide

When the map function starts producing output, it is not simply written to
disk.The process is more involved, and takes advantage of buffering writes in memory
anddoingsomepresorting forefficiency reasons.

Shuffleand sortinMapReduce

119
The buffer is 100 MB by default, a size which can be tuned by changing the
io.sort.mbproperty. When the contents of the buffer reaches a certain threshold size a background
threadwill start to spill the contents to disk. Map outputs will continue to be written to the
bufferwhile the spill takes place, but if the buffer fills up during this time, the map will block
until thespill is complete. Spills are written in round-robin fashion to the directories specified by
themapred.local.dir property, in a job-specific subdirectory. Before it writes to disk, the thread
firstdivides the data into partitions corresponding to the reducers that they will ultimately be
sentto. Within each partition, the background thread performs an in-memory sort by key, and
ifthereisacombinerfunction,itis runontheoutputofthesort.

Running the combiner function makes for a more compact map output, so there is
lessdata to write to local disk and to transfer to the reducer. Each time the memory buffer
reachesthe spill threshold, a new spill file is created, so after the map task has written its last
outputrecord there could be several spill files. Before the task is finished, the spill files are
merged
intoasinglepartitionedandsortedoutputfile.Theconfigurationpropertyio.sort.factorcontrolsthe
maximum number of streams to merge at once; the default is 10.If there are at least threespill
files then the combiner is run again before the output file is written. Combiners may be
runrepeatedly over the input without affecting the final result.If there are only one or two
spills,then the potential reduction in map output size is not worth the overhead in invoking
thecombiner,soitis notrunagainforthis mapoutput.

To compress themap outputas itis written to disk,makesit fastertowrite todisk,saves disk


space, and reduces the amount of data to transfer to the reducer. By default, theoutput is not
compressed, but it is easy to enable by setting mapred.compress.map.output totrue. The output
file’s partitions are made available to the reducers over HTTP. The
maximumnumberofworkerthreadsusedtoservethefilepartitionsiscontrolledbythetasktracker.http.th
reads property. The default of 40 may need increasing for large clustersrunninglargejobs.

120
II. TheReduce Side

The map output file is sitting on the local disk of the machine that ran the
maptask. The reduce task needs the map output for its particular partition from several
maptasksacross thecluster.

Copy phase of reduce: The map tasks may finish at different times, so the reduce task
startscopyingtheir outputs as soon as each completes. The reduce task has a small number of
copierthreadssothatitcanfetchmapoutputs inparallel.

Thedefaultisfivethreads,butthisnumbercanbechangedbysettingthemapred.reduce.parallel.c
opiesproperty.The mapoutputsarecopiedtoreducetaskJVM’smemory otherwise, they are copied to
disk. When the in-memory buffer reaches a
thresholdsizeorreachesathresholdnumberofmapoutputsitismergedandspilledtodisk.Ifacombiner is
specified it will be run during the merge to reduce the amount of data written todisk.The copies
accumulate on disk, a background thread merges them into larger, sorted
files.Thissavessometimemerginglateron.

Anymapoutputsthatwerecompressedhavetobedecompressedinmemoryinorderto perform a
merge on them. When all the map outputs have been copied, the reduce taskmoves into the sort
phase which merges the map outputs, maintaining their sort ordering. Thisis done in rounds. For
example, if there were 50 map outputs, and the merge factor was 10,then there would be 5
rounds. Each round would merge 10 files into one, so at the end therewould be five intermediate
files. These five files into a single sorted file, the merge saves a tripto disk by directly feeding the
reduce function. This final merge can come from a mixture of in-memoryandon-disksegments.

During the reduce phase, the reduce function is invoked for each key in the sortedoutput.
The output of this phase is written directly to the output filesystem, typically HDFS. Inthe case
of HDFS, since the tasktracker node is also running a datanode, the first block
replicawillbewrittentothelocaldisk.

121
III. ConfigurationTuning

Tounderstandhowtotunetheshuffle toimproveMapReduceperformance.The generalprinciple is to give


the shuffle as much memory as possible.There is a trade-off, in that youneed to make sure that
your map and reduce functions get enough memory to operate. Writeyour map and reduce
functions to use as little memory as possible. They should not use anunbounded amount of
memory. The amount of memory given to the JVMs in which the mapand reduce tasks run is set
by the mapred.child.java.opts property. To make this as large aspossiblefortheamountofmemory
onyourtasknodes.

On the map side, the best performance can be obtained by avoiding multiple spills to disk; oneis
optimal. If you can estimate the size of your map outputs, then you can set the io.sort.*properties
appropriately to minimize the number of spills. There is a MapReduce counter thatcounts the
total number of records that were spilled to disk over the course of a job, which
canbeusefulfortuning. Thecounterincludesbothmapandreducessidespills.

On the reduce side, the best performance is obtained when the intermediate data can resideentirely in
memory. By default, this does not happen, since for the general case all the memoryis reserved
for the reduce function. But if your reduce function has light memory requirements,then setting
mapred.inmem.merge.threshold to 0 andmapred.job.reduce.input.buffer.percentto 1.0 may bring
a performance boost. Hadoop uses a buffer size of 4 KB by default, which islow,
soyoushouldincreasethis acrossthecluster.

InputFormats
Hadoopcanprocess manydifferenttypesofdataformats,fromflattextfilestodatabases.
1) InputSplitsand Records:
An input split is a chunk of the input that is processed by a single map. Each map processes asingle
split. Each split is divided into records, and the map processes each record—a key-valuepair—
inturn.
publicabstractclassInputSplit{
publicabstractlonggetLength()throwsIOException,InterruptedException;

122
publicabstractString[]getLocations()throws IOException,InterruptedException;
}
 FileInputFormat:FileInputFormatisthebaseclassforallimplementationsofInputFormat that
use files as their data source. It provides two things: a place to definewhich files are
included as the input to a job, and an implementation for generatingsplitsfortheinputfiles.
 FileInputFormatinputpaths:Theinputtoajobisspecifiedasacollectionofpaths,which offers
great flexibility in constraining the input to a job. FileInputFormat offersfourstatic
conveniencemethodsforsettingaJob’s inputpaths:
publicstaticvoidaddInputPath(Jobjob,Pathpath)
publicstaticvoidaddInputPaths(Jobjob,StringcommaSeparatedPaths)publicstaticv
oidsetInputPaths(Jobjob,Path...inputPaths)
publicstaticvoidsetInputPaths(Jobjob,StringcommaSeparatedPaths)

Fig:InputFormatclasshierarchy

Table:Inputpathandfilterproperties

123
 FileInputFormat input splits: FileInputFormat splits only large files. Here “large”
meanslargerthananHDFSblock.Thesplitsizeis normallythesizeofanHDFSblock.
Table:Propertiesforcontrollingsplitsize

 Preventing splitting: There are a couple of ways to ensure that an existing file is not
split.The first way is to increase the minimum split size to be larger than the largest file
inyour system. The second is to subclass the concrete subclass of FileInputFormat that
youwanttouse,tooverridetheisSplitable()methodtoreturnfalse.
 Fileinformationinthemapper:Amapperprocessingafileinputsplitcanfindinformationaboutth
esplitbycallingthegetInputSplit()methodontheMapper’sContextobject.
Table:Filesplitproperties

124
 Processingawholefileasarecord:Arelatedrequirementthatsometimescropsupisformappersto
haveaccesstothefullcontentsofafile.ThelistingforWholeFileInputFormatshows away
ofdoing this.

Example:AnInputFormatforreadingawholefileasarecord
public class WholeFileInputFormat extends
FileInputFormat<NullWritable,BytesWritable>{
@Override
protected boolean isSplitable(JobContext context, Path file)
{returnfalse;
}}
WholeFileRecordReader is responsible for taking a FileSplit and converting it into a single
record,withanullkey andavaluecontaining thebytesofthefile.
2) TextInput
 TextInputFormat:TextInputFormatisthedefaultInputFormat.Eachrecordisalineofinput.Afil
econtainingthefollowingtext:
On the top of the Crumpetty
TreeTheQuangleWanglesat,
But his face you could not
see,OnaccountofhisBeaverHat.
isdividedintoonesplitoffourrecords.Therecordsareinterpretedasthefollowingkey-valuepairs:
(0, On the top of the Crumpetty Tree)
(33,TheQuangleWanglesat,)
(57, But his face you could not see,)
(89,Onaccountofhis BeaverHat.)
Fig:Logicalrecords andHDFSblocksforTextInputFormat

125
 You can specify the separator via

themapreduce.input.keyvaluelinerecordreader.key.value.separatorproperty.Itisatabcharacte
r by default. Consider the following input file, where → represents a
(horizontal)tabcharacter:
line1→On the top of the Crumpetty
Treeline2→The Quangle Wangle
sat,line3→But his face you could not
see,line4→OnaccountofhisBeaverHat.
LikeintheTextInputFormatcase,theinputisinasinglesplitcomprisingfourrecords,althoughthistimethekeys
aretheTextsequencesbeforethetabineachline:
(line1,OnthetopoftheCrumpettyTree)
(line2,TheQuangleWanglesat,)
(line3, But his face you could not see,)
(line4,OnaccountofhisBeaver Hat.)
 NLineInputFormat: N refers to the number of lines of input that each mapper
receives.WithNsettoone,eachmapperreceivesexactlyonelineofinput.mapreduce.input.linei
nputformat.linespermappropertycontrolsthevalueofN.
For example, N is two, then each split contains two lines. One mapper will receive the
firsttwokey-valuepairs:
(0, On the top of the Crumpetty Tree)
(33,TheQuangleWanglesat,)
And another mapper will receive the second two key-value pairs:
(57,Buthisfaceyoucouldnotsee,)
(89,Onaccountofhis BeaverHat.)

126
3) Binary Input: Hadoop MapReduce is not just restricted to processing textual data—it
hassupportforbinary formats,too.
 SequenceFileInputFormat: Hadoop’s sequence file format stores sequences of binarykey-
valuepairs.
 SequenceFileAsTextInputFormat:SequenceFileAsTextInputFormatisavariantofSequence
FileInputFormatthatconvertsthesequencefile’skeysandvaluestoTextobjects.
 SequenceFileAsBinaryInputFormat:SequenceFileAsBinaryInputFormatisavariantofSeque
nceFileInputFormat that retrieves the sequence file’s keys and values as
opaquebinaryobjects.
4) MultipleInputs:AlthoughtheinputtoaMapReducejobmayconsistofmultipleinputfiles,alloftheinp
utis interpretedbyasingleInputFormatandasingleMapper.
TheMultipleInputsclasshasanoverloadedversionofaddInputPath()thatdoesn’ttakeamapper:
public static void addInputPath(Job job, Path path, Class<? extends
InputFormat>inputFormatClass)

OutputFormats
Figure:OutputFormatclasshierarchy

127
1) Text Output: The default output format, TextOutputFormat, writes records as lines of text.
Itskeys
andvaluesmaybeofanytype,sinceTextOutputFormatturnsthemtostringsbycallingtoString()onthem.
2) Binary Output
 SequenceFileOutputFormat: As the name indicates, SequenceFileOutputFormat
writessequencefilesforitsoutput.CompressioniscontrolledviathestaticmethodsonSequence
FileOutputFormat.
 SequenceFileAsBinaryOutputFormat:SequenceFileAsBinaryOutputFormatisthecounterpa
rt to SequenceFileAsBinaryInput Format, and it writes keys and values in
rawbinaryformatintoaSequenceFilecontainer.
 MapFileOutputFormat: MapFileOutputFormat writes MapFiles as output. The keys in
aMapFile must be added in order, so you need to ensure that your reducers emit keys
insortedorder.
3) Multiple Outputs: FileOutputFormat and its subclasses generate a set of files in the
outputdirectory. There is one file per reducer, and files are named by the partition number: part-r-
00000,partr-00001,etc.MapReducecomeswiththeMultipleOutputsclasstohelpyoudothis.
Zeroreducers:Therearenopartitions,astheapplicationneedstorunonlymaptasks.
One reducer: It can be convenient to run small jobs to combine the output of previousjobs
into a single file. This should only be attempted when the amount of data is small
enoughtobeprocessedcomfortablyby onereducer.
 MultipleOutputs: MultipleOutputs allows you to write data to files whose names
arederivedfromtheoutputkeysandvalues,orinfactfromanarbitrarystring.MultipleOutputs
delegates to the mapper’s OutputFormat, which in this
exampleisaTextOutputFormat,butmorecomplexsetupsarepossible.
 Lazy Output: FileOutputFormat subclasses will create output (part-r-nnnnn) files, even
iftheyare
empty.Someapplicationspreferthatemptyfilesnotbecreated,whichiswhereLazyOutputFormathe
lps.Itisawrapperoutputformatthatensuresthattheoutputfile

128
is Output Formats created only when the first record is emitted for a given partition. Touse it,
call its setOutputFormatClass() methodwiththe JobConfand the underlyingoutputformat.
 DatabaseOutput:TheoutputformatsforwritingtorelationaldatabasesandtoHBase

UNITVHADOOPRELATEDTOOLS

Hbase – data model and implementations – Hbase clients – Hbase examples –


praxis.Cassandra – cassandra data model – cassandra examples – cassandra clients –
Hadoopintegration. Pig – Grunt – pig data model – Pig Latin – developing and testing Pig
Latin scripts.Hive – data types and file formats – HiveQL data definition – HiveQL data
manipulation –HiveQLqueries.

HBase
• HBASEstandsforHadoopdataBase
• HBaseisa distributedcolumn-orienteddatabasebuiltontopofHDFS.
• HBase is the Hadoop application to use when you require real-time read/write random-
accesstoverylargedatasets.
• Horizontallyscalable

• –Automaticsharding

• Supportsstronglyconsistentreadsandwrites

• SupportsAutomaticfail-over

• IthasSimpleJavaAPI

• IntegratedwithMap/Reduceframework

• SupportsThrift,AvroandRESTWeb-services

• WhenToUseHBase

• Goodforlargeamountsofdata

• 100sofmillions orbillionsofrows

129
• If data is too small all the records will end up on a single node leaving
therestoftheclusteridle

• WhenNOTtoUse HBase

• BadfortraditionalRDBMsretrieval

• Transactionalapplications

• RelationalAnalytics

• 'groupby','join',and'wherecolumnlike',etc....

• Currentlybadfortext-basedsearchaccess

HBase is a column-oriented database that’s an open-source implementation of Google’s BigTable


storage architecture. It can manage structured and semi-structured data and has somebuilt-in
features such as scalability, versioning, compression and garbage collection. Since itsuses write-
ahead logging and distributed configuration, it can provide fault-tolerance and quickrecovery
from individual server failures. HBase built on top of Hadoop / HDFS and the
datastoredinHBasecanbemanipulatedusingHadoop’sMapReducecapabilities.

Let’s now take a look at how HBase (a column-oriented database) is different from some otherdata
structures and concepts that we are familiar with Row-Oriented vs. Column-Oriented datastores.
As shown below, in a row-oriented data store, a row is a unit of data that is read orwritten
together. In a column-oriented data store, the data in a column is stored together
andhencequickly retrieved.

130
Row-orienteddatastores–

 Data is stored and retrieved one row at a time and hence could read unnecessary data
ifonlysomeofthedatainarowis required.
 Easytoreadandwrite records
 WellsuitedforOLTPsystems
 Not efficient in performing operations applicable to the entire dataset and
henceaggregationisanexpensiveoperation
 Typical compression mechanisms provide less effective results than those on column-
orienteddatastores
Column-orienteddatastores–

 Data is stored and retrieved in columns and hence can read only relevant data if
onlysomedatais required
 ReadandWritearetypicallyslower operations
 WellsuitedforOLAPsystems
 Can efficiently perform operations applicable to the entire dataset and hence
enablesaggregationovermanyrows andcolumns
 Permitshighcompressionratesduetofewdistinctvaluesincolumns

RelationalDatabasesvs.HBase

When talking of data stores, we first think of Relational Databases with structured data storageand a
sophisticated query engine. However, a Relational Database incurs a big penalty toimprove
performance as the data size increases. HBase, on the other hand, is designed from theground up
to provide scalability and partitioning to enable efficient data structure
serialization,storageandretrieval.Broadly, thedifferencesbetweena RelationalDatabaseandHBase
are:

131
HDFSvs.HBase

HDFS is a distributed file system that is well suited for storing large files. It’s designed
tosupportbatchprocessingofdatabutdoesn’tprovidefastindividualrecordlookups.HBaseisbuilt on
top of HDFS and is designed to provide access to single rows of data in large
tables.Overall,thedifferences betweenHDFSandHBaseare

HBaseArchitecture

The HBase Physical Architecture consists of servers in a Master-Slave relationship as


shownbelow.Typically,theHBase clusterhasoneMasternode,
calledHMasterandmultipleRegionServerscalledHRegionServer.EachRegionServercontainsmulti
pleRegions–HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables are storedin
Regions. When a Table becomes too big, the Table is partitioned into multiple Regions.
TheseRegions are assigned to Region Servers across the cluster. Each Region Server hosts
roughly thesamenumberofRegions.

132
TheHMasterintheHBaseisresponsiblefor

 PerformingAdministration
 ManagingandMonitoringtheCluster
 AssigningRegionstotheRegionServers
 ControllingtheLoadBalancingandFailover
Ontheotherhand, theHRegionServerperformthefollowingwork

 HostingandmanagingRegions
 SplittingtheRegionsautomatically
 Handlingtheread/writerequests
 CommunicatingwiththeClientsdirectly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each
Regionin turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in
theseStoreFiles in the form of Column Families (explained below). The MemStore holds in-
memorymodificationstotheStore(data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying
toreadorwrite datafromHBase, theclientsreadtherequiredRegioninformationfrom
the .META table and directly communicate with the appropriate Region Server. Each Region
isidentifiedby thestartkey(inclusive)andtheendkey (exclusive)

133
HBASEdetailed

Architecture

You can see that HBase handles basically two kinds of file types. One is used for the write-aheadlog
and the other for the actual data storage. The files are primarily handled by theHRegionServer's.
But in certain scenarios even the HMaster will have to perform low-level fileoperations. You
may also notice that the actual files are in fact divided up into smaller blockswhen stored within
the Hadoop Distributed Filesystem (HDFS). This is also one of the
areaswhereyoucanconfigurethesystem tohandlelargerorsmallerdatabetter.Moreonthatlater.

ZooKeeper is a high-performance coordination server for distributed applications. It


exposescommonservices--suchasnamingandconfigurationmanagement,synchronization,
andgroupservices--inasimpleinterface.

The general flow is that a new client contacts the Zookeeper quorumfirst to find a particularrow
key. It does so by retrieving the server name (i.e. host name) that hosts the -ROOT-
regionfromZookeeper. Withthatinformationitcanquerythatservertogettheserver
thathoststhe .META. table. Both of these two details are cached and only looked up once.
Lastly it canquerythe.META. serverandretrievetheserverthathastherowtheclientislookingfor.

134
Once it has been told where the row resides, i.e. in what region, it caches this information aswell and
contacts the HRegionServer hosting that region directly. So over time the client has
aprettycompletepictureofwheretogetrowsfromwithoutneedingtoquerythe.META.serveragain.

Next the HRegionServer opens the region it creates a corresponding HRegion object. When
theHRegion is "opened" it sets up a Store instance for each HColumnFamily for every table
asdefined by the user beforehand. Each of the Store instances can in turn have one or
moreStoreFileinstances,whicharelightweightwrappersaroundtheactualstoragefilecalledHFile.A
HRegion also has a MemStore and a HLog instance. We will now have a look at how they
worktogetherbutalsowherethereareexceptionstotherule.
HBaseDataModel
The Data Model in HBase is designed to accommodate semi-structured data that could vary infield
size, data type and columns. Additionally, the layout of the data model makes it easier
topartition the data and distribute it across the cluster. The Data Model in HBase is made
ofdifferent logical components such as Tables, Rows, Column Families, Columns, Cells
andVersions.

135
Tables – The HBase Tables are more like logical collection of rows stored in separate
partitionscalled Regions. As shown above, every Region is then served by exactly one Region
Server. ThefigureaboveshowsarepresentationofaTable.

Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys
areuniqueinaTableandarealways treatedasabyte[].

Column Families – Data in a row are grouped together as Column Families. Each Column
Familyhas one more Columns and these Columns in a family are stored together in a low level
storagefile known as HFile. Column Families form the basic unit of physical storage to which
certainHBase features like compression are applied. Hence it’s important that proper care be
takenwhen designing Column Families in table. The table above shows Customer and Sales
ColumnFamilies. The Customer Column Family is made up 2 columns – Name and City,
whereas theSalesColumnFamilies ismadeupto2columns–ProductandAmount.

Columns – A Column Family is made of one or more columns. A Column is identified by aColumn
Qualifier that consists of the Column Family name concatenated with the Column nameusing a
colon – example: columnfamily:columnname. There can be multiple Columns within
aColumnFamily andRows withinatablecanhavevariednumberofColumns.

Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family
andthe Column (Column Qualifier). The data stored in a Cell is called its value and the data
type isalwaystreatedas byte[].

Version – The data stored in a cell is versioned and versions of data are identified by thetimestamp.
The number of versions of data retained in a column family is configurable and
thisvaluebydefaultis 3.

136
InJSONformat, themodelisrepresentedas

(Table, RowKey, Family,Column, Timestamp)→Value

HBaseclients

1. REST
HBase ships with a powerful REST server, which supports the complete client
andadministrativeAPI.It
alsoprovidessupportfordifferentmessageformats,offeringmanychoicesforaclientappli
cationtocommunicatewiththeserver.

RESTJava client

The REST server also comes with a comprehensive Java client API. It is located in
theorg.apache.hadoop.hbase.rest.clientpackage.

2. Thrift

Apache Thrift is written in C++, but provides schema compilers for many
programminglanguages,includingJava,C++,
Perl,PHP,Python,Ruby,andmore.Onceyouhave
compiled a schema, you can exchange messages transparently between systems
implementedinoneormoreofthoselanguages.

137
3. Avro
Apache Avro, like Thrift, provides schema compilers for many programming
languages,including Java, C++, PHP, Python, Ruby, and more. Once you have compiled a
schema,you can exchange messages transparently between systems implemented in one or
moreofthoselanguages.
4. OtherClients
• JRuby
The HBase Shell is an example of using a JVM-based language to access
theJavabased API. It comes with the full source code, so you can use it to add
thesamefeatures toyourownJRubycode.
• HB
ql HBql adds an SQL-like syntax on top of HBase, while adding the
extensionsneededwhereHBasehasuniquefeatures
• HBase-DSL
This project gives you dedicated classes that help when formulating queriesagainst
an HBase cluster. Using a builder-like style, you can quickly assemble
alltheoptions andparameters necessary.
• JPA/JPO
You can use, for example, DataNucleus to put a JPA/JPO access layer on top
ofHBase.
• PyHBase
ThePyHBaseproject(https://github.com/hammer/
pyhbase/)offersanHBaseclientthroughtheAvrogateway server.
• AsyncHBase
AsyncHBase offers a completely asynchronous, nonblocking, and thread-safeclient
to access HBase clusters. It uses the native RPC protocol to talk directly
tothevariousservers

Cassandra

TheCassandradatastoreisanopensourceApacheprojectavailableathttp://cassandra
.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s
inboxsearchproblem,inwhichthey
hadtodealwithlargevolumesofdatainawaythatwasdifficulttoscalewithtraditionalmethods.

138
Mainfeatures

• Decentralized
Every node in the cluster has the same role. There is no single point of failure. Data
isdistributed across the cluster (so each node contains different data), but there is no
masteraseverynodecanserviceanyrequest.
• Supportsreplicationandmultidatacenterreplication
Replication strategies are configurable.[18] Cassandra is designed as a distributed system,for
deployment of large numbers of nodes across multiple data centers. Key features
ofCassandra’s distributed architecture are specifically tailored for multiple-data
centerdeployment,forredundancy,forfailover anddisasterrecovery.
• Scalability
Read and write throughput both increase linearly as new machines are added, with
nodowntimeorinterruptiontoapplications.
• Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication
acrossmultipledatacentersis supported.Failednodes canbereplacedwithnodowntime.
• Tunableconsistency
Writes and reads offer a tunable level of consistency, all the way from "writes never fail"
to"blockforallreplicastobereadable",withthequorumlevelinthemiddle.
• MapReducesupport
Cassandra has Hadoop integration, with MapReduce support. There is support also
forApachePig andApacheHive.
• Querylanguage
Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to
thetraditional RPC interface. Language drivers are available for Java (JDBC), Python,
Node.JSandGo

Whywe use?

1. Quickwrites
2. Failsafe
3. QuickReporting
4. BatchProcessingtoo, withmapreduces.
5. Easeofmaintenance
6. Easeofconfiguration
7. tuneablyconsistent
8. highlyavailable
9. faulttolerant

139
10. The peer-to-peer design allows for high performance with linear scalability and no
singlepointsoffailure
11. Decentralizeddatabases
12. Supports12differentclientlanguages
13. Automaticprovisioningofnewnodes

Cassandradatamodel
Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key
valuenature is represented by a row object, in which value would be generally organized in
columns.Inshort,cassandrausesthefollowingterms

1. KeyspacecanbeseenasDBSchemainSQL.
2. Columnfamilyresemblesa tableinSQLworld(readbelowthisanalogyismisleading)
3. Row has a key and as a value a set of Cassandra columns.But without relational
schemacorset.
4. Columnisatriplet:=(name,value,timestamp).
5. Supercolumnis atupel:=(name,collectionofcolumns).
6. DataTypes: Validators&Comparators
7. Indexes

Cassandradata modelis illustratedinthefollowingfigure

KeySpaces

KeySpacesarethelargestcontainer,withanorderedlistofColumnFamilies,similartoadatabaseinRDMS.

Column

140
A Column is the most basic element in Cassandra: a simple tuple that contains a name, valueand
timestamp. All values are set by the client. That's an important consideration for
thetimestamp,as itmeansyou'llneedclocksynchronization.

SuperColumn

A SuperColumn is a column that stores an associative array of columns. You could think of it
assimilar to a HashMap in Java, with an identifying column (name) that stores a list of
columnsinside (value). The key difference between a Column and a SuperColumn is that the
value of aColumn is a string, where the value of a SuperColumn is a map of Columns. Note
thatSuperColumnshavenotimestamp,justanameandavalue.

ColumnFamily

A ColumnFamily hold a number of Rows, a sorted map that matches column names to
columnvalues.A row is a set of columns, similar to the table concept from relational databases.
Thecolumnfamilyholdsanorderedlistofcolumnswhichyoucanreferencebycolumnname.

TheColumnFamilycanbeoftwotypes,Standardor
Super.StandardColumnFamilyscontainamapofnormalcolumns,

141
Example

SuperColumnFamily'scontainrowsofSuperColumns.

Example

DataTypes
Andofcoursetherearepredefineddatatypesincassandra,inwhich

 Thedata typeofrowkeyiscalledavalidator.
 Thedata typefora columnnameiscalleda comparator.

142
You can assign predefined data types when you create your column family(which
isrecommended), but Cassandra does not require it. Internally Cassandra stores column
namesandvaluesas hexbytearrays (BytesType). Thisis thedefaultclientencoding.
Indexes
TheunderstandingofIndexesinCassandraisrequisite. There aretwokindsofthem.

 The Primary index for a column family is the index of its row keys. Each node maintains
thisindexforthedataitmanages.
 The Secondary indexes in Cassandra refer to indexes on column values.
Cassandraimplementssecondary indexesas ahiddencolumnfamily

Primary index determines cluster-wide row distribution. Secondary indexes are very
importantforcustomqueries.

DifferencesBetweenRDBMSandCassandra

1. No Query Language: SQL is the standard query language used in relational


databases.Cassandra has no query language. It does have an API that you access through
its RPCserializationmechanism,Thrift.
2. NoReferentialIntegrity:Cassandrahasnoconceptofreferentialintegrity,andthereforehasnoc
onceptofjoins.
3. Secondary Indexes: The second column family acts as an explicit secondary index
inCassandra
4. Sorting:In RDBMS, you can easily change the order of records by using ORDER BY
orGROUP BY in your query. There is no support for ORDER BYand GROUPBY
statementsin Cassandra. In Cassandra, however, sorting is treated differently; it is a
design decision.Column family definitions include a CompareWith element, which
dictates the order inwhichyourrows willbesorted.
5. Denormalization: In the relational world, denormalization violates Codd's normal
forms,and we try to avoid it. But in Cassandra, denormalization is, well, perfectly normal.
It'snotrequiredifyourdatamodelis simple.
6. Design Patterns: Cassandra design pattern offers a Materialized View, Valueless
Column,andAggregateKey.

143
CassandraClients

5. Thrift

Thrift is the driver-level interface; it provides the API for client implementations in awide
variety of languages. Thrift was developed at Facebook and donated as an
Apacheproject

Thriftisa codegenerationlibraryforclientsinC++, C#,Erlang,Haskell, Java, Objective


C/Cocoa, OCaml, Perl, PHP, Python, Ruby, Smalltalk, and Squeak. Its goal is to provide an easyway
to support efficient RPC calls in a wide variety of popular languages, without requiring
theoverheadofsomethinglikeSOAP.

ThedesignofThriftoffersthefollowingfeatures:
 Language-independenttypes
 Commontransportinterface
 Protocolindependence
 Versioningsupport

6. Avro
The Apache Avro project is a data serialization and RPC system targeted as the replacement
forThriftinCassandra.
Avro provides many features similar to those of Thrift and other data serialization
andRPCmechanismsincluding:
• Robustdatastructures
• Anefficient,small binaryformatforRPCcalls
• Easy integration with dynamically typed languages such as Python, Ruby,
Smalltalk,Perl,PHP,andObjective-C

AvroistheRPCanddataserializationmechanism for
Cassandra. It generates code that remote clients can use to interact with the database.It’s
well-supported in the community and has the strength of growing out of the
largerandverywell-
knownHadoopproject.ItshouldserveCassandrawellfortheforeseeablefuture.

7. Hector
Hector is an open source project written in Java using the MIT license. It was createdby
Ran Tavory of Outbrain (previously of Google) and is hosted at GitHub. It was oneof
the early Cassandra clients and is used in production at Outbrain. It wraps
ThriftandoffersJMX,connectionpooling,andfailover.

Hectorisa well-supportedandfull-featuredCassandraclient,withmanyusersandan

144
activecommunity.It offersthefollowing:
 High-levelobject-orientedAPI
 Failoversupport
 Connectionpooling
 JMX(JavaManagement eXtensions) support

8. Chirper
ChirperisaportofTwissandrato.NET,writtenbyChakerNakhli.It’savailableundertheApache2.0license,and
thesourcecodeis onGitHub

9. Chiton
Chitonis aCassandrabrowserwrittenbyBrandonWilliams thatusesthePythonGTKframework
10. Pelops
Pelops is a free, open source Java client written by Dominic Williams. It is similar toHector
in that it’s Java-based, but it was started more recently. This has become a
verypopularclient. Its goalsincludethefollowing:
 Tocreateasimple,easy-to-useclient
 Tocompletelyseparateconcernsfordataprocessingfromlower-levelitemssuch
 asconnectionpooling
 Toactas aclosefollowertoCassandrasothatit’s readilyuptodate

11. Kundera
Kundera is an object-relational mapping (ORM) implementation for Cassandra written
usingJavaannotations.
12. Fauna
RyanKing ofTwitterandEvanWeavercreatedaRubyclientfortheCassandradatabasecalledFauna.

Pig
 Pig is a simple-to-understand data flow language used in the analysis of large data sets.Pig
scripts are automatically converted into MapReduce jobs by the Pig interpreter,
soyoucananalyzethedatainaHadoopclusterevenifyouaren'tfamiliarwithMapReduce.
 Usedto

o Processweblog
o Builduserbehaviormodels
o Processimages

145
oDatamining

Pigismadeupoftwocomponents:thefirstisthelanguageitself,whichiscalledPigLatin,andthesecondis
aruntimeenvironmentwherePigLatinprogramsareexecuted.

ThePigexecutionenvironmenthastwomodes:
• Localmode:Allscriptsarerunonasinglemachine.HadoopMapReduceandHDFSarenotrequir
ed.
• Hadoop:AlsocalledMapReducemode,all scriptsarerunona givenHadoopcluster.

Pigprogramscanberuninthreedifferentways,allofthemcompatiblewithlocalandHadoopmode:

1. Pig Latin Script: Simply a file containing Pig Latin commands, identified by the .pig
suffix(forexample,file.pigormyscript.pig).ThecommandsareinterpretedbyPigandexecutedi
nsequentialorder.
2. Gruntshell:Gruntisacommandinterpreter.YoucantypePigLatinonthegruntcommandlinean
dGruntwillexecutethecommandonyourbehalf.
3. Embedded:PigprogramscanbeexecutedaspartofaJava program.

Pig provides an engine for executing data flows in parallelon Hadoop. It includes a
language,PigLatin,forexpressingthesedataflows.PigLatinincludesoperatorsformanyofthetraditiona
l data operations (join, sort, filter, etc.), as well as the ability for users to develop
theirownfunctionsforreading,processing,andwritingdata.
 Itisalarge-scaledataprocessingsystem
 ScriptsarewritteninPigLatin,adataflowlanguage
 Developedby Yahoo,andopensource
 Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS,
andHadoop’sprocessing system,MapReduce.

DifferencesbetweenPIGvsMapReduce

PIG is a data flow language, the key focus of Pig is manage the flow of data from input source
tooutputstore.

Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all
jobsinaPigaremapreducejobsordatamovementjobs.Pigallowsforcustomfunctionstobe

146
addedwhichcanbeusedforprocessinginPig,somedefaultonesarelikeordering,grouping,distinct,countetc.

Map/
Reduceontheotherhandis,itisaprogrammingmodel,orframeworkforprocessinglargedatasetsindistribu
tedmanner,usinglargenumberofcomputers,i.e. nodes.

PIG commands are submitted as MapReduce jobs internally.An advantage PIG has
overMapReduce is that the former is more concise. A 200 lines Java code written for
MapReducecanbereducedto10lines ofPIG code.

A disadvantage PIG has: it is bit slower as compared to MapReduce as PIG commands


aretranslatedintoMapReducepriortoexecution.

PigLatin

PigLatinhasaveryrichsyntax.Itsupportsoperatorsforthefollowingoperations:
 Loadingandstoringofdata
 Streamingdata
 Filteringdata
 Groupingandjoiningdata
 Sortingdata
 Combiningandsplittingdata
Pig Latin also supports a wide variety of types, expressions, functions, diagnostic
operators,macros,andfilesystemcommands.
DUMP
Dumpdirectstheoutputofyourscripttoyourscreen
Syntax:
dumpout.txt;
LOAD:Loadsdatafromthefilesystem.
Syntax

147
'data' is the name of the file or directory, in single quotes. USING, AS areKeywords. If theUSING
clause is omitted, the default load function PigStorage is used. Schema- A schema
usingtheASkeyword,enclosedinparentheses

Usage
UsetheLOADoperatortoloaddatafromthefilesystem.

Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records
arenewline-separated.

123
421
834
Inthisexamplethedefaultloadfunction,PigStorage,loadsdata frommyfile.txttoformrelation
A. The two LOAD statements are equivalent. Note that, because no schema is specified,
thefieldsarenotnamedandallfieldsdefaulttotypebytearray.

A=LOAD'myfile.txt'USINGPigStorage('\
t');DUMPA;

Output:

(1,2,3)
(4,2,1)
(8,3,4)

SampleCode

TheexamplesarebasedonthesePigcommands, whichextractalluserIDsfromthe
/etc/passwdfile.

A=load'passwd'usingPigStorage(':');B=for
eachAgenerate$0asid;
dumpB;
storeBinto‘id.txt’;

148
STORE:Storesorsavesresultstothefilesystem.
Syntax

Examples
In this example data is stored using PigStorage and the asterisk character (*) as the
fielddelimiter.

A = LOAD
'data';DUMPA;

ouptut

(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

STOREAINTO'myoutput'USINGPigStorage('*');CA
Tmyoutput;

Output

1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3

STREAM:Sendsdata toanexternalscriptorprogram.

GROUP:Groupsthedatainoneormorerelations.

JOIN(inner):Performsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.

149
JOIN(outer):Performsanouterjoinoftworelationsbasedoncommonfieldvalues.
Example
Supposewe have relationsA andB.

A=LOAD'data1'AS(a1:int,a2:int, a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

B=LOAD'data2'AS(b1:int,

b2:int);DUMPB;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)

InthisexamplerelationsAandB arejoinedbytheirfirstfields.

X=JOINABYa1,BBYb1;

DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)

Grunt

GruntisPig’sinteractiveshell.ItenablesuserstoenterPigLatininteractivelyandprovidesashellforuserstoi
nteractwithHDFS.

150
In other words, it is a command interpreter. You can type Pig Latin on the grunt command
lineandGruntwillexecutethecommandonyourbehalf.

ToenterGrunt,invokePig withnoscriptorcommandtorun.Typing:

$pig-x local

will resultintheprompt:

grunt>

This gives you a Grunt shell to interact with your local filesystem. To exit Grunt you can
typequitorenterCtrl-D.

ExampleforenteringPigLatin ScriptsinGrunt

To run Pig’s Grunt shell in local mode, follow these

instructions.Fromyourcurrentworkingdirectoryinlinus,run:

$pig-x local

TheGruntshellis invokedandyoucanentercommandsattheprompt.

grunt>A=load'passwd'usingPigStorage(':');grunt>
B = foreach A generate $0 as
id;grunt>dumpB;
PigwillnotstartexecutingthePigLatinyouenteruntilitsees eitherastoreordump.

Pig’sDataModel

This includes Pig’s data types, how it handles concepts such as missing data, and how you
candescribeyourdatatoPig.

Types

Pig’s data types can be divided into two categories: scalar types, which contain a single
value,andcomplex types,whichcontainothertypes.

151
1. ScalarTypes

Pig’sscalar typesare simpletypesthatappearinmostprogramminglanguages.

 int
Aninteger.Theystorea four-bytesignedinteger
 long
Alonginteger.Theystoreaneight-bytesignedinteger.
 float
Afloating-pointnumber.Usesfour bytestostoretheirvalue.
 double
A double-precision floating-point number.and use eight bytes to store theirvalue
 chararray
Astringorcharacter array,andare expressedasstringliteralswithsinglequotes
 bytearray
Abloborarrayofbytes.

2. ComplexTypes
Pig has several complex data types such as maps, tuples, and bags. All of these types cancontain
data of any type, including other complex types. So it is possible to have a map
wherethevaluefieldis abag,whichcontainsatuplewhereoneofthefields is amap.

 Map

A map in Pig is a chararray to data element mapping, where that element can be any Pigtype,
including a complex type. The chararray is called a key and is used as an index
tofindtheelement,referredtoas thevalue.

 Tuple

A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are dividedinto
fields, with each field containing one data element. These elements can be of anytype
—they do not all need to be the same type. A tuple is analogous to a row in
SQL,withthefields beingSQL columns.

152
 Bag

A bag is an unordered collection of tuples. Because it has no order, it is not possible


toreferencetuplesina bagbyposition.Like tuples,a bagcan,butisnotrequiredto,havea
schema associated with it. In the case of a bag, the schema describes all tuples
withinthebag.

 Nulls
Pigincludestheconceptofadata elementbeingnull.Dataofanytypecanbenull. Itisimportant to
understand that in Pig the concept of null is the same as in SQL, which iscompletely
different from the concept of null in C, Java, Python, etc. In Pig a null
dataelementmeansthevalueisunknown.
 Casts
Indicatesconvertonetypeofcontenttoanyothertype.

Hive

Hive was originally an internal Facebook project which eventually tenured into a full-blownApache
project, and it was created to simplify access to MapReduce (MR)by exposing a SQL-based
language for data manipulation. Hive also maintains metadata in a metastore, which isstored in a
relational database, as well as this metadata contains information about what tablesexist, their
columns,privileges, and more. Hive is an open source data warehousing solutionbuilt on top of
Hadoop, and its particular strength is in offering ad-hoc querying of data,
incontrasttothecompilationrequirementofPigandCascading.

Hive is a natural starting point for more full-featured business intelligence systems which
offerauserfriendlyinterfacefornon-technicalusers.

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS as well as
easilycompatible file systems like Amazon S3 (Simple Storage Service). Amazon S3 is a
scalable, high-speed, low-cost, Web-based service designed for online backup and archiving of
data as well asapplication programs. Hive provides SQL-like language called HiveQL while
maintaining fullsupport for map/reduce, and to accelerate queries, it provides indexes, including
bitmapindexes. Apache Hive is a data warehouse infrastructure built on top ofHadoop
forprovidingdatasummarization,query,as wellas analysis.
AdvantagesofHive

 Perfectlyfitslowlevel interfacerequirementofHadoop

153
 HivesupportsexternaltablesandODBC/JDBC
 HavingIntelligenceOptimizer
 HivesupportofTable-levelPartitioningtospeedupthequery times
 Metadatastoreisabigplusinthearchitecturethatmakes thelookupeasy

DataUnits

Hivedata isorganizedinto:
Databases:Namespacesthatseparatetablesandotherdata unitsfromnamingconfliction.
Tables: Homogeneous units of data, which have the same schema. An example of a table
couldbepage_viewstable,whereeachrowcouldcompriseofthefollowingcolumns(schema):
timestamp- whichisofINTtypethatcorrespondstoa unixtimestampofwhenthepagewasviewed.
userid - which is of BIGINT type that identifies the user who viewed the
page.page_url-whichis ofSTRINGtypethatcapturesthelocationofthepage.
referer_url - which is of STRING that captures the location of the page from where the
userarrivedatthecurrentpage.
IP - which is of STRING type that captures the IP address from where the page request
wasmade.
Partitions: Each Table can have one or more partition Keys which determines how the data isstored.
Partitions - apart from being storage units - also allow the user to efficiently identify therows that
satisfy a certain criteria. For example, a date_partition of type STRING andcountry_partition of
type STRING. Each unique value of the partition keys defines a partition ofthe Table. For
example all "US" data from "2009-12-23" is a partition of the page_views table.Therefore, if you
run analysis on only the "US" data for 2009-12-23, you can run that query
onlyontherelevantpartitionofthetabletherebyspeedinguptheanalysissignificantly.
Partition columns are virtual columns, they are not part of the data itself but are derived onload.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on
thevalue of a hash function of some column of the Table. For example the page_views table
maybe bucketed by userid, which is one of the columns, other than the partitions columns, of
thepage_viewtable.Thesecanbeusedtoefficientlysamplethedata.

154
HiveDataTypes

HiveSupporttwotypesofdatatypeformats
1. Primitive datatype
2. Collectiondatatype
13. PrimitiveDataTypes

 TINYINT,SMALLINT,INT,BIGINT
arefourintegerdatatypeswithonlydifferencesintheirsize.

 FLOATandDOUBLE aretwofloatingpointdatatypes.BOOLEANistostoretrueorfalse.

 STRING is to store character strings. Note that, in hive, we do not specify length
forSTRINGlikeinotherdatabases.It’smoreflexibleandvariableinlength.

 TIMESTAMPcanbeanintegerwhichisinterpretedassecondssinceUNIXepochtime.Itmay
be a float where number after decimal is nanosecond. It may be string which
isinterpreted
 accordingtotheJDBCdatestringformati.e.YYYY-MM-
DDhh:mm:ss.fffffffff.TimecomponentisinterpretedasUTC time.

 BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable
forbinarydata.

14. CollectionDataTypes

155
1. STRUCT
2. MAP
3. ARRAY

HiveFileformats

HivesupportsalltheHadoopfileformats,plusThriftencoding,aswell
assupportingpluggableSerDe(serializer/deserializer)classes tosupportcustomformats.

ThereareseveralfileformatssupportedbyHive.

TEXTFILE is the easiest to use, but the least space

efficient.SEQUENCEFILEformatismorespaceefficient.

MAPFILEwhichaddsanindextoa SEQUENCEFILEforfasterretrievalofparticularrecords.

Hive defaults to the following record and field delimiters, all of which are non-printable
controlcharactersandallofwhichcanbecustomized.

156
Let us take an example to understand it. I am assuming an employee table with below
structureinhive.

Notethat\001is anoctalcodefor^A,\002is^Band\003is^C. forfurtherexplanation,we


willusetextinsteadofoctalcode.Let’s assumeonerecordas showninbelowtable.

157
HiveQL

HiveQListheHivequerylanguage

Hadoop is an open source framework for the distributed processing of large amounts of dataacross
a cluster. It relies upon the MapReduce paradigm to reduce complex tasks into smallerparallel
tasks that can be executed concurrently across multiple machines. However,
writingMapReducetasksontopofHadoopforprocessingdataisnotforeveryonesince itrequires

158
learninganewframeworkanda newprogrammingparadigmaltogether.Whatis neededis aneasy-to-use
abstraction on top of Hadoop that allows people not familiar with it to use
itscapabilitiesaseasily.

Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on top
ofHadoop.Hive
achievesthistaskbyconvertingquerieswritteninHiveQLintoMapReducetasksthatarethenrunacross
theHadoopclustertofetchthedesiredresults

Hive is best suited for batch processing large amounts of data (such as in data warehousing) butis not
ideally suitable as a routine transactional database because of its slow response times
(itneedstofetchdatafromacross acluster).

A common task for which Hive is used is the processing of logs of web servers. These logs havea
regular structure and hence can be readily converted into a format that Hive can
understandandprocess

Hive query language (HiveQL) supports SQL features like CREATE tables, DROP tables,
SELECT ...FROM ... WHERE clauses, Joins (inner, left outer, right outer and outer joins),
Cartesian products,GROUP BY, SORT BY, aggregations, union and many useful functions on
primitive as well ascomplex data types. Metadatabrowsingfeaturessuchaslistdatabases,tablesand
soonarealso provided. HiveQL does have limitations compared with traditional RDBMS SQL.
HiveQLallows creation of new tables in accordance with partitions(Each table can have one or
morepartitions in Hive) as well as buckets (The data in partitions is further distributed as
buckets)andallows insertion of data in single or multiple tables but does not allow deletion or
updating ofdata

HiveQL:DataDefinition

Firstopenthehiveconsolebytyping:

$hive

Oncethehiveconsoleisopened,like

hive>

youneedtorunthequerytocreatethetable.

3.Createand Showdatabase

159
Theyareveryusefulforlargerclusterswithmultipleteamsandusers,asa wayofavoidingtablename collisions.
It’s also common to use databases to organize production tables into
logicalgroups.Ifyoudon’tspecify adatabase,thedefaultdatabaseis used.

hive>CREATE DATABASEIF NOTEXISTSfinancials;

Atanytime,youcanseethedatabasesthatalreadyexistasfollows:

hive>SHOWDATABASES;

outputis

defaultfinanc
ials

hive>CREATEDATABASEhuman_resources;

hive>SHOWDATABASES;

output
isdefaultf
inancials
human_resources

2. DESCRIBEdatabase
- showsthedirectorylocationforthedatabase.

hive>DESCRIBEDATABASEfinancials;
outputis
hdfs://master-server/user/hive/warehouse/financials.db

15. USEdatabase

TheUSE commandsetsadatabaseasyourworkingdatabase,
analogoustochangingworkingdirectoriesinafilesystem

hive>USE financials;

16. DROPdatabase
youcandropadatabase:

160
hive>DROP DATABASE IFEXISTSfinancials;
TheIFEXISTSisoptionalandsuppresseswarningsiffinancialsdoesn’texist.

17. AlterDatabase
You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTERDATABASE command. No other metadata about the database can be
changed,including itsnameanddirectory location:

hive>ALTERDATABASEfinancialsSETDBPROPERTIES('edited-by'='activesteps');

18. CreateTables
The CREATE TABLE statement follows SQL conventions, but Hive’s version offers sig-
nificantextensions to support a wide range of flexibility where the data files for tables are
stored, theformatsused,etc.

 ManagedTables

 The tables we have created so far are called managed tables or sometimes
calledinternal tables, because Hive controls the lifecycle of their data. As we’ve
seen,Hivestores the data for these tables in subdirectory under the directory defined
byhive.metastore.warehouse.dir(e.g.,/user/hive/warehouse),bydefault.

 Whenwe dropa managedtable,Hive deletesthedata inthetable.

 Managedtablesarelessconvenientforsharingwithothertools

 ExternalTables

CREATEEXTERNALTABLEIF NOTEXISTSstocks(
exchange
STRING,symbol
STRING,ymd
STRING,price_op
en
FLOAT,price_hig
h
FLOAT,price_low
FLOAT,price_clos
e
FLOAT,volumeIN
T,

161
LOCATION'/data/stocks/';

The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is
requiredtotellHivewhereit’s located. Becauseit’sexternal

Partitioned,ManagedTables

Partitioned tables help to organize data in a logical fashion, such as hierarchically.Example:Our


HR people often run queries with WHERE clauses that restrict the results to aparticular
country or to a particular first-level subdivision (e.g., state in the United States
orprovinceinCanada).

we havetouse address.statetoprojectthevalueinsidetheaddress.So,let’spartitionthedatafirstby
countryandthenbystate:

CREATE TABLE IF NOT EXISTS


mydb.employees (nameSTRING,
salaryFLOAT,
subordinates
ARRAY<STRING>,deductionsM
AP<STRING,FLOAT>,
addressSTRUCT<street:STRING,city:STRING,state:STRING,zip:INT>

PARTITIONEDBY(countrySTRING,
stateSTRING)ROW FORMATDELIMITED
FIELDSTERMINATEDBY'\
001'COLLECTION ITEMS
TERMINATED BY '\
002'MAPKEYSTERMINATEDBY '\003'
LINES TERMINATED BY '\
PartitioningtableschangeshowHivestructuresthedatastorage.Ifwecreatethistableinthemydbdatabase,ther
ewillstillbeanemployees directoryforthetable:

LOADDATALOCALINPATH'/path/to/
employees.txt'INTOTABLEemployees
PARTITION (country = 'US', state =
'IL');hdfs://master_server/user/hive/warehouse/mydb.db/employees

Oncecreated,thepartitionkeys(countryandstate, inthiscase)behavelike regularcolumns.

hive>SHOW PARTITIONSemployees;

162
outputis

OK
country=US/state=IL
Timetaken:0.145seconds

19. DroppingTables

ThefamiliarDROPTABLE commandfromSQLissupported:

DROPTABLEIFEXISTSemployees;

HiveQL:DataManipulation

1. LoadingDataintoManagedTables

Createstockstable
CREATEEXTERNALTABLEIF NOTEXISTSstocks(
exchange
STRING,symbol
STRING,ymd
STRING,price_op
enFLOAT,price_hi
gh
FLOAT,price_low
FLOAT,price_clos
e
FLOAT,volumeIN
T,
price_adj_closeFLOA
T)ROW FORMAT
DELIMITEDFIELDSTERMI

Loadthestocks

LOADDATALOCALINPATH'/path/to/
employees.txt'INTOTABLEstocks
PARTITION(exchange='NASDAQ',symbol='AAPL');

This command will first create the directory for the partition, if it doesn’t already
exist,thencopythedatatoit.

2. InsertingDataintoTablesfromQueries

163
INSERTOVERWRITETABLEemployeesPARTITION(country='US',state='OR')

WithOVERWRITE,
anypreviouscontentsofthepartitionarereplaced.IfyoudropthekeywordOVERWRITE
orreplaceitwithINTO,Hiveappendsthedataratherthanreplacesit.

HiveQLqueries

1. SELECT…FROMClauses

SELECT is the projection operator in SQL. The FROM clause identifies from which table, view,
ornestedqueryweselectrecords

Createemployees

CREATE EXTERNAL TABLE


employees (nameSTRING,
salaryFLOAT,
subordinates
ARRAY<STRING>,deductionsM
AP<STRING,FLOAT>,
addressSTRUCT<street:STRING,city:STRING,state:STRING,zip:INT>

ROW FORMAT
DELIMITEDFIELDSTERMI
NATEDBY'\001'
COLLECTION ITEMS TERMINATED BY '\
002'MAPKEYSTERMINATEDBY '\003'
LINES TERMINATED BY '\
n'STOREDASTEXTFILE

Loaddata

LOADDATALOCALINPATH'/path/to/
employees.txt'INTOTABLEemployees
PARTITION(country='US',state='IL');

Datainemployee.txtisassumedas

164
Selectdata

hive>SELECTname,salaryFROMemployees;

outputis

When you select columns that are one of the collection types, Hive uses JSON (Java-
ScriptObjectNotation)syntaxfortheoutput.First,let’sselectthesubordinates,anARRAY,
whereacomma-separatedlistsurroundedwith[…]isused.

hive>SELECTname,subordinatesFROM employees;
outputis

The deductions is a MAP, where the JSON representation for maps is used, namely a comma-
separatedlistofkey:valuepairs,surroundedwith{…}:

hive>SELECT name,deductionsFROMemployees;

outputis

Finally,theaddress isaSTRUCT,whichisalsowrittenusingtheJSONmapformat:

hive>SELECTname,addressFROM employees;

outputis

165
166

You might also like