0% found this document useful (0 votes)
148 views103 pages

Hand Book: Ahmedabad Institute of Technology

This document provides an introduction to big data analytics, including definitions of big data and its key characteristics of volume, velocity, and variety. It discusses different data types like structured, unstructured, and semi-structured data. The document also introduces distributed file systems and Hadoop Distributed File System (HDFS), which is the primary data storage system used by Hadoop applications. Finally, it outlines the contents and topics that will be covered in subsequent sections, such as Hadoop architecture, HDFS, Hive, HBase, Spark, NoSQL, and databases for the modern web.

Uploaded by

Bhavik Sanghar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views103 pages

Hand Book: Ahmedabad Institute of Technology

This document provides an introduction to big data analytics, including definitions of big data and its key characteristics of volume, velocity, and variety. It discusses different data types like structured, unstructured, and semi-structured data. The document also introduces distributed file systems and Hadoop Distributed File System (HDFS), which is the primary data storage system used by Hadoop applications. Finally, it outlines the contents and topics that will be covered in subsequent sections, such as Hadoop architecture, HDFS, Hive, HBase, Spark, NoSQL, and databases for the modern web.

Uploaded by

Bhavik Sanghar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Ahmedabad Institute of Technology

CE & IT Department

Hand Book
BIG DATA ANALYTICS (2180710)
Year: 2020-21

Prepared By: Prof. Neha Prajapati, Asst. Prof. CE Dept, AIT


Index
1.0 Introduction to Big Data 03

2.0 Introduction to Hadoop and Hadoop Architecture. 18

3.0 HDFS, HIVE and HIVEQL, HBASE 27

4.0 Spark 63

5.0 NoSQL 74

6.0 Data Base for the Modern Web 81

Prepared By: Prof. Neha Prajapati Page 2


1.0 Introduction to Big Data
Introduction of Big Data

 Big Data is a collection of large datasets that cannot be adequately processed


using traditional processing techniques. Big data is not only data it has become a
complete subject, which involves various tools, techniques and frameworks.
 Big data term describes the volume amount of data both structured and
unstructured manner that adapted in day-to-day business environment. It’s
important that what organizations utilize with these with the data that matters.

 Big data helps to analyze the in-depth concepts for the better decisions and
strategic taken for the development of the organization.
 Big Data includes huge volume, high velocity, and extensible variety of data.
 The data in it will be of three types.
 Structured data − Relational data.
 Semi Structured data − XML data.
 Unstructured data − Word, PDF, Text, Media Logs.

Benefits of Big Data


 Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and
other advertising mediums.

 Using the information in the social media like preferences and product
perception of their consumers, product companies and retail organizations are
planning their production.

 Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.

Prepared By: Prof. Neha Prajapati Page 3


Categories Of 'Big Data'
Big data have three forms:
1) Structured 2)Unstructured 3)Semi-structured

1) Structured
 Any data that can be stored, accessed and processed in the form of fixed format
is termed as a 'structured' data.

 Over the period of time, talent in computer science have achieved greater success
in developing techniques for working with such kind of data (where the format
is well known in advance) and also deriving value out of it.

 When size of such data grows to a huge extent, typical sizes are being in the rage
of multiple zettabyte.

Employee_ID Employee_Name Gender Department Salary_In_lacs


1 Xyz M Finance 750000
2 ABC F Admin 150000
3 MNP M Admin 550000
4 NXP M Finance 600000

2) Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 In addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it.
 Typical example of unstructured data is, a heterogeneous data source containing
a combination of simple text files, images, videos etc.
 Now a day organizations have wealth of data available with them but
unfortunately they don't know how to derive value out of it since this data is in
its raw form or unstructured format.
Examples Of Un-structured Data :- Typical human-generated unstructured data
includes: Text files: Word processing, spreadsheets, presentations, email, logs.
 Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured and traditional analytics tools cannot parse it.
Social Media: Data from Facebook, Twitter, LinkedIn.
Website: YouTube, Instagram, photo sharing sites.
Mobile data: Text messages, locations.
Communications: Chat, IM, phone recordings, collaboration software.
Media: MP3, digital photos, audio and video files.
Business applications: MS Office documents, productivity applications.
Typical machine-generated unstructured data includes:

Prepared By: Prof. Neha Prajapati Page 4


 Satellite imagery: Weather data, land forms, military movements.
 Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
 Digital surveillance: Surveillance photos and video.
 Sensor data: Traffic, weather, oceanographic sensors.
3) Semi-structured
 Semi-structured data can contain both the forms of data.
 User can see semi-structured data as a strcutured in form but it is actually not
defined with e.g. a table definition in relational DBMS.

Examples Of Semi-structured Data :- Personal data stored in a XML file-


<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
 Basic requirements for working with big data are the same as the requirements
for working with datasets of any size.
 However, the massive scale, the speed of ingesting and processing, and the
characteristics of the data that must be dealt with at each stage of the process
present significant new challenges when designing solutions.
 In 2001, Gartner's Doug Laney first presented what became known as the "three
Vs of big data" to describe some of the characteristics that make big data
different from other data processing:

Volume
 The sheer scale of the information processed helps define big data systems.
 These datasets can be orders of magnitude larger than traditional datasets, which
demands more thought at each stage of the processing and storage life cycle.
 Often, because the work requirements exceed the capabilities of a single
computer, this becomes a challenge of pooling, allocating, and coordinating
resources from groups of computers.
 Cluster management and algorithms capable of breaking tasks into smaller
pieces become increasingly important.

Velocity
 Another way in which big data differs significantly from other data systems is
the speed that information moves through the system.
 Data is frequently flowing into the system from multiple sources and is often
expected to be processed in real time to gain insights and update the current
understanding of the system.
 This focus on near instant feedback has driven many big data practitioners away
from a batch-oriented approach and closer to a real-time streaming system.

Prepared By: Prof. Neha Prajapati Page 5


 Data is constantly being added, massaged, processed, and analyzed in order to
keep up with the influx of new information and to surface valuable information
early when it is most relevant.

Variety
 Big data problems are often unique because of the wide range of both the sources
being processed and their relative quality.
 Data can be ingested from internal systems like application and server logs, from
social media feeds and other external APIs, from physical device sensors, and
from other providers.

 Big data seeks to handle potentially useful data regardless of where it's coming
from by consolidating all information into a single system.
 The formats and types of media can vary significantly as well. Rich media like
images, video files, and audio recordings are ingested alongside text
files,structured logs, etc.
 While more traditional data processing systems might expect data to enter the
pipeline already labeled, formatted, and organized, big data systems usually
accept and store data closer to its raw state.
 Ideally, any transformations or changes to the raw data will happen in memory
at the time of processing.

2. Distributed File System


 A distributed file system (DFS) is a file system with data stored on a server. The
data is accessed and processed as if it was stored on the local client machine.
 The DFS makes it convenient to share information and files among users on a
network in a controlled and authorized way.
 The server allows the client users to share files and store data just like they are
storing the information locally.The Hadoop Distributed File System (HDFS) is
the primary data storage system used by Hadoop applications.
 It employs a NameNode and DataNode architecture to implement a distributed
file system that provides high-performance access to data across highly scalable
Hadoop clusters.
 HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a
reliable means for managing pools of big data and supporting related big data
analytics applications.
 HDFS supports the rapid transfer of data between compute nodes. At its outset,
it was closely coupled with MapReduce, a programmatic framework for data
processing.

Prepared By: Prof. Neha Prajapati Page 6


 When HDFS takes in data, it breaks the information down into separate blocks
and distributes them to different nodes in a cluster, thus enabling highly efficient
parallel processing.
 HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from possible
data losses in case of failure. HDFS also makes applications available to parallel
processing.

 Features of HDFS
1) It is suitable for the distributed storage and processing.
2) Hadoop provides a command interface to interact with HDFS.
3) The built-in servers of namenode and datanode help users to easily check the
status of cluster.
4) Streaming access to file system data.
5) HDFS provides file permissions and authentication.
HDFS Architecture
 HDFS follows the master-slave architecture and it has the following elements.
Namenode
 The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
 It is a software that can be run on commodity hardware.
 The system having the namenode acts as the master server and it does the
following tasks Manages the file system namespace.
 Regulates client’s access to files. It also executes file system operations such as
renaming, closing, and opening files and directories.

2) Datanode
 The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity hardware/System)
in a cluster, there will be a datanode.
 These nodes manage the data storage of their system. Datanodes perform read-
write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

3) Block
 The file in a file system will be divided into one or more segments and/or stored
in individual data nodes.

Prepared By: Prof. Neha Prajapati Page 7


 These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block.
 The default block size is 64MB, but it can be increased as per the need to change
in HDFS configuration.

Goals of HDFS

1) Fault detection and recovery − Since HDFS includes a large number of


commodity hardware, failure of components is frequent.
Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.

2) Huge datasets − HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.

3) Hardware at data − A requested task can be done efficiently, when the


computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.

Big Data And its importance


 Complex or massive data sets which are quite impractical to be managed using
the traditional database system and software tools are referred to as big data.
 Big data is utilized by organizations in one or another way. It is the technology
which possibly realizes big data’s value.
 It is the voluminous amount of both multi-structured as well unstructured data.

Advantages of Big Data


 Due to the loopholes present in traditional data systems (used by an
organization), big data was developed. Following table1 marks the difference
between traditional data and big data:

Factors Big Data Traditional Data


Data
Distributed Architecture Centralized Architecture
Architecture
Semi-structured/Un-structured
Type of Data Structured Data
Data
Consists of (250 – 260 ) bytes of
Volume of Data Consists of 240 bytes of data
data
Schema No schema Based on Fixed Schema
Data Complex Relationship between Known Relationship between the

Prepared By: Prof. Neha Prajapati Page 8


Relationship data data
4. Drives for Big Data
1) Media & communications
 In media and communications it is used to analyze the personal and behavioral
data of the customers in order to create customer’s profile.
 It creates the content for different customers, recommend that content on the
basis of demand and measure its performance (Tankard 2012).
 In media and communication industry big data is used by the companies like
Spotify, Amazon prime, etc. Spotify use big data analytics in analyzing the data
and give recommendation regarding the music to its customers individually.

2) Banking
 In banking data is used to manage large financial data.
 SEC (Securities Exchange Commission) uses big data in order to monitor the
market and finance related data of the bank and Network analytics in order to
track illegal activities in the finance.

 Big data is also used in the trading sector for trade analytics and decision support
analytics.

3) Healthcare
 Big data is used in the healthcare sector in order to manage the large amount of
data relate to the patients, doctors and the other staff members.
 It helps to eliminate the failures like errors, invalid or inappropriate data, any
system fault etc. that comes while utilizing the system and provides benefits like
managing customer, staff and doctors information related to healthcare (Bughin
et al. 2010).
 According to (Gartner 2013), 43% of the healthcare industries have invested in
Big data.

Prepared By: Prof. Neha Prajapati Page 9


4) Communications, Media and Entertainment
 Consumers expect rich media on-demand in different formats and in a variety of
devices, some big data challenges in the communications, media and
entertainment industry include:
 Collecting, analyzing, and utilizing consumer insights Leveraging mobile and
social media content Understanding patterns of real-time, media content usage
Organizations in this industry simultaneously analyze customer data along with
behavioral data to create detailed customer profiles that can be used to:
 Create content for different target audiences Recommend content on demand
Measure content performance.

5) Education
 A major challenge in the education industry is to incorporate big data from
different sources and vendors and to utilize it on platforms that were not
designed for the varying data.
 Big data is used quite significantly in higher education. For example, The
University of Tasmania. An Australian university with over 26000 students, has
deployed a Learning and Management System that tracks among other things,
when a student logs onto the system, how much time is spent on different pages
in the system, as well as the overall progress of a student over time.
 In a different use case of the use of big data in education, it is also used to
measure teacher’s effectiveness to ensure a good experience for both students
and teachers.

Prepared By: Prof. Neha Prajapati Page 10


 Teacher’s performance can be fine-tuned and measured against student numbers,
subject matter, student demographics, student aspirations, behavioral
classification and several other variables.
 On a governmental level, the Office of Educational Technology in the U. S.
Department of Education, is using big data to develop analytics to help course
correct students who are going astray while using online big data courses. Click
patterns are also being used to detect boredom.

6) Manufacturing and Natural Resources


 Increasing demand for natural resources including oil, agricultural products,
minerals, gas, metals, and so on has led to an increase in the volume, complexity,
and velocity of data that is a challenge to handle.

 Similarly, large volumes of data from the manufacturing industry are untapped.
The underutilization of this information prevents improved quality of products,
energy efficiency, reliability, and better profit margins.

7) Government
 In governments the biggest challenges are the integration and interoperability of
big data across different government departments and affiliated organizations.
 In public services, big data has a very wide range of applications including:
energy exploration, financial market analysis, fraud detection, health related
research and environmental protection.
 Some more specific examples are as follows:
 Big data is being used in the analysis of large amounts of social disability claims,
made to the Social Security Administration (SSA), that arrive in the form of
unstructured data. The analytics are used to process medical information rapidly
and efficiently for faster decision making and to detect suspicious or fraudulent
claims.
 The Food and Drug Administration (FDA) is using big data to detect and study
patterns of food-related illnesses and diseases. This allows for faster response
which has led to faster treatment and less death.

Prepared By: Prof. Neha Prajapati Page 11


8) Insurance
 Lack of personalized services, lack of personalized pricing and the lack of
targeted services to new segments and to specific market segments are some of
the main challenges.
 In a survey conducted by Marketforce challenges identified by professionals in
the insurance industry include underutilization of data gathered by loss
adjusters and a hunger for better insight.
 Applications of big data in the insurance industry.
 Big data has been used in the industry to provide customer insights for
transparent and simpler products, by analyzing and predicting customer
behavior through data derived from social media, GPS-enabled devices and
CCTV footage. The big data also allows for better customer retention from
insurance companies.

5) Algorithms using MapReduce


 MapReduce is a programming paradigm that runs in the background of Hadoop
to provide scalability and easy data-processing solutions.

Why MapReduce?
 Traditional Enterprise Systems normally have a centralized server to store and
process data.
 The following illustration depicts a schematic view of a traditional enterprise
system.
 Traditional model is certainly not suitable to process huge volumes of scalable
data and cannot be accommodated by standard database servers.
 Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.

MapReduce divides a task into small parts and assigns them to many computers.
The results are collected at one place and integrated to form the result dataset.
How MapReduce Works?
 The MapReduce algorithm contains two important tasks, namely Map and
Reduce.

Prepared By: Prof. Neha Prajapati Page 12


 The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those
data tuples (key-value pairs) into a smaller set of tuples.
 The reduce task is always performed after the map job.

 Input Phase − Here we have a Record Reader that translates each record in an
input file and sends the parsed data to the mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs
and processes each one of them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known
as intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from
the map phase into identifiable sets. It takes the intermediate keys from the
mapper as input and applies a user-defined code to aggregate the values in a
small scope of one mapper. It is not a part of the main MapReduce algorithm; it
is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be aggregated,
filtered, and combined in a number of ways, and it requires a wide range of
processing. Once the execution is over, it gives zero or more key-value pairs to
the final step.

Prepared By: Prof. Neha Prajapati Page 13


 Output Phase − In the output phase, we have an output formatter that translates
the final key-value pairs from the Reducer function and writes them onto a file
using a record writer.

MapReduce-Example
 Let us take a real-world example to comprehend the power of MapReduce.
 Twitter receives around 500 million tweets per day, which is nearly 3000 tweets
per second.
 The following illustration shows how Tweeter manages its tweets with the help
of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following


actions −
 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-
value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered
maps as key-value pairs.

Prepared By: Prof. Neha Prajapati Page 14


 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
 The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
 The map task is done by means of Mapper Class The reduce task is done by
means of Reducer Class. Mapper class takes the input, tokenizes it, maps and
sorts it. The output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.

 MapReduce implements various mathematical algorithms to divide a task into


small parts and assign them to multiple systems.
 In technical terms, MapReduce algorithm helps in sending the Map & Reduce
tasks to appropriate servers in a cluster.
 These mathematical algorithms may include the following
−Sorting ,Searching ,Indexing ,TF-IDF

Sorting
 Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-
value pairs from the mapper by their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the
Context class (user-defined class) collects the matching valued keys as a
collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the
help of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically
sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented
to the Reducer.

Prepared By: Prof. Neha Prajapati Page 15


Searching
 Searching plays an important role in MapReduce algorithm. It helps in the
combiner phase (optional) and in the Reducer phase. Let us try to understand
how Searching works with the help of an example.
Example
 The following example shows how MapReduce employs Searching algorithm to
find out the details of the employee who draws the highest salary in a given
employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let
us also assume there are duplicate employee records in all four files because of
importing the employee data from all database tables repeatedly. See the
following illustration.

 The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map
phase as a key-value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employee in each file. See the following snippet.
<k: employee name, v: salary> Max= the salary of an first employee. Treated as max
salary
if (v(second employee).salary > Max)
{
Max = v(salary);
}

Prepared By: Prof. Neha Prajapati Page 16


else{Continue checking;}
The expected result is as follows −
<satish,26000> <gopal,50000> <kiran,45000> <manisha,45000>
Reducer phase − Form each file, you will find the highest salaried employee.
 To avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries,
 if any. The same algorithm is used in between the four <k, v> pairs, which are
coming from four input files. The final output should be as follows −<gopal,
50000>

Indexing
 Normally indexing is used to point to a particular data and its address. It
performs batch indexing on the input files for a particular Mapper.
 The indexing technique that is normally used in MapReduce is known as
inverted index. Search engines like Google and Bing use inverted indexing
technique. Let us try to understand how Indexing works with the help of a
simple example.

Example
 The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are
the file names and their content are in double quotes.
 T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana"
 After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
 Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2}
implies the term "is" appears in the files T[0], T[1], and T[2].

TF-IDF
 TF-IDF is a text processing algorithm which is short for Term Frequency −
Inverse Document Frequency. It is one of the common web analysis algorithms.
Here, the term 'frequency' refers to the number of times a term appears in a
document.
 Term Frequency (TF)
 It measures how frequently a particular term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the
total number of words in that document.

Prepared By: Prof. Neha Prajapati Page 17


 TF(the) = (Number of times term the ‘the’ appears in a document) / (Total
number of terms in the document)

Inverse Document Frequency (IDF)


 It measures the importance of a term. It is calculated by the number of
documents in the text database divided by the number of documents where a
specific term appears.
 While computing TF, all the terms are considered equally important. That means,
TF counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus
we need to know the frequent terms while scaling up the rare ones, by
computing the following −
 IDF(the) = log_e(Total number of documents / Number of documents with term
‘the’ in it).
 The algorithm is explained below with the help of a small example.

Example
 Consider a document containing 1000 words, wherein the word hive appears 50
times. The TF for hive is then (50 / 1000) = 0.05.
 Now, assume we have 10 million documents and the word hive appears in 1000
of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.
 The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.

2.0 Introduction to hadoop and hadoop architecture


Overview of Big Data
 Big data is a term used to refer to data sets that are too large or complex for
traditional data-processing application software to adequately deal with. Data
with many cases (rows) offer greater statistical power, while data with higher
complexity (more attributes or columns) may lead to a higher false discovery rate.
 Big data works on the data produced by various devices and their applications.
Below are some of the fields that are involved in the umbrella of Big Data.

Prepared By: Prof. Neha Prajapati Page 18


 Black Box Data:It is an incorporated by flight crafts, which stores a large sum of
information, which includes the conversation between crew members and any
other communications (alert messages or any order passed)by the technical
grounds duty staff.
 Social Media Data: Social networking sites such as Face book and Twitter
contains the information and the views posted by millions of people across the
globe.
 Stock Exchange Data: It holds information (complete details of in and out of
business transactions) about the ‘buyer’ and ‘seller’ decisions in terms of share
between different companies made by the customers.
 Power Grid Data: The power grid data mainly holds the information consumed
by a particular node in terms of base station.
 Transport Data: It includes the data’s from various transport sectors such as
model, capacity, distance and availability of a vehicle.
 Search Engine Data: Search engines retrieve a large amount of data from
different sources of database.
Big Data Technology

Operational Big Data


 It includes the applications such as MongoDB which provides operational
capabilities for interactive and real time workloads where data is generally
captured and stored.
 NoSQL Big Data systems are designed in such a way it capitalizes on new cloud
computing architectures, to permit access on massive computations to be run
reasonably and efficiently. Hence this builds operation on big data workloads
much easier to manage, cheaper and faster to implement.
Analytical Big Data
 It owns the systems like Massively Parallel Processing database systems and
MapReduce which provides the analytical capabilities for re collective and
complex analysis.
 MapReduce provides a new method for analyzing the data that flaunts its
capabilities provided by SQL, and based on a system called MapReduce that can
be scaled up from single servers to thousands of high and low end machines.
Big Data Barriers(Challenges)
Barriers that are imposed on big data are as follows:
 Capture data
 Storage Capacity
 Searching
 Sharing
 Transfer
 Analysis

Prepared By: Prof. Neha Prajapati Page 19


 Presentation

Apache Hadoop and Hadoop Ecosystem


Apache Hadoop
Apache Hadoop was born to enhance the usage and solve major issues of big data. The
web media was generating loads of information on a daily basis, and it was becoming
very difficult to manage the data of around one billion pages of content.

Apache Hadoop is the most important framework for working with Big Data. Hadoop
biggest strength is scalability. It upgrades from working on a single node to thousands
of nodes without any issue in a seamless manner.

Hadoop Architecture
Architecture can be broken down into two branches,
1) Hadoop core components 2) Hadoop echosystem

1) Hadoop core components

Core Hadoop Components :


There are four basic or core components :
 Hadoop Common – It is a set of common utilities and libraries which handle
other Hadoop modules. It makes sure that the hardware failures are managed by
Hadoop cluster automatically.
 HDFS – HDFS is a Hadoop distributed file system that stores that stores data in
the form of small memory block and distributes them across the cluster. Each
data is replicated multiple times to ensure data availability.
 Hadoop YARN – It allocates resources which in turn allow different users to
execute various applications without worrying about the increased workloads.

Prepared By: Prof. Neha Prajapati Page 20


 Hadoop MapReduce – It executes tasks in a parallel fashion by distributing it as
small blocks.
2) Hadoop echosystem

Hadoop Echo System


 Ambari – Ambari is a web-based interface for managing, configuring and testing
big data clusters to support its components like HDFS, MapReduce, Hive,
HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. It provides a console for
monitoring the health of the clusters as well as allows assessing the performance
of certain components such as MapReduce, Pig, Hive, etc., in a user-friendly way.
 Cassandra – An open source highly scalable distributed database system based
on NoSQL dedicated to handle massive amount of data across multiple
commodity servers, ultimately contributing to high availability without a single
failure.
 Flume – A distributed and reliable tool to for effectively collecting, aggregating
and moving bulk of streaming data into HDFS.
 HBase – A non-relational distributed databases running on the Big Data Hadoop
cluster that stores large amount of structured data. HBase acts as an input for the
MapReduce jobs.
 HCatalog – It is a layer of table and storage management which allows the
developers to access and share the data.
 Hadoop Hive – Hive is a data warehouse infrastructure that allows
summarization, querying, and analyzing of data with the help of a query
language similar to SQL.

Prepared By: Prof. Neha Prajapati Page 21


 Hadoop Oozie – A server-based system that schedules and manages the Hadoop
jobs..
 Hadoop Pig– A dedicated high-level platform which is responsible for
manipulating the data stored in HDFS with the help of a compiler for Mapreduce
and a language called Pig Latin. It allows the analysts to extract, transform and
load (ETL) the data without writing the codes for MapReduce.
 Solr – A highly scalable search tool which enables indexing, central configuration,
failovers and recovery.
 Spark –An open source fast engine responsible for Hadoop streaming and
supporting SQL, machine learning and processing graphs.
 Hadoop Sqoop – A mechanism to transfer huge amount of data between Hadoop
and structured databases.
 Hadoop Zookeeper –An open source application that configures synchronizes
the distributed systems.

Moving Data In and Out of Haddop


 Moving data in and out of Hadoop, which is referred as data ingress and egress,
is the process by which data is transported from an external system into an
internal system, and vice versa.
 Hadoop supports ingress and egress at a low level in HDFS and MapReduce.
 Files can be moved in and out of HDFS , and data can be pulled from external
data sources and pushed to external data sinks using MapReduce.
 Following Figure shows some of Hadoop’s ingress and egress mechanisms.

Prepared By: Prof. Neha Prajapati Page 22


Moving large quantities of data in and out of Hadoop has logistical challenges that
include consistency guarantees and resource impacts on data sources and destina-tions.

IDEMPOTENCE
 Idempotent operation produces the same result no matter how many times it’s
executed.
 In a relational database the inserts typically aren’t idempotent, because exe-
cuting them multiple times doesn’t produce the same resulting database state.
 Alter-natively, updates often are idempotent, because they’ll produce the same
end result.
 Any time data is being written idempotence should be a consideration, and
dataingress and egress in Hadoop is no different.
 How well do distributed log collectionframeworks deal with data
retransmissions?
 How do you ensure idempotent behavior in a MapReduce job where multiple
tasks are inserting into a database in parallel?

AGGREGATION
 The data aggregation process combines multiple data elements.
 In the context of data ingress this can be useful because moving large quantities
of small files into HDFS potentially translates into NameNode memory woes, as
well as slow MapReduce exe-cution times.
 Having the ability to aggregate files or data together mitigates this prob-lem, and
is a feature to consider.

DATA FORMAT TRANSFORMATION


 The data format transformation process converts one data format into another.
 If your source data is multiline XML or JSON form, for example, you may want
to consider a preprocessing step.

Prepared By: Prof. Neha Prajapati Page 23


 This would convert the data into a form that can be split, such as a JSON or an
XML element per line, or convert it into a format such as Avro.

RECOVERABILITY
 Recoverability allows an ingress or egress tool to retry in the event of a failed
opera-tion.
 it’s unlikely that any data source, sink, or Hadoop itself can be 100 percent
available, it’s important that an ingress or egress action be retried in the event of
failure.

CORRECTNESS
 In the context of data transportation, checking for correctness is how you verify
that no data corruption occurred as the data was in transit.
 When user work with heteroge-neous systems such as Hadoop data ingress and
egress tools, the fact that data is being transported across different hosts,
networks, and protocols only increases the poten-tial for problems during data
transfer.
 Common methods for checking correctness of raw data such as storage devices
include Cyclic Redundancy Checks (CRC ), which are what HDFS uses internally
to maintain block-level integrity.
Mooving Data into Hadoop
Making data available to Hadoop is the first setp while working with data in Hadoop.
Two Primary approches 1) HDFS Level 2) MapReduce Level.
Comaring Flume,Chukwa and Scribe

Flume
 The basic architecture of Flume. It is , data generators (such as Facebook, Twitter)
generate data which gets collected by individual Flume agents running on them.
 Thereafter, a data collector (which is also an agent) collects the data from the
agents which is aggregated and pushed into a centralized store such as HDFS or
Hbase.

 F
lume Event
 An event is the basic unit of the data transported inside Flume.

Prepared By: Prof. Neha Prajapati Page 24


 It contains a payload of byte array that is to be transported from the source to the
destination accompanied by optional headers.
 A typical Flume event would have the following structure

 Flume Agent
 An agent is an independent daemon process (JVM) in Flume.
 It receives the data (events) from clients or other agents and forwards it to its
next destination (sink or agent).
 Flume may have more than one agent. Following diagram represents a Flume
Agent.

As shown in the diagram a Flume Agent contains three main components namely,
source, channel, and sink.
Source
 A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
 Apache Flume supports several types of sources and each source receives events
from a specified data generator.
 Example − Avro source, Thrift source, twitter 1% source etc.\

Channel

Prepared By: Prof. Neha Prajapati Page 25


A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks.
 It acts as a bridge between the sources and the sinks.
 These channels are fully transactional and they can work with any number of
sources and sinks.
 Example − JDBC channel, File system channel, Memory channel, etc.
Sink
 A sink stores the data into centralized stores like HBase and HDFS.
 It consumes the data (events) from the channels and delivers it to the destination.
 The destination of the sink might be another agent or the central stores.
 Example − HDFS sink
Chukwa
 Chukwa is structured as a pipeline of collection and processing stages, with clean
and narrow interfaces between stages. This will facilitate future innovation
without breaking existing code.
 Chukwa has four primary components:
 Agents that run on each machine and emit data.
 Collectors that receive data from the agent and write it to stable storage.
 MapReduce jobs for parsing and archiving the data.
 HICC, the Hadoop Infrastructure Care Center; a web-portal style interface for
displaying data.
 Below is a figure showing the Chukwa data pipeline, annotated with data dwell
times at each stage. A more detailed figure is available at the end of this
document

Agent
s
and Adaptors
 Chukwa agents do not collect some particular fixed set of data. Rather, they
support dynamically starting and stopping Adaptors, which small dynamically-
controllable modules that run inside the Agent process and are responsible for
the actual collection of data.
 These dynamically controllable data sources are called adaptors, since they
generally are wrapping some other data source, such as a file or a Unix

Prepared By: Prof. Neha Prajapati Page 26


command-line tool. The Chukwa agent guide includes an up-to-date list of
available Adaptors.
Data Model
 Chukwa Adaptors emit data in Chunks. A Chunk is a sequence of bytes, with
some metadata. Several of these are set automatically by the Agent or Adaptors.
Two of them require user intervention: cluster name and datatype.
Collectors
 Rather than have each adaptor write directly to HDFS, data is sent across the
network to a collector process, that does the HDFS writes. Each collector receives
data from up to several hundred hosts, and writes all this data to a single sink
file, which is a Hadoop sequence file of serialized Chunks. Periodically,
collectors close their sink files, rename them to mark them available for
processing, and resume writing a new file. Data is sent to collectors over HTTP.
MapReduce processing
 Collectors write data in sequence files. This is convenient for rapidly getting data
committed to stable storage. But it's less convenient for analysis or finding particular
data items. As a result, Chukwa has a toolbox of MapReduce jobs for organizing and
processing incoming data.
 These jobs come in two kinds: Archiving and Demux. The archiving jobs simply
take Chunks from their input, and output new sequence files of Chunks, ordered
and grouped. They do no parsing or modification of the contents. (There are several
different archiving jobs, that differ in precisely how they group the data.)

HICC
 HICC, the Hadoop Infrastructure Care Center is a web-portal style interface for
displaying data. Data is fetched from a MySQL database, which in turn is
populated by a mapreduce job that runs on the collected data, after Demux.
Scribe
 Scribe was a server for aggregating log data streamed in real-time from a large
number of servers. It was designed to be scalable, extensible without client-side
modification, and robust to failure of the network or any specific machine.
 Scribe was developed at Facebook and released in 2008 as open source.
 Scribe servers are arranged in a directed graph, with each server knowing only
about the next server in the graph. This network topology allows for adding
extra layers of fan-in as a system grows, and batching messages before sending
them between data centers, without having any code that explicitly needs to
understand data center topology, only a simple configuration.
 Scribe was designed to consider reliability but to not require heavyweight protocols
and expansive disk usage. Scribe spools data to disk on any node to handle
intermittent connectivity node failure, but doesn't sync a log file for every message.
 This creates a possibility of a small amount of data loss in the event of a crash or
Prepared By: Prof. Neha Prajapati Page 27
catastrophic hardware failure. However, this degree of reliability is often suitable for
most Facebook use cases.

3.0 HDFS,HIVE and HIVEQL,HBASE


HDFS Overview

Prepared By: Prof. Neha Prajapati Page 28


 Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly
faulttolerant and designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from possible
data losses in case of failure. HDFS also makes applications available to parallel
processing.

Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the
status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

 HDFS follows the master-slave architecture and it has the following elements.
Namenode
 The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
 It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −Manages the
file system namespace.
 Regulates client’s access to files.

Prepared By: Prof. Neha Prajapati Page 29


 It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode
 The datanode is a commodity hardware having the GNU/Linux operating system
and datanode software. For every node (Commodity hardware/System) in a
cluster, there will be a datanode. These nodes manage the data storage of their
system.
 Datanodes perform read-write operations on the file systems, as per client
request.
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
 Generally the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks.
 In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the
need to change in HDFS configuration.

Read Operation In HDFS


 Data read request is served by HDFS, NameNode, and DataNode. Let's call the
reader as a 'client'. Below diagram depicts file read operation in Hadoop.

Prepared By: Prof. Neha Prajapati Page 30


A client initiates read request by calling 'open()' method of FileSystem object; it is an
object of type DistributedFileSystem.
 This object connects to namenode using RPC and gets metadata information such
as the locations of the blocks of the file. Please note that these addresses are of
first few blocks of a file.
 In response to this metadata request, addresses of the DataNodes having a copy
of that block is returned back.
 Once addresses of DataNodes are received, an object of type FSDataInputStream
is returned to the client.
 FSDataInputStream contains DFSInputStream which takes care of interactions
with DataNode and NameNode. In step 4 shown in the above diagram, a client
invokes 'read()'method which causes DFSInputStream to establish a connection
with the first DataNode with the first block of a file.
 Data is read in the form of streams wherein client invokes 'read()' method
repeatedly. This process of read() operation continues till it reaches the end of
block.
 Once the end of a block is reached, DFSInputStream closes the connection and
moves on to locate the next DataNode for the next block
 Once a client has done with the reading, it calls a close() method.

Write Operation In HDFS


In this section, we will understand how data is written into HDFS through files.

Prepared By: Prof. Neha Prajapati Page 31


A client initiates write operation by calling 'create()' method of
DistributedFileSystem object which creates a new file - Step no. 1 in the above
diagram.
 DistributedFileSystem object connects to the NameNode using RPC call and
initiates new file creation. However, this file creates operation does not associate
any blocks with the file. It is the responsibility of NameNode to verify that the
file (which is being created) does not exist already and a client has correct
permissions to create a new file. If a file already exists or client does not have
sufficient permission to create a new file, then IOException is thrown to the client.
Otherwise, the operation succeeds and a new record for the file is created by the
NameNode.
 Once a new record in NameNode is created, an object of type
FSDataOutputStream is returned to the client. A client uses it to write data into
the HDFS. Data write method is invoked (step 3 in the diagram).
 FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues
writing data,DFSOutputStream continues creating packets with this data. These
packets are enqueued into a queue which is called as DataQueue.
 There is one more component called DataStreamer which consumes this
DataQueue. DataStreamer also asks NameNode for allocation of new blocks
thereby picking desirable DataNodes to be used for replication.

Prepared By: Prof. Neha Prajapati Page 32


 Now, the process of replication starts by creating a pipeline using DataNodes. In
our case, we have chosen a replication level of 3 and hence there are 3 DataNodes
in the pipeline.
 The DataStreamer pours packets into the first DataNode in the pipeline.
 Every DataNode in a pipeline stores packet received by it and forwards the same
to the second DataNode in a pipeline.
 Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets
which are waiting for acknowledgment from DataNodes.
 Once acknowledgment for a packet in the queue is received from all DataNodes
in the pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode
failure, packets from this queue are used to reinitiate the operation. After a client
is done with the writing data, it calls a close() method (Step 9 in the diagram) Call
to close(), results into flushing remaining data packets to the pipeline followed
by waiting for acknowledgment. Once a final acknowledgment is received,
NameNode is contacted to tell it that the file write operation is complete.

Goals of HDFS
 Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
 Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
 Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.

Hadoop Installation
 This text describes the installation and configuration of Hadoop cluster backed
by the Hadoop Distributed File System, running on Ubuntu Linux.
 Hadoop 1.2.1 (stable release)
 Ubuntu 14.04 LTS
 Java 7 (Open JDK for Linux)

Installing and running Hadoop requires the following parts:-


1.Setting up the prerequisites for Hadoop
2.Configuring all the machines on the cluster for Hadoop
3.Configuring the cluster to assign one master and several slaves

Part 1: Prerequisites
Configuring Ubuntu:-
 For these steps I will be creating 2-node cluster of Hadoop. Both machines have
Ubuntu 14.04 LTS installed with all latest updates from default repositories.

Prepared By: Prof. Neha Prajapati Page 33


 Master machine’s hostname is master and Slave machine has hostname slave.
 Both machines have username hduser (which have the root privileges of the
system). So the terminal on master is like this hduser@master:~$
And on the slave hduser@slave:~ $

Networking:
 Both machines must be able to reach each other over the network. The easiest is
to put both machines in the same network with regard to hardware and software
configuration.
When using virtual machines in VIRTULBOX:
 These steps are only to be performed if configuring networking in Oracle
VirtualBox.In the settings of the each virtual machine, go to Network and select
Bridged adapter to put the virtual machines as the same network as the host. So
that all VMs are now in same network.
 Inside the running VM, configure the IP address manually to allocate some IP
address like 192.168.0.xxx For Example IP address as follows: -
master 192.168.0.1
slave 192.168.0.2
 Now edit the /etc/hosts file to let the system know about other systems on the
network. Edit the file on all machines to look like this.
127.0.0.1localhost
#127.0.1.1hostname
192.168.0.1master
192.168.0.2slave

Disabling IPv6:
 IPv6 should be disabled as caution because Hadoop uses IP address 0.0.0.0 for
internal options and it will be bind to an IPv6 address by default in Ubuntu. To
disable it, add the following lines to /etc/sysctl conf. The machines should be
restarted in order for changes to take effect.
 disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1
 net.ipv6.conf.default.disable_ipv6 = 1
 net.ipv6.conf.lo.disable_ipv6 = 1

Configuring Java: -
 Java JDK is required to execute the Hadoop code, so it must be installed and
configured properly on each machine on the cluster.

Installation:
 Run the following commands from a terminal
sudo apt-get install openjdk-7-jreopenjdk-7-jdk
 The installed files of java will be placed in /usr/lib/jvm/java-7-openjdk-i386

Prepared By: Prof. Neha Prajapati Page 34


Configuring SSH:-
 Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your
local machine if you want to use Hadoop on it, so that it can access the cluster
machines without having to supply username and password every time.
This requires all the nodes must have the same version of JDK and Hadoop
version.

Installation:
To install sshin Ubuntu sudo apt-get install ssh
 Generate keys: Execute this step on both master as well all the slaves
 ssh-keygen -t rsa -P " "
 Adding this key to authorized keys in ssh:Perform this step on all machines in
cluster to copy the above generated key to trusted keys for the user.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys.

Distributing the master’s key to all slaves:


 We have to add the master’s public SSH key (which should be in
$HOME/.ssh/id_rsa.pub) to the authorized keys of slave (in this slave user’s
$HOME/.ssh/authorized_keys).So that it doesn’t require password every time.
ssh-copy-id -i $HOME/.ssh/id_rsa.pub slaveIP
 The above command will prompt the login password on slave, then copy the
public SSH key for you, creating the correct directory and fixing the permissions
as necessary.
 Now login to each slave node and add it to the authorized keys.
cat $HOME/.ssh/master.pub >> $HOME/.ssh/authorized_keys

Checking the ssh connections:


 Execute the following commands from the master node
ssh master
ssh slave
 The terminal prompt will change to display that we have successfully logged
into the respective machines remotely.
Part 2: Hadoop Configuration on individual machines
(a) Hadoop Installation: -Perform following steps on all machines including
master. Download the Hadoop archive from.
https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/
tar –xzvf ./hadoop-1.2.1.tar.gz

b) Hadoop configuration (In all machines): -


Edit the files present in hadoop/conf directory as follows:
hadoop-env.sh:- Set the path for java JDK

Prepared By: Prof. Neha Prajapati Page 35


export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

<core-site>xml:- Insert the following code snippet in between


<configuration>tags</configuration>
<property><name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description></property>

<property><name>fs.default.name</name><value>hdfs://master:54310</value>
<description>The name of the default file system. </description></property>
mapred-site.xml:
Insert the following code snippet in between
<configuration></configuration>tags.
<property><name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local",
then jobs are run in-process as a single map and reduce task.
</description></property>
hdfs-site.xml
<property><name>dfs.replication</name> <value>2</value>
<description>Default block replication.The actual number of replications can be
specified when the file is created. The default is used if replication is not
specified in create time. </description></property>

Part 3: Assign the master and slaves from the cluster:


 The conf/masters files define on which machines Hadoop will start as master and
run the secondary NameNodes in our multi-node cluster. In our case, this is just
the master machine.
 The conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons
(DataNodes and TaskTrackers) will be run.
 We want both the master box and the slave box to act as Hadoop slaves because
we want both of them to store and process data. (If you do not want master to
take part in tasks, do not add its hostname in the conf/slaves on master) Change
the conf/slaves on all machines to look like:master,slave.

RUNNING THE HADOOP CLUSTER


a) Formatting the HDFS filesystem via the NameNode:
Before we start our new multi-node cluster, we must format Hadoop’s
distributed filesystem (HDFS) via the NameNode. We need to do this the first
time to set up a Hadoop cluster. (on the master machine).

hduser@master:~/Desktop$ hadoop/bin/hadoop namenode –format


Prepared By: Prof. Neha Prajapati Page 36
b)Starting multi-node cluster:
Run the following command from master terminal from hadoop directory
bin/start-all.sh
This will start all required processes, which can be viewed from java process
viewer (jps)in the terminal. On master machine following processes should be
 running
3810 jps
3439 SecondaryNameNode
3510 jobTracker
3678 taskTracker
3142 NameNode
3291 DataNode

 On slave machines
2404 DataNode
3102 SecondaryNameNode
2532 TaskTracker
3325 jps

c) Stopping the cluster: Execute the command from master machine


bin/stop-all.sh
The output will be:
stoppingjobtracker
slave: stopping tasktracker
master: stopping tasktracker
stoppingnamenode
slave: stopping datanode
master: stopping datanode

master: stopping secondarynamenode

Access HDFS using JAVA API


 In this section, we try to understand Java interface used for accessing Hadoop's
file system.
 In order to interact with Hadoop's filesystem programmatically, Hadoop
provides multiple JAVA classes. Package named org.apache.hadoop.fs contains
classes useful in manipulation of a file in Hadoop's filesystem. These operations
include, open, read, write, and close.
 Actually, file API for Hadoop is generic and can be extended to interact with
other filesystems other than HDFS. Object java.net.URL is used for reading
contents of a file.

Prepared By: Prof. Neha Prajapati Page 37


 To begin with, we need to make Java recognize Hadoop's hdfs URL scheme. This
is done by calling setURLStreamHandlerFactory method on URL object and an
instance of FsUrlStreamHandlerFactory is passed to it. This method needs to be
executed only once per JVM, hence it is enclosed in a static block.
An example code is-
public class URLCat
{
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); }
public static void main(String[] args) throws Exception { InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
 This code opens and reads contents of a file. Path of this file on HDFS is passed
to the program as a command line argument.
 Access HDFS Using COMMAND-LINE INTERFACE
 This is one of the simplest ways to interact with HDFS. Command-line interface
has support for filesystem operations like read the file, create directories, moving
files, deleting data, and listing directories.
 We can run '$HADOOP_HOME/bin/hdfs dfs -help' to get detailed help on every
command. Here 'dfs' is a shell command of HDFS which supports multiple
subcommands.
 Some of the widely used commands are listed below along with some details of
each one.
1. Copy a file from the local filesystem to HDFS
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /

This command copies file temp.txt from the local filesystem to HDFS.

2. We can list files present in a directory using -ls


$HADOOP_HOME/bin/hdfs dfs -ls /

Prepared By: Prof. Neha Prajapati Page 38


We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.

3. Command to copy a file to the local filesystem from HDFS


$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt

We can see temp.txt copied to a local filesystem.

4. Command to create a new directory


$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

HDFS Commands
1) touchz:- HDFS Command to create a file in HDFS with file size 0 bytes.
Usage: hdfs dfs–touchz/directory/filename
Command:hdfsdfs–touchz /new_edureka/sample
2) text
HDFS Command that takes a source file and outputs the file in text format.
Usage:hdfs dfs –text /directory/filename
Command:hdfs dfs –text /new_edureka/test

3) cat

Prepared By: Prof. Neha Prajapati Page 39


HDFS Command that reads a file on HDFS and prints the content of that file to
the standard output.
Usage:hdfsdfs –cat /path/to/file_in_hdfs
Command: hdfs dfs –cat /new_edureka/test
4) copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Usage: hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
Command: hdfs dfs–copyFromLocal /home/edureka/test /new_edureka
5) copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Usage: hdfs dfs -copyToLocal <hdfs source> <localdst>
Command:hdfs dfs –copyToLocal /new_edureka/test /home/edureka
6) put
HDFS Command to copy single source or multiple sources from local file system
to the destination file system.
Usage:hdfs dfs -put <localsrc> <destination>
Command: hdfs dfs–put /home/edureka/test /user
7) get
HDFS Command to copy files from hdfs to the local file system.
Usage: hdfs dfs -get <src> <localdst>
Command:hdfs dfs –get /user/test /home/edureka
8) count
HDFS Command to count the number of directories, files, and bytes under the
paths that match the specified file pattern.
Usage: hdfs dfs -count <path>
Command: hdfs dfs –count /user
10) rm
HDFS Command to remove the file from HDFS.
Usage: hdfs dfs –rm <path>
Command:hdfs dfs –rm /new_edureka/test
11) cp
HDFS Command to copy files from source to destination. This command allows
multiple sources as well, in which case the destination must be a directory.
Usage: hdfs dfs -cp <src> <dest>
Command: hdfs dfs -cp/user/hadoop/file1 /user/hadoop/file2
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir.

Prepared By: Prof. Neha Prajapati Page 40


12) expunge
HDFS Command that makes the trash empty.
Command: hdfs dfs-expunge
13) usage
HDFS Command that returns the help for an individual command.
Usage: hdfs dfs -usage <command>
Command: hdfs dfs -usage mkdir
14) fsck
HDFS Command to check the health of the Hadoop file system
15) ls
HDFS Command to display the list of Files and Directories in HDFS.
Command:hdfs dfs –ls /
16) mkdir
HDFS Command to create the directory in HDFS.
Usage: hdfs dfs –mkdir /directory_name
Command:hdfs dfs –mkdir /new_edureka
Hive
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
 This is a brief tutorial that provides an introduction on how to use Apache Hive
HiveQL with Hadoop Distributed File System. This tutorial can be your first step
towards becoming a successful Hadoop Developer with Hive.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
What is Hive
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
 Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache
Hive. It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Prepared By: Prof. Neha Prajapati Page 41


Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
 The following component diagram depicts the architecture of Hive:

Unit Name Operation


Hive is a data warehouse infrastructure software that can create interaction
User
between user and HDFS. The user interfaces that Hive supports are Hive
Interface
Web UI, Hive command line, and Hive HD Insight (In Windows server).
Hive chooses respective database servers to store the schema or Metadata of
Meta Store
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is
HiveQL
one of the replacements of traditional approach for MapReduce program.
Process
Instead of writing MapReduce program in Java, we can write a query for
Engine
MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution
Execution Engine. Execution engine processes the query and generates
Engine
results as same as MapReduce results. It uses the flavor of MapReduce.

Prepared By: Prof. Neha Prajapati Page 42


HDFS or Hadoop distributed file system or HBASE are the data storage techniques to
HBASE store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:
Step
Operation
No.
Execute Query The Hive interface such as Command Line or Web UI sends query
1
to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Get Plan The driver takes the help of query compiler that parses the query to check
2
the syntax and query plan or the requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any database).
4 Send Metadata Metastore sends metadata as a response to the compiler.
Send Plan The compiler checks the requirement and resends the plan to the driver.
5
Up to here, the parsing and compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
Execute Job Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it assigns
7
this job to TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
Metadata Ops Meanwhile in execution, the execution engine can execute metadata
7
operations with Metastore.

Prepared By: Prof. Neha Prajapati Page 43


8 Fetch Result The execution engine receives the results from Data nodes.
9 Send Results The execution engine sends those resultant values to the driver.
10 Send Results The driver sends the results to Hive Interfaces.

Step 1: Downloading Hive


 http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded
onto the /Downloads directory. Here, we download Hive archive named
“apache-hive-0.14.0-bin.tar.gz” . The following command is used to verify the
download:
cd Downloads
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin.tar.gz
Step 2: Installing Hive
 The following steps are required for installing Hive on your system. Let us
assume the Hive archive is downloaded onto the /Downloads directory.
Extracting and verifying Hive Archive.

The following command is used to verify the download and extract the hive
archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory


 We need to copy the files from the super user “su -”. The following commands
are used to copy the files from the extracted directory to the /usr/local/hive”
directory.

$ su -passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up environment for Hive


 You can set up the Hive environment by appending the following lines to
~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Prepared By: Prof. Neha Prajapati Page 44


export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

The following command is used to execute ~/.bashrc file.


$ source ~/.bashrc

Step 3: Configuring Hive


 To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is
placed in the $HIVE_HOME/conf directory. The following commands redirect to
Hive config folder and copy the template file:

$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:


export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external
database server to configure Metastore.

Step 4: Downloading and Installing Apache Derby


Follow the steps given below to download and install Apache Derby:
Downloading Apache Derby. The following command is used to download
Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-
10.4.2.0-bin.tar.gz

The following command is used to verify the download:


$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive


The following commands are used for extracting and verifying the Derby
archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /usr/local/derby directory


We need to copy from the super user “su -”.

Prepared By: Prof. Neha Prajapati Page 45


The following commands are used to copy the files from the extracted directory
to the /usr/local/derby directory:

$ su -passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up environment for Derby


 You can set up the Derby environment by appending the following lines to
~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin

Apache Hive
 export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc

Create a directory to store Metastore


 Create a directory named data in $DERBY_HOME directory to store Metastore
data.
$ mkdir $DERBY_HOME/data

Step 5: Configuring Metastore of Hive


 Configuring Metastore means specifying to Hive where the database is stored.
You can do this by editing the hive-site.xml file, which is in the
$HIVE_HOME/conf directory. First of all, copy the template file using the
following command:

$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml

Edit hive-site.xml and append the following lines between the <configuration>
and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>

Prepared By: Prof. Neha Prajapati Page 46


</property>

Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create =
true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 6: Verifying Hive Installation


 Before running Hive, you need to create the /tmp folder and a separate Hive
folder in HDFS. Here, we use the /user/hive/warehouse folder. You need to set
write permission for these newly created folders as shown below:
chmod g+w

Now set them in HDFS before verifying Hive. Use the following commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

The following commands are used to verify Hive installation:


$ cd $HIVE_HOME
$ bin/hive

 On successful installation of Hive, you get to see the following response:


Logging initialized using configuration in jar:file:/home/hadoop/hive-
0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties

Hivehistory
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
Prepared By: Prof. Neha Prajapati Page 47
………………….
hive>

The following sample command is executed to display all the tables:


hive> show tables;
OK
Time taken: 2.798 seconds
hive>
Hive Metastore
 Metastore is the central repository of Apache Hive metadata. It stores metadata
for Hive tables (like their schema and location) and partitions in a relational
database. It provides client access to this information by using metastore service
API. Hive metastore consists of two fundamental units:
 A service that provides metastore access to other Apache Hive services.
 Disk storage for the Hive metadata which is separate from HDFS storage.
 Hive Metastore Modes

There are three modes for Hive Metastore deployment:


 Embedded Metastore
 Local Metastore
 Remote Metastore

i. Embedded Metastore
 In Hive by default, metastore service runs in the same JVM as the Hive service. It
uses embedded derby database stored on the local file system in this mode. Thus
both metastore service and hive service runs in the same JVM by using
embedded Derby Database. But, this mode also has limitation that, as only one
embedded Derby database can access the database files on disk at any one time,

 o
n
l
y

o
n
e

Hive session could be open at a time.


 If we try to start the second session it produces an error when it attempts to open
a connection to the metastore. So, to allow many services to connect the

Prepared By: Prof. Neha Prajapati Page 48


Metastore, it configures Derby as a network server. This mode is good for unit
testing. But it is not good for the practical solutions.

 ii. Local Metastore


Hive is the data-warehousing framework, so hive does not prefer single session.
To overcome this limitation of Embedded Metastore, for Local Metastore was
introduced. This mode allows us to have many Hive sessions i.e. many users can
use the metastore at the same time. We can achieve by using any JDBC compliant
like MySQL which runs in a separate JVM or different machines than that of the
Hive
servic
e and
metas
tore
servic
e
which
are
runni
ng in the same JVM.

 This configuration is called as local metastore because metastore service still runs
in the same process as the Hive. But it connects to a database running in a
separate process, either on the same machine or on a remote machine. Before
starting Apache Hive client, add the JDBC / ODBC driver libraries to the Hive lib
folder.
iii. Remote Metastore
 Moving further, another metastore configuration called Remote Metastore. In
this mode, metastore runs on its own separate JVM, not in the Hive service JVM.
If other processes want to communicate with the metastore server they can
communicate using Thrift Network APIs. We can also have one more metastore
servers in this case to provide more availability.
 This also brings better manageability/security because the database tier can be
completely firewalled off. And the clients no longer need share database
credentials with each Hiver user to access the metastore database.

Prepared By: Prof. Neha Prajapati Page 49


Comparison with Traditional Database
Schema on Read Versus Schema on Write
 In a traditional database, a table’s schema is enforced at data load time. If the
data being loaded doesn’t conform to the schema, then it is rejected. This design
is sometimes called schema on write, since the data is checked against the
schema when it is written into the database.
 Hive, on the other hand, doesn’t verify the data when it is loaded, but rather
when a query is issued. This is called schema on read. There are trade-offs
between the two approaches. Schema on read makes for a very fast initial load,
since the data does not have to be read, parsed, and serialized to disk in the
database’s internal format.
 The load operation is just a file copy or move. It is more flexible, too: consider
having two schemas for the same underlying data, depending on the analysis
being performed. (This is possible in Hive using external tables, see “Managed
Tables and External Tables”)
 The trade-off, however, is that it takes longer to load data into the database.
Furthermore, there are many scenarios where the schema is not known at load
time, so there are no indexes to apply, since the queries have not been formulated
yet. These scenarios are where Hive shines.
Updates, Transactions, and Indexes
 Updates, transactions, and indexes are mainstays of traditional databases. Yet,
until recently, these features have not been considered a part of Hive’s feature set.
This is because Hive was built to operate over HDFS data using MapReduce,
where full-table scans are the norm and a table update is achieved by
transforming the data into a new table.
 However, there are workloads where updates (or insert appends, at least) are
needed, or where indexes would yield significant performance gains. On the
transactions front, Hive doesn’t define clear semantics for concurrent access to
tables, which means applications need to build their own application-level
concurrency or locking mechanism.
 The Hive team is actively working on improvements in all these areas. Change is
also coming from another direction: HBase integration. HBase ( HBase Chapter )
has different storage characteristics to HDFS, such as the ability to do row
updates and column indexing, so we can expect to see these features used by
Hive in future releases.

Prepared By: Prof. Neha Prajapati Page 50


Hive Traditional database

Schema on WRITE – table schema is enforced at


Schema on READ – it’s does not verify the
data load time i.e if the data being loaded does’t
schema while it’s loaded the data
conformed on schema in that case it will rejected
It’s very easily scalable at low cost Not much Scalable, costly scale up
It’s based on hadoop notation that is Write In traditional database we can read and write
once and read many times many time
Record level updates is not possible in Record level updates, insertions and
Hive deletes, transactions and indexes are possible
OLTP (On-line Transaction Processing) is Both OLTP (On-line Transaction Processing) and
not yet supported in Hive but it’s OLAP (On-line Analytical Processing) are
supported OLAP (On-line Analytical supported in RDBMS
Processing)

HiveQL

Feature SQL HiveQL


UPDATE,DELETE UPDATE, DELETE
Updates
INSERT, INSERT,
Transaction Supported Limited Support Supported
Indexes Supported Supported
Integral, floating-point, Boolean, integral, floatingpoint, fixed-
Data Types ftxedpoint, text and binary point, text and binary strings, temporal,
strings, temporal array, map, struct
Functions Hundreds of built-in functions Hundreds of built-in functions
Multitable
Not supported Supported
inserts
Create
table.....as Not supported Supported
Select
Supported with SORT BY clause for partial
Select Supported ordering and LIMIT to restrict number of
rows returned
Inner joins, outer joins, semi join, map
Joins Supported
joins, cross joins
Subqueries Used in any clause Used in FROM, WHERE, or HAVING

Prepared By: Prof. Neha Prajapati Page 51


Feature SQL HiveQL
clauses
Views Updatable Read-only

HiveQL
 Hive’s SQL language is known as HiveQL, it is a combination of SQL-92, Oracal’s
SQL language and MySQL.
 HiveQI provides some improved features from previous version of SQL
standards, like analytics function from SQL 2003.
 Some Hive’s’ extension like multitable inserts, TRANSFORM, MAP and
REDUCE.
 Hive is an open source data warehouse system used for querying and analyzing
large datasets. Data in Apache Hive can be categorized into Table,Partition, and
Bucket. The table in Hive is logically made up of the data being stored. Hive has
two types of tables which are as follows: Managed Table (Internal Table) External
Table
 Hive Managed Tables-
It is also know an internal table. When we create a table in Hive, it by default
manages the data. This means that Hive moves the data into its warehouse
directory.Hive External Tables-
 We can also create an external table. It tells Hive to refer to the data that is at an
existing location outside the warehouse directory.
 Here we are going to cover the comparison between Hive Internal tables vs
External tables on the basis of different features. Let’s discuss them one by one-

LOAD and DROP Semantics


 We can see the main difference between the two table type in the LOAD and
DROP semantics.
 Managed Tables –When we load data into a Managed table, then Hive moves
data into Hive warehouse directory.

CREATE TABLE managed_table (dummy STRING);


 LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
 This moves the file hdfs://user/tom/data.txt into Hive’s warehouse directory for
the managed_table table, which is hdfs://user/hive/warehouse/managed_table.
Further, if we drop the table using:

DROP TABLE managed_table


 Then this will delete the table metadata including its data. The data no longer
exists anywhere. This is what it means for HIVE to manage the data.

Prepared By: Prof. Neha Prajapati Page 52


 External Tables –External table behaves differently. In this, we can control the
creation and deletion of the data. The location of the external data is specified at
the table creation time:

CREATE EXTERNAL TABLE external_table(dummy STRING)


 LOCATION '/user/tom/external_table';
 LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

 Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing
the data. So it doesn’t move data to its warehouse directory. It does not even
check whether the external location exists at the time it is defined. This very
useful feature because it means we create the data lazily after creating the table.
The important thing to notice is that when we drop an external table, Hive will
leave the data untouched and only delete the metadata.

ii. Security
 Managed Tables –Hive solely controls the Managed table security. Within Hive,
security needs to be managed; probably at the schema level (depends on
organization).
 External Tables –These tables’ files are accessible to anyone who has access to
HDFS file structure. So, it needs to manage security at the HDFS file/folder level.
iii. When to use Managed and external table
 Use Managed table when –
 We want Hive to completely manage the lifecycle of the data and table.
 Data is temporary

 Use External table when –


Data is used outside of Hive. For example, the data files are read and processed
by an existing program that does not lock the files.
We are not creating a table based on the existing table.
 We need data to remain in the underlying location even after a DROP TABLE.
This may apply if we are pointing multiple schemas at a single data set.
 The hive shouldn’t own data and control settings, directories etc., we may have
another program or process that will do these things.

 HiveQL SELECT Statement


SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference
 [WHERE where_condition]
 [GROUP BY col_list]
 [HAVING having_condition]
 [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
 [LIMIT number]

Prepared By: Prof. Neha Prajapati Page 53


hive> SELECT * FROM employee WHERE salary>30000;
Create Table Statement
 Create Table is a statement used to create a table in Hive. The syntax and
example are as follows: Syntax
 CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

 hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,salary
String, destination String)

 COMMENT ‘Employee details’


 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ‘\t’
 LINES TERMINATED BY ‘\n’
 STORED AS TEXTFILE;
 ALTER TABLE name RENAME TO new_name
 ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
 ALTER TABLE name DROP [COLUMN] column_name
 ALTER TABLE name CHANGE column_name new_name new_type
 ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Drop Table Statement


 The syntax is as follows:DROP TABLE [IF EXISTS] table_name;
 The following query drops a table named employee:
 hive> DROP TABLE IF EXISTS employee;
 JOIN is a clause that is used for combining specific fields from two tables by
using values common
 to each one. It is used to combine records from two or more tables in the
database.
join_table:
 table_reference JOIN table_factor [join_condition]| table_reference
{LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition |
table_reference LEFT SEMI JOIN table_reference join_condition| table_reference
CROSS JOIN table_reference [join_condition]
 Example
 We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..

Prepared By: Prof. Neha Prajapati Page 54


+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+

Consider another table ORDERS as follows:


+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--

 The following query executes JOIN on the CUSTOMER and ORDER tables, and
retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN
ORDERS o ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
LEFT OUTER JOIN
 The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if
there are no matches in the right table. This means, if the ON clause matches 0
(zero) records in the right table, the JOIN still returns a row in the result, but
with NULL in each column from the right table.

Prepared By: Prof. Neha Prajapati Page 55


 A LEFT JOIN returns all the values from the left table, plus the matched values
from the right table, or NULL in case of no matching JOIN predicate.
 The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+

RIGHT OUTER JOIN


 The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even
if there are no matches in the left table. If the ON clause matches 0 (zero) records
in the left table, the JOIN still returns a row in the result, but with NULL in each
column from the left table.
 A RIGHT JOIN returns all the values from the right table, plus the matched
values from the left table, or NULL in case of no matching join predicate.
 The following query demonstrates RIGHT OUTER JOIN between the
CUSTOMER and ORDER tables.
 notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM
CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
 On successful execution of the query, you get to see the following response:

+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik |1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

Prepared By: Prof. Neha Prajapati Page 56


 FULL OUTER JOIN
 The HiveQL FULL OUTER JOIN combines the records of both the left and the
right outer tables that fulfil the JOIN condition. The joined table contains either all
the records from both the tables, or fills in NULL values for missing matches on
either side.
 The following query demonstrates FULL OUTER JOIN between CUSTOMER
and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

 On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

Hbase Concept
Introduction of HBase
 HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
 HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
 It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.

Prepared By: Prof. Neha Prajapati Page 57


 One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.

HBase and HDFS


HDFS HBase
HDFS is a distributed file system
HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast
HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency batch
It provides low latency access to single rows from
processing; no concept of batch
billions of records (Random access).
processing.
HBase internally uses Hash tables and provides
It provides only sequential access
random access, and it stores the data in indexed HDFS
of data.
files for faster lookups.

Storage Mechanism in HBase


 HBase is a column-oriented database and the tables in it are sorted by row. The
table schema defines only column families, which are the key value pairs. A table
have multiple column families and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each
cell value of the table has a timestamp. In short, in an HBase:
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of
data, rather than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
Prepared By: Prof. Neha Prajapati Page 58
It is suitable for Online Transaction Process It is suitable for Online Analytical
(OLTP). Processing (OLAP).
Such databases are designed for small number Column-oriented databases are designed
of rows and columns. for huge tables.
The following image shows column families in a column-oriented database:

Featu
res of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Where to Use HBase


 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Zookeeper
 Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
 Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.

Prepared By: Prof. Neha Prajapati Page 59


 In addition to availability, the nodes are also used to track server failures or
network partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.
 A distributed application can run on multiple systems in a network at a given
time (simultaneously) by coordinating among themselves to complete a
particular task in a fast and efficient manner. Normally, complex and time-
consuming tasks, which will take hours to complete by a non-distributed
application (running in a single system) can be done in minutes by a distributed
application by using computing capabilities of all the system involved.
 The time to complete the task can be further reduced by configuring the
distributed application to run on more systems. A group of systems in which a
distributed application is running is called a Cluster and each machine running
in a cluster is called a Node.
 A distributed application has two parts, Server and Client application. Server
applications are actually distributed and have a common interface so that clients
can connect to any server in the cluster and get the same result. Client
applications are the tools to interact with a distributed application.

Benefits
of
Distribut
ed Applications
 Reliability − Failure of a single or a few systems does not make the whole system
to fail.
 Scalability − Performance can be increased as and when needed by adding more
machines with minor change in the configuration of the application with no
downtime.
 Transparency − Hides the complexity of the system and shows itself as a single
entity / application.
Challenges of Distributed Applications
 Race condition − Two or more machines trying to perform a particular task,
which actually needs to be done only by a single machine at any given time. For

Prepared By: Prof. Neha Prajapati Page 60


example, shared resources should only be modified by a single machine at any
given time.
 Deadlock − Two or more operations waiting for each other to complete
indefinitely.
 Inconsistency − Partial failure of data.
Benefits of ZooKeeper
Simple distributed coordination process
 Synchronization − Mutual exclusion and co-operation between server processes.
This process helps in Apache HBase for configuration management.
Ordered Messages
 Serialization − Encode the data according to specific rules. Ensure your
application runs consistently. This approach can be used in MapReduce to
coordinate queue to execute running threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but no transaction is
partial.
 Before going deep into the working of ZooKeeper, let us take a look at the
fundamental concepts of ZooKeeper.
Architecture of ZooKeeper
 Take a look at the following diagram. It depicts the “Client-Server Architecture”
of
Zoo
Keep
er.

Prepared By: Prof. Neha Prajapati Page 61


Each one of the components that is a part of the ZooKeeper architecture has been
explained in the following table.
Part Description
Clients, one of the nodes in our distributed application cluster, access
information from the server. For a particular time interval, every client sends
a message to the server to let the sever know that the client is alive.
Client
Similarly, the server sends an acknowledgement when a client connects. If
there is no response from the connected server, the client automatically
redirects the message to another server.

Server, one of the nodes in our ZooKeeper ensemble, provides all the
Server services to clients. Gives acknowledgement to client to inform that the server
is alive.
Group of ZooKeeper servers. The minimum number of nodes that is
Ensemble
required to form an ensemble is 3.
Server node which performs automatic recovery if any of the connected
Leader
node failed. Leaders are elected on service startup.
Follower Server node which follows leader instruction.
Pig Architecture
 Pig consists of two components:
1. Pig Latin, which is a language
2. A runtime environment, for running PigLatin programs.
 A Pig Latin program consists of a series of operations or transformations which
are applied to the input data to produce output. These operations describe a data
flow which is translated into an executable representation, by Pig execution
environment. Underneath, results of these transformations are series of
MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows the
programmer to focus on data rather than the nature of execution.

Pig has two execution modes:


1. Local mode: In this mode, Pig runs in a single JVM and makes use of local file
system. This mode is suitable only for analysis of small datasets using Pig
2. Map Reduce mode: In this mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully
distributed). MapReduce mode with the fully distributed cluster is useful of
running Pig on large datasets.

Prepared By: Prof. Neha Prajapati Page 62


Pig can be used for following purposes:
 ETL data pipeline
 Research on raw data
 Iterative processing.

Running Pig Programs


 There are namely 3 ways of executing Pig programs which works on both local
and MapReduce mode:

Prepared By: Prof. Neha Prajapati Page 63


Script
 Pig can run a script file that contains Pig commands. For example, pig script.pig
runs the commands in the local file script.pig. Alternatively, for very short scripts,
you can use the -e option to run a script specified as a string on the command
line.
Grunt
 Grunt is an interactive shell programming for running Pig commands. Grunt is
started when no file is specified for Pig to run, and the -e option apparently not
used. It is also possible to run Pig scripts from within Grunt using run and exec.
Embedded
 You can execute all the Pig programs from Java and can use JDBC to run SQL
programs from Java.

Example: Word count in Pig


Lines=LOAD ‘input/hadoop.log’ AS (line:char array);
Words = FOR EACH Lines GENERATE FLATTEN (TOKENIZE (line)) AS word;
Groups = GROUP Words BY word;
Counts = FOR EACH Groups GENERATE group, COUNT (Words);
Results = ORDER Words BY Counts DESC;
Top5 = LIMIT Results 5;
STORE Top5 INTO /output/top5words;
Difference between hive and pig

Hive Pig
Used for Data Analysis Used for Data and Programs
Used as Structured Data Pig is Semi-Structured Data
Hive has HiveQL Pig has Latin
Hive is used for creating reports Pig is used for programming
Hive works on the server side Pig works on the client side
Hive does not support avro Pig supports Avro

Prepared By: Prof. Neha Prajapati Page 64


4.0 SPARK
Introduction of Spark
 Spark extends the popular MapReduce model to efficiently support more types
of computations, including interactive queries and stream processing.
 Speed is important in processing large datasets, as it means the difference
between exploring data interactively and waiting minutes or hours.
 One of the main features Spark offers for speed is the ability to run computations
in memory, but the system is also more efficient than MapReduce for complex
applications running on disk.
 Spark is designed to be highly accessible, offering simple APIs in Python, Java,
Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big
Data tools.
 In particular, Spark can run in Hadoop clusters and access any Hadoop data
source, including Cassandra.

Spark Stack

Prepared By: Prof. Neha Prajapati Page 65


 Spark Core
 Spark Core contains the basic functionality of Spark, including components for
task scheduling, memory management, fault recovery, interacting with storage
systems, and more.
 Spark Core is also home to the API that defines resilient distributed datasets
(RDDs), which are Spark’s main programming abstraction. RDDs represent a
collection of items distributed across many compute nodes that can be
manipulated in parallel. Spark Core provides many APIs for building and
manipulating these collections.

 Spark SQL
 Spark SQL is Spark’s package for working with structured data. It allows
querying data via SQL as well as the Apache Hive variant of SQL—called the
Hive Query Language (HQL)—and it supports many sources of data, including
Hive tables, Parquet, and JSON.
 Beyond providing a SQL interface to Spark, Spark SQL allows developers to
intermix SQL queries with the programmatic data manipulations supported by
RDDs in Python, Java, and Scala, all within a single application, thus combining
SQL with complex analytics.
 This tight integration with the rich computing environment provided by Spark
makes Spark SQL unlike any other open source data warehouse tool. Spark SQL
was added to Spark in version 1.0.
 Shark was an older SQL-on-Spark project out of the University of California,
Berkeley, that modified Apache Hive to run on Spark. It has now been replaced
Prepared By: Prof. Neha Prajapati Page 66
by Spark SQL to provide better integration with the Spark engine and language
APIs.
 Spark Streaming
 Spark Streaming is a Spark component that enables processing of live streams of
data.
 Examples of data streams include logfiles generated by production web servers,
or queues of messages containing status updates posted by users of a web service.
 Spark Streaming provides an API for manipulating data streams that closely
matches the Spark Core’s RDD API, making it easy for programmers to learn the
project and move between applications that manipulate data stored in memory,
on disk, or arriving in real time.
 Underneath its API, Spark Streaming was designed to provide the same degree
of fault tolerance, throughput, and scalability as Spark Core.

 MLlib
 Spark comes with a library containing common machine learning (ML)
functionality, called MLlib. MLlib provides multiple types of machine learning
algorithms, including classification, regression, clustering, and collaborative
filtering, as well as supporting functionality such as model evaluation and data
import.
 It also provides some lower-level ML primitives, including a generic gradient
descent optimization algorithm. All of these methods are designed to scale out
across a cluster.

 GraphX
 GraphX is a library for manipulating graphs (e.g., a social network’s friend graph)
and performing graph-parallel computations.
 Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API,
allowing us to create a directed graph with arbitrary properties attached to each
vertex and edge.
 GraphX also provides various operators for manipulating graphs (e.g., subgraph
and mapVertices) and a library of common graph algorithms (e.g., PageRank and
triangle counting).
 Data Analysis with Spark
 Data Science Tasks
 Data science, a discipline that has been emerging over the past few years, centers
on analyzing data. While there is no standard definition, for our purposes a data
scientist is somebody whose main task is to analyze and model data.
 Data scientists may have experience with SQL, statistics, predictive modeling
(machine learning), and programming, usually in Python, Matlab, or R.

Prepared By: Prof. Neha Prajapati Page 67


 Data scientists also have experience with techniques necessary to transform data
into formats that can be analyzed for insights (sometimes referred to as data
wrangling).
 Data scientists use their skills to analyze data with the goal of answering a
question or discovering insights.
 Oftentimes, their workflow involves ad hoc analysis, so they use interactive
shells (versus building complex applications) that let them see results of queries
and snippets of code in the least amount of time.
 Spark’s speed and simple APIs shine for this purpose, and its built-in libraries
mean that many algorithms are available out of the box.
 Spark supports the different tasks of data science with a number of components.
The Spark shell makes it easy to do interactive data analysis using Python or
Scala.
 Spark SQL also has a separate SQL shell that can be used to do data exploration
using SQL, or Spark SQL can be used as part of a regular Spark program or in the
Spark shell.
 Machine learning and data analysis is supported through the MLLib libraries. In
addition, there is support for calling out to external programs in Matlab or R.
 Spark enables data scientists to tackle problems with larger data sizes than they
could before with tools like R or Pandas.
 Sometimes, after the initial exploration phase, the work of a data scientist will be
“productized,” or extended, hardened (i.e., made fault-tolerant), and tuned to
become a production data processing application, which itself is a component of
a business application.
 For example, the initial investigation of a data scientist might lead to the creation
of a production recommender system that is integrated into a web application
and used to generate product suggestions to users. Often it is a different person
or team that leads the process of productizing the work of the data scientists, and
that person is often an engineer.
 Data Processing Applications
 The other main use case of Spark can be described in the context of the engineer
persona. For our purposes here, we think of engineers as a large class of software
developers who use Spark to build production data processing applications.
 These developers usually have an understanding of the principles of software
engineering, such as encapsulation, interface design, and object-oriented
programming. They frequently have a degree in computer science.
 They use their engineering skills to design and build software systems that
implement a business use case. For engineers, Spark provides a simple way to
parallelize these applications across clusters, and hides the complexity of
distributed systems programming, network communication, and fault tolerance.

Prepared By: Prof. Neha Prajapati Page 68


 The system gives them enough control to monitor, inspect, and tune applications
while allowing them to implement common tasks quickly. The modular nature
of the API (based on passing distributed collections of objects) makes it easy to
factor work into reusable libraries and test it locally.
Resilient Distributed Datasets (RDDs)
 RDDs are the main logical data unit in Spark. They are a distributed collection of
objects, which are stored in memory or on disks of different machines of a cluster.
 A single RDD can be divided into multiple logical partitions so that these partitions
can be stored and processed on different machines of a cluster.
 RDDs are immutable (read-only) in nature. You cannot change an original RDD,
but you can create new RDDs by performing coarse-grain operations, like
transformations, on an existing RDD.

Features of RDD
 Resilient: RDDs track data lineage information to recover the lost data,
automatically on failure. It is also called Fault tolerance.
 Distributed: Data present in the RDD resides on multiple nodes. It is distributed
across different nodes of a cluster.
 Lazy Evaluation: Data does not get loaded in the RDD even if we define it.
Transformations are actually computed when you call an action, like count or
collect, or save the output to a file system.
 Immutability: Data stored in the RDD is in a read-only mode you cannot edit the
data which is present in the RDD. But, you can create new RDDs by performing
transformations on the existing RDDs. memory Computation: RDD stores any
immediate data that is generated in the memory (RAM) than on the disk so that
it provides faster access.
 Partitioning: Partitions can be done on any existing RDDs to create logical parts
that are mutable. You can achieve this by applying transformations on existing
partitions.

Prepared By: Prof. Neha Prajapati Page 69


There are two basic operations which can be done on RDDs. They are:
1. Transformations
2. Actions

 Transformations: These are functions which accept existing RDDs as the input
and outputs one or more RDDs. The data in the existing RDDs does not change
as it is immutable. Some of the transformation operations are shown in the table
given below:

Functions Description
Returns a new RDD by applying the function on each data
map()
element
Returns a new RDD formed by selecting those elements of the
filter()
source on which the function returns true
reduceByKey() Used to aggregate values of a key using a function
groupByKey() Used to convert a (key, value) pair to (key, <iterable value>) pair

Prepared By: Prof. Neha Prajapati Page 70


Returns a new RDD that contains all elements and arguments
union()
from the source RDD
Returns a new RDD that contains an intersection of elements in
intersection()
the datasets

 These transformations are executed when they are invoked or called. Every time
transformations are applied, a new RDD is created.
 Actions: Actions in Spark are functions which return the end result of RDD
computations. It uses a lineage graph to load the data onto the RDD in a
particular order. After all transformations are done, actions return the final result
to the Spark Driver. Actions are operations which provide non-RDD values.
Some of the common actions used in Spark are:
Functions Description
count() Gets the number of data elements in an RDD
collect() Gets all data elements in the RDD as an array
Aggregates data elements into the RDD by taking two arguments
reduce()
and returning one
take(n) Used to fetch the first n elements of the RDD
foreach(operation) Used to execute operation for each data element in the RDD
first() Retrieves the first data element of the RDD

Motivation
 Spark provides special type of operations on RDDs containing key or value pairs.
These RDDs are called pair RDDs operations. Pair RDDs are a useful building
block in many programming language, as they expose operations that allow you
to act on each key operations in parallel or regroup data across the network.

Prepared By: Prof. Neha Prajapati Page 71


Creating Pair RDDs
 Pair RDDs can be created by running a map() function that returns key or value
pairs. The procedure to build the key-value RDDs differs by language. In Python
language, for the functions on keyed data to work we need to return an RDD
composed of tuples Creating a pair RDD using the first word as the key in
Python programming language.
pairs = lines.map(lambda x: (x.split(" ")[0], x))
 In Scala, for the functions on keyed data to be available, we also need to return
tuples as shown in the previous example. An implicit conversion on RDDs of
tuples exists to provide the additional key or value functions as per the
requirement.
 Creating a pair RDD using the first word as the key word in Scala
val pairs = lines.map(x => (x.split(" ")(0), x))

 Java doesn’t have a built-in function of tuple function, so only Spark’s Java API
has users create tuples using the scala.Tuple2 class. Java users can construct a
new tuple by writing new Tuple2(elem1, elem2) and can then access its relevant
elements with the -1() and -2() methods.
 Java users also need to call special versions of Spark’s functions when you are
creating pair of RDDs. For instance, the mapToPair () function should be used in
place of the basic map() function.

Creating a pair RDD using the first word as the key word in Java program.
PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(” “)[0], x);}};
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);

Transformations on Pair RDDs


Aggregations
 When datasets are described in terms of key or value pairs, it is common feature
that is required to aggregate statistics across all elements with the same key
value. Spark has a set of operations that combines values that own the same key
value. These operations return RDDs and thus are transformations rather than
actions i.e. reduceByKey(), foldByKey(), combineByKey().
Grouping Data

Prepared By: Prof. Neha Prajapati Page 72


 With key data is a common type of use case in grouping our data sets is used
with respect to predefined key value for example, viewing all of a customer’s
orders together in one file.
 If our data is already keyed in the way we want to implement, groupByKey()
will group our data using the key value using our RDD. On an RDD consisting of
keys of type K and values of type V, we get back an RDD operation of type [K,
Iterable[V]].
 groupBy() works on unpaired data or data where we want to use a different
terms of condition besides equality on the current key been specified. It requires
a function that it allows to apply the same to every element in the source of RDD
and uses the result to determine the key value obtained.
Joins
 The most useful and effective operations we get with keyed data values comes
from using it together with other keyed data. Joining datasets together is
probably one of the most common type of operations you can find out on a pair
RDD.
 Inner Join : Only keys that are present in both pair RDDs are known as the output.
 leftOuterJoin() : The resulting pair RDD has entries for each key in the source
RDD. The value which is been associated with each key in the result is a tuple of
the value from the source RDD and an Option for the value from the other pair of
RDD.
 rightOuterJoin() : is almost identical functioning to leftOuterJoin () except the key
must be present in the other RDD and the tuple has an option for the source
rather than the other RDD functions.
Sorting Data
 We can sort an RDD with key or value pairs provided that there is an ordering
defined on the key set. Once we have sorted our data elements, any subsequent
call on the sorted data to collect() or save() will result in ordered dataset.
Common Transformations and Actions
 There are relatively the two most common transformations you mostly be using
are map() and filter() transformations. The map() transformation takes in a
function and applies it to each element in the RDD with the result of the function
being the new value of each element in the resulting RDD. The filter()
transformation takes in a function and returns an RDD that only has elements
that pass the filter() function.

Prepared By: Prof. Neha Prajapati Page 73


Machine learning with MLib
 Apache Spark comes with a library named MLlib to perform machine learning
tasks using spark framework. Since we have a Python API for Apache spark, that
is, as you already know, PySpark, we can also use this library in PySpark. MLlib
contains many algorithms and machine learning utilities.
What is Machine Learning?
 Machine learning is one of the many applications of Artificial intelligence (AI)
where the primary aim is to enable the computers to learn automatically without
any human assistance.
 With the help of machine learning, computers are able to tackle the tasks that
were, until now, only handled and carried out by people.
 It’s basically a process of teaching a system, how to make accurate predictions
when fed the right data.
 It provides the ability to learn and improve from experience without being
specifically programmed for that task.
 Machine learning mainly focuses on developing the computer programs and
algorithms that make predictions and learn from the provided data.
What are dataframes?
 A dataframe is the new API for Apache Spark. It is basically a distributed,
Strongly-typed collection of data, that is, a dataset which is organised into named
columns. Dataframe is equivalent to what a table is for relational database, only,
it has richer optimization options.

 How to create dataframes


There are multiple ways to create dataframes in Apache Spark:
 Dataframes can be created using an existing RDD
 You can create a dataframe by loading a CSV file directly
 You can programmatically specify a schema to create a dataframe as well
 Basic intro to MLlib
 MLlib is short for machine learning library. Machine learning in PySpark is easy
to use and scalable. It works on distributed systems.
 We use machine learning in PySpark for data analysis.

Prepared By: Prof. Neha Prajapati Page 74


We get the benefit of various machine learning algorithms such as Regression,
classification etc, because of the MLlib in Apache Spark.
Parameters in PySpark MLlib
Some of the main parameters of PySpark MLlib are listed below:
 Ratings:This parameter is used to create an RDD of ratings, rows or tuples
 Rank:It shows the number of features computed and ranks them
 Lambda:Lambda is a regularization parameter
 Blocks:Blocks are used to parallel the number of computations. The default value
for this is -1.
Performing Linear Regression on a real world Dataset
 Let’s understand machine learning better by implementing a full-fledged code to
perform linear Regression on the dataset of top 5 Fortune 500 Companies in
recent year.
Loading the data:
 As mentioned above, we are going to use a dataframe that we have created
directly from a CSV file. Following are the commands to load the data into a
dataframe and to view the loaded data.

Input:In [1]:from pyspark import SparkConf, SparkContext


from pyspark.sql import SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)

 In [2]:companydf=sqlContext.read.format(‘com.databricks.spark.csv’)
options(header=’true’,inferschema=’true’)
load(‘C:/Users/intellipaat/Downloads/spark-2.3.2-bin-
hadoop2.7/Fortune5002017.csv’) company-df.take(1)
 You can choose the number of rows you want to view while displaying the data
of a dataframe. I have displayed the first row only.
 Output:Out[2]:[Row (Rank=1, Title= ‘Walmart’, Website=
‘http:/www.walmart.com’, Employees-2300000, Sector= ‘retailing’)]
Data exploration:
 To check the datatype of every column of a dataframe and print the schema of
the dataframe in a tree format, you can use the following commands respectively.
Input:In[3]:company-df.cache(),company-df.printSchema()
Output:Out [3]:DataFrame[Rank: int, Title: string, Website: string, Employees:Int,
Sector: string]
root
|– Rank: integer (nullable = true)
|– Title: string (nullable = true)
|– Website:string (nullable = true)
Prepared By: Prof. Neha Prajapati Page 75
|– Employees: integer (nullable = true)
|– Sector: string (nullable = true)
Performing Descriptive Analysis:
Input:In [4]: company-df.describe().toPandas().transpose()
Output:Out [4]:
0 1 2 3 4
Summary count mean stddev min max
Rank 5 3.0 1.581138830084 1 5
Title 5 None None Apple Walmart
Website 5 None None www.apple.com www.walmart.com
Employees 5 584880.0 966714.2168190142 68000 2300000
Sector 5 None None Energy Wholesalers
 Machine learning in Industry
 Computer systems with the ability to predict and learn from a given data and
improve themselves without having to be reprogrammed used to be a dream
only but in the recent years it has been made possible using machine learning.
 Now machine learning is a most used branch of artificial intelligence that is
being accepted by big industries in order to benefit their businesses.
Following are some of the organisations where machine learning has various use
cases: PayPal:PayPal uses machine learning to detect suspicious activity.
 IBM: There is a machine learning technology patented by IBM which helps to
decide when to handover the control of self-driving vehicle between a vehicle
control processor and a human driver.
 Google:Machine learning is used to gather information from the users which
further is used to improve their search engine results.
 Walmart: Machine learning in Walmart is used to benefit their efficiency
 Amazon:Machine learning is used to design and implement personalised
product recommendations.
 Facebook:Machine learning is used to filter out poor quality content.

Prepared By: Prof. Neha Prajapati Page 76


5.0 NOSQL
What is NOSQL
 NoSQL (commonly referred to as "Not Only SQL") represents a completely
different framework of databases that allows for high-performance, agile
processing of information at massive scale.
 In other words, it is a database infrastructure that as been very well-adapted to
the heavy demands of big data.
 The efficiency of NoSQL can be achieved because unlike relational databases that
are highly structured, NoSQL databases are unstructured in nature, trading off
stringent consistency requirements for speed and agility.
 NoSQL centers around the concept of distributed databases, where unstructured
data may be stored across multiple processing nodes, and often across multiple
servers.
 This distributed architecture allows NoSQL databases to be horizontally scalable;
as data continues to explode, just add more hardware to keep up, with no
slowdown in performance.
 The NoSQL distributed database infrastructure has been the solution to handling
some of the biggest data warehouses on the planet – i.e. the likes of Google,
Amazon, and the CIA.
 Advantages of NoSQL
NoSQL databases provide various important advantages over traditional
relational databases. A few core features of NoSQL are listed here, which apply
to most NoSQL databases.
 Schema agnostic: NoSQL databases are schema agnostic. You aren’t required to
do a lot on designing your schema before you can store data in NoSQL databases.
You can start coding, and store and retrieve data without knowing how the
database stores and works internally. If you need advanced functionality, then
you can customise the schema manually before indexing the data. Schema
agnosticism may be the most significant difference between NoSQL and
relational databases.
 Scalability: NoSQL databases support horizontal scaling methodology that
makes it easy to add or reduce capacity quickly without tinkering with
commodity hardware. This eliminates the tremendous cost and complexity of
manual sharding that is necessary when attempting to scale RDBMS.

Prepared By: Prof. Neha Prajapati Page 77


 Performance: Some databases are designed to operate best (or only) with
specialised storage and processing hardware. With a NoSQL database, you can
increase performance by simply adding cheaper servers, called commodity
servers. This helps organisations to continue to deliver reliably fast user
experiences with a predictable return on investment for adding resources again,
without the overhead associated with manual sharding.
 High availability: NoSQL databases are generally designed to ensure high
availability and avoid the complexity that comes with a typical RDBMS
architecture, which relies on primary and secondary nodes. Some ‘distributed’
NoSQL databases use a masterless architecture that automatically distributes
data equally among multiple resources so that the application remains available
for both read and write operations, even when one node fails.
 Global availability: By automatically replicating data across multiple servers,
data centres or cloud resources, distributed NoSQL databases can minimise
latency and ensure a consistent application experience wherever users are
located. An added benefit is a significantly reduced database management
burden of manual RDBMS configuration, freeing operations teams to focus on
other business priorities.
Types of NoSQL Databases
 Key-value stores are the simplest. Every item in the database is stored as an
attribute name (or "key") together with its value. Riak, Voldemort, and Redis are
the most well-known in this category.
 Wide-column stores store data together as columns instead of rows and are
optimized for queries over large datasets. The most popular are Cassandra and
HBase.
 Document databases pair each key with a complex data structure known as a
document. Documents can contain many different key-value pairs, or key-array
pairs, or even nested documents. MongoDB is the most popular of these
databases.
 Graph databases are used to store information about networks, such as social
connections. Examples are Neo4J and HyperGraphDB.

Key-value store NoSQL database


 Key-value stores are the simplest NoSQL data stores to use. The client can either
get the value for the key, assign a value for a key or delete a key from the data
store.
 The value is a blob that the data store just stores, without caring or knowing
what’s inside; it’s the responsibility of the application to understand what was
stored.
 Since key-value stores always use primary-key access, they generally have great
performance and can be easily scaled.

Prepared By: Prof. Neha Prajapati Page 78


 The key-value database uses a hash table to store unique keys and pointers (in
some databases it’s also called the inverted index) with respect to each data value
it stores. There are no column type relations in the database; hence, its
implementation is easy. Key-value databases give great performance and can be
very easily scaled as per business needs.
Use cases: Here are some popular use cases of the key-value databases:
 For storing user session data
 Maintaining schema-less user profiles
 Storing user preferences
 Storing shopping cart data
Examples of this database are Redis, MemcacheDB and Riak.

 Document store NoSQL database


 Document store NoSQL databases are similar to key-value databases in that
there’s a key and a value.
 Data is stored as a value. Its associated key is the unique identifier for that value.
 The difference is that, in a document database, the value contains structured or
semi-structured data. This structured/semi-structured value is referred to as a
document and can be in XML, JSON or BSON format.
Use cases: Document store databases are preferable for:
 E-commerce platforms
 Content management systems
 Analytics platforms
 Blogging platforms

Column store NoSQL database


 In column-oriented NoSQL databases, data is stored in cells grouped in columns
of data rather than as rows of data. Columns are logically grouped into column
families.
 Column families can contain a virtually unlimited number of columns that can
be created at runtime or while defining the schema. Read and write is done using
columns rather than rows.
 Column families are groups of similar data that is usually accessed together. As
an example, we often access customers’ names and profile information at the
same time, but not the information on their orders.
 The main advantages of storing data in columns over relational DBMS are fast
search/access and data aggregation. Relational databases store a single row as a
continuous disk entry.
 Different rows are stored in different places on the disk while columnar
databases store all the cells corresponding to a column as a continuous disk entry,
thus making the search/access faster.
Use cases: Developers mainly use column databases in:
Prepared By: Prof. Neha Prajapati Page 79
 Content management systems
 Blogging platforms
 Systems that maintain counters
 Services that have expiring usage
 Systems that require heavy write requests (like log aggregators)

Graph base NoSQL database


 Graph databases are basically built upon the Entity – Attribute – Value model.
Entities are also known as nodes, which have properties. It is a very flexible way
to describe how data relates to other data.
 Nodes store data about each entity in the database, relationships describe a
relationship between nodes, and a property is simply the node on the opposite
end of the relationship.
 Whereas a traditional database stores a description of each possible relationship
in foreign key fields or junction tables, graph databases allow for virtually any
relationship to be defined on-the-fly.
Use cases: Graph base NoSQL databases are usually used in:
 Fraud detection
 Graph based search
 Network and IT operations
 Social networks, etc
Examples of graph base NoSQL databases are Neo4j, ArangoDB and OrientDB.

Session Store
 Managing session data using relational database is very difficult, especially in
case where applications are grown very much.
 In such cases the right approach is to use a global session store, which manages
session information for every user who visits the site.
 NOSQL is suitable for storing such web application session information very is
large in size.
 Since the session data is unstructured in form, so it is easy to store it in schema
less documents rather than in relation database record.

User Profile Store


 To enable online transactions, user preferences, authentication of user and more,
it is required to store the user profile by web and mobile application.
 In recent time users of web and mobile application are grown very rapidly. The
relational database could not handle such large volume of user profile data
which growing rapidly, as it is limited to single server.
 Using NOSQL capacity can be easily increased by adding server, which makes
scaling cost effective.

Prepared By: Prof. Neha Prajapati Page 80


Content and Metadata Store
 Many companies like publication houses require a place where they can store
large amount of data, which include articles, digital content and e-books, in order
to merge various tools for learning in single platform.
 The applications which are content based, for such application metadata is very
frequently accessed data which need less response times.
 For building applications based on content, use of NoSQL provide flexibility in
faster access to data and to store different types of contents.

Mobile Applications
 Since the smartphone users are increasing very rapidly, mobile applications face
problems related to growth and volume.
 Using NoSQL database mobile application development can be started with
small size and can be easily expanded as the number of user increases, which is
very difficult if you consider relational databases.
 Since NoSQL database store the data in schema-less for the application developer
can update the apps without having to do major modification in database.
 The mobile app companies like Kobo and Playtika, uses NOSQL and serving
millions of users across the world.

Third-Party Data Aggregation


 Frequently a business require to access data produced by third party. For
instance, a consumer packaged goods company may require to get sales data
from stores as well as shopper’s purchase history.
 In such scenarios, NoSQL databases are suitable, since NoSQL databases can
manage huge amount of data which is generating at high speed from various
data sources.

Internet of Things
 Today, billions of devices are connected to internet, such as smartphones, tablets,
home appliances, systems installed in hospitals, cars and warehouses. For such
devices large volume and variety of data is generated and keep on generating.
 Relational databases are unable to store such data. The NOSQL permits
organizations to expand concurrent access to data from billions of devices and
systems which are connected, store huge amount of data and meet the required
performance.
E-Commerce
 E-commerce companies use NoSQL for store huge volume of data and large
amount of request from user.
Social Gaming

Prepared By: Prof. Neha Prajapati Page 81


 Data-intensive applications such as social games which can grow users to
millions. Such a growth in number of users as well as amount of data requires a
database system which can store such data and can be scaled to incorporate
number of growing users NOSQL is suitable for such applications.
 NOSQL has been used by some of the mobile gaming companies like, electronic
arts, zynga and tencent.
Ad Targeting
 Displaying ads or offers on the current web page is a decision with direct income
To determine what group of users to target, on web page where to display ads,
the platforms gathers behavioral and demographic characteristics of users.
 A NoSQL database enables ad companies to track user details and also place the
very quickly and increases the probability of clicks.
 AOL, Mediamind and PayPal are some of the ad targeting companies which uses
NoSQL.

Advantages of NoSQL
 NOSQL provides high level of scalability.
 It is used in distributed computing environment.
 Implementation is less costly It provides storage for semi-structured data and it
is also provide flexibility in schema.
 Relationships are less complicated
 The advantages of NOSQL also include being able to handle :
 Large volumes of structured, semi-structured and unstructured data.
 Object-oriented algorithms permit implementations in order to achieve the
maximum availability over multiple data centers.
 Eventual-consistency based systems scale update workloads better than
traditional OLAP RDBMS, while also scaling to very large datasets.
 Programming that is easy to use and flexible. Efficient, scale-out
architecture instead of expensive, monolithic architecture
Differences between SQL and NoSQL database:
Index SQL NoSQL
Databases are categorized as
NoSQL databases are categorized as Non-
1) Relational Database Management
relational or distributed database system.
System (RDBMS).
SQL databases have fixed or static
2) NoSQL databases have dynamic schema.
or predefined schema.
SQL databases display data in NoSQL databases display data as collection of
3) form of tables so it is known as key-value pair, documents, graph databases or
table-based database. wide-column stores.
4) SQL databases are vertically NoSQL databases are horizontally scalable.
Prepared By: Prof. Neha Prajapati Page 82
scalable.
SQL databases use a powerful In NoSQL databases, collection of documents
language "Structured Query are used to query the data. It is also called
5)
Language" to define and unstructured query language. It varies from
manipulate the data. database to database.
NoSQL databases are not so good for complex
SQL databases are best suited for
6) queries because these are not as powerful as
complex queries.
SQL queries.
SQL databases are not best suited NoSQL databases are best suited for
7)
for hierarchical data storage. hierarchical data storage.
MySQL, Oracle, Sqlite, MongoDB, BigTable, Redis, RavenDB,
8) PostgreSQL and MS-SQL etc. are Cassandra, Hbase, Neo4j, CouchDB etc. are the
the example of SQL database. example of nosql database

NewSQL
 NewSQL is a class of modern relational database management systems that seek
to provide the same scalable performance of NoSQL systems for online
transaction processing (OLTP) read-write workloads while still maintaining the
ACID guarantees of a traditional database system.
 NewSQL systems vary greatly in their internal architectures, the two
distinguishing features common amongst them are that they all support the
relational data model and use SQL as their primary interface.
 The applications typically targeted by these NewSQL systems are characterized
by being OLTP, that is, having a large number of transactions.

 (1) are short-lived (i.e., no user stalls) (2) touch a small subset of data using index
lookups (i.e., no full table scans or large distributed joins), (3) are repetitive (i.e.
executing the same queries with different inputs).
 However, some of the NewSQL databases are also HTAP systems, therefore,
supporting hybrid transactional/analytical workloads.
 These NewSQL systems achieve high performance and scalability by eschewing
much of the legacy architecture of the original IBM System R design, such as
heavyweight recovery or concurrency control algorithms.
 One of the first known NewSQL systems is the H-Store parallel database system

Advantages of NoSQL
 Large volumes of structured, semi-structured, and unstructured data
 Agile sprints, quick iteration, and frequent code pushes
 Object-oriented programming that is easy to use and flexible
 Efficient, scale-out architecture instead of expensive, monolithic architecture

Prepared By: Prof. Neha Prajapati Page 83


Disadvantages of NoSQL
 No NewSQL systems are as general-purpose as traditional SQL systems set out
to be.
 In-memory architectures may be inappropriate for volumes exceeding a few
terabytes.
 Offers only partial access to the rich tooling of traditional SQL systems.

6.0 Database for The Modern Web


Introduction of MongoDB
 MongoDB is a document-oriented NoSQL database used for high volume data
storage. MongoDB is a database which came into light around the mid-2000s.
 It falls under the category of a NoSQL database.
MongoDB Features
 Each database contains collections which in turn contains documents. Each
document can be different with a varying number of fields. The size and content
of each document can be different from each other.
 The document structure is more in line with how developers construct their
classes and objects in their respective programming languages. Developers will

Prepared By: Prof. Neha Prajapati Page 84


often say that their classes are not rows and columns but have a clear structure
with key-value pairs.
 As seen in the introduction with NoSQL databases, the rows (or documents as
called in MongoDB) doesn't need to have a schema defined beforehand. Instead,
the fields can be created on the fly.
 The data model available within MongoDB allows you to represent hierarchical
relationships, to store arrays, and other more complex structures more easily.
 Examples of a user document and a department document are shown.

The user document contains the following code:


{_id: <ObjectId1>,username: “jdoe”}
The department document contains the code
{_id: <ObjectId3>,user_id: <ObjectId1>,Name:Sales,Office: “Chicago”}
 The related user document line is _id: <ObjectId1>,
 The related department document line is user_id: <ObjectId1>,
 Relationships use links or “references” between data. Applications resolve the
relationships to access the related data. This is considered a “normalized” model.
An example of embedded sub-documents is as follows:
{_id: <ObjectId1>, username: “jdoe”,contact: {phone: “555-555-1111”,Cell: “555-
555-222email: [email protected]},
Position: {department: “sales”,manager: “vp of sales”}}
Ad hoc queries
 MongoDB supports field, range query, and regular expression searches.
 Queries can return specific fields of documents and also include user-defined
JavaScript functions.
 Queries can also be configured to return a random sample of results of a given
size.
 A range query is a common database operation that retrieves all records where
some value is between an upper and lower boundary. For example, list all
employees with 3 to 5 years' experience.
 Range queries are unusual because it is not generally known in advance how
many entries a range query will return, or if it will return any at all.
 Many other queries, such as the top ten most senior employees, or the newest
employee, can be done more efficiently because there is an upper bound to the
number of results they will return.
 A query that returns exactly one result is sometimes called a singleton.

Secondary Indexes
 Indexes support the efficient execution of queries in MongoDB. Without indexes,
MongoDB must perform a collection scan, i.e. scan every document in a
collection, to select those documents that match the query statement. If an

Prepared By: Prof. Neha Prajapati Page 85


appropriate index exists for a query, MongoDB can use the index to limit the
number of documents it must inspect.
 Indexes are special data structures that store a small portion of the collection’s
data set in an easy to traverse form. The index stores the value of a specific field
or set of fields, ordered by the value of the field.
 The ordering of the index entries supports efficient equality matches and range-
based query operations. In addition, MongoDB can return sorted results by using
the ordering in the index.

Replication
 Replication is the process of synchronizing data across multiple servers.
Replication provides redundancy and increases data availability with multiple
copies of data on different database servers.
 Replication protects a database from the loss of a single server.
 Replication also allows you to recover from hardware failure and service
interruptions. With additional copies of the data, you can dedicate one to disaster
recovery, reporting, or backup.

Why Replication?
 To keep your data safe
 High (24*7) availability of data
 Disaster recovery
 No downtime for maintenance (like backups, index rebuilds, compaction)
 Read scaling (extra copies to read from)
 Replica set is transparent to the application

How Replication Works in MongoDB


 MongoDB achieves replication by the use of replica set. A replica set is a group of
mongod instances that host the same data set. In a replica, one node is primary
node that receives all write operations. All other instances, such as secondaries,
apply operations from the primary so that they have the same data set. Replica
set can have only one primary node.
 Replica set is a group of two or more nodes (generally minimum 3 nodes are
required).
 In a replica set, one node is primary node and remaining nodes are secondary.
 All data replicates from primary to secondary node.
 At the time of automatic failover or maintenance, election establishes for primary
and a new primary node is elected.
 After the recovery of failed node, it again join the replica set and works as a
secondary node.

Prepared By: Prof. Neha Prajapati Page 86


Deduplication at the File or Bock Level
 In its simplest form, deduplication takes place on the file level; that is, it
eliminates duplicate copies of the same file.
 This kind of deduplication is sometimes called file-level deduplication or single
instance storage (SIS).
 Deduplication can also take place on the block level, eliminating duplicated
blocks of data that occur in non-identical files.
 Block-level deduplication frees up more space than SIS, and a particular type
known as variable block or variable length deduplication has become very
popular.
 Often the phrase data deduplication is used as a synonym for block-level or
variable length deduplication.

Load Balancer
 A load balancer is a device that distributes network or application traffic across a
cluster of servers. Load balancing improves responsiveness and increases
availability of applications.
 A load balancer sits between the client and the server farm accepting incoming
network and application traffic and distributing the traffic across multiple
backend servers using various methods.
 By balancing application requests across multiple servers, a load balancer
reduces individual server load and prevents any one application server from
becoming a single point of failure, thus improving overall application availability
and responsiveness.
Scaling Up Vs Scaling Out
 Scaling out is considered more important as commodity hardware is cheaper
compared to cost of special configuration hardware (super computer).

Prepared By: Prof. Neha Prajapati Page 87


 But increasing the number of requests that an application can handle on a single
commodity hardware box is also important. An application is said to be
performing well if it can handle more requests with-out degrading response time
by just adding more resources.

Scalable Architecture
 Application architecture is scalable if each layer in multi layered architecture is
scalable (scale out). For example :– As shown in diagram below we should be
able linearly scale by add additional box in Application Layer or Database
Laye

MongoDB’s Core Server and Tools


 MongoDB is a document-based database. The basic idea behind shifting from
relational data model to a new data model is to replace the concept of a ‘row’
with a flexible model, the ‘document’.
 The document-based approach allows embedded documents, arrays, and
represents a complex hierarchical relationship using a single record. This is how
developers using object-oriented languages want to represent their data.

The Core Server


 The core database server of MongoDB can be run as an executable process called
mongod or mongodb.exe on Windows.
The mongod process receives a command to run the MongoDB server over a
network socket through a custom binary protocol. The data files for a mongod
process are stored by default in the directory /data/db (read as slash data slash D-
B).
A mongod process can be run in several modes:
 Replica set: Configurations comprise two replicas and an arbiter process that
reside on a third server.

Prepared By: Prof. Neha Prajapati Page 88


 Per-shard replica set: The auto-sharding architecture of MongoDB consist of
mongod processes configured as per-shard replica sets.
 Mongos: A separate routing server is used to send requests to the appropriate
shard.
 Mongos queries from the application layer and locates the data in the sharded
cluster to complete these operations. A mongos instance is identical to any
MongoDB instance.

MongoDB's Tools
MongoDB Tools consists of the following:
 JavaScript shell
 Database drivers
 Command-line tools.

The JavaScript shell


 The command shell in MongoDB is a JavaScript-based tool. It is used to
administer the database and manipulate data.
 A mongo executable loads the shell and connects it to a specified mongod
process. In addition to inserting and querying data, the shell allows you to run
administrative commands.
Database drivers
 The MongoDB drivers are easy to use. It provides an Application Program
Interface or API that matches the syntax of the language used while maintaining
uniform interfaces across languages.
 10gen (pronounce as ten gen), a company behind MongoDB supports drivers for
C, C++, C#, (pronounce as C, C plus plus C hash) Erlang, Haskell, Java, Perl, PHP,
Python, Scala, and Ruby.
Command-line tools
 Mongodump and Mongorestore (Pronounce as Mongo dump and mongo store)
are standard utilities that help backup and restore a database.
 Mongodump can save the data and the BSON format of MongoDB and thus, is
used for backups only. This tool is used for hot backups and can be restored with
mongorestore easily.
 Mongoexport and Mongoimport are used to export and import JSON, comma
separated value or CSV, and Tab Separated Value or TSV data.
 These tools help you get data in widely supported formats. user can use
mongoimport for initial imports of large data sets.
 However, you need to adjust the data models for best results. In such a case, you
can use a custom script to easily import the data through one of the drivers.
 Mongosniff is a wire-sniffing tool used for viewing operations sent to the
database. This tool translates the BSON that is transmitted to human-readable
shell statements.
Prepared By: Prof. Neha Prajapati Page 89
 Mongostat is similar to iostat (pronounce as I-O-stat). Mongostat provides
helpful statistics, including the number of operations per second, for example,
inserts, queries, updates, deletes, and so on.
 It also provides information, such as the amount of virtual memory allocated and
the number of connections to the server.

Compare MongoDB Versus other Databases


MySQL MongoDB
ACID
ACID Transactions
Transactions
Table Collection
Row Document
Column Field
Secondary Index Secondary Index
Embedded documents, $lookup &
JOINs
$graphLookup
GROUP_BY Aggregation Pipeline

MySQL MongoDB NoSQL Data Store


Open source Yes Yes Yes
ACID Transactions Yes Yes No
Partial: schema flexibility but
Flexible, rich data model No Yes support for only simple data
structures
Schema governance Yes Yes No
Expressive joins, faceted search,
graphs queries, powerful Yes Yes No
aggregations
Idiomatic, native language drivers No Yes No
Horizontal scale-out with data Partial: no controls over data
No Yes
locality controls locality
Analytics and BI ready Yes Yes No
Enterprise grade security and
Yes Yes No
mature management tools
Database as a service on all major
Yes Yes No
clouds

MongoDB through the JavaScript Shell

Prepared By: Prof. Neha Prajapati Page 90


 The MongoDB shell is an interactive JavaScript shell. As such, it provides the
capability to use JavaScript code directly in the shell or executed as a standalone
JavaScript file.
 Subsequent hours that deal with using the shell to access the database and create
and manipulate collections and documents provide examples that are written in
JavaScript.
 To follow those examples, you need to understand at least some of the
fundamental aspects of the JavaScript language.

Defining Variables
 The first place to begin within JavaScript is defining variables. Variables are a
means to name data so that you can use that name to temporarily store and
access data from your JavaScript files. Variables can point to simple data types
such as numbers or strings, or they can point to more complex data types such as
objects.
 To define a variable in JavaScript, you use the var keyword and then give the
variable a name, as in this example:
var myData;
You can also assign a value to the variable in the same line. For example, the
following line of code creates a variable named myString and assigns it the value
of "Some Text":var myString = "Some Text";
 It works as well as this code:
var myString;
myString = "Some Text";
 After you have declared the variable, you can use the name to assign the variable a
value and access the value of the variable. For example, the following code stores a
string into the myString variable and then uses it when assigning the value to the
newString variable:
var myString = "Some Text";
var newString = myString + " Some More Text";
 Your variable names should describe the data stored in them so that you can easily
use them later in your program. The only rules for creating variable names is that
they must begin with a letter, $, or _ and they cannot contain spaces. Also remember
that variable names are case sensitive, so myString is different from MyString.

Data Type
 JavaScript uses data types to determine how to handle data that is assigned to a
variable. The variable type determines what operations you can perform on the
variable, such as looping or executing. The following list describes the most common
types of variables you will be working with through the book:

Prepared By: Prof. Neha Prajapati Page 91


 String: This variable stores character data as a string. The character data is
specified by either single or double quotes. All the data contained in the quotes is
assigned to the string variable. For example:
var myString = 'Some Text';
var anotherString = 'Some More Text';
 Number: Data is stored as a numerical value. Numbers are useful in counts,
calculations, and comparisons. Some examples follow:
var myInteger = 1;
var cost = 1.33;
 Boolean: This variable stores a single bit that is either true or false. Booleans are
often used for flags. For example, you might set a variable to false at the
beginning of some code and then check it upon completion to see if the code
execution hit a certain spot. The following shows an example of defining a true
variable and a false variable:
var yes = true;
var no = false;

 Array: An indexed array is a series of separate distinct data items all stored
under a single variable name. Items in the array can be accessed by their zero-
based index using array[index]. The following is an example of creating a simple
array and then accessing the first element, which is at index 0.
var arr = ["one", "two", "three"];
var first = arr[0];
 Object literal: JavaScript supports the capability to create and use object literals.
When you use an object literal, you can access values and functions in the object
using object.property syntax. The following example shows how to create and
access properties with an object literal:
var obj = {"name":"Brad", "occupation":"Hacker", "age", "Unknown"};
var name = obj.name;

 Null: Sometimes you do not have a value to store in a variable either because it
hasn’t been created or you are no longer using it. At this time, you can set a
variable to null. Using null is better than assigning the variable a value of 0 or an
empty string "" because those might be valid values for the variable. Assigning
the variable null lets you assign no value and check against null inside your code.
var newVar = null;
Inserts and queries
 Inserts a document into a collection.
The insertOne() method has the following syntax:
db.collection.insertOne(

Prepared By: Prof. Neha Prajapati Page 92


<document>,
{ writeConcern: <document> }
)
db.collection.update(query, update, options)
 Modifies an existing document or documents in a collection. The method can
modify specific fields of an existing document or documents or replace an
existing document entirely, depending on the update parameter.
 By default, the update() method updates a single document. Set the Multi
Parameter to update all documents that match the query criteria.
 The update() method has the following form:

db.collection.update(
<query>, <update>,{upsert: <boolean>, multi: <boolean>, writeConcern:
<document>, collation: <document>,arrayFilters: [ <filterdocument1>, ... ]
})

Creating and Querying with Indexes


 An index supports a query when the index contains all the fields scanned by the
query. The query scans the index and not the collection. Creating indexes that
support queries results in greatly increased query performance.
 This document describes strategies for creating indexes that support queries.
 The insert() Method
To insert data into MongoDB collection, you need to use MongoDB's insert() or save()
method.
Syntax
The basic syntax of insert() command is as follows −
>db.COLLECTION_NAME.insert(document)
Example
>db.mycol.insert({
_id: ObjectId(7df78ad8902c),
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100
})
 Here mycol is our collection name, as created in the previous chapter. If the
collection doesn't exist in the database, then MongoDB will create this collection
and then insert a document into it.
 In the inserted document, if we don't specify the _id parameter, then MongoDB
assigns a unique ObjectId for this document.

Prepared By: Prof. Neha Prajapati Page 93


 _id is 12 bytes hexadecimal number unique for every document in a collection. 12
bytes are divided as follows −_id: ObjectId(4 bytes timestamp, 3 bytes machine id,
2 bytes process id,3 bytes incrementer)
 To insert multiple documents in a single query, you can pass an array of
documents in insert() command.
Example
>db.post.insert([{
title: 'MongoDB Overview', description: 'MongoDB is no sql database',
by: 'tutorials point', url: 'http://www.tutorialspoint.com',tags: ['mongodb',
'database', 'NoSQL'], likes: 100},
{title: 'NoSQL Database', description: "NoSQL database doesn't have
tables", by: 'tutorials point', url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 20,
comments: [ {user:'user1',message: 'My first comment', dateCreated: new
Date(2013,11,10,2,35),like: 0 }] }])

 To insert the document you can use db.post.save(document) also. If you don't
specify _id in the document then save() method will work same as insert()
method. If you specify _id then it will replace whole data of document
containing _id as specified in save() method.

MongoDB Update() Method


 The update() method updates the values in the existing document.
Syntax:- The basic syntax of update() method is as follows −
>db.COLLECTION_NAME.update(SELECTION_CRITERIA,
UPDATED_DATA)
 Example
Consider the mycol collection has the following data.
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
 Following example will set the new title 'New MongoDB Tutorial' of the
documents whose title is 'MongoDB Overview'.
>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB
Tutorial'}})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
"_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}>

Prepared By: Prof. Neha Prajapati Page 94


 By default, MongoDB will update only a single document. To update multiple
documents, you need to set a parameter 'multi' to true.
>db.mycol.update({'title':'MongoDB Overview'},
{$set:{'title':'New MongoDB Tutorial'}},{multi:true})

 MongoDB Save() Method


 The save() method replaces the existing document with the new document
passed in the save() method.
Syntax
The basic syntax of MongoDB save() method is shown below −
>db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA})
Example
 Following example will replace the document with the _id
'5983548781331adf45ec5'.
>db.mycol.save( {"_id" : ObjectId(5983548781331adf45ec5),
"title":"Tutorials Point New Topic", "by":"Tutorials Point" })
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"Tutorials Point New Topic",
"by":"Tutorials Point"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>
Creating and Querying with Indexes
 Indexes support the efficient resolution of queries. Without indexes, MongoDB
must scan every document of a collection to select those documents that match
the query statement. This scan is highly inefficient and require MongoDB to
process a large volume of data.
 Indexes are special data structures, that store a small portion of the data set in an
easy-to-traverse form. The index stores the value of a specific field or set of fields,
ordered by the value of the field as specified in the index.
 The ensureIndex() Method
To create an index you need to use ensureIndex() method of MongoDB.
Syntax:- The basic syntax of ensureIndex() method is as follows().
>db.COLLECTION_NAME.ensureIndex({KEY:1})
 Here key is the name of the field on which you want to create index and 1 is for
ascending order. To create index in descending order you need to use -1.

Example
>db.mycol.ensureIndex({"title":1})
In ensureIndex() method you can pass multiple fields, to create index on multiple
fields.
>db.mycol.ensureIndex({"title":1,"description":-1})
Prepared By: Prof. Neha Prajapati Page 95
>ensureIndex() method also accepts list of options (which are optional).
Parameter Type Description
Builds the index in the background so that building an
background Boolean index does not block other database activities. Specify true
to build in the background. The default value is false.
Creates a unique index so that the collection will not accept
insertion of documents where the index key or keys match
unique Boolean
an existing value in the index. Specify true to create a
unique index. The default value is false.
The name of the index. If unspecified, MongoDB generates
name string an index name by concatenating the names of the indexed
fields and the sort order.
Creates a unique index on a field that may have duplicates.
MongoDB indexes only the first occurrence of a key and
dropDups Boolean removes all documents from the collection that contain
subsequent occurrences of that key. Specify true to create
unique index. The default value is false.
If true, the index only references documents with the
specified field. These indexes use less space but behave
sparse Boolean
differently in some situations (particularly sorts). The
default value is false.
Specifies a value, in seconds, as a TTL to control how long
expireAfterSeconds integer
MongoDB retains documents in this collection.
The index version number. The default index version
index
v depends on the version of MongoDB running when
version
creating the index.
The weight is a number ranging from 1 to 99,999 and
weights document denotes the significance of the field relative to the other
indexed fields in terms of the score.
For a text index, the language that determines the list of
default_language string stop words and the rules for the stemmer and tokenizer.
The default value is english.
For a text index, specify the name of the field in the
language_override string document that contains, the language to override the
default language. The default value is language.
 Connect to Database
 To connect database, you need to specify the database name, if the database
doesn't exist then MongoDB creates it automatically.
 Following is the code snippet to connect to the database −

Prepared By: Prof. Neha Prajapati Page 96


import com.mongodb.client.MongoDatabase;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;

public class ConnectToDB {


public static void main( String args[] ) {
// Creating a Mongo client
MongoClient mongo = new MongoClient( "localhost" , 27017 );
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");

// Accessing the database


MongoDatabase database = mongo.getDatabase("myDb");
System.out.println("Credentials ::"+ credential);
}
}
 Now, let's compile and run the above program to create our database myDb as
shown below.
$javac ConnectToDB.java
$java ConnectToDB
On executing, the above program gives you the following output.
Connected to the database successfully
Credentials ::MongoCredential{
mechanism = null,
userName = 'sampleUser',
source = 'myDb',
password = <hidden>,
mechanismProperties = {}}
Create a Collection
To create a collection, createCollection() method of
com.mongodb.client.MongoDatabase class is used.
Following is the code snippet to create a collection −
import com.mongodb.client.MongoDatabase;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;

public class CreatingCollection {

Prepared By: Prof. Neha Prajapati Page 97


public static void main( String args[] ) {

// Creating a Mongo client


MongoClient mongo = new MongoClient( "localhost" , 27017 );

// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");

//Accessing the database


MongoDatabase database = mongo.getDatabase("myDb");

//Creating a collection
database.createCollection("sampleCollection");
System.out.println("Collection created successfully");
}
}
On compiling, the above program gives you the following result −
Connected to the database successfully
Collection created successfully

Getting/Selecting a Collection
To get/select a collection from the database, getCollection() method of
com.mongodb.client.MongoDatabase class is used.
Following is the program to get/select a collection −
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;

public class selectingCollection {


public static void main( String args[] ) {

// Creating a Mongo client


MongoClient mongo = new MongoClient( "localhost" , 27017 );

// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());

Prepared By: Prof. Neha Prajapati Page 98


System.out.println("Connected to the database successfully");

// Accessing the database


MongoDatabase database = mongo.getDatabase("myDb");

// Creating a collection
System.out.println("Collection created successfully");

// Retieving a collection
MongoCollection<Document> collection = database.getCollection("myCollection");
System.out.println("Collection myCollection selected successfully");
}
}

On compiling, the above program gives you the following result −


Connected to the database successfully
Collection created successfully
Collection myCollection selected successfully

Insert a Document
To insert a document into MongoDB, insert() method of
com.mongodb.client.MongoCollection class is used.

Following is the code snippet to insert a document −


import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;

public class InsertingDocument {

public static void main( String args[] ) {

// Creating a Mongo client


MongoClient mongo = new MongoClient( "localhost" , 27017 );

// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");

Prepared By: Prof. Neha Prajapati Page 99


// Accessing the database
MongoDatabase database = mongo.getDatabase("myDb");

// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
System.out.println("Collection sampleCollection selected successfully");

Document document = new Document("title", "MongoDB")


.append("id", 1)
.append("description", "database")
.append("likes", 100)
.append("url", "http://www.tutorialspoint.com/mongodb/")
.append("by", "tutorials point");
collection.insertOne(document);
System.out.println("Document inserted successfully");
}
}

On compiling, the above program gives you the following result −


Connected to the database successfully
Collection sampleCollection selected successfully
Document inserted successfully

Retrieve All Documents


To select all documents from the collection, find() method of
com.mongodb.client.MongoCollection class is used. This method returns a cursor, so
you need to iterate this cursor.
Following is the program to select all documents −
import com.mongodb.client.FindIterable;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;

import java.util.Iterator;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;

public class RetrievingAllDocuments {

Prepared By: Prof. Neha Prajapati Page 100


public static void main( String args[] ) {

// Creating a Mongo client


MongoClient mongo = new MongoClient( "localhost" , 27017 );

// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");

// Accessing the database


MongoDatabase database = mongo.getDatabase("myDb");

// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
System.out.println("Collection sampleCollection selected successfully");

// Getting the iterable object


FindIterable<Document> iterDoc = collection.find();
int i = 1;

// Getting the iterator


Iterator it = iterDoc.iterator();

while (it.hasNext()) {
System.out.println(it.next());
i++;
}
}
}

On compiling, the above program gives you the following result −


Document{{
_id = 5967745223993a32646baab8,
title = MongoDB,
id = 1,
description = database,
likes = 100,
url = http://www.tutorialspoint.com/mongodb/, by = tutorials point
}}
Document{{
Prepared By: Prof. Neha Prajapati Page 101
_id = 7452239959673a32646baab8,
title = RethinkDB,
id = 2,
description = database,
likes = 200,
url = http://www.tutorialspoint.com/rethinkdb/, by = tutorials point
}}

 Update Document
To update a document from the collection, updateOne() method of
com.mongodb.client.MongoCollection.

MongoDB query language with sort()


If we want to fetch documents from the collection "userdetails" which contains the
value of "date_of_join" is "16/10/2010" and the value of "education" is "M.C.A" and sort
the fetched results in descending order, the following mongodb command can be used :
>db.userdetails.find({"date_of_join":"16/10/2010","education":"M.C.A."}).sort({"professio
n":-}).pretty();N.B.find()method displays the documents in a non structured format but
to display the results in a formatted way, the pretty() method can be used.
For comparison of different BSON type values, see the specified BSON comparison
order.

Name Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$in Matches any of the values specified in an array.
$lt Matches values that are less than a specified value.
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$nin Matches none of the values specified in an array.

Logical
Name Description
$and Joins query clauses with a logical AND returns all documents that match the
conditions of both clauses.
$not Inverts the effect of a query expression and returns documents that do not match
the query expression.
$nor Joins query clauses with a logical NOR returns all documents that fail to match

Prepared By: Prof. Neha Prajapati Page 102


Name Description
both clauses.
$or Joins query clauses with a logical OR returns all documents that match the
conditions of either clause.

Bitwise
Name Description
$bitsAllClear Matches numeric or binary values in which a set of bit positions all have
a value of 0.
$bitsAllSet Matches numeric or binary values in which a set of bit positions all have
a value of 1.
$bitsAnyClear Matches numeric or binary values in which any bit from a set of bit
positions has a value of 0.
$bitsAnySet Matches numeric or binary values in which any bit from a set of bit
positions has a value of 1.

Projection Operators
Name Description
$ Projects the first element in an array that matches the query condition.
$elemMatch Projects the first element in an array that matches the specified $elemMatch
condition.
$meta Projects the document’s score assigned during $text operation.
$slice Limits the number of elements projected from an array. Supports skip and
limit slices.

Prepared By: Prof. Neha Prajapati Page 103

You might also like