0% found this document useful (0 votes)
36 views151 pages

Fundamentals of Big Data Analytics

Big data

Uploaded by

shreyesh1902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views151 pages

Fundamentals of Big Data Analytics

Big data

Uploaded by

shreyesh1902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Fundamental of BIG DATA ANALYTICS

(R20A0519)

DIGITAL NOTES

B.TECH IV YEAR – II SEM(2023-2024)

MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY


(Autonomous Institution – UGC, Govt. of India)
Recognized under 2(f) and 12 (B) of UGC ACT 1956

(Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified)
Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, India
FNDAMENTALS OF BIG DATA ANALYTICS

SYLLABUS:

UNIT I

INTRODUCTION TO BIG DATA AND ANALYTICS: Classification of Digital Data,


Structured and Unstructured Data - Introduction to Big Data: Characteristics – Evolution –
Definition - Challenges with Big Data - Other Characteristics of Data - Why Big Data - Traditional
Business Intelligence versus Big Data - Data Warehouse and Hadoop Environment Big Data
Analytics: Classification of Analytics – Challenges - Big Data Analytics important - Top
AnalyticsTools

UNIT II

INTRODUCTION TO TECHNOLOGY LANDSCAPE: NoSQL, Comparison of SQL and


NoSQL, Hadoop -RDBMS Versus Hadoop - Distributed Computing Challenges – Hadoop Overview
- Hadoop Distributed File System - Processing Data with Hadoop - Managing Resources and
Applications with Hadoop YARN - Interacting with Hadoop Ecosystem

UNIT III

INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING:

MongoDB: Why Mongo DB - Terms used in RDBMS and Mongo DB - Data Types - MongoDB
Query Language .MapReduce: Mapper – Reducer – Combiner – Partitioner – Searching – Sorting –
Compression

UNIT IV

INTRODUCTION TO HIVE: Hive: Introduction – Architecture - Data Types - File Formats -


Hive Query Language Statements – Partitions – Bucketing – Views - Sub- Query – Joins –
Aggregations - Group by and Having - RCFile Implementation - Hive User Defined Function -
Serialization and Deserialization.

UNIT V

INTRODUCTION TO DATA ANALYTICS :Machine Learning: Introduction, Supervised


Learning, Unsupervised Learning, Machine Learning Algorithms: Regression Model, Clustering,
Collaborative Filtering, Associate Rule Making, Decision Tree
Text Book:

1. Seema Acharya, SubhashiniChellappan, “Big Data and Analytics”, Wiley Publications,First


Edition,2015

Reference Book:

1. Judith Huruwitz, Alan Nugent, Fern Halper, Marcia Kaufman, “Big data fordummies”,
John Wiley & Sons, Inc.(2013)

2. Tom White, “Hadoop The Definitive Guide”, O’Reilly Publications, Fourth Edition,2015

3. Dirk Deroos, Paul C.Zikopoulos, Roman B.Melnky, Bruce Brown, Rafael Coss,“Hadoop
For Dummies”, Wiley Publications,2014

4. Robert D.Schneider, “Hadoop For Dummies”, John Wiley & Sons, Inc.(2012)

5. Paul Zikopoulos, “Understanding Big Data: Analytics for Enterprise Class Hadoopand
Streaming Data, McGraw Hill, 2012 Chuck Lam, “Hadoop In Action”,

6. Dreamtech Publications,2010
UNIT – I

What is Big Data?


According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”

This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and large
data sets that have to be processed and analyzed to uncover valuable information that can
benefit businesses and organizations.

However, there are certain basic tenets of Big Data that will make it even simpler to answer whatis Big
Data:
• It refers to a massive amount of data that keeps on growing exponentially with time.
• It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
• It includes data mining, data storage, data analysis, data sharing, and data visualization.
• The term is an all-comprehensive one including data, data frameworks, along with the
toolsand techniques used to process and analyze the data.

The History of Big Data

• Although the concept of big data itself is relatively new, the origins of large data sets
go back to the 1960s and '70s when the world of data was just getting started with the
first data centers and the development of the relational database.

• Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open-source framework

4
created specifically to store and analyze big data sets) was developed that same year.
NoSQL also began to gain popularity during this time.

• The development of open-source frameworks, such as Hadoop (and more recently,


Spark) was essential for the growth of big data because they make big data easier to
work with and cheaper tostore. In the years since then, the volume of big data has
skyrocketed. Users are still generating huge amounts of data—but it’s not just humans
who are doing it.

• With the advent of the Internet of Things (IoT), more objects and devices are connected
to the internet, gathering data on customer usage patterns and product performance. The
emergence of machine learning has produced still more data.

• While big data has come far, its usefulness is only just beginning. Cloud computing has
expandedbig data possibilities even further. The cloud offers truly elastic scalability,
where developers cansimply spin up ad hoc clusters to test a subset of data.

Benefits of Big Data and Data Analytics

• Big data makes it possible for you to gain more complete answers because you have
moreinformation.
• More complete answers mean more confidence in the data—which means a
completelydifferent approach to tackling problems.

Types of Big Data


Now that we are on track with what is big data, let’s have a look at the types of big data:

a) Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as

5
theemployee details, their job positions, their salaries, etc., will be present in an organized
manner.

b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important types
of big data.

c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be
precise, it refersto the data that although has not been classified under a particular repository
(database), yet contains vital information or tags that segregate individual elements within the
data. Thus we cometo the end of types of data.

Characteristics of Big Data

Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity, andVolume.
Let’s discuss the characteristics of big data.These characteristics, isolated, are enough to
know what big data is. Let’s look at them in depth:

a) Variety
Variety of Big Data refers to structured, unstructured, and semi-structured data that is
gathered from multiple sources. While in the past, data could only be collected from
spreadsheets and databases, todaydata comes in an array of forms such as emails, PDFs, photos,
videos, audios, SMposts, and so much more. Variety is one of the important characteristics of
big data.Velocity

6
Velocity essentially refers to the speed at which data is being created in real-time. In a
broader prospect, it comprises the rate of change, linking of incoming data sets at varying
speeds, and activity bursts.
b) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates
huge ‘volumes’ of data that is being generated on a daily basis from various sources like social
media platforms, business processes, machines, networks, human interactions, etc. Such a large
amountof data is stored in data warehouses. Thus comes to the end of characteristics of big data.

Why is Big Data Important?

The importance of big data does not revolve around how much data a company has but
how a company utilizes the collected data. Every company uses data in its own way; the more
efficientlya company uses its data, the more potential it has to grow. The company can take
data from any source and analyze it to find answers which will enable:

1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can
bring cost advantages to business when large amounts of data are to be stored and
thesetools also help in identifying more efficient ways of doing business.

2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics
can easily identify new sources of data which helps businesses analyzing data
immediately and make quick decisions based on the learning.

3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.

4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you
can get feedback about who is saying what about your company. If you want to
monitorand improve the online presence of your business, then, big data tools can
help in all this.

7
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no
single business that can claim success without first having to establish a solid
customer base. However, even with a customer base, a business cannot afford to
disregard the high competition it faces. If a business is slow to learn what customers
are looking for, then it is very easy to begin offering poor quality products. In the
end, loss of clientele will result, and this creates an adverse overall effect on business
success. The use of big dataallows businesses to observe various customer related
patterns and trends. Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights

Big data analytics can help change all business operations. This includes the ability to match
customer expectation, changing company’s product line and of course ensuring that the marketing
campaigns are powerful.

8
7. Big Data Analytics As a Driver of Innovations and Product Development
Another huge advantage of big data is the ability to help companies innovate and
redevelop their products.

Business Intelligence vs Big Data

Although Big Data and Business Intelligence are two technologies used to analyze data to help
companies in the decision-making process, there are differences between both of them. They differin the
way they work as much as in the type of data they analyze.

Traditional BI methodology is based on the principle of grouping all business data into a central server.
Typically, this data is analyzed in offline mode, after storing the information in an environment called
Data Warehouse. The data is structured in a conventional relational database with an additional set of
indexes and forms of access to the tables (multidimensional cubes).

A Big Data solution differs in many aspects to BI to use. These are the main differences betweenBig Data
and Business Intelligence:

1. In a Big Data environment, information is stored on a distributed file system, rather


than on a central server. It is a much safer and more flexible space.

2. Big Data solutions carry the processing functions to the data, rather than the data to the
functions. As the analysis is centered on the information, it´s easier to handle larger
amounts of information in a more agile way.

3. Big Data can analyze data in different formats, both structured and unstructured. The
volume of unstructured data (those not stored in a traditional database) is growing at
levelsmuch higher than the structured data. Nevertheless, its analysis carries different
challenges. Big Data solutions solve them by allowing a global analysis of various
sources of information.

9
4. Data processed by Big Data solutions can be historical or come from real-time sources.
Thus, companies can make decisions that affect their business in an agile and efficient
way.

5. Big Data technology uses parallel mass processing (MPP) concepts, which improves
the speed of analysis. With MPP many instructions are executed simultaneously, and
since thevarious jobs are divided into several parallel execution parts, at the end the
overall resultsare reunited and presented. This allows you to analyze large volumes of
information quickly.

Big Data vs Data Warehouse

Big Data has become the reality of doing business for organizations today. There is a boom in the
amount of structured as well as raw data that floods every organization daily. If this data is managed well,
it can lead to powerful insights and quality decision making.

Big data analytics is the process of examining large data sets containing a variety of data types to
discover some knowledge in databases, to identify interesting patterns and establish relationships to solve
problems, market trends, customer preferences, and other useful information. Companiesand businesses
that implement Big Data Analytics often reap several business benefits. Companiesimplement Big Data
Analytics because they want to make more informed business decisions.

A data warehouse (DW) is a collection of corporate information and data derived from operationalsystems
and external data sources. A data warehouse is designed to support business decisions by allowing data
consolidation, analysis and reporting at different aggregate levels. Data is populated into the Data
Warehouse through the processes of extraction, transformation and loading (ETL tools). Data analysis
tools, such as business intelligence software, access the data within the warehouse.

Hadoop Environment Big Data Analytics


Hadoop is changing the perception of handling Big Data especially the unstructured data. Let’s
know how Apache Hadoop software library, which is a framework, plays a vital role in handling Big Data.

10
Apache Hadoop enables surplus data to be streamlined for any distributed processing system across
clusters of computers using simple programming models. It truly is made to scale up from single servers
to a large number of machines, each and every offering local computation, and storage space. Instead of
depending on hardware to provide high-availability, the library itself is built to detect and handle
breakdowns at the application layer, so providing an extremely available service along with a cluster of
computers, as both versions might be vulnerable to failures.
Hadoop Community Package Consists of
• File system and OS level abstractions
• A MapReduce engine (either MapReduce or YARN)
• The Hadoop Distributed File System (HDFS)
• Java ARchive (JAR) files
• Scripts needed to start Hadoop
• Source code, documentation and a contribution section

Activities performed on Big Data

• Store – Big data need to be collected in a seamless repository, and it is not necessary
tostore in a single physical database.
• Process – The process becomes more tedious than traditional one in terms of
cleansing,enriching, calculating, transforming, and running algorithms.
• Access – There is no business sense of it at all when the data cannot be searched,
retrievedeasily, and can be virtually showcased along the business lines.
Classification of analyticsDescriptive analytics

Descriptive analytics is a statistical method that is used to search and summarize historical
data in
order to identify patterns or meaning.

Data aggregation and data mining are two techniques used in descriptive analytics to
discover historical data. Data is first gathered and sorted by data aggregation in order to make
the datasetsmore manageable by analysts.

11
Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning. Identified patterns are analyzed to discover the specific ways
that learners interacted with the learning content and within the learning environment.

Advantages:

• Quickly and easily report on the Return on Investment (ROI) by showing how
performanceachieved business or target goals.

• Identify gaps and performance issues early - before they become problems.

• Identify specific learners who require additional support, regardless of how many
studentsor employees there are.

• Identify successful learners in order to offer positive feedback or additional resources.

• Analyze the value and impact of course design and learning resources.

Predictive analytics

Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviors

The software for predictive analytics has moved beyond the realm of statisticians and is
becoming more affordable and accessible for different markets and industries, including the
field of learning& development.

For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as specialized
software.

For the learner, predictive forecasting could be as simple as a dashboard located on the main
screenafter logging in to access a course. Analyzing data from past and current progress, visual

12
indicatorsin the dashboard could be provided to signal whether the employee was on track with
training requirements.

Advantages

• Personalize the training needs of employees by identifying their gaps, strengths, and
weaknesses; specific learning resources and training can be offered to support individual
needs.

• Retain Talent by tracking and understanding employee career progression and


forecasting what skills and learning resources would best benefit their career paths.
Knowing what skillsemployees need also benefits the design of future training.

• Support employees who may be falling behind or not reaching their potential by
offering intervention support before their performance puts them at risk.

• Simplified reporting and visuals that keep everyone updated when predictive
forecasting is required.
Prescriptive analytics
Prescriptive analytics is a statistical method used to generate recommendations and make
decisionsbased on the computational findings of algorithmic models.

Generating automated decisions or recommendations requires specific and unique


algorithmic models and clear direction from those utilizing the analytical technique. A
recommendation cannotbe generated without knowing what to look for or what problem is
desired to be solved. In this way, prescriptive analytics begins with a problem.

Example
A Training Manager uses predictive analysis to discover that most learners without a
particular skill will not complete the newly launched course. What could be done? Now
prescriptive analyticscan be of assistance on the matter and help determine options for action.
Perhaps an algorithm candetect the learners who require that new course, but lack that particular

13
skill, and send an automatedrecommendation that they take an additional training resource to
acquire the missing skill.

The accuracy of a generated decision or recommendation, however, is only as good as the


quality of data and the algorithmic models developed. What may work for one company’s
training needs may not make sense when put into practice in another company’s training
department. Models aregenerally recommended to be tailored for each unique situation and
need.

Descriptive vs Predictive vs Prescriptive Analytics

Descriptive Analytics is focused solely on historical data.

You can think of Predictive Analytics as then using this historical data to develop statistical modelsthat will
then forecast about future possibilities.Prescriptive Analytics takes Predictive Analytics a step further and
takes the possible forecasted outcomes and predicts consequences for these outcomes.

What Big Data Analytics Challenges

1. Need For Synchronization Across Disparate Data Sources

As data sets are becoming bigger and more diverse, there is a big challenge to incorporate them into
an analytical platform. If this is overlooked, it will create gaps and lead to wrong messages and insights.

2. Acute Shortage Of Professionals Who Understand Big Data Analysis

The analysis of data is important to make this voluminous amount of data being produced in every
minute, useful. With the exponential rise of data, a huge demand for big data scientists and Big Data
analysts has been created in the market. It is important for business organizations to hire a data scientist
having skills that are varied as the job of a data scientist is multidisciplinary. Anothermajor challenge faced

14
by businesses is the shortage of professionals who understand Big Data analysis. There is a sharp shortage
of data scientists in comparison to the massive amount of data being produced.

3. Getting Meaningful Insights Through The Use Of Big Data Analytics

It is imperative for business organizations to gain important insights from Big Data analytics, andalso
it is important that onlythe relevant department has access to this information. A big challengefaced by the
companies in the Big Data analytics is mending this wide gap in an effective manner.

15
4. Getting Voluminous Data Into The Big Data Platform

It is hardly surprising that data is growing with every passing day. This simply indicates
that business organizations need to handle a large amount of data on daily basis. The amount
and variety of data available these days can overwhelm any data engineer and that is why it is
considered vital to make data accessibility easy and convenient for brand owners and managers.

5. Uncertainty Of Data Management Landscape

With the rise of Big Data, new technologies and companies are being developed every
day. However, a big challenge faced by the companies in the Big Data analytics is to find out
which technology will be best suited to them without the introduction of new problems and
potential risks.

6. Data Storage And Quality

Business organizations are growing at a rapid pace. With the tremendous growth of the companiesand
large business organizations, increases the amount of data produced. The storage of this massive amount
of data is becoming a real challenge for everyone. Popular data storage options like data lakes/ warehouses
are commonly used to gather and store large quantities of unstructuredand structured data in its native
format. The real problem arises when a data lakes/ warehouse tryto combine unstructured and inconsistent
data from diverse sources, it encounters errors. Missing data, inconsistent data, logic conflicts, and
duplicates data all result in data quality challenges.

7. Security And Privacy Of Data

Once business enterprises discover how to use Big Data, it brings them a wide range of possibilitiesand
opportunities. However, it also involves the potential risks associated with big data when it comes to the
privacy and the security of the data. The Big Data tools used for analysis and storage utilizes the data
disparate sources. This eventually leads to a high risk of exposure of the data, making it vulnerable. Thus,
the rise of voluminous amount of data increases privacy and security concerns.

16
Terminologies Used In Big Data Environments:

a. As-a-service infrastructure

Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that


rather than selling data, licences to use data, or platforms for running Big Data technology, it
can be provided “as a service”, rather than as a product. This reduces the upfront capital
investment

necessary for customers to begin putting their data, or platforms, to work for them, as the
providerbears all of the costs of setting up and hosting the infrastructure. As a customer, as-a-
service infrastructure can greatly reduce the initial cost and setup time of getting Big Data
initiatives up and running.

b. Data science

Data science is the professional field that deals with turning data into value such as new
insights or predictive models. It brings together expertise from fields including statistics,
mathematics, computer science, communication as well as domain expertise such as business
knowledge. Data scientist has recently been voted the No 1 job in the U.S., based on current
demand and salary andcareer opportunities.

c. Data mining

Data mining is the process of discovering insights from data. In terms of Big Data, because
it is so large, this is generally done by computational methods in an automated way using
methods such as decision trees, clustering analysis and, most recently, machine learning. This
can be thought ofas using the brute mathematical power of computers to spot patterns in data
which would not be visible to the human eye due to the complexity of the dataset.

d. Hadoop

Hadoop is a framework for Big Data computing which has been released into the public
domain as open source software, and so can freely be used by anyone. It consists of a number
of modulesall tailored for a different vitalstep ofthe Big Data process – from file storage (Hadoop
File System

17
– HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see
below).It has become so popular due to its power and flexibility that it has developed its own
industry ofretailers (selling tailored versions), support service providers and consultants.

• Predictive modelling

At its simplest, this is predicting what will happen next based on data about what has
happened previously. In the Big Data age, because there is more data around than ever before,
predictions are becoming more and more accurate. Predictive modelling is a core component of
most Big Datainitiatives, which are formulated to help us choose the course of action which will
lead to the mostdesirable outcome. The speed of modern computers and the volume of data
available means that predictions can be made based on a huge number of variables, allowing an
ever-increasing numberof variables to be assessed for the probability that it will lead to success.

• MapReduce

MapReduce is a computing procedure for working with large datasets, which was devised
due to difficulty of reading and analysing really Big Data using conventional computing
methodologies. As its name suggest, it consists of two procedures – mapping (sorting
information into the formatneeded for analysis – i.e. sorting a list of people according to their
age) and reducing (performingan operation, such checking the age of everyone in the dataset to
see who is over 21).

• NoSQL

NoSQL refers to a database format designed to hold more than data which is simply
arranged intotables, rows, and columns, as is the case in a conventional relational database. This
database formathas proven very popular in Big Data applications because Big Data is often
messy, unstructured and does not easily fit into traditional database frameworks.

• Python

Python is a programming language which has become very popular in the Big Data space
due to its ability to work verywell with large, unstructured datasets (see Part II for the difference
betweenstructured and unstructured data). It is considered to be easier to learn for a data science
beginnerthan other languages such as R (see also Part II) and more flexible.

18
• R Programming

R is another programming language commonly used in Big Data, and can be thought of
as more specialised than Python, being geared towards statistics. Its strength lies in its powerful
handling of structured data. Like Python, it has an active community of users who are constantly
expandingand adding to its capabilities by creating new libraries and extensions.

• Recommendation engine

A recommendation engine is basically an algorithm, or collection of algorithms, designed


to match an entity (for example, a customer) with something they are looking for.
Recommendation enginesused by the likes of Netflix or Amazon heavily rely on Big Data
technology to gain an overview of their customers and, using predictive modelling, match them
with products to buy or content to consume. The economic incentives offered by
recommendation engines has been a driving force behind a lot of commercial Big Data
initiatives and developments over the last decade.

• Real-time

Real-time means “as it happens” and in Big Data refers to a system or process which is
able to give data-driven insights based on what is happening at the present moment. Recent
years have seen a large push for the development of systems capable of processing and offering
insights in real-time (or near-real-time), and advances in computing power as well as
development of techniques such as machine learning have made it a reality in many
applications today.

• Reporting

The crucial “last step” of many Big Data initiative involves getting the right information
to the people who need it to make decisions, at the right time. When this step is automated,
analytics is applied to the insights themselves to ensure that they are communicated in a way
that they will beunderstood and easy to act on. This will usually involve creating multiple
reports based on the same data or insights but each intended for a different audience (for
example, in-depth technical analysis for engineers, and an overview of the impact on the bottom
line for c-level executives).

19
• Spark

Spark is another open source framework like Hadoop but more recently developed and more suited
to handling cutting-edge Big Data tasks involving real time analytics and machine learning. Unlike
Hadoop it does not include its own filesystem, though it is designed to work with Hadoop’s HDFS
or a number of other options. However, for certain data related processes it is able to calculate at
over 100 times the speed of Hadoop, thanks to its in-memory processing capability. This means it
is becoming an increasingly popular choice for projects involving deep learning, neural networks
and other compute-intensive tasks.

• Structured Data

Structured data is simply data that can be arranged neatly into charts and tables consisting
of rows,columns or multi-dimensioned matrixes. This is traditionally the way that computers
have stored data, and information in this format can easily and simply be processed and mined
for insights. Data gathered from machines is often a good example of structured data, where
various data points
– speed, temperature, rate of failure, RPM etc. – can be neatly recorded and tabulated for
analysis.

• Unstructured Data

Unstructured data is any data which cannot easily be put into conventional charts and
tables. Thiscan include video data, pictures, recorded sounds, text written in human languages
and a great dealmore. This data has traditionally been far harder to draw insight from using
computers which weregenerally designed to read and analyze structured information. However,
since it has become apparent that a huge amount of value can be locked away in this
unstructured data, great efforts have been made to create applications which are capable of
understanding unstructured data – for example visual recognition and natural language
processing.

• Visualization

Humans find it very hard to understand and draw insights from large amounts of text or
numericaldata – we can do it, but it takes time, and our concentration and attention is limited.

20
For this reason effort has been made to develop computer applications capable of rendering
information in a visualform – charts and graphics which highlight the most important insights
which have resulted from our Big Data projects. A subfield of reporting (see above), visualizing
is now often an automated process, with visualizations customized by algorithm to be
understandable to the people who needto act or take decisions based on them.

21
Top Analytics Tools

R is a language for statistical computing and graphics. It also used for big data analysis. It provides a
wide variety of statistical tests.

Features:

• Effective data handling and storage facility,


• It provides a suite of operators for calculations on arrays, in particular, matrices,
• It provides coherent, integrated collection of big data tools for data analysis
• It provides graphical facilities for data analysis which display either on-screen or
onhardcopy

* Apache Spark is a powerful open source big data analytics tool. It offers over 80 high-level
operators that make it easy to build parallel apps. It is used at a wide range of organizations to
process large datasets.

Features:

• It helps to run an application in Hadoop cluster, up to 100 times faster in memory, and
tentimes faster on disk
• It offers lighting Fast Processing
• Support for Sophisticated Analytics
• Ability to Integrate with Hadoop and Existing Hadoop Data

* Plotly is an analytics tool that lets users create charts and dashboards to share online.

Features:

• Easily turn any data into eye-catching and informative graphics


• It provides audited industries with fine-grained information on data provenance
• Plotly offers unlimited public file hosting through its free community plan

22
* Lumify is a big data fusion, analysis, and visualization platform. It helps users to discover
connections and explore relationships in their data via a suite of analytic options.

Features:

• It provides both 2D and 3D graph visualizations with a variety of automatic layouts

• It provides a variety of options for analyzing the links between entities on the graph
• It comes with specific ingest processing and interface elements for textual content,
images,and videos
• It spaces feature allows you to organize work into a set of projects, or workspaces
• It is built on proven, scalable big data technologies

* IBM SPSS Modeler is a predictive big data analytics platform. It offers predictive models
and delivers to individuals, groups, systems and the enterprise. It has a range of advanced
algorithms and analysis techniques.

Features:

• Discover insights and solve problems faster by analyzing structured and unstructured
data
• Use an intuitive interface for everyone to learn
• You can select from on-premises, cloud and hybrid deployment options
• Quickly choose the best performing algorithm based on model performance

* MongoDB is a NoSQL, document-oriented database written in C, C++, and JavaScript. It is


free to use and is an open source tool that supports multiple operating systems including
Windows Vista ( and later versions), OS X (10.7 and later versions), Linux, Solaris, and
FreeBSD.

23
Its main features include Aggregation, Adhoc-queries, Uses BSON format, Sharding,
Indexing, Replication, Server-side execution of javascript, Schemaless, Capped collection,
MongoDB management service (MMS), load balancing and file storage.

Features:

• Easy to learn.
• Provides support for multiple technologies and platforms.
• No hiccups in installation and maintenance.
• Reliable and low cost.

24
UNIT II

NoSQL

NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy
toscale. NoSQL database is used for distributed data stores with humongous data storage needs.
NoSQL is used for Big data and real-time web apps. For example companies like Twitter,
Facebook, Google that collect terabytes of user data every single day.
SQL

Structured Query language (SQL) pronounced as "S-Q-L" or sometimes as "See-


Quel" is the standard language for dealing with Relational Databases. A relational database
defines relationships in the form of tables.

SQL programming can be effectively used to insert, search, update, delete database records.

Comparison of SQL and NoSQL

Parameter SQL NOSQL

Definition SQL databases are primarily NoSQL databases are primarily called as
calledRDBMS or Relational Non-relational or distributed database
Databases

Design for Traditional RDBMS uses SQL NoSQL database system consists of various
syntax and queries to analyze and kind of database technologies. These databases
get the data for further insights. were developed in response to the demands
They are used for OLAP systems. presented for the development of the
modern application.

25
Query Structured query language No declarative query language
Language (SQL)

Type SQL databases are table NoSQL databases can be document based,
baseddatabases key-value pairs, graph databases

Schema SQL databases have a NoSQL databases use dynamic schema


predefinedschema forunstructured data.

Ability to SQL databases NoSQL databases are horizontally scalable


scale are vertically
scalable

Examples Oracle, Postgres, and MS- MongoDB, Redis, , Neo4j, Cassandra,


SQL. Hbase.

Best suited An ideal choice for the It is not good fit complex queries.
for complexquery intensive
environment.

Hierarchical SQL databases are not suitable More suitable for the hierarchical data store
data storage forhierarchical data storage. as itsupports key-value pair method.

Variations One type with minor Many different types which include key-
variations. value stores, document databases, and graph
databases.

26
Development It was developed in the Developed in the late 2000s to overcome
Year 1970s todeal with issues with flat issuesand limitations of SQL databases.
file

storage

Open-source A mix of open-source like Open-source


Postgres
& MySQL, and commercial
likeOracle Database.

Consistency It should be configured for It depends on DBMS as some offers strong


strongconsistency. consistency like MongoDB, whereas others
offer only offers eventual consistency, like

Cassandra.

Best Used RDBMS database is NoSQL is a best used for solving


for the rightoption for solving dataavailability problems
ACID problems.

Importance It should be used when data Use when it's more important to have fast
validityis super important datathan correct data

Best option When you need to support Use when you need to scale based on
dynamicqueries changingrequirements

Hardware Specialized DB hardware Commodity hardware


(OracleExadata, etc.)

Network Highly available network Commodity network (Ethernet, etc.)


(Infiniband, Fabric Path, etc.)

27
Storage Highly Available Storage Commodity drives storage (standard
Type (SAN,RAID, etc.) HDDs,JBOD)

Best features Cross-platform support, Secure Easy to use, High performance, and
andfree Flexibletool.

Top Hootsuite, CircleCI, Gauges Airbnb, Uber, Kickstarter


Companies

Using

Average The average salary The average salary for "NoSQL


salary for anyprofessional developer"ranges from approximately $72,174
SQL Developer is per year

$84,328 per year in the U.S.A.

ACID vs. ACID( Atomicity, Base ( Basically Available,


Consistency, Isolation, and Soft state,Eventually
BASE
Durability) is a standard for Consistent) is a model of many NoSQL
Model
RDBMS systems

RDBMS Versus Hadoop

28
MRCET
-CSE

Distributed Computing Challenges

Designing a distributed system does not come as easy and straight forward. A
number of challenges need to be overcome in order to get the ideal system. The major
challenges in distributed systems are listed below:

1. Heterogeneity:

The Internet enables users to access services and run applications over a heterogeneous
collectionof computers and networks. Heterogeneity (that is, variety and difference) applies to
all of the following:

o Hardware devices: computers, tablets, mobile phones, embedded devices, etc.


o Operating System: Ms Windows, Linux, Mac, Unix, etc.
o Network: Local network, the Internet, wireless network, satellite links, etc.
o Programming languages: Java, C/C++, Python, PHP, etc.
o Different roles of software developers, designers, system managers

BIG DATA 29
ANALYTICS
MRCET
-CSE

Different programming languages use different representations for characters and data structures such as arrays
and records. These differences must be addressed if programs written in different languages are to be able to
communicate with one another. Programs written by differentdevelopers cannot communicate with one another
unless they use common standards, for example,for network communication.

Heterogeneity and mobile code: The term mobile code is used to refer to program code that can be
transferred from one computer to another and run at the destination – Java applets are an example. Code
suitable for running on one computer is not necessarily suitable for running on another because executable
programs are normally specific both to the instruction set and to the host operating system.
2. Transparency:

Transparency is defined as the concealment from the user and the application programmer
of the separation of components in a distributed system, so that the system is perceived as a
whole ratherthan as a collection of independent components. In other words, distributed systems
designers must hide the complexity of the systems as much as they can. Some terms of
transparency in distributed systems are

• Access Hide differences in data representation and how a resource is accessed

• Location Hide where a resource is located

• Migration Hide that a resource may move to another location

• Relocation Hide that a resource may be moved to another location while in use
Replication Hide that a resource may be copied in several places

• Concurrency Hide that a resource may be shared by several competitive users


Failure Hide the

• failure and recovery of a resource

• Persistence Hide whether a (software) resource is in memory or a disk


3. Openness

BIG DATA 30
ANALYTICS
MRCET
-CSE

The openness of a computer system is the characteristic that determines whether the
system can be extended and re-implemented in various ways. The openness of distributed
systems is determined primarily by the degree to which new resource-sharing services can be
added and be made available for use by a variety of client programs. If the well-defined
interfaces for a system are published, it is easier for developers to add new features or replace
sub-systems in the future. Example: Twitter and Facebook have API that allows developers to
develop their own software interactively.

4. Concurrency

Both services and applications provide resources that can be shared by clients in a
distributed system. There is therefore a possibility that several clients will attempt to access a
shared resourceat the same time. For example, a data structure that records bids for an auction
may be accessed very frequently when it gets close to the deadline time. For an object to be
safe in a concurrent environment, its operations must be synchronized in such a way that its
data remains consistent. This can be achieved by standard techniques such as semaphores, which
are used in most operatingsystems.

5. Security

Many of the information resources that are made available and maintained in distributed
systemshave a high intrinsic value to their users. Their security is therefore of considerable
importance. Security for information resources has three components:
confidentiality (protection against disclosure tounauthorized individuals)
integrity (protection against alteration or corruption),
availability for the authorized (protection against interference with the means to access the
resources).
6. Scalability

Distributed systems must be scalable as the number of user increases. The scalability is
defined by
B. Clifford Neuman as

BIG DATA 31
ANALYTICS
MRCET
-CSE

A system is said to be scalable if it can handle the addition of users and resources without
sufferinga noticeable loss of performance or increase in administrative complexity

Scalability has 3 dimensions:

o Size
o Number of users and resources to be processed. Problem associated is overloading
o Geography
o Distance between users and resources. Problem associated is communication
reliability
o Administration
o As the size of distributed systems increases, many of the system needs to be
controlled.Problem associated is administrative mess
7. Failure Handling

Computer systems sometimes fail. When faults occur in hardware or software, programs
may produce incorrect results or may stop before they have completed the intended
computation. The handling of failures is particularly difficult.

Hadoop Overview

Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of computers. Hadoop is designed
to scale upfrom single server to thousands of machines, each offering local computation and
storage.

Hadoop Architecture

BIG DATA 32
ANALYTICS
MRCET
-CSE

At its core, Hadoop has two major layers namely −

• Processing/Computation layer (MapReduce), and

• Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google
for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

BIG DATA 33
ANALYTICS
MRCET
-CSE

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributedfile systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-costhardware. It provides high throughput access to application data and is suitable for
applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes
the following two modules −

• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.

• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing,
but as an alternative, you can tie together many commodity computers with single- CPU, as a single
functional distributed system and practically, the clustered machines can read the dataset in parallel and
provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first
motivational factor behind using Hadoop that it runs across clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −

• Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).

• These files are then distributed across various cluster nodes for further processing.

• HDFS, being on top of the local file system, supervises the processing.

• Blocks are replicated for handling hardware failure.

BIG DATA 34
ANALYTICS
MRCET
-CSE

• Checking that the code was executed successfully.

• Performing the sort that takes place between the map and reduce stages.

• Sending the sorted data to a certain computer.

• Writing the debugging logs for each job.

Advantages of Hadoop

• Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.

• Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA),rather Hadoop library itself has been designed to detect and handle failures at
the application layer.

• Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.

• Another big advantage of Hadoop is that apart from being open source, it is compatible
onall the platforms since it is Java based.

Processing Data with Hadoop - Managing Resources and Applications with Hadoop
YARN

Yarn divides the task on resource management and job scheduling/monitoring into
separate daemons. There is one ResourceManager and per-application ApplicationMaster. An
application can be either a job or a DAG of jobs.
The ResourceManger have two components – Scheduler and AppicationManager.

The scheduler is a pure scheduler i.e. it does not track the status of running application. It
only allocates resources to various competing applications. Also, it does not restart the job after
failure due to hardware or application failure. The scheduler allocates the resources based on

BIG DATA 35
ANALYTICS
MRCET
-CSE

an abstract notion of a container. A container is nothing but a fraction of resources like CPU,
memory, disk, network etc.
Following are the tasks of ApplicationManager:-

• Accepts submission of jobs by client.


• Negotaites first container for specific ApplicationMaster.
• Restarts the container after application failure. Below are the responsibilities of
ApplicationMaster

• Negotiates containers from Scheduler


• Tracking container status and monitoring its progress.
Yarn supports the concept of Resource Reservation via Reservation System. In this, a user
can fix a number of resources for execution of a particular job over time and temporal
constraints. The Reservation System makes sure that the resources are available to the job until
its completion. It also performs admission control for reservation.

Yarn can scale beyond a few thousand nodes via Yarn Federation. YARN Federation
allows to wire multiple sub-cluster into the single massive cluster. We can use many
independent clusters together for a single large job. It can be used to achieve a large scale
system.

Let us summarize how Hadoop works step by step:


• Input data is broken into blocks of size 128 Mb and then blocks are moved to different
nodes.
• Once all the blocks of the data are stored on data-nodes, the user can process the data.
• Resource Manager then schedules the program (submitted by the user) on individual
nodes.
• Once all the nodes process the data, the output is written back to HDFS.

BIG DATA 36
ANALYTICS
MRCET
-CSE

Interacting with Hadoop Ecosystem


Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its three core
components processing, resource management, and storage. In this topic, you will learn the
components of the Hadoop ecosystem and how they perform their roles during Big Data
processing. The
Hadoop ecosystem is continuously growing to meet the needs of Big Data. It comprises
the following twelve components:

• HDFS(Hadoop Distributed file system)


• HBase
• Sqoop
• Flume
• Spark
• Hadoop MapReduce
• Pig
• Impala
• Hive
• Cloudera Search
• Oozie
• Hue.

Let us understand the role of each component of the Hadoop ecosystem.

Components of Hadoop Ecosystem

Let us start with the first component HDFS of Hadoop Ecosystem.

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

• HDFS is a storage layer for Hadoop.

BIG DATA 37
ANALYTICS
MRCET
-CSE

• HDFS is suitable for distributed storage and processing, that is, while the data is
beingstored, it first gets distributed and then it is processed.
• HDFS provides Streaming access to file system data.
• HDFS provides file permission and authentication.
• HDFS uses a command line interface to interact with Hadoop.

So what stores data in HDFS? It is the HBase which stores data in HDFS.

HBase

• HBase is a NoSQL database or non-relational database .


• HBase is important and mainly used when you need random, real-time, read, or
writeaccess to your Big Data.
• It provides support to a high volume of data and high throughput.
• In an HBase, a table can have thousands of columns.

BIG DATA 38
ANALYTICS
MRCET
-CSE

UNIT-III

INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING

MongoDB is a cross-platform, document-oriented database that provides, high


performance, highavailability, and easy scalability. MongoDB works on concept of collection
and document.

Database

Database is a physical container for collections. Each database gets its own set of files on
the filesystem. A single MongoDB server typically has multiple databases.

Collection

Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A


collection exists within a single database. Collections do not enforce a schema. Documents
withina collection can have different fields. Typically, all documents in a collection are of
similar or related purpose.

Document

A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema
meansthat documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of data.

The following table shows the relationship of RDBMS terminology with MongoDB.

BIG DATA 39
ANALYTICS
MRCET
-CSE

RDBMS MongoDB

Database Database

Table Collection

Tuple/Row Document

column Field

Primary Key Primary Key (Default key _id


provided by MongoDB itself)

Database Server and Client

mysqld/Oracle mongod

mysql/sqlplus mongo

Sample Document

Following example shows the document structure of a blog site, which is simply a commaseparated key
value pair.

_id: ObjectId(7df78ad8902c)title: 'MongoDB Overview',


description: 'MongoDB is no sql database',by: 'tutorials point',

url: 'http://www.tutorialspoint.com', tags: ['mongodb', 'database', 'NoSQL'],likes: 100,

comments: [

BIG DATA 40
ANALYTICS
MRCET
-CSE

{
user:'user1',

message: 'My first comment', dateCreated: new Date(2011,1,20,2,15),like: 0

},

user:'user2',

message: 'My second comments', dateCreated: new Date(2011,1,25,7,45),like: 5


}

_id is a 12 bytes hexadecimal number which assures the uniqueness of every document.
You can provide _id while inserting the document. If you don’t provide then MongoDB
provides a uniqueid for every document. These 12 bytes first 4 bytes for the current timestamp,
next 3 bytes for machine id, next 2 bytes for process id of MongoDB server and remaining 3
bytes are simple incremental VALUE.
Any relational database has a typical schema design that shows number of tables and the
relationship between these tables. While in MongoDB, there is no concept of relationship.

BIG DATA 41
ANALYTICS
MRCET
-CSE

Advantages of MongoDB over RDBMS

• Schema less − MongoDB is a document database in which one collection holds


differentdocuments. Number of fields, content and size of the document can differ from
one document to another.

• Structure of a single object is clear.

• No complex joins.

• Deep query-ability. MongoDB supports dynamic queries on documents using a


document-based query language that's nearly as powerful as SQL.

• Tuning.

• Ease of scale-out − MongoDB is easy to scale.

• Conversion/mapping of application objects to database objects not needed.

• Uses internal memory for storing the (windowed) working set, enabling faster access
of data.

Why Use MongoDB?

• Document Oriented Storage − Data is stored in the form of JSON style documents.

• Index on any attribute

• Replication and high availability

• Auto-Sharding

• Rich queries

• Fast in-place updates

• Professional support by MongoDBWhere to Use MongoDB?

BIG DATA 42
ANALYTICS
MRCET
-CSE

• Big Data

• Content Management and Delivery

• Mobile and Social Infrastructure

• User Data Management

• Data Hub

MongoDB supports many datatypes. Some of them are −

• String − This is the most commonly used datatype to store the data. String in
MongoDBmust be UTF-8 valid.

• Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit
depending upon your server.

• Boolean − This type is used to store a boolean (true/ false) value.

• Double − This type is used to store floating point values.

• Min/ Max keys − This type is used to compare a value against the lowest and
highestBSON elements.

• Arrays − This type is used to store arrays or list or multiple values into one key.

• Timestamp − ctimestamp. This can be handy for recording when a document has been
modified or added.

• Object − This datatype is used for embedded documents.

• Null − This type is used to store a Null value.

• Symbol − This datatype is used identically to a string; however, it's generally reserved
forlanguages that use a specific symbol type.

• Date − This datatype is used to store the current date or time in UNIX time format. You
can specify your own date time by creating object of Date and passing day, month,
year into it.

• Object ID − This datatype is used to store the document’s ID.

BIG DATA 43
ANALYTICS
MRCET
-CSE

• Binary data − This datatype is used to store binary data.

• Code − This datatype is used to store JavaScript code into the document.

• Regular expression − This datatype is used to store regular expression

The find() Method

To query data from MongoDB collection, you need to use MongoDB's find() method.Syntax
The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()

find() method will display all the documents in a non-structured way.Example


Assume we have created a collection named mycol as −

> use sampleDB


switched to db sampleDB

> db.createCollection("mycol")
{ "ok" : 1 }

> And inserted 3 documents in it using the insert() method as shown below −

> db.mycol.insert([

{
title: "MongoDB Overview",

description: "MongoDB is no SQL


database",by: "tutorials point",

url:
"http://www.tutorialspoint.com", tags:
["mongodb", "database", "NoSQL"],likes:
100

},

title: "NoSQL Database",

description: "NoSQL database doesn't have


tables",by: "tutorials point",
BIG DATA 44
ANALYTICS url:
MRCET
-CSE

{
user:"user1",
message: "My first
comment", dateCreated:
new
Date(2013,11,10,2,35),
like: 0
Following method retrieves all the documents in the collection −
}
> db.mycol.find() ]
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534c"), "title" : "MongoDB Overview",
"description" }
: "MongoDB ]) is no SQL database", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 100 }
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534d"), "title" : "NoSQL Database", "description"
: "NoSQL database doesn't have tables", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 20,
"comments" : [ { "user" : "user1", "message" : "My first comment", "dateCreated" :
ISODate("2013-12-09T21:05:00Z"), "like" : 0 } ] }
>

The pretty() Method

To display the results in a formatted way, you can use pretty() method.Syntax
>db.COLLECTION_NAME.find().pretty()Example
Following example retrieves all the documents from the collection named mycol and
arrangesthem in an easy-to-read format.

BIG DATA 45
ANALYTICS
MRCET
-CSE

> db.mycol.find().pretty()
{
"_id" :
ObjectId("5dd4e2cc0821d3b44607534c"),"title" :
"MongoDB Overview",

"description" : "MongoDB is no SQL


database","by" : "tutorials point",
"url" :
"http://www.tutorialspoint.com","tags"
:[
"mong
odb",
"database",
"NoSQL"
],
"likes" : 100

BIG DATA 46
ANALYTICS
MRCET
-CSE

}
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534d"),
"title" : "NoSQL Database",
"description" : "NoSQL database doesn't have tables",
"by" : "tutorials point",

"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 20,

"comments" : [
{

"user" : "user1",
"message" : "My first comment",

"dateCreated" : ISODate("2013-12-09T21:05:00Z"),
"like" : 0

}
]
}

BIG DATA 47
ANALYTICS
MRCET
-CSE

The findOne() method

Apart from the find() method, there is findOne() method, that returns only one document.Syntax
>db.COLLECTIONNAME.findOne()

Example

Following example retrieves the document with title MongoDB Overview.

> db.mycol.findOne({title: "MongoDB Overview"})


{
"_id" :
ObjectId("5dd6542170fb13eec3963bf0"),"title" :
"MongoDB Overview",
"description" : "MongoDB is no SQL
database","by" : "tutorials point",
"url" :
"http://www.tutorialspoint.com","tags"
:[

"mong
odb",
"database",
"NoSQL"
],
"likes" : 100

BIG DATA 48
ANALYTICS
MRCET
-CSE

RDBMS Where Clause Equivalents in MongoDB

To query the document on the basis of some condition, you can use following operations.

Operation Syntax Example RDBMS

Equivalent

Equali {<key>:{$eg;<value>}} db.mycol.find({"by":"tutorials where


ty point"}).pretty() by =
'tutorials
point'

Less {<key>:{$lt:<value>}} db.mycol.find({"likes":{$lt:50}}).prett where


Than y() likes

< 50

Less {<key>:{$lte:<value>}} db.mycol.find({"likes":{$lte:50}}).pre where


Than tty() likes
Equals
<= 50

Greate {<key>:{$gt:<value>}} db.mycol.find({"likes":{$gt:50}}).pret where


rThan ty() likes

> 50

Greate {<key>:{$gte:<value>}} db.mycol.find({"likes":{$gte:50}}).pr where


rThan etty() likes
Equals
>= 50

BIG DATA 49
ANALYTICS
MRCET
-CSE

Not {<key>:{$ne:<value>}} db.mycol.find({"likes":{$ne:50}}).pre where


Equals tty() likes

!= 50

Values {<key>:{$in:[<value1>, db.mycol.find({"name":{$in:["Raj", Where


in name
<value2>,……<valueN> "Ram", "Raghu"]}}).pretty()
an array matches any
]}}
of the value
in
:["Raj",

"Ram",

"Raghu
"]

BIG DATA 50
ANALYTICS
MRCET
-CSE

Value {<key>:{$nin:<value>} db.mycol.find({"name":{$nin:["Ramu Where


s notin an } ","Raghav"]}}).pretty() name values
array is
not in the
array
:["Ramu"
, "Raghav"]or,
doesn’texist at
all

AND in MongoDBSyntax
To query documents based on the AND condition, you need to use $and keyword. Following isthe
basic syntax of AND −

>db.mycol.find({ $and: [ {<key1>:<value1>}, { <key2>:<value2>} ] })Example


Following example will show all the tutorials written by 'tutorials point' and whose title
is'MongoDB Overview'.

BIG DATA 51
ANALYTICS
MRCET
-CSE

> db.mycol.find({$and:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()


{
"_id" :
ObjectId("5dd4e2cc0821d3b44607534c"),"title" :
"MongoDB Overview",
"description" : "MongoDB is no SQL
database","by" : "tutorials point",
"url" :
"http://www.tutorialspoint.com","tags"
:[
"mong
odb",
"database",
"NoSQL"
],
"likes" : 100

}
>
For the above given example, equivalent where clause will be ' where by = 'tutorials
point'AND title = 'MongoDB Overview' '. You can pass any number of key, value pairs in
find clause.

OR in MongoDBSyntax

To query documents based on the OR condition, you need to use $or keyword.
Following is thebasic syntax of OR −

>db.mycol.find(
{
$or: [

{key1: value1}, {key2:value2}


]

BIG DATA 52
ANALYTICS
MRCET
-CSE

}
).pretty()Example
Following example will show all the tutorials written by 'tutorials point' or whose title is
'MongoDB Overview'.

>db.mycol.find({$or:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()


{

"_id":
ObjectId(7df78ad8902c),"title":
"MongoDB Overview",

"description": "MongoDB is no sql


database","by": "tutorials point",

"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}

> Using AND and OR TogetherExample

BIG DATA 53
ANALYTICS
MRCET
-CSE

The following example will show the documents that have likes greater than 10 and whose
title is either 'MongoDB Overview' or by is 'tutorials point'. Equivalent SQL where clause is

>db.mycol.find({"likes": {$gt:10}, $or: [{"by": "tutorials point"},


{"title": "MongoDB Overview"}]}).pretty()

"_id":
ObjectId(7df78ad8902c),"title":
"MongoDB Overview",

"description": "MongoDB is no sql


database","by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
>

'where likes>10 AND (by = 'tutorials point' OR title = 'MongoDB Overview')'

BIG DATA 54
ANALYTICS
MRCET
-CSE

NOR in MongoDBSyntax
To query documents based on the NOT condition, you need to use $not keyword.
Following isthe basic syntax of NOT −

BIG DATA 55
ANALYTICS
MRCET
-CSE

db.empDetails.insertMany(
[
{
> db.empDetails.find(
{ First_Name:
"Radhika",Last_Name:
"Sharma", Age: "26",

e_mail:
"[email protected]",phone:
"9000012345"

},
{

First_Name:
"Rachel", Last_Name:
"Christopher",Age: "27",

e_mail:
"[email protected]",phone:
"9000054321"

},
{

First_Name:
"Fathima",Last_Name:
"Sheik", Age: "24",
e_mail:
"[email protected]",phone:
"9000054321"

}
]

Following example will retrieve the document(s) whose first name is not "Radhika" and
last nameis not "Christopher"

BIG DATA 56
ANALYTICS
MRCET
-CSE

$nor:[
40
{"First_Name": "Radhika"},

{"Last_Name": "Christopher"}
]

}
).pretty()

"_id" :
ObjectId("5dd631f270fb13eec3963bef"),
"First_Name" : "Fathima",
"Last_Name"
: "Sheik",Syntax
NOT in MongoDB "Age" :
"24",
To query documents based on the NOT condition, you need to use $not keyword
"e_mail" :
following is the basic syntax of NOT −
"[email protected]",
"phone" : "9000054321"
> db.empDetails.find(
} { "Age": { $not: { $gt: "25" } } } )
{

"_id" :
ObjectId("5dd6636870fb13eec3963bf7"),
"First_Name" : "Fathima",

"Last_Name" :
"Sheik","Age" : "24",

"e_mail" :
"[email protected]","phone" :
"9000054321"

BIG DATA 57
ANALYTICS
MRCET
-CSE

MapReduce:

MapReduce addresses the challenges of distributed programming by providing an


abstraction thatisolates the developer from system-level details (e.g., locking of data structures,
data starvation issues in the processing pipeline, etc.). The programming model specifies simple
and well-definedinterfaces between a small number of components, and therefore is easy for
the programmer to reason about. MapReduce maintains a separation of what computations are
to be performed and how those computations are actually carried out on a cluster of machines.
The first is under the control of the programmer, while the second is exclusively the
responsibility of the execution framework or “runtime”. The advantage is that the execution
framework only needs to be designedonce and verified for correctness—thereafter, as long as
the developer expresses computations in the programming model, code is guaranteed to behave
as expected. The upshot is that the developeris freed from having to worry about system-level
details (e.g., no more debugging race conditions and addressing lock contention) and can
instead focus on algorithm or application design.

ich often has multiple cores). Why is MapReduce important? In practical terms, it provides
a very effective tool for tackling large-data problems. But beyond that, MapReduce is
important in how it has changed the way we organize computations at a massive scale.
MapReduce represents the first widely-adopted step away from the von Neumann model that
has served as the foundation ofcomputer science over the last half plus century. Valiant called
this a bridging model [148], a conceptual bridge between the physical implementation of a
machine and the software that is to be executed onthat machine. Until recently, the von
Neumann model has served us well: Hardwaredesigners focused on efficient implementations
of the von Neumann model and didn’t have to think much about the actual software that would
run on the machines. Similarly, the software industry developed software targeted at the model
without worrying about the hardware details. The result was extraordinary growth: chip
designers churned out successive generations of increasingly powerful processors, and

BIG DATA 58
ANALYTICS
MRCET
-CSE

software engineers were able to develop applications in high-level languages that exploited
those processors.

MapReduce can be viewed as the first breakthrough in the quest for new abstractions that
allow usto organize computations, not over individual machines, but over entire clusters. As
Barroso puts it, the datacenter is the computer. MapReduce is certainly not the first model of
parallel computation that has been proposed. The most prevalent model in theoretical computer
science, which dates back several decades, is the PRAM. MAPPERS AND REDUCERS Key-
value pairs form the basic data structure in MapReduce. Keys and values may be primitives
such as integers, floating point values, strings, and raw bytes, or they may be arbitrarily
complex structures (lists, tuples, associative arrays, etc.). Programmers typically need to define
their own custom data types,although a number of libraries such as Protocol Buffers,5 Thrift,6
and Avro7 simplify the task. Part of the design of MapReduce algorithms involves imposing the
key-value structure on arbitrarydatasets. For a collection of web pages, keys may be URLs and
values may be the actual HTML content. For a graph, keys may represent node ids and values
may contain the adjacency lists of those nodes (see Chapter 5 for more details). In some
algorithms, input keys are not particularly

meaningful and are simply ignored during processing, while in other cases input keys are
used to uniquely identify a datum (such as a record id). In Chapter 3, we discuss the role of
complex keysand values in the design of various algorithms. In MapReduce, the programmer
defines a mapper and a reducer with the following signatures: map: (k1, v1) → [(k2, v2)]
reduce: (k2, [v2]) → [(k3,v3)] The convention [. . .] is used throughout this book to denote a
list. The input to a MapReducejob starts as data stored on the underlying distributed file system
(see Section 2.5). The mapper isapplied to every input key-value pair (split across an arbitrary
number of files) to generate an arbitrary number of intermediate key-value pairs. The reducer
is applied to all values associated with the same intermediate key to generate output key-value
pairs.8 Implicit between the map andreduce phases is a distributed “group by” operation on
intermediate keys. Intermediate data arriveat each reducer in order, sorted by the key. However,
no ordering relationship is guaranteed for keys across different reducers. Output key-value
pairs from each reducer are written persistently back onto the distributed file system (whereas

BIG DATA 59
ANALYTICS
MRCET
-CSE

intermediate key-value pairs are transient and not preserved). The output ends up in r files on
the distributed file system, where r is the number of reducers. For the most part, there is no
need to consolidate reducer output, since the r files often serve as input to yet another
MapReduce job. Figure 2.2 illustrates this two-stage processing structure. A simple word count
algorithm in MapReduce is shown in Figure 2.3. This algorithm counts the number of
occurrences of every word in a text collection, which may be the first step in, for example,
building a unigram language model (i.e., probability

MAPREDUCE BASICS

distribution over words in a collection). Input key-values pairs take the form of (docid,
doc) pairsstored on the distributed file system, where the former is a unique identifier for the
document, andthe latter is the text of the document itself. The mapper takes an input key-value
pair, tokenizes the document, and emits an intermediate key-value pair for every word: the
word itself serves as the key, and the integer one serves as the value (denoting that we’ve seen
the word once). The MapReduce execution framework guarantees that all values associated
with the same key are brought together in the reducer. Therefore, in our word count algorithm,
we simply need to sum up all counts (ones) associated with each word. The reducer does
exactly this, and emits final keyvalue pairs with the word as the key, and the count as the value.
Final output is written to the distributed file system, one file per reducer. Words within each
file will be sorted by alphabeticalorder, and each file will contain roughly the same number of
words. The partitioner, which we discuss later in Section 2.4, controls the assignment of words
to reducers. The output can be examined by the programmer or used as input to another
MapReduce program.

There are some differences between the Hadoop implementation of MapReduce and
Google’s implementation.9 In Hadoop, the reducer is presented with a key and an iterator over
all values associated with the particular key. The values are arbitrarily ordered. Google’s
implementation allows the programmer to specify a secondary sort key for ordering the values
(if desired)—in which case values associated with each key would be presented to the
developer’s reduce code insorted order. Later in Section 3.4 we discuss how to overcome this

BIG DATA 60
ANALYTICS
MRCET
-CSE

limitation in Hadoop to perform secondary sorting. Another difference: in Google’s


implementation the programmer is not allowedto change the key in the reducer. That is, the
reducer output key must be exactly the same as the reducer input key. In Hadoop, there is no
such restriction, and the reducer can emit an arbitrary number of output key-value pairs (with
different keys).

To provide a bit more implementation detail: pseudo-code provided in this book roughly
mirrors how MapReduce programs are written in Hadoop. Mappers and reducers are objects
that implement the Map and Reduce methods, respectively. In Hadoop, a mapper object is
initialized for each map task (associated with a particular sequence of key-value pairs called
an input split) and the Map method is called on each key-value pair by the execution
framework. In configuringa MapReduce job, the programmer provides a hint on the number of
map tasks to run, but the execution framework (see next section) makes the final determination
based on the physical layout of the data (more details in Section 2.5 and Section 2.6). The
situation is similar for the reduce phase: a reducer object is initialized for each reduce task, and
the Reduce method is called once per intermediate key. In contrast with the number of map
tasks, the programmer can precisely specify the number of reduce tasks. We will return to
discuss the details of Hadoop job execution in Section 2.6, which is dependent on an
understanding of the distributed file system (covered in Section 2.5). To reiterate: although the
presentation of algorithms in this book closely mirrors theway they would be implemented in
Hadoop, our focus is on algorithm design and conceptual

understanding—not actual Hadoop programming. For that, we would recommend Tom


White’s book [154]. What are the restrictions on mappers and reducers? Mappers and reducers
can expressarbitrary computations over their inputs. However, one must generally be careful
about use of external resources since multiple mappers or reducers may be contending for those
resources. Forexample, it may be unwise for a mapper to query an external SQL database, since
that would introduce a scalability bottleneck on the number of map tasks that could be run in
parallel (since they might all be simultaneously querying the database).10 In general, mappers
can emit an arbitrary number of intermediate key-value pairs, and they need not be of the same
type as the input key-value pairs. Similarly, reducers can emit an arbitrary number of final key-

BIG DATA 61
ANALYTICS
MRCET
-CSE

value pairs, and they can differ in type from the intermediate key-value pairs. Although not
permitted in functional programming, mappers and reducers can have side effects. This is a
powerful and usefulfeature: for example, preserving state across multiple inputs is central to
the design of many MapReduce algorithms (see Chapter 3). Such algorithms can be understood
as having side effectsthat only change state that is internal to the mapper or reducer. While the
correctness of such algorithms may be more difficult to guarantee (since the function’s
behavior depends not only on the current input but on previous inputs), most potential
synchronization problems are avoided since internal state is private only to individual mappers
and reducers. In other cases (see Section

4.4 and Section 6.5), it may be useful for mappers or reducers to have external side effects,
such as writing files to the distributed file system. Since many mappers and reducers are run in
parallel,and the distributed file system is a shared global resource, special care must be taken to
ensure thatsuch operations avoid synchronization conflicts. One strategy is to write a temporary
file that is renamed upon successful completion of the mapper or reducer .

In addition to the “canonical” MapReduce processing flow, other variations are also possible.
MapReduce programs can contain no reducers, in which case mapper output is directly written
to disk (one file per mapper). For embarrassingly parallel problems, e.g., parse a large text
collectionor independently analyze a large number of images, this would be a common pattern.
The converse—a MapReduce program with no mappers—is not possible, although in some
cases it isuseful for the mapper to implement the identity function and simply pass input key-
value pairs tothe reducers. This has the effect of sorting and regrouping the input for reduce-
side processing. Similarly, in some cases it is useful for the reducer to implement the identity
function, in which case the program simply sorts and groups mapper output. Finally, running
identity mappers and reducers has the effect of regrouping and resorting the input data (which
is sometimes useful).

Although in the most common case, input to a MapReduce job comes from data stored on
the distributed file system and output is written back to the distributed file system, any other
system that satisfies the proper abstractions can serve as a data source or sink. With Google’s

BIG DATA 62
ANALYTICS
MRCET
-CSE

MapReduceimplementation, BigTable [34], a sparse, distributed, persistent multidimensional


sorted map, is frequently used as a source of input and as a store of MapReduce output. HBase
is an open-sourceBigTable clone and has similar capabilities. Also, Hadoop has been integrated
with existing MPP (massively parallel processing) relational databases, which allows a
programmer to write MapReduce jobs over database rows and dump output into a new database
table. Finally, in some cases MapReduce jobs may not consume any input at all (e.g., computing
π) or may only consumea small amount of data (e.g., input parameters to many instances of
processorintensive simulationsrunning in parallel).
PARTITIONERS AND COMBINERS

We have thus far presented a simplified view of MapReduce. There are two additional
elements that complete the programming model: partitioners and combiners. Partitioners are
responsible fordividing up the intermediate key space and assigning intermediate key-value
pairs to reducers. In other words, the partitioner specifies the task to which an intermediate
key-value pair must be copied. Within each reducer, keys are processed in sorted order (which
is how the “group by” is implemented). The simplest partitioner involves computing the hash
value of the key and then taking the mod of that value with the number of reducers. This assigns
approximately the same number of keys to each reducer (dependent on the quality of the hash
function). Note, however, that the partitioner only considers the key and ignores the value—
therefore, a roughly-even partitioning of the key space may nevertheless yield large differences
in the number of key-valuespairs sent to each reducer (since different keys may have different
numbers of associated values).This imbalance in the amount of data associated with each key
is relatively common in many textprocessing applications due to the Zipfian distribution of
word occurrences.

Combiners are an optimization in MapReduce that allow for local aggregation before the
shuffle and sort phase. We can motivate the need for combiners by considering the word count
algorithm in Figure 2.3, which emits a key-value pair for each word in the collection.
Furthermore, all thesekey-value pairs need to be copied across the network, and so the amount
of intermediate data willbe larger than the input collection itself. This is clearly inefficient. One

BIG DATA 63
ANALYTICS
MRCET
-CSE

solution is to perform localaggregation on the output of each mapper, i.e., to compute a local
count for a word over all the documents processed by the mapper. With this modification
(assuming the maximum amount of local aggregation possible), the number of intermediate key-
value pairs will be at most the numberof unique words in the collection times the number of
mappers (and typically far smaller because each mapper may not encounter every word).

smaller because each mapper may not encounter every word). The combiner in
MapReduce supports such an optimization. One can think of combiners as “mini-reducers” that
take place on the output of the mappers, prior to the shuffle and sort phase. Each combiner
operates in isolation and therefore does not have access to intermediate output from other
mappers. The combiner is provided keys and values associated with each key (the same types
as the mapper output keys andvalues). Critically, one cannot assume that a combiner will have
the opportunity to process all values associated with the same key. The combiner can emit any
number of key-value pairs, but the keys and values must be of the same type as the mapper
output (same as the reducer input).12 In cases where an operation is both associative and
commutative (e.g., addition or multiplication), reducers can directly serve as combiners. In
general, however, reducers and combiners are not interchangeable.

In many cases, proper use of combiners can spell the difference between an impractical
algorithm and an efficient algorithm. This topic will be discussed in Section 3.1, which focuses
on various techniques for local aggregation. It suffices to say for now that a combiner can
significantly reducethe amount of data that needs to be copied over the network, resulting in
much faster algorithms. The complete MapReduce model is shown in Figure 2.4. Output of the
mappers are processed bythe combiners, which perform local aggregation to cut down on the
number of intermediate key- value pairs. The partitioner determines which reducer will be
responsible for processing a particular key, and the execution framework uses this information
to copy the data to the right location during the shuffle and sort phase.13 Therefore, a complete
MapReduce job consists of code for the mapper, reducer, combiner, and partitioner, along with
job configuration parameters.The execution framework handles everything else.

BIG DATA 64
ANALYTICS
MRCET
-CSE

SECONDARY SORTING

MapReduce sorts intermediate key-value pairs by the keys during the shuffle and sort phase,
whichis very convenient if computations inside the reducer rely on sort order (e.g., the order
inversion design pattern described in the previous section). However, what if in addition to
sorting by key, we also need to sort by value? Google’s MapReduce implementation
provides built-in functionality for (optional) secondary sorting, which guarantees that values
arrive in sorted order.Hadoop, unfortunately, does not have this capability built in.

Consider the example of sensor data from a scientific experiment: there are m sensors each
takingreadings on continuous basis, where m is potentially a large number. A dump of the
sensor data might look something like the following, where rx after each timestamp represents

BIG DATA 65
ANALYTICS
MRCET
-CSE

the actual sensor readings (unimportant for this discussion, but may be a series of values, one
or more complex records, or even raw bytes of images).

(t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) ...(t2, m1, r21823) (t2, m2, r66508)
(t2, m3, r98347)

Suppose we wish to reconstruct the activity at each individual sensor over time. A MapReduce
program to accomplish this might map over the raw data and emit the sensor id as the
intermediatekey, with the rest of each record as the value:

m1 → (t1, r80521)

This would bring all readings from the same sensor together in the reducer. However,
since MapReduce makes no guarantees about the ordering of values associated with the same
key, the sensor readings will not likely be in temporal order. The most obvious solution is to
buffer all thereadings in memory and then sort by timestamp before additional processing.
However, it should be apparent by now that any in-memory buffering of data introduces a
potential scalability bottleneck. What if we are working with a high frequency sensor or sensor
readings over a long period of time? What if the sensor readings themselves are large complex
objects? This approachmay not scale in these cases—the reducer would run out of memory
trying to buffer all values associated with the same key.

This is a common problem, since in many applications we wish to first group together data
one way(e.g., bysensor id), and then sort within the groupings another way(e.g., by time).
Fortunately,there is a general purpose solution, which we call the “value-to-key conversion”
design pattern. The basic idea is to move part of the value into the intermediate key to form a
composite key, andlet the MapReduce execution framework handle the sorting. In the above
example, instead of emitting the sensor id as the key, we would emit the sensor id and the
timestamp as a composite key: (m1, t1) → (r80521)

BIG DATA 66
ANALYTICS
MRCET
-CSE

The sensor reading itself now occupies the value. We must define the intermediate key
sort order to first sort by the sensor id (the left element in the pair) and then by the timestamp
(the right element in the pair). We must also implement a custom partitioner so that all pairs
associated withthe same sensor are shuffled to the same reducer. Properly orchestrated, the
key-value pairs will be presented to the reducer in the correct sorted order: (m1, t1) → [(r80521)]
(m1, t2) → [(r21823)](m1, t3) → [(r146925)] . . .

However, note that sensor readings are now split across multiple keys. The reducer will
need to preserve state and keep track of when readings associated with the current sensor end
and the nextsensor begin.9 The basic tradeoff between the two approaches discussed above
(buffer and inmemory sort vs. value-to-key conversion) is where sorting is performed. One can
explicitly implement secondary sorting in the reducer, which is likely to be faster but suffers
from a scalability bottleneck.10 With value-to-key conversion, sorting is offloaded to the
MapReduce execution framework. Note that this approach can be arbitrarily extended to
tertiary, quaternary, etc. sorting. This pattern results in many more keys for the framework to
sort, but distributed sorting is a task that the MapReduce runtime excels at since it lies at the
heart of the programmingmodel.

INDEX COMPRESSION

We return to the question of how postings are actually compressed and stored on disk. This
chapterdevotes a substantial amount of space to this topic because index compression is one of
the main differences between a “toy” indexer and one that works on real-world collections.
Otherwise, MapReduce inverted indexing algorithms are pretty straightforward.

BIG DATA 67
ANALYTICS
MRCET
-CSE

Let us consider the canonical case where each posting consists of a document id and the
term frequency. A na¨ıve implementation might represent the first as a 32-bit integer9 and the
second as a 16-bit integer. Thus, a postings list might be encoded as follows: [(5, 2),(7, 3),(12,
1),(49, 1),(51, 2), . . .]

where each posting is represented by a pair in parentheses. Note that all brackets,
parentheses, andcommas are only included to enhance readability; in reality the postings would
be represented as a long stream of integers. This na¨ıve implementation would require six bytes
per posting. Using this scheme, the entire inverted index would be about as large as the
collection itself. Fortunately, we can do significantly better. The first trick is to encode
differences between document ids as opposed to the document ids themselves. Since the
postings are sorted by document ids, the differences (called d-gaps) must be positive integers
greater than zero. The above postings list, represented with d-gaps, would be: [(5, 2),(2, 3),(5,
1),(37, 1),(2, 2)

Of course, we must actually encode the first document id. We haven’t lost any information,
sincethe original document ids can be easily reconstructed from the d-gaps. However, it’s not
obvious that we’ve reduced the space requirements either, since the largest possible d-gap is
one less than

the number of documents in the collection. This is where the second trick comes in, which
is to represent the d-gaps in a way such that it takes less space for smaller numbers. Similarly,
we wantto apply the same techniques to compress the term frequencies, since for the most part
they are also small values. But to understand how this is done, we need to take a slight detour
into compression techniques, particularly for coding integers.

Compression, in general, can be characterized as either lossless or lossy: it’s fairly obvious that loseless
compression is required in this context. To start, it is important to understand that all compression
techniques represent a time–space tradeoff. That is, we reduce the amount of space on disk necessary to

BIG DATA 68
ANALYTICS
MRCET
-CSE

store data, but at the cost of extra processor cycles that must be spent codingand decoding data. Therefore,
it is possible that compression reduces size but also slows processing. However, if the two factors are
properly balanced (i.e., decoding speed can keep up with disk bandwidth), we can achieve the best of both
worlds: smaller and faster.

POSTINGS COMPRESSION

Having completed our slight detour into integer compression techniques, we can now
return to thescalable inverted indexing algorithm shown in Figure 4.4 and discuss how postings
lists can be properly compressed. As we can see from the previous section, there is a wide
range of choices that represent different tradeoffs between compression ratio and decoding
speed. Actual performance also depends on characteristics of the collection, which, among
other factors, determine the distribution of d-gaps. B¨uttcher et al. [30] recently compared the
performance of various compression techniques on coding document ids. In terms of the
amount of compression that can be obtained (measured in bits per docid), Golomb and Rice
codes performed the best, followed by γ codes, Simple-9, varInt, and group varInt (the least
space efficient). In terms of raw decoding speed, the order was almost the reverse: group varInt
was the fastest, followed by varInt.14 Simple-9 was substantially slower, and the bit-aligned
codes were even slower than that.Within the bit-aligned codes, Rice codes were the fastest,
followed by γ, with Golomb codes beingthe slowest (about ten times slower than group varInt).

Let us discuss what modifications are necessary to our inverted indexing algorithm if we
were to adopt Golomb compression for d-gaps and represent term frequencies with γ codes.
Note that this represents a space-efficient encoding, at the cost of slower decoding compared
to alternatives. Whether or not this is actually a worthwhile tradeoff in practice is not important
here: use of Golomb codes serves a pedagogical purpose, to illustrate how one might set
compression parameters.

BIG DATA 69
ANALYTICS
MRCET
-CSE

Coding term frequencies with γ codes is easy since they are parameterless. Compressing
d-gaps with Golomb codes, however, is a bit tricky, since two parameters are required: the size
of the document collection and the number of postings for a particular postings list (i.e., the
document frequency, or df). The first is easy to obtain and can be passed into the reducer as a
constant. The df of a term, however, is not known until all the postings have been processed—
and unfortunately,

the parameter must be known before any posting is coded. At first glance, this seems like
a chicken-and-egg problem. A two-pass solution that involves first buffering the postings (in
memory) would suffer from the memory bottleneck we’ve been trying to avoid in the first place.

To get around this problem, we need to somehow inform the reducer of a term’s df before any
ofits postings arrive. This can be solved with the order inversion design pattern introduced in
Section

3.3 to compute relative frequencies. The solution is to have the mapper emit special keys
of the form ht, ∗i to communicate partial document frequencies. That is, inside the mapper, in
addition to emitting intermediate key-value pairs of the following form:

(tuple ht, docidi,tf f)

we also emit special intermediate key-value pairs like this:

(tuple ht, ∗i, df e)

to keep track of document frequencies associated with each term. In practice, we can
accomplishthis by applying the in-mapper combining design pattern (see Section 3.1). The
mapper holds an in-memory associative array that keeps track of how many documents a term

BIG DATA 70
ANALYTICS
MRCET
-CSE

has been observed in (i.e., the local document frequency of the term for the subset of documents
processed by the mapper). Once the mapper has processed all input records, special keys of the
form ht, ∗i are emitted with the partial df as the value.

To ensure that these special keys arrive first, we define the sort order of the tuple so that the
specialsymbol ∗ precedes all documents (part of the order inversion design pattern). Thus, for
each term, the reducer will first encounter the ht, ∗i key, associated with a list of values
representing partial df values originating from each mapper. Summing all these partial
contributions will yield the term’s df, which can then be used to set the Golomb compression
parameter b. This allows the postings to be incrementally compressed as they are encountered
in the reducer—memory bottlenecks are eliminated since we do not need to buffer postings in
memory.

Once again, the order inversion design pattern comes to the rescue. Recall that the pattern
is usefulwhen a reducer needs to access the result of a computation (e.g., an aggregate statistic)
before it encounters the data necessary to produce that computation. For computing relative
frequencies, that bit of information was the marginal. In this case, it’s the document frequency.

PARALLEL BREADTH-FIRST SEARCH

One of the most common and well-studied problems in graph theory is the single-source
shortest path problem, where the task is to find shortest paths from a source node to all other
nodes in the graph (or alternatively, edges can be associated with costs or weights, in which
case the task is tocompute lowest-cost or lowest-weight paths). Such problems are a staple
in undergraduate

BIG DATA 71
ANALYTICS
MRCET
-CSE

algorithm courses, where students are taught the solution using Dijkstra’s algorithm.
However, this famous algorithm assumes sequential processing—how would we solve this
problem in parallel, and more specifically, with MapReduce?

Dijkstra(G, w, s)2: d[s] ← 0


3: for all vertex v ∈ V do4: d[v] ← ∞
5: Q ← {V }
6: while Q 6= ∅ do
7: u ← ExtractMin(Q)
8: for all vertex v ∈ u.AdjacencyList do9: if d[v] > d[u] + w(u, v) then

10: d[v] ← d[u] + w(u, v)

Figure 5.2: Pseudo-code for Dijkstra’s algorithm, which is based on maintaining a global
priorityqueue of nodes with priorities equal to their distances from the source node. At each
iteration, thealgorithm expands the node with the shortest distance and updates distances to all
reachable nodes.As a refresher and also to serve as a point of comparison, Dijkstra’s algorithm
is shown in Figure 5.2, adapted from Cormen, Leiserson, and Rivest’s classic algorithms
textbook [41] (often simplyknown as CLR). The input to the algorithm is a directed, connected
graph G = (V, E) representedwith adjacency lists, w containing edge distances such that w(u,
v) ≥ 0, and the source node s. Thealgorithm begins by first setting distances to all vertices d[v],
v ∈ V to ∞, except for the source node, whose distance to itself is zero. The algorithm maintains
Q, a global priority queue of vertices with priorities equal to their distance values d

Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current
distance from the priority queue (initially, this is the source node). At each iteration, the
algorithm “expands” that node by traversing the adjacency list of the selected node to see if
any of those nodes can be reached with a path of a shorter distance. The algorithm terminates

BIG DATA 72
ANALYTICS
MRCET
-CSE

when the priorityqueue Q is empty, or equivalently, when all nodes have been considered. Note
that the algorithm as presented in Figure 5.2 only computes the shortest distances. The actual
paths can be recoveredby storing “backpointers” for every node indicating a fragment of the
shortest path.

A sample trace of the algorithm running on a simple graph is shown in Figure 5.3 (example
also adapted from CLR). We start out in (a) with n1 having a distance of zero (since it’s the
source) and all other nodes having a distance of ∞. In the first iteration (a), n1 is selected as the
node to expand (indicated by the thicker border). After the expansion, we see in (b) that n2 and
n3 can bereached at a distance of 10 and 5, respectively. Also, we see in (b) that n3 is the next
node selectedfor expansion. Nodes we have already considered for expansion are shown in
black. Expanding n3, we see in (c) that the distance to n2 has decreased because we’ve found
a shorter path. The nodes that will be expanded next, in order, are n5, n2, and n4. The algorithm
terminates with the end state shown in (f), where we’ve di

The key to Dijkstra’s algorithm is the priority queue that maintains a globallysorted list of
nodes by current distance. This is not possible in MapReduce, as the programming model does

BIG DATA 73
ANALYTICS
MRCET
-CSE

not provide a mechanism for exchanging global data. Instead, we adopt a brute force approach
knownas parallel breadth-first search. First, as a simplification let us assume that all edges have
unit distance (modeling, for example, hyperlinks on the web). This makes the algorithm easier
to understand, but we’ll relax this restriction later.
The intuition behind the algorithm is this: the distance of all nodes connected directly to the
sourcenode is one; the distance of all nodes directly connected to those is two; and so on.
Imagine waterrippling away from a rock dropped into a pond— that’s a good image of how
parallel breadth-firstsearch works. However, what if there are multiple paths to the same node?
Suppose we wish to compute the shortest distance to node n. The shortest path must go through
one of the nodes in Mthat contains an outgoing edge to n: we need to examine all m ∈ M to
find ms, the node with the shortest distance. The shortest distance to n is the distance to ms
plus one.

Pseudo-code for the implementation of the parallel breadth-first search algorithm is


provided in Figure 5.4. As with Dijkstra’s algorithm, we assume a connected, directed graph
represented as adjacency lists. Distance to each node is directly stored alongside the adjacency
list of that node, and initialized to ∞ for all nodes except for the source node. In the pseudo-code,
we use n to denotethe node id (an integer) and N to denote the node’s corresponding data
structure (adjacency list and current distance). The algorithm works by mapping over all nodes
and emitting a key-value pair for each neighbor on the node’s adjacency list. The key contains
the node id of the neighbor, and the value is the current distance to the node plus one. This says:
if we can reach node n with adistance d, then we must be able to reach all the nodes that are
connected to n with distance d + 1.

BIG DATA 74
ANALYTICS
MRCET
-CSE

After shuffle and sort, reducers will receive keys corresponding to the destination node ids
and distances corresponding to all paths leading to that node. The reducer will select the
shortest of these distances and then update the distance in the node data structure.

h iteration corresponds to a MapReduce job. The first time we run the algorithm, we
“discover” all nodes that are connected to the source. The second iteration, we discover all
nodes connected to those, and so on. Each iteration of the algorithm expands the “search
frontier” by one hop, and,eventually, all nodes will be discovered with their shortest distances
(assuming a fully-connected graph). Before we discuss termination of the algorithm, there is
one more detail required to make the parallel breadth-first search algorithm work. We need to
“pass along” the graph structure fromone iteration to the next. This is accomplished by emitting
the node data structure itself, with the node id as a key (Figure 5.4, line 4 in the mapper). In
the reducer, we must distinguish the node data structure from distance values (Figure 5.4, lines
5–6 in the reducer), and update the minimumdistance in the node data structure before emitting
it as the final value. The final output is now ready to serve as input to the next iteration.

So how many iterations are necessary to compute the shortest distance to all nodes? The
answer isthe diameter of the graph, or the greatest distance between any pair of nodes. This
number is surprisingly small for many real-world problems: the saying “six degrees of
separation” suggests that everyone on the planet is connected to everyone else by at most six
steps (the people a personknows are one step away, people that they know are two steps away,
etc.). If this is indeed true, then parallel breadthfirst search on the global social network would
take at most six MapReduce iterations.
class Mapper
2: method Map(nid n, node N)3: d ← N.Distance
4: Emit(nid n, N) . Pass along graph structure5: for all nodeid m ∈ N.AdjacencyList do

6: Emit(nid m, d + 1) . Emit distances to reachable nodes1: class Reducer

BIG DATA 75
ANALYTICS
MRCET
-CSE

2: method Reduce(nid m, [d1, d2, . . .])3: dmin ← ∞

4: M ← ∅
5: for all d ∈ counts [d1, d2, . . .] do6: if IsNode(d) then
7: M ← d . Recover graph structure
8: else if d < dmin then . Look for shorter distance9: dmin ← d
10: M.Distance dmin . Update shortest distance11: Emit(nid m, node M)

BIG DATA 76
ANALYTICS
MRCET
-CSE

Figure 5.4: Pseudo-code for parallel breath-first search in MapReduce: the mappers emit distances to
reachable nodes, while the reducers select the minimum of those distances for each destinationnode. Each
iteration (one MapReduce job) of the algorithm expands the “search frontier” by one hop.
For more serious academic studies of “small world” phenomena in networks, we refer the
readerto a number of publications [61, 62, 152, 2]. In practical terms, we iterate the algorithm
until thereare no more node distances that are ∞. Since the graph is connected, all nodes are
reachable, and since all edge distances are one, all discovered nodes are guaranteed to have the
shortest distances(i.e., there is not a shorter path that goes through a node that hasn’t been
discovered).

The actual checking of the termination condition must occur outside of MapReduce.
Typically, execution of an iterative MapReduce algorithm requires a nonMapReduce “driver”
program, which submits a MapReduce job to iterate the algorithm, checks to see if a
termination condition has been met, and if not, repeats. Hadoop provides a lightweight API for
constructs called “counters”, which, as the name suggests, can be used for counting events that
occur during execution, e.g., number of corrupt records, number of times a certain condition is
met, or anythingthat the programmer desires. Counters can be defined to count the number of
nodes that have distances of ∞: at the end of the job, the driver program can access the final
counter value and check to see if another iteration is necessary.

BIG DATA 77
ANALYTICS
MRCET
-CSE

Finally, as with Dijkstra’s algorithm in the form presented earlier, the parallel breadth-first
search algorithm only finds the shortest distances, not the actual shortest paths. However, the
path can bestraightforwardly recovered. Storing “backpointers” at each node, as with Dijkstra’s
algorithm, will work, but may not be efficient since the graph needs to be traversed again to
reconstruct the path segments. A simpler approach is to emit paths along with distances in the
mapper, so that each node will have its shortest path easily accessible at all times. The
additional space requirements for shuffling these data from mappers to reducers are relatively
modest, since for themost part paths (i.e., sequence of node ids) are relatively short.
Up until now, we have been assuming that all edges are unit distance. Let us relax that
restrictionand see what changes are required in the parallel breadth-first search algorithm. The
adjacency lists, which were previously lists of node ids, must now encode the edge distances as
well. In line

6 of the mapper code in Figure 5.4, instead of emitting d + 1 as the value, we must now emit d + w where
w is the edge distance. No other changes to the algorithm are required, but the terminationbehavior is very
different. To illustrate, consider the graph fragment in Figure 5.5, where s is the source node, and in this
iteration, we just “discovered” node r for the very first time. Assume for the sake of argument that we’ve
already discovered the shortest distance to node p, and that the shortest distance to r so far goes through
p. This, however, does not guarantee that we’ve discovered the shortest distance to r, since there may exist
a path going through q that we haven’t encountered yet (because it lies outside the search frontier).6
However, as the search frontier expands, we’ll eventually cover q and all other nodes along the path from p
to q to r—which meansthat with sufficient iterations, we will discover the shortest distance to r. But how
do we know thatwe’ve found the shortest distance to p? Well, if the shortest path to p lies within the search
frontier,we would have already discovered it. And if it doesn’t, the above argument applies. Similarly, we
can repeat the same argument for all nodes on the path from s to p. The conclusion is that, with sufficient
iterations, we’ll eventually discover all the shortest distances.
So exactly how many iterations does “eventually” mean? In the worst case, we might need as many
iterations as there are nodes in the graph minus one. In fact, it is not difficult to construct graphs that will
elicit this worse-case behavior: Figure 5.6 provides an example, with n1 as the source. The parallel
breadth-first search algorithm would not discover that the shortest path from n1 to n6goes through n3, n4,
and n5 until the fifth iteration. Three more iterations are necessary to cover the rest of the graph.
Fortunately, for most real-world graphs, such extreme cases are rare, and thenumber of iterations necessary

BIG DATA 78
ANALYTICS
MRCET
-CSE

to discover all shortest distances is quite close to the diameter of the graph, as in the unit edge distance
case.

In practical terms, how do we know when to stop iterating in the case of arbitrary edge distances? The
algorithm can terminate when shortest distances at every node no longer change. Once again,we can use
counters to keep track of such events. Every time we encounter a shorter distance in the reducer, we
increment a counter. At the end of each MapReduce iteration, the driver program reads the counter value
and determines if another iteration is necessary.
Compared to Dijkstra’s algorithm on a single processor, parallel breadth-first search in MapReduce
can be characterized as a brute force approach that “wastes” a lot of time performing computations whose
results are discarded. At each iteration, the algorithm attempts to recompute distances to all nodes, but in
reality only useful work is done along the search frontier: inside the search frontier, the algorithm is simply
repeating previous computations.7 Outside the search frontier, the algorithm hasn’t discovered any paths
to nodes there yet, so no meaningful work is done. Dijkstra’s algorithm, on the other hand, is far more
efficient. Every time a node is explored,we’re guaranteed to have already found the shortest path to it.
However, this is made possible by maintaining a global data structure (a priority queue) that holds nodes
sorted by distance—this is not possible in MapReduce because the programming model does not provide
support for global data that is mutable and accessible by the mappers and reducers. These inefficiencies
represent thecost of parallelization.
The parallel breadth-first search algorithm is instructive in that it represents the prototypical structure of
a large class of graph algorithms in MapReduce. They share in the following characteristics:

The graph structure is represented with adjacency lists, which is part of some larger
node data structure that may contain additional information (variables to store intermediate
output, features of the nodes). In many cases, features are attached to edges as well (e.g., edge
weights).

The graph structure is represented with adjacency lists, which is part of some larger
node data structure that may contain additional information (variables to store intermediate
output, features of the nodes). In many cases, features are attached to edges as well (e.g., edge
weights).

BIG DATA 79
ANALYTICS
MRCET
-CSE

In addition to computations, the graph itself is also passed from the mapper to the
reducer.In the reducer, the data structure corresponding to each node is updated and written
back to disk.

Graph algorithms in MapReduce are generally iterative, where the output of the
previous iteration serves as input to the next iteration. The process is controlled by a non-
MapReduce driverprogram that checks for termination.

For parallel breadth-first search, the mapper computation is the current distance plus
edge distance (emitting distances to neighbors), while the reducer computation is the Min
function (selecting the shortest path). As we will see in the next section, the MapReduce
algorithm for PageRank works in much the same way

BIG DATA 80
ANALYTICS
MRCET
-CSE

UNIT IV

INTRODUCTION TO HIVE

The term ‘Big Data’ is used for collections of large datasets that include huge volume,
high velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.

Hadoop

Hadoop is an open-source framework to store and process Big Data in a distributed


environment.It contains two modules, one is MapReduce and another is Hadoop Distributed
File System (HDFS).

• MapReduce: It is a parallel programming model for processing large amounts of


structured, semi-structured, and unstructured data on large clusters of commodity
hardware.

• HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store


andprocess the datasets. It provides a fault-tolerant file systemto runon commodity
hardware.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and
Hive thatare used to help Hadoop modules.

• Sqoop: It is used to import and export data to and from between HDFS and RDBMS.

• Pig: It is a procedural language platform used to develop a script for MapReduce


operations.

BIG DATA 81
ANALYTICS
MRCET
-CSE

• Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

• The traditional approach using Java MapReduce program for structured, semi-
structured,and unstructured data.

• The scripting approach for MapReduce to process structured and semi structured data
usingPig.

• The Hive Query Language (HiveQL or HQL) for MapReduce to process structured
datausing Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It


resides ontop of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

• A relational database

• A design for OnLine Transaction Processing (OLTP)

• A language for real-time queries and row-level updates

Features of Hive

• It stores schema in a database and processed data into HDFS.

BIG DATA 82
ANALYTICS
MRCET
-CSE

• It is designed for OLAP.

• It provides SQL type language for querying called HiveQL or HQL.

• It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes
each unit:

BIG DATA 83
ANALYTICS
MRCET
-CSE

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types,
and HDFS mapping.

HiveQL Process HiveQL is similar to SQL for querying on schema info on the
Engine Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and


MapReduce is Hive Execution Engine. Execution engine processesthe
query and generates results as same as MapReduce results. It uses the
flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

BIG DATA 84
ANALYTICS
MRCET
-CSE

The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query

The Hive interface such as Command Line or Web UI sends queryto Driver (any
databasedriver such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check the syntax
andquery plan or the requirement of query.

3 Get Metadata

The compiler sends metadata request to Metastore (any database).

4 Send Metadata

Metastore sends metadata as a response to the compiler.

5 Send Plan

BIG DATA 85
ANALYTICS
MRCET
-CSE

The compiler checks the requirement and resends the plan to the driver. Up to here,
theparsing and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, whichis
in Data node. Here, the query executes MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata operations


withMetastore.

8 Fetch Result

The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the driver.

10 Send Results

The driver sends the results to Hive Interfaces.

BIG DATA 86
ANALYTICS
MRCET
-CSE

File Formats in Hive

▪ File Format specifies how records are encoded in files


▪ Record Format implies how a stream of bytes for a given record are encoded

▪ The default file format is TEXTFILE – each record is a line in the file
▪ Hive uses different control characters as delimeters in textfiles

▪ ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n

▪ The term field is used when overriding the default delimiter


▪ FIELDS TERMINATED BY ‘\001’

▪ Supports text files – csv, tsv

▪ TextFile can contain JSON or XML documents.

ommonly used File Formats –

1. TextFile format

▪ Suitable for sharing data with other tools

▪ Can be viewed/edited manually


2. SequenceFile

▪ Flat files that stores binary key ,value pair

▪ SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and
sortingrespectively
▪ Supports – Uncompressed, Record compressed ( only value is compressed) and Block
compressed ( both key,value compressed) formats
3. RCFile

BIG DATA 87
ANALYTICS
MRCET
-CSE

▪ RCFile stores columns of a table in a record columnar way


4. ORC

5. AVRO

Hive Commands

Hive supports Data definition Language(DDL), Data Manipulation Language(DML)


and Userdefined functions.

Hive DDL Commands

create databasedrop database create table drop table

alter table create indexcreate view


Hive DML Commands

Select Where Group ByOrder By Load DataJoin:


o Inner Join

o Left Outer Join

o Right Outer Join

o Full Outer Join Hive DDL Commands Create Database Statement


A database in Hive is a namespace or a collection of tables.

BIG DATA 88
ANALYTICS
MRCET
-CSE

1. hive> CREATE SCHEMA userdb;


2. hive> SHOW DATABASES;Drop database
1. ive> DROP DATABASE IF EXISTS userdb;Creating Hive Tables
Create a table called Sonoo with two columns, the first being an integer and the other a
string.

1. hive> CREATE TABLE Sonoo(foo INT, bar STRING);

Create a table called HIVE_TABLE with two columns and a partition column called ds.
The partition column is a virtual column. It is not part of the data itself but is derived from the
partitionthat a particular dataset is loaded into.By default, tables are assumed to be of text input
format andthe delimiters are assumed to be ^A(ctrl-a).

1. hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds
STRING);

Browse the table

1. hive> Show tables;

Altering and Dropping Tables

1. hive> ALTER TABLE Sonoo RENAME TO Kafka;


2. hive> ALTER TABLE Kafka ADD COLUMNS (col INT);

BIG DATA 89
ANALYTICS
MRCET
-CSE

3. hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a
comment');
4. hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT, weight STRING,
bazINT COMMENT 'baz replaces new_col1');

Hive DML Commands

To understand the Hive DML commands, let's see the employee and employee_department table first.

LOAD DATA

1. hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO TABLE


Employee;

SELECTS and FILTERS

1. hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';GROUP BY


1. hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;Adding a
Partition

We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.

BIG DATA 90
ANALYTICS
MRCET
-CSE

Syntax:

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec


[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;

partition_spec:

: (p_column = p_col_value, p_column = p_col_value, ...)


The following query is used to add a partition to the employee table.

hive> ALTER TABLE employee


> ADD PARTITION (year=’2012’)
> location '/2012/part2012';
Renaming a Partition

The syntax of this command is as follows

ALTER TABLE table_name PARTITION partition_spec RENAME


TO PARTITIONpartition_spec;

The following query is used to rename a partition:

hive> ALTER TABLE employee PARTITION (year=’1203’)


> RENAME TO PARTITION (Yoj=’1203’);
Dropping a Partition

The following syntax is used to drop a partition:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,


PARTITIONpartition_spec,...;

The following query is used to drop a partition:

hive> ALTER TABLE employee DROP [IF EXISTS]


> PARTITION (year=’1203’);

BIG DATA 91
ANALYTICS
MRCET
-CSE

Hive Query Language

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data
in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause.

SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition.
It filters the data using the condition and gives you a finite result. The built-in operatorsand functions
generate an expression, which fulfils the condition.

Syntax

Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY
Example
col_list]][LIMIT number];

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as given
below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the
employee details who earn a salary of more than Rs 30000.

BIG DATA 92
ANALYTICS
MRCET
-CSE

++ + + + +
| ID | Name | Salary | Designation | Dept |

++ + + + +
|1201 | Gopal | 45000 | Technical manager | TP |

|1202 | Manisha | 45000 | Proofreader | PR |


|1203 | Masthanvali | 40000 | Technical writer | TP |

|1204 | Krian | 40000 | Hr Admin | HR |


|1205 | Kranthi | 30000 | Op Admin | Admin |

++ + + + +

The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following response:

++ + + + +
| ID | Name | Salary | Designation | Dept |

++ + + + +
|1201 | Gopal | 45000 | Technical manager | TP |

|1202 | Manisha | 45000 | Proofreader | PR |


|1203 | Masthanvali | 40000 | Technical writer | TP |

|1204 | Krian | 40000 | Hr Admin | HR |


++ + + + +

JDBC Program

The JDBC program to apply where clause for the given example is as follows.

BIG DATA 93
ANALYTICS
MRCET
-CSE

import
java.sql.SQLException;import
java.sql.Connection; import
java.sql.ResultSet; import
java.sql.Statement; import
java.sql.DriverManager;

public class HiveQLWhere {


private static String driverName =

"org.apache.hadoop.hive.jdbc.HiveDriver";public static void main(String[]

args) throws SQLException {

// Register driver and create driver


instanceClass.forName(driverName);

// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb",
"",

"");

// create statement
Statement stmt = con.createStatement();

// execute statement

BIG DATA 94
ANALYTICS
MRCET
-CSE

Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE


salary>30000;");

System.out.println("Result:");

System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");

while (res.next()) {

Save System.out.println(res.getInt(1) + " " + res.getString(2)


the program in a file named HiveQLWhere.java. Use the +following
" " + res.getDouble(3)
commands to + " " +
res.getString(4) + " " + res.getString(5));
compileand execute
} this program.

con.close();
$ javac HiveQLWhere.java
}
$ java HiveQLWhere
}
Output:

ID Name Salary Designation Dept

1201 Gopal 45000 Technical manager TP

1202 Manisha 45000 Proofreader PR

1203 Masthanvali 40000 Technical writer TP

1204 Krian 40000 Hr Admin HR

BIG DATA 95
ANALYTICS
MRCET
-CSE

The ORDER BY clause is used to retrieve the details based on one column and sort the result setby
ascending or descending order.

Syntax

Given below is the syntax of the ORDER BY clause:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

BIG DATA 96
ANALYTICS
MRCET
-CSE

Example

Let us take an example for SELECT...ORDER BY clause. Assume employee table as


given below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query
to retrieve the employee details in order by using Department name.

++ + + + +

| ID | Name | Salary | Designation | Dept |


++ + + + +

|1201 | Gopal | 45000 | Technical manager | TP |


|1202 | Manisha | 45000 | Proofreader | PR |

|1203 | Masthanvali | 40000 | Technical writer | TP |


|1204 | Krian | 40000 | Hr Admin | HR |

|1205 | Kranthi | 30000 | Op Admin | Admin |

++ + + + +

The following query retrieves the employee details using the above scenario:

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

On successful execution of the query, you get to see the following response:

++ + + + +
| ID | Name | Salary | Designation | Dept |

++ + + + +

|1205 | Kranthi | 30000 | Op Admin | Admin |

|1204 | Krian | 40000 | Hr Admin | HR |

BIG DATA 97
ANALYTICS
MRCET
-CSE

|1202 | Manisha | 45000 | Proofreader | PR |


|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |

++ + + + +

JDBC Program

Here is the JDBC program to apply Order By clause for the given example.

import
java.sql.SQLException;import
java.sql.Connection; import
java.sql.ResultSet; import
java.sql.Statement; import
java.sql.DriverManager;

public class HiveQLOrderBy {


private static String driverName =

"org.apache.hadoop.hive.jdbc.HiveDriver";public static void main(String[]

args) throws SQLException {

// Register driver and create driver


instanceClass.forName(driverName);

// get connection

BIG DATA 98
ANALYTICS
MRCET
-CSE

Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb",


"",
"");

// create statement
Statement stmt = con.createStatement();

// execute statement

Resultset res = stmt.executeQuery("SELECT * FROM employee ORDER BY DEPT;");


System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");

while (res.next()) {

System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " "


+res.getString(4) + " " + res.getString(5));

con.close();
}
}

Save the program in a file named HiveQLOrderBy.java. Use the following commands to
compileand execute this program.

BIG DATA 99
ANALYTICS
MRCET
-CSE

$ javac HiveQLOrderBy.java
$ java HiveQLOrderBy

Output:

ID Name Salary Designation Dept

1205 Kranthi 30000 Op Admin Admin

1204 Krian 40000 Hr Admin HR

1202 Manisha 45000 Proofreader PR

1201 Gopal 45000 Technical manager TP


1203 Masthanvali 40000 Technical writer TP

1204 Krian 40000 Hr Admin HR

The GROUP BY clause is used to group all the records in a result set using a particular
collectioncolumn. It is used to query a group of records.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]

BIG DATA 100


ANALYTICS
MRCET
-CSE

The syntax of GROUP BY clause is as follows:

[LIMIT number];

Example

Let us take an example of SELECT…GROUP BY clause. Assume employee table as


given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve
the number of employees in each department.

++ + + + +
| ID | Name | Salary | Designation | Dept |

++ + + + +
|1201 | Gopal | 45000 | Technical manager | TP |

|1202 | Manisha | 45000 | Proofreader | PR |


|1203 | Masthanvali | 40000 | Technical writer | TP |

|1204 | Krian | 45000 | Proofreader | PR |


|1205 | Kranthi | 30000 | Op Admin | Admin |

++ + + + +

The following query retrieves the employee details using the above scenario.

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;

On successful execution of the query, you get to see the following response:

++ +
| Dept | Count(*) |

++ +
|Admin | 1 |

BIG DATA 101


ANALYTICS
MRCET
-CSE

|PR | 2 |
|TP | 3 |

++ +

JDBC Program

Given below is the JDBC program to apply the Group By clause for the given example.

import
java.sql.SQLException;import
java.sql.Connection; import
java.sql.ResultSet; import
java.sql.Statement; import
java.sql.DriverManager;

public class HiveQLGroupBy {

private static String driverName =

"org.apache.hadoop.hive.jdbc.HiveDriver";public static void main(String[]

args) throws SQLException {

// Register driver and create driver


instanceClass.forName(driverName);

// get connection

BIG DATA 102


ANALYTICS
MRCET
-CSE

Connection con = DriverManager.


getConnection("jdbc:hive://localhost:10000/userdb", "", "");

// create statement
Statement stmt = con.createStatement();

// execute statement

Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP


BY DEPT; ”);

System.out.println(" Dept \t count(*)");

while (res.next()) {
Save the program in a file named HiveQLGroupBy.java. Use the following commands to
System.out.println(res.getString(1) + " " + res.getInt(2));
compileand execute this program.
}
con.close();
$ javac HiveQLGroupBy.java
}
$ java HiveQLGroupBy
}

Output:

Dept Count(*)Admin 1
PR 2

TP 3

JOIN is a clause that is used for combining specific fields from two tables by using
valuescommon to each one. It is used to combine records from two or more tables in the
database.

BIG DATA 103


ANALYTICS
MRCET
-CSE

Syntax

join_table:

table_reference JOIN table_factor [join_condition]


| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition

| table_reference CROSS JOIN table_reference [join_condition]

Example

We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..

+ + + + + +
| ID | NAME | AGE | ADDRESS | SALARY |

+ + + + + +
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |

| 2 | Khilan | 25 | Delhi | 1500.00 |


| 3 | kaushik | 23 | Kota | 2000.00 |

| 4 | Chaitali | 25 | Mumbai | 6500.00 |


| 5 | Hardik | 27 | Bhopal | 8500.00 |

| 6 | Komal | 22 | MP | 4500.00 |

| 7 | Muffy | 24 | Indore | 10000.00 |

+ + + + + + Consider another table ORDERS as follows:


+ + + + +

BIG DATA 104


ANALYTICS
MRCET
-CSE

|OID | DATE | CUSTOMER_ID | AMOUNT |

+ + + + +
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |

| 100 | 2009-10-08 00:00:00 | 3 | 1500 |

| 101 | 2009-11-20 00:00:00 | 2 | 1560 |


| 103 | 2008-05-20 00:00:00 | 4 | 2060 |

+ + + + + There are different types of joins given as


follows:

• JOIN

• LEFT OUTER JOIN

• RIGHT OUTER JOIN

• FULL OUTER JOIN

JOIN

JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is
same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and
foreign keysof the tables.

The following query executes JOIN on the CUSTOMER and ORDER tables, and
retrieves therecords:

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNTFROM CUSTOMERS c JOIN


ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:

+ + + + +

BIG DATA 105


ANALYTICS
MRCET
-CSE

| ID | NAME | AGE | AMOUNT |

+ + + + +
| 3 | kaushik | 23 | 3000 |

| 3 | kaushik | 23 | 1500 |

| 2 | Khilan | 25 | 1560 |

| 4 | Chaitali | 25 | 2060 |
+ + + + +

LEFT OUTER JOIN

The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there
are no matches in the right table. This means, if the ON clause matches 0 (zero) records in the
right table,the JOIN still returns a row in the result, but with NULL in each column from the
right table.

A LEFT JOIN returns all the values from the left table, plus the matched values from the
right table, or NULL in case of no matching JOIN predicate.

The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDERtables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATEFROM CUSTOMERS c


LEFT OUTER JOIN ORDERS oON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:

+ + + + +
| ID | NAME | AMOUNT | DATE |

+ + + + +
| 1 | Ramesh | NULL | NULL |

| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |

BIG DATA 106


ANALYTICS
MRCET
-CSE

| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |

| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |


| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |

| 5 | Hardik | NULL | NULL |


| 6 | Komal | NULL | NULL |

| 7 | Muffy | NULL | NULL |


+ + + + +

RIGHT OUTER JOIN

The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there
are nomatches in the left table. If the ON clause matches 0 (zero) records in the left table, the
JOIN stillreturns a row in the result, but with NULL in each column from the left table.

A RIGHT JOIN returns all the values from the right table, plus the matched values from
the lefttable, or NULL in case of no matching join predicate.

The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDERtables.

notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM


CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:

+ + + + +

BIG DATA 107


ANALYTICS
MRCET
-CSE

| ID | NAME | AMOUNT | DATE |


++ + + +

|3 | kaushik | 3000 | 2009-10-08 00:00:00 |


|3 | kaushik | 1500 | 2009-10-08 00:00:00 |

|2 | Khilan | 1560 | 2009-11-20 00:00:00 |


|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |

++ + + +

FULL OUTER JOIN

The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer
tablesthat fulfil the JOIN condition. The joined table contains either all the records from both
the tables,or fills in NULL values for missing matches on either side.

The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDERtables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATEFROM CUSTOMERS c

FULL OUTER JOIN ORDERS oON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:

++ + + +

| ID | NAME | AMOUNT | DATE |


++ + + +

|1 | Ramesh | NULL | NULL |


|2 | Khilan | 1560 | 2009-11-20 00:00:00 |

BIG DATA 108


ANALYTICS
MRCET
-CSE

|3 | kaushik | 3000 | 2009-10-08 00:00:00 |


|3 | kaushik | 1500 | 2009-10-08 00:00:00 |

|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |


|5 | Hardik | NULL | NULL |

|6 | Komal | NULL | NULL |


|7 | Muffy | NULL | NULL |

|3 | kaushik | 3000 | 2009-10-08 00:00:00 |


|3 | kaushik | 1500 | 2009-10-08 00:00:00 |

|2 | Khilan | 1560 | 2009-11-20 00:00:00 |


|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |

++ + + +

Bucketing #

• Bucketing concept is based on (hashing function on the bucketed column) mod (by total
numberof buckets). The hash_function depends on the type of the bucketing column.

• Records with the same bucketed column will always be stored in the same bucket.

• We use CLUSTERED BY clause to divide the table into buckets.

BIG DATA 109


ANALYTICS
MRCET
-CSE

• Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.

• Bucketing can be done along with Partitioning on Hive tables and even without partitioning.

• Bucketed tables will create almost equally distributed data file parts, unless there is skew in
data.

• Bucketing is enabled by setting hive.enforce.bucketing= true;Advantages


• Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can
tryout queries on a fraction of data for testing and debugging purpose when the original data
sets arevery huge.

• As the data files are equal sized parts, map-side joins will be faster on bucketed tables than
non-bucketed tables.

• Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted
by one or more columns. This makes map-side joins even more efficient, since the join of each
bucketbecomes an efficient merge-sort.

Bucketing Vs Partitioning

• Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps
in organizing data in each partition into multiple files, so that the same set of data is always
written in same bucket.

• Bucketing helps a lot in joining of columns.

BIG DATA 110


ANALYTICS
MRCET
-CSE

• Hive Bucket is nothing but another technique of decomposing data or decreasing the data into
more manageable parts or equal parts.

Sampling

• TABLESAMPLE() gives more disordered and random records from a table as compared to
LIMIT. •We can sample using the rand() function, which returns a random number.

SELECT * from users TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;SELECT


* from users TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
• Here rand() refers to any random column. •The denominator in the bucket clause represents
the number of buckets into which data will be hashed. •The numerator is the bucket number
selected.

SELECT * from users TABLESAMPLE(BUCKET 2 OUT OF 4 ON name) s;

• If the columns specified in the TABLESAMPLE clause match the columns in the
CLUSTEREDBY clause, TABLESAMPLE queries only scan the required hash partitions of
the table.

SELECT * FROM buck_users TABLESAMPLE(BUCKET 1 OUT OF 2 ON id) s


LIMIT 1;

BIG DATA 111


ANALYTICS
MRCET
-CSE

Joins and Types #Reduce-Side Join

• If datasets are large, reduce side join takes place.Map-Side Join


• In case one of the dataset is small, map side join takes place. •In map side join, a local job
runs tocreate hash-table from content of HDFS file and sends it to every node.

SET hive.auto.convert.join =true;Bucket Map Join

• The data must be bucketed on the keys used in the ON clause and the number of buckets for
one table must be a multiple of the number of buckets for the other table. •When these
conditions are met, Hive can join individual buckets between tables in the map phase, because
it does not have to fetch the entire content of one table to match against each bucket in the other
table. •set hive.optimize.bucketmapjoin =true; •SET hive.auto.convert.join =true;

SMBM Join

• Sort-Merge-Bucket (SMB) joins can be converted to SMB map joins as well.

• SMB joins are used wherever the tables are sorted and bucketed.

• The join boils down to just merging the already sorted tables, allowing this operation to be
fasterthan an ordinary map-join.

• set hive.enforce.sortmergebucketmapjoin =false;

• set hive.auto.convert.sortmerge.join =true;

• set hive.optimize.bucketmapjoin = true;

• set hive.optimize.bucketmapjoin.sortedmerge = true;LEFT SEMI JOIN

BIG DATA 112


ANALYTICS
MRCET
-CSE

•A left semi-join returns records from the lefthand table if records are found in the righthand
tablethat satisfy the ON predicates.

• It’s a special, optimized case of the more general inner join.

• Most SQL dialects support an IN … EXISTS construct to do the same thing.

• SELECT and WHERE clauses can’t reference columns from the righthand table.

• Right semi-joins are not supported in Hive.

BIG DATA 113


ANALYTICS
MRCET
-CSE

• The reason semi-joins are more efficient than the more general inner join is as follows:

• For a given record in the lefthand table, Hive can stop looking for matching records in the
righthand table as soon as any match is found.

• At that point, the selected columns from the lefthand table record can be projected

• A file format is a way in which information is stored or encoded in a computer file.

• In Hive it refers to how records are stored inside the file.

• InputFormat reads key-value pairs from files.

• As we are dealing with structured data, each record has to be its own structure.

• How records are encoded in a file defines a file format.

• These file formats mainly vary between data encoding, compression rate, usage of space and
diskI/O.

• Hive does not verify whether the data that you are loading matches the schema for the
table ornot. •However, it verifies if the file format matches the table definition or not.

SerDe in Hive #

BIG DATA 114


ANALYTICS
MRCET
-CSE

• The SerDe interface allows you to instruct Hive as to how a record should be processed.

• A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De).

• The Deserializer interface takes a string or binary representation of a record, and translates it
intoa Java object that Hive can manipulate.

• The Serializer, however, will take a Java object that Hive has been working with, and turn it
intosomething that Hive can write to HDFS or another supported system.

• Commonly, Deserializers are used at query time to execute SELECT statements, and
Serializersare used when writing data, such as through an INSERT-SELECT statement.

CSVSerDe

• Use ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’

• Define following in SERDEPROPERTIES( " separatorChar " = < value_of_separator

, " quoteChar " = < value_of_quote_character ,

" escapeChar “ = < value_of_escape_character

BIG DATA 115


ANALYTICS
MRCET
-CSE

JSONSerDe

• Include hive-hcatalog-core-0.14.0.jar •Use ROW FORMAT SERDE ’


org.apache.hive.hcatalog.data.JsonSerDe ’

RegexSerDe

• It is used in case of pattern matching. •Use ROW FORMAT SERDE


'org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘
• In SERDEPROPERTIES, define input pattern and output fields.For Example

• input.regex = ‘(.)/(.)@(.*)’ •output.format.string’ = ’ 1 s 2 s 3 s’;

USE PARTITIONING AND BUCKETING

• Partitioning a table stores data in sub-directories categorized by table location, which allows
Hiveto exclude unnecessary data from queries without reading all the data every time a new
query is made.

• Hive does support Dynamic Partitioning (DP) where column values are only known at
EXECUTION TIME. To enable Dynamic Partitioning :

SET hive.exec.dynamic.partition =true;

BIG DATA 116


ANALYTICS
MRCET
-CSE

• Another situation we want to protect against dynamic partition insert is that the user may
accidentally specify all partitions to be dynamic partitions without specifying one static
partition,while the original intention is to just overwrite the sub-partitions of one root partition.

SET hive.exec.dynamic.partition.mode =strict;To enable bucketing:


SET hive.enforce.bucketing =true;

Optimizations in Hive #

• Use Denormalisation , Filtering and Projection as early as possible to reduce data before join.

• Join is a costly affair and requires extra map-reduce phase to accomplish query job. With De-
normalisation, the data is present in the same table so there is no need for any joins, hence the
selects are very fast.

• As join requires data to be shuffled across nodes, use filtering and projection as early as
possibleto reduce data before join.

TUNE CONFIGURATIONS

• To increase number of mapper, reduce split size :

SET mapred.max.split.size =1000000; (~1 MB)

• Compress map/reduce output

SET mapred.compress.map.output =true;SET mapred.output.compress =true;


• Parallel execution

BIG DATA 117


ANALYTICS
MRCET
-CSE

• Applies to MapReduce jobs that can run in parallel, for example jobs processing different
sourcetables before a join.

SET hive.exec.parallel =true;USE ORCFILE

• Hive supports ORCfile , a new table storage format that sports fantastic speed
improvementsthrough techniques like predicate push-down, compression and more.

• Using ORCFile for every HIVE table is extremely beneficial to get fast response times for
yourHIVE queries.

USE TEZ

• With Hadoop2 and Tez , the cost of job submission and scheduling is minimized.

• Also Tez does not restrict the job to be only Map followed by Reduce; this implies that all
thequery execution can be done in a single job without having to cross job boundaries.

• Let’s look at an example. Consider a click-stream event table:

CREATE TABLE clicks (timestamp date, sessionID string,


url string, source_ip string

STORED as ORC

tblproperties (“ orc.compress ” = “SNAPPY”);

• Each record represents a click event, and we would like to find the latest URL for each
sessionID

BIG DATA 118


ANALYTICS
MRCET
-CSE

• One might consider the following approach:

SELECT clicks.sessionID, clicks.url FROM clicks inner join (select sessionID,


max(timestamp) as max_ts from clicks group by sessionID) latest ON clicks.sessionID =
latest.sessionID and clicks.timestamp = latest.max_ts;

• In the above query, we build a sub-query to collect the timestamp of the latest event in
eachsession, and then use an inner join to filter out the rest.

• While the query is a reasonable solution —from a functional point of view— it turns out
there’sa better way to re-write this query as follows:

SELECT ranked_clicks.sessionID , ranked_clicks.url FROM (SELECT sessionID , url ,


RANK() over (partition by sessionID,order by timestamp desc ) as rank FROM clicks)
ranked_clicks WHERE ranked_clicks.rank =1;

• Here, we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing,
butwithout a Join.

• Clearly, removing an unnecessary join will almost always result in better performance, and
whenusing big data this is more important than ever.

MAKING MULTIPLE PASS OVER SAME DATA

• Hive has a special syntax for producing multiple aggregations from a single pass through a
sourceof data, rather than rescanning it for each aggregation.

• This change can save considerable processing time for large input data sets.

BIG DATA 119


ANALYTICS
MRCET
-CSE

• For example, each of the following two queries creates a table from the same source table,
history:INSERT OVERWRITE TABLE sales

SELECT * FROM history WHERE action=‘purchased’;INSERT OVERWRITE TABLE


credits
SELECT * FROM history WHERE action=‘returned’;

Optimizations in Hive

• This syntax is correct, but inefficient.

• The following rewrite achieves the same thing, but using a single pass through the source
historytable:

FROM history

INSERT OVERWRITE sales SELECT * WHERE action=‘purchased’INSERT


OVERWRITE credits SELECT * WHERE action=‘returned’;

BIG DATA 120


ANALYTICS
MRCET
-CSE

UNIT V

What is Machine Learning?


A subset of artificial intelligence (AI) and computer science called machine learning focuses on
using data and algorithms to simulate how humans learn, gradually increasing the accuracy of the
system.

The rapidly expanding field of data science includes machine learning as a key element.
Algorithms are trained to generate classifications or predictions using statistical techniques,
revealing essential insights in data mining operations.

Features of Machine Learning

➢ Machine learning is a subfield of artificial intelligence that focuses on developing


algorithms capable of learning from data and improving their performance over time.
➢ Machine learning algorithms can handle large and complex datasets and can
automatically extract relevant features from data, reducing the need for manual feature
engineering. These algorithms can be supervised or unsupervised and can be used for a
wide range of applications, including predictive modeling, classification, clustering,
and recommendation systems.
➢ Machine learning is a rapidly evolving field that is transforming the way we work with
data, and its ability to adapt to new data and changing conditions makes it an essential
tool for many industries and applications.
➢ In summary, machine learning is characterized by its ability to learn from data, handle
complex datasets, perform automatic feature extraction, be used for various
applications, and adapt to new data and changing conditions.

How Does Machine Learning Work?

➢ Machine Learning works by using algorithms to analyze data, identify patterns and
relationships, and make predictions or decisions based on the learned patterns.

BIG DATA 121


ANALYTICS
MRCET
-CSE

➢ The process involves several steps, including data preprocessing, feature engineering,
model selection, training, evaluation, and deployment.
➢ Machine Learning algorithms can be supervised, unsupervised, or semi-supervised,
depending on the type of input data and the desired outcome.
➢ During training, the algorithm adjusts its parameters to minimize the difference
between its predictions and the actual outcomes.
➢ Once the model is trained, it can be used to make predictions on new data. Overall,
Machine Learning is a complex process that requires a solid understanding of statistics,
mathematics, and computer science.

Need for Machine Learning

➢ Machine learning is significant because it helps in the development of new goods and
provides businesses with a picture of trends in consumer behavior and operational
business patterns.
➢ A substantial portion of the operations of many of today's top companies, like
Facebook, Google, and Uber, revolve around machine learning. For many businesses,
machine learning has emerged as a key competitive differentiation.

Types of Machine Learning

There are three main types of machine learning:

1. Supervised Learning
2.Unsupervised Learning
3. Reinforcement Learning

• Supervised learning involves training a model on labeled data, where the correct outputs
are provided alongside the inputs. The goal of supervised learning is to enable the model to
generalize to new data by learning patterns and relationships in the data.
• Unsupervised learning involves training a model on unlabeled data, where the inputs are
not accompanied by any specific output. The goal of unsupervised learning is to enable the
model to identify patterns and relationships in the data without any explicit guidance.

BIG DATA 122


ANALYTICS
MRCET
-CSE

• Reinforcement learning involves training a model to interact with an environment and


learn from the rewards or penalties it receives based on its actions. The goal of reinforcement
learning is to enable the model to learn the best sequence of actions to achieve a specific goal.

Each type of machine learning has its own strengths and weaknesses, and the choice of which
type to use depends on the specific problem being addressed and the nature of the available data

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More

specifically, Regression analysis helps us to understand how the value of the dependent variable
is changing corresponding to an independent variable when other independent variables are

held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every

year and get sales on that. The below list shows the advertisement made by the company in the
last 5 years and the corresponding sales:

BIG DATA 123


ANALYTICS
MRCET
-CSE

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction problems
in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between

variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and

determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,

using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-

predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model

has captured a strong relationship or not.

BIG DATA 124


ANALYTICS
MRCET
-CSE

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression


Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not well
with test dataset, then such problem is called Overfitting. And if our algorithm does not perform
well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future predictions such as
weather condition, sales prediction, marketing trends, etc., for such case we need some

technology which can make predictions more accurately. So for such case we need Regression
analysis which is a statistical method and used in machine learning and data science. Below are

some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.

BIG DATA 125


ANALYTICS
MRCET
-CSE

o It is used to find the trends in data.


o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the least
important factor, and how each factor is affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

BIG DATA 126


ANALYTICS
MRCET
-CSE

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.

BIG DATA 127


ANALYTICS
MRCET
-CSE

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y =dependent variables (target variables),


X= Independent variables (predictor variables),

a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

BIG DATA 128


ANALYTICS
MRCET
-CSE

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm
in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression. The function can be
represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

BIG DATA 129


ANALYTICS
MRCET
-CSE

o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and
values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion,
so for such case, linear regression will not best fit to those datapoints. To cover such datapoints,
we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features of
given degree and then modeled using a linear model. Which means the datapoints are best
fitted using a polynomial line.

BIG DATA 130


ANALYTICS
MRCET
-CSE

o The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.

Support Vector Regression:


Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line
which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the

maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:

BIG DATA 131


ANALYTICS
MRCET
-CSE

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:


o Decision Tree is a supervised learning algorithm which can be used for solving both classification
and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents the
"test" for an attribute, each branch represent the result of the test, and each leaf node represents
the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits into
left and right child nodes (subsets of dataset). These child nodes are further divided into their
children node, and themselves become the parent node of those nodes. Consider the below
image:

BIG DATA 132


ANALYTICS
MRCET
-CSE

Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple decision
trees and predicts the final output based on the average of each tree output. The combined
decision trees are called as base models, and it can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in


which aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by creating
random subsets of the dataset.

BIG DATA 133


ANALYTICS
MRCET
-CSE

Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small amount
of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can compute
this penalty term by multiplying with the lambda to the squared weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

BIG DATA 134


ANALYTICS
MRCET
-CSE

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can
only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Clustering in Machine Learning


Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,

consisting of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,

behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm,

and it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

BIG DATA 135


ANALYTICS
MRCET
-CSE

Note: Clustering is somewhere similar to the classification algorithm, but the difference is
the type of dataset that we are using. In classification, we work with the labeled data set,
whereas in clustering, we work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of Mall:
When we visit any shopping mall, we can observe that the things with similar usage are grouped

together. Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections,

so that we can easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of

this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system
to provide the recommendations as per the past search of products. Netflix also uses this

technique to recommend the movies and web-series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the different

fruits are divided into several groups with similar properties.

BIG DATA 136


ANALYTICS
MRCET
-CSE

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs to only

one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods used

in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.

BIG DATA 137


ANALYTICS
MRCET
-CSE

In this type, the dataset is divided into a set of k groups, where K is used to define the

number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This

algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser

areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

BIG DATA 138


ANALYTICS
MRCET
-CSE

Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

BIG DATA 139


ANALYTICS
MRCET
-CSE

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the

dataset is divided into clusters to create a tree-like structure, which is also called a dendrogram.
The observations or any number of clusters can be selected by cutting the tree at the correct

level. The most common example of this method is the Agglomerative Hierarchical algorithm.

BIG DATA 140


ANALYTICS
MRCET
-CSE

Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend on the

degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type
of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly used.

The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms

need to guess the number of clusters in the given dataset, whereas some are required to find
the minimum distance between the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine

learning:

BIG DATA 141


ANALYTICS
MRCET
-CSE

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm. It is fast with fewer computations required, with
the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density
of data points. It is an example of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with
Noise. It is an example of a density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high density are separated by the areas of
low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed
that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of data
points until convergence. It has O(N2T) time complexity, which is the main drawback of this
algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification
of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data objects
in one group that is far from the other dissimilar objects. The accurate result of a query depends
on the quality of the clustering algorithm used.

BIG DATA 142


ANALYTICS
MRCET
-CSE

o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using
the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the
GIS database. This can be very useful to find that for what purpose the particular land should be
used, that means for which purpose it is more suitable.

What is Collaborative Filtering?


In Collaborative Filtering, we tend to find similar users and recommend what similar users like. In this
type of recommendation system, we don’t use the features of the item to recommend it, rather we
classify the users into clusters of similar types and recommend each user according to the preference of
its cluster.

There are basically four types of algorithms o say techniques to build Collaborative filtering based
recommender systems:

➢ Memory-Based

➢ Model-Based

➢ Hybrid

➢ Deep Learning

Advantages of Collaborative Filtering-Based Recommender Systems

As we know there are two types of recommender systems the content-based recommender systems have
limited use cases and have higher time complexity. Also, this algorithm is based on some limited content
but that is not the case in Collaborative Filtering based algorithms. One of the main advantages that
these recommender systems have is that they are highly efficient in providing personalized content but
also able t adapt to changing user preferences.

BIG DATA 143


ANALYTICS
MRCET
-CSE

Measuring Similarity

A simple example of the movie recommendation system will help us in


explaining:

In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the movie, so
we can conclude that Movie 3 is also going to be averagely liked by User 1 but Movie 4 will be a good
recommendation to User 2, like this we can also see that there are users who have different choices like
User 1 and User 3 are opposite to each other. One can see that User 3 and User 4 have a common interest
in the movie, on that basis we can say that Movie 4 is also going to be disliked by User 4. This is
Collaborative Filtering, we recommend to users the items which are liked by users of similar interest
domains.

Cosine Similarity

We can also use the cosine similarity between the users to find out the users with similar interests, larger
cosine implies that there is a smaller angle between two users, hence they have similar interests. We can
apply the cosine distance between two users in the utility matrix, and we can also give the zero value to
all the unfilled columns to make calculation easy, if we get smaller cosine then there will be a larger
distance between the users, and if the cosine is larger than we have a small angle between the users, and
we can recommend them similar things .

*** QuickLaTeX cannot compile formula:


\text{similarity} = \frac{A\cdot B}{\left\| A\right\|\times \left\|
B\right\|}=\frac{\sum_{i=1}^{n}A_i \times B_i}{\sqrt{\sum_{i=1}^{n}A_i^2}\times
{\sqrt{\sum_{i=1}^{n}B_i^2}}

*** Error message:

BIG DATA 144


ANALYTICS
MRCET
-CSE

File ended while scanning use of \frac .


Emergency stop.

Rounding the Data


In collaborative filtering, we round off the data to compare it more easily like we can assign
below 3 ratings as 0 and above of it as 1, this will help us to compare data more easily, for
example:

We again took the previous example and we apply the rounding-off process, as you can see how

much more readable the data has become after performing this process, we can see that User 1 and
User 2 are more similar and User 3 and User 4 are more alike.

Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract all the given ratings from
it, so we’ll get either positive or negative values as a rating, which can simply classify further into
similar groups. By normalizing the data we can make clusters of the users that give a similar rating to
similar items and then we can use these clusters to recommend items to the users.

What are some of the Challenges to be Faced while using

Collaborative Filtering?

As we know that every algorithm has its pros and cons and so is the case with Collaborative Filtering
Algorithms. Collaborative Filtering algorithms are very dynamic and can change as well as adapt to the
changes in user preferences with time. But one of the main issues which are faced by recommender

BIG DATA 145


ANALYTICS
MRCET
-CSE

systems is that of scalability because as the user base increases then the respective sizes for the
computation and the data storage space all increase manifold which leads to slow and inaccurate results.

Also, collaborative filtering algorithms fail to recommend a diversity of products as it is based on


historical data and hence provide recommendations related to them as well.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

BIG DATA 146


ANALYTICS
MRCET
-CSE

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning

model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Decision Tree Terminologies


• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.

BIG DATA 147


ANALYTICS
MRCET
-CSE

• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a
leaf node.

• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.

• Branch/Sub Tree: A tree formed by splitting the tree.

• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real

dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree. The

complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root

BIG DATA 148


ANALYTICS
MRCET
-CSE

node (Salary attribute by ASM). The root node splits further into the next decision node (distance

from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision

node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best

attribute for the root node and for sub-nodes. So, to solve such problems there is a technique
which is called as Attribute selection measure or ASM. By this measurement, we can easily

select the best attribute for the nodes of the tree. There are two popular techniques for ASM,
which are:

BIG DATA 149


ANALYTICS
MRCET
-CSE

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

BIG DATA 150


ANALYTICS
MRCET
-CSE

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the

important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of

tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase

BIG DATA 151


ANALYTICS

You might also like