0% found this document useful (0 votes)
140 views34 pages

Big Data Hadoop

This document is a training report submitted by Anil Kumar Chouhan for partial fulfillment of a Bachelor of Technology degree. It discusses a project done during an internship at CETPA INFOTECH PRIVATE LIMITED from May 16th to June 24th 2018. The project involved analyzing YouTube video data using Hadoop concepts like HDFS, MapReduce, Hive and Pig to extract meaningful insights that could help with business decisions. Sample queries were run on Hive to demonstrate analysis of YouTube data stored in Hadoop.

Uploaded by

kuldeep singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views34 pages

Big Data Hadoop

This document is a training report submitted by Anil Kumar Chouhan for partial fulfillment of a Bachelor of Technology degree. It discusses a project done during an internship at CETPA INFOTECH PRIVATE LIMITED from May 16th to June 24th 2018. The project involved analyzing YouTube video data using Hadoop concepts like HDFS, MapReduce, Hive and Pig to extract meaningful insights that could help with business decisions. Sample queries were run on Hive to demonstrate analysis of YouTube data stored in Hadoop.

Uploaded by

kuldeep singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

A REPORT OF PRACTICAL TRAINING

at

CETPA INFOTECH PRIVTE LIMITED


SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR

THE AWARD OF THE DEGREE OF

BACHELOR OF TECHNOLOGY

(Computer Science & Engineering)

May-June 2018

SUBMITTED BY :-

Anil Kumar Chouhan

15EEBCS004

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ENGINEERING COLLEGE BIKANER

RAJASTHAN TECHNICAL UNIVERSITY

2018-19
i
ENGINEERING COLLEGE BIKANER

CANDIDATE’S DECLARATION

I ANIL KUMAR CHOUAHN hereby declare that i have undertaken industrial/practical


training at CETPA INFOTECH PRIVATE LIMITED during a period from 16 May to 24
June 2018 in partial fulfilment of requirements for the award not degree of B.Tech(Computer
Science & Engineering) at ENGINEERING COLLEGE BIKANER, RAJASTHAN. The
work which is being presented in the training report submitted to Department of computer
Science & Engineering at ENGINEERING COLLEGE BIKANER, RAJASTHAN is an
authentic record of training work.

ANIL KUMAR CHOUHAN


15EEBCS004

ii
Abstract
Analysis of structured data has seen tremendous success in the past. However, analysis of
large-scale structured data in the form of videos format remains a challenging area.
YouTube, a Google company, has over a billion users and generates billions of views. Since
YouTube data is getting created in a very large amount and with an equally great speed,
there is a huge demand to store, process and carefully study this large amount of data to
make it usable
The main objective of this project is to demonstrate by using Hadoop concepts,
how data generated from YouTube can be mined and utilized to make targeted, real time and
informed decisions.
The project utilizes the YouTube Data API (Application Programming Interface)
that allows the applications/websites to incorporate functions that are used by YouTube
application to fetch and view information. The Google Developers Console is used to
generate a unique access key which is further required to fetch YouTube public channel
data. Once the API key is generated, A Net base console application is designed to use the
YouTube API for fetching video information based on a search criteria. The text file
output generated from the console application is then loaded from HDFS (Hadoop
Distributed File System) file into HIVE database. Hive uses a SQL-like interface to query
data stored in various databases and file systems that integrate with Hadoop. HDFS (Hadoop
Distributed File System) is a primary Hadoop application and a user can directly interact
with HDFS using various shell-like commands supported by Hadoop. This project uses SQL
like queries that are later run on Big Data using HIVE to extract the meaningful output
which can be used by the management for analysis.
Hadoop is an ecosystem of open source components that fundamentally changes the
way enterprises store, process, and analyze data. Unlike the additional system, Hadoop enables
multiple types of analytic workloads to run on the same data, at there same time at massive
scale on industry-standard hardware. Cloudera’s open source platform, is the most popular
distribution of Hadoop and related projects in the word (with support available via a cloudera
Enterprise subscription)

iii
ACKNOWLEDGEMENTS

The time I spent in CETPA INFOTECH PRIVATE LIMITED as training was a memorable
one for me as it was rich in experience sharing and helped me learning core or technology. I
have had so many rich experiences and opportunities that I personally believe will forever
shape and influence my professional life while fostering personal growth and development.
Firstly, I would like to thank my parents who patiently helped me as i went through my work
and helped to modify and eliminate some of the irrelevant or un-necessary stuffs.
Secondly, I would like to thanks CETPA INFOTECH PRIVATE LIMITED for giving me
such a wonderful opportunity to expand my knowledge for my own branch and giving me
guidelines to present a seminar report. It helped me a lot to realize of what we study for.
Thirdly, I would like to thank my Mr. Kuldeep Kumar who gave training to us, who helped
me to make my work more organized and well-stacked till the end.
I am really thankful to our HOD Mr. Ranulal Chouhan for giving me the chance to present
myself and do some researches work regarding latest technology in IT
Next, I would thank Microsoft for developing such a wonderful tool like MS Word. It helped
my work a lot to remain error-free.
Last but clearly not the least, I would thank The Almighty for giving me strength to complete
my report on time.

Anil Kumar Chouhan


15EEBCS004

iv
About the Institute
CETPA INFOTECH, the best training institutes the right choice for the students who are
fresher or last year pass out and want to engage in long term or short term training or winter
training or industrial training. CETPA is the finest training company which offers fresher
training in various Northern India’s cities like Roorkee, Noida, Dehradun and Lucknow. The
company offers hands on training with the help of team of highly experienced and proficient
trainers who follows best training methodology, which makes CETPA one of the most trusted
training company in whole of North India.

Online Training is delivered using computer whether it is a part of training or whole


training course. CETPA, the best training institute offers online training in various
technologies like Software Testing, BIG DATA HADOOP, CLOUD COMPUTING, ARM,
JAVA, AutoCAD, CNC, STAAD.pro, PLC and SCADA, VLSI, Networking, Embedded
system training, Cloud Computing, SEO, Digital Marketing, and much more. Earlier, it was
thought that taking computers to classrooms will eliminate the personal touch that many
students want, hence it received unhealthy publicity. But with the passage of time and
advancement of technology, smart phone and tablets are now welcomed in office as well as
classroom.

v
CONTENTS
CONTENTS Page
Certificate by Institute....................................................................................................i
Candidate’s Declaration.................................................................................................ii
Abstract..........................................................................................................................iii
Acknowledgement.........................................................................................................iv
About the Institute..........................................................................................................v
List of Figures...............................................................................................................vii
List of Table.................................................................................................................viii
CHAPTER 1 INTRODUCTIO.....................................................................................1-4
1. BACKGROUND OF BIG DATA....................................................................3-4
1.1 Purpose and Scope ...................................................................................3
1.2 Overview of the Project...........................................................................4
CHAPTER 2 BIG DATA AND HADOOP..................................................................5-11
2.1 Big Data.......................................................................................................5
2.2 Hadoop.......................................................................................................6
2.3 Hadoop Ecosystem....................................................................................7
2.4 HDFS Architecture..................................................................................7
2.4.1. Hadoop Distributed File System (HDFS) ...................................7
2.4.2. Yet Another Resource Negotiator (YARN)........................................9
2.4.3 Hadoop Framework .......................................................................11
CHAPTER 3 HADOOP DATA ANALYSIS TECHNOLO.........................................12-22
3.1 Map Reduce ...........................................................................................12
3.2 Hive..........................................................................................................13
3.3 Pig .........................................................................................................14
3.4 Analysis in Technology to Use……………………….……......…....…15
YOUTUBE DATA ANALYSIS USING HADOOP...............................................17-20
3.5 Solution Extraction Using Hadoop......................................................... 17
3.6 Output Result........................................................................................ 20
CHAPTER 4 CONCLUSION.....................................................................................23-25
6.1 Conclusion……………………………………………………..…….….23
6.2 Future Work……………………………………………….………....….25
vi
List of Figure
Figure Page No

1.1 High Level Flow Diagram.............................................................................................4


2.1 Flowchart of data Analysis............................................................................................6
2.2 HDFS as a user-space-level file system. ......................................................................8
2.3 YARN Architecture ………………....……………......…….........................................10
2.4 Hadoop Framework……………….…………………......…………………...............11
3.1 MapReduce input & output………………………………....………………...............13
3.2 HIVE Architecture …………………………………………...……………........…....14
3.3 Pig Architecture ……………………………………………..….………..…......…...15
3.4 View & Like Graph……………………………………………....……………...........20
3.5 Video-id & Like Graph…………………………………………....………..…...........21
3.6 Bitivse SSH Link To Coludera……………………………………....….……......…...21
3.7 Windows To Linux File Transfer server…………………………….….…….............22
3.8 Count Max-Comment…………………..…………………………..……....…....…....22
3.9 Count Max-Like……………………………………………………..………...............22
4.1 Data is growing at 40 percent compound annual rate...................................................24

vii
List of Table

Table Page No
3.1 Features and Comparisons of Map Reduce, Pig and Hive..........................16

3.2 Data Sample For Sales Data........................................................................19

3.3 Data View After Analysis...........................................................................20

4.1 Cloud Shift Summary by Market Segment..................................................24

viii
Chapter 1
INTRODUCTION
A Internet companies like Google, Yahoo, Amazon, eBay and a rapidly growing
internet savvy population, today's advanced systems and enterprises are generating data in a
very large volume with great velocity and in a multi-structured formats including videos,
images, sensor data, weblogs etc. from different sources. This has given birth to a new type
of data called Big Data which is unstructured sometime semi structured and also
unpredictable in nature. This data is mostly generated in real time from social media websites
which is increasing exponentially on a daily basis.
“Big Data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate to deal with them. Analysis of data sets can find new
correlations to spot business trends, prevent diseases, combat crime and so on. A millions of
people using Twitter to tweet about their most recent brand experience or hundreds of
thousands of check-ins on Yelp, thousands of people talking about a recently released movie
on Face book and millions of views on YouTube for a recently released movie trailer, we are
at a stage wherein we are heading into a social media data explosion. Companies are already
facing challenges getting useful information from the transactional data from their customers
(for e.g. data captured by the companies when a customer opens a new account or sign up for
a credit card or a service). This type of data is structural in nature and still manageable.
However, social media data is primarily unstructured in nature. The very unstructured nature
of the data makes it very hard to analyze and very interesting at the same time
Whereas RDBMS is designed to handle structured data and that to only certain limit,
RDBMS fails to handle this kind of unstructured and huge amount of data called Big Data.
This inability of RDBMS has given birth to new database management system called NoSQL
management system.
Some of the key concepts used in Big Data Analysis are
1. Data Mining: Data mining is incorporation of quantitative methods. Using powerful
mathematical techniques applied to analyze data and how to process that data. It is used to
extract data and find actionable information which is used to increase productivity and
efficiency.

1
10

0
2. Data Warehousing: A data warehouse is a database as the name implies. It is a kind of
central repository for collecting relevant information. It has centralized logic which reduces
the need for manual data integration.
3. MapReduce: MapReduce is a data processing paradigm for condensing large volumes
of data into useful aggregated results. Suppose we have a large volume of data for particular
users or employees etc. to handle. For that we need MapReduce function to get the aggregated
result as per the query.
4. Hadoop: Anyone holding a web application would be aware of the problem of storing
and retrieving data every minute. The adaptive solution created for the same was the use of
Hadoop including Hadoop Distributed File System or HDFS for performing operations of
storing and retrieving data. Hadoop framework has a scalable and highly accessible
architecture.
5. Hive: Hive is a data warehouse system for Hadoop that facilitates ad hoc queries and
analysis of large data sets stored in Hadoop.
6. HQL: Hive uses a SQL like language called Hive. HQL is a popular choice for Hadoop
analytics.

YouTube is one of the most popular and engaging social media tool and an amazing
platform that reveals the community feedback through comments for published videos,
number of likes, dislikes, number of subscribers for a particular channel. YouTube collects a
wide variety of traditional data points including View Counts, Likes, Votes, and Comments.
The analysis of the above listed data points constitutes a very interesting data source to mine
for obtaining implicit knowledge about users, videos, categories and community interests.
Most of the companies are uploading their product launch on YouTube and they
anxiously await their subscribers' reviews. Major production houses launch movie trailers and
people provide their first reaction and reviews about the trailers. This further creates a buzz
and excitement about the product. Hence the above listed data points become very critical for
the companies so that they can do the analysis and understand the customers' sentiments .

2
1. BACKGROUND OF BIG DATA
1.1 Purpose and Scope
"YouTube has over a billion users and every day people watch hundreds of millions of
hours on YouTube and generate billions of views”. "Every day, people across the world are
uploading 1.2 million videos to YouTube, or over 100 hours per minute and this number is
ever increasing . To analyze and understand the activity occurring on such a massive scale, a
relational SQL database is not enough. Such kind of data is well suited to a massively parallel
and distributed system like Hadoop. This project in data generated from YouTube can be
mined and utilized by different companies to make targeted, real time and informed
decisions about their product that can increase their market share. This can be done by using
Hadoop concepts.
The given project will data generated from YouTube can be mined and utilized.
There are multiple applications of this project. Companies can use this project to understand
how effective and penetrative their marketing programs are. In addition to the view counts,
subscribers and shares, audience retention count, companies can also evaluate views
according to date range. This can tell the companies when is the slow period or spike in
viewership and attribute the same to certain marketing campaign. Applications for YouTube
data can be endless. For example, Companies can analyze how much a product is liked by
people. This project can also help in analyzing new emerging trends and knowing about
people's changing behavior with time. Also people in different countries have different
preferences. By analyzing the comments/feedbacks/likes/view counts etc. of the videos
uploaded, companies can understand what are the likes/dislikes of people around the world
and work on their preferences accordingly.
This project uses following tools throughout its lifecycle.
1. Eclipse (java)
2. Hadoop
3. Cloudera
4. Unix (Cont OS)
5. Rstudeo
6. Bitvise ssh
7. HIVE
8. Pig
3
1.2 Overview of the Project
In this project we fetch a specific channel’s YouTube data using
YouTube API. We will use Google Developers Console and generate a unique access key
which is required to fetch YouTube public channel data. Once the API key is generated, A
Net(C#) based console application is designed to use the YouTube API for fetching videos
information based on a search criteria. The text file output generated from the console
application is then loaded from HDFS file into HIVE database. HDFS is a primary Hadoop
application and a user can directly interact with HDFS using various shell-like commands
supported by Hadoop. Then we run queries on Big Data using HIVE to extract the
meaningful output which can be used by the management for analysis.

Fig 1.1 High Level Flow Diagram

4
Chapter 2
BIG DATA AND HADOOP
2.1 Big Data
"a collection of data sets so large and complex that it becomes difficult to
process using the available database management tools. The challenges include how to
capture, curate, store, search, share, analyze and visualize Big Data”. In today's environment,
we have access to more types of data. These data sources include online transactions, social
networking activities, mobile device services, internet gaming
Big Data is a collection of data sets that are large and complex in nature. They
constitute both structured and unstructured data that grow large so fast that they are not
manageable by traditional relational database systems or conventional statistical tools. Big
Data is defined as any kind of data source that has at least three shared characteristics:
 Extremely large Volumes of data
 Extremely high Velocity of data
 Extremely wide Variety of data
According to Big Data: Concepts, Methodologies, Tools, and Applications, Volume I
by Information Resources Management Association (IRMA), "organizations today are at the
tipping point in terms of managing data. Data sources are ever expanding. Data from Face
book, Twitter, YouTube, Google etc., are to grow 50X in the next 10 years. Over 2.5
Exabyte’s of data is generated every day. Some of the sources of huge volume of data are:
1. A typical large stock exchange captures more than 1 TB of data every day.
2. There are over 5 billion mobile phones in the world which are producing
enormous amount of data on daily basis.
3. YouTube users upload more than 48 hours of video every minute.
4. Large social networks such as Twitter and Face book capture more than 10 TB of
data daily.
5. There are more than 30 million networked sensors in the world which further
produces TBs of data every day.

5
Fig 2.1 Flowchart of data Analysis

A Structured and semi-structured formats have some limitations with respect to


handling large quantities of data. Hence, in order to manage the data in the Big Data world,
new emerging approaches are required, including document, graph, columnar, and geospatial
database architectures. Collectively, these are referred to as NoSQL, or not only SQL,
databases. In essence the data architectures need to be mapped to the types of transactions.
Doing so will help to ensure the right data is available when you need it.

2.2 Hadoop
As organizations are getting flooded with massive amount of raw data, the
challenge here is that traditional tools are poorly equipped to deal with the scale and
complexity of such kind of data. That's where Hadoop comes in. Hadoop is well suited to
meet many Big Data challenges, especially with high volumes of data and data with a variety
of structures.
At its core, Hadoop is a framework for storing data on large clusters of commodity
hardware everyday computer hardware that is affordable and easily available and running
applications against that data. A cluster is a group of interconnected computers (known as
nodes) that can work together on the same problem. Using networks of affordable compute
resources to acquire business insight is the key value proposition of Hadoop.

6
Hadoop consists of two main components
1. A distributed processing framework named MapReduce (which is now supported by
a component called YARN(Yet Another Resource Negotiator) .
2. A distributed file system known as the Hadoop Distributed File System( HDFS). In
Hadoop you can do any kind any kind of aggregation of data whether it is one-month old
data or one-year-old data. Hadoop provides a mechanism called MapReduce model to do
distributed processing of large data which internally takes care of data even if one machine
goes down.

2.3 Hadoop Ecosystem


Hadoop is a shared nothing system where each node acts independently throughout the
system. A framework where a piece of work is divided among several parallel Map Reduce
task . Each task operated independently on cheap commodity servers. This enables businesses
to generate values from data that was previously considered too expensive to be stored and
processed in a traditional data warehouse .

The old paradigm, companies would use a traditional enterprise data warehouse
system and would buy the biggest data warehouse they could afford and store the data on a
single machine. However, with the increasing amount of data, this approach is no longer
affordable nor practical. Some of the components of Hadoop ecosystem are HDFS (Hadoop
Distributed File System), MapReduce, Yarn, Hive and Hbase. Hadoop has two core
components. ‘Storage’ part to store the data and ‘Processing’ part to process the data. The
storage part is called ‘HDFS’ and the processing part is called as ‘YARN’.

2.4 HDFS Architecture


2.4.1 HDFS :-
The Hadoop Distributed File System (HDFS) is the storage component of the
core Hadoop Infrastructure. HDFS provides a distributed architecture for extremely large
scale storage, which can easily be extended by scaling out. It is important to mention the
difference between scale up and scale out. In its initial days, Google was facing challenges to
store and process not only all the pages on the internet but also its users’ web log data. At
7
that time, Google was using scale up architecture model where you can increase the system
capacity by adding CPU cores, RAM etc to the existing server. But such kind of model had
was not only expensive but also had structural limitations. So instead, Google engineers
implemented Scale out architecture model by using cluster of smaller servers which can be
further scaled out if they require more power and capacity. Google File System (GFS) was
developed based on this architectural model. HDFS is designed based on similar concept.

Fig 2.2 HDFS as a user-space-level file system

The core concept of HDFS is that it can be made up of dozens, hundreds, or even thousands
of individual computers, where the system's files are stored in directly attached disk drives.
Each of these individual computers is a self-contained server with its own memory, CPU,
disk storage, and installed operating system (typically Linux, though Windows is also
supported). Technically speaking, HDFS is a user-space-level file system because it lives on
top of the file systems that are installed on all individual computers that make up the Hadoop
cluster. Figure 2 HDFS as a user-space-level file system
The above figure 2 shows that a Hadoop cluster is made up of two classes of servers:
slave nodes, where the data is stored and processed and master nodes, which govern the
management of the Hadoop cluster. On each of the master nodes and slave nodes, HDFS runs
special services and stores raw data to capture the state of the file system. In the case of the

8
slave nodes, the raw data consists of the blocks stored on the node, and with the master
nodes,the raw data consists of metadata that maps data blocks to the files stored in HDFS.
HDFS is a system that allows multiple commodity machines to store data from a single
source. HDFS consists of a Name Node and a Data Node. HDFS operates as master slave
architecture as opposed to peer to peer architecture. Name Node serves as the master
component while the Data Node serves as a slave component. Name Node comprises of only
the Meta data information of HDFS that is the blocks of data that are present on the Data
Node
. Data Nodes constitute a Name Node, capacity of the Name Node and space utilization
. The Data Node comprises of data processing, all the processing data that is stored on the
Data Node and deployed on each machine.
. The actual storage of the files being processed and serving read and write request for the
clients
Hadoop there was only one Name Node attached to the Data Node which was a single
point of failure. Hadoop version 2.x provides multiple Name Node where secondary Name
Node can take over in the event of a primary Name Node failure. Secondary Name Node is
responsible for performing periodic check points in the event of a primary Name Node
failure. You can start secondary Name Node by providing checkpoints that provide high
availability within HDFS.
HDFS creates a self-healing architecture by replicating the same data across
multiple nodes. So it can process the data in a high availability environment. For example,
if we have three DataNodes and one NameNode, the data is transferred from the client
environment into HDFS DataNode.
The replication factor defines the number of times a data block is replicated in a
clustered environment. we have a file that is split into two data blocks across three
DataNodes. If we are processing these files to a three DataNode cluster and we set the
replication factor to three. If one of the nodes fails, the data from the failed nodes is
redistributed among the remaining active nodes and the other nodes will complete the
processing function.

2.4.2 YARN :-
We take look at a data warehouse structure example where we have one machine and

9
with HDFS we can distribute the data into more than one machine. Let’s say we have 100
GB of file that takes 20 minutes to process on a machine with a given number of channel
and hard drive. If you add four machines of exactly the same configuration on a Hadoop
cluster, the processing time reduces.

Processing Component: Yet Another Resource Negotiator (YARN) is a resource


manager that identifies on which machine a particular task is going to be executed. The actual
processing of the task or program will be done by Node Manager. In Hadoop 2.2, YARN
augments the MapReduce platform and serves as the Hadoop operating system. Hadoop 2.2
separates the resource management function from data processing allowing greater flexibility.
This way MapReduce only performs data processing while resource management is isolated
in YARN. Being the primary resource manager in HDFS, YARN enables enterprises to store
data in a single place and interact with it in multiple ways with consistent levels of service. In
Hadoop 1.0 the NameNode used job tracker and the DataNode used task tracker to manage
resources. In Hadoop 2.x, YARN splits up into two major functionalities of the job tracker -
the resource management and job scheduling. The client reports to the resource manager and
the resource manager allocates resources to jobs using the resource container, Node Manager
and app master. The resource container splits memory, CPU, network bandwidth among other
hardware constraints into a single unit. The Node Manager receives updates from the resource
containers which communicate with the app master. The Node Manager is the framework for
containers, resource monitoring and for reporting data to the resource.

Fig 2.3 YARN Architecture

10
2.4.3 Hadoop Framework :-

Hadoop Framework comprises of Hadoop Distributed File System and the


MapReduce framework. The Hadoop framework divides the data into smaller chunks and
stores divides that data into smaller chucks and stores each part of the data on a separate node
within the cluster. For example, if we have 4 terabytes of data, the HDFS divides this data
into 4 parts of 1TB each. By doing this, the time taken to store the data onto the disk is
significantly reduced. The total time to store this entire data onto the disk is equal to storing 1
part of the data as it will store all the parts of the data simultaneously on different machines.
In order to provide high availability what Hadoop does is replicate each part of the data onto
other machines that are present within the cluster. The number of copies it will replicate
depends on the replication factor. By default the replication factor is 3, in such a case there
will be 3 copies of each part of the data on three different machines. In order to reduce the
bandwidth and latency time, it will store two copies on the same rack and third copy on a
different rack. For example, in the above example, NODE 1 and NODE 2 are on rack one and
NODE 3 and NODE 4 are on rack two. Then the first two copies of part 1 will be stored on
NODE 1 and third copy will be stored either on NODE 3 or NODE 4. Similar process is
followed in storing remaining parts of the data. The HDFS takes care of the networking
required by these nodes in order to communicate.

Fig 2.4 Hadoop Framework


11
Chapter 3

Hadoop Data Analysis Technology


While Hadoop provides the ability to collect data on HDFS (Hadoop Distributed
File System), there are many applications available in the market (like MapReduce, Pig and
Hive) that can be used to analyze the data. We us first take a closer look at all three
applications and then analyze which application is better suited for YouTube Data Analysis
project.

3.1 MapReduce
MapReduce is a set of Java classes run on YARN with the purpose of
processing massive amounts of data and reducing this data into output files. HDFS works
with MapReduce to divide the data in parallel fashion on local or parallel machines. Parallel
structure requires that the data is immutable and cannot be updated. It begins with the input
files where the data is initially stored typically residing in HDFS. These input files are then
split up into input format which selects the files, defines the input splits, breaks the file into
tasks and provides a place for record reader objects. The input format defines the list of tasks
that makes up the map phase. The tasks are then assigned to the nodes in the system based on
where the input files chunks are physically resident.
The input split describes the unit of work that comprises a single map task in a
MapReduce program. The record reader loads the data and converts it into key value pairs
that can be read by the Mapper. The Mapper performs the first phase of the MapReduce
program. Given a key and a value the mappers export key and value pairs and send these
values to the reducers. The process of moving mapped outputs to the reducers is known as
shuffling. Partitions are the inputs to reduce tasks, the practitioner determines which key and
value pair will be stored and reduced.
The set of intermediate keys are automatically stored before they are sent to the
reduce function. A reducer instance is created for each reduced task to create an output
format. The output format governs the way objects are written, the output format provided by
Hadoop writes the files to HDFS.

12
Fig 3.1 MapReduce input & output

3.2 Hive
Hive provides the ability to store large amounts of data in HDFS. Hive was designed
to appeal to a community comfortable with SQL. Hive uses an SQL like language known as
HIVEQL. Its philosophy is that we don’t need yet another scripting language. Hive supports
maps and reduced transform scripts in the language of the user’s choice which can be
embedded with SQL. Hive is widely used in Facebook, with analyst comfortable with SQL as
well as data miners programming in Python & Java.
Supporting SQL syntax also means that it is possible to integrate with existing tools
like. Hive has an ODBC (Open Database Connectivity JDBC (Java Database Connectivity)
driver that allows and facilitates easy queries. It also adds support for indexes which allows
support for queries common in such environment. Hive is a framework for performing
analytical queries. Currently Hive can be used to query data stored in HBase which is a key
value store like those found in the gods of most RDBMS’s (Relational database management
system) and the Hadoop database project uses Hive query RDBMS tier.

13
Fig 3.2 Hive Architecture

3.3 Pig
Pig comes from the language Pig Latin. Pig Latin is a procedural programming
language and fits very naturally in the pipeline paradigm. When queries become complex
with most of joins and filters then Pig is strongly recommended. Pig Latin allows pipeline
developers to decide where to checkpoint data in the pipeline. That is storing data in
between operations has the advantage of check pointing data in the pipeline. This ensures the
whole pipeline does not has to be rerun in the event of a failure. Pig Latin allows users to
store data at any point in the pipeline without disturbing the pipeline execution.
The advantage that Pig Latin provides is that pipelines developers decide where
appropriate checkpoints are in the pipeline rather than being forced to checkpoint
wherever the schematics of SQL impose it. Pig Latin supports splits in the pipeline.
Common features of data pipelines is that they are often graphics and not linear pipelines
since disk’s read and write scan time and intermediate results usually dominate
processing of large datasets reducing the number of times data must be written to and read
from disk is crucial for good performance.

14
Pig Latin allows developers to insert their own code almost anywhere in the data pipeline
which is useful for pipeline development. This is accomplished through a user defined
functions UDFS (User Defined Functions). UDFS allows user to specify how data is
loaded, how data is stored and how data is processed. Streaming allows users to include
executables at any point in the data flow. Pipeline also often includes user defined
columns transformation functions and user defined aggregations. Pig Latin supports writing
both of these types of functions in java.

Fig 3.3 Pig Architecture

3.4 Analysis in Technology to Use


The following table shows features and comparison of leading Hadoop Data
Analysis technologies available in the market. For this project, dataset is relatively smaller.
The dataset is around 2000 records for any given search criteria across multiple countries.
However, in real environment, with the extensive information available on YouTube the data
size can be much larger. Given a video ID, the application first extracts information from the
YouTube API, which contains all the meta-data. The application then scrapes the video's
webpage to obtain the remaining information. The following information of YouTube video
is recorded in order; they are divided in the data file.

15
After extracting the sample dataset from YouTube API ( represents the snapshot of extracted
dataset), this dataset can be fed into various Hadoop Technologies and meaningful results
can be extracted and analyzed.

Table 3.1 Features and Comparisons of Map Reduce, Pig and Hive

16
YOUTUBE DATA ANALYSIS USING HADOOP
YouTube, owned by Google, is a video sharing website, where users can upload,
watch and share videos with others. YouTube provides a forum for people to connect, inform,
and inspire others across the globe and acts as a distribution platform for original content
creators and advertisers large and small.
YouTube, owned by Google, is a video sharing website, where users can upload,
watch and share videos with others. YouTube provides a forum for people to connect, inform,
and inspire others across the globe and acts as a distribution platform for original content
creators and advertisers large and small. According to Statista.com (The Statistics Portal),
"As of July 2010, more than 400 hours of video content were uploaded to YouTube every
minute, a fourfold increase compared to only two years prior. The platform, which was
created in 2005, has slowly become one of the most visited websites in the world and a
global phenomenon. In the first quarter of 2015, more than 80 percent of global internet users
had visited YouTube in the last month. In the United States, it is the second largest social
media website after Facebook, accounting for over 22 percent of social media traffic. The
rise in Smartphone and other mobile devices usage has also helped increase the consumption
of YouTube videos on the go. As of mid 2015, approximately half of U.S. mobile users
accessed YouTube via a mobile device, whether Smartphone or tablet computer."
While companies, musicians or film distributers might use YouTube as a form of
free direct advertisement, YouTube has also become a launch pad for various
products/services wherein large corporations reveal the first look of the product on YouTube,
generate a buzz about their product, assess the market demand based upon likes and view
counts and improve their product based upon customers' feedback. Hence the data points
including View Counts, Likes, Votes, and Comments etc. become very critical for the
companies so that they can do the analysis and understand the customers' sentiments about
their product/services.
The main objective of this project is to show how companies can analyze YouTube
data using YouTube API to make targeted real time and informed decisions. This project will
help in understanding changing trends among people by analyzing YouTube data and fetching
meaningful results. For example, when companies like Disney launch their new movie trailers

17
on YouTube, this application can help Disney in analyzing the reaction of people towards a
specific movie trailer. This application can analyze home many people liked the trailers, in
which country the trailer was liked the most, whether the comments posted on YouTube are
generally positive, negative or neutral etc. This way management can take executive
decisions how to spend their marketing budget in order to maximize their returns.
Since YouTube data is getting created in a very huge amount and with an equally great
speed, there is a huge demand to store, process and carefully study this large amount of data
to make it usable. Hadoop is definitely the preferred framework to analyze the data of this
magnitude.
The YouTube sample dataset collected using .NET console application (see Section 7)
has the following properties:
There is a structured format of data
It would require joins to capture country wise count statistics
The collected data set can be organized into schema

Following feature comparison analysis is performed in order to analyze which


Hadoop Technology is suitable for YouTube Data Analysis project.
1. If Map Reduce is to be used for YouTube Data analysis project, then we need to write
complex business logic in order to successfully execute the join queries. We would
have to think from map and reduce view of what is important and what is not
important and which particular code little piece will go into map and which one will
go into reduce side. Programmatically, this effort will become quite challenging as lot
of custom code is required to successfully execute the business logic even for simplest
tasks. Also, it may be difficult to map the data into schema format and lot of
development effort may go in to deciding how map and reduce joins can function
efficiently.
2. Pig is a procedural data flow language. A procedural language is a step by step
approach defined by the programmers. Pig requires a learning curve since the syntax
is new and different from SQL. Also, Pig requires more maintenance. The values of
variables may not be retained; instead, the query needs to rerun in order to get the
values from a variable. Moreover, Pig is a scripting language that is more suitable for

18
3. Prototyping and rapidly developing MapReduce based jobs. The data schema is not
enforced explicitly in Pig and hence it becomes difficult to map the data into
schema format. Also, the error that Pig produces is not very user friendly. It just gives
exec error even if the problem is related to syntax or type error. Pig development
may require more time than hive but is purely based on the developer’s familiarity
with the pig code.
4. Hive provides a familiar programming model. It operates on query data with a SQL-
based language. It is comparatively faster with better interactive response times, even
over huge datasets. As data variety and volume grows, more commodity machines can
be added without reducing the performance. Hence, Hive is scalable and extensible.
Hive is very compatible.

If we apply Hive to analyze the YouTube data, then we would be able to leverage the
SQL capabilities of Hive- QL as well as data can be managed in a particular schema. Also, by
using Hive, the development time can be significantly reduced. After looking at the pros and
cons, Hive becomes the obvious choice for this YouTube Data Analysis project.

Table 3.2 Data Sample For Sales Data

19
Table 3.3 data view after analysis

Fig 3.4 View & Like Graph

20
Fig 3.5 Video-id & Like Graph

Fig 3.6 Bitivse SSH Link To Cloudera

21
Fig 3.6 Windows To Linux File Transfer server

Fig 3.7 Count of Max-Comment

Fig 3.8 Count of max-Like in Hadoop


Chapter 4

CONCLUSION AND FUTURE SCOPE

4.1 Conclusion

The task of big data analysis is not only important but also a necessity. In fact
many organizations that have implemented Big Data are realizing significant competitive
advantage compared to other organizations with no Big Data efforts. The project is intended
to analyze the YouTube Big Data and come up with significant insights which cannot be
determined otherwise.
The output results of YouTube data analysis project show key insights that can
be extrapolated to other use cases as well. One of the output results describes that for a
specific video id, how many likes were received. The number of likes -- or "thumbs-up" -- a
video had has a direct significance to the YouTube video's ranking, according to YouTube
Analytics. So if a company posts its video on YouTube, then the number of YouTube likes
the company has could determine whether the company or its competitors appear more
prominently in YouTube search results.
Another output result gives us insights on if there is a pattern of affinity of interests
for certain video category. This can be done by analyzing the comments count. For e.g., if the
company falls under 'Comedy' or 'Education' category, a meaningful discussion in the form
of comments can be triggered on YouTube. A comment analysis can further be conducted to
understand the attitude of people towards the specific video.

23
Table 4.1 Cloud Shift Summary by market segment

Fig 4.1 Data is growing at 40 percent compound annual rate, reaching nearly 45 ZB by
2020

24
4.2 Future Work

The future work would include extending the analysis of YouTube data using
other Big Data analysis Technologies like Pig and MapReduce and do a feature comparison
analysis. It would be interesting to see which technology fares better as compared to the other
ones. One feature that is not added in the project is to represent the output in a Graphical User
Interface (GUI).
The current project displays a very simplistic output which does not warrant a
GUI interface. However, if the output is too large and complex, the output can be interfaced
in a GUI format to display the results. The data can then be presented in different format
including pie-charts and graphs for better user experience. Another possible extension of this
project could be the YouTube Comment Analysis project.
The current scope of the project includes analyzing the statistics for a
channel/category including view counts, likes, dislikes, country wise view etc. By identifying
classifying/categorizing the polarity of the words, sentiment analysis or opinion minding can
be performed for a specific video. This would tell us writer's attitude towards a particular
product or a given subject. Using Sentiment Analysis, we can determine if the general attitude
of people is positive, negative or neutral towards a specific subject/video.

26

You might also like