0% found this document useful (0 votes)
49 views19 pages

Unit 1

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views19 pages

Unit 1

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

BIG DATA and HADOOP

DIGITAL NOTES

SCHOOL OF ENGINEERING
Department of Cyber Security
MALLA REDDY UNIVERSITY
III Year B. Tech – II Semester
(MR22-1CS04) BIG DATA and HADOOP

COURSE OBJECTIVES
1. Introduce Bigdata concepts.
2. Introduce distributed concepts with map reduce.
3. Store and Analyze data with Hadoop ecosystems
4. Introduce NoSQL concept using HBase.
5. Perform Data Analytics using Hive.
UNIT – I
Introduction to Big Data: What is Big Data - Why Big Data is Important -Evolution of Big
Data – Failure of Traditional Database in Handling Big Data- 3Vs of Big Data 4 – Sources of
Big Data – Different Types of Data – Big Data Infrastructure – Big Data Life Cycle – Big Data
Applications- A brief history of Hadoop -Apache hadoop- Hadoop EcoSystem. -Linux
refresher- VMware Installation of Hadoop.

UNIT – II
Hadoop I/O :
Data Integrity - Data Integrity in HDFS – Local File System – Checksum File System -
Compression and Input Splits - Using Compression in Map Reduce - Serialization - The
Writable Interface - Writable Classes - Implementing a Custom Writable - Serialization
Frameworks - File-Based Data Structures –Sequence File - MapFile - Other File Formats and
Column-Oriented Formats.
HDFS : The design of HDFS-HDFS concepts -Command line interface to HDFS- Hadoop
File systems- Interfaces -Java Interface to Hadoop - Anatomy of a file read - Anatomy of a
file writes- Replica placement and Coherency Model- Parallel copying with distcp -Keeping
an HDFS cluster balanced.

UNIT – III
Understanding Map Reduce Fundamentals: Introduction- Analyzing data with unix tools-
Analyzing data with hadoop- Java MapReduce classes (new API)- Data flow-combiner
functions-Running a distributed Map Reduce Job.
Classic Map reduce. Job submission- Job Initialization- Task Assignment- Task execution -
Progress and status updates- Job Completion- Shuffle and sort on Map and reducer side-
Configuration tuning- Map Reduce Types-Input formats-Output formats – Sorting - Map side
and Reduce side joins.

UNIT – IV
HBASE: Introduction -Architecture - storage of Big Data - Interacting with hadoop Eco-
system – Installation - Programming with HBase -Combining HBase and HDFS- Installation
Test Drive - Clients - Java - MapReduce - REST and Thrift - Building an Online Query
Application - Schema Design - Loading Data - Online Queries - HBase Versus RDBMS -
Successful Service - HBase .
YARN: Anatomy of a YARN Application Run - Resource Requests - Application Lifespan -
Building YARN Applications - YARN Compared to MapReduce 1 - Scheduling in YARN -
Scheduler Options - Capacity Scheduler Configuration - Fair Scheduler Configuration -
Delay Scheduling - Dominant Resource – Fairness.
UNIT – V
HIVE:
The Hive Shell- Hive services- Hive clients - The meta store - Comparison with traditional
databases – HiveQl –Hbasics – Concepts - Implementation - Java and Map reduce clients-
Loading data -web queries. - Data Types - Operators and Functions - Tables - Managed
Tables and External Tables - Partitions and Buckets - Storage Formats - Importing Data -
Altering Tables - Dropping Tables - Querying Data - Sorting and Aggregating - MapReduce
Scripts - Joins – Subqueries.

TEXT BOOKS:
1. Student’s Handbook for Associate Analytics.
2. BIG DATA and ANALYTICS, Seema Acharya, Subhasinin Chellappan, Wiley
publications.
3. BIG DATA, Black BookTM, DreamTech Press, 2015 Edition.
4. BUSINESS ANALYTICS 5e, BY Albright |Winston.
REFERENCE BOOKS:
1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira (the authors have kindly made
an online version available): http://www.dataminingbook.info/uploads/book.pdf.
3. Mining of Massive Datasets Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs
Jeffrey D. Ullman, Stanford Univ.
COURSE OUTCOMES:
1. Outline how to Store and manage data in HDFS and implement basic applications in map
reduce.
2. Summarize how to store and analyze data using PIG scripts and handle partitioned and
bucked tables in Hive.
3. Interpret the Import and export data from databases like mysql or oracle.
4. Illustrate the working with various file formats in hadoop ecosystems.
5. Implement spark scripts using RDDs and work with column data base using HBase.
UNIT – I
1. INTRODUCTION TO BIG DATA:

Today we live in the digital world. With increased digitization the amount of structured and
unstructured data being created and stored is exploding. The data is being generated from
various sources - transactions, social media, sensors, digital images, videos, audios and
clickstreams for domains including healthcare, retail, energy and utilities. In addition to
business and organizations, individuals contribute to the data volume. For instance, 30 billion
content are being shared on Facebook every month; the photos viewed every 16 seconds in
Picasa could cover a football field.
IDC terms this digital universe is set to explode to an unimaginable 8 Zeta bytes by the year
2015. The term “Big Data” was coined to address this massive volume of data storage and
processing. It is increasingly becoming imperative for organizations to access this data to their
applications.
What is Big Data?
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex
and large data sets that have to be processed and analyzed to uncover valuable information that
can benefit businesses and organizations. However, there are certain basic tenets of Big Data
that will make it even simpler to answer what is Big Data:
 It refers to a massive amount of data that keeps on growing exponentially with time.
 It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
 It includes data mining, data storage, data analysis, data sharing, and data visualization.
 The term is an all-comprehensive one including data, data frameworks, along with the
tools and techniques used to process and analyze the data.
The History of Big Data
Although the concept of big data itself is relatively new, the origins of large data sets go back
to the 1960s and '70s when the world of data was just getting started with the first data centers
and the development of the relational database.
Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically
to store and analyze big data sets) was developed that same year. NoSQL also began to gain
popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
Essential for the growth of big data because they make big data easier to work with and
cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are still
generating huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence
of machine learning has produced still more data. While big data has come far, its usefulness is
only just beginning. Cloud computing has expanded big data possibilities even further. The
cloud offers truly elastic scalability, where developers can simply spin up ad hoc clusters to
test a subset of data.
Benefits of Big Data and Data Analytics
 Big data makes it possible for you to gain more complete answers because you have
more information.
 More complete answers mean more confidence in the data—which means a completely
different approach to tackling problems.
Why is Big Data Important?
The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its own way; the more
efficiently a company uses its data, the more potential it has to grow. The company can take
data from any source and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring
cost advantages to business when large amounts of data are to be stored and these tools also
help in identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately and make
quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’ purchasing
behaviors, a company can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve
the online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no single business
that can claim success without first having to establish a solid customer base. However, even
with a customer base, a business cannot afford to disregard the high competition it faces. If a
business is slow to learn what customers are looking for, then it is very easy to begin offering
poor quality products. In the end, loss of clientele will result, and this creates an adverse
overall effect on business success. The use of big data allows businesses to observe various
customer related patterns and trends. Observing customer behavior is important to trigger
loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
Big data analytics can help change all business operations. This includes the ability to match
customer expectation, changing company’s product line and of course ensuring that the
marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product Development
Another huge advantage of big data is the ability to help companies innovate and redevelop
their products.
Evolution of Big Data
If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an open-
source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. Cloud Computing:
Cloud computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This has
led to the development of artificial intelligence (AI) applications.
6. Data Streaming:
Data streaming technology has emerged as a solution to process large volumes of data
in real time.
7. Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source of
the data.
Overall, big data technology has come a long way since the early days of data warehousing.
The introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data
streaming, and edge computing has revolutionized how we store, process, and analyze large
volumes of data. As technology evolves, we can expect Big Data to play a very important role
in various industries.

Failure of Traditional Database in Handling Big Data


1. Need for Synchronization across Disparate Data Sources
As data sets are becoming bigger and more diverse, there is a big challenge to incorporate
them into an analytical platform. If this is overlooked, it will create gaps and lead to wrong
messages and insights.
2. Acute Shortage of Professionals Who Understand Big Data Analysis
The analysis of data is important to make this voluminous amount of data being produced in
every minute, useful. With the exponential rise of data, a huge demand for big data scientists
and Big Data analysts has been created in the market. It is important for business organizations
to hire a data scientist having skills that are varied as the job of a data scientist is
multidisciplinary. Another major challenge faced by businesses is the shortage of professionals
who understand Big Data analysis. There is a sharp shortage of data scientists in comparison to
the massive amount of data being produced.
3. Getting Meaningful Insights Through The Use Of Big Data Analytics
It is imperative for business organizations to gain important insights from Big Data analytics,
and also it is important that only the relevant department has access to this information. A big
challenge faced by the companies in the Big Data analytics is mending this wide gap in an
effective manner.
4. Getting Voluminous Data into The Big Data Platform
It is hardly surprising that data is growing with every passing day. This simply indicates that
business organizations need to handle a large amount of data on daily basis. The amount and
variety of data available these days can overwhelm any data engineer and that is why it is
considered vital to make data accessibility easy and convenient for brand owners and
managers.
5. Uncertainty of Data Management Landscape
With the rise of Big Data, new technologies and companies are being developed every day.
However, a big challenge faced by the companies in the Big Data analytics is to find out which
technology will be best suited to them without the introduction of new problems and potential
risks.
6. Data Storage and Quality
Business organizations are growing at a rapid pace. With the tremendous growth of the
companies and large business organizations, increases the amount of data produced. The
storage of this massive amount of data is becoming a real challenge for everyone. Popular data
storage options like data lakes/ warehouses are commonly used to gather and store large
quantities of unstructured and structured data in its native format. The real problem arises
when a data lakes/ warehouse try to combine unstructured and inconsistent data from diverse
sources, it encounters errors. Missing data, inconsistent data, logic conflicts, and duplicates
data all result in data quality challenges.
7. Security and Privacy of Data
Once business enterprises discover how to use Big Data, it brings them a wide range of
possibilities and opportunities. However, it also involves the potential risks associated with big
data when it comes to the privacy and the security of the data. The Big Data tools used for
analysis and storage utilizes the data disparate sources. This eventually leads to a high risk of
exposure of the data, making it vulnerable. Thus, the rise of voluminous amount of data
increases privacy and security concerns.
3Vs of Big Data
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity,
and Volume. Let’s discuss the characteristics of big data. These characteristics, isolated, are
enough to know what big data is. Let’s look at them in depth: Understanding the 3 Vs of Big
Data –
1. Volume
2. Velocity
3. Variety
1. Volume
Within the Social Media space for example, Volume refers to the amount of data generated
through websites, portals and online applications. Especially for B2C companies, Volume
encompasses the available data that are out there and need to be assessed for relevance.
Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350
million users and Instagram 700 million users. Every day, these users contribute to billions
of images, posts, videos, tweets etc. You can now imagine the insanely large amount -or
Volume- of data that is generated every minute and every hour.
2. Velocity
With Velocity we refer to the speed with which data are being generated. Staying with our
social media example, every day 900 million photos are uploaded on Facebook, 500
million tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube
and 3.5 billion searches are performed in Google. This is like a nuclear data explosion. Big
Data helps the company to hold this explosion, accept the incoming flow of data and at the
same time process it fast so that it does not create bottlenecks.
3. VARIETY
Variety in Big Data refers to all the structured and unstructured data that has the possibility
of getting generated either by humans or by machines. The most commonly added data are
structured -texts, tweets, pictures & videos. However, unstructured data like emails,
voicemails, hand-written text, ECG reading, audio recordings etc, are also important
elements under Variety. Variety is all about the ability to classify the incoming data into
various categories.
Sources of Big Data
Classification of Types of Big Data
The following classification was developed by the Task Team on Big Data, in June 2013.
1. Social Networks (human-sourced information): this information is the record of
human experiences, previously recorded in books and works of art, and later in
photographs, audio and video. Human-sourced information is now almost entirely digitized
and stored everywhere from personal computers to social networks. Data are loosely
structured and often ungoverned.
 Social Networks: Facebook, Twitter, Tumblr etc.
 Blogs and comments
 Personal documents
 Pictures: Instagram, Flickr, Picasa etc.
 Videos: Youtube etc.
 Internet searches
 Mobile data content: text messages
 User-generated maps
 E-Mail
2. Traditional Business systems (process-mediated data): these processes record and
monitor business events of interest, such as registering a customer, manufacturing a
product, taking an order, etc. The process-mediated data thus collected is highly structured
and includes transactions,reference tables and relationships, as well as the metadata that
sets its context. Traditional business data is the vast majority of what IT managed and
processed, in both operational and BI systems. Usually structured and stored in relational
database systems. (Some sources belonging to this class may fall into the category of
"Administrative data").
 Data produced by Public Agencies
o Medical records
 Data produced by businesses
o Commercial transactions
o Banking/stock records
o E-commerce
o Credit cards
3. Internet of Things (machine-generated data): derived from the phenomenal growth in
the number of sensors and machines used to measure and record the events and situations
in the physical world. The output of these sensors is machine-generated data, and from
simple sensor records to complex computer logs, it is well structured. As sensors
proliferate and data volumes grow, it is becoming an increasingly important component of
the information stored and processed by many businesses. Its well-structured nature is
suitable for computer processing, but its size and speed is beyond traditional approaches.
 Data from sensors
o Fixed sensors
o Home automation
o Weather/pollution sensors
o Traffic sensors/webcam
o Scientific sensors
 Security/surveillance videos/images
o Mobile sensors (tracking)
o Mobile phone location
o Cars
o Satellite images
 Data from computer systems
o Logs
o Web logs

Types of Big Data


Now that we are on track with what is big data, let’s have a look at the types of big data:
a) Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information
that can be readily and seamlessly stored and accessed from a database by simple search
engine algorithms. For instance, the employee table in a company database will be
structured as the employee details, their job positions, their salaries, etc., will be present in
an organized manner.

b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important
types of big data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To
be precise, it refers to the data that although has not been classified under a particular
repository (database), yet contains vital information or tags that segregate individual
elements within the data. Thus we come to the end of types of data.
Big Data Infrastructure
• Big data architecture is a comprehensive solution to deal with an enormous amount of
data.
• It details the blueprint for providing solutions and infrastructure for dealing with big
data based on a company’s demands.
 Data Sources: Relational databases, data warehouses, cloud-based data
warehouses, SaaS applications, real-time data from company servers and sensors
such as IoT devices, third-party data providers, and also static files such as
Windows logs, comprise several data sources.
 Data Storage: HDFS, Microsoft Azure, AWS, and GCP storage, among other blob
containers.
 Batch Processing: Multiple approaches to batch processing are employed,
including Hive jobs, U-SQL jobs, Sqoop or Pig and custom map reducer jobs
written in any one of the Java or Scala or other languages such as Python.
 Real Time-Based Message Ingestion:Message-based ingestion stores such as
Apache Kafka, Apache Flume, Event hubs from Azure, and others, on the other
hand, must be used if message-based processing is required. The delivery process,
along with other message queuing semantics, is generally more reliable.
 Stream Processing:Stream processing, on the other hand, handles all of that
streaming data in the form of windows or streams and writes it to the sink. This
includes Apache Spark, Flink, Storm, etc.
 Analytics-Based Datastore: In order to analyze and process already processed
data, analytical tools use the data store that is based on HBase or any other NoSQL
data warehouse technology. NoSQL databases like HBase or Spark SQL are also
available.
 Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis tools
that utilize embedded technology and a solution to produce useful graphs, analysis,
and insights that are beneficial to the businesses. For example, Cognos, Hyperion,
and others.
 Orchestration: Data-based solutions that utilise big data are data-related tasks that
are repetitive in nature, and which are also contained in workflow chains that can
transform the source data and also move data across sources as well as sinks and
loads in stores. Sqoop, oozie, data factory, and others are just a few examples.

Big Data Life Cycle


The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results.
1. Phase I Business Problem Definition –
In this stage, the team learns about the business domain, which presents the motivation
and goals for carrying out the analysis. In this stage, the problem is identified, and
assumptions are made that how much potential gain a company will make after carrying
out the analysis. Important activities in this step include framing the business problem as
an analytics challenge that can be addressed in subsequent phases. It helps the decision-
makers understand the business resources that will be required to be utilized thereby
determining the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data problem
or not, based on the business requirements in the business case. To qualify as a big data
problem, the business case should be directly related to one(or more) of the
characteristics of volume, velocity, or variety.
2. Phase II Data Definition –
Once the business case is identified, now it’s time to find the appropriate datasets to
work with. In this stage, analysis is done to see what other companies have done for a
similar case. Depending on the business case and the scope of analysis of the project
being addressed, the sources of datasets can be either external or internal to the
company. In the case of internal datasets, the datasets can include data collected from
internal sources, such as feedback forms, from existing software, On the other hand, for
external datasets, the list includes datasets from third-party providers.
3. Phase III Data Acquisition and filtration –
Once the source of data is identified, now it is time to gather the data from such sources.
This kind of data is mostly unstructured.Then it is subjected to filtration, such as
removal of the corrupt data or irrelevant data, which is of no scope to the analysis
objective. Here corrupt data means data that may have missing records, or the ones,
which include incompatible data types.
After filtration, a copy of the filtered data is stored and compressed, as it can be of use
in the future, for some other analysis.
4. Phase IV Data Extraction –
Now the data is filtered, but there might be a possibility that some of the entries of the
data might be incompatible, to rectify this issue, a separate phase is created, known as
the data extraction phase. In this phase, the data, which don’t match with the underlying
scope of the analysis, are extracted and transformed insuchaform.
5. Phase V Data Munging –
As mentioned in phase III, the data is collected from various sources, which results in
the data being unstructured. There might be a possibility, that the data might have
constraints, that are unsuitable, which can lead to false results. Hence there is a need to
clean and validate the data.
It includes removing any invalid data and establishing complex validation rules. There
are many ways to validate and clean the data. For example, a dataset might contain few
rows, with null entries. If a similar dataset is present, then those entries are copied from
that dataset, else those rows are dropped.

6. Phase VI Data Aggregation & Representation –


The data is cleansed and validates, against certain rules set by the enterprise. But the
data might be spread across multiple datasets, and it is not advisable to work with
multiple datasets. Hence, the datasets are joined together. For example: If there are two
datasets, namely that of a Student Academic section and Student Personal Details
section, then both can be joined together via common fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very large.
Automation can be brought into consideration, so that these things are executed, without
any human intervention.

7. Phase VII Exploratory Data Analysis –


Here comes the actual step, the analysis task. Depending on the nature of the big data
problem, analysis is carried out. Data analysis can be classified as Confirmatory analysis
and Exploratory analysis. In confirmatory analysis, the cause of a phenomenon is
analyzed before. The assumption is called the hypothesis. The data is analyzed to
approve or disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions and
confirms whether an assumption was true or not.In an exploratory analysis, the data is
explored to obtain information, why a phenomenon occurred. This type of analysis
answers “why” a phenomenon occurred. This kind of analysis doesn’t provide
definitive, meanwhile, it provides discovery of patterns.

8. Phase VIII Data Visualization –


Now we have the answer to some questions, using the information from the data in the
datasets. But these answers are still in a form that can’t be presented to business users. A
sort of representation is required to obtains value or some conclusion from the analysis.
Hence, various tools are used to visualize the data in graphic form, which can easily be
interpreted by business users.
Visualization is said to influence the interpretation of the results. Moreover, it allows the
users to discover answers to questions that are yet to be formulated.

9. Phase IX Utilization of analysis results –


The analysis is done, the results are visualized, now it’s time for the business users to
make decisions to utilize the results. The results can be used for optimization, to refine
the business process. It can also be used as an input for the systems to enhance
performance.
Big Data Applications
In today’s world, there are a lot of data. Big companies utilize those data for their business
growth. By analyzing this data, the useful decision can be made in various cases as
discussed below:
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s
spending habit (in which product customer spent, in which brand they wish to spent,
how frequently they spent), shopping behavior, customer’s most liked product (so that
they can keep those products in the store). Which product is being searched/sold most,
based on that data, production/collection rate of that product get fixed.
2. Banking sector uses their customer’s spending behavior-related data so that they can
provide the offer to a particular customer to buy his particular liked product by using
bank’s credit or debit card with discount or cashback. By this way, they can send the
right offer to the right person at the right time.
3. Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city, GPS
device placed in the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-
free or less jam way, less time taking ways are recommended. Such a way smart traffic
system can be built in the city by Big data analysis. One more profit is fuel
consumption can be reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature,
other environmental condition. Based on such data analysis, an environmental
parameter within flight are set up and varied.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In
the various spot of car camera, a sensor placed, that gather data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then
various calculation like how many angles to rotate, what should be speed, when to stop,
etc carried out. These calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool
(like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to provide
the answer of the various question asked by users. This tool tracks the location of the user,
their local time, season, other data related to question asked, etc. Analyzing all such data, it
provides an answer.
7. IoT:Manufacturing company install IOT sensor into machines to collect operational
data. Analyzing such data, it can be predicted how long machine will work without any
problem when it requires repairing so that company can take action before the situation
when machine facing a lot of issues or gets totally down. Thus, the cost to replace the
whole machine can be saved.
8. Education Sector: Online educational course conducting organization utilize big data to
search candidate, interested in that course. If someone searches for YouTube tutorial video
on a subject, then online or offline course provider organization on that subject send ad
online to that person about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends
this read data to the server, where data analyzed and it can be estimated what is the time in
a day when the power load is less throughout the city. By this system manufacturing unit
or housekeeper are suggested the time when they should drive their heavy machine in the
night time when power load less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing
company like Netflix, Amazon Prime, Spotify do analysis on data collected from their
users. Data like what type of video, music users are watching, listening most, how long
users are spending on site, etc are collected and analyzed to set the next business strategy.

A Brief History of Hadoop


Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely usedtext
search library. Hadoop has its origins in Apache Nutch, an open source web searchengine,
itself a part of the Lucene project.

The Origin of the Name “Hadoop”The name Hadoop is not an acronym; it’s a made-up name.
The project’s creator, DougCutting, explains how the name came about:

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell andpronounce,
meaningless, and not used elsewhere: those are my naming criteria. Kids are good at
generating such. Googol is a kid’s term. Sub projects and “contrib” modules in Hadoop also
tend to have names that are unrelated to their function, often with an elephant or other animal
theme (“Pig,”

For example). Smaller components are given more descriptive (and therefore more
mundane)names. This is a good principle, as it means you can generally work out what
something does from its name. For example, the jobtracker9 keeps track of Map Reduce
jobs.Building a web search engine from scratch was an ambitious goal, for not only is the
software required to crawl and index websites complex to write, but it is also a challengeto run
without a dedicated operations team, since there are so many moving parts. It’sexpensive, too:
Mike Cafarella and Doug Cutting estimated a system supporting a1-billion-page index would
cost around half a million dollars in hardware, with a Monthly running cost of $30,000.10
Nevertheless, they believed it was a worthy goal, as it would open up and ultimately
democratize search engine algorithms Nutch was started in 2002, and a working crawler and
search system quickly emerged. However, they realized that their architecture wouldn’t scale
to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003
that described the architecture of Google’s distributed file system, called GFS, which was
being used in production at Google.11 GFS, or something like it, would solve their storage
needs for the very large files generated as a part of the web crawl and indexing process. In
particular, GFS would free up time being spent on administrative tasks such as managing
storage nodes. In 2004, they set about writing an open source implementation, the Nutch
Distributed File system (NDFS).In 2004, Google published the paper that introduced Map
Reduce to the world.12 Earlyin 2005, the Nutch developers had a working Map Reduce
implementation in Nutch, and by the middle of that year all the major Nutch algorithms had
been ported to run using Map Reduce and NDFS.NDFS and the Map Reduce implementation
in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of
Nutch to form an independent subproject of Lucene called Hadoop. At around the same time,
Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn
Hadoop into a system that ran at web scale (see sidebar). This was demonstrated in February
2008when Yahoo! announced that its production search index was being generated by a10,
000-core Hadoop cluster.13

In January 2008, Hadoop was made its own top-level project at Apache, confirming its success
and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. In one well-
publicized feat, the New York Times used Amazon’s EC2 compute cloud o crunch through
four terabytes of scanned archives from the paper converting them to PDFs for the Web.14
The processing took less than 24 hours to run using 100 machines, and the project probably
wouldn’t have been embarked on without the combination of Amazon’s pay-by-the-hour
model (which allowed the NYT to access a large number of machines for a short period) and
Hadoop’s easy-to-use parallel programming model. In April 2008, Hadoop broke a world
record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster,
Hadoop sorted one terabyte in 209seconds (just under 3½ minutes), beating the previous year’s
winner of 297 seconds (described in detail in “Tera Byte Sort on Apache Hadoop” on page
601). In November of the same year, Google reported that its Map Reduce implementation
sorted one tera byte in 68 seconds.15 As the first edition of this book was going to press (May
2009),it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.

Apache Hadoop and the Hadoop Ecosystem

Although Hadoop is best known for Map Reduce and its distributed file system
(HDFS,renamed from NDFS), the term is also used for a family of related projects that fall
under the umbrella of infrastructure for distributed computing and large-scale data
processing. All of the core projects covered in this book are hosted by the Apache Software
Foundation, which provides support for a community of open source software projects,
including the original HTTP Server from which it gets its name. As the Hadoop eco system
grows, more projects are appearing, not necessarily hosted at Apache, which provide
complementary services to Hadoop, or build on the core to add higher-level abstractions.
The Hadoop projects that are covered in this book are described briefly here:

Common
A set of components and interfaces for distributed
filesystems and general I/O serialization, Java RPC,
persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent datastorage.
MapReduce
A distributed data processing model and execution environment that runs on largeclusters
of commodity machines.

HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
A data flow language and execution environment for exploring
very large datasets.Pig runs on HDFS and MapReduce clusters.
Hive

A distributed data warehouse. Hive manages data stored in HDFS and provides aquery
language based on SQL (and which is translated by the runtime engine toMapReduce jobs)
for querying the data.

HBase

A distributed, column-oriented database. HBase uses HDFS for its underlyingstorage, and
supports both batch- style computations using MapReduce and pointqueries (random
reads).

ZooKeeper

A distributed, highly available coordination service. ZooKeeper provides primitivessuch


as distributed locks that can be used for building distributed applications.

Sqoop

A tool for efficiently moving data between relational databases and HDFS.Hadoop
ReleasesWhich version of Hadoop should you use? The answer to this question changes
overtime, of course, and also depends on the features that you need. “Hadoop
Releases”summarizes the high-level features in recent Hadoop release series.

LINUX REFRESHER:

VM WARE:

The easiest way to run Hadoop on your Windows computer in order to run Hadoop would be
to install VMwarePlayer, then install a virtual hadoop server.

The instructions on how to install VMware Player on Windows.

1. Download VM Ware Playerfor windows 32-bit and 64 bit for VMware Player v5 and up.

2. Run the installer file and then click the Next button on the welcome screen.
3. You will see the End User License Agreement. You need to accept to proceed.

4. You will be prompted for a folder to install VMware Player into–accept the default and click the
Next button.

5. You can optionally enable VMware Player to check for updates when it starts up.
6. You can optionally enable sending usage statistics to VMware.

7. You can then choose whether to create shortcuts on the Desktop and/or the Windows Start Menu.

8. Click the Continue button to proceed with the installation.


9. The installation will take a few minutes.

10. Once the installation has completed, you can click on the final Finish button to exit the installerOn my
Windows 7 computer, I did not need to reboot my system for the VMware Player .

You might also like