Unit 1
Unit 1
DIGITAL NOTES
SCHOOL OF ENGINEERING
Department of Cyber Security
MALLA REDDY UNIVERSITY
III Year B. Tech – II Semester
(MR22-1CS04) BIG DATA and HADOOP
COURSE OBJECTIVES
1. Introduce Bigdata concepts.
2. Introduce distributed concepts with map reduce.
3. Store and Analyze data with Hadoop ecosystems
4. Introduce NoSQL concept using HBase.
5. Perform Data Analytics using Hive.
UNIT – I
Introduction to Big Data: What is Big Data - Why Big Data is Important -Evolution of Big
Data – Failure of Traditional Database in Handling Big Data- 3Vs of Big Data 4 – Sources of
Big Data – Different Types of Data – Big Data Infrastructure – Big Data Life Cycle – Big Data
Applications- A brief history of Hadoop -Apache hadoop- Hadoop EcoSystem. -Linux
refresher- VMware Installation of Hadoop.
UNIT – II
Hadoop I/O :
Data Integrity - Data Integrity in HDFS – Local File System – Checksum File System -
Compression and Input Splits - Using Compression in Map Reduce - Serialization - The
Writable Interface - Writable Classes - Implementing a Custom Writable - Serialization
Frameworks - File-Based Data Structures –Sequence File - MapFile - Other File Formats and
Column-Oriented Formats.
HDFS : The design of HDFS-HDFS concepts -Command line interface to HDFS- Hadoop
File systems- Interfaces -Java Interface to Hadoop - Anatomy of a file read - Anatomy of a
file writes- Replica placement and Coherency Model- Parallel copying with distcp -Keeping
an HDFS cluster balanced.
UNIT – III
Understanding Map Reduce Fundamentals: Introduction- Analyzing data with unix tools-
Analyzing data with hadoop- Java MapReduce classes (new API)- Data flow-combiner
functions-Running a distributed Map Reduce Job.
Classic Map reduce. Job submission- Job Initialization- Task Assignment- Task execution -
Progress and status updates- Job Completion- Shuffle and sort on Map and reducer side-
Configuration tuning- Map Reduce Types-Input formats-Output formats – Sorting - Map side
and Reduce side joins.
UNIT – IV
HBASE: Introduction -Architecture - storage of Big Data - Interacting with hadoop Eco-
system – Installation - Programming with HBase -Combining HBase and HDFS- Installation
Test Drive - Clients - Java - MapReduce - REST and Thrift - Building an Online Query
Application - Schema Design - Loading Data - Online Queries - HBase Versus RDBMS -
Successful Service - HBase .
YARN: Anatomy of a YARN Application Run - Resource Requests - Application Lifespan -
Building YARN Applications - YARN Compared to MapReduce 1 - Scheduling in YARN -
Scheduler Options - Capacity Scheduler Configuration - Fair Scheduler Configuration -
Delay Scheduling - Dominant Resource – Fairness.
UNIT – V
HIVE:
The Hive Shell- Hive services- Hive clients - The meta store - Comparison with traditional
databases – HiveQl –Hbasics – Concepts - Implementation - Java and Map reduce clients-
Loading data -web queries. - Data Types - Operators and Functions - Tables - Managed
Tables and External Tables - Partitions and Buckets - Storage Formats - Importing Data -
Altering Tables - Dropping Tables - Querying Data - Sorting and Aggregating - MapReduce
Scripts - Joins – Subqueries.
TEXT BOOKS:
1. Student’s Handbook for Associate Analytics.
2. BIG DATA and ANALYTICS, Seema Acharya, Subhasinin Chellappan, Wiley
publications.
3. BIG DATA, Black BookTM, DreamTech Press, 2015 Edition.
4. BUSINESS ANALYTICS 5e, BY Albright |Winston.
REFERENCE BOOKS:
1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira (the authors have kindly made
an online version available): http://www.dataminingbook.info/uploads/book.pdf.
3. Mining of Massive Datasets Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs
Jeffrey D. Ullman, Stanford Univ.
COURSE OUTCOMES:
1. Outline how to Store and manage data in HDFS and implement basic applications in map
reduce.
2. Summarize how to store and analyze data using PIG scripts and handle partitioned and
bucked tables in Hive.
3. Interpret the Import and export data from databases like mysql or oracle.
4. Illustrate the working with various file formats in hadoop ecosystems.
5. Implement spark scripts using RDDs and work with column data base using HBase.
UNIT – I
1. INTRODUCTION TO BIG DATA:
Today we live in the digital world. With increased digitization the amount of structured and
unstructured data being created and stored is exploding. The data is being generated from
various sources - transactions, social media, sensors, digital images, videos, audios and
clickstreams for domains including healthcare, retail, energy and utilities. In addition to
business and organizations, individuals contribute to the data volume. For instance, 30 billion
content are being shared on Facebook every month; the photos viewed every 16 seconds in
Picasa could cover a football field.
IDC terms this digital universe is set to explode to an unimaginable 8 Zeta bytes by the year
2015. The term “Big Data” was coined to address this massive volume of data storage and
processing. It is increasingly becoming imperative for organizations to access this data to their
applications.
What is Big Data?
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex
and large data sets that have to be processed and analyzed to uncover valuable information that
can benefit businesses and organizations. However, there are certain basic tenets of Big Data
that will make it even simpler to answer what is Big Data:
It refers to a massive amount of data that keeps on growing exponentially with time.
It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
It includes data mining, data storage, data analysis, data sharing, and data visualization.
The term is an all-comprehensive one including data, data frameworks, along with the
tools and techniques used to process and analyze the data.
The History of Big Data
Although the concept of big data itself is relatively new, the origins of large data sets go back
to the 1960s and '70s when the world of data was just getting started with the first data centers
and the development of the relational database.
Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically
to store and analyze big data sets) was developed that same year. NoSQL also began to gain
popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
Essential for the growth of big data because they make big data easier to work with and
cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are still
generating huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence
of machine learning has produced still more data. While big data has come far, its usefulness is
only just beginning. Cloud computing has expanded big data possibilities even further. The
cloud offers truly elastic scalability, where developers can simply spin up ad hoc clusters to
test a subset of data.
Benefits of Big Data and Data Analytics
Big data makes it possible for you to gain more complete answers because you have
more information.
More complete answers mean more confidence in the data—which means a completely
different approach to tackling problems.
Why is Big Data Important?
The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its own way; the more
efficiently a company uses its data, the more potential it has to grow. The company can take
data from any source and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring
cost advantages to business when large amounts of data are to be stored and these tools also
help in identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately and make
quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’ purchasing
behaviors, a company can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve
the online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no single business
that can claim success without first having to establish a solid customer base. However, even
with a customer base, a business cannot afford to disregard the high competition it faces. If a
business is slow to learn what customers are looking for, then it is very easy to begin offering
poor quality products. In the end, loss of clientele will result, and this creates an adverse
overall effect on business success. The use of big data allows businesses to observe various
customer related patterns and trends. Observing customer behavior is important to trigger
loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
Big data analytics can help change all business operations. This includes the ability to match
customer expectation, changing company’s product line and of course ensuring that the
marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product Development
Another huge advantage of big data is the ability to help companies innovate and redevelop
their products.
Evolution of Big Data
If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an open-
source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. Cloud Computing:
Cloud computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This has
led to the development of artificial intelligence (AI) applications.
6. Data Streaming:
Data streaming technology has emerged as a solution to process large volumes of data
in real time.
7. Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source of
the data.
Overall, big data technology has come a long way since the early days of data warehousing.
The introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data
streaming, and edge computing has revolutionized how we store, process, and analyze large
volumes of data. As technology evolves, we can expect Big Data to play a very important role
in various industries.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important
types of big data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To
be precise, it refers to the data that although has not been classified under a particular
repository (database), yet contains vital information or tags that segregate individual
elements within the data. Thus we come to the end of types of data.
Big Data Infrastructure
• Big data architecture is a comprehensive solution to deal with an enormous amount of
data.
• It details the blueprint for providing solutions and infrastructure for dealing with big
data based on a company’s demands.
Data Sources: Relational databases, data warehouses, cloud-based data
warehouses, SaaS applications, real-time data from company servers and sensors
such as IoT devices, third-party data providers, and also static files such as
Windows logs, comprise several data sources.
Data Storage: HDFS, Microsoft Azure, AWS, and GCP storage, among other blob
containers.
Batch Processing: Multiple approaches to batch processing are employed,
including Hive jobs, U-SQL jobs, Sqoop or Pig and custom map reducer jobs
written in any one of the Java or Scala or other languages such as Python.
Real Time-Based Message Ingestion:Message-based ingestion stores such as
Apache Kafka, Apache Flume, Event hubs from Azure, and others, on the other
hand, must be used if message-based processing is required. The delivery process,
along with other message queuing semantics, is generally more reliable.
Stream Processing:Stream processing, on the other hand, handles all of that
streaming data in the form of windows or streams and writes it to the sink. This
includes Apache Spark, Flink, Storm, etc.
Analytics-Based Datastore: In order to analyze and process already processed
data, analytical tools use the data store that is based on HBase or any other NoSQL
data warehouse technology. NoSQL databases like HBase or Spark SQL are also
available.
Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis tools
that utilize embedded technology and a solution to produce useful graphs, analysis,
and insights that are beneficial to the businesses. For example, Cognos, Hyperion,
and others.
Orchestration: Data-based solutions that utilise big data are data-related tasks that
are repetitive in nature, and which are also contained in workflow chains that can
transform the source data and also move data across sources as well as sinks and
loads in stores. Sqoop, oozie, data factory, and others are just a few examples.
The Origin of the Name “Hadoop”The name Hadoop is not an acronym; it’s a made-up name.
The project’s creator, DougCutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell andpronounce,
meaningless, and not used elsewhere: those are my naming criteria. Kids are good at
generating such. Googol is a kid’s term. Sub projects and “contrib” modules in Hadoop also
tend to have names that are unrelated to their function, often with an elephant or other animal
theme (“Pig,”
For example). Smaller components are given more descriptive (and therefore more
mundane)names. This is a good principle, as it means you can generally work out what
something does from its name. For example, the jobtracker9 keeps track of Map Reduce
jobs.Building a web search engine from scratch was an ambitious goal, for not only is the
software required to crawl and index websites complex to write, but it is also a challengeto run
without a dedicated operations team, since there are so many moving parts. It’sexpensive, too:
Mike Cafarella and Doug Cutting estimated a system supporting a1-billion-page index would
cost around half a million dollars in hardware, with a Monthly running cost of $30,000.10
Nevertheless, they believed it was a worthy goal, as it would open up and ultimately
democratize search engine algorithms Nutch was started in 2002, and a working crawler and
search system quickly emerged. However, they realized that their architecture wouldn’t scale
to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003
that described the architecture of Google’s distributed file system, called GFS, which was
being used in production at Google.11 GFS, or something like it, would solve their storage
needs for the very large files generated as a part of the web crawl and indexing process. In
particular, GFS would free up time being spent on administrative tasks such as managing
storage nodes. In 2004, they set about writing an open source implementation, the Nutch
Distributed File system (NDFS).In 2004, Google published the paper that introduced Map
Reduce to the world.12 Earlyin 2005, the Nutch developers had a working Map Reduce
implementation in Nutch, and by the middle of that year all the major Nutch algorithms had
been ported to run using Map Reduce and NDFS.NDFS and the Map Reduce implementation
in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of
Nutch to form an independent subproject of Lucene called Hadoop. At around the same time,
Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn
Hadoop into a system that ran at web scale (see sidebar). This was demonstrated in February
2008when Yahoo! announced that its production search index was being generated by a10,
000-core Hadoop cluster.13
In January 2008, Hadoop was made its own top-level project at Apache, confirming its success
and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. In one well-
publicized feat, the New York Times used Amazon’s EC2 compute cloud o crunch through
four terabytes of scanned archives from the paper converting them to PDFs for the Web.14
The processing took less than 24 hours to run using 100 machines, and the project probably
wouldn’t have been embarked on without the combination of Amazon’s pay-by-the-hour
model (which allowed the NYT to access a large number of machines for a short period) and
Hadoop’s easy-to-use parallel programming model. In April 2008, Hadoop broke a world
record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster,
Hadoop sorted one terabyte in 209seconds (just under 3½ minutes), beating the previous year’s
winner of 297 seconds (described in detail in “Tera Byte Sort on Apache Hadoop” on page
601). In November of the same year, Google reported that its Map Reduce implementation
sorted one tera byte in 68 seconds.15 As the first edition of this book was going to press (May
2009),it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.
Although Hadoop is best known for Map Reduce and its distributed file system
(HDFS,renamed from NDFS), the term is also used for a family of related projects that fall
under the umbrella of infrastructure for distributed computing and large-scale data
processing. All of the core projects covered in this book are hosted by the Apache Software
Foundation, which provides support for a community of open source software projects,
including the original HTTP Server from which it gets its name. As the Hadoop eco system
grows, more projects are appearing, not necessarily hosted at Apache, which provide
complementary services to Hadoop, or build on the core to add higher-level abstractions.
The Hadoop projects that are covered in this book are described briefly here:
Common
A set of components and interfaces for distributed
filesystems and general I/O serialization, Java RPC,
persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent datastorage.
MapReduce
A distributed data processing model and execution environment that runs on largeclusters
of commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
A data flow language and execution environment for exploring
very large datasets.Pig runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides aquery
language based on SQL (and which is translated by the runtime engine toMapReduce jobs)
for querying the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlyingstorage, and
supports both batch- style computations using MapReduce and pointqueries (random
reads).
ZooKeeper
Sqoop
A tool for efficiently moving data between relational databases and HDFS.Hadoop
ReleasesWhich version of Hadoop should you use? The answer to this question changes
overtime, of course, and also depends on the features that you need. “Hadoop
Releases”summarizes the high-level features in recent Hadoop release series.
LINUX REFRESHER:
VM WARE:
The easiest way to run Hadoop on your Windows computer in order to run Hadoop would be
to install VMwarePlayer, then install a virtual hadoop server.
1. Download VM Ware Playerfor windows 32-bit and 64 bit for VMware Player v5 and up.
2. Run the installer file and then click the Next button on the welcome screen.
3. You will see the End User License Agreement. You need to accept to proceed.
4. You will be prompted for a folder to install VMware Player into–accept the default and click the
Next button.
5. You can optionally enable VMware Player to check for updates when it starts up.
6. You can optionally enable sending usage statistics to VMware.
7. You can then choose whether to create shortcuts on the Desktop and/or the Windows Start Menu.
10. Once the installation has completed, you can click on the final Finish button to exit the installerOn my
Windows 7 computer, I did not need to reboot my system for the VMware Player .