0% found this document useful (0 votes)

64 views56 pages

BDA Unit-1

The document discusses what data and big data are, provides examples of big data sources like the New York Stock Exchange and Facebook, and describes the types of data including structured, unstructured, and semi-structured data. It then discusses characteristics of big data like volume, variety, and velocity, and the importance of big data for cost savings, time savings, understanding markets, social media listening, customer acquisition and retention, solving advertisers' problems, driving innovation, and product development. Finally, it provides an overview of Hadoop, describing its architecture including MapReduce and HDFS.

Uploaded by

Rammohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views56 pages

BDA Unit-1

Uploaded by

Rammohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 56

What is Data?

The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.

[OR]

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.

[OR]

The definition of big data is data that contains greater variety, arriving in increasing volumes and with
more velocity.

Example of Big Data

The New York Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.

Types Of Big Data

Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size
of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data is
in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by ‘Google Search’

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Characteristics Of Big Data

Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one
characteristic which needs to be considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.

Importance of Big data:

Big Data importance doesn’t revolve around the amount of data a company has. Its importance lies in
the fact that how the company utilizes the gathered data.

Every company uses its collected data in its own way. More effectively the company uses its data,
more rapidly it grows.

The companies in the present market need to collect it and analyze it because:

1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when they
have to store large amounts of data. These tools help organizations in identifying more effective ways
of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools like
Hadoop help them to analyze data immediately thus helping in making quick decisions based on the
learnings.

3. Understand the market conditions

Big Data analysis helps businesses to get a better understanding of market situations.

For example, analysis of customer purchasing behavior helps companies to identify the products sold
most and thus produces those products accordingly. This helps companies to get ahead of their
competitors.

4. Social Media Listening

Companies can perform sentiment analysis using Big Data tools. These enable them to get feedback
about their company, that is, who is saying what about the company.

Companies can use Big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention

Customers are a vital asset on which any business depends on. No single business can achieve its
success without building a robust customer base. But even with a solid customer base, the companies
can’t ignore the competition in the market.

If we don’t know what our customers want then it will degrade companies’ success. It will result in
the loss of clientele which creates an adverse effect on business growth.

Big data analytics helps businesses to identify customer related trends and patterns. Customer
behavior analysis leads to a profitable business.

6. Solve Advertisers Problem and Offer Marketing Insights

Big data analytics shapes all business operations. It enables companies to fulfill customer
expectations. Big data analytics helps in changing the company’s product line. It ensures powerful
marketing campaigns.

7. The driver of Innovations and Product Development

Big data makes companies capable to innovate and redevelop their products.

Meet Hadoop:

Hadoop is an Apache open source framework written in java that allows distributed processing of
large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and
 Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google
for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce
program runs on Hadoop which is an Apache open-source framework.

The first is the map job, which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).

The reduce job takes the output from a map as input and combines those data tuples into a smaller set
of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after
the map job.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides
a distributed file system that is designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Data Storage and Analysis - Hadoop

The problem is simple: while the storage capacities of hard drives have increased massively
over the years, access speeds the rate at which data can be read from drives have not kept
up.One typical drive from 1990 could store 1,370 MB of data and hada transfer speed of 4.4
MB/s, so you could read all the data from a full drive in around five minutes. Over 20 years
later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes
more than two and a half hours to read all the data off the disk.
 This is a long time to read all data on a single drive and writing is even slower.The obvious
way to reduce the time is to read from multiple disks at once.Imagine if we had 100 drives,
each holding one hundredth of the data.Working in parallel,we could read the data in under
two minutes.
 Only using one hundredth of a disk may seem wasteful.But we can store one hundred data sets,
each of which is one terabyte, and provide shared ould be likely to be spread over time, so they
wouldn’t interfere with each other too much.
 There’s more to being able to read and write data in parallel to or from multiple disks,though.
 The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware,the chance that one will fail is fairly high. A common way of avoiding data loss is
through replication: redundant copies of the data are kept by the system so that in the event of
failure, there is another copy available. This is how RAID works, for instance, although
Hadoop’s file system, the Hadoop Distributed Filesystem (HDFS),takes a slightly different
approach, as you shall see later.
 The second problem is that most analysis tasks need to be able to combine the data in some
way; data read from one disk may need to be combined with the data from any of the other 99
disks. Various distributed systems allow data to be combined from multiple sources, but doing
this correctly is notoriously challenging. Map Reduce provides a programming model that
abstracts the problem from disk reads and writes,transforming it into a computation over sets
of keys and values. We will look at the details of this model in later chapters, but the important
point for the present discussion is that there are two parts to the computation, the map and the
reduce, and it’s theinterface between the two where the “mixing” occurs. Like HDFS,
MapReduce has built-in reliability.
 This, in a nut shell, is what Hadoop provides: a reliable shared storage and analysis system.
The storage is provided by HDFS and analysis by MapReduce. There are other parts to
Hadoop, but these capabilities are its kernel.

Comparison of Hadoop with other systems

1.Hadoop vs RDBMS:
Disk latency has not improved proportionally to disk bandwidth i.e seek time has not improved
proportionally to transfer time. RDBMS uses B-tree for data access which is limited by disk latency,
therefore it would take large time to access majority of data. Hadoop uses MapReduce model for data
access which is limited by disk bandwidth. Hence for queries involving majority of database B-tree is
less effecient than MapReduce.

RDBMS is more efficient for point queries where data is indexed to improve disk latency. Whereas
Hadoop's Mapreduce is more efficient for queries involving complete data. Moreover Mapreduce suits
applications in which data is written once and read many times, whereas in RDBMS dataset is
continuously updated.

MapReduce RDBMS
1. Size of data Petabytes Gigabytes
2. Integrity of data Low High
3. Data schema Dynamic Static
4. Access method Interactive and Batch Batch
5. Scaling Linear Nonlinear
6. Data structure Unstructured Structured
7. Normalization of data Not Required Required
S.No. RDBMS Hadoop

Traditional row-column based An open-source software used for

databases, basically used for data storing data and running applications
1. storage, manipulation and retrieval. or processes concurrently.

In this structured data is mostly In this both structured and

2. processed. unstructured data is processed.

It is best suited for OLTP

3. environment. It is best suited for BIG data.

4. It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in Data normalization is not required in

5. RDBMS. Hadoop.

It stores transformed and aggregated

6. data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is static The data schema of Hadoop is

8. type. dynamic type.

Low data integrity available than

9. High data integrity available. RDBMS.

Cost is applicable for licensed Free of cost, as it is an open source

10. software. software.
2.Hadoop vs Grid Computing:
Grid Computing has been doing large scale processing by dividing a job over a cluster of systems. But
it is efficient for only compute intensive jobs. For data intensive jobs huge data has to be transferred
over the network and the network bandwidth becomes the bottleneck. This is where Hadoop
outperforms Grid Computing. Mapreduce tries to locate computations on the node where data resides
thus saving network bandwidth. This is called as principle of locality which lies at the heart of
MapReduce.

Moreover Mapreduce saves the programmers from writing code for node failure and handling data
flow as these are handled implicitly by MapReduce.Whereas Grid Computing provides great control
to handle data flow and node failures.

Thus we can say that Hadoop is not a replacement for RDBMS and both these systems can coexist
simultaneously.
3.Hadoop vs Cloud computing
In simplest terms, it means storing and accessing your data, programs, and files over the internet rather
than your PC’s hard drive. Basically, the cloud is another name for the internet.
Cloud computing has several attractive benefits for end users and businesses. The primary benefits of
cloud computing includes:
 Elasticity – with cloud computing, businesses only use the resources they require. Organizations can
increase their usage as computing needs increase and reduce their usage as the computing needs
decrease. This eliminates the need for investing heavily in IT infrastructures which may or may not be
used.
 Self-service provisioning – users can always use the resources for almost any type of workload on
demand. This eliminates the need for IT admins to provide and manage computer resources.
 Pay-per-use – compute resources are measures depending on the usage level. This means that users
are only charged for the cloud resources they use.
Differences between Hadoop and Cloud Computing:

1. Hadoop is a framework which uses simple programming models to process large data sets across
clusters of computers. It is designed to scale up from single servers to thousands of machines, which
offer local computation and storage individually. Cloud computing, on the other hand, constitutes
various computing concepts, which involve a large number of computers which are usually connected
through a real-time communication network.
2. Cloud computing focuses on on-demand, scalable and adaptable service models. Hadoop on the other
hand all about extracting value out of volume, variety, and velocity.
3. In cloud computing, Cloud MapReduce is a substitute implementation of MapReduce. The main
difference between cloud MapReduce and Hadoop is that Cloud MapReduce doesn’t provide its own
implementation; rather, it relies on the infrastructure offered by different cloud services providers.
4. Hadoop is an ‘ecosystem’ of open source software projects which allow cheap computing which is
well distributed on industry-standard hardware. On the other hand, cloud computing is a model where
processing and storage resources can be accessed from any location via the internet.

4.Hadoop vs Volunteer Computing

When people first hear about Hadoop and MapReduce, they often ask, “How is it different from
SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runs a project called SETI@home
in which volunteers donate CPU time from their otherwise idle computers to analyze radio telescope
data for signs of intelligent life outside earth. SETI@home is the most well-known of many volunteer
computing projects; others include the Great Internet Mersenne Prime Search (to search for large
prime numbers) and Folding@home (to understand protein folding and how it relates to disease).

Volunteer computing projects work by breaking the problem they are trying to solve into chunks
called work units, which are sent to computers around the world to be analyzed. For example, a
SETI@home work unit is about 0.35 MB of radio telescope data, and takes hours or days to analyze
on a typical home computer. When the analysis is completed, the results are sent back to the server,
and the client gets another work unit. As a precaution to combat cheating, each work unit is sent to
three different machines and needs at least two results to agree to be accepted.
Although SETI@home may be superficially similar to MapReduce (breaking a problem into
independent pieces to be worked on in parallel), there are some significant differences. The
SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of
thousands of computers across the world,† since the time to transfer the work unit is dwarfed by the
time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running
in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home
runs a perpetual computation on untrusted machines on the Internet with highly variable connection
speeds and no data locality.

History of Apache Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an
open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they have
to spend a lot of costs which becomes the consequence of that project. This problem becomes
one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data
processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop
Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Yea Event
r

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 o Hadoop introduced.

o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.

2007 o Yahoo runs 2 clusters of 1000 machines.

o Hadoop includes HBase.

2008 o YARN JIRA opened

o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.

2009 o Yahoo runs 17 clusters of 24,000 machines.

o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.

2010 o Hadoop added the support for Kerberos.

o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011 o Apache Zookeeper released.

o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

Hadoop Ecosystem:
Overview: Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question arises
that what is big data ? Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS. Hadoop has made its
place in the industries and companies that need to work on large data sets which are sensitive and
needs efficient handling. Hadoop is a framework that enables processing of large data sets which
reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are
supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve
the big data problems. It includes Apache projects and various commercial tools and solutions.
There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common. Most of the tools or solutions are used to supplement or support these major elements.
All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too that are
part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop
that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for the
Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:

 Mahout, allows Machine Learnability to a system or application. Machine Learning , as the name
suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.
Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data
sets effectively.
 At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times, HBase
comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:

 Solr, Lucene: These are the two services that perform the task of searching and indexing with
the help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of coordination and synchronization among
the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper
overcame all the problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus
is given to it.

Hadoop Installation
In this section of the Hadoop tutorial, we will be talking about the Hadoop installation process.
Hadoop is supported by the Linux platform and its facilities. If you are working on Windows, you can
use Cloudera VMware that has preinstalled Hadoop, or you can use Oracle VirtualBox or the VMware
Workstation. In this tutorial, I will be demonstrating the installation process for Hadoop using the
VMware Workstation 12. You can use any of the above to perform the installation. I will do this by
installing CentOS on my VMware.

Prerequisites

 VirtualBox/VMWare/Cloudera: Any of these can be used for installing the operating system.

 Operating System: You can install Hadoop on Linux-based operating systems. Ubuntu and
CentOS are very commonly used among them. In this tutorial, we are using CentOS.
 Java: You need to install the Java 8 package on your system.
 Hadoop: You require the Hadoop 2.7.3 package.

Hadoop Installation on Windows

Step 1: Installing VMware Workstation

 Download VMware Workstation from this link

 Once downloaded, open the .exe file and set the location as required
 Follow the required steps of installation

Step 2: Installing CentOS

 Install CentOS from this link
 Save the file in any desired location

Step 3: Setting up CentOS in VMware 12

When you open VMware, the following window pops up:

Click on Create a New Virtual Machine

1. As seen in the screenshot above, browse the location of your CentOS file you downloaded. Note
that it should be a disc image file
2. Click on Next

1. Choose the name of your machine. Here, I have given the name CentOS 64-bit
2. Then, click Next

1. Specify the disk capacity. Here, I have specified it to be 20 GB

2. Click Next
o Click on Finish

 After this, you should be able to see a window as shown below. This screen indicates that you are
booting the system and getting it ready for installation. You will be given a time of 60 seconds to
change the option from Install CentOS to others. You will need to wait for 60 seconds if you need
the option selected to be Install CentOS
Note: In the image above, you can see three options, such as, I Finished Installing, Change Disc,
and Help. You don’t need to touch any of these until your CentOS is successfully installed.

o At the moment, your system is being checked and is getting ready for installation

Once the checking percentage reaches 100%, you will be taken to a screen as shown below:

Interested in learning Big Data Hadoop in-depth? Enroll in our Big Data Hadoop Training now!

Step 4: Here, you can choose your language. The default language is English, and that is what I have
selected

1. If you want any other language to be selected, specify it
2. Click on Continue

Step 5: Setting up the Installation Processes


o From Step 4, you will be directed to a window with various options as shown below:


o First, to select the software type, click on the SOFTWARE SELECTION option

 Now, you will see the following window:1. Select the Server with GUI option to give your
server a graphical appeal
2. Click on Done

After clicking on Done, you will be taken to the main menu where you had previously
selected SOFTWARE SELECTION
Next, you need to click on INSTALLATION DESTINATION

 On clicking this, you will see the following window:

1.
Under Other Storage Options, select I would like to make additional space available
2. Then, select the radio button that says I will configure partitioning
3. Then, click on Done
o Next, you’ll be taken to another window as shown below:

1. Select
the partition scheme here as Standard Partition2. Now, you need to add three mount points
here. For doing that, click on ‘+’

a) Select the Mount Point /boot as shown above
b) Next, select the Desired Capacity as 500 MiB as shown below:

c) Click on Add mount point
d) Again, click on ‘+’ to add another Mount Point

e) This time, select the Mount Point as swap and Desired Capacity as 2 GiB

f) Click on Add Mount Point
g) Now, to add the last Mount Point, click on + again

h) Add another Mount Point ‘/’ and click on Add Mount Point

i) Click on Done, and you will see the following window:

Note: This is just to make you aware of all the changes you had made in the partition of your drive

 Now, click on Accept Changes if you’re sure about the partitions you have made
 Next, select NETWORK & HOST NAME

 You’ll be taken to a window as shown below:

1. Set the Ethernet settings as ON

2. Change the HOST name if required
3. Apply the settings
4. Finally, click on Done

o Next, click on Begin Installation

Step 6: Configuration


o Once you complete step 5, you will see the following window where the final installation
process will be completed.
o But before that, you need to set the ROOT PASSWORD and create a user



o Click on ROOT PASSWORD, which will direct you to the following window:

1
. Enter your root password here
2. Confirm the password
3. Click on Done
 Now, click on USER CREATION, and you will be directed to the following window:

1. Enter your Full name. Here, I have entered Intellipaaat

2. Next, enter your User name; here, intellipaaat (This generally comes up automatically)
3. You can either make this password-based or make this a user
administrator
4. Enter the password
5. Confirm your password
6. Finally, click on Done


o You’ll see the Reboot button, as seen below when your installation is done, which takes up to
20–30 minutes


o In the next screen, you will see the installation process in progress

Note: It will take about 3 seconds for the CentOS to start.

Wait until a window pops up to accept your license info step 7: Setting up the License Information

o Accept the License Information

Step 8: Logging
into CentOS

 You will see the login screen as below:

Enter the user ID and password you had set up in Step 6
Your CentOS installation is now complete!
Now, you need to start working on CentOS, and not on your local operating system. If you have
jumped to this step because you are already working on Linux/Ubuntu, then continue with the
following steps.

Note: All commands need to be run on the Terminal. You can open the Terminal by right-clicking on
the desktop and selecting Open Terminal

Step 9: Downloading and Installing Java 8

 Click here to download the Java 8 Package. Save this file in your home directory
 Extract the Java tar file using the following command:

tar -xvf jdk-8u101-linux-i586.tar.gz

Step 10: Downloading and Installing Hadoop


o Download a stable release packed as a zipped file from here and unpack it somewhere on your
file system

 Extract the Hadoop file using the following command on the terminal:

tar -xvf hadoop-2.7.3.tar.gz



o You will be directed to the following window:

Step
11: Moving Hadoop to a Location
o Use the following code to move your file to a particular location, here Hadoop:

mv hadoop-2.7.3/home/intellipaaat/hadoop
Note: The location of the file you want to change may differ. For demonstration purposes, I
have used this location, and this will be the same throughout this tutorial. You can change it
according to your choice.

 Here, Home will remain the same.

 Intellipaat is the user name I have used. You can change it according to your user name.
 Hadoop is the location where I want to save this file. You can change it as well if you want.
Step 12: Editing and Setting up HadoopFirst, you need to set the path in the ~/.bashrc file. You can
set the path from the root user by using the command ~/.bashrc. Before you edit ~/.bashrc, you need
to check your Java configurations.
Enter the command:
update-alternatives-config java

You will now see all the Java versions available on the machine. Here, since I have only one version
of Java which is the latest one, it is shown below:

You can have multiple versions as well.

 Next, you need to select the version you want to work on. As you can see, there is a highlighted
path in the above screenshot. Copy this path and place it in a gedit file. This path is just for being
used in the upcoming steps
 Enter the selection number you have chosen. Here, I have chosen the number 1
 Now, open ~/.bashrc with the vi editor (the screen-oriented text editor in Linux)

Note: You have to become a root user first to be able to edit ~/.bashrc.

o Enter the command: su

o You will be prompted for the password. Enter your root password



o When you get logged into your root user, enter the command: vi ~/.bashrc


o The above command takes you to the vi editor, and you should be able to see the following
screen:

To access this, press Insert on your keyboard, and then, start writing the following set of codes for
setting a path for Java:
o fi
o #HADOOP VARIABLES START
o export JAVA_HOME= (path you copied in the previous step)
o export HADOOP_HOME=/home/(your username)/hadoop
o export PATH=$PATH:$HADOOP_INSTALL/bin
o export PATH=$PATH:$HADOOP_INSTALL/sbin
o export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
o export HADOOP_COMMON_HOME=$HADOOP_INSTALL
o export HADOOP_HDFS_HOME=$HADOOP_INSTALL
o export YARN_HOME=$HADOOP_INSTALL
o export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/INSTALL/lib/native
o export HADOOP_OPTS=”Djava.library.path”=$HADOOP_INSTALL/lib”
#HADOOP VARIABLES END

After writing the code, click on Esc on your keyboard and write the command:wq!
This will save and exit you from the vi editor. The path has been set now as it can be seen in the
image below:

Step 13: Adding Configuration Files

o Open hadoop-env.sh with the vi editor using the following command:

vi /home/intellipaaat/hadoop/etc/hadoop/hadoop-env.sh



o Replace this path with the Java path to tell Hadoop which path to use. You will see the
following window coming up:


o Change the JAVA_HOME variable to the path you had copied in the previous step

Step 14:
Now, several XML files need to be edited, and you need to set the property and the path for
them.



o Editing core-site.xml


 Use the same command as in the previous step and just change the last part to core-
site.xml as given below:

vi /home/intellipaaat/hadoop/etc/hadoop/core-site.xml

Next, you will see the following window:
o


 Enter the following code in between the configuration tags as below:

 <configuration>
 <property>
 <name>fs.defaultFS</name>
 <value>hdfs://(your localhost):9000</value>
 </property>
 </configuration>

o
 Now, exit from this window by entering the command:wq!



o Editing yarn-site.xml


 Enter the command:

vi /home/intellipaaat/hadoop/etc/hadoop/yarn-site.xml

You will see the following window:

o
 Enter the code in between the configuration tags as shown below:
 <configuration>
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 <property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
</configuration>

 Exit from this window by pressing Esc and then writing the command:wq!

Editing mapred-site.xml


 Copy or rename a file mapred-site.xml.template with the name mapred-site.xml.Note: If

you go to the following path, you will see that there is no file named mapred-site.xml:
Home > intellipaaat > hadoop > hadoop-2.7.3 > etc > hadoop
So, we will copy the contents of the mapred-site .xml.template to mapred-site.xml.
 Use the following command to copy the contents:

cp /home/intellipaaat/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml.template
/home/intellipaaat/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml

Once the contents have been copied to a new file named mapred-site.xml, you can verify
it by going to the following path:
Home > intellipaaat > hadoop > hadoop-2.7.3 > etc > hadoop

 Now, use the following command to add configurations:

vi/home/intellipaaat/hadoop/etc/hadoop/mapred-site.xml

o
 In the new window, enter the following code in between the configuration tags as below:
 <configuration>
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>
</configuration>

 Exit using Esc and the command:wq!

 Editing hdfs-site.xml
Before editing the hdfs-site.xml, two directories have to be created, which will contain the
namenode and the datanode.
o

 Enter the following code for creating a directory, namenode:

mkdir -p /home/intellipaaat/hadoop_store/hdfs/namenode

Note: Here, mkdir means creating a new file.

 Similarly, to create the datanode directory, enter the following command:

mkdir -p /home/intellipaaat/hadoop_store/hdfs/datanode

 Now, go to the following path to check both the files:

Home > intellipaaat > hadoop_store > hdfs You can find both directories in the specified
path as in the images below:

o Now, to configure hdfs-site.xml, use the following command:

vi /home/intellipaaat/hadoop/etc/hadoop/hdfs-site.xml

o Enter the following code in between the configuration tags:

o <configuration>
o <property>
o <name>dfs.replication</name>
o <value>1</value>
o </property>
o <property>
o <name>dfs.namenode.name.dir</name>
o <value>file:/home/intellipaaat/hadoop_store/hdfs/namenode</value>
o </property>
o <property>
o <name>dfs.datanode.data.dir</name>
o <value> file:/home/intellipaaat/hadoop_store/hdfs/namenode</value>
o </property>
</configuration>
o Exit using Esc and the command:wq!

That’sall!
All your configurations are done. And Hadoop Installation is done now!

Step15:CheckingHadoop
You will now need to check whether the Hadoop installation is successfully done on your system or
not.

 Go to the location where you had extracted the Hadoop tar file, right-click on the bin, and open it
in the terminal
 Now, write the command, ls
Next, if you see a window as below, then it means that Hadoop is successfully installed!

ANALYZING THE DATA WITH HADOOP

To take advantage of the parallel processing that Hadoop provides, we need to express
our query as a MapReduce job. After some local, small-scale testing, we will be able to
run it on a cluster of machines.

MAP AND REDUCE

MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two functions: the
map function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of
the line from the beginning of the file, but as we have no need for this, we ignore it.

Our map function is simple. We pull out the year and the air temperature, since these
are the only fields we are interested in. In this case, the map function is just a
data preparation phase, setting up the data in such a way that the reducer function can
do its work on it: finding the maximum temperature for each year. The map function is
also a good place to drop bad records: here we filter out temperatures that are missing,
suspect, or erroneous.

To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...

0043011990999991950051512004...9999999N9+00221+99999999999...

0043011990999991950051518004...9999999N9-00111+99999999999...

0043012650999991949032412004...0500001N9+01111+99999999999...

0043012650999991949032418004...0500001N9+00781+99999999999...

These lines are presented to the map function as the key-value pairs:

(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)

(106 0043011990999991950051512004...9999999N9+00221+99999999999
, ...)
(212 0043011990999991950051518004...9999999N9-
, 00111+99999999999...)
(318 0043012650999991949032412004...0500001N9+01111+99999999999
, ...)
(424 0043012650999991949032418004...0500001N9+00781+99999999999
, ...)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated in bold
text), and emits them as its output (the temperature values have been interpreted as
integers):

(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the key-
value pairs by key. So, continuing the example, our reduce function sees the
following input:

(1949, [111, 78])

(1950, [0, 22, −11])

Each year appears with a list of all itsair temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:

(1949, 111)

(1950, 22)

This is the final output: the maximum global temperature recorded in each year.

The whole data flow is illustrated in 2.2. At the bottom of the diagram is a Unix
pipeline, which mimics the whole MapReduce flow, and which we will see again
later in the chapter when we look at Hadoop Streaming.
Figure 2-1. MapReduce logical data flow

JAVA MAPREDUCE
Having run through how the MapReduce program works, the next step is to express
it in code. We need three things: a map function, a reduce function, and some code
to run the job. The map function is represented by the Mapper class, which declares
an abstract map() method.
The Mapper class is a generic type, with four formal type parameters that specify the
input key, input value, output key, and output value types of the map function.
For the present example, the input key is a long integer offset, the input value is
a line of text, the output key is a year, and the output value is an air temperature
(an integer). Rather than use built-in Java types, Hadoop provides its own set of
basic types that are opti- mized for network serialization. These are found in the
org.apache.hadoop.io package. Here we use LongWritable, which corresponds to
a Java Long, Text (like Java String), and IntWritable (like Java Integer).

The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method to
extract the columns we are interested in.

The map() method also provides an instance of Context to write the output to. In this
case, we write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the
temperature is present and the quality code indicates the temperature reading is
OK.

Again, four formal type parameters are used to specify the input and output
types, this time for the reduce function. The input types of the reduce function
must match the output types of the map function: Text and IntWritable. And in
this case, the output types of the reduce function are Text and IntWritable, for a
year and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.
A Job object forms the specification of the job. It gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code
into a JAR file (which Hadoop will distribute around the cluster). Rather than
explicitly specify the name of the JAR file, we can pass a class in the Job’s
setJarByClass() method, which Hadoop will use to locate the relevant JAR file
by looking for the JAR file containing this class.

Having constructed a Job object, we specify the input and output paths. An input
path is specified by calling the static addInputPath() method on FileInputFormat,
and it can be a single file, a directory (in which case, the input forms all the files in
that directory), or a file pattern. As the name suggests, addInputPath() can be called
more than once to use input from multiple paths.

The output path (of which there is only one) is specified by the static setOutput
Path() method on FileOutputFormat. It specifies a directory where the output
files from the reducer functions are written. The directory shouldn’t exist before
running the job, as Hadoop will complain and not run the job. This precaution is
to prevent data loss(it can be very annoying to accidentally overwrite the output
of a long job with another).

Next, we specify the map and reduce types to use via the setMapperClass() and

setReducerClass() methods.

The setOutputKeyClass() and setOutputValueClass() methods control the

output types for the map and the reduce functions, which are often the same, as
they are in our case. If they are different, then the map output types can be set
using the methods setMapOutputKeyClass() and setMapOutputValueClass().

The input types are controlled via the input format, which we have not explicitly
set since we are using the default TextInputFormat.

After setting the classes that define the map and reduce functions, we are ready to
run the job. The waitForCompletion() method on Job submits the job and waits for it
to finish. The method’s boolean argument is a verbose flag, so in this case the job
writes information about its progress to the console.

The return value of the waitForCompletion() method is a boolean indicating

success (true) or failure (false), which we translate into the program’s exit code
of 0 or 1.
A TEST RUN

After writing a MapReduce job, it’s normal to try it out on a small dataset to flush out
any immediate problems with the code. First install Hadoop in standalone mode—
there are instructions for how to do this in Appendix A. This is the mode in which
Hadoop runs using the local filesystem with a local job runner. Then install and
compile the examples using the instructions on the book’s website.

When the hadoop command is invoked with a classname as the first argument, it
launches a JVM to run the class. It is more convenient to use hadoop than straight
java since the former adds the Hadoop libraries (and their dependencies) to the class-
path and picks up the Hadoop configuration, too. To add the application classes to the
classpath, we’ve defined an environment variable called HADOOP_CLASSPATH,
which the hadoop script picks up.

The last section of the output, titled “Counters,” shows the statistics that Hadoop
generates for each job it runs. These are very useful for checking whether the
amount of data processed is what you expected. For example, we can follow the
number of records that went through the system: five map inputs produced five
map outputs, then five reduce inputs in two groups produced two reduce outputs.

The output was written to the output directory, which contains one output file per
reducer. The job had a single reducer, so we find a single file, named part-r-00000:

% cat output/part-

r-00000 1949 111

1950 22

This result is the same as when we went through it by hand earlier. We interpret
this as saying that the maximum temperature recorded in 1949 was 11.1°C, and in
1950 it was 2.2°C.
THE OLD AND THE NEW JAVA MAPREDUCE APIS

The Java MapReduce API used in the previous section was first released in
Hadoop

0.20.0. This new API, sometimes referred to as “Context Objects,” was designed
to

make the API easier to evolve in the future. It is type-incompatible with the old, how-
ever, so applications need to be rewritten to take advantage of it.
The new API is not complete in the 1.x (formerly 0.20) release series, so the old API
is recommended for these releases, despite having been marked as deprecated in the
early

0.20 releases. (Understandably, this recommendation caused a lot of confusion so

the deprecation warning was removed from later releases in that series.)

Previous editions of this book were based on 0.20 releases, and used the old API
throughout (although the new API was covered, the code invariably used the old
API). In this edition the new API is used as the primary API, except where
mentioned. How- ever, should you wish to use the old API, you can, since the
code for all the examples in this book is available for the old API on the book’s
website.1

There are several notable differences between the two APIs:

• The new API favors abstract classes over interfaces, since these are easier to
evolve. For example, you can add a method (with a default implementation) to
an abstract class without breaking old implementations of the class2. For
example, the Mapper and Reducer interfaces in the old API are abstract
classes in the new API.
• The new API is in the org.apache.hadoop.mapreduce package (and
subpackages). The old API can still be found in org.apache.hadoop.mapred.
• The new API makes extensive use of context objects that allow the user code to
communicate with the MapReduce system. The new Context, for example,
essen- tially unifies the role of the JobConf, the OutputCollector, and the
Reporter from the old API.
• In both APIs, key-value record pairs are pushed to the mapper and reducer, but
in addition, the new API allows both mappers and reducers to control the
execution flow by overriding the run() method. For example, records can be
processed in batches, or the execution can be terminated before all the
records have been processed. In the old API this is possible for mappers by
writing a MapRunnable, but no equivalent exists for reducers.
• Configuration has been unified. The old API has a special JobConf object for
job configuration, which is an extension of Hadoop’s vanilla Configuration
object (used for configuring daemons. In the new API, this distinction is
dropped, so job configuration is done through a Configuration.
• Job control is performed through the Job class in the new API, rather than the
old
JobClient, which no longer exists in the new API.

• Output files are named slightly differently: in the old API both map and
reduce outputs are named part-nnnnn, while in the new API map outputs are
named part- m-nnnnn, and reduce outputs are named part-r-nnnnn (where
nnnnn is an integer designating the part number, starting from zero).
• User-overridable methods in the new API are declared to throw java.lang.Inter
ruptedException. What this means is that you can write your code to be
reponsive to interupts so that the framework can gracefully cancel long-running
operations if it needs to3.
• In the new API the reduce() method passes values as a
java.lang.Iterable, rather than a java.lang.Iterator (as the old API
does). This change makes it easier to iterate over the values using
Java’s for-each loop construct: for (VALUEIN value : values)
{ ... }

Scale up

Resources such as CPU, network, and storage are common targets for scaling up. The goal is to
increase the resources supporting your application to reach or maintain adequate performance. In
a hardware-centric world, this might mean adding a larger hard drive to a computer for increased
storage capacity. It might mean replacing the entire computer with a machine that has more
CPU and a more performant network interface. If you are managing a non-cloud system, this
scaling up process can take anywhere from weeks up to months as you request, purchase, install,
and finally deploy the new resources.

In a cloud system, the process should take seconds or minutes. A cloud system might still target
hardware and that will be on the tens of minutes end of the time to scale range. But virtualized
systems dominate cloud computing and some scaling actions, like increasing storage volume
capacity or spinning up a new container to scale up a microservice can take seconds to deploy.
What is being scaled will not be that different. One may still shift applications to a larger VM or
it may be as simple as allocating more capacity on an attached storage volume.

Regardless of whether you are dealing with virtual or hardware resources, the take-home point is
that you are moving from one smaller resource and scaling up to one larger, more performant
resource.

Scale out

Scaling up makes sense when you have an application that needs to sit on a single machine. If
you have an application that has a loosely coupled architecture, it becomes possible to easily
scale out by replicating resources.
Scaling out a microservices application can be as simple as spinning up a new container running
a webserver app and adding it to the load balancer pool. When scaling out the idea is that it is
possible to add identical services to a system to increase performance. Systems that support this
model also tolerate the removal of resources when the load decreases. This allows greater
fluidity in scaling resource size in response to changing conditions.

The incremental nature of the scale out model is of great benefit when considering cost
management. Because components are identical, cost increments should be relatively
predictable. Scaling out also provides greater responsiveness to changes in demand. Typically
services can be rapidly added or removed to best meet resource needs. This flexibility and speed
effectively reduces spending by only using (and paying for) the resources needed at the time.

3-2 CSD Bda Full Notes
No ratings yet
3-2 CSD Bda Full Notes
115 pages
Module 3 - Business Analytics
No ratings yet
Module 3 - Business Analytics
34 pages
Big Data
No ratings yet
Big Data
14 pages
Class 11 PRACTICAL FILE
No ratings yet
Class 11 PRACTICAL FILE
11 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Unit 1
No ratings yet
Unit 1
56 pages
Big Data Processing
No ratings yet
Big Data Processing
19 pages
Vidar NDT Pro
No ratings yet
Vidar NDT Pro
3 pages
PHD Thesis Topics in Data Mining
100% (2)
PHD Thesis Topics in Data Mining
5 pages
BD Unit1
No ratings yet
BD Unit1
45 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Big Data Notes UNIT-1
No ratings yet
Big Data Notes UNIT-1
14 pages
Capstone Title Proposal
No ratings yet
Capstone Title Proposal
5 pages
What Is BIG DATA - Introduction, Types, Characteristics, Example
No ratings yet
What Is BIG DATA - Introduction, Types, Characteristics, Example
11 pages
2010 TJC Sol
No ratings yet
2010 TJC Sol
12 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
6 pages
Big Data
No ratings yet
Big Data
7 pages
How To Program Delphi 3
No ratings yet
How To Program Delphi 3
448 pages
Technical Drafting Summative Test 2-20-19
No ratings yet
Technical Drafting Summative Test 2-20-19
4 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
117 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Akash Decap456 Introduction To Big Data
No ratings yet
Akash Decap456 Introduction To Big Data
297 pages
Coa Unit 4 Digital Notes
No ratings yet
Coa Unit 4 Digital Notes
160 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
MIS - Ethical and Social Issues
No ratings yet
MIS - Ethical and Social Issues
38 pages
ECSS E HB 31 03A (15november2016)
No ratings yet
ECSS E HB 31 03A (15november2016)
65 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
115 pages
Unit 1
No ratings yet
Unit 1
56 pages
Converted 4011171
No ratings yet
Converted 4011171
144 pages
Python Lab Manual
No ratings yet
Python Lab Manual
50 pages
W100 CT-BL Manual2017
No ratings yet
W100 CT-BL Manual2017
59 pages
BDA Notes
No ratings yet
BDA Notes
35 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Lumix DMC Fz100
No ratings yet
Lumix DMC Fz100
44 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
UNIT - 1 - DA - Notes
No ratings yet
UNIT - 1 - DA - Notes
51 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
27 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
Module 10 - Big Data
No ratings yet
Module 10 - Big Data
36 pages
Big Data-Intro
No ratings yet
Big Data-Intro
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
23 pages
Max 536
No ratings yet
Max 536
24 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Changes
No ratings yet
Changes
15 pages
Big Data UNIT1
No ratings yet
Big Data UNIT1
23 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Emerging Big Data and Cloud Computing
No ratings yet
Emerging Big Data and Cloud Computing
15 pages
Chapter 4 Data Analytics
No ratings yet
Chapter 4 Data Analytics
19 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
23 pages
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
No ratings yet
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
12 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Bda Aiml Note Unit 1
No ratings yet
Bda Aiml Note Unit 1
14 pages
Report of Big Data
No ratings yet
Report of Big Data
14 pages
Big Data
No ratings yet
Big Data
41 pages
Unit 1 Notes Bda
No ratings yet
Unit 1 Notes Bda
20 pages
Module 1
No ratings yet
Module 1
14 pages
Avaya Aura Session Manager TLS Certificates Expiration Dates
No ratings yet
Avaya Aura Session Manager TLS Certificates Expiration Dates
4 pages
Introduction To Big Data Analytics - Thendral1
No ratings yet
Introduction To Big Data Analytics - Thendral1
26 pages
281710lecture Notes 3-Applications of Data Structures-1718434458689
No ratings yet
281710lecture Notes 3-Applications of Data Structures-1718434458689
8 pages
Big Data
No ratings yet
Big Data
7 pages
Phrase and Idioms
No ratings yet
Phrase and Idioms
12 pages
Attachment
No ratings yet
Attachment
10 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
MCQ Web Tech
No ratings yet
MCQ Web Tech
8 pages
UNIT 3 Notes by ARUN JHAPATE
No ratings yet
UNIT 3 Notes by ARUN JHAPATE
9 pages
Unit 1 (Chapter 1) - Introduction
No ratings yet
Unit 1 (Chapter 1) - Introduction
10 pages
DLD Lab 5
No ratings yet
DLD Lab 5
9 pages
Big Data12
No ratings yet
Big Data12
11 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
5 pages
Typesetting in Wxmaxima: 1.1 Entering Material & Exporting L Tex Files
No ratings yet
Typesetting in Wxmaxima: 1.1 Entering Material & Exporting L Tex Files
9 pages
Steps For Inbound ALE IDOC
No ratings yet
Steps For Inbound ALE IDOC
5 pages
Sandeep
No ratings yet
Sandeep
6 pages
Ut
No ratings yet
Ut
6 pages
Assignment: Advance Marketing Research & Data Analytics
No ratings yet
Assignment: Advance Marketing Research & Data Analytics
4 pages
Big Data
No ratings yet
Big Data
3 pages
Big Type Data
No ratings yet
Big Type Data
4 pages
Creating An Impeller With Catia V5
No ratings yet
Creating An Impeller With Catia V5
4 pages
Chap 8
No ratings yet
Chap 8
4 pages
281 Resume For Fresher
No ratings yet
281 Resume For Fresher
3 pages
Sackboy™
No ratings yet
Sackboy™
4 pages
Puzzledesignchallenge Sheet
No ratings yet
Puzzledesignchallenge Sheet
5 pages
MiFi 2372 Datasheet
No ratings yet
MiFi 2372 Datasheet
2 pages
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet