0% found this document useful (0 votes)
89 views

Big Data 3

Big data is high-volume, high-velocity, or high-variety information that requires new forms of processing to enable enhanced decision making and discovery of insights. The document outlines five best practices for big data analytics: understand business requirements, determine collected digital assets, identify missing data, comprehend appropriate analytics to leverage, and analyze data continuously.

Uploaded by

Royal Hunter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Big Data 3

Big data is high-volume, high-velocity, or high-variety information that requires new forms of processing to enable enhanced decision making and discovery of insights. The document outlines five best practices for big data analytics: understand business requirements, determine collected digital assets, identify missing data, comprehend appropriate analytics to leverage, and analyze data continuously.

Uploaded by

Royal Hunter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1. What is Big Data? List out the best practices of Big Data Analytics?

Ans:

Big Data is high volume, high-velocity or high variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and
process optimization.

BEST PRACTICES FOR BIG DATA

Now, with the knowledge of what is big data and what it offers, organizations must know

how analytics must be practiced to make the most of their data. The list below shows five

of the best practices for big data:

1. UNDERSTAND THE BUSINESS REQUIREMENTS

Analyzing and understanding the business requirements and organizational goals is the
first and the foremost step that must be carried out even before leveraging big data
analytics into your projects. The business users must understand which projects in their
company must use big data analytics to make maximum profit.

2. DETERMINE THE COLLECTED DIGITAL ASSETS

The second best big data practice is to identify the type of data pouring into the

organization, as well as, the data generated in-house. Usually, the data collected is

disorganized and in varying formats. Moreover, some data is never even exploited (read

dark data), and it is essential that organizations identify this data too.

3. IDENTIFY WHAT IS MISSING

The third practice is analyzing and understanding what is missing. Once you have
collected the data needed for a project, identify the additional information that might be

required for that particular project and where can it come from. For instance, if you want

to leverage big data analytics in your organization to understand your employee's

well-being, then along with information such as login logout time, medical reports, and

email reports, we need to have some additional information about the employee’s, let’s

say, stress levels. This information can be provided by co-workers or leaders.

4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED

After analyzing and collecting data from different sources, it's time for the organization to

understand which big data technologies, such as predictive analytics, stream analytics,

data preparation, fraud detection, sentiment analysis, and so on can be best used for the

current business requirements. For instance, big data analytics helps the HR team in

companies for the recruitment process to identify the right talent faster by collaborating

the social media and job portals using predictive and sentiment analysis.

5. ANALYZE DATA CONTINUOUSLY

This is the final best practice that an organization must follow when it comes to big data.

You must always be aware of what data is lying with your organization and what is being

done with it. Check the health of your data periodically to never miss out on any important

but hidden signals in the data. Before implementing any new technology in your

organization, it is vital to have a strategy to help you get the most out of it. With adequate

and accurate data at their disposal, companies must also follow the above mentioned big

data practices to extract value from this data.


2. Write down the characteristics of Big Data Applications?
Ans:
Same as of 8th answer

3. Write down the four computing resources of Big Data Storage?


Ans:
a) Processing Capability:

-In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate a desirable output.
-This step may vary slightly from process to process depending on the source of data
being processed (data lakes, online databases, connected devices, etc.) and the
intended use of the output.
-Data in its raw form is not useful to any organization. Data processing is the method of
collecting raw data and translating it into usable information.

-It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.

-The raw data is collected, filtered, sorted, processed, analyzed, stored, and then
presented in a readable format.

-Data processing is essential for organizations to create better business strategies and
increase their competitive edge.

-By converting the data into readable formats like graphs, charts, and documents,
employees throughout the organization can understand and use the data.

b) Memory:

c) Storage:
-One of the step of the data processing cycle is storage, where data and metadata are
stored for further use.
-This allows for quick access and retrieval of information whenever needed, and also
allows it to be used as input in the next data processing cycle directly.

d) Network:
4. What is HDFS?
Ans:
-HDFS stands for Hadoop Distributed File System.
-Apache Hadoop is a collection of open-source software utilities that facilitate using a
network of many computers to solve problems involving massive amounts of data and
computation.
-It provides a software framework for distributed storage and processing of big data
using the MapReduce programming model.

- Hadoop comes with a distributed file system called HDFS.

- In HDFS data is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.

- It is cost effective as it uses commodity hardware. It involves the concept of blocks,


data nodes and node name.

- HDFS (Hadoop Distributed File System) is the primary storage system used by
Hadoop applications.

-This open source framework works by rapidly transferring data between nodes. -It’s
often used by companies who need to handle and store big data.

- HDFS is a key component of many Hadoop systems, as it provides a means for


managing big data, as well as supporting big data analytics.

5. What is Map Reduce?


Ans:
-MapReduce is a processing technique and a program model for distributed computing
based on java.
-The MapReduce algorithm contains two important tasks, namely Map and Reduce.
-Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
- MapReduce is a software framework and programming model used for processing
huge amounts of data.
- MapReduce program work in two phases, namely, Map and Reduce.
- Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and
reduce the data.
- Hadoop is capable of running MapReduce programs written in various languages:
Java, Ruby, Python, and C++.
- The programs of Map Reduce in cloud computing are parallel in nature, thus are very
useful for performing large-scale data analysis using multiple machines in the cluster.
- The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.
6. What is YARN?
Ans:
-YARN stands for Yet another resource negotiator.
-It Manages computer Resources.
-The platform is responsible for providing computational resources, such as
CPUs , memory , network I/O which are needed when application executes.
-YARN architecture basically separates resource management layer from the
processing layer.
-In Hadoop 1.0 version, the responsibility of Job tracker is split between the
resource manager and application manager.
-YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient.
-Through its various components, it can dynamically allocate various resources
and schedule the application processing.
-For large volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.

7. What is Map Reduce Programming Model?


Ans:

-MapReduce is a programming model and an associated implementation for processing


and generating big data sets with a parallel, distributed algorithm on a cluster.
-The model is a specialization of the split-apply-combine strategy for data analysis.
- MapReduce is a software framework and programming model used for processing
huge amounts of data.
-MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal
with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
-Hadoop is capable of running MapReduce programs written in various languages: Java,
Ruby, Python, and C++.
-The programs of Map Reduce in cloud computing are parallel in nature, thus are very
useful for performing large-scale data analysis using multiple machines in the cluster.
-The input to each phase is key-value pairs.
-In addition, every programmer needs to specify two functions: map function and
reduce function.

-The data goes through the following phases of MapReduce in Big Data
● Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map
● Mapping
This is the very first phase in the execution of map-reduce program. In this phase data
in each split is passed to a mapping function to produce output values. In our example, a
job of mapping phase is to count a number of occurrences of each word from input splits
(more details about input-split is given below) and prepare a list in the form of <word,
frequency>
● Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed
together along with their respective frequency.
● Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In short, this
phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates
total occurrences of each word.

8. What are the characteristics of big data?


Ans:Big data can be described by the following characteristics:

● Volume
● Variety
● Velocity
● Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.

(iv) Veracity-This feature of Big Data is connected to the previous one. It defines the
degree of trustworthiness of the data. As most of the data you encounter is unstructured,
it is important to filter out the unnecessary information and use the rest for processing.

Veracity is one of the characteristics of big data analytics that denotes data inconsistency
as well as data uncertainty.

As an example, a huge amount of data can create much confusion on the other hand,
when there is a fewer amount of data, that creates inadequate information.

9. What is Big Data Platform?


Ans:
• Big Data Platform is integrated IT solution for Big Data management which combines
several software systems, software tools and hardware to provide easy to use tools
system to enterprises.

• It is a single one-stop solution for all Big Data needs of an enterprise irrespective of
size and data volume. Big Data Platform is enterprise class IT solution for developing,
deploying and managing Big Data.

● There are several Open source and commercial Big Data Platform in the market with
varied features which can be used in Big Data environment.
● Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:

● Big Data platform should be able to accommodate new platforms and tool based
on the business requirement. Because business needs can change due to new
technologies or due to change in business process.
● It should support linear scale-out
● It should have capability for rapid deployment
● It should support variety of data format
● Platform should provide data analysis and reporting tools
● It should provide real-time data analysis software
● It should have tools for searching the data through large data sets

10. What is Bigdata? Give some examples related to big data?


Ans:
• Big Data is high volume, high-velocity or high variety information asset that
requires new forms of processing for enhanced decision making, insight
discovery and process optimization.

Examples:

• A client introduced a different kind of eco friendly packaging for one of its
brands.
• Customer sentiment was not happy to the new packaging, and after tracking
customer feedback and comments.
• The company got to know a certain amount of discontent around the change
and moved to a different kind of eco-friendly package.
• The credit goes to company that adopts big data technologies to discover,
understand and react to the sentiment.

11. Explain in detail about Types as well as sub types of Data?


Ans:

1. Structured data –
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is
typically a database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational
keys and can easily be mapped into pre-designed fields. Today, those
data are most processed in the development and simplest way to
manage information. Example: Relational data.

2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it
easier to analyze. With some processes, you can store them in the
relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.

3. Unstructured data –
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a good
fit for a mainstream relational database. So for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of
business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.

https://www.geeksforgeeks.org/difference-between-structured-semi-structured-
and-unstructured-data/

12.Briefly discuss about Map Reduce and YARN.


Ans:
-YARN is an Apache Hadoop technology and stands for Yet Another Resource
Negotiator.
-YARN is a large-scale, distributed operating system for big data applications.
-YARN is a software rewrite that is capable of decoupling MapReduce's resource
management and scheduling capabilities from the data processing component.

13.Explain in detail about HDFS.


Ans:

- It works on master slave architecture.

- Name node acts as a master node

- Name node stores the meta data.

- File is divided into blocks.

- Name node maps the block to the data node

- The Default size of HDFS Block in Hadoop 1.0 is 64 MB and default size in
Hadoop 2.0 is 128 MB.

14.Write a note on: Yarn Architecture.


Ans:
-Master node has two components:

Job history server

Resource manager

-The Resource manager is the master and it is only one in a cluster.

-Resource manager decides how to assign resources

-Application manager instance estimates the resource requirement for running an


application program.

-Node manager acts as a slave of the infrastructure.

-All active node managers send the controlling signal periodically to Resource manager
signalling their presence.

15.Explain Hadoop Ecosystem?


Ans:

Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such
as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:


· HDFS: Hadoop Distributed File System
· YARN: Yet Another Resource Negotiator
· MapReduce: Programming based Data Processing
· Spark: In-Memory data processing
· PIG, HIVE: Query based processing of data services
· HBase: NoSQL Database
· Mahout, Spark MLLib: Machine Learning algorithm libraries
· Solar, Lucene: Searching and Indexing
· Zookeeper: Managing cluster
· Oozie: Job Scheduling

16.Write note on :
a) Apache Oozie:

- Apache Oozie is an open-source project of Apache that schedules Hadoop Jobs.

- Oozie design provisions the scalable processing of multiple jobs.

- oozie provides a way to package and bundle multiple coordinator and workflow jobs and
manage the lifecycle of those jobs.

- The two basic oozie functions are:

● Oozie work flow jobs are represented as directed Acrylic graphs ,specifying
a sequence of actions to execute.
● Oozie coordinator jobs are recurrent Oozie work flow jobs that are
triggered by time and data availability.

- Oozie provisions for the following

I. Integrate multiple jobs in a sequential manner

II. Stores and supports Hadoop jobs for map reduce, hive, Pig and Sqoop.

III. Runs work flow jobs based on time and data triggers.

IV. Manages batch coordinator for the applications

b) Sqoop

-It is used to load large amount of data on enterprise application servers.

-Sqoop works with relation databases such as Oracle, MySQL, PostgreSQL.

-It Provides the mechanism to import data from external data Store into HDFS.

-Sqoop provisions for fault tolerance.

-Sqoop Initially parses the argument passed in the command line and prepares
the map task.

-Each map task initialisation multiple methods depending on the number


supplied by the user in the command line.

-Sqoop distributes the input data equally among the mappers.

-Then each mappers creates a connection with the database using JDBC.
-Then sends to HDFS/ HBASE/ Hive.

c) Apache Ambari

● Apache Ambari is a management platform for Hadoop.

● It is open source and enables an enterprise to plan, securely install,


manage and maintain the clusters in the Hadoop.

● Features of Ambari and associated components are as follows:

● Simplification of installation, configuration and management.

● Enables easy, efficient, repeatable and automated creation of clusters.

● Manages and Monitors Scalable Clustering.

● Provides an intuitive Web User Interface and REST API. The provision
enables automation of cluster operations.

● Visualizes the health of clusters and critical metrics for their operation.

● Enables detection of faulty node Links.

● Provides extensibility and customizability.

d) HBase
- HBase is a column-oriented database management system that runs on
HDFS.

- HBase does not support a structured query language like SQL and non
relational database.

- Data is stored in Tabular format.

- Each Table contains Rows and colums like Traditional data base.

- Hbase provides a primary key as in the database Table.

- Data accesses are performed using that Key.

e) Apache Hive
- Apache Hive is an open-Source data warehouse software System.

- Hive facilitates reading, writing and managing large datasets which are at
distributed Hadoop files.

- Hive also enables data serialization/deserialization and increases flexibility


in design by including a system catalog called Hive Metastore.

- Highly Scalable

- Uses HiveQL i.e HQL.


Hive + SQL = HQL

- HQL Translates SQL-Like Queries into MapReduce jobs executed on


Hadoop Automatically.

- Three major functions of Hive are Data summarization, Query and


Analysis.

f) Apache Pig

● Apache PIG is an open source, high-level language platform.

● Language used in Pig is known as Pig Latin.

● Pig executes Queries on large datasets that are stored in HDFS using
Apache Hadoop.

● Feature of Pig are:


- Loads the data after applying the required filters and dumps the data in desired
format.
- Requires Java run time environment
- Converts all the operations on Map and reduce Tasks.
17. Explain in detail about MAHOUT?
Ans:

-Mahout is a project of Apache with Library of Scalable machine learning algorithm.


-Mahout provides the learning tools to automate the finding of meaningful patterns in the
Big data sets.
-Mahout supports four main area:

● Collaborative data-filtering that mines user behavior and makes product


recommendations.

● Clustering that takes data items in a particular class, organizes them into
naturally occurring groups, such that items belonging to the same group are
similar to each other.

● Classification that means learning from existing categorizations and then


assigning the future items to the best category.

● Frequent item-set mining that analyses items in a group and then identifies
which items usually occur together.

You might also like