0% found this document useful (0 votes)

101 views16 pages

Big Data 3

Big data is high-volume, high-velocity, or high-variety information that requires new forms of processing to enable enhanced decision making and discovery of insights. The document outlines five best practices for big data analytics: understand business requirements, determine collected digital assets, identify missing data, comprehend appropriate analytics to leverage, and analyze data continuously.

Uploaded by

Royal Hunter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views16 pages

Big Data 3

Uploaded by

Royal Hunter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

1. What is Big Data? List out the best practices of Big Data Analytics?

Ans:

Big Data is high volume, high-velocity or high variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and
process optimization.

BEST PRACTICES FOR BIG DATA

Now, with the knowledge of what is big data and what it offers, organizations must know

how analytics must be practiced to make the most of their data. The list below shows five

of the best practices for big data:

1. UNDERSTAND THE BUSINESS REQUIREMENTS

Analyzing and understanding the business requirements and organizational goals is the
first and the foremost step that must be carried out even before leveraging big data
analytics into your projects. The business users must understand which projects in their
company must use big data analytics to make maximum profit.

2. DETERMINE THE COLLECTED DIGITAL ASSETS

The second best big data practice is to identify the type of data pouring into the

organization, as well as, the data generated in-house. Usually, the data collected is

disorganized and in varying formats. Moreover, some data is never even exploited (read

dark data), and it is essential that organizations identify this data too.

3. IDENTIFY WHAT IS MISSING

The third practice is analyzing and understanding what is missing. Once you have
collected the data needed for a project, identify the additional information that might be

required for that particular project and where can it come from. For instance, if you want

to leverage big data analytics in your organization to understand your employee's

well-being, then along with information such as login logout time, medical reports, and

email reports, we need to have some additional information about the employee’s, let’s

say, stress levels. This information can be provided by co-workers or leaders.

4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED

After analyzing and collecting data from different sources, it's time for the organization to

understand which big data technologies, such as predictive analytics, stream analytics,

data preparation, fraud detection, sentiment analysis, and so on can be best used for the

current business requirements. For instance, big data analytics helps the HR team in

companies for the recruitment process to identify the right talent faster by collaborating

the social media and job portals using predictive and sentiment analysis.

5. ANALYZE DATA CONTINUOUSLY

This is the final best practice that an organization must follow when it comes to big data.

You must always be aware of what data is lying with your organization and what is being

done with it. Check the health of your data periodically to never miss out on any important

but hidden signals in the data. Before implementing any new technology in your

organization, it is vital to have a strategy to help you get the most out of it. With adequate

and accurate data at their disposal, companies must also follow the above mentioned big

data practices to extract value from this data.

2. Write down the characteristics of Big Data Applications?
Ans:
Same as of 8th answer

3. Write down the four computing resources of Big Data Storage?

Ans:
a) Processing Capability:

-In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate a desirable output.
-This step may vary slightly from process to process depending on the source of data
being processed (data lakes, online databases, connected devices, etc.) and the
intended use of the output.
-Data in its raw form is not useful to any organization. Data processing is the method of
collecting raw data and translating it into usable information.

-It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.

-The raw data is collected, filtered, sorted, processed, analyzed, stored, and then
presented in a readable format.

-Data processing is essential for organizations to create better business strategies and
increase their competitive edge.

-By converting the data into readable formats like graphs, charts, and documents,
employees throughout the organization can understand and use the data.

b) Memory:

c) Storage:
-One of the step of the data processing cycle is storage, where data and metadata are
stored for further use.
-This allows for quick access and retrieval of information whenever needed, and also
allows it to be used as input in the next data processing cycle directly.

d) Network:
4. What is HDFS?
Ans:
-HDFS stands for Hadoop Distributed File System.
-Apache Hadoop is a collection of open-source software utilities that facilitate using a
network of many computers to solve problems involving massive amounts of data and
computation.
-It provides a software framework for distributed storage and processing of big data
using the MapReduce programming model.

- Hadoop comes with a distributed file system called HDFS.

- In HDFS data is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.

- It is cost effective as it uses commodity hardware. It involves the concept of blocks,

data nodes and node name.

- HDFS (Hadoop Distributed File System) is the primary storage system used by
Hadoop applications.

-This open source framework works by rapidly transferring data between nodes. -It’s
often used by companies who need to handle and store big data.

- HDFS is a key component of many Hadoop systems, as it provides a means for

managing big data, as well as supporting big data analytics.

5. What is Map Reduce?

Ans:
-MapReduce is a processing technique and a program model for distributed computing
based on java.
-The MapReduce algorithm contains two important tasks, namely Map and Reduce.
-Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
- MapReduce is a software framework and programming model used for processing
huge amounts of data.
- MapReduce program work in two phases, namely, Map and Reduce.
- Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and
reduce the data.
- Hadoop is capable of running MapReduce programs written in various languages:
Java, Ruby, Python, and C++.
- The programs of Map Reduce in cloud computing are parallel in nature, thus are very
useful for performing large-scale data analysis using multiple machines in the cluster.
- The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.
6. What is YARN?
Ans:
-YARN stands for Yet another resource negotiator.
-It Manages computer Resources.
-The platform is responsible for providing computational resources, such as
CPUs , memory , network I/O which are needed when application executes.
-YARN architecture basically separates resource management layer from the
processing layer.
-In Hadoop 1.0 version, the responsibility of Job tracker is split between the
resource manager and application manager.
-YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient.
-Through its various components, it can dynamically allocate various resources
and schedule the application processing.
-For large volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.

7. What is Map Reduce Programming Model?

Ans:

-MapReduce is a programming model and an associated implementation for processing

and generating big data sets with a parallel, distributed algorithm on a cluster.
-The model is a specialization of the split-apply-combine strategy for data analysis.
- MapReduce is a software framework and programming model used for processing
huge amounts of data.
-MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal
with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
-Hadoop is capable of running MapReduce programs written in various languages: Java,
Ruby, Python, and C++.
-The programs of Map Reduce in cloud computing are parallel in nature, thus are very
useful for performing large-scale data analysis using multiple machines in the cluster.
-The input to each phase is key-value pairs.
-In addition, every programmer needs to specify two functions: map function and
reduce function.

-The data goes through the following phases of MapReduce in Big Data
● Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map
● Mapping
This is the very first phase in the execution of map-reduce program. In this phase data
in each split is passed to a mapping function to produce output values. In our example, a
job of mapping phase is to count a number of occurrences of each word from input splits
(more details about input-split is given below) and prepare a list in the form of <word,
frequency>
● Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed
together along with their respective frequency.
● Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In short, this
phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates
total occurrences of each word.

8. What are the characteristics of big data?

Ans:Big data can be described by the following characteristics:

● Volume
● Variety
● Velocity
● Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.

(iv) Veracity-This feature of Big Data is connected to the previous one. It defines the
degree of trustworthiness of the data. As most of the data you encounter is unstructured,
it is important to filter out the unnecessary information and use the rest for processing.

Veracity is one of the characteristics of big data analytics that denotes data inconsistency
as well as data uncertainty.

As an example, a huge amount of data can create much confusion on the other hand,
when there is a fewer amount of data, that creates inadequate information.

9. What is Big Data Platform?

Ans:
• Big Data Platform is integrated IT solution for Big Data management which combines
several software systems, software tools and hardware to provide easy to use tools
system to enterprises.

• It is a single one-stop solution for all Big Data needs of an enterprise irrespective of
size and data volume. Big Data Platform is enterprise class IT solution for developing,
deploying and managing Big Data.

● There are several Open source and commercial Big Data Platform in the market with
varied features which can be used in Big Data environment.
● Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:

● Big Data platform should be able to accommodate new platforms and tool based
on the business requirement. Because business needs can change due to new
technologies or due to change in business process.
● It should support linear scale-out
● It should have capability for rapid deployment
● It should support variety of data format
● Platform should provide data analysis and reporting tools
● It should provide real-time data analysis software
● It should have tools for searching the data through large data sets

10. What is Bigdata? Give some examples related to big data?

Ans:
• Big Data is high volume, high-velocity or high variety information asset that
requires new forms of processing for enhanced decision making, insight
discovery and process optimization.

Examples:

• A client introduced a different kind of eco friendly packaging for one of its
brands.
• Customer sentiment was not happy to the new packaging, and after tracking
customer feedback and comments.
• The company got to know a certain amount of discontent around the change
and moved to a different kind of eco-friendly package.
• The credit goes to company that adopts big data technologies to discover,
understand and react to the sentiment.

11. Explain in detail about Types as well as sub types of Data?

Ans:

1. Structured data –
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is
typically a database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational
keys and can easily be mapped into pre-designed fields. Today, those
data are most processed in the development and simplest way to
manage information. Example: Relational data.

2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it
easier to analyze. With some processes, you can store them in the
relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.

3. Unstructured data –
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a good
fit for a mainstream relational database. So for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of
business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.

https://www.geeksforgeeks.org/difference-between-structured-semi-structured-
and-unstructured-data/

12.Briefly discuss about Map Reduce and YARN.

Ans:
-YARN is an Apache Hadoop technology and stands for Yet Another Resource
Negotiator.
-YARN is a large-scale, distributed operating system for big data applications.
-YARN is a software rewrite that is capable of decoupling MapReduce's resource
management and scheduling capabilities from the data processing component.

13.Explain in detail about HDFS.

Ans:

- It works on master slave architecture.

- Name node acts as a master node

- Name node stores the meta data.

- File is divided into blocks.

- Name node maps the block to the data node

- The Default size of HDFS Block in Hadoop 1.0 is 64 MB and default size in
Hadoop 2.0 is 128 MB.

14.Write a note on: Yarn Architecture.

Ans:
-Master node has two components:

Job history server

Resource manager

-The Resource manager is the master and it is only one in a cluster.

-Resource manager decides how to assign resources

-Application manager instance estimates the resource requirement for running an

application program.

-Node manager acts as a slave of the infrastructure.

-All active node managers send the controlling signal periodically to Resource manager
signalling their presence.

15.Explain Hadoop Ecosystem?

Ans:

Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such
as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:

· HDFS: Hadoop Distributed File System
· YARN: Yet Another Resource Negotiator
· MapReduce: Programming based Data Processing
· Spark: In-Memory data processing
· PIG, HIVE: Query based processing of data services
· HBase: NoSQL Database
· Mahout, Spark MLLib: Machine Learning algorithm libraries
· Solar, Lucene: Searching and Indexing
· Zookeeper: Managing cluster
· Oozie: Job Scheduling

16.Write note on :
a) Apache Oozie:

- Apache Oozie is an open-source project of Apache that schedules Hadoop Jobs.

- Oozie design provisions the scalable processing of multiple jobs.

- oozie provides a way to package and bundle multiple coordinator and workflow jobs and
manage the lifecycle of those jobs.

- The two basic oozie functions are:

● Oozie work flow jobs are represented as directed Acrylic graphs ,specifying
a sequence of actions to execute.
● Oozie coordinator jobs are recurrent Oozie work flow jobs that are
triggered by time and data availability.

- Oozie provisions for the following

I. Integrate multiple jobs in a sequential manner

II. Stores and supports Hadoop jobs for map reduce, hive, Pig and Sqoop.

III. Runs work flow jobs based on time and data triggers.

IV. Manages batch coordinator for the applications

b) Sqoop

-It is used to load large amount of data on enterprise application servers.

-Sqoop works with relation databases such as Oracle, MySQL, PostgreSQL.

-It Provides the mechanism to import data from external data Store into HDFS.

-Sqoop provisions for fault tolerance.

-Sqoop Initially parses the argument passed in the command line and prepares
the map task.

-Each map task initialisation multiple methods depending on the number

supplied by the user in the command line.

-Sqoop distributes the input data equally among the mappers.

-Then each mappers creates a connection with the database using JDBC.
-Then sends to HDFS/ HBASE/ Hive.

c) Apache Ambari

● Apache Ambari is a management platform for Hadoop.

● It is open source and enables an enterprise to plan, securely install,

manage and maintain the clusters in the Hadoop.

● Features of Ambari and associated components are as follows:

● Simplification of installation, configuration and management.

● Enables easy, efficient, repeatable and automated creation of clusters.

● Manages and Monitors Scalable Clustering.

● Provides an intuitive Web User Interface and REST API. The provision
enables automation of cluster operations.

● Visualizes the health of clusters and critical metrics for their operation.

● Enables detection of faulty node Links.

● Provides extensibility and customizability.

d) HBase
- HBase is a column-oriented database management system that runs on
HDFS.

- HBase does not support a structured query language like SQL and non
relational database.

- Data is stored in Tabular format.

- Each Table contains Rows and colums like Traditional data base.

- Hbase provides a primary key as in the database Table.

- Data accesses are performed using that Key.

e) Apache Hive
- Apache Hive is an open-Source data warehouse software System.

- Hive facilitates reading, writing and managing large datasets which are at
distributed Hadoop files.

- Hive also enables data serialization/deserialization and increases flexibility

in design by including a system catalog called Hive Metastore.

- Highly Scalable

- Uses HiveQL i.e HQL.

Hive + SQL = HQL

- HQL Translates SQL-Like Queries into MapReduce jobs executed on

Hadoop Automatically.

- Three major functions of Hive are Data summarization, Query and

Analysis.

f) Apache Pig

● Apache PIG is an open source, high-level language platform.

● Language used in Pig is known as Pig Latin.

● Pig executes Queries on large datasets that are stored in HDFS using
Apache Hadoop.

● Feature of Pig are:

- Loads the data after applying the required filters and dumps the data in desired
format.
- Requires Java run time environment
- Converts all the operations on Map and reduce Tasks.
17. Explain in detail about MAHOUT?
Ans:

-Mahout is a project of Apache with Library of Scalable machine learning algorithm.

-Mahout provides the learning tools to automate the finding of meaningful patterns in the
Big data sets.
-Mahout supports four main area:

● Collaborative data-filtering that mines user behavior and makes product

recommendations.

● Clustering that takes data items in a particular class, organizes them into
naturally occurring groups, such that items belonging to the same group are
similar to each other.

● Classification that means learning from existing categorizations and then

assigning the future items to the best category.

● Frequent item-set mining that analyses items in a group and then identifies
which items usually occur together.

Azure DevOps Interview Questions & Answers
No ratings yet
Azure DevOps Interview Questions & Answers
196 pages
Java Complete Reference (2023)
No ratings yet
Java Complete Reference (2023)
292 pages
GST Billing System
100% (14)
GST Billing System
22 pages
Graph Database
No ratings yet
Graph Database
64 pages
Eazy Bytes
100% (1)
Eazy Bytes
206 pages
Hibernate Tutorial
100% (1)
Hibernate Tutorial
251 pages
Guide To Clear Java Developer Interview
No ratings yet
Guide To Clear Java Developer Interview
42 pages
100 Java Interview
No ratings yet
100 Java Interview
51 pages
Java Interview Questions For 2 Years Experienced
No ratings yet
Java Interview Questions For 2 Years Experienced
8 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
New KV - Rao Core Java PDF
No ratings yet
New KV - Rao Core Java PDF
382 pages
Data Structures and Algorithms Made Easy With Java Learn Data Structure Using Java in 7 Days
No ratings yet
Data Structures and Algorithms Made Easy With Java Learn Data Structure Using Java in 7 Days
364 pages
My Question Bank
No ratings yet
My Question Bank
164 pages
Java 24 Whats New
100% (1)
Java 24 Whats New
7 pages
Spark
No ratings yet
Spark
160 pages
JAVA-Important Interview Questions
100% (1)
JAVA-Important Interview Questions
6 pages
MM17 Custom Fields Update MVKE Table
0% (1)
MM17 Custom Fields Update MVKE Table
10 pages
7 Input and Output in C Language Lyst7377
100% (1)
7 Input and Output in C Language Lyst7377
21 pages
Design Patterns
No ratings yet
Design Patterns
64 pages
Django Rest Framework Json API
No ratings yet
Django Rest Framework Json API
21 pages
WQD7005 (Alternative Assessment)
100% (1)
WQD7005 (Alternative Assessment)
4 pages
Building A RESTful Web Service With Spring - Sample Chapter
No ratings yet
Building A RESTful Web Service With Spring - Sample Chapter
13 pages
Big Data Technologies
No ratings yet
Big Data Technologies
4 pages
C Dbadm 2404
No ratings yet
C Dbadm 2404
2 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Java Fullstack Roadmap Complete
No ratings yet
Java Fullstack Roadmap Complete
3 pages
60 TOP AngularJS Interview Questions and Answers
100% (1)
60 TOP AngularJS Interview Questions and Answers
3 pages
Path - Web Developers - OpenClassrooms
0% (1)
Path - Web Developers - OpenClassrooms
15 pages
Multi Banking SRS
100% (1)
Multi Banking SRS
12 pages
JDBC Interview Questions
No ratings yet
JDBC Interview Questions
9 pages
Advanced Java Tutorial Servlet
0% (1)
Advanced Java Tutorial Servlet
4 pages
MVC Design Pattern PPT Presented by QuontraSolutions
No ratings yet
MVC Design Pattern PPT Presented by QuontraSolutions
35 pages
What Is System Catalog
No ratings yet
What Is System Catalog
3 pages
Performance Evaluation of Java For Numerical Computing: Roldan Pozo
No ratings yet
Performance Evaluation of Java For Numerical Computing: Roldan Pozo
44 pages
6CS4-02 ML PPT Unit-3
No ratings yet
6CS4-02 ML PPT Unit-3
52 pages
Cba 8 Clinical Decision Support System: Capstone Project
No ratings yet
Cba 8 Clinical Decision Support System: Capstone Project
31 pages
E Learning: A Student Guide To Moodle
No ratings yet
E Learning: A Student Guide To Moodle
16 pages
Ddbms Lab Manual
No ratings yet
Ddbms Lab Manual
100 pages
Java J2ee Syllabus JBK PDF
No ratings yet
Java J2ee Syllabus JBK PDF
3 pages
Class Notes
No ratings yet
Class Notes
42 pages
Excel Practice - ExcelR
No ratings yet
Excel Practice - ExcelR
177 pages
Java Full Stack - TOC
No ratings yet
Java Full Stack - TOC
23 pages
Inheritance in C++
No ratings yet
Inheritance in C++
51 pages
2006 Sahil Take
No ratings yet
2006 Sahil Take
68 pages
Unit-7: Java Web Frameworks: Spring MVC
No ratings yet
Unit-7: Java Web Frameworks: Spring MVC
26 pages
E-Commerce Application - Angular Front-End and Spring Boot Back-End
No ratings yet
E-Commerce Application - Angular Front-End and Spring Boot Back-End
2 pages
Java Web Services Interview Questions and Answers: Overview: Integration Styles?
No ratings yet
Java Web Services Interview Questions and Answers: Overview: Integration Styles?
25 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Java Interview Guide - How To Build Confidence With A Solid - Anthony DePalma
No ratings yet
Java Interview Guide - How To Build Confidence With A Solid - Anthony DePalma
72 pages
ClimsoftV4 User Manual Feb 2022
No ratings yet
ClimsoftV4 User Manual Feb 2022
109 pages
Lec 10
No ratings yet
Lec 10
111 pages
Moodle Administrator: User Manual For Faculty Members
No ratings yet
Moodle Administrator: User Manual For Faculty Members
79 pages
Gis 1-5
No ratings yet
Gis 1-5
37 pages
Java Coding Standard
No ratings yet
Java Coding Standard
4 pages
RESTFul Services
No ratings yet
RESTFul Services
25 pages
Java Unit Test-1
No ratings yet
Java Unit Test-1
39 pages
PENTAGON SPACE - Java Full Stack Brochure New Syllabus 01
No ratings yet
PENTAGON SPACE - Java Full Stack Brochure New Syllabus 01
10 pages
Real Time Projects in Java
No ratings yet
Real Time Projects in Java
7 pages
Java Developer Nanodegree Program Syllabus PDF
No ratings yet
Java Developer Nanodegree Program Syllabus PDF
5 pages
Unit 5 Da
No ratings yet
Unit 5 Da
41 pages
Software Technologies
No ratings yet
Software Technologies
32 pages
Advanced Database Lab
No ratings yet
Advanced Database Lab
36 pages
Design and Implementation of A Modular Student Results Management System For A Senior Secondary School in Old Kampala Senior Secondary School
No ratings yet
Design and Implementation of A Modular Student Results Management System For A Senior Secondary School in Old Kampala Senior Secondary School
25 pages
Live Projects in Java
No ratings yet
Live Projects in Java
7 pages
DBMS MCQ Questions
No ratings yet
DBMS MCQ Questions
8 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
Chapter 1
No ratings yet
Chapter 1
46 pages
Databases in Organisations
No ratings yet
Databases in Organisations
14 pages
PGIS Unit 1 Crash Course Contents
No ratings yet
PGIS Unit 1 Crash Course Contents
20 pages
Serializability of Transactions and Testing of Serializability
No ratings yet
Serializability of Transactions and Testing of Serializability
10 pages
Os-2 Marks-Q& A
No ratings yet
Os-2 Marks-Q& A
21 pages
Lecture 2-Intro To DSA - 071646
No ratings yet
Lecture 2-Intro To DSA - 071646
22 pages
Statistical Analyst - Job Advert
No ratings yet
Statistical Analyst - Job Advert
2 pages
Ava Ull Tack Ontent - Ays: Advanced Java/ J2EE JDBC
No ratings yet
Ava Ull Tack Ontent - Ays: Advanced Java/ J2EE JDBC
7 pages
Sequences and Synonyms
No ratings yet
Sequences and Synonyms
16 pages
43 - InfyTQ Interview Experience Batch
No ratings yet
43 - InfyTQ Interview Experience Batch
4 pages
JDevelopers Guide - IBM Java Interview Questions For 3-8 Year Experience
No ratings yet
JDevelopers Guide - IBM Java Interview Questions For 3-8 Year Experience
3 pages
Special Chart Types PDF
No ratings yet
Special Chart Types PDF
5 pages
Big Data Fund
No ratings yet
Big Data Fund
5 pages
DK Resume
No ratings yet
DK Resume
4 pages
Food Waste Management System Project
No ratings yet
Food Waste Management System Project
9 pages
Umair Latif
No ratings yet
Umair Latif
4 pages
Title: Introduction To PHP Programming Slide 1: Title
No ratings yet
Title: Introduction To PHP Programming Slide 1: Title
4 pages
Cricket Management System Scenario
No ratings yet
Cricket Management System Scenario
4 pages
Application of NoSQL Database in Web Crawling
No ratings yet
Application of NoSQL Database in Web Crawling
9 pages
Hibernet Object Relation
No ratings yet
Hibernet Object Relation
24 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Heroku Cloud Application Development
From Everand
Heroku Cloud Application Development
Anubhav Hanjura
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet

Big Data 3

Uploaded by

Big Data 3

Uploaded by

1. What is Big Data? List out the best practices of Big Data Analytics?

BEST PRACTICES FOR BIG DATA

of the best practices for big data:

1. UNDERSTAND THE BUSINESS REQUIREMENTS

2. DETERMINE THE COLLECTED DIGITAL ASSETS

3. IDENTIFY WHAT IS MISSING

to leverage big data analytics in your organization to understand your employee's

say, stress levels. This information can be provided by co-workers or leaders.

4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED

5. ANALYZE DATA CONTINUOUSLY

data practices to extract value from this data.

3. Write down the four computing resources of Big Data Storage?

- Hadoop comes with a distributed file system called HDFS.

- It is cost effective as it uses commodity hardware. It involves the concept of blocks,

- HDFS is a key component of many Hadoop systems, as it provides a means for

5. What is Map Reduce?

7. What is Map Reduce Programming Model?

-MapReduce is a programming model and an associated implementation for processing

8. What are the characteristics of big data?

9. What is Big Data Platform?

10. What is Bigdata? Give some examples related to big data?

11. Explain in detail about Types as well as sub types of Data?

12.Briefly discuss about Map Reduce and YARN.

13.Explain in detail about HDFS.

- It works on master slave architecture.

- Name node acts as a master node

- Name node stores the meta data.

- File is divided into blocks.

- Name node maps the block to the data node

14.Write a note on: Yarn Architecture.

Job history server

-The Resource manager is the master and it is only one in a cluster.

-Resource manager decides how to assign resources

-Application manager instance estimates the resource requirement for running an

-Node manager acts as a slave of the infrastructure.

15.Explain Hadoop Ecosystem?

Following are the components that collectively form a Hadoop ecosystem:

- Apache Oozie is an open-source project of Apache that schedules Hadoop Jobs.

- Oozie design provisions the scalable processing of multiple jobs.

- The two basic oozie functions are:

- Oozie provisions for the following

I. Integrate multiple jobs in a sequential manner

IV. Manages batch coordinator for the applications

-It is used to load large amount of data on enterprise application servers.

-Sqoop works with relation databases such as Oracle, MySQL, PostgreSQL.

-Sqoop provisions for fault tolerance.

-Each map task initialisation multiple methods depending on the number

-Sqoop distributes the input data equally among the mappers.

● Apache Ambari is a management platform for Hadoop.

● It is open source and enables an enterprise to plan, securely install,

● Features of Ambari and associated components are as follows:

● Simplification of installation, configuration and management.

● Enables easy, efficient, repeatable and automated creation of clusters.

● Manages and Monitors Scalable Clustering.

● Enables detection of faulty node Links.

● Provides extensibility and customizability.

- Data is stored in Tabular format.

- Hbase provides a primary key as in the database Table.

- Data accesses are performed using that Key.

- Hive also enables data serialization/deserialization and increases flexibility

- Uses HiveQL i.e HQL.

- HQL Translates SQL-Like Queries into MapReduce jobs executed on

- Three major functions of Hive are Data summarization, Query and

● Apache PIG is an open source, high-level language platform.

● Language used in Pig is known as Pig Latin.

● Feature of Pig are:

-Mahout is a project of Apache with Library of Scalable machine learning algorithm.

● Collaborative data-filtering that mines user behavior and makes product

● Classification that means learning from existing categorizations and then

You might also like