Big Data 3
Big Data 3
Ans:
Big Data is high volume, high-velocity or high variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and
process optimization.
Now, with the knowledge of what is big data and what it offers, organizations must know
how analytics must be practiced to make the most of their data. The list below shows five
Analyzing and understanding the business requirements and organizational goals is the
first and the foremost step that must be carried out even before leveraging big data
analytics into your projects. The business users must understand which projects in their
company must use big data analytics to make maximum profit.
The second best big data practice is to identify the type of data pouring into the
organization, as well as, the data generated in-house. Usually, the data collected is
disorganized and in varying formats. Moreover, some data is never even exploited (read
dark data), and it is essential that organizations identify this data too.
The third practice is analyzing and understanding what is missing. Once you have
collected the data needed for a project, identify the additional information that might be
required for that particular project and where can it come from. For instance, if you want
well-being, then along with information such as login logout time, medical reports, and
email reports, we need to have some additional information about the employee’s, let’s
After analyzing and collecting data from different sources, it's time for the organization to
understand which big data technologies, such as predictive analytics, stream analytics,
data preparation, fraud detection, sentiment analysis, and so on can be best used for the
current business requirements. For instance, big data analytics helps the HR team in
companies for the recruitment process to identify the right talent faster by collaborating
the social media and job portals using predictive and sentiment analysis.
This is the final best practice that an organization must follow when it comes to big data.
You must always be aware of what data is lying with your organization and what is being
done with it. Check the health of your data periodically to never miss out on any important
but hidden signals in the data. Before implementing any new technology in your
organization, it is vital to have a strategy to help you get the most out of it. With adequate
and accurate data at their disposal, companies must also follow the above mentioned big
-In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate a desirable output.
-This step may vary slightly from process to process depending on the source of data
being processed (data lakes, online databases, connected devices, etc.) and the
intended use of the output.
-Data in its raw form is not useful to any organization. Data processing is the method of
collecting raw data and translating it into usable information.
-It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.
-The raw data is collected, filtered, sorted, processed, analyzed, stored, and then
presented in a readable format.
-Data processing is essential for organizations to create better business strategies and
increase their competitive edge.
-By converting the data into readable formats like graphs, charts, and documents,
employees throughout the organization can understand and use the data.
b) Memory:
c) Storage:
-One of the step of the data processing cycle is storage, where data and metadata are
stored for further use.
-This allows for quick access and retrieval of information whenever needed, and also
allows it to be used as input in the next data processing cycle directly.
d) Network:
4. What is HDFS?
Ans:
-HDFS stands for Hadoop Distributed File System.
-Apache Hadoop is a collection of open-source software utilities that facilitate using a
network of many computers to solve problems involving massive amounts of data and
computation.
-It provides a software framework for distributed storage and processing of big data
using the MapReduce programming model.
- In HDFS data is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.
- HDFS (Hadoop Distributed File System) is the primary storage system used by
Hadoop applications.
-This open source framework works by rapidly transferring data between nodes. -It’s
often used by companies who need to handle and store big data.
-The data goes through the following phases of MapReduce in Big Data
● Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input
splits Input split is a chunk of the input that is consumed by a single map
● Mapping
This is the very first phase in the execution of map-reduce program. In this phase data
in each split is passed to a mapping function to produce output values. In our example, a
job of mapping phase is to count a number of occurrences of each word from input splits
(more details about input-split is given below) and prepare a list in the form of <word,
frequency>
● Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubed
together along with their respective frequency.
● Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In short, this
phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates
total occurrences of each word.
● Volume
● Variety
● Velocity
● Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.
(iv) Veracity-This feature of Big Data is connected to the previous one. It defines the
degree of trustworthiness of the data. As most of the data you encounter is unstructured,
it is important to filter out the unnecessary information and use the rest for processing.
Veracity is one of the characteristics of big data analytics that denotes data inconsistency
as well as data uncertainty.
As an example, a huge amount of data can create much confusion on the other hand,
when there is a fewer amount of data, that creates inadequate information.
• It is a single one-stop solution for all Big Data needs of an enterprise irrespective of
size and data volume. Big Data Platform is enterprise class IT solution for developing,
deploying and managing Big Data.
● There are several Open source and commercial Big Data Platform in the market with
varied features which can be used in Big Data environment.
● Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:
● Big Data platform should be able to accommodate new platforms and tool based
on the business requirement. Because business needs can change due to new
technologies or due to change in business process.
● It should support linear scale-out
● It should have capability for rapid deployment
● It should support variety of data format
● Platform should provide data analysis and reporting tools
● It should provide real-time data analysis software
● It should have tools for searching the data through large data sets
Examples:
• A client introduced a different kind of eco friendly packaging for one of its
brands.
• Customer sentiment was not happy to the new packaging, and after tracking
customer feedback and comments.
• The company got to know a certain amount of discontent around the change
and moved to a different kind of eco-friendly package.
• The credit goes to company that adopts big data technologies to discover,
understand and react to the sentiment.
1. Structured data –
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is
typically a database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational
keys and can easily be mapped into pre-designed fields. Today, those
data are most processed in the development and simplest way to
manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it
easier to analyze. With some processes, you can store them in the
relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a good
fit for a mainstream relational database. So for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of
business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.
https://www.geeksforgeeks.org/difference-between-structured-semi-structured-
and-unstructured-data/
- The Default size of HDFS Block in Hadoop 1.0 is 64 MB and default size in
Hadoop 2.0 is 128 MB.
Resource manager
-All active node managers send the controlling signal periodically to Resource manager
signalling their presence.
Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such
as absorption, analysis, storage and maintenance of data etc.
16.Write note on :
a) Apache Oozie:
- oozie provides a way to package and bundle multiple coordinator and workflow jobs and
manage the lifecycle of those jobs.
● Oozie work flow jobs are represented as directed Acrylic graphs ,specifying
a sequence of actions to execute.
● Oozie coordinator jobs are recurrent Oozie work flow jobs that are
triggered by time and data availability.
II. Stores and supports Hadoop jobs for map reduce, hive, Pig and Sqoop.
III. Runs work flow jobs based on time and data triggers.
b) Sqoop
-It Provides the mechanism to import data from external data Store into HDFS.
-Sqoop Initially parses the argument passed in the command line and prepares
the map task.
-Then each mappers creates a connection with the database using JDBC.
-Then sends to HDFS/ HBASE/ Hive.
c) Apache Ambari
● Provides an intuitive Web User Interface and REST API. The provision
enables automation of cluster operations.
● Visualizes the health of clusters and critical metrics for their operation.
d) HBase
- HBase is a column-oriented database management system that runs on
HDFS.
- HBase does not support a structured query language like SQL and non
relational database.
- Each Table contains Rows and colums like Traditional data base.
e) Apache Hive
- Apache Hive is an open-Source data warehouse software System.
- Hive facilitates reading, writing and managing large datasets which are at
distributed Hadoop files.
- Highly Scalable
f) Apache Pig
● Pig executes Queries on large datasets that are stored in HDFS using
Apache Hadoop.
● Clustering that takes data items in a particular class, organizes them into
naturally occurring groups, such that items belonging to the same group are
similar to each other.
● Frequent item-set mining that analyses items in a group and then identifies
which items usually occur together.