BDA Notes
BDA Notes
IT
• One of the largest users of Big Data, IT companies around the world are using
Big Data to optimize their functioning, enhance employee productivity, and
minimize risks in business operations.
• By combining Big Data technologies with ML and Al, the IT sector is continually
powering innovation to find solutions even for the most complex of problems.
Retail
• Big Data has changed the way of working in traditional brick and mortar retail
stores.
• Over the years, retailers have collected vast amounts of data from local
demographic surveys, POS scanners, RFID, customer loyalty cards, store
inventory, and so on.
• Now, they've started to leverage this data to create personalized customer
experiences, boost sales, increase revenue, and deliver outstanding customer
service.
• Retailers are even using smart sensors and Wi-Fi to track the movement of
customers, the most frequented aisles, for how long customers linger in the
aisles, among other things.
• They also gather social media data to understand what customers are saying
about their brand, their services, and tweak their product design and marketing
strategies accordingly.
Transportation
Big Data Analytics holds immense value for the transportation industry.
• In countries across the world, both private and government-run transportation
companies use Big Data technologies to optimize route planning, control traffic,
manage road congestion, and improve services.
• Additionally, transportation services even use Big Data to revenue management,
drive technological innovation, enhance logistics, and of course, to gain the upper
hand in the
market.
Big data provides business intelligence and cutting-edge analytical insights that
help with decision-making. A company can get a more in-depth picture of its target
market by collecting more customer data.
According to surveys done by New Vantage and Syncsort (now Precisely), big data
analytics has helped businesses significantly cut their costs. Big data is being used
to cut costs, according to 66.7% of survey participants from New Vantage.
Moreover, 59.4% of Syncsort survey participants stated that using big data tools
improved operational efficiency and reduced costs. Do you know that Hadoop and
Cloud-Based Analytics, two popular big data analytics tools, can help lower the cost
of storing big data
3. Detection of Fraud
Financial companies especially use big data to identify fraud. To find anomalies and
transaction patterns, data analysts use artificial intelligence and machine learning
algorithms. These irregularities in transaction patterns show that something is out
of place or that there is a mismatch, providing us with hints about potential fraud.
For credit unions, banks, and credit card companies, fraud detection is crucial for
identifying account information, materials, or product access. By spotting frauds
before they cause problems, any industry, including finance, can provide better
customer service.
4. A rise in productivity
A survey by Syncsort found that 59.9% of respondents said they were using big
data analytics tools like Spark and Hadoop to boost productivity. They have been
able to increase sales and improve customer retention as a result of this rise in
productivity. Modern big data tools make it possible for data scientists and analysts
to analyse a lot of data quickly and effectively, giving them an overview of more
data.
They become more productive as a result of this. Additionally, big data analytics
aids data scientists and analysts in learning more about themselves to figure out
how to be more effective in their tasks and job responsibilities. As a result, investing
in big data analytics gives businesses across all sectors a chance to stand out
through improved productivity.
Big data also enables businesses better to comprehend the thoughts and feelings
of their clients to provide them with more individualised goods and services.
Providing a personalised experience can increase client satisfaction, strengthen
bonds with clients, and, most importantly, foster loyalty.
Increasing business agility is a big data benefit for competition. Big data analytics
can assist businesses in becoming more innovative and adaptable in the
marketplace. Large customer data sets can be analysed to help businesses gain
insights ahead of the competition and more effectively address customer pain
points.
7. Greater innovation
Innovation is another common benefit of big data, and the NewVantage survey
found that 11.6 per cent of executives are investing in analytics primarily as a means
to innovate and disrupt their markets. They reason that if they can glean insights
that their competitors don't have, they may be able to get out ahead of the rest of
the market with new products and services.
1. A talent gap
A study by AtScale found that for the past three years, the biggest challenge in this
industry has been a lack of big data specialists and data scientists. Given that it
requires a different skill set, big data analytics is currently beyond the scope of
many IT professionals. Finding data scientists who are also knowledgeable about
big data can be difficult.
Data scientists and big data specialists are two well-paid professions in the data
science industry. As a result, hiring big data analysts can be very costly for
businesses, particularly for start-ups. Some businesses must wait a long time to hire
the necessary personnel to carry out their big data analytics tasks.
2. Security hazard
For big data analytics, businesses frequently collect sensitive data. These data need
to be protected, and security risks can be detrimental if they are not properly
maintained.
Additionally, having access to enormous data sets can attract the unwanted
attention of hackers, and your company could become the target of a potential
cyber-attack. You are aware that for many businesses today, data breaches are the
biggest threat. Unless you take all necessary precautions, important information
could be leaked to rivals, which is another risk associated with big data.
3. Adherence
Another disadvantage of big data is the requirement for legal compliance with
governmental regulations. To store, handle, maintain, and process big data that
contains sensitive or private information, a company must make sure that they
adhere to all applicable laws and industry standards. As a result, managing data
governance tasks, transmission, and storage will become more challenging as big
data volumes grow.
4. High Cost
Given that it is a science that is constantly evolving and has as its goal the processing
of ever-increasing amounts of data, only large companies can sustain the
investment in the development of their Big Data techniques.
5. Data quality
Dealing with data quality issues was the main drawback of working with big data.
Data scientists and analysts must ensure the data they are using is accurate,
pertinent, and in the right format for analysis before they can use big data for
analytics efforts.
This significantly slows down the reporting process, but if businesses don't address
data quality problems, they may discover that the insights their analytics produce
are useless or even harmful if used.
6. Rapid Change
The fact that technology is evolving quickly is another potential disadvantage of big
data analytics. Businesses must deal with the possibility of spending money on one
technology only to see something better emerge a few months later. This big data
drawback was ranked fourth among all the potential difficulties by Syncsort
respondents.
Structured Data
Geek1 11 A 1 A
Geek2 11 A 2 B
Name Class Section Roll No Grade
Geek3 11 A 3 A
Semi-Structured Data
Unstructured Data
Un-structured Data
Q. 10) What are the benefits of Big Data? Discuss challenges under Big Data. How
Big Data Analytics can be useful in the development of smart cities. 07
Ans.
• Better Decision Making. ...
• Reduce costs of business processes. ...
• Fraud Detection. ...
• Increased productivity. ...
• Improved customer service. ...
• Increased agility
• Sharing and Accessing Data:
o Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources.
o Sharing data can cause substantial challenges.
o It include the need for inter and intra- institutional legal
documents.
o Accessing data from public repositories leads to multiple
difficulties.
o It is necessary for the data to be available in an accurate,
complete and timely manner because if data in the companies
information system is to be used to make accurate decisions in
time then it becomes necessary for data to be available in this
manner.
1. Analytical Challenges:
• There are some huge analytical challenges in big data
which arise some main challenges questions like how to
deal with a problem if data volume gets too large?
• Or how to find out the important data points?
• Or how to use data to the best advantage?
• These large amount of data on which these type of
analysis is to be done can be structured (organized
data), semi-structured (Semi-organized data) or
unstructured (unorganized data). There are two
techniques through which decision making can be done:
• Either incorporate massive data volumes in the
analysis.
• Or determine upfront which Big data is relevant.
2. Technical challenges:
• Quality of data:
• When there is a collection of a large amount of
data and storage of this data, it comes at a cost.
Big companies, business leaders and IT leaders
always want large data storage.
• For better results and conclusions, Big data
rather than having irrelevant data, focuses on
quality data storage.
• This further arise a question that how it can be
ensured that data is relevant, how much data
would be enough for decision making and
whether the stored data is accurate or not.
• Fault tolerance:
• Fault tolerance is another technical challenge
and fault tolerance computing is extremely hard,
involving intricate algorithms.
• Nowadays some of the new technologies like
cloud computing and big data always intended
that whenever the failure occurs the damage
done should be within the acceptable threshold
that is the whole task should not begin from the
scratch.
• Scalability:
• Big data projects can grow and evolve rapidly.
The scalability issue of Big Data has lead
towards cloud computing.
• It leads to various challenges like how to run
and execute various jobs so that goal of each
workload can be achieved cost-effectively.
• It also requires dealing with the system failures
in an efficient manner. This leads to a big
question again that what kinds of storage
devices are to be used.
ncreasing population and number of vehicles will give rise to traffic issues. The
smart city solutions can harness the real-time data for delivering improved
mobility and resolving traffic-related problems. In the coming years, traffic
management will become more significant. As per an estimate, the revenue of
traffic-focused technology for the smart city will be valued at $4.4 billion by
2023.
The smarter traffic system is based on big data analytics solutions. These
solutions assist the administration to fetch and share the information across
various departments to take prompt action. Big data solutions are designed to
gather all sorts of information using sensors for ensuring real-time traffic
control. Also, these solutions can predict traffic trends based on mathematical
models and current scenario.
Public Safety
In any big or small city, public safety is highly important. Data insights can be
used to identify the most possible reasons for crime and find vulnerable areas
in the city. Data-driven smart cities can readily unfold any criminal offenses and
the administration can implement safety-related measures more effectively.
Once the high-crime prone zones are defined, it is easy for the low enforcement
agencies to act in a ‘smarter’ way to prevent crimes before they occur.
Emergency services will also get a boost from real-time data. For example, it is
possible for police, fire, and ambulance to rush at the spot immediately with the
help of real-time data about any unwanted incident. Different data sets can
handle the crisis situations during the natural or manmade calamities.
Public Spending
A lot of investment from the government is necessary for transforming any city.
In the smart city, the city planner can spend money either for renovation or
redesigning the city. Before spending money and allocating the budget for a
smart city, it is better to go through the collected data. The data can suggest the
major areas that need upgrade or revamping. Also, big data analytics solutions
can predict the situation of particular areas in the future. It can give the city
planners a clear idea of how they can invest in such areas to get the maximum
value for money.
City Planning
Big data solutions aim at providing continuous updates about the necessary
change. In a way, these solutions ascertain the growth of a smart city. It is
necessary for the smart city to implement desired features and infrastructure-
based facilities to attract and retain more people. In other words, sustainability
can be achieved by proper monitoring and control over the city’s infrastructure.
Big data analytics can play a big role in achieving this objective by collecting
real-time information.
Ch – 2
Q. 1)Draw and Explain HDFS architecture. How can you restart Name Node & all
the daemons in Hadoop? 07(4)
Ans.
Q. 2)What is MapReduce? Explain working of various phases of MapReduce with
appropriate example and diagram.07(4)
Ans.
Q. 3)What is Hadoop? Briefly explain the core components and advantages of it.
07 (3)
Ans.
Advantages of Hadoop
• Varied Data Sources. Hadoop accepts a variety of data. ...
• Cost-effective. Hadoop is an economical solution as it uses a cluster
of commodity hardware to store data. ...
• Performance. ...
• Fault-Tolerant. ...
• Highly Available. ...
• Low Network Traffic. ...
• High Throughput. ...
• Open Source.
->In Hadoop, data serialization is used to convert data objects into a format that
can be stored in Hadoop Distributed File System (HDFS) and processed by
MapReduce.
->Improved performance: Serialized data is smaller and can be read and written
faster than non-serialized data.
->Complexity: The process of serializing and deserializing data can be complex and
time-consuming.
->Limited flexibility: The chosen serialization format may not be well-suited for all
types of data or use cases.
->Increased memory usage: Serialized data takes up more memory than non-
serialized data.
Hadoop serialization is also known for its efficient data compression and data
transfer, which reduces the network I/O and storage costs. However, it can also
increase the CPU load when deserializing and serializing large data sets, which
can be a drawback.
Q. 5)How is Big data and Hadoop related? 03
Ans.
Big data and Hadoop are interrelated. In simple words, big data is massive
amount of data that cannot be stored, processed, or analyzed using the
traditional methods. Big data consists of vast volumes of various types of
data which is generated at a high speed. To overcome the issue of storing,
processing, and analyzing big data, Hadoop is used.
Q. 14)What is the role of a “combiner” in the Map Reduce ? Explain with the help of an example 07
Ans.
In the above example, we can see that two Mappers are containing different data.
the main text file is divided into two different Mappers. Each mapper is assigned
to process a different line of our data. in our above example, we have two lines
of data so we have two Mappers to handle each line. Mappers are producing the
intermediate key-value pairs, where the name of the particular word is key and
its count is its value. For example for the data Geeks For Geeks For the key-
value pairs are shown below.
// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
The key-value pairs generated by the Mapper are known as the intermediate key-
value pairs or intermediate output of the Mapper. Now we can minimize the
number of these key-value pairs by introducing a combiner for each Mapper in
our program. In our case, we have 4 key-value pairs generated by each of the
Mapper. since these intermediate key-value pairs are not ready to directly feed
to Reducer because that can increase Network congestion so Combiner will
combine these intermediate key-value pairs before sending them to Reducer.
The combiner combines these intermediate key-value pairs as per their key. For
the above example for data Geeks For Geeks For the combiner will partially
reduce them by merging the same pairs according to their key value and
generate new key-value pairs as shown below.
// Partially reduced key-value pairs with combiner
(Geeks,2)
(For,2)
With the help of Combiner, the Mapper output got partially reduced in terms of
size(key-value pairs) which now can be made available to the Reducer for better
performance. Now the Reducer will again Reduce the output obtained from
combiners and produces the final output that is stored on HDFS(Hadoop
Distributed File System).
Ans. copyToLocal (or) get: To copy files/folders from hdfs store to local file
system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
Ans.
Data Integrity ensures the correctness of the data. However, it is possible that the
data will get corrupted during I/O operation on the disk. Corruption can occur due
to various reasons like faults in a storage device, network faults, or buggy
software.Hadoop HDFS framework implements checksum checking on the
contents of HDFS files. In Hadoop, when a client creates an HDFS file, it computes a
checksum of each block of file and stores these checksums in a separate hidden file
in the same HDFS namespace.
HDFS client, when retrieves file contents, it first verifies that the data it received
from each Datanode matches the checksum stored in the associated checksum file.
And if not, then the client can opt to retrieve that block from another DataNode
that has a replica of that block.
1) Data Integrity means to make sure that no data is lost or corrupted during
storage or processing of the Data.
5) It is possible that it’s the checksum that is corrupt, not the data, but this is very
unlikely, because the checksum is much smaller than the data
7) DataNodes are responsible for verifying the data they receive before storing the
data and its checksum. Checksum is computed for the data that they receive from
clients and from other DataNodes during replication
8) Hadoop can heal the corrupted data by copying one of the good replica to
produce the new replica which is uncorrupt replica.
9) if a client detects an error when reading a block, it reports the bad block and the
DataNodes it was trying to read from to the NameNode before throwing a
Checksum Exception.
10) The NameNode marks the block replica as corrupt so it doesn’t direct any more
clients to it or try to copy this replica to another DataNodes.
11) It provides copy of the block another DataNodes which is to be replicated, so its
replication factor is back at the expected level.
Q. 1)What are the features of MongoDB and How MongoDB is better than SQL database? 04 (4)
Ans.
MongoDB MySQL
Differences in Terminology
There are differences based on terminology between MongoDB and MySQL.
Data Representation
The difference between the way data is represented and stored in both the
databases is quite different.
MongoDB stores data in form of JSON-like documents and MySQL stores data
in form of rows of table as mentioned earlier.
Example: To show how data is stored and represented in MongoDB and
MySQL.
Ans.
Q. 3)Explain Replication and scaling feature of MongoDB. 04 (2)
Ans.
Ans.
SQL NoSQL
RELATIONAL DATABASE Non-relational or distributed database system.
MANAGEMENT SYSTEM
(RDBMS)
These databases have They have dynamic schema
fixed or static or
predefined schema
These databases are not These databases are best suited for hierarchical
suited for hierarchical data storage.
data storage.
These databases are best These databases are not so good for complex
suited for complex queries
queries
Vertically Scalable Horizontally scalable
Follows ACID property Follows CAP(consistency, availability, partition
tolerance)
Examples: MySQL, Postg Examples: MongoDB, GraphQL, HBase, Neo4j, C
reSQL, Oracle, MS-SQL assandra, etc
Server, etc
1. Structure –
SQL databases are table-based on the other hand NoSQL databases
are either key-value pairs, document-based, graph databases or wide-
column stores. This makes relational SQL databases a better option
for applications that require multi-row transactions such as an
accounting system or for legacy systems that were built for a relational
structure.
2. Property followed –
SQL databases follow ACID properties (Atomicity, Consistency,
Isolation and Durability) whereas the NoSQL database follows the
Brewers CAP theorem (Consistency, Availability and Partition
tolerance).
3. Support –
Great support is available for all SQL database from their vendors.
Also a lot of independent consultations are there who can help you
with SQL database for a very large scale deployments but for some
NoSQL database you still have to rely on community support and only
limited outside experts are available for setting up and deploying your
large scale NoSQL deployments.
Q. 5)What is NoSQL database? List the differences between NoSQL and relational databases. Explain
various types of NoSQL databases . 07(2)
Ans.
• Document-based databases
• Key-value stores
• Column-oriented databases
• Graph-based databases
Document-Based Database:
A key-value store is like a relational database with only two columns which is the
key and the value.
Key features of the key-value store:
• Simplicity.
• Scalability.
• Speed.
Graph-Based databases:
Ans.
Q. 7)Difference between master-slave versus peer-to-peer distribution models. 03
->there is a central authority, or "master," that controls and manages all the other nodes, or "slaves," in
the network.
-> The master node is responsible for distributing tasks and coordinating the work of the slave nodes.
->This model is often used in distributed computing systems, where a central server is responsible for
managing a group of worker nodes.
-> It can not have any issues about decision making because it is handled by master node only
->it can have a single point of failure and a bottleneck in the master node.
->a peer-to-peer (P2P) distribution model does not have a central authority.
->Instead, all nodes in the network are equal and can communicate with each other directly. Each node
acts as both a client and a server, making requests and providing resources to other nodes.
-> This model is often used in decentralized systems, such as file sharing networks.
->It can have issues when it comes to decision making, coordination and consistency.
Q. )Which Four ways that NoSQL systems handle big data problems. 07
Ans.
]
Ans.
Ans.
Ans.
The term NewSQL categorizes databases that are the combination of relational
model with the advancement in scalability, flexibility with types of data. These
databases focus on the features which are not present in NoSQL, which offers
a strong consistency guarantee. This covers two layers of data one relational
one and a key-value store.
S.No NoSQL NewSQL
1. NoSQL is a schema-free database. NewSQL is schema-fixed as well as
a schema-free database.
2. It is horizontally scalable. It is horizontally scalable.
3. It possesses automatically high- It possesses built-in high
availability. availability.
4. It supports cloud, on-disk, and It fully supports cloud, on-disk, and
cache storage. cache storage.
5. It promotes CAP properties. It promotes ACID properties.
6. Online Transactional Processing is Online Transactional Processing is
not supported. fully supported.
7. There are low-security concerns. There are moderate security
concerns.
8. Use Cases: Big Data, Social Use Cases: E-Commerce, Telecom
Network Applications, and IOT. industry, and Gaming.
9. Examples : DynamoDB, MongoDB, Examples : VoltDB, CockroachDB,
RaveenDB etc. NuoDB etc.
Ans.
Sharding means to distribute data on multiple servers, here a large amount of data is partitioned into
data chunks using the shard key, and these data chunks are evenly distributed across shards that reside
across many physical servers.
Sharding which is also known as data partitioning works on the same concept
of sharing the Pizza slices. It is basically a database architecture pattern in
which we split a large dataset into smaller chunks (logical shards) and we
store/distribute these chunks in different machines/database nodes (physical
shards). Each chunk/partition is known as a “shard” and each shard has the
same database schema as the original database. We distribute the data in such
a way that each row appears in exactly one shard. It’s a good mechanism to
improve the scalability of an application.
Map Reduce
Map reduce is used for aggregating results for the large volume of data. Map
reduce has two main functions one is a map that groups all the documents and
the second one is the reduce which performs operation on the grouped data.
Syntax:
db.collectionName.mapReduce(mappingFunction, reduceFunction,
{out:'Result'});
Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If
there is no indexing, then the MongoDB must scan every document in the
collection and retrieve only those documents that match the query. Indexes are
special data structures that stores some information related to the documents
such that it becomes easy for MongoDB to find the right data file. The indexes
are order by the value of the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an
index.
Syntax –
db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index
and 1 (or -1) determines the order in which these indexes will be
arranged(ascending or descending).
Q. 15)Which terms are used for table, row, column, and table-join in MongoDB 03
Ans.
Ch-4
Ans.
1. Apache Samza
• Apache Samza is an open source, near-real time, asynchronous computational framework for stream
processing developed by the Apache Software Foundation in Scala and Java.
• Samza allows users to build stateful applications that process data in real-time from multiple sources
including Apache Kafka.
• Samza provides fault tolerance, isolation and stateful processing. Samza is used by multiple
companies. The biggest installation is in LinkedIn.
2. Apache Flink
• The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala.
• Fink provides a high-throughput, low-latency streaming engine as well as support for eventtime
processing and state management.
• Flink does not provide its own data-storage system, but provides data-source and sink connectors to
systems such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache Cassandra, and Elastic Search.
Ans.
Q. 4)How Graph Analytics used in Big Data. 03
Ans.
Q. 5)Write a short note on Decaying Window. 04
Ans.
Q. 6)Explain with example: How to perform Real Time Sentiment Analysis of any product. 07
Ans.
CH-5
Ans.
Ans.
The major components of Hive and its interaction with the Hadoop is
demonstrated in the figure below and all the components are described further:
• User Interface (UI) –
As the name describes User interface provide an interface between
user and hive. It enables user to submit queries and other operations
to the system. Hive web UI, Hive command line, and Hive HD Insight
(In windows server) are supported by the user interface.
• Driver –
Queries of the user after the interface are received by the driver within
the Hive. Concept of session handles is implemented by driver.
Execution and Fetching of APIs modelled on JDBC/ODBC interfaces
is provided by the user.
• Compiler –
Queries are parses, semantic analysis on the different query blocks
and query expression is done by the compiler. Execution plan with the
help of the table in the database and partition metadata observed from
the metastore are generated by the compiler eventually.
• Metastore –
All the structured data or information of the different tables and
partition in the warehouse containing attributes and attributes level
information are stored in the metastore. Sequences or de-sequences
necessary to read and write data and the corresponding HDFS files
where the data is stored. Hive selects corresponding database servers
to stock the schema or Metadata of databases, tables, attributes in a
table, data types of databases, and HDFS mapping.
• Execution Engine –
Execution of the execution plan made by the compiler is performed in
the execution engine. The plan is a DAG of stages. The dependencies
within the various stages of the plan is managed by execution engine
as well as it executes these stages on the suitable system
components.
Ans.
Ans.
Q. 5)What do you mean by HiveQL DDL? Explain any 3 HiveQL DDL command with its Syntax & example.
07 (3)
Ans. Hive Data Definition Language (DDL) is a subset of Hive SQL statements that describe
the data structure in Hive by creating, deleting, or altering schema objects such as
databases, tables, views, partitions, and buckets. Most Hive DDL statements start with the
keywords CREATE , DROP , or ALTER .
[COMMENT database_comment]
[LOCATION hdfs_path]
Ans.
HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have
multiple column families and each column family can have any number of columns.
Subsequent column values are stored contiguously on the disk. Each cell value of the
table has a timestamp. In short, in an HBase:
Ans.
HDFS HBase
HDFS provides high latency for HBase provides low latency access to
access operations. small amount of data
Ans.
Ans.
Ans.
Q. 11)What are WAL, MemStore, Hfile and Hlog in HBase? 04
Ans.
Write Ahead Log (WAL) is a file used to store new data that is yet to be put on permanent
storage. It is used for recovery in the case of failure. When a client issues a put request, it will
write the data to the write-ahead log (WAL).
Memstore is just like a cache memory. Anything that is entered into the HBase is stored here
initially. Later, the data is transferred and saved in Hfiles as blocks and the memstore is flushed.
When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a
new HFile in HDFS
HFile is the internal file format for HBase to store its data. These are the first two
lines of the description of HFile from its source code: File format for hbase. A file of
sorted key/value pairs. Both keys and values are byte arrays.
HLog contains entries for edits of all regions performed by a particular Region
Server. WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are
written immediately. WAL edits remain in the memory till the flush period in case of
deferred log flush
Q. 12)Explain the concept of regions in HBase and storing Big data with HBase. 07
Ans.
Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster that
consists of the distribution of tables and are comprised of Column families.
Region Server runs on HDFS DataNode which is present in Hadoop cluster.
Regions of Region Server are responsible for several things, like handling,
managing, executing as well as reads and writes HBase operations on that set
of regions. The default size of a region is 256 MB.
In the column-Oriented data storage approach, the data is stored and retrieved
based on the columns. Thus the problem which we were facing in the case of
the row-oriented approach has been solved because in the column-oriented
approach we can filter out the data which are required to us from the whole set
of data with the help of corresponding columns. In the column-oriented
approach, the read and write operations are slower than others but it can be
efficient while performing operations on the entire database and hence it
permits very high compression rates.
Q. 13)Explain Pig data Model in detail and Discuss how it will help for effective data flow. 07
Ans.
Q. 14)Describe data processing operators in Pig. 04
Ans.
Ans.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with
storage systems and memory management.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured
data.
o It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant
processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused
to analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time example
of a data stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine learning
algorithms.
o These include correlations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-
parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached to each
vertex and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages.
Ans.
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than
the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
Null Value
Missing values are represented by the special value NULL.
CH-6
Q. 1)Write a short note on Spark stack and Explain Components of Spark. 07 (3)
Ans.
Q. 2)Write a short note on Spark. What are the advantages of using Apache Spark over Hadoop? 04 (3)
Ans.done
Ans.
Actions: Actions refer to an operation which also applies on RDD, that instructs
Spark to perform computation and send the result back to driver. This is an
example of action.
The Transformations and Actions in Apache Spark are divided into 4 major
categories:
• General
• Mathematical and Statistical
• Set Theory and Relational
• Data-structure and IO
Ans.
Q. 5)What is RDD in Apache Spark? Why RDD is better than Map Reduce data storage?(2)
Ans.
Apache Spark is very much popular for its speed. It runs 100 times faster in
memory and ten times faster on disk than Hadoop MapReduce since it
processes data in memory (RAM). At the same time, Hadoop MapReduce has
to persist data back to the disk after every Map or Reduce action.
Apache Spark can schedule all its tasks by itself, while Hadoop MapReduce
requires external schedulers like Oozie.
• Streaming Data:
• Machine Learning:
• Fog Computing:
• Apache Spark at Alibaba:
• Apache Spark at MyFitnessPal:
• Apache Spark at TripAdvisor:
• Apache Spark at Yahoo:
• Finance:
Q. 10)Discuss Spark Streaming with suitable example such as analyzing tweets from Twitter. 07
Ans.
Q. 11)What are the problems related to Map Reduce data storage? How Apache Spark solves it using
RDD? 07
Ans.
Apache MapReduce is a distributed data processing framework which uses the MapReduce
algorithm for processing large volumes of data distributed across multiple nodes in a
Hadoop cluster.
It is a data processing technology and not a data storage technology. HDFS (Hadoop
Distributed File System) is a data storage technology which stores large volumes of data in a
distributed manner across multiple nodes in a Hadoop cluster.
Apache MapReduce reads large volumes of data from the disk, executes the mapper tasks
and then the reducer tasks and stores the output back on the disk.
The intermediate output of the mapper task is also stored on the local disk. This results in
latency as writing the data to the disk (as part of the map task) and reading the data from
the disk (for it to be used in the reduce task) is a time consuming operation as disk reads
are I/O expensive.
MapReduce provides high throughput along with high latency. Hence it is not suitable for
Iterative Tasks
Interactive Queries
RDD avoids all of the reading/writing to HDFS. By significantly reducing I/O operations,
RDD offers a much faster way to retrieve and process data in a Hadoop cluster. In fact, it's
estimated that Hadoop MapReduce apps spend more than 90% of their time performing
reads/writes to HDFS.