0% found this document useful (0 votes)
6 views

Module 5

Uploaded by

sonia
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 5

Uploaded by

sonia
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

1. Compare relational databases with NoSQL databases.

Relational Database NoSQL


It is used to handle data coming in high
It is used to handle data coming in low velocity.
velocity.
It gives only read scalability. It gives both read and write scalability.
It manages structured data. It manages all type of data.
Data arrives from one or few locations. Data arrives from many locations.
It supports complex transactions. It supports simple transactions.
It has single point of failure. No single point of failure.
It handles data in less volume. It handles data in high volume.
Transactions written in one location. Transactions written in many locations.
support ACID properties compliance doesn’t support ACID properties
Its difficult to make changes in database once it is Enables easy and frequent changes to
defined database
schema is mandatory to store the data schema design is not required
Deployed in vertical fashion. Deployed in Horizontal fashion.
2. What is the purpose of sharding? What is the difference between replication and
sharding?

Ans. Sharding is a technique of providing horizontal scalability, which allows different sites
to have different types of data. This scalability helps in reducing the work load of servers.
Replication is just a process of copying the same data across different sites while sharding is
the process of distributing different datasets on different sites. In addition, sharding improves
both read and write performance while replication improves read performance but not write
performance.

3. Explain the CAP theorem.


Ans. In case of distributed databases, the three important aspects of the CAP theorem
are Consistency (C), Availability (A), and Partition tolerance (P). The CAP theorem
states that in any distributed system, we can select only two aspects. Let us discuss
the three aspects of the CAP theorem. The first one refers to the number of nodes that
should respond to a read request before it is considered as a successful operation. The
second is the number of nodes that should respond to a write request before it is
considered a successful operation. The third is the number of nodes where the data is
replicated or copied.

4. Explain the ways in which data can be distributed.


Ans. Data distribution can be performed in the following two ways:
 Through sharding―Sharding is one of the major techniques of data
distribution. It is used to distribute various types of data across multiple servers.
Therefore, each server acts as a single source for a subset of data.
 Through replication―Replication is one of the major techniques for fault
tolerance. The idea is to copy data across multiple servers so that each bit of data can
be found in multiple places. Replication occurs in two forms:
 Master-slave replication, which makes one node the authoritative copy
that handles writes while slaves, which are synchronized with the master, handle
reads.
 Peer-to-peer replication, which allows writes to any node without
requiring authorization. Here, the nodes can coordinate with each other to
synchronize their copies of the data.

5. List the three important aspects of CAP theorem


Consistency (C), Availability (A), and Partition tolerance (P)

6. Demonstrate how the concept of materialized views can be effectively utilized


within NoSQL databases to optimize query performance and support real-time
analytics in a specific use case or application scenario
Materialized views are slightly different from normal views. These are disk based and
update periodically as per the requirements of the query. This is an advantage
because when we query a materialized view, we are basically querying a table, which
can be indexed. Creating materialized views in the form of aggregate tables or copies
of frequently executed queries can speed up the response time. A disadvantage of a
materialized view is that the data obtained from it is updated to the point or date it
was last refreshed. Materialized views are mostly used in BI applications or Big Data,
where query response time is a basic need.

7. List the various functions of Sqoop.

a. Data Import
b. Data Export
c. Parallel Data Transfer:
d. Incremental Data Transfer:
e. Customized Data Mapping:
f. Support for Various Data Sources:
g. Integration with Hadoop Ecosystem:
h. Compression and Serialization:
i. Security:
j. Extensibility:
k. Job Scheduling:
8. Describe some applications of the clustering technique in Mahout.

a. Recommendation Systems
b. Document Classification
c. Customer Segmentation
d. Anomaly Detection
e. Image and Video Analysis
f. Network Analysis
g. Natural Language Processing (NLP)
h. Image Compression
i. Spatial Data Analysis
j. Fraud Detection
k. Data Preprocessing

9. Differentiate Apache Flume and Apache Sqoop with respect to failure handling.
Apache Flume is well-suited for real-time, event-based data streaming with a focus on
robust failure handling and guaranteed event delivery. On the other hand, Apache
Sqoop is designed for batch-oriented data transfer between Hadoop and structured
data stores, with a focus on data integrity but not real-time processing or fine-grained
failure handling. The choice between the two tools depends on the specific
requirements and use cases of the data transfer operation.

10. Name the data model that can be used for social network mining.
The data model commonly used for social network mining is the "Graph Data Model"
or simply "Graph.

11. Correlate reliability and Failure handling in Apache Flume.


Apache Flume, reliability and failure handling are closely intertwined. Flume's design
and features prioritize data reliability by using mechanisms like in-memory channels,
retries, backoff strategies, and durable sinks to manage and mitigate failures, ensuring
that events are transferred to their destination with minimal data loss. The correlation
between these aspects is essential for maintaining data integrity in data streaming and
collection processes.

12. Differentiate replication and sharding.


Sharding is a technique of providing horizontal scalability, which allows different
sites to have different types of data. This scalability helps in reducing the work load
of servers. Replication is just a process of copying the same data across different sites
while sharding is the process of distributing different datasets on different sites. In
addition, sharding improves both read and write performance while replication
improves read performance but not write performance.

13. Illustrate the different types of No-SQL database in detail.


Refer PPT

14. Explain the architecture of Flume in detail.

Refer PPT

15. Explain the architecture of Sqoop in detail.


Refer PPT

16. Illustrate the 3Cs of mahout on the machine learning framework for processing
data.[or] Illustrate Collaborative filtering, Clustering and Classification in
Mahout.
Refer PPT

You might also like