Module 5
Module 5
Ans. Sharding is a technique of providing horizontal scalability, which allows different sites
to have different types of data. This scalability helps in reducing the work load of servers.
Replication is just a process of copying the same data across different sites while sharding is
the process of distributing different datasets on different sites. In addition, sharding improves
both read and write performance while replication improves read performance but not write
performance.
a. Data Import
b. Data Export
c. Parallel Data Transfer:
d. Incremental Data Transfer:
e. Customized Data Mapping:
f. Support for Various Data Sources:
g. Integration with Hadoop Ecosystem:
h. Compression and Serialization:
i. Security:
j. Extensibility:
k. Job Scheduling:
8. Describe some applications of the clustering technique in Mahout.
a. Recommendation Systems
b. Document Classification
c. Customer Segmentation
d. Anomaly Detection
e. Image and Video Analysis
f. Network Analysis
g. Natural Language Processing (NLP)
h. Image Compression
i. Spatial Data Analysis
j. Fraud Detection
k. Data Preprocessing
9. Differentiate Apache Flume and Apache Sqoop with respect to failure handling.
Apache Flume is well-suited for real-time, event-based data streaming with a focus on
robust failure handling and guaranteed event delivery. On the other hand, Apache
Sqoop is designed for batch-oriented data transfer between Hadoop and structured
data stores, with a focus on data integrity but not real-time processing or fine-grained
failure handling. The choice between the two tools depends on the specific
requirements and use cases of the data transfer operation.
10. Name the data model that can be used for social network mining.
The data model commonly used for social network mining is the "Graph Data Model"
or simply "Graph.
Refer PPT
16. Illustrate the 3Cs of mahout on the machine learning framework for processing
data.[or] Illustrate Collaborative filtering, Clustering and Classification in
Mahout.
Refer PPT