Week 6-1
Week 6-1
b) Block Report from each DataNode contains a list of all the blocks that are
stored on that DataNode
c) User data is distributed across multiple DataNodes in the cluster and is
managed by the NameNode.
d) DataNode is aware of the files to which the blocks stored on it belong to
● Correct. Replication Factor can indeed be set at both the cluster level and the
file level in distributed file systems like HDFS.
b) Block Report from each DataNode contains a list of all the blocks that
are stored on that DataNode
● correct. In a distributed file system like HDFS, user data is stored in a distributed
manner across the cluster and managed by the HDFS, not just the local file
system of each DataNode.
● Incorrect. DataNodes manage blocks of data and are not aware of the
higher-level file structure; this information is managed by the NameNode in
HDFS.
2. What is the primary technique used by Random Forest to reduce overfitting?
a) Boosting
b) Bagging
c) Pruning
d) Neural networks
a. Incorrect - Boosting:
● Not Used in Random Forest: Boosting is a different technique used in methods
like Gradient Boosting, where trees are built sequentially to correct errors from
previous trees. It’s not used by Random Forest, which relies on bagging.
c. Incorrect - Pruning:
● Not a Primary Technique in Random Forest: Pruning is a technique used to
reduce the size of decision trees by removing parts that are not contributing to
the prediction accuracy. While pruning helps to control overfitting in individual
decision trees, Random Forest primarily relies on bagging for overfitting
reduction.
S2: Random Forest is use for regression whereas Gradient Boosting is use for
Classification task
S3: Random Forest is use for classification whereas Gradient Boosting is use for
regression task
A) S1 and S2
B) S2 and S4
C) S3 and S4
D) S1 and S4
● Correct. Both Random Forest and Gradient Boosting can be used for
classification problems.
● Incorrect. Random Forest and Gradient Boosting can both be used for both
regression and classification tasks.
● Incorrect. As with S2, both methods can be used for both types of tasks.
● Correct. Both Random Forest and Gradient Boosting can be used for regression
tasks as well.
4. In the context of K-means clustering with MapReduce, what role does the Map
phase play in handling very large datasets?
● Incorrect. The Map phase does not focus on removing duplicates but rather on
distributing and processing the data.
● Correct. The Map phase is responsible for calculating distances between data
points and centroids and distributing this task across nodes.
● Incorrect. PCA is not typically done in the Map phase; it is a preprocessing step
for dimensionality reduction.
● Incorrect. Reducing the number of clusters might not improve performance and
could lead to less meaningful clustering.
6.Which similarity measure is often used to determine the similarity between two text
documents by considering the angle between their vector representations in a
high-dimensional space?
A) Manhattan Distance
B) Cosine Similarity
C) Jaccard Similarity
D) Hamming Distance
A) Manhattan Distance
● Incorrect. Manhattan Distance is not used for text document similarity in this
context.
B) Cosine Similarity
● Correct. Cosine Similarity measures the cosine of the angle between two
vectors, making it ideal for text documents in high-dimensional space.
C) Jaccard Similarity
● Incorrect. Jaccard Similarity is used for comparing sets and is not based on
vector angles.
D) Hamming Distance
● Incorrect. Hamming Distance is used for comparing strings of equal length and
is not applicable to text document similarity in vector space.
7.Which distance measure calculates the distance along strictly horizontal and vertical
paths, consisting of segments along the axes?
A)Minkowski distance
B) Cosine similarity
c) Manhattan distance
D) Euclidean distance
A) Minkowski distance
B) Cosine similarity
● Incorrect. Cosine similarity measures the angle between vectors and does not
involve distance calculation.
C) Manhattan distance
D) Euclidean distance
● Incorrect. The validation set is not used for training but for model evaluation
during training.
● Incorrect. Testing the final model's performance is done using a separate test
set, not the validation set.
9. In K-fold cross-validation, what is the purpose of splitting the dataset into K folds?
A) To ensure that every data point is used for training only once
A) To ensure that every data point is used for training only once
● Incorrect. The dataset is split into folds, and only K-1 folds are
used for training each time.
● Incorrect. Each fold is used for testing once, and training is done
on the remaining folds.
10. Which of the following steps is NOT typically part of the machine learning process?
A) Data Collection
B) Model Training
C) Model Deployment
D) Data Encryption
A) Data Collection
● Incorrect. Data Collection is a fundamental step in the machine learning
process.
B) Model Training
C) Model Deployment
D) Data Encryption
● Correct. Data Encryption is not typically a part of the machine learning process
itself, though it may be relevant for data security and privacy.