0% found this document useful (0 votes)

5 views9 pages

Week 6-1

big data computing

Uploaded by

iti01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views9 pages

Week 6-1

big data computing

Uploaded by

iti01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Week 6

1. Point out the wrong statement.

a) Replication Factor can be configured at a cluster level (Default is set to 3) and

also at a file level

b) Block Report from each DataNode contains a list of all the blocks that are
stored on that DataNode
c) User data is distributed across multiple DataNodes in the cluster and is
managed by the NameNode.
d) DataNode is aware of the files to which the blocks stored on it belong to

a) Replication Factor can be configured at a cluster level (Default is set to

3) and also at a file level

● Correct. Replication Factor can indeed be set at both the cluster level and the
file level in distributed file systems like HDFS.

b) Block Report from each DataNode contains a list of all the blocks that
are stored on that DataNode

● Correct. A Block Report includes a list of all blocks stored on a DataNode.

c) User data is distributed across multiple DataNodes in the cluster and is

managed by the NameNode.

● correct. In a distributed file system like HDFS, user data is stored in a distributed
manner across the cluster and managed by the HDFS, not just the local file
system of each DataNode.

d) DataNode is aware of the files to which the blocks stored on it belong to

● Incorrect. DataNodes manage blocks of data and are not aware of the
higher-level file structure; this information is managed by the NameNode in
HDFS.
2. What is the primary technique used by Random Forest to reduce overfitting?

a) Boosting

b) Bagging

c) Pruning

d) Neural networks

a. Incorrect - Boosting:
● Not Used in Random Forest: Boosting is a different technique used in methods
like Gradient Boosting, where trees are built sequentially to correct errors from
previous trees. It’s not used by Random Forest, which relies on bagging.

b. Correct - Bagging (Bootstrap Aggregating):

● Primary Technique Used by Random Forest: Bagging is the key technique
used by Random Forest to reduce overfitting. In Random Forest, multiple
decision trees are trained on different random subsets of the data, created
through bootstrapping (sampling with replacement). Each tree is trained
independently on its subset of data, and the predictions from all trees are
aggregated (typically by voting or averaging) to produce the final result. This
approach helps to reduce variance and improves the model's ability to generalize
by averaging out the errors from individual trees.

c. Incorrect - Pruning:
● Not a Primary Technique in Random Forest: Pruning is a technique used to
reduce the size of decision trees by removing parts that are not contributing to
the prediction accuracy. While pruning helps to control overfitting in individual
decision trees, Random Forest primarily relies on bagging for overfitting
reduction.

d. Incorrect - Neural Networks:

● Not Related to Random Forest: Neural networks are a different class of models
and are not related to the ensemble method of Random Forest.
3. What statements accurately describe Random Forest and Gradient Boosting
ensemble methods?

S1: Both methods can be used for classification task

S2: Random Forest is use for regression whereas Gradient Boosting is use for
Classification task

S3: Random Forest is use for classification whereas Gradient Boosting is use for
regression task

S4: Both methods can be used for regression

A) S1 and S2
B) S2 and S4
C) S3 and S4
D) S1 and S4

S1: Both methods can be used for classification tasks

● Correct. Both Random Forest and Gradient Boosting can be used for
classification problems.

S2: Random Forest is used for regression whereas Gradient Boosting is

used for Classification task

● Incorrect. Random Forest and Gradient Boosting can both be used for both
regression and classification tasks.

S3: Random Forest is used for classification whereas Gradient Boosting is

used for regression task

● Incorrect. As with S2, both methods can be used for both types of tasks.

S4: Both methods can be used for regression

● Correct. Both Random Forest and Gradient Boosting can be used for regression
tasks as well.
4. In the context of K-means clustering with MapReduce, what role does the Map
phase play in handling very large datasets?

A) It reduces the size of the dataset by removing duplicates

B) It distributes the computation of distances between data points and centroids

across multiple nodes

C) It initializes multiple sets of centroids to improve clustering accuracy

D) It performs principal component analysis (PCA) on the data

A) It reduces the size of the dataset by removing duplicates

● Incorrect. The Map phase does not focus on removing duplicates but rather on
distributing and processing the data.

B) It distributes the computation of distances between data points and

centroids across multiple nodes

● Correct. The Map phase is responsible for calculating distances between data
points and centroids and distributing this task across nodes.

C) It initializes multiple sets of centroids to improve clustering accuracy

● Incorrect. Initialization of centroids is generally done before the Map phase

starts and is not part of its functionality.

D) It performs principal component analysis (PCA) on the data

● Incorrect. PCA is not typically done in the Map phase; it is a preprocessing step
for dimensionality reduction.

5. What is a common method to improve the performance of the K-means

algorithm when dealing with large-scale datasets in a MapReduce environment?

A) Using hierarchical clustering before K-means

B) Reducing the number of clusters

C) Employing mini-batch K-means

D) Increasing the number of centroids

A) Using hierarchical clustering before K-means

● Incorrect. While hierarchical clustering can be used to initialize centroids, it is

not specific to improving K-means in a MapReduce environment.

B) Reducing the number of clusters

● Incorrect. Reducing the number of clusters might not improve performance and
could lead to less meaningful clustering.

C) Employing mini-batch K-means

● Correct. Mini-batch K-means is a method used to handle large-scale datasets

efficiently by processing small, random subsets of the data.

D) Increasing the number of centroids

● Incorrect. Increasing the number of centroids might not improve performance

and could complicate the clustering process.

6.Which similarity measure is often used to determine the similarity between two text
documents by considering the angle between their vector representations in a
high-dimensional space?

A) Manhattan Distance

B) Cosine Similarity

C) Jaccard Similarity

D) Hamming Distance
A) Manhattan Distance

● Incorrect. Manhattan Distance is not used for text document similarity in this
context.

B) Cosine Similarity

● Correct. Cosine Similarity measures the cosine of the angle between two
vectors, making it ideal for text documents in high-dimensional space.

C) Jaccard Similarity

● Incorrect. Jaccard Similarity is used for comparing sets and is not based on
vector angles.

D) Hamming Distance

● Incorrect. Hamming Distance is used for comparing strings of equal length and
is not applicable to text document similarity in vector space.

7.Which distance measure calculates the distance along strictly horizontal and vertical
paths, consisting of segments along the axes?

A)Minkowski distance

B) Cosine similarity

c) Manhattan distance

D) Euclidean distance

A) Minkowski distance

● Incorrect. Minkowski distance generalizes Euclidean and Manhattan distances

but is not specific to axis-aligned paths.

B) Cosine similarity

● Incorrect. Cosine similarity measures the angle between vectors and does not
involve distance calculation.
C) Manhattan distance

● Correct. Manhattan distance measures distance along axis-aligned paths

(horizontal and vertical segments).

D) Euclidean distance

● Incorrect. Euclidean distance measures the straight-line distance between

points, not restricted to axis-aligned paths.

8.What is the purpose of a validation set in machine learning?

A) To train the model on unseen data

B) To evaluate the model’s performance on the training data

C) To tune hyperparameters and prevent overfitting

D) To test the final model’s performance

A) To train the model on unseen data

● Incorrect. The validation set is not used for training but for model evaluation
during training.

B) To evaluate the model’s performance on the training data

● Incorrect. Evaluation on training data is not the purpose of a validation set; it is

used for hyperparameter tuning and model selection.

C) To tune hyperparameters and prevent overfitting

● Correct. The validation set is used to tune hyperparameters and monitor

performance to avoid overfitting.

D) To test the final model’s performance

● Incorrect. Testing the final model's performance is done using a separate test
set, not the validation set.
9. In K-fold cross-validation, what is the purpose of splitting the dataset into K folds?

A) To ensure that every data point is used for training only once

B) To train the model on all the data points

C) To test the model on the same data multiple times

D) To evaluate the model’s performance on different subsets of data

A) To ensure that every data point is used for training only once

● Incorrect. In K-fold cross-validation, every data point is used for

training multiple times, not just once.

B) To train the model on all the data points

● Incorrect. The dataset is split into folds, and only K-1 folds are
used for training each time.

C) To test the model on the same data multiple times

● Incorrect. Each fold is used for testing once, and training is done
on the remaining folds.

D) To evaluate the model’s performance on different subsets of data

● Correct. K-fold cross-validation ensures that the model is

evaluated on different subsets of the data, providing a robust
measure of its performance.

10. Which of the following steps is NOT typically part of the machine learning process?

A) Data Collection

B) Model Training

C) Model Deployment

D) Data Encryption

A) Data Collection
● Incorrect. Data Collection is a fundamental step in the machine learning
process.

B) Model Training

● Incorrect. Model Training is a core step in machine learning.

C) Model Deployment

● Incorrect. Model Deployment is part of the machine learning lifecycle, as it

involves putting the model into production.

D) Data Encryption

● Correct. Data Encryption is not typically a part of the machine learning process
itself, though it may be relevant for data security and privacy.

Huawei Final Written Exam
50% (2)
Huawei Final Written Exam
18 pages
HUAWEI Final Written Exam 3333
50% (2)
HUAWEI Final Written Exam 3333
13 pages
Practice Questions for Tableau Desktop Specialist Certification Case Based
From Everand
Practice Questions for Tableau Desktop Specialist Certification Case Based
Exam OG
5/5 (1)
Machine Learning Multiple Choice Questions
100% (1)
Machine Learning Multiple Choice Questions
20 pages
Monthly Statement: Name Address Account Number Statement Period
No ratings yet
Monthly Statement: Name Address Account Number Statement Period
8 pages
Week 7
No ratings yet
Week 7
10 pages
MCQs Dumps 2
No ratings yet
MCQs Dumps 2
15 pages
Practice Paper 2
No ratings yet
Practice Paper 2
10 pages
ML Paper
No ratings yet
ML Paper
23 pages
ML 2 (Mainly KNN)
100% (1)
ML 2 (Mainly KNN)
12 pages
Test DS
No ratings yet
Test DS
7 pages
Assignment 07 BigData Computing Noc23-Cs112
No ratings yet
Assignment 07 BigData Computing Noc23-Cs112
8 pages
Ai ML Unit 3
No ratings yet
Ai ML Unit 3
15 pages
Practice MCQ AI
No ratings yet
Practice MCQ AI
4 pages
MCQ of Machine Learning
100% (2)
MCQ of Machine Learning
151 pages
Machine Learning Bits
100% (2)
Machine Learning Bits
28 pages
MLfinal 1
No ratings yet
MLfinal 1
7 pages
Huawei Final Written Exam 2.2 Attempts
No ratings yet
Huawei Final Written Exam 2.2 Attempts
19 pages
MLRECT2 Solution
No ratings yet
MLRECT2 Solution
9 pages
PA Objectives Important
No ratings yet
PA Objectives Important
3 pages
Aml Mid-2 Objective
No ratings yet
Aml Mid-2 Objective
17 pages
Khoi KHDL - de On
No ratings yet
Khoi KHDL - de On
6 pages
Review Questions DS
No ratings yet
Review Questions DS
14 pages
Machine Learning MCQ
No ratings yet
Machine Learning MCQ
3 pages
It ML
No ratings yet
It ML
10 pages
Final - Model-Machine Learning
No ratings yet
Final - Model-Machine Learning
15 pages
AIML Question
No ratings yet
AIML Question
38 pages
examBD2223 January Solutions
No ratings yet
examBD2223 January Solutions
7 pages
ML Suggestion 2
No ratings yet
ML Suggestion 2
11 pages
Mcqs 1
No ratings yet
Mcqs 1
34 pages
ML Objectives Mid 1
No ratings yet
ML Objectives Mid 1
5 pages
Hatdog 1.2
No ratings yet
Hatdog 1.2
18 pages
Mcq's (6 Topics)
No ratings yet
Mcq's (6 Topics)
42 pages
MLP Question Bank of AI and ML and NLP
No ratings yet
MLP Question Bank of AI and ML and NLP
7 pages
Final - Model-Machine Learning Without Solution
No ratings yet
Final - Model-Machine Learning Without Solution
15 pages
Machine Learning Multiple Choice Questions - Free Practice Test
100% (1)
Machine Learning Multiple Choice Questions - Free Practice Test
12 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
Final Exam: CS 189 Spring 2020 Introduction To Machine Learning
No ratings yet
Final Exam: CS 189 Spring 2020 Introduction To Machine Learning
19 pages
IML-IITKGP - Assignment 1 Solution
No ratings yet
IML-IITKGP - Assignment 1 Solution
7 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
DMT MCQ
No ratings yet
DMT MCQ
15 pages
Questions For ML - Built A Thon
No ratings yet
Questions For ML - Built A Thon
7 pages
Feedback The Correct Answer Is:analysis of Time Series
No ratings yet
Feedback The Correct Answer Is:analysis of Time Series
42 pages
Nptel Week 7
No ratings yet
Nptel Week 7
3 pages
Top 40 Machine Learning Questions & Answers: Which of The Following Statement Is True in The Following Case?
No ratings yet
Top 40 Machine Learning Questions & Answers: Which of The Following Statement Is True in The Following Case?
34 pages
ML MCQs Set
No ratings yet
ML MCQs Set
18 pages
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
10 pages
Data Science
No ratings yet
Data Science
35 pages
Quiz 1-A
No ratings yet
Quiz 1-A
5 pages
Data Analytic MCQ
No ratings yet
Data Analytic MCQ
5 pages
CAPSTONE
No ratings yet
CAPSTONE
16 pages
Classification Algorithms
No ratings yet
Classification Algorithms
20 pages
Machine Learning Suggestion (2 Marks) MCQ
No ratings yet
Machine Learning Suggestion (2 Marks) MCQ
5 pages
ML Final MCQsa
No ratings yet
ML Final MCQsa
7 pages
ISE 529 Mock Test Answers
No ratings yet
ISE 529 Mock Test Answers
6 pages
Aam Ut-1 QB Ans
No ratings yet
Aam Ut-1 QB Ans
12 pages
Unit 3
No ratings yet
Unit 3
19 pages
ML BIT Ans
No ratings yet
ML BIT Ans
5 pages
MCQ Amt 2
No ratings yet
MCQ Amt 2
9 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Economy of Eswatini
No ratings yet
Economy of Eswatini
7 pages
Chemicals Used in Preparatory Process
No ratings yet
Chemicals Used in Preparatory Process
10 pages
Sap Fi S4 Hana Asset Accounting Part 1
0% (1)
Sap Fi S4 Hana Asset Accounting Part 1
6 pages
Dear Ms. Lovely Khatun, Urban Essentials India PVT LTD.: Date 03/01/2024
No ratings yet
Dear Ms. Lovely Khatun, Urban Essentials India PVT LTD.: Date 03/01/2024
2 pages
Choi 2019
No ratings yet
Choi 2019
6 pages
Nisbau Brochure
No ratings yet
Nisbau Brochure
19 pages
A005 Atos
No ratings yet
A005 Atos
4 pages
Ansys CFX Tutorials - Release 13 PDF
No ratings yet
Ansys CFX Tutorials - Release 13 PDF
636 pages
Government Employees Insurance Co. Et Al v. Alexandre Michel Scheer, M.D. Et Al Doc 11 Filed 30 Sep 13
No ratings yet
Government Employees Insurance Co. Et Al v. Alexandre Michel Scheer, M.D. Et Al Doc 11 Filed 30 Sep 13
32 pages
SMM1 In-Class1 31211022596
No ratings yet
SMM1 In-Class1 31211022596
2 pages
Mock Drill Report
No ratings yet
Mock Drill Report
2 pages
Receipt 1713415482
No ratings yet
Receipt 1713415482
2 pages
The New Driver Emergency Checklist
No ratings yet
The New Driver Emergency Checklist
12 pages
COMSOL
No ratings yet
COMSOL
20 pages
Limitations of Social Welfare System.
No ratings yet
Limitations of Social Welfare System.
7 pages
Polymers: The Preparations and Water Vapor Barrier Properties of Polyimide Films Containing Amide Moieties
No ratings yet
Polymers: The Preparations and Water Vapor Barrier Properties of Polyimide Films Containing Amide Moieties
14 pages
OASIS Native QuickHelp
No ratings yet
OASIS Native QuickHelp
21 pages
Quality Control of Sterile Products
100% (3)
Quality Control of Sterile Products
21 pages
Anhvan12chuongtrinhglobalsuccess Thionlineunit9 Careerpaths
No ratings yet
Anhvan12chuongtrinhglobalsuccess Thionlineunit9 Careerpaths
3 pages
Topic 2: Operation Strategy Name Affiliation
No ratings yet
Topic 2: Operation Strategy Name Affiliation
4 pages
Agricultural Developmental Activities To Double The Crop Yield of Bisoi Block of Mayurbhanj District of Odisha During This Corona Period
No ratings yet
Agricultural Developmental Activities To Double The Crop Yield of Bisoi Block of Mayurbhanj District of Odisha During This Corona Period
5 pages
Entrepreneurship
No ratings yet
Entrepreneurship
160 pages
Encoders
No ratings yet
Encoders
24 pages
Detailed Price Break-Up (Annexure - I) (R3)
No ratings yet
Detailed Price Break-Up (Annexure - I) (R3)
2 pages
Philippine National Standards For Drinking Water 2017 (DOH AO 2017-0010) - Drinking Water - Water Quality
No ratings yet
Philippine National Standards For Drinking Water 2017 (DOH AO 2017-0010) - Drinking Water - Water Quality
1 page
Pocket Guide To Transportation 2010
No ratings yet
Pocket Guide To Transportation 2010
56 pages
Lesson 1 Grooming Personality
No ratings yet
Lesson 1 Grooming Personality
12 pages
1.1 Introduction To Accounting: Transcribed by - To Remove This Message
No ratings yet
1.1 Introduction To Accounting: Transcribed by - To Remove This Message
4 pages
English Language Teacher Motivation and Turnover - Ross Thorburn PDF
No ratings yet
English Language Teacher Motivation and Turnover - Ross Thorburn PDF
3 pages

Week 6-1

Uploaded by

Week 6-1

Uploaded by

Week 6

1. Point out the wrong statement.

a) Replication Factor can be configured at a cluster level (Default is set to 3) and

a) Replication Factor can be configured at a cluster level (Default is set to

● Correct. A Block Report includes a list of all blocks stored on a DataNode.

c) User data is distributed across multiple DataNodes in the cluster and is

d) DataNode is aware of the files to which the blocks stored on it belong to

b. Correct - Bagging (Bootstrap Aggregating):

d. Incorrect - Neural Networks:

S1: Both methods can be used for classification task

S4: Both methods can be used for regression

S1: Both methods can be used for classification tasks

S2: Random Forest is used for regression whereas Gradient Boosting is

S3: Random Forest is used for classification whereas Gradient Boosting is

S4: Both methods can be used for regression

A) It reduces the size of the dataset by removing duplicates

B) It distributes the computation of distances between data points and centroids

C) It initializes multiple sets of centroids to improve clustering accuracy

D) It performs principal component analysis (PCA) on the data

A) It reduces the size of the dataset by removing duplicates

B) It distributes the computation of distances between data points and

C) It initializes multiple sets of centroids to improve clustering accuracy

● Incorrect. Initialization of centroids is generally done before the Map phase

D) It performs principal component analysis (PCA) on the data

5. What is a common method to improve the performance of the K-means

A) Using hierarchical clustering before K-means

B) Reducing the number of clusters

D) Increasing the number of centroids

A) Using hierarchical clustering before K-means

● Incorrect. While hierarchical clustering can be used to initialize centroids, it is

B) Reducing the number of clusters

C) Employing mini-batch K-means

● Correct. Mini-batch K-means is a method used to handle large-scale datasets

D) Increasing the number of centroids

● Incorrect. Increasing the number of centroids might not improve performance

● Incorrect. Minkowski distance generalizes Euclidean and Manhattan distances

● Correct. Manhattan distance measures distance along axis-aligned paths

● Incorrect. Euclidean distance measures the straight-line distance between

8.What is the purpose of a validation set in machine learning?

A) To train the model on unseen data

B) To evaluate the model’s performance on the training data

C) To tune hyperparameters and prevent overfitting

D) To test the final model’s performance

A) To train the model on unseen data

B) To evaluate the model’s performance on the training data

● Incorrect. Evaluation on training data is not the purpose of a validation set; it is

C) To tune hyperparameters and prevent overfitting

● Correct. The validation set is used to tune hyperparameters and monitor

D) To test the final model’s performance

B) To train the model on all the data points

C) To test the model on the same data multiple times

D) To evaluate the model’s performance on different subsets of data

● Incorrect. In K-fold cross-validation, every data point is used for

B) To train the model on all the data points

C) To test the model on the same data multiple times

D) To evaluate the model’s performance on different subsets of data

● Correct. K-fold cross-validation ensures that the model is

● Incorrect. Model Training is a core step in machine learning.

● Incorrect. Model Deployment is part of the machine learning lifecycle, as it

You might also like