0% found this document useful (0 votes)
11 views10 pages

ClassTest1 DeepLearning

Uploaded by

RAUSHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

ClassTest1 DeepLearning

Uploaded by

RAUSHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

What does BERT stand for?

A) Basic Encoder for Robust Transformers

B) Bidirectional Encoder Representations from Transformers

C) Binary Encoded Recursive Transformers

D) Balanced Embedding Representation Technology


Which of the following is NOT a
component of the transformer architecture?

A) Multi-head attention

B) Feed-forward neural networks

C) Positional encoding

D) Convolutional layers
What is the primary advantage of
the transformer architecture over RNNs?

A) Lower computational complexity

B) Ability to handle variable-length sequences

C) Parallel processing of input sequences

D) Smaller model size


What pre-training task does BERT
use to learn bidirectional context?

A) Next Sentence Prediction

B) Masked Language Modeling

C) Machine Translation

D) Both A and B
What is the purpose of the [CLS] token in BERT?

A) To mark the end of a sentence

B) To represent the entire sequence for classification tasks

C) To separate two sentences in the input

D) To mask random words in the input


How do transformer models handle out-of-vocabulary words?

A) ignore them

B) Use sub-word tokenization

C) Assign them a random embedding

D) Assign embedding of the closest word from vocabulary


What is the key difference between BERT and GPT models?

A) BERT uses encoders only while GPT uses decoders only

B) BERT is bidirectional while GPT is unidirectional

C) BERT is for classification tasks only, while GPT is for generation

D) Both A and B
What is the primary purpose of self-attention in transformer models?

A) To reduce the model size

B) To speed up training

C) To eliminate the need for positional encoding

D) To capture dependencies between


different positions in a sequence
What is the purpose of the scaling factor
in the scaled dot-product attention?

A) To normalize the input

B) To prevent vanishing gradients

C) To stabilize the gradients,


especially for large dimension inputs

D) To increase the model's capacity


What is the purpose of positional encoding in transformer models?

A) To add information about the order of the sequence

B) To increase the model's vocabulary

C) To reduce computational complexity

D) To enable multi-head attention

You might also like