0% found this document useful (0 votes)

60 views10 pages

Data Science Interview Preparation Questions (#Day06)

This document contains questions and answers about natural language processing (NLP) and time series forecasting. It discusses key NLP concepts like tokenization, stemming, lemmatization, bag-of-words models, TF-IDF, Word2Vec, Doc2Vec. It also covers time series forecasting, the differences between time series and regression, and stationary vs non-stationary data. The document provides definitions and examples to explain these important machine learning and NLP topics.

Uploaded by

ThànhĐạt Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views10 pages

Data Science Interview Preparation Questions (#Day06)

Uploaded by

ThànhĐạt Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

DATA SCIENCE

INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)

# DAY 06
Q1. What is NLP?
Natural language processing (NLP): It is the branch of artificial intelligence that helps computers
understand, interpret and manipulate human language. NLP draws from many disciplines, including
computer science and computational linguistics, in its pursuit to fill the gap between human
communication and computer understanding.

Q2. What are the Libraries we used for NLP?

We usually use these libraries in NLP, which are:
NLTK (Natural language Tool kit), TextBlob, CoreNLP, Polyglot,
Gensim, SpaCy, Scikit-learn
And the new one is Megatron library launched recently.

Q3. What do you understand by tokenisation?

Tokenisation is the act of breaking a sequence of strings into pieces such as words, keywords, phrases,
symbols and other elements called tokens. Tokens can be individual words, phrases or even whole
sentences. In the process of tokenisation, some characters like punctuation marks are discarded.
Q4. What do you understand by stemming?
Stemming: It is the process of reducing inflexions in words to their root forms such as mapping a
group of words to the same stem even if stem itself is not a valid word in the Language.

Q5. What is lemmatisation?

Lemmatisation: It is the process of the group together the different inflected forms of the word so
that they can be analysed as a single item. It is quite similar to stemming, but it brings context to the
words. So it links words with similar kind meaning to one word.

Q6. What is Bag-of-words model?

We need the way to represent text data for the machine learning algorithms, and the bag-of-words
model helps us to achieve the task. This model is very understandable and to implement. It is the way
of extracting features from the text for the use in machine learning algorithms.
In this approach, we use the tokenised words for each of observation and find out the frequency of
each token.
Let’s do an example to understand this concept in depth.
“It is going to rain today.”
“Today, I am not going outside.”
“I am going to watch the season premiere.”
We treat each sentence as the separate document and we make the list of all words from all the three
documents excluding the punctuation. We get,
‘It’, ’is’, ’going’, ‘to’, ‘rain’, ‘today’ ‘I’, ‘am’, ‘not’, ‘outside’, ‘watch’, ‘the’, ‘season’, ‘premiere.’
The next step is the create vectors. Vectors convert text that can be used by the machine learning
algorithm.
We take the first document — “It is going to rain today”, and we check the frequency of words from
the ten unique words.
“It” = 1
“is” = 1
“going” = 1
“to” = 1
“rain” = 1
“today” = 1
“I” = 0
“am” = 0
“not” = 0
“outside” = 0
Rest of the documents will be:
“It is going to rain today” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“Today I am not going outside” = [0, 0, 1, 0, 0, 1, 1, 1, 1, 1]
“I am going to watch the season premiere” = [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]
In this approach, each word (a token) is called a “gram”. Creating the vocabulary of two-word pairs
is called a bigram model.
The process of converting the NLP text into numbers is called vectorisation in ML. There are different
ways to convert text into the vectors :

• Counting the number of times that each word appears in the document.
• I am calculating the frequency that each word appears in a document out of all the words in
the document.
Q7.What do you understand by TF-IDF?
TF-IDF: It stands for the term of frequency-inverse document frequency.
TF-IDF weight: It is a statistical measure used to evaluate how important a word is to a document in
a collection or corpus. The importance increases proportionally to the number of times a word appears
in the document but is offset by the frequency of the word in the corpus.

• Term Frequency (TF): is a scoring of the frequency of the word in the current document.
Since every document is different in length, it is possible that a term would appear much
more times in long documents than shorter ones. The term frequency is often divided by the
document length to normalise.

• Inverse Document Frequency (IDF): It is a scoring of how rare the word is across the
documents. It is a measure of how rare a term is, Rarer the term, and more is the IDF score.

Thus,

Q8. What is Word2vec?

Word2Vec is a shallow, two-layer neural network which is trained to reconstruct linguistic

contexts of words. It takes as its input a large corpus of words and produces a vector space,
typically of several of hundred dimensions, with each of unique word in the corpus being
assigned to the corresponding vector in space.
Word vectors are positioned in a vector space such that words which share common
contexts in the corpus are located close to one another in the space.
Word2Vec is a particularly computationally-efficient predictive model for learning word
embeddings from raw text.
Word2Vec is a group of models which helps derive relations between a word and its
contextual words. Let’s look at two important models inside Word2Vec: Skip-grams and
CBOW.
Skip-grams

In Skip-gram model, we take a centre word and a window of context (neighbour) words, and
we try to predict the context of words out to some window size for each centre word. So, our
model is going to define a probability distribution, i.e. probability of a word appearing in the
context given a centre word and we are going to choose our vector representations to maximise
the probability.

Continuous Bag-of-Words (CBOW)

CBOW predicts target words (e.g. ‘mat’) from the surrounding context words (‘the cat sits
on the’).
Statistically, it affects that CBOW smoothes over a lot of distributional information (by
treating an entire context as one observation). For the most part, this turns out to be a useful
thing for smaller datasets.
This was about converting words into vectors. But where does the “learning” happen?
Essentially, we begin with small random initialisation of word vectors. Our predictive model
learns the vectors by minimising the loss function. In Word2vec, this happens with feed-
forward neural networks and optimisation techniques such as Stochastic gradient descent.
There are also count-based models which make the co-occurrence count matrix of the words
in our corpus; we have a very large matrix with each row for the “words” and columns for the
“context”. The number of “contexts” is, of course very large, since it is very essentially
combinatorial in size. To overcome this issue, we apply SVD to a matrix. This reduces the
dimensions of the matrix to retain maximum pieces of information.

Q9. What is Doc2vec?

Paragraph Vector (more popularly known as Doc2Vec) — Distributed Memory (PV-DM)

Paragraph Vector (Doc2Vec) is supposed to be an extension to Word2Vec such

that Word2Vec learns to project words into a latent d-dimensional space whereas Doc2Vec
aims at learning how to project a document into a latent d-dimensional space.

The basic idea behind PV-DM is inspired by Word2Vec. In CBOW model of Word2Vec, the
model learns to predict a centre word based on the contexts. For example- given a sentence
“The cat sat on the table”, CBOW model would learn to predict the words “sat” given the
context words — the cat, on and table. Similarly,in PV-DM the main idea is: randomly sample
consecutive words from the paragraph and predict a centre word from the randomly sampled
set of words by taking as the input — the context words and the paragraph id.

Let’s have a look at the model diagram for some more clarity. In this given model, we see
Paragraph matrix, (Average/Concatenate) and classifier sections.
Paragraph matrix: It is the matrix where each column represents the vector of a paragraph.
Average/Concatenate: It means that whether the word vectors and paragraph vector are
averaged or concatenated.
Classifier: In this, it takes the hidden layer vector (the one that was concatenated/averaged) as
input and predicts the Centre word.
In the matrix D, It has the embeddings for “seen” paragraphs (i.e. arbitrary length
documents), the same way Word2Vec models learns embeddings for words. For unseen
paragraphs, the model is again run through gradient descent (5 or so iterations) to infer a
document vector.

Q9. What is Time-Series forecasting?

Time series forecasting is a technique for the prediction of events through a sequence of
time. The technique is used across many fields of study, from the geology to behaviour to
economics. The techniques predict future events by analysing the trends of the past, on the
assumption that future trends will hold similar to historical trends.

Q10. What is the difference between in Time series and

regression?

Time-series:
1. Whenever data is recorded at regular intervals of time.
2. Time-series forecast is Extrapolation.
3. Time-series refers to an ordered series of data.
Regression:
1. Whereas in regression, whether data is recorded at regular or irregular intervals of time,
we can apply.
2. Regression is Interpolation.
3. Regression refer both ordered and unordered series of data.
Q11. What is the difference between stationery and non-
stationary data?
Stationary: A series is said to be "STRICTLY STATIONARY” if the Mean, Variance &
Covariance is constant over some time or time-invariant.

Non-Stationary:

A series is said to be "STRICTLY STATIONARY” if the Mean, Variance & Covariance is

not constant over some time or time-invariant.
Q12. Why you cannot take non-stationary data to solve time series
Problem?

o Most models assume stationary of data. In other words, standard techniques are
invalid if data is "NON-STATIONARY".
o Autocorrelation may result due to "NON-STATIONARY".
o Non-stationary processes are a random walk with or without a drift (a slow, steady
change).
o Deterministic trends (trends that are constant, positive or negative, independent of
time for the whole life of the series).

-------------------------------------------------------------------------------------------------------------------

ملخص ممتاز ومترجم لفصول كويرك
50% (4)
ملخص ممتاز ومترجم لفصول كويرك
76 pages
DM - Ai22c07 - Unit 5
No ratings yet
DM - Ai22c07 - Unit 5
101 pages
AUDI 2.0 L FSI PDF
80% (5)
AUDI 2.0 L FSI PDF
44 pages
100 NLP Questions
100% (6)
100 NLP Questions
23 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP m3
No ratings yet
NLP m3
111 pages
Andy Clark - Associative Engines
100% (3)
Andy Clark - Associative Engines
248 pages
Square Wave Generator
100% (2)
Square Wave Generator
6 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
Data Science Interview Questions #Week3
No ratings yet
Data Science Interview Questions #Week3
87 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Unit IV
No ratings yet
Unit IV
57 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Unit IV
No ratings yet
Unit IV
58 pages
Module III
No ratings yet
Module III
42 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Lect 5
No ratings yet
Lect 5
40 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Learn 4
No ratings yet
Learn 4
27 pages
Lab 5
No ratings yet
Lab 5
27 pages
Unit 2 Newml
No ratings yet
Unit 2 Newml
25 pages
Short Notes On Servo Motor
100% (3)
Short Notes On Servo Motor
2 pages
Chapter II
No ratings yet
Chapter II
26 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Information Retrieval On Cranfield Dataset
No ratings yet
Information Retrieval On Cranfield Dataset
15 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
NLP ANONYMOUS QB Ans
No ratings yet
NLP ANONYMOUS QB Ans
21 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
No ratings yet
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
24 pages
Data Science Interview Questions (#Day15)
No ratings yet
Data Science Interview Questions (#Day15)
12 pages
517-C-30070-Assignment - Chapter NLP
No ratings yet
517-C-30070-Assignment - Chapter NLP
9 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
SKD Academy (CBSE) Session - 2024-2025 Subject - Artificial Intelligence (417) Important Questions Chap - NLP
No ratings yet
SKD Academy (CBSE) Session - 2024-2025 Subject - Artificial Intelligence (417) Important Questions Chap - NLP
7 pages
Unit 5
No ratings yet
Unit 5
8 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
Document Classification Using Distributed Machine Learning
No ratings yet
Document Classification Using Distributed Machine Learning
4 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Basics of A Jet Engine
No ratings yet
Basics of A Jet Engine
34 pages
Part 3
No ratings yet
Part 3
5 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Mathematics: Grade 8
No ratings yet
Mathematics: Grade 8
180 pages
Question Bank
No ratings yet
Question Bank
7 pages
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
No ratings yet
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
9 pages
A Method For Selection of Power MOSFETs To Minimiz
No ratings yet
A Method For Selection of Power MOSFETs To Minimiz
8 pages
Paragraph Vector PDF
No ratings yet
Paragraph Vector PDF
9 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Physics
100% (1)
Physics
7 pages
Buck Converter Modeling, Control, and Compensator Design
No ratings yet
Buck Converter Modeling, Control, and Compensator Design
56 pages
Ultrasonic Testing
100% (1)
Ultrasonic Testing
57 pages
Restaurant
No ratings yet
Restaurant
24 pages
Turbina Tesla Manual
No ratings yet
Turbina Tesla Manual
5 pages
Elec-275 Final Examination April 2012
No ratings yet
Elec-275 Final Examination April 2012
4 pages
Lipid Chemistry BSN
No ratings yet
Lipid Chemistry BSN
53 pages
High Frequency Isolated Bidirectional Dual Active Bridge DC-DC Converters and Its Application To Distributed Energy Systems: An Overview
No ratings yet
High Frequency Isolated Bidirectional Dual Active Bridge DC-DC Converters and Its Application To Distributed Energy Systems: An Overview
23 pages
Data Science Interview Preparation (# DAY 22)
No ratings yet
Data Science Interview Preparation (# DAY 22)
16 pages
ThinkServer TD350 - Product Guide
No ratings yet
ThinkServer TD350 - Product Guide
27 pages
Data Science Interview Preparation
No ratings yet
Data Science Interview Preparation
16 pages
Computer Science Class Notes
No ratings yet
Computer Science Class Notes
3 pages
Bcf42ht Maruyama
No ratings yet
Bcf42ht Maruyama
16 pages
Recurring Crux Configurations 6 OI Parallel Side 38.4
No ratings yet
Recurring Crux Configurations 6 OI Parallel Side 38.4
4 pages
Grade 5 DLL MATH 5 Q4 Week 2
No ratings yet
Grade 5 DLL MATH 5 Q4 Week 2
5 pages
PrecisionTree - Debbie House
No ratings yet
PrecisionTree - Debbie House
18 pages
2018 Howland Et Al. Quantifying The Effects of Erosion On Archaeological Sites With Low-Altitude Aerial Photography, Structure From Motion, and GIS
No ratings yet
2018 Howland Et Al. Quantifying The Effects of Erosion On Archaeological Sites With Low-Altitude Aerial Photography, Structure From Motion, and GIS
9 pages
Case Studies Why Look at Case Studies?: Deeplearning - Ai
No ratings yet
Case Studies Why Look at Case Studies?: Deeplearning - Ai
50 pages
Novel Convolutional Neural Network (NCNN) For The Diagnosis of Bearing Defects in Rotary Machinery
No ratings yet
Novel Convolutional Neural Network (NCNN) For The Diagnosis of Bearing Defects in Rotary Machinery
10 pages
32 + 44 B10F-Ball-Valve
No ratings yet
32 + 44 B10F-Ball-Valve
1 page
Convolutional Neural Networks: Computer Vision
No ratings yet
Convolutional Neural Networks: Computer Vision
41 pages
BSP03 Multi Process Control Trainer
No ratings yet
BSP03 Multi Process Control Trainer
2 pages
Multiple Choice (8 X 1 PT)
No ratings yet
Multiple Choice (8 X 1 PT)
5 pages
Barrierboard - Info Sheet
No ratings yet
Barrierboard - Info Sheet
2 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Science Interview Preparation Questions (#Day06)

Uploaded by

Data Science Interview Preparation Questions (#Day06)

Uploaded by

DATA SCIENCE

Q2. What are the Libraries we used for NLP?

Q3. What do you understand by tokenisation?

Q5. What is lemmatisation?

Q6. What is Bag-of-words model?

Q8. What is Word2vec?

Word2Vec is a shallow, two-layer neural network which is trained to reconstruct linguistic

Continuous Bag-of-Words (CBOW)

Q9. What is Doc2vec?

Paragraph Vector (more popularly known as Doc2Vec) — Distributed Memory (PV-DM)

Paragraph Vector (Doc2Vec) is supposed to be an extension to Word2Vec such

Q9. What is Time-Series forecasting?

Q10. What is the difference between in Time series and

A series is said to be "STRICTLY STATIONARY” if the Mean, Variance & Covariance is

You might also like