0% found this document useful (0 votes)
4 views

Unstructured Data Classification

The document outlines a series of tasks and questions related to sentiment analysis and natural language processing (NLP). It covers dataset loading, supervised learning concepts, text classification, performance metrics like confusion matrix, and techniques such as lemmatization and TF-IDF. Additionally, it addresses issues like class imbalance and overfitting in machine learning models.

Uploaded by

Gurram Anurag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unstructured Data Classification

The document outlines a series of tasks and questions related to sentiment analysis and natural language processing (NLP). It covers dataset loading, supervised learning concepts, text classification, performance metrics like confusion matrix, and techniques such as lemmatization and TF-IDF. Additionally, it addresses issues like class imbalance and overfitting in machine learning models.

Uploaded by

Gurram Anurag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

1.

a) Download the dataset from


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'.
c) Try out the code snippets and answer the questions.
To view the first 3 rows of the dataset, which of the following commands is used?

sentiment_analysis_data.head(3)

2.In Supervised learning, class labels of the training samples are ____________
known

3.Inverse Document frequency is used in the term-document matrix.


True

4.Can we consider sentiment classification as a text classification problem?


yes

5.In document classification, each document has to be converted from full text to a
document vector.
true

6.A technique used to depict the performance in a tabular form that has 2
dimensions namely actual and predicted sets of data is ___________
Confusion Matrix

7.Which NLP technique uses a lexical knowledge base to obtain the correct base form
of the words?
lemmatization

8. a) Download the dataset from


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'.
c) Try out the code snippets and answer the questions.
What does the command sentiment_analysis_data['label'].value_counts() return?

The number of columns in the dataset

9. a) Download the dataset from


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'.
c) Try out the code snippets and answer the questions.
What command should be given to tokenize a sentence into words?

from nltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)

10.Which numerical statistics is used to identify the importance of a rare word in


a document?

TF-IDF

11.Which type of cross-validation is used for an imbalanced dataset?


K-Fold

12.Cross-validation causes over-fitting.


False
13.Select the pre-processing technique(s) from the following.
All the options

14.Clustering is supervised classification.


false

15. a) Download the dataset from


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'.
c) Try out the code snippets and answer the questions.
Is there a class imbalance problem in the given data set?
Yes

16.SVM is a _____________
Supervised learning algorithm

17.In a Term Document Matrix (TDM), each row represents ____________


TF-IDF value

18.Imagine you have just finished training a decision tree for spam classification,
and it is showing abnormal bad performance on both your training and test sets.
Assume that your implementation has no bugs. What could be the reason for this
problem?
All the options

19.Which of the given hyperparameters, when increased, may cause the random forest
to overfit the data?
Depth of Tree

20.In a Document Term Matrix (DTM), each row represents


TF-IDF value

21.Email spam data is an example of __________


Unstructured data

22.

You might also like