Ipynb
Ipynb
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "NLP notes.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "oNJxtkag_19P"
},
"source": [
"#**NATURAL LANGUAGE PREPROCESSING (NLP)**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9TMXJzIv_17A"
},
"source": [
"**What will you learn?**\r\n",
"1. **NLP**: Intoduction to NLP\r\n",
"2. **NLTK**: NLTK library\r\n",
"3. **Tokenization**: Converting paragraphs into sentences and words.\r\n",
"4. **StopWords**: Most common words \r\n",
"5. **Stemming**: Converting the words to their base form.\r\n",
"6. **POS tag**: Part of speech of each word.\r\n",
"7. **Lemmatization**: gives the meaningful word in proper form.\r\n",
"8. **Implementation on Movie Review Dataset**\r\n",
"9. **Count-vectorizer**: counts the frequency of words\r\n",
"10. **N-grams** & **TF-IDF**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iJeddoM5_14I"
},
"source": [
"Natural language processing (NLP) is a subfield of computer science,
information engineering, and artificial intelligence concerned with the
interactions between computers and human (natural) languages, in particular how to
program computers to process and analyze large amounts of natural language data .\
r\n",
"\r\n",
"Let's use an example to show just how powerful NLP is when used in a
practical situation. When you're typing on your phone, like many of us do every
day, you'll see word suggestions based on what you type and what you're currently
typing. That's natural language processing in action.\r\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MnZ3WZ7p_11I"
},
"source": [
"NLP includes many different techniques for interpreting human language,
ranging from statistical and machine learning methods to rules-based and
algorithmic approaches. We need a broad array of approaches because the text- and
voice-based data varies widely, as do the practical applications. \r\n",
"\r\n",
"Basic NLP tasks include tokenization and parsing, lemmatization/stemming,
part-of-speech tagging, language detection and identification of semantic
relationships. \r\n",
"\r\n",
"In general terms, NLP tasks break down language into shorter, elemental
pieces, try to understand relationships between the pieces and explore how the
pieces work together to create meaning.\r\n",
"\r\n",
"We will study about all of them in detail."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "izf2fX3D_1yI"
},
"source": [
"##**Natural Language toolkit (NLTK)**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hBq6oXLn_1wA"
},
"source": [
"The Natural Language Toolkit (NLTK) is a platform used for building Python
programs that work with human language data for applying in statistical natural
language processing (NLP).\r\n",
"\r\n",
"It contains text processing libraries for tokenization, parsing,
classification, stemming, tagging and semantic reasoning. It also includes
graphical demonstrations and sample data sets as well as accompanied by a cook book
and a book which explains the principles behind the underlying language processing
tasks that NLTK supports.\r\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EZ11bTp-_1sy"
},
"source": [
"###**Installing NLTK**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "somB3Exu_1pu"
},
"source": [
"Type $!pip \\;install\\; nltk$ in the Jupyter Notebook or if it doesn’t
work in cmd type conda install$ -c \\;conda-forge\\; nltk.$ This should work in
most cases.\r\n",
"\r\n",
"To check if NLTK has installed correctly, you can open your Python
terminal and type the following: Import nltk. If everything goes fine, that means
you've successfully installed NLTK library."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9W_2x-UC_1mv"
},
"source": [
"##**Tokenisation**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UGCTa4i-_1k_"
},
"source": [
"**Tokenization** is the process breaking complex data like paragraphs into
simple units called tokens. Tokens can be individual words, phrases or even whole
sentences. In the process of tokenization, some characters like punctuation marks
are discarded.\r\n",
"1. **Sentence tokenization** : split a paragraph into list of sentences
using sent_tokenize() method\r\n",
"2. **Word tokenization** : split a sentence into list of words using
word_tokenize() method\r\n",
"\r\n",
"Import all the libraries required to perform tokenization on input data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "kYF-IrOfGzj0"
},
"source": [
"import nltk"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "r0UjInG9F3cp"
},
"source": [
"from nltk.tokenize import sent_tokenize, word_tokenize"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "7ReOFuAjGGaF"
},
"source": [
"sample_text = \"In the ninja world, those who break the rules are trash.
That's true, but those who abandon their friends are worse than trash.\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WJAtERHkGqJ7",
"outputId": "62aad1a8-a1cb-48b2-e241-991ea9ddd0c0"
},
"source": [
"sent_tokenize(sample_text) #converted the sample text into sentences\r\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Unzipping tokenizers/punkt.zip.\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['In the ninja world, those who break the rules are trash.',\n",
" \"That's true, but those who abandon their friends are worse than
trash.\"]"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "X2xAt2QcGqIU",
"outputId": "ec170d0d-7010-4cea-e9d1-85e6a1cb8bab"
},
"source": [
"words = word_tokenize(sample_text)\r\n",
"print(words,\"\\ncount = \",len(words))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"['In', 'the', 'ninja', 'world', ',', 'those', 'who', 'break', 'the',
'rules', 'are', 'trash', '.', 'That', \"'s\", 'true', ',', 'but', 'those', 'who',
'abandon', 'their', 'friends', 'are', 'worse', 'than', 'trash', '.'] \n",
"count = 28\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J55aOc34_1iJ"
},
"source": [
"Data Cleaning plays important role in NLP to remove noise from data."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HCUc7kdP_1fh"
},
"source": [
"##**Stop Words**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "awOTdrRv_1c-"
},
"source": [
"**Stop Words** refers to the most common words in a language (such
as \"the\", \"a\", \"an\", \"in\") which helps in formation of sentence to make
sense, but these words does not provide any significance in language processing so
remove it .\r\n",
"\r\n",
"In computing, stop words are words which are filtered out before or after
processing of natural language data (text). "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yt7xNCi-_1aX"
},
"source": [
"You can check list of stopwords by running below code snippet"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9z1iYMxjI4AV",
"outputId": "f9ffe3b6-e9d9-445d-b788-b10e3c2c7366"
},
"source": [
"nltk.download('stopwords')\r\n",
"from nltk.corpus import stopwords\r\n",
"stop = stopwords.words('english')\r\n",
"stop"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
"[nltk_data] Unzipping corpora/stopwords.zip.\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['i',\n",
" 'me',\n",
" 'my',\n",
" 'myself',\n",
" 'we',\n",
" 'our',\n",
" 'ours',\n",
" 'ourselves',\n",
" 'you',\n",
" \"you're\",\n",
" \"you've\",\n",
" \"you'll\",\n",
" \"you'd\",\n",
" 'your',\n",
" 'yours',\n",
" 'yourself',\n",
" 'yourselves',\n",
" 'he',\n",
" 'him',\n",
" 'his',\n",
" 'himself',\n",
" 'she',\n",
" \"she's\",\n",
" 'her',\n",
" 'hers',\n",
" 'herself',\n",
" 'it',\n",
" \"it's\",\n",
" 'its',\n",
" 'itself',\n",
" 'they',\n",
" 'them',\n",
" 'their',\n",
" 'theirs',\n",
" 'themselves',\n",
" 'what',\n",
" 'which',\n",
" 'who',\n",
" 'whom',\n",
" 'this',\n",
" 'that',\n",
" \"that'll\",\n",
" 'these',\n",
" 'those',\n",
" 'am',\n",
" 'is',\n",
" 'are',\n",
" 'was',\n",
" 'were',\n",
" 'be',\n",
" 'been',\n",
" 'being',\n",
" 'have',\n",
" 'has',\n",
" 'had',\n",
" 'having',\n",
" 'do',\n",
" 'does',\n",
" 'did',\n",
" 'doing',\n",
" 'a',\n",
" 'an',\n",
" 'the',\n",
" 'and',\n",
" 'but',\n",
" 'if',\n",
" 'or',\n",
" 'because',\n",
" 'as',\n",
" 'until',\n",
" 'while',\n",
" 'of',\n",
" 'at',\n",
" 'by',\n",
" 'for',\n",
" 'with',\n",
" 'about',\n",
" 'against',\n",
" 'between',\n",
" 'into',\n",
" 'through',\n",
" 'during',\n",
" 'before',\n",
" 'after',\n",
" 'above',\n",
" 'below',\n",
" 'to',\n",
" 'from',\n",
" 'up',\n",
" 'down',\n",
" 'in',\n",
" 'out',\n",
" 'on',\n",
" 'off',\n",
" 'over',\n",
" 'under',\n",
" 'again',\n",
" 'further',\n",
" 'then',\n",
" 'once',\n",
" 'here',\n",
" 'there',\n",
" 'when',\n",
" 'where',\n",
" 'why',\n",
" 'how',\n",
" 'all',\n",
" 'any',\n",
" 'both',\n",
" 'each',\n",
" 'few',\n",
" 'more',\n",
" 'most',\n",
" 'other',\n",
" 'some',\n",
" 'such',\n",
" 'no',\n",
" 'nor',\n",
" 'not',\n",
" 'only',\n",
" 'own',\n",
" 'same',\n",
" 'so',\n",
" 'than',\n",
" 'too',\n",
" 'very',\n",
" 's',\n",
" 't',\n",
" 'can',\n",
" 'will',\n",
" 'just',\n",
" 'don',\n",
" \"don't\",\n",
" 'should',\n",
" \"should've\",\n",
" 'now',\n",
" 'd',\n",
" 'll',\n",
" 'm',\n",
" 'o',\n",
" 're',\n",
" 've',\n",
" 'y',\n",
" 'ain',\n",
" 'aren',\n",
" \"aren't\",\n",
" 'couldn',\n",
" \"couldn't\",\n",
" 'didn',\n",
" \"didn't\",\n",
" 'doesn',\n",
" \"doesn't\",\n",
" 'hadn',\n",
" \"hadn't\",\n",
" 'hasn',\n",
" \"hasn't\",\n",
" 'haven',\n",
" \"haven't\",\n",
" 'isn',\n",
" \"isn't\",\n",
" 'ma',\n",
" 'mightn',\n",
" \"mightn't\",\n",
" 'mustn',\n",
" \"mustn't\",\n",
" 'needn',\n",
" \"needn't\",\n",
" 'shan',\n",
" \"shan't\",\n",
" 'shouldn',\n",
" \"shouldn't\",\n",
" 'wasn',\n",
" \"wasn't\",\n",
" 'weren',\n",
" \"weren't\",\n",
" 'won',\n",
" \"won't\",\n",
" 'wouldn',\n",
" \"wouldn't\"]"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2ifmPPHyJDRg",
"outputId": "ab1d0049-a57f-44e9-dba4-38c45384587c"
},
"source": [
"clean_words = [w for w in words if not w.lower() in stop]\r\n",
"print(clean_words,\"\\ncount = \",len(clean_words))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"['ninja', 'world', ',', 'break', 'rules', 'trash', '.', \"'s\",
'true', ',', 'abandon', 'friends', 'worse', 'trash', '.'] \n",
"count = 15\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4Fl5gaqOI1DM"
},
"source": [
"Words like \"in\", \"the\", \"who\",etc. are removed from the list of
words."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "59fMwzP-I0gX"
},
"source": [
"**Remove Punctuations**\r\n",
"\r\n",
"To remove punctuations from the list of words, import all punctuations and
add them in the stop word list."
]
},
{
"cell_type": "code",
"metadata": {
"id": "dj1CPsYHPJSj"
},
"source": [
"import string\r\n",
"punctuations = list(string.punctuation)\r\n",
"stop = stop + punctuations"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "IS7bV8nhPZXh",
"outputId": "0f77ec8f-c861-4b61-bd8f-ac6b5369e97f"
},
"source": [
"clean_words = [w for w in words if not w.lower() in stop]\r\n",
"print(clean_words,\"\\ncount = \",len(clean_words))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"['ninja', 'world', 'break', 'rules', 'trash', \"'s\", 'true',
'abandon', 'friends', 'worse', 'trash'] \n",
"count = 11\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GW-emZqA_1X2"
},
"source": [
"##**Stemming**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q-5pX0t9_1U3"
},
"source": [
"**Stemming** is a normalization technique where list of tokenized words
are converted into shorten root words to remove redundancy. Stemming is the process
of reducing inflected (or sometimes derived) words to their word stem, base or root
form.\r\n",
"\r\n",
"A computer program that stems word may be called a stemmer.\r\n",
"\r\n",
"A stemmer reduce the words like fishing, fished, and fisher to the stem
fish. The stem need not be a word, for example the Porter algorithm reduces, argue,
argued, argues, arguing, and argus to the stem argu .\r\n",
"\r\n",
"It removes suffices, like \"ing\", \"ly\", \"s\", etc. by a simple rule-
based approach. It reduces the corpus of words but often the actual words get
neglected."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fzCt_UeU_1Q1"
},
"source": [
"**Various Stemming algorithms**\r\n",
"1. **Porter stemming algorithm**: This class knows several regular word
forms and suffixes with the help of which it can transform the input word to a
final stem.\r\n",
"2. **Lancaster stemming algorithm**: It was developed at Lancaster
University and it is another very common stemming algorithms.\r\n",
"NLTK has LancasterStemmer class with the help of which we can easily
implement Lancaster Stemmer algorithms for the word we want to stem.\r\n",
"3. **Regular Expression stemming algorithm** : With the help of this
stemming algorithm, we can construct our own stemmer.\r\n",
"NLTK has RegexpStemmer class with the help of which we can easily
implement Regular Expression Stemmer algorithms. It basically takes a single
regular expression and removes any prefix or suffix that matches the expression.\r\
n",
"4. **Snowball stemming algorithm**: NLTK has SnowballStemmer class with
the help of which we can easily implement Snowball Stemmer algorithms. It supports
15 non-English languages. In order to use this steaming class, we need to create an
instance with the name of the language we are using and then call the stem()
method."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "nksAa1HJ55xC",
"outputId": "1ec1e449-effb-405c-c659-b92132f26ecc"
},
"source": [
"stem_words =
[\"play\", \"played\", \"playing\", \"player\", \"happier\", \"happiness\", \"unive
rse\", \"universal\"]\r\n",
"from nltk.stem import PorterStemmer #Here we have used the porter stemming
algorithm\r\n",
"ps = PorterStemmer()\r\n",
"for w in stem_words:\r\n",
" print (ps.stem(w))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"play\n",
"play\n",
"play\n",
"player\n",
"happier\n",
"happi\n",
"univers\n",
"univers\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nFVlqCLDZc1F"
},
"source": [
"##**POS Tag**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r0Kbc7hbZc2z"
},
"source": [
"Parts of speech Tagging is responsible for reading the text in a language
and assigning some specific token (Parts of Speech) to each word.=\r\n",
"POS tag tell us about grammatical information of words of the sentence by
assigning specific token (Determiner, noun, adjective , adverb ,verb,Personal
Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.\r\n",
"\r\n",
"Word can have more than one POS depending upon context where it is used.
we can use POS tags as statistical NLP tasks it distinguishes sense of word which
is very helpful in text realization and infer semantic information from gives text
for sentiment analysis."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WwrMn2xcZc5h"
},
"source": [
"Steps Involved:\r\n",
"\r\n",
"1. Tokenize text (word_tokenize)\r\n",
"2. apply pos_tag to above step that is nltk.pos_tag(tokenize_text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0k0fB_qnZc7D"
},
"source": [
"POS tag list:\r\n",
"\r\n",
"**Abbreviation** \t **Meaning**\r\n",
"\r\n",
"CC coordinating conjunction\r\n",
"\r\n",
"CD\tcardinal digit\r\n",
"\r\n",
"DT\tdeterminer\r\n",
"\r\n",
"EX\texistential there\r\n",
"\r\n",
"FW\tforeign word\r\n",
"\r\n",
"IN\tpreposition/subordinating conjunction\r\n",
"\r\n",
"JJ\tadjective (large)\r\n",
"\r\n",
"JJR\tadjective, comparative (larger)\r\n",
"\r\n",
"JJS\tadjective, superlative (largest)\r\n",
"\r\n",
"LS\tlist market\r\n",
"\r\n",
"MD\tmodal (could, will)\r\n",
"\r\n",
"NN\tnoun, singular (cat, tree)\r\n",
"\r\n",
"NNS\tnoun plural (desks)\r\n",
"\r\n",
"NNP\tproper noun, singular (sarah)\r\n",
"\r\n",
"NNPS\tproper noun, plural (indians or americans)\r\n",
"\r\n",
"PDT\tpredeterminer (all, both, half)\r\n",
"\r\n",
"POS\tpossessive ending (parent\\ 's)\r\n",
"\r\n",
"PRP\tpersonal pronoun (hers, herself, him,himself)\r\n",
"\r\n",
"PRP$\tpossessive pronoun (her, his, mine, my, our )\r\n",
"\r\n",
"RB\tadverb (occasionally, swiftly)\r\n",
"\r\n",
"RBR\tadverb, comparative (greater)\r\n",
"\r\n",
"RBS\tadverb, superlative (biggest)\r\n",
"\r\n",
"RP\tparticle (about)\r\n",
"\r\n",
"TO\tinfinite marker (to)\r\n",
"\r\n",
"UH\tinterjection (goodbye)\r\n",
"\r\n",
"VB\tverb (ask)\r\n",
"\r\n",
"VBG\tverb gerund (judging)\r\n",
"\r\n",
"VBD\tverb past tense (pleaded)\r\n",
"\r\n",
"VBN\tverb past participle (reunified)\r\n",
"\r\n",
"VBP\tverb, present tense not 3rd person singular(wrap)\r\n",
"\r\n",
"VBZ\tverb, present tense with 3rd person singular (bases)\r\n",
"\r\n",
"WDT\twh-determiner (that, what)\r\n",
"\r\n",
"WP\twh- pronoun (who)\r\n",
"\r\n",
"WRB\twh- adverb (how)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "vMpdlfAmaahc",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "06dcf16c-c4b2-4480-b961-68025b4138b2"
},
"source": [
"nltk.download('state_union')\r\n",
"\r\n",
"from nltk.corpus import state_union\r\n",
"text = state_union.raw(\"2006-GWBush.txt\")"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package state_union to /root/nltk_data...\n",
"[nltk_data] Unzipping corpora/state_union.zip.\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3vqe11_daotx",
"outputId": "858f8c65-1738-4cc9-ddb6-248dce210dae"
},
"source": [
"from nltk import pos_tag\r\n",
"import numpy as np\r\n",
"pos = pos_tag(word_tokenize(text.lower()))\r\n",
"pos2=np.array(pos)\r\n",
"pos2"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package averaged_perceptron_tagger to\n",
"[nltk_data] /root/nltk_data...\n",
"[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([['president', 'NN'],\n",
" ['george', 'NN'],\n",
" ['w.', 'VBD'],\n",
" ...,\n",
" ['applause', 'IN'],\n",
" ['.', '.'],\n",
" [')', ')']], dtype='<U18')"
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XO85PZCFawNF",
"outputId": "e1112163-1637-43d4-8a51-c7e7ddbf99d9"
},
"source": [
"print(pos_tag([\"One\"]),\r\n",
" pos_tag([\"legendary\"]),\r\n",
" pos_tag([\"flying\"]),\r\n",
" pos_tag([\"person\"]))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[('One', 'CD')] [('legendary', 'JJ')] [('flying', 'VBG')] [('person',
'NN')]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "m-egWEU5bAt_"
},
"source": [
"##**Lemmatization**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yPYT16q7bAyl"
},
"source": [
"Major drawback of stemming is it produces Intermediate representation of
word. Stemmer may or may not return meaningful word.\r\n",
"\r\n",
"To overcome this problem , Lemmatization comes into picture.\r\n",
"Stemming algorithm works by cutting suffix or prefix from the word.On the
contrary Lemmatization consider morphological analysis of the words and returns
meaningful word in proper form.\r\n",
"\r\n",
" The output we will get after lemmatization is called ‘lemma’, which is a
root word rather than root stem,"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JhK69BMLbA0X"
},
"source": [
"NLTK provides WordNetLemmatizer class which is a thin wrapper around the
wordnet corpus."
]
},
{
"cell_type": "code",
"metadata": {
"id": "0p_4_0_Mc_Zu"
},
"source": [
"from nltk.stem import WordNetLemmatizer\r\n",
"\r\n",
"lem = WordNetLemmatizer()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "5_FsXkeNd5Sr",
"outputId": "2b1b3020-9873-4738-e43d-82bedc54874b"
},
"source": [
"\r\n",
" \r\n",
"lem.lemmatize(\"good\", pos = 'a')\r\n",
" "
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'good'"
]
},
"metadata": {
"tags": []
},
"execution_count": 26
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "J9UmE1KYd-AB",
"outputId": "cb0a1cd2-4138-4c3a-de22-b21830cfe4ad"
},
"source": [
"lem.lemmatize(\"better\", pos = 'a')"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'good'"
]
},
"metadata": {
"tags": []
},
"execution_count": 27
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "b9YxjY_meBSo",
"outputId": "dae42fde-9de2-4951-b6f1-5cb44c1e3aa5"
},
"source": [
"lem.lemmatize(\"painting\", pos = 'n') \r\n",
"#Here painting is a noun which means painting can't be converted into
paint. For eg\r\n",
"#\"This painting is beautiful\". Here painting cannot be changed."
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'painting'"
]
},
"metadata": {
"tags": []
},
"execution_count": 28
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "qG9NFvPGeDRs",
"outputId": "6e2b283d-65fd-4053-bca6-4e4197e2c209"
},
"source": [
"lem.lemmatize(\"painting\", pos = 'v')\r\n",
"#Here painting is a verb which means it can be converted into paint.\r\n",
"#\"I love painting\""
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'paint'"
]
},
"metadata": {
"tags": []
},
"execution_count": 29
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q4ae_VeVeoTX"
},
"source": [
"##**Working on Movie Reviews Dataset**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0j-nqG6SfiQe"
},
"source": [
"**STEP 1**: Importing the data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "XHsNbuXJesqn",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "8bd2a125-2b78-41ac-8d42-4e7c23da8fbf"
},
"source": [
"from nltk.corpus import movie_reviews\r\n",
"nltk.download('movie_reviews')"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package movie_reviews to /root/nltk_data...\
n",
"[nltk_data] Unzipping corpora/movie_reviews.zip.\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {
"tags": []
},
"execution_count": 34
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1BrxunscfrdP",
"outputId": "bd4b2ebb-2571-48df-912a-69d2a75c181e"
},
"source": [
"len(movie_reviews.fileids()) # dataset contains 2000 movie reviews out of
which 1000 are positive and rest are negative."
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2000"
]
},
"metadata": {
"tags": []
},
"execution_count": 35
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0_9k_UHgfw0S",
"outputId": "1e15d7d0-d91c-48e3-f921-47745ec3c96f"
},
"source": [
"movie_reviews.fileids('pos')"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['pos/cv000_29590.txt',\n",
" 'pos/cv001_18431.txt',\n",
" 'pos/cv002_15918.txt',\n",
" 'pos/cv003_11664.txt',\n",
" 'pos/cv004_11636.txt',\n",
" 'pos/cv005_29443.txt',\n",
" 'pos/cv006_15448.txt',\n",
" 'pos/cv007_4968.txt',\n",
" 'pos/cv008_29435.txt',\n",
" 'pos/cv009_29592.txt',\n",
" 'pos/cv010_29198.txt',\n",
" 'pos/cv011_12166.txt',\n",
" 'pos/cv012_29576.txt',\n",
" 'pos/cv013_10159.txt',\n",
" 'pos/cv014_13924.txt',\n",
" 'pos/cv015_29439.txt',\n",
" 'pos/cv016_4659.txt',\n",
" 'pos/cv017_22464.txt',\n",
" 'pos/cv018_20137.txt',\n",
" 'pos/cv019_14482.txt',\n",
" 'pos/cv020_8825.txt',\n",
" 'pos/cv021_15838.txt',\n",
" 'pos/cv022_12864.txt',\n",
" 'pos/cv023_12672.txt',\n",
" 'pos/cv024_6778.txt',\n",
" 'pos/cv025_3108.txt',\n",
" 'pos/cv026_29325.txt',\n",
" 'pos/cv027_25219.txt',\n",
" 'pos/cv028_26746.txt',\n",
" 'pos/cv029_18643.txt',\n",
" 'pos/cv030_21593.txt',\n",
" 'pos/cv031_18452.txt',\n",
" 'pos/cv032_22550.txt',\n",
" 'pos/cv033_24444.txt',\n",
" 'pos/cv034_29647.txt',\n",
" 'pos/cv035_3954.txt',\n",
" 'pos/cv036_16831.txt',\n",
" 'pos/cv037_18510.txt',\n",
" 'pos/cv038_9749.txt',\n",
" 'pos/cv039_6170.txt',\n",
" 'pos/cv040_8276.txt',\n",
" 'pos/cv041_21113.txt',\n",
" 'pos/cv042_10982.txt',\n",
" 'pos/cv043_15013.txt',\n",
" 'pos/cv044_16969.txt',\n",
" 'pos/cv045_23923.txt',\n",
" 'pos/cv046_10188.txt',\n",
" 'pos/cv047_1754.txt',\n",
" 'pos/cv048_16828.txt',\n",
" 'pos/cv049_20471.txt',\n",
" 'pos/cv050_11175.txt',\n",
" 'pos/cv051_10306.txt',\n",
" 'pos/cv052_29378.txt',\n",
" 'pos/cv053_21822.txt',\n",
" 'pos/cv054_4230.txt',\n",
" 'pos/cv055_8338.txt',\n",
" 'pos/cv056_13133.txt',\n",
" 'pos/cv057_7453.txt',\n",
" 'pos/cv058_8025.txt',\n",
" 'pos/cv059_28885.txt',\n",
" 'pos/cv060_10844.txt',\n",
" 'pos/cv061_8837.txt',\n",
" 'pos/cv062_23115.txt',\n",
" 'pos/cv063_28997.txt',\n",
" 'pos/cv064_24576.txt',\n",
" 'pos/cv065_15248.txt',\n",
" 'pos/cv066_10821.txt',\n",
" 'pos/cv067_19774.txt',\n",
" 'pos/cv068_13400.txt',\n",
" 'pos/cv069_10801.txt',\n",
" 'pos/cv070_12289.txt',\n",
" 'pos/cv071_12095.txt',\n",
" 'pos/cv072_6169.txt',\n",
" 'pos/cv073_21785.txt',\n",
" 'pos/cv074_6875.txt',\n",
" 'pos/cv075_6500.txt',\n",
" 'pos/cv076_24945.txt',\n",
" 'pos/cv077_22138.txt',\n",
" 'pos/cv078_14730.txt',\n",
" 'pos/cv079_11933.txt',\n",
" 'pos/cv080_13465.txt',\n",
" 'pos/cv081_16582.txt',\n",
" 'pos/cv082_11080.txt',\n",
" 'pos/cv083_24234.txt',\n",
" 'pos/cv084_13566.txt',\n",
" 'pos/cv085_1381.txt',\n",
" 'pos/cv086_18371.txt',\n",
" 'pos/cv087_1989.txt',\n",
" 'pos/cv088_24113.txt',\n",
" 'pos/cv089_11418.txt',\n",
" 'pos/cv090_0042.txt',\n",
" 'pos/cv091_7400.txt',\n",
" 'pos/cv092_28017.txt',\n",
" 'pos/cv093_13951.txt',\n",
" 'pos/cv094_27889.txt',\n",
" 'pos/cv095_28892.txt',\n",
" 'pos/cv096_11474.txt',\n",
" 'pos/cv097_24970.txt',\n",
" 'pos/cv098_15435.txt',\n",
" 'pos/cv099_10534.txt',\n",
" 'pos/cv100_11528.txt',\n",
" 'pos/cv101_10175.txt',\n",
" 'pos/cv102_7846.txt',\n",
" 'pos/cv103_11021.txt',\n",
" 'pos/cv104_18134.txt',\n",
" 'pos/cv105_17990.txt',\n",
" 'pos/cv106_16807.txt',\n",
" 'pos/cv107_24319.txt',\n",
" 'pos/cv108_15571.txt',\n",
" 'pos/cv109_21172.txt',\n",
" 'pos/cv110_27788.txt',\n",
" 'pos/cv111_11473.txt',\n",
" 'pos/cv112_11193.txt',\n",
" 'pos/cv113_23102.txt',\n",
" 'pos/cv114_18398.txt',\n",
" 'pos/cv115_25396.txt',\n",
" 'pos/cv116_28942.txt',\n",
" 'pos/cv117_24295.txt',\n",
" 'pos/cv118_28980.txt',\n",
" 'pos/cv119_9867.txt',\n",
" 'pos/cv120_4111.txt',\n",
" 'pos/cv121_17302.txt',\n",
" 'pos/cv122_7392.txt',\n",
" 'pos/cv123_11182.txt',\n",
" 'pos/cv124_4122.txt',\n",
" 'pos/cv125_9391.txt',\n",
" 'pos/cv126_28971.txt',\n",
" 'pos/cv127_14711.txt',\n",
" 'pos/cv128_29627.txt',\n",
" 'pos/cv129_16741.txt',\n",
" 'pos/cv130_17083.txt',\n",
" 'pos/cv131_10713.txt',\n",
" 'pos/cv132_5618.txt',\n",
" 'pos/cv133_16336.txt',\n",
" 'pos/cv134_22246.txt',\n",
" 'pos/cv135_11603.txt',\n",
" 'pos/cv136_11505.txt',\n",
" 'pos/cv137_15422.txt',\n",
" 'pos/cv138_12721.txt',\n",
" 'pos/cv139_12873.txt',\n",
" 'pos/cv140_7479.txt',\n",
" 'pos/cv141_15686.txt',\n",
" 'pos/cv142_22516.txt',\n",
" 'pos/cv143_19666.txt',\n",
" 'pos/cv144_5007.txt',\n",
" 'pos/cv145_11472.txt',\n",
" 'pos/cv146_18458.txt',\n",
" 'pos/cv147_21193.txt',\n",
" 'pos/cv148_16345.txt',\n",
" 'pos/cv149_15670.txt',\n",
" 'pos/cv150_12916.txt',\n",
" 'pos/cv151_15771.txt',\n",
" 'pos/cv152_8736.txt',\n",
" 'pos/cv153_10779.txt',\n",
" 'pos/cv154_9328.txt',\n",
" 'pos/cv155_7308.txt',\n",
" 'pos/cv156_10481.txt',\n",
" 'pos/cv157_29372.txt',\n",
" 'pos/cv158_10390.txt',\n",
" 'pos/cv159_29505.txt',\n",
" 'pos/cv160_10362.txt',\n",
" 'pos/cv161_11425.txt',\n",
" 'pos/cv162_10424.txt',\n",
" 'pos/cv163_10052.txt',\n",
" 'pos/cv164_22447.txt',\n",
" 'pos/cv165_22619.txt',\n",
" 'pos/cv166_11052.txt',\n",
" 'pos/cv167_16376.txt',\n",
" 'pos/cv168_7050.txt',\n",
" 'pos/cv169_23778.txt',\n",
" 'pos/cv170_3006.txt',\n",
" 'pos/cv171_13537.txt',\n",
" 'pos/cv172_11131.txt',\n",
" 'pos/cv173_4471.txt',\n",
" 'pos/cv174_9659.txt',\n",
" 'pos/cv175_6964.txt',\n",
" 'pos/cv176_12857.txt',\n",
" 'pos/cv177_10367.txt',\n",
" 'pos/cv178_12972.txt',\n",
" 'pos/cv179_9228.txt',\n",
" 'pos/cv180_16113.txt',\n",
" 'pos/cv181_14401.txt',\n",
" 'pos/cv182_7281.txt',\n",
" 'pos/cv183_18612.txt',\n",
" 'pos/cv184_2673.txt',\n",
" 'pos/cv185_28654.txt',\n",
" 'pos/cv186_2269.txt',\n",
" 'pos/cv187_12829.txt',\n",
" 'pos/cv188_19226.txt',\n",
" 'pos/cv189_22934.txt',\n",
" 'pos/cv190_27052.txt',\n",
" 'pos/cv191_29719.txt',\n",
" 'pos/cv192_14395.txt',\n",
" 'pos/cv193_5416.txt',\n",
" 'pos/cv194_12079.txt',\n",
" 'pos/cv195_14528.txt',\n",
" 'pos/cv196_29027.txt',\n",
" 'pos/cv197_29328.txt',\n",
" 'pos/cv198_18180.txt',\n",
" 'pos/cv199_9629.txt',\n",
" 'pos/cv200_2915.txt',\n",
" 'pos/cv201_6997.txt',\n",
" 'pos/cv202_10654.txt',\n",
" 'pos/cv203_17986.txt',\n",
" 'pos/cv204_8451.txt',\n",
" 'pos/cv205_9457.txt',\n",
" 'pos/cv206_14293.txt',\n",
" 'pos/cv207_29284.txt',\n",
" 'pos/cv208_9020.txt',\n",
" 'pos/cv209_29118.txt',\n",
" 'pos/cv210_9312.txt',\n",
" 'pos/cv211_9953.txt',\n",
" 'pos/cv212_10027.txt',\n",
" 'pos/cv213_18934.txt',\n",
" 'pos/cv214_12294.txt',\n",
" 'pos/cv215_22240.txt',\n",
" 'pos/cv216_18738.txt',\n",
" 'pos/cv217_28842.txt',\n",
" 'pos/cv218_24352.txt',\n",
" 'pos/cv219_18626.txt',\n",
" 'pos/cv220_29059.txt',\n",
" 'pos/cv221_2695.txt',\n",
" 'pos/cv222_17395.txt',\n",
" 'pos/cv223_29066.txt',\n",
" 'pos/cv224_17661.txt',\n",
" 'pos/cv225_29224.txt',\n",
" 'pos/cv226_2618.txt',\n",
" 'pos/cv227_24215.txt',\n",
" 'pos/cv228_5806.txt',\n",
" 'pos/cv229_13611.txt',\n",
" 'pos/cv230_7428.txt',\n",
" 'pos/cv231_10425.txt',\n",
" 'pos/cv232_14991.txt',\n",
" 'pos/cv233_15964.txt',\n",
" 'pos/cv234_20643.txt',\n",
" 'pos/cv235_10217.txt',\n",
" 'pos/cv236_11565.txt',\n",
" 'pos/cv237_19221.txt',\n",
" 'pos/cv238_12931.txt',\n",
" 'pos/cv239_3385.txt',\n",
" 'pos/cv240_14336.txt',\n",
" 'pos/cv241_23130.txt',\n",
" 'pos/cv242_10638.txt',\n",
" 'pos/cv243_20728.txt',\n",
" 'pos/cv244_21649.txt',\n",
" 'pos/cv245_8569.txt',\n",
" 'pos/cv246_28807.txt',\n",
" 'pos/cv247_13142.txt',\n",
" 'pos/cv248_13987.txt',\n",
" 'pos/cv249_11640.txt',\n",
" 'pos/cv250_25616.txt',\n",
" 'pos/cv251_22636.txt',\n",
" 'pos/cv252_23779.txt',\n",
" 'pos/cv253_10077.txt',\n",
" 'pos/cv254_6027.txt',\n",
" 'pos/cv255_13683.txt',\n",
" 'pos/cv256_14740.txt',\n",
" 'pos/cv257_10975.txt',\n",
" 'pos/cv258_5792.txt',\n",
" 'pos/cv259_10934.txt',\n",
" 'pos/cv260_13959.txt',\n",
" 'pos/cv261_10954.txt',\n",
" 'pos/cv262_12649.txt',\n",
" 'pos/cv263_19259.txt',\n",
" 'pos/cv264_12801.txt',\n",
" 'pos/cv265_10814.txt',\n",
" 'pos/cv266_25779.txt',\n",
" 'pos/cv267_14952.txt',\n",
" 'pos/cv268_18834.txt',\n",
" 'pos/cv269_21732.txt',\n",
" 'pos/cv270_6079.txt',\n",
" 'pos/cv271_13837.txt',\n",
" 'pos/cv272_18974.txt',\n",
" 'pos/cv273_29112.txt',\n",
" 'pos/cv274_25253.txt',\n",
" 'pos/cv275_28887.txt',\n",
" 'pos/cv276_15684.txt',\n",
" 'pos/cv277_19091.txt',\n",
" 'pos/cv278_13041.txt',\n",
" 'pos/cv279_18329.txt',\n",
" 'pos/cv280_8267.txt',\n",
" 'pos/cv281_23253.txt',\n",
" 'pos/cv282_6653.txt',\n",
" 'pos/cv283_11055.txt',\n",
" 'pos/cv284_19119.txt',\n",
" 'pos/cv285_16494.txt',\n",
" 'pos/cv286_25050.txt',\n",
" 'pos/cv287_15900.txt',\n",
" 'pos/cv288_18791.txt',\n",
" 'pos/cv289_6463.txt',\n",
" 'pos/cv290_11084.txt',\n",
" 'pos/cv291_26635.txt',\n",
" 'pos/cv292_7282.txt',\n",
" 'pos/cv293_29856.txt',\n",
" 'pos/cv294_11684.txt',\n",
" 'pos/cv295_15570.txt',\n",
" 'pos/cv296_12251.txt',\n",
" 'pos/cv297_10047.txt',\n",
" 'pos/cv298_23111.txt',\n",
" 'pos/cv299_16214.txt',\n",
" 'pos/cv300_22284.txt',\n",
" 'pos/cv301_12146.txt',\n",
" 'pos/cv302_25649.txt',\n",
" 'pos/cv303_27520.txt',\n",
" 'pos/cv304_28706.txt',\n",
" 'pos/cv305_9946.txt',\n",
" 'pos/cv306_10364.txt',\n",
" 'pos/cv307_25270.txt',\n",
" 'pos/cv308_5016.txt',\n",
" 'pos/cv309_22571.txt',\n",
" 'pos/cv310_13091.txt',\n",
" 'pos/cv311_16002.txt',\n",
" 'pos/cv312_29377.txt',\n",
" 'pos/cv313_18198.txt',\n",
" 'pos/cv314_14422.txt',\n",
" 'pos/cv315_11629.txt',\n",
" 'pos/cv316_6370.txt',\n",
" 'pos/cv317_24049.txt',\n",
" 'pos/cv318_10493.txt',\n",
" 'pos/cv319_14727.txt',\n",
" 'pos/cv320_9530.txt',\n",
" 'pos/cv321_12843.txt',\n",
" 'pos/cv322_20318.txt',\n",
" 'pos/cv323_29805.txt',\n",
" 'pos/cv324_7082.txt',\n",
" 'pos/cv325_16629.txt',\n",
" 'pos/cv326_13295.txt',\n",
" 'pos/cv327_20292.txt',\n",
" 'pos/cv328_10373.txt',\n",
" 'pos/cv329_29370.txt',\n",
" 'pos/cv330_29809.txt',\n",
" 'pos/cv331_8273.txt',\n",
" 'pos/cv332_16307.txt',\n",
" 'pos/cv333_8916.txt',\n",
" 'pos/cv334_10001.txt',\n",
" 'pos/cv335_14665.txt',\n",
" 'pos/cv336_10143.txt',\n",
" 'pos/cv337_29181.txt',\n",
" 'pos/cv338_8821.txt',\n",
" 'pos/cv339_21119.txt',\n",
" 'pos/cv340_13287.txt',\n",
" 'pos/cv341_24430.txt',\n",
" 'pos/cv342_19456.txt',\n",
" 'pos/cv343_10368.txt',\n",
" 'pos/cv344_5312.txt',\n",
" 'pos/cv345_9954.txt',\n",
" 'pos/cv346_18168.txt',\n",
" 'pos/cv347_13194.txt',\n",
" 'pos/cv348_18176.txt',\n",
" 'pos/cv349_13507.txt',\n",
" 'pos/cv350_20670.txt',\n",
" 'pos/cv351_15458.txt',\n",
" 'pos/cv352_5524.txt',\n",
" 'pos/cv353_18159.txt',\n",
" 'pos/cv354_8132.txt',\n",
" 'pos/cv355_16413.txt',\n",
" 'pos/cv356_25163.txt',\n",
" 'pos/cv357_13156.txt',\n",
" 'pos/cv358_10691.txt',\n",
" 'pos/cv359_6647.txt',\n",
" 'pos/cv360_8398.txt',\n",
" 'pos/cv361_28944.txt',\n",
" 'pos/cv362_15341.txt',\n",
" 'pos/cv363_29332.txt',\n",
" 'pos/cv364_12901.txt',\n",
" 'pos/cv365_11576.txt',\n",
" 'pos/cv366_10221.txt',\n",
" 'pos/cv367_22792.txt',\n",
" 'pos/cv368_10466.txt',\n",
" 'pos/cv369_12886.txt',\n",
" 'pos/cv370_5221.txt',\n",
" 'pos/cv371_7630.txt',\n",
" 'pos/cv372_6552.txt',\n",
" 'pos/cv373_20404.txt',\n",
" 'pos/cv374_25436.txt',\n",
" 'pos/cv375_9929.txt',\n",
" 'pos/cv376_19435.txt',\n",
" 'pos/cv377_7946.txt',\n",
" 'pos/cv378_20629.txt',\n",
" 'pos/cv379_21963.txt',\n",
" 'pos/cv380_7574.txt',\n",
" 'pos/cv381_20172.txt',\n",
" 'pos/cv382_7897.txt',\n",
" 'pos/cv383_13116.txt',\n",
" 'pos/cv384_17140.txt',\n",
" 'pos/cv385_29741.txt',\n",
" 'pos/cv386_10080.txt',\n",
" 'pos/cv387_11507.txt',\n",
" 'pos/cv388_12009.txt',\n",
" 'pos/cv389_9369.txt',\n",
" 'pos/cv390_11345.txt',\n",
" 'pos/cv391_10802.txt',\n",
" 'pos/cv392_11458.txt',\n",
" 'pos/cv393_29327.txt',\n",
" 'pos/cv394_5137.txt',\n",
" 'pos/cv395_10849.txt',\n",
" 'pos/cv396_17989.txt',\n",
" 'pos/cv397_29023.txt',\n",
" 'pos/cv398_15537.txt',\n",
" 'pos/cv399_2877.txt',\n",
" 'pos/cv400_19220.txt',\n",
" 'pos/cv401_12605.txt',\n",
" 'pos/cv402_14425.txt',\n",
" 'pos/cv403_6621.txt',\n",
" 'pos/cv404_20315.txt',\n",
" 'pos/cv405_20399.txt',\n",
" 'pos/cv406_21020.txt',\n",
" 'pos/cv407_22637.txt',\n",
" 'pos/cv408_5297.txt',\n",
" 'pos/cv409_29786.txt',\n",
" 'pos/cv410_24266.txt',\n",
" 'pos/cv411_15007.txt',\n",
" 'pos/cv412_24095.txt',\n",
" 'pos/cv413_7398.txt',\n",
" 'pos/cv414_10518.txt',\n",
" 'pos/cv415_22517.txt',\n",
" 'pos/cv416_11136.txt',\n",
" 'pos/cv417_13115.txt',\n",
" 'pos/cv418_14774.txt',\n",
" 'pos/cv419_13394.txt',\n",
" 'pos/cv420_28795.txt',\n",
" 'pos/cv421_9709.txt',\n",
" 'pos/cv422_9381.txt',\n",
" 'pos/cv423_11155.txt',\n",
" 'pos/cv424_8831.txt',\n",
" 'pos/cv425_8250.txt',\n",
" 'pos/cv426_10421.txt',\n",
" 'pos/cv427_10825.txt',\n",
" 'pos/cv428_11347.txt',\n",
" 'pos/cv429_7439.txt',\n",
" 'pos/cv430_17351.txt',\n",
" 'pos/cv431_7085.txt',\n",
" 'pos/cv432_14224.txt',\n",
" 'pos/cv433_10144.txt',\n",
" 'pos/cv434_5793.txt',\n",
" 'pos/cv435_23110.txt',\n",
" 'pos/cv436_19179.txt',\n",
" 'pos/cv437_22849.txt',\n",
" 'pos/cv438_8043.txt',\n",
" 'pos/cv439_15970.txt',\n",
" 'pos/cv440_15243.txt',\n",
" 'pos/cv441_13711.txt',\n",
" 'pos/cv442_13846.txt',\n",
" 'pos/cv443_21118.txt',\n",
" 'pos/cv444_9974.txt',\n",
" 'pos/cv445_25882.txt',\n",
" 'pos/cv446_11353.txt',\n",
" 'pos/cv447_27332.txt',\n",
" 'pos/cv448_14695.txt',\n",
" 'pos/cv449_8785.txt',\n",
" 'pos/cv450_7890.txt',\n",
" 'pos/cv451_10690.txt',\n",
" 'pos/cv452_5088.txt',\n",
" 'pos/cv453_10379.txt',\n",
" 'pos/cv454_2053.txt',\n",
" 'pos/cv455_29000.txt',\n",
" 'pos/cv456_18985.txt',\n",
" 'pos/cv457_18453.txt',\n",
" 'pos/cv458_8604.txt',\n",
" 'pos/cv459_20319.txt',\n",
" 'pos/cv460_10842.txt',\n",
" 'pos/cv461_19600.txt',\n",
" 'pos/cv462_19350.txt',\n",
" 'pos/cv463_10343.txt',\n",
" 'pos/cv464_15650.txt',\n",
" 'pos/cv465_22431.txt',\n",
" 'pos/cv466_18722.txt',\n",
" 'pos/cv467_25773.txt',\n",
" 'pos/cv468_15228.txt',\n",
" 'pos/cv469_20630.txt',\n",
" 'pos/cv470_15952.txt',\n",
" 'pos/cv471_16858.txt',\n",
" 'pos/cv472_29280.txt',\n",
" 'pos/cv473_7367.txt',\n",
" 'pos/cv474_10209.txt',\n",
" 'pos/cv475_21692.txt',\n",
" 'pos/cv476_16856.txt',\n",
" 'pos/cv477_22479.txt',\n",
" 'pos/cv478_14309.txt',\n",
" 'pos/cv479_5649.txt',\n",
" 'pos/cv480_19817.txt',\n",
" 'pos/cv481_7436.txt',\n",
" 'pos/cv482_10580.txt',\n",
" 'pos/cv483_16378.txt',\n",
" 'pos/cv484_25054.txt',\n",
" 'pos/cv485_26649.txt',\n",
" 'pos/cv486_9799.txt',\n",
" 'pos/cv487_10446.txt',\n",
" 'pos/cv488_19856.txt',\n",
" 'pos/cv489_17906.txt',\n",
" 'pos/cv490_17872.txt',\n",
" 'pos/cv491_12145.txt',\n",
" 'pos/cv492_18271.txt',\n",
" 'pos/cv493_12839.txt',\n",
" 'pos/cv494_17389.txt',\n",
" 'pos/cv495_14518.txt',\n",
" 'pos/cv496_10530.txt',\n",
" 'pos/cv497_26980.txt',\n",
" 'pos/cv498_8832.txt',\n",
" 'pos/cv499_10658.txt',\n",
" 'pos/cv500_10251.txt',\n",
" 'pos/cv501_11657.txt',\n",
" 'pos/cv502_10406.txt',\n",
" 'pos/cv503_10558.txt',\n",
" 'pos/cv504_29243.txt',\n",
" 'pos/cv505_12090.txt',\n",
" 'pos/cv506_15956.txt',\n",
" 'pos/cv507_9220.txt',\n",
" 'pos/cv508_16006.txt',\n",
" 'pos/cv509_15888.txt',\n",
" 'pos/cv510_23360.txt',\n",
" 'pos/cv511_10132.txt',\n",
" 'pos/cv512_15965.txt',\n",
" 'pos/cv513_6923.txt',\n",
" 'pos/cv514_11187.txt',\n",
" 'pos/cv515_17069.txt',\n",
" 'pos/cv516_11172.txt',\n",
" 'pos/cv517_19219.txt',\n",
" 'pos/cv518_13331.txt',\n",
" 'pos/cv519_14661.txt',\n",
" 'pos/cv520_12295.txt',\n",
" 'pos/cv521_15828.txt',\n",
" 'pos/cv522_5583.txt',\n",
" 'pos/cv523_16615.txt',\n",
" 'pos/cv524_23627.txt',\n",
" 'pos/cv525_16122.txt',\n",
" 'pos/cv526_12083.txt',\n",
" 'pos/cv527_10123.txt',\n",
" 'pos/cv528_10822.txt',\n",
" 'pos/cv529_10420.txt',\n",
" 'pos/cv530_16212.txt',\n",
" 'pos/cv531_26486.txt',\n",
" 'pos/cv532_6522.txt',\n",
" 'pos/cv533_9821.txt',\n",
" 'pos/cv534_14083.txt',\n",
" 'pos/cv535_19728.txt',\n",
" 'pos/cv536_27134.txt',\n",
" 'pos/cv537_12370.txt',\n",
" 'pos/cv538_28667.txt',\n",
" 'pos/cv539_20347.txt',\n",
" 'pos/cv540_3421.txt',\n",
" 'pos/cv541_28835.txt',\n",
" 'pos/cv542_18980.txt',\n",
" 'pos/cv543_5045.txt',\n",
" 'pos/cv544_5108.txt',\n",
" 'pos/cv545_12014.txt',\n",
" 'pos/cv546_11767.txt',\n",
" 'pos/cv547_16324.txt',\n",
" 'pos/cv548_17731.txt',\n",
" 'pos/cv549_21443.txt',\n",
" 'pos/cv550_22211.txt',\n",
" 'pos/cv551_10565.txt',\n",
" 'pos/cv552_10016.txt',\n",
" 'pos/cv553_26915.txt',\n",
" 'pos/cv554_13151.txt',\n",
" 'pos/cv555_23922.txt',\n",
" 'pos/cv556_14808.txt',\n",
" 'pos/cv557_11449.txt',\n",
" 'pos/cv558_29507.txt',\n",
" 'pos/cv559_0050.txt',\n",
" 'pos/cv560_17175.txt',\n",
" 'pos/cv561_9201.txt',\n",
" 'pos/cv562_10359.txt',\n",
" 'pos/cv563_17257.txt',\n",
" 'pos/cv564_11110.txt',\n",
" 'pos/cv565_29572.txt',\n",
" 'pos/cv566_8581.txt',\n",
" 'pos/cv567_29611.txt',\n",
" 'pos/cv568_15638.txt',\n",
" 'pos/cv569_26381.txt',\n",
" 'pos/cv570_29082.txt',\n",
" 'pos/cv571_29366.txt',\n",
" 'pos/cv572_18657.txt',\n",
" 'pos/cv573_29525.txt',\n",
" 'pos/cv574_22156.txt',\n",
" 'pos/cv575_21150.txt',\n",
" 'pos/cv576_14094.txt',\n",
" 'pos/cv577_28549.txt',\n",
" 'pos/cv578_15094.txt',\n",
" 'pos/cv579_11605.txt',\n",
" 'pos/cv580_14064.txt',\n",
" 'pos/cv581_19381.txt',\n",
" 'pos/cv582_6559.txt',\n",
" 'pos/cv583_29692.txt',\n",
" 'pos/cv584_29722.txt',\n",
" 'pos/cv585_22496.txt',\n",
" 'pos/cv586_7543.txt',\n",
" 'pos/cv587_19162.txt',\n",
" 'pos/cv588_13008.txt',\n",
" 'pos/cv589_12064.txt',\n",
" 'pos/cv590_19290.txt',\n",
" 'pos/cv591_23640.txt',\n",
" 'pos/cv592_22315.txt',\n",
" 'pos/cv593_10987.txt',\n",
" 'pos/cv594_11039.txt',\n",
" 'pos/cv595_25335.txt',\n",
" 'pos/cv596_28311.txt',\n",
" 'pos/cv597_26360.txt',\n",
" 'pos/cv598_16452.txt',\n",
" 'pos/cv599_20988.txt',\n",
" 'pos/cv600_23878.txt',\n",
" 'pos/cv601_23453.txt',\n",
" 'pos/cv602_8300.txt',\n",
" 'pos/cv603_17694.txt',\n",
" 'pos/cv604_2230.txt',\n",
" 'pos/cv605_11800.txt',\n",
" 'pos/cv606_15985.txt',\n",
" 'pos/cv607_7717.txt',\n",
" 'pos/cv608_23231.txt',\n",
" 'pos/cv609_23877.txt',\n",
" 'pos/cv610_2287.txt',\n",
" 'pos/cv611_21120.txt',\n",
" 'pos/cv612_5461.txt',\n",
" 'pos/cv613_21796.txt',\n",
" 'pos/cv614_10626.txt',\n",
" 'pos/cv615_14182.txt',\n",
" 'pos/cv616_29319.txt',\n",
" 'pos/cv617_9322.txt',\n",
" 'pos/cv618_8974.txt',\n",
" 'pos/cv619_12462.txt',\n",
" 'pos/cv620_24265.txt',\n",
" 'pos/cv621_14368.txt',\n",
" 'pos/cv622_8147.txt',\n",
" 'pos/cv623_15356.txt',\n",
" 'pos/cv624_10744.txt',\n",
" 'pos/cv625_12440.txt',\n",
" 'pos/cv626_7410.txt',\n",
" 'pos/cv627_11620.txt',\n",
" 'pos/cv628_19325.txt',\n",
" 'pos/cv629_14909.txt',\n",
" 'pos/cv630_10057.txt',\n",
" 'pos/cv631_4967.txt',\n",
" 'pos/cv632_9610.txt',\n",
" 'pos/cv633_29837.txt',\n",
" 'pos/cv634_11101.txt',\n",
" 'pos/cv635_10022.txt',\n",
" 'pos/cv636_15279.txt',\n",
" 'pos/cv637_1250.txt',\n",
" 'pos/cv638_2953.txt',\n",
" 'pos/cv639_10308.txt',\n",
" 'pos/cv640_5378.txt',\n",
" 'pos/cv641_12349.txt',\n",
" 'pos/cv642_29867.txt',\n",
" 'pos/cv643_29349.txt',\n",
" 'pos/cv644_17154.txt',\n",
" 'pos/cv645_15668.txt',\n",
" 'pos/cv646_15065.txt',\n",
" 'pos/cv647_13691.txt',\n",
" 'pos/cv648_15792.txt',\n",
" 'pos/cv649_12735.txt',\n",
" 'pos/cv650_14340.txt',\n",
" 'pos/cv651_10492.txt',\n",
" 'pos/cv652_13972.txt',\n",
" 'pos/cv653_19583.txt',\n",
" 'pos/cv654_18246.txt',\n",
" 'pos/cv655_11154.txt',\n",
" 'pos/cv656_24201.txt',\n",
" 'pos/cv657_24513.txt',\n",
" 'pos/cv658_10532.txt',\n",
" 'pos/cv659_19944.txt',\n",
" 'pos/cv660_21893.txt',\n",
" 'pos/cv661_2450.txt',\n",
" 'pos/cv662_13320.txt',\n",
" 'pos/cv663_13019.txt',\n",
" 'pos/cv664_4389.txt',\n",
" 'pos/cv665_29538.txt',\n",
" 'pos/cv666_18963.txt',\n",
" 'pos/cv667_18467.txt',\n",
" 'pos/cv668_17604.txt',\n",
" 'pos/cv669_22995.txt',\n",
" 'pos/cv670_25826.txt',\n",
" 'pos/cv671_5054.txt',\n",
" 'pos/cv672_28083.txt',\n",
" 'pos/cv673_24714.txt',\n",
" 'pos/cv674_10732.txt',\n",
" 'pos/cv675_21588.txt',\n",
" 'pos/cv676_21090.txt',\n",
" 'pos/cv677_17715.txt',\n",
" 'pos/cv678_13419.txt',\n",
" 'pos/cv679_28559.txt',\n",
" 'pos/cv680_10160.txt',\n",
" 'pos/cv681_9692.txt',\n",
" 'pos/cv682_16139.txt',\n",
" 'pos/cv683_12167.txt',\n",
" 'pos/cv684_11798.txt',\n",
" 'pos/cv685_5947.txt',\n",
" 'pos/cv686_13900.txt',\n",
" 'pos/cv687_21100.txt',\n",
" 'pos/cv688_7368.txt',\n",
" 'pos/cv689_12587.txt',\n",
" 'pos/cv690_5619.txt',\n",
" 'pos/cv691_5043.txt',\n",
" 'pos/cv692_15451.txt',\n",
" 'pos/cv693_18063.txt',\n",
" 'pos/cv694_4876.txt',\n",
" 'pos/cv695_21108.txt',\n",
" 'pos/cv696_29740.txt',\n",
" 'pos/cv697_11162.txt',\n",
" 'pos/cv698_15253.txt',\n",
" 'pos/cv699_7223.txt',\n",
" 'pos/cv700_21947.txt',\n",
" 'pos/cv701_14252.txt',\n",
" 'pos/cv702_11500.txt',\n",
" 'pos/cv703_16143.txt',\n",
" 'pos/cv704_15969.txt',\n",
" 'pos/cv705_11059.txt',\n",
" 'pos/cv706_24716.txt',\n",
" 'pos/cv707_10678.txt',\n",
" 'pos/cv708_28729.txt',\n",
" 'pos/cv709_10529.txt',\n",
" 'pos/cv710_22577.txt',\n",
" 'pos/cv711_11665.txt',\n",
" 'pos/cv712_22920.txt',\n",
" 'pos/cv713_29155.txt',\n",
" 'pos/cv714_18502.txt',\n",
" 'pos/cv715_18179.txt',\n",
" 'pos/cv716_10514.txt',\n",
" 'pos/cv717_15953.txt',\n",
" 'pos/cv718_11434.txt',\n",
" 'pos/cv719_5713.txt',\n",
" 'pos/cv720_5389.txt',\n",
" 'pos/cv721_29121.txt',\n",
" 'pos/cv722_7110.txt',\n",
" 'pos/cv723_8648.txt',\n",
" 'pos/cv724_13681.txt',\n",
" 'pos/cv725_10103.txt',\n",
" 'pos/cv726_4719.txt',\n",
" 'pos/cv727_4978.txt',\n",
" 'pos/cv728_16133.txt',\n",
" 'pos/cv729_10154.txt',\n",
" 'pos/cv730_10279.txt',\n",
" 'pos/cv731_4136.txt',\n",
" 'pos/cv732_12245.txt',\n",
" 'pos/cv733_9839.txt',\n",
" 'pos/cv734_21568.txt',\n",
" 'pos/cv735_18801.txt',\n",
" 'pos/cv736_23670.txt',\n",
" 'pos/cv737_28907.txt',\n",
" 'pos/cv738_10116.txt',\n",
" 'pos/cv739_11209.txt',\n",
" 'pos/cv740_12445.txt',\n",
" 'pos/cv741_11890.txt',\n",
" 'pos/cv742_7751.txt',\n",
" 'pos/cv743_15449.txt',\n",
" 'pos/cv744_10038.txt',\n",
" 'pos/cv745_12773.txt',\n",
" 'pos/cv746_10147.txt',\n",
" 'pos/cv747_16556.txt',\n",
" 'pos/cv748_12786.txt',\n",
" 'pos/cv749_17765.txt',\n",
" 'pos/cv750_10180.txt',\n",
" 'pos/cv751_15719.txt',\n",
" 'pos/cv752_24155.txt',\n",
" 'pos/cv753_10875.txt',\n",
" 'pos/cv754_7216.txt',\n",
" 'pos/cv755_23616.txt',\n",
" 'pos/cv756_22540.txt',\n",
" 'pos/cv757_10189.txt',\n",
" 'pos/cv758_9671.txt',\n",
" 'pos/cv759_13522.txt',\n",
" 'pos/cv760_8597.txt',\n",
" 'pos/cv761_12620.txt',\n",
" 'pos/cv762_13927.txt',\n",
" 'pos/cv763_14729.txt',\n",
" 'pos/cv764_11739.txt',\n",
" 'pos/cv765_19037.txt',\n",
" 'pos/cv766_7540.txt',\n",
" 'pos/cv767_14062.txt',\n",
" 'pos/cv768_11751.txt',\n",
" 'pos/cv769_8123.txt',\n",
" 'pos/cv770_10451.txt',\n",
" 'pos/cv771_28665.txt',\n",
" 'pos/cv772_12119.txt',\n",
" 'pos/cv773_18817.txt',\n",
" 'pos/cv774_13845.txt',\n",
" 'pos/cv775_16237.txt',\n",
" 'pos/cv776_20529.txt',\n",
" 'pos/cv777_10094.txt',\n",
" 'pos/cv778_17330.txt',\n",
" 'pos/cv779_17881.txt',\n",
" 'pos/cv780_7984.txt',\n",
" 'pos/cv781_5262.txt',\n",
" 'pos/cv782_19526.txt',\n",
" 'pos/cv783_13227.txt',\n",
" 'pos/cv784_14394.txt',\n",
" 'pos/cv785_22600.txt',\n",
" 'pos/cv786_22497.txt',\n",
" 'pos/cv787_13743.txt',\n",
" 'pos/cv788_25272.txt',\n",
" 'pos/cv789_12136.txt',\n",
" 'pos/cv790_14600.txt',\n",
" 'pos/cv791_16302.txt',\n",
" 'pos/cv792_3832.txt',\n",
" 'pos/cv793_13650.txt',\n",
" 'pos/cv794_15868.txt',\n",
" 'pos/cv795_10122.txt',\n",
" 'pos/cv796_15782.txt',\n",
" 'pos/cv797_6957.txt',\n",
" 'pos/cv798_23531.txt',\n",
" 'pos/cv799_18543.txt',\n",
" 'pos/cv800_12368.txt',\n",
" 'pos/cv801_25228.txt',\n",
" 'pos/cv802_28664.txt',\n",
" 'pos/cv803_8207.txt',\n",
" 'pos/cv804_10862.txt',\n",
" 'pos/cv805_19601.txt',\n",
" 'pos/cv806_8842.txt',\n",
" 'pos/cv807_21740.txt',\n",
" 'pos/cv808_12635.txt',\n",
" 'pos/cv809_5009.txt',\n",
" 'pos/cv810_12458.txt',\n",
" 'pos/cv811_21386.txt',\n",
" 'pos/cv812_17924.txt',\n",
" 'pos/cv813_6534.txt',\n",
" 'pos/cv814_18975.txt',\n",
" 'pos/cv815_22456.txt',\n",
" 'pos/cv816_13655.txt',\n",
" 'pos/cv817_4041.txt',\n",
" 'pos/cv818_10211.txt',\n",
" 'pos/cv819_9364.txt',\n",
" 'pos/cv820_22892.txt',\n",
" 'pos/cv821_29364.txt',\n",
" 'pos/cv822_20049.txt',\n",
" 'pos/cv823_15569.txt',\n",
" 'pos/cv824_8838.txt',\n",
" 'pos/cv825_5063.txt',\n",
" 'pos/cv826_11834.txt',\n",
" 'pos/cv827_18331.txt',\n",
" 'pos/cv828_19831.txt',\n",
" 'pos/cv829_20289.txt',\n",
" 'pos/cv830_6014.txt',\n",
" 'pos/cv831_14689.txt',\n",
" 'pos/cv832_23275.txt',\n",
" 'pos/cv833_11053.txt',\n",
" 'pos/cv834_22195.txt',\n",
" 'pos/cv835_19159.txt',\n",
" 'pos/cv836_12968.txt',\n",
" 'pos/cv837_27325.txt',\n",
" 'pos/cv838_24728.txt',\n",
" 'pos/cv839_21467.txt',\n",
" 'pos/cv840_16321.txt',\n",
" 'pos/cv841_3967.txt',\n",
" 'pos/cv842_5866.txt',\n",
" 'pos/cv843_15544.txt',\n",
" 'pos/cv844_12690.txt',\n",
" 'pos/cv845_14290.txt',\n",
" 'pos/cv846_29497.txt',\n",
" 'pos/cv847_1941.txt',\n",
" 'pos/cv848_10036.txt',\n",
" 'pos/cv849_15729.txt',\n",
" 'pos/cv850_16466.txt',\n",
" 'pos/cv851_20469.txt',\n",
" 'pos/cv852_27523.txt',\n",
" 'pos/cv853_29233.txt',\n",
" 'pos/cv854_17740.txt',\n",
" 'pos/cv855_20661.txt',\n",
" 'pos/cv856_29013.txt',\n",
" 'pos/cv857_15958.txt',\n",
" 'pos/cv858_18819.txt',\n",
" 'pos/cv859_14107.txt',\n",
" 'pos/cv860_13853.txt',\n",
" 'pos/cv861_1198.txt',\n",
" 'pos/cv862_14324.txt',\n",
" 'pos/cv863_7424.txt',\n",
" 'pos/cv864_3416.txt',\n",
" 'pos/cv865_2895.txt',\n",
" 'pos/cv866_29691.txt',\n",
" 'pos/cv867_16661.txt',\n",
" 'pos/cv868_11948.txt',\n",
" 'pos/cv869_23611.txt',\n",
" 'pos/cv870_16348.txt',\n",
" 'pos/cv871_24888.txt',\n",
" 'pos/cv872_12591.txt',\n",
" 'pos/cv873_18636.txt',\n",
" 'pos/cv874_11236.txt',\n",
" 'pos/cv875_5754.txt',\n",
" 'pos/cv876_9390.txt',\n",
" 'pos/cv877_29274.txt',\n",
" 'pos/cv878_15694.txt',\n",
" 'pos/cv879_14903.txt',\n",
" 'pos/cv880_29800.txt',\n",
" 'pos/cv881_13254.txt',\n",
" 'pos/cv882_10026.txt',\n",
" 'pos/cv883_27751.txt',\n",
" 'pos/cv884_13632.txt',\n",
" 'pos/cv885_12318.txt',\n",
" 'pos/cv886_18177.txt',\n",
" 'pos/cv887_5126.txt',\n",
" 'pos/cv888_24435.txt',\n",
" 'pos/cv889_21430.txt',\n",
" 'pos/cv890_3977.txt',\n",
" 'pos/cv891_6385.txt',\n",
" 'pos/cv892_17576.txt',\n",
" 'pos/cv893_26269.txt',\n",
" 'pos/cv894_2068.txt',\n",
" 'pos/cv895_21022.txt',\n",
" 'pos/cv896_16071.txt',\n",
" 'pos/cv897_10837.txt',\n",
" 'pos/cv898_14187.txt',\n",
" 'pos/cv899_16014.txt',\n",
" 'pos/cv900_10331.txt',\n",
" 'pos/cv901_11017.txt',\n",
" 'pos/cv902_12256.txt',\n",
" 'pos/cv903_17822.txt',\n",
" 'pos/cv904_24353.txt',\n",
" 'pos/cv905_29114.txt',\n",
" 'pos/cv906_11491.txt',\n",
" 'pos/cv907_3541.txt',\n",
" 'pos/cv908_16009.txt',\n",
" 'pos/cv909_9960.txt',\n",
" 'pos/cv910_20488.txt',\n",
" 'pos/cv911_20260.txt',\n",
" 'pos/cv912_5674.txt',\n",
" 'pos/cv913_29252.txt',\n",
" 'pos/cv914_28742.txt',\n",
" 'pos/cv915_8841.txt',\n",
" 'pos/cv916_15467.txt',\n",
" 'pos/cv917_29715.txt',\n",
" 'pos/cv918_2693.txt',\n",
" 'pos/cv919_16380.txt',\n",
" 'pos/cv920_29622.txt',\n",
" 'pos/cv921_12747.txt',\n",
" 'pos/cv922_10073.txt',\n",
" 'pos/cv923_11051.txt',\n",
" 'pos/cv924_29540.txt',\n",
" 'pos/cv925_8969.txt',\n",
" 'pos/cv926_17059.txt',\n",
" 'pos/cv927_10681.txt',\n",
" 'pos/cv928_9168.txt',\n",
" 'pos/cv929_16908.txt',\n",
" 'pos/cv930_13475.txt',\n",
" 'pos/cv931_17563.txt',\n",
" 'pos/cv932_13401.txt',\n",
" 'pos/cv933_23776.txt',\n",
" 'pos/cv934_19027.txt',\n",
" 'pos/cv935_23841.txt',\n",
" 'pos/cv936_15954.txt',\n",
" 'pos/cv937_9811.txt',\n",
" 'pos/cv938_10220.txt',\n",
" 'pos/cv939_10583.txt',\n",
" 'pos/cv940_17705.txt',\n",
" 'pos/cv941_10246.txt',\n",
" 'pos/cv942_17082.txt',\n",
" 'pos/cv943_22488.txt',\n",
" 'pos/cv944_13521.txt',\n",
" 'pos/cv945_12160.txt',\n",
" 'pos/cv946_18658.txt',\n",
" 'pos/cv947_10601.txt',\n",
" 'pos/cv948_24606.txt',\n",
" 'pos/cv949_20112.txt',\n",
" 'pos/cv950_12350.txt',\n",
" 'pos/cv951_10926.txt',\n",
" 'pos/cv952_25240.txt',\n",
" 'pos/cv953_6836.txt',\n",
" 'pos/cv954_18628.txt',\n",
" 'pos/cv955_25001.txt',\n",
" 'pos/cv956_11609.txt',\n",
" 'pos/cv957_8737.txt',\n",
" 'pos/cv958_12162.txt',\n",
" 'pos/cv959_14611.txt',\n",
" 'pos/cv960_29007.txt',\n",
" 'pos/cv961_5682.txt',\n",
" 'pos/cv962_9803.txt',\n",
" 'pos/cv963_6895.txt',\n",
" 'pos/cv964_6021.txt',\n",
" 'pos/cv965_26071.txt',\n",
" 'pos/cv966_28832.txt',\n",
" 'pos/cv967_5788.txt',\n",
" 'pos/cv968_24218.txt',\n",
" 'pos/cv969_13250.txt',\n",
" 'pos/cv970_18450.txt',\n",
" 'pos/cv971_10874.txt',\n",
" 'pos/cv972_26417.txt',\n",
" 'pos/cv973_10066.txt',\n",
" 'pos/cv974_22941.txt',\n",
" 'pos/cv975_10981.txt',\n",
" 'pos/cv976_10267.txt',\n",
" 'pos/cv977_4938.txt',\n",
" 'pos/cv978_20929.txt',\n",
" 'pos/cv979_18921.txt',\n",
" 'pos/cv980_10953.txt',\n",
" 'pos/cv981_14989.txt',\n",
" 'pos/cv982_21103.txt',\n",
" 'pos/cv983_22928.txt',\n",
" 'pos/cv984_12767.txt',\n",
" 'pos/cv985_6359.txt',\n",
" 'pos/cv986_13527.txt',\n",
" 'pos/cv987_6965.txt',\n",
" 'pos/cv988_18740.txt',\n",
" 'pos/cv989_15824.txt',\n",
" 'pos/cv990_11591.txt',\n",
" 'pos/cv991_18645.txt',\n",
" 'pos/cv992_11962.txt',\n",
" 'pos/cv993_29737.txt',\n",
" 'pos/cv994_12270.txt',\n",
" 'pos/cv995_21821.txt',\n",
" 'pos/cv996_11592.txt',\n",
" 'pos/cv997_5046.txt',\n",
" 'pos/cv998_14111.txt',\n",
" 'pos/cv999_13106.txt']"
]
},
"metadata": {
"tags": []
},
"execution_count": 36
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LhRoiFntf0Yb",
"outputId": "daef79ff-fe54-4701-e6f6-9ad23190fb0f"
},
"source": [
"movie_reviews.words(movie_reviews.fileids()[5])"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['capsule', ':', 'in', '2176', 'on', 'the', 'planet', ...]"
]
},
"metadata": {
"tags": []
},
"execution_count": 37
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DXFC7CBYgAjd",
"outputId": "12641d9d-75b5-463a-b909-63ff4b7fd682"
},
"source": [
"documents = []\r\n",
"for category in movie_reviews.categories():\r\n",
" for fileid in movie_reviews.fileids(category):\r\n",
" documents.append([movie_reviews.words(fileid), category])\r\n",
"documents[0:5] ##preparing the dataset, here every review is tokenized
and result is appended to get a training example."
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg'],\
n",
" [['the', 'happy', 'bastard', \"'\", 's', 'quick', 'movie', ...],
'neg'],\n",
" [['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...],
'neg'],\n",
" [['\"', 'quest', 'for', 'camelot', '\"', 'is', 'warner', ...],
'neg'],\n",
" [['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...],
'neg']]"
]
},
"metadata": {
"tags": []
},
"execution_count": 38
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "G6zTjdh466ZQ"
},
"source": [
"import random\r\n",
"random.seed(2)\r\n",
"random.shuffle(documents) ## shuffling the training exapmles ."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "8D17Y_mC69qZ"
},
"source": [
"from nltk.stem import WordNetLemmatizer\r\n",
"lemmatizer = WordNetLemmatizer()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "kqCBVZmr_R1C"
},
"source": [
"from nltk.corpus import wordnet\r\n",
"def get_simple_pos(tag): #creating simple tags to pass into the
lemmatizer\r\n",
" if tag.startswith('J'):\r\n",
" return wordnet.ADJ\r\n",
" elif tag.startswith('V'):\r\n",
" return wordnet.VERB\r\n",
" elif tag.startswith('N'):\r\n",
" return wordnet.NOUN\r\n",
" elif tag.startswith('R'):\r\n",
" return wordnet.ADV\r\n",
" else:\r\n",
" return wordnet.NOUN"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "AqusEhKr_V3R"
},
"source": [
"from nltk.corpus import stopwords\r\n",
"import string\r\n",
"stops = stopwords.words('english') + list(string.punctuation)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "yLWb2vzM_axG"
},
"source": [
"from nltk import pos_tag"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "WSb89iSb_dGQ"
},
"source": [
"def clean_review(words):\r\n",
" output_words = []\r\n",
" for w in words:\r\n",
" if w.lower() not in stops:\r\n",
" pos = pos_tag([w]) \r\n",
" clean_word = lemmatizer.lemmatize(w, get_simple_pos(pos[0]
[1]))\r\n",
" output_words.append(clean_word.lower())\r\n",
" return output_words"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "POBT4BR9BPAQ",
"outputId": "f67ef44f-25a0-4d99-a720-9e7be4228db3"
},
"source": [
" import time\r\n",
" start = time.time()\r\n",
"documents = [(clean_review(document), category) for document, category in
documents]\r\n",
"end = time.time()\r\n",
"print(\"Cleaning time: \", end - start)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Cleaning time: 103.91443848609924\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UmzRLUtEB7zI",
"outputId": "3bbc332c-100b-40eb-dc00-73c62e38ee5a"
},
"source": [
"documents[0]"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(['cold',\n",
" 'molecule',\n",
" 'move',\n",
" 'everything',\n",
" 'clean',\n",
" 'essential',\n",
" 'word',\n",
" 'mikey',\n",
" 'carver',\n",
" 'elijah',\n",
" 'wood',\n",
" 'young',\n",
" 'teenage',\n",
" 'boy',\n",
" 'living',\n",
" '1973',\n",
" 'new',\n",
" 'canaan',\n",
" 'connecticut',\n",
" 'ice',\n",
" 'storm',\n",
" 'mikey',\n",
" 'delivers',\n",
" 'word',\n",
" 'bore',\n",
" 'science',\n",
" 'class',\n",
" 'unlikely',\n",
" 'anyone',\n",
" 'realizes',\n",
" 'much',\n",
" 'parallel',\n",
" 'mikey',\n",
" 'life',\n",
" 'life',\n",
" 'surround',\n",
" 'father',\n",
" 'jim',\n",
" 'jamey',\n",
" 'sheridan',\n",
" 'rarely',\n",
" 'see',\n",
" 'mother',\n",
" 'janey',\n",
" 'sigourney',\n",
" 'weaver',\n",
" 'affair',\n",
" 'married',\n",
" 'neighbor',\n",
" 'ben',\n",
" 'hood',\n",
" 'kevin',\n",
" 'kline',\n",
" 'ben',\n",
" 'wife',\n",
" 'elena',\n",
" 'joan',\n",
" 'allen',\n",
" 'suspect',\n",
" 'affair',\n",
" 'say',\n",
" 'anything',\n",
" 'meanwhile',\n",
" 'ben',\n",
" '14',\n",
" 'year',\n",
" 'old',\n",
" 'daughter',\n",
" 'wendy',\n",
" 'christina',\n",
" 'ricci',\n",
" 'continuously',\n",
" 'lure',\n",
" 'mikey',\n",
" 'young',\n",
" 'brother',\n",
" 'sandy',\n",
" 'adam',\n",
" 'hann',\n",
" 'byrd',\n",
" 'sexual',\n",
" 'exploration',\n",
" 'tobey',\n",
" 'maguire',\n",
" 'play',\n",
" 'paul',\n",
" 'hood',\n",
" '16',\n",
" 'year',\n",
" 'old',\n",
" 'narrator',\n",
" 'story',\n",
" 'also',\n",
" 'happens',\n",
" 'least',\n",
" 'prevalent',\n",
" 'character',\n",
" 'start',\n",
" 'film',\n",
" 'interest',\n",
" 'outlook',\n",
" 'family',\n",
" 'paul',\n",
" 'compare',\n",
" 'family',\n",
" 'fantastic',\n",
" 'four',\n",
" 'comic',\n",
" 'book',\n",
" 'even',\n",
" 'go',\n",
" 'far',\n",
" 'say',\n",
" 'family',\n",
" 'everybody',\n",
" 'anti',\n",
" 'matter',\n",
" 'something',\n",
" 'everybody',\n",
" 'return',\n",
" 'eventually',\n",
" 'farther',\n",
" 'go',\n",
" 'deeper',\n",
" 'return',\n",
" 'ice',\n",
" 'storm',\n",
" 'character',\n",
" 'piece',\n",
" 'explores',\n",
" 'dismal',\n",
" 'time',\n",
" 'america',\n",
" 'individual',\n",
" 'life',\n",
" 'portrayed',\n",
" 'movie',\n",
" 'everything',\n",
" 'parallel',\n",
" 'everything',\n",
" 'else',\n",
" 'young',\n",
" 'teenager',\n",
" 'try',\n",
" 'discover',\n",
" 'thru',\n",
" 'drug',\n",
" 'sex',\n",
" 'alcohol',\n",
" 'really',\n",
" 'almost',\n",
" 'identical',\n",
" 'parent',\n",
" 'try',\n",
" 'figure',\n",
" 'purpose',\n",
" 'life',\n",
" 'use',\n",
" 'method',\n",
" 'cold',\n",
" 'outside',\n",
" 'molecule',\n",
" 'move',\n",
" 'everything',\n",
" 'clean',\n",
" 'everything',\n",
" 'clean',\n",
" 'nobody',\n",
" 'admit',\n",
" 'go',\n",
" 'even',\n",
" 'president',\n",
" 'tv',\n",
" 'deny',\n",
" 'wrong',\n",
" 'doings',\n",
" 'expect',\n",
" 'anything',\n",
" 'couple',\n",
" 'suburban',\n",
" 'family',\n",
" 'rid',\n",
" 'coattail',\n",
" 'sexual',\n",
" 'revolution',\n",
" 'sex',\n",
" 'drug',\n",
" 'obviously',\n",
" 'empty',\n",
" 'think',\n",
" 'point',\n",
" 'film',\n",
" 'first',\n",
" 'view',\n",
" 'entire',\n",
" 'movie',\n",
" 'might',\n",
" 'seem',\n",
" 'empty',\n",
" 'parallel',\n",
" 'get',\n",
" 'know',\n",
" 'character',\n",
" 'deeply',\n",
" 'think',\n",
" 'nobody',\n",
" 'movie',\n",
" 'know',\n",
" 'either',\n",
" 'sadly',\n",
" 'watch',\n",
" 'two',\n",
" 'family',\n",
" 'go',\n",
" 'life',\n",
" 'nearly',\n",
" 'oblivious',\n",
" 'one',\n",
" 'another',\n",
" 'first',\n",
" 'glance',\n",
" 'might',\n",
" 'think',\n",
" 'emotion',\n",
" 'lose',\n",
" 'end',\n",
" 'scene',\n",
" 'gotten',\n",
" 'know',\n",
" 'character',\n",
" 'well',\n",
" 'enough',\n",
" 'sympathize',\n",
" 'realize',\n",
" 'might',\n",
" 'point',\n",
" 'feel',\n",
" 'pain',\n",
" 'act',\n",
" 'quite',\n",
" 'good',\n",
" 'particularly',\n",
" 'like',\n",
" 'elijah',\n",
" 'wood',\n",
" 'seem',\n",
" 'receive',\n",
" 'much',\n",
" 'recognition',\n",
" 'others',\n",
" 'still',\n",
" 'found',\n",
" 'posse',\n",
" 'quite',\n",
" 'real',\n",
" 'sense',\n",
" 'christina',\n",
" 'ricci',\n",
" 'acclaim',\n",
" 'part',\n",
" 'misguide',\n",
" 'teenage',\n",
" 'temptress',\n",
" 'look',\n",
" 'something',\n",
" 'life',\n",
" 'pant',\n",
" 'every',\n",
" 'available',\n",
" 'boy',\n",
" 'still',\n",
" 'say',\n",
" 'hat',\n",
" 'remain',\n",
" 'wood',\n",
" 'joan',\n",
" 'allen',\n",
" 'subtle',\n",
" 'believable',\n",
" 'performance',\n",
" 'lonely',\n",
" 'unappreciated',\n",
" 'wife',\n",
" 'always',\n",
" 'excellent',\n",
" 'kevin',\n",
" 'kline',\n",
" 'tobey',\n",
" 'maguire',\n",
" 'fine',\n",
" 'job',\n",
" 'character',\n",
" 'perhaps',\n",
" 'intact',\n",
" 'sensible',\n",
" 'person',\n",
" 'story',\n",
" 'seem',\n",
" 'little',\n",
" 'lose',\n",
" 'need',\n",
" 'perhaps',\n",
" 'use',\n",
" 'ice',\n",
" 'storm',\n",
" 'hail',\n",
" 'many',\n",
" 'one',\n",
" 'best',\n",
" 'film',\n",
" 'year',\n",
" 'hate',\n",
" 'say',\n",
" 'agree',\n",
" 'fact',\n",
" 'think',\n",
" 'many',\n",
" 'film',\n",
" 'would',\n",
" 'rank',\n",
" 'high',\n",
" 'one',\n",
" 'even',\n",
" 'good',\n",
" 'film',\n",
" 'take',\n",
" 'lot',\n",
" 'retrospect',\n",
" 'fully',\n",
" 'appreciate',\n",
" 'art',\n",
" 'finally',\n",
" 'start',\n",
" 'see',\n",
" 'thing',\n",
" 'start',\n",
" 'grow',\n",
" 'even',\n",
" 'fonder',\n",
" 'character',\n",
" 'story',\n",
" 'perhaps',\n",
" 'one',\n",
" 'see',\n",
" 'ponder',\n",
" 'watch',\n",
" 'different',\n",
" 'eye'],\n",
" 'pos')"
]
},
"metadata": {
"tags": []
},
"execution_count": 46
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "hR7j6q5sCZca"
},
"source": [
"training_documents = documents[0:1500]\r\n",
"testing_documents = documents[1500:]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Atrz1fzyDbrz"
},
"source": [
"all_words = []\r\n",
"for doc in documents:\r\n",
" all_words += doc[0]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "BkaI2AeuDfIx"
},
"source": [
"freq = nltk.FreqDist(all_words) #will retrurn a freq
distribution object\r\n",
"common = freq.most_common(3000)\r\n",
"features = [i[0] for i in common] #choosing the top 3000 frequency words"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "0Kqd03IQDg8m"
},
"source": [
"def get_feature_dict(words): #will return true/false if the word in
present in the document or not\r\n",
" current_features = {}\r\n",
" words_set = set(words)\r\n",
" for w in features:\r\n",
" current_features[w] = w in words_set\r\n",
" return current_features"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "v-MOWlisDmrI"
},
"source": [
"training_data = [(get_feature_dict(doc), category) for doc, category in
training_documents] \r\n",
"testing_data = [(get_feature_dict(doc), category) for doc, category in
testing_documents]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "lDKXTU5WD4ZA"
},
"source": [
"#Classification using NLTK Naive Bayes"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ZttTwJ0nEcEf"
},
"source": [
"from nltk import NaiveBayesClassifier "
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "oWIogFYfEddV"
},
"source": [
"classifier = NaiveBayesClassifier.train(training_data)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "KUrkh-BaEeqB",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "5dc69053-8c61-405b-b161-e4fcd2ed1c29"
},
"source": [
"nltk.classify.accuracy(classifier, testing_data)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.776"
]
},
"metadata": {
"tags": []
},
"execution_count": 55
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "17GDTkZXEgAj",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "1fd97dbc-27a0-4441-fe68-317903cf270e"
},
"source": [
"classifier.show_most_informative_features(15)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Most Informative Features\n",
" ludicrous = True neg : pos = 20.2 :
1.0\n",
" outstanding = True pos : neg = 13.5 :
1.0\n",
" anna = True pos : neg = 10.0 :
1.0\n",
" damon = True pos : neg = 9.3 :
1.0\n",
" schumacher = True neg : pos = 9.3 :
1.0\n",
" idiotic = True neg : pos = 8.4 :
1.0\n",
" wonderfully = True pos : neg = 8.2 :
1.0\n",
" breathtaking = True pos : neg = 8.2 :
1.0\n",
" stupidity = True neg : pos = 7.6 :
1.0\n",
" anger = True pos : neg = 7.6 :
1.0\n",
" lifeless = True neg : pos = 7.3 :
1.0\n",
" balance = True pos : neg = 7.1 :
1.0\n",
" inept = True neg : pos = 7.0 :
1.0\n",
" painfully = True neg : pos = 6.7 :
1.0\n",
" sat = True neg : pos = 6.7 :
1.0\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0qajIEqvDpPU"
},
"source": [
"##**Sklearn Classifiers within NLTK**\r\n",
"\r\n",
"There is a Sklearn classifier that gives uses of NLTK a way to call the
underlying scikit-learn classifier through their code in Phyton.\r\n",
"\r\n",
"To construct a scikit-learn estimator object, then use that to construct a
SklearnClassifier. E.g., to wrap a linear SVM with default settings:\r\n",
"\r\n",
"$$from \\;sklearn.svm \\;import LinearSVC$$\r\n",
"\r\n",
"$$from\\; nltk.classify.scikitlearn\\; import\\; SklearnClassifier$$\r\n",
"\r\n",
"$$classifier = SklearnClassifier(LinearSVC())$$"
]
},
{
"cell_type": "code",
"metadata": {
"id": "y3WbseqjEiCT"
},
"source": [
"#Using Sklearn Classifier within Nltk\r\n",
"from sklearn.svm import SVC\r\n",
"from nltk.classify.scikitlearn import SklearnClassifier"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "qYIaiM_sD2xx"
},
"source": [
"svc = SVC()\r\n",
"classifier_sklearn = SklearnClassifier(svc)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "B5QcC4eJD4eC",
"outputId": "5ce3ffa7-f56a-459f-d5a3-51317b96c2d2"
},
"source": [
"classifier_sklearn.train(training_data)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<SklearnClassifier(SVC(C=1.0, break_ties=False, cache_size=200,
class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='scale',
kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=None,
shrinking=True,\n",
" tol=0.001, verbose=False))>"
]
},
"metadata": {
"tags": []
},
"execution_count": 59
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "C80AC8mLFHMw",
"outputId": "d5015639-c0da-4761-a662-3293b1f28e80"
},
"source": [
"nltk.classify.accuracy(classifier_sklearn, testing_data)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.812"
]
},
"metadata": {
"tags": []
},
"execution_count": 60
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "xokgLCIqFJJz"
},
"source": [
"from sklearn.ensemble import RandomForestClassifier"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "LfEMr7SbFLQn"
},
"source": [
"rfc = RandomForestClassifier()\r\n",
"classifier_sklearn1 = SklearnClassifier(rfc)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iaS7s2PBFNIs",
"outputId": "b0149952-4f0b-4a2b-f2cf-2fcf0cc3f5f4"
},
"source": [
"classifier_sklearn1.train(training_data)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<SklearnClassifier(RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=None,
max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0,
min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0,
n_estimators=100,\n",
" n_jobs=None, oob_score=False,
random_state=None,\n",
" verbose=0, warm_start=False))>"
]
},
"metadata": {
"tags": []
},
"execution_count": 63
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EGLbuk3OFO9K",
"outputId": "baa5017f-b496-4656-d7d6-1b288d44373c"
},
"source": [
"nltk.classify.accuracy(classifier_sklearn1, testing_data)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.8"
]
},
"metadata": {
"tags": []
},
"execution_count": 64
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LyYCqotKoCm0"
},
"source": [
"##**Count Vectorizer**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S1Q3b49eoH9u"
},
"source": [
"Count Vectorizer is used to transform a given text into a vector on the
basis of the frequency (count) of each word that occurs in the entire text.\r\n",
"\r\n",
"It is used to convert a collection of text documents to a vector of
term/token counts. It also enables the pre-processing of text data prior to
generating the vector representation.\r\n",
"\r\n",
"CountVectorizer creates a matrix in which each unique word is represented
by a column of the matrix, and each text sample from the document is a row in the
matrix. The value of each cell is nothing but the count of the word in that
particular text sample. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5v94h6ivoH6j"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KJsGQFaYoH00"
},
"source": [
""
]
},
{
"cell_type": "code",
"metadata": {
"id": "Yb5H2DtnFQ5g"
},
"source": [
"documents = [(clean_review(document), category) for document, category in
documents]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "FNiqeLlfo-lr"
},
"source": [
"from sklearn.feature_extraction.text import CountVectorizer"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5M0jg0fypk-N",
"outputId": "376174b0-5d48-4afb-c9bb-b8c9f29b00fd"
},
"source": [
"train_set = {\"the sky sky is blue\", \"the sun is bright\"}\r\n",
"count_vec = CountVectorizer(max_features = 3)\r\n",
"a = count_vec.fit_transform(train_set)\r\n",
"a.todense() #give us the sparse matrix which gives us the info about the
most freq words"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"matrix([[1, 0, 1],\n",
" [1, 2, 1]])"
]
},
"metadata": {
"tags": []
},
"execution_count": 67
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2I_ew788ptJC",
"outputId": "79175d5b-fc4d-44d8-c203-facfab1ccb52"
},
"source": [
"count_vec.get_feature_names() #features with the highest frequency"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['is', 'sky', 'the']"
]
},
"metadata": {
"tags": []
},
"execution_count": 68
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "GAQ12lA4p147"
},
"source": [
"categories = [category for document, category in documents]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "8aO6WnbEp389"
},
"source": [
"text_documents = [\" \".join(document) for document, category in
documents]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "6zk5FBO_p52l"
},
"source": [
"from sklearn.model_selection import train_test_split"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "tf0WX8JWp7lH"
},
"source": [
"x_train, x_test, y_train, y_test = train_test_split(text_documents,
categories)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Wr3NHD47p9GI",
"outputId": "e0ec45b9-0c93-4e6b-d29d-6021fae2c74b"
},
"source": [
"count_vec = CountVectorizer(max_features = 2000)\r\n",
"x_train_features = count_vec.fit_transform(x_train)\r\n",
"x_train_features.todense()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"matrix([[0, 0, 0, ..., 1, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" ...,\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 1, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 1, 0, 0]])"
]
},
"metadata": {
"tags": []
},
"execution_count": 73
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qb9pLOljp-t4",
"outputId": "e811e991-b126-4c66-931a-8687fdfdde38"
},
"source": [
"count_vec.get_feature_names()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['000',\n",
" '10',\n",
" '100',\n",
" '12',\n",
" '13',\n",
" '15',\n",
" '17',\n",
" '1995',\n",
" '1996',\n",
" '1997',\n",
" '1998',\n",
" '1999',\n",
" '20',\n",
" '30',\n",
" '50',\n",
" '60',\n",
" '70',\n",
" '80',\n",
" '90',\n",
" 'abandon',\n",
" 'ability',\n",
" 'able',\n",
" 'absolutely',\n",
" 'academy',\n",
" 'accent',\n",
" 'accept',\n",
" 'accident',\n",
" 'accidentally',\n",
" 'accomplish',\n",
" 'achieve',\n",
" 'achievement',\n",
" 'across',\n",
" 'act',\n",
" 'action',\n",
" 'actor',\n",
" 'actress',\n",
" 'actual',\n",
" 'actually',\n",
" 'ad',\n",
" 'adam',\n",
" 'adaptation',\n",
" 'add',\n",
" 'addition',\n",
" 'admit',\n",
" 'adult',\n",
" 'adventure',\n",
" 'affair',\n",
" 'affleck',\n",
" 'african',\n",
" 'age',\n",
" 'agent',\n",
" 'ago',\n",
" 'agree',\n",
" 'agrees',\n",
" 'ahead',\n",
" 'aid',\n",
" 'aim',\n",
" 'air',\n",
" 'al',\n",
" 'alan',\n",
" 'alex',\n",
" 'alien',\n",
" 'alive',\n",
" 'allen',\n",
" 'allow',\n",
" 'allows',\n",
" 'almost',\n",
" 'alone',\n",
" 'along',\n",
" 'already',\n",
" 'also',\n",
" 'although',\n",
" 'always',\n",
" 'amaze',\n",
" 'america',\n",
" 'american',\n",
" 'among',\n",
" 'amount',\n",
" 'amuse',\n",
" 'amy',\n",
" 'anderson',\n",
" 'andrew',\n",
" 'angel',\n",
" 'angle',\n",
" 'angry',\n",
" 'animal',\n",
" 'animate',\n",
" 'animation',\n",
" 'anna',\n",
" 'anne',\n",
" 'annoy',\n",
" 'another',\n",
" 'answer',\n",
" 'anthony',\n",
" 'anti',\n",
" 'anyone',\n",
" 'anything',\n",
" 'anyway',\n",
" 'anywhere',\n",
" 'apart',\n",
" 'apartment',\n",
" 'ape',\n",
" 'apparent',\n",
" 'apparently',\n",
" 'appeal',\n",
" 'appear',\n",
" 'appearance',\n",
" 'appreciate',\n",
" 'approach',\n",
" 'appropriate',\n",
" 'area',\n",
" 'argue',\n",
" 'arm',\n",
" 'army',\n",
" 'arnold',\n",
" 'around',\n",
" 'arquette',\n",
" 'arrest',\n",
" 'arrive',\n",
" 'arrives',\n",
" 'art',\n",
" 'artist',\n",
" 'aside',\n",
" 'ask',\n",
" 'asks',\n",
" 'aspect',\n",
" 'assistant',\n",
" 'associate',\n",
" 'assume',\n",
" 'atmosphere',\n",
" 'attack',\n",
" 'attempt',\n",
" 'attention',\n",
" 'attitude',\n",
" 'attract',\n",
" 'attractive',\n",
" 'audience',\n",
" 'austin',\n",
" 'author',\n",
" 'authority',\n",
" 'available',\n",
" 'average',\n",
" 'avoid',\n",
" 'award',\n",
" 'aware',\n",
" 'away',\n",
" 'awful',\n",
" 'babe',\n",
" 'baby',\n",
" 'back',\n",
" 'background',\n",
" 'bacon',\n",
" 'bad',\n",
" 'badly',\n",
" 'bag',\n",
" 'baldwin',\n",
" 'ball',\n",
" 'band',\n",
" 'bank',\n",
" 'bar',\n",
" 'barely',\n",
" 'barry',\n",
" 'base',\n",
" 'basic',\n",
" 'basically',\n",
" 'bat',\n",
" 'batman',\n",
" 'battle',\n",
" 'beach',\n",
" 'bear',\n",
" 'beast',\n",
" 'beat',\n",
" 'beautiful',\n",
" 'beauty',\n",
" 'become',\n",
" 'becomes',\n",
" 'bed',\n",
" 'begin',\n",
" 'behavior',\n",
" 'behind',\n",
" 'belief',\n",
" 'believable',\n",
" 'believe',\n",
" 'beloved',\n",
" 'ben',\n",
" 'benefit',\n",
" 'besides',\n",
" 'besson',\n",
" 'best',\n",
" 'beyond',\n",
" 'big',\n",
" 'bill',\n",
" 'billy',\n",
" 'bird',\n",
" 'bit',\n",
" 'bitter',\n",
" 'bizarre',\n",
" 'black',\n",
" 'blade',\n",
" 'blair',\n",
" 'blame',\n",
" 'bland',\n",
" 'block',\n",
" 'blockbuster',\n",
" 'blood',\n",
" 'bloody',\n",
" 'blow',\n",
" 'blue',\n",
" 'board',\n",
" 'boat',\n",
" 'bob',\n",
" 'bobby',\n",
" 'body',\n",
" 'bomb',\n",
" 'bond',\n",
" 'boogie',\n",
" 'book',\n",
" 'bore',\n",
" 'boring',\n",
" 'born',\n",
" 'bos',\n",
" 'bother',\n",
" 'bottom',\n",
" 'bound',\n",
" 'box',\n",
" 'boy',\n",
" 'boyfriend',\n",
" 'brain',\n",
" 'break',\n",
" 'breast',\n",
" 'brian',\n",
" 'bride',\n",
" 'bridge',\n",
" 'brief',\n",
" 'bright',\n",
" 'brilliant',\n",
" 'bring',\n",
" 'brings',\n",
" 'british',\n",
" 'broderick',\n",
" 'broken',\n",
" 'brook',\n",
" 'brother',\n",
" 'brought',\n",
" 'brown',\n",
" 'bruce',\n",
" 'buck',\n",
" 'buddy',\n",
" 'budget',\n",
" 'bug',\n",
" 'build',\n",
" 'building',\n",
" 'built',\n",
" 'bullet',\n",
" 'bunch',\n",
" 'burn',\n",
" 'burton',\n",
" 'bus',\n",
" 'business',\n",
" 'buy',\n",
" 'cage',\n",
" 'call',\n",
" 'cameo',\n",
" 'camera',\n",
" 'cameron',\n",
" 'camp',\n",
" 'campbell',\n",
" 'cannot',\n",
" 'capable',\n",
" 'captain',\n",
" 'capture',\n",
" 'car',\n",
" 'card',\n",
" 'care',\n",
" 'career',\n",
" 'carpenter',\n",
" 'carrey',\n",
" 'carry',\n",
" 'carter',\n",
" 'cartoon',\n",
" 'case',\n",
" 'cash',\n",
" 'cast',\n",
" 'cat',\n",
" 'catch',\n",
" 'catherine',\n",
" 'caught',\n",
" 'cause',\n",
" 'cell',\n",
" 'center',\n",
" 'central',\n",
" 'century',\n",
" 'certain',\n",
" 'certainly',\n",
" 'cgi',\n",
" 'chain',\n",
" 'challenge',\n",
" 'chan',\n",
" 'chance',\n",
" 'change',\n",
" 'character',\n",
" 'characterization',\n",
" 'charge',\n",
" 'charles',\n",
" 'charlie',\n",
" 'charm',\n",
" 'chase',\n",
" 'cheap',\n",
" 'cheat',\n",
" 'check',\n",
" 'cheesy',\n",
" 'chemistry',\n",
" 'chicago',\n",
" 'chicken',\n",
" 'chief',\n",
" 'child',\n",
" 'chinese',\n",
" 'choice',\n",
" 'chosen',\n",
" 'chris',\n",
" 'christian',\n",
" 'christmas',\n",
" 'christopher',\n",
" 'chuckle',\n",
" 'church',\n",
" 'cinema',\n",
" 'cinematic',\n",
" 'cinematographer',\n",
" 'cinematography',\n",
" 'circumstance',\n",
" 'city',\n",
" 'claim',\n",
" 'claire',\n",
" 'class',\n",
" 'classic',\n",
" 'clean',\n",
" 'clear',\n",
" 'clearly',\n",
" 'clever',\n",
" 'clich',\n",
" 'cliche',\n",
" 'climactic',\n",
" 'climax',\n",
" 'clooney',\n",
" 'close',\n",
" 'club',\n",
" 'clue',\n",
" 'co',\n",
" 'cold',\n",
" 'college',\n",
" 'color',\n",
" 'combine',\n",
" 'come',\n",
" 'comedic',\n",
" 'comedy',\n",
" 'comic',\n",
" 'command',\n",
" 'comment',\n",
" 'commercial',\n",
" 'commit',\n",
" 'common',\n",
" 'community',\n",
" 'company',\n",
" 'compare',\n",
" 'comparison',\n",
" 'compelling',\n",
" 'complete',\n",
" 'completely',\n",
" 'complex',\n",
" 'complicate',\n",
" 'computer',\n",
" 'con',\n",
" 'concept',\n",
" 'concern',\n",
" 'conclusion',\n",
" 'condition',\n",
" 'conflict',\n",
" 'confuse',\n",
" 'connection',\n",
" 'connor',\n",
" 'consider',\n",
" 'conspiracy',\n",
" 'constant',\n",
" 'constantly',\n",
" 'contact',\n",
" 'contain',\n",
" 'contains',\n",
" 'content',\n",
" 'continue',\n",
" 'contrast',\n",
" 'contrive',\n",
" 'control',\n",
" 'conversation',\n",
" 'convince',\n",
" 'convincing',\n",
" 'cool',\n",
" 'cop',\n",
" 'copy',\n",
" 'core',\n",
" 'corner',\n",
" 'cost',\n",
" 'costume',\n",
" 'could',\n",
" 'count',\n",
" 'country',\n",
" 'couple',\n",
" 'course',\n",
" 'court',\n",
" 'courtroom',\n",
" 'cover',\n",
" 'cowboy',\n",
" 'crack',\n",
" 'craft',\n",
" 'crap',\n",
" 'crash',\n",
" 'crazy',\n",
" 'create',\n",
" 'creates',\n",
" 'creative',\n",
" 'creature',\n",
" 'credit',\n",
" 'creepy',\n",
" 'crew',\n",
" 'crime',\n",
" 'criminal',\n",
" 'critic',\n",
" 'critical',\n",
" 'critique',\n",
" 'cross',\n",
" 'crowd',\n",
" 'cruise',\n",
" 'cry',\n",
" 'crystal',\n",
" 'cult',\n",
" 'culture',\n",
" 'current',\n",
" 'cusack',\n",
" 'cut',\n",
" 'cute',\n",
" 'dad',\n",
" 'damme',\n",
" 'damn',\n",
" 'damon',\n",
" 'dance',\n",
" 'danger',\n",
" 'dangerous',\n",
" 'daniel',\n",
" 'danny',\n",
" 'dare',\n",
" 'dark',\n",
" 'date',\n",
" 'daughter',\n",
" 'david',\n",
" 'day',\n",
" 'de',\n",
" 'dead',\n",
" 'deadly',\n",
" 'deal',\n",
" 'death',\n",
" 'debut',\n",
" 'decade',\n",
" 'decent',\n",
" 'decide',\n",
" 'decides',\n",
" 'decision',\n",
" 'deep',\n",
" 'definitely',\n",
" 'degree',\n",
" 'delight',\n",
" 'deliver',\n",
" 'delivers',\n",
" 'demand',\n",
" 'dennis',\n",
" 'department',\n",
" 'depict',\n",
" 'depth',\n",
" 'derek',\n",
" 'describe',\n",
" 'desert',\n",
" 'deserve',\n",
" 'deserves',\n",
" 'design',\n",
" 'desire',\n",
" 'desperate',\n",
" 'desperately',\n",
" 'despite',\n",
" 'destroy',\n",
" 'detail',\n",
" 'detective',\n",
" 'determine',\n",
" 'develop',\n",
" 'developed',\n",
" 'development',\n",
" 'device',\n",
" 'devil',\n",
" 'dialogue',\n",
" 'die',\n",
" 'difference',\n",
" 'different',\n",
" 'difficult',\n",
" 'digital',\n",
" 'dimensional',\n",
" 'dinner',\n",
" 'direct',\n",
" 'direction',\n",
" 'directly',\n",
" 'director',\n",
" 'dirty',\n",
" 'disappoint',\n",
" 'disappointment',\n",
" 'disaster',\n",
" 'discover',\n",
" 'discovers',\n",
" 'discus',\n",
" 'disney',\n",
" 'display',\n",
" 'distract',\n",
" 'disturb',\n",
" 'doctor',\n",
" 'documentary',\n",
" 'dog',\n",
" 'dollar',\n",
" 'doom',\n",
" 'door',\n",
" 'double',\n",
" 'doubt',\n",
" 'douglas',\n",
" 'dozen',\n",
" 'dr',\n",
" 'drag',\n",
" 'dragon',\n",
" 'drama',\n",
" 'dramatic',\n",
" 'draw',\n",
" 'drawn',\n",
" 'dream',\n",
" 'dress',\n",
" 'drive',\n",
" 'driven',\n",
" 'driver',\n",
" 'drop',\n",
" 'drug',\n",
" 'dude',\n",
" 'due',\n",
" 'dull',\n",
" 'dumb',\n",
" 'dvd',\n",
" 'dy',\n",
" 'earlier',\n",
" 'early',\n",
" 'earn',\n",
" 'earth',\n",
" 'easily',\n",
" 'easy',\n",
" 'eat',\n",
" 'eccentric',\n",
" 'ed',\n",
" 'eddie',\n",
" 'edge',\n",
" 'edit',\n",
" 'edward',\n",
" 'effect',\n",
" 'effective',\n",
" 'effort',\n",
" 'eight',\n",
" 'either',\n",
" 'elaborate',\n",
" 'element',\n",
" 'elizabeth',\n",
" 'else',\n",
" 'embarrass',\n",
" 'emotion',\n",
" 'emotional',\n",
" 'emotionally',\n",
" 'employ',\n",
" 'empty',\n",
" 'encounter',\n",
" 'end',\n",
" 'enemy',\n",
" 'energy',\n",
" 'engage',\n",
" 'england',\n",
" 'english',\n",
" 'enjoy',\n",
" 'enjoyable',\n",
" 'enough',\n",
" 'enter',\n",
" 'entertain',\n",
" 'entertainment',\n",
" 'entire',\n",
" 'entirely',\n",
" 'epic',\n",
" 'episode',\n",
" 'equally',\n",
" 'era',\n",
" 'eric',\n",
" 'escape',\n",
" 'especially',\n",
" 'essentially',\n",
" 'establish',\n",
" 'etc',\n",
" 'eve',\n",
" 'even',\n",
" 'event',\n",
" 'eventually',\n",
" 'ever',\n",
" 'every',\n",
" 'everybody',\n",
" 'everyone',\n",
" 'everything',\n",
" 'evidence',\n",
" 'evil',\n",
" 'ex',\n",
" 'exact',\n",
" 'exactly',\n",
" 'example',\n",
" 'excellent',\n",
" 'except',\n",
" 'exception',\n",
" 'exchange',\n",
" 'excite',\n",
" 'excuse',\n",
" 'execute',\n",
" 'exercise',\n",
" 'exist',\n",
" 'existence',\n",
" 'expect',\n",
" 'expectation',\n",
" 'experience',\n",
" 'explain',\n",
" 'explains',\n",
" 'explanation',\n",
" 'explore',\n",
" 'explosion',\n",
" 'express',\n",
" 'expression',\n",
" 'extend',\n",
" 'extra',\n",
" 'extraordinary',\n",
" 'extreme',\n",
" 'extremely',\n",
" 'eye',\n",
" 'face',\n",
" 'fact',\n",
" 'factor',\n",
" 'fail',\n",
" 'fails',\n",
" 'failure',\n",
" 'fair',\n",
" 'fairly',\n",
" 'faith',\n",
" 'fake',\n",
" 'fall',\n",
" 'fame',\n",
" 'familiar',\n",
" 'family',\n",
" 'famous',\n",
" 'fan',\n",
" 'fantastic',\n",
" 'fantasy',\n",
" 'far',\n",
" 'fare',\n",
" 'fascinate',\n",
" 'fashion',\n",
" 'fast',\n",
" 'fat',\n",
" 'fate',\n",
" 'father',\n",
" 'fault',\n",
" 'favor',\n",
" 'favorite',\n",
" 'fbi',\n",
" 'fear',\n",
" 'feature',\n",
" 'feel',\n",
" 'fellow',\n",
" 'felt',\n",
" 'female',\n",
" 'fi',\n",
" 'fiction',\n",
" 'field',\n",
" 'fifteen',\n",
" 'fifth',\n",
" 'fight',\n",
" 'figure',\n",
" 'file',\n",
" 'fill',\n",
" 'film',\n",
" 'filmmaker',\n",
" 'filmmaking',\n",
" 'final',\n",
" 'finale',\n",
" 'finally',\n",
" 'find',\n",
" 'fine',\n",
" 'finish',\n",
" 'fire',\n",
" 'first',\n",
" 'fish',\n",
" 'fit',\n",
" 'five',\n",
" 'flash',\n",
" 'flashback',\n",
" 'flat',\n",
" 'flaw',\n",
" 'flesh',\n",
" 'flick',\n",
" 'float',\n",
" 'floor',\n",
" 'flow',\n",
" 'fly',\n",
" 'flynt',\n",
" 'focus',\n",
" 'folk',\n",
" 'follow',\n",
" 'food',\n",
" 'fool',\n",
" 'foot',\n",
" 'footage',\n",
" 'football',\n",
" 'force',\n",
" 'ford',\n",
" 'forever',\n",
" 'forget',\n",
" 'forgotten',\n",
" 'form',\n",
" 'former',\n",
" 'formula',\n",
" 'fortune',\n",
" 'forward',\n",
" 'found',\n",
" 'four',\n",
" 'fox',\n",
" 'frame',\n",
" 'frank',\n",
" 'free',\n",
" 'freedom',\n",
" 'freeze',\n",
" 'french',\n",
" 'frequently',\n",
" 'fresh',\n",
" 'friend',\n",
" 'friendship',\n",
" 'frighten',\n",
" 'front',\n",
" 'frustrate',\n",
" 'fugitive',\n",
" 'full',\n",
" 'fully',\n",
" 'fun',\n",
" 'funniest',\n",
" 'funny',\n",
" 'future',\n",
" 'gadget',\n",
" 'gag',\n",
" 'gain',\n",
" 'game',\n",
" 'gang',\n",
" 'gangster',\n",
" 'gary',\n",
" 'gay',\n",
" 'general',\n",
" 'generally',\n",
" 'generate',\n",
" 'generation',\n",
" 'genius',\n",
" 'genre',\n",
" 'genuine',\n",
" 'genuinely',\n",
" 'george',\n",
" 'german',\n",
" 'get',\n",
" 'ghost',\n",
" 'giant',\n",
" 'gibson',\n",
" 'gift',\n",
" 'girl',\n",
" 'girlfriend',\n",
" 'give',\n",
" 'glass',\n",
" 'go',\n",
" 'goal',\n",
" 'god',\n",
" 'godzilla',\n",
" 'gold',\n",
" 'good',\n",
" 'goofy',\n",
" 'gore',\n",
" 'gotten',\n",
" 'government',\n",
" 'grace',\n",
" 'grade',\n",
" 'grand',\n",
" 'grant',\n",
" 'graphic',\n",
" 'great',\n",
" 'green',\n",
" 'gross',\n",
" 'ground',\n",
" 'group',\n",
" 'grow',\n",
" 'grown',\n",
" 'grows',\n",
" 'guard',\n",
" 'guess',\n",
" 'guest',\n",
" 'guilty',\n",
" 'gun',\n",
" 'guy',\n",
" 'hair',\n",
" 'half',\n",
" 'hall',\n",
" 'halloween',\n",
" 'hand',\n",
" 'handle',\n",
" 'hang',\n",
" 'hank',\n",
" 'happen',\n",
" 'happens',\n",
" 'happy',\n",
" 'hard',\n",
" 'hardly',\n",
" 'harry',\n",
" 'hat',\n",
" 'hate',\n",
" 'haunt',\n",
" 'head',\n",
" 'hear',\n",
" 'heard',\n",
" 'heart',\n",
" 'hearted',\n",
" 'heaven',\n",
" 'heavily',\n",
" 'heavy',\n",
" 'held',\n",
" 'hell',\n",
" 'help',\n",
" 'henry',\n",
" 'hero',\n",
" 'heroine',\n",
" 'hey',\n",
" 'hidden',\n",
" 'hide',\n",
" 'high',\n",
" 'highlight',\n",
" 'highly',\n",
" 'hilarious',\n",
" 'hill',\n",
" 'hint',\n",
" 'hip',\n",
" 'hire',\n",
" 'history',\n",
" 'hit',\n",
" 'hitchcock',\n",
" 'hoffman',\n",
" 'hold',\n",
" 'hole',\n",
" 'hollywood',\n",
" 'home',\n",
" 'honest',\n",
" 'hong',\n",
" 'hook',\n",
" 'hop',\n",
" 'hope',\n",
" 'hopkins',\n",
" 'horizon',\n",
" 'horrible',\n",
" 'horror',\n",
" 'horse',\n",
" 'hospital',\n",
" 'hot',\n",
" 'hotel',\n",
" 'hour',\n",
" 'house',\n",
" 'however',\n",
" 'huge',\n",
" 'human',\n",
" 'humanity',\n",
" 'humor',\n",
" 'humorous',\n",
" 'humour',\n",
" 'hundred',\n",
" 'hunt',\n",
" 'hunter',\n",
" 'hurt',\n",
" 'husband',\n",
" 'ice',\n",
" 'idea',\n",
" 'ideal',\n",
" 'identity',\n",
" 'ignore',\n",
" 'ii',\n",
" 'ill',\n",
" 'image',\n",
" 'imagination',\n",
" 'imagine',\n",
" 'immediately',\n",
" 'impact',\n",
" 'important',\n",
" 'impossible',\n",
" 'impression',\n",
" 'impressive',\n",
" 'include',\n",
" 'incredible',\n",
" 'incredibly',\n",
" 'indeed',\n",
" 'individual',\n",
" 'industry',\n",
" 'inevitable',\n",
" 'influence',\n",
" 'information',\n",
" 'innocent',\n",
" 'inside',\n",
" 'insight',\n",
" 'inspire',\n",
" 'instance',\n",
" 'instead',\n",
" 'instinct',\n",
" 'insult',\n",
" 'intelligence',\n",
" 'intelligent',\n",
" 'intend',\n",
" 'intense',\n",
" 'intensity',\n",
" 'intention',\n",
" 'interest',\n",
" 'interested',\n",
" 'international',\n",
" 'interview',\n",
" 'intrigue',\n",
" 'introduce',\n",
" 'investigate',\n",
" 'investigation',\n",
" 'involve',\n",
" 'involves',\n",
" 'island',\n",
" 'issue',\n",
" 'jack',\n",
" 'jackal',\n",
" 'jackie',\n",
" 'jackson',\n",
" 'jail',\n",
" 'jake',\n",
" 'james',\n",
" 'jane',\n",
" 'japanese',\n",
" 'jar',\n",
" 'jason',\n",
" 'jay',\n",
" 'jean',\n",
" 'jedi',\n",
" 'jeff',\n",
" 'jennifer',\n",
" 'jerry',\n",
" 'jim',\n",
" 'jimmy',\n",
" 'joan',\n",
" 'job',\n",
" 'joe',\n",
" 'joel',\n",
" 'john',\n",
" 'johnny',\n",
" 'join',\n",
" 'joke',\n",
" 'jon',\n",
" 'jonathan',\n",
" 'jones',\n",
" 'journey',\n",
" 'joy',\n",
" 'jr',\n",
" 'judge',\n",
" 'julia',\n",
" 'julie',\n",
" 'jump',\n",
" 'jungle',\n",
" 'justice',\n",
" 'kate',\n",
" 'keaton',\n",
" 'keep',\n",
" 'kelly',\n",
" 'kept',\n",
" 'kevin',\n",
" 'key',\n",
" 'kick',\n",
" 'kid',\n",
" 'kidnap',\n",
" 'kill',\n",
" 'killer',\n",
" 'kind',\n",
" 'king',\n",
" 'kiss',\n",
" 'knew',\n",
" 'knight',\n",
" 'knock',\n",
" 'know',\n",
" 'knowledge',\n",
" 'kong',\n",
" 'la',\n",
" 'lack',\n",
" 'lady',\n",
" ...]"
]
},
"metadata": {
"tags": []
},
"execution_count": 74
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "6J53v5dLqAUr"
},
"source": [
"x_test_features = count_vec.transform(x_test)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "s5QgFimrqFEc",
"outputId": "be9cd59c-f36c-4645-b682-8cf3fdfa4c2b"
},
"source": [
"x_test_features\r\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<500x2000 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 85178 stored elements in Compressed Sparse Row format>"
]
},
"metadata": {
"tags": []
},
"execution_count": 76
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "djBNS2_dqHKb"
},
"source": [
"#Sklearn classifier on countvectorized data\r\n",
"from sklearn.svm import SVC\r\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "u_YAHDzpqlWN"
},
"source": [
"scv = SVC()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "W79psHeOqnmm",
"outputId": "313efe57-65f4-4da9-8cf7-0dfd0ba49f33"
},
"source": [
"svc.fit(x_train_features,y_train)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='scale',
kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=None,
shrinking=True,\n",
" tol=0.001, verbose=False)"
]
},
"metadata": {
"tags": []
},
"execution_count": 79
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LBcqFZumqs0K",
"outputId": "652feb08-e485-44df-eac4-b77d284767b3"
},
"source": [
"svc.score(x_test_features , y_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.816"
]
},
"metadata": {
"tags": []
},
"execution_count": 80
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TH6PRXWVriaN"
},
"source": [
"##**N-Grams**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jY6dErmTt4Eq"
},
"source": [
"An N-gram is an N-token sequence of words: a 2-gram (called a bigram) is a
two-word sequence of words like “really good”, “not good”, or “your homework”, and
a 3-gram (trigram) is a three-word sequence of words like “not at all”, or “turn
off light”.\r\n",
"\r\n",
"Set the parameter ngram_range=(a,b) where a is the minimum and b is the
maximum size of ngrams you want to include in your features. The default
ngram_range is (1,1)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b9wanxJrt38A"
},
"source": [
"Instead of using a single word as feature, we can use a pair of words or
three words as one features for our model."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "710mBY-qq0vR",
"outputId": "b4289a2e-a6f9-463e-a867-8c1070deab96"
},
"source": [
"count_vec = CountVectorizer(max_features = 2000, ngram_range=(2,3))\r\n",
"x_train_features = count_vec.fit_transform(x_train)\r\n",
"x_train_features.todense()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"matrix([[0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" ...,\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0]])"
]
},
"metadata": {
"tags": []
},
"execution_count": 83
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Lh5vHjHpueTt",
"outputId": "e9d77119-a680-4aae-c334-48a8de8cae25"
},
"source": [
"count_vec.get_feature_names() #Here the features are a pair of words or
set of three words"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['10 10',\n",
" '10 minute',\n",
" '10 scale',\n",
" '10 scale scale',\n",
" '10 year',\n",
" '100 million',\n",
" '12 year',\n",
" '13th warrior',\n",
" '14 year',\n",
" '15 minute',\n",
" '15 year',\n",
" '17 year',\n",
" '19th century',\n",
" '20 minute',\n",
" '20 year',\n",
" '2001 space',\n",
" '2001 space odyssey',\n",
" '20th century',\n",
" '30 minute',\n",
" '30 year',\n",
" '90 minute',\n",
" 'absolutely nothing',\n",
" 'academy award',\n",
" 'ace ventura',\n",
" 'across country',\n",
" 'act ability',\n",
" 'act film',\n",
" 'act like',\n",
" 'act skill',\n",
" 'act talent',\n",
" 'action adventure',\n",
" 'action comedy',\n",
" 'action film',\n",
" 'action flick',\n",
" 'action hero',\n",
" 'action movie',\n",
" 'action packed',\n",
" 'action scene',\n",
" 'action sequence',\n",
" 'action star',\n",
" 'action thriller',\n",
" 'actor film',\n",
" 'actor play',\n",
" 'actually get',\n",
" 'adam sandler',\n",
" 'adult film',\n",
" 'african american',\n",
" 'al pacino',\n",
" 'albert brook',\n",
" 'alec baldwin',\n",
" 'alien film',\n",
" 'alien resurrection',\n",
" 'almost always',\n",
" 'almost entirely',\n",
" 'almost every',\n",
" 'along line',\n",
" 'along way',\n",
" 'already know',\n",
" 'also direct',\n",
" 'also feature',\n",
" 'also get',\n",
" 'also give',\n",
" 'also good',\n",
" 'also happens',\n",
" 'also help',\n",
" 'also include',\n",
" 'also like',\n",
" 'also make',\n",
" 'also one',\n",
" 'also seem',\n",
" 'also write',\n",
" 'although character',\n",
" 'although film',\n",
" 'always one',\n",
" 'always seem',\n",
" 'amaze potent',\n",
" 'amaze potent stuff',\n",
" 'american beauty',\n",
" 'american dream',\n",
" 'american film',\n",
" 'american history',\n",
" 'american movie',\n",
" 'american pie',\n",
" 'angelina jolie',\n",
" 'animate feature',\n",
" 'animate film',\n",
" 'another character',\n",
" 'another film',\n",
" 'another man',\n",
" 'another movie',\n",
" 'another one',\n",
" 'another thing',\n",
" 'answer question',\n",
" 'anthony hopkins',\n",
" 'antonio banderas',\n",
" 'anyone else',\n",
" 'anyone see',\n",
" 'anything else',\n",
" 'arnold schwarzenegger',\n",
" 'art direction',\n",
" 'ashley judd',\n",
" 'aspect film',\n",
" 'aspect ratio',\n",
" 'attempt make',\n",
" 'audience get',\n",
" 'audience member',\n",
" 'austin power',\n",
" 'award win',\n",
" 'away film',\n",
" 'ba ku',\n",
" 'back forth',\n",
" 'back home',\n",
" 'back seat',\n",
" 'back time',\n",
" 'bad act',\n",
" 'bad actor',\n",
" 'bad boy',\n",
" 'bad dialogue',\n",
" 'bad enough',\n",
" 'bad film',\n",
" 'bad guy',\n",
" 'bad movie',\n",
" 'bad news',\n",
" 'bad one',\n",
" 'bad script',\n",
" 'bad thing',\n",
" 'bank robber',\n",
" 'barb wire',\n",
" 'base book',\n",
" 'base novel',\n",
" 'base true',\n",
" 'base true story',\n",
" 'basic instinct',\n",
" 'basic plot',\n",
" 'batman forever',\n",
" 'batman robin',\n",
" 'battle scene',\n",
" 'battlefield earth',\n",
" 'beautiful woman',\n",
" 'beauty beast',\n",
" 'become one',\n",
" 'becomes apparent',\n",
" 'begin end',\n",
" 'begin film',\n",
" 'begin movie',\n",
" 'behind camera',\n",
" 'behind film',\n",
" 'behind scene',\n",
" 'ben affleck',\n",
" 'ben stiller',\n",
" 'benicio del',\n",
" 'best actor',\n",
" 'best film',\n",
" 'best film year',\n",
" 'best friend',\n",
" 'best know',\n",
" 'best movie',\n",
" 'best part',\n",
" 'best performance',\n",
" 'best picture',\n",
" 'best scene',\n",
" 'best support',\n",
" 'best thing',\n",
" 'best way',\n",
" 'best work',\n",
" 'beverly hill',\n",
" 'big budget',\n",
" 'big disappointment',\n",
" 'big fan',\n",
" 'big hit',\n",
" 'big laugh',\n",
" 'big lebowski',\n",
" 'big name',\n",
" 'big problem',\n",
" 'big screen',\n",
" 'big star',\n",
" 'big time',\n",
" 'bill murray',\n",
" 'bill paxton',\n",
" 'billy bob',\n",
" 'billy crystal',\n",
" 'black cauldron',\n",
" 'black comedy',\n",
" 'black white',\n",
" 'blade runner',\n",
" 'blade squad',\n",
" 'blair witch',\n",
" 'blair witch project',\n",
" 'body count',\n",
" 'boiler room',\n",
" 'bond film',\n",
" 'boogie night',\n",
" 'born killer',\n",
" 'bottom line',\n",
" 'box office',\n",
" 'brad pitt',\n",
" 'brand new',\n",
" 'brian de',\n",
" 'brian de palma',\n",
" 'bridget fonda',\n",
" 'bright spot',\n",
" 'bring back',\n",
" 'bring friend',\n",
" 'bring friend amaze',\n",
" 'bruce willis',\n",
" 'budget film',\n",
" 'bug life',\n",
" 'cameo appearance',\n",
" 'camera angle',\n",
" 'camera work',\n",
" 'cameron diaz',\n",
" 'car chase',\n",
" 'car crash',\n",
" 'care character',\n",
" 'care less',\n",
" 'carlito way',\n",
" 'carry film',\n",
" 'carry movie',\n",
" 'casper van',\n",
" 'casper van dien',\n",
" 'cast character',\n",
" 'cast crew',\n",
" 'cast include',\n",
" 'cast member',\n",
" 'catherine keener',\n",
" 'catherine zeta',\n",
" 'catherine zeta jones',\n",
" 'center around',\n",
" 'central character',\n",
" 'century fox',\n",
" 'chan movie',\n",
" 'change mind',\n",
" 'change pace',\n",
" 'character actor',\n",
" 'character actually',\n",
" 'character almost',\n",
" 'character also',\n",
" 'character come',\n",
" 'character development',\n",
" 'character even',\n",
" 'character film',\n",
" 'character get',\n",
" 'character give',\n",
" 'character interaction',\n",
" 'character like',\n",
" 'character make',\n",
" 'character movie',\n",
" 'character much',\n",
" 'character name',\n",
" 'character never',\n",
" 'character one',\n",
" 'character play',\n",
" 'character played',\n",
" 'character really',\n",
" 'character see',\n",
" 'character seem',\n",
" 'character show',\n",
" 'character situation',\n",
" 'character study',\n",
" 'character suppose',\n",
" 'character well',\n",
" 'character would',\n",
" 'charlize theron',\n",
" 'chase amy',\n",
" 'chase scene',\n",
" 'chicken run',\n",
" 'child actor',\n",
" 'chris donnell',\n",
" 'chris rock',\n",
" 'chris tucker',\n",
" 'christina ricci',\n",
" 'christopher walken',\n",
" 'civil war',\n",
" 'claire dane',\n",
" 'classic film',\n",
" 'clint eastwood',\n",
" 'close encounter',\n",
" 'close ups',\n",
" 'closing credit',\n",
" 'co star',\n",
" 'co worker',\n",
" 'co write',\n",
" 'co writer',\n",
" 'coen brother',\n",
" 'col nicholson',\n",
" 'come across',\n",
" 'come along',\n",
" 'come back',\n",
" 'come close',\n",
" 'come end',\n",
" 'come go',\n",
" 'come life',\n",
" 'come mind',\n",
" 'come movie',\n",
" 'come play',\n",
" 'come surprise',\n",
" 'come together',\n",
" 'comedy film',\n",
" 'comic book',\n",
" 'comic relief',\n",
" 'comic timing',\n",
" 'completely different',\n",
" 'computer animate',\n",
" 'computer generate',\n",
" 'con air',\n",
" 'concentration camp',\n",
" 'consider portion',\n",
" 'consider portion follow',\n",
" 'cop movie',\n",
" 'could easily',\n",
" 'could go',\n",
" 'could make',\n",
" 'could much',\n",
" 'could never',\n",
" 'could possibly',\n",
" 'could say',\n",
" 'could see',\n",
" 'could use',\n",
" 'could well',\n",
" 'couple year',\n",
" 'course film',\n",
" 'course one',\n",
" 'courtney love',\n",
" 'courtroom drama',\n",
" 'creaky still',\n",
" 'creaky still well',\n",
" 'credit roll',\n",
" 'crew member',\n",
" 'cruel intention',\n",
" 'cuba gooding',\n",
" 'dalai lama',\n",
" 'danny devito',\n",
" 'danny elfman',\n",
" 'danny glover',\n",
" 'dante peak',\n",
" 'dark city',\n",
" 'dark side',\n",
" 'darth vader',\n",
" 'david arquette',\n",
" 'david lynch',\n",
" 'dawson creek',\n",
" 'day day',\n",
" 'day go',\n",
" 'day life',\n",
" 'de bont',\n",
" 'de niro',\n",
" 'de palma',\n",
" 'decides go',\n",
" 'decides take',\n",
" 'deep end',\n",
" 'deep impact',\n",
" 'deep rise',\n",
" 'del toro',\n",
" 'denise richards',\n",
" 'dennis quaid',\n",
" 'denzel washington',\n",
" 'despite fact',\n",
" 'dialogue character',\n",
" 'die hard',\n",
" 'digital effect',\n",
" 'dim witted',\n",
" 'dimensional character',\n",
" 'direct film',\n",
" 'director david',\n",
" 'director john',\n",
" 'director paul',\n",
" 'director peter',\n",
" 'director stephen',\n",
" 'director writer',\n",
" 'directorial debut',\n",
" 'disaster film',\n",
" 'disaster movie',\n",
" 'disney animate',\n",
" 'disney animate feature',\n",
" 'disney film',\n",
" 'donald sutherland',\n",
" 'donnie brasco',\n",
" 'double cross',\n",
" 'double jeopardy',\n",
" 'downey jr',\n",
" 'dr evil',\n",
" 'dream sequence',\n",
" 'dress like',\n",
" 'drew barrymore',\n",
" 'drug deal',\n",
" 'drug dealer',\n",
" 'drug use',\n",
" 'drunken master',\n",
" 'dusk till',\n",
" 'dusk till dawn',\n",
" 'dustin hoffman',\n",
" 'earlier film',\n",
" 'early film',\n",
" 'early scene',\n",
" 'easy see',\n",
" 'ed wood',\n",
" 'eddie murphy',\n",
" 'edward norton',\n",
" 'effect movie',\n",
" 'eight year',\n",
" 'either one',\n",
" 'either way',\n",
" 'element film',\n",
" 'elmore leonard',\n",
" 'end credit',\n",
" 'end day',\n",
" 'end end',\n",
" 'end film',\n",
" 'end get',\n",
" 'end movie',\n",
" 'end one',\n",
" 'end result',\n",
" 'enjoy film',\n",
" 'enjoy movie',\n",
" 'enough make',\n",
" 'entertainment value',\n",
" 'entire film',\n",
" 'entire life',\n",
" 'entire movie',\n",
" 'etc etc',\n",
" 'eugene levy',\n",
" 'even bad',\n",
" 'even begin',\n",
" 'even come',\n",
" 'even get',\n",
" 'even go',\n",
" 'even know',\n",
" 'even less',\n",
" 'even make',\n",
" 'even one',\n",
" 'even remotely',\n",
" 'even see',\n",
" 'even though',\n",
" 'even try',\n",
" 'even well',\n",
" 'event horizon',\n",
" 'ever get',\n",
" 'ever heard',\n",
" 'ever make',\n",
" 'ever see',\n",
" 'ever since',\n",
" 'every bit',\n",
" 'every character',\n",
" 'every day',\n",
" 'every line',\n",
" 'every movie',\n",
" 'every one',\n",
" 'every scene',\n",
" 'every single',\n",
" 'every time',\n",
" 'everyone else',\n",
" 'everyone know',\n",
" 'everything else',\n",
" 'everything seem',\n",
" 'ewan mcgregor',\n",
" 'ex girlfriend',\n",
" 'exactly go',\n",
" 'expect film',\n",
" 'expect movie',\n",
" 'expect much',\n",
" 'expect see',\n",
" 'extremely well',\n",
" 'eye candy',\n",
" 'facial expression',\n",
" 'fact film',\n",
" 'fact movie',\n",
" 'fairy tale',\n",
" 'fall apart',\n",
" 'fall flat',\n",
" 'fall love',\n",
" 'fall short',\n",
" 'family film',\n",
" 'family movie',\n",
" 'famke janssen',\n",
" 'far away',\n",
" 'far fetch',\n",
" 'far superior',\n",
" 'far well',\n",
" 'fare well',\n",
" 'farrelly brother',\n",
" 'fast forward',\n",
" 'fast pace',\n",
" 'fbi agent',\n",
" 'feature film',\n",
" 'feature length',\n",
" 'feel good',\n",
" 'feel like',\n",
" 'feel sorry',\n",
" 'felt like',\n",
" 'female character',\n",
" 'fi film',\n",
" 'fiction film',\n",
" 'fifteen minute',\n",
" 'fifteen year',\n",
" 'fifth element',\n",
" 'fight club',\n",
" 'fight scene',\n",
" 'fight sequence',\n",
" 'film action',\n",
" 'film actually',\n",
" 'film alien',\n",
" 'film almost',\n",
" 'film also',\n",
" 'film although',\n",
" 'film another',\n",
" 'film attempt',\n",
" 'film bad',\n",
" 'film base',\n",
" 'film become',\n",
" 'film becomes',\n",
" 'film begin',\n",
" 'film best',\n",
" 'film big',\n",
" 'film call',\n",
" 'film character',\n",
" 'film come',\n",
" 'film completely',\n",
" 'film contains',\n",
" 'film could',\n",
" 'film course',\n",
" 'film critic',\n",
" 'film deal',\n",
" 'film debut',\n",
" 'film despite',\n",
" 'film direct',\n",
" 'film director',\n",
" 'film end',\n",
" 'film entertain',\n",
" 'film even',\n",
" 'film ever',\n",
" 'film ever make',\n",
" 'film ever see',\n",
" 'film expect',\n",
" 'film fact',\n",
" 'film far',\n",
" 'film feature',\n",
" 'film feel',\n",
" 'film festival',\n",
" 'film fill',\n",
" 'film film',\n",
" 'film first',\n",
" 'film focus',\n",
" 'film follow',\n",
" 'film full',\n",
" 'film funny',\n",
" 'film genre',\n",
" 'film get',\n",
" 'film give',\n",
" 'film go',\n",
" 'film good',\n",
" 'film great',\n",
" 'film history',\n",
" 'film however',\n",
" 'film industry',\n",
" 'film instead',\n",
" 'film interest',\n",
" 'film know',\n",
" 'film lack',\n",
" 'film last',\n",
" 'film least',\n",
" 'film left',\n",
" 'film like',\n",
" 'film little',\n",
" 'film look',\n",
" 'film look like',\n",
" 'film lot',\n",
" 'film main',\n",
" 'film make',\n",
" 'film maker',\n",
" 'film many',\n",
" 'film may',\n",
" 'film might',\n",
" 'film movie',\n",
" 'film much',\n",
" 'film need',\n",
" 'film never',\n",
" 'film noir',\n",
" 'film offer',\n",
" 'film often',\n",
" 'film one',\n",
" 'film open',\n",
" 'film opening',\n",
" 'film original',\n",
" 'film people',\n",
" 'film play',\n",
" 'film plot',\n",
" 'film probably',\n",
" 'film produce',\n",
" 'film progress',\n",
" 'film real',\n",
" 'film really',\n",
" 'film release',\n",
" 'film review',\n",
" 'film run',\n",
" 'film say',\n",
" 'film scene',\n",
" 'film see',\n",
" 'film seem',\n",
" 'film series',\n",
" 'film set',\n",
" 'film show',\n",
" 'film simply',\n",
" 'film since',\n",
" 'film star',\n",
" 'film start',\n",
" 'film still',\n",
" 'film story',\n",
" 'film suppose',\n",
" 'film sure',\n",
" 'film take',\n",
" 'film take place',\n",
" 'film think',\n",
" 'film though',\n",
" 'film time',\n",
" 'film title',\n",
" 'film try',\n",
" 'film turn',\n",
" 'film two',\n",
" 'film use',\n",
" 'film version',\n",
" 'film want',\n",
" 'film watch',\n",
" 'film way',\n",
" 'film well',\n",
" 'film whole',\n",
" 'film without',\n",
" 'film work',\n",
" 'film would',\n",
" 'film write',\n",
" 'film year',\n",
" 'film yet',\n",
" 'final destination',\n",
" 'final scene',\n",
" 'finally get',\n",
" 'find film',\n",
" 'find good',\n",
" 'find right',\n",
" 'find way',\n",
" 'fine performance',\n",
" 'first contact',\n",
" 'first feature',\n",
" 'first film',\n",
" 'first half',\n",
" 'first half hour',\n",
" 'first hour',\n",
" 'first movie',\n",
" 'first one',\n",
" 'first place',\n",
" 'first rate',\n",
" 'first saw',\n",
" 'first scene',\n",
" 'first see',\n",
" 'first thing',\n",
" 'first time',\n",
" 'first two',\n",
" 'fish water',\n",
" 'five minute',\n",
" 'five year',\n",
" 'fly inkpot',\n",
" 'fly inkpot rating',\n",
" 'follow text',\n",
" 'follow text spoiler',\n",
" 'foul mouth',\n",
" 'found film',\n",
" 'four star',\n",
" 'four year',\n",
" 'freddie prinze',\n",
" 'freddie prinze jr',\n",
" 'friend amaze',\n",
" 'friend amaze potent',\n",
" 'full length',\n",
" 'full monty',\n",
" 'fun film',\n",
" 'fun movie',\n",
" 'fun watch',\n",
" 'funny moment',\n",
" 'funny movie',\n",
" 'funny scene',\n",
" 'gary oldman',\n",
" 'gary sinise',\n",
" 'gas station',\n",
" 'gauge 10',\n",
" 'gene hackman',\n",
" 'general daughter',\n",
" 'geoffrey rush',\n",
" 'george clooney',\n",
" 'george lucas',\n",
" 'get away',\n",
" 'get back',\n",
" 'get bad',\n",
" 'get big',\n",
" 'get caught',\n",
" 'get chance',\n",
" 'get even',\n",
" 'get feel',\n",
" 'get first',\n",
" 'get go',\n",
" 'get good',\n",
" 'get involve',\n",
" 'get job',\n",
" 'get kill',\n",
" 'get know',\n",
" 'get little',\n",
" 'get lose',\n",
" 'get mail',\n",
" 'get married',\n",
" 'get money',\n",
" 'get movie',\n",
" 'get much',\n",
" 'get one',\n",
" 'get past',\n",
" 'get point',\n",
" 'get rid',\n",
" 'get right',\n",
" 'get see',\n",
" 'get shot',\n",
" 'get together',\n",
" 'get trouble',\n",
" 'get way',\n",
" 'get well',\n",
" 'get wrong',\n",
" 'ghost dog',\n",
" 'ghost mar',\n",
" 'gingerbread man',\n",
" 'girl name',\n",
" 'give audience',\n",
" 'give away',\n",
" 'give best',\n",
" 'give chance',\n",
" 'give character',\n",
" 'give credit',\n",
" 'give enough',\n",
" 'give film',\n",
" 'give good',\n",
" 'give great',\n",
" 'give little',\n",
" 'give movie',\n",
" 'give much',\n",
" 'give one',\n",
" 'give opportunity',\n",
" 'glenn close',\n",
" 'go along',\n",
" 'go anywhere',\n",
" 'go around',\n",
" 'go away',\n",
" 'go awry',\n",
" 'go back',\n",
" 'go bad',\n",
" 'go beyond',\n",
" 'go far',\n",
" 'go film',\n",
" 'go get',\n",
" 'go go',\n",
" 'go happen',\n",
" 'go home',\n",
" 'go like',\n",
" 'go long',\n",
" 'go make',\n",
" 'go movie',\n",
" 'go nowhere',\n",
" 'go one',\n",
" 'go see',\n",
" 'go see movie',\n",
" 'go way',\n",
" 'go well',\n",
" 'go without',\n",
" 'go wrong',\n",
" 'good actor',\n",
" 'good bad',\n",
" 'good enough',\n",
" 'good even',\n",
" 'good evil',\n",
" 'good film',\n",
" 'good get',\n",
" 'good guy',\n",
" 'good hunt',\n",
" 'good idea',\n",
" 'good job',\n",
" 'good laugh',\n",
" 'good look',\n",
" 'good movie',\n",
" 'good natured',\n",
" 'good old',\n",
" 'good one',\n",
" 'good performance',\n",
" 'good reason',\n",
" 'good scene',\n",
" 'good script',\n",
" 'good story',\n",
" 'good thing',\n",
" 'good time',\n",
" 'gooding jr',\n",
" 'granger movie',\n",
" 'granger movie gauge',\n",
" 'granger review',\n",
" 'great actor',\n",
" 'great deal',\n",
" 'great film',\n",
" 'great job',\n",
" 'great movie',\n",
" 'great performance',\n",
" 'great thing',\n",
" 'groundhog day',\n",
" 'group people',\n",
" 'gu van',\n",
" 'gu van sant',\n",
" 'guilty pleasure',\n",
" 'guy get',\n",
" 'gwyneth paltrow',\n",
" 'ha ha',\n",
" 'half dozen',\n",
" 'half film',\n",
" 'half hour',\n",
" 'halfway film',\n",
" 'hank azaria',\n",
" 'happy end',\n",
" 'happy gilmore',\n",
" 'hard rain',\n",
" 'hard time',\n",
" 'hard work',\n",
" 'harrison ford',\n",
" 'haunt hill',\n",
" 'heather graham',\n",
" 'heavy hand',\n",
" 'help film',\n",
" 'hen wen',\n",
" 'high art',\n",
" 'high concept',\n",
" 'high level',\n",
" 'high profile',\n",
" 'high school',\n",
" 'hip hop',\n",
" 'hit man',\n",
" 'hollywood film',\n",
" 'hollywood movie',\n",
" 'holy man',\n",
" 'home alone',\n",
" 'home gotcha',\n",
" 'home gotcha pretty',\n",
" 'home video',\n",
" 'hong kong',\n",
" 'horn king',\n",
" 'horror film',\n",
" 'horror movie',\n",
" 'horse whisperer',\n",
" 'hot shot',\n",
" 'hotel room',\n",
" 'hour film',\n",
" 'hour half',\n",
" 'hour long',\n",
" 'hour movie',\n",
" 'house haunt',\n",
" 'house haunt hill',\n",
" 'however film',\n",
" 'humor film',\n",
" 'ice cream',\n",
" 'include one',\n",
" 'independence day',\n",
" 'indiana jones',\n",
" 'inkpot rating',\n",
" 'inkpot rating system',\n",
" 'inspector gadget',\n",
" 'instead get',\n",
" 'interest character',\n",
" 'interest film',\n",
" 'interest thing',\n",
" 'jack nicholson',\n",
" 'jackie brown',\n",
" 'jackie chan',\n",
" 'jackie chan movie',\n",
" 'james bond',\n",
" 'james cameron',\n",
" 'james cromwell',\n",
" 'james spader',\n",
" 'james wood',\n",
" 'jamie lee',\n",
" 'jan de',\n",
" 'jan de bont',\n",
" 'janeane garofalo',\n",
" 'jar jar',\n",
" 'jay mohr',\n",
" 'jay silent',\n",
" 'jay silent bob',\n",
" 'jean claude',\n",
" 'jean reno',\n",
" 'jedi knight',\n",
" 'jeff goldblum',\n",
" 'jennifer lopez',\n",
" 'jennifer love',\n",
" 'jennifer love hewitt',\n",
" 'jerry springer',\n",
" 'jesse james',\n",
" 'jet li',\n",
" 'jim carrey',\n",
" 'joblo come',\n",
" 'joe black',\n",
" 'joe eszterhas',\n",
" 'joe young',\n",
" 'joel schumacher',\n",
" 'john carpenter',\n",
" 'john cusack',\n",
" 'john goodman',\n",
" 'john grisham',\n",
" 'john hughes',\n",
" 'john malkovich',\n",
" 'john travolta',\n",
" 'john williams',\n",
" 'john woo',\n",
" 'johnny depp',\n",
" 'jon voight',\n",
" 'julia robert',\n",
" 'julianne moore',\n",
" 'jurassic park',\n",
" 'kathy bates',\n",
" 'keanu reef',\n",
" 'kenneth williams',\n",
" 'kevin bacon',\n",
" 'kevin costner',\n",
" 'kevin smith',\n",
" 'kevin spacey',\n",
" 'kevin williamson',\n",
" 'kill one',\n",
" 'kind film',\n",
" 'kind movie',\n",
" 'kiss girl',\n",
" 'know character',\n",
" 'know exactly',\n",
" 'know fact',\n",
" 'know fact film',\n",
" 'know film',\n",
" 'know get',\n",
" 'know go',\n",
" 'know last',\n",
" 'know last summer',\n",
" 'know movie',\n",
" 'know one',\n",
" 'know well',\n",
" 'know would',\n",
" 'kung fu',\n",
" 'la motta',\n",
" 'la vega',\n",
" 'large life',\n",
" 'larry flynt',\n",
" 'last day',\n",
" 'last film',\n",
" 'last half',\n",
" 'last minute',\n",
" 'last night',\n",
" 'last one',\n",
" 'last scene',\n",
" 'last summer',\n",
" 'last time',\n",
" 'last two',\n",
" 'last year',\n",
" 'late film',\n",
" 'late night',\n",
" 'later film',\n",
" 'laugh film',\n",
" 'laugh loud',\n",
" 'lead character',\n",
" 'lead role',\n",
" 'learn little',\n",
" 'least bit',\n",
" 'least one',\n",
" 'lee jones',\n",
" 'left behind',\n",
" 'let alone',\n",
" 'let face',\n",
" 'let go',\n",
" 'let hope',\n",
" 'let say',\n",
" 'let see',\n",
" 'lethal weapon',\n",
" 'liam neeson',\n",
" 'life beautiful',\n",
" 'life death',\n",
" 'life film',\n",
" 'life make',\n",
" 'life one',\n",
" 'light hearted',\n",
" 'like alien',\n",
" 'like bad',\n",
" 'like film',\n",
" 'like first',\n",
" 'like get',\n",
" 'like make',\n",
" 'like many',\n",
" 'like movie',\n",
" 'like much',\n",
" 'like one',\n",
" 'like say',\n",
" 'like see',\n",
" 'like something',\n",
" 'like watch',\n",
" 'lili taylor',\n",
" 'line dialogue',\n",
" 'line film',\n",
" 'line like',\n",
" 'lisa kudrow',\n",
" 'little bit',\n",
" ...]"
]
},
"metadata": {
"tags": []
},
"execution_count": 84
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ryySur8lveqc"
},
"source": [
"##**TF-IDF**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iUCA7-rMzkqK"
},
"source": [
"TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency.
This is very common algorithm to transform text into a meaningful representation of
numbers which is used to fit machine algorithm for prediction\r\n",
"\r\n",
"TF-IDF is a statistical measure that evaluates how relevant a word is to a
document in a collection of documents. This is done by multiplying two metrics: how
many times a word appears in a document, and the inverse document frequency of the
word across a set of documents.\r\n",
"\r\n",
"TF-IDF for a word in a document is calculated by multiplying two different
metrics:\r\n",
"1. The term frequency of a word in a document. There are several ways of
calculating this frequency, with the simplest being a raw count of instances a word
appears in a document. Then, there are ways to adjust the frequency, by length of a
document, or by the raw frequency of the most frequent word in a document.\r\n",
"2. The inverse document frequency of the word across a set of documents.
This means, how common or rare a word is in the entire document set. The closer it
is to 0, the more common a word is. This metric can be calculated by taking the
total number of documents, dividing it by the number of documents that contain a
word, and calculating the logarithm.\r\n",
"\r\n",
"Multiplying these two numbers results in the TF-IDF score of a word in a
document. The higher the score, the more relevant that word is in that particular
document.\r\n",
"\r\n",
"**min_df, max_df**: These are the minimum and maximum document frequencies
words/n-grams must have to be used as features. If either of these parameters are
set to integers, they will be used as bounds on the number of documents each
feature must be in to be considered as a feature. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "bMjOgkRkvhGQ"
},
"source": [
"count_vec = CountVectorizer(max_features = 2000,
ngram_range=(2,3),min_df=0.1,max_df=0.7)\r\n",
"x_train_features = count_vec.fit_transform(x_train)\r\n"
],
"execution_count": null,
"outputs": []
}
]
}