Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk
in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and
sentences for the mentioned documents. – 3Marks.
Import Libraries.
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data] Package movie_reviews is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data] C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data] Package inaugural is already up-to-date!
y0 = pd.DataFrame({'Text':inaugural.raw('1961-Kennedy.txt')},index = [0])
y1 = pd.DataFrame({'Text':inaugural.raw('1941-Roosevelt.txt')},index = [0])
y2 = pd.DataFrame({'Text':inaugural.raw( '1973-Nixon.txt')},index = [0])
Text wordcount char count sent c
[('the', 9446),
('of', 7087),
(',', 7045),
('and', 5146),
('.', 4856),
('to', 4414),
('in', 2561),
('a', 2184),
('our', 2021),
('that', 1748)]
Most Common top (10) Words Used by all 3 Presidents during the Inaugural Ceremony since the
Time.
2.2 Remove all the stop words from all three
speeches. – 3 Marks.
We can filter the stop words with the help to Filter, Sort & Stop function.
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "yo
u're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yours
elves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', '
herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'thei
rs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "tha
t'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been'
, 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing'
, 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between'
, 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to
', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'ag
ain', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why
', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than'
, 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'shou
ld', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', '
aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "does
n't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "i
sn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn'
t", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren'
, "weren't", 'won', "won't", 'wouldn', "wouldn't"]
from nltk.tokenize import word_tokenize
text =inaugural.raw('1941-Roosevelt.txt')
text_tokens = word_tokenize(y1['Text'][0])
tokens_without_sw = [word for word in text_tokens if not word in stop_t
est]
print(tokens_without_sw)
We need to tokenize the all three speeches to get the stop words and to get out the special
characters, Sentences and Words out of the Speeches.
filtered_sentence = (" ").join(tokens_without_sw)
print(filtered_sentence)
Need to Filter all speeches to get the speech in proper Maner., we can use function Filter Sentences.
2.3 Which word occurs the most number of
times in his inaugural address for each
president? Mention the top three words. (after
removing the stopwords) – 3 Marks¶
from collections import Counter
Roosevelt_split = filtered_sentence.split()#y0['Text'][0].split()
Roosevelt_counter = Counter(Roosevelt_split)
Kennedy_split = filtered_sentence.split()#y1['Text'][0].split()
Kenndey_counter = Counter(Kennedy_split)
Nixon_split = filtered_sentence.split()#y2['Text'][0].split()
Nixon_counter = Counter(Nixon_split)
In [39]:
Roosevelt_most_occur = Roosevelt_counter.most_common(10)
print("Most common word of Roosevelt speech ",Roosevelt_most_occur )
Roosevelt_freq = pd.DataFrame(Roosevelt_most_occur, columns= ['Roosevelt_Fr
equent_words', 'Roosevelt_total_words'])
Roosevelt_freq
Kennedy_most_occur = Kenndey_counter.most_common(10)
print("Most common word of Kennedy speech ",Kennedy_most_occur )
Kennedy_freq = pd.DataFrame(Kennedy_most_occur, columns= ['Kennedy_Frequent
_words', 'Kennedy_total_words'])
Kennedy_freq
Nixon_most_occur = Nixon_counter.most_common(10)
print("Most common word of Nixon speech ",Nixon_most_occur )
Nixon_freq = pd.DataFrame(Nixon_most_occur, columns= ['Nixon_Frequent_words
', 'Nixon_total_words'])
Nixon_freq
Nixon_Frequent_words Nixon_total_words
0 , 77
1 . 68
2 -- 25
3 It 13
4 The 10
5 know 10
6 We 10
7 spirit 9
8 life 9
9 us 8
The Most Common words use by the all 3 President during the Speech.
Most common word of Roosevelt speech [(',', 77), ('.', 68), ('--', 25
), ('It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9),
('life', 9), ('us', 8)]
Most common word of Kennedy speech [(',', 77), ('.', 68), ('--', 25),
('It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9), ('l
ife', 9), ('us', 8)]
Most common word of Nixon speech [(',', 77), ('.', 68), ('--', 25), (
'It', 13), ('The', 10), ('know', 10), ('We', 10), ('spirit', 9), ('li
fe', 9), ('us', 8)]
2.4 Plot the word cloud of each of the speeches of
the variable. (after removing the stopwords) – 3
Marks¶
from wordcloud import WordCloud,STOPWORDS
from wordcloud import WordCloud,STOPWORDS
words = ' '.join(y0['Text'])
cleaned_word = " ".join([word for word in words.split()
if '\n' not in word
With the Help of World Cloud Function, we can distinguish the most used word by the all 3 Presidents
During the Speech. We need to change the Vales of y0,y1,& y2 for app