Data Science 1: Assignment No. 2 Date: Sept 26, 2016

This document is an assignment for a data science course. It contains Python code to analyze a text document from Project Gutenberg. The code downloads the text, removes HTML tags and punctuation, stems the words, removes common stopwords, counts word frequencies, and plots the results in a histogram. The code also times how long the analysis takes to run.

Uploaded by

Ashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

Data Science 1: Assignment No. 2 Date: Sept 26, 2016

Uploaded by

Ashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Science 1: Assignment No.

2
Date: Sept 26, 2016

By,
Ashish Menkudale
UIN: 656130575
[email protected]

import timeit
import numpy as np
import pandas as pd
import bs4
import requests
from bs4 import BeautifulSoup

start = timeit.default_timer()
#timer started

import urllib2
data = urllib2.urlopen("https://archive.org/stream/ataleoftwocities00098gut/98.txt")
l = []
str = ''
for line in data.readlines():
l.append(line)
str = '\n'.join(l)

print str
# got all the text here

import lxml.html
import re, htmlentitydefs
filtered_str = re.sub('<[^<]+?>', '', str)
print filtered_str
# cleared html tags

import re

removed_punct = re.sub(r'[^\w\s]','',filtered_str)
print removed_punct
#removed punctuation over here
stopwords = ['had', 'has' ,'your' ,'you' ,'with' ,'i' ,'his', 'she', 'he', 'are' ,
'not' ,'the' ,'a','was','an','and','of','at','on','over','under','to',
'from','what','if','else','also','in','is','it','by','this','that','his',
'have','be', 'as', 'were', 'for', 'so', 'him', 'her', 'but', 'she', 'or',
'no', 'will', 'my', 'up', 'its', 'there', 'away', 'me', 'we' , 'they', 'only',
'too', 'down', 'upon', 'into', 'their', 'here', 'could', 'would', 'been',
'after', 'us','1','2','3','4','5','6','7','8','9','0']
querywords = removed_punct.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
print result
#removed the common occurrences

list = reduce(lambda d, c: d.update([(c, d.get(c,0)+1)]) or d, result.split(), {})

sorted_list = list.items()
sorted_list.sort(key = lambda item: item[1])
for word in sorted_list:
print word
# got the frequency and sorted it over here

wordList = re.sub("[^\w]", " ", result).split()

print wordlist
# changed the datatype over here

from collections import Counter

import numpy as np

import matplotlib.pyplot as plt

word_counts = Counter(wordList)
def plot_bar_from_counter(counter, ax=None):

if ax is None:
fig = plt.figure()
ax = fig.add_subplot(111)
frequencies = counter.values()
names = counter.keys()
x_coordinates = np.arange(len(counter))
ax.bar(x_coordinates, frequencies, align='center')
ax.xaxis.set_major_locator(plt.FixedLocator(x_coordinates))
ax.xaxis.set_major_formatter(plt.FixedFormatter(names))
return ax

plot_bar_from_counter(word_counts)
plt.show()
# plotted histogram

print timeit.default_timer()-start
# got the time

6.86172139321

Ir Practical
No ratings yet
Ir Practical
13 pages
Python Code Examples
100% (1)
Python Code Examples
30 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Dse Assignment
No ratings yet
Dse Assignment
30 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Trends Merged
No ratings yet
Trends Merged
10 pages
AP19110010110 Lab Assignment-2 - Jupyter Notebook
No ratings yet
AP19110010110 Lab Assignment-2 - Jupyter Notebook
18 pages
96 Yogesh Khairnar Assignment 4
No ratings yet
96 Yogesh Khairnar Assignment 4
25 pages
Cs Practicals
No ratings yet
Cs Practicals
54 pages
Lab Exno 4 Abc
No ratings yet
Lab Exno 4 Abc
8 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
7 pages
Python Assignment 2
No ratings yet
Python Assignment 2
7 pages
Record
No ratings yet
Record
6 pages
NLP Assignment 1
No ratings yet
NLP Assignment 1
7 pages
Assignment 3 Syandilya Sai Vardhan
No ratings yet
Assignment 3 Syandilya Sai Vardhan
7 pages
UNIT 2 PDS Notes P1
No ratings yet
UNIT 2 PDS Notes P1
20 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
03 Python
No ratings yet
03 Python
5 pages
IR
No ratings yet
IR
12 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
x0 Process
No ratings yet
x0 Process
4 pages
NLP Record
No ratings yet
NLP Record
15 pages
Tsa Ex-2
No ratings yet
Tsa Ex-2
4 pages
Sets 2 and Dictionary (6 8PM)
No ratings yet
Sets 2 and Dictionary (6 8PM)
6 pages
Working With Dictionaries - Lab Answersheet - Colab
No ratings yet
Working With Dictionaries - Lab Answersheet - Colab
5 pages
Neha Py02 (2-3) - 1
No ratings yet
Neha Py02 (2-3) - 1
4 pages
Assignment No - 7
No ratings yet
Assignment No - 7
4 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
No ratings yet
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
22 pages
Codesrepl
No ratings yet
Codesrepl
16 pages
Python Assignment: Rollno. 11720080 Name: Ashutosh Kumar Anshu
No ratings yet
Python Assignment: Rollno. 11720080 Name: Ashutosh Kumar Anshu
6 pages
Lab - Activity-Iii: ST ND
No ratings yet
Lab - Activity-Iii: ST ND
9 pages
Assignment2 Fall 2024
No ratings yet
Assignment2 Fall 2024
6 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Reservation
No ratings yet
Reservation
6 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
IR Assignment6
No ratings yet
IR Assignment6
5 pages
3
No ratings yet
3
7 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Python Practice Question
No ratings yet
Python Practice Question
5 pages
Import Pandas As PD
No ratings yet
Import Pandas As PD
2 pages
Chapter Dictionary Worksheet
No ratings yet
Chapter Dictionary Worksheet
2 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
TSA Student
No ratings yet
TSA Student
20 pages
Program To Execute in Lab
No ratings yet
Program To Execute in Lab
4 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Py 1679789071
No ratings yet
Py 1679789071
2 pages
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
Bag of Words 03 and 04 Model
No ratings yet
Bag of Words 03 and 04 Model
4 pages
Experiment 8 & 9
No ratings yet
Experiment 8 & 9
14 pages
PYTHON
No ratings yet
PYTHON
2 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Data Science 1: Assignment No. 2 Date: Sept 26, 2016

Uploaded by

Data Science 1: Assignment No. 2 Date: Sept 26, 2016

Uploaded by

Data Science 1: Assignment No.

list = reduce(lambda d, c: d.update([(c, d.get(c,0)+1)]) or d, result.split(), {})

wordList = re.sub("[^\w]", " ", result).split()

from collections import Counter

import matplotlib.pyplot as plt

You might also like