0% found this document useful (0 votes)

95 views15 pages

Information Retrieval: Assignment 1

The document summarizes an assignment completed by the author involving creating a text corpus from a specific domain using Scrapy or Feedparser. NLTK functions were then tested on the corpus, including analyzing word frequencies and concordances. The corpus was from a Wikipedia article on Sterjo Spasse and various NLTK analyses were conducted including word counts, frequencies, and concordances.

Uploaded by

DuaFetai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views15 pages

Information Retrieval: Assignment 1

Uploaded by

DuaFetai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Assignment 1

Information Retrieval
Dua Fetai

2. Activity With Scrapy or Feedparser create a text corpus for a specific

domain. Then create the vocabulary from this corpus, where each word in the
vocabulary corresponds to the number indicating that this word is
represented in the sentence. Test NLTK functions for this corpus described in
http://www.nltk.org/book/ch01.html (at least 10 functions).

 The tools that I used to complete the assignment:

 Python, NLTK
 Anaconda, Spyder (3.7), text file

 I launched Spyder from Anaconda, it is a Scientific Python Development

Environment. Powerful Python IDE with advanced editing, interactive
testing, debugging and introspection features.

 The corpus information I got from Wikipedia, Sterjo Spasse.

 I created a project using Anaconda,Spyder.

 Importing NLTK, and processing my text file.
Here I made the importing and connection of the corpus text:

And we start with the functions:

The code:
# -*- coding: utf-8 -*-
# *** Spyder Python Console History Log ***

## ---(Fri Apr 3 20:30:56 2020)---

import nltk
nltk.download ()
import nltk
nltk.download()

## ---(Sat Apr 4 03:26:38 2020)---

import nltk

## ---(Sat Apr 4 03:34:38 2020)---

http://www.nltk.org/book/ch01.html
import nltk
from nltk.corpus import PlaintextCorpusReader
corpusText = PlaintextCorpusReader(r"/Users/duafetai/Desktop/corp","SterjoSpasse.txt")
text = nltk.Text(corp.words())
text = nltk.Text(corpusText.words())
text
text.concordance("Spasse")
text.concordance("Arsimit")
text.concordance("Veprimtaria")
text.similar("monstrous")
set(text)
len(text)
text.count("arsimi")
text.count("Spasse")
sorted(set(text))
text[10]
text[20]
text[2]
text[5:9]
from nltk.probability import FreqDist
fdist2 = FreqDist(text)
print(fdist2)
fdist2.most_common(50)
'fshati',
'fshatin',
'fshatrat',
'fundin',
'fundit',
'fushën',
'gazetën',
'gjithashtu',
'hartimin',
'here',
'herë',
'i',
'integral',
'ishin',
'ishte',
'iu',
'ja',
'jeta',
'jetë',
'jonë',
'jug',
'ka',
'kaloi',
'kaluar',
'katër',
'kishte',
'kohën',
'komplet',
'komuniste',
'korrespondencë',
'krahët',
'kreu',
'krijimet',
'kryevepra',
'ku',
'kurs',
'letrare',
'lindur',
'liqenit',
'liri',
'lirë',
'lufte',
'maqedonase',
'maqedonishtfolës',
'marrë',
'mbahet',
'me',
'monografi',
'muar',
'më',
'mësues',
'ndaj',
'ndryshme',
'ndryshëm',
'ndërsa',
'nga',
'nisi',
'njihet',
'një',
'njëri',
'nuk',
'në',
'nëntë',
'organin',
'pa',
'pak',
'pandehur',
'parisë',
'partiak',
'parë',
'pas',
'pashë',
'pedagogji',
'pedagogjike',
'periudhë',
'po',
'politike',
'popullor',
'por',
'porsaçliruar',
'pranë',
'preference',
'prejardhje',
'profesion',
'provimet',
'punime',
'punoi',
'punë',
'për',
'përkohshmes',
'përkthime',
'përkthye',
'përmbledhje',
'përsëri',
'qe',
'që',
're',
'realizmit',
'revistat',
'rinie',
'romane',
'romani',
'romanin',
'romanit',
'shekullit',
'shkak',
'shkollore',
'shkollën',
'shkrimtar',
'shkroi',
'shkruante',
'shkurt',
'shumëllojshme',
'si',
'sidomos',
'socialist',
'së',
'tekste',
'teksteve',
'tetë',
'tij',
'tregime',
'tregimesh',
'tregimit',
'tridhjetë',
'të',
'u',
'vdekje',
'vepra',
'veprash',
'veprën',
'veta',
'vetëm',
'vitet',
'vitit',
'vonë',
'vëllime',
'zë',
'është'}

len(text)
Out[13]: 525

text.count("arsimi")
Out[14]: 0

text.count("Spasse")
Out[15]: 3

sorted(set(text))
Out[16]:
['!?"',
'!?.',
'"',
'",',
'(',
')',
'),',
').',
',',
'-',
'.',
'..."',
'.[',
'15',
'1934',
'1935',
'1944',
'1946',
'1952',
'1954',
'1958',
'1965',
'1968',
'1972',
'1973',
'1975',
'1978',
'1980',
'1983',
'1985',
'2',
'3',
'4',
':',
'Afërdita',
'Aleksandër',
'Arsimi',
'Arsimit',
'Artistëve',
'Ata',
'Botimeve',
'Botoi',
'Buzë',
'Derviçan',
'Dhimitër',
'Draçinin',
'Elbasan',
'Firencë',
'Gjirokastrës',
'Gollomboç',
'Hakiun',
'Harbutët',
'Italisë',
'Ja',
'Kjo',
'Kokonën',
'Korçë',
'Korçës',
'Kryengritësit',
'Kulturës',
'Kuror',
'Kutelin',
'Lidhjes',
'Literatura',
'Me',
'Min',
'Më',
'Ndërsa',
'Nga',
'Normale',
'Nusja',
'Në',
'Nëntori',
'Pishtarë',
'Po',
'Prespës',
'Pse',
'Punoi',
'Përpara',
'Qemal',
'Redaksia',
'Revista',
'Rilindja',
'Shkolla',
'Shkrimtarëve',
'Shkroi',
'Shqipërisë',
'Shuteriqin',
'Shuteriqit',
'Si',
'Sipas',
'Spasse',
'Spasses',
'Sterjo',
'Tiranë',
'Tiranën',
'Trebeshinës',
'Të',
'Vepra',
'Veprimtaria',
'Xhuvanin',
'Zgjimi',
'Zjarre',
'[',
']',
'].',
'ai',
'ajo',
'anglisht',
'arriti',
'artikuj',
'ashtu',
'atë',
'bashkëthemelues',
'botimi',
'botoi',
'botua',
'botuan',
'botuar',
'buzë',
'cili',
'cilën',
'dallua',
'dhe',
'dhjetë',
'duvak',
'e',
'etj',
'fala',
'femre',
'fill',
'fillore',
'fitores',
'fshat',
'fshati',
'fshatin',
'fshatrat',
'fundin',
'fundit',
'fushën',
'gazetën',
'gjithashtu',
'hartimin',
'here',
'herë',
'i',
'integral',
'ishin',
'ishte',
'iu',
'ja',
'jeta',
'jetë',
'jonë',
'jug',
'ka',
'kaloi',
'kaluar',
'katër',
'kishte',
'kohën',
'komplet',
'komuniste',
'korrespondencë',
'krahët',
'kreu',
'krijimet',
'kryevepra',
'ku',
'kurs',
'letrare',
'lindur',
'liqenit',
'liri',
'lirë',
'lufte',
'maqedonase',
'maqedonishtfolës',
'marrë',
'mbahet',
'me',
'monografi',
'muar',
'më',
'mësues',
'ndaj',
'ndryshme',
'ndryshëm',
'ndërsa',
'nga',
'nisi',
'njihet',
'një',
'njëri',
'nuk',
'në',
'nëntë',
'organin',
'pa',
'pak',
'pandehur',
'parisë',
'partiak',
'parë',
'pas',
'pashë',
'pedagogji',
'pedagogjike',
'periudhë',
'po',
'politike',
'popullor',
'por',
'porsaçliruar',
'pranë',
'preference',
'prejardhje',
'profesion',
'provimet',
'punime',
'punoi',
'punë',
'për',
'përkohshmes',
'përkthime',
'përkthye',
'përmbledhje',
'përsëri',
'qe',
'që',
're',
'realizmit',
'revistat',
'rinie',
'romane',
'romani',
'romanin',
'romanit',
'shekullit',
'shkak',
'shkollore',
'shkollën',
'shkrimtar',
'shkroi',
'shkruante',
'shkurt',
'shumëllojshme',
'si',
'sidomos',
'socialist',
'së',
'tekste',
'teksteve',
'tetë',
'tij',
'tregime',
'tregimesh',
'tregimit',
'tridhjetë',
'të',
'u',
'vdekje',
'vepra',
'veprash',
'veprën',
'veta',
'vetëm',
'vitet',
'vitit',
'vonë',
'vëllime',
'zë',
'është']

text[10]
Out[17]: ','

text[20]
Out[18]: 'Sterjo'

text[2]
Out[19]: 'me'

text[5:9]
Out[20]: [',', 'i', 'lindur', 'në']

from nltk.probability import FreqDist

fdist2 = FreqDist(text)

print(fdist2)
<FreqDist with 273 samples and 525 outcomes>

fdist2.most_common(50)
Out[25]:
[('"', 42),
('e', 22),
(',', 21),
('të', 21),
('në', 18),
('(', 13),
('.', 12),
('),', 10),
('dhe', 8),
('me', 7),
('i', 5),
('që', 5),
('një', 5),
('Më', 4),
('pas', 4),
('si', 4),
('letrare', 4),
('më', 4),
('1944', 4),
('[', 4),
('].', 4),
(':', 4),
('",', 4),
('Spasse', 3),
('nga', 3),
('Sterjo', 3),
('nisi', 3),
('për', 3),
('nuk', 3),
('Tiranë', 3),
('2', 3),
('tij', 3),
('-', 3),
('ishte', 2),
('liqenit', 2),
('shkollën', 2),
('ndërsa', 2),
('shkurt', 2),
('Pse', 2),
('fundit', 2),
('Në', 2),
('botuar', 2),
('shkollore', 2),
('Dhimitër', 2),
('pedagogjike', 2),
('etj', 2),
('Veprimtaria', 2),
('u', 2),
('shkroi', 2),
('romane', 2)]

CMO Olympiad Book For Class 5
0% (1)
CMO Olympiad Book For Class 5
13 pages
Inverse Dictionary of Albanian
100% (2)
Inverse Dictionary of Albanian
128 pages
Student Handbook For Internship and Placement
No ratings yet
Student Handbook For Internship and Placement
134 pages
DELTA IA-Robot ALL C EN Ver2023 20231026
No ratings yet
DELTA IA-Robot ALL C EN Ver2023 20231026
16 pages
First Quarterly Assessment in Tle-Ict
100% (1)
First Quarterly Assessment in Tle-Ict
6 pages
Football Database System: Project Proposal
No ratings yet
Football Database System: Project Proposal
9 pages
ACTIVITY I Linear Motion
100% (1)
ACTIVITY I Linear Motion
4 pages
Security Issues in Cloud Computing
No ratings yet
Security Issues in Cloud Computing
19 pages
Computedradiography 190411130133
No ratings yet
Computedradiography 190411130133
32 pages
Building LLM From Scratch-Post 3
No ratings yet
Building LLM From Scratch-Post 3
16 pages
PSW User Manual SD Exports
No ratings yet
PSW User Manual SD Exports
34 pages
Gramatika e Gjuhes Shqipe Albanian Edition - Yin Chanliang
No ratings yet
Gramatika e Gjuhes Shqipe Albanian Edition - Yin Chanliang
301 pages
Tally Prime Digital e Bookk
No ratings yet
Tally Prime Digital e Bookk
79 pages
Module 8 - Final - 21.7.24
No ratings yet
Module 8 - Final - 21.7.24
66 pages
Free Science Homework Help Online Chat
100% (1)
Free Science Homework Help Online Chat
6 pages
Iso 15628 2013
No ratings yet
Iso 15628 2013
15 pages
3.text Representation Word Embeddings 2.ipynb
No ratings yet
3.text Representation Word Embeddings 2.ipynb
76 pages
CS522T4C-DBMS-MODULE 2 - ER Diagram
No ratings yet
CS522T4C-DBMS-MODULE 2 - ER Diagram
77 pages
21e41e0057 - A Report On Indian Stock Exchange Analysis - Iifl
No ratings yet
21e41e0057 - A Report On Indian Stock Exchange Analysis - Iifl
56 pages
Word Lecture Note
No ratings yet
Word Lecture Note
65 pages
ENARSI Chapter 3
No ratings yet
ENARSI Chapter 3
52 pages
Ramgarh
No ratings yet
Ramgarh
25 pages
Spark 4 ALB Study Companion 1 1
No ratings yet
Spark 4 ALB Study Companion 1 1
32 pages
Examen Final Ccna1 v5.0
No ratings yet
Examen Final Ccna1 v5.0
23 pages
Vdocuments - MX - 100 Most Common Albanian Words
No ratings yet
Vdocuments - MX - 100 Most Common Albanian Words
21 pages
Exercise 2: Wooldridge Book: Part I Computer Exercises
No ratings yet
Exercise 2: Wooldridge Book: Part I Computer Exercises
10 pages
iPodUpdater 26
No ratings yet
iPodUpdater 26
24 pages
Basics of Video Editing
No ratings yet
Basics of Video Editing
7 pages
NIRNAY Quiz
No ratings yet
NIRNAY Quiz
14 pages
Program Questions
No ratings yet
Program Questions
9 pages
Module 6 Analytics-Making Sense of Data
No ratings yet
Module 6 Analytics-Making Sense of Data
9 pages
Application System Are Used As The Basis of Applications
No ratings yet
Application System Are Used As The Basis of Applications
7 pages
Vlan Basic Concepts Explained With Examples
No ratings yet
Vlan Basic Concepts Explained With Examples
7 pages
Code
No ratings yet
Code
5 pages
MockTest 1
No ratings yet
MockTest 1
3 pages
Notes9 - Class - 10 - Data Visualization Using MatPlotlib Notes
No ratings yet
Notes9 - Class - 10 - Data Visualization Using MatPlotlib Notes
5 pages
SOA
No ratings yet
SOA
5 pages
Win Promote: On gk2 Gs 10
No ratings yet
Win Promote: On gk2 Gs 10
4 pages
Meta Careers Do The Most Meaningful Work of Your Career Meta Careers
No ratings yet
Meta Careers Do The Most Meaningful Work of Your Career Meta Careers
1 page
Aakash Singh Bhadoriya
No ratings yet
Aakash Singh Bhadoriya
1 page
Professional Biography: Df26455@seeu - Edu.mk
No ratings yet
Professional Biography: Df26455@seeu - Edu.mk
1 page
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Containerization with LXC
From Everand
Containerization with LXC
Konstantin Ivanov
2/5 (1)
Basics of Python Programming: A Quick Guide for Beginners
From Everand
Basics of Python Programming: A Quick Guide for Beginners
Krishna Kumar Mohbey
No ratings yet
Python Programming for Arduino
From Everand
Python Programming for Arduino
Pratik Desai
5/5 (3)
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
JDK Tutorials - Herong's Tutorial Examples
From Everand
JDK Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
ROS Robotics By Example
From Everand
ROS Robotics By Example
Carol Fairchild
No ratings yet
WiX: A Developer's Guide to Windows Installer XML
From Everand
WiX: A Developer's Guide to Windows Installer XML
Ramirez Nick
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
From Everand
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
Tim Peters
No ratings yet
Pyqt6 101: A Beginner’s Guide to PyQt6
From Everand
Pyqt6 101: A Beginner’s Guide to PyQt6
Edward Chang
No ratings yet
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
XProc 3.0 Programmer Reference
From Everand
XProc 3.0 Programmer Reference
Erik Siegel
No ratings yet
Learn Python in 10 Minutes
From Everand
Learn Python in 10 Minutes
Victor Ebai
4/5 (30)
Learning Programming and Computer Science: 1, #1
From Everand
Learning Programming and Computer Science: 1, #1
MATHY WISDOM
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Native Docker Clustering with Swarm
From Everand
Native Docker Clustering with Swarm
Fabrizio Soppelsa
No ratings yet
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
From Everand
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
Kartik Bhatnagar
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
Your First Python Program
From Everand
Your First Python Program
Alexander Paz
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
From Everand
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
Kiet Huynh
No ratings yet
Extending Docker
From Everand
Extending Docker
Russ McKendrick
5/5 (1)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Python Programming: Learn, Code, Create
From Everand
Python Programming: Learn, Code, Create
Sachin Naha
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Essential Python 3
From Everand
Essential Python 3
Kevin Vans-Colina
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
From Everand
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
Nathan Metzler
4/5 (2)
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
DevOps. How To Build Pipelines With Bitbucket Pipelines + Docker Container + AWS ECS + JDK 11 + Maven 3?
From Everand
DevOps. How To Build Pipelines With Bitbucket Pipelines + Docker Container + AWS ECS + JDK 11 + Maven 3?
John Edward Cooper Berg
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
From Everand
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet

Information Retrieval: Assignment 1

Uploaded by

Information Retrieval: Assignment 1

Uploaded by

Assignment 1

2. Activity With Scrapy or Feedparser create a text corpus for a specific

 The tools that I used to complete the assignment:

 I launched Spyder from Anaconda, it is a Scientific Python Development

 The corpus information I got from Wikipedia, Sterjo Spasse.

 I created a project using Anaconda,Spyder.

And we start with the functions:

## ---(Fri Apr 3 20:30:56 2020)---

## ---(Sat Apr 4 03:26:38 2020)---

## ---(Sat Apr 4 03:34:38 2020)---

from nltk.probability import FreqDist

from nltk.probability import FreqDist

You might also like