0% found this document useful (0 votes)
71 views110 pages

ISR Lab Manual

Uploaded by

asdfghjkl890890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views110 pages

ISR Lab Manual

Uploaded by

asdfghjkl890890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

DEPARTMENT OF INFORMATION TECHNOLOGY

LABORATORY MANUAL

BE (INFORMATION TECHNOLOGY) (SEMESTER – I)

414446: Lab Practice III (ISR)

2019 course

FACULTY NAME- VRUSHALI PATIL

Department of IT, BSIOTR, Wagholi, Pune Page 1


DEPARTMENT OF INFORMATION TECHNOLOGY

Course Code: 414446


Course Name: Lab Practice III (ISR) (2019 Course)

Teaching Scheme: Credits: 02 Examination Scheme:


Practical: 4 hrs./week TW: 25 Marks
OR: 25 Marks

Course Outcomes:
Statement
Course Outcome
At the end of the course, a student will be able to:
Understand the concept of Information retrieval and to apply clustering in
CO1
information retrieval.
CO2 Use appropriate indexing approach for retrieval of text and multimedia data.
Evaluate performance of information retrieval systems.
CO3 Apply appropriate tools in analyzing the web information.
CO4 Map the concepts of the subject on recent developments in the Information retrieval
field.

Department of IT, BSIOTR, Wagholi, Pune Page 2


Sr. No. Title of Assignment PO Mapping PSO Mapping

GROUP A

PO1, PO2,
1 Implement Conflation algorithm to generate document
PO3, PO4, PSO1, PSO2
representative of a text file.
PO5, PO6,
PO9, PO10,
PO12

PO1, PO2,
Implement Single-pass Algorithm for clustering of PO3, PO4,
files.(Consider 4 to 5 files). PO5, PO6,
2 PSO1, PSO2
PO9, PO10,
PO12
PO1, PO2,
Implement a program for retrieval of documents using
3 PO3, PO4,
inverted files.
PO5, PO6,
PSO1, PSO2
PO9, PO10,
PO12
Sr. No. Title of Assignment PO Mapping PSO Mapping

GROUP B

PO1, PO2,
1 Implement a program to calculate precision and recall
for sample input. (Answer set A, Query q1, PO3, PO4, PSO1, PSO2
Relevant documents to query q1- Rq1 ) PO5, PO6,
PO9, PO10,
PO12

Write a program to calculate harmonic mean (F-


measure) and E-measure for above example.
2 PO1, PO2, PSO1, PSO2
PO3, PO4,
PO5, PO6,

Department of IT, BSIOTR, Wagholi, Pune Page 3


PO9, PO10,
PO12
PO1, PO2,
Implement a program for feature extraction in 2D color
3 PO3, PO4,
images (any features like color, texture etc. and
PO5, PO6,
to extract features from input image and plot histogram PSO1, PSO2
PO9, PO10,
for the features.
PO12

GROUP C

PO1, PO2,
1 Build the web crawler to pull product information and
PO3, PO4, PSO1, PSO2
links from an e-commerce website. (Python)
PO5, PO6,
PO9, PO10,
PO12

Write a program to find the live weather report


(Temperature, wind speed, description, and weather)
2 PO1, PO2, PSO1, PSO2
ofa given city. (Python).
PO3, PO4,
PO5, PO6,
PO9, PO10,
PO12
PO1, PO2,
Case study on recommender system for a product /
3 PO3, PO4,
Doctor / Product price / Music.
PO5, PO6,
PSO1, PSO2
PO9, PO10,
PO12 1|Pag e

Department of IT, BSIOTR, Wagholi, Pune Page 4


AssignmentNo. 1

Problem Statement:

Implementation of Conflation Algorithm to generate document representative of a


text file.

Objective:
To study:
1. The various concepts and components of information retrieval.
2. Conflation Algorithm.
3. The role of clustering in information retrieval.
4. Indexing structures for information retrieval

Outcomes:
At the end of the assignment the students will be able to
1. Understand the concept of Information retrieval and to apply clustering in
information retrieval.

Scope:
Removal of Stop Words
Suffix Stripping (Any Five Grammar Rules)
Frequency Occurrences Of Key Words (Weight Calculation)

Theory

Introduction to Information Retrieval


In today’s information explosion era, increase in demand for quicker
dissemination of information, from contents stored in a variety of forms
requires speedy search and timely retrieval. The values of documents are
measured according to the information it contains but they are proved useless
until the stored information is brought out for use by the readers. This may be
either by subject analysis or representation of the terms through symbols. It

Department of IT, BSIOTR, Wagholi, Pune Page 5


has always been the need of the scholars and the lingering turmoil in the
minds of library organizers, to suitably facilitate the extraction of the contents

expeditiously and exhaustively that has brought forward the concept of


information retrieval.

Meaning & Definition:


Calvin Mooers coined the term information retrieval in 1950. In the context of
library and information science, we mean to get back information, which is,
in a way, hidden, from normal sight or vision. According to J.H. Shera: It is,
"the process of locating and selecting data, relevant to a given
requirement."Calvin Mooers: "Searching and retrieval of information
from storage, according to specification by subject."

Functions:
The major functions that constitute an information retrieval system, comprises
of: Acquisition, Analysis, Representation of information, Organisation of the
indexes, Matching, Retrieving, Readjustment and Feedback

Components of Information Retrieval System:


A study of the functions of IRS brings forth some of the essential components
that constitute the proper functioning of the system. According to Lancaster,
an information retrieval system consists of six basic subsystems. They are as
follows:
1. The document selection subsystem
2. The indexing subsystem
3. The vocabulary subsystem
4. The searching subsystem
5. The user-system interface
6. The marching subsystem
All the above subsystems may be grouped under two groups' subject/content
analysis and search strategy. Subject or content analysis includes the task of
analysis, organisation and storage of information. Search strategy includes

Department of IT, BSIOTR, Wagholi, Pune Page 6


analysis of user queries, creation of search formula and the actual searching.

Document Representative:
Documents in a collection are frequently represented through a set of index
terms

keywords.Suchkeywordsmightbeextracteddirectlyfromthetextofthedocumentor
mightbespecifiedby a human subject. Modern computers are making it
possible to represent a document by its fullset of words. With very large
collections, however, even modern computers might have to reducethe set of
representative keywords. This can be accomplished through the elimination of
stopwords(suchasarticlesandconnectives),theuseofstemming(whichreducesdisti
nctwordsto
theircommongrammaticalroot),andtheidentificationofnoungroups(whichelimin
atesadjectives, adverbs, and verbs). Further, compression might be employed.
These operations arecalled text operations (or transformations).

The full text is clearly the most complete logical viewof a document but its
usage usually implies higher computational costs. A small set of
categories(generated by a human specialist) provides the most concise logical
view of a document but itsusage might lead to retrieval of poor quality.
Several intermediate logical views (of a document)mightbe adopted
byaninformation retrieval system as illustratedin Figure

Besidesadoptinganyoftheintermediaterepresentations,theretrievalsystemmighta
lsorecognizetheinternalstructurenormallypresentinadocument.Thisinformation
onthestructure of the document might be quite useful and is required by
structured text retrievalmodels. As illustrated in Figure we view the issue of
logically representing a document as acontinuum in which the logical view of
a document might shift (smoothly) from a full textrepresentationto ahigher-
level representation specifiedbyahuman subject.
The document representatieisoneconsistingsimply

Department of IT, BSIOTR, Wagholi, Pune Page 7


ofalistofclassnames,eachnamerepresenting a class of words occurring in the
total input text. A document will be indexed by anameifoneof its significant
words occursasamemberof thatclass.

Figure: Logical view of document:full text to se to finde x terms

ConflationAlgorithm:

Ultimately one would like to develop a text processing system which by


means of computable methods with the minimum of human intervention will
generate from the input text (full text, abstract, or title) a document
representative adequate for use in an automatic retrieval system. This is a tall
order and can only be partially met. A document will be indexed by a name
if one of its significant words occurs as a member of that class.

Such a system will usually consist of three parts:

(1) removal of high frequency words,

(2) suffix stripping,

(3) detecting equivalent stems

Luhn's ideas
In one of Luhn's early papers he states: 'It is here proposed that the frequency of
word occurrence in an article furnishes a useful measurement of word

Department of IT, BSIOTR, Wagholi, Pune Page 8


significance. It is further proposed that the relative position within a sentence of
words having given values of significance furnish a useful measurement for
determining the significance of sentences. The significance factor of a sentence
will therefore be based on a combination of these two measurements.' This
quote fairly summaries Luhn's contribution to automatic text analysis. His
assumption is that frequency data can be used to extract words and sentences to

represent a document.

The removal of high frequency words, 'stop' words or 'fluff' words is one
way of implementing Luhn's upper cut-off. This is normally done by
comparing the input text with a 'stop list' of words which are to be removed.
The advantages of the process are not only that non-significant words are
removed and will therefore not interfere during retrieval, but also that the size
of the total document file can be reduced by between 30 and 50 per cent.

Fig: A plot of Hyperbolic curve relating frequency of words ‘f ‘ vs words by rank order
‘r’

Let f be the frequency of occurrence of various word types in a given position


of text and r their rank order, that is, the order of their frequency of occurrence,
then a plot relating f and r yields a curve similar to the hyperbolic curve in
Figure 2.1. This is in fact a curve demonstrating that the product of the
frequency of use of wards and the rank order is approximately constant.

Department of IT, BSIOTR, Wagholi, Pune Page 9


2. SuffixStripping:
The second stage, suffix stripping, is more complicated. A standard approach
is to have acomplete list of suffixes and to remove the longest possible one.
For example, we may wellwant UAL removed from FACTUAL but not from
EQUAL. To avoid erroneously removingsuffixes, context rules are devised so
that a suffix will be removed only if the context is
right.'Right'maymeananumber ofthings:
(1) thelength ofremainingstemexceedsa givennumber; thedefault isusually2;
(2) thestem-endingsatisfies acertain condition,e.g.does notend withQ.
3. Detectingequivalentstems:
Many words, which are equivalent in the above sense, map to one
morphological form byremoving their suffixes. The simplest method of dealing
with it is to construct a list ofequivalent stem-endings. For two stems to be
equivalent they must match except for theirendings, which themselves must
appear in the list as equivalent. For example, stems such asABSORB- and
ABSORPT- are conflated because there is an entry in the list defining B
andPTas equivalent stem-endings if theprecedingcharacters match.
Theassumption(inthecontextofIR)isthatiftwowordshavethesameunderlyingstemt
hen they refer to the same concept and should be indexed as such. This is
obviously a verysimplification since words with the same stem, such as
NEUTRON AND NEUTRALISE,sometimes need to be distinguished. Even
words which are essentially equivalent may
meandifferentthingsindifferentcontexts.Sincethereisnocheapwayofmakingthesef
inedistinctions we put up with a certain proportion of errors and
assume(correctly) that theywillnot degraderetrievaleffectiveness too much. The
final output from a conflation algorithm is a set of classes, one for each stem
detected. Aclass name is assigned to a document if and only if one of its
members occurs as a significantword in the text of the document. A document
representative then becomes a list of class names.Theseareoftenreferred to as
thedocumentsindex terms orkeywords.

Department of IT, BSIOTR, Wagholi, Pune Page 10


Input:
1. A text file containing stop words
2. A document which is searched and index according to frequency of words

Output:
Document containing frequently appearing words without stop words and
removing stemming

Program implementation: Code written in C/C++ to implement


Conflation
Algorithm with proper output.

Conclusion: Thus, we have implemented the Conflation Algorithm to generate


document representative of a text file.

A. Write short answer of following questions:


1. Explain. the working of conflation algorithm.
2.. State and explain Luhn’s theory
B. Viva Questions:
1. Difference between Data Retrieval and Information
2. Indexing exhaustivity and specificity.
3. Five commonly used measures of association in information retrieval.
4. Why Normalized versions of the simple matching coefficient are used for
measures of association

Department of IT, BSIOTR, Wagholi, Pune Page 11


Practical No.1

Title: Implement Conflation algorithm to generate document representative of a text file.


Program:
package assign1;
import java.io.File;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;
public class Conflation
{
public static void main(String[] args) throws IOException
{
try
{
File fi=new File("Input.txt");
Scanner sc1=new Scanner(new File("Input.txt"));
int ch,i,ans;
do
{
System.out.println("1. Display the file");
System.out.println("2. Remove Stop Words");
System.out.println("3.Suffix Stripping");
System.out.println("4. Count Frequency");
System.out.println("Enter your choice");
Scanner sc=new Scanner(System.in);
ch=sc.nextInt();
switch(ch)
{
case 1:
while(sc1.hasNext())
{
System.out.print(sc1.next()+" ");
}
System.out.println(" ");
break;
case 2:
remove_punctutaion(fi);
//remove_stop_words(fi);
break;
case 3:
suffix_stripping();
break;
Department of IT, BSIOTR, Wagholi, Pune Page 12
case 4:
frequency_count();
break;
}
}while(ch!=4);
}catch (FileNotFoundException }
e)
{
System.out.println(e);
}
private static void remove_punctutaion(File fi)
{
try {
Scanner sc_punctuation=new Scanner(fi);
BufferedWriter out = new BufferedWriter(
new FileWriter("without_punctuation_and_stopwords.txt"));
while(sc_punctuation.hasNext())
{
String str_p=sc_punctuation.next();
String str_r=str_p.replaceAll("[^a-zA-Z\\s]", "");
if (!str_r.toLowerCase().equals("the") && !str_r.toLowerCase().equals("is") &&
!str_r.toLowerCase().equals("and") && !str_r.toLowerCase().equals("of") &&
!str_r.toLowerCase().equals("are") && !str_r.toLowerCase().equals("for") &&
!str_r.toLowerCase().equals("in"))
{
out.write(str_r+" ");
}
}
out.close();
System.out.println("File after punctuation and stopwords:");
File testfile = new File("without_punctuation_and_stopwords.txt");
BufferedReader br= new BufferedReader(new FileReader(testfile));
String z;
while ((z = br.readLine()) != null)
System.out.println(z);
br.close();
}
catch (IOException e) {
System.out.println("exception occurred" + e);
}
}
private static void suffix_stripping() throws FileNotFoundException,IOException
{
Scanner sc1=new Scanner(new File("without_punctuation_and_stopwords.txt"));
BufferedWriter out = new BufferedWriter(

Department of IT, BSIOTR, Wagholi, Pune Page 13


new FileWriter("suffix_stripping2.txt"));
while (sc1.hasNext())
{
String str=sc1.next();
str=str+"/";
if(str.endsWith("ier/"))
{
str=str.replaceAll("ier/", "y");
}
else if (str.endsWith("ied/"))
{
str=str.replaceAll("ied/", "y");
}
else if (str.endsWith("iage/"))
{
str=str.replaceAll("iage/", "y");
}
else if (str.endsWith("iest/"))
{
str=str.replaceAll("iest/", "y");
}
else if (str.endsWith("ies/"))
{
str=str.replaceAll("ies/", "y");
}
else if (str.endsWith("iful/"))
{
str=str.replaceAll("iful/", "y");
}
else if (str.endsWith("ify/"))
{
str=str.replaceAll("ify/", "y");
}
else if (str.endsWith("iness/"))
{
str=str.replaceAll("iness/", "y");
}
else if (str.endsWith("ness/"))
{
str=str.replaceAll("ness/", "y");
}
else if (str.endsWith("ily/"))
{
str=str.replaceAll("ily/", "y");
}
else if (str.endsWith("yer/"))
{
Department of IT, BSIOTR, Wagholi, Pune Page 14
str=str.replaceAll("yer/", "y");
}
else if (str.endsWith("ying/"))

Department of IT, BSIOTR, Wagholi, Pune Page 15


{
str=str.replaceAll("ying/", "y");
}
else if (str.endsWith("ys/"))
{
str=str.replaceAll("ys/", "y");
}
else if (str.endsWith("yable/"))
{
str=str.replaceAll("yable/", "y");
}
else if (str.endsWith("yful"))
{
str=str.replaceAll("yful", "y");
}
else if (str.endsWith("al/"))
{
str=str.replaceAll("al/", "y");
}
else if (str.endsWith("ly/"))
{
if(str.endsWith("ely/"))
{
str=str.replaceAll("ely/", "e");
}
else
{
str=str.replaceAll("ly/", "");
}
}
else if (str.endsWith("ing/"))
{
str=str.replaceAll("ing/", "y");
}
else if (str.endsWith("ed/"))
{
str=str.replaceAll("ed/", "y");
}
else if (str.endsWith("es/"))
{
str=str.replaceAll("es/", "y");
}
else if (str.endsWith("es/"))
{
str=str.replaceAll("es/", "y");
}
else if (str.endsWith("s/"))
Department of IT, BSIOTR, Wagholi, Pune Page 16
{
str=str.replaceAll("s/", " ");
}

Department of IT, BSIOTR, Wagholi, Pune Page 17


else if (str.endsWith("is/"))
{
str=str.replaceAll("is", "y");
}
else if (str.endsWith("ment/"))
{
str=str.replaceAll("ment/", " ");
}
else if (str.endsWith("eing/"))
{
str=str.replaceAll("eing/", " ");
}
else if (str.endsWith("led/"))
{
str=str.replaceAll("led/", " ");
}
else if (str.endsWith("lex/"))
{
str=str.replaceAll("lex/", " ");
}
else if (str.endsWith("ling/"))
{
str=str.replaceAll("ling/", " ");
}
str=str.replace("/", " ");
out.write(str+" ");
}
out.close();
sc1.close();
System.out.println("File after suffix Stripping:");
File testfile = new File("suffix_stripping2.txt");
BufferedReader br= new BufferedReader(new
FileReader(testfile));
String z;
while ((z = br.readLine()) != null)
System.out.println(z);
br.close();
}

private static void frequency_count() throws


FileNotFoundException,IOException
{
Scanner sc3=new Scanner(new File("suffix_stripping2.txt"));
int flag=0,i=0,l=0,ct=0,flag_w=0;
String w[]=new String[1000];
int cnt[]=new int[1000];
Department of IT, BSIOTR, Wagholi, Pune Page 18
while(sc3.hasNext())
{
w[i]=sc3.next();
i++;
}
sc3.reset();
Scanner sc5=new Scanner(new File("suffix_stripping2.txt"));
while (sc5.hasNext())
{
String str1=sc5.next();
for(int j=0;j<i;j++)
{
if(str1.equalsIgnoreCase(w[j]))
{
flag=1;
cnt[j]++;
}
}
if(flag==0)
{
w[i]=str1;
cnt[i]=1;
i++;
}
}
for(int j=0;j<i;j++)
{
for(int k=j+1;k<i;k++)
{
if(w[j].equalsIgnoreCase(w[k]))
{
flag_w=0;
break;
}
else
{
flag_w=1;
}
}
if(flag_w==1)
{
System.out.println(w[j]+"."+cnt[j]+" ");
Department of IT, BSIOTR, Wagholi, Pune Page 19
}
}
}
}
Output:
1. Display the file
2. Remove Stop Words
3.Suffix Stripping
4. Count Frequency
Enter your choice
1
Astronomy is the study of everything in the universe beyond Earth's atmosphere. That includes objects we can
see with our naked eyes, like the Sun , the Moon , the planets, and the stars . It also includes objects we can only
see with telescopes or other instruments, like faraway galaxies and tiny particles.
1. Display the file
2. Remove Stop Words
3.Suffix Stripping
4. Count Frequency
Enter your choice
2
File after punctuation and stopwords:
Astronomy study everything universe beyond Earths atmosphere That includes objects we can see with our naked
eyes like Sun Moon planets stars It also includes objects we can only see with telescopes or other instruments like
faraway galaxies tiny particles
1. Display the file
2. Remove Stop Words
3.Suffix Stripping
4. Count Frequency
Enter your choice
3
File after suffix Stripping:
Astronomy study everythy universe beyond Earth atmosphere That includy object we can see with our naky eyy
like Sun Moon planet star It also includy object we can on see with telescopy or other instrument like faraway
galaxy tiny particly
1. Display the file
2. Remove Stop Words
3.Suffix Stripping
4. Count Frequency
Enter your choice
4
Astronomy.1
study.1
Department of IT, BSIOTR, Wagholi, Pune Page 20
everythy.1
universe.1
beyond.1
Earth.1
atmosphere.1
That.1
our.1
naky.1

Department of IT, BSIOTR, Wagholi, Pune Page 21


eyy.1
Sun.1
Moon.1
planet.1
star.1
It.1
also.1
includy.2
object.2
we.2
can.2
on.1
see.2
with.2
telescopy.1
or.1
other.1
instrument.1
like.2
faraway.1
galaxy.1
tiny.1
particly.1

Department of IT, BSIOTR, Wagholi, Pune Page 22


AssignmentNo. 2

ProblemStatement:

Implement Single-pass Algorithm for clustering of files.(Consider 4 to 5 files)

Objectives:
Tostudy:

1. WhatisClustering?

2. Singlepassalgorithmforclustering.

3. Measureofassociation

4. Thegraphicalrepresentationofclustering.

Theory:
Clustering
Clusteringcanbeconsideredthemostimportantunsupervisedlearningproblem;so,asever
yotherproblemofthiskind,itdealswithfindingastructureinacollectionofunlabeleddata.

Adefinitionofclusteringcouldbe“theprocessoforganizingobjectsinto
groupswhosemembers aresimilarinsomeway”.Aclusteristhereforea
collectionofobjectswhichare“similar”betweenthemandare
“dissimilar”totheobjectsbelongingtootherclusters.

Clusteringistheprocessofgroupingthedocumentswhicharerelevant.Itcanbeshownbyag
raphwithnodesconnectediftheyarerelevanttothesamerequest.Abasicassumptionisthat
documentsrelevanttoarequestareseparatedfromthosewhicharenotrelevanti.e.therelev
ant documentsare morelikeoneanotherthantheyarelikenonrelevantone.

Department of IT, BSIOTR, Wagholi, Pune Page 23


A simple graphical example:

Toidentifythe4clustersintowhichthedatacanbedivided;thesimilaritycriterionisdistanc
e:twoormoreobjectsbelongtothesameclusteriftheyare
“close”accordingtoagivendistance(inthiscasegeometricaldistance).Thisiscalleddista
nce-basedclustering.

Anotherkindof clusteringisconceptualclustering:two or more objectsbelongto


thesameclusterifthis onedefines aconcept commontoallthatobjects. Inotherwords,
objectsaregrouped accordingtotheirfit
todescriptiveconcepts,notaccordingtosimplesimilaritymeasures.

TheGoalsofClustering
Thegoalofclusteringistodeterminetheintrinsicgroupinginasetofunlabeleddata.But
howtodecide whatconstitutes agoodclustering? It canbe shown that there is
noabsolute “best criterionwhichwouldbeindependentofthefinalaimofthe
clustering.Consequently,itis theuserwhichmustsupplythiscriterion,in suchawaythat
theresultoftheclusteringwillsuittheirneeds.
Forinstance,usercouldbeinterestedinfindingrepresentativesforhomogeneousgroups(
datareduction),infinding“naturalclusters”anddescribetheirunknownproperties(“nat
ural”
datatypes),infindingusefulandsuitablegroupings(“useful”dataclasses)orinfindingun
usualdataobjects(outlierdetection).

Clustering Requirements
Themainrequirementsthataclusteringalgorithm should satisfyare:
1. Scalability;
2. Dealingwithdifferenttypesofattributes;

Department of IT, BSIOTR, Wagholi, Pune Page 24


3. Discoveringclusterswitharbitraryshape;
4. Minimalrequirementsfordomainknowledgetodetermineinputparameters;
5. Abilitytodealwithnoiseandoutliers;
6. Insensitivitytoorderofinputrecords;
7. Highdimensionality;
8. Interpretabilityandusability.

Department of IT, BSIOTR, Wagholi, Pune Page 25


Single Pass Clustering
SinglePassclusteringquicklybywhichwemakeincrementalclusteringtostreamdata.Thi
s
clusteringtechniqueprovidesuswithasimpleyetflexibletechniqueforstreamdata1.Give
nacollectionofclustersandathresholdvalueh,ifanewdocumentnhasthehighestsimilarit
ymorethanhtosomecluster,thedocumentnisappendedtothecluster,andifthereexistsnoc
luster, anew cluster isgenerated which contains only thedocument n.ClearlySingle
PassClusteringissuitableforincrementalclusteringtotemporaldata(ordatastream)since
,onceadocumentis assignedtoacluster,itisnotchangedin thefuture.

Thealgorithmisasfollows
(1)Leth beathresholdvalue.
(2)LetSbeanemptysetandd1bethefirstdocument.WegenerateanewclusterC1
consistingofd1.
(3)Whenanewdocumentdi(i>1)comesin,calculatethesimilarityvaluestoalltheclusters
C.
(4) LetsimmaxbethehighestvalueandCdithemostsimilarcluster.Ifsimmax>h,addditoC
diand adjustthecenterofCdi.Otherwise,wegenerateanewclusterCdithat
containsonlydi.
(5) Repeattheprocessaboveuntilnodatacomes.In (4)wedefinesimmax=
MAX(sim(~di,~C)).AlsowedefinesimilarityofadocumentdandaclusterCwherethece
nterisVCasbelow(calledcosinesimilarity):
sim(~d,~C)=d~· V~C/|~d||V~C|

Measures of association

Association isthe similarity between objects characterized by discrete state


attributes.Themeasureofsimilarityorassociationisdesignedtoquantifylikenessbetwee
ntheobjectsinsuchawaythatanobjectinagroupismoreliketheothermembersofthegroupt
hatislikeanyobject
outsidethegroupthenaclustermethodenablessuchagroupstructuretobediscovered.

Thereare fivecommonlyusedmeasuresof associationinIR.

1.|X∩Y| Simplematchingcoefficient

Department of IT, BSIOTR, Wagholi, Pune Page 26


2. |X∩Y|/|X|+|Y| Dicescoefficient
3. |X∩Y|/|XUY| Jaccard’scoefficient
4. |X∩Y|/|X|1/2*|Y|½ Cosinecoefficient
5.|X∩Y|/ min(|X|,|Y|) Overlapcoefficient

Department of IT, BSIOTR, Wagholi, Pune Page 27


Inshort,measureofassociationiscalculatedbythisprogrambytakingintoaccountfreque
ncyofoccurrence ofwords in both the documents i.eleast value of frequency
ofoccurrence of
acommonwordinbothdocumentsisconsideredforfindingoutthemeasureofassociation.

Classificationmethods
1. Multistateattribute(E.g.:Colour)
2. Binarystate(E.g.:Keyword)
3. Numerical(E.g.:Hardnessscaleor weightedkeyword)
4. Probabilitydistribution

Cluster hypothesis
Thehypothesiscanbesimplystatedascloselyassociateddocumenttendtoberelevanttothesam
erequest.ThishypothesisisreferredasClusterhypothesis.

Thebasicassumptioninretrievalsystemisthatdocumentsrelevanttoarequestareseparat
edfromthosewhichhavenotrelevant.Computetheassociationbetweenallpairsofdocu
ments.
a.Bothof whicharerelevanttoarequestand
b.Oneofwhichisrelevantandtheotherisnot

Two approaches of clustering


1.Theclusteringisbasedonameasureofsimilaritiesbetweentheobjectstobeclustered.
Theexampleoffirstapproachisgraphtheoreticmethodwhichcanbeusedtodefinecluster
sin termsofgraphsderivedfrommeasureofsimilarity.
2. Theclustermethodproceedsdirectlyfromtheobjectdescription.

Graphical representation of clustering


Heresimilaritymatrixisusedinordertodrawthegraph,thedocumentshavingmeasur
eof
associationgreaterthanthresholdvaluecanberepresentedbytheedgeinthegraph.Th
isisan identicalclusterwhichcanshownasconnectedgraph.

From thegraph below , it can be easily understood that all documents are
associated.Butdocumentslike2&5arenotdirectlyassociated&sameisthecasefordo
cuments4&5.Inthiswayclusterscanbedepicted
Department of IT, BSIOTR, Wagholi, Pune Page 28
Example:
Objects{1, 2, 3,4,5,6}

Department of IT, BSIOTR, Wagholi, Pune Page 29


Threshold:0.59

Clustersare:

Input:Documentrepresentative(minimum5files)

Scope:Use of matching function (dice coefficient) for comparing document


representative. define appropriate threshold value.

ProgramImplementation:Codewrittenin c/c++toimplementsinglepass
algorithmforclusteringwithproperoutput.

Output:Clusterofdocuments

Conclusion:Thus,wehaveimplementedthesinglepassalgorithmforclustering.

A. Write short answer of following questions :

1. Explain Clustering using dissimilarity matrix. Also explain effect of threshold on


clustering.

Department of IT, BSIOTR, Wagholi, Pune Page 30


2. Explain K-list

3. Explain Cluster-based retrieval

4. Explain working of Rochio’s algorithm

Department of IT, BSIOTR, Wagholi, Pune Page 31


B. Viva Questions:

1. Cluster using similarity measures


2. IR Models
3. Boolean Search
4. What is multipass clustering technique

Department of IT, BSIOTR, Wagholi, Pune Page 32


Practical No.2

Title : Implement Single-pass Algorithm for clustering of files.(Consider 4 to 5 files).


Program:
package com.prac.prac;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
public class singlepass {
public static void main(String[] args) throws IOException{
BufferedReader stdInpt = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter the no of Tokens");
int noOfDocuments=Integer.parseInt(stdInpt.readLine());
System.out.println("Enter the no of Documents");
int noOfTokens=Integer.parseInt(stdInpt.readLine());
System.out.println("Enter the threshhold");
float threshhold=Float.parseFloat(stdInpt.readLine());
System.out.println("Enter the Document Token Matrix");
int [][]input= new int [noOfDocuments][noOfTokens];
for(int i=0;i {
for(int j=0;j {
System.out.println("Enter("+i+","+j+")");
input[i][j]=Integer.parseInt(stdInpt.readLine());
}
}
SinglePassAlgorithm(noOfDocuments, noOfTokens, threshhold, input);
}
private static void SinglePassAlgorithm(int noOfDocuments,int noOfTokens,float threshhold,int
[][]input)
{
int [][] cluster = new int [noOfDocuments][noOfDocuments+1];

Department of IT, BSIOTR, Wagholi, Pune Page 33


ArrayList clusterRepresentative = new ArrayList();
cluster [0][0]=1;
cluster [0][1]=0;
int noOfClusters=1;
Float []temp= new Float[noOfTokens];
temp=convertintArrToFloatArr(input[0]);
clusterRepresentative.add(temp);
for(int i=1;i {
float max=-1;
int clusterId=-1;
for(int j=0;j {
float similarity=calculateSimilarity(convertintArrToFloatArr(input[i]),clusterRepresentative.get(j) );
if(similarity>threshhold)
{
if(similarity>max)
{
max=similarity;
clusterId=j;
}
}
}
if(max==-1)
{
cluster[noOfClusters][0]=1;
cluster[noOfClusters][1]=i;
noOfClusters++;
clusterRepresentative.add(convertintArrToFloatArr(input[i]));
}
else
{

Department of IT, BSIOTR, Wagholi, Pune Page 34


cluster[clusterId][0]+=1;
int index=cluster[clusterId][0];
cluster[clusterId][index]=i;
clusterRepresentative.set(clusterId,calculateClusterRepresentative(cluster[clusterId],input,
noOfTokens));
}
}
for(int i=0;i {
System.out.print("\n"+i+"\t");
for(int j=1;j<=cluster[i][0];++j)
{
System.out.print(" "+cluster[i][j]);
}
}
}
private static Float[] convertintArrToFloatArr(int[] input)
{
int size=input.length;
Float[] answer = new Float[size];
for(int i=0;i {
answer[i]=(float)input[i];
}
return answer;
}
private static float calculateSimilarity(Float[] a,Float[] b)
{
float answer=0;
for(int i=0;i {
answer+=a[i]*b[i];
}
return answer;

Department of IT, BSIOTR, Wagholi, Pune Page 35


}
private static Float[] calculateClusterRepresentative(int[] cluster,int [][] input,int noOFTokens)
{
Float[] answer= new Float[noOFTokens];
for(int i=0;i {
answer[i]=Float.parseFloat("0");
}
for(int i=1;i<=cluster[0];++i)
{
for(int j=0;j {
answer[j]+=input[cluster[i]][j];
}
}
for(int i=0;i {
answer[i]/=cluster[0];
}
return answer;
}
}

Department of IT, BSIOTR, Wagholi, Pune Page 36


Output :
Enter the no of Tokens
5
Enter the no of Documents
5
Enter the threshhold
10
Enter the Document Token Matrix
Enter(0,0)
1
Enter(0,1)
3
Enter(0,2)
3
Enter(0,3)
2
Enter(0,4)
2
Enter(1,0)
2
Enter(1,1)
1
Enter(1,2)
0
Enter(1,3)
1
Enter(1,4)
2
Enter(2,0)
0
Enter(2,1)
2
Enter(2,2)
0
Enter(2,3)
0
Enter(2,4)
1
Enter(3,0)
0
Enter(3,1)

Department of IT, BSIOTR, Wagholi, Pune Page 37


3
Enter(3,2)
1
Enter(3,3)
0
Enter(3,4)
5
Enter(4,0)

Department of IT, BSIOTR, Wagholi, Pune Page 38


1
Enter(4,1)
0
Enter(4,2)
1
Enter(4,3)
0
Enter(4,4)
1
0013
12
24

Department of IT, BSIOTR, Wagholi, Pune Page 39


AssignmentNo. 3

ProblemStatement:
Toimplement a program for Retrieval ofdocuments usinginvertedfiles

Objectives:
TostudyIndexing,InvertedFilesandsearchinginformationwiththehelpofinvertedfile.

Outcomes:
Attheendoftheassignmentthestudentsshould have:
1. Understooduseofindexinginfastretrieval
2. Understoodworkingofinvertedindex

Infrastructure:Desktop/laptopsystemwithLinux oritsderivatives.

Softwareused:LINUX/WindowsOS/VirtualMachine/IOS/C/C++/Java/python

Theory:
Indexing
In searching for a basics query is to scan the text sequentially. Sequential or online text
searching involves finding the occurrences of a pattern in a text. Online searching is
appropriate then the text is small and it is the only choice if the text collection is very
volatile or the index space overhead cannot be afforded. A second option is to build
datastructures over the text to speed up the search. It is worthwhile building and
maintainingan. index when the text collection is large and semi-static.
Inverted Files:
An inverted file is a word-oriented mechanism for indexing a test collection in order to
speed up the matching task. The inverted file structure is composed of two elements:
vocabulary and occurrence. The vocabulary is the set of all different words in the text.
For each such word a list of all the text portions where the word appears is stored. The

Department of IT, BSIOTR, Wagholi, Pune Page 40


set of all those lists is called the occurrences. These positions can refer to words or
characters. Word positions simplify phrase and proximity queries, while character
positions facilitate direct access to the matching text position.

Department of IT, BSIOTR, Wagholi, Pune Page 41


Searching with the help of inverted file:
The search algorithm on an inverted index has three steps.
1. Vocabulary Search
Searching with the help of inverted file:
The search algorithm on an inverted index has three steps.
1. Vocabulary Search
2. Retrieval of occurrence
3. Manipulation of occurrences
Single-word queries can be searched using any suitable data structure to speed up the
search, such as hashing, tries, or B-trees. The first two give O(m) search cost. However,
simply storing the words in lexicographically order is cheaper in space and very
competitive in performance. Since the word can be binary searched at O (log n) cost.
Prefix and range queries can also be solved with binary search, tries, or B-trees but not
with hashing. If the query is formed by single words then the process ends by delivering
the list of occurrences. Context queries arc more difficult to solve with inverted indices.
Each element must be searched separately and a list generated for each one. Then, the
lists of all elements are traversed in synchronization to findplaceswhere allthewords
appear insequence (for aphrase) orappearcloseenough(forproximity).

Text:

1 6 9 11 17 19 ….
Thisis atext. Atext hasmanywords.Words are madefromletters.
InvertedIndex:

Vocabulary Occurrences
Letters 60…
Made 50…
Many 28…

Department of IT, BSIOTR, Wagholi, Pune Page 42


Text 11,19…
Words 33,40….

Algorithm

Department of IT, BSIOTR, Wagholi, Pune Page 43


1. Inputtheconflated file
2. Buildthe indexfileforinput file
3. Input thequery
4. Printtheindexfileandresult of query

Conclusion: Implementation is concluded by stating analysis of Retrieval of documents


usingInvertedFiles.
A. Writeshortanswerof following questions:
1. Workingof invertedfiles.
2. Whatareapplicationsofinvertedindex.
3. Workingof signature files

B. VivaQuestions:

1. Whatisvocabularyandoccurrences?

2. Howsearchiscarried outoninvertedindex?

3. Howtoindexmultimediaobject.

4. Limitationsofinverted index.

5. WhatisSuffix ArrayandSuffix Tree?

6. Whatistheconceptof signaturefiles?

Department of IT, BSIOTR, Wagholi, Pune Page 44


Practical No.3
Title : Implement a program for retrieval of documents using inverted files.
Program:
#include<iostream>
#include<vector>
#include<map>
#include<string>
#include<fstream>
#include<sstream>
using namespace std;
struct word_position
{
string file_name;
int line;
int index;
};
class InvertedIndex
{
map<string,vector<word_position> > Dictionary;
vector<string> filelist;
public:
void addfile(string filename);
void show_files();
void search(string word);
};

Department of IT, BSIOTR, Wagholi, Pune Page 45


void InvertedIndex::addfile(string filename)
{
ifstream fp;
fp.open(filename + ".txt",ios::in);
if(!fp)
{
cout<<"File Not Found\n";
return ;
}
filelist.push_back(filename);
string line,word;
int line_number=0,word_number=0;
while(getline(fp,line))
{
line_number++;
word_number = 0;
stringstream s(line);
while(s>>word)
{
word_number++;
word_position obj;
obj.file_name = filename;
obj.line = line_number;
obj.index = word_number;
Dictionary[word].push_back(obj);
}

Department of IT, BSIOTR, Wagholi, Pune Page 46


}
fp.close();
}
void InvertedIndex::show_files()
{
int size = (int)filelist.size();
for(int i=0;i<size;i++) cout<<i+1<<": "<<filelist[i]<<endl;
if(!size) cout<<"No files added\n";
}
void InvertedIndex::search(string word)
{
if(Dictionary.find(word)==Dictionary.end())
{
cout<<"No instance exist\n";
return ;
}
int size = (int)Dictionary[word].size();
for(int counter = 0;counter < size ;counter++)
{
cout<<counter+1<<":\n";
cout<<" Filename: "<<Dictionary[word][counter].file_name<<endl;
cout<<" Line Number: "<<Dictionary[word][counter].line<<endl;
cout<<" Index: "<<Dictionary[word][counter].index<<endl;
}

Department of IT, BSIOTR, Wagholi, Pune Page 47


}
int main(int argc, char*argv[])
{
InvertedIndex Data;
for(int i = 1 ; i< argc ; i++)
{
Data.addfile(argv[i]);
}
int choice = 0;
do
{
cout<<"1: See files\n2: Add File\n3: Query Word\n4: Exit\n";
cin>>choice;
switch(choice)
{
case 1: Data.show_files(); break;
case 2:
{
cout<<"Enter File Name: ";
string name;
cin>>name;
Data.addfile(name);
break;
}
case 3:
{

Department of IT, BSIOTR, Wagholi, Pune Page 48


cout<<"Enter Word: ";
string word;
cin>>word;
Data.search(word);
break;
}
case 4: break;
default : continue;
}
}while(choice!=4);
return 0;

Department of IT, BSIOTR, Wagholi, Pune Page 49


Output :
1: See files 2: Add File 3: Query Word 4: Exit
1
No files added 1: See files 2: Add File 3: Query Word 4: Exit
2
Enter File Name:
ABC.txt
File Not Found 1: See files 2: Add File 3: Query Word 4: Exit
3
Enter Word:
ABC
No instance exist 1: See files 2: Add File 3: Query Word 4: Exit
4
** Process exited - Return Code: 0 **

Department of IT, BSIOTR, Wagholi, Pune Page 50


AssignmentNo. 4

ProblemStatement:
Implement a program to calculate precision and recall for sample input. (Answer set A,
Query q1, Relevant documents to query q1- Rq1 )

Objectives:
1. To understand precision and recall in information retrieval
2. To study indexing structures for information retrieval.

Outcomes:
Attheendoftheassignmentthestudentsshould have:
1. Understood precision and recall in information retrieval.
2. Understooduseofindexinginfastretrieval.

Theory:
Precision and Recall in Information Retrieval
Information Systems can be measured with two metrics: precision and recall. When a
user decides to search for information on a topic, the total database and the results to be
obtained can be divided into 4 categories:
1. Relevant and Retrieved
2. Relevant and Not Retrieved
3. Non-Relevant and Retrieved
4. Non-Relevant and Not Retrieved
Relevant items are those documents that help the user in answering his question.Non-
Relevant items are items that don’t provide actually useful information. For each item
there are two possibilities it can be retrieved or not retrieved by the user’s query.
Precision is defined as the ratio of the number of relevant and retrieved
documents(number of items retrieved that are actually useful to the user and match his
search need) to the number of total retrieved documents from the query. Recall is

Department of IT, BSIOTR, Wagholi, Pune Page 51


defined as ratio of the number of retrieved and relevant documents(the number of items
retrieved that are relevant to the user and match his needs) to the number of possible
relevant documents(number of relevant documents in the database).Precision measures

Department of IT, BSIOTR, Wagholi, Pune Page 52


one aspect of information retrieval overhead for a user associated with a particular
search. If a search has 85 percent precision then 15(100-85) percent of user effort is
overhead reviewing non-relevant items. Recall measures to what extent a system
processing a particular query is able to retrieve the relevant items the user is interested in
seeing. Recall is a very useful concept but due to the denominator is non-calculable in
operational systems. If the system is made known the total set of relevant items in the
database, recall can be made calculable.

Precision/recall trade-off
Department of IT, BSIOTR, Wagholi, Pune Page 53
You can increase recall by returning more docs. Recall is a non-decreasing function of
the number of docs retrieved. A system that returns all docs has 100% recall! The
converse is also true (usually): It’s easy to get high precision for very low recall.

Department of IT, BSIOTR, Wagholi, Pune Page 54


Consider an Information retrieval (IR) system returning relevant documents

Fig 1: IR system returning relevant documents

Precision and Recall explanation:


Consider,
I: an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR system
R ∩ A: the intersection of the sets R and A
|A|-number of documents in the set A
|Ra |-number of documents in the intersection of sets R and A

Department of IT, BSIOTR, Wagholi, Pune Page 55


The goal is to achieve high precision and high recall. The definition of precision
and recall assumes that all docs in the set A have been examined However, the
user is not usually presented with all docs in the answer set A at once User sees a
ranked set of documents and examines them starting from the top Thus, precision
and recall vary as the user proceeds with their examination of the set A. Most
appropriate then is to plot a curve of precision versus recall.

Department of IT, BSIOTR, Wagholi, Pune Page 56


Department of IT, BSIOTR, Wagholi, Pune Page 57
Department of IT, BSIOTR, Wagholi, Pune Page 58
Department of IT, BSIOTR, Wagholi, Pune Page 59
If we proceed with our examination of the ranking generated, we can plot a curve of
precision versus recall as follows:

Thus, Precision and recall have been extensively used to evaluate the retrieval
performance of IR systems or algorithms. However, a more careful reflection reveals
problems with these two measures: First, the proper estimation of maximum recall for a
query requires detailed knowledge of all the documents in the collection Second, in
many situations the use of a single measure could b e more appropriate Third, recall and
precision measure the effectiveness over a set of queries processed in batch mode
Fourth, for systems which require a weak ordering though, recall and precision might be
inadequate.

Sample code in C++


• Code
• Output

Department of IT, BSIOTR, Wagholi, Pune Page 60


Conclusion: Implementation is concluded by executing a program to calculate precision
and recall for sample input with relevant documents Rq1 for query q1.

A. Writeshortanswerof following questions:


1. What is precision and recall in IR systems?
2. How are recall and precision measures are defined?

B. VivaQuestions:

1. Whatis relevance of document?

2. What are the metrics to measure information systems?

3. Howare precision and recall calculated for information systems?.

4. What is the problem with these two measures?

Department of IT, BSIOTR, Wagholi, Pune Page 61


Practical No.4

Title : Implement a program to calculate precision and recall for sample input. (Answer set A, Query q1,
Relevant documents to query q1- Rq1 )
Program:
#include <iostream>
#include <string.h>
#include <iomanip>
#include <fstream>
using namespace std;
string left(const string s, const int w)
{ // Left aligns input string in table
stringstream ss, spaces;
int padding = w - s.size(); // count excess room to pad
for (int i = 0; i < padding; ++i)
spaces << " ";
ss << s << spaces.str() << '|'; // format with padding
return ss.str();
}
string center(const string s, const int w)
{ // center aligns input string in table
stringstream ss, spaces;
int padding = w - s.size(); // count excess room to pad
for (int i = 0; i < padding / 2; ++i)
spaces << " ";
ss << spaces.str() << s << spaces.str(); // format with padding
if (padding > 0 && padding % 2 != 0) // if odd #, add 1 space

Department of IT, BSIOTR, Wagholi, Pune Page 62


ss << " ";
return ss.str();
}
string prd(float x, int decDigits, int width)
{ // right aligns float values with specified no. of precision digits in a table
stringstream ss;
ss << fixed << right;
ss.fill(' '); // fill space around displayed #
ss.width(width); // set width around displayed #
ss.precision(decDigits); // set # places after decimal
ss << x;
return ss.str();
}
string printDocs(string state[], int size)
{
// prints each document at a specific iteration inside the table
stringstream ss;
ss << '|' << ' ';
for (int i = 0; i < size; i++)
{ // convert the array into a string of comma seprated values
ss << state[i];
if (state[i].compare("") != 0 and i + 1 < size and state[i + 1].compare("") != 0)
ss << ',' << ' ';
}
return left(ss.str(), 98);
}
float E_value(float b, float rj, float pj)

Department of IT, BSIOTR, Wagholi, Pune Page 63


{ // calculates E value
return 1 - (((1 + b * b) * rj * pj) / (b * b * pj + rj));
}
int main()
{ // Hardcoded Rq and A
string Rq[10] = {"d3", "d5", "d9", "d25", "d39", "d44", "d56", "d71", "d89", "d123"};
string A[15] = {"d123", "d84", "d56", "d6", "d8", "d9", "d511", "d129", "d187", "d25", "d38", "d48",
"d250", "d113", "d3"};
// Creating and opening output file
ofstream write("Recall_Precision_Evaluation_output.txt");
// required constants and arrays for calculations
float modRq = sizeof(Rq) / sizeof(Rq[0]);
string Ra[sizeof(A) / sizeof(A[0])];
float P[sizeof(A) / sizeof(A[0])];
float R[sizeof(A) / sizeof(A[0])];
float modRa = 0;
float modA = 0;
double precision;
double recall;
// table header formatting and printing
std::cout << setprecision(2) << fixed;
write << setprecision(2) << fixed;
std::cout << string(45 * 3 + 11, '-') << "\n";
write << string(45 * 3 + 11, '-') << "\n";
std::cout << '|' << center("Documents", 96) << " | "
<< center("|Ra|", 8) << " | "

Department of IT, BSIOTR, Wagholi, Pune Page 64


<< center("|A|", 8) << " | "
<< center("Precision(%)", 5) << "|"
<< center("Recall(%)", 5) << " | " << endl;
write << '|' << center("Documents", 96) << " | "
<< center("|Ra|", 8) << " | "
<< center("|A|", 8) << " | "
<< center("Precision(%)", 5) << "|"
<< center("Recall(%)", 5) << " | " << endl;
std::cout << string(45 * 3 + 11, '-') << "\n";
write << string(45 * 3 + 11, '-') << "\n";
// Algorithm to calculate and print all the values in the output table, MAIN algo
for (int i = 0; i < sizeof(A) / sizeof(A[0]); i++)
{
Ra[i] = A[i];
modA++;
for (int j = 0; j < modRq; j++)
{
if (A[i] == Rq[j])
{
modRa++;
break;
}
}
precision = (modRa / modA) * 100;
P[i] = precision / 100;
recall = (modRa / modRq) * 100;
R[i] = recall / 100;
// Printing documents and other values of current iteration within the table
std::cout << printDocs(Ra, sizeof(Ra) / sizeof(Ra[0]));

Department of IT, BSIOTR, Wagholi, Pune Page 65


write << printDocs(Ra, sizeof(Ra) / sizeof(Ra[0]));
std::cout << prd(modRa, 2, 10) << "|"
<< prd(modA, 2, 10) << "|"
<< prd(precision, 2, 13) << "|"
<< prd(recall, 2, 10) << "|"
<< endl;
write << prd(modRa, 2, 10) << "|"
<< prd(modA, 2, 10) << "|"
<< prd(precision, 2, 13) << "|"
<< prd(recall, 2, 10) << "|"
<< endl;
}
// closing the table
std::cout << string(45 * 3 + 11, '-') << "\n";
write << string(45 * 3 + 11, '-') << "\n";
// taking user input for calculation of Fj and Ej
int j;
do
{
std::cout << "Harmonic mean and E-value\nEnter value of j(0 - " << (sizeof(A) / sizeof(A[0])) - 1 << ") to
find F(j)and E(j):" << endl;
cin >> j;
} while (j > sizeof(Ra) / sizeof(Ra[0]));
// calculating Harmonic mean and printing in table
float Fj = (2 * P[j] * R[j]) / (P[j] + R[j]);
std::cout << string(15 * 2 + 3, '-') << "\n"
<< "| Harmonic mean (F" << j << ") is: |" << Fj << " |\n"
<< string(15 * 2 + 3, '-') << "\n";

Department of IT, BSIOTR, Wagholi, Pune Page 66


write << string(15 * 2 + 3, '-') << "\n"
<< "| Harmonic mean (F" << j << ") is: |" << Fj << " |\n"
<< string(15 * 2 + 3, '-') << "\n";
// table header
std::cout << string(15 * 2 + 4, '-') << "\n"
<< "|" << center("E-Value", 32) << "|\n"
<< string(15 * 2 + 4, '-') << "\n";
write << string(15 * 2 + 4, '-') << "\n"
<< "|" << center("E-Value", 32) << "|\n"
<< string(15 * 2 + 4, '-') << "\n";
// table header (sub columns)
std::cout << "|" << center("b>1", 10) << "|"
<< center("b=0", 10) << "|"
<< center("b<1", 10) << "|\n"
<< string(15 * 2 + 4, '-') << "\n";
write << "|" << center("b>1", 10) << "|"
<< center("b=0", 10) << "|"
<< center("b<1", 10) << "|\n"
<< string(15 * 2 + 4, '-') << "\n";
// Calculating and Printing E-Values in table
std::cout << "|" << prd(E_value(1.1, R[j], P[j]), 2, 10) << "|"
<< prd(E_value(0, R[j], P[j]), 2, 10) << "|"
<< prd(E_value(0.9, R[j], P[j]), 2, 10) << "|\n";
write << "|" << prd(E_value(1.1, R[j], P[j]), 2, 10) << "|"
<< prd(E_value(0, R[j], P[j]), 2, 10) << "|"
<< prd(E_value(0.9, R[j], P[j]), 2, 10) << "|\n";

Department of IT, BSIOTR, Wagholi, Pune Page 67


// Closing table
std::cout << string(15 * 2 + 4, '-') << "\n";
write << string(15 * 2 + 4, '-') << "\n";
write.close();
return 0;
}

Output:
| Documents | |Ra| | |A| | Precision(%)|Recall(%) |

| d123 | 1.00| 1.00| 100.00| 10.00|


| d123, d84 | 1.00| 2.00| 50.00| 10.00|
| d123, d84, d56 | 2.00| 3.00| 66.67| 20.00|
| d123, d84, d56, d6 | 2.00| 4.00| 50.00| 20.00|
| d123, d84, d56, d6, d8 | 2.00| 5.00| 40.00| 20.00|
| d123, d84, d56, d6, d8, d9 | 3.00| 6.00| 50.00| 30.00|
| d123, d84, d56, d6, d8, d9, d511 | 3.00| 7.00| 42.86| 30.00|
| d123, d84, d56, d6, d8, d9, d511, d129 | 3.00| 8.00| 37.50| 30.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187 | 3.00| 9.00| 33.33| 30.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187, d25 | 4.00| 10.00| 40.00| 40.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38 | 4.00| 11.00| 36.36| 40.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38, d48 | 4.00| 12.00| 33.33| 40.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38, d48, d250 | 4.00| 13.00| 30.77| 40.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38, d48, d250, d113 | 4.00| 14.00| 28.57| 40.00|
| d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38, d48, d250, d113, d3 | 5.00|
15.00| 33.33| 50.00|

Harmonic mean and E-value


Enter value of j(0 - 14) to find F(j)and E(j):
10

| Harmonic mean (F10) is: |0.38 |

| E-Value |

| b>1 | b=0 | b<1 |

Department of IT, BSIOTR, Wagholi, Pune Page 68


| 0.62| 0.64| 0.62|

Department of IT, BSIOTR, Wagholi, Pune Page 69


AssignmentNo. 5

ProblemStatement:
Write a program to calculate harmonic mean (F-measure) and E-measure for above example

Objectives:

1. To evaluate the retrieval performance of IR systems


2. To understand importance of harmonic mean (F-measure) and E-measure in
information retrieval
3. To study indexing structures for information retrieval.

Outcomes:
Attheendoftheassignmentthestudentsshould have:
1. Understood to calculate harmonic mean (F-measure) and E-measure in information
retrieval.
2. Understoodmethpd to evaluate the retrieval performance of IR systems.

THEORY:
F-Score / F-Measure)
F1 Score considers both precision and recall.
It is the harmonic mean(average) of the precision and recall.
F1 Score is best if there is some sort of balance between precision (p) & recall (r) in the
system. Oppositely F1 Score isn’t so high if one measure is improved at the expense of the
other.
For example, if P is 1 & R is 0, F1 score is 0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Information Systems can be measured with two metrics: precision and recall. Thus, Precision
and recall have been extensively used to evaluate the retrieval performance of IR systems or
algorithms. However, a more careful reflection reveals problems with these two measures:

Department of IT, BSIOTR, Wagholi, Pune Page 70


First, the proper estimation of maximum recall for a query requires detailed knowledge of all
the documents in the collection Second, in many situations the use of a single measure could
b e more appropriate Third, recall and precision measure the effectiveness over a set of
queries processed in batch mode Fourth, for systems which require a weak ordering though,
recall and precision might be inadequate.

Department of IT, BSIOTR, Wagholi, Pune Page 71


Alternative measures: The harmonic mean/F measure
The F-measure is also a single measure that combines recall and precision

Where,
r ( j ) is the recall at the j-th position in the ranking
P ( j ) is the precision at the j-th position in the ranking
F ( j ) is the harmonic mean at the j-th position in the ranking

Determining max value of F can be interpreted as an attempt to find the best possible
compromise between recall and precision. The function F assumes values in the interval [0, 1]
It is 0 when no relevant documents have been retrieved and is 1 when all ranked documents
are relevant Further, the harmonic mean F assumes a high value only when both recall and
precision are high To maximize F requires finding the best possible compromise between
recall and precision.

Alternative measures: E measure


E measure was proposed by Van Rijesbergn which combines recall and precision. User is
allowed to specify whether he is more interested in recall or in precision.
E measure is defined as

Where,
r ( j ) is the recall at the j-th position in the ranking
P ( j ) is the precision at the j-th position in the ranking
b ≥ 0 is a user specified parameter

Department of IT, BSIOTR, Wagholi, Pune Page 72


E ( j ) is the E metric at the j-th position in the ranking
If b=1 E(j) measure works as complement of the Harmonic mean F(j)
If b>1 indicates that the user is more interested in precision than in recall
If b<1Indicates that user is more interested in recall than in precision
Notice that setting b = 1 in the formula of the E-measure yields F ( j) = 1 − E ( j )

Department of IT, BSIOTR, Wagholi, Pune Page 73


The parameter b is specified by the user and reflects the relative importance of recall and
Precision.
If b = 0 E ( j) = 1 − P ( j ) low values of b make E ( j ) a function of precision.
If b → ∞ lim b→∞ E ( j) = 1 − r ( j ) high values of b make E ( j ) a function of recall
For b = 1, the E-measure becomes the F-measure.
Thus. Single value measures can also be stored in a table to provide a statistical summary
For instance, these summary table statistics could include the number of queries used in
the task the total number of documents retrieved by all queries the total number of
relevant docs retrieved by all queries the total number of relevant docs for all queries, as
judged by the specialists.

Department of IT, BSIOTR, Wagholi, Pune Page 74


Practical No.5

Sample code in C++


• Code
• Output

Department of IT, BSIOTR, Wagholi, Pune Page 75


Conclusion: Implementation is concluded by executing a program to calculate (F-
measure) and E-measure for sample input used in above example.

A. Write short answer of following questions:


1. What is harmonic mean (F-measure) and E-measure in IR systems?
2. How are (F-measure) and E-measure calculated?

B. Viva Questions:

1. Whatis difference between (F-measure) and E-measure?

2. What are the metrics to measure information systems?

3. What is the advantage of (F-measure) and E-measure?

Department of IT, BSIOTR, Wagholi, Pune Page 76


AssignmentNo. 6

ProblemStatement:
To Implement a program for feature extraction in 2D color images (any features like color, texture etc.
and to extract features from input image and plot histogram for the features.

Objective:
1. To study Program for Feature Extraction in 2D ColourImages for features like,Colour and
Textures and plot Histogram.
2. Given Feature Extraction source code is implemented using Java or python language .
3. The input to the program is image file that is to be modified using program by changing
colour
Outcomes:
At the end of the assignment the students should have
1. Understood the feature extraction process and its applications
2. Apply appropriate tools in analyzing the web information.

Infrastructure: Desktop/ laptop system with Linux or its derivatives.


Software used: LINUX/ Windows OS/ Virtual Machine/ IOS, python 3.9.12
Input:Image file
Output:Features of Image file
Theory:
Introduction:
Technology determines the types and amounts of information we can access. Currently, a
large fraction of information originates in silicon. Cheap, fast chips and smart algorithms
are helping digital data processing take over all sorts of information
processingConsequently, the volume of digital data surrounding us increases
continuously. However, an information-centric society has additional requirements
besides the availability and capability to process digital data. We should also be able to
Department of IT, BSIOTR, Wagholi, Pune Page 77
find the pieces of information relevant to a particular problem. Having the answer to a
question but not being able to find it is equivalent to not having it at all. The increased
volume of information and the wide variety of data types make finding information a
challenging task. Current searching methods and algorithms are based on assumptions
about technology and goals that seemed reasonable before the widespread use of

Department of IT, BSIOTR, Wagholi, Pune Page 78


computers. pattern is to map from a large, complex problem space into a small, simple
feature space. The mapping represents the creative part of the solution. Every type of
application uses a different kind mapping. Mapping into the feature space is also the hard
part of this pattern. Traditional searching algorithms are not viable for problems typical to
the information retrieval domain. Since they were designed for exact matching, their use
for similarity search is cumbersome. In contrast, feature extraction provides an elegant
and efficient alternative. With information retrieval expanding into other fields, this
pattern is applicable in a wide range of applications. Work with an alternative, simpler
representation of data. The representation contains some information that is unique to
each data item. This computation is actually a function. It maps from the problem space
into a feature space. For this reason, it is also called feature extraction process.
Feature Extraction:
Texture is an important feature that identifies the object present in any image. The
texture is defined by the spatial distribution of pixels in the neighborhood of an image.
The gray level spatial dependency is represented by a two-dimensional matrix known as
GLCM and it is used for texture analysis. The GLCM matrix specifies that how often the
pairs of pixels with certain values occur in an image. The statistical measures are then
derived using the GLCM matrix. The textural features represent the spatial distribution of
gray tonal variations within a specified area. In images, the neighboring pixel is
correlated and spatial values are obtained by the redundancy between the neighboring
pixel values. The color features are represented by color histograms in six color spaces
namely RGB, HSV, LAB, CIE, HUE and OPP.
The textural features are considered for classifying the image. These textural
features are calculated in the spatial domain and a set of gray tone spatial dependency
matrix was computed. The textural features are computed using GLCM matrix in four
different orientation angles. The textural features are based on the fact that describes how
the gray tone appears in a spatial relationship to another.

Department of IT, BSIOTR, Wagholi, Pune Page 79


GRAY LEVEL CO-OCCURENCE MATRIX (GLCM)
In statistical texture analysis, from the distribution of intensities the texture features
are obtained at specified position relative to one another in an image. The statistics of
texture are classified into first order, second order and higher order statistics. The method
of extracting second order statistical texture features is done using Gray Level Co-
occurrence Matrix (GLCM). First order texture measure is not related to pixel neighbor
relationships and it is calculated from the original image. GLCM considers the relation
between two pixels at a time, called reference pixel and a neighbor pixel. A GLCM is
defined by a matrix in which the number of rows and columns are equal to the number of
gray levels G in an image. The matrix element P (I, j | Ax, By) is the relative frequency
where I and j represent the intensity and both are separated by a pixel distance Ax, By.
The different textural features such as energy, entropy, contrast, homogeneity,
correlation, dissimilarity, inverse difference moment and maximum probability can be
computed using GLCM matrix.
Significance of Extracted Feature:
1. Color: It signifies the object identification and extraction of from scene.
2. Brightness: Brightness is one of the most significant pixel characteristics. It should be
used only for no quantitative references to physiological sensations and perceptions of
light.
3.Entropy: It characterized the texture in image
4. Contrast: Contrast is the dissimilarity or difference between things.
5. Shape of image
6. Size of image
7. Owner, file name, file type etc.

Department of IT, BSIOTR, Wagholi, Pune Page 80


Practical No.6

A. Importing an Image:
Importing an image in python is easy. Following code will help you import an image on
Python :

B. Understanding the underlying data:


This image has several colors and many pixels.
1. To visualize how this image is stored, think of every pixel as a cell in matrix.
2. Now this cell contains three different intensity information, catering to the color Red,
Green and Blue. So a RGB image becomes a 3-D matrix.

Department of IT, BSIOTR, Wagholi, Pune Page 81


3. Each number is the intensity of Red, Blue and Green colors.

Department of IT, BSIOTR, Wagholi, Pune Page 82


C. Converting Images to a 2-D matrix:-
1. Handling the third dimension of images sometimes can be complex and redundant.
2. In feature extraction, it becomes much simpler if we compress the image to a 2-D
matrix.

3. This is done by Gray-scaling ,Here is how you convert a RGB image to Gray
scale.

Now let’s try to binarize this Gray scale image:-


Blurring an Image:-
Last part of this assignment is more relevant for feature extraction : Blurring of images.

Department of IT, BSIOTR, Wagholi, Pune Page 83


Example:-
image = imread(r"C:\Users\Tavish\Desktop\7.jpg")
show_img(image)
red, yellow = image.copy(), image.copy()
red[:,:,(1,2)] = 0
yellow[:,:,2]=0
show_images(images=[red,yellow], titles=['Red Intensity','Yellow Intensity'])
from skimage.color import rgb2gray
gray_image = rgb2gray(image)
show_images(images=[image,gray_image],titles=["Color","Grayscale"])

print "Colored image shape:", image.shape.


print "Grayscale image shape:", gray_image.shape
from skimage.filter
import threshold_otsu
thresh = threshold_otsu(gray_image)
binary = gray_image> thresh
show_images(images=[gray_image,binary_image,binary],titles=["Grayscale","Otsu
Binary"])
from skimage.filter import gaussian_filter
blurred_image = gaussian_filter(gray_image,sigma=20)
show_images(images=[gray_image,blurred_image],titles=["Gray Image","20 Sigma
Blur"])
Second Part of the Assignment –Plotting the Histogram
histogram is a graphical representation showing how frequently various colour values
occur in the image.
Steps:-
Importing image data:-
Department of IT, BSIOTR, Wagholi, Pune Page 84
import matplotlib.pyplot as plt #importing matplotlib
The image should be used in a PNG file as matplotlib supports only PNG images.
img = plt.imread('flower.png') #reads image data

Department of IT, BSIOTR, Wagholi, Pune Page 85


Histogram creation using numpyarray:-
 To create a histogram of our image data, we use the hist() function.
 plt.hist(n_img.ravel(), bins=256, range=(0.0, 1.0), fc='k', ec='k')
#calculating histogram

Department of IT, BSIOTR, Wagholi, Pune Page 86


Histogram Calculation:-
 Here, we use cv2.calcHist()(in-built function in OpenCV) to find the histogram.
 cv2.calcHist(images, channels, mask, histSize, ranges[, hist[, accumulate]])
images : it is the source image of type uint8 or float32 represented as “[img]”.
channels : it is the index of channel for which we calculate histogram.
For grayscale image, its value is [0] andcolor image, you can pass [0], [1] or [2] to
calculate histogram of blue, green or red channel respectively.
mask : mask image. To find histogram of full image, it is given as “None”.
histSize : this represents our BIN count. For full scale, we pass [256].
ranges : this is our RANGE. Normally, it is [0,256].
Example:-
# load an image in grayscale mode
img = cv2.imread('ex.jpg',0)
# calculate frequency of pixels in range 0-255
histg = cv2.calcHist([img],[0],None,[256],[0,256])
Then, we need to plot histogram to show the characteristics of an image.
Plotting Histograms
Analysis using Matplotlib:
# importing required libraries of opencv
import cv2

# importing library for plotting


from matplotlib import pyplot as plt
# reads an input image
img = cv2.imread('ex.jpg',0)
# find frequency of pixels in range 0-255
histr = cv2.calcHist([img],[0],None,[256],[0,256])
# show the plotting graph of an image
Department of IT, BSIOTR, Wagholi, Pune Page 87
plt.plot(histr)
plt.show()

Department of IT, BSIOTR, Wagholi, Pune Page 88


Input:

Output:

Department of IT, BSIOTR, Wagholi, Pune Page 89


Algorithm
1. Open colored 2D bitmap file in binary mode.
2. Read the header structure
3. Extract the various feature
4. Print the values of features

Conclusion: Thus, we have successfully implemented a program for feature extraction in 2D


colour images and plotted histogram for the features.

A. Writeshortansweroffollowingquestions:

1. How images are indexed?

2. Explain how color is extracted from image

3. What is multimedia IR?Discuss steps on which data retrieval relies

Viva Questions:

1. What is use of image features?

2. Enlist some of the features of image and its applications.

3. How to compare two images and calculate the relevancy?

Department of IT, BSIOTR, Wagholi, Pune Page 90


AssignmentNo. 7

ProblemStatement:
Build the web crawler to pull product information and links from an e-commerce website. (Python)

Objective: -

To understand the working of web crawler and implement it

Outcomes:

At the end of the assignment the students should have

1. Understood how web crawler works

Infrastructure: Desktop/ laptop system with Linux or its derivatives.

Software used: LINUX/ Windows OS/ Virtual Machine/ IOS/C/C++/Java

Theory:

Search Engines

A program that searches documents for specified keywords and returns a list of the
documents where the keywords were found is a search engine. Although search engine is
really a general class of programs, the term is often used to specifically describe systems
like Google, Alta Vista and Excite that enable users to search for documents on the World
Wide Web and USENET newsgroups.
Typically, a search engine works by sending out a spider to fetch as many documents as

possible. Another program, called an indexer, then reads these documents and creates an

index based on the words contained in each document. Each search engine uses a proprietary
algorithm to create its indices such that, ideally, only meaningful results are returned for
Department of IT, BSIOTR, Wagholi, Pune Page 91
each query. Search engines are special sites on the Web that are designed to help people find
information stored on other sites. There are differences in the ways. Various search engines
work, but they all perform three basic tasks:

They search the Internet - based on important words.

Department of IT, BSIOTR, Wagholi, Pune Page 92


They keep an index of the words they find, and where they find them.

They allow users to look for words or combinations of words found in that index.

Fig.1 shows general search engine architecture. Every engine relies on a crawler
module to provide the grist for its operation. Crawlers are small programs that
browse the Web on the search engine's behalf, similar to how a human user would
follow links to reach different pages. The programs are given a starting set of URLs,
whose pages they retrieve from the Web. The crawlers extract URLs appearing in the
retrieved pages, and give this information to the crawler control module. This
module determines what links to visit next, and feeds the links to visit back to the
crawlers. (Some of the functionality of the crawler control module may be
implemented by the crawlers themselves.) The crawlers also pass the retrieved pages
into a page repository. Crawlers continue visiting the Web, until local resources,
such as storages, are exhausted.

Department of IT, BSIOTR, Wagholi, Pune Page 93


Fig: General search engine architecture
Web Crawlers
Web crawlers are programs that exploit the graph structure of the Web to move from page
to page. It may be observed that the noun 'crawler' is not indicative of the speed of these
programs, as they can be considerably fast. A key motivation for designing Web crawlers

Department of IT, BSIOTR, Wagholi, Pune Page 94


has been to retrieve Web pages and add them or their representations to a local repository.
Such a repository may then serve particular application needs such as those of a Web
search engine. In its simplest form, a crawler starts from a seed page and then uses the
external links within it to attend to other pages. The Crawler is the means by which Web
crawler collects pages from the Web. It operates by iteratively downloading a web page,
processing it, and following the links in that page to other Web pages, perhaps on other
servers.

The end result of crawling is a collection of Web pages, HTML or plain text at a
central location practice.
In a more traditional IR system, the documents to be indexed are available locally in
a database or file system. Web crawler's first information retrieval system was based
on Salton's vector- space retrieval model. The first system used a simple vector-space
retrieval model. In the vector- space model, the queries and documents represent
vectors in a highly dimensional word space. The component of the vector in a
particular dimension is the significance of the word to the document. For example, if
a particular word is very significant to a document, the component of the vector
along that word's axis would be strong. In this vector space, then, the task of
querying becomes that of determining what document vectors are most similar to the
query vector. Practically speaking, this task amounts to comparing the query vector,
component by component, to all the document vectors that have a word in common
with the query vector. WebCrawler determined a similarity number for each of these
comparisons that formed the basis of the relevance score returned to the user.
Web crawler's first IR system had three pieces: a query processing module, an
inverted full-text index, and a metadata store. The query processing module parses
the searcher's query, looks up the words in the inverted index, forms the result list,
looks up the metadata for each result, and builds the HTML for the result page. The
query processing module used a series of data structures and algorithms to generate
results for a given query. First, this module put the query in a canonical form, and
Department of IT, BSIOTR, Wagholi, Pune Page 95
parsed each space- separated word in the query. If necessary, each word was
converted to its singular form using a modified Porter stemming algorithm and all
words were filtered through the stop list to obtain the final list of words. Finally, the
query processor looked up each word in the dictionary, and ordered the list of words
for optimal query execution. Web crawler's key contribution to distributed systems is
to show that a reliable, scalable, and responsive system can be built using simple

Department of IT, BSIOTR, Wagholi, Pune Page 96


techniques for handling distribution, load balancing, and fault tolerance.
Robot Exclusion
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or
robots.txt protocol, is a convention to prevent co-operating web crawlers and other
web robots from accessing all or a part of a website which is otherwise publicly
viewable. Robots are often used by search engines to categorize and archive web
sites, or by webmasters to proofread source code.

The standard is different but can be used in conjunction with sitemaps, a robot
inclusion standard for websites A robots.txt file on a website will function as a
request that specified robots ignore specified files or directories in their search. This
might be, for example, out of preference for privacy from search engine results, or
the belief that the content of the selected directories might be misleading or irrelevant
to the categorization of the site as a whole, or out of desire that an application only
operates on certain data. A person may not want certain pages indexed. Crawlers
should obey the Robot Exclusion Protocol.
The Robots Exclusion Protocol (REP) is a very simple but powerful mechanism
available to webmasters and SEOs alike. Perhaps it is the simplicity of the file that
means it is often overlooked and often the cause of one or more critical SE0 issues.
To this end, we have attempted to pull together tricks, tips and examples to assist
with the implementation and management of your robots.txt file. As many of the
non-standard REP declarations supported by Google, Yahoo and Bing may change,
we will be providing updates to this in the future. The robots.txt file defines the
Robots Exclusion Protocol (REP) for a website. The file defines directives that
exclude Web robots from directories or files per website host. The robots.txt file
defines crawling directives, not indexing directives. Good Web robots adhere to
directives in your robots.txt file. Bad Web robots may not. Do not rely on the
robots.txt file to protect private or sensitive data from search engines. The robots.txt

Department of IT, BSIOTR, Wagholi, Pune Page 97


file is publicly accessible and so do not include any files or folders that may include
business critical information. For example: Website analytics folders (/web stats/,
/stats/ etc.) Test or development areas (/test/, /dev/) XML Sitemap element if your
URL structure contains vital taxonomy.
If a URL redirects to a URL that is blocked by a robots.txt file, the first URL will
be reported as being blocked by robots.txt in Google Webmaster Tools. Search
engines may cache your robots.txt file (For example: Google may cache your

Department of IT, BSIOTR, Wagholi, Pune Page 98


robots.txt file for 24 hours). When deploying a new website from a development
environment always check the robots.txt file to ensure no key directories are
excluded. Excluding files using robots.txt may not save the crawl budget from the
same crawl session. For example: if Google cannot access a number of files it may
not crawl other files in their place. URLs excluded by REP (Robots Exclusion
Protocol) may still appear in a search engine index.
Program Implementation: Code written in Python to implement of the same with
appropriate output.

Algorithm
1. MakeUserInterface
2. Input theURLofanywebsite
3. EstablishHTTPconnection
4. ReadHTMLpagesourcecode
5. ExtractHyperlinks ofHTMLpage
6. Displaythe list of hyperlinks on thesame page

Conclusion:Implementationisconcluded bystatingthe basicworkingofwebcrawler.

A. Write short answer of following questions:

1. What are the crawler architectures.

2. Explain Harvest architecture

3. Explain the working of GOOGLE crawler

4. Explain Challenges involved in searching web.

5. Explain Meta searches with examples

B. VivaQuestions:

Department of IT, BSIOTR, Wagholi, Pune Page 99


1. Whatisrobot.txt?
2. Whatisthesignificanceof robot.txt?
3. Whatarethe strategiesused bycrawler.
4. Whatispagerank?
5. Whatissignificanceof dampeningfactor?

Department of IT, BSIOTR, Wagholi, Pune Page 100


Practical No.7

Title : Build the web crawler to pull product information and links from an e-commerce website. (Python).
Program:
import java.net.*;
import java.io.*;
public class Crawler{
public static void main(String[] args) throws Exception{
String urls[] = new String[1000];
String url = "https://www.cricbuzz.com/live-cricket-scores/20307/aus-vs-ind-3rd-odi-india-tour-of-australia-
2018-19";
int i=0,j=0,tmp=0,total=0, MAX = 1000;
int start=0, end=0;
String webpage = Web.getWeb(url);
end = webpage.indexOf("<body");
for(i=total;i<MAX; i++, total++){
start = webpage.indexOf("http://", end);
if(start == -1){
start = 0;
end = 0;
try{
webpage = Web.getWeb(urls[j++]);
}catch(Exception e){
System.out.println("******************");
System.out.println(urls[j-1]);
System.out.println("Exception caught \n"+e);
}
/*logic to fetch urls out of body of webpage only */
end = webpage.indexOf("<body");
if(end == -1)

Department of IT, BSIOTR, Wagholi, Pune Page 101


end = start = 0;
continue;
}
}
end = webpage.indexOf("\"", start);
tmp = webpage.indexOf("'", start);
if(tmp < end && tmp != -1){
end = tmp;
}
url = webpage.substring(start, end);
urls[i] = url;
System.out.println(urls[i]);
}
System.out.println("Total URLS Fetched are " + total);
}
}
/*This class contains a static function which will fetch the webpage
of the given url and return as a string */
class Web{
public static String getWeb(String address)throws Exception{
String webpage = "";
String inputLine = "";
URL url = new URL(address);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
while ((inputLine = in.readLine()) != null)
webpage += inputLine;
in.close();
return webpage; }}

Department of IT, BSIOTR, Wagholi, Pune Page 102


Department of IT, BSIOTR, Wagholi, Pune Page 103
AssignmentNo. 8

ProblemStatement:
Write a program to find the live weather report (temperature, wind speed, description, and
weather) of a given city. (Python).

Objective: -
1. To Get Weather Information using Python
2. To evaluate the performance of the IR system and understand user interfaces for
searching.
3. To understand information sharing on the web

Outcomes:

At the end of the assignment the students should have

1. Understood and implemented the program to find the live weather report using
python.

Infrastructure: Desktop/ laptop system with Linux or its derivatives.

Software used: LINUX/ Windows OS/ Virtual Machine/ IOS/Python 3.9

Theory:

Weather Information using Python

Python is a growing language that has become increasingly popular in the


programming community. One of Python’s key features is that it’s easy to work with
APIs on websites and many weather entities have their API which you can access with
just a couple of lines of code. One such great API is the Open Weather Map’s API,
with it you can build a small program to access any given location’s weather forecast
anywhere across the globe! This article will help you to get weather information using

Department of IT, BSIOTR, Wagholi, Pune Page 104


Python.

Department of IT, BSIOTR, Wagholi, Pune Page 105


What is OpenWeatherMap?

The OpenWeatherMap (OWM) is a helpful and free way to gather and display weather
information. Because it’s an open-source project, it’s also free to use and modify in
any way. OWM offers a variety of features and is very flexible. Because of these
qualities, it offers a lot of benefits to developers. One of the major benefits of OWM is
that it’s ready to go. Unlike other weather applications, OWM has a web API that’s
ready to use. You don’t have

to install any software or set up a database to get it up and running. This is a great option
for developers who want to get weather reading on a website quickly and efficiently.

It has an API that supports HTML, XML, and JSON endpoints. Current weather
information extended forecasts, and graphical maps can be requested by users. These
maps show cloud cover, wind speed as well as pressure, and precipitation.

Conclusion: Thus, we have successfully implemented a program to find the live weather report
(temperature, wind speed, description, and weather) of a given city using Python.

Department of IT, BSIOTR, Wagholi, Pune Page 106


Practical No.8

Code Snippets

Sample Solution:

Python Code:

Sample Output:

Enter the city: Brazil


Brazil's temperature: 16.45°C
Wind speed: 2.1 m/s
Description: clear sky
Weather: Clear

Department of IT, BSIOTR, Wagholi, Pune Page 107


AssignmentNo. 9

ProblemStatement:

Case study on recommender system for a product / Doctor / Product price / Music

Objective: -
1. To study recommender system

Outcomes:

At the end of the assignment the students should have


1. Understoodtheconceptofcollaborativerecommendersystem

Theory:

(Do study of collaborative or content based recommender system)

The E-Commerce platform has seen enormous growth in online platforms in recent
years. Product recommendations are extremely complex. This leads to a large number
of combinations that can be overwhelming and extremely difficult to calculate
recommendations.

The paradigm of machine learning and natural language processing comes in picture in
achieving this goal of product recommendation. Through the implementation of these
approaches, the products can be effectively reviewed and realized for their potential
for recommendation to a particular user for a product / Doctor / Product price / Music.

Department of IT, BSIOTR, Wagholi, Pune Page 108


Fig: Recommendation system block diagram

Department of IT, BSIOTR, Wagholi, Pune Page 109


Conclusion: Thus we have studied collaborative recommender system

A. Write short answer of following questions:

1. Explain Collaborative Filtering Recommendation of Documents and products.

2. Explain Content Based Filtering Recommendation of Documents and products.

B. VivaQuestions:

1. What are recommendation systems?

2. How are recommendation systems classified?

Department of IT, BSIOTR, Wagholi, Pune Page 110

You might also like