Indexing and Abstracting
Indexing and Abstracting
Course Overview
The present era has been described as The Information Age. This is
because of the belief that the 21st Century society of the world is a
knowledge-based one in which data, information and knowledge are
integral to the existence of the human race. This information and the
ability to retrieve, select, evaluate, process and use it are central to the
survival and success of individuals, groups, organizations and
communities (Rowley and Hartley, 2008). Currently individuals,
organizations and communities are exposed to more information
transmitted from a wider range of sources through a broader range of
sources through a channels, many of which possess faster response and
turn around times.
The Phenomenal Growth of Information
• In the past thirty to forty years the world has undergone dramatic changes
in technology which has affected the way information is handled.
• Our era has been dubbed the Information Age because of the tremendous
amount of information that is generated on a daily basis
• One can today buy a CD-ROM that contains the texts of about 2,000
books; photographs and movies are now being stored in desktop
computers; some scholarly journals are appearing in electronic formats
only (Cleveland and Cleveland, 1983).
The Phenomenal Growth of Information. contd
• The value of information has been recognized since earlier times. For
example books in medieval libraries were chained to reading stalls to
prevent people from taking them away thereby preventing other people
from benefiting from the information content of the books.
• As far back as 1979, the information business was reckoned as a 25 billion
dollar industry in the USA. In the U K the information industry was
reckoned to be 20 billion dollars or 5% of Gross Domestic Product (GDP).
The information business in the U S A is now referred to as the knowledge
industry with its components including advertising, book and magazine
publishing, computers, research, government activities, libraries, radio
and television broadcast etc.
• A special characteristic of information is that it is neither scarce nor
depleting. In fact the more it is used and manipulated the more
information is generated
Origins of Information Retrieval
• Information Retrieval began with ancient Greek and Roman scholars who
began producing large works containing different types of information.
Having the good sense to know that people might not be interested in
reading everything contained in the book, sought a way of organizing it so
that one can retrieve only what he/she needed. One of the first things
they did in this direction was to provide a table of contents. For example,
Pliny the Elder wrote what he called ‘The Natural History in 37 Volumes”.
It was a sort of encyclopedia. In order to make the work user- friendly, the
first volume acted as the table of contents listing volume by volume all the
subjects that has been treated in the subsequent Volumes. This acted as a
directional aid to help identify where in each volume a particular subject
was located.
• Alphabetization was another method used to facilitate the retrieval of
information. It was devised by the Greek scholars in the third century
before the birth of Christ (C3rd BC) at the library of Alexandria in Egypt.
Definition and Functions of Information Retrieval
• Information
• Documents
• Document Surrogates
Information
• This is the primary theoretical concept that underlines the discipline of
library and information science. It is at the heart of the information
profession. Information professionals deal with the collection, description,
classification, storage, retrieval and dissemination of information upon
request. It is very difficult to arrive at a universal definition of information.
At best what might be done is to identify some of its properties.
Properties of Information
• True or False
• Current or old
• Raw or processed,
• Valuable or worthless,
• Confidential or open
• Private or public
• It may exist in large or small qualities.
• It may also exist in a wide variety of media.
Documents
• Documents may be classified into groups like textual and non – textual;
published or unpublished. Some may be on paper; others on film or
diskette etc. Sometimes those on paper may be transferred onto other
media e.g. a paper document may be put on a P.C. to prevent the loss of
the information contained in it as a result of deterioration. I am sure you
have learned about how paper documents deteriorate in your
Preservation course.
Document Surrogates
Author Index
• This is made up of a series of alphabetically arranged names of individual
authors; corporate authors; government agencies; organizations; and
institutions including universities. Each entry will have either a class
number or call number or accession number by which we can locate
information in the index.
Types of Indexes.contd
Subject Index
Computer databases
• Computer databases have three components:
• Reference databases
- Bibliographic databases: They consist of name of the author of the document,
place and date of publication, title, publisher, and descriptors or access terms.
Sometimes they may contain abstracts of the documents. Example of a
bibliographic database is LISA (Library and Information Science Abstracts).
• Source databases
- Numerical databases: These provide factual, statistical and survey data
among other kinds of information in the areas of business, economics,
industry etc. An example is FAME (Financial Analysis Made Easy).
• SOUGHT TERM: This is what a searcher is looking for, when using an index.
It may be one word or it may be a phrase, for example, ‘psychology’ or
‘clinical psychology’ etc. The end result of indexing invariably is to provide
a term that may be used to gain easy access to documents or information
contained in a database. Thus the sought term is affected by the
parameters that we are about to look at. The parameters are exhaustivity,
specificity, recall, precision and fallout. Now let us look at what they are
and how they affect the indexing system.
EXHAUSTIVITY
• It is the extent to which the indexing system allows for the analysis of the
content of a document to its barest minimum. That is how fully the
subject matter of a document has been represented in the index. In order
to achieve this objective the indexer has to select as many keywords as
possible to represent the author’s ideas in the document.
SPECIFICITY
• This refers to the extent to which the indexing system allows for precision
when searching for information within the index. That is how broad or
specific the terms or keywords selected in a particular situation, are. For
example, ‘Orange’ is a more specific or precise term to use for a search
than ‘Citrus fruits’ when an information seeker is searching for
information on oranges.
RECALL
• Fallout ratio is the last of the parameters used to measure the efficiency of
the indexing system. It is the ability of the system to suppress or not to
retrieve irrelevant terms. It is also reckoned as total irrelevant terms
retrieved over the total relevant terms in the system multiplied by a
hundred.
SUBJECT INDEXING
• subject indexing is the creation of indexes from the conceptual analysis of
the contents of documents.
• Most users of information approach information sources with queries
concerning a particular subject or topic. Hence indexers analyze the
intellectual contents of documents in order to represent them in the index
to facilitate easy access to the relevant document or information
Indexing and subject analysis
Summarization
• Juxtaposed to depth indexing, this is the policy where only dominant
overall themes are recognized for the purpose of indexing. In other words
summarization of a document is the expression of the total contents of
the document by a brief description e.g. a document that discusses
‘Psychology’ will be recognized as the overall theme or concept. Such
subjects like ‘clinical psychology’, ‘abnormal psychology’, ‘and child
psychology ’, ‘industrial psychology’
Identifying index able concepts
There are questions the indexer has to pose to help him identify concepts.
Some of these are:
• To what extent is the document about a particular subject?
• Is there enough information about this particular concept in the
documents?
• Would the user searching for information on this concept be satisfied with
this document?
• Is there any possibility that the concept will feature in a search query?
Guidelines
an indexer may have guidelines to help him make decisions about which
concepts to include in the index and which ones to exclude from the index
e.g. Commonwealth Agricultural Bureau International (CAB International) has
the following guidelines about concepts that should be indexed.
• Organisms e.g. snakes, tigers
• Geographical Locations e.g. Kumasi, Tamale, UK
• All relevant concepts like techniques, behaviour
• Bibliographical terms like conferences, books, theses etc.
The indexing process
Familiarization
• This is the first step in the indexing process. At this level, the indexer
wants to have an understanding of the overall purpose of the author of
the document. He will have that understanding by making a survey of the
entire contents of the document.
Analysis
• The third stage of the indexing process is a very critical one. This is
because it is at this stage that the indexer must select terms which must
agree with the concepts that have been identified at the analysis stage. It
is at this stage that the indexing language is applied. This stage is critical
also because in assigning labels a number of problems arise.
Another problem is the occurrence of homographs. These are words that are
spelt the same but have different meanings e.g.
• ‘minute’ which may mean time as in 30 minutes or small as in amount or
size
• ‘score’ as in the results of a game or as in music which is the written notes
• ‘crane’ as in the equipment used to lift heavy objects or the bird.
Nouns
These inflect for number, so one is faced with which form to use, whether
singular or plural form. The rules say that the plural form should be used for
count nouns e.g. pencils instead of pencil.
Problems associated with translation of concepts. contd
• Composite Subjects
These may have the same components or concepts. For example, ‘assessment
of lecturers by students’ and ‘the assessment of students by lecturers’ have
the same components or concepts. In such circumstance citation order is
used to show differences in meaning. This involves the syntactic relationship
that exists between words; that is the position of occurrence of the words in
the sentence may provide the meaning.
The Searching Process
• The searching process follows the same steps as the indexing process i.e
familiarization, analysis, translation.
• It is important for the searcher to have a clear view of the objective of the
search otherwise the search may end in dissatisfaction with the search
results.
• The next stage in the searching process is to analyze the concepts in the
search query (reference question, information need).
Translation
To translate the concepts in a search profile means to match the concepts
identified in the analysis of the search query with the thesaurus, classification
scheme or list of subject headings (indexing language) that has been used to
index documents in the collection to be searched. If the terms march, then
good results may be produced. Successful translation depend a lot on the
support that has been provided in the system being searched.
INDEXING LANGUAGES
• Natural language may be used in cases where the search involves specific
words or phrases that are known to have been used in the document.
Natural language is, therefore, used for unique brand names or company
names. Again it is used for slogans or expressions for which there are no
controlled language equivalent e.g. Hasmal etc. It is used again where
geographical labels have been used in the document.
Controlled Indexing Language
• Scorpion is another system that uses DDC to index and catalogue internet
resources. It is a project of the OCLC Office of Research.
• Cyber Stacks is also a collection of selected digital resources that use the
Library of Congress Classification Scheme for retrieval purposes.
• Library of Congress list is on the other hand the most important of the list of
subject headings. It covers all known areas of knowledge in the world.
• Like classification schemes lists of subject headings are also now used to
organize information resources on the web. Examples include INFOMINE and
Scout Report.
INFOMINE
• The first set of devices is USE and UF (use for). These are used to solve the
problems of synonymy. They show Equivalence relationships.
• The term ‘use’ indicates that a particular term should be preferred to the
other e.g. ‘espionage’ and ‘spying’. Depending on the users of the index,
the indexer will determine which of the terms is more likely to be used in a
search query.
Thesauri.contd
MURDER
USE HOMICIDE
Thesauri.contd
• It is important to note that USE, UF, BT, NT, RT connections in the thesaurus
reciprocates each other. Therefore a corresponding complementary link must
appear at different points in the thesaurus. For example:
•
Corn Shucking (unauthorized term)
USE CORN HUSKING (link from unauthorized term to authorized term)
• STEP 1
• Skim through the collection of documents and record words and phrases
that seem to represent the concepts in the field.
• The idea behind this is to base the Thesaurus on the actual concepts that
have been used in the document.
• Other sources that may be used to select concepts include other Thesauri;
glossaries (that is a list of technical or foreign terms used in the
document); back of the book index; and records of requests that searchers
have made – these show query words that have been used by the public.
Building a Thesaurus.contd
• STEP 2
Arrange the selected terms into categories - that is cluster those terms that are
closely related to each other.
• Materials : rubber,plastic,paint etc.
• Properties of persons, things, materials or actions : stability, confidence,
speed
• Equipment : computers, photocopiers, printers, cranes
• Activities : or processes : hairdressing, lubrication, distillation
• Institutions : university of Ghana, CSIR, IEA
• Historical Periods : The Renaissance, Napoleonic wars
• Events or occurrences : natural disasters, funerals, exhibitions
• Things and their physical parts : birds, documents, mountain regions
• Disciplines or subjects : philosophy, medicine, business studies
• Units of measurement : minutes, kilometers
• Types of people and organizations : adults, nurses, lawyers, financial service
organizations, nations.
Building a Thesaurus.contd
• STEP 3
This is where proper control of the language comes in. Here the terms that
have been selected must be converted into the control vocabulary. The terms
in a thesaurus are called descriptors.
Generally there are fairly standardized rules that the indexer can use in
forming the descriptors. These are:
• For count nouns, the plural form must be used e.g. oranges rather than
orange.
• Use the singular form for mass nouns or uncountable nouns e.g.
congregation instead of congregations.
• The singular form is also used for processes, e.g. filtration.
Building a Thesaurus. step three. contd
• STEP 4
This is the second area of control. The indexer decides which terms to be
authorized and which ones will not be authorized. The former are those
which would be used to access the index while the latter would be used for
cross – references, e.g. Homicide UF murder. Homicide is the authorized
terms. Murder then is used for cross-reference.
• STEP 5
Establish a relationship between the terms by indicating clearly which terms
are authorized, unauthorized, broader, narrower, and which are
related(USE,UF, BT, NT and RT).
Building a Thesaurus.contd
• STEP 6
Arrange the descriptors in a thesaurus structure by listing them alphabetically. The thesaurus
form is as follows – authorized term comes first, followed by the unauthorized term, then by the
broader term, narrower term and related term. For example:
LECTURES
UF CLASSES
BT CAMPUS LIFE
NT ASSIGNMENTS
EXAMINATIONS
TUTORIALS
RT HALL WEEK
STUDENT POLITICS
Building a Thesaurus.contd
• STEP 7
Add scope notes(SN) to explain terms that may be ambiguous or misinterpreted.
For example:
INDEXING
SN Assignment of terms to documents for the purpose of retrieval at a later
date. Do not use for ‘cost index’. Scope Notes are not full dictionary definitions;
they are just to explain how a term has been used in a particular thesaurus.
• STEP 8
Use the thesaurus to index a set of documents. After that, use it to search for
documents from the index. If this is successful, then a good thesaurus has been
constructed. If difficulties arise in finding a selected topic them it means that the
thesaurus is not good enough. The indexer has to go back to step two and work
his way downwards again.
Criteria For Evaluating A Thesaurus
• There are two main issues relevant to pre-coordinate indexing. These are:
-Consistent description of subjects and
-Referencing
• Consistent Description of Subjects
Pre-Coordinate indexing deal with many subject headings. Describing the
heading must be done in a consistent manner. Equally, there must be
consistency in the arrangement of terms (Citation Order). To be able to
ensure consistency in the description and arrangement of terms,
controlled vocabulary and standard citation order have to be used.
Issues Relevant to Pre-Coordinate indexing. contd
• Principles and guidelines have been evolved over the years to ensure
consistency in the description of concepts and the arrangement of terms.
Three of the most notable of these are Cutter’s Rules for a Dictionary
Catalogue; Kaiser’s Systematic Indexing; and Coates’s rules for British
Technology Index.
Cutter’s Rules for a Dictionary Catalogue
• The rules were evolved by Charles Amy Cutter in 1876. The rules say that
where a subject and a place are both elements of a topic, the subject
should precede place in scientific related areas. For example, “Ghana
Institution of Engineers” will be realized as “Institution of Engineers :
Ghana”. On the other hand place should precede the subject in other
areas like commerce, history, government, etc. For example,
“Government of Ghana” will be “Ghana : Government”.
• They are used in the Library of Congress List of Subject Headings.
Kaiser’s Systematic Indexing
• Coates also established rules to cater for the British Technology Index now
called Current Technology Index. His rules covered a wide range of
composite subjects. He based his rules on Kaiser’s Systematic Indexing
except that he changed Kaiser’s “Concrete” and “Process” to “Thing” and
“Action”. Thus according to Coates “Thing” should be cited first before
the “action” taken on it. He used the principle to map out an extended
citation order where there is a ‘‘thing’’ and “part of the thing’’, then
“material” and “action”. For example “Manufacture of multiwall Kraft
paper sacks” will be cited as:
• The primary operators determine the citation order for the string and they are cited ordinally.
1 - Represents the key system or the object on which an action may be taken or its effect
may be experienced.
2 - Shows action or phenomena, that is action or the effect of an action on the key
system or the object.
r represents assembly
s indicates role definer
t shows author-attributed association
u represents two-way interaction.
Steps in Creating PRECIS Indexing String
• Step One
The first step is to identify the concepts in the composite subject that are to be
reflected in the index entries. In PRECIS, a concept is defined as a term or
topic that matches a précis operator.
• Step Two
Express the identified concepts in the controlled vocabulary that will be used in
the index.
• Step Three
Assign an “operator” or “code” to each term that has been identified. The
operators have specific filing values. This is to ensure that terms appear in the
indexing string in an order that will produce a meaningful set of index entries.
• Step Four
Arrange the index terms according to the filing value of the operators that
have been assigned to the concepts. By the end of this step, an indexing string
would have been created
Steps in Creating PRECIS Indexing String. contd
• Step Five
The indexer explores the possible entries that may be generated from the
string. He may make whatever adjustments that may be necessary in
terms of which terms should take lead position. He could also include or
exclude terms at this stage.
• Step Six
Computer instruction codes or commands are used to replace the
operators in the string and convert the operators into machine readable
codes that will show which terms to be used as entry terms.
• Step Seven
The computer creates a series of entries based on the indexing string that
was generated.
Steps in Creating PRECIS Indexing String. contd
• LEAD Qualifier
Display
The Lead is the term to be used as the access point. The Qualifier is the
context establishing term and the Display indicates narrower terms.
• By a process of rotation, the previous LEAD term goes to Qualifier position.
The next term at the head of the Display will move to LEAD position.
• The process of rotation is called SHUNTING.
POPSI
• A POPSI index entry is made up of two parts, namely the Lead Heading
which contains the access term and the Context Heading which normally
appears on the second line after the lead heading. The context heading
contains the subject words with auxiliary words which show the context in
which the lead term has been discussed.
• There are four basic categories of subjects under POPSI called DEPA for
short. These are:
-Discipline
-Entity
-Property
-Action
POPSI.contd
• Brevity: They must be shorter than the document from which they are
derived. Brevity saves the user’s time and lowers the cost of production
of the abstract.
• Clarity: This means that it must be clearly written and all sorts of
ambiguities avoided.
• Accuracy: As far as practicable, all errors must be avoided.
Additionally abstracts
• must be self-contained and must make good reading by
• should be objective without containing any critique or interpretation or
evaluation.
• must be high in information content and should place emphasis on
reporting new facts
Characteristics of abstracts.contd
• Abstracts may also be characterized by their length. On this there are two
schools of thoughts. One believing that a document of about a page or two
should be abstracted in not more than 200 words while a larger document
should not exceed 500 words. The other school behave that there should be
no rule on the length of the abstract because the length would necessarily be
affected by a number of factors such as.
• The length of item to be abstracted itself would affect the abstract
• The complexity of the subject matter of the item to be abstracted will also
affect the length of the abstract.
• The diversity of the subject matter of the document.
• The importance of the item to the organization preparing the abstract.
• Accessibility of the subject matter of the document can affect the length.
Accessibility here refers to physical and intellectual accessibility.
• Printing costs may also affect the length of the abstract.
• The purpose of the abstract will also determine its length.
Types of materials that may be abstracted
Types of abstracts
There are several types of abstracts namely Informative abstracts;
Indicative abstracts; Informative / Indicative abstracts; Critical abstracts;
Slanted abstracts; Author abstracts; Highlight abstracts etc.
Informative Abstracts
• They are normally used for documents that report new findings e.g. scientific
journals, technical bulletins, monographs and sometimes conference
proceedings.
Indicative Abstracts
• In practice these are more common than the purely informative or purely
indicative abstract. The two together mean that parts of the abstracts are
written informatively while other parts are written indicatively. Those
parts of the document considered to be of great importance are written
informatively whilst those of minor significance are treated indicatively.
Informative/Indicative abstracts may be used for papers or documents
that report original results. They may also be used for literature reviews.
Critical Abstracts