Language Technology A First Overview: Hans Uszkoreit 1. Scope
Language Technology A First Overview: Hans Uszkoreit 1. Scope
A First Overview
Hans Uszkoreit
1. Scope
Language technologies are information technologies that are specialized for dealing with the
most complex information medium in our world: human language. Therefore these technologies
are also often subsumed under the term Human Language Technology. Human language
occurs in spoken and written form. Whereas speech is the oldest and most natural mode of
language communication, complex information and most of human knowledge is maintained
and transmitted in written texts. Speech and text technologies process or produce language in
these two modes of realization. But language also has aspects that are shared between
speech and text such as dictionaries, most of grammar and the meaning of sentences. Thus
large parts of language technology cannot be subsumed under speech and text technologies.
Among those are technologies that link language to knowledge. We do not know how
language, knowledge and thought are represented in the human brain. Nevertheless, language
technology had to create formal representation systems that link language to concepts and
tasks in the real world. This provides the interface to the fast growing area of knowledge
technologies.
In our communication we mix language with other modes of communication and other informa-
tion media. We combine speech with gesture and facial expressions. Digital texts are combined
with pictures and sounds. Movies may contain language and spoken and written form. Thus
speech and text technologies overlap and interact with many other technologies that facilitate
processing of multimodal communication and multimedia documents.
multimedia &
multimodality
technologies
speech text
technologies technologies
language
technologies
knowledge
technologies
For a comprehensive introduction to the field, the reader is referred to: Cole R.A., J. Mariani, H. Uszkoreit, G. Varile,
A. Zaenen, V. Zue, A. Zampolli (Eds.) (1997) Survey of the State of the Art in Human Language Technology, Cam-
bridge University Press and Giardini. (http://www.dfki.de/~hansu/HLT-Survey.pdf)
In the following a selection of the most relevant language technologies will be summarized. By
clicking on the names of the technologies, you can access additional information.
Speech recognition
Spoken language is recognized and transformed in
into text as in dictation systems, into commands as
in robot control systems, or into some other internal
representation.
Speech synthesis
Utterances in spoken language are produced from text
(text-to-speech systems) or from internal representations
of words or sentences (concept-to-speech systems)
Text categorization
This technology assigns texts to categories. Texts may
belong to more than one category, categories may
contain other categories. Filtering is a special case of
categorization with just two categories.
Text Summarization
The most relevant portions of a text are extracted as
a summary. The task depends on the needed lengths
of the summaries. Summarization is harder if the
summary has to be specific to a certain query.
Text Indexing
As a precondition for document retrieval, texts are
are stored in an indexed database. Usually a text
is indexed for all word forms or – after lemmatization –
for all lemmas. Sometimes indexing is combined
with categorization and summarization.
Text Retrieval
Texts are retrieved from a database that best match
a given query or document. The candidate documents
are ordered with respect to their expected relevance.
Indexing, categorization, summarization and retrieval
are often subsumed under the term information retrieval.
Information Extraction
Relevant information pieces of information are discovered
and marked for extraction. The extracted pieces can be:
the topic, named entities such as company, place or
person names, simple relations such as prices, desti-
nations, functions etc. or complex relations describing
accidents, company mergers or football matches.
Report Generation
A report in natural language is produced that describes
the essential contents or changes of a database. The
report can contain accumulated numbers, maxima,
minima and the most drastic changes.
Translation Technologies
Technologies that translate texts or assist human trans-
lators. Automatic translation is called machine translation.
Translation memories use large amounts of texts together
with existing translations for efficient look-up of possible
translations for words, phrases and sentences.
Generic CS Methods
Programming languages, algorithms for generic data types, and software engineering methods
for structuring and organizing software development and quality assurance.
Specialized Algorithms
Dedicated algorithms have been designed for parsing, generation and translation, for morpho-
logical and syntactic processing with finite state automata/transducers and many other tasks.
Linguistic Knowledge
Linguistic knowledge resources for many languages are utilized: dictionaries, morphological
and syntactic grammars, rules for semantic interpretation, pronunciation and intonation.