0% found this document useful (0 votes)
32 views

Text Extraction

Extract text from web pages

Uploaded by

Amit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Text Extraction

Extract text from web pages

Uploaded by

Amit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2010 Eighth IEEE European Conference on Web Services

A Service-oriented Architecture for Text Analytics


Enabled Business Applications
Kerstin Denecke Marek Kowalkiewicz
L3S Research Center SAP Research CEC Brisbane
Hannover, Germany Brisbane, Australia
Email: [email protected] Email: [email protected]

Abstract—Experiences showed that developing business appli- Known text analysis methods including named entity recog-
cations that base on text analysis normally requires a lot of nition, sentiment analysis and text classification can be used to
time and expertise in the field of computer linguistics. Several implement this high level scenario. The corresponding system
approaches of integrating text analysis systems with business could work as follows:
applications have been proposed, but so far there has been no
coordinated approach which would enable building scalable and First, the sender of an incoming E-Mail is identified by the
flexible applications of text analysis in enterprise scenarios. In this implemented system and, by accessing internal data, the cor-
paper, a service-oriented architecture for text processing applica- responding customer group the sender belongs to is selected
tions in the business domain is introduced. It comprises various (e.g., a wholesale distributor). Further, the mail content is
groups of processing components and knowledge resources. The analysed and the mail is classified according to its content
architecture, created as a result of our experiences with building
natural language processing applications in business scenarios, as a complaint, a request, as positive feedback etc. Through
allows for the reuse of text analysis and other components, and named entity recognition, the product name is determined.
facilitates the development of business applications. We verify Based on the determined group information, the mail is
our approach by showing how the proposed architecture can be redirected automatically to a responsible service employee or
applied to create a text analytics enabled business application department. Further, the relevant background information on
that addresses a concrete business scenario.
the customer and the product, the number of purchases, etc.
is identified by the system and it provides this information
I. I NTRODUCTION
to the selected service employee. In this way, the work flow
Huge amounts of structured and unstructured data are for this employee is facilitated by getting access to (a) E-
available in Desktop, Enterprise and Internet applications. For mails lying within his or her competencies, and (b) useful and
interpreting and processing unstructured data (which is the relevant information on customer and product. This reduces
main focus of this paper) known text analysis methods can the processing time of incoming mail messages and customers
be exploited. Due to the immense volume of data available, it receive replies faster (which increases their satisfaction with
becomes necessary to have applications on hand that support the company’s service support).
automatic analysis, interpretation, and integration of data from Hence, there is a business case in automating the process of
different sources and also to leverage and analyse structured analysing natural language text stored in different applications
and unstructured data in conjunction with each other. Further, and data resources for enabling efficient data analytics. Fur-
systems are required that can invoke or suggest appropriate ther, integration of textual data with structured data is crucial
actions and applications to allow for quick access to relevant as well as enabling quick access to relevant business, desktop
information and applications. These issues are in particular and Internet applications. Often, text processing applications
of interest with respect to unstructured data since its amount require similar subtasks and methods. Unfortunately, a reuse
increases significantly during daily business (e.g., E-Mails, does not take place due to a missing common architecture
competitive data in the Web). Text analysis could support this and common data exchange between components. Of course,
process significantly. Consider the following scenario: systems have already been developed that use text analysis in
Screen Corp. sells computer screens to wholesale dis- business applications [1], [2], [3]. However, a general architec-
tributors (1), computer stores (2) and, via on-line chan- ture of such applications, their requirements and components
nels, directly to customers (3). It uses one E-mail address still need to be developed and described.
([email protected]) to communicate with the customers. In In this paper, such architecture for text processing applica-
order to streamline the work, Screen Corp. uses now an E- tions in the business domain is presented and its components
mail program that scans mails directed to [email protected] are described. In the following, we refer to systems that
automatically, redirects them to the corresponding service combine text analysis with applications relevant in the business
department and provides the service employees with any domain as text analytics enabled business applications. The
additional information about the partner ((1), (2), or (3)) and proposed architecture is intended to allow for easy creation
kind of request. of such applications for information workers, for example

978-0-7695-4310-9/10 $26.00 © 2010 IEEE 205


DOI 10.1109/ECOWS.2010.27
a consultant or an office assistant working for a company. technologies and approaches that use concepts of SOA. We
Applications that are designed in the proposed way will allow will focus on related work on SOA for text processing and
for incorporating one or more text analysis services such those in business applications.
as Sentiment Analysis or Named Entity Recognition within Service-oriented architectures for text processing were pro-
a larger application, invoking external applications for user posed by different research groups and targeted at solving
interactions or for collecting data from them. We believe that different tasks including terminology acquisition [6] or text en-
a service-oriented architecture that allows arranging compo- richment and annotation [7], [8]. Witte and Gitzinger propose
nents in a process-like manner (to follow the service oriented a service-oriented architecture to integrate natural language
architecture (SOA) approach) is well suited for such purpose processing (NLP) services into client applications [9].
since the ability of reusing components is a main paradigm in Jastrzebski et al. [10] present a distributed service-oriented
this kind of architectures. architecture for retrieving documents where client and server
One objective of this work is to provide an architecture that as well as storage and processing are separated. Their system
allows for combining text analysis and enterprise, desktop and includes a service for managing the information extraction
Web applications in business application systems for business process and invocation of services.
intelligence. In addition, we aim at finding a possibility to Mikroyannidis et al. [11] introduce Parmenides, a business-
facilitate the process of system development by enabling reuse oriented framework for information management on the Web
of components. Exploiting modules that address a concrete text that supports decision making in management via a customised
analysis problem allows even non-specialists in the domain of analysis of the desired market. An application of ontology-
natural language processing the creation of applications that based extraction in the context of E-business applications
can make use of it. is described by Saggion et al. [12]. The system identifies
The questions we want to address here are the following: information that has been specified in the underlying ontology
• What are possible application scenarios of text analytics in natural language text. The extraction facilities have been
in the business domain? developed using GATE (see below).
• What are the requirements for these applications? The described systems are rather limited with respect to the
• What components are required for a service-oriented problems and real-world scenarios that can be addressed, i.e.
architecture implementation of text analytics enabled they are not targeting at creating an architecture for systems
business applications? dealing with varying business scenarios. In contrast, in this
The paper is structured as follows: Section II provides paper we propose an architecture that facilitates creation of
an overview of related work. In Section III, requirements to business applications that exploit text analysis. Further, the
be fulfilled by the architecture are collected. The spectrum architecture allows for integrating external applications into
of business applications using text analysis is very broad, text analytics enabled business applications.
and we present some possible scenarios in Section IV. The B. NLP Frameworks
architecture itself is described in Section V. Its application in In the field of natural language processing (NLP), systems,
one concrete scenario is presented in Section VI. The paper architectures and frameworks addressing concrete NLP prob-
finishes with a discussion of the proposed approach (Section lems have been introduced. OpenCalais1 is a tool set for NLP
VII) and conclusions (Section VIII). and Machine Learning that allows for metatagging of textual
II. R ELATED W ORK information and, in this way, for generating semantic content.
In addition, a broad range of language processing toolkits are
Various systems have been developed to realise basic text
available (e.g., NLTK[13]), MontyLingua2 or LingPipe3 . Some
analysis tasks as such or to integrate text analysis methods in
of them are working in a pipeline fashion [14], [15]. For
processing pipelines. A processing pipeline consists of a chain
example, UAM Text Tools provide command-line programs
of processing modules (processes, threads, etc.), arranged in
that realise concrete NLP tasks (e.g., tokenization, sentence
a way that the output of each module is the input of the
splitting, morphologic analysis) which can be connected in
next. Some text analysis systems were implemented as service-
various ways [14]. Language processing tools as such are
oriented architectures. There are also business applications
clearly useful, but it is even more relevant to make use of
already available that make use of text analysis. In this section,
them in more sophisticated systems in general and in business
relevant approaches in this area are described and differences
applications in particular. Showing a way of integrating text
to our work are presented.
analysis methods into business applications in a re-usable
A. Service-oriented Natural Language Processing (NLP) Sys- manner is focus of this paper.
tems Frameworks for creating modular NLP applications allow
In the recent years, research in the area of SOA focused arranging modules that perform different tasks of linguistic
on modernising the design, analysis and delivery of service processing (e.g., tokenization, POS tagging) in a processing
systems and targeted at automating design as well as on 1 http://www.opencalais.com
capturing domain expertise for a specific field in a reusable 2 http://web.media.mit.edu/ hugo/montylingua/
way [4]. Papazogolou and van Heuvel [5] review current 3 http://alias-i.com/lingpipe/

206
pipeline. They target at creating systems for linguists or and does not follow a general system architecture for business
knowledge engineers, allowing them to create new NLP tools applications.
by arranging services in a certain way. Example frameworks These examples show that the different applications of
are GATE and UIMA . GATE (General Architecture for text processing in the business domain use similar processing
Text Engineering, [16]) is both a framework and graphi- resources (e.g., information extraction, named entity recogni-
cal environment for human language processing. It allows tion). However, a general architecture of text analytics enabled
for combining different processing modules and language business application systems is missing. Such architecture
resources for building a pipeline of NLP components. The should characterise the necessary services and components and
Unstructured Information Management Architecture (UIMA, define the communication between components.
[17]) framework is an open, scalable and extendible platform
for building text analytics applications or search solutions III. R EQUIREMENTS
that process unstructured information. It allows for specifying As stated before, our central goal is to bring text analysis
analytic pipelines, describes a set of design patterns and into real-world business applications. In this section, we list
suggests certain data representations. the requirements to be fulfilled by our architecture. They are
Both frameworks aim at creating specific NLP tasks; they summarised in Table I and described in more details in the
do not provide a general architecture for more complex text following paragraphs.
analysis applications and do not allow for integrating services The requirements can be grouped according to the various
from external (business) applications like the SAP system. perspectives or actors for such system. The (a) end-user is
In contrast to this, we want to specify an architecture that working with a client application which enables access to the
allows for such integration of various textanalytical methods actual system. Normally, the end-user is not aware or caring
and external applications. about the underlying technology. The (b) system developer
has the role to ’plug’ the different services together to create
C. Business Intelligence Applications a new application addressing a concrete business scenario.
Business Intelligence (BI) refers to skills, technologies etc. He does not necessarily have knowledge in Natural Language
that support decision making [18]. In the last couple of years, Processing (NLP) or in implementing NLP components and
the focus in BI applications was on structured data and tech- services. The (c) service engineer creates and develops the
nologies including OLAP, data mining, and process mining processing services, for example NLP services for information
[19]. New challenges occur when considering unstructured extraction or named entity recognition. We resist on discussing
data which is the focus of this paper. the role of the service engineer here since our focus is on the
Nie et al. [3] introduce the MBOI tool that uses informa- general architecture of text analysis application systems, and
tion extraction for discovering business opportunities on the not on the way how text analysis services or other services are
Internet. Their main aim is to help users to decide which implemented at the end. For system development, it should be
company tenders require further investigation. Similarly, the of no concern how the components are created. In addition
LIXTO tool [20] is used for web data extraction for business to the various user perspectives, we finally assume the (d)
intelligence, for example to acquire sales price information system perspective, where we consider the properties desirable
from online sales sites. The system h-TechSight [2] uses for the architecture as a whole. In the following, the different
GATE’s information extraction facilities to detect changes and requirements are described in more detail.
trends in business information and to monitor markets. The
presented systems are mostly self contained which hampers A. End-user requirements
their reuse. Nevertheless, they comprise modules that could The ’end-user’ of a system built according to the archi-
be useful in other business applications, too. This paper aims tecture we are proposing here, can be any kind of user
at presenting an architecture that allows for reuse of such working with desktop applications. Certainly, our focus is
components. the business domain. Therefore, potential users are office
Sureka, De and Varma [21] propose a generic architecture assistants, consultants, service hotline employees, marketers,
of a text analytics based retrieval system. It enables a warranty or salespeople. Some of the business users must use analytical
analyst to query unstructured data. Their architecture focuses tools to improve the results of a business process; others just
on one concrete problem which is gaining intelligence from the need quick access to relevant information and applications
textual data stored in warranty claim forms. This is in contrast [12].
to our approach where one architecture should enable the cre- Scenarios described in section 4 show some potential ap-
ation of very different business applications. Within Yowie [1], plications and provide insights into the variety of potential
named entity recognition and information extraction services end-users. Considering the different user groups and scenarios,
are integrated with a SOA paradigm. These technologies are the architecture must be open and flexible with respect to
exploited as entry points for external services. For example, the type of clients integrated with. Further, it must allow
entities in E-Mails are identified and related to objects and for considering the user context. Depending on the current
applications on the Desktop, in the Enterprise system or the working context, a user might have different tasks and access
Internet. Nevertheless, also Yowie is a self-contained system rights. For instance, office assistants can be able to check leave

207
User requirements
1 Open and flexible with respect to the type of clients integrated
business processes. The examples were collected during dis-
with cussions with customers and internal business units of one
2 Consideration of user context of the largest enterprise software vendors. They show how
3 Ensuring integration and invocation of external applications
Software Engineer
text analysis such as general and domain-specific named entity
4 Easy integration of any client recognition (e.g., person names, product names, administrative
5 Reuse of components activities), sentiment analysis or information extraction can
System requirements support business processes. The objective of this section is to
6 Ensuring easy communication between Client and Server
7 Enabling easy modification, replacement and extension of give an indication of the variety of potential applications of
knowledge resources text analysis in the business domain. The section is also meant
8 Enabling easy integration of new processing services to justify our objective of defining an architecture that allows
9 Enabling automatic invocation of external applications (e.g.
Enterprise-, Desktop and Web applications) for creating corresponding systems with a limited amount of
10 Ensure flexible handling of results time by re-using components and structures.
TABLE I In the business domain, activities and functions can be
R EQUIREMENTS OF THE ARCHITECTURE . structured into four general groups:
1) Customer facing activities,
2) Supplier related activities,
3) Business Execution activities,
requests from a company’s employees, but are not allowed to 4) Business management activities.
make any changes. For their own leave request, they have
Customer facing activities are centred around the customer
of course the corresponding rights. Since business processes
and include the marketing of products, activities of selling
and work flows might involve the use of various software
products to customers and their support with product-related
applications, the architecture must allow for the automatic
issues. Supplier related activities refer to all activities that are
invocation of external applications.
directed towards the supplier, i.e. identification and hiring of
B. System Developer Requirements suppliers. Business execution activities include the tasks and
activities performed within an enterprise to produce products.
A second user group form system developers, i.e. those Finally, business management activities are mainly administra-
who create a text analytics enabled business application and tive activities within an enterprise. In the following, potential
integrate it into a client application. The main purpose of the scenarios of text analytics enabled business applications are
proposed architecture is to facilitate their job. For this reason, presented. They are assigned to the kind of organisational
the architecture should allow for an easy integration of any activity they are addressing most.
client and should enable developers to do this in an effective
and efficient manner. Further, reuse of components is a main A. Customer Facing Activities
issue, since this may help to reduce the development effort Service Hotline. Large amounts of customer requests reach
significantly. service hotlines via a number of channels, including voice
communication and e-mails every day. E-mail requests sent
C. Architecture Requirements
to a help desk, or transcripts of voice communication, can
Additional requirements are related to the architecture itself. be automatically analysed and redirected to the appropriate
Easy communication between Client and Server needs to be expert for immediate response. The expert gets additional
enabled. The processing services require different resources. information on customer, products purchased etc. In this way,
It needs to be ensured that these resources can be easily processing time is decreased, also by providing relevant,
modified, replaced or extended. We will propose a set of additional information to address the customer requests.
required processing services. Nevertheless, the architecture Customer Information Scenario. Having good knowledge
should be open in a way to enable easy integration of new about a customer is crucial for generating a well suited
processing services. Results of services might be directly sent product offer or reacting appropriately to a customer request
to the client application or be used by following services. or complaint. Information about a customer is available in
Therefore, the architecture should ensure a flexible handling heterogeneous sources: internal documents and databases, and
of results. also in the Web. Hence, a system that could collect automat-
ically information from all relevant resources, analyse them
IV. T EXTANALYTICS E NABLED B USINESS A PPLICATIONS and make them available to the user would be very useful.
In the last couple of years, the usefulness and necessity of The described process would involve named entity recognition,
text analytics enabled business applications became even more and could include sentiment analysis, text summarization,
visible [7]. Applications that garner the largest interest are information extraction and other text analysis technologies.
semantic search and question answering as well as applications The system could provide a sophisticated overview of the
analysing enterprise feedback, risk and fraud applications specified customer, which in turn could help the enterprise
for warranty claims, financial services etc. In this section, employee learn more about the customer and get a more
examples are provided, where text analysis could support complete view on them, their interests and opinions.

208
Customer Decision Support. Customers are confronted with consuming, since the relevant application or given template
a lot of products with similar functionalities (e.g., digital needs to be found on a desktop or even in an enterprise system.
cameras, computers). To support their customers in identifying Entered keywords could be interpreted automatically, and a
the most relevant product and finally coming to a decision (and relevant application could be started or a template could be
buy something), an enterprise might support their customers opened. This time effective way to access respective business
by a system that collects product reviews available in the applications would result in higher productivity.
Web regarding the offered products. Through an interactive These examples show the spectrum of possible applications
user interface, these reviews or their content, respectively, of text analysis in the business domain. They cover each of the
could be made intuitively accessible. Product features could four organisational activities existent in an enterprise. On the
be identified and the statements could be clustered according one hand, relevant information should be collected, analysed
to the expressed opinion. Based on product features a customer and presented to a user in an easy understandable way. On
is interested most, the system could makes suggestions which the other hand, quick access to relevant applications should
product is suited best for a customer. be provided.
B. Supplier Facing Activities V. A RCHITECTURE
Bid management Scenario. Managing a bid over various The application examples presented before show that for
product categories is complex, often depends on a variety of different business applications in various business activities,
information created and manipulated in business productivity similar text analysis techniques are required. Further, text
software (e.g. Excel, Word, Outlook). A system could collect processing technologies need to be linked to Enterprise, Web
information and associate it to relevant data in the Enterprise and Desktop applications within a business application. We
resource planning system to ensure that all information is are now presenting an architecture for text analytics enabled
available in the follow-up processes. This would help to business applications that supports these functionalities. To
increase the accuracy in offer creation to hit target margin enable a flexible way of developing such applications, we
and customer price. decided to use the principle of service-oriented architectures,
C. Business Execution Activities since single services can be easily integrated and re-used in
Risk management in day-to-day business. In daily business, various applications once the services have been implemented.
employees often come across documents about product recalls, Fig. 1 shows the four components of the architecture. It
vendor issues, raw material shortages etc. However there is comprises a client facing application and a server application
no comprehensive approach that helps them in collecting acting as a facade to the system. Processing services group
and reporting these risks in a structured way. An active the computational components, and resources encapsulate data
involvement of each employee into risk management would that need to be persisted for the system to work properly.
help to identify risks at an early point of time. It would Details such as acquiring data from client application or
be desirable to have a system that automatically identifies multiple data sources of several types are not shown explicitly.
and highlights relevant information such as name of supplier, In the following, the single components are described in more
customer, and material (which clearly involves text analysis) depth.
in the documents processed by employees. The system should
also allow an employee to associate a risk with the identified
information which would result in a reported risk that can be
considered elsewhere.
Collecting Competitive Data. Similar to the problem de-
scribed in the customer information scenario in section 4.1,
information on competitors is available in distributed textual
sources. To get an overview and a better understanding of com-
petitors, competitive data, potential crisis areas and strategic
information could be collected automatically by a system from
internal documents and the Web. After that, the information
could be related to a product or certain product functionalities.
A summary of current news and relevant data could then be
generated, and current trends of hottest topics or decisions
determined. Relevant information could be made accessible to Fig. 1. Architecture Overview
all interested parties which would help save time for searching
relevant information. A. Client
D. Applications Addressing Business Management Activities The client in our architecture might be an existing system
Administration activities. Standard administrative activities such as a text processor or e-mail client or any other client,
are performed more or less regularly and are often time including Web browsers. It allows starting functionalities of

209
the text analytics enabled business application that targets deal- NLP Services provide text analysis functionalities on dif-
ing with a concrete scenario. The end-user of the application ferent levels of granularity. Their main task is to analyse
interacts with the client to make use of the functionalities. Re- natural language, identify relevant pieces of information and
sults achieved from the server are made available or accessible make them available for follow-up services. NLP services
through the client. can be grouped into three classes: (1) services for preparing
the data for further processing (e.g., string preparation and
B. Server normalisation or a web page parser to identify text content),
(2) services for basic text processing such as general or
The server is responsible for the interaction between client
domain-specific named entity recognition, lexicon look-up,
and processing services, the invocation of services and the
and (3) complex text processing and analysis services. The
communication with external applications. In its role of a
latter include among others services for sentiment analysis or
service orchestrator it invokes services in the right order
document classification. It is relevant to provide also modules
and transmits the results of one service as input to the next
for more complex NLP tasks, since this allows making the
service if required. It prepares the responses of the processing
system development as easy as possible for system developers
services, collects the results of the services and transmits the
even with limited NLP knowledge.
results to the client (e.g., some generated HTML code). The
final visualisation is realised through the client. Further, the Filtering Services allow for filtering identified information
server maintains information on the output that needs to be and produced results with respect to different criteria. These
transmitted to the client. Service descriptions that comprise criteria may include user specified preferences, a user’s work-
information on input and output of the processing services, ing context, his role and responsibility in a company, access
required resources, and other prerequisites are maintained by rights to a systems etc. One potential filtering service is a
the server. personalization service that allows for restricting presented
The server also enables communication with other relevant results only to those a user is interested in or that fit to his
applications. We can distinguish three different kinds of appli- current working context. Filtering includes also integration of
cations with which the server might be interacting in terms of information from different resources which is relevant to avoid
opening applications, or collecting data: Desktop Applications, presentation of the same information multiple times.
Enterprise Applications and Internet Applications. For some of
these applications direct API calls from the server might be Application Services are responsible for starting external
sufficient (e.g., open the ’send mail’ window in Outlook, or applications. On the one hand, a business application should
collect customer data from enterprise applications). Other ap- allow for starting another relevant application (e.g., a specific
plications may require building custom extensions to allow for external application such as a text processor with a specific
interactions with the server (effectively becoming an extended template). On the other hand, it should allow for collecting
API). To realise communication with external applications, the data stored in external applications like for example in en-
server receives abstract invocation messages generated by a terprise applications. Therefore, application services allow to
corresponding processing service (see below). create a formal description of the kind of application that
needs to be started, collects the required input information
C. Processing Services and provides details on the expected output. This formal
description is then transmitted to the server who takes care of
The processing services realise the actual processing. They starting an appropriate application and - if required - collecting
might be independent from each other or the output from the processing results of that application. For example, given
one processing service is required as input for another. We the task of preparing a letter and including relevant contact
identified five different groups of processing services that details that have been collected by other processing services,
are required to realise a business application based on text an application service would create a description that a word
analysis. These include: NLP Services, Data Collection Ser- processor should be opened, and that some specific contact
vices, External Application Services, Visualisation Services details have to be placed into a new document. This description
and Filtering Services. Details are given in the following is then transmitted to the server that interprets the message and
paragraphs. reacts appropriately.
Data Collection Services are responsible for collecting
(textual) data from different resources, e.g., from web pages, The Visualisation Services take care of the visualisation and
weblog postings (e.g., by a Web crawling service). Further, result presentation of the application. User needs for text ana-
having access to relevant documents from a user’s desktop or lytical business applications range from alerts or the ability to
from an enterprise network is necessary. Since information is integrate results from text analysis with structured data sources
also available in external applications, another service should to combining text-derived information with transactional and
allow for collecting data stored within this kind of application. operational information. An HTML Wrapper is one possible
Among others, GPS information from a mobile phone, contact visualisation service that integrates the results of other services
details stored in a mail program or customer details from into an HTML page. This page can then be shown to the end
enterprise applications can be collected by such service. user through the client.

210
provides information on which employee is responsible for
what kind of tasks. The service then sends a list of identified
experts back to the server. Further, additional information on
the customer are collected from enterprise system through the
Transaction Data Collector (e.g. from SAP transactions). This
information is sent to the server for further aggregation and
processing. The server now has the information on responsible
persons and additional customer information. This information
is sent to the HTML Wrapper that produces HTML from the
customer data, extracted entities and tasks. This information is
again transmitted to the server. Finally, the client application
needs to be invoked, i.e. a mail with the corresponding content
needs to be created and sent. For this reason, the Server calls
Fig. 2. Components of Hotline Scenario Application the Desktop Application Starter which creates a application
starting description indicating, that a mail program needs to be
invoked, a new mail has to be opened and the produced HTML
D. Resources content has to be included into the mail. This abstract message
is interpreted by the server that triggers the mail program
Some of the processing services presented before require to send the corresponding mail. In this way, the original
additional knowledge which is stored in the proposed archi- mail is forwarded to the identified experts with the additional
tecture in the resources. Three different kinds of resources can information collected by the processing services. Mails are
be distinguished. Knowledge Resources contain background filtered automatically and re-directed to the appropriate expert.
knowledge on the domain or the enterprise itself (e.g. on Fig. 2 shows the architecture for this scenario as described
the employees’ competencies or responsibilities). Enterprise before.
knowledge is required mainly for filtering collected informa-
tion. Lexical Resources provide relevant information mainly VII. D ISCUSSION
for NLP services and comprise for example lists of person Creating business applications following the architecture
names, customer or product names or locations. Further, more presented in this paper has several benefits. First, very different
domain specific content is stored in lexical resources such as business applications dealing with various scenarios can be
risk management subjects or administrative activities. Lexical developed by considering this architecture. This is in contrast
Resources can be used to identify relevant entities through to the existing work presented in section 2 that focused only
string matching. Textual Resources include unstructured back- on single, self-contained systems. Second, the architecture
ground information, for example Desktop documents, Web ensures re-use of existing modules. Once a set of processing
documents, or even links to this kind of resources that are modules has been implemented, resources only need to be
of interest within the application. integrated in an appropriate manner to create a business appli-
cation. Third, the service-oriented structure or the architecture
VI. S ERVICE H OTLINE S CENARIO
assures flexibility of creating business applications addressing
In this section, we want to show the exploitation of the pre- other scenarios by the ability of adding new services. Process-
sented architecture in a concrete application which addresses ing services and resources are strictly separated which allows
the service hotline scenario introduced in section IV. In this for using the same resources by different processing services.
scenario, the client application is an E-Mail program. Once a In summary, the architecture offers a computational linguist
trigger event (which is here an incoming mail) is recognised or system developer a workflow environment that allows her
by the client, he contacts the server. The server invokes the or him to rapidly prototype and test applications built from
relevant processing resources. First, the mail message is sent services and resources specified by the architecture.
to the entity recognition service that identifies general entities With respect to the collected requirements the following
such as person names, or locations and dates. Second, the observations can be made: Due to the client - server structure,
specific named entity recognition identifies the tasks to be created business applications can be integrated into any client
requested from the service employee (e.g., send a request, application. Further, the user context is considered by the
problem solving). Further, product names, company names different filtering services. Therefore, the user requirements
and the like are identified in the text. The entity recognition are fulfilled by the architecture. This holds also true for
services make use of lexical resources such as lists with the requirements for system developers. The service-oriented
company names, customer names etc. architecture allows for creating systems with services working
The results of the entity recognition steps are sent to the in a processing pipeline. Services perform concrete tasks or
server that makes them available to the other processing address solving of concrete problems (e.g., detecting opinions)
services. The extracted tasks are sent to the filtering service and are separate from each other which allows for easy reuse
Expert Finder. This service exploits a knowledge resource that of components. In this way, for creating a concrete business

211
application system, implementation details of single services [2] D. Maynard, M. Yankova, R. Kourakis, and A. Kokossis, “Ontology-
are not required. A system developer does not need to be a based information extraction for market monitoring and technology
watch,” in In ESWC Workshop End User Apects of the Semantic Web,
knowledge engineer and does not necessarily need knowledge 2005.
about text analysis functionalities. This facilitates the imple- [3] J. Nie, F. Paradis, and A. Tajarobi, “Discovery of business opportunities
mentation process and reduces the required knowledge for on the internet with information extraction,” in Workshop on Multi-Agent
Information Retrieval and Recommender Systems (IJCAI), Edinburgh,
creating such applications. Scotland, 2005, pp. 47–54.
Regarding system requirements, the architecture is open for [4] L.-J. Zhang, “Editorial: Modern services engineering,” in IEEE Transac-
integration of new services. A flexible handling of results is tions on Services Computing (October-December 2009), vol. 2(4), 2009,
p. 276.
ensured by allowing further processing of results of single [5] M. P. Papazoglou and W.-J. Heuvel, “Service oriented architectures:
processing services, communication of results to the client ap- approaches, technologies and research issues,” The VLDB Journal,
plication or their use in additional external applications. Since vol. 16, no. 3, pp. 389–415, 2007.
[6] F. Cerbah and B. Daille, “A service oriented architecture for adaptable
we resisted on discussing implementation details, ensuring the terminology acquisition,” in NLDB, ser. Lecture Notes in Computer
maintenance of resources depends on the final implementation Science, Z. Kedad, N. Lammari, E. Métais, F. Meziane, and Y. Rezgui,
of the architecture. We conclude that the previously collected Eds., vol. 4592. Springer, 2007, pp. 420–426.
[7] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Ka-
requirements are fulfilled by the proposed architecture. nungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien,
This paper focuses on a general architecture and resist on “Semtag and seeker: bootstrapping the semantic web via automated se-
implementation details. We are aware that there are different mantic annotation,” in WWW ’03: Proceedings of the 12th international
conference on World Wide Web. New York, NY, USA: ACM, 2003,
approaches to realise service-oriented architectures such as pp. 178–186.
WS-* or RESTful. Both may have their benefits and short- [8] T. Stajner, D. Rusu, L. Dali, B. Fortuna, D. Mladenic, and M. Grobelnik,
comings when they are used for implementing the proposed “Enrycher : service oriented text enrichment,” in Proceedings of the 11th
International multi-conference Information Society IS-2009, Ljubljana,
architecture. We leave it open to the system developed to Slovenia., 2009.
decide for the best suited approach. The proposed architecture [9] R. Witte and T. Gitzinger, “A general architecture for connecting nlp
is new in a sense that existing research as presented in section frameworks and desktop clients using web services,” in NLDB ’08:
Proceedings of the 13th international conference on Natural Language
2 is focusing either on one concrete business application and and Information Systems. Berlin, Heidelberg: Springer-Verlag, 2008,
used service-oriented architectures to realize this application, pp. 317–322.
or they are addressing specific NLP tasks and resist on pro- [10] L. Jastrzebski, M. Piasecki, and G. Strzelecki, “Distributed service -
oriented architecture for information extraction system semanta,” in
viding a general architecture for complete application systems. Proceedings of the 5th international Conference on intelligent Systems
The architecture presented in this paper provides the more Design and Applications. Washington, DC: ISDA. IEEE Computer
general view on these applications. Single NLP systems or Society, 2005, pp. 61–66.
[11] A. Mikroyannidis, B. Theodoulidis, and A. Persidis, “Parmenides:
modules as produced with GATE or UIMA can be integrated Towards business intelligence discovery from web data,” in
in the business application as NLP services. We believe that the IEEE/WIC/ACM International Conference on Web Intelligence,
business application systems presented by others (see section 2006 In Web Intelligence, 2006. WI 2006, 2006, pp. 1057–1060.
[12] H. Saggion, A. Funk, D. Maynard, and K. Bontcheva, “Ontology-based
II) can be realised by applying our architecture and that most information extraction for business intelligence,” in LNCS, vol. 4825,
of the services can be reused in various applications. Berlin, Heidelberg, 2007, pp. 843–56.
[13] S. Bird, E. Klein, and E. Loper, Natural Language Processing with
VIII. C ONCLUSION Python. http://www.nltk.org/book: O’Reilly Media Inc., 2009.
[14] T. Obrebski and M. Stolarski, “Uam text tools - a flexible nlp architec-
In this work, we described a general architecture that al- ture,” in Proceedings of LREC 2006, Genova, 2006.
lows combining text analysis and external applications within [15] C. Grover and et al., “A framework for text mining services,” in
Proceedings of the Third UK e-Science Programme All Hands Meeting
a business application. We specified groups of necessary (AHM 2004), 2004, p. 67.
components and introduced some potential instances of such [16] H. Cunningham, “Gate, a general architecture for text engineering,”
components. The architecture is new in a sense that it provides Computers and the Humanities, vol. 36, pp. 223–254, 2002.
[17] A. Ferrucci and A. Lally, “Uima: an architectural approach to unstruc-
the general picture on business applications that are using text tured information processing in the corporate research environment,”
analysis. By considering this architecture in business appli- Natural Language Engineering, vol. 10, pp. 327–48, 2006.
cation development, re-use of components would be ensured [18] M. Golfarelli, S. Rizzi, and I. Cella, “Beyond data warehousing: what’s
next in business intelligence?” in DOLAP ’04: Proceedings of the 7th
and development is facilitated. In future work, we will have a ACM international workshop on Data warehousing and OLAP. New
closer look into the user interaction with such system. Further, York, NY, USA: ACM, 2004, pp. 1–6.
we will focus on developing concrete applications addressing [19] B. de Ville, Microsoft Data Mining: Integrated Business Intelligence
for e-Commerce and Knowledge Management. Boston: Digital Press,
the concrete problems described in this paper and applying the 2001.
proposed architecture to prove that useful textanalytics enabled [20] R. Baumgartner, O. Froelich, G. Gottlob, P. Harz, M. Herzog, and
business applications can be built efficiently by considering the P. Lehmann, “Web data extraction for business intelligence: the lixto
approach,” 2005, pp. 30–47.
proposed architecture. [21] A. Sureka, S. De, and K. Varma, “A generic software architecture of
a text processing system for analyzing product warranty claims data,”
R EFERENCES in Compute ’08: Proceedings of the 1st Bangalore annual Compute
[1] M. Kowalkiewicz and K. Jünemann, “Yowie: Information extraction in conference. New York, NY, USA: ACM, 2008, pp. 1–4.
a service enabled world,” in ICSOC, ser. Lecture Notes in Computer
Science, A. Bouguettaya, I. Krüger, and T. Margaria, Eds., vol. 5364,
2008, pp. 732–733.

212

You might also like