Technik

Methods

Fig. The input is converted into CAS by a collection reader (CR), further processed by a number of analysis engines (AE), and finally written back by a collection consumer (CC).

The distinctive feature of UIMA-HPC is the flexible generic approach which makes it applicable to any kind of UIMA-Pipelines and workflows thereof as well as any kind of compute resources, which are available.

UIMA pipelines are the basic building blocks of information extraction workflows. Apache UIMA provides a native Java framework for mining unstructured data. An UIMA application is organized as a Collection Processing Engine (CPE) that consists of an UIMA Collection Reader (CR), one or more UIMA Analysis Engines (AEs) and one Collection Consumer (CC). The analyzed artifact (e.g. text or binary data) is stored in the internal UIMA data structure Common Analysis Structure (CAS). The framework architecture also provides convenience methods for serializing CAS objects (XCAS) to store them persistently on hard disk. These stored XCAS files can then again be read by a CR. In our implementation we exploit this procedure to transport data between physically separated hardware nodes.

 

Information extraction from chemical patents

The goal of the research project UIMA-HPC is to automate and hence speed-up the process of knowledge mining in patents. Multi-threaded analysis engines, developed according to UIMA (Unstructured Information Management Architecture) standards, process texts and images in thousands of documents in parallel. UNICORE (UNiform Interface to COmputing Resources) workflow control and execution features capabilities make it possible to dynamically allocate resources for every given task to gain best cpu-time/real-time ratios in an HPC environment.

All UIMA components (CPE, CR, AE and CC) are specified via XML file format descriptors, which contain consistent predefined internal routes. For a Grid system we need a dynamic handling of network paths. Therefore we use the UIMAFit implementation to generate all XML specifications at run-time of an UIMA pipeline. The necessary import of uniform resource identifiers (URIs) in all Java classes of UIMA can be dynamically adapted to any location using UIMAFit. All our integrated pipelines are provided as a Java archive files (jar) and run platform independent on different operating systems. The framework architecture UIMA makes it possible to easily integrate existing software and also replace AEs within different UIMA pipelines.

Existing Components

A workflow that demonstrates all UIMA-Components available at this time. Collection Reader Pipelets are shown in green, Analysis Engines in blue and Consumers in orange, respectively. UIMA-View converter is shown in dark-grey.  Open Source software such as OpenNLP components are seamlessly integrated into one workflow together with proprietary software components (ProMiner, chemoCR) by sharing the same UIMA-TypeSystem.

Examples of implemented UIMA pipelines to process documents with medical and chemical content

Input Integrated 3rd Party Software Function Annotations Output
PDF CLI abbyy finereader OCR SourceDocument Information XCAS
PDF PDFbox, iText Text extraction SourceDocument Information XCAS
XCAS ProMiner Dictionary based Annotation Chemistry, Diseases, Genes XCAS
XCAS Linda Machine Learning (ML) based Annotation Diseases, Genes, IUPAC-terms XCAS
XCAS OSCAR Dictionary and ML based annotation of chemical terms Chemical terms XCAS
XCAS iText, PDFBox Generating annotated PDF   Enriched PDF

UIMA and UNICORE

Fig. Complete architecture of the coupling between UIMA and UNICORE

In order to make UIMA pipelines available on distributed heterogeneous resources to be accessible through UNICORE they have to meet certain requirements:

  • Installed on the target system,
  • Executable as stand-alone applications,
  • No hard-coded paths in file descriptors.


The overall architecture is shown in Figure 2. As UIMA is a native Java library it is cross platform compatible and can be installed on UNIX and Microsoft Windows based servers. The prerequisite is an installation of Java Virtual Machine 6 or higher.

A UIMA pipeline is provided as a Java jar archive, which has to be available on a server’s file system. The Input and output data format is defined in XML, it is called serialised CAS objects (C). This must be unified to be free in the choice of annotations and their order in a workflow. The Java archive is made available through UNICORE by defining it as an application resource (B). Upon execution the jar archive is called by UNICORE via a system call using the standard arguments of the Java virtual machine. The XML application configuration files support any number of arguments that can be defined prior to execution separately for every job on the client side. UIMA provides multithreading of embedded components. This allows to exploit all cores of a node in the execution environment.