Page tree
Skip to end of metadata
Go to start of metadata







logoAverbis Information Discovery: User Manual

Version 5.12, 04/05/2019


Overview

Averbis Information Discovery is a leading text analytics and machine learning platform that allows you to get insights in your structured and unstructured data and explore important information in the most flexible way. Averbis Information Discovery collects and analyzes all kind of documents, such as patents, research literature, databases, websites, and other enterprise repositories.

By parsing and analyzing content, and creating a searchable index, Averbis Information Discovery helps you perform text analytics across all relevant data in the internet and your enterprise and make that data available for analysis and search. It allows you to explore facts and relationships across many sources that would otherwise be hidden in unstructured data.

Getting started

General Administration

Users with administration rights can create new users and projects. When these users are logged in, they can see the "Project administration" and "User administration" areas.


adminStartpage


Figure 1: Home page of an administration user


Project administration

In the project administration area, you first see a list with all projects that are currently available in the system.




Figure 2: Overview of created projects


  • Name: name of the project. The name also functions as a link to the corresponding project. The link goes to the project’s overview page.

  • Description: description of the project.

  • Operations | Edit project: this allows you to modify the name and the description of the project.

  • Operations | Delete project: this allows you to delete a project.

Below the table is a button that you can use to create a new project.

User administration

In the user administration area, you first see a list with all user accounts that are currently available in the system. This list can be filtered using the text box on the top left.


userManagementListUsers


Figure 3: Overview of registered users.


  • Username: the user’s login name.

  • Lastname: the user’s last name.

  • Firstname: the user’s first name.

  • Email: the user’s email address.

  • Blocked: if a user is temporarily blocked, a padlock icon is displayed here.

  • Administrator: if the user is an administrator, a checkmark is displayed here.

  • Operations | Rights: using this button you can see an overview of the rights that the user currently has. Rights cannot be edited here. Editing rights is done using the corresponding button in each project.

  • Operations | Edit: in the Edit dialog, you can edit the user profile data (firstname, lastname, email address). You can also use this dialog to block a user.

  • Operations | Change password: this allows you to enter a new user password.

  • Operations | Delete user: this allows you to delete the user.

Below the table is a button that you can use to create a new user.

Add and/or edit users

Use the 'Create new user' or 'Edit user' button to open a dialog and edit the user’s metadata.


addUser


Figure 4: Create new user.


In addition to editing the profile metadata, you can also assign an initial password when creating the user (to edit the password of an already existing user, please use the corresponding 'Change password' button in the user administration overview table).

You can also use this dialog to block the user.

Change password

Using the Change password buttons you can open a dialog which allows you enter a new password.


userManagementChangePassword


Figure 5: Changing the password of an existing user.


General guidelines

When a user without global administration rights opens the application, his/her home page contains an overview of the projects assigned to this user (My projects). The project names act as links to the corresponding projects. On the project overview page, the user can find all the functions for which he/she has the relevant project rights.


userStartpage


Figure 6: Home page of a non-administrator user


After selecting a project, a page is displayed with a list of all the modules in the project. This list is also available on other pages with the project navigation menu in the upper right area.


projectStartpage


Figure 7: Overview page of a project with buttons for opening each module.


Language and web interface localization

The web interface is currently available in German and English. The language is recognized automatically from the browser or the system settings of your operating system and the content of the user interface is displayed in the corresponding language.

Outer navigation bars

The top and left side outer navigation bars can be hidden when required. This saves space when the navigation tools are not required. To show/hide the navigation bars, click the small menu icon on the upper right edge of the application.


toggleOuterNavigation


Figure 8: Menu icon to show/hide the outer navigation bars

Keyboard Shortcuts

To simplify working with the application, some functions are implemented with keyboard shortcuts. Press Shift + ? to display a summary of the defined shortcuts.
keyboardShortcuts


Figure 9: Summary of all defined keyboard shortcuts. Open with Shift + ?


Flash messages

To provide information about the progress and outcome of processes or to display general information flash messages are displayed that are standard for all applications. The background of the flash messages differs according to the message category. Information messages are blue, success messages green, error messages red. Flash messages disappear automatically after a few seconds. Flash messages that display errors however remain displayed until they are closed manually by the user.


closeErrorFlashMessage


Figure 10: Flash messages that display errors are closed by clicking the cross mark in the top right corner.


Documentation

Complete user documentation is available that describes the functionality of each component. This documentation can be accessed directly from the help menu in the navigation bar on the left side of the web interface.

Embedded help

In addition to the complete online help, you can find information in several places directly embedded in the interface. You can access this wherever you see a blue question mark on a white background. Move the mouse cursor over the question mark.


helpPopover


Figure 11: Embedded help


Connector Management & Document Import

Managing Standard Connectors

Connectors are used to import documents into the system. A connector monitors a specific resource (like a file system or a database), automatically imports new documents and updates changes so that imported documents are kept in sync with the document source. Connectors can also be scheduled to certain times of day, for example to import and update documents only at night and reduce system load during office hours.

Connectors can be created and administered on the connector management page. The figure below shows the connector management with the list of all connectors that have been created within the current project:


connectorAdminOverview


Figure 12: Overview of all connectors.


  • Connector: The name of the connector.

  • Type: The connector type. For example file connector or database connector.

  • Active: Indicates whether the connector is active. Only active connectors import and update documents.

  • Schedules: Displays the periods of time in which the connector is active. 0-24 means that the connector is active 24 hours a day.

  • Statistics: The statistics show the following values

    • Documents whose URLs have been reported by the connector.

    • Documents that have already been requested by the connector and whose contents have been received.

    • Documents that have already been enriched with metadata.

    • Documents that have already been saved.

  • Actions | Start connector : Starts the connector.

  • Actions | Stop connector : Stops the connector.

  • Actions | Reset connector : If you reset a connector, all documents from this connector are re-imported.

  • Actions | Edit connector : Opens the edit connector dialog. All parameters except the connector name can be edited.

  • Actions | Edit mapping : Opens the edit mapping dialog where connector matadata fields like title and content can be mapped to document fields.

  • Actions | Schedule connector : Opens the schedule dialog.

  • Actions | Delete documents of connector : Deletes all documents that have been imported by the connector.

  • Actions | Delete connector : Deletes the connector. All documents that have been import by the connector will be deleted as well.

In order to create a new connector, the connector type has to be selected first. After clicking the Create connector button the connector can be configured in the create new connector dialog. Please refer to the connector specific documentation for further details.


File System Connector

A file system connector imports documents from file system resources. It monitors one or multiple directories (including sub-directories) and imports documents from files in these directories. The following file types are supported:

  • .txt
  • .pdf
  • .doc/docx
  • .ppt/pptx
  • .xls/xlsx
  • .html

There are currently two implementations: FileConnectorType and AverbisFileConnectorType. The AverbisFileConnectorType remembers the current position when stopping, so that it does not start from the beginning when restarting.

A file system connector can be configured using the following parameters:

  • Name: Name of the connector. This name can be chosen freely and serves e. g. as label within the connector overview. They must not contain spaces and special characters nor underscores.

  • Start paths: For each line, you can specify a file system path that is taken into account by the connector. The connector runs through these directories recursively, i. e. all subdirectories are considered.

  • Exclude pattern: Here you can specify patterns to exclude certain files or file types (Black List).

  • Include pattern (optional): Here you can specify patterns to include certain files or file types only (White List).

Database Connector

With a database connector, structured data can be imported from a database connection. The database connector supports JDBC compliant databases and can crawl database tables using SQL queries. Each row from the SQL query result is treated as a separate document. The database connector keeps track of changes that are made to the database tables and synchronizes these changes automatically into .

In order to use the database connector, the database JDBC driver has to be provided to the Tomcat server instance that is running . Please ask your system administrator to put the database JDBC driver library into Tomcats lib directory.


The database connector can be configured using the following parameters:

  • Name: Name of the connector. This name can be chosen freely and serves e. g. as label within the connector overview. They must not contain spaces and special characters nor underscores.

  • JDBC Driver Classname: Fully qualifying class name of the database JDBC driver. E.g. com.mysql.jdbc.Driver

  • JDBC Connection URL: JDBC connection URL to the database. E.g. jdbc:mysql://localhost:3306/documentDB

  • Username: Database username.

  • Password: Database password.

  • Traversal SQL Query: SQL select query. E.g. SELECT id, title, content FROM documents

  • Primary Key Fields: Name of the column that represents the primary key and identifies a table row. E.g. id

The database connector default field mapping concatenates all queried columns (like id, title and content) and maps it into the document field named content. The field mapping can be configured in the connector field mapping dialog (See section Editing field mappings for further details). The figure below shows a custom field mapping that maps the database columns to document fields. The id column is mapped to the document_name field, title and content are mapped to identical document fields.


connectorAdminFieldMapping


Figure 13: Database connector custom field mapping.

Editing field mappings

Connectors read different sources to extract structured data from them. The extracted data is then written to fields of a solr core. Field mappings define which information from the original documents is written to which fields of the Solr Index.

Specific default mappings can be specified for each index and connector throughout the system. These are automatically taken into account when a new connector is created.

When editing the field mappings, select a connector field on the left. On the right, select the core field in which you want the connector to write this data. All core fields that have been activated in the Solr schema configuration and are writable are available here. In addition to editing the default mappings, you can also specify further mappings or remove existing ones.

You can also specify a sequence for the mappings. This order is relevant when mapping multiple connector fields to a core field. If the core field can contain more than one value, it lands in the field in the order specified here. If the core field can only contain one value, it will be the value that is the lowest in the mapping sequence.

After you have edited a field mapping, you must reset the connector so that the changes to the mapping are taken into account.


connectorAdminEditFieldMapping


Figure 14: Editing field mappings.


There are currently three different mapping types:

  • Copy Mapping: Der Standard Typ: The connector field is mapped 1:1 to the specified document field.

  • Constant Mapping: Instead of a connector field, a constant value can be mapped to a document field.

  • Split Mapping: The value of a connector field is divided into several values by a character to be entered. This can be used to convert comma-separated lists into multi valued document fields.

Document Import

In addition to defining connectors that can monitor and search different document sources, it is also possible to import pre-structured data into a search engine index. Unlike connectors, this data is imported once, i. e. no subsequent synchronization takes place.

Manage document imports

Any number of document sets can be imported in the application and deleted if necessary. For each set of imported documents, known as import batches, you see a row in the overview table. In addition to the name of the import batch, you can also see how many documents are part of the batch. The status indicates whether the import is still running, whether it was successful, or whether it has failed.


importBatchOverview


Figure 15: Overview of all previously imported document batches.


Below the overview table you will find the form elements to import a new document set. To do this, enter a name and click the Browse button. A window opens in which the local file system is displayed.

You can import single files as well as zip archives with several files. Make sure that there are no (hidden) subdirectories in such ZIP file and that the files have the correct file extensions.


These import formats are currently available:

Text Importer

Text importers can be used to import any plain text files. The complete content of the file is imported into a field. The file name of the file is available later as a metadate. CAS Importer Allows the import of serialized UIMA CAS (currently as XMI). This means that for example documents are imported as gold standards.

Please note that the type system of this CAS has to be compatible with the type system of .


Solr XML Importer

A simple XML format that allows the import of pre-structured data. During the import, the fields defined in XML are written to the search index in fields with the same value. Please make sure that the field names in the XML file correspond to the field names of the search index associated with your project.

Images that can be imported to the documents and displayed together with them are a special feature. To upload an image, you have to pack the XML document (s) together with the images into a ZIP archive. With each document you can now add as many image_reference fields as you like. Relative paths to the image are expected. Images can be stored in any subfolders within the ZIP archive. Supported image formats are. gif,. png,. jpg and. tif.

...
<field name="image_reference">images/image.png</field>
<field name="image_reference">./images/pics/picture.png</field>
...
      

An example of the supported import format is shown below

<?xml version='1.0' encoding='UTF-8'?>
<!--Averbis Solr Import file generated from: medline15n0771.xml.gz-->
<update>
  <add>
    <doc>
      <field name="id">24552733</field>
      <field name="title">Treatment of sulfate-rich and low pH wastewater by sulfate reducing bacteria with iron shavings in a laboratory.		</field>
      <field name="content">Sulfate-rich wastewater is an indirect Tag der Arbeit threat to the environment especially at low pH. Sulfate reducing bacteria (SRB) could use sulfate as the terminal electron acceptor for the degradation of organic compounds and hydrogen transferring SO(4)(2-) to H2S. However their acute sensitivity to acidity leads to a greatest limitation of SRB applied in such wastewater treatment. With the addition of iron shavings SRB could adapt to such an acidic environment, and 57.97, 55.05 and 14.35% of SO(4)(2-) was reduced at pH 5, pH 4 and pH 3, respectively. Nevertheless it would be inhibited in too acidic an environment. The behavior of SRB after inoculation in acidic synthetic wastewater with and without iron shavings is presented, and some glutinous substances were generated in the experiments at pH 4 with SRB culture and iron shavings.</field>
      <field name="tag">Hydrogen-Ion Concentration; Iron; Oxidation-Reduction; Sulfur-Reducing Bacteria; Waste Water; Water Purification</field>
      <field name="author">Liu X, Gong W, Liu L</field>
      <field name="descriptor">Evaluation Studies; Journal Article; Research Support, Non-U.S. Gov't</field>
    </doc>
    <doc>
      <field name="id">24552734</field>
      <field name="title">Environmental isotopic and hydrochemical characteristics of groundwater from the Sandspruit Catchment, Berg River Basin, South Africa.</field>
      <field name="content">The Sandspruit catchment (a tributary of the Berg River) represents a drainage system, whereby saline groundwater with total dissolved solids (TDS) up to 10,870 mg/l, and electrical conductivity (EC) up to 2,140 mS/m has been documented. The catchment belongs to the winter rainfall region with precipitation seldom exceeding 400 mm/yr, as such, groundwater recharge occurs predominantly from May to August. Recharge estimation using the catchment water-balance method, chloride mass balance method, and qualified guesses produced recharge rates between 8 and 70 mm/yr. To understand the origin, occurrence and dynamics of the saline groundwater, a coupled analysis of major ion hydrochemistry and environmental isotopes (d(18)O, d(2)H and (3)H) data supported by conventional hydrogeological information has been undertaken. These spatial and multi-temporal hydrochemical and environmental isotope data provided insight into the origin, mechanisms and spatial evolution of the groundwater salinity. These data also illustrate that the saline groundwater within the catchment can be attributed to the combined effects of evaporation, salt dissolution, and groundwater mixing. The salinity of the groundwater tends to vary seasonally and evolves in the direction of groundwater flow. The stable isotope signatures further indicate two possible mechanisms of recharge; namely, (1) a slow diffuse type modern recharge through a relatively low permeability material as explained by heavy isotope signal and (2) a relatively quick recharge prior to evaporation from a distant high altitude source as explained by the relatively depleted isotopic signal and sub-modern to old tritium values. </field>
      <field name="tag">Groundwater; Isotopes; Rivers; Salinity; South Africa; Water Movements</field>
      <field name="author">Naicker S, Demlie M</field>
      <field name="descriptor">Journal Article; Research Support, Non-U.S. Gov't</field>
    </doc>
  </add>
</update>


Text Analysis

Text analysis is one of the core components of Averbis Information Discovery. This chapter describes how text analysis pipelines are created, configured, distributed to remote systems, and monitored. It also describes what options Averbis Information Discovery provides for evaluating and optimizing text analysis results.

Pipeline Configuration

The text analysis components and pipelines used in can be graphically administered and monitored in a centralized way. This is done in the Pipeline configuration module.


pipelineConfNavItem


Figure 16: Link for opening the graphical configuration of text analysis components.


The overview page lists all the text analysis pipelines available in the project. The following information and operations are provided in the table.

  • "Pipeline Name": name of the pipeline.

  • "Status": Status of the pipeline: STOPPED, STARTING or STARTED. As soon as the pipeline started, it reserves system resources. Only after it started, it accepts analysis requests.

  • "Preconfigured": indicates whether the pipeline is a preconfigured pipeline. These pipelines cannot be edited.

  • "Throughput": here, two indicators for the pipeline throughput are given: the total number of processed texts, and the average number of processed texts per second. The statistics are reinitialized each time the pipeline stops/starts.

  • "Operations | Initialize pipeline" : this is used to initialize a pipeline. As soon as it has been initialized, it can process texts.

  • "Operations | Stop pipeline" : to save system resources, pipelines can also be stopped.

  • "Operations | Edit pipeline" : this is used to configure a pipeline, for example to add other components to it, to remove them or to modify their configuration parameters. Pipelines can only be edited when they are stopped.

  • "Operations | Update pipeline" : this is used to update the statistics (throughput) and status of the pipeline.

  • "Operations | Delete pipeline" : this allows pipelines to be permanently deleted, if they are no longer needed.


pipelineConfOverview


Figure 17: Overview of all available text analysis pipelines in the project.


To create new pipelines, use the 'Create pipeline' button below the overview table.

Pipeline details

With the pencil icon in the taskbar of the overview table, you can access the details page of the pipeline. At the top left, all components are displayed in the order in which they are used in the pipeline.

To the right of each component name, you can see the component-specific throughput data, indicating the total number of processed texts and the average number of texts per second. By clicking the relevant component, you can show all the configurable configuration parameters.


pipelineDetails


Figure 18: Detail view of an initialized pipeline.


As long as a pipeline is running, it cannot be edited. When you stop a non-preconfigured pipeline, you can reconfigure the pipeline in the details page. Buttons on the right are now displayed instead of the throughput data, which can be used to remove components from the pipeline, or to move them to another position within the pipeline. Individual configuration parameters of the components are now also editable. Other components can also be added to the pipeline from the right side.


pipelineDetailsEdit


Figure 19: Editing a pipeline.


The right-hand area with the available components is itself divided into several blocks: Preconfigured Annotators, PEAR Components and Available Annotators.

Preconfigured Annotators

Preconfigured annotators are annotators that Averbis has already preconfigured for a specific purpose. For example, a diagnostic annotator is nothing more than a GenericTerminologyAnnotator preconfigured with a diagnosis dictionary. Preconfigured annotators can also be made up of several components, i.e. an aggregate of several components. This can be used to present the end user components of complex interdependencies in a clear way.

PEAR components

PEAR components are those added by users. They can be integrated in pipelines like the preconfigured or available annators in pipelines. More on this in the chapter Managing / Adding new textanalysis components.

Available Annotators

The list of available annotators contains all general, i.e. not preconfigured, components detected in ’s component repository.

Managing / Adding new text analysis components

The application allows to add new text analysis components at runtime. There is no need to reinstall or redeploy the application. For that, so called UIMA™ PEAR components (Processing Engine ARchive) are used. PEAR is a packaging format, which allows to ship textanalysis components alongside all needed resources in a single artifact.

You find a list of all available PEAR components in the Pipeline Configuration where you configure your textanalysis pipeline. Adding new components is done within the Textanalysis: Components module.


componentsAndPear


Figure 20: Show and import UIMA PEAR components.

Text Analysis Processes

Any number of text analysis results can be generated and stored for all known document sources in . Text analysis results can be created either automatically through pipelines or manually. This way, you can obtain different semantic views of the same document which enable you to evaluate several views side by side.


processOverview


Figure 21: Overview of all currently created test analysis tasks.


The table contains the following columns:

  • "Type": indicates whether this is a manual or automatic text analysis.

  • "Name": name of the process. For example Demo - anatomy

  • "Status": Status of the process. It is either RUNNING or IDLE.

  • "Document source": the document source to which the task refers. In parentheses after the name is the number of processed fields. For example if two fields, contents and title, are processed in a corpus of 3000 documents, then at the end of the task, 6000 will be indicated here.

  • "Pipeline": in the case of an automatic text analysis, the pipeline that was used for the text analysis is indicated here.

  • Download: Download the whole result as set of UIMA XMI files.

  • Delete: Delete whole process and all results.

When you create a new task, you can select whether it is a manual or an automatic text analysis.


processAddNew


Figure 22: Creating a new text analysis task: manual or automatic text analysis.


If you choose automatic text analysis, then in addition to the name and the document source, you are requested to give your text mining process a name and specify the document source and pipeline.


processConfigureAutomaticTextAnalysis


Figure 23: Creating a new automated text analysis process: Give your process a name and enter the document source and the pipeline you want to use.

Annotation Editor: Viewing and Editing Annotations

To be able to make a judgment about text analysis components, it is frequently essential to have the results displayed graphically. You may also want to correct text analysis results manually or annotate documents completely manually, for example to create gold standards, which are then used to evaluate text analysis components. For all these purposes, the Annotation Editor can be used.

Viewing annotations inside a document source

The Annotation Editor can be used to display text analysis results graphically. Using the annotation editor, all documents from a document source can be easily viewed, section by section, and all annotations can be graphically highlighted.

In Annotation Editor, you first select a document source (1). If document names have been given to the documents in the source, the name of the first document in the source is displayed (2). You then select the text analysis process that you wish to view (3).

Once you have selected the source and the text analysis, the first document in the corpus is displayed. The document is displayed section by section. There is a checkbox above the text of each available annotation to enable the content of the annotation to be graphically highlighted (4). Using the right-hand checkbox (5), you can highlight all annotations at once, or reset the highlighting of all annotations.

In the main window (6), you can see the corresponding section of the document with the currently activated highlights. Below the main window, there are buttons for navigating through the individual sections of a document (7). Above it there are similar buttons, which you can use to navigate between the individual documents in a source (8).


annotationEditorOne


Figure 24: Displaying the annotations in the documents of a document source.


A table with a list of all the currently highlighted annotations can be displayed on the right of the main window.


annotationEditorTwo


Figure 25: Overview table of annotations.


To provide a better connection between the table and the graphical highlighting in the text, annotations from the table can be given special emphasis in the text. To do this, you set the checkbox in front of the name of the related annotations. This allows the corresponding annotations to be displayed in bold and large font, in addition to the colored highlighting.


annotationEditorThree


Figure 26: Especially emphasizing individual annotations.


The overview table is also used to view the individual attributes of the annotation. By expanding the annotation in the table, you can obtain a list of all the annotation’s attributes.


annotationEditorFeatures


Figure 27: Show annotations' attributes.


Configuring section sizes

As described above, the documents are displayed section by section. By default, 5 sentences are displayed on each page. This setting can be configured in the interface by clicking on the wheel at the right top.

In principle, you can combine a character-based sectioning with an annotation-based sectioning. While the standard sectioning is the character-based sectioning, annotation-based sectioning may has the advantage that you don’t miss cross section annotations. When combining both sections, the sections are always shown with a slight overlap. The end of section n is displayed again at the beginning of section n+1 to avoid the section being taken out of context. Furthermore, when sectioning by characters, the sectioning automatically ensures that the section splits are not made in the middle of a word.

Any change to the section size the graphical configuration is applied immediately after closing the window. Using the reset button, you can restore the configure default values.


annotationEditorSettings


Figure 28: Annotation Editor settings window.


Manually editing, adding and deleting annotations

The annotation editor can also be used to add annotations manually or to edit them. Using the button on the right, you can switch to edit mode.

In edit mode, a button appears above the main window for each activated annotation type (2). After you select the type, you can create annotations of this type in the text. To create annotations of this type, simply highlight an area of text in the main window using the mouse. A quick way of adding an annotation is to simply click a word. An annotation of the corresponding type is then created for the whole word.

Edit mode also allows you to delete existing annotations. To do this, click the cross mark in the overview table of annotations on the right.

After you have made changes to the document, these can be saved or discarded by clicking the buttons (3).


annotationEditorEditMode


Figure 29: Editing Annotations.


In edit mode, you can also edit attributes of an annotation (only for annotations which are configured by Averbis as editable).


annotationEditorEditModeFeatures


Figure 30: Editing the attributes of an annotation.


Displayed and editable annotation types, attributes and colours

Currently, the user cannot configure which annotation types and attributes are visible in the annotation editor, which colors are assigned to these annotation types, and which attributes are editable. This is currently preset by Averbis.

Text Analysis Evaluation

The results of various text analysis tasks can be evaluated against each other, e.g., to compare a text mining process against gold standards.

To do this, you may first choose the document’s source (1) which serves as the basis of the evaluation. Then, you choose the reference view (2) in the left part of the window, and, on the right side (3), you choose the text analysis process that you wish to evaluate.

If you chose a source and two text analysis processes, one can evaluate the results visually, one against the other, in a split-view with two separate annotation editors. The representation of the sections in the right window is thereby coupled to the sections in the left window. In addition to the color highlighting of the individual annotations, you can also distinguish graphically which annotations on the two sides do not match. In addition to the graphic labelling within the text, the annotations are also labelled appropriately in the tabulated overview on the right side (4). Mistakes there are either marked in orange (false positives) or gray (false negatives).


textanalysisCorpusEvaluationOne


Figure 31: The image shows the example of a DoseFormConcept annotation on the left that does not match on the right: TBCR.


"Matches" and "Partial Matches"

When evaluating, it is possible to distinguish between exact and partial matches. Annotations are marked as an exact match if their type, characterizing attributes and position in the text are identical.

To obtain an extra level between a hit and a no-hit, it is also possible to define a partial match. Annotations that are not exactly identical, but still meet these criteria, are marked accordingly both in the graphical and table presentation. In the graphical presentation they are italicized and underlined.


textanalysisCorpusEvaluationPartialMatch


Figure 32: Displaying a partial match.


Configuring the match criteria

The definition of what should be considered as a match, partial match and mismatch can be configured by the user in the interface.

The general rule is that two annotations are considered as a match when they are of the same type and are found at exactly the same place in the document. For each annotation type you can then define which annotation attributes also have to match. If we use a concept, this could be the concept’s unique ID. This means that two concepts would be identified as a match only if this attribute was identical in both annotations.

It is also possible to configure for each annotation type, when two annotations of this type should be considered as a partial match. Here you can choose between four different options:

  • "No partial matches": only exact matches are allowed.

  • "Annotations must overlap": a partial match is given whenever the annotations overlap.

  • "Allow fixed offset": at the beginning and end of the annotations, a configurable offset is allowed.

  • "Are within the same annotation of a specific type": a partial match is found whenever the annotations are within the same larger annotation. For example, if they are inside the same sentence.


textanalysisCorpusEvaluationConfiguration


Figure 33: Graphical configuration of the match criteria.


Corpus evaluation

Using the Evaluate metrics button, a window can be opened, displaying the precision, recall, F1 score and standard deviation for either a single document or the whole corpus. The numbers are split by annotation type.


textanalysisCorpusEvaluation


Figure 34: Evaluation at corpus level.


In the Settings panel, you can configure which types are to be taken into account in the corpus evaluation.
textanalysisCorpusEvaluationChooseTypes


Figure 35: Selecting the annotation types to be taken into account in the corpus evaluation.

Annotation Overview

For the quality assessment and improvement of text analysis pipelines, an aggregated overview of the assigned annotations is often helpful. For this purpose, the Annotation overview is used. You can create any number of these overviews. To do this, you first select a source and an existing text analysis process. Next, you select the annotation type to be analyzed.

After pressing the green button, the aggregation is calculated. Depending on the scope of the selected source, this may take some time. All overviews are listed in the table. As soon as an overview has been calculated, the results can be displayed via the list symbol.


annotationOverview


Figure 36: Listing and management of the available annotation overviews.


Aggregation und Context

If you select an overview from the table using the list symbol, you will see an aggregated list of the annotations found for the corresponding type. By default, the list is sorted in descending order by frequency. By clicking on an annotation in the table, you can display some example text in which the annotations occur. In addition to the analysis, the overview is also suitable for directly improving the results. In this way, false positives as well as false negatives can be identified and corrected.

Currently, the attributes that appear in the list for each annotation, are preconfigured by Averbis. This setting cannot yet be made graphically via the GUI.

Text Analysis Web Service API

This section describes the Web Service API which can be used to integrate text analysis capabilities in existing third-party systems. An interface is offered via a RESTful/XML service, which is integrated in the Swagger framework. For the formal specification please refer to the official documentation.

Analyse Text Web Service

The Analyse Text Web Service analyses plain text and returns annotations in JSON.

POST http(s)://HOST:PORT/APPLICATION_NAME/rest/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseText
  • URL parameter projectName specifies the project name that contains the pipeline.

  • URL parameter pipelineName specifies the name of the pipeline that will be used to analyse the text.

  • URL parameter language specifies the text language. Can be omitted if the pipeline is able to detect the text language.

  • URL parameter annotationTypes specifies a comma separated list of annotation types that will be contained in the response. Wildcards (*) are supported.

  • Request body parameter text specifies the text to be analysed.

Example Request:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' -d 'Some sample text to be analysed' 
'http://localhost:8080/information-discovery/rest/textanalysis/projects/defaultProject/pipelines
/defaultPipeline/analyseText?language=en&annotationTypes=de.averbis.types.Token%2Cde.averbis.types.Sentence'

Analyse HTML Web Service

The 'Analyse HTML Web Service' analyses text contained in HTML5 and returns annotations in JSON.

POST http(s)://HOST:PORT/APPLICATION_NAME/rest/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseHtml
  • URL parameter projectName specifies the project name that contains the pipeline.

  • URL parameter pipelineName specifies the name of the pipeline that will be used to analyse the text.

  • URL parameter language specifies the text language. Can be omitted if the pipeline is able to detect the text language.

  • URL parameter annotationTypes specifies a comma separated list of annotation types that will be contained in the response. Wildcards (*) are supported.

  • Request body parameter text specifies the html5 content to be analysed.

Example Request:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' -d '<html>
<body>Some sample html5 content to be analysed</body></html>' 'http://localhost:8080/information-
discovery/rest/textanalysis/projects/defaultProject/pipelines/defaultPipeline
/analyseHtml?language=en&annotationTypes=de.averbis.types.Sentence%2Cde.averbis.types.Token'

Swagger-UI API Browser

Developers can test the functionality of the Text Analysis Web API and get an overview on the integrated Swagger-UI API browser page. In particular, sample requests can easily be generated and return values verified. The Swagger-UI API browser is available at:

http(s)://HOST:PORT/APPLICATION_NAME/rest/swagger-ui.html


swagger textanalysis


Figure 37: Swagger-UI API Browser

Terminologies

In this module, you can manage the lexical resources, which are used within the text analysis components.

Terminology Administration

That module lists all available terminologies within the current project. You can add new terminologies, import or export content and add new terminologies, as well.

Add a new terminology

When adding a new terminology, you can specify the following parameters:

Terminology-ID

A unique identifier. E.g. MeSH_2017.

Label

A label. E.g. MeSH.

Version

A version number. E.g. 2017.

Concept type

The concept type when being used within text analysis. E.g. de.averbis.extraction.types.Concept.

Hierarchical

When unchecking this box, the terminology will not contain any hierarchical relations (flat list).

Encrypted export

ConceptAnnotator dictionaries can be exported encrypted to prevent having sensible data on the disk.

This parameter only affects Concept Dictionary XML Exports. Other exports still are unencrypted.

Besides, you can specify, which languages are available within that terminology.


createTerminology


Figure 38: Add a new terminology.


Available languages

Your terminology can contain term for all languages which are selected here. There is no need to use all languages for all terms. So there could be concepts, which only have terms in a subset of those languages. Since in some situations, we need to compute one cross-lingual preferred term, we need to decide which language to use, if there are no terms in specific languages. For that, you can specify a language priority by moving the language up/down in this list. If you have English at the top, followed by German, we try to display the English preferred term. If no English preferred term is available, the German one is displayed.

There is one special language, called Diverse. Terms in that language are mapped in every language. You can mark language independent terms with that language (e.g. Roman numerals).


Edit terminology´s meta data

You can edit the meta data, that you specified when creating the terminology, via the edit-button ().

Delete a terminology

The delete-button () allows to delete a terminology, when there is no active import or export running.

Import content

You can import content from OBO files (versions 1.2 [1 and 1.4[2]) into an existing terminology. If you have multilingual terminologies, version 1.4 needs to be used. Optionally, a mapping mode for each synonym can be imported, too.

The source file may be zipped to support large files.


The minimal structure of your OBO terminology looks like this:

Example of an OBO terminology

synonymtypedef: DEFAULT_MODE "Default Mapping Mode"  //
synonymtypedef: EXACT_MODE "Exact Mapping Mode" //OPTIONAL - only if using mapping modes
synonymtypedef: IGNORE_MODE "Ignore Mapping Mode" //

[Term] id: 1 name: First Concept synonym: "First Concept" DEFAULT_MODE []
synonym: "First Synonym" IGNORE_MODE []
synonym: "Second Synonym" EXACT_MODE []
 
[Term] id: 2 name: First Child is_a: 1 ! First Concept

To import terms with mapping modes, the OBO terminology begins with the synonym type definitions, as shown in the first three lines of the OBO terminology in the example above.

Each concept begins with the flag "[TERM]", followed by an "id" and a preferred name with the flag "name". After that you can add as many synonyms as you like with the flag "synonym", followed by the desired mapping mode (optionally). Note: if you would like to define a mapping mode for your concept name (flag "name"), you have to add the term as synonym, as shown in the example for "First Concept".

Furthermore, if your terminology contains a hierarchy, you can use "is_a" to refer to other concepts of your terminology.

To import a terminology like the one shown above, proceed as follows:

  1. In "Project Overview", click on "Terminology Administration".

  2. Click on "Create New Terminology". Fill in the dialog as described in Add Terminology.

  3. Once you have created a terminology, click the up arrow icon to the right of the terminology.

  4. In the "Import Terminology" dialog, select "OBO Importer" as import format. Then select the terminology you want to import from the file system. Click on "Import".

    1. By clicking on the "Refresh" button to the right of the terminology you can check the progress of the import. When the terminology has been fully imported, the status changes to "Completed".

    2. To browse your terminology, switch to the "Terminology Editor" by going to the "Project Overview" page and clicking on "Terminology Editor".



Figure 39: Import content into existing terminology


After an import has started, the current status is shown in the overview.
importIntoTerminologyStatus


Figure 40: Status of currently running processes.


Besides, you can see some details of the latest import (including error messages).


importIntoTerminologyStatusDetails


Figure 41: Detailed information regarding the latest process.


After successful terminology import, terms, hierarchies and mapping modes can be checked in the Terminology Editor.




Figure 42: Terminology Editor showing imported terminology

Export content

To use a terminology within the text analysis, you need to export () its content into the Concept Dictionary XML-format.


exportTerminology


Figure 43: Export a terminology.


After exporting a terminology into the Concept Dictionary XML-format, you need to restart the pipeline using it, in order to refresh its content.

Terminology Editor

The Terminology Editor allows to edit the content of terminologies.

Free text search and autosuggest

The centered search bar at the top of the Terminology Editor is meant for doing a free text search across multiple terminologies. You can include or exclude terminologies from the search by checking them within the drop down menu next to the search bar. While entering a search term, the system suggests different possible matches via autosuggest, grouped by terminology.


searchAutosuggest


Figure 44: Terminology auto suggest.


Doing a free text search, you can use the asterisk symbol (*) for truncation (e.g. Appendi\*). The results of a free text search are listed within the upper right section. Results are grouped by their terminologies.

The settings menu on the top right allows to customize some search and autosuggest settings. You can specify whether Concept IDs are included within the search, and define the number of hits that shall be displayed.


terminologyEditorSettings


Figure 45: Configuration of search and autosuggest.


Displaying concepts hierarchically

The tree view in the Terminology Editor allows to view its position in the terminology hierarchy. Just click on a concept within the list of search results.


treeD012214


Figure 46: Displaying concepts hierarchically.


You can configure whether the Concept ID shall be shown in the tree as well, and whether the tree view shall show the siblings of a concept along its hierarchy.


treeFocusVsNonFocus


Figure 47: Tree with and without strictly focusing on the selected concept.


Terms

In the lower right corner of the windows you see the concept’s details. The first tab shows concept synonyms. You can edit, add or delete synonyms here as well.


terminologyEditorAddTerm


Figure 48: Adding new terms.


 Mapping Mode

Every term has a so called Mapping Mode. Mapping Modes are an efficient way of increasing the accuracy of terminology based annotations. They allow to ignore certain synonyms which are irrelevant or lead to false positive hits (IGNORE). Synonyms can also be set to EXACT matches, which is especially good for acronyms and abbreviations (AIDS != aid).

Currently, there are 3 Mapping Modes

DEFAULT

Term is preprocessed the same way the pipeline is configured.

EXACT

Term is only mapped when the string matches exactly to the text without any modification by preprocssing (including case).

IGNORE

Term will be ignored. It won’t be used within the text analysis.


Relations

The second tab shows all relations known for that concept. You can use this view to add or delete relations, too. Currently, only hierarchical relations are supported. When adding a new relation, you get an autosuggest to find the correct concept that you want to relate.

Mapping Mode and comment

In the third tab, you can add a comment to a concept. Besides, you can set a concept-wide Mapping Mode. Terms, which do not have a specific Mapping Mode inherit it from this concept Mapping Mode.


Document search

Solr Core Administration

As soon as the Solr Admin module is used, the application has a default Solr Core. This core is displayed in the administration panel.

uses Solr to create a search index and to make documents searchable. Choose "Solr Core Administration" on the project overview to create the basic settings.

Indexing pipeline

Documents that are imported or crawled go through a text analysis pipeline in order to add metadata to the search index.

The corresponding pipeline is selected here - a separate indexing pipeline can be used for each project.


solrCoresOverviewNoPipes


Figure 49: Choosing the indexing pipeline.


If you choose an indexing pipeline, all documents that are imported or crawled in the future will be processed. If you want to use a different pipeline for processing search queries, you can set it in the Solr Core Management section.

You can also switch the indexing pipeline within a project. To avoid a heterogeneous set of metadata, all documents are re-processed.

Query Pipeline

Here you can select which of the available pipelines should be used for analyzing the search query. By default, the same pipeline is used here as selected for indexing the documents.


solrCoresNoQueryPipe


Figure 50: Initial state in which no query pipeline is selected.


solrCoresSetQueryPipe


Figure 51: Choose a query pipeline.


Solr Core Overview

A so-called "Solr Core" is available for each project, the administration of which can be accessed via the "Solr Core Management" button on the project page.


solrCoresOverview


Figure 52: Key figures and information on the search index of a project.


  • "Core Name": The name of the Solr instance (generated automatically)

  • "Path to solrconfig.xml": This is the path to the configuration file of this Solr instance. Expert settings can be made in this configuration file. After editing this file, the Solr instance must be restarted in order for the changed settings to take effect.

  • "Path to schema.xml": The index fields are configured in this configuration file. This file should only be edited manually in exceptional cases and by experts.

  • "Indexed documents": Number of documents currently in the index.

  • "Pending documents": Number of documents that are currently in the processing queue of the Solr instance.

After pending documents have been processed by Solr, a commit must take place before these documents are actually available in the index. Since a commit is quite resource-intensive, the number of commits are kept low. By default, a commit therefore only takes place every 15 minutes. The processed documents therefore appear under the indexed documents with a delay.


  • "Operations": At the level of the Solr core, there are three operations available:

    • "Refresh" : You can update the displayed key figures by clicking on this icon.

    • "Commit" : This command executes a commit on the Solr core, including documents in the index that are not visible beforehand. By default, this happens every 30 minutes in the background.

    • "Delete all documents from the index" : With a click on this icon, all documents are deleted from the index.

Configuration of the search index schema

The configuration of the schema of the current search index can be reached via the module "Solr schema configuration".

Overview of all schema fields

Each Solr core has a schema that defines which information is stored in which kinds of fields. The Solr schema configuration lists all available fields in alphabetical order. The following information and operations are available for field in the index:

  • "Field name": Name of the field as defined in the Solr schema. This name is often chosen in such a way that it is unpleasant for people to read. If a field is a system field, that is, a field whose values must not be overwritten by the user, a small lock symbol () is displayed to the right of the field name.

  • "Type": The type specifies the contents of this field. In addition to an abstract description (e. g. string) the complete class name of the field is specified in parentheses.

  • "Active": This button controls whether the field contains information to be displayed or used elsewhere in the application. These fields are then available, for example, to be displayed in the search result, to form facets or to be used via query builder for the formulation of complex, field-based search restrictions. Fields that are not activated can still be used by the system, but they are not available for manual configuration to the users. If a field is activated, the line is highlighted in green.

  • "Label": The field name itself is often not suitable for displaying because it is not legible, and it is not localized. Therefore, you can define meaningful display names for all fields in different languages. These names are used wherever the user accesses or displays field contents. If no corresponding display name is defined for the user’s language, the illegible field name is displayed.


schemaConf


Figure 53: Overview of the Solr cores scheme.

Dynamic fields

In the overview, dynamically generated Solr fields are also displayed as soon as they have been created (that is, as soon as they have been filled with values once). As soon as the field has data, it remains permanently in the overview, even if all documents containing values in this field have been deleted in the meantime.

Manage and use search interface

The functionality and appearance of the search interface can be influenced by configuration.

Configuring the display of search results

Starting from the overview page of a project, the display of search results can be configured by using the "Field Layout Configuration" module. You can specify which fields/contents of the indexed documents are to be displayed in the interface. This applies to both the fields on the results overview page and the fields on the detail page of the documents (accessible by clicking on the title information of the result). Fields that are only displayed on the overview page of the search results are highlighted in green. In addition to selecting the fields, you can also configure whether the field title should be displayed, as well. If this option is activated, the display name created in the Solr schema management for the language of the respective user is displayed.

In addition, the length of content of a particular field can be specified, as well as some style settings.


manageDisplayFields


Figure 54: Configuring the display of search results.


Configure Facets

So-called facets provide the user with additional filter options. They are displayed on the left side of the search page. The configuration of facets can be accessed via the module "Facet Configuration" on the project overview page.

On the configuration page, you can select and configure the facet fields displayed in the user interface. When selecting a facet, you can configure whether the entries within a facet are AND- or OR-linked. In the case of AND facets, only documents that combine all the terms selected in this facet are displayed. OR facets, on the other hand, offer the option of finding documents that contain only individual terms (e. g. documents of "Category 1" OR "Category 2").

In addition, you can configure how many entries are to be displayed within each facet. The order of the facets can be determined with the arrows. The display in the search interface is similar to the order in the administration panel. The display name of a facet is selected according to the labels assigned in the Solr schema configuration (see above).


manageFacets


Figure 55: Configure Facets.


Configuring auto-completion

Settings for automatic completion of search terms can be made via the "Autosuggest" module that you access on the project overview page. There are various methods by which users can make suggestions to complete their searches in a meaningful way. Currently, four methods are available to choose from, and they can be freely combined as needed.

The proposals are grouped by their mode in the search interface. The order of the groups corresponds to the order in which the modes are listed here (if more than one mode is used). Use the arrow keys to change the order.

In addition to the number of proposals per group, you can also specify a description for each group, which is displayed in the search interface above the respective proposal block.

Changes will take effect immediately after saving for all users of the search.

If one of the two concept-based methods is used, an additional field appears where you select which Solr field is to be used for the lookup. All fields that are recognized as concept-based fields are available for selection.


manageAutosuggest


Figure 56: Configuring auto-completion.


The methods are characterized as follows:

"Prefixed Facet Mode"

  • The proposals for completing the search query come from the documents in the search index. No external sources are therefore used for the proposals.

  • The suggestions are intended to complete the term currently entered, no additional term is proposed (no multiple word suggestions).

  • The current search restrictions (e. g. via facets) are taken into account in the proposals. Therefore, only those terms are suggested for which there are also hits in the body, taking into account all active search restrictions.

  • The proposals are not based on the order of the terms in the documents. If you enter a search query that consists of several partial words, the proposed word does not have to be directly behind the term it is in the search query.

"Shingled Prefixed Facet Mode"

  • The proposals for completing the search query come from the documents in the search index. No external sources are therefore used for the proposals.

  • Unlike simple prefixed facet mode, suggestions can consist of several words. In addition to the completion of the term currently entered, it is also suggested terms that are often directly or closely related to this term in the documents. Entering Appen in this mode could therefore lead to suggestions such as treating _appendicitis.

  • The current search restrictions (e. g. via facets) are taken into account in the proposals. Therefore, only those terms are suggested for which there are also hits in the body, taking into account all active search restrictions.

  • If the query consists of several words, the suggestions for the order are based on the last of these words. All terms before this last word are still used as filters. The entry Hospital Appendi could therefore also lead to the suggestion Hospital Treat Appendicitis, if Hospital Treat Appendicitis is not in the immediate vicinity of Hospital in the text.

Concept Mode with guaranteed hits (concepts_hit)

  • The suggestions for completing the search query are taken from synonyms of the stored terminology.

  • Proposals show the wording of the synonym and the title of the terminology as well as the preferred name of the concept in the user’s language.

  • If you select a proposal (synonym), a search with the associated concept is executed.

  • Documents that contain the exact synonym text (that is, documents that cannot be found using another synonym) are given a higher weighting and are displayed in the results list above.

  • Only proposals that guarantee at least one hit are displayed.

Concept Mode without guaranteed hits (concepts_all).

This mode differs from the conventional concept mode in that proposals are also displayed that do not lead to a hit. All terms from the stored terminology are displayed.

The activation of the concept modes is not completely implemented via the GUI. Please contact support.


Search restrictions

Switch to the "Search" module of the project to get to the search page of the application. All search terms entered remain comprehensible for the user at any time. You can easily see which search terms have led to the currently presented result set. The current search restrictions are listed next to each other on the left side of the search bar. They are highlighted in the same color as the corresponding highlighting in the text. If the restriction by a term originates from a facet, the name of the facet is listed before the search term (see screenshot below).

If the number of search restrictions is too long to be displayed in the search bar, they are displayed in a pop-up and collapsible menu on the left in the search bar. The small cross symbol next to each search restriction removes this restriction and updates the search results accordingly. With the cross button to the right of the search bar you can also remove all current search restrictions at once.


searchrestrictions


Figure 57: Display of the current search restriction.


Faceted search

Facets represent one of the core functionalities of the search. With the help of the facets, the search results can be quickly limited to relevant results. In the admin panel you can configure for which categories facets should be displayed.

Within the facets, the most frequent terms from the respective category appear, which are contained in the indexed documents. The number after the faceted entries indicates how many documents are contained in the index (or current search result set) that match the corresponding term.

The faceted entries can be clicked on, whereupon the search result will be limited accordingly. Different terms can be combined here - even across facets. This allows a high degree of flexibility in restricting the search results.


facet


Figure 58: Concept facet with selected restriction to 'Diagnosis'.


AND-linked facets

By default, all selected facet entries are AND-linked. This means that only documents matching all selected criteria are listed. The currently selected filters are highlighted in orange. The restriction can be removed by clicking on the faceted entry again.

OR-linked facets

This filter yields to result sets in which at least one of the selected criteria appears. only one or only a few of the selected terms appear. In the case of these OR-linked facets, a checkbox is displayed in front of each entry.

Querybuilder / Expert Search

With the query builder, a comfortable mechanism is available in the system to create complex search queries. This allows for combining different criteria to a a query using any fields from the index.

The Querybuilder can be opened using the magic wand icon in the search bar.


suchschlitzQbInactive


Figure 59: The magic wand on the right of the search bar opens the query builder.


The input mask allows you to add search restrictions on all activated schema fields. Depending on the type of the selected schema field, different comparison operators are available. Text fields allow the operators contains and contains not. Any text can be entered as a restricting value. The asterisk * is used as a wildcard.

Date fields are provided by the comparison operators >= and <=. Numerical fields are provided by the comparison operators =, <>, >= and <=. By combining two date or number fields, the search can also be restricted to periods or ranges.


qb


Figure 60: Input mask of the query builder


Concept-based fields allow the operators contains and contains not like text fields.

Any number of conditions can be added. These are linked with each other using the boolean operators AND and/or OR. The criteria can also be grouped together to create any logical combinations. In addition to the graphical display, you can also find the logical expression that results from the current compilation of search restrictions in the upper area of the query builder. Once the complex search query has been created, it can be activated using the Apply button. The search results are calculated accordingly. In addition, the magic wand icon in the search bar turns orange to indicate that a complex search restriction is active. The search query can be reloaded by clicking on this button and can be edited until the result matches your expectations.

The query created using the Querybuilder behaves in addition to any other search restrictions, such as by means of free text search or facet restriction.

Document details and original document

The title field of a document serves as a link to a detail page containing additional information about the document (see "Solr Schema Configuration" module on the project overview page).

In addition to the detailed view, you can also download the underlying original documents (e.g. PDF, office document etc.) if they are available. You can recognize this by a small icon on the right of the document title. The symbol differs depending on the document category. Clicking on the file icon starts the download of the original document.

Export search results

Documents in the system can be exported - both individual documents and complete search result sets.

Selection of documents to be exported

If the user has the necessary permissions to export documents, checkboxes are provided on the search results page to mark individual documents. There is also a checkbox to mark all currently displayed documents. In addition, the button "Export search results" is displayed above the search results, where the selected documents can be exported.

Another option is to export all documents that meet the current search restrictions. In this case, all checkbox have to be deselected.


ExportDib


Figure 61: Controls to mark and export documents.


Selection of the exporter and the fields to be exported

After selecting the documents to be exported, a dialog box appears in which the exporter type can be selected. To this day, there is an exporter that exports selected fields of the documents to an Excel document.

After selecting the fields to be included in the export and confirming with the "Export" button, the export starts. Once the export is complete, the result is offered for download.


ExportDialog


Figure 62: Selection of the exporter and the fields to be exported.


Document Classification

Manage classification

Administration of the label system

The target categories for automatic classification of documents are called the label system that can be edited and maintained in the module "Label System". In a new project, the label system is initially empty.

Clicking on "Create new label" at the bottom left adds a new label. The pen symbol on the right-hand side is used to rename the label. The plus symbol to its right adds a new label as a child of the current label. It is therefore used to create hierarchically organized label systems. Clicking on the red cross symbol deletes labels (only labels that have no children can be deleted).

In a hierarchical labeling system, the hierarchical arrangement can also be edited via drag & drop.


manageLabelsystem


Figure 63: Labels can be added, edited, moved or deleted in the label system administration.


Administration of different classification sets

The starting point for the automatic classification of documents are so-called classification sets.


navigationItemManageClassification


Figure 64: Menu item for managing classification sets.


Create a new classification set

Any number of classification sets can be created for each project. This means that you can classify the same document source with different classification parameters.

There is only one label system per project. The same label system is used for each classification set. Please make sure that the label system has been created before you create a classification set.


To be able to view the results of the classification in the interface, you should select an indexing pipeline in Solr Core Management before you create classification sets.


When creating a new classification set, following settings can be adjusted:

  • Name: Name under which this classification set is referenced.

  • Document fields: From all document fields known to the system, you can select those that are used for training the classifier (so-called features).

  • High confidence threshold: The system distinguishes between documents with high and low confidence for automatically classified documents. This parameter can be used to define the value above which the confidence is interpreted as "high".

  • Classifier: In principle, different implementations can be used for classification. At present, the implementation offered is a support vector machine.

    • SVM: Support vector machine

  • Single/multi-label: This parameter determines how many categories can be assigned to a single document. With Single only one label is assigned. With a Multi, a document can be categorized in several classes.

  • Classification method: The classification method determines how the machine selects from several candidates. Depending on whether it is a single-label or multi-label scenario, different options and configuration parameters are available:

    • Single-Label

      • Best Labels: With Single-Label-Classification there is only one classification method: the Best Labels method chooses the class with the highest confidence.

        • Threshold : The threshold value can be used to determine that only classes that have a certain minimum confidence are taken into account. This allows for filtering assignments for which the machine is very unsafe.

    • Multi-Label: For Multi-Label Classification several methods are available (for a deeper theoretical background, see Matthew R. Boutell: Learning multi-label scene classification ):

      • All Labels: This method simply selects the available instance labels in a decreasing confidence order.

      • T-criterion: Using the T-criterion, instances first get filtered by a minimum confidence threshold of 0.5. If the confidences are too low, i.e. no labels are assigned, another filter step is used. The second step checks if the entropy of the confidences is lower than the minimum entropy threshold, i.e. confidences are distributed unevenly. If this is the case, the labels are assigned based on a lower minimum confidence threshold.

        • Entropy: 1.0 (default minimum entropy)

        • Threshold value: 0.1 (default minimum confidence)

      • C-criterion: This method ensures the selection of the best prediction values depending on the configuration parameters (i.e. Percentage and Threshold values). It first selects the label with the highest confidence (larger than the threshold value) and continues to assign labels whose confidence is at least at 75% of the highest confidence value.

        • Percentage value: 0.75

        • Threshold value: 0.1 (minimal default confidence).

      • Top n labels: This method selects those categories that have the highest confidence.

        • n: the number of classes to be assigned

The classification configuration can be changed on the classification administration page by clicking on the edit button.

After changing parameters of an existing classification set re-training and re-classification are necessary for all changes to take effect.


Before documents can be automatically classified, the machine requires appropriate training material. This refers to a small set of intellectually classified documents used by the machine to train a model.

Training data can be created in two ways. Either by manually assigning classes via the graphical user interface (please see "Browse classifications" below) or by importing a CSV file that contains appropriate assignments.

Import of training material

The button opens a dialog for importing a CSV file with training material. The CSV file must contain the name of the document in the first column (referred to document_name in the system). The subsequent columns contain the label assignments (one column for each label in a mult-label scenario). The columns must be separated by semicolons. The values of the columns can be enclosed with double quotation marks if required (mandatory if the values contain semicolons).

Example :
trainset.csv

doc1;label_1;label_2
doc2;label_1;
doc3;label_1;label_3
...
      

The document name, which is used to identify the document in the list, must contain the value that is entered in the field document_name in the application.


If a training file contains several labels per document, but the selected training set is a single-label classification, only the first label is used.


If the document names or labels contain semicolons, the values must be enclosed in double quotation marks to avoid incorrectly interpreting the semicolon as a field separator.


Only values that are part of the label system in the application (or project) are allowed as labels (all others are ignored).


When you import training material, any labels that may already be assigned to the documents in the list are deleted.

Train a model

As soon as the system has access to training material by importing a training list or manually assigning labels, a model can be trained using the button. Use to update the information on "State" and "Model": the training has finished if "State" is IDLE and "Model" is READY.

Quality of the current model

After each training session, an evaluation is carried out to evaluate the current quality of the model. For this purpose, the machine uses the document set of intellectually confirmed labels. This quantity is divided into a training set (90%) and a test set (10%). The test set is classified by the machine on the basis of a model that has been trained for this training set. The results of the automatic classification are then compared with the intellectually assigned labels. To smooth the results, the machine repeats this 10 times for different divisions of test and training sets. The results of the tests can be viewed in the form of a diagram using the button. The diagrams show the following metrics per label, which are derived from the number of correct assignments (true positives - TP), false assignments (false positives - FP), and missing assignments (false negatives - FN):

Accuracy: The ratio of all correct assignments (and correct non-assignments) to the total sum of all observations: 

        TP + TN
____________________

TP + FP + FN + TN


Precision: The ratio of correct assignments to all assignments:

    TP
_________

TP + FP

If one attaches great importance to the fact that there are no misallocations, this value is of particular relevance.


Recall: The ratio of correct assignments to the sum of all existing correct assignments:

    TP
_________

TP + FN

If you take some misallocations into account in order to increase the number of hits, this value is of particular relevance.


F1-Score: A weighted average between Precision (P) and Recall (R):

            P x R
2 x     _________

            P + R

 

Automatic classification of all unclassified documents

As soon as an initial model has been created, all previously unclassified documents can be automatically classified on the basis of this model via on the classification configuration page.

Once the classification is complete, the results can be viewed in the graphical user interface. The assigned classes are displayed above each document (see "Browse classifications" below).

Status information

The overview table depicts information of the current status of the classification set:

  • IDLE: No process is currently running.

  • TRAINING: A training is in progress. During this time, no other processes can be started on this classification set.

  • CLASSIFYING: Documents are currently being classified. During this time, no other processes can be started on this classification set.

  • ABORTING: A process (training or classification) is being aborted. During this time, no processes can be started on this classification set.

The resulting model of a classification set comes with additional information:

  • NONE: No model has been trained yet.

  • READY: A valid model exists and a classification process can be started.

  • OUTDATED: Since the last training, manual classifications have been added or automatic classifications have been confirmed or rejected. The model should be re-trained in order to make changes take effect.

  • INVALID: Changes were made to the label system or a manually assigned label were deleted, which invalidates the current model. The model has to be re-trained.

Index, evaluate and manually classify documents

For all classification sets, you can use a graphical user interface to navigate through the documents, review results, confirm or delete automatically assigned classes, and assign classes manually. You can access this browser view by clicking on "Classification" on the project overview page.

Structure of the interface

The interface is similar to the search interface, both in terms of its structure and functionality. The classification page has three predefined facets on the left side of the screen, that can be used to filter documents according to the assigned class (Label), the assigned confidences (Confidence) or the assignment status on the document level (Status).

This makes it very easy to display, for example, only those documents that have been automatically classified (Status = Autoclassified) and that have labels with low confidence (Confidence = low). By making corrections/confirmations to the resulting documents the classification model can be improved (i.e. the system learns exactly where it is currently most unsafe (so-called Active Learning).

To the right of the search input field, the classification set on which you want to work can be chosen. If you have created several classification sets, you can quickly switch between them.

Confirm or reject automatically assigned labels

The labels that have been assigned to each document are depicted below the title information of each document. Manually assigned labels are displayed in blue ( manual label ), automatically assigned classes are displayed in red (low confidence automatic label with low confidence ), or green (high confidence automatic label with high confidence ).

Automatically assigned labels have a button to confirm and to delete the label. By confirming an automatically assigned label, it changes its color and will be considered for the next training session to improve the model.

As soon as you confirm, delete or add labels, the model is considered OUTDATED. This means that since the last training session, new data has been collected to improve the model and re-training is necessary.


Execute actions on several selected documents

Similar to the conventional search interface, there are several document-centered actions for classification. In general, actions either refer to

  • exactly one document,

  • a selection of documents

  • all documents of the project or

  • all documents corresponding to the current search restrictions.

For any of these actions, there is a small button with a distinctive icon under the document title. Use this button to apply the action exactly to the corresponding document.

The same icons are displayed on larger buttons below the search bar ("Label documents(s)", "Classifiy document(s)", "Export classifications"). Clicking on these buttons apply the action to all documents that are marked with the checkbox left to their title. All documents on the current search result page are selected by clicking the uppermost checkbox on the page.

If no particular documents are selected at all, the action is applied to all documents that correspond to the current search restrictions. Since the result set can be very large, a window opens for approving the currents selection before the corresponding process starts in background.

Manually label documents

In addition to confirming or rejecting automatically assigned labels, categories can be assigned manually. The button attached to each document serves this purpose. The button opens a window in which you can select the desired label(s). You can also manually label several documents at the same time by using the checkboxes left to the documents title in conjunction with the uppermost button.

When manually assigning labels, a window opens with labeling information:

  • "Not selected": This label has not been assigned to any of the selected documents.

  • "Partially selected": This label has already been assigned for some (not all) selected documents (gray stripes).

  • "Completely selected": All selected documents already have this label (grey).

When assigning a label manually, automatically assigned labels of the same type are automatically overwritten, if existing.

As an example, if you select 100 documents to assign label A and 10 of them already have an automatically assigned label A, the status for the 10 documents will be switched to "Approved". An automatic assigned label B would not be replaced by this procedure (except in a single label classification scenario where only one label is allowed).

Classify documents automatically

The same selection mechanism as for manual labeling also applies to automatic classification (single documents, a selection of documents or the current search result set). The button "Classify document(s)" with the icon automatically classifies documents that are not manually categorized.

As a result, automatically assigned category labels are displayed in red (low confidence automatic label with low confidence), or green (high confidence automatic label with high confidence). The corresponding facet filters on the left (Label, Confidence and Status) will change when refreshing the page.

If documents are automatically classified, all previously unconfirmed automatically assigned classes of these documents are deleted from previous runs.

Export labels

The assignment of (confirmed or manual) labels can be exported from the interface to a CSV file (button "Export classifications"). The format has the same structure as the input format that is allowed for importing training material.

Training and classifying directly from the search page

With the button on the top right of the page a new model based on all previously manually classified or confirmed documents can be trained. Similar, the button on the top right is used to classify all unclassified documents based on the current model.

Classification Web Service

This section describes the possible integration of the classification component in existing third-party systems. An interface is offered via a RESTful/XML service, which is completely integrated in the Swagger framework. For the formal specification please refer to the official documentation.

Web Service

The Web service accepts requests at the following URL:


The information on HOST and PORT depends on the specific installation and can be obtained from the system administrator.

  • {projectName} is the selected name of the created project in the application.

  • {classificationSetName} is the selected name of the created classification configuration in the application.

  • {Importer} is the importer type to process different input document types and can be one of:

    • CAS Importer

    • Solr XML Importer

    • Text Importer

Additional importers can be included for specific applications. The access to the service URL is not authenticated.

The first time the Web service is called after restarting , the requested classification model is loaded from the classification configuration into the working memory so that service requests can be answered as quickly as possible. Therefore, with a newly started system or a new classification configuration, a first request should be made to warm up the web service, e.g. with a defined test data set. In addition to an automatic query by an integrating external system, the test page Swagger or a query via curl can also be used (see below).

Test page and simple query via Curl

Developers can test the functionality of the service and get an overview on the following page. In particular, sample requests can easily be generated and return values verified.


swagger classification


Figure 65: Swagger test page


Curl is a command line program for transferring information in computer networks. There are versions for Windows and Linux systems, among others. The following simple call receives classification results for two documents in Solr format:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/xml' -d '<?xml version="1.0" encoding="UTF-8"?> \
<update> \
	<add> \
		<doc> \
			<field name="document_name">doc1</field> \
			<field name="title">Machine learning for automatic text classification</field> \
			<field name="content">Machine learning is a subset of artificial intelligence in
				          the field of computer science that often uses statistical techniques
				          to give computers the ability to learn...</field> \
		</doc> \
		<doc> \
			<field name="document_name">doc2</field> \
			<field name="title">Document classification made easy</field> \
			<field name="content">Document classification or document categorization is a
			             problem in library science, information science and computer science.
			             The task is to assign a document to one or more classes or
			             categories...</field> \
		</doc> \
	</add> \
</update>' \
'https://HOST:PORT/information-discovery/rest/classification/projects/{project}/classificationSets/
{classificationSet}/classifyDocument?type=Solr%20XML%20Importer'

Result Format (XML)

The answer of the web service is returned in XML format and contains the automatic classifications for the input data set. The output for each data record contains the identifier (docment_name) and one or more categories with corresponding confidence values. In the example, both documents could be successfully classified, which is indicated by the attribute success=true:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
	<classifications>
		<classification documentIdentifier="doc1" success="true">
			<labels>
				<label confidence="0.98">Artificial Intelligence</label>
				<label confidence="0.89">Text Mining</label>
			</labels>
		</classification>
		<classification documentIdentifier="doc2" success="true">
			<labels>
				<label confidence="0.98">Information Science</label>
			</labels>
		</classification>
	</classifications>
</response>

If no category is assigned to a document due to selection criteria in the classification configuration (e.g. thresholds), the classification for the document also appears with success=true, but with an empty list of categories in the returned message:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
        <classification documentIdentifier="doc3" success="true">
            <labels/>
        </classification>
    </classifications>
</response>

If fields that are set active in the classification configuration are missing, corresponding error messages are added to the document classification. If the classification could still be carried out, this is indicated by success=true and the assigned categories are displayed:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
        <classification documentIdentifier="doc4" success="true">
            <labels>
				<label confidence="0.98">Artificial Intelligence</label>
				<label confidence="0.89">Text Mining</label>
            </labels>
            <errors>
                <error>Document has no title.</error>
            </errors>
        </classification>
    </classifications>
</response>

Multiple error messages for a document are listed separately:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
        <classification documentIdentifier="doc5" success="true">
            <labels>
				<label confidence="0.98">Artificial Intelligence</label>
            </labels>
            <errors>
                <error>Document has no title.</error>
                <error>Document has no content.</error>
                <error>Error on ...</error>
            </errors>
        </classification>
    </classifications>
</response>

If no classification can be performed due to an error, this is indicated by success=false and the output list of assigned categories is empty. A corresponding error message is added to the message returned:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
       <classification documentIdentifier="doc6" success="false">
            <labels/>
            <errors>
                <error>Document has no classifiable content.</error>
            </errors>
        </classification>
    </classifications>
</response>

A document without the document_name input field cannot be classified because a unique document identifier is required. Since no assignment to an individual document can be made without this document identifier, the corresponding error message appears at the upper level. Other documents are not affected, so the other classifications will return normally:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <errors>
        <error>1 document(s) without identifier.</error>
    </errors>
    <classifications>
        <classification documentIdentifier="doc1" success="true">
            <labels>
				<label confidence="0.98">Artificial Intelligence</label>
				<label confidence="0.89">Text Mining</label>
            </labels>
        </classification>
        <classification documentIdentifier="doc2" success="true">
            <labels>
				<label confidence="0.98">Information Science</label>
            </labels>
        </classification>
    </classifications>
</response>

If a global error prevents classification of the documents, an error message is returned for the entire input, for example, the message that no classification characteristics could be extracted:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <errors>
        <error>Feature extraction failed.</error>
    </errors>
</response>


Text Analysis Component Reference

Type Systems

AverbisTypeSystem

de.averbis.textanalysis.typesystems.AverbisTypeSystem

The core type system for all default components.

Maven Coordinates

        
<dependency>
   <groupId>de.averbis.textanalysis</groupId>
   <artifactId>components-core-typesystem</artifactId>
   <version>3.5.0</version>
</dependency>
        
      

Imports

  • de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

Sentence

Full Name: de.averbis.extraction.types.Sentence

Description: Annotation representing a sentence including the ending punctuation mark.

Parent Type: de.averbis.extraction.types.CoreAnnotation

Token

Full Name: de.averbis.extraction.types.Token

Description: Annotation for basic textual units including word, numbers and punctuation marks.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 1: Features

NameRangeElement TypeMultiple References Allowed

posTag

de.averbis.extraction.types.POSTag




Description: The part of speech of this token (i.a. used for concept annotator to be restricted to certain pos types).

segments

uima.cas.FSArray

de.averbis.extraction.types.Segment



Description: Segments of this token (i.a. used for respective mode in concept annotator).

stem

de.averbis.extraction.types.Stem




Description: The stem of the token (i.a. used for respective mode in concept annotator).

isAbbreviation

uima.cas.Boolean




Description: Marker whether the token is (part of) an abbreviation.

abbreviations

uima.cas.FSArray

de.averbis.extraction.types.Abbreviation



Description: The abbreviations for the token; this may be used as replacement in concept annotator. Note, that this goes in combination with "isAbbreviation" marking whether the token is an abbreviation. Multiple entries mean that there is ambiguity which full form is correct. There will be components to resolve this ambiguity which then remove wrong forms. Components which cannot do this ambiguity resolution must rely that the first (and hopefully only) entry is correct!

concepts

uima.cas.FSArray

de.averbis.extraction.types.Concept



Description: List of concepts containing/covering this token (this feature is used for indexing and highlighting with lucene/solr)

entities

uima.cas.FSArray

de.averbis.extraction.types.CoreAnnotation



Description: Contains entities such as Date,Time,Size,... discovered inside the Token (this feature is used for indexing and highlighting with lucene/solr)

ignoreByConceptMapper

uima.cas.Boolean




Description: If this feature is true then the ConceptAnnotator ignores the token. Use this is a pre-processing component has already identified the semantic of the token. E.g. dates, times, measurement values. Default value: false

normalized

uima.cas.String




Description: Normalized version of this token (usually lower-case, without special characters and number). This feature is used for indexing/search with lucene/solr.

diacriticsFreeVersions

uima.cas.StringArray




Description: In the case that the normalized version contains diacritics, multiple versions without diacritics are stored in this array. This feature is used for indexing/search with lucene/solr.

isStopword

uima.cas.Boolean




Description: Indicates if the token is a stopword.

lemma

de.averbis.extraction.types.Lemma




Description: The Lemma of the token.

isInvariant

uima.cas.Boolean




Description: Defines whether a token is an invariant. Such a token should not undergo some morphologic analysis steps, such as stemming and/or decompounding. However, lemmatization might still be allowed. Typical invariants: IL-2 (gene name) or also product names or numbers (SR-2715) but also too short words (au).

tokenClass

uima.cas.String




Description: The optional string representing the class of the token concerning its surface form.

Abbreviation

Full Name: de.averbis.extraction.types.Abbreviation

Description: An abbreviation is a letter or group of letters, taken from a word or words. For example, the word "abbreviation" can be abbreviated as "abbr." or "abbrev."

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 2: Features

NameRangeElement TypeMultiple References Allowed

fullForm

uima.cas.String




Description: The full form of an abbreviation. The full form, for example for HLA could be human leukocyte antigen.

textReference

de.averbis.extraction.types.CoreAnnotation




Description: Reference to the text span that contains the full form of the abbreviation/acronym.

definedHere

uima.cas.Boolean




Description: This feature is true if the abbreviation/acronym is defined for the first time in the text, e.g. in "interleukin 2 (Il-2) receptor", it can be true only for locally introduced abbreviations/acronyms.

stems

uima.cas.StringArray




Description: Stems of the full form.

segments

uima.cas.StringArray




Description: Segments of the full form.

tokens

uima.cas.StringArray




Description: Token strings of the full form.

Concept

Full Name: de.averbis.extraction.types.Concept

Description: A concept is a reference to an entry in a database, terminology, taxonomie, ontology etc.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 3: Features

NameRangeElement TypeMultiple References Allowed

dictCanon

uima.cas.String




Description: Canonical form (preferred term).

enclosingSpan

de.averbis.extraction.types.CoreAnnotation




Description: The span that this concept is contained within (i.e. its sentence).

negatedBy

de.averbis.extraction.types.CoreAnnotation




Description: Indicates which annotation negates the concept.

partialMatch

uima.cas.Boolean




Description: Specifies if the annotation matches the complete context. E.g. if coveredText is "Lungenabschnitte" and the generated Concept annotation is "Lunge", then this value is set to true.

matchedText

uima.cas.String




Description: The text in document which matched the synonym (in the respective mapping mode form, i.e., segment/stem/original etc.).

matchedTerm

uima.cas.String




Description: The synonym of the concept which caused the match (in the ConceptAnnotator-dictionary this is <term label=xxx>).

matchedVariant

uima.cas.String




Description: The variant of the synonym of the concept which caused the match (in the ConceptAnnotator-dictionary this is <variant label=xxx>). Note that one synonym (matchedTerm) can have several variants (i.e. spelling forms or mapping forms).

matchedTokens

uima.cas.FSArray

de.averbis.extraction.types.Token



Description: The Token annotations on which the concept was found. Note that there is also matchedAnnotations which list the actual annotations involved in the matching process (i.e., token, stem, segment, etc.).

matchedAnnotations

uima.cas.FSArray




Description: List of the actual annotations involved in the matching process (i.e., original, stem, segment, etc.). Note that there is also matchedTokens, which lists only matching Token annotations.

mappingMode

uima.cas.String




Description: The mode used for mapping (e.g., original, stem, segment...).

mappingFuzzynessScore

uima.cas.Float




Description: The score for the fuzzyness of the mapping (higher scores mean higher fuzzyness, i.e., less exact mappings).

uniqueID

uima.cas.String




Description: The unique concept id, including terminology name and concept ID should look like this: <terminologyName>:<conceptID>.

conceptID

uima.cas.String




Description: The concept id. For a unique id refer to uniqueID.

source

uima.cas.String




Description: the name of the terminology source.

Zone

Full Name: de.averbis.extraction.types.Zone

Description: An annotation concerning the document structure, e.g. header, title, abstract, etc.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 4: Features

NameRangeElement TypeMultiple References Allowed

label

uima.cas.String




Description: Allows to annotated the Zone with a semantic label. E.g. in the case of a section the value might be Introduction, Appendix,...

weight

uima.cas.Float




Description: The relevance or weight for a zone; used e.g. to weight information contained in the respective zone.

Header

Full Name: de.averbis.extraction.types.Header

Description: The header annotation of a document

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 5: Features

NameRangeElement TypeMultiple References Allowed

docID

uima.cas.String




Description: The ID of the document.

source

uima.cas.String




Description: The source of the document.

fileName

uima.cas.String




Description: The name of the source file (often used by cas consumers which produce an output file for each CAS; this name is used as base).

fileEncoding

uima.cas.String




Description: The encoding of the file.

documentIndex

uima.cas.Integer




Description: The current number of the documents, i.e. a document is number 5 in a complete sequence.

lastFile

uima.cas.Boolean




Description: Indicates if this is the last file.

sourceLanguage

uima.cas.String




Description: The document language of the source.

offsetInSource

uima.cas.Integer




Description: Byte offset of the start of document content within original source file or other input source. Only used if the CAS document was retrieved from an source where one physical source file contained several conceptual documents. Zero otherwise.

documentSize

uima.cas.Integer




Description: Size of original document in bytes before processing by CAS Initializer. Either absolute file size of size within file or other source.

sequenceNumber

uima.cas.Integer




Description: Sequence number to test the right order while merging CASes.

lastSegment

uima.cas.Boolean




Description: For a CAS that represents a segment of a larger source document, this flag indicates whether this CAS is the final segment of the source document. This is useful for downstream components that want to take some action after having seen all of the segments of a particular source document.

POSTag

Full Name: de.averbis.extraction.types.POSTag

Description: Parent type for all specific part-of-speech types.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 6: Features

NameRangeElement TypeMultiple References Allowed

tagsetId

uima.cas.String




Description: The name of the tag set the pos tag belongs to; e.g. the "Penn Treebank II Tags" (see http://bulba.sdsu.edu/jeanette/thesis/PennTags.html)

value

uima.cas.String




Description: The specific part-of-speech tag, as returned by the POS-Tagger (e.g., "NN" or "ADJ" etc)

Chunk

Full Name: de.averbis.extraction.types.Chunk

Description: A general type for chunks (NPs, VPs, PPs etc.). Note: there are 3 specific subtypes for common chunks: ChunkNP, ChunkVP, ChunkPP. For all other chunk types (e.g., SBAR; ADJP etc.) use this general type!

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 7: Features

NameRangeElement TypeMultiple References Allowed

enclosedTokens

uima.cas.Integer




Description: The Token annotations enclosed by this chunk.

head

de.averbis.extraction.types.CoreAnnotation




Description: The head entity on which this chunk grammatically depends. Example: in "Der Vater des Kindes" is "der Vater" the head of "des Kindes".

dependents

uima.cas.FSArray

de.averbis.extraction.types.CoreAnnotation



Description: The entities which grammatically depend on this chunk. Example: in "Der Vater des Kindes" is "des Kindes" the dependent of "Der Vater".

value

uima.cas.String




Description: the specific chunk tag as returned by the chunker (e.g., "NP", "SBAR", "S" etc.).

ChunkNP

Full Name: de.averbis.extraction.types.ChunkNP

Description: A noun phrase (e.g. "the strange bird").

Parent Type: de.averbis.extraction.types.Chunk

ChunkVP

Full Name: de.averbis.extraction.types.ChunkVP

Description: A verb phrase (e.g. "has been thinking").

Parent Type: de.averbis.extraction.types.Chunk

ChunkPP

Full Name: de.averbis.extraction.types.ChunkPP

Description: A prepositional phrase (e.g. "in between").

Parent Type: de.averbis.extraction.types.Chunk

Segment

Full Name: de.averbis.extraction.types.Segment

Description: The segmentation of a text part; a segment is usually a subword (i.e., part of a token).

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 8: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: The string representation of the segment.

isValidSegmentation

uima.cas.Boolean




Description: Indicates if the segmentation is valid (viz. could be completely matched against the dictionary).

midStrings

uima.cas.StringArray




Description: The MID value, multiple for ambiguous MIDs (for MID see Morphosaurus Paper).

Section

Full Name: de.averbis.extraction.types.Section

Description: Text sections of a certain type.

Parent Type: de.averbis.extraction.types.Zone

Abstract

Full Name: de.averbis.extraction.types.Abstract

Description: Semantic abstract section found in the text.

Parent Type: de.averbis.extraction.types.Zone

Paragraph

Full Name: de.averbis.extraction.types.Paragraph

Description: Different paragraphs found in the document.

Parent Type: de.averbis.extraction.types.Zone

Title

Full Name: de.averbis.extraction.types.Title

Description: Marks a title in the document.

Parent Type: de.averbis.extraction.types.Zone

Relation

Full Name: de.averbis.extraction.types.Relation

Description: Describes a binary relation between two annotations. The relation is defined according to the SPO (subject, predicate, object) annotation.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 9: Features

NameRangeElement TypeMultiple References Allowed

subject

de.averbis.extraction.types.CoreAnnotation




Description: An annotation representing the subject of the relation ("agens").

predicate

de.averbis.extraction.types.CoreAnnotation




Description: The annotation representing the predicate of the relation e.g. BASF has integrated BAYER --> 'has integrated' is the predicate that is marked as ChunkVP --> use the feature 'value' to define the type of the relation --> value: acquisition

object

de.averbis.extraction.types.CoreAnnotation




Description: The object of the relation.

value

uima.cas.String




Description: Type of the relation.

Entity

Full Name: de.averbis.extraction.types.Entity

Description: A named entity; not to be confused with a Concept. A (named) entity is a string representation in text referring to a class of entities. Thus, the entity mention does not have an identifier but a specific type (the category) assigned to it.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 10: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: This feature provides the text of the annotated mention. Important for easily representing discontinuous mentions such as 'T cell' in the expression 'T and B cell'.

label

uima.cas.String




Description: The type of the entity; e.g., PERSON, LOCATION etc. Feature name is label due to the fact that "type" is a reserved word.

parsedElements

uima.cas.FSArray

de.averbis.extraction.types.Entity



Description: Reference to all recognized entities inside this Entity such as Size, Time, Area, Date, Volume, ....

ResolvedEntity

Full Name: de.averbis.extraction.types.ResolvedEntity

Description: A special entity with an additional specific resolved form.

Parent Type: de.averbis.extraction.types.Entity


Table 11: Features

NameRangeElement TypeMultiple References Allowed

resolvedType

uima.cas.String




Description: The type of the resolved form.

resolvedForm

uima.cas.String




Description: A string representing the resolved form of the entity.

Group

Full Name: de.averbis.extraction.types.Group

Description: Groups together a set of annotations that belong together, e.g., enumerations. One of them can be set to the "leading" concept. E.g. "the liver metastasis is hypodense and has a size of 3cm*2cm". lead: metastasis, other concepts: liver, hypodense, 3cm*2cm

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 12: Features

NameRangeElement TypeMultiple References Allowed

leadingAnnotation

de.averbis.extraction.types.CoreAnnotation




Description: Optional annotation indicating the leading head of hte group.

members

uima.cas.FSArray

de.averbis.extraction.types.CoreAnnotation



Description: Annotations contained in the group.

label

uima.cas.String




Description: Textual label describing the group elements.

Enumeration

Full Name: de.averbis.extraction.types.Enumeration

Description: A specific group representing an enumeration like "red, blue and green".

Parent Type: de.averbis.extraction.types.Group

Listing

Full Name: de.averbis.extraction.types.Listing

Description: A specific group representing a listing like "1. red 2. blue 3. green".

Parent Type: de.averbis.extraction.types.Group

InputParam

Full Name: de.averbis.extraction.types.InputParam

Description: InputParam is used to pass parameters to an analysis engine via a JCas object. This may be used to pass parameters in the process() method of an analysis engine rather than during initialization of the AEs. It is e.g. necessary for the ConceptAnnotator at which you want to pass restrictions (such as "language" or "terminology") for each single text/jcas while only having one ConceptAnnotator instance.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 13: Features

NameRangeElement TypeMultiple References Allowed

key

uima.cas.String




Description: The key of the input parameter.

values

uima.cas.StringArray




Description: The values of the input parameter.

Stem

Full Name: de.averbis.extraction.types.Stem

Description: The type for stem annotations.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 14: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: The string representation of the stem.

Category

Full Name: de.averbis.extraction.types.Category

Description: Category meta information on the document or a region of this document (use the context feature to identify which section this category refers to). E.g. language information of the document text or language information of specific sections.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 15: Features

NameRangeElement TypeMultiple References Allowed

group

uima.cas.String




Description: category group (e.g. HSG, language) to which from which the label belongs. For language categorization the group might be "lang" and then the labels could be "en", "de", "fr" etc.

label

uima.cas.String




Description: The label of the category annotation. E.g. in the case that we identified languages (de, en, fr, ...).

context

de.averbis.extraction.types.CoreAnnotation




Description: The text context which belongs to the given category annotation, e.g. Document, Section, Sentence.

rank

uima.cas.Integer




Description: The rank of the current category with respect to the context.

SummarySentence

Full Name: de.averbis.extraction.types.SummarySentence

Description: Annotation indicating a sentence that makes up a summary of the document.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 16: Features

NameRangeElement TypeMultiple References Allowed

sentence

de.averbis.extraction.types.Sentence




Description: The sentence annotation that contains the content of this summary sentence.

descriptors

uima.cas.FSArray

de.averbis.extraction.types.Descriptor

false


Description: The descriptors extracted by the algorithm accounting to the selection of the sentence.

IndexTerm

Full Name: de.averbis.extraction.types.IndexTerm

Description: A term to be used for indexing a document.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 17: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: The string representation of the index term. Example: the normalized and stemmed string which represents a keyword in the free keywording scenario or for controlled keywording (=descriptor extraction) the dictCanon might be written here.

baseAnnotation

de.averbis.extraction.types.CoreAnnotation




Description: The annotation to be assigned as index term. This can i.e. be a Concept or a Noun Phrase annotation from which the index term was derived.

Descriptor

Full Name: de.averbis.extraction.types.Descriptor

Description: An index term from an ontology; its type (or reference) is written in the feature annotation.

Parent Type: de.averbis.extraction.types.IndexTerm


Table 18: Features

NameRangeElement TypeMultiple References Allowed

uid

uima.cas.String




Description: The unique identifier of the descriptor, e.g., a combination of terminology and concept id.

Keyword

Full Name: de.averbis.extraction.types.Keyword

Description: A keyword that is assigned freely (i.e., not taken from an ontology) to a document. Its type is written in the feature annotation

Parent Type: de.averbis.extraction.types.IndexTerm

GenericMetadata

Full Name: de.averbis.extraction.types.GenericMetadata

Description: Type holding generic metadata information of a document. Multiple annotations should be used to add additional metadata to the CAS.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 19: Features

NameRangeElement TypeMultiple References Allowed

metadataFieldname

uima.cas.String




Description: To reduce the potential metadata field names, the predefined field names should be used where possible (add new field names if necessary). Predefined metadata field names are: title, summary, filesize, annotatorName.

value

uima.cas.String




Description: Value of the metadata field, e.g. metadataFieldname = title, value = "Brave new world".

POSTagNoun

Full Name: de.averbis.extraction.types.POSTagNoun

Description: The type for all POS-Tags of the type "Noun".

Parent Type: de.averbis.extraction.types.POSTag

POSTagVerb

Full Name: de.averbis.extraction.types.POSTagVerb

Description: The type for all POS-Tags of the type "Verb".

Parent Type: de.averbis.extraction.types.POSTag

POSTagAdj

Full Name: de.averbis.extraction.types.POSTagAdj

Description: The type for all POS-Tags of the type "Adjective".

Parent Type: de.averbis.extraction.types.POSTag

POSTagAdv

Full Name: de.averbis.extraction.types.POSTagAdv

Description: The type for all POS-Tags of the type "Adverb".

Parent Type: de.averbis.extraction.types.POSTag

POSTagPron

Full Name: de.averbis.extraction.types.POSTagPron

Description: The type for all POS-Tags of the type "Pronoun".

Parent Type: de.averbis.extraction.types.POSTag

POSTagDet

Full Name: de.averbis.extraction.types.POSTagDet

Description: The type for all POS-Tags of the type "Determiner".

Parent Type: de.averbis.extraction.types.POSTag

POSTagAdp

Full Name: de.averbis.extraction.types.POSTagAdp

Description: The type for all POS-Tags of the type "Preposition/Postposition".

Parent Type: de.averbis.extraction.types.POSTag

POSTagNum

Full Name: de.averbis.extraction.types.POSTagNum

Description: The type for all POS-Tags of the type "Numeral".

Parent Type: de.averbis.extraction.types.POSTag

POSTagConj

Full Name: de.averbis.extraction.types.POSTagConj

Description: The type for all POS-Tags of the type "Conjunction".

Parent Type: de.averbis.extraction.types.POSTag

POSTagPart

Full Name: de.averbis.extraction.types.POSTagPart

Description: The type for all POS-Tags of the type "Particle".

Parent Type: de.averbis.extraction.types.POSTag

POSTagPunct

Full Name: de.averbis.extraction.types.POSTagPunct

Description: The type for all POS-Tags of the type "Punctuation".

Parent Type: de.averbis.extraction.types.POSTag

ValidTextSegment

Full Name: de.averbis.extraction.types.ValidTextSegment

Description: Zone to mark valid text in contrast to invalid text which e.g. may be OCR (Optical Character Recognition) artefacts, number blocks, tables etc.

Parent Type: de.averbis.extraction.types.Zone

Lemma

Full Name: de.averbis.extraction.types.Lemma

Description: Lemma is the canonical form of a lexeme. Lemmata can be retrieved from lexicon or be produced by lemmatizer.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 20: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: The value of the lemma.

case

uima.cas.String




Description: Case such as Nom (Nominative) or Gen (Genitive) etc.

number

uima.cas.String




Description: Singular or plural.

gender

uima.cas.String




Description: fem or masc or neutr.

Member

Full Name: de.averbis.extraction.types.Member

Description: Utility annotation for indicating a member of a group.

Parent Type: de.averbis.extraction.types.CoreAnnotation

8.1.2. NumericValueTypeSystem

de.averbis.textanalysis.typesystems.NumericValueTypeSystem

This type system contains types for representing numeric values.

Maven Coordinates

        
<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>numeric-value-typesystem</artifactId>
    <version>3.5.0</version>
</dependency>
          

Imports

  • de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

NumericValue

Full Name: de.averbis.textanalysis.types.numericvalue.NumericValue

Description: Represents a text span which can be interpreted as a numeric value which is stored in a feature.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 21: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.Double




Description: the actual double value of the numeric value.

Fraction

Full Name: de.averbis.textanalysis.types.numericvalue.Fraction

Description: A fraction of two NumericValue annotations.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 22: Features

NameRangeElement TypeMultiple References Allowed

numerator

de.averbis.textanalysis.types.numericvalue.NumericValue




Description: The numerator of the fraction.

denominator

de.averbis.textanalysis.types.numericvalue.NumericValue




Description: The denominator of the fraction.

SimpleFraction

Full Name: de.averbis.textanalysis.types.numericvalue.SimpleFraction

Description: A fraction of two integer values.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 23: Features

NameRangeElement TypeMultiple References Allowed

numerator

uima.cas.Integer




Description: The numerator of the fraction.

denominator

uima.cas.Integer




Description: The denominator of the fraction.

LanguageContainer

Full Name: de.averbis.textanalysis.types.numericvalue.LanguageContainer

Description: A container annotation specifying the language of the covered text.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 24: Features

NameRangeElement TypeMultiple References Allowed

language

uima.cas.String




Description: The language locale like 'de' or 'en'.

ConjunctionFragment

Full Name: de.averbis.textanalysis.types.numericvalue.ConjunctionFragment

Description: A text span indicating a conjunction of numbers, which may also be located within token like in 'fünfundzwanzig'

Parent Type: de.averbis.extraction.types.CoreAnnotation

RomanNumeral

Full Name: de.averbis.textanalysis.types.numericvalue.RomanNumeral

Description: Annotation for roman numerals.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 25: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.Integer




Description: Integer value of the roman numeral.

MeasurementTypeSystem

de.averbis.textanalysis.typesystems.MeasurementTypeSystem

This type system contains types for measurements and units.

Maven Coordinates

        
<dependency>
   <groupId>de.averbis.textanalysis</groupId>
   <artifactId>measurement-typesystem</artifactId>
   <version>3.5.0</version>
</dependency>
        
      

Imports

  • de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

  • de.averbis.textanalysis.typesystems.NumericValueTypeSystem

Measurement

Full Name: de.averbis.textanalysis.types.measurement.Measurement

Description: A measurement combining a numeric value and a unit.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 26: Features

NameRangeElement TypeMultiple References Allowed

unit

de.averbis.textanalysis.types.measurement.Unit




Description: The unit of the measurement.

value

de.averbis.textanalysis.types.numericvalue.NumericValue




Description: The numeric value of the measurement.

normalizedUnit

uima.cas.String




Description: Normalized string value of the unit.

normalizedAsciiUnit

uima.cas.String




Description: Ascii normalized string value of the unit.

normalizedValue

uima.cas.Double




Description: The normalized value of the measurement. This value is the result of the transformation of the numeric value according to the transformation of the unit to its standard unit.

normalized

uima.cas.String




Description: The concatenation of the normalized numeric value and the ascii normalized unit.

parsedUnit

uima.cas.String




Description: Optional parsable unit string which replaces the unit annotation. It is utilized for normalization.

Unit

Full Name: de.averbis.textanalysis.types.measurement.Unit

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 27: Features

NameRangeElement TypeMultiple References Allowed

normalizedAscii

uima.cas.String




Description: Ascii normalized string value of the unit.

parsed

uima.cas.String




Description: String value of the parsed/identified unit.

normalized

uima.cas.String




Description: Normalized string value of the unit.

dimension

uima.cas.String




Description: The dimension of the unit in the form like '[L^3' for volume]

MeasurementInterval

Full Name: de.averbis.textanalysis.types.measurement.MeasurementInterval

Description: An interval defined by two measurements, a low and high limit.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 28: Features

NameRangeElement TypeMultiple References Allowed

low

de.averbis.textanalysis.types.measurement.Measurement




Description: The lower bound of the interval.

high

de.averbis.textanalysis.types.measurement.Measurement




Description: The upper bound of the interval.

lowExcluded

uima.cas.Boolean




Description: Marker set to true if the lower bound itself is not part of the interval.

highExcluded

uima.cas.Boolean




Description: Marker set to true if the upper bound itself is not part of the interval.

SimpleMeasurementInterval

Full Name: de.averbis.textanalysis.types.measurement.SimpleMeasurementInterval

Description: An interval extending MeasurementInterval with several primitive features representing two measurements.

Parent Type: de.averbis.textanalysis.types.measurement.MeasurementInterval


Table 29: Features

NameRangeElement TypeMultiple References Allowed

lowNormalizedUnit

uima.cas.String




Description: The normalized unit of the lower bound.

lowNormalizedValue

uima.cas.Double




Description: The normalized value of the lower bound.

lowNormalized

uima.cas.String




Description: The normalized value combined with the normalized unit of the lower bound.

lowParsedUnit

uima.cas.String




Description: The parsed unit of the lower bound.

highNormalizedUnit

uima.cas.String




Description: The normalized unit of the upper bound.

highNormalizedValue

uima.cas.Double




Description: The normalized value of the upper bound.

highNormalized

uima.cas.String




Description: The normalized value combined with the normalized unit of the upper bound.

highParsedUnit

uima.cas.String




Description: The parsed unit of the upper bound.

RelativeMeasurementInterval

Full Name: de.averbis.textanalysis.types.measurement.RelativeMeasurementInterval

Description: A relative interval defined by two measurements, a base and deflection.

Parent Type: de.averbis.textanalysis.types.measurement.MeasurementInterval


Table 30: Features

NameRangeElement TypeMultiple References Allowed

base

de.averbis.textanalysis.types.measurement.Measurement




Description: The base of the interval.

deflection

de.averbis.textanalysis.types.measurement.Measurement




Description: The deflection of the interval.

IntervalIndicator

Full Name: de.averbis.textanalysis.types.measurement.IntervalIndicator

Description: A textual representation indicating something an interval like '-' or 'bis'

Parent Type: de.averbis.extraction.types.CoreAnnotation

ComparisonIndicator

Full Name: de.averbis.textanalysis.types.measurement.ComparisonIndicator

Description: A textual representation of a comparison like '<=' or 'unter', also able to indicate an interval.

Parent Type: de.averbis.extraction.types.CoreAnnotation

GreaterIndicator

Full Name: de.averbis.textanalysis.types.measurement.GreaterIndicator

Description: A textual representation indicating something is 'greater', also able to indicate an interval.

Parent Type: de.averbis.textanalysis.types.measurement.ComparisonIndicator

LessIndicator

Full Name: de.averbis.textanalysis.types.measurement.LessIndicator

Description: A textual representation indicating something is 'less', also able to indicate an interval.

Parent Type: de.averbis.textanalysis.types.measurement.ComparisonIndicator

NoUnit

Full Name: de.averbis.textanalysis.types.measurement.NoUnit

Description: A textual position that is not a unit.

Parent Type: de.averbis.extraction.types.CoreAnnotation

DictionaryMeasurementMention

Full Name: de.averbis.textanalysis.types.measurement.DictionaryMeasurementMention

Description: A textual representation indicating a measurement. This is a helper type for measurements that combine numeric values and units or cause other problems for the unit parsing.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 31: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: Parseable value of the measurement.

unit

uima.cas.String




Description: Parseable unit of the measurement.

TemporalTypeSystem

de.averbis.textanalysis.typesystems.TemporalTypeSystem

This type system contains types for representing temporal expressions and values.

Maven Coordinates

        
<dependency>
   <groupId>de.averbis.textanalysis</groupId>
   <artifactId>temporal-typesystem</artifactId>
   <version>3.5.0</version>
</dependency>
        
      

Imports

  • de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

Timex3

Full Name: de.averbis.textanalysis.types.temporal.Timex3

Description: Represents a text span which can be interpreted as a numeric value which is stored in a feature.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 32: Features

NameRangeElement TypeMultiple References Allowed

tid

uima.cas.String




Description: Non-optional attribute. Each TIMEX3 expression has to be identified by a unique ID number. This is automatically assigned by the annotation tool.

kind

uima.cas.String




Description: Non-optional attribute. Each TIMEX3 is assigned one of the following types: DATE, TIME, DURATION, or SET. The format of the value attribute is determined by the type of TIMEX3. For instance, a DURATION must have a value that begins with the letter ’P’ since durations represent a period of time. This will be elaborated on below in the value section. In addition, some optional attributes are used specifically with certain types of temporal expressions.

value

uima.cas.String




Description: The value attribute details which temporal information is contained in the TIMEX3. This value is given in an extended ISO 8601 format. Examples: T24:00, 2001-01-12TEV, 1984-01-03T12:00, XXXX-12-02,1964-SU, P4M, PT20M

temporalFunction

uima.cas.Boolean




Description: Binary attribute which expresses that the value of the temporal expression needs to be determined via evaluation of a temporal function.

anchor

de.averbis.textanalysis.types.temporal.Timex3




Description: Optional attribute. It introduces the annotation of the temporal expression to which the TIMEX3 in question is temporally anchored.

Date

Full Name: de.averbis.textanalysis.types.temporal.Date

Description: The expression describes a calendar time.

Parent Type: de.averbis.textanalysis.types.temporal.Timex3


Table 33: Features

NameRangeElement TypeMultiple References Allowed

day

uima.tcas.Annotation




Description: The day of the actual date.

month

uima.tcas.Annotation




Description: The month of the actual date.

year

uima.tcas.Annotation




Description: The year of the actual date.

Time

Full Name: de.averbis.textanalysis.types.temporal.Time

Description: The expression refers to a time of the day, even if in a very indefinite way.

Parent Type: de.averbis.textanalysis.types.temporal.Timex3


Table 34: Features

NameRangeElement TypeMultiple References Allowed

hour

uima.tcas.Annotation




Description: The hour of the actual time.

minute

uima.tcas.Annotation




Description: The minute of the actual time.

second

uima.tcas.Annotation




Description: The second of the actual time.

Duration

Full Name: de.averbis.textanalysis.types.temporal.Duration

Description: The expression describes a duration. This value is assigned to explicit durations.

Parent Type: `de.averbis.textanalysis.types.temporal.Timex3 `

TemporalSet

Full Name: de.averbis.textanalysis.types.temporal.TemporalSet

Description: The expression describes a set of times.

Parent Type: de.averbis.textanalysis.types.temporal.Timex3

DocumentDate

Full Name: de.averbis.textanalysis.types.temporal.DocumentDate

Description: Annotation representing the date and time presenting the document if available, e.g., the creation time.

Parent Type: de.averbis.textanalysis.types.temporal.Timex3

WeekDay

Full Name: de.averbis.textanalysis.types.temporal.WeekDay

Description: Annotation indicating a weekday e.g., 'Monday'.

Parent Type: uima.tcas.Annotation


Table 35: Features

NameRangeElement TypeMultiple References Allowed

dayOfWeek

uima.cas.Integer




Description: Number of the day, e.g., 1 for Monday

DayTime

Full Name: de.averbis.textanalysis.types.temporal.DayTime

Description: Annotation indicating a time of day e.g., 'in the morning'.

Parent Type: uima.tcas.Annotation


Table 36: Features

NameRangeElement TypeMultiple References Allowed

timeOfDay

uima.cas.String




Description: Number specifying the time of the day.

TemporalIntervalBeginIndicator

Full Name: de.averbis.textanalysis.types.temporal.TemporalIntervalBeginIndicator

Description: Indicator for a possible begin of a temporal interval.

Parent Type: de.averbis.extraction.types.CoreAnnotation

TemporalIntervalEndIndicator

Full Name: de.averbis.textanalysis.types.temporal.TemporalIntervalEndIndicator

Description: Indicator for a possible end of a temporal interval.

Parent Type: de.averbis.extraction.types.CoreAnnotation

UnambiguousTimex

Full Name: de.averbis.textanalysis.types.temporal.UnambiguousTimex

Description: Helper annotation type pointing to an most likely unambiguous temporal expression. The term '1992' could represent a year, but also a measurement with that value. Other text spans like '1.1.2015' are most likely unambiguous dates, which can be represented by this type.

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 37: Features

NameRangeElement TypeMultiple References Allowed

timex

de.averbis.textanalysis.types.temporal.Timex3




Description: The actual temporal expression.

DateInterval

Full Name: de.averbis.textanalysis.types.temporal.DateInterval

Description: The expression describes an interval(or set) of dates defined by the start and end date of an event.

Parent Type: de.averbis.textanalysis.types.temporal.TemporalSet


Table 38: Features

NameRangeElement TypeMultiple References Allowed

startDate

de.averbis.textanalysis.types.temporal.Date




Description: The start date of the temporal interval.

endDate

de.averbis.textanalysis.types.temporal.Date




Description: The end date of the temporal interval.

SmpcTypeSystem

de.averbis.textanalysis.pharma.SmpcTypeSystem

-

Maven Coordinates

        
<dependency>
   <groupId>de.averbis.textanalysis</groupId>
   <artifactId>idmp-typesystem</artifactId>
   <version>0.7.0</version>
</dependency>
        
      

Imports

  • de.averbis.textanalysis.typesystems.AverbisTypeSystem

  • de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

  • de.averbis.textanalysis.typesystems.MeasurementTypeSystem

  • de.averbis.textanalysis.typesystems.TemporalTypeSystem

  • de.averbis.textanalysis.typesystems.pharma.PharmaConceptTypeSystem

SmPC

Full Name: de.averbis.textanalysis.types.pharma.smpc.SmPC

Description: -

Parent Type: uima.tcas.Annotation


Table 39: Features

NameRangeElement TypeMultiple References Allowed

medicinalProduct

de.averbis.textanalysis.types.pharma.smpc.MedicinalProduct




Description: -

marketingAuthorisation

de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisation




Description: -

clinicalParticulars

de.averbis.textanalysis.types.pharma.smpc.ClinicalParticulars




Description: -

pharmaceuticalForm

de.averbis.textanalysis.types.pharma.smpc.PharmaceuticalForm




Description: -

SmpcContext

Full Name: de.averbis.textanalysis.types.pharma.smpc.SmpcContext

Description: -

Parent Type: uima.tcas.Annotation

MedicinalProduct

Full Name: de.averbis.textanalysis.types.pharma.smpc.MedicinalProduct

Description: -

Parent Type: uima.tcas.Annotation


Table 40: Features

NameRangeElement TypeMultiple References Allowed

additionalMonitoringIndicator

uima.tcas.Annotation




Description: -

medicinalProductClassification

de.averbis.textanalysis.types.pharma.smpc.MedicinalProductClassification




Description: -

medicinalProductName

de.averbis.textanalysis.types.pharma.smpc.MedicinalProductName




Description: -

activeSubstances

de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer




Description: -

excipients

de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer




Description: -

administration

de.averbis.textanalysis.types.pharma.smpc.Administration




Description: -

shelfLifeStorage

de.averbis.textanalysis.types.pharma.smpc.ShelfLifeStorage




Description: -

MedicinalProductName

Full Name: de.averbis.textanalysis.types.pharma.smpc.MedicinalProductName

Description: -

Parent Type: uima.tcas.Annotation


Table 41: Features

NameRangeElement TypeMultiple References Allowed

inventedNamePart

de.averbis.textanalysis.types.pharma.smpc.InventedProductName




Description: inventedPartName

scientificNamePart

de.averbis.textanalysis.types.pharma.smpc.ScientificProductName




Description: scientificNamePart

strengthPart

de.averbis.textanalysis.types.pharma.smpc.ProductStrength




Description: strengthPart

pharmaceuticalDoseFormPart

de.averbis.textanalysis.types.pharma.smpc.ProductDoseForm




Description: pharmaceuticalDoseFormPart

formulationPart

uima.tcas.Annotation




Description: formulationPart

intendedUsePart

uima.tcas.Annotation




Description: intendedUsePart

targetPopulationPart

uima.tcas.Annotation




Description: targetPopulationPart

containerOrPackPart

uima.tcas.Annotation




Description: containerOrPackPart

devicePart

uima.tcas.Annotation




Description: devicePart

trademarkOrCompanyNamePart

uima.tcas.Annotation




Description: trademarkOrCompanyNamePart

timePeriodPart

uima.tcas.Annotation




Description: timePeriodPart

flavourPart

uima.tcas.Annotation




Description: flavourPart

InventedProductName

Full Name: de.averbis.textanalysis.types.pharma.smpc.InventedProductName

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 42: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: -

ScientificProductName

Full Name: de.averbis.textanalysis.types.pharma.smpc.ScientificProductName

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 43: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: -

ProductStrength

Full Name: de.averbis.textanalysis.types.pharma.smpc.ProductStrength

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 44: Features

NameRangeElement TypeMultiple References Allowed

measurement

de.averbis.textanalysis.types.measurement.Measurement




Description: -

ProductDoseForm

Full Name: de.averbis.textanalysis.types.pharma.smpc.ProductDoseForm

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 45: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: -

Indication

Full Name: de.averbis.textanalysis.types.pharma.smpc.Indication

Description: -

Parent Type: uima.tcas.Annotation


Table 46: Features

NameRangeElement TypeMultiple References Allowed

populationSpecifics

de.averbis.textanalysis.types.pharma.smpc.PopulationSpecifics




Description: -

otherTherapySpecifics

de.averbis.textanalysis.types.pharma.smpc.OtherTherapySpecifics




Description: -

diseaseStatus

uima.tcas.Annotation




Description: -

coMorbidity

uima.tcas.Annotation




Description: -

intendedEffect

uima.tcas.Annotation




Description: -

timingDuration

uima.tcas.Annotation




Description: -

indicationAsDiseaseSymptomProcedure

uima.tcas.Annotation




Description: -

Interaction

Full Name: de.averbis.textanalysis.types.pharma.smpc.Interaction

Description: -

Parent Type: uima.tcas.Annotation


Table 47: Features

NameRangeElement TypeMultiple References Allowed

interactionType

uima.cas.String




Description: -

interactionEffect

uima.tcas.Annotation




Description: -

interactionIncidence

uima.tcas.Annotation




Description: -

managementActions

uima.tcas.Annotation




Description: -

Contraindication

Full Name: de.averbis.textanalysis.types.pharma.smpc.Contraindication

Description: -

Parent Type: uima.tcas.Annotation


Table 48: Features

NameRangeElement TypeMultiple References Allowed

populationSpecifics

de.averbis.textanalysis.types.pharma.smpc.PopulationSpecifics




Description: -

otherTherapySpecifics

de.averbis.textanalysis.types.pharma.smpc.OtherTherapySpecifics




Description: -

contraIndicationsAsDiseaseSymptomProcedure

uima.tcas.Annotation




Description: -

diseaseStatus

uima.tcas.Annotation




Description: -

coMorbidity

uima.tcas.Annotation




Description: -

UndesirableEffect

Full Name: de.averbis.textanalysis.types.pharma.smpc.UndesirableEffect

Description: -

Parent Type: uima.tcas.Annotation


Table 49: Features

NameRangeElement TypeMultiple References Allowed

undesirableEffect

uima.tcas.Annotation




Description: undesirableEffect

undesirableEffectAsSymptomConditionEffect

uima.tcas.Annotation




Description: undesirableEffectAsSymptomConditionEffect

frequencyOfOccurence

uima.tcas.Annotation




Description: frequencyOfOccurence

symptomConditionEffectClassification

uima.tcas.Annotation




Description: symptomConditionEffectClassification

PharmaceuticalForm

Full Name: de.averbis.textanalysis.types.pharma.smpc.PharmaceuticalForm

Description: -

Parent Type: uima.tcas.Annotation


Table 50: Features

NameRangeElement TypeMultiple References Allowed

authorisedDosageForm

de.averbis.textanalysis.types.pharma.smpc.AuthorisedDosageForm




Description: -

manufacturedItem

de.averbis.textanalysis.types.pharma.smpc.ManufacturedItem




Description: -

AuthorisedDosageForm

Full Name: de.averbis.textanalysis.types.pharma.smpc.AuthorisedDosageForm

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 51: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: -

MedicinalProductClassification

Full Name: de.averbis.textanalysis.types.pharma.smpc.MedicinalProductClassification

Description: -

Parent Type: uima.tcas.Annotation


Table 52: Features

NameRangeElement TypeMultiple References Allowed

classificationSystem

uima.tcas.Annotation




Description: -

classificationValue

uima.tcas.Annotation




Description: -

PopulationSpecifics

Full Name: de.averbis.textanalysis.types.pharma.smpc.PopulationSpecifics

Description: -

Parent Type: uima.tcas.Annotation


Table 53: Features

NameRangeElement TypeMultiple References Allowed

age

uima.tcas.Annotation




Description: -

ageRange

uima.tcas.Annotation




Description: -

gender

uima.tcas.Annotation




Description: -

race

uima.tcas.Annotation




Description: -

healthStatus

uima.tcas.Annotation




Description: -

OtherTherapySpecifics

Full Name: de.averbis.textanalysis.types.pharma.smpc.OtherTherapySpecifics

Description: -

Parent Type: uima.tcas.Annotation


Table 54: Features

NameRangeElement TypeMultiple References Allowed

therapyRelationshipType

uima.tcas.Annotation




Description: -

medication

uima.tcas.Annotation




Description: -

MarketingAuthorisation

Full Name: de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisation

Description: -

Parent Type: uima.tcas.Annotation


Table 55: Features

NameRangeElement TypeMultiple References Allowed

marketingAuthorisationNumberContainer

de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationNumberContainer




Description: -

legalStatusOfSupply

uima.tcas.Annotation




Description: -

marketingAuthorisationHolder

de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolder




Description: -

firstAuthorisationDate

de.averbis.textanalysis.types.pharma.smpc.DateOfFirstAuthorisation




Description: -

lastRenewalDate

de.averbis.textanalysis.types.pharma.smpc.DateOfLatestRenewal




Description: -

revisionDate

de.averbis.textanalysis.types.pharma.smpc.DateOfRevision




Description: -

ClinicalParticulars

Full Name: de.averbis.textanalysis.types.pharma.smpc.ClinicalParticulars

Description: -

Parent Type: uima.tcas.Annotation


Table 56: Features

NameRangeElement TypeMultiple References Allowed

therapeuticIndications

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.smpc.Indication

false


Description: -

undesirableEffects

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.smpc.UndesirableEffect

false


Description: -

interactions

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.smpc.Interaction

false


Description: -

contraIndications

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.smpc.Contraindication

false


Description: -

Administration

Full Name: de.averbis.textanalysis.types.pharma.smpc.Administration

Description: -

Parent Type: uima.tcas.Annotation


Table 57: Features

NameRangeElement TypeMultiple References Allowed

routeOfAdministration

de.averbis.textanalysis.types.pharma.AdministrationConcept




Description: -

unitOfPresentation

uima.tcas.Annotation




Description: -

paediatricUseIndicator

de.averbis.textanalysis.types.pharma.smpc.PaediatricUseIndicator




Description: -

Container

Full Name: de.averbis.textanalysis.types.pharma.smpc.Container

Description: -

Parent Type: uima.tcas.Annotation


Table 58: Features

NameRangeElement TypeMultiple References Allowed

packageDescription

uima.tcas.Annotation




Description: -

ShelfLifeStorage

Full Name: de.averbis.textanalysis.types.pharma.smpc.ShelfLifeStorage

Description: -

Parent Type: uima.tcas.Annotation


Table 59: Features

NameRangeElement TypeMultiple References Allowed

shelfLifeContainer

de.averbis.textanalysis.types.pharma.smpc.ShelfLifeContainer




Description: -

specialPrecautionsForStorage

uima.tcas.Annotation




Description: -

ManufacturedItem

Full Name: de.averbis.textanalysis.types.pharma.smpc.ManufacturedItem

Description: -

Parent Type: uima.tcas.Annotation


Table 60: Features

NameRangeElement TypeMultiple References Allowed

form

uima.tcas.Annotation




Description: -

manufacturedItemQuantity

uima.tcas.Annotation




Description: -

physicalCharacteristics

de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristics




Description: -

PhysicalCharacteristics

Full Name: de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristics

Description: -

Parent Type: uima.tcas.Annotation


Table 61: Features

NameRangeElement TypeMultiple References Allowed

itemShape

de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsShape




Description: -

itemColor

de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsColor




Description: -

itemImprint

de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsImprint




Description: -

MarketingAuthorisationHolder

Full Name: de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolder

Description: -

Parent Type: uima.tcas.Annotation


Table 62: Features

NameRangeElement TypeMultiple References Allowed

organisationId

uima.tcas.Annotation




Description: -

authorisationHolderName

de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderName




Description: -

authorisationHolderAddress

de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderAddress




Description: -

ContactPerson

Full Name: de.averbis.textanalysis.types.pharma.smpc.ContactPerson

Description: -

Parent Type: uima.tcas.Annotation


Table 63: Features

NameRangeElement TypeMultiple References Allowed

confidentialityIndicator

uima.tcas.Annotation




Description: -

telecom

uima.tcas.Annotation




Description: -

name

uima.tcas.Annotation




Description: -

role

uima.tcas.Annotation




Description: -

SmpcDate

Full Name: de.averbis.textanalysis.types.pharma.smpc.SmpcDate

Description: -

Parent Type: uima.tcas.Annotation


Table 64: Features

NameRangeElement TypeMultiple References Allowed

date

de.averbis.textanalysis.types.temporal.Date




Description: -

DateOfFirstAuthorisation

Full Name: de.averbis.textanalysis.types.pharma.smpc.DateOfFirstAuthorisation

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.SmpcDate

DateOfLatestRenewal

Full Name: de.averbis.textanalysis.types.pharma.smpc.DateOfLatestRenewal

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.SmpcDate

DateOfRevision

Full Name: de.averbis.textanalysis.types.pharma.smpc.DateOfRevision

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.SmpcDate

Substance

Full Name: de.averbis.textanalysis.types.pharma.smpc.Substance

Description: -

Parent Type: uima.tcas.Annotation


Table 65: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: -

ActiveSubstance

Full Name: de.averbis.textanalysis.types.pharma.smpc.ActiveSubstance

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.Substance


Table 66: Features

NameRangeElement TypeMultiple References Allowed

referencedSubstance

de.averbis.textanalysis.types.pharma.smpc.Substance




Description: -

presentationStrength

de.averbis.textanalysis.types.pharma.smpc.PresentationStrength




Description: -

concentrationStrength

de.averbis.textanalysis.types.pharma.smpc.ConcentrationStrength




Description: -

PresentationStrength

Full Name: de.averbis.textanalysis.types.pharma.smpc.PresentationStrength

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 67: Features


NameRangeElement TypeMultiple References Allowed

measurement

de.averbis.textanalysis.types.measurement.Measurement




Description: -

ConcentrationStrength

Full Name: de.averbis.textanalysis.types.pharma.smpc.ConcentrationStrength

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 68: Features

NameRangeElement TypeMultiple References Allowed

measurement

de.averbis.textanalysis.types.measurement.Measurement




Description: -

Excipient

Full Name: de.averbis.textanalysis.types.pharma.smpc.Excipient

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.Substance


Table 69: Features

NameRangeElement TypeMultiple References Allowed

form

uima.tcas.Annotation




Description: -

PharmacodynamicClassificationSystem

Full Name: de.averbis.textanalysis.types.pharma.smpc.PharmacodynamicClassificationSystem

Description: -

Parent Type: uima.tcas.Annotation

PharmacodynamicClassificationValue

Full Name: de.averbis.textanalysis.types.pharma.smpc.PharmacodynamicClassificationValue

Description: -

Parent Type: uima.tcas.Annotation

ShelfLifeTimePeriod

Full Name: de.averbis.textanalysis.types.pharma.smpc.ShelfLifeTimePeriod

Description: -

Parent Type: uima.tcas.Annotation


Table 70: Features

NameRangeElement TypeMultiple References Allowed

form

uima.tcas.Annotation




Description: -

ShelfLifeTimeType

Full Name: de.averbis.textanalysis.types.pharma.smpc.ShelfLifeTimeType

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 71: Features

NameRangeElement TypeMultiple References Allowed

prefTerm

uima.cas.String




Description: -

code

uima.cas.String




Description: -

SpecialPrecautionsForStorage

Full Name: de.averbis.textanalysis.types.pharma.smpc.SpecialPrecautionsForStorage

Description: -

Parent Type: uima.tcas.Annotation

PackageDescription

Full Name: de.averbis.textanalysis.types.pharma.smpc.PackageDescription

Description: -

Parent Type: uima.tcas.Annotation

PhysicalCharacteristicsShape

Full Name: de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsShape

Description: -

Parent Type: uima.tcas.Annotation

PhysicalCharacteristicsColor

Full Name: de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsColor

Description: -

Parent Type: uima.tcas.Annotation

PhysicalCharacteristicsImprint

Full Name: de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsImprint

Description: -

Parent Type: uima.tcas.Annotation

MarketingAuthorisationNumber

Full Name: de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationNumber

Description: -

Parent Type: uima.tcas.Annotation

MarketingAuthorisationHolderName

Full Name: de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderName

Description: -

Parent Type: uima.tcas.Annotation

MarketingAuthorisationHolderAddress

Full Name: de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderAddress

Description: -

Parent Type: uima.tcas.Annotation


Table 72: Features


NameRangeElement TypeMultiple References Allowed

postAddress

uima.tcas.Annotation




Description: -

postCode

uima.tcas.Annotation




Description: -

city

uima.tcas.Annotation




Description: -

country

uima.tcas.Annotation




Description: -

AdditionalMonitoringIndicator

Full Name: de.averbis.textanalysis.types.pharma.smpc.AdditionalMonitoringIndicator

Description: -

Parent Type: uima.tcas.Annotation

PaediatricUseIndicator

Full Name: de.averbis.textanalysis.types.pharma.smpc.PaediatricUseIndicator

Description: -

Parent Type: uima.tcas.Annotation


Table 73: Features

NameRangeElement TypeMultiple References Allowed

normalized

uima.cas.String




Description: -

ManufacturedItemQuantity

Full Name: de.averbis.textanalysis.types.pharma.smpc.ManufacturedItemQuantity

Description: -

Parent Type: uima.tcas.Annotation

PackageItemType

Full Name: de.averbis.textanalysis.types.pharma.smpc.PackageItemType

Description: -

Parent Type: uima.tcas.Annotation

PackageItemQuantity

Full Name: de.averbis.textanalysis.types.pharma.smpc.PackageItemQuantity

Description: -

Parent Type: uima.tcas.Annotation

PackageItemConcept

Full Name: de.averbis.textanalysis.types.pharma.smpc.PackageItemConcept

Description: -

Parent Type: de.averbis.extraction.types.Concept

City

Full Name: de.averbis.textanalysis.types.pharma.smpc.City

Description: -

Parent Type: uima.tcas.Annotation

Country

Full Name: de.averbis.textanalysis.types.pharma.smpc.Country

Description: -

Parent Type: uima.tcas.Annotation

PostAddress

Full Name: de.averbis.textanalysis.types.pharma.smpc.PostAddress

Description: -

Parent Type: uima.tcas.Annotation

PostCode

Full Name: de.averbis.textanalysis.types.pharma.smpc.PostCode

Description: -

Parent Type: uima.tcas.Annotation

CompanyName

Full Name: de.averbis.textanalysis.types.pharma.smpc.CompanyName

Description: -

Parent Type: uima.tcas.Annotation

CompanyPostfix

Full Name: de.averbis.textanalysis.types.pharma.smpc.CompanyPostfix

Description: -

Parent Type: uima.tcas.Annotation

StreetIndicator

Full Name: de.averbis.textanalysis.types.pharma.smpc.StreetIndicator

Description: -

Parent Type: uima.tcas.Annotation

MarketingAuthorisationNumberContainer

Full Name: de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationNumberContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 74: Features

NameRangeElement TypeMultiple References Allowed

numbers

uima.cas.FSArray




Description: -

ConceptCell

Full Name: de.averbis.textanalysis.types.pharma.smpc.ConceptCell

Description: -

Parent Type: uima.tcas.Annotation


Table 75: Features

NameRangeElement TypeMultiple References Allowed

concepts

uima.cas.FSArray




Description: -

SubstanceContainer

Full Name: de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 76: Features

NameRangeElement TypeMultiple References Allowed

substances

uima.cas.FSArray




Description: -

ActiveSubstanceContainer

Full Name: de.averbis.textanalysis.types.pharma.smpc.ActiveSubstanceContainer

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer

ExcipientSubstanceContainer

Full Name: de.averbis.textanalysis.types.pharma.smpc.ExcipientSubstanceContainer

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer

Device

Full Name: de.averbis.textanalysis.types.pharma.smpc.Device

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 77: Features

NameRangeElement TypeMultiple References Allowed

prefTerm

uima.cas.String




Description: -

code

uima.cas.String




Description: -

ExcipientRole

Full Name: de.averbis.textanalysis.types.pharma.smpc.ExcipientRole

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

ATCCode

Full Name: de.averbis.textanalysis.types.pharma.smpc.ATCCode

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 78: Features

NameRangeElement TypeMultiple References Allowed

prefTerm

uima.cas.String




Description: -

code

uima.cas.String




Description: -

ShelfLifeContainer

Full Name: de.averbis.textanalysis.types.pharma.smpc.ShelfLifeContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 79: Features

NameRangeElement TypeMultiple References Allowed

shelfLifes

uima.cas.FSArray




Description: -

SmpcSectionTypeSystem

de.averbis.textanalysis.typesystems.pharma.SmpcSectionTypeSystem

-

Maven Coordinates

        
<dependency>
   <groupId>de.averbis.textanalysis</groupId>
   <artifactId>idmp-typesystem</artifactId>
   <version>0.7.0</version>
</dependency>
        
      

Imports

  • de.averbis.textanalysis.typesystems.AverbisTypeSystem

  • de.averbis.textanalysis.typesystems.NumericValueTypeSystem

SmpcAnnex

Full Name: de.averbis.textanalysis.types.pharma.SmpcAnnex

Description: -

Parent Type: de.averbis.extraction.types.Section

SmpcAnnexI

Full Name: de.averbis.textanalysis.types.pharma.SmpcAnnexI

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcAnnex

SmpcAnnexII

Full Name: de.averbis.textanalysis.types.pharma.SmpcAnnexII

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcAnnex

SmpcAnnexIII

Full Name: de.averbis.textanalysis.types.pharma.SmpcAnnexIII

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcAnnex

SmpcSection

Full Name: de.averbis.textanalysis.types.pharma.SmpcSection

Description: -

Parent Type: de.averbis.extraction.types.Section


Table 80: Features

NameRangeElement TypeMultiple References Allowed

headline

de.averbis.textanalysis.types.pharma.SmpcSectionHeadline




Description: headline

content

uima.tcas.Annotation




Description: content

SmpcSectionHeadline

Full Name: de.averbis.textanalysis.types.pharma.SmpcSectionHeadline

Description: -

Parent Type: uima.tcas.Annotation


Table 81: Features

NameRangeElement TypeMultiple References Allowed

number

uima.cas.Double




Description: number

text

uima.tcas.Annotation




Description: text

main

uima.cas.Boolean




Description: main

SmpcSectionHeadlineText

Full Name: de.averbis.textanalysis.types.pharma.SmpcSectionHeadlineText

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

SecondarySmpcHeadline

Full Name: de.averbis.textanalysis.types.pharma.SecondarySmpcHeadline

Description: -

Parent Type: uima.tcas.Annotation

HeadlineTag

Full Name: de.averbis.textanalysis.types.pharma.HeadlineTag

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 82: Features

NameRangeElement TypeMultiple References Allowed

cssClass

uima.cas.String




Description: -

MainSmpcHeadline

Full Name: de.averbis.textanalysis.types.pharma.MainSmpcHeadline

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

SmpcSectionContent

Full Name: de.averbis.textanalysis.types.pharma.SmpcSectionContent

Description: -

Parent Type: de.averbis.textanalysis.types.numericvalue.LanguageContainer

NameOfTheMedicinalProductContent

Full Name: de.averbis.textanalysis.types.pharma.NameOfTheMedicinalProductContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

QualitativeAndQuantitativeCompositionContent

Full Name: de.averbis.textanalysis.types.pharma.QualitativeAndQuantitativeCompositionContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PharmaceuticalFormContent

Full Name: de.averbis.textanalysis.types.pharma.PharmaceuticalFormContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

ClinicalParticularsContent

Full Name: de.averbis.textanalysis.types.pharma.ClinicalParticularsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

TherapeuticIndicationsContent

Full Name: de.averbis.textanalysis.types.pharma.TherapeuticIndicationsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PosologyAndMethodOfAdministrationContent

Full Name: de.averbis.textanalysis.types.pharma.PosologyAndMethodOfAdministrationContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

ContraindicationsContent

Full Name: de.averbis.textanalysis.types.pharma.ContraindicationsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

SpecialWarningsAndPrecautionsForUseContent

Full Name: de.averbis.textanalysis.types.pharma.SpecialWarningsAndPrecautionsForUseContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

InteractionsContent

Full Name: de.averbis.textanalysis.types.pharma.InteractionsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

FertilityPregnancyLactationContent

Full Name: de.averbis.textanalysis.types.pharma.FertilityPregnancyLactationContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

EffectsOnAbilityContent

Full Name: de.averbis.textanalysis.types.pharma.EffectsOnAbilityContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

UndesirableEffectsContent

Full Name: de.averbis.textanalysis.types.pharma.UndesirableEffectsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

OverdoseContent

Full Name: de.averbis.textanalysis.types.pharma.OverdoseContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PharmacologicalPropertiesContent

Full Name: de.averbis.textanalysis.types.pharma.PharmacologicalPropertiesContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PharmacodynamicPropertiesContent

Full Name: de.averbis.textanalysis.types.pharma.PharmacodynamicPropertiesContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PharmacokineticPropertiesContent

Full Name: de.averbis.textanalysis.types.pharma.PharmacokineticPropertiesContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PreclinicalSafetyDataContent

Full Name: de.averbis.textanalysis.types.pharma.PreclinicalSafetyDataContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

PharmaceuticalParticularsContent

Full Name: de.averbis.textanalysis.types.pharma.PharmaceuticalParticularsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

ListOfExcipientsContent

Full Name: de.averbis.textanalysis.types.pharma.ListOfExcipientsContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

IncompatibilitiesContent

Full Name: de.averbis.textanalysis.types.pharma.IncompatibilitiesContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

ShelfLifeContent

Full Name: de.averbis.textanalysis.types.pharma.ShelfLifeContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

SpecialPrecautionsForStorageContent

Full Name: de.averbis.textanalysis.types.pharma.SpecialPrecautionsForStorageContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

NatureAndContentsOfContainerContent

Full Name: de.averbis.textanalysis.types.pharma.NatureAndContentsOfContainerContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

SpecialPrecautionsForDisposalAndOtherHandlingContent

Full Name: de.averbis.textanalysis.types.pharma.SpecialPrecautionsForDisposalAndOtherHandlingContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

MarketingAuthorisationHolderContent

Full Name: de.averbis.textanalysis.types.pharma.MarketingAuthorisationHolderContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

MarketingAuthorisationNumbersContent

Full Name: de.averbis.textanalysis.types.pharma.MarketingAuthorisationNumbersContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

DateOfAuthorisationContent

Full Name: de.averbis.textanalysis.types.pharma.DateOfAuthorisationContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

DateOfRevisionContent

Full Name: de.averbis.textanalysis.types.pharma.DateOfRevisionContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

GeneralClassificationForSupplyContent

Full Name: de.averbis.textanalysis.types.pharma.GeneralClassificationForSupplyContent

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.SmpcSectionContent

Module3TypeSystem

de.averbis.textanalysis.pharma.Module3TypeSystem

-

Maven Coordinates

<dependency>
   <groupId>de.averbis.textanalysis</groupId>
   <artifactId>idmp-typesystem</artifactId>
   <version>0.7.0</version>
</dependency>
        
      

Imports

  • de.averbis.textanalysis.typesystems.AverbisTypeSystem

  • de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

Product

Full Name: de.averbis.textanalysis.types.pharma.module3.Product

Description: Composition

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 83: Features

NameRangeElement TypeMultiple References Allowed

compositions

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.module3.ProductComposition

true


Description: -

description

de.averbis.textanalysis.types.pharma.module3.ProductDescription




Description: description

ProductComposition

Full Name: de.averbis.textanalysis.types.pharma.module3.ProductComposition

Description: Composition

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 84: Features

NameRangeElement TypeMultiple References Allowed

activeEntries

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.module3.CompositionEntry

true


Description: active entries

excipientEntries

uima.cas.FSArray

de.averbis.textanalysis.types.pharma.module3.CompositionEntry

true


Description: excipient entries

reference

de.averbis.extraction.types.CoreAnnotation




Description: reference

ProductDescription

Full Name: de.averbis.textanalysis.types.pharma.module3.ProductDescription

Description: ProductDescription

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 85: Features

NameRangeElement TypeMultiple References Allowed

dosageForm

de.averbis.textanalysis.types.pharma.module3.DosageForm




Description: dosageForm

color

uima.cas.FSArray




Description: color

shape

uima.cas.FSArray




Description: shape

CompositionEntry

Full Name: de.averbis.textanalysis.types.pharma.module3.CompositionEntry

Description: Composition

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 86: Features

NameRangeElement TypeMultiple References Allowed

active

uima.cas.Boolean




Description: active

substance

de.averbis.textanalysis.types.pharma.module3.Substance




Description: substance

strength

de.averbis.textanalysis.types.pharma.module3.StrengthContainer




Description: strength

role

de.averbis.textanalysis.types.pharma.module3.IngredientRoleContainer




Description: function

standard

de.averbis.textanalysis.types.pharma.module3.QualityStandardContainer




Description: qualityStandards

StrengthContainer

Full Name: de.averbis.textanalysis.types.pharma.module3.StrengthContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 87: Features

NameRangeElement TypeMultiple References Allowed

strengths

uima.cas.FSArray




Description: strengths

IngredientRoleContainer

Full Name: de.averbis.textanalysis.types.pharma.module3.IngredientRoleContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 88: Features

NameRangeElement TypeMultiple References Allowed

roles

uima.cas.FSArray




Description: functions

QualityStandardContainer

Full Name: de.averbis.textanalysis.types.pharma.module3.QualityStandardContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 89: Features

NameRangeElement TypeMultiple References Allowed

standards

uima.cas.FSArray




Description: standards

IngredientRole

Full Name: de.averbis.textanalysis.types.pharma.module3.IngredientRole

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

Substance

Full Name: de.averbis.textanalysis.types.pharma.module3.Substance

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ConceptContainer

RouteOfAdministration

Full Name: de.averbis.textanalysis.types.pharma.module3.RouteOfAdministration

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ConceptContainer

DosageForm

Full Name: de.averbis.textanalysis.types.pharma.module3.DosageForm

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ConceptContainer

QualityStandard

Full Name: de.averbis.textanalysis.types.pharma.module3.QualityStandard

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

PhysicalCharacteristicsShape

Full Name: de.averbis.textanalysis.types.pharma.module3.PhysicalCharacteristicsShape

Description: -

Parent Type: uima.tcas.Annotation

PhysicalCharacteristicsColor

Full Name: de.averbis.textanalysis.types.pharma.module3.PhysicalCharacteristicsColor

Description: -

Parent Type: uima.tcas.Annotation

ConceptContainer

Full Name: de.averbis.textanalysis.types.pharma.module3.ConceptContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 90: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: concept

Manufacturer

Full Name: de.averbis.textanalysis.types.pharma.module3.Manufacturer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 91: Features

NameRangeElement TypeMultiple References Allowed

operationTypeContainer

de.averbis.textanalysis.types.pharma.module3.OperationTypeContainer




Description: operationTypeContainer

name

de.averbis.textanalysis.types.pharma.module3.ManufacturerName




Description: name

postAddress

de.averbis.textanalysis.types.pharma.module3.PostAddress




Description: postAddress

city

de.averbis.textanalysis.types.pharma.module3.City




Description: city

postCode

de.averbis.textanalysis.types.pharma.module3.PostCode




Description: postCode

country

de.averbis.textanalysis.types.pharma.module3.Country




Description: country

OperationTypeContainer

Full Name: de.averbis.textanalysis.types.pharma.module3.OperationTypeContainer

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 92: Features

NameRangeElement TypeMultiple References Allowed

operationTypes

uima.cas.FSArray




Description: operationTypes

PostAddress

Full Name: de.averbis.textanalysis.types.pharma.module3.PostAddress

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart

ManufacturersContext

Full Name: de.averbis.textanalysis.types.pharma.module3.ManufacturersContext

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

OperationTypeContext

Full Name: de.averbis.textanalysis.types.pharma.module3.OperationTypeContext

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 93: Features

NameRangeElement TypeMultiple References Allowed

elements

uima.cas.FSArray




Description: elements

Country

Full Name: de.averbis.textanalysis.types.pharma.module3.Country

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart

StreetIndicatorPrefix

Full Name: de.averbis.textanalysis.types.pharma.module3.StreetIndicatorPrefix

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

StreetIndicatorPostfix

Full Name: de.averbis.textanalysis.types.pharma.module3.StreetIndicatorPostfix

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

OperationType

Full Name: de.averbis.textanalysis.types.pharma.module3.OperationType

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 94: Features

NameRangeElement TypeMultiple References Allowed

value

uima.cas.String




Description: value

PostCode

Full Name: de.averbis.textanalysis.types.pharma.module3.PostCode

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart

CompanyPostfix

Full Name: de.averbis.textanalysis.types.pharma.module3.CompanyPostfix

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

City

Full Name: de.averbis.textanalysis.types.pharma.module3.City

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart

ManufacturerName

Full Name: de.averbis.textanalysis.types.pharma.module3.ManufacturerName

Description: -

Parent Type: de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart

ManufacturerAddressPart

Full Name: de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation

AdverseEventTypeSystem

de.averbis.textanalysis.pharma.AdverseEventTypeSystem

-

Maven Coordinates

<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>adverse-event-typesystem</artifactId>
    <version>0.7.0</version>
</dependency>
         

Imports

  • de.averbis.textanalysis.typesystems.AverbisTypeSystem

AdverseEvent

Full Name: de.averbis.textanalysis.types.pharma.AdverseEvent

Description: -

Parent Type: de.averbis.extraction.types.CoreAnnotation


Table 95: Features

NameRangeElement TypeMultiple References Allowed

concept

de.averbis.extraction.types.Concept




Description: -

label

uima.cas.String




Description: -

serious

uima.cas.Boolean




Description: -

Language Detection

LanguageCategorizer

General

The LanguageCategorizer recognizes and sets the text language on a CAS object.

Depending on the LanguageDetectorResource configured, one or multiple languages of the text are predicted.

Input

This component does not require annotations.

Output
  • Sets the language of the CAS object.

  • de.averbis.extraction.types.Category - (optional) category annotations are set.

Configuration

Implementation: de.averbis.textanalysis.components.languagecategorizer.LanguageCategorizer


Table 96: Configuration Parameters

NameTypeMultiValuedMandatory

allowedLanguages

Description: The list of languages allowed to be set as document language.

Default: en, de

String

true

false

useUnknownLanguage

Description: If a language could not be determinated or is not an allowed language set unknown language to true or leave unset (false).

Default: true

Boolean

false

false

overwriteExisting

Description: If true an existing document language will be overwritten.

Default: false

Boolean

false

false

maxCharacterLimit

Description: The number of characters to be analysed. Can be used when categorizing large texts in order to increase performance.

Default: 20000

Integer

false

false

addCategoryAnnotations

Description: If true, UIMA Category annotations are added to CAS (languages and confidences).

Default: false

Boolean

false

false

shortTextSizeTrigger

Description: The number of characters below which the short-text-algorithm will be used to guess the language.

Default: 200

Integer

false

false

setDocumentLanguage

Description: If true, the determined language will be set as the document language on JCas.

Default: true

Boolean

false

false


Table 97: External Resources

NameOptionalInterface/Implementation

languageDetectorResourceShort

Description: Resource holding a languageDetector for use on short texts.

false

de.averbis.textanalysis.resources.languagedetectorresource.LanguageDetectorResource

languageDetectorResourceDefault

Description: Resource holding a languageDetector for use on default text.

false

de.averbis.textanalysis.resources.languagedetectorresource.LanguageDetectorResource


Maven Coordinates:

<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>language-categorizer</artifactId>
    <version>3.5.0</version>
</dependency>
        

LanguageDetectorResource

General

The default implementation of the LanguageDetectorResource uses the language-detector by optimaize. This is based on the Language Detection Library of Shuyo Nakatani. For speech recognition, the probability for all configured languages is calculated by distributing the observed character N-grams in the text using a naïve bayes model.

The standard models are supplied in 16 languages. These are not (!) trained by Averbis, but come from the used com.optimaize.langdetect library.

Configuration

Implementation: de.averbis.textanalysis.resources.languagedetectorresource.LanguageDetectorResource


Table 98: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: languagedetector

String

false

false

category

Description: The model category to be used.

Default: default

String

false

false

availableLanguages

Description: The list of languages whose models should be loaded.

Default: en, de, fr, es, it, pt

String

false

false

useLowerCase

Description: If true the text will be converted to lower case, before determining the language category.

Default: false

Boolean

false

false

Maven Coordinates:

<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>language-detector-resource</artifactId>
    <version>3.5.0</version>
</dependency>
        
      

LanguageSetter

General

This component sets the document language to a user-defined value.

Input

The component does not expect any annotations.

Output
  • The component sets the parameter documentLanguage in the CAS object.

Configuration

Implementation: de.averbis.textanalysis.components.languagesetter.LanguageSetter


Table 99: Configuration Parameters

NameTypeMultiValuedMandatory

language

Description: The document language to set if not already set in CAS.

String

false

true

overwriteExisting

Description: If true an existing document language will be overwritten.

Default: false

Boolean

false

true

Maven Coordinates:

<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>language-setter</artifactId>
    <version>3.5.0</version>
</dependency>
        
      

Sentence Detection

OpennlpSentenceAnnotator

General

Using machine learning techniques, it is often easier to detect sentences than with simple rule-based approaches. This sentence annotator is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Input

The component does not expect annotations, but instead works on the document text.

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of type de.averbis.extraction.types.Sentence

Configuration

Implementation: de.averbis.textanalysis.components.opennlpsentenceannotator.OpennlpSentenceAnnotator


Table 100: Configuration Parameters

NameTypeMultiValuedMandatory

splitLinebreak

Description: Will additionally add sentence splits at all line breaks if true

Default: false

Boolean

false

false

enclosingSpanType

Description: if set, then sentence detection will be within this type only, else on whole document

Default: -

String

false

false


Table 101: External Resources

NameOptionalInterface/Implementation

opennlpSentenceDetectorResource

Description: Resource holding a map with available models (SentenceDetector) for languages

false

de.averbis.textanalysis.resources.opennlpsentencedetectorresource.OpennlpSentenceDetectorResource

Maven Coordinates:

<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>opennlp-sentence-annotator</artifactId>
    <version>3.5.0</version>
</dependency>
        
      

OpennlpSentenceDetectorResource

General

This resource encapsulates the statistical Sentence Detector model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.opennlpsentencedetectorresource.OpennlpSentenceDetectorResource


Table 102: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: opennlpsentencedetector

String

false

false

genre

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: newspaper

String

false

false

Maven Coordinates:

<dependency>
<groupId>de.averbis.textanalysis</groupId> <artifactId>opennlp-sentence-detector-resource</artifactId> <version>3.5.0</version> </dependency>

RegexSentenceAnnotator

General

A simple and often very efficient approach to sentence recognition is the decomposition of the text on the basis of specific language-specific rules. The SimpleSentenceAnnotator splits sentences in the text when the valid block separators ".!?" appear. The separator is part of the sentence annotation.

However, a sentence splitting is only executed if at least one blank character (or line break) and then an alphanumeric character (no lowercase letter!) follow after such a separator.

The advantage of this method is that it works very quickly and comprehensibly for the user. In many applications, this simple approach is perfectly adequate.

Known weaknesses and problems:

  • Splits sentences in abbreviations (etc., Prof. Dr. Maier).

  • Problems with dates (2. Mai 2012)

Input

The component does not expect annotations, but works on the document text.

Output

The component creates the following annotations:

  • de.averbis.extraction.types.Sentence

Configuration

Implementation: de.averbis.textanalysis.components.regexsentenceannotator.RegexSentenceAnnotator


Table 103: Configuration Parameters

NameTypeMultiValuedMandatory

regularExpression

Description: The regular expression to split sentences at.

Default: ([.?!)(\s)([^\p{Ll}])]

String

false

true

implementation

Description: The implementation of the regular expression library. Available are: JavaPattern, JRegex, Brics, RE2J. Warning: not all implementations support all regex constructs.

Default: JavaPattern

String

false

true

Maven Coordinates:

        
<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>regex-sentence-annotator</artifactId>
    <version>3.5.0</version>
</dependency>
        
      

Tokenization

JTokAnnotator

General

This component uses the JTok library to recognize tokens, sentences and paragraphs. Cascaded regular expressions and language-specific resources are used. The JTok library currently provides resources for the languages en, de and it (without special genres). There are also resources for special genres in the languages de and fr. Note that resources can not yet be loaded from the datapath.

Input

This component requires no specific annotations.

Output

The component creates the following annotations (depending on the configuration):

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Paragraph

  • de.averbis.extraction.types.Abbreviation

Configuration

Implementation: de.averbis.textanalysis.components.jtokannotator.JTokAnnotator


Table 104: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: jtokannotator

String

false

true

addTokens

Description: Create Token annotations.

Default: true

Boolean

false

true

addAbbreviations

Description: Create Abbreviation annotations.

Default: true

Boolean

false

true

addSentences

Description: Create Sentence annotations.

Default: true

Boolean

false

true

addParagraphs

Description: Create Paragraph annotations.

Default: true

Boolean

false

true

applyPostProcessing

Description: Apply additional postprocessing fixing common errors of sentence splitting.

Default: true

Boolean

false

true

genre

Description: Genre specifying an external configuration and their resources. If not available, it delegates to the JTok config files.

Default: default

String

false

true

availableLanguages

Description: List of languages available to the annotator. If no language configuration is available for the given genre, the default configuration is applied. The default configuration is also used for languages not included in this list.

Default: de, en, fr

String

true

true

globalEnclosingSpan

Description: Type of annotations specifying the enclosing span that should be tokenized. This parameter overrides the parameter enclosingSpan.

Default: uima.tcas.DocumentAnnotation

String

false

true

enclosingSpan

Description: Type of annotations specifying the enclosing span that should be tokenized.

Default: de.averbis.extraction.types.Sentence

String

false

true

normalizationRequired

Description: If true, additional normalization is applied on every token.

Default: true

Boolean

false

true

Maven Coordinates:

        
<dependency>
    <groupId>de.averbis.textanalysis</groupId>
    <artifactId>jtok-annotator</artifactId>
    <version>3.5.0</version>
</dependency>
        
      

OpennlpTokenAnnotator

General

Machine learning techniques can often solve tokenization problems better than simple rule-based approaches. This tokenizer is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Based on the included training module, this component can be easily adapted to new languages and genres by retraining.

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Sentence

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of type:

  • de.averbis.extraction.types.Token

Configuration

Implementation: de.averbis.textanalysis.components.opennlptokenannotator.OpennlpTokenAnnotator


Table 105: Configuration Parameters

NameTypeMultiValuedMandatory

splittingRules

Description: Additional regular expressions which can be used to make further splits on generated tokens

String

true

false

enclosingSpan

Description: Type of annotations specifying the enclosing span that should be tokenized.

Default: de.averbis.extraction.types.Sentence

String

false

true

normalizationRequired

Description: If true, additional normalization is applied on every token.

Default: true

Boolean

false

true


Table 106: External Resources

NameOptionalInterface/Implementation

opennlpTokenizerResource

Description: Resource holding a map with available models (Tokenizer) for different languages

false

de.averbis.textanalysis.resources.opennlptokenizerresource.OpennlpTokenizerResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>opennlp-token-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

OpennlpTokenizerResource

General

This resource encapsulates the statistical tokenizer model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.opennlptokenizerresource.OpennlpTokenizerResource


Table 107: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: opennlptokenizer

String

false

false

genre

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: newspaper

String

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>opennlp-tokenizer-resource</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

RegexTokenAnnotator

General

A simple and often very efficient approach to tokenization is the decomposition of the text based on specific language-specific rules. The RegexTokenAnnotator uses a set of defined delimiters to separate the words. Each time such a separator occurs in the text, a new token is started. The separators themselves (e. g. "-") are not marked as token annotations.

The advantage of this method is that it works very quickly and comprehensibly for the user. In many applications, this simple approach is perfectly adequate. In some applications, especially if applied to special domains such as biomedical literature, this procedure leads to an unintentional decomposition of tokens that are valid there (e. g. special proper names such as "IL-2" which should not be separated).

Input

The component does not necessarily expect annotations.

Output

The component creates the following annotations:

  • de.averbis.extraction.types.Token

Configuration

Implementation: de.averbis.textanalysis.components.regextokenannotator.RegexTokenAnnotator


Table 108: Configuration Parameters

NameTypeMultiValuedMandatory

regularExpression

Description: Regular expression to split a text.

Default: [|.+*;,!?/ :@_()"`„”““—’‘'¿-]

String

false

true

implementation

Description: The implementation of the regular expression library. Available are: JavaPattern, JRegex, Brics, RE2J. Warning: not all implementations support all regex constructs.

Default: JavaPattern

String

false

true

enclosingSpan

Description: Type of annotations specifying the enclosing span that should be tokenized.

Default: de.averbis.extraction.types.Sentence

String

false

true

normalizationRequired

Description: If true, additional normalization is applied on every token.

Default: true

Boolean

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>regex-token-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

InvariantTokenTagger

General

Invariant taggers mark tokens as invariant if they are not to be treated by subsequent linguistic processing steps such as stemming or composite decomposition. This is the case, for example, with proper names — a composite decomposition of "Ingmar Bergmann" in "Berg" + "Mann" would not be correct. For this purpose, a flag "isInvariant" can be set on each token. If this flag is set to true, the following components leave this word untreated.

Simple basic approach based on a set of rules. A token is defined as an invariant if it does not correspond to a standard token pattern (Regex). A valid, i. e. non-invariant token is defined as: either completely in capital letters ("DISPLAY") or only small letters and the initial letter optionally large ("Display" or "advertisement"). Otherwise only hyphen allowed. Attention: Not optimized for all languages.

Input

The following annotations are mandatory for this component:

  • de.averbis.extraction.types.Token

Output

The component sets the feature "isInvariant" of the token annotations. No new annotations are produced.

Configuration

Implementation: de.averbis.textanalysis.components.invarianttokentagger.InvariantTokenTagger


Table 109: Configuration Parameters

NameTypeMultiValuedMandatory

validTokenPattern

Description: Allowed (~ !invariant) is defined as: either all upper case ("ANZEIGE") or all lower case with one optional upper case latter preceding ("Anzeige" or "anzeige"), hyphens are allowed.

Default: (\p{Lu}?\p{Ll}*|\p{Lu}+)

String

false

true

validTokenLength

Description: Allowed (~ !invariant) length of a token. Shorter tokens will be tagged as invariant.

Default: 4

Integer

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>invariant-token-tagger</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

Stemming and Segmentation

 SnowballStemAnnotator

General

The Snowball-Stemmer is based on the Porter-Stemming algorithm. This is the most common stemming approach. The Porter-Stemming algorithm is a rule-based procedure using a set of language-specific shortening rules until a minimum number of syllables is reached.

The Porter-Stemming algorithm is an "aggressive" stemming approach, the resulting word stems are not valid words and often not linguistically correct wordstems.

Reference: An algorithm for suffix stripping, M. F. Porter, 1980

Input

The following annotations are mandatory for this component:

  • de.averbis.extraction.types.Token

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The following annotations are created:

  • de.averbis.extraction.types.Stem

In addition, feature references from the token annotations are made to the respective stem annotation.

Configuration

Implementation: de.averbis.textanalysis.components.snowballstemannotator.SnowballStemAnnotator


Table 110: Configuration Parameters

NameTypeMultiValuedMandatory

allLowerCase

Description: if true, the token to be stemmed will be transformed to lowercase

Default: false

Boolean

false

true

excludePattern

Description: A regular expression, which specifies exceptions for the stemmer. If the pattern matches, then the stemmer will be skipped and the covered text of the token will be assigned to the stem value. A parameter value like ".*itis" has the result that a token "cellulitis" won't be stemmed to "cell". Instead, the stem value will be "cellulitis".The parameter allLowerCase is applied before this parameter and thus may influence its functionality.

Default: .*itis

String

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>snowball-stem-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

MorphoSemanticSegmentAnnotator

General

The morphosemantic analysis is based on the Morphosaurus algorithm. This was originally developed for use in medical language.

See also Foundation, Implementation and Evaluation of the MorphoSaurus System, dissertation by Kornél Markó, JULIE Lab, University of Jena, 2007.

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Token

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of type:

  • de.averbis.extraction.types.Segment

These annotations are also linked to the respective token and stored there as a reference.

Configuration

Implementation: de.averbis.textanalysis.components.morphosemanticsegmentannotator.MorphoSemanticSegmentAnnotator


Table 111: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: morphosemanticsegmentannotator

String

false

false

msiEngineLexiconFile

Description: The core engine lexicon.

Default: msi.data

String

false

true

msiEngineReplacementFile

Description: The core engine replacement file.

Default: replacement.xml

String

false

true

msiEngineAdditionalLexiconFiles

Description: Additional lexicon files for the core engine.

String

true

false

msiEngineLanguages

Description: The languages to load from the lexica.

Default: en, de

String

true

true

msiEngineNoMatchPlain

Description: -

Default: true

Boolean

false

true

msiEngineSegmenterMode

Description: The core engine segmenter mode: RIGHT, LEFT, BOTH.

Default: BOTH

String

false

true

msiEngineConcatPrefix

Description: Concat prefix.

Default: false

Boolean

false

true

msiEngineConcatSuffix

Description: Concat suffix.

Default: false

Boolean

false

true

msiEngineMIDs

Description: Attach MIDs to segmentation.

Default: false

Boolean

false

true

msiNoPreferredForIVs

Description: No prefered terms for type IV.

Default: true

Boolean

false

true

enrichAbbreviations

Description: Defines if abbreviations should be enriched with segments.

Default: false

Boolean

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>morpho-semantic-segment-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

Abbreviation Detection

AbbreviationAnnotator

General

This component uses a dictionary to recognize abbreviations. If the dictionary contains the full form of the abbreviation, this is saved in the annotation. The component can load different abbreviation lists (genres) for a specific language. Currently available:

  • de

    • default

    • bionlp

    • latin

    • law

    • literature_reference

  • en

    • default

    • bionlp

    • latin

    • oxford

  • fr

    • bionlp

    • latin

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Token

Output

The component creates annotations of type:

  • de.averbis.extraction.types.Abbreviation

The abbreviation annotations created are associated with their full form, if available.

Configuration

Implementation: de.averbis.textanalysis.components.abbreviationannotator.AbbreviationAnnotator


Table 112: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: abbreviation

String

false

true

genres

Description: The genres of abbreviations that should be utilized.

Default: default

String

true

true

fullformTokenizerPattern

Description: The pattern for tokenizing the fullform of an abbreviation.

Default: \s+|,|\-

String

false

true

tokenizeFullform

Description: Option to tokenize the fullform of all abbreviations.

Default: true

Boolean

false

true

Maven Coordinates:

<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>abbreviation-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        

Numeric Values, Measurements, Times and Dates

NumericValueAnnotator

General

This component can recognize a wide variety of numeric expressions and their numerical value. These include simple numbers such as 2.3, but also more difficult expressions such as ½ million or fuenfundzwanzig. Furthermore, the component is able to recognize roman numerals and assign an equivalent numeric value. Written-out numbers are currently only supported in English, German and French.

The functional elements of this component are divided into individual reusable components, that can also be individually configured and newly combined. The main component NumericValueAnnotator consists of the following elements:

  1. ConjunctionFragment.ruta: These UIMA Ruta rules split tokens for detecting smaller numeric fragments.

  2. RomanNumeral.ruta: These UIMA Ruta rules annotate different kinds of roman numerals and calculate their numeric equivalents in a Java procedure.

  3. RutaTokenSeedAnnotator: This component adds annotations for the following Dictionary lookup.

  4. SimpleDictionaryAnnotator: This component adds different annotations based on the given word lists.

  5. NumericValue.ruta: These UIMA Ruta rules annotate different kinds of numeric values.

Input

The component requires annotations for numerical base units. The exact type of these annotations can be determined by the configuration parameter number. Normally this value is set to org.apache.uima.ruta.types.NUM. Annotations of this type are created automatically, if not already existing, and if the configuration parameter seeders has not been adjusted.

If configured appropriately, the component can continue to process other (certain) annotations, if any, and if the component is configured accordingly. This includes LanguageContainer and the annotations of the types specified in the parameter noNumericValue.

Irrespective of these annotations, the component can also use any type of annotations used in the rule-based implementation. These include for example Multiplicator or ConjunctionFragment.

Output

The component generates different types of annotations. The actual result of the component is:

  • de.averbis.textanalysis.types.numericvalue.NumericValue

Additionally it creates annotations of roman numerals:

  • de.averbis.textanalysis.types.numericvalue.RomanNumeral

The numeric value of the detected number is stored in the value feature.

Configuration

Implementation: de.averbis.textanalysis.components.numericvalueannotator.NumericValueAnnotator


Table 113: Configuration Parameters

NameTypeMultiValuedMandatory

allowPeriodDecimalSeparator

Description: Option to allow the usage of a period for the decimal separator for all locales, e.g., also in German.

Default: true

Boolean

false

true

detectComplexPatterns

Description: Option to detect more complex patterns of numerical values like 2^1/2 or fuenfundzwanzig.

Default: true

Boolean

false

true

detectFractions

Description: Option to detect fractions like 125/75.

Default: true

Boolean

false

true

mergeConsecutiveEqualNumbers

Description: Option to merge consecutive equal numbers like '5 (five)'.

Default: false

Boolean

false

true

dictionaryLookup

Description: ption to apply dictionary lookup for detecting special numeric elements like ² or five.

Default: true

Boolean

false

true

decimalSeparator

Description: Regular expression to validate decimal separators as in 2.6.

Default: \.

String

false

true

thousandsSeparator

Description: Regular expression to validate thousands separators as in 3,000.

Default: ,

String

false

true

conjunctionFragment

Description: Regular expression to detect conjunction fragements like 'und' as in fuenfundzwanzig.

Default: and|und|et

String

false

true

simpleNumericValuesOnlyWithoutSpaces

Description: Simple numeric values with punctation marks are only annotated if there are no spaces in between.

Default: true

Boolean

false

true

language

Description: Default value of the language. Will normally be overwritten by the DocumentAnnotation language or by the LanguageContainer language.

Default: false

Boolean

false

true

noNumericValue

Description: List of types specifying annotation spans in which no numeric value should be detected, e.g., Dates.

Default: -

String

true

true

number

Description: The basic annotation type for digits.

Default: org.apache.uima.ruta.type.NUM

String

false

true

languageSpecific

Description: If activated, language dependent values will automatically be assigned to parameters decimalSeparator, thousandsSeparator and conjunctionFragment.

Default: true

Boolean

false

true

allowLeadingZeros

Description: Option to allow numeric values to start with zeros like 02.

Default: false

Boolean

false

true

detectRomanNumerals

Description: Option to detect roman numerals like XIV, II, MMDC.

Default: false

Boolean

false

true

seeders

Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied.

Default: org.apache.uima.ruta.seed.DefaultSeeder

String

true

false

reindexOnly

Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed.

Default: uima.tcas.Annotation

String

true

false

indexOnlyMentionedTypes

Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed.

Default: false

Boolean

false

false

indexAdditionally

Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed.

Default: **

String

true

false

strictImports

Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved.

Default: true

Boolean

false

false

debug

Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution.

Default: false

Boolean

false

false

debugWithMatches

Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches.

Default: false

Boolean

false

false

Maven Coordinates:

<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>numeric-value-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
RutaTokenSeedAnnotator

A more detailed description of RutaTokenSeedAnnotator can be found in the corresponding chapter.

SimpleDictionaryAnnotator

A more detailed description of SimpleDictionaryAnnotator can be found in the corresponding chapter.

Rules

RomanNumeral.ruta

PACKAGE de.averbis.textanalysis.components.numericvalueannotator;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;
UIMAFIT de.averbis.textanalysis.components.numericvalueannotator.RomanNumeralValueCalculator;

FOREACH(cap) CAP{REGEXP("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$")}{
    cap{ -> CREATE(RomanNumeral)};
}

FOREACH(cw) CW{REGEXP("M?C?D?L?X?V?I?")}{
    cw { -> CREATE(RomanNumeral)};
}

EXEC(RomanNumeralValueCalculator, {RomanNumeral});
      

ConjunctionFragment.ruta

PACKAGE de.averbis.textanalysis.components.numericvalueannotator;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;

STRING conjunctionFragment = "and|und|et";
BOOLEAN detectComplexPatterns = true;
BOOLEAN languageSpecific = true;
STRING language = "x-unspecified";

BLOCK(languageSepcific) Document{languageSpecific} {
    BLOCK(en) Document{language == "en"} {
        Document{-> conjunctionFragment = "and"};
    }
    BLOCK(de) Document{language == "de"} {
        Document{-> conjunctionFragment = "und"};
    }
    BLOCK(fr) Document{language == "fr"} {
        Document{-> conjunctionFragment = "et"};
    }
}

Document{detectComplexPatterns,-REGEXP(conjunctionFragment, "")} -> {conjunctionFragment -> ConjunctionFragment;};
      

NumericValue.ruta

PACKAGE de.averbis.textanalysis.components.numericvalueannotator;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;

SCRIPT de.averbis.textanalysis.components.numericvalueannotator.RomanNumeral;

// configuration parameters:
BOOLEAN allowPeriodDecimalSeparator = true;
BOOLEAN allowLeadingZeros = false;
BOOLEAN detectComplexPatterns = true;
BOOLEAN detectFractions = true;
BOOLEAN mergeConsecutiveEqualNumbers = false;
STRING decimalSeparator = "\\.";
STRING thousandsSeparator = ",";
BOOLEAN simpleNumericValuesOnlyWithoutSpaces = true;
BOOLEAN languageSpecific = true;
STRING language = "x-unspecified";
TYPELIST noNumericValue;
TYPE number = NUM;
BOOLEAN detectRomanNumerals = false;

// additional variables
STRINGLIST localesWithPeriodDecimalSeparator = {"en"};
STRINGLIST localesWithCommaDecimalSeparator = {"de", "fr"};

// helper types
DECLARE NumberWithValue (DOUBLE value);
DECLARE NumberWithValue Multiplicator, Exponent;

// language specific settings
Document{IS(uima.tcas.DocumentAnnotation)-> GETFEATURE("language", language)};
LanguageContainer{-> GETFEATURE("language", language)};

BLOCK(languageSepcific) Document{languageSpecific} {
    BLOCK(separators) Document{CONTAINS(localesWithPeriodDecimalSeparator, language)} {
        Document{-> decimalSeparator = "\\.", thousandsSeparator = ","};
    }
    BLOCK(separators) Document{CONTAINS(localesWithCommaDecimalSeparator, language)} {
        Document{-> decimalSeparator = ",", thousandsSeparator = "\\."};
    }
    BLOCK(fr) Document{language == "fr"} {
        Document{-> thousandsSeparator = "\\s"};
    }
}

NumericValue{PARTOF(noNumericValue)-> UNMARK(NumericValue)};
ConjunctionFragment{PARTOF(Multiplicator)-> UNMARK(ConjunctionFragment)};

CONDITION isThousandsSep() = REGEXP(thousandsSeparator);
CONDITION isDecimalSep() = REGEXP(decimalSeparator);

DOUBLE value;
// normal numbers like 1,000.95
Document{simpleNumericValuesOnlyWithoutSpaces -> ADDRETAINTYPE(SPACE, BREAK)};
FOREACH(num) number{-PARTOF(noNumericValue)}{
    (num{-PARTOF(NumericValue)}
        (PM{isThousandsSep()} number{REGEXP("...")})*
        (PM{isDecimalSep()} number)
        ){PARSE(value, language) -> CREATE(NumericValue, "value" = value)};
    (num{-PARTOF(NumericValue), num.ct!= "0"}
        (PM{isThousandsSep()} number{REGEXP("...")})+
        ){PARSE(value, language) -> CREATE(NumericValue, "value" = value)};
    (num{-PARTOF(NumericValue), allowPeriodDecimalSeparator} PERIOD number)
            {PARSE(value, "en") -> CREATE(NumericValue, "value" = value)};
    (num{-PARTOF(NumericValue)})
                {PARSE(value, language) -> CREATE(NumericValue, "value" = value)};
}
FOREACH(num) NumericValue{}{
    num{-IF(allowLeadingZeros), REGEXP("^0\\d.*") -> UNMARK(NumericValue)};
    W{-REGEXP("[ex]", true)} @num{-> UNMARK(NumericValue)} W;
//        W{-REGEXP("[ex]", true)} @num{-> UNMARK(NumericValue)};
    W{REGEXP("[A-Z]{1,3}")} @num{OR(REGEXP("\\d{1,2}"),REGEXP("\\d{2}\\.\\d{1}"))-> UNMARK(NumericValue)};
        NUM PERIOD NUM PERIOD @num{-> UNMARK(NumericValue)};
    num{mergeConsecutiveEqualNumbers, PARTOF(NumericValue) -> num.end = s2.end} WS* SPECIAL{REGEXP("[\\(\\[\\{]")} WS*
        n2:NumericValue{num.value == n2.value -> UNMARK(n2)} WS* s2:SPECIAL{REGEXP("[\\)\\]\\}]")};
}
Document{simpleNumericValuesOnlyWithoutSpaces -> REMOVERETAINTYPE(SPACE, BREAK)};


// Fractions with numerical values
BLOCK(dictionary) Document{detectFractions} {
        FOREACH(num) NumericValue{}{
                // fractions like 3/4
                num{-> UNMARK(NumericValue)} SPECIAL{REGEXP("/")} NumericValue{-> UNMARK(NumericValue),
                    GATHER(Fraction,1,3, "numerator" = 1, "denominator" = 3)};
                // fractions like Seven out of 38
                num{-> UNMARK(NumericValue)} SW? SW.ct=="of" NumericValue{-> UNMARK(NumericValue),
                    GATHER(Fraction,1,4, "numerator" = 1, "denominator" = 4)};
        }
}
// simple fractions
NumericValue{REGEXP("\\d")-> UNMARK(NumericValue)} @SPECIAL{REGEXP("/")} NumericValue{REGEXP("\\d")-> UNMARK(NumericValue),
            GATHER(Fraction,1,3, "numerator" = 1, "denominator" = 3)};


Fraction{-> CREATE(NumericValue, "value" = (Fraction.numerator.value / Fraction.denominator.value))};
SimpleFraction{-> CREATE(NumericValue, "value" = (SimpleFraction.numerator / SimpleFraction.denominator))};

BLOCK(complexPatterns) Document{detectComplexPatterns}{
        FOREACH(num, false) NumericValue{}{
            // exponents like 2^3, 2.3e13, 4²
                (num exp:Exponent)
                    {-> num.value=POW(num.value, exp.value), num.end = exp.end};

                (num SPECIAL.ct=="^" exp:NumericValue{-> UNMARK(NumericValue)})
                    {-> num.value=POW(num.value, exp.value), num.end = exp.end};

                (num W{REGEXP("e", true)} exp:NumericValue{-> UNMARK(NumericValue)})
                    {-> num.value = (num.value * (POW(10, exp.value))), num.end = exp.end};

                // multiplication like 3x4, 2*2
                (num ANY{REGEXP("×|x|\\*", true)} mult:NumericValue{-> UNMARK(NumericValue)})
                            {-> num.value = (num.value * mult.value), num.end = mult.end};

                pre:NumericValue{PARTOF(W),num.value != pre.value -> UNMARK(NumericValue)} SPECIAL?{REGEXP("-")} num{IS(NumericValue),PARTOF(W)
                    -> num.value = (num.value + pre.value), num.begin = pre.begin};

                // combination with multipliers like 3 million
                (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)} SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)}
//                    add1:NumericValue?{-> num.value = (num.value + add1.value), UNMARK(NumericValue)}
                    (
                            Multiplicator{-> num.value = (num.value * (POW(10, Multiplicator.value)))}
                            add2:NumericValue?{-> num.value = (num.value + add2.value), UNMARK(NumericValue)}
                    )*);

                // fünfundzwanzig
                (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment add:NumericValue.value!=0{PARTOF(W), IF((NumericValue.value%1) == 0) -> UNMARK(NumericValue)})
                    {-> num.value = (num.value + add.value)};

                // 2+3
                (num{-> SHIFT(NumericValue,1,3)} SPECIAL.ct=="+" add:NumericValue{ -> UNMARK(NumericValue)})
                    {-> num.value = (num.value + add.value)};
        }
}

Document{detectRomanNumerals -> CALL(RomanNumeral)};
      

MeasurementAnnotator

General

This component detects units, measurements and quantities. It can trace the given unit back to SI base units and at the same time normalize the numerical value. For example, the text passage 10cm is recognized as 0.1 m (dimension L).

The functional elements of this component are divided into individual reusable components, which can also be individually configured and newly combined. The main component MeasurementAnnotator consists of the following elements:

  1. UnitAnnotator: This component recognizes units (annotation type Unit) according to certain annotations, normally numbers.

  2. UnitNormalizer: This component normalizes given units (annotation type Unit).

  3. Measurement.ruta: These UIMA Ruta rules combine numeric value and unit annotations to form measurement annotations.

  4. MeasurementNormalizer: This component normalizes the numerical value depending on the given unit.

  5. RelativeMeasurementIntervalAnnotator: This component is a helper annotator for relative intervals.

These components are described in more detail below.

Input

The component does not expect any mandatory annotations, but requires annotations of the types that are set by the configuration parameters to work correctly.

Output

The component generates different types of annotations. The actual result of the component is:

  • de.averbis.textanalysis.types.measurement.Measurement

These annotations combine a de.averbis.textanalysis.types.numericvalue.NumericValue and de.averbis.textanalysis.types.measurement.Unit` annotation and store their normalized values. In addition, the component can also create annotations of type:

  • de.averbis.textanalysis.types.measurement.MeasurementInterval

Configuration

Implementation: de.averbis.textanalysis.components.measurementannotator.MeasurementAnnotator


Table 114: Configuration Parameters

NameTypeMultiValuedMandatory

anchorType

Description: Optional type for annotations after which the component should search for units.

Default: de.averbis.textanalysis.types.numericvalue.NumericValue

String

false

true

lookaheadType

Description: Optional type for basic annotation which should be used as lookahead starting from the anchorType. If no anchorType is given, then the component trys to parse all annotations, but only single annotations and no combinations. This means that the given type needs to cover the complete unit.

Default: org.apache.uima.ruta.type.RutaBasic

String

false

true

lookaheadSize

Description: Amount of annotations of lookaheadType that are used as lookahead.

Default: 15

Integer

false

true

genres

Description: The categories/genres of unit data (subdirectories) that should be utilized. Multiple values are concatenated with a comma.

Default: default

String

false

true

languages

Description: The languages of unit data (subdirectories) that should be utilized. Multiple values are concatenated with a comma.

Default: en,de

String

false

true

ignoreWhitespaces

Description: If activated whitespace characters are ignored while units are parsed.

Default: true

Boolean

false

true

leftRecursive

Description: If activated multiplications and divisions are parsed from left to right, e.g., mg/s/m is (mg/s)/m. If deactivated mg/s/m is mg/(s/m).

Default: true

Boolean

false

true

identifierLookahead

Description: Additional lookahead of the parser for multi token units.

Default: 2

String

false

true

avoidNumberOnlyUnits

Description: If activated, units with only number like 2/2 will be ignored.

Default: true

String

false

true

detectIntervals

Description: Option to detect intervals of measurements.

Default: true

String

false

true

dictionaryLookup

Description: Option to include a simple dictionary lookup for specific textual mentions.

Default: true

Boolean

false

true

seeders

Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied.

Default: org.apache.uima.ruta.seed.DefaultSeeder

String

true

false

reindexOnly

Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed.

Default: uima.tcas.Annotation

String

true

false

indexOnlyMentionedTypes

Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed.

Default: false

Boolean

false

false

indexAdditionally

Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed.

Default: **

String

true

false

strictImports

Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved.

Default: true

Boolean

false

false

debug

Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution.

Default: false

Boolean

false

false

debugWithMatches

Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches.

Default: false

Boolean

false

false

Maven Coordinates:

<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>measurement-annotator</artifactId>
        <version>3.5.0</version>
</dependency>


UnitAnnotator
General

The component recognizes text passages with units, but not their normalized form. It has two different modes: either text passages are examined for certain annotations, e. g. numeric values, or the text of certain annotations are examined themselves. The first mode is activated by setting the configuration parameter anchorType. The text passages are searched for annotations of the configured type. The size and range of these text passages are determined by the configuration parameters lookaheadType and lookaheadSize. In the second mode only the configuration parameter lookaheadType is set. Only the text of the annotation of this type is examined here. In principle, a unit is recognized if the unit parser of the configured resource can recognize a unit.

Input

The component is based on the annotations whose types are configured in the parameters anchorType and lookaheadType.

Output

The component creates annotations of type:

  • de.averbis.textanalysis.types.measurement.Unit

but does not set any features. For this purpose, the UnitNormalizer component must be used.

Configuration

Implementation: de.averbis.textanalysis.components.measurementannotator.UnitAnnotator


Table 115: Configuration Parameters

NameTypeMultiValuedMandatory

anchorType

Description: Optional type for annotations after which the component should search for units.

String

false

false

lookaheadType

Description: Optional type for basic annotation which should be used as lookahead starting from the anchorType. If no anchorType is given, then the component trys to parse all annotations, but only single annotations and no combinations. This means that the given type needs to cover the complete unit.

Default: de.averbis.extraction.types.Token

String

false

false

lookaheadSize

Description: Amount of annotations of lookaheadType that are used as lookahead.

Default: 15

Integer

false

true

ignoreWhitespaces

Description: If activated whitespace characters are ignored while units are parsed.

Default: true

Boolean

false

true


Table 116: External Resources

NameOptionalInterface/Implementation

unitResource

Description: Resource holding the unit implementation and data.

false

de.averbis.textanalysis.resources.unitresource.UnitResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>measurement-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
UnitResource
General

This resource encapsulates the implementation for processing units, especially the parser for unit detection. The supported units and their synonyms are defined by the configuration parameters genres and languages and can be extended by additional configured values. Later genres overwrite previous genres, each genre can contain new units and synonyms for units, prefixes and operations in different languages. The structure of the additional files is explained below.

Configuration

Adaptations and extensibility

The functionality of the resource is largely determined by additional properties files (file extension . txt). Each genre contains an optional number of specific files that have different tasks. The additional files are structured as follows (folder structure is indicated with hyphens):

unit
- default
-- de
--- aliases.txt
--- operations.txt
--- prefixes.txt
-- en
--- aliases.txt
--- operations.txt
--- prefixes.txt
-- unit
--- units.txt

The main folder unit is configurable by the configuration parameter resourceSpecificSubdirectory. It contains a folder for each genre. In this example a genre with the name default is given. Each genre folder may contain multiple language-specific folders and one language-independent folder. The language-specific folders contain up to three property files: aliases.txt, operations.txt and prefixes.txt. The non-specified folder has the name unit and contains exactly one Properties file with the name units.txt. There is a functional dependency between the files.

units.txt

First the file units.txt is processed. This properties file defines new units, either as new base units or as derived units. The functionality is explained by the following example:

U
hektar = m²*10000

The first line defines a new unit with the symbol U, which also determines the dimension. The second line defines a new derived unit hectare, which corresponds to ten thousand square meters. The definition of derived units may only use known terms of the unit implementation or previously defined units, not synonyms from the other files.

operations.txt

Next, the files are processed for operations. These contain synonyms for arithmetic operations. The functionality is explained by the following example:

/ = Per, per, pro, Pro

Here several synonyms are introduced for a division.

aliases.txt

Next, the synonyms for units are processed. The functionality is explained by the following example:

minute = Minute, Minuten, Min, Min., minütiger, minütige, minütig

This line defines several German synonyms for a unit minute. The keyword on the left side for the unit must be known, i. e. it must have been introduced in either the unit implementation or units. txt.

Finally, the additional synonyms for prefixes are processed. The functionality is explained by the following example


prefixes.txt

unit = gramm, meter
k = Kilo

The first line with the keyword unit lists all unit synonyms to be prefixed. The other lines contain known prefixes and their synonyms. The result of this file is that a synonym for two (derived) units is added: Kilogram and Kilometre.

Maven Coordinates
        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>measurement-annotator</artifactId>
        <version>2.1.0-SNAPSHOT</version>
</dependency>
        
      


UnitNormalizer
General

This component parses unit annotations and recognizes the actual unit. If set, the text pf the feature parsed will be used, otherwise the text of the annotation in feature coveredText. Unit annotations whose feature normalized is already set are skipped.

Input

The component processes annotations of type:

  • de.averbis.textanalysis.types.measurement.Unit.

Output

The component sets the features parsed, normalized, normalizedAscii and dimension of the annotations of the type de.averbis.textanalysis.types.measurement.Unit.

Configuration

Implementation: de.averbis.textanalysis.components.measurementannotator.UnitNormalizer


Table 117: Configuration Parameters

NameTypeMultiValuedMandatory

ignoreWhitespaces

Description: If activated whitespace characters are ignored while units are parsed.

Default: true

Boolean

false

true


Table 118: External Resources

NameOptionalInterface/Implementation

unitResource

Description: Resource holding the unit implementation and data.

false

de.averbis.textanalysis.resources.unitresource.UnitResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>measurement-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
MeasurementNormalizer
General

This component processes measurement annotations. First, the exact unit of the unit annotation is parsed, which is set in the feature unit. If the feature does not contain any annotation, the value of the parsedUnit feature is used. This means that measurements can also be normalized, which do not have a real unit in the text, but only an implicit unit, which can be set separately. Then the standard unit and a transformation from the parsed unit to it is determined. The default unit is stored in the normalizedUnit feature. The transformation is used to normalize the numeric value set in the value feature. The result is stored in the normalizedValue feature. The normalized feature has the concatenation of the values from the normalizedValue and normalizedUnit features.

Input

The component processes annotations of type:

  • de.averbis.textanalysis.types.measurement.Measurement.

Output

The component sets the features normalized, normalizedValue and normalizedUnit of the annotations of type de.averbis.textanalysis.types.measurement.Measurement.

Configuration

Implementation: de.averbis.textanalysis.components.measurementannotator.MeasurementNormalizer


Table 119: Configuration Parameters

NameTypeMultiValuedMandatory

ignoreWhitespaces

Description: If activated whitespace characters are ignored while units are parsed.

Default: true

Boolean

false

true


Table 120: External Resources

NameOptionalInterface/Implementation

unitResource

Description: Resource holding the unit implementation and data.

false

de.averbis.textanalysis.resources.unitresource.UnitResource

Maven Coordinates:

<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>measurement-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

RelativeMeasurementIntervalAnnotator

General

This component processes relative measurement interval annotations and sets the low and high limit.

Input

The component processes annotations of type:

  • de.averbis.textanalysis.types.measurement.RelativeMeasurementInterval

Output

The component sets the features low and high of the given annotations of type de.averbis.textanalysis.types.measurement.MeasurementInterval. In this process, other annotations usually used for measurements are also created.

Configuration

Implementation: de.averbis.textanalysis.components.measurementannotator.RelativeMeasurementIntervalAnnotator


Table 121: Configuration Parameters

NameTypeMultiValuedMandatory

ignoreWhitespaces

Description: If activated whitespace characters are ignored while units are parsed.

Default: true

Boolean

false

true


Table 122: External Resources

NameOptionalInterface/Implementation

unitResource

Description: Resource holding the unit implementation and data.

false

de.averbis.textanalysis.resources.unitresource.UnitResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>measurement-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
Rules
Measurement.ruta
PACKAGE de.averbis.textanalysis.components.measurementannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.MeasurementTypeSystem;
SCRIPT de.averbis.textanalysis.components.measurementannotator.MeasurementInterval;

BOOLEAN detectIntervals = true;

(n:NumericValue SPECIAL?{-PARTOF(Unit)} u:Unit){-> CREATE(Measurement, "value" = n, "unit" = u)};

Document{detectIntervals -> CALL(MeasurementInterval)};
      
MeasurementInterval.ruta
        PACKAGE de.averbis.textanalysis.components.measurementannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.MeasurementTypeSystem;

DECLARE RelativeIntervalPrefix;

ii:IntervalIndicator{PARTOF(RelativeIntervalPrefix)-> UNMARK(ii)};
gi:GreaterIndicator{PARTOF(RelativeIntervalPrefix)-> UNMARK(gi)};
li:LessIndicator{PARTOF(RelativeIntervalPrefix)-> UNMARK(li)};

FOREACH(m) Measurement{}{
    m (p:RelativeIntervalPrefix m2:Measurement){ -> CREATE(RelativeMeasurementInterval, "base" = m, "deflection" = m2)};
    nv:NumericValue{-PARTOF(Measurement) -> CREATE(Measurement, "unit" = m.unit, "value" = nv)}->{m1:Measurement;}
        (p:RelativeIntervalPrefix @m)
        { -> CREATE(RelativeMeasurementInterval, "base" = m1, "deflection" = m)};

        ADDRETAINTYPE(WS);
    ANY{-PARTOF(IntervalIndicator)} SPACE[0,2] (l:NumericValue{-> CREATE(Measurement, "unit" = h.unit, "value" = l)} SPACE[0,2])? IntervalIndicator SPACE[0,2] h:@m{-PARTOF(MeasurementInterval)};
    // 12 - 15 mg
    (l:Measurement SPACE[0,2] IntervalIndicator SPACE[0,2] h:@m{-PARTOF(MeasurementInterval)})
        {-> CREATE(MeasurementInterval, "low" = l, "high" = h)};
    // 20-0-0-0 IE
     ANY{-PARTOF(NumericValue),-PARTOF(Measurement)} SPACE[0,2] (IntervalIndicator SPACE[0,2] h:@m{-PARTOF(MeasurementInterval)})
        {-> CREATE(MeasurementInterval, "high" = h)};
    (l:m{-PARTOF(MeasurementInterval)} SPACE[0,2] IntervalIndicator (SPACE[0,2] h:Measurement)?)
                {-> CREATE(MeasurementInterval, "low" = l, "high" = h)};
    // 1,2 pg/ml 1,0 - 3,0       vs. Metformin 850 mg 1-0-1
    ANY{-PARTOF(IntervalIndicator)} SPACE[0,2] @m SPACE[0,2] (n1:NumericValue{-PARTOF(Measurement), -PARTOF(MeasurementInterval) -> CREATE(Measurement, "value" = n1, "unit" = m.unit)}->{m1:Measurement;}
        SPACE[0,2] IntervalIndicator SPACE[0,2]
        n2:NumericValue{-PARTOF(Measurement), -PARTOF(MeasurementInterval) -> CREATE(Measurement, "value" = n2, "unit" = m.unit)}->{m2:Measurement;}
        ) { -> CREATE(MeasurementInterval, "low" = m1, "high"=m2)} SPACE[0,2] ANY{-PARTOF(IntervalIndicator)};

    (GreaterIndicator SPACE[0,2] m{-PARTOF(MeasurementInterval)}){ -> CREATE(MeasurementInterval, "low" = m)};
    (LessIndicator SPACE[0,2] m{-PARTOF(MeasurementInterval)}){ -> CREATE(MeasurementInterval, "high" = m)};
        REMOVERETAINTYPE(WS);
}
      

TemporalExpressionAnnotator

General

This component can recognize different temporal expressions and normalize their values. This includes simple date formats such as "10.2.2015" or "12:30". The component supports the English and German language.

The functional elements of this component are divided into individual reusable components, that can also be individually configured and newly combined. The main component TemporalExpressionAnnotator consists of the following elements:

  1. RutaTokenSeedAnnotator:: This component adds annotations for the following Dictionary lookup.

  2. SimpleDictionaryAnnotator:: This component adds different annotations based on the given word lists, for example month names.

  3. TemporalExpression.ruta: These UIMA Ruta rules aggregate the Ruta scripts Dictionary, Date and Time.

  4. _TemporalExpressionNormalizer: This component normalizes annotations of the type Date and Time, and sets their values.

Input

The component does not require any other annotations as input.

Output

The component generates different types of annotations. The actual result of the annotator are subtypes of the type:

  • de.averbis.textanalysis.types.temporal.Timex3

Currently, the following subtypes are supported:

  • de.averbis.textanalysis.types.temporal.Date

  • de.averbis.textanalysis.types.temporal.Time

Configuration

Implementation: de.averbis.textanalysis.components.temporalexpressionannotator.TemporalExpressionAnnotator


Table 123: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: temporalexpressionannotator

String

false

true

ignoreCase

Description: Option to ignore the case of terms in the dictionary.

Default: false

String

false

true

genres

Description: Dictionaries to be used for creating the inital temporal text fragments.

Default: default

String

true

true

defaultFilteredTypes

Description: Filtered types by default in the Ruta script.

Default: org.apache.uima.ruta.type.SPACE, org.apache.uima.ruta.type.MARKUP

String

true

true

enclosingSpanType

Description: The type of the enclosing spans in which the rules are applied.

Default: uima.tcas.DocumentAnnotation

String

false

true

anchorTypeName

Description: Anchor type for the dictionary lookup.

Default: org.apache.uima.ruta.type.ANY

String

false

true

seeders

Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied.

Default: org.apache.uima.ruta.seed.DefaultSeeder

String

true

false

reindexOnly

Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed.

Default: uima.tcas.Annotation

String

true

false

indexOnlyMentionedTypes

Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed.

Default: false

Boolean

false

false

indexAdditionally

Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed.

Default: **

String

true

false

strictImports

Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved.

Default: true

Boolean

false

false

debug

Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution.

Default: false

Boolean

false

false

debugWithMatches

Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches.

Default: false

Boolean

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>temporal-expression-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
RutaTokenSeedAnnotator

A more detailed description of RutaTokenSeedAnnotator can be found in the corresponding chapter.

SimpleDictionaryAnnotator

A more detailed description of SimpleDictionaryAnnotator can be found in the corresponding chapter.

TemporalExpressionNormalizer

General

This component normalizes annotations of the Date and Time types, and sets the value of their value features. Missing information, such as the date year, can be completed with the help of the anchor feature, more precisely with the annotation it contains. In the case of an annotation of type Date, the corresponding "Anchor" annotation is also determined automatically if the feature is not set. The next annotation is used before the annotations considered, which contains the required information.

Input

The component processes annotations of the types:

  • de.averbis.textanalysis.types.temporal.Date

  • de.averbis.textanalysis.types.temporal.Time

Output

The component sets the features value of the annotations of the types de.averbis.textanalysis.types.temporal.Date and de.averbis.textanalysis.types.temporal.Time.

Configuration

Implementation: de.averbis.textanalysis.components.temporalexpressionannotator.TemporalExpressionNormalizer


Table 124: Configuration Parameters

NameTypeMultiValuedMandatory

intValueFeatureName

Description: The name of the feature holding the normalized int value.

Default: value

String

false

true

Maven Coordinates:

        
 <dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>temporal-expression-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
Rules

TemporalExpression.rutaPACKAGE de.averbis.textanalysis.components.temporalexpressionannotator;


TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.EvaluationTypeSystem;

SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.Dictionary;
SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.Date;
SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.Time;
SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.DateInterval;

CALL(Dictionary);
CALL(Date);
CALL(Time);
CALL(DateInterval);
      

Dictionary.ruta

PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;

DECLARE IntValued (INT value);
DECLARE IntValued MonthInd, DayInd, YearInd, HourInd, MinuteInd, SecondInd;

DECLARE DayInd DayNumberInd;
DECLARE MonthInd MonthLongInd, MonthShortInd, MonthNumberInd;
DECLARE YearInd Year2DInd, Year4DInd;
DECLARE Year4DInd Year4DModernInd;

DECLARE YearPostfixInd, YearPrefixInd;
DECLARE TimePrefixInd, TimePostfixInd;
DECLARE OfInd;

// fix dictionary-based entires
mni:MonthNumberInd{CONTAINS(NUM,2,10) -> UNMARK(mni)};
mni:DayNumberInd{CONTAINS(NUM,2,10) -> UNMARK(mni)};
mni:YearInd{CONTAINS(NUM,2,10) -> UNMARK(mni)};

INT int;

BLOCK(ClassifyNum) NUM{}{
    Document{PARSE(int)};

    Document{-PARTOF(Year4DInd), REGEXP("(?:19[0-9]{2})|(?:20[0-9]{2})") -> Year4DModernInd, Year4DModernInd.value=int};
    Document{-PARTOF(YearInd), REGEXP("[12]...") -> Year4DInd, Year4DInd.value=int};
    Document{-PARTOF(YearInd), REGEXP("..") -> Year2DInd, Year2DInd.value=int};
    Document{-PARTOF(HourInd), int <= 24, int >= 0 -> HourInd, HourInd.value=int};
    Document{-PARTOF(MinuteInd), int <= 60, int >= 0 -> MinuteInd, MinuteInd.value=int, SecondInd, SecondInd.value=int};
    Document{-PARTOF(DayNumberInd), int <= 31, int > 0, REGEXP("..?")  -> DayNumberInd, DayNumberInd.value = int};
    Document{-PARTOF(MonthNumberInd), int <= 12, int > 0, REGEXP("..?")  -> MonthNumberInd, MonthNumberInd.value = int};
}

s:SPECIAL{REGEXP("[´`']")} y:@Year2DInd{-> y.begin = s.begin};
DayNumberInd{ENDSWITH(W)} ->{
    Year2DInd{->UNMARK(Year2DInd)};Year2DInd{->UNMARK(MonthNumberInd)};
    };


DECLARE Dash, Slash;
BLOCK(ClassifySpecial) SPECIAL{}{
    Document{-PARTOF(Dash), REGEXP("[-]")-> Dash};
    Document{-PARTOF(Slash), REGEXP("[/]")-> Slash};
}

//MonthLongInd{PARTOF({POSTagVerb})-> UNMARK(MonthLongInd)};
      

Date.ruta

        PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.components.temporalexpressionannotator.DictionaryRutaTypeSystem;

STRING language;
Document{IS(uima.tcas.DocumentAnnotation) -> GETFEATURE("language", language)};
Document{IS(LanguageContainer) -> GETFEATURE("language", language)};
Document{language=="x-unspecified" -> language = "en"};

ACTION CreateDate(ANNOTATION year, ANNOTATION month, ANNOTATION day) = CREATE(Date, "kind" = "DATE", "year" = year, "month" = month, "day" = day);

ADDFILTERTYPE(Date);

// hotfix combi with document
(y:@Year4DInd{STARTSWITH(Document)} Dash m:MonthNumberInd Dash d:DayNumberInd){-> CreateDate(y, m, d)};
ANY{-PARTOF(Dash)} @(y:@Year4DInd Dash m:MonthNumberInd Dash d:DayNumberInd){-> CreateDate(y, m, d)};

(d:DayNumberInd PERIOD m:MonthInd PERIOD? y:YearInd){-> CreateDate(y, m, d)};

(m:MonthInd{-IS(MonthNumberInd)} d:DayInd COMMA y:YearInd){-> CreateDate(y, m, d)};
(m:MonthInd{-IS(MonthNumberInd)} COMMA d:DayInd y:YearInd){-> CreateDate(y, m, d)};

BLOCK(en) Document{language == "en"} {
    (d:DayInd ANY?{OR(REGEXP("of"), IS(PERIOD))} m:MonthInd{-IS(MonthNumberInd)} COMMA y:@YearInd){-> CreateDate(y, m, d)} COMMA;
    (d:DayInd ANY?{OR(REGEXP("of"), IS(PERIOD))} m:MonthInd{-IS(MonthNumberInd)} y:@YearInd){-> CreateDate(y, m, d)};
    (m:MonthNumberInd{-PARTOF(Date)} Slash d:DayNumberInd Slash y:YearInd){-> CreateDate(y, m, d)};
    (m:MonthNumberInd{-PARTOF(Date)} Dash d:DayNumberInd Dash y:YearInd){-> CreateDate(y, m, d)};
    W{REGEXP("on", true)} (m:@MonthInd{-IS(MonthNumberInd)} d:DayNumberInd){-> CreateDate(null, m, d)};
    (m:MonthInd{-IS(MonthNumberInd)} OfInd? y:@YearInd){-> CreateDate(y, m, null)};
    (m:MonthInd{-IS(MonthNumberInd)} COMMA y:@YearInd){-> CreateDate(y, m, null)} COMMA;
    (m:MonthInd{-IS(MonthNumberInd)} d:DayInd){-> CreateDate(null, m, d)};
}
BLOCK(de) Document{language == "de"} {
    (d:DayNumberInd{-PARTOF(Date)} Slash m:MonthNumberInd Slash y:YearInd){-> CreateDate(y, m, d)};
    (d:DayNumberInd{-PARTOF(Date)} Dash m:MonthNumberInd Dash y:YearInd){-> CreateDate(y, m, d)};
}

(d:DayInd PERIOD? m:MonthInd y:YearInd){-> CreateDate(y, m, d)};

(m:MonthInd Slash? y:@YearInd){-> CreateDate(y, m, null)};
(m:MonthInd{-IS(MonthNumberInd)} d:DayInd){-> CreateDate(null, m, d)};
(d:DayInd PERIOD m:@MonthNumberInd p:PERIOD{p.begin==m.end}){-> CreateDate(null, m, d)};
(d:DayInd PERIOD m:@MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, d)};
(d:DayInd OfInd m:@MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, d)};
(d:DayInd m:@MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, d)};


(m:MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, null)};
(y:@Year4DModernInd){-> CreateDate(y, null, null)};
y:@Year2DInd{STARTSWITH(SPECIAL)-> CreateDate(y, null, null)};
YearPrefixInd y:@Year2DInd{-> CreateDate(y, null, null)} (SW{REGEXP("and|und")} Year2DInd{-> CreateDate(y, null, null)})?;

REMOVEFILTERTYPE(Date);

// vom 12. bis 14.08.2008
TemporalIntervalBeginIndicator d:DayInd{ -> CreateDate(date.year, date.month, d)} PERIOD
    TemporalIntervalEndIndicator date:Date{date.day != null};

ADDRETAINTYPE(WS);
Date{-> UNMARK(Date)} PM NUM{-PARTOF(Date)};
Date{CONTAINS(NUM,3,10) -> UNMARK(Date)} PM NUM;
Date{-> UNMARK(Date)} SPECIAL{-REGEXP("[\\)\\]\\}]")} ANY{-PARTOF(Date),-PARTOF(WS)};
NUM PM @Date{CONTAINS(NUM,3,10)-> UNMARK(Date)};
ANY{-PARTOF(Date)} PM @Date{-> UNMARK(Date)};
@Date{-CONTAINS(W)-> UNMARK(Date)} W{-REGEXP("T"),-PARTOF(Date)};
W{-PARTOF(Date)} @Date{STARTSWITH(NUM)-> UNMARK(Date)};
ANY{-PARTOF(Date),-PARTOF(WS)} SPECIAL{-REGEXP("[\\(\\[\\{]")} @Date{-> UNMARK(Date)};
d1:Date{ENDSWITH(Year2DInd)-> UNMARK(d1)} SPECIAL d2:Date{IS(Year4DInd) -> UNMARK(d2)};
REMOVERETAINTYPE(WS);

ANY{-PARTOF(NUM)} @Date{REGEXP("May")-> UNMARK(Date)} W;
@Date{STARTSWITH(Document),REGEXP("May")-> UNMARK(Date)} W;

OfInd d:@Date{OR(CONTAINS(Slash), CONTAINS(Dash)), -CONTAINS(MonthLongInd), -CONTAINS(MonthShortInd)-> UNMARK(d)};

BLOCK(en) Document{language == "en"} {
   SW.ct=="at" @Date{-> UNMARK(Date)};
}

ACTION Unambig(ANNOTATION timex) = CREATE(UnambiguousTimex, "timex" = timex);

FOREACH(date,false) Date{}{
    // 23.1.,24.2. und 25.2.2017
    d:Date{d.year == null-> d.anchor=date} ANY+{PARTOF({COMMA,POSTagConj,TemporalIntervalBeginIndicator,TemporalIntervalEndIndicator})} date{date.year != null};
    d:Date{d.year == null-> d.anchor=date.anchor} ANY+{PARTOF({COMMA,POSTagConj,TemporalIntervalBeginIndicator,TemporalIntervalEndIndicator})} date{date.anchor != null};


    // unambiguous
    date{OR(CONTAINS(MonthLongInd),CONTAINS(MonthShortInd)), CONTAINS(YearInd) -> Unambig(date)};
    date{date.day != null, date.month != null, date.year != null -> Unambig(date)};
}
      

Time.ruta

PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.components.temporalexpressionannotator.DictionaryRutaTypeSystem;

STRING language;
Document{IS(uima.tcas.DocumentAnnotation) -> GETFEATURE("language", language)};
Document{IS(LanguageContainer) -> GETFEATURE("language", language)};

ACTION CreateTime(ANNOTATION hour, ANNOTATION minute, ANNOTATION second) = CREATE(Time, "kind" = "TIME", "hour" = hour, "minute" = minute, "second" = second);

ADDFILTERTYPE(Time);

(h:HourInd COLON m:MinuteInd{REGEXP("..")} (COLON s:SecondInd)? TimePostfixInd?){-> CreateTime(h, m, s)};
(h:HourInd TimePostfixInd){-> CreateTime(h, null, null)};

REMOVEFILTERTYPE(Time);

d:Date{-> UNMARK(d)} ANY?{OR(REGEXP("T|,"), IS(TimePrefixInd))} t:Time{-> t.begin = d.begin, t.anchor = d};

ADDRETAINTYPE(WS);
REMOVERETAINTYPE(WS);
      

Part-of-Speech Tagging

FactoriePOSAnnotator

General

This POS Tagger is based on a Factorie Factor Graph model. The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Input

The component expects the following mandatory annotations

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Token

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of type:

  • de.averbis.extraction.types.POSTag

or, depending on the word type, the corresponding annotation. The following subtypes are available in the type system for this purpose:

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagAdp

  • de.averbis.extraction.types.POSTagAdv

  • de.averbis.extraction.types.POSTagConj

  • de.averbis.extraction.types.POSTagDet

  • de.averbis.extraction.types.POSTagNoun

  • de.averbis.extraction.types.POSTagNum

  • de.averbis.extraction.types.POSTagPart

  • de.averbis.extraction.types.POSTagPron

  • de.averbis.extraction.types.POSTagPunct

  • de.averbis.extraction.types.POSTagVerb

Configuration

Apart from the resource, this component has no configuration parameters.

FactoriePOSTaggerResource

General

This resource encapsulates the statistical POSTagger model based on Factorie. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.factoriepostaggerresource.FactoriePOSTaggerResource


Table 125: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: factoriepostagger

String

false

false

genre

Description: the genre of text to process; the combination of genre and document language determines which model is used; available genres: newspaper or bionlp.

Default: newspaper

String

false

false

documentAnnotatorClassName

Description: The implementation of the Factorie DocumentAnnotator.

Default: de.averbis.textanalysis.factorie.GenericForwardPosTagger

String

false

true

attributeClassName

Description: The implementation of the Factorie Attribute.

Default: de.averbis.textanalysis.factorie.GenericPosTag

String

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>factorie-postagger-resource</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

OpennlpPOSAnnotator

General

This POSTagger is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Token

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of the following type:

  • de.averbis.extraction.types.POSTag

or its subtypes. If a word type cannot be assigned to a specific subtype, the above-mentioned parent type is used.

The following valid subtypes are available in the type system:

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagAdp

  • de.averbis.extraction.types.POSTagAdv

  • de.averbis.extraction.types.POSTagConj

  • de.averbis.extraction.types.POSTagDet

  • de.averbis.extraction.types.POSTagNoun

  • de.averbis.extraction.types.POSTagNum

  • de.averbis.extraction.types.POSTagPart

  • de.averbis.extraction.types.POSTagPron

  • de.averbis.extraction.types.POSTagPunct

  • de.averbis.extraction.types.POSTagVerb

Configuration

Implementation: de.averbis.textanalysis.components.opennlpposannotator.OpennlpPOSAnnotator


Table 126: Configuration Parameters

NameTypeMultiValuedMandatory

tokenBlockSize

Description: Sentences having more tokens than tokenBlockSize will be processed in blocks of this size to avoid overlong runtime of this component.

Default: 500

Integer

false

false


Table 127: External Resources

NameOptionalInterface/Implementation

opennlpPOSTaggerResource

Description: Resource holding a map with available models (postagger) for different languages

false

de.averbis.textanalysis.resources.opennlppostaggerresource.OpennlpPOSTaggerResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>opennlp-pos-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

OpennlpPOSTaggerResource

General

This resource encapsulates the statistical POSTagger model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.opennlppostaggerresource.OpennlpPOSTaggerResource


Table 128: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: opennlppostagger

String

false

false

genre

Description: The genre of text to process; the combination of genre and document language determines which model is used; available genres: newspaper or bionlp.

Default: newspaper

String

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>opennlp-postagger-resource</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

Shallow Parsing

FactorieChunkAnnotator

General

This chunker is based on a factor factor graph model. The basic version includes trained models for the two standard languages (de, en), as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.POSTag

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of type:

  • de.averbis.extraction.types.Chunk

or, depending on the phrase type, the corresponding annotation. The following subtypes are available in the type system for this purpose:

  • de.averbis.extraction.types.ChunkNP

  • de.averbis.extraction.types.ChunkVP

  • de.averbis.extraction.types.ChunkPP

Configuration

Apart from the resource, this component has no configuration parameters.

FactorieChunkerResource

General

This resource encapsulates the statistical chunker model based on Factorie. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.factoriechunkerresource.FactorieChunkerResource


Table 129: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: factoriechunker

String

false

false

genre

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: newspaper

String

false

false

documentAnnotatorClassName

Description: The implementation of the Factorie DocumentAnnotator.

Default: de.averbis.textanalysis.factorie.BIOGenericChainChunker

String

false

true

attributeClassName

Description: The implementation of the Factorie Attribute.

Default: cc.factorie.app.nlp.load.BIOChunkTag

String

false

true

Maven Coordinates:

<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>factorie-chunker-resource</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

OpennlpChunkAnnotator

General

This chunker is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the two standard languages (de, en), as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.POSTag

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The component creates annotations of type de.averbis.extraction.types.Chunk or, depending on the phrase type, the corresponding annotation.

The following subtypes are available in the type system for this purpose:

  • de.averbis.extraction.types.ChunkNP

  • de.averbis.extraction.types.ChunkVP

  • de.averbis.extraction.types.ChunkPP

Configuration

Implementation: de.averbis.textanalysis.components.opennlpchunkannotator.OpennlpChunkAnnotator


Table 130: Configuration Parameters

NameTypeMultiValuedMandatory

tokenBlockSize

Description: Sentences having more tokens than tokenBlockSize will be processed in blocks of this size to avoid overlong runtime of this component.

Default: 500

Integer

false

false


Table 131: External Resources

NameOptionalInterface/Implementation

opennlpChunkerResource

Description: Resource holding a map with available models (chunker) for different languages.

false

de.averbis.textanalysis.resources.opennlpchunkerresource.OpennlpChunkerResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>opennlp-chunk-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

OpennlpChunkerResource

General

This resource encapsulates the statistical Chunker model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.opennlpchunkerresource.OpennlpChunkerResource


Table 132: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: opennlpchunker

String

false

false

genre

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: newspaper

String

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>opennlp-chunker-resource</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

Enumerations

EnumerationAnnotator

General

This component can detect enumerations based on atomic text units (e. g. chunks) and conjunctions (e. g. the word "and").

Input

The component requires mandatory annotations for the configured types. Normally, POSTagger and Chunkers annotations are required:

  • de.averbis.extraction.types.POSTagConj

  • de.averbis.extraction.types.ChunkNP

Output

The component creates annotations of type:

  • de.averbis.textanalysis.types.Enumeration

It fills their feature members and sets the feature label to 'enumeration'.

Configuration

Implementation: de.averbis.textanalysis.components.enumerationannotator.EnumerationAnnotator


Table 133: Configuration Parameters

NameTypeMultiValuedMandatory

withinChunks

Description: If activated, enumerations are detected within annotations of the type chunkType.

Default: true

Boolean

false

true

combineChunks

Description: If activated, annotations of the type chunkType are combined to enumerations.

Default: true

Boolean

false

true

slashEnum

Description: If activated, a slash '/' indicates an enumeration.

Default: false

Boolean

false

true

chunkType

Description: The basic type of elements of an enumeration.

Default: de.averbis.extraction.types.ChunkNP

String

false

true

conjunctionType

Description: The basic type of enumeration indicator.

Default: de.averbis.extraction.types.POSTagConj

String

false

true

seeders

Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied.

Default: org.apache.uima.ruta.seed.DefaultSeeder

String

true

false

reindexOnly

Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed.

Default: uima.tcas.Annotation

String

true

false

indexOnlyMentionedTypes

Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed.

Default: false

Boolean

false

false

indexAdditionally

Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed.

Default: **

String

true

false

strictImports

Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved.

Default: true

Boolean

false

false

debug

Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution.

Default: false

Boolean

false

false

debugWithMatches

Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches.

Default: false

Boolean

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>enumeration-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      
Rules
        PACKAGE de.averbis.textanalysis.components.enumerationannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;
TYPESYSTEM de.averbis.textanalysis.components.enumerationannotator.ListingRutaTypeSystem;

BOOLEAN withinChunks = true;
BOOLEAN combineChunks = true;
BOOLEAN slashEnum = false;
BOOLEAN addMissingChunks = true;
BOOLEAN extendChunksToConcepts = true;

TYPE chunkType = de.averbis.extraction.types.ChunkNP;
TYPE conjunctionType = de.averbis.extraction.types.POSTagConj;

ACTION Enum() = CREATE(Enumeration, "members" = Member, "label" = "enumeration");

DECLARE EnumIndicator;
(conjunctionType{-PARTOF(EnumIndicator)} (SPECIAL.ct=="/" conjunctionType)?){-> EnumIndicator};
(ei:EnumIndicator s:SPECIAL.ct=="-"){-> ei.end=s.end};

e:EnumIndicator{REGEXP("but") -> UNMARK(e)};

BLOCK(addMissingChunks) Document{addMissingChunks}{
     (POSTagDet?{-PARTOF(Chunk)} POSTagAdv* POSTagAdj* @POSTagNoun{-PARTOF(Chunk)}){-> ChunkNP, ChunkNP.value = "NP"};
     ChunkNP COMMA POSTagAdj{-PARTOF(Chunk)-> ChunkNP, ChunkNP.value = "NP"} EnumIndicator ChunkNP;
}

// TODO refactor to avoid redundant operations
BLOCK(extendChunksToConcepts) Document{extendChunksToConcepts}{
    // shortness of breath -> 1 ChunkNP
    c:Concept{CONTAINS(POSTagAdp)}->{np1:ChunkNP{np1.begin==c.begin} POSTagAdp np2:ChunkNP{np2.end==c.end -> np1.end=np2.end, UNMARK(np2)};};
    // CMV-Pneumonie
    c:Concept{CONTAINS(SPECIAL)}->{np1:ChunkNP{np1.begin==c.begin} SPECIAL.ct=="-" np2:ChunkNP{np2.end==c.end -> np1.end=np2.end, UNMARK(np2)};};
    // carotid bruits
    c:Concept{CONTAINS(ChunkNP,2,2)}->{np1:ChunkNP{np1.begin==c.begin -> np1.end = np2.end} np2:ChunkNP{-> UNMARK(np2)};};
    c:Concept{CONTAINS(ChunkNP)}<-{np1:ChunkNP{np1.begin==c.begin -> np1.end = np2.end} np2:POSTagNoun{-PARTOF(Chunk)};};
    // <Crohn><'s, or ulcerative colitis>
    c1:Concept{STARTSWITH(ChunkNP), -ENDSWITH(ChunkNP)}->{np1:ChunkNP{np1.begin == c1.begin -> np1.end = c1.end};}
        COMMA? conjunctionType c2:Concept;
    np:ChunkNP<-{pa:POSTagPart{pa.begin == np.begin} SW.ct=="s" ANY[0,2]{-PARTOF(Concept)} c:Concept{-> np.begin = c.begin};};
    // Akute Transplantat-gegen-Wirt Erkrankung
    c:Concept{-> ChunkNP}<-{np1:ChunkNP{np1.begin==c.begin -> UNMARK(np1)} ANY[0,3]{-PARTOF(Chunk)} np2:ChunkNP{np2.end==c.end-> UNMARK(np2)};};
}

BLOCK(withinChunks) Document{withinChunks}{
    // within chunks
    BLOCK(eachChunk) chunkType{CONTAINS(EnumIndicator)} {
        ((ANY+{-PARTOF(COMMA) -> Member} COMMA)* ANY+{-PARTOF(COMMA)-> Member} @EnumIndicator{-PARTOF(Enumeration)} #{-> Member}){-> Enum()};
    }
    // should have been a chunk, chunk misses adjectives in front of it
    ((COMMA? ANY+{-PARTOF(COMMA),PARTOF({POSTagAdj, POSTagAdv})-> Member})+ @EnumIndicator{-PARTOF(Enumeration)}
        ChunkNP{OR(STARTSWITH(POSTagAdj),STARTSWITH(POSTagAdv))-> Member}){-> Enum()};

    // adjectives after chunk used in medical documents
    (ChunkNP{-> Member}
            (ANY+{-PARTOF(COMMA),-PARTOF(ChunkNP),PARTOF({POSTagAdj, POSTagAdv})} COMMA)+
            ANY+{-PARTOF(COMMA),-PARTOF(ChunkNP),PARTOF({POSTagAdj, POSTagAdv})-> Member}
            @EnumIndicator{-PARTOF(Enumeration)}
            ANY+{-PARTOF(ChunkNP),PARTOF({POSTagAdj, POSTagAdv})-> Member}){-> Enum()};

}

BLOCK(combineChunks) Document{combineChunks}{

    // lentigo vs macular SK vs lentig maligna
    ((chunkType{-PARTOF(Enumeration)-> Member} EnumIndicator{-PARTOF(Enumeration)})[2,100] chunkType{-> Member}){-> Enum()};

        ((chunkType{-PARTOF(Enumeration) -> Member} SPECIAL.ct=="-"? COMMA)* chunkType{-PARTOF(Enumeration) -> Member} SPECIAL.ct=="-"?{-PARTOF(chunkType)} COMMA?{-PARTOF(chunkType)}
            @EnumIndicator{-PARTOF(Enumeration)} chunkType{-PARTOF(Enumeration) -> Member}){-> Enum()};
        // TODO broken chunking
        ((chunkType{-> Member} SPECIAL.ct=="-"? COMMA)* chunkType{-> Member} SPECIAL.ct=="-"?{-PARTOF(chunkType)} COMMA?{-PARTOF(chunkType)}
                @EnumIndicator{-PARTOF(Enumeration)} Chunk{-PARTOF({ChunkNP,ChunkPP,ChunkVP})-> Member}){-> Enum()} ANY{-PARTOF(POSTagAdp)};
}

BLOCK(slashEnum) Document{slashEnum}{
        ((chunkType{-PARTOF(Enumeration), -REGEXP(".") -> Member} SPECIAL.ct=="-"? SPECIAL.ct=="/")+
            chunkType{-PARTOF(Enumeration), -CONTAINS(COMMA), -STARTSWITH(NUM) -> Member}){-> Enum()};
}
      

Entity Detection

MalletEntityAnnotator

In computational linguistics, the recognition of proper names is the task of identifying and typing references to entities within a text. Typical proper names are people, places and organisations.

The recognition of proper names is based on machine learning methods, as this approach enables high recognition rates. The module can also be adapted to new domains, languages and entity types by retraining the statistical model.

General

The component is based on Conditional Random Fields (CRF), a very good machine learning method for this task. This component comes with a standard model that recognizes the classic named entities (people, places, organizations).

The component also provides a training module that can be used to easily train new models from existing training data. In this way, adaptation to a new text domain or a new text genre (e. g. social media or biomedical literature) and adaptation to other entity types is very easy. For example, a gene and protein tagger can be created very easily.

If the confidence calculation is switched on, the marginal probabilities of the respective word are calculated while retaining the remaining sequence (i. e. the predicted labels of the sentence). For each entity, its confidence is then the average of all the individual word probabilities contained in the entity.

To make the tagger more precise, you can specify a minimum confidence that must be met for an entity to be annotated at all. The resulting increase in precision is of course achieved at the expense of recall.

Input

The component requires the following annotations:

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Token

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

Output

The following annotation is created:

  • de.averbis.extraction.types.Entity

The feature label specifies the entity class (for example PERS for persons, GEO for places and ORG for organisations in the basic model). However, you can also specify a special mapping using configuration parameters, which contains more specific entity types depending on the label.

Details
Background/Algorithm

The Tagger is a further development of the JNET-Tagger. It is based on conditional random fields (CRFs) and uses the mallet-implementation. The tagger is based on the tagger described in Settles (2004). Conditional Random Fields (CRFs) are particularly well suited for the task of Named Entity Recognition as they model correlations in the text. The linear chain CRFs used here model the text as a sequence of words and thus simulate dependencies inherent to the language.

            Burr Settles. 2004. Biomedical named entity
            recognition using conditional random fields and rich feature sets.
Feature Configuration

The feature configuration specifies which features are used during training. The default configuration is as follows. The commented out lines can be commented in again to use the other features.

            offset_conjunctions= (-1) (1)
            feat_lowercase_enabled=false
            feat_wc_enabled = true
            feat_bwc_enabled=true
            feat_bioregexp_enabled = true
            feat_plural_enabled = true
            #token_ngrams = 2,3
            #char_ngrams = 3,4
            #prefix_sizes = 2,3
            Suffix_sizes = 2,3
            #TESTLEX_lexicon = test. lex
Evaluation

Model "default"

The basic model (available for German and English) recognizes people, places and organisations. It has been trained on training data from the newspaper domain and is therefore very suitable for texts that are well-formed, grammatically correct and not colloquial.

German: Tiger-Korpus. Recall/Precision/F-Score: 0.88/0.93/0.90

English: MASC corpus. Recall/Precision/F-Score: 0.83/0.89/0.86

Configuration

Implementation: de.averbis.textanalysis.components.malletentityannotator.MalletEntityAnnotator


Table 134: Configuration Parameters

NameTypeMultiValuedMandatory

calculateConfidence

Description: If activated, the confidence of extracted entities will be calculated (takes extra time, so only turn on if really needed).

Default: false

Boolean

false

true

confidenceThreshold

Description: If parameter calculateConfidence is activated, only entity mentions which exceed this threshold are added.

Default: 0.0

Float

false

true

labelMapping

Description: Optional mapping file from label to entity type.

String

false

false

blackList

Description: Optional file containing exclulsions for specific labels, e.g., Obama@GEO.

String

false

false

expandAbbreviations

Description: If activated, tokens which are acronyms/abbreviations should be sent to tagger in expanded form, i.e. their full form. Setting it to true, this may improve tagger performance, however, only if model was trained on such data.

Default: false

Boolean

false

true

linkEntityToToken

Description: If activated, tokens underlying the entity will have a reference to the entity.

Default: false

Boolean

false

true

ignoreByConceptMapperAfterMapped

Description: If this parameter and the parameter linkEntityToToken are activated, then the tokens underlying the new entities will be set to ignore by concept mapper.

Default: false

Boolean

false

true


Table 135: External Resources

NameOptionalInterface/Implementation

malletEntityTaggerResource

Description: Resource holding the available models for different languages.

false

de.averbis.textanalysis.resources.malletentitytaggerresource.MalletEntityTaggerResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>mallet-entity-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

MalletEntityTaggerResource

General

This resource encapsulates the statistical CRF model based on Mallet. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

Configuration

Implementation: de.averbis.textanalysis.resources.malletentitytaggerresource.MalletEntityTaggerResource


Table 136: Configuration Parameters

NameTypeMultiValuedMandatory

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: malletentitytagger

String

false

false

genre

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: newspaper

String

false

false

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>mallet-entity-tagger-resource</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

Concept Recognition

GenericTerminologyAnnotator

General

This component is a generic combination of up to three concept annotators based on the configured terminology. It is designed to simplify the use of concept recognition by eliminating the need to configure individual concept annotators and their resources. The configuration of the managed components and resources is automatic, but can be influenced by different configuration parameters. One of the most important parameters is terminologyNames, which can be used to define the terminology used. These are, of course, first converted into serialized dictionaries in a necessary preprocessing stage.

Input

The component is based on possibly several ConceptAnnotators with different configurations. Therefore, annotations are required that must be available for the ConceptAnnotators to function correctly, such as sentences, tokens, stems or segments.

Output

The component creates annotations of type:

  • de.averbis.extraction.types.Concept (auch Subtypen)

The exact type depends on the terminology files used and the concept types specified in them.

Configuration

Implementation: de.averbis.textanalysis.components.terminologyannotator.GenericTerminologyAnnotator


Table 137: Configuration Parameters

NameTypeMultiValuedMandatory

useExactLookup

Description: Apply exact lookup.

Default: true

Boolean

false

true

useOriginalLookup

Description: Apply original lookup.

Default: true

Boolean

false

true

useStemLookup

Description: Apply lookup based on stems.

Default: true

Boolean

false

true

useSegmentLookup

Description: Apply lookup based on segments.

Default: true

Boolean

false

true

enableMatchedTokens

Description: Enable matched tokens again after processing. Sets the feature ignoredByConceptMapper of tokens covered by any Concept to false.

Default: true

Boolean

false

true

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved. This parameter overrides the default directory given by the implementation.

String

false

false

terminologyNames

Description: Names of the source terminologies.

String

true

false

resourceIdentifier

Description: Optional identifier for resources that are automatically created and binded within the concept annotators.

String

false

false

exactPreprocessingAnalysisEngineName

Description: Analysis engine name for exact preprocessing of the dictionary entries.

String

false

false

originalPreprocessingAnalysisEngineName

Description: Analysis engine name for original preprocessing of the dictionary entries.

String

false

false

stemPreprocessingAnalysisEngineName

Description: Analysis engine name for stem preprocessing of the dictionary entries.

String

false

false

exactDictionarySourceFileNames

Description: Names of the dictionaries used for exact lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.exact.xml'.

String

false

false

originalDictionarySourceFileNames

Description: Names of the dictionaries used for original lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.xml'.

String

false

false

stemDictionarySourceFileNames

Description: Names of the dictionaries used for stem lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.xml'.

String

false

false

segmentDictionarySourceFileNames

Description: Names of the dictionaries used for segment lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.xml'.

String

false

false

ignoreAfterExact

Description: Ignore matched tokens after exact lookup.

Default: true

Boolean

false

true

ignoreAfterOriginal

Description: Ignore matched tokens after original lookup.

Default: true

Boolean

false

true

ignoreAfterStem

Description: Ignore matched tokens after stem lookup.

Default: true

Boolean

false

true

ignoreAfterSegment

Description: Ignore matched tokens after segment lookup.

Default: true

Boolean

false

true

exactLookup

Description: Apply exact lookup. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: UNKNOWN

String

false

true

originalLookup

Description: Apply original lookup. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: UNKNOWN

String

false

true

stemLookup

Description: Apply lookup based on stems. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: UNKNOWN

String

false

true

segmentLookup

Description: Apply lookup based on segments. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: UNKNOWN

String

false

true

exactCaseVariant

Description: Defines the case matching of the exact lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

Default: CASE_MATCH

String

false

true

originalCaseVariant

Description: Defines the case matching of the original lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

Default: CASE_IGNORE

String

false

true

stemCaseVariant

Description: Defines the case matching of the stem lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE

Default: CASE_IGNORE

String

false

true

segmentCaseVariant

Description: Defines the case matching of the segment lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

Default: CASE_IGNORE

String

false

true

exactMatchOnlyTermsWithNouns

Description: Defines if only concepts should be matched that comprise a noun in exact mode.

Default: false

Boolean

false

true

originalMatchOnlyTermsWithNouns

Description: Defines if only concepts should be matched that comprise a noun in original mode.

Default: false

Boolean

false

true

stemMatchOnlyTermsWithNouns

Description: Defines if only concepts should be matched that comprise a noun in stem mode.

Default: false

Boolean

false

true

segmentMatchOnlyTermsWithNouns

Description: Defines if only concepts should be matched that comprise a noun in segment mode.

Default: false

Boolean

false

true

exactMapResolvedAbbreviations

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in exact mode.

Default: false

Boolean

false

true

originalMapResolvedAbbreviations

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in original mode.

Default: false

Boolean

false

true

stemMapResolvedAbbreviations

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in stem mode.

Default: false

Boolean

false

true

segmentMapResolvedAbbreviations

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in segment mode.

Default: false

Boolean

false

true

exactFindAllMatches

Description: Finds all matches in a text passage in exact mode, also the overlapping ones.

Default: false

Boolean

false

true

originalFindAllMatches

Description: Finds all matches in a text passage in original mode, also the overlapping ones.

Default: false

Boolean

false

true

stemFindAllMatches

Description: Finds all matches in a text passage in stem mode, also the overlapping ones.

Default: false

Boolean

false

true

segmentFindAllMatches

Description: Finds all matches in a text passage in segment mode, also the overlapping ones.

Default: false

Boolean

false

true

exactFilterBestMatches

Description: Chooses the best match of all matches on a text passage in exact mode (via fuzzyness score).

Default: true

Boolean

false

true

originalFilterBestMatches

Description: Chooses the best match of all matches on a text passage in original mode (via fuzzyness score).

Default: true

Boolean

false

true

stemFilterBestMatches

Description: Chooses the best match of all matches on a text passage in stem mode (via fuzzyness score).

Default: true

Boolean

false

true

segmentFilterBestMatches

Description: Chooses the best match of all matches on a text passage in segment mode (via fuzzyness score).

Default: true

Boolean

false

true

makeConceptAnnotation

Description: This parameter specifies whether a concept annotation will be made at all. If set to true a concept annotation is made (i.e. added to the index), if set to false, no concept annotation is made but the tokens underlying the potential concepts are set to ignore. This is e.g. used if the concept annotator is just used to set some phrase to be ignored (without being interested in the concept annotation itself).

Default: true

Boolean

false

true


Table 138: External Resources

NameOptionalInterface/Implementation

exactConceptDictionaryResource

Description: Dictionary resource for exact lookup overriding the default one.

true

de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource

stemConceptDictionaryResource

Description: Dictionary resource for stem lookup overriding the default one.

true

de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource

originalConceptDictionaryResource

Description: Dictionary resource for original lookup overriding the default one.

true

de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource

segmentConceptDictionaryResource

Description: Dictionary resource for segment lookup overriding the default one.

true

de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>terminology-annotator</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

Instead of the configuration parameters exactLookup, stemLookup and segmentLookup only the parameters useExactLookup, useStemLookup and useSegmentLookup should be used when configuring this component.


WordlistAnnotator

Description

The WordlistAnnotator allows users to directly embed simple wordlists into pipelines. It identifies words from the wordlist in texts and creates an annotation of type Entity. Optionally, a 'label' and a 'value' can be specified in columns 2 and 3 of the wordlist to fill the corresponding attributes of type Entity (see example below).

Input

Above this annotator, the following annotator must be included in the pipeline:

Configuration


Table 50: Configuration Parameters

delimiter

The separator of different terms in the wordlist, separating the searched term from its features.

string
false
true

ignoreCase

Option to ignore the case of the terms in the wordlist.

Possible values (default is underlined): ACTIVE | INACTIVE

boolean
false
true

onlyLongest

Option to filter matches that are part of a longer match. Example: 'diabetes mellitus' but not 'diabetes'.

Possible values (default is underlined): ACTIVE | INACTIVE

boolean
false
true

wordlist

The wordlist (dictionary) content.

The first line contains the complete package name of type Entity. If columns 2 and 3 are filled, line 1 has to be filled with the attribute names 'label' and 'value'.

The remaining lines contain the words of the wordlist (column 1) and optionally 'label' and 'value' values (columns 2 and 3).

Example Wordlist:

de.averbis.extraction.types.Entity;label;value

Lip;Organ;C00

Tongue;Organ;C01

string
false
false

Output

The annotator creates an annotation of type Entity.


Exemplary Annotation Type: de.averbis.extraction.types.Entity


Table 51: Features

labelRepresents the string in the feature "label"  of the matched term in the wordlist.String
valueRepresents the string in the feature "value" of the matched term in the wordlist.String

WebService Example

Example:

The lip

    {
      "begin": 4,
      "end": 7,
      "type": "de.averbis.extraction.types.Entity",
      "coveredText": "lip",
      "id": 306,
      "componentId": null,
      "confidence": 0,
      "label": "Organ",
      "value": "C00",
      "parsedElements": null
    }

Indexing

CooccurrenceDescriptorAnnotator

General

Extracts keywords based on the coke competition of individual lexical units. Lexical units can be tokens here, Stems, segments or lemmata. The scores for the selected lexical units within a keyword candidate. Subsequently, they are calculated to a total score for the respective keyword candidates.

Units that are often used with other lexical units together in a keyword candidate, are given a higher weighting in the process as units, for example, which are mostly used alone in keyword candidates. This leads to the fact that this procedure rather prefers keyword candidates, which consist of several lexical units. In this respect, the procedure is, for example, a very simple one. well suitable for recognizing personal names or complex and thus to extract very specific terms.

Input

The component expects the following annotations to be mandatory

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Concept

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagNoun

Depending on the setting, these other annotations are also used

  • de.averbis.extraction.types.Zone

  • de.averbis.extraction.types.Stem

  • de.averbis.extraction.types.Segment

  • de.averbis.extraction.types.Lemma

Output

The component produces annotations of type:

  • de.averbis.extraction.types.Descriptor

Background

To calculate the scores of the lexical units, the following steps are performed first per document a co-current-matrix over all relevant lexical units.

Then f(u), the so-called unit frequency, is calculated. These expresses how many keyword candidates the lexical unit is made up of happens. In addition, d(u), the so-called unit degree, is calculated.

The basic score of the lexical unit is now calculated:

s(u)=d(u)/f(u)

This basic score is additionally marked with the tf value of the keyword candidates. This value expresses how often the keyword candidate appears in the current document.

This procedure represents an extension and modification of the existing RAKE. Paper described above.

Configuration

Implementation: de.averbis.textanalysis.components.indexing.descriptor.CooccurrenceDescriptorAnnotator


Table 139: Configuration Parameters

NameTypeMultiValuedMandatory

unitType

Description: The complete long name of the type of a unit.

Default: de.averbis.extraction.types.Segment

String

false

true

scoreCombinationType

Description: Combination type used for scores: SUM, AVG, MAX

Default: MAX

String

false

true

topN

Description: Maximum number of annotations to produce per CAS.

Default: 10

Integer

false

true

minScore

Description: Minimum score of annotations.

Default: 0.0

Float

false

true

normalizeScore

Description: Option to normalize the score.

Default: true

Boolean

false

true

conceptConfidenceBoost

Description: Option to boost concepts.

Default: false

Boolean

false

true

allowedZones

Description: Name of zone labels: if set, only concepts of these zones will be considered.

String

true

false

zoneBoost

Description: Option to boost zones.

Default: false

Boolean

false

true

Maven Coordinates:

        
          <dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>indexing</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

DefaultDescriptorAnnotator

General

The "default"approach to uncontrolled keywording is based on mainly on the tf-idf value of the keyword. You can also use the position of the keyword in the text into the weighting formula can be added.

Input

The component expects the following annotations to be mandatory

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Concept

Depending on the setting, these other annotations are also used

  • de.averbis.extraction.types.Zone

Output

The component creates annotations of type:

  • de.averbis.extraction.types.Descriptor

Background

The tf-value of a keyword is the frequency of this keyword in the current document. The tf value is normalized via the the so-called "augmented tf-score"approach. Here, the tf value a keyword with the maximum tf value of the current document normalized according to the formula:

tf_i_norm = 0.5 + 0.5* tf_i / tf_max

The idf-value of a keyword can also be used if required. can be normalized. Here the maximum idf-value of the used IDF Dictionaries.

The item weight is defined as the relative number of records that are stored before the of a keyword candidate are defined. Coming the keyword for the first time in the first sentence, the weight is 1.

Configuration

Implementation: de.averbis.textanalysis.components.indexing.descriptor.DefaultDescriptorAnnotator


Table 140: Configuration Parameters

NameTypeMultiValuedMandatory

positionBoost

Description: Option to boost by position.

Default: true

Boolean

false

true

idfBoost

Description: Option to boost by idf.

Default: false

Boolean

false

true

termFrequencyBoost

Description: Option to boost by term frequency.

Default: true

Boolean

false

true

idfDictionary

Description: IDF dictionary file for idf boost.

String

false

false

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: idfdictionary

String

false

false

topN

Description: Maximum number of annotations to produce per CAS.

Default: 10

Integer

false

true

minScore

Description: Minimum score of annotations.

Default: 0.0

Float

false

true

normalizeScore

Description: Option to normalize the score.

Default: true

Boolean

false

true

conceptConfidenceBoost

Description: Option to boost concepts.

Default: false

Boolean

false

true

allowedZones

Description: Name of zone labels: if set, only concepts of these zones will be considered.

String

true

false

zoneBoost

Description: Option to boost zones.

Default: false

Boolean

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>indexing</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

TextrankDescriptorAnnotator

General

Extracts keywords based on the TextRank procedure. The text is represented internally as a graph determining the text coherence. The TextRank-based process is completely unsupervised and can therefore be used independently of a given document collection. It does not require any models or other resources.

Keyword extraction methods based on domain knowledge such as, e.g., a IDF dictionary, may produce better results under certain circumstances. In many cases, however, the domain in question is not yet exactly known in advance so that no suitable IDF dictionary can be created. In such a case, it is advisable to use the TextRank procedure.

Input

The following annotations are mandatory for this component

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.Concept

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagNoun

Depending on the setting, these other annotations are also used

  • de.averbis.extraction.types.Zone

  • de.averbis.extraction.types.Stem

  • de.averbis.extraction.types.Segment

  • de.averbis.extraction.types.Lemma

Output

The component produces annotations of type:

  • de.averbis.extraction.types.Descriptor

Background

Based on the TextRank algorithm (https://web.eecs.umich.edu/~mihalcea/papers/mihalcea. emnlp04. pdf[Mihalcea paper, 2000]). Lexical units, e.g. tokens, stems, segments or lemmata) are assigned a score using the TextRank graph. These basic values are used for calculating respective scores for all concept annotations. Here, various calculation options are supported: average, maximum value and sum. In our experiments, the maximum value method usually produces best results.

The TextRank graph contains the respective lexical units as nodes. The edges between these nodes represent a connection between these units in the text in terms of adjacency. The parameter windowSize defines the size of the window for which adjacent lexical units are consdidered.

During the optimization phase, the system displays the text the weights of the nodes are calculated. Nodes that have many edges to neighboring nodes are potentially weighted higher.

In the optimization phase, the weights of the nodes are calculated based on the graph created for the respective text. Nodes that have many edges to neighboring nodes are potentially weighted higher.

The procedure allows an inherent normalization of the node weight. If the weights are normalized, they represent the probability of the "random surfer model", i.e. the probability of accidentally encountering the respective lexical unit in a text. Thus, the normalized scores of all nodes represent a probability distribution.

The following figure shows a TextRank graph on a document about the horse meat scandal in spring 2013 (Source: SPIEGEL Online). Segments were used as lexical units. The darker the color of the nodes, the higher the unit score. You can easily see that "meat","horse" and "product" are central aspects.


textrank


Figure 66: TextRank graph for document on horse meat scandal 2013 (German)


Configuration

Implementation: de.averbis.textanalysis.components.indexing.descriptor.TextrankDescriptorAnnotator


Table 141: Configuration Parameters

NameTypeMultiValuedMandatory

windowSize

Description: The window size.

Default: 3

Integer

false

true

unitType

Description: The complete long name of the type of a unit.

Default: de.averbis.extraction.types.Segment

String

false

true

scoreCombinationType

Description: Combination type used for scores: SUM, AVG, MAX

Default: MAX

String

false

true

maxIterations

Description: An internal parameter specifying the maximum amountof iterations if supported by the algorithm.

Default: 100

Integer

false

true

convergenceThreshold

Description: An internal parameter specifying the threshold for convergence if supported by the algorithm..

Default: 1.0E-4

Float

false

true

topN

Description: Maximum number of annotations to produce per CAS.

Default: 10

Integer

false

true

minScore

Description: Minimum score of annotations.

Default: 0.0

Float

false

true

normalizeScore

Description: Option to normalize the score.

Default: true

Boolean

false

true

conceptConfidenceBoost

Description: Option to boost concepts.

Default: false

Boolean

false

true

allowedZones

Description: Name of zone labels: if set, only concepts of these zones will be considered.

String

true

false

zoneBoost

Description: Option to boost zones.

Default: false

Boolean

false

true

Maven Coordinates:

        
          <dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>indexing</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

CooccurrenceKeywordAnnotator

General

Extracts keywords based on the coke competition of individual lexical units. Lexical units can be tokens here, Stems, segments or lemmata. The scores for the selected lexical units within a keyword candidate. Subsequently, they are calculated to a total score for the respective keyword candidates.

Units that are often used with other lexical units together in a keyword candidate, are given a higher weighting in the process as units, for example, which are mostly used alone in keyword candidates. This leads to the fact that this procedure rather prefers keyword candidates, which consist of several lexical units. In this respect, the procedure is, for example, a very simple one. well suitable for recognizing personal names or complex and thus to extract very specific terms.

Input

The component expects the following annotations to be mandatory

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.ChunkNP

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagNoun

Depending on the setting, these other annotations are also used

  • de.averbis.extraction.types.Zone

  • de.averbis.extraction.types.Stem

  • de.averbis.extraction.types.Segment

  • de.averbis.extraction.types.Lemma

Output

The component produces annotations of type:

  • de.averbis.extraction.types.Keyword

Background

See description of the analog keyword component CooccurrenceDescriptorAnnotator.

Configuration

Implementation: de.averbis.textanalysis.components.indexing.keyword.CooccurrenceKeywordAnnotator


Table 142: Configuration Parameters

NameTypeMultiValuedMandatory

unitType

Description: The complete long name of the type of a unit.

Default: de.averbis.extraction.types.Segment

String

false

true

scoreCombinationType

Description: Combination type used for scores: SUM, AVG, MAX

Default: MAX

String

false

true

maxNumberHeadTokens

Description: Maximum number of head tokens.

Default: 1

Integer

false

true

fuzzyClustering

Description: Option for fuzzy clustering.

Default: false

Boolean

false

true

topN

Description: Maximum number of annotations to produce per CAS.

Default: 10

Integer

false

true

minScore

Description: Minimum score of annotations.

Default: 0.0

Float

false

true

normalizeScore

Description: Option to normalize the score.

Default: true

Boolean

false

true

conceptConfidenceBoost

Description: Option to boost concepts.

Default: false

Boolean

false

true

allowedZones

Description: Name of zone labels: if set, only concepts of these zones will be considered.

String

true

false

zoneBoost

Description: Option to boost zones.

Default: false

Boolean

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>indexing</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

DefaultKeywordAnnotator

General

The "default" approach to uncontrolled keywording is based on mainly on the tf-idf value of the keyword. You can also use the position of the keyword in the text into the weighting formula can be added.

Input

The component expects the following annotations to be mandatory

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.ChunkNP

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagNoun

Depending on the setting, these other annotations are also used

  • de.averbis.extraction.types.Zone

Output

The component produces annotations of type:

  • de.averbis.extraction.types.Keyword

Background

See description of the analog keyword component DefaultDescriptorAnnotator.

Configuration

Implementation: de.averbis.textanalysis.components.indexing.keyword.DefaultKeywordAnnotator


Table 143: Configuration Parameters

NameTypeMultiValuedMandatory

positionBoost

Description: Option to boost by position.

Default: true

Boolean

false

true

idfBoost

Description: Option to boost by idf.

Default: false

Boolean

false

true

termFrequencyBoost

Description: Option to boost term frequency.

Default: true

Boolean

false

true

idfDictionary

Description: IDF dictionary file for idf boost.

String

false

false

resourceSpecificSubdirectory

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: idfdictionary

String

false

false

maxNumberHeadTokens

Description: Maximum number of head tokens.

Default: 1

Integer

false

true

fuzzyClustering

Description: Option for fuzzy clustering.

Default: false

Boolean

false

true

topN

Description: Maximum number of annotations to produce per CAS.

Default: 10

Integer

false

true

minScore

Description: Minimum score of annotations.

Default: 0.0

Float

false

true

normalizeScore

Description: Option to normalize the score.

Default: true

Boolean

false

true

conceptConfidenceBoost

Description: Option to boost concepts.

Default: false

Boolean

false

true

allowedZones

Description: Name of zone labels: if set, only concepts of these zones will be considered.

String

true

false

zoneBoost

Description: Option to boost zones.

Default: false

Boolean

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>indexing</artifactId>
        <version>3.5.0</version>
</dependency>
        
      

TextrankKeywordAnnotator

General

Extracts keywords based on the TextRank procedure. The text is represented internally as a graph determining the text coherence. The TextRank-based process is completely unsupervised and can therefore be used independently of a given document collection. It does not require any models or other resources.

Keyword extraction methods based on domain knowledge such as, e.g., a IDF dictionary, may produce better results under certain circumstances. In many cases, however, the domain in question is not yet exactly known in advance so that no suitable IDF dictionary can be created. In such a case, it is advisable to use the TextRank procedure.

Input

The following components are mandatory for the component

  • de.averbis.extraction.types.Token

  • de.averbis.extraction.types.Sentence

  • de.averbis.extraction.types.ChunkNP

  • de.averbis.extraction.types.POSTagAdj

  • de.averbis.extraction.types.POSTagNoun

Depending on the setting, these other annotations are also used

  • de.averbis.extraction.types.Zone

  • de.averbis.extraction.types.Stem

  • de.averbis.extraction.types.Segment

  • de.averbis.extraction.types.Lemma

Output

The component produces annotations of type:

  • de.averbis.extraction.types.Keyword

Background

See description of the analog keyword extraction component TextrankDescriptorAnnotator.

Configuration

Implementation: de.averbis.textanalysis.components.indexing.keyword.TextrankKeywordAnnotator


Table 144: Configuration Parameters

NameTypeMultiValuedMandatory

windowSize

Description: The window size.

Default: 3

Integer

false

true

unitType

Description: The complete long name of the type of a unit.

Default: de.averbis.extraction.types.Segment

String

false

true

scoreCombinationType

Description: Combination type used for scores: SUM, AVG, MAX

Default: MAX

String

false

true

maxIterations

Description: An internal parameter specifying the maximum amountof iterations if supported by the algorithm.

Default: 10

Integer

false

true

convergenceThreshold

Description: An internal parameter specifying the threshold for convergence if supported by the algorithm..

Default: 0.005

Float

false

true

maxNumberHeadTokens

Description: Maximum number of head tokens.

Default: 1

Integer

false

true

fuzzyClustering

Description: Option for fuzzy clustering.

Default: false

Boolean

false

true

topN

Description: Maximum number of annotations to produce per CAS.

Default: 10

Integer

false

true

minScore

Description: Minimum score of annotations.

Default: 0.0

Float

false

true

normalizeScore

Description: Option to normalize the score.

Default: true

Boolean

false

true

conceptConfidenceBoost

Description: Option to boost concepts.

Default: false

Boolean

false

true

allowedZones

Description: Name of zone labels: if set, only concepts of these zones will be considered.

String

true

false

zoneBoost

Description: Option to boost zones.

Default: false

Boolean

false

true

Maven Coordinates:

        
<dependency>
        <groupId>de.averbis.textanalysis</groupId>
        <artifactId>indexing</artifactId>
        <version>3.5.0</version>
</dependency>
        
      


1. ftp://ftp.geneontology.org/pub/go/www/GO.format.obo-1_2.shtml, Stand Januar 2017.
2. http://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html, Stand Januar 2017.