##### Page tree
Go to start of banner

# Averbis Information Discovery: User Manual

Version 5.12, 04/05/2019

## Overview

Averbis Information Discovery is a leading text analytics and machine learning platform that allows you to get insights in your structured and unstructured data and explore important information in the most flexible way. Averbis Information Discovery collects and analyzes all kind of documents, such as patents, research literature, databases, websites, and other enterprise repositories.

By parsing and analyzing content, and creating a searchable index, Averbis Information Discovery helps you perform text analytics across all relevant data in the internet and your enterprise and make that data available for analysis and search. It allows you to explore facts and relationships across many sources that would otherwise be hidden in unstructured data.

## Getting started

Users with administration rights can create new users and projects. When these users are logged in, they can see the "Project administration" and "User administration" areas.

In the project administration area, you first see a list with all projects that are currently available in the system.

Figure 2: Overview of created projects

• Name: name of the project. The name also functions as a link to the corresponding project. The link goes to the project’s overview page.

• Description: description of the project.

• Operations | Edit project: this allows you to modify the name and the description of the project.

• Operations | Delete project: this allows you to delete a project.

Below the table is a button that you can use to create a new project.

In the user administration area, you first see a list with all user accounts that are currently available in the system. This list can be filtered using the text box on the top left.

Figure 3: Overview of registered users.

• Lastname: the user’s last name.

• Firstname: the user’s first name.

• Email: the user’s email address.

• Blocked: if a user is temporarily blocked, a padlock icon is displayed here.

• Administrator: if the user is an administrator, a checkmark is displayed here.

• Operations | Rights: using this button you can see an overview of the rights that the user currently has. Rights cannot be edited here. Editing rights is done using the corresponding button in each project.

• Operations | Edit: in the Edit dialog, you can edit the user profile data (firstname, lastname, email address). You can also use this dialog to block a user.

• Operations | Change password: this allows you to enter a new user password.

• Operations | Delete user: this allows you to delete the user.

Below the table is a button that you can use to create a new user.

Use the 'Create new user' or 'Edit user' button to open a dialog and edit the user’s metadata.

Figure 4: Create new user.

You can also use this dialog to block the user.

Using the Change password buttons you can open a dialog which allows you enter a new password.

Figure 5: Changing the password of an existing user.

### General guidelines

When a user without global administration rights opens the application, his/her home page contains an overview of the projects assigned to this user (My projects). The project names act as links to the corresponding projects. On the project overview page, the user can find all the functions for which he/she has the relevant project rights.

After selecting a project, a page is displayed with a list of all the modules in the project. This list is also available on other pages with the project navigation menu in the upper right area.

Figure 7: Overview page of a project with buttons for opening each module.

### Language and web interface localization

The web interface is currently available in German and English. The language is recognized automatically from the browser or the system settings of your operating system and the content of the user interface is displayed in the corresponding language.

The top and left side outer navigation bars can be hidden when required. This saves space when the navigation tools are not required. To show/hide the navigation bars, click the small menu icon on the upper right edge of the application.

### Keyboard Shortcuts

To simplify working with the application, some functions are implemented with keyboard shortcuts. Press Shift + ? to display a summary of the defined shortcuts.

Figure 9: Summary of all defined keyboard shortcuts. Open with Shift + ?

### Flash messages

To provide information about the progress and outcome of processes or to display general information flash messages are displayed that are standard for all applications. The background of the flash messages differs according to the message category. Information messages are blue, success messages green, error messages red. Flash messages disappear automatically after a few seconds. Flash messages that display errors however remain displayed until they are closed manually by the user.

Figure 10: Flash messages that display errors are closed by clicking the cross mark in the top right corner.

### Documentation

Complete user documentation is available that describes the functionality of each component. This documentation can be accessed directly from the help menu in the navigation bar on the left side of the web interface.

### Embedded help

In addition to the complete online help, you can find information in several places directly embedded in the interface. You can access this wherever you see a blue question mark on a white background. Move the mouse cursor over the question mark.

Figure 11: Embedded help

## Connector Management & Document Import

### Managing Standard Connectors

Connectors are used to import documents into the system. A connector monitors a specific resource (like a file system or a database), automatically imports new documents and updates changes so that imported documents are kept in sync with the document source. Connectors can also be scheduled to certain times of day, for example to import and update documents only at night and reduce system load during office hours.

Connectors can be created and administered on the connector management page. The figure below shows the connector management with the list of all connectors that have been created within the current project:

Figure 12: Overview of all connectors.

• Connector: The name of the connector.

• Type: The connector type. For example file connector or database connector.

• Active: Indicates whether the connector is active. Only active connectors import and update documents.

• Schedules: Displays the periods of time in which the connector is active. 0-24 means that the connector is active 24 hours a day.

• Statistics: The statistics show the following values

• Documents whose URLs have been reported by the connector.

• Documents that have already been requested by the connector and whose contents have been received.

• Documents that have already been saved.

• Actions | Start connector : Starts the connector.

• Actions | Stop connector : Stops the connector.

• Actions | Reset connector : If you reset a connector, all documents from this connector are re-imported.

• Actions | Edit connector : Opens the edit connector dialog. All parameters except the connector name can be edited.

• Actions | Edit mapping : Opens the edit mapping dialog where connector matadata fields like title and content can be mapped to document fields.

• Actions | Schedule connector : Opens the schedule dialog.

• Actions | Delete documents of connector : Deletes all documents that have been imported by the connector.

• Actions | Delete connector : Deletes the connector. All documents that have been import by the connector will be deleted as well.

In order to create a new connector, the connector type has to be selected first. After clicking the Create connector button the connector can be configured in the create new connector dialog. Please refer to the connector specific documentation for further details.

#### File System Connector

A file system connector imports documents from file system resources. It monitors one or multiple directories (including sub-directories) and imports documents from files in these directories. The following file types are supported:

• .txt
• .pdf
• .doc/docx
• .ppt/pptx
• .xls/xlsx
• .html

There are currently two implementations: `FileConnectorType` and `AverbisFileConnectorType`. The `AverbisFileConnectorType` remembers the current position when stopping, so that it does not start from the beginning when restarting.

A file system connector can be configured using the following parameters:

• Name: Name of the connector. This name can be chosen freely and serves e. g. as label within the connector overview. They must not contain spaces and special characters nor underscores.

• Start paths: For each line, you can specify a file system path that is taken into account by the connector. The connector runs through these directories recursively, i. e. all subdirectories are considered.

• Exclude pattern: Here you can specify patterns to exclude certain files or file types (Black List).

• Include pattern (optional): Here you can specify patterns to include certain files or file types only (White List).

#### Database Connector

With a database connector, structured data can be imported from a database connection. The database connector supports JDBC compliant databases and can crawl database tables using SQL queries. Each row from the SQL query result is treated as a separate document. The database connector keeps track of changes that are made to the database tables and synchronizes these changes automatically into .

In order to use the database connector, the database JDBC driver has to be provided to the Tomcat server instance that is running . Please ask your system administrator to put the database JDBC driver library into Tomcats `lib` directory.

The database connector can be configured using the following parameters:

• Name: Name of the connector. This name can be chosen freely and serves e. g. as label within the connector overview. They must not contain spaces and special characters nor underscores.

• JDBC Driver Classname: Fully qualifying class name of the database JDBC driver. E.g. `com.mysql.jdbc.Driver`

• JDBC Connection URL: JDBC connection URL to the database. E.g. ```jdbc:mysql://localhost:3306/documentDB ```

• Traversal SQL Query: SQL select query. E.g. `SELECT id, title, content FROM documents`

• Primary Key Fields: Name of the column that represents the primary key and identifies a table row. E.g. `id`

The database connector default field mapping concatenates all queried columns (like id, title and content) and maps it into the document field named `content`. The field mapping can be configured in the connector field mapping dialog (See section Editing field mappings for further details). The figure below shows a custom field mapping that maps the database columns to document fields. The id column is mapped to the document_name field, title and content are mapped to identical document fields.

Figure 13: Database connector custom field mapping.

#### Editing field mappings

Connectors read different sources to extract structured data from them. The extracted data is then written to fields of a solr core. Field mappings define which information from the original documents is written to which fields of the Solr Index.

Specific default mappings can be specified for each index and connector throughout the system. These are automatically taken into account when a new connector is created.

When editing the field mappings, select a connector field on the left. On the right, select the core field in which you want the connector to write this data. All core fields that have been activated in the Solr schema configuration and are writable are available here. In addition to editing the default mappings, you can also specify further mappings or remove existing ones.

You can also specify a sequence for the mappings. This order is relevant when mapping multiple connector fields to a core field. If the core field can contain more than one value, it lands in the field in the order specified here. If the core field can only contain one value, it will be the value that is the lowest in the mapping sequence.

After you have edited a field mapping, you must reset the connector so that the changes to the mapping are taken into account.

Figure 14: Editing field mappings.

There are currently three different mapping types:

• Copy Mapping: Der Standard Typ: The connector field is mapped 1:1 to the specified document field.

• Constant Mapping: Instead of a connector field, a constant value can be mapped to a document field.

• Split Mapping: The value of a connector field is divided into several values by a character to be entered. This can be used to convert comma-separated lists into multi valued document fields.

### Document Import

In addition to defining connectors that can monitor and search different document sources, it is also possible to import pre-structured data into a search engine index. Unlike connectors, this data is imported once, i. e. no subsequent synchronization takes place.

#### Manage document imports

Any number of document sets can be imported in the application and deleted if necessary. For each set of imported documents, known as import batches, you see a row in the overview table. In addition to the name of the import batch, you can also see how many documents are part of the batch. The status indicates whether the import is still running, whether it was successful, or whether it has failed.

Figure 15: Overview of all previously imported document batches.

Below the overview table you will find the form elements to import a new document set. To do this, enter a name and click the Browse button. A window opens in which the local file system is displayed.

You can import single files as well as zip archives with several files. Make sure that there are no (hidden) subdirectories in such ZIP file and that the files have the correct file extensions.

These import formats are currently available:

Text Importer

Text importers can be used to import any plain text files. The complete content of the file is imported into a field. The file name of the file is available later as a metadate. CAS Importer Allows the import of serialized UIMA CAS (currently as XMI). This means that for example documents are imported as gold standards.

Please note that the type system of this CAS has to be compatible with the type system of .

Solr XML Importer

A simple XML format that allows the import of pre-structured data. During the import, the fields defined in XML are written to the search index in fields with the same value. Please make sure that the field names in the XML file correspond to the field names of the search index associated with your project.

Images that can be imported to the documents and displayed together with them are a special feature. To upload an image, you have to pack the XML document (s) together with the images into a ZIP archive. With each document you can now add as many image_reference fields as you like. Relative paths to the image are expected. Images can be stored in any subfolders within the ZIP archive. Supported image formats are. gif,. png,. jpg and. tif.

``````...
<field name="image_reference">images/image.png</field>
<field name="image_reference">./images/pics/picture.png</field>
...```
```

An example of the supported import format is shown below

```<?xml version='1.0' encoding='UTF-8'?>
<!--Averbis Solr Import file generated from: medline15n0771.xml.gz-->
<update>
<doc>
<field name="id">24552733</field>
<field name="title">Treatment of sulfate-rich and low pH wastewater by sulfate reducing bacteria with iron shavings in a laboratory.		</field>
<field name="content">Sulfate-rich wastewater is an indirect Tag der Arbeit threat to the environment especially at low pH. Sulfate reducing bacteria (SRB) could use sulfate as the terminal electron acceptor for the degradation of organic compounds and hydrogen transferring SO(4)(2-) to H2S. However their acute sensitivity to acidity leads to a greatest limitation of SRB applied in such wastewater treatment. With the addition of iron shavings SRB could adapt to such an acidic environment, and 57.97, 55.05 and 14.35% of SO(4)(2-) was reduced at pH 5, pH 4 and pH 3, respectively. Nevertheless it would be inhibited in too acidic an environment. The behavior of SRB after inoculation in acidic synthetic wastewater with and without iron shavings is presented, and some glutinous substances were generated in the experiments at pH 4 with SRB culture and iron shavings.</field>
<field name="tag">Hydrogen-Ion Concentration; Iron; Oxidation-Reduction; Sulfur-Reducing Bacteria; Waste Water; Water Purification</field>
<field name="author">Liu X, Gong W, Liu L</field>
<field name="descriptor">Evaluation Studies; Journal Article; Research Support, Non-U.S. Gov't</field>
</doc>
<doc>
<field name="id">24552734</field>
<field name="title">Environmental isotopic and hydrochemical characteristics of groundwater from the Sandspruit Catchment, Berg River Basin, South Africa.</field>
<field name="content">The Sandspruit catchment (a tributary of the Berg River) represents a drainage system, whereby saline groundwater with total dissolved solids (TDS) up to 10,870 mg/l, and electrical conductivity (EC) up to 2,140 mS/m has been documented. The catchment belongs to the winter rainfall region with precipitation seldom exceeding 400 mm/yr, as such, groundwater recharge occurs predominantly from May to August. Recharge estimation using the catchment water-balance method, chloride mass balance method, and qualified guesses produced recharge rates between 8 and 70 mm/yr. To understand the origin, occurrence and dynamics of the saline groundwater, a coupled analysis of major ion hydrochemistry and environmental isotopes (d(18)O, d(2)H and (3)H) data supported by conventional hydrogeological information has been undertaken. These spatial and multi-temporal hydrochemical and environmental isotope data provided insight into the origin, mechanisms and spatial evolution of the groundwater salinity. These data also illustrate that the saline groundwater within the catchment can be attributed to the combined effects of evaporation, salt dissolution, and groundwater mixing. The salinity of the groundwater tends to vary seasonally and evolves in the direction of groundwater flow. The stable isotope signatures further indicate two possible mechanisms of recharge; namely, (1) a slow diffuse type modern recharge through a relatively low permeability material as explained by heavy isotope signal and (2) a relatively quick recharge prior to evaporation from a distant high altitude source as explained by the relatively depleted isotopic signal and sub-modern to old tritium values. </field>
<field name="tag">Groundwater; Isotopes; Rivers; Salinity; South Africa; Water Movements</field>
<field name="author">Naicker S, Demlie M</field>
<field name="descriptor">Journal Article; Research Support, Non-U.S. Gov't</field>
</doc>
</update>```

## Text Analysis

Text analysis is one of the core components of Averbis Information Discovery. This chapter describes how text analysis pipelines are created, configured, distributed to remote systems, and monitored. It also describes what options Averbis Information Discovery provides for evaluating and optimizing text analysis results.

### Pipeline Configuration

The text analysis components and pipelines used in can be graphically administered and monitored in a centralized way. This is done in the Pipeline configuration module.

Figure 16: Link for opening the graphical configuration of text analysis components.

The overview page lists all the text analysis pipelines available in the project. The following information and operations are provided in the table.

• "Pipeline Name": name of the pipeline.

• "Status": Status of the pipeline: STOPPED, STARTING or STARTED. As soon as the pipeline started, it reserves system resources. Only after it started, it accepts analysis requests.

• "Preconfigured": indicates whether the pipeline is a preconfigured pipeline. These pipelines cannot be edited.

• "Throughput": here, two indicators for the pipeline throughput are given: the total number of processed texts, and the average number of processed texts per second. The statistics are reinitialized each time the pipeline stops/starts.

• "Operations | Initialize pipeline" : this is used to initialize a pipeline. As soon as it has been initialized, it can process texts.

• "Operations | Stop pipeline" : to save system resources, pipelines can also be stopped.

• "Operations | Edit pipeline" : this is used to configure a pipeline, for example to add other components to it, to remove them or to modify their configuration parameters. Pipelines can only be edited when they are stopped.

• "Operations | Update pipeline" : this is used to update the statistics (throughput) and status of the pipeline.

• "Operations | Delete pipeline" : this allows pipelines to be permanently deleted, if they are no longer needed.

Figure 17: Overview of all available text analysis pipelines in the project.

To create new pipelines, use the 'Create pipeline' button below the overview table.

### Pipeline details

With the pencil icon in the taskbar of the overview table, you can access the details page of the pipeline. At the top left, all components are displayed in the order in which they are used in the pipeline.

To the right of each component name, you can see the component-specific throughput data, indicating the total number of processed texts and the average number of texts per second. By clicking the relevant component, you can show all the configurable configuration parameters.

Figure 18: Detail view of an initialized pipeline.

As long as a pipeline is running, it cannot be edited. When you stop a non-preconfigured pipeline, you can reconfigure the pipeline in the details page. Buttons on the right are now displayed instead of the throughput data, which can be used to remove components from the pipeline, or to move them to another position within the pipeline. Individual configuration parameters of the components are now also editable. Other components can also be added to the pipeline from the right side.

Figure 19: Editing a pipeline.

The right-hand area with the available components is itself divided into several blocks: Preconfigured Annotators, PEAR Components and Available Annotators.

#### Preconfigured Annotators

Preconfigured annotators are annotators that Averbis has already preconfigured for a specific purpose. For example, a diagnostic annotator is nothing more than a GenericTerminologyAnnotator preconfigured with a diagnosis dictionary. Preconfigured annotators can also be made up of several components, i.e. an aggregate of several components. This can be used to present the end user components of complex interdependencies in a clear way.

#### PEAR components

PEAR components are those added by users. They can be integrated in pipelines like the preconfigured or available annators in pipelines. More on this in the chapter Managing / Adding new textanalysis components.

#### Available Annotators

The list of available annotators contains all general, i.e. not preconfigured, components detected in ’s component repository.

### Managing / Adding new text analysis components

The application allows to add new text analysis components at runtime. There is no need to reinstall or redeploy the application. For that, so called UIMA™ PEAR components (Processing Engine ARchive) are used. PEAR is a packaging format, which allows to ship textanalysis components alongside all needed resources in a single artifact.

You find a list of all available PEAR components in the Pipeline Configuration where you configure your textanalysis pipeline. Adding new components is done within the Textanalysis: Components module.

Figure 20: Show and import UIMA PEAR components.

### Text Analysis Processes

Any number of text analysis results can be generated and stored for all known document sources in . Text analysis results can be created either automatically through pipelines or manually. This way, you can obtain different semantic views of the same document which enable you to evaluate several views side by side.

Figure 21: Overview of all currently created test analysis tasks.

The table contains the following columns:

• "Type": indicates whether this is a manual or automatic text analysis.

• "Name": name of the process. For example Demo - anatomy

• "Status": Status of the process. It is either RUNNING or IDLE.

• "Document source": the document source to which the task refers. In parentheses after the name is the number of processed fields. For example if two fields, contents and title, are processed in a corpus of 3000 documents, then at the end of the task, 6000 will be indicated here.

• "Pipeline": in the case of an automatic text analysis, the pipeline that was used for the text analysis is indicated here.

• Delete: Delete whole process and all results.

When you create a new task, you can select whether it is a manual or an automatic text analysis.

Figure 22: Creating a new text analysis task: manual or automatic text analysis.

If you choose automatic text analysis, then in addition to the name and the document source, you are requested to give your text mining process a name and specify the document source and pipeline.

Figure 23: Creating a new automated text analysis process: Give your process a name and enter the document source and the pipeline you want to use.

### Annotation Editor: Viewing and Editing Annotations

To be able to make a judgment about text analysis components, it is frequently essential to have the results displayed graphically. You may also want to correct text analysis results manually or annotate documents completely manually, for example to create gold standards, which are then used to evaluate text analysis components. For all these purposes, the Annotation Editor can be used.

#### Viewing annotations inside a document source

The Annotation Editor can be used to display text analysis results graphically. Using the annotation editor, all documents from a document source can be easily viewed, section by section, and all annotations can be graphically highlighted.

In Annotation Editor, you first select a document source (1). If document names have been given to the documents in the source, the name of the first document in the source is displayed (2). You then select the text analysis process that you wish to view (3).

Once you have selected the source and the text analysis, the first document in the corpus is displayed. The document is displayed section by section. There is a checkbox above the text of each available annotation to enable the content of the annotation to be graphically highlighted (4). Using the right-hand checkbox (5), you can highlight all annotations at once, or reset the highlighting of all annotations.

In the main window (6), you can see the corresponding section of the document with the currently activated highlights. Below the main window, there are buttons for navigating through the individual sections of a document (7). Above it there are similar buttons, which you can use to navigate between the individual documents in a source (8).

Figure 24: Displaying the annotations in the documents of a document source.

A table with a list of all the currently highlighted annotations can be displayed on the right of the main window.

Figure 25: Overview table of annotations.

To provide a better connection between the table and the graphical highlighting in the text, annotations from the table can be given special emphasis in the text. To do this, you set the checkbox in front of the name of the related annotations. This allows the corresponding annotations to be displayed in bold and large font, in addition to the colored highlighting.

Figure 26: Especially emphasizing individual annotations.

The overview table is also used to view the individual attributes of the annotation. By expanding the annotation in the table, you can obtain a list of all the annotation’s attributes.

Figure 27: Show annotations' attributes.

#### Configuring section sizes

As described above, the documents are displayed section by section. By default, 5 sentences are displayed on each page. This setting can be configured in the interface by clicking on the wheel at the right top.

In principle, you can combine a character-based sectioning with an annotation-based sectioning. While the standard sectioning is the character-based sectioning, annotation-based sectioning may has the advantage that you don’t miss cross section annotations. When combining both sections, the sections are always shown with a slight overlap. The end of section n is displayed again at the beginning of section n+1 to avoid the section being taken out of context. Furthermore, when sectioning by characters, the sectioning automatically ensures that the section splits are not made in the middle of a word.

Any change to the section size the graphical configuration is applied immediately after closing the window. Using the reset button, you can restore the configure default values.

Figure 28: Annotation Editor settings window.

#### Manually editing, adding and deleting annotations

The annotation editor can also be used to add annotations manually or to edit them. Using the button on the right, you can switch to edit mode.

In edit mode, a button appears above the main window for each activated annotation type (2). After you select the type, you can create annotations of this type in the text. To create annotations of this type, simply highlight an area of text in the main window using the mouse. A quick way of adding an annotation is to simply click a word. An annotation of the corresponding type is then created for the whole word.

Edit mode also allows you to delete existing annotations. To do this, click the cross mark in the overview table of annotations on the right.

After you have made changes to the document, these can be saved or discarded by clicking the buttons (3).

Figure 29: Editing Annotations.

In edit mode, you can also edit attributes of an annotation (only for annotations which are configured by Averbis as editable).

Figure 30: Editing the attributes of an annotation.

#### Displayed and editable annotation types, attributes and colours

Currently, the user cannot configure which annotation types and attributes are visible in the annotation editor, which colors are assigned to these annotation types, and which attributes are editable. This is currently preset by Averbis.

### Text Analysis Evaluation

The results of various text analysis tasks can be evaluated against each other, e.g., to compare a text mining process against gold standards.

To do this, you may first choose the document’s source (1) which serves as the basis of the evaluation. Then, you choose the reference view (2) in the left part of the window, and, on the right side (3), you choose the text analysis process that you wish to evaluate.

If you chose a source and two text analysis processes, one can evaluate the results visually, one against the other, in a split-view with two separate annotation editors. The representation of the sections in the right window is thereby coupled to the sections in the left window. In addition to the color highlighting of the individual annotations, you can also distinguish graphically which annotations on the two sides do not match. In addition to the graphic labelling within the text, the annotations are also labelled appropriately in the tabulated overview on the right side (4). Mistakes there are either marked in orange (false positives) or gray (false negatives).

Figure 31: The image shows the example of a DoseFormConcept annotation on the left that does not match on the right: TBCR.

#### "Matches" and "Partial Matches"

When evaluating, it is possible to distinguish between exact and partial matches. Annotations are marked as an exact match if their type, characterizing attributes and position in the text are identical.

To obtain an extra level between a hit and a no-hit, it is also possible to define a partial match. Annotations that are not exactly identical, but still meet these criteria, are marked accordingly both in the graphical and table presentation. In the graphical presentation they are italicized and underlined.

Figure 32: Displaying a partial match.

#### Configuring the match criteria

The definition of what should be considered as a match, partial match and mismatch can be configured by the user in the interface.

The general rule is that two annotations are considered as a match when they are of the same type and are found at exactly the same place in the document. For each annotation type you can then define which annotation attributes also have to match. If we use a concept, this could be the concept’s unique ID. This means that two concepts would be identified as a match only if this attribute was identical in both annotations.

It is also possible to configure for each annotation type, when two annotations of this type should be considered as a partial match. Here you can choose between four different options:

• "No partial matches": only exact matches are allowed.

• "Annotations must overlap": a partial match is given whenever the annotations overlap.

• "Allow fixed offset": at the beginning and end of the annotations, a configurable offset is allowed.

• "Are within the same annotation of a specific type": a partial match is found whenever the annotations are within the same larger annotation. For example, if they are inside the same sentence.

Figure 33: Graphical configuration of the match criteria.

#### Corpus evaluation

Using the Evaluate metrics button, a window can be opened, displaying the precision, recall, F1 score and standard deviation for either a single document or the whole corpus. The numbers are split by annotation type.

Figure 34: Evaluation at corpus level.

In the Settings panel, you can configure which types are to be taken into account in the corpus evaluation.

Figure 35: Selecting the annotation types to be taken into account in the corpus evaluation.

### Annotation Overview

For the quality assessment and improvement of text analysis pipelines, an aggregated overview of the assigned annotations is often helpful. For this purpose, the Annotation overview is used. You can create any number of these overviews. To do this, you first select a source and an existing text analysis process. Next, you select the annotation type to be analyzed.

After pressing the green button, the aggregation is calculated. Depending on the scope of the selected source, this may take some time. All overviews are listed in the table. As soon as an overview has been calculated, the results can be displayed via the list symbol.

Figure 36: Listing and management of the available annotation overviews.

#### Aggregation und Context

If you select an overview from the table using the list symbol, you will see an aggregated list of the annotations found for the corresponding type. By default, the list is sorted in descending order by frequency. By clicking on an annotation in the table, you can display some example text in which the annotations occur. In addition to the analysis, the overview is also suitable for directly improving the results. In this way, false positives as well as false negatives can be identified and corrected.

Currently, the attributes that appear in the list for each annotation, are preconfigured by Averbis. This setting cannot yet be made graphically via the GUI.

### Text Analysis Web Service API

This section describes the Web Service API which can be used to integrate text analysis capabilities in existing third-party systems. An interface is offered via a RESTful/XML service, which is integrated in the Swagger framework. For the formal specification please refer to the official documentation.

#### Analyse Text Web Service

The Analyse Text Web Service analyses plain text and returns annotations in JSON.

`POST http(s)://HOST:PORT/APPLICATION_NAME/rest/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseText`
• URL parameter `projectName` specifies the project name that contains the pipeline.

• URL parameter `pipelineName` specifies the name of the pipeline that will be used to analyse the text.

• URL parameter `language` specifies the text language. Can be omitted if the pipeline is able to detect the text language.

• URL parameter `annotationTypes` specifies a comma separated list of annotation types that will be contained in the response. Wildcards (`*`) are supported.

• Request body parameter `text` specifies the text to be analysed.

Example Request:

`curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' -d 'Some sample text to be analysed' 'http://localhost:8080/information-discovery/rest/textanalysis/projects/defaultProject/pipelines/defaultPipeline/analyseText?language=en&annotationTypes=de.averbis.types.Token%2Cde.averbis.types.Sentence'`

#### Analyse HTML Web Service

The 'Analyse HTML Web Service' analyses text contained in HTML5 and returns annotations in JSON.

`POST http(s)://HOST:PORT/APPLICATION_NAME/rest/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseHtml`
• URL parameter `projectName` specifies the project name that contains the pipeline.

• URL parameter `pipelineName` specifies the name of the pipeline that will be used to analyse the text.

• URL parameter `language` specifies the text language. Can be omitted if the pipeline is able to detect the text language.

• URL parameter `annotationTypes` specifies a comma separated list of annotation types that will be contained in the response. Wildcards (`*`) are supported.

• Request body parameter `text` specifies the html5 content to be analysed.

Example Request:

`curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' -d '<html><body>Some sample html5 content to be analysed</body></html>' 'http://localhost:8080/information-discovery/rest/textanalysis/projects/defaultProject/pipelines/defaultPipeline/analyseHtml?language=en&annotationTypes=de.averbis.types.Sentence%2Cde.averbis.types.Token'`

#### Swagger-UI API Browser

Developers can test the functionality of the Text Analysis Web API and get an overview on the integrated Swagger-UI API browser page. In particular, sample requests can easily be generated and return values verified. The Swagger-UI API browser is available at:

`http(s)://HOST:PORT/APPLICATION_NAME/rest/swagger-ui.html`

Figure 37: Swagger-UI API Browser

## Terminologies

In this module, you can manage the lexical resources, which are used within the text analysis components.

That module lists all available terminologies within the current project. You can add new terminologies, import or export content and add new terminologies, as well.

When adding a new terminology, you can specify the following parameters:

Terminology-ID

A unique identifier. E.g. MeSH_2017.

Label

A label. E.g. MeSH.

Version

A version number. E.g. 2017.

Concept type

The concept type when being used within text analysis. E.g. de.averbis.extraction.types.Concept.

Hierarchical

When unchecking this box, the terminology will not contain any hierarchical relations (flat list).

Encrypted export

ConceptAnnotator dictionaries can be exported encrypted to prevent having sensible data on the disk.

This parameter only affects Concept Dictionary XML Exports. Other exports still are unencrypted.

Besides, you can specify, which languages are available within that terminology.

Figure 38: Add a new terminology.

##### Available languages

Your terminology can contain term for all languages which are selected here. There is no need to use all languages for all terms. So there could be concepts, which only have terms in a subset of those languages. Since in some situations, we need to compute one cross-lingual preferred term, we need to decide which language to use, if there are no terms in specific languages. For that, you can specify a language priority by moving the language up/down in this list. If you have English at the top, followed by German, we try to display the English preferred term. If no English preferred term is available, the German one is displayed.

There is one special language, called `Diverse`. Terms in that language are mapped in every language. You can mark language independent terms with that language (e.g. Roman numerals).

#### Edit terminology´s meta data

You can edit the meta data, that you specified when creating the terminology, via the edit-button ().

#### Delete a terminology

The delete-button () allows to delete a terminology, when there is no active import or export running.

#### Import content

You can import content from OBO files (versions 1.2 [1 and 1.4[2]) into an existing terminology. If you have multilingual terminologies, version 1.4 needs to be used. Optionally, a mapping mode for each synonym can be imported, too.

The source file may be zipped to support large files.

The minimal structure of your OBO terminology looks like this:

Example of an OBO terminology

`synonymtypedef: DEFAULT_MODE "Default Mapping Mode"  //synonymtypedef: EXACT_MODE "Exact Mapping Mode"      //OPTIONAL - only if using mapping modessynonymtypedef: IGNORE_MODE "Ignore Mapping Mode"    //`
``````[Term]
id: 1
name: First Concept
synonym: "First Concept" DEFAULT_MODE []synonym: "First Synonym" IGNORE_MODE []synonym: "Second Synonym" EXACT_MODE [] [Term]
id: 2
name: First Child
is_a: 1 ! First Concept```
```

To import terms with mapping modes, the OBO terminology begins with the synonym type definitions, as shown in the first three lines of the OBO terminology in the example above.

Each concept begins with the flag "[TERM]", followed by an "id" and a preferred name with the flag "name". After that you can add as many synonyms as you like with the flag "synonym", followed by the desired mapping mode (optionally). Note: if you would like to define a mapping mode for your concept name (flag "name"), you have to add the term as synonym, as shown in the example for "First Concept".

Furthermore, if your terminology contains a hierarchy, you can use "is_a" to refer to other concepts of your terminology.

To import a terminology like the one shown above, proceed as follows:

1. In "Project Overview", click on "Terminology Administration".

2. Click on "Create New Terminology". Fill in the dialog as described in Add Terminology.

3. Once you have created a terminology, click the up arrow icon to the right of the terminology.

4. In the "Import Terminology" dialog, select "OBO Importer" as import format. Then select the terminology you want to import from the file system. Click on "Import".

1. By clicking on the "Refresh" button to the right of the terminology you can check the progress of the import. When the terminology has been fully imported, the status changes to "Completed".

2. To browse your terminology, switch to the "Terminology Editor" by going to the "Project Overview" page and clicking on "Terminology Editor".

Figure 39: Import content into existing terminology

After an import has started, the current status is shown in the overview.

Figure 40: Status of currently running processes.

Besides, you can see some details of the latest import (including error messages).

Figure 41: Detailed information regarding the latest process.

After successful terminology import, terms, hierarchies and mapping modes can be checked in the Terminology Editor.

Figure 42: Terminology Editor showing imported terminology

#### Export content

To use a terminology within the text analysis, you need to export () its content into the `Concept Dictionary XML`-format.

Figure 43: Export a terminology.

After exporting a terminology into the `Concept Dictionary XML`-format, you need to restart the pipeline using it, in order to refresh its content.

### Terminology Editor

The Terminology Editor allows to edit the content of terminologies.

#### Free text search and autosuggest

The centered search bar at the top of the Terminology Editor is meant for doing a free text search across multiple terminologies. You can include or exclude terminologies from the search by checking them within the drop down menu next to the search bar. While entering a search term, the system suggests different possible matches via autosuggest, grouped by terminology.

Figure 44: Terminology auto suggest.

Doing a free text search, you can use the asterisk symbol (`*`) for truncation (e.g. `Appendi\*`). The results of a free text search are listed within the upper right section. Results are grouped by their terminologies.

The settings menu on the top right allows to customize some search and autosuggest settings. You can specify whether Concept IDs are included within the search, and define the number of hits that shall be displayed.

Figure 45: Configuration of search and autosuggest.

#### Displaying concepts hierarchically

The tree view in the Terminology Editor allows to view its position in the terminology hierarchy. Just click on a concept within the list of search results.

Figure 46: Displaying concepts hierarchically.

You can configure whether the Concept ID shall be shown in the tree as well, and whether the tree view shall show the siblings of a concept along its hierarchy.

Figure 47: Tree with and without strictly focusing on the selected concept.

#### Terms

In the lower right corner of the windows you see the concept’s details. The first tab shows concept synonyms. You can edit, add or delete synonyms here as well.

#### Mapping Mode

Every term has a so called Mapping Mode. Mapping Modes are an efficient way of increasing the accuracy of terminology based annotations. They allow to ignore certain synonyms which are irrelevant or lead to false positive hits (IGNORE). Synonyms can also be set to EXACT matches, which is especially good for acronyms and abbreviations (AIDS != aid).

Currently, there are 3 Mapping Modes

DEFAULT

Term is preprocessed the same way the pipeline is configured.

EXACT

Term is only mapped when the string matches exactly to the text without any modification by preprocssing (including case).

IGNORE

Term will be ignored. It won’t be used within the text analysis.

#### Relations

The second tab shows all relations known for that concept. You can use this view to add or delete relations, too. Currently, only hierarchical relations are supported. When adding a new relation, you get an autosuggest to find the correct concept that you want to relate.

#### Mapping Mode and comment

In the third tab, you can add a comment to a concept. Besides, you can set a concept-wide Mapping Mode. Terms, which do not have a specific Mapping Mode inherit it from this concept Mapping Mode.

## Document search

As soon as the Solr Admin module is used, the application has a default Solr Core. This core is displayed in the administration panel.

uses Solr to create a search index and to make documents searchable. Choose "Solr Core Administration" on the project overview to create the basic settings.

#### Indexing pipeline

Documents that are imported or crawled go through a text analysis pipeline in order to add metadata to the search index.

The corresponding pipeline is selected here - a separate indexing pipeline can be used for each project.

Figure 49: Choosing the indexing pipeline.

If you choose an indexing pipeline, all documents that are imported or crawled in the future will be processed. If you want to use a different pipeline for processing search queries, you can set it in the Solr Core Management section.

You can also switch the indexing pipeline within a project. To avoid a heterogeneous set of metadata, all documents are re-processed.

#### Query Pipeline

Here you can select which of the available pipelines should be used for analyzing the search query. By default, the same pipeline is used here as selected for indexing the documents.

Figure 50: Initial state in which no query pipeline is selected.

Figure 51: Choose a query pipeline.

#### Solr Core Overview

A so-called "Solr Core" is available for each project, the administration of which can be accessed via the "Solr Core Management" button on the project page.

Figure 52: Key figures and information on the search index of a project.

• "Core Name": The name of the Solr instance (generated automatically)

• "Path to solrconfig.xml": This is the path to the configuration file of this Solr instance. Expert settings can be made in this configuration file. After editing this file, the Solr instance must be restarted in order for the changed settings to take effect.

• "Path to schema.xml": The index fields are configured in this configuration file. This file should only be edited manually in exceptional cases and by experts.

• "Indexed documents": Number of documents currently in the index.

• "Pending documents": Number of documents that are currently in the processing queue of the Solr instance.

After pending documents have been processed by Solr, a commit must take place before these documents are actually available in the index. Since a commit is quite resource-intensive, the number of commits are kept low. By default, a commit therefore only takes place every 15 minutes. The processed documents therefore appear under the indexed documents with a delay.

• "Operations": At the level of the Solr core, there are three operations available:

• "Refresh" : You can update the displayed key figures by clicking on this icon.

• "Commit" : This command executes a commit on the Solr core, including documents in the index that are not visible beforehand. By default, this happens every 30 minutes in the background.

• "Delete all documents from the index" : With a click on this icon, all documents are deleted from the index.

#### Configuration of the search index schema

The configuration of the schema of the current search index can be reached via the module "Solr schema configuration".

##### Overview of all schema fields

Each Solr core has a schema that defines which information is stored in which kinds of fields. The Solr schema configuration lists all available fields in alphabetical order. The following information and operations are available for field in the index:

• "Field name": Name of the field as defined in the Solr schema. This name is often chosen in such a way that it is unpleasant for people to read. If a field is a system field, that is, a field whose values must not be overwritten by the user, a small lock symbol () is displayed to the right of the field name.

• "Type": The type specifies the contents of this field. In addition to an abstract description (e. g. string) the complete class name of the field is specified in parentheses.

• "Active": This button controls whether the field contains information to be displayed or used elsewhere in the application. These fields are then available, for example, to be displayed in the search result, to form facets or to be used via query builder for the formulation of complex, field-based search restrictions. Fields that are not activated can still be used by the system, but they are not available for manual configuration to the users. If a field is activated, the line is highlighted in green.

• "Label": The field name itself is often not suitable for displaying because it is not legible, and it is not localized. Therefore, you can define meaningful display names for all fields in different languages. These names are used wherever the user accesses or displays field contents. If no corresponding display name is defined for the user’s language, the illegible field name is displayed.

Figure 53: Overview of the Solr cores scheme.

##### Dynamic fields

In the overview, dynamically generated Solr fields are also displayed as soon as they have been created (that is, as soon as they have been filled with values once). As soon as the field has data, it remains permanently in the overview, even if all documents containing values in this field have been deleted in the meantime.

### Manage and use search interface

The functionality and appearance of the search interface can be influenced by configuration.

#### Configuring the display of search results

Starting from the overview page of a project, the display of search results can be configured by using the "Field Layout Configuration" module. You can specify which fields/contents of the indexed documents are to be displayed in the interface. This applies to both the fields on the results overview page and the fields on the detail page of the documents (accessible by clicking on the title information of the result). Fields that are only displayed on the overview page of the search results are highlighted in green. In addition to selecting the fields, you can also configure whether the field title should be displayed, as well. If this option is activated, the display name created in the Solr schema management for the language of the respective user is displayed.

In addition, the length of content of a particular field can be specified, as well as some style settings.

Figure 54: Configuring the display of search results.

#### Configure Facets

So-called facets provide the user with additional filter options. They are displayed on the left side of the search page. The configuration of facets can be accessed via the module "Facet Configuration" on the project overview page.

On the configuration page, you can select and configure the facet fields displayed in the user interface. When selecting a facet, you can configure whether the entries within a facet are AND- or OR-linked. In the case of AND facets, only documents that combine all the terms selected in this facet are displayed. OR facets, on the other hand, offer the option of finding documents that contain only individual terms (e. g. documents of "Category 1" OR "Category 2").

In addition, you can configure how many entries are to be displayed within each facet. The order of the facets can be determined with the arrows. The display in the search interface is similar to the order in the administration panel. The display name of a facet is selected according to the labels assigned in the Solr schema configuration (see above).

Figure 55: Configure Facets.

#### Configuring auto-completion

Settings for automatic completion of search terms can be made via the "Autosuggest" module that you access on the project overview page. There are various methods by which users can make suggestions to complete their searches in a meaningful way. Currently, four methods are available to choose from, and they can be freely combined as needed.

The proposals are grouped by their mode in the search interface. The order of the groups corresponds to the order in which the modes are listed here (if more than one mode is used). Use the arrow keys to change the order.

In addition to the number of proposals per group, you can also specify a description for each group, which is displayed in the search interface above the respective proposal block.

Changes will take effect immediately after saving for all users of the search.

If one of the two concept-based methods is used, an additional field appears where you select which Solr field is to be used for the lookup. All fields that are recognized as concept-based fields are available for selection.

Figure 56: Configuring auto-completion.

The methods are characterized as follows:

"Prefixed Facet Mode"

• The proposals for completing the search query come from the documents in the search index. No external sources are therefore used for the proposals.

• The suggestions are intended to complete the term currently entered, no additional term is proposed (no multiple word suggestions).

• The current search restrictions (e. g. via facets) are taken into account in the proposals. Therefore, only those terms are suggested for which there are also hits in the body, taking into account all active search restrictions.

• The proposals are not based on the order of the terms in the documents. If you enter a search query that consists of several partial words, the proposed word does not have to be directly behind the term it is in the search query.

"Shingled Prefixed Facet Mode"

• The proposals for completing the search query come from the documents in the search index. No external sources are therefore used for the proposals.

• Unlike simple prefixed facet mode, suggestions can consist of several words. In addition to the completion of the term currently entered, it is also suggested terms that are often directly or closely related to this term in the documents. Entering Appen in this mode could therefore lead to suggestions such as treating _appendicitis.

• The current search restrictions (e. g. via facets) are taken into account in the proposals. Therefore, only those terms are suggested for which there are also hits in the body, taking into account all active search restrictions.

• If the query consists of several words, the suggestions for the order are based on the last of these words. All terms before this last word are still used as filters. The entry Hospital Appendi could therefore also lead to the suggestion Hospital Treat Appendicitis, if Hospital Treat Appendicitis is not in the immediate vicinity of Hospital in the text.

Concept Mode with guaranteed hits (concepts_hit)

• The suggestions for completing the search query are taken from synonyms of the stored terminology.

• Proposals show the wording of the synonym and the title of the terminology as well as the preferred name of the concept in the user’s language.

• If you select a proposal (synonym), a search with the associated concept is executed.

• Documents that contain the exact synonym text (that is, documents that cannot be found using another synonym) are given a higher weighting and are displayed in the results list above.

• Only proposals that guarantee at least one hit are displayed.

Concept Mode without guaranteed hits (concepts_all).

This mode differs from the conventional concept mode in that proposals are also displayed that do not lead to a hit. All terms from the stored terminology are displayed.

#### Search restrictions

Switch to the "Search" module of the project to get to the search page of the application. All search terms entered remain comprehensible for the user at any time. You can easily see which search terms have led to the currently presented result set. The current search restrictions are listed next to each other on the left side of the search bar. They are highlighted in the same color as the corresponding highlighting in the text. If the restriction by a term originates from a facet, the name of the facet is listed before the search term (see screenshot below).

If the number of search restrictions is too long to be displayed in the search bar, they are displayed in a pop-up and collapsible menu on the left in the search bar. The small cross symbol next to each search restriction removes this restriction and updates the search results accordingly. With the cross button to the right of the search bar you can also remove all current search restrictions at once.

Figure 57: Display of the current search restriction.

#### Faceted search

Facets represent one of the core functionalities of the search. With the help of the facets, the search results can be quickly limited to relevant results. In the admin panel you can configure for which categories facets should be displayed.

Within the facets, the most frequent terms from the respective category appear, which are contained in the indexed documents. The number after the faceted entries indicates how many documents are contained in the index (or current search result set) that match the corresponding term.

The faceted entries can be clicked on, whereupon the search result will be limited accordingly. Different terms can be combined here - even across facets. This allows a high degree of flexibility in restricting the search results.

Figure 58: Concept facet with selected restriction to 'Diagnosis'.

By default, all selected facet entries are AND-linked. This means that only documents matching all selected criteria are listed. The currently selected filters are highlighted in orange. The restriction can be removed by clicking on the faceted entry again.

This filter yields to result sets in which at least one of the selected criteria appears. only one or only a few of the selected terms appear. In the case of these OR-linked facets, a checkbox is displayed in front of each entry.

#### Querybuilder / Expert Search

With the query builder, a comfortable mechanism is available in the system to create complex search queries. This allows for combining different criteria to a a query using any fields from the index.

The Querybuilder can be opened using the magic wand icon in the search bar.

Figure 59: The magic wand on the right of the search bar opens the query builder.

The input mask allows you to add search restrictions on all activated schema fields. Depending on the type of the selected schema field, different comparison operators are available. Text fields allow the operators `contains` and `contains not`. Any text can be entered as a restricting value. The asterisk `*` is used as a wildcard.

Date fields are provided by the comparison operators `>=` and `<=`. Numerical fields are provided by the comparison operators `=`, `<>`, `>=` and `<=`. By combining two date or number fields, the search can also be restricted to periods or ranges.

Figure 60: Input mask of the query builder

Concept-based fields allow the operators `contains` and `contains not` like text fields.

Any number of conditions can be added. These are linked with each other using the boolean operators AND and/or OR. The criteria can also be grouped together to create any logical combinations. In addition to the graphical display, you can also find the logical expression that results from the current compilation of search restrictions in the upper area of the query builder. Once the complex search query has been created, it can be activated using the Apply button. The search results are calculated accordingly. In addition, the magic wand icon in the search bar turns orange to indicate that a complex search restriction is active. The search query can be reloaded by clicking on this button and can be edited until the result matches your expectations.

The query created using the Querybuilder behaves in addition to any other search restrictions, such as by means of free text search or facet restriction.

#### Document details and original document

The title field of a document serves as a link to a detail page containing additional information about the document (see "Solr Schema Configuration" module on the project overview page).

In addition to the detailed view, you can also download the underlying original documents (e.g. PDF, office document etc.) if they are available. You can recognize this by a small icon on the right of the document title. The symbol differs depending on the document category. Clicking on the file icon starts the download of the original document.

### Export search results

Documents in the system can be exported - both individual documents and complete search result sets.

#### Selection of documents to be exported

If the user has the necessary permissions to export documents, checkboxes are provided on the search results page to mark individual documents. There is also a checkbox to mark all currently displayed documents. In addition, the button "Export search results" is displayed above the search results, where the selected documents can be exported.

Another option is to export all documents that meet the current search restrictions. In this case, all checkbox have to be deselected.

Figure 61: Controls to mark and export documents.

#### Selection of the exporter and the fields to be exported

After selecting the documents to be exported, a dialog box appears in which the exporter type can be selected. To this day, there is an exporter that exports selected fields of the documents to an Excel document.

After selecting the fields to be included in the export and confirming with the "Export" button, the export starts. Once the export is complete, the result is offered for download.

Figure 62: Selection of the exporter and the fields to be exported.

## Document Classification

### Manage classification

#### Administration of the label system

The target categories for automatic classification of documents are called the label system that can be edited and maintained in the module "Label System". In a new project, the label system is initially empty.

Clicking on "Create new label" at the bottom left adds a new label. The pen symbol on the right-hand side is used to rename the label. The plus symbol to its right adds a new label as a child of the current label. It is therefore used to create hierarchically organized label systems. Clicking on the red cross symbol deletes labels (only labels that have no children can be deleted).

In a hierarchical labeling system, the hierarchical arrangement can also be edited via drag & drop.

Figure 63: Labels can be added, edited, moved or deleted in the label system administration.

#### Administration of different classification sets

The starting point for the automatic classification of documents are so-called classification sets.

Figure 64: Menu item for managing classification sets.

##### Create a new classification set

Any number of classification sets can be created for each project. This means that you can classify the same document source with different classification parameters.

There is only one label system per project. The same label system is used for each classification set. Please make sure that the label system has been created before you create a classification set.

To be able to view the results of the classification in the interface, you should select an indexing pipeline in `Solr Core Management` before you create classification sets.

When creating a new classification set, following settings can be adjusted:

• Name: Name under which this classification set is referenced.

• Document fields: From all document fields known to the system, you can select those that are used for training the classifier (so-called `features`).

• High confidence threshold: The system distinguishes between documents with high and low confidence for automatically classified documents. This parameter can be used to define the value above which the confidence is interpreted as "high".

• Classifier: In principle, different implementations can be used for classification. At present, the implementation offered is a `support vector machine`.

• SVM: `Support vector machine`

• Single/multi-label: This parameter determines how many categories can be assigned to a single document. With `Single` only one label is assigned. With a `Multi`, a document can be categorized in several classes.

• Classification method: The classification method determines how the machine selects from several candidates. Depending on whether it is a single-label or multi-label scenario, different options and configuration parameters are available:

• Single-Label

• Best Labels: With `Single-Label-Classification` there is only one classification method: the `Best Labels` method chooses the class with the highest confidence.

• Threshold : The threshold value can be used to determine that only classes that have a certain minimum confidence are taken into account. This allows for filtering assignments for which the machine is very unsafe.

• Multi-Label: For `Multi-Label Classification` several methods are available (for a deeper theoretical background, see Matthew R. Boutell: Learning multi-label scene classification ):

• All Labels: This method simply selects the available instance labels in a decreasing confidence order.

• T-criterion: Using the T-criterion, instances first get filtered by a minimum confidence threshold of 0.5. If the confidences are too low, i.e. no labels are assigned, another filter step is used. The second step checks if the entropy of the confidences is lower than the minimum entropy threshold, i.e. confidences are distributed unevenly. If this is the case, the labels are assigned based on a lower minimum confidence threshold.

• Entropy: 1.0 (default minimum entropy)

• Threshold value: 0.1 (default minimum confidence)

• C-criterion: This method ensures the selection of the best prediction values depending on the configuration parameters (i.e. Percentage and Threshold values). It first selects the label with the highest confidence (larger than the threshold value) and continues to assign labels whose confidence is at least at 75% of the highest confidence value.

• Percentage value: 0.75

• Threshold value: 0.1 (minimal default confidence).

• Top n labels: This method selects those categories that have the highest confidence.

• n: the number of classes to be assigned

The classification configuration can be changed on the classification administration page by clicking on the edit button.

After changing parameters of an existing classification set re-training and re-classification are necessary for all changes to take effect.

Before documents can be automatically classified, the machine requires appropriate training material. This refers to a small set of intellectually classified documents used by the machine to train a model.

Training data can be created in two ways. Either by manually assigning classes via the graphical user interface (please see "Browse classifications" below) or by importing a CSV file that contains appropriate assignments.

##### Import of training material

The button opens a dialog for importing a CSV file with training material. The CSV file must contain the name of the document in the first column (referred to `document_name` in the system). The subsequent columns contain the label assignments (one column for each label in a mult-label scenario). The columns must be separated by semicolons. The values of the columns can be enclosed with double quotation marks if required (mandatory if the values contain semicolons).

Example :
trainset.csv

``````doc1;label_1;label_2
doc2;label_1;
doc3;label_1;label_3
...```
```

The document name, which is used to identify the document in the list, must contain the value that is entered in the field `document_name` in the application.

If a training file contains several labels per document, but the selected training set is a single-label classification, only the first label is used.

If the document names or labels contain semicolons, the values must be enclosed in double quotation marks to avoid incorrectly interpreting the semicolon as a field separator.

Only values that are part of the label system in the application (or project) are allowed as labels (all others are ignored).

When you import training material, any labels that may already be assigned to the documents in the list are deleted.

##### Train a model

As soon as the system has access to training material by importing a training list or manually assigning labels, a model can be trained using the button. Use to update the information on "State" and "Model": the training has finished if "State" is IDLE and "Model" is READY.

##### Quality of the current model

After each training session, an evaluation is carried out to evaluate the current quality of the model. For this purpose, the machine uses the document set of intellectually confirmed labels. This quantity is divided into a training set (90%) and a test set (10%). The test set is classified by the machine on the basis of a model that has been trained for this training set. The results of the automatic classification are then compared with the intellectually assigned labels. To smooth the results, the machine repeats this 10 times for different divisions of test and training sets. The results of the tests can be viewed in the form of a diagram using the button. The diagrams show the following metrics per label, which are derived from the number of correct assignments (true positives - TP), false assignments (false positives - FP), and missing assignments (false negatives - FN):

Accuracy: The ratio of all correct assignments (and correct non-assignments) to the total sum of all observations:

TP + TN
____________________

TP + FP + FN + TN

Precision: The ratio of correct assignments to all assignments:

TP
_________

TP + FP

If one attaches great importance to the fact that there are no misallocations, this value is of particular relevance.

Recall: The ratio of correct assignments to the sum of all existing correct assignments:

TP
_________

TP + FN

If you take some misallocations into account in order to increase the number of hits, this value is of particular relevance.

F1-Score: A weighted average between Precision (P) and Recall (R):

P x R
2 x     _________

P + R

##### Automatic classification of all unclassified documents

As soon as an initial model has been created, all previously unclassified documents can be automatically classified on the basis of this model via on the classification configuration page.

Once the classification is complete, the results can be viewed in the graphical user interface. The assigned classes are displayed above each document (see "Browse classifications" below).

##### Status information

The overview table depicts information of the current status of the classification set:

• IDLE: No process is currently running.

• TRAINING: A training is in progress. During this time, no other processes can be started on this classification set.

• CLASSIFYING: Documents are currently being classified. During this time, no other processes can be started on this classification set.

• ABORTING: A process (training or classification) is being aborted. During this time, no processes can be started on this classification set.

The resulting model of a classification set comes with additional information:

• NONE: No model has been trained yet.

• READY: A valid model exists and a classification process can be started.

• OUTDATED: Since the last training, manual classifications have been added or automatic classifications have been confirmed or rejected. The model should be re-trained in order to make changes take effect.

• INVALID: Changes were made to the label system or a manually assigned label were deleted, which invalidates the current model. The model has to be re-trained.

### Index, evaluate and manually classify documents

For all classification sets, you can use a graphical user interface to navigate through the documents, review results, confirm or delete automatically assigned classes, and assign classes manually. You can access this browser view by clicking on "Classification" on the project overview page.

#### Structure of the interface

The interface is similar to the search interface, both in terms of its structure and functionality. The classification page has three predefined facets on the left side of the screen, that can be used to filter documents according to the assigned class (`Label`), the assigned confidences (`Confidence`) or the assignment status on the document level (`Status`).

This makes it very easy to display, for example, only those documents that have been automatically classified (`Status` = `Autoclassified`) and that have labels with low confidence (`Confidence` = `low`). By making corrections/confirmations to the resulting documents the classification model can be improved (i.e. the system learns exactly where it is currently most unsafe (so-called Active Learning).

To the right of the search input field, the classification set on which you want to work can be chosen. If you have created several classification sets, you can quickly switch between them.

#### Confirm or reject automatically assigned labels

The labels that have been assigned to each document are depicted below the title information of each document. Manually assigned labels are displayed in blue ( ), automatically assigned classes are displayed in red (low confidence ), or green (high confidence ).

Automatically assigned labels have a button to confirm and to delete the label. By confirming an automatically assigned label, it changes its color and will be considered for the next training session to improve the model.

As soon as you confirm, delete or add labels, the model is considered `OUTDATED`. This means that since the last training session, new data has been collected to improve the model and re-training is necessary.

#### Execute actions on several selected documents

Similar to the conventional search interface, there are several document-centered actions for classification. In general, actions either refer to

• exactly one document,

• a selection of documents

• all documents of the project or

• all documents corresponding to the current search restrictions.

For any of these actions, there is a small button with a distinctive icon under the document title. Use this button to apply the action exactly to the corresponding document.

The same icons are displayed on larger buttons below the search bar ("Label documents(s)", "Classifiy document(s)", "Export classifications"). Clicking on these buttons apply the action to all documents that are marked with the checkbox left to their title. All documents on the current search result page are selected by clicking the uppermost checkbox on the page.

If no particular documents are selected at all, the action is applied to all documents that correspond to the current search restrictions. Since the result set can be very large, a window opens for approving the currents selection before the corresponding process starts in background.

#### Manually label documents

In addition to confirming or rejecting automatically assigned labels, categories can be assigned manually. The button attached to each document serves this purpose. The button opens a window in which you can select the desired label(s). You can also manually label several documents at the same time by using the checkboxes left to the documents title in conjunction with the uppermost button.

When manually assigning labels, a window opens with labeling information:

• "Not selected": This label has not been assigned to any of the selected documents.

• "Partially selected": This label has already been assigned for some (not all) selected documents (gray stripes).

• "Completely selected": All selected documents already have this label (grey).

When assigning a label manually, automatically assigned labels of the same type are automatically overwritten, if existing.

As an example, if you select 100 documents to assign label A and 10 of them already have an automatically assigned label A, the status for the 10 documents will be switched to "Approved". An automatic assigned label B would not be replaced by this procedure (except in a single label classification scenario where only one label is allowed).

#### Classify documents automatically

The same selection mechanism as for manual labeling also applies to automatic classification (single documents, a selection of documents or the current search result set). The button "Classify document(s)" with the icon automatically classifies documents that are not manually categorized.

As a result, automatically assigned category labels are displayed in red (low confidence automatic label with low confidence), or green (high confidence automatic label with high confidence). The corresponding facet filters on the left (Label, Confidence and Status) will change when refreshing the page.

If documents are automatically classified, all previously unconfirmed automatically assigned classes of these documents are deleted from previous runs.

#### Export labels

The assignment of (confirmed or manual) labels can be exported from the interface to a CSV file (button "Export classifications"). The format has the same structure as the input format that is allowed for importing training material.

#### Training and classifying directly from the search page

With the button on the top right of the page a new model based on all previously manually classified or confirmed documents can be trained. Similar, the button on the top right is used to classify all unclassified documents based on the current model.

### Classification Web Service

This section describes the possible integration of the classification component in existing third-party systems. An interface is offered via a RESTful/XML service, which is completely integrated in the Swagger framework. For the formal specification please refer to the official documentation.

#### Web Service

The Web service accepts requests at the following URL:

The information on HOST and PORT depends on the specific installation and can be obtained from the system administrator.

• {projectName} is the selected name of the created project in the application.

• {classificationSetName} is the selected name of the created classification configuration in the application.

• {Importer} is the importer type to process different input document types and can be one of:

• CAS Importer

• Solr XML Importer

• Text Importer

Additional importers can be included for specific applications. The access to the service URL is not authenticated.

The first time the Web service is called after restarting , the requested classification model is loaded from the classification configuration into the working memory so that service requests can be answered as quickly as possible. Therefore, with a newly started system or a new classification configuration, a first request should be made to warm up the web service, e.g. with a defined test data set. In addition to an automatic query by an integrating external system, the test page Swagger or a query via curl can also be used (see below).

#### Test page and simple query via Curl

Developers can test the functionality of the service and get an overview on the following page. In particular, sample requests can easily be generated and return values verified.

Figure 65: Swagger test page

Curl is a command line program for transferring information in computer networks. There are versions for Windows and Linux systems, among others. The following simple call receives classification results for two documents in Solr format:

```curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/xml' -d '<?xml version="1.0" encoding="UTF-8"?> \
<update> \
<doc> \
<field name="document_name">doc1</field> \
<field name="title">Machine learning for automatic text classification</field> \
<field name="content">Machine learning is a subset of artificial intelligence in
the field of computer science that often uses statistical techniques
to give computers the ability to learn...</field> \
</doc> \
<doc> \
<field name="document_name">doc2</field> \
<field name="title">Document classification made easy</field> \
<field name="content">Document classification or document categorization is a
problem in library science, information science and computer science.
The task is to assign a document to one or more classes or
categories...</field> \
</doc> \
</update>' \
'https://HOST:PORT/information-discovery/rest/classification/projects/{project}/classificationSets/
{classificationSet}/classifyDocument?type=Solr%20XML%20Importer'```

#### Result Format (XML)

The answer of the web service is returned in XML format and contains the automatic classifications for the input data set. The output for each data record contains the identifier (docment_name) and one or more categories with corresponding confidence values. In the example, both documents could be successfully classified, which is indicated by the attribute success=true:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<classifications>
<classification documentIdentifier="doc1" success="true">
<labels>
<label confidence="0.98">Artificial Intelligence</label>
<label confidence="0.89">Text Mining</label>
</labels>
</classification>
<classification documentIdentifier="doc2" success="true">
<labels>
<label confidence="0.98">Information Science</label>
</labels>
</classification>
</classifications>
</response>```

If no category is assigned to a document due to selection criteria in the classification configuration (e.g. thresholds), the classification for the document also appears with success=true, but with an empty list of categories in the returned message:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<classifications>
<classification documentIdentifier="doc3" success="true">
<labels/>
</classification>
</classifications>
</response>```

If fields that are set active in the classification configuration are missing, corresponding error messages are added to the document classification. If the classification could still be carried out, this is indicated by success=true and the assigned categories are displayed:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<classifications>
<classification documentIdentifier="doc4" success="true">
<labels>
<label confidence="0.98">Artificial Intelligence</label>
<label confidence="0.89">Text Mining</label>
</labels>
<errors>
<error>Document has no title.</error>
</errors>
</classification>
</classifications>
</response>```

Multiple error messages for a document are listed separately:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<classifications>
<classification documentIdentifier="doc5" success="true">
<labels>
<label confidence="0.98">Artificial Intelligence</label>
</labels>
<errors>
<error>Document has no title.</error>
<error>Document has no content.</error>
<error>Error on ...</error>
</errors>
</classification>
</classifications>
</response>```

If no classification can be performed due to an error, this is indicated by success=false and the output list of assigned categories is empty. A corresponding error message is added to the message returned:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<classifications>
<classification documentIdentifier="doc6" success="false">
<labels/>
<errors>
<error>Document has no classifiable content.</error>
</errors>
</classification>
</classifications>
</response>```

A document without the document_name input field cannot be classified because a unique document identifier is required. Since no assignment to an individual document can be made without this document identifier, the corresponding error message appears at the upper level. Other documents are not affected, so the other classifications will return normally:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<errors>
<error>1 document(s) without identifier.</error>
</errors>
<classifications>
<classification documentIdentifier="doc1" success="true">
<labels>
<label confidence="0.98">Artificial Intelligence</label>
<label confidence="0.89">Text Mining</label>
</labels>
</classification>
<classification documentIdentifier="doc2" success="true">
<labels>
<label confidence="0.98">Information Science</label>
</labels>
</classification>
</classifications>
</response>```

If a global error prevents classification of the documents, an error message is returned for the entire input, for example, the message that no classification characteristics could be extracted:

```<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
<errors>
<error>Feature extraction failed.</error>
</errors>
</response>```

## Text Analysis Component Reference

### Type Systems

#### AverbisTypeSystem

de.averbis.textanalysis.typesystems.AverbisTypeSystem

The core type system for all default components.

Maven Coordinates

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>components-core-typesystem</artifactId>
<version>3.5.0</version>
</dependency>
```
```

Imports

• de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

##### Sentence

Full Name: `de.averbis.extraction.types.Sentence`

Description: Annotation representing a sentence including the ending punctuation mark.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### Token

Full Name: `de.averbis.extraction.types.Token`

Description: Annotation for basic textual units including word, numbers and punctuation marks.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 1: Features

NameRangeElement TypeMultiple References Allowed

``` posTag ```

`de.averbis.extraction.types.POSTag`

Description: The part of speech of this token (i.a. used for concept annotator to be restricted to certain pos types).

``` segments ```

`uima.cas.FSArray`

`de.averbis.extraction.types.Segment`

Description: Segments of this token (i.a. used for respective mode in concept annotator).

``` stem ```

`de.averbis.extraction.types.Stem`

Description: The stem of the token (i.a. used for respective mode in concept annotator).

``` isAbbreviation ```

`uima.cas.Boolean`

Description: Marker whether the token is (part of) an abbreviation.

``` abbreviations ```

`uima.cas.FSArray`

`de.averbis.extraction.types.Abbreviation`

Description: The abbreviations for the token; this may be used as replacement in concept annotator. Note, that this goes in combination with "isAbbreviation" marking whether the token is an abbreviation. Multiple entries mean that there is ambiguity which full form is correct. There will be components to resolve this ambiguity which then remove wrong forms. Components which cannot do this ambiguity resolution must rely that the first (and hopefully only) entry is correct!

``` concepts ```

`uima.cas.FSArray`

`de.averbis.extraction.types.Concept`

Description: List of concepts containing/covering this token (this feature is used for indexing and highlighting with lucene/solr)

``` entities ```

`uima.cas.FSArray`

`de.averbis.extraction.types.CoreAnnotation`

Description: Contains entities such as Date,Time,Size,... discovered inside the Token (this feature is used for indexing and highlighting with lucene/solr)

``` ignoreByConceptMapper ```

`uima.cas.Boolean`

Description: If this feature is true then the ConceptAnnotator ignores the token. Use this is a pre-processing component has already identified the semantic of the token. E.g. dates, times, measurement values. Default value: false

``` normalized ```

`uima.cas.String`

Description: Normalized version of this token (usually lower-case, without special characters and number). This feature is used for indexing/search with lucene/solr.

``` diacriticsFreeVersions ```

`uima.cas.StringArray`

Description: In the case that the normalized version contains diacritics, multiple versions without diacritics are stored in this array. This feature is used for indexing/search with lucene/solr.

``` isStopword ```

`uima.cas.Boolean`

Description: Indicates if the token is a stopword.

``` lemma ```

`de.averbis.extraction.types.Lemma`

Description: The Lemma of the token.

``` isInvariant ```

`uima.cas.Boolean`

Description: Defines whether a token is an invariant. Such a token should not undergo some morphologic analysis steps, such as stemming and/or decompounding. However, lemmatization might still be allowed. Typical invariants: IL-2 (gene name) or also product names or numbers (SR-2715) but also too short words (au).

``` tokenClass ```

`uima.cas.String`

Description: The optional string representing the class of the token concerning its surface form.

##### Abbreviation

Full Name: `de.averbis.extraction.types.Abbreviation`

Description: An abbreviation is a letter or group of letters, taken from a word or words. For example, the word "abbreviation" can be abbreviated as "abbr." or "abbrev."

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 2: Features

NameRangeElement TypeMultiple References Allowed

``` fullForm ```

`uima.cas.String`

Description: The full form of an abbreviation. The full form, for example for HLA could be human leukocyte antigen.

``` textReference ```

`de.averbis.extraction.types.CoreAnnotation`

Description: Reference to the text span that contains the full form of the abbreviation/acronym.

``` definedHere ```

`uima.cas.Boolean`

Description: This feature is true if the abbreviation/acronym is defined for the first time in the text, e.g. in "interleukin 2 (Il-2) receptor", it can be true only for locally introduced abbreviations/acronyms.

``` stems ```

`uima.cas.StringArray`

Description: Stems of the full form.

``` segments ```

`uima.cas.StringArray`

Description: Segments of the full form.

``` tokens ```

`uima.cas.StringArray`

Description: Token strings of the full form.

##### Concept

Full Name: `de.averbis.extraction.types.Concept`

Description: A concept is a reference to an entry in a database, terminology, taxonomie, ontology etc.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 3: Features

NameRangeElement TypeMultiple References Allowed

``` dictCanon ```

`uima.cas.String`

Description: Canonical form (preferred term).

``` enclosingSpan ```

`de.averbis.extraction.types.CoreAnnotation`

Description: The span that this concept is contained within (i.e. its sentence).

``` negatedBy ```

`de.averbis.extraction.types.CoreAnnotation`

Description: Indicates which annotation negates the concept.

``` partialMatch ```

`uima.cas.Boolean`

Description: Specifies if the annotation matches the complete context. E.g. if coveredText is "Lungenabschnitte" and the generated Concept annotation is "Lunge", then this value is set to true.

``` matchedText ```

`uima.cas.String`

Description: The text in document which matched the synonym (in the respective mapping mode form, i.e., segment/stem/original etc.).

``` matchedTerm ```

`uima.cas.String`

Description: The synonym of the concept which caused the match (in the ConceptAnnotator-dictionary this is <term label=xxx>).

``` matchedVariant ```

`uima.cas.String`

Description: The variant of the synonym of the concept which caused the match (in the ConceptAnnotator-dictionary this is <variant label=xxx>). Note that one synonym (matchedTerm) can have several variants (i.e. spelling forms or mapping forms).

``` matchedTokens ```

`uima.cas.FSArray`

`de.averbis.extraction.types.Token`

Description: The Token annotations on which the concept was found. Note that there is also matchedAnnotations which list the actual annotations involved in the matching process (i.e., token, stem, segment, etc.).

``` matchedAnnotations ```

`uima.cas.FSArray`

Description: List of the actual annotations involved in the matching process (i.e., original, stem, segment, etc.). Note that there is also matchedTokens, which lists only matching Token annotations.

``` mappingMode ```

`uima.cas.String`

Description: The mode used for mapping (e.g., original, stem, segment...).

``` mappingFuzzynessScore ```

`uima.cas.Float`

Description: The score for the fuzzyness of the mapping (higher scores mean higher fuzzyness, i.e., less exact mappings).

``` uniqueID ```

`uima.cas.String`

Description: The unique concept id, including terminology name and concept ID should look like this: <terminologyName>:<conceptID>.

``` conceptID ```

`uima.cas.String`

Description: The concept id. For a unique id refer to uniqueID.

``` source ```

`uima.cas.String`

Description: the name of the terminology source.

##### Zone

Full Name: `de.averbis.extraction.types.Zone`

Description: An annotation concerning the document structure, e.g. header, title, abstract, etc.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 4: Features

NameRangeElement TypeMultiple References Allowed

``` label ```

`uima.cas.String`

Description: Allows to annotated the Zone with a semantic label. E.g. in the case of a section the value might be Introduction, Appendix,...

``` weight ```

`uima.cas.Float`

Description: The relevance or weight for a zone; used e.g. to weight information contained in the respective zone.

Full Name: `de.averbis.extraction.types.Header`

Description: The header annotation of a document

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 5: Features

NameRangeElement TypeMultiple References Allowed

``` docID ```

`uima.cas.String`

Description: The ID of the document.

``` source ```

`uima.cas.String`

Description: The source of the document.

``` fileName ```

`uima.cas.String`

Description: The name of the source file (often used by cas consumers which produce an output file for each CAS; this name is used as base).

``` fileEncoding ```

`uima.cas.String`

Description: The encoding of the file.

``` documentIndex ```

`uima.cas.Integer`

Description: The current number of the documents, i.e. a document is number 5 in a complete sequence.

``` lastFile ```

`uima.cas.Boolean`

Description: Indicates if this is the last file.

``` sourceLanguage ```

`uima.cas.String`

Description: The document language of the source.

``` offsetInSource ```

`uima.cas.Integer`

Description: Byte offset of the start of document content within original source file or other input source. Only used if the CAS document was retrieved from an source where one physical source file contained several conceptual documents. Zero otherwise.

``` documentSize ```

`uima.cas.Integer`

Description: Size of original document in bytes before processing by CAS Initializer. Either absolute file size of size within file or other source.

``` sequenceNumber ```

`uima.cas.Integer`

Description: Sequence number to test the right order while merging CASes.

``` lastSegment ```

`uima.cas.Boolean`

Description: For a CAS that represents a segment of a larger source document, this flag indicates whether this CAS is the final segment of the source document. This is useful for downstream components that want to take some action after having seen all of the segments of a particular source document.

##### POSTag

Full Name: `de.averbis.extraction.types.POSTag`

Description: Parent type for all specific part-of-speech types.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 6: Features

NameRangeElement TypeMultiple References Allowed

``` tagsetId ```

`uima.cas.String`

Description: The name of the tag set the pos tag belongs to; e.g. the "Penn Treebank II Tags" (see http://bulba.sdsu.edu/jeanette/thesis/PennTags.html)

``` value ```

`uima.cas.String`

Description: The specific part-of-speech tag, as returned by the POS-Tagger (e.g., "NN" or "ADJ" etc)

##### Chunk

Full Name: `de.averbis.extraction.types.Chunk`

Description: A general type for chunks (NPs, VPs, PPs etc.). Note: there are 3 specific subtypes for common chunks: ChunkNP, ChunkVP, ChunkPP. For all other chunk types (e.g., SBAR; ADJP etc.) use this general type!

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 7: Features

NameRangeElement TypeMultiple References Allowed

``` enclosedTokens ```

`uima.cas.Integer`

Description: The Token annotations enclosed by this chunk.

``` head ```

`de.averbis.extraction.types.CoreAnnotation`

Description: The head entity on which this chunk grammatically depends. Example: in "Der Vater des Kindes" is "der Vater" the head of "des Kindes".

``` dependents ```

`uima.cas.FSArray`

`de.averbis.extraction.types.CoreAnnotation`

Description: The entities which grammatically depend on this chunk. Example: in "Der Vater des Kindes" is "des Kindes" the dependent of "Der Vater".

``` value ```

`uima.cas.String`

Description: the specific chunk tag as returned by the chunker (e.g., "NP", "SBAR", "S" etc.).

##### ChunkNP

Full Name: `de.averbis.extraction.types.ChunkNP`

Description: A noun phrase (e.g. "the strange bird").

Parent Type: `de.averbis.extraction.types.Chunk`

##### ChunkVP

Full Name: `de.averbis.extraction.types.ChunkVP`

Description: A verb phrase (e.g. "has been thinking").

Parent Type: `de.averbis.extraction.types.Chunk`

##### ChunkPP

Full Name: `de.averbis.extraction.types.ChunkPP`

Description: A prepositional phrase (e.g. "in between").

Parent Type: `de.averbis.extraction.types.Chunk`

##### Segment

Full Name: `de.averbis.extraction.types.Segment`

Description: The segmentation of a text part; a segment is usually a subword (i.e., part of a token).

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 8: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: The string representation of the segment.

``` isValidSegmentation ```

`uima.cas.Boolean`

Description: Indicates if the segmentation is valid (viz. could be completely matched against the dictionary).

``` midStrings ```

`uima.cas.StringArray`

Description: The MID value, multiple for ambiguous MIDs (for MID see Morphosaurus Paper).

##### Section

Full Name: `de.averbis.extraction.types.Section`

Description: Text sections of a certain type.

Parent Type: `de.averbis.extraction.types.Zone`

##### Abstract

Full Name: `de.averbis.extraction.types.Abstract`

Description: Semantic abstract section found in the text.

Parent Type: `de.averbis.extraction.types.Zone`

##### Paragraph

Full Name: `de.averbis.extraction.types.Paragraph`

Description: Different paragraphs found in the document.

Parent Type: `de.averbis.extraction.types.Zone`

##### Title

Full Name: `de.averbis.extraction.types.Title`

Description: Marks a title in the document.

Parent Type: `de.averbis.extraction.types.Zone`

##### Relation

Full Name: `de.averbis.extraction.types.Relation`

Description: Describes a binary relation between two annotations. The relation is defined according to the SPO (subject, predicate, object) annotation.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 9: Features

NameRangeElement TypeMultiple References Allowed

``` subject ```

`de.averbis.extraction.types.CoreAnnotation`

Description: An annotation representing the subject of the relation ("agens").

``` predicate ```

`de.averbis.extraction.types.CoreAnnotation`

Description: The annotation representing the predicate of the relation e.g. BASF has integrated BAYER --> 'has integrated' is the predicate that is marked as ChunkVP --> use the feature 'value' to define the type of the relation --> value: acquisition

``` object ```

`de.averbis.extraction.types.CoreAnnotation`

Description: The object of the relation.

``` value ```

`uima.cas.String`

Description: Type of the relation.

##### Entity

Full Name: `de.averbis.extraction.types.Entity`

Description: A named entity; not to be confused with a Concept. A (named) entity is a string representation in text referring to a class of entities. Thus, the entity mention does not have an identifier but a specific type (the category) assigned to it.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 10: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: This feature provides the text of the annotated mention. Important for easily representing discontinuous mentions such as 'T cell' in the expression 'T and B cell'.

``` label ```

`uima.cas.String`

Description: The type of the entity; e.g., PERSON, LOCATION etc. Feature name is label due to the fact that "type" is a reserved word.

``` parsedElements ```

`uima.cas.FSArray`

`de.averbis.extraction.types.Entity`

Description: Reference to all recognized entities inside this Entity such as Size, Time, Area, Date, Volume, ....

##### ResolvedEntity

Full Name: `de.averbis.extraction.types.ResolvedEntity`

Description: A special entity with an additional specific resolved form.

Parent Type: `de.averbis.extraction.types.Entity`

Table 11: Features

NameRangeElement TypeMultiple References Allowed

``` resolvedType ```

`uima.cas.String`

Description: The type of the resolved form.

``` resolvedForm ```

`uima.cas.String`

Description: A string representing the resolved form of the entity.

##### Group

Full Name: `de.averbis.extraction.types.Group`

Description: Groups together a set of annotations that belong together, e.g., enumerations. One of them can be set to the "leading" concept. E.g. "the liver metastasis is hypodense and has a size of 3cm*2cm". lead: metastasis, other concepts: liver, hypodense, 3cm*2cm

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 12: Features

NameRangeElement TypeMultiple References Allowed

``` leadingAnnotation ```

`de.averbis.extraction.types.CoreAnnotation`

``` members ```

`uima.cas.FSArray`

`de.averbis.extraction.types.CoreAnnotation`

Description: Annotations contained in the group.

``` label ```

`uima.cas.String`

Description: Textual label describing the group elements.

##### Enumeration

Full Name: `de.averbis.extraction.types.Enumeration`

Description: A specific group representing an enumeration like "red, blue and green".

Parent Type: `de.averbis.extraction.types.Group`

##### Listing

Full Name: `de.averbis.extraction.types.Listing`

Description: A specific group representing a listing like "1. red 2. blue 3. green".

Parent Type: `de.averbis.extraction.types.Group`

##### InputParam

Full Name: `de.averbis.extraction.types.InputParam`

Description: InputParam is used to pass parameters to an analysis engine via a JCas object. This may be used to pass parameters in the process() method of an analysis engine rather than during initialization of the AEs. It is e.g. necessary for the ConceptAnnotator at which you want to pass restrictions (such as "language" or "terminology") for each single text/jcas while only having one ConceptAnnotator instance.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 13: Features

NameRangeElement TypeMultiple References Allowed

``` key ```

`uima.cas.String`

Description: The key of the input parameter.

``` values ```

`uima.cas.StringArray`

Description: The values of the input parameter.

##### Stem

Full Name: `de.averbis.extraction.types.Stem`

Description: The type for stem annotations.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 14: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: The string representation of the stem.

##### Category

Full Name: `de.averbis.extraction.types.Category`

Description: Category meta information on the document or a region of this document (use the context feature to identify which section this category refers to). E.g. language information of the document text or language information of specific sections.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 15: Features

NameRangeElement TypeMultiple References Allowed

``` group ```

`uima.cas.String`

Description: category group (e.g. HSG, language) to which from which the label belongs. For language categorization the group might be "lang" and then the labels could be "en", "de", "fr" etc.

``` label ```

`uima.cas.String`

Description: The label of the category annotation. E.g. in the case that we identified languages (de, en, fr, ...).

``` context ```

`de.averbis.extraction.types.CoreAnnotation`

Description: The text context which belongs to the given category annotation, e.g. Document, Section, Sentence.

``` rank ```

`uima.cas.Integer`

Description: The rank of the current category with respect to the context.

##### SummarySentence

Full Name: `de.averbis.extraction.types.SummarySentence`

Description: Annotation indicating a sentence that makes up a summary of the document.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 16: Features

NameRangeElement TypeMultiple References Allowed

``` sentence ```

`de.averbis.extraction.types.Sentence`

Description: The sentence annotation that contains the content of this summary sentence.

``` descriptors ```

`uima.cas.FSArray`

`de.averbis.extraction.types.Descriptor`

`false`

Description: The descriptors extracted by the algorithm accounting to the selection of the sentence.

##### IndexTerm

Full Name: `de.averbis.extraction.types.IndexTerm`

Description: A term to be used for indexing a document.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 17: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: The string representation of the index term. Example: the normalized and stemmed string which represents a keyword in the free keywording scenario or for controlled keywording (=descriptor extraction) the dictCanon might be written here.

``` baseAnnotation ```

`de.averbis.extraction.types.CoreAnnotation`

Description: The annotation to be assigned as index term. This can i.e. be a Concept or a Noun Phrase annotation from which the index term was derived.

##### Descriptor

Full Name: `de.averbis.extraction.types.Descriptor`

Description: An index term from an ontology; its type (or reference) is written in the feature annotation.

Parent Type: `de.averbis.extraction.types.IndexTerm`

Table 18: Features

NameRangeElement TypeMultiple References Allowed

``` uid ```

`uima.cas.String`

Description: The unique identifier of the descriptor, e.g., a combination of terminology and concept id.

##### Keyword

Full Name: `de.averbis.extraction.types.Keyword`

Description: A keyword that is assigned freely (i.e., not taken from an ontology) to a document. Its type is written in the feature annotation

Parent Type: `de.averbis.extraction.types.IndexTerm`

Full Name: `de.averbis.extraction.types.GenericMetadata`

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 19: Features

NameRangeElement TypeMultiple References Allowed

``` metadataFieldname ```

`uima.cas.String`

Description: To reduce the potential metadata field names, the predefined field names should be used where possible (add new field names if necessary). Predefined metadata field names are: title, summary, filesize, annotatorName.

``` value ```

`uima.cas.String`

Description: Value of the metadata field, e.g. metadataFieldname = title, value = "Brave new world".

##### POSTagNoun

Full Name: `de.averbis.extraction.types.POSTagNoun`

Description: The type for all POS-Tags of the type "Noun".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagVerb

Full Name: `de.averbis.extraction.types.POSTagVerb`

Description: The type for all POS-Tags of the type "Verb".

Parent Type: `de.averbis.extraction.types.POSTag`

Full Name: `de.averbis.extraction.types.POSTagAdj`

Description: The type for all POS-Tags of the type "Adjective".

Parent Type: `de.averbis.extraction.types.POSTag`

Full Name: `de.averbis.extraction.types.POSTagAdv`

Description: The type for all POS-Tags of the type "Adverb".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagPron

Full Name: `de.averbis.extraction.types.POSTagPron`

Description: The type for all POS-Tags of the type "Pronoun".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagDet

Full Name: `de.averbis.extraction.types.POSTagDet`

Description: The type for all POS-Tags of the type "Determiner".

Parent Type: `de.averbis.extraction.types.POSTag`

Full Name: `de.averbis.extraction.types.POSTagAdp`

Description: The type for all POS-Tags of the type "Preposition/Postposition".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagNum

Full Name: `de.averbis.extraction.types.POSTagNum`

Description: The type for all POS-Tags of the type "Numeral".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagConj

Full Name: `de.averbis.extraction.types.POSTagConj`

Description: The type for all POS-Tags of the type "Conjunction".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagPart

Full Name: `de.averbis.extraction.types.POSTagPart`

Description: The type for all POS-Tags of the type "Particle".

Parent Type: `de.averbis.extraction.types.POSTag`

##### POSTagPunct

Full Name: `de.averbis.extraction.types.POSTagPunct`

Description: The type for all POS-Tags of the type "Punctuation".

Parent Type: `de.averbis.extraction.types.POSTag`

##### ValidTextSegment

Full Name: `de.averbis.extraction.types.ValidTextSegment`

Description: Zone to mark valid text in contrast to invalid text which e.g. may be OCR (Optical Character Recognition) artefacts, number blocks, tables etc.

Parent Type: `de.averbis.extraction.types.Zone`

##### Lemma

Full Name: `de.averbis.extraction.types.Lemma`

Description: Lemma is the canonical form of a lexeme. Lemmata can be retrieved from lexicon or be produced by lemmatizer.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 20: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: The value of the lemma.

``` case ```

`uima.cas.String`

Description: Case such as Nom (Nominative) or Gen (Genitive) etc.

``` number ```

`uima.cas.String`

Description: Singular or plural.

``` gender ```

`uima.cas.String`

Description: fem or masc or neutr.

##### Member

Full Name: `de.averbis.extraction.types.Member`

Description: Utility annotation for indicating a member of a group.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

#### 8.1.2. NumericValueTypeSystem

de.averbis.textanalysis.typesystems.NumericValueTypeSystem

This type system contains types for representing numeric values.

Maven Coordinates

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>numeric-value-typesystem</artifactId>
<version>3.5.0</version>
</dependency>
```  ```

Imports

• de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

##### NumericValue

Full Name: `de.averbis.textanalysis.types.numericvalue.NumericValue`

Description: Represents a text span which can be interpreted as a numeric value which is stored in a feature.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 21: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.Double`

Description: the actual double value of the numeric value.

##### Fraction

Full Name: `de.averbis.textanalysis.types.numericvalue.Fraction`

Description: A fraction of two NumericValue annotations.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 22: Features

NameRangeElement TypeMultiple References Allowed

``` numerator ```

`de.averbis.textanalysis.types.numericvalue.NumericValue`

Description: The numerator of the fraction.

``` denominator ```

`de.averbis.textanalysis.types.numericvalue.NumericValue`

Description: The denominator of the fraction.

##### SimpleFraction

Full Name: `de.averbis.textanalysis.types.numericvalue.SimpleFraction`

Description: A fraction of two integer values.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 23: Features

NameRangeElement TypeMultiple References Allowed

``` numerator ```

`uima.cas.Integer`

Description: The numerator of the fraction.

``` denominator ```

`uima.cas.Integer`

Description: The denominator of the fraction.

##### LanguageContainer

Full Name: `de.averbis.textanalysis.types.numericvalue.LanguageContainer`

Description: A container annotation specifying the language of the covered text.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 24: Features

NameRangeElement TypeMultiple References Allowed

``` language ```

`uima.cas.String`

Description: The language locale like 'de' or 'en'.

##### ConjunctionFragment

Full Name: `de.averbis.textanalysis.types.numericvalue.ConjunctionFragment`

Description: A text span indicating a conjunction of numbers, which may also be located within token like in 'fünfundzwanzig'

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### RomanNumeral

Full Name: `de.averbis.textanalysis.types.numericvalue.RomanNumeral`

Description: Annotation for roman numerals.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 25: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.Integer`

Description: Integer value of the roman numeral.

#### MeasurementTypeSystem

de.averbis.textanalysis.typesystems.MeasurementTypeSystem

This type system contains types for measurements and units.

Maven Coordinates

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>measurement-typesystem</artifactId>
<version>3.5.0</version>
</dependency>
```
```

Imports

• de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

• de.averbis.textanalysis.typesystems.NumericValueTypeSystem

##### Measurement

Full Name: `de.averbis.textanalysis.types.measurement.Measurement`

Description: A measurement combining a numeric value and a unit.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 26: Features

NameRangeElement TypeMultiple References Allowed

``` unit ```

`de.averbis.textanalysis.types.measurement.Unit`

Description: The unit of the measurement.

``` value ```

`de.averbis.textanalysis.types.numericvalue.NumericValue`

Description: The numeric value of the measurement.

``` normalizedUnit ```

`uima.cas.String`

Description: Normalized string value of the unit.

``` normalizedAsciiUnit ```

`uima.cas.String`

Description: Ascii normalized string value of the unit.

``` normalizedValue ```

`uima.cas.Double`

Description: The normalized value of the measurement. This value is the result of the transformation of the numeric value according to the transformation of the unit to its standard unit.

``` normalized ```

`uima.cas.String`

Description: The concatenation of the normalized numeric value and the ascii normalized unit.

``` parsedUnit ```

`uima.cas.String`

Description: Optional parsable unit string which replaces the unit annotation. It is utilized for normalization.

##### Unit

Full Name: `de.averbis.textanalysis.types.measurement.Unit`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 27: Features

NameRangeElement TypeMultiple References Allowed

``` normalizedAscii ```

`uima.cas.String`

Description: Ascii normalized string value of the unit.

``` parsed ```

`uima.cas.String`

Description: String value of the parsed/identified unit.

``` normalized ```

`uima.cas.String`

Description: Normalized string value of the unit.

``` dimension ```

`uima.cas.String`

Description: The dimension of the unit in the form like '[L^3' for volume]

##### MeasurementInterval

Full Name: `de.averbis.textanalysis.types.measurement.MeasurementInterval`

Description: An interval defined by two measurements, a low and high limit.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 28: Features

NameRangeElement TypeMultiple References Allowed

``` low ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: The lower bound of the interval.

``` high ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: The upper bound of the interval.

``` lowExcluded ```

`uima.cas.Boolean`

Description: Marker set to true if the lower bound itself is not part of the interval.

``` highExcluded ```

`uima.cas.Boolean`

Description: Marker set to true if the upper bound itself is not part of the interval.

##### SimpleMeasurementInterval

Full Name: `de.averbis.textanalysis.types.measurement.SimpleMeasurementInterval`

Description: An interval extending MeasurementInterval with several primitive features representing two measurements.

Parent Type: `de.averbis.textanalysis.types.measurement.MeasurementInterval`

Table 29: Features

NameRangeElement TypeMultiple References Allowed

``` lowNormalizedUnit ```

`uima.cas.String`

Description: The normalized unit of the lower bound.

``` lowNormalizedValue ```

`uima.cas.Double`

Description: The normalized value of the lower bound.

``` lowNormalized ```

`uima.cas.String`

Description: The normalized value combined with the normalized unit of the lower bound.

``` lowParsedUnit ```

`uima.cas.String`

Description: The parsed unit of the lower bound.

``` highNormalizedUnit ```

`uima.cas.String`

Description: The normalized unit of the upper bound.

``` highNormalizedValue ```

`uima.cas.Double`

Description: The normalized value of the upper bound.

``` highNormalized ```

`uima.cas.String`

Description: The normalized value combined with the normalized unit of the upper bound.

``` highParsedUnit ```

`uima.cas.String`

Description: The parsed unit of the upper bound.

##### RelativeMeasurementInterval

Full Name: `de.averbis.textanalysis.types.measurement.RelativeMeasurementInterval`

Description: A relative interval defined by two measurements, a base and deflection.

Parent Type: `de.averbis.textanalysis.types.measurement.MeasurementInterval`

Table 30: Features

NameRangeElement TypeMultiple References Allowed

``` base ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: The base of the interval.

``` deflection ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: The deflection of the interval.

##### IntervalIndicator

Full Name: `de.averbis.textanalysis.types.measurement.IntervalIndicator`

Description: A textual representation indicating something an interval like '-' or 'bis'

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### ComparisonIndicator

Full Name: `de.averbis.textanalysis.types.measurement.ComparisonIndicator`

Description: A textual representation of a comparison like '<=' or 'unter', also able to indicate an interval.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### GreaterIndicator

Full Name: `de.averbis.textanalysis.types.measurement.GreaterIndicator`

Description: A textual representation indicating something is 'greater', also able to indicate an interval.

Parent Type: `de.averbis.textanalysis.types.measurement.ComparisonIndicator`

##### LessIndicator

Full Name: `de.averbis.textanalysis.types.measurement.LessIndicator`

Description: A textual representation indicating something is 'less', also able to indicate an interval.

Parent Type: `de.averbis.textanalysis.types.measurement.ComparisonIndicator`

##### NoUnit

Full Name: `de.averbis.textanalysis.types.measurement.NoUnit`

Description: A textual position that is not a unit.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### DictionaryMeasurementMention

Full Name: `de.averbis.textanalysis.types.measurement.DictionaryMeasurementMention`

Description: A textual representation indicating a measurement. This is a helper type for measurements that combine numeric values and units or cause other problems for the unit parsing.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 31: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: Parseable value of the measurement.

``` unit ```

`uima.cas.String`

Description: Parseable unit of the measurement.

#### TemporalTypeSystem

de.averbis.textanalysis.typesystems.TemporalTypeSystem

This type system contains types for representing temporal expressions and values.

Maven Coordinates

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>temporal-typesystem</artifactId>
<version>3.5.0</version>
</dependency>
```
```

Imports

• de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

##### Timex3

Full Name: `de.averbis.textanalysis.types.temporal.Timex3`

Description: Represents a text span which can be interpreted as a numeric value which is stored in a feature.

Parent Type: de.averbis.extraction.types.CoreAnnotation

Table 32: Features

NameRangeElement TypeMultiple References Allowed

``` tid ```

`uima.cas.String`

Description: Non-optional attribute. Each TIMEX3 expression has to be identified by a unique ID number. This is automatically assigned by the annotation tool.

``` kind ```

`uima.cas.String`

Description: Non-optional attribute. Each TIMEX3 is assigned one of the following types: DATE, TIME, DURATION, or SET. The format of the value attribute is determined by the type of TIMEX3. For instance, a DURATION must have a value that begins with the letter ’P’ since durations represent a period of time. This will be elaborated on below in the value section. In addition, some optional attributes are used specifically with certain types of temporal expressions.

``` value ```

`uima.cas.String`

Description: The value attribute details which temporal information is contained in the TIMEX3. This value is given in an extended ISO 8601 format. Examples: T24:00, 2001-01-12TEV, 1984-01-03T12:00, XXXX-12-02,1964-SU, P4M, PT20M

``` temporalFunction ```

`uima.cas.Boolean`

Description: Binary attribute which expresses that the value of the temporal expression needs to be determined via evaluation of a temporal function.

``` anchor ```

`de.averbis.textanalysis.types.temporal.Timex3`

Description: Optional attribute. It introduces the annotation of the temporal expression to which the TIMEX3 in question is temporally anchored.

##### Date

Full Name: `de.averbis.textanalysis.types.temporal.Date`

Description: The expression describes a calendar time.

Parent Type: `de.averbis.textanalysis.types.temporal.Timex3`

Table 33: Features

NameRangeElement TypeMultiple References Allowed

``` day ```

`uima.tcas.Annotation`

Description: The day of the actual date.

``` month ```

`uima.tcas.Annotation`

Description: The month of the actual date.

``` year ```

`uima.tcas.Annotation`

Description: The year of the actual date.

##### Time

Full Name: `de.averbis.textanalysis.types.temporal.Time`

Description: The expression refers to a time of the day, even if in a very indefinite way.

Parent Type: de.averbis.textanalysis.types.temporal.Timex3

Table 34: Features

NameRangeElement TypeMultiple References Allowed

``` hour ```

`uima.tcas.Annotation`

Description: The hour of the actual time.

``` minute ```

`uima.tcas.Annotation`

Description: The minute of the actual time.

``` second ```

`uima.tcas.Annotation`

Description: The second of the actual time.

##### Duration

Full Name: `de.averbis.textanalysis.types.temporal.Duration`

Description: The expression describes a duration. This value is assigned to explicit durations.

Parent Type: `de.averbis.textanalysis.types.temporal.Timex3 `

##### TemporalSet

Full Name: `de.averbis.textanalysis.types.temporal.TemporalSet`

Description: The expression describes a set of times.

Parent Type: `de.averbis.textanalysis.types.temporal.Timex3`

##### DocumentDate

Full Name: `de.averbis.textanalysis.types.temporal.DocumentDate`

Description: Annotation representing the date and time presenting the document if available, e.g., the creation time.

Parent Type: `de.averbis.textanalysis.types.temporal.Timex3`

##### WeekDay

Full Name: `de.averbis.textanalysis.types.temporal.WeekDay`

Description: Annotation indicating a weekday e.g., 'Monday'.

Parent Type: `uima.tcas.Annotation`

Table 35: Features

NameRangeElement TypeMultiple References Allowed

``` dayOfWeek ```

`uima.cas.Integer`

Description: Number of the day, e.g., 1 for Monday

##### DayTime

Full Name: `de.averbis.textanalysis.types.temporal.DayTime`

Description: Annotation indicating a time of day e.g., 'in the morning'.

Parent Type: `uima.tcas.Annotation`

Table 36: Features

NameRangeElement TypeMultiple References Allowed

``` timeOfDay ```

`uima.cas.String`

Description: Number specifying the time of the day.

##### TemporalIntervalBeginIndicator

Full Name: `de.averbis.textanalysis.types.temporal.TemporalIntervalBeginIndicator`

Description: Indicator for a possible begin of a temporal interval.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### TemporalIntervalEndIndicator

Full Name: `de.averbis.textanalysis.types.temporal.TemporalIntervalEndIndicator`

Description: Indicator for a possible end of a temporal interval.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### UnambiguousTimex

Full Name: `de.averbis.textanalysis.types.temporal.UnambiguousTimex`

Description: Helper annotation type pointing to an most likely unambiguous temporal expression. The term '1992' could represent a year, but also a measurement with that value. Other text spans like '1.1.2015' are most likely unambiguous dates, which can be represented by this type.

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 37: Features

NameRangeElement TypeMultiple References Allowed

``` timex ```

`de.averbis.textanalysis.types.temporal.Timex3`

Description: The actual temporal expression.

##### DateInterval

Full Name: `de.averbis.textanalysis.types.temporal.DateInterval`

Description: The expression describes an interval(or set) of dates defined by the start and end date of an event.

Parent Type: `de.averbis.textanalysis.types.temporal.TemporalSet`

Table 38: Features

NameRangeElement TypeMultiple References Allowed

``` startDate ```

`de.averbis.textanalysis.types.temporal.Date`

Description: The start date of the temporal interval.

``` endDate ```

`de.averbis.textanalysis.types.temporal.Date`

Description: The end date of the temporal interval.

#### SmpcTypeSystem

de.averbis.textanalysis.pharma.SmpcTypeSystem

-

Maven Coordinates

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>idmp-typesystem</artifactId>
<version>0.7.0</version>
</dependency>
```
```

Imports

• de.averbis.textanalysis.typesystems.AverbisTypeSystem

• de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

• de.averbis.textanalysis.typesystems.MeasurementTypeSystem

• de.averbis.textanalysis.typesystems.TemporalTypeSystem

• de.averbis.textanalysis.typesystems.pharma.PharmaConceptTypeSystem

##### SmPC

Full Name: `de.averbis.textanalysis.types.pharma.smpc.SmPC`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 39: Features

NameRangeElement TypeMultiple References Allowed

``` medicinalProduct ```

`de.averbis.textanalysis.types.pharma.smpc.MedicinalProduct`

Description: -

``` marketingAuthorisation ```

`de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisation`

Description: -

``` clinicalParticulars ```

`de.averbis.textanalysis.types.pharma.smpc.ClinicalParticulars`

Description: -

``` pharmaceuticalForm ```

`de.averbis.textanalysis.types.pharma.smpc.PharmaceuticalForm`

Description: -

##### SmpcContext

Full Name: `de.averbis.textanalysis.types.pharma.smpc.SmpcContext`

Description: -

Parent Type: `uima.tcas.Annotation`

##### MedicinalProduct

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MedicinalProduct`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 40: Features

NameRangeElement TypeMultiple References Allowed

``` additionalMonitoringIndicator ```

`uima.tcas.Annotation`

Description: -

``` medicinalProductClassification ```

`de.averbis.textanalysis.types.pharma.smpc.MedicinalProductClassification`

Description: -

``` medicinalProductName ```

`de.averbis.textanalysis.types.pharma.smpc.MedicinalProductName`

Description: -

``` activeSubstances ```

`de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer`

Description: -

``` excipients ```

`de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer`

Description: -

``` administration ```

`de.averbis.textanalysis.types.pharma.smpc.Administration`

Description: -

``` shelfLifeStorage ```

`de.averbis.textanalysis.types.pharma.smpc.ShelfLifeStorage`

Description: -

##### MedicinalProductName

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MedicinalProductName`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 41: Features

NameRangeElement TypeMultiple References Allowed

``` inventedNamePart ```

`de.averbis.textanalysis.types.pharma.smpc.InventedProductName`

Description: inventedPartName

``` scientificNamePart ```

`de.averbis.textanalysis.types.pharma.smpc.ScientificProductName`

Description: scientificNamePart

``` strengthPart ```

`de.averbis.textanalysis.types.pharma.smpc.ProductStrength`

Description: strengthPart

``` pharmaceuticalDoseFormPart ```

`de.averbis.textanalysis.types.pharma.smpc.ProductDoseForm`

Description: pharmaceuticalDoseFormPart

``` formulationPart ```

`uima.tcas.Annotation`

Description: formulationPart

``` intendedUsePart ```

`uima.tcas.Annotation`

Description: intendedUsePart

``` targetPopulationPart ```

`uima.tcas.Annotation`

Description: targetPopulationPart

``` containerOrPackPart ```

`uima.tcas.Annotation`

Description: containerOrPackPart

``` devicePart ```

`uima.tcas.Annotation`

Description: devicePart

``` trademarkOrCompanyNamePart ```

`uima.tcas.Annotation`

``` timePeriodPart ```

`uima.tcas.Annotation`

Description: timePeriodPart

``` flavourPart ```

`uima.tcas.Annotation`

Description: flavourPart

##### InventedProductName

Full Name: `de.averbis.textanalysis.types.pharma.smpc.InventedProductName`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 42: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: -

##### ScientificProductName

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ScientificProductName`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 43: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: -

##### ProductStrength

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ProductStrength`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 44: Features

NameRangeElement TypeMultiple References Allowed

``` measurement ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: -

##### ProductDoseForm

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ProductDoseForm`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 45: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: -

##### Indication

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Indication`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 46: Features

NameRangeElement TypeMultiple References Allowed

``` populationSpecifics ```

`de.averbis.textanalysis.types.pharma.smpc.PopulationSpecifics`

Description: -

``` otherTherapySpecifics ```

`de.averbis.textanalysis.types.pharma.smpc.OtherTherapySpecifics`

Description: -

``` diseaseStatus ```

`uima.tcas.Annotation`

Description: -

``` coMorbidity ```

`uima.tcas.Annotation`

Description: -

``` intendedEffect ```

`uima.tcas.Annotation`

Description: -

``` timingDuration ```

`uima.tcas.Annotation`

Description: -

``` indicationAsDiseaseSymptomProcedure ```

`uima.tcas.Annotation`

Description: -

##### Interaction

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Interaction`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 47: Features

NameRangeElement TypeMultiple References Allowed

``` interactionType ```

`uima.cas.String`

Description: -

``` interactionEffect ```

`uima.tcas.Annotation`

Description: -

``` interactionIncidence ```

`uima.tcas.Annotation`

Description: -

``` managementActions ```

`uima.tcas.Annotation`

Description: -

##### Contraindication

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Contraindication`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 48: Features

NameRangeElement TypeMultiple References Allowed

``` populationSpecifics ```

`de.averbis.textanalysis.types.pharma.smpc.PopulationSpecifics`

Description: -

``` otherTherapySpecifics ```

`de.averbis.textanalysis.types.pharma.smpc.OtherTherapySpecifics`

Description: -

``` contraIndicationsAsDiseaseSymptomProcedure ```

`uima.tcas.Annotation`

Description: -

``` diseaseStatus ```

`uima.tcas.Annotation`

Description: -

``` coMorbidity ```

`uima.tcas.Annotation`

Description: -

##### UndesirableEffect

Full Name: `de.averbis.textanalysis.types.pharma.smpc.UndesirableEffect`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 49: Features

NameRangeElement TypeMultiple References Allowed

``` undesirableEffect ```

`uima.tcas.Annotation`

Description: undesirableEffect

``` undesirableEffectAsSymptomConditionEffect ```

`uima.tcas.Annotation`

Description: undesirableEffectAsSymptomConditionEffect

``` frequencyOfOccurence ```

`uima.tcas.Annotation`

Description: frequencyOfOccurence

``` symptomConditionEffectClassification ```

`uima.tcas.Annotation`

Description: symptomConditionEffectClassification

##### PharmaceuticalForm

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PharmaceuticalForm`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 50: Features

NameRangeElement TypeMultiple References Allowed

``` authorisedDosageForm ```

`de.averbis.textanalysis.types.pharma.smpc.AuthorisedDosageForm`

Description: -

``` manufacturedItem ```

`de.averbis.textanalysis.types.pharma.smpc.ManufacturedItem`

Description: -

##### AuthorisedDosageForm

Full Name: `de.averbis.textanalysis.types.pharma.smpc.AuthorisedDosageForm`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 51: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: -

##### MedicinalProductClassification

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MedicinalProductClassification`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 52: Features

NameRangeElement TypeMultiple References Allowed

``` classificationSystem ```

`uima.tcas.Annotation`

Description: -

``` classificationValue ```

`uima.tcas.Annotation`

Description: -

##### PopulationSpecifics

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PopulationSpecifics`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 53: Features

NameRangeElement TypeMultiple References Allowed

``` age ```

`uima.tcas.Annotation`

Description: -

``` ageRange ```

`uima.tcas.Annotation`

Description: -

``` gender ```

`uima.tcas.Annotation`

Description: -

``` race ```

`uima.tcas.Annotation`

Description: -

``` healthStatus ```

`uima.tcas.Annotation`

Description: -

##### OtherTherapySpecifics

Full Name: `de.averbis.textanalysis.types.pharma.smpc.OtherTherapySpecifics`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 54: Features

NameRangeElement TypeMultiple References Allowed

``` therapyRelationshipType ```

`uima.tcas.Annotation`

Description: -

``` medication ```

`uima.tcas.Annotation`

Description: -

##### MarketingAuthorisation

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisation`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 55: Features

NameRangeElement TypeMultiple References Allowed

``` marketingAuthorisationNumberContainer ```

`de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationNumberContainer`

Description: -

``` legalStatusOfSupply ```

`uima.tcas.Annotation`

Description: -

``` marketingAuthorisationHolder ```

`de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolder`

Description: -

``` firstAuthorisationDate ```

`de.averbis.textanalysis.types.pharma.smpc.DateOfFirstAuthorisation`

Description: -

``` lastRenewalDate ```

`de.averbis.textanalysis.types.pharma.smpc.DateOfLatestRenewal`

Description: -

``` revisionDate ```

`de.averbis.textanalysis.types.pharma.smpc.DateOfRevision`

Description: -

##### ClinicalParticulars

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ClinicalParticulars`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 56: Features

NameRangeElement TypeMultiple References Allowed

``` therapeuticIndications ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.smpc.Indication`

`false`

Description: -

``` undesirableEffects ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.smpc.UndesirableEffect`

`false`

Description: -

``` interactions ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.smpc.Interaction`

`false`

Description: -

``` contraIndications ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.smpc.Contraindication`

`false`

Description: -

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Administration`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 57: Features

NameRangeElement TypeMultiple References Allowed

``` routeOfAdministration ```

`de.averbis.textanalysis.types.pharma.AdministrationConcept`

Description: -

``` unitOfPresentation ```

`uima.tcas.Annotation`

Description: -

``` paediatricUseIndicator ```

`de.averbis.textanalysis.types.pharma.smpc.PaediatricUseIndicator`

Description: -

##### Container

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Container`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 58: Features

NameRangeElement TypeMultiple References Allowed

``` packageDescription ```

`uima.tcas.Annotation`

Description: -

##### ShelfLifeStorage

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ShelfLifeStorage`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 59: Features

NameRangeElement TypeMultiple References Allowed

``` shelfLifeContainer ```

`de.averbis.textanalysis.types.pharma.smpc.ShelfLifeContainer`

Description: -

``` specialPrecautionsForStorage ```

`uima.tcas.Annotation`

Description: -

##### ManufacturedItem

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ManufacturedItem`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 60: Features

NameRangeElement TypeMultiple References Allowed

``` form ```

`uima.tcas.Annotation`

Description: -

``` manufacturedItemQuantity ```

`uima.tcas.Annotation`

Description: -

``` physicalCharacteristics ```

`de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristics`

Description: -

##### PhysicalCharacteristics

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristics`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 61: Features

NameRangeElement TypeMultiple References Allowed

``` itemShape ```

`de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsShape`

Description: -

``` itemColor ```

`de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsColor`

Description: -

``` itemImprint ```

`de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsImprint`

Description: -

##### MarketingAuthorisationHolder

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolder`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 62: Features

NameRangeElement TypeMultiple References Allowed

``` organisationId ```

`uima.tcas.Annotation`

Description: -

``` authorisationHolderName ```

`de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderName`

Description: -

``` authorisationHolderAddress ```

`de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderAddress`

Description: -

##### ContactPerson

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ContactPerson`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 63: Features

NameRangeElement TypeMultiple References Allowed

``` confidentialityIndicator ```

`uima.tcas.Annotation`

Description: -

``` telecom ```

`uima.tcas.Annotation`

Description: -

``` name ```

`uima.tcas.Annotation`

Description: -

``` role ```

`uima.tcas.Annotation`

Description: -

##### SmpcDate

Full Name: `de.averbis.textanalysis.types.pharma.smpc.SmpcDate`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 64: Features

NameRangeElement TypeMultiple References Allowed

``` date ```

`de.averbis.textanalysis.types.temporal.Date`

Description: -

##### DateOfFirstAuthorisation

Full Name: `de.averbis.textanalysis.types.pharma.smpc.DateOfFirstAuthorisation`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.SmpcDate`

##### DateOfLatestRenewal

Full Name: `de.averbis.textanalysis.types.pharma.smpc.DateOfLatestRenewal`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.SmpcDate`

##### DateOfRevision

Full Name: `de.averbis.textanalysis.types.pharma.smpc.DateOfRevision`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.SmpcDate`

##### Substance

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Substance`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 65: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: -

##### ActiveSubstance

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ActiveSubstance`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.Substance`

Table 66: Features

NameRangeElement TypeMultiple References Allowed

``` referencedSubstance ```

`de.averbis.textanalysis.types.pharma.smpc.Substance`

Description: -

``` presentationStrength ```

`de.averbis.textanalysis.types.pharma.smpc.PresentationStrength`

Description: -

``` concentrationStrength ```

`de.averbis.textanalysis.types.pharma.smpc.ConcentrationStrength`

Description: -

##### PresentationStrength

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PresentationStrength`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 67: Features

NameRangeElement TypeMultiple References Allowed

``` measurement ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: -

##### ConcentrationStrength

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ConcentrationStrength`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 68: Features

NameRangeElement TypeMultiple References Allowed

``` measurement ```

`de.averbis.textanalysis.types.measurement.Measurement`

Description: -

##### Excipient

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Excipient`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.Substance`

Table 69: Features

NameRangeElement TypeMultiple References Allowed

``` form ```

`uima.tcas.Annotation`

Description: -

##### PharmacodynamicClassificationSystem

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PharmacodynamicClassificationSystem`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PharmacodynamicClassificationValue

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PharmacodynamicClassificationValue`

Description: -

Parent Type: `uima.tcas.Annotation`

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ShelfLifeTimePeriod`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 70: Features

NameRangeElement TypeMultiple References Allowed

``` form ```

`uima.tcas.Annotation`

Description: -

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ShelfLifeTimeType`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 71: Features

NameRangeElement TypeMultiple References Allowed

``` prefTerm ```

`uima.cas.String`

Description: -

``` code ```

`uima.cas.String`

Description: -

##### SpecialPrecautionsForStorage

Full Name: `de.averbis.textanalysis.types.pharma.smpc.SpecialPrecautionsForStorage`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PackageDescription

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PackageDescription`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PhysicalCharacteristicsShape

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsShape`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PhysicalCharacteristicsColor

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsColor`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PhysicalCharacteristicsImprint

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PhysicalCharacteristicsImprint`

Description: -

Parent Type: `uima.tcas.Annotation`

##### MarketingAuthorisationNumber

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationNumber`

Description: -

Parent Type: `uima.tcas.Annotation`

##### MarketingAuthorisationHolderName

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderName`

Description: -

Parent Type: `uima.tcas.Annotation`

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationHolderAddress`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 72: Features

NameRangeElement TypeMultiple References Allowed

``` postAddress ```

`uima.tcas.Annotation`

Description: -

``` postCode ```

`uima.tcas.Annotation`

Description: -

``` city ```

`uima.tcas.Annotation`

Description: -

``` country ```

`uima.tcas.Annotation`

Description: -

Full Name: `de.averbis.textanalysis.types.pharma.smpc.AdditionalMonitoringIndicator`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PaediatricUseIndicator

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PaediatricUseIndicator`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 73: Features

NameRangeElement TypeMultiple References Allowed

``` normalized ```

`uima.cas.String`

Description: -

##### ManufacturedItemQuantity

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ManufacturedItemQuantity`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PackageItemType

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PackageItemType`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PackageItemQuantity

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PackageItemQuantity`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PackageItemConcept

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PackageItemConcept`

Description: -

Parent Type: `de.averbis.extraction.types.Concept`

##### City

Full Name: `de.averbis.textanalysis.types.pharma.smpc.City`

Description: -

Parent Type: `uima.tcas.Annotation`

##### Country

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Country`

Description: -

Parent Type: `uima.tcas.Annotation`

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PostAddress`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PostCode

Full Name: `de.averbis.textanalysis.types.pharma.smpc.PostCode`

Description: -

Parent Type: `uima.tcas.Annotation`

##### CompanyName

Full Name: `de.averbis.textanalysis.types.pharma.smpc.CompanyName`

Description: -

Parent Type: `uima.tcas.Annotation`

##### CompanyPostfix

Full Name: `de.averbis.textanalysis.types.pharma.smpc.CompanyPostfix`

Description: -

Parent Type: `uima.tcas.Annotation`

##### StreetIndicator

Full Name: `de.averbis.textanalysis.types.pharma.smpc.StreetIndicator`

Description: -

Parent Type: `uima.tcas.Annotation`

##### MarketingAuthorisationNumberContainer

Full Name: `de.averbis.textanalysis.types.pharma.smpc.MarketingAuthorisationNumberContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 74: Features

NameRangeElement TypeMultiple References Allowed

``` numbers ```

`uima.cas.FSArray`

Description: -

##### ConceptCell

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ConceptCell`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 75: Features

NameRangeElement TypeMultiple References Allowed

``` concepts ```

`uima.cas.FSArray`

Description: -

##### SubstanceContainer

Full Name: `de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 76: Features

NameRangeElement TypeMultiple References Allowed

``` substances ```

`uima.cas.FSArray`

Description: -

##### ActiveSubstanceContainer

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ActiveSubstanceContainer`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer`

##### ExcipientSubstanceContainer

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ExcipientSubstanceContainer`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.smpc.SubstanceContainer`

##### Device

Full Name: `de.averbis.textanalysis.types.pharma.smpc.Device`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 77: Features

NameRangeElement TypeMultiple References Allowed

``` prefTerm ```

`uima.cas.String`

Description: -

``` code ```

`uima.cas.String`

Description: -

##### ExcipientRole

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ExcipientRole`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### ATCCode

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ATCCode`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 78: Features

NameRangeElement TypeMultiple References Allowed

``` prefTerm ```

`uima.cas.String`

Description: -

``` code ```

`uima.cas.String`

Description: -

##### ShelfLifeContainer

Full Name: `de.averbis.textanalysis.types.pharma.smpc.ShelfLifeContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 79: Features

NameRangeElement TypeMultiple References Allowed

``` shelfLifes ```

`uima.cas.FSArray`

Description: -

#### SmpcSectionTypeSystem

de.averbis.textanalysis.typesystems.pharma.SmpcSectionTypeSystem

-

Maven Coordinates

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>idmp-typesystem</artifactId>
<version>0.7.0</version>
</dependency>
```
```

Imports

• de.averbis.textanalysis.typesystems.AverbisTypeSystem

• de.averbis.textanalysis.typesystems.NumericValueTypeSystem

##### SmpcAnnex

Full Name: `de.averbis.textanalysis.types.pharma.SmpcAnnex`

Description: -

Parent Type: `de.averbis.extraction.types.Section`

##### SmpcAnnexI

Full Name: `de.averbis.textanalysis.types.pharma.SmpcAnnexI`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcAnnex`

##### SmpcAnnexII

Full Name: `de.averbis.textanalysis.types.pharma.SmpcAnnexII`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcAnnex`

##### SmpcAnnexIII

Full Name: `de.averbis.textanalysis.types.pharma.SmpcAnnexIII`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcAnnex`

##### SmpcSection

Full Name: `de.averbis.textanalysis.types.pharma.SmpcSection`

Description: -

Parent Type: `de.averbis.extraction.types.Section`

Table 80: Features

NameRangeElement TypeMultiple References Allowed

``` headline ```

`de.averbis.textanalysis.types.pharma.SmpcSectionHeadline`

``` content ```

`uima.tcas.Annotation`

Description: content

Full Name: `de.averbis.textanalysis.types.pharma.SmpcSectionHeadline`

Description: -

Parent Type: `uima.tcas.Annotation`

Table 81: Features

NameRangeElement TypeMultiple References Allowed

``` number ```

`uima.cas.Double`

Description: number

``` text ```

`uima.tcas.Annotation`

Description: text

``` main ```

`uima.cas.Boolean`

Description: main

Full Name: `de.averbis.textanalysis.types.pharma.SmpcSectionHeadlineText`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Full Name: `de.averbis.textanalysis.types.pharma.SecondarySmpcHeadline`

Description: -

Parent Type: `uima.tcas.Annotation`

Full Name: `de.averbis.textanalysis.types.pharma.HeadlineTag`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 82: Features

NameRangeElement TypeMultiple References Allowed

``` cssClass ```

`uima.cas.String`

Description: -

Full Name: `de.averbis.textanalysis.types.pharma.MainSmpcHeadline`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### SmpcSectionContent

Full Name: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.numericvalue.LanguageContainer`

##### NameOfTheMedicinalProductContent

Full Name: `de.averbis.textanalysis.types.pharma.NameOfTheMedicinalProductContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### QualitativeAndQuantitativeCompositionContent

Full Name: `de.averbis.textanalysis.types.pharma.QualitativeAndQuantitativeCompositionContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### PharmaceuticalFormContent

Full Name: `de.averbis.textanalysis.types.pharma.PharmaceuticalFormContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### ClinicalParticularsContent

Full Name: `de.averbis.textanalysis.types.pharma.ClinicalParticularsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### TherapeuticIndicationsContent

Full Name: `de.averbis.textanalysis.types.pharma.TherapeuticIndicationsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

Full Name: `de.averbis.textanalysis.types.pharma.PosologyAndMethodOfAdministrationContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### ContraindicationsContent

Full Name: `de.averbis.textanalysis.types.pharma.ContraindicationsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### SpecialWarningsAndPrecautionsForUseContent

Full Name: `de.averbis.textanalysis.types.pharma.SpecialWarningsAndPrecautionsForUseContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### InteractionsContent

Full Name: `de.averbis.textanalysis.types.pharma.InteractionsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### FertilityPregnancyLactationContent

Full Name: `de.averbis.textanalysis.types.pharma.FertilityPregnancyLactationContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### EffectsOnAbilityContent

Full Name: `de.averbis.textanalysis.types.pharma.EffectsOnAbilityContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### UndesirableEffectsContent

Full Name: `de.averbis.textanalysis.types.pharma.UndesirableEffectsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### OverdoseContent

Full Name: `de.averbis.textanalysis.types.pharma.OverdoseContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### PharmacologicalPropertiesContent

Full Name: `de.averbis.textanalysis.types.pharma.PharmacologicalPropertiesContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### PharmacodynamicPropertiesContent

Full Name: `de.averbis.textanalysis.types.pharma.PharmacodynamicPropertiesContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### PharmacokineticPropertiesContent

Full Name: `de.averbis.textanalysis.types.pharma.PharmacokineticPropertiesContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### PreclinicalSafetyDataContent

Full Name: `de.averbis.textanalysis.types.pharma.PreclinicalSafetyDataContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### PharmaceuticalParticularsContent

Full Name: `de.averbis.textanalysis.types.pharma.PharmaceuticalParticularsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### ListOfExcipientsContent

Full Name: `de.averbis.textanalysis.types.pharma.ListOfExcipientsContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### IncompatibilitiesContent

Full Name: `de.averbis.textanalysis.types.pharma.IncompatibilitiesContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### ShelfLifeContent

Full Name: `de.averbis.textanalysis.types.pharma.ShelfLifeContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### SpecialPrecautionsForStorageContent

Full Name: `de.averbis.textanalysis.types.pharma.SpecialPrecautionsForStorageContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### NatureAndContentsOfContainerContent

Full Name: `de.averbis.textanalysis.types.pharma.NatureAndContentsOfContainerContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### SpecialPrecautionsForDisposalAndOtherHandlingContent

Full Name: `de.averbis.textanalysis.types.pharma.SpecialPrecautionsForDisposalAndOtherHandlingContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### MarketingAuthorisationHolderContent

Full Name: `de.averbis.textanalysis.types.pharma.MarketingAuthorisationHolderContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### MarketingAuthorisationNumbersContent

Full Name: `de.averbis.textanalysis.types.pharma.MarketingAuthorisationNumbersContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### DateOfAuthorisationContent

Full Name: `de.averbis.textanalysis.types.pharma.DateOfAuthorisationContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### DateOfRevisionContent

Full Name: `de.averbis.textanalysis.types.pharma.DateOfRevisionContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

##### GeneralClassificationForSupplyContent

Full Name: `de.averbis.textanalysis.types.pharma.GeneralClassificationForSupplyContent`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.SmpcSectionContent`

#### Module3TypeSystem

de.averbis.textanalysis.pharma.Module3TypeSystem

-

Maven Coordinates

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>idmp-typesystem</artifactId>
<version>0.7.0</version>
</dependency>
```
```

Imports

• de.averbis.textanalysis.typesystems.AverbisTypeSystem

• de.averbis.textanalysis.typesystems.AverbisInternalTypeSystem

##### Product

Full Name: `de.averbis.textanalysis.types.pharma.module3.Product`

Description: Composition

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 83: Features

NameRangeElement TypeMultiple References Allowed

``` compositions ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.module3.ProductComposition`

`true`

Description: -

``` description ```

`de.averbis.textanalysis.types.pharma.module3.ProductDescription`

Description: description

##### ProductComposition

Full Name: `de.averbis.textanalysis.types.pharma.module3.ProductComposition`

Description: Composition

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 84: Features

NameRangeElement TypeMultiple References Allowed

``` activeEntries ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.module3.CompositionEntry`

`true`

Description: active entries

``` excipientEntries ```

`uima.cas.FSArray`

`de.averbis.textanalysis.types.pharma.module3.CompositionEntry`

`true`

Description: excipient entries

``` reference ```

`de.averbis.extraction.types.CoreAnnotation`

Description: reference

##### ProductDescription

Full Name: `de.averbis.textanalysis.types.pharma.module3.ProductDescription`

Description: ProductDescription

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 85: Features

NameRangeElement TypeMultiple References Allowed

``` dosageForm ```

`de.averbis.textanalysis.types.pharma.module3.DosageForm`

Description: dosageForm

``` color ```

`uima.cas.FSArray`

Description: color

``` shape ```

`uima.cas.FSArray`

Description: shape

##### CompositionEntry

Full Name: `de.averbis.textanalysis.types.pharma.module3.CompositionEntry`

Description: Composition

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 86: Features

NameRangeElement TypeMultiple References Allowed

``` active ```

`uima.cas.Boolean`

Description: active

``` substance ```

`de.averbis.textanalysis.types.pharma.module3.Substance`

Description: substance

``` strength ```

`de.averbis.textanalysis.types.pharma.module3.StrengthContainer`

Description: strength

``` role ```

`de.averbis.textanalysis.types.pharma.module3.IngredientRoleContainer`

Description: function

``` standard ```

`de.averbis.textanalysis.types.pharma.module3.QualityStandardContainer`

Description: qualityStandards

##### StrengthContainer

Full Name: `de.averbis.textanalysis.types.pharma.module3.StrengthContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 87: Features

NameRangeElement TypeMultiple References Allowed

``` strengths ```

`uima.cas.FSArray`

Description: strengths

##### IngredientRoleContainer

Full Name: `de.averbis.textanalysis.types.pharma.module3.IngredientRoleContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 88: Features

NameRangeElement TypeMultiple References Allowed

``` roles ```

`uima.cas.FSArray`

Description: functions

##### QualityStandardContainer

Full Name: `de.averbis.textanalysis.types.pharma.module3.QualityStandardContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 89: Features

NameRangeElement TypeMultiple References Allowed

``` standards ```

`uima.cas.FSArray`

Description: standards

##### IngredientRole

Full Name: `de.averbis.textanalysis.types.pharma.module3.IngredientRole`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### Substance

Full Name: `de.averbis.textanalysis.types.pharma.module3.Substance`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ConceptContainer`

Full Name: `de.averbis.textanalysis.types.pharma.module3.RouteOfAdministration`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ConceptContainer`

##### DosageForm

Full Name: `de.averbis.textanalysis.types.pharma.module3.DosageForm`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ConceptContainer`

##### QualityStandard

Full Name: `de.averbis.textanalysis.types.pharma.module3.QualityStandard`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### PhysicalCharacteristicsShape

Full Name: `de.averbis.textanalysis.types.pharma.module3.PhysicalCharacteristicsShape`

Description: -

Parent Type: `uima.tcas.Annotation`

##### PhysicalCharacteristicsColor

Full Name: `de.averbis.textanalysis.types.pharma.module3.PhysicalCharacteristicsColor`

Description: -

Parent Type: `uima.tcas.Annotation`

##### ConceptContainer

Full Name: `de.averbis.textanalysis.types.pharma.module3.ConceptContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 90: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: concept

##### Manufacturer

Full Name: `de.averbis.textanalysis.types.pharma.module3.Manufacturer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 91: Features

NameRangeElement TypeMultiple References Allowed

``` operationTypeContainer ```

`de.averbis.textanalysis.types.pharma.module3.OperationTypeContainer`

Description: operationTypeContainer

``` name ```

`de.averbis.textanalysis.types.pharma.module3.ManufacturerName`

Description: name

``` postAddress ```

`de.averbis.textanalysis.types.pharma.module3.PostAddress`

``` city ```

`de.averbis.textanalysis.types.pharma.module3.City`

Description: city

``` postCode ```

`de.averbis.textanalysis.types.pharma.module3.PostCode`

Description: postCode

``` country ```

`de.averbis.textanalysis.types.pharma.module3.Country`

Description: country

##### OperationTypeContainer

Full Name: `de.averbis.textanalysis.types.pharma.module3.OperationTypeContainer`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 92: Features

NameRangeElement TypeMultiple References Allowed

``` operationTypes ```

`uima.cas.FSArray`

Description: operationTypes

Full Name: `de.averbis.textanalysis.types.pharma.module3.PostAddress`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart`

##### ManufacturersContext

Full Name: `de.averbis.textanalysis.types.pharma.module3.ManufacturersContext`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### OperationTypeContext

Full Name: `de.averbis.textanalysis.types.pharma.module3.OperationTypeContext`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 93: Features

NameRangeElement TypeMultiple References Allowed

``` elements ```

`uima.cas.FSArray`

Description: elements

##### Country

Full Name: `de.averbis.textanalysis.types.pharma.module3.Country`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart`

##### StreetIndicatorPrefix

Full Name: `de.averbis.textanalysis.types.pharma.module3.StreetIndicatorPrefix`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### StreetIndicatorPostfix

Full Name: `de.averbis.textanalysis.types.pharma.module3.StreetIndicatorPostfix`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### OperationType

Full Name: `de.averbis.textanalysis.types.pharma.module3.OperationType`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 94: Features

NameRangeElement TypeMultiple References Allowed

``` value ```

`uima.cas.String`

Description: value

##### PostCode

Full Name: `de.averbis.textanalysis.types.pharma.module3.PostCode`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart`

##### CompanyPostfix

Full Name: `de.averbis.textanalysis.types.pharma.module3.CompanyPostfix`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

##### City

Full Name: `de.averbis.textanalysis.types.pharma.module3.City`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart`

##### ManufacturerName

Full Name: `de.averbis.textanalysis.types.pharma.module3.ManufacturerName`

Description: -

Parent Type: `de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart`

Full Name: `de.averbis.textanalysis.types.pharma.module3.ManufacturerAddressPart`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

-

Maven Coordinates

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<version>0.7.0</version>
</dependency>
``` ```

Imports

• de.averbis.textanalysis.typesystems.AverbisTypeSystem

Full Name: `de.averbis.textanalysis.types.pharma.AdverseEvent`

Description: -

Parent Type: `de.averbis.extraction.types.CoreAnnotation`

Table 95: Features

NameRangeElement TypeMultiple References Allowed

``` concept ```

`de.averbis.extraction.types.Concept`

Description: -

``` label ```

`uima.cas.String`

Description: -

``` serious ```

`uima.cas.Boolean`

Description: -

### Language Detection

#### LanguageCategorizer

##### General

The LanguageCategorizer recognizes and sets the text language on a CAS object.

Depending on the LanguageDetectorResource configured, one or multiple languages of the text are predicted.

##### Input

This component does not require annotations.

##### Output
• Sets the language of the CAS object.

• `de.averbis.extraction.types.Category` - (optional) category annotations are set.

##### Configuration

Implementation: de.averbis.textanalysis.components.languagecategorizer.LanguageCategorizer

Table 96: Configuration Parameters

NameTypeMultiValuedMandatory

``` allowedLanguages ```

Description: The list of languages allowed to be set as document language.

Default: ``` en, de ```

`String`

`true`

`false`

``` useUnknownLanguage ```

Description: If a language could not be determinated or is not an allowed language set unknown language to true or leave unset (false).

Default: ``` true ```

`Boolean`

`false`

`false`

``` overwriteExisting ```

Description: If true an existing document language will be overwritten.

Default: ``` false ```

`Boolean`

`false`

`false`

``` maxCharacterLimit ```

Description: The number of characters to be analysed. Can be used when categorizing large texts in order to increase performance.

Default: ``` 20000 ```

`Integer`

`false`

`false`

``` addCategoryAnnotations ```

Description: If true, UIMA Category annotations are added to CAS (languages and confidences).

Default: ``` false ```

`Boolean`

`false`

`false`

``` shortTextSizeTrigger ```

Description: The number of characters below which the short-text-algorithm will be used to guess the language.

Default: ``` 200 ```

`Integer`

`false`

`false`

``` setDocumentLanguage ```

Description: If true, the determined language will be set as the document language on JCas.

Default: ``` true ```

`Boolean`

`false`

`false`

Table 97: External Resources

NameOptionalInterface/Implementation

``` languageDetectorResourceShort ```

Description: Resource holding a languageDetector for use on short texts.

`false`

`de.averbis.textanalysis.resources.languagedetectorresource.LanguageDetectorResource`

``` languageDetectorResourceDefault ```

Description: Resource holding a languageDetector for use on default text.

`false`

`de.averbis.textanalysis.resources.languagedetectorresource.LanguageDetectorResource`

Maven Coordinates:``` ```

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>language-categorizer</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### LanguageDetectorResource

##### General

The default implementation of the LanguageDetectorResource uses the language-detector by optimaize. This is based on the Language Detection Library of Shuyo Nakatani. For speech recognition, the probability for all configured languages is calculated by distributing the observed character N-grams in the text using a naïve bayes model.

The standard models are supplied in 16 languages. These are not (!) trained by Averbis, but come from the used `com.optimaize.langdetect` library.

##### Configuration

Implementation: de.averbis.textanalysis.resources.languagedetectorresource.LanguageDetectorResource

Table 98: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` languagedetector ```

`String`

`false`

`false`

``` category ```

Description: The model category to be used.

Default: ``` default ```

`String`

`false`

`false`

``` availableLanguages ```

Description: The list of languages whose models should be loaded.

Default: ``` en, de, fr, es, it, pt ```

`String`

`false`

`false`

``` useLowerCase ```

Description: If true the text will be converted to lower case, before determining the language category.

Default: ``` false ```

`Boolean`

`false`

`false`

Maven Coordinates:``` ```

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>language-detector-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### LanguageSetter

##### General

This component sets the document language to a user-defined value.

##### Input

The component does not expect any annotations.

##### Output
• The component sets the parameter `documentLanguage` in the CAS object.

##### Configuration

Implementation: de.averbis.textanalysis.components.languagesetter.LanguageSetter

Table 99: Configuration Parameters

NameTypeMultiValuedMandatory

``` language ```

Description: The document language to set if not already set in CAS.

`String`

`false`

`true`

``` overwriteExisting ```

Description: If true an existing document language will be overwritten.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:``` ```

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>language-setter</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Sentence Detection

#### OpennlpSentenceAnnotator

##### General

Using machine learning techniques, it is often easier to detect sentences than with simple rule-based approaches. This sentence annotator is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

##### Input

The component does not expect annotations, but instead works on the document text.

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of type `de.averbis.extraction.types.Sentence`

##### Configuration

Implementation: de.averbis.textanalysis.components.opennlpsentenceannotator.OpennlpSentenceAnnotator

Table 100: Configuration Parameters

NameTypeMultiValuedMandatory

``` splitLinebreak ```

Description: Will additionally add sentence splits at all line breaks if true

Default: ``` false ```

`Boolean`

`false`

`false`

``` enclosingSpanType ```

Description: if set, then sentence detection will be within this type only, else on whole document

Default: ``` - ```

`String`

`false`

`false`

Table 101: External Resources

NameOptionalInterface/Implementation

``` opennlpSentenceDetectorResource ```

Description: Resource holding a map with available models (SentenceDetector) for languages

`false`

`de.averbis.textanalysis.resources.opennlpsentencedetectorresource.OpennlpSentenceDetectorResource`

Maven Coordinates:

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-sentence-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpSentenceDetectorResource

##### General

This resource encapsulates the statistical Sentence Detector model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.opennlpsentencedetectorresource.OpennlpSentenceDetectorResource

Table 102: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` opennlpsentencedetector ```

`String`

`false`

`false`

``` genre ```

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: ``` newspaper ```

`String`

`false`

`false`

Maven Coordinates:

```<dependency>    <groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-sentence-detector-resource</artifactId>
<version>3.5.0</version>
</dependency>

```

#### RegexSentenceAnnotator

##### General

A simple and often very efficient approach to sentence recognition is the decomposition of the text on the basis of specific language-specific rules. The SimpleSentenceAnnotator splits sentences in the text when the valid block separators ".!?" appear. The separator is part of the sentence annotation.

However, a sentence splitting is only executed if at least one blank character (or line break) and then an alphanumeric character (no lowercase letter!) follow after such a separator.

The advantage of this method is that it works very quickly and comprehensibly for the user. In many applications, this simple approach is perfectly adequate.

Known weaknesses and problems:

• Splits sentences in abbreviations (etc., Prof. Dr. Maier).

• Problems with dates (2. Mai 2012)

##### Input

The component does not expect annotations, but works on the document text.

##### Output

The component creates the following annotations:

• `de.averbis.extraction.types.Sentence`

##### Configuration

Implementation: de.averbis.textanalysis.components.regexsentenceannotator.RegexSentenceAnnotator

Table 103: Configuration Parameters

NameTypeMultiValuedMandatory

``` regularExpression ```

Description: The regular expression to split sentences at.

Default: ``` ([.?!)(\s)([^\p{Ll}])] ```

`String`

`false`

`true`

``` implementation ```

Description: The implementation of the regular expression library. Available are: JavaPattern, JRegex, Brics, RE2J. Warning: not all implementations support all regex constructs.

Default: ``` JavaPattern ```

`String`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>regex-sentence-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Tokenization

#### JTokAnnotator

##### General

This component uses the JTok library to recognize tokens, sentences and paragraphs. Cascaded regular expressions and language-specific resources are used. The JTok library currently provides resources for the languages `en`, `de` and `it` (without special genres). There are also resources for special genres in the languages `de` and `fr`. Note that resources can not yet be loaded from the datapath.

##### Input

This component requires no specific annotations.

##### Output

The component creates the following annotations (depending on the configuration):

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Paragraph`

• `de.averbis.extraction.types.Abbreviation`

##### Configuration

Implementation: de.averbis.textanalysis.components.jtokannotator.JTokAnnotator

Table 104: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` jtokannotator ```

`String`

`false`

`true`

``` addTokens ```

Description: Create Token annotations.

Default: ``` true ```

`Boolean`

`false`

`true`

``` addAbbreviations ```

Description: Create Abbreviation annotations.

Default: ``` true ```

`Boolean`

`false`

`true`

``` addSentences ```

Description: Create Sentence annotations.

Default: ``` true ```

`Boolean`

`false`

`true`

``` addParagraphs ```

Description: Create Paragraph annotations.

Default: ``` true ```

`Boolean`

`false`

`true`

``` applyPostProcessing ```

Description: Apply additional postprocessing fixing common errors of sentence splitting.

Default: ``` true ```

`Boolean`

`false`

`true`

``` genre ```

Description: Genre specifying an external configuration and their resources. If not available, it delegates to the JTok config files.

Default: ``` default ```

`String`

`false`

`true`

``` availableLanguages ```

Description: List of languages available to the annotator. If no language configuration is available for the given genre, the default configuration is applied. The default configuration is also used for languages not included in this list.

Default: ``` de, en, fr ```

`String`

`true`

`true`

``` globalEnclosingSpan ```

Description: Type of annotations specifying the enclosing span that should be tokenized. This parameter overrides the parameter enclosingSpan.

Default: ``` uima.tcas.DocumentAnnotation ```

`String`

`false`

`true`

``` enclosingSpan ```

Description: Type of annotations specifying the enclosing span that should be tokenized.

Default: ``` de.averbis.extraction.types.Sentence ```

`String`

`false`

`true`

``` normalizationRequired ```

Description: If true, additional normalization is applied on every token.

Default: ``` true ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>jtok-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpTokenAnnotator

##### General

Machine learning techniques can often solve tokenization problems better than simple rule-based approaches. This tokenizer is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

Based on the included training module, this component can be easily adapted to new languages and genres by retraining.

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Sentence`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.Token`

##### Configuration

Implementation: de.averbis.textanalysis.components.opennlptokenannotator.OpennlpTokenAnnotator

Table 105: Configuration Parameters

NameTypeMultiValuedMandatory

``` splittingRules ```

Description: Additional regular expressions which can be used to make further splits on generated tokens

`String`

`true`

`false`

``` enclosingSpan ```

Description: Type of annotations specifying the enclosing span that should be tokenized.

Default: ``` de.averbis.extraction.types.Sentence ```

`String`

`false`

`true`

``` normalizationRequired ```

Description: If true, additional normalization is applied on every token.

Default: ``` true ```

`Boolean`

`false`

`true`

Table 106: External Resources

NameOptionalInterface/Implementation

``` opennlpTokenizerResource ```

Description: Resource holding a map with available models (Tokenizer) for different languages

`false`

`de.averbis.textanalysis.resources.opennlptokenizerresource.OpennlpTokenizerResource`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-token-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpTokenizerResource

##### General

This resource encapsulates the statistical tokenizer model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.opennlptokenizerresource.OpennlpTokenizerResource

Table 107: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` opennlptokenizer ```

`String`

`false`

`false`

``` genre ```

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: ``` newspaper ```

`String`

`false`

`false`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-tokenizer-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### RegexTokenAnnotator

##### General

A simple and often very efficient approach to tokenization is the decomposition of the text based on specific language-specific rules. The RegexTokenAnnotator uses a set of defined delimiters to separate the words. Each time such a separator occurs in the text, a new token is started. The separators themselves (e. g. "-") are not marked as token annotations.

The advantage of this method is that it works very quickly and comprehensibly for the user. In many applications, this simple approach is perfectly adequate. In some applications, especially if applied to special domains such as biomedical literature, this procedure leads to an unintentional decomposition of tokens that are valid there (e. g. special proper names such as "IL-2" which should not be separated).

##### Input

The component does not necessarily expect annotations.

##### Output

The component creates the following annotations:

• `de.averbis.extraction.types.Token`

##### Configuration

Implementation: de.averbis.textanalysis.components.regextokenannotator.RegexTokenAnnotator

Table 108: Configuration Parameters

NameTypeMultiValuedMandatory

``` regularExpression ```

Description: Regular expression to split a text.

Default: ``` [|.+*;,!?/ :@_()"`„”““—’‘'¿-] ```

`String`

`false`

`true`

``` implementation ```

Description: The implementation of the regular expression library. Available are: JavaPattern, JRegex, Brics, RE2J. Warning: not all implementations support all regex constructs.

Default: ``` JavaPattern ```

`String`

`false`

`true`

``` enclosingSpan ```

Description: Type of annotations specifying the enclosing span that should be tokenized.

Default: ``` de.averbis.extraction.types.Sentence ```

`String`

`false`

`true`

``` normalizationRequired ```

Description: If true, additional normalization is applied on every token.

Default: ``` true ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>regex-token-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### InvariantTokenTagger

##### General

Invariant taggers mark tokens as invariant if they are not to be treated by subsequent linguistic processing steps such as stemming or composite decomposition. This is the case, for example, with proper names — a composite decomposition of "Ingmar Bergmann" in "Berg" + "Mann" would not be correct. For this purpose, a flag "isInvariant" can be set on each token. If this flag is set to true, the following components leave this word untreated.

Simple basic approach based on a set of rules. A token is defined as an invariant if it does not correspond to a standard token pattern (Regex). A valid, i. e. non-invariant token is defined as: either completely in capital letters ("DISPLAY") or only small letters and the initial letter optionally large ("Display" or "advertisement"). Otherwise only hyphen allowed. Attention: Not optimized for all languages.

##### Input

The following annotations are mandatory for this component:

• `de.averbis.extraction.types.Token`

##### Output

The component sets the feature "isInvariant" of the token annotations. No new annotations are produced.

##### Configuration

Implementation: de.averbis.textanalysis.components.invarianttokentagger.InvariantTokenTagger

Table 109: Configuration Parameters

NameTypeMultiValuedMandatory

``` validTokenPattern ```

Description: Allowed (~ !invariant) is defined as: either all upper case ("ANZEIGE") or all lower case with one optional upper case latter preceding ("Anzeige" or "anzeige"), hyphens are allowed.

Default: ``` (\p{Lu}?\p{Ll}*|\p{Lu}+) ```

`String`

`false`

`true`

``` validTokenLength ```

Description: Allowed (~ !invariant) length of a token. Shorter tokens will be tagged as invariant.

Default: ``` 4 ```

`Integer`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>invariant-token-tagger</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Stemming and Segmentation

#### SnowballStemAnnotator

##### General

The Snowball-Stemmer is based on the Porter-Stemming algorithm. This is the most common stemming approach. The Porter-Stemming algorithm is a rule-based procedure using a set of language-specific shortening rules until a minimum number of syllables is reached.

The Porter-Stemming algorithm is an "aggressive" stemming approach, the resulting word stems are not valid words and often not linguistically correct wordstems.

Reference: An algorithm for suffix stripping, M. F. Porter, 1980

##### Input

The following annotations are mandatory for this component:

• `de.averbis.extraction.types.Token`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The following annotations are created:

• `de.averbis.extraction.types.Stem`

In addition, feature references from the token annotations are made to the respective stem annotation.

##### Configuration

Implementation: de.averbis.textanalysis.components.snowballstemannotator.SnowballStemAnnotator

Table 110: Configuration Parameters

NameTypeMultiValuedMandatory

``` allLowerCase ```

Description: if true, the token to be stemmed will be transformed to lowercase

Default: ``` false ```

`Boolean`

`false`

`true`

``` excludePattern ```

Description: A regular expression, which specifies exceptions for the stemmer. If the pattern matches, then the stemmer will be skipped and the covered text of the token will be assigned to the stem value. A parameter value like ".*itis" has the result that a token "cellulitis" won't be stemmed to "cell". Instead, the stem value will be "cellulitis".The parameter allLowerCase is applied before this parameter and thus may influence its functionality.

Default: ``` .*itis ```

`String`

`false`

`false`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>snowball-stem-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### MorphoSemanticSegmentAnnotator

##### General

The morphosemantic analysis is based on the Morphosaurus algorithm. This was originally developed for use in medical language.

See also Foundation, Implementation and Evaluation of the MorphoSaurus System, dissertation by Kornél Markó, JULIE Lab, University of Jena, 2007.

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Token`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.Segment`

These annotations are also linked to the respective token and stored there as a reference.

##### Configuration

Implementation: de.averbis.textanalysis.components.morphosemanticsegmentannotator.MorphoSemanticSegmentAnnotator

Table 111: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` morphosemanticsegmentannotator ```

`String`

`false`

`false`

``` msiEngineLexiconFile ```

Description: The core engine lexicon.

Default: ``` msi.data ```

`String`

`false`

`true`

``` msiEngineReplacementFile ```

Description: The core engine replacement file.

Default: ``` replacement.xml ```

`String`

`false`

`true`

``` msiEngineAdditionalLexiconFiles ```

Description: Additional lexicon files for the core engine.

`String`

`true`

`false`

``` msiEngineLanguages ```

Description: The languages to load from the lexica.

Default: ``` en, de ```

`String`

`true`

`true`

``` msiEngineNoMatchPlain ```

Description: -

Default: ``` true ```

`Boolean`

`false`

`true`

``` msiEngineSegmenterMode ```

Description: The core engine segmenter mode: RIGHT, LEFT, BOTH.

Default: ``` BOTH ```

`String`

`false`

`true`

``` msiEngineConcatPrefix ```

Description: Concat prefix.

Default: ``` false ```

`Boolean`

`false`

`true`

``` msiEngineConcatSuffix ```

Description: Concat suffix.

Default: ``` false ```

`Boolean`

`false`

`true`

``` msiEngineMIDs ```

Description: Attach MIDs to segmentation.

Default: ``` false ```

`Boolean`

`false`

`true`

``` msiNoPreferredForIVs ```

Description: No prefered terms for type IV.

Default: ``` true ```

`Boolean`

`false`

`true`

``` enrichAbbreviations ```

Description: Defines if abbreviations should be enriched with segments.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>morpho-semantic-segment-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Abbreviation Detection

#### AbbreviationAnnotator

##### General

This component uses a dictionary to recognize abbreviations. If the dictionary contains the full form of the abbreviation, this is saved in the annotation. The component can load different abbreviation lists (genres) for a specific language. Currently available:

• de

• default

• bionlp

• latin

• law

• literature_reference

• en

• default

• bionlp

• latin

• oxford

• fr

• bionlp

• latin

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Token`

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.Abbreviation`

The abbreviation annotations created are associated with their full form, if available.

##### Configuration

Implementation: de.averbis.textanalysis.components.abbreviationannotator.AbbreviationAnnotator

Table 112: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` abbreviation ```

`String`

`false`

`true`

``` genres ```

Description: The genres of abbreviations that should be utilized.

Default: ``` default ```

`String`

`true`

`true`

``` fullformTokenizerPattern ```

Description: The pattern for tokenizing the fullform of an abbreviation.

Default: ``` \s+|,|\- ```

`String`

`false`

`true`

``` tokenizeFullform ```

Description: Option to tokenize the fullform of all abbreviations.

Default: ``` true ```

`Boolean`

`false`

`true`

Maven Coordinates:

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>abbreviation-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Numeric Values, Measurements, Times and Dates

#### NumericValueAnnotator

##### General

This component can recognize a wide variety of numeric expressions and their numerical value. These include simple numbers such as 2.3, but also more difficult expressions such as ½ million or fuenfundzwanzig. Furthermore, the component is able to recognize roman numerals and assign an equivalent numeric value. Written-out numbers are currently only supported in English, German and French.

The functional elements of this component are divided into individual reusable components, that can also be individually configured and newly combined. The main component NumericValueAnnotator consists of the following elements:

1. ConjunctionFragment.ruta: These UIMA Ruta rules split tokens for detecting smaller numeric fragments.

2. RomanNumeral.ruta: These UIMA Ruta rules annotate different kinds of roman numerals and calculate their numeric equivalents in a Java procedure.

3. RutaTokenSeedAnnotator: This component adds annotations for the following Dictionary lookup.

4. SimpleDictionaryAnnotator: This component adds different annotations based on the given word lists.

5. NumericValue.ruta: These UIMA Ruta rules annotate different kinds of numeric values.

##### Input

The component requires annotations for numerical base units. The exact type of these annotations can be determined by the configuration parameter `number`. Normally this value is set to `org.apache.uima.ruta.types.NUM`. Annotations of this type are created automatically, if not already existing, and if the configuration parameter `seeders` has not been adjusted.

If configured appropriately, the component can continue to process other (certain) annotations, if any, and if the component is configured accordingly. This includes `LanguageContainer` and the annotations of the types specified in the parameter `noNumericValue`.

Irrespective of these annotations, the component can also use any type of annotations used in the rule-based implementation. These include for example `Multiplicator` or `ConjunctionFragment`.

##### Output

The component generates different types of annotations. The actual result of the component is:

• `de.averbis.textanalysis.types.numericvalue.NumericValue`

Additionally it creates annotations of roman numerals:

• `de.averbis.textanalysis.types.numericvalue.RomanNumeral`

The numeric value of the detected number is stored in the `value` feature.

##### Configuration

Implementation: de.averbis.textanalysis.components.numericvalueannotator.NumericValueAnnotator

Table 113: Configuration Parameters

NameTypeMultiValuedMandatory

``` allowPeriodDecimalSeparator ```

Description: Option to allow the usage of a period for the decimal separator for all locales, e.g., also in German.

Default: ``` true ```

`Boolean`

`false`

`true`

``` detectComplexPatterns ```

Description: Option to detect more complex patterns of numerical values like 2^1/2 or fuenfundzwanzig.

Default: ``` true ```

`Boolean`

`false`

`true`

``` detectFractions ```

Description: Option to detect fractions like 125/75.

Default: ``` true ```

`Boolean`

`false`

`true`

``` mergeConsecutiveEqualNumbers ```

Description: Option to merge consecutive equal numbers like '5 (five)'.

Default: ``` false ```

`Boolean`

`false`

`true`

``` dictionaryLookup ```

Description: ption to apply dictionary lookup for detecting special numeric elements like ² or five.

Default: ``` true ```

`Boolean`

`false`

`true`

``` decimalSeparator ```

Description: Regular expression to validate decimal separators as in 2.6.

Default: ``` \. ```

`String`

`false`

`true`

``` thousandsSeparator ```

Description: Regular expression to validate thousands separators as in 3,000.

Default: ``` , ```

`String`

`false`

`true`

``` conjunctionFragment ```

Description: Regular expression to detect conjunction fragements like 'und' as in fuenfundzwanzig.

Default: ``` and|und|et ```

`String`

`false`

`true`

``` simpleNumericValuesOnlyWithoutSpaces ```

Description: Simple numeric values with punctation marks are only annotated if there are no spaces in between.

Default: ``` true ```

`Boolean`

`false`

`true`

``` language ```

Description: Default value of the language. Will normally be overwritten by the DocumentAnnotation language or by the LanguageContainer language.

Default: ``` false ```

`Boolean`

`false`

`true`

``` noNumericValue ```

Description: List of types specifying annotation spans in which no numeric value should be detected, e.g., Dates.

Default: ``` - ```

`String`

`true`

`true`

``` number ```

Description: The basic annotation type for digits.

Default: ``` org.apache.uima.ruta.type.NUM ```

`String`

`false`

`true`

``` languageSpecific ```

Description: If activated, language dependent values will automatically be assigned to parameters decimalSeparator, thousandsSeparator and conjunctionFragment.

Default: ``` true ```

`Boolean`

`false`

`true`

``` allowLeadingZeros ```

Description: Option to allow numeric values to start with zeros like 02.

Default: ``` false ```

`Boolean`

`false`

`true`

``` detectRomanNumerals ```

Description: Option to detect roman numerals like XIV, II, MMDC.

Default: ``` false ```

`Boolean`

`false`

`true`

``` seeders ```

Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied.

Default: ``` org.apache.uima.ruta.seed.DefaultSeeder ```

`String`

`true`

`false`

``` reindexOnly ```

Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed.

Default: ``` uima.tcas.Annotation ```

`String`

`true`

`false`

``` indexOnlyMentionedTypes ```

Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed.

Default: ``` false ```

`Boolean`

`false`

`false`

``` indexAdditionally ```

Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed.

Default: `**`

`String`

`true`

`false`

``` strictImports ```

Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved.

Default: ``` true ```

`Boolean`

`false`

`false`

``` debug ```

Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution.

Default: ``` false ```

`Boolean`

`false`

`false`

``` debugWithMatches ```

Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches.

Default: ``` false ```

`Boolean`

`false`

`false`

Maven Coordinates:

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>numeric-value-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```
###### RutaTokenSeedAnnotator

A more detailed description of RutaTokenSeedAnnotator can be found in the corresponding chapter.

###### SimpleDictionaryAnnotator

A more detailed description of SimpleDictionaryAnnotator can be found in the corresponding chapter.

##### Rules

RomanNumeral.ruta

``````PACKAGE de.averbis.textanalysis.components.numericvalueannotator;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;
UIMAFIT de.averbis.textanalysis.components.numericvalueannotator.RomanNumeralValueCalculator;

FOREACH(cap) CAP{REGEXP("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\$")}{
cap{ -> CREATE(RomanNumeral)};
}

FOREACH(cw) CW{REGEXP("M?C?D?L?X?V?I?")}{
cw { -> CREATE(RomanNumeral)};
}

EXEC(RomanNumeralValueCalculator, {RomanNumeral});```
```

ConjunctionFragment.ruta

``````PACKAGE de.averbis.textanalysis.components.numericvalueannotator;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;

STRING conjunctionFragment = "and|und|et";
BOOLEAN detectComplexPatterns = true;
BOOLEAN languageSpecific = true;
STRING language = "x-unspecified";

BLOCK(languageSepcific) Document{languageSpecific} {
BLOCK(en) Document{language == "en"} {
Document{-> conjunctionFragment = "and"};
}
BLOCK(de) Document{language == "de"} {
Document{-> conjunctionFragment = "und"};
}
BLOCK(fr) Document{language == "fr"} {
Document{-> conjunctionFragment = "et"};
}
}

Document{detectComplexPatterns,-REGEXP(conjunctionFragment, "")} -> {conjunctionFragment -> ConjunctionFragment;};```
```

NumericValue.ruta

``````PACKAGE de.averbis.textanalysis.components.numericvalueannotator;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;

SCRIPT de.averbis.textanalysis.components.numericvalueannotator.RomanNumeral;

// configuration parameters:
BOOLEAN allowPeriodDecimalSeparator = true;
BOOLEAN detectComplexPatterns = true;
BOOLEAN detectFractions = true;
BOOLEAN mergeConsecutiveEqualNumbers = false;
STRING decimalSeparator = "\\.";
STRING thousandsSeparator = ",";
BOOLEAN simpleNumericValuesOnlyWithoutSpaces = true;
BOOLEAN languageSpecific = true;
STRING language = "x-unspecified";
TYPELIST noNumericValue;
TYPE number = NUM;
BOOLEAN detectRomanNumerals = false;

STRINGLIST localesWithPeriodDecimalSeparator = {"en"};

// helper types
DECLARE NumberWithValue (DOUBLE value);
DECLARE NumberWithValue Multiplicator, Exponent;

// language specific settings
Document{IS(uima.tcas.DocumentAnnotation)-> GETFEATURE("language", language)};
LanguageContainer{-> GETFEATURE("language", language)};

BLOCK(languageSepcific) Document{languageSpecific} {
BLOCK(separators) Document{CONTAINS(localesWithPeriodDecimalSeparator, language)} {
Document{-> decimalSeparator = "\\.", thousandsSeparator = ","};
}
Document{-> decimalSeparator = ",", thousandsSeparator = "\\."};
}
BLOCK(fr) Document{language == "fr"} {
Document{-> thousandsSeparator = "\\s"};
}
}

NumericValue{PARTOF(noNumericValue)-> UNMARK(NumericValue)};
ConjunctionFragment{PARTOF(Multiplicator)-> UNMARK(ConjunctionFragment)};

CONDITION isThousandsSep() = REGEXP(thousandsSeparator);
CONDITION isDecimalSep() = REGEXP(decimalSeparator);

DOUBLE value;
// normal numbers like 1,000.95
FOREACH(num) number{-PARTOF(noNumericValue)}{
(num{-PARTOF(NumericValue)}
(PM{isThousandsSep()} number{REGEXP("...")})*
(PM{isDecimalSep()} number)
){PARSE(value, language) -> CREATE(NumericValue, "value" = value)};
(num{-PARTOF(NumericValue), num.ct!= "0"}
(PM{isThousandsSep()} number{REGEXP("...")})+
){PARSE(value, language) -> CREATE(NumericValue, "value" = value)};
(num{-PARTOF(NumericValue), allowPeriodDecimalSeparator} PERIOD number)
{PARSE(value, "en") -> CREATE(NumericValue, "value" = value)};
(num{-PARTOF(NumericValue)})
{PARSE(value, language) -> CREATE(NumericValue, "value" = value)};
}
FOREACH(num) NumericValue{}{
W{-REGEXP("[ex]", true)} @num{-> UNMARK(NumericValue)} W;
//        W{-REGEXP("[ex]", true)} @num{-> UNMARK(NumericValue)};
W{REGEXP("[A-Z]{1,3}")} @num{OR(REGEXP("\\d{1,2}"),REGEXP("\\d{2}\\.\\d{1}"))-> UNMARK(NumericValue)};
NUM PERIOD NUM PERIOD @num{-> UNMARK(NumericValue)};
num{mergeConsecutiveEqualNumbers, PARTOF(NumericValue) -> num.end = s2.end} WS* SPECIAL{REGEXP("[\$\$\\{]")} WS* n2:NumericValue{num.value == n2.value -> UNMARK(n2)} WS* s2:SPECIAL{REGEXP("[$$/extract_itex]\$\\}]")}; } Document{simpleNumericValuesOnlyWithoutSpaces -> REMOVERETAINTYPE(SPACE, BREAK)}; // Fractions with numerical values BLOCK(dictionary) Document{detectFractions} { FOREACH(num) NumericValue{}{ // fractions like 3/4 num{-> UNMARK(NumericValue)} SPECIAL{REGEXP("/")} NumericValue{-> UNMARK(NumericValue), GATHER(Fraction,1,3, "numerator" = 1, "denominator" = 3)}; // fractions like Seven out of 38 num{-> UNMARK(NumericValue)} SW? SW.ct=="of" NumericValue{-> UNMARK(NumericValue), GATHER(Fraction,1,4, "numerator" = 1, "denominator" = 4)}; } } // simple fractions NumericValue{REGEXP("\\d")-> UNMARK(NumericValue)} @SPECIAL{REGEXP("/")} NumericValue{REGEXP("\\d")-> UNMARK(NumericValue), GATHER(Fraction,1,3, "numerator" = 1, "denominator" = 3)}; Fraction{-> CREATE(NumericValue, "value" = (Fraction.numerator.value / Fraction.denominator.value))}; SimpleFraction{-> CREATE(NumericValue, "value" = (SimpleFraction.numerator / SimpleFraction.denominator))}; BLOCK(complexPatterns) Document{detectComplexPatterns}{ FOREACH(num, false) NumericValue{}{ // exponents like 2^3, 2.3e13, 4² (num exp:Exponent) {-> num.value=POW(num.value, exp.value), num.end = exp.end}; (num SPECIAL.ct=="^" exp:NumericValue{-> UNMARK(NumericValue)}) {-> num.value=POW(num.value, exp.value), num.end = exp.end}; (num W{REGEXP("e", true)} exp:NumericValue{-> UNMARK(NumericValue)}) {-> num.value = (num.value * (POW(10, exp.value))), num.end = exp.end}; // multiplication like 3x4, 2*2 (num ANY{REGEXP("×|x|\\*", true)} mult:NumericValue{-> UNMARK(NumericValue)}) {-> num.value = (num.value * mult.value), num.end = mult.end}; pre:NumericValue{PARTOF(W),num.value != pre.value -> UNMARK(NumericValue)} SPECIAL?{REGEXP("-")} num{IS(NumericValue),PARTOF(W) -> num.value = (num.value + pre.value), num.begin = pre.begin}; // combination with multipliers like 3 million (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)} SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)} // add1:NumericValue?{-> num.value = (num.value + add1.value), UNMARK(NumericValue)} ( Multiplicator{-> num.value = (num.value * (POW(10, Multiplicator.value)))} add2:NumericValue?{-> num.value = (num.value + add2.value), UNMARK(NumericValue)} )*); // fünfundzwanzig (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment add:NumericValue.value!=0{PARTOF(W), IF((NumericValue.value%1) == 0) -> UNMARK(NumericValue)}) {-> num.value = (num.value + add.value)}; // 2+3 (num{-> SHIFT(NumericValue,1,3)} SPECIAL.ct=="+" add:NumericValue{ -> UNMARK(NumericValue)}) {-> num.value = (num.value + add.value)}; } } Document{detectRomanNumerals -> CALL(RomanNumeral)};``` ``` #### MeasurementAnnotator ##### General This component detects units, measurements and quantities. It can trace the given unit back to SI base units and at the same time normalize the numerical value. For example, the text passage `10cm` is recognized as 0.1 m (dimension L). The functional elements of this component are divided into individual reusable components, which can also be individually configured and newly combined. The main component MeasurementAnnotator consists of the following elements: 1. UnitAnnotator: This component recognizes units (annotation type Unit) according to certain annotations, normally numbers. 2. UnitNormalizer: This component normalizes given units (annotation type Unit). 3. Measurement.ruta: These UIMA Ruta rules combine numeric value and unit annotations to form measurement annotations. 4. MeasurementNormalizer: This component normalizes the numerical value depending on the given unit. 5. RelativeMeasurementIntervalAnnotator: This component is a helper annotator for relative intervals. These components are described in more detail below. ##### Input The component does not expect any mandatory annotations, but requires annotations of the types that are set by the configuration parameters to work correctly. ##### Output The component generates different types of annotations. The actual result of the component is: • `de.averbis.textanalysis.types.measurement.Measurement` These annotations combine a `de.averbis.textanalysis.types.numericvalue.NumericValue` and de.averbis.textanalysis.types.measurement.Unit` annotation and store their normalized values. In addition, the component can also create annotations of type: • `de.averbis.textanalysis.types.measurement.MeasurementInterval` ##### Configuration Implementation: de.averbis.textanalysis.components.measurementannotator.MeasurementAnnotator Table 114: Configuration Parameters NameTypeMultiValuedMandatory ``` anchorType ``` Description: Optional type for annotations after which the component should search for units. Default: ``` de.averbis.textanalysis.types.numericvalue.NumericValue ``` `String` `false` `true` ``` lookaheadType ``` Description: Optional type for basic annotation which should be used as lookahead starting from the anchorType. If no anchorType is given, then the component trys to parse all annotations, but only single annotations and no combinations. This means that the given type needs to cover the complete unit. Default: ``` org.apache.uima.ruta.type.RutaBasic ``` `String` `false` `true` ``` lookaheadSize ``` Description: Amount of annotations of lookaheadType that are used as lookahead. Default: ``` 15 ``` `Integer` `false` `true` ``` genres ``` Description: The categories/genres of unit data (subdirectories) that should be utilized. Multiple values are concatenated with a comma. Default: ``` default ``` `String` `false` `true` ``` languages ``` Description: The languages of unit data (subdirectories) that should be utilized. Multiple values are concatenated with a comma. Default: ``` en,de ``` `String` `false` `true` ``` ignoreWhitespaces ``` Description: If activated whitespace characters are ignored while units are parsed. Default: ``` true ``` `Boolean` `false` `true` ``` leftRecursive ``` Description: If activated multiplications and divisions are parsed from left to right, e.g., mg/s/m is (mg/s)/m. If deactivated mg/s/m is mg/(s/m). Default: ``` true ``` `Boolean` `false` `true` ``` identifierLookahead ``` Description: Additional lookahead of the parser for multi token units. Default: ``` 2 ``` `String` `false` `true` ``` avoidNumberOnlyUnits ``` Description: If activated, units with only number like 2/2 will be ignored. Default: ``` true ``` `String` `false` `true` ``` detectIntervals ``` Description: Option to detect intervals of measurements. Default: ``` true ``` `String` `false` `true` ``` dictionaryLookup ``` Description: Option to include a simple dictionary lookup for specific textual mentions. Default: ``` true ``` `Boolean` `false` `true` ``` seeders ``` Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied. Default: ``` org.apache.uima.ruta.seed.DefaultSeeder ``` `String` `true` `false` ``` reindexOnly ``` Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed. Default: ``` uima.tcas.Annotation ``` `String` `true` `false` ``` indexOnlyMentionedTypes ``` Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed. Default: ``` false ``` `Boolean` `false` `false` ``` indexAdditionally ``` Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed. Default: `**` `String` `true` `false` ``` strictImports ``` Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved. Default: ``` true ``` `Boolean` `false` `false` ``` debug ``` Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution. Default: ``` false ``` `Boolean` `false` `false` ``` debugWithMatches ``` Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches. Default: ``` false ``` `Boolean` `false` `false` Maven Coordinates: ``````<dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>measurement-annotator</artifactId> <version>3.5.0</version> </dependency> `````` ##### UnitAnnotator ##### General The component recognizes text passages with units, but not their normalized form. It has two different modes: either text passages are examined for certain annotations, e. g. numeric values, or the text of certain annotations are examined themselves. The first mode is activated by setting the configuration parameter `anchorType`. The text passages are searched for annotations of the configured type. The size and range of these text passages are determined by the configuration parameters `lookaheadType` and `lookaheadSize`. In the second mode only the configuration parameter `lookaheadType` is set. Only the text of the annotation of this type is examined here. In principle, a unit is recognized if the unit parser of the configured resource can recognize a unit. ##### Input The component is based on the annotations whose types are configured in the parameters `anchorType` and `lookaheadType`. ##### Output The component creates annotations of type: • `de.averbis.textanalysis.types.measurement.Unit` but does not set any features. For this purpose, the UnitNormalizer component must be used. ##### Configuration Implementation: de.averbis.textanalysis.components.measurementannotator.UnitAnnotator Table 115: Configuration Parameters NameTypeMultiValuedMandatory ``` anchorType ``` Description: Optional type for annotations after which the component should search for units. `String` `false` `false` ``` lookaheadType ``` Description: Optional type for basic annotation which should be used as lookahead starting from the anchorType. If no anchorType is given, then the component trys to parse all annotations, but only single annotations and no combinations. This means that the given type needs to cover the complete unit. Default: ``` de.averbis.extraction.types.Token ``` `String` `false` `false` ``` lookaheadSize ``` Description: Amount of annotations of lookaheadType that are used as lookahead. Default: ``` 15 ``` `Integer` `false` `true` ``` ignoreWhitespaces ``` Description: If activated whitespace characters are ignored while units are parsed. Default: ``` true ``` `Boolean` `false` `true` Table 116: External Resources NameOptionalInterface/Implementation ``` unitResource ``` Description: Resource holding the unit implementation and data. `false` `de.averbis.textanalysis.resources.unitresource.UnitResource` Maven Coordinates: ``` ``` <dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>measurement-annotator</artifactId> <version>3.5.0</version> </dependency> ``` ``` ##### UnitResource ###### General This resource encapsulates the implementation for processing units, especially the parser for unit detection. The supported units and their synonyms are defined by the configuration parameters `genres` and `languages` and can be extended by additional configured values. Later genres overwrite previous genres, each genre can contain new units and synonyms for units, prefixes and operations in different languages. The structure of the additional files is explained below. ###### Configuration Adaptations and extensibility The functionality of the resource is largely determined by additional properties files (file extension `. txt`). Each genre contains an optional number of specific files that have different tasks. The additional files are structured as follows (folder structure is indicated with hyphens): ```unit - default -- de --- aliases.txt --- operations.txt --- prefixes.txt -- en --- aliases.txt --- operations.txt --- prefixes.txt -- unit --- units.txt``` The main folder `unit` is configurable by the configuration parameter `resourceSpecificSubdirectory`. It contains a folder for each genre. In this example a genre with the name `default` is given. Each genre folder may contain multiple language-specific folders and one language-independent folder. The language-specific folders contain up to three property files: `aliases.txt`, `operations.txt` and `prefixes.txt`. The non-specified folder has the name `unit` and contains exactly one Properties file with the name `units.txt`. There is a functional dependency between the files. units.txt First the file `units.txt` is processed. This properties file defines new units, either as new base units or as derived units. The functionality is explained by the following example: ```U hektar = m²*10000``` The first line defines a new unit with the symbol `U`, which also determines the dimension. The second line defines a new derived unit `hectare`, which corresponds to ten thousand square meters. The definition of derived units may only use known terms of the unit implementation or previously defined units, not synonyms from the other files. operations.txt Next, the files are processed for operations. These contain synonyms for arithmetic operations. The functionality is explained by the following example: `/ = Per, per, pro, Pro` Here several synonyms are introduced for a division. aliases.txt Next, the synonyms for units are processed. The functionality is explained by the following example: `minute = Minute, Minuten, Min, Min., minütiger, minütige, minütig` This line defines several German synonyms for a unit `minute`. The keyword on the left side for the unit must be known, i. e. it must have been introduced in either the unit implementation or `units. txt`. Finally, the additional synonyms for prefixes are processed. The functionality is explained by the following example prefixes.txt ```unit = gramm, meter k = Kilo``` The first line with the keyword `unit` lists all unit synonyms to be prefixed. The other lines contain known prefixes and their synonyms. The result of this file is that a synonym for two (derived) units is added: `Kilogram` and `Kilometre`. ###### Maven Coordinates ``` ``` <dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>measurement-annotator</artifactId> <version>2.1.0-SNAPSHOT</version> </dependency> ``` ``` ##### UnitNormalizer ##### General This component parses unit annotations and recognizes the actual unit. If set, the text pf the feature `parsed` will be used, otherwise the text of the annotation in feature `coveredText`. Unit annotations whose feature `normalized` is already set are skipped. ##### Input The component processes annotations of type: • `de.averbis.textanalysis.types.measurement.Unit`. ##### Output The component sets the features `parsed`, `normalized`, `normalizedAscii` and `dimension` of the annotations of the type `de.averbis.textanalysis.types.measurement.Unit`. ##### Configuration Implementation: de.averbis.textanalysis.components.measurementannotator.UnitNormalizer Table 117: Configuration Parameters NameTypeMultiValuedMandatory ``` ignoreWhitespaces ``` Description: If activated whitespace characters are ignored while units are parsed. Default: ``` true ``` `Boolean` `false` `true` Table 118: External Resources NameOptionalInterface/Implementation ``` unitResource ``` Description: Resource holding the unit implementation and data. `false` `de.averbis.textanalysis.resources.unitresource.UnitResource` Maven Coordinates: ``` ``` <dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>measurement-annotator</artifactId> <version>3.5.0</version> </dependency> ``` ``` ##### MeasurementNormalizer ##### General This component processes measurement annotations. First, the exact unit of the unit annotation is parsed, which is set in the feature `unit`. If the feature does not contain any annotation, the value of the `parsedUnit` feature is used. This means that measurements can also be normalized, which do not have a real unit in the text, but only an implicit unit, which can be set separately. Then the standard unit and a transformation from the parsed unit to it is determined. The default unit is stored in the normalizedUnit feature. The transformation is used to normalize the numeric value set in the `value` feature. The result is stored in the normalizedValue feature. The `normalized` feature has the concatenation of the values from the `normalizedValue` and `normalizedUnit` features. ##### Input The component processes annotations of type: • `de.averbis.textanalysis.types.measurement.Measurement`. ##### Output The component sets the features `normalized`, `normalizedValue` and `normalizedUnit` of the annotations of type `de.averbis.textanalysis.types.measurement.Measurement`. ##### Configuration Implementation: de.averbis.textanalysis.components.measurementannotator.MeasurementNormalizer Table 119: Configuration Parameters NameTypeMultiValuedMandatory ``` ignoreWhitespaces ``` Description: If activated whitespace characters are ignored while units are parsed. Default: ``` true ``` `Boolean` `false` `true` Table 120: External Resources NameOptionalInterface/Implementation ``` unitResource ``` Description: Resource holding the unit implementation and data. `false` `de.averbis.textanalysis.resources.unitresource.UnitResource` Maven Coordinates: ``````<dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>measurement-annotator</artifactId> <version>3.5.0</version> </dependency> ``` ``` #### RelativeMeasurementIntervalAnnotator ##### General This component processes relative measurement interval annotations and sets the low and high limit. ##### Input The component processes annotations of type: • `de.averbis.textanalysis.types.measurement.RelativeMeasurementInterval` ##### Output The component sets the features `low` and `high` of the given annotations of type `de.averbis.textanalysis.types.measurement.MeasurementInterval`. In this process, other annotations usually used for measurements are also created. ##### Configuration Implementation: de.averbis.textanalysis.components.measurementannotator.RelativeMeasurementIntervalAnnotator Table 121: Configuration Parameters NameTypeMultiValuedMandatory ``` ignoreWhitespaces ``` Description: If activated whitespace characters are ignored while units are parsed. Default: ``` true ``` `Boolean` `false` `true` Table 122: External Resources NameOptionalInterface/Implementation ``` unitResource ``` Description: Resource holding the unit implementation and data. `false` `de.averbis.textanalysis.resources.unitresource.UnitResource` Maven Coordinates: ``` ``` <dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>measurement-annotator</artifactId> <version>3.5.0</version> </dependency> ``` ``` ##### Rules ###### Measurement.ruta ``````PACKAGE de.averbis.textanalysis.components.measurementannotator; TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.MeasurementTypeSystem; SCRIPT de.averbis.textanalysis.components.measurementannotator.MeasurementInterval; BOOLEAN detectIntervals = true; (n:NumericValue SPECIAL?{-PARTOF(Unit)} u:Unit){-> CREATE(Measurement, "value" = n, "unit" = u)}; Document{detectIntervals -> CALL(MeasurementInterval)};``` ``` ###### MeasurementInterval.ruta ``` ```PACKAGE de.averbis.textanalysis.components.measurementannotator; TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.MeasurementTypeSystem; DECLARE RelativeIntervalPrefix; ii:IntervalIndicator{PARTOF(RelativeIntervalPrefix)-> UNMARK(ii)}; gi:GreaterIndicator{PARTOF(RelativeIntervalPrefix)-> UNMARK(gi)}; li:LessIndicator{PARTOF(RelativeIntervalPrefix)-> UNMARK(li)}; FOREACH(m) Measurement{}{ m (p:RelativeIntervalPrefix m2:Measurement){ -> CREATE(RelativeMeasurementInterval, "base" = m, "deflection" = m2)}; nv:NumericValue{-PARTOF(Measurement) -> CREATE(Measurement, "unit" = m.unit, "value" = nv)}->{m1:Measurement;} (p:RelativeIntervalPrefix @m) { -> CREATE(RelativeMeasurementInterval, "base" = m1, "deflection" = m)}; ADDRETAINTYPE(WS); ANY{-PARTOF(IntervalIndicator)} SPACE[0,2] (l:NumericValue{-> CREATE(Measurement, "unit" = h.unit, "value" = l)} SPACE[0,2])? IntervalIndicator SPACE[0,2] h:@m{-PARTOF(MeasurementInterval)}; // 12 - 15 mg (l:Measurement SPACE[0,2] IntervalIndicator SPACE[0,2] h:@m{-PARTOF(MeasurementInterval)}) {-> CREATE(MeasurementInterval, "low" = l, "high" = h)}; // 20-0-0-0 IE ANY{-PARTOF(NumericValue),-PARTOF(Measurement)} SPACE[0,2] (IntervalIndicator SPACE[0,2] h:@m{-PARTOF(MeasurementInterval)}) {-> CREATE(MeasurementInterval, "high" = h)}; (l:m{-PARTOF(MeasurementInterval)} SPACE[0,2] IntervalIndicator (SPACE[0,2] h:Measurement)?) {-> CREATE(MeasurementInterval, "low" = l, "high" = h)}; // 1,2 pg/ml 1,0 - 3,0 vs. Metformin 850 mg 1-0-1 ANY{-PARTOF(IntervalIndicator)} SPACE[0,2] @m SPACE[0,2] (n1:NumericValue{-PARTOF(Measurement), -PARTOF(MeasurementInterval) -> CREATE(Measurement, "value" = n1, "unit" = m.unit)}->{m1:Measurement;} SPACE[0,2] IntervalIndicator SPACE[0,2] n2:NumericValue{-PARTOF(Measurement), -PARTOF(MeasurementInterval) -> CREATE(Measurement, "value" = n2, "unit" = m.unit)}->{m2:Measurement;} ) { -> CREATE(MeasurementInterval, "low" = m1, "high"=m2)} SPACE[0,2] ANY{-PARTOF(IntervalIndicator)}; (GreaterIndicator SPACE[0,2] m{-PARTOF(MeasurementInterval)}){ -> CREATE(MeasurementInterval, "low" = m)}; (LessIndicator SPACE[0,2] m{-PARTOF(MeasurementInterval)}){ -> CREATE(MeasurementInterval, "high" = m)}; REMOVERETAINTYPE(WS); }``` ``` #### TemporalExpressionAnnotator ##### General This component can recognize different temporal expressions and normalize their values. This includes simple date formats such as "10.2.2015" or "12:30". The component supports the English and German language. The functional elements of this component are divided into individual reusable components, that can also be individually configured and newly combined. The main component TemporalExpressionAnnotator consists of the following elements: 1. RutaTokenSeedAnnotator:: This component adds annotations for the following Dictionary lookup. 2. SimpleDictionaryAnnotator:: This component adds different annotations based on the given word lists, for example month names. 3. TemporalExpression.ruta: These UIMA Ruta rules aggregate the Ruta scripts `Dictionary`, `Date` and `Time`. 4. _TemporalExpressionNormalizer: This component normalizes annotations of the type `Date` and `Time`, and sets their values. ##### Input The component does not require any other annotations as input. ##### Output The component generates different types of annotations. The actual result of the annotator are subtypes of the type: • `de.averbis.textanalysis.types.temporal.Timex3` Currently, the following subtypes are supported: • `de.averbis.textanalysis.types.temporal.Date` • `de.averbis.textanalysis.types.temporal.Time` ##### Configuration Implementation: de.averbis.textanalysis.components.temporalexpressionannotator.TemporalExpressionAnnotator Table 123: Configuration Parameters NameTypeMultiValuedMandatory ``` resourceSpecificSubdirectory ``` Description: Resource specific subdirectory against which all relative paths are resolved. Default: ``` temporalexpressionannotator ``` `String` `false` `true` ``` ignoreCase ``` Description: Option to ignore the case of terms in the dictionary. Default: ``` false ``` `String` `false` `true` ``` genres ``` Description: Dictionaries to be used for creating the inital temporal text fragments. Default: ``` default ``` `String` `true` `true` ``` defaultFilteredTypes ``` Description: Filtered types by default in the Ruta script. Default: ``` org.apache.uima.ruta.type.SPACE, org.apache.uima.ruta.type.MARKUP ``` `String` `true` `true` ``` enclosingSpanType ``` Description: The type of the enclosing spans in which the rules are applied. Default: ``` uima.tcas.DocumentAnnotation ``` `String` `false` `true` ``` anchorTypeName ``` Description: Anchor type for the dictionary lookup. Default: ``` org.apache.uima.ruta.type.ANY ``` `String` `false` `true` ``` seeders ``` Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied. Default: ``` org.apache.uima.ruta.seed.DefaultSeeder ``` `String` `true` `false` ``` reindexOnly ``` Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed. Default: ``` uima.tcas.Annotation ``` `String` `true` `false` ``` indexOnlyMentionedTypes ``` Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed. Default: ``` false ``` `Boolean` `false` `false` ``` indexAdditionally ``` Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed. Default: `**` `String` `true` `false` ``` strictImports ``` Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved. Default: ``` true ``` `Boolean` `false` `false` ``` debug ``` Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution. Default: ``` false ``` `Boolean` `false` `false` ``` debugWithMatches ``` Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches. Default: ``` false ``` `Boolean` `false` `false` Maven Coordinates: ``` ``` <dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>temporal-expression-annotator</artifactId> <version>3.5.0</version> </dependency> ``` ``` ###### RutaTokenSeedAnnotator A more detailed description of RutaTokenSeedAnnotator can be found in the corresponding chapter. ###### SimpleDictionaryAnnotator A more detailed description of SimpleDictionaryAnnotator can be found in the corresponding chapter. #### TemporalExpressionNormalizer ##### General This component normalizes annotations of the Date and Time types, and sets the value of their `value` features. Missing information, such as the date year, can be completed with the help of the `anchor` feature, more precisely with the annotation it contains. In the case of an annotation of type `Date`, the corresponding "Anchor" annotation is also determined automatically if the feature is not set. The next annotation is used before the annotations considered, which contains the required information. ##### Input The component processes annotations of the types: • `de.averbis.textanalysis.types.temporal.Date` • `de.averbis.textanalysis.types.temporal.Time` ##### Output The component sets the features `value` of the annotations of the types `de.averbis.textanalysis.types.temporal.Date` and `de.averbis.textanalysis.types.temporal.Time`. ##### Configuration Implementation: de.averbis.textanalysis.components.temporalexpressionannotator.TemporalExpressionNormalizer Table 124: Configuration Parameters NameTypeMultiValuedMandatory ``` intValueFeatureName ``` Description: The name of the feature holding the normalized int value. Default: ``` value ``` `String` `false` `true` Maven Coordinates: ``` ``` <dependency> <groupId>de.averbis.textanalysis</groupId> <artifactId>temporal-expression-annotator</artifactId> <version>3.5.0</version> </dependency> ``` ``` ##### Rules TemporalExpression.ruta```PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator; ``` `````` TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.EvaluationTypeSystem; SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.Dictionary; SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.Date; SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.Time; SCRIPT de.averbis.textanalysis.components.temporalexpressionannotator.DateInterval; CALL(Dictionary); CALL(Date); CALL(Time); CALL(DateInterval);``` ``` Dictionary.ruta ``````PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator; TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem; DECLARE IntValued (INT value); DECLARE IntValued MonthInd, DayInd, YearInd, HourInd, MinuteInd, SecondInd; DECLARE DayInd DayNumberInd; DECLARE MonthInd MonthLongInd, MonthShortInd, MonthNumberInd; DECLARE YearInd Year2DInd, Year4DInd; DECLARE Year4DInd Year4DModernInd; DECLARE YearPostfixInd, YearPrefixInd; DECLARE TimePrefixInd, TimePostfixInd; DECLARE OfInd; // fix dictionary-based entires mni:MonthNumberInd{CONTAINS(NUM,2,10) -> UNMARK(mni)}; mni:DayNumberInd{CONTAINS(NUM,2,10) -> UNMARK(mni)}; mni:YearInd{CONTAINS(NUM,2,10) -> UNMARK(mni)}; INT int; BLOCK(ClassifyNum) NUM{}{ Document{PARSE(int)}; Document{-PARTOF(Year4DInd), REGEXP("(?:19[0-9]{2})|(?:20[0-9]{2})") -> Year4DModernInd, Year4DModernInd.value=int}; Document{-PARTOF(YearInd), REGEXP("[12]...") -> Year4DInd, Year4DInd.value=int}; Document{-PARTOF(YearInd), REGEXP("..") -> Year2DInd, Year2DInd.value=int}; Document{-PARTOF(HourInd), int <= 24, int >= 0 -> HourInd, HourInd.value=int}; Document{-PARTOF(MinuteInd), int <= 60, int >= 0 -> MinuteInd, MinuteInd.value=int, SecondInd, SecondInd.value=int}; Document{-PARTOF(DayNumberInd), int <= 31, int > 0, REGEXP("..?") -> DayNumberInd, DayNumberInd.value = int}; Document{-PARTOF(MonthNumberInd), int <= 12, int > 0, REGEXP("..?") -> MonthNumberInd, MonthNumberInd.value = int}; } s:SPECIAL{REGEXP("[´`']")} y:@Year2DInd{-> y.begin = s.begin}; DayNumberInd{ENDSWITH(W)} ->{ Year2DInd{->UNMARK(Year2DInd)};Year2DInd{->UNMARK(MonthNumberInd)}; }; DECLARE Dash, Slash; BLOCK(ClassifySpecial) SPECIAL{}{ Document{-PARTOF(Dash), REGEXP("[-]")-> Dash}; Document{-PARTOF(Slash), REGEXP("[/]")-> Slash}; } //MonthLongInd{PARTOF({POSTagVerb})-> UNMARK(MonthLongInd)};``` ``` Date.ruta ``` ```PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator; TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem; TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem; TYPESYSTEM de.averbis.textanalysis.components.temporalexpressionannotator.DictionaryRutaTypeSystem; STRING language; Document{IS(uima.tcas.DocumentAnnotation) -> GETFEATURE("language", language)}; Document{IS(LanguageContainer) -> GETFEATURE("language", language)}; Document{language=="x-unspecified" -> language = "en"}; ACTION CreateDate(ANNOTATION year, ANNOTATION month, ANNOTATION day) = CREATE(Date, "kind" = "DATE", "year" = year, "month" = month, "day" = day); ADDFILTERTYPE(Date); // hotfix combi with document (y:@Year4DInd{STARTSWITH(Document)} Dash m:MonthNumberInd Dash d:DayNumberInd){-> CreateDate(y, m, d)}; ANY{-PARTOF(Dash)} @(y:@Year4DInd Dash m:MonthNumberInd Dash d:DayNumberInd){-> CreateDate(y, m, d)}; (d:DayNumberInd PERIOD m:MonthInd PERIOD? y:YearInd){-> CreateDate(y, m, d)}; (m:MonthInd{-IS(MonthNumberInd)} d:DayInd COMMA y:YearInd){-> CreateDate(y, m, d)}; (m:MonthInd{-IS(MonthNumberInd)} COMMA d:DayInd y:YearInd){-> CreateDate(y, m, d)}; BLOCK(en) Document{language == "en"} { (d:DayInd ANY?{OR(REGEXP("of"), IS(PERIOD))} m:MonthInd{-IS(MonthNumberInd)} COMMA y:@YearInd){-> CreateDate(y, m, d)} COMMA; (d:DayInd ANY?{OR(REGEXP("of"), IS(PERIOD))} m:MonthInd{-IS(MonthNumberInd)} y:@YearInd){-> CreateDate(y, m, d)}; (m:MonthNumberInd{-PARTOF(Date)} Slash d:DayNumberInd Slash y:YearInd){-> CreateDate(y, m, d)}; (m:MonthNumberInd{-PARTOF(Date)} Dash d:DayNumberInd Dash y:YearInd){-> CreateDate(y, m, d)}; W{REGEXP("on", true)} (m:@MonthInd{-IS(MonthNumberInd)} d:DayNumberInd){-> CreateDate(null, m, d)}; (m:MonthInd{-IS(MonthNumberInd)} OfInd? y:@YearInd){-> CreateDate(y, m, null)}; (m:MonthInd{-IS(MonthNumberInd)} COMMA y:@YearInd){-> CreateDate(y, m, null)} COMMA; (m:MonthInd{-IS(MonthNumberInd)} d:DayInd){-> CreateDate(null, m, d)}; } BLOCK(de) Document{language == "de"} { (d:DayNumberInd{-PARTOF(Date)} Slash m:MonthNumberInd Slash y:YearInd){-> CreateDate(y, m, d)}; (d:DayNumberInd{-PARTOF(Date)} Dash m:MonthNumberInd Dash y:YearInd){-> CreateDate(y, m, d)}; } (d:DayInd PERIOD? m:MonthInd y:YearInd){-> CreateDate(y, m, d)}; (m:MonthInd Slash? y:@YearInd){-> CreateDate(y, m, null)}; (m:MonthInd{-IS(MonthNumberInd)} d:DayInd){-> CreateDate(null, m, d)}; (d:DayInd PERIOD m:@MonthNumberInd p:PERIOD{p.begin==m.end}){-> CreateDate(null, m, d)}; (d:DayInd PERIOD m:@MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, d)}; (d:DayInd OfInd m:@MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, d)}; (d:DayInd m:@MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, d)}; (m:MonthInd{-IS(MonthNumberInd)}){-> CreateDate(null, m, null)}; (y:@Year4DModernInd){-> CreateDate(y, null, null)}; y:@Year2DInd{STARTSWITH(SPECIAL)-> CreateDate(y, null, null)}; YearPrefixInd y:@Year2DInd{-> CreateDate(y, null, null)} (SW{REGEXP("and|und")} Year2DInd{-> CreateDate(y, null, null)})?; REMOVEFILTERTYPE(Date); // vom 12. bis 14.08.2008 TemporalIntervalBeginIndicator d:DayInd{ -> CreateDate(date.year, date.month, d)} PERIOD TemporalIntervalEndIndicator date:Date{date.day != null}; ADDRETAINTYPE(WS); Date{-> UNMARK(Date)} PM NUM{-PARTOF(Date)}; Date{CONTAINS(NUM,3,10) -> UNMARK(Date)} PM NUM; Date{-> UNMARK(Date)} SPECIAL{-REGEXP("[\\)\$$\\}]")} ANY{-PARTOF(Date),-PARTOF(WS)};
NUM PM @Date{CONTAINS(NUM,3,10)-> UNMARK(Date)};
ANY{-PARTOF(Date)} PM @Date{-> UNMARK(Date)};
@Date{-CONTAINS(W)-> UNMARK(Date)} W{-REGEXP("T"),-PARTOF(Date)};
W{-PARTOF(Date)} @Date{STARTSWITH(NUM)-> UNMARK(Date)};
ANY{-PARTOF(Date),-PARTOF(WS)} SPECIAL{-REGEXP("[\\(\\[\\{]")} @Date{-> UNMARK(Date)};
d1:Date{ENDSWITH(Year2DInd)-> UNMARK(d1)} SPECIAL d2:Date{IS(Year4DInd) -> UNMARK(d2)};
REMOVERETAINTYPE(WS);

ANY{-PARTOF(NUM)} @Date{REGEXP("May")-> UNMARK(Date)} W;
@Date{STARTSWITH(Document),REGEXP("May")-> UNMARK(Date)} W;

OfInd d:@Date{OR(CONTAINS(Slash), CONTAINS(Dash)), -CONTAINS(MonthLongInd), -CONTAINS(MonthShortInd)-> UNMARK(d)};

BLOCK(en) Document{language == "en"} {
SW.ct=="at" @Date{-> UNMARK(Date)};
}

ACTION Unambig(ANNOTATION timex) = CREATE(UnambiguousTimex, "timex" = timex);

FOREACH(date,false) Date{}{
// 23.1.,24.2. und 25.2.2017
d:Date{d.year == null-> d.anchor=date} ANY+{PARTOF({COMMA,POSTagConj,TemporalIntervalBeginIndicator,TemporalIntervalEndIndicator})} date{date.year != null};
d:Date{d.year == null-> d.anchor=date.anchor} ANY+{PARTOF({COMMA,POSTagConj,TemporalIntervalBeginIndicator,TemporalIntervalEndIndicator})} date{date.anchor != null};

// unambiguous
date{OR(CONTAINS(MonthLongInd),CONTAINS(MonthShortInd)), CONTAINS(YearInd) -> Unambig(date)};
date{date.day != null, date.month != null, date.year != null -> Unambig(date)};
}```
```

Time.ruta

``````PACKAGE de.averbis.textanalysis.components.temporalexpressionannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.TemporalTypeSystem;
TYPESYSTEM de.averbis.textanalysis.typesystems.NumericValueTypeSystem;
TYPESYSTEM de.averbis.textanalysis.components.temporalexpressionannotator.DictionaryRutaTypeSystem;

STRING language;
Document{IS(uima.tcas.DocumentAnnotation) -> GETFEATURE("language", language)};
Document{IS(LanguageContainer) -> GETFEATURE("language", language)};

ACTION CreateTime(ANNOTATION hour, ANNOTATION minute, ANNOTATION second) = CREATE(Time, "kind" = "TIME", "hour" = hour, "minute" = minute, "second" = second);

(h:HourInd COLON m:MinuteInd{REGEXP("..")} (COLON s:SecondInd)? TimePostfixInd?){-> CreateTime(h, m, s)};
(h:HourInd TimePostfixInd){-> CreateTime(h, null, null)};

REMOVEFILTERTYPE(Time);

d:Date{-> UNMARK(d)} ANY?{OR(REGEXP("T|,"), IS(TimePrefixInd))} t:Time{-> t.begin = d.begin, t.anchor = d};

REMOVERETAINTYPE(WS);```
```

### Part-of-Speech Tagging

#### FactoriePOSAnnotator

##### General

This POS Tagger is based on a Factorie Factor Graph model. The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

##### Input

The component expects the following mandatory annotations

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Token`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.POSTag`

or, depending on the word type, the corresponding annotation. The following subtypes are available in the type system for this purpose:

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagAdp`

• `de.averbis.extraction.types.POSTagAdv`

• `de.averbis.extraction.types.POSTagConj`

• `de.averbis.extraction.types.POSTagDet`

• `de.averbis.extraction.types.POSTagNoun`

• `de.averbis.extraction.types.POSTagNum`

• `de.averbis.extraction.types.POSTagPart`

• `de.averbis.extraction.types.POSTagPron`

• `de.averbis.extraction.types.POSTagPunct`

• `de.averbis.extraction.types.POSTagVerb`

##### Configuration

Apart from the resource, this component has no configuration parameters.

#### FactoriePOSTaggerResource

##### General

This resource encapsulates the statistical POSTagger model based on Factorie. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.factoriepostaggerresource.FactoriePOSTaggerResource

Table 125: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` factoriepostagger ```

`String`

`false`

`false`

``` genre ```

Description: the genre of text to process; the combination of genre and document language determines which model is used; available genres: newspaper or bionlp.

Default: ``` newspaper ```

`String`

`false`

`false`

``` documentAnnotatorClassName ```

Description: The implementation of the Factorie DocumentAnnotator.

Default: ``` de.averbis.textanalysis.factorie.GenericForwardPosTagger ```

`String`

`false`

`true`

``` attributeClassName ```

Description: The implementation of the Factorie Attribute.

Default: ``` de.averbis.textanalysis.factorie.GenericPosTag ```

`String`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>factorie-postagger-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpPOSAnnotator

##### General

This POSTagger is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the six standard languages (de, en, it, fr, pt, es) as well as the two genres "newspaper" and "bionlp" for biomedical literature.

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Token`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of the following type:

• `de.averbis.extraction.types.POSTag`

or its subtypes. If a word type cannot be assigned to a specific subtype, the above-mentioned parent type is used.

The following valid subtypes are available in the type system:

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagAdp`

• `de.averbis.extraction.types.POSTagAdv`

• `de.averbis.extraction.types.POSTagConj`

• `de.averbis.extraction.types.POSTagDet`

• `de.averbis.extraction.types.POSTagNoun`

• `de.averbis.extraction.types.POSTagNum`

• `de.averbis.extraction.types.POSTagPart`

• `de.averbis.extraction.types.POSTagPron`

• `de.averbis.extraction.types.POSTagPunct`

• `de.averbis.extraction.types.POSTagVerb`

##### Configuration

Implementation: de.averbis.textanalysis.components.opennlpposannotator.OpennlpPOSAnnotator

Table 126: Configuration Parameters

NameTypeMultiValuedMandatory

``` tokenBlockSize ```

Description: Sentences having more tokens than tokenBlockSize will be processed in blocks of this size to avoid overlong runtime of this component.

Default: ``` 500 ```

`Integer`

`false`

`false`

Table 127: External Resources

NameOptionalInterface/Implementation

``` opennlpPOSTaggerResource ```

Description: Resource holding a map with available models (postagger) for different languages

`false`

`de.averbis.textanalysis.resources.opennlppostaggerresource.OpennlpPOSTaggerResource`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-pos-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpPOSTaggerResource

##### General

This resource encapsulates the statistical POSTagger model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.opennlppostaggerresource.OpennlpPOSTaggerResource

Table 128: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` opennlppostagger ```

`String`

`false`

`false`

``` genre ```

Description: The genre of text to process; the combination of genre and document language determines which model is used; available genres: newspaper or bionlp.

Default: ``` newspaper ```

`String`

`false`

`false`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-postagger-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Shallow Parsing

#### FactorieChunkAnnotator

##### General

This chunker is based on a factor factor graph model. The basic version includes trained models for the two standard languages (de, en), as well as the two genres "newspaper" and "bionlp" for biomedical literature.

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.POSTag`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.Chunk`

or, depending on the phrase type, the corresponding annotation. The following subtypes are available in the type system for this purpose:

• `de.averbis.extraction.types.ChunkNP`

• `de.averbis.extraction.types.ChunkVP`

• `de.averbis.extraction.types.ChunkPP`

##### Configuration

Apart from the resource, this component has no configuration parameters.

#### FactorieChunkerResource

##### General

This resource encapsulates the statistical chunker model based on Factorie. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.factoriechunkerresource.FactorieChunkerResource

Table 129: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` factoriechunker ```

`String`

`false`

`false`

``` genre ```

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: ``` newspaper ```

`String`

`false`

`false`

``` documentAnnotatorClassName ```

Description: The implementation of the Factorie DocumentAnnotator.

Default: ``` de.averbis.textanalysis.factorie.BIOGenericChainChunker ```

`String`

`false`

`true`

``` attributeClassName ```

Description: The implementation of the Factorie Attribute.

Default: ``` cc.factorie.app.nlp.load.BIOChunkTag ```

`String`

`false`

`true`

Maven Coordinates:

``````<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>factorie-chunker-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpChunkAnnotator

##### General

This chunker is based on a maximum entropy model (also known as logistic regression). The basic version includes trained models for the two standard languages (de, en), as well as the two genres "newspaper" and "bionlp" for biomedical literature.

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.POSTag`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The component creates annotations of type `de.averbis.extraction.types.Chunk` or, depending on the phrase type, the corresponding annotation.

The following subtypes are available in the type system for this purpose:

• `de.averbis.extraction.types.ChunkNP`

• `de.averbis.extraction.types.ChunkVP`

• `de.averbis.extraction.types.ChunkPP`

##### Configuration

Implementation: de.averbis.textanalysis.components.opennlpchunkannotator.OpennlpChunkAnnotator

Table 130: Configuration Parameters

NameTypeMultiValuedMandatory

``` tokenBlockSize ```

Description: Sentences having more tokens than tokenBlockSize will be processed in blocks of this size to avoid overlong runtime of this component.

Default: ``` 500 ```

`Integer`

`false`

`false`

Table 131: External Resources

NameOptionalInterface/Implementation

``` opennlpChunkerResource ```

Description: Resource holding a map with available models (chunker) for different languages.

`false`

`de.averbis.textanalysis.resources.opennlpchunkerresource.OpennlpChunkerResource`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-chunk-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### OpennlpChunkerResource

##### General

This resource encapsulates the statistical Chunker model based on OpenNLP. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.opennlpchunkerresource.OpennlpChunkerResource

Table 132: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` opennlpchunker ```

`String`

`false`

`false`

``` genre ```

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: ``` newspaper ```

`String`

`false`

`false`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>opennlp-chunker-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Enumerations

#### EnumerationAnnotator

##### General

This component can detect enumerations based on atomic text units (e. g. chunks) and conjunctions (e. g. the word "and").

##### Input

The component requires mandatory annotations for the configured types. Normally, POSTagger and Chunkers annotations are required:

• `de.averbis.extraction.types.POSTagConj`

• `de.averbis.extraction.types.ChunkNP`

##### Output

The component creates annotations of type:

• `de.averbis.textanalysis.types.Enumeration`

It fills their feature `members` and sets the feature `label` to 'enumeration'.

##### Configuration

Implementation: de.averbis.textanalysis.components.enumerationannotator.EnumerationAnnotator

Table 133: Configuration Parameters

NameTypeMultiValuedMandatory

``` withinChunks ```

Description: If activated, enumerations are detected within annotations of the type chunkType.

Default: ``` true ```

`Boolean`

`false`

`true`

``` combineChunks ```

Description: If activated, annotations of the type chunkType are combined to enumerations.

Default: ``` true ```

`Boolean`

`false`

`true`

``` slashEnum ```

Description: If activated, a slash '/' indicates an enumeration.

Default: ``` false ```

`Boolean`

`false`

`true`

``` chunkType ```

Description: The basic type of elements of an enumeration.

Default: ``` de.averbis.extraction.types.ChunkNP ```

`String`

`false`

`true`

``` conjunctionType ```

Description: The basic type of enumeration indicator.

Default: ``` de.averbis.extraction.types.POSTagConj ```

`String`

`false`

`true`

``` seeders ```

Description: A UIMA Ruta specific parameter specifying the initial seeders that should be applied.

Default: ``` org.apache.uima.ruta.seed.DefaultSeeder ```

`String`

`true`

`false`

``` reindexOnly ```

Description: A UIMA Ruta specific parameter specifying the annotation types that should be reindexed.

Default: ``` uima.tcas.Annotation ```

`String`

`true`

`false`

``` indexOnlyMentionedTypes ```

Description: A UIMA Ruta specific parameter specifying if only annotation types that are explicitly mentioned in the rules should be indexed.

Default: ``` false ```

`Boolean`

`false`

`false`

``` indexAdditionally ```

Description: A UIMA Ruta specific parameter specifying if additional annotation types that should be indexed.

Default: `**`

`String`

`true`

`false`

``` strictImports ```

Description: A UIMA Ruta specific parameter specifying if only types that are explictly imported in the script are known and will be resolved.

Default: ``` true ```

`Boolean`

`false`

`false`

``` debug ```

Description: A UIMA Ruta specific parameter specifying if debug information should be created for the rule execution.

Default: ``` false ```

`Boolean`

`false`

`false`

``` debugWithMatches ```

Description: A UIMA Ruta specific parameter specifying if debug information should be created for rule element matches.

Default: ``` false ```

`Boolean`

`false`

`false`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>enumeration-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```
##### Rules
```        ```PACKAGE de.averbis.textanalysis.components.enumerationannotator;

TYPESYSTEM de.averbis.textanalysis.typesystems.AverbisTypeSystem;
TYPESYSTEM de.averbis.textanalysis.components.enumerationannotator.ListingRutaTypeSystem;

BOOLEAN withinChunks = true;
BOOLEAN combineChunks = true;
BOOLEAN slashEnum = false;
BOOLEAN extendChunksToConcepts = true;

TYPE chunkType = de.averbis.extraction.types.ChunkNP;
TYPE conjunctionType = de.averbis.extraction.types.POSTagConj;

ACTION Enum() = CREATE(Enumeration, "members" = Member, "label" = "enumeration");

DECLARE EnumIndicator;
(conjunctionType{-PARTOF(EnumIndicator)} (SPECIAL.ct=="/" conjunctionType)?){-> EnumIndicator};
(ei:EnumIndicator s:SPECIAL.ct=="-"){-> ei.end=s.end};

e:EnumIndicator{REGEXP("but") -> UNMARK(e)};

ChunkNP COMMA POSTagAdj{-PARTOF(Chunk)-> ChunkNP, ChunkNP.value = "NP"} EnumIndicator ChunkNP;
}

// TODO refactor to avoid redundant operations
BLOCK(extendChunksToConcepts) Document{extendChunksToConcepts}{
// shortness of breath -> 1 ChunkNP
// CMV-Pneumonie
c:Concept{CONTAINS(SPECIAL)}->{np1:ChunkNP{np1.begin==c.begin} SPECIAL.ct=="-" np2:ChunkNP{np2.end==c.end -> np1.end=np2.end, UNMARK(np2)};};
// carotid bruits
c:Concept{CONTAINS(ChunkNP,2,2)}->{np1:ChunkNP{np1.begin==c.begin -> np1.end = np2.end} np2:ChunkNP{-> UNMARK(np2)};};
c:Concept{CONTAINS(ChunkNP)}<-{np1:ChunkNP{np1.begin==c.begin -> np1.end = np2.end} np2:POSTagNoun{-PARTOF(Chunk)};};
// <Crohn><'s, or ulcerative colitis>
c1:Concept{STARTSWITH(ChunkNP), -ENDSWITH(ChunkNP)}->{np1:ChunkNP{np1.begin == c1.begin -> np1.end = c1.end};}
COMMA? conjunctionType c2:Concept;
np:ChunkNP<-{pa:POSTagPart{pa.begin == np.begin} SW.ct=="s" ANY[0,2]{-PARTOF(Concept)} c:Concept{-> np.begin = c.begin};};
// Akute Transplantat-gegen-Wirt Erkrankung
c:Concept{-> ChunkNP}<-{np1:ChunkNP{np1.begin==c.begin -> UNMARK(np1)} ANY[0,3]{-PARTOF(Chunk)} np2:ChunkNP{np2.end==c.end-> UNMARK(np2)};};
}

BLOCK(withinChunks) Document{withinChunks}{
// within chunks
BLOCK(eachChunk) chunkType{CONTAINS(EnumIndicator)} {
((ANY+{-PARTOF(COMMA) -> Member} COMMA)* ANY+{-PARTOF(COMMA)-> Member} @EnumIndicator{-PARTOF(Enumeration)} #{-> Member}){-> Enum()};
}
// should have been a chunk, chunk misses adjectives in front of it

// adjectives after chunk used in medical documents
(ChunkNP{-> Member}
@EnumIndicator{-PARTOF(Enumeration)}

}

BLOCK(combineChunks) Document{combineChunks}{

// lentigo vs macular SK vs lentig maligna
((chunkType{-PARTOF(Enumeration)-> Member} EnumIndicator{-PARTOF(Enumeration)})[2,100] chunkType{-> Member}){-> Enum()};

((chunkType{-PARTOF(Enumeration) -> Member} SPECIAL.ct=="-"? COMMA)* chunkType{-PARTOF(Enumeration) -> Member} SPECIAL.ct=="-"?{-PARTOF(chunkType)} COMMA?{-PARTOF(chunkType)}
@EnumIndicator{-PARTOF(Enumeration)} chunkType{-PARTOF(Enumeration) -> Member}){-> Enum()};
// TODO broken chunking
((chunkType{-> Member} SPECIAL.ct=="-"? COMMA)* chunkType{-> Member} SPECIAL.ct=="-"?{-PARTOF(chunkType)} COMMA?{-PARTOF(chunkType)}
}

BLOCK(slashEnum) Document{slashEnum}{
((chunkType{-PARTOF(Enumeration), -REGEXP(".") -> Member} SPECIAL.ct=="-"? SPECIAL.ct=="/")+
chunkType{-PARTOF(Enumeration), -CONTAINS(COMMA), -STARTSWITH(NUM) -> Member}){-> Enum()};
}```
```

### Entity Detection

#### MalletEntityAnnotator

In computational linguistics, the recognition of proper names is the task of identifying and typing references to entities within a text. Typical proper names are people, places and organisations.

The recognition of proper names is based on machine learning methods, as this approach enables high recognition rates. The module can also be adapted to new domains, languages and entity types by retraining the statistical model.

##### General

The component is based on Conditional Random Fields (CRF), a very good machine learning method for this task. This component comes with a standard model that recognizes the classic named entities (people, places, organizations).

The component also provides a training module that can be used to easily train new models from existing training data. In this way, adaptation to a new text domain or a new text genre (e. g. social media or biomedical literature) and adaptation to other entity types is very easy. For example, a gene and protein tagger can be created very easily.

If the confidence calculation is switched on, the marginal probabilities of the respective word are calculated while retaining the remaining sequence (i. e. the predicted labels of the sentence). For each entity, its confidence is then the average of all the individual word probabilities contained in the entity.

To make the tagger more precise, you can specify a minimum confidence that must be met for an entity to be annotated at all. The resulting increase in precision is of course achieved at the expense of recall.

##### Input

The component requires the following annotations:

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Token`

This component requires that the document language is set, e.g., by a component like LanguageCategorizer or LanguageSetter.

##### Output

The following annotation is created:

• `de.averbis.extraction.types.Entity`

The feature `label` specifies the entity class (for example `PERS` for persons, `GEO` for places and `ORG` for organisations in the basic model). However, you can also specify a special mapping using configuration parameters, which contains more specific entity types depending on the label.

##### Details
###### Background/Algorithm

The Tagger is a further development of the JNET-Tagger. It is based on conditional random fields (CRFs) and uses the mallet-implementation. The tagger is based on the tagger described in Settles (2004). Conditional Random Fields (CRFs) are particularly well suited for the task of Named Entity Recognition as they model correlations in the text. The linear chain CRFs used here model the text as a sequence of words and thus simulate dependencies inherent to the language.

```            Burr Settles. 2004. Biomedical named entity
recognition using conditional random fields and rich feature sets.```
###### Feature Configuration

The feature configuration specifies which features are used during training. The default configuration is as follows. The commented out lines can be commented in again to use the other features.

```            offset_conjunctions= (-1) (1)
feat_lowercase_enabled=false
feat_wc_enabled = true
feat_bwc_enabled=true
feat_bioregexp_enabled = true
feat_plural_enabled = true
#token_ngrams = 2,3
#char_ngrams = 3,4
#prefix_sizes = 2,3
Suffix_sizes = 2,3
#TESTLEX_lexicon = test. lex```
###### Evaluation

Model "default"

The basic model (available for German and English) recognizes people, places and organisations. It has been trained on training data from the newspaper domain and is therefore very suitable for texts that are well-formed, grammatically correct and not colloquial.

German: Tiger-Korpus. Recall/Precision/F-Score: 0.88/0.93/0.90

English: MASC corpus. Recall/Precision/F-Score: 0.83/0.89/0.86

##### Configuration

Implementation: de.averbis.textanalysis.components.malletentityannotator.MalletEntityAnnotator

Table 134: Configuration Parameters

NameTypeMultiValuedMandatory

``` calculateConfidence ```

Description: If activated, the confidence of extracted entities will be calculated (takes extra time, so only turn on if really needed).

Default: ``` false ```

`Boolean`

`false`

`true`

``` confidenceThreshold ```

Description: If parameter calculateConfidence is activated, only entity mentions which exceed this threshold are added.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` labelMapping ```

Description: Optional mapping file from label to entity type.

`String`

`false`

`false`

``` blackList ```

Description: Optional file containing exclulsions for specific labels, e.g., Obama@GEO.

`String`

`false`

`false`

``` expandAbbreviations ```

Description: If activated, tokens which are acronyms/abbreviations should be sent to tagger in expanded form, i.e. their full form. Setting it to true, this may improve tagger performance, however, only if model was trained on such data.

Default: ``` false ```

`Boolean`

`false`

`true`

``` linkEntityToToken ```

Description: If activated, tokens underlying the entity will have a reference to the entity.

Default: ``` false ```

`Boolean`

`false`

`true`

``` ignoreByConceptMapperAfterMapped ```

Description: If this parameter and the parameter linkEntityToToken are activated, then the tokens underlying the new entities will be set to ignore by concept mapper.

Default: ``` false ```

`Boolean`

`false`

`true`

Table 135: External Resources

NameOptionalInterface/Implementation

``` malletEntityTaggerResource ```

Description: Resource holding the available models for different languages.

`false`

`de.averbis.textanalysis.resources.malletentitytaggerresource.MalletEntityTaggerResource`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>mallet-entity-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### MalletEntityTaggerResource

##### General

This resource encapsulates the statistical CRF model based on Mallet. The language of the model used corresponds to the CAS language. The genre of the model can be set by the corresponding parameter.

##### Configuration

Implementation: de.averbis.textanalysis.resources.malletentitytaggerresource.MalletEntityTaggerResource

Table 136: Configuration Parameters

NameTypeMultiValuedMandatory

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` malletentitytagger ```

`String`

`false`

`false`

``` genre ```

Description: The genre of the model family to be used (e.g. newspaper, bionlp).

Default: ``` newspaper ```

`String`

`false`

`false`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>mallet-entity-tagger-resource</artifactId>
<version>3.5.0</version>
</dependency>
```
```

### Concept Recognition

#### GenericTerminologyAnnotator

##### General

This component is a generic combination of up to three concept annotators based on the configured terminology. It is designed to simplify the use of concept recognition by eliminating the need to configure individual concept annotators and their resources. The configuration of the managed components and resources is automatic, but can be influenced by different configuration parameters. One of the most important parameters is `terminologyNames`, which can be used to define the terminology used. These are, of course, first converted into serialized dictionaries in a necessary preprocessing stage.

##### Input

The component is based on possibly several ConceptAnnotators with different configurations. Therefore, annotations are required that must be available for the ConceptAnnotators to function correctly, such as sentences, tokens, stems or segments.

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.Concept` (auch Subtypen)

The exact type depends on the terminology files used and the concept types specified in them.

##### Configuration

Implementation: de.averbis.textanalysis.components.terminologyannotator.GenericTerminologyAnnotator

Table 137: Configuration Parameters

NameTypeMultiValuedMandatory

``` useExactLookup ```

Description: Apply exact lookup.

Default: ``` true ```

`Boolean`

`false`

`true`

``` useOriginalLookup ```

Description: Apply original lookup.

Default: ``` true ```

`Boolean`

`false`

`true`

``` useStemLookup ```

Description: Apply lookup based on stems.

Default: ``` true ```

`Boolean`

`false`

`true`

``` useSegmentLookup ```

Description: Apply lookup based on segments.

Default: ``` true ```

`Boolean`

`false`

`true`

``` enableMatchedTokens ```

Description: Enable matched tokens again after processing. Sets the feature ignoredByConceptMapper of tokens covered by any Concept to false.

Default: ``` true ```

`Boolean`

`false`

`true`

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved. This parameter overrides the default directory given by the implementation.

`String`

`false`

`false`

``` terminologyNames ```

Description: Names of the source terminologies.

`String`

`true`

`false`

``` resourceIdentifier ```

Description: Optional identifier for resources that are automatically created and binded within the concept annotators.

`String`

`false`

`false`

``` exactPreprocessingAnalysisEngineName ```

Description: Analysis engine name for exact preprocessing of the dictionary entries.

`String`

`false`

`false`

``` originalPreprocessingAnalysisEngineName ```

Description: Analysis engine name for original preprocessing of the dictionary entries.

`String`

`false`

`false`

``` stemPreprocessingAnalysisEngineName ```

Description: Analysis engine name for stem preprocessing of the dictionary entries.

`String`

`false`

`false`

``` exactDictionarySourceFileNames ```

Description: Names of the dictionaries used for exact lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.exact.xml'.

`String`

`false`

`false`

``` originalDictionarySourceFileNames ```

Description: Names of the dictionaries used for original lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.xml'.

`String`

`false`

`false`

``` stemDictionarySourceFileNames ```

Description: Names of the dictionaries used for stem lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.xml'.

`String`

`false`

`false`

``` segmentDictionarySourceFileNames ```

Description: Names of the dictionaries used for segment lookup. The value is given by a comma separated list. The parameter 'terminologyNames' overrides the value of this parameter by extending the names with '.xml'.

`String`

`false`

`false`

``` ignoreAfterExact ```

Description: Ignore matched tokens after exact lookup.

Default: ``` true ```

`Boolean`

`false`

`true`

``` ignoreAfterOriginal ```

Description: Ignore matched tokens after original lookup.

Default: ``` true ```

`Boolean`

`false`

`true`

``` ignoreAfterStem ```

Description: Ignore matched tokens after stem lookup.

Default: ``` true ```

`Boolean`

`false`

`true`

``` ignoreAfterSegment ```

Description: Ignore matched tokens after segment lookup.

Default: ``` true ```

`Boolean`

`false`

`true`

``` exactLookup ```

Description: Apply exact lookup. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: ``` UNKNOWN ```

`String`

`false`

`true`

``` originalLookup ```

Description: Apply original lookup. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: ``` UNKNOWN ```

`String`

`false`

`true`

``` stemLookup ```

Description: Apply lookup based on stems. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: ``` UNKNOWN ```

`String`

`false`

`true`

``` segmentLookup ```

Description: Apply lookup based on segments. This parameter overrides the default behavior of the implementation. The following values are allowed: ACTIVE, INACTIVE, UNKNOWN.

Default: ``` UNKNOWN ```

`String`

`false`

`true`

``` exactCaseVariant ```

Description: Defines the case matching of the exact lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

Default: ``` CASE_MATCH ```

`String`

`false`

`true`

``` originalCaseVariant ```

Description: Defines the case matching of the original lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

Default: ``` CASE_IGNORE ```

`String`

`false`

`true`

``` stemCaseVariant ```

Description: Defines the case matching of the stem lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE

Default: ``` CASE_IGNORE ```

`String`

`false`

`true`

``` segmentCaseVariant ```

Description: Defines the case matching of the segment lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

Default: ``` CASE_IGNORE ```

`String`

`false`

`true`

``` exactMatchOnlyTermsWithNouns ```

Description: Defines if only concepts should be matched that comprise a noun in exact mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` originalMatchOnlyTermsWithNouns ```

Description: Defines if only concepts should be matched that comprise a noun in original mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` stemMatchOnlyTermsWithNouns ```

Description: Defines if only concepts should be matched that comprise a noun in stem mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` segmentMatchOnlyTermsWithNouns ```

Description: Defines if only concepts should be matched that comprise a noun in segment mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` exactMapResolvedAbbreviations ```

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in exact mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` originalMapResolvedAbbreviations ```

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in original mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` stemMapResolvedAbbreviations ```

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in stem mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` segmentMapResolvedAbbreviations ```

Description: If true and there are abbreviations with marked full forms (Abbreviation annotation), the full form is mapped instead of the abbreviation from the text in segment mode.

Default: ``` false ```

`Boolean`

`false`

`true`

``` exactFindAllMatches ```

Description: Finds all matches in a text passage in exact mode, also the overlapping ones.

Default: ``` false ```

`Boolean`

`false`

`true`

``` originalFindAllMatches ```

Description: Finds all matches in a text passage in original mode, also the overlapping ones.

Default: ``` false ```

`Boolean`

`false`

`true`

``` stemFindAllMatches ```

Description: Finds all matches in a text passage in stem mode, also the overlapping ones.

Default: ``` false ```

`Boolean`

`false`

`true`

``` segmentFindAllMatches ```

Description: Finds all matches in a text passage in segment mode, also the overlapping ones.

Default: ``` false ```

`Boolean`

`false`

`true`

``` exactFilterBestMatches ```

Description: Chooses the best match of all matches on a text passage in exact mode (via fuzzyness score).

Default: ``` true ```

`Boolean`

`false`

`true`

``` originalFilterBestMatches ```

Description: Chooses the best match of all matches on a text passage in original mode (via fuzzyness score).

Default: ``` true ```

`Boolean`

`false`

`true`

``` stemFilterBestMatches ```

Description: Chooses the best match of all matches on a text passage in stem mode (via fuzzyness score).

Default: ``` true ```

`Boolean`

`false`

`true`

``` segmentFilterBestMatches ```

Description: Chooses the best match of all matches on a text passage in segment mode (via fuzzyness score).

Default: ``` true ```

`Boolean`

`false`

`true`

``` makeConceptAnnotation ```

Description: This parameter specifies whether a concept annotation will be made at all. If set to true a concept annotation is made (i.e. added to the index), if set to false, no concept annotation is made but the tokens underlying the potential concepts are set to ignore. This is e.g. used if the concept annotator is just used to set some phrase to be ignored (without being interested in the concept annotation itself).

Default: ``` true ```

`Boolean`

`false`

`true`

Table 138: External Resources

NameOptionalInterface/Implementation

``` exactConceptDictionaryResource ```

Description: Dictionary resource for exact lookup overriding the default one.

`true`

`de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource`

``` stemConceptDictionaryResource ```

Description: Dictionary resource for stem lookup overriding the default one.

`true`

`de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource`

``` originalConceptDictionaryResource ```

Description: Dictionary resource for original lookup overriding the default one.

`true`

`de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource`

``` segmentConceptDictionaryResource ```

Description: Dictionary resource for segment lookup overriding the default one.

`true`

`de.averbis.textanalysis.resources.conceptdictionaryresource.ConceptDictionaryResource`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>terminology-annotator</artifactId>
<version>3.5.0</version>
</dependency>
```
```

Instead of the configuration parameters `exactLookup`, `stemLookup` and `segmentLookup` only the parameters `useExactLookup`, `useStemLookup` and `useSegmentLookup` should be used when configuring this component.

### WordlistAnnotator

#### Description

The WordlistAnnotator allows users to directly embed simple wordlists into pipelines. It identifies words from the wordlist in texts and creates an annotation of type Entity. Optionally, a 'label' and a 'value' can be specified in columns 2 and 3 of the wordlist to fill the corresponding attributes of type Entity (see example below).

#### Input

Above this annotator, the following annotator must be included in the pipeline:

#### Configuration

Table 50: Configuration Parameters

 `delimiter` The separator of different terms in the wordlist, separating the searched term from its features. `string` `false` `true` `ignoreCase` Option to ignore the case of the terms in the wordlist.Possible values (default is underlined): ACTIVE | INACTIVE `boolean` `false` `true` `onlyLongest` Option to filter matches that are part of a longer match. Example: 'diabetes mellitus' but not 'diabetes'.Possible values (default is underlined): ACTIVE | INACTIVE `boolean` `false` `true` `wordlist` The wordlist (dictionary) content.The first line contains the complete package name of type Entity. If columns 2 and 3 are filled, line 1 has to be filled with the attribute names 'label' and 'value'.The remaining lines contain the words of the wordlist (column 1) and optionally 'label' and 'value' values (columns 2 and 3).Example Wordlist:de.averbis.extraction.types.Entity;label;valueLip;Organ;C00Tongue;Organ;C01 `string` `false` `false`

#### Output

The annotator creates an annotation of type Entity.

Exemplary Annotation Type: `de.averbis.extraction.types.Entity`

Table 51: Features

 `label` Represents the string in the feature "label"  of the matched term in the wordlist. `String` `value` Represents the string in the feature "value" of the matched term in the wordlist. `String`

#### WebService Example

Example:

The lip

``````    {
"begin": 4,
"end": 7,
"type": "de.averbis.extraction.types.Entity",
"coveredText": "lip",
"id": 306,
"componentId": null,
"confidence": 0,
"label": "Organ",
"value": "C00",
"parsedElements": null
}``````

### Indexing

#### CooccurrenceDescriptorAnnotator

##### General

Extracts keywords based on the coke competition of individual lexical units. Lexical units can be tokens here, Stems, segments or lemmata. The scores for the selected lexical units within a keyword candidate. Subsequently, they are calculated to a total score for the respective keyword candidates.

Units that are often used with other lexical units together in a keyword candidate, are given a higher weighting in the process as units, for example, which are mostly used alone in keyword candidates. This leads to the fact that this procedure rather prefers keyword candidates, which consist of several lexical units. In this respect, the procedure is, for example, a very simple one. well suitable for recognizing personal names or complex and thus to extract very specific terms.

##### Input

The component expects the following annotations to be mandatory

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Concept`

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagNoun`

Depending on the setting, these other annotations are also used

• `de.averbis.extraction.types.Zone`

• `de.averbis.extraction.types.Stem`

• `de.averbis.extraction.types.Segment`

• `de.averbis.extraction.types.Lemma`

##### Output

The component produces annotations of type:

• `de.averbis.extraction.types.Descriptor`

##### Background

To calculate the scores of the lexical units, the following steps are performed first per document a co-current-matrix over all relevant lexical units.

Then f(u), the so-called unit frequency, is calculated. These expresses how many keyword candidates the lexical unit is made up of happens. In addition, d(u), the so-called unit degree, is calculated.

The basic score of the lexical unit is now calculated:

s(u)=d(u)/f(u)

This basic score is additionally marked with the `tf` value of the keyword candidates. This value expresses how often the keyword candidate appears in the current document.

This procedure represents an extension and modification of the existing RAKE. Paper described above.

##### Configuration

Implementation: de.averbis.textanalysis.components.indexing.descriptor.CooccurrenceDescriptorAnnotator

Table 139: Configuration Parameters

NameTypeMultiValuedMandatory

``` unitType ```

Description: The complete long name of the type of a unit.

Default: ``` de.averbis.extraction.types.Segment ```

`String`

`false`

`true`

``` scoreCombinationType ```

Description: Combination type used for scores: SUM, AVG, MAX

Default: ``` MAX ```

`String`

`false`

`true`

``` topN ```

Description: Maximum number of annotations to produce per CAS.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` minScore ```

Description: Minimum score of annotations.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` normalizeScore ```

Description: Option to normalize the score.

Default: ``` true ```

`Boolean`

`false`

`true`

``` conceptConfidenceBoost ```

Description: Option to boost concepts.

Default: ``` false ```

`Boolean`

`false`

`true`

``` allowedZones ```

Description: Name of zone labels: if set, only concepts of these zones will be considered.

`String`

`true`

`false`

``` zoneBoost ```

Description: Option to boost zones.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>indexing</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### DefaultDescriptorAnnotator

##### General

The "default"approach to uncontrolled keywording is based on mainly on the tf-idf value of the keyword. You can also use the position of the keyword in the text into the weighting formula can be added.

##### Input

The component expects the following annotations to be mandatory

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Concept`

Depending on the setting, these other annotations are also used

• `de.averbis.extraction.types.Zone`

##### Output

The component creates annotations of type:

• `de.averbis.extraction.types.Descriptor`

##### Background

The `tf`-value of a keyword is the frequency of this keyword in the current document. The `tf` value is normalized via the the so-called "augmented tf-score"approach. Here, the `tf` value a keyword with the maximum `tf` value of the current document normalized according to the formula:

`tf_i_norm = 0.5 + 0.5* tf_i / tf_max`

The `idf`-value of a keyword can also be used if required. can be normalized. Here the maximum `idf`-value of the used IDF Dictionaries.

The item weight is defined as the relative number of records that are stored before the of a keyword candidate are defined. Coming the keyword for the first time in the first sentence, the weight is 1.

##### Configuration

Implementation: de.averbis.textanalysis.components.indexing.descriptor.DefaultDescriptorAnnotator

Table 140: Configuration Parameters

NameTypeMultiValuedMandatory

``` positionBoost ```

Description: Option to boost by position.

Default: ``` true ```

`Boolean`

`false`

`true`

``` idfBoost ```

Description: Option to boost by idf.

Default: ``` false ```

`Boolean`

`false`

`true`

``` termFrequencyBoost ```

Description: Option to boost by term frequency.

Default: ``` true ```

`Boolean`

`false`

`true`

``` idfDictionary ```

Description: IDF dictionary file for idf boost.

`String`

`false`

`false`

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` idfdictionary ```

`String`

`false`

`false`

``` topN ```

Description: Maximum number of annotations to produce per CAS.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` minScore ```

Description: Minimum score of annotations.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` normalizeScore ```

Description: Option to normalize the score.

Default: ``` true ```

`Boolean`

`false`

`true`

``` conceptConfidenceBoost ```

Description: Option to boost concepts.

Default: ``` false ```

`Boolean`

`false`

`true`

``` allowedZones ```

Description: Name of zone labels: if set, only concepts of these zones will be considered.

`String`

`true`

`false`

``` zoneBoost ```

Description: Option to boost zones.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>indexing</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### TextrankDescriptorAnnotator

##### General

Extracts keywords based on the TextRank procedure. The text is represented internally as a graph determining the text coherence. The TextRank-based process is completely unsupervised and can therefore be used independently of a given document collection. It does not require any models or other resources.

Keyword extraction methods based on domain knowledge such as, e.g., a IDF dictionary, may produce better results under certain circumstances. In many cases, however, the domain in question is not yet exactly known in advance so that no suitable IDF dictionary can be created. In such a case, it is advisable to use the TextRank procedure.

##### Input

The following annotations are mandatory for this component

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.Concept`

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagNoun`

Depending on the setting, these other annotations are also used

• `de.averbis.extraction.types.Zone`

• `de.averbis.extraction.types.Stem`

• `de.averbis.extraction.types.Segment`

• `de.averbis.extraction.types.Lemma`

##### Output

The component produces annotations of type:

• `de.averbis.extraction.types.Descriptor`

##### Background

Based on the TextRank algorithm (https://web.eecs.umich.edu/~mihalcea/papers/mihalcea. emnlp04. pdf[Mihalcea paper, 2000]). Lexical units, e.g. tokens, stems, segments or lemmata) are assigned a score using the TextRank graph. These basic values are used for calculating respective scores for all concept annotations. Here, various calculation options are supported: average, maximum value and sum. In our experiments, the maximum value method usually produces best results.

The TextRank graph contains the respective lexical units as nodes. The edges between these nodes represent a connection between these units in the text in terms of adjacency. The parameter `windowSize` defines the size of the window for which adjacent lexical units are consdidered.

During the optimization phase, the system displays the text the weights of the nodes are calculated. Nodes that have many edges to neighboring nodes are potentially weighted higher.

In the optimization phase, the weights of the nodes are calculated based on the graph created for the respective text. Nodes that have many edges to neighboring nodes are potentially weighted higher.

The procedure allows an inherent normalization of the node weight. If the weights are normalized, they represent the probability of the "random surfer model", i.e. the probability of accidentally encountering the respective lexical unit in a text. Thus, the normalized scores of all nodes represent a probability distribution.

The following figure shows a TextRank graph on a document about the horse meat scandal in spring 2013 (Source: SPIEGEL Online). Segments were used as lexical units. The darker the color of the nodes, the higher the unit score. You can easily see that "meat","horse" and "product" are central aspects.

Figure 66: TextRank graph for document on horse meat scandal 2013 (German)

##### Configuration

Implementation: de.averbis.textanalysis.components.indexing.descriptor.TextrankDescriptorAnnotator

Table 141: Configuration Parameters

NameTypeMultiValuedMandatory

``` windowSize ```

Description: The window size.

Default: ``` 3 ```

`Integer`

`false`

`true`

``` unitType ```

Description: The complete long name of the type of a unit.

Default: ``` de.averbis.extraction.types.Segment ```

`String`

`false`

`true`

``` scoreCombinationType ```

Description: Combination type used for scores: SUM, AVG, MAX

Default: ``` MAX ```

`String`

`false`

`true`

``` maxIterations ```

Description: An internal parameter specifying the maximum amountof iterations if supported by the algorithm.

Default: ``` 100 ```

`Integer`

`false`

`true`

``` convergenceThreshold ```

Description: An internal parameter specifying the threshold for convergence if supported by the algorithm..

Default: ``` 1.0E-4 ```

`Float`

`false`

`true`

``` topN ```

Description: Maximum number of annotations to produce per CAS.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` minScore ```

Description: Minimum score of annotations.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` normalizeScore ```

Description: Option to normalize the score.

Default: ``` true ```

`Boolean`

`false`

`true`

``` conceptConfidenceBoost ```

Description: Option to boost concepts.

Default: ``` false ```

`Boolean`

`false`

`true`

``` allowedZones ```

Description: Name of zone labels: if set, only concepts of these zones will be considered.

`String`

`true`

`false`

``` zoneBoost ```

Description: Option to boost zones.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>indexing</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### CooccurrenceKeywordAnnotator

##### General

Extracts keywords based on the coke competition of individual lexical units. Lexical units can be tokens here, Stems, segments or lemmata. The scores for the selected lexical units within a keyword candidate. Subsequently, they are calculated to a total score for the respective keyword candidates.

Units that are often used with other lexical units together in a keyword candidate, are given a higher weighting in the process as units, for example, which are mostly used alone in keyword candidates. This leads to the fact that this procedure rather prefers keyword candidates, which consist of several lexical units. In this respect, the procedure is, for example, a very simple one. well suitable for recognizing personal names or complex and thus to extract very specific terms.

##### Input

The component expects the following annotations to be mandatory

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.ChunkNP`

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagNoun`

Depending on the setting, these other annotations are also used

• `de.averbis.extraction.types.Zone`

• `de.averbis.extraction.types.Stem`

• `de.averbis.extraction.types.Segment`

• `de.averbis.extraction.types.Lemma`

##### Output

The component produces annotations of type:

• `de.averbis.extraction.types.Keyword`

##### Background

See description of the analog keyword component CooccurrenceDescriptorAnnotator.

##### Configuration

Implementation: de.averbis.textanalysis.components.indexing.keyword.CooccurrenceKeywordAnnotator

Table 142: Configuration Parameters

NameTypeMultiValuedMandatory

``` unitType ```

Description: The complete long name of the type of a unit.

Default: ``` de.averbis.extraction.types.Segment ```

`String`

`false`

`true`

``` scoreCombinationType ```

Description: Combination type used for scores: SUM, AVG, MAX

Default: ``` MAX ```

`String`

`false`

`true`

``` maxNumberHeadTokens ```

Description: Maximum number of head tokens.

Default: ``` 1 ```

`Integer`

`false`

`true`

``` fuzzyClustering ```

Description: Option for fuzzy clustering.

Default: ``` false ```

`Boolean`

`false`

`true`

``` topN ```

Description: Maximum number of annotations to produce per CAS.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` minScore ```

Description: Minimum score of annotations.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` normalizeScore ```

Description: Option to normalize the score.

Default: ``` true ```

`Boolean`

`false`

`true`

``` conceptConfidenceBoost ```

Description: Option to boost concepts.

Default: ``` false ```

`Boolean`

`false`

`true`

``` allowedZones ```

Description: Name of zone labels: if set, only concepts of these zones will be considered.

`String`

`true`

`false`

``` zoneBoost ```

Description: Option to boost zones.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>indexing</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### DefaultKeywordAnnotator

##### General

The "default" approach to uncontrolled keywording is based on mainly on the tf-idf value of the keyword. You can also use the position of the keyword in the text into the weighting formula can be added.

##### Input

The component expects the following annotations to be mandatory

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.ChunkNP`

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagNoun`

Depending on the setting, these other annotations are also used

• `de.averbis.extraction.types.Zone`

##### Output

The component produces annotations of type:

• `de.averbis.extraction.types.Keyword`

##### Background

See description of the analog keyword component DefaultDescriptorAnnotator.

##### Configuration

Implementation: de.averbis.textanalysis.components.indexing.keyword.DefaultKeywordAnnotator

Table 143: Configuration Parameters

NameTypeMultiValuedMandatory

``` positionBoost ```

Description: Option to boost by position.

Default: ``` true ```

`Boolean`

`false`

`true`

``` idfBoost ```

Description: Option to boost by idf.

Default: ``` false ```

`Boolean`

`false`

`true`

``` termFrequencyBoost ```

Description: Option to boost term frequency.

Default: ``` true ```

`Boolean`

`false`

`true`

``` idfDictionary ```

Description: IDF dictionary file for idf boost.

`String`

`false`

`false`

``` resourceSpecificSubdirectory ```

Description: Resource specific subdirectory against which all relative paths are resolved.

Default: ``` idfdictionary ```

`String`

`false`

`false`

``` maxNumberHeadTokens ```

Description: Maximum number of head tokens.

Default: ``` 1 ```

`Integer`

`false`

`true`

``` fuzzyClustering ```

Description: Option for fuzzy clustering.

Default: ``` false ```

`Boolean`

`false`

`true`

``` topN ```

Description: Maximum number of annotations to produce per CAS.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` minScore ```

Description: Minimum score of annotations.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` normalizeScore ```

Description: Option to normalize the score.

Default: ``` true ```

`Boolean`

`false`

`true`

``` conceptConfidenceBoost ```

Description: Option to boost concepts.

Default: ``` false ```

`Boolean`

`false`

`true`

``` allowedZones ```

Description: Name of zone labels: if set, only concepts of these zones will be considered.

`String`

`true`

`false`

``` zoneBoost ```

Description: Option to boost zones.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>indexing</artifactId>
<version>3.5.0</version>
</dependency>
```
```

#### TextrankKeywordAnnotator

##### General

Extracts keywords based on the TextRank procedure. The text is represented internally as a graph determining the text coherence. The TextRank-based process is completely unsupervised and can therefore be used independently of a given document collection. It does not require any models or other resources.

Keyword extraction methods based on domain knowledge such as, e.g., a IDF dictionary, may produce better results under certain circumstances. In many cases, however, the domain in question is not yet exactly known in advance so that no suitable IDF dictionary can be created. In such a case, it is advisable to use the TextRank procedure.

##### Input

The following components are mandatory for the component

• `de.averbis.extraction.types.Token`

• `de.averbis.extraction.types.Sentence`

• `de.averbis.extraction.types.ChunkNP`

• `de.averbis.extraction.types.POSTagAdj`

• `de.averbis.extraction.types.POSTagNoun`

Depending on the setting, these other annotations are also used

• `de.averbis.extraction.types.Zone`

• `de.averbis.extraction.types.Stem`

• `de.averbis.extraction.types.Segment`

• `de.averbis.extraction.types.Lemma`

##### Output

The component produces annotations of type:

• `de.averbis.extraction.types.Keyword`

##### Background

See description of the analog keyword extraction component TextrankDescriptorAnnotator.

##### Configuration

Implementation: de.averbis.textanalysis.components.indexing.keyword.TextrankKeywordAnnotator

Table 144: Configuration Parameters

NameTypeMultiValuedMandatory

``` windowSize ```

Description: The window size.

Default: ``` 3 ```

`Integer`

`false`

`true`

``` unitType ```

Description: The complete long name of the type of a unit.

Default: ``` de.averbis.extraction.types.Segment ```

`String`

`false`

`true`

``` scoreCombinationType ```

Description: Combination type used for scores: SUM, AVG, MAX

Default: ``` MAX ```

`String`

`false`

`true`

``` maxIterations ```

Description: An internal parameter specifying the maximum amountof iterations if supported by the algorithm.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` convergenceThreshold ```

Description: An internal parameter specifying the threshold for convergence if supported by the algorithm..

Default: ``` 0.005 ```

`Float`

`false`

`true`

``` maxNumberHeadTokens ```

Description: Maximum number of head tokens.

Default: ``` 1 ```

`Integer`

`false`

`true`

``` fuzzyClustering ```

Description: Option for fuzzy clustering.

Default: ``` false ```

`Boolean`

`false`

`true`

``` topN ```

Description: Maximum number of annotations to produce per CAS.

Default: ``` 10 ```

`Integer`

`false`

`true`

``` minScore ```

Description: Minimum score of annotations.

Default: ``` 0.0 ```

`Float`

`false`

`true`

``` normalizeScore ```

Description: Option to normalize the score.

Default: ``` true ```

`Boolean`

`false`

`true`

``` conceptConfidenceBoost ```

Description: Option to boost concepts.

Default: ``` false ```

`Boolean`

`false`

`true`

``` allowedZones ```

Description: Name of zone labels: if set, only concepts of these zones will be considered.

`String`

`true`

`false`

``` zoneBoost ```

Description: Option to boost zones.

Default: ``` false ```

`Boolean`

`false`

`true`

Maven Coordinates:

```        ```
<dependency>
<groupId>de.averbis.textanalysis</groupId>
<artifactId>indexing</artifactId>
<version>3.5.0</version>
</dependency>
```
```

1. ftp://ftp.geneontology.org/pub/go/www/GO.format.obo-1_2.shtml, Stand Januar 2017.
2. http://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html, Stand Januar 2017.