Page tree
Skip to end of metadata
Go to start of metadata




logo hd 300x120

Averbis Health Discovery: User Manual

Version 5.12, 04/23/2019

1. Overview

Health Discovery is a text mining and machine learning platform for analyzing large amounts of patient data. With Health Discovery, medical documents can be analyzed and searched for diagnoses, symptoms, prescriptions, special findings, and other criteria. Heterogeneous patient data in both structured and unstructured forms can be harmonized and analyzed by text mining, and can be accessed and searched via a unified interface.

Health Discovery has a modular structure. The various functionalities are roughly divided into the following categories:

  • General: There are some general modules in which projects and users with corresponding rights and roles can be created.

  • Sources: There are several ways to invite documents to Health Discovery. Documents can be imported from your own client or from any server from file or database.

  • Terminology: Health Discovery allows you to create your own terminology or import terminologies. These can be integrated into text mining, and terms of these terminologies can be found in texts.

  • Text Analysis: This category contains various modules for configuring text mining pipelines, starting text mining processes and viewing text mining results. Different text mining pipelines can also be compared with each other.

  • Search: Health Discovery contains a semantic full-text search that can be configured and used in the various modules of this category.

  • Classification: Health Discovery contains a machine learning based classification module. Users can sort documents manually or automatically into different categories. An intuitive interface enables the training and evaluation of machine learning models.

The user manual is intended to give you a quick introduction to Health Discovery with the "Getting started" section. Then the text mining components and pipelines that are included in Health Discovery are described in detail.

2. Getting started

2.1. Login and import documents

  1. Enter the URL of Health Discovery in a web browser and login with your user name and password. If you don’t know the URL or your credentials, contact your system administrator.

  2. On page "Home", select "Project Administration", create a new project and name it "default". You will be redirected to the "Project Overview" page.

project overview


Figure 1: Project overview of Health Discovery.

      3. On the "Project Overview" page, choose module "Import Documents", click on "New Import", give your import batch a name, select the file format and the documents to be imported. You can import a single file or a zip container with multiple files. Make sure that the zip  
         container doesn’t contain (hidden) sub folders and that the files have the correct file extension.

      4. By clicking on "Import", the document import will start. You can click on the "Refresh" button to the right of your document import to see the progress.


document import


Figure 2: Import Documents in Health Discovery.


You can reach the "Project Overview" page at any time via the breadcrumb navigation in the upper left by clicking on "default".

2.2. Run a text mining process

Health Discovery typically contains predefined pipelines that are already available when the application starts. Therefore, you can start text mining processing immediately after importing the first documents. This goes as follows:

  1. On the "Project Overview" page, select "Pipeline Configuration" and start a text mining pipeline, e.g "discharge"

    Starting the pipeline may take a few minutes, as a lot of information is loaded into the main memory.

  2. Switch back to "Project Overview" and select "Processes".
  3. Click on "New Text Analysis".

  4. Give your text mining process a name, select the document source and the text mining pipeline, and click Ok.

  5. The text analysis starts now. By clicking on the browser refresh you can monitor the progress of the text analysis.


textmining process


Figure 3: Start a text mining process in Health Discovery.

2.3. View text mining results

As soon as a text mining process is running, you can see the results by following these steps:

  1. In "Project Overview", click on "Annotation Editor".

  2. Select the document source and the text analysis process you just ran.

  3. The annotation editor now displays the results of the text analysis graphically. In the upper legend you can select and deselect the different annotation types. On the right you can unfold a side menu, which shows you details to the individual annotations found in the text.


annotation editor


Figure 4: View the text mining results in the Annotation Editor.


2.4. Configure your own pipeline

If you want to build your own pipeline from existing text mining components, proceed as follows:

  1. In "Project Overview", click on "Pipeline Configuration".

  2. Click on "Create Pipeline".

  3. Give your pipeline a name, optionally a description and click on "Ok".

  4. Click the pen icon ("Edit Pipeline") to the right of your pipeline.

  5. Select the desired components from the components on the right by clicking on the corresponding left arrow. For more information about the available components and which upstream components they require, see Available Text Mining Annotators & Web Service Specification.


pipeline configuration


Figure 5: Configure your own pipeline by moving the components from right to left.


2.5. Create your first terminology

If you want to create your own terminology, proceed as follows:

  1. In "Project Administration", select "Terminology Administration".

  2. Click on "Create Terminology".

  3. Assign a "Terminology ID", a "Label", a "Version". Choose whether the terminology should have a hierarchy or not. Leave the "Concept type" on "de.averbis.extraction.types.Concept" and the "Encrypted export" on disabled. Select the language(s) in which the terminology is to be created. Then click on "Ok".


terminology administration


Figure 6: Create your first terminology.


  1. Switch to the "Terminology Editor" by going to the "Project Overview" page and clicking on "Terminology Editor".

  2. Click on the "plus" to the right of your terminology to create the first concept.

  3. Enter a "Concept ID", a "Preferred Term" and optionally a "Comment" and click "Ok".


create concept


Figure 7: Create the first concept.


  1. By clicking on the button "Add Terms" you can add more synonyms to the concept.

  2. By clicking on the "plus" to the right of your newly created concept you can create further sub-concepts.

  3. When you have finished terminology work, switch back to the "Terminology Administration" module.

  4. There, click on the icon "Export Terminology" (blue icon with arrow down).

  5. Select "Concept Dictionary XML Exporter" as the export format and click on "Export". This makes the terminology visible for text mining (see also [Integrate your terminologies into a text mining pipeline]).

  6. By clicking on the "Refresh" button to the right of the terminology you can check the progress of the export. When the terminology has been fully exported, the status changes to "Completed".


export terminology


Figure 8: Export the terminology to make it visible for text mining.

2.6. Integrate own terminologies into a text mining pipeline

You can import your own terminologies to Health Discovery. Optionally, a mapping mode for each synonym can be imported, too. To import terminologies, you must convert them to the OBO file format.

The minimal structure of your OBO terminology looks like this:

Example of an OBO terminology

synonymtypedef: DEFAULT_MODE "Default Mapping Mode"  //
synonymtypedef: EXACT_MODE "Exact Mapping Mode" //OPTIONAL - only if using mapping modes
synonymtypedef: IGNORE_MODE "Ignore Mapping Mode" //

[Term] id: 1 name: First Concept synonym: "First Concept" DEFAULT_MODE []
synonym: "First Synonym" IGNORE_MODE []
synonym: "Second Synonym" EXACT_MODE []
 
[Term] id: 2 name: First Child is_a: 1 ! First Concept

To import terms with mapping modes, the OBO terminology begins with the synonym type definitions, as shown in the first three lines of the OBO terminology in the example above.

Each concept begins with the flag "[TERM]", followed by an "id" and a preferred name with the flag "name". After that you can add as many synonyms as you like with the flag "synonym", followed by the desired mapping mode (optionally). Note: if you would like to define a mapping mode for your concept name (flag "name"), you have to add the term as synonym, as shown in the example for "First Concept".

Furthermore, if your terminology contains a hierarchy, you can use "is_a" to refer to other concepts of your terminology.


To import a terminology like the one shown above, proceed as follows:

  1. In "Project Overview", click on "Terminology Administration".

  2. Click on "Create New Terminology". Fill in the dialog as described in Create your first terminology.

  3. Once you have created a terminology, click the up arrow icon to the right of the terminology.

  4. In the "Import Terminology" dialog, select "OBO Importer" as import format. Then select the terminology you want to import from the file system. Click on "Import".

    1. By clicking on the "Refresh" button to the right of the terminology you can check the progress of the import. When the terminology has been fully imported, the status changes to "Completed".

    2. To browse your terminology, switch to the "Terminology Editor" by going to the "Project Overview" page and clicking on "Terminology Editor".


import terminologies


Figure 9: Import your own terminologies into Health Discovery.


After successful terminology import, terms, hierarchies and mapping modes can be checked in the Terminology Editor.



Figure 10: Terminology Editor showing imported terminology


2.7. Use the Web Service

All text mining pipelines configured and started in Health Discovery can also be accessed via web service. To do this, proceed as follows:

  1. Add the suffix "/rest/swagger-ui.html" to the URL of Health Discovery (e.g. https://<YOURURL>/health-discovery/rest/swagger-ui.html).

  2. Click on "text-analysis-controller" and then on "/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseText".

  3. Add "default" to "projectName" and "discharge" to pipelineName (or the name of another started pipeline).

  4. Add any text in the field "text".


swagger


Figure 11: Use Swagger UI to access our RESTful Web Service.


  1. The field "language" can be left blank for the "discharge" pipeline, as the pipeline automatically recognizes the language.

  2. Click on "Try it out!"

In the field "Response Body" you can now view the return values in Json format.

3. Available Text Mining Annotators & Web Service Specification

Health Discovery contains a number of pipelines and text mining components. These can be configured in the "Pipeline Configuration" module. The individual components are described below. In addition to a short description of the component, it specifies which types the components require as input and which type they generate. A web service example of an annotation of the corresponding type is also given.


3.1. BiologicallyDerivedProducts

3.1.1. Description

A biologically derived product is a material substance originating from a biological entity intended to be transplanted or infused into another biological entity. Examples for a biologically derived product include hematopoietic stem cells such as bone marrow, peripheral blood, or cord blood extraction. This annotator extracts the information about the type of the transplanted biological product, the amount of transplanted cells and the date in the context of allogeneic transplantations.

Currently, the annotation is limited to the extraction of the biological product of CD34-positive stem cells.

3.1.2. Input

3.1.3. Output

Annotation Type: de.averbis.types.health.BiologicallylDerivedProduct


Table 1: Features BiologicallyDerivedProduct

AttributeDescriptionType

quantity

The volume of the product which was transplanted.

NumericValue

time

Temporal information (time, date or date interval) about the transplantation.

kind: Possible values (default is underlined): null, TIME, DATE, DATEINTERVAL

value: value of the temporal information

Timex3

matchedTermMatching synonym of the biologically derived product concept.String
dictCanonPreferred term of the biologically derived product concept.String
conceptIdThe ID of the concept.String

source

The name of the terminology source.String
uniqueIdUnique identifier of the concept of the format 'terminologyId:conceptId'.String


3.1.4. Terminology Binding

NameLanguagesVersionIdentifierComment

Averbis Lab Terminology

EN, DE

2.0

Averbis-Lab-Terminology_2.0

Laboratory and vital signs parameters, ID based on LOINC codes (LOINC parts) composed by Averbis.


3.1.5. Web Service Example

Text Example: On 11/11/2008 transfusion of 4.5x 106 CD34-positive cells/kg

    {
      "begin": 29,
      "end": 42,
      "type": "de.averbis.types.health.BiologicallyDerivedProduct",
      "coveredText": "4.5x106 CD34",
      "id": 4735,
      "quantity": 4500000,
      "matchedTerm": "CD34+",
      "dictCanon": "CD34+",
      "conceptId": "78002-3",
      "time": {
        "kind": "DATE",
        "value": "2008-11-11"
      },
      "source": "Averbis-Lab-Terminology_2.0",
      "uniqueId": "Averbis-Lab-Terminology_2.0:78002-3"
    }


3.2. Chimerism

3.2.1. Description

This component annotates information about chimerism. In the field of transplantation medicine, a chimerism analysis is performed after stem cell or bone marrow transplantation to determine whether the recipient’s hematopoietic system is only derived from the donor or not. The chimerism is called "complete" if more than 95% of the tested hematopoietic cells originate from the donor, otherwise the chimerism is called "mixed".

3.2.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.2.3. Output

Annotation Type: de.averbis.types.health.Chimerism


Table 2: Chimerism Features

AttributeDescriptionType

kind

The kind of the actual chimerism.

Possible values (default is underlined): null | COMPLETE | MIXED

String

value

Numeric value of chimerism.

NumericValue

date

Date of chimerism analysis.

Date

3.2.4. Web Service Example

Example:

Chimärismusanalyse vom 17.11.2008: Nachweis von 85,2 % Donorzellen.

{
         "begin": 48,
         "end": 66,
         "type": "de.averbis.types.health.Chimerism",
         "coveredText":  "85,2 % Donorzellen",
         "id": 3107,
         "date": "2008-11-17" ,
         "kind": "MIXED" ,
         "value": 85.2
}


3.3. Clinical Sections

3.3.1. Description

This component detects sections in medical documents. These sections can refer to diagnoses, medications, therapies, etc.

3.3.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.3.3. Output

Annotation Type: de.averbis.types.health.ClinicalSection


Table 3: Clinical Section Features

AttributeDescriptionType

dictCanon

Preferred term of the topography concept.

String

matchedTerm

Matching synonym of the topography concept.

String

uniqueId

Unique identifier of the topography concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

The type ClinicalSection has a lot of subtypes which code for the kind of section, e.g. DiagnosisSection, LaboratorySection, AnamnesisSection, etc.


3.3.4. Terminology Binding


Table 4: Terminology Bindings

NameLanguagesVersionIdentifierComment

clinical-Sections

EN, DE

1.0

clinical_sections_de, clinical_sections_en

Types of clinical sections, ID predominantly based on LOINC codes composed and enriched with synonyms by Averbis.

3.3.5. Web Service Example

Example:

Medication Citation|Active|CM| TraMADol HCl - 50 MG Oral Tablet;TAKE 1 TABLET 3 TIMES DAILY.; RPT~Tylenol Arthritis Ext Relief 650 MG TBCR;TAKE 1 TABLET 3-4 TIMES DAILY.; RPT~CeleBREX 200 MG Oral Capsule;TAKE 1 CAPSULE DAILY.; RPT~Folbic TABS;; RPT~Folic Acid 1 MG Oral Tablet;TAKE 1 TABLET DAILY.; RPT~PredniSONE 10 MG Oral Tablet;TAKE 1 TABLET AS NEEDED.; RPT~Cholestyramine 4 GM Oral Packet;MIX THE CONTENTS OF 1 POWDER PACKET WITH 2 TO 6 OZ OF NONCARBONATED BEVERAGE AND DRINK 3 TIMES DAILY.; RPT~Methotrexate 2.5 MG Oral Tablet;TAKE 1 TABLET WEEKLY.; RPT~Citracal Plus Oral Tablet;TAKE 2 TABLET DAILY; RPT~Multi Vitamin Daily TABS;TAKE 1 TABLET DAILY.; RPT~Miscellaneous Medication;Schiff "Move Free". 400 MG taken once daily; RPT


{
         "begin": 0,
         "end": 738,
         "type": "de.averbis.types.health.ClinicalSection",
         "coveredText": "Medication Citation|Active|CM|  \n TraMADol HCl - 50 MG Oral Tablet;TAKE 1 TABLET 3 TIMES DAILY.; RPT~Tylenol Arthritis Ext Relief 650 MG TBCR;TAKE 1 TABLET 3-4 TIMES DAILY.; 
RPT~CeleBREX 200 MG Oral Capsule;TAKE 1 CAPSULE DAILY.; RPT~Folbic TABS;; RPT~Folic Acid 1 MG Oral Tablet;TAKE 1 TABLET DAILY.; RPT~PredniSONE 10 MG Oral Tablet;TAKE 1 TABLET AS NEEDED.;
RPT~Cholestyramine 4 GM Oral Packet;MIX THE CONTENTS OF 1 POWDER PACKET WITH 2 TO 6 OZ OF NONCARBONATED BEVERAGE AND DRINK 3 TIMES DAILY.; RPT~Methotrexate 2.5 MG Oral Tablet;TAKE 1 TABLET WEEKLY.;
RPT~Citracal Plus Oral Tablet;TAKE 2 TABLET DAILY; RPT~Multi Vitamin Daily TABS;TAKE 1 TABLET DAILY.; RPT~Miscellaneous Medication;Schiff
\" Move Free \" . 400 MG taken once daily; RPT"
, "id": 21810, "matchedTerm": "Medication", "dictCanon": "Medication", "conceptId": "29549-3", "source": "clinical_sections_en", "label": "MedicationSection", "uniqueId": "clinical_sections_en:29549-3" }

3.4. Departments

3.4.1. Description

This component annotates medical departments in clinical notes, e.g. Paediatrics, Neurology, Orthodontics...

3.4.2. Input 

Above this annotator, the following annotators must be included in the pipeline:

3.4.3. Output

Annotation Type: de.averbis.types.health.Department


Table 5: Features

AttributeDescriptionType

dictCanon

Preferred term of the department (concept) as defined in the terminology.String

matchedTerm

The term that matched to a department concept in the terminology.String

conceptId

The ID of the matched department concept in the terminology.String

source

The name of the terminology source.String

uniqueID

Unique identifier of the department concept of the format 'terminologyId:conceptId'.String

3.4.4. Terminology Binding


Table 6: Terminology Binding

CountryLanguagesVersionIdentifierComment
United States, GermanyEN, DE1.0

Averbis-SpecialistDepartment_1.0

Terminology of department names, composed by Averbis enriched with terms from SNOMED-CT.


3.4.5. Web Service Example

Example Text: Service: NEONATOLOGY

 {
      "begin": 9,
      "end": 20,
      "type": "de.averbis.types.health.Department",
      "coveredText": "NEONATOLOGY",
      "id": 414,
      "matchedTerm": "Neonatology",
      "dictCanon": "Neonatology",
      "conceptId": "408445005",
      "source": "Averbis-SpecialistDepartment_1.0",
      "uniqueId": "Averbis-SpecialistDepartment_1.0:408445005"
 }

3.5. Diagnoses

3.5.1. Description

This component detects a condition, problem, diagnosis, or other event, situation, issue, or clinical concept that has risen to a level of concern.

3.5.2. Input

Above this annotator, the following annotators must be included in the pipeline:

To get the full functionality, the following annotators should also be included below this annotator in the given order:

3.5.3. Output

Annotation Type: de.averbis.types.health.Diagnosis


Table 7: Features

AttributeDescriptionType

dictCanon

Preferred term of the condition.

String

matchedTerm

The matching synonym of the Diagnosis concept.

String

uniqueId

Unique identifier of a concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

negatedBy

Specifies the negation word, if one exists.

String

verificationStatus

Verification status of the actual diagnosis.

Possible values (default is underlined): null | NEGATED | ASSURED | SUSPECTED | DIFFERENTIAL

String

clinicalStatus

Clinical status of the actual diagnosis.

Possible values (default is underlined): null | ACTIVE | RESOLVED

String

kind

The kind of the diagnosis.

Possible values (default is underlined): null | main | secondary

String

side

The laterality of the diagnosis.

Possible values (default is underlined): null | RIGHT | LEFT | BOTH

String

laterality

The laterality of the diagnosis.

Possible values (default is underlined): null | RIGHT | LEFT | BOTH

WARNING: This feature is deprecated and will be removed in V5.6 of Health Discovery. It will be replaced by the equivalent attribute 'side'.

String

3.5.4. Terminology Binding


Table 8: Terminology Bindings

CountryNameVersionIdentifierComment

United States

ICD-10-CM-Averbis

2017

ICD-10-CM-Averbis_2017

International Classification of Diseases, 10th Edition, Clinical Modification, 2017, enriched with synonyms from SNOMED CT and by Averbis.

Germany

ICD-10-GM-Averbis

2018

ICD-10-GM-Averbis_2018

International Classification of Diseases, 10th Edition, German Modification, 2018, enriched with synonyms by Averbis.

3.5.5. Web Service Example

Example:

suspected history of appendicitis

{
        "begin": 10,
        "end": 33,
        "type": "de.averbis.types.health.Diagnosis",
        "id": 627,
        "coveredText": "history of appendicitis",
        "negatedBy": null,
        "matchedTerm": "History of appendicitis",
        "verificationStatus": "SUSPECTED",
        "kind": null,
        "dictCanon": "Personal history of other diseases of the digestive system",
        "conceptId": "Z87.19",
        "source": "ICD-10-CM-Averbis_2017",
        "clinicalStatus": "RESOLVED",
   "belongsTo": null,
        "laterality": null,
        "uniqueId": "ICD-10-CM-Averbis_2017:Z87.19"
}
      

3.6. Diagnosis Status

3.6.1. Description

The annotator recognizes the status of diagnoses. Different status includes, for example, "suspected" or "history of".

3.6.2.  Input

Above this annotator, the following annotator must be included in the pipeline:

3.6.3. Output

This annotator sets the features belongsTo, verificationStatus and clinicalStatus in annotations of type Diagnosis and changes conceptID and uniqueID if the diagnosis does not belong to the patient but e.g. to a family member.

3.6.4. Web Service Example

Example 1 (ClinicalStatus):

history of appendicitis

{
        "begin": 3,
         "end": 26,
         "type": "de.averbis.types.health.Diagnosis",
         "id": 637,
         "coveredText": "history of appendicitis",
         "negatedBy": null,
         "verificationStatus": null,
         "kind": null,
         "dictCanon": "Personal history of other diseases of the digestive system",
         "conceptId": "Z87.19",
         "source": "ICD-10-CM-Averbis_2017",
         "clinicalStatus": "RESOLVED",
         "uniqueId":  "ICD-10-CM-Averbis_2017:Z87.19" 
}
      

Example 2 (FamilyDiagnosis):

father has diabetes mellitus

{
         "begin": 11,
         "end": 28,
         "type": "de.averbis.types.health.Diagnosis",
         "id": 656,
         "coveredText": "diabetes mellitus",
         "negatedBy": null,
         "verificationStatus": null,
         "kind": null,
         "dictCanon": "Type 2 diabetes mellitus without complications",
         "conceptId": "Z83.3",
         "source": "ICD-10-CM-Averbis_2017",
         "clinicalStatus": null,
         "belongsTo": "FAMILY" ,
         "uniqueId": "ICD-10-CM-Averbis_2017:Z83.3" 
}
      

3.7. Disambiguation

3.7.1. Description

In case of ambiguous annotations this component decides which annotations should be valid in the given context, e.g. within a list of laboratory values the parameter 'Calcium' represents a laboratory parameter and not an ingredient.

3.7.2. Input

This component requires annotations of at least one of the following types:

3.7.3. Output

Only the annotation which is evaluated as valid is maintained the other(s) are discarded.

3.7.4. Web Service Example

There is no special web service return for Disambiguation.

3.8. Enumerations

3.8.1. Description

This component detects enumerations. The enumerations are recognized based on atomic text units (e.g. chunks) and conjunctions (e.g. the word "and").

3.8.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.8.3. Output

This component sets the following internal type that is not visible in the annotation editor:

Annotation Type*: de.averbis.types.Enumeration

3.8.4. Web Service Example

The enumeration itself is not returned in the web service. However, the following example shows that both diagnoses are assigned the status "SUSPECTED".

Example:

suspicion of bronchitis or asthma bronchiale

{
         "begin": 13,
         "end": 23,
         "type": "de.averbis.types.health.Diagnosis",
         "id": 842,
         "coveredText": "Bronchitis",
         "negatedBy": null,
         "verificationStatus": "SUSPECTED",
         "kind": null,
         "dictCanon": "Bronchitis, not specified as acute or chronic",
         "conceptId": "J40",
         "source": "ICD-10-CM-Averbis_2017",
         "clinicalStatus": null,
         "uniqueId": "ICD-10-CM-Averbis_2017:J40" 
},
{
         "begin": 27,
         "end": 44,
         "type": "de.averbis.types.health.Diagnosis",
         "id": 862,
         "coveredText": "asthma bronchiale",
         "negatedBy": null,
         "verificationStatus": "SUSPECTED",
         "kind": null,
         "dictCanon": "Unspecified asthma, uncomplicated",
         "conceptId": "J45.909",
         "source": "CD-10-CM-Averbis_2017",
         "clinicalStatus": null,
         "uniqueId": "ICD-10-CM-Averbis_2017:J45.909" 
}
      

3.9. GenericTerminologyAnnotator

3.9.1. Description

The generic Terminology Annotator recognizes terms from terminologies created in Health Discovery’s TermEditor module.

3.9.2. Input

Above this annotator, the following annotator must be included in the pipeline:

3.9.3. Output

The component creates annotations of type:

Annotation Type: de.averbis.extraction.types.Concept


Table 9: Features

AttributeDescriptionType

dictCanon

Preferred term of the concept.

String

uniqueId

Unique identifier of a concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

matchedTerm

The matching synonym of the terminology source.

String

The exact type depends on the terminology files used and the concept types specified in them.

3.9.4. Configuration


Table 10: Configuration

NameDescriptionTypeMultiValuedMandatory

terminologyNames

Names of the source terminologies.

String

true

false

useExactLookup

Apply exact lookup.

Boolean

false

true

useStemLookup

Apply lookup based on stems.

Boolean

false

true

useSegmentLookup

Apply lookup based on segments.

Boolean

false

true

exactCaseVariant

Defines the case matching of the exact lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

String

false

true

stemCaseVariant

Defines the case matching of the stem lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

String

false

true

segmentCaseVariant

Defines the case matching of the segment lookup. Available variants: CASE_MATCH, CASE_INSENSITIVE, CASE_FOLD_DIGITS, CASE_IGNORE.

String

false

true

3.9.5. Web Service Example

Example:

appendicitis

{
  "begin": 0,
  "end": 12,
  "type": "de.averbis.types.health.Concept",
  "coveredText": "Appendizitis",
  "id": 303,
  "matchedTerm": "Appendizitis",
  "dictCanon": "Appendizitis",
  "conceptId": "2",
  "source": "test_1.0",
  "uniqueId": "test_1.0:2"
}
   

3.10. Gleason Score

3.10.1. Description

This component recognizes Gleason score annotations.

3.10.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.10.3. Output

Annotation Type: de.averbis.types.health.GleasonScore


Table 11: GleasonScore Features

AttributeDescriptionType

score

The combined score.

NumericValue

primary

The primary grade (not always available).

NumericValue

secondary

The secondary grade (not always available).

NumericValue

3.10.4. Web Service Example

Example: Gleason Pattern 3(60%) + 4(40%) = 7

{
         "begin": 0,
         "end": 35,
         "type": "de.averbis.types.health.GleasonScore",
         "id": 2723,
         "coveredText": "Gleason Pattern 3(60%) + 4(40%) = 7",
         "score": 7,
         "primaryGrade": 3,
         "secondaryGrade": 4
}
      

3.11. GvHD

3.11.1. Description

This component recognizes information about the occurrence of a GvHD (Graft-versus-Host-Disease).

3.11.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.11.3. Output

Annotation Type: de.averbis.types.health.GvHDDiagnosisConcept

Annotation Type: de.averbis.types.health.GvHD


Table 12: GvHD Features

AttributeDescriptionType

continuanceStatus

GvHD status.

Possible values (default is underlined): null | ACUTE | CHRONIC

String

grade

Grade of the GvHD diagnosis. Possible values (default is underlined): null I | II | III | IV

String

stage

Stage of the GvHD diagnosis.

Possible values (default is underlined): null | 1 | 2 | 3 | 4 | LIMITED | EXTENDED

String

organ

Organ diagnosed with GvHD.

Possible values (default is underlined): null | SKIN | LIVER | INTESTINAL | EYE | LUNG | CONNECTIVE TISSUE | MUCOSA | VAGINAL

String

date

The date of the diagnosis.

Date

GvHD is a subtype of Diagnosis , i.e. it inherits all features.


3.11.4. Web Service Example

Example:

Akute Transplantat-gegen-Wirt Erkrankung Stadium 3 der Haut, Schweregrad III

{
     "begin": 0,
     "end": 76,
     "type": "de.averbis.types.health.GvHD",
     "coveredText": "Akute Transplantat-gegen-Wirt Erkrankung Stadium 3 der Haut, Schweregrad III",
     "id": 3548,
     "organ": "SKIN",
     "date": null,
     "stage": "3",
     "grade": "III",
     "continuanceStatus": "ACUTE" 
}

3.12. Health Measurements

3.12.1. Description

This component detects measurements in medical texts.

3.12.2. Input

Above this annotator, the following annotator must be included in the pipeline:

When generating a measurement annotation a NumericValue and a unit is combined. The LaboratoryConcept annotation allows the generation of a measurement even when a unit is missing, e.g. Hb 11.

The Health Preprocessing pipeline block provides most of the prerequisite annotation types to ensure the proper functionality of the HealthMeasurement annotation. In order to use the positive effect of available LaboratoryConcept annotations, this annotator is included in Laboratory Values, but it can also be used separately.


3.12.3. Output

Annotation Type: de.averbis.types.health.Measurement


Table 13: Measurement Features

AttributeDescriptionType

unit

The unit of the measurement.

Unit

normalizedUnit

Normalized string value of the unit.

String

normalizedValue

Normalized value of the measurement.

This value is the result of the transformation of the numeric value according to the transformation of the unit to its standard unit.

Double

value

The numeric value of the measurement.

NumericValue

dimension

The dimension of the unit, e.g. [M] standing for mass in the example below.

String

3.12.4. Web Service Example

HealthMeasurements are only returned in the context of a Laboratory Values and Medications.

3.13. Health Preprocessing

3.13.1. Description

This pipeline block is responsible for preprocessing the input documents and preparing the minimal set of required annotations which serve as input for the subsequenet components. Among others, this pipeline block recognizes and annotates words, sentences, abbreviations, temporal expressions and numerical values. Additionally, it filters out the stopwords (i.e, commonly used words which carry no important significance) and improves the sentence segmentation altered by abbreviations.

For the optimal functionality of the subsequent components, it is recommended to run the Health Preprocessing beforehand.


3.13.2. Input

Above this annotator, one of the following annotators must be included in the pipeline:

3.13.3. Output

This component generates annotations which will be processed by the subsequent components, e.g. words, sentences, abbreviations, temporal expressions and numerical values.

3.13.4. Web Service Example

The annotations generated by the preprocessing pipeline block are not returned in the web service.

3.14. HLA

3.14.1. Description

This component annotates information about HLA (human leukocyte antigen).

3.14.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.14.3. Output

Annotation Type: de.averbis.types.health.HLA


Table 14: HLA Features

AttributeDescriptionType

parameter

The concept of HLA containing the matched and prefered term.

HLAConcept

male

Paternal HLA manifestation.

HLAValue

female

Maternal HLA manifestation.

HLAValue

samplingDate

Date of sampling.

Date

receiptDate

Date of receipt of sample.

Date

date

Date of observation.

Date

Annotation Type: de.averbis.types.health.HLAValue


Table 15: HLAValue Features

AttributeDescriptionType

alleleGroup

Allele group of actual HLA.

String

protein

Specific protein of actual HLA.

String

synonymousDNA

Synonymous DNA substitution within the coding region.

String

noncodingRegionVariant

Differences in non-coding region.

String

expressionNote

Suffix to code changes in expression.

String

Annotation Type: de.averbis.types.health.HLAConcept


Table 16: HLAConcept Features

AttributeDescriptionType

dictCanon

Preferred term of the HLA concept.

String

matchedTerm

Matching synonym of the HLA concept.

String

uniqueId

Unique identifier of the HLA concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

3.14.4. Web Service Example

Example:

HLA-A 0101, 6801

{
         "begin": 0,
         "end": 16,
         "type": "de.averbis.types.health.HLA",
         "coveredText": "HLA-A 0101, 6801",
         "id": 1105,
         "date": null,
         "parameter": {
                 "begin": 0,
                 "end": 5,
                 "type": "de.averbis.types.health.HLAConcept",
                 "coveredText": "HLA-A",
                 "id": 863,
                 "matchedTerm": "HLA-A",
                 "dictCanon": "HLA-A",
                 "conceptId": "LP18319-1",
                 "source": "Averbis-Lab-Terminology_2.0",
                 "uniqueId": "Averbis-Lab-Terminology_2.0:LP18319-1" 
        },
         "receiptDate": null,
         "female": {
                 "begin": 12,
                 "end": 16,
                 "type": "de.averbis.types.health.HLAValue",
                 "coveredText": "6801",
                 "id": 1030,
                 "alleleGroup": "68",
                 "noncodingRegionVariant": null,
                 "protein": "01",
                 "synonymousDNA": null,
                 "expressionNote": null
        },
         "samplingDate": null,
         "male": {
                 "begin": 6,
                 "end": 10,
                 "type": "de.averbis.types.health.HLAValue",
                 "coveredText": "1101" ,
                 "id": 1007,
                 "alleleGroup": "01",
                 "noncodingRegionVariant": null,
                 "protein": "01",
                 "synonymousDNA": null,
                 "expressionNote": null
        }
}
      

3.15. Irradiation

3.15.1. Description

This component recognizes information about a previous irradiation therapy.

3.15.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.15.3. Output

Annotation Type: de.averbis.types.health.Irradiation


Table 17: Irradiation Features

AttributeDescriptionType

concept

The actual concept of the irradiation.

IrradiationTherapyConcept

irradiationDose

The irradiation dose.

IrradiationDose

startDate

The start date of the irradiation therapy.

String

endDate

The end date of the irradiation therapy.

String

Annotation Type: de.averbis.types.health.IrradiationDose


Table 18: IrradiationDose Features

AttributeDescriptionType

kind

The irradiation dose kind.

Possible values (default is underlined): null | FRACTIONAL

String

dose

The dose.

Measurement

Annotation Type: de.averbis.types.health.IrradiationTherapyConcept


Table 19: IrradiationTherapyConcept Features

AttributeDescriptionType

dictCanon

Preferred term of the Irradiation concept.

String

matchedTerm

Matching synonym of the Irradiation concept.

String

uniqueId

Unique identifier of the Irradiation concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

3.15.4. Web Service Example

Example (Irradiation):

Fraktionierte Ganzkörperbestrahlung (TBI) über opponierende Felder mit einer Gesamtdosis von 12 Gy vom 18.11. bis 20.11.2008

{
         "begin": 0,
         "end": 41,
         "type": "de.averbis.types.health.Irradiation",
         "coveredText": "Fraktionierte Ganzkörperbestrahlung (TBI)",
         "id": 5865,
         "endDate": "2008-11-20",
         "concept": {
                 "begin": 14,
                 "end": 41,
                 "type": "de.averbis.types.health.IrradiationTherapyConcept",
                 "coveredText": "Ganzkörperbestrahlung (TBI)",
                 "id": 5773,
                 "matchedTerm": "Ganzkörperbestrahlung",
                 "dictCanon": "Bestrahlung",
                 "conceptId": "10037794",
                 "source": "Averbis-Therapy_1.0",
                 "uniqueId": "Averbis-Therapy_1.0:10037794" 
        },
         "irradiationDose": {
                 "begin": 93,
                 "end": 98,
                 "type": "de.averbis.types.health.IrradiationDose",
                 "coveredText": "12 Gy",
                 "id": 5825,
                 "dose": {
                         "begin": 93,
                         "end": 98,
                         "type": "de.averbis.types.health.Measurement",
                         "coveredText": "12 Gy",
                         "id": 3561,
                         "unit": "Gy",
                         "normalizedUnit": "m²/s²",
                         "normalizedValue": 12,
                         "value": 12,
                         "dimension": "[L]²/[T]²" 
                },
                 "kind": "FRACTIONAL" 
        },
         "startDate": "2008-11-18" 
}
      

3.16. Laboratory Values

3.16.1. Description

This component detects laboratory values and vital signs.

The annotation of measurements is already integrated in this pipeline block. If measurements are needed for other components (e.g. for Medication), they should be executed afterwards. For more details of measurements see Health Measurements.


3.16.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.16.3. Output

Annotation Type: de.averbis.types.health.LaboratoryValue


Table 20: LaboratoryValue Features

AttributeDescriptionType

parameter

Parameter of actual laboratory value.

LaboratoryConcept

fact

Measurement of actual laboratory value.

Measurement

factAssessment

A optional relative assessment of the fact.

String

lowerLimit

Lower reference value of actual laboratory value.

Measurement

upperLimit

Upper reference value of actual laboratory value.

Measurement

interpretation

Interpretation of fact depending on reference values or interpretation in text (also possible without fact).

Possible values (default is underlined): null | normal | abnormal | high | low

AbstractInterpretation

qualitativeValue

Qualitative value of the actual laboratory value.

QualitativeValue

belongsTo

Indicates, whether the laboratory value belongs to a donor or recipient (e.g. in case of transplantations) or to a family member.

Possible values (default is underlined): null | DONOR | FAMILY | RECIPIENT

String

Annotation Type: de.averbis.types.health.LaboratoryConcept


Table 21: LaboratoryConcept Features

AttributeDescriptionType

dictCanon

Preferred term of the LaboratoryConcept concept.

String

matchedTerm

Matching synonym of the LaboratoryConcept concept.

String

uniqueId

Unique identifier of the LaboratoryConcept concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String


Annotation Type: de.averbis.types.health.QualitativeValue


Table 22: QualitativeValue Features

AttributeDescriptionType
value

Qualitative statement on a laboratory value.

Possible values (default is underlined): null | 1- | - - | 2- | - - - | 3- | 1+ | ++| 2+| +++| 3+ | APPROPRIATE | EVIDENCE | NEGATIVE | NO_EVIDENCE | POSITIVE | SPECKLED | STAINING | UNKOWN

String
modifier

Describes the characteristic of a qualitative value.

Possible values (default is underlined): null | ABNORMAL | ALTERNATING | BORDERLINE | CENTROMERE | CIRCULAR | CONTINUOUS | CYTOPLASMATIC | DEMONSTRABLE | HOMOGENEOUS |

MODERATE | NOT_QUANTIFIABLE | NUCLEOLAR | PERINUCLEOLAR | QUANTIFIABLE | QUALITATIVE | STRONG | WEAK

String


Annotation Type: de.averbis.types.health.BloodPressure


Table 23: BloodPressure Features

AttributeDescriptionType

systolic

Measurement of systolic blood pressure.

Measurement

diastolic

Measurement of diastolic blood pressure.

Measurement

interpretation

Interpretation of systolic and diastolic values depending on named interpretations in the text.

Possible values (default is underlined): null | normal | abnormal | high | low

AbstractInterpretation

3.16.4. Terminology Binding


Table 24: Terminology Bindings

NameLanguagesVersionIdentifierComment

Averbis Lab Terminology

EN, DE

2.0

Averbis-Lab-Terminology_2.0

Laboratory and vital signs parameters, ID based on LOINC codes (LOINC parts) composed by Averbis.

3.16.5. Web Service Example

Example 1 (LaboratoryValue with interpretation):

Uric acid 9.6 mg/dl (3.5-7.0)

{
                 "begin": 0,
                 "end": 29,
                 "type": "de.averbis.types.health.LaboratoryValue",
                 "id": 3744,
                 "coveredText":  "Uric acid 9.6 mg/dl (3.5-7.0)",
                 "fact": {
                         "begin": 10,
                         "end": 19,
                         "type": "de.averbis.types.health.Measurement",
                         "id": 3087,
                         "coveredText": "9.6 mg/dl",
                         "unit": "mg/dL",
                         "normalizedUnit":  "kg/m³",
                         "normalizedValue": 0.096,
                         "value": 9.6,
                         "dimension": "[M]/[L]³" 
                },
                 "interpretation": "high",
                 "parameter": {
                         "begin": 0,
                         "end": 9,
                         "type": "de.averbis.types.health.LaboratoryConcept",
                         "id": 2965,
                         "coveredText": "Uric acid",
                         "dictCanon": "Urate", 
                         "conceptId": "LP15935-7",
                         "source": "Averbis-Lab-Terminology_2.0",
                         "uniqueId": "Averbis-Lab-Terminology_2.0:LP15935-7",
                         "matchedTerm": "Uric acid" 
                }
        }

Example 2 (QualitativeValue):

CMV antibody strong positive

{
      "begin": 0,
"end": 28,
"type": "de.averbis.types.health.LaboratoryValue",
"coveredText": "CMV antibody strong positive",
"id": 1133,
"factAssessment": null,
"fact": null,
"interpretation": null,
"parameter": {
"begin": 0,
"end": 12,
"type": "de.averbis.types.health.LaboratoryConcept",
"coveredText": "CMV antibody",
"id": 676
"matchedTerm": "CMV antibody",
"dictCanon": "Cytomegalovirus Ab",
"conceptId": "LP37878-3",
"source": "Averbis-Lab-Terminology_2.0",
"uniqueId": "Averbis-Lab-Terminology_2.0:LP37878-3"
},
      "upperLimit": null,
"qualitativeValue": {
"value": "POSITIVE",
"modifier": "STRONG"
        },
"lowerLimit": null,
"belongsTo": null
}


Example 3 (BloodPressure):

RR 129/61 mmHg

{
         "begin": 0,
         "end": 14,
         "type": "de.averbis.types.health.BloodPressure",
         "id": 1717,
         "coveredText": "RR 129/61 mmHg",
         "systolic": {
                 "begin": 3,
                 "end": 6,
                 "type": "de.averbis.types.health.Measurement",
                 "id": 1363,
                 "coveredText":  "129",
                 "unit": "mmHg",
                 "normalizedUnit": "kg/(m·s²)",
                 "normalizedValue": 17198.538,
                 "value": 129,
                 "dimension": "[M]/([L]·[T]²)" 
        },
         "diastolic": {
                 "begin": 7,
                 "end": 14,
                 "type": "de.averbis.types.health.Measurement",
                 "id": 1178,
                 "coveredText": "61 mmHg",
                 "unit": "mmHg",
                 "normalizedUnit": "kg/(m·s²)",
                 "normalizedValue": 8132.642,
                 "value": 61,
                 "dimension": "[M]/([L]·[T]²)" 
        },
         "interpretation": null
}
      

3.17. Language Detection

3.17.1. Description

This component recognizes and sets the text language. It currently supports German and English. In contrast to the LanguageSetter, this component decides individually for each document which language it is and sets the language accordingly.

If no language can be detected the language is set to 'German'.

3.17.2. Input

The component does not expect any annotations.

3.17.3. Output

The component sets the parameter 'documentLanguage' in the type

uima.tcas.DocumentAnnotation

3.17.4. Web Service Example

Example:

this is a sample text.

{
   "begin": 0,
   "end": 22,
   "type": "de.averbis.types.health.DocumentAnnotation",
   "coveredText": "This is a sample text.",
   "id": 8,
   "language": "en",
   "version": "5.4.0" 
}
      

3.18. LanguageSetter

3.18.1. Description

A language setter sets the text language in a document. It should only be used if the language is the same for all documents that are sent to this pipeline.

3.18.2. Input

The component does not expect any annotations.

3.18.3. Output

The component sets the parameter documentLanguage.

3.18.4. Configuration


Table 25: Configuration LanguageSetter

NameDescriptionTypeMultiValuedMandatory

language

The document language to set if not already set in CAS.

String

false

true

overwriteExisting

If true an existing document language will be overwritten.

Boolean

false

true

3.18.5. Web Service Example

The language is currently not returned in the web service.

3.19. Laterality

3.19.1. Description

This component annotates the laterality or body site of different annotation types, e.g. Diagnosis , Procedure and Ophthalmology.

3.19.2. Input

Above this annotator, the following annotators must be included in the pipeline:


This annotator must be included above the annotators whose feature 'side' it sets.

3.19.3. Output

This annotator sets the feature 'side' in above mentioned annotation types.

3.19.4. Web Service Example

As a standalone component, this doesn’t return anything in the web service.

3.20. Medication

3.20.1. Description

This component detects medications, which are a combination of the active ingredient or preparation, a strength, a dose frequency, the dose form, the route of administration and date intervals or a single date.

3.20.2. Input

Above this annotator, the following annotators must be included in the pipeline:

For the annotation of measurements either the Laboratory Values block or the Health Measurements block should be executed beforehand.

3.20.3. Output

Annotation Type: de.averbis.types.health.Medication


Table 26: Medication Features

AttributeDescriptionType

drug

Drug or multi drug of the actual medication.

Drug

doseFrequency

Dose frequency of the actual medication.

Possible forms are a general DoseFrequency or the more detailed DayTimeDoseFrequency, WeekDayDoseFrequency, TimeMeasurementDoseFrequency etc.

DoseFrequency

doseForm

Dose form of the actual medication.

DoseFormConcept

date

Temporal information (time, date or time interval) about the actual medication.

Timex3

administrations

The routes of administration of this medication.

StringArray

rateQuantity

Amount of medication per unit of time, e.g., 2 doses.

Double

status

Status of the medication.

Possible values (default is underlined): null | ADMISSION | ALLERGY | INPATIENT | DISCHARGE | NEGATED | CONSIDERED | INTENDED | FAMILY |CONDITIONING_TREATMENT

MedicationStatus

termTypes Additional information on clinical drug, e.g. semantic clinical drug (RxNorm TermType).String

Annotation Type: de.averbis.types.health.Drug


Table 27: Drug Features

AttributeDescriptionType

ingredient

Ingredient of the drug.

IngredientConcept

strength

Strength of the drug.

Strength

Drugs with more than one ingredient (multi drugs) are also detected and consist of multiple Drug-annotations.


Annotation Type: de.averbis.types.health.IngredientConcept

AttributeDescriptionType

dictCanon

Preferred term of the IngredientConcept concept.

String

matchedTerm

Matching synonym of the IngredientConcept concept.

String

uniqueId

Unique identifier of the IngredientConcept concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

Annotation Type: de.averbis.types.health.Strength


Table 28: Strength Features

AttributeDescriptionType

dictCanon

Preferred term of the strength concept (optional).

String

uniqueId

Unique identifier of the strength concept of the format 'terminologyId:conceptId' (optional).

String

conceptId

The concept id (optional).

String

source

The name of the terminology source (optional).

String

measurement

The actual strength as a measurement.

Measurement

Annotation Type: de.averbis.types.health.DoseFormConcept


Table 29: DoseForm Features

AttributeDescriptionType

dictCanon

Preferred term of the dose form concept.

String

uniqueId

Unique identifier of the dose form concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

Annotation Type: de.averbis.types.health.DoseFrequency


Table 30: DoseFrequency Features

AttributeDescriptionType

dictCanon

Preferred term of the dose frequency concept (optional).

String

matchedTerm

The matched Term of the IngredientConcept concept.

String

uniqueId

Unique identifier of the dose frequency concept of the format 'terminologyId:conceptId' (optional).

String

conceptId

The concept id (optional).

String

source

The name of the terminology source (optional).

String

interval

The taking interval of a medication, e.g. day, week, month etc.

String

totalCount

Total count of taken drug units per interval.

Double

totalDose

Total dose of taken drug per interval.

Measurement

regimen

Annotation of the mentioned regimen, e.g. 'three times daily'.

Annotation

morning / midday / evening / atNight

Only available for DayTimeDoseFrequency: represent the count of drug units to be taken at the different daytimes.

Double

monday / tuesday / … / sunday

Only available for WeekTimeDoseFrequency: represent the count of drug units to be taken at the different week days.

Double

3.20.4. Terminology Binding


Table 31: Terminology Bindings

CountryNameVersionIdentifierComment

United States

RxNorm Ingredients

2018.1

RxNorm-Ingredients_2018.1

Subset of RxNorm, a US-specific terminology in medicine that contains all medications available on the US market in 2018, enriched with synonyms by Averbis. This subset contains only the ingredients.

United States

RxNorm Strength

2018.1

RxNorm-Strength_2018.1

Subset of RxNorm, a US-specific terminology in medicine that contains all medications available on the US market in 2018, enriched with synonyms by Averbis. This subset contains only the strengths.

United States

Averbis-Dose-Frequency

1.0

Averbis-Dose-Frequency_1.0

Terminology of dose frequencies, ID based on SNOMED-CT codes composed and enriched by Averbis.

United States / Germany

Averbis Dose Form

1.0

Averbis-Dose-Form_1.0

Terminology of dose forms, composed and enriched by Averbis. Based on SNOMED-CT, RxNorm and Abdamed.

Germany

Abdamed-Averbis

2017

Abdamed-Averbis_2017

Database of pharmaceutical and medication terminology in Germany, 2017, enriched with synonyms by Averbis.

3.20.5. Web Service Example

Example1:

On discharge: Aspirin 100 mg 1-0-1 TAB vom 01.01. bis 31.01.2018

{
   "begin": 0,
   "end": 50,
   "type": "de.averbis.types.health.Medication",
   "coveredText": "Aspirin 100 mg 1-0-1 TAB vom 01.01. bis 31.01.2018",
   "id": 5144,
   "date": {
     "kind": "DATEINTERVAL",
     "startDate": "2018-01-01",
     "endDate": "2018-01-31" 
  },
   "administrations": [],
   "drugs": [
    {
       "begin": 0,
       "end": 14,
       "type": "de.averbis.types.health.Drug",
       "coveredText": "Aspirin 100 mg",
       "id": 4794,
       "ingredient": {
         "begin": 0,
         "end": 7,
         "type": "de.averbis.types.health.IngredientConcept",
         "coveredText": "Aspirin",
         "id": 4221,
         "matchedTerm": "Aspirin",
         "dictCanon": "Acetylsalicylsäure",
         "conceptId": "A01AD05-B01AC06-N02BA01",
         "source": "Abdamed-Averbis_2017",
         "uniqueId": "Abdamed-Averbis_2017:A01AD05-B01AC06-N02BA01" 
      },
       "strength": {
         "begin": 8,
         "end": 14,
         "type": "de.averbis.types.health.Measurement",
         "coveredText": "100 mg",
         "id": 3760,
         "unit": "mg",
         "matchedTerm": null,
         "dictCanon": null,
         "conceptId": null,
         "normalizedUnit": "kg",
         "source": null,
         "normalizedValue": 0.0001,
         "value": 100,
         "dimension": "[M]",
         "uniqueId": null
      }
    }
  ],
   "doseForm": {
     "begin": 21,
     "end": 24,
     "type": "de.averbis.types.health.DoseFormConcept",
     "coveredText": "TAB",
     "id": 4254,
     "matchedTerm": "TAB",
     "dictCanon": "Tablette",
     "conceptId": "SCT385055001",
     "source": "Averbis-Dose-Form_1.0",
     "uniqueId": "Averbis-Dose-Form_1.0:SCT385055001" 
  },
   "doseFrequency": {
     "begin": 15,
     "end": 20,
     "type": "de.averbis.types.health.DayTimeDoseFrequency",
     "coveredText": "1-0-1",
     "id": 4355,
     "totalDose": {
       "begin": 15,
       "end": 20,
       "type": "de.averbis.types.health.Measurement",
       "coveredText": "1-0-1",
       "id": 5406,
       "unit": "mg" ,
       "normalizedUnit": null,
       "normalizedValue": null,
       "value": 200,
       "dimension": "[M]" 
    },
     "midday": 0,
     "concept": null,
     "interval": "daytime",
     "totalCount": 2,
     "evening": 1,
     "atNight": null,
     "morning": 1
  },
   "status": null
}
      

Example2:

Lisinopril 5 MG tablet Take 5 mg by mouth daily.

{
   "begin": 0,
   "end": 66,
   "type": "de.averbis.types.health.Medication",
   "coveredText": "lisinopril (PRINIVIL,ZESTRIL) 5 MG tablet Take 5 mg by mouth daily",
   "id": 4396,
   "administrations": [
     "by mouth" 
  ],
   "drugs": [
    {
       "begin": 0,
       "end": 34,
       "type": "de.averbis.types.health.Drug",
       "coveredText": "lisinopril (PRINIVIL,ZESTRIL) 5 MG",
       "id": 3914,
       "ingredient": {
         "begin": 0,
         "end": 10,
         "type": "de.averbis.types.health.IngredientConcept",
         "coveredText": "lisinopril",
         "id": 2982,
         "matchedTerm": "LISINOPRIL",
         "dictCanon": "Lisinopril",
         "conceptId": "29046",
         "source": "RxNorm-Ingredients_2018.1",
         "uniqueId": "RxNorm-Ingredients_2018.1:29046" 
      },
       "strength": {
         "begin": 30,
         "end": 34,
         "type": "de.averbis.types.health.Measurement",
         "coveredText": "5 MG",
         "id": 2174,
         "unit": "mg",
         "matchedTerm": "5 MG",
         "dictCanon": "5 MG",
         "conceptId": "STR133",
         "normalizedUnit": "kg",
         "source": "RxNorm-Strength_2018.1",
         "normalizedValue": 0.000005,
         "value": 5,
         "dimension": "[M]",
         "uniqueId": "RxNorm-Strength_2018.1:STR133" 
      }
    }
  ],
   "doseForm": {
     "begin": 35,
     "end": 41,
     "type": "de.averbis.types.health.DoseFormConcept",
     "coveredText": "tablet",
     "id": 3081,
     "matchedTerm": "Tablet",
     "dictCanon": "Tablet dose form (qualifier value)",
     "conceptId": "SCT385055001",
     "source": "Averbis-Dose-Form_1.0",
     "uniqueId": "Averbis-Dose-Form_1.0:SCT385055001" 
  },
   "doseFrequency": {
     "begin": 61,
     "end": 66,
     "type": "de.averbis.types.health.TimeMeasurementDoseFrequency",
     "coveredText": "daily",
     "id" : 3406,
     "concept": {
       "begin": 61,
       "end": 66,
       "type": "de.averbis.types.health.DoseFrequencyConcept",
       "coveredText": "daily",
       "id": 3114,
       "matchedTerm": "Daily",
       "dictCanon": "Daily (qualifier value)",
       "conceptId": "69620002",
       "source": "Averbis-Dose-Frequency_1.0",
       "uniqueId": "Averbis-Dose-Frequency_1.0:69620002" 
    },
     "interval": "1/day",
     "totalCount": 1
  },
   "status": null
}
      

3.21. Medication Status

3.21.1. Description

The annotator recognizes the status of medications. Different status includes, for example, "INTENDED" or "FAMILY".

3.21.2. Input

Above this annotator, the following annotator must be included in the pipeline:

3.21.3. Output

This annotator sets the feature status in annotations of type Medication .

3.21.4. Web Service Example

Example:

A very good alternative, if the tumor is ER positive, is treatment with Tamoxifen.

{
         "begin": 72,
         "end": 81,
         "type": "de.averbis.types.health.Medication",
         "id": 2682,
         "coveredText":  "Tamoxifen",
         "drugs": [
             {
                         "begin": 72,
                         "end": 81,
                         "type": "de.averbis.types.health.Drug",
                         "id": 2331,
                         "coveredText": "Tamoxifen",
                         "ingredient": {
                                 "begin": 72,
                                 "end": 81,
                                 "type": "de.averbis.types.health.IngredientConcept",
                                 "id": 2170,
                                 "coveredText":  "Tamoxifen",
                                 "dictCanon": "Tamoxifen",
                                 "conceptId": "10324",
                                 "source": "RxNorm-Ingredients_2018.1",
                                 "uniqueId": "RxNorm-Ingredients_2018.1:10324" 
                        }
                }
        ],
         "doseForm": null,
         "doseFrequency": null,
         "status": "CONSIDERED" 
}
      

3.22. Morphology

3.22.1. Description

This component detects morphology concepts. It is mainly used in pathology reports.

3.22.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.22.3. Output

The component creates annotations of type:

Annotation Type: de.averbis.types.health.MorphologyConcept


Table 32: MorphologyConcept Features

AttributeDescriptionType

dictCanon

Preferred term of the Morphology concept.

String

matchedTerm

Matching synonym of the Morphology concept.

String

uniqueId

Unique identifier of the Morphology concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

negatedBy

Contains "true" if the concept is negated.

String

3.22.4. Terminology Binding


Table 33: Terminology Bindings

CountryNameVersionIdentifierComment

United States

ICD-O

3.1

ICD-O_3.1

International Classification of Diseases for Oncology WHO edition, enriched with synonyms by Averbis.

Germany

ICD-O-DE

3.1

ICD-O-DE_3.1

International Classification of Diseases for Oncology German Edition, enriched with synonyms by Averbis.

3.22.5. Web Service Example

Example:

Adenokarzinom des Rektums

{
         "begin": 0,
         "end": 13,
         "type": "de.averbis.types.health.MorphologyConcept",
         "id": 622,
         "coveredText": "Adenokarzinom",
         "negatedBy": null,
         "dictCanon": "Adenokarzinom",
         "conceptId": "8140/3",
         "source": "ICD-O-DE_3.1",
         "uniqueId": "ICD-O-DE_3.1:8140/3",
         "matchedTerm": "Adenokarzinom" 
}
      

3.23. Negation

3.23.1. Description

This component detects negated expressions. The negations are detected and assigned to concept annotations that are affected by these expressions. The negation detection component is optimized for medical texts.

3.23.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.23.3. Output

This component sets the following internal type that is not visible in the annotation editor:

Annotation Type*: de.averbis.types.health.MedicalNegation

If a concept is successfully negated, the feature negatedBy will be set to the corresponding negation term. If the DiagnosisStatus annotator is included behind it, the’verificationStatus' feature is additionally set to NEGATED.

3.23.4. Web Service Example

Example:

No Crohn’s disease

{
         "begin": 3,
         "end": 18,
         "type": "de.averbis.types.health.Diagnosis",
         "id": 722,
         "coveredText": "Crohn's disease",
         "negatedBy": "no",
         "verificationStatus": "NEGATED",
         "kind": null,
         "dictCanon": "Crohn's disease, unspecified, without complications",
         "conceptId": "K50.90",
         "source": "ICD-10-CM-Averbis_2017",
         "clinicalStatus": null,
         "uniqueId": "ICD-10-CM-Averbis_2017:K50.90" 
}
      

3.24. Ophthalmology

3.24.1. Description

This component detects indicators for the left and the right eye, the intraocular pressure, mentions of visual acuity and concepts concerning the field of ophthalmology.

3.24.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.24.3. Output

Annotation Type: de.averbis.types.health.OphthalmologyConcept


Table 34: OphthalmologyConcept Features

AttributeDescriptionType

dictCanon

Preferred term of the Ophthalmology concept.

String

matchedTerm

Matching synonym of the Ophthalmology concept.

String

uniqueId

Unique identifier of the Ophthalmology concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

negatedBy

Specifies the negation word, if one exists.

String

Annotation Type: de.averbis.types.health.Tensio


Table 35: Tensio Features

AttributeDescriptionType

leftEye

Tensio measurement of left eye.

Measurement

rightEye

Tensio measurement of right eye.

Measurement

Annotation Type: de.averbis.types.health.RelevantVisualAcuity

Best or actual visual acuity, selected from multiple VisualAcuity- or VisualAcuityValues.

Annotation Type: de.averbis.types.health.VisualAcuity


Table 36: VisualAcuity Features

AttributeDescriptionType

leftEye

Left eye’s visual acuity.

VisualAcuityValue

rightEye

Right eye’s visual acuity.

VisualAcuityValue

Annotation Type: de.averbis.types.health.VisualAcuityValue


Table 37: VisualAcuityValue Features

AttributeDescriptionType

fact

Normalized value of visual acuity.

String

meter

Visual acuity measured with blackboard.

Boolean

correction

Normalized value of correction during measuring visual acuity.

String

refraction

The measured refraction.

Refraction

pinHole

Visual acuity measured with pin hole.

Boolean

additionalInformation

Kind of comment, e.g. "AR_NOT_POSSIBLE", "DOES_NOT_IMPROVE".

String

Annotation Type: de.averbis.types.health.Refraction


Table 38: Refraction Features

AttributeDescriptionType

sphere

The spheric value of the actual refraction.

NumericValue

cylinder

The cylinder value of the actual refraction.

NumericValue

axis

The axis value of the actual refraction.

NumericValue

3.24.4. Web Service Example

Example1: Tensio

Tensio RA 13 mmHg LA 14 mmHg

{
         "begin": 7,
         "end": 28,
         "type":  "de.averbis.types.health.Tensio",
         "coveredText": "RA 13 mmHg LA 14 mmHg",
         "id": 3078,
         "rightEye": {
           "begin": 10,
           "end": 17,
           "type": "de.averbis.types.health.Measurement",
           "coveredText": "13 mmHg",
           "id": 2082,
           "unit": "mmHg",
           "normalizedUnit":  "kg/(m·s²)",
           "normalizedValue": 1733.1860000000001,
           "value": 13,
           "dimension": "[M]/([L]·[T]²)" 
        },
         "leftEye": {
           "begin": 21,
           "end": 28,
           "type": "de.averbis.types.health.Measurement",
           "coveredText": "14 mmHg",
           "id": 2107,
           "unit": "mmHg",
           "normalizedUnit": "kg/(m·s²)",
           "normalizedValue": 1866.508,
           "value": 14,
           "dimension": "[M]/([L]·[T]²)" 
        }
  }
      

Example2: Visual Acuity

Visus RA 0,16 (AR +1,0 -3,25 84) LA sc 1/35 (AR nicht möglich)

{
         "begin": 0,
         "end": 62,
         "type": "de.averbis.types.health.VisualAcuity",
         "coveredText": "Visus \n RA \n 0,16 (AR +1,0 -3,25 84) \n LA \n sc 1/35 (AR nicht möglich)",
         "id": 7430,
         "rightEye": {
                 "begin": 9,
                 "end": 32,
                 "type": "de.averbis.types.health.VisualAcuityValue",
                 "coveredText": "0,16 (AR +1,0 -3,25 84)",
                 "id": 7032,
                 "additionalInformation": null,
                 "pinHole": false,
                 "fact": "0.16",
                 "refraction": {
                         "begin": 14,
                         "end": 32,
                         "type": "de.averbis.types.health.Refraction",
                         "coveredText":  "(AR +1,0 -3,25 84)",
                         "id": 6843,
                         "sphere": 1,
                         "cylinder": -3.25,
                         "axis": 84
                },
                 "meter": false,
                 "correction": "AR" 
        },
         "leftEye": {
                 "begin": 36,
                 "end": 62,
                 "type": "de.averbis.types.health.VisualAcuityValue",
                 "coveredText": "sc 1/35 (AR nicht möglich)",
                 "id": 7092,
                 "additionalInformation": {
                         "begin": 44,
                         "end": 62,
                         "type": "de.averbis.types.health.VisualAcuityAdditionalInformation",
                         "coveredText": "(AR nicht möglich)",
                         "id": 5950,
                         "normalized": "AR_NOT_POSSIBLE" 
                },
                 "pinHole": false,
                 "fact": "1/35",
                 "refraction": null,
                 "meter": true,
                 "correction":  "SC" 
        }
}
      

Example3: Relevant Visual Acuity

Visus RA 0,16 (AR +1,0 -3,25 84) LA sc 1/35 (AR nicht möglich)

{
         "begin": 0,
         "end": 62,
         "type": "de.averbis.types.health.VisualAcuity",
         "coveredText": "Visus \n RA \n 0,16 (AR +1,0 -3,25 84) \n LA \n sc 1/35 (AR nicht möglich)",
         "id": 7430,
         "rightEye": {
                 "begin": 9,
                 "end": 32,
                 "type": "de.averbis.types.health.VisualAcuityValue",
                 "coveredText": "0,16 (AR +1,0 -3,25 84)",
                 "id": 7032,
                 "additionalInformation" : null,
                 "pinHole": false,
                 "fact": "0.16",
                 "refraction": {
                         "begin": 14,
                         "end": 32,
                         "type": "de.averbis.types.health.Refraction",
                         "coveredText": "(AR +1,0 -3,25 84)",
                         "id": 6843,
                         "sphere": 1,
                         "cylinder": -3.25,
                         "axis": 84
                },
                 "meter": false,
                 "correction": "AR" 
        },
         " leftEye " : {
                 "begin": 36,
                 "end": 62,
                 "type": "de.averbis.types.health.VisualAcuityValue",
                 "coveredText": "sc 1/35 (AR nicht möglich)",
                 "id": 7092,
                 "additionalInformation" : {
                         "begin": 44,
                         "end": 62,
                         "type": "de.averbis.types.health.VisualAcuityAdditionalInformation",
                         "coveredText": "(AR nicht möglich)",
                         "id": 5950,
                         "normalized": "AR_NOT_POSSIBLE" 
                },
                 "pinHole": false,
                 "fact": "1/35",
                 "refraction": null,
                 "meter": true,
                 "correction": "SC" 
        }
}
      

Example4: Ophthalmology Concept

Kataraktoperation

{
         "begin": 0,
         "end": 17,
         "type": "de.averbis.types.health.OphthalmologyConcept",
         "coveredText": "Kataraktoperation",
         "id": 800,
         "negatedBy": null,
         "matchedTerm": "Kataraktoperation",
         "dictCanon": "Katarakt-Operation",
         "conceptId": "110473004",
         "source": "Ophthalmologie_1.0",
         "uniqueId": "Ophthalmologie_1.0:110473004" 
}
      

3.25. Patient Information

3.25.1. Description

With this component, different information about the patient shall be detected, such as  admission and discharge dates, the gender of the patient and the information as to whether the patient is deceased. In addition, if a list of patient names was imported as terminology to Averbis Health Discovery, these patient names can be extracted, too.

3.25.2. Input

Above this annotator, the following annotators must be included in the pipeline:


The Health Preprocessing pipeline block provides the prerequisite annotation types to ensure the proper functionality of this annotator.

3.25.3. Annotation of patient names

Patient names can only be extracted from clinical notes, if they exist as an entry in a terminology called "patientnames". Therefore, the following preparations are necessary to annotate patient names.

Step 1: Create a terminology in the "Terminology Administration" with the Terminology-ID "patientnames", Concept-Type “de.averbis.textanalysis.types.health.PatientNameConcept" and language "Miscellaneous". Label and Version can be set freely. See Create your own Terminology for more details on how to create a terminology.





Figure 12: Add terminology named "patientnames"



Step 2: Import your list of patient names into the terminology using OBO-Format or enter the patient names manually into the terminology using the "Terminology Editor". In order to distinguish between first names and last names, the terms must follow the following syntax: Firstname[semicolon]Lastname, e.g. John;Doe.

Your OBO-file with patient names may look like:

[Term]

id: 1
name: Sue;Miller

[Term] id: 2 name: John;Doe

....


View the results of your import/editing in the "Terminology Editor" to make sure everything worked out smoothly. The imported terminology/OBO-file should contain the patients' first and last name as preferred term. Synonyms do not need to be added.



Figure 13: Entries in terminology "patientnames", view in Terminology Editor



Step 3: Switch to the "Terminology Administration" and submit the terminology to the text analytics module.




Figure 14: Submit terminology for use in text analytic pipelines.



Step 4: Reuse an existing pipeline where "Patient Information" is included or create a pipeline and include the following annotators:


Step 5: (Re)Start the pipeline. After completing steps 1 through 5, the pipeline is now ready to annotate the imported patient names.


3.25.4. Output

Annotation Typede.averbis.types.health.PatientInformation



Table 39: Patient Information Features

AttributeDescriptionType
firstNameThe first part (before the semicolon) of the matching preferred term in the terminology "patientnames".
String
lastNameThe last part (after the semicolon) of the matching preferred term in the terminology "patientnames".
String
gender

Gender of the patient.

Possible values (default is underlined): null, female, male

String
deathdateDeathdate of the patient.
Date
deceased

Information as to whether the patient is deceased.

Possible values (default is underlined): false, true

Boolean



Annotation Type: de.averbis.types.health.Hospitalisation


Table 40: Hospitalisation Features


AttributeDescriptionType

admissionDate

Date of admission to hospital.

Date

dischargeDate

Date of discharge from hospital.

Date


3.25.5. Terminology Binding for patientnames


Table 41: Terminology Binding

CountryNameVersionIdentifierComment
All<define your name><define your version>patientnames

To annotate the patient's name, a terminology with ID "patientnames" has to be created and filled with an individual list of patient names, which should be annotated.

See  chapter Annotate Patient Names for more details.


3.25.6. Web Service Example

Example text: We're reporting on the patient John Doe. He stayed in our hospital from 1/01/2018 until 2/01/2018.


{
      "begin": 0,
      "end": 98,
      "type": "de.averbis.types.health.PatientInformation",
      "coveredText": "We're reporting on the patient John Doe. He stayed in our hospital from 1/01/2018 until 2/01/2018.",
      "id": 3471,
      "firstName": "John",
      "lastName": "Doe",
"deceased": false, "gender": "male", "deathDate": null }
{
      "begin": 72,
      "end": 97,
      "type": "de.averbis.types.health.Hospitalisation",
      "coveredText": "1/01/2018 until 2/01/2018",
      "id": 3251,
      "admissionDate": "2018-01-01",
      "dischargeDate": "2018-02-01"
}


Example text: The patient died on 2/01/2018 in the course of a multiorgan failure.

    {
      "begin": 0,
      "end": 68,
      "type": "de.averbis.types.health.PatientInformation",
      "coveredText": "The patient died on 2/01/2018 in the course of a multiorgan failure.",
      "id": 2356,
      "firstName": null,
      "lastName": null,
      "deceased": true,
      "gender": null,
      "deathDate": "2018-02-01"
    }

3.26. Physical Therapies

3.26.1. Description

The component annotates physical therapies (e.g. cryotherapy, occupational therapy) from clinical notes.

3.26.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.26.3. Output

Annotation Type: de.averbis.types.health.PhysicalTherapy


Table 42: Features

AttributeDescriptionType
dictCanonPreferred term of the physical therapy.
String
matchedTermThe matching synonym of the physical therapy.
String

uniqueId

Unique identifier of a concept of the format 'terminologyId:conceptId'.
String
conceptID
The concept id of the physical therapy.
String
source
The identifier of the terminology.
String

negatedBy

Specifies the negation word, if one exists.

String

3.26.4. Terminology Binding


Table 43: Terminology Bindings

CountryNameVersionIdentifierComment
EN,DEAverbis -Therapy1.0Averbis-Therapy_1.0Averbis' own multilingual terminology for physical and related therapies.

3.26.5. Web Service Example

Example:

10x cryotherapy


{
        "begin": 4,
        "end": 15,
        "type": "de.averbis.types.health.PhysicalTherapy",
        "id": 627,
        "coveredText": "cryotherapy",
        "negatedBy": null,
        "matchedTerm": "cryotherapy",
        "dictCanon": "cryotherapy",
        "conceptId": "PT000019",
        "source": "Averbis-Therapy_1.0",
        "uniqueId": "Averbis-Therapy_1.0:PT000019"
}

3.27. Procedures

3.27.1. Description

The component annotates surgical procedures from clinical notes.


This component is currently only available in English.


3.27.2. Input

Above this annotator, the following annotators must be included in the pipeline:

To get the full functionality, the following annotators should also be included below this annotator in the given order:

3.27.3. Output

Annotation Type: de.averbis.types.health.Procedure


Table 44: Features

AttributeDescriptionType

dictCanon

Preferred term of the procedure.

String

matchedTerm

The matching synonym of the procedure.

String

uniqueId

Unique identifier of a concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id of the procedure.

String

source

The identifier of the terminology.

String

negatedBy

Specifies the negation word, if one exists.

String

status

Describes the status of the procedure.

Possible values (default is underlined): null | HISTORY | NEGATED | PLANNED

ProcedureStatus

side

The laterality of the procedure.

Possible values (default is underlined): null | RIGHT | LEFT | BOTH

Laterality

date

The date of the procedure.

Date

3.27.4. Terminology Binding


Table 45: Terminology Bindings

CountryNameVersionIdentifierComment

United States

SNOMED-CT-US

2018-09-01

SNOMED-CT-US_2018-09-01

The SNOMED CT United States (US) Edition, subtree of concept "387713003 Surgical procedure (procedure)"

3.27.5. Web Service Example

Example:

history of cholecystectomy

{
         "begin": 11,
         "end": 26,
         "type": "de.averbis.types.health.Procedure",
         "id": 627,
         "coveredText": "cholecystectomy",
         "negatedBy": null,
         "matchedTerm": "Cholecystectomy",
         "verificationStatus": null,
         "kind": null,
         "dictCanon": "Cholecystectomy (procedure)",
         "conceptId": "38102005",
         "source": "SNOMED-CT-US_2018-09-01",
         "clinicalStatus": "HISTORY",
    "belongsTo": null,
    "laterality": null,
         "uniqueId" : "SNOMED-CT-US_2018-09-01:38102005" 
}
      

3.28. Procedure Status

3.28.1. Description

The annotator recognizes the status of procedures. Different status includes, for example, "intended" or "family".

3.28.2. Input

Before this annotator, the following annotator must be included in the pipeline:

3.28.3. Output

This annotator sets the feature status in annotations of type Procedure .

3.28.4. Web Service Example

See Procedure .

3.29. RutaEngine

3.29.1. Description

The RutaEngine is a generic annotator which interprets and executes a rule-based scripting language for Apache UIMA, called UIMA Ruta. Due to its generic nature, the annotator is able create and modify all available types of annotations.

Detailed documentation on the use of Ruta can be found at the Apache UIMA official manual.

3.29.2. Input

The component does not expect any annotations.

3.29.3. Output

The Entity type is described here as a exemplary and recommended placeholder for possible types of annotations that are created by this annotator. Entity is a generic type which semantics are specified by its features label and value.

Annotation Type: de.averbis.extraction.types.Entity


Table 46: Features

AttributeDescriptionType

value

This feature provides the text of the annotated mention.

String

label

The type of the entity; e.g., PERSON, LOCATION etc.

String

3.29.4. Configuration

NameDescriptionTypeMultiValuedMandatory

rules

A String parameter representing the rule that should be applied by the analysis engine. If set, it replaces the content of file specified by the mainScript parameter.

String

false

false

3.29.5. Web Service Example

Example:

Ruta Script:

"pack year" -> Keyword;
(n:NUM k:Keyword){-> CREATE(Entity, "label" = k.ct, "value" = n.ct)};


Text: 40 pack years

{
         "begin": 0,
         "end": 12,
         "type": "de.averbis.types.health.Entity",
         "id": 626,
         "coveredText": "40 pack years",
         "label": "pack years",
         "value": "40" 
}
      

3.30. TNM

3.30.1. Description

This component detects and annotates abbreviated notations and free-text remarks of the TNM classification. It is able to distinsh the tumor (T), node (N), metastasis (M), the grading of tumor and some additional information.gui

3.30.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.30.3. Output

Annotation Type: de.averbis.types.health.TNMNotation

The aggregating annotation for TNM annotations.


Table 47: TNMNotation Features

AttributeDescriptionType

tumour

The tumor annotations.

TNMTumour[]

node

The node annotations.

TNMNode[]

metastasis

The metastasis annotations.

TNMMetastasis[]

grading

The grading annotations.

TNMGrading[]

location

The location annotations.

TNMLocation[]

rclass

The R class annotations.

TNMRClass[]

certainty

The certainty annotations.

TNMCertainty[]

additional

The additional annotations, e.g. L for the lymphatic invasion

TNMAdditional[]

Annotation Type: de.averbis.types.health.TNMElement

Abstract type for a TNM element with modifiers.


Table 48: TNMEntity Features

AttributeDescriptionType

value

The value of the TNM mention.

String

label

Label of the entity describing the type of the entity, e.g. GRADING.

String

modifiers

The prefix modifiers of the element.

String

postModifier

The postfix modifier of the element.

String

Possible TNMElements are:

  • de.averbis.types.health.TNMTumour

  • de.averbis.types.health.TNMNode

  • de.averbis.types.health.TNMMetastasis

Annotation Type: de.averbis.types.health.AbstractTNMLymphNodes

Abstract type for lymph node mentions.


Table 49: TNMAbstractTNMLymphNodes Features

AttributeDescriptionType

attacked

Number of attacked lymph nodes.

Integer

tested

Number of tested lymph nodes.

Integer

Possible AbstractLymphNodes are:

  • de.averbis.types.health.TNMLymphNodes

  • de.averbis.types.health.TNMLymphNodesSentinel

Annotation Type: de.averbis.types.health.TNMEntity

Abstract type for a TNM entity providing a label.


Table 50: TNMEntity Features

AttributeDescriptionType

value

The value of the TNM mention.

String

label

Label of the entity describing the type of the entity, e.g. GRADING.

String

Possible TNMEntities are:

  • de.averbis.types.health.TNMGrading

  • de.averbis.types.health.TNMLocation

  • de.averbis.types.health.TNMRClass

  • de.averbis.types.health.TNMRCertainty

  • de.averbis.types.health.TNMAdditional

3.30.4. Web Service Example

The web service actually only supports TNMTumour, TNMNode, TNMMetastasis and TNMGrading.

Example (TNMNotation):

pTis, N0, Mx, G3

{
         "begin": 14,
         "end": 16,
         "type": "de.averbis.types.health.TNMGrading",
         "id": 1352,
         "coveredText": "G3",
         "value": "G3" 
},
{
         "begin": 10,
         "end": 12,
         "type": "de.averbis.types.health.TNMMetastasis",
         "id": 1342,
         "coveredText": "Mx",
         "value": "Mx" 
},
{
         "begin": 6,
         "end": 8,
         "type": "de.averbis.types.health.TNMNode",
         "id": 1330,
         "coveredText": "N0",
         "value": "N0" 
},
{
         "begin": 0,
         "end": 4,
         "type": "de.averbis.types.health.TNMTumour",
         "id": 1320,
         "coveredText": "pTis",
         "value": "Tis" 
}
      

3.31. Topography

3.31.1. Description

This component detects topography concepts. It is mainly used in pathology reports.

3.31.2. Input

Above this annotator, the following annotators must be included in the pipeline:

3.31.3. Output

Annotation Type: de.averbis.types.health.TopographyConcept


Table 51: TopographyConcept Features

AttributeDescriptionType

dictCanon

Preferred term of the Topography concept.

String

matchedTerm

Matching synonym of the Topography concept.

String

uniqueId

Unique identifier of the Topography concept of the format 'terminologyId:conceptId'.

String

conceptId

The concept id.

String

source

The name of the terminology source.

String

negatedBy

Specifies the negation word, if one exists.

String

3.31.4. Terminology Binding


Table 52: Terminology Bindings

CountryNameVersionIdentifierComment

United States

ICD-O

3.1

ICD-O_3.1

International Classification of Diseases for Oncology WHO edition, enriched with synonyms by Averbis.

Germany

ICD-O-DE

3.1

ICD-O-DE_3.1

International Classification of Diseases for Oncology German Edition, enriched with synonyms by Averbis.

3.31.5. Web Service Example

Example:

Adenokarzinom des Rektums

{
         "begin": 18,
         "end": 25,
         "type": "de.averbis.types.health.TopographyConcept",
         "id": 580,
         "coveredText": "Rektums",
         "negatedBy": null,
         "dictCanon": "Rektum",
         "conceptId": "C20.9",
         "source": "ICD-O-DE_3.1",
         "uniqueId":  "ICD-O-DE_3.1:C20.9",
         "matchedTerm": "Rektum" 
}
      

3.32. WordlistAnnotator

3.32.1. Description

The WordlistAnnotator allows users to directly embed simple wordlists into pipelines. It identifies words from the wordlist in texts and creates an annotation of type Entity. Optionally, a 'label' and a 'value' can be specified in columns 2 and 3 of the wordlist to fill the corresponding attributes of type Entity (see example below).

3.32.2. Input

Above this annotator, the following annotator must be included in the pipeline:

3.32.3. Configuration


Table 53: Configuration Parameters

NameDescriptionTypeMultiValuedMandatory

delimiter

The separator of different terms in the wordlist, separating the searched term from its features.

string
false
true

ignoreCase

Option to ignore the case of the terms in the wordlist.

Possible values (default is underlined): ACTIVE | INACTIVE

boolean
false
true

onlyLongest

Option to filter matches that are part of a longer match. Example: 'diabetes mellitus' but not 'diabetes'.

Possible values (default is underlined): ACTIVE | INACTIVE

boolean
false
true

wordlist

The wordlist (dictionary) content.

The first line contains the complete package name of type Entity. If columns 2 and 3 are filled, line 1 has to be filled with the attribute names 'label' and 'value'.

The remaining lines contain the words of the wordlist (column 1) and optionally 'label' and 'value' values (columns 2 and 3).

Example Wordlist:

de.averbis.extraction.types.Entity;label;value

Lip;Organ;C00

Tongue;Organ;C01

string
false
false


3.32.4. Output

The annotator creates an annotation of type Entity.

Exemplary Annotation Type: de.averbis.extraction.types.Entity


Table 54: Features

AttributeDescriptionType
labelRepresents the string in the feature "label"  of the matched term in the wordlist.String
valueRepresents the string in the feature "value" of the matched term in the wordlist.String

3.32.5. WebService Example

Example:

The lip

    {
      "begin": 4,
      "end": 7,
      "type": "de.averbis.extraction.types.Entity",
      "coveredText": "lip",
      "id": 306,
      "componentId": null,
      "confidence": 0,
      "label": "Organ",
      "value": "C00",
      "parsedElements": null
    }

4. Available Text Mining Pipelines

The respective components are described in detail in Available Text Mining Annotators & Web Service Specification.

4.1. Discharge Pipeline

4.1.1. Description

This pipeline extracts the basic medical information in physician letters. Since these letters mainly originate when patients are discharged from the hospital or transferred to another doctor, they are called discharge letters. After some preprocessing, this pipeline annotates information concerning diagnoses, laboratory values and medications. The resulting annotations undergo a postprocessing considering enumerations, negations, disambiguity and possible status.

4.1.2. Components

The following components are part of the discharge pipeline:

4.2. HLA Pipeline

4.2.1. Description

This pipeline extracts information about the human leukocyte antigen (HLA) contained in special reports.

4.2.2. Components

The following components are part of the HLA pipeline:

4.3. Ophthalmology Pipeline

4.3.1. Description

This pipeline extracts information concerning diagnoses, laboratory values, medications, negations, visual acuity, tensio and further information in the field of ophthalmology.

4.3.2. Components

The following components are part of the discharge pipeline:

4.4. Pathology Pipeline

4.4.1. Description

This pipeline extracts information from pathology reports. After some preprocessing, this pipeline annotates information concerning diagnoses, morphology, topography and TNM classification.

4.4.2. Components

The following components are part of the discharge pipeline:

4.5. Transplantation Pipeline

4.5.1. Description

This pipeline extracts information concerning diagnoses, laboratory values, medications, graft-versus-host-disease, conditioning regimens and negations from physician letters after a transplantation,

4.5.2. Components

The following components are part of the discharge pipeline:

5. GUI Overview

5.1. General Administration

Users with administration rights can create new users and projects. When these users are logged in, they can see the "Project administration" and "User administration" areas.


adminStartpage


Figure 15: Home page of an administration user


5.1.1. Project administration

In the project administration area, you first see a list with all projects that are currently available in the system.




Figure 16: Overview of created projects


  • Name: name of the project. The name also functions as a link to the corresponding project. The link goes to the project’s overview page.

  • Description: description of the project.

  • Operations | Edit project: this allows you to modify the name and the description of the project.

  • Operations | Delete project: this allows you to delete a project.

Below the table is a button that you can use to create a new project.

5.1.2. User administration

In the user administration area, you first see a list with all user accounts that are currently available in the system. This list can be filtered using the text box on the top left.


userManagementListUsers


Figure 17: Overview of registered users.


  • Username: the user’s login name.

  • Lastname: the user’s last name.

  • Firstname: the user’s first name.

  • Email: the user’s email address.

  • Blocked: if a user is temporarily blocked, a padlock icon is displayed here.

  • Administrator: if the user is an administrator, a checkmark is displayed here.

  • Operations | Rights: using this button you can see an overview of the rights that the user currently has. Rights cannot be edited here. Editing rights is done using the corresponding button in each project.

  • Operations | Edit: in the Edit dialog, you can edit the user profile data (firstname, lastname, email address). You can also use this dialog to block a user.

  • Operations | Change password: this allows you to enter a new user password.

  • Operations | Delete user: this allows you to delete the user.

Below the table is a button that you can use to create a new user.

5.1.3. Add and/or edit users

Use the 'Create new user' or 'Edit user' button to open a dialog and edit the user’s metadata.


addUser


Figure 18: Create new user.


In addition to editing the profile metadata, you can also assign an initial password when creating the user (to edit the password of an already existing user, please use the corresponding 'Change password' button in the user administration overview table).

You can also use this dialog to block the user.

5.1.4. Change password

Using the Change password buttons you can open a dialog which allows you enter a new password.


userManagementChangePassword


Figure 19: Changing the password of an existing user.


5.2. General guidelines

When a user without global administration rights opens the application, his/her home page contains an overview of the projects assigned to this user (My projects). The project names act as links to the corresponding projects. On the project overview page, the user can find all the functions for which he/she has the relevant project rights.


userStartpage


Figure 20: Home page of a non-administrator user


After selecting a project, a page is displayed with a list of all the modules in the project. This list is also available on other pages with the project navigation menu in the upper right area.


projectStartpage


Figure 21: Overview page of a project with buttons for opening each module.


5.3. Language and web interface localization

The web interface is currently available in German and English. The language is recognized automatically from the browser or the system settings of your operating system and the content of the user interface is displayed in the corresponding language.

5.4. Outer navigation bars

The top and left side outer navigation bars can be hidden when required. This saves space when the navigation tools are not required. To show/hide the navigation bars, click the small menu icon on the upper right edge of the application.


toggleOuterNavigation


Figure 22: Menu icon to show/hide the outer navigation bars

5.5. Keyboard Shortcuts

To simplify working with the application, some functions are implemented with keyboard shortcuts. Press Shift + ? to display a summary of the defined shortcuts.
keyboardShortcuts


Figure 23: Summary of all defined keyboard shortcuts. Open with Shift + ?


5.6. Flash messages

To provide information about the progress and outcome of processes or to display general information flash messages are displayed that are standard for all applications. The background of the flash messages differs according to the message category. Information messages are blue, success messages green, error messages red. Flash messages disappear automatically after a few seconds. Flash messages that display errors however remain displayed until they are closed manually by the user.


closeErrorFlashMessage


Figure 24: Flash messages that display errors are closed by clicking the cross mark in the top right corner.


5.7. Documentation

Complete user documentation is available that describes the functionality of each component. This documentation can be accessed directly from the help menu in the navigation bar on the left side of the web interface.

5.8. Embedded help

In addition to the complete online help, you can find information in several places directly embedded in the interface. You can access this wherever you see a blue question mark on a white background. Move the mouse cursor over the question mark.


helpPopover


Figure 25: Embedded help


6. Connector Management & Document Import

6.1. Managing Standard Connectors

Connectors are used to import documents into the system. A connector monitors a specific resource (like a file system or a database), automatically imports new documents and updates changes so that imported documents are kept in sync with the document source. Connectors can also be scheduled to certain times of day, for example to import and update documents only at night and reduce system load during office hours.

Connectors can be created and administered on the connector management page. The figure below shows the connector management with the list of all connectors that have been created within the current project:


connectorAdminOverview


Figure 26: Overview of all connectors.


  • Connector: The name of the connector.

  • Type: The connector type. For example file connector or database connector.

  • Active: Indicates whether the connector is active. Only active connectors import and update documents.

  • Schedules: Displays the periods of time in which the connector is active. 0-24 means that the connector is active 24 hours a day.

  • Statistics: The statistics show the following values

    • Documents whose URLs have been reported by the connector.

    • Documents that have already been requested by the connector and whose contents have been received.

    • Documents that have already been enriched with metadata.

    • Documents that have already been saved.

  • Actions | Start connector : Starts the connector.

  • Actions | Stop connector : Stops the connector.

  • Actions | Reset connector : If you reset a connector, all documents from this connector are re-imported.

  • Actions | Edit connector : Opens the edit connector dialog. All parameters except the connector name can be edited.

  • Actions | Edit mapping : Opens the edit mapping dialog where connector matadata fields like title and content can be mapped to document fields.

  • Actions | Schedule connector : Opens the schedule dialog.

  • Actions | Delete documents of connector : Deletes all documents that have been imported by the connector.

  • Actions | Delete connector : Deletes the connector. All documents that have been import by the connector will be deleted as well.

In order to create a new connector, the connector type has to be selected first. After clicking the Create connector button the connector can be configured in the create new connector dialog. Please refer to the connector specific documentation for further details.


6.1.1. File System Connector

A file system connector imports documents from file system resources. It monitors one or multiple directories (including sub-directories) and imports documents from files in these directories. The following file types are supported:

  • .txt
  • .pdf
  • .doc/docx
  • .ppt/pptx
  • .xls/xlsx
  • .html

There are currently two implementations: FileConnectorType and AverbisFileConnectorType. The AverbisFileConnectorType remembers the current position when stopping, so that it does not start from the beginning when restarting.

A file system connector can be configured using the following parameters:

  • Name: Name of the connector. This name can be chosen freely and serves e. g. as label within the connector overview. They must not contain spaces and special characters nor underscores.

  • Start paths: For each line, you can specify a file system path that is taken into account by the connector. The connector runs through these directories recursively, i. e. all subdirectories are considered.

  • Exclude pattern: Here you can specify patterns to exclude certain files or file types (Black List).

  • Include pattern (optional): Here you can specify patterns to include certain files or file types only (White List).

6.1.2. Database Connector

With a database connector, structured data can be imported from a database connection. The database connector supports JDBC compliant databases and can crawl database tables using SQL queries. Each row from the SQL query result is treated as a separate document. The database connector keeps track of changes that are made to the database tables and synchronizes these changes automatically into .

In order to use the database connector, the database JDBC driver has to be provided to the Tomcat server instance that is running . Please ask your system administrator to put the database JDBC driver library into Tomcats lib directory.


The database connector can be configured using the following parameters:

  • Name: Name of the connector. This name can be chosen freely and serves e. g. as label within the connector overview. They must not contain spaces and special characters nor underscores.

  • JDBC Driver Classname: Fully qualifying class name of the database JDBC driver. E.g. com.mysql.jdbc.Driver

  • JDBC Connection URL: JDBC connection URL to the database. E.g. jdbc:mysql://localhost:3306/documentDB

  • Username: Database username.

  • Password: Database password.

  • Traversal SQL Query: SQL select query. E.g. SELECT id, title, content FROM documents

  • Primary Key Fields: Name of the column that represents the primary key and identifies a table row. E.g. id

The database connector default field mapping concatenates all queried columns (like id, title and content) and maps it into the document field named content. The field mapping can be configured in the connector field mapping dialog (See section Editing field mappings for further details). The figure below shows a custom field mapping that maps the database columns to document fields. The id column is mapped to the document_name field, title and content are mapped to identical document fields.


connectorAdminFieldMapping


Figure 27: Database connector custom field mapping.

6.1.3. Editing field mappings

Connectors read different sources to extract structured data from them. The extracted data is then written to fields of a solr core. Field mappings define which information from the original documents is written to which fields of the Solr Index.

Specific default mappings can be specified for each index and connector throughout the system. These are automatically taken into account when a new connector is created.

When editing the field mappings, select a connector field on the left. On the right, select the core field in which you want the connector to write this data. All core fields that have been activated in the Solr schema configuration and are writable are available here. In addition to editing the default mappings, you can also specify further mappings or remove existing ones.

You can also specify a sequence for the mappings. This order is relevant when mapping multiple connector fields to a core field. If the core field can contain more than one value, it lands in the field in the order specified here. If the core field can only contain one value, it will be the value that is the lowest in the mapping sequence.

After you have edited a field mapping, you must reset the connector so that the changes to the mapping are taken into account.


connectorAdminEditFieldMapping


Figure 28: Editing field mappings.


There are currently three different mapping types:

  • Copy Mapping: Der Standard Typ: The connector field is mapped 1:1 to the specified document field.

  • Constant Mapping: Instead of a connector field, a constant value can be mapped to a document field.

  • Split Mapping: The value of a connector field is divided into several values by a character to be entered. This can be used to convert comma-separated lists into multi valued document fields.

6.2. Document Import

In addition to defining connectors that can monitor and search different document sources, it is also possible to import pre-structured data into a search engine index. Unlike connectors, this data is imported once, i. e. no subsequent synchronization takes place.

6.2.1. Manage document imports

Any number of document sets can be imported in the application and deleted if necessary. For each set of imported documents, known as import batches, you see a row in the overview table. In addition to the name of the import batch, you can also see how many documents are part of the batch. The status indicates whether the import is still running, whether it was successful, or whether it has failed.


importBatchOverview


Figure 29: Overview of all previously imported document batches.


Below the overview table you will find the form elements to import a new document set. To do this, enter a name and click the Browse button. A window opens in which the local file system is displayed.

You can import single files as well as zip archives with several files. Make sure that there are no (hidden) subdirectories in such ZIP file and that the files have the correct file extensions.


These import formats are currently available:

Text Importer

Text importers can be used to import any plain text files. The complete content of the file is imported into a field. The file name of the file is available later as a metadate. CAS Importer Allows the import of serialized UIMA CAS (currently as XMI). This means that for example documents are imported as gold standards.

Please note that the type system of this CAS has to be compatible with the type system of .


Solr XML Importer

A simple XML format that allows the import of pre-structured data. During the import, the fields defined in XML are written to the search index in fields with the same value. Please make sure that the field names in the XML file correspond to the field names of the search index associated with your project.

Images that can be imported to the documents and displayed together with them are a special feature. To upload an image, you have to pack the XML document (s) together with the images into a ZIP archive. With each document you can now add as many image_reference fields as you like. Relative paths to the image are expected. Images can be stored in any subfolders within the ZIP archive. Supported image formats are. gif,. png,. jpg and. tif.

...
<field name="image_reference">images/image.png</field>
<field name="image_reference">./images/pics/picture.png</field>
...
      

An example of the supported import format is shown below

<?xml version='1.0' encoding='UTF-8'?>
<!--Averbis Solr Import file generated from: medline15n0771.xml.gz-->
<update>
  <add>
    <doc>
      <field name="id">24552733</field>
      <field name="title">Treatment of sulfate-rich and low pH wastewater by sulfate reducing bacteria with iron shavings in a laboratory.		</field>
      <field name="content">Sulfate-rich wastewater is an indirect Tag der Arbeit threat to the environment especially at low pH. Sulfate reducing bacteria (SRB) could use sulfate as the terminal electron acceptor for the degradation of organic compounds and hydrogen transferring SO(4)(2-) to H2S. However their acute sensitivity to acidity leads to a greatest limitation of SRB applied in such wastewater treatment. With the addition of iron shavings SRB could adapt to such an acidic environment, and 57.97, 55.05 and 14.35% of SO(4)(2-) was reduced at pH 5, pH 4 and pH 3, respectively. Nevertheless it would be inhibited in too acidic an environment. The behavior of SRB after inoculation in acidic synthetic wastewater with and without iron shavings is presented, and some glutinous substances were generated in the experiments at pH 4 with SRB culture and iron shavings.</field>
      <field name="tag">Hydrogen-Ion Concentration; Iron; Oxidation-Reduction; Sulfur-Reducing Bacteria; Waste Water; Water Purification</field>
      <field name="author">Liu X, Gong W, Liu L</field>
      <field name="descriptor">Evaluation Studies; Journal Article; Research Support, Non-U.S. Gov't</field>
    </doc>
    <doc>
      <field name="id">24552734</field>
      <field name="title">Environmental isotopic and hydrochemical characteristics of groundwater from the Sandspruit Catchment, Berg River Basin, South Africa.</field>
      <field name="content">The Sandspruit catchment (a tributary of the Berg River) represents a drainage system, whereby saline groundwater with total dissolved solids (TDS) up to 10,870 mg/l, and electrical conductivity (EC) up to 2,140 mS/m has been documented. The catchment belongs to the winter rainfall region with precipitation seldom exceeding 400 mm/yr, as such, groundwater recharge occurs predominantly from May to August. Recharge estimation using the catchment water-balance method, chloride mass balance method, and qualified guesses produced recharge rates between 8 and 70 mm/yr. To understand the origin, occurrence and dynamics of the saline groundwater, a coupled analysis of major ion hydrochemistry and environmental isotopes (d(18)O, d(2)H and (3)H) data supported by conventional hydrogeological information has been undertaken. These spatial and multi-temporal hydrochemical and environmental isotope data provided insight into the origin, mechanisms and spatial evolution of the groundwater salinity. These data also illustrate that the saline groundwater within the catchment can be attributed to the combined effects of evaporation, salt dissolution, and groundwater mixing. The salinity of the groundwater tends to vary seasonally and evolves in the direction of groundwater flow. The stable isotope signatures further indicate two possible mechanisms of recharge; namely, (1) a slow diffuse type modern recharge through a relatively low permeability material as explained by heavy isotope signal and (2) a relatively quick recharge prior to evaporation from a distant high altitude source as explained by the relatively depleted isotopic signal and sub-modern to old tritium values. </field>
      <field name="tag">Groundwater; Isotopes; Rivers; Salinity; South Africa; Water Movements</field>
      <field name="author">Naicker S, Demlie M</field>
      <field name="descriptor">Journal Article; Research Support, Non-U.S. Gov't</field>
    </doc>
  </add>
</update>


7. Text Analysis

7.1. Pipeline Configuration

The text analysis components and pipelines used in can be graphically administered and monitored in a centralized way. This is done in the Pipeline configuration module.


pipelineConfNavItem


Figure 30: Link for opening the graphical configuration of text analysis components.


The overview page lists all the text analysis pipelines available in the project. The following information and operations are provided in the table.

  • "Pipeline Name": name of the pipeline.

  • "Status": Status of the pipeline: STOPPED, STARTING or STARTED. As soon as the pipeline started, it reserves system resources. Only after it started, it accepts analysis requests.

  • "Preconfigured": indicates whether the pipeline is a preconfigured pipeline. These pipelines cannot be edited.

  • "Throughput": here, two indicators for the pipeline throughput are given: the total number of processed texts, and the average number of processed texts per second. The statistics are reinitialized each time the pipeline stops/starts.

  • "Operations | Initialize pipeline" : this is used to initialize a pipeline. As soon as it has been initialized, it can process texts.

  • "Operations | Stop pipeline" : to save system resources, pipelines can also be stopped.

  • "Operations | Edit pipeline" : this is used to configure a pipeline, for example to add other components to it, to remove them or to modify their configuration parameters. Pipelines can only be edited when they are stopped.

  • "Operations | Update pipeline" : this is used to update the statistics (throughput) and status of the pipeline.

  • "Operations | Delete pipeline" : this allows pipelines to be permanently deleted, if they are no longer needed.


pipelineConfOverview


Figure 31: Overview of all available text analysis pipelines in the project.


To create new pipelines, use the 'Create pipeline' button below the overview table.

7.2. Pipeline details

With the pencil icon in the taskbar of the overview table, you can access the details page of the pipeline. At the top left, all components are displayed in the order in which they are used in the pipeline.

To the right of each component name, you can see the component-specific throughput data, indicating the total number of processed texts and the average number of texts per second. By clicking the relevant component, you can show all the configurable configuration parameters.


pipelineDetails


Figure 32: Detail view of an initialized pipeline.


As long as a pipeline is running, it cannot be edited. When you stop a non-preconfigured pipeline, you can reconfigure the pipeline in the details page. Buttons on the right are now displayed instead of the throughput data, which can be used to remove components from the pipeline, or to move them to another position within the pipeline. Individual configuration parameters of the components are now also editable. Other components can also be added to the pipeline from the right side.


pipelineDetailsEdit


Figure 33: Editing a pipeline.


The right-hand area with the available components is itself divided into several blocks: Preconfigured Annotators, PEAR Components and Available Annotators.

7.2.1. Preconfigured Annotators

Preconfigured annotators are annotators that Averbis has already preconfigured for a specific purpose. For example, a diagnostic annotator is nothing more than a GenericTerminologyAnnotator preconfigured with a diagnosis dictionary. Preconfigured annotators can also be made up of several components, i.e. an aggregate of several components. This can be used to present the end user components of complex interdependencies in a clear way.

7.2.2. PEAR components

PEAR components are those added by users. They can be integrated in pipelines like the preconfigured or available annators in pipelines. More on this in the chapter Managing / Adding new textanalysis components.

7.2.3. Available Annotators

The list of available annotators contains all general, i.e. not preconfigured, components detected in ’s component repository.

7.3. Managing / Adding new text analysis components

The application allows to add new text analysis components at runtime. There is no need to reinstall or redeploy the application. For that, so called UIMA™ PEAR components (Processing Engine ARchive) are used. PEAR is a packaging format, which allows to ship textanalysis components alongside all needed resources in a single artifact.

You find a list of all available PEAR components in the Pipeline Configuration where you configure your textanalysis pipeline. Adding new components is done within the Textanalysis: Components module.


componentsAndPear


Figure 34: Show and import UIMA PEAR components.

7.4. Text Analysis Processes

Any number of text analysis results can be generated and stored for all known document sources in . Text analysis results can be created either automatically through pipelines or manually. This way, you can obtain different semantic views of the same document which enable you to evaluate several views side by side.


processOverview


Figure 35: Overview of all currently created test analysis tasks.


The table contains the following columns:

  • "Type": indicates whether this is a manual or automatic text analysis.

  • "Name": name of the process. For example Demo - anatomy

  • "Status": Status of the process. It is either RUNNING or IDLE.

  • "Document source": the document source to which the task refers. In parentheses after the name is the number of processed fields. For example if two fields, contents and title, are processed in a corpus of 3000 documents, then at the end of the task, 6000 will be indicated here.

  • "Pipeline": in the case of an automatic text analysis, the pipeline that was used for the text analysis is indicated here.

  • Download: Download the whole result as set of UIMA XMI files.

  • Delete: Delete whole process and all results.

When you create a new task, you can select whether it is a manual or an automatic text analysis.


processAddNew


Figure 36: Creating a new text analysis task: manual or automatic text analysis.


If you choose automatic text analysis, then in addition to the name and the document source, you are requested to give your text mining process a name and specify the document source and pipeline.


processConfigureAutomaticTextAnalysis


Figure 37: Creating a new automated text analysis process: Give your process a name and enter the document source and the pipeline you want to use.

7.5. Annotation Editor: Viewing and Editing Annotations

To be able to make a judgment about text analysis components, it is frequently essential to have the results displayed graphically. You may also want to correct text analysis results manually or annotate documents completely manually, for example to create gold standards, which are then used to evaluate text analysis components. For all these purposes, the Annotation Editor can be used.

7.5.1. Viewing annotations inside a document source

The Annotation Editor can be used to display text analysis results graphically. Using the annotation editor, all documents from a document source can be easily viewed, section by section, and all annotations can be graphically highlighted.

In Annotation Editor, you first select a document source (1). If document names have been given to the documents in the source, the name of the first document in the source is displayed (2). You then select the text analysis process that you wish to view (3).

Once you have selected the source and the text analysis, the first document in the corpus is displayed. The document is displayed section by section. There is a checkbox above the text of each available annotation to enable the content of the annotation to be graphically highlighted (4). Using the right-hand checkbox (5), you can highlight all annotations at once, or reset the highlighting of all annotations.

In the main window (6), you can see the corresponding section of the document with the currently activated highlights. Below the main window, there are buttons for navigating through the individual sections of a document (7). Above it there are similar buttons, which you can use to navigate between the individual documents in a source (8).


annotationEditorOne


Figure 38: Displaying the annotations in the documents of a document source.


A table with a list of all the currently highlighted annotations can be displayed on the right of the main window.


annotationEditorTwo


Figure 39: Overview table of annotations.


To provide a better connection between the table and the graphical highlighting in the text, annotations from the table can be given special emphasis in the text. To do this, you set the checkbox in front of the name of the related annotations. This allows the corresponding annotations to be displayed in bold and large font, in addition to the colored highlighting.


annotationEditorThree


Figure 40: Especially emphasizing individual annotations.


The overview table is also used to view the individual attributes of the annotation. By expanding the annotation in the table, you can obtain a list of all the annotation’s attributes.


annotationEditorFeatures


Figure 41: Show annotations' attributes.


7.5.2. Configuring section sizes

As described above, the documents are displayed section by section. By default, 5 sentences are displayed on each page. This setting can be configured in the interface by clicking on the wheel at the right top.

In principle, you can combine a character-based sectioning with an annotation-based sectioning. While the standard sectioning is the character-based sectioning, annotation-based sectioning may has the advantage that you don’t miss cross section annotations. When combining both sections, the sections are always shown with a slight overlap. The end of section n is displayed again at the beginning of section n+1 to avoid the section being taken out of context. Furthermore, when sectioning by characters, the sectioning automatically ensures that the section splits are not made in the middle of a word.

Any change to the section size the graphical configuration is applied immediately after closing the window. Using the reset button, you can restore the configure default values.


annotationEditorSettings


Figure 42: Annotation Editor settings window.


7.5.3. Manually editing, adding and deleting annotations

The annotation editor can also be used to add annotations manually or to edit them. Using the button on the right, you can switch to edit mode.

In edit mode, a button appears above the main window for each activated annotation type (2). After you select the type, you can create annotations of this type in the text. To create annotations of this type, simply highlight an area of text in the main window using the mouse. A quick way of adding an annotation is to simply click a word. An annotation of the corresponding type is then created for the whole word.

Edit mode also allows you to delete existing annotations. To do this, click the cross mark in the overview table of annotations on the right.

After you have made changes to the document, these can be saved or discarded by clicking the buttons (3).


annotationEditorEditMode


Figure 43: Editing Annotations.


In edit mode, you can also edit attributes of an annotation (only for annotations which are configured by Averbis as editable).


annotationEditorEditModeFeatures


Figure 44: Editing the attributes of an annotation.


7.5.4. Displayed and editable annotation types, attributes and colours

Currently, the user cannot configure which annotation types and attributes are visible in the annotation editor, which colors are assigned to these annotation types, and which attributes are editable. This is currently preset by Averbis.

7.6. Text Analysis Evaluation

The results of various text analysis tasks can be evaluated against each other, e.g., to compare a text mining process against gold standards.

To do this, you may first choose the document’s source (1) which serves as the basis of the evaluation. Then, you choose the reference view (2) in the left part of the window, and, on the right side (3), you choose the text analysis process that you wish to evaluate.

If you chose a source and two text analysis processes, one can evaluate the results visually, one against the other, in a split-view with two separate annotation editors. The representation of the sections in the right window is thereby coupled to the sections in the left window. In addition to the color highlighting of the individual annotations, you can also distinguish graphically which annotations on the two sides do not match. In addition to the graphic labelling within the text, the annotations are also labelled appropriately in the tabulated overview on the right side (4). Mistakes there are either marked in orange (false positives) or gray (false negatives).


textanalysisCorpusEvaluationOne


Figure 45: The image shows the example of a DoseFormConcept annotation on the left that does not match on the right: TBCR.


7.6.1. "Matches" and "Partial Matches"

When evaluating, it is possible to distinguish between exact and partial matches. Annotations are marked as an exact match if their type, characterizing attributes and position in the text are identical.

To obtain an extra level between a hit and a no-hit, it is also possible to define a partial match. Annotations that are not exactly identical, but still meet these criteria, are marked accordingly both in the graphical and table presentation. In the graphical presentation they are italicized and underlined.


textanalysisCorpusEvaluationPartialMatch


Figure 46: Displaying a partial match.


7.6.2. Configuring the match criteria

The definition of what should be considered as a match, partial match and mismatch can be configured by the user in the interface.

The general rule is that two annotations are considered as a match when they are of the same type and are found at exactly the same place in the document. For each annotation type you can then define which annotation attributes also have to match. If we use a concept, this could be the concept’s unique ID. This means that two concepts would be identified as a match only if this attribute was identical in both annotations.

It is also possible to configure for each annotation type, when two annotations of this type should be considered as a partial match. Here you can choose between four different options:

  • "No partial matches": only exact matches are allowed.

  • "Annotations must overlap": a partial match is given whenever the annotations overlap.

  • "Allow fixed offset": at the beginning and end of the annotations, a configurable offset is allowed.

  • "Are within the same annotation of a specific type": a partial match is found whenever the annotations are within the same larger annotation. For example, if they are inside the same sentence.


textanalysisCorpusEvaluationConfiguration


Figure 47: Graphical configuration of the match criteria.


7.6.3. Corpus evaluation

Using the Evaluate metrics button, a window can be opened, displaying the precision, recall, F1 score and standard deviation for either a single document or the whole corpus. The numbers are split by annotation type.


textanalysisCorpusEvaluation


Figure 48: Evaluation at corpus level.


In the Settings panel, you can configure which types are to be taken into account in the corpus evaluation.
textanalysisCorpusEvaluationChooseTypes


Figure 49: Selecting the annotation types to be taken into account in the corpus evaluation.

7.7. Annotation Overview

For the quality assessment and improvement of text analysis pipelines, an aggregated overview of the assigned annotations is often helpful. For this purpose, the Annotation overview is used. You can create any number of these overviews. To do this, you first select a source and an existing text analysis process. Next, you select the annotation type to be analyzed.

After pressing the green button, the aggregation is calculated. Depending on the scope of the selected source, this may take some time. All overviews are listed in the table. As soon as an overview has been calculated, the results can be displayed via the list symbol.


annotationOverview


Figure 50: Listing and management of the available annotation overviews.


7.7.1. Aggregation und Context

If you select an overview from the table using the list symbol, you will see an aggregated list of the annotations found for the corresponding type. By default, the list is sorted in descending order by frequency. By clicking on an annotation in the table, you can display some example text in which the annotations occur. In addition to the analysis, the overview is also suitable for directly improving the results. In this way, false positives as well as false negatives can be identified and corrected.

Currently, the attributes that appear in the list for each annotation, are preconfigured by Averbis. This setting cannot yet be made graphically via the GUI.

7.8. Text Analysis Web Service API

This chapter describes the REST web service API, with which functionality can be integrated into external systems and workflows.

7.8.1. Authentication

The REST Web Service API uses API tokens to protect resources against unauthorized use. Users can create personalized API tokens and use them to use the Web Service API.


The generated API Token is only visible during initial creation. You won’t be able to see it again! Make sure to copy it and keep it in a safe place.


API tokens do not expire and can be used for any length of time. If an API token is to be deactivated, it can be invalidated via the interface. Therefore calls using this API token are rejected. Only one API token per user is supported. Generating a new API token invalidates an existing API token.

The API token is expected to be transferred in the HTTP header named api-token as shown in the example request below:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' --header 'api-token: 2e559d4049190682d1433af1999a71483646abe72d75fbcba286eebd21d9f7e8' -d 'Some sample text' 'http://localhost:8080/information-discovery/rest/v1/textanalysis/projects/SampleProject/pipelines/samplePipeline/analyseText?language=en'

An HTTPS connection is required to securely transfer the API tokens.



7.8.2. Analyse Text Web Service

The Analyse Text Web Service analyses plain text and returns annotations in JSON.

POST http(s)://HOST:PORT/APPLICATION_NAME/rest/v1/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseText

  • URL parameter projectName specifies the project name that contains the pipeline.

  • URL parameter pipelineName specifies the name of the pipeline that will be used to analyse the text.

  • URL parameter language specifies the text language. Can be omitted if the pipeline is able to detect the text language.

  • URL parameter annotationTypes specifies a comma separated list of annotation types that will be contained in the response. Wildcards (*) are supported.

  • Request body parameter text specifies the text to be analysed.

Example Request:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' --header 'api-token: 2e559d4049190682d1433af1999a71483646abe72d75fbcba286eebd21d9f7e8' -d 'Some sample text to be analysed' 'http://localhost:8080/information-discovery/rest/v1/textanalysis/projects/defaultProject/pipelines/defaultPipeline/analyseText?language=en&annotationTypes=de.averbis.types.Token%2Cde.averbis.types.Sentence'



7.8.3. Analyse HTML Web Service

The 'Analyse HTML Web Service' analyses text contained in HTML5 and returns annotations in JSON

POST http(s)://HOST:PORT/APPLICATION_NAME/rest/v1/textanalysis/projects/{projectName}/pipelines/{pipelineName}/analyseHtml


  • URL parameter projectName specifies the project name that contains the pipeline.

  • URL parameter pipelineName specifies the name of the pipeline that will be used to analyse the text.

  • URL parameter language specifies the text language. Can be omitted if the pipeline is able to detect the text language.

  • URL parameter annotationTypes specifies a comma separated list of annotation types that will be contained in the response. Wildcards (*) are supported.

  • Request body parameter text specifies the html5 content to be analysed.

Example Request:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' --header 'api-token: 2e559d4049190682d1433af1999a71483646abe72d75fbcba286eebd21d9f7e8' -d '<html><body>Some sample html5 content to be analysed</body></html>' 'http://localhost:8080/information-discovery/rest/v1/textanalysis/projects/defaultProject/pipelines/defaultPipeline/analyseHtml?language=en&annotationTypes=de.averbis.types.Sentence%2Cde.averbis.types.Token'



7.8.4. Swagger-UI API Browser

Developers can test the functionality of the Text Analysis Web API and get an overview on the integrated Swagger-UI API browser page. In particular, sample requests can easily be generated and return values verified. The Swagger-UI API browser is available at:

http(s)://HOST:PORT/APPLICATION_NAME/rest/swagger-ui.html


swagger textanalysis


Figure 51: Swagger-UI API Browser

7.8.5. Result Format (XML)

The answer of the web service is returned in XML format and contains the text analysis for the input data set. For more information about the data format, see chapter Available Text Mining Annotators & Web Service Specification.

8. Terminologies/Lexical resources

In this module, you can manage the lexical resources, which are used within the text analysis components.

8.1. Terminology Editor

The Terminology Editor allows to edit the content of terminologies.

8.1.1. Free text search and autosuggest

The centered search bar at the top of the Terminology Editor is meant for doing a free text search across multiple terminologies. You can include or exclude terminologies from the search by checking them within the drop down menu next to the search bar. While entering a search term, the system suggests different possible matches via autosuggest, grouped by terminology.


searchAutosuggest


Figure 52: Terminology auto suggest.


Doing a free text search, you can use the asterisk symbol (*) for truncation (e.g. Appendi\*). The results of a free text search are listed within the upper right section. Results are grouped by their terminologies.

The settings menu on the top right allows to customize some search and autosuggest settings. You can specify whether Concept IDs are included within the search, and define the number of hits that shall be displayed.


terminologyEditorSettings


Figure 53: Configuration of search and autosuggest.


8.1.2. Displaying concepts hierarchically

The tree view in the Terminology Editor allows to view its position in the terminology hierarchy. Just click on a concept within the list of search results.


treeD012214


Figure 54: Displaying concepts hierarchically.


You can configure whether the Concept ID shall be shown in the tree as well, and whether the tree view shall show the siblings of a concept along its hierarchy.


treeFocusVsNonFocus


Figure 55: Tree with and without strictly focusing on the selected concept.


8.1.3. Terms

In the lower right corner of the windows you see the concept’s details. The first tab shows concept synonyms. You can edit, add or delete synonyms here as well.


terminologyEditorAddTerm


Figure 56: Adding new terms.


8.1.4.  Mapping Mode

Every term has a so called Mapping Mode. Mapping Modes are an efficient way of increasing the accuracy of terminology based annotations. They allow to ignore certain synonyms which are irrelevant or lead to false positive hits (IGNORE). Synonyms can also be set to EXACT matches, which is especially good for acronyms and abbreviations (AIDS != aid).

Currently, there are 3 Mapping Modes

DEFAULT

Term is preprocessed the same way the pipeline is configured.

EXACT

Term is only mapped when the string matches exactly to the text without any modification by preprocssing (including case).

IGNORE

Term will be ignored. It won’t be used within the text analysis.


8.1.5. Relations

The second tab shows all relations known for that concept. You can use this view to add or delete relations, too. Currently, only hierarchical relations are supported. When adding a new relation, you get an autosuggest to find the correct concept that you want to relate.

8.1.6. Mapping Mode and comment

In the third tab, you can add a comment to a concept. Besides, you can set a concept-wide Mapping Mode. Terms, which do not have a specific Mapping Mode inherit it from this concept Mapping Mode.

8.2. Terminology Editor

The Terminology Editor allows to edit the content of terminologies.

8.2.1. Free text search and autosuggest

The centered search bar at the top of the Terminology Editor is meant for doing a free text search across multiple terminologies. You can include or exclude terminologies from the search by checking them within the drop down menu next to the search bar. While entering a search term, the system suggests different possible matches via autosuggest, grouped by terminology.


searchAutosuggest


Figure 57: Terminology auto suggest.


Doing a free text search, you can use the asterisk symbol (*) for truncation (e.g. Appendi\*). The results of a free text search are listed within the upper right section. Results are grouped by their terminologies.

The settings menu on the top right allows to customize some search and autosuggest settings. You can specify whether Concept IDs are included within the search, and define the number of hits that shall be displayed.


terminologyEditorSettings


Figure 58: Configuration of search and autosuggest.


8.2.2. Displaying concepts hierarchically

The tree view in the Terminology Editor allows to view its position in the terminology hierarchy. Just click on a concept within the list of search results.


treeD012214


Figure 59: Displaying concepts hierarchically.


You can configure whether the Concept ID shall be shown in the tree as well, and whether the tree view shall show the siblings of a concept along its hierarchy.


treeFocusVsNonFocus


Figure 60: Tree with and without strictly focusing on the selected concept.


8.2.3. Terms

In the lower right corner of the windows you see the concept’s details. The first tab shows concept synonyms. You can edit, add or delete synonyms here as well.


terminologyEditorAddTerm


Figure 61: Adding new terms.


8.2.4.  Mapping Mode

Every term has a so called Mapping Mode. Mapping Modes are an efficient way of increasing the accuracy of terminology based annotations. They allow to ignore certain synonyms which are irrelevant or lead to false positive hits (IGNORE). Synonyms can also be set to EXACT matches, which is especially good for acronyms and abbreviations (AIDS != aid).

Currently, there are 3 Mapping Modes

DEFAULT

Term is preprocessed the same way the pipeline is configured.

EXACT

Term is only mapped when the string matches exactly to the text without any modification by preprocssing (including case).

IGNORE

Term will be ignored. It won’t be used within the text analysis.


8.2.5. Relations

The second tab shows all relations known for that concept. You can use this view to add or delete relations, too. Currently, only hierarchical relations are supported. When adding a new relation, you get an autosuggest to find the correct concept that you want to relate.

8.2.6. Mapping Mode and comment

In the third tab, you can add a comment to a concept. Besides, you can set a concept-wide Mapping Mode. Terms, which do not have a specific Mapping Mode inherit it from this concept Mapping Mode.


9. Document Search

9.1. Solr Core Administration

As soon as the Solr Admin module is used, the application has a default Solr Core. This core is displayed in the administration panel.

uses Solr to create a search index and to make documents searchable. Choose "Solr Core Administration" on the project overview to create the basic settings.

9.1.1. Indexing pipeline

Documents that are imported or crawled go through a text analysis pipeline in order to add metadata to the search index.

The corresponding pipeline is selected here - a separate indexing pipeline can be used for each project.


solrCoresOverviewNoPipes


Figure 62: Choosing the indexing pipeline.


If you choose an indexing pipeline, all documents that are imported or crawled in the future will be processed. If you want to use a different pipeline for processing search queries, you can set it in the Solr Core Management section.

You can also switch the indexing pipeline within a project. To avoid a heterogeneous set of metadata, all documents are re-processed.

9.1.2. Query Pipeline

Here you can select which of the available pipelines should be used for analyzing the search query. By default, the same pipeline is used here as selected for indexing the documents.


solrCoresNoQueryPipe


Figure 63: Initial state in which no query pipeline is selected.


solrCoresSetQueryPipe


Figure 64: Choose a query pipeline.


9.1.3. Solr Core Overview

A so-called "Solr Core" is available for each project, the administration of which can be accessed via the "Solr Core Management" button on the project page.


solrCoresOverview


Figure 65: Key figures and information on the search index of a project.


  • "Core Name": The name of the Solr instance (generated automatically)

  • "Path to solrconfig.xml": This is the path to the configuration file of this Solr instance. Expert settings can be made in this configuration file. After editing this file, the Solr instance must be restarted in order for the changed settings to take effect.

  • "Path to schema.xml": The index fields are configured in this configuration file. This file should only be edited manually in exceptional cases and by experts.

  • "Indexed documents": Number of documents currently in the index.

  • "Pending documents": Number of documents that are currently in the processing queue of the Solr instance.

After pending documents have been processed by Solr, a commit must take place before these documents are actually available in the index. Since a commit is quite resource-intensive, the number of commits are kept low. By default, a commit therefore only takes place every 15 minutes. The processed documents therefore appear under the indexed documents with a delay.


  • "Operations": At the level of the Solr core, there are three operations available:

    • "Refresh" : You can update the displayed key figures by clicking on this icon.

    • "Commit" : This command executes a commit on the Solr core, including documents in the index that are not visible beforehand. By default, this happens every 30 minutes in the background.

    • "Delete all documents from the index" : With a click on this icon, all documents are deleted from the index.

9.1.4. Configuration of the search index schema

The configuration of the schema of the current search index can be reached via the module "Solr schema configuration".

9.1.4.1. Overview of all schema fields

Each Solr core has a schema that defines which information is stored in which kinds of fields. The Solr schema configuration lists all available fields in alphabetical order. The following information and operations are available for field in the index:

  • "Field name": Name of the field as defined in the Solr schema. This name is often chosen in such a way that it is unpleasant for people to read. If a field is a system field, that is, a field whose values must not be overwritten by the user, a small lock symbol () is displayed to the right of the field name.

  • "Type": The type specifies the contents of this field. In addition to an abstract description (e. g. string) the complete class name of the field is specified in parentheses.

  • "Active": This button controls whether the field contains information to be displayed or used elsewhere in the application. These fields are then available, for example, to be displayed in the search result, to form facets or to be used via query builder for the formulation of complex, field-based search restrictions. Fields that are not activated can still be used by the system, but they are not available for manual configuration to the users. If a field is activated, the line is highlighted in green.

  • "Label": The field name itself is often not suitable for displaying because it is not legible, and it is not localized. Therefore, you can define meaningful display names for all fields in different languages. These names are used wherever the user accesses or displays field contents. If no corresponding display name is defined for the user’s language, the illegible field name is displayed.


schemaConf


Figure 66: Overview of the Solr cores scheme.

9.1.4.2. Dynamic fields

In the overview, dynamically generated Solr fields are also displayed as soon as they have been created (that is, as soon as they have been filled with values once). As soon as the field has data, it remains permanently in the overview, even if all documents containing values in this field have been deleted in the meantime.

9.2. Manage and use search interface

The functionality and appearance of the search interface can be influenced by configuration.

9.2.1. Configuring the display of search results

Starting from the overview page of a project, the display of search results can be configured by using the "Field Layout Configuration" module. You can specify which fields/contents of the indexed documents are to be displayed in the interface. This applies to both the fields on the results overview page and the fields on the detail page of the documents (accessible by clicking on the title information of the result). Fields that are only displayed on the overview page of the search results are highlighted in green. In addition to selecting the fields, you can also configure whether the field title should be displayed, as well. If this option is activated, the display name created in the Solr schema management for the language of the respective user is displayed.

In addition, the length of content of a particular field can be specified, as well as some style settings.


manageDisplayFields


Figure 67: Configuring the display of search results.


9.2.2. Configure Facets

So-called facets provide the user with additional filter options. They are displayed on the left side of the search page. The configuration of facets can be accessed via the module "Facet Configuration" on the project overview page.

On the configuration page, you can select and configure the facet fields displayed in the user interface. When selecting a facet, you can configure whether the entries within a facet are AND- or OR-linked. In the case of AND facets, only documents that combine all the terms selected in this facet are displayed. OR facets, on the other hand, offer the option of finding documents that contain only individual terms (e. g. documents of "Category 1" OR "Category 2").

In addition, you can configure how many entries are to be displayed within each facet. The order of the facets can be determined with the arrows. The display in the search interface is similar to the order in the administration panel. The display name of a facet is selected according to the labels assigned in the Solr schema configuration (see above).


manageFacets


Figure 68: Configure Facets.


9.2.3. Configuring auto-completion

Settings for automatic completion of search terms can be made via the "Autosuggest" module that you access on the project overview page. There are various methods by which users can make suggestions to complete their searches in a meaningful way. Currently, four methods are available to choose from, and they can be freely combined as needed.

The proposals are grouped by their mode in the search interface. The order of the groups corresponds to the order in which the modes are listed here (if more than one mode is used). Use the arrow keys to change the order.

In addition to the number of proposals per group, you can also specify a description for each group, which is displayed in the search interface above the respective proposal block.

Changes will take effect immediately after saving for all users of the search.

If one of the two concept-based methods is used, an additional field appears where you select which Solr field is to be used for the lookup. All fields that are recognized as concept-based fields are available for selection.


manageAutosuggest


Figure 69: Configuring auto-completion.


The methods are characterized as follows:

"Prefixed Facet Mode"

  • The proposals for completing the search query come from the documents in the search index. No external sources are therefore used for the proposals.

  • The suggestions are intended to complete the term currently entered, no additional term is proposed (no multiple word suggestions).

  • The current search restrictions (e. g. via facets) are taken into account in the proposals. Therefore, only those terms are suggested for which there are also hits in the body, taking into account all active search restrictions.

  • The proposals are not based on the order of the terms in the documents. If you enter a search query that consists of several partial words, the proposed word does not have to be directly behind the term it is in the search query.

"Shingled Prefixed Facet Mode"

  • The proposals for completing the search query come from the documents in the search index. No external sources are therefore used for the proposals.

  • Unlike simple prefixed facet mode, suggestions can consist of several words. In addition to the completion of the term currently entered, it is also suggested terms that are often directly or closely related to this term in the documents. Entering Appen in this mode could therefore lead to suggestions such as treating _appendicitis.

  • The current search restrictions (e. g. via facets) are taken into account in the proposals. Therefore, only those terms are suggested for which there are also hits in the body, taking into account all active search restrictions.

  • If the query consists of several words, the suggestions for the order are based on the last of these words. All terms before this last word are still used as filters. The entry Hospital Appendi could therefore also lead to the suggestion Hospital Treat Appendicitis, if Hospital Treat Appendicitis is not in the immediate vicinity of Hospital in the text.

Concept Mode with guaranteed hits (concepts_hit)

  • The suggestions for completing the search query are taken from synonyms of the stored terminology.

  • Proposals show the wording of the synonym and the title of the terminology as well as the preferred name of the concept in the user’s language.

  • If you select a proposal (synonym), a search with the associated concept is executed.

  • Documents that contain the exact synonym text (that is, documents that cannot be found using another synonym) are given a higher weighting and are displayed in the results list above.

  • Only proposals that guarantee at least one hit are displayed.

Concept Mode without guaranteed hits (concepts_all).

This mode differs from the conventional concept mode in that proposals are also displayed that do not lead to a hit. All terms from the stored terminology are displayed.

The activation of the concept modes is not completely implemented via the GUI. Please contact support.


9.2.4. Search restrictions

Switch to the "Search" module of the project to get to the search page of the application. All search terms entered remain comprehensible for the user at any time. You can easily see which search terms have led to the currently presented result set. The current search restrictions are listed next to each other on the left side of the search bar. They are highlighted in the same color as the corresponding highlighting in the text. If the restriction by a term originates from a facet, the name of the facet is listed before the search term (see screenshot below).

If the number of search restrictions is too long to be displayed in the search bar, they are displayed in a pop-up and collapsible menu on the left in the search bar. The small cross symbol next to each search restriction removes this restriction and updates the search results accordingly. With the cross button to the right of the search bar you can also remove all current search restrictions at once.


searchrestrictions


Figure 70: Display of the current search restriction.


9.2.5. Faceted search

Facets represent one of the core functionalities of the search. With the help of the facets, the search results can be quickly limited to relevant results. In the admin panel you can configure for which categories facets should be displayed.

Within the facets, the most frequent terms from the respective category appear, which are contained in the indexed documents. The number after the faceted entries indicates how many documents are contained in the index (or current search result set) that match the corresponding term.

The faceted entries can be clicked on, whereupon the search result will be limited accordingly. Different terms can be combined here - even across facets. This allows a high degree of flexibility in restricting the search results.


facet


Figure 71: Concept facet with selected restriction to 'Diagnosis'.


9.2.6. AND-linked facets

By default, all selected facet entries are AND-linked. This means that only documents matching all selected criteria are listed. The currently selected filters are highlighted in orange. The restriction can be removed by clicking on the faceted entry again.

9.2.7. OR-linked facets

This filter yields to result sets in which at least one of the selected criteria appears. only one or only a few of the selected terms appear. In the case of these OR-linked facets, a checkbox is displayed in front of each entry.

9.2.8. Querybuilder / Expert Search

With the query builder, a comfortable mechanism is available in the system to create complex search queries. This allows for combining different criteria to a a query using any fields from the index.

The Querybuilder can be opened using the magic wand icon in the search bar.


suchschlitzQbInactive


Figure 72: The magic wand on the right of the search bar opens the query builder.


The input mask allows you to add search restrictions on all activated schema fields. Depending on the type of the selected schema field, different comparison operators are available. Text fields allow the operators contains and contains not. Any text can be entered as a restricting value. The asterisk * is used as a wildcard.

Date fields are provided by the comparison operators >= and <=. Numerical fields are provided by the comparison operators =, <>, >= and <=. By combining two date or number fields, the search can also be restricted to periods or ranges.


qb


Figure 73: Input mask of the query builder


Concept-based fields allow the operators contains and contains not like text fields.

Any number of conditions can be added. These are linked with each other using the boolean operators AND and/or OR. The criteria can also be grouped together to create any logical combinations. In addition to the graphical display, you can also find the logical expression that results from the current compilation of search restrictions in the upper area of the query builder. Once the complex search query has been created, it can be activated using the Apply button. The search results are calculated accordingly. In addition, the magic wand icon in the search bar turns orange to indicate that a complex search restriction is active. The search query can be reloaded by clicking on this button and can be edited until the result matches your expectations.

The query created using the Querybuilder behaves in addition to any other search restrictions, such as by means of free text search or facet restriction.

9.2.9. Document details and original document

The title field of a document serves as a link to a detail page containing additional information about the document (see "Solr Schema Configuration" module on the project overview page).

In addition to the detailed view, you can also download the underlying original documents (e.g. PDF, office document etc.) if they are available. You can recognize this by a small icon on the right of the document title. The symbol differs depending on the document category. Clicking on the file icon starts the download of the original document.

9.3. Export search results

Documents in the system can be exported - both individual documents and complete search result sets.

9.3.1. Selection of documents to be exported

If the user has the necessary permissions to export documents, checkboxes are provided on the search results page to mark individual documents. There is also a checkbox to mark all currently displayed documents. In addition, the button "Export search results" is displayed above the search results, where the selected documents can be exported.

Another option is to export all documents that meet the current search restrictions. In this case, all checkbox have to be deselected.


ExportDib


Figure 74: Controls to mark and export documents.


9.3.2. Selection of the exporter and the fields to be exported

After selecting the documents to be exported, a dialog box appears in which the exporter type can be selected. To this day, there is an exporter that exports selected fields of the documents to an Excel document.

After selecting the fields to be included in the export and confirming with the "Export" button, the export starts. Once the export is complete, the result is offered for download.


ExportDialog


Figure 75: Selection of the exporter and the fields to be exported.


10. Document Classification

10.1. Manage classification

10.1.1. Administration of the label system

The target categories for automatic classification of documents are called the label system that can be edited and maintained in the module "Label System". In a new project, the label system is initially empty.

Clicking on "Create new label" at the bottom left adds a new label. The pen symbol on the right-hand side is used to rename the label. The plus symbol to its right adds a new label as a child of the current label. It is therefore used to create hierarchically organized label systems. Clicking on the red cross symbol deletes labels (only labels that have no children can be deleted).

In a hierarchical labeling system, the hierarchical arrangement can also be edited via drag & drop.


manageLabelsystem


Figure 76: Labels can be added, edited, moved or deleted in the label system administration.


10.1.2. Administration of different classification sets

The starting point for the automatic classification of documents are so-called classification sets.


navigationItemManageClassification


Figure 77: Menu item for managing classification sets.


10.1.2.1. Create a new classification set

Any number of classification sets can be created for each project. This means that you can classify the same document source with different classification parameters.

There is only one label system per project. The same label system is used for each classification set. Please make sure that the label system has been created before you create a classification set.


To be able to view the results of the classification in the interface, you should select an indexing pipeline in Solr Core Management before you create classification sets.


When creating a new classification set, following settings can be adjusted:

  • Name: Name under which this classification set is referenced.

  • Document fields: From all document fields known to the system, you can select those that are used for training the classifier (so-called features).

  • High confidence threshold: The system distinguishes between documents with high and low confidence for automatically classified documents. This parameter can be used to define the value above which the confidence is interpreted as "high".

  • Classifier: In principle, different implementations can be used for classification. At present, the implementation offered is a support vector machine.

    • SVM: Support vector machine

  • Single/multi-label: This parameter determines how many categories can be assigned to a single document. With Single only one label is assigned. With a Multi, a document can be categorized in several classes.

  • Classification method: The classification method determines how the machine selects from several candidates. Depending on whether it is a single-label or multi-label scenario, different options and configuration parameters are available:

    • Single-Label

      • Best Labels: With Single-Label-Classification there is only one classification method: the Best Labels method chooses the class with the highest confidence.

        • Threshold : The threshold value can be used to determine that only classes that have a certain minimum confidence are taken into account. This allows for filtering assignments for which the machine is very unsafe.

    • Multi-Label: For Multi-Label Classification several methods are available (for a deeper theoretical background, see Matthew R. Boutell: Learning multi-label scene classification ):

      • All Labels: This method simply selects the available instance labels in a decreasing confidence order.

      • T-criterion: Using the T-criterion, instances first get filtered by a minimum confidence threshold of 0.5. If the confidences are too low, i.e. no labels are assigned, another filter step is used. The second step checks if the entropy of the confidences is lower than the minimum entropy threshold, i.e. confidences are distributed unevenly. If this is the case, the labels are assigned based on a lower minimum confidence threshold.

        • Entropy: 1.0 (default minimum entropy)

        • Threshold value: 0.1 (default minimum confidence)

      • C-criterion: This method ensures the selection of the best prediction values depending on the configuration parameters (i.e. Percentage and Threshold values). It first selects the label with the highest confidence (larger than the threshold value) and continues to assign labels whose confidence is at least at 75% of the highest confidence value.

        • Percentage value: 0.75

        • Threshold value: 0.1 (minimal default confidence).

      • Top n labels: This method selects those categories that have the highest confidence.

        • n: the number of classes to be assigned

The classification configuration can be changed on the classification administration page by clicking on the edit button.

After changing parameters of an existing classification set re-training and re-classification are necessary for all changes to take effect.


Before documents can be automatically classified, the machine requires appropriate training material. This refers to a small set of intellectually classified documents used by the machine to train a model.

Training data can be created in two ways. Either by manually assigning classes via the graphical user interface (please see "Browse classifications" below) or by importing a CSV file that contains appropriate assignments.

10.1.2.2. Import of training material

The button opens a dialog for importing a CSV file with training material. The CSV file must contain the name of the document in the first column (referred to document_name in the system). The subsequent columns contain the label assignments (one column for each label in a mult-label scenario). The columns must be separated by semicolons. The values of the columns can be enclosed with double quotation marks if required (mandatory if the values contain semicolons).

Example :
trainset.csv

doc1;label_1;label_2
doc2;label_1;
doc3;label_1;label_3
...
      

The document name, which is used to identify the document in the list, must contain the value that is entered in the field document_name in the application.


If a training file contains several labels per document, but the selected training set is a single-label classification, only the first label is used.


If the document names or labels contain semicolons, the values must be enclosed in double quotation marks to avoid incorrectly interpreting the semicolon as a field separator.


Only values that are part of the label system in the application (or project) are allowed as labels (all others are ignored).


When you import training material, any labels that may already be assigned to the documents in the list are deleted.

10.1.2.3. Train a model

As soon as the system has access to training material by importing a training list or manually assigning labels, a model can be trained using the button. Use to update the information on "State" and "Model": the training has finished if "State" is IDLE and "Model" is READY.

10.1.2.4. Quality of the current model

After each training session, an evaluation is carried out to evaluate the current quality of the model. For this purpose, the machine uses the document set of intellectually confirmed labels. This quantity is divided into a training set (90%) and a test set (10%). The test set is classified by the machine on the basis of a model that has been trained for this training set. The results of the automatic classification are then compared with the intellectually assigned labels. To smooth the results, the machine repeats this 10 times for different divisions of test and training sets. The results of the tests can be viewed in the form of a diagram using the button. The diagrams show the following metrics per label, which are derived from the number of correct assignments (true positives - TP), false assignments (false positives - FP), and missing assignments (false negatives - FN):

Accuracy: The ratio of all correct assignments (and correct non-assignments) to the total sum of all observations: 

        TP + TN
____________________

TP + FP + FN + TN


Precision: The ratio of correct assignments to all assignments:

    TP
_________

TP + FP

If one attaches great importance to the fact that there are no misallocations, this value is of particular relevance.


Recall: The ratio of correct assignments to the sum of all existing correct assignments:

    TP
_________

TP + FN

If you take some misallocations into account in order to increase the number of hits, this value is of particular relevance.


F1-Score: A weighted average between Precision (P) and Recall (R):

            P x R
2 x     _________

            P + R

 

10.1.2.5. Automatic classification of all unclassified documents

As soon as an initial model has been created, all previously unclassified documents can be automatically classified on the basis of this model via on the classification configuration page.

Once the classification is complete, the results can be viewed in the graphical user interface. The assigned classes are displayed above each document (see "Browse classifications" below).

10.1.2.6. Status information

The overview table depicts information of the current status of the classification set:

  • IDLE: No process is currently running.

  • TRAINING: A training is in progress. During this time, no other processes can be started on this classification set.

  • CLASSIFYING: Documents are currently being classified. During this time, no other processes can be started on this classification set.

  • ABORTING: A process (training or classification) is being aborted. During this time, no processes can be started on this classification set.

The resulting model of a classification set comes with additional information:

  • NONE: No model has been trained yet.

  • READY: A valid model exists and a classification process can be started.

  • OUTDATED: Since the last training, manual classifications have been added or automatic classifications have been confirmed or rejected. The model should be re-trained in order to make changes take effect.

  • INVALID: Changes were made to the label system or a manually assigned label were deleted, which invalidates the current model. The model has to be re-trained.

10.2. Index, evaluate and manually classify documents

For all classification sets, you can use a graphical user interface to navigate through the documents, review results, confirm or delete automatically assigned classes, and assign classes manually. You can access this browser view by clicking on "Classification" on the project overview page.

10.2.1. Structure of the interface

The interface is similar to the search interface, both in terms of its structure and functionality. The classification page has three predefined facets on the left side of the screen, that can be used to filter documents according to the assigned class (Label), the assigned confidences (Confidence) or the assignment status on the document level (Status).

This makes it very easy to display, for example, only those documents that have been automatically classified (Status = Autoclassified) and that have labels with low confidence (Confidence = low). By making corrections/confirmations to the resulting documents the classification model can be improved (i.e. the system learns exactly where it is currently most unsafe (so-called Active Learning).

To the right of the search input field, the classification set on which you want to work can be chosen. If you have created several classification sets, you can quickly switch between them.

10.2.2. Confirm or reject automatically assigned labels

The labels that have been assigned to each document are depicted below the title information of each document. Manually assigned labels are displayed in blue ( manual label ), automatically assigned classes are displayed in red (low confidence automatic label with low confidence ), or green (high confidence automatic label with high confidence ).

Automatically assigned labels have a button to confirm and to delete the label. By confirming an automatically assigned label, it changes its color and will be considered for the next training session to improve the model.

As soon as you confirm, delete or add labels, the model is considered OUTDATED. This means that since the last training session, new data has been collected to improve the model and re-training is necessary.


10.2.3. Execute actions on several selected documents

Similar to the conventional search interface, there are several document-centered actions for classification. In general, actions either refer to

  • exactly one document,

  • a selection of documents

  • all documents of the project or

  • all documents corresponding to the current search restrictions.

For any of these actions, there is a small button with a distinctive icon under the document title. Use this button to apply the action exactly to the corresponding document.

The same icons are displayed on larger buttons below the search bar ("Label documents(s)", "Classifiy document(s)", "Export classifications"). Clicking on these buttons apply the action to all documents that are marked with the checkbox left to their title. All documents on the current search result page are selected by clicking the uppermost checkbox on the page.

If no particular documents are selected at all, the action is applied to all documents that correspond to the current search restrictions. Since the result set can be very large, a window opens for approving the currents selection before the corresponding process starts in background.

10.2.4. Manually label documents

In addition to confirming or rejecting automatically assigned labels, categories can be assigned manually. The button attached to each document serves this purpose. The button opens a window in which you can select the desired label(s). You can also manually label several documents at the same time by using the checkboxes left to the documents title in conjunction with the uppermost button.

When manually assigning labels, a window opens with labeling information:

  • "Not selected": This label has not been assigned to any of the selected documents.

  • "Partially selected": This label has already been assigned for some (not all) selected documents (gray stripes).

  • "Completely selected": All selected documents already have this label (grey).

When assigning a label manually, automatically assigned labels of the same type are automatically overwritten, if existing.

As an example, if you select 100 documents to assign label A and 10 of them already have an automatically assigned label A, the status for the 10 documents will be switched to "Approved". An automatic assigned label B would not be replaced by this procedure (except in a single label classification scenario where only one label is allowed).

10.2.5. Classify documents automatically

The same selection mechanism as for manual labeling also applies to automatic classification (single documents, a selection of documents or the current search result set). The button "Classify document(s)" with the icon automatically classifies documents that are not manually categorized.

As a result, automatically assigned category labels are displayed in red (low confidence automatic label with low confidence), or green (high confidence automatic label with high confidence). The corresponding facet filters on the left (Label, Confidence and Status) will change when refreshing the page.

If documents are automatically classified, all previously unconfirmed automatically assigned classes of these documents are deleted from previous runs.

10.2.6. Export labels

The assignment of (confirmed or manual) labels can be exported from the interface to a CSV file (button "Export classifications"). The format has the same structure as the input format that is allowed for importing training material.

10.2.7. Training and classifying directly from the search page

With the button on the top right of the page a new model based on all previously manually classified or confirmed documents can be trained. Similar, the button on the top right is used to classify all unclassified documents based on the current model.

10.3. Classification Web Service

This section describes the possible integration of the classification component in existing third-party systems. An interface is offered via a RESTful/XML service, which is completely integrated in the Swagger framework. For the formal specification please refer to the official documentation.

10.3.1. Web Service

The Web service accepts requests at the following URL:


The information on HOST and PORT depends on the specific installation and can be obtained from the system administrator.

  • {projectName} is the selected name of the created project in the application.

  • {classificationSetName} is the selected name of the created classification configuration in the application.

  • {Importer} is the importer type to process different input document types and can be one of:

    • CAS Importer

    • Solr XML Importer

    • Text Importer

Additional importers can be included for specific applications. The access to the service URL is not authenticated.

The first time the Web service is called after restarting , the requested classification model is loaded from the classification configuration into the working memory so that service requests can be answered as quickly as possible. Therefore, with a newly started system or a new classification configuration, a first request should be made to warm up the web service, e.g. with a defined test data set. In addition to an automatic query by an integrating external system, the test page Swagger or a query via curl can also be used (see below).

10.3.2. Test page and simple query via Curl

Developers can test the functionality of the service and get an overview on the following page. In particular, sample requests can easily be generated and return values verified.


swagger classification


Figure 78: Swagger test page


Curl is a command line program for transferring information in computer networks. There are versions for Windows and Linux systems, among others. The following simple call receives classification results for two documents in Solr format:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/xml' -d '<?xml version="1.0" encoding="UTF-8"?> \
<update> \
	<add> \
		<doc> \
			<field name="document_name">doc1</field> \
			<field name="title">Machine learning for automatic text classification</field> \
			<field name="content">Machine learning is a subset of artificial intelligence in
				          the field of computer science that often uses statistical techniques
				          to give computers the ability to learn...</field> \
		</doc> \
		<doc> \
			<field name="document_name">doc2</field> \
			<field name="title">Document classification made easy</field> \
			<field name="content">Document classification or document categorization is a
			             problem in library science, information science and computer science.
			             The task is to assign a document to one or more classes or
			             categories...</field> \
		</doc> \
	</add> \
</update>' \
'https://HOST:PORT/information-discovery/rest/classification/projects/{project}/classificationSets/
{classificationSet}/classifyDocument?type=Solr%20XML%20Importer'

10.3.3. Result Format (XML)

The answer of the web service is returned in XML format and contains the automatic classifications for the input data set. The output for each data record contains the identifier (docment_name) and one or more categories with corresponding confidence values. In the example, both documents could be successfully classified, which is indicated by the attribute success=true:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
	<classifications>
		<classification documentIdentifier="doc1" success="true">
			<labels>
				<label confidence="0.98">Artificial Intelligence</label>
				<label confidence="0.89">Text Mining</label>
			</labels>
		</classification>
		<classification documentIdentifier="doc2" success="true">
			<labels>
				<label confidence="0.98">Information Science</label>
			</labels>
		</classification>
	</classifications>
</response>

If no category is assigned to a document due to selection criteria in the classification configuration (e.g. thresholds), the classification for the document also appears with success=true, but with an empty list of categories in the returned message:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
        <classification documentIdentifier="doc3" success="true">
            <labels/>
        </classification>
    </classifications>
</response>

If fields that are set active in the classification configuration are missing, corresponding error messages are added to the document classification. If the classification could still be carried out, this is indicated by success=true and the assigned categories are displayed:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
        <classification documentIdentifier="doc4" success="true">
            <labels>
				<label confidence="0.98">Artificial Intelligence</label>
				<label confidence="0.89">Text Mining</label>
            </labels>
            <errors>
                <error>Document has no title.</error>
            </errors>
        </classification>
    </classifications>
</response>

Multiple error messages for a document are listed separately:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
        <classification documentIdentifier="doc5" success="true">
            <labels>
				<label confidence="0.98">Artificial Intelligence</label>
            </labels>
            <errors>
                <error>Document has no title.</error>
                <error>Document has no content.</error>
                <error>Error on ...</error>
            </errors>
        </classification>
    </classifications>
</response>

If no classification can be performed due to an error, this is indicated by success=false and the output list of assigned categories is empty. A corresponding error message is added to the message returned:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <classifications>
       <classification documentIdentifier="doc6" success="false">
            <labels/>
            <errors>
                <error>Document has no classifiable content.</error>
            </errors>
        </classification>
    </classifications>
</response>

A document without the document_name input field cannot be classified because a unique document identifier is required. Since no assignment to an individual document can be made without this document identifier, the corresponding error message appears at the upper level. Other documents are not affected, so the other classifications will return normally:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <errors>
        <error>1 document(s) without identifier.</error>
    </errors>
    <classifications>
        <classification documentIdentifier="doc1" success="true">
            <labels>
				<label confidence="0.98">Artificial Intelligence</label>
				<label confidence="0.89">Text Mining</label>
            </labels>
        </classification>
        <classification documentIdentifier="doc2" success="true">
            <labels>
				<label confidence="0.98">Information Science</label>
            </labels>
        </classification>
    </classifications>
</response>

If a global error prevents classification of the documents, an error message is returned for the entire input, for example, the message that no classification characteristics could be extracted:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response>
    <errors>
        <error>Feature extraction failed.</error>
    </errors>
</response>



  • No labels