Page tree
Skip to end of metadata
Go to start of metadata


1. Apache UIMA Ruta Tutorial


2. The Goal

After completing this tutorial, the reader should:

  • be able to configure a ready-to-go environment, develop his/her own rules in Ruta Workbench
  • have a clear understanding of Apache UIMA Ruta as a language, its basic syntax and functionality
  • be able to write basic Apache UIMA Ruta annotators
  • be able to create simple PEAR projects and use them in Averbis Products (Information Discovery / Health Discovery).

3. Overview

3.1. What is Apache UIMA Ruta?

Apache UIMA Ruta™ is a rule-based script language supported by Eclipse-based tooling. The language is designed to enable rapid development of text processing applications within Apache UIMA™. A special focus lies on the intuitive and flexible domain specific language for defining patterns of annotations. Writing rules for information extraction or other text processing applications is a tedious process. The Eclipse-based tooling for UIMA Ruta, called the Apache UIMA Ruta Workbench, was created to support the user and to facilitate every step when writing UIMA Ruta rules. Both the Ruta rule language and the UIMA Ruta Workbench integrate smoothly with Apache UIMA.

3.2. Core Concepts

  • The UIMA Ruta language is an imperative rule language extended with scripting elements.
  • A UIMA Ruta rule defines a pattern of annotations with additional conditions. If this pattern applies, then the actions of the rule are performed on the matched annotations.
  • A rule is composed of a sequence of rule elements and a rule element essentially consist of four parts: A matching condition, an optional quantifier, a list of conditions and a list of actions.


  • Writing rules manually is a tedious and error-prone process. The UIMA Ruta Workbench was developed to facilitate this process by providing as much tooling support as possible.
  • The workbench allows, for example, syntax checking and auto completion, which make the development less error-prone.
  • The user can annotate documents and use these documents as unit tests for test-driven development or quality maintenance.
  • Sometimes, it is necessary to debug the rules because they do not match as expected. In this case, the explanation perspective provides views that explain every detail of the matching process.

4. Getting Started

Looking for a quick and smooth start with Apache UIMA Ruta? In this step-by-step tutorial, we are going to prepare the UIMA Ruta environment ready for development, explain its syntax in a learning-by-doing approach and apply the knowledge to develop your own text analysis component.

4.1.1. Prerequisites:

4.2. Apache UIMA Ruta Workbench

4.2.1. Installation

As a prerequisite, the UIMA Ruta Workbench should first installed in Eclipse. Please follow the Setting Up a Development Environment for UIMA Text Analysis Components guide.

4.2.2. UIMA Ruta Workbench Overview

The UIMA Ruta Workbench provides two main perspectives.

  1. The UIMA Ruta perspective, which provides the main functionality for working on UIMA Ruta projects.

  2. The Explain perspective, which provides functionality primarily used to explain how a set of rules are executed on input documents.

The following image shows the UIMA Ruta perspective.

4.2.3. Creating a new UIMA Ruta project

To create a new UIMA Ruta project in Eclipse, click File New Other... → UIMA Ruta → UIMA Ruta Project. This opens the corresponding wizard.

... enter a project name for your project and click Finish. Open the UIMA Ruta perspective by clicking "Yes". This will create everything you need to start.

4.2.4. Project Explorer

UIMA Ruta projects used within the UIMA Ruta Workbench need to have a certain folder structure. This structure is automatically set up when creating a project using the UIMA Ruta create project wizard.

A newly created project will have the following structure:

where:

FolderDescription
scriptSource folder for UIMA Ruta scripts and packages.
descriptorBuild folder for UIMA components. Analysis engines and type systems are created automatically from the related script files.
inputFolder that contains the files that will be processed when launching a UIMA Ruta script. Such input files could be plain text, HTML or xmiCAS files.
outputFolder that contains the resulting xmiCAS files. One xmiCAS file is generated for each associated document in the input folder.
resourcesDefault folder for word lists, dictionaries and tables.
testFolder for test-driven development.

4.2.5. Start writing the rules

To start writing the rules, a new script file has to be created. It is a good practice to define the script folder structure. To do this, right click on the "script" folder "New → Folder", define the desired structure and click "Finish". 



The actual script can now be created in the predefined folder structure.

Right click on "helloworld" folder "New → "UIMA Ruta File", name the script file and click "Finish".


By completing this step, the environment is set up and ready for development with UIMA Ruta language.

4.2.6. Running the script and output

We can now start writing the rules in the newly created script.

For simplicity's sake, let's assume we want to annotate the capital words in the "Hello World!" text span as "HelloWorld" annotations:

  1. we declare the output annotation "HelloWorld"
  2. we annotate any capital word (denoted in Ruta as CW) as "HelloWorld"
  3. we save the changes with Ctrl+S or "File" → "Save"
PACKAGE uima.ruta.helloworld;

DECLARE HelloWorld;
CW{-> HelloWorld};

Before executing the script, we have to define the input text. Right click on the "input" folder: "New" → "File", type the file name and click "Finish". In the newly created file type a sample text "Hello World!" and save it by pressing Ctrl+S or "File" → "Save".


Run the script by pressing Ctrl+F11 or "Run"→ "Run" . A successful execution yields an .xmi file in the "output" folder:

Open the output .xmi file with UIMA Annotation Editor (right click on "test.xmi" → Open With → UIMA Annotation Editor).

As a result, the found annotations are listed in the "Annotation Browser View" which can be found in the right-hand section of the editor. 

Please be sure to have the UIMA Ruta perspective on.




This step completes the first goal of this tutorial. The next section tackles the UIMA Ruta syntax through an use case.

4.3. Learning by Example

Let's start exploring the Ruta environment by adding the following sample sentence to a new document in the input folder. In this example we will try to identify simple entities like name, occupation and date.

Example

Friedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844.

As the first step, create a new UIMA Ruta script (see above). Add the following declarations in the script. These represent our target annotation types. Type declarations always start with the keyword DECLARE followed by the short name of the new type.

DECLARE
PACKAGE uima.ruta.helloworld;

DECLARE OccupationRT; 

UIMA Ruta supports the common regular expressions as defined for Java API. The REGEXP condition is fulfilled, if the given pattern matches on the matched annotation.
A simple way to annotate the occupation in our example is to use a pattern in a regular expression which matches exactly the given example (i.e. philosopher).

REGEXP
W{REGEXP("philosopher|composer|poet") -> OccupationRT};

The matched annotation W (representing a single word in UIMA Ruta), on which the REGEXP condition is applied, denotes a basic word in UIMA Ruta. The curly brackets in UIMA Ruta is a syntax element used to indicate the condition and/or action block applied on the matched annotation (i.e., W). In order to have a clear distinction between conditions and actions, the arrow symbol "->" is used. As a result, any word in the input example which match the pattern "philosopher" is annotated as "OccupationRT"

Output: Friedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844. (OccupationRT = philosopher, OccupationRT = composer, OccupationRT = poet)

Please use the annotation editor by clicking on the resulting .xmi file in the output folder to inspect the annotations created (see above).

W

W is one of the basic tokens which form the Ruta Seeds.

In case we want to find a sequence of occupations in our example, it can be easily accomplished by annotating this as an enumeration. Simply add the following lines to your script:

Composition of rule elements | Greedy Quantifiers
DECLARE OccupationEnum; 
(OccupationRT (COMMA | "and"))+{-PARTOF(OccupationEnum)-> MARK(OccupationEnum, 1,2)} OccupationRT;

In the example above, recurrent combinations of "Ocupation" followed by a COMMA or an "and", form the first rule element. Part of it is built by a composition between comma and conjunction. In UIMA Ruta the disjunctive and conjunctive rule elements: "|" - OR, "&" - AND allow the composition of a series of elements.
In this case, the rule looks for an "OccupationRT" annotation just before either a COMMA or a conjunction "and". The whole combination has to match at least once. This is indicated by the "+" greedy quantifier. More on this topic can be found below, in a separate section.
The enumeration ends with the last mention of "OccupationRT"; this representing the second rule element. The entire span covered by these elements is then annotated as "OccupationEnum". The MARK action is used to indicated which rule elements should be part of the enumeration.

Output: Friedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844. (OccupationEnum = philosopher, composer and poet)

MARK

A good practice is to use MARK action only when a selection of rule elements to build the target annotation is required.


Let us continue with annotating the name entity by introducing a new rule element - the wildcard.  It can be used to skip some text or annotations until the next rule element is able to match. To prepare the ground, we annotate the "FirstNameRT" and the "LastNameRT" by observing the name pattern. Add the following lines to your script:

Wildcard #
DECLARE FirstNameRT, LastNameRT;
DECLARE FullNameRT;
CW{-> FirstNameRT} CW PERIOD;
CW PERIOD CW{-> LastNameRT};
(FirstNameRT # LastNameRT){-> FullNameRT};

Although it does not extrapolate well outside our example, in the rule above, we use the wildcard to skip anything between the "FirstNameRT" and the "LastNameRT". The round brackets are used to isolate rule elements and consider them as a whole in the annotation process.

OutputFriedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844. (FullNameRT = Friedrich W. Nietzsche)


UIMA Ruta rules can contain an arbitrary number of conditions and actions, which is illustrated by the next example.

Combining multiple actions and conditions
DECLARE DateRT, MonthRT, YearRT;
WORDLIST MonthsList = "ListOfMonths.txt";
 
ANY{INLIST(MonthsList) -> MARK(MonthRT), MARK(DateRT,1,3)}
    PERIOD? NUM{REGEXP(".{2,4}") -> MARK(YearRT)};

This rule consists of three rule elements. The first one matches on every token, which has a covered text that occurs in a word lists named MonthsList. The second rule element is optional and does not need to be fulfilled, which is indicated by the quantifier ?. The last rule element matches on numbers that fulfill the regular expression REGEXP(".{2,4}" and are therefore at least two characters to a maximum of four characters long. If this rule successfully matches on a text passage, then its three actions are executed: An annotation of the type MonthRT is created for the first rule element, an annotation of the type YearRT is created for the last rule element and an annotation of the type DateRT is created for the span of all three rule elements. If the word list contains the correct entries, then this rule matches on strings like Dec. 2004, July 85 or 11.2008 and creates the corresponding annotations.

The "MonthsList" is a Ruta defined "WORDLIST" type. It consists of a list of months which have to be defined beforehand in separate file (in this case, "ListOfMonths.txt"). This simple text file should be created in the "resource" folder of the project (i.e., src/main/resource).

We create the file by a right click on the "resource" folder: "New" → "File", type the file name "ListOfMonths.txt" and click "Finish". In the newly created file type the list of months in the format given below and save it by pressing Ctrl+S or "File" → "Save".

Combining multiple actions and conditions
September
October
November

This represents our "MonthsList" type and UIMA Ruta would then annotate any words matching any from the list above as MonthRT.

Output: Friedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844. (DateRT = October 1844, MonthRT = October, YearRT = 1844)

UIMA Ruta rules can not only be used to create or modify annotations, but also to create features for annotations. Going back to annotating the name in our example, the following rule defines and assigns a "FullName" structure, by storing the given "FirstName" and "LastName" annotations as feature values. Replace the declaration for "FullName" and its corresponding rule by the following lines:

Feature for Annotations
DECLARE FullNameRT(FirstNameRT first, LastNameRT last);
(FirstNameRT # LastNameRT){ -> CREATE(FullNameRT, "first" = FirstNameRT , "last" = LastNameRT)};

The first statement of this example is a declaration that defines a new type of annotation named FullNameRT. This annotation has two features: One feature with the name “first” of the type FirstNameRT and one feature with the name “last” of the type “LastNameRT”.


The second statement of the example, which is a simple rule, creates one annotation of the type FullNameRT for each span starting with "FirstNameRT" and ending with "LastNameRT". Additionally to creating an annotation, the "CREATE" action also assigns an annotation of the FirstNameRT, which needs to be located within the span of the matched sentence, to the feature “first” and an "LastNameRT" annotation to the feature “last”. The annotations mentioned in this example need to be present in advance.

OutputFriedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844. (FullNameRT = Friedrich W. Nietzsche, FullNameRT.first = Friedrich, FullNameRT.last = Nietzsche)


In order to refer to annotations in a simplified syntax manner, it is possible to store them locally in a variable, called label. These label will contain the reference to the corresponding annotation and can be used, for instance, for feature assigning. The following rule shows a simiple use case:

Labels
DECLARE FullNameRT(FirstNameRT first, LastNameRT last);
(fn:FirstNameRT # ln:LastNameRT ){ -> CREATE(FullNameRT, "first" = fn, "last" = ln)};

"fn" and "ln" are used as labels (i.e., local variables) for the "FirstNameRT" and "LastNameRT" annotations respectively.

Global variables for annotations are declared like other variables and are able to store annotations across rules as illustrated by the next example:

Global variables
DECLARE PersonRT(Annotation dateOfBirth);
ANNOTATION birthDate;
# d:DateRT{-> birthDate = d};
FullNameRT{-> PersonRT, PersonRT.dateOfBirth = birthDate};

The first line declares a new type "Person" with "dateOfBirth" as feature. The second line defines a variable named "birthDate" which can store one annotation. A variable able to hold several annotations is defined with "ANNOTATIONLIST". The next line assigns the first occurrence of "DateRT" annotation to the annotation variable "birthDate". The last line creates an annotation of the type "Person" and assigns the value of the variable "birthDate" to the feature "dateOfBirth" of the created annotation.

OutputFriedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844. (Person= Friedrich W. Nietzsche, Person.dateOfBirth = 15 October 1844)


In the end, the complete code block should look like:

Global variables
PACKAGE uima.ruta.helloworld;
 
DECLARE OccupationRT;
DECLARE OccupationEnum;
DECLARE FirstNameRT, LastNameRT;
DECLARE FullNameRT(FirstNameRT first, LastNameRT last);
DECLARE DateRT, MonthRT, YearRT;
  
WORDLIST MonthsList = "ListOfMonths.txt";
  
W{REGEXP("philosopher|composer|poet") -> OccupationRT};
  
(OccupationRT (COMMA | "and"))+{-PARTOF(OccupationEnum)-> MARK(OccupationEnum, 1,2)} OccupationRT;
  
CW{-> FirstNameRT} CW PERIOD;
CW PERIOD CW{-> LastNameRT};
  
(FirstNameRT # LastNameRT){ -> CREATE(FullNameRT, "first" = FirstNameRT , "last" = LastNameRT)};
  
ANY{INLIST(MonthsList) -> MARK(MonthRT), MARK(DateRT,1,3)}
    PERIOD? NUM{REGEXP(".{2,4}") -> MARK(YearRT)};


For the complete list of actions and conditions, please visit the Apache UIMA Ruta Documentation.


There are two ways to integrate your UIMA Ruta annotators in Averbis Products:

  • apply script to Ruta engine
  • build PEARs


4.4. Define your pipeline with the generic UIMA Ruta annotator


Step 1: Navigate to your Averbis product instance and open the Pipeline Configuration administration:


Step 2: Create a new pipeline (e.g., HelloWorldPipeline)


Step 3: Open the pipeline editor by clicking the "Pencil" symbol button


Step 4: Since the written rules rely on the Ruta seeds (preexisting annotations) like "CW", a prerequisite component to add is the RutaEngine. Locate the RutaEngine component in the "Modules" column of the page and add it by clicking on the arrow symbol.

The pipeline would only consist of RutaEngine component.


Step 5: Left click on "RutaEngine" to expand the component configuration window. Paste the script code developed in the previous section, in the "rules" parameter window.


Step 6: Save the pipeline configuration by pressing the "Save" button:


Step 7: Go back to "Pipeline Configuration" and start the pipeline by clicking the green button.


Step 8: Request the results following the example described here.

The text analysis result from the web service will look like this:

{
  "annotationDtos": [
    {
      "begin": 0,
      "end": 22,
      "type": "FullNameRT",
      "coveredText": "Friedrich W. Nietzsche",
      "id": 373,
      "last": {
        "begin": 13,
        "end": 22,
        "type": "LastNameRT",
        "coveredText": "Nietzsche",
        "id": 341
      },
      "first": {
        "begin": 0,
        "end": 9,
        "type": "FirstNameRT",
        "coveredText": "Friedrich",
        "id": 313
      }
    }
  ]
}

where the Friedrich W. Nietzsche was annotated as the "FullNameRT" along with its features "first" and "last" denoting the "FirstNameRT" Friedrich and the "LastNameRT" Nietzsche, as defined in the script.

4.5. Create and upload an UIMA PEAR

A PEAR (Processing Engine ARchive) file is the UIMA standard packaging format for UIMA annotators. The PEAR package can be intergrated in a textanalysis pipeline . The PEAR package allows you to integrate any UIMA Ruta annotator, like the one developed in the previous tutorial, in a productive text analysis pipeline

4.5.1. PEAR configuration and export

We provide a UIMA Ruta PEAR template intended to facilitate the process of building your own PEARs. In the following, we present a step-by-step guide to configure and export a customized PEAR package.


Step 1: Go to your workspace. It can be any folder on your filesystem that you would like to use for storing your PEAR. However, it is recommended to use your Eclipse workspace which can be defined at the first start of the Eclipse IDE (e.g., under Windows OS, it can be defined as "C:\src\workspace").

Step 2: Open the terminal/command-line in your workspace executing the following shortcuts function of the Operating System your are using:

  • Windows: Ctrl + L, type cmd, press Enter
  • Linux: Ctrl + Alt + T (navigate to your workspace in terminal)

Step 3: Generate the PEAR project folder structure by executing the following command:

mvn archetype:generate 
	-DarchetypeGroupId=de.averbis.textanalysis 
	-DarchetypeArtifactId=ruta-pear-archetype 
	-DarchetypeVersion=1.4.0

The desired project structure can be defined now by setting the pom.xml file properties "groupId", "artifactId" and "version". If the properties are not defined, the default structure and namings will be considered.

Step 4:  Set the pom.xml properties to define your own project configuration and confirm it:

Define value for property 'groupId' com.example: : com.example.pear
Define value for property 'artifactId' my-annotator: : my-ruta-pear-annotator
Define value for property 'version' 1.0.0-SNAPSHOT: : 1.0.1
Define value for property 'package' com.example: : com.example.pear.ruta
Confirm properties configuration:
groupId: com.example.pear
artifactId: my-ruta-pear-annotator
version: 1.0.1
package: com.example.pear.ruta
 Y: : Y

A successful execution ends with the following summary

[INFO] ----------------------------------------------------------------------------
[INFO] Using following parameters for creating project from Archetype: ruta-pear-archetype:0.2.0-SNAPSHOT
[INFO] ----------------------------------------------------------------------------
[INFO] Parameter: groupId, Value: com.example.pear
[INFO] Parameter: artifactId, Value: my-ruta-pear-annotator
[INFO] Parameter: version, Value: 1.0.1
[INFO] Parameter: package, Value: com.example.pear.ruta
[INFO] Parameter: packageInPathFormat, Value: com/example/pear/ruta
[INFO] Parameter: package, Value: com.example.pear.ruta
[INFO] Parameter: version, Value: 1.0.1
[INFO] Parameter: groupId, Value: com.example.pear
[INFO] Parameter: artifactId, Value: my-ruta-pear-annotator
[INFO] Project created from Archetype in dir: C:\src\ws-uima\ruta-pear-archetype-template\my-annotator\my-ruta-pear-annotator
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1:35 min
[INFO] Finished at: 2019-01-22T13:11:23+01:00
[INFO] ------------------------------------------------------------------------

Step 5:  Import the newly created project in Eclipse by using the import wizard: "File" → "Import" → "Maven" → "Existing Maven Projects" → "Next". Browse to the project location and click "Finish".


Step 6: Open the "Main.ruta" script and lay down your rules


A good start is the Learning by example use case. We will proceed with building an actual annotator out of the given example.


Once the logic of the annotator has been developed, before integrating it in a textanalysis pipeline, it has to be built with maven. To do this, execute the following command in the terminal/command-line (opened in the project directory):

mvn clean install

A successful build generates the PEAR package in the "target" folder of the project

target/my-ruta-pear-annotator-1.0.1.pear

The PEAR is now ready to be integrated in a text analysis pipeline.

4.5.2. Integrating PEAR in a Text Analysis Pipeline

A text analysis pipeline represents a certain combination of annotators set up within the Averbis tools: Information Discovery or Health Discovery.

Step 1: Navigate to your Averbis product instance and open the Components administration:


Step 2: Import my-ruta-pear-annotator-1.0.1.pear as a PEAR component:


Step 3: Navigate (go back) to your Averbis product instance and open the Pipeline Configuration administration:


Step 4: Create a new pipeline (e.g., HelloWorldPipeline)


Step 5: Open the pipeline editor by clicking the "Pencil" symbol button


Step 6: Since the written rules rely on the Ruta seeds (preexisting annotations) like "CW", the first component to add is the RutaEngine. Locate the RutaEngine component in the "Modules" column of the page and add it by clicking on the arrow symbol.


Step 7: Likewise, add the PEAR component my-ruta-pear-annotator-1.0.1.pear from the "PEAR Components" section. Save the pipeline configuration.


Step 8: Go back to "Pipeline Configuration" and start the pipeline by clicking the green button.

4.5.3. Viewing the pipeline output

You can use the following command to analyze text with the pipeline using the Information/Health Discovery web service interface.

Please make sure to use the correct Averbis products name, host name, port, project name and pipeline name of your Information/Health Discovery instance.

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' -d 'I am some sample text' 'https://$HOSTNAME:$PORT/information-discovery/rest/textanalysis/projects/$YOURPROJECTNAME/pipelines/$YOURPIPELINENAME/analyseText?annotationTypes=de.averbis.tutorials.Main.HelloWorldAnnotation'

For instance, if we want to get the "FullNameRT" annotations from our input example, the command would look like:

curl -X POST --header 'Content-Type: text/plain' --header 'Accept: application/json' -d 'Friedrich W. Nietzsche was a German philosopher, composer and poet born on 15 October 1844.' 'http://localhost:8080/information-discovery/rest/textanalysis/projects/Tutorials/pipelines/HelloWorldPipeline/analyseText?annotationTypes=de.averbis.tutorials.Main.FullNameRT'

The text analysis result from the web service will look like this:

{
  "annotationDtos": [
    {
      "begin": 0,
      "end": 22,
      "type": "com.example.pear.ruta.Main.FullNameRT",
      "coveredText": "Friedrich W. Nietzsche",
      "id": 545,
      "last": {
        "begin": 13,
        "end": 22,
        "type": "com.example.pear.ruta.Main.LastNameRT",
        "coveredText": "Nietzsche",
        "id": 513
      },
      "first": {
        "begin": 0,
        "end": 9,
        "type": "com.example.pear.ruta.Main.FirstNameRT",
        "coveredText": "Friedrich",
        "id": 485
      }
    }
  ]
}

where the Friedrich W. Nietzsche was annotated as the "FullNameRT" along with its features "first" and "last" denoting the "FirstNameRT" Friedrich and the "LastNameRT" Nietzsche, as defined in the script.


Congratulations! This was the final step which concludes the entire tutorial. You should now be able to create your own annotator in UIMA Ruta, configure it as PEAR package and integrate it in an analysis engine pipeline.

To get a better understanding about the syntax and functionality in UIMA Ruta which would allow you to develop more advanced annotators, please follow the next part of this tutorial and/or visit the official Apache UIMA Ruta Documentation.

5. Apache UIMA Ruta Language

This chapter provides a basic description of the Apache UIMA Ruta language and its syntax.

5.1. Rule elements and their matching order

If not specified otherwise, then the UIMA Ruta rules normally start the matching process with their first rule element. The first rule element searches for possible positions for its matching condition and then will advise the next rule element to continue the matching process. For that reason, writing rules that contain a first rule element with an optional quantifier is discouraged and will result in ignoring the optional attribute of the quantifier.

The starting rule element can also be manually specified by adding @ directly in front of the matching condition. In the following example, the rule first searches for capitalized words (CW) and then checks whether there is a period in front of the matched word.

PERIOD @CW;

The choice of the starting rule element can greatly influence the performance speed of the rule execution.

5.2. Basic annotations and tokens

Ruta creates a list of basic token annotations (or seeds). These tokens build a hierarchy shown in the following figure.

A detailed description is given in the following table:

AnnotationParentDescription
ALL-parent type of all tokens
AMPANYampersand expression
AMPANYampersand expression
ANYALLall tokens except for markup
BREAKWSline break
CAPWword only containing capitalized letters
COLONPMcolon
COMMAPMcomma
CWWcapitalized word
EXCLAMATIONSENTENCEENDexclamation mark
NBSPSPACEnon breaking space
NUMANYsequence of digits
PERIODSENTENCEENDperiod
PMANYall kinds of punctuation marks
QUESTIONSENTENCEENDquestion mark
SEMICOLONPMsemicolon
SENTENCEENDPM

all kinds of punctuation marks that indicate the end of a sent

SPACEWSspaces
SPECIALANYall other tokens and symbols
SWWlower case
WANYall kinds of words
WSANYall kinds of white spaces

5.3. Quantifiers

NameSyntaxDescriptionInputExampleMatchNo Match
Star Greedy*Matches on any amount of annotations and evaluates always true. Please mind that a rule element with a Star Greedy quantifier needs to match on different annotations as the next rule element.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM* NUM SW

-

a 1 2 3 b
a 2 3 b
a 3 b
a b

Star Reluctant*?Matches on any amount of annotations and evaluates always true, but stops to match on new annotations, when the next rule element matches and evaluates true on this annotation.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM*? NUM SW

a 3 b

a 1 2 3 b
a 2 3 b
a 3 b
a b

Plus Greedy+Matches on at least one annotation. Please mind that a rule element after a rule element with a Plus Greedy quantifier matches and evaluates on different conditions.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM+ NUM SW

-

a 1 2 3 b
a 2 3 b
a 3 b
a b

Plus Reluctant+?Matches on at least one annotation in order to evaluate true, but stops when the next rule element is able to match on this annotation.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM+? NUM SWa 2 3 b

a 1 2 3 b
a 3 b
a b

Question Greedy?Matches optionally on an annotation and therefore always evaluates true.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM? NUM SWa 2 3 b

a 1 2 3 b
a 3 b
a b

Question Reluctant??Matches optionally on an annotation, if the next rule element does not match on the same annotation and therefore always evaluates true.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM?? NUM SWa 3 b

a 1 2 3 b
a 2 3 b
a b

Min Max Greedy[ x, y] Matches at least x and at most y annotations of its rule element to evaluate true.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM[1,2] NUM SWa 1 2 3 b

a 2 3 b
a 3 b
a b

Min Max Reluctant[ x, y]?Matches at least x and at most y annotations of its rule element to evaluate true, but stops to match on additional annotations, if the next rule element is able to match on this annotation.

a 1 2 3 b
a 2 3 b
a 3 b
a b

SW NUM[1,2]? SWa 2 3 b

a 1 2 3 b
a 3 b
a b


5.4. Control Structures

5.4.1.1. BLOCK

It is sometimes easier to express functionality with control structures known by programming languages rather than to engineer all functionality only with matching rules. The UIMA Ruta language provides the BLOCK element for some of these use cases. The BLOCK element starts with the keyword BLOCK followed by its name in parentheses. This structure has two major advantages: 1.) it facilitates the debugging process by using different names for each block (thereby, code gets segmented), 2.) the name can be used to execute this block using the CALL action. Hereby, it is possible to access only specific sets of rules of other script files, or to implement a recursive call of rules.

BLOCK
DECLARE SentenceWithNoLeadingNP;
BLOCK(ForEach) Sentence{} {
    Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
}

In the block above, the rule in the block statement is performed for each occurrence of an annotation of the type Sentence. The rule within the block matches on the complete document, which is the current sentence in the context of the block statement. As a consequence, this example creates an annotation of the type SentenceWithNoLeadingNP for each sentence that does not start with a NP annotation.

5.4.1.2. FOREACH

The syntax of the FOREACH block is very similar to the common BLOCK construct, but the execution of the contained rules can lead to other results. The execution of the rules is, however, different. Here, all contained rules are applied on each matched annotation consecutively. In a BLOCK construct, each rule is applied within the window of each matched annotation.

The following example illustrates the syntax and semantic of the FOREACH block:

FOREACH
FOREACH(num, true) NUM{}{
    num{-> SpecialNum} CW;
    SW{-> T5} num{-> SpecialNum};
}

The first line specifies that the FOREACH block iterates over all annotations of the type NUM and assigns each matched annotation to a new local variable named "num". The block contains two rules. Both rules start their matching process with the rule element with the matching condition num, meaning that they match directly on the annotation match by the head rule. While the first rule validates if there is a capitalized word following the number, the second rule validates that the is a small written word before the number. Thus, this construct annotates number efficiently with annotations of the type "SpecialNum" dependent on their surrounding.

5.4.1.3. Inlined Rules

There are two more language constructs (-> and <-) that allow to apply rules within a certain context. These rules are added to an arbitrary rule element and are called inlined rules. The first example interprets the inlined rules as actions. They are executed if the surrounding rule was able to match, which makes this one very similar to the block statement.

Inlined Rule As Action
DECLARE SentenceNoLeadingNP;
Sentence->{
	Document{-STARTSWITH(NP) -> SentenceNoLeadingNP};
};

The second one (<-) interprets the inlined rules as conditions. The surrounding rule can only match if at least one inlined rule was successfully applied. In the following example, a sentence is annotated with the type "SentenceWithNPNP", if there are two successive NP annotations within this sentence.

Inlined Rule As Condition
DECLARE SentenceWithNPNP;
Sentence{-> SentenceWithNPNP}<-{
    NP NP;
};

5.5. Importing  scripts and typesystems

UIMA Ruta script files with many rules can quickly confuse the reader. The UIMA Ruta language, therefore, allows to import other script files in order to increase the modularity of a project or to create rule libraries. The next example imports the rules together with all known types of another script file and executes that script file. The script file with the name SecondaryScript.ruta, which is located in the package uima/ruta/example, is imported and executed by the "CALL" action on the complete document.

SCRIPT uima.ruta.example.SecondaryScript;
Document{-> CALL(SecondaryScript)};

The types of important annotations of the application are often defined in a separate type system. The next example shows how to import those types.

TYPESYSTEM my.package.NamedEntityTypeSystem;
Person{PARTOF(Organization) -> UNMARK(Person)};

The type system descriptor file with the name NamedEntityTypeSystem.xml located in the package my/package is imported.

5.6. Filtering and Visibility

The UIMA Ruta allows to filter out irrelevant annotations by making them invisible. Similarly, it is possible to retain the already invisible annotations.

Document{->ADDFILTERTYPE(CW)};
Document{->ADDRETAINTYPE(BREAK)};

In the rule block above, the capital words CW are filtered out of the document context and the visibilty of the line breaks BREAK is retained (by default, BREAK annotations are invisible).

5.7. Engineering paradigms

5.7.1. Candidate classification

The matching condition of the following rule element is given with the type Paragraph, thus the rule takes a look at all Paragraph annotations. The rule matches only if the three conditions, separated by commas, are fulfilled. The first condition CONTAINS(Bold, 90, 100, true) states that 90%-100% of the matched paragraph annotation should also be annotated with annotations of the type Bold. The boolean parameter true indicates that amount of Bold annotations should be calculated relatively to the matched annotation. The two numbers 90,100 are, therefore, interpreted as percent amounts. The exact calculation of the coverage is dependent on the tokenization of the document and is neglected for now. The second condition CONTAINS(Underlined, 90, 100, true) consequently states that the paragraph should also contain at least 90% of annotations of the type underlined. The third condition ENDSWITH(COLON) finally forces the Paragraph annotation to end with a colon. It is only fulfilled, if there is an annotation of the type COLON, which has an end offset equal to the end offset of the matched "Paragraph" annotation.

DECLARE Headline;
Paragraph{CONTAINS(Bold, 90, 100, true), 
    CONTAINS(Underlined, 90, 100, true), ENDSWITH(COLON) 
    -> MARK(Headline)};

5.7.2. Bottom-up matching

Another approach is to start annotating the simplest entities (as helper annotations), going up to the target annotation. In this example, an uppercase letter followed by a period is annotated as "Initial". The span of text preceding the "Initial" annotation, which covers a capital word and comma, is annotated as "Name".  In the last rule, a sequence of "Name" annotations delimited by a comma, is considered a valid "Author" annotation.

(CW{REGEXP(".")} PERIOD){-> Initial};
(CW COMMA Initial+){-> Name};
(Name (COMMA Name)*){->Author};

5.7.3. Boundary matching

Helper annotations can be used to define the boundaries of the target annotation. Likewise, in this example, "AuthorStart" and "AuthorEnd" are defined first, in order to be used as constituents of the parent annotation "Author".

Reference{-> MARKFAST(AuthorStart)};
COLON{-> AuthorEnd} CW;
(AuthorStart # AuthorEnd){-> Author};

5.7.4. Transformation-based

5.7.4.1. UNMARK action

The following rule consists of one condition and one action. The condition -CONTAINS(W) is negated (indicated by the character -), and is therefore only fulfilled, if there are no annotations of the type W within the bound of the matched Headline annotation. The action UNMARK(Headline) removes the matched Headline annotation. Put into simple words, headlines that contain no words at all are not headlines.

Headline{-CONTAINS(W) -> UNMARK(Headline)};

5.7.4.2. SHIFT action

Here, the action SHIFT(Headline, 1, 2) expands the matched Headline annotation to the next colon, if that Headline annotation is followed by a COLON annotation.

Headline{-> SHIFT(Headline, 1, 2)} COLON;

5.8. Best Practices @Averbis

PracticeExamplesRemarkAlternative example
No literal string matching
"dog"{-> Animal};
"Hund"{-> Animal};
Literal strings introduce language dependency and can become computationally expensive. Unless this is not the intent, literal string match should be avoided.
W{REGEXP("dog|Hund") -> Animal};
No conditions at wildcards
#{-PARTOF(CW) -> NoCapitalWords};

The wildcard element goes over each annotation in the context; therefore, having a condition for each element might become computationally expensive.

ANY[0,100]{-PARTOF(CW) -> NoCapitalWords};
No disjunct/conjunct rule elements
((CW NUM) | (NUM CW)){ -> Street};
Readability is reduced.
ANY[2,2]{IS({NUM, CM}) -> Street}
No actions in inlined rules as conditions
Sentence{CONTAINS(CW)} <- {SW{-> CREATE(LastWord)} PERIOD;};
The CREATE action will be performed, regardless the matching condition.
Sentence{CONTAINS(CW)} -> {SW{-> CREATE(LastWord)} PERIOD;};

Modularization of scripts

-Try to fragment the script into subscripts and make use of CALL action to keep your scripts readable.-
Keep EXECs usage in scripts minimal-Due to class loader problems in uimaFIT (external component).-
Import typesystems of all used types-Types cannot be resolved while processing and this causes exceptions.-
  • No labels