Skip to main content

Entity Extraction

Re:infer extracts two kinds of output from unstructured text: labels and entities. Labels describe the entire verbatim, e.g. "Cancellation", "Trade Failure", or "Urgent". Entities refer to specific parts of the verbatim, e.g. "Counterparty Name", "Customer ID", or "Cancellation Date".

In a downstream process, labels are used to triage, prioritise, and decide what kind of action should be taken. Entities are used to fill in fields of requests. For example, a downstream process may filter verbatims to those that have the "Cancellation" label, and then use the extracted "Customer ID" and "Cancellation Date" entities to call an API to automatically process the cancellation.

Re:infer comes with a number of built-in entities for common concepts (such as Organisation, Currency Code, or Date). You can customise Re:infer's built-in entities so that they are tailored to your specific use case. For example, Re:infer has a highly trained pre-built Date entity which you can use as a starting point for a more customised entity such as Renewal Date or Cancellation Date. Alternatively, you can start from scratch and teach Re:infer to recognise something completely new.

Entity Extraction.
Entity Extraction

Configuring Entities#

We will use an insurance use-case as our example. The insurer mailbox receives emails from brokers that should be triaged to different teams for processing. In this example the dataset has already been trained and the taxonomy looks like this:

Example Taxonomy
Example Taxonomy

This mailbox receives Renewal, Cancellation, and Admin requests which are occasionally Urgent. Re:infer has been trained to recognise each of these concepts, and Re:infer predictions can be used to triage the emails to the correct team by creating support tickets.

To ensure that the customer is responded to quickly, we can extract some key data points that will help the downstream teams process the request. Specifically, we want to extract the policy number, insured organisation name, and broker name from the email. We can use entity extraction to do that.

Configured Entities
Configured Entities

Since the policy number format is specific to this particular insurer, we configure the entity to be trainable from scratch. On the other hand, the insured organisation is a type of organisation, so we configure it to be trainable based on the built-in Organisation entity. Finally, we notice that brokers don't always put their name into the email, so we decide to use the broker email address (available from the comment metadata) to look up the corresponding name in an internal database, rather than extracting it as an entity.

The table below summarizes these approaches.

ConfigurationWhen to useExamples
Trainable entity with no base entityMost often used for various kinds of internal IDs, or when there is no suitable base entity in Re:infer.Policy Number, Customer ID
Trainable entity with base entityUsed for customising an existing pre-built entity in Re:infer.Cancellation Date (based on Date), Insured Organisation (based on Organisation)
Pre-built entities (not trainable)Used for entities that should be matched exactly as defined, where training would invite mistakes.ISIN
Using comment metadata instead of entitiesUsed when required information is already present in structured form in the comment metadata.Sender Address, Sender Domain

Using Entities in your Application#

Re:infer provides multiple ways of fetching predictions, including predicted entities. Please consult the data download overview to understand which method will work best for your use case.

Whichever method you choose, you need to be aware of the following edge-cases, and handle them in your application:

  • Not all expected entities are present in the response
  • The response contains multiple matches for one or more entities
  • Not all entities present in the response are correct

In this section we will go through each one of these edge-cases in more detail.

Not all entities are present in the response#

You should expect to handle cases where not all expected entities are present. In the example below, the email has the policy number, but doesn't have the insured organisation name. Your application should be able to handle such partial information.

Missing Insured Organisation
Missing Insured Organisation

The response contains multiple matches for one or more entities#

You should also expect to handle the opposite of the previous case, namely cases where a comment has more entities than expected. In the example below, even though we expect one policy number and insured organisation name per email, the email has multiple policy numbers.

Multiple matches for the same entity
Multiple matches for the same entity

Note that you can use the metadata in the response when handling such cases. For example, we can choose to preferentially pick policy numbers that appear in the email subject over those that appear in the email body. The example below shows the response that the API will return for our example email.

{  "predictions": [    {      "uid": "aa05ba2250de48e3.7588b85f68f81c3b",      "labels": [...],      "entities": [        {          "id": "6a1d11118b60868e",          "name": "policy-number",          "span": {            "content_part": "body",            "message_index": 0,            "utf16_byte_start": 200,            "utf16_byte_end": 222,            "char_start": 100,            "char_end": 111          },          "kind": "policy-number",          "formatted_value": "GHI-0204963"        },        {          "id": "6a1d11118b60868e",          "name": "policy-number",          "span": {            "content_part": "subject",            "message_index": 0,            "utf16_byte_start": 0,            "utf16_byte_end": 22,            "char_start": 0,            "char_end": 11          },          "kind": "policy-number",          "formatted_value": "GHI-0068448"        },        {...},        {...},        {...}      ]    }  ],  "model": {    "version": 31,    "time": "2021-07-14T15:00:57.608000Z"  },  "status": "ok"}

Not all entities present in the response are correct#

Finally, since entities are extracted using machine learning, you should expect to receive wrong matches. The number of wrong matches will depend on the entity you are using. The Validation page of your dataset provides validation statistics to understand how an entity will perform.

Entity Validation
Entity Validation