Skip to main content

Labels and Entities

This page describes how to interpret labels and entities downloaded from the Re:infer platform for use in your application. This page describes the labels and entities themselves - to understand where to find them in the downloaded data, be sure to check the documentation for your chosen download method.

Labels

A comment can have zero, one, or multiple predicted labels. The example below shows two predicted labels (Order and Order > Missing) together with their confidence scores. This format is used by most API routes. An exception is the Dataset Export route which formats label names as strings instead of lists (to be consistent with the CSV export in the browser).

Some routes (currently Predict routes) will optionally return a list of threshold names ("high_recall", "balanced", "high_precision") that the label confidence score meets. This is a useful alternative to hand-picking thresholds, especially for very large taxonomies. In your application, you decide whether you are interested in "high_recall", "balanced", or "high_precision" results, then discard all labels which lack your chosen auto-threshold, and process the remaining labels as before.

{
"labels": [
{
"name": ["Order"],
"probability": 0.6598735451698303
},
{
"name": ["Order", "Missing"],
"probability": 0.6598735451698303
}
]
}

The Label object has the following format:

NameTypeDescription
namearray<string> or stringAll API routes except Dataset Export: The name of the predicted label, formatted as a list of hierarchical labels. For instance, the label Parent Label > Child Label will have the format ["Parent Label", "Child Label"].

Dataset Export API route: The name of the predicted label, formatted as a string with " > " separating hierarchical labels.
probabilitynumberConfidence score. A number between 0.0 and 1.0.
sentimentnumberSentiment score. A number between -1.0 and 1.0. Only returned if sentiments are enabled in the dataset.
auto_thresholdsarray<string>A list of automatically computed thresholds that the label confidence score meets. The thresholds are returned as descriptive names (rather than values between 0.0 and 1) that can be used to easily filter out labels that don't meet your desired confidence levels. The threshold names "high_recall", "balanced" and "high_precision" correspond to three increasing confidence levels. Additional "sampled_0" ... "sampled_5" thresholds provide a more advanced way of performing aggregations for data-science applications, and can be ignored if you're processing comments on a one-by-one basis.

Labels FAQ

Q: How can I download labels from the Re:infer platform?

A: The following download methods provide labels: Re:infer API, CSV downloads, and Re:infer command-line tool. Please take a look at the Downloading Data page for an overview of the available download methods, and the FAQ item below for a detailed comparison.

Q: Do all download methods provide the same information?

A: The tables below explain the differences between the download methods. A description of labels in the Explore page in the Re:infer web UI is provided for comparison.

Non-deterministic methods

Explore page, CSV download, Re:infer command-line tool, and the Export API endpoint provide latest available predictions. Note that after a new model version has been trained, but before all predictions have been recalculated, you will see a mix of predictions from the latest and the previous model versions. These methods are aware of assigned labels and will show them as assigned or with a confidence score of 1.

MethodAssigned LabelsPredicted Labels
Explore PageExplore page visually differentiates assigned labels from predicted labels. It does not report confidence scores for assigned labels.Explore page is designed to support the model training workflow, so it shows selected predicted labels that the user may want to pin. It will preferentially show labels that meet a balanced threshold (derived from F-score for that label), but may also show labels with lower probability as a suggestion, if the user is likely to want to pin them.
Export APIReturns assigned labels.Returns all predicted labels (no threshold is applied).
CSV DownloadReturns a confidence score of 1 for assigned labels. Note that predicted labels may also have a score of 1 if the model is very confident.Returns all predicted labels (no threshold is applied).
Re:infer CLIIf a comment has assigned labels, will return both assigned and predicted labels for that comment.Returns all predicted labels (no threshold is applied).
Deterministic methods

In contrast to the non-deterministic methods above, Stream API and Predict API routes will return predictions from a specific model version. As such, these API routes behave as if you downloaded a comment from the platform and then sent it for prediction against a specific model version, and are not aware of assigned labels.

MethodAssigned LabelsPredicted Labels
Stream API and Predict APINot aware of assigned labels.Return predicted labels with confidence score above the provided label thresholds (or above the default value of 0.25 if no thresholds are provided).

Using Labels in Automation

When designing an application that makes decisions on a per-verbatim basis, you will want to convert the confidence score of each label into a Yes-or-No answer. You can do that by determining the minimum confidence score at which you will treat the prediction as saying "yes, the label applies". We call this number the confidence score threshold.

How to pick a confidence score threshold

A common misconception is picking the threshold to equal the precision you'd like to get ("I want the labels to be correct at least 70% of the time, so I will pick labels with confidence scores above 0.70"). To understand thresholds and how to pick them, please check the Confidence Thresholds section of the integration guide.

Using Labels in Analytics

If you are exporting labels for use in an analytics application, it's important to decide whether to expose confidence scores to users. For users of business analytics applications, you should convert the confidence scores into presence or absence of the label using one of the approaches described in the Automation section. On the other hand, users of data science applications proficient in working with probabilistic data will benefit from access to raw confidence scores.

An important consideration is to make sure that all predictions in your analytics application are from the same model version. If you are upgrading your integration to fetch predictions from a new model version, all predictions will need to be reingested for the data to stay consistent.

Label Properties

If Quality-of-Service labels have been added to the dataset, the prediction response will contain a Quality-of-Service score for each comment. If Tone has been enabled on a dataset, the prediction response will contain a Tone score for each comment. Both scores can be found in the label_properties part of the response.

{
"label_properties": [
{
"property_id": "0000000000000001",
"property_name": "tone",
"value": -1.8130283355712891
},
{
"id": "0000000000000002",
"name": "quality_of_service",
"value": -3.006324252113699913
}
]
}

The label property object has the following format:

NameTypeDescription
namestringName of the label property.
idstringInternal ID of the label property.
valuenumberValue of the label property. A value between -10 and 10.

Entities

A comment can have zero, one, or multiple predicted entities. The example below shows one predicted order_number entity. Note that unlike labels, entities do not have associated confidence scores.

"entities": [
{
"id": "0abe5b728ee17811",
"name": "order_number",
"span": {
"content_part": "body",
"message_index": 0,
"utf16_byte_start": 58,
"utf16_byte_end": 76,
"char_start": 29,
"char_end": 38
},
"name": "order_number",
"kind": "order_number", # deprecated
"formatted_value": "ABC-123456",
"capture_ids": []
}
]

The API returns entities in the following format:

NameTypeDescription
idstringEntity ID.
namestringEntity name.
kindstring(Deprecated) Entity kind.
formatted_valuestringEntity value.
spanSpanAn object containing the location of the entity in the comment.
capture_idsarray<int>The capture IDs of the groups to which an entity belongs.

Formatting

Each entity has a span and a formatted_value. The span represents the boundaries of the entity in the corresponding comment. The formatted_value typically corresponds to the text covered by that span, except in some specific instances that we describe below.

Monetary Quantity

The Monetary Quantity entity will extract a wide variety of monetary amounts and apply a common formatting. For example, "1M USD", "USD 1000000", and "1,000,000 usd" will all be extracted as 1,000,000.00 USD. Since the extracted value is formatted in a consistent way, you can easily get the currency and the amount by splitting on whitespace.

However, if the currency is ambiguous, the extracted value will retain the ambiguous currency. For example, "$1M" and "$1,000,000" will be extracted as $1,000,000.00 rather than 1,000,000.00 USD, since a "$" sign could refer to a Canadian or Australian dollar as well as a US dollar.

Date

The Date entity will extract any date appearing in a comment and will normalise them using the standard ISO 8601 format, followed by the time in UTC. For instance, "Jan 25 2020", "25/01/2020" and "now" in an email sent on January 25 2020 will all be extracted as "2020-01-25 00:00 UTC".

This formatting will be applied to any entity that has a type corresponding to a date, such as cancellation dates, value dates, or any type of dates that have been trained by the user.

If some parts of the date are missing, the timestamp of the comment will be used as an anchor; the date "at 4PM on the fifth of the month" in a message sent on May 1, 2020 will be extracted as "2020-05-05 16:00 UTC". If no timezone is provided, then the timezone of the comment is used, but the extracted date will always be returned in the UTC timezone.

Country

Country names are normalised to a common value; for instance, both strings "UK" and "United Kingdom" will have the formatted value "United Kingdom".

Capture IDs

If a comment was processed as rich text, contains a table, and an entity was matched in that table, the capture_ids property of that entity will contain a capture ID. Entities matched in the same row of the table will have the same capture ID, allowing them to be grouped together.

For instance, an Order ID could be associated to an Order Date. In a comment where multiple orders are referred to, one can distinguish the different order details by grouping entities by their capture IDs.

Today, entities matched in a table will belong to exactly one group, i.e. their capture_ids property will contain exactly one ID. In the future, the API may return multiple IDs.

In all other cases, the capture_id property will be an empty list.

Entities FAQ

Q: How can I download entities from the Re:infer platform?

A: The following download methods provide entities: Re:infer API and Re:infer command-line tool. Please take a look at the Downloading Data overview to understand which method is suitable for your use-case. Note that CSV downloads will not include entities.

Models

Staging and Live Tags

For ease of use with integrations, a model version can be tagged as staging or live in the Re:infer UI. This tag can be provided to Predict API requests in place of the model version number. This allows your integration to fetch predictions from whichever model version the Staging or Live tag points to, which platform users can easily manage from the Re:infer UI.

Model Version Details

Details about a specific model version can be fetched using the Validation API endpoint.

Additionally, responses to prediction requests contain information about the model that was used to make the predictions.

"model": {
"version": 2,
"time": "2021-02-17T12:56:13.444000Z"
}
NameTypeDescription
timetimestampWhen the model version was pinned.
versionnumberModel version.