Labels and Entities

Labels#

A comment can have zero, one, or multiple predicted labels. The example below shows two predicted labels (Order and Order > Missing) together with their confidence scores. This format is used by most API routes. An exception is the Dataset Export route which formats label names as strings instead of lists (to be consistent with the CSV export in the browser).

Some routes (currently Predict routes) will optionally return a list of threshold names ("high_recall", "balanced", "high_precision") that the label confidence score meets. This is a useful alternative to hand-picking thresholds, especially for very large taxonomies. In your application, you decide whether you are interested in "high_recall", "balanced", or "high_precision" results, then discard all labels which lack your chosen auto-threshold, and process the remaining labels as before.

{  "labels": [    {      "name": ["Order"],      "probability": 0.6598735451698303    },    {      "name": ["Order", "Missing"],      "probability": 0.6598735451698303    }  ]}

The Label object has the following format:

NameTypeDescription
namearray<string> or stringAll API routes except Dataset Export: The name of the predicted label, formatted as a list of hierarchical labels. For instance, the label Parent Label > Child Label will have the format ["Parent Label", "Child Label"].

Dataset Export API route: The name of the predicted label, formatted as a string with " > " separating hierarchical labels.
probabilitynumberConfidence score. A number between 0.0 and 1.0.
sentimentnumberSentiment score. A number between -1.0 and 1.0. Only returned if sentiments are enabled in the dataset.
auto_thresholdsarray<string>A list of automatically computed thresholds that the label confidence score meets. The thresholds are returned as descriptive names (rather than values between 0.0 and 1) that can be used to easily filter out labels that don't meet your desired confidence levels. The threshold names "high_recall", "balanced" and "high_precision" correspond to three increasing confidence levels. Additional "sampled_0" ... "sampled_5" thresholds provide a more advanced way of performing aggregations for data-science applications, and can be ignored if you're processing comments on a one-by-one basis.

Labels FAQ#

A: The tables below explain the differences between the download methods. A description of labels in the Explore page in the Re:infer web UI is provided for comparison.

Non-deterministic methods#

Explore page, CSV download, Re:infer command-line tool, and the Export API endpoint provide latest available predictions. Note that after a new model version has been trained, but before all predictions have been recalculated, you will see a mix of predictions from the latest and the previous model versions. These methods are aware of assigned labels and will show them as assigned or with a confidence score of 1.

MethodAssigned LabelsPredicted Labels
Explore PageExplore page visually differentiates assigned labels from predicted labels. It does not report confidence scores for assigned labels.Explore page is designed to support the model training workflow, so it shows selected predicted labels that the user may want to pin. It will preferentially show labels that meet a balanced threshold (derived from F-score for that label), but may also show labels with lower probability as a suggestion, if the user is likely to want to pin them.
Export APIReturns assigned labels.Returns all predicted labels (no threshold is applied).
CSV DownloadReturns a confidence score of 1 for assigned labels. Note that predicted labels may also have a score of 1 if the model is very confident.Returns all predicted labels (no threshold is applied).
Re:infer CLIIf a comment has assigned labels, will return both assigned and predicted labels for that comment.Returns all predicted labels (no threshold is applied).
Deterministic methods#

In contrast to the non-deterministic methods above, Trigger API and Predict API routes will return predictions from a specific model version. As such, these API routes behave as if you downloaded a comment from the platform and then sent it for prediction against a specific model version, and are not aware of assigned labels.

MethodAssigned LabelsPredicted Labels
Trigger API and Predict APINot aware of assigned labels.Return predicted labels with confidence score above the provided label thresholds (or above the default value of 0.25 if no thresholds are provided).

Using Labels in Automation#

When designing an application that makes decisions on a per-verbatim basis, you will want to convert the confidence score of each label into a Yes-or-No answer. You can do that by determining the minimum confidence score at which you will treat the prediction as saying "yes, the label applies". We call this number the confidence score threshold.

How to pick a confidence score threshold

A common misconception is picking the threshold to equal the precision you'd like to get ("I want the labels to be correct at least 70% of the time, so I will pick labels with confidence scores above 0.70"). To understand thresholds and how to pick them, please check the Confidence Thresholds section of the integration guide.

Using Labels in Analytics#

If you are exporting labels for use in an analytics application, it's important to decide whether to expose confidence scores to users. For users of business analytics applications, you should convert the confidence scores into presence or absence of the label using one of the approaches described in the Automation section. On the other hand, users of data science applications proficient in working with probabilistic data will benefit from access to raw confidence scores.

An important consideration is to make sure that all predictions in your analytics application are from the same model version. If you are upgrading your integration to fetch predictions from a new model version, all predictions will need to be reingested for the data to stay consistent.

Entities#

A comment can have zero, one, or multiple predicted entities. The example below shows one predicted order_number entity. Note that unlike labels, entities do not have associated confidence scores.

"entities": [    {        "id": "0abe5b728ee17811",        "name": "order_number",        "span": {            "content_part": "body",            "message_index": 0,            "utf16_byte_start": 58,            "utf16_byte_end": 76,            "char_start": 29,            "char_end": 38        },        "name": "order_number",        "kind": "order_number", # deprecated        "formatted_value": "ABC-123456"    }]

The API returns entities in the following format:

NameTypeDescription
idstringEntity ID.
namestringEntity name.
kindstring(Deprecated) Entity kind.
formatted_valuestringEntity value.
spanSpanAn object containing the location of the entity in the comment.

Formatting#

Each entity has a span and a formatted_value. The span represents the boundaries of the entity in the corresponding comment. The formatted_value typically corresponds to the text covered by that span, except in some specific instances that we describe below.

Monetary Quantity#

The Monetary Quantity entity will extract a wide variety of monetary amounts and apply a common formatting. For example, "1M USD", "USD 1000000", and "1,000,000 usd" will all be extracted as 1,000,000.00 USD. Since the extracted value is formatted in a consistent way, you can easily get the currency and the amount by splitting on whitespace.

However, if the currency is ambiguous, the extracted value will retain the ambiguous currency. For example, "$1M" and "$1,000,000" will be extracted as $1,000,000.00 rather than 1,000,000.00 USD, since a "$" sign could refer to a Canadian or Australian dollar as well as a US dollar.

Date#

The Date entity will extract any date appearing in a comment and will normalise them using the standard ISO 8601 format, followed by the time in UTC. For instance, "Jan 25 2020", "25/01/2020" and "now" in an email sent on January 25 2020 will all be extracted as "2020-01-25 00:00 UTC".

This formatting will be applied to any entity that has a type corresponding to a date, such as cancellation dates, value dates, or any type of dates that have been trained by the user.

If some parts of the date are missing, the timestamp of the comment will be used as an anchor; the date "at 4PM on the fifth of the month" in a message sent on May 1, 2020 will be extracted as "2020-05-05 16:00 UTC". If no timezone is provided, then the timezone of the comment is used, but the extracted date will always be returned in the UTC timezone.

Country#

Country names are normalised to a common value; for instance, both strings "UK" and "United Kingdom" will have the formatted value "United Kingdom".

Entities FAQ#

"model": {    "version": 2,    "time": "2021-02-17T12:56:13.444000Z"}
timetimestampWhen the model version was pinned.
versionnumberModel version.