- API docs
- CLI
- Integration guides
- Blog
- How machines learn to understand words: a guide to embeddings in NLP
- Prompt-based learning with Transformers
- Efficient Transformers II: knowledge distillation & fine-tuning
- Efficient Transformers I: attention mechanisms
- Deep hierarchical unsupervised intent modelling: getting value without training data
- Fixing annotating bias with Communications Mining
- Active learning: better ML models in less time
- It's all in the numbers - assessing model performance with metrics
- Why model validation is important
- Comparing Communications Mining and Google AutoML for conversational data intelligence
Labels and general fields
This page describes how to interpret labels and general fields downloaded from the Communications Mining platform for use in your application. This page describes the labels and general fields themselves - to understand where to find them in the downloaded data, be sure to check the documentation for your chosen download method.
A comment can have zero, one, or multiple predicted labels. The example below shows two predicted labels (Order and Order > Missing) together with their confidence scores. This format is used by most API routes. An exception is the Dataset Export route which formats label names as strings instead of lists (to be consistent with the CSV export in the browser).
Some routes (currently Predict routes) will optionally return a list of threshold names ("high_recall", "balanced", "high_precision") that the label confidence score meets. This is a useful alternative to hand-picking thresholds, especially for very large taxonomies. In your application, you decide whether you are interested in "high_recall", "balanced", or "high_precision" results, then discard all labels which lack your chosen auto-threshold, and process the remaining labels as before.
- All routes except Dataset Export
{ "labels": [ { "name": ["Order"], "probability": 0.6598735451698303 }, { "name": ["Order", "Missing"], "probability": 0.6598735451698303 } ] }
{ "labels": [ { "name": ["Order"], "probability": 0.6598735451698303 }, { "name": ["Order", "Missing"], "probability": 0.6598735451698303 } ] } - Dataset Export
{ "labels": [ { "name": "Order", "probability": 0.6598735451698303 }, { "name": "Order > Missing", "probability": 0.6598735451698303 } ] }
{ "labels": [ { "name": "Order", "probability": 0.6598735451698303 }, { "name": "Order > Missing", "probability": 0.6598735451698303 } ] } - Predict (auto-thresholded)
{ "labels": [ { "name": ["Order"], "probability": 0.6598735451698303, "auto_thresholds": ["high_recall", "balanced", "sampled_2"] }, { "name": ["Order", "Missing"], "probability": 0.6598735451698303, "auto_thresholds": ["high_recall", "sampled_2"] } ] }
{ "labels": [ { "name": ["Order"], "probability": 0.6598735451698303, "auto_thresholds": ["high_recall", "balanced", "sampled_2"] }, { "name": ["Order", "Missing"], "probability": 0.6598735451698303, "auto_thresholds": ["high_recall", "sampled_2"] } ] }
The Label object has the following format:
NAME | TYPE | DESCRIPTION |
---|---|---|
name | array<string> or string |
All API routes except Dataset Export: The name of the predicted label, formatted as a list of hierarchical labels. For instance,
the label Parent Label > Child Label will have the format
["Parent Label", "Child Label"] .
Dataset Export API route: The name of the predicted label, formatted as a string with
" > " separating hierarchical labels.
|
probability | number | Confidence score. A number between 0.0 and 1.0. |
sentiment | number | Sentiment score. A number between -1.0 and 1.0. Only returned if sentiments are enabled in the dataset. |
auto_thresholds | array<string> | A list of automatically computed thresholds that the label confidence score meets. The thresholds are returned as descriptive names (rather than values between 0.0 and 1) that can be used to easily filter out labels that don't meet your desired confidence levels. The threshold names "high_recall", "balanced" and "high_precision" correspond to three increasing confidence levels. Additional "sampled_0" ... "sampled_5" thresholds provide a more advanced way of performing aggregations for data-science applications, and can be ignored if you're processing comments on a one-by-one basis. |
Q: How can I download labels from the Communications Mining platform?
A: The following download methods provide labels: Communications Mining API, CSV downloads, and Communications Mining command-line tool. Please take a look at the Downloading Data page for an overview of the available download methods, and the FAQ item below, for a detailed comparison.
Q: Do all download methods provide the same information?
A: The tables below explain the differences between the download methods. A description of labels in the Explore page in the Communications Mining web UI is provided for comparison.
Non-deterministic methods
Explore page, CSV download, Communications Mining command-line tool, and the Export API endpoint provide latest available predictions. Note that after a new model version has been trained, but before all predictions have been recalculated, you will see a mix of predictions from the latest and the previous model versions. These methods are aware of assigned labels and will show them as assigned or with a confidence score of 1.
METHOD | ASSIGNED LABELS | PREDICTED LABELS |
---|---|---|
Explore Page | Explore page visually differentiates assigned labels from predicted labels. It does not report confidence scores for assigned labels. | Explore page is designed to support the model training workflow, so it shows selected predicted labels that the user may want to pin. It will preferentially show labels that meet a balanced threshold (derived from F-score for that label), but may also show labels with lower probability as a suggestion, if the user is likely to want to pin them. |
Export API | Returns assigned labels. | Returns all predicted labels (no threshold is applied). |
CSV Download | Returns a confidence score of 1 for assigned labels. Note that predicted labels may also have a score of 1 if the model is very confident. | Returns all predicted labels (no threshold is applied). |
Communications Mining CLI | If a comment has assigned labels, will return both assigned and predicted labels for that comment. | Returns all predicted labels (no threshold is applied). |
Deterministic methods
In contrast to the non-deterministic methods above, Stream API and Predict API routes will return predictions from a specific model version. As such, these API routes behave as if you downloaded a comment from the platform and then sent it for prediction against a specific model version, and are not aware of assigned labels.
METHOD | ASSIGNED LABELS | PREDICTED LABELS |
---|---|---|
Stream API and Predict API | Not aware of assigned labels. | Return predicted labels with confidence score above the provided label thresholds (or above the default value of 0.25 if no thresholds are provided). |
When designing an application that makes decisions on a per-message basis, you will want to convert the confidence score of each label into a Yes-or-No answer. You can do that by determining the minimum confidence score at which you will treat the prediction as saying "yes, the label applies". We call this number the confidence score threshold.
HOW TO PICK A CONFIDENCE SCORE THRESHOLD
A common misconception is picking the threshold to equal the precision you'd like to get ("I want the labels to be correct at least 70% of the time, so I will pick labels with confidence scores above 0.70"). To understand thresholds and how to pick them, please check the Confidence Thresholds section of the integration guide.
If you are exporting labels for use in an analytics application, it's important to decide whether to expose confidence scores to users. For users of business analytics applications, you should convert the confidence scores into presence or absence of the label using one of the approaches described in the Automation section. On the other hand, users of data science applications proficient in working with probabilistic data will benefit from access to raw confidence scores.
An important consideration is to make sure that all predictions in your analytics application are from the same model version. If you are upgrading your integration to fetch predictions from a new model version, all predictions will need to be reingested for the data to stay consistent.
label_properties
part of the response.
{
"label_properties": [
{
"property_id": "0000000000000001",
"property_name": "tone",
"value": -1.8130283355712891
},
{
"id": "0000000000000002",
"name": "quality_of_service",
"value": -3.006324252113699913
}
]
}
{
"label_properties": [
{
"property_id": "0000000000000001",
"property_name": "tone",
"value": -1.8130283355712891
},
{
"id": "0000000000000002",
"name": "quality_of_service",
"value": -3.006324252113699913
}
]
}
The label property object has the following format:
NAME | TYPE | DESCRIPTION |
---|---|---|
name | string | Name of the label property. |
id | string | Internal ID of the label property. |
value | number | Value of the label property. A value between -10 and 10. |
order_number
entity. Note that unlike labels, general fields do not have associated confidence scores.
"entities": [
{
"id": "0abe5b728ee17811",
"name": "order_number",
"span": {
"content_part": "body",
"message_index": 0,
"utf16_byte_start": 58,
"utf16_byte_end": 76,
"char_start": 29,
"char_end": 38
},
"name": "order_number",
"kind": "order_number", # deprecated
"formatted_value": "ABC-123456",
"capture_ids": []
}
]
"entities": [
{
"id": "0abe5b728ee17811",
"name": "order_number",
"span": {
"content_part": "body",
"message_index": 0,
"utf16_byte_start": 58,
"utf16_byte_end": 76,
"char_start": 29,
"char_end": 38
},
"name": "order_number",
"kind": "order_number", # deprecated
"formatted_value": "ABC-123456",
"capture_ids": []
}
]
The API returns entities in the following format:
NAME | TYPE | DESCRIPTION |
---|---|---|
id | string | Entity ID. |
name | string | Entity name. |
kind | string | (Deprecated) Entity kind. |
formatted_value | string | Entity value. |
span | Span | An object containing the location of the entity in the comment. |
capture_ids | array<int> | The capture IDs of the groups to which an entity belongs. |
span
and a formatted_value
. The span represents the boundaries of the entity in the corresponding comment. The formatted_value
typically corresponds to the text covered by that span, except in some specific instances that we describe below.
Monetary Quantity
Monetary Quantity
entity will extract a wide variety of monetary amounts and apply a common formatting. For example, "1M USD", "USD 1000000",
and "1,000,000 usd" will all be extracted as 1,000,000.00 USD
. Since the extracted value is formatted in a consistent way, you can easily get the currency and the amount by splitting
on whitespace.
$1,000,000.00
rather than 1,000,000.00 USD
, since a "$" sign could refer to a Canadian or Australian dollar as well as a US dollar.
Date
Date
entity will extract any date appearing in a comment and will normalize them using the standard ISO 8601 format, followed by the time in UTC. For instance, "Jan 25 2020", "25/01/2020" and "now" in an email sent on January 25 2020 will
all be extracted as "2020-01-25 00:00 UTC".
This formatting will be applied to any entity that has a type corresponding to a date, such as cancellation dates, value dates, or any type of dates that have been trained by the user.
If some parts of the date are missing, the timestamp of the comment will be used as an anchor; the date "at 4PM on the fifth of the month" in a message sent on May 1, 2020 will be extracted as "2020-05-05 16:00 UTC". If no timezone is provided, then the timezone of the comment is used, but the extracted date will always be returned in the UTC timezone.
Country
Country names are normalized to a common value; for instance, both strings "UK" and "United Kingdom" will have the formatted value "United Kingdom".
capture_ids
property of that entity will contain a capture ID. Entities matched in the same row of the table will have the same capture
ID, allowing them to be grouped together.
Order ID
could be associated to an Order Date
. In a comment where multiple orders are referred to, one can distinguish the different order details by grouping entities
by their capture IDs.
capture_ids
property will contain exactly one ID. In the future, the API may return multiple IDs.
capture_id
property will be an empty list.
Q: How can I download general fields from the Communications Mining platform?
A: The following download methods provide general fields: Communications Mining API and Communications Mining command-line tool. Please take a look at the Downloading Data overview to understand which method is suitable for your use-case. Note that CSV downloads will not include general fields.
staging
or live
in the Communications Mining UI. This tag can be provided to Predict API requests in place of the model version number. This
allows your integration to fetch predictions from whichever model version the Staging or Live tag points to, which platform
users can easily manage from the Communications Mining UI.
Details about a specific model version can be fetched using the Validation API endpoint.
Additionally, responses to prediction requests contain information about the model that was used to make the predictions.
"model": {
"version": 2,
"time": "2021-02-17T12:56:13.444000Z"
}
"model": {
"version": 2,
"time": "2021-02-17T12:56:13.444000Z"
}
NAME | TYPE | DESCRIPTION |
---|---|---|
time | timestamp | When the model version was pinned. |
version | number | Model version. |