- Getting started
- Balance
- Clusters
- Concept drift
- Coverage
- Datasets
- General fields (previously entities)
- Labels (predictions, confidence levels, hierarchy, etc.)
- Models
- Streams
- Model Rating
- Projects
- Precision
- Recall
- Reviewed and unreviewed messages
- Sources
- Taxonomies
- Training
- True and false positive and negative predictions
- Validation
- Messages
- Administration
- Manage sources and datasets
- Understanding the data structure and permissions
- Create or delete a data source in the GUI
- Uploading a CSV file into a source
- Preparing data for .CSV upload
- Create a new dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amend a dataset's settings
- Delete messages via the UI
- Delete a dataset
- Export a dataset
- Using Exchange Integrations
- Model training and maintenance
- Understanding labels, general fields and metadata
- Label hierarchy and best practice
- Defining your taxonomy objectives
- Analytics vs. automation use cases
- Turning your objectives into labels
- Building your taxonomy structure
- Taxonomy design best practice
- Importing your taxonomy
- Overview of the model training process
- Generative Annotation (NEW)
- Dastaset status
- Model training and annotating best practice
- Training with label sentiment analysis enabled
- Understanding data requirements
- Train
- Introduction to Refine
- Precision and recall explained
- Precision and recall
- How does Validation work?
- Understanding and improving model performance
- Why might a label have low average precision?
- Training using Check label and Missed label
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using Rebalance
- When to stop training your model
- Using general fields
- Generative extraction
- Using analytics and monitoring
- Automations and Communications Mining
- Licensing information
- FAQs and more
Communications Mining User Guide
Understanding data requirements
This article offers guidelines for the communications data volumes required to optimize the training experience and maximize the value provided by analytics and automation.
- Return on Investment (ROI)
- Complexity
- Technical limits
To get the most out of your Communications Mining™. implementation, we recommend to start with high-volume use cases. These cases benefit from Communications Mining's ability to process large amounts of message data efficiently, both for historical analytics and live monitoring, as well as automations.
The effort required to deploy a use case does not increase significantly with higher message volumes. Therefore, high-volume use cases tend to offer a better return on investment in terms of implementation effort compared to lower-volume use cases. This is important for organizations with limited resources or those that require external support for implementation.
However, if you have lower-volume scenarios with high business value, you should also consider these use cases. Many low-volume use cases are technically feasible and should not be dismissed.
Many use cases have a level of complexity—in terms of the number and complexity of labels and fields to be extracted—that is not well-suited for very low volumes of messages. This is because there may be insufficient examples in the dataset of varied and complex concepts or fields to effectively fine-tune and validate Communications Mining specialized models. This applies to both the automated training provided by generative annotation, and further examples annotated by model trainers.
While some use cases may be technically feasible and have sufficient examples, lower volumes can sometimes result in a poorer annotation experience for model trainers. A larger data pool makes it easier for Communications Mining's active learning modes to identify and surface useful examples to annotate. A small pool of data can create fewer quality examples across the taxonomy. Fewer quality examples cause users to rely on annotating elusive or more complex examples.
Before you proceed with qualifying and implementing a use case based on the considerations based on complexity and ROI, it's important to consider the technical limits for Communications Mining.
For generating clusters, Communications Mining requires a minimum of 2048 messages in a dataset (which can be made up of multiple similar sources). Datasets smaller than 2048 messages allow you to use all Comms Mining features, besides clusters and generated label suggestions for clusters.
Use cases with less than 2048 messages should be very simple in terms of the number and complexity of labels/fields. It should also be expected that a much higher proportion of total messages will need to be annotated for fine-tuning and validation purposes compared to higher volume use cases. It is likely that there may be insufficient examples to annotate for some labels and/or fields if they are not frequently occurring.
To ensure meaningful validation data, Communications Mining also expects a minimum of 25 annotated examples per label and field. Therefore, it’s important that you are able to source at least this number of examples from the data available.
The following recommendations concern use cases with lower data volume, but high value and/or low complexity.
Generally, use cases should function as expected if their complexity aligns with the volume of message data. Very low volume use cases should typically be very simple, while high volume use cases can be more complex.
In some instances, synchronizing more than one year's worth of historical data can help in sourcing sufficient quality examples for training. This also provides the benefit of greater analytics in terms of trends and alerts.
Use cases with fewer than 20,000 messages (in terms of historical volumes or annual throughput) should be carefully considered in terms of complexity, ROI, and the effort required to support and enable the use case. While there is a chance that such use cases may be disqualified based on these considerations, they can still provide sufficient business value to proceed with.
Every use case is unique, so there isnot a single guideline that fits all complexity scenarios. The labels and fields themselves can range from very simple to complex in terms of understanding and extraction.
The following table outlines rough guidelines for use case complexity.
Complexity | Labels | Extraction Fields | General Fields |
---|---|---|---|
Very Low | ~ 2-5 | N/A | 1 - 2 |
Low | ~ 5 - 15 | 1 - 2 for a few labels | 1 - 3 |
Medium | 15 - 50 | 1 - 5 for multiple labels | 1 - 5 * |
High | 50+ | 1 - 8+ for high proportion of labels | 1 - 5 * |
* Use cases with extraction fields should rely on these rather than general fields. If you are not using extraction fields, you can expect more general fields, but they may not add equivalent value.
# of Messages * | Limitations | Recommendation |
---|---|---|
Less than |
| Should only be:
|
2048 - 20,000 |
|
Should primarily be:
|
20,000 - 50,000 |
|
Should primarily be:
|
Historical data volumes from which training examples will be sourced typically have only a small proportion of total volumes annotated. This proportion is usually higher on lower volume and higher complexity use cases.