- API docs
- CLI
- Integration guides
- Blog
- How machines learn to understand words: a guide to embeddings in NLP
- Prompt-based learning with Transformers
- Efficient Transformers II: knowledge distillation & fine-tuning
- Efficient Transformers I: attention mechanisms
- Deep hierarchical unsupervised intent modelling: getting value without training data
- Fixing annotating bias with Communications Mining
- Active learning: better ML models in less time
- It's all in the numbers - assessing model performance with metrics
- Why model validation is important
- Comparing Communications Mining and Google AutoML for conversational data intelligence
Fixing annotating bias with Communications Mining
Communications Mining uses machine learning models to identify patterns in communications data like emails, chats and calls. Models extrapolate these patterns to make predictions for similar data in the future, driving downstream processes like automations and analytics.
For this approach to work, the data used to train a model needs to be representative of the communications it will make predictions on. When this is not the case, models will make mistakes that can seriously impact the performance of systems which rely on accurate predictions.
To help users build robust, well-performing models, we built a tool to ensure data used for training always matches the user’s target task. In this blog post we discuss how this tool works, and some of the problems we tackled during its development.
What is annotating bias?
Models in Communications Mining are trained on user-reviewed data. Users create labels for topics they care about, then annotate examples with labels that apply. A model is then automatically trained on this reviewed data to predict which labels apply.
Annotating data is difficult and time consuming. Communications Mining leverages active learning to speed up the process, helping users annotate the most informative data points in the fastest time possible.
Since active learning selects specific data points, it tends to focus only on a subset of the underlying data. Furthermore, switching between concepts comes with a cognitive overhead. Users are encouraged to annotate groups of examples from similar topics at the same time, rather than constantly changing between themes.
This can lead to some topics appearing more or less frequently in the reviewed data than the dataset as a whole. We call this annotating bias, because the data annotated by users no longer represents the underlying data.
Why should you care?
Communications Mining uses reviewed data during validation to assess model performance. If this data is biased towards certain topics, the validation results can be misleading.
Consider a shared mailbox for a multinational bank which contains emails from across EMEA. Communications Mining’s multilingual models can understand communications data in a mix of languages. However, if a user were to only label emails from a single language, the model may learn to focus on features specific to that language.
In this case, validation scores would be good for that model, as it performs well on all the annotated examples. On the other hand, performance on emails in other languages may be worse. The user would be unaware because there are no examples to highlight this in the reviewed data. This could lead to inefficiencies in any processes which rely on the model for accurate predictions.
The maths behind labeling bias
Each of these components is estimated from some or all of the dataset during training.
- P(Document∣Label) Models the range of documents for a given topic. The model learns to estimate this from the annotated data, extrapolating using its knowledge of language and the world.
- P(Document) Models the different types of documents in the dataset and their relative frequencies. This is independent of labels and can be estimated from all examples (both reviewed and unreviewed).
- P(Label) Models the frequency of different topics. This can only be estimated from the annotated data, as it is specific to each use-case.
All three parts are required to find P(Label∣Document). However, both P(Label) and P(Document∣Label) depend heavily on the annotated data. When annotating bias is present, these estimates may not match the true distributions, leading to inaccuracies in P(Label∣Document).
Given the vital role that reviewed data plays in training and validating models, we need to detect annotating bias and warn users when their data isn’t representative.
At the simplest level, annotating bias is a discrepancy between examples which have been reviewed by users and those which have not. Imagine a person is asked to check for annotating bias in a dataset. This person might look at common themes which appear in the reviewed data and then check how often these occur in the unreviewed data.
If the person finds a reliable rule for differentiating between these two groups we can be confident that there is an imbalance. On the other hand, in a dataset with no annotating bias a person would be unable to accurately predict if examples are reviewed or not. The predictive performance of this person measures how much annotating bias is present in the dataset.
We used this idea as a starting point for our annotating bias model.
The comparison task can be automated with a machine learning model. This model is different to Communications Mining's core model, that predicts which labels or general fields apply to a document. Instead, the model is trained to identify reviewed data points.
The validation scores for the model show how easily the model can distinguish between reviewed and unreviewed examples, and therefore how much annotating bias is present in the dataset.
Classifying reviewed examples
A simple classifier model trained on the synthetic dataset has an average precision of over 80%. If the dataset was unbiased, we would expect the model to perform no better than random chance, which matches the bias we can see in the reviewed data.
Similar naive classifier models trained on real datasets could also reliably detect reviewed examples. This suggests that annotating bias was present in these datasets, but the exact source was unknown.
For the synthetic dataset, it’s easy to see the effect of annotating bias in the plotted data. This is not the case for a real dataset, where data lies in more than 2 dimensions and patterns are often much more complex.
Instead, we can look for patterns in examples that the model is confident are unreviewed. This approach showed that emails confidently predicted as being unreviewed often contained attachments with no text. Where these emails were present in the data, they were usually underrepresented in the reviewed examples.
This constitutes a clear annotating bias and shows the promise of a classifier model.
The annotating bias model is trained to distinguish between reviewed and unreviewed data. In this setting, the model tries to catch out the user by identifying patterns in their annotated data. This adversarial approach is a powerful way of inspecting the reviewed data, but also raises two interesting problems.
Trivial differences
Differences in reviewed and unreviewed data picked up by the model should have meaning to users. However, when we provided the naive bias model with detailed inputs, we found the model sometimes focused on insignificant patterns.
.jpg
files with GOCR
in the name were confidently predicted as being unreviewed. There were no such examples in the reviewed set, but 160 in the
unreviewed set, representing a small annotating bias.
GOCR
in filenames, and these examples were just a subset of attachment-only emails in the dataset. In fact, all of these emails
had confident, correct predictions for the dataset’s Auto-Generated
label, meaning these features also had no significance to Communications Mining’s annotating model either. However, the bias
model was using these features to make predictions.
Users shouldn’t have to label all combinations of meaningless features to get a good annotating bias score. For almost all concepts, we don’t need thousands of examples to fully capture the range of possible data points. Instead, the annotating bias model should only focus on differences which actually impact annotating predictions.
Unimportant topics
Datasets may contain data points which are never annotated by users because they are irrelevant for their target task.
Returning to our multinational banking example, teams could use Communications Mining to drive country-specific use cases. Each team would build a model customized to their target task, with all models using emails from the shared mailbox.
These use cases are likely to differ between teams. European countries may wish to track the effect of Brexit on their operations and would create a set of labels for this purpose. On the other hand, teams in the Middle East and Africa may have no use for Brexit-related emails and would ignore them in their model.
Not annotating Brexit-related emails is an example of annotating bias. However, this is a bias that is unimportant to users in the Middle East and Africa. The bias model should take this into account and only search for annotating bias in emails that the team deems useful.
We need to make it more difficult for the labeller to focus on small features, but guide this by what the user defines as useful. To do this, we can alter the inputs we pass to our annotating bias model.
MODEL INPUTS
The inputs to our core annotating model contain a large amount of information from the input text. This allows the model to learn complex relationships which influence label predictions. However, for the annotating bias model, this also lets the model focus on small, meaningless differences in features like filenames.
Dimensionality reduction is a way of filtering out information while maintaining meaningful properties of the original inputs. Using reduced inputs prevents the bias model from focusing on small features while retaining information that is important in a dataset.
Users only create labels for topics they want to track, so including labels during dimensionality reduction means we keep the most important input features. With this approach, our annotating bias model no longer focuses on small features and takes labels into account when estimating bias.
We use our annotating bias model for two main tasks in Communications Mining.
Balance scores
Detecting and addressing annotating bias is vital for reliable model validation scores. Because of this, we show the performance of the annotating bias model in the model rating.
This is in the form of a similarity measure between the reviewed and unreviewed data. A low similarity score indicates a big difference between reviewed and unreviewed data, highlighting annotating bias in the dataset.
Rebalance
The best way to build an unbiased set of reviewed data is to annotate a random selection of examples. This way, the reviewed labels will always match the underlying distribution. However, annotating in this way is inefficient, especially for rare concepts.
Instead, Communications Mining uses active learning to speed up the annotating process by targeting the most useful examples. These targeted examples do not always match the underlying data distribution, meaning annotating biases can gradually develop over time.
Active learning is not guaranteed to produce an unbiased set of reviewed examples. However, when annotating bias is detected, we can use the annotating bias model to address any imbalance. This way, we benefit from the reduced training time of active learning and the low annotating bias of random sampling.
Rebalance
view, which shows data points that the bias model is confident are unreviewed, and therefore underrepresented in the dataset.
Annotating these examples provides a quick way of addressing annotating bias in a dataset.
To demonstrate how rebalance improves Communications Mining's performance, we simulated users annotating examples following three active learning strategies.
- Random. Annotate a random selection of the unreviewed examples.
- Standard. Annotate examples that Communications Mining is most unsure of, or those with the highest prediction entropy. This is a
common approach to active learning, and is equivalent to only using the
Teach
view in Communications Mining. - Communications Mining. Follow Communications Mining’s active learning strategy, which suggests the top training actions for improving the current
model. This includes the
Rebalance
view.
We ran these simulations on the open-source Reuters dataset provided by NLTK which contains news articles tagged with one or more of 90 labels. For each run, the same randomly selected initial set of 100 examples was used. For each simulation step, we model users annotating 50 examples selected by the active learning strategy. Communications Mining then retrains and the process is repeated.
The plot below shows the performance of Communications Mining’s annotating model on the Reuters task as more examples are annotated. The balance score is also shown, representing the amount of annotating bias present in the dataset.
Communications Mining’s active learning strategy produces similar balance scores to random sampling, but requires fewer examples to produce the same model performance. This means active learning with Rebalance gives the best of both standard active learning and random sampling: unbiased reviewed examples and good model performance in less time.
- To get accurate model validation scores, annotated data must be representative of the dataset as a whole.
- Communications Mining’s annotating bias model compares reviewed and unreviewed data to spot topics that are underrepresented in the dataset.
- The
Rebalance
view can be used to quickly address annotating bias in a dataset. - Communications Mining's active learning leads to less annotating bias than standard approaches, and performs better than random sampling alone.