For the volumes of documents needed, see the Training and Retraining Pipelines section here.
When selecting the documents to be used for training, you will also need to be aware of a few details. First, you will need to remove garbage pages which do not include fields of interest, or which include only 1 or 2. You can do this in Data Manager using the Delete button. Pages are not lost, they can always be recovered from the Deleted view.
Then, if your usecase involves a highly diverse document type (like invoices or receipts), then you need a highly diverse training set. At the same time, the dataset needs to be balanced - you should avoid having 10 times more docs from one vendor than from another. In general it is enough to have 2-3 documents (i.e. ~4-6 pages if there are 2 pages per document on average) from a given layout. If some of them are very common in your workflow and you want to make sure they are extracted correctly, you may include 5-7 samples (10-15 pages).
However, if your usecase involves a document type with a very consistent layout (like a form) then you would need at least 30 samples from it, because if the trainset is too small, the ML model training might fail.
You can have multiple people use the same instance to label at the same time only if the following conditions are observed:
- no two users should be labelling the same document at the same time
- whenever fields are added, removed or their configuration is edited, this should be done by one user and all other users should immediately refresh their browser to see the changes. Making changes to fields while other people are labelling will cause unexpected behavior.
When you import a dataset without checking the "Make this a Testset" checkbox on the Import Data dialog, then that dataset will be used for training. In this case you only need to focus on the labelling of the words (grey boxes) on the document. If once in a while the text that gets filled in the sidebar fields is not correct, that's not a problem, the ML model will still learn. In some cases, you may need to adjust the configuration of the fields - for instance by checking the Multi-line checkbox. But in general, the main focus is on labelling the words on the page.
There are many situations where a field will appear in multiple places in the same document or even on the same page. These should all be labelled as long as they have the same meaning. An example, from many utility bills, is the total amount. It often appears at the top, and also within a line item list in the middle, and then also in a pay slip at the bottom, which can be detached and send in the mail with the check. In this situation, all three occurrences would be labelled. This is useful because in some cases, if there is an OCR error, or the layout is different, and one of them cannot be identified, the model can still identify the other occurrences.
It is important to not that what counts is the meaning of the value, not the value itself. For instance, on some invoices which carry no tax, the net amount and the total amount have the same value. But they are clearly different concepts. Consequently, they should not both be labelled as total amount. Only the one whose meaning is to represent the total amount, should be labelled as total amount.
When you import a dataset and you check the "Make this a Testset" checkbox on the Import Data dialog, then that dataset will not be used by Training pipelines in AI Fabric, but only be Evaluation pipelines. In this case, it is important that the correct text is filled into the fields in the sidebar (or the top bar in the case of Column fields). This takes much longer to verify for each field, but it is the only way you will get a reliable metric of the accuracy of the ML model you are building.
See below the main actions you need to perform when labeling documents. A given field may be labeled in multiple places on the same page.
- Label field
- Select words by dragging mouse (rubber banding) or by clicking on them, holding down Shift to select multiple words.
- Tap the shortcut key to label the field
- Remove label
- Select words, then tap the Delete or Backspace key on your keyboard.
- Group table row
- After you have labeled some Column fields, and only if some rows span multiple lines of text, then you may group them together by using the “/” key to indicate that they are part of the same table row. A green box will appear around the group.
- Ungroup table row
- Select the group and tap “/” again
- Make correction to OCR
- Right-click on the word and edit the text in the tooltip that appears. This is rarely recommended, since when in production the OCR will still make those errors. Consequently, it is usually best to just skip and move on.
- Make correction to labeled value
- Click on the text in the sidebar or the top bar and edit the content. A small lock will appear to indicate the field has been manually edited. This is necessary when labelling test sets.
- Reset labeled value to auto-extracted value
- Click on the lock, and the field will revert to its auto-extracted value.
Updated 25 days ago