- Getting Started
- Framework Components
- Document Understanding in AI Center
- Pipelines
- ML Packages
- Data Manager
- OCR Services
- Licensing
- References
Label Documents
For the volumes of documents needed, see the Training and Retraining Pipelines section here.
When selecting the documents to be used for training, you will also need to be aware of a few details. First, you will need to remove garbage pages which do not include fields of interest, or which include only 1 or 2. You can do this in Data Manager using the Delete button. Pages are not lost, they can always be recovered from the Deleted view.
Then, if your usecase involves a highly diverse document type (like invoices or receipts), then you need a highly diverse training set. At the same time, the dataset needs to be balanced - you should avoid having 10 times more docs from one vendor than from another. In general it is enough to have 2-3 documents (i.e. ~4-6 pages if there are 2 pages per document on average) from a given layout. If some of them are very common in your workflow and you want to make sure they are extracted correctly, you may include 5-7 samples (10-15 pages).
However, if your usecase involves a document type with a very consistent layout (like a form) then you would need at least 30 samples from it, because if the trainset is too small, the ML model training might fail.
You can have multiple people use the same instance to label at the same time only if the following conditions are observed:
- no two users should be labelling the same document at the same time
- whenever fields are added, removed or their configuration is edited, this should be done by one user and all other users should immediately refresh their browser to see the changes. Making changes to fields while other people are labelling will cause unexpected behavior.
When you import a dataset without checking the "Make this a Testset" checkbox on the Import Data dialog, then that dataset will be used for training. In this case you only need to focus on the labelling of the words (grey boxes) on the document. If once in a while the text that gets filled in the sidebar fields is not correct, that's not a problem, the ML model will still learn. In some cases, you may need to adjust the configuration of the fields - for instance by checking the Multi-line checkbox. But in general, the main focus is on labelling the words on the page.
There are many situations where a field will appear in multiple places in the same document or even on the same page. These should all be labelled as long as they have the same meaning. An example, from many utility bills, is the total amount. It often appears at the top, and also within a line item list in the middle, and then also in a pay slip at the bottom, which can be detached and send in the mail with the check. In this situation, all three occurrences would be labelled. This is useful because in some cases, if there is an OCR error, or the layout is different, and one of them cannot be identified, the model can still identify the other occurrences.
It is important to note that what counts is the meaning of the value, not the value itself. For instance, on some invoices which carry no tax, the net amount and the total amount have the same value. But they are clearly different concepts. Consequently, they should not both be labelled as total amount. Only the one whose meaning is to represent the total amount, should be labelled as total amount.
When you import a dataset and you check the "Make this a Testset" checkbox on the Import Data dialog, then that dataset will not be used by Training pipelines in AI Fabric, but only be Evaluation pipelines. In this case, it is important that the correct text is filled into the fields in the sidebar (or the top bar in the case of Column fields). This takes much longer to verify for each field, but it is the only way you will get a reliable metric of the accuracy of the ML model you are building.