You must first create a working folder for holding your ML data. This is referenced in all commands documented below.
Run the configuration steps below before launching Data Manager. If later on you need to change the configuration (like the OCR engine, or a user password), you need to stop Data Manager using the Docker stop command, run the configuration commands, and then launch Data Manager again. See here the Docker cheat sheet.
An admin user with the admin username and admin password is created by default.
To create new users, stop the Data Manager container if it is running, use the following command, and then start the Data Manager container again:
docker run --rm -it -p <port_number>:80 -v "<path_to_working_folder>:/app/data" aiflprodweacr.azurecr.io/datamanager:latest --license-agreement accept --user <username> --passw <password>
Each user can also modify their password from the Settings -> Password view accessible through the button at top right of the screen.
In order to import documents into Data Manager, it is mandatory to configure an OCR service. This can be done from the Settings -> OCR view accessible through the button at top right of the screen. Configuring the OCR requires the OCR service has a URL. Here are the possible URLs you can use:
- public URLs such as https://du.uipath.com/ocr?edition=enterprise or third party URLs from Google Vision OCR or Microsoft Read OCR
- URLs of UiPath Document OCR or Omnipage OCR standalone containers provided by UiPath, deployed on premises
- URLs of OCR ML Package deployed as ML Skills which have been made Public in AI Fabric on premises v2020.10
If you are running the OCR on the same machine as Data Manager, then do note use "localhost" to refer to the local machine, but rather use the IP address or Domain Name of the local machine. In the case of URLs of OCR deployed as Public ML Skill in AI Fabric on premises v2020.7 or later, use the URL as it appears in the AI Fabric ML Skill details screen.
Choosing the OCR engine to be used for importing documents into Data Manager is a critical decision.
It is recommended to use the same OCR to import training data (train time) as it will be used when the model is deployed (run time). Ideally, you should try a few different ones, to see which works best on your documents, and only then make a decision.
The on-premises options are UiPath OCR container, Omnipage OCR container (both available from UiPath) and Microsoft Read container (available as preview from Microsoft) as well as UiPath OCR ML Skills deployed in AI Fabric on premises v2020.10 or later.
UiPath OCR supports the main Western European languages. Microsoft Read supports the main Western European languages plus Japanese and Chinese (Simplified). Omnipage works best on cleanly scanned documents and has the best language coverage.
Cloud based options are UiPath Document OCR (du.uipath.com/ocr), Google Cloud OCR and Microsoft Read Azure OCR. Google Cloud OCR has the best language coverage.
If you already have a model which can extract some of the fields that need labeling, and there are only a few extra fields that require manual labeling, you can save a lot of time by using Data Manager’s Prelabelling feature. You can configure Prelabelling from the Settings -> Prelabelling view accessible through the button at the top right of the screen. Prelabelling requires the ML model has a URL. Here are the possible URLs you can use:
- public URLs such as https://du.uipath.com/ie/invoices or https://du.uipath.com/ie/purchase_orders
- URLs of ML Skills in AI Fabric on premises v2020.4
- URLs of ML Skills which have been made Public in AI Fabric on premises v2020.7
ML Skills in AI Fabric Cloud cannot be used for prelabelling in Data Manager because they are not exposed as URLs. Also ML Skills in AI Fabric on premises v2020.10 deployed in airgapped environments cannot be used for prelabelling.
If you are running the Prelabelling model on the same machine as Data Manager, then do not use "localhost" to refer to the local machine, but rather use the IP address or Domain Name of the local machine. In the case of URLs of Public ML Skills in AI Fabric on premises v2020.7 or later, use the URL as it appears in the AI Fabric ML Skill details screen.
After activating Prelabelling, a Predict button will appear on the top bar in Data Manager. Click it in order to prelabel the current document.
This is not necessary when running Data Manager on your own machine or on a secure office network. However, if you plan to run Data Manager on a remote server open to the Internet, then we strongly suggest you enable SSL encryption. In order to do this you need to obtain the DNS name of the remote server and to generate a https certificate (.crt file) and key (.key file) for that domain name, and place them in a folder called certs on the remote server. Then you need to launch the Data Manager using the following command:
docker run -d -p <port_number>:80 -v "<path_to_working_folder>:/app/data" -v "<path_to_certs_folder>:/certs" aiflprodweacr.azurecr.io/datamanager:latest --license-agreement accept --https-certificate /certs/<cert_filename.crt> --https-private-key /certs/<key_filename.key>
In this command, <cert_filename.crt> refers to the name of the .crt file and <key_filename.key> refers to the name of the .key file which you have placed in the certs folder.
In order to use the Retraining capability in AI Fabric, you need to use a set of fields based on the fields already extracted by the out-of-the-box pretrained models offered by UiPath (Invoice and Receipts extraction). This list of fields is called a schema. To make it easier to get started we are providing the schemas of the out-of-the-box models. These are zip files which you can import into Data Manager just like you would import a dataset, by clicking on the Import button at the top of the screen, and then selecting the zip file from the dialog. The Data Manager will detect that it is a new schema and will import it directly.
The schemas for the pretrained ML models provided by UiPath are available at the following links:
Invoices-Japan ML Model only supports Google Cloud Vision OCR.
Updated about a month ago