Create an Extractor Asset

A Document AI model is a trained and published model that is consumable as an API and can be easily integrated with any third-party systems.

A Document AI “Extractor Asset” is a trained extraction model that contextually extracts fields and table information from unstructured and structured documents.

Users must have any one of the following policies to create an Extractor Asset:

Administrator Policy
Creator Policy

This guide will walk you through the below steps on how to create your first Extractor Asset.

Create a document set
Create an asset
Select documents
Annotate and train
Review results and validate
Publish the asset

Step 1: Create a document set

The first step in creating an asset is to add documents to the Document library. Read Upload documents section to know how. If you already have an existing document set in the document library, you can skip this step and proceed to Create an asset.

Step 2: Create an asset

You can create assets using our Asset Studio.

Head to the Asset Studio page, click the Create Asset then choose Classic AI.

In the Classic AI window that appears, enter a unique Asset name.
Optional: Enter a brief description and upload an image.
In Document type, you can create a new Document type on the go or select from an existing Document type.
- To create a new Document type, enter the name of the Document type that you wish to create and then press Enter key.
- To select an existing Document type, search for the Document type and choose from the available results.

Nature of document

This option is only applicable when you create a new Document type.

In Nature of document option that appears, select the following required option(s) against the Document type.
- Free flow – Use this option to extract the information from the unstructured and semi-structured documents.
- ID – Use this option to extract the information from the documents such as Driving License, Passport, and more.
- Form – Use this option to extract the information from the documents such as Insurance Application form, Bank Account opening form, and more.

Note: You can select multiple options for the Nature of Document, as some documents may be a combination of Forms, ID cards , and Free flow formats.

Click on Create and proceed to select documents.
In Asset Visibility, choose any one of the following options.
- All Users (default): Choose this option to share the asset with everyone in the platform who has the appropriate permissions to view and manage the asset.
- Private: Choose this option to ensure that only you, the owner, can view and manage the asset.

Step 3: Select documents

In the Document Sets section, select or search for the document set for annotation.
The files in the document set will be displayed in the right pane of the page.
To annotate files, check the boxes next to the documents.

Note: Select a minimum of 10 documents per type for the asset to train. However, if you have more documents available, the recommended volume is 25 documents to provide a better accuracy measure.

Click on Proceed and you will land on the annotation page.

Step 4: Annotate and train

Data annotation is the process of labeling data to show the outcome you want your machine learning model to predict.

Users must have any one of the following policies to annotate an Extractor Asset:

Administrator Policy
Creator Policy
Annotator Policy

If you choose to create the Document type on the go, follow the steps below to add fields, tables and sections.

Add field

In the Document type section, click Add new fields.
In the Labels window that appears, click Add Field.
Enter the field name and select the appropriate data type from the drop down list.
You have the option to choose from various data types to annotate and add to your Extractor Asset. Each data type serves a specific purpose and can be tailored to meet your document processing needs.
- Text : Choose this option if you wish to annotate only textual information against a field.
- Number: Choose this option if you wish to annotate numerical values against a field.
- Datetime: Choose this option if you wish to annotate dates and times against a field.
- Image: Choose this option if you wish to annotate images against a field. This allows for the extraction and handling of image data within documents.
- Currency: Choose this option if you wish to annotate currency-related information against a field.
- Checkbox: Choose this option if you wish to annotate checkbox against a field.
- Checkbox(Group): Choose this option if you wish to annotate a group of checkboxes against a field.
Click Settings against a field and choose any one of the following Expected Label Output options:
- Required once: Choose this option if the field is expected only once in the output, regardless of whether it appears and annotated once or multiple times in the document.
- Required multiple: Choose this option if the field is expected to appear multiple times in the output, depending on whether it appears and annotated once or multiple times in the document. This option enables you to annotate multiple instances against a field and also generate multiple results against a field.
Select the PII check box if the field contains personally identifiable information to encrypt the field value. This option ensures data security of the field’s information.
To add more fields, select Add field.
Use to delete the field.

Add table

In the Labels window, click Add table.
Enter the table name and click Add field.
Enter the field name and select the appropriate data type from the drop down list.
Click Settings against the table and choose any one of the following Expected Label Output options:
Select the PII check box if the table contains personally identifiable information in order to encrypt the table values. This option ensures data security of the table information.
To add more tables, select Add table.
Use to delete the table.
The fields and entities added, will be displayed on the right hand panel of the page. You are now ready to annotate your documents.
If you have selected an existing Document type, the fields and table headers will be displayed on the right panel and you can begin annotating the fields and tables.

Add section

Section and group is a feature that allows users to extract a group of fields and tables as they appear in the documents.
For more information on how to add a section, see Add Section and Group page.

Annotate fields

The platform allows you to annotate a field using the following options:

Auto Annotation
Manual Annotation

Auto Annotation

Auto annotation refers to the automatic process of labelling data. It automatically annotates documents, reducing the need for manual effort. Auto Annotation enables the automatic extraction of field information from documents, making the process more efficient and less time-consuming.

Note: Auto Annotation is available only for text fields.

On the Annotation page, click Auto Annotate.
In the Auto Annotate window that appears, choose any one of the following options:
Choose Selected Documents if you wish to auto annotate the selected documents.
Choose All Documents if you wish to auto annotate all the documents.
Click Start to initiate the auto annotation process.
After auto annotation is completed, quickly review the extracted data to ensure its accuracy, as auto annotation may occasionally make incorrect predictions. This step is crucial for verifying that the correct data has been captured.

Manual Annotation

Manual annotation refers to the manual process of labelling data, requiring users to manually annotate the data against specific fields.

On the Annotation page, select the field to be annotated in the right pane, spot the target text in the document and click on the left top corner of the text and draw a bounding box. Ensure the text is completely enclosed within the box.
Once the box is drawn, the text within the box will appear in the right pane.

Note: You can annotate text, number, date and time, image, currency in the same way.

To add multiple instances of a value for a single field, click on . This allows you to include additional instances or occurrences of the target text pertaining to a given field.
If the value is not extracted as intended, click on the delete symbol to remove the annotation and restart the annotation process for the same field.

Annotate table

The table annotation feature allows users to extract table information from documents. For more information about how to annotate tables, see Annotate a Table.

Annotate section

The Section and Group feature allows users to extract a group of repeating fields and tables from documents with ease.
For more information on how to annotate section and group, see Annotate Section and Group.

Train

Once the annotation is done for all documents, click on Train.
You can view the annotation summary which provides an overview of the below details:
- Document status: This shows the number of documents annotated and not annotated.
- Field annotation: This shows the number of annotations per field.
- Tables: This shows the number of tables annotated.

Click on Proceed training to initiate asset training.
While the training is in progress, you may choose to go back to the asset studio and you will see a unique entry for your asset with status “Training in progress”. Once completed, the status will change to “Training completed” at which point, you can access the asset from the Asset Studio to review the results.

Note: During the training phase, the documents are split into an 80:20 ratio, with 80% of the documents used for training and the remaining 20% for testing. The asset effectively learns from the provided training documents to develop a predictive model for identifying and extracting field information. It leverages the knowledge gained from these training documents to accurately extract the required fields from the test documents.

To save and export the annotations for later use, click on the export. This feature enables you to store the annotated data and utilize it as a template for similar tasks or assets you may create in the future.

To import annotations, you can use the import to bring in previously saved annotated data and apply them to assets you may create in the future.

Step 5: Review results and validate

Review results

Click on the Asset in the Asset studio listing page and you will be directed to the Accuracy Results page.
You can view the accuracy percentage which is a metric used to evaluate the performance of the asset.
You can gain a comprehensive overview of the total documents used, categorized based on their purpose for training and testing.
You can view the complete list of documents used for testing the asset.
You can also review the predicted fields against the annotated fields and compare the results. The results are provided under 3 categories, namely:
1. Predicted correctly, where the annotated and predicted fields match
2. Predicted incorrectly, where the annotations and prediction do not match
3. Not predicted, where fields were not predicted by the asset
For each prediction you will find a confidence score which determines the level of confidence of the model to make the right prediction from the training provided.
If the accuracy of the asset is lower than your expectations use Fine-tune to improve its accuracy. Click on Fine-tune to proceed the fine-tuning the asset. This involves adding more document samples with ample variations to the existing training data to improve the asset’s performance. For more information about Fine-tune, see Fine-tune an Extractor Asset.

Validate

To test the performance of the extraction asset on a new set of documents, use Validate.

Click on Validate placed next to Review Results.
Select a new document which was preferably not used during the training process and click on Proceed to initiate validation.
Once the validation is completed, you can see the accuracy against each field.

Step 6: Publish the asset

If the desired accuracy has been achieved, click on Publish. The following page will appear.
Enter the name and description for the asset.
Optional: Upload a sample image for a visual representation.
Click on Publish and the status of the asset changes to Published and can be accessed in the Asset Studio.

Note: Once the asset is published, you can download the API and its documentation. The API can be invoked independently or used within a specific use case. If you wish to consume this asset via API, see Consume an Asset via API page.

It is recommended to use URL aliases, if you wish to consume multiple versions of an Asset. It allows you to consume its different versions via a single API. For more information, see URL aliases.

You can also consume this asset in the Asset Monitor module. For more information, see Consume an Asset via Create Transaction page.