The first step in creating an asset is to add documents to the Document library.
Users must have any one of the following policies to upload documents:
- Administrator Policy
- Creator Policy
This guide will walk you through the steps on how to set up your first document set.
Document set refers to the collection of documents that are used to train and test an asset.
Step 1: Know the accepted document formats & storage limitations
Before uploading the required documents, check the documents with the following accepted formats and storage limitations.
- Supported file formats: PDF, PNG, JPEG, TIFF, DOCX.
- Maximum document size: 20 MB.
- Ensure that the file is not password protected and not zipped.
- Make sure images and documents have a resolution of at least 200 DPI (300 DPI is recommended).
- Upload documents without watermarks.
Step 2: Create a document set
- Head to the Document Library module and then click Create document set.
- In Create document set window that appears, enter a unique document set name and a brief description of the document set.
- Click Create to create a new document set.
Step 3: Upload files
You can upload documents using the following options:
- Manual Upload: It allows you to upload the documents from your local system.
- Connector Upload: It allows you to configure the s3 connector and access the documents from the AWS and stored in the Doc Lib.
- Web Crawler Upload: It allows you to fetch the web pages and documents from the websites and stored in the Doc Lib
Manual Upload
- On the Document Library page, select the Document set you wish to import the documents.
- In the Document set page that appears, click Import and then select Files option to Import the required documents from your local system.
Connector Upload
- On the Document Library page, select the Document set you wish to Import the documents.
- In the Document set page that appears, click Import and then select Amazon S3 option.
- In the Amazon S3 connector window that appears, enter the required details.
- Optional: Use Test connection to test the connection.
- Click Start import to import the document via Amazon S3 connector.
- In Choose your connection, select the connection that you wish to connect.
- In Bucket Name, enter the bucket name.
- In Folder or file path, enter folder or file path.
- Click Add metadata to categorize and retrieve relevant information for the documents.
Web crawler Upload
- On the Document Library page, select the Document set you wish to import the documents into.
- In the Document set page that appears, click Import and then select Web Crawler option.
- In the Web Crawler window that appears, use the Custom option.
- In the URL, enter the URL that you wish to crawl.
- Optional: Add an additional URL if you wish to fetch the additional web pages from different domains. For example, If you entered “abcd.com” as the first url you can add other domains (abcd.in, Acbd.org) in additional URLs.
Note: It is recommended to provide the valid URLs and relative domains.
- In Scrap level, enter the level that you wish to fetch information from the web page. The scrap level refers to the depth or level of pages that the web crawler will scrape or visit during its crawling process. It determines how deep into a website’s structure the crawler will go to gather information.
For example: If you’re using a web crawler to gather information from a news website you set the scrap level to 2, the crawler will visit the homepage (level 0), then follow links to articles (level 1), and possibly follow links within those articles to other pages (level 2). It won’t go deeper than the specified level. - In Maximum URLs, enter the total number hyperlinks the web crawler will fetch information from. The maximum URL refers to the limit on the number of URLs or links that the web crawler will process during its crawling operation. This limit helps control the crawler’s workload and prevents it from endlessly crawling through an excessively large number of URLs. For example: If you’re using a web crawler to collect data from a bank’s website and you set the maximum URLs to 100, the crawler will stop after it has visited 100 different pages on the bank’s website, ensuring it doesn’t spend too much time crawling endless links.
- Click Start import to import the web pages into the Document library.