Module 1: Preparation Guide
LEARNING MODULE
Overview
The Amazon Textract and .NET Workloads badge demonstrates proficiency with the Amazon Textract service and .NET workloads. This preparation guide explains what you need to know to pass the assessment, topic by topic, with resources you can review. You should also have hands-on experience using the service, either with your own applications or an AWS tutorial.
Once you have prepared, advance to Module 2 to take the assessment exam.
Purpose
Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents such as PDFs and images. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Textract uses machine learning to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
Video: What is Amazon Textract?
Benefits
With Textract, you can realize these benefits:
- Drive higher business efficiency and faster decision making while reducing costs
- Extract key insights with high accuracy from virtually any document
- Scale up or scale down the document processing pipeline to quickly adapt to market demands
- Automate data processing securely with data privacy, encryption, and compliance standards
Capabilities
Textract’s capabilities include:
- Integration of document text detection into your apps. Textract removes the complexity of building text detection capabilities into your applications by making powerful and accurate analysis available with a simple API.
- Scalable document analysis: Textract enables you to analyze and extract data quickly from millions of documents, which can accelerate decision making.
- Multiple languages. Textract supports English, Spanish, German, Italian, French, and Portuguese.
- Multiple document formats. Textract can process PDF, TIFF, JPEG, and PNG documents.
Pricing
You should be familiar with the Amazon Textract pricing model and free tier. With Textract, you pay only for what you use. There are no minimum fees and no upfront commitments. Textract charges only for pages processed whether you extract text, text with tables, form data, queries or process invoices and identity documents.
- Varying rates by API. Textract contains 5 APIs (Detect Document Text, Analyze Document, Analyze Expense, Analyze ID, Analyze Lending), each with specific rates charged per 1,000 pages.
- Rates can vary between AWS regions.
- You pay a reduced rate after you meet a monthly threshold. Once you meet an API’s monthly threshold, you pay a lower rate for the remainder of the month. For example, the Detect Document API charges less after your first million documents in a month. The thresholds and rates are different for each API.
- The AWS Free Tier lasts 3 months and gives you a varying number of pages free per API. For example, you get 1,000 pages/month free for the Detect Document Text API and 100 pages/month for the Analyze Expense API.
- You can use the AWS Pricing Calculator from the pricing page to estimate your costs.
Use Cases
The following are common use cases for using Amazon Textract:
- Creating an intelligent search index. Using Textract you can create libraries of text detected in image and PDF files.
- Using intelligent text extraction for natural language processing (NLP). Textract provides you with control over how text is grouped as an input for NLP applications. It can extract text as words and lines. It also groups text by table cells if document table analysis is enabled.
- Accelerating the capture and normalization of data from different sources. Textract enables text and tabular data extraction from a wide variety of documents, such as financial documents, research reports, and medical notes.
- Automating data capture from forms. Textract enables structured data to be extracted from forms. With the APIs, you can build extraction capabilities into existing business workflows so that user data submitted through forms can be extracted into a usable format.
- Automating document classification and extraction. With Textract's Analyze Lending document processing API, you can automate the classification of lending documents into various document classes, and then automatically route the classified pages to the correct analysis operation for further processing.
Industry use cases for Textract include the following.
- Financial services : Accurately extract critical business data such as mortgage rates, applicant names, and invoice totals across a variety of financial forms to process loan and mortgage applications in minutes.
- Healthcare and life sciences : Better serve your patients and insurers by extracting important patient data from health intake forms, insurance claims, and pre-authorization forms. Keep data organized and in its original context, and eliminate manual review of output.
- Public sector : Easily extract relevant data from government-related forms such as small business loans, federal tax forms, and business applications with a high degree of accuracy.
Developer Guide - What is Amazon Textract?s
Amazon Textract Product Detail Page - Use cases
Features
You should understand these features:
1. Optical character recognition. Textract uses optical character recognition (OCR) to automatically detect printed text, handwriting, and numbers in a scan or rendering of a document, such as a legal document or a scan of a book.
Developer Guide - Detecting Text
2.Analyze lending. Textract’s Analyze Lending API is a managed, preconfigured intelligent document processing API that fully automates the extraction of information from loan packages. You simply upload mortgage loan documents to the Analyze Lending API and its prebuilt machine learning models will classify and split the document package by document type.
Developer Guide - Analyze Lending
3. Form extraction. You can detect key-value pairs in document images automatically and retain the context without manual intervention. A key-value pair is a set of linked data items. For instance, in a document, the field “First Name” is the key and “Jane” is the value. This makes it easy to import the extracted data into a database or provide it as a variable in an application.
Developer Guide - Analyzing Documents - Form Extraction
4. Table extraction. Textract preserves the composition of data stored in tables during extraction. This is helpful for documents that are largely composed of structured data, such as financial reports or medical records with tables in columns and rows. You can load the extracted data into a database using a predefined schema. For example, rows of item numbers and quantities in an inventory report will retain their association so an inventory management application can easily increment item totals.
5. Signature Detection. Textract provides the ability to detect signatures on any document or image. This makes it easy to automatically detect signatures on documents such as checks, loan application forms, and claims forms. The location of the signatures and associated confidence scores are included in the API response
Developer Guide - Analyzing Documents - Signatures
6. Query-based extraction. Textract provides you with the flexibility to specify the data you need to extract from documents using queries. You can specify the information you need in the form of natural language questions (e.g., “What is the customer name”) and receive the exact information (e.g., ”John Doe”) as part of the API response. You do not need to know the data structure in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Textract Queries are pre-trained on a large variety of documents including paystubs, bank statements, W-2s, loan application forms, mortgage notes, claims documents, and insurance cards. The flexibility that Textract Queries provides reduces the need to implement post processing, reliance on manual reviews of extracted data or the need to train ML models. Query extraction is only available in English document detection.
Developer Guide - Analyzing Documents - Queries
7. Handwriting recognition: Many documents, such as medical intake forms and employment applications, include both handwritten and printed text. Amazon Textract can extract both from documents written in English with high confidence scores, whether the text is free-form or embedded in tables. Documents can also contain a mix of typed text and handwritten text.
Developer Guide - What is Amazon Textract?
8. Invoices and receipts. Invoices and receipts can have a wide variety of layouts, which makes it difficult and time-consuming to manually extract data at scale. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts and automatically extracts relevant data such as vendor name, invoice number, item prices, total amount, and payment terms. When you submit an invoice or a receipt to the AnalyzeExpense API, it returns a series of ExpenseDocument objects. Each ExpenseDocument is further separated into LineItemGroups and SummaryFields.
Developer Guide - Analyzing Invoices and Receipts
Invoice and Receipt Response Objects
9. Identity documents. Textract uses machine learning (ML) to understand the context of identity documents such as U.S. passports and driver’s licenses without the need for templates or configuration. You can automatically extract specific information such as date of expiry and date of birth, as well as intelligently identify and extract implied information such as name and address. Using Analyze ID, businesses providing ID verification services and those in finance, healthcare, and insurance can easily automate account creation, appointment scheduling, employment applications, and more by allowing customers to submit a picture or scan of their identity document.
Developer Guide - Analyzing Identity Documents
10. Built-in human review workflow. Textract is directly integrated with Amazon Augmented AI (A2I) so you can easily implement human review of printed text and handwriting extracted from documents. Choose a confidence threshold for your application, and all predictions with a confidence below the threshold are automatically sent to human reviewers for validation. You can also specify which key-value pairs should be sent for human review and configure A2I to send randomly selected documents for review as well.
AWS SDK for .NET
Use the AWS SDK for .NET to interact with Textract from .NET code. You should know the primary SDK classes and methods used to support the capabilities listed above under Features.
- To use the SDK, add the AWSSDK.Textract NuGet package to your C# project.
- To work with Textract, instantiate an instance of AmazonTextractClient and call its methods.
- Some SDK methods, with names ending in Async, are called asynchronously with the C# await keyword.
- Use the standard SDK pattern of creating request objects to pass to methods and process the response objects returned. The SDK documentation for a method describes its request and response objects. Request and response object have the same root name as the method they support. For example, the request and response objects for the DetectDocumentTextAsync method are named DetectDocumentTextRequest and DetectDocumentTextResponse.
using (var textractClient = new AmazonTextractClient(RegionEndpoint.USEast1))
{
var bytes = File.ReadAllBytes("example.png");
Console.WriteLine("Detect Document Text");
var detectResponse = await textractClient.DetectDocumentTextAsync(new DetectDocumentTextRequest
{
Document = new Document
{
Bytes = new MemoryStream(bytes)
}
});
foreach (var block in detectResponse.Blocks)
{
Console.WriteLine($"Type {block.BlockType}, Text: {block.Text}");
}
}
Synchronous and Asynchronous Operations
Textract operations are grouped into “synchronous" and "asynchronous" types. This has no relation to C# async methods.
- "Synchronous" operations return results in near real-time. They are for detecting and analyzing text in single-page documents.
- "Asynchronous" operations run in the background. They are for multipage document processing. For example, a PDF file with over 1,000 pages takes a long time to process, but processing the PDF file asynchronously allows your application to complete other tasks while the operation completes. These method names begin with the word "Start", such as StartDocumentAnalysis.
Developer Guide - Processing Documents with Synchronous Operations
Developer Guide - Processing Documents with Asynchronous Operations
Lines and Words of Text
Textract operations return detected text in a list of Block objects. These objects represent lines of text or textual words that are detected on a document page. A list of PAGE, LINE. and WORD objects is returned with parent-child relationships.
Bounding Boxes
Textract operations return the location and geometry of items found on a document page. All extracted data is returned with bounding box coordinates—polygon frames that encompass each piece of identified data, such as a word, a line, a table, or individual cells within a table. This helps you audit where a word or number came from in the source document and guides you when search results provide scans of original documents. For example, when searching medical records for patient history details, you can easily find the source document and take note for future searches.
Adjustable Confidence Thresholds
When extracting information from documents, Textract returns a confidence score for everything it identifies so you can make informed decisions about how to use the results. For instance, if you extract information from tax records and want to ensure high accuracy, you can flag any item with a confidence score below 95% to be reviewed by a human. You can set a lower threshold for other documents where errors would have fewer negative consequences, such as when processing resumes or digitizing archived records.
Developer Guide - Best Practices for Amazon Textract - Use Confidence Scores
Handling Throttled Calls and Dropped Connections
A Textract operation can fail if you exceed the maximum number of transactions per second (TPS), causing the service to throttle your application, or when your connection drops. You can manage throttling and dropped connections by automatically retrying the operation. Specify the number of retries by including the Config parameter when you create the Amazon Textract client. AWS recommends a retry count of 5. The AWS SDK retries an operation the specified number of times before failing and throwing an exception.
Developer Guide - Handling Throttled Calls and Dropped Connections
Amazon Textract endpoints and quotas
Quotas
Your use of Amazon Textract is subject to quotas. There are two kinds of quotas:
- Set quotas cannot be changed. These include accepted file formats, file size and page count limits, PDF-specific limits, image size and rotation, character size, character set, and ID types.
- Default quotas can be viewed or changed via the Service quotas console. TPS quotas determine how often you can request that Textract process a new document. Concurrent job limit defines how many jobs can be run in parallel at a given time.
You can estimate your quota needs with the Service Quotas Calculator.
Best Practices
You should be familiar with the following best practices for Textract:
- Provide an optimal input document: A high-quality image of at least 150 DPI, in a language and format Textract supports.
- Use confidence scores. Take into account the confidence scores returned by Textract API operations and the sensitivity of their use case. The optimal threshold depends on the application. In applications that are sensitive to detection errors (false positives), enforce a minimum confidence score threshold.
- Consider using human review. You can incorporate human review into your workflows. This is especially important for sensitive applications, such as business processes that involve financial decisions.
Developer Guide - Best Practices for Amazon Textract
Hands-on Experience
You should have experience using Textract to extract text, handwriting, and data from documents. You can use the tutorials and demos below if you don’t have an application to work with.
Tutorials
Extract text and structured data (AWS console tutorial)
Hello, Textract! (coding tutorial)
Sample applications
Community Videos
Intro to Textract and .NET 6 - EP01 by Tom Moore
Intro to Textract and .NET 6 - EP02 by Tom Moore
AWS Experience
Beginner or Intermediate
.NET Experience
Intermediate
Time to Complete
Up to 3 hours depending on prior experience
Services Used
Amazon Textract
Last updated
July 7, 2022
Modules
This tutorial is divided into the following modules. You may go through the modules fully, or skim and review, based on your experience and readiness.
- Preparation Guide (3 hours).
- Skills Assessment : Assess Amazon Textract and .NET Workloads