Interfacing with trp2

The Textract response parser was the preferred way of handling Textract API output before the release of Textractor. If your current workflow uses the older library, you can easily reuse their functions through the compatibility API.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:
from textractor import Textractor

extractor = Textractor(profile_name="default")
# This path assumes that you are running the notebook from docs/source/notebooks
document = extractor.detect_document_text("../../../tests/fixtures/form.png")
[2]:
document
[2]:
This document holds the following data:
Pages - 1
Words - 259
Lines - 74
Key-values - 0
Checkboxes - 0
Tables - 0
Identity Documents - 0

Getting the trp2 document

All Document objects have a convenience function to_trp2() that is a shorthand for TDocumentSchema().load(document.response) and creates a matching trp2 document. Note that this behaves as a converter, not as a proxy so any changes done on the TDocument will not be passed to the Document object.

[4]:
trp2_document = document.to_trp2()

Conclusion

Textractor comes with everything you need to reuse components from your current workflow with the newer caller, pretty printer, or directional finder.