Interfacing with trp2
The Textract response parser was the preferred way of handling Textract API output before the release of Textractor. If your current workflow uses the older library, you can easily reuse their functions through the compatibility API.
Installation
To begin, install the amazon-textract-textractor
package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]
. You can read more on extra dependencies in the documentation
Calling Textract
[1]:
from textractor import Textractor
extractor = Textractor(profile_name="default")
# This path assumes that you are running the notebook from docs/source/notebooks
document = extractor.detect_document_text("../../../tests/fixtures/form.png")
[2]:
document
[2]:
This document holds the following data:
Pages - 1
Words - 259
Lines - 74
Key-values - 0
Checkboxes - 0
Tables - 0
Identity Documents - 0
Getting the trp2 document
All Document
objects have a convenience function to_trp2()
that is a shorthand for TDocumentSchema().load(document.response)
and creates a matching trp2 document. Note that this behaves as a converter, not as a proxy so any changes done on the TDocument
will not be passed to the Document
object.
[4]:
trp2_document = document.to_trp2()
Conclusion
Textractor comes with everything you need to reuse components from your current workflow with the newer caller, pretty printer, or directional finder.