Parsing an existing response
Since Amazon Textract is a paid service, it is likely that you will want to reduce your costs by developing and debugging with existing JSON responses. We offer a simple interface to do so.
Installation
To begin, install the amazon-textract-textractor
package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]
. You can read more on extra dependencies in the documentation
Not calling Textract
There are two ways to parse an existing JSON. The simplest one, reminiscent of PIL.Image.open()
is Document.open()
which takes either a path or file-like object and parses it automatically. The path can be an S3 path.
[1]:
from textractor.entities.document import Document
document = Document.open("../../../tests/fixtures/saved_api_responses/test_table.json")
[2]:
document
[2]:
This document holds the following data:
Pages - 1
Words - 51
Lines - 24
Key-values - 0
Checkboxes - 0
Tables - 1
Identity Documents - 0
Expense Documents - 0
[3]:
with open("../../../tests/fixtures/saved_api_responses/test_table.json", "r") as f:
document = Document.open(f)
[4]:
document
[4]:
This document holds the following data:
Pages - 1
Words - 51
Lines - 24
Key-values - 0
Checkboxes - 0
Tables - 1
Identity Documents - 0
Expense Documents - 0
Instantiating from a dictionary
Another possible solution is to use the ResponseParser
directly with a dict
object.
[8]:
import json
from textractor.parsers import response_parser
[9]:
with open("../../../tests/fixtures/saved_api_responses/test_table.json", "r") as f:
document = response_parser.parse(json.load(f))
[10]:
document
[10]:
This document holds the following data:
Pages - 1
Words - 51
Lines - 24
Key-values - 0
Checkboxes - 0
Tables - 1
Identity Documents - 0
Expense Documents - 0
Conclusion
You can save and accelerate your development process by reusing responses. This approach is used in the Textractor unit tests.