Parsing an existing response

Since Amazon Textract is a paid service, it is likely that you will want to reduce your costs by developing and debugging with existing JSON responses. We offer a simple interface to do so.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]. You can read more on extra dependencies in the documentation

Not calling Textract

There are two ways to parse an existing JSON. The simplest one, reminiscent of PIL.Image.open() is Document.open() which takes either a path or file-like object and parses it automatically. The path can be an S3 path.

[1]:

from textractor.entities.document import Document

document = Document.open("../../../tests/fixtures/saved_api_responses/test_table.json")

[2]:

document

[2]:

This document holds the following data:
Pages - 1
Words - 51
Lines - 24
Key-values - 0
Checkboxes - 0
Tables - 1
Identity Documents - 0
Expense Documents - 0

[3]:

with open("../../../tests/fixtures/saved_api_responses/test_table.json", "r") as f:
    document = Document.open(f)

[4]:

document

[4]:

This document holds the following data:
Pages - 1
Words - 51
Lines - 24
Key-values - 0
Checkboxes - 0
Tables - 1
Identity Documents - 0
Expense Documents - 0

Instantiating from a dictionary

Another possible solution is to use the ResponseParser directly with a dict object.

[8]:

import json
from textractor.parsers import response_parser

[9]:

with open("../../../tests/fixtures/saved_api_responses/test_table.json", "r") as f:
    document = response_parser.parse(json.load(f))

[10]:

document

[10]:

This document holds the following data:
Pages - 1
Words - 51
Lines - 24
Key-values - 0
Checkboxes - 0
Tables - 1
Identity Documents - 0
Expense Documents - 0

Conclusion

You can save and accelerate your development process by reusing responses. This approach is used in the Textractor unit tests.