Introduction to searching

Textractor offers several tool to process Textract output. In this example, we will build a robust solution for extracting specific fields from a specific form (i.e. not fields explicitly supported by Amazon Textract).

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:

import os
from PIL import Image
from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")
document = extractor.start_document_analysis(
    # Here we pass a Pillow image instead of path. This changes nothing as
    # Textractor supports most input types.
    file_source=Image.open("../../../tests/fixtures/form.png"),
    # We specify the features that we want, here, we only want keys and values
    # therefore we use TextractFeatures.FORMS.
    features=[TextractFeatures.FORMS],
    s3_upload_path="s3://textract-ocr/temp/",
    save_image=True
)

Let’s look at the asset.

[2]:

Image.open("../../../tests/fixtures/form.png")

[2]:

../_images/notebooks_introduction_to_searching_4_0.png

Now if you are looking for the rent, here of $2000, you could use the extracted checkboxes.

[3]:

[c for c in document.checkboxes if "rent" in str(c.key).lower()]

[3]:

[[ ] Rent Room, [ ] Space Rent $, [X] Rent $ 2000]

There are several results, but we really only care about what the checkboxe the customer checked.

[6]:

out = [c for c in document.checkboxes if "rent" in str(c.key).lower() and c.is_selected()]
out

[6]:

[[X] Rent $ 2000]

Saving on costs

As shown above, Amazon Textract is able to extract the checkbox without any issue, but what if you want to reduce the per-document processing costs and you are willing process the data a bit more. Here we will get the same result by leveraging word search.

[7]:

document = extractor.detect_document_text(
    file_source=Image.open("../../../tests/fixtures/form.png"),
)

[8]:

document

[8]:

This document holds the following data:
Pages - 1
Words - 499
Lines - 129
Key-values - 0
Checkboxes - 0
Tables - 0
Identity Documents - 0
Expense Documents - 0

[18]:

rents = document.search_lines("Rent", top_k=10, similarity_threshold=0.1)

[19]:

rents

[19]:

[Renters: Provide a complete signed copy of rent or lease agreement dated within the last 12 months, listing every person living,
 in the home(s). If subsidized, provide signed Housing documents listing every person in the home, rent and utility rebate.,
 Rent Room,
 Rent $ 2000,
 Subsidized Rent $,
 Space Rent $]

Now we can’t know if there was a checked checkbox next to the “Rent $ 2000” line, but maybe that’s a reasonable assumption for your use case.

Conclusion

Searching allows you to inject your knowledge of the data inside your workflow to improve the raw results returned by Textract. The more specific your input data is, the lower the costs as you can get away with parsing just the text output.