Visualizing Results

When debugging it’s usually helpful to be able to see what went wrong. Textractor offers simple API to see your output that can help a lot when developing heuristics.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:

import os
from PIL import Image
from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
    file_source=Image.open("../../../tests/fixtures/form.png"),
    features=[TextractFeatures.FORMS, TextractFeatures.TABLES],
    save_image=True,
)

Let’s look at the asset.

[2]:

Image.open("../../../tests/fixtures/form.png")

[2]:

../_images/notebooks_visualizing_results_3_0.png

[3]:

document

[3]:

This document holds the following data:
Pages - 1
Words - 494
Lines - 129
Key-values - 20
Checkboxes - 29
Tables - 1
Identity Documents - 0
Expense Documents - 0

[4]:

document.checkboxes.visualize()

[4]:

../_images/notebooks_visualizing_results_5_0.png

[5]:

document.key_values.visualize()

[5]:

../_images/notebooks_visualizing_results_6_0.png

Visualizing the result of a search

Here we will be looking for the word “Rent”.

[6]:

words = document.search_words("Rent", top_k=10, similarity_threshold=0.1)

[7]:

words.visualize()

[7]:

../_images/notebooks_visualizing_results_9_0.png

Visualizing Tables

Tables can be visualized as well (here) in purple.

[8]:

document.tables.visualize()

[8]:

../_images/notebooks_visualizing_results_11_0.png

Conclusion

Textractor packs visualization utilities that help you understand the Textract output to implement better heuristics.