Word Occurence within Document

This page will walk you through using Textractor to find all the occurrences of the word ‘Room’ in a document and visualize them overlayed on the document image.

Installation

To begin, install the amazon-textract-textractor package using pip. This example requires rasterizing PDFs and using word embeddings to do fuzzy search therefore you need to run:

pip install amazon-textract-textractor[torch,pdf]

There are various sets of dependencies available to tailor your installation to your use case. You can read more on extra dependencies in the documentation

Calling Textract

We use the synchronous API for this example, but any all APIs would support this use case.

[7]:

import os
from textractor import Textractor

extractor = Textractor(profile_name="default")
document = extractor.detect_document_text(
    file_source="../../../tests/fixtures/tutorial.pdf",
    save_image=True
)

Retrieving word occurences

Words and lines can be search at the document and page level by calling the search_words() and search_lines() methods respectively.

[8]:

word_occurences = document.search_words(keyword="Room", top_k=15, similarity_threshold=0.5)
print("Number of occurences of the word Room in the document = ", len(word_occurences))

Find words on document page through visualize()

To identify the location of these words on the document, visualize() method can be called on the object returned by the search_words() call. This returns PIL Images of the document that contains these word instances along with bounding boxes around these objects.

[11]:

word_occurences.visualize()

[11]:

../_images/notebooks_finding_words_within_a_document_5_0.png

What if you don’t want to use PyTorch?

PyTorch gives the most accurate results, but it also uses more compute resources and might be overkill for your use case. This is why Textractor comes built-in with other string matching algorithms.

[14]:

from textractor.data.constants import SimilarityMetric

word_occurences = document.search_words(
    keyword="Room",
    top_k=15,
    similarity_threshold=0.5,
    similarity_metric=SimilarityMetric.LEVENSHTEIN
)
print("Number of occurences of the word Room in the document = ", len(word_occurences))

Number of occurences of the word Room in the document =  10

[15]:

word_occurences.visualize()

[15]:

../_images/notebooks_finding_words_within_a_document_8_0.png

Conclusion

There are many more supported APIs and usecases in Textractor, if this did not address your use case, we encourage you to look at the other examples.