Textractor Documentation

Textractor

Textractor is a python package created to seamlessly work with 4 popular Amazon Textract APIs. These are the DocumentTextDetection, StartDocumentTextDetection, AnalyzeDocument and StartDocumentAnalysis endpoints. The package contains utilities to call Textract services, convert JSON responses from API calls to programmable objects, visualize entities on the document and export document data is compatible formats. It is intended to aid Textract customers in setting up their post-processing pipelines.

Previous work in this space has been made available in the following packages:

  1. amazon-textract-caller (to call textract without the explicit use of boto3)

  2. amazon-textract-response-parser (to parse the JSON response returned by Textract APIs)

  3. amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image)

  4. amazon-textract-prettyprinter (to string represent document entities)

  5. amazon-textract-directional_finder (to perform geometric search on the document)

The amazon-textract-caller has been used as a dependency within this package with a wrapper around it to reduce the number of parameters the customer needs to pass. Additionally, newer input formats for the document have been provisioned with this package.

The remaining packages have been refactored within this new package but the prominent functionalities are all made available to not disrupt the requirements of the customer.

This package also hosts newer features that haven’t previously been implemented in existing packages. These include:

  1. Semantic Document Search

  2. Query for key-values using keys

  3. Table access with numpy indexing

  4. New export formats with excel, csv and txt

  5. Indication of duplicated document entities

  6. Availability of all the above at Document and Page level.

Usage

API Reference