Textractor Documentation
Textractor is a python package created to seamlessly work with 4 popular Amazon Textract APIs. These are the DocumentTextDetection, StartDocumentTextDetection, AnalyzeDocument and StartDocumentAnalysis endpoints. The package contains utilities to call Textract services, convert JSON responses from API calls to programmable objects, visualize entities on the document and export document data is compatible formats. It is intended to aid Textract customers in setting up their post-processing pipelines.
Previous work in this space has been made available in the following packages:
amazon-textract-caller (to call textract without the explicit use of boto3)
amazon-textract-response-parser (to parse the JSON response returned by Textract APIs)
amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image)
amazon-textract-prettyprinter (to string represent document entities)
amazon-textract-directional_finder (to perform geometric search on the document)
The amazon-textract-caller has been used as a dependency within this package with a wrapper around it to reduce the number of parameters the customer needs to pass. Additionally, newer input formats for the document have been provisioned with this package.
The remaining packages have been refactored within this new package but the prominent functionalities are all made available to not disrupt the requirements of the customer.
This package also hosts newer features that haven’t previously been implemented in existing packages. These include:
Semantic Document Search
Query for key-values using keys
Table access with numpy indexing
New export formats with excel, csv and txt
Indication of duplicated document entities
Availability of all the above at
Document
andPage
level.
Usage
- Installation
- Using Textractor in AWS Lambda
- Examples
- Using Textract OCR
- Parsing an existing response
- Introduction to searching
- Visualizing Results
- Word Occurence within Document
- Exporting Form Data
- Table data extraction to Excel
- Using AnalyzeExpense
- Using AnalyzeID
- Using Queries
- Using Layout Analysis
- Tabular data linearization
- Tabular data linearization (Continued)
- Using Layout Analysis for Text Linearization
- Document Linearization to Markdown or HTML with Textractor
- Textractor for Large Language Models (LLM)
- Interfacing with trp2
- Signature Detection
- Going further
- CLI