CLI

Textractor comes with its very own command line interface that aims to be easier to use than the default boto3 interface by adding several quality of life improvements.

First install the package using pip install amazon-textract-textractor make sure that you Python bin directory is added to PATH otherwise it will not find the executable. If you are not using a virtual environment this will probably be the case.

Available APIs

Textractor supports all Textract APIs and follow their official names as described here: https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html. We use a single subcommand to fetch the results named GetResult.

Synchronous APIs:

DetectDocumentText/detect-document-text (Returns words and lines)
AnalyzeDocument/analyze-document (Returns Forms, Tables and Query results)
AnalyzeExpense/analyze-expense (Returns standardized fields for invoices)
AnalyzeID/analyze-id (Returns standardized fields for driver’s license and passports)

Asynchronous APIs:

StartDocumentTextDetection/start-document-text-detection
StartDocumentAnalysis/start-document-analysis
StartExpenseAnalysis/start-expense-analysis

Getting document text

Now lets say you have a file and you wish to run OCR on it:

textractor detect-document-text your_file.png output.json

This will call the Textract API and save the output to output.json. You could use the Textractor python module to post-process those response afterwards.

Processing a directory of files

Now if instead of a file, you wished to process an entire directory of files. You could call the above on every file in the directory, but this would prove to be a very long process. Instead you can leverage Textract’s ability to scale to your workload using the asynchronous API.

ls your_dir/ | xargs -I{} textractor start-document-text-detection {} --s3-upload-path s3://your-bucket/your-prefix/{}

You can also parallelize it simply by adding -P8 (for 8 concurrent processes).

ls your_dir/ | xargs -P8 -I{} textractor start-document-text-detection {} --s3-upload-path s3://your-bucket/your-prefix/{} > output.txt

You will notice that all you have in output.txt are UUID like this: 628e39089ffa1b52d62d980ec1cf4f62cb7f785c83a708b2e17ebaaf21ad0d61. Those are JobIDs and can be used to fetch the output of asynchronous operations.

Wait a few minutes (dependending on the number of files your processed) and then fetch the result with GetResult.

cat output.txt | xargs -I{} textractor get-result {} DETECT_TEXT {}.json

Using -P8 would make the above faster, but be careful not to increase the concurrent process count too much as you might run into rate limiting issues (See https://docs.aws.amazon.com/textract/latest/dg/limits.html for more details).

Visualizing the output

The textractor CLI allows you to overlay the output of Amazon Textract on top of an image for troubleshooting. It is only available for synchronous APIs (DetectDocumentText, AnalyzeDocument) and allows you to visualize words, lines, key and values, and tables.

In this example we will overlay words and tables on top of the tests/fixtures/amzn_q2.png file. The image will be created in the same directory as the output.json file under the name output.json.png.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay WORDS TABLES

This will yield the following (click to enlarge):

This document has a lot of small words, making it difficult to read. You can add --font-size-ratio to the command to increase the font size.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay WORDS TABLES --font-size-ratio 1.0 (default it 0.75)

Reference

Commandline interface for the Textractor library

usage: textractor [-h]
                  {detect-document-text,start-document-text-detection,analyze-document,start-document-analysis,analyze-expense,start-expense-analysis,analyze-id,get-result}
                  ...

Positional Arguments

subcommand

Possible choices: detect-document-text, start-document-text-detection, analyze-document, start-document-analysis, analyze-expense, start-expense-analysis, analyze-id, get-result

Sub-command help

Sub-commands

detect-document-text

Synchronous API for Optical Character Recognition

textractor detect-document-text [-h] [--profile-name PROFILE_NAME]
                                [--region-name REGION_NAME]
                                [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
                                [--linearize]
                                [--linearize-config-path LINEARIZE_CONFIG_PATH]
                                [--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} ...]]
                                [--font-size-ratio FONT_SIZE_RATIO]
                                file_source output_file

Positional Arguments

file_source: File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
output_file: Output file to save the response, can be an S3 path

Named Arguments

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization

--overlay

Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURES, LAYOUTS

Save an image of the document with the words, lines, form fields, and tables overlayed on top

--font-size-ratio

Scales the text up or down, default is 0.75, which would be half the pixel height

Default: 0.75

start-document-text-detection

Asynchronous API for Optical Character Recognition

textractor start-document-text-detection [-h]
                                         [--s3-upload-path S3_UPLOAD_PATH]
                                         [--s3-output-path S3_OUTPUT_PATH]
                                         [--profile-name PROFILE_NAME]
                                         [--region-name REGION_NAME]
                                         file_source

Positional Arguments

file_source: File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path

Named Arguments

--s3-upload-path: Path to upload the input files to, required if input_file is not an S3 path
--s3-output-path: Path to write the response to
--profile-name: AWS profile name to use for the request
--region-name: AWS region to use for the request

analyze-document

Synchronous API for document analysis (forms, tables, queries, and signatures)

textractor analyze-document [-h] --features
                            {FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT}
                            [{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT} ...]
                            [--queries QUERIES [QUERIES ...]]
                            [--profile-name PROFILE_NAME]
                            [--region-name REGION_NAME]
                            [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
                            [--linearize]
                            [--linearize-config-path LINEARIZE_CONFIG_PATH]
                            [--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} ...]]
                            [--font-size-ratio FONT_SIZE_RATIO]
                            file_source output_file

Positional Arguments

file_source: File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
output_file: Output file to save the response, can be an S3 path

Named Arguments

--features

Possible choices: FORMS, TABLES, QUERIES, SIGNATURES, LAYOUT

--queries

List of queries, use quotes (”) to escape spaces

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization

--overlay

Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURES, LAYOUTS

Save an image of the document with the words, lines, form fields, and tables overlayed on top

--font-size-ratio

Scales the text up or down, default is 0.75, which would be half the pixel height

Default: 0.75

start-document-analysis

Asynchronous API for document analysis (forms, tables, queries, and signatures)

textractor start-document-analysis [-h] --features
                                   {FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT}
                                   [{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT} ...]
                                   [--queries QUERIES [QUERIES ...]]
                                   [--s3-upload-path S3_UPLOAD_PATH]
                                   [--s3-output-path S3_OUTPUT_PATH]
                                   [--profile-name PROFILE_NAME]
                                   [--region-name REGION_NAME]
                                   file_source

Positional Arguments

file_source: File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path

Named Arguments

--features: Possible choices: FORMS, TABLES, QUERIES, SIGNATURES, LAYOUT
--queries: List of queries, use quotes (”) to escape spaces
--s3-upload-path: Path to upload the input files to, required if input_file is not an S3 path
--s3-output-path: Path to write the response to
--profile-name: AWS profile name to use for the request
--region-name: AWS region to use for the request

analyze-expense

Synchronous API for expense analysis

textractor analyze-expense [-h] [--profile-name PROFILE_NAME]
                           [--region-name REGION_NAME]
                           [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
                           [--linearize]
                           [--linearize-config-path LINEARIZE_CONFIG_PATH]
                           [--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} ...]]
                           [--font-size-ratio FONT_SIZE_RATIO]
                           file_source output_file

Positional Arguments

file_source: File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
output_file: Output file to save the response, can be an S3 path

Named Arguments

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization

--overlay

Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURES, LAYOUTS

Save an image of the document with the words, lines, form fields, and tables overlayed on top

--font-size-ratio

Scales the text up or down, default is 0.75, which would be half the pixel height

Default: 0.75

start-expense-analysis

Asynchronous API for expense analysis

textractor start-expense-analysis [-h] [--s3-upload-path S3_UPLOAD_PATH]
                                  [--s3-output-path S3_OUTPUT_PATH]
                                  [--profile-name PROFILE_NAME]
                                  [--region-name REGION_NAME]
                                  file_source

Positional Arguments

file_source: File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path

Named Arguments

--s3-upload-path: Path to upload the input files to, required if input_file is not an S3 path
--s3-output-path: Path to write the response to
--profile-name: AWS profile name to use for the request
--region-name: AWS region to use for the request

analyze-id

API for identity document analysis (supports driver’s license and passports).

textractor analyze-id [-h] [--profile-name PROFILE_NAME]
                      [--region-name REGION_NAME]
                      [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
                      file_source output_file

Positional Arguments

file_source: File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
output_file: Output file to save the response, can be an S3 path

Named Arguments

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS

Print the output in a readable format

get-result

Try to fetch the result for a given job id

textractor get-result [-h] [--profile-name PROFILE_NAME]
                      [--region-name REGION_NAME]
                      [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
                      [--linearize]
                      [--linearize-config-path LINEARIZE_CONFIG_PATH]
                      job_id {DETECT_TEXT,ANALYZE,EXPENSE} output_file

Positional Arguments

job_id

Job ID, as returned by any of the asynchronous functions

api

Possible choices: DETECT_TEXT, ANALYZE, EXPENSE

API used to make the request

output_file

Output file to save the response, can be an S3 path

Named Arguments

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization