CLI

Textractor comes with its very own command line interface that aims to be easier to use than the default boto3 interface by adding several quality of life improvements.

First install the package using pip install amazon-textract-textractor make sure that you Python bin directory is added to PATH otherwise it will not find the executable. If you are not using a virtual environment this will probably be the case.

Available APIs

Textractor supports all Textract APIs and follow their official names as described here: https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html. We use a single subcommand to fetch the results named GetResult.

Synchronous APIs:

  • DetectDocumentText/detect-document-text (Returns words and lines)

  • AnalyzeDocument/analyze-document (Returns Forms, Tables and Query results)

  • AnalyzeExpense/analyze-expense (Returns standardized fields for invoices)

  • AnalyzeID/analyze-id (Returns standardized fields for driver’s license and passports)

Asynchronous APIs:

  • StartDocumentTextDetection/start-document-text-detection

  • StartDocumentAnalysis/start-document-analysis

  • StartExpenseAnalysis/start-expense-analysis

Getting document text

Now lets say you have a file and you wish to run OCR on it:

textractor detect-document-text your_file.png output.json

This will call the Textract API and save the output to output.json. You could use the Textractor python module to post-process those response afterwards.

Processing a directory of files

Now if instead of a file, you wished to process an entire directory of files. You could call the above on every file in the directory, but this would prove to be a very long process. Instead you can leverage Textract’s ability to scale to your workload using the asynchronous API.

ls your_dir/ | xargs -I{} textractor start-document-text-detection {} --s3-upload-path s3://your-bucket/your-prefix/{}

You can also parallelize it simply by adding -P8 (for 8 concurrent processes).

ls your_dir/ | xargs -P8 -I{} textractor start-document-text-detection {} --s3-upload-path s3://your-bucket/your-prefix/{} > output.txt

You will notice that all you have in output.txt are UUID like this: 628e39089ffa1b52d62d980ec1cf4f62cb7f785c83a708b2e17ebaaf21ad0d61. Those are JobIDs and can be used to fetch the output of asynchronous operations.

Wait a few minutes (dependending on the number of files your processed) and then fetch the result with GetResult.

cat output.txt | xargs -I{} textractor get-result {} DETECT_TEXT {}.json

Using -P8 would make the above faster, but be careful not to increase the concurrent process count too much as you might run into rate limiting issues (See https://docs.aws.amazon.com/textract/latest/dg/limits.html for more details).

Visualizing the output

The textractor CLI allows you to overlay the output of Amazon Textract on top of an image for troubleshooting. It is only available for synchronous APIs (DetectDocumentText, AnalyzeDocument) and allows you to visualize words, lines, key and values, and tables.

In this example we will overlay words and tables on top of the tests/fixtures/amzn_q2.png file. The image will be created in the same directory as the output.json file under the name output.json.png.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay WORDS TABLES

This will yield the following (click to enlarge):

Overlayer output

This document has a lot of small words, making it difficult to read. You can add --font-size-ratio to the command to increase the font size.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay WORDS TABLES --font-size-ratio 1.0 (default it 0.75)

Overlayer output bigger

Reference

Commandline interface for the Textractor library

usage: textractor [-h]
                  {detect-document-text,start-document-text-detection,analyze-document,start-document-analysis,analyze-expense,start-expense-analysis,analyze-id,get-result}
                  ...

Positional Arguments

subcommand

Possible choices: detect-document-text, start-document-text-detection, analyze-document, start-document-analysis, analyze-expense, start-expense-analysis, analyze-id, get-result

Sub-command help

Sub-commands

detect-document-text

Synchronous API for Optical Character Recognition

textractor detect-document-text [-h] [--profile-name PROFILE_NAME]
                                [--region-name REGION_NAME]
                                [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} ...]]
                                [--linearize]
                                [--linearize-config-path LINEARIZE_CONFIG_PATH]
                                [--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURE} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURE} ...]]
                                [--font-size-ratio FONT_SIZE_RATIO]
                                file_source output_file
Positional Arguments
file_source

File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path

output_file

Output file to save the response, can be an S3 path

Named Arguments
--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization

--overlay

Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURE

Save an image of the document with the words, lines, form fields, and tables overlayed on top

--font-size-ratio

Scales the text up or down, default is 0.75, which would be half the pixel height

Default: 0.75

start-document-text-detection

Asynchronous API for Optical Character Recognition

textractor start-document-text-detection [-h]
                                         [--s3-upload-path S3_UPLOAD_PATH]
                                         [--s3-output-path S3_OUTPUT_PATH]
                                         [--profile-name PROFILE_NAME]
                                         [--region-name REGION_NAME]
                                         file_source
Positional Arguments
file_source

File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path

Named Arguments
--s3-upload-path

Path to upload the input files to, required if input_file is not an S3 path

--s3-output-path

Path to write the response to

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

analyze-document

Synchronous API for document analysis (forms, tables, queries, and signatures)

textractor analyze-document [-h] --features
                            {FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT}
                            [{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT} ...]
                            [--queries QUERIES [QUERIES ...]]
                            [--profile-name PROFILE_NAME]
                            [--region-name REGION_NAME]
                            [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} ...]]
                            [--linearize]
                            [--linearize-config-path LINEARIZE_CONFIG_PATH]
                            [--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURE} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURE} ...]]
                            [--font-size-ratio FONT_SIZE_RATIO]
                            file_source output_file
Positional Arguments
file_source

File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path

output_file

Output file to save the response, can be an S3 path

Named Arguments
--features

Possible choices: FORMS, TABLES, QUERIES, SIGNATURES, LAYOUT

--queries

List of queries, use quotes (”) to escape spaces

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization

--overlay

Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURE

Save an image of the document with the words, lines, form fields, and tables overlayed on top

--font-size-ratio

Scales the text up or down, default is 0.75, which would be half the pixel height

Default: 0.75

start-document-analysis

Asynchronous API for document analysis (forms, tables, queries, and signatures)

textractor start-document-analysis [-h] --features
                                   {FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT}
                                   [{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT} ...]
                                   [--queries QUERIES [QUERIES ...]]
                                   [--s3-upload-path S3_UPLOAD_PATH]
                                   [--s3-output-path S3_OUTPUT_PATH]
                                   [--profile-name PROFILE_NAME]
                                   [--region-name REGION_NAME]
                                   file_source
Positional Arguments
file_source

File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path

Named Arguments
--features

Possible choices: FORMS, TABLES, QUERIES, SIGNATURES, LAYOUT

--queries

List of queries, use quotes (”) to escape spaces

--s3-upload-path

Path to upload the input files to, required if input_file is not an S3 path

--s3-output-path

Path to write the response to

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

analyze-expense

Synchronous API for expense analysis

textractor analyze-expense [-h] [--profile-name PROFILE_NAME]
                           [--region-name REGION_NAME]
                           [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} ...]]
                           [--linearize]
                           [--linearize-config-path LINEARIZE_CONFIG_PATH]
                           [--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURE} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURE} ...]]
                           [--font-size-ratio FONT_SIZE_RATIO]
                           file_source output_file
Positional Arguments
file_source

File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path

output_file

Output file to save the response, can be an S3 path

Named Arguments
--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization

--overlay

Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURE

Save an image of the document with the words, lines, form fields, and tables overlayed on top

--font-size-ratio

Scales the text up or down, default is 0.75, which would be half the pixel height

Default: 0.75

start-expense-analysis

Asynchronous API for expense analysis

textractor start-expense-analysis [-h] [--s3-upload-path S3_UPLOAD_PATH]
                                  [--s3-output-path S3_OUTPUT_PATH]
                                  [--profile-name PROFILE_NAME]
                                  [--region-name REGION_NAME]
                                  file_source
Positional Arguments
file_source

File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path

Named Arguments
--s3-upload-path

Path to upload the input files to, required if input_file is not an S3 path

--s3-output-path

Path to write the response to

--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

analyze-id

API for identity document analysis (supports driver’s license and passports).

textractor analyze-id [-h] [--profile-name PROFILE_NAME]
                      [--region-name REGION_NAME]
                      [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} ...]]
                      file_source output_file
Positional Arguments
file_source

File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path

output_file

Output file to save the response, can be an S3 path

Named Arguments
--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS

Print the output in a readable format

get-result

Try to fetch the result for a given job id

textractor get-result [-h] [--profile-name PROFILE_NAME]
                      [--region-name REGION_NAME]
                      [--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS} ...]]
                      [--linearize]
                      [--linearize-config-path LINEARIZE_CONFIG_PATH]
                      job_id {DETECT_TEXT,ANALYZE,EXPENSE} output_file
Positional Arguments
job_id

Job ID, as returned by any of the asynchronous functions

api

Possible choices: DETECT_TEXT, ANALYZE, EXPENSE

API used to make the request

output_file

Output file to save the response, can be an S3 path

Named Arguments
--profile-name

AWS profile name to use for the request

--region-name

AWS region to use for the request

--print

Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS

Print the output in a readable format

--linearize

Print the linearized document output

Default: False

--linearize-config-path

Configuration file for the linearization