CLI
Textractor comes with its very own command line interface that aims to be easier to use than the default boto3 interface by adding several quality of life improvements.
First install the package using pip install amazon-textract-textractor
make sure that you Python bin directory is added to PATH otherwise it will not find the executable. If you are not using a virtual environment this will probably be the case.
Available APIs
Textractor
supports all Textract APIs and follow their official names as described here: https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html. We use a single subcommand to fetch the results named GetResult
.
Synchronous APIs:
DetectDocumentText/detect-document-text (Returns words and lines)
AnalyzeDocument/analyze-document (Returns Forms, Tables and Query results)
AnalyzeExpense/analyze-expense (Returns standardized fields for invoices)
AnalyzeID/analyze-id (Returns standardized fields for driver’s license and passports)
Asynchronous APIs:
StartDocumentTextDetection/start-document-text-detection
StartDocumentAnalysis/start-document-analysis
StartExpenseAnalysis/start-expense-analysis
Getting document text
Now lets say you have a file and you wish to run OCR on it:
textractor detect-document-text your_file.png output.json
This will call the Textract API and save the output to output.json
. You could use the Textractor python module to post-process those response afterwards.
Processing a directory of files
Now if instead of a file, you wished to process an entire directory of files. You could call the above on every file in the directory, but this would prove to be a very long process. Instead you can leverage Textract’s ability to scale to your workload using the asynchronous API.
ls your_dir/ | xargs -I{} textractor start-document-text-detection {} --s3-upload-path s3://your-bucket/your-prefix/{}
You can also parallelize it simply by adding -P8 (for 8 concurrent processes).
ls your_dir/ | xargs -P8 -I{} textractor start-document-text-detection {} --s3-upload-path s3://your-bucket/your-prefix/{} > output.txt
You will notice that all you have in output.txt are UUID like this: 628e39089ffa1b52d62d980ec1cf4f62cb7f785c83a708b2e17ebaaf21ad0d61
. Those are JobIDs and can be used to fetch the output of asynchronous operations.
Wait a few minutes (dependending on the number of files your processed) and then fetch the result with GetResult
.
cat output.txt | xargs -I{} textractor get-result {} DETECT_TEXT {}.json
Using -P8
would make the above faster, but be careful not to increase the concurrent process count too much as you might run into rate limiting issues (See https://docs.aws.amazon.com/textract/latest/dg/limits.html for more details).
Visualizing the output
The textractor
CLI allows you to overlay the output of Amazon Textract on top of an image for troubleshooting. It is only available for synchronous APIs (DetectDocumentText, AnalyzeDocument) and allows you to visualize words, lines, key and values, and tables.
In this example we will overlay words and tables on top of the tests/fixtures/amzn_q2.png
file. The image will be created in the same directory as the output.json
file under the name output.json.png
.
textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay WORDS TABLES
This will yield the following (click to enlarge):
This document has a lot of small words, making it difficult to read. You can add --font-size-ratio
to the command to increase the font size.
textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay WORDS TABLES --font-size-ratio 1.0
(default it 0.75)
Reference
Commandline interface for the Textractor library
usage: textractor [-h]
{detect-document-text,start-document-text-detection,analyze-document,start-document-analysis,analyze-expense,start-expense-analysis,analyze-id,get-result}
...
Positional Arguments
- subcommand
Possible choices: detect-document-text, start-document-text-detection, analyze-document, start-document-analysis, analyze-expense, start-expense-analysis, analyze-id, get-result
Sub-command help
Sub-commands
detect-document-text
Synchronous API for Optical Character Recognition
textractor detect-document-text [-h] [--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
[--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
[--linearize]
[--linearize-config-path LINEARIZE_CONFIG_PATH]
[--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} ...]]
[--font-size-ratio FONT_SIZE_RATIO]
file_source output_file
Positional Arguments
- file_source
File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
- output_file
Output file to save the response, can be an S3 path
Named Arguments
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS
Print the output in a readable format
- --linearize
Print the linearized document output
Default:
False
- --linearize-config-path
Configuration file for the linearization
- --overlay
Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURES, LAYOUTS
Save an image of the document with the words, lines, form fields, and tables overlayed on top
- --font-size-ratio
Scales the text up or down, default is 0.75, which would be half the pixel height
Default:
0.75
start-document-text-detection
Asynchronous API for Optical Character Recognition
textractor start-document-text-detection [-h]
[--s3-upload-path S3_UPLOAD_PATH]
[--s3-output-path S3_OUTPUT_PATH]
[--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
file_source
Positional Arguments
- file_source
File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path
Named Arguments
- --s3-upload-path
Path to upload the input files to, required if input_file is not an S3 path
- --s3-output-path
Path to write the response to
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
analyze-document
Synchronous API for document analysis (forms, tables, queries, and signatures)
textractor analyze-document [-h] --features
{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT}
[{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT} ...]
[--queries QUERIES [QUERIES ...]]
[--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
[--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
[--linearize]
[--linearize-config-path LINEARIZE_CONFIG_PATH]
[--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} ...]]
[--font-size-ratio FONT_SIZE_RATIO]
file_source output_file
Positional Arguments
- file_source
File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
- output_file
Output file to save the response, can be an S3 path
Named Arguments
- --features
Possible choices: FORMS, TABLES, QUERIES, SIGNATURES, LAYOUT
- --queries
List of queries, use quotes (”) to escape spaces
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS
Print the output in a readable format
- --linearize
Print the linearized document output
Default:
False
- --linearize-config-path
Configuration file for the linearization
- --overlay
Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURES, LAYOUTS
Save an image of the document with the words, lines, form fields, and tables overlayed on top
- --font-size-ratio
Scales the text up or down, default is 0.75, which would be half the pixel height
Default:
0.75
start-document-analysis
Asynchronous API for document analysis (forms, tables, queries, and signatures)
textractor start-document-analysis [-h] --features
{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT}
[{FORMS,TABLES,QUERIES,SIGNATURES,LAYOUT} ...]
[--queries QUERIES [QUERIES ...]]
[--s3-upload-path S3_UPLOAD_PATH]
[--s3-output-path S3_OUTPUT_PATH]
[--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
file_source
Positional Arguments
- file_source
File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path
Named Arguments
- --features
Possible choices: FORMS, TABLES, QUERIES, SIGNATURES, LAYOUT
- --queries
List of queries, use quotes (”) to escape spaces
- --s3-upload-path
Path to upload the input files to, required if input_file is not an S3 path
- --s3-output-path
Path to write the response to
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
analyze-expense
Synchronous API for expense analysis
textractor analyze-expense [-h] [--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
[--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
[--linearize]
[--linearize-config-path LINEARIZE_CONFIG_PATH]
[--overlay {ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} [{ALL,WORDS,LINES,TABLES,FORMS,QUERIES,SIGNATURES,LAYOUTS} ...]]
[--font-size-ratio FONT_SIZE_RATIO]
file_source output_file
Positional Arguments
- file_source
File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
- output_file
Output file to save the response, can be an S3 path
Named Arguments
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS
Print the output in a readable format
- --linearize
Print the linearized document output
Default:
False
- --linearize-config-path
Configuration file for the linearization
- --overlay
Possible choices: ALL, WORDS, LINES, TABLES, FORMS, QUERIES, SIGNATURES, LAYOUTS
Save an image of the document with the words, lines, form fields, and tables overlayed on top
- --font-size-ratio
Scales the text up or down, default is 0.75, which would be half the pixel height
Default:
0.75
start-expense-analysis
Asynchronous API for expense analysis
textractor start-expense-analysis [-h] [--s3-upload-path S3_UPLOAD_PATH]
[--s3-output-path S3_OUTPUT_PATH]
[--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
file_source
Positional Arguments
- file_source
File to process, must be of type PDF, JPEG, PNG, TIFF, BMP. The file has to be in S3, you you can provide an S3 path with –upload-s3-path
Named Arguments
- --s3-upload-path
Path to upload the input files to, required if input_file is not an S3 path
- --s3-output-path
Path to write the response to
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
analyze-id
API for identity document analysis (supports driver’s license and passports).
textractor analyze-id [-h] [--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
[--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
file_source output_file
Positional Arguments
- file_source
File to process, must be of type JPEG, PNG, TIFF, BMP. Can be an S3 path
- output_file
Output file to save the response, can be an S3 path
Named Arguments
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS
Print the output in a readable format
get-result
Try to fetch the result for a given job id
textractor get-result [-h] [--profile-name PROFILE_NAME]
[--region-name REGION_NAME]
[--print {ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} [{ALL,TEXT,TABLES,FORMS,QUERIES,EXPENSES,SIGNATURES,IDS,LAYOUTS} ...]]
[--linearize]
[--linearize-config-path LINEARIZE_CONFIG_PATH]
job_id {DETECT_TEXT,ANALYZE,EXPENSE} output_file
Positional Arguments
- job_id
Job ID, as returned by any of the asynchronous functions
- api
Possible choices: DETECT_TEXT, ANALYZE, EXPENSE
API used to make the request
- output_file
Output file to save the response, can be an S3 path
Named Arguments
- --profile-name
AWS profile name to use for the request
- --region-name
AWS region to use for the request
Possible choices: ALL, TEXT, TABLES, FORMS, QUERIES, EXPENSES, SIGNATURES, IDS, LAYOUTS
Print the output in a readable format
- --linearize
Print the linearized document output
Default:
False
- --linearize-config-path
Configuration file for the linearization