Textract Caller

Textractor is the main class associated with this package. It needs to be instantiated before using any of the functionalities the package provides. The main use of this class is to make calls to the Textract API and create Python objects for all the document entities that are returned in the JSON output of the API. The response received is implicitly parsed and a Document type object is returned containing all the document entities, their associated relationships and metadata.

The Textract API and Textractor method mapping is as below. Use these wrappers to make calls and parse the responses in one step.

(SYNC) DetectDocumentText : detect_document_text
(SYNC) AnalyzeDocument : analyze_document
(SYNC) AnalyzeID : analyze_id
(SYNC) AnalyzeExpense : analyze_expense
(ASYNC) StartDocumentTextDetection : start_document_text_detection
(ASYNC) StartDocumentAnalysis : start_document_analysis
(ASYNC) StartExpenseAnalysis : start_expense_analysis

class textractor.textractor.Textractor(profile_name: str = None, region_name: str = None, kms_key_id: str = '')

Bases: object

Initializes the customer credentials needed to make calls to Textract using boto3 package internally.

Parameters:

profile_name (str, optional) – Customer’s profile name as set in the ~/.aws/config file. This profile typically contains this format. [default] region = us-west-2 output=json
region_name – If AWSCLI isn’t setup, the user can pass region to let boto3 pick up credentials from the system.
region_name – str
kms_key_id (str, optional) – Customer’s AWS KMS key (cryptographic key)

analyze_document(file_source, features, queries: Union[QueriesConfig, List[Query], List[str]] = None, save_image: bool = True) → Document

Make a call to the SYNC AnalyzeDocument API, implicitly parses the response and produces a Document object. This function is ideal for single page PDFs or single images.

Parameters:

file_source (str or PIL.Image, required) – Path to a file stored locally, on an S3 bucket or PIL Image
features (list, required) – List of TextractFeatures to be extracted from the Document by the TextractAPI
queries (Union[QueriesConfig, List[Query], List[str]]) – Queries to run on the document
save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns:

Returns a Document object containing all the entities, relationships and metadata extracted by the Textract AnalyzeDocument API stored within it.

Return type:

Document

analyze_expense(file_source: Union[str, List[Image], List[str]], save_image: bool = True)

Make a call to the SYNC AnalyzeExpense API, implicitly parses the response and produces a Document object. This function is ideal for multipage PDFs or list of images.

Parameters:

file_source (Union[str, List[Image.Image], List[str]]) – Path to a file stored locally, on an S3 bucket or PIL Image
save_image (bool, optional) – Whether to keep the file source as PIL Images inside the returned Document object, defaults to False

Raises:

IncorrectMethodException – Raised when the file source type is incompatible with the Textract API being called
InputError – Raised when the file source type is invalid
InvalidS3ObjectException – Raised when the file source region is different the API region.
exception – Raised if the Textract API call fails

Returns:

Document

Return type:

Document

analyze_id(file_source: Union[str, List[Image], List[str]], save_image: bool = True) → Document

AnalyzeID parses identity documents such as passports and driver’s license and returns the result as a dictionary of standardized fields. See https://docs.aws.amazon.com/textract/latest/dg/identitydocumentfields.html for a complete list.

Parameters:

file_source (Union[str, List[Image.Image], List[str]]) – Path to a file stored locally, on an S3 bucket or list of PIL Images
save_image (bool, optional) – Saves the images in the returned Document object for visualizing the results, defaults to False

Raises:

InputError – Raised when the file_source could not be parsed
InvalidS3ObjectException – Raised when the S3 object passed as file source is in a region that does not match the one used to create the Textractor object.
exception – Raised when the Textract call fails

Returns:

Document

Return type:

Document

detect_document_text(file_source, save_image: bool = True) → Document

Make a call to the SYNC DetectDocumentText API, implicitly parses the response and produces a Document object. This function is ideal for single page PDFs or single images.

Parameters:

file_source (str or PIL.Image, required) – Path to a file stored locally, on an S3 bucket or PIL Image
save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns:

Returns a Document object containing all the entities, relationships and metadata extracted by the Textract DetectDocumentText API stored within it.

Return type:

Document

get_result(job_id: str, api: Union[TextractAPI, Textract_API]) → Document: Retrieves Textract API output for a given job id. :param job_id: Textract API JobID :type job_id: str, required :return: Returns a Document object :rtype: Document

start_document_analysis(file_source: Union[str, bytes, Image], features, s3_output_path: str = '', s3_upload_path: str = '', queries: Union[QueriesConfig, List[Query], List[str]] = None, client_request_token: str = '', job_tag: str = '', save_image: bool = True) → LazyDocument

Make a call to the ASYNC StartDocumentAnalysis API, implicitly parses the response and produces a Document object. This function is ideal for multipage PDFs or an image.

Parameters:

file_source (Union[str, bytes, Image.Image], required) – Path to a file stored locally, on an S3 bucket or a PIL Image
features (list, required) – List of TextractFeatures to be extracted from the Document by the TextractAPI
s3_output_path (str) – Path to store the output on the S3 bucket (passed as param to Textractor).
s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3
client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once.
job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic.
save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns:

Returns a Document object containing all the entities, relationships and metadata extracted by the Textract StartDocumentAnalysis API stored within it.

Return type:

Document

start_document_text_detection(file_source: Union[str, bytes, Image], s3_output_path: str = '', s3_upload_path: str = '', client_request_token: str = '', job_tag: str = '', save_image: bool = True) → LazyDocument

Make a call to the ASYNC StartDocumentTextDetection API.

Parameters:

file_source (Union[str, bytes, Image.Image], required) – File bytes, path to a file stored locally or in an S3 bucket
s3_output_path (str) – Prefix to store the output on the S3 bucket (passed as param to Textractor).
s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3
client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once.
job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic.
save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns:

Lazy-loaded Document object

Return type:

LazyDocument

start_expense_analysis(file_source: Union[str, bytes, Image], s3_output_path: str = '', s3_upload_path: str = '', client_request_token: str = '', job_tag: str = '', save_image: bool = True) → LazyDocument

Make a call to the ASYNC StartExpenseAnalysis API, implicitly parses the response and produces a Document object. This function is ideal for multipage PDFs or an image.

Parameters:

file_source (Union[str, bytes, Image.Image]) – Path to a file stored locally, on an S3 bucket or a PIL Image
s3_output_path (str) – Path to store the output on the S3 bucket (passed as param to Textractor).
s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3
client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once.
job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic.
save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Raises:

InputError – Raised when the file source type is invalid
InvalidS3ObjectException – Raised when the file source region is different the API region.
exception – Raised if the Textract API call fails

Returns:

Lazy-loaded Document object

Return type:

LazyDocument