Textract Caller

Textractor is the main class associated with this package. It needs to be instantiated before using any of the functionalities the package provides. The main use of this class is to make calls to the Textract API and create Python objects for all the document entities that are returned in the JSON output of the API. The response received is implicitly parsed and a Document type object is returned containing all the document entities, their associated relationships and metadata.

The Textract API and Textractor method mapping is as below. Use these wrappers to make calls and parse the responses in one step.

  • (SYNC) DetectDocumentText : detect_document_text

  • (SYNC) AnalyzeDocument : analyze_document

  • (SYNC) AnalyzeID : analyze_id

  • (SYNC) AnalyzeExpense : analyze_expense

  • (ASYNC) StartDocumentTextDetection : start_document_text_detection

  • (ASYNC) StartDocumentAnalysis : start_document_analysis

  • (ASYNC) StartExpenseAnalysis : start_expense_analysis

class textractor.textractor.Textractor(profile_name: Optional[str] = None, region_name: Optional[str] = None, kms_key_id: str = '')

Bases: object

Initializes the customer credentials needed to make calls to Textract using boto3 package internally.

Parameters
  • profile_name (str, optional) – Customer’s profile name as set in the ~/.aws/config file. This profile typically contains this format. [default] region = us-west-2 output=json

  • region_name – If AWSCLI isn’t setup, the user can pass region to let boto3 pick up credentials from the system.

  • region_name – str

  • kms_key_id (str, optional) – Customer’s AWS KMS key (cryptographic key)

analyze_document(file_source, features, queries: Optional[Union[QueriesConfig, List[Query], List[str]]] = None, s3_output_path: str = '', save_image: bool = True) Document

Make a call to the SYNC AnalyzeDocument API, implicitly parses the response and produces a Document object. This function is ideal for single page PDFs or single images.

Parameters
  • file_source (str or PIL.Image, required) – Path to a file stored locally, on an S3 bucket or PIL Image

  • features (Union[QueriesConfig, List[Query], List[str]]) – List of TextractFeatures to be extracted from the Document by the TextractAPI

  • queries – Queries to run on the document

  • s3_output_path (str, optional) – Prefix to store the output on the S3 bucket (passed as param to Textractor).

  • save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns

Returns a Document object containing all the entities, relationships and metadata extracted by the Textract AnalyzeDocument API stored within it.

Return type

Document

analyze_expense(file_source: Union[str, List[Image], List[str]], save_image: bool = True)

Make a call to the SYNC AnalyzeExpense API, implicitly parses the response and produces a Document object. This function is ideal for multipage PDFs or list of images.

Parameters
  • file_source (Union[str, List[Image.Image], List[str]]) – Path to a file stored locally, on an S3 bucket or PIL Image

  • save_image (bool, optional) – Whether to keep the file source as PIL Images inside the returned Document object, defaults to False

Raises
  • IncorrectMethodException – Raised when the file source type is incompatible with the Textract API being called

  • InputError – Raised when the file source type is invalid

  • RegionMismatchError – Raised when the file source region is different the API region.

  • exception – Raised if the Textract API call fails

Returns

Document

Return type

Document

analyze_id(file_source: Union[str, List[Image], List[str]], save_image: bool = True) Document

AnalyzeID parses identity documents such as passports and driver’s license and returns the result as a dictionary of standardized fields. See https://docs.aws.amazon.com/textract/latest/dg/identitydocumentfields.html for a complete list.

Parameters
  • file_source (Union[str, List[Image.Image], List[str]]) – Path to a file stored locally, on an S3 bucket or list of PIL Images

  • save_image (bool, optional) – Saves the images in the returned Document object for visualizing the results, defaults to False

Raises
  • InputError – Raised when the file_source could not be parsed

  • RegionMismatchError – Raised when the S3 object passed as file source is in a region that does not match the one used to create the Textractor object.

  • exception – Raised when the Textract call fails

Returns

Document

Return type

Document

detect_document_text(file_source, s3_output_path: str = '', save_image: bool = True) Document

Make a call to the SYNC DetectDocumentText API, implicitly parses the response and produces a Document object. This function is ideal for single page PDFs or single images.

Parameters
  • file_source (str or PIL.Image, required) – Path to a file stored locally, on an S3 bucket or PIL Image

  • s3_output_path (str, optional) – S3 path to store the output.

  • save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns

Returns a Document object containing all the entities, relationships and metadata extracted by the Textract DetectDocumentText API stored within it.

Return type

Document

get_result(job_id: str, api: Union[TextractAPI, Textract_API]) Document

Retrieves Textract API output for a given job id. :param job_id: Textract API JobID :type job_id: str, required :return: Returns a Document object :rtype: Document

start_document_analysis(file_source: Union[str, bytes, Image], features, s3_output_path: str = '', s3_upload_path: str = '', queries: Optional[Union[QueriesConfig, List[Query], List[str]]] = None, client_request_token: str = '', job_tag: str = '', save_image: bool = True) LazyDocument

Make a call to the ASYNC StartDocumentAnalysis API, implicitly parses the response and produces a Document object. This function is ideal for multipage PDFs or an image.

Parameters
  • file_source (Union[str, bytes, Image.Image], required) – Path to a file stored locally, on an S3 bucket or a PIL Image

  • features (list, required) – List of TextractFeatures to be extracted from the Document by the TextractAPI

  • s3_output_path (str) – Path to store the output on the S3 bucket (passed as param to Textractor).

  • s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3

  • client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once.

  • job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic.

  • save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Returns

Returns a Document object containing all the entities, relationships and metadata extracted by the Textract StartDocumentAnalysis API stored within it.

Return type

Document

start_document_text_detection(file_source: Union[str, bytes, Image], s3_output_path: str = '', s3_upload_path: str = '', client_request_token: str = '', job_tag: str = '', save_image: bool = True)

Make a call to the ASYNC StartDocumentTextDetection API.

Parameters
  • file_source (Union[str, bytes, Image.Image], required) – File bytes, path to a file stored locally or in an S3 bucket

  • s3_output_path (str) – Prefix to store the output on the S3 bucket (passed as param to Textractor).

  • s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3

  • client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once.

  • job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic.

Returns

Returns a job id which can be used to fetch the results

Return type

str

start_expense_analysis(file_source: Union[str, bytes, Image], s3_output_path: str = '', s3_upload_path: str = '', client_request_token: str = '', job_tag: str = '', save_image: bool = True) LazyDocument

Make a call to the ASYNC StartExpenseAnalysis API, implicitly parses the response and produces a Document object. This function is ideal for multipage PDFs or an image.

Parameters
  • file_source (Union[str, bytes, Image.Image]) – Path to a file stored locally, on an S3 bucket or a PIL Image

  • s3_output_path (str) – Path to store the output on the S3 bucket (passed as param to Textractor).

  • s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3

  • client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once.

  • job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic.

  • save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities.

Raises
  • InputError – Raised when the file source type is invalid

  • RegionMismatchError – Raised when the file source region is different the API region.

  • exception – Raised if the Textract API call fails

Returns

Lazy-loaded Document object

Return type

LazyDocument