Textract Caller
Textractor is the main class associated with this package. It needs to be instantiated before using any of the functionalities
the package provides. The main use of this class is to make calls to the Textract API and create Python objects for all the
document entities that are returned in the JSON output of the API. The response received is implicitly parsed and a Document type 
object is returned containing all the document entities, their associated relationships and metadata.
The Textract API and Textractor method mapping is as below. Use these wrappers to make calls and parse the responses in one step.
- (SYNC) DetectDocumentText : detect_document_text 
- (SYNC) AnalyzeDocument : analyze_document 
- (SYNC) AnalyzeID : analyze_id 
- (SYNC) AnalyzeExpense : analyze_expense 
- (ASYNC) StartDocumentTextDetection : start_document_text_detection 
- (ASYNC) StartDocumentAnalysis : start_document_analysis 
- (ASYNC) StartExpenseAnalysis : start_expense_analysis 
- class textractor.textractor.Textractor(profile_name: str = None, region_name: str = None, kms_key_id: str = '')
- Bases: - object- Initializes the customer credentials needed to make calls to Textract using boto3 package internally. - Parameters:
- profile_name (str, optional) – Customer’s profile name as set in the ~/.aws/config file. This profile typically contains this format. - [default] region = us-west-2 output=json
- region_name – If AWSCLI isn’t setup, the user can pass region to let boto3 pick up credentials from the system. 
- region_name – str 
- kms_key_id (str, optional) – Customer’s AWS KMS key (cryptographic key) 
 
 - analyze_document(file_source, features, queries: Union[QueriesConfig, List[Query], List[str]] = None, save_image: bool = True) Document
- Make a call to the SYNC AnalyzeDocument API, implicitly parses the response and produces a - Documentobject. This function is ideal for single page PDFs or single images.- Parameters:
- file_source (str or PIL.Image, required) – Path to a file stored locally, on an S3 bucket or PIL Image 
- features (list, required) – List of TextractFeatures to be extracted from the Document by the TextractAPI 
- queries (Union[QueriesConfig, List[Query], List[str]]) – Queries to run on the document 
- save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities. 
 
- Returns:
- Returns a Document object containing all the entities, relationships and metadata extracted by the Textract AnalyzeDocument API stored within it. 
- Return type:
 
 - analyze_expense(file_source: Union[str, List[Image], List[str]], save_image: bool = True)
- Make a call to the SYNC AnalyzeExpense API, implicitly parses the response and produces a - Documentobject. This function is ideal for multipage PDFs or list of images.- Parameters:
- file_source (Union[str, List[Image.Image], List[str]]) – Path to a file stored locally, on an S3 bucket or PIL Image 
- save_image (bool, optional) – Whether to keep the file source as PIL Images inside the returned Document object, defaults to False 
 
- Raises:
- IncorrectMethodException – Raised when the file source type is incompatible with the Textract API being called 
- InputError – Raised when the file source type is invalid 
- InvalidS3ObjectException – Raised when the file source region is different the API region. 
- exception – Raised if the Textract API call fails 
 
- Returns:
- Document 
- Return type:
 
 - analyze_id(file_source: Union[str, List[Image], List[str]], save_image: bool = True) Document
- AnalyzeID parses identity documents such as passports and driver’s license and returns the result as a dictionary of standardized fields. See https://docs.aws.amazon.com/textract/latest/dg/identitydocumentfields.html for a complete list. - Parameters:
- file_source (Union[str, List[Image.Image], List[str]]) – Path to a file stored locally, on an S3 bucket or list of PIL Images 
- save_image (bool, optional) – Saves the images in the returned Document object for visualizing the results, defaults to False 
 
- Raises:
- InputError – Raised when the file_source could not be parsed 
- InvalidS3ObjectException – Raised when the S3 object passed as file source is in a region that does not match the one used to create the Textractor object. 
- exception – Raised when the Textract call fails 
 
- Returns:
- Document 
- Return type:
 
 - detect_document_text(file_source, save_image: bool = True) Document
- Make a call to the SYNC DetectDocumentText API, implicitly parses the response and produces a - Documentobject. This function is ideal for single page PDFs or single images.- Parameters:
- file_source (str or PIL.Image, required) – Path to a file stored locally, on an S3 bucket or PIL Image 
- save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities. 
 
- Returns:
- Returns a Document object containing all the entities, relationships and metadata extracted by the Textract DetectDocumentText API stored within it. 
- Return type:
 
 - get_result(job_id: str, api: Union[TextractAPI, Textract_API]) Document
- Retrieves Textract API output for a given job id. :param job_id: Textract API JobID :type job_id: str, required :return: Returns a Document object :rtype: Document 
 - start_document_analysis(file_source: Union[str, bytes, Image], features, s3_output_path: str = '', s3_upload_path: str = '', queries: Union[QueriesConfig, List[Query], List[str]] = None, client_request_token: str = '', job_tag: str = '', save_image: bool = True) LazyDocument
- Make a call to the ASYNC StartDocumentAnalysis API, implicitly parses the response and produces a - Documentobject. This function is ideal for multipage PDFs or an image.- Parameters:
- file_source (Union[str, bytes, Image.Image], required) – Path to a file stored locally, on an S3 bucket or a PIL Image 
- features (list, required) – List of TextractFeatures to be extracted from the Document by the TextractAPI 
- s3_output_path (str) – Path to store the output on the S3 bucket (passed as param to Textractor). 
- s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3 
- client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once. 
- job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic. 
- save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities. 
 
- Returns:
- Returns a Document object containing all the entities, relationships and metadata extracted by the Textract StartDocumentAnalysis API stored within it. 
- Return type:
 
 - start_document_text_detection(file_source: Union[str, bytes, Image], s3_output_path: str = '', s3_upload_path: str = '', client_request_token: str = '', job_tag: str = '', save_image: bool = True) LazyDocument
- Make a call to the ASYNC StartDocumentTextDetection API. - Parameters:
- file_source (Union[str, bytes, Image.Image], required) – File bytes, path to a file stored locally or in an S3 bucket 
- s3_output_path (str) – Prefix to store the output on the S3 bucket (passed as param to Textractor). 
- s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3 
- client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once. 
- job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic. 
- save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities. 
 
- Returns:
- Lazy-loaded Document object 
- Return type:
 
 - start_expense_analysis(file_source: Union[str, bytes, Image], s3_output_path: str = '', s3_upload_path: str = '', client_request_token: str = '', job_tag: str = '', save_image: bool = True) LazyDocument
- Make a call to the ASYNC StartExpenseAnalysis API, implicitly parses the response and produces a - Documentobject. This function is ideal for multipage PDFs or an image.- Parameters:
- file_source (Union[str, bytes, Image.Image]) – Path to a file stored locally, on an S3 bucket or a PIL Image 
- s3_output_path (str) – Path to store the output on the S3 bucket (passed as param to Textractor). 
- s3_upload_path (str, optional) – If given, will automatically upload the document to the given S3 prefix before calling Textract. Files are uploaded under a uuid. If not given the data is expected to be already in s3 
- client_request_token (str, optional) – The idempotent token that’s used to identify the start request. If you use the same. token with multiple StartDocumentTextDetection requests, the same. JobId is returned. Use ClientRequestToken to prevent the same. job from being accidentally started more than once. 
- job_tag (str, optional) – An identifier that you specify that’s included in the completion notification published to the Amazon SNS topic. 
- save_image (bool) – Flag to indicate if document images are to be stored within the Document object. This is optional and necessary only if the customer wants to visualize bounding boxes for their document entities. 
 
- Raises:
- InputError – Raised when the file source type is invalid 
- InvalidS3ObjectException – Raised when the file source region is different the API region. 
- exception – Raised if the Textract API call fails 
 
- Returns:
- Lazy-loaded Document object 
- Return type: