Document Entities

Document objects contain various entities within them. Textract document analysis APIs recognize 6 document entities namely: WORD, LINE, KEY_VALUE_SET , SELECTION_ELEMENT, TABLE, CELL

These are structures that occur in most documents and the package provides classes to programmatically store and access the information produced by Textract for these entities.

BoundingBox

BoundingBox class contains all the co-ordinate information for a DocumentEntity. This class is mainly useful to locate the entity on the image of the document page.

class textractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None)

Bases: SpatialObject

Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By default BoundingBox is set to work with denormalized co-ordinates: \(x \in [0, docwidth]\) and \(y \in [0, docheight]\). Use the as_normalized_dict function to obtain BoundingBox with normalized co-ordinates: \(x \in [0, 1]\) and \(y \in [0, 1]\).

Create a BoundingBox like shown below:

Directly: bb = BoundingBox(x, y, width, height)
From dict: bb = BoundingBox.from_dict(bb_dict) where bb_dict = {'x': x, 'y': y, 'width': width, 'height': height}

Use a BoundingBox like shown below:

Directly: print('The top left is: ' + str(bb.x) + ' ' + str(bb.y))
Convert to dict: bb_dict = bb.as_dict() returns {'x': x, 'y': y, 'width': width, 'height': height}

property area

Returns the area of the bounding box, handles negative bboxes as 0-area

Returns:: Bounding box area
Return type:: float

as_denormalized_numpy()

Returns:: Returns denormalized co-ordinates x, y and dimensions width, height as numpy array.
Return type:: numpy.array

classmethod center_is_inside(bbox_a, bbox_b): Returns true if the center point of Bounding Box A is within Bounding Box B

classmethod enclosing_bbox(bboxes, spatial_object: Optional[SpatialObject] = None)

Parameters:

[BoundingBox] (bboxes) – list of bounding boxes
SpatialObject (spatial_object) – spatial object to be added to the returned bbox

Returns:

classmethod from_denormalized_borders(left: float, top: float, right: float, bottom: float, spatial_object: Optional[SpatialObject] = None): Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. If spatial_object is not None, the coordinates will be denormalized according to the spatial object. :param left: ~ [0, doc_width] :param top: ~ [0, doc_height] :param right: ~ [0, doc_width] :param bottom: ~ [0, doc_height] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_corners(x1: float, y1: float, x2: float, y2: float, spatial_object: Optional[SpatialObject] = None): Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. :param x1: Left ~ [0, wdoc_idth] :param y1: Top ~ [0, doc_height] :param x2: Right ~ [0, doc_width] :param y2: Bottom ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_dict(bbox_dict: Dict[str, float]): Builds an axis aligned bounding box from a dictionary of: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} The coordinates will be denormalized according to the spatial object. :param bbox_dict: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} of [0, doc_height] x [0, doc_width] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_xywh(x: float, y: float, width: float, height: float, spatial_object: Optional[SpatialObject] = None): Builds an axis aligned bounding box from top-left, width and height properties. The coordinates are assumed to be denormalized. :param x: Left ~ [0, doc_width] :param y: Top ~ [0, doc_height] :param width: Width ~ [0, doc_width] :param height: Height ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_normalized_dict(bbox_dict: Dict[str, float], spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned BoundingBox from a dictionary like {'x': x, 'y': y, 'width': width, 'height': height}. The coordinates will be denormalized according to spatial_object.

Parameters:

bbox_dict (dict) – Dictionary of normalized co-ordinates.
spatial_object (SpatialObject) – Object with width and height attributes.

Returns:

Object with denormalized co-ordinates

Return type:

BoundingBox

get_distance(bbox)

Returns the distance between the center point of the bounding box and another bounding box

Returns:: Returns the distance as float
Return type:: float

get_intersection(bbox): Returns the intersection of this object’s bbox and another BoundingBox :return: a BoundingBox object

classmethod is_inside(bbox_a, bbox_b): Returns true if Bounding Box A is within Bounding Box B

class textractor.entities.bbox.SpatialObject(width: float, height: float)

Bases: ABC

The SpatialObject interface defines an object that has a width and height. This mostly used for BoundingBox reference to be able to provide normalized coordinates.

Document

The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity objects can be accessed, searched and exported the functions given below.

class textractor.entities.document.Document(num_pages: int = 1)

Bases: SpatialObject, Linearizable

Represents the description of a single document, as it would appear in the input to the Textract API. Document serves as the root node of the object model hierarchy, which should be used as an intermediate form for most analytic purposes. The Document node also contains the metadata of the document.

property checkboxes: EntityList[KeyValue]

Returns all the KeyValue objects with SelectionElements present in the Document.

Returns:: List of KeyValue objects, each representing a checkbox within the Document.
Return type:: EntityList[KeyValue]

directional_finder(word_1: str = '', word_2: str = '', page: int = -1, prefix: str = '', direction=Direction.BELOW, entities=[])

The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.

Parameters:

word_1 (str, required) – The reference word from where x1, y1 coordinates are derived
word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.
page (int, required) – page number of the page in the document to search the entities in.
prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function
entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.

Returns:

Returns the EntityList of modified key-value and/or checkboxes

Return type:

EntityList

property expense_documents: EntityList[ExpenseDocument]

Returns all the ExpenseDocument objects present in the Document.

Returns:: List of ExpenseDocument objects, each representing an expense document within the Document.
Return type:: EntityList[ExpenseDocument]

export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv', sep: str = ';')

Export key-value entities and checkboxes in csv format.

Parameters:

include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.
sep (str) – Separator to be used in the csv file.

export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')

Export key-value entities and checkboxes in txt format.

Parameters:

include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.

export_tables_to_excel(filepath)

Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.

Parameters:: filepath (str, required) – Path to store the exported Excel file.

filter_checkboxes(selected: bool = True, not_selected: bool = True) → List[KeyValue]

Return a list of KeyValue objects containing checkboxes if the document contains them.

Parameters:

selected (bool) – True/False Return SELECTED checkboxes
not_selected (bool) – True/False Return NOT_SELECTED checkboxes

Returns:

Returns checkboxes that match the conditions set by the flags.

Return type:

EntityList[KeyValue]

get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6)

Return upto top_k_matches of key-value pairs for the key that is queried from the document.

Parameters:

key (str) – Query key to match
top_k_matches (int) – Maximum number of matches to return
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[KeyValue]

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) → List[Word]

Returns list of Word entities that match the input text type.

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns list of Word entities that match the input text type.
Return type:: EntityList[Word]

property identity_document: EntityList[IdentityDocument]

Returns all the IdentityDocument objects present in the Page.

Returns:: List of IdentityDocument objects.
Return type:: EntityList

property identity_documents: EntityList[IdentityDocument]

Returns all the IdentityDocument objects present in the Document.

Returns:: List of IdentityDocument objects, each representing an identity document within the Document.
Return type:: EntityList[IdentityDocument]

property images: List[Image]

Returns all the page images in the Document.

Returns:: List of PIL Image objects.
Return type:: PIL.Image

independent_words()

Returns:: Return all words in the document, outside of tables, checkboxes, key-values.
Return type:: EntityList[Word]

property key_values: EntityList[KeyValue]

Returns all the KeyValue objects present in the Document.

Returns:: List of KeyValue objects, each representing a key-value pair within the Document.
Return type:: EntityList[KeyValue]

keys(include_checkboxes: bool = True) → List[str]

Prints all keys for key-value pairs and checkboxes if the document contains them.

Parameters:: include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.
Returns:: List of strings containing key names in the Document
Return type:: List[str]

property layouts: EntityList[Layout]

Returns all the Layout objects present in the Document

Returns:: List of Layout objects
Return type:: EntityList[Layout]

property lines: EntityList[Line]

Returns all the Line objects present in the Document.

Returns:: List of Line objects, each representing a line within the Document.
Return type:: EntityList[Line]

classmethod open(fp: Union[dict, str, Path, IO])

Create a Document object from a JSON file path, file handle or response dictionary

Parameters:: fp (Union[dict, str, Path, IO[AnyStr]]) – _description_
Raises:: InputError – Raised on input not being of type Union[dict, str, Path, IO[AnyStr]]
Returns:: Document object
Return type:: Document

page(page_no: int = 0)

Returns Page object/s depending on the input page_no. Follows zero-indexing.

Parameters:: page_no (int if single page, list of int if multiple pages) – if int, returns single Page Object, else if list, it returns a list of Page objects.
Returns:: Filters and returns Page objects depending on the input page_no
Return type:: Page or List[Page]

property pages: List[Page]

Returns all the Page objects present in the Document.

Returns:: List of Page objects, each representing a Page within the Document.
Return type:: List

property queries: EntityList[Query]

Returns all the Query objects present in the Document.

Returns:: List of Query objects.
Return type:: EntityList[Query]

return_duplicates()

Returns a dictionary containing page numbers as keys and list of EntityList objects as values. Each EntityList instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.

Returns:: Dictionary containing page numbers as keys and list of EntityList objects as values.
Return type:: Dict[page_num, List[EntityList[DocumentEntity]]]

search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) → List[Line]

Return a list of top_k lines that contain the queried keyword.

Parameters:

keyword (str) – Keyword that is used to query the document.
top_k (int) – Number of closest line objects to be returned
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of lines that contain the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Line]

search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) → List[Word]

Return a list of top_k words that match the keyword.

Parameters:

keyword (str) – Keyword that is used to query the document.
top_k (int) – Number of closest word objects to be returned
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of words that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Word]

property signatures: EntityList[Signature]

Returns all the Signature objects present in the Document.

Returns:: List of Signature objects.
Return type:: EntityList[Signature]

property tables: EntityList[Table]

Returns all the Table objects present in the Document.

Returns:: List of Table objects, each representing a table within the Document.
Return type:: EntityList[Table]

property text: str

Returns the document text as one string

Returns:: Page text seperated by line return
Return type:: str

to_html(config: HTMLLinearizationConfig = HTMLLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='<div>', page_num_suffix='</div>', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='<div>', list_layout_suffix='</div>', list_element_prefix='', list_element_suffix='', title_prefix='<h1>', title_suffix='</h1>', table_layout_prefix='<div>', table_layout_suffix='</div>', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='html', table_add_title_as_caption=True, table_add_footer_as_paragraph=True, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='', table_flatten_semi_structured_as_plaintext=False, table_prefix='<table>', table_suffix='</table>', table_row_separator='\n', table_row_prefix='<tr>', table_row_suffix='</tr>', table_cell_prefix='<td>', table_cell_suffix='</td>', table_cell_header_prefix='<th>', table_cell_header_suffix='</th>', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='<h1>', header_suffix='</h1>', section_header_prefix='<h2>', section_header_suffix='</h2>', text_prefix='<p>', text_suffix='</p>', key_value_layout_prefix='<div>', key_value_layout_suffix='</div>', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='<p>', entity_layout_suffix='</p>', figure_layout_prefix='<div>', figure_layout_suffix='</div>', footer_layout_prefix='<div>', footer_layout_suffix='</div>', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True, add_ids_to_html_tags=False, add_short_ids_to_html_tags=False))

Returns the HTML representation of the document, effectively calls Linearizable.to_html() but add <html><body></body></html> around the result and put each page in a <div>.

Returns:: HTML text of the entity
Return type:: str

to_trp2()

Parses the response to the trp2 format for backward compatibility

Returns:: TDocument object that can be used with the older Textractor libraries
Return type:: TDocument

visualize(*args, **kwargs)

Returns the object’s children in a visualization EntityList object

Returns:: Returns an EntityList object
Return type:: EntityList

property words: EntityList[Word]

Returns all the Word objects present in the Document.

Returns:: List of Word objects, each representing a word within the Document.
Return type:: EntityList[Word]

LazyDocument

The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity objects can be accessed, searched and exported the functions given below.

class textractor.entities.lazy_document.LazyDocument(job_id: str, api: TextractAPI, textract_client=None, images=None, output_config: Optional[OutputConfig] = None)

Bases: object

LazyDocument is a proxy for Document when using the async APIs. It will not load the response until one if its property is used. You can access the underlying Document object using the document property.

property document: Document

Getter for the underlying Document object

Returns:: Proxied Document object
Return type:: Document

property s3_polling_interval: int

Getter for the polling interval

Returns:: Time between get_full_result calls
Return type:: int

property textract_polling_interval: int

Getter for the polling interval

Returns:: Time between get_full_result calls
Return type:: int

DocumentEntity

DocumentEntity is the class that all Document entities such as Word, Line, Table etc. inherit from. This class provides methods useful to all such entities.

class textractor.entities.document_entity.DocumentEntity(entity_id: str, bbox: BoundingBox)

Bases: Linearizable, ABC

An interface for all document entities within the document body, composing the hierarchy of the document object model. The purpose of this class is to define properties common to all document entities i.e. unique id and bounding box.

add_children(children)

Adds children to all entities that have parent-child relationships.

Parameters:: children (list) – List of child entities.

property bbox: BoundingBox

Returns:: Returns entire bounding box of entity
Return type:: BoundingBox

property children

Returns:: Returns children of entity
Return type:: list

property confidence: float

Returns the object confidence as predicted by Textract. If the confidence is not available, returns None

Returns:: Prediction confidence for a document entity, between 0 and 1
Return type:: float

property height: float

Returns:: Returns height for bounding box
Return type:: float

property raw_object: Dict

Returns:: Returns the raw dictionary object that was used to create this Python object
Return type:: Dict

remove(entity)

Recursively removes an entity from the child tree of a document entity and update its bounding box

Parameters:: entity (DocumentEntity) – Entity

visit(word_set)

visualize(*args, **kwargs) → EntityList

Returns the object’s children in a visualization EntityList object

Returns:: Returns an EntityList object
Return type:: EntityList

property width: float

Returns:: Returns width for bounding box
Return type:: float

property x: float

Returns:: Returns x coordinate for bounding box
Return type:: float

property y: float

Returns:: Returns y coordinate for bounding box
Return type:: float

Word

Represents a single Word within the Document. This class contains the associated metadata with the Word entity including the text transcription, text type, bounding box information, page number, Page ID and confidence of detection.

class textractor.entities.word.Word(entity_id: str, bbox: BoundingBox, text: str = '', text_type: TextTypes = TextTypes.PRINTED, confidence: float = 0, is_clickable: bool = False, is_structure: bool = False)

Bases: DocumentEntity

To create a new Word object we need the following:

Parameters:

entity_id (str) – Unique identifier of the Word entity.
bbox (BoundingBox) – Bounding box of the Word entity.
text (str) – Transcription of the Word object.
text_type (TextTypes) – Enum value stating the type of text stored in the entity. Takes 2 values - PRINTED and HANDWRITING
confidence (float) – value storing the confidence of detection out of 100.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property page: int

Returns:: Returns the page number of the page the Word entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property text: str

Returns:: Returns the text transcription of the Word entity.
Return type:: str

property text_type: TextTypes

Returns:: Returns the property of Word class that holds the text type of Word object.
Return type:: str

property words

Returns itself

Return type:: Word

Line

Represents a single Line Entity within the Document. The Textract API response returns groups of words as LINE BlockTypes. They contain Word entities as children.

This class contains the associated metadata with the Line entity including the entity ID, bounding box information, child words, page number, Page ID and confidence of detection.

class textractor.entities.line.Line(entity_id: str, bbox: BoundingBox, words: Optional[List[Word]] = None, confidence: float = 0)

Bases: DocumentEntity

To create a new Line object we need the following:

Parameters:

entity_id (str) – Unique identifier of the Line entity.
bbox (BoundingBox) – Bounding box of the line entity.
words (list, optional) – List of the Word entities present in the line
confidence (float, optional) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) → List[Word]

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns EntityList of Word entities that match the input text type.
Return type:: EntityList[Word]

property page

Returns:: Returns the page number of the page the Line entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property text

Returns:: Returns the text transcription of the Line entity.
Return type:: str

property words

Returns:: Returns the line’s children
Return type:: List[Word]

Page

Represents a single Document page, as it would appear in the Textract API output. The Page object also contains the metadata such as the physical dimensions of the page (width, height, in pixels), child_ids etc.

class textractor.entities.page.Page(id: str, width: int, height: int, page_num: int = -1, child_ids=None)

Bases: SpatialObject, Linearizable

Creates a new document, ideally representing a single item in the dataset.

Parameters:

id (str) – Unique id of the Page
width (float) – Width of page, in pixels
height (float) – Height of page, in pixels
page_num (int) – Page number in the document linked to this Page object
child_ids (List) – IDs of child entities in the Page as determined by Textract

property checkboxes: EntityList[KeyValue]

Returns all the KeyValue objects with SelectionElement present in the Page.

Returns:: List of KeyValue objects, each representing a checkbox within the Page.
Return type:: EntityList[KeyValue]

property container_layouts: EntityList[Layout]

Returns all the container Layout objects present in the Page.

Returns:: List of Layout objects.
Return type:: EntityList

directional_finder(word_1: str = '', word_2: str = '', prefix: str = '', direction=Direction.BELOW, entities=[])

The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.

Parameters:

word_1 (str, required) – The reference word from where x1, y1 coordinates are derived
word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.
prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function
entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.

Returns:

Returns the EntityList of modified key-value and/or checkboxes

Return type:

EntityList

property expense_documents: EntityList[ExpenseDocument]

Returns all the ExpenseDocument objects present in the Page.

Returns:: List of ExpenseDocument objects.
Return type:: EntityList

export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv')

Export key-value entities and checkboxes in csv format.

Parameters:

include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.

export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')

Export key-value entities and checkboxes in txt format.

Parameters:

include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.

export_tables_to_excel(filepath)

Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.

Parameters:: filepath (str, required) – Path to store the exported Excel file.

filter_checkboxes(selected: bool = True, not_selected: bool = True) → EntityList[KeyValue]

Return a list of KeyValue objects containing checkboxes if the page contains them.

Parameters:

selected (bool) – True/False Return SELECTED checkboxes
not_selected (bool) – True/False Return NOT_SELECTED checkboxes

Returns:

Returns checkboxes that match the conditions set by the flags.

Return type:

EntityList[KeyValue]

get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) → EntityList[KeyValue]

Return upto top_k_matches of key-value pairs for the key that is queried from the page.

Parameters:

key (str) – Query key to match
top_k_matches (int) – Maximum number of matches to return
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6

Returns:

Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[KeyValue]

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List[Word]]

Returns the page text and words sorted in reading order

Parameters:: config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()
Returns:: Tuple of page text and words
Return type:: Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) → EntityList[Word]

Returns list of Word entities that match the input text type.

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns list of Word entities that match the input text type.
Return type:: EntityList[Word]

independent_words() → EntityList[Word]

Returns:: Return all words in the document, outside of tables, checkboxes, key-values.
Return type:: EntityList[Word]

property key_values: EntityList[KeyValue]

Returns all the KeyValue objects present in the Page.

Returns:: List of KeyValue objects, each representing a key-value pair within the Page.
Return type:: EntityList[KeyValue]

keys(include_checkboxes: bool = True) → List[str]

Prints all keys for key-value pairs and checkboxes if the page contains them.

Parameters:: include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.
Returns:: List of strings containing key names in the Page
Return type:: List[str]

property layouts: EntityList[Layout]

Returns all the Layout objects present in the Page.

Returns:: List of Layout objects.
Return type:: EntityList

property leaf_layouts: EntityList[Layout]

Returns all the leaf Layout objects present in the Page.

Returns:: List of Layout objects.
Return type:: EntityList

property lines: EntityList[Line]

Returns all the Line objects present in the Page.

Returns:: List of Line objects, each representing a line within the Page.
Return type:: EntityList[Line]

property page_layout: PageLayout

property queries: EntityList[Query]

Returns all the Query objects present in the Page.

Returns:: List of Query objects.
Return type:: EntityList

return_duplicates()

Returns a list containing EntityList objects. Each EntityList instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.

Returns:: List of EntityList objects each containing the intersection of KeyValue and Table entities on the page.
Return type:: List[EntityList]

search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: int = 0.6) → EntityList[Line]

Return a list of top_k lines that contain the queried keyword.

Parameters:

keyword (str) – Keyword that is used to query the page.
top_k (int) – Number of closest line objects to be returned
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6

Returns:

Returns a list of lines that contain the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Line]

search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) → EntityList[Word]

Return a list of top_k words that match the keyword.

Parameters:

keyword (str, required) – Keyword that is used to query the document.
top_k (int, optional) – Number of closest word objects to be returned. default=1
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of words that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Word]

property signatures: EntityList[Signature]

Returns all the Signature objects present in the Page.

Returns:: List of Signature objects.
Return type:: EntityList

property tables: EntityList[Table]

Returns all the Table objects present in the Page.

Returns:: List of Table objects, each representing a table within the Page.
Return type:: EntityList

property text: str

Returns the page text

Returns:: Linearized page text
Return type:: str

visualize(*args, **kwargs)

Returns the object’s children in a visualization EntityList object

Returns:: Returns an EntityList object
Return type:: EntityList

property words: EntityList[Word]

Returns all the Word objects present in the Page.

Returns:: List of Word objects, each representing a word within the Page.
Return type:: EntityList[Word]

PageLayout

class textractor.entities.page_layout.PageLayout(titles: EntityList[Layout] = [], headers: EntityList[Layout] = [], footers: EntityList[Layout] = [], section_headers: EntityList[Layout] = [], page_numbers: EntityList[Layout] = [], lists: EntityList[Layout] = [], figures: EntityList[Layout] = [], tables: EntityList[Layout] = [], key_values: EntityList[Layout] = [])

Bases: object

Object representation of the layout components detected in the table.

property figures: EntityList[Layout]

Figures detected in the Page

Returns:: EntityList of figures detected in the page
Return type:: EntityList[Layout]

property footers: EntityList[Layout]

Footers detected in the Page

Returns:: EntityList of footers detected in the page
Return type:: EntityList[Layout]

property headers: EntityList[Layout]

Headers detected in the Page

Returns:: EntityList of headers detected in the page
Return type:: EntityList[Layout]

property key_values: EntityList[Layout]

KeyValues detected in the Page

Returns:: EntityList of keyvalues detected in the page
Return type:: EntityList[Layout]

property lists: EntityList[Layout]

Lists detected in the Page

Returns:: EntityList of lists detected in the page
Return type:: EntityList[Layout]

property page_numbers: EntityList[Layout]

Page numbers detected in the Page

Returns:: EntityList of page numbers detected in the page
Return type:: EntityList[Layout]

property section_headers: EntityList[Layout]

Section headers detected in the Page

Returns:: EntityList of section headers detected in the page
Return type:: EntityList[Layout]

property tables: EntityList[Layout]

Tables detected in the Page. This includes Tables detected by the AnalyzeDocument Tables API if used.

Returns:: EntityList of tables detected in the page
Return type:: EntityList[Layout]

property titles: EntityList[Layout]

Titles detected in the Page

Returns:: EntityList of titles detected in the page
Return type:: EntityList[Layout]

Layout

Represents a single Layout Entity within the Document. The Textract API response returns groups of layout as LAYOUT_* BlockTypes.

class textractor.entities.layout.Layout(entity_id: str, bbox: BoundingBox, reading_order: int, label: str, confidence: float = 0)

Bases: DocumentEntity

To create a new Layout object we need the following:

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List[Word]]

Returns the layout object text and words sorted in reading order

Parameters:: config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()
Returns:: Tuple of page text and words
Return type:: Tuple[str, List[Word]]

property page

Returns:: Returns the page number of the page the Layout entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property text

Maps to .get_text()

Returns:: Returns the linearized text of the entity
Return type:: str

property words

Table

Represents a Table entity within the document. Tables are hierarchical objects composed of TableCell objects, which implicitly form columns and rows.

Table object contains associated metadata within it. They include TableCell information, headers, page number and page ID of the page within which it exists in the document.

class textractor.entities.table.Table(entity_id, bbox: BoundingBox)

Bases: DocumentEntity

To create a new Table object we need the following:

Parameters:

entity_id – Unique identifier of the table.
bbox – Bounding box of the table.

add_cells(cells: List[TableCell])

Add TableCell objects to the Table. This function does not check the integrity of the table after the cells are added.

Parameters:: cells (list) – List of TableCell objects, each representing a single cell within the table. No specific ordering is assumed since it is implicitly ordered by row and column index.

property checkboxes: List[SelectionElement]

property column_count

property column_headers: Dict[str, List[TableCell]]

Returns:: Returns the column headers of the Table entity.
Return type:: Dict[str, List[TableCell]]

property footers

Returns:: Returns the table footers.
Return type:: List[TableFooter]

get_cells_by_type(cell_type: CellTypes = CellTypes.COLUMN_HEADER)

Returns a dictionary of column_header (str) : List[TableCell] (in order).

Parameters:: cell_type (CellTypes) – supports CellTypes.COLUMN_HEADER as of now, will support SECTION_TITLE, FLOATING_TITLE, FLOATING_FOOTER, SUMMARY_CELL in the future.
Returns:: {column_header (str) : List[TableCell]}
Return type:: Dict[str, List[TableCell]]

get_columns_by_name(column_names, similarity_metric=SimilarityMetric.COSINE, similarity_threshold=0.6)

Returns a dictionary of format {column_name : List[TableCell]} for the column names listed in param column_names.

Parameters:

column_names (list) – List of column names of columns to be extracted from table.
similarity_metric (str) – ‘cosine’, ‘euclidean’ or ‘levenshtein’. ‘cosine’ is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a new Table consisting of columns passed in column_names.

Return type:

Table

get_table_range()

Returns:: Returns the number of rows and columns in the table.
Return type:: Tuple(int)

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

get_words_by_type(text_type=TextTypes.PRINTED)

Returns list of Word entities that match the input text type.

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns list of Word entities that match the input text type.
Return type:: EntityList[Word]

property page

Returns:: Returns the page number of the page the Table entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property row_count

strip_headers(column_headers: bool = True, in_table_title: bool = False, section_titles=False)

Returns a new Table object after removing all cells that are marked as column headers in the table from the API response.

Parameters:

column_headers (bool) – Remove the column headers
in_table_title (bool) – Remove the in-table titles
section_titles (bool) – Remove the in-table section titles

Returns:

Table object after removing the headers.

Return type:

Table

property table_type

Returns:: Returns the table type.
Return type:: TableTypes

property title

Returns:: Returns the table title.
Return type:: TableTitle

to_csv(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → str

Returns the table in the Comma-Separated-Value (CSV) format

Parameters:

use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.
config – Text linearization configuration object for the table content

Returns:

Table as a CSV string.

Return type:

str

to_excel(filepath=None, workbook=None, save_workbook=True)

Export the Table Entity as an excel document. Advantage of excel over csv is that it can accommodate merged cells that we see so often with Textract documents.

Parameters:

filepath (str) – Path to store the exported Excel file
workbook (xlsxwriter.Workbook) – if xlsxwriter workbook is passed to the function, the table is appended to the last sheet of that workbook.
save_workbook (bool) – Flag to save_notebook. If False, it is returned by the function.

Returns:

Returns a workbook if save_workbook is False. Else, saves the .xlsx file in the filepath if was initialized with.

Return type:

xlsxwriter.Workbook

to_html() → str

Returns the table in the HTML format

Returns:: Table as an HTML string.
Return type:: str

to_pandas(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Converts the table to a pandas DataFrame

Parameters:

use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.
config – Text linearization configuration object for the table content

Returns:

to_txt()

property words

Returns all the Word objects present in the Table.

Return words:: List of Word objects, each representing a word within the Table.
Return type:: EntityList[Word]

TableCell

Represents a single TableCell:class: object. The TableCell objects contains information such as:

The position info of the cell within the encompassing Table
Properties such as merged-cells span
A hierarchy of words contained within the TableCell (optional)
Page information
Confidence of entity detection.

class textractor.entities.table_cell.TableCell(entity_id: str, bbox: BoundingBox, row_index: int, col_index: int, row_span: int, col_span: int, confidence: float = 0, is_column_header: bool = False, is_title: bool = False, is_footer: bool = False, is_summary: bool = False, is_section_title: bool = False)

Bases: DocumentEntity

To create a new TableCell object we need the following:

Parameters:

entity_id – Unique id of the TableCell object
bbox – Bounding box of the entity
row_index – Row index of position of cell within the table
col_index – Column index of position of cell within the table
row_span – How many merged cells does the cell spans horizontally (1 means no merged cells)
col_span – How many merged cells does the cell spand vertically (1 means no merged cells)
confidence – Confidence out of 100 with which the Cell was detected.
is_column_header – Indicates if the cell is a column header
is_title – Indicates if the cell is a table title
is_footer – Indicates if the cell is a table footer
is_summary – Indicates if the cell is a summary cell
is_section_title – Indicates if the cell is a section title

property checkboxes

property col_index

Returns:: Returns the column index of the cell in the Table.
Return type:: int

property col_span

Returns:: Returns the column span of the cell in the Table.
Return type:: int

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Returns the text in the cell as one space-separated string

Returns:: Text in the cell
Return type:: Tuple[str, List]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) → List[Word]

Returns list of Word entities that match the input text type.

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns list of Word entities that match the input text type.
Return type:: EntityList

property is_column_header

property is_footer

property is_section_title

property is_summary

property is_title

merge_direction()

Returns:: Determines if the merged cell is a row or column merge. Returns 0 if row merge, 1 if column merge and 2 if both and None if there is no merge.
Return type:: int, str

property page

Returns:: Returns the page number of the page the TableCell entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property row_index

Returns:: Returns the row index of the cell in the Table.
Return type:: int

property row_span

Returns:: Returns the row span of the cell in the Table.
Return type:: int

property table_id

Returns:: Returns the ID of the Table the TableCell belongs to.
Return type:: str

property text: str

Returns the text in the cell as one space-separated string

Returns:: Text in the cell
Return type:: str

property words

Returns all the Word objects present in the TableCell.

Return words:: List of Word objects, each representing a word within the TableCell.
Return type:: list

TableTitle

Represents a single TableTitle:class: object. The TableCell:class: object contains information such as:

The position of the title within the Document
The words that it contains
Confidence of entity detection

class textractor.entities.table_title.TableTitle(entity_id: str, bbox: BoundingBox)

Bases: DocumentEntity

Represents a title that is either in-table or floating

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property is_floating: bool

Returns:: Returns whether the TableTitle entity is floating or not.
Return type:: bool

property page

Returns:: Returns the page number of the page the TableTitle entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property text: str

Returns the text in the title as one space-separated string

Returns:: Text in the title
Return type:: str

property words

Returns all the Word objects present in the TableTitle.

Return words:: List of Word objects, each representing a word within the TableTitle.
Return type:: list

TableFooter

Represents a single TableFooter:class: object. The TableCell:class: object contains information such as:

The position of the footer within the Document
The words that it contains
Confidence of entity detection

class textractor.entities.table_footer.TableFooter(entity_id: str, bbox: BoundingBox)

Bases: DocumentEntity

Represents a footer that is either in-table or floating

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property page

Returns:: Returns the page number of the page the TableFooter entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property text: str

Returns the text in the footer as one space-separated string

Returns:: Text in the footer
Return type:: str

property words

Returns all the Word objects present in the TableFooter.

Return words:: List of Word objects, each representing a word within the TableFooter.
Return type:: list

KeyValue

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.key_value.KeyValue(entity_id: str, bbox: BoundingBox, contains_checkbox: bool = False, value: Optional[Value] = None, confidence: float = 0)

Bases: DocumentEntity

To create a new KeyValue object we require the following:

Parameters:

entity_id (str) – Unique identifier of the KeyValue entity.
bbox (BoundingBox) – Bounding box of the KeyValue entity.
contains_checkbox (bool) – True/False to indicate if the value is a checkbox.
value (Value) – Value object that maps to the KeyValue entity.
confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

get_words_by_type(text_type: str = TextTypes.PRINTED) → List[Word]

Returns list of Word entities that match the input text type.

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns list of Word entities that match the input text type.
Return type:: EntityList[Word]

is_selected() → bool

For KeyValues containing a selection item, returns its is_selected status

Returns:: Selection status of a selection item key value pair
Return type:: bool

property key

Returns:: Returns EntityList[Word] object (a list of words) associated with the key.
Return type:: EntityList[Word]

property ocr_confidence: Return the average OCR confidence :return:

property page: int

Returns:: Returns the page number of the page the Table entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property value: Value

Returns:: Returns the Value mapped to the key if it has been assigned.
Return type:: Value

property words: List[Word]

Returns all the Word objects present in the key and value of the KeyValue object.

Return words:: List of Word objects, each representing a word within the KeyValue entity.
Return type:: EntityList[Word]

Value

Represents a single Value Entity within the Document. The Textract API response returns groups of words as KEY_VALUE_SET BlockTypes. These may be of KEY or VALUE type which is indicated by the EntityType attribute in the JSON response.

This class contains the associated metadata with the Value entity including the entity ID, bounding box information, child words, associated key ID, page number, Page ID, confidence of detection and if it’s a checkbox.

class textractor.entities.value.Value(entity_id: str, bbox: BoundingBox, confidence: float = 0)

Bases: DocumentEntity

To create a new Value object we need the following:

Parameters:

entity_id (str) – Unique identifier of the Word entity.
bbox (BoundingBox) – Bounding box of the Word entity.
confidence (float) – value storing the confidence of detection out of 100.

property contains_checkbox: bool

Returns True if the value associated is a SelectionElement.

Returns:: Returns True if the value associated is a checkbox/SelectionElement.
Return type:: bool

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

get_words_by_type(text_type: str = TextTypes.PRINTED) → List[Word]

Returns list of Word entities that match the input text type.

Parameters:: text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
Returns:: Returns list of Word entities that match the input text type.
Return type:: EntityList[Word]

property key_id: str

Returns the associated Key ID for the Value entity.

Returns:: Returns the associated KeyValue object ID.
Return type:: str

property page

Returns:: Returns the page number of the page the Value entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property words: List[Word]

Returns:: Returns a list of all words in the entity if it exists else returns the checkbox status of the Value entity.
Return type:: EntityList[Word]

SelectionElement

Represents a single SelectionElement/Checkbox/Clickable Entity within the Document.

This class contains the associated metadata with the SelectionElement entity including the entity ID, bounding box information, selection status, page number, Page ID and confidence of detection.

class textractor.entities.selection_element.SelectionElement(entity_id: str, bbox: BoundingBox, status: SelectionStatus, confidence: float = 0)

Bases: DocumentEntity

To create a new SelectionElement object we need the following:

Parameters:

entity_id (str) – Unique identifier of the SelectionElement entity.
bbox (BoundingBox) – Bounding box of the SelectionElement
status (SelectionStatus) – SelectionStatus.SELECTED / SelectionStatus.NOT_SELECTED
confidence (float) – Confidence with which this entity is detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

is_selected() → bool

Returns:: Returns True / False depending on selection status of the SelectionElement.
Return type:: bool

property page

Returns:: Returns the page number of the page the SelectionElement entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property words: List[Word]

Returns:: Empty Word list as SelectionElement do not have words
Return type:: EntityList[Word]

Query

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.query.Query(entity_id: str, query: str, alias: str, query_result: Optional[QueryResult], result_bbox: Optional[BoundingBox])

Bases: DocumentEntity

The Query object merges QUERY and QUERY_RESULT blocks. To create a new Query object we require the following:

Parameters:

entity_id (str) – Unique identifier of the Query entity.
bbox (BoundingBox) – Bounding box of the KeyValue entity.
contains_checkbox (bool) – True/False to indicate if the value is a checkbox.
value (Value) – Value object that maps to the KeyValue entity.
confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the Query and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property has_result: bool

Returns:: Returns whether there was a result associated with the query
Return type:: bool

property page: int

Returns:: Returns the page number of the page the Table entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

QueryResult

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.query_result.QueryResult(entity_id: str, confidence: float, result_bbox: BoundingBox, answer: str)

Bases: DocumentEntity

The QueryResult object represents QUERY_RESULT blocks. To create a new QueryResult object we require the following:

Parameters:

entity_id (str) – Unique identifier of the Query entity.
bbox (BoundingBox) – Bounding box of the QueryResult entity.
contains_checkbox (bool) – True/False to indicate if the value is a checkbox.
value (Value) – Value object that maps to the QueryResult entity.
confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the QueryResult and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property page: int

Returns:: Returns the page number of the page the Table entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

Signature

Represents a single Signature Entity within the Document. The Textract API response returns signatures as SIGNATURE BlockTypes.

This class contains the associated metadata with the Signature entity including the entity ID, bounding box information, page number, Page ID and confidence of detection.

class textractor.entities.signature.Signature(entity_id: str, bbox: BoundingBox, confidence: float = 0)

Bases: DocumentEntity

To create a new Signature object we need the following:

Parameters:

entity_id (str) – Unique identifier of the signature entity.
bbox (BoundingBox) – Bounding box of the signature entity.
words (list, optional) – List of the Word entities present in the signature
confidence (float, optional) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property page

Returns:: Returns the page number of the page the Signature entity is present in.
Return type:: int

property page_id: str

Returns:: Returns the Page ID attribute of the page which the entity belongs to.
Return type:: str

property words

Returns:: Returns an empty list
Return type:: list

ExpenseDocument

The ExpenseDocument class is the object representation of an AnalyzeID response. It is similar to a dictionary. Despite its name it does not inherit from Document as the AnalyzeID response does not contains position information.

class textractor.entities.expense_document.ExpenseDocument(summary_fields: List[ExpenseField], line_items_groups: List[LineItemGroup], bounding_box: BoundingBox, page: int)

Bases: DocumentEntity

Represents the description of a single expense document.

property bbox

Returns:: Returns entire bounding box of entity
Return type:: BoundingBox

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property line_items_groups: List[LineItemGroup]

property page

property summary_fields_list

class textractor.entities.expense_document.Fields

Bases: dict

Dictionary to hold Summary Fields Dynamically added properties to enable ease of discovery

class textractor.entities.expense_document.FieldsGroups

Bases: dict

Summary Fields Group dictionary {GROUP_KEY_NAME: {GROUP_ID_1: [SUMMARY_FIELD1, SUMMARY_FIELD2]}}

get_group_bboxes(key: str): Return the enclosing bboxes for each group for a given group key :param key: Group key e.g VENDOR :return:

Expense

class textractor.entities.expense_field.Expense(bbox: BoundingBox, text: str, confidence: float, page: int)

Bases: DocumentEntity

Holds the Key or the Value of an Expense

property geometry

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the Expense and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property page

property text

Maps to .get_text()

Returns:: Returns the linearized text of the entity
Return type:: str

class textractor.entities.expense_field.ExpenseField(type: ExpenseType, value: Expense, group_properties: List[ExpenseGroupProperty], page: int, label: Optional[Expense] = None, currency=None)

Bases: DocumentEntity

The ExpenseField holds the information a given summary field, key, value and type. The bounding box of that ExpenseField is the enclosing one of all its components

property bbox: BoundingBox

Returns:: Returns entire bounding box of entity
Return type:: BoundingBox

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the ExpenseField and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property group_properties: List[ExpenseGroupProperty]

property key: Expense

property page: int

property type: ExpenseType

property value: Expense

class textractor.entities.expense_field.ExpenseGroupProperty(id: str, types: List[str])

Bases: object

Associated with a given ExpenseField, which group it is associated with and the related type of the group

id: str

types: List[str]

class textractor.entities.expense_field.ExpenseType(text: str, confidence: float, raw_object: object)

Bases: object

Type of an ExpenseField, e.g TOTAL or SUBTOTAL

confidence: float

raw_object: object

text: str

class textractor.entities.expense_field.LineItemGroup(index, line_item_rows: List[LineItemRow], page: int)

Bases: DocumentEntity

A LineItemGroup contains several LineItemRow. It is often similar to a table in invoices but in receipts, the table structure can be more loose and less aligned.

property bbox

Returns:: Returns entire bounding box of entity
Return type:: BoundingBox

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the LineItemGroup and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property index

property page

property rows

to_csv()

to_json()

to_pandas(include_EXPENSE_ROW=False)

class textractor.entities.expense_field.LineItemRow(index, line_item_expense_fields: List[ExpenseField], page: int)

Bases: DocumentEntity

A LineItemRow contains several ExpenseField that are all inside the row. They don’t always align in a structured column structure as tables do.

property bbox

Returns:: Returns entire bounding box of entity
Return type:: BoundingBox

property expenses

get(index)

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the LineItemRow and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property page

IdentityDocument

The IdentityDocument class is the object representation of an AnalyzeID response. It is similar to a dictionary. Despite its name it does not inherit from Document as the AnalyzeID response does not contains position information.

class textractor.entities.identity_document.IdentityDocument(fields=None)

Bases: SpatialObject

Represents the description of a single ID document.

property fields: Dict[str, IdentityField]

get(key: Union[str, AnalyzeIDFields]) → Optional[str]

keys() → List[str]

values() → List[str]

IdentityField

class textractor.entities.identity_field.IdentityField(key, value, confidence)

Bases: object

property confidence: float

property key: str

property value: str

Linearizable

Linearizable is a class that defines how a component can be linearized (converted to text)

class textractor.entities.linearizable.Linearizable

Bases: ABC

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → str

Returns the linearized text of the entity

Parameters:: config – Text linearization confi
Returns:: Linearized text of the entity
Return type:: str

abstract get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → Tuple[str, List]

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

property text: str

Maps to .get_text()

Returns:: Returns the linearized text of the entity
Return type:: str

to_html(config: HTMLLinearizationConfig = HTMLLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='<div>', page_num_suffix='</div>', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='<div>', list_layout_suffix='</div>', list_element_prefix='', list_element_suffix='', title_prefix='<h1>', title_suffix='</h1>', table_layout_prefix='<div>', table_layout_suffix='</div>', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='html', table_add_title_as_caption=True, table_add_footer_as_paragraph=True, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='', table_flatten_semi_structured_as_plaintext=False, table_prefix='<table>', table_suffix='</table>', table_row_separator='\n', table_row_prefix='<tr>', table_row_suffix='</tr>', table_cell_prefix='<td>', table_cell_suffix='</td>', table_cell_header_prefix='<th>', table_cell_header_suffix='</th>', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='<h1>', header_suffix='</h1>', section_header_prefix='<h2>', section_header_suffix='</h2>', text_prefix='<p>', text_suffix='</p>', key_value_layout_prefix='<div>', key_value_layout_suffix='</div>', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='<p>', entity_layout_suffix='</p>', figure_layout_prefix='<div>', figure_layout_suffix='</div>', footer_layout_prefix='<div>', footer_layout_suffix='</div>', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True, add_ids_to_html_tags=False, add_short_ids_to_html_tags=False)) → str

Returns the HTML representation of the entity

Returns:: HTML text of the entity
Return type:: str

to_markdown(config: MarkdownLinearizationConfig = MarkdownLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='# ', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=True, table_column_header_threshold=0.9, table_linearization_format='markdown', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='## ', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) → str

Returns the markdown representation of the entity

Returns:: Markdown text of the entity
Return type:: str