Document Entities

Document objects contain various entities within them. Textract document analysis APIs recognize 6 document entities namely: WORD, LINE, KEY_VALUE_SET , SELECTION_ELEMENT, TABLE, CELL

These are structures that occur in most documents and the package provides classes to programmatically store and access the information produced by Textract for these entities.

BoundingBox

BoundingBox class contains all the co-ordinate information for a DocumentEntity. This class is mainly useful to locate the entity on the image of the document page.

class textractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None)

Bases: SpatialObject

Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By default BoundingBox is set to work with denormalized co-ordinates: \(x \in [0, docwidth]\) and \(y \in [0, docheight]\). Use the as_normalized_dict function to obtain BoundingBox with normalized co-ordinates: \(x \in [0, 1]\) and \(y \in [0, 1]\).

Create a BoundingBox like shown below:

  • Directly: bb = BoundingBox(x, y, width, height)

  • From dict: bb = BoundingBox.from_dict(bb_dict) where bb_dict = {'x': x, 'y': y, 'width': width, 'height': height}

Use a BoundingBox like shown below:

  • Directly: print('The top left is: ' + str(bb.x) + ' ' + str(bb.y))

  • Convert to dict: bb_dict = bb.as_dict() returns {'x': x, 'y': y, 'width': width, 'height': height}

property area

Returns the area of the bounding box, handles negative bboxes as 0-area

Returns:

Bounding box area

Return type:

float

as_denormalized_numpy()
Returns:

Returns denormalized co-ordinates x, y and dimensions width, height as numpy array.

Return type:

numpy.array

classmethod center_is_inside(bbox_a, bbox_b)

Returns true if the center point of Bounding Box A is within Bounding Box B

classmethod enclosing_bbox(bboxes, spatial_object: Optional[SpatialObject] = None)
Parameters:
  • [BoundingBox] (bboxes) – list of bounding boxes

  • SpatialObject (spatial_object) – spatial object to be added to the returned bbox

Returns:

classmethod from_denormalized_borders(left: float, top: float, right: float, bottom: float, spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. If spatial_object is not None, the coordinates will be denormalized according to the spatial object. :param left: ~ [0, doc_width] :param top: ~ [0, doc_height] :param right: ~ [0, doc_width] :param bottom: ~ [0, doc_height] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_corners(x1: float, y1: float, x2: float, y2: float, spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. :param x1: Left ~ [0, wdoc_idth] :param y1: Top ~ [0, doc_height] :param x2: Right ~ [0, doc_width] :param y2: Bottom ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_dict(bbox_dict: Dict[str, float])

Builds an axis aligned bounding box from a dictionary of: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} The coordinates will be denormalized according to the spatial object. :param bbox_dict: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} of [0, doc_height] x [0, doc_width] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_xywh(x: float, y: float, width: float, height: float, spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned bounding box from top-left, width and height properties. The coordinates are assumed to be denormalized. :param x: Left ~ [0, doc_width] :param y: Top ~ [0, doc_height] :param width: Width ~ [0, doc_width] :param height: Height ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_normalized_dict(bbox_dict: Dict[str, float], spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned BoundingBox from a dictionary like {'x': x, 'y': y, 'width': width, 'height': height}. The coordinates will be denormalized according to spatial_object.

Parameters:
  • bbox_dict (dict) – Dictionary of normalized co-ordinates.

  • spatial_object (SpatialObject) – Object with width and height attributes.

Returns:

Object with denormalized co-ordinates

Return type:

BoundingBox

get_distance(bbox)

Returns the distance between the center point of the bounding box and another bounding box

Returns:

Returns the distance as float

Return type:

float

get_intersection(bbox)

Returns the intersection of this object’s bbox and another BoundingBox :return: a BoundingBox object

classmethod is_inside(bbox_a, bbox_b)

Returns true if Bounding Box A is within Bounding Box B

class textractor.entities.bbox.SpatialObject(width: float, height: float)

Bases: ABC

The SpatialObject interface defines an object that has a width and height. This mostly used for BoundingBox reference to be able to provide normalized coordinates.

Document

The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity objects can be accessed, searched and exported the functions given below.

class textractor.entities.document.Document(num_pages: int = 1)

Bases: SpatialObject, Linearizable

Represents the description of a single document, as it would appear in the input to the Textract API. Document serves as the root node of the object model hierarchy, which should be used as an intermediate form for most analytic purposes. The Document node also contains the metadata of the document.

property checkboxes: EntityList[KeyValue]

Returns all the KeyValue objects with SelectionElements present in the Document.

Returns:

List of KeyValue objects, each representing a checkbox within the Document.

Return type:

EntityList[KeyValue]

directional_finder(word_1: str = '', word_2: str = '', page: int = -1, prefix: str = '', direction=Direction.BELOW, entities=[])

The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.

Parameters:
  • word_1 (str, required) – The reference word from where x1, y1 coordinates are derived

  • word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.

  • page (int, required) – page number of the page in the document to search the entities in.

  • prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function

  • entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.

Returns:

Returns the EntityList of modified key-value and/or checkboxes

Return type:

EntityList

property expense_documents: EntityList[ExpenseDocument]

Returns all the ExpenseDocument objects present in the Document.

Returns:

List of ExpenseDocument objects, each representing an expense document within the Document.

Return type:

EntityList[ExpenseDocument]

export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv', sep: str = ';')

Export key-value entities and checkboxes in csv format.

Parameters:
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

  • sep (str) – Separator to be used in the csv file.

export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')

Export key-value entities and checkboxes in txt format.

Parameters:
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_tables_to_excel(filepath)

Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.

Parameters:

filepath (str, required) – Path to store the exported Excel file.

filter_checkboxes(selected: bool = True, not_selected: bool = True) List[KeyValue]

Return a list of KeyValue objects containing checkboxes if the document contains them.

Parameters:
  • selected (bool) – True/False Return SELECTED checkboxes

  • not_selected (bool) – True/False Return NOT_SELECTED checkboxes

Returns:

Returns checkboxes that match the conditions set by the flags.

Return type:

EntityList[KeyValue]

get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6)

Return upto top_k_matches of key-value pairs for the key that is queried from the document.

Parameters:
  • key (str) – Query key to match

  • top_k_matches (int) – Maximum number of matches to return

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[KeyValue]

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns list of Word entities that match the input text type.

Return type:

EntityList[Word]

property identity_document: EntityList[IdentityDocument]

Returns all the IdentityDocument objects present in the Page.

Returns:

List of IdentityDocument objects.

Return type:

EntityList

property identity_documents: EntityList[IdentityDocument]

Returns all the IdentityDocument objects present in the Document.

Returns:

List of IdentityDocument objects, each representing an identity document within the Document.

Return type:

EntityList[IdentityDocument]

property images: List[Image]

Returns all the page images in the Document.

Returns:

List of PIL Image objects.

Return type:

PIL.Image

independent_words()
Returns:

Return all words in the document, outside of tables, checkboxes, key-values.

Return type:

EntityList[Word]

property key_values: EntityList[KeyValue]

Returns all the KeyValue objects present in the Document.

Returns:

List of KeyValue objects, each representing a key-value pair within the Document.

Return type:

EntityList[KeyValue]

keys(include_checkboxes: bool = True) List[str]

Prints all keys for key-value pairs and checkboxes if the document contains them.

Parameters:

include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.

Returns:

List of strings containing key names in the Document

Return type:

List[str]

property layouts: EntityList[Layout]

Returns all the Layout objects present in the Document

Returns:

List of Layout objects

Return type:

EntityList[Layout]

property lines: EntityList[Line]

Returns all the Line objects present in the Document.

Returns:

List of Line objects, each representing a line within the Document.

Return type:

EntityList[Line]

classmethod open(fp: Union[dict, str, Path, IO])

Create a Document object from a JSON file path, file handle or response dictionary

Parameters:

fp (Union[dict, str, Path, IO[AnyStr]]) – _description_

Raises:

InputError – Raised on input not being of type Union[dict, str, Path, IO[AnyStr]]

Returns:

Document object

Return type:

Document

page(page_no: int = 0)

Returns Page object/s depending on the input page_no. Follows zero-indexing.

Parameters:

page_no (int if single page, list of int if multiple pages) – if int, returns single Page Object, else if list, it returns a list of Page objects.

Returns:

Filters and returns Page objects depending on the input page_no

Return type:

Page or List[Page]

property pages: List[Page]

Returns all the Page objects present in the Document.

Returns:

List of Page objects, each representing a Page within the Document.

Return type:

List

property queries: EntityList[Query]

Returns all the Query objects present in the Document.

Returns:

List of Query objects.

Return type:

EntityList[Query]

return_duplicates()

Returns a dictionary containing page numbers as keys and list of EntityList objects as values. Each EntityList instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.

Returns:

Dictionary containing page numbers as keys and list of EntityList objects as values.

Return type:

Dict[page_num, List[EntityList[DocumentEntity]]]

search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) List[Line]

Return a list of top_k lines that contain the queried keyword.

Parameters:
  • keyword (str) – Keyword that is used to query the document.

  • top_k (int) – Number of closest line objects to be returned

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of lines that contain the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Line]

search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) List[Word]

Return a list of top_k words that match the keyword.

Parameters:
  • keyword (str) – Keyword that is used to query the document.

  • top_k (int) – Number of closest word objects to be returned

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of words that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Word]

property signatures: EntityList[Signature]

Returns all the Signature objects present in the Document.

Returns:

List of Signature objects.

Return type:

EntityList[Signature]

property tables: EntityList[Table]

Returns all the Table objects present in the Document.

Returns:

List of Table objects, each representing a table within the Document.

Return type:

EntityList[Table]

property text: str

Returns the document text as one string

Returns:

Page text seperated by line return

Return type:

str

to_html(config: HTMLLinearizationConfig = HTMLLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='<div>', page_num_suffix='</div>', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='<div>', list_layout_suffix='</div>', list_element_prefix='', list_element_suffix='', title_prefix='<h1>', title_suffix='</h1>', table_layout_prefix='<div>', table_layout_suffix='</div>', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='html', table_add_title_as_caption=True, table_add_footer_as_paragraph=True, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='', table_flatten_semi_structured_as_plaintext=False, table_prefix='<table>', table_suffix='</table>', table_row_separator='\n', table_row_prefix='<tr>', table_row_suffix='</tr>', table_cell_prefix='<td>', table_cell_suffix='</td>', table_cell_header_prefix='<th>', table_cell_header_suffix='</th>', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='<h1>', header_suffix='</h1>', section_header_prefix='<h2>', section_header_suffix='</h2>', text_prefix='<p>', text_suffix='</p>', key_value_layout_prefix='<div>', key_value_layout_suffix='</div>', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='<p>', entity_layout_suffix='</p>', figure_layout_prefix='<div>', figure_layout_suffix='</div>', footer_layout_prefix='<div>', footer_layout_suffix='</div>', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True, add_ids_to_html_tags=False, add_short_ids_to_html_tags=False))

Returns the HTML representation of the document, effectively calls Linearizable.to_html() but add <html><body></body></html> around the result and put each page in a <div>.

Returns:

HTML text of the entity

Return type:

str

to_trp2()

Parses the response to the trp2 format for backward compatibility

Returns:

TDocument object that can be used with the older Textractor libraries

Return type:

TDocument

visualize(*args, **kwargs)

Returns the object’s children in a visualization EntityList object

Returns:

Returns an EntityList object

Return type:

EntityList

property words: EntityList[Word]

Returns all the Word objects present in the Document.

Returns:

List of Word objects, each representing a word within the Document.

Return type:

EntityList[Word]

LazyDocument

The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity objects can be accessed, searched and exported the functions given below.

class textractor.entities.lazy_document.LazyDocument(job_id: str, api: TextractAPI, textract_client=None, images=None, output_config: Optional[OutputConfig] = None)

Bases: object

LazyDocument is a proxy for Document when using the async APIs. It will not load the response until one if its property is used. You can access the underlying Document object using the document property.

property document: Document

Getter for the underlying Document object

Returns:

Proxied Document object

Return type:

Document

property s3_polling_interval: int

Getter for the polling interval

Returns:

Time between get_full_result calls

Return type:

int

property textract_polling_interval: int

Getter for the polling interval

Returns:

Time between get_full_result calls

Return type:

int

DocumentEntity

DocumentEntity is the class that all Document entities such as Word, Line, Table etc. inherit from. This class provides methods useful to all such entities.

class textractor.entities.document_entity.DocumentEntity(entity_id: str, bbox: BoundingBox)

Bases: Linearizable, ABC

An interface for all document entities within the document body, composing the hierarchy of the document object model. The purpose of this class is to define properties common to all document entities i.e. unique id and bounding box.

add_children(children)

Adds children to all entities that have parent-child relationships.

Parameters:

children (list) – List of child entities.

property bbox: BoundingBox
Returns:

Returns entire bounding box of entity

Return type:

BoundingBox

property children
Returns:

Returns children of entity

Return type:

list

property confidence: float

Returns the object confidence as predicted by Textract. If the confidence is not available, returns None

Returns:

Prediction confidence for a document entity, between 0 and 1

Return type:

float

property height: float
Returns:

Returns height for bounding box

Return type:

float

property raw_object: Dict
Returns:

Returns the raw dictionary object that was used to create this Python object

Return type:

Dict

remove(entity)

Recursively removes an entity from the child tree of a document entity and update its bounding box

Parameters:

entity (DocumentEntity) – Entity

visit(word_set)
visualize(*args, **kwargs) EntityList

Returns the object’s children in a visualization EntityList object

Returns:

Returns an EntityList object

Return type:

EntityList

property width: float
Returns:

Returns width for bounding box

Return type:

float

property x: float
Returns:

Returns x coordinate for bounding box

Return type:

float

property y: float
Returns:

Returns y coordinate for bounding box

Return type:

float

Word

Represents a single Word within the Document. This class contains the associated metadata with the Word entity including the text transcription, text type, bounding box information, page number, Page ID and confidence of detection.

class textractor.entities.word.Word(entity_id: str, bbox: BoundingBox, text: str = '', text_type: TextTypes = TextTypes.PRINTED, confidence: float = 0, is_clickable: bool = False, is_structure: bool = False)

Bases: DocumentEntity

To create a new Word object we need the following:

Parameters:
  • entity_id (str) – Unique identifier of the Word entity.

  • bbox (BoundingBox) – Bounding box of the Word entity.

  • text (str) – Transcription of the Word object.

  • text_type (TextTypes) – Enum value stating the type of text stored in the entity. Takes 2 values - PRINTED and HANDWRITING

  • confidence (float) – value storing the confidence of detection out of 100.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property page: int
Returns:

Returns the page number of the page the Word entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property text: str
Returns:

Returns the text transcription of the Word entity.

Return type:

str

property text_type: TextTypes
Returns:

Returns the property of Word class that holds the text type of Word object.

Return type:

str

property words

Returns itself

Return type:

Word

Line

Represents a single Line Entity within the Document. The Textract API response returns groups of words as LINE BlockTypes. They contain Word entities as children.

This class contains the associated metadata with the Line entity including the entity ID, bounding box information, child words, page number, Page ID and confidence of detection.

class textractor.entities.line.Line(entity_id: str, bbox: BoundingBox, words: Optional[List[Word]] = None, confidence: float = 0)

Bases: DocumentEntity

To create a new Line object we need the following:

Parameters:
  • entity_id (str) – Unique identifier of the Line entity.

  • bbox (BoundingBox) – Bounding box of the line entity.

  • words (list, optional) – List of the Word entities present in the line

  • confidence (float, optional) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]
Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns EntityList of Word entities that match the input text type.

Return type:

EntityList[Word]

property page
Returns:

Returns the page number of the page the Line entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property text
Returns:

Returns the text transcription of the Line entity.

Return type:

str

property words
Returns:

Returns the line’s children

Return type:

List[Word]

Page

Represents a single Document page, as it would appear in the Textract API output. The Page object also contains the metadata such as the physical dimensions of the page (width, height, in pixels), child_ids etc.

class textractor.entities.page.Page(id: str, width: int, height: int, page_num: int = -1, child_ids=None)

Bases: SpatialObject, Linearizable

Creates a new document, ideally representing a single item in the dataset.

Parameters:
  • id (str) – Unique id of the Page

  • width (float) – Width of page, in pixels

  • height (float) – Height of page, in pixels

  • page_num (int) – Page number in the document linked to this Page object

  • child_ids (List) – IDs of child entities in the Page as determined by Textract

property checkboxes: EntityList[KeyValue]

Returns all the KeyValue objects with SelectionElement present in the Page.

Returns:

List of KeyValue objects, each representing a checkbox within the Page.

Return type:

EntityList[KeyValue]

property container_layouts: EntityList[Layout]

Returns all the container Layout objects present in the Page.

Returns:

List of Layout objects.

Return type:

EntityList

directional_finder(word_1: str = '', word_2: str = '', prefix: str = '', direction=Direction.BELOW, entities=[])

The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.

Parameters:
  • word_1 (str, required) – The reference word from where x1, y1 coordinates are derived

  • word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.

  • prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function

  • entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.

Returns:

Returns the EntityList of modified key-value and/or checkboxes

Return type:

EntityList

property expense_documents: EntityList[ExpenseDocument]

Returns all the ExpenseDocument objects present in the Page.

Returns:

List of ExpenseDocument objects.

Return type:

EntityList

export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv')

Export key-value entities and checkboxes in csv format.

Parameters:
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')

Export key-value entities and checkboxes in txt format.

Parameters:
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_tables_to_excel(filepath)

Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.

Parameters:

filepath (str, required) – Path to store the exported Excel file.

filter_checkboxes(selected: bool = True, not_selected: bool = True) EntityList[KeyValue]

Return a list of KeyValue objects containing checkboxes if the page contains them.

Parameters:
  • selected (bool) – True/False Return SELECTED checkboxes

  • not_selected (bool) – True/False Return NOT_SELECTED checkboxes

Returns:

Returns checkboxes that match the conditions set by the flags.

Return type:

EntityList[KeyValue]

get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) EntityList[KeyValue]

Return upto top_k_matches of key-value pairs for the key that is queried from the page.

Parameters:
  • key (str) – Query key to match

  • top_k_matches (int) – Maximum number of matches to return

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6

Returns:

Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[KeyValue]

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List[Word]]

Returns the page text and words sorted in reading order

Parameters:

config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()

Returns:

Tuple of page text and words

Return type:

Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) EntityList[Word]

Returns list of Word entities that match the input text type.

Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns list of Word entities that match the input text type.

Return type:

EntityList[Word]

independent_words() EntityList[Word]
Returns:

Return all words in the document, outside of tables, checkboxes, key-values.

Return type:

EntityList[Word]

property key_values: EntityList[KeyValue]

Returns all the KeyValue objects present in the Page.

Returns:

List of KeyValue objects, each representing a key-value pair within the Page.

Return type:

EntityList[KeyValue]

keys(include_checkboxes: bool = True) List[str]

Prints all keys for key-value pairs and checkboxes if the page contains them.

Parameters:

include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.

Returns:

List of strings containing key names in the Page

Return type:

List[str]

property layouts: EntityList[Layout]

Returns all the Layout objects present in the Page.

Returns:

List of Layout objects.

Return type:

EntityList

property leaf_layouts: EntityList[Layout]

Returns all the leaf Layout objects present in the Page.

Returns:

List of Layout objects.

Return type:

EntityList

property lines: EntityList[Line]

Returns all the Line objects present in the Page.

Returns:

List of Line objects, each representing a line within the Page.

Return type:

EntityList[Line]

property page_layout: PageLayout
property queries: EntityList[Query]

Returns all the Query objects present in the Page.

Returns:

List of Query objects.

Return type:

EntityList

return_duplicates()

Returns a list containing EntityList objects. Each EntityList instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.

Returns:

List of EntityList objects each containing the intersection of KeyValue and Table entities on the page.

Return type:

List[EntityList]

search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: int = 0.6) EntityList[Line]

Return a list of top_k lines that contain the queried keyword.

Parameters:
  • keyword (str) – Keyword that is used to query the page.

  • top_k (int) – Number of closest line objects to be returned

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6

Returns:

Returns a list of lines that contain the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Line]

search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) EntityList[Word]

Return a list of top_k words that match the keyword.

Parameters:
  • keyword (str, required) – Keyword that is used to query the document.

  • top_k (int, optional) – Number of closest word objects to be returned. default=1

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a list of words that match the queried key sorted from highest to lowest similarity.

Return type:

EntityList[Word]

property signatures: EntityList[Signature]

Returns all the Signature objects present in the Page.

Returns:

List of Signature objects.

Return type:

EntityList

property tables: EntityList[Table]

Returns all the Table objects present in the Page.

Returns:

List of Table objects, each representing a table within the Page.

Return type:

EntityList

property text: str

Returns the page text

Returns:

Linearized page text

Return type:

str

visualize(*args, **kwargs)

Returns the object’s children in a visualization EntityList object

Returns:

Returns an EntityList object

Return type:

EntityList

property words: EntityList[Word]

Returns all the Word objects present in the Page.

Returns:

List of Word objects, each representing a word within the Page.

Return type:

EntityList[Word]

PageLayout

class textractor.entities.page_layout.PageLayout(titles: EntityList[Layout] = [], headers: EntityList[Layout] = [], footers: EntityList[Layout] = [], section_headers: EntityList[Layout] = [], page_numbers: EntityList[Layout] = [], lists: EntityList[Layout] = [], figures: EntityList[Layout] = [], tables: EntityList[Layout] = [], key_values: EntityList[Layout] = [])

Bases: object

Object representation of the layout components detected in the table.

property figures: EntityList[Layout]

Figures detected in the Page

Returns:

EntityList of figures detected in the page

Return type:

EntityList[Layout]

property footers: EntityList[Layout]

Footers detected in the Page

Returns:

EntityList of footers detected in the page

Return type:

EntityList[Layout]

property headers: EntityList[Layout]

Headers detected in the Page

Returns:

EntityList of headers detected in the page

Return type:

EntityList[Layout]

property key_values: EntityList[Layout]

KeyValues detected in the Page

Returns:

EntityList of keyvalues detected in the page

Return type:

EntityList[Layout]

property lists: EntityList[Layout]

Lists detected in the Page

Returns:

EntityList of lists detected in the page

Return type:

EntityList[Layout]

property page_numbers: EntityList[Layout]

Page numbers detected in the Page

Returns:

EntityList of page numbers detected in the page

Return type:

EntityList[Layout]

property section_headers: EntityList[Layout]

Section headers detected in the Page

Returns:

EntityList of section headers detected in the page

Return type:

EntityList[Layout]

property tables: EntityList[Layout]

Tables detected in the Page. This includes Tables detected by the AnalyzeDocument Tables API if used.

Returns:

EntityList of tables detected in the page

Return type:

EntityList[Layout]

property titles: EntityList[Layout]

Titles detected in the Page

Returns:

EntityList of titles detected in the page

Return type:

EntityList[Layout]

Layout

Represents a single Layout Entity within the Document. The Textract API response returns groups of layout as LAYOUT_* BlockTypes.

class textractor.entities.layout.Layout(entity_id: str, bbox: BoundingBox, reading_order: int, label: str, confidence: float = 0)

Bases: DocumentEntity

To create a new Layout object we need the following:

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List[Word]]

Returns the layout object text and words sorted in reading order

Parameters:

config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()

Returns:

Tuple of page text and words

Return type:

Tuple[str, List[Word]]

property page
Returns:

Returns the page number of the page the Layout entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property text

Maps to .get_text()

Returns:

Returns the linearized text of the entity

Return type:

str

property words

Table

Represents a Table entity within the document. Tables are hierarchical objects composed of TableCell objects, which implicitly form columns and rows.

Table object contains associated metadata within it. They include TableCell information, headers, page number and page ID of the page within which it exists in the document.

class textractor.entities.table.Table(entity_id, bbox: BoundingBox)

Bases: DocumentEntity

To create a new Table object we need the following:

Parameters:
  • entity_id – Unique identifier of the table.

  • bbox – Bounding box of the table.

add_cells(cells: List[TableCell])

Add TableCell objects to the Table. This function does not check the integrity of the table after the cells are added.

Parameters:

cells (list) – List of TableCell objects, each representing a single cell within the table. No specific ordering is assumed since it is implicitly ordered by row and column index.

property checkboxes: List[SelectionElement]
property column_count
property column_headers: Dict[str, List[TableCell]]
Returns:

Returns the column headers of the Table entity.

Return type:

Dict[str, List[TableCell]]

property footers
Returns:

Returns the table footers.

Return type:

List[TableFooter]

get_cells_by_type(cell_type: CellTypes = CellTypes.COLUMN_HEADER)

Returns a dictionary of column_header (str) : List[TableCell] (in order).

Parameters:

cell_type (CellTypes) – supports CellTypes.COLUMN_HEADER as of now, will support SECTION_TITLE, FLOATING_TITLE, FLOATING_FOOTER, SUMMARY_CELL in the future.

Returns:

{column_header (str) : List[TableCell]}

Return type:

Dict[str, List[TableCell]]

get_columns_by_name(column_names, similarity_metric=SimilarityMetric.COSINE, similarity_threshold=0.6)

Returns a dictionary of format {column_name : List[TableCell]} for the column names listed in param column_names.

Parameters:
  • column_names (list) – List of column names of columns to be extracted from table.

  • similarity_metric (str) – ‘cosine’, ‘euclidean’ or ‘levenshtein’. ‘cosine’ is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns:

Returns a new Table consisting of columns passed in column_names.

Return type:

Table

get_table_range()
Returns:

Returns the number of rows and columns in the table.

Return type:

Tuple(int)

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

get_words_by_type(text_type=TextTypes.PRINTED)

Returns list of Word entities that match the input text type.

Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns list of Word entities that match the input text type.

Return type:

EntityList[Word]

property page
Returns:

Returns the page number of the page the Table entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property row_count
strip_headers(column_headers: bool = True, in_table_title: bool = False, section_titles=False)

Returns a new Table object after removing all cells that are marked as column headers in the table from the API response.

Parameters:
  • column_headers (bool) – Remove the column headers

  • in_table_title (bool) – Remove the in-table titles

  • section_titles (bool) – Remove the in-table section titles

Returns:

Table object after removing the headers.

Return type:

Table

property table_type
Returns:

Returns the table type.

Return type:

TableTypes

property title
Returns:

Returns the table title.

Return type:

TableTitle

to_csv(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str

Returns the table in the Comma-Separated-Value (CSV) format

Parameters:
  • use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.

  • config – Text linearization configuration object for the table content

Returns:

Table as a CSV string.

Return type:

str

to_excel(filepath=None, workbook=None, save_workbook=True)

Export the Table Entity as an excel document. Advantage of excel over csv is that it can accommodate merged cells that we see so often with Textract documents.

Parameters:
  • filepath (str) – Path to store the exported Excel file

  • workbook (xlsxwriter.Workbook) – if xlsxwriter workbook is passed to the function, the table is appended to the last sheet of that workbook.

  • save_workbook (bool) – Flag to save_notebook. If False, it is returned by the function.

Returns:

Returns a workbook if save_workbook is False. Else, saves the .xlsx file in the filepath if was initialized with.

Return type:

xlsxwriter.Workbook

to_html() str

Returns the table in the HTML format

Returns:

Table as an HTML string.

Return type:

str

to_pandas(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Converts the table to a pandas DataFrame

Parameters:
  • use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.

  • config – Text linearization configuration object for the table content

Returns:

to_txt()
property words

Returns all the Word objects present in the Table.

Return words:

List of Word objects, each representing a word within the Table.

Return type:

EntityList[Word]

TableCell

Represents a single TableCell:class: object. The TableCell objects contains information such as:

  • The position info of the cell within the encompassing Table

  • Properties such as merged-cells span

  • A hierarchy of words contained within the TableCell (optional)

  • Page information

  • Confidence of entity detection.

class textractor.entities.table_cell.TableCell(entity_id: str, bbox: BoundingBox, row_index: int, col_index: int, row_span: int, col_span: int, confidence: float = 0, is_column_header: bool = False, is_title: bool = False, is_footer: bool = False, is_summary: bool = False, is_section_title: bool = False)

Bases: DocumentEntity

To create a new TableCell object we need the following:

Parameters:
  • entity_id – Unique id of the TableCell object

  • bbox – Bounding box of the entity

  • row_index – Row index of position of cell within the table

  • col_index – Column index of position of cell within the table

  • row_span – How many merged cells does the cell spans horizontally (1 means no merged cells)

  • col_span – How many merged cells does the cell spand vertically (1 means no merged cells)

  • confidence – Confidence out of 100 with which the Cell was detected.

  • is_column_header – Indicates if the cell is a column header

  • is_title – Indicates if the cell is a table title

  • is_footer – Indicates if the cell is a table footer

  • is_summary – Indicates if the cell is a summary cell

  • is_section_title – Indicates if the cell is a section title

property checkboxes
property col_index
Returns:

Returns the column index of the cell in the Table.

Return type:

int

property col_span
Returns:

Returns the column span of the cell in the Table.

Return type:

int

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Returns the text in the cell as one space-separated string

Returns:

Text in the cell

Return type:

Tuple[str, List]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns list of Word entities that match the input text type.

Return type:

EntityList

property is_column_header
property is_section_title
property is_summary
property is_title
merge_direction()
Returns:

Determines if the merged cell is a row or column merge. Returns 0 if row merge, 1 if column merge and 2 if both and None if there is no merge.

Return type:

int, str

property page
Returns:

Returns the page number of the page the TableCell entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property row_index
Returns:

Returns the row index of the cell in the Table.

Return type:

int

property row_span
Returns:

Returns the row span of the cell in the Table.

Return type:

int

property table_id
Returns:

Returns the ID of the Table the TableCell belongs to.

Return type:

str

property text: str

Returns the text in the cell as one space-separated string

Returns:

Text in the cell

Return type:

str

property words

Returns all the Word objects present in the TableCell.

Return words:

List of Word objects, each representing a word within the TableCell.

Return type:

list

TableTitle

Represents a single TableTitle:class: object. The TableCell:class: object contains information such as:

  • The position of the title within the Document

  • The words that it contains

  • Confidence of entity detection

class textractor.entities.table_title.TableTitle(entity_id: str, bbox: BoundingBox)

Bases: DocumentEntity

Represents a title that is either in-table or floating

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property is_floating: bool
Returns:

Returns whether the TableTitle entity is floating or not.

Return type:

bool

property page
Returns:

Returns the page number of the page the TableTitle entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property text: str

Returns the text in the title as one space-separated string

Returns:

Text in the title

Return type:

str

property words

Returns all the Word objects present in the TableTitle.

Return words:

List of Word objects, each representing a word within the TableTitle.

Return type:

list

KeyValue

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.key_value.KeyValue(entity_id: str, bbox: BoundingBox, contains_checkbox: bool = False, value: Optional[Value] = None, confidence: float = 0)

Bases: DocumentEntity

To create a new KeyValue object we require the following:

Parameters:
  • entity_id (str) – Unique identifier of the KeyValue entity.

  • bbox (BoundingBox) – Bounding box of the KeyValue entity.

  • contains_checkbox (bool) – True/False to indicate if the value is a checkbox.

  • value (Value) – Value object that maps to the KeyValue entity.

  • confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

get_words_by_type(text_type: str = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns list of Word entities that match the input text type.

Return type:

EntityList[Word]

is_selected() bool

For KeyValues containing a selection item, returns its is_selected status

Returns:

Selection status of a selection item key value pair

Return type:

bool

property key
Returns:

Returns EntityList[Word] object (a list of words) associated with the key.

Return type:

EntityList[Word]

property ocr_confidence

Return the average OCR confidence :return:

property page: int
Returns:

Returns the page number of the page the Table entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property value: Value
Returns:

Returns the Value mapped to the key if it has been assigned.

Return type:

Value

property words: List[Word]

Returns all the Word objects present in the key and value of the KeyValue object.

Return words:

List of Word objects, each representing a word within the KeyValue entity.

Return type:

EntityList[Word]

Value

Represents a single Value Entity within the Document. The Textract API response returns groups of words as KEY_VALUE_SET BlockTypes. These may be of KEY or VALUE type which is indicated by the EntityType attribute in the JSON response.

This class contains the associated metadata with the Value entity including the entity ID, bounding box information, child words, associated key ID, page number, Page ID, confidence of detection and if it’s a checkbox.

class textractor.entities.value.Value(entity_id: str, bbox: BoundingBox, confidence: float = 0)

Bases: DocumentEntity

To create a new Value object we need the following:

Parameters:
  • entity_id (str) – Unique identifier of the Word entity.

  • bbox (BoundingBox) – Bounding box of the Word entity.

  • confidence (float) – value storing the confidence of detection out of 100.

property contains_checkbox: bool

Returns True if the value associated is a SelectionElement.

Returns:

Returns True if the value associated is a checkbox/SelectionElement.

Return type:

bool

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

get_words_by_type(text_type: str = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters:

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns:

Returns list of Word entities that match the input text type.

Return type:

EntityList[Word]

property key_id: str

Returns the associated Key ID for the Value entity.

Returns:

Returns the associated KeyValue object ID.

Return type:

str

property page
Returns:

Returns the page number of the page the Value entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property words: List[Word]
Returns:

Returns a list of all words in the entity if it exists else returns the checkbox status of the Value entity.

Return type:

EntityList[Word]

SelectionElement

Represents a single SelectionElement/Checkbox/Clickable Entity within the Document.

This class contains the associated metadata with the SelectionElement entity including the entity ID, bounding box information, selection status, page number, Page ID and confidence of detection.

class textractor.entities.selection_element.SelectionElement(entity_id: str, bbox: BoundingBox, status: SelectionStatus, confidence: float = 0)

Bases: DocumentEntity

To create a new SelectionElement object we need the following:

Parameters:
  • entity_id (str) – Unique identifier of the SelectionElement entity.

  • bbox (BoundingBox) – Bounding box of the SelectionElement

  • status (SelectionStatus) – SelectionStatus.SELECTED / SelectionStatus.NOT_SELECTED

  • confidence (float) – Confidence with which this entity is detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

is_selected() bool
Returns:

Returns True / False depending on selection status of the SelectionElement.

Return type:

bool

property page
Returns:

Returns the page number of the page the SelectionElement entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property words: List[Word]
Returns:

Empty Word list as SelectionElement do not have words

Return type:

EntityList[Word]

Query

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.query.Query(entity_id: str, query: str, alias: str, query_result: Optional[QueryResult], result_bbox: Optional[BoundingBox])

Bases: DocumentEntity

The Query object merges QUERY and QUERY_RESULT blocks. To create a new Query object we require the following:

Parameters:
  • entity_id (str) – Unique identifier of the Query entity.

  • bbox (BoundingBox) – Bounding box of the KeyValue entity.

  • contains_checkbox (bool) – True/False to indicate if the value is a checkbox.

  • value (Value) – Value object that maps to the KeyValue entity.

  • confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the Query and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property has_result: bool
Returns:

Returns whether there was a result associated with the query

Return type:

bool

property page: int
Returns:

Returns the page number of the page the Table entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

QueryResult

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.query_result.QueryResult(entity_id: str, confidence: float, result_bbox: BoundingBox, answer: str)

Bases: DocumentEntity

The QueryResult object represents QUERY_RESULT blocks. To create a new QueryResult object we require the following:

Parameters:
  • entity_id (str) – Unique identifier of the Query entity.

  • bbox (BoundingBox) – Bounding box of the QueryResult entity.

  • contains_checkbox (bool) – True/False to indicate if the value is a checkbox.

  • value (Value) – Value object that maps to the QueryResult entity.

  • confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the QueryResult and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property page: int
Returns:

Returns the page number of the page the Table entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

Signature

Represents a single Signature Entity within the Document. The Textract API response returns signatures as SIGNATURE BlockTypes.

This class contains the associated metadata with the Signature entity including the entity ID, bounding box information, page number, Page ID and confidence of detection.

class textractor.entities.signature.Signature(entity_id: str, bbox: BoundingBox, confidence: float = 0)

Bases: DocumentEntity

To create a new Signature object we need the following:

Parameters:
  • entity_id (str) – Unique identifier of the signature entity.

  • bbox (BoundingBox) – Bounding box of the signature entity.

  • words (list, optional) – List of the Word entities present in the signature

  • confidence (float, optional) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property page
Returns:

Returns the page number of the page the Signature entity is present in.

Return type:

int

property page_id: str
Returns:

Returns the Page ID attribute of the page which the entity belongs to.

Return type:

str

property words
Returns:

Returns an empty list

Return type:

list

ExpenseDocument

The ExpenseDocument class is the object representation of an AnalyzeID response. It is similar to a dictionary. Despite its name it does not inherit from Document as the AnalyzeID response does not contains position information.

class textractor.entities.expense_document.ExpenseDocument(summary_fields: List[ExpenseField], line_items_groups: List[LineItemGroup], bounding_box: BoundingBox, page: int)

Bases: DocumentEntity

Represents the description of a single expense document.

property bbox
Returns:

Returns entire bounding box of entity

Return type:

BoundingBox

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property line_items_groups: List[LineItemGroup]
property page
property summary_fields_list
class textractor.entities.expense_document.Fields

Bases: dict

Dictionary to hold Summary Fields Dynamically added properties to enable ease of discovery

class textractor.entities.expense_document.FieldsGroups

Bases: dict

Summary Fields Group dictionary {GROUP_KEY_NAME: {GROUP_ID_1: [SUMMARY_FIELD1, SUMMARY_FIELD2]}}

get_group_bboxes(key: str)

Return the enclosing bboxes for each group for a given group key :param key: Group key e.g VENDOR :return:

Expense

class textractor.entities.expense_field.Expense(bbox: BoundingBox, text: str, confidence: float, page: int)

Bases: DocumentEntity

Holds the Key or the Value of an Expense

property geometry
get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the Expense and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property page
property text

Maps to .get_text()

Returns:

Returns the linearized text of the entity

Return type:

str

class textractor.entities.expense_field.ExpenseField(type: ExpenseType, value: Expense, group_properties: List[ExpenseGroupProperty], page: int, label: Optional[Expense] = None, currency=None)

Bases: DocumentEntity

The ExpenseField holds the information a given summary field, key, value and type. The bounding box of that ExpenseField is the enclosing one of all its components

property bbox: BoundingBox
Returns:

Returns entire bounding box of entity

Return type:

BoundingBox

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the ExpenseField and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property group_properties: List[ExpenseGroupProperty]
property key: Expense
property page: int
property type: ExpenseType
property value: Expense
class textractor.entities.expense_field.ExpenseGroupProperty(id: str, types: List[str])

Bases: object

Associated with a given ExpenseField, which group it is associated with and the related type of the group

id: str
types: List[str]
class textractor.entities.expense_field.ExpenseType(text: str, confidence: float, raw_object: object)

Bases: object

Type of an ExpenseField, e.g TOTAL or SUBTOTAL

confidence: float
raw_object: object
text: str
class textractor.entities.expense_field.LineItemGroup(index, line_item_rows: List[LineItemRow], page: int)

Bases: DocumentEntity

A LineItemGroup contains several LineItemRow. It is often similar to a table in invoices but in receipts, the table structure can be more loose and less aligned.

property bbox
Returns:

Returns entire bounding box of entity

Return type:

BoundingBox

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the LineItemGroup and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property index
property page
property rows
to_csv()
to_json()
to_pandas(include_EXPENSE_ROW=False)
class textractor.entities.expense_field.LineItemRow(index, line_item_expense_fields: List[ExpenseField], page: int)

Bases: DocumentEntity

A LineItemRow contains several ExpenseField that are all inside the row. They don’t always align in a structured column structure as tables do.

property bbox
Returns:

Returns entire bounding box of entity

Return type:

BoundingBox

property expenses
get(index)
get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the LineItemRow and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property page

IdentityDocument

The IdentityDocument class is the object representation of an AnalyzeID response. It is similar to a dictionary. Despite its name it does not inherit from Document as the AnalyzeID response does not contains position information.

class textractor.entities.identity_document.IdentityDocument(fields=None)

Bases: SpatialObject

Represents the description of a single ID document.

property fields: Dict[str, IdentityField]
get(key: Union[str, AnalyzeIDFields]) Optional[str]
keys() List[str]
values() List[str]

IdentityField

class textractor.entities.identity_field.IdentityField(key, value, confidence)

Bases: object

property confidence: float
property key: str
property value: str

Linearizable

Linearizable is a class that defines how a component can be linearized (converted to text)

class textractor.entities.linearizable.Linearizable

Bases: ABC

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str

Returns the linearized text of the entity

Parameters:

config – Text linearization confi

Returns:

Linearized text of the entity

Return type:

str

abstract get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the entity and the matching words

Returns:

Tuple of text and word list

Return type:

Tuple[str, List[Word]]

property text: str

Maps to .get_text()

Returns:

Returns the linearized text of the entity

Return type:

str

to_html(config: HTMLLinearizationConfig = HTMLLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='<div>', page_num_suffix='</div>', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='<div>', list_layout_suffix='</div>', list_element_prefix='', list_element_suffix='', title_prefix='<h1>', title_suffix='</h1>', table_layout_prefix='<div>', table_layout_suffix='</div>', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='html', table_add_title_as_caption=True, table_add_footer_as_paragraph=True, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='', table_flatten_semi_structured_as_plaintext=False, table_prefix='<table>', table_suffix='</table>', table_row_separator='\n', table_row_prefix='<tr>', table_row_suffix='</tr>', table_cell_prefix='<td>', table_cell_suffix='</td>', table_cell_header_prefix='<th>', table_cell_header_suffix='</th>', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='<h1>', header_suffix='</h1>', section_header_prefix='<h2>', section_header_suffix='</h2>', text_prefix='<p>', text_suffix='</p>', key_value_layout_prefix='<div>', key_value_layout_suffix='</div>', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='<p>', entity_layout_suffix='</p>', figure_layout_prefix='<div>', figure_layout_suffix='</div>', footer_layout_prefix='<div>', footer_layout_suffix='</div>', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True, add_ids_to_html_tags=False, add_short_ids_to_html_tags=False)) str

Returns the HTML representation of the entity

Returns:

HTML text of the entity

Return type:

str

to_markdown(config: MarkdownLinearizationConfig = MarkdownLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='# ', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=True, table_column_header_threshold=0.9, table_linearization_format='markdown', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='## ', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str

Returns the markdown representation of the entity

Returns:

Markdown text of the entity

Return type:

str