Document Entities

Document objects contain various entities within them. Textract document analysis APIs recognize 6 document entities namely: WORD, LINE, KEY_VALUE_SET , SELECTION_ELEMENT, TABLE, CELL

These are structures that occur in most documents and the package provides classes to programmatically store and access the information produced by Textract for these entities.

BoundingBox

BoundingBox class contains all the co-ordinate information for a DocumentEntity. This class is mainly useful to locate the entity on the image of the document page.

class textractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None)

Bases: SpatialObject

Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By default BoundingBox is set to work with denormalized co-ordinates: \(x \in [0, docwidth]\) and \(y \in [0, docheight]\). Use the as_normalized_dict function to obtain BoundingBox with normalized co-ordinates: \(x \in [0, 1]\) and \(y \in [0, 1]\).

Create a BoundingBox like shown below:

  • Directly: bb = BoundingBox(x, y, width, height)

  • From dict: bb = BoundingBox.from_dict(bb_dict) where bb_dict = {'x': x, 'y': y, 'width': width, 'height': height}

Use a BoundingBox like shown below:

  • Directly: print('The top left is: ' + str(bb.x) + ' ' + str(bb.y))

  • Convert to dict: bb_dict = bb.as_dict() returns {'x': x, 'y': y, 'width': width, 'height': height}

property area

Returns the area of the bounding box, handles negative bboxes as 0-area

Returns

Bounding box area

Return type

float

as_denormalized_numpy()
Returns

Returns denormalized co-ordinates x, y and dimensions width, height as numpy array.

Return type

numpy.array

classmethod center_is_inside(bbox_a, bbox_b)

Returns true if the center point of Bounding Box A is within Bounding Box B

classmethod enclosing_bbox(bboxes, spatial_object: Optional[SpatialObject] = None)
Parameters
  • [BoundingBox] (bboxes) – list of bounding boxes

  • SpatialObject (spatial_object) – spatial object to be added to the returned bbox

Returns

classmethod from_denormalized_borders(left: float, top: float, right: float, bottom: float, spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. If spatial_object is not None, the coordinates will be denormalized according to the spatial object. :param left: ~ [0, doc_width] :param top: ~ [0, doc_height] :param right: ~ [0, doc_width] :param bottom: ~ [0, doc_height] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_corners(x1: float, y1: float, x2: float, y2: float, spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. :param x1: Left ~ [0, wdoc_idth] :param y1: Top ~ [0, doc_height] :param x2: Right ~ [0, doc_width] :param y2: Bottom ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_dict(bbox_dict: Dict[str, float])

Builds an axis aligned bounding box from a dictionary of: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} The coordinates will be denormalized according to the spatial object. :param bbox_dict: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} of [0, doc_height] x [0, doc_width] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_denormalized_xywh(x: float, y: float, width: float, height: float, spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned bounding box from top-left, width and height properties. The coordinates are assumed to be denormalized. :param x: Left ~ [0, doc_width] :param y: Top ~ [0, doc_height] :param width: Width ~ [0, doc_width] :param height: Height ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]

classmethod from_normalized_dict(bbox_dict: Dict[str, float], spatial_object: Optional[SpatialObject] = None)

Builds an axis aligned BoundingBox from a dictionary like {'x': x, 'y': y, 'width': width, 'height': height}. The coordinates will be denormalized according to spatial_object.

Parameters
  • bbox_dict (dict) – Dictionary of normalized co-ordinates.

  • spatial_object (SpatialObject) – Object with width and height attributes.

Returns

Object with denormalized co-ordinates

Return type

BoundingBox

get_distance(bbox)

Returns the distance between the center point of the bounding box and another bounding box

Returns

Returns the distance as float

Return type

float

get_intersection(bbox)

Returns the intersection of this object’s bbox and another BoundingBox :return: a BoundingBox object

classmethod is_inside(bbox_a, bbox_b)

Returns true if Bounding Box A is within Bounding Box B

class textractor.entities.bbox.SpatialObject(width: float, height: float)

Bases: ABC

The SpatialObject interface defines an object that has a width and height. This mostly used for BoundingBox reference to be able to provide normalized coordinates.

DocumentEntity

DocumentEntity is the class that all Document entities such as Word, Line, Table etc. inherit from. This class provides methods useful to all such entities.

class textractor.entities.document_entity.DocumentEntity(entity_id: str, bbox: BoundingBox)

Bases: ABC

An interface for all document entities within the document body, composing the hierarchy of the document object model. The purpose of this class is to define properties common to all document entities i.e. unique id and bounding box.

add_children(children)

Adds children to all entities that have parent-child relationships.

Parameters

children (list) – List of child entities.

property bbox: BoundingBox
Returns

Returns entire bounding box of entity

Return type

BoundingBox

property children
Returns

Returns children of entity

Return type

list

abstract get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

property height: float
Returns

Returns height for bounding box

Return type

float

property raw_object: Dict
Returns

Returns the raw dictionary object that was used to create this Python object

Return type

Dict

remove(entity)

Recursively removes an entity from the child tree of a document entity and update its bounding box

Parameters

entity (DocumentEntity) – Entity

visit(word_set)
visualize(*args, **kwargs) EntityList

Returns the object’s children in a visualization EntityList object

Returns

Returns an EntityList object

Return type

EntityList

property width: float
Returns

Returns width for bounding box

Return type

float

property x: float
Returns

Returns x coordinate for bounding box

Return type

float

property y: float
Returns

Returns y coordinate for bounding box

Return type

float

Word

Represents a single Word within the Document. This class contains the associated metadata with the Word entity including the text transcription, text type, bounding box information, page number, Page ID and confidence of detection.

class textractor.entities.word.Word(entity_id: str, bbox: BoundingBox, text: str = '', text_type: TextTypes = TextTypes.PRINTED, confidence: float = 0, is_clickable: bool = False, is_structure: bool = False)

Bases: DocumentEntity

To create a new Word object we need the following:

Parameters
  • entity_id (str) – Unique identifier of the Word entity.

  • bbox (BoundingBox) – Bounding box of the Word entity.

  • text (str) – Transcription of the Word object.

  • text_type (TextTypes) – Enum value stating the type of text stored in the entity. Takes 2 values - PRINTED and HANDWRITING

  • confidence (float) – value storing the confidence of detection out of 100.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

property page: int
Returns

Returns the page number of the page the Word entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property text: str
Returns

Returns the text transcription of the Word entity.

Return type

str

property text_type: TextTypes
Returns

Returns the property of Word class that holds the text type of Word object.

Return type

str

property words

Returns itself

Return type

Word

Line

Represents a single Line Entity within the Document. The Textract API response returns groups of words as LINE BlockTypes. They contain Word entities as children.

This class contains the associated metadata with the Line entity including the entity ID, bounding box information, child words, page number, Page ID and confidence of detection.

class textractor.entities.line.Line(entity_id: str, bbox: BoundingBox, words: Optional[List[Word]] = None, confidence: float = 0)

Bases: DocumentEntity

To create a new Line object we need the following:

Parameters
  • entity_id (str) – Unique identifier of the Line entity.

  • bbox (BoundingBox) – Bounding box of the line entity.

  • words (list, optional) – List of the Word entities present in the line

  • confidence (float, optional) – confidence with which the entity was detected.

get_text_and_words(config)

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]
Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns EntityList of Word entities that match the input text type.

Return type

EntityList[Word]

property page
Returns

Returns the page number of the page the Line entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property text
Returns

Returns the text transcription of the Line entity.

Return type

str

property words
Returns

Returns the line’s children

Return type

List[Word]

KeyValue

The KeyValue entity is a document entity representing the Forms output. The key in KeyValue are typically words and the Value could be Word elements or SelectionElement in case of checkboxes.

This class contains the associated metadata with the KeyValue entity including the entity ID, bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.

class textractor.entities.key_value.KeyValue(entity_id: str, bbox: BoundingBox, contains_checkbox: bool = False, value: Optional[Value] = None, confidence: float = 0)

Bases: DocumentEntity

To create a new KeyValue object we require the following:

Parameters
  • entity_id (str) – Unique identifier of the KeyValue entity.

  • bbox (BoundingBox) – Bounding box of the KeyValue entity.

  • contains_checkbox (bool) – True/False to indicate if the value is a checkbox.

  • value (Value) – Value object that maps to the KeyValue entity.

  • confidence (float) – confidence with which the entity was detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

get_words_by_type(text_type: str = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns list of Word entities that match the input text type.

Return type

EntityList[Word]

is_selected() bool

For KeyValues containing a selection item, returns its is_selected status

Returns

Selection status of a selection item key value pair

Return type

bool

property key
Returns

Returns Line object associated with the key.

Return type

Line

property ocr_confidence

Return the average OCR confidence :return:

property page: int
Returns

Returns the page number of the page the Table entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property value: Value
Returns

Returns the Value mapped to the key if it has been assigned.

Return type

Value

property words: List[Word]

Returns all the Word objects present in the key and value of the KeyValue object.

Return words

List of Word objects, each representing a word within the KeyValue entity.

Return type

EntityList[Word]

Value

Represents a single Value Entity within the Document. The Textract API response returns groups of words as KEY_VALUE_SET BlockTypes. These may be of KEY or VALUE type which is indicated by the EntityType attribute in the JSON response.

This class contains the associated metadata with the Value entity including the entity ID, bounding box information, child words, associated key ID, page number, Page ID, confidence of detection and if it’s a checkbox.

class textractor.entities.value.Value(entity_id: str, bbox: BoundingBox, confidence: float = 0)

Bases: DocumentEntity

To create a new Value object we need the following:

Parameters
  • entity_id (str) – Unique identifier of the Word entity.

  • bbox (BoundingBox) – Bounding box of the Word entity.

  • confidence (float) – value storing the confidence of detection out of 100.

property contains_checkbox: bool

Returns True if the value associated is a SelectionElement.

Returns

Returns True if the value associated is a checkbox/SelectionElement.

Return type

bool

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str
get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

get_words_by_type(text_type: str = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns list of Word entities that match the input text type.

Return type

EntityList[Word]

property key_id: str

Returns the associated Key ID for the Value entity.

Returns

Returns the associated KeyValue object ID.

Return type

str

property page
Returns

Returns the page number of the page the Value entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property words: List[Word]
Returns

Returns a list of all words in the entity if it exists else returns the checkbox status of the Value entity.

Return type

EntityList[Word]

SelectionElement

Represents a single SelectionElement/Checkbox/Clickable Entity within the Document.

This class contains the associated metadata with the SelectionElement entity including the entity ID, bounding box information, selection status, page number, Page ID and confidence of detection.

class textractor.entities.selection_element.SelectionElement(entity_id: str, bbox: BoundingBox, status: SelectionStatus, confidence: float = 0)

Bases: DocumentEntity

To create a new SelectionElement object we need the following:

Parameters
  • entity_id (str) – Unique identifier of the SelectionElement entity.

  • bbox (BoundingBox) – Bounding box of the SelectionElement

  • status (SelectionStatus) – SelectionStatus.SELECTED / SelectionStatus.NOT_SELECTED

  • confidence (float) – Confidence with which this entity is detected.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

is_selected() bool
Returns

Returns True / False depending on selection status of the SelectionElement.

Return type

bool

property page
Returns

Returns the page number of the page the SelectionElement entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property words: List[Word]
Returns

Empty Word list as SelectionElement do not have words

Return type

EntityList[Word]

Table

Represents a Table entity within the document. Tables are hierarchical objects composed of TableCell objects, which implicitly form columns and rows.

Table object contains associated metadata within it. They include TableCell information, headers, page number and page ID of the page within which it exists in the document.

class textractor.entities.table.Table(entity_id, bbox: BoundingBox)

Bases: DocumentEntity

To create a new Table object we need the following:

Parameters
  • entity_id – Unique identifier of the table.

  • bbox – Bounding box of the table.

add_cells(cells: List[TableCell])

Add TableCell objects to the Table. This function does not check the integrity of the table after the cells are added.

Parameters

cells (list) – List of TableCell objects, each representing a single cell within the table. No specific ordering is assumed since it is implicitly ordered by row and column index.

property checkboxes: List[SelectionElement]
property column_count
property column_headers: Dict[str, List[TableCell]]
Returns

Returns the column headers of the Table entity.

Return type

Dict[str, List[TableCell]]

property footers
Returns

Returns the table footers.

Return type

List[TableFooter]

get_cells_by_type(cell_type: CellTypes = CellTypes.COLUMN_HEADER)

Returns a dictionary of column_header (str) : List[TableCell] (in order).

Parameters

cell_type (CellTypes) – supports CellTypes.COLUMN_HEADER as of now, will support SECTION_TITLE, FLOATING_TITLE, FLOATING_FOOTER, SUMMARY_CELL in the future.

Returns

{column_header (str) : List[TableCell]}

Return type

Dict[str, List[TableCell]]

get_columns_by_name(column_names, similarity_metric=SimilarityMetric.COSINE, similarity_threshold=0.6)

Returns a dictionary of format {column_name : List[TableCell]} for the column names listed in param column_names.

Parameters
  • column_names (list) – List of column names of columns to be extracted from table.

  • similarity_metric (str) – ‘cosine’, ‘euclidean’ or ‘levenshtein’. ‘cosine’ is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns

Returns a new Table consisting of columns passed in column_names.

Return type

Table

get_table_range()
Returns

Returns the number of rows and columns in the table.

Return type

Tuple(int)

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

get_words_by_type(text_type=TextTypes.PRINTED)

Returns list of Word entities that match the input text type.

Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns list of Word entities that match the input text type.

Return type

EntityList[Word]

property page
Returns

Returns the page number of the page the Table entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property row_count
strip_headers(column_headers: bool = True, in_table_title: bool = False, section_titles=False)

Returns a new Table object after removing all cells that are marked as column headers in the table from the API response.

Parameters
  • column_headers (bool) – Remove the column headers

  • in_table_title (bool) – Remove the in-table titles

  • section_titles (bool) – Remove the in-table section titles

Returns

Table object after removing the headers.

Return type

Table

property table_type
Returns

Returns the table type.

Return type

TableTypes

property title
Returns

Returns the table title.

Return type

TableTitle

to_csv() str

Returns the table in the Comma-Separated-Value (CSV) format

Returns

Table as a CSV string.

Return type

str

to_excel(filepath=None, workbook=None, save_workbook=True)

Export the Table Entity as an excel document. Advantage of excel over csv is that it can accommodate merged cells that we see so often with Textract documents.

Parameters
  • filepath (str) – Path to store the exported Excel file

  • workbook (xlsxwriter.Workbook) – if xlsxwriter workbook is passed to the function, the table is appended to the last sheet of that workbook.

  • save_workbook (bool) – Flag to save_notebook. If False, it is returned by the function.

Returns

Returns a workbook if save_workbook is False. Else, saves the .xlsx file in the filepath if was initialized with.

Return type

xlsxwriter.Workbook

to_pandas(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Converts the table to a pandas DataFrame

Parameters
  • use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.

  • config – Text linearization configuration object for the table content

Returns

to_txt()
property words

Returns all the Word objects present in the Table.

Return words

List of Word objects, each representing a word within the Table.

Return type

EntityList[Word]

TableCell

Represents a single TableCell:class: object. The TableCell objects contains information such as:

  • The position info of the cell within the encompassing Table

  • Properties such as merged-cells span

  • A hierarchy of words contained within the TableCell (optional)

  • Page information

  • Confidence of entity detection.

class textractor.entities.table_cell.TableCell(entity_id: str, bbox: BoundingBox, row_index: int, col_index: int, row_span: int, col_span: int, confidence: float = 0, is_column_header: bool = False, is_title: bool = False, is_footer: bool = False, is_summary: bool = False, is_section_title: bool = False)

Bases: DocumentEntity

To create a new TableCell object we need the following:

Parameters
  • entity_id – Unique id of the TableCell object

  • bbox – Bounding box of the entity

  • row_index – Row index of position of cell within the table

  • col_index – Column index of position of cell within the table

  • row_span – How many merged cells does the cell spans horizontally (1 means no merged cells)

  • col_span – How many merged cells does the cell spand vertically (1 means no merged cells)

  • confidence – Confidence out of 100 with which the Cell was detected.

  • is_column_header – Indicates if the cell is a column header

  • is_title – Indicates if the cell is a table title

  • is_footer – Indicates if the cell is a table footer

  • is_summary – Indicates if the cell is a summary cell

  • is_section_title – Indicates if the cell is a section title

property checkboxes
property col_index
Returns

Returns the column index of the cell in the Table.

Return type

int

property col_span
Returns

Returns the column span of the cell in the Table.

Return type

int

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str

Return the text in the cell as one space-separated string

Returns

Text in the cell

Return type

str

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]

Returns the text in the cell as one space-separated string

Returns

Text in the cell

Return type

Tuple[str, List]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns list of Word entities that match the input text type.

Return type

EntityList

property is_column_header
property is_section_title
property is_summary
property is_title
merge_direction()
Returns

Determines if the merged cell is a row or column merge. Returns 0 if row merge, 1 if column merge and 2 if both and None if there is no merge.

Return type

int, str

property page
Returns

Returns the page number of the page the TableCell entity is present in.

Return type

int

property page_id: str
Returns

Returns the Page ID attribute of the page which the entity belongs to.

Return type

str

property row_index
Returns

Returns the row index of the cell in the Table.

Return type

int

property row_span
Returns

Returns the row span of the cell in the Table.

Return type

int

property table_id
Returns

Returns the ID of the Table the TableCell belongs to.

Return type

str

property text: str

Returns the text in the cell as one space-separated string

Returns

Text in the cell

Return type

str

property words

Returns all the Word objects present in the TableCell.

Return words

List of Word objects, each representing a word within the TableCell.

Return type

list

Page

Represents a single Document page, as it would appear in the Textract API output. The Page object also contains the metadata such as the physical dimensions of the page (width, height, in pixels), child_ids etc.

class textractor.entities.page.Page(id: str, width: int, height: int, page_num: int = -1, child_ids=None)

Bases: SpatialObject

Creates a new document, ideally representing a single item in the dataset.

Parameters
  • id (str) – Unique id of the Page

  • width (float) – Width of page, in pixels

  • height (float) – Height of page, in pixels

  • page_num (int) – Page number in the document linked to this Page object

  • child_ids (List) – IDs of child entities in the Page as determined by Textract

property checkboxes: EntityList[KeyValue]

Returns all the KeyValue objects with SelectionElement present in the Page.

Returns

List of KeyValue objects, each representing a checkbox within the Page.

Return type

EntityList[KeyValue]

property container_layouts: EntityList[Layout]

Returns all the container Layout objects present in the Page.

Returns

List of Layout objects.

Return type

EntityList

directional_finder(word_1: str = '', word_2: str = '', prefix: str = '', direction=Direction.BELOW, entities=[])

The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.

Parameters
  • word_1 (str, required) – The reference word from where x1, y1 coordinates are derived

  • word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.

  • prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function

  • entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.

Returns

Returns the EntityList of modified key-value and/or checkboxes

Return type

EntityList

property expense_documents: EntityList[ExpenseDocument]

Returns all the ExpenseDocument objects present in the Page.

Returns

List of ExpenseDocument objects.

Return type

EntityList

export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv')

Export key-value entities and checkboxes in csv format.

Parameters
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')

Export key-value entities and checkboxes in txt format.

Parameters
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_tables_to_excel(filepath)

Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.

Parameters

filepath (str, required) – Path to store the exported Excel file.

filter_checkboxes(selected: bool = True, not_selected: bool = True) EntityList[KeyValue]

Return a list of KeyValue objects containing checkboxes if the page contains them.

Parameters
  • selected (bool) – True/False Return SELECTED checkboxes

  • not_selected (bool) – True/False Return NOT_SELECTED checkboxes

Returns

Returns checkboxes that match the conditions set by the flags.

Return type

EntityList[KeyValue]

get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) EntityList[KeyValue]

Return upto top_k_matches of key-value pairs for the key that is queried from the page.

Parameters
  • key (str) – Query key to match

  • top_k_matches (int) – Maximum number of matches to return

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6

Returns

Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.

Return type

EntityList[KeyValue]

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str

Returns the page text

Parameters

config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()

Returns

Linearized page text

Return type

str

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List[Word]]

Returns the page text and words sorted in reading order

Parameters

config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()

Returns

Tuple of page text and words

Return type

Tuple[str, List[Word]]

get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) EntityList[Word]

Returns list of Word entities that match the input text type.

Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns list of Word entities that match the input text type.

Return type

EntityList[Word]

independent_words() EntityList[Word]
Returns

Return all words in the document, outside of tables, checkboxes, key-values.

Return type

EntityList[Word]

property key_values: EntityList[KeyValue]

Returns all the KeyValue objects present in the Page.

Returns

List of KeyValue objects, each representing a key-value pair within the Page.

Return type

EntityList[KeyValue]

keys(include_checkboxes: bool = True) List[str]

Prints all keys for key-value pairs and checkboxes if the page contains them.

Parameters

include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.

Returns

List of strings containing key names in the Page

Return type

List[str]

property layouts: EntityList[Layout]

Returns all the Layout objects present in the Page.

Returns

List of Layout objects.

Return type

EntityList

property leaf_layouts: EntityList[Layout]

Returns all the leaf Layout objects present in the Page.

Returns

List of Layout objects.

Return type

EntityList

property lines: EntityList[Line]

Returns all the Line objects present in the Page.

Returns

List of Line objects, each representing a line within the Page.

Return type

EntityList[Line]

property page_layout: PageLayout
property queries: EntityList[Query]

Returns all the Query objects present in the Page.

Returns

List of Query objects.

Return type

EntityList

return_duplicates()

Returns a list containing EntityList objects. Each EntityList instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.

Returns

List of EntityList objects each containing the intersection of KeyValue and Table entities on the page.

Return type

List[EntityList]

search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: int = 0.6) EntityList[Line]

Return a list of top_k lines that contain the queried keyword.

Parameters
  • keyword (str) – Keyword that is used to query the page.

  • top_k (int) – Number of closest line objects to be returned

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6

Returns

Returns a list of lines that contain the queried key sorted from highest to lowest similarity.

Return type

EntityList[Line]

search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) EntityList[Word]

Return a list of top_k words that match the keyword.

Parameters
  • keyword (str, required) – Keyword that is used to query the document.

  • top_k (int, optional) – Number of closest word objects to be returned. default=1

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns

Returns a list of words that match the queried key sorted from highest to lowest similarity.

Return type

EntityList[Word]

property signatures: EntityList[Signature]

Returns all the Signature objects present in the Page.

Returns

List of Signature objects.

Return type

EntityList

property tables: EntityList[Table]

Returns all the Table objects present in the Page.

Returns

List of Table objects, each representing a table within the Page.

Return type

EntityList

property text: str

Returns the page text

Returns

Linearized page text

Return type

str

visualize(*args, **kwargs)

Returns the object’s children in a visualization EntityList object

Returns

Returns an EntityList object

Return type

EntityList

property words: EntityList[Word]

Returns all the Word objects present in the Page.

Returns

List of Word objects, each representing a word within the Page.

Return type

EntityList[Word]

PageLayout

class textractor.entities.page_layout.PageLayout(titles: EntityList[Layout] = [], headers: EntityList[Layout] = [], footers: EntityList[Layout] = [], section_headers: EntityList[Layout] = [], page_numbers: EntityList[Layout] = [], lists: EntityList[Layout] = [], figures: EntityList[Layout] = [], tables: EntityList[Layout] = [], key_values: EntityList[Layout] = [])

Bases: object

Object representation of the layout components detected in the table.

property figures: EntityList[Layout]

Figures detected in the Page

Returns

EntityList of figures detected in the page

Return type

EntityList[Layout]

property footers: EntityList[Layout]

Footers detected in the Page

Returns

EntityList of footers detected in the page

Return type

EntityList[Layout]

property headers: EntityList[Layout]

Headers detected in the Page

Returns

EntityList of headers detected in the page

Return type

EntityList[Layout]

property key_values: EntityList[Layout]

KeyValues detected in the Page

Returns

EntityList of keyvalues detected in the page

Return type

EntityList[Layout]

property lists: EntityList[Layout]

Lists detected in the Page

Returns

EntityList of lists detected in the page

Return type

EntityList[Layout]

property page_numbers: EntityList[Layout]

Page numbers detected in the Page

Returns

EntityList of page numbers detected in the page

Return type

EntityList[Layout]

property section_headers: EntityList[Layout]

Section headers detected in the Page

Returns

EntityList of section headers detected in the page

Return type

EntityList[Layout]

property tables: EntityList[Layout]

Tables detected in the Page. This includes Tables detected by the AnalyzeDocument Tables API if used.

Returns

EntityList of tables detected in the page

Return type

EntityList[Layout]

property titles: EntityList[Layout]

Titles detected in the Page

Returns

EntityList of titles detected in the page

Return type

EntityList[Layout]

Document

The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity objects can be accessed, searched and exported the functions given below.

class textractor.entities.document.Document(num_pages: int = 1)

Bases: SpatialObject

Represents the description of a single document, as it would appear in the input to the Textract API. Document serves as the root node of the object model hierarchy, which should be used as an intermediate form for most analytic purposes. The Document node also contains the metadata of the document.

property checkboxes: EntityList[KeyValue]

Returns all the KeyValue objects with SelectionElements present in the Document.

Returns

List of KeyValue objects, each representing a checkbox within the Document.

Return type

EntityList[KeyValue]

directional_finder(word_1: str = '', word_2: str = '', page: int = -1, prefix: str = '', direction=Direction.BELOW, entities=[])

The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.

Parameters
  • word_1 (str, required) – The reference word from where x1, y1 coordinates are derived

  • word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.

  • page (int, required) – page number of the page in the document to search the entities in.

  • prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function

  • entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.

Returns

Returns the EntityList of modified key-value and/or checkboxes

Return type

EntityList

property expense_documents: EntityList[ExpenseDocument]

Returns all the ExpenseDocument objects present in the Document.

Returns

List of ExpenseDocument objects, each representing an expense document within the Document.

Return type

EntityList[ExpenseDocument]

export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv')

Export key-value entities and checkboxes in csv format.

Parameters
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')

Export key-value entities and checkboxes in txt format.

Parameters
  • include_kv (bool) – True if KVs are to be exported. Else False.

  • include_checkboxes (bool) – True if checkboxes are to be exported. Else False.

  • filepath (str) – Path to where file is to be stored.

export_tables_to_excel(filepath)

Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.

Parameters

filepath (str, required) – Path to store the exported Excel file.

filter_checkboxes(selected: bool = True, not_selected: bool = True) List[KeyValue]

Return a list of KeyValue objects containing checkboxes if the document contains them.

Parameters
  • selected (bool) – True/False Return SELECTED checkboxes

  • not_selected (bool) – True/False Return NOT_SELECTED checkboxes

Returns

Returns checkboxes that match the conditions set by the flags.

Return type

EntityList[KeyValue]

get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6)

Return upto top_k_matches of key-value pairs for the key that is queried from the document.

Parameters
  • key (str) – Query key to match

  • top_k_matches (int) – Maximum number of matches to return

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns

Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.

Return type

EntityList[KeyValue]

get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str
get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]

Returns list of Word entities that match the input text type.

Parameters

text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING

Returns

Returns list of Word entities that match the input text type.

Return type

EntityList[Word]

property identity_document: EntityList[IdentityDocument]

Returns all the IdentityDocument objects present in the Page.

Returns

List of IdentityDocument objects.

Return type

EntityList

property identity_documents: EntityList[IdentityDocument]

Returns all the IdentityDocument objects present in the Document.

Returns

List of IdentityDocument objects, each representing an identity document within the Document.

Return type

EntityList[IdentityDocument]

property images: List[Image]

Returns all the page images in the Document.

Returns

List of PIL Image objects.

Return type

PIL.Image

independent_words()
Returns

Return all words in the document, outside of tables, checkboxes, key-values.

Return type

EntityList[Word]

property key_values: EntityList[KeyValue]

Returns all the KeyValue objects present in the Document.

Returns

List of KeyValue objects, each representing a key-value pair within the Document.

Return type

EntityList[KeyValue]

keys(include_checkboxes: bool = True) List[str]

Prints all keys for key-value pairs and checkboxes if the document contains them.

Parameters

include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.

Returns

List of strings containing key names in the Document

Return type

List[str]

property lines: EntityList[Line]

Returns all the Line objects present in the Document.

Returns

List of Line objects, each representing a line within the Document.

Return type

EntityList[Line]

classmethod open(fp: Union[dict, str, IO])

Create a Document object from a JSON file path, file handle or response dictionary

Parameters

fp (Union[dict, str, IO[AnyStr]]) – _description_

Raises

InputError – Raised on input not being of type Union[dict, str, IO[AnyStr]]

Returns

Document object

Return type

Document

page(page_no: int = 0)

Returns Page object/s depending on the input page_no. Follows zero-indexing.

Parameters

page_no (int if single page, list of int if multiple pages) – if int, returns single Page Object, else if list, it returns a list of Page objects.

Returns

Filters and returns Page objects depending on the input page_no

Return type

Page or List[Page]

property pages: List[Page]

Returns all the Page objects present in the Document.

Returns

List of Page objects, each representing a Page within the Document.

Return type

List

property queries: EntityList[Query]

Returns all the Query objects present in the Document.

Returns

List of Query objects.

Return type

EntityList[Query]

return_duplicates()

Returns a dictionary containing page numbers as keys and list of EntityList objects as values. Each EntityList instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.

Returns

Dictionary containing page numbers as keys and list of EntityList objects as values.

Return type

Dict[page_num, List[EntityList[DocumentEntity]]]

search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) List[Line]

Return a list of top_k lines that contain the queried keyword.

Parameters
  • keyword (str) – Keyword that is used to query the document.

  • top_k (int) – Number of closest line objects to be returned

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns

Returns a list of lines that contain the queried key sorted from highest to lowest similarity.

Return type

EntityList[Line]

search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) List[Word]

Return a list of top_k words that match the keyword.

Parameters
  • keyword (str) – Keyword that is used to query the document.

  • top_k (int) – Number of closest word objects to be returned

  • similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.

  • similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6

Returns

Returns a list of words that match the queried key sorted from highest to lowest similarity.

Return type

EntityList[Word]

property signatures: EntityList[Signature]

Returns all the Signature objects present in the Document.

Returns

List of Signature objects.

Return type

EntityList[Signature]

property tables: EntityList[Table]

Returns all the Table objects present in the Document.

Returns

List of Table objects, each representing a table within the Document.

Return type

EntityList[Table]

property text: str

Returns the document text as one string

Returns

Page text seperated by line return

Return type

str

to_trp2()

Parses the response to the trp2 format for backward compatibility

Returns

TDocument object that can be used with the older Textractor libraries

Return type

TDocument

visualize(*args, **kwargs)

Returns the object’s children in a visualization EntityList object

Returns

Returns an EntityList object

Return type

EntityList

property words: EntityList[Word]

Returns all the Word objects present in the Document.

Returns

List of Word objects, each representing a word within the Document.

Return type

EntityList[Word]