Entity Visualization

Most features that return DocumentEntity objects are of EntityList type. It is an extension of the list data type with the intention of providing visualization features to these entities.

EntityList

The EntityList is an extension of list type with custom functions to print document entities in a well formatted manner and visualize on top of the document page with their BoundingBox information.

The two main functions within this class are pretty_print() and visualize(). Use pretty_print() to get a string formatted output of your custom list of entities. Use visualize() to get the bounding box visualization of the entities on the document page images.

class textractor.visualizers.entitylist.EntityList(objs=None)

Bases: list, Generic[T], Linearizable

Creates a list type object, initially empty but extended with the list passed in objs.

Parameters

objs (list) – Custom list of objects that can be visualized with this class.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', figure_layout_prefix='', figure_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns

Tuple of text and word list

Return type

Tuple[str, List[Word]]

pretty_print(table_format: TableFormat = TableFormat.GITHUB, with_confidence: bool = False, with_geo: bool = False, with_page_number: bool = False, trim: bool = False) str

Returns a formatted string output for each of the entities in the list according to its entity type.

Parameters
  • table_format (TableFormat) – Choose one of the defined TableFormat types to decorate the table output string. This is a predefined set of choices by the PyPI tabulate package. It is used only if there are KeyValues or Tables in the list of textractor.entities.

  • with_confidence (bool) – Flag to add the confidence of prediction to the entity string. default= False.

  • with_geo (bool) – Flag to add the bounding box information to the entity string. default= False.

  • with_page_number (bool) – Flag to add the page number to the entity string. default= False.

  • trim (bool) – Flag to trim text in the entity string. default= False.

Returns

Returns a formatted string output for each of the entities in the list according to its entity type.

Return type

str

visualize(with_text: bool = True, with_words: bool = True, with_confidence: bool = False, font_size_ratio: float = 0.5) List

Returns list of PIL Images with bounding boxes drawn around the entities in the list.

Parameters
  • with_text (bool) – Flag to print the OCR output of Textract on top of the text bounding box.

  • with_confidence (bool) – Flag to print the confidence of prediction on top of the entity bounding box.

Returns

Returns list of PIL Images with bounding boxes drawn around the entities in the list.

Return type

list