Entity Visualization

Most features that return DocumentEntity objects are of EntityList type. It is an extension of the list data type with the intention of providing visualization features to these entities.

EntityList

The EntityList is an extension of list type with custom functions to print document entities in a well formatted manner and visualize on top of the document page with their BoundingBox information.

The two main functions within this class are pretty_print() and visualize(). Use pretty_print() to get a string formatted output of your custom list of entities. Use visualize() to get the bounding box visualization of the entities on the document page images.

class textractor.visualizers.entitylist.EntityList(objs=None)

Bases: list, Generic[T], Linearizable

Creates a list type object, initially empty but extended with the list passed in objs.

Parameters:: objs (list) – Custom list of objects that can be visualized with this class.

get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))

Used for linearization, returns the linearized text of the entity and the matching words

Returns:: Tuple of text and word list
Return type:: Tuple[str, List[Word]]

pretty_print(table_format: TableFormat = TableFormat.GITHUB, with_confidence: bool = False, with_geo: bool = False, with_page_number: bool = False, trim: bool = False) → str

Returns a formatted string output for each of the entities in the list according to its entity type.

Parameters:

table_format (TableFormat) – Choose one of the defined TableFormat types to decorate the table output string. This is a predefined set of choices by the PyPI tabulate package. It is used only if there are KeyValues or Tables in the list of textractor.entities.
with_confidence (bool) – Flag to add the confidence of prediction to the entity string. default= False.
with_geo (bool) – Flag to add the bounding box information to the entity string. default= False.
with_page_number (bool) – Flag to add the page number to the entity string. default= False.
trim (bool) – Flag to trim text in the entity string. default= False.

Returns:

Returns a formatted string output for each of the entities in the list according to its entity type.

Return type:

str

visualize(with_text: bool = True, with_words: bool = True, with_confidence: bool = False, font_size_ratio: float = 0.5) → List

Returns list of PIL Images with bounding boxes drawn around the entities in the list.

Parameters:

with_text (bool) – Flag to print the OCR output of Textract on top of the text bounding box.
with_confidence (bool) – Flag to print the confidence of prediction on top of the entity bounding box.

Returns:

Returns list of PIL Images with bounding boxes drawn around the entities in the list.

Return type:

list