Entity Visualization
Most features that return DocumentEntity
objects are of EntityList
type. It is an extension of the list
data type
with the intention of providing visualization features to these entities.
EntityList
The EntityList
is an extension of list type with custom functions to print document entities in a well formatted manner and visualize on top of the document page with their BoundingBox information.
The two main functions within this class are pretty_print()
and visualize()
.
Use pretty_print()
to get a string formatted output of your custom list of entities.
Use visualize()
to get the bounding box visualization of the entities on the document page images.
- class textractor.visualizers.entitylist.EntityList(objs=None)
Bases:
list
,Generic
[T
],Linearizable
Creates a list type object, initially empty but extended with the list passed in objs.
- Parameters:
objs (list) – Custom list of objects that can be visualized with this class.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- pretty_print(table_format: TableFormat = TableFormat.GITHUB, with_confidence: bool = False, with_geo: bool = False, with_page_number: bool = False, trim: bool = False) str
Returns a formatted string output for each of the entities in the list according to its entity type.
- Parameters:
table_format (TableFormat) – Choose one of the defined TableFormat types to decorate the table output string. This is a predefined set of choices by the PyPI tabulate package. It is used only if there are KeyValues or Tables in the list of textractor.entities.
with_confidence (bool) – Flag to add the confidence of prediction to the entity string. default= False.
with_geo (bool) – Flag to add the bounding box information to the entity string. default= False.
with_page_number (bool) – Flag to add the page number to the entity string. default= False.
trim (bool) – Flag to trim text in the entity string. default= False.
- Returns:
Returns a formatted string output for each of the entities in the list according to its entity type.
- Return type:
str
- visualize(with_text: bool = True, with_words: bool = True, with_confidence: bool = False, font_size_ratio: float = 0.5) List
Returns list of PIL Images with bounding boxes drawn around the entities in the list.
- Parameters:
with_text (bool) – Flag to print the OCR output of Textract on top of the text bounding box.
with_confidence (bool) – Flag to print the confidence of prediction on top of the entity bounding box.
- Returns:
Returns list of PIL Images with bounding boxes drawn around the entities in the list.
- Return type:
list