TextLinearizationConfig

class textractor.data.text_linearization_config.TextLinearizationConfig(remove_new_lines_in_leaf_elements: bool = True, max_number_of_consecutive_new_lines: int = 2, max_number_of_consecutive_spaces: int = None, hide_header_layout: bool = False, hide_footer_layout: bool = False, hide_figure_layout: bool = False, hide_table_layout: bool = False, hide_key_value_layout: bool = False, hide_page_num_layout: bool = False, page_num_prefix: str = '', page_num_suffix: str = '', same_paragraph_separator: str = ' ', same_layout_element_separator: str = '\n', layout_element_separator: str = '\n\n', list_element_separator: str = '\n', list_layout_prefix: str = '', list_layout_suffix: str = '', list_element_prefix: str = '', list_element_suffix: str = '', title_prefix: str = '', title_suffix: str = '', table_layout_prefix: str = '\n\n', table_layout_suffix: str = '\n', table_remove_column_headers: bool = False, table_column_header_threshold: float = 0.9, table_linearization_format: str = 'plaintext', table_add_title_as_caption: bool = False, table_add_footer_as_paragraph: bool = False, table_tabulate_format: str = 'github', table_tabulate_remove_extra_hyphens: bool = False, table_duplicate_text_in_merged_cells: bool = False, table_flatten_headers: bool = False, table_min_table_words: int = 0, table_column_separator: str = '\t', table_flatten_semi_structured_as_plaintext: bool = False, table_prefix: str = '', table_suffix: str = '', table_row_separator: str = '\n', table_row_prefix: str = '', table_row_suffix: str = '', table_cell_prefix: str = '', table_cell_suffix: str = '', table_cell_header_prefix: str = '', table_cell_header_suffix: str = '', table_cell_empty_cell_placeholder: str = '', table_cell_merge_cell_placeholder: str = '', table_cell_left_merge_cell_placeholder: str = '', table_cell_top_merge_cell_placeholder: str = '', table_cell_cross_merge_cell_placeholder: str = '', table_title_prefix: str = '', table_title_suffix: str = '', table_footers_prefix: str = '', table_footers_suffix: str = '', header_prefix: str = '', header_suffix: str = '', section_header_prefix: str = '', section_header_suffix: str = '', text_prefix: str = '', text_suffix: str = '', key_value_layout_prefix: str = '', key_value_layout_suffix: str = '', key_value_prefix: str = '', key_value_suffix: str = '', key_prefix: str = '', key_suffix: str = ' ', value_prefix: str = '', value_suffix: str = '', entity_layout_prefix: str = '', entity_layout_suffix: str = '', figure_layout_prefix: str = '', figure_layout_suffix: str = '', footer_layout_prefix: str = '', footer_layout_suffix: str = '', selection_element_selected: str = '[X]', selection_element_not_selected: str = '[ ]', heuristic_h_tolerance: float = 0.3, heuristic_line_break_threshold: float = 0.9, heuristic_overlap_ratio: float = 0.5, signature_token: str = '[SIGNATURE]', add_prefixes_and_suffixes_as_words: bool = False, add_prefixes_and_suffixes_in_text: bool = True)

Bases: object

The TextLinearizationConfig object defines how a document is linearized into a text string

add_prefixes_and_suffixes_as_words: bool = False: Controls if the prefixes/suffixes will be inserted in the words returned by get_text_and_words

add_prefixes_and_suffixes_in_text: bool = True: Controls if the prefixes/suffixes will be added to the linearized text

entity_layout_prefix: str = '': Prefix for LAYOUT_ENTITY elements (layout elements without a predicted layout type)

entity_layout_suffix: str = '': Suffix for LAYOUT_ENTITY elements (layout elements without a predicted layout type)

figure_layout_prefix: str = '': Prefix for figure layout elements

figure_layout_suffix: str = '': Suffix for figure layout elements

footer_layout_prefix: str = '': Prefix for figure layout elements

footer_layout_suffix: str = '': Suffix for figure layout elements

header_prefix: str = '': Prefix for header layout elements

header_suffix: str = '': Suffix for header layout elements

heuristic_h_tolerance: float = 0.3: How much the line below and above the current line should differ in width to be separated

heuristic_line_break_threshold: float = 0.9: How much space is acceptable between two lines before splitting them. Expressed in multiple of min heights

heuristic_overlap_ratio: float = 0.5: How much vertical overlap is tolerated between two subsequent lines before merging them into a single line

hide_figure_layout: bool = False: Hide figures layouts in the linearized output

hide_footer_layout: bool = False: Hide footers layouts in the linearized output

hide_header_layout: bool = False: Hide headers layouts in the linearized output

hide_key_value_layout: bool = False: Hide key-value layouts in the linearized output

hide_page_num_layout: bool = False: Hide page numbers in the linearized output

hide_table_layout: bool = False: Hide tables layouts in the linearized output

key_prefix: str = '': Prefix for key elements

key_suffix: str = ' ': Suffix for key elements

key_value_layout_prefix: str = '': Prefix for key_value layout elements (not for individual key-value elements)

key_value_layout_suffix: str = '': Suffix for key_value layout elements (not for individual key-value elements)

key_value_prefix: str = '': Prefix for key-value elements

key_value_suffix: str = '': Suffix for key-value elements

layout_element_separator: str = '\n\n': Separator to use when combining linearized layout elements

list_element_prefix: str = '': Prefix for elements in a list layout (children)

list_element_separator: str = '\n': Separator for elements in a list layout

list_element_suffix: str = '': Suffix for elements in a list layout (children)

list_layout_prefix: str = '': Prefix for list layout elements (parent)

list_layout_suffix: str = '': Suffix for list layout elements (parent)

max_number_of_consecutive_new_lines: int = 2: Removes extra whitespace

max_number_of_consecutive_spaces: int = None: Removes extra whitespace (None skips whitespace removal)

page_num_prefix: str = '': Prefix for page number layout elements

page_num_suffix: str = '': Suffix for page number layout elements

remove_new_lines_in_leaf_elements: bool = True: Removes new lines in leaf layout elements, this removes extra whitespace

same_layout_element_separator: str = '\n'

same_paragraph_separator: str = ' '

section_header_prefix: str = '': Prefix for section header layout elements

section_header_suffix: str = '': Suffix for section header layout elements

selection_element_not_selected: str = '[ ]'

selection_element_selected: str = '[X]'

signature_token: str = '[SIGNATURE]': Signature representation in the linearized text

table_add_footer_as_paragraph: bool = False

table_add_title_as_caption: bool = False: When using html linearization format, adds the title inside the table as <caption></caption>

table_cell_cross_merge_cell_placeholder: str = '': Placeholder for left merge cell (X)

table_cell_empty_cell_placeholder: str = '': Placeholder for empty cells

table_cell_header_prefix: str = '': Prefix for header cell

table_cell_header_suffix: str = '': Suffix for header cell

table_cell_left_merge_cell_placeholder: str = '': Placeholder for left merge cell (L) see:

table_cell_merge_cell_placeholder: str = '': Placeholder for merged cell

table_cell_prefix: str = '': Prefix for table cell

table_cell_suffix: str = '': Suffix for table cell

table_cell_top_merge_cell_placeholder: str = '': Placeholder for left merge cell (T)

table_column_header_threshold: float = 0.9: Threshold for a row to be selected as header when rendering as markdown. 0.9 means that 90% of the cells must have the is_header_cell flag.

table_column_separator: str = '\t': Table column separator, used when linearizing layout tables, not used if AnalyzeDocument was called with the TABLES feature

table_duplicate_text_in_merged_cells: bool = False: Duplicate text in merged cells to preserve line alignment

table_flatten_headers: bool = False: Flatten table headers into a single row, unmerging the cells horizontally

table_flatten_semi_structured_as_plaintext: bool = False: Ignores table structure for SEMI_STRUCTURED tables and returns them as text

table_footers_prefix: str = '': Prefix for table footers if they are outside of the table (floating)

table_footers_suffix: str = '': Suffix for table footers if they are outside of the table (floating)

table_layout_prefix: str = '\n\n': Prefix for table elements

table_layout_suffix: str = '\n': Suffix for table elements

table_linearization_format: str = 'plaintext': How to represent tables in the linearized output. Choices are plaintext, markdown or html.

table_min_table_words: int = 0: Threshold below which tables will be rendered as words instead of using table layout

table_prefix: str = ''

table_remove_column_headers: bool = False: Remove pandas index column headers from tables

table_row_prefix: str = '': Prefix for table row

table_row_separator: str = '\n': Table row separator

table_row_suffix: str = '': Suffix for table row

table_suffix: str = ''

table_tabulate_format: str = 'github': Markdown tabulate format to use when table are linearized as markdown

table_tabulate_remove_extra_hyphens: bool = False: By default markdown tables will have N hyphens to preserve alignement, this reduces the number of hyphens to 1, which is the minimum number allowed by the GitHub Markdown spec

table_title_prefix: str = '': Prefix for table title if it is outside of the table (floating)

table_title_suffix: str = '': Suffix for table title if it is outside of the table (floating)

text_prefix: str = '': Prefix for text layout elements

text_suffix: str = '': Suffix for text layout elements

title_prefix: str = '': Prefix for title layout elements

title_suffix: str = '': Suffix for title layout elements

value_prefix: str = '': Prefix for value elements

value_suffix: str = '': Suffix for value elements