TextLinearizationConfig

class textractor.data.text_linearization_config.TextLinearizationConfig(remove_new_lines_in_leaf_elements: bool = True, max_number_of_consecutive_new_lines: int = 2, max_number_of_consecutive_spaces: Optional[int] = None, hide_header_layout: bool = False, hide_footer_layout: bool = False, hide_figure_layout: bool = False, hide_table_layout: bool = False, hide_key_value_layout: bool = False, hide_page_num_layout: bool = False, page_num_prefix: str = '', page_num_suffix: str = '', same_paragraph_separator: str = ' ', same_layout_element_separator: str = '\n', layout_element_separator: str = '\n\n', list_element_separator: str = '\n', list_layout_prefix: str = '', list_layout_suffix: str = '', list_element_prefix: str = '', list_element_suffix: str = '', title_prefix: str = '', title_suffix: str = '', table_layout_prefix: str = '\n\n', table_layout_suffix: str = '\n', table_remove_column_headers: bool = False, table_column_header_threshold: float = 0.9, table_linearization_format: str = 'plaintext', table_add_title_as_caption: bool = False, table_add_footer_as_paragraph: bool = False, table_tabulate_format: str = 'github', table_tabulate_remove_extra_hyphens: bool = False, table_duplicate_text_in_merged_cells: bool = False, table_flatten_headers: bool = False, table_min_table_words: int = 0, table_column_separator: str = '\t', table_flatten_semi_structured_as_plaintext: bool = False, table_prefix: str = '', table_suffix: str = '', table_row_separator: str = '\n', table_row_prefix: str = '', table_row_suffix: str = '', table_cell_prefix: str = '', table_cell_suffix: str = '', table_cell_header_prefix: str = '', table_cell_header_suffix: str = '', table_cell_empty_cell_placeholder: str = '', table_cell_merge_cell_placeholder: str = '', table_cell_left_merge_cell_placeholder: str = '', table_cell_top_merge_cell_placeholder: str = '', table_cell_cross_merge_cell_placeholder: str = '', table_title_prefix: str = '', table_title_suffix: str = '', table_footers_prefix: str = '', table_footers_suffix: str = '', header_prefix: str = '', header_suffix: str = '', section_header_prefix: str = '', section_header_suffix: str = '', text_prefix: str = '', text_suffix: str = '', key_value_layout_prefix: str = '', key_value_layout_suffix: str = '', key_value_prefix: str = '', key_value_suffix: str = '', key_prefix: str = '', key_suffix: str = ' ', value_prefix: str = '', value_suffix: str = '', entity_layout_prefix: str = '', entity_layout_suffix: str = '', figure_layout_prefix: str = '', figure_layout_suffix: str = '', footer_layout_prefix: str = '', footer_layout_suffix: str = '', selection_element_selected: str = '[X]', selection_element_not_selected: str = '[ ]', heuristic_h_tolerance: float = 0.3, heuristic_line_break_threshold: float = 0.9, heuristic_overlap_ratio: float = 0.5, signature_token: str = '[SIGNATURE]', add_prefixes_and_suffixes_as_words: bool = False, add_prefixes_and_suffixes_in_text: bool = True)

Bases: object

The TextLinearizationConfig object defines how a document is linearized into a text string

add_prefixes_and_suffixes_as_words: bool = False

Controls if the prefixes/suffixes will be inserted in the words returned by get_text_and_words

add_prefixes_and_suffixes_in_text: bool = True

Controls if the prefixes/suffixes will be added to the linearized text

entity_layout_prefix: str = ''

Prefix for LAYOUT_ENTITY elements (layout elements without a predicted layout type)

entity_layout_suffix: str = ''

Suffix for LAYOUT_ENTITY elements (layout elements without a predicted layout type)

figure_layout_prefix: str = ''

Prefix for figure layout elements

figure_layout_suffix: str = ''

Suffix for figure layout elements

footer_layout_prefix: str = ''

Prefix for figure layout elements

footer_layout_suffix: str = ''

Suffix for figure layout elements

header_prefix: str = ''

Prefix for header layout elements

header_suffix: str = ''

Suffix for header layout elements

heuristic_h_tolerance: float = 0.3

How much the line below and above the current line should differ in width to be separated

heuristic_line_break_threshold: float = 0.9

How much space is acceptable between two lines before splitting them. Expressed in multiple of min heights

heuristic_overlap_ratio: float = 0.5

How much vertical overlap is tolerated between two subsequent lines before merging them into a single line

hide_figure_layout: bool = False

Hide figures layouts in the linearized output

Hide footers layouts in the linearized output

hide_header_layout: bool = False

Hide headers layouts in the linearized output

hide_key_value_layout: bool = False

Hide key-value layouts in the linearized output

hide_page_num_layout: bool = False

Hide page numbers in the linearized output

hide_table_layout: bool = False

Hide tables layouts in the linearized output

key_prefix: str = ''

Prefix for key elements

key_suffix: str = ' '

Suffix for key elements

key_value_layout_prefix: str = ''

Prefix for key_value layout elements (not for individual key-value elements)

key_value_layout_suffix: str = ''

Suffix for key_value layout elements (not for individual key-value elements)

key_value_prefix: str = ''

Prefix for key-value elements

key_value_suffix: str = ''

Suffix for key-value elements

layout_element_separator: str = '\n\n'

Separator to use when combining linearized layout elements

list_element_prefix: str = ''

Prefix for elements in a list layout (children)

list_element_separator: str = '\n'

Separator for elements in a list layout

list_element_suffix: str = ''

Suffix for elements in a list layout (children)

list_layout_prefix: str = ''

Prefix for list layout elements (parent)

list_layout_suffix: str = ''

Suffix for list layout elements (parent)

max_number_of_consecutive_new_lines: int = 2

Removes extra whitespace

max_number_of_consecutive_spaces: int = None

Removes extra whitespace (None skips whitespace removal)

page_num_prefix: str = ''

Prefix for page number layout elements

page_num_suffix: str = ''

Suffix for page number layout elements

remove_new_lines_in_leaf_elements: bool = True

Removes new lines in leaf layout elements, this removes extra whitespace

same_layout_element_separator: str = '\n'
same_paragraph_separator: str = ' '
section_header_prefix: str = ''

Prefix for section header layout elements

section_header_suffix: str = ''

Suffix for section header layout elements

selection_element_not_selected: str = '[ ]'
selection_element_selected: str = '[X]'
signature_token: str = '[SIGNATURE]'

Signature representation in the linearized text

table_add_title_as_caption: bool = False

When using html linearization format, adds the title inside the table as <caption></caption>

table_cell_cross_merge_cell_placeholder: str = ''

Placeholder for left merge cell (X)

table_cell_empty_cell_placeholder: str = ''

Placeholder for empty cells

table_cell_header_prefix: str = ''

Prefix for header cell

table_cell_header_suffix: str = ''

Suffix for header cell

table_cell_left_merge_cell_placeholder: str = ''

Placeholder for left merge cell (L) see:

table_cell_merge_cell_placeholder: str = ''

Placeholder for merged cell

table_cell_prefix: str = ''

Prefix for table cell

table_cell_suffix: str = ''

Suffix for table cell

table_cell_top_merge_cell_placeholder: str = ''

Placeholder for left merge cell (T)

table_column_header_threshold: float = 0.9

Threshold for a row to be selected as header when rendering as markdown. 0.9 means that 90% of the cells must have the is_header_cell flag.

table_column_separator: str = '\t'

Table column separator, used when linearizing layout tables, not used if AnalyzeDocument was called with the TABLES feature

table_duplicate_text_in_merged_cells: bool = False

Duplicate text in merged cells to preserve line alignment

table_flatten_headers: bool = False

Flatten table headers into a single row, unmerging the cells horizontally

table_flatten_semi_structured_as_plaintext: bool = False

Ignores table structure for SEMI_STRUCTURED tables and returns them as text

table_footers_prefix: str = ''

Prefix for table footers if they are outside of the table (floating)

table_footers_suffix: str = ''

Suffix for table footers if they are outside of the table (floating)

table_layout_prefix: str = '\n\n'

Prefix for table elements

table_layout_suffix: str = '\n'

Suffix for table elements

table_linearization_format: str = 'plaintext'

How to represent tables in the linearized output. Choices are plaintext, markdown or html.

table_min_table_words: int = 0

Threshold below which tables will be rendered as words instead of using table layout

table_prefix: str = ''
table_remove_column_headers: bool = False

Remove pandas index column headers from tables

table_row_prefix: str = ''

Prefix for table row

table_row_separator: str = '\n'

Table row separator

table_row_suffix: str = ''

Suffix for table row

table_suffix: str = ''
table_tabulate_format: str = 'github'

Markdown tabulate format to use when table are linearized as markdown

table_tabulate_remove_extra_hyphens: bool = False

By default markdown tables will have N hyphens to preserve alignement, this reduces the number of hyphens to 1, which is the minimum number allowed by the GitHub Markdown spec

table_title_prefix: str = ''

Prefix for table title if it is outside of the table (floating)

table_title_suffix: str = ''

Suffix for table title if it is outside of the table (floating)

text_prefix: str = ''

Prefix for text layout elements

text_suffix: str = ''

Suffix for text layout elements

title_prefix: str = ''

Prefix for title layout elements

title_suffix: str = ''

Suffix for title layout elements

value_prefix: str = ''

Prefix for value elements

value_suffix: str = ''

Suffix for value elements