Document Entities
Document
objects contain various entities within them. Textract document analysis APIs recognize 6 document entities namely: WORD, LINE, KEY_VALUE_SET
, SELECTION_ELEMENT, TABLE, CELL
These are structures that occur in most documents and the package provides classes to programmatically store and access the information produced by Textract for these entities.
BoundingBox
BoundingBox class contains all the co-ordinate information for a DocumentEntity
. This class is mainly useful to locate the entity
on the image of the document page.
- class textractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None)
Bases:
SpatialObject
Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By default
BoundingBox
is set to work with denormalized co-ordinates: \(x \in [0, docwidth]\) and \(y \in [0, docheight]\). Use the as_normalized_dict function to obtain BoundingBox with normalized co-ordinates: \(x \in [0, 1]\) and \(y \in [0, 1]\).Create a BoundingBox like shown below:
Directly:
bb = BoundingBox(x, y, width, height)
From dict:
bb = BoundingBox.from_dict(bb_dict)
wherebb_dict = {'x': x, 'y': y, 'width': width, 'height': height}
Use a BoundingBox like shown below:
Directly:
print('The top left is: ' + str(bb.x) + ' ' + str(bb.y))
Convert to dict:
bb_dict = bb.as_dict()
returns{'x': x, 'y': y, 'width': width, 'height': height}
- property area
Returns the area of the bounding box, handles negative bboxes as 0-area
- Returns:
Bounding box area
- Return type:
float
- as_denormalized_numpy()
- Returns:
Returns denormalized co-ordinates x, y and dimensions width, height as numpy array.
- Return type:
numpy.array
- classmethod center_is_inside(bbox_a, bbox_b)
Returns true if the center point of Bounding Box A is within Bounding Box B
- classmethod enclosing_bbox(bboxes, spatial_object: Optional[SpatialObject] = None)
- Parameters:
[BoundingBox] (bboxes) – list of bounding boxes
SpatialObject (spatial_object) – spatial object to be added to the returned bbox
- Returns:
- classmethod from_denormalized_borders(left: float, top: float, right: float, bottom: float, spatial_object: Optional[SpatialObject] = None)
Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. If spatial_object is not None, the coordinates will be denormalized according to the spatial object. :param left: ~ [0, doc_width] :param top: ~ [0, doc_height] :param right: ~ [0, doc_width] :param bottom: ~ [0, doc_height] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]
- classmethod from_denormalized_corners(x1: float, y1: float, x2: float, y2: float, spatial_object: Optional[SpatialObject] = None)
Builds an axis aligned bounding box from top-left and bottom-right coordinates. The coordinates are assumed to be denormalized. :param x1: Left ~ [0, wdoc_idth] :param y1: Top ~ [0, doc_height] :param x2: Right ~ [0, doc_width] :param y2: Bottom ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]
- classmethod from_denormalized_dict(bbox_dict: Dict[str, float])
Builds an axis aligned bounding box from a dictionary of: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} The coordinates will be denormalized according to the spatial object. :param bbox_dict: {‘x’: x, ‘y’: y, ‘width’: width, ‘height’: height} of [0, doc_height] x [0, doc_width] :param spatial_object: Some object with width and height attributes :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]
- classmethod from_denormalized_xywh(x: float, y: float, width: float, height: float, spatial_object: Optional[SpatialObject] = None)
Builds an axis aligned bounding box from top-left, width and height properties. The coordinates are assumed to be denormalized. :param x: Left ~ [0, doc_width] :param y: Top ~ [0, doc_height] :param width: Width ~ [0, doc_width] :param height: Height ~ [0, doc_height] :param spatial_object: Some object with width and height attributes (i.e: Document, ConvertibleImage). :return: BoundingBox object in denormalized coordinates: ~ [0, doc_height] x [0, doc_width]
- classmethod from_normalized_dict(bbox_dict: Dict[str, float], spatial_object: Optional[SpatialObject] = None)
Builds an axis aligned BoundingBox from a dictionary like
{'x': x, 'y': y, 'width': width, 'height': height}
. The coordinates will be denormalized according to spatial_object.- Parameters:
bbox_dict (dict) – Dictionary of normalized co-ordinates.
spatial_object (SpatialObject) – Object with width and height attributes.
- Returns:
Object with denormalized co-ordinates
- Return type:
- get_distance(bbox)
Returns the distance between the center point of the bounding box and another bounding box
- Returns:
Returns the distance as float
- Return type:
float
- get_intersection(bbox)
Returns the intersection of this object’s bbox and another BoundingBox :return: a BoundingBox object
- classmethod is_inside(bbox_a, bbox_b)
Returns true if Bounding Box A is within Bounding Box B
- class textractor.entities.bbox.SpatialObject(width: float, height: float)
Bases:
ABC
The
SpatialObject
interface defines an object that has a width and height. This mostly used forBoundingBox
reference to be able to provide normalized coordinates.
Document
The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity
objects can be
accessed, searched and exported the functions given below.
- class textractor.entities.document.Document(num_pages: int = 1)
Bases:
SpatialObject
,Linearizable
Represents the description of a single document, as it would appear in the input to the Textract API. Document serves as the root node of the object model hierarchy, which should be used as an intermediate form for most analytic purposes. The Document node also contains the metadata of the document.
- property checkboxes: EntityList[KeyValue]
Returns all the
KeyValue
objects with SelectionElements present in the Document.- Returns:
List of KeyValue objects, each representing a checkbox within the Document.
- Return type:
- directional_finder(word_1: str = '', word_2: str = '', page: int = -1, prefix: str = '', direction=Direction.BELOW, entities=[])
The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.
- Parameters:
word_1 (str, required) – The reference word from where x1, y1 coordinates are derived
word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.
page (int, required) – page number of the page in the document to search the entities in.
prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function
entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.
- Returns:
Returns the EntityList of modified key-value and/or checkboxes
- Return type:
- property expense_documents: EntityList[ExpenseDocument]
Returns all the
ExpenseDocument
objects present in the Document.- Returns:
List of ExpenseDocument objects, each representing an expense document within the Document.
- Return type:
- export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv', sep: str = ';')
Export key-value entities and checkboxes in csv format.
- Parameters:
include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.
sep (str) – Separator to be used in the csv file.
- export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')
Export key-value entities and checkboxes in txt format.
- Parameters:
include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.
- export_tables_to_excel(filepath)
Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.
- Parameters:
filepath (str, required) – Path to store the exported Excel file.
- filter_checkboxes(selected: bool = True, not_selected: bool = True) List[KeyValue]
Return a list of
KeyValue
objects containing checkboxes if the document contains them.- Parameters:
selected (bool) – True/False Return SELECTED checkboxes
not_selected (bool) – True/False Return NOT_SELECTED checkboxes
- Returns:
Returns checkboxes that match the conditions set by the flags.
- Return type:
- get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6)
Return upto top_k_matches of key-value pairs for the key that is queried from the document.
- Parameters:
key (str) – Query key to match
top_k_matches (int) – Maximum number of matches to return
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6
- Returns:
Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.
- Return type:
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]
Returns list of
Word
entities that match the input text type.- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns list of Word entities that match the input text type.
- Return type:
- property identity_document: EntityList[IdentityDocument]
Returns all the
IdentityDocument
objects present in the Page.- Returns:
List of IdentityDocument objects.
- Return type:
- property identity_documents: EntityList[IdentityDocument]
Returns all the
IdentityDocument
objects present in the Document.- Returns:
List of IdentityDocument objects, each representing an identity document within the Document.
- Return type:
- property images: List[Image]
Returns all the page images in the Document.
- Returns:
List of PIL Image objects.
- Return type:
PIL.Image
- independent_words()
- Returns:
Return all words in the document, outside of tables, checkboxes, key-values.
- Return type:
- property key_values: EntityList[KeyValue]
Returns all the
KeyValue
objects present in the Document.- Returns:
List of KeyValue objects, each representing a key-value pair within the Document.
- Return type:
- keys(include_checkboxes: bool = True) List[str]
Prints all keys for key-value pairs and checkboxes if the document contains them.
- Parameters:
include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.
- Returns:
List of strings containing key names in the Document
- Return type:
List[str]
- property layouts: EntityList[Layout]
Returns all the
Layout
objects present in the Document- Returns:
List of Layout objects
- Return type:
- property lines: EntityList[Line]
Returns all the
Line
objects present in the Document.- Returns:
List of Line objects, each representing a line within the Document.
- Return type:
- classmethod open(fp: Union[dict, str, Path, IO])
Create a Document object from a JSON file path, file handle or response dictionary
- Parameters:
fp (Union[dict, str, Path, IO[AnyStr]]) – _description_
- Raises:
InputError – Raised on input not being of type Union[dict, str, Path, IO[AnyStr]]
- Returns:
Document object
- Return type:
- page(page_no: int = 0)
Returns
Page
object/s depending on the input page_no. Follows zero-indexing.
- property pages: List[Page]
Returns all the
Page
objects present in the Document.- Returns:
List of Page objects, each representing a Page within the Document.
- Return type:
List
- property queries: EntityList[Query]
Returns all the
Query
objects present in the Document.- Returns:
List of Query objects.
- Return type:
- return_duplicates()
Returns a dictionary containing page numbers as keys and list of
EntityList
objects as values. EachEntityList
instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.- Returns:
Dictionary containing page numbers as keys and list of EntityList objects as values.
- Return type:
Dict[page_num, List[EntityList[DocumentEntity]]]
- search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) List[Line]
Return a list of top_k lines that contain the queried keyword.
- Parameters:
keyword (str) – Keyword that is used to query the document.
top_k (int) – Number of closest line objects to be returned
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6
- Returns:
Returns a list of lines that contain the queried key sorted from highest to lowest similarity.
- Return type:
- search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) List[Word]
Return a list of top_k words that match the keyword.
- Parameters:
keyword (str) – Keyword that is used to query the document.
top_k (int) – Number of closest word objects to be returned
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6
- Returns:
Returns a list of words that match the queried key sorted from highest to lowest similarity.
- Return type:
- property signatures: EntityList[Signature]
Returns all the
Signature
objects present in the Document.- Returns:
List of Signature objects.
- Return type:
- property tables: EntityList[Table]
Returns all the
Table
objects present in the Document.- Returns:
List of Table objects, each representing a table within the Document.
- Return type:
- property text: str
Returns the document text as one string
- Returns:
Page text seperated by line return
- Return type:
str
- to_html(config: HTMLLinearizationConfig = HTMLLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='<div>', page_num_suffix='</div>', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='<div>', list_layout_suffix='</div>', list_element_prefix='', list_element_suffix='', title_prefix='<h1>', title_suffix='</h1>', table_layout_prefix='<div>', table_layout_suffix='</div>', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='html', table_add_title_as_caption=True, table_add_footer_as_paragraph=True, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='', table_flatten_semi_structured_as_plaintext=False, table_prefix='<table>', table_suffix='</table>', table_row_separator='\n', table_row_prefix='<tr>', table_row_suffix='</tr>', table_cell_prefix='<td>', table_cell_suffix='</td>', table_cell_header_prefix='<th>', table_cell_header_suffix='</th>', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='<h1>', header_suffix='</h1>', section_header_prefix='<h2>', section_header_suffix='</h2>', text_prefix='<p>', text_suffix='</p>', key_value_layout_prefix='<div>', key_value_layout_suffix='</div>', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='<p>', entity_layout_suffix='</p>', figure_layout_prefix='<div>', figure_layout_suffix='</div>', footer_layout_prefix='<div>', footer_layout_suffix='</div>', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True, add_ids_to_html_tags=False, add_short_ids_to_html_tags=False))
Returns the HTML representation of the document, effectively calls Linearizable.to_html() but add <html><body></body></html> around the result and put each page in a <div>.
- Returns:
HTML text of the entity
- Return type:
str
- to_trp2()
Parses the response to the trp2 format for backward compatibility
- Returns:
TDocument object that can be used with the older Textractor libraries
- Return type:
TDocument
- visualize(*args, **kwargs)
Returns the object’s children in a visualization EntityList object
- Returns:
Returns an EntityList object
- Return type:
- property words: EntityList[Word]
Returns all the
Word
objects present in the Document.- Returns:
List of Word objects, each representing a word within the Document.
- Return type:
LazyDocument
The Document class is defined to host all the various DocumentEntity objects within it. DocumentEntity
objects can be
accessed, searched and exported the functions given below.
- class textractor.entities.lazy_document.LazyDocument(job_id: str, api: TextractAPI, textract_client=None, images=None, output_config: Optional[OutputConfig] = None)
Bases:
object
LazyDocument is a proxy for Document when using the async APIs. It will not load the response until one if its property is used. You can access the underlying Document object using the document property.
- property document: Document
Getter for the underlying Document object
- Returns:
Proxied Document object
- Return type:
- property s3_polling_interval: int
Getter for the polling interval
- Returns:
Time between get_full_result calls
- Return type:
int
- property textract_polling_interval: int
Getter for the polling interval
- Returns:
Time between get_full_result calls
- Return type:
int
DocumentEntity
DocumentEntity
is the class that all Document entities such as Word
, Line
, Table
etc. inherit from. This class provides methods
useful to all such entities.
- class textractor.entities.document_entity.DocumentEntity(entity_id: str, bbox: BoundingBox)
Bases:
Linearizable
,ABC
An interface for all document entities within the document body, composing the hierarchy of the document object model. The purpose of this class is to define properties common to all document entities i.e. unique id and bounding box.
- add_children(children)
Adds children to all entities that have parent-child relationships.
- Parameters:
children (list) – List of child entities.
- property bbox: BoundingBox
- Returns:
Returns entire bounding box of entity
- Return type:
- property children
- Returns:
Returns children of entity
- Return type:
list
- property confidence: float
Returns the object confidence as predicted by Textract. If the confidence is not available, returns None
- Returns:
Prediction confidence for a document entity, between 0 and 1
- Return type:
float
- property height: float
- Returns:
Returns height for bounding box
- Return type:
float
- property raw_object: Dict
- Returns:
Returns the raw dictionary object that was used to create this Python object
- Return type:
Dict
- remove(entity)
Recursively removes an entity from the child tree of a document entity and update its bounding box
- Parameters:
entity (DocumentEntity) – Entity
- visit(word_set)
- visualize(*args, **kwargs) EntityList
Returns the object’s children in a visualization EntityList object
- Returns:
Returns an EntityList object
- Return type:
- property width: float
- Returns:
Returns width for bounding box
- Return type:
float
- property x: float
- Returns:
Returns x coordinate for bounding box
- Return type:
float
- property y: float
- Returns:
Returns y coordinate for bounding box
- Return type:
float
Word
Represents a single Word
within the Document
.
This class contains the associated metadata with the Word
entity including the text transcription,
text type, bounding box information, page number, Page ID and confidence of detection.
- class textractor.entities.word.Word(entity_id: str, bbox: BoundingBox, text: str = '', text_type: TextTypes = TextTypes.PRINTED, confidence: float = 0, is_clickable: bool = False, is_structure: bool = False)
Bases:
DocumentEntity
To create a new
Word
object we need the following:- Parameters:
entity_id (str) – Unique identifier of the Word entity.
bbox (BoundingBox) – Bounding box of the Word entity.
text (str) – Transcription of the Word object.
text_type (TextTypes) – Enum value stating the type of text stored in the entity. Takes 2 values - PRINTED and HANDWRITING
confidence (float) – value storing the confidence of detection out of 100.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property page: int
- Returns:
Returns the page number of the page the Word entity is present in.
- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property text: str
- Returns:
Returns the text transcription of the Word entity.
- Return type:
str
Line
Represents a single Line
Entity within the Document
.
The Textract API response returns groups of words as LINE BlockTypes. They contain Word
entities as children.
This class contains the associated metadata with the Line
entity including the entity ID,
bounding box information, child words, page number, Page ID and confidence of detection.
- class textractor.entities.line.Line(entity_id: str, bbox: BoundingBox, words: Optional[List[Word]] = None, confidence: float = 0)
Bases:
DocumentEntity
To create a new
Line
object we need the following:- Parameters:
entity_id (str) – Unique identifier of the Line entity.
bbox (BoundingBox) – Bounding box of the line entity.
words (list, optional) – List of the Word entities present in the line
confidence (float, optional) – confidence with which the entity was detected.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]
- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns EntityList of Word entities that match the input text type.
- Return type:
- property page
- Returns:
Returns the page number of the page the
Line
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
Page
Represents a single Document
page, as it would appear in the Textract API output.
The Page
object also contains the metadata such as the physical dimensions of the page (width, height, in pixels), child_ids etc.
- class textractor.entities.page.Page(id: str, width: int, height: int, page_num: int = -1, child_ids=None)
Bases:
SpatialObject
,Linearizable
Creates a new document, ideally representing a single item in the dataset.
- Parameters:
id (str) – Unique id of the Page
width (float) – Width of page, in pixels
height (float) – Height of page, in pixels
page_num (int) – Page number in the document linked to this Page object
child_ids (List) – IDs of child entities in the Page as determined by Textract
- property checkboxes: EntityList[KeyValue]
Returns all the
KeyValue
objects withSelectionElement
present in the Page.- Returns:
List of KeyValue objects, each representing a checkbox within the Page.
- Return type:
- property container_layouts: EntityList[Layout]
Returns all the container
Layout
objects present in the Page.- Returns:
List of Layout objects.
- Return type:
- directional_finder(word_1: str = '', word_2: str = '', prefix: str = '', direction=Direction.BELOW, entities=[])
The function returns entity types present in entities by prepending the prefix provided by te user. This helps in cases of repeating key-values and checkboxes. The user can manipulate original data or produce a copy. The main advantage of this function is to be able to define direction.
- Parameters:
word_1 (str, required) – The reference word from where x1, y1 coordinates are derived
word_2 (str, optional) – The second word preferably in the direction indicated by the parameter direction. When it isn’t given the end of page coordinates are used in the given direction.
prefix (str, optional) – User provided prefix to prepend to the key . Without prefix, the method acts as a search by geometry function
entities (List[DirectionalFinderType]) – List of DirectionalFinderType inputs.
- Returns:
Returns the EntityList of modified key-value and/or checkboxes
- Return type:
- property expense_documents: EntityList[ExpenseDocument]
Returns all the
ExpenseDocument
objects present in the Page.- Returns:
List of ExpenseDocument objects.
- Return type:
- export_kv_to_csv(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.csv')
Export key-value entities and checkboxes in csv format.
- Parameters:
include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.
- export_kv_to_txt(include_kv: bool = True, include_checkboxes: bool = True, filepath: str = 'Key-Values.txt')
Export key-value entities and checkboxes in txt format.
- Parameters:
include_kv (bool) – True if KVs are to be exported. Else False.
include_checkboxes (bool) – True if checkboxes are to be exported. Else False.
filepath (str) – Path to where file is to be stored.
- export_tables_to_excel(filepath)
Creates an excel file and writes each table on a separate worksheet within the workbook. This is stored on the filepath passed by the user.
- Parameters:
filepath (str, required) – Path to store the exported Excel file.
- filter_checkboxes(selected: bool = True, not_selected: bool = True) EntityList[KeyValue]
Return a list of
KeyValue
objects containing checkboxes if the page contains them.- Parameters:
selected (bool) – True/False Return SELECTED checkboxes
not_selected (bool) – True/False Return NOT_SELECTED checkboxes
- Returns:
Returns checkboxes that match the conditions set by the flags.
- Return type:
- get(key: str, top_k_matches: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) EntityList[KeyValue]
Return upto top_k_matches of key-value pairs for the key that is queried from the page.
- Parameters:
key (str) – Query key to match
top_k_matches (int) – Maximum number of matches to return
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6
- Returns:
Returns a list of key-value pairs that match the queried key sorted from highest to lowest similarity.
- Return type:
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List[Word]]
Returns the page text and words sorted in reading order
- Parameters:
config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()
- Returns:
Tuple of page text and words
- Return type:
Tuple[str, List[Word]]
- get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) EntityList[Word]
Returns list of
Word
entities that match the input text type.- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns list of Word entities that match the input text type.
- Return type:
- independent_words() EntityList[Word]
- Returns:
Return all words in the document, outside of tables, checkboxes, key-values.
- Return type:
- property key_values: EntityList[KeyValue]
Returns all the
KeyValue
objects present in the Page.- Returns:
List of KeyValue objects, each representing a key-value pair within the Page.
- Return type:
- keys(include_checkboxes: bool = True) List[str]
Prints all keys for key-value pairs and checkboxes if the page contains them.
- Parameters:
include_checkboxes (bool) – True/False. Set False if checkboxes need to be excluded.
- Returns:
List of strings containing key names in the Page
- Return type:
List[str]
- property layouts: EntityList[Layout]
Returns all the
Layout
objects present in the Page.- Returns:
List of Layout objects.
- Return type:
- property leaf_layouts: EntityList[Layout]
Returns all the leaf
Layout
objects present in the Page.- Returns:
List of Layout objects.
- Return type:
- property lines: EntityList[Line]
Returns all the
Line
objects present in the Page.- Returns:
List of Line objects, each representing a line within the Page.
- Return type:
- property page_layout: PageLayout
- property queries: EntityList[Query]
Returns all the
Query
objects present in the Page.- Returns:
List of Query objects.
- Return type:
- return_duplicates()
Returns a list containing
EntityList
objects. EachEntityList
instance contains the key-values and the last item is the table which contains duplicate information. This function is intended to let the Textract user know of duplicate objects extracted by the various Textract models.- Returns:
List of EntityList objects each containing the intersection of KeyValue and Table entities on the page.
- Return type:
List[EntityList]
- search_lines(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: int = 0.6) EntityList[Line]
Return a list of top_k lines that contain the queried keyword.
- Parameters:
keyword (str) – Keyword that is used to query the page.
top_k (int) – Number of closest line objects to be returned
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar page key is to queried key. default=0.6
- Returns:
Returns a list of lines that contain the queried key sorted from highest to lowest similarity.
- Return type:
- search_words(keyword: str, top_k: int = 1, similarity_metric: SimilarityMetric = SimilarityMetric.LEVENSHTEIN, similarity_threshold: float = 0.6) EntityList[Word]
Return a list of top_k words that match the keyword.
- Parameters:
keyword (str, required) – Keyword that is used to query the document.
top_k (int, optional) – Number of closest word objects to be returned. default=1
similarity_metric (SimilarityMetric) – SimilarityMetric.COSINE, SimilarityMetric.EUCLIDEAN or SimilarityMetric.LEVENSHTEIN. SimilarityMetric.COSINE is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6
- Returns:
Returns a list of words that match the queried key sorted from highest to lowest similarity.
- Return type:
- property signatures: EntityList[Signature]
Returns all the
Signature
objects present in the Page.- Returns:
List of Signature objects.
- Return type:
- property tables: EntityList[Table]
Returns all the
Table
objects present in the Page.- Returns:
List of Table objects, each representing a table within the Page.
- Return type:
- property text: str
Returns the page text
- Returns:
Linearized page text
- Return type:
str
- visualize(*args, **kwargs)
Returns the object’s children in a visualization EntityList object
- Returns:
Returns an EntityList object
- Return type:
- property words: EntityList[Word]
Returns all the
Word
objects present in the Page.- Returns:
List of Word objects, each representing a word within the Page.
- Return type:
PageLayout
- class textractor.entities.page_layout.PageLayout(titles: EntityList[Layout] = [], headers: EntityList[Layout] = [], footers: EntityList[Layout] = [], section_headers: EntityList[Layout] = [], page_numbers: EntityList[Layout] = [], lists: EntityList[Layout] = [], figures: EntityList[Layout] = [], tables: EntityList[Layout] = [], key_values: EntityList[Layout] = [])
Bases:
object
Object representation of the layout components detected in the table.
- property figures: EntityList[Layout]
Figures detected in the Page
- Returns:
EntityList of figures detected in the page
- Return type:
Footers detected in the Page
- Returns:
EntityList of footers detected in the page
- Return type:
- property headers: EntityList[Layout]
Headers detected in the Page
- Returns:
EntityList of headers detected in the page
- Return type:
- property key_values: EntityList[Layout]
KeyValues detected in the Page
- Returns:
EntityList of keyvalues detected in the page
- Return type:
- property lists: EntityList[Layout]
Lists detected in the Page
- Returns:
EntityList of lists detected in the page
- Return type:
- property page_numbers: EntityList[Layout]
Page numbers detected in the Page
- Returns:
EntityList of page numbers detected in the page
- Return type:
- property section_headers: EntityList[Layout]
Section headers detected in the Page
- Returns:
EntityList of section headers detected in the page
- Return type:
- property tables: EntityList[Layout]
Tables detected in the Page. This includes Tables detected by the AnalyzeDocument Tables API if used.
- Returns:
EntityList of tables detected in the page
- Return type:
- property titles: EntityList[Layout]
Titles detected in the Page
- Returns:
EntityList of titles detected in the page
- Return type:
Layout
Represents a single Layout
Entity within the Document
.
The Textract API response returns groups of layout as LAYOUT_* BlockTypes.
- class textractor.entities.layout.Layout(entity_id: str, bbox: BoundingBox, reading_order: int, label: str, confidence: float = 0)
Bases:
DocumentEntity
To create a new
Layout
object we need the following:- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List[Word]]
Returns the layout object text and words sorted in reading order
- Parameters:
config (TextLinearizationConfig, optional) – Text linearization configuration object, defaults to TextLinearizationConfig()
- Returns:
Tuple of page text and words
- Return type:
Tuple[str, List[Word]]
- property page
- Returns:
Returns the page number of the page the
Layout
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property text
Maps to .get_text()
- Returns:
Returns the linearized text of the entity
- Return type:
str
- property words
Table
Represents a Table
entity within the document.
Tables are hierarchical objects composed of TableCell
objects, which implicitly form columns and rows.
Table
object contains associated metadata within it. They include TableCell
information, headers, page number and
page ID of the page within which it exists in the document.
- class textractor.entities.table.Table(entity_id, bbox: BoundingBox)
Bases:
DocumentEntity
To create a new
Table
object we need the following:- Parameters:
entity_id – Unique identifier of the table.
bbox – Bounding box of the table.
- add_cells(cells: List[TableCell])
Add
TableCell
objects to theTable
. This function does not check the integrity of the table after the cells are added.- Parameters:
cells (list) – List of TableCell objects, each representing a single cell within the table. No specific ordering is assumed since it is implicitly ordered by row and column index.
- property checkboxes: List[SelectionElement]
- property column_count
- property column_headers: Dict[str, List[TableCell]]
- Returns:
Returns the column headers of the Table entity.
- Return type:
Dict[str, List[TableCell]]
- Returns:
Returns the table footers.
- Return type:
List[TableFooter]
- get_cells_by_type(cell_type: CellTypes = CellTypes.COLUMN_HEADER)
Returns a dictionary of column_header (str) : List[TableCell] (in order).
- get_columns_by_name(column_names, similarity_metric=SimilarityMetric.COSINE, similarity_threshold=0.6)
Returns a dictionary of format {column_name : List[TableCell]} for the column names listed in param column_names.
- Parameters:
column_names (list) – List of column names of columns to be extracted from table.
similarity_metric (str) – ‘cosine’, ‘euclidean’ or ‘levenshtein’. ‘cosine’ is chosen as default.
similarity_threshold (float) – Measure of how similar document key is to queried key. default=0.6
- Returns:
Returns a new Table consisting of columns passed in column_names.
- Return type:
- get_table_range()
- Returns:
Returns the number of rows and columns in the table.
- Return type:
Tuple(int)
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- get_words_by_type(text_type=TextTypes.PRINTED)
Returns list of
Word
entities that match the input text type.- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns list of Word entities that match the input text type.
- Return type:
- property page
- Returns:
Returns the page number of the page the Table entity is present in.
- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property row_count
- strip_headers(column_headers: bool = True, in_table_title: bool = False, section_titles=False)
Returns a new
Table
object after removing all cells that are marked as column headers in the table from the API response.- Parameters:
column_headers (bool) – Remove the column headers
in_table_title (bool) – Remove the in-table titles
section_titles (bool) – Remove the in-table section titles
- Returns:
Table object after removing the headers.
- Return type:
- property table_type
- Returns:
Returns the table type.
- Return type:
- property title
- Returns:
Returns the table title.
- Return type:
- to_csv(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str
Returns the table in the Comma-Separated-Value (CSV) format
- Parameters:
use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.
config – Text linearization configuration object for the table content
- Returns:
Table as a CSV string.
- Return type:
str
- to_excel(filepath=None, workbook=None, save_workbook=True)
Export the Table Entity as an excel document. Advantage of excel over csv is that it can accommodate merged cells that we see so often with Textract documents.
- Parameters:
filepath (str) – Path to store the exported Excel file
workbook (xlsxwriter.Workbook) – if xlsxwriter workbook is passed to the function, the table is appended to the last sheet of that workbook.
save_workbook (bool) – Flag to save_notebook. If False, it is returned by the function.
- Returns:
Returns a workbook if save_workbook is False. Else, saves the .xlsx file in the filepath if was initialized with.
- Return type:
xlsxwriter.Workbook
- to_html() str
Returns the table in the HTML format
- Returns:
Table as an HTML string.
- Return type:
str
- to_pandas(use_columns=False, config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Converts the table to a pandas DataFrame
- Parameters:
use_columns – If the first row of the table is made of column headers, use them for the pandas dataframe. Only supports single row header.
config – Text linearization configuration object for the table content
- Returns:
- to_txt()
TableCell
Represents a single TableCell:class:
object. The TableCell
objects contains information such as:
The position info of the cell within the encompassing Table
Properties such as merged-cells span
A hierarchy of words contained within the TableCell (optional)
Page information
Confidence of entity detection.
- class textractor.entities.table_cell.TableCell(entity_id: str, bbox: BoundingBox, row_index: int, col_index: int, row_span: int, col_span: int, confidence: float = 0, is_column_header: bool = False, is_title: bool = False, is_footer: bool = False, is_summary: bool = False, is_section_title: bool = False)
Bases:
DocumentEntity
To create a new TableCell object we need the following:
- Parameters:
entity_id – Unique id of the TableCell object
bbox – Bounding box of the entity
row_index – Row index of position of cell within the table
col_index – Column index of position of cell within the table
row_span – How many merged cells does the cell spans horizontally (1 means no merged cells)
col_span – How many merged cells does the cell spand vertically (1 means no merged cells)
confidence – Confidence out of 100 with which the Cell was detected.
is_column_header – Indicates if the cell is a column header
is_title – Indicates if the cell is a table title
is_footer – Indicates if the cell is a table footer
is_summary – Indicates if the cell is a summary cell
is_section_title – Indicates if the cell is a section title
- property checkboxes
- property col_index
- Returns:
Returns the column index of the cell in the Table.
- Return type:
int
- property col_span
- Returns:
Returns the column span of the cell in the
Table
.- Return type:
int
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Returns the text in the cell as one space-separated string
- Returns:
Text in the cell
- Return type:
Tuple[str, List]
- get_words_by_type(text_type: TextTypes = TextTypes.PRINTED) List[Word]
Returns list of
Word
entities that match the input text type.- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns list of Word entities that match the input text type.
- Return type:
- property is_column_header
- property is_section_title
- property is_summary
- property is_title
- merge_direction()
- Returns:
Determines if the merged cell is a row or column merge. Returns 0 if row merge, 1 if column merge and 2 if both and None if there is no merge.
- Return type:
int, str
- property page
- Returns:
Returns the page number of the page the
TableCell
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property row_index
- Returns:
Returns the row index of the cell in the
Table
.- Return type:
int
- property row_span
- Returns:
Returns the row span of the cell in the
Table
.- Return type:
int
- property table_id
- Returns:
Returns the ID of the
Table
the TableCell belongs to.- Return type:
str
- property text: str
Returns the text in the cell as one space-separated string
- Returns:
Text in the cell
- Return type:
str
TableTitle
Represents a single TableTitle:class:
object. The TableCell:class: object contains information such as:
The position of the title within the Document
The words that it contains
Confidence of entity detection
- class textractor.entities.table_title.TableTitle(entity_id: str, bbox: BoundingBox)
Bases:
DocumentEntity
Represents a title that is either in-table or floating
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property is_floating: bool
- Returns:
Returns whether the TableTitle entity is floating or not.
- Return type:
bool
- property page
- Returns:
Returns the page number of the page the TableTitle entity is present in.
- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property text: str
Returns the text in the title as one space-separated string
- Returns:
Text in the title
- Return type:
str
- property words
Returns all the Word objects present in the
TableTitle
.- Return words:
List of Word objects, each representing a word within the TableTitle.
- Return type:
list
KeyValue
The KeyValue
entity is a document entity representing the Forms output. The key in KeyValue
are typically words
and the Value
could be Word
elements or SelectionElement
in case of checkboxes.
This class contains the associated metadata with the KeyValue
entity including the entity ID,
bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.
- class textractor.entities.key_value.KeyValue(entity_id: str, bbox: BoundingBox, contains_checkbox: bool = False, value: Optional[Value] = None, confidence: float = 0)
Bases:
DocumentEntity
To create a new
KeyValue
object we require the following:- Parameters:
entity_id (str) – Unique identifier of the KeyValue entity.
bbox (BoundingBox) – Bounding box of the KeyValue entity.
contains_checkbox (bool) – True/False to indicate if the value is a checkbox.
value (Value) – Value object that maps to the KeyValue entity.
confidence (float) – confidence with which the entity was detected.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- get_words_by_type(text_type: str = TextTypes.PRINTED) List[Word]
Returns list of
Word
entities that match the input text type.- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns list of Word entities that match the input text type.
- Return type:
- is_selected() bool
For KeyValues containing a selection item, returns its is_selected status
- Returns:
Selection status of a selection item key value pair
- Return type:
bool
- property key
- Returns:
Returns
EntityList[Word]
object (a list of words) associated with the key.- Return type:
- property ocr_confidence
Return the average OCR confidence :return:
- property page: int
- Returns:
Returns the page number of the page the
Table
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property value: Value
- Returns:
Returns the
Value
mapped to the key if it has been assigned.- Return type:
Value
Represents a single Value
Entity within the Document
.
The Textract API response returns groups of words as KEY_VALUE_SET BlockTypes. These may be of KEY
or VALUE type which is indicated by the EntityType attribute in the JSON response.
This class contains the associated metadata with the Value
entity including the entity ID,
bounding box information, child words, associated key ID, page number, Page ID, confidence of detection
and if it’s a checkbox.
- class textractor.entities.value.Value(entity_id: str, bbox: BoundingBox, confidence: float = 0)
Bases:
DocumentEntity
To create a new
Value
object we need the following:- Parameters:
entity_id (str) – Unique identifier of the Word entity.
bbox (BoundingBox) – Bounding box of the Word entity.
confidence (float) – value storing the confidence of detection out of 100.
- property contains_checkbox: bool
Returns True if the value associated is a
SelectionElement
.- Returns:
Returns True if the value associated is a checkbox/SelectionElement.
- Return type:
bool
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- get_words_by_type(text_type: str = TextTypes.PRINTED) List[Word]
Returns list of
Word
entities that match the input text type.- Parameters:
text_type (TextTypes) – TextTypes.PRINTED or TextTypes.HANDWRITING
- Returns:
Returns list of Word entities that match the input text type.
- Return type:
- property key_id: str
Returns the associated Key ID for the
Value
entity.- Returns:
Returns the associated KeyValue object ID.
- Return type:
str
- property page
- Returns:
Returns the page number of the page the Value entity is present in.
- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
SelectionElement
Represents a single SelectionElement
/Checkbox/Clickable Entity within the Document
.
This class contains the associated metadata with the SelectionElement
entity including the entity ID,
bounding box information, selection status, page number, Page ID and confidence of detection.
- class textractor.entities.selection_element.SelectionElement(entity_id: str, bbox: BoundingBox, status: SelectionStatus, confidence: float = 0)
Bases:
DocumentEntity
To create a new
SelectionElement
object we need the following:- Parameters:
entity_id (str) – Unique identifier of the SelectionElement entity.
bbox (BoundingBox) – Bounding box of the SelectionElement
status (SelectionStatus) – SelectionStatus.SELECTED / SelectionStatus.NOT_SELECTED
confidence (float) – Confidence with which this entity is detected.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- is_selected() bool
- Returns:
Returns True / False depending on selection status of the SelectionElement.
- Return type:
bool
- property page
- Returns:
Returns the page number of the page the SelectionElement entity is present in.
- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
Query
The KeyValue
entity is a document entity representing the Forms output. The key in KeyValue
are typically words
and the Value
could be Word
elements or SelectionElement
in case of checkboxes.
This class contains the associated metadata with the KeyValue
entity including the entity ID,
bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.
- class textractor.entities.query.Query(entity_id: str, query: str, alias: str, query_result: Optional[QueryResult], result_bbox: Optional[BoundingBox])
Bases:
DocumentEntity
The Query object merges QUERY and QUERY_RESULT blocks. To create a new
Query
object we require the following:- Parameters:
entity_id (str) – Unique identifier of the Query entity.
bbox (BoundingBox) – Bounding box of the KeyValue entity.
contains_checkbox (bool) – True/False to indicate if the value is a checkbox.
value (Value) – Value object that maps to the KeyValue entity.
confidence (float) – confidence with which the entity was detected.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the Query and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property has_result: bool
- Returns:
Returns whether there was a result associated with the query
- Return type:
bool
- property page: int
- Returns:
Returns the page number of the page the
Table
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
QueryResult
The KeyValue
entity is a document entity representing the Forms output. The key in KeyValue
are typically words
and the Value
could be Word
elements or SelectionElement
in case of checkboxes.
This class contains the associated metadata with the KeyValue
entity including the entity ID,
bounding box information, value, existence of checkbox, page number, Page ID and confidence of detection.
- class textractor.entities.query_result.QueryResult(entity_id: str, confidence: float, result_bbox: BoundingBox, answer: str)
Bases:
DocumentEntity
The QueryResult object represents QUERY_RESULT blocks. To create a new
QueryResult
object we require the following:- Parameters:
entity_id (str) – Unique identifier of the Query entity.
bbox (BoundingBox) – Bounding box of the QueryResult entity.
contains_checkbox (bool) – True/False to indicate if the value is a checkbox.
value (Value) – Value object that maps to the QueryResult entity.
confidence (float) – confidence with which the entity was detected.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the QueryResult and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property page: int
- Returns:
Returns the page number of the page the
Table
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
Signature
Represents a single Signature
Entity within the Document
.
The Textract API response returns signatures as SIGNATURE BlockTypes.
This class contains the associated metadata with the Signature
entity including the entity ID,
bounding box information, page number, Page ID and confidence of detection.
- class textractor.entities.signature.Signature(entity_id: str, bbox: BoundingBox, confidence: float = 0)
Bases:
DocumentEntity
To create a new
Signature
object we need the following:- Parameters:
entity_id (str) – Unique identifier of the signature entity.
bbox (BoundingBox) – Bounding box of the signature entity.
words (list, optional) – List of the Word entities present in the signature
confidence (float, optional) – confidence with which the entity was detected.
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property page
- Returns:
Returns the page number of the page the
Signature
entity is present in.- Return type:
int
- property page_id: str
- Returns:
Returns the Page ID attribute of the page which the entity belongs to.
- Return type:
str
- property words
- Returns:
Returns an empty list
- Return type:
list
ExpenseDocument
The ExpenseDocument class is the object representation of an AnalyzeID response. It is similar to a dictionary. Despite its name it does not inherit from Document as the AnalyzeID response does not contains position information.
- class textractor.entities.expense_document.ExpenseDocument(summary_fields: List[ExpenseField], line_items_groups: List[LineItemGroup], bounding_box: BoundingBox, page: int)
Bases:
DocumentEntity
Represents the description of a single expense document.
- property bbox
- Returns:
Returns entire bounding box of entity
- Return type:
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True))
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property line_items_groups: List[LineItemGroup]
- property page
- property summary_fields_list
- class textractor.entities.expense_document.Fields
Bases:
dict
Dictionary to hold Summary Fields Dynamically added properties to enable ease of discovery
- class textractor.entities.expense_document.FieldsGroups
Bases:
dict
Summary Fields Group dictionary {GROUP_KEY_NAME: {GROUP_ID_1: [SUMMARY_FIELD1, SUMMARY_FIELD2]}}
- get_group_bboxes(key: str)
Return the enclosing bboxes for each group for a given group key :param key: Group key e.g VENDOR :return:
Expense
- class textractor.entities.expense_field.Expense(bbox: BoundingBox, text: str, confidence: float, page: int)
Bases:
DocumentEntity
Holds the Key or the Value of an Expense
- property geometry
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the Expense and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property page
- property text
Maps to .get_text()
- Returns:
Returns the linearized text of the entity
- Return type:
str
- class textractor.entities.expense_field.ExpenseField(type: ExpenseType, value: Expense, group_properties: List[ExpenseGroupProperty], page: int, label: Optional[Expense] = None, currency=None)
Bases:
DocumentEntity
The ExpenseField holds the information a given summary field, key, value and type. The bounding box of that ExpenseField is the enclosing one of all its components
- property bbox: BoundingBox
- Returns:
Returns entire bounding box of entity
- Return type:
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the ExpenseField and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property group_properties: List[ExpenseGroupProperty]
- property page: int
- property type: ExpenseType
- class textractor.entities.expense_field.ExpenseGroupProperty(id: str, types: List[str])
Bases:
object
Associated with a given ExpenseField, which group it is associated with and the related type of the group
- id: str
- types: List[str]
- class textractor.entities.expense_field.ExpenseType(text: str, confidence: float, raw_object: object)
Bases:
object
Type of an ExpenseField, e.g TOTAL or SUBTOTAL
- confidence: float
- raw_object: object
- text: str
- class textractor.entities.expense_field.LineItemGroup(index, line_item_rows: List[LineItemRow], page: int)
Bases:
DocumentEntity
A LineItemGroup contains several LineItemRow. It is often similar to a table in invoices but in receipts, the table structure can be more loose and less aligned.
- property bbox
- Returns:
Returns entire bounding box of entity
- Return type:
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the LineItemGroup and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property index
- property page
- property rows
- to_csv()
- to_json()
- to_pandas(include_EXPENSE_ROW=False)
- class textractor.entities.expense_field.LineItemRow(index, line_item_expense_fields: List[ExpenseField], page: int)
Bases:
DocumentEntity
A LineItemRow contains several ExpenseField that are all inside the row. They don’t always align in a structured column structure as tables do.
- property bbox
- Returns:
Returns entire bounding box of entity
- Return type:
- property expenses
- get(index)
- get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the LineItemRow and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property page
IdentityDocument
The IdentityDocument class is the object representation of an AnalyzeID response. It is similar to a dictionary. Despite its name it does not inherit from Document as the AnalyzeID response does not contains position information.
- class textractor.entities.identity_document.IdentityDocument(fields=None)
Bases:
SpatialObject
Represents the description of a single ID document.
- property fields: Dict[str, IdentityField]
- get(key: Union[str, AnalyzeIDFields]) Optional[str]
- keys() List[str]
- values() List[str]
IdentityField
Linearizable
Linearizable
is a class that defines how a component can be linearized (converted to text)
- class textractor.entities.linearizable.Linearizable
Bases:
ABC
- get_text(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str
Returns the linearized text of the entity
- Parameters:
config – Text linearization confi
- Returns:
Linearized text of the entity
- Return type:
str
- abstract get_text_and_words(config: TextLinearizationConfig = TextLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='plaintext', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) Tuple[str, List]
Used for linearization, returns the linearized text of the entity and the matching words
- Returns:
Tuple of text and word list
- Return type:
Tuple[str, List[Word]]
- property text: str
Maps to .get_text()
- Returns:
Returns the linearized text of the entity
- Return type:
str
- to_html(config: HTMLLinearizationConfig = HTMLLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='<div>', page_num_suffix='</div>', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='<div>', list_layout_suffix='</div>', list_element_prefix='', list_element_suffix='', title_prefix='<h1>', title_suffix='</h1>', table_layout_prefix='<div>', table_layout_suffix='</div>', table_remove_column_headers=False, table_column_header_threshold=0.9, table_linearization_format='html', table_add_title_as_caption=True, table_add_footer_as_paragraph=True, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='', table_flatten_semi_structured_as_plaintext=False, table_prefix='<table>', table_suffix='</table>', table_row_separator='\n', table_row_prefix='<tr>', table_row_suffix='</tr>', table_cell_prefix='<td>', table_cell_suffix='</td>', table_cell_header_prefix='<th>', table_cell_header_suffix='</th>', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='<h1>', header_suffix='</h1>', section_header_prefix='<h2>', section_header_suffix='</h2>', text_prefix='<p>', text_suffix='</p>', key_value_layout_prefix='<div>', key_value_layout_suffix='</div>', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='<p>', entity_layout_suffix='</p>', figure_layout_prefix='<div>', figure_layout_suffix='</div>', footer_layout_prefix='<div>', footer_layout_suffix='</div>', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True, add_ids_to_html_tags=False, add_short_ids_to_html_tags=False)) str
Returns the HTML representation of the entity
- Returns:
HTML text of the entity
- Return type:
str
- to_markdown(config: MarkdownLinearizationConfig = MarkdownLinearizationConfig(remove_new_lines_in_leaf_elements=True, max_number_of_consecutive_new_lines=2, max_number_of_consecutive_spaces=None, hide_header_layout=False, hide_footer_layout=False, hide_figure_layout=False, hide_table_layout=False, hide_key_value_layout=False, hide_page_num_layout=False, page_num_prefix='', page_num_suffix='', same_paragraph_separator=' ', same_layout_element_separator='\n', layout_element_separator='\n\n', list_element_separator='\n', list_layout_prefix='', list_layout_suffix='', list_element_prefix='', list_element_suffix='', title_prefix='# ', title_suffix='', table_layout_prefix='\n\n', table_layout_suffix='\n', table_remove_column_headers=True, table_column_header_threshold=0.9, table_linearization_format='markdown', table_add_title_as_caption=False, table_add_footer_as_paragraph=False, table_tabulate_format='github', table_tabulate_remove_extra_hyphens=False, table_duplicate_text_in_merged_cells=False, table_flatten_headers=False, table_min_table_words=0, table_column_separator='\t', table_flatten_semi_structured_as_plaintext=False, table_prefix='', table_suffix='', table_row_separator='\n', table_row_prefix='', table_row_suffix='', table_cell_prefix='', table_cell_suffix='', table_cell_header_prefix='', table_cell_header_suffix='', table_cell_empty_cell_placeholder='', table_cell_merge_cell_placeholder='', table_cell_left_merge_cell_placeholder='', table_cell_top_merge_cell_placeholder='', table_cell_cross_merge_cell_placeholder='', table_title_prefix='', table_title_suffix='', table_footers_prefix='', table_footers_suffix='', header_prefix='', header_suffix='', section_header_prefix='## ', section_header_suffix='', text_prefix='', text_suffix='', key_value_layout_prefix='', key_value_layout_suffix='', key_value_prefix='', key_value_suffix='', key_prefix='', key_suffix=' ', value_prefix='', value_suffix='', entity_layout_prefix='', entity_layout_suffix='', figure_layout_prefix='', figure_layout_suffix='', footer_layout_prefix='', footer_layout_suffix='', selection_element_selected='[X]', selection_element_not_selected='[ ]', heuristic_h_tolerance=0.3, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, signature_token='[SIGNATURE]', add_prefixes_and_suffixes_as_words=False, add_prefixes_and_suffixes_in_text=True)) str
Returns the markdown representation of the entity
- Returns:
Markdown text of the entity
- Return type:
str