Tabular data linearization

This example shows how to table-aware linearization in Textractor can preserve cell text integrity and improve retrieval in a question answering task.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:
import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures
from textractor.data.text_linearization_config import TextLinearizationConfig
[2]:
image = Image.open("../../../tests/fixtures/multiline_cells.jpeg")
image
[2]:
../_images/notebooks_tabular_data_linearization_2_0.png
[3]:
extractor = Textractor(region_name="us-west-2")

document = extractor.detect_document_text(
    file_source=image,
    save_image=True
)
[4]:
raw_text = document.get_text(TextLinearizationConfig(max_number_of_consecutive_new_lines=1))
print(raw_text.replace("\n", " "))
Guidance and Manufacturer's Declaration - Electromagnetic Immunity   The MiniMed 640G insulin pump is intended for use in the electromagnetic   environment specified below. The customer or the user of the MiniMed insulin   pump should assure that it is used in such an environment.   Immunity Test   IEC 60601 Test   Compliance   Electromagnetic   Level   Level   Environment -   Guidance   Electrostatic discharge   +8 kV contact   +30 kV air   For use in a typical   (ESD)   15 kV air   (<5% relative   domestic, commercial,   IEC 61000-4-2   humidity)   or hospital   environment.   Electrical fast transient/   +2 kV for power   Not applicable   Requirement does not   burst   supply lines   apply to this battery   IEC 61000-4-4   +1 kV for input/   powered device.   output lines   Surge   +1 kV line(s) to   Not applicable   Requirement does not   IEC 61000-4-5   line(s)   apply to this battery   2 kV line(s) to   powered device.   earth   Voltage dips, short   <5% UT   Not applicable   Requirement does not   interruptions and   (>95% dip in UT   apply to this battery   voltage variations on   ) for 0.5 cycle   powered device.   power supply lines   IEC 61000-4-11   Power frequency   400 A/m   400 A/m   Power frequency   (50/60 Hz) magnetic   (continuous   magnetic fields should   field   field at 60   be at levels   IEC 61000-4-8   seconds)   characteristic of a   typical location in a   4000 A/m (short   4000 A/m   typical commercial or   duration at 3   hospital environment.   seconds)   Note: UT is the a.c. mains voltage prior to application of the test level.   Product specifications and safety information   263
[5]:
extractor = Textractor(region_name="us-west-2")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)
[6]:
document.tables[0].visualize()
[6]:
../_images/notebooks_tabular_data_linearization_6_0.png
[7]:
table_aware_text = document.tables[0].get_text(TextLinearizationConfig(table_linearization_format='markdown'))
print(table_aware_text)
| 0                                                                                                                                                                                                                | 1                                                                               | 2                                  | 3                                                                                                                                         |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| Guidance and Manufacturer's Declaration - Electromagnetic Immunity                                                                                                                                               |                                                                                 |                                    |                                                                                                                                           |
| The MiniMed 640G insulin pump is intended for use in the electromagnetic environment specified below. The customer or the user of the MiniMed insulin pump should assure that it is used in such an environment. |                                                                                 |                                    |                                                                                                                                           |
| Immunity Test                                                                                                                                                                                                    | IEC 60601 Test Level                                                            | Compliance Level                   | Electromagnetic Environment - Guidance                                                                                                    |
| Electrostatic discharge (ESD) IEC 61000-4-2                                                                                                                                                                      | +8 kV contact 15 kV air                                                         | +30 kV air (<5% relative humidity) | For use in a typical domestic, commercial, or hospital environment.                                                                       |
| Electrical fast transient/ burst IEC 61000-4-4                                                                                                                                                                   | +2 kV for power supply lines +1 kV for input/ output lines                      | Not applicable                     | Requirement does not apply to this battery powered device.                                                                                |
| Surge IEC 61000-4-5                                                                                                                                                                                              | +1 kV line(s) to line(s) 2 kV line(s) to earth                                  | Not applicable                     | Requirement does not apply to this battery powered device.                                                                                |
| Voltage dips, short interruptions and voltage variations on power supply lines IEC 61000-4-11                                                                                                                    | <5% UT (>95% dip in UT ) for 0.5 cycle                                          | Not applicable                     | Requirement does not apply to this battery powered device.                                                                                |
| Power frequency (50/60 Hz) magnetic field IEC 61000-4-8                                                                                                                                                          | 400 A/m (continuous field at 60 seconds) 4000 A/m (short duration at 3 seconds) | 400 A/m 4000 A/m                   | Power frequency magnetic fields should be at levels characteristic of a typical location in a typical commercial or hospital environment. |

The above is a table in the markdown format, we can paste in in a new cell to see it rendered properly. Note that we set table_column_header_threshold=0.5 otherwise the top row would not be identified as a header due to the first cell not being a header itself (see in blue above).

0

1

2

3

Guidance and Manufacturer’s Declaration - Electromagnetic Immunity

The MiniMed 640G insulin pump is intended for use in the electromagnetic environment specified below. The customer or the user of the MiniMed insulin pump should assure that it is used in such an environment.

Immunity Test

IEC 60601 Test Level

Compliance Level

Electromagnetic Environment - Guidance

Electrostatic discharge (ESD) IEC 61000-4-2

+8 kV contact 15 kV air

+30 kV air (<5% relative humidity)

For use in a typical domestic, commercial, or hospital environment.

Electrical fast transient/ burst IEC 61000-4-4

+2 kV for power supply lines +1 kV for input/ output lines

Not applicable

Requirement does not apply to this battery powered device.

Surge IEC 61000-4-5

+1 kV line(s) to line(s) 2 kV line(s) to earth

Not applicable

Requirement does not apply to this battery powered device.

Voltage dips, short interruptions and voltage variations on power supply lines IEC 61000-4-11

<5% UT (>95% dip in UT ) for 0.5 cycle

Not applicable

Requirement does not apply to this battery powered device.

Power frequency (50/60 Hz) magnetic field IEC 61000-4-8

400 A/m (continuous field at 60 seconds) 4000 A/m (short duration at 3 seconds)

400 A/m 4000 A/m

Power frequency magnetic fields should be at levels characteristic of a typical location in a typical commercial or hospital environment.

With the two versions of this tables, we then ask Claude questions pertaining to the document

[8]:
import json
import boto3

def get_response_from_claude(context, prompt_data):
    body = json.dumps({
        "prompt": f"""Human: Given the following document:
        {context}
        Answer the following:\n {prompt_data}
        Assistant:""",
        "max_tokens_to_sample": 2000,
        "top_k": 1,
    })
    modelId = f'anthropic.claude-instant-v1' # change this to use a different version from the model provider
    accept = '*/*'
    contentType = 'application/json'

    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    answer = response_body.get('completion')

    return answer

os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com"

bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2',endpoint_url='https://bedrock-runtime.us-west-2.amazonaws.com')
[9]:
question = "What is the compliance level for IEC 61000-4-2?"

IEC 61000-4-2 is the electrostatic discharge test, the correct answer is +30 kV air (<5% relative humidity).

[10]:
document.pages[0].image.crop((200, 190, 1000, 430))
[10]:
../_images/notebooks_tabular_data_linearization_14_0.png
[11]:
print(get_response_from_claude(raw_text, question))
 Based on the document, the compliance level for IEC 61000-4-2 (Electrostatic discharge) is:

+8 kV contact
+30 kV air
15 kV air (<5% relative humidity)

We see that Claude using the raw text is unable to extract the correct answer as the IEC 60601 Test Level and Compliance Level are interlaced, with the resulting text being:

Electrostatic discharge   +8 kV contact   +30 kV air   For use in a typical   (ESD)   15 kV air   (<5% relative   domestic, commercial,   IEC 61000-4-2   humidity)   or hospital   environment.

If instead we use the table-aware linearization, the answer is available and properly aligned in the text, leading to a proper extraction:

[12]:
print(get_response_from_claude(table_aware_text, question))
 Based on the document, the compliance level for IEC 61000-4-2 (Electrostatic discharge (ESD)) is:
+30 kV air (<5% relative humidity)

Conclusion

By leveraging Textract Tables, we can build a better text representation of the tabular data, leading to a better performance in question answering tasks.