Tabular data linearization (Continued)

This example shows how to table-aware linearization in Textractor can apply various transformations on the table to improve QA performance.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:
import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures
from textractor.data.text_linearization_config import TextLinearizationConfig
[34]:
image = Image.open("../../../tests/fixtures/vbat.png")
image
[34]:
../_images/notebooks_tabular_data_linearization_continued_2_0.png
[35]:
extractor = Textractor(region_name="us-west-2")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)
[36]:
document.tables[0].visualize()
[36]:
../_images/notebooks_tabular_data_linearization_continued_4_0.png
[37]:
markdown_table = document.tables[0].get_text(TextLinearizationConfig(table_linearization_format='markdown'))

Let’s visualize the markdown output:

0

1

2

3

4

5

6

7

8

9

10

11

12

Symbol

Parameter

Conditions

Typ

Max(1)

Unit

Backup SRAM

RTC and LSE

1.2 V

2V

3V

3.4 V

3 y

Tj=25 °C

Tj=85 °C

Tj=105 °C

Tj=125 °C

IDD (VBAT)

Supply current in VBAT mode

OFF

OFF

0,02

0,02

0,03

0,05

0,5

4,1

10

24

UA

ON

OFF

1,33

1,45

1,58

1,7

4,4

22

48

87

OFF

ON

0,46

0,57

0,75

0,87

ON

ON

1,77

2

2,3

2,5

Markdown does not support merged cells. This is problematic as we see that the “Oversampling” columns end up mostly empty. Let’s test Claude v2 on that table representation.

[6]:
import json
import boto3

def get_response_from_claude(context, prompt_data):
    body = json.dumps({
        "prompt": f"""Human: Given the following document:
        {context}
        Answer the following:\n {prompt_data}
        Assistant:""",
        "max_tokens_to_sample": 2000,
        "top_k": 1,
    })
    modelId = f'anthropic.claude-instant-v1' # change this to use a different version from the model provider
    accept = '*/*'
    contentType = 'application/json'

    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    answer = response_body.get('completion')

    return answer

os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com"

bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2',endpoint_url='https://bedrock-runtime.us-west-2.amazonaws.com')
[51]:
question = "What is the max supply current at 125°C if both Backup SRAM and RTC and LSE are OFF? Answer in one line"

The correct answer is 24uA

[52]:
print(get_response_from_claude(markdown_table, question))
 0.05 UA

We see that Claude using the text is unable to extract the correct answer. However we can change the TextLinearizationConfig to duplicate text in merged cells:

[53]:
markdown_table_with_duplication = document.tables[0].get_text(TextLinearizationConfig(table_linearization_format='markdown', table_duplicate_text_in_merged_cells=True))

Let’s visualize the markdown output:

0

1

2

3

4

5

6

7

8

9

10

11

12

Symbol

Parameter

Conditions

Conditions

Typ

Typ

Typ

Typ

Max(1)

Max(1)

Max(1)

Max(1)

Unit

Symbol

Parameter

Backup SRAM

RTC and LSE

1.2 V

2V

3V

3.4 V

3 y

3 y

3 y

3 y

Unit

Symbol

Parameter

Backup SRAM

RTC and LSE

1.2 V

2V

3V

3.4 V

Tj=25 °C

Tj=85 °C

Tj=105 °C

Tj=125 °C

Unit

IDD (VBAT)

Supply current in VBAT mode

OFF

OFF

0,02

0,02

0,03

0,05

0,5

4,1

10

24

UA

IDD (VBAT)

Supply current in VBAT mode

ON

OFF

1,33

1,45

1,58

1,7

4,4

22

48

87

UA

IDD (VBAT)

Supply current in VBAT mode

OFF

ON

0,46

0,57

0,75

0,87

UA

IDD (VBAT)

Supply current in VBAT mode

ON

ON

1,77

2

2,3

2,5

UA

[54]:
print(get_response_from_claude(markdown_table_with_duplication, question))
 24 UA

Let’s now try something different

[55]:
image = Image.open("../../../tests/fixtures/vbat2.png")
image
[55]:
../_images/notebooks_tabular_data_linearization_continued_19_0.png
[56]:
document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)
[57]:
document.tables[0].visualize()
[57]:
../_images/notebooks_tabular_data_linearization_continued_21_0.png
[58]:
markdown_table_2 = document.tables[0].get_text(
    TextLinearizationConfig(
        table_linearization_format='markdown'
    )
)

Symbol

Parameter

Conditions

Typ

Max ¹

Unit

TA=25°C

TA=85°C

TA=105 °C

VDD = 1.8 V

VDD 2.4 V

VDD 3.3 V

VDD 3.6 V

’DD_VBAT

Backup domain supply current

Backup SRAM ON, low-speed oscillator and RTC ON

1.29

1.42

1.68

12

19

UA

Backup SRAM OFF, low-speed oscillator and RTC ON

0.62

0.73

0.96

8

10

Backup SRAM ON, RTC OFF

0.79

0.81

0.86

9

16

Backup SRAM OFF, RTC OFF

0.10

0.10

0.10

5

7

[62]:
markdown_table_flattened_headers_2 = document.tables[0].get_text(
    TextLinearizationConfig(
        table_linearization_format='markdown',
        table_flatten_headers=True
    )
)

Symbol

Parameter

Conditions

Typ TA=25°C VDD = 1.8 V

Typ TA=25°C VDD 2.4 V

Typ TA=25°C VDD 3.3 V

Max ¹ TA=85°C VDD 3.6 V

Max ¹ TA=105 °C VDD 3.6 V

Unit

’DD_VBAT

Backup domain supply current

Backup SRAM ON, low-speed oscillator and RTC ON

1.29

1.42

1.68

12

19

UA

Backup SRAM OFF, low-speed oscillator and RTC ON

0.62

0.73

0.96

8

10

Backup SRAM ON, RTC OFF

0.79

0.81

0.86

9

16

Backup SRAM OFF, RTC OFF

0.1

0.1

0.1

5

7

Combining it all together

[91]:
markdown_table_flattened_headers_duplicated_text_2 = document.tables[0].get_text(
    TextLinearizationConfig(
        table_linearization_format='markdown',
        table_flatten_headers=True,
        table_duplicate_text_in_merged_cells=True,
    )
)

Symbol

Parameter

Conditions

Typ TA=25°C VDD = 1.8 V

Typ TA=25°C VDD 2.4 V

Typ TA=25°C VDD 3.3 V

Max ¹ TA=85°C VDD 3.6 V

Max ¹ TA=105 °C VDD 3.6 V

Unit

’DD_VBAT

Backup domain supply current

Backup SRAM ON, low-speed oscillator and RTC ON

1.29

1.42

1.68

12

19

UA

’DD_VBAT

Backup domain supply current

Backup SRAM OFF, low-speed oscillator and RTC ON

0.62

0.73

0.96

8

10

UA

’DD_VBAT

Backup domain supply current

Backup SRAM ON, RTC OFF

0.79

0.81

0.86

9

16

UA

’DD_VBAT

Backup domain supply current

Backup SRAM OFF, RTC OFF

0.1

0.1

0.1

5

7

UA

Conclusion

By leveraging Textract Tables, we can build a better text representation of the tabular data, leading to a better performance in question answering tasks.