Textractor for Large Language Models (LLM)

This example explores how using the various Textract APIs with Textractor to enrich the text given to a large language model, allowing us to process documents where some of data is not in text.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdf]. You can read more on extra dependencies in the documentation

Calling Textract

[4]:
import os
import boto3
import json

from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures

def get_response_from_claude(context, prompt_data):
    body = json.dumps({
        "prompt": f"""Human: Given the following document:
        {context}
        Answer the following:\n {prompt_data}
        Assistant:""",
        "max_tokens_to_sample": 2000,
        "top_k": 1,
    })
    modelId = f'anthropic.claude-instant-v1' # change this to use a different version from the model provider
    accept = '*/*'
    contentType = 'application/json'

    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    answer = response_body.get('completion')

    return answer

os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com"

bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2',endpoint_url='https://bedrock-runtime.us-west-2.amazonaws.com')
[5]:
image = Image.open("../../../tests/fixtures/form_1005.png").convert("RGB")
image
[5]:
../_images/notebooks_textractor_for_large_language_models_2_0.png

Our example is a verification of employment form for a mortgage. This is a complex forms with over 30 fields, selection elements (checkboxes) and signatures that we want to process using Amazon Bedrock and Claude. However, LLMs only take text as input therefore we have to first convert the visual clues into textual clues. This can be done with Textractor.

[14]:
from textractor import Textractor
from textractor.data.text_linearization_config import TextLinearizationConfig

extractor = Textractor(region_name="us-west-2")
document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.LAYOUT],
    save_image=True
)
print(document.get_text())
Request for Verification of Employment
Privacy Act Notice: This information is to be used by the agency collecting it or its assignees in determining whether you qualify as a prospective mortgagor under its program. It will not be disclosed outside the agency except as required and permitted by law. You do not have to provide this information, but if you do not your application for approval as a prospec- tive mortgagor or borrower may be delayed or rejected. The information requested in this form is authorized by Title 38, USC, Chapter 37 (if VA); by 12 USC, Section 1701 et. seq. (if HUD/FHA); by 42 USC, Section 1452b (if HUD/CPD); and Title 42 USC, 1471 et. seq., or 7 USC, 1921 et. seq. (if USDA/FmHA).
Instructions: Lender - Complete items 1 through 7. Have applicant complete item 8. Forward directly to employer named in item 1. Employer - Please complete either Part II or Part III as applicable. Complete Part IV and return directly to lender named in item 2. The form is to be transmitted directly to the lender and is not to be transmitted through the applicant or any other party.
Part I - Request 1. To (Name and address of employer)

2. From (Name and address of lender)

Alejandro Rosalez

Carlos Salazar 123 Any Street, Any Town, USA

100 Main Street, Anytown, USA

I

certify that this verification has been sent directly to the employer and has not passed through the hands of the applicant or any other interested party.

3. Signature of Lender

4. Title

5. Date

6. Lender's Number Carlos Salazar

Project Manager

12/12/2006

(Optional)

5555-5555-5555 I have applied for a mortgage loan and stated that I am now or was formerly employed by you. My signature below authorizes verification of this information. 7. Name and Address of Applicant (include employee or badge number)

8. Signature of Applicant Paulo Santos

Paulo Santos

123 Any Street, Any Town, USA Part II - Verification of Present Employment 9. Applicant's Date of Employment

10. Present Position

11. Probability of Continued Employment

06/06/2006

General Manager

3 years

12A. Current Gross Base Pay (Enter Amount and Check Period)

13. For Military Personnel Only

14. If Overtime or Bonus is Applicable, Annual

Hourly

Pay Grade

10

Is Its Continuance Likely?

Monthly

Other (Specify)

Type

Monthly Amount

Overtime

Yes

No

$ 5600

Weekly

Bonus

Yes

No

12B. Gross Earnings

Base Pay

$

520

15. If paid hourly - average hours per

week

Type

Year To Date

Past Year

Past Year

Rations

$

162

40 hours

Thru

2006

Flight or

16. Date of applicant's next pay increase Base Pay

$

15.00

$ 20.00

$

30.00

Hazard

$

756

08/08/2007

Clothing

$

452

Overtime

$

15.00

$ 20.00

$

30.00

17. Projected amount of next pay increase

Quarters

$

986

$ 5600

Commissions

$

20.00

$ 20.00

$

15.00

Pro Pay

$

123

18. Date of applicant's last pay increase

09/08/2006

Overseas or

Bonus

$

20.00

$

20.00

$

15.00

Combat

$

645

19. Amount of last pay increase

Total

$ 70.00

$ 80.00

$ 90.00

Variable Housing Allowance

$

587

$ 4800
20. Remarks (If employee was off work for any length of time, please indicate time period and reason)
Not Applicable
Part III - Verification of Previous Employment
21. Date Hired

04/04/2004

23. Salary/Wage at Termination Per (Year) (Month) (Week)

22. Date Terminated

01/03/2005

Base

$ 9500

Overtime

1250

Commissions

4500

Bonus

4000

24. Reason for Leaving

25. Position Held Medical Issue

Device Operator

Part IV - Authorized Signature - Federal statutes provide severe penalties for any fraud, intentional misrepresentation, or criminal connivance or

conspiracy purposed to influence the issuance of any guaranty or insurance by the VA Secretary, the U.S.D.A., FmHA/FHA Commissioner, or the HUD/CPD Assistant Secretary.

26. Signature of Employer

27. Title (Please print or type)

28. Date

Richard Roe

VA Secretary

01/05/2007

29. Print or type name signed in Item 26

30. Phone No.

Richard Roe

555-0100
Form 1005
July 96

As you may notice, the layout API is insufficient here. Claude agrees:

[16]:
print(get_response_from_claude(
    document.get_text(),
    """
    - Did the applicant sign the document?
    """
))
 Based on the information provided in the document:

- No, the applicant (Paulo Santos) did not sign the document. Item 8 states "Signature of Applicant" but there is no signature filled in. The document is a "Request for Verification of Employment" form that is filled out and signed by the employer, not the applicant.

Let’s instead introduce signatures as [SIGNATURE] token inside the resulting text.

[22]:
document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.LAYOUT, TextractFeatures.SIGNATURES],
    save_image=True
)
print(document.get_text())
Request for Verification of Employment
Privacy Act Notice: This information is to be used by the agency collecting it or its assignees in determining whether you qualify as a prospective mortgagor under its program. It will not be disclosed outside the agency except as required and permitted by law. You do not have to provide this information, but if you do not your application for approval as a prospec- tive mortgagor or borrower may be delayed or rejected. The information requested in this form is authorized by Title 38, USC, Chapter 37 (if VA); by 12 USC, Section 1701 et. seq. (if HUD/FHA); by 42 USC, Section 1452b (if HUD/CPD); and Title 42 USC, 1471 et. seq., or 7 USC, 1921 et. seq. (if USDA/FmHA).
Instructions: Lender - Complete items 1 through 7. Have applicant complete item 8. Forward directly to employer named in item 1. Employer - Please complete either Part II or Part III as applicable. Complete Part IV and return directly to lender named in item 2. The form is to be transmitted directly to the lender and is not to be transmitted through the applicant or any other party.
Part I - Request 1. To (Name and address of employer)

2. From (Name and address of lender)

Alejandro Rosalez

Carlos Salazar 123 Any Street, Any Town, USA

100 Main Street, Anytown, USA

I

certify that this verification has been sent directly to the employer and has not passed through the hands of the applicant or any other interested party.

3. Signature of Lender

4. Title

5. Date

6. Lender's Number

[SIGNATURE]

Carlos Salazar

Project Manager

12/12/2006

(Optional)

5555-5555-5555 I have applied for a mortgage loan and stated that I am now or was formerly employed by you. My signature below authorizes verification of this information. 7. Name and Address of Applicant (include employee or badge number)

8. Signature of Applicant Paulo Santos 123 Any Street, Any Town, USA

[SIGNATURE]

Paulo Santos Part II - Verification of Present Employment 9. Applicant's Date of Employment

10. Present Position

11. Probability of Continued Employment

06/06/2006

General Manager

3 years

12A. Current Gross Base Pay (Enter Amount and Check Period)

13. For Military Personnel Only

14. If Overtime or Bonus is Applicable, Annual

Hourly

Pay Grade

10

Is Its Continuance Likely?

Monthly

Other (Specify)

Type

Monthly Amount

Overtime

Yes

No

$ 5600

Weekly

Bonus

Yes

No

12B. Gross Earnings

Base Pay

$

520

15. If paid hourly - average hours per

week

Type

Year To Date

Past Year

Past Year

Rations

$

162

40 hours

Thru

2006

Flight or

16. Date of applicant's next pay increase Base Pay

$

15.00

$ 20.00

$

30.00

Hazard

$

756

08/08/2007

Clothing

$

452

Overtime

$

15.00

$ 20.00

$

30.00

17. Projected amount of next pay increase

Quarters

$

986

$ 5600

Commissions

$

20.00

$ 20.00

$

15.00

Pro Pay

$

123

18. Date of applicant's last pay increase

09/08/2006

Overseas or

Bonus

$

20.00

$

20.00

$

15.00

Combat

$

645

19. Amount of last pay increase

Total

$ 70.00

$ 80.00

$ 90.00

Variable Housing Allowance

$

587

$ 4800
20. Remarks (If employee was off work for any length of time, please indicate time period and reason)
Not Applicable
Part III - Verification of Previous Employment
21. Date Hired

04/04/2004

23. Salary/Wage at Termination Per (Year) (Month) (Week)

22. Date Terminated

01/03/2005

Base

$ 9500

Overtime

1250

Commissions

4500

Bonus

4000

24. Reason for Leaving

25. Position Held Medical Issue

Device Operator

Part IV - Authorized Signature - Federal statutes provide severe penalties for any fraud, intentional misrepresentation, or criminal connivance or

conspiracy purposed to influence the issuance of any guaranty or insurance by the VA Secretary, the U.S.D.A., FmHA/FHA Commissioner, or the HUD/CPD Assistant Secretary.

26. Signature of Employer

27. Title (Please print or type)

28. Date

Richard Roe

[SIGNATURE]

VA Secretary

01/05/2007

29. Print or type name signed in Item 26

30. Phone No.

Richard Roe

555-0100
Form 1005
July 96
[23]:
print(get_response_from_claude(
    document.get_text(),
    """
    - Did the applicant sign the document?
    """
))
 Based on the information provided:

- Yes, the applicant Paulo Santos signed the document. In Part I item 7, it states "Name and Address of Applicant (include employee or badge number) Paulo Santos 123 Any Street, Any Town, USA" and there is a signature in item 8 for Paulo Santos.

Another piece of information that does not exist as text are selection items or checkboxes.

[24]:
print(get_response_from_claude(
    document.get_text(),
    """
    - Is the reported salary annual or hourly?
    """
))
 Based on the information provided in the document:

- The reported salary in item 12A is monthly. Item 12A specifies the pay period as "Monthly".

- Item 15 also indicates the applicant is paid hourly, reporting an average of 40 hours per week.

So in summary, the reported salary is monthly, but the applicant is paid on an hourly basis at an average of 40 hours per week.

All the above is wrong, the applicant is paid annually as checked in 12A but the model does not get that information. We can enrich the above text with selection item placeholders, namely [X] and []. Note that those can be configured in the TextLinearizationConfig object.

[29]:
# This is the default configuration
config = TextLinearizationConfig(
    selection_element_selected="[X]",
    selection_element_not_selected="[]",
    signature_token="[SIGNATURE]",
)

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.LAYOUT, TextractFeatures.SIGNATURES, TextractFeatures.FORMS],
    save_image=True
)
print(document.get_text(config=config))
Request for Verification of Employment
Privacy Act Notice: This information is to be used by the agency collecting it or its assignees in determining whether you qualify as a prospective mortgagor under its program. It will not be disclosed outside the agency except as required and permitted by law. You do not have to provide this information, but if you do not your application for approval as a prospec- tive mortgagor or borrower may be delayed or rejected. The information requested in this form is authorized by Title 38, USC, Chapter 37 (if VA); by 12 USC, Section 1701 et. seq. (if HUD/FHA); by 42 USC, Section 1452b (if HUD/CPD); and Title 42 USC, 1471 et. seq., or 7 USC, 1921 et. seq. (if USDA/FmHA).
Instructions: Lender - Complete items 1 through 7. Have applicant complete item 8. Forward directly to employer named in item 1. Employer - Please complete either Part II or Part III as applicable. Complete Part IV and return directly to lender named in item 2. The form is to be transmitted directly to the lender and is not to be transmitted through the applicant or any other party.
Part I - Request

2. From (Name and address of lender) Carlos Salazar 100 Main Street, Anytown, USA

1. To (Name and address of employer) Alejandro Rosalez 123 Any Street, Any Town, USA

I

certify that this verification has been sent directly to the employer and has not passed through the hands of the applicant or any other interested party.

5. Date 12/12/2006

6. Lender's Number (Optional) 5555-5555-5555

4. Title Project Manager

3. Signature of Lender Carlos Salazar

[SIGNATURE]

I have applied for a mortgage loan and stated that I am now or was formerly employed by you. My signature below authorizes verification of this information.

8. Signature of Applicant Paulo Santos

7. Name and Address of Applicant (include employee or badge number) Paulo Santos 123 Any Street, Any Town, USA

[SIGNATURE]

Part II - Verification of Present Employment

11. Probability of Continued Employment 3 years

10. Present Position General Manager

9. Applicant's Date of Employment 06/06/2006

13. For Military Personnel Only

12A. Current Gross Base Pay (Enter Amount and Check Period) $ 5600

14. If Overtime or Bonus is Applicable,

Pay Grade

10

Annual [X]

Hourly [ ]

Is Its Continuance Likely?

Monthly Amount

Other (Specify) [ ]

Type

Monthly [ ]

Overtime

Yes [X]

No [ ]

Yes [ ]

No [X]

Bonus

Weekly [ ]

Base Pay $ 520

15. If paid hourly - average hours per week 40 hours

12B. Gross Earnings

Year To Date

Past Year

Rations $ 162

Type

Past Year

16. Date of applicant's next pay increase 08/08/2007

Thru 2006

Flight or Hazard 756 $

Base Pay

$

15.00

$ 20.00

$

30.00

Clothing $ 452

15.00

$ 20.00

30.00

17. Projected amount of next pay increase $ 5600

Overtime

$

$

Quarters $ 986

20.00

$ 20.00

15.00

18. Date of applicant's last pay increase 09/08/2006

$

Pro Pay $ 123

Commissions

$

Overseas or Combat $ 645

20.00

20.00

15.00

19. Amount of last pay increase $ 4800

Bonus

$

$

$

Variable Housing Allowance $ 587

Total

$ 70.00

$ 80.00

$ 90.00
 20. Remarks (If employee was off work for any length of time, please indicate time period and reason) Not Applicable
Not Applicable
Part III - Verification of Previous Employment
23. Salary/Wage at Termination Per (Year) (Month) (Week)

21. Date Hired 04/04/2004

Bonus 4000

Overtime 1250

Commissions 4500

Base $ 9500

22. Date Terminated 01/03/2005

25. Position Held Device Operator

24. Reason for Leaving Medical Issue

Part IV - Authorized Signature - Federal statutes provide severe penalties for any fraud, intentional misrepresentation, or criminal connivance or

conspiracy purposed to influence the issuance of any guaranty or insurance by the VA Secretary, the U.S.D.A., FmHA/FHA Commissioner, or the HUD/CPD Assistant Secretary.

27. Title (Please print or type) VA Secretary

28. Date 01/05/2007

26. Signature of Employer Richard Roe

[SIGNATURE]

29. Print or type name signed in Item 26 Richard Roe

30. Phone No. 555-0100


Form 1005 July 96
July 96
[30]:
print(get_response_from_claude(
    document.get_text(),
    """
    - Is the reported salary annual or hourly?
    """
))
 Based on the information provided in the document:

- The reported salary in Part II, item 12A is annual. It specifically states the salary of $5600 is for the "Annual" period.

Conclusion

Large-language models are only as good as their input information. By leveraging Textract APIs to enrich the text representation that is provided as input you can unblock intelligent document processing workflows without implementing complex heuristics.