Textractor for Large Language Models (LLM)
This example explores how using the various Textract APIs with Textractor to enrich the text given to a large language model, allowing us to process documents where some of data is not in text.
Installation
To begin, install the amazon-textract-textractor
package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]
. You can read more on extra dependencies in the documentation
Calling Textract
[4]:
import os
import boto3
import json
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures
def get_response_from_claude(context, prompt_data):
body = json.dumps({
"prompt": f"""Human: Given the following document:
{context}
Answer the following:\n {prompt_data}
Assistant:""",
"max_tokens_to_sample": 2000,
"top_k": 1,
})
modelId = f'anthropic.claude-instant-v1' # change this to use a different version from the model provider
accept = '*/*'
contentType = 'application/json'
response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())
answer = response_body.get('completion')
return answer
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com"
bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2',endpoint_url='https://bedrock-runtime.us-west-2.amazonaws.com')
[5]:
image = Image.open("../../../tests/fixtures/form_1005.png").convert("RGB")
image
[5]:
Our example is a verification of employment form for a mortgage. This is a complex forms with over 30 fields, selection elements (checkboxes) and signatures that we want to process using Amazon Bedrock and Claude. However, LLMs only take text as input therefore we have to first convert the visual clues into textual clues. This can be done with Textractor.
[14]:
from textractor import Textractor
from textractor.data.text_linearization_config import TextLinearizationConfig
extractor = Textractor(region_name="us-west-2")
document = extractor.analyze_document(
file_source=image,
features=[TextractFeatures.LAYOUT],
save_image=True
)
print(document.get_text())
Request for Verification of Employment
Privacy Act Notice: This information is to be used by the agency collecting it or its assignees in determining whether you qualify as a prospective mortgagor under its program. It will not be disclosed outside the agency except as required and permitted by law. You do not have to provide this information, but if you do not your application for approval as a prospec- tive mortgagor or borrower may be delayed or rejected. The information requested in this form is authorized by Title 38, USC, Chapter 37 (if VA); by 12 USC, Section 1701 et. seq. (if HUD/FHA); by 42 USC, Section 1452b (if HUD/CPD); and Title 42 USC, 1471 et. seq., or 7 USC, 1921 et. seq. (if USDA/FmHA).
Instructions: Lender - Complete items 1 through 7. Have applicant complete item 8. Forward directly to employer named in item 1. Employer - Please complete either Part II or Part III as applicable. Complete Part IV and return directly to lender named in item 2. The form is to be transmitted directly to the lender and is not to be transmitted through the applicant or any other party.
Part I - Request 1. To (Name and address of employer)
2. From (Name and address of lender)
Alejandro Rosalez
Carlos Salazar 123 Any Street, Any Town, USA
100 Main Street, Anytown, USA
I
certify that this verification has been sent directly to the employer and has not passed through the hands of the applicant or any other interested party.
3. Signature of Lender
4. Title
5. Date
6. Lender's Number Carlos Salazar
Project Manager
12/12/2006
(Optional)
5555-5555-5555 I have applied for a mortgage loan and stated that I am now or was formerly employed by you. My signature below authorizes verification of this information. 7. Name and Address of Applicant (include employee or badge number)
8. Signature of Applicant Paulo Santos
Paulo Santos
123 Any Street, Any Town, USA Part II - Verification of Present Employment 9. Applicant's Date of Employment
10. Present Position
11. Probability of Continued Employment
06/06/2006
General Manager
3 years
12A. Current Gross Base Pay (Enter Amount and Check Period)
13. For Military Personnel Only
14. If Overtime or Bonus is Applicable, Annual
Hourly
Pay Grade
10
Is Its Continuance Likely?
Monthly
Other (Specify)
Type
Monthly Amount
Overtime
Yes
No
$ 5600
Weekly
Bonus
Yes
No
12B. Gross Earnings
Base Pay
$
520
15. If paid hourly - average hours per
week
Type
Year To Date
Past Year
Past Year
Rations
$
162
40 hours
Thru
2006
Flight or
16. Date of applicant's next pay increase Base Pay
$
15.00
$ 20.00
$
30.00
Hazard
$
756
08/08/2007
Clothing
$
452
Overtime
$
15.00
$ 20.00
$
30.00
17. Projected amount of next pay increase
Quarters
$
986
$ 5600
Commissions
$
20.00
$ 20.00
$
15.00
Pro Pay
$
123
18. Date of applicant's last pay increase
09/08/2006
Overseas or
Bonus
$
20.00
$
20.00
$
15.00
Combat
$
645
19. Amount of last pay increase
Total
$ 70.00
$ 80.00
$ 90.00
Variable Housing Allowance
$
587
$ 4800
20. Remarks (If employee was off work for any length of time, please indicate time period and reason)
Not Applicable
Part III - Verification of Previous Employment
21. Date Hired
04/04/2004
23. Salary/Wage at Termination Per (Year) (Month) (Week)
22. Date Terminated
01/03/2005
Base
$ 9500
Overtime
1250
Commissions
4500
Bonus
4000
24. Reason for Leaving
25. Position Held Medical Issue
Device Operator
Part IV - Authorized Signature - Federal statutes provide severe penalties for any fraud, intentional misrepresentation, or criminal connivance or
conspiracy purposed to influence the issuance of any guaranty or insurance by the VA Secretary, the U.S.D.A., FmHA/FHA Commissioner, or the HUD/CPD Assistant Secretary.
26. Signature of Employer
27. Title (Please print or type)
28. Date
Richard Roe
VA Secretary
01/05/2007
29. Print or type name signed in Item 26
30. Phone No.
Richard Roe
555-0100
Form 1005
July 96
As you may notice, the layout API is insufficient here. Claude agrees:
[16]:
print(get_response_from_claude(
document.get_text(),
"""
- Did the applicant sign the document?
"""
))
Based on the information provided in the document:
- No, the applicant (Paulo Santos) did not sign the document. Item 8 states "Signature of Applicant" but there is no signature filled in. The document is a "Request for Verification of Employment" form that is filled out and signed by the employer, not the applicant.
Let’s instead introduce signatures as [SIGNATURE]
token inside the resulting text.
[22]:
document = extractor.analyze_document(
file_source=image,
features=[TextractFeatures.LAYOUT, TextractFeatures.SIGNATURES],
save_image=True
)
print(document.get_text())
Request for Verification of Employment
Privacy Act Notice: This information is to be used by the agency collecting it or its assignees in determining whether you qualify as a prospective mortgagor under its program. It will not be disclosed outside the agency except as required and permitted by law. You do not have to provide this information, but if you do not your application for approval as a prospec- tive mortgagor or borrower may be delayed or rejected. The information requested in this form is authorized by Title 38, USC, Chapter 37 (if VA); by 12 USC, Section 1701 et. seq. (if HUD/FHA); by 42 USC, Section 1452b (if HUD/CPD); and Title 42 USC, 1471 et. seq., or 7 USC, 1921 et. seq. (if USDA/FmHA).
Instructions: Lender - Complete items 1 through 7. Have applicant complete item 8. Forward directly to employer named in item 1. Employer - Please complete either Part II or Part III as applicable. Complete Part IV and return directly to lender named in item 2. The form is to be transmitted directly to the lender and is not to be transmitted through the applicant or any other party.
Part I - Request 1. To (Name and address of employer)
2. From (Name and address of lender)
Alejandro Rosalez
Carlos Salazar 123 Any Street, Any Town, USA
100 Main Street, Anytown, USA
I
certify that this verification has been sent directly to the employer and has not passed through the hands of the applicant or any other interested party.
3. Signature of Lender
4. Title
5. Date
6. Lender's Number
[SIGNATURE]
Carlos Salazar
Project Manager
12/12/2006
(Optional)
5555-5555-5555 I have applied for a mortgage loan and stated that I am now or was formerly employed by you. My signature below authorizes verification of this information. 7. Name and Address of Applicant (include employee or badge number)
8. Signature of Applicant Paulo Santos 123 Any Street, Any Town, USA
[SIGNATURE]
Paulo Santos Part II - Verification of Present Employment 9. Applicant's Date of Employment
10. Present Position
11. Probability of Continued Employment
06/06/2006
General Manager
3 years
12A. Current Gross Base Pay (Enter Amount and Check Period)
13. For Military Personnel Only
14. If Overtime or Bonus is Applicable, Annual
Hourly
Pay Grade
10
Is Its Continuance Likely?
Monthly
Other (Specify)
Type
Monthly Amount
Overtime
Yes
No
$ 5600
Weekly
Bonus
Yes
No
12B. Gross Earnings
Base Pay
$
520
15. If paid hourly - average hours per
week
Type
Year To Date
Past Year
Past Year
Rations
$
162
40 hours
Thru
2006
Flight or
16. Date of applicant's next pay increase Base Pay
$
15.00
$ 20.00
$
30.00
Hazard
$
756
08/08/2007
Clothing
$
452
Overtime
$
15.00
$ 20.00
$
30.00
17. Projected amount of next pay increase
Quarters
$
986
$ 5600
Commissions
$
20.00
$ 20.00
$
15.00
Pro Pay
$
123
18. Date of applicant's last pay increase
09/08/2006
Overseas or
Bonus
$
20.00
$
20.00
$
15.00
Combat
$
645
19. Amount of last pay increase
Total
$ 70.00
$ 80.00
$ 90.00
Variable Housing Allowance
$
587
$ 4800
20. Remarks (If employee was off work for any length of time, please indicate time period and reason)
Not Applicable
Part III - Verification of Previous Employment
21. Date Hired
04/04/2004
23. Salary/Wage at Termination Per (Year) (Month) (Week)
22. Date Terminated
01/03/2005
Base
$ 9500
Overtime
1250
Commissions
4500
Bonus
4000
24. Reason for Leaving
25. Position Held Medical Issue
Device Operator
Part IV - Authorized Signature - Federal statutes provide severe penalties for any fraud, intentional misrepresentation, or criminal connivance or
conspiracy purposed to influence the issuance of any guaranty or insurance by the VA Secretary, the U.S.D.A., FmHA/FHA Commissioner, or the HUD/CPD Assistant Secretary.
26. Signature of Employer
27. Title (Please print or type)
28. Date
Richard Roe
[SIGNATURE]
VA Secretary
01/05/2007
29. Print or type name signed in Item 26
30. Phone No.
Richard Roe
555-0100
Form 1005
July 96
[23]:
print(get_response_from_claude(
document.get_text(),
"""
- Did the applicant sign the document?
"""
))
Based on the information provided:
- Yes, the applicant Paulo Santos signed the document. In Part I item 7, it states "Name and Address of Applicant (include employee or badge number) Paulo Santos 123 Any Street, Any Town, USA" and there is a signature in item 8 for Paulo Santos.
Another piece of information that does not exist as text are selection items or checkboxes.
[24]:
print(get_response_from_claude(
document.get_text(),
"""
- Is the reported salary annual or hourly?
"""
))
Based on the information provided in the document:
- The reported salary in item 12A is monthly. Item 12A specifies the pay period as "Monthly".
- Item 15 also indicates the applicant is paid hourly, reporting an average of 40 hours per week.
So in summary, the reported salary is monthly, but the applicant is paid on an hourly basis at an average of 40 hours per week.
All the above is wrong, the applicant is paid annually as checked in 12A but the model does not get that information. We can enrich the above text with selection item placeholders, namely [X]
and []
. Note that those can be configured in the TextLinearizationConfig
object.
[29]:
# This is the default configuration
config = TextLinearizationConfig(
selection_element_selected="[X]",
selection_element_not_selected="[]",
signature_token="[SIGNATURE]",
)
document = extractor.analyze_document(
file_source=image,
features=[TextractFeatures.LAYOUT, TextractFeatures.SIGNATURES, TextractFeatures.FORMS],
save_image=True
)
print(document.get_text(config=config))
Request for Verification of Employment
Privacy Act Notice: This information is to be used by the agency collecting it or its assignees in determining whether you qualify as a prospective mortgagor under its program. It will not be disclosed outside the agency except as required and permitted by law. You do not have to provide this information, but if you do not your application for approval as a prospec- tive mortgagor or borrower may be delayed or rejected. The information requested in this form is authorized by Title 38, USC, Chapter 37 (if VA); by 12 USC, Section 1701 et. seq. (if HUD/FHA); by 42 USC, Section 1452b (if HUD/CPD); and Title 42 USC, 1471 et. seq., or 7 USC, 1921 et. seq. (if USDA/FmHA).
Instructions: Lender - Complete items 1 through 7. Have applicant complete item 8. Forward directly to employer named in item 1. Employer - Please complete either Part II or Part III as applicable. Complete Part IV and return directly to lender named in item 2. The form is to be transmitted directly to the lender and is not to be transmitted through the applicant or any other party.
Part I - Request
2. From (Name and address of lender) Carlos Salazar 100 Main Street, Anytown, USA
1. To (Name and address of employer) Alejandro Rosalez 123 Any Street, Any Town, USA
I
certify that this verification has been sent directly to the employer and has not passed through the hands of the applicant or any other interested party.
5. Date 12/12/2006
6. Lender's Number (Optional) 5555-5555-5555
4. Title Project Manager
3. Signature of Lender Carlos Salazar
[SIGNATURE]
I have applied for a mortgage loan and stated that I am now or was formerly employed by you. My signature below authorizes verification of this information.
8. Signature of Applicant Paulo Santos
7. Name and Address of Applicant (include employee or badge number) Paulo Santos 123 Any Street, Any Town, USA
[SIGNATURE]
Part II - Verification of Present Employment
11. Probability of Continued Employment 3 years
10. Present Position General Manager
9. Applicant's Date of Employment 06/06/2006
13. For Military Personnel Only
12A. Current Gross Base Pay (Enter Amount and Check Period) $ 5600
14. If Overtime or Bonus is Applicable,
Pay Grade
10
Annual [X]
Hourly [ ]
Is Its Continuance Likely?
Monthly Amount
Other (Specify) [ ]
Type
Monthly [ ]
Overtime
Yes [X]
No [ ]
Yes [ ]
No [X]
Bonus
Weekly [ ]
Base Pay $ 520
15. If paid hourly - average hours per week 40 hours
12B. Gross Earnings
Year To Date
Past Year
Rations $ 162
Type
Past Year
16. Date of applicant's next pay increase 08/08/2007
Thru 2006
Flight or Hazard 756 $
Base Pay
$
15.00
$ 20.00
$
30.00
Clothing $ 452
15.00
$ 20.00
30.00
17. Projected amount of next pay increase $ 5600
Overtime
$
$
Quarters $ 986
20.00
$ 20.00
15.00
18. Date of applicant's last pay increase 09/08/2006
$
Pro Pay $ 123
Commissions
$
Overseas or Combat $ 645
20.00
20.00
15.00
19. Amount of last pay increase $ 4800
Bonus
$
$
$
Variable Housing Allowance $ 587
Total
$ 70.00
$ 80.00
$ 90.00
20. Remarks (If employee was off work for any length of time, please indicate time period and reason) Not Applicable
Not Applicable
Part III - Verification of Previous Employment
23. Salary/Wage at Termination Per (Year) (Month) (Week)
21. Date Hired 04/04/2004
Bonus 4000
Overtime 1250
Commissions 4500
Base $ 9500
22. Date Terminated 01/03/2005
25. Position Held Device Operator
24. Reason for Leaving Medical Issue
Part IV - Authorized Signature - Federal statutes provide severe penalties for any fraud, intentional misrepresentation, or criminal connivance or
conspiracy purposed to influence the issuance of any guaranty or insurance by the VA Secretary, the U.S.D.A., FmHA/FHA Commissioner, or the HUD/CPD Assistant Secretary.
27. Title (Please print or type) VA Secretary
28. Date 01/05/2007
26. Signature of Employer Richard Roe
[SIGNATURE]
29. Print or type name signed in Item 26 Richard Roe
30. Phone No. 555-0100
Form 1005 July 96
July 96
[30]:
print(get_response_from_claude(
document.get_text(),
"""
- Is the reported salary annual or hourly?
"""
))
Based on the information provided in the document:
- The reported salary in Part II, item 12A is annual. It specifically states the salary of $5600 is for the "Annual" period.
Conclusion
Large-language models are only as good as their input information. By leveraging Textract APIs to enrich the text representation that is provided as input you can unblock intelligent document processing workflows without implementing complex heuristics.