Using AnalyzeExpense
Textract AnalyzeExpense is an API dedicated to processing Invoice and Receipts documents. It is available as a synchronous or asynchronous API.
Installation
To begin, install the amazon-textract-textractor package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if you workflow uses PDFs with pip install amazon-textract-textractor[pdfium]
. You can read more on extra dependencies in the
[2]:
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_expense(
file_source="../../../tests/fixtures/invoice.png",
save_image=True,
)
[4]:
document.visualize(with_words=False)
[4]:
[10]:
document
[10]:
This document holds the following data:
Pages - 1
Words - 98
Lines - 48
Key-values - 0
Checkboxes - 0
Tables - 0
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents - 1
Parsing the output
The AnalyzeExpense API output is captured in the expense_documents
entity list on the main document. There are two main components that make up an expense document: - The first one is the summary fields. These are Key Value pairs which are normalized to a list of specific fields. They are different than traditional key values because the key is optional. They also have a normalized type. The full list of which can be found in the API or in the data.constants
file
[9]:
from textractor.data.constants import AnalyzeExpenseFields, AnalyzeExpenseFieldsGroup, AnalyzeExpenseLineItemFields
[12]:
expense_doc = document.expense_documents[0]
expense_doc
[12]:
Summary fields: 20
Line Item Groups: index 1: 3 rows
[13]:
expense_doc.summary_fields
[13]:
ADDRESS:
ADDRESS (BILL TO): John Smith\n2 Court Square\nNew York, NY 12210
ADDRESS (SHIP TO): John Smith\n3787 Pineview Drive\nCambridge, MA 12210
ADDRESS (FROM): East Repair Inc.\n1912 Harvest Lane\nNew York, NY 12210
STREET:
STREET: 2 Court Square
STREET: 3787 Pineview Drive
STREET: 1912 Harvest Lane
CITY:
CITY: New York,
CITY: Cambridge,
CITY: New York,
STATE:
STATE: NY
STATE: MA
STATE: NY
ZIP_CODE:
ZIP_CODE: 12210
ZIP_CODE: 12210
ZIP_CODE: 12210
NAME:
NAME: John Smith
NAME: John Smith
NAME: East Repair Inc.
NAME (Please make checks payable to:): East Repair Inc.
NAME: LOGO
ADDRESS_BLOCK:
ADDRESS_BLOCK: 2 Court Square\nNew York, NY 12210
ADDRESS_BLOCK: 3787 Pineview Drive\nCambridge, MA 12210
ADDRESS_BLOCK: 1912 Harvest Lane\nNew York, NY 12210
DUE_DATE:
DUE_DATE (DUE DATE): 26/02/2019
INVOICE_RECEIPT_DATE:
INVOICE_RECEIPT_DATE (INVOICE DATE): 11/02/2019
INVOICE_RECEIPT_ID:
INVOICE_RECEIPT_ID (INVOICE #): US-001
PO_NUMBER:
PO_NUMBER (P.O. #): 2312/2019
PAYMENT_TERMS:
PAYMENT_TERMS (TERMS & CONDITIONS): Payment is due within 15 days
RECEIVER_ADDRESS:
RECEIVER_ADDRESS (BILL TO): John Smith\n2 Court Square\nNew York, NY 12210
RECEIVER_ADDRESS (SHIP TO): John Smith\n3787 Pineview Drive\nCambridge, MA 12210
RECEIVER_NAME:
RECEIVER_NAME: John Smith
RECEIVER_NAME: John Smith
SUBTOTAL:
SUBTOTAL (Subtotal): 145.00 [USD]
TAX:
TAX (Sales Tax 6.25%): 9.06 [USD]
TOTAL:
TOTAL (TOTAL): $154.06 [USD]
VENDOR_ADDRESS:
VENDOR_ADDRESS (FROM): East Repair Inc.\n1912 Harvest Lane\nNew York, NY 12210
VENDOR_NAME:
VENDOR_NAME (Please make checks payable to:): East Repair Inc.
VENDOR_NAME: East Repair Inc.
VENDOR_NAME: LOGO
OTHER:
OTHER (Sales Tax): 6.25%
The summary fields are also further grouped in semantic groups. For example, there can be several RECEIVER_ADDRESS, one for shipping and one for billing. They are accessed in the following property:
[14]:
expense_doc.summary_groups
[14]:
RECEIVER_BILL_TO:
ADDRESS (BILL TO): John Smith\n2 Court Square\nNew York, NY 12210
STREET: 2 Court Square
CITY: New York,
STATE: NY
ZIP_CODE: 12210
NAME: John Smith
ADDRESS_BLOCK: 2 Court Square\nNew York, NY 12210
RECEIVER_SHIP_TO:
ADDRESS (SHIP TO): John Smith\n3787 Pineview Drive\nCambridge, MA 12210
STREET: 3787 Pineview Drive
CITY: Cambridge,
STATE: MA
ZIP_CODE: 12210
NAME: John Smith
ADDRESS_BLOCK: 3787 Pineview Drive\nCambridge, MA 12210
VENDOR:
ADDRESS (FROM): East Repair Inc.\n1912 Harvest Lane\nNew York, NY 12210
STREET: 1912 Harvest Lane
CITY: New York,
STATE: NY
ZIP_CODE: 12210
NAME: East Repair Inc.
ADDRESS_BLOCK: 1912 Harvest Lane\nNew York, NY 12210
NAME (Please make checks payable to:): East Repair Inc.
NAME: LOGO
The second main component of the analyze expense output are the line item groups
[16]:
expense_doc.line_items_groups
[16]:
[|QUANTITY: 1 | ITEM: Front and rear brake cables | UNIT_PRICE: 100.00 | PRICE: 100.00 | EXPENSE_ROW: 1 Front and rear brake cables 100.00 100.00 |
|QUANTITY: 2 | ITEM: New set of pedal arms | UNIT_PRICE: 15.00 | PRICE: 30.00 | EXPENSE_ROW: 2 New set of pedal arms 15.00 30.00 |
|QUANTITY: 3 | ITEM: Labor 3hrs | UNIT_PRICE: 5.00 | PRICE: 15.00 | EXPENSE_ROW: 3 Labor 3hrs 5.00 15.00 | ]
[18]:
expense_doc.line_items_groups[0].to_pandas()
[18]:
ITEM | PRICE | PRODUCT_CODE | QUANTITY | UNIT_PRICE | |
---|---|---|---|---|---|
0 | Front and rear brake cables | 100.00 | 1 | 100.00 | |
1 | New set of pedal arms | 30.00 | 2 | 15.00 | |
2 | Labor 3hrs | 15.00 | 3 | 5.00 |
There are also summary fields that are normalized across these 5 fields.
[ ]: