Exporting Form Data
We now move from Textract OCR to Textract Forms, the API to extract key-value pairs. Here we want to export all key-values extracted from an image as a .csv file.
Installation
To begin, install the amazon-textract-textractor
package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdf]
. You can read more on extra dependencies in the documentation
Calling Textract
We use the asynchronous API for this example, but as seen in the OCR example the synchronous API exposes the same methods.
[1]:
import os
from PIL import Image
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.start_document_analysis(
# Here we pass a Pillow image instead of path. This changes nothing as
# Textractor supports most input types.
file_source=Image.open("../../../tests/fixtures/form.png"),
# We specify the features that we want, here, we only want keys and values
# therefore we use TextractFeatures.FORMS.
features=[TextractFeatures.FORMS],
s3_upload_path="s3://textract-ocr/temp/",
save_image=True
)
Retrieving key-values and exporting as CSV
Form data/Key-values are stored at the document and page level as a property and can be accessed as shown below
[2]:
# All key-values present in the document
document.key_values
[2]:
[Date : 04/23/2020,
Phone : 615-373-6883,
Address : BLVD,
Cellular : 683-426-2200,
Work : 726-448-6720,
Time : P.M.,
Phone : 626-200-4890,
Cleaning Tech : LEWIS,
Customer : CAMPBELL,
Day : Wednesday,
Name : CAMPBELL,
City : YORK,
E-Mail" : vilcomp@gmail.com,
Special Instructions or Directions: : ,
Sales Tax : 00,
Late Fee : 00,
TOTAL : 00]
[6]:
# Export the key-values as csv
document.export_kv_to_csv(
include_kv=True,
include_checkboxes=False,
filepath=os.path.join("kv.csv")
)
View CSV as dataframe
To verify the contents of the file stored, we open it as a Pandas dataframe.
[7]:
import pandas as pd
df_key_values = pd.read_csv(os.path.join(os.getcwd(), "kv.csv"))
df_key_values
[7]:
Key | Value | |
---|---|---|
0 | Date | 04/23/2020 |
1 | Phone | 615-373-6883 |
2 | Address | BLVD |
3 | Cellular | 683-426-2200 |
4 | Work | 726-448-6720 |
5 | Time | P.M. |
6 | Phone | 626-200-4890 |
7 | Cleaning Tech | LEWIS |
8 | Customer | CAMPBELL |
9 | Day | Wednesday |
10 | Name | CAMPBELL |
11 | City | YORK |
12 | E-Mail" | vilcomp@gmail.com |
13 | Special Instructions or Directions: | NaN |
14 | Sales Tax | 00 |
15 | Late Fee | 00 |
16 | TOTAL | 00 |
Conclusion
There are many more supported APIs and use cases in Textractor, if this did not address your use case, we encourage you to look at the other examples.