{ "cells": [ { "cell_type": "markdown", "id": "737e6086-5638-4e74-b26f-845adc4dfffc", "metadata": {}, "source": [ "# Exporting Form Data\n", "\n", "We now move from Textract OCR to Textract Forms, the API to extract key-value pairs. Here we want to export all key-values extracted from an image as a .csv file.\n", "\n", "## Installation\n", "\n", "To begin, install the `amazon-textract-textractor` package using pip.\n", "\n", "`pip install amazon-textract-textractor`\n", "\n", "There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)" ] }, { "cell_type": "markdown", "id": "fac93401-7b94-4862-b87e-7d8376a29562", "metadata": {}, "source": [ "## Calling Textract\n", "\n", "We use the asynchronous API for this example, but as seen in the OCR example the synchronous API exposes the same methods." ] }, { "cell_type": "code", "execution_count": 1, "id": "fa86d86f-6272-4021-b4da-aa751ae2e22e", "metadata": {}, "outputs": [], "source": [ "import os\n", "from PIL import Image\n", "from textractor import Textractor\n", "from textractor.data.constants import TextractFeatures\n", "\n", "extractor = Textractor(profile_name=\"default\")\n", "document = extractor.start_document_analysis(\n", " # Here we pass a Pillow image instead of path. This changes nothing as\n", " # Textractor supports most input types.\n", " file_source=Image.open(\"../../../tests/fixtures/form.png\"),\n", " # We specify the features that we want, here, we only want keys and values\n", " # therefore we use TextractFeatures.FORMS.\n", " features=[TextractFeatures.FORMS],\n", " s3_upload_path=\"s3://textract-ocr/temp/\",\n", " save_image=True\n", ")" ] }, { "cell_type": "markdown", "id": "b44d9f71-d05e-4fae-bdc9-d8c73267fea5", "metadata": {}, "source": [ "## Retrieving key-values and exporting as CSV\n", "Form data/Key-values are stored at the document and page level as a property and can be accessed as shown below" ] }, { "cell_type": "code", "execution_count": 2, "id": "d65a0edc-58e8-4dbd-a8f6-4883c5859c94", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Date : 04/23/2020,\n", " Phone : 615-373-6883,\n", " Address : BLVD,\n", " Cellular : 683-426-2200,\n", " Work : 726-448-6720,\n", " Time : P.M.,\n", " Phone : 626-200-4890,\n", " Cleaning Tech : LEWIS,\n", " Customer : CAMPBELL,\n", " Day : Wednesday,\n", " Name : CAMPBELL,\n", " City : YORK,\n", " E-Mail\" : vilcomp@gmail.com,\n", " Special Instructions or Directions: : ,\n", " Sales Tax : 00,\n", " Late Fee : 00,\n", " TOTAL : 00]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# All key-values present in the document\n", "document.key_values" ] }, { "cell_type": "code", "execution_count": 6, "id": "705ffaff-ffe6-4db1-829a-565a61981124", "metadata": {}, "outputs": [], "source": [ "# Export the key-values as csv\n", "document.export_kv_to_csv(\n", " include_kv=True,\n", " include_checkboxes=False,\n", " filepath=os.path.join(\"kv.csv\")\n", ")" ] }, { "cell_type": "markdown", "id": "56a77862-1431-4261-b001-13cd70f87070", "metadata": {}, "source": [ "## View CSV as dataframe\n", "To verify the contents of the file stored, we open it as a Pandas dataframe." ] }, { "cell_type": "code", "execution_count": 7, "id": "72bbc1d6-71ff-42e4-ac7b-8dc8a67faaf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
KeyValue
0Date04/23/2020
1Phone615-373-6883
2AddressBLVD
3Cellular683-426-2200
4Work726-448-6720
5TimeP.M.
6Phone626-200-4890
7Cleaning TechLEWIS
8CustomerCAMPBELL
9DayWednesday
10NameCAMPBELL
11CityYORK
12E-Mail\"vilcomp@gmail.com
13Special Instructions or Directions:NaN
14Sales Tax00
15Late Fee00
16TOTAL00
\n", "
" ], "text/plain": [ " Key Value\n", "0 Date 04/23/2020\n", "1 Phone 615-373-6883\n", "2 Address BLVD\n", "3 Cellular 683-426-2200\n", "4 Work 726-448-6720\n", "5 Time P.M.\n", "6 Phone 626-200-4890\n", "7 Cleaning Tech LEWIS\n", "8 Customer CAMPBELL\n", "9 Day Wednesday\n", "10 Name CAMPBELL\n", "11 City YORK\n", "12 E-Mail\" vilcomp@gmail.com\n", "13 Special Instructions or Directions: NaN\n", "14 Sales Tax 00\n", "15 Late Fee 00\n", "16 TOTAL 00" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_key_values = pd.read_csv(os.path.join(os.getcwd(), \"kv.csv\"))\n", "df_key_values" ] }, { "cell_type": "markdown", "id": "be6f8cb3", "metadata": {}, "source": [ "## Conclusion\n", "\n", "There are many more supported APIs and use cases in Textractor, if this did not address your use case, we encourage you to look at [the other examples](https://aws-samples.github.io/amazon-textract-textractor/examples.html)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }