{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "737e6086-5638-4e74-b26f-845adc4dfffc",
   "metadata": {},
   "source": [
    "# Exporting Form Data\n",
    "\n",
    "We now move from Textract OCR to Textract Forms, the API to extract key-value pairs. Here we want to export all key-values extracted from an image as a .csv file.\n",
    "\n",
    "## Installation\n",
    "\n",
    "To begin, install the `amazon-textract-textractor` package using pip.\n",
    "\n",
    "`pip install amazon-textract-textractor`\n",
    "\n",
    "There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fac93401-7b94-4862-b87e-7d8376a29562",
   "metadata": {},
   "source": [
    "## Calling Textract\n",
    "\n",
    "We use the asynchronous API for this example, but as seen in the OCR example the synchronous API exposes the same methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fa86d86f-6272-4021-b4da-aa751ae2e22e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from PIL import Image\n",
    "from textractor import Textractor\n",
    "from textractor.data.constants import TextractFeatures\n",
    "\n",
    "extractor = Textractor(profile_name=\"default\")\n",
    "document = extractor.start_document_analysis(\n",
    "    # Here we pass a Pillow image instead of path. This changes nothing as\n",
    "    # Textractor supports most input types.\n",
    "    file_source=Image.open(\"../../../tests/fixtures/form.png\"),\n",
    "    # We specify the features that we want, here, we only want keys and values\n",
    "    # therefore we use TextractFeatures.FORMS.\n",
    "    features=[TextractFeatures.FORMS],\n",
    "    s3_upload_path=\"s3://textract-ocr/temp/\",\n",
    "    save_image=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b44d9f71-d05e-4fae-bdc9-d8c73267fea5",
   "metadata": {},
   "source": [
    "## Retrieving key-values and exporting as CSV\n",
    "Form data/Key-values are stored at the document and page level as a property and can be accessed as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d65a0edc-58e8-4dbd-a8f6-4883c5859c94",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Date : 04/23/2020,\n",
       " Phone : 615-373-6883,\n",
       " Address : BLVD,\n",
       " Cellular : 683-426-2200,\n",
       " Work : 726-448-6720,\n",
       " Time : P.M.,\n",
       " Phone : 626-200-4890,\n",
       " Cleaning Tech : LEWIS,\n",
       " Customer : CAMPBELL,\n",
       " Day : Wednesday,\n",
       " Name : CAMPBELL,\n",
       " City : YORK,\n",
       " E-Mail\" : vilcomp@gmail.com,\n",
       " Special Instructions or Directions: : ,\n",
       " Sales Tax : 00,\n",
       " Late Fee : 00,\n",
       " TOTAL : 00]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# All key-values present in the document\n",
    "document.key_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "705ffaff-ffe6-4db1-829a-565a61981124",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export the key-values as csv\n",
    "document.export_kv_to_csv(\n",
    "    include_kv=True,\n",
    "    include_checkboxes=False,\n",
    "    filepath=os.path.join(\"kv.csv\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56a77862-1431-4261-b001-13cd70f87070",
   "metadata": {},
   "source": [
    "## View CSV as dataframe\n",
    "To verify the contents of the file stored, we open it as a Pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "72bbc1d6-71ff-42e4-ac7b-8dc8a67faaf7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Key</th>\n",
       "      <th>Value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Date</td>\n",
       "      <td>04/23/2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Phone</td>\n",
       "      <td>615-373-6883</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Address</td>\n",
       "      <td>BLVD</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Cellular</td>\n",
       "      <td>683-426-2200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Work</td>\n",
       "      <td>726-448-6720</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Time</td>\n",
       "      <td>P.M.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Phone</td>\n",
       "      <td>626-200-4890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Cleaning Tech</td>\n",
       "      <td>LEWIS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Customer</td>\n",
       "      <td>CAMPBELL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Day</td>\n",
       "      <td>Wednesday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Name</td>\n",
       "      <td>CAMPBELL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>City</td>\n",
       "      <td>YORK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>E-Mail\"</td>\n",
       "      <td>vilcomp@gmail.com</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Special Instructions or Directions:</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Sales Tax</td>\n",
       "      <td>00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Late Fee</td>\n",
       "      <td>00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>TOTAL</td>\n",
       "      <td>00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    Key              Value\n",
       "0                                  Date         04/23/2020\n",
       "1                                 Phone       615-373-6883\n",
       "2                               Address               BLVD\n",
       "3                              Cellular       683-426-2200\n",
       "4                                  Work       726-448-6720\n",
       "5                                  Time               P.M.\n",
       "6                                 Phone       626-200-4890\n",
       "7                         Cleaning Tech              LEWIS\n",
       "8                              Customer           CAMPBELL\n",
       "9                                   Day          Wednesday\n",
       "10                                 Name           CAMPBELL\n",
       "11                                 City               YORK\n",
       "12                              E-Mail\"  vilcomp@gmail.com\n",
       "13  Special Instructions or Directions:                NaN\n",
       "14                            Sales Tax                 00\n",
       "15                             Late Fee                 00\n",
       "16                                TOTAL                 00"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_key_values = pd.read_csv(os.path.join(os.getcwd(), \"kv.csv\"))\n",
    "df_key_values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be6f8cb3",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "There are many more supported APIs and use cases in Textractor, if this did not address your use case, we encourage you to look at [the other examples](https://aws-samples.github.io/amazon-textract-textractor/examples.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}