{ "cells": [ { "cell_type": "markdown", "id": "7c4c0bb1", "metadata": {}, "source": [ "# Parsing an existing response\n", "\n", "Since Amazon Textract is a paid service, it is likely that you will want to reduce your costs by developing and debugging with existing JSON responses. We offer a simple interface to do so.\n", "\n", "## Installation\n", "\n", "To begin, install the `amazon-textract-textractor` package using pip.\n", "\n", "`pip install amazon-textract-textractor`\n", "\n", "There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n", "\n", "## Not calling Textract\n", "\n", "There are two ways to parse an existing JSON. The simplest one, reminiscent of `PIL.Image.open()` is `Document.open()` which takes either a path or file-like object and parses it automatically. The path **can** be an S3 path." ] }, { "cell_type": "code", "execution_count": 1, "id": "73a1f8f1", "metadata": {}, "outputs": [], "source": [ "from textractor.entities.document import Document\n", "\n", "document = Document.open(\"../../../tests/fixtures/saved_api_responses/test_table.json\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "8b09b0d1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "This document holds the following data:\n", "Pages - 1\n", "Words - 51\n", "Lines - 24\n", "Key-values - 0\n", "Checkboxes - 0\n", "Tables - 1\n", "Identity Documents - 0\n", "Expense Documents - 0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "document" ] }, { "cell_type": "code", "execution_count": 3, "id": "547dfd45", "metadata": {}, "outputs": [], "source": [ "with open(\"../../../tests/fixtures/saved_api_responses/test_table.json\", \"r\") as f:\n", " document = Document.open(f)" ] }, { "cell_type": "code", "execution_count": 4, "id": "cd4ff320", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "This document holds the following data:\n", "Pages - 1\n", "Words - 51\n", "Lines - 24\n", "Key-values - 0\n", "Checkboxes - 0\n", "Tables - 1\n", "Identity Documents - 0\n", "Expense Documents - 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "document" ] }, { "cell_type": "markdown", "id": "e7c833a3", "metadata": {}, "source": [ "## Instantiating from a dictionary\n", "\n", "Another possible solution is to use the `ResponseParser` directly with a `dict` object." ] }, { "cell_type": "code", "execution_count": 8, "id": "ab16a290", "metadata": {}, "outputs": [], "source": [ "import json\n", "from textractor.parsers import response_parser" ] }, { "cell_type": "code", "execution_count": 9, "id": "2140deeb", "metadata": {}, "outputs": [], "source": [ "with open(\"../../../tests/fixtures/saved_api_responses/test_table.json\", \"r\") as f:\n", " document = response_parser.parse(json.load(f))" ] }, { "cell_type": "code", "execution_count": 10, "id": "08c369e4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "This document holds the following data:\n", "Pages - 1\n", "Words - 51\n", "Lines - 24\n", "Key-values - 0\n", "Checkboxes - 0\n", "Tables - 1\n", "Identity Documents - 0\n", "Expense Documents - 0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "document" ] }, { "cell_type": "markdown", "id": "93b94dbd", "metadata": {}, "source": [ "## Conclusion\n", "\n", "You can save and accelerate your development process by reusing responses. This approach is used in the Textractor unit tests." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }