Using Textract OCR
If you only want to use the Amazon Textract OCR engine, you have to choose between the synchronous DetectDocumentText
API and the asynchronous StartDocumentTextDetection
API. The former will block until the OCR inference completes, while the latter will return a job_id
that you can use to get the results later.
Installation
To begin, install the amazon-textract-textractor
package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]
. You can read more on extra dependencies in the documentation
Synchronous example
This example assumes that you have set up your AWS credentials and have a default profile. If that is not the case see this page to get started.
[1]:
from textractor import Textractor
extractor = Textractor(profile_name="default")
# This path assumes that you are running the notebook from docs/source/notebooks
document = extractor.detect_document_text("../../../tests/fixtures/form.png")
[2]:
document
[2]:
This document holds the following data:
Pages - 1
Words - 259
Lines - 74
Key-values - 0
Checkboxes - 0
Tables - 0
Identity Documents - 0
The next step will be dependent on your goal, if it’s simply to get one long string of extracted OCR, you can use document.text
.
[4]:
document.text
[4]:
'"Service You\'ll Be\nBragging\n&\nRoyal\nAbout GUARANTEED!"\nYou have a full week to inspect your carpet.\nIf a spot returns or if there is a concern we\'ll return.\nOnline@royalcarpetlincoln.com\nCarpet & Upholstery Cleaning\n5401 S. 20 St. Circle Lincoln NE 68512\nAlways Free Estimates 402-423-7200\nYour privacy is important to us! Your e-mail address will not be shared or sold.\nName ELIZABETH CAMPBELL\nE-Mail camp@gmail.com\nAddress 90 OLD HICKORY BLVD\nCity NEW YORK\nPhone 615-373-6883 Work 726-448-6720\nCellular\n683-426-2200\nDate 04/23/2020\nDay Wednesday\nTime 12.30 P.M.\nCondition of Carpet or Furniture:\nSpecial Instructions or Directions:\nPet Odors\nAllergy Concerns\nExcessive Wear\nSoiled Furniture\nPermanent Wear\nLoose Seams\nPermanent Shading\n-Laminate Floor Concerns\nCarpet Cleaning\n250\nPet Odors\n50\nFood Stains\n45\nSteam cleaning\n100\nwater Damge Repair\n400\nloose seams repair\n200\nTile cleaning\n200\nDelivery cost\n200\nSales Tax\nDue to Insurance regulations; Items such as: breakables Ns. computers\n1445.00 00\nglassware, grandfather clocks. bookshelves, or pianos can not be moved\nPayment is due upon receipt.\nAccounts over 30 days will\nLate Fee\nRoyal is not responsible for color transfer. change. bleeding, shrinking,\n5.00\n00\nbe assessed a $5.00 late fee\nseems in carpet or furniture Some stains. spots or shading may\n& a 5% finance charge\nor loose permanent all claims.\nbe due to their nature. Royal Cleaning Service reserves the\nTOTAL\n1450\n00\nright to Replace, Repair or Refund the cost of cleaning on\nCustomer ELIZABETH CAMPBELL\nCleaning Tech JOHN LEWIS\nWhen can we contact you about your next service?\n6 Mo\n12 Mo\nOther\nPhone 626-200-4890'
Asynchronous example
If you have a lot of data or multipage PDFs, it quickly becomes unwieldy to use the synchronous API as it is rate-limited. A solution to this problem is to use the asynchronous API, which creates jobs and results that you can fetch later. The input file needs to be inside an S3 bucket. If you only have it locally you can provide an s3_upload_path
and Textractor will take care of upload the file to that directory before calling Textract.
[3]:
from textractor import Textractor
extractor = Textractor(profile_name="default")
# This path assumes that you are running the notebook from docs/source/notebooks
document = extractor.start_document_text_detection(
"../../../tests/fixtures/form.png",
s3_upload_path="s3://textract-ocr/temp/",
)
Instead of a Document object, an asynchronous function returns a LazyDocument object which is functionally identical to Document but will not actually load the Textract response until you use it. This allows you to make as many requests as you want without ever blocking.
[4]:
type(document)
[4]:
textractor.entities.lazy_document.LazyDocument
If you use the document’s property, the object will issue a Textract call to fetch the results.
[6]:
document.text
[6]:
'"Service You\'ll Be\nBragging\n&\nRoyal\nAbout GUARANTEED!"\nYou have a full week to inspect your carpet.\nIf a spot returns or if there is a concern we\'ll return.\nOnline@royalcarpetlincoln.com\nCarpet & Upholstery Cleaning\n5401 S. 20 St. Circle Lincoln NE 68512\nAlways Free Estimates 402-423-7200\nYour privacy is important to us! Your e-mail address will not be shared or sold.\nName ELIZABETH CAMPBELL\nE-Mail camp@gmail.com\nAddress 90 OLD HICKORY BLVD\nCity NEW YORK\nPhone 615-373-6883 Work 726-448-6720\nCellular\n683-426-2200\nDate 04/23/2020\nDay Wednesday\nTime 12.30 P.M.\nCondition of Carpet or Furniture:\nSpecial Instructions or Directions:\nPet Odors\nAllergy Concerns\nExcessive Wear\nSoiled Furniture\nPermanent Wear\nLoose Seams\nPermanent Shading\n-Laminate Floor Concerns\nCarpet Cleaning\n250\nPet Odors\n50\nFood Stains\n45\nSteam cleaning\n100\nwater Damge Repair\n400\nloose seams repair\n200\nTile cleaning\n200\nDelivery cost\n200\nSales Tax\nDue to Insurance regulations; Items such as: breakables Ns. computers\n1445.00 00\nglassware, grandfather clocks. bookshelves, or pianos can not be moved\nPayment is due upon receipt.\nAccounts over 30 days will\nLate Fee\nRoyal is not responsible for color transfer. change. bleeding, shrinking,\n5.00\n00\nbe assessed a $5.00 late fee\nseems in carpet or furniture Some stains. spots or shading may\n& a 5% finance charge\nor loose permanent all claims.\nbe due to their nature. Royal Cleaning Service reserves the\nTOTAL\n1450\n00\nright to Replace, Repair or Refund the cost of cleaning on\nCustomer ELIZABETH CAMPBELL\nCleaning Tech JOHN LEWIS\nWhen can we contact you about your next service?\n6 Mo\n12 Mo\nOther\nPhone 626-200-4890'
Conclusion
There are many more supported APIs and use cases in Textractor, if this did not address your use case, we encourage you to look at the other examples.