Installation
Official package
Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of textractor. The following extras can be used to add features:
pdfium(pip install amazon-textract-textractor[pdfium]) includespypdfium2and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.pdf(pip install amazon-textract-textractor[pdf]) includespdf2imageand is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.torch(pip install amazon-textract-textractor[torch]) includessentence_transformersfor better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.dev(pip install amazon-textract-textractor[dev]) includes all the dependencies above and everything else needed to test the code.
You can pick several extras by separating the labels with commas like this pip install amazon-textract-textractor[pdf,torch].
From Source
To install the package, clone the repository with the following command -
git clone git@github.com:aws-samples/amazon-textract-textractor.git
Navigate into the amazon-textract-textractor directory on the terminal and run these commands.
To install requirements pip install -r requirements.txt
Then install the package with pip install -e .
Try it out
The Demo.ipynb can be used as a reference to understand some functionalities hosted by the package.
Additionally, docs/tests/notebooks/ have some tutorials you can try out.