Installation

Official package

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of textractor. The following extras can be used to add features:

pdfium (pip install amazon-textract-textractor[pdfium]) includes pypdfium2 and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
pdf (pip install amazon-textract-textractor[pdf]) includes pdf2image and is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
torch (pip install amazon-textract-textractor[torch]) includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
dev (pip install amazon-textract-textractor[dev]) includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install amazon-textract-textractor[pdf,torch].

From Source

To install the package, clone the repository with the following command -

git clone git@github.com:aws-samples/amazon-textract-textractor.git

Navigate into the amazon-textract-textractor directory on the terminal and run these commands.

To install requirements pip install -r requirements.txt

Then install the package with pip install -e .

Try it out

The Demo.ipynb can be used as a reference to understand some functionalities hosted by the package. Additionally, docs/tests/notebooks/ have some tutorials you can try out.