Installation
Official package
Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor
. By default this will install the minimal version of textractor. The following extras can be used to add features:
pdfium
(pip install amazon-textract-textractor[pdfium]
) includespypdfium2
and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.pdf
(pip install amazon-textract-textractor[pdf]
) includespdf2image
and is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.torch
(pip install amazon-textract-textractor[torch]
) includessentence_transformers
for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.dev
(pip install amazon-textract-textractor[dev]
) includes all the dependencies above and everything else needed to test the code.
You can pick several extras by separating the labels with commas like this pip install amazon-textract-textractor[pdf,torch]
.
From Source
To install the package, clone the repository with the following command -
git clone git@github.com:aws-samples/amazon-textract-textractor.git
Navigate into the amazon-textract-textractor directory on the terminal and run these commands.
To install requirements pip install -r requirements.txt
Then install the package with pip install -e .
Try it out
The Demo.ipynb
can be used as a reference to understand some functionalities hosted by the package.
Additionally, docs/tests/notebooks/ have some tutorials you can try out.