Document Linearization to Markdown or HTML with Textractor

This example goes deeper on text linearization in Textractor. Text linearization is the conversion for a 2D document with words, lines, layouts and tables into a text string.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdf]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:
import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures
[2]:
image = Image.open("../../../tests/fixtures/paystub.jpg").convert("RGB")
image
[2]:
../_images/notebooks_document_linearization_to_markdown_or_html_2_0.png

We can Textract’s AnalyzeDocument API on this image. For the best possible extraction, we recommend using at least the LAYOUT, TABLES and SIGNATURES features to achieve the best possible reading order. OCR is always included with any AnalyzeDocument call. In this case we will also include FORMS.

[3]:
extractor = Textractor(region_name="us-west-2")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.FORMS, TextractFeatures.SIGNATURES],
    save_image=True
)

Base linearization .get_text() is always available on all components of a document:

[4]:
print(document.tables[0].get_text())
CO.     FILE    DEPT.   CLOCK   NUMBER
ABC     126543  123456  12345   00000000

However you can also use .to_html() and .to_markdown() to obtain the HTML and Markdown output respectively.

[5]:
print(document.tables[0].to_html())
<table><tr><th>CO.</th><th>FILE</th><th>DEPT.</th><th>CLOCK</th><th>NUMBER</th></tr>
<tr><td>ABC</td><td>126543</td><td>123456</td><td>12345</td><td>00000000</td></tr>
</table>
[6]:
print(document.tables[0].to_markdown())
| CO.    |   FILE  |   DEPT.  |   CLOCK  |   NUMBER  |
|--------|---------|----------|----------|-----------|
| ABC    | 126543  |  123456  |   12345  | 00000000  |

Both the HTML and Markdown options will use the table cell types to identify headers automatically.

You can apply your own configuration by building a TextLinearizationConfig object. Both to_html() and .to_markdown() are actually implemented as TextLinearizationConfig objects themselves. Say you wanted not to have header cells <th> in your table output, you could change the HTMLLinearizationConfig object and call .get_text() with it.

[7]:
from textractor.data.html_linearization_config import HTMLLinearizationConfig

config = HTMLLinearizationConfig()
config.table_cell_header_prefix = "<td>"
config.table_cell_header_suffix = "</td>"

print(document.tables[0].get_text(config))
<table><tr><td>CO.</td><td>FILE</td><td>DEPT.</td><td>CLOCK</td><td>NUMBER</td></tr>
<tr><td>ABC</td><td>126543</td><td>123456</td><td>12345</td><td>00000000</td></tr>
</table>

All entities can be linearized

This is not limited to tables, all entities have the .get_text(), .to_html() and .to_markdown() methods.

[8]:
print(document.tables.to_markdown())
| CO.    |   FILE  |   DEPT.  |   CLOCK  |   NUMBER  |
|--------|---------|----------|----------|-----------|
| ABC    | 126543  |  123456  |   12345  | 00000000  |

|                |           |
|----------------|-----------|
| Period ending: | 7/18/2008 |
| Pay date:      | 7/25/2008 |

|          |                       |
|----------|-----------------------|
| Federal: | 3. $25 Additional Tax |
| State:   | 2                     |
| Local:   | 2                     |

| Earnings    | rate      | hours    | this period    | year to date    |
|-------------|-----------|----------|----------------|-----------------|
| Regular     | 10.00     | 32.00    | 320.00         | 16,640.00       |
| Overtime    | 15.00     | 1.00     | 15.00          | 780.00          |
| Holiday     | 10.00     | 8.00     | 80.00          | 4,160.00        |
| Tuition     |           |          | 37.43          | 1,946.80        |
|             | Gross Pay |          | $ 452.43       | 23,526.80       |

| Other Benefits and Information    | this period    | total to date    |
|-----------------------------------|----------------|------------------|
| Group Term Life                   | 0.51           | 27.00            |
| Loan Amt Paid                     |                | 840.00           |
| Vac Hrs                           |                | 40.00            |
| Sick Hrs                          |                | 16.00            |
| Title                             | Operator       |                  |

|            |                              |         |          |
|------------|------------------------------|---------|----------|
| Deductions | Statutory Federal Income Tax | -40.60  | 2,111.20 |
|            | Social Security Tax          | -28.05  | 1,458.60 |
|            | Medicare Tax                 | -6.56   | 341.12   |
|            | NY State Income Tax          | -8.43   | 438.36   |
|            | NYC Income Tax               | -5.94   | 308.88   |
|            | NY SUI/SDI Tax Other         | -0.60   | 31.20    |
|            | Bond                         | -5.00   | 100.00   |
|            | 401(k)                       | -28.85* | 1,500.20 |
|            | Stock Plan                   | -15.00  | 150.00   |
|            | Life Insurance               | -5.00   | 50.00    |
|            | Loan                         | -30.00  | 150.00   |
|            | Adjustment                   |         |          |
|            | Life Insurance               | + 13.50 |          |
|            |                              |         |          |
|            | Net Pay                      | $291.90 |          |

|                       |             |
|-----------------------|-------------|
| Payroll check number: | 0000000000  |
| Pay date:             | 7/25/2008   |
| Social Security No.   | 987-65-4321 |

|                      |                                           |         |
|----------------------|-------------------------------------------|---------|
| Pay to the order of: | JOHN STILES                               |         |
| This amount:         | TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS | $291.90 |
[9]:
print(document.key_values.get_text())
CO. ABC

CLOCK 12345

FILE 126543

DEPT. 123456

NUMBER 00000000

Period ending: 7/18/2008

Pay date: 7/25/2008

Social Security Number: 987-65-4321

Taxable Marital Status: Married

Federal: 3. $25 Additional Tax

JOHN STILES 101 MAIN STREET ANYTOWN, USA 12345

State: 2

Local: 2

total to date 27.00

Loan Amt Paid 840.00

Vac Hrs 40.00

Gross Pay $ 452.43

Sick Hrs 16.00

Title Operator

HOURLY RATE HAS BEEN CHANGED FROM $8.00

PER HOUR. $10.00

Life Insurance + 13.50

Net Pay $291.90

Your federal wages this period are $386.15

Payroll check number: 0000000000

Pay date: 7/25/2008

ANY COMPANY CORP. 475 ANY AVENUE ANYTOWN, USA 10101

Social Security No. 987-65-4321

Pay to the order of: JOHN STILES

This amount: TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS $291.90

VOID AFTER 00 DAYS Authorized AUTHORIZED SIGNATURE Signature

BANK NAME STREET ADDRESS CITY STATE ZIP SAMPLE NON-NEGOTIABLE VOID VOID VOID

What if you are passing this to an LLM and would like to properly split key and values? You can use a custom TextLinearizationConfig to add special tokens that will act as a delimiter.

[11]:
from textractor.data.text_linearization_config import TextLinearizationConfig

config = TextLinearizationConfig(
    key_prefix="<key>",
    key_suffix="</key>",
    value_prefix="<value>",
    value_suffix="</value>",
)
print(document.key_values.get_text(config))
<key>CO.</key><value>ABC </value>

<key>CLOCK</key><value>12345 </value>

<key>FILE</key><value>126543 </value>

<key>DEPT.</key><value>123456 </value>

<key>NUMBER</key><value>00000000 </value>

<key>Period ending:</key><value>7/18/2008 </value>

<key>Pay date:</key><value>7/25/2008 </value>

<key>Social Security Number:</key><value>987-65-4321 </value>

<key>Taxable Marital Status:</key><value>Married </value>

<key>Federal:</key><value>3. $25 Additional Tax </value>

<key>JOHN STILES</key><value>101 MAIN STREET ANYTOWN, USA 12345 </value>

<key>State:</key><value>2 </value>

<key>Local:</key><value>2 </value>

<key>total to date</key><value>27.00 </value>

<key>Loan Amt Paid</key><value>840.00 </value>

<key>Vac Hrs</key><value>40.00 </value>

<key>Gross Pay</key><value>$ 452.43 </value>

<key>Sick Hrs</key><value>16.00 </value>

<key>Title</key><value>Operator </value>

<key>HOURLY RATE HAS BEEN CHANGED FROM</key><value>$8.00 </value>

<key>PER HOUR.</key><value>$10.00 </value>

<key>Life Insurance</key><value>+ 13.50 </value>

<key>Net Pay</key><value>$291.90 </value>

<key>Your federal wages this period are</key><value>$386.15 </value>

<key>Payroll check number:</key><value>0000000000 </value>

<key>Pay date:</key><value>7/25/2008 </value>

<key>ANY COMPANY CORP.</key><value>475 ANY AVENUE ANYTOWN, USA 10101 </value>

<key>Social Security No.</key><value>987-65-4321 </value>

<key>Pay to the order of:</key><value>JOHN STILES </value>

<key>This amount:</key><value>TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS $291.90 </value>

<key>VOID AFTER 00 DAYS</key><value>Authorized AUTHORIZED SIGNATURE Signature </value>

<key>BANK NAME STREET ADDRESS CITY STATE ZIP</key><value>SAMPLE NON-NEGOTIABLE VOID VOID VOID </value>

We can now combine all the above to get a specially tailored output for your workflow

[14]:
config = HTMLLinearizationConfig(
    table_cell_header_prefix = "<td>",
    table_cell_header_suffix = "</td>",
    key_prefix="<key>",
    key_suffix="</key>",
    value_prefix="<value>",
    value_suffix="</value>",
)
print(document.get_text(config))
1



<table><tr><td>CO.</td><td>FILE</td><td>DEPT.</td><td>CLOCK</td><td>NUMBER</td></tr>
<tr><td>ABC</td><td>126543</td><td>123456</td><td>12345</td><td>00000000</td></tr>
</table>



ANY COMPANY CORP. 475 ANY AVENUE ANYTOWN, USA 10101

<h2>Earnings Statement </h2>



<table><tr><td>Period ending:</td><td>7/18/2008</td></tr>
<tr><td>Pay date:</td><td>7/25/2008</td></tr>
</table>



 <key>Social Security Number:</key><value>987-65-4321 </value> <key>Taxable Marital Status:</key><value>Married </value> Exemptions/Allowances:



<table><tr><td>Federal:</td><td>3. $25 Additional Tax</td></tr>
<tr><td>State:</td><td>2</td></tr>
<tr><td>Local:</td><td>2</td></tr>
</table>



 <key>JOHN STILES</key><value>101 MAIN STREET ANYTOWN, USA 12345 </value>



<table><tr><td>Earnings</td><td>rate</td><td>hours</td><td>this period</td><td>year to date</td></tr>
<tr><td>Regular</td><td>10.00</td><td>32.00</td><td>320.00</td><td>16,640.00</td></tr>
<tr><td>Overtime</td><td>15.00</td><td>1.00</td><td>15.00</td><td>780.00</td></tr>
<tr><td>Holiday</td><td>10.00</td><td>8.00</td><td>80.00</td><td>4,160.00</td></tr>
<tr><td>Tuition</td><td></td><td></td><td>37.43</td><td>1,946.80</td></tr>
<tr><td></td><td>Gross Pay</td><td></td><td>$ 452.43</td><td>23,526.80</td></tr>
</table>
<table><tr><td>Deductions</td><td>Statutory Federal Income Tax</td><td>-40.60</td><td>2,111.20</td></tr>
<tr><td></td><td>Social Security Tax</td><td>-28.05</td><td>1,458.60</td></tr>
<tr><td></td><td>Medicare Tax</td><td>-6.56</td><td>341.12</td></tr>
<tr><td></td><td>NY State Income Tax</td><td>-8.43</td><td>438.36</td></tr>
<tr><td></td><td>NYC Income Tax</td><td>-5.94</td><td>308.88</td></tr>
<tr><td></td><td>NY SUI/SDI Tax Other</td><td>-0.60</td><td>31.20</td></tr>
<tr><td></td><td>Bond</td><td>-5.00</td><td>100.00</td></tr>
<tr><td></td><td>401(k)</td><td>-28.85*</td><td>1,500.20</td></tr>
<tr><td></td><td>Stock Plan</td><td>-15.00</td><td>150.00</td></tr>
<tr><td></td><td>Life Insurance</td><td>-5.00</td><td>50.00</td></tr>
<tr><td></td><td>Loan</td><td>-30.00</td><td>150.00</td></tr>
<tr><td></td><td>Adjustment</td><td></td><td></td></tr>
<tr><td></td><td>Life Insurance</td><td>+ 13.50</td><td></td></tr>
<tr><td></td><td></td><td></td><td></td></tr>
<tr><td></td><td>Net Pay</td><td>$291.90</td><td></td></tr>
</table>
*Excluded from federal taxable wages



 <key>Your federal wages this period are</key><value>$386.15 </value>



<table><tr><td>Other Benefits and Information</td><td>this period</td><td>total to date</td></tr>
<tr><td>Group Term Life</td><td>0.51</td><td>27.00</td></tr>
<tr><td>Loan Amt Paid</td><td></td><td>840.00</td></tr>
<tr><td>Vac Hrs</td><td></td><td>40.00</td></tr>
<tr><td>Sick Hrs</td><td></td><td>16.00</td></tr>
<tr><td>Title</td><td>Operator</td><td></td></tr>
</table>



<h2>Important Notes </h2>

EFFECTIVE THIS PAY PERIOD YOUR REGULAR <key>HOURLY RATE HAS BEEN CHANGED FROM</key><value>$8.00 </value> TO <key>PER HOUR.</key><value>$10.00 </value>

WE WILL BE STARTING OUR UNITED WAY FUND DRIVE SOON AND LOOK FORWARD TO YOUR PARTICIPATION.

ESTS8ET03

 <key>ANY COMPANY CORP.</key><value>475 ANY AVENUE ANYTOWN, USA 10101 </value>



<table><tr><td>Payroll check number:</td><td>0000000000</td></tr>
<tr><td>Pay date:</td><td>7/25/2008</td></tr>
<tr><td>Social Security No.</td><td>987-65-4321</td></tr>
</table>





<table><tr><td>Pay to the order of:</td><td>JOHN STILES</td><td></td></tr>
<tr><td>This amount:</td><td>TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS</td><td>$291.90</td></tr>
</table>



20 APP 1933 $ 1000.000 2001

SAMPLE NON-NEGOTIABLE VOID VOID VOID <key>VOID AFTER 00 DAYS</key><value>Authorized AUTHORIZED SIGNATURE Signature </value>

[SIGNATURE]


BANK NAME STREET ADDRESS CITY STATE ZIP

001379⑈ ⑆122000496⑆4040110157⑈

THEORIGINALDOCUMENTHASAREFLECTIVEWATERMARKONTHEBAOK.

Conclusion

In this tutorial, we have shown how the +50 configuration options of TextLinearizationConfig can be used to produce an output specifically tailored to your workflow.