Table data extraction to Excel

This example uses Textractor to convert an image to a Microsoft Office Excel file that can be further processed manually.

Installation

To begin, install the amazon-textract-textractor package using pip.

pip install amazon-textract-textractor

There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdf]. You can read more on extra dependencies in the documentation

Calling Textract

[1]:
import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures, Direction, DirectionalFinderType
[18]:
image = Image.open("../../../tests/fixtures/amzn_q2.png")
image
[18]:
../_images/notebooks_table_data_to_various_formats_2_0.png
[19]:
extractor = Textractor(profile_name="default")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)
[20]:
document
[20]:
This document holds the following data:
Pages - 1
Words - 564
Lines - 318
Key-values - 0
Checkboxes - 0
Tables - 1
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents - 0

As expected, Textract identified the table. Let’s see how it looks.

[21]:
table = EntityList(document.tables[0])
document.tables[0].visualize()
[21]:
../_images/notebooks_table_data_to_various_formats_6_0.png

Now let’s export it to Excel.

[6]:
table[0].to_excel(filepath="out.xlsx")

If you open “out.xlsx” in Microsoft Office Excel you should see the table formatted with the the merged cells preserved.

[7]:
Image.open("imgs/excel.png")
[7]:
../_images/notebooks_table_data_to_various_formats_10_0.png

Table as a pandas DataFrame

[8]:
table[0].to_pandas()
[8]:
0 1 2 3 4 5 6
0 Three Months June 30, Ended Six Months June Ended 30, Twelve Months June 30, Ended
1 2021 2022 2021 2022 2021 2022
2
3 CASH, PERIOD CASH EQUIVALENTS, AND RESTRICTED ... $ 34,155 $ 36,599 $ 42,377 $ 36,477 $ 37,842 $ 40,667
4 OPERATING ACTIVITIES:
5 Net income (loss) 7,778 (2,028) 15,885 (5,872) 29,438 11,607
6 Adjustments to reconcile net income (loss) to ...
7 Depreciation content costs, and operating amor... 8,038 9,594 15,546 18,572 29,687 37,322
8 Stock-based compensation 3,591 5,209 5,897 8,459 10,747 15,319
9 Other operating expense (income), net 18 122 48 337 (372) 426
10 Other expense (income), net (1,258) 6,104 (2,714) 14,793 (5,092) 3,201
11 Deferred income taxes 701 (1,955) 2,404 (3,956) 1,063 (6,670)
12 Changes in operating assets and liabilities:
13 Inventories (209) (3,890) (513) (6,504) (4,082) (15,478)
14 Accounts receivable, net and other (4,462) (6,799) (6,717) (8,315) (13,294) (19,761)
15 Accounts payable 47 3,699 (8,219) (5,681) 8,689 6,140
16 Accrued expenses and other (1,685) (1,412) (5,745) (7,315) 1,071 553
17 Unearned revenue 156 321 1,056 1,657 1,467 2,915
18 Net cash provided by (used in) operating activ... 12,715 8,965 16,928 6,175 59,322 35,574
19 INVESTING ACTIVITIES:
20 Purchases of property and equipment (14,288) (15,724) (26,370) (30,675) (52,256) (65,358)
21 Proceeds from property and equipment sales and... 1,300 1,626 2,195 2,835 5,080 6,297
22 Acquisitions, net of cash acquired, and other (320) (259) (950) (6,600) (3,066) (7,635)
23 Sales and maturities of marketable securities 13,213 2,608 31,039 25,361 61,512 53,706
24 Purchases of marketable securities (21,985) (329) (36,660) (2,093) (74,929) (25,590)
25 Net cash provided by (used in) investing activ... (22,080) (12,078) (30,746) (11,172) (63,659) (38,580)
26 FINANCING ACTIVITIES:
27 Common stock repurchased - (3,334) - (6,000) - (6,000)
28 Proceeds from short-term debt. and other 1,176 4,865 3,102 18,608 6,848 23,462
29 Repayments of short-term debt, and other (1,176) (7,610) (3,177) (13,841) (6,817) (18,417)
30 Proceeds from long-term debt 18,516 12,824 18,627 12,824 19,158 13,200
31 Repayments of long-term debt (41) (1) (80) (1) (1,392) (1,511)
32 Principal repayments of finance leases (2,804) (2,059) (6,210) (4,836) (11,435) (9,789)
33 Principal repayments of financing obligations (28) (59) (95) (138) (116) (205)
34 Net cash provided by (used in) financing activ... 15,643 4,626 12,167 6,616 6,246 740
35 Foreign currency effect on cash, cash equivale... 234 (412) (59) (396) 916 (701)
36 Net increase (decrease) in cash, cash equivale... 6,512 1,101 (1,710) 1,223 2,825 (2,967)
37 CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, E... $ 40,667 $ 37,700 $ 40,667 $ 37,700 $ 40,667 $ 37,700
38 SUPPLEMENTAL CASH FLOW INFORMATION:
39 Cash paid for interest on debt $ 179 $ 349 $ 455 $ 628 $ 942 $ 1,271
40 Cash paid for operating leases 1,577 2,088 3,217 4,455 5,577 7,960
41 Cash paid for interest on finance leases 129 95 286 202 569 437
42 Cash paid for interest on financing obligations 35 55 68 113 127 198
43 Cash paid for income taxes, net of refunds 1,803 3,145 2,604 3,598 3,526 4,682
44 Assets acquired under operating leases 5,578 5,101 9,114 7,276 19,576 23,531
45 Property modifications and equipment acquired ... 1,642 61 3,709 227 9,976 3,579
46 Property lease arrangements and equipment reco... 1,193 986 2,080 2,351 3,486 6,117
47 Property lease arrangements, and equipment wit... 99 1,079 99 1,112 99 1,243

Table as CSV

This is a shorthand for table.to_pandas().to_csv().

[9]:
table[0].to_csv()
[9]:
',0,1,2,3,4,5,6\n0,,Three Months June,"30, Ended",Six Months June,"Ended 30,",Twelve Months June,"30, Ended"\n1,,2021,2022,2021,2022,2021,2022\n2,,,,,,,\n3,"CASH, PERIOD CASH EQUIVALENTS, AND RESTRICTED CASH, BEGINNING OF","$ 34,155","$ 36,599","$ 42,377","$ 36,477","$ 37,842","$ 40,667"\n4,OPERATING ACTIVITIES:,,,,,,\n5,Net income (loss),"7,778","(2,028)","15,885","(5,872)","29,438","11,607"\n6,Adjustments to reconcile net income (loss) to net cash from operating activities:,,,,,,\n7,"Depreciation content costs, and operating amortization lease of assets, property and and other equipment and capitalized","8,038","9,594","15,546","18,572","29,687","37,322"\n8,Stock-based compensation,"3,591","5,209","5,897","8,459","10,747","15,319"\n9,"Other operating expense (income), net",18,122,48,337,(372),426\n10,"Other expense (income), net","(1,258)","6,104","(2,714)","14,793","(5,092)","3,201"\n11,Deferred income taxes,701,"(1,955)","2,404","(3,956)","1,063","(6,670)"\n12,Changes in operating assets and liabilities:,,,,,,\n13,Inventories,(209),"(3,890)",(513),"(6,504)","(4,082)","(15,478)"\n14,"Accounts receivable, net and other","(4,462)","(6,799)","(6,717)","(8,315)","(13,294)","(19,761)"\n15,Accounts payable,47,"3,699","(8,219)","(5,681)","8,689","6,140"\n16,Accrued expenses and other,"(1,685)","(1,412)","(5,745)","(7,315)","1,071",553\n17,Unearned revenue,156,321,"1,056","1,657","1,467","2,915"\n18,Net cash provided by (used in) operating activities,"12,715","8,965","16,928","6,175","59,322","35,574"\n19,INVESTING ACTIVITIES:,,,,,,\n20,Purchases of property and equipment,"(14,288)","(15,724)","(26,370)","(30,675)","(52,256)","(65,358)"\n21,Proceeds from property and equipment sales and incentives,"1,300","1,626","2,195","2,835","5,080","6,297"\n22,"Acquisitions, net of cash acquired, and other",(320),(259),(950),"(6,600)","(3,066)","(7,635)"\n23,Sales and maturities of marketable securities,"13,213","2,608","31,039","25,361","61,512","53,706"\n24,Purchases of marketable securities,"(21,985)",(329),"(36,660)","(2,093)","(74,929)","(25,590)"\n25,Net cash provided by (used in) investing activities,"(22,080)","(12,078)","(30,746)","(11,172)","(63,659)","(38,580)"\n26,FINANCING ACTIVITIES:,,,,,,\n27,Common stock repurchased,-,"(3,334)",-,"(6,000)",-,"(6,000)"\n28,Proceeds from short-term debt. and other,"1,176","4,865","3,102","18,608","6,848","23,462"\n29,"Repayments of short-term debt, and other","(1,176)","(7,610)","(3,177)","(13,841)","(6,817)","(18,417)"\n30,Proceeds from long-term debt,"18,516","12,824","18,627","12,824","19,158","13,200"\n31,Repayments of long-term debt,(41),(1),(80),(1),"(1,392)","(1,511)"\n32,Principal repayments of finance leases,"(2,804)","(2,059)","(6,210)","(4,836)","(11,435)","(9,789)"\n33,Principal repayments of financing obligations,(28),(59),(95),(138),(116),(205)\n34,Net cash provided by (used in) financing activities,"15,643","4,626","12,167","6,616","6,246",740\n35,"Foreign currency effect on cash, cash equivalents, and restricted cash",234,(412),(59),(396),916,(701)\n36,"Net increase (decrease) in cash, cash equivalents, and restricted cash","6,512","1,101","(1,710)","1,223","2,825","(2,967)"\n37,"CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, END OF PERIOD","$ 40,667","$ 37,700","$ 40,667","$ 37,700","$ 40,667","$ 37,700"\n38,SUPPLEMENTAL CASH FLOW INFORMATION:,,,,,,\n39,Cash paid for interest on debt,$ 179,$ 349,$ 455,$ 628,$ 942,"$ 1,271"\n40,Cash paid for operating leases,"1,577","2,088","3,217","4,455","5,577","7,960"\n41,Cash paid for interest on finance leases,129,95,286,202,569,437\n42,Cash paid for interest on financing obligations,35,55,68,113,127,198\n43,"Cash paid for income taxes, net of refunds","1,803","3,145","2,604","3,598","3,526","4,682"\n44,Assets acquired under operating leases,"5,578","5,101","9,114","7,276","19,576","23,531"\n45,"Property modifications and equipment acquired under finance leases, net of remeasurements and","1,642",61,"3,709",227,"9,976","3,579"\n46,Property lease arrangements and equipment recognized during the construction period of build-to-suit,"1,193",986,"2,080","2,351","3,486","6,117"\n47,"Property lease arrangements, and equipment with derecognized the associated after leases the construction recognized period as operating of build-to-suit",99,"1,079",99,"1,112",99,"1,243"\n'

Table to Microsoft Office Excel

The only format that supports merged cells.

[10]:
table[0].to_excel("out.xlsx")

Conclusion

Textractor makes it easy to go from one format to another and streamlines your table processing workflows.