Table data extraction to Excel
This example uses Textractor to convert an image to a Microsoft Office Excel file that can be further processed manually.
Installation
To begin, install the amazon-textract-textractor
package using pip.
pip install amazon-textract-textractor
There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with pip install amazon-textract-textractor[pdfium]
. You can read more on extra dependencies in the documentation
Calling Textract
[1]:
import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures, Direction, DirectionalFinderType
[18]:
image = Image.open("../../../tests/fixtures/amzn_q2.png")
image
[18]:
[19]:
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
file_source=image,
features=[TextractFeatures.TABLES],
save_image=True
)
[20]:
document
[20]:
This document holds the following data:
Pages - 1
Words - 564
Lines - 318
Key-values - 0
Checkboxes - 0
Tables - 1
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents - 0
As expected, Textract identified the table. Let’s see how it looks.
[21]:
table = EntityList(document.tables[0])
document.tables[0].visualize()
[21]:
Now let’s export it to Excel.
[6]:
table[0].to_excel(filepath="out.xlsx")
If you open “out.xlsx” in Microsoft Office Excel you should see the table formatted with the the merged cells preserved.
[7]:
Image.open("imgs/excel.png")
[7]:
Table as a pandas DataFrame
[8]:
table[0].to_pandas()
[8]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | Three Months June | 30, Ended | Six Months June | Ended 30, | Twelve Months June | 30, Ended | |
1 | 2021 | 2022 | 2021 | 2022 | 2021 | 2022 | |
2 | |||||||
3 | CASH, PERIOD CASH EQUIVALENTS, AND RESTRICTED ... | $ 34,155 | $ 36,599 | $ 42,377 | $ 36,477 | $ 37,842 | $ 40,667 |
4 | OPERATING ACTIVITIES: | ||||||
5 | Net income (loss) | 7,778 | (2,028) | 15,885 | (5,872) | 29,438 | 11,607 |
6 | Adjustments to reconcile net income (loss) to ... | ||||||
7 | Depreciation content costs, and operating amor... | 8,038 | 9,594 | 15,546 | 18,572 | 29,687 | 37,322 |
8 | Stock-based compensation | 3,591 | 5,209 | 5,897 | 8,459 | 10,747 | 15,319 |
9 | Other operating expense (income), net | 18 | 122 | 48 | 337 | (372) | 426 |
10 | Other expense (income), net | (1,258) | 6,104 | (2,714) | 14,793 | (5,092) | 3,201 |
11 | Deferred income taxes | 701 | (1,955) | 2,404 | (3,956) | 1,063 | (6,670) |
12 | Changes in operating assets and liabilities: | ||||||
13 | Inventories | (209) | (3,890) | (513) | (6,504) | (4,082) | (15,478) |
14 | Accounts receivable, net and other | (4,462) | (6,799) | (6,717) | (8,315) | (13,294) | (19,761) |
15 | Accounts payable | 47 | 3,699 | (8,219) | (5,681) | 8,689 | 6,140 |
16 | Accrued expenses and other | (1,685) | (1,412) | (5,745) | (7,315) | 1,071 | 553 |
17 | Unearned revenue | 156 | 321 | 1,056 | 1,657 | 1,467 | 2,915 |
18 | Net cash provided by (used in) operating activ... | 12,715 | 8,965 | 16,928 | 6,175 | 59,322 | 35,574 |
19 | INVESTING ACTIVITIES: | ||||||
20 | Purchases of property and equipment | (14,288) | (15,724) | (26,370) | (30,675) | (52,256) | (65,358) |
21 | Proceeds from property and equipment sales and... | 1,300 | 1,626 | 2,195 | 2,835 | 5,080 | 6,297 |
22 | Acquisitions, net of cash acquired, and other | (320) | (259) | (950) | (6,600) | (3,066) | (7,635) |
23 | Sales and maturities of marketable securities | 13,213 | 2,608 | 31,039 | 25,361 | 61,512 | 53,706 |
24 | Purchases of marketable securities | (21,985) | (329) | (36,660) | (2,093) | (74,929) | (25,590) |
25 | Net cash provided by (used in) investing activ... | (22,080) | (12,078) | (30,746) | (11,172) | (63,659) | (38,580) |
26 | FINANCING ACTIVITIES: | ||||||
27 | Common stock repurchased | - | (3,334) | - | (6,000) | - | (6,000) |
28 | Proceeds from short-term debt. and other | 1,176 | 4,865 | 3,102 | 18,608 | 6,848 | 23,462 |
29 | Repayments of short-term debt, and other | (1,176) | (7,610) | (3,177) | (13,841) | (6,817) | (18,417) |
30 | Proceeds from long-term debt | 18,516 | 12,824 | 18,627 | 12,824 | 19,158 | 13,200 |
31 | Repayments of long-term debt | (41) | (1) | (80) | (1) | (1,392) | (1,511) |
32 | Principal repayments of finance leases | (2,804) | (2,059) | (6,210) | (4,836) | (11,435) | (9,789) |
33 | Principal repayments of financing obligations | (28) | (59) | (95) | (138) | (116) | (205) |
34 | Net cash provided by (used in) financing activ... | 15,643 | 4,626 | 12,167 | 6,616 | 6,246 | 740 |
35 | Foreign currency effect on cash, cash equivale... | 234 | (412) | (59) | (396) | 916 | (701) |
36 | Net increase (decrease) in cash, cash equivale... | 6,512 | 1,101 | (1,710) | 1,223 | 2,825 | (2,967) |
37 | CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, E... | $ 40,667 | $ 37,700 | $ 40,667 | $ 37,700 | $ 40,667 | $ 37,700 |
38 | SUPPLEMENTAL CASH FLOW INFORMATION: | ||||||
39 | Cash paid for interest on debt | $ 179 | $ 349 | $ 455 | $ 628 | $ 942 | $ 1,271 |
40 | Cash paid for operating leases | 1,577 | 2,088 | 3,217 | 4,455 | 5,577 | 7,960 |
41 | Cash paid for interest on finance leases | 129 | 95 | 286 | 202 | 569 | 437 |
42 | Cash paid for interest on financing obligations | 35 | 55 | 68 | 113 | 127 | 198 |
43 | Cash paid for income taxes, net of refunds | 1,803 | 3,145 | 2,604 | 3,598 | 3,526 | 4,682 |
44 | Assets acquired under operating leases | 5,578 | 5,101 | 9,114 | 7,276 | 19,576 | 23,531 |
45 | Property modifications and equipment acquired ... | 1,642 | 61 | 3,709 | 227 | 9,976 | 3,579 |
46 | Property lease arrangements and equipment reco... | 1,193 | 986 | 2,080 | 2,351 | 3,486 | 6,117 |
47 | Property lease arrangements, and equipment wit... | 99 | 1,079 | 99 | 1,112 | 99 | 1,243 |
Table as CSV
This is a shorthand for table.to_pandas().to_csv().
[9]:
table[0].to_csv()
[9]:
',0,1,2,3,4,5,6\n0,,Three Months June,"30, Ended",Six Months June,"Ended 30,",Twelve Months June,"30, Ended"\n1,,2021,2022,2021,2022,2021,2022\n2,,,,,,,\n3,"CASH, PERIOD CASH EQUIVALENTS, AND RESTRICTED CASH, BEGINNING OF","$ 34,155","$ 36,599","$ 42,377","$ 36,477","$ 37,842","$ 40,667"\n4,OPERATING ACTIVITIES:,,,,,,\n5,Net income (loss),"7,778","(2,028)","15,885","(5,872)","29,438","11,607"\n6,Adjustments to reconcile net income (loss) to net cash from operating activities:,,,,,,\n7,"Depreciation content costs, and operating amortization lease of assets, property and and other equipment and capitalized","8,038","9,594","15,546","18,572","29,687","37,322"\n8,Stock-based compensation,"3,591","5,209","5,897","8,459","10,747","15,319"\n9,"Other operating expense (income), net",18,122,48,337,(372),426\n10,"Other expense (income), net","(1,258)","6,104","(2,714)","14,793","(5,092)","3,201"\n11,Deferred income taxes,701,"(1,955)","2,404","(3,956)","1,063","(6,670)"\n12,Changes in operating assets and liabilities:,,,,,,\n13,Inventories,(209),"(3,890)",(513),"(6,504)","(4,082)","(15,478)"\n14,"Accounts receivable, net and other","(4,462)","(6,799)","(6,717)","(8,315)","(13,294)","(19,761)"\n15,Accounts payable,47,"3,699","(8,219)","(5,681)","8,689","6,140"\n16,Accrued expenses and other,"(1,685)","(1,412)","(5,745)","(7,315)","1,071",553\n17,Unearned revenue,156,321,"1,056","1,657","1,467","2,915"\n18,Net cash provided by (used in) operating activities,"12,715","8,965","16,928","6,175","59,322","35,574"\n19,INVESTING ACTIVITIES:,,,,,,\n20,Purchases of property and equipment,"(14,288)","(15,724)","(26,370)","(30,675)","(52,256)","(65,358)"\n21,Proceeds from property and equipment sales and incentives,"1,300","1,626","2,195","2,835","5,080","6,297"\n22,"Acquisitions, net of cash acquired, and other",(320),(259),(950),"(6,600)","(3,066)","(7,635)"\n23,Sales and maturities of marketable securities,"13,213","2,608","31,039","25,361","61,512","53,706"\n24,Purchases of marketable securities,"(21,985)",(329),"(36,660)","(2,093)","(74,929)","(25,590)"\n25,Net cash provided by (used in) investing activities,"(22,080)","(12,078)","(30,746)","(11,172)","(63,659)","(38,580)"\n26,FINANCING ACTIVITIES:,,,,,,\n27,Common stock repurchased,-,"(3,334)",-,"(6,000)",-,"(6,000)"\n28,Proceeds from short-term debt. and other,"1,176","4,865","3,102","18,608","6,848","23,462"\n29,"Repayments of short-term debt, and other","(1,176)","(7,610)","(3,177)","(13,841)","(6,817)","(18,417)"\n30,Proceeds from long-term debt,"18,516","12,824","18,627","12,824","19,158","13,200"\n31,Repayments of long-term debt,(41),(1),(80),(1),"(1,392)","(1,511)"\n32,Principal repayments of finance leases,"(2,804)","(2,059)","(6,210)","(4,836)","(11,435)","(9,789)"\n33,Principal repayments of financing obligations,(28),(59),(95),(138),(116),(205)\n34,Net cash provided by (used in) financing activities,"15,643","4,626","12,167","6,616","6,246",740\n35,"Foreign currency effect on cash, cash equivalents, and restricted cash",234,(412),(59),(396),916,(701)\n36,"Net increase (decrease) in cash, cash equivalents, and restricted cash","6,512","1,101","(1,710)","1,223","2,825","(2,967)"\n37,"CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, END OF PERIOD","$ 40,667","$ 37,700","$ 40,667","$ 37,700","$ 40,667","$ 37,700"\n38,SUPPLEMENTAL CASH FLOW INFORMATION:,,,,,,\n39,Cash paid for interest on debt,$ 179,$ 349,$ 455,$ 628,$ 942,"$ 1,271"\n40,Cash paid for operating leases,"1,577","2,088","3,217","4,455","5,577","7,960"\n41,Cash paid for interest on finance leases,129,95,286,202,569,437\n42,Cash paid for interest on financing obligations,35,55,68,113,127,198\n43,"Cash paid for income taxes, net of refunds","1,803","3,145","2,604","3,598","3,526","4,682"\n44,Assets acquired under operating leases,"5,578","5,101","9,114","7,276","19,576","23,531"\n45,"Property modifications and equipment acquired under finance leases, net of remeasurements and","1,642",61,"3,709",227,"9,976","3,579"\n46,Property lease arrangements and equipment recognized during the construction period of build-to-suit,"1,193",986,"2,080","2,351","3,486","6,117"\n47,"Property lease arrangements, and equipment with derecognized the associated after leases the construction recognized period as operating of build-to-suit",99,"1,079",99,"1,112",99,"1,243"\n'
Table to Microsoft Office Excel
The only format that supports merged cells.
[10]:
table[0].to_excel("out.xlsx")
Conclusion
Textractor makes it easy to go from one format to another and streamlines your table processing workflows.