Line Items

Understanding the line-items data format

What are line-items?

The lineitems data type is used for fields that extract tabular information for a specific type of table and pre-defined columns. There are many different lineitems fields tailored to different extraction use-cases.

For example, the invoice.lineitems field captures tables containing invoice line items, while the statement.transactions field returns credit and debit transaction rows from bank and credit card statements.

Specific AI products may contain additional functionality on top table basic extraction for a given use-case. For example,ndis.lineitems includes inference of Support Item Reference Numbers from line level description text. Always prefer the best matching lineitemsfield to your use-case over generic table extraction (i.e.generic.table)when available.

Basic data structure

Fields with the lineitems data type return a common data structure. Each prediction is a list of tables, one for each table found in the source document. Each of these tables is a JSON object with three keys:

{
   "types": [...],
   "headers": [...],
   "cells": [[...], ...],
}
  • types aligns each extracted column to a specific column type.

  • headers identify the specific text and position of header cells within the source document.

  • cells contain the content of the table arranged as an array-of-arrays; organised rows by columns.

The following sections unpack each part of this data structure in detail.

Types

Each item in the types array contains a type identifier and confidence score for the corresponding table column. These identifiers can be used to interpret the corresponding cell content for that column in the table. For example, a column labelled "Item Price" might be classified as a sypht.invoice.lineitems.unitPrice column) and contain prices for each listed item.

A type of null indicates the corresponding column does not match a pre-defined column type for the field. Header and cell content is still returned for these columns.

[
    {
        "type": "sypht.invoice.lineitems.unitPrice",
        "score": 0.97207
    },
...
]

Headers

Headers encode information about the header cells detected on each table. In general headers are not needed to interpret the content of the table for a lineitems field, but may be useful to understand the content of non-aligned columns and how the data was originally presented in the source document.

Each object contains text and bounds information used to locate headers in the source.

{
    "id": "table.label.head:price:69",
    "type": "table.label",
    "text": "ITEM PRICE",
    "page_idx": 0,
    "tokens": [
        "ITEM",
        "PRICE"
    ],
    "bounds": {
        "pageNum": 1,
        "topLeft": {
            "x": 0.6981132075471698,
            "y": 0.4591194968553459
        },
        "bottomRight": {
            "x": 0.7458379578246392,
            "y": 0.47570040022870214
        },
        "tokenIds": [
            68,
            69
        ]
    },
},

Cells

Each element in the cells array represents a row, and each row contains one item per column in the table. Row items may be null indicating an empty cell for a given column. Rows with no extracted cells are omitted from the output.

When cells are present they contain a similar data structure to headers. This includes both text and bounds information.

"cells": [
    [
        null,
        {
            "type": "table.value",
            "text": "50.00",
            "page_idx": 0,
            "tokens": [
                "50.00"
            ],
            "bounds": {
                "pageNum": 1,
                "topLeft": {
                    "x": 0.5268220495745468,
                    "y": 0.49914236706689535
                },
                "bottomRight": {
                    "x": 0.6182019977802442,
                    "y": 0.5105774728416238
                },
                "tokenIds": [
                    87
                ]
            }
        },
        ...
    ],
    [
        ...
    ]
]

Examples

Line-item fields are a densely packed source of structured information. While there is a lot of information available, it's usually quite simple in practice to pull out the specific information you need.

Here we provide an end-to-end example uploading a document using the Sypht API and interpreting the results of a lineitems field in Python. We utilise the pandas library to format tabular results.

import pandas as pd
from sypht.client import SyphtClient
sypht = SyphtClient()

# upload a document and run the invoices product
with open('invoice.png', 'rb') as f:
    doc_id = sc.upload(f, products=["invoices"])

# collect the extraction results
results = sc.fetch_results(doc_id)

# grab the lineitems field in this case
tables = results['invoice.lineitems']

for table in tables:
    # construct a DataFrame representing the table using the source doc headers
    df = pd.DataFrame(
        [
            [cell['text'] for cell in row]
            for row in table['cells']
        ],
        columns=[header['text'] for header in table['headers']]
    )
    print(df)

Depending on the input file content, this sample produces a DataFrame with original document headers for columns and cell content in each row, e.g.:

Date

Product Description

Misc.

Total ($/AUD)

1/1/2020

Foo

Hello

$50.00

1/1/2021

Bar

World

$100.00

Alternately we can use the aligned column types rather than raw text to construct a DataFrame like so:

df = pd.DataFrame(
    [
        [cell['text'] for cell in row]
        for row in table['cells']
    ],
    columns=[header['type'] for header in table['types']]
)

This produces an equivalent table with columns aligned to specific invoice.lineitems types:

invoice.lineitems.date

invoice.lineitems.description

null

invoice.lineitems.total

1/1/2020

Foo

Hello

$50.00

1/1/2021

Bar

World

$100.00

Last updated