Values

All predictions in version 2 of the Sypht API return data structures we call Values. Values can be simple data structures representing a simple extraction, or they can be complex data structures representing collections of nested values.

Values are encoded as self-describing JSON objects. At minimum, an encoded Value will contain a key named "type". The corresponding value of this key tells the consumer how to parse the specific value. Below is a list of some common Value types.

bounds

A Value of type bounds represents the location of some extraction. It includes the page number, and the coordinates of two vertices—the top left and bottom right—that form a rectangle or bounding box enclosing the area represented by the value. An example:

{
    "type": "bounds",
    "pageNum": 1,
    "topLeft": {
        "x": 0.38,
        "y": 0.36
    },
    "bottomRight": {
        "x": 0.43,
        "y": 0.37
    }
}

The x and y coordinates are encoded as floating point numbers. These numbers are ratios relative to the size of the page. For example, if a page had a width of 900 pixels, an x value of 0.5 would refer to the absolute value of 450. Likewise, if a page had a height of 1200 pixels, a y value of 0.25 would refer to the absolute value of 300.

token

A Value of type tokens representing an extraction from a document that contains a single token (i.e. word or unit). It contains information such as the characters that make up the token (with or without white space), the location of the token on the page (via a nested encoded bounds Value), and the index of the token relative to the rest of the tokens in the document. The index is useful for creating an ordered list of tokens. Here is an example of an encoded token:

{
    "type": "token",
    "bounds": {
        "type": "bounds",
        "pageNum": 1,
        "topLeft": {
            "x": 0.38,
            "y": 0.36
        },
        "bottomRight": {
            "x": 0.43,
            "y": 0.37
        }
    },
    "text": "Recovery",
    "textWithWhitespace": "Recovery ",
    "pageTokenId": 82,
    "tokenId": 82
},

Note that the value of the "bounds" key is just an encoded bounds Value as above. The pageTokenId is the index of the token relative to all the tokens on that particular page. The tokenId is the index of the token relative to all the tokens in the document.

tokens

A tokens value represents a sequence of tokens that form a single extraction. In their encoded form, they contain text, an encoded bounds and a list of encoded tokens. For example:

{
    "type": "tokens",
    "text": "ACME Holdings Inc.",
    "bounds": {...},
    "tokens": [
        {
            "type": "token",
            "bounds": {...},
            "text": "ACME",
            "textWithWhitespace": "ACME ",
            "pageTokenId": 13,
            "tokenId": 13
        },
        ...
    ]
}

entity

An entity value represents an Entity, and is the output of an entity match field prediction. An entity has an id, entity type, company_id, and data. These values are representations of the entity match reference data that were sent to Sypht via the entities API (see here for further information on entity matching and the entity API). Here is an example of an encoded entity:

{
    "type": "entity",
    "id": "07_101_0106_6_3",
    "entity_type": "ndis-support-category",
    "company_id": "9333c845-875b-47d7-bb35-0873770f23d5",
    "data": {
        "Registration_Group_Number": "106",
        "Registration_Group_Name": "Assistance In Coordinating Or Managing Life Stages, Transitions And Supports",
        "Support_Category_Number": "7",
        "Support_Category_Name": "Support Coordination",
        "Support_Item_Number": "07_101_0106_6_3",
        "Support_Item_Name": "Psychosocial Recovery Coaching - Weekday Daytime",
        "Unit": "H",
        "Quote": "N",
        "Price_Limit:_NT_-_SA_TAS-WA_(MMM_1-5)": null,
        "Price_Limit:_ACT_-_NSW_QLD_-_VIC_(MMM_1-5)": null,
        "Price_Limit:_National_Non-Remote_(MMM_1-5)": "$83.15",
        "Price_Limit:_National_Remote_(MMM_6)": "$116.41",
        "Price_Limit:_National_Very_Remote_(MMM_7)": "$124.73",
        "Non-Face-to-Face_Support_Provision": "Y",
        "Provider_Travel": "Y",
        "Short_Notice_Cancellations.": "Y",
        "NDIA_Requested_Reports": "N",
        "Irregular_SIL_Supports": "N"
    }
}

This is an entity of type ndis-support-category and has unique identifier 07_101_0106_6_3. It came from the Sypht company (9333c845-875b-47d7-bb35-0873770f23d5), and the included data has keys and values that describe this particular entity.

list

A list value is a generic ordered collection of values. Each item within the list is itself an encoded value. A list should contain items each of the same value type.

{
    "type": "list",
    "items": [
        ...
    ]
}

table

A Value of type table represents a prediction that has extracted a table from a document. An encoded table contains a list of columns:

{
    "type": "table",
    "columns": [
        ...
    ]
}

There are two types of columns that can be encoded: column, and derived-column.

column

A column of type column is a column whose data has been extracted directly from the document (as opposed to inferred or derived). Most columns fall into this class. A column has a category, header, and a list of cells. The category explains what category of column it is. This is especially useful for line-item extractions from invoices. The list of currently available line-item categories is:

  • id

  • date

  • description

  • quantity

  • unitPrice

  • tax

  • subTotal

  • total

  • other

The header is an encoded "tokens" value identifying the tokens that make up the header of this column. The "cells" are a list of encoded values, the type of which depend on the column category. For example, a column with category "description" would contain cells each of value type "tokens". (Note: currently all columns regardless of category have cells of value type tokens. This could be improved by using "date" values for "date" columns, etc.). Here is an example of an encoded column:

{
  "type": "column",
  "category": "description",
  "header": {
      "type": "tokens",
      "text": "NOTES",
      "tokens": [...],
      "bounds": {...}
  }
  "cells": [
      {
          "type": "tokens",
          "text": "Recovery Coaching",
          "tokens": [...],
          "bounds": {...}
      },
      ...
  ]
},

derived-column

A derived column is a column whose cells are not directly extracted from content in the document, but rather inferred. As such, a derived column does not contain a header. A common example is an entity-match column that has inferred an entity for each row in a table. In such a case, each cell would be an encoded value of type "entity":

{
    "type": "derived-column",
    "category": "supportCode",
    "cells": [
        {
            "type": "entity",
            "id": "07_101_0106_6_3",
            "entity_type": "ndis-support-category",
            "company_id": "9333c845-875b-47d7-bb35-0873770f23d5",
            "data": {
                "Registration_Group_Number": "106",
                "Registration_Group_Name": "Assistance In Coordinating Or Managing Life Stages, Transitions And Supports",
                "Support_Category_Number": "7",
                "Support_Category_Name": "Support Coordination",
                "Support_Item_Number": "07_101_0106_6_3",
                "Support_Item_Name": "Psychosocial Recovery Coaching - Weekday Daytime",
                "Unit": "H",
                "Quote": "N",
                "Price_Limit:_NT_-_SA_TAS-WA_(MMM_1-5)": null,
                "Price_Limit:_ACT_-_NSW_QLD_-_VIC_(MMM_1-5)": null,
                "Price_Limit:_National_Non-Remote_(MMM_1-5)": "$83.15",
                "Price_Limit:_National_Remote_(MMM_6)": "$116.41",
                "Price_Limit:_National_Very_Remote_(MMM_7)": "$124.73",
                "Non-Face-to-Face_Support_Provision": "Y",
                "Provider_Travel": "Y",
                "Short_Notice_Cancellations.": "Y",
                "NDIA_Requested_Reports": "N",
                "Irregular_SIL_Supports": "N"
            }
        },
        ...
    ]
}

Last updated