Entity matching

Match data to an external data source

How it works

Entity fields match information on a source document to user-provided reference data. This lets you establish a link between documents and records from an existing business database or directory.

A common use-case for entity matching is to link an invoice issuer to a supplier database. Entity fields automatically learn a fuzzy-match between information on the document (e.g. a Supplier Name, Address or Business number) and reference data fields you've uploaded. The reference data is then returned as a standard prediction result, allowing you to build on these matches to power complex automation rules, derived field values and validation checks.

Getting started

To configure and use an entity match field, the following basic process applies:

  1. Upload entity data to Sypht via the entity storage API

  2. Configure and train an entity match field

  3. Extract the entity field from documents to establish a match

As entities are pushed from your database to Sypht there is no need to open up network or API access into secure internal data stores. You have complete control over what data is pushed and when.

Overview of the basic entity matching process.

Keeping reference data in sync

When an entity field is extracted on a document, data is matched against all available entities in the Sypht data store at that time. As reference data changes over time you can push smaller differential updates via the entity storage API to replicate addition, removal or modification of entities.

We recommend establishing a regular automated ETL process to keep reference data in sync.

Using the entities API

This section explains the available API endpoints for storage, retrieval and search of entity data. Entities are stored in isolated collections for a given company_id and within that company there may be multiple distinct entity types (e.g. there may be distinct entity types for supplier and the receiving office ; each having distinct attributes and match logic).

Checkout the open-source Sypht Python Client on GitHub for a reference implementation using the entities API.

Pushing entities

PUT storage/{company_id}/entity/{entity_type}/{entity_id}

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

  • entity_id a unique identifier for the entity

Request Body:

  • JSON-encoded entity data

    • An object with keys and values representing attributes of the entity

    • Complex json data structures (e.g. nested objects or lists) may be stored but are currently not supported as reference fields for search and match

    • Empty values should be represented as null

Example Supplier Entity data
{
"name": "Joe's Plumbing",
"address": "123 Water Street, Chippendale, 2008",
"expense_code": "GL1234",
"project_id": "PRJ-9876",
"last_sync": "2021-01-01"
}

POST storage/{company_id}/bulkentity/{entity_type}/

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

Request Body:

  • JSON-encoded list of objects with an entity_id and data to be store

[
{
"entity_id": "id0",
"data": {
"rego_no": "qwer",
"Contract Start": "2020-11-04"
}
},
{
"entity_id": "id1",
"data": {
"rego_no": "wert",
"Contract Start": "2020-11-05"
}
},
{
"entity_id": "id2",
"data": {
"rego_no": "erty",
"Contract Start": "2020-11-06"
}
}
]

Removing entities

DELETE storage/{company_id}/entity/{entity_type}/{entity_id}

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

  • entity_id a unique identifier for the entity

Retrieving entities

GET storage/{company_id}/entity/{entity_type}/{entity_id}

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

  • entity_id a unique identifier for the entity

Response Body:

{
"name": "Joe's Plumbing",
"address": "123 Water Street, Chippendale, 2008",
"expense_code": "GL1234",
"project_id": "PRJ-9876",
"last_sync": "2021-01-01"
}

Searching entities

POST storage/{company_id}/entitysearch/{entity_type}/

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

Request Body:

  • JSON-encoded string containing exact and fuzzy match search constraints

    • Each of these should be an object with keys denoting an attribute to search against and value denoting the query string to search for

  • e.g. To search for exact match for "rego_no" == "qwer"

{
"exact": {},
"fuzzy": {"name":"Plumbing"}
}

Response Body:

[
{
"item": {
"name": "Joe's Plumbing",
"address": "123 Water Street, Chippendale, 2008",
"expense_code": "GL1234",
"project_id": "PRJ-9876",
"last_sync": "2021-01-01"
},
"score": 0.5753642
},
...
]

Searching entities by id

POST storage/{company_id}/entitysearch/{entity_type}/by_id

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

Request Body:

  • JSON-encoded list of objects with entity_id

[
{
"entity_id": "id_0"
},
{
"entity_id": "id_1"
}
]

Response Body:

{
"entities": [
{
"entity_id": "id_0",
"data": {
"name": "Joe's Plumbing",
"address": "123 Water Street, Chippendale, 2008",
"expense_code": "GL1234"
},
"error": false
},
{
"entity_id": "id_1",
"data": null,
"error": true
}
]
}

Retrieving list of entity_id

GET storage/{company_id}/entitysearch/{entity_type}

Path Parameters:

  • company_id your Sypht Company Id

  • entity_type type of entity e.g. supplier, vehicle or employee

Query Parameters:

  • page page token, if None (not provided) will return first page by default, otherwise request for specified page which would be grabbed from next_page of previous response

  • limit maximum count for responded entity_ids

Response Body:

{
"next_page": "page token",
"entities": [
"102013",
"102015",
"102019",
"102034",
"102051",
"102056",
"102057",
"102068",
"102072",
"102074"
]
}

Using Sypht Client

pip install sypht

Retrieving all entity_ids

This client method is a wrapper to loop over pagination endpoint to get all entity_ids for specified entity_type

  • Returns list of objects if verbose (by default)

    [{"entity_id": "id_0"}, {"entity_id": "id_1"}, ...]

  • Returns list of entity_id if not verbose

    ["id_0", "id_1", ...]

from sypht.client import SyphtClient
sc = SyphtClient('<client_id>', '<client_secret>')
sc.get_all_entity_ids(entity_type='test_type')