Signals

Get to know Signals field concepts and output data formats

What are Signals?

Signals are a unique type of field which leverage previously extracted data to derive aggregated metrics or values. Depending on the signal type, you may compute statistics of your own document data set or a shared pool of aggregated information.

Sample use-cases for signal field types include:

  • Comparing the Total on an invoice to the historical average to detect anomalies

  • Detecting document duplicates by searching for old data matching key fields

  • Computing the probability of observed fields combinations to detect potential fraud

How do I access signals?

Signals are currently available for access via private Beta.

If you'd like to get started with signals on your data, please get in touch!

They come in two forms:

Signal Types

Probability Models and Fraud Signals

Probability models compute the probability of observing the certain field values with respect to others that have been extracted from the document. This type of field can detect documents with unusual details and forms the basis of the fraud signal field provided by Sypht. The Sypht fraud signal field can detect common invoice fraud where a third party replaces legitimate payment information for a known invoice issuer with their own payment details. These new details will not match historical payment information for this issuer and therefore return a low probability of being legitimate.

In the example below, we calculate a fraud signal based on the probability of the observed bank account details (BSB and accountNo) appearing given the Australian Business Number (ABN).

Configuration

We define 2 sets of fields to compute a probability or fraud detection signal

  • conditioned fields: the set of fields whose values are common to all documents and form the underlying set of documents from which we can derive probability estimates for the observed field values. In the above example this is the ABN of the issuer.

  • observed fields: the set of fields being observed from which we derive a probability estimate given the conditioned field values being present. In the example above these are the payment details (BSB and accountNo) that could be altered in a fraudulent document.

Output

The fraud signal field will calculate a float value indicating the probability P(observed|conditioned) computed from historical data. Using the above example this is: "the probability of a document with a certain ABN also having the observed BSB and accountNo with respect to all other documents that have the same ABN". The complement of this probability: 1 - P(observed|conditioned) is computed and rounded to nearest 0.05 to give the final signal value as the likelihood of fraud. (NOTE: the probability value is smoothed.)

Example output:
[{'id': 'sypht.signals.fraud',
  'field': 'signals.fraud',
  'type': 'float',
  'predictions': [{'value': '0.95',
                   'value_norm': 0.95,
                   'confidence': 0.92,
                   'confidence_norm': 0.92,
                   'bounds': None,
                   'source': {'type': 'multi-derived',
                              'sources': ['sypht.bank.bsb',
                                          'sypht.issuer.ABN',
                                          'sypht.bank.accountNo']},
                   'support': 'MEDIUM'}]}]

For example, if there are a 1000 documents with a certain ABN and only 50 of those documents have a given BSB and accountNo, then the signal will be 1 - 50/1000 = 0.95 (without smoothing). This is rounded to nearest 0.05 to give 0.95 (no change). The value may be displayed as percentage which would be 95%. A signal above 70% would be considered a strong enough signal to warrant inspection.

It's important to take into consideration the confidence of the signal as it takes into account the number of documents from which the probability calculation is derived. If there is one document, the signal will be 0%. For 2 documents the signal will be 35% (with smoothing included). In these situations there is clearly not enough data to determine if a document is unusual or fraudulent. As a result, the confidence will be very low to reflect this fact. See the next section for how the confidence of the signal is calculated.

Signal Confidence

The signal output includes a confidence field that takes a value from 0 to 1 indicating the degree of overall confidence in the signal. This value is a function of:

  1. The mean of the confidence values given by the models that extracted the observed and conditioned fields of the document the signal is used on. Lower confidence in the extracted values (the ABN, BSB or accountNo using the above example) will result in a lower value. For example if the ABN, BSB and accountNo have confidence values of 0.8809, 0.9998 and 0.9995, the mean will be 0.9601 .

  2. The total number of documents used when calculating the probability, which is the number of documents with the given conditioned field values. Using the above example, this would be all documents with the same ABN. The smaller this number, the more the confidence value calculated in (1) is reduced. No reduction in confidence is made if there are 1000 or more documents. For 500 documents, the confidence is reduced by ~10%. For 100 documents, the confidence is reduced by ~33%. For 10 documents, the confidence is reduced by ~67%.

The signal output includes a support field that can take on 3 values to provide an indicator of the total number of documents used in calculating the signal:

  • HIGH - the estimate is based on enough documents that we can be confident about the probability.

  • MEDIUM - there are enough documents that the probability signal can be treated as an indicator.

  • LOW - there might not be enough documents for the probability calculation to provide an accurate indicator.

Document Match

Document match models dynamically search and match previously uploaded documents based on values extracted on a query document. This can be used to power a 3-way match of invoice to purchase and delivery documentation; or in the example below, to detect near-duplicates such and avoid the invoice being processed multiple times:

Configuration

  • fields a list of field IDs o match against

  • exact boolean value indicating whether to use fuzzy or exact value matching

Document match searches all previously uploaded documents where the match fields have been extracted.

Output

A list of documentreference values. One for each matched document.

For example:

[
    {
        "file_id": "aaaaaa-bbbbb-..."
    },
    {
        "file_id": "cccccc-ddddd-..."
    }
    ...
]

Statistics

Statistical signals return basic metrics for numerical field values with respect to historical data. This can be used for data analysis and benchmarking of extractions against historical values or market data precedents.

For example, configuring a value statistic signal over the invoice.total field, you may asses the percentile_rank of the Total on an invoice and use this in combination with field conditions to dynamically flag invoices with unusually high values (e.g. 99th percentile) for human review.

Configuration

  • source the field identifier to compute historical metrics for. Only fields with numerical data types are accepted.

Output

A dictionary containing descriptive metrics including:

  • min the minimum observed value of this field in previously extracted data

  • max the maximum observed value of this field in previously extracted data

  • avg the average or mean observed value for this field in previously extracted data

  • variance the numerical variance of this field in previously extracted data

  • percentile_rank the percentile rank of the observed value observed with respect to previously extracted data

{
    "min": -594.30,
    "max": 275474.84,
    "avg": 1961.09,
    "variance": 5829136.35,
    "percentile_rank": 70.21478
}

Last updated