Consuming garbage

Let’s say, you own a service that is the consumer of an external domain.

  • It’s an old domain with numerous edge cases and a complex business process.
  • You receive semi-structured, unversioned data. The delivery mechanism does not matter a whole lot for our considerations. We’ll refer to these data inputs as “events.”
  • The producer offers you a sandbox integration, but that only covers a limited set of use cases. Many real world edge cases and some features are not supported.

There is also no feasible way to change this situation, at least not within relevant timescales.

The JSON payloads you receive share some common fields but vary significantly, with up to hundreds of fields and deeply nested objects.

Example json

 {
      "category": "transfer.bank",
      "status": "COMPLETED",
      "transaction_id": "abc12345-def6-7890-ghij-klmnopqrstuv",
      "user_reference": "user56789-abcd-1234-efgh-ijklmnopqrst",
      "initiator_reference": "admin12345-abcd-5678-ijkl-mnopqrstuvwx",
      "account_reference": "acc12345-xyz-6789-abcd-qrstuvwx5678",
      "previous_transaction_id": null,
      "account_summary": {
        "currency": "EUR",
        "ledger_total": 5000,
        "available_total": 4500,
        "reserved_total": 300,
        "pending_total": 200,
        "affected_amount": 100,
        "detailed_balances": {
          "EUR": {
            "currency": "EUR",
            "ledger_total": 5000,
            "available_total": 4500,
            "reserved_total": 300,
            "pending_total": 200,
            "affected_amount": 100
          }
        }
      },
      "transfer_details": {
        "transfer_id": "trans56789-ijkl-2345-mnop-qrstuvwx7890",
        "amount": 100,
        "created_at": "2023-09-15T12:34:56Z",
        "updated_at": "2023-09-15T12:36:00Z",
        "linked_transaction_id": "link2345-qrst-6789-ijkl-mnopwx5678yz",
        "status": "COMPLETED",
        "response_info": {
          "response_code": "200",
          "response_message": ""
        },
        "funding_source": {
          "amount": 100,
          "source_details": {
            "type": "BANK_ACCOUNT",
            "id": "bank5678-abcd-1234-efgh-ijklmnop1234",
            "active": true,
            "name": "Personal Checking Account",
            "default": true,
            "created_at": "2022-06-01T09:15:00Z",
            "updated_at": "2023-01-20T10:20:30Z"
          },
          "transaction_log": {
            "log_id": "log12345-efgh-6789-abcd-mnopqrstuvwx",
            "message": "Funds transferred successfully",
            "duration_ms": 450,
            "timed_out": false,
            "gateway_response": {
              "status_code": "00",
              "message": "Approved"
            }
          }
        }
      },
      "processing_time_ms": 560,
      "initiated_at": "2023-09-15T12:34:00Z",
      "processed_at": "2023-09-15T12:35:00Z",
      "amount": 100,
      "currency": "EUR",
      "network": "SEPA",
      "recipient": {
        "country": "DE",
        "bank_code": "10020030",
        "account_number": "DE89370400440532013000",
        "type": "PERSONAL"
      },
      "metadata": {
        "notes": "Monthly rent payment"
      },
      "network_reference_id": "02dfd12312",
      "auth_indicator": "NOT_REQUIRED",
      "acquirer_id": null,
      ...
    }

To make things worse, the documentation is often inadequate:

  • Enum values are incomplete.
  • Not all fields are documented.
  • Descriptions are vague, and field names may use custom terminology, making internet searches unhelpful.
  • Even documented fields are mostly marked as optional. However, specific fields must be set in certain combinations — which, of course, are undocumented.

A lot of real-world software engineering revolves around handling such messy data. Below are some ideas and patterns our team is using (or should have used).

Get the ball rolling

While consulting documentation is advisable, it rarely covers everything. Common cases are usually documented best, so focus on handling those first.

The only way to progress is to develop a domain model, start building, process incoming data, and adjust your model as you go.

Store all the data

Your first step should be to persist all data that you receive. Think like a scientist gathering evidence to refine a hypothesis.

Simply attach an ID and timestamp to the data and store it in a database. Cold storage can be considered later, but this isn’t the time to worry about that.

Start with loose constraints

The end goal is of course some business process on your end. Eventually you have to do something with the data - you need to properly model and process it.

  • Concretely, that means transforming it into an struct or object. The fields of these initial objects should allow for optional/null values on non-critical fields.
  • If you receive enum values, consider storing them as strings. If you need to parse it as an enum yourself, you can add an UNKNOWN value in case you receive a new value, as to not break your parsing setup.

When persisting this data, consider serializing it as JSON or a string, even in relational databases. This makes schema changes—like adding or removing fields or nullability—much easier.

Don’t trust the documentation

Even in cases where documentation is available, trusting it blindly might not be the best idea. Just today we had an incident at work where an external service stopped sending a mandatory field and deserialization on our end started to fail. We made the field optional and started handling case. After that we reprocessed the faulty messages.

What part of the data is really necessary for your core flow? Do you want to break our application just because a contract violation on an important field occurred?

Refining your model

As you progress, your model of this domain will become clearer. You’ll notice patterns like “Event A is followed by Event B, except if C happens.” “If field X has the value DEFAULT, then field Y needs to be set to true.” “Field Z will only be strings up to length 20.”

Start enforcing constraints:

  • Make fields non-optional.
  • Add assertions during the parsing step or processing steps to validate assumptions.

Handle broken assumptions

Unexpected inputs will happen. While your service might handle these gracefully, your model still needs to be updated.

Log the id of the object, timestamp and other relevant data. Alert yourself via your regular channels (Slack, Sentry etc.) and regularly investigate what happened. Once you understand what assumption was broken, go back and add proper handling for the new case.

Store processing results

Processing failures may be frequent. It’s impractical to investigate every alert. While tools like Sentry or Datadog can aggregate similar alerts, your might not have access them.

Maintain another table where you store events that you failed to process and why. This table also allows you to reprocess those events after you have fixed the issue.

Ask wtf is going on

Sometimes you might be very confused about the meaning of specific fields, how events relate temporarily or what edge cases to consider. Reviewing historical data might not help, as the producer’s data model has changed over time: Certain invariants were not enforced on the producer side, when you started this project, but now are. For example, a field was initially sometimes not present, but now always is. You never got this info, maybe it was never in their release notes or other communications. If you now look back at the complete history you will see of a lot cases that break the invariant and conclude, that, in fact, this invariant does not exist.

When in doubt, reach out. Contact customer support, project liaisons, or even the engineering team of the producer. A few emails or a quick call can often clarify things much faster than trial and error.

Exposing your data

Make data accessible to domain experts

You might be the domain expert by now. Usually it is non-technical people in your organization. Direct access to production data may not be ideal for compliance or performance reasons. Moreover, the raw data could not be the best view anyways, as certain patterns might only emerge if you cross-reference data from other services.

Spend some time on setting up a data warehouse and add some ETL pipelines for high level views. This makes spotting inconsistencies, broken assumptions not caught by checks and weird edge cases much easier.

Use raw event data for specific needs

Some subdomains may need data that your core domain model doesn’t account for. For example, customer support might require different views of the process. The clean way would be to add those to your own domain model. But that has costs when you want to be consistent, like backfilling the values in your database.

One way to avoid this, especially if the requirements are not yet fully clear, is to read from the raw data you stored at the very beginning, instead of modifying your core domain model prematurely. Your persisted domain model + the extra data from the raw event then are combined into the required view.

Closing remarks

Always remember: The goal isn’t perfection, but building a system that adapts to change while supporting your business. In most domains, not all edge cases need to be handled, not everything perfectly planned ahead. Embrace incomplete knowledge and turn unreliable inputs into (more) reliable outputs.