APIs Contracts and Integration

Reading time
23 min read
Word count
4425 words
Diagram count
6 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/07 APIs Contracts and Integration.md.

APIs Contracts and Integration

APIs are long-lived contracts. Integration quality determines how safely systems can evolve, how quickly teams can ship, and how much production risk appears at service boundaries. A good contract makes behavior explicit enough that producers can change internals without surprising consumers.

An API contract is more than a payload schema. It includes identity, authorization expectations, consistency, error behavior, rate limits, retry safety, versioning policy, observability, and operational ownership.

Core mental model

Every integration answers six questions:

QuestionContract concernFailure if ignored
What is being exchanged?Resource, command, query, event, file, streamConsumers infer meaning from implementation details
Who owns the truth?Source of record, derived view, cacheSplit brain, stale writes, impossible reconciliation
What can change?Compatibility rules, versioning policyAccidental breaking changes
What happens when it fails?Timeouts, retries, idempotency, compensationDuplicate work, data loss, cascading failure
How is misuse reported?Error model, validation, problem detailsAd hoc parsing, brittle client logic
How is it verified?Contract tests, schema checks, examplesIntegration drift discovered in production
Rendering diagram...

Contract types

Contract typeTypical technologiesBest forMain risks
Synchronous APIREST, GraphQL, gRPC, internal RPCImmediate request response workflowsTight availability coupling, timeout chains
Asynchronous eventKafka, RabbitMQ, SQS, SNS, NATS, webhooksFacts, notifications, workflows, projectionsDuplicate delivery, ordering assumptions, schema drift
File contractCSV, JSON, Parquet, XML, fixed widthBatch exchange, analytics, partner integrationsAmbiguous format rules, partial loads, encoding problems
Database contractShared tables, views, CDC streamsLegacy integration, reporting, change captureLeaked internals, lock contention, unsafe schema changes
UI contractBackend for frontend response shapes, component propsProduct surfaces, frontend backend boundariesAccidental coupling to backend domain model
Operational contractHealth checks, metrics, alerts, runbooks, SLOsSafe operation and incident responseFalse health, missing ownership, alert fatigue

Shared databases should be treated as integration debt unless explicitly designed as a stable contract. A read-only view, CDC topic, or API facade is usually safer than letting another service depend on private tables.

API design principles

  • Design around stable domain concepts, not current database tables.
  • Make identity explicit and immutable wherever possible.
  • Separate commands from queries when their behavior, consistency, or authorization differs.
  • Make idempotency explicit for every mutating operation that clients may retry.
  • Prefer additive changes and deprecation windows over flag day migrations.
  • Use stable error models that machines can parse and humans can diagnose.
  • Document consistency guarantees, including read after write behavior.
  • Document timeout and retry expectations for both client and server.
  • Avoid exposing private lifecycle states unless they are part of the product contract.
  • Prefer small, cohesive contracts over broad "god" endpoints.
  • Include examples for success, validation failure, authorization failure, conflict, and retry.
  • Treat observability fields such as request ID and correlation ID as contract elements.

Choosing REST, GraphQL, gRPC, or events

StyleStrengthsWeaknessesChoose when
RESTSimple resource model, broad tooling, cache friendly, easy debuggingCan become chatty, weak typing unless OpenAPI is maintained, hard for graph shaped readsPublic APIs, CRUD resources, partner integrations, web clients
GraphQLClient selected fields, schema introspection, good for aggregate viewsResolver complexity, authorization per field, caching complexity, N+1 risksProduct clients need flexible reads across related entities
gRPCStrong contracts, efficient binary transport, streaming, code generationBrowser support needs bridges, harder ad hoc debugging, versioning discipline requiredInternal service to service calls, low latency systems, typed platforms
Async eventsLoose temporal coupling, scalable fanout, replayable historyEventual consistency, duplicates, ordering gaps, harder debuggingState changes, workflows, audit trails, projections, integrations that should survive receiver downtime

Useful rule: use synchronous APIs for questions and commands that need immediate acceptance, and events for facts that already happened or workflows that can proceed asynchronously.

REST contracts

REST works best when resources, representations, and status codes are consistent. A REST contract should define resource identity, allowed methods, representation shape, filtering, pagination, error bodies, and concurrency behavior.

Resource modeling

PatternGoodBad
Stable nouns/v1/customers/{customer_id}/invoices/v1/getCustomerInvoices
Explicit commandsPOST /v1/invoices/{id}/voidPATCH /v1/invoices/{id} with { "status": "voided" } when voiding has business rules
Collection creationPOST /v1/invoicesGET /v1/createInvoice?...
Full replacementPUT /v1/customer-profiles/{id}POST /v1/updateEverything
Partial updatePATCH /v1/customer-profiles/{id}PUT that silently merges omitted fields

HTTP method semantics

MethodExpected semanticsIdempotentNotes
GETRetrieve a representationYesMust not mutate business state
HEADRetrieve metadataYesUseful for cache and existence checks
POSTCreate subordinate resource or run commandNot by defaultCan be made idempotent with a key
PUTReplace or create at known URIYesClient knows the resource ID
PATCHApply partial changeUsually noCan be idempotent if patch format is designed that way
DELETERemove or tombstoneYesRepeated delete should not recreate failure ambiguity

REST good example

POST /v1/payments HTTP/1.1
Content-Type: application/json
Idempotency-Key: pay_2026_06_11_0001
X-Request-ID: req_01hx

{
  "account_id": "acct_123",
  "invoice_id": "inv_456",
  "amount": {
    "currency": "USD",
    "value": "129.00"
  },
  "payment_method_id": "pm_789"
}
HTTP/1.1 201 Created
Content-Type: application/json
Location: /v1/payments/pay_abc
X-Request-ID: req_01hx

{
  "id": "pay_abc",
  "status": "authorized",
  "created_at": "2026-06-11T12:30:00Z"
}

Why it is good:

  • The operation has an explicit idempotency key.
  • Money is represented as decimal string plus currency, not a float.
  • The response includes a stable resource identifier.
  • The server returns 201 Created and a Location header.
  • Request tracing is part of the contract.

REST bad example

GET /createPayment?account=123&invoice=456&amount=129 HTTP/1.1

Why it is bad:

  • GET mutates state.
  • No idempotency key exists.
  • Amount has no currency and may be parsed imprecisely.
  • The endpoint name describes implementation action rather than resource semantics.
  • Sensitive or important values may leak through logs and caches.

GraphQL contracts

GraphQL is a contract centered on a typed schema. The schema is not only documentation. It is executable, introspectable, and consumed by code generation. The hard part is preserving semantics as fields and resolvers evolve.

GraphQL design guidelines

TopicRecommended practiceAvoid
Field namesDomain language with stable meaningNames copied from database columns
NullabilityUse non-null only when the value is always presentOverusing ! before reality is proven
PaginationCursor based connections for listsUnbounded arrays
MutationsPayload object with result and user facing errorsBoolean success flags
ErrorsTyped domain errors in mutation payloads plus GraphQL errors for infrastructure failuresPutting every failure in top level errors
AuthorizationEnforce at resolver and field levelReturning forbidden fields as null without explanation
DeprecationUse @deprecated(reason: "...") with a migration pathRemoving fields immediately

GraphQL good example

type Query {
  customer(id: ID!): Customer
}

type Customer {
  id: ID!
  displayName: String!
  invoices(first: Int!, after: String): InvoiceConnection!
}

type InvoiceConnection {
  edges: [InvoiceEdge!]!
  pageInfo: PageInfo!
}

type InvoiceEdge {
  cursor: String!
  node: Invoice!
}

type Mutation {
  voidInvoice(input: VoidInvoiceInput!): VoidInvoicePayload!
}

input VoidInvoiceInput {
  invoiceId: ID!
  reason: String!
  idempotencyKey: String!
}

type VoidInvoicePayload {
  invoice: Invoice
  errors: [UserError!]!
}

type UserError {
  code: String!
  message: String!
  path: [String!]!
}

GraphQL bad example

type Mutation {
  updateInvoiceStatus(id: ID!, status: String!): Boolean!
}

Why it is bad:

  • status is an unbounded string.
  • Business commands are hidden behind generic updates.
  • No idempotency key exists.
  • The boolean result cannot explain validation, authorization, or conflict failures.
  • Consumers must query again to learn the resulting state.

gRPC contracts

gRPC contracts are usually expressed with Protocol Buffers. They are strong at typed service boundaries, but they require discipline because field numbers, enum values, and default values become compatibility constraints.

Protobuf evolution rules

ChangeCompatibilityNotes
Add optional field with new numberUsually safeOld clients ignore it
Remove field but reserve number and nameSafe after consumers stop using itPrevents accidental reuse
Reuse field numberBreakingOld clients may parse wrong data
Change field typeUsually breakingSome wire compatible cases still change semantics
Add enum valueUsually wire compatibleClients must handle unknown values
Rename fieldWire compatible but source disruptiveGenerated clients may break
Change default meaningBreakingEven if schema compiles

gRPC good example

syntax = "proto3";

package billing.v1;

service PaymentService {
  rpc CreatePayment(CreatePaymentRequest) returns (CreatePaymentResponse);
  rpc GetPayment(GetPaymentRequest) returns (GetPaymentResponse);
}

message Money {
  string currency = 1;
  string value = 2;
}

message CreatePaymentRequest {
  string account_id = 1;
  string invoice_id = 2;
  Money amount = 3;
  string payment_method_id = 4;
  string idempotency_key = 5;
}

message CreatePaymentResponse {
  Payment payment = 1;
}

message GetPaymentRequest {
  string payment_id = 1;
}

message GetPaymentResponse {
  Payment payment = 1;
}

message Payment {
  string id = 1;
  PaymentStatus status = 2;
  string created_at = 3;
}

enum PaymentStatus {
  PAYMENT_STATUS_UNSPECIFIED = 0;
  PAYMENT_STATUS_AUTHORIZED = 1;
  PAYMENT_STATUS_CAPTURED = 2;
  PAYMENT_STATUS_FAILED = 3;
}

gRPC operational contract

  • Set deadlines on every client call.
  • Propagate cancellation to downstream work.
  • Use status codes consistently: INVALID_ARGUMENT, NOT_FOUND, ALREADY_EXISTS, FAILED_PRECONDITION, UNAVAILABLE, DEADLINE_EXCEEDED.
  • Put machine readable details in error metadata or rich error details.
  • Do not treat transport success as business success if the response can contain domain failures.
  • Version packages intentionally, such as billing.v1 and billing.v2.

Async event contracts

Events should describe facts that already happened. They are not remote procedure calls with a queue in the middle. A good event contract states what the event means, what invariants hold, how it is keyed, what ordering is promised, how duplicates are handled, and how schemas evolve.

Event categories

Event typeMeaningExampleConsumer expectation
Domain eventBusiness fact inside a bounded contextInvoiceVoidedStable domain semantics
Integration eventPublished fact for external consumersBillingInvoiceVoidedV1Backward compatible external contract
Notification eventSignal that something may need attentionCustomerEmailChangedConsumer may call API for details
CDC eventDatabase row changeinvoices.row.updatedLow level replication semantics
Command messageRequest for another component to do workSendReceiptEmailSingle owner should process or reject

Domain events and integration events are related but not always identical. Internal domain events can be rich and change with the model. Integration events should be stable, documented, and intentionally versioned.

Event envelope

{
  "event_id": "evt_01j0",
  "event_type": "billing.invoice_voided",
  "schema_version": 1,
  "occurred_at": "2026-06-11T12:31:00Z",
  "producer": "billing-service",
  "correlation_id": "req_01hx",
  "causation_id": "cmd_778",
  "subject": "invoice:inv_456",
  "data": {
    "invoice_id": "inv_456",
    "account_id": "acct_123",
    "voided_by": "user_999",
    "reason": "duplicate_invoice"
  }
}

Event design checklist

  • Name events in past tense, such as InvoiceVoided, not VoidInvoice.
  • Include a globally unique event_id.
  • Include stable subject identifiers.
  • Include occurred_at from the producer, not only broker ingestion time.
  • Include schema version or subject version.
  • Include correlation and causation IDs.
  • Define partition key and ordering scope.
  • Define duplicate handling expectations.
  • Define replay safety and consumer side effects.
  • Avoid leaking private database columns.
  • Avoid "updated" events unless the changed meaning is clear.

Good event

{
  "event_type": "catalog.product_price_changed",
  "schema_version": 2,
  "event_id": "evt_123",
  "occurred_at": "2026-06-11T10:15:00Z",
  "subject": "product:prod_123",
  "data": {
    "product_id": "prod_123",
    "previous_price": { "currency": "USD", "value": "19.00" },
    "new_price": { "currency": "USD", "value": "21.00" },
    "effective_at": "2026-06-12T00:00:00Z"
  }
}

Why it is good:

  • The event names the business fact.
  • The old and new values are explicit.
  • The effective time is part of the contract.
  • Consumers can deduplicate by event_id.

Bad event

{
  "event_type": "product_updated",
  "data": {
    "id": 123,
    "p": 21,
    "u": "2026-06-11"
  }
}

Why it is bad:

  • The business meaning is unclear.
  • Field names are not self describing.
  • There is no event ID, schema version, or correlation ID.
  • It is not obvious whether p is price, points, priority, or something else.
  • Consumers must infer what changed.

Schema evolution

Compatibility is a product decision, not a serialization feature. A change is compatible only if existing consumers continue to behave correctly without coordinated deployment.

Change compatibility matrix

ChangeREST JSONGraphQLProtobufEvent payload
Add optional fieldUsually safeSafe if nullableSafe with new field numberUsually safe
Add required field to requestBreakingBreaking input changeBreaking behaviorBreaking for producers
Remove response fieldBreaking if consumedBreakingBreaking unless reserved and unusedBreaking
Rename fieldBreakingBreakingWire compatible but source disruptiveBreaking
Change field meaningBreakingBreakingBreakingBreaking
Add enum valueRiskyRiskyWire compatible but client riskyRisky
Narrow validationBreaking for existing clientsBreakingBreakingBreaking
Widen validationUsually safeUsually safeUsually safeUsually safe
Change ordering guaranteesBreaking if documented or relied onBreakingBreakingBreaking

Evolution rules

  • Add fields before requiring them.
  • Keep old fields during a deprecation window.
  • Emit both old and new fields when migrating names.
  • Version when semantics change, not only when syntax changes.
  • Reserve removed Protobuf field numbers and names.
  • Treat enum expansion as potentially breaking until clients prove they handle unknown values.
  • Run consumer driven contract tests before releasing provider changes.
  • Keep examples current with schemas.
  • Record ownership and compatibility windows in the contract repository.

Deprecation sequence

Rendering diagram...

Versioning

Versioning is a tool for compatibility, not a substitute for it. The best versioning strategy depends on who consumes the API, how much coordination exists, and how expensive duplicate versions are to operate.

Versioning styleExampleBest forTradeoffs
URL path/v1/invoicesPublic REST APIsSimple routing, but versions can spread through URLs
HeaderAccept: application/vnd.company.billing.v1+jsonClients that can control headersCleaner URLs, harder browser debugging
Schema subjectbilling.invoice_voided.v1Event streamsClear per event compatibility
Package namespacebilling.v1.PaymentServicegRPCStrong codegen separation
Field deprecationGraphQL @deprecatedGradual client migrationRequires usage visibility and discipline

Versioning guidelines

  • Version public contracts more explicitly than private contracts.
  • Keep compatibility inside a major version.
  • Do not create v2 for every additive field.
  • Create a new major version when meaning, invariants, or required behavior changes.
  • Support old versions only as long as there is an owner and a removal policy.
  • Publish migration examples, not only schema diffs.
  • Avoid "latest" endpoints. They make reproducibility impossible.

Backwards and forwards compatibility

Backward compatibility means new producers work with old consumers. Forward compatibility means old producers work with new consumers or old consumers safely ignore new data.

RuleWhy it matters
Unknown fields should be ignored by readers unless strict validation is requiredAllows additive changes
Unknown enum values should map to UNKNOWN or be handled explicitlyPrevents crashes on producer expansion
Optional means optional in behavior, not only in syntaxConsumers must not be forced to infer missing data
Defaults must be stableSilent default changes are semantic breaks
Consumers should not depend on response field orderJSON object order should not be meaningful
Producers should not reuse identifiersReuse breaks caches, audit trails, and idempotency

Idempotent APIs

Mutating APIs need a retry story. If a client times out after sending a request, it may not know whether the server committed the work. Idempotency lets the client retry safely without creating duplicate side effects.

Idempotency patterns

PatternExampleUse whenCaveats
Idempotency keyIdempotency-Key: abc123POST commands may be retriedServer must store key, payload hash, status, and response
Client generated IDPUT /customers/cus_123Client can allocate stable resource IDsRequires ID collision rules
Natural unique keyorder_number unique per merchantBusiness domain already has uniquenessNatural keys can change or be reused in bad domains
Conditional requestIf-Match: "etag-123"Updates need optimistic concurrencyRequires ETag or version field
Outbox dedupeEvent ID tracked by consumerAsync consumers must handle duplicatesConsumer storage must be transactional with side effects

Idempotency key behavior

SituationExpected response
First request with keyProcess command and store response
Duplicate request with same payload while completeReplay original response
Duplicate request with same payload while still processingReturn 409 Conflict, 202 Accepted, or wait within timeout, documented explicitly
Duplicate key with different payloadReturn 409 Conflict with a stable error code
Key expiredTreat as new request or return explicit expiration error, documented explicitly

Idempotency storage

Minimum server record:

  • Idempotency key.
  • Authenticated principal or tenant.
  • Request payload hash.
  • Operation name.
  • Processing status.
  • Final status code and response body.
  • Created time and expiration time.

Never scope an idempotency key globally if tenants share the API. Scope it by tenant, actor, operation, or resource as appropriate.

Error models

Errors are part of the contract. They should be stable enough for programmatic handling and descriptive enough for humans to debug. Avoid making clients parse English messages.

Error response shape

{
  "type": "https://docs.example.com/errors/insufficient_funds",
  "title": "Insufficient funds",
  "status": 409,
  "code": "INSUFFICIENT_FUNDS",
  "detail": "The account balance is lower than the requested debit amount.",
  "request_id": "req_01hx",
  "fields": [
    {
      "path": "amount.value",
      "code": "AMOUNT_EXCEEDS_BALANCE",
      "message": "Amount exceeds available balance."
    }
  ]
}

HTTP status guidance

StatusUse forDo not use for
400 Bad RequestMalformed syntax, invalid JSONBusiness conflict
401 UnauthorizedMissing or invalid authenticationAuthenticated user without permission
403 ForbiddenAuthenticated but not allowedMissing token
404 Not FoundResource absent or intentionally hiddenValidation errors
409 ConflictState conflict, idempotency mismatch, unique constraint conflictGeneric server errors
422 Unprocessable EntityWell formed request with domain validation errorsParse errors
429 Too Many RequestsRate limit or quota exceededPermanent authorization failure
500 Internal Server ErrorUnexpected server faultKnown client mistakes
502 Bad GatewayUpstream returned invalid responseClient validation
503 Service UnavailableTemporary overload or maintenancePermanent failure
504 Gateway TimeoutUpstream deadline exceededLocal validation

Good error model

{
  "code": "IDEMPOTENCY_KEY_PAYLOAD_MISMATCH",
  "message": "The idempotency key was already used with a different request payload.",
  "request_id": "req_abc",
  "retryable": false
}

Bad error model

{
  "error": "Something went wrong"
}

Why it is bad:

  • There is no stable code.
  • Retry behavior is unclear.
  • No request ID exists for support.
  • The client cannot distinguish validation, conflict, authorization, or server failure.

Pagination

Pagination is a consistency contract. It defines how clients traverse a changing collection without duplicates, gaps, or unbounded memory usage.

Pagination strategies

StrategyExampleStrengthsWeaknesses
Offset?limit=50&offset=100Simple, good for small static listsSlow at high offsets, duplicates or gaps while data changes
Page number?page=3&page_size=50Familiar for UISame consistency issues as offset
Cursor?limit=50&after=cursor_abcStable for changing data, efficient with indexCursor must be opaque and well designed
Keyset?limit=50&created_before=...&id_before=...Efficient and deterministicMore complex API shape

Pagination response example

{
  "data": [
    {
      "id": "inv_123",
      "created_at": "2026-06-11T09:00:00Z"
    }
  ],
  "page": {
    "limit": 50,
    "next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wNi0xMVQwOTowMDowMFoiLCJpZCI6Imludl8xMjMifQ",
    "has_more": true
  }
}

Pagination checklist

  • Use deterministic ordering with a unique tiebreaker, such as created_at desc, id desc.
  • Make cursors opaque.
  • Include has_more or equivalent.
  • Define maximum page size.
  • Define behavior when items are inserted or deleted during traversal.
  • Keep filters stable across pages.
  • Do not let clients combine cursor from one query with different filters.

Filtering, sorting, and field selection

Filtering and sorting are contracts because they define query semantics, index expectations, and authorization behavior.

FeatureGoodBad
Filtering?status=paid&created_after=2026-06-01T00:00:00Z?where=status='paid'
Sorting?sort=-created_at,id with documented fieldsArbitrary SQL fragments
Search?query=receipt with documented matching rulesUnspecified fuzzy behavior
Field selection?fields=id,status,totalReturning private fields and asking clients to ignore them
Includes?include=customer,line_items with limitsRecursive expansion without depth limits

Filtering rules should state:

  • Allowed fields and operators.
  • Date and timezone semantics.
  • Case sensitivity.
  • Null handling.
  • Authorization interaction.
  • Maximum complexity.
  • Index or performance limits.

Concurrency control

Integration contracts must define what happens when two actors update the same resource.

Optimistic concurrency with ETag

GET /v1/invoices/inv_123 HTTP/1.1
HTTP/1.1 200 OK
ETag: "invoice-version-7"

{
  "id": "inv_123",
  "memo": "Original memo"
}
PATCH /v1/invoices/inv_123 HTTP/1.1
If-Match: "invoice-version-7"
Content-Type: application/json

{
  "memo": "Updated memo"
}

If the resource changed after the read, return:

HTTP/1.1 412 Precondition Failed
Content-Type: application/json

{
  "code": "RESOURCE_VERSION_CONFLICT",
  "message": "The invoice was modified after the provided version.",
  "request_id": "req_123"
}

Timeouts, retries, and circuit breakers

Distributed systems fail through partial failure, not only full outage. Every integration should define deadlines, retry budgets, and overload behavior. See also 05 Distributed Systems.

Timeout budget

Rendering diagram...

Timeouts must shrink as calls go deeper. A downstream timeout longer than the caller timeout wastes work and increases load during incidents.

Retry guidance

FailureRetry?Notes
Network connection reset before responseYes if operation is idempotentUse idempotency key for mutations
408 Request TimeoutUsuallyRespect operation safety
409 ConflictUsually noClient must change state or refresh
429 Too Many RequestsYes after Retry-AfterApply jitter
500 Internal Server ErrorMaybeRetry only if documented safe
502 Bad GatewayUsuallyBounded exponential backoff
503 Service UnavailableUsuallyRespect Retry-After
504 Gateway TimeoutMaybeRisk of duplicate commit unless idempotent
Validation errorNoFix request
Authorization failureNoFix credentials or permissions

Retry checklist

  • Retry only idempotent operations or operations protected by idempotency keys.
  • Use exponential backoff with jitter.
  • Set a maximum retry count and total retry deadline.
  • Respect Retry-After.
  • Do not retry every layer independently without a shared budget.
  • Avoid retrying on permanent errors.
  • Record retry attempts in logs and metrics.
  • Test duplicate request behavior.

Circuit breaker states

Rendering diagram...

Circuit breakers protect dependencies and callers from repeated calls to a failing service. They should be paired with timeouts, fallback behavior, and observability. A circuit breaker without clear user experience can turn a partial outage into confusing application behavior.

Integration failure modes

Failure modeExampleMitigation
Duplicate requestClient retries after timeout and creates two paymentsIdempotency keys, unique constraints, response replay
Lost eventProducer commits database write but crashes before publishingDesign Patterns/Outbox Pattern
Duplicate eventBroker redelivery after consumer crashConsumer dedupe table, idempotent side effects
Out of order eventInvoicePaid arrives before InvoiceCreatedPartition by aggregate, version checks, buffering
Poison messageConsumer cannot parse one event and blocks partitionDead letter queue, schema validation, alerting
Slow dependencyPayment provider latency consumes all threadsDeadlines, bulkheads, circuit breakers
Partial failureLocal state committed but remote call failedSaga, compensation, reconciliation job
Schema driftProducer changes field meaning silentlyContract tests, schema registry, compatibility checks
Rate limitPartner API returns 429 under loadToken bucket, backoff, queueing, quota dashboard
Clock skewExpiration or ordering based on local clocksServer timestamps, monotonic versions, tolerance windows
Authorization driftConsumer role loses permission unexpectedlySynthetic checks, explicit scopes, owner alerts

Consumer driven contracts

Consumer driven contracts capture what each consumer actually depends on. They reduce the false confidence of provider only tests and prevent accidental breaking changes.

Rendering diagram...

Consumer driven contract workflow

  1. Consumer defines expected request and response interactions.
  2. Consumer tests itself against a mock generated from the contract.
  3. Contract is published to a shared broker or repository.
  4. Provider CI verifies the provider implementation against published contracts.
  5. Deployment is blocked if a provider change breaks an active consumer contract.
  6. Usage and ownership metadata identify who must migrate before removal.

What to include

  • Request method, path, headers, query parameters, and body shape.
  • Required response fields and their meanings.
  • Error cases the consumer handles.
  • Authentication and authorization assumptions.
  • Version or provider state setup.
  • Matching rules that avoid overfitting to irrelevant values.

What not to include

  • Provider implementation details.
  • Incidental response fields the consumer does not use.
  • Exact timestamps or generated IDs unless they are semantically required.
  • A single golden payload that makes additive changes look breaking.

Contract testing

Contract testing sits between unit tests and end to end tests. It verifies boundaries without requiring the entire system to run.

Test typePurposeExample
Schema validationPayload conforms to OpenAPI, GraphQL, Protobuf, JSON Schema, or AvroValidate event before publish
Provider contract testProvider satisfies published consumer expectationsPact provider verification
Consumer contract testConsumer handles documented provider responsesMock provider generated from contract
Compatibility testNew schema is compatible with prior schemaSchema registry compatibility gate
Example testDocumentation examples execute successfullyOpenAPI examples validated in CI
Negative contract testErrors follow stable modelInvalid request returns 422 with field errors
Replay testConsumer can process historical eventsReplay topic snapshot in staging

Contract test checklist

  • Validate both success and failure responses.
  • Include at least one retry safe mutation case.
  • Include authorization failures.
  • Include pagination edge cases.
  • Include unknown enum or unknown field behavior where relevant.
  • Verify idempotency conflict behavior.
  • Verify deprecated fields remain available during the compatibility window.
  • Run provider verification in CI before deployment.
  • Keep generated clients and schemas in sync.
  • Fail builds on undocumented breaking changes.

OpenAPI and schema repositories

For REST APIs, OpenAPI is most useful when treated as executable contract source, not hand written decoration.

Good practices:

  • Store specs in version control.
  • Generate server validation or client SDKs where practical.
  • Validate examples during CI.
  • Run breaking change detection against the previous released spec.
  • Publish rendered docs from the same source.
  • Include error schemas and headers, not only happy path bodies.
  • Include security schemes and scope requirements.

Bad practices:

  • Updating docs after implementation by memory.
  • Using object for every response.
  • Omitting error responses.
  • Describing behavior in prose that contradicts the schema.
  • Allowing undocumented fields to become relied upon by consumers.

Webhooks

Webhooks are asynchronous contracts delivered over HTTP. They need the same rigor as events plus additional delivery and security rules.

Webhook contract checklist

  • Sign payloads with a documented algorithm.
  • Include timestamp and protect against replay.
  • Include event ID for deduplication.
  • Retry with bounded backoff.
  • Document retry schedule and final failure behavior.
  • Treat non-2xx responses as failed delivery.
  • Provide a manual replay mechanism.
  • Provide endpoint verification where needed.
  • Keep payload schema versioned.
  • Avoid requiring immediate synchronous callback from the receiver.

Webhook signature example

POST /webhooks/billing HTTP/1.1
Content-Type: application/json
X-Webhook-ID: evt_123
X-Webhook-Timestamp: 1781179200
X-Webhook-Signature: v1=4f7a...

Receiver rules:

  • Verify timestamp freshness.
  • Compute signature over the raw body, not reparsed JSON.
  • Deduplicate by webhook ID.
  • Return 2xx only after durable acceptance.
  • Process side effects asynchronously when possible.

Security and authorization as contract

Security behavior must be documented because clients build workflows around it.

ConcernContract detail
AuthenticationToken type, mTLS, API key, session, service account
AuthorizationRequired scopes, roles, resource ownership checks
Tenant isolationHow tenant is selected and validated
Sensitive fieldsRedaction rules and role based visibility
AuditWhich actions generate audit entries
Rate limitsQuotas, windows, headers, and retry behavior
Idempotency scopeWhether keys are scoped by tenant, principal, or operation

Avoid returning different error shapes for security failures. It is acceptable to hide existence with 404, but the behavior must be intentional and consistent.

Observability contract

Integration behavior is only debuggable if both sides can correlate traffic.

Required signals

SignalPurpose
Request IDSingle request debugging
Correlation IDBusiness workflow across services
Causation IDParent event or command that caused work
Consumer name and versionIdentify affected clients
Contract versionDetect incompatible deployments
Latency histogramUnderstand tail latency
Error code metricTrack contract level failures
Retry countDetect instability hidden by retries
Idempotency replay countDetect duplicate traffic patterns
Dead letter countDetect consumer parse or processing failures

Documentation quality

A contract document should let a new consumer integrate without reading provider code.

Include:

  • Overview and ownership.
  • Authentication and authorization.
  • Resource or event model.
  • Request and response schemas.
  • Error model.
  • Pagination, filtering, and sorting.
  • Idempotency and concurrency behavior.
  • Rate limits and quotas.
  • Versioning and deprecation policy.
  • Retry, timeout, and delivery guarantees.
  • Examples for common and edge cases.
  • Change log.

Design review checklist

Use this before publishing or changing a contract.

Semantics

  • Is the source of truth clear?
  • Are resource identifiers stable and scoped correctly?
  • Are commands named after business actions?
  • Are events named as past tense facts?
  • Are consistency guarantees explicit?
  • Are state transitions documented?

Failure behavior

  • Does every mutating operation have an idempotency story?
  • Are timeouts and retry policies documented?
  • Are retryable and non retryable errors distinguishable?
  • Is partial failure handled through compensation or reconciliation?
  • Are duplicate events and duplicate requests safe?
  • Is rate limiting explicit?

Evolution

  • Is the change additive?
  • If not additive, is there a new version or migration plan?
  • Are deprecated fields measured before removal?
  • Are unknown fields and enum values handled safely?
  • Are generated clients updated?
  • Are examples and docs changed with the schema?

Testing

  • Do provider tests verify consumer contracts?
  • Do consumer tests run against contract mocks?
  • Are schema compatibility checks in CI?
  • Are examples executable or validated?
  • Are negative cases covered?
  • Is event replay tested for consumers with side effects?

Operations

  • Are owner, on-call path, and support channel known?
  • Are request IDs and correlation IDs propagated?
  • Are contract errors visible in metrics?
  • Is there an alert for schema validation failures?
  • Is there a replay or reconciliation process?
  • Is there a safe rollback plan?

Concrete end to end example

Scenario: an order service asks billing to authorize payment. Billing publishes a payment event. Fulfillment consumes the event.

Rendering diagram...

Contract decisions:

  • Orders uses an idempotency key when calling Billing.
  • Billing stores the payment and outbox event in one transaction.
  • Outbox publishes at least once.
  • Fulfillment deduplicates by event_id.
  • PaymentAuthorized includes order_id, payment_id, account_id, and amount.
  • Fulfillment does not assume global ordering across all orders.

Bad variant:

  • Orders calls Billing without idempotency.
  • Billing publishes event before database commit.
  • Fulfillment uses event arrival order as truth.
  • Failed inventory reservation is only logged.

Result: duplicate payment risk, ghost events, inconsistent fulfillment, and no reliable reconciliation path.

Practical heuristics

  • If a client might retry it, make it idempotent.
  • If a consumer might branch on it, give it a stable code.
  • If a field has business meaning, document the meaning, not only the type.
  • If an event name contains Updated, ask what actually happened.
  • If an endpoint returns a list, define pagination before production traffic exists.
  • If an enum may grow, force clients to handle unknown values.
  • If two services must update state together, design for partial failure from the start.
  • If contract tests are hard to write, the contract is probably too implicit.
  • If a breaking change seems harmless, find the consumers before shipping it.