payment-gateway.app Docs
Deployment

Monitoring & Observability

Configure OpenTelemetry tracing, log aggregation, and health monitoring for the Payment Gateway.

Monitoring & Observability

The Payment Gateway emits structured logs, distributed traces (via OpenTelemetry), and health check endpoints. This guide covers connecting these to common observability stacks.


Health Endpoints

Both backends expose two health endpoints that require no authentication:

GET /health          → 200 OK if process is running (shallow)
GET /internal/health → deep health check with dependency checks

Use the /internal/health endpoint for load balancer health checks and uptime monitors.

Notes:

  • MongoDB is a critical dependency (failures make /internal/health non-200).
  • Cache (Redis/Garnet) is non-critical (failures are reported but do not necessarily fail overall health).

Example with curl:

curl -sf https://admin.yourcompany.com/health && echo "OK"

Structured Logging

Both backends emit JSON-structured logs via Zerolog. Set MPG_LOG_LEVEL=info and MPG_LOG_FORMAT=json in production.

Log fields:

FieldDescription
levelLog level (info, warn, error)
timeISO 8601 timestamp
requestIdUnique request identifier
methodHTTP method
pathRequest path
statusHTTP status code
latencyRequest duration (ms)
errorError message (on error levels)

Shipping Logs to Loki (Grafana)

With Docker Compose, add the Loki logging driver:

services:
  payment-gateway-admin-backend:
    logging:
      driver: loki
      options:
        loki-url: "http://loki:3100/loki/api/v1/push"
        loki-external-labels: "service=admin-backend,env=production"

Or use a Promtail agent that tails container log files.

Shipping Logs to Elasticsearch (ELK)

Use a Filebeat container to tail Docker logs and forward to Elasticsearch:

filebeat:
  image: docker.elastic.co/beats/filebeat:8.x
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - /var/run/docker.sock:/var/run/docker.sock:ro

OpenTelemetry Tracing

The gateway instruments all HTTP requests with OpenTelemetry traces. Enable tracing by setting:

MPG_OTEL_ENABLED=true
MPG_OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
MPG_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

Connecting to Grafana Tempo

Add a Tempo + OTEL Collector to your Docker Compose:

otel-collector:
  image: otel/opentelemetry-collector-contrib:latest
  command: ["--config=/etc/otel-collector-config.yaml"]
  volumes:
    - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
  ports:
    - "4317:4317" # OTLP gRPC
    - "4318:4318" # OTLP HTTP

tempo:
  image: grafana/tempo:latest
  command: ["-config.file=/etc/tempo.yaml"]

otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

Connecting to Jaeger

MPG_OTEL_EXPORTER_OTLP_PROTOCOL=grpc
MPG_OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
MPG_OTEL_EXPORTER_OTLP_INSECURE=true

Connecting to Honeycomb / Datadog / Grafana Cloud

For SaaS OTEL backends, set the endpoint URL and authentication headers:

MPG_OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
MPG_OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=YOUR_API_KEY

Request Tracing

Every request receives a unique X-Request-ID header. Log lines include this field as requestId. Use it to correlate logs and traces for a specific request across both backends.


Alerting Recommendations

ConditionSuggested Alert
/internal/health returns non-200PagerDuty P1
error log rate > 10/minPagerDuty P2
Transaction failed rate > 5% in 10 minSlack warning
Worker PDF job queue depth > 100Slack warning
MongoDB connections > 80% of poolSlack warning

Configure alerts in your observability platform (Grafana Alerting, Datadog Monitors, etc.) against these signals.


Dashboards

If using Grafana, create panels for:

  • Request rate by service and status code (from logs or OTEL metrics)
  • P50/P95/P99 latency by endpoint
  • Transaction status distribution (succeeded/failed/pending) over time
  • PDF worker queue depth and processing time
  • Active DB connections vs pool maximum
  • Rate limit hits (429 response rate)

On this page