API-Design Teil 20: Betrieb & Deployment

Eine API zu entwickeln ist nur die halbe Miete. Die andere Hälfte ist, sie zuverlässig zu betreiben. Betriebskonzepte sind keine Nachgedanken, sondern Design-Entscheidungen, die von Anfang an mitgedacht werden müssen.

Zielbild

Nach diesem Artikel kannst du:

Environment-Parität zwischen Dev, Staging und Production sicherstellen
Deployment-Strategien wählen, die Zero-Downtime ermöglichen
Rollback-Prozesse definieren und testen
Runbooks für typische Incidents erstellen

Kernfragen

Wie stellen wir Rollbacks sicher?
Welche Runbooks brauchen wir zwingend?
Wie erreichen wir Environment-Parität?

Environment-Parität

Unterschiede zwischen Environments sind die häufigste Ursache für Works-on-my-machine-Probleme in Production.

Das 12-Factor-Prinzip

Environment-Konfiguration

Aspekt	Dev	Staging	Production
Database	Local/Container	Managed (kleiner)	Managed (HA)
Secrets	Local Vault	Staging Vault (isoliert)	Production Vault
Replicas	1	2	3+ (Auto-scaling)
TLS	Self-signed	Let's Encrypt	Managed Certs
Logging	Console	Aggregated	Aggregated + Retention
Feature Flags	All enabled	Selective	Controlled Rollout

Config Management Pattern

# base/deployment.yaml - Gemeinsame Basis
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          image: api:VERSION  # Wird ersetzt
          envFrom:
            - configMapRef:
                name: api-config
            - secretRef:
                name: api-secrets
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patches:
  - path: deployment-patch.yaml
configMapGenerator:
  - name: api-config
    literals:
      - LOG_LEVEL=info
      - CACHE_TTL=300

# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"

Health Endpoints

Health Endpoints sind die Schnittstelle zwischen API und Orchestrator. Sie sollten intern erreichbar sein und keine sensiblen Details preisgeben.

Die drei Health-Checks

Implementierung

// health.controller.ts
@Controller('health')
export class HealthController {
    constructor(
        private readonly db: DatabaseService,
        private readonly cache: CacheService
    ) {
    }

    // Liveness: Prozess läuft
    @Get('live')
    live(): { status: string } {
        return {status: 'ok'};
    }

    // Readiness: Kann Traffic verarbeiten
    @Get('ready')
    async ready(): Promise<HealthCheckResult> {
        const checks = await Promise.allSettled([
            this.checkDatabase(),
            this.checkCache(),
        ]);

        const results = {
            database: checks[0].status === 'fulfilled' ? 'ok' : 'degraded',
            cache: checks[1].status === 'fulfilled' ? 'ok' : 'degraded',
        };

        const healthy = checks.every(c => c.status === 'fulfilled');

        if (!healthy) {
            throw new ServiceUnavailableException({
                status: 'degraded',
                // Details nur intern ausliefern
                checks: results
            });
        }

        return {status: 'ok', checks: results};
    }

    // Startup: Initial-Checks abgeschlossen
    @Get('startup')
    async startup(): Promise<{ status: string }> {
        if (!this.db.isConnected()) {
            throw new ServiceUnavailableException('Database not ready');
        }
        if (!this.cache.isWarmedUp()) {
            throw new ServiceUnavailableException('Cache warming up');
        }
        return {status: 'ok'};
    }

    private async checkDatabase(): Promise<void> {
        await this.db.query('SELECT 1');
    }

    private async checkCache(): Promise<void> {
        await this.cache.ping();
    }
}

Kubernetes-Konfiguration

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: api
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3

          startupProbe:
            httpGet:
              path: /health/startup
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 10
            failureThreshold: 30  # 5 Minuten für Startup

Deployment-Strategien

Die richtige Deployment-Strategie minimiert Risiko und Downtime.

Strategie-Übersicht

Rolling Update (Kubernetes Default)

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max 1 zusätzlicher Pod
      maxUnavailable: 0  # Immer alle Replicas verfügbar

Blue-Green mit Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    blueGreen:
      activeService: api-active
      previewService: api-preview
      autoPromotionEnabled: false  # Manuelle Freigabe
      prePromotionAnalysis:
        templates:
          - templateName: smoke-tests
      postPromotionAnalysis:
        templates:
          - templateName: success-rate

Canary mit Traffic-Splitting

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate
        - setWeight: 50
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: latency-check
        - setWeight: 100
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualService:
            name: api-vsvc

Canary Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01  # <1% Errors
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status_class="5xx",app="api",version="canary"}[5m]))
            /
            sum(rate(http_requests_total{app="api",version="canary"}[5m]))

Rollback-Strategien

Rollbacks müssen schneller sein als Fixes. Jede Minute zählt.

Rollback-Entscheidungsbaum

Automatischer Rollback bei Canary

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      analysis:
        templates:
          - templateName: error-rate
        startingStep: 1
        args:
          - name: service
            value: api-canary
      # Automatischer Rollback bei Analysis-Failure
      abortScaleDownDelaySeconds: 30

Manueller Rollback-Prozess

# Kubernetes: Rollback zum vorherigen ReplicaSet
kubectl rollout undo deployment/api

# Argo Rollouts: Abort und Rollback
kubectl argo rollouts abort api
kubectl argo rollouts undo api

# Helm: Rollback zur vorherigen Revision
helm rollback api 1

# Docker Compose: Vorheriges Image deployen
docker compose pull api
docker compose up -d api

Database-Migrations bei Rollback

Runbooks

Runbooks sind dokumentierte Verfahren für wiederkehrende Operational Tasks.

Runbook-Struktur

Ein Runbook muss im Incident in Sekunden lesbar sein. Diese Struktur hat sich bewährt:

Titel

Runbook: <Incident-Typ>

Metadaten

Feld	Beispiel
Severity	P1 / P2 / P3
On-Call Team	Platform
Escalation	nach 15 Min → Tech Lead
Last Updated	2026-01-15
Last Tested	2026-01-10

Symptome

Was sieht der User?
Welche Alerts feuern?
Welche Metriken sind auffällig?

Diagnose

Schritt-für-Schritt Diagnose
Relevante Dashboards/Logs
Häufige Ursachen

Mitigation

Sofortmaßnahmen (< 5 Min)
Kurzfristige Fixes
Rollback-Entscheidung

Resolution

Root Cause beheben
Validierung
Post-Mortem Trigger

Kommunikation

Statuspage-Template
Stakeholder-Benachrichtigung

Beispiel: API High Error Rate

Runbook: API High Error Rate (>1%)

Metadaten (Beispiel)

Feld	Wert
Severity	P1
On-Call Team	Backend
Escalation	15 Min → Backend Lead → CTO

Symptome (Beispiel)

Alert: api_error_rate > 0.01
User Reports: 500 Errors, Service unavailable
Dashboard: Error-Spike in Grafana

Diagnose (Beispiel)

1. Scope identifizieren

# Welche Endpoints betroffen?
curl -s "prometheus:9090/api/v1/query?query=\
  topk(5, sum by (route) (rate(http_requests_total{status_class=\"5xx\"}[5m])))"

# Seit wann?
start=$(date -u -d '1 hour ago' +%s)
end=$(date -u +%s)
curl -s "prometheus:9090/api/v1/query_range?query=\
  sum(rate(http_requests_total{status_class=\"5xx\"}[5m]))&start=${start}&end=${end}&step=60"

2. Error-Logs prüfen

# Letzte Errors
kubectl logs -l app=api --since=10m | grep -i error | tail -50

# Nach Error-Type gruppieren (bei JSON-Logs)
kubectl logs -l app=api --since=10m | jq -r '.error.type // empty' | sort | uniq -c

3. Dependencies prüfen

# Database
kubectl exec -it deploy/api -- curl -s localhost:3000/health/ready | jq

# Downstream Services
kubectl exec -it deploy/api -- curl -s downstream:8080/health

Mitigation (Beispiel)

Sofort (< 2 Min)

Rollback prüfen: War kürzlich ein Deployment?
```
kubectl rollout history deployment/api
```
Rollback ausführen (wenn ja):
```
kubectl rollout undo deployment/api
```

Wenn kein Deployment-Bezug

Replicas erhöhen (bei Load-Problemen):

kubectl scale deployment/api --replicas=10

Circuit Breaker aktivieren (bei Downstream-Failure):

kubectl set env deployment/api DOWNSTREAM_CIRCUIT_OPEN=true

Resolution (Beispiel)

Root Cause in Logs/Traces identifizieren
Fix entwickeln und testen
Staged Rollout des Fixes
Metriken 30 Min beobachten
Post-Mortem erstellen (wenn P1)

Kommunikation (Beispiel)

Statuspage (Initial)

Investigating: We are investigating elevated error rates affecting the API. Some requests may fail. Updates to follow.

Statuspage (Mitigated)

Identified: The issue has been identified and a fix is being deployed. Error rates are returning to normal.

Statuspage (Resolved)

Resolved: The API is operating normally. A post-incident review will be conducted.

Must-Have Runbooks

Runbook	Trigger	Priorität
High Error Rate	Error Rate > 1%	P1
High Latency	p99 > SLO × 2	P1
Database Connection Issues	DB Health Check fails	P1
Out of Memory	OOM Kills detected	P1
Certificate Expiry	< 7 Tage bis Expiry	P2
Disk Space Low	> 80% Usage	P2
Deployment Failure	Pipeline failed	P2
Rate Limit Exceeded	429s > 5% Traffic	P3

Capacity Planning

Kapazitätsplanung verhindert Überraschungen bei Traffic-Spikes.

Baseline-Metriken

Capacity Baseline

Messen:

Requests/Second (normal, peak, max tested)
Memory pro Request
CPU pro Request
Connections (DB, Redis, External)

Beispiel:

Metrik	Normal	Peak	Max Test
RPS	100	500	2000
Memory/Pod	256 MB	400 MB	800 MB
CPU/Pod	100m	400m	800m
DB Connections	10	25	50
Latency p99	50ms	100ms	300ms

Auto-Scaling Konfiguration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Graceful Shutdown

Ein sauberer Shutdown verhindert verlorene Requests.

// main.ts
async function bootstrap() {
    const app = await NestFactory.create(AppModule);

    const server = await app.listen(3000);

    // Timeout für Keep-Alive Connections
    server.keepAliveTimeout = 65000;
    server.headersTimeout = 66000;

    let shuttingDown = false;

    process.on('SIGTERM', async () => {
        if (shuttingDown) {
            return;
        }
        shuttingDown = true;

        const forceExit = setTimeout(() => {
            console.error('Shutdown timeout, forcing exit');
            process.exit(1);
        }, 30000);

        console.log('SIGTERM received, starting graceful shutdown');

        // 1. Health-Check auf "not ready" setzen
        app.get(HealthService).setShuttingDown(true);

        // 2. Warten bis Load Balancer uns entfernt (grace period)
        await sleep(10000);

        // 3. Laufende Requests abschließen
        await app.close();

        // 4. DB-Connections schließen
        await app.get(DatabaseService).disconnect();

        console.log('Graceful shutdown complete');
        clearTimeout(forceExit);
    });
}

# Kubernetes Pod Spec
spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: [ "/bin/sh", "-c", "sleep 10" ]

Regeln und Anti-Patterns

Do

Environment-Parität: Gleicher Code, unterschiedliche Config
Health Endpoints trennen: Liveness ≠ Readiness ≠ Startup
Rollbacks testen: Mindestens einmal pro Quartal
Runbooks pflegen: Nach jedem Incident aktualisieren
Graceful Shutdown: Keine verlorenen Requests
Expand-Contract: Rückwärtskompatible Migrations

Don't

Manuelles Deployment: Immer über Pipeline
Unterschiedliche Branches: Ein Artifact, viele Configs
Secrets im Code: Immer aus Vault/Secrets Manager
Unbegrenztes Scaling: Max-Limits definieren
Runbooks ignorieren: Im Incident ist keine Zeit zum Improvisieren
Sofortige Schema-Changes: Rollback-Window einplanen

Artefakt: Operations-Readiness-Liste

# Operations Readiness Checklist

## Environment & Config

- [ ] Dev/Staging/Prod Environment-Parität dokumentiert
- [ ] Config über Environment Variables (12-Factor)
- [ ] Secrets in Vault/Secrets Manager
- [ ] Feature Flags für kontrolliertes Rollout

## Health & Monitoring

- [ ] /health/live Endpoint (Liveness)
- [ ] /health/ready Endpoint (Readiness)
- [ ] /health/startup Endpoint (für langsame Starts)
- [ ] Kubernetes Probes konfiguriert
- [ ] Alerts für Health-Failures

## Deployment

- [ ] Zero-Downtime Deployment Strategy gewählt
- [ ] Deployment Pipeline automatisiert
- [ ] Canary/Blue-Green für kritische Services
- [ ] Deployment-Dauer < 10 Minuten

## Rollback

- [ ] Rollback-Prozess dokumentiert
- [ ] Rollback getestet (< 5 Minuten)
- [ ] Database Migrations rückwärtskompatibel
- [ ] Rollback-Window definiert (z.B. 7 Tage)

## Runbooks

- [ ] High Error Rate Runbook
- [ ] High Latency Runbook
- [ ] Database Issues Runbook
- [ ] Deployment Failure Runbook
- [ ] Runbooks getestet/geübt

## Capacity & Scaling

- [ ] Baseline-Metriken dokumentiert
- [ ] Auto-Scaling konfiguriert
- [ ] Max-Limits definiert
- [ ] Load-Tests gegen 2x Peak durchgeführt

## Graceful Operations

- [ ] Graceful Shutdown implementiert
- [ ] terminationGracePeriodSeconds konfiguriert
- [ ] Connection Draining aktiv
- [ ] Keep-Alive Timeouts gesetzt

Checkliste

Must-have vor Go-Live

[ ] Dev/Staging/Prod Environments aufgesetzt
[ ] Config Management über Environment Variables
[ ] Health Endpoints (Live, Ready) implementiert
[ ] Zero-Downtime Deployment möglich
[ ] Rollback-Prozess dokumentiert und getestet
[ ] Mindestens 3 kritische Runbooks erstellt

Should-have

[ ] Blue-Green oder Canary Deployments
[ ] Automated Canary Analysis
[ ] Auto-Scaling konfiguriert
[ ] Graceful Shutdown implementiert

Nice-to-have

[ ] GitOps mit Argo CD
[ ] Progressive Delivery mit Feature Flags
[ ] Chaos Engineering in Staging

Wie es weitergeht

Im nächsten Teil behandeln wir Datenschutz & Datenlebenszyklus – wie du DSGVO-Anforderungen erfüllst, Retention Policies definierst und das Recht auf Löschung implementierst.

Dies ist Teil 20 der Serie API-Design. Alle Teile findest du in der Serie: API-Design.