Service & Support

Last updated: 15. März 2026

Cluster Lifecycle — End-to-End Process

Overview

This document describes every step from cluster creation to decommissioning. Each process is designed so that non-technical stakeholders can understand what happens, while engineers have clear runbooks.


1. Cluster Creation Process

Trigger

Process Flow

Customer Request
       │
       ▼
┌──────────────┐
│ Requirements  │ ◄── Cluster size, region, environment type,
│ Gathering     │     compliance needs, network requirements
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Approval &    │ ◄── Internal review, capacity check,
│ Scheduling    │     billing setup
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Infrastructure│ ◄── Terraform/Pulumi provisions servers,
│ Provisioning  │     networking, storage, load balancers
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Kubernetes    │ ◄── RKE2 cluster bootstrap via Rancher,
│ Bootstrap     │     control plane HA setup
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Platform      │ ◄── Cilium, Kyverno, Falco, cert-manager,
│ Components    │     monitoring agents, backup agents
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ GitOps        │ ◄── ArgoCD registration, tenant repo setup,
│ Registration  │     baseline config deployment
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Security      │ ◄── RBAC setup, SSO integration, network
│ Configuration │     policies, pod security policies
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Validation &  │ ◄── Automated tests, connectivity checks,
│ Testing       │     monitoring verification, backup test
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Handover      │ ◄── Customer access granted, documentation
│               │     provided, kickoff meeting
└──────────────┘

Timeline

StepDurationResponsible
Requirements gathering1-2 daysCustomer Success + Customer
Approval & scheduling1 dayCTO / Platform Lead
Infrastructure provisioning30-60 minutes (automated)Platform Engineer
Kubernetes bootstrap15-30 minutes (automated)Platform Engineer
Platform components15-30 minutes (GitOps)Automated via ArgoCD
GitOps registration15 minutesPlatform Engineer
Security configuration30-60 minutesPlatform Engineer
Validation & testing30-60 minutesPlatform Engineer
Handover1-2 hoursCustomer Success
Total (technical)2-4 hours
Total (including coordination)3-5 business days

Automation Details

Infrastructure Provisioning (Terraform)

Input:  cluster_name, environment, size, region, node_count
Output: Servers provisioned, networking configured, DNS records created

Cluster Bootstrap (Rancher)

Input:  Infrastructure details, cluster template, RKE2 version
Output: Running Kubernetes cluster registered in Rancher

Platform Stack (ArgoCD)

Input:  Cluster registered in ArgoCD, tenant baseline repo
Output: All platform components deployed and healthy

2. Cluster Monitoring Process

What We Monitor

LayerMetricsTools
InfrastructureCPU, memory, disk, network, node healthPrometheus + node_exporter
KubernetesPod health, deployments, services, eventskube-state-metrics
NetworkingNetwork policies, traffic flows, DNSCilium + Hubble
SecurityPolicy violations, runtime alertsKyverno + Falco
ApplicationsCustom metrics (if exposed)Prometheus
CertificatesExpiration datescert-manager
BackupsBackup success/failure, last backup timeVelero

Alert Routing

Alert Fires
    │
    ▼
Alertmanager
    │
    ├── P1 (Critical): PagerDuty → On-call engineer (immediate)
    │                   + Slack #incidents
    │                   + Customer notification (if customer-facing)
    │
    ├── P2 (High):     Slack #alerts → Acknowledged within 1h
    │                   + PagerDuty (business hours)
    │
    ├── P3 (Medium):   Slack #alerts → Next business day
    │                   + Ticket created automatically
    │
    └── P4 (Low):      Slack #monitoring → Weekly review
                        + Logged for trends

Standard Alert Rules

AlertSeverityCondition
Node downP1Node unreachable > 5 min
Control plane unhealthyP1API server / etcd unavailable
Cluster unreachableP1Management cluster cannot reach workload cluster
Pod crash loopingP2Pod restarted > 5 times in 10 min
High CPU usageP2Node CPU > 90% for 15 min
High memory usageP2Node memory > 90% for 15 min
Disk usage criticalP2Disk > 85% full
Certificate expiringP2Certificate expires < 14 days
Backup failureP2Backup failed or missed schedule
Kyverno policy violationP3Blocked resource creation
Falco security alertP2/P3Runtime security event (severity-dependent)
High pod restart rateP3> 10 restarts/hour across namespace
PV usage highP3PersistentVolume > 80% full

Customer Dashboards (Grafana)

Each customer gets read-only access to:


3. Cluster Upgrade Process

Upgrade Types

TypeFrequencyDowntimeProcess
Kubernetes minor versionEvery 3-4 monthsZero (rolling)Planned maintenance
Kubernetes patch versionMonthlyZero (rolling)Automated
Platform component updatesMonthlyZeroGitOps
OS security patchesWeeklyZero (rolling)Automated
Emergency security patchesAs neededMinimalExpedited

Kubernetes Version Upgrade Process

1. New K8s version released
       │
       ▼
2. Internal testing (1-2 weeks)
   - Deploy on internal test cluster
   - Run compatibility tests
   - Validate platform components
       │
       ▼
3. Staging rollout
   - Upgrade customer staging/dev clusters first
   - Monitor for 1 week
       │
       ▼
4. Production rollout
   - Schedule maintenance window (agreed with customer)
   - Rolling upgrade: one node at a time
   - Cordon → Drain → Upgrade → Uncordon
   - Validate after each node
       │
       ▼
5. Post-upgrade validation
   - All pods healthy
   - All services accessible
   - Monitoring operational
   - Backup operational
       │
       ▼
6. Customer notification
   - Upgrade complete confirmation
   - Version change documented

Maintenance Windows

Support TierMaintenance Window
EssentialTuesday-Thursday, 02:00-06:00 CET
BusinessCoordinated with customer, 48h notice
EnterpriseCustomer-defined window, 1 week notice

4. Backup & Restore Process

Backup Schedule

Backup TypeFrequencyRetentionStorage
Full cluster backup (Velero)Daily at 02:00 CET30 daysHetzner StorageBox
etcd snapshotEvery 6 hours7 daysLocal + remote
Persistent volume snapshotsDaily14 daysHetzner Block Storage
Platform configuration (Git)Every commitUnlimitedGit history

Restore Process

Restore Request
       │
       ├── Partial restore (single namespace/app)
       │        │
       │        ▼
       │   Velero restore --include-namespaces <ns>
       │   Duration: 5-30 minutes
       │
       └── Full cluster restore
                │
                ▼
           1. Provision new infrastructure
           2. Bootstrap new cluster
           3. Restore from Velero backup
           4. Validate all components
           5. Update DNS / ingress
           Duration: 1-4 hours

Recovery Time Objectives

ScenarioRTORPO
Single pod/deployment failure< 5 min (auto-heal)0
Single node failure< 15 min (auto-replace)0
Namespace restore< 30 min< 24 hours
Full cluster restore< 4 hours< 24 hours
Management cluster failure< 2 hours< 6 hours

5. Node Replacement Process

Automatic (Self-Healing)

Node becomes unhealthy
       │
       ▼
Rancher detects unhealthy node (5 min)
       │
       ▼
Alert fired (P1)
       │
       ▼
On-call engineer validates
       │
       ▼
New node provisioned (automated)
       │
       ▼
Workloads rescheduled automatically
       │
       ▼
Old node cordoned and removed
       │
       ▼
Incident report created

Manual (Planned)

Maintenance scheduled
       │
       ▼
Cordon node (prevent new scheduling)
       │
       ▼
Drain node (evict workloads gracefully)
       │
       ▼
Perform maintenance / replace node
       │
       ▼
Uncordon node / join new node
       │
       ▼
Validate workloads rescheduled

6. Cluster Decommissioning Process

Trigger

Process

Decommission Request
       │
       ▼
1. Confirm with customer (written approval required)
       │
       ▼
2. Final backup (retained for 90 days)
       │
       ▼
3. Export customer data (provided to customer)
       │
       ▼
4. Remove from monitoring and alerting
       │
       ▼
5. Remove from ArgoCD and GitOps
       │
       ▼
6. Remove from Rancher
       │
       ▼
7. Destroy Kubernetes cluster
       │
       ▼
8. Destroy infrastructure (servers, storage, networking)
       │
       ▼
9. Archive tenant Git repositories
       │
       ▼
10. Update billing (stop invoicing)
       │
       ▼
11. Send decommission confirmation to customer
       │
       ▼
12. After 90 days: delete final backup

Data Retention

Data TypeRetention After Decommission
Customer workload data0 (exported to customer, then deleted)
Backup data90 days
Monitoring/log data30 days
Billing records10 years (German tax law)
Contracts10 years (German commercial law)
Git repository (archived)1 year

Customer Onboarding — End-to-End Process

Overview

Customer onboarding is the most critical process for retention. A smooth onboarding sets the foundation for a long-term relationship. This process is designed for non-technical decision-makers and their teams.


Onboarding Timeline

Week 1          Week 2          Week 3          Week 4
│               │               │               │
├─ Kickoff      ├─ Cluster      ├─ App          ├─ Go-Live
│  Meeting      │  Provisioned  │  Migration    │  + Handover
│               │               │               │
├─ Requirements ├─ Access       ├─ Training     ├─ Support
│  Finalized    │  Configured   │  Sessions     │  Transition
│               │               │               │
├─ Contract     ├─ GitOps       ├─ Dry Run      ├─ 30-Day
│  Signed       │  Setup        │  Deployment   │  Check-in

Total onboarding time: 2-4 weeks (depending on complexity)


Phase 1: Pre-Sales to Contract (Week 0)

Steps

StepOwnerDurationDeliverable
Discovery callSales30 minCustomer needs understood
Technical assessmentCTO / Sr. Engineer1-2 hoursFeasibility confirmed
Architecture proposalCTO1-2 daysProposed setup document
Pricing proposalSales1 dayCommercial offer
Contract negotiationSales + Legal3-10 daysSigned contract (AVV + MSA)

Required Documents (Germany)

DocumentPurpose
Master Service Agreement (MSA)Main contract
Auftragsverarbeitungsvertrag (AVV)GDPR data processing agreement — mandatory
Service Level Agreement (SLA)Uptime and support commitments
Technical SpecificationCluster architecture, sizing
Pricing ScheduleDetailed cost breakdown

Phase 2: Kickoff (Week 1)

Kickoff Meeting Agenda (90 minutes)

  1. Introductions (10 min)

- Platform team introduction

- Customer team introduction

- Roles and responsibilities

  1. Platform Overview (20 min)

- How the platform works (non-technical overview)

- What we manage vs. what the customer manages

- Rancher UI walkthrough

  1. Requirements Review (30 min)

- Cluster architecture confirmation

- Application inventory

- Network requirements

- Compliance requirements

- Integration requirements (CI/CD, monitoring)

  1. Access Setup (15 min)

- SSO configuration (customer's IdP)

- RBAC role mapping

- Who gets what access

  1. Timeline & Next Steps (15 min)

- Milestone dates

- Communication channels (Slack, email)

- Escalation contacts

Information We Need from Customer

ItemDescriptionUrgency
Application listWhat apps will run on the clusterWeek 1
Container readinessAre apps already containerized?Week 1
DNS domainsCustomer domains for ingressWeek 1
IdP detailsOIDC/SAML configuration for SSOWeek 1
Network requirementsIP ranges, firewall rules, VPN needsWeek 1
Compliance requirementsSpecific regulations (ISO, BAFIN, etc.)Week 1
Team contactsAdmin contacts, on-call contactsWeek 1
CI/CD setupCurrent CI/CD tools and workflowsWeek 2

Phase 3: Cluster Provisioning (Week 2)

Actions

ActionResponsibleDuration
Provision infrastructurePlatform Engineer1 hour
Bootstrap Kubernetes clusterPlatform Engineer30 min
Deploy platform componentsAutomated (ArgoCD)30 min
Configure SSO integrationPlatform Engineer1-2 hours
Configure RBACPlatform Engineer30 min
Setup GitOps repositoriesPlatform Engineer1 hour
Configure monitoring dashboardsPlatform Engineer1 hour
Configure backup schedulePlatform Engineer30 min
Validation testingPlatform Engineer1 hour
Total~1 day

Customer Access Delivery

After provisioning, the customer receives:

ItemHow
Rancher UI accessSSO login URL + role assignment
Grafana dashboardsSSO login URL (read-only)
kubectl accessRancher-provided kubeconfig
GitOps repositoryGitHub/GitLab repo access
Support channelSlack channel created
Documentation portalAccess to customer docs

Phase 4: Application Migration (Week 3)

Migration Support Options

OptionDescriptionOur Role
Self-serviceCustomer deploys their own appsWe provide docs + support
Guided migrationWe help containerize and deployHands-on assistance
Full migrationWe containerize, deploy, validateFull professional service

Migration Steps (Guided)

  1. Application assessment: Review existing apps, dependencies
  2. Containerization: Create Dockerfiles, optimize images
  3. Kubernetes manifests: Create Deployments, Services, Ingress
  4. GitOps integration: Set up ArgoCD application definitions
  5. Staging deployment: Deploy to non-prod cluster first
  6. Testing: Validate functionality, performance, connectivity
  7. Production deployment: Deploy to production cluster
  8. DNS cutover: Point production DNS to new ingress

Phase 5: Go-Live & Handover (Week 4)

Go-Live Checklist

Training Sessions

SessionDurationAudienceContent
Platform Overview2 hoursAll team membersRancher UI, dashboards, basic operations
Deployment Workflow2 hoursDevOps / DevelopersGitOps workflow, ArgoCD, CI/CD integration
Monitoring & Alerting1 hourDevOps / OpsGrafana dashboards, alert interpretation
Incident Reporting1 hourAll team membersHow to report issues, severity levels

30-Day Check-in

Scheduled 30 days after go-live:


Onboarding Success Metrics

MetricTarget
Time to first cluster< 5 business days
Time to go-live< 4 weeks
Customer satisfaction (NPS) at 30 days> 8/10
Support tickets in first 30 days< 5
Zero P1 incidents during onboarding100%

Support, Escalation & Incident Management

Support Model Overview

Support Tiers

TierAvailabilityChannelsResponse SLAIncluded With
EssentialMon-Fri 9:00-18:00 CETEmail, TicketP1: 4h, P2: 8h, P3: 24h, P4: 48hAll plans
BusinessMon-Fri 7:00-22:00 CETEmail, Ticket, SlackP1: 1h, P2: 4h, P3: 8h, P4: 24h€800/mo add-on
Enterprise24/7/365Email, Ticket, Slack, PhoneP1: 15min, P2: 1h, P3: 4h, P4: 8h€2,500/mo add-on

What's Supported vs. Not Supported

Supported (In Scope)Not Supported (Out of Scope)
Kubernetes cluster operationsApplication code debugging
Platform component issuesCustomer application logic
Node failures and replacementsDatabase query optimization
Network policy configurationCustom application performance tuning
Security policy managementThird-party software support
Backup and restore operationsCI/CD pipeline development
Monitoring and alerting setupApplication architecture consulting*
Kubernetes version upgradesCustom development*
SSL certificate managementTraining beyond onboarding*

*Available as Professional Services at additional cost


Severity Levels

P1 — Critical

AttributeValue
DefinitionProduction cluster down, data loss risk, security breach
Business ImpactCustomer's production services unavailable
ExamplesControl plane failure, all nodes down, data corruption, active security incident
Response TimeEssential: 4h / Business: 1h / Enterprise: 15min
Resolution Target4 hours
CommunicationEvery 30 minutes until resolved
EscalationAutomatic to CTO after 2 hours

P2 — High

AttributeValue
DefinitionSignificant degradation, partial outage, failed backups
Business ImpactSome services impacted, workaround may exist
ExamplesSingle node failure, high error rates, backup failure, cert expiring < 48h
Response TimeEssential: 8h / Business: 4h / Enterprise: 1h
Resolution Target8 hours
CommunicationEvery 2 hours until resolved
EscalationAutomatic to Platform Lead after 4 hours

P3 — Medium

AttributeValue
DefinitionNon-critical issue, minor impact
Business ImpactMinor inconvenience, no production impact
ExamplesDashboard issue, non-critical alert, minor config change needed
Response TimeEssential: 24h / Business: 8h / Enterprise: 4h
Resolution Target3 business days
CommunicationDaily update
EscalationManual after 2 business days

P4 — Low

AttributeValue
DefinitionQuestion, feature request, minor improvement
Business ImpactNone
ExamplesDocumentation question, dashboard customization, feature request
Response TimeEssential: 48h / Business: 24h / Enterprise: 8h
Resolution Target5 business days
CommunicationUpon completion
EscalationNone (scheduled backlog)

Incident Management Process

Incident Lifecycle

Detection
    │
    ├── Automated (monitoring alert)
    │   └── PagerDuty pages on-call engineer
    │
    └── Customer-reported
        └── Ticket/Slack/Phone → Triaged by support
    │
    ▼
Triage (5 minutes)
    │
    ├── Assign severity (P1-P4)
    ├── Assign owner (on-call or support engineer)
    └── Create incident ticket
    │
    ▼
Investigation (varies by severity)
    │
    ├── Check monitoring dashboards
    ├── Review logs (Loki/Grafana)
    ├── Check recent changes (ArgoCD/Git)
    ├── Check infrastructure health (Rancher)
    └── Identify root cause
    │
    ▼
Resolution
    │
    ├── Apply fix (manual or via GitOps)
    ├── Validate fix
    ├── Monitor for recurrence
    └── Confirm with customer
    │
    ▼
Post-Incident
    │
    ├── Update ticket with resolution
    ├── Postmortem (P1/P2 only)
    ├── Action items created
    └── Customer communication sent

Incident Communication Templates

P1 Initial Notification (to customer)

Subject: [P1] Incident — {Cluster Name} — {Brief Description}

We are aware of an issue affecting your {cluster/service}.

Impact: {Description of impact}
Status: Investigating
Next update: In 30 minutes

Our team is actively working on resolution.

P1 Update

Subject: [P1] Update — {Cluster Name} — {Brief Description}

Status: {Investigating / Identified / Mitigated / Resolved}
Update: {What we've done since last update}
Next steps: {What we're doing next}
Next update: In 30 minutes

P1 Resolution

Subject: [P1] Resolved — {Cluster Name} — {Brief Description}

The incident has been resolved.

Root cause: {Brief explanation}
Resolution: {What was done}
Duration: {Start time — End time}

A detailed postmortem will be shared within 3 business days.

Escalation Matrix

Automatic Escalation (Time-Based)

SeverityLevel 1Level 2Level 3Level 4
P1On-call engineer (0 min)Platform Lead (30 min)CTO (2h)CEO (4h)
P2Support engineer (0 min)Sr. Platform Eng (2h)Platform Lead (4h)CTO (8h)
P3Support engineer (0 min)Sr. Platform Eng (1 day)Platform Lead (2 days)
P4Support engineer (0 min)

Manual Escalation (Customer-Initiated)

Customers can request escalation at any time through:

Escalation Responsibilities

LevelRoleResponsibility
L1On-call / Support EngineerFirst response, initial diagnosis, known-issue resolution
L2Senior Platform EngineerComplex troubleshooting, infrastructure issues
L3Platform Lead / CTOArchitecture decisions, emergency changes, vendor escalation
L4CEOCustomer executive communication, business decisions

On-Call Process

On-Call Rotation

ParameterValue
Rotation length1 week (Monday 09:00 — Monday 09:00)
Team size for rotationMinimum 3 engineers
Handoff process15-min sync at rotation change
Primary + SecondaryAlways 2 engineers on-call
ToolPagerDuty

On-Call Expectations

ExpectationRequirement
Acknowledgement time< 5 minutes (P1), < 15 minutes (P2)
AvailabilityReachable by phone at all times
Response capabilityLaptop + internet access within 15 minutes
EscalationIf unable to resolve in 30 min, escalate to L2
DocumentationLog all actions in incident ticket

On-Call Compensation (Germany)

ItemCompensation
Weekday on-call standby€50/night
Weekend on-call standby€100/day
Public holiday standby€150/day
Actual incident work (outside business hours)€75/hour
Rest time after night incident (> 2h work)Next morning off

Postmortem Process

When Required

Postmortem Template

# Postmortem: {Incident Title}

## Summary
- Date: {YYYY-MM-DD}
- Duration: {X hours Y minutes}
- Severity: {P1/P2}
- Affected customers: {List}
- Impact: {Description}

## Timeline (CET)
- HH:MM — {Event}
- HH:MM — {Event}
- ...

## Root Cause
{Detailed technical explanation}

## Resolution
{What was done to resolve}

## What Went Well
- {Item}

## What Went Wrong
- {Item}

## Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| {Action} | {Person} | {Date} | Open |

## Lessons Learned
{Key takeaways}

Postmortem Timeline

StepDeadline
Draft postmortem24 hours after resolution
Internal review48 hours after resolution
Share with customer3 business days after resolution
Action items trackedOngoing, reviewed weekly

Support Metrics & KPIs

MetricTargetMeasurement
First response time (P1)< SLAPagerDuty / Ticket system
Mean time to acknowledge (MTTA)< 10 min (P1)PagerDuty
Mean time to resolve (MTTR)P1: < 4h, P2: < 8hTicket system
SLA compliance> 99.5%Monthly report
Customer satisfaction (CSAT)> 4.5/5Post-ticket survey
Incidents per customer per month< 1Monthly report
Postmortems completed on time100%Tracked in tickets
On-call alert noise< 5 alerts/week per clusterPagerDuty analytics

Operational Runbooks

Runbook Index

IDRunbookSeverityTrigger
RB-001Node FailureP1/P2Node unreachable alert
RB-002Control Plane FailureP1API server / etcd alert
RB-003Cluster UnreachableP1Management cluster cannot reach workload cluster
RB-004Certificate ExpirationP2cert-manager alert
RB-005Backup FailureP2Velero backup failure alert
RB-006Storage FailureP1/P2PV/disk alert
RB-007Network FailureP1/P2CNI / connectivity alert
RB-008Cluster Upgrade FailureP2Upgrade process error
RB-009High Resource UsageP2/P3CPU/Memory/Disk threshold alert
RB-010Security IncidentP1Falco / Kyverno alert
RB-011Pod Crash LoopP2/P3Pod restart alert
RB-012DNS Resolution FailureP2DNS health check failure
RB-013Ingress FailureP1/P2Ingress controller down
RB-014etcd RestoreP1Data corruption / etcd failure
RB-015Full Cluster RestoreP1Complete cluster loss

RB-001: Node Failure

Alert

KubernetesNodeNotReady — Node has been in NotReady state for > 5 minutes

Diagnosis Steps

# 1. Check node status from management cluster
kubectl get nodes -o wide

# 2. Check node conditions
kubectl describe node <node-name>

# 3. Check if node is reachable via SSH
ssh <node-ip> "uptime"

# 4. Check system services
ssh <node-ip> "systemctl status rke2-agent" # or rke2-server for control plane

# 5. Check system resources
ssh <node-ip> "df -h && free -m && uptime"

# 6. Check Rancher cluster status
# Navigate to Rancher UI → Cluster → Nodes

Resolution Steps

If node is reachable but K8s service is down:

# Restart RKE2 agent
ssh <node-ip> "systemctl restart rke2-agent"

# Wait 2-3 minutes, verify
kubectl get nodes

If node is unreachable (hardware/network failure):

# 1. Cordon the node to prevent scheduling
kubectl cordon <node-name>

# 2. Drain workloads (if possible)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s

# 3. Provision replacement node via Terraform
cd infrastructure/terraform/<customer>
terraform apply -target=module.worker_node_<n>

# 4. Join new node to cluster (automated via Rancher)

# 5. Verify new node is Ready
kubectl get nodes

# 6. Remove old node
kubectl delete node <old-node-name>

Customer Communication


RB-002: Control Plane Failure

Alert

KubernetesAPIServerDown or EtcdClusterUnhealthy

Diagnosis Steps

# 1. Check if API server responds
kubectl cluster-info
curl -k https://<api-server-ip>:6443/healthz

# 2. Check etcd health
ssh <control-plane-node> "ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  endpoint health"

# 3. Check RKE2 server logs
ssh <control-plane-node> "journalctl -u rke2-server -n 100"

# 4. Check all control plane components
ssh <control-plane-node> "crictl ps | grep -E 'kube-api|etcd|controller|scheduler'"

Resolution Steps

If single control plane node failure (HA cluster with 3 CP nodes):

If etcd quorum lost (2/3 nodes down):

# 1. This is a P1 — cluster is down
# 2. Attempt to recover at least one more etcd member
ssh <surviving-node> "systemctl restart rke2-server"

# 3. If recovery fails, restore from etcd snapshot
# See RB-014: etcd Restore

If all control plane nodes down:


RB-003: Cluster Unreachable

Alert

ManagedClusterUnreachable — Rancher cannot communicate with downstream cluster

Diagnosis Steps

# 1. Check Rancher UI — is cluster showing disconnected?

# 2. Check if cluster nodes are reachable from management network
ping <cluster-node-ip>
ssh <cluster-node-ip> "kubectl get nodes"

# 3. Check Rancher agent on downstream cluster
ssh <cluster-node> "kubectl -n cattle-system get pods"
ssh <cluster-node> "kubectl -n cattle-system logs deployment/cattle-cluster-agent"

# 4. Check network connectivity (firewall, VPN, routing)
ssh <cluster-node> "curl -k https://<rancher-url>/healthz"

# 5. Check if it's a Rancher issue (all clusters affected?)
# Check Rancher UI for other cluster statuses

Resolution Steps

Network issue:

# Check and fix firewall rules
# Verify VPN tunnel is up (if applicable)
# Check load balancer health

Rancher agent issue:

# Restart cattle-cluster-agent
ssh <cluster-node> "kubectl -n cattle-system rollout restart deployment/cattle-cluster-agent"

Rancher server issue:

# Check Rancher pods on management cluster
kubectl -n cattle-system get pods
kubectl -n cattle-system logs deployment/rancher

RB-004: Certificate Expiration

Alert

CertificateExpiringSoon — Certificate expires in < 14 days

Diagnosis Steps

# 1. Check which certificate is expiring
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>

# 2. Check cert-manager logs
kubectl -n cert-manager logs deployment/cert-manager

# 3. Check certificate details
kubectl get secret <cert-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -dates

Resolution Steps

cert-manager renewal failure:

# 1. Check cert-manager challenges
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n <namespace>

# 2. Common issues:
#    - DNS challenge: Check DNS provider credentials
#    - HTTP challenge: Check ingress is accessible
#    - Rate limits: Let's Encrypt rate limit (wait or use staging)

# 3. Delete and recreate certificate if needed
kubectl delete certificate <cert-name> -n <namespace>
# ArgoCD will recreate it from Git

RKE2 internal certificates:

# RKE2 auto-rotates internal certs, but if needed:
ssh <control-plane-node> "rke2 certificate rotate"
ssh <control-plane-node> "systemctl restart rke2-server"

RB-005: Backup Failure

Alert

VeleroBackupFailed — Scheduled backup did not complete

Diagnosis Steps

# 1. Check Velero backup status
kubectl -n velero get backups
kubectl -n velero describe backup <backup-name>

# 2. Check Velero logs
kubectl -n velero logs deployment/velero

# 3. Check backup storage location
kubectl -n velero get backupstoragelocation
kubectl -n velero describe backupstoragelocation default

# 4. Check storage connectivity
# Verify StorageBox credentials and connectivity

Resolution Steps

Storage connectivity issue:

# Check and update storage credentials
kubectl -n velero get secret cloud-credentials -o yaml

# Verify storage endpoint is reachable
kubectl -n velero exec deployment/velero -- \
  wget -q --spider <storage-endpoint>

Volume snapshot failure:

# Check volume snapshot class
kubectl get volumesnapshotclass

# Check if PV supports snapshots
kubectl get pv <pv-name> -o yaml

# Manual backup trigger
velero backup create manual-backup-$(date +%Y%m%d) \
  --include-namespaces <namespace>

RB-009: High Resource Usage

Alert

NodeCPUHigh (> 90%), NodeMemoryHigh (> 90%), NodeDiskHigh (> 85%)

Diagnosis Steps

# 1. Identify which node is affected
kubectl top nodes

# 2. Find resource-hungry pods
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory

# 3. Check for resource limits
kubectl get pods -A -o json | jq '.items[] |
  select(.spec.containers[].resources.limits == null) |
  .metadata.namespace + "/" + .metadata.name'

# 4. Check for eviction pressure
kubectl describe node <node-name> | grep -A5 Conditions

Resolution Steps

Short-term (immediate relief):

# Identify and scale down non-critical workloads
kubectl -n <namespace> scale deployment <name> --replicas=1

# Evict pods from overloaded node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Long-term:

# Add worker node
cd infrastructure/terraform/<customer>
# Increase node count and apply

# Or resize existing nodes (requires node replacement)
# Update Terraform with larger instance type

RB-010: Security Incident

Alert

FalcoSecurityAlert — Runtime security event detected

IMMEDIATE Actions (First 5 Minutes)

1. DO NOT delete evidence
2. Assess scope — which cluster, namespace, pod
3. Determine if active attack or false positive
4. If active attack:
   a. Isolate affected pod/namespace (network policy)
   b. Do NOT delete the pod (preserve forensic data)
   c. Escalate to Security Engineer immediately
   d. Notify CTO

Diagnosis Steps

# 1. Check Falco alerts
kubectl -n falco logs daemonset/falco | grep -i "Warning\|Error\|Critical"

# 2. Check what triggered the alert
# Falco alert will contain:
#   - Rule name
#   - Output fields (container, process, file, network)
#   - Priority

# 3. Inspect the suspicious pod
kubectl -n <namespace> describe pod <pod-name>
kubectl -n <namespace> logs <pod-name>

# 4. Check Kyverno violations
kubectl get policyreport -A

# 5. Check network flows (Hubble/Cilium)
hubble observe --namespace <namespace> --pod <pod-name>

Containment Steps

# 1. Isolate the namespace with network policy
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: isolate-namespace
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

# 2. Capture pod state for forensics
kubectl -n <namespace> get pod <pod-name> -o yaml > forensics/pod-state.yaml
kubectl -n <namespace> logs <pod-name> > forensics/pod-logs.txt

# 3. If confirmed malicious — kill the pod
kubectl -n <namespace> delete pod <pod-name>

Customer Communication


RB-014: etcd Restore

When Needed

Restore from Snapshot

# 1. Stop RKE2 on all control plane nodes
for node in cp1 cp2 cp3; do
  ssh $node "systemctl stop rke2-server"
done

# 2. Find latest snapshot
ssh cp1 "ls -la /var/lib/rancher/rke2/server/db/snapshots/"

# 3. Restore on first control plane node
ssh cp1 "rke2 server \
  --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/<snapshot-name>"

# 4. Start RKE2 on first node
ssh cp1 "systemctl start rke2-server"

# 5. Wait for first node to be ready
ssh cp1 "kubectl get nodes"

# 6. Remove old etcd data on other nodes
for node in cp2 cp3; do
  ssh $node "rm -rf /var/lib/rancher/rke2/server/db/etcd"
done

# 7. Restart RKE2 on other nodes (they will rejoin)
for node in cp2 cp3; do
  ssh $node "systemctl start rke2-server"
done

# 8. Verify cluster health
kubectl get nodes
kubectl get pods -A

RB-015: Full Cluster Restore

When Needed

Process

1. Assess damage — what is lost, what is recoverable
       │
       ▼
2. Provision new infrastructure (Terraform)
   Duration: 30-60 minutes
       │
       ▼
3. Bootstrap new RKE2 cluster (Rancher)
   Duration: 15-30 minutes
       │
       ▼
4. Deploy platform components (ArgoCD)
   Duration: 15-30 minutes
       │
       ▼
5. Restore from Velero backup
   velero restore create --from-backup <latest-backup>
   Duration: 30-120 minutes (depends on data size)
       │
       ▼
6. Validate all workloads
   - Check all deployments
   - Check all services
   - Check persistent data
   Duration: 30-60 minutes
       │
       ▼
7. Update DNS and ingress
   Duration: 5-15 minutes (+ DNS propagation)
       │
       ▼
8. Notify customer — service restored
       │
       ▼
9. Full postmortem within 3 business days

Total Expected Recovery Time: 2-4 hours


Runbook Maintenance

ActivityFrequencyOwner
Review all runbooksQuarterlyPlatform Lead
Update after incidentsAfter every P1/P2Incident owner
Test DR runbooks (RB-014, RB-015)QuarterlySRE team
Add new runbooksAs new failure modes discoveredPlatform team
Customer-specific runbooksAt onboarding + annuallyCustomer Success

Service Level Agreement (SLA) & Support Tiers

Platform Availability SLA

Uptime Commitment

ComponentEssentialBusinessEnterprise
Management Platform (Rancher, ArgoCD)99.5%99.9%99.95%
Customer Cluster Control Plane99.5%99.9%99.95%
Monitoring & Alerting99.0%99.5%99.9%
Backup Operations99.0%99.5%99.9%

What Uptime Means

Exclusions

The following are NOT counted as downtime:


SLA Credits

If we miss the uptime SLA, customers receive service credits:

Uptime AchievedCredit (% of Monthly Fee)
99.0% - 99.49%10%
98.0% - 98.99%25%
95.0% - 97.99%50%
< 95.0%100%

Credit Process

  1. Customer submits credit request within 30 days of incident
  2. We validate against monitoring data
  3. Credit applied to next invoice
  4. Maximum credit: 100% of one month's platform management fee
  5. Credits do not apply to infrastructure pass-through costs

Support Tier Comparison

Essential Support (Included)

FeatureDetail
PriceIncluded with all plans
AvailabilityMonday-Friday, 09:00-18:00 CET
ChannelsEmail, Ticket portal
P1 Response4 hours
P2 Response8 hours
P3 Response24 hours
P4 Response48 hours
Named contacts2
Monthly reviewNo
Dedicated SlackNo
Phone supportNo
Dedicated engineerNo
Uptime SLA99.5%

Business Support (€800/month)

FeatureDetail
Price€800/month
AvailabilityMonday-Friday, 07:00-22:00 CET
ChannelsEmail, Ticket portal, Slack
P1 Response1 hour
P2 Response4 hours
P3 Response8 hours
P4 Response24 hours
Named contacts5
Monthly reviewYes (30 min)
Dedicated SlackYes (shared channel)
Phone supportNo
Dedicated engineerNo
Uptime SLA99.9%

Enterprise Support (€2,500/month)

FeatureDetail
Price€2,500/month
Availability24/7/365
ChannelsEmail, Ticket portal, Slack, Phone
P1 Response15 minutes
P2 Response1 hour
P3 Response4 hours
P4 Response8 hours
Named contactsUnlimited
Monthly reviewYes (weekly 30 min sync)
Dedicated SlackYes (with SLA on responses)
Phone supportYes (dedicated number)
Dedicated engineerYes (named contact)
Uptime SLA99.95%
Custom maintenance windowsYes
Priority upgrade schedulingYes
Quarterly business reviewYes

Maintenance Windows

Scheduled Maintenance

Support TierNotice PeriodWindow
Essential48 hoursTue-Thu, 02:00-06:00 CET
Business5 business daysAgreed with customer
Enterprise10 business daysCustomer-defined

Emergency Maintenance

ConditionNoticeApproval
Critical security patchBest effort (min 2 hours)No customer approval needed
Data integrity riskBest effort (min 1 hour)No customer approval needed
Non-critical but urgent24 hoursNotification only

Operational Commitments

Backup Guarantees

CommitmentTarget
Daily backup execution99.5% success rate
Backup data retention30 days minimum
Restore test (upon request)Within 2 business days
Full cluster restore< 4 hours RTO
Data loss maximum< 24 hours RPO

Security Commitments

CommitmentTarget
Security patch deploymentCritical: < 24 hours, High: < 72 hours
Vulnerability scanningWeekly
Security incident notification< 1 hour after detection
GDPR breach notification< 72 hours (as per law)

Platform Update Commitments

CommitmentTarget
Kubernetes version supportN-2 minor versions
Kubernetes upgrade after GA releaseWithin 60 days
Platform component updatesMonthly
OS security patchesWeekly

Reporting

Monthly Platform Report (All Tiers)

Delivered by 5th business day of each month:

Monthly Review Meeting (Business + Enterprise)

Agenda:

  1. Uptime and incident review
  2. Capacity and performance review
  3. Security posture review
  4. Upcoming maintenance and upgrades
  5. Customer requests and roadmap

Quarterly Business Review (Enterprise Only)

Agenda:

  1. All monthly review items
  2. Strategic infrastructure planning
  3. Cost optimization recommendations
  4. Technology roadmap alignment
  5. Contract and SLA review