Cluster Lifecycle — End-to-End Process

Overview

This document describes every step from cluster creation to decommissioning. Each process is designed so that non-technical stakeholders can understand what happens, while engineers have clear runbooks.

1. Cluster Creation Process

Trigger

New customer onboarded
Existing customer requests additional cluster
Internal team needs new environment

Process Flow

Customer Request
       │
       ▼
┌──────────────┐
│ Requirements  │ ◄── Cluster size, region, environment type,
│ Gathering     │     compliance needs, network requirements
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Approval &    │ ◄── Internal review, capacity check,
│ Scheduling    │     billing setup
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Infrastructure│ ◄── Terraform/Pulumi provisions servers,
│ Provisioning  │     networking, storage, load balancers
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Kubernetes    │ ◄── RKE2 cluster bootstrap via Rancher,
│ Bootstrap     │     control plane HA setup
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Platform      │ ◄── Cilium, Kyverno, Falco, cert-manager,
│ Components    │     monitoring agents, backup agents
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ GitOps        │ ◄── ArgoCD registration, tenant repo setup,
│ Registration  │     baseline config deployment
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Security      │ ◄── RBAC setup, SSO integration, network
│ Configuration │     policies, pod security policies
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Validation &  │ ◄── Automated tests, connectivity checks,
│ Testing       │     monitoring verification, backup test
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Handover      │ ◄── Customer access granted, documentation
│               │     provided, kickoff meeting
└──────────────┘

Timeline

Step	Duration	Responsible
Requirements gathering	1-2 days	Customer Success + Customer
Approval & scheduling	1 day	CTO / Platform Lead
Infrastructure provisioning	30-60 minutes (automated)	Platform Engineer
Kubernetes bootstrap	15-30 minutes (automated)	Platform Engineer
Platform components	15-30 minutes (GitOps)	Automated via ArgoCD
GitOps registration	15 minutes	Platform Engineer
Security configuration	30-60 minutes	Platform Engineer
Validation & testing	30-60 minutes	Platform Engineer
Handover	1-2 hours	Customer Success
Total (technical)	2-4 hours
Total (including coordination)	3-5 business days

Automation Details

Infrastructure Provisioning (Terraform)

Input:  cluster_name, environment, size, region, node_count
Output: Servers provisioned, networking configured, DNS records created

Cluster Bootstrap (Rancher)

Input:  Infrastructure details, cluster template, RKE2 version
Output: Running Kubernetes cluster registered in Rancher

Platform Stack (ArgoCD)

Input:  Cluster registered in ArgoCD, tenant baseline repo
Output: All platform components deployed and healthy

2. Cluster Monitoring Process

What We Monitor

Layer	Metrics	Tools
Infrastructure	CPU, memory, disk, network, node health	Prometheus + node_exporter
Kubernetes	Pod health, deployments, services, events	kube-state-metrics
Networking	Network policies, traffic flows, DNS	Cilium + Hubble
Security	Policy violations, runtime alerts	Kyverno + Falco
Applications	Custom metrics (if exposed)	Prometheus
Certificates	Expiration dates	cert-manager
Backups	Backup success/failure, last backup time	Velero

Alert Routing

Alert Fires
    │
    ▼
Alertmanager
    │
    ├── P1 (Critical): PagerDuty → On-call engineer (immediate)
    │                   + Slack #incidents
    │                   + Customer notification (if customer-facing)
    │
    ├── P2 (High):     Slack #alerts → Acknowledged within 1h
    │                   + PagerDuty (business hours)
    │
    ├── P3 (Medium):   Slack #alerts → Next business day
    │                   + Ticket created automatically
    │
    └── P4 (Low):      Slack #monitoring → Weekly review
                        + Logged for trends

Standard Alert Rules

Alert	Severity	Condition
Node down	P1	Node unreachable > 5 min
Control plane unhealthy	P1	API server / etcd unavailable
Cluster unreachable	P1	Management cluster cannot reach workload cluster
Pod crash looping	P2	Pod restarted > 5 times in 10 min
High CPU usage	P2	Node CPU > 90% for 15 min
High memory usage	P2	Node memory > 90% for 15 min
Disk usage critical	P2	Disk > 85% full
Certificate expiring	P2	Certificate expires < 14 days
Backup failure	P2	Backup failed or missed schedule
Kyverno policy violation	P3	Blocked resource creation
Falco security alert	P2/P3	Runtime security event (severity-dependent)
High pod restart rate	P3	> 10 restarts/hour across namespace
PV usage high	P3	PersistentVolume > 80% full

Customer Dashboards (Grafana)

Each customer gets read-only access to:

Cluster overview dashboard
Node health dashboard
Workload status dashboard
Resource usage dashboard
Certificate status dashboard
Backup status dashboard

3. Cluster Upgrade Process

Upgrade Types

Type	Frequency	Downtime	Process
Kubernetes minor version	Every 3-4 months	Zero (rolling)	Planned maintenance
Kubernetes patch version	Monthly	Zero (rolling)	Automated
Platform component updates	Monthly	Zero	GitOps
OS security patches	Weekly	Zero (rolling)	Automated
Emergency security patches	As needed	Minimal	Expedited

Kubernetes Version Upgrade Process

1. New K8s version released
       │
       ▼
2. Internal testing (1-2 weeks)
   - Deploy on internal test cluster
   - Run compatibility tests
   - Validate platform components
       │
       ▼
3. Staging rollout
   - Upgrade customer staging/dev clusters first
   - Monitor for 1 week
       │
       ▼
4. Production rollout
   - Schedule maintenance window (agreed with customer)
   - Rolling upgrade: one node at a time
   - Cordon → Drain → Upgrade → Uncordon
   - Validate after each node
       │
       ▼
5. Post-upgrade validation
   - All pods healthy
   - All services accessible
   - Monitoring operational
   - Backup operational
       │
       ▼
6. Customer notification
   - Upgrade complete confirmation
   - Version change documented

Maintenance Windows

Support Tier	Maintenance Window
Essential	Tuesday-Thursday, 02:00-06:00 CET
Business	Coordinated with customer, 48h notice
Enterprise	Customer-defined window, 1 week notice

4. Backup & Restore Process

Backup Schedule

Backup Type	Frequency	Retention	Storage
Full cluster backup (Velero)	Daily at 02:00 CET	30 days	Hetzner StorageBox
etcd snapshot	Every 6 hours	7 days	Local + remote
Persistent volume snapshots	Daily	14 days	Hetzner Block Storage
Platform configuration (Git)	Every commit	Unlimited	Git history

Restore Process

Restore Request
       │
       ├── Partial restore (single namespace/app)
       │        │
       │        ▼
       │   Velero restore --include-namespaces <ns>
       │   Duration: 5-30 minutes
       │
       └── Full cluster restore
                │
                ▼
           1. Provision new infrastructure
           2. Bootstrap new cluster
           3. Restore from Velero backup
           4. Validate all components
           5. Update DNS / ingress
           Duration: 1-4 hours

Recovery Time Objectives

Scenario	RTO	RPO
Single pod/deployment failure	< 5 min (auto-heal)	0
Single node failure	< 15 min (auto-replace)	0
Namespace restore	< 30 min	< 24 hours
Full cluster restore	< 4 hours	< 24 hours
Management cluster failure	< 2 hours	< 6 hours

5. Node Replacement Process

Automatic (Self-Healing)

Node becomes unhealthy
       │
       ▼
Rancher detects unhealthy node (5 min)
       │
       ▼
Alert fired (P1)
       │
       ▼
On-call engineer validates
       │
       ▼
New node provisioned (automated)
       │
       ▼
Workloads rescheduled automatically
       │
       ▼
Old node cordoned and removed
       │
       ▼
Incident report created

Manual (Planned)

Maintenance scheduled
       │
       ▼
Cordon node (prevent new scheduling)
       │
       ▼
Drain node (evict workloads gracefully)
       │
       ▼
Perform maintenance / replace node
       │
       ▼
Uncordon node / join new node
       │
       ▼
Validate workloads rescheduled

6. Cluster Decommissioning Process

Trigger

Customer contract ends
Customer requests cluster removal
Environment no longer needed

Process

Decommission Request
       │
       ▼
1. Confirm with customer (written approval required)
       │
       ▼
2. Final backup (retained for 90 days)
       │
       ▼
3. Export customer data (provided to customer)
       │
       ▼
4. Remove from monitoring and alerting
       │
       ▼
5. Remove from ArgoCD and GitOps
       │
       ▼
6. Remove from Rancher
       │
       ▼
7. Destroy Kubernetes cluster
       │
       ▼
8. Destroy infrastructure (servers, storage, networking)
       │
       ▼
9. Archive tenant Git repositories
       │
       ▼
10. Update billing (stop invoicing)
       │
       ▼
11. Send decommission confirmation to customer
       │
       ▼
12. After 90 days: delete final backup

Data Retention

Data Type	Retention After Decommission
Customer workload data	0 (exported to customer, then deleted)
Backup data	90 days
Monitoring/log data	30 days
Billing records	10 years (German tax law)
Contracts	10 years (German commercial law)
Git repository (archived)	1 year

Customer Onboarding — End-to-End Process

Overview

Customer onboarding is the most critical process for retention. A smooth onboarding sets the foundation for a long-term relationship. This process is designed for non-technical decision-makers and their teams.

Onboarding Timeline

Week 1          Week 2          Week 3          Week 4
│               │               │               │
├─ Kickoff      ├─ Cluster      ├─ App          ├─ Go-Live
│  Meeting      │  Provisioned  │  Migration    │  + Handover
│               │               │               │
├─ Requirements ├─ Access       ├─ Training     ├─ Support
│  Finalized    │  Configured   │  Sessions     │  Transition
│               │               │               │
├─ Contract     ├─ GitOps       ├─ Dry Run      ├─ 30-Day
│  Signed       │  Setup        │  Deployment   │  Check-in

Total onboarding time: 2-4 weeks (depending on complexity)

Phase 1: Pre-Sales to Contract (Week 0)

Steps

Step	Owner	Duration	Deliverable
Discovery call	Sales	30 min	Customer needs understood
Technical assessment	CTO / Sr. Engineer	1-2 hours	Feasibility confirmed
Architecture proposal	CTO	1-2 days	Proposed setup document
Pricing proposal	Sales	1 day	Commercial offer
Contract negotiation	Sales + Legal	3-10 days	Signed contract (AVV + MSA)

Required Documents (Germany)

Document	Purpose
Master Service Agreement (MSA)	Main contract
Auftragsverarbeitungsvertrag (AVV)	GDPR data processing agreement — mandatory
Service Level Agreement (SLA)	Uptime and support commitments
Technical Specification	Cluster architecture, sizing
Pricing Schedule	Detailed cost breakdown

Phase 2: Kickoff (Week 1)

Kickoff Meeting Agenda (90 minutes)

Introductions (10 min)

- Platform team introduction

- Customer team introduction

- Roles and responsibilities

Platform Overview (20 min)

- How the platform works (non-technical overview)

- What we manage vs. what the customer manages

- Rancher UI walkthrough

Requirements Review (30 min)

- Cluster architecture confirmation

- Application inventory

- Network requirements

- Compliance requirements

- Integration requirements (CI/CD, monitoring)

Access Setup (15 min)

- SSO configuration (customer's IdP)

- RBAC role mapping

- Who gets what access

Timeline & Next Steps (15 min)

- Milestone dates

- Communication channels (Slack, email)

- Escalation contacts

Information We Need from Customer

Item	Description	Urgency
Application list	What apps will run on the cluster	Week 1
Container readiness	Are apps already containerized?	Week 1
DNS domains	Customer domains for ingress	Week 1
IdP details	OIDC/SAML configuration for SSO	Week 1
Network requirements	IP ranges, firewall rules, VPN needs	Week 1
Compliance requirements	Specific regulations (ISO, BAFIN, etc.)	Week 1
Team contacts	Admin contacts, on-call contacts	Week 1
CI/CD setup	Current CI/CD tools and workflows	Week 2

Phase 3: Cluster Provisioning (Week 2)

Actions

Action	Responsible	Duration
Provision infrastructure	Platform Engineer	1 hour
Bootstrap Kubernetes cluster	Platform Engineer	30 min
Deploy platform components	Automated (ArgoCD)	30 min
Configure SSO integration	Platform Engineer	1-2 hours
Configure RBAC	Platform Engineer	30 min
Setup GitOps repositories	Platform Engineer	1 hour
Configure monitoring dashboards	Platform Engineer	1 hour
Configure backup schedule	Platform Engineer	30 min
Validation testing	Platform Engineer	1 hour
Total		~1 day

Customer Access Delivery

After provisioning, the customer receives:

Item	How
Rancher UI access	SSO login URL + role assignment
Grafana dashboards	SSO login URL (read-only)
kubectl access	Rancher-provided kubeconfig
GitOps repository	GitHub/GitLab repo access
Support channel	Slack channel created
Documentation portal	Access to customer docs

Phase 4: Application Migration (Week 3)

Migration Support Options

Option	Description	Our Role
Self-service	Customer deploys their own apps	We provide docs + support
Guided migration	We help containerize and deploy	Hands-on assistance
Full migration	We containerize, deploy, validate	Full professional service

Migration Steps (Guided)

Application assessment: Review existing apps, dependencies
Containerization: Create Dockerfiles, optimize images
Kubernetes manifests: Create Deployments, Services, Ingress
GitOps integration: Set up ArgoCD application definitions
Staging deployment: Deploy to non-prod cluster first
Testing: Validate functionality, performance, connectivity
Production deployment: Deploy to production cluster
DNS cutover: Point production DNS to new ingress

Phase 5: Go-Live & Handover (Week 4)

Go-Live Checklist

[ ] All applications deployed and healthy
[ ] Monitoring dashboards showing data
[ ] Alerts configured and tested
[ ] Backups running successfully
[ ] SSL certificates active and auto-renewing
[ ] Customer team has access (SSO working)
[ ] Customer team trained on basics
[ ] Support channel active
[ ] Runbooks for customer-specific scenarios created
[ ] DNS cutover completed
[ ] Load testing passed (if applicable)

Training Sessions

Session	Duration	Audience	Content
Platform Overview	2 hours	All team members	Rancher UI, dashboards, basic operations
Deployment Workflow	2 hours	DevOps / Developers	GitOps workflow, ArgoCD, CI/CD integration
Monitoring & Alerting	1 hour	DevOps / Ops	Grafana dashboards, alert interpretation
Incident Reporting	1 hour	All team members	How to report issues, severity levels

30-Day Check-in

Scheduled 30 days after go-live:

Review platform health
Address any issues
Discuss expansion needs
Gather feedback
Adjust monitoring/alerting if needed

Onboarding Success Metrics

Metric	Target
Time to first cluster	< 5 business days
Time to go-live	< 4 weeks
Customer satisfaction (NPS) at 30 days	> 8/10
Support tickets in first 30 days	< 5
Zero P1 incidents during onboarding	100%

Support, Escalation & Incident Management

Support Model Overview

Support Tiers

Tier	Availability	Channels	Response SLA	Included With
Essential	Mon-Fri 9:00-18:00 CET	Email, Ticket	P1: 4h, P2: 8h, P3: 24h, P4: 48h	All plans
Business	Mon-Fri 7:00-22:00 CET	Email, Ticket, Slack	P1: 1h, P2: 4h, P3: 8h, P4: 24h	€800/mo add-on
Enterprise	24/7/365	Email, Ticket, Slack, Phone	P1: 15min, P2: 1h, P3: 4h, P4: 8h	€2,500/mo add-on

What's Supported vs. Not Supported

Supported (In Scope)	Not Supported (Out of Scope)
Kubernetes cluster operations	Application code debugging
Platform component issues	Customer application logic
Node failures and replacements	Database query optimization
Network policy configuration	Custom application performance tuning
Security policy management	Third-party software support
Backup and restore operations	CI/CD pipeline development
Monitoring and alerting setup	Application architecture consulting*
Kubernetes version upgrades	Custom development*
SSL certificate management	Training beyond onboarding*

*Available as Professional Services at additional cost

Severity Levels

P1 — Critical

Attribute	Value
Definition	Production cluster down, data loss risk, security breach
Business Impact	Customer's production services unavailable
Examples	Control plane failure, all nodes down, data corruption, active security incident
Response Time	Essential: 4h / Business: 1h / Enterprise: 15min
Resolution Target	4 hours
Communication	Every 30 minutes until resolved
Escalation	Automatic to CTO after 2 hours

P2 — High

Attribute	Value
Definition	Significant degradation, partial outage, failed backups
Business Impact	Some services impacted, workaround may exist
Examples	Single node failure, high error rates, backup failure, cert expiring < 48h
Response Time	Essential: 8h / Business: 4h / Enterprise: 1h
Resolution Target	8 hours
Communication	Every 2 hours until resolved
Escalation	Automatic to Platform Lead after 4 hours

P3 — Medium

Attribute	Value
Definition	Non-critical issue, minor impact
Business Impact	Minor inconvenience, no production impact
Examples	Dashboard issue, non-critical alert, minor config change needed
Response Time	Essential: 24h / Business: 8h / Enterprise: 4h
Resolution Target	3 business days
Communication	Daily update
Escalation	Manual after 2 business days

P4 — Low

Attribute	Value
Definition	Question, feature request, minor improvement
Business Impact	None
Examples	Documentation question, dashboard customization, feature request
Response Time	Essential: 48h / Business: 24h / Enterprise: 8h
Resolution Target	5 business days
Communication	Upon completion
Escalation	None (scheduled backlog)

Incident Management Process

Incident Lifecycle

Detection
    │
    ├── Automated (monitoring alert)
    │   └── PagerDuty pages on-call engineer
    │
    └── Customer-reported
        └── Ticket/Slack/Phone → Triaged by support
    │
    ▼
Triage (5 minutes)
    │
    ├── Assign severity (P1-P4)
    ├── Assign owner (on-call or support engineer)
    └── Create incident ticket
    │
    ▼
Investigation (varies by severity)
    │
    ├── Check monitoring dashboards
    ├── Review logs (Loki/Grafana)
    ├── Check recent changes (ArgoCD/Git)
    ├── Check infrastructure health (Rancher)
    └── Identify root cause
    │
    ▼
Resolution
    │
    ├── Apply fix (manual or via GitOps)
    ├── Validate fix
    ├── Monitor for recurrence
    └── Confirm with customer
    │
    ▼
Post-Incident
    │
    ├── Update ticket with resolution
    ├── Postmortem (P1/P2 only)
    ├── Action items created
    └── Customer communication sent

Incident Communication Templates

P1 Initial Notification (to customer)

Subject: [P1] Incident — {Cluster Name} — {Brief Description}

We are aware of an issue affecting your {cluster/service}.

Impact: {Description of impact}
Status: Investigating
Next update: In 30 minutes

Our team is actively working on resolution.

P1 Update

Subject: [P1] Update — {Cluster Name} — {Brief Description}

Status: {Investigating / Identified / Mitigated / Resolved}
Update: {What we've done since last update}
Next steps: {What we're doing next}
Next update: In 30 minutes

P1 Resolution

Subject: [P1] Resolved — {Cluster Name} — {Brief Description}

The incident has been resolved.

Root cause: {Brief explanation}
Resolution: {What was done}
Duration: {Start time — End time}

A detailed postmortem will be shared within 3 business days.

Escalation Matrix

Automatic Escalation (Time-Based)

Severity	Level 1	Level 2	Level 3	Level 4
P1	On-call engineer (0 min)	Platform Lead (30 min)	CTO (2h)	CEO (4h)
P2	Support engineer (0 min)	Sr. Platform Eng (2h)	Platform Lead (4h)	CTO (8h)
P3	Support engineer (0 min)	Sr. Platform Eng (1 day)	Platform Lead (2 days)	—
P4	Support engineer (0 min)	—	—	—

Manual Escalation (Customer-Initiated)

Customers can request escalation at any time through:

Slack: Mention @platform-lead
Email: escalation@platform.example.com
Phone (Enterprise only): Dedicated escalation number

Escalation Responsibilities

Level	Role	Responsibility
L1	On-call / Support Engineer	First response, initial diagnosis, known-issue resolution
L2	Senior Platform Engineer	Complex troubleshooting, infrastructure issues
L3	Platform Lead / CTO	Architecture decisions, emergency changes, vendor escalation
L4	CEO	Customer executive communication, business decisions

On-Call Process

On-Call Rotation

Parameter	Value
Rotation length	1 week (Monday 09:00 — Monday 09:00)
Team size for rotation	Minimum 3 engineers
Handoff process	15-min sync at rotation change
Primary + Secondary	Always 2 engineers on-call
Tool	PagerDuty

On-Call Expectations

Expectation	Requirement
Acknowledgement time	< 5 minutes (P1), < 15 minutes (P2)
Availability	Reachable by phone at all times
Response capability	Laptop + internet access within 15 minutes
Escalation	If unable to resolve in 30 min, escalate to L2
Documentation	Log all actions in incident ticket

On-Call Compensation (Germany)

Item	Compensation
Weekday on-call standby	€50/night
Weekend on-call standby	€100/day
Public holiday standby	€150/day
Actual incident work (outside business hours)	€75/hour
Rest time after night incident (> 2h work)	Next morning off

Postmortem Process

When Required

All P1 incidents
P2 incidents with customer impact
Any incident with data loss
Any security incident
Any incident lasting > 4 hours

Postmortem Template

# Postmortem: {Incident Title}

## Summary
- Date: {YYYY-MM-DD}
- Duration: {X hours Y minutes}
- Severity: {P1/P2}
- Affected customers: {List}
- Impact: {Description}

## Timeline (CET)
- HH:MM — {Event}
- HH:MM — {Event}
- ...

## Root Cause
{Detailed technical explanation}

## Resolution
{What was done to resolve}

## What Went Well
- {Item}

## What Went Wrong
- {Item}

## Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| {Action} | {Person} | {Date} | Open |

## Lessons Learned
{Key takeaways}

Postmortem Timeline

Step	Deadline
Draft postmortem	24 hours after resolution
Internal review	48 hours after resolution
Share with customer	3 business days after resolution
Action items tracked	Ongoing, reviewed weekly

Support Metrics & KPIs

Metric	Target	Measurement
First response time (P1)	< SLA	PagerDuty / Ticket system
Mean time to acknowledge (MTTA)	< 10 min (P1)	PagerDuty
Mean time to resolve (MTTR)	P1: < 4h, P2: < 8h	Ticket system
SLA compliance	> 99.5%	Monthly report
Customer satisfaction (CSAT)	> 4.5/5	Post-ticket survey
Incidents per customer per month	< 1	Monthly report
Postmortems completed on time	100%	Tracked in tickets
On-call alert noise	< 5 alerts/week per cluster	PagerDuty analytics

Operational Runbooks

Runbook Index

ID	Runbook	Severity	Trigger
RB-001	Node Failure	P1/P2	Node unreachable alert
RB-002	Control Plane Failure	P1	API server / etcd alert
RB-003	Cluster Unreachable	P1	Management cluster cannot reach workload cluster
RB-004	Certificate Expiration	P2	cert-manager alert
RB-005	Backup Failure	P2	Velero backup failure alert
RB-006	Storage Failure	P1/P2	PV/disk alert
RB-007	Network Failure	P1/P2	CNI / connectivity alert
RB-008	Cluster Upgrade Failure	P2	Upgrade process error
RB-009	High Resource Usage	P2/P3	CPU/Memory/Disk threshold alert
RB-010	Security Incident	P1	Falco / Kyverno alert
RB-011	Pod Crash Loop	P2/P3	Pod restart alert
RB-012	DNS Resolution Failure	P2	DNS health check failure
RB-013	Ingress Failure	P1/P2	Ingress controller down
RB-014	etcd Restore	P1	Data corruption / etcd failure
RB-015	Full Cluster Restore	P1	Complete cluster loss

RB-001: Node Failure

Alert

KubernetesNodeNotReady — Node has been in NotReady state for > 5 minutes

Diagnosis Steps

# 1. Check node status from management cluster
kubectl get nodes -o wide

# 2. Check node conditions
kubectl describe node <node-name>

# 3. Check if node is reachable via SSH
ssh <node-ip> "uptime"

# 4. Check system services
ssh <node-ip> "systemctl status rke2-agent" # or rke2-server for control plane

# 5. Check system resources
ssh <node-ip> "df -h && free -m && uptime"

# 6. Check Rancher cluster status
# Navigate to Rancher UI → Cluster → Nodes

Resolution Steps

If node is reachable but K8s service is down:

# Restart RKE2 agent
ssh <node-ip> "systemctl restart rke2-agent"

# Wait 2-3 minutes, verify
kubectl get nodes

If node is unreachable (hardware/network failure):

# 1. Cordon the node to prevent scheduling
kubectl cordon <node-name>

# 2. Drain workloads (if possible)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s

# 3. Provision replacement node via Terraform
cd infrastructure/terraform/<customer>
terraform apply -target=module.worker_node_<n>

# 4. Join new node to cluster (automated via Rancher)

# 5. Verify new node is Ready
kubectl get nodes

# 6. Remove old node
kubectl delete node <old-node-name>

Customer Communication

P1 if control plane node and cluster has < 3 control plane nodes
P2 if worker node (workloads auto-rescheduled)
Notify customer via Slack/email within 15 minutes

RB-002: Control Plane Failure

Alert

KubernetesAPIServerDown or EtcdClusterUnhealthy

Diagnosis Steps

# 1. Check if API server responds
kubectl cluster-info
curl -k https://<api-server-ip>:6443/healthz

# 2. Check etcd health
ssh <control-plane-node> "ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  endpoint health"

# 3. Check RKE2 server logs
ssh <control-plane-node> "journalctl -u rke2-server -n 100"

# 4. Check all control plane components
ssh <control-plane-node> "crictl ps | grep -E 'kube-api|etcd|controller|scheduler'"

Resolution Steps

If single control plane node failure (HA cluster with 3 CP nodes):

Cluster continues to operate with 2/3 nodes
Follow RB-001 to replace the failed node
Urgency: P2 (cluster still operational)

If etcd quorum lost (2/3 nodes down):

# 1. This is a P1 — cluster is down
# 2. Attempt to recover at least one more etcd member
ssh <surviving-node> "systemctl restart rke2-server"

# 3. If recovery fails, restore from etcd snapshot
# See RB-014: etcd Restore

If all control plane nodes down:

Follow RB-015: Full Cluster Restore
P1 — all hands on deck

RB-003: Cluster Unreachable

Alert

ManagedClusterUnreachable — Rancher cannot communicate with downstream cluster

Diagnosis Steps

# 1. Check Rancher UI — is cluster showing disconnected?

# 2. Check if cluster nodes are reachable from management network
ping <cluster-node-ip>
ssh <cluster-node-ip> "kubectl get nodes"

# 3. Check Rancher agent on downstream cluster
ssh <cluster-node> "kubectl -n cattle-system get pods"
ssh <cluster-node> "kubectl -n cattle-system logs deployment/cattle-cluster-agent"

# 4. Check network connectivity (firewall, VPN, routing)
ssh <cluster-node> "curl -k https://<rancher-url>/healthz"

# 5. Check if it's a Rancher issue (all clusters affected?)
# Check Rancher UI for other cluster statuses

Resolution Steps

Network issue:

# Check and fix firewall rules
# Verify VPN tunnel is up (if applicable)
# Check load balancer health

Rancher agent issue:

# Restart cattle-cluster-agent
ssh <cluster-node> "kubectl -n cattle-system rollout restart deployment/cattle-cluster-agent"

Rancher server issue:

# Check Rancher pods on management cluster
kubectl -n cattle-system get pods
kubectl -n cattle-system logs deployment/rancher

RB-004: Certificate Expiration

Alert

CertificateExpiringSoon — Certificate expires in < 14 days

Diagnosis Steps

# 1. Check which certificate is expiring
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>

# 2. Check cert-manager logs
kubectl -n cert-manager logs deployment/cert-manager

# 3. Check certificate details
kubectl get secret <cert-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -dates

Resolution Steps

cert-manager renewal failure:

# 1. Check cert-manager challenges
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n <namespace>

# 2. Common issues:
#    - DNS challenge: Check DNS provider credentials
#    - HTTP challenge: Check ingress is accessible
#    - Rate limits: Let's Encrypt rate limit (wait or use staging)

# 3. Delete and recreate certificate if needed
kubectl delete certificate <cert-name> -n <namespace>
# ArgoCD will recreate it from Git

RKE2 internal certificates:

# RKE2 auto-rotates internal certs, but if needed:
ssh <control-plane-node> "rke2 certificate rotate"
ssh <control-plane-node> "systemctl restart rke2-server"

RB-005: Backup Failure

Alert

VeleroBackupFailed — Scheduled backup did not complete

Diagnosis Steps

# 1. Check Velero backup status
kubectl -n velero get backups
kubectl -n velero describe backup <backup-name>

# 2. Check Velero logs
kubectl -n velero logs deployment/velero

# 3. Check backup storage location
kubectl -n velero get backupstoragelocation
kubectl -n velero describe backupstoragelocation default

# 4. Check storage connectivity
# Verify StorageBox credentials and connectivity

Resolution Steps

Storage connectivity issue:

# Check and update storage credentials
kubectl -n velero get secret cloud-credentials -o yaml

# Verify storage endpoint is reachable
kubectl -n velero exec deployment/velero -- \
  wget -q --spider <storage-endpoint>

Volume snapshot failure:

# Check volume snapshot class
kubectl get volumesnapshotclass

# Check if PV supports snapshots
kubectl get pv <pv-name> -o yaml

# Manual backup trigger
velero backup create manual-backup-$(date +%Y%m%d) \
  --include-namespaces <namespace>

RB-009: High Resource Usage

Alert

NodeCPUHigh (> 90%), NodeMemoryHigh (> 90%), NodeDiskHigh (> 85%)

Diagnosis Steps

# 1. Identify which node is affected
kubectl top nodes

# 2. Find resource-hungry pods
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory

# 3. Check for resource limits
kubectl get pods -A -o json | jq '.items[] |
  select(.spec.containers[].resources.limits == null) |
  .metadata.namespace + "/" + .metadata.name'

# 4. Check for eviction pressure
kubectl describe node <node-name> | grep -A5 Conditions

Resolution Steps

Short-term (immediate relief):

# Identify and scale down non-critical workloads
kubectl -n <namespace> scale deployment <name> --replicas=1

# Evict pods from overloaded node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Long-term:

# Add worker node
cd infrastructure/terraform/<customer>
# Increase node count and apply

# Or resize existing nodes (requires node replacement)
# Update Terraform with larger instance type

RB-010: Security Incident

Alert

FalcoSecurityAlert — Runtime security event detected

IMMEDIATE Actions (First 5 Minutes)

1. DO NOT delete evidence
2. Assess scope — which cluster, namespace, pod
3. Determine if active attack or false positive
4. If active attack:
   a. Isolate affected pod/namespace (network policy)
   b. Do NOT delete the pod (preserve forensic data)
   c. Escalate to Security Engineer immediately
   d. Notify CTO

Diagnosis Steps

# 1. Check Falco alerts
kubectl -n falco logs daemonset/falco | grep -i "Warning\|Error\|Critical"

# 2. Check what triggered the alert
# Falco alert will contain:
#   - Rule name
#   - Output fields (container, process, file, network)
#   - Priority

# 3. Inspect the suspicious pod
kubectl -n <namespace> describe pod <pod-name>
kubectl -n <namespace> logs <pod-name>

# 4. Check Kyverno violations
kubectl get policyreport -A

# 5. Check network flows (Hubble/Cilium)
hubble observe --namespace <namespace> --pod <pod-name>

Containment Steps

# 1. Isolate the namespace with network policy
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: isolate-namespace
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

# 2. Capture pod state for forensics
kubectl -n <namespace> get pod <pod-name> -o yaml > forensics/pod-state.yaml
kubectl -n <namespace> logs <pod-name> > forensics/pod-logs.txt

# 3. If confirmed malicious — kill the pod
kubectl -n <namespace> delete pod <pod-name>

Customer Communication

Always notify customer of confirmed security incidents
Provide incident report within 24 hours
Full postmortem within 5 business days
GDPR notification if personal data involved (72-hour deadline)

RB-014: etcd Restore

When Needed

etcd data corruption
etcd quorum permanently lost
Accidental deletion of critical resources

Restore from Snapshot

# 1. Stop RKE2 on all control plane nodes
for node in cp1 cp2 cp3; do
  ssh $node "systemctl stop rke2-server"
done

# 2. Find latest snapshot
ssh cp1 "ls -la /var/lib/rancher/rke2/server/db/snapshots/"

# 3. Restore on first control plane node
ssh cp1 "rke2 server \
  --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/<snapshot-name>"

# 4. Start RKE2 on first node
ssh cp1 "systemctl start rke2-server"

# 5. Wait for first node to be ready
ssh cp1 "kubectl get nodes"

# 6. Remove old etcd data on other nodes
for node in cp2 cp3; do
  ssh $node "rm -rf /var/lib/rancher/rke2/server/db/etcd"
done

# 7. Restart RKE2 on other nodes (they will rejoin)
for node in cp2 cp3; do
  ssh $node "systemctl start rke2-server"
done

# 8. Verify cluster health
kubectl get nodes
kubectl get pods -A

RB-015: Full Cluster Restore

When Needed

Complete cluster loss (all nodes destroyed)
Catastrophic infrastructure failure
Disaster recovery scenario

Process

1. Assess damage — what is lost, what is recoverable
       │
       ▼
2. Provision new infrastructure (Terraform)
   Duration: 30-60 minutes
       │
       ▼
3. Bootstrap new RKE2 cluster (Rancher)
   Duration: 15-30 minutes
       │
       ▼
4. Deploy platform components (ArgoCD)
   Duration: 15-30 minutes
       │
       ▼
5. Restore from Velero backup
   velero restore create --from-backup <latest-backup>
   Duration: 30-120 minutes (depends on data size)
       │
       ▼
6. Validate all workloads
   - Check all deployments
   - Check all services
   - Check persistent data
   Duration: 30-60 minutes
       │
       ▼
7. Update DNS and ingress
   Duration: 5-15 minutes (+ DNS propagation)
       │
       ▼
8. Notify customer — service restored
       │
       ▼
9. Full postmortem within 3 business days

Total Expected Recovery Time: 2-4 hours

Runbook Maintenance

Activity	Frequency	Owner
Review all runbooks	Quarterly	Platform Lead
Update after incidents	After every P1/P2	Incident owner
Test DR runbooks (RB-014, RB-015)	Quarterly	SRE team
Add new runbooks	As new failure modes discovered	Platform team
Customer-specific runbooks	At onboarding + annually	Customer Success

Service Level Agreement (SLA) & Support Tiers

Platform Availability SLA

Uptime Commitment

Component	Essential	Business	Enterprise
Management Platform (Rancher, ArgoCD)	99.5%	99.9%	99.95%
Customer Cluster Control Plane	99.5%	99.9%	99.95%
Monitoring & Alerting	99.0%	99.5%	99.9%
Backup Operations	99.0%	99.5%	99.9%

What Uptime Means

99.5% = max 3.65 hours downtime/month
99.9% = max 43.8 minutes downtime/month
99.95% = max 21.9 minutes downtime/month

Exclusions

The following are NOT counted as downtime:

Scheduled maintenance (with agreed notice period)
Customer-caused outages (misconfigurations, resource exhaustion)
Force majeure (natural disasters, war, pandemic)
Third-party provider outages (Hetzner, DNS providers)
Customer application issues

SLA Credits

If we miss the uptime SLA, customers receive service credits:

Uptime Achieved	Credit (% of Monthly Fee)
99.0% - 99.49%	10%
98.0% - 98.99%	25%
95.0% - 97.99%	50%
< 95.0%	100%

Credit Process

Customer submits credit request within 30 days of incident
We validate against monitoring data
Credit applied to next invoice
Maximum credit: 100% of one month's platform management fee
Credits do not apply to infrastructure pass-through costs

Support Tier Comparison

Essential Support (Included)

Feature	Detail
Price	Included with all plans
Availability	Monday-Friday, 09:00-18:00 CET
Channels	Email, Ticket portal
P1 Response	4 hours
P2 Response	8 hours
P3 Response	24 hours
P4 Response	48 hours
Named contacts	2
Monthly review	No
Dedicated Slack	No
Phone support	No
Dedicated engineer	No
Uptime SLA	99.5%

Business Support (€800/month)

Feature	Detail
Price	€800/month
Availability	Monday-Friday, 07:00-22:00 CET
Channels	Email, Ticket portal, Slack
P1 Response	1 hour
P2 Response	4 hours
P3 Response	8 hours
P4 Response	24 hours
Named contacts	5
Monthly review	Yes (30 min)
Dedicated Slack	Yes (shared channel)
Phone support	No
Dedicated engineer	No
Uptime SLA	99.9%

Enterprise Support (€2,500/month)

Feature	Detail
Price	€2,500/month
Availability	24/7/365
Channels	Email, Ticket portal, Slack, Phone
P1 Response	15 minutes
P2 Response	1 hour
P3 Response	4 hours
P4 Response	8 hours
Named contacts	Unlimited
Monthly review	Yes (weekly 30 min sync)
Dedicated Slack	Yes (with SLA on responses)
Phone support	Yes (dedicated number)
Dedicated engineer	Yes (named contact)
Uptime SLA	99.95%
Custom maintenance windows	Yes
Priority upgrade scheduling	Yes
Quarterly business review	Yes

Maintenance Windows

Scheduled Maintenance

Support Tier	Notice Period	Window
Essential	48 hours	Tue-Thu, 02:00-06:00 CET
Business	5 business days	Agreed with customer
Enterprise	10 business days	Customer-defined

Emergency Maintenance

Condition	Notice	Approval
Critical security patch	Best effort (min 2 hours)	No customer approval needed
Data integrity risk	Best effort (min 1 hour)	No customer approval needed
Non-critical but urgent	24 hours	Notification only

Operational Commitments

Backup Guarantees

Commitment	Target
Daily backup execution	99.5% success rate
Backup data retention	30 days minimum
Restore test (upon request)	Within 2 business days
Full cluster restore	< 4 hours RTO
Data loss maximum	< 24 hours RPO

Security Commitments

Commitment	Target
Security patch deployment	Critical: < 24 hours, High: < 72 hours
Vulnerability scanning	Weekly
Security incident notification	< 1 hour after detection
GDPR breach notification	< 72 hours (as per law)

Platform Update Commitments

Commitment	Target
Kubernetes version support	N-2 minor versions
Kubernetes upgrade after GA release	Within 60 days
Platform component updates	Monthly
OS security patches	Weekly

Reporting

Monthly Platform Report (All Tiers)

Delivered by 5th business day of each month:

Cluster uptime percentage
Number of incidents by severity
Backup success rate
Resource utilization trends
Security scan results summary

Monthly Review Meeting (Business + Enterprise)

Agenda:

Uptime and incident review
Capacity and performance review
Security posture review
Upcoming maintenance and upgrades
Customer requests and roadmap

Quarterly Business Review (Enterprise Only)

Agenda:

All monthly review items
Strategic infrastructure planning
Cost optimization recommendations
Technology roadmap alignment
Contract and SLA review