Cluster Lifecycle — End-to-End Process
Overview
This document describes every step from cluster creation to decommissioning. Each process is designed so that non-technical stakeholders can understand what happens, while engineers have clear runbooks.
1. Cluster Creation Process
Trigger
- New customer onboarded
- Existing customer requests additional cluster
- Internal team needs new environment
Process Flow
Customer Request
│
▼
┌──────────────┐
│ Requirements │ ◄── Cluster size, region, environment type,
│ Gathering │ compliance needs, network requirements
└──────┬───────┘
│
▼
┌──────────────┐
│ Approval & │ ◄── Internal review, capacity check,
│ Scheduling │ billing setup
└──────┬───────┘
│
▼
┌──────────────┐
│ Infrastructure│ ◄── Terraform/Pulumi provisions servers,
│ Provisioning │ networking, storage, load balancers
└──────┬───────┘
│
▼
┌──────────────┐
│ Kubernetes │ ◄── RKE2 cluster bootstrap via Rancher,
│ Bootstrap │ control plane HA setup
└──────┬───────┘
│
▼
┌──────────────┐
│ Platform │ ◄── Cilium, Kyverno, Falco, cert-manager,
│ Components │ monitoring agents, backup agents
└──────┬───────┘
│
▼
┌──────────────┐
│ GitOps │ ◄── ArgoCD registration, tenant repo setup,
│ Registration │ baseline config deployment
└──────┬───────┘
│
▼
┌──────────────┐
│ Security │ ◄── RBAC setup, SSO integration, network
│ Configuration │ policies, pod security policies
└──────┬───────┘
│
▼
┌──────────────┐
│ Validation & │ ◄── Automated tests, connectivity checks,
│ Testing │ monitoring verification, backup test
└──────┬───────┘
│
▼
┌──────────────┐
│ Handover │ ◄── Customer access granted, documentation
│ │ provided, kickoff meeting
└──────────────┘
Timeline
| Step | Duration | Responsible |
|---|---|---|
| Requirements gathering | 1-2 days | Customer Success + Customer |
| Approval & scheduling | 1 day | CTO / Platform Lead |
| Infrastructure provisioning | 30-60 minutes (automated) | Platform Engineer |
| Kubernetes bootstrap | 15-30 minutes (automated) | Platform Engineer |
| Platform components | 15-30 minutes (GitOps) | Automated via ArgoCD |
| GitOps registration | 15 minutes | Platform Engineer |
| Security configuration | 30-60 minutes | Platform Engineer |
| Validation & testing | 30-60 minutes | Platform Engineer |
| Handover | 1-2 hours | Customer Success |
| Total (technical) | 2-4 hours | |
| Total (including coordination) | 3-5 business days |
Automation Details
Infrastructure Provisioning (Terraform)
Input: cluster_name, environment, size, region, node_count
Output: Servers provisioned, networking configured, DNS records created
Cluster Bootstrap (Rancher)
Input: Infrastructure details, cluster template, RKE2 version
Output: Running Kubernetes cluster registered in Rancher
Platform Stack (ArgoCD)
Input: Cluster registered in ArgoCD, tenant baseline repo
Output: All platform components deployed and healthy
2. Cluster Monitoring Process
What We Monitor
| Layer | Metrics | Tools |
|---|---|---|
| Infrastructure | CPU, memory, disk, network, node health | Prometheus + node_exporter |
| Kubernetes | Pod health, deployments, services, events | kube-state-metrics |
| Networking | Network policies, traffic flows, DNS | Cilium + Hubble |
| Security | Policy violations, runtime alerts | Kyverno + Falco |
| Applications | Custom metrics (if exposed) | Prometheus |
| Certificates | Expiration dates | cert-manager |
| Backups | Backup success/failure, last backup time | Velero |
Alert Routing
Alert Fires
│
▼
Alertmanager
│
├── P1 (Critical): PagerDuty → On-call engineer (immediate)
│ + Slack #incidents
│ + Customer notification (if customer-facing)
│
├── P2 (High): Slack #alerts → Acknowledged within 1h
│ + PagerDuty (business hours)
│
├── P3 (Medium): Slack #alerts → Next business day
│ + Ticket created automatically
│
└── P4 (Low): Slack #monitoring → Weekly review
+ Logged for trends
Standard Alert Rules
| Alert | Severity | Condition |
|---|---|---|
| Node down | P1 | Node unreachable > 5 min |
| Control plane unhealthy | P1 | API server / etcd unavailable |
| Cluster unreachable | P1 | Management cluster cannot reach workload cluster |
| Pod crash looping | P2 | Pod restarted > 5 times in 10 min |
| High CPU usage | P2 | Node CPU > 90% for 15 min |
| High memory usage | P2 | Node memory > 90% for 15 min |
| Disk usage critical | P2 | Disk > 85% full |
| Certificate expiring | P2 | Certificate expires < 14 days |
| Backup failure | P2 | Backup failed or missed schedule |
| Kyverno policy violation | P3 | Blocked resource creation |
| Falco security alert | P2/P3 | Runtime security event (severity-dependent) |
| High pod restart rate | P3 | > 10 restarts/hour across namespace |
| PV usage high | P3 | PersistentVolume > 80% full |
Customer Dashboards (Grafana)
Each customer gets read-only access to:
- Cluster overview dashboard
- Node health dashboard
- Workload status dashboard
- Resource usage dashboard
- Certificate status dashboard
- Backup status dashboard
3. Cluster Upgrade Process
Upgrade Types
| Type | Frequency | Downtime | Process |
|---|---|---|---|
| Kubernetes minor version | Every 3-4 months | Zero (rolling) | Planned maintenance |
| Kubernetes patch version | Monthly | Zero (rolling) | Automated |
| Platform component updates | Monthly | Zero | GitOps |
| OS security patches | Weekly | Zero (rolling) | Automated |
| Emergency security patches | As needed | Minimal | Expedited |
Kubernetes Version Upgrade Process
1. New K8s version released
│
▼
2. Internal testing (1-2 weeks)
- Deploy on internal test cluster
- Run compatibility tests
- Validate platform components
│
▼
3. Staging rollout
- Upgrade customer staging/dev clusters first
- Monitor for 1 week
│
▼
4. Production rollout
- Schedule maintenance window (agreed with customer)
- Rolling upgrade: one node at a time
- Cordon → Drain → Upgrade → Uncordon
- Validate after each node
│
▼
5. Post-upgrade validation
- All pods healthy
- All services accessible
- Monitoring operational
- Backup operational
│
▼
6. Customer notification
- Upgrade complete confirmation
- Version change documented
Maintenance Windows
| Support Tier | Maintenance Window |
|---|---|
| Essential | Tuesday-Thursday, 02:00-06:00 CET |
| Business | Coordinated with customer, 48h notice |
| Enterprise | Customer-defined window, 1 week notice |
4. Backup & Restore Process
Backup Schedule
| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| Full cluster backup (Velero) | Daily at 02:00 CET | 30 days | Hetzner StorageBox |
| etcd snapshot | Every 6 hours | 7 days | Local + remote |
| Persistent volume snapshots | Daily | 14 days | Hetzner Block Storage |
| Platform configuration (Git) | Every commit | Unlimited | Git history |
Restore Process
Restore Request
│
├── Partial restore (single namespace/app)
│ │
│ ▼
│ Velero restore --include-namespaces <ns>
│ Duration: 5-30 minutes
│
└── Full cluster restore
│
▼
1. Provision new infrastructure
2. Bootstrap new cluster
3. Restore from Velero backup
4. Validate all components
5. Update DNS / ingress
Duration: 1-4 hours
Recovery Time Objectives
| Scenario | RTO | RPO |
|---|---|---|
| Single pod/deployment failure | < 5 min (auto-heal) | 0 |
| Single node failure | < 15 min (auto-replace) | 0 |
| Namespace restore | < 30 min | < 24 hours |
| Full cluster restore | < 4 hours | < 24 hours |
| Management cluster failure | < 2 hours | < 6 hours |
5. Node Replacement Process
Automatic (Self-Healing)
Node becomes unhealthy
│
▼
Rancher detects unhealthy node (5 min)
│
▼
Alert fired (P1)
│
▼
On-call engineer validates
│
▼
New node provisioned (automated)
│
▼
Workloads rescheduled automatically
│
▼
Old node cordoned and removed
│
▼
Incident report created
Manual (Planned)
Maintenance scheduled
│
▼
Cordon node (prevent new scheduling)
│
▼
Drain node (evict workloads gracefully)
│
▼
Perform maintenance / replace node
│
▼
Uncordon node / join new node
│
▼
Validate workloads rescheduled
6. Cluster Decommissioning Process
Trigger
- Customer contract ends
- Customer requests cluster removal
- Environment no longer needed
Process
Decommission Request
│
▼
1. Confirm with customer (written approval required)
│
▼
2. Final backup (retained for 90 days)
│
▼
3. Export customer data (provided to customer)
│
▼
4. Remove from monitoring and alerting
│
▼
5. Remove from ArgoCD and GitOps
│
▼
6. Remove from Rancher
│
▼
7. Destroy Kubernetes cluster
│
▼
8. Destroy infrastructure (servers, storage, networking)
│
▼
9. Archive tenant Git repositories
│
▼
10. Update billing (stop invoicing)
│
▼
11. Send decommission confirmation to customer
│
▼
12. After 90 days: delete final backup
Data Retention
| Data Type | Retention After Decommission |
|---|---|
| Customer workload data | 0 (exported to customer, then deleted) |
| Backup data | 90 days |
| Monitoring/log data | 30 days |
| Billing records | 10 years (German tax law) |
| Contracts | 10 years (German commercial law) |
| Git repository (archived) | 1 year |
Customer Onboarding — End-to-End Process
Overview
Customer onboarding is the most critical process for retention. A smooth onboarding sets the foundation for a long-term relationship. This process is designed for non-technical decision-makers and their teams.
Onboarding Timeline
Week 1 Week 2 Week 3 Week 4
│ │ │ │
├─ Kickoff ├─ Cluster ├─ App ├─ Go-Live
│ Meeting │ Provisioned │ Migration │ + Handover
│ │ │ │
├─ Requirements ├─ Access ├─ Training ├─ Support
│ Finalized │ Configured │ Sessions │ Transition
│ │ │ │
├─ Contract ├─ GitOps ├─ Dry Run ├─ 30-Day
│ Signed │ Setup │ Deployment │ Check-in
Total onboarding time: 2-4 weeks (depending on complexity)
Phase 1: Pre-Sales to Contract (Week 0)
Steps
| Step | Owner | Duration | Deliverable |
|---|---|---|---|
| Discovery call | Sales | 30 min | Customer needs understood |
| Technical assessment | CTO / Sr. Engineer | 1-2 hours | Feasibility confirmed |
| Architecture proposal | CTO | 1-2 days | Proposed setup document |
| Pricing proposal | Sales | 1 day | Commercial offer |
| Contract negotiation | Sales + Legal | 3-10 days | Signed contract (AVV + MSA) |
Required Documents (Germany)
| Document | Purpose |
|---|---|
| Master Service Agreement (MSA) | Main contract |
| Auftragsverarbeitungsvertrag (AVV) | GDPR data processing agreement — mandatory |
| Service Level Agreement (SLA) | Uptime and support commitments |
| Technical Specification | Cluster architecture, sizing |
| Pricing Schedule | Detailed cost breakdown |
Phase 2: Kickoff (Week 1)
Kickoff Meeting Agenda (90 minutes)
- Introductions (10 min)
- Platform team introduction
- Customer team introduction
- Roles and responsibilities
- Platform Overview (20 min)
- How the platform works (non-technical overview)
- What we manage vs. what the customer manages
- Rancher UI walkthrough
- Requirements Review (30 min)
- Cluster architecture confirmation
- Application inventory
- Network requirements
- Compliance requirements
- Integration requirements (CI/CD, monitoring)
- Access Setup (15 min)
- SSO configuration (customer's IdP)
- RBAC role mapping
- Who gets what access
- Timeline & Next Steps (15 min)
- Milestone dates
- Communication channels (Slack, email)
- Escalation contacts
Information We Need from Customer
| Item | Description | Urgency |
|---|---|---|
| Application list | What apps will run on the cluster | Week 1 |
| Container readiness | Are apps already containerized? | Week 1 |
| DNS domains | Customer domains for ingress | Week 1 |
| IdP details | OIDC/SAML configuration for SSO | Week 1 |
| Network requirements | IP ranges, firewall rules, VPN needs | Week 1 |
| Compliance requirements | Specific regulations (ISO, BAFIN, etc.) | Week 1 |
| Team contacts | Admin contacts, on-call contacts | Week 1 |
| CI/CD setup | Current CI/CD tools and workflows | Week 2 |
Phase 3: Cluster Provisioning (Week 2)
Actions
| Action | Responsible | Duration |
|---|---|---|
| Provision infrastructure | Platform Engineer | 1 hour |
| Bootstrap Kubernetes cluster | Platform Engineer | 30 min |
| Deploy platform components | Automated (ArgoCD) | 30 min |
| Configure SSO integration | Platform Engineer | 1-2 hours |
| Configure RBAC | Platform Engineer | 30 min |
| Setup GitOps repositories | Platform Engineer | 1 hour |
| Configure monitoring dashboards | Platform Engineer | 1 hour |
| Configure backup schedule | Platform Engineer | 30 min |
| Validation testing | Platform Engineer | 1 hour |
| Total | ~1 day |
Customer Access Delivery
After provisioning, the customer receives:
| Item | How |
|---|---|
| Rancher UI access | SSO login URL + role assignment |
| Grafana dashboards | SSO login URL (read-only) |
| kubectl access | Rancher-provided kubeconfig |
| GitOps repository | GitHub/GitLab repo access |
| Support channel | Slack channel created |
| Documentation portal | Access to customer docs |
Phase 4: Application Migration (Week 3)
Migration Support Options
| Option | Description | Our Role |
|---|---|---|
| Self-service | Customer deploys their own apps | We provide docs + support |
| Guided migration | We help containerize and deploy | Hands-on assistance |
| Full migration | We containerize, deploy, validate | Full professional service |
Migration Steps (Guided)
- Application assessment: Review existing apps, dependencies
- Containerization: Create Dockerfiles, optimize images
- Kubernetes manifests: Create Deployments, Services, Ingress
- GitOps integration: Set up ArgoCD application definitions
- Staging deployment: Deploy to non-prod cluster first
- Testing: Validate functionality, performance, connectivity
- Production deployment: Deploy to production cluster
- DNS cutover: Point production DNS to new ingress
Phase 5: Go-Live & Handover (Week 4)
Go-Live Checklist
- [ ] All applications deployed and healthy
- [ ] Monitoring dashboards showing data
- [ ] Alerts configured and tested
- [ ] Backups running successfully
- [ ] SSL certificates active and auto-renewing
- [ ] Customer team has access (SSO working)
- [ ] Customer team trained on basics
- [ ] Support channel active
- [ ] Runbooks for customer-specific scenarios created
- [ ] DNS cutover completed
- [ ] Load testing passed (if applicable)
Training Sessions
| Session | Duration | Audience | Content |
|---|---|---|---|
| Platform Overview | 2 hours | All team members | Rancher UI, dashboards, basic operations |
| Deployment Workflow | 2 hours | DevOps / Developers | GitOps workflow, ArgoCD, CI/CD integration |
| Monitoring & Alerting | 1 hour | DevOps / Ops | Grafana dashboards, alert interpretation |
| Incident Reporting | 1 hour | All team members | How to report issues, severity levels |
30-Day Check-in
Scheduled 30 days after go-live:
- Review platform health
- Address any issues
- Discuss expansion needs
- Gather feedback
- Adjust monitoring/alerting if needed
Onboarding Success Metrics
| Metric | Target |
|---|---|
| Time to first cluster | < 5 business days |
| Time to go-live | < 4 weeks |
| Customer satisfaction (NPS) at 30 days | > 8/10 |
| Support tickets in first 30 days | < 5 |
| Zero P1 incidents during onboarding | 100% |
Support, Escalation & Incident Management
Support Model Overview
Support Tiers
| Tier | Availability | Channels | Response SLA | Included With |
|---|---|---|---|---|
| Essential | Mon-Fri 9:00-18:00 CET | Email, Ticket | P1: 4h, P2: 8h, P3: 24h, P4: 48h | All plans |
| Business | Mon-Fri 7:00-22:00 CET | Email, Ticket, Slack | P1: 1h, P2: 4h, P3: 8h, P4: 24h | €800/mo add-on |
| Enterprise | 24/7/365 | Email, Ticket, Slack, Phone | P1: 15min, P2: 1h, P3: 4h, P4: 8h | €2,500/mo add-on |
What's Supported vs. Not Supported
| Supported (In Scope) | Not Supported (Out of Scope) |
|---|---|
| Kubernetes cluster operations | Application code debugging |
| Platform component issues | Customer application logic |
| Node failures and replacements | Database query optimization |
| Network policy configuration | Custom application performance tuning |
| Security policy management | Third-party software support |
| Backup and restore operations | CI/CD pipeline development |
| Monitoring and alerting setup | Application architecture consulting* |
| Kubernetes version upgrades | Custom development* |
| SSL certificate management | Training beyond onboarding* |
*Available as Professional Services at additional cost
Severity Levels
P1 — Critical
| Attribute | Value |
|---|---|
| Definition | Production cluster down, data loss risk, security breach |
| Business Impact | Customer's production services unavailable |
| Examples | Control plane failure, all nodes down, data corruption, active security incident |
| Response Time | Essential: 4h / Business: 1h / Enterprise: 15min |
| Resolution Target | 4 hours |
| Communication | Every 30 minutes until resolved |
| Escalation | Automatic to CTO after 2 hours |
P2 — High
| Attribute | Value |
|---|---|
| Definition | Significant degradation, partial outage, failed backups |
| Business Impact | Some services impacted, workaround may exist |
| Examples | Single node failure, high error rates, backup failure, cert expiring < 48h |
| Response Time | Essential: 8h / Business: 4h / Enterprise: 1h |
| Resolution Target | 8 hours |
| Communication | Every 2 hours until resolved |
| Escalation | Automatic to Platform Lead after 4 hours |
P3 — Medium
| Attribute | Value |
|---|---|
| Definition | Non-critical issue, minor impact |
| Business Impact | Minor inconvenience, no production impact |
| Examples | Dashboard issue, non-critical alert, minor config change needed |
| Response Time | Essential: 24h / Business: 8h / Enterprise: 4h |
| Resolution Target | 3 business days |
| Communication | Daily update |
| Escalation | Manual after 2 business days |
P4 — Low
| Attribute | Value |
|---|---|
| Definition | Question, feature request, minor improvement |
| Business Impact | None |
| Examples | Documentation question, dashboard customization, feature request |
| Response Time | Essential: 48h / Business: 24h / Enterprise: 8h |
| Resolution Target | 5 business days |
| Communication | Upon completion |
| Escalation | None (scheduled backlog) |
Incident Management Process
Incident Lifecycle
Detection
│
├── Automated (monitoring alert)
│ └── PagerDuty pages on-call engineer
│
└── Customer-reported
└── Ticket/Slack/Phone → Triaged by support
│
▼
Triage (5 minutes)
│
├── Assign severity (P1-P4)
├── Assign owner (on-call or support engineer)
└── Create incident ticket
│
▼
Investigation (varies by severity)
│
├── Check monitoring dashboards
├── Review logs (Loki/Grafana)
├── Check recent changes (ArgoCD/Git)
├── Check infrastructure health (Rancher)
└── Identify root cause
│
▼
Resolution
│
├── Apply fix (manual or via GitOps)
├── Validate fix
├── Monitor for recurrence
└── Confirm with customer
│
▼
Post-Incident
│
├── Update ticket with resolution
├── Postmortem (P1/P2 only)
├── Action items created
└── Customer communication sent
Incident Communication Templates
P1 Initial Notification (to customer)
Subject: [P1] Incident — {Cluster Name} — {Brief Description}
We are aware of an issue affecting your {cluster/service}.
Impact: {Description of impact}
Status: Investigating
Next update: In 30 minutes
Our team is actively working on resolution.
P1 Update
Subject: [P1] Update — {Cluster Name} — {Brief Description}
Status: {Investigating / Identified / Mitigated / Resolved}
Update: {What we've done since last update}
Next steps: {What we're doing next}
Next update: In 30 minutes
P1 Resolution
Subject: [P1] Resolved — {Cluster Name} — {Brief Description}
The incident has been resolved.
Root cause: {Brief explanation}
Resolution: {What was done}
Duration: {Start time — End time}
A detailed postmortem will be shared within 3 business days.
Escalation Matrix
Automatic Escalation (Time-Based)
| Severity | Level 1 | Level 2 | Level 3 | Level 4 |
|---|---|---|---|---|
| P1 | On-call engineer (0 min) | Platform Lead (30 min) | CTO (2h) | CEO (4h) |
| P2 | Support engineer (0 min) | Sr. Platform Eng (2h) | Platform Lead (4h) | CTO (8h) |
| P3 | Support engineer (0 min) | Sr. Platform Eng (1 day) | Platform Lead (2 days) | — |
| P4 | Support engineer (0 min) | — | — | — |
Manual Escalation (Customer-Initiated)
Customers can request escalation at any time through:
- Slack: Mention
@platform-lead - Email: escalation@platform.example.com
- Phone (Enterprise only): Dedicated escalation number
Escalation Responsibilities
| Level | Role | Responsibility |
|---|---|---|
| L1 | On-call / Support Engineer | First response, initial diagnosis, known-issue resolution |
| L2 | Senior Platform Engineer | Complex troubleshooting, infrastructure issues |
| L3 | Platform Lead / CTO | Architecture decisions, emergency changes, vendor escalation |
| L4 | CEO | Customer executive communication, business decisions |
On-Call Process
On-Call Rotation
| Parameter | Value |
|---|---|
| Rotation length | 1 week (Monday 09:00 — Monday 09:00) |
| Team size for rotation | Minimum 3 engineers |
| Handoff process | 15-min sync at rotation change |
| Primary + Secondary | Always 2 engineers on-call |
| Tool | PagerDuty |
On-Call Expectations
| Expectation | Requirement |
|---|---|
| Acknowledgement time | < 5 minutes (P1), < 15 minutes (P2) |
| Availability | Reachable by phone at all times |
| Response capability | Laptop + internet access within 15 minutes |
| Escalation | If unable to resolve in 30 min, escalate to L2 |
| Documentation | Log all actions in incident ticket |
On-Call Compensation (Germany)
| Item | Compensation |
|---|---|
| Weekday on-call standby | €50/night |
| Weekend on-call standby | €100/day |
| Public holiday standby | €150/day |
| Actual incident work (outside business hours) | €75/hour |
| Rest time after night incident (> 2h work) | Next morning off |
Postmortem Process
When Required
- All P1 incidents
- P2 incidents with customer impact
- Any incident with data loss
- Any security incident
- Any incident lasting > 4 hours
Postmortem Template
# Postmortem: {Incident Title}
## Summary
- Date: {YYYY-MM-DD}
- Duration: {X hours Y minutes}
- Severity: {P1/P2}
- Affected customers: {List}
- Impact: {Description}
## Timeline (CET)
- HH:MM — {Event}
- HH:MM — {Event}
- ...
## Root Cause
{Detailed technical explanation}
## Resolution
{What was done to resolve}
## What Went Well
- {Item}
## What Went Wrong
- {Item}
## Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| {Action} | {Person} | {Date} | Open |
## Lessons Learned
{Key takeaways}
Postmortem Timeline
| Step | Deadline |
|---|---|
| Draft postmortem | 24 hours after resolution |
| Internal review | 48 hours after resolution |
| Share with customer | 3 business days after resolution |
| Action items tracked | Ongoing, reviewed weekly |
Support Metrics & KPIs
| Metric | Target | Measurement |
|---|---|---|
| First response time (P1) | < SLA | PagerDuty / Ticket system |
| Mean time to acknowledge (MTTA) | < 10 min (P1) | PagerDuty |
| Mean time to resolve (MTTR) | P1: < 4h, P2: < 8h | Ticket system |
| SLA compliance | > 99.5% | Monthly report |
| Customer satisfaction (CSAT) | > 4.5/5 | Post-ticket survey |
| Incidents per customer per month | < 1 | Monthly report |
| Postmortems completed on time | 100% | Tracked in tickets |
| On-call alert noise | < 5 alerts/week per cluster | PagerDuty analytics |
Operational Runbooks
Runbook Index
| ID | Runbook | Severity | Trigger |
|---|---|---|---|
| RB-001 | Node Failure | P1/P2 | Node unreachable alert |
| RB-002 | Control Plane Failure | P1 | API server / etcd alert |
| RB-003 | Cluster Unreachable | P1 | Management cluster cannot reach workload cluster |
| RB-004 | Certificate Expiration | P2 | cert-manager alert |
| RB-005 | Backup Failure | P2 | Velero backup failure alert |
| RB-006 | Storage Failure | P1/P2 | PV/disk alert |
| RB-007 | Network Failure | P1/P2 | CNI / connectivity alert |
| RB-008 | Cluster Upgrade Failure | P2 | Upgrade process error |
| RB-009 | High Resource Usage | P2/P3 | CPU/Memory/Disk threshold alert |
| RB-010 | Security Incident | P1 | Falco / Kyverno alert |
| RB-011 | Pod Crash Loop | P2/P3 | Pod restart alert |
| RB-012 | DNS Resolution Failure | P2 | DNS health check failure |
| RB-013 | Ingress Failure | P1/P2 | Ingress controller down |
| RB-014 | etcd Restore | P1 | Data corruption / etcd failure |
| RB-015 | Full Cluster Restore | P1 | Complete cluster loss |
RB-001: Node Failure
Alert
KubernetesNodeNotReady — Node has been in NotReady state for > 5 minutes
Diagnosis Steps
# 1. Check node status from management cluster
kubectl get nodes -o wide
# 2. Check node conditions
kubectl describe node <node-name>
# 3. Check if node is reachable via SSH
ssh <node-ip> "uptime"
# 4. Check system services
ssh <node-ip> "systemctl status rke2-agent" # or rke2-server for control plane
# 5. Check system resources
ssh <node-ip> "df -h && free -m && uptime"
# 6. Check Rancher cluster status
# Navigate to Rancher UI → Cluster → Nodes
Resolution Steps
If node is reachable but K8s service is down:
# Restart RKE2 agent
ssh <node-ip> "systemctl restart rke2-agent"
# Wait 2-3 minutes, verify
kubectl get nodes
If node is unreachable (hardware/network failure):
# 1. Cordon the node to prevent scheduling
kubectl cordon <node-name>
# 2. Drain workloads (if possible)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s
# 3. Provision replacement node via Terraform
cd infrastructure/terraform/<customer>
terraform apply -target=module.worker_node_<n>
# 4. Join new node to cluster (automated via Rancher)
# 5. Verify new node is Ready
kubectl get nodes
# 6. Remove old node
kubectl delete node <old-node-name>
Customer Communication
- P1 if control plane node and cluster has < 3 control plane nodes
- P2 if worker node (workloads auto-rescheduled)
- Notify customer via Slack/email within 15 minutes
RB-002: Control Plane Failure
Alert
KubernetesAPIServerDown or EtcdClusterUnhealthy
Diagnosis Steps
# 1. Check if API server responds
kubectl cluster-info
curl -k https://<api-server-ip>:6443/healthz
# 2. Check etcd health
ssh <control-plane-node> "ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
endpoint health"
# 3. Check RKE2 server logs
ssh <control-plane-node> "journalctl -u rke2-server -n 100"
# 4. Check all control plane components
ssh <control-plane-node> "crictl ps | grep -E 'kube-api|etcd|controller|scheduler'"
Resolution Steps
If single control plane node failure (HA cluster with 3 CP nodes):
- Cluster continues to operate with 2/3 nodes
- Follow RB-001 to replace the failed node
- Urgency: P2 (cluster still operational)
If etcd quorum lost (2/3 nodes down):
# 1. This is a P1 — cluster is down
# 2. Attempt to recover at least one more etcd member
ssh <surviving-node> "systemctl restart rke2-server"
# 3. If recovery fails, restore from etcd snapshot
# See RB-014: etcd Restore
If all control plane nodes down:
- Follow RB-015: Full Cluster Restore
- P1 — all hands on deck
RB-003: Cluster Unreachable
Alert
ManagedClusterUnreachable — Rancher cannot communicate with downstream cluster
Diagnosis Steps
# 1. Check Rancher UI — is cluster showing disconnected?
# 2. Check if cluster nodes are reachable from management network
ping <cluster-node-ip>
ssh <cluster-node-ip> "kubectl get nodes"
# 3. Check Rancher agent on downstream cluster
ssh <cluster-node> "kubectl -n cattle-system get pods"
ssh <cluster-node> "kubectl -n cattle-system logs deployment/cattle-cluster-agent"
# 4. Check network connectivity (firewall, VPN, routing)
ssh <cluster-node> "curl -k https://<rancher-url>/healthz"
# 5. Check if it's a Rancher issue (all clusters affected?)
# Check Rancher UI for other cluster statuses
Resolution Steps
Network issue:
# Check and fix firewall rules
# Verify VPN tunnel is up (if applicable)
# Check load balancer health
Rancher agent issue:
# Restart cattle-cluster-agent
ssh <cluster-node> "kubectl -n cattle-system rollout restart deployment/cattle-cluster-agent"
Rancher server issue:
# Check Rancher pods on management cluster
kubectl -n cattle-system get pods
kubectl -n cattle-system logs deployment/rancher
RB-004: Certificate Expiration
Alert
CertificateExpiringSoon — Certificate expires in < 14 days
Diagnosis Steps
# 1. Check which certificate is expiring
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
# 2. Check cert-manager logs
kubectl -n cert-manager logs deployment/cert-manager
# 3. Check certificate details
kubectl get secret <cert-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -dates
Resolution Steps
cert-manager renewal failure:
# 1. Check cert-manager challenges
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n <namespace>
# 2. Common issues:
# - DNS challenge: Check DNS provider credentials
# - HTTP challenge: Check ingress is accessible
# - Rate limits: Let's Encrypt rate limit (wait or use staging)
# 3. Delete and recreate certificate if needed
kubectl delete certificate <cert-name> -n <namespace>
# ArgoCD will recreate it from Git
RKE2 internal certificates:
# RKE2 auto-rotates internal certs, but if needed:
ssh <control-plane-node> "rke2 certificate rotate"
ssh <control-plane-node> "systemctl restart rke2-server"
RB-005: Backup Failure
Alert
VeleroBackupFailed — Scheduled backup did not complete
Diagnosis Steps
# 1. Check Velero backup status
kubectl -n velero get backups
kubectl -n velero describe backup <backup-name>
# 2. Check Velero logs
kubectl -n velero logs deployment/velero
# 3. Check backup storage location
kubectl -n velero get backupstoragelocation
kubectl -n velero describe backupstoragelocation default
# 4. Check storage connectivity
# Verify StorageBox credentials and connectivity
Resolution Steps
Storage connectivity issue:
# Check and update storage credentials
kubectl -n velero get secret cloud-credentials -o yaml
# Verify storage endpoint is reachable
kubectl -n velero exec deployment/velero -- \
wget -q --spider <storage-endpoint>
Volume snapshot failure:
# Check volume snapshot class
kubectl get volumesnapshotclass
# Check if PV supports snapshots
kubectl get pv <pv-name> -o yaml
# Manual backup trigger
velero backup create manual-backup-$(date +%Y%m%d) \
--include-namespaces <namespace>
RB-009: High Resource Usage
Alert
NodeCPUHigh (> 90%), NodeMemoryHigh (> 90%), NodeDiskHigh (> 85%)
Diagnosis Steps
# 1. Identify which node is affected
kubectl top nodes
# 2. Find resource-hungry pods
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory
# 3. Check for resource limits
kubectl get pods -A -o json | jq '.items[] |
select(.spec.containers[].resources.limits == null) |
.metadata.namespace + "/" + .metadata.name'
# 4. Check for eviction pressure
kubectl describe node <node-name> | grep -A5 Conditions
Resolution Steps
Short-term (immediate relief):
# Identify and scale down non-critical workloads
kubectl -n <namespace> scale deployment <name> --replicas=1
# Evict pods from overloaded node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Long-term:
# Add worker node
cd infrastructure/terraform/<customer>
# Increase node count and apply
# Or resize existing nodes (requires node replacement)
# Update Terraform with larger instance type
RB-010: Security Incident
Alert
FalcoSecurityAlert — Runtime security event detected
IMMEDIATE Actions (First 5 Minutes)
1. DO NOT delete evidence
2. Assess scope — which cluster, namespace, pod
3. Determine if active attack or false positive
4. If active attack:
a. Isolate affected pod/namespace (network policy)
b. Do NOT delete the pod (preserve forensic data)
c. Escalate to Security Engineer immediately
d. Notify CTO
Diagnosis Steps
# 1. Check Falco alerts
kubectl -n falco logs daemonset/falco | grep -i "Warning\|Error\|Critical"
# 2. Check what triggered the alert
# Falco alert will contain:
# - Rule name
# - Output fields (container, process, file, network)
# - Priority
# 3. Inspect the suspicious pod
kubectl -n <namespace> describe pod <pod-name>
kubectl -n <namespace> logs <pod-name>
# 4. Check Kyverno violations
kubectl get policyreport -A
# 5. Check network flows (Hubble/Cilium)
hubble observe --namespace <namespace> --pod <pod-name>
Containment Steps
# 1. Isolate the namespace with network policy
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: isolate-namespace
namespace: <namespace>
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
EOF
# 2. Capture pod state for forensics
kubectl -n <namespace> get pod <pod-name> -o yaml > forensics/pod-state.yaml
kubectl -n <namespace> logs <pod-name> > forensics/pod-logs.txt
# 3. If confirmed malicious — kill the pod
kubectl -n <namespace> delete pod <pod-name>
Customer Communication
- Always notify customer of confirmed security incidents
- Provide incident report within 24 hours
- Full postmortem within 5 business days
- GDPR notification if personal data involved (72-hour deadline)
RB-014: etcd Restore
When Needed
- etcd data corruption
- etcd quorum permanently lost
- Accidental deletion of critical resources
Restore from Snapshot
# 1. Stop RKE2 on all control plane nodes
for node in cp1 cp2 cp3; do
ssh $node "systemctl stop rke2-server"
done
# 2. Find latest snapshot
ssh cp1 "ls -la /var/lib/rancher/rke2/server/db/snapshots/"
# 3. Restore on first control plane node
ssh cp1 "rke2 server \
--cluster-reset \
--cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/<snapshot-name>"
# 4. Start RKE2 on first node
ssh cp1 "systemctl start rke2-server"
# 5. Wait for first node to be ready
ssh cp1 "kubectl get nodes"
# 6. Remove old etcd data on other nodes
for node in cp2 cp3; do
ssh $node "rm -rf /var/lib/rancher/rke2/server/db/etcd"
done
# 7. Restart RKE2 on other nodes (they will rejoin)
for node in cp2 cp3; do
ssh $node "systemctl start rke2-server"
done
# 8. Verify cluster health
kubectl get nodes
kubectl get pods -A
RB-015: Full Cluster Restore
When Needed
- Complete cluster loss (all nodes destroyed)
- Catastrophic infrastructure failure
- Disaster recovery scenario
Process
1. Assess damage — what is lost, what is recoverable
│
▼
2. Provision new infrastructure (Terraform)
Duration: 30-60 minutes
│
▼
3. Bootstrap new RKE2 cluster (Rancher)
Duration: 15-30 minutes
│
▼
4. Deploy platform components (ArgoCD)
Duration: 15-30 minutes
│
▼
5. Restore from Velero backup
velero restore create --from-backup <latest-backup>
Duration: 30-120 minutes (depends on data size)
│
▼
6. Validate all workloads
- Check all deployments
- Check all services
- Check persistent data
Duration: 30-60 minutes
│
▼
7. Update DNS and ingress
Duration: 5-15 minutes (+ DNS propagation)
│
▼
8. Notify customer — service restored
│
▼
9. Full postmortem within 3 business days
Total Expected Recovery Time: 2-4 hours
Runbook Maintenance
| Activity | Frequency | Owner |
|---|---|---|
| Review all runbooks | Quarterly | Platform Lead |
| Update after incidents | After every P1/P2 | Incident owner |
| Test DR runbooks (RB-014, RB-015) | Quarterly | SRE team |
| Add new runbooks | As new failure modes discovered | Platform team |
| Customer-specific runbooks | At onboarding + annually | Customer Success |
Service Level Agreement (SLA) & Support Tiers
Platform Availability SLA
Uptime Commitment
| Component | Essential | Business | Enterprise |
|---|---|---|---|
| Management Platform (Rancher, ArgoCD) | 99.5% | 99.9% | 99.95% |
| Customer Cluster Control Plane | 99.5% | 99.9% | 99.95% |
| Monitoring & Alerting | 99.0% | 99.5% | 99.9% |
| Backup Operations | 99.0% | 99.5% | 99.9% |
What Uptime Means
- 99.5% = max 3.65 hours downtime/month
- 99.9% = max 43.8 minutes downtime/month
- 99.95% = max 21.9 minutes downtime/month
Exclusions
The following are NOT counted as downtime:
- Scheduled maintenance (with agreed notice period)
- Customer-caused outages (misconfigurations, resource exhaustion)
- Force majeure (natural disasters, war, pandemic)
- Third-party provider outages (Hetzner, DNS providers)
- Customer application issues
SLA Credits
If we miss the uptime SLA, customers receive service credits:
| Uptime Achieved | Credit (% of Monthly Fee) |
|---|---|
| 99.0% - 99.49% | 10% |
| 98.0% - 98.99% | 25% |
| 95.0% - 97.99% | 50% |
| < 95.0% | 100% |
Credit Process
- Customer submits credit request within 30 days of incident
- We validate against monitoring data
- Credit applied to next invoice
- Maximum credit: 100% of one month's platform management fee
- Credits do not apply to infrastructure pass-through costs
Support Tier Comparison
Essential Support (Included)
| Feature | Detail |
|---|---|
| Price | Included with all plans |
| Availability | Monday-Friday, 09:00-18:00 CET |
| Channels | Email, Ticket portal |
| P1 Response | 4 hours |
| P2 Response | 8 hours |
| P3 Response | 24 hours |
| P4 Response | 48 hours |
| Named contacts | 2 |
| Monthly review | No |
| Dedicated Slack | No |
| Phone support | No |
| Dedicated engineer | No |
| Uptime SLA | 99.5% |
Business Support (€800/month)
| Feature | Detail |
|---|---|
| Price | €800/month |
| Availability | Monday-Friday, 07:00-22:00 CET |
| Channels | Email, Ticket portal, Slack |
| P1 Response | 1 hour |
| P2 Response | 4 hours |
| P3 Response | 8 hours |
| P4 Response | 24 hours |
| Named contacts | 5 |
| Monthly review | Yes (30 min) |
| Dedicated Slack | Yes (shared channel) |
| Phone support | No |
| Dedicated engineer | No |
| Uptime SLA | 99.9% |
Enterprise Support (€2,500/month)
| Feature | Detail |
|---|---|
| Price | €2,500/month |
| Availability | 24/7/365 |
| Channels | Email, Ticket portal, Slack, Phone |
| P1 Response | 15 minutes |
| P2 Response | 1 hour |
| P3 Response | 4 hours |
| P4 Response | 8 hours |
| Named contacts | Unlimited |
| Monthly review | Yes (weekly 30 min sync) |
| Dedicated Slack | Yes (with SLA on responses) |
| Phone support | Yes (dedicated number) |
| Dedicated engineer | Yes (named contact) |
| Uptime SLA | 99.95% |
| Custom maintenance windows | Yes |
| Priority upgrade scheduling | Yes |
| Quarterly business review | Yes |
Maintenance Windows
Scheduled Maintenance
| Support Tier | Notice Period | Window |
|---|---|---|
| Essential | 48 hours | Tue-Thu, 02:00-06:00 CET |
| Business | 5 business days | Agreed with customer |
| Enterprise | 10 business days | Customer-defined |
Emergency Maintenance
| Condition | Notice | Approval |
|---|---|---|
| Critical security patch | Best effort (min 2 hours) | No customer approval needed |
| Data integrity risk | Best effort (min 1 hour) | No customer approval needed |
| Non-critical but urgent | 24 hours | Notification only |
Operational Commitments
Backup Guarantees
| Commitment | Target |
|---|---|
| Daily backup execution | 99.5% success rate |
| Backup data retention | 30 days minimum |
| Restore test (upon request) | Within 2 business days |
| Full cluster restore | < 4 hours RTO |
| Data loss maximum | < 24 hours RPO |
Security Commitments
| Commitment | Target |
|---|---|
| Security patch deployment | Critical: < 24 hours, High: < 72 hours |
| Vulnerability scanning | Weekly |
| Security incident notification | < 1 hour after detection |
| GDPR breach notification | < 72 hours (as per law) |
Platform Update Commitments
| Commitment | Target |
|---|---|
| Kubernetes version support | N-2 minor versions |
| Kubernetes upgrade after GA release | Within 60 days |
| Platform component updates | Monthly |
| OS security patches | Weekly |
Reporting
Monthly Platform Report (All Tiers)
Delivered by 5th business day of each month:
- Cluster uptime percentage
- Number of incidents by severity
- Backup success rate
- Resource utilization trends
- Security scan results summary
Monthly Review Meeting (Business + Enterprise)
Agenda:
- Uptime and incident review
- Capacity and performance review
- Security posture review
- Upcoming maintenance and upgrades
- Customer requests and roadmap
Quarterly Business Review (Enterprise Only)
Agenda:
- All monthly review items
- Strategic infrastructure planning
- Cost optimization recommendations
- Technology roadmap alignment
- Contract and SLA review