How to ensure cloud operations delivers on performance, cost and availability

by
Subha Shrinivasan
VP of Customer Success
Rakuten Symphony
June 21, 2023
5
minute read

Cloud operations is the art of orchestrating and managing a Kubernetes (K8s) cluster after deployment such that it delivers on the continuous availability, performance and cost objectives of the company. In addition, it is the everyday management and monitoring of the K8s cluster deployed across data centers. 

Six unique principles contribute to efficient cloud operations and orchestration, and organizations must ensure that there is a skill set and tools to deliver them all. 

Observability (OBF) 

An OBF dashboard serves as the window into every event in the cluster. Therefore, organizations must build custom dashboards that offer a view of health, capacity, software changes, audits and insightful analytics into the K8s cluster. 

An OBF framework must be conceptualized and built using Kubernetes native or external tools built for this. 

  1. Kube-state metrics is a built-in Kubernetes metrics server that exports K8s events through APIs 
  2. Prometheus and Grafana are external tools that use sidecar to pull essential metrics and display them as time-series data graph  
  3. Liveness and readiness probes at a pod level for availability and health check 
  4. Kubernetes exports several events which can be pulled through built-in APIs and displayed on a custom dashboard 

The data assimilated from various sources can then be ingested into a data lake, pushed into streaming pipelines such as Kafka, and then visualized on a dashboard. Triggers and alarms can then be set to act upon critical changes in the cluster. 

Day 2 operations  

Day 2 operations of the cluster are the decisions made regularly based on the inputs from OBF and manual housekeeping tasks done on the cluster. In large-scale deployments, however, manually maintaining and operating hundreds of thousands of cluster elements is challenging–microservices, pods and infrastructure.  

Primary areas of automation should revolve around application scale in and out, redeployment, capacity management, failover and recovery and auto-healing. 

The automation of Day 2 operations must focus on continuous availability and eliminating the need for human intervention in known failure scenarios. This approach helps drive predictability and acts as a blueprint for future deployments. 

Change management  

Any outage in a running cluster is mainly caused by a change to the cluster executed without a comprehensive plan. This includes changes in and to hardware and software configurations, upgrades, network, storage layer and redeployment pods. 

A sound change management system and a change control process can prevent massive outages post a change window. No change must ever be executed on a live production system without the following: 

  • Validation Process: Of all changes in a lower environment 
  • Change Management Committee: Review and Approval CM Procedures  
  • Rollback plan: If applied changes fail  
  • Continuity of services: Pre and post check scripts to capture before and after state  
  • Fallback cluster or environment: In the event of failure  

Cluster maintenance 

Cluster maintenance is maintaining all the hardware and software in a K8s cluster at the correct versions, cleaning up unwanted jobs and freeing up resources for other requirements. 

Kubernetes is evolving continuously, and new releases coming at a rapid pace. At a minimum, the operations team needs to maintain the clusters at a level no older than N-2 releases, as many bug fixes also go into these releases. On top of this, OS, hardware and firmware upgrades also need to be maintained.  

The second aspect of maintenance is to watch out for rogue jobs that consume excessive resources, zombie processes and terminated pods but still hold onto disk spaces, for example. The operations team needs to build MOPs to consistently eliminate such resource leakage before the cluster becomes unresponsive or jobs fail. The ideal window for such MOPs to run is based on business objectives but should be one week at maximum. 

Incident management & restoration 

When there is a service outage, the incident management team is responsible for the restoration within the committed time limits (SLAs and KPIs). The team’s objective should be to progressively decrease the time it takes to restore similar issues. The final aim is to drive enough automation in the cloud that such errors do not repeat and are caught ahead of time in the future.  

The incident management team is accountable for restoration. And such restoration must be progressively automated. To achieve this, the incident management team must comprise site reliability engineers (SREs) that can build automation tools aimed at restoration. 

Hardening & security  

K8s security is complex because the security hardening needs to be done at various levels. The security policies usually involve securing nodes, networks, pods, users and data from security threats/compromises. 

An ideal security practice must involve securing Kubernetes at all levels: 

  • Access Control: Well-managed tenants, role bindings, and root permissions  
  • Cluster: Comply with the latest Kubernetes CIS benchmarks 
  • Node: Use hardened OS & Comply to Host L1/L2 CIS benchmarks 
  • API security: Appropriate RBAC and admission controllers in place 
  • Network: Well-laid out DMZ, Firewall rules, DNS configs, IPsec 
  • Pod: Check for RBAC and image scan for vulnerabilities  

These must be applied to avoid malicious attacks or downtime in the cluster. 

Executing the six principles in tandem, it is guaranteed that the cloud will optimally serve its purpose. 

Spotlight on Tech