Cloud operations is the art of orchestrating and managing a Kubernetes (K8s) cluster after deployment such that it delivers on the continuous availability, performance and cost objectives of the company. In addition, it is the everyday management and monitoring of the K8s cluster deployed across data centers.
Six unique principles contribute to efficient cloud operations and orchestration, and organizations must ensure that there is a skill set and tools to deliver them all.
An OBF dashboard serves as the window into every event in the cluster. Therefore, organizations must build custom dashboards that offer a view of health, capacity, software changes, audits and insightful analytics into the K8s cluster.
An OBF framework must be conceptualized and built using Kubernetes native or external tools built for this.
The data assimilated from various sources can then be ingested into a data lake, pushed into streaming pipelines such as Kafka, and then visualized on a dashboard. Triggers and alarms can then be set to act upon critical changes in the cluster.
Day 2 operations of the cluster are the decisions made regularly based on the inputs from OBF and manual housekeeping tasks done on the cluster. In large-scale deployments, however, manually maintaining and operating hundreds of thousands of cluster elements is challenging–microservices, pods and infrastructure.
Primary areas of automation should revolve around application scale in and out, redeployment, capacity management, failover and recovery and auto-healing.
The automation of Day 2 operations must focus on continuous availability and eliminating the need for human intervention in known failure scenarios. This approach helps drive predictability and acts as a blueprint for future deployments.
Any outage in a running cluster is mainly caused by a change to the cluster executed without a comprehensive plan. This includes changes in and to hardware and software configurations, upgrades, network, storage layer and redeployment pods.
A sound change management system and a change control process can prevent massive outages post a change window. No change must ever be executed on a live production system without the following:
Cluster maintenance is maintaining all the hardware and software in a K8s cluster at the correct versions, cleaning up unwanted jobs and freeing up resources for other requirements.
Kubernetes is evolving continuously, and new releases coming at a rapid pace. At a minimum, the operations team needs to maintain the clusters at a level no older than N-2 releases, as many bug fixes also go into these releases. On top of this, OS, hardware and firmware upgrades also need to be maintained.
The second aspect of maintenance is to watch out for rogue jobs that consume excessive resources, zombie processes and terminated pods but still hold onto disk spaces, for example. The operations team needs to build MOPs to consistently eliminate such resource leakage before the cluster becomes unresponsive or jobs fail. The ideal window for such MOPs to run is based on business objectives but should be one week at maximum.
When there is a service outage, the incident management team is responsible for the restoration within the committed time limits (SLAs and KPIs). The team’s objective should be to progressively decrease the time it takes to restore similar issues. The final aim is to drive enough automation in the cloud that such errors do not repeat and are caught ahead of time in the future.
The incident management team is accountable for restoration. And such restoration must be progressively automated. To achieve this, the incident management team must comprise site reliability engineers (SREs) that can build automation tools aimed at restoration.
K8s security is complex because the security hardening needs to be done at various levels. The security policies usually involve securing nodes, networks, pods, users and data from security threats/compromises.
An ideal security practice must involve securing Kubernetes at all levels:
These must be applied to avoid malicious attacks or downtime in the cluster.
Executing the six principles in tandem, it is guaranteed that the cloud will optimally serve its purpose.