Kubernetes operators: Maximizing efficiency or adding complexity?

by
Ravi Kumar Alluboyina
VP Engineering
Rakuten Symphony
June 13, 2023
15
minute read

Exploring the benefits and challenges of Kubernetes operators

Kubernetes operators enable application vendors to expand the functionality of core Kubernetes by allowing them to effectively manage the deployment and lifecycle of their applications. By defining a Custom Resource Definition (CRD) and custom controllers, it is possible to develop a software extension for Kubernetes that is tailored to the specific orchestration requirements of a particular application, and is thus able to manage it more effectively. Although this approach can be highly advantageous for application vendors or developers, it may pose significant challenges for cluster administrators or end-users. While there are compelling reasons to create operators for applications, ensuring consistency in the definitions and establishing robust higher-level building blocks will make operator frameworks more appealing to all parties involved. We will see some interesting solutions to this problem toward the end of this blog.

Kubernetes, the opinionated stack

Kubernetes was earlier recognized as a highly opinionated stack, with application developers being compelled to adhere to specific deployment patterns in order to onboard it to the platform. That means the stack’s framework imposes a certain method of interactive rules or constructs. In other words, there is a general right way of working the stack in a series of patterns to get the most out of it. These patterns included essential constructs such as pod, Persistent Volume Claim (PVC), ConfigMap, Secret, Deployment, StatefulSets, DaemonSets, Services and others, each serving a unique purpose. Pods provided compute resources, PVC facilitated storage, ConfigMaps/Secrets were utilized for configuration purposes, Deployments and StatefulSets were employed to manage scalability and availability, and Services/Endpoints were used for network access.

Although Kubernetes supports many other resource types, the aforementioned constructs are application-agnostic and serve as foundational building blocks for virtually any type of application deployment.

The need for operators and custom resources

Kubernetes constructs are designed to model the resources required for application deployment, such as storage, compute and network. However, certain aspects of applications cannot be adequately mapped to these constructs. While progress has been made in areas such as data protection through the use of specialized custom resources and VolumeSnapshotClass, VolumeSnapshotContent, VolumeContent etc., there remains a significant amount of work to be done. Custom resources and operators are being developed to address native shortcomings in a way that makes interacting with the stack easier and more predictable, and include:

  • Lack of application awareness with core resource types, required by the application to support automated operations
  • Extended operations, such as snapshots, backup, clones, granular restore (such as a table or a schema in databases), topology updates, data rebalancing and other administrative actions
  • Kubernetes primarily focuses on compute management and offloads the storage management to the Container Storage Interface (CSI) plugin. There are instances where coordinated planning is required from Kubernetes core scheduler, CSI, Container Network Interface (CNI) and Container Runtime Interface (CRI).

In order to successfully automate data protection lifecycle management, application awareness is a must for data-heavy applications such as SQL, NoSQL, DataBus and big data applications. There is a pressing need for data management.

For example, consider these complexities:

  • Scaling, in the perspective of Kubernetes, means creating more pods and PVCs as needed. But who does the data rebalancing or data replication triggers when moving from two-replicas to three-replicas?
  • Snapshotting an application often involves moving the application to read-only mode or freezing it.
  • Backing up an application might involve copying only a selected set of volumes. For example, we have an application with three-way replicated data. There is no point backing up all the three volumes!
  • Smart restore or granular restore definitely involves application awareness. Restore the data volumes (i.e. PVC and not the log volume). Extreme cases involve restoring a key-space, schema or table in a database.
  • An additional use case involves serving as an application controller that generates applications as needed, similar to the Factory or Builder Patterns used in many programming languages.
  • Handling upgrades in a multi-tier application. In the majority of cases, there is a definite need for upgrade order. Note that most of the applications are retrofitted to run on Kubernetes platform. Let us ask ourselves a few questions. How many data heavy-applications are built on true microservices following the 12-factor methodology? Which is one of the core guidelines for building microservices-based OCI-compliant containers?

Operators are meant to solve all of the above problems and some of the mature operators do deliver predictive functionality in a format that is easy to work with in a repetitive manner.

What then is the challenge with operators?

Operators are a superior way to extend Kubernetes to complex application deployments. However, there are some glaring issues, although minor, which need to be addressed to reduce the operator’s overhead.

  • Lack of consistent listing of applications - Kubectl get? Helm list?
  • Lack of consistent definitions
  • Lack of consistent lifecycle management patterns
  • Lack of consistent troubleshooting and debugging methods

Let’s consider each briefly.

Lack of consistent listing

Consider a simple scenario such as listing all the applications running on a Kubernetes cluster, with multiple helm releases and custom operators. Using "helm list" only provides a partial overview of helm releases and some operators, but not all running applications. Creating a GUI for such a Kubernetes instance presents challenges in user navigation. One possible solution is to organize helm and operator applications into different namespaces, though determining how to divide namespaces in a Kubernetes cluster is a complex issue with multiple patterns available:

  • Namespace per org
  • Namespace per applications
  • Namespace per application group (pipeline)
  • Namespace per application type (databases, big data)
  • Namespace per operator

On a side note, some of the operators dictate how the namespaces are created. Overall, it’s safe to say that there needs to be some consistency in how application listings work. Application CRD is one possible solution.

Lack of consistent definitions

With an operator pattern, the application vendors define CRDs and controllers. Vendors are free to define the schema in the CRD. Of course they need to adhere to the basic template with ApiVersion, Metadata, Spec or Status, but what goes in the spec and status sections, in most cases, is completely up to the vendor. And here is where the trouble starts.

To illustrate this point, let us consider a simple scale-up/down operation. Every Kubernetes admin knows how to scale a StatefulSet. It is either ‘kubectl scale’ or editing the ‘spec.replicas’ in the YAML spec, using ‘kubectl edit’. This is consistent across any StatefulSet. The actions that are taken when scaled-out or scaled-in are entirely up to the entrypoint.

However, in some cases, changes may be needed on the already-running pods before the new pod can be created as part of scale-out operation. Changes to entrypoint alone can’t handle the scale-up operation and a custom controller, which works at an application level, will need to intervene and orchestrate the scale-up or scale-down operation.

There is typically a command line client (CLI client) or a document that describes which fields to change in the YAML spec to trigger the scale-up. Every operator might use different specs and command line parameters to handle application lifecycle operations.

In summary, every operator defines its own schema and practices for a standard operation such as a scaleout/scale-in. In an environment where thousands of applications are deployed, it is extremely difficult to orchestrate if there is no consistency in definitions.

i. Etcd operator

https://github.com/coreos/etcd-operator#resize-an-etcd-cluster

ii. MongoDB Operator

https://github.com/mongodb/mongodb-kubernetes-operator/blob/master/docs/deploy-configure.md#scale-a-replica-set

iii. Cassandra Operator

https://github.com/instaclustr/cassandra-operator/blob/master/examples/example-datacenter.yaml

Lack of consistent lifecycle management patterns

Many operators have a custom CLI which is different from the standard application management CLIs. This CLI is typically to manage the CRs or the YAML definitions and not necessarily the application management. Just take a look at one of the postgres operators.

i. Postgres backup — Let us take a look at documentation for logical backups. pg_dumpall trigger is behind this declarative syntax.

https://github.com/coreos/etcd-operator#backup-and-restore-an-etcd-cluster

ii. Etcd backups — To back up etcd cluster, we have another operator

https://github.com/coreos/etcd-operator#backup-and-restore-an-etcd-cluster

Lack of consistent debugging and troubleshooting

There is a lack of consistency in debugging and troubleshooting when it comes to Kubernetes operators, and a similar issue arises with metrics, monitoring and alerting. For example, generating an average CPU utilization for an application requires organizing resources by namespaces, or alternatively, using labels and selectors. This can become difficult to manage if there is no control over label addition and removal. It's worth noting that anyone with access to a pod can add or remove labels, so it's important to consider potential issues that may arise from this.

A better approach

Alternatives such as the super operator framework from Rakuten Cloud, a Rakuten Symphony division, allow vendors to model any application and its lifecycle management operations using a consistent schema and lifecycle hooks, eliminating consistency issues. Then there is kubernetes Application CRD. Other open source initiatives such as project Nephio and efforts to standardize Network Functions/Network Services definitions also offer solutions.

Why use our approach? Rakuten Cloud-Native Orchestrator (formerly Symcloud Orchestrator) framework is simple yet powerful as it enables developers and application administrators alike to compose, deploy, and manage complex application stacks, workloads and data pipelines. One type of application supported by the aforementioned framework are Rakuten Cloud-Native applications. These applications are mandated by the configuration defined within the backing Application Bundle.

Our Application Bundles are a robust and inclusive collection of all artifacts required to deploy and manage an application, giving you the most amount of automated options. It contains one or more application container images, referenced within a manifest file that describes the components the application is composed of, the necessary dependencies between services, resource requirements, affinity/anti-affinity rules and custom actions required for application management. As a result, one can view an Application Bundle as the starting point for creating an application, and as such, a means by which to abstract the underlying infrastructure from a user.

For more information, here is the link to the documentation.

Conclusion

While the build/distribute/run/manage methodology leveraging Docker/OCI for application images, Kubernetes/Controller/CRI-O for application runtime and controller/CRDs for application lifecycle management benefits vendors, cluster administrators or end-users face a different challenge. With hundreds of operators each having their own CLI and custom YAML formats, application vendors must understand the intricacies of each operator in addition to the core Kubernetes resources to effectively triage and troubleshoot applications.

At Rakuten Cloud, we have been working on this issue for the last seven years. The goal is to build a pattern, called application bundles, for application developers to onboard any application, yet making it simple enough for the end-users and administrator, and automation tooling to maintain consistent operational practices. We have reference bundles for most of the mainstream applications ranging across SQL, noSQL, BigData, DataBus and telco workloads.

Spotlight on Tech

Credit

What I’ve described here is mostly based on lessons I’ve learned through my significant work orchestrating data-intensive applications at Rakuten Symphony.

Rakuten Cloud extends Kubernetes’ agility, efficiency and portability to all stateful applications, including complicated big data, databases, AI/ML and custom applications, on any infrastructure, including on-premises, hybrid and multi-cloud ecosystem.