Installing and configuring Kubeflow on Rakuten Cloud-Native Platform

by
Sricharan Mahavadi
Director of Enterprise Solutions and Customer Success
Rakuten Cloud
November 16, 2023
14
minute read

I’m sure many of you have heard or are familiar with the term DevOps – a development and operations paradigm that integrates everything from planning to coding, deployment and operation. Next to DevOps there is MLOps, which takes the previous paradigm a significant step forward.

With MLOps, you can create a machine learning (ML) solution that uses field data to self-train and iterate, enabling organizations to maximize reliability and efficiency while requiring less manual labor or fewer experts.

Through MLOps, data scientists can focus on testing and validating data with a performance model that tunes the solution in the frameworks of performance, accuracy, availability or any metric that is valuable to improvements.

Below outlines the general considerations to keep in mind when choosing the right cloud-native platform to host Kubeflow and a walkthrough of steps involved in installing upstream Kubeflow and configuration with the cloud-native persistent storage and networking stack.

What is Kubeflow?

Kubeflow is an open-source ML platform that runs natively on Kubernetes. The Kubeflow project has multiple distinct software components that each address specific stages of the ML lifecycle, including model development, model training, model serving, and the automated ML and CI/CD of models and data ecosystem. Kubeflow is ideal for data scientists who want to build and experiment with ML pipelines. It is also for ML engineers and operational teams who want to deploy ML systems in various environments for model development, testing and production-level serving using CI/CD automations.

Why MLOPs on Rakuten Cloud-Native Platform?

Rakuten Cloud-Native Platform (Rakuten CNP) (formerly known as Symcloud Platform) provides a supercharged Kubernetes platform with native integration between cloud-native storage, cloud-native networking stack and also includes an application management system with full automation management of both clusters and applications. Rakuten CNP has the built-in capability to create managed application snapshots that enable cloning, backup and migration of applications between on-prem and cloud or between data centers within an enterprise.

Rakuten CNP fully automates the end-to-end cluster provisioning process for the most challenging platform deployments for several applications, including Kubeflow, and even custom application configurations.

Four Kubeflow deployment considerations

  1. Cloud-native persistent storage layer: Installing Kubeflow requires an enterprise-grade, cloud-native persistent storage layer that is scalable, reliable, resilient, performant and secured. Kubeflow install deploys various deployments, Stateful Sets, Persistent Volume Claims (PVCs) that require enterprise-grade storage class (CSI) to support snapshots, clones, backups and the replication of data, applications and so on. The Kubeflow platform requires shared volumes (ReadWriteMany) PVC that can be mounted on multiple containers that are part of the ML pipelines.
  2. Advanced networking & compute support: Kubeflow deploys various CRDs, custom resources and services; installs and configures Istio service mesh, and configures Ingress and load balancer services. The Kubernetes platform should support cloud-native load balancers like metalLB, CNI (Container Network Interface) like Calico or OpenVSwitch (OVS) for network communication between various pods.
  3. Advanced compute and GPU operator support: Some AI/ML use cases require GPU computation. The platform hosting Kubeflow must discover advanced GPU hardware like MIG GPU slices, then be able to allocate it to Jupyter Notebook and ML applications. The platform should have the observability built-in to monitor system performance and utilization. It should also autoscale based on compute demand. Rakuten CNP comes with Nvidia’s GPU operator installed that exposes the GPUs to containers requesting them.
  4. Role-based access control (RBAC) and multitenancy: To operationalize, Kubeflow applications require support for RBAC and multitenancy. Hence the platform should have advanced RBAC support.

Kubeflow components on Rakuten Cloud-Native Platform (Kubernetes Platform)

There are various Kubeflow components that get deployed as part of Kubeflow installation. The below image outlines the components that interact with each other.

Kubeflow components. Source: Adapted from Kubeflow documentation

Steps for installing Kubeflow on Rakuten Cloud-Native Platform

1. Install Rakuten Cloud-Native Platform - https://docs.robin.io/platform/5.4.1/install.html#

2. Setup metalLB load balancer to use a specific IP-pool range for the load balancer service.

MetalLB can be installed during Rakuten CNP installation or we can perform post install using:

3. Prepare the PVC YAML to reflect the right storage class, and config storage management options like replication, encryption, and mediatype.

4. Installing Kubeflow

Refer to the official Kubeflow install documentation:

Kubeflow release version: v1.6.0
(the latest release is https://github.com/kubeflow/manifests/tree/v1.6.1)

4.1. Download the Kubeflow release Tar file, extract and cd into manifests directory:

1.a.  Download kustomize and add to the host PATH:

4.2.  Using single command Kubeflow installation:

cd manifests

Check if all the pods are running:

4.3. Deploy MetalLB to have external IP for Kubeflow (Refer to step 2 if not already done)
Ensure external LB IP is allocated for Istio-ingress gateway service:

4.4. Check all Kubeflow features are working:

Go to browser and open the Kubeflow UI app using load balancer service IP:
http://10.9.232.xx

The default username/password for the Kubeflow application is
user@example.com/12341234

Create a Jupyter Notebook using PeristentVolume Claim:
Select the new workspace volume using a Rakuten CNP class.

Select the datavolume with accessmode as “ReadWriteMany” using Rakuten Cloud-Native Platform immediate storage class.

You can choose to create a custom volume with a PVC spec and launch a Jupyter Notebook.
The notebook will now have access to both PVCs, shared and local, provisioned by a CSI (storageclass).

PVC spec example:

Advanced PVC parameters can be configured by adding appropriate annotations to the spec file. (for advance storage options)

https://docs.robin.io/platform/latest/manage_storage.html#readwritemany-rwx-volumes

To further validate Kubeflow, you can:

  1. Create and run Kubeflow pipelines from Jupyer Notebook using Kale
  2. Create and run Katib hyperparameter tuning experiments from Jupyer Notebook using Kale
  3. Create model servers using KServe

Conclusion

In this article, we have covered several considerations needed to set up a multi-tenant deployment of Kubeflow and covered the step by step installation of Kubeflow on Rakuten CNP. Rakuten CNP is a fully integrated Kubernetes platform solution that comes with cloud-native storage, compute, and networking capabilities to run Kubeflow deployments at scale, that provides significant advantages for MLOPs applications such as Kubeflow.

Spotlight on Tech

Disclaimer - Please note that Rakuten Cloud-Native Platform does not directly support the Kubeflow application on its platform. However, this exercise is a good starting point for organizations looking to implement Kubeflow on Kubernetes in their MLOps journey.

For more information about exploring Rakuten Cloud-Native Platform and Storage, please visit: https://symphony.rakuten.com/cloud