Kubeflow v1.7 simplifies Kubernetes native MLOps via enhanced UI, Katib Tuning API and new training frameworks
Kubeflow v1.7 users are capitalizing on their python knowledge to build seamless workflows without the need of Kubernetes CLI commands and without building container images for each iteration. With new UIs in multiple components, developers can correlate configuration parameters with logs which allow them to quickly analyze the results. When coupled with Kubeflow’s pythonic workflows and Kubernetes operating efficiencies, these enhancements save model developers material amounts of time and toil.
This release includes hundreds of new commits, and the following sections provide details on the top features. You will also find updates on platform dependencies, changelogs, breaking changes, ecosystem partners, the new Security Team and Kubeflow’s transition to the CNCF. We encourage active users to contribute to our next release, and provide links to join the community at the end.
Selected and Highlighted deliveries
Katib
Katib includes new enhancements to the UI and SDK. The new Katib UI provides simplified fine-grained configuration and log correlation. Additionally sorting and filtering have been added, allowing a more organized view of your many experiments. In addition, these features minimize the necessity of manually employing low-level commands to locate and correlate logs with hyperparameter experiment configurations. This simplifies the process of in-depth performance analysis and subsequent iterations of model parameters.
In 1.7, the Katib SDK provides new features including a Tune API and the ability to retreive trial metrics from the Katib database. Model developers or data scientists can execute the Tune API to start a HyperParameter Experiment without any knowledge of underlying systems e.g. Kubernetes, Docker. It automatically converts user training scripts to a Katib Experiment. The Katib changelog provides details on over 100 updates and bug fixes including these SDK and UI top features:
- Narrow down Katib RBAC rules (#2091)
- Support Postgres as a Katib DB (#1921)
- [SDK] Create Tune API in the Katib SDK (#1951)
- [SDK] Get Trial Metrics from Katib DB (#2050)
- More Suggestion container fields in Katib Config (#2000)
- Katib UI: Enable pagination/sorting/filtering (#2017 and #2040)
- Katib UI: Add authorization mechanisms (#1983)
Training operator
Kubeflow’s unified distributed training operator enhancements include configuration options for fine tuned resource scaling (processor, memory, storage). It now includes HPA support for Pytorch Elastic workloads where users can specify target metric/utilization in Job Spec. This is used for automatic scale up/down of Pytorch Job matching demand while ensuring the elastic policy configured by the user. These enhancements simplify user workflows significantly and reduce operational toil and costs. The Job Spec is flexible and supports multiple scheduler types: Kubernetes, volcano, custom. Major 1.7 training operator features include:
- PodGroup enhancements(#1574)
- Integration with other training frameworks - Paddlepaddle(#520)
- Enhancements on Pytorch Elastic training (#1645, #1626)
- Support coscheduling plugin (#1722)
- [SDK] Create Unify Training Client(#1719)
Pipelines
In Kubeflow 1.7, the Pipelines Working Group (KFP) has continued its efforts towards KFP v2 with their latest 2.0.0-alpha.7 release. This release includes the following key enhancements:
- Pipelines as components: Pipelines can themselves be used as components in other pipelines, just as you would use any other single-step component in a pipeline. (#8179, #8204, #8209, #8220)
- Sub-DAG visualization that allows pipeline users to dive deep into sub-graph components of their pipeline. (#8326)
- Miscellaneous bug and vulnerability fixes.
Model developers recognize the time-saving pythonic workflows in Kubeflow Pipelines, which speed iteration by not requiring the generation of new images for pre-prod experimentation. The new V2 UI and SDK in Kubeflow 1.7 provide valuable details on each pipeline step. This simplifies the correlation and analysis of parameters, metadata and artifacts during iteration.
Kubeflow web apps (Notebooks, Volumes, TensorBoards) and Controllers
Kubeflow 1.7 delivers new web apps enhancements that expose more information to the end users and improve their UI interactions.
A valuable new delivery is that all of the main tables have filtering #6754 and sorting #6742 functionalities as well as showing objects from all namespaces at once #6778. This allows the end users to navigate through their tools and apps (notebooks, tensorboards, volumes etc) more efficiently.
Additionally, 1.7 provides an update to the Notebooks form page #6826 as well as a dedicated page #6769 #6788 for the different types of tools managed by Kubeflow. These detailed pages allow users to view logs, events and configuration yamls and they also link from one another (i.e. going to a volume’s details page via a notebook details page). Previously these functions were only available through the Kubernetes API, which would require the user to have increased privileges and to have a more in-depth knowledge of Kubernetes CLI commands. With these new features the user has a simpler, more organized and more secure way of accessing crucial Kubeflow resource information.
Other notable features are small improvements on our user stories around PodDefaults. Aside from additional use-cases, like defining sidecar and init containers #6749, Kubeflow’s TensorBoard stack now integrates with PodDefaults #6874 #6924. These enhancements enable the re-use of the user’s existing PodDefaults to gain S3 access from both Notebook and TensorBoard servers.
Platform dependencies, breaking changes, add-ons
Kubeflow 1.7 includes hundreds of commits. The Kubeflow release process includes several rounds of testing by the Kubeflow working groups and Kubeflow distributions. Kubeflow’s configuration options provide a high degree of flexibility. After considering all of the testing options, the 1.7 Release Team narrowed the critical dependencies for consistent testing and documentation to the following.
K8s | Istio | KNative | Kustomize | Cert Mgr | DEX | Argo | Tekton | Oidc-authservice |
1.25 / 1.24 | 1.16 | 1.8.1 | 3.2 or 5.0 | 1.10.1 | 2.31.2 | 3.3.8 | 1.5 | e236439
|
Another valuable platform enhancement is the support of additional processor architectures including IBM’s Power (#6684). This effort provides the foundation to add other processor types as well.
The 1.7 documentation includes overall installation instructions from the Manifest Working Group, and detailed feature reviews from each Kubeflow working group. Most of the working groups have broken their changelogs into subsections that highlight core features, UI enhancements, miscellaneous updates, bug fixes and breaking changes.
Working Group Changelogs including breaking changes
The community has continued its work to identify core components and add-ons. Significant enhancements in add-ons include the continued integration with KServe’s v.10 release, as well as a new serving option from BentoML. The BentoML team has done a tremendous job in supporting the Kubeflow community in 1.7 and their documentation is excellent. Details on BentoML are available here.
What’s next
The community continues to see a large increase in activity since the announcement that Kubeflow will be donated to the CNCF by Google. The community holds regular meetings to review progress on the checklist items needed for the CNCF due diligence (meeting notes).
During the 1.7 release cycle, the community formed a Security Team, which is working to improve the security profile of Kubeflow components and their dependencies. The Security Team has completed these three deliveries:
- Set-up a github directory, slack channel and regular meeting schedule with notes
- Created an inventory of container images used by Kubeflow
- Created a list of common vulnerabilities and errors (CVEs) in the container images.
Going forward, the Security Team will work to develop on-going policies and to remedy security issues. For example, fixing CVEs is an on-going maintenance requirement and this function is currently provided by Kubeflow distribution providers as a value added delivery. Some distributions and end-users are working to fix CVEs in the upstream projects and the Security Team is looking for help on defining and delivering those deliveries and expectations.
The Kubeflow team is working on integration efforts with the Ray and MLflow communities.The Ray integration progress has moved closer to user testing and users can find more information on this tracking issue. The MLflow integration is progressing and its integration is tracked here.
How to get started with 1.7
For trying out Kubeflow 1.7 we recommend our installation page where you can choose between a selection of Kubeflow distributions. For more advanced users we recommend the manifest installation guide.
Join the Community
We would like to thank everyone for their contribution to Kubeflow 1.7, especially Dominik Fleischmann for his work as the v1.7 Release Manager. As you can see, the Kubeflow community is vibrant and diverse, solving real-world problems for organizations worldwide.
Want to help? The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!
- Visit our Kubeflow website or Kubeflow GitHub Page
- Join the Kubeflow Slack channel
- Join the kubeflow-discuss mailing list
- Attend a weekly community meeting