AI Workload Conformance In Kubernetes: A Feature-Driven Approach
In the ever-evolving landscape of artificial intelligence and machine learning, ensuring the conformance of AI workloads within Kubernetes is paramount. This article delves into a proposal for feature-driven conformance for AI workloads within Kubernetes, addressing the communication gap between stakeholders and the complexities of verifying conformance requirements. We will explore the problem, the proposed solution, its technical implementation, and the benefits it offers.
Understanding the Challenge of AI Workload Conformance
Defining conformance requirements for AI/ML workloads, such as those involving GPUs and Distributed Resource Allocation (DRA), presents unique challenges. A significant communication gap often exists because the stakeholders who define these requirements—including Model Builders, Hardware Vendors, and Product Managers—may not be proficient in reading Go code. However, the verification logic for these requirements typically resides exclusively within test/e2e Go files. This disconnect leads to several critical issues:
The Problem of Documentation Drift
One of the primary issues is documentation drift. Currently, the descriptions in conformance.yaml are scraped from Go comments. Over time, these comments tend to diverge from the actual test logic, creating discrepancies between the documented requirements and the implemented tests. This can lead to confusion and uncertainty about whether the system truly conforms to the intended specifications.
Stakeholder Accessibility
Another significant challenge is the accessibility of the conformance criteria to non-coders. Stakeholders who lack coding expertise find it difficult to audit the "Contract" of AI Conformance. This lack of transparency can hinder collaboration and understanding among different teams and stakeholders involved in the AI workload lifecycle.
Complexity Concerns with Existing Solutions
While Behavior-Driven Development (BDD) frameworks like Cucumber could offer a solution, introducing a full BDD framework would add considerable technical debt and complexity to the Kubernetes build system. This complexity could outweigh the benefits, making it essential to explore alternative approaches.
The Proposal: A Two-Style Guide Approach
To address these challenges without incurring significant technical debt, a two-style guide approach is proposed. This method aims to maintain the existing runtime standard (Ginkgo) while making the requirements more readable and accessible through Gherkin syntax. This approach involves two distinct style guides linked by a build-time check, ensuring consistency and accuracy.
Style Guide A: The Specification (The "What")
This style guide focuses on defining the requirements in a human-readable format. It targets a broad audience, including AI Working Group members and Product Managers, who need to understand the specifications without delving into code.
Key Aspects of Style Guide A
- Gherkin Syntax: Requirements are written in standard
.featurefiles using Gherkin syntax. Gherkin is a plain-text format that uses natural language to describe the behavior of software, making it accessible to both technical and non-technical stakeholders. - Location: These feature files are located in the
test/e2e/ai-conformance/features/directory, providing a centralized location for all AI conformance specifications. - Tags: Each feature file must include the
@conformancetag. This tag helps identify the files relevant to conformance testing, ensuring they are included in the conformance checks.
Style Guide B: The Implementation (The "How")
This style guide targets test developers and focuses on implementing the tests that verify the specifications defined in Style Guide A. It ensures that the tests accurately reflect the requirements.
Key Aspects of Style Guide B
- Ginkgo Framework: Tests are written using the standard Ginkgo testing framework, which is already integrated into the Kubernetes ecosystem. This avoids the need to introduce new runtime dependencies.
- Mapping Rules: Strict mapping rules are enforced to ensure that the tests mirror the feature files precisely. These rules are crucial for maintaining consistency between the specification and the implementation.
- Mapping Rule 1: The `It(