Creating A METHODS.md File: A Comprehensive Guide

Nov 22, 2025 by Alex Johnson 50 views

Creating a Comprehensive METHODS.md Documentation: A Guide to Reproducible Benchmarks

In the realm of performance benchmarking, a well-documented methodology is paramount. It ensures that results are not only reproducible but also interpretable and comparable across different systems and configurations. This article delves into the importance of creating a comprehensive METHODS.md file, a critical component often referenced in benchmark repositories, and provides a detailed guide on how to construct one effectively. This documentation is essential for users to understand the scientific methodology behind the metrics, interpret and compare results, and validate findings. Let's explore why this document is crucial and how to create one that meets the highest standards of scientific rigor.

The Problem: The Absence of a Standardized Methodology Document

Many benchmark repositories suffer from a common issue: the absence of a central document detailing the methodology employed. This often manifests as references to a missing METHODS.md file within various README documents. For instance, you might find lines like “See METHODS.md for full protocol and plotting notes” or “For a full protocol (definitions, plotting, black-&-white-friendly figure templates), keep or add a project-level METHODS.md.”

These references highlight the intended purpose of the METHODS.md file: to provide a comprehensive overview of the benchmark execution, metric calculation, and results interpretation. The absence of such a document leads to several critical issues.

Impact of Missing Methodology Documentation

The lack of a METHODS.md file significantly impacts the usability and credibility of benchmarks. Key issues include:

Inability to Understand Scientific Methodology: Without a detailed methodology document, users struggle to grasp the underlying scientific principles and processes used to derive the reported metrics. This lack of transparency makes it difficult to trust and utilize the benchmark results effectively.
Difficulty in Interpreting and Comparing Results: Benchmarks become challenging to interpret and compare when the methodology is unclear. Users cannot discern whether differences in results stem from actual performance variations or methodological discrepancies. This ambiguity undermines the value of the benchmarks.
Impediment to Publication-Quality Benchmarks: High-quality benchmarks, suitable for academic and industry publications, require meticulous methodology documentation. The absence of a METHODS.md file prevents benchmarks from meeting the rigorous standards expected in scholarly and professional contexts.
Erosion of Credibility: Benchmarks lacking clear methodology documentation suffer from reduced credibility. Users are less likely to trust results when the methods used to generate them are opaque.
Challenges in Reproducing and Validating Findings: The inability to reproduce or validate benchmark findings poses a significant problem. Without a detailed methodology, replicating the benchmark setup and execution becomes an arduous task, hindering verification efforts.

The Proposed Solution: Creating a Comprehensive METHODS.md File

To address the issues arising from the absence of methodology documentation, the proposed solution is to create a comprehensive METHODS.md file at the root of the repository. This file should cover all aspects of the benchmark methodology, including core principles, metric definitions, proxy metric methodologies, statistical methods, plotting and visualization standards, environment documentation, and reporting standards. Let's explore each of these components in detail.

1. Benchmark Methodology

The METHODS.md file should begin with an overview of the benchmark methodology, outlining the core principles that guide the benchmarking process. This section should include a clear statement of intent, setting the stage for the detailed explanations that follow.

# Methods and Protocols

## Overview
This document describes the methodology, metrics, statistical approaches,
and reporting standards used across all benchmarks in this repository.

## Core Principles
- Platform-agnostic measurements
- Reproducible results with documented variance
- Vendor-neutral implementations
- Scientifically sound statistical methods

This introductory section establishes the foundational principles underpinning the benchmarks. These principles emphasize the importance of platform-agnostic measurements, ensuring that the benchmarks can be executed across various hardware and software configurations. Reproducibility is another cornerstone, highlighting the need for results that can be consistently replicated with documented variance. Vendor-neutral implementations ensure fairness and impartiality, while scientifically sound statistical methods guarantee the rigor of the analysis.

2. Metrics Definitions

A crucial part of the METHODS.md file is the precise definition of each metric used in the benchmarks. Each metric should be thoroughly documented, including its definition, calculation method, units of measurement, and when it should be measured (e.g., during steady-state or including warm-up periods). Metrics are the heart of any benchmark, providing the quantitative data needed to evaluate performance. Clear and consistent definitions are essential for accurate interpretation.

Throughput

For example, the throughput metric should be defined as follows:

Throughput:

Definition: Items processed per unit time
Calculation method: Total items processed divided by total time
Units: entries/s, tokens/s, MB/s
When to measure: steady-state vs. including warm-up

This definition clarifies what throughput represents—the rate at which items are processed. It specifies the method for calculation, ensuring consistency across measurements. Units of measurement are provided to standardize reporting, and guidance is given on the timing of measurements, distinguishing between steady-state and warm-up periods.

Tail Latency (P95/P99)

Similarly, tail latency, a critical metric for understanding worst-case performance, should be defined with precision:

Tail Latency (P95/P99):

Definition: The latency value at which 95% (P95) or 99% (P99) of requests are completed
Importance: Reflects the latency experienced by the slowest requests, critical for user experience
Calculation approach: exact vs. approximation methods (e.g., nearest-rank, linear interpolation)
Sample size requirements: Sufficient data points for accurate percentile estimation
Interpretation guidelines: Understanding the impact of tail latency on overall system performance

Time-to-First-Batch (TTFB)

The time-to-first-batch metric, crucial for assessing startup performance, should be defined as follows:

Time-to-First-Batch (TTFB):

Definition: The time taken for the first batch of data to be processed
Measurement approach: Start timing from request initiation, stop when the first batch processing completes
Importance for GPU utilization: TTFB can significantly affect GPU utilization during startup

Request Amplification

Request amplification, which quantifies the overhead introduced by a system, should be defined as:

Request Amplification:

Definition: The ratio of total requests generated to the number of original requests
Calculation formula: Total requests / Original requests
Interpretation: Higher amplification indicates greater overhead

3. Proxy Metrics Methodology

In some cases, direct measurement of certain metrics may be impractical or infeasible. In such situations, proxy metrics can be used to estimate performance characteristics. The METHODS.md file should detail the methodology used for proxy metrics, including the rationale, formulas, assumptions, and limitations.

GPU Utilization Impact (Proxy)

Estimating the impact of optimizations on GPU utilization can be achieved using a proxy metric based on data loading latency improvements. The methodology for this proxy metric should be clearly documented:

### GPU Utilization Proxy

**Method:** We estimate GPU idle time reduction based on data loading
latency improvements.

**Formula:**

idle_reduction_pct = (TTFB_baseline - TTFB_optimized) / step_time_avg * 100


**Assumptions:**

*   Single GPU, data-parallel training
*   Step time = data_loading + compute + communication
*   Compute and communication remain constant
*   Data loading is serialized with compute

**Limitations:**

*   Actual impact varies with model architecture
*   Pipeline parallelism changes the relationship
*   Prefetching may hide latency improvements

This section explains the method for estimating GPU utilization impact, providing the formula used, the assumptions underlying the calculation, and the limitations that should be considered when interpreting the results. By explicitly stating these factors, the methodology gains transparency and robustness.

Cost per Token Estimation

Estimating the cost per token, a critical metric for cost-sensitive applications, requires a clear methodology:

### Cost per Token Estimation

**Method:** Derive from runtime changes and public cloud pricing.

**Formula:**

cost_per_1M_tokens = (runtime_hours × instance_hourly_rate) / (tokens_processed / 1e6)


**Data Sources:**

*   AWS EC2 pricing: [link to AWS pricing]
*   Google Cloud pricing: [link to Google Cloud pricing]
*   Azure pricing: [link to Azure pricing]

**Assumptions:**

*   Include compute costs only (not storage/networking)
*   Use on-demand pricing (not spot/reserved)
*   Linear scaling within the tested range

Energy per Sample

Estimating energy consumption per sample is crucial for understanding the energy efficiency of a system. The METHODS.md file should detail the approach used:

### Energy per Sample Estimation

**Method:** Calculate from power draw or TDP estimates.

**Formula:**

energy_kJ_per_1k_samples = (power_watts × runtime_seconds × samples / 1000) / 1000


**Measurement Approaches:**

1.  Direct: Use RAPL, nvidia-smi, or power meters
2.  Estimated: Use TDP × utilization%
3.  Cloud: Use provider sustainability reports when available

**Reporting Requirements:**

*   State measurement method
*   Include measurement window
*   Report sampling frequency
*   Note whether idle power is subtracted

This comprehensive methodology covers various approaches for measuring or estimating energy consumption, from direct measurements using specialized tools to estimates based on TDP and cloud provider reports. The reporting requirements ensure that energy consumption metrics are presented in a consistent and transparent manner.

4. Statistical Methods

Statistical methods play a critical role in ensuring the reliability and validity of benchmark results. The METHODS.md file should detail the statistical techniques used for data analysis, including percentile calculation, warm-up exclusion, and sample size guidelines.

Percentile Calculation

The method used for calculating percentiles, such as P95 and P99, should be clearly defined. Including a code snippet can further clarify the implementation:

def percentile(values: List[float], pct: float) -> float:
    """
    Calculate percentile using nearest-rank method.

    Args:
        values: List of measurements
        pct: Percentile as fraction (0.95 for P95)

    Returns:
        Percentile value
    """
    if not values:
        return 0.0
    sorted_vals = sorted(values)
    idx = int(pct * (len(sorted_vals) - 1))
    return sorted_vals[idx]

This code snippet provides a clear, executable definition of the percentile calculation method, ensuring consistency and reproducibility.

Warm-up Exclusion

Warm-up exclusion is a crucial step in benchmark execution, as the initial iterations often exhibit performance characteristics that differ from steady-state behavior. The METHODS.md file should specify the methodology for warm-up exclusion:

Discard first N iterations (configurable, default 5% of total)
Report warm-up separately from steady-state
Document cold-start scenarios explicitly

Sample Size Guidelines

Sample size is a critical determinant of the statistical power of benchmark results. The METHODS.md file should provide guidelines on appropriate sample sizes for different metrics:

Minimum 50 iterations for stable percentiles
Minimum 100 for P99 estimates
Report confidence intervals when sample size < 100

5. Plotting and Visualization Standards

Visual representation of benchmark results is essential for effective communication. The METHODS.md file should outline the standards for plotting and visualization, ensuring that figures are clear, informative, and adhere to best practices.

Figure Requirements

Specific requirements for figures should be detailed, including:

Black-and-white compatibility (grayscale or patterns)
Include units on all axes
Show sample size (N=...)
Include measurement time window
Add environment table (hardware, OS, versions)
Error bars or confidence intervals when applicable

Format Templates

Providing sample figure templates can further standardize the presentation of results:

## Sample Figure Template

Title: Checkpoint Write Throughput Across Storage Tiers

[Bar chart or line plot]

X-axis: Storage Configuration (labeled)
Y-axis: Throughput (MB/s)

Caption:
Figure 1: Checkpoint write throughput comparison (N=100 iterations each,
8×16MB shards, concurrency=4). Environment: AWS EC2 i3.2xlarge, Ubuntu 22.04,
kernel 5.15, measured on 2024-01-15. Error bars show 95% confidence intervals.

6. Environment Documentation

Documenting the environment in which benchmarks are executed is critical for reproducibility. The METHODS.md file should include a checklist of required information:

## Environment Reporting Checklist

**Hardware:**

*   CPU model and core count
*   GPU model and count (if applicable)
*   RAM capacity and speed
*   Storage type (NVMe, SSD, HDD, network-attached)

**Software:**

*   OS and kernel version
*   Python version
*   Key dependency versions (PyYAML, torch, etc.)
*   Filesystem type and mount options

**Network (if applicable):**

*   Bandwidth and latency to storage
*   Network filesystem protocol and version

**Configuration:**

*   Benchmark parameters used
*   Any non-default settings
*   Number of iterations and sample size

7. Reporting Standards

The METHODS.md file should also outline the standards for reporting benchmark results, specifying the minimum requirements and the guidelines for publication-quality reporting.

Minimal Reporting Requirements:

Attach all CSV and YAML outputs
Document command line used
Include metadata.yaml
State measurement date and duration
Describe any anomalies or outliers

Publication-Quality Reporting:

Follow minimal requirements
Add comparison figures with error bars
Include environment table
Provide raw data or access method
Document any data cleaning or filtering

Benefits of a Comprehensive METHODS.md

Creating a comprehensive METHODS.md file offers numerous benefits, enhancing the usability and credibility of benchmarks.

Enabling Reproducible Research

Detailed methodology documentation is essential for enabling reproducible research. When the methods used to generate benchmark results are clearly documented, others can replicate the experiments and verify the findings.

Providing Scientific Credibility

A well-documented methodology lends scientific credibility to benchmarks. Transparency in methods enhances trust in the results and fosters confidence in the benchmarking process.

Helping Users Interpret Results Correctly

Comprehensive documentation aids users in correctly interpreting benchmark results. By understanding the methodology, users can accurately assess the implications of the findings and avoid misinterpretations.

Supporting Academic and Industry Publications

Benchmarks with thorough methodology documentation are more likely to be suitable for academic and industry publications. The rigorous standards of these contexts demand clear and detailed explanations of methods.

Clarifying Proxy Metric Limitations

Proxy metrics, while useful, come with inherent limitations. Documenting these limitations is crucial for ensuring that the results are interpreted appropriately and that the uncertainties associated with the proxy metrics are understood.

Standardizing Reporting Across Users

A METHODS.md file promotes standardized reporting across different users. By providing clear guidelines and templates, it ensures that results are presented in a consistent format, facilitating comparison and analysis.

Implementation Steps

Implementing a comprehensive METHODS.md file involves several key steps:

Create METHODS.md with sections outlined above
Review and refine metric definitions
Add example calculations
Include code snippets for key algorithms
Add figure templates and examples
Link from all benchmark READMEs
Request community review
Keep document updated as methods evolve

Conclusion

In conclusion, creating a comprehensive METHODS.md file is essential for any benchmark repository aiming for reproducibility, credibility, and usability. By meticulously documenting the methodology, metrics, statistical methods, and reporting standards, benchmark creators can ensure that their results are not only reliable but also readily interpretable and comparable. This, in turn, fosters trust in the benchmarks and facilitates their adoption in both academic and industry settings. Embracing a rigorous approach to methodology documentation is a cornerstone of sound benchmarking practice.

For further information on best practices in benchmarking, visit the SPEC (Standard Performance Evaluation Corporation) website.