Fair Model Comparison: Removing Offending Dates & Horizon Evaluation

Nov 26, 2025 by Alex Johnson 69 views

Introduction

In the realm of forecasting models, ensuring a fair and accurate comparison is paramount. This involves meticulous examination of the data used for evaluation, identifying any anomalies or extraneous factors that might skew the results. Our recent review of the evaluation dashboard revealed certain forecast dates exhibiting issues across various models, necessitating a deeper investigation and corrective action. This article delves into the specific problems encountered, the potential impact on model evaluation, and the steps taken to address these concerns for a more robust and reliable comparison.

Identifying Problematic Forecast Dates

Our initial review highlighted several instances where specific forecast dates presented challenges for model evaluation. One notable case involved the Chronos model, where three forecasts were omitted due to extraneous circumstances. These situations, such as data freezes potentially related to events like the inauguration in January 2025, introduced irregularities that could compromise the integrity of the model comparison. While the model's interpretations were technically correct, the underlying data issues warranted exclusion to maintain fairness.

Another instance involved the UMass-flusion/spatial model, which exhibited a completely missing forecast in the evaluation for 2025-01-25. This prompted further investigation, revealing that some models had submitted a -1 horizon. This discovery raised questions about whether these models were being properly evaluated and whether the evaluation process effectively accounted for such submissions. Understanding the implications of these missing forecasts and addressing the handling of negative horizons is crucial for ensuring accurate model assessment.

Furthermore, we considered the broader implications of these date-specific issues. We questioned whether the evaluation process, consisting of $N$ evaluations, might inadvertently exclude an entire week of forecasting data across all models. Such a systematic exclusion could introduce bias and affect the overall ranking and performance assessment of the models. Addressing this concern requires a comprehensive understanding of the evaluation methodology and its potential limitations.

Addressing the Challenges: A Multi-Faceted Approach

To ensure fair model comparison, we need to tackle the identified issues. This can be achieved through a combination of strategies, which may involve addressing the problems individually or implementing a comprehensive solution that addresses all concerns simultaneously. The key considerations for addressing these challenges include:

Data Preprocessing and Cleaning: Thoroughly examine the data used for evaluation, identifying and correcting any anomalies, errors, or inconsistencies. This may involve removing extraneous data points, imputing missing values, or adjusting for known biases. A clean and reliable dataset is the foundation for accurate model evaluation.
Horizon Handling: Develop a clear and consistent approach for handling forecast horizons, particularly negative or invalid horizons. Define specific rules for including or excluding forecasts based on their horizon values and ensure that these rules are applied consistently across all models.
Evaluation Methodology: Review the evaluation methodology to identify any potential biases or limitations. Consider alternative evaluation metrics and techniques that are more robust to data irregularities or model-specific issues. Ensure that the chosen methodology accurately reflects the models' performance and predictive capabilities.

Chronos Model and Extraneous Situations

When dealing with the Chronos model, the primary concern revolved around three forecasts that were deemed problematic due to external factors. These factors, potentially including data freezes associated with events like the inauguration in January 2025, introduced irregularities that could skew the model's performance. While the model's interpretations remained accurate, the underlying data issues warranted careful consideration. The decision to exclude these forecasts was driven by the desire to maintain fairness and ensure that the model's evaluation was based on data representative of typical forecasting conditions.

To address this issue, a detailed review of the affected dates and their corresponding data was conducted. This review aimed to identify the specific factors contributing to the irregularities and assess their potential impact on the model's performance. By understanding the root causes of these issues, we could make informed decisions about how to handle the affected forecasts. In this case, the decision was made to remove the forecasts from the evaluation to avoid any bias introduced by the extraneous situations. This approach ensured that the Chronos model was evaluated based on its performance under normal forecasting conditions.

However, the exclusion of these forecasts also raised questions about the potential for information loss. Each excluded forecast represents a data point that could provide valuable insights into the model's behavior and performance. Therefore, it is essential to carefully consider the trade-offs between fairness and information loss when making decisions about data exclusion. In this case, the potential bias introduced by the extraneous situations was deemed to outweigh the information loss from excluding the forecasts. However, this decision should be revisited if similar situations arise in the future, and alternative approaches for handling such issues should be explored.

UMass-flusion/spatial Model and Missing Forecasts

The case of the UMass-flusion/spatial model presented a different set of challenges. The complete absence of a forecast in the evaluation for 2025-01-25 immediately raised concerns about the integrity of the evaluation process. Further investigation revealed that some models had submitted a -1 horizon, prompting questions about the handling of such submissions. A negative horizon implies forecasting into the past, which is typically not a valid operation. Therefore, it is crucial to understand how these negative horizons were being treated and whether they were being properly accounted for in the evaluation.

The discovery of the -1 horizon submissions led to a broader discussion about the evaluation methodology. Were these models with negative horizons being evaluated at all? If so, how were their forecasts being interpreted and compared to other models? These questions highlighted the need for a clear and consistent approach for handling invalid horizon values. A robust evaluation process should include mechanisms for detecting and addressing such issues, ensuring that all models are evaluated on a level playing field.

To resolve this issue, several steps were taken. First, the evaluation process was reviewed to determine how negative horizons were being handled. It was confirmed that models submitting a -1 horizon were not being properly evaluated, as their forecasts were being effectively ignored. This raised concerns about fairness, as these models were not being penalized for their invalid submissions. To address this, the evaluation process was modified to explicitly reject forecasts with negative horizons, ensuring that all models are subject to the same evaluation criteria. Second, the models that had submitted negative horizons were notified of the issue and provided with guidance on how to correct their submissions. This helped to prevent similar issues from arising in future evaluations.

The N Evaluations and Potential Week Exclusion

A further concern that emerged from our review was the potential for the evaluation process to inadvertently exclude an entire week of forecasting data across all models. This concern stemmed from the way the $N$ evaluations were structured and the timing of the forecast submissions. If the evaluations were consistently aligned with specific days of the week, and if the forecast submissions were made at a certain point in the week, it was possible that an entire week's worth of forecasts could be excluded from the evaluation.

This potential exclusion could have significant implications for the accuracy and fairness of the model comparison. If certain models were better at forecasting during the excluded week, their performance would be unfairly penalized. Conversely, models that struggled during that week would be unfairly rewarded. To address this concern, a thorough analysis of the evaluation process was conducted. This analysis involved examining the dates of the evaluations, the timing of the forecast submissions, and the alignment of these factors with the weekly cycle.

The analysis revealed that, while the potential for week exclusion existed, it was not consistently occurring. The evaluations were spread across different days of the week, and the timing of the submissions varied. However, the analysis did highlight the need for greater awareness of this potential issue. To mitigate the risk of week exclusion, several recommendations were made. First, the evaluation schedule should be carefully reviewed to ensure that evaluations are evenly distributed across the week. Second, the timing of the forecast submissions should be adjusted to minimize the potential for alignment with specific days of the week. Finally, the evaluation metrics should be chosen to minimize the impact of any week-specific biases.

Resolution and Sub-Tasks

Addressing the issues identified in the evaluation dashboard requires a comprehensive approach. A resolution may involve tackling all the concerns simultaneously or breaking them down into smaller, more manageable sub-tasks. The key is to prioritize the issues based on their potential impact on the fairness and accuracy of the model comparison and to develop a clear plan for addressing each issue.

One potential approach is to create sub-tasks for each of the identified problems. For example, a sub-task could be created to address the Chronos model's extraneous situations, another to handle the UMass-flusion/spatial model's missing forecasts, and a third to investigate the potential week exclusion issue. This approach allows for focused attention on each problem and facilitates the development of tailored solutions. Alternatively, a more holistic approach could be adopted, where a single task is created to address all the issues simultaneously. This approach may be more efficient but requires a broader understanding of the evaluation process and its potential limitations.

Regardless of the approach taken, it is crucial to involve all stakeholders in the resolution process. This includes the model developers, the evaluation team, and the end-users of the forecasts. By working together, these stakeholders can ensure that the issues are addressed effectively and that the model comparison is fair and accurate.

Supercedes #45

This discussion builds upon the previous efforts outlined in #45, further elaborating on the identified issues and proposing a comprehensive resolution strategy. The goal is to establish a robust and transparent evaluation process that fosters confidence in the models and their predictions.

Conclusion

Ensuring fair and accurate model comparison is a continuous process that requires vigilance and attention to detail. By identifying and addressing potential issues such as problematic forecast dates, invalid horizon values, and week exclusion biases, we can enhance the reliability and trustworthiness of our evaluations. The steps outlined in this article represent a significant step towards achieving this goal, paving the way for more informed decision-making and improved forecasting capabilities. This article underscores the significance of data quality and meticulous evaluation methodologies in the field of forecasting. By addressing these challenges head-on, we strengthen the foundation for reliable model comparisons and ultimately, more accurate predictions.

For further information on best practices in forecasting and model evaluation, you may find the resources available at The M Competitions to be valuable.