Skip to content

Remove dead recoverable_exceptions and is_recoverable_fetch_e#5120

Open
saitcakmak wants to merge 2 commits intofacebook:mainfrom
saitcakmak:export-D98932195
Open

Remove dead recoverable_exceptions and is_recoverable_fetch_e#5120
saitcakmak wants to merge 2 commits intofacebook:mainfrom
saitcakmak:export-D98932195

Conversation

@saitcakmak
Copy link
Copy Markdown
Contributor

Summary:
Follow-up to D98924467, which decoupled metric fetch errors from trial
status in the Orchestrator. The orchestrator no longer uses
recoverable_exceptions or is_recoverable_fetch_e to decide trial
fate, making them dead code.

  • Remove Metric.recoverable_exceptions class attribute and
    Metric.is_recoverable_fetch_e classmethod from ax/core/metric.py.

Differential Revision: D98932195

…ook#5119)

Summary:

Design doc: D98741656

When `fetch_trials_data_results` returned a `MetricFetchE` for an
optimization config metric, the orchestrator marked the trial as
ABANDONED. This discarded good data, inflated the failure rate, and was
inconsistent with the Client layer which keeps trials COMPLETED with
incomplete metrics via `MetricAvailability` (D93924193).

This diff removes the trial abandonment behavior. Metric fetch errors
are now logged (with traceback via `logger.exception`) but trial status
is unchanged. `MetricAvailability` tracks data completeness, and the
failure rate check uses it to detect persistent metric issues.

Changes:
- `_fetch_and_process_trials_data_results`: Removed the branch that
  marked trials ABANDONED for metric fetch errors and the separate
  `is_available_while_running` branch. All metric fetch errors are
  now simply logged and the method continues. The `_report_metric_fetch_e`
  hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can
  react to errors (create pastes, build error tables, etc.).
- `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded`
  into this method to avoid duplicate computation. Now counts both
  runner failures (FAILED/ABANDONED) and metric-incomplete trials (via
  `compute_metric_availability`) toward the failure rate.
- `_get_failure_rate_exceeded_error`: Rewritten with an actionable
  error message listing runner failures, metric-incomplete trials,
  missing metrics, and affected trial indices.
- Removed dead code: `_mark_err_trial_status`,
  `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`,
  `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`.
- Kept `_report_metric_fetch_e` as a no-op hook so subclasses like
  `AxSweepOrchestrator` can still react to metric fetch errors.
- Updated telemetry (`OrchestratorCompletedRecord`) to use
  `_count_metric_incomplete_trials` (via `compute_metric_availability`)
  for both `num_metric_fetch_e_encountered` and
  `num_trials_bad_due_to_err`.
- Updated `AxSweepOrchestrator` test assertions: trials now stay
  COMPLETED (not ABANDONED) after metric fetch errors.
- `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e`
  are kept for now since `pts/` metrics still reference them; cleanup
  will follow in a separate diff.

Differential Revision: D98924467
Summary:
Follow-up to D98924467, which decoupled metric fetch errors from trial
status in the Orchestrator. The orchestrator no longer uses
`recoverable_exceptions` or `is_recoverable_fetch_e` to decide trial
fate, making them dead code.

- Remove `Metric.recoverable_exceptions` class attribute and
  `Metric.is_recoverable_fetch_e` classmethod from `ax/core/metric.py`.

Differential Revision: D98932195
@meta-cla meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Apr 1, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 1, 2026

@saitcakmak has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98932195.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.39%. Comparing base (6cebd1c) to head (982d09e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5120      +/-   ##
==========================================
- Coverage   96.40%   96.39%   -0.02%     
==========================================
  Files         613      613              
  Lines       68142    68140       -2     
==========================================
- Hits        65694    65683      -11     
- Misses       2448     2457       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants