[ML] Fix test failure updating model deployment with stale cluster state. #128667

davidkyle · 2025-05-30T11:16:36Z

When updating a model deployment - changing the number of allocations for example- calculating the new state is an expensive operation so it is done outside of the ClusterStateUpdateTask. However, if there was another clusterstate update while computing the update then the submitted update fails due to an out of data cluster state.

The fix is quite easy, compute the model deployment update outside of the ClusterStateUpdateTask then merge it with the latest state when executing the task. The code already has a check that the deployment update is compatible with the new state (areClusterStatesCompatibleForRebalance(...)) making it safe to merge the new state.

Closes #121165

elasticsearchmachine · 2025-05-30T11:17:01Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2025-05-30T11:17:54Z

...va/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

@@ -80,9 +78,6 @@ public class TrainedModelAssignmentClusterService implements ClusterStateListene

    private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class);

-    private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0;


These version checks are redundant in 9.0 and 9.1. The 8.x backports will need to keep them however.

davidkyle · 2025-05-30T11:19:00Z

...va/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

-        ActionListener<ClusterState> updatedStateListener = ActionListener.wrap(
-            updatedState -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {
+        ActionListener<TrainedModelAssignmentMetadata.Builder> updatedAssignmentListener = ActionListener.wrap(
+            updatedAssignment -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {


This is the fix, here the new assignment state is passed rather than the updated cluster state.

davidkyle · 2025-05-30T11:19:37Z

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

@@ -375,20 +375,17 @@ public void clusterChanged(ClusterChangedEvent event) {
        final boolean isResetMode = MlMetadata.getMlMetadata(event.state()).isResetMode();
        TrainedModelAssignmentMetadata modelAssignmentMetadata = TrainedModelAssignmentMetadata.fromState(event.state());
        final String currentNode = event.state().nodes().getLocalNodeId();
-        final boolean isNewAllocationSupported = event.state()


Another version change that is irrelevant for 9

Use latest state

06bec76

davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 v9.0.3 v8.18.3 labels May 30, 2025

elasticsearchmachine added the Team:ML Meta label for the ML team label May 30, 2025

davidkyle removed v8.19.0 v8.18.3 labels May 30, 2025

davidkyle commented May 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix test failure updating model deployment with stale cluster state. #128667

[ML] Fix test failure updating model deployment with stale cluster state. #128667

davidkyle commented May 30, 2025

Uh oh!

elasticsearchmachine commented May 30, 2025

Uh oh!

davidkyle May 30, 2025

Uh oh!

davidkyle May 30, 2025

Uh oh!

davidkyle May 30, 2025

Uh oh!

Uh oh!

		@@ -80,9 +78,6 @@ public class TrainedModelAssignmentClusterService implements ClusterStateListene

		private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class);

		private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0;

[ML] Fix test failure updating model deployment with stale cluster state. #128667

Are you sure you want to change the base?

[ML] Fix test failure updating model deployment with stale cluster state. #128667

Conversation

davidkyle commented May 30, 2025

Uh oh!

elasticsearchmachine commented May 30, 2025

Uh oh!

davidkyle May 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidkyle May 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidkyle May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!