Skip to content

[ML] Fix test failure updating model deployment with stale cluster state. #128667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davidkyle
Copy link
Member

When updating a model deployment - changing the number of allocations for example- calculating the new state is an expensive operation so it is done outside of the ClusterStateUpdateTask. However, if there was another clusterstate update while computing the update then the submitted update fails due to an out of data cluster state.

The fix is quite easy, compute the model deployment update outside of the ClusterStateUpdateTask then merge it with the latest state when executing the task. The code already has a check that the deployment update is compatible with the new state (areClusterStatesCompatibleForRebalance(...)) making it safe to merge the new state.

Closes #121165

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 v9.0.3 v8.18.3 labels May 30, 2025
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label May 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@@ -80,9 +78,6 @@ public class TrainedModelAssignmentClusterService implements ClusterStateListene

private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class);

private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These version checks are redundant in 9.0 and 9.1. The 8.x backports will need to keep them however.

ActionListener<ClusterState> updatedStateListener = ActionListener.wrap(
updatedState -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {
ActionListener<TrainedModelAssignmentMetadata.Builder> updatedAssignmentListener = ActionListener.wrap(
updatedAssignment -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix, here the new assignment state is passed rather than the updated cluster state.

@@ -375,20 +375,17 @@ public void clusterChanged(ClusterChangedEvent event) {
final boolean isResetMode = MlMetadata.getMlMetadata(event.state()).isResetMode();
TrainedModelAssignmentMetadata modelAssignmentMetadata = TrainedModelAssignmentMetadata.fromState(event.state());
final String currentNode = event.state().nodes().getLocalNodeId();
final boolean isNewAllocationSupported = event.state()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another version change that is irrelevant for 9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v9.0.3 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyTorchModelIT testUpdateDeployment_GivenAllocationsAreIncreased failure
2 participants