-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Fix test failure updating model deployment with stale cluster state. #128667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/ml-core (Team:ML) |
@@ -80,9 +78,6 @@ public class TrainedModelAssignmentClusterService implements ClusterStateListene | |||
|
|||
private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class); | |||
|
|||
private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These version checks are redundant in 9.0 and 9.1. The 8.x backports will need to keep them however.
ActionListener<ClusterState> updatedStateListener = ActionListener.wrap( | ||
updatedState -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() { | ||
ActionListener<TrainedModelAssignmentMetadata.Builder> updatedAssignmentListener = ActionListener.wrap( | ||
updatedAssignment -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix, here the new assignment state is passed rather than the updated cluster state.
@@ -375,20 +375,17 @@ public void clusterChanged(ClusterChangedEvent event) { | |||
final boolean isResetMode = MlMetadata.getMlMetadata(event.state()).isResetMode(); | |||
TrainedModelAssignmentMetadata modelAssignmentMetadata = TrainedModelAssignmentMetadata.fromState(event.state()); | |||
final String currentNode = event.state().nodes().getLocalNodeId(); | |||
final boolean isNewAllocationSupported = event.state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another version change that is irrelevant for 9
When updating a model deployment - changing the number of allocations for example- calculating the new state is an expensive operation so it is done outside of the ClusterStateUpdateTask. However, if there was another clusterstate update while computing the update then the submitted update fails due to an out of data cluster state.
The fix is quite easy, compute the model deployment update outside of the ClusterStateUpdateTask then merge it with the latest state when executing the task. The code already has a check that the deployment update is compatible with the new state (
areClusterStatesCompatibleForRebalance(...)
) making it safe to merge the new state.Closes #121165