-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Validation of RayFTEnabled is false and GcsFaultToleranceOption is not nil #2726
[Feature] Validation of RayFTEnabled is false and GcsFaultToleranceOption is not nil #2726
Conversation
|
||
// Validation for invalid case: FT explicitly enabled and GcsFaultToleranceOptions is not nil | ||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil { | ||
panic("Invalid configuration: GCS Fault Tolerance is explicitly disabled (ray.io/ft-enabled: false) while GcsFaultToleranceOptions is set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't panic here. It is user's error instead of our error. Make the instance in failed state instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MortalHappiness
It is confirmed that panic needs to be removed.
I’d like to discuss the initTemplateAnnotations
function. If I remove the panic and instead set the instance to a failed state, since it’s not using a pointer, it might not be possible to propagate the failed state back to the higher-level instance.
If I modify initTemplateAnnotations
to pass the instance as a pointer, the failed state will propagate back to the calling function, e.g., DefaultWorkerPodTemplate
. However, the instance passed to it is also not using a pointer, so the original instance wouldn’t be affected. If I were to change that, it would involve many modifications.
I want to confirm if it’s okay just to push forward and make these changes directly.
Or did I actually misunderstand something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't pass pointer. You can either:
- Change the function signature to returns
error
. And let the caller handle it. Propogate the error the callers. - Don't validate spec here. Validate it in function like
validateRayClusterSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem
If there is anything that needs to be changed, please let me know. |
if podTemplate.Annotations == nil { | ||
podTemplate.Annotations = make(map[string]string) | ||
} | ||
|
||
// Validation for invalid case: FT explicitly enabled and GcsFaultToleranceOptions is not nil | ||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a function validateRayClusterSpec
and add the check in the very beginning of rayClusterReconcile
?
You can refer to the following links:
if err := r.validateRayClusterStatus(instance); err != nil { |
if err := validateRayJobSpec(rayJobInstance); err != nil { |
also cc @rueian for review. |
|
@@ -220,10 +220,26 @@ func (r *RayClusterReconciler) validateRayClusterStatus(instance *rayv1.RayClust | |||
return nil | |||
} | |||
|
|||
// Validation for invalid Ray Cluster configurations. | |||
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error { | |||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides this case, we also need to reject the case where the instance
has RAY_REDIS_ADDRESS
env set but has no instance.Annotations[utils.RayFTEnabledAnnotationKey] == "true"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this validateRayClusterSpec
validation should also be invoked in the optional validation webhook:
func (r *RayCluster) validateRayCluster() error { |
Note that the validation webhook is usually the recommended way to validate a k8s CR, but users can choose not to enable it. That is the reason we need to do validation in both reconciliation and the webhook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem
logger.Error(err, "The RayCluster spec is invalid") | ||
r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.InvalidRayClusterSpec), | ||
"The RayCluster spec is invalid %s/%s: %v", instance.Namespace, instance.Name, err) | ||
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to requeue the invalid RayCluster.
f85d476
to
392120e
Compare
…nv set but has no instance.Annotations[utils.RayFTEnabledAnnotationKey] == "true"
if r.Annotations[RayFTEnabledAnnotationKey] == "false" && r.Spec.GcsFaultToleranceOptions != nil { | ||
return field.Invalid(field.NewPath("spec").Child("gcsFaultToleranceOptions"), r.Spec.GcsFaultToleranceOptions, "GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled") | ||
} | ||
if r.Annotations[RayFTEnabledAnnotationKey] != "true" && r.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if r.Annotations[RayFTEnabledAnnotationKey] != "true" && r.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil { | |
if r.Annotations[RayFTEnabledAnnotationKey] != "true" && utils.EnvVarExists("RAY_REDIS_ADDRESS ", r.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, is it possible that r.Spec.HeadGroupSpec.Template.Spec.Containers
doesn't exist at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use utils, it will cause import cycle
.
I thinks we need to move ray-operator/controllers/ray/utils/constant.go
to other place to avoid import cycle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, is it possible that
r.Spec.HeadGroupSpec.Template.Spec.Containers
doesn't exist at this point?
OK, I will handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinks we need to move ray-operator/controllers/ray/utils/constant.go to other place to avoid import cycle?
One thing we could try is moving the EnvVarExists
and RayContainerIndex
to the /ray/v1
while keeping their references in the utils
.
If I use functions in utils, it will cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can leave the webhook change for another PR, so this PR will be easier to review.
func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance *rayv1.RayCluster) (ctrl.Result, error) { | ||
var reconcileErr error | ||
logger := ctrl.LoggerFrom(ctx) | ||
|
||
// Validate that GcsFaultToleranceOptions is nil when FT is explicitly disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it below utils.ManagedByExternalController
.
func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance *rayv1.RayCluster) (ctrl.Result, error) { | ||
var reconcileErr error | ||
logger := ctrl.LoggerFrom(ctx) | ||
|
||
// Validate that GcsFaultToleranceOptions is nil when FT is explicitly disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment. validateRayClusterSpec
is to validate the whole RayClusterSpec instead of GcsFaultToleranceOptions` only.
// Validate that GcsFaultToleranceOptions is nil when FT is explicitly disabled |
// Validation for invalid Ray Cluster configurations. | ||
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error { | ||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil { | ||
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled") | |
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is set to false") |
@@ -220,10 +220,33 @@ func (r *RayClusterReconciler) validateRayClusterStatus(instance *rayv1.RayClust | |||
return nil | |||
} | |||
|
|||
// Validation for invalid Ray Cluster configurations. | |||
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't need to be the member function of RayClusterReconciler
. You can open another PR to update validateRayClusterStatus
.
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error { | |
func validateRayClusterSpec(instance *rayv1.RayCluster) error { |
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled") | ||
} | ||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil { | ||
for _, env := range instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use utils.EnvVarExists
if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil { | ||
for _, env := range instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env { | ||
if env.Name == "RAY_REDIS_ADDRESS" { | ||
return fmt.Errorf("RAY_REDIS_ADDRESS should not be set when ray.io/ft-enabled is disabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("RAY_REDIS_ADDRESS should not be set when ray.io/ft-enabled is disabled") | |
return fmt.Errorf("RAY_REDIS_ADDRESS should not be set when ray.io/ft-enabled is not set or set to false") |
} | ||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil { | ||
for _, env := range instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env { | ||
if env.Name == "RAY_REDIS_ADDRESS" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add RAY_REDIS_ADDRESS
to constant.go
.
…conciler, to avoid HeadGroupSpec container count < 0, use utils.EnvVarExists to check env variables
PTAL thx |
@@ -229,6 +246,13 @@ func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance | |||
return ctrl.Result{}, nil | |||
} | |||
|
|||
if err := validateRayClusterSpec(instance); err != nil { | |||
logger.Error(err, "The RayCluster spec is invalid") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add instance.Namespace and instance.Name into the message?
logger.Error(err, "The RayCluster spec is invalid") | ||
r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.InvalidRayClusterSpec), | ||
"The RayCluster spec is invalid %s/%s: %v", instance.Namespace, instance.Name, err) | ||
return ctrl.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ctrl.Result{}, err | |
return ctrl.Result{}, nil |
I guess returning an err here still requeues the invalid CR. Could you double check on that?
… avoid rescheduling in `validateRayClusterSpec`
|
|
||
if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && | ||
len(instance.Spec.HeadGroupSpec.Template.Spec.Containers) > 0 && | ||
instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env != nil
check is not necessary.
instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env != nil { | ||
|
||
if utils.EnvVarExists(utils.RAY_REDIS_ADDRESS, instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env) { | ||
return fmt.Errorf("%s should not be set when %s is set to false", utils.RAY_REDIS_ADDRESS, utils.RayFTEnabledAnnotationKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("%s should not be set when %s is set to false", utils.RAY_REDIS_ADDRESS, utils.RayFTEnabledAnnotationKey) | |
return fmt.Errorf("%s environment variable should not be set when %s annotation is not set to true", utils.RAY_REDIS_ADDRESS, utils.RayFTEnabledAnnotationKey) |
@CheyuWu would you mind fixing the conflicts? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a follow up PR to address some issues directly.
Why are these changes needed?
The annotation explicitly disables GCS FT (i.e., ray.io/ft-enabled: "false") and GcsFaultToleranceOptions is not nil. This case is invalid.
Therefore, I add a validation to check
Related issue number
Closes #2694
Checks