Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Validation of RayFTEnabled is false and GcsFaultToleranceOption is not nil #2726

Merged
merged 15 commits into from
Jan 14, 2025

Conversation

CheyuWu
Copy link
Contributor

@CheyuWu CheyuWu commented Jan 10, 2025

Why are these changes needed?

The annotation explicitly disables GCS FT (i.e., ray.io/ft-enabled: "false") and GcsFaultToleranceOptions is not nil. This case is invalid.
Therefore, I add a validation to check

Related issue number

Closes #2694

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 10, 2025

@MortalHappiness @kevin85421 PTAL

@CheyuWu CheyuWu changed the title feat: validation of RayFTEnabled == false and GcsFaultToleranceOption… [Feature] Vvalidation of RayFTEnabled is false and GcsFaultToleranceOption is not nil Jan 10, 2025
@CheyuWu CheyuWu changed the title [Feature] Vvalidation of RayFTEnabled is false and GcsFaultToleranceOption is not nil [Feature] Validation of RayFTEnabled is false and GcsFaultToleranceOption is not nil Jan 10, 2025

// Validation for invalid case: FT explicitly enabled and GcsFaultToleranceOptions is not nil
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil {
panic("Invalid configuration: GCS Fault Tolerance is explicitly disabled (ray.io/ft-enabled: false) while GcsFaultToleranceOptions is set")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't panic here. It is user's error instead of our error. Make the instance in failed state instead.

Copy link
Contributor Author

@CheyuWu CheyuWu Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MortalHappiness
It is confirmed that panic needs to be removed.

I’d like to discuss the initTemplateAnnotations function. If I remove the panic and instead set the instance to a failed state, since it’s not using a pointer, it might not be possible to propagate the failed state back to the higher-level instance.

If I modify initTemplateAnnotations to pass the instance as a pointer, the failed state will propagate back to the calling function, e.g., DefaultWorkerPodTemplate. However, the instance passed to it is also not using a pointer, so the original instance wouldn’t be affected. If I were to change that, it would involve many modifications.

I want to confirm if it’s okay just to push forward and make these changes directly.

Or did I actually misunderstand something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pass pointer. You can either:

  • Change the function signature to returns error. And let the caller handle it. Propogate the error the callers.
  • Don't validate spec here. Validate it in function like validateRayClusterSpec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 10, 2025

  • Added new rayClusterType -> InvalidConfiguration
  • Use instance as a pointer and pass the parameter into initTemplateAnnotations

If there is anything that needs to be changed, please let me know.

if podTemplate.Annotations == nil {
podTemplate.Annotations = make(map[string]string)
}

// Validation for invalid case: FT explicitly enabled and GcsFaultToleranceOptions is not nil
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a function validateRayClusterSpec and add the check in the very beginning of rayClusterReconcile?

You can refer to the following links:

if err := r.validateRayClusterStatus(instance); err != nil {

if err := validateRayJobSpec(rayJobInstance); err != nil {

@kevin85421 kevin85421 self-assigned this Jan 10, 2025
@kevin85421
Copy link
Member

also cc @rueian for review.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 11, 2025

  • I reverted the previous commit and added the validateRayClusterSpec function
  • add unit test for validateRayClusterSpec

@@ -220,10 +220,26 @@ func (r *RayClusterReconciler) validateRayClusterStatus(instance *rayv1.RayClust
return nil
}

// Validation for invalid Ray Cluster configurations.
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error {
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides this case, we also need to reject the case where the instance has RAY_REDIS_ADDRESS env set but has no instance.Annotations[utils.RayFTEnabledAnnotationKey] == "true".

Copy link
Contributor

@rueian rueian Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this validateRayClusterSpec validation should also be invoked in the optional validation webhook:

func (r *RayCluster) validateRayCluster() error {

Note that the validation webhook is usually the recommended way to validate a k8s CR, but users can choose not to enable it. That is the reason we need to do validation in both reconciliation and the webhook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem

logger.Error(err, "The RayCluster spec is invalid")
r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.InvalidRayClusterSpec),
"The RayCluster spec is invalid %s/%s: %v", instance.Namespace, instance.Name, err)
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to requeue the invalid RayCluster.

@CheyuWu CheyuWu force-pushed the feat/gcsFT/validation branch from f85d476 to 392120e Compare January 12, 2025 16:38
if r.Annotations[RayFTEnabledAnnotationKey] == "false" && r.Spec.GcsFaultToleranceOptions != nil {
return field.Invalid(field.NewPath("spec").Child("gcsFaultToleranceOptions"), r.Spec.GcsFaultToleranceOptions, "GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled")
}
if r.Annotations[RayFTEnabledAnnotationKey] != "true" && r.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if r.Annotations[RayFTEnabledAnnotationKey] != "true" && r.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil {
if r.Annotations[RayFTEnabledAnnotationKey] != "true" && utils.EnvVarExists("RAY_REDIS_ADDRESS ", r.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, is it possible that r.Spec.HeadGroupSpec.Template.Spec.Containers doesn't exist at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use utils, it will cause import cycle.
I thinks we need to move ray-operator/controllers/ray/utils/constant.go to other place to avoid import cycle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, is it possible that r.Spec.HeadGroupSpec.Template.Spec.Containers doesn't exist at this point?

OK, I will handle this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinks we need to move ray-operator/controllers/ray/utils/constant.go to other place to avoid import cycle?

One thing we could try is moving the EnvVarExists and RayContainerIndex to the /ray/v1 while keeping their references in the utils.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 12, 2025

  • I removed the action of requeue the invalid RayCluster.
  • Add a condition for reject the case where the instance has RAY_REDIS_ADDRESS env set but has no instance.Annotations[utils.RayFTEnabledAnnotationKey] == "true"
  • Create a new function and relative parameters in validateRayClusterSpec

If I use functions in utils, it will cause import cycle.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can leave the webhook change for another PR, so this PR will be easier to review.

func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance *rayv1.RayCluster) (ctrl.Result, error) {
var reconcileErr error
logger := ctrl.LoggerFrom(ctx)

// Validate that GcsFaultToleranceOptions is nil when FT is explicitly disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it below utils.ManagedByExternalController.

func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance *rayv1.RayCluster) (ctrl.Result, error) {
var reconcileErr error
logger := ctrl.LoggerFrom(ctx)

// Validate that GcsFaultToleranceOptions is nil when FT is explicitly disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment. validateRayClusterSpec is to validate the whole RayClusterSpec instead of GcsFaultToleranceOptions` only.

Suggested change
// Validate that GcsFaultToleranceOptions is nil when FT is explicitly disabled

// Validation for invalid Ray Cluster configurations.
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error {
if instance.Annotations[utils.RayFTEnabledAnnotationKey] == "false" && instance.Spec.GcsFaultToleranceOptions != nil {
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled")
return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is set to false")

@@ -220,10 +220,33 @@ func (r *RayClusterReconciler) validateRayClusterStatus(instance *rayv1.RayClust
return nil
}

// Validation for invalid Ray Cluster configurations.
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't need to be the member function of RayClusterReconciler. You can open another PR to update validateRayClusterStatus.

Suggested change
func (r *RayClusterReconciler) validateRayClusterSpec(instance *rayv1.RayCluster) error {
func validateRayClusterSpec(instance *rayv1.RayCluster) error {

return fmt.Errorf("GcsFaultToleranceOptions should be nil when ray.io/ft-enabled is disabled")
}
if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil {
for _, env := range instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use utils.EnvVarExists

if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil {
for _, env := range instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env {
if env.Name == "RAY_REDIS_ADDRESS" {
return fmt.Errorf("RAY_REDIS_ADDRESS should not be set when ray.io/ft-enabled is disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("RAY_REDIS_ADDRESS should not be set when ray.io/ft-enabled is disabled")
return fmt.Errorf("RAY_REDIS_ADDRESS should not be set when ray.io/ft-enabled is not set or set to false")

}
if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" && instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env != nil {
for _, env := range instance.Spec.HeadGroupSpec.Template.Spec.Containers[0].Env {
if env.Name == "RAY_REDIS_ADDRESS" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add RAY_REDIS_ADDRESS to constant.go.

…conciler, to avoid HeadGroupSpec container count < 0, use utils.EnvVarExists to check env variables
@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 13, 2025

PTAL thx

@@ -229,6 +246,13 @@ func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance
return ctrl.Result{}, nil
}

if err := validateRayClusterSpec(instance); err != nil {
logger.Error(err, "The RayCluster spec is invalid")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add instance.Namespace and instance.Name into the message?

logger.Error(err, "The RayCluster spec is invalid")
r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.InvalidRayClusterSpec),
"The RayCluster spec is invalid %s/%s: %v", instance.Namespace, instance.Name, err)
return ctrl.Result{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ctrl.Result{}, err
return ctrl.Result{}, nil

I guess returning an err here still requeues the invalid CR. Could you double check on that?

… avoid rescheduling in `validateRayClusterSpec`
@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 13, 2025

  • Add instance.Namespace and instance.Name into the logger message
  • return nil in validateRayClusterSpec to avoid requeues


if instance.Annotations[utils.RayFTEnabledAnnotationKey] != "true" &&
len(instance.Spec.HeadGroupSpec.Template.Spec.Containers) > 0 &&
instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env != nil check is not necessary.

instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env != nil {

if utils.EnvVarExists(utils.RAY_REDIS_ADDRESS, instance.Spec.HeadGroupSpec.Template.Spec.Containers[utils.RayContainerIndex].Env) {
return fmt.Errorf("%s should not be set when %s is set to false", utils.RAY_REDIS_ADDRESS, utils.RayFTEnabledAnnotationKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("%s should not be set when %s is set to false", utils.RAY_REDIS_ADDRESS, utils.RayFTEnabledAnnotationKey)
return fmt.Errorf("%s environment variable should not be set when %s annotation is not set to true", utils.RAY_REDIS_ADDRESS, utils.RayFTEnabledAnnotationKey)

@kevin85421
Copy link
Member

@CheyuWu would you mind fixing the conflicts?

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jan 14, 2025

  • fix merge conflict
  • change error log message
  • remove not necessary check

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a follow up PR to address some issues directly.

@kevin85421 kevin85421 merged commit 92c2907 into ray-project:master Jan 14, 2025
24 checks passed
win5923 pushed a commit to win5923/kuberay that referenced this pull request Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GCS FT] GCS FT misconfiguration
4 participants