Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4444: add "prefer same node" semantics to PreferClose #4931

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 38 additions & 10 deletions keps/sig-network/4444-service-traffic-distribution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Story 3](#story-3)
- [Story 4 (DNS)](#story-4-dns)
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
Expand Down Expand Up @@ -266,6 +267,18 @@ NOTE: Implementations reserve the right to refine the behavior associated with
implementation. This simplifies their deployment process and reduces the
complexity of managing cross-cluster applications.

#### Story 4 (DNS)

* **Requirement:** As a cluster administrator, I plan to deploy CoreDNS so that there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is common, to have the DNS resolver per node ... this also removes the nodeLocalDNS hack to divert the traffic, and people/installers: kops, kubeadm, ... can just replace the DNS deployment by a daemonset

is an endpoint on every node, and I want clients to preferentially connect to the
endpoint on their own node, to reduce latency. I'm not worried about CoreDNS
endpoints becoming overloaded by unbalanced traffic distribution, because "1
endpoint per node" is substantially more replicas than the service "needs" anyway.
* **Solution:** Set `trafficDistribution: PreferClose`
* **Effect:** The Kubernetes implementation will recognize that the DNS service is
deployed in a way that supports "prefer same node" traffic distribution, and behave
accordingly.

### Notes/Constraints/Caveats

This proposal is our third attempt at an API revolving around such a
Expand Down Expand Up @@ -311,14 +324,29 @@ the value configured for `trafficDistribution`
* This leverages existing implementation, requiring no major changes.

#### `PreferClose`
* **Meaning:** Attempts to route traffic to endpoints within the same zone as
the client. If no endpoints are available within the zone, traffic would be
routed to other zones.
* **Meaning:**
* Attempts to route traffic to endpoints within the same zone as the client. If no
endpoints are available within the zone, traffic would be routed to other zones.
* If the Service has endpoints that are part of a DaemonSet deployed to all nodes,
then it attempts to route traffic to endpoints on the same node as the client. If
no endpoints are available on the same node, traffic would be routed to other
nodes (though still preferring the same zone, if relevant).
Comment on lines +331 to +333
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment: This is super great!
Having used iTP of local, its one shortfall is when that endpoint is briefly down (when we upgrade the DaemonSet, for example).
As a user I'd prefer traffic to stay local to the node, but I don't mind if it occasionally leaves the node (or the zone), as long as it's temporary.

This is just a comment, so you're welcome to resolve this.


* This preference will be implemented by the use of Hints within EndpointSlices.
* We already use Hints to implement `service.kubernetes.io/topology-mode: Auto`
In a similar manner, the EndpointSlice controller will now also populate hints
for `trafficDistribution: PreferClose` -- although in this case, the zone hint will
match the endpoint of the zone itself.
* We already use `Hints.ForZones` to implement
`service.kubernetes.io/topology-mode: Auto` In a similar manner, the
EndpointSlice controller will now also populate zone hints for
`trafficDistribution: PreferClose` -- although in this case, the zone hint will
match the endpoint of the zone itself.
* For the same-node preference, the EndpointSlice controller will set a new
`Hints.ForSameNode` field (to `true`) to indicate to the service proxy that
clients on the same node as this endpoint should prefer it. This will be set
when:
* the Service has `PreferClose` traffic distribution, and
* the Service has endpoints on all or all-but-one of the Ready nodes, and
Copy link
Member

@adrianmoisey adrianmoisey Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may be similar to what Antonio is saying below.
If a DaemonSet is configured to for updateStrategy.rollingUpdate.maxUnavailable of > 1, this won't match.

Also, a use case we have is that we run a statsd DaemonSet on all nodes, except for a handful of nodes that are short-lived and have a very specific task, and running the DaemonSet there costs too much.

As a user I'd want the "SameNode" functionality to be the "preferred" method, but I also understand that this feature is covering a wide range of use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And would you want prefer-same-node semantics for that Service?

There is definitely a tradeoff if we want to try to "auto-detect" this.

Maybe I should try to figure out the TopologySpreadConstraints version of this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And would you want prefer-same-node semantics for that Service?

Definitely!
Since we are running the DaemonSet on most of our nodes, we wouldn't worry about overloading a specific node.

What is the thought behind having the safe guards described here? Is the worry that a single node could be overloaded with traffic, and may be the correct choice would have been to send traffic to a different Pod (at the cost of going over the network).

I may be missing something here (context: I've never used the topology hints feature, but would really like to, so may be I haven't encountered its downsides yet), but the name "PreferClose" makes me assume that traffic would be sent to a pod on the same node if available, if it can't go to a local pod, then to a pod on the same zone, if that fails, then a random pod in the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we don't know anything about the clients of the service, and we can't just assume that they are distributed in a way that ensures prefer-same-node will result in good behavior. If, for some reason, there are endpoints on Nodes A, B, and C, but most of the clients are on Node A, then prefer-same-node would result in the Node A endpoint getting overloaded while the Node B and C endpoints were idle. (If the nodes are heterogeneous in some way that affects scheduling this might even be a predictable outcome.)

If I was attaching the prefer-same-node semantics to a new traffic distribution value then this wouldn't be a problem because we could just say "don't use trafficDistribution: PreferSameNode if you don't want that". But since there are already people using PreferClose, we have to be careful to not break them.

Copy link
Member

@adrianmoisey adrianmoisey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I guess this KEP is about PreferClose, does it make sense that the thing I'm asking for be under a different option?

I'm still not sure I like this test:

the Service has endpoints on all or all-but-one of the Ready nodes

But I also can't think of a better way to determine if the DaemonSet is running everywhere.
I guess the situation I'm worried about is when a DaemonSet is being updated, and multiple Pods are not ready at a time. However, in theory this scenario is temporary, so it isn't a big deal.

May be it's worth excluding control-plane nodes from the set of nodes?

* the endpoints of the Service are Pods that have `OwnerReferences` indicating
that they are all part of the same DaemonSet.
Comment on lines +346 to +348
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not much of a stretch to rely on these specifics?
Should not be better add a new algorithm like PreferNode than try to special case the existing one ?


* While it may seem redundant to populate the hints here since kube-proxy can
already derive the zone hint from the endpoints zone (as they would be the
same), we will still use this for implementation simply because of the reason
Expand Down Expand Up @@ -363,7 +391,9 @@ NOTE: The expectation remains that *all* endpoints within an EndpointSlice must
Aware
Hints](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/2433-topology-aware-hints/README.md#kube-proxy), i.e. _"This is to provide safer transitions between enabled and disabled states. Without this fallback, endpoints could easily get overloaded as hints were being added or removed from some EndpointSlices but had not yet propagated to all of them."_

<<[UNRESOLVED Name for the field is being discussed]>>
Additionally, if a service has any endpoints for which (a) the `NodeName` is set to
the name of kube-proxy's node, and (b) the `ForSameNode` hint is set, then kube-proxy
will only send traffic for the service to those same-node endpoints.

### Choice of field name
The name `trafficDistribution` is meant to capture the highly
Expand All @@ -387,8 +417,6 @@ traffic
so as not to confuse with the actual process of selecting the complete set of
pods backing a service.

<<[/UNRESOLVED]>>

### Intersection with internal/externalTrafficPolicy

The intersection of the field with `internalTrafficPolicy` and
Expand Down