-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4444: add "prefer same node" semantics to PreferClose #4931
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ | |
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Story 3](#story-3) | ||
- [Story 4 (DNS)](#story-4-dns) | ||
- [Notes/Constraints/Caveats](#notesconstraintscaveats) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
|
@@ -266,6 +267,18 @@ NOTE: Implementations reserve the right to refine the behavior associated with | |
implementation. This simplifies their deployment process and reduces the | ||
complexity of managing cross-cluster applications. | ||
|
||
#### Story 4 (DNS) | ||
|
||
* **Requirement:** As a cluster administrator, I plan to deploy CoreDNS so that there | ||
is an endpoint on every node, and I want clients to preferentially connect to the | ||
endpoint on their own node, to reduce latency. I'm not worried about CoreDNS | ||
endpoints becoming overloaded by unbalanced traffic distribution, because "1 | ||
endpoint per node" is substantially more replicas than the service "needs" anyway. | ||
* **Solution:** Set `trafficDistribution: PreferClose` | ||
* **Effect:** The Kubernetes implementation will recognize that the DNS service is | ||
deployed in a way that supports "prefer same node" traffic distribution, and behave | ||
accordingly. | ||
|
||
### Notes/Constraints/Caveats | ||
|
||
This proposal is our third attempt at an API revolving around such a | ||
|
@@ -311,14 +324,29 @@ the value configured for `trafficDistribution` | |
* This leverages existing implementation, requiring no major changes. | ||
|
||
#### `PreferClose` | ||
* **Meaning:** Attempts to route traffic to endpoints within the same zone as | ||
the client. If no endpoints are available within the zone, traffic would be | ||
routed to other zones. | ||
* **Meaning:** | ||
* Attempts to route traffic to endpoints within the same zone as the client. If no | ||
endpoints are available within the zone, traffic would be routed to other zones. | ||
* If the Service has endpoints that are part of a DaemonSet deployed to all nodes, | ||
then it attempts to route traffic to endpoints on the same node as the client. If | ||
no endpoints are available on the same node, traffic would be routed to other | ||
nodes (though still preferring the same zone, if relevant). | ||
Comment on lines
+331
to
+333
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a comment: This is super great! This is just a comment, so you're welcome to resolve this. |
||
|
||
* This preference will be implemented by the use of Hints within EndpointSlices. | ||
* We already use Hints to implement `service.kubernetes.io/topology-mode: Auto` | ||
In a similar manner, the EndpointSlice controller will now also populate hints | ||
for `trafficDistribution: PreferClose` -- although in this case, the zone hint will | ||
match the endpoint of the zone itself. | ||
* We already use `Hints.ForZones` to implement | ||
`service.kubernetes.io/topology-mode: Auto` In a similar manner, the | ||
EndpointSlice controller will now also populate zone hints for | ||
`trafficDistribution: PreferClose` -- although in this case, the zone hint will | ||
match the endpoint of the zone itself. | ||
* For the same-node preference, the EndpointSlice controller will set a new | ||
`Hints.ForSameNode` field (to `true`) to indicate to the service proxy that | ||
clients on the same node as this endpoint should prefer it. This will be set | ||
when: | ||
* the Service has `PreferClose` traffic distribution, and | ||
* the Service has endpoints on all or all-but-one of the Ready nodes, and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment may be similar to what Antonio is saying below. Also, a use case we have is that we run a statsd DaemonSet on all nodes, except for a handful of nodes that are short-lived and have a very specific task, and running the DaemonSet there costs too much. As a user I'd want the "SameNode" functionality to be the "preferred" method, but I also understand that this feature is covering a wide range of use-cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And would you want prefer-same-node semantics for that Service? There is definitely a tradeoff if we want to try to "auto-detect" this. Maybe I should try to figure out the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Definitely! What is the thought behind having the safe guards described here? Is the worry that a single node could be overloaded with traffic, and may be the correct choice would have been to send traffic to a different Pod (at the cost of going over the network). I may be missing something here (context: I've never used the topology hints feature, but would really like to, so may be I haven't encountered its downsides yet), but the name "PreferClose" makes me assume that traffic would be sent to a pod on the same node if available, if it can't go to a local pod, then to a pod on the same zone, if that fails, then a random pod in the cluster. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue is that we don't know anything about the clients of the service, and we can't just assume that they are distributed in a way that ensures prefer-same-node will result in good behavior. If, for some reason, there are endpoints on Nodes A, B, and C, but most of the clients are on Node A, then prefer-same-node would result in the Node A endpoint getting overloaded while the Node B and C endpoints were idle. (If the nodes are heterogeneous in some way that affects scheduling this might even be a predictable outcome.) If I was attaching the prefer-same-node semantics to a new traffic distribution value then this wouldn't be a problem because we could just say "don't use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, so I guess this KEP is about I'm still not sure I like this test:
But I also can't think of a better way to determine if the DaemonSet is running everywhere. May be it's worth excluding control-plane nodes from the set of nodes? |
||
* the endpoints of the Service are Pods that have `OwnerReferences` indicating | ||
that they are all part of the same DaemonSet. | ||
Comment on lines
+346
to
+348
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this not much of a stretch to rely on these specifics? |
||
|
||
* While it may seem redundant to populate the hints here since kube-proxy can | ||
already derive the zone hint from the endpoints zone (as they would be the | ||
same), we will still use this for implementation simply because of the reason | ||
|
@@ -363,7 +391,9 @@ NOTE: The expectation remains that *all* endpoints within an EndpointSlice must | |
Aware | ||
Hints](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/2433-topology-aware-hints/README.md#kube-proxy), i.e. _"This is to provide safer transitions between enabled and disabled states. Without this fallback, endpoints could easily get overloaded as hints were being added or removed from some EndpointSlices but had not yet propagated to all of them."_ | ||
|
||
<<[UNRESOLVED Name for the field is being discussed]>> | ||
Additionally, if a service has any endpoints for which (a) the `NodeName` is set to | ||
the name of kube-proxy's node, and (b) the `ForSameNode` hint is set, then kube-proxy | ||
will only send traffic for the service to those same-node endpoints. | ||
|
||
### Choice of field name | ||
The name `trafficDistribution` is meant to capture the highly | ||
|
@@ -387,8 +417,6 @@ traffic | |
so as not to confuse with the actual process of selecting the complete set of | ||
pods backing a service. | ||
|
||
<<[/UNRESOLVED]>> | ||
|
||
### Intersection with internal/externalTrafficPolicy | ||
|
||
The intersection of the field with `internalTrafficPolicy` and | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is common, to have the DNS resolver per node ... this also removes the nodeLocalDNS hack to divert the traffic, and people/installers: kops, kubeadm, ... can just replace the DNS deployment by a daemonset