Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-4817]: DRA: Resource Claim Status with possible standardized network interface data #4861

Merged
merged 11 commits into from
Oct 10, 2024

Conversation

LionelJouin
Copy link
Member

@LionelJouin LionelJouin commented Sep 19, 2024

  • One-line PR description: Add driver-owned fields in ResourceClaim.Status with possible standardized network interface data

I followed option 5 here: https://docs.google.com/document/d/1smjEm7WxFm8btQwywdr_F5BZkiifatNqDvrQUaesu1M/edit?usp=sharing and this PoC here: https://gist.github.com/LionelJouin/5cfc11eecf73663b5657ed3be1eb6c00

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 19, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 19, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @LionelJouin. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 19, 2024
@johnbelamaric
Copy link
Member

/assign @johnbelamaric @aojea

@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 19, 2024
@kannon92
Copy link
Contributor

/retitle [KEP-4817]: DRA: Resource Claim Status with possible standardized network interface data

We have another KEP that is about device health status so renaming this to be a bit more clear what the KEP is about.

@k8s-ci-robot k8s-ci-robot changed the title KEP-4817 - Resource Claim Device Status [KEP-4817]: DRA: Resource Claim Status with possible standardized network interface data Sep 19, 2024
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, left a few comments, this is a good start.

Can you fix the MD formatting so we don't have really long lines? In vi, it's :gwap and it will wrap your paragraph nicely...

Also feel free to include any text from https://docs.google.com/document/d/1smjEm7WxFm8btQwywdr_F5BZkiifatNqDvrQUaesu1M/edit?tab=t.0#heading=h.wui6s3gos7b2 that you find useful.

keps/sig-node/4817-resource-claim-device-status/README.md Outdated Show resolved Hide resolved
keps/sig-node/4817-resource-claim-device-status/README.md Outdated Show resolved Hide resolved

The API changes define a new `DeviceStatuses` field in the existing `ResourceClaimStatus` struct. `DeviceStatuses` is a slice of a new struct `AllocatedDeviceStatus` which holds device specific information.

A device, identified by `<driver name>/<pool name>/<device name>` can be represented only once in the `DeviceStatuses` slice and will also mention which request caused the device to be allocated. The state and characteristics of the device are reported in the `Conditions`, representing the operational state of the device and in the `DeviceInfo`, an arbitrary data slice representing device specific characteristics. Additionally, for networking devices, a field `NetworkDeviceInfo` can be used to report the IPs, the MAC address and the interface name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one approach, where there is DeviceInfo and NetworkDeviceInfo. I am not sure that will pass API review muster, but maybe. An alternative is what is described in Option 5. Instead of 3 fields, there would be two fields: Conditions and DeviceInfo. The DeviceInfo field is a one-of API type. So, something like:

type AllocatedDeviceStatus struct {
  ...
  DeviceInfo *DeviceInfo
}

type DeviceInfo struct {
    // +optional
	// +oneOf=DeviceInfoType
    Network *DeviceInfoNetwork

   // +optional
   // +oneOf=DeviceInfoType
   General *DeviceInfoGeneral
}

type DeviceInfoGeneral struct {
  Opaque RawExtension
}

type DeviceInfoNetwork struct {
   InterfaceName string
   IPAddresses []IPAddress
   MACAddress string

  Opaque RawExtension
}

@thockin WDYT? Every one-of would have the Opaque RawExtension plus one-of specific fields. Should that therefore just bubble up one level, so that we have a RawExtension, and then an optional device-type based one-of? Or, what about compound devices that may be several types. Should we just make the Opaque and Network fields separate, optional fields that may or may not be populated for any given device? I am not sure the one-of construction actually helps. (In other words, go with what is already here in the KEP).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I interpreted it like any driver can use the opaque field (DeviceInfo) + their own type. So the oneOf would be between specific type (NetworkDeviceInfo, GPUDeviceInfo...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the way you did it is less prescriptive - it allows a device to serve multiple functions. So I rather like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a little more verbose about my last comment:

  1. we don't know what devices will come out in the future. So, whose to say they won't stick an optical connection on a GPU...or maybe they have already? So, maybe the GPU and NIC will be a single device, not two separate devices

  2. As VM architects, there may be specific devices that "go together as one". The GPU/NIC combos is an example. Right now, we are using MatchAttributes with a pcieRoot attribute to align the internal topology. That's a lot of work for the user. Instead, I am looking at building a "compound device driver" on top of the partitionable device model. This would allow a single "device" owned by that driver to be allocated, but in reality that driver would proxy calls for that device to multiple other drivers. So, the VM architect could "pair" specific devices, such that you cannot get one without the other. Those devices would have separate functions, represented by the separate underlying drivers. I would want to be able to stick information about each underlying device in the right "device info field" for that function.

Thinking about it, this is still limited of course, to one device per function. Another option would be to have the one-of, but allow multiple entries per driver/pool/device tuple.

//
// +optional
// +listType=atomic
DeviceInfo []runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above regarding this versus a one-of

@johnbelamaric
Copy link
Member

Also, you're going to need a section on the authorization for the driver to write to the status, as is actively under discussion in Slack.

Signed-off-by: Lionel Jouin <[email protected]>
keps/sig-node/4817-resource-claim-device-status/README.md Outdated Show resolved Hide resolved
// per request.
//
// +required
Request string `json:"request" protobuf:"bytes,1,rep,name=request"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? More generally, can we drop the "Allocated" part from the API?

It is not useful to post status of arbitrary, unrelated devices, but it can happen at least temporarily, for example directly after a claim got deallocated.

Speaking of which, who removes such stale entries?

A claim can go from "new" to "allocated" back to "deallocated" and then to "allocated" again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I removed it, I guess this is repetitive as this information can be retrieved from .Status.Allocation.Devices.Results.

In the Notes/Constraints/Caveats section, I stated The accuracy of the information depends on the implementation of the DRA Drivers, the reported status of the device may not always reflect the real time changes of the state of the device. so the DRA-Driver should update these information based on when it configures/check/unconfigured devices.

Probably if the ResourceClaim gets deallocated then the DeviceStatuses can be cleared as it might be deleted or reallocated in which case the driver will have to reconfigure the devices in the pods/containers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that could be added. In that case, calling this AllocatedDeviceStatus would make sense because it is meant and documented as "only for allocated devices".

// +listType=atomic
Conditions []metav1.Condition `json:"conditions" protobuf:"bytes,5,rep,name=conditions"`

// DeviceInfo contains Arbitrary driver-specific data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// DeviceInfo contains Arbitrary driver-specific data.
// DeviceInfo contains arbitrary driver-specific data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a consumer reliably determine how to parse this content? The existing driver field above might not be enough: for example, two different drivers "foo.example.com" and "bar.example.com" might both provide data according to some common specification.

A consumer shouldn't have to keep a database that lists whether a driver uses such a common specification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a consumer reliably determine how to parse this content? The existing driver field above might not be enough: for example, two different drivers "foo.example.com" and "bar.example.com" might both provide data according to some common specification.

A consumer shouldn't have to keep a database that lists whether a driver uses such a common specification.

One option is to require a GVK in here. But another is just to say - you shouldn't be consuming opaque data. If there is a common spec, it should be a built in type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the idea yes as it would be quite difficult to cover all use cases with the specs we would provide in this KEP. Letting the driver reporting arbitrary status could still be useful I guess (for troubleshooting for example)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but no one except the driver should rely on that data

// This can include both IPv4 and IPv6 addresses.
//
// +optional
IPs []string `json:"ips,omitempty" protobuf:"bytes,2,rep,name=ips"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a type here instead of string? Does mask, family add value?

type IP struct { address mask family }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this field with a NetworkAddresses one:

type NetworkDeviceInfo struct {
...
    // NetworkAddresses lists the network addresses assigned to the device's network interface.
    // This can include both IPv4 and IPv6 addresses.
    //
    // +optional
    NetworkAddresses []NetworkAddress `json:"networkAddresses,omitempty" protobuf:"bytes,2,rep,name=networkAddresses"`
...
}

// NetworkAddress provides a network address related details such as IP and Mask.
type NetworkAddress struct {
    // CIDR contains the network address in CIDR notation, which includes
    // both the address and the associated subnet mask.
    // e.g.: "192.0.2.0/24" for IPv4 and "2001:db8::/64" for IPv6.
    //
    // +required
    CIDR string `json:"cidr,omitempty" protobuf:"bytes,1,rep,name=cidr"`
}

For reference here is how the NetworkAttachmentDefinition Status looks like and here is for the CNI Result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have validation on types, I think we don't need to force the types and it removes complexity if we just use an string and we validate is a valid CIDR , WDYT @thockin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it back to a single field

type NetworkDeviceData struct {
    ...
    // Addresses lists the network addresses assigned to the device's network interface.
    // This can include both IPv4 and IPv6 addresses.
    // The addresses are in the CIDR notation, which includes both the address and the
    // associated subnet mask.
    // e.g.: "192.0.2.0/24" for IPv4 and "2001:db8::/64" for IPv6.
    //
    // +optional
    // +listType=atomic
    Addresses []string `json:"addresses,omitempty" protobuf:"bytes,2,opt,name=addresses"`
    ...
}

* Typos
* Establish a standard instead of laying the foundation for creating a
  standard
* KEP Name
* DeviceStatuses keys
* AllocatedDeviceStatus request field removed
* IPs to a new struct

Signed-off-by: Lionel Jouin <[email protected]>
// NetworkData contains network-related information specific to the device.
//
// +optional
NetworkData NetworkDeviceData `json:"networkData,omitempty" protobuf:"bytes,6,opt,name=networkData"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about NetworkData vs. Data. So let's consider a situation where I've got users using NetworkAttachmentDefinition style specifications, and so they've got fields like:

    Name       string      `json:"name"`
    Interface  string      `json:"interface,omitempty"`
    IPs        []string    `json:"ips,omitempty"`
    Mac        string      `json:"mac,omitempty"`
    Mtu        int         `json:"mtu,omitempty"`
    Default    bool        `json:"default,omitempty"`
    DNS        DNS         `json:"dns,omitempty"`
    DeviceInfo *DeviceInfo `json:"device-info,omitempty"`
    Gateway    []string    `json:"gateway,omitempty"`

The so, I have this mapping...

    Interface  --> InterfaceName
    IPs  --> Addresses
    Mac --> HWAddress

Which is 3/x, so the other values would be remapped to... The data field?

I'm mostly looking for clarity, and trying to figure out standardization (where it makes sense), and also -- extensibility. That is, that with the */network-status: annotation with net-attach-def, we've definitely had to slowly expand that struct over time. So I wonder if we should focus on building some standards for the Data field if these other fields would go there? Or, could we have something as part of the NetworkDeviceData that is somehow extensible, and maybe think about forming a standard for it via Multinetworking WG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to store the rest, yes, the Data field. The main idea is that the Data field is 100% driver specific and so there are no guarantees (by the Kubernetes project) with respect to this data structure or API evolution.

The question I have is what data needs to be consumed by other ecosystem projects? That's what should be in a K8s built-in API.

Or, could we have something as part of the NetworkDeviceData that is somehow extensible, and maybe think about forming a standard for it via Multinetworking WG?

Maybe NetworkDeviceData should be a K8s one-of?

Or, to be really flexible, instead of two fields, we have one field Data []DeviceData, sort of like we did originally but now an array instead of just a single value:

type AllocatedDeviceStatus struct {
 ...
 Data []DeviceData
}

type DeviceData struct {
  // exactly one must be populated

  // +optional
  Opaque *runtime.RawExtension

  // +optional
  NetworkData *NetworkDeviceData

  // +optional
  SomeFutureThing *MultinetThing
}

The sort of hard part is now we can have an aribtrary number of each type, but maybe that's OK.

Copy link
Member

@aojea aojea Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dougbtv one question, what of those things are configuration option and what of those things need to return a value from the driver?

It look most of those things are configuration options, that already are reported in status, the thing we try to standardize are the things that are reported by the driver that are not set in configuration.

I guess the question is, what of those values are set by the user and what of those values are reported by the driver only (they are not configured) , like addresses and interface name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier text implies that Data may include multiple representations of the same data in different formats:

`Data` being an arbitrary data field allows the DRA Driver to store
device specific data in different formats. For example, a Network Device being
configured by via a CNI plugin could get its `Data` field filled with the CNI
result for troubleshooting purpose and with a Network result in a modern
standard format (closer to Pod.Status.PodIPs for example) used by 3rd party
controllers.

Or maybe that's an artifact from an earlier version of the text and the second part is supposed to be the NetworkData field?

If there is any intention that Data could have multiple representations of the same data, then it seems like it ought to be explicitly an array or map. Otherwise how are you supposed to be able to reliably extract the representation you wanted?

So I think I like John's version above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is definitely not for config - config goes in the DeviceClaim.Config list, which is a bunch of RawExtensions (or in the DeviceClass.Config).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I misinterpreted it sorry @dougbtv

@johnbelamaric is the list not going to be confusing to identify which information goes where?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric is the list not going to be confusing to identify which information goes where?

Yes, each list entry needs an identifier if we do this. Maybe make it a list-map?

obj.status.devices[i].driver.{opaque,network,whatever}

driver already exists in device[i] as the name of the driver.

obj.status.devices[i].status.{opaque,network,whatever}

Meh. Not better than data to me.

With data as a slice as I am suggesting:

obj.status.devices[i].data[j].{opaque,network,whatever}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj.status.devices[i].data[j].{opaque,network,whatever}

So a claim has multiple devices, each device has multiple datas (datae?), each of which has exactly one field set, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's right. fun stuff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is a list-map with some sort of driver-defined identifier for the key

// NetworkData contains network-related information specific to the device.
//
// +optional
NetworkData NetworkDeviceData `json:"networkData,omitempty" protobuf:"bytes,6,opt,name=networkData"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier text implies that Data may include multiple representations of the same data in different formats:

`Data` being an arbitrary data field allows the DRA Driver to store
device specific data in different formats. For example, a Network Device being
configured by via a CNI plugin could get its `Data` field filled with the CNI
result for troubleshooting purpose and with a Network result in a modern
standard format (closer to Pod.Status.PodIPs for example) used by 3rd party
controllers.

Or maybe that's an artifact from an earlier version of the text and the second part is supposed to be the NetworkData field?

If there is any intention that Data could have multiple representations of the same data, then it seems like it ought to be explicitly an array or map. Otherwise how are you supposed to be able to reliably extract the representation you wanted?

So I think I like John's version above.


// NetworkDeviceData provides network-related details for the allocated device.
// This information may be filled by drivers or other components to configure
// or identify the device within a network context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably include the interface index? Name, ips, and hwaddr are all mutable,.

(In which case maybe you don't need to include the interface name?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have enough privilege to rename an interface, none of the rest really matters, right? That doesn't stop us from reporting podIPs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't worrying just about privilege/attacks. Maybe there are use cases where the interface eventually gets renamed for some reason?

It just seems like, if it's possible to refer to the device via an immutable identifier rather than a mutable one, then we should?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with that principle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference of the interface in the kubernetes cluster is: Driver + "/" + Pool + "/" + Device
The index is the reference of the interface in a Linux operating system, Devices are supposed to abstract any network interface, most commonly the linux ones, but not only those.
The drivers owns the interface, if it is renamed, then I expect the driver to update the status, but I don't expect the consumer of the api to use the index of Linux interface, so interface name sounds much better to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, now that I think, index make sense to ident, we can have both fields optional, index and name?

// network interface.
//
// +optional
InterfaceName string `json:"interfaceName,omitempty" protobuf:"bytes,1,opt,name=interfaceName"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this optional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there cases that you want to report the addresses but not the interface name, I can think a few cases with network attached devices

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPDK / userspace stuff might not have an interface representation

Comment on lines 226 to 230
// Addresses lists the network addresses assigned to the device's network interface.
// This can include both IPv4 and IPv6 addresses.
// The addresses are in the CIDR notation, which includes both the address and the
// associated subnet mask.
// e.g.: "192.0.2.0/24" for IPv4 and "2001:db8::/64" for IPv6.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So first, those examples are wrong; those are descriptions of subnets, not descriptions of IPs on those subnets. (You need, eg, "192.0.2.5/24").

But also, (1) this is contrary to everything else in kubernetes, where we always use plain IP addresses when referring to IP addresses, and only use CIDR notation for describing subnets and masks; (2) the CIDR notation for interface addresses is only really useful for IPv4 anyway, and doesn't really make sense for IPv6; (3) it is not clear what the user stories would be that would motivate including the link-local-subnet-mask but not any other routing information.

So I think these should be plain IPs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we say "IPs" almost everywhere, and where we use "Addresses" it's because it could include non-IP values. Which is this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"192.0.2.0" and "2001:db8::" are valid IP addresses. Just try:

vm-003 ~ # ip -6 addr add 2001:db8::/64 dev eth0
vm-003 ~ # ip -6 addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    inet6 2001:db8::/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fd00::c0a8:3/120 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::200:ff:fe01:3/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever

It is a common misunderstanding that host 0 (zero) is reserved. It isn't. But that misunderstanding may be buit-in in older systems/equipment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But because of that misunderstanding 192.0.2.5/24 may be clearer

Copy link

@uablrek uablrek Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CIDR notation does make sense for IPv6. /64 is not required, except for some generated addresses such as the link-local address and SLAAC. IMO it make perfect sense to have one /64 range for an entire K8s cluster, and divide it into POD, ClusterIP, and external addresses with CIDR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you pasted inhttps://github.com//pull/4861#discussion_r1793266465

the output of ip -a contains ips and masks, and this mimics that behavior

3: br-6a4e01b79274: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:5e:f6:25 brd ff:ff:ff:ff:ff:ff
    inet 192.168.8.1/24 brd 192.168.8.255 scope global br-6a4e01b79274
       valid_lft forever preferred_lft forever
    inet6 fc00:f853:ccd:e793::1/64 scope global

so I do think it makes sense to have the cidr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"192.0.2.0" and "2001:db8::" are valid IP addresses.

Yes, I know, but I'm fairly certain that's not what they actually meant, and if it was they should change the example to be less confusing anyway.

CIDR notation does make sense for IPv6. /64 is not required

CIDR notation makes sense for describing IPv6 subnets. It does not make sense for describing IPv6 addresses because knowing the size of the subnet an address was allocated out of doesn't help you do anything in IPv6.

If an interface has the IPv4 address "192.0.2.5/24", that tells you that 192.0.2.6 is link-local and 192.0.3.5 is not. (This is not always true, but the entire point of the CIDR notation is to convey exactly that information to you.) But if it has the IPv6 address "2001:db8::1/64", that does not imply either that "2001:db8::2" is link-local or that "2001:db9::1" is not, because IPv6 subnetting is more complicated than that. There's nothing useful you can do with just the prefix length; it's only meaningful if you also have the rest of the information from the router advertisement.

the output of ip -a contains ips and masks, and this mimics that behavior

The output of ip -a contains a LOT of information. Why do we want the mask, but not the rest of it? What is the use case?

Copy link
Member

@aojea aojea Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at minimum you need an IP if the Device is a pod like only interface, ala PodIP, but if is a kind of L2 connection to a external network, you may want to have the mask , at least to know how many more interfaces you can add on that net ... there are cases you ask for an interface in one network, but you don't get to decide how large is that network, so knowing the mask help you to prevent a claim being stuck

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is "you" here? The DRA driver? The end user? What is the exact thing that someone can do when the prefix length is exposed here (without needing any other information that is also not exposed here) that they could not do with the prefix length not exposed.

Knowing the subnet size does not tell you how many IPs are available on the subnet so I don't see how it helps with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end user can get this information and use it for example to get some hints on who (pod) can talk to whom (pod/external device). The CIDR could also tell you if what you got is what you requested (I requested an IP in the 192.0.2.0/24 subnet, is my address 192.0.2.x/24, or is there something wrong somewhere and my address is 192.0.2.5/8)

This NetworkDeviceData struct is likely to evolve in the future when people will start to solve more complex use-cases (as the NetworkStatus evolved in the network-attachment-definition API (Multus)). So I would prefer to have the addresses in the CIDR format now than just IP addresses and potentially a new field in the future for the CIDR.

// HWAddress represents the hardware address (e.g. MAC Address) of the device's network interface.
//
// +optional
HWAddress string `json:"hwAddress,omitempty" protobuf:"bytes,3,opt,name=hwAddress"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the use case for including IP addresses in the result is that people may want to connect to the pod via this interface. That's something someone might want to do even if they aren't directly poking around with the allocated device.

What is the use case for including HW addresses? What sort of user would have a reason to need to know the hwaddr, but would not be doing something where they had to look directly at the allocated device in question (which would allow them to find the hwaddr themselves)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not directly involved any more, but I think non-standard (even seemingly weird) use-cases that don't use ip-addresses should be supported. An example is non-IP traffic, another is to allow "something" to send ip-packets to the interface even when it has no ip-address, or has an address shared by many.

The entity that consumes this status will probably not be an application that wants to connect to the pod, but something very specialized (router, load-balancer, ...?), and we can't predict what they will be.

Copy link
Contributor

@danwinship danwinship Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but will those use cases be able to work with just the hwaddr, or are they going to potentially need some other information as well?

Basically, on one end of the API space we have "indicate just the name/index of the device that was created", and on the other end we have "provide a full CNI ADD response". And my question is, why did we end up exactly here? Are we sure we need all of the fields we have, and we don't need any other fields?


## Proposal

### API - ResourceClaim.Status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between "Proposal" and "Design Details" has always been confusing to me, but in this case it's especially confusing that the details of the proposal appear before the user stories, so I think this should mostly move to "Design Details".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this has never been clear - it's long been on my list of things to fix "one day" in the KEP template. I would not block this KEP just for that.

keps/sig-node/4817-resource-claim-device-status/README.md Outdated Show resolved Hide resolved
traffic to pods over networks attached via the network devices, the IP
addresses of the network device(s) must then be reflected in the
`ResourceClaim.Status` allowing the network service controller(s) to access
them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this motivates having the IPs be available, but not the hardware addrs (or anything else). Is that correct?

In the `ResourceClaim.Status.Devices`, instead of having opaque field (`Data`) and
specific type fields, an object reference could be used for each device. The custom
object would be created and maintained by the driver to report the status of the
devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. It's very Gateway-API-ish (eg, ParametersReference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be my last choice, as if you want to build a controller (e.g. EndpointSlice controller) that gets the IPs for some pods, then you will need to find those pods, then get the ResouceClaims for those pods (current scenario with what is being proposed), then get the status object for those ResourceClaims.

The best would be the New Pod.Status Field alternative, but from what I understood in the Multi-Network KEP and some other discussion (e.g. here), this is not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin CRXes!

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't hate this. Some questions, but mostly minor. I do worry about the VAP being too rigid for real life.


## Proposal

### API - ResourceClaim.Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this has never been clear - it's long been on my list of things to fix "one day" in the KEP template. I would not block this KEP just for that.

// +required
Device string `json:"device" protobuf:"bytes,3,rep,name=device"`

// Conditions contains the latest observation of the device's state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for hardware errors, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, other than Ready, it's all up to the driver what lands in here. This will be the only place to put that information in the API server (we will not have it in pod status, for example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have / need guidance on what should go into Conditions and what should go into Data ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidance on what should go in Conditions and Data are in the "Proposal" section:

The state and characteristics of the
device are reported in the `Conditions`, representing the operational state of
the device and in the `Data`, an arbitrary data field representing device
specific characteristics. 
....

`Data` being an arbitrary data field allows the DRA Driver to store
device specific data in different formats. For example, a Network Device being
configured by via a CNI plugin could get its `Data` field filled with the CNI
result for troubleshooting purpose and with a Network result in a modern
standard format (closer to Pod.Status.PodIPs for example) used by 3rd party
controllers.

or did you mean more detailed ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditions are not an awesome API for most things, but they are the only way to express things without knowing the "extension" schema.

I suspect we will have to feel this distinction out as we go, and crystallize it more later.

Comment on lines 226 to 230
// Addresses lists the network addresses assigned to the device's network interface.
// This can include both IPv4 and IPv6 addresses.
// The addresses are in the CIDR notation, which includes both the address and the
// associated subnet mask.
// e.g.: "192.0.2.0/24" for IPv4 and "2001:db8::/64" for IPv6.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we say "IPs" almost everywhere, and where we use "Addresses" it's because it could include non-IP values. Which is this?

### Write Permission

To prevent unauthorized or accidental modifications by entities that do not
have access to a particular resource, a `ValidatingAdmissionPolicy` will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not love this trend - does it need a VAP per driver?

`ResourceClaim.Status.Devices` is running on the same node.

Here is a `ResourceClaim` allocated on a node. This would only work for now if
exactly one node is set:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow - this seems very rigid? Any driver that doesn't follow this exact standard can't be secured? Or do we end up embedding thdriver name into the VAP, and have one VAP per driver?

object.status.devices != oldObject.status.devices
validations:
- expression: >- # User node must be the same node as the one where the ResourceClaim is allocated.
variables.userNodeName != variables.objectNodeName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't VAP's asserting the affirmative? I.e. this should be == ? If not, I am confused about my own VAPs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will re-try this VAP with my PoC,validations should work fine, but I remember having issues with the matchConditions

@johnbelamaric
Copy link
Member

johnbelamaric commented Oct 9, 2024

I don't hate this.

High praise from @thockin! 😆

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2024
- Authorization implemented to allow only the user on the same node as the
allocated `ResourceClaim` to write the status of the devices.

#### Beta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @danwinship offline, @LionelJouin @dougbtv what do you think if we add to beta criteria having a multus DRA network driver ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely be interested in it! And there's a sort of basis that exists pre-1.31 in: k8snetworkplumbingwg/multus-cni#1078

But yes, it'd need extensions to this functionality as well as a Multus DRA driver itself, as the challenge with that implementation is that it uses an example reference driver, and directly being a driver would be a good way to see how well the spec fits with an implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 1 real-world usage as beta requirement.

@johnbelamaric
Copy link
Member

One small nit for PRR, not needed to approve. Once there is SIG approval, I can give the PRR approval.

@thockin
Copy link
Member

thockin commented Oct 9, 2024

@danwinship @aojea I am OK with this KEP - what are you looking for to LGTM vs. things that can be discussed after?

@mrunalp
Copy link
Contributor

mrunalp commented Oct 10, 2024

/approve
from SIG Node

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with PRR and from the DRA side (modulo the exact API we can work out later). I will give the PRR approval once this has SIG node and SIG net approval.

@johnbelamaric
Copy link
Member

I am happy with PRR and from the DRA side (modulo the exact API we can work out later). I will give the PRR approval once this has SIG node and SIG net approval.

Race condition with @mrunalp. So now @thockin has already approved but left an open question for @danwinship and @aojea so I will approve but leave the LGTM to them.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, LionelJouin, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024
@danwinship
Copy link
Contributor

/lgtm
to move forward with

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024
@thockin
Copy link
Member

thockin commented Oct 10, 2024

The comment thread about API shape (#4861 (review)) has become "outdated". :(

@thockin thockin removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2024
@shaneutt
Copy link
Member

We discussed this in the SIG Network community today and there were no major concerns or objections stated (though there was a soft suggestion that we might want to explore more options on how we handle the status if we have time).

/unhold

@k8s-ci-robot k8s-ci-robot merged commit 499e983 into kubernetes:master Oct 10, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 10, 2024
@aojea
Copy link
Member

aojea commented Oct 10, 2024

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Pre-Alpha
Development

Successfully merging this pull request may close these issues.