Nvidia GPU is not shareable between apps

Description

nvidia-device-plugin does not seem to allow gpu sharing between apps, u TC user, tried to deploy a second app with allocated gpu, but got
`0/1 nodes are available: 1 Insufficient nvidia.com/gpu.` as soon as the first app was stopped, the second deployed.

reading some docs around i found that (quote from k8s docs)
```
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
```
https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#v1-8-onwards

With a bit more search i found this that can enable sharing gpu
https://github.com/AliyunContainerService/gpushare-scheduler-extender

I'm fairly new with k8s, so i'm not quite sure how possible would that be to implement.

People who coming from a docker environment which allows sharing, this seems "limiting".

Also: noticed the driver is `460.xx`, while `470.xx` is released.

Problem/Justification

None

Impact

None

Activity

Show:

Waqar Ahmed September 7, 2021 at 1:32 PM

please file a suggestion ticket for the feature requested, closing this issue.

Stavros Kois September 6, 2021 at 5:28 PM

Bug flag was my mistake indeed.
Subscribed to this upstream issue to follow it for more info!

Waqar Ahmed September 6, 2021 at 5:25 PM

i don't think this should be considered a bug as this is upstream behavior where GPUs can't be shared between pods. There is an open issue on kubernetes which you can track as well ( https://github.com/kubernetes/kubernetes/issues/52757 ). About the link you shared for the scheduler, i would advise you to create a suggestion ticket. However let's see if the newer nvidia driver version is available on the apt mirror we are using and we can update that.

Behaves as Intended
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Time remaining

0m

Fix versions

Affects versions

Priority

Katalon Platform

Created September 6, 2021 at 5:18 PM
Updated July 6, 2022 at 9:01 PM
Resolved September 7, 2021 at 1:31 PM