Apps Won't Start After RAM Upgrade

Description

I upgraded this system from ~60GiB to 256GiB. After this upgrade, apps would not start and were stuck on "Deploying."

I was able to reproduce the issue by reinstalling the original RAM – Apps started as normal, then installing the new RAM a second time – Apps stuck in "Deploying." I let sit for well over an hour to see if there might be any progress, but there wasn't.

Results of some CLI commands that appear pertinent follow. I think what ultimately resolved the issue for me was either:

1) Using the GUI to move the App installation to a new pool and possibly
2) Removing apps that wouldn't terminate (Traefik, PiHole & k8s-gateway.

Initial errors looked like this:

2021-09-23 20:11:49
Scaled up replica set plex-746dc75cc5 to 1
2021-09-23 20:11:49
Created pod: plex-746dc75cc5-bfj5t
0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.
0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.

Seems similar:

2021-09-23 19:03:31
Created pod: traefik-b99bc5-h9v2p
2021-09-23 19:03:31
Marking for deletion Pod ix-traefik/traefik-b99bc5-v8vp9
0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.
0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.

Wound up finding kube-system was stuck on
ContainerCreating:

scale# k3s kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
ix-organizr organizr-5bd974bd88-6lvx5 0/1 Terminating 0 5d7h
ix-pihole3 svclb-pihole3-dns-2mpmd 0/1 Terminating 0 5d1h
ix-pihole3 svclb-pihole3-dns-tcp-vzwd8 0/1 Terminating 0 5d1h
ix-k8s-gateway k8s-gateway-6486ffbb68-xd4kt 0/1 Terminating 0 5d1h
kube-system openebs-zfs-node-484q5 0/2 Terminating 7 4d1h
ix-traefik svclb-traefik-tcp-44ctg 0/2 Terminating 12 4d1h
ix-k8s-gateway svclb-k8s-gateway-t8xp9 0/1 Pending 0 8m14s
ix-tautulli tautulli-569844459c-fcgvm 0/1 Init:0/1 0 15m
ix-librespeed librespeed-764b78db8b-9t9lb 0/1 Init:0/1 0 15m
ix-plex plex-746dc75cc5-98j2q 0/1 Init:0/1 0 15m
ix-pihole3 pihole3-5d57d479bc-x7ndq 0/1 Init:0/1 0 15m
kube-system openebs-zfs-controller-0 0/5 ContainerCreating 0 11m
kube-system coredns-7448499f4d-vztmx 1/1 Running 0 15m
ix-traefik traefik-b99bc5-9xwgp 1/1 Running 0 15m
scale#

And received errors like this:

2021-09-23 22:47:31
MountVolume.MountDevice failed for volume "pvc-ce610695-dadb-48f7-9082-2bbd55db770b" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name zfs.csi.openebs.io not found in the list of registered CSI drivers

I deleted the stuck in Terminating pods using the instructions here: https://www.truenas.com/community/threads/plex-failure-after-major-failure.93300/post-645881

scale# k3s kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
ix-traefik svclb-traefik-tcp-x2szt 0/2 Error 2 6h57m
ix-traefik traefik-b99bc5-smrl4 0/1 Completed 0 6h33m
ix-pihole3 pihole3-5d57d479bc-88rld 0/1 Completed 0 6h33m
ix-k8s-gateway k8s-gateway-785cd758d9-jwqc7 0/1 Completed 0 6h33m
ix-librespeed librespeed-764b78db8b-6zc8v 0/1 Completed 0 6h33m
ix-tautulli tautulli-569844459c-cn87d 0/1 Completed 0 6h33m
ix-k8s-gateway svclb-k8s-gateway-t8xp9 0/1 Error 1 9h
kube-system coredns-7448499f4d-vztmx 0/1 Completed 2 9h
kube-system openebs-zfs-node-78zgg 0/2 Completed 1 6h53m
ix-pihole3 svclb-pihole3-dns-tcp-7xps7 0/1 Error 1 6h55m
kube-system openebs-zfs-controller-0 0/5 Error 10 9h
ix-pihole3 svclb-pihole3-dns-ld2f4 0/1 Pending 0 6h55m
ix-plex plex-56486f6bf6-ljzc6 0/1 Init:0/1 0 7m31s (edited)
[8:08 AM]
I tried deleting the kube-system pods, which worked when I switched back to the old RAM and now I'm stuck at ContainerCreating.

scale# k3s kubectl delete pods coredns-7448499f4d-vztmx openebs-zfs-node-78zgg openebs-zfs-controller-0 -n kube-system
pod "coredns-7448499f4d-vztmx" deleted
pod "openebs-zfs-node-78zgg" deleted
pod "openebs-zfs-controller-0" deleted
scale# k3s kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
ix-traefik svclb-traefik-tcp-x2szt 0/2 Error 2 7h7m
ix-traefik traefik-b99bc5-smrl4 0/1 Completed 0 6h44m
ix-pihole3 pihole3-5d57d479bc-88rld 0/1 Completed 0 6h43m
ix-k8s-gateway k8s-gateway-785cd758d9-jwqc7 0/1 Completed 0 6h44m
ix-librespeed librespeed-764b78db8b-6zc8v 0/1 Completed 0 6h44m
ix-tautulli tautulli-569844459c-cn87d 0/1 Completed 0 6h43m
ix-k8s-gateway svclb-k8s-gateway-t8xp9 0/1 Error 1 9h
ix-pihole3 svclb-pihole3-dns-tcp-7xps7 0/1 Error 1 7h5m
ix-pihole3 svclb-pihole3-dns-ld2f4 0/1 Pending 0 7h5m
ix-plex plex-56486f6bf6-ljzc6 0/1 Init:0/1 0 17m
kube-system coredns-7448499f4d-mcdp2 0/1 ContainerCreating 0 7m1s
kube-system openebs-zfs-controller-0 0/5 ContainerCreating 0 3m13s
kube-system openebs-zfs-node-jlwdx 0/2 ContainerCreating 0 2m1s

scale# k3s kubectl describe -n kube-system po coredns-7448499f4d-tjmxv

[too long to post the entire message]

Events:
Type Reason Age From Message


------ ---- ---- ------- Normal Scheduled 20m default-scheduler Successfully assigned kube-system/coredns-7448499f4d-tjmxv to ix-truenas
Warning NetworkNotReady 8m56s (x6 over 9m5s) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Warning FailedSync 2m40s kubelet error determining status: rpc error: code = Unknown desc = Error: No such container: b7a4616ddbd16addc194357a0d748424a87eb66ef481c5dcbd864a63c80c7ac2
Warning FailedCreatePodSandBox 28s (x4 over 6m52s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "coredns-7448499f4d-tjmxv": operation timeout: context deadline exceeded
Normal SandboxChanged 28s (x3 over 4m40s) kubelet Pod sandbox changed, it will be killed and re-created.
scale# (edited)

scale# k3s kubectl describe pods suggests the problem is:

Events:
Type Reason Age From Message


------ ---- ---- ------- Warning FailedScheduling 16m default-scheduler 0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.
Warning FailedScheduling 16m default-scheduler 0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.
Warning FailedScheduling 8m31s default-scheduler 0/1 nodes are available: 1 node(s) had taint {ix-svc-stop: }, that the pod didn't tolerate.
Normal Scheduled 8m18s default-scheduler Successfully assigned kube-system/openebs-zfs-controller-0 to ix-truenas
Warning FailedCreatePodSandBox 4m3s (x2 over 6m17s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "openebs-zfs-controller-0": operation timeout: context deadline exceeded
Warning FailedSync 3m39s (x2 over 3m51s) kubelet error determining status: rpc error: code = Unknown desc = Error: No such container: 419cb42b30cd18126516010c022d5342da7feb697fb256c020055f1bda569ee9
Normal SandboxChanged 3m33s kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 95s multus Add eth0 [172.16.0.120/16] from ix-net
Normal Pulled 95s kubelet Container image "k8s.gcr.io/sig-storage/csi-resizer:v1.1.0" already present on machine

Problem/Justification

None

Impact

None

Activity

Show:

Rick Bollar October 19, 2021 at 4:23 PM

I wasn't able to reproduce in the window I had – I have since done a bare metal reinstall after the RAM upgrade, so that environment no longer exists.

Waqar Ahmed October 17, 2021 at 4:16 PM

 were you able to switch RAM and see if the issue persists ? Thanks

Waqar Ahmed October 8, 2021 at 7:18 PM

 yes i saw that in the description but the debug was not pointing the same facts and i was not able to check other data points which might have potentially pointed out what might be going on. Anyways, let me know please if you are able to reproduce this next week. A debug at that point of time would always be nice. Thanks!

Rick Bollar October 8, 2021 at 7:04 PM

I knew the apps weren't working because their web portals wouldn't start and if you looked at:

 

{{scale# k3s kubectl get pods -A }}
NAMESPACE NAME READY STATUS RESTARTS AGE
ix-organizr organizr-5bd974bd88-6lvx5 0/1 Terminating 0 5d7h
ix-pihole3 svclb-pihole3-dns-2mpmd 0/1 Terminating 0 5d1h
ix-pihole3 svclb-pihole3-dns-tcp-vzwd8 0/1 Terminating 0 5d1h
ix-k8s-gateway k8s-gateway-6486ffbb68-xd4kt 0/1 Terminating 0 5d1h
kube-system openebs-zfs-node-484q5 0/2 Terminating 7 4d1h
ix-traefik svclb-traefik-tcp-44ctg 0/2 Terminating 12 4d1h
ix-k8s-gateway svclb-k8s-gateway-t8xp9 0/1 Pending 0 8m14s
ix-tautulli tautulli-569844459c-fcgvm 0/1 Init:0/1 0 15m
ix-librespeed librespeed-764b78db8b-9t9lb 0/1 Init:0/1 0 15m
ix-plex plex-746dc75cc5-98j2q 0/1 Init:0/1 0 15m
ix-pihole3 pihole3-5d57d479bc-x7ndq 0/1 Init:0/1 0 15m
kube-system openebs-zfs-controller-0 0/5 ContainerCreating 0 11m
kube-system coredns-7448499f4d-vztmx 1/1 Running 0 15m
ix-traefik traefik-b99bc5-9xwgp 1/1 Running 0 15m
scale#

All of the apps were stuck in pending or init.

Unfortunately, I had to completely destroy the apps installation and reinstall from fresh to make them work again – I have a scheduled downtime next week and I will try dropping the RAM again and see if I can reproduce in the new system.

Waqar Ahmed October 8, 2021 at 6:20 PM

ping

Cannot Reproduce
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Time remaining

0m

Components

Fix versions

Affects versions

Priority

Katalon Platform

Created September 25, 2021 at 11:16 AM
Updated July 6, 2022 at 9:00 PM
Resolved October 19, 2021 at 5:08 PM