This error is derived from a P1 incident which itself originated from a emerging Google regional network issue. 3 gke nodes in that region were impacted, the kubelet daemon in each node was not able to update node status and other metrics. The node was changed to NotReady
and some customer services went down.
After the nodes went back to ready state as the root cause mitigated, from GCP Log Explorer
one gke node kubelet kept posting error:
1 | E0505 03:59:46.470928 1543 kubelet_volumes.go:154] orphaned pod "7935559a-a41b-4b44-960b-6a31fcab81f8" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them. |
Concept
The solution for this error is simple but let’s understand some concepts: Why we have orphan pods, see kubernetes Garbage Collection:
Some Kubernetes objects are owners of other objects. For example, a ReplicaSet is the owner of a set of Pods. The owned objects are called dependents of the owner object.
If you delete an object without deleting its dependents automatically, the dependents are said to be orphaned.
In background cascading deletion, Kubernetes deletes the owner object immediately and the garbage collector then deletes the dependents in the background.
The default behavior is to delete the dependents in the background which is the behavior when --cascade
is omitted or explicitly set to background.
Solution
As we can see, after the nodes was ready, the old pod controllers were deleted and new ones got redeployed, however, some orphan pods hung there because of pod volume, this is a known issue
On host node, cd
to /var/lib/kubelet/pods
path and manually deleted them.
1 | cd /var/lib/kubelet/pods |