commit sha： 1a89265e73f85e732bf5d657cc010eee47a80e7apush time in 1 week ago
commit sha： 26c94e8f69f86bb319921a1d02e8e20124c904f5push time in 1 week ago
commit sha： 037f8df5102d05e5345882455b53ace7971bca61push time in 1 week ago
commit sha： 012f258ecbadcd40c50387b247efe48f31e05704push time in 1 week ago
commit sha： 5988998302e429c096beb33716e618edeb7aa697push time in 1 week ago
commit sha： 840ce309636630c2cb39ebdfae0e0572d935b49cpush time in 1 week ago
commit sha： 0fee21a9c4a444fce7e1b3da2ad009d417a4a8cfpush time in 1 week ago
(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection)
We've just upgrade our production cluster to 1.17.2.
Since the update on saturday, we've had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won't retry to restablish them unless manually restarted.
Here is the timeline of last time it occured:
01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.
Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it ... Jan 28 01:31:37 baremetal044 systemd-networkd: bond-mngmt: Lost carrier Jan 28 01:31:37 baremetal044 systemd-networkd: bond-mngmt: Gained carrier Jan 28 01:31:37 baremetal044 systemd-networkd: bond-mngmt: Configured
As expected, all watches are closed. Message is the same for them all:
... Jan 28 01:31:44 baremetal044 kubelet-wrapper: W0128 04:31:44.352736 2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received ...
So these messages begin:
`Jan 28 01:31:44 baremetal44 kubelet-wrapper: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`
Which I'm guessing shouldn't be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.
# journalctl --since '2020-01-28 01:31' | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc 9757 01 20663 02 20622 03 20651 04 20664 05 20666 06 20664 07 20661 08 16655 09 3 10
Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.
Is there any way to mitigate this kind of event?
Would this be a bug?