hashwing

hashwing

Member Since 5 years ago

@gzsunrun , Guangzhou, China

Experience Points
8
follower
Lessons Completed
4
follow
Lessons Completed
233
stars
Best Reply Awards
65
repos

86 contributions in the last year

Pinned
⚡ Fork of Collaborative Web IDE by Google
⚡ Package config is a Configuration file parser for INI format
Activity
Jan
19
4 days ago
started
started time in 4 days ago
started
started time in 4 days ago
started
started time in 4 days ago
Jan
18
5 days ago
Jan
17
6 days ago
started
started time in 6 days ago
Dec
30
3 weeks ago
push

hashwing push hashwing/togo

hashwing
hashwing

修复migrations 表没住键的问题

commit sha: ce2f695cdbb911ca0c932ed2566bb6854a23dab9

push time in 3 weeks ago
Dec
27
3 weeks ago
Activity icon
issue

hashwing issue comment kubernetes/kubernetes

hashwing
hashwing

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection)

We've just upgrade our production cluster to 1.17.2.

Since the update on saturday, we've had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won't retry to restablish them unless manually restarted.

Here is the timeline of last time it occured:

01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.

Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it
...
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Lost carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Gained carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Configured

As expected, all watches are closed. Message is the same for them all:

...
Jan 28 01:31:44 baremetal044 kubelet-wrapper[2039]: W0128 04:31:44.352736    2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received
...

So these messages begin:

`Jan 28 01:31:44 baremetal44 kubelet-wrapper[2039]: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`

Which I'm guessing shouldn't be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.

# journalctl --since '2020-01-28 01:31'   | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc
   9757 01
  20663 02
  20622 03
  20651 04
  20664 05
  20666 06
  20664 07
  20661 08
  16655 09
      3 10

Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.

Is there any way to mitigate this kind of event?

Would this be a bug?

hashwing
hashwing

你好,你的邮件已收到,祝你身体健康,学业进步

Dec
24
4 weeks ago
started
started time in 4 weeks ago
started
started time in 4 weeks ago
Dec
23
1 month ago
started
started time in 4 weeks ago
started
started time in 4 weeks ago
Dec
22
1 month ago
started
started time in 1 month ago
Dec
21
1 month ago
started
started time in 1 month ago
started
started time in 1 month ago
Dec
16
1 month ago
started
started time in 1 month ago
Dec
15
1 month ago
started
started time in 1 month ago
Dec
9
1 month ago
started
started time in 1 month ago
Previous