Комментарии разработчика:
"k8s API must have returned NotFound for the secret (see here) where the basic cluster info is stored. If the secret is not found, Rook will assume that it's a new cluster and create new creds here. The killer here is if the k8s API is returning invalid responses, the operator will do the wrong thing. If K8s is returning invalid responses to the caller instead of failure codes, we really need a fix from K8s or else a way to know when the K8s API is unstable."
"Kubernetes as a distributed application platform must be able to survive network partitions and other temporary or catastrophic events. It's based on etcd for its config store for this reason. If there is ever a loss of quorum, etcd will halt and the cluster should stop working. Similarly, Ceph is also designed to halt if the cluster is too unhealthy, rather than continuing and corrupting things.
Is there a possibility that your K8s automation is reseting K8s in some way that would be causing this? I haven't heard of others experiencing this issue. This corruption is completely unexpected from K8s. Otherwise, Rook or other applications can't rely on it as a distributed platform."
Комментарии отрепортившего баг:
"I believe if either of those two things behaved differently, similar stories to ours wouldn't be popping up out there in the k8s world:
https://medium.com/flant-com/rook-cluster-recovery-580efcd275dbI didn't see any root cause in that blog post but I'm fairly certain their cluster disappeared due to a similar situation with his control plane. We've seen it happen twice now in 2 different clusters."
Это та же статья Flant, но на английском. В совокупности о проблеме набирается три сообщения, если считать статью Flant за одно из них.