[Jenkins-infra] Pluginsite: post-mortem 2017-11-29
me at olblak.com
Thu Nov 30 16:14:37 UTC 2017
Yesterday, from 3:10PM UTC to 5:20PM UTC (according Datadog),
'plugins.jenkins.io' was down.
The reason of this outage was due to an "un-catched" breaking change
with the upgrade of the ingress controller.
We upgraded the ingress container from
nginx-ingress-controller:0.9.0-beta.19 but started from
nginx-ingress-controller:0.9.0-beta.18, annotation name changed
from ingress.kubernetes.io to nginx.ingress.kubernetes.io. which had for
consequence to break pluginsite redirect rules.
It wasn't a big modification (and it was easy to rollblack), but
unfortunately it tooks a lot of time to be detected.
In order to avoid this situation to appear again in the futur, we need a
better way to do kubernetes regression tests, and to improve downtime
More information about the Jenkins-infra