[Jenkins-infra] Pluginsite: post-mortem 2017-11-29

Olblak me at olblak.com
Thu Nov 30 16:14:37 UTC 2017


Hi,

Yesterday, from 3:10PM UTC to 5:20PM UTC (according Datadog),
'plugins.jenkins.io' was down.

The reason of this outage was due to an "un-catched" breaking change
with the upgrade of the ingress controller.

We upgraded the ingress container from
nginx-ingress-controller:0.9.0-beta.15 to
nginx-ingress-controller:0.9.0-beta.19 but started from
nginx-ingress-controller:0.9.0-beta.18, annotation name changed
from ingress.kubernetes.io to nginx.ingress.kubernetes.io. which had for
consequence to break pluginsite redirect rules.

It wasn't a big modification (and it was easy to rollblack), but
unfortunately it tooks a lot of time to be detected.

In order to avoid this situation to appear again in the futur, we need a
better way to do kubernetes regression tests, and to improve downtime
notification.


More information about the Jenkins-infra mailing list