[Jenkins-infra] Pluginsite: post-mortem 2017-11-29

Olblak me at olblak.com
Fri Dec 1 14:14:33 UTC 2017

The reason why Pluginsite returned 404 was because after ingress
container upgrade, Pluginsite api started receiving bad HTTP request.

 Normal behavior: http://pluginsite/api -> http://api.pluginsite/
 What happened: http://pluginsite/api -> http://api.pluginsite/api/

Each time we deploy/update an application, Kubernetes deploys new
containers and use Liveness probes and Readiness probes to know if those
new containers are running and ready to receive traffics. 
If yes, traffics is re-routed from old containers to the new ones and
old containers are deleted.
But if for some reason checks don't pass, traffic stays on old
containers, and new containers are restarted until checks succeed.

The benefit of this approach is that application upgrades are totally
transparent for end users (most of the time) and we have a security net
for each application in case of problem but unfortunately it doesn't
work for linked application.
Which is exactly what happened here.

Actually the only way to catch this kind of errors, is to first deploy
everything on a sandbox cluster.
But it's time-consuming and this step is sometimes skipped for small

This step must be tested and automated with a sandbox cluster (like
Minikube) before any modifications.

Regarding monitoring, we were already (badly) monitoring Pluginsite api
(for the index age: https://git.io/vbq0T), but the check was configured
to report an error if data was missing for at least 90min (which of
course is way too big), I reduced it to 2min        

On Thu, Nov 30, 2017, at 06:17 PM, R. Tyler Croy wrote:
> (replies inline)
> On Thu, 30 Nov 2017, Olblak wrote:
> > Hi,
> > 
> > Yesterday, from 3:10PM UTC to 5:20PM UTC (according Datadog),
> > 'plugins.jenkins.io' was down.
> > 
> > The reason of this outage was due to an "un-catched" breaking change
> > with the upgrade of the ingress controller.
> > 
> > We upgraded the ingress container from
> > nginx-ingress-controller:0.9.0-beta.15 to
> > nginx-ingress-controller:0.9.0-beta.19 but started from
> > nginx-ingress-controller:0.9.0-beta.18, annotation name changed
> > from ingress.kubernetes.io to nginx.ingress.kubernetes.io. which had for
> > consequence to break pluginsite redirect rules.
> > 
> > It wasn't a big modification (and it was easy to rollblack), but
> > unfortunately it tooks a lot of time to be detected.
> > 
> > In order to avoid this situation to appear again in the futur, we need a
> > better way to do kubernetes regression tests, and to improve downtime
> > notification.
> One thing that was interesting about this was that plugins.jenkins.io was
> responding to requests but with a 404. I've added a monitor which should
> hopefully help catch that in the future
> (https://app.datadoghq.com/monitors#3445672?group=all&live=1h)
> I'm curious what kind of testing you think we could introduce into the
> Jenkins
> Pipeline which would have caught this kind of issue?
> Cheers
> - R. Tyler Croy
> ------------------------------------------------------
>      Code: <https://github.com/rtyler>
>   Chatter: <https://twitter.com/agentdero>
>      xmpp: rtyler at jabber.org
>   % gpg --keyserver keys.gnupg.net --recv-key 1426C7DC3F51E16F
> ------------------------------------------------------
> Email had 1 attachment:
> + signature.asc
>   1k (application/pgp-signature)

More information about the Jenkins-infra mailing list