[Jenkins-infra] Outage Friday morning post-mortem

Kohsuke Kawaguchi kk at kohsuke.org
Tue May 26 16:01:12 UTC 2015

We had the outage of LDAP and Confluence Friday morning.

   - LDAP server (slapd) is configured to restart every day
   <https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for some
   reason, when it happened last night, it killed slapd but a new one didn't
   come online. This happend around May 21 20:00 EDT. No monitoring was set up
   to detect this.

   - As a result, no one was able to login to JIRA & Confluence.

   - The intersection of the people who had access and the people who knew
   where LDAP was running was zero, until the morning next day in PDT. I think
   Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal chat,
   and I got ithe daemon started at May 22 10:05 EDT.

   - LDAP outage lasted good 14 hours.

   - At around the same time, Confluence started acting up. According to
   the monitoring, this happened around May 22 10:10 EDT.

   - A quick investigation revealed that this was caused by cache
   depletion. Confluence is very slow, so we have the static cache layer in
   front that serves pre-generated HTML files. Probably because the periodical
   cache regeneration failed due to LDAP outage, almost all the cached files
   were gone.

   - This resulted in everyone hitting Confluence hard, and Confluence
   couldn't keep up with it.

   - The maintenance screen was put up around 11:05 EDT, the cache
   regeneration process was started again, and by 12:10 EDT, the cache was
   sufficiently populated and the maintenance screen was taken down.

Here is my take of actions based on this:

   - slapd needs to be managed by Upstart, not /etc/init.d so that if the
   process is lost, it'll automatically get restarted. With this, nightly LDAP
   restart shouldn't be an issue.

   - LDAP needs to be monitored so that we know it's responsive. Ideally
   this should check the certificate expiration date as well so that we get
   warned if the expiration gest imminent.

   - Static cache size in Confluence needs to be monitored so that we can
   see when it starts to go down.

   - Static cache generator code needs to be checked to find out why it can
   deplete the cache. I suspect it's the full regeneration process going
   rogue, but surely it should be able to keep the old cache file around if
   the new one fails to generate.

Kohsuke Kawaguchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150526/28ff8c4b/attachment.html>

More information about the Jenkins-infra mailing list