[Jenkins-infra] Outage Friday morning post-mortem
Kohsuke Kawaguchi
kk at kohsuke.org
Tue May 26 16:01:12 UTC 2015
We had the outage of LDAP and Confluence Friday morning.
- LDAP server (slapd) is configured to restart every day
<https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for some
reason, when it happened last night, it killed slapd but a new one didn't
come online. This happend around May 21 20:00 EDT. No monitoring was set up
to detect this.
- As a result, no one was able to login to JIRA & Confluence.
- The intersection of the people who had access and the people who knew
where LDAP was running was zero, until the morning next day in PDT. I think
Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal chat,
and I got ithe daemon started at May 22 10:05 EDT.
- LDAP outage lasted good 14 hours.
- At around the same time, Confluence started acting up. According to
the monitoring, this happened around May 22 10:10 EDT.
- A quick investigation revealed that this was caused by cache
depletion. Confluence is very slow, so we have the static cache layer in
front that serves pre-generated HTML files. Probably because the periodical
cache regeneration failed due to LDAP outage, almost all the cached files
were gone.
- This resulted in everyone hitting Confluence hard, and Confluence
couldn't keep up with it.
- The maintenance screen was put up around 11:05 EDT, the cache
regeneration process was started again, and by 12:10 EDT, the cache was
sufficiently populated and the maintenance screen was taken down.
Here is my take of actions based on this:
- slapd needs to be managed by Upstart, not /etc/init.d so that if the
process is lost, it'll automatically get restarted. With this, nightly LDAP
restart shouldn't be an issue.
- LDAP needs to be monitored so that we know it's responsive. Ideally
this should check the certificate expiration date as well so that we get
warned if the expiration gest imminent.
- Static cache size in Confluence needs to be monitored so that we can
see when it starts to go down.
- Static cache generator code needs to be checked to find out why it can
deplete the cache. I suspect it's the full regeneration process going
rogue, but surely it should be able to keep the old cache file around if
the new one fails to generate.
--
Kohsuke Kawaguchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150526/28ff8c4b/attachment.html>
More information about the Jenkins-infra
mailing list