[Jenkins-infra] Outage Friday morning post-mortem
aheritier at gmail.com
Tue May 26 17:05:04 UTC 2015
It was mainly my fault. I wasn't sure if LDAP was hosted on cucumber or
dockerized on lettuce and when I (too quickly) checked on cucumber I didn't
remember that the service was named slapd (and not *ldap*). At least next
time I'll know.
For me you are missing the most important point :
* The intersection of the people who had access and the people who knew
where LDAP was running was zero, until the morning next day in PDT.
Even if everything is automated, monitored, ... we really need to be sure
that several people in the community have the rights/skills/knowledge about
the infrastructure to be able to solve such issue. To solve this we need :
* To be sure we have enough people with required rights all around the
world to process such tasks
* To document (it can be few lines in a markdown page) where is each
service and how to restart it. This documentation mustn't be hosted on our
own infra (to be accessible even if everything is down).
On Tue, May 26, 2015 at 6:01 PM, Kohsuke Kawaguchi <kk at kohsuke.org> wrote:
> We had the outage of LDAP and Confluence Friday morning.
> - LDAP server (slapd) is configured to restart every day
> <https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for some
> reason, when it happened last night, it killed slapd but a new one didn't
> come online. This happend around May 21 20:00 EDT. No monitoring was set up
> to detect this.
> - As a result, no one was able to login to JIRA & Confluence.
> - The intersection of the people who had access and the people who
> knew where LDAP was running was zero, until the morning next day in PDT. I
> think Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal
> chat, and I got ithe daemon started at May 22 10:05 EDT.
> - LDAP outage lasted good 14 hours.
> - At around the same time, Confluence started acting up. According to
> the monitoring, this happened around May 22 10:10 EDT.
> - A quick investigation revealed that this was caused by cache
> depletion. Confluence is very slow, so we have the static cache layer in
> front that serves pre-generated HTML files. Probably because the periodical
> cache regeneration failed due to LDAP outage, almost all the cached files
> were gone.
> - This resulted in everyone hitting Confluence hard, and Confluence
> couldn't keep up with it.
> - The maintenance screen was put up around 11:05 EDT, the cache
> regeneration process was started again, and by 12:10 EDT, the cache was
> sufficiently populated and the maintenance screen was taken down.
> Here is my take of actions based on this:
> - slapd needs to be managed by Upstart, not /etc/init.d so that if the
> process is lost, it'll automatically get restarted. With this, nightly LDAP
> restart shouldn't be an issue.
> - LDAP needs to be monitored so that we know it's responsive. Ideally
> this should check the certificate expiration date as well so that we get
> warned if the expiration gest imminent.
> - Static cache size in Confluence needs to be monitored so that we can
> see when it starts to go down.
> - Static cache generator code needs to be checked to find out why it
> can deplete the cache. I suspect it's the full regeneration process going
> rogue, but surely it should be able to keep the old cache file around if
> the new one fails to generate.
> Kohsuke Kawaguchi
> Jenkins-infra mailing list
> Jenkins-infra at lists.jenkins-ci.org
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Jenkins-infra