[Jenkins-infra] Outage Friday morning post-mortem
aheritier at gmail.com
Tue Jun 9 07:45:10 UTC 2015
Same thing this today :(
I restarted LDAP.
I'm trying to see if I can fix Confluence but I'm not sure to be able to do
On Tue, May 26, 2015 at 6:01 PM, Kohsuke Kawaguchi <kk at kohsuke.org> wrote:
> We had the outage of LDAP and Confluence Friday morning.
> - LDAP server (slapd) is configured to restart every day
> <https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for some
> reason, when it happened last night, it killed slapd but a new one didn't
> come online. This happend around May 21 20:00 EDT. No monitoring was set up
> to detect this.
> - As a result, no one was able to login to JIRA & Confluence.
> - The intersection of the people who had access and the people who
> knew where LDAP was running was zero, until the morning next day in PDT. I
> think Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal
> chat, and I got ithe daemon started at May 22 10:05 EDT.
> - LDAP outage lasted good 14 hours.
> - At around the same time, Confluence started acting up. According to
> the monitoring, this happened around May 22 10:10 EDT.
> - A quick investigation revealed that this was caused by cache
> depletion. Confluence is very slow, so we have the static cache layer in
> front that serves pre-generated HTML files. Probably because the periodical
> cache regeneration failed due to LDAP outage, almost all the cached files
> were gone.
> - This resulted in everyone hitting Confluence hard, and Confluence
> couldn't keep up with it.
> - The maintenance screen was put up around 11:05 EDT, the cache
> regeneration process was started again, and by 12:10 EDT, the cache was
> sufficiently populated and the maintenance screen was taken down.
> Here is my take of actions based on this:
> - slapd needs to be managed by Upstart, not /etc/init.d so that if the
> process is lost, it'll automatically get restarted. With this, nightly LDAP
> restart shouldn't be an issue.
> - LDAP needs to be monitored so that we know it's responsive. Ideally
> this should check the certificate expiration date as well so that we get
> warned if the expiration gest imminent.
> - Static cache size in Confluence needs to be monitored so that we can
> see when it starts to go down.
> - Static cache generator code needs to be checked to find out why it
> can deplete the cache. I suspect it's the full regeneration process going
> rogue, but surely it should be able to keep the old cache file around if
> the new one fails to generate.
> Kohsuke Kawaguchi
> Jenkins-infra mailing list
> Jenkins-infra at lists.jenkins-ci.org
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Jenkins-infra