[Jenkins-infra] Outage Friday morning post-mortem

Arnaud Héritier aheritier at gmail.com
Tue Jun 9 07:45:10 UTC 2015


Same thing this today :(
I restarted LDAP.
I'm trying to see if I can fix Confluence but I'm not sure to be able to do
it myself

On Tue, May 26, 2015 at 6:01 PM, Kohsuke Kawaguchi <kk at kohsuke.org> wrote:

>
> We had the outage of LDAP and Confluence Friday morning.
>
>    - LDAP server (slapd) is configured to restart every day
>    <https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for some
>    reason, when it happened last night, it killed slapd but a new one didn't
>    come online. This happend around May 21 20:00 EDT. No monitoring was set up
>    to detect this.
>
>    - As a result, no one was able to login to JIRA & Confluence.
>
>    - The intersection of the people who had access and the people who
>    knew where LDAP was running was zero, until the morning next day in PDT. I
>    think Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal
>    chat, and I got ithe daemon started at May 22 10:05 EDT.
>
>    - LDAP outage lasted good 14 hours.
>
>    - At around the same time, Confluence started acting up. According to
>    the monitoring, this happened around May 22 10:10 EDT.
>
>    - A quick investigation revealed that this was caused by cache
>    depletion. Confluence is very slow, so we have the static cache layer in
>    front that serves pre-generated HTML files. Probably because the periodical
>    cache regeneration failed due to LDAP outage, almost all the cached files
>    were gone.
>
>    - This resulted in everyone hitting Confluence hard, and Confluence
>    couldn't keep up with it.
>
>    - The maintenance screen was put up around 11:05 EDT, the cache
>    regeneration process was started again, and by 12:10 EDT, the cache was
>    sufficiently populated and the maintenance screen was taken down.
>
> Here is my take of actions based on this:
>
>    - slapd needs to be managed by Upstart, not /etc/init.d so that if the
>    process is lost, it'll automatically get restarted. With this, nightly LDAP
>    restart shouldn't be an issue.
>
>    - LDAP needs to be monitored so that we know it's responsive. Ideally
>    this should check the certificate expiration date as well so that we get
>    warned if the expiration gest imminent.
>
>    - Static cache size in Confluence needs to be monitored so that we can
>    see when it starts to go down.
>
>    - Static cache generator code needs to be checked to find out why it
>    can deplete the cache. I suspect it's the full regeneration process going
>    rogue, but surely it should be able to keep the old cache file around if
>    the new one fails to generate.
>
>
> --
> Kohsuke Kawaguchi
>
> _______________________________________________
> Jenkins-infra mailing list
> Jenkins-infra at lists.jenkins-ci.org
> http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
>
>


-- 
-----
Arnaud Héritier
http://aheritier.net
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150609/ba53a01d/attachment.html>


More information about the Jenkins-infra mailing list