[Jenkins-infra] Outage Friday morning post-mortem

Arnaud Héritier aheritier at gmail.com
Tue Jun 9 07:53:04 UTC 2015

Confluence is back
I didn't have to touch to it.
A little bit slow but I hope that caches will be filled before US are
waking up

On Tue, Jun 9, 2015 at 9:45 AM, Arnaud Héritier <aheritier at gmail.com> wrote:

> Same thing this today :(
> I restarted LDAP.
> I'm trying to see if I can fix Confluence but I'm not sure to be able to
> do it myself
> On Tue, May 26, 2015 at 6:01 PM, Kohsuke Kawaguchi <kk at kohsuke.org> wrote:
>> We had the outage of LDAP and Confluence Friday morning.
>>    - LDAP server (slapd) is configured to restart every day
>>    <https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for some
>>    reason, when it happened last night, it killed slapd but a new one didn't
>>    come online. This happend around May 21 20:00 EDT. No monitoring was set up
>>    to detect this.
>>    - As a result, no one was able to login to JIRA & Confluence.
>>    - The intersection of the people who had access and the people who
>>    knew where LDAP was running was zero, until the morning next day in PDT. I
>>    think Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal
>>    chat, and I got ithe daemon started at May 22 10:05 EDT.
>>    - LDAP outage lasted good 14 hours.
>>    - At around the same time, Confluence started acting up. According to
>>    the monitoring, this happened around May 22 10:10 EDT.
>>    - A quick investigation revealed that this was caused by cache
>>    depletion. Confluence is very slow, so we have the static cache layer in
>>    front that serves pre-generated HTML files. Probably because the periodical
>>    cache regeneration failed due to LDAP outage, almost all the cached files
>>    were gone.
>>    - This resulted in everyone hitting Confluence hard, and Confluence
>>    couldn't keep up with it.
>>    - The maintenance screen was put up around 11:05 EDT, the cache
>>    regeneration process was started again, and by 12:10 EDT, the cache was
>>    sufficiently populated and the maintenance screen was taken down.
>> Here is my take of actions based on this:
>>    - slapd needs to be managed by Upstart, not /etc/init.d so that if
>>    the process is lost, it'll automatically get restarted. With this, nightly
>>    LDAP restart shouldn't be an issue.
>>    - LDAP needs to be monitored so that we know it's responsive. Ideally
>>    this should check the certificate expiration date as well so that we get
>>    warned if the expiration gest imminent.
>>    - Static cache size in Confluence needs to be monitored so that we
>>    can see when it starts to go down.
>>    - Static cache generator code needs to be checked to find out why it
>>    can deplete the cache. I suspect it's the full regeneration process going
>>    rogue, but surely it should be able to keep the old cache file around if
>>    the new one fails to generate.
>> --
>> Kohsuke Kawaguchi
>> _______________________________________________
>> Jenkins-infra mailing list
>> Jenkins-infra at lists.jenkins-ci.org
>> http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
> --
> -----
> Arnaud Héritier
> http://aheritier.net
> Mail/GTalk: aheritier AT gmail DOT com
> Twitter/Skype : aheritier

Arnaud Héritier
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150609/f5aa6db0/attachment.html>

More information about the Jenkins-infra mailing list