[Jenkins-infra] Outage Friday morning post-mortem

Arnaud Héritier aheritier at gmail.com
Mon Jul 27 09:19:51 UTC 2015


I manually restarted LDAP today
Wiki was always broken after 20min thus I restarted it too but it is always
failing
I suppose it might be a cache issue like KK in may
maybe we should stop apache for one hour to see if it helps (I don't know
where KK setup the maintenance page)

On Tue, Jun 9, 2015 at 9:53 AM, Arnaud Héritier <aheritier at gmail.com> wrote:

> Confluence is back
> I didn't have to touch to it.
> A little bit slow but I hope that caches will be filled before US are
> waking up
>
> On Tue, Jun 9, 2015 at 9:45 AM, Arnaud Héritier <aheritier at gmail.com>
> wrote:
>
>> Same thing this today :(
>> I restarted LDAP.
>> I'm trying to see if I can fix Confluence but I'm not sure to be able to
>> do it myself
>>
>> On Tue, May 26, 2015 at 6:01 PM, Kohsuke Kawaguchi <kk at kohsuke.org>
>> wrote:
>>
>>>
>>> We had the outage of LDAP and Confluence Friday morning.
>>>
>>>    - LDAP server (slapd) is configured to restart every day
>>>    <https://issues.jenkins-ci.org/browse/INFRA-240>. Apparently for
>>>    some reason, when it happened last night, it killed slapd but a new one
>>>    didn't come online. This happend around May 21 20:00 EDT. No monitoring was
>>>    set up to detect this.
>>>
>>>    - As a result, no one was able to login to JIRA & Confluence.
>>>
>>>    - The intersection of the people who had access and the people who
>>>    knew where LDAP was running was zero, until the morning next day in PDT. I
>>>    think Arnaud, Kostyasha, or James pinged me over IRC and CloudBees internal
>>>    chat, and I got ithe daemon started at May 22 10:05 EDT.
>>>
>>>    - LDAP outage lasted good 14 hours.
>>>
>>>    - At around the same time, Confluence started acting up. According
>>>    to the monitoring, this happened around May 22 10:10 EDT.
>>>
>>>    - A quick investigation revealed that this was caused by cache
>>>    depletion. Confluence is very slow, so we have the static cache layer in
>>>    front that serves pre-generated HTML files. Probably because the periodical
>>>    cache regeneration failed due to LDAP outage, almost all the cached files
>>>    were gone.
>>>
>>>    - This resulted in everyone hitting Confluence hard, and Confluence
>>>    couldn't keep up with it.
>>>
>>>    - The maintenance screen was put up around 11:05 EDT, the cache
>>>    regeneration process was started again, and by 12:10 EDT, the cache was
>>>    sufficiently populated and the maintenance screen was taken down.
>>>
>>> Here is my take of actions based on this:
>>>
>>>    - slapd needs to be managed by Upstart, not /etc/init.d so that if
>>>    the process is lost, it'll automatically get restarted. With this, nightly
>>>    LDAP restart shouldn't be an issue.
>>>
>>>    - LDAP needs to be monitored so that we know it's responsive.
>>>    Ideally this should check the certificate expiration date as well so that
>>>    we get warned if the expiration gest imminent.
>>>
>>>    - Static cache size in Confluence needs to be monitored so that we
>>>    can see when it starts to go down.
>>>
>>>    - Static cache generator code needs to be checked to find out why it
>>>    can deplete the cache. I suspect it's the full regeneration process going
>>>    rogue, but surely it should be able to keep the old cache file around if
>>>    the new one fails to generate.
>>>
>>>
>>> --
>>> Kohsuke Kawaguchi
>>>
>>> _______________________________________________
>>> Jenkins-infra mailing list
>>> Jenkins-infra at lists.jenkins-ci.org
>>> http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
>>>
>>>
>>
>>
>> --
>> -----
>> Arnaud Héritier
>> http://aheritier.net
>> Mail/GTalk: aheritier AT gmail DOT com
>> Twitter/Skype : aheritier
>>
>
>
>
> --
> -----
> Arnaud Héritier
> http://aheritier.net
> Mail/GTalk: aheritier AT gmail DOT com
> Twitter/Skype : aheritier
>



-- 
-----
Arnaud Héritier
http://aheritier.net
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150727/ee20d756/attachment.html>


More information about the Jenkins-infra mailing list