[Jenkins-infra] Confluence outage post mortem

Kohsuke Kawaguchi kk at kohsuke.org
Mon Apr 6 03:42:15 UTC 2015


I think Daniel (or maybe someone else) reported this afternoon that
Confluence was down.

I then discovered in Datadog that eggplant went inaccessible around 7:55am
PT. This didn't raise a pager duty because I had monitoring incorrectly
setup to stay silent if data doesn't come (I've fixed this problem since
then.)

eggplant was responding to ping, and SSH connections were accepted, but SSH
wasn't doing handshake. I'm not sure exactly what happened to that box, but
I've filed OSUOSL support ticket to reset the machine.

Once the machine came backup, I noticed that memory footprint of Confluence
is lower than the normal level, and it's just not writing as much data as
it normally does (thanks Datadog!) In browser, the response was indeed bit
slower, but I was still able to see pages OK.

I've only realized much later that Confluence was actually not responding.
Instead, it's the caching layers that were serving all the requests it can
handle, which includes Wiki pages and static resources, hence the browser
appeared to be loading pages.

Confluence was not responding because somebody (probably Larry) has
installed Tomcat manager app, and this was trying to verify its plain-text
LDAP connection to ldap.jenkins-ci.org, which was failing. We've disabled
this for security reasons a week or so ago, and I didn't realize that would
fail Confluence from starting, as it didn't affect a running Confluence
instance.

-- 
Kohsuke Kawaguchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150405/cbb8c6db/attachment.html>


More information about the Jenkins-infra mailing list