[Jenkins-infra] Confluence instability post-mortem

Kohsuke Kawaguchi kk at kohsuke.org
Tue Mar 24 02:03:46 UTC 2015

A related problem here is that when lettuce was moved from old puppet
management to new puppet management, it must have been reinstalled from
scratch. And we lost nagios during this transition --- try hitting
http://nagios.jenkins-ci.org/ and you'll see it yourself.

So we are flying blind when it comes to monitoring, which means when a
Confluence issue like this happens, we don't get to notice.

2015-03-23 18:59 GMT-07:00 Kohsuke Kawaguchi <kk at kohsuke.org>:

> As I was about to issue a security advisory, I've noticed that Confluence
> is acting up. It accepts inbound HTTP connections, but very slowly, and
> then even if it accepts connections, it fails to render HTML in a timely
> manner.
> Now I think I know what's going on.
> The way the system is put together is that there's Apache at the very
> front, and requests for wiki is forwarded to nginx that acts as the cache
> layer. If a cache fails, nginx further forwards the request to Tomcat,
> which runs Confluence.
> I think the root cause of the problem is that /srv/wiki/cache weren't
> fully populated. I've discovered this at the very end, and I still don't
> know why this was only partially populated, but this explains everything.
> Normally, the cache tier responds to most requests. But now that the cahe
> is gone, Confluence takes far more load than usual. Unfortunately, Tomcat
> was configured to spin up to 200 request handling threads, yet it only had
> 15 DB connections in the pool. So almost all of 200 request handling
> threads all ended up competing for available database connections. This was
> quite visible in the thread dump.
> I've made the change to double the DB connecton pool size to 30 as per this
> KB document
> <https://confluence.atlassian.com/display/CONFKB/Confluence+Slows+and+Times+Out+During+Periods+of+High+Load+Due+to+DB+Connection+Pool> (which
> had to be be done outside Puppet as this file contains passwords and so
> cannot be managed in infra-puppet), and reduced the # of maximum request
> handling threads from 200 to 75. In this way, even if Confluence sees
> increased load, it doesn't end up taking too many connections that it
> cannot serve.
> I've also issued re-generation of static cache. Confluence CPU usage is
> down and the site is mostly snappy, and it'll get better as the static
> cache fills up.
> --
> Kohsuke Kawaguchi

Kohsuke Kawaguchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150323/b5880818/attachment.html>

More information about the Jenkins-infra mailing list