[Jenkins-infra] Confluence instability post-mortem

Kohsuke Kawaguchi kk at kohsuke.org
Tue Mar 24 04:44:17 UTC 2015


I've played a bit with datadog and now eggplant (jira&confluence) is
monitored through datadog with pagerduty integration, and I like it.

There are really only just two servers that we want to monitor --- cucumber
& eggplant, and that just costs $30/month. I think it's a good use of our
money to help the infra work. Any thoughts?

If anyone else wants to see the dashboard, I can add you to the "Jenkins"
org.

2015-03-23 19:03 GMT-07:00 Kohsuke Kawaguchi <kk at kohsuke.org>:

> A related problem here is that when lettuce was moved from old puppet
> management to new puppet management, it must have been reinstalled from
> scratch. And we lost nagios during this transition --- try hitting
> http://nagios.jenkins-ci.org/ and you'll see it yourself.
>
> So we are flying blind when it comes to monitoring, which means when a
> Confluence issue like this happens, we don't get to notice.
>
> 2015-03-23 18:59 GMT-07:00 Kohsuke Kawaguchi <kk at kohsuke.org>:
>
> As I was about to issue a security advisory, I've noticed that Confluence
>> is acting up. It accepts inbound HTTP connections, but very slowly, and
>> then even if it accepts connections, it fails to render HTML in a timely
>> manner.
>>
>> Now I think I know what's going on.
>>
>> The way the system is put together is that there's Apache at the very
>> front, and requests for wiki is forwarded to nginx that acts as the cache
>> layer. If a cache fails, nginx further forwards the request to Tomcat,
>> which runs Confluence.
>>
>> I think the root cause of the problem is that /srv/wiki/cache weren't
>> fully populated. I've discovered this at the very end, and I still don't
>> know why this was only partially populated, but this explains everything.
>>
>> Normally, the cache tier responds to most requests. But now that the cahe
>> is gone, Confluence takes far more load than usual. Unfortunately, Tomcat
>> was configured to spin up to 200 request handling threads, yet it only had
>> 15 DB connections in the pool. So almost all of 200 request handling
>> threads all ended up competing for available database connections. This was
>> quite visible in the thread dump.
>>
>> I've made the change to double the DB connecton pool size to 30 as per this
>> KB document
>> <https://confluence.atlassian.com/display/CONFKB/Confluence+Slows+and+Times+Out+During+Periods+of+High+Load+Due+to+DB+Connection+Pool> (which
>> had to be be done outside Puppet as this file contains passwords and so
>> cannot be managed in infra-puppet), and reduced the # of maximum request
>> handling threads from 200 to 75. In this way, even if Confluence sees
>> increased load, it doesn't end up taking too many connections that it
>> cannot serve.
>>
>> I've also issued re-generation of static cache. Confluence CPU usage is
>> down and the site is mostly snappy, and it'll get better as the static
>> cache fills up.
>>
>>
>> --
>> Kohsuke Kawaguchi
>>
>
>
>
> --
> Kohsuke Kawaguchi
>



-- 
Kohsuke Kawaguchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20150323/7ba0225d/attachment.html>


More information about the Jenkins-infra mailing list