[Jenkins-infra] "What the hell was up with Confluence today?'
R. Tyler Croy
tyler at monkeypox.org
Wed Apr 5 23:52:42 UTC 2017
olblak and I received a "shit ton" (my words) of Confluence alerts, along with
a few from JIRA. Our usual procedure for figuring out what the hell is wrong
with Confluence is to kick the damned thing over a few times, which olblak did,
but nothing was fixed :(
Our hypothesis was that either: (a) we were getting hit by lots more traffic,
or (b) some remote resource that Confluence was dependent on was taking longer.
Sidenote: building microservices is apparently super hard for everybody,
including Atlassian, dependent services should never take down an app because
it should have been coded with timeouts and fault tolerance! All services fail!
</rant>
Whenever you're responding to alerts, Datadog is your friend! I noticed, when
looking at host:lettuce in Datadog, that the "Apache - Hits rate" hadn't
changed significantly, but the CPU and "Apache - Workers" metrics both spiked
hard today. That eliminated, for me, option (a).
I verified that LDAP wasn't the culprit, since it sometimes is. Which led me to
believe that the DB tier at the OSUOSL might be the culprit. pwnguin from
OSUOSL was able to confirm and rectify the situation in #osuosl:
19:16 < rtyler> pwnguin: is it really slower today?
19:16 < rtyler> been hunting down why confluence has been crap all morning
19:24 <@pwnguin> load of 40 on nearly all the nodes
19:26 < rtyler> O_O
19:27 < rtyler> I think that might explain our issues then
19:33 < rtyler> pwnguin: will you ping once the load subsides, I expect to see confluence cool down if t hat's our issue
19:34 <@pwnguin> almost done
19:43 <@pwnguin> rtyler: so, it seems fixed now, but i'm not yet sure what the pattern is that reveals hosts affected vs not
19:44 < rtyler> okie will keep an eye out
19:51 <@pwnguin> rtyler: huh, looks like we also saw a load alert on mysql1
23:09 < rtyler> pwnguin: based on my metrics in datadog, things look like they suddenly disappearead at about 15:00 PST
23:09 < rtyler> does that match any changes on your end?
23:09 < rtyler> as in, load fell off a cliff
23:31 <@pwnguin> rtyler: https://munin.osuosl.org/osuosl.bak/gprod3.osuosl.bak/cpu.html <- like that?
23:35 < rtyler> haha, yeah
23:36 < rtyler> what was up today?
23:37 <@pwnguin> yesterday, i pushed out a change to cfengine to add a logrotate for ganeti
23:37 <@pwnguin> the cfengine change was missing a parens, and thus broken.
23:37 < rtyler> O_O
23:38 <@pwnguin> on our CI that led to a failure. on our ganeti nodes, that led to the cron job spinlock'd
23:38 <@pwnguin> so every hour, another cfengine process was running and consuming another core
23:39 <@pwnguin> wasn't till today that load was high enough to affect you I guess
23:40 < rtyler> wow
23:40 <@pwnguin> i pushed out the missing paren like five minutes after we saw it broke in CI, but didn't see any alerts warranting we check in on the nodes.
23:41 <@pwnguin> so i figured things were fine till i saw that nearly all the nodes had high load, and saw that cfengine processes were taking multiple cores
Everything with Confluence looks normal now that the backend DB is responding
in a timely fashion. olblak and I have some items we identified today to make
sure Confluence uses all the resources available to it, but unfortunately this
time it looks like this was out of our hands :(
Cheers
- R. Tyler Croy
------------------------------------------------------
Code: <https://github.com/rtyler>
Chatter: <https://twitter.com/agentdero>
xmpp: rtyler at jabber.org
% gpg --keyserver keys.gnupg.net --recv-key 1426C7DC3F51E16F
------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lettuce-apacheworkers.png
Type: image/png
Size: 21852 bytes
Desc: not available
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20170405/75120a61/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lettuce-cpu.png
Type: image/png
Size: 19838 bytes
Desc: not available
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20170405/75120a61/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lettuce-hits.png
Type: image/png
Size: 18570 bytes
Desc: not available
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20170405/75120a61/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ganeti-load.png
Type: image/png
Size: 37536 bytes
Desc: not available
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20170405/75120a61/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20170405/75120a61/attachment-0001.asc>
More information about the Jenkins-infra
mailing list