[Jenkins-infra] "What the hell was up with Confluence today?'

R. Tyler Croy tyler at monkeypox.org
Wed Apr 5 23:52:42 UTC 2017

olblak and I received a "shit ton" (my words) of Confluence alerts, along with
a few from JIRA. Our usual procedure for figuring out what the hell is wrong
with Confluence is to kick the damned thing over a few times, which olblak did,
but nothing was fixed :(

Our hypothesis was that either: (a) we were getting hit by lots more traffic,
or (b) some remote resource that Confluence was dependent on was taking longer.

Sidenote: building microservices is apparently super hard for everybody,
including Atlassian, dependent services should never take down an app because
it should have been coded with timeouts and fault tolerance! All services fail!

Whenever you're responding to alerts, Datadog is your friend! I noticed, when
looking at host:lettuce in Datadog, that the "Apache - Hits rate" hadn't
changed significantly, but the CPU and "Apache - Workers" metrics both spiked
hard today. That eliminated, for me, option (a).

I verified that LDAP wasn't the culprit, since it sometimes is. Which led me to
believe that the DB tier at the OSUOSL might be the culprit. pwnguin from
OSUOSL was able to confirm and rectify the situation in #osuosl:

    19:16 < rtyler> pwnguin: is it really slower today?
    19:16 < rtyler> been hunting down why confluence has been crap all morning
    19:24 <@pwnguin> load of 40 on nearly all the nodes
    19:26 < rtyler> O_O
    19:27 < rtyler> I think that might explain our issues then
    19:33 < rtyler> pwnguin: will you ping once the load subsides, I expect to see confluence cool down if t hat's our issue
    19:34 <@pwnguin> almost done
    19:43 <@pwnguin> rtyler: so, it seems fixed  now, but i'm not yet sure what the pattern is that reveals hosts affected vs not
    19:44 < rtyler> okie will keep an eye out
    19:51 <@pwnguin> rtyler: huh, looks like we also saw a load alert on mysql1
    23:09 < rtyler> pwnguin: based on my metrics in datadog, things look like they suddenly disappearead at about 15:00 PST
    23:09 < rtyler> does that match any changes on your end?
    23:09 < rtyler> as in, load fell off a cliff
    23:31 <@pwnguin> rtyler: https://munin.osuosl.org/osuosl.bak/gprod3.osuosl.bak/cpu.html <- like that?
    23:35 < rtyler> haha, yeah 
    23:36 < rtyler> what was up today?
    23:37 <@pwnguin> yesterday, i pushed out a change to cfengine to add a logrotate for ganeti
    23:37 <@pwnguin> the cfengine change was missing a parens, and thus broken.
    23:37 < rtyler> O_O
    23:38 <@pwnguin> on our CI that led to a failure. on our ganeti nodes, that led to the cron job spinlock'd
    23:38 <@pwnguin> so every hour, another cfengine process was running and consuming another core
    23:39 <@pwnguin> wasn't till today that load was high enough to affect you I guess
    23:40 < rtyler> wow
    23:40 <@pwnguin> i pushed out the missing paren like five minutes after we saw it broke in CI, but didn't see any alerts warranting we check in on the nodes.
    23:41 <@pwnguin> so i figured things were fine till i saw that nearly all the nodes had high load, and saw that cfengine processes were taking multiple cores

Everything with Confluence looks normal now that the backend DB is responding
in a timely fashion. olblak and I have some items we identified today to make
sure Confluence uses all the resources available to it, but unfortunately this
time it looks like this was out of our hands :(

- R. Tyler Croy

