[Jenkins-infra] Confluence, Nginx and 99 reasons Docker hates me: a report.
R. Tyler Croy
tyler at monkeypox.org
Mon Jan 18 03:43:21 UTC 2016
We had some wiki availability issues today, that were partially my fault and
partially related to trying to bring "build13" of the docker confluence image
KK made the change earlier last week to disable LDAP caching but for some
reason Docker wasn't pulling the new container properly. This is what I set out
to fix about 6 hours ago.
First, I discovered that newer versions of Docker had no problem pulling the
docker container and we did not have consistent versions of Docker installed
across our machines (1.5.0, 1.7.0 and 1.9.1 by my survey). With this commit
I ensured that we would have 1.9.1 consistently installed. This required some
changes to the forked version of garethr-docker puppet module we use since it's
been changed quite a bit to accomodate newer options in later Docker versions.
COOL, surely that must have been the end of my day.
Second, after rolling out the Docker changes the wiki became unavailable.
Investigation led to two problems, one I have seen before with Docker a few
times already in our infrastructure: stale IPTables routing rules. When Docker
sets up its networking it will install some rules into a couple chains in the
`filter` and `nat` tables, periodically it has failed to clean up these rules
leading to requests not being routed between confluence-cache and confluence
containers. The second problem I identified was that there was an internal IP
address hard-coded for the confluence-cache container, which no longer existed,
so naturally it wasn't finding the right confluence container. I addressed
*that* with this change.
While debugging this, I noticed another cute behavior of docker with it's named
containers support. Since we name our containers (e.g. `confluence`), the
docker daemon will actually persist the tag and some of the options passed into
the `docker run` invocation. I.e. `docker run -e SOME=foo --name bleepbloop rtyler/myimage`
would persist the environment variable options (SOME=foo) until I stopped and
removed the container (e.g. `docker rm bleepbloop`)
To remedy this, I nuked all the previous incantations of named containers from
the host running confluence. That finished, I could FINALLY run `build13` of
the confluence container which had the LDAP cache setting change that KK made
earlier. Bringing that up I discovered another issue..
Third, lots of spammers and bots are regularly hitting the wiki which I
suspected was causing confluence not to come online and stay online, so I made
this commit to deny those bots at the Apache proxy level (refresher,
requests go: Apache (ssl termination) -> Nginx (cache) -> Confluence)
All that said and done, it still does not appear that the current configuration
of Confluence can sustain the traffic levels without LDAP caching enabled, so I
unfortunately have pinned things back down to `build7`
You may be asking yourself at this point of the email: "why is he writing all
this out?" Welp, this is effectively what I spent my Sunday doing, and it would
be a shame if nobody but me learned from this collosal waste of time. :)
Anywho, that's that. Confluence is back online, and I'm probably not going to
touch it for at least a few days, lest I go crazy.
- R. Tyler Croy
% gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 181 bytes
Desc: Digital signature
More information about the Jenkins-infra