[Jenkins-infra] Confluence, Nginx and 99 reasons Docker hates me: a report.

R. Tyler Croy tyler at monkeypox.org
Tue Jan 19 15:31:43 UTC 2016


(replies inline)

On Tue, 19 Jan 2016, Arnaud H?ritier wrote:

> The bot was fixed/restarted ?

Yes, the bot was restarted as far as I can tell. That means KK probably did it
right after he got home from his skiing trip last night :)



> 
> On Mon, Jan 18, 2016 at 6:43 PM, Larry Shatzer, Jr. <larrys at gmail.com>
> wrote:
> 
> > We need to get that running on a box more people have access to so we can
> > kick it when it does this. It is becoming more critical to keep spam away.
> > I do not want to spend all my day off deleting spam.
> >
> > On Mon, Jan 18, 2016 at 8:44 AM, R. Tyler Croy <tyler at monkeypox.org>
> > wrote:
> >
> >> (replies inline)
> >>
> >> On Mon, 18 Jan 2016, Larry Shatzer, Jr. wrote:
> >>
> >> > And now it appears the wiki spam bot is dead, since there are probably
> >> > around a thousand spam pages, all ones that would normally be killed by
> >> it.
> >>
> >>
> >> Due to access constraints, I didn't touch the backend spam bot at all for
> >> what
> >> it's worth. It may have cratered due to the connectivity issues though.
> >> I'll
> >> ping KK as soon as I see him online about it.
> >>
> >>
> >> > On Sun, Jan 17, 2016 at 8:43 PM, R. Tyler Croy <tyler at monkeypox.org>
> >> wrote:
> >> >
> >> > >
> >> > > We had some wiki availability issues today, that were partially my
> >> fault
> >> > > and
> >> > > partially related to trying to bring "build13" of the docker
> >> confluence
> >> > > image
> >> > > into production.
> >> > >
> >> > > KK made the change earlier last week to disable LDAP caching but for
> >> some
> >> > > reason Docker wasn't pulling the new container properly. This is what
> >> I
> >> > > set out
> >> > > to fix about 6 hours ago.
> >> > >
> >> > >
> >> > > First, I discovered that newer versions of Docker had no problem
> >> pulling
> >> > > the
> >> > > docker container and we did not have consistent versions of Docker
> >> > > installed
> >> > > across our machines (1.5.0, 1.7.0 and 1.9.1 by my survey). With this
> >> > > commit[1]
> >> > > I ensured that we would have 1.9.1 consistently installed. This
> >> required
> >> > > some
> >> > > changes to the forked version of garethr-docker puppet module we use
> >> since
> >> > > it's
> >> > > been changed quite a bit to accomodate newer options in later Docker
> >> > > versions.
> >> > >
> >> > >
> >> > > COOL, surely that must have been the end of my day.
> >> > >
> >> > >
> >> > > Second, after rolling out the Docker changes the wiki became
> >> unavailable.
> >> > > Investigation led to two problems, one I have seen before with Docker
> >> a few
> >> > > times already in our infrastructure: stale IPTables routing rules.
> >> When
> >> > > Docker
> >> > > sets up its networking it will install some rules into a couple
> >> chains in
> >> > > the
> >> > > `filter` and `nat` tables, periodically it has failed to clean up
> >> these
> >> > > rules
> >> > > leading to requests not being routed between confluence-cache and
> >> > > confluence
> >> > > containers. The second problem I identified was that there was an
> >> internal
> >> > > IP
> >> > > address hard-coded for the confluence-cache container, which no longer
> >> > > existed,
> >> > > so naturally it wasn't finding the right confluence container. I
> >> addressed
> >> > > *that* with this[2] change.
> >> > >
> >> > >
> >> > > While debugging this, I noticed another cute behavior of docker with
> >> it's
> >> > > named
> >> > > containers support. Since we name our containers (e.g. `confluence`),
> >> the
> >> > > docker daemon will actually persist the tag and some of the options
> >> passed
> >> > > into
> >> > > the `docker run` invocation. I.e. `docker run -e SOME=foo --name
> >> > > bleepbloop rtyler/myimage`
> >> > > would persist the environment variable options (SOME=foo) until I
> >> stopped
> >> > > and
> >> > > removed the container (e.g. `docker rm bleepbloop`)
> >> > >
> >> > > To remedy this, I nuked all the previous incantations of named
> >> containers
> >> > > from
> >> > > the host running confluence. That finished, I could FINALLY run
> >> `build13`
> >> > > of
> >> > > the confluence container which had the LDAP cache setting change that
> >> KK
> >> > > made
> >> > > earlier. Bringing that up I discovered another issue..
> >> > >
> >> > >
> >> > > Third, lots of spammers and bots are regularly hitting the wiki which
> >> I
> >> > > suspected was causing confluence not to come online and stay online,
> >> so I
> >> > > made
> >> > > this commit[3] to deny those bots at the Apache proxy level
> >> (refresher,
> >> > > requests go: Apache (ssl termination) -> Nginx (cache) -> Confluence)
> >> > >
> >> > >
> >> > > All that said and done, it still does not appear that the current
> >> > > configuration
> >> > > of Confluence can sustain the traffic levels without LDAP caching
> >> enabled,
> >> > > so I
> >> > > unfortunately have pinned things back down to `build7`
> >> > >
> >> > >
> >> > >
> >> > > You may be asking yourself at this point of the email: "why is he
> >> writing
> >> > > all
> >> > > this out?" Welp, this is effectively what I spent my Sunday doing,
> >> and it
> >> > > would
> >> > > be a shame if nobody but me learned from this collosal waste of time.
> >> :)
> >> > >
> >> > >
> >> > > Anywho, that's that. Confluence is back online, and I'm probably not
> >> going
> >> > > to
> >> > > touch it for at least a few days, lest I go crazy.
> >> > >
> >> > >
> >> > > [1]
> >> > >
> >> https://github.com/jenkins-infra/jenkins-infra/commit/0107e79b0aa7b5bd9acd3d4d6b268c4178331beb
> >> > > [2]
> >> > >
> >> https://github.com/jenkins-infra/jenkins-infra/commit/f95c0e67803e9129c54a3f7fe8fce2940f7ad874
> >> > > [3]
> >> > >
> >> https://github.com/jenkins-infra/jenkins-infra/commit/675e4bdfc7bdd96b34046dc872f73f7f514e4e49
> >> > >
> >> > >
> >> > > Cheers
> >> > > - R. Tyler Croy
> >> > >
> >> > > ------------------------------------------------------
> >> > >      Code: <https://github.com/rtyler>
> >> > >   Chatter: <https://twitter.com/agentdero>
> >> > >
> >> > >   % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
> >> > > ------------------------------------------------------
> >> > >
> >> > > _______________________________________________
> >> > > Jenkins-infra mailing list
> >> > > Jenkins-infra at lists.jenkins-ci.org
> >> > > http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
> >> > >
> >> > >
> >>
> >> - R. Tyler Croy
> >>
> >> ------------------------------------------------------
> >>      Code: <https://github.com/rtyler>
> >>   Chatter: <https://twitter.com/agentdero>
> >>
> >>   % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
> >> ------------------------------------------------------
> >>
> >
> >
> > _______________________________________________
> > Jenkins-infra mailing list
> > Jenkins-infra at lists.jenkins-ci.org
> > http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
> >
> >
> 
> 
> -- 
> -----
> Arnaud Héritier
> http://aheritier.net
> Mail/GTalk: aheritier AT gmail DOT com
> Twitter/Skype : aheritier

- R. Tyler Croy

------------------------------------------------------
     Code: <https://github.com/rtyler>
  Chatter: <https://twitter.com/agentdero>

  % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160119/f0395d5f/attachment-0001.asc>


More information about the Jenkins-infra mailing list