[Jenkins-infra] So what the hell happened this morning?

Arnaud Héritier aheritier at gmail.com
Tue Apr 26 23:55:52 UTC 2016


Nice sum-up Tyler
Thanks a lot for your help.
Another point is that when something goes wrong and few of us are online it
is difficult to try to solve an issue and at the same time to keep our
community informed.
Couldn't we add to our backlog to publish a status page ?
I think that a service like https://www.statuspage.io/pricing is too
expensive for us. But maybe a simple homemade stuff ?



On Tue, Apr 26, 2016 at 11:44 PM, R. Tyler Croy <tyler at monkeypox.org> wrote:

>
> First off, let me tell you, there is nothing I love more than 2am calls
> from
> Daniel and Arnaud :)
>
> Anyways, with the potential infra compromise which is blogged about
> here[0],
> we migrated a *lot* of infrastructure in very short order to AWS. We're
> still
> not entirely finished with that process, but as part of it LDAP, Rating,
> CI and
> Mirrorbrain/Updates were moved into AWS.
>
> Mirrorbrain as you might know powers {mirrors,updates,pkg}.jenkins.io and
> uses
> a PostgreSQL database to keep track of which files are located on which
> mirrors.
>
> During the migration to AWS, we adopted AWS RDS for our postgresql server.
>
> Last night, an uptick in traffic hitting the jenkins-mirrorbrain host
> (home to
> {mirrors,updates,pkg}.jenkins.io), AWS RDS experienced a total network
> connection loss between the EC2 instance and the RDS instance[1]. The
> behavior
> was, as you can see in [1], that all DB Connections flat-lined from the EC2
> instance to the RDS instance. The host was unable to reconnect to
> PostgreSQL
> until it was stopped and started (not even a reboot worked). This caused
> requests to stack up in Apache on jenkins-mirrorbrain which also slowed
> Confluence down, since the plugin-info macro requests
> updates.jenkins-ci.org/latest/update-center.json periodically, and was
> unable
> to retrieve it successfully.
>
> Unfortunately this "super fucked up behavior" (my words) may be a "known
> issue"[2] that others have experienced with AWS RDS (pgsql). The
> work-around,
> which was manually deployed, was to spin up a read replica and point
> Apache/MirrorBrain to that instead of the write-master instance. This means
> Apache is hitting a read-replica, and the mirrorbrain scripts for scanning
> mirrors are hitting the write-master (filed to puppetize here[3])
>
> This seems to prevent the issue from manifesting itself, but I'm
> definitely not
> feeling too confident about that hack.
>
>
> Another issue that was noticed while dealing with all of this, is that I
> neglected to provide Elastic IPs for many of the hosts I provisioned in AWS
> last week.  WHich means we're going to have to go back through later and
> assign
> Elastic IPs and migrate many of our DNS records, which is not going to be
> pleasant.
>
>
> I have also created a private runbooks[4] repository on GitHub where we can
> migrate runbooks and infra documentation from the protected space on
> Confluence, which will allow us to follow our runbooks for Confluence being
> down when Confluence is down :)
>
>
> Moving forward we're going to need to figure out a more highly available
> distribution process. The unfortunate nature of jenkins-mirrorbrain is
> that it
> must have a very large file tree under /srv/releases to properly serve and
> sync
> with other mirrors, and potentially distributing that across N webservers
> to
> get some redundancy would dramatically complicate our current release
> process.
>
>
> So there are the details, today's kind of sucked, but we recovered in a
> couple
> hours from a fairly absurd bug in AWS. Yay us.
>
>
> [0] https://jenkins.io/blog/2016/04/22/possible-infra-compromise/
> [1] http://i.imgur.com/lGGmRGe.png
> [2] https://twitter.com/rhoml/status/724952449199403008
> [3] https://issues.jenkins-ci.org/browse/INFRA-669
> [4] https://github.com/jenkins-infra/runbooks
>
>
> - R. Tyler Croy
>
> ------------------------------------------------------
>      Code: <https://github.com/rtyler>
>   Chatter: <https://twitter.com/agentdero>
>
>   % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
> ------------------------------------------------------
>
> _______________________________________________
> Jenkins-infra mailing list
> Jenkins-infra at lists.jenkins-ci.org
> http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
>
>


-- 
-----
Arnaud Héritier
http://aheritier.net
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160427/2812518f/attachment.html>


More information about the Jenkins-infra mailing list