[Jenkins-infra] So what the hell happened this morning?
R. Tyler Croy
tyler at monkeypox.org
Tue Apr 26 21:44:10 UTC 2016
First off, let me tell you, there is nothing I love more than 2am calls from
Daniel and Arnaud :)
Anyways, with the potential infra compromise which is blogged about here[0],
we migrated a *lot* of infrastructure in very short order to AWS. We're still
not entirely finished with that process, but as part of it LDAP, Rating, CI and
Mirrorbrain/Updates were moved into AWS.
Mirrorbrain as you might know powers {mirrors,updates,pkg}.jenkins.io and uses
a PostgreSQL database to keep track of which files are located on which
mirrors.
During the migration to AWS, we adopted AWS RDS for our postgresql server.
Last night, an uptick in traffic hitting the jenkins-mirrorbrain host (home to
{mirrors,updates,pkg}.jenkins.io), AWS RDS experienced a total network
connection loss between the EC2 instance and the RDS instance[1]. The behavior
was, as you can see in [1], that all DB Connections flat-lined from the EC2
instance to the RDS instance. The host was unable to reconnect to PostgreSQL
until it was stopped and started (not even a reboot worked). This caused
requests to stack up in Apache on jenkins-mirrorbrain which also slowed
Confluence down, since the plugin-info macro requests
updates.jenkins-ci.org/latest/update-center.json periodically, and was unable
to retrieve it successfully.
Unfortunately this "super fucked up behavior" (my words) may be a "known
issue"[2] that others have experienced with AWS RDS (pgsql). The work-around,
which was manually deployed, was to spin up a read replica and point
Apache/MirrorBrain to that instead of the write-master instance. This means
Apache is hitting a read-replica, and the mirrorbrain scripts for scanning
mirrors are hitting the write-master (filed to puppetize here[3])
This seems to prevent the issue from manifesting itself, but I'm definitely not
feeling too confident about that hack.
Another issue that was noticed while dealing with all of this, is that I
neglected to provide Elastic IPs for many of the hosts I provisioned in AWS
last week. WHich means we're going to have to go back through later and assign
Elastic IPs and migrate many of our DNS records, which is not going to be
pleasant.
I have also created a private runbooks[4] repository on GitHub where we can
migrate runbooks and infra documentation from the protected space on
Confluence, which will allow us to follow our runbooks for Confluence being
down when Confluence is down :)
Moving forward we're going to need to figure out a more highly available
distribution process. The unfortunate nature of jenkins-mirrorbrain is that it
must have a very large file tree under /srv/releases to properly serve and sync
with other mirrors, and potentially distributing that across N webservers to
get some redundancy would dramatically complicate our current release process.
So there are the details, today's kind of sucked, but we recovered in a couple
hours from a fairly absurd bug in AWS. Yay us.
[0] https://jenkins.io/blog/2016/04/22/possible-infra-compromise/
[1] http://i.imgur.com/lGGmRGe.png
[2] https://twitter.com/rhoml/status/724952449199403008
[3] https://issues.jenkins-ci.org/browse/INFRA-669
[4] https://github.com/jenkins-infra/runbooks
- R. Tyler Croy
------------------------------------------------------
Code: <https://github.com/rtyler>
Chatter: <https://twitter.com/agentdero>
% gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160426/5d1c3498/attachment.asc>
More information about the Jenkins-infra
mailing list