[Jenkins-infra] So what the hell happened this morning?

Thu Apr 28 10:17:01 UTC 2016

Tyler,

Daniel reported that this morning the problem came back for few users
They had such error : http://pastebin.com/vzf8uSrZ
I didn't find how you configured the R/O vs R/W
I found only one configuration file for mirror brain which seems to go to
the R/W DB
I don't know if it can be related but in mirrorbrains logs we again had
various "Name or service not known" errors today when it tried to verify
the mirrors (like last Tuesday)
https://gist.github.com/aheritier/d3961ac39a5e042a5c853670ce1f73a7
Since few hours we also have the same error about the DB access
in /var/log/apache2/error.log
[Thu Apr 28 06:01:00.879680 2016] [dbd:error] [pid 26193:tid
140561210119936] (20014)Internal error: AH00629: Can't connect to pgsql:
could not translate host name "jenkinsinfra01-rr1......." to address: Name
or service not known\n
....

Cheers

On Tue, Apr 26, 2016 at 11:44 PM, R. Tyler Croy <tyler at monkeypox.org> wrote:

>
> First off, let me tell you, there is nothing I love more than 2am calls
> from
> Daniel and Arnaud :)
>
> Anyways, with the potential infra compromise which is blogged about
> here[0],
> we migrated a *lot* of infrastructure in very short order to AWS. We're
> still
> not entirely finished with that process, but as part of it LDAP, Rating,
> CI and
> Mirrorbrain/Updates were moved into AWS.
>
> Mirrorbrain as you might know powers {mirrors,updates,pkg}.jenkins.io and
> uses
> a PostgreSQL database to keep track of which files are located on which
> mirrors.
>
> During the migration to AWS, we adopted AWS RDS for our postgresql server.
>
> Last night, an uptick in traffic hitting the jenkins-mirrorbrain host
> (home to
> {mirrors,updates,pkg}.jenkins.io), AWS RDS experienced a total network
> connection loss between the EC2 instance and the RDS instance[1]. The
> behavior
> was, as you can see in [1], that all DB Connections flat-lined from the EC2
> instance to the RDS instance. The host was unable to reconnect to
> PostgreSQL
> until it was stopped and started (not even a reboot worked). This caused
> requests to stack up in Apache on jenkins-mirrorbrain which also slowed
> Confluence down, since the plugin-info macro requests
> updates.jenkins-ci.org/latest/update-center.json periodically, and was
> unable
> to retrieve it successfully.
>
> Unfortunately this "super fucked up behavior" (my words) may be a "known
> issue"[2] that others have experienced with AWS RDS (pgsql). The
> work-around,
> which was manually deployed, was to spin up a read replica and point
> Apache/MirrorBrain to that instead of the write-master instance. This means
> Apache is hitting a read-replica, and the mirrorbrain scripts for scanning
> mirrors are hitting the write-master (filed to puppetize here[3])
>
> This seems to prevent the issue from manifesting itself, but I'm
> definitely not
> feeling too confident about that hack.
>
>
> Another issue that was noticed while dealing with all of this, is that I
> neglected to provide Elastic IPs for many of the hosts I provisioned in AWS
> last week.  WHich means we're going to have to go back through later and
> assign
> Elastic IPs and migrate many of our DNS records, which is not going to be
> pleasant.
>
>
> I have also created a private runbooks[4] repository on GitHub where we can
> migrate runbooks and infra documentation from the protected space on
> Confluence, which will allow us to follow our runbooks for Confluence being
> down when Confluence is down :)
>
>
> Moving forward we're going to need to figure out a more highly available
> distribution process. The unfortunate nature of jenkins-mirrorbrain is
> that it
> must have a very large file tree under /srv/releases to properly serve and
> sync
> with other mirrors, and potentially distributing that across N webservers
> to
> get some redundancy would dramatically complicate our current release
> process.
>
>
> So there are the details, today's kind of sucked, but we recovered in a
> couple
> hours from a fairly absurd bug in AWS. Yay us.
>
>
> [0] https://jenkins.io/blog/2016/04/22/possible-infra-compromise/
> [1] http://i.imgur.com/lGGmRGe.png
> [2] https://twitter.com/rhoml/status/724952449199403008
> [3] https://issues.jenkins-ci.org/browse/INFRA-669
> [4] https://github.com/jenkins-infra/runbooks
>
>
> - R. Tyler Croy
>
> ------------------------------------------------------
>      Code: <https://github.com/rtyler>
>   Chatter: <https://twitter.com/agentdero>
>
>   % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
> ------------------------------------------------------
>
> _______________________________________________
> Jenkins-infra mailing list
> Jenkins-infra at lists.jenkins-ci.org
> http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
>
>

-- 
-----
Arnaud Héritier
http://aheritier.net
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160428/b9621208/attachment.html>