[Jenkins-infra] So what the hell happened this morning?

R. Tyler Croy tyler at monkeypox.org
Thu Apr 28 19:15:42 UTC 2016


(replies inline)

On Thu, 28 Apr 2016, Arnaud H?ritier wrote:

> Tyler,
> 
> Daniel reported that this morning the problem came back for few users
> They had such error : http://pastebin.com/vzf8uSrZ
> I didn't find how you configured the R/O vs R/W
> I found only one configuration file for mirror brain which seems to go to
> the R/W DB
> I don't know if it can be related but in mirrorbrains logs we again had
> various "Name or service not known" errors today when it tried to verify
> the mirrors (like last Tuesday)
> https://gist.github.com/aheritier/d3961ac39a5e042a5c853670ce1f73a7
> Since few hours we also have the same error about the DB access
> in /var/log/apache2/error.log
> [Thu Apr 28 06:01:00.879680 2016] [dbd:error] [pid 26193:tid
> 140561210119936] (20014)Internal error: AH00629: Can't connect to pgsql:
> could not translate host name "jenkinsinfra01-rr1......." to address: Name
> or service not known\n



The same issue cropped up again this morning, basically VPC networking combined
with AWS RDS (pgsql) "sucks".


ctennis and I fixed this today by dumping RDS and loading up postgresql on the
machine directly.


SO now I have more to introduce into Puppetized management :)



> On Tue, Apr 26, 2016 at 11:44 PM, R. Tyler Croy <tyler at monkeypox.org> wrote:
> 
> >
> > First off, let me tell you, there is nothing I love more than 2am calls
> > from
> > Daniel and Arnaud :)
> >
> > Anyways, with the potential infra compromise which is blogged about
> > here[0],
> > we migrated a *lot* of infrastructure in very short order to AWS. We're
> > still
> > not entirely finished with that process, but as part of it LDAP, Rating,
> > CI and
> > Mirrorbrain/Updates were moved into AWS.
> >
> > Mirrorbrain as you might know powers {mirrors,updates,pkg}.jenkins.io and
> > uses
> > a PostgreSQL database to keep track of which files are located on which
> > mirrors.
> >
> > During the migration to AWS, we adopted AWS RDS for our postgresql server.
> >
> > Last night, an uptick in traffic hitting the jenkins-mirrorbrain host
> > (home to
> > {mirrors,updates,pkg}.jenkins.io), AWS RDS experienced a total network
> > connection loss between the EC2 instance and the RDS instance[1]. The
> > behavior
> > was, as you can see in [1], that all DB Connections flat-lined from the EC2
> > instance to the RDS instance. The host was unable to reconnect to
> > PostgreSQL
> > until it was stopped and started (not even a reboot worked). This caused
> > requests to stack up in Apache on jenkins-mirrorbrain which also slowed
> > Confluence down, since the plugin-info macro requests
> > updates.jenkins-ci.org/latest/update-center.json periodically, and was
> > unable
> > to retrieve it successfully.
> >
> > Unfortunately this "super fucked up behavior" (my words) may be a "known
> > issue"[2] that others have experienced with AWS RDS (pgsql). The
> > work-around,
> > which was manually deployed, was to spin up a read replica and point
> > Apache/MirrorBrain to that instead of the write-master instance. This means
> > Apache is hitting a read-replica, and the mirrorbrain scripts for scanning
> > mirrors are hitting the write-master (filed to puppetize here[3])
> >
> > This seems to prevent the issue from manifesting itself, but I'm
> > definitely not
> > feeling too confident about that hack.
> >
> >
> > Another issue that was noticed while dealing with all of this, is that I
> > neglected to provide Elastic IPs for many of the hosts I provisioned in AWS
> > last week.  WHich means we're going to have to go back through later and
> > assign
> > Elastic IPs and migrate many of our DNS records, which is not going to be
> > pleasant.
> >
> >
> > I have also created a private runbooks[4] repository on GitHub where we can
> > migrate runbooks and infra documentation from the protected space on
> > Confluence, which will allow us to follow our runbooks for Confluence being
> > down when Confluence is down :)
> >
> >
> > Moving forward we're going to need to figure out a more highly available
> > distribution process. The unfortunate nature of jenkins-mirrorbrain is
> > that it
> > must have a very large file tree under /srv/releases to properly serve and
> > sync
> > with other mirrors, and potentially distributing that across N webservers
> > to
> > get some redundancy would dramatically complicate our current release
> > process.
> >
> >
> > So there are the details, today's kind of sucked, but we recovered in a
> > couple
> > hours from a fairly absurd bug in AWS. Yay us.
> >
> >
> > [0] https://jenkins.io/blog/2016/04/22/possible-infra-compromise/
> > [1] http://i.imgur.com/lGGmRGe.png
> > [2] https://twitter.com/rhoml/status/724952449199403008
> > [3] https://issues.jenkins-ci.org/browse/INFRA-669
> > [4] https://github.com/jenkins-infra/runbooks
> >
> >
> > - R. Tyler Croy
> >
> > ------------------------------------------------------
> >      Code: <https://github.com/rtyler>
> >   Chatter: <https://twitter.com/agentdero>
> >
> >   % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
> > ------------------------------------------------------
> >
> > _______________________________________________
> > Jenkins-infra mailing list
> > Jenkins-infra at lists.jenkins-ci.org
> > http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
> >
> >
> 
> 
> -- 
> -----
> Arnaud Héritier
> http://aheritier.net
> Mail/GTalk: aheritier AT gmail DOT com
> Twitter/Skype : aheritier

- R. Tyler Croy

------------------------------------------------------
     Code: <https://github.com/rtyler>
  Chatter: <https://twitter.com/agentdero>

  % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160428/de31f757/attachment.asc>


More information about the Jenkins-infra mailing list