[Jenkins-infra] Fwd: [support.osuosl.org #21424] cabbage.jenkins-ci.org (140.211.15.130) hang

Kohsuke Kawaguchi kk at kohsuke.org
Wed Sep 12 22:39:53 UTC 2012




-------- Original Message --------
Subject: [support.osuosl.org #21424] cabbage.jenkins-ci.org 
(140.211.15.130) hang
Date: Wed, 12 Sep 2012 13:47:00 -0700
From: Justin Dugger via RT <support at osuosl.org>
Reply-To: support at osuosl.org
To: kk at kohsuke.org
CC: infra at lists.jenkins-ci.org

I'd be fine for replacing drives, but I'm having trouble identifying 
which one to replace.

According to the df command, you've got 20 gigs in use. And vgdisplay 
shows roughly a terrabyte of free disk space. This system predates our 
ganeti KVM infrastructure; would a VM in our cluster be a good enough 
solution?

The main reason to avoid the VM infra is I/O throughput, so we'd want to 
move it over to our DB cluster if we turn cabbage into a VM.

Justin

On Mon Sep 10 18:59:35 2012, kk at kohsuke.org wrote:
> On 09/10/2012 06:43 PM, Justin Dugger via RT wrote:
> > Okay, I've poked around a bit more, and I don't have much more
> > information. There's nothing in the logs from just prior to the
> lockup.
> > Nothing on the console (ie nonresponsive). The only thing I noticed
> was
> > the disk array was one disk, blinking in a on-off pattern.
>
> Thank you for checking! And I hate to say this, but it went
> unresponsive
> again.
>
> Both hangs happened while or shortly after lvcreate call, so this is
> starting to look like a hardware problem to me.
>
> I suspect some part of the disk died in a way that causes the
> controller
> to hang up. This causes linux kernel to hang when making I/O, so one
> by
> one processes get stuck trying to make system calls, and eventually
> sshd
> locks up.
>
> This time I can see from other shells that I was using that after one
> process got uninterruptible, other shells have worked for a while,
> like
> 'ls ~'. But eventually it went silent.
>
> I wonder if there's any way to verify this hypothesis.
>
> Or should we just start rescuing what we can and put in a new disk, if
> none of us can afford too much effort into this?
>
>
> > On Mon Sep 10 17:41:31 2012, jldugger wrote:
> >> Sorry bout not keeping you up to date, I rebooted it about 40
> minutes ago.
> >>
> >> On Mon Sep 10 17:13:16 2012, kk at kohsuke.org wrote:
> >> > Sorry to be so impatient, but is there any updates on this?
> >> >
> >> > I wonder if we can at least have the machine rebooted.
> >> >
> >> > On 09/10/2012 11:48 AM, OSL Systems Support Team via RT wrote:
> >> > > Greetings,
> >> > >
> >> > > This message has been automatically generated in response to
> the
> >> > > creation of a support ticket call:
> >> > >
> >> > >          "cabbage.jenkins-ci.org (140.211.15.130) hang",
> >> > >
> >> > > a summary of which appears below.
> >> > >
> >> > > There is no need to reply to this message right now. Your
> ticket has
> >> > been
> >> > >   assigned an ID of [support.osuosl.org #21424]. Please include
> this
> >> > string
> >> > > in the subject line of all future correspondence about this
> issue.
> >> > You may
> >> > >   also catch us on irc (irc.freenode.net) in #osuosl.
> >> > >
> >> > >
> >> > >
> >> > >                          Thank you.
> >> > >                          support at osuosl.org
> >> > >
> >> > >
> >> >
> -------------------------------------------------------------------------
> >> > >
> >> > > Hi, OSUOSL,
> >> > >
> >> > > I just run "sudo lvcreate -L 200G -n srv lvm" a 15 minutes ago
> or so
> >> > on
> >> > > 140.211.15.130. Several minutes after that I noticed that the
> system
> >> > > doesn't accept any ssh connections, even though it's still
> >> > responding to
> >> > > ping.
> >> > >
> >> > > This certainly looks like lvcreate caused the kernel to hang
> --- and
> >> > if
> >> > > so I'm wondering if there's any problem with the underlying
> block
> >> > devices.
> >> > >
> >> > > Tyler told me that he just filed a ticket #21416 this morning
> where
> >> > this
> >> > > same system got rebooted, and that also made us wonder if this
> might
> >> > be
> >> > > somehow related to that.
> >> > >
> >> > > Do you have any insights on what might be going on? Otherwise
> we'd
> >> > like
> >> > > to get this machine rebooted and maybe check the SMART status
> of the
> >> > > disks or something.
> >> > >
> >> >
> >> >
> >>
> >>
> >
> >
> >
> >
>
>








More information about the Jenkins-infra mailing list