[Jenkins-infra] [support.osuosl.org #21424] cabbage.jenkins-ci.org (140.211.15.130) hang
Kohsuke Kawaguchi
kk at kohsuke.org
Tue Sep 11 01:59:22 UTC 2012
On 09/10/2012 06:43 PM, Justin Dugger via RT wrote:
> Okay, I've poked around a bit more, and I don't have much more
> information. There's nothing in the logs from just prior to the lockup.
> Nothing on the console (ie nonresponsive). The only thing I noticed was
> the disk array was one disk, blinking in a on-off pattern.
Thank you for checking! And I hate to say this, but it went unresponsive
again.
Both hangs happened while or shortly after lvcreate call, so this is
starting to look like a hardware problem to me.
I suspect some part of the disk died in a way that causes the controller
to hang up. This causes linux kernel to hang when making I/O, so one by
one processes get stuck trying to make system calls, and eventually sshd
locks up.
This time I can see from other shells that I was using that after one
process got uninterruptible, other shells have worked for a while, like
'ls ~'. But eventually it went silent.
I wonder if there's any way to verify this hypothesis.
Or should we just start rescuing what we can and put in a new disk, if
none of us can afford too much effort into this?
> On Mon Sep 10 17:41:31 2012, jldugger wrote:
>> Sorry bout not keeping you up to date, I rebooted it about 40 minutes ago.
>>
>> On Mon Sep 10 17:13:16 2012, kk at kohsuke.org wrote:
>> > Sorry to be so impatient, but is there any updates on this?
>> >
>> > I wonder if we can at least have the machine rebooted.
>> >
>> > On 09/10/2012 11:48 AM, OSL Systems Support Team via RT wrote:
>> > > Greetings,
>> > >
>> > > This message has been automatically generated in response to the
>> > > creation of a support ticket call:
>> > >
>> > > "cabbage.jenkins-ci.org (140.211.15.130) hang",
>> > >
>> > > a summary of which appears below.
>> > >
>> > > There is no need to reply to this message right now. Your ticket has
>> > been
>> > > assigned an ID of [support.osuosl.org #21424]. Please include this
>> > string
>> > > in the subject line of all future correspondence about this issue.
>> > You may
>> > > also catch us on irc (irc.freenode.net) in #osuosl.
>> > >
>> > >
>> > >
>> > > Thank you.
>> > > support at osuosl.org
>> > >
>> > >
>> > -------------------------------------------------------------------------
>> > >
>> > > Hi, OSUOSL,
>> > >
>> > > I just run "sudo lvcreate -L 200G -n srv lvm" a 15 minutes ago or so
>> > on
>> > > 140.211.15.130. Several minutes after that I noticed that the system
>> > > doesn't accept any ssh connections, even though it's still
>> > responding to
>> > > ping.
>> > >
>> > > This certainly looks like lvcreate caused the kernel to hang --- and
>> > if
>> > > so I'm wondering if there's any problem with the underlying block
>> > devices.
>> > >
>> > > Tyler told me that he just filed a ticket #21416 this morning where
>> > this
>> > > same system got rebooted, and that also made us wonder if this might
>> > be
>> > > somehow related to that.
>> > >
>> > > Do you have any insights on what might be going on? Otherwise we'd
>> > like
>> > > to get this machine rebooted and maybe check the SMART status of the
>> > > disks or something.
>> > >
>> >
>> >
>>
>>
>
>
>
>
--
Kohsuke Kawaguchi http://kohsuke.org/
More information about the Jenkins-infra
mailing list