[Jenkins-infra] Fwd: [support.osuosl.org #21424] cabbage.jenkins-ci.org (140.211.15.130) hang

Kohsuke Kawaguchi kkawaguchi at cloudbees.com
Thu Sep 13 16:20:51 UTC 2012


To put the context to this, the suggestion from Justin appears to be to 
retire the physical hardware and move the system into a virtual machine, 
which I assume would be better maintained.

I think this would be great so long as we can get adequate storage space 
and adequate computing power.

I say "adequate storage" because getting by with just 20GB has been a 
soar point, and we need another 60GB or so right now to serve as the 
fallback mirror, not to mention a room for growth.

And the same goes for "adequate computing power" because our plan was to 
use this machine to run Confluence or Wiki over time.

I wouldn't worry too much about I/O bandwidth because even 
JIRA/Confluence would be primarily memory bound and not I/O bound.



I was going to suggest that we purchase a new disk for this computer, so 
FWIW I'm willing to send some money to OSUOSL if that makes it easier 
for us to get a comparable system in VM.

Any thoughts on this?


On 09/12/2012 03:39 PM, Kohsuke Kawaguchi wrote:
> -------- Original Message --------
> Subject: [support.osuosl.org #21424] cabbage.jenkins-ci.org
> (140.211.15.130) hang
> Date: Wed, 12 Sep 2012 13:47:00 -0700
> From: Justin Dugger via RT <support at osuosl.org>
> Reply-To: support at osuosl.org
> To: kk at kohsuke.org
> CC: infra at lists.jenkins-ci.org
>
> I'd be fine for replacing drives, but I'm having trouble identifying
> which one to replace.
>
> According to the df command, you've got 20 gigs in use. And vgdisplay
> shows roughly a terrabyte of free disk space. This system predates our
> ganeti KVM infrastructure; would a VM in our cluster be a good enough
> solution?
>
> The main reason to avoid the VM infra is I/O throughput, so we'd want to
> move it over to our DB cluster if we turn cabbage into a VM.
>
> Justin
>
> On Mon Sep 10 18:59:35 2012, kk at kohsuke.org wrote:
>> On 09/10/2012 06:43 PM, Justin Dugger via RT wrote:
>> > Okay, I've poked around a bit more, and I don't have much more
>> > information. There's nothing in the logs from just prior to the
>> lockup.
>> > Nothing on the console (ie nonresponsive). The only thing I noticed
>> was
>> > the disk array was one disk, blinking in a on-off pattern.
>>
>> Thank you for checking! And I hate to say this, but it went
>> unresponsive
>> again.
>>
>> Both hangs happened while or shortly after lvcreate call, so this is
>> starting to look like a hardware problem to me.
>>
>> I suspect some part of the disk died in a way that causes the
>> controller
>> to hang up. This causes linux kernel to hang when making I/O, so one
>> by
>> one processes get stuck trying to make system calls, and eventually
>> sshd
>> locks up.
>>
>> This time I can see from other shells that I was using that after one
>> process got uninterruptible, other shells have worked for a while,
>> like
>> 'ls ~'. But eventually it went silent.
>>
>> I wonder if there's any way to verify this hypothesis.
>>
>> Or should we just start rescuing what we can and put in a new disk, if
>> none of us can afford too much effort into this?
>>
>>
>> > On Mon Sep 10 17:41:31 2012, jldugger wrote:
>> >> Sorry bout not keeping you up to date, I rebooted it about 40
>> minutes ago.
>> >>
>> >> On Mon Sep 10 17:13:16 2012, kk at kohsuke.org wrote:
>> >> > Sorry to be so impatient, but is there any updates on this?
>> >> >
>> >> > I wonder if we can at least have the machine rebooted.
>> >> >
>> >> > On 09/10/2012 11:48 AM, OSL Systems Support Team via RT wrote:
>> >> > > Greetings,
>> >> > >
>> >> > > This message has been automatically generated in response to
>> the
>> >> > > creation of a support ticket call:
>> >> > >
>> >> > >          "cabbage.jenkins-ci.org (140.211.15.130) hang",
>> >> > >
>> >> > > a summary of which appears below.
>> >> > >
>> >> > > There is no need to reply to this message right now. Your
>> ticket has
>> >> > been
>> >> > >   assigned an ID of [support.osuosl.org #21424]. Please include
>> this
>> >> > string
>> >> > > in the subject line of all future correspondence about this
>> issue.
>> >> > You may
>> >> > >   also catch us on irc (irc.freenode.net) in #osuosl.
>> >> > >
>> >> > >
>> >> > >
>> >> > >                          Thank you.
>> >> > >                          support at osuosl.org
>> >> > >
>> >> > >
>> >> >
>> -------------------------------------------------------------------------
>> >> > >
>> >> > > Hi, OSUOSL,
>> >> > >
>> >> > > I just run "sudo lvcreate -L 200G -n srv lvm" a 15 minutes ago
>> or so
>> >> > on
>> >> > > 140.211.15.130. Several minutes after that I noticed that the
>> system
>> >> > > doesn't accept any ssh connections, even though it's still
>> >> > responding to
>> >> > > ping.
>> >> > >
>> >> > > This certainly looks like lvcreate caused the kernel to hang
>> --- and
>> >> > if
>> >> > > so I'm wondering if there's any problem with the underlying
>> block
>> >> > devices.
>> >> > >
>> >> > > Tyler told me that he just filed a ticket #21416 this morning
>> where
>> >> > this
>> >> > > same system got rebooted, and that also made us wonder if this
>> might
>> >> > be
>> >> > > somehow related to that.
>> >> > >
>> >> > > Do you have any insights on what might be going on? Otherwise
>> we'd
>> >> > like
>> >> > > to get this machine rebooted and maybe check the SMART status
>> of the
>> >> > > disks or something.
>> >> > >
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>> >
>> >
>>
>>
>
>
>
>
>
>
> _______________________________________________
> Jenkins-infra mailing list
> Jenkins-infra at lists.jenkins-ci.org
> http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra
>


-- 
Kohsuke Kawaguchi | CloudBees, Inc. | http://cloudbees.com/
Try Nectar, our professional version of Jenkins


More information about the Jenkins-infra mailing list