Small Victories

I’ve moaned about the state of the SGI cluster here at WeSC before. I’m now happy to report that it’s almost back to full fitness.

After playing about with the external CD-ROM a bit more it became apparent that it would only behave itself if you followed a careful procedure of powering down the Origin it was attached to; power-cycling the CD-ROM and then powering the Origin back on.

At this point the internal disk of the dead node wasn’t even showing up on the SCSI bus so it was definitely knackered. Fortunately the UK is home to a fine purveyor of second-hand SGI parts. We ordered a replacement drive (complete with SGI firmware) and it arrived the next day. Ian Mapleson (for it is he that runs the SGI depot) also has written some excellent articles on Irix administration, one of which details an easy way to clone a root disk. With this info I was able to clone one of the other nodes in the cluster. A quick edit of /etc/sys_id so that it won’t wake up thinking it’s the wrong machine and we are ready to go.

The drive goes into the dead machine, we power it up and hey presto! one working Irix box.

I am jubilant until I realise that all the nodes of the cluster share the same CXFS volume and that this node no longer has a valid CXFS license. And of course we have no backups (actually this isn’t quite true it later transpired that there were copies of the license file on another machine but a backup that isn’t documented isn’t a very useful backup).

I put in a support call to SGI (remembering that this machine has no maintenance contract) without much hope. The very next day SGI email me the license file! SGI, you may be at death’s door but you are lovely people.

At this point all that’s left to do is to debug the condor install which doesn’t seem to be working properly/ But that is somebody else’s problem.