Category Archives: SysAdmin

How Many Ways Can We Fail Today?

The SGI cluster here at WeSC is beginning to get me down. One of it’s nodes has been down since I started work here and I’ve finally gotten around to looking at it.

Step 1 was to try and get acces to the serial console. After hunting around for a cable I then had to fight with minicom to get it running. A process that would have been significantly easier if the terminals weren’t all runnning at a non-standard baud. Anyway a quick re-boot of the machine showed that it was finding it’s internal disk but failing to find it’s OS. Given the number of times the power has failed recently I wouldn’t be at all surprised if the partition table is corrupted. So it seemed like a re-install was worth trying before getting a replacement disk.

After finding a set of Irix install instructions that I could actually understand I hooked up the ancient external SCSI CD drive and put in the disk that contains the install tools. It took a couple of attempts to convince the drive that it should close but after that it made all the right kinds of whirring noises and I was quietly hopeful.

So boot to the command monitor and:

boot -f cdrom(1,1,7)sash64

And the monitor helpfully responds with a ‘no media found’ message. After trying several other CDs I realized that I couldn’t even ls them never mind run them. The conclusion? Knackered CD drive. Arse.

For my next trick: installing over bootp using an SGI Fuel workstation as a server.

A Productive Week – Cfengine and SSH-agent

Now that lack-of-sleep madness has passed I’ve managed to actually get some work done. In particular I’ve been chipping away at some of the tedious manual labour that comes with administering multiple machines.

To start off with I finally knuckled down to working out how to use ssh-agent. This nice article from SecurityFocus helped me get started. The most difficult bit was getting ssh-agent to run from fluxbox on start up. To fix that I added the following lines into .fluxbox/apps

[startup] {eval `ssh-agent -s`}
[startup] {ssh-add < /dev/null}

which pops up a dialog box for my passphrase on login.

I also started to get down to sorting out configuration management using cfengine. One of things that I've never been able to work out was how to make rules depend on one another. So if you have a rule that adds a line into the iptables config how do you then tell cfengine that iptables needs to be restarted. After hunting around on the web I found an example that does almost exactly what I need. A hacked up a quick example that would sort my root alias and then run the sendmail newaliases command.

editfiles::
{ /etc/aliases
BeginGroupIfNoSuchLine "root:           wescroot@wesc.ac.uk"
DeleteLinesStarting "#root"
Append "root:           wescroot@wesc.ac.uk"
EndGroup
DefineClasses "aliaseschanged"
}

shellcommands:
aliaseschanged::
'/usr/bin/newaliases'
useshell=false

Basically aliaseschanged is only set if the editfiles rule needs to be executed. So newaliases is only run if we actually update the aliases file. I have a more complicated set of rules that does the same thing for iptables. Next week globus4.

Aftermath

Happily I can report that my two Sun V880 servers are now running again. I discovered early this morning that the reason I couldn’t boot from DVD is that the Toshiba DVD drives on these servers have a firmware bug which means they cannot read 2k sectors. So I had to spend some time downloading the Solaris 10 CD image. So if anyone is interested in the Solaris equivalent of booting from a rescue disk here it is.

At the OpenBoot prompt (you may have to send BREAK to the boot process to get the prompt which on minicom is CtrlA-f) type:

boot cdrom -s

and wait for what seems like an eternity while it boots up. You are then free to mount filesystems and fsck things to your hearts content.

Of course once I had the Sun boxes up and running we discovered that the Origin300 cluster was completely dead. The PROM has lost all it’s hardware info.

Current Mood: fatalistic

Mr Murphy, I see you’ve brought your SledgeHammer

Came into work to find that about half my machine were down. Mostly just switched off, but some in indeterminate stages of “not working right”. Powered-on all the machines that were obviously down and re-booted the others (I wouldn’t normally do that but with so many machines down I had to remove the easy targets, not to mention the fact that console management in the machine room is woefully inadequate for anything more than brute-force diagnostics.)

So at this point it was becoming fairly clear that we had had a little incident with the power supply. While waiting to see what would come back up I tried to log into my mail and found that down also. Clearly whatever interrupted our supply hit the whole building. A look through the logs of those machines that were up shows that everything re-booted on Saturday morning. Definitely a power problem.

Further prodding of those services that refuse to come up reveal two culprits: bouscat (the server that world+dog get dumped on) has let the magic smoke out, it can probably be frankensteined. More seriously the two Sun V880 servers that run one of our main data-stores are refusing to boot. This elicits much running about in search of the correct serial cable (see previous post in which I clearly caught the attention of the fuckup fairy).

After finding a cable that would give me console access I discover that the root filesystem is failing fsck and mounting ro. This apparently is enough to completely hose a Solaris8 box. Oh Joy. After consulting with #solaris I can poke OpenBoot in the requisite manner to boot in single-user mode. This gets me precisely no-where. Even in single-user mode it refuse to get me to a shell where I could do anything useful. Tonight I shall mostly be reading the Solaris documentation to try to work out how to get the box to boot from DVD so that I have some hope of fixing it from the rescue system.

Current Mood: Maybe London Wasn’t So Bad