Archive

Archive for the ‘SysAdmin’ Category

Backup Scripts

September 30th, 2006

Over the last couple of weeks I’ve been chipping away at the problem of our department having no backups whatsoever. Being a small department with few machines and a fairly small amount of data I’ve decided that systems like Bacula and Amanda are over-kill for our situation.

I’ve written a set of small scripts to handle our most pressing backup needs. Over the next few posts I’ll describe how I’ve backed things up and the scripts and tools I’ve used to do it. None of this is rocket-science but if it saves even one person, one hour of work it’ll have been worth writing down.

All the scripts in the next few posts can be found in the WeSC subversion repository.

Part 1: Mysql

Part 2: Subversion

Part 3: Rotating and Culling Backups

SysAdmin

Small Victories

August 8th, 2006

I’ve moaned about the state of the SGI cluster here at WeSC before. I’m now happy to report that it’s almost back to full fitness.

After playing about with the external CD-ROM a bit more it became apparent that it would only behave itself if you followed a careful procedure of powering down the Origin it was attached to; power-cycling the CD-ROM and then powering the Origin back on.

At this point the internal disk of the dead node wasn’t even showing up on the SCSI bus so it was definitely knackered. Fortunately the UK is home to a fine purveyor of second-hand SGI parts. We ordered a replacement drive (complete with SGI firmware) and it arrived the next day. Ian Mapleson (for it is he that runs the SGI depot) also has written some excellent articles on Irix administration, one of which details an easy way to clone a root disk. With this info I was able to clone one of the other nodes in the cluster. A quick edit of /etc/sys_id so that it won’t wake up thinking it’s the wrong machine and we are ready to go.

The drive goes into the dead machine, we power it up and hey presto! one working Irix box.

I am jubilant until I realise that all the nodes of the cluster share the same CXFS volume and that this node no longer has a valid CXFS license. And of course we have no backups (actually this isn’t quite true it later transpired that there were copies of the license file on another machine but a backup that isn’t documented isn’t a very useful backup).

I put in a support call to SGI (remembering that this machine has no maintenance contract) without much hope. The very next day SGI email me the license file! SGI, you may be at death’s door but you are lovely people.

At this point all that’s left to do is to debug the condor install which doesn’t seem to be working properly/ But that is somebody else’s problem.

SysAdmin

How Many Ways Can We Fail Today?

July 17th, 2006

The SGI cluster here at WeSC is beginning to get me down. One of it’s nodes has been down since I started work here and I’ve finally gotten around to looking at it.

Step 1 was to try and get acces to the serial console. After hunting around for a cable I then had to fight with minicom to get it running. A process that would have been significantly easier if the terminals weren’t all runnning at a non-standard baud. Anyway a quick re-boot of the machine showed that it was finding it’s internal disk but failing to find it’s OS. Given the number of times the power has failed recently I wouldn’t be at all surprised if the partition table is corrupted. So it seemed like a re-install was worth trying before getting a replacement disk.

After finding a set of Irix install instructions that I could actually understand I hooked up the ancient external SCSI CD drive and put in the disk that contains the install tools. It took a couple of attempts to convince the drive that it should close but after that it made all the right kinds of whirring noises and I was quietly hopeful.

So boot to the command monitor and:

boot -f cdrom(1,1,7)sash64

And the monitor helpfully responds with a ‘no media found’ message. After trying several other CDs I realized that I couldn’t even ls them never mind run them. The conclusion? Knackered CD drive. Arse.

For my next trick: installing over bootp using an SGI Fuel workstation as a server.

SysAdmin

A Productive Week - Cfengine and SSH-agent

May 12th, 2006

Now that lack-of-sleep madness has passed I’ve managed to actually get some work done. In particular I’ve been chipping away at some of the tedious manual labour that comes with administering multiple machines.

To start off with I finally knuckled down to working out how to use ssh-agent. This nice article from SecurityFocus helped me get started. The most difficult bit was getting ssh-agent to run from fluxbox on start up. To fix that I added the following lines into .fluxbox/apps

[startup] {eval `ssh-agent -s`}
[startup] {ssh-add < /dev/null}

which pops up a dialog box for my passphrase on login.

I also started to get down to sorting out configuration management using cfengine. One of things that I’ve never been able to work out was how to make rules depend on one another. So if you have a rule that adds a line into the iptables config how do you then tell cfengine that iptables needs to be restarted. After hunting around on the web I found an example that does almost exactly what I need. A hacked up a quick example that would sort my root alias and then run the sendmail newaliases command.

editfiles::
{ /etc/aliases
BeginGroupIfNoSuchLine "root:           wescroot@wesc.ac.uk"
DeleteLinesStarting "#root"
Append "root:           wescroot@wesc.ac.uk"
EndGroup
DefineClasses "aliaseschanged"
}

shellcommands:
aliaseschanged::
'/usr/bin/newaliases'
useshell=false

Basically aliaseschanged is only set if the editfiles rule needs to be executed. So newaliases is only run if we actually update the aliases file. I have a more complicated set of rules that does the same thing for iptables. Next week globus4.

SysAdmin

Aftermath

April 19th, 2006

Happily I can report that my two Sun V880 servers are now running again. I discovered early this morning that the reason I couldn’t boot from DVD is that the Toshiba DVD drives on these servers have a firmware bug which means they cannot read 2k sectors. So I had to spend some time downloading the Solaris 10 CD image. So if anyone is interested in the Solaris equivalent of booting from a rescue disk here it is.

At the OpenBoot prompt (you may have to send BREAK to the boot process to get the prompt which on minicom is CtrlA-f) type:

boot cdrom -s

and wait for what seems like an eternity while it boots up. You are then free to mount filesystems and fsck things to your hearts content.

Of course once I had the Sun boxes up and running we discovered that the Origin300 cluster was completely dead. The PROM has lost all it’s hardware info.

Current Mood: fatalistic

SysAdmin

Mr Murphy, I see you’ve brought your SledgeHammer

April 18th, 2006

Came into work to find that about half my machine were down. Mostly just switched off, but some in indeterminate stages of “not working right”. Powered-on all the machines that were obviously down and re-booted the others (I wouldn’t normally do that but with so many machines down I had to remove the easy targets, not to mention the fact that console management in the machine room is woefully inadequate for anything more than brute-force diagnostics.)

So at this point it was becoming fairly clear that we had had a little incident with the power supply. While waiting to see what would come back up I tried to log into my mail and found that down also. Clearly whatever interrupted our supply hit the whole building. A look through the logs of those machines that were up shows that everything re-booted on Saturday morning. Definitely a power problem.

Further prodding of those services that refuse to come up reveal two culprits: bouscat (the server that world+dog get dumped on) has let the magic smoke out, it can probably be frankensteined. More seriously the two Sun V880 servers that run one of our main data-stores are refusing to boot. This elicits much running about in search of the correct serial cable (see previous post in which I clearly caught the attention of the fuckup fairy).

After finding a cable that would give me console access I discover that the root filesystem is failing fsck and mounting ro. This apparently is enough to completely hose a Solaris8 box. Oh Joy. After consulting with #solaris I can poke OpenBoot in the requisite manner to boot in single-user mode. This gets me precisely no-where. Even in single-user mode it refuse to get me to a shell where I could do anything useful. Tonight I shall mostly be reading the Solaris documentation to try to work out how to get the box to boot from DVD so that I have some hope of fixing it from the rescue system.

Current Mood: Maybe London Wasn’t So Bad

SysAdmin

Quick OpenSSL Tip

March 6th, 2006

here is how I create random passwords.

head -c 8 /dev/random > /tmp/xxx
openssl enc -base64 -in /tmp/xxx | head -c 8

is there a better way?

SysAdmin