Category Archives: SysAdmin

Solaris 10 – A Wailing and Gnashing of Teeth

I’ve been playing around with Solaris 10 on my development V880 at work and generally enjoying it. ZFS is a thing of beauty and Zones have turned out to be very helpful. It’s not unusual for a researcher to come along needing to set up gigabytes of database temporarily while they try out some things. Being able to throw a zone up and give them full access to it without worrying about them messing up your server is very handy.

Of course the patching system on Solaris leaves something to be desired. The pretty GUI tools that Sun reccomends don’t work if you have zones running on your machine. So I’ve resorted to using PCA which works but is horribly slow.

Anyway on thursday I get a call from Information Services telling me they have been informed by JANET that hostx is aggressively scanning the network. So they have disconnected the network port it lives on. hostx happens to be a Solaris 10 zone on the afore-mentioned V880. So I wander into the machine room log into the console and halt the zone. A quick look through the filesystem reveals the traces of the Solaris Telnet Worm. Great. In six years the only machine I’ve had hacked.

Eventually I work out what happened. When the advisory about the telnet vulnerability came out I disabled telnet on the Solaris 10 machines and voiced my displeasure that telnet was enabled by default. However for reasons that I can’t explain I didn’t actually patch the machine.

A month later I set up a new zone for one of the researchers. And it turns on telnet again. Because that’s the sensible thing to do. Zone gets rooted, I look like a moron.

I’m going to go away and put PCA into a cron job like I should have done originally.

Solaris 10 and Telnet

Yes it’s still turned on by default at least in Solaris 10 update 2. That’ll teach me not to portscan new *nix installs where I’m not completely familiar with the OS defaults.
Seriously though, is there a good reason for telnet to be on by default in this day and age?


So it’s sunday and I decide to run apt-get on my Debian box which hosts this here blog and the peapod project page. Mysql 5 gets an update. apt-get stops it: apt-get starts it: and it dies on it’s arse.

Feb 11 15:54:02 hlynes mysqld[707]: 070211 15:54:02 [Note] /usr/sbin/mysqld: Shutdown complete
Feb 11 15:54:02 hlynes mysqld[707]:
Feb 11 15:54:03 hlynes mysqld_safe[6754]: ended
Feb 11 15:55:46 hlynes init: Trying to re-exec init
Feb 11 16:01:45 hlynes mysqld_safe[10043]: mysqld got signal 11;
Feb 11 16:01:45 hlynes mysqld_safe[10043]: This could be because you hit a bug.
It is also possible that this binary
Feb 11 16:01:45 hlynes mysqld_safe[10043]: or one of the libraries it was linked
against is corrupt, improperly built,
Feb 11 16:01:45 hlynes mysqld_safe[10043]: or misconfigured. This error can also
be caused by malfunctioning hardware.
Feb 11 16:01:45 hlynes mysqld_safe[10043]: We will try our best to scrape up som
e info that will hopefully help diagnose
Feb 11 16:01:45 hlynes mysqld_safe[10043]: the problem, but since we have alread
y crashed, something is definitely wrong
Feb 11 16:01:45 hlynes mysqld_safe[10043]: and this may fail.

After much googling it turns out to be to do with the fact that this is a Bytemark virtual machine running under UML. Apparently the tls library which is part of NPTL does some very strange things with memory that work just fine on a normal kernel but not one running under UML.

Anyway the workaround is to move /lib/tls out of the way. Apparently this can fox apache also.

Wikis, Wikis Everywhere

One of the first things I did when I arrived at my current job was to whack a copy of mediawiki on one of my servers so that I somewhere to write ad-hoc documentation. As time has passed we have built up quite a few docs and had to lock down the wiki as some of the documents are externally viewable.

It has become apparant that we could do with moving to a wiki that is more suited to our current usage.  So my list of wiki requirements is:

  • Be able to re-use apache authentication since we already have working mod_auth_pam setups talking to central LDAP.
  • Different parts of the wiki editable/viewable by different users.
  • Pages can be public or only viewable by logged in users.
  • Able to use multiple different auth mechanisms concurrently: e.g Apache,PAM,wiki-login
  • Themes/stylesheets for different sections of the wiki
  • File upload/attachments. I don’t want people to have to ftp documents and link to them manually.

So far the contenders are Midgard, MoinMoin and TWiki. I’m leaning towards Moin because it’s written in Python. What I’d really like to hear is some comments from people you have similare requirements. What wiki did you go with and what were your experiences?


Backups Part 3: Rotating and Culling Backups

I’ve now got backup scripts happily creating copies of all my subversion repositories and MySQL databases every 24 hours. This is great but it means you end up with an awful lot of backups. I realy don’t need a backup from every day going back forever. But it is nice to have snapshots going back into history in case something subtle has gone wrong.

What I’d really like is to copy the oldest backups into a different directory every seven days, and delete all the backups in the main directory that are older than seven days. Of course I’ll then end up with piles of backups building up in the weekly directory. So I’d like to go through the weekly directory every month, copy the oldest backups into another directory and delete all the weekly backups that are more than one month old.

To do this I give you the snappily named rotate-backups

rotate-backup            rotates backups in a given directory weekly,or monthly

-b                       directory to rotate backups in
-f                       file containing list of backup dirs
-t                       time period (weekly,monthly)

The config file is just a new-line separated list of directories. To make it work I put a script in /etc/cron.weekly like:

rotate-backups -t weekly -f /etc/rotate-backups

and one in cron.monthly:

rotate-backups -t monthly -f /etc/rotate-backups

The backup script makes an assumption that backups created on the same day are from the same backup run. It copies the ‘oldest’ backups by copying files from the same day as the oldest file in the directory. This way it doesn’t have to know anything about what you are backing up or what your naming conventions are.

Also it culls old backups relative to the date of the latest file in the directory. This means that if you stop taking backups the script won’t keep deleting files until you have none left.

Backups – Part 2: Subversion

We run a subversion service for a number of different research groups. Generally we create a separate subversion repository for each group. Obviously looking after this data is important. It’s not a good idea to lose months of someone’s work.

Fortunately backing up a subversion repository is pretty simple. Subversion ships with a utility called svnadmin. One of the functions of which is to dump a repository to a file.

As the repository owner do:

svnadmin dump /path/to/repository > subversion_dump_file

I have a directory full of subversion repositories so what I really want is a script that will find all the subversion repositories in the directory and dump them with sensible filenames. With my usual lack of imagination I’ve called this script svn_backup. It runs svnadmin verify against each file in the directory it’s given. If any of them turn out to be subversion repositories it dumps them using svnadmin dump.

$ ./svn_backup
svn_backup  -s subversion_dir [-b backup_dir] [ -o logfile]
        script to backup subversion repositories

        -s      directory containging subversion repositories
        -b      directory to dump backups to. defaults to /tmp
        -o      file to output names of backups to

So I now have an entry in cron.daily like:

svn_backup -s /var/www/subversion -o /tmp/svn.log -b /var/backups/svn

The reason I write the backups to a log file is that it allows me to run a script once a week that copies the latest backups to Amazon’s S3 storage system.

The scripts:

Backups – Part 1: MySQL

Like most people I’ve got a number of MySQL servers with different databases running things like Wikis and other web-apps. MySQL ships with a handy little tool called mysqldump which can be used to dump ( as the name suggests ) a mysql DB to a text file containing SQL commands necessary to re-create it.

The first thing I like to do is to create a backup user that only has enough privileges to do the backup.


with this done you should be able to do something like

mysqldump -hlocalhost -ubackup -pmypasswd --databases mysql > test_backup.mysql

With that in place it was an easy task to write a script that can read a config file for the backup login info and options to pass to mysqldump. This script has a number of benefits over sticking a mysqldump command straight into cron:

  1. It stores it’s settings in an external config file so you can use the same script in several settings
  2. It backs up each database on the server into a separate dump file.
  3. The backup options are in the config file so you can back up different servers with different tweaks. e.g locking the DB for consistancy.

the line in the config file looks like

localhost:backup:mypasswd:--opt --add-drop-table

I add a quick script to /etc/cron.daily like

mysql_backup -b /var/backups/mysql -f /etc/mysql_backup

I can now sleep a bit easier. This isn’t the only way to back up mysql, proper DBAs will know all about replication and various other tricks.

Next time: subversion repositories.

Backup Scripts

Over the last couple of weeks I’ve been chipping away at the problem of our department having no backups whatsoever. Being a small department with few machines and a fairly small amount of data I’ve decided that systems like Bacula and Amanda are over-kill for our situation.

I’ve written a set of small scripts to handle our most pressing backup needs. Over the next few posts I’ll describe how I’ve backed things up and the scripts and tools I’ve used to do it. None of this is rocket-science but if it saves even one person, one hour of work it’ll have been worth writing down.

All the scripts in the next few posts can be found in the WeSC subversion repository.

Part 1: Mysql

Part 2: Subversion

Part 3: Rotating and Culling Backups

Small Victories

I’ve moaned about the state of the SGI cluster here at WeSC before. I’m now happy to report that it’s almost back to full fitness.

After playing about with the external CD-ROM a bit more it became apparent that it would only behave itself if you followed a careful procedure of powering down the Origin it was attached to; power-cycling the CD-ROM and then powering the Origin back on.

At this point the internal disk of the dead node wasn’t even showing up on the SCSI bus so it was definitely knackered. Fortunately the UK is home to a fine purveyor of second-hand SGI parts. We ordered a replacement drive (complete with SGI firmware) and it arrived the next day. Ian Mapleson (for it is he that runs the SGI depot) also has written some excellent articles on Irix administration, one of which details an easy way to clone a root disk. With this info I was able to clone one of the other nodes in the cluster. A quick edit of /etc/sys_id so that it won’t wake up thinking it’s the wrong machine and we are ready to go.

The drive goes into the dead machine, we power it up and hey presto! one working Irix box.

I am jubilant until I realise that all the nodes of the cluster share the same CXFS volume and that this node no longer has a valid CXFS license. And of course we have no backups (actually this isn’t quite true it later transpired that there were copies of the license file on another machine but a backup that isn’t documented isn’t a very useful backup).

I put in a support call to SGI (remembering that this machine has no maintenance contract) without much hope. The very next day SGI email me the license file! SGI, you may be at death’s door but you are lovely people.

At this point all that’s left to do is to debug the condor install which doesn’t seem to be working properly/ But that is somebody else’s problem.