Dealing With Stupid Programs That Think They Need X

The new compute cluster is beginning to feel like a production system. I’m currently run off my feet installing software for the stream of new users. Mostly this is fine, but occasionally I run into software that makes me want to band my head repeatedly on my desk until the pain goes away; or more accurately makes me want to bang the programmer’s head on the desk.

Just today we received a linux port of a code that has been running on the Windows Condor pool for a while now. Everything seemed fine except for it’s stubborn refusal to run if it couldn’t find a windowing system. Bear in mind that it doesn’t actually produce any graphical output it just dies if it can’t connect to X. After a bit of futzing around we discover that the people that normally run this code do something like:

Xvfb :1 -server 1 1024x1024x8 &
export DISPLAY=:1
./stupid_code_that_wants_X

Xvfb is the X virtual framebuffer. It creates a running X client without actually needing any graphics to be running.

Which works just great locally but if you want to launch that as a script in the job scheduling system (we use PBSpro) then you need to be a bit more careful. What happens if two of these jobs try to launch on the same machine? Obviously one of them will fail because display 1 is already allocated. What I really needed was a script that will try to launch Xvfb and increment DISPLAY on failure until it finds a display that is free. For your edification here it is:

get_xvfb_pid () {
	XVFB_PID=`ps -efww | grep -v grep | grep Xvfb |\
       grep $USERNAME | tail -n 1 | awk '{print $2}'`
	}

create_xvfb () {
	USERNAME=`whoami`
	DISPLAYNO=1
	while [ -z $xvfb_success ]
		do
		get_xvfb_pid
		old_XVFB_PID=$XVFB_PID
		XVFB_PID=""
		Xvfb :${DISPLAYNO} -screen 0 1024x1024x8 >& /dev/null &
		sleep 1
		get_xvfb_pid
		if ! [ -z $old_XVFB_PID ]
			then
			if [ -z $XFVB_PID ] && ! [ $XVFB_PID == $old_XVFB_PID ]
				then
				echo "Started XVFB on display $DISPLAYNO process $XVFB_PID"
				xvfb_success=1
			else
				DISPLAYNO=$(($DISPLAYNO + 1))
				XVFB_PID=""
			fi
		else
			if [ -z $XFVB_PID ]
                                then
                                echo "Started XVFB on display $DISPLAYNO process $XVFB_PID"
                                xvfb_success=1
                        else
                                DISPLAYNO=$(($DISPLAYNO + 1))
                                echo "FAIL!" $XVFB_PID
                                XVFB_PID=""
                        fi
		fi
 		done
	export XVFB_PID
	export DISPLAY=:${DISPLAYNO}
	}

kill_xvfb () {
	kill $XVFB_PID
	}

Which you can call from a script like thus:

[arccacluster8]$. ./xvfb_helper
[arccacluster8]$ create_xvfb
Started XVFB on display 1 process 9563
[arccacluster8 ~]$ echo $DISPLAY
:1
[arccacluster8 ~]$ echo $XVFB_PID
9563
[arccacluster8 ~]$ ps -efw | grep Xvfb
username    9563  9498  0 19:31 pts/8    00:00:00 Xvfb :1 -screen 0 1024x1024x8
[arccacluster8 ~]$ kill_xvfb
[arccacluster8 ~]$ ps -efw | grep Xvfb
[arccacluster8 ~]$

I submit that this is a disgraceful hack, but it might come in handy to someone else.