Resource Manager: Torque
This is the second part of a four-part tutorial on installing and configuring a queuing system and scheduler. The full tutorial includes:
- Using a Scheduler and Queue
- Resource Manager: Torque
- Scheduler: Maui
- Torque and Maui Sanity Check: Submitting a Job
There is also a troubleshooting page:
From the Cluster Resources page on Torque,
- "TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project..."
Because torque branched off from PBS, it still retains a lot of the old commands and names. PBS stands for portable batch system, and from here, I'll still call it torque, but commands may have "pbs" in them rather than "torque".
Before you get and install torque, you'll want to make sure you have all the compilers installed that are necessary. If you don't, it will give you errors about which ones you're missing.
To get the most recent version of torque, visit http://www.clusterresources.com/downloads/torque/ and find the most recent version of it. At the time if this writing, that happens to be torque-2.2.1.tar.gz. Copy of the link location of the file. From
/usr/local/src, issue the following command for the most current file:
Next, untar the file with
tar xvf torque-2.2.1.tar.gz
Move into the directory that that just created with
cd torque-2.2.1, or whatever your directory is. We're ready to run
./configure (as part of the Source Installation Paradigm, which you might want to check out if this seems unfamiliar to you). We'll add a number of arguments to the compiler in order to let torque know we want a server, and how to set up the server. To see all of the possible arguments, type
./configure -help. What we'll use is this:
./configure --with-default-server=<your server name> --with-server-home=/var/spool/pbs --with-rcp=scp
--with-default-serverspecifies the head node, which will run the server torque process. Be sure to replace
your server namewith your actual head node's hostname!
--with-server-homesets the directory where torque will run from.
/var/spool/pbsis by no means standard, but it's the paradigm I'll be using. Others use a directory like
/home/torque. I don't like confusing my processes with users.
--with-rcp=scpsets the default file-copying mechanism. Technically, scp (for secure copy) is the default, but if you don't specify it and
scpisn't found, it'll move onto trying to find the next one, which we don't want.
./configure finishes successfully, you're ready to move onto the next step. If not, address the issues before running the command again. When it does finish successfully, it will end with a line like
config.status: executing depfiles commands, but no message about being finished. Next, run
A lot of what looks like gibberish will scroll by, and it may take somewhere around five minutes. Again, it will finish without a confirmation message. The last part of the script finished on mine with
make: Leaving directory `/usr/local/src/torque-2.2.1/doc' make: Leaving directory `/usr/local/src/torque-2.2.1/doc' make: Leaving directory `/usr/local/src/torque-2.2.1/doc' make: Entering directory `/usr/local/src/torque-2.2.1' make: Nothing to be done for `all-am'. make: Leaving directory `/usr/local/src/torque-2.2.1'
Finally, you're ready to run
You won't get a confirmation message for this, either, and it'll finish similarly to the way the last one finished. To make sure it was installed correctly, try using
which to locate one of the binaries, like this:
gyrfalcon:~# which pbs_server /usr/local/sbin/pbs_server
If it can't find it, double check that the binary was installed with
gyrfalcon:~# ls /usr/local/sbin | grep pbs pbs_demux pbs_iff pbs_mom pbs_sched pbs_server
If it's there in
which doesn't find it, you'll need to edit
/etc/login.defs. Locate the line for
ENV_SUPATH and add
/usr/local/sbin to it. The line for
ENV_PATH should be right below it; add
/usr/local/bin to it.
To start the torque server running on the head node and create a new database of jobs, issue
pbs_server -t create
Now, if you run
ps aux | grep pbs, you'll see the server running. However, if you run a command to list the queues and their statuses,
you'll see nothing because no queues have been set up. To begin configuring queues for torque, we need
qmgr, an interface to the batch system. You can run
to start it up in an interactive mode, or enter the commands one at a time on the command line:
qmgr -c "set server scheduling=true" qmgr -c "create queue batch queue_type=execution" qmgr -c "set queue batch started=true" qmgr -c "set queue batch enabled=true" qmgr -c "set queue batch resources_default.nodes=1" qmgr -c "set queue batch resources_default.walltime=3600" qmgr -c "set server default_queue=batch"
Additionally, you can run commands to set the administrators' e-mail:
qmgr -c "set server operators = root@localhost" qmgr -c "set server operators += kwanous@localhost"
At this point, running
qstat -q to view available queues should give you something like this:
gyrfalcon:~# qstat -q server: gyrfalcon Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- batch -- -- -- -- 0 0 -- E R ----- ----- 0 0
Excellent, we have a queue called "batch" and it's empty. You can also view your qmgr settings with
qmgr -c "print server"
Time to try submitting a job to the queue. First, switch over to a different user account (don't run this as root) with
su - <username>. Then, try to submit a job that just sleeps for thirty seconds and does nothing:
echo "sleep 30" | qsub
The purpose of this is to see whether the job shows in the queue when you run
qstat after submitting it. Below is a script of my testing it.
kwanous@gyrfalcon:~$ echo "sleep 30" | qsub 0.gyrfalcon kwanous@gyrfalcon:~$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 0.gyrfalcon STDIN kwanous 0 Q batch
Excellent, the job shows up! Unfortunately, though, it won't run... the state is "Q" (I assume for "queued"), and it needs to be scheduled. That's what we'll install Maui for later.
Introducing Torque to the Worker Nodes
Now we need to tell the
pbs_server which worker nodes are available and will be running
pbs_mom, a client that allows the the server to give them jobs to run. We do this by creating the file
/var/spool/pbs/server_priv/nodes. With your favorite text editor, add each worker node hostname on a line by itself. If they have more than one processor, add
np=X next to the line. Mine looks like this:
eagle np=4 goshawk np=4 harrier np=4 kestrel np=4 kite np=4 osprey np=4 owl np=4 peregrine np=4
Which that, configuration on the head node for torque is done.
Installing Torque on the Worker Nodes
Now we need to install a smaller version of torque, called
pbs_mom, on all of the worker nodes. Move back into the directory we untarred earlier,
/usr/local/src/torque*. There's a handy way to create the packages for the torque clients. Run
and they'll be created for you. This time you'll get a confirmation message:
Done. The package files are self-extracting packages that can be copied and executed on your production machines. Use --help for options.
You'll see some new files in the directory now if you run an
ls. The one we're interested in is
torque-package-mom-linux-*.sh where the * is your architecture. We need to copy that file to all the the worker nodes. You can either copy it over to a shared NFS mount, or see my Cluster Time-saving Tricks on how to
copy a file to all the nodes using the
rsync command. I'm copying it over to my NFS mount with
cp torque-package-mom-linux-i686.sh /shared/usr/local/src/
Once it's on each worker node, they each need to run the script with
You have a couple of options for doing this on each node. You can ssh over and run it manually, or you can check out my Cluster Time-saving Tricks page to learn to how to write a quick script to run the command over ssh without having to log into each node. If you're going with the second route, the command to use is
for x in `cat machines`; do ssh $x /<full path to package>/torque-package-mom-linux-i686.sh --install; done
Before we can start up
pbs_mom on each of the nodes, they need to know who the server is. You can do this by creating a file
/var/spool/pbs/server_name that contains the hostname of the head node on each worker node, or you can copy the file to all of the nodes at once with a short script (assuming you've created a file at
~/machines with the hostnames of the worker nodes as outlined in the Cluster Time-saving Tricks page):
for x in `cat ~/machines`; do rsync -plarv /var/spool/pbs/server_name $x:/var/spool/pbs/; done
Next, if you're using a NFS-mounted file system, you need to create a file on each of the worker nodes at
/var/spool/pbs/mom_priv/config with the contents
$usecp <full hostname of head node>:<home directory path on head node> <home directory path on worker node>
The path is the same for me on my head node or worker node, and my file looks like this:
$usecp gyrfalcon.raptor.loc:/shared/home /shared/home
Again, this file can be created on each of the worker nodes, or you can create it and copy it over to each of the nodes. If you're using the latter technique, assuming you've created a
machines file with all the host names, and you've created a
config file, the command to run from the head node is
for x in `cat ~/machines`; do rsync -plarv config $x:/var/spool/pbs/mom_priv/; done
After you've done that,
pbs_mom is ready to be started on each of the worker nodes. Again, you can ssh in to each node and run
pbs_mom, or the script equivalent is
for x in `cat ~/machines`; do ssh $x pbs_mom; done
Everyone Placing Nice on Torque
Finally, it's time to make sure the server monitors the pbs_moms that are running. Terminate the current queues with
and then start up the pbs server process again
Then, to see all the available worker nodes in the queue, run
(I don't know why this command doesn't have an underscore.) Each of the nodes should check in with a little report like my node peregrine's below.
peregrine state = free np = 4 ntype = cluster status = opsys=linux,uname=Linux peregrine 2.6.21-2-686 #1 SMP Wed Jul 11 0 3:53:02 UTC 2007 i686,sessions=? 0,nsessions=? ,nusers=0,idletime=1910856,totme m=3004480kb,availmem=2953608kb,physmem=1028496kb,ncpus=8,loadave=0.00,netload=18 0898837,state=free,jobs=,varattr=,rectime=1200191204
Ready to continue? Move on to installing Maui, the scheduler.