Technical Note - Grid Computing: SCS Grid
Summary
There doesn't seem to be any ECS-focused documentation aimed at the users wishing to run jobs on the SCS Grid,
so here are the basics.
A more complete description can be found in the
providers documentation.
Details
General
ECS administers two "Grids" known, in administration circles, as the
ECS Grid and the
SCS Grid.
You may also find these referred to as the
SGE Grid and the
Condor Grid respectively.
The two are seperate.
The SCS Grid runs under control of a Condor instance and exists to make use of the computing
power of the University's ITS-provided student computing service (
SCS) lab machines at times
when they are unused, ie when they should have no one logged in at the console.
Jobs, basically text files decribing the tasks to be carried out and the files needed to do so, are submitted
from the Condor master machine (
users will thus need to obtain a login account for this machine through ITS)
into the Condor queuing system where they remain, in turn, until an "unused" machine is able to start them.
(Note that being
able to start them is not the same as being able to run them to completion).
At present, users at the console of a machine have priority over Grid jobs running on the same machine
to the extent that a Grid job will be suspended upon a machine where there is console activity, so users
submitting Grid jobs should be aware that there is no guaranteed run time for any given task.
Basically,
it'll finish when it finishes
The Condor Master
The ITS machine from which all
Condor activities are carried out is
vuwunicondgrd01.vuw.ac.nz
As of
November 2010 the machine may be accessed using the name
condor01.vuw.ac.nz
The Condor Master runs RedHat Enterprise Linux (RHEL) and ssh access is via port 10 (This is an ITS standard, apparently)
If your username on the Condor Master does not match your ECS username, you will need to type:
ssh vuwunicondgrd01.vuw.ac.nz -p 10 -l username
to logon, otherwise, with matching usernames, a simple
ssh vuwunicondgrd01.vuw.ac.nz -p 10
will suffice.
Your home directory on the Condor Master will be of the form
/home/username
You can use
scp to move files to and from your ECS filestore to the Condor Master, eg
scp -P 10 localfile username@vuwunicondgrd01.vuw.ac.nz:/path/to/remotedir/.
scp -P 10 username@vuwunicondgrd01.vuw.ac.nz:file_in_homedir.txt /path/to/local/dir/.
It is possible to use
rsync to maintain a level of synchronisation between directories
within ECS filestore and on the Condor Master, eg.
% hostname
some-machine.ecs.vuw.ac.nz
% cd ~/top/of/my/local/condorgrid/directories
% rsync -avi --rsh="ssh -p 10" vuwunicondgrd01.vuw.ac.nz:/home/username/remotedir/ ./remotedir/
Note that you need to specify the fact that you need to talk
ssh over
port 10 vis the
--rsh option
to
rsync
Setting up the environment
The machines ITS provide for the SCS run a
Windows operating system and the machines
are effectively installed using a single OS+packages image rolled out from the centre.
ECS users will thus find a limited range of native software packages, placed onto the SCS
machines at the request of various academics across the University, for use by students in
teaching labs or self-study, accessible by default.
ITS does not seem to have a list of the user-accessible packages, on the SCS ,available at time
of writing.
Where Grid jobs merely require such packages to be operated in a batch mode, such jobs can
obviously make use of those packages.
In general, you will need to use the full path to the executables you wish to invoke, eg:.
c:\"program files"\r\r-2.6.0\bin\r.exe --no-save < myscript.r
With software packages in use within ECS but not available on the SCS, ITS may consent to installing
the
Windows version, if available, of that package within the image they roll-out to the SCS machines.
Users new to the SCS machines should be aware that ITS are, understandably, reluctant to alter
their images once a teaching trimester is underway and there may well be issues in having multiple
versions of the same package installed at the same time. Planning ahead so as to liaise with those
who have requested packages be installed on the SCS is thus a good thing.
Users wishing to make use of the service to run bespoke codes or programs which normally operate
within an ECS, or other, UNIX environment have two approaches.
- They will need to recompile sources, or otherwise ensure that any binaries run, against the
Cygwin emulation software and then upload a matching Cygwin DLL as part of the job submission payload. The ITS machines do not provide, nor hence constrain the user to, any particular Cygwin DLL version.
- They will need to recompile sources using a
Windows compiler to produce native binaries, though care should be taken to ensure that any run-time dependencies will be resolved when running on an SCS machine. The ITS machines do provide access to one Windows compiler suite.
Do I have a home on the Grid?
Not a permanent home. no: it's more like a rented bach with the Condor Master
as your trailer of stuff from home.
Basically, Condor would normally upload all the files needed for a task into a temporary
directory,
%_CONDOR_SCRATCH_DIR%, on each remote machine from a user's directory
on the Condor Master.
At the end of the job execution, Condor can download files back to filestore areas on the
Condor master.
This means that a user might first need to copy files from their ECS filestore onto
the Condor Master so as to then make them accessible to the Condor Grid, and copy
files off the Condor Master back into their ECS filestore for post processing.
To continue with the bach/trailor metaphor, you pack it before you go, unpack when you arrive,
pack it again before you leave and unpack when you get back home.
Files in the temporary directory on machines the jobs were executed on
are not currently preserved between individual job executions.
(So someone tidies up the bach after you have been there, too).
It may be possible to have ITS load large static data sets onto the SCS image so that
such data appears at a constant path on any SCS machine, however, as mentioned
above, ITS are, understandably, reluctant to alter the image once a teaching trimester
is underway.
Where will the input and output files be?
Because this is a distributed batch processing environment, there's usually no clear
indication as to which machine(s) your job(s) will end up running on.
You thus need to give a little more thought to the location of input and output files than
if you were simply running a job on your own workstation where everything is local to
the machine.
Within the SCS Grid, Condor can stage files and directories to and from the master
machine.
Preserving results after execution
In order to get your output back to somewhere more useful to you, you need to tell the
task that it should copy the files back. Condor will then, by default, copy back any altered
or newly created files from the remote execution node.
Cleaning up
With the temporary area allocated to your job being automatically removed once the
job has completed, there is no explicit cleaning required for simple job submission.
Where do =stdin, stdout and stderr appear
When one runs programs locally, program output and error messages will often appear
on the console, or in the terminal emulator, and one can usually perform command line
redrection for input.
When you are running a non-interactive job on a remote machine however, it is likely that
you aren't going to see any console output during the execution of the program.
Condor can therefore be instructed to redirect the stdout and stderr channels to file so that
they may be inspected after the job has finished.
The submission script has directives that allow the user to name files for the redirection.
If you do not specify filenames then you implicity get
/dev/null, or platform equivalent.
In order to create unique filenames for each task submitted, use can be made of the fact
that Condor provides for a number of variables to be expanded within the submission script.
Condor allows one to submit multiple jobs (tasks) within a single submission. The submission
is referred to as a
Cluster and within each cluster individual instances of the job are known
as a
Process, numbered starting from
0
A basic script
Within a directory containing
cygwin1.dll hworld.cmd hworld.exe
with
hworld.cmd being a very basic script (11 lines) like this:
universe = vanilla
environment = path=c:\WINDOWS\SYSTEM32
executable = hworld.exe
TransferInputFiles = cygwin1.dll
output = hworld.out.$(Cluster).$(Process)
error = hworld.err.$(Cluster).$(Process)
log = hworld.log.$(Cluster).$(Process)
Requirements = (OpSys == "WINNT51") && (Arch == "INTEL")
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1
one can submit a job by typing
condor_submit hworld.cmd
If the job submission returns something like this
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1234.
then you would expect to see the following files at the end of the job,
along with any files created on the remote execution node(s).
hworld.err.1234.0
hworld.out.1234.0
hworld.log.1234.0
Emailed output
By default, Condor will send an email message to the users mailbox on the Condor Master,
on completion of the job.
This is an automated email from the Condor system
on machine "vuwunicondgrd01.vuw.ac.nz". Do not reply.
Condor job 1234.0
/home/username/hworld/hworld.exe
has exited normally with status 12
Submitted at: Mon Jul 13 12:37:48 2009
Completed at: Mon Jul 13 12:38:30 2009
Real Time: 0 00:00:42
Virtual Image Size: 0 Kilobytes
Statistics from last run:
Allocation/Run time: 0 00:00:02
Remote User CPU Time: 0 00:00:00
Remote System CPU Time: 0 00:00:00
Total Remote CPU Time: 0 00:00:00
Statistics totaled from all runs:
Allocation/Run time: 0 00:00:02
Network:
1.8 MB Run Bytes Received By Job
12.0 B Run Bytes Sent By Job
1.8 MB Total Bytes Received By Job
12.0 B Total Bytes Sent By Job
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor-admin@ecs.vuw.ac.nz
The Official Condor Homepage is http://www.cs.wisc.edu/condor
Various attributes of the notification process can be altered within the
job submission script.
Running Java programs
Been a couple of people running these of late so here's some basic info
Let's assume that you will be running two instances of a Java program called
myprog.jar
which, when run on your own machine reads from a datafile
mydata.txt and produces an
output file called
myoutput.txt
The job submission script, which will be called
myprog.cmd might be expected to look
something like this
universe = vanilla
environment = path=c:\WINDOWS\SYSTEM32
executable = myprog.bat
TransferInputFiles = myprog.jar mydata.txt
output = hworld.out.$(Cluster).$(Process)
error = hworld.err.$(Cluster).$(Process)
log = hworld.log.$(Cluster).$(Process)
Requirements = (OpSys == "WINNT51") && (Arch == "INTEL")
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 2
Note, in comparison with the basic script, that rather than running the java program directly, our executable is a DOS
batch (
.bat) file in which we will invoke
Java to run our
JAR file, with a command such as
java -jar myprog.jar
and that we copy both the
JAR and our data file over using
TransferInputFiles
however, when we run the two instances of the program, both instances will, if the program runs,
produce output files named
myoutput.txt.
This represents a problem.
Condor will automatically fetch back, to the directory on the master from where we submitted the
job, all files that have changed on each execution host, which means it will overwrite the output
file.
What's needed here is a mechanism for ensuring that the files that get created are named uniquely.
In the same way that we can use the Condor information about the
Cluster and
Process
to differentiate the
stdout,
stderr and logfiles we can use it to solve the file name issue
by making those values available as environmental variables in the DOS environment.
This is done by modifying the
environment value in the submission script so that is looks like this
(note the quotes)
environment = "path=c:\WINDOWS\SYSTEM32 CONDOR_CLUSTER=$(Cluster) CONDOR_PROCESS=$(Process)"
Within our DOS batch file, we can access the DOS environmental variables using the standard
%VARNAME% syntax.
This allows us to either alter our program so that it accepts the Condor information as arguments and alters the output filenames
internally, or simply rename the output file as part of the set of command in the DOS batch file that will be run.
The latter approach works because Condor transfers the files after the
executable has ended, which means after
all the commands in the DOS batch file have run.
We might thus have a DOS batch file that looks like this
java -jar myprog.jar
ren myoutput.txt myoutput.%CONDOR_CLUSTER%.%CONDOR_PROCESS%
Note: the DOS command
ren is the equivalent of the UNIX
mv.
Similarly if we wish to have the program changing it's operation based on those values we might have DOS batch file
looking like this
java -jar myprog.jar %CONDOR_CLUSTER% %CONDOR_PROCESS%
Note that the latter approach, making the executable aware of its place in the set of jobs,
can be useful as a mechanism, eg, for invoking different behaviours or defining initial conditions
such as random number generation.
One can submit a job by typing
condor_submit myprog.cmd
If the job submission returns something like this
Submitting job(s).
Logging submit event(s).
2 job(s) submitted to cluster 1234.
then you would expect to see the following files at the end of the job,
along with any files created on the remote execution node(s).
hworld.err.1234.0
hworld.err.1234.1
hworld.out.1234.0
hworld.out.1234.1
hworld.log.1234.0
hworld.log.1234.1
myoutput.txt.1234.0
myoutput.txt.1234.1
Java versions
As of Feb 2011, there were four versions of Java available on the SCS machines,
with one machine examined showing:
10/09/2008 08:34 a.m. <DIR> jdk1.5.0_09
10/09/2008 08:22 a.m. <DIR> jre1.5.0_06
10/09/2008 08:23 a.m. <DIR> jre1.6.0_07
09/03/2010 05:10 p.m. <DIR> jre6
the default version, which I am told is the one without any version information
visible, is 1.6.0_18.
ECS machines had access to the
OpenJDK 7 at this time.