Technical Note - Grid Computing: SCS Grid

This is a technical note. If you have no interest in navigating away via the space wasting but University-standard look and feel side bars, you can get rid of them by adding

?skin=vuwecsclean2,vuwecs,pattern

to the URL

Summary

There doesn't seem to be any ECS-focused documentation aimed at the users wishing to run jobs on the SCS Grid, so here are the basics.

A more complete description can be found in the providers documentation.

Details

General

ECS administers two "Grids" known, in administration circles, as the ECS Grid and the SCS Grid. You may also find these referred to as the SGE Grid and the Condor Grid respectively. The two are seperate.

The Tech Note for the ECS Grid is at:

http://ecs.victoria.ac.nz/Support/TechNoteEcsGrid

The SCS Grid runs under control of a Condor instance and exists to make use of the computing power of the University's ITS-provided student computing service (SCS) lab machines at times when they are unused, ie when they should have no one logged in at the console.

Jobs, basically text files decribing the tasks to be carried out and the files needed to do so, are submitted from the Condor master machine (users will thus need to obtain a login account for this machine with their ITS username) into the Condor queuing system where they remain, in turn, until an "unused" machine is able to start them. (Note that being able to start them is not the same as being able to run them to completion).

At present, users at the console of a machine have priority over Grid jobs running on the same machine to the extent that a Grid job will be suspended upon a machine where there is console activity, so users submitting Grid jobs should be aware that there is no guaranteed run time for any given task. Basically, it'll finish when it finishes

There is a low volume mailing list which is used to inform users of the SCS/Condor grid of matters which may be of interest to them. You can subscribe your School or University email address here

http://ecs.victoria.ac.nz/mailman/listinfo.cgi/scs-grid-users

The Condor Master

The ITS machine from which all Condor activities are carried out is vuwunicondgrd01.vuw.ac.nz

As of November 2010 the machine may be accessed using the name condor01.vuw.ac.nz

The Condor Master runs RedHat Enterprise Linux (RHEL) and ssh access is via port 10 (This is an ITS standard, apparently)

If your username on the Condor Master does not match your ECS username, you will need to type:

ssh vuwunicondgrd01.vuw.ac.nz -p 10 -l username

to logon, otherwise, with matching usernames, a simple

ssh vuwunicondgrd01.vuw.ac.nz -p 10

will suffice.

Your home directory on the Condor Master will be of the form /home/username

You can use scp to move files to and from your ECS filestore to the Condor Master, eg

scp -P 10  localfile  username@vuwunicondgrd01.vuw.ac.nz:/path/to/remotedir/.

scp -P 10  username@vuwunicondgrd01.vuw.ac.nz:file_in_homedir.txt /path/to/local/dir/.

It is possible to use rsync to maintain a level of synchronisation between directories within ECS filestore and on the Condor Master, eg.

% hostname
some-machine.ecs.vuw.ac.nz
% cd ~/top/of/my/local/condorgrid/directories
% rsync -avi --rsh="ssh -p 10" vuwunicondgrd01.vuw.ac.nz:/home/username/remotedir/ ./remotedir/

Note that you need to specify the fact that you need to talk ssh over port 10 vis the --rsh option to rsync

Setting up the environment

The machines ITS provide for the SCS run a Windows operating system and the machines are effectively installed using a single OS+packages image rolled out from the centre.

ECS users will thus find a limited range of native software packages, placed onto the SCS machines at the request of various academics across the University, for use by students in teaching labs or self-study, accessible by default.

ITS does not seem to have a list of the user-accessible packages, on the SCS ,available at time of writing.

Where Grid jobs merely require such packages to be operated in a batch mode, such jobs can obviously make use of those packages.

In general, you will need to use the full path to the executables you wish to invoke, eg:.

c:\"program files"\r\r-2.6.0\bin\r.exe --no-save < myscript.r

With software packages in use within ECS but not available on the SCS, ITS may consent to installing the Windows version, if available, of that package within the image they roll-out to the SCS machines.

Users new to the SCS machines should be aware that ITS are, understandably, reluctant to alter their images once a teaching trimester is underway and there may well be issues in having multiple versions of the same package installed at the same time. Planning ahead so as to liaise with those who have requested packages be installed on the SCS is thus a good thing.

Users wishing to make use of the service to run bespoke codes or programs which normally operate within an ECS, or other, UNIX environment have two approaches.

  1. They will need to recompile sources, or otherwise ensure that any binaries run, against the Cygwin emulation software and then upload a matching Cygwin DLL as part of the job submission payload. The ITS machines do not provide, nor hence constrain the user to, any particular Cygwin DLL version.
  2. They will need to recompile sources using a Windows compiler to produce native binaries, though care should be taken to ensure that any run-time dependencies will be resolved when running on an SCS machine. The ITS machines do provide access to one Windows compiler suite.

Do I have a home on the Grid?

Not a permanent home. no: it's more like a rented bach with the Condor Master as your trailer of stuff from home.

Basically, Condor would normally upload all the files needed for a task into a temporary directory, %_CONDOR_SCRATCH_DIR%, on each remote machine from a user's directory on the Condor Master.

At the end of the job execution, Condor can download files back to filestore areas on the Condor master.

This means that a user might first need to copy files from their ECS filestore onto the Condor Master so as to then make them accessible to the Condor Grid, and copy files off the Condor Master back into their ECS filestore for post processing.

To continue with the bach/trailor metaphor, you pack it before you go, unpack when you arrive, pack it again before you leave and unpack when you get back home.

Files in the temporary directory on machines the jobs were executed on are not currently preserved between individual job executions.

(So someone tidies up the bach after you have been there, too).

It may be possible to have ITS load large static data sets onto the SCS image so that such data appears at a constant path on any SCS machine, however, as mentioned above, ITS are, understandably, reluctant to alter the image once a teaching trimester is underway.

Where will the input and output files be?

Because this is a distributed batch processing environment, there's usually no clear indication as to which machine(s) your job(s) will end up running on.

You thus need to give a little more thought to the location of input and output files than if you were simply running a job on your own workstation where everything is local to the machine.

Within the SCS Grid, Condor can stage files and directories to and from the master machine.

Preserving results after execution

In order to get your output back to somewhere more useful to you, you need to tell the task that it should copy the files back. Condor will then, by default, copy back any altered or newly created files from the remote execution node.

Cleaning up

With the temporary area allocated to your job being automatically removed once the job has completed, there is no explicit cleaning required for simple job submission.

Where do stdin, stdout and stderr appear

When one runs programs locally, program output and error messages will often appear on the console, or in the terminal emulator, and one can usually perform command line redrection for input.

When you are running a non-interactive job on a remote machine however, it is likely that you aren't going to see any console output during the execution of the program.

Condor can therefore be instructed to redirect the stdout and stderr channels to file so that they may be inspected after the job has finished.

The submission script has directives that allow the user to name files for the redirection.

If you do not specify filenames then you implicity get /dev/null, or platform equivalent.

In order to create unique filenames for each task submitted, use can be made of the fact that Condor provides for a number of variables to be expanded within the submission script.

Condor allows one to submit multiple jobs (tasks) within a single submission. The submission is referred to as a Cluster and within each cluster individual instances of the job are known as a Process, numbered starting from 0

A basic script

Within a directory containing

cygwin1.dll hworld.cmd  hworld.exe 

with hworld.cmd being a very basic script (11 lines) like this:

universe = vanilla
environment = path=c:\WINDOWS\SYSTEM32
executable = hworld.exe
TransferInputFiles  = cygwin1.dll
output     = hworld.out.$(Cluster).$(Process)
error      = hworld.err.$(Cluster).$(Process)
log        = hworld.log.$(Cluster).$(Process)
Requirements = (OpSys == "WINNT51") && (Arch == "INTEL") 
ShouldTransferFiles  = YES
WhenToTransferOutput = ON_EXIT
queue 1

one can submit a job by typing
condor_submit hworld.cmd

If the job submission returns something like this

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1234.

then you would expect to see the following files at the end of the job, along with any files created on the remote execution node(s).

hworld.err.1234.0
hworld.out.1234.0
hworld.log.1234.0

Emailed output

By default, Condor will send an email message to the users mailbox on the Condor Master, on completion of the job.

This is an automated email from the Condor system
on machine "vuwunicondgrd01.vuw.ac.nz".  Do not reply.

Condor job 1234.0
        /home/username/hworld/hworld.exe
has exited normally with status 12


Submitted at:        Mon Jul 13 12:37:48 2009
Completed at:        Mon Jul 13 12:38:30 2009
Real Time:           0 00:00:42

Virtual Image Size:  0 Kilobytes

Statistics from last run:
Allocation/Run time:     0 00:00:02
Remote User CPU Time:    0 00:00:00
Remote System CPU Time:  0 00:00:00
Total Remote CPU Time:   0 00:00:00

Statistics totaled from all runs:
Allocation/Run time:     0 00:00:02

Network:
    1.8 MB Run Bytes Received By Job
   12.0 B  Run Bytes Sent By Job
    1.8 MB Total Bytes Received By Job
   12.0 B  Total Bytes Sent By Job


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor-admin@ecs.vuw.ac.nz
The Official Condor Homepage is http://www.cs.wisc.edu/condor

Various attributes of the notification process can be altered within the job submission script.

Running Java programs

Been a couple of people running these of late so here's some basic info

Let's assume that you will be running two instances of a Java program called myprog.jar which, when run on your own machine reads from a datafile mydata.txt and produces an output file called myoutput.txt

The job submission script, which will be called myprog.cmd might be expected to look something like this

universe = vanilla
environment = path=c:\WINDOWS\SYSTEM32
executable = myprog.bat
TransferInputFiles  = myprog.jar mydata.txt
output     = hworld.out.$(Cluster).$(Process)
error      = hworld.err.$(Cluster).$(Process)
log        = hworld.log.$(Cluster).$(Process)
Requirements = (OpSys == "WINNT51") && (Arch == "INTEL") 
ShouldTransferFiles  = YES
WhenToTransferOutput = ON_EXIT
queue 2

Note, in comparison with the basic script, that rather than running the java program directly, our executable is a DOS batch (.bat) file in which we will invoke Java to run our JAR file, with a command such as

java -jar myprog.jar

and that we copy both the JAR and our data file over using TransferInputFiles

however, when we run the two instances of the program, both instances will, if the program runs, produce output files named myoutput.txt.

This represents a problem.

Condor will automatically fetch back, to the directory on the master from where we submitted the job, all files that have changed on each execution host, which means it will overwrite the output file.

What's needed here is a mechanism for ensuring that the files that get created are named uniquely.

In the same way that we can use the Condor information about the Cluster and Process to differentiate the stdout, stderr and logfiles we can use it to solve the file name issue by making those values available as environmental variables in the DOS environment.

This is done by modifying the environment value in the submission script so that is looks like this (note the quotes)

environment = "path=c:\WINDOWS\SYSTEM32 CONDOR_CLUSTER=$(Cluster) CONDOR_PROCESS=$(Process)"

Within our DOS batch file, we can access the DOS environmental variables using the standard %VARNAME% syntax.

This allows us to either alter our program so that it accepts the Condor information as arguments and alters the output filenames internally, or simply rename the output file as part of the set of command in the DOS batch file that will be run.

The latter approach works because Condor transfers the files after the executable has ended, which means after all the commands in the DOS batch file have run.

We might thus have a DOS batch file that looks like this

java -jar myprog.jar
ren myoutput.txt myoutput.%CONDOR_CLUSTER%.%CONDOR_PROCESS%

Note: the DOS command ren is the equivalent of the UNIX mv.

Similarly if we wish to have the program changing it's operation based on those values we might have DOS batch file looking like this

java -jar myprog.jar %CONDOR_CLUSTER% %CONDOR_PROCESS%

Note that the latter approach, making the executable aware of its place in the set of jobs, can be useful as a mechanism, eg, for invoking different behaviours or defining initial conditions such as random number generation.

One can submit a job by typing
condor_submit myprog.cmd

If the job submission returns something like this

Submitting job(s).
Logging submit event(s).
2 job(s) submitted to cluster 1234.

then you would expect to see the following files at the end of the job, along with any files created on the remote execution node(s).

hworld.err.1234.0
hworld.err.1234.1
hworld.out.1234.0
hworld.out.1234.1
hworld.log.1234.0
hworld.log.1234.1

myoutput.txt.1234.0
myoutput.txt.1234.1

Java versions

As of Feb 2011, there were four versions of Java available on the SCS machines, with one machine examined showing:

10/09/2008  08:34 a.m.    <DIR>          jdk1.5.0_09
10/09/2008  08:22 a.m.    <DIR>          jre1.5.0_06
10/09/2008  08:23 a.m.    <DIR>          jre1.6.0_07
09/03/2010  05:10 p.m.    <DIR>          jre6

the default version, which I am told is the one without any version information visible, is 1.6.0_18.

ECS machines had access to the OpenJDK 7 at this time.

Update May 2015

Someone in SEF who was looking to use Java on the SCS/Condor Grid was informed by ITS that their lab machines no longer had a deployment of Java. Despite this assertion it was, however, possible to find and run against the following Java versions (minimum number of machines with each version listed)

java version "1.6.0_17"       1
java version "1.6.0_18"      32
java version "1.6.0_24"       3
java version "1.6.0_25"       1
java version "1.7.0_07"       9
java version "1.7.0_51"      10
java version "1.8.0_25"       1

Given the ITS assertion, it is probably best not to plan on using Java on the SCS/Condor Grid.

Using Java to ease data transfer

Because of the way that the remote shell is set-up when you run jobs on the SCS/Condor Grid, you may not have access to common utilities that you would expect to have in an interactive session.

Su Nguyen, from SECS, was looking to transfer a complete directory structure over but was unable to access the local machine's unzip utility, so he wrote some Java code that allowed him to do the packing and unpacking as part of his SCS/Condor Grid job.

Su says that his program is invoked as follows

java -jar ZIPJ.jar zip <dir_name> <zip_name>

java -jar ZIPJ.jar unzip <zip_name> <dir_name> 

and the Java source is attached to this page.

You may need to rename the file in order to create the JAR-file.

Delayed starting of SCS/Condor Grid jobs

This facility has been useful to some users of the SCS/Condor Grid so here it is.

OK, here's how to make the jobs start at say, 1am, when you think you might get six or seven hours of uniterrupted computing on each machine you can get your hands on overnight.

It's a distillation of Section 2.12.1 (Time Scheduling for Job Execution) in the Condor Manual

http://www.cs.wisc.edu/condor/manual/

Choose the Stable Release, Online HTML to view the relevant stuff.

Basically, you need to add three lines to the submission file, eg

deferral_time      = 1291291200
deferral_window    = 18000
deferral_prep_time = 120

Those values are in seconds.

The reason the deferral_time is so large is that it's the number of seconds since midnight GMT on Jan 1, 1970.

What the above says is:

start my job (or jobs if you queue more than one per submission file) at exactly 2010-12-03 01:00:00 or within a window of 5 hours afterwards, however don't try and grab any resources until 2 minutes before the time I'd like it to start.

Note that you need to be specific about the date, not just the time of day to remove possible confusion.

When you submit the job(s) they'll appear as "Idle" in the queue up until 2 minutes before the proposed start time when, assuming there are resources available, they'll start to grab resources.

The reason for having the 2 minutes is that without it, you might grab the resources straight away, ie hog the machine with nothing else running on it until the start time. This potentially wastes resources and defeats the object of the grid.

The reason for having the window is to allow your job to start if it can't get resources at the exact time specified.

Whilst I am sure you will be able to do the math each time you need to calculate the number of seconds since Jan 1, 1970, on most UNIX boxes you should be able to do the following

$ date -d "2010-12-03 01:00:00" +%s

and get back

1291291200

The Condor manual suggests you'ld type

$ date --date "12/03/2010 01:00:00" +%s

with an American date style, though personally I get confused.

I am not sure how you would achieve that from a windows box but you have access to the Condor master so you could do it on there.
Topic attachments
I Attachment Action Size Date Who Comment
SuNguyen.javajava SuNguyen.java manage 3 K 28 May 2013 - 14:32 Main.kevin Source of a Java zip/unzip program